EVL Anonymization

Table of Contents


Products, services and company names referenced in this document may be either trademarks or registered trademarks of their respective owners.

Copyright © 2017–2020 EVL Tool, s.r.o.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts.

Config File

Main configuration file defines information about each field to be processed. It is shared to all metadata-driven EVL Microservices, so it is situated in configs project directory. It is a csv file named by a source, so you can have several config files in a project, for example:

configs/mssql_exports.csv
configs/oracle_db1_tables.csv
configs/oracle_db2_tables.csv
configs/some_json_source.csv

Here is an example of the most of the columns of such configuration csv file (semicolon delimited by default):

src_typeentity_nameorderfield_namedata_typenullanon_typeevl_value
ORAaccountsidintNoANON_UNIQ
ORAaccountscust_idintNoANON_LOOKUP
ORAaccountsibanstringANON_IBAN
ORAaccountscurrencystring
ORAaccountsscoredecimal(8,2)ANON_AMOUNT(0.1)
ORAaccountswhendateANON_VAR
FILEcust.csv1idintNoANON_UNIQ
FILEcust.csv2emailstringANON_EMAIL
FILEcust.csv3pers_idstringNoanon_rc(IN)

Description of each column of such configuration file follows.

Source Type

is used to distinguish different inputs. Possible values are:

FILE, file

for any supported file format, e.g. avro, csv, json, parquet, qvd, xls, xlsx, xml. To distinguish a file format, file suffix is used from the Entity name.

avro, csv, json, parquet, qvd, xls, xlsx, xml

to be used only in the case the file name or mask (i.e. an Entity name) has no proper file suffix.

MySQL

for MySQL/MariaDB tables

ORA, Oracle

for Oracle tables

PG, PostgreSQL

for PostgreSQL tables

TD, Teradata

for Teradata tables

When the field ‘source_type’ is empty, then option ‘FILE’ is supposed.

Entity Name

is mandatory and it is either file name/mask or table name.

In case of file sources the file name or mask should contain proper file extension to be distinguished. It supports also compression suffixes .gz, .tar.gz, and .zip, and process them properly.

Field Order

is mandatory only for csv files. It specifies the order of given field in a file, counted from 1.

Specifying the order for json, qvd, xls, xlsx, or xml files will omit using option --match-fields by EVL Read component which increase speed of the anonymization.

Field Name

is simply a field name. Case sensitive!

Field name is also used for mapping generation, so it appears in EVM mapping file and can be used in ‘evl_value’ configuration field.

EVL Datatype

Possible EVL data types are:

string

to be used for all texts, varchars, etc.

char, uchar, short, ushort, int, uint, long, ulong, int128, uint128

to be used for all corresponding integral data types

decimal

to be used for decimal(m,n) and number(m,n) data types

float, double

to be used for float, double, real and similar source data types

date, datetime, timestamp

to be used for corresponding date and time data types, where date keeps only date, datetime keeps date and time (in seconds), and timestamp can hold also nanoseconds and time zone.

Format

To specify a format for date and time data types and decimal places for decimals.

For decimal data type you must specify ‘m,n’, where ‘m’ is number of all digits and ‘n’ is the number of decimal places.

For date and time data types, when no format is specified, defaults are used:

EVL_DEFAULT_DATE_PATTERN="%Y-%m-%d"

to specify default formatting string for ‘date’ data type

EVL_DEFAULT_TIME_PATTERN="%H:%M:%S"

to specify default formatting string for ‘time’ data type

EVL_DEFAULT_DATETIME_PATTERN="%Y-%m-%d %H:%M:%S"

to specify default formatting string for ‘datetime’ data type

EVL_DEFAULT_TIMESTAMP_PATTERN="%Y-%m-%d %H:%M:%E*S"

to specify default formatting string for ‘timestamp’ data type

To specify a formatting pattern, standard C notation can be used. Examples of such formatting strings:

%d.%m.%Y

to have a date like ‘16.10.2020

%d-%b-%Y

to have a date like ‘16-Oct-2020

%Y%m%d%H%M%S

to have a datetime like ‘20201016093000

%Y-%m-%dT%H:%M:%S.%E3f

to have a timestamp like ‘2020-10-16T09:30:59.545

Nullable

says if the field is nullable or not. When empty, it suppose the field is nullable. Possible values are:

1, Y, y, yes, Yes, True, true

for NULLABLE

0, N, n, no, No, False, false

for NOT NULL

For file sources, when the field is NULLABLE, then an empty string is interpreted as NULL and anonymization functions keep such field again NULL.

Important: When a field is flagged as NOT NULL, then an empty string is manipulated as proper string, so anonymization functions anonymize them. In such case from an empty string one can get non-empty anonymized value.

Min and Max

Strings

For strings it specifies, how long strings it should create on the output when anonymize.

By default it supposes ‘min’ = 0 and ‘max’ = 10, so for string fields of the length less than 10, it is actually mandatory to set ‘max’ length. To change these defaults, one can specify other values in project.sh configuration file.

export EVL_ANON_DEFAULT_MIN_STRING_LENGTH=0
export EVL_ANON_DEFAULT_MAX_STRING_LENGTH=10

Numbers

For numbers it contains the range of values it should produce when doing anonymization. Here are environment variables with their default values (min/max values from config file take precedence):

export EVL_ANON_DEFAULT_MIN_CHAR=0
export EVL_ANON_DEFAULT_MIN_SHORT=0
export EVL_ANON_DEFAULT_MIN_INT=0
export EVL_ANON_DEFAULT_MIN_LONG=0
export EVL_ANON_DEFAULT_MIN_DECIMAL=0
export EVL_ANON_DEFAULT_MIN_FLOAT=0
export EVL_ANON_DEFAULT_MIN_DOUBLE=0
export EVL_ANON_DEFAULT_MAX_CHAR=99
export EVL_ANON_DEFAULT_MAX_SHORT=9999
export EVL_ANON_DEFAULT_MAX_INT=99999999
export EVL_ANON_DEFAULT_MAX_LONG=99999999
export EVL_ANON_DEFAULT_MAX_DECIMAL=99999999
export EVL_ANON_DEFAULT_MAX_FLOAT=1000000
export EVL_ANON_DEFAULT_MAX_DOUBLE=1000000

Dates and times

For dates and times it contains the range it should produce when calling anonymize(value,min,max) function (e.g. by ANON anon type). Default values for date, datetime and timestamp are set by environment variables to ‘1970-01-01’ and ‘2099-12-31’ this way:

export EVL_ANON_DEFAULT_MIN_DATE="1970-01-01"
export EVL_ANON_DEFAULT_MIN_DATETIME="1970-01-01 00:00:00"
export EVL_ANON_DEFAULT_MIN_TIMESTAMP="1970-01-01 00:00:00.000000000"
export EVL_ANON_DEFAULT_MAX_DATE="2099-12-31"
export EVL_ANON_DEFAULT_MAX_DATETIME="2099-12-31 00:00:00"
export EVL_ANON_DEFAULT_MAX_TIMESTAMP="2099-12-31 00:00:00.000000000"

Using anonymize(value,seconds,minutes,hours,days,months,years) or function anonymize(value,days,months,years) (e.g. by ANON_VAR anon type), it ignores min/max values and uses values from environment variables:

export EVL_ANON_DEFAULT_NANOSECONDS=500000000
export EVL_ANON_DEFAULT_SECONDS=30
export EVL_ANON_DEFAULT_MINUTES=30
export EVL_ANON_DEFAULT_HOURS=12
export EVL_ANON_DEFAULT_DAYS=15
export EVL_ANON_DEFAULT_MONTHS=6
export EVL_ANON_DEFAULT_YEARS=5

Note: All the environment variables can be defined either in project.sh (for the whole project), or in configs/anon/<source_name>.sh to have it only for sources given by <source_name>.csv.

Anon Type

Following predefined types of anonymization can be specified in the ‘anon_type’ configuration column.

ANON

this type is applicable for all EVL data types. It uses ‘anonymize(value, max, min)’ EVL function for given data type. It takes min/max value from columns ‘min’ and ‘max’ if such exist, otherwise it uses default values from variables starts with EVL_ANON_DEFAULT_MAX_.

In general it may happen that for two different inputs it produces the same anonymized value. So it may happen that the mapping is not bijection, which is mostly the good way how to anonymize.

Important: For given value and given salt it produces always the same anonymized value.

ANON_VAR

can be used only for date and time data types. Internally it uses ‘anonymize(value, days, month, years)’ for date, ‘anonymize(value, seconds, minutes, hours, days, month, years)’ for datetime, and ‘anonymize(value, nanoseconds, seconds, minutes, hours, days, month, years)’ for timestamp. Variance arguments are taken from following environment variables, which are set by default:

export EVL_ANON_DEFAULT_NANOSECONDS=500000000
export EVL_ANON_DEFAULT_SECONDS=30
export EVL_ANON_DEFAULT_MINUTES=30
export EVL_ANON_DEFAULT_HOURS=12
export EVL_ANON_DEFAULT_DAYS=15
export EVL_ANON_DEFAULT_MONTHS=6
export EVL_ANON_DEFAULT_YEARS=5
ANON_UNIQ

can be used only for integral data types. Internally it uses ‘anonymize_uniq()’ EVL function. It produces the output in a unique way, so bijection is guaranteed. Particularly useful for IDs.

Important: It keeps the mapping to be a bijection.

ANON_NAME

can be used only for string data type and internally maps to ‘anonymize(<IN>," ", true)’, i.e. keep the length, spaces, capital letters will be again capitals, lowercase letters stay lowercased and numbers will be numbers again.

ANON_EMAIL

can be used only for string data type and internally maps to ‘anonymize(<IN>,"@.")’, i.e. keep ‘@’ and dots unchanged, other parts are anonymized as usual. Length is not kept.

ANON_IBAN
ANON_IBAN_KEEP_COUNTRY
ANON_IBAN_KEEP_BANK

applicable only for strings and maps into

anonymize_iban(<IN>)
anonymize_iban(<IN>,iban_anon::keep_country)
anonymize_iban(<IN>,iban_anon::keep_country_and_bank)

So in the first case it anonymize IBAN into arbitrary country IBAN, but keep the validity. In second case it keeps the country, and in third one also keeps the bank.

ANON_AMOUNT()

is applicable for int, uint, long, ulong and decimal. It is defined as

anonymize(<*IN>, <*IN> - <ARG> * <*IN>, <*IN> + <ARG> * <*IN>)

where ‘<*IN>’ is placeholder for the input value and ‘<ARG>’ is an argument specified in the parentheses. So for example for ‘ANON_AMOUNT(0.1)’ returns anonymized value within plus/minus 10% interval.

MASK_LEFT()
MASK_RIGHT()

can be used only for strings and it calls ‘str_mask_left(<IN>,<ARG>)’ or ‘str_mask_right(<IN>,<ARG>)’, where ‘<IN>’ is placeholder for the pointer to the input value and ‘<ARG>’ is an argument specified in the parentheses. So for example ‘MASK_LEFT(4)’ masks the string by ‘*’ from left, but keep 4 characters unchanged.

RANDOM

behaves very similar way as ‘ANON’ type, but each run may return different value. Applicable for any data type; internally calls ‘random_<data_type>(min,max)’, where min/max is taken from config file (see Min and Max) or is defined by environment variables.

RANDOM_VAR

behaves very similar way as ‘ANON_VAR’ type, but each run may return different value. Applicable only for date and time data types. It calls internally ‘randomize()’ function.

LOOKUP_ANON

firstly it creates a lookup from the column and then use it as possible values. This way you get real values, but shuffled.

Important: For given value and given salt it produces always the same anonymized value.

LOOKUP_ANON()

it uses some already prepared lookup specified as an argument in parentheses. So anonymized values are always taken from this lookup. Such a lookup must be prepared by this EVD

/opt/EVL-2.4/share/templates/anonymization/anon/evd/lookup.string.evd

which defines data structure:

key	int
value	string null=""

and must be sorted by ‘key’ field and starts with zero and incremented by 1.

Important: For given value and given salt it produces always the same anonymized value.

LOOKUP_RANDOM

behaves very similar way as ‘LOOKUP_ANON’ type, but each run may return different value.

LOOKUP_RANDOM()

behaves very similar way as ‘LOOKUP_ANON()’ type, but each run may return different value.

TOKEN*

group of anonymization types starts with ‘TOKEN’ are not really defined, but act exactly like all above types starts with ‘ANON’. The only difference is that it keeps the conversion table in the $EVL_ANON_TOKEN_DIR directory, so you can revert the anonymized values if needed. Once some value is anonymized and so presented in the conversion table, the same value will be again anonymized the same way, no matter if salt has changed at meantime.

Important: This anonymization type is revertible, but that is the purpose.

All of these anonymization types are predefined in

/opt/EVL-2.4/share/templates/anonymization/configs/anon/anon-functions-config.csv

It is actually a mapping of anonymization type and data type to exact internal EVL (or custom C/C++) function. For example ‘MASK_LEFT()’ is defined by

anon_type;evl_datatype;evl_value;description
MASK_LEFT();string;str_mask_left(<IN>,<ARG>);

Custom Anon Types

You can also define your own custom anonymization types within your Anonymization project by specifying in your project folder a file configs/anon/anon-functions-config.csv of the same structure. You can define for example anonymization type ‘MASK_LEFT_X()’, which mask input from left by ‘X’, but keep <ARG> characters unchanged, this way:

anon_type;evl_datatype;evl_value;description
MASK_LEFT_X();string;str_mask_left(<IN>,<ARG>,'X');

Placeholders

Following possible placeholders can be used in Anonymization Type definition file:

<IN>
<*IN>

to be replaced by the value of the input field (‘<*IN>’) or by the pointer to that value (‘<IN>’)

<ARG>

for Anon Types ended with ‘()’ it will resolve to the content specified in the parentheses

<MIN>
<MAX>

to be replaced by values from min/max fields from config file or by default values specified by EVL_ANON_DEFAULT_MIN_* and EVL_ANON_DEFAULT_MAN_* environment variables

<FIELD_NAME>

to be replaced by values a field name

<NANOSECONDS>
<SECONDS>
<MINUTES>
<HOURS>
<DAYS>
<MONTHS>
<YEARS>

to be replaced by values specified by environment variables: EVL_ANON_DEFAULT_NANOSECONDS, EVL_ANON_DEFAULT_SECONDS, EVL_ANON_DEFAULT_MINUTES, EVL_ANON_DEFAULT_HOURS, EVL_ANON_DEFAULT_DAYS, EVL_ANON_DEFAULT_MONTHS, and EVL_ANON_DEFAULT_YEARS

EVL Value

Important: When also ‘anon_type’ column has a value in the config file, then this ‘evl_value’ is applied first and then also an anonymization type

This field can contain an EVL mapping function(s) to be applied on the given fields. This can be used for special cases, when you don’t want to create new custom anonymization type or when you need to somehow prepare a field first.

For example when you want to achieve some business logic to be kept. Like some date must be greater than some other one:

field_nameevl_datatypeanon_typeevl_value
birth_datedateANON
death_datedateanonymize(IN, *out->birth_date+1, *out->birth_date+36500)

In this example the anonymized date of death will be always in between the anonymized birth date plus one day and anonymized birth date plus 100 years (36500 days).

Ignored Values

Next to NULLs, there are also usually several other values which you’d like to keep unchanged by anonymization. For example dates like ‘1970-01-01’, ‘4712-12-31’, or ‘9999-12-31’.

For such purpose, these environment variables can be defined either in project.sh (for the whole project), or in configs/anon/<source_name>.sh to have it only for sources given by <source_name>.csv:

EVL_ANON_IGNORE_DATE
EVL_ANON_IGNORE_DATETIME
EVL_ANON_IGNORE_TIMESTAMP

can be empty or contains a comma separated list of values to be ignored when given data type is used,

EVL_ANON_IGNORE_STRING

can be empty or contains a comma separated list of values to be ignored when string is used,

EVL_ANON_IGNORE_CHAR
EVL_ANON_IGNORE_SHORT
EVL_ANON_IGNORE_INT
EVL_ANON_IGNORE_LONG
EVL_ANON_IGNORE_DECIMAL
EVL_ANON_IGNORE_FLOAT
EVL_ANON_IGNORE_DOUBLE

can be empty or contains a comma separated list of numbers to be ignored when given data type is used. Unsigned variants for integral data types use their signed ones.

Here is an usual example for keeping dates unchanged when anonymize database tables.

export EVL_ANON_IGNORE_DATE="1970-01-01,4712-12-31,9999-12-31"
export EVL_ANON_IGNORE_DATETIME="1970-01-01 00:00:00,4712-12-31 00:00:00"