EVL Anonymization

Products, services and company names referenced in this document may be either trademarks or registered trademarks of their respective owners.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts.

Config File

Main configuration file defines information about each field to be processed. It is shared to all metadata-driven EVL Microservices, so it is situated in configs project directory. It is a csv file named by a source, so you can have several config files in a project, for example:

configs/mssql_exports.csv
configs/oracle_db1_tables.csv
configs/oracle_db2_tables.csv
configs/some_json_source.csv

Here is an example of the most of the columns of such configuration csv file (semicolon delimited by default):

src_type	entity_name	order	field_name	data_type	null	anon_type	evl_value
`ORA`	`accounts`		`id`	`int`	`No`	`ANON_UNIQ`
`ORA`	`accounts`		`cust_id`	`int`	`No`	`ANON_LOOKUP`
`ORA`	`accounts`		`iban`	`string`		`ANON_IBAN`
`ORA`	`accounts`		`currency`	`string`
`ORA`	`accounts`		`score`	`decimal(8,2)`		`ANON_AMOUNT(0.1)`
`ORA`	`accounts`		`when`	`date`		`ANON_VAR`
`FILE`	`cust.csv`	`1`	`id`	`int`	`No`	`ANON_UNIQ`
`FILE`	`cust.csv`	`2`	`email`	`string`		`ANON_EMAIL`
`FILE`	`cust.csv`	`3`	`pers_id`	`string`	`No`		`anon_rc(IN)`

Description of each column of such configuration file follows.

Source Type

is used to distinguish different inputs. Possible values are:

FILE, file: for any supported file format, e.g. avro, csv, json, parquet, qvd, xls, xlsx, xml. To distinguish a file format, file suffix is used from the Entity name.
avro, csv, json, parquet, qvd, xls, xlsx, xml: to be used only in the case the file name or mask (i.e. an Entity name) has no proper file suffix.
MySQL: for MySQL/MariaDB tables
ORA, Oracle: for Oracle tables
PG, PostgreSQL: for PostgreSQL tables
TD, Teradata: for Teradata tables

When the field ‘source_type’ is empty, then option ‘FILE’ is supposed.

Entity Name

is mandatory and it is either file name/mask or table name.

In case of file sources the file name or mask should contain proper file extension to be distinguished. It supports also compression suffixes .gz, .tar.gz, and .zip, and process them properly.

Field Order

is mandatory only for csv files. It specifies the order of given field in a file, counted from 1.

Specifying the order for json, qvd, xls, xlsx, or xml files will omit using option --match-fields by EVL Read component which increase speed of the anonymization.

Field Name

is simply a field name. Case sensitive!

Field name is also used for mapping generation, so it appears in EVM mapping file and can be used in ‘evl_value’ configuration field.

EVL Datatype

Possible EVL data types are:

string: to be used for all texts, varchars, etc.
char, uchar, short, ushort, int, uint, long, ulong, int128, uint128: to be used for all corresponding integral data types
decimal: to be used for decimal(m,n) and number(m,n) data types
float, double: to be used for float, double, real and similar source data types
date, datetime, timestamp: to be used for corresponding date and time data types, where date keeps only date, datetime keeps date and time (in seconds), and timestamp can hold also nanoseconds and time zone.

Format

To specify a format for date and time data types and decimal places for decimals.

For decimal data type you must specify ‘m,n’, where ‘m’ is number of all digits and ‘n’ is the number of decimal places.

For date and time data types, when no format is specified, defaults are used:

EVL_DEFAULT_DATE_PATTERN="%Y-%m-%d" ¶: to specify default formatting string for ‘date’ data type
EVL_DEFAULT_TIME_PATTERN="%H:%M:%S" ¶: to specify default formatting string for ‘time’ data type
EVL_DEFAULT_DATETIME_PATTERN="%Y-%m-%d %H:%M:%S" ¶: to specify default formatting string for ‘datetime’ data type
EVL_DEFAULT_TIMESTAMP_PATTERN="%Y-%m-%d %H:%M:%E*S" ¶: to specify default formatting string for ‘timestamp’ data type

To specify a formatting pattern, standard C notation can be used. Examples of such formatting strings:

%d.%m.%Y: to have a date like ‘08.12.2021’
%d-%b-%Y: to have a date like ‘08-Dec-2021’
%Y%m%d%H%M%S: to have a datetime like ‘20211208093000’
%Y-%m-%dT%H:%M:%S.%E3f: to have a timestamp like ‘2021-12-08T09:30:59.545’

Nullable

says if the field is nullable or not. When empty, it suppose the field is nullable. Possible values are:

1, Y, y, yes, Yes, True, true: for NULLABLE
0, N, n, no, No, False, false: for NOT NULL

For file sources, when the field is NULLABLE, then an empty string is interpreted as NULL and anonymization functions keep such field again NULL.

Important: When a field is flagged as NOT NULL, then an empty string is manipulated as proper string, so anonymization functions anonymize them. In such case from an empty string one can get non-empty anonymized value.

Min and Max

Strings

For strings it specifies, how long strings it should create on the output when anonymize.

By default it supposes ‘min’ = 0 and ‘max’ = 10, so for string fields of the length less than 10, it is actually mandatory to set ‘max’ length. To change these defaults, one can specify other values in project.sh configuration file.

export EVL_ANON_DEFAULT_MIN_STRING_LENGTH=0
export EVL_ANON_DEFAULT_MAX_STRING_LENGTH=10

Numbers

For numbers it contains the range of values it should produce when doing anonymization. Here are environment variables with their default values (min/max values from config file take precedence):

export EVL_ANON_DEFAULT_MIN_CHAR=0
export EVL_ANON_DEFAULT_MIN_SHORT=0
export EVL_ANON_DEFAULT_MIN_INT=0
export EVL_ANON_DEFAULT_MIN_LONG=0
export EVL_ANON_DEFAULT_MIN_DECIMAL=0
export EVL_ANON_DEFAULT_MIN_FLOAT=0
export EVL_ANON_DEFAULT_MIN_DOUBLE=0
export EVL_ANON_DEFAULT_MAX_CHAR=99
export EVL_ANON_DEFAULT_MAX_SHORT=9999
export EVL_ANON_DEFAULT_MAX_INT=99999999
export EVL_ANON_DEFAULT_MAX_LONG=99999999
export EVL_ANON_DEFAULT_MAX_DECIMAL=99999999
export EVL_ANON_DEFAULT_MAX_FLOAT=1000000
export EVL_ANON_DEFAULT_MAX_DOUBLE=1000000

Dates and times

For dates and times it contains the range it should produce when calling anonymize(value,min,max) function (e.g. by ANON anon type). Default values for date, datetime and timestamp are set by environment variables to ‘1970-01-01’ and ‘2099-12-31’ this way:

export EVL_ANON_DEFAULT_MIN_DATE="1970-01-01"
export EVL_ANON_DEFAULT_MIN_DATETIME="1970-01-01 00:00:00"
export EVL_ANON_DEFAULT_MIN_TIMESTAMP="1970-01-01 00:00:00.000000000"
export EVL_ANON_DEFAULT_MAX_DATE="2099-12-31"
export EVL_ANON_DEFAULT_MAX_DATETIME="2099-12-31 00:00:00"
export EVL_ANON_DEFAULT_MAX_TIMESTAMP="2099-12-31 00:00:00.000000000"

Using anonymize(value,seconds,minutes,hours,days,months,years) or function anonymize(value,days,months,years) (e.g. by ANON_VAR anon type), it ignores min/max values and uses values from environment variables:

export EVL_ANON_DEFAULT_NANOSECONDS=500000000
export EVL_ANON_DEFAULT_SECONDS=30
export EVL_ANON_DEFAULT_MINUTES=30
export EVL_ANON_DEFAULT_HOURS=12
export EVL_ANON_DEFAULT_DAYS=15
export EVL_ANON_DEFAULT_MONTHS=6
export EVL_ANON_DEFAULT_YEARS=5

Note: All the environment variables can be defined either in project.sh (for the whole project), or in configs/anon/<source_name>.sh to have it only for sources given by <source_name>.csv.

Anon Type

Following predefined types of anonymization can be specified in the ‘anon_type’ configuration column.

ANON ¶

this type is applicable for all EVL data types. It uses ‘anonymize(value, max, min)’ EVL function for given data type. It takes min/max value from columns ‘min’ and ‘max’ if such exist, otherwise it uses default values from variables starts with EVL_ANON_DEFAULT_MAX_.

In general it may happen that for two different inputs it produces the same anonymized value. So it may happen that the mapping is not bijection, which is mostly the good way how to anonymize.

Important: For given value and given salt it produces always the same anonymized value.

ANON_VAR ¶

can be used only for date and time data types. Internally it uses ‘anonymize(value, days, month, years)’ for date, ‘anonymize(value, seconds, minutes, hours, days, month, years)’ for datetime, and ‘anonymize(value, nanoseconds, seconds, minutes, hours, days, month, years)’ for timestamp. Variance arguments are taken from following environment variables, which are set by default:

export EVL_ANON_DEFAULT_NANOSECONDS=500000000
export EVL_ANON_DEFAULT_SECONDS=30
export EVL_ANON_DEFAULT_MINUTES=30
export EVL_ANON_DEFAULT_HOURS=12
export EVL_ANON_DEFAULT_DAYS=15
export EVL_ANON_DEFAULT_MONTHS=6
export EVL_ANON_DEFAULT_YEARS=5

ANON_UNIQ ¶

can be used only for integral data types. Internally it uses ‘anonymize_uniq()’ EVL function. It produces the output in a unique way, so bijection is guaranteed. Particularly useful for IDs.

Important: It keeps the mapping to be a bijection.

ANON_NAME ¶

can be used only for string data type and internally maps to ‘anonymize(<IN>," ", true)’, i.e. keep the length, spaces, capital letters will be again capitals, lowercase letters stay lowercased and numbers will be numbers again.

ANON_EMAIL ¶

can be used only for string data type and internally maps to ‘anonymize(<IN>,"@.")’, i.e. keep ‘@’ and dots unchanged, other parts are anonymized as usual. Length is not kept.

ANON_IBAN ¶

ANON_IBAN_KEEP_COUNTRY ¶

ANON_IBAN_KEEP_BANK ¶

applicable only for strings and maps into

anonymize_iban(<IN>)
anonymize_iban(<IN>,iban_anon::keep_country)
anonymize_iban(<IN>,iban_anon::keep_country_and_bank)

So in the first case it anonymize IBAN into arbitrary country IBAN, but keep the validity. In second case it keeps the country, and in third one also keeps the bank.

ANON_AMOUNT() ¶

is applicable for int, uint, long, ulong and decimal. It is defined as

anonymize(<*IN>, <*IN> - <ARG> * <*IN>, <*IN> + <ARG> * <*IN>)

where ‘<*IN>’ is placeholder for the input value and ‘<ARG>’ is an argument specified in the parentheses. So for example for ‘ANON_AMOUNT(0.1)’ returns anonymized value within plus/minus 10% interval.

MASK_LEFT() ¶

MASK_RIGHT() ¶

can be used only for strings and it calls ‘str_mask_left(<IN>,<ARG>)’ or ‘str_mask_right(<IN>,<ARG>)’, where ‘<IN>’ is placeholder for the pointer to the input value and ‘<ARG>’ is an argument specified in the parentheses. So for example ‘MASK_LEFT(4)’ masks the string by ‘*’ from left, but keep 4 characters unchanged.

RANDOM ¶

behaves very similar way as ‘ANON’ type, but each run may return different value. Applicable for any data type; internally calls ‘random_<data_type>(min,max)’, where min/max is taken from config file (see Min and Max) or is defined by environment variables.

RANDOM_VAR ¶

behaves very similar way as ‘ANON_VAR’ type, but each run may return different value. Applicable only for date and time data types. It calls internally ‘randomize()’ function.

ANON_LOOKUP ¶

firstly it creates a lookup from the column and then use it as possible values. This way you get real values, but shuffled.

Important: For given value and given salt it produces always the same anonymized value.

ANON_LOOKUP() ¶

it uses some already prepared lookup specified as an argument in parentheses. So anonymized values are always taken from this lookup. Such a lookup must be prepared by this EVD

/opt/EVL-2.6/share/templates/anonymization/anon/evd/lookup.string.evd

which defines data structure:

key	int
value	string null=""

and must be sorted by ‘key’ field and starts with zero and incremented by 1.

Important: For given value and given salt it produces always the same anonymized value.

RANDOM_LOOKUP ¶

behaves very similar way as ‘ANON_LOOKUP’ type, but each run may return different value.

RANDOM_LOOKUP() ¶

behaves very similar way as ‘ANON_LOOKUP()’ type, but each run may return different value.

TOKEN* ¶

group of anonymization types starts with ‘TOKEN’ are not really defined, but act exactly like all above types starts with ‘ANON’. The only difference is that it keeps the conversion table in the $EVL_ANON_TOKEN_DIR directory, so you can revert the anonymized values if needed. Once some value is anonymized and so presented in the conversion table, the same value will be again anonymized the same way, no matter if salt has changed at meantime.

Important: This anonymization type is revertible, but that is the purpose.

All of these anonymization types are predefined in

/opt/EVL-2.6/share/templates/anonymization/configs/anon/anon-functions-config.csv

It is actually a mapping of anonymization type and data type to exact internal EVL (or custom C/C++) function. For example ‘MASK_LEFT()’ is defined by

anon_type;evl_datatype;evl_value;description
MASK_LEFT();string;str_mask_left(<IN>,<ARG>);

Custom Anon Types

You can also define your own custom anonymization types within your Anonymization project by specifying in your project folder a file configs/anon/anon-functions-config.csv of the same structure. You can define for example anonymization type ‘MASK_LEFT_X()’, which mask input from left by ‘X’, but keep <ARG> characters unchanged, this way:

anon_type;evl_datatype;evl_value;description
MASK_LEFT_X();string;str_mask_left(<IN>,<ARG>,'X');

Placeholders

Following possible placeholders can be used in Anonymization Type definition file:

<IN>
<*IN>: to be replaced by the value of the input field (‘<*IN>’) or by the pointer to that value (‘<IN>’)
<ARG>: for Anon Types ended with ‘()’ it will resolve to the content specified in the parentheses
<MIN>
<MAX>: to be replaced by values from min/max fields from config file or by default values specified by EVL_ANON_DEFAULT_MIN_* and EVL_ANON_DEFAULT_MAN_* environment variables
<FIELD_NAME>: to be replaced by values a field name
<NANOSECONDS>
<SECONDS>
<MINUTES>
<HOURS>
<DAYS>
<MONTHS>
<YEARS>: to be replaced by values specified by environment variables: EVL_ANON_DEFAULT_NANOSECONDS, EVL_ANON_DEFAULT_SECONDS, EVL_ANON_DEFAULT_MINUTES, EVL_ANON_DEFAULT_HOURS, EVL_ANON_DEFAULT_DAYS, EVL_ANON_DEFAULT_MONTHS, and EVL_ANON_DEFAULT_YEARS

EVL Value

Important: When also ‘anon_type’ column has a value in the config file, then this ‘evl_value’ is applied first and then also an anonymization type

This field can contain an EVL mapping function(s) to be applied on the given fields. This can be used for special cases, when you don’t want to create new custom anonymization type or when you need to somehow prepare a field first.

For example when you want to achieve some business logic to be kept. Like some date must be greater than some other one:

field_name	evl_datatype	anon_type	evl_value
`birth_date`	`date`	`ANON`
`death_date`	`date`		`anonymize(IN, out->birth_date+1, out->birth_date+36500)`

In this example the anonymized date of death will be always in between the anonymized birth date plus one day and anonymized birth date plus 100 years (36500 days).

Ignored Values

Next to NULLs, there are also usually several other values which you’d like to keep unchanged by anonymization. For example dates like ‘1970-01-01’, ‘4712-12-31’, or ‘9999-12-31’.

For such purpose, these environment variables can be defined either in project.sh (for the whole project), or in configs/anon/<source_name>.sh to have it only for sources given by <source_name>.csv:

EVL_ANON_IGNORE_DATE ¶
EVL_ANON_IGNORE_DATETIME ¶
EVL_ANON_IGNORE_TIMESTAMP ¶: can be empty or contains a comma separated list of values to be ignored when given data type is used,
EVL_ANON_IGNORE_STRING ¶: can be empty or contains a comma separated list of values to be ignored when string is used,
EVL_ANON_IGNORE_CHAR ¶
EVL_ANON_IGNORE_SHORT ¶
EVL_ANON_IGNORE_INT ¶
EVL_ANON_IGNORE_LONG ¶
EVL_ANON_IGNORE_DECIMAL ¶
EVL_ANON_IGNORE_FLOAT ¶
EVL_ANON_IGNORE_DOUBLE ¶: can be empty or contains a comma separated list of numbers to be ignored when given data type is used. Unsigned variants for integral data types use their signed ones.

Here is an usual example for keeping dates unchanged when anonymize database tables.

export EVL_ANON_IGNORE_DATE="1970-01-01,4712-12-31,9999-12-31"
export EVL_ANON_IGNORE_DATETIME="1970-01-01 00:00:00,4712-12-31 00:00:00"

EVL Anonymization

Table of Contents

Config File

Source Type

Entity Name

Field Order

Field Name

EVL Datatype

Format

Nullable

Min and Max

Strings

Numbers

Dates and times

Anon Type

Custom Anon Types

Placeholders

EVL Value

Ignored Values