Config File
Main configuration file defines information about each field to be processed. It is shared to all metadata-driven EVL Microservices, so it is situated in configs project directory. It is a csv file named by a source, so you can have several config files in a project, for example:
configs/mssql_exports.csv configs/oracle_db1_tables.csv configs/oracle_db2_tables.csv configs/some_json_source.csv
Here is an example of the most of the columns of such configuration csv file (semicolon delimited by default):
src_type | entity_name | order | field_name | data_type | null | anon_type | evl_value |
---|---|---|---|---|---|---|---|
ORA | accounts | id | int | No | ANON_UNIQ |
| |
ORA | accounts | cust_id | int | No | ANON_LOOKUP |
| |
ORA | accounts | iban | string |
| ANON_IBAN | | |
ORA | accounts | currency | string |
|
|
| |
ORA | accounts | score | decimal(8,2) |
| ANON_AMOUNT(0.1) | ||
ORA | accounts | when | date |
| ANON_VAR |
| |
FILE | cust.csv | 1 | id | int | No | ANON_UNIQ |
|
FILE | cust.csv | 2 | email | string |
| ANON_EMAIL |
|
FILE | cust.csv | 3 | pers_id | string | No |
| anon_rc(IN) |
Description of each column of such configuration file follows.
Source Type
is used to distinguish different inputs. Possible values are:
FILE, file
for any supported file format, e.g. avro, csv, json, parquet, qvd, xls, xlsx, xml. To distinguish a file format, file suffix is used from the Entity name.
avro, csv, json, parquet, qvd, xls, xlsx, xml
to be used only in the case the file name or mask (i.e. an Entity name) has no proper file suffix.
MySQL
for MySQL/MariaDB tables
ORA, Oracle
for Oracle tables
PG, PostgreSQL
for PostgreSQL tables
TD, Teradata
for Teradata tables
When the field ‘source_type’ is empty, then option ‘FILE’ is supposed.
Entity Name
is mandatory and it is either file name/mask or table name.
In case of file sources the file name or mask should contain proper file extension to be distinguished. It supports also compression suffixes .gz, .tar.gz, and .zip, and process them properly.
Field Order
is mandatory only for csv files. It specifies the order of given field in a file, counted from 1.
Specifying the order for json, qvd, xls, xlsx, or xml files will omit using option --match-fields by EVL Read component which increase speed of the anonymization.
Field Name
is simply a field name. Case sensitive!
Field name is also used for mapping generation, so it appears in EVM mapping file and can be used in ‘evl_value’ configuration field.
EVL Datatype
Possible EVL data types are:
string
to be used for all texts, varchars, etc.
char, uchar, short, ushort, int, uint, long, ulong, int128, uint128
to be used for all corresponding integral data types
decimal
to be used for
decimal(m,n)
andnumber(m,n)
data typesfloat, double
to be used for
float
,double
,real
and similar source data typesdate, datetime, timestamp
to be used for corresponding date and time data types, where
date
keeps only date,datetime
keeps date and time (in seconds), andtimestamp
can hold also nanoseconds and time zone.
Format
To specify a format for date and time data types and decimal places for decimals.
For decimal
data type you must specify ‘m,n’, where ‘m’ is number of all digits and ‘n’ is the number of decimal places.
For date and time data types, when no format is specified, defaults are used:
EVL_DEFAULT_DATE_PATTERN="%Y-%m-%d"
¶to specify default formatting string for ‘date’ data type
EVL_DEFAULT_TIME_PATTERN="%H:%M:%S"
¶to specify default formatting string for ‘time’ data type
EVL_DEFAULT_DATETIME_PATTERN="%Y-%m-%d %H:%M:%S"
¶to specify default formatting string for ‘datetime’ data type
EVL_DEFAULT_TIMESTAMP_PATTERN="%Y-%m-%d %H:%M:%E*S"
¶to specify default formatting string for ‘timestamp’ data type
To specify a formatting pattern, standard C notation can be used. Examples of such formatting strings:
%d.%m.%Y
to have a date like ‘08.12.2021’
%d-%b-%Y
to have a date like ‘08-Dec-2021’
%Y%m%d%H%M%S
to have a datetime like ‘20211208093000’
%Y-%m-%dT%H:%M:%S.%E3f
to have a timestamp like ‘2021-12-08T09:30:59.545’
Nullable
says if the field is nullable or not. When empty, it suppose the field is nullable. Possible values are:
1, Y, y, yes, Yes, True, true
for NULLABLE
0, N, n, no, No, False, false
for NOT NULL
For file sources, when the field is NULLABLE, then an empty string is interpreted as NULL and anonymization functions keep such field again NULL.
Important: When a field is flagged as NOT NULL, then an empty string is manipulated as proper string, so anonymization functions anonymize them. In such case from an empty string one can get non-empty anonymized value.
Min and Max
Strings
For strings it specifies, how long strings it should create on the output when anonymize.
By default it supposes ‘min’ = 0 and ‘max’ = 10, so for string fields of the length less than 10, it is actually mandatory to set ‘max’ length. To change these defaults, one can specify other values in project.sh configuration file.
export EVL_ANON_DEFAULT_MIN_STRING_LENGTH=0 export EVL_ANON_DEFAULT_MAX_STRING_LENGTH=10
Numbers
For numbers it contains the range of values it should produce when doing anonymization. Here are environment variables with their default values (min/max values from config file take precedence):
export EVL_ANON_DEFAULT_MIN_CHAR=0 export EVL_ANON_DEFAULT_MIN_SHORT=0 export EVL_ANON_DEFAULT_MIN_INT=0 export EVL_ANON_DEFAULT_MIN_LONG=0 export EVL_ANON_DEFAULT_MIN_DECIMAL=0 export EVL_ANON_DEFAULT_MIN_FLOAT=0 export EVL_ANON_DEFAULT_MIN_DOUBLE=0 export EVL_ANON_DEFAULT_MAX_CHAR=99 export EVL_ANON_DEFAULT_MAX_SHORT=9999 export EVL_ANON_DEFAULT_MAX_INT=99999999 export EVL_ANON_DEFAULT_MAX_LONG=99999999 export EVL_ANON_DEFAULT_MAX_DECIMAL=99999999 export EVL_ANON_DEFAULT_MAX_FLOAT=1000000 export EVL_ANON_DEFAULT_MAX_DOUBLE=1000000
Dates and times
For dates and times it contains the range it should produce when calling anonymize(value,min,max)
function (e.g. by ANON
anon type).
Default values for date, datetime and timestamp are set by environment variables to ‘1970-01-01’ and ‘2099-12-31’ this way:
export EVL_ANON_DEFAULT_MIN_DATE="1970-01-01" export EVL_ANON_DEFAULT_MIN_DATETIME="1970-01-01 00:00:00" export EVL_ANON_DEFAULT_MIN_TIMESTAMP="1970-01-01 00:00:00.000000000" export EVL_ANON_DEFAULT_MAX_DATE="2099-12-31" export EVL_ANON_DEFAULT_MAX_DATETIME="2099-12-31 00:00:00" export EVL_ANON_DEFAULT_MAX_TIMESTAMP="2099-12-31 00:00:00.000000000"
Using anonymize(value,seconds,minutes,hours,days,months,years)
or function anonymize(value,days,months,years)
(e.g. by ANON_VAR
anon type), it ignores min/max values and uses values from environment variables:
export EVL_ANON_DEFAULT_NANOSECONDS=500000000 export EVL_ANON_DEFAULT_SECONDS=30 export EVL_ANON_DEFAULT_MINUTES=30 export EVL_ANON_DEFAULT_HOURS=12 export EVL_ANON_DEFAULT_DAYS=15 export EVL_ANON_DEFAULT_MONTHS=6 export EVL_ANON_DEFAULT_YEARS=5
|
Anon Type
Following predefined types of anonymization can be specified in the ‘anon_type’ configuration column.
ANON
¶this type is applicable for all EVL data types. It uses ‘anonymize(value, max, min)’ EVL function for given data type. It takes min/max value from columns ‘min’ and ‘max’ if such exist, otherwise it uses default values from variables starts with
EVL_ANON_DEFAULT_MAX_
.In general it may happen that for two different inputs it produces the same anonymized value. So it may happen that the mapping is not bijection, which is mostly the good way how to anonymize.
Important: For given value and given salt it produces always the same anonymized value.
ANON_VAR
¶can be used only for date and time data types. Internally it uses ‘anonymize(value, days, month, years)’ for date, ‘anonymize(value, seconds, minutes, hours, days, month, years)’ for datetime, and ‘anonymize(value, nanoseconds, seconds, minutes, hours, days, month, years)’ for timestamp. Variance arguments are taken from following environment variables, which are set by default:
export EVL_ANON_DEFAULT_NANOSECONDS=500000000 export EVL_ANON_DEFAULT_SECONDS=30 export EVL_ANON_DEFAULT_MINUTES=30 export EVL_ANON_DEFAULT_HOURS=12 export EVL_ANON_DEFAULT_DAYS=15 export EVL_ANON_DEFAULT_MONTHS=6 export EVL_ANON_DEFAULT_YEARS=5
ANON_UNIQ
¶can be used only for integral data types. Internally it uses ‘anonymize_uniq()’ EVL function. It produces the output in a unique way, so bijection is guaranteed. Particularly useful for IDs.
Important: It keeps the mapping to be a bijection.
ANON_NAME
¶can be used only for string data type and internally maps to ‘anonymize(<IN>," ", true)’, i.e. keep the length, spaces, capital letters will be again capitals, lowercase letters stay lowercased and numbers will be numbers again.
ANON_EMAIL
¶can be used only for string data type and internally maps to ‘anonymize(<IN>,"@.")’, i.e. keep ‘@’ and dots unchanged, other parts are anonymized as usual. Length is not kept.
ANON_IBAN
¶ANON_IBAN_KEEP_COUNTRY
¶ANON_IBAN_KEEP_BANK
¶applicable only for strings and maps into
anonymize_iban(<IN>) anonymize_iban(<IN>,iban_anon::keep_country) anonymize_iban(<IN>,iban_anon::keep_country_and_bank)
So in the first case it anonymize IBAN into arbitrary country IBAN, but keep the validity. In second case it keeps the country, and in third one also keeps the bank.
ANON_AMOUNT()
¶is applicable for
int
,uint
,long
,ulong
anddecimal
. It is defined asanonymize(<*IN>, <*IN> - <ARG> * <*IN>, <*IN> + <ARG> * <*IN>)
where ‘<*IN>’ is placeholder for the input value and ‘<ARG>’ is an argument specified in the parentheses. So for example for ‘ANON_AMOUNT(0.1)’ returns anonymized value within plus/minus 10% interval.
MASK_LEFT()
¶MASK_RIGHT()
¶can be used only for strings and it calls ‘str_mask_left(<IN>,<ARG>)’ or ‘str_mask_right(<IN>,<ARG>)’, where ‘<IN>’ is placeholder for the pointer to the input value and ‘<ARG>’ is an argument specified in the parentheses. So for example ‘MASK_LEFT(4)’ masks the string by ‘*’ from left, but keep 4 characters unchanged.
RANDOM
¶behaves very similar way as ‘ANON’ type, but each run may return different value. Applicable for any data type; internally calls ‘random_<data_type>(min,max)’, where min/max is taken from config file (see Min and Max) or is defined by environment variables.
RANDOM_VAR
¶behaves very similar way as ‘ANON_VAR’ type, but each run may return different value. Applicable only for date and time data types. It calls internally ‘randomize()’ function.
ANON_LOOKUP
¶firstly it creates a lookup from the column and then use it as possible values. This way you get real values, but shuffled.
Important: For given value and given salt it produces always the same anonymized value.
ANON_LOOKUP()
¶it uses some already prepared lookup specified as an argument in parentheses. So anonymized values are always taken from this lookup. Such a lookup must be prepared by this EVD
/opt/EVL-2.6/share/templates/anonymization/anon/evd/lookup.string.evd
which defines data structure:
key int value string null=""
and must be sorted by ‘key’ field and starts with zero and incremented by 1.
Important: For given value and given salt it produces always the same anonymized value.
RANDOM_LOOKUP
¶behaves very similar way as ‘ANON_LOOKUP’ type, but each run may return different value.
RANDOM_LOOKUP()
¶behaves very similar way as ‘ANON_LOOKUP()’ type, but each run may return different value.
TOKEN*
¶group of anonymization types starts with ‘TOKEN’ are not really defined, but act exactly like all above types starts with ‘ANON’. The only difference is that it keeps the conversion table in the
$EVL_ANON_TOKEN_DIR
directory, so you can revert the anonymized values if needed. Once some value is anonymized and so presented in the conversion table, the same value will be again anonymized the same way, no matter if salt has changed at meantime.Important: This anonymization type is revertible, but that is the purpose.
All of these anonymization types are predefined in
/opt/EVL-2.6/share/templates/anonymization/configs/anon/anon-functions-config.csv
It is actually a mapping of anonymization type and data type to exact internal EVL (or custom C/C++) function. For example ‘MASK_LEFT()’ is defined by
anon_type;evl_datatype;evl_value;description MASK_LEFT();string;str_mask_left(<IN>,<ARG>);
Custom Anon Types
You can also define your own custom anonymization types within your Anonymization project by specifying in your project folder a file configs/anon/anon-functions-config.csv of the same structure. You can define for example anonymization type ‘MASK_LEFT_X()’, which mask input from left by ‘X’, but keep <ARG>
characters unchanged, this way:
anon_type;evl_datatype;evl_value;description MASK_LEFT_X();string;str_mask_left(<IN>,<ARG>,'X');
Placeholders
Following possible placeholders can be used in Anonymization Type definition file:
<IN>
<*IN>
to be replaced by the value of the input field (‘<*IN>’) or by the pointer to that value (‘<IN>’)
<ARG>
for Anon Types ended with ‘()’ it will resolve to the content specified in the parentheses
<MIN>
<MAX>
to be replaced by values from min/max fields from config file or by default values specified by
EVL_ANON_DEFAULT_MIN_*
andEVL_ANON_DEFAULT_MAN_*
environment variables<FIELD_NAME>
to be replaced by values a field name
<NANOSECONDS>
<SECONDS>
<MINUTES>
<HOURS>
<DAYS>
<MONTHS>
<YEARS>
to be replaced by values specified by environment variables:
EVL_ANON_DEFAULT_NANOSECONDS
,EVL_ANON_DEFAULT_SECONDS
,EVL_ANON_DEFAULT_MINUTES
,EVL_ANON_DEFAULT_HOURS
,EVL_ANON_DEFAULT_DAYS
,EVL_ANON_DEFAULT_MONTHS
, andEVL_ANON_DEFAULT_YEARS
EVL Value
Important: When also ‘anon_type’ column has a value in the config file, then this ‘evl_value’ is applied first and then also an anonymization type
This field can contain an EVL mapping function(s) to be applied on the given fields. This can be used for special cases, when you don’t want to create new custom anonymization type or when you need to somehow prepare a field first.
For example when you want to achieve some business logic to be kept. Like some date must be greater than some other one:
field_name | evl_datatype | anon_type | evl_value |
---|---|---|---|
birth_date | date | ANON | |
death_date | date | anonymize(IN, *out->birth_date+1, *out->birth_date+36500) |
In this example the anonymized date of death will be always in between the anonymized birth date plus one day and anonymized birth date plus 100 years (36500 days).
Ignored Values
Next to NULLs, there are also usually several other values which you’d like to keep unchanged by anonymization. For example dates like ‘1970-01-01’, ‘4712-12-31’, or ‘9999-12-31’.
For such purpose, these environment variables can be defined either in project.sh (for the whole project), or in configs/anon/<source_name>.sh to have it only for sources given by <source_name>.csv:
EVL_ANON_IGNORE_DATE
¶EVL_ANON_IGNORE_DATETIME
¶EVL_ANON_IGNORE_TIMESTAMP
¶can be empty or contains a comma separated list of values to be ignored when given data type is used,
EVL_ANON_IGNORE_STRING
¶can be empty or contains a comma separated list of values to be ignored when string is used,
EVL_ANON_IGNORE_CHAR
¶EVL_ANON_IGNORE_SHORT
¶EVL_ANON_IGNORE_INT
¶EVL_ANON_IGNORE_LONG
¶EVL_ANON_IGNORE_DECIMAL
¶EVL_ANON_IGNORE_FLOAT
¶EVL_ANON_IGNORE_DOUBLE
¶can be empty or contains a comma separated list of numbers to be ignored when given data type is used. Unsigned variants for integral data types use their signed ones.
Here is an usual example for keeping dates unchanged when anonymize database tables.
export EVL_ANON_IGNORE_DATE="1970-01-01,4712-12-31,9999-12-31" export EVL_ANON_IGNORE_DATETIME="1970-01-01 00:00:00,4712-12-31 00:00:00"