EVL – ETL Tool

Products, services and company names referenced in this document may be either trademarks or registered trademarks of their respective owners.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts.

Introduction
Release Notes
- Version 1.0
- Version 1.1
- Version 1.2
- Version 1.3
- Version 2.0
- Version 2.1
- Version 2.2
- Version 2.3
- Version 2.4
- Version 2.5
- Version 2.6
- Version 2.7
Installation and Settings
- Linux – RPM
- Linux – DEB
- Other Unix systems
- Settings
  - Compiler
  - Project
- Text Editor
  - Vim
EVL Overview
- EVL Jobs
- EVL Workflows
- Scheduling
Main EVL Command
- Usage
- Examples
- Options
- Environment
- evl project
- evl run
- evl workflow
EVD and Data Types
- EVD Structure
- EVD Options
- Default Values
- Compound Types
- String
- Integral Types
- Decimal
  - Declaration in mapping
  - Manipulation, comparison
- Float and Double
- Date and Time
Components Common
- Common Options
Basic Components
- Assign
- Cat
- Cmd
- Component
- Cut
- Departition
- Echo
- Filter
- Gather
- Generate
- Head
- Lookup
- Merge
- Partition
- Sort
- Sortgroup
- Tail
- Tee
- Trash
- Uniq
- Validate
- Watcher
Mapping Components
- Aggreg
- Join
- Map
Read Components
- Read
- Readevd
- Readjson
- Readkafka
- Readmysql
- Readora
- Readparquet
- Readpg
- Readqvd
- Readtd
- Readxls
- Readxlsx
- Readxml
Run SQL Components
- Runmysql
- Runora
- Runpg
Write Components
- Write
- Writeevd
- Writejson
- Writekafka
- Writeora
- Writeparquet
- Writepg
- Writeqvd
- Writeqvx
- Writetd
- Writexlsx
- Writexml
- Writemysql
Commands
- Cancel
- Cp
- Chmod
- Crontab
- End
- Fr
- Log
- Ls
- Mail
- Manager
- Mkdir
- Mv
- Rm
- Set
- Skip
- Sleep
- Spark
- Status
- Test
- Wait
EVM Mappings
- Output Functions
- String Functions
- Checksum Functions
- IP Addresses Functions
  - IPv4 Functions
  - IPv6 Functions
- Randomization Functions
- Anonymization Functions
Joins and Lookups
- Lookup tables
  - Declaration and load
  - Methods
Utils
- csv2evd
- csv2qvd
- evd2sql
- guess-timestamp-format
- json2evd
- pg2evd
- qvd2csv
- qvd2evd
- evl_increment_run_id
- qvd-header
EVM Functions Index
EVD Data Types Index
Variables Index
General Index

Read

(since EVL 1.0)

Read <source>(s) (file mask can be specified) and sends it to output <f_out>. Multiple <source>s are concatenated.

It automatically parses various file formats: ‘Avro’, ‘json’, ‘Parquet’, ‘QVD’, ‘xls’, ‘xlsx’ and ‘xml’, just based on file suffix.

Also when compression suffix is recognized, like ‘gz’, ‘tar’, ‘bz2’, ‘zip’, ‘Z’, data are decompressed automatically.

When <source> starts with ‘file://’, ‘sftp://’, ‘hdfs://’, ‘s3://’, ‘gs://’ or ‘smb://’ it uses appropriate utility to get data from such location. If no URI Scheme is presented, it reads from local file system.

When <source> starts with ‘mysql://’, ‘postgres://’, ‘oracle://’, ‘sqlite://’ or ‘teradata://’ it uses appropriate utility to get data from such database.

Besides below mentioned options, which changes file suffix behaviour, one can use generic ‘--cmd=<cmd>’ option, which calls ‘echo <source>... | xargs <cmd>’ to obtain the input for this component. <cmd> can be also a pipeline (that is the reason for xargs). See examples below for inspiration.

Read: is to be used in EVS job structure definition file. <f_out> is either output file or flow name.
evl read: is intended for standalone usage, i.e. to be invoked from command line and and write to standard output.

EVD is EVL data definition file and EVS defines EVL job structure, for details see evl-evd(5) and evl-evs(5).

URI Scheme:

Based on the URI Scheme in the <source>, it calls appropriate utility to get files or tables.

no scheme, ‘file://’,: suppose local filesystem
‘hdfs://’: calls ‘hdfs dfs’ utility
‘gs://’: calls Google’s ‘gsutil’ utility
‘s3://’: calls AWS’s ‘aws s3’ utility
‘sftp://’: calls ‘ssh’ utility
‘smb://’: calls ‘smbclient’ utility
‘mysql://’: calls ‘mysql’ utility
‘postgres://’: calls PostgreSQL’s ‘psql’ command together with EVL binary ‘evl_readpg’ to manipulate data in binary way
‘oracle://’: calls Oracle’s ‘sqlplus’ utility
‘sqlite://’: calls ‘sqlite3’ binary
‘teradata://’: calls Teradata’s ‘fexp’ (FastExport) command together with EVL binary ‘evl_read_teradata_fastexport’ to manipulate data in binary way

Compression:

Compressed file suffix behaviour (applied by following the order):

‘*.tgz’, ‘*.tar.gz’: calls ‘tar -zxO’
‘*.tar.Z’: calls ‘tar -ZxO’
‘*.tar.bz2’: calls ‘tar -jxO’
‘*.tar’: calls ‘tar -xO’
‘*.gz’, ‘*.GZ’, ‘*.Z’, ‘*.zip’, ‘*.bz2’: calls ‘gunzip -c’
‘*.zip’, ‘*.ZIP’: calls ‘unzip -p’

File Type:

Read component behaves according to the <source> suffix.

Specific file formats suffix behaviour:

‘*.avro’, ‘*.AVRO’: calls ‘evl readavro’
‘*.csv’, ‘*.CSV’, ‘*.txt’, ‘*.TXT’: read file(s) with ‘--text-input’ option, other than standard Unix end-of-line character (‘\n’) can be specified by option ‘--dos-eol’ or ‘--mac-eol’
‘*.json’, ‘*.JSON’: calls ‘evl readjson’
‘*.parquet’, ‘*.parq’, ‘*.PARQUET’, ‘*.PARQ’: calls ‘evl readparquet’
‘*.qvd’, ‘*.QVD’: calls ‘evl readqvd’
‘*.xls’, ‘*.XLS’: calls ‘evl readxls’
‘*.xlsx’, ‘*.XLSX’: calls ‘evl readxlsx’
‘*.xml’, ‘*.XML’: calls ‘evl readxml’

Synopsis

Read
  <source>... <f_out> (<evd>|-d <inline_evd>)
  [--footer=<n>] [--header=<n>] [--cmd=<cmd>]
  [<file_type_options>] [--ignore-suffix]
  [-y|--text-output] [--validate]

evl read
  <source>... (<evd>|-d <inline_evd>)
  [--footer=<n>] [--header=<n>] [--cmd=<cmd>]
  [<file_type_options>] [--ignore-suffix]
  [-y|--text-output] [--validate]
  [-v|--verbose]

evl read
  ( --help | --usage | --version )

Options

-d, --data-definition=<inline_evd>: either this option or the file <evd> must be presented. Example: ‘-d "user_name string, user_sum int"’
-f, --footer=<n>: skip last <n> records. When multiple files, skip last <n> records in each of them. Command ‘evl head -n-<n> --skip-parse’ is used for this job.
-h, --header=<n>: skip first <n> records. When multiple files, skip first <n> records in each of them. Command ‘evl tail -<n>+(N+1) --skip-parse’ is used for this job.
--cmd=<cmd>: bash command <cmd> is used to read the <source>s. In such case recognizing file’s suffix is switched off. See examples below for inspiration.
--ignore-suffix: ignore <source>’s suffix, act only based on options.
--validate: without this option, no fields are checked against data types. With this option, all output fields are checked
-x, --text-input: suppose the input as text, not binary
--dos-eol: suppose the input is text with CRLF as end of line
--mac-eol: suppose the input is text with CR as end of line
-y, --text-output: write the output as text, not binary

Standard options:

--help: print this help and exit
--usage: print short usage information and exit
-v, --verbose: print to stderr info/debug messages of the component
--version: print version and exit

File type options:

--avro: whatever <source>’s suffix, act as reading ‘avro’ file format
--gz: whatever <source>’s suffix, act as reading ‘gz’, ‘Z’, ‘zip’, ‘bz2’ compressed file format
--json: whatever <source>’s suffix, act as reading ‘json’ file format
--parquet: whatever <source>’s suffix, act as reading ‘parquet’ file format
--qvd: whatever <source>’s suffix, act as reading Qlik’s ‘QVD’ file format
--xls, --xlsx: whatever <source>’s suffix, act as reading MS Excel ‘xls’ or ‘xlsx’ file format
--tar: whatever <source>’s suffix, act as reading tar file
--xml: whatever <source>’s suffix, act as reading ‘xml’ file format

QVD, XLS, XLSX, XML and JSON specific option:

--match-fields: for other than QVD, XLS(X), XML and JSON file is this option ignored.

XML and JSON specific option:

--all-fields-exist: for other then XML and JSON file is this option ignored.

XML specific options:

--document-tag=<tag>: for other then XML file is this option ignored. Check ‘man evl readxml’ for details.
--record-tag=<tag>: for other then XML file is this option ignored. Check ‘man evl readxml’ for details.
--vector-element-tag=<tag>: for other then XML file is this option ignored. Check ‘man evl readxml’ for details.

XLS and XLSX specific options:

--sheet-index=<n>: read <n>-th sheet, starting from number 0. ‘--sheet-index=0’ is default
--sheet-name=<name>: read sheet with name <name>

Examples

Standard examples of standalone usage:

Read tar.gz, skip header line and validate data types Write into ‘example.csv’ the content of the tarred and gzipped source without the header line and with validated data types:
```
evl read -d 'id int sep=";", value string sep="\n"' \
         -h1 -vxy <example.csv.tar.gz >example.csv
```
Gzipped json file:
```
evl read sample.json.gz sample.evd -y >sample.csv
```
As the file has standard file suffixes ‘gz’ and ‘json’, they are automatically recognized a gunzipped and parsed as JSON.

Standard examples of usage in EVL Job:

Gzipped json file. The same as example 2., but to be used in evs file:

Read   sample.json.gz SRC sample.evd
Write  SRC     sample.csv sample.evd