EVL – ETL Tool


Products, services and company names referenced in this document may be either trademarks or registered trademarks of their respective owners.

Copyright © 2017–2022 EVL Tool, s.r.o.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts.

Table of Contents

Read

(since EVL 1.0)

Read <source>(s) (file mask can be specified) and sends it to output <f_out>. Multiple <source>s are concatenated.

It automatically parses various file formats: ‘Avro’, ‘json’, ‘Parquet’, ‘QVD’, ‘xls’, ‘xlsx’ and ‘xml’, just based on file suffix.

Also when compression suffix is recognized, like ‘gz’, ‘tar’, ‘bz2’, ‘zip’, ‘Z’, data are decompressed automatically.

When <source> starts with ‘file://’, ‘sftp://’, ‘hdfs://’, ‘s3://’, ‘gs://’ or ‘smb://’ it uses appropriate utility to get data from such location. If no URI Scheme is presented, it reads from local file system.

When <source> starts with ‘mysql://’, ‘postgres://’, ‘oracle://’, ‘sqlite://’ or ‘teradata://’ it uses appropriate utility to get data from such database.

Besides below mentioned options, which changes file suffix behaviour, one can use generic ‘--cmd=<cmd>’ option, which calls ‘echo <source>... | xargs <cmd>’ to obtain the input for this component. <cmd> can be also a pipeline (that is the reason for xargs). See examples below for inspiration.

Read

is to be used in EVS job structure definition file. <f_out> is either output file or flow name.

evl read

is intended for standalone usage, i.e. to be invoked from command line and and write to standard output.

EVD is EVL data definition file and EVS defines EVL job structure, for details see evl-evd(5) and evl-evs(5).

URI Scheme:

Based on the URI Scheme in the <source>, it calls appropriate utility to get files or tables.

no scheme, ‘file://’,

suppose local filesystem

hdfs://

calls ‘hdfs dfs’ utility

gs://

calls Google’s ‘gsutil’ utility

s3://

calls AWS’s ‘aws s3’ utility

sftp://

calls ‘ssh’ utility

smb://

calls ‘smbclient’ utility

mysql://

calls ‘mysql’ utility

postgres://

calls PostgreSQL’s ‘psql’ command together with EVL binary ‘evl_readpg’ to manipulate data in binary way

oracle://

calls Oracle’s ‘sqlplus’ utility

sqlite://

calls ‘sqlite3’ binary

teradata://

calls Teradata’s ‘fexp’ (FastExport) command together with EVL binary ‘evl_read_teradata_fastexport’ to manipulate data in binary way

Compression:

Compressed file suffix behaviour (applied by following the order):

*.tgz’, ‘*.tar.gz

calls ‘tar -zxO

*.tar.Z

calls ‘tar -ZxO

*.tar.bz2

calls ‘tar -jxO

*.tar

calls ‘tar -xO

*.gz’, ‘*.GZ’, ‘*.Z’, ‘*.zip’, ‘*.bz2

calls ‘gunzip -c

*.zip’, ‘*.ZIP

calls ‘unzip -p

File Type:

Read component behaves according to the <source> suffix.

Specific file formats suffix behaviour:

*.avro’, ‘*.AVRO

calls ‘evl readavro

*.csv’, ‘*.CSV’, ‘*.txt’, ‘*.TXT

read file(s) with ‘--text-input’ option, other than standard Unix end-of-line character (‘\n’) can be specified by option ‘--dos-eol’ or ‘--mac-eol

*.json’, ‘*.JSON

calls ‘evl readjson

*.parquet’, ‘*.parq’, ‘*.PARQUET’, ‘*.PARQ

calls ‘evl readparquet

*.qvd’, ‘*.QVD

calls ‘evl readqvd

*.xls’, ‘*.XLS

calls ‘evl readxls

*.xlsx’, ‘*.XLSX

calls ‘evl readxlsx

*.xml’, ‘*.XML

calls ‘evl readxml

Synopsis

Read
  <source>... <f_out> (<evd>|-d <inline_evd>)
  [--footer=<n>] [--header=<n>] [--cmd=<cmd>]
  [<file_type_options>] [--ignore-suffix]
  [-y|--text-output] [--validate]

evl read
  <source>... (<evd>|-d <inline_evd>)
  [--footer=<n>] [--header=<n>] [--cmd=<cmd>]
  [<file_type_options>] [--ignore-suffix]
  [-y|--text-output] [--validate]
  [-v|--verbose]

evl read
  ( --help | --usage | --version )

Options

-d, --data-definition=<inline_evd>

either this option or the file <evd> must be presented. Example: ‘-d "user_name string, user_sum int"

-f, --footer=<n>

skip last <n> records. When multiple files, skip last <n> records in each of them. Command ‘evl head -n-<n> --skip-parse’ is used for this job.

-h, --header=<n>

skip first <n> records. When multiple files, skip first <n> records in each of them. Command ‘evl tail -<n>+(N+1) --skip-parse’ is used for this job.

--cmd=<cmd>

bash command <cmd> is used to read the <source>s. In such case recognizing file’s suffix is switched off. See examples below for inspiration.

--ignore-suffix

ignore <source>’s suffix, act only based on options.

--validate

without this option, no fields are checked against data types. With this option, all output fields are checked

-x, --text-input

suppose the input as text, not binary

--dos-eol

suppose the input is text with CRLF as end of line

--mac-eol

suppose the input is text with CR as end of line

-y, --text-output

write the output as text, not binary

Standard options:

--help

print this help and exit

--usage

print short usage information and exit

-v, --verbose

print to stderr info/debug messages of the component

--version

print version and exit

File type options:

--avro

whatever <source>’s suffix, act as reading ‘avro’ file format

--gz

whatever <source>’s suffix, act as reading ‘gz’, ‘Z’, ‘zip’, ‘bz2’ compressed file format

--json

whatever <source>’s suffix, act as reading ‘json’ file format

--parquet

whatever <source>’s suffix, act as reading ‘parquet’ file format

--qvd

whatever <source>’s suffix, act as reading Qlik’s ‘QVD’ file format

--xls, --xlsx

whatever <source>’s suffix, act as reading MS Excel ‘xls’ or ‘xlsx’ file format

--tar

whatever <source>’s suffix, act as reading tar file

--xml

whatever <source>’s suffix, act as reading ‘xml’ file format

QVD, XLS, XLSX, XML and JSON specific option:

--match-fields

for other than QVD, XLS(X), XML and JSON file is this option ignored.

XML and JSON specific option:

--all-fields-exist

for other then XML and JSON file is this option ignored.

XML specific options:

--document-tag=<tag>

for other then XML file is this option ignored. Check ‘man evl readxml’ for details.

--record-tag=<tag>

for other then XML file is this option ignored. Check ‘man evl readxml’ for details.

--vector-element-tag=<tag>

for other then XML file is this option ignored. Check ‘man evl readxml’ for details.

XLS and XLSX specific options:

--sheet-index=<n>

read <n>-th sheet, starting from number 0. ‘--sheet-index=0’ is default

--sheet-name=<name>

read sheet with name <name>

Examples

Standard examples of standalone usage:

  1. Read tar.gz, skip header line and validate data types Write into ‘example.csv’ the content of the tarred and gzipped source without the header line and with validated data types:
    evl read -d 'id int sep=";", value string sep="\n"' \
             -h1 -vxy <example.csv.tar.gz >example.csv
    
  2. Gzipped json file:
    evl read sample.json.gz sample.evd -y >sample.csv
    

    As the file has standard file suffixes ‘gz’ and ‘json’, they are automatically recognized a gunzipped and parsed as JSON.

Standard examples of usage in EVL Job:

  1. Gzipped json file. The same as example 2., but to be used in evs file:
    Read   sample.json.gz SRC sample.evd
    Write  SRC     sample.csv sample.evd