Read
(since EVL 1.0)
Read <source>
(s) (file mask can be specified) and sends it to output <f_out>
.
Multiple <source>
s are concatenated.
It automatically parses various file formats: ‘Avro’, ‘json’, ‘Parquet’, ‘QVD’, ‘xls’, ‘xlsx’ and ‘xml’, just based on file suffix.
Also when compression suffix is recognized, like ‘gz’, ‘tar’, ‘bz2’, ‘zip’, ‘Z’, data are decompressed automatically.
In general the <source>
is of the form
[scheme://][[user@@]host[:port]]/path/basename[.format][.compression] [scheme://][[user@@]host[:port]/]database?(table=<table>|query=<query>)
When <source>
starts with ‘file://’, ‘sftp://’, ‘hdfs://’, ‘s3://’,
‘gs://’ or ‘smb://’ it uses appropriate utility to get data from such location. If no URI
Scheme is presented, it reads from local file system.
When <source>
starts with ‘mysql://’, ‘postgres://’, ‘oracle://’,
‘sqlite://’ or ‘teradata://’ it uses appropriate utility to get data from such database.
Besides below mentioned options, which changes file suffix behaviour, one can use generic ‘--cmd=<cmd>’ option, which calls ‘echo <source>... | xargs <cmd>’ to obtain the input for this component. <cmd> can be also a pipeline (that is the reason for xargs). See examples below for inspiration.
- Read
-
is to be used in EVS job structure definition file.
<f_out>
is either output file or flow name. - evl read
-
is intended for standalone usage, i.e. to be invoked from command line and and write to standard output.
EVD is EVL data definition file and EVS defines EVL job structure, for details see evl-evd(5) and evl-evs(5).
URI Scheme for file:
Based on the URI Scheme in the <source>
, it calls appropriate utility to get files or
tables.
- no scheme, ‘file://’,
-
suppose local filesystem
- ‘gdrive://’
-
calls ‘gdrive’ utility
- ‘gs://’
-
calls Google’s ‘gsutil’ utility
- ‘hdfs://’
-
calls ‘hdfs dfs’ utility
- ‘s3://’
-
calls AWS’s ‘aws s3’ utility
- ‘sftp://’
-
calls ‘ssh’ utility
- ‘smb://’
-
calls ‘smbclient’ utility
URI Scheme for table:
- ‘mysql://’
-
calls Readmysql component to read MySQL/MariaDB table
- ‘postgres://’
-
calls Readpg component to read PostgreSQL table
- ‘oracle://’
-
calls Readora component to read Oracle table
- ‘sqlite://’
-
calls Readsqlite component to read SQLite table
- ‘teradata://’
-
calls Readtd component to read Teradata table
Compression:
Compressed file suffix behaviour (applied by following the order):
- ‘*.tgz’, ‘*.tar.gz’
-
calls ‘tar -zxO’
- ‘*.tar.Z’
-
calls ‘tar -ZxO’
- ‘*.tar.bz2’
-
calls ‘tar -jxO’
- ‘*.tar’
-
calls ‘tar -xO’
- ‘*.gz’, ‘*.GZ’, ‘*.Z’, ‘*.zip’, ‘*.bz2’
-
calls ‘gunzip -c’
- ‘*.zip’, ‘*.ZIP’
-
calls ‘unzip -p’
File Type:
Read component behaves according to the <source>
suffix.
Specific file formats suffix behaviour:
- ‘*.avro’, ‘*.AVRO’
-
calls ‘evl readavro’
- ‘*.csv’, ‘*.CSV’, ‘*.txt’, ‘*.TXT’
-
read file(s) with ‘--text-input’ option, other than standard Unix end-of-line character (‘\n’) can be specified by option ‘--dos-eol’ or ‘--mac-eol’
- ‘*.json’, ‘*.JSON’
-
calls ‘evl readjson’
- ‘*.parquet’, ‘*.parq’, ‘*.PARQUET’, ‘*.PARQ’
-
calls ‘evl readparquet’
- ‘*.qvd’, ‘*.QVD’
-
calls ‘evl readqvd’
- ‘*.xls’, ‘*.XLS’
-
calls ‘evl readxls’
- ‘*.xlsx’, ‘*.XLSX’
-
calls ‘evl readxlsx’
- ‘*.xml’, ‘*.XML’
-
calls ‘evl readxml’
Synopsis
Read <source>... <f_out> (<evd>|-d <inline_evd>) [--footer=<n>] [--header=<n>] [--cmd=<cmd>] [<file_type_options>] [--ignore-suffix] [--allow-missing-file] [-y|--text-output [--dos-eol | --mac-eol] ] [--validate] evl read <source>... (<evd>|-d <inline_evd>) [--footer=<n>] [--header=<n>] [--cmd=<cmd>] [<file_type_options>] [--ignore-suffix] [--allow-missing-file] [-y|--text-output [--dos-eol | --mac-eol] ] [--validate] [-v|--verbose] evl read ( --help | --usage | --version )
Options
- --allow-missing-file
-
don’t fail if
<source>
doesn’t exist, and produce empty output - -d, --data-definition=<inline_evd>
-
either this option or the file
<evd>
must be presented. Example: ‘-d "user_name string, user_sum int"’ - -f, --footer=<n>
-
skip last <n> records. When multiple files, skip last <n> records in each of them. Command ‘evl head -n-<n> --skip-parse’ is used for this job.
- -h, --header=<n>
-
skip first <n> records. When multiple files, skip first <n> records in each of them. Command ‘evl tail -<n>+(N+1) --skip-parse’ is used for this job.
- --cmd=<cmd>
-
bash command <cmd> is used to read the <source>s. In such case recognizing file’s suffix is switched off. See examples below for inspiration.
- --ignore-suffix
-
ignore <source>’s suffix, act only based on options.
- --validate
-
without this option, no fields are checked against data types. With this option, all output fields are checked
- -x, --text-input
-
suppose the input as text, not binary
- --dos-eol
-
suppose the input is text with CRLF as end of line
- --mac-eol
-
suppose the input is text with CR as end of line
- -y, --text-output
-
write the output as text, not binary
Standard options:
- --help
-
print this help and exit
- --usage
-
print short usage information and exit
- -v, --verbose
-
print to stderr info/debug messages of the component
- --version
-
print version and exit
File type options:
- --avro
-
whatever <source>’s suffix, act as reading ‘avro’ file format
- --gz
-
whatever <source>’s suffix, act as reading ‘gz’, ‘Z’, ‘zip’, ‘bz2’ compressed file format
- --json
-
whatever <source>’s suffix, act as reading ‘json’ file format
- --parquet
-
whatever <source>’s suffix, act as reading ‘parquet’ file format
- --qvd
-
whatever <source>’s suffix, act as reading Qlik’s ‘QVD’ file format
- --xls, --xlsx
-
whatever <source>’s suffix, act as reading MS Excel ‘xls’ or ‘xlsx’ file format
- --tar
-
whatever <source>’s suffix, act as reading tar file
- --xml
-
whatever <source>’s suffix, act as reading ‘xml’ file format
QVD, XLS, XLSX, XML and JSON specific option:
- --match-fields
-
for other than QVD, XLS(X), XML and JSON file is this option ignored.
XML and JSON specific option:
- --all-fields-exist
-
for other then XML and JSON file is this option ignored.
XML specific options:
- --document-tag=<tag>
-
for other then XML file is this option ignored. Check ‘man evl readxml’ for details.
- --record-tag=<tag>
-
for other then XML file is this option ignored. Check ‘man evl readxml’ for details.
- --vector-element-tag=<tag>
-
for other then XML file is this option ignored. Check ‘man evl readxml’ for details.
XLS and XLSX specific options:
- --sheet-index=<n>
-
read
<n>
-th sheet, starting from number 0. ‘--sheet-index=0’ is default - --sheet-name=<name>
-
read sheet with name
<name>
Examples
Standard examples of standalone usage:
- Read tar.gz, skip header line and validate data types Write into ‘example.csv’ the content of
the tarred and gzipped source without the header line and with validated data types:
evl read -d 'id int sep=";", value string sep="\n"' \ -h1 -vxy <example.csv.tar.gz >example.csv
- Gzipped json file:
evl read sample.json.gz sample.evd -y >sample.csv
As the file has standard file suffixes ‘gz’ and ‘json’, they are automatically recognized a gunzipped and parsed as JSON.
Standard examples of usage in EVL Job:
- Gzipped json file. The same as example 2., but to be used in evs file:
Read sample.json.gz SRC sample.evd Write SRC sample.csv sample.evd