ETL in general
ETL stands for Extract–Transform–Load and shortly said it is a system to move and transform data between data storages.
ETL processing usually consists of three main parts
- ETL itself (ETL jobs) – to process data
- Orchestration (ETL workflows) – to manage ETL jobs, handle job consequences, await file delivery, provide information about processing via e-mail or SNMP traps, etc.
- Scheduling – to fire ETL workflows at give time in a given day
Quite often is Orchestration and Scheduling named together as Scheduler, but let’s distinguish these two parts of ETL system to follow Unix Philosophy: “do one thing, and do it well”.
ETL jobs or ETL workflows consists of one or more oriented acyclic graphs, named quite often as DAG = Directed Acyclic Graph (the reason is probably that DAG is nicer acronym than OAG).
So the following DAG may describe either ETL job or ETL workflow.
The only difference is the meaning of vertices and edges:
|vertices||data modifying components||jobs, other workflows|
So then the approach in restarting of jobs and workflows is also different:
- When ETL job fails, whole must be restarted.
- When ETL workflow fails, can be either restarted from the beginning or continue from last failure(s).
So an ETL workflow like this:
might be restarted from the red job. Green (i.e. successful ones) will be skipped.
Considering above theory, EVL splits ETL system into three main entities:
All three entities are supposed to be tracked by Git or any other version control system.
Scope of work isolated from the business perspective. EVL supports inter-project workflow dependencies, but often it delivers data from the source to the final target independently. Technically is the project represented by one directory and the name of the directory is the name of the project. In this directory is project.sh (bash) file, crontab.sh and directories of all other EVL configuration files:
EVL custom Components, a common piece of a job
EVL Data Definitions, data types, separators, null flags, encoding, etc., simply something like DDL
EVL Mappings, C++ syntax
EVL Structure file, i.e. templates of jobs
EVL Workflow Structure, i.e. templates of workflows
*.evl files contain only parameters for given *.evs file, these files are called by
evl runcommand. There might be also standard bash scripts *.sh
*.ewf files contain only parameters for given *.ews file, to be called by command
Important: Files *.evc, *.evs, *.ews, *.evl and *.ewf are interpreted as bash scripts.
Either job or workflow or wait for a file. Each run of the task has its so-called Order Date (ODATE) and Run ID (unique increment integer per project).
The basic unit of work, it can be either a standard Bash script *.sh or it can be represented by an *.evl parameters file which uses one and only one EVS job template. The job is always restarted from the beginning, (despite in some cases, like downloading X files from the source, it can start after the last successful step, it depends on how the job is written). So jobs are physically represented by *.evl and *.sh files.
List of task calls with dependencies split by waits for another workflows (even from other projects). A project can have one or multiple workflows physically represented by *.ewf files. Workflow can be restarted from the beginning or from the last successful step.
Workflows, and jobs (if not included in workflows) can be scheduled using crontab (or any other scheduler). For crontab there is crontab.sh script to wrap scheduling up.