EVL – ETL Tool


Products, services and company names referenced in this document may be either trademarks or registered trademarks of their respective owners.

Copyright © 2017–2023 EVL Tool, s.r.o.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts.

Table of Contents

EVL Overview

ETL in general

ETL stands for Extract–Transform–Load and shortly said it is a system to move and transform data between data storages.

ETL processing usually consists of three main parts

  • ETL itself (ETL jobs) – to process data
  • Orchestration (ETL workflows) – to manage ETL jobs, handle job consequences, await file delivery, provide information about processing via e-mail or SNMP traps, etc.
  • Scheduling – to fire ETL workflows at give time in a given day

Quite often is Orchestration and Scheduling named together as Scheduler, but let’s distinguish these two parts of ETL system to follow Unix Philosophy: “do one thing, and do it well”.

ETL jobs or ETL workflows consists of one or more oriented acyclic graphs, named quite often as DAG = Directed Acyclic Graph (the reason is probably that DAG is nicer acronym than OAG).

So the following DAG may describe either ETL job or ETL workflow.

../images/DAG

The only difference is the meaning of vertices and edges:

jobsworkflows
verticesdata modifying componentsjobs, other workflows
edgesdata flowssuccessor

So then the approach in restarting of jobs and workflows is also different:

  • When ETL job fails, whole must be restarted.
  • When ETL workflow fails, can be either restarted from the beginning or continue from last failure(s).

So an ETL workflow like this:

../images/DAG-workflow

might be restarted from the red job. Green (i.e. successful ones) will be skipped.

EVL approach

Considering above theory, EVL splits ETL system into three main entities:

../images/EVL_overview

All three entities are supposed to be tracked by Git or any other version control system.

Terminology

Project

Scope of work isolated from the business perspective. EVL supports inter-project workflow dependencies, but often it delivers data from the source to the final target independently. Technically is the project represented by one directory and the name of the directory is the name of the project. In this directory is project.sh (bash) file, crontab.sh and directories of all other EVL configuration files:

evc/

EVL custom Components, a common piece of a job

evd/

EVL Data Definitions, data types, separators, null flags, encoding, etc., simply something like DDL

evm/

EVL Mappings, C++ syntax

evs/

EVL Structure file, i.e. templates of jobs

ews/

EVL Workflow Structure, i.e. templates of workflows

run/

*.evl files contain only parameters for given *.evs file, these files are called by evl run command. There might be also standard bash scripts *.sh

workflow/

*.ewf files contain only parameters for given *.ews file, to be called by command evl run/restart/continue

Important: Files *.evc, *.evs, *.ews, *.evl and *.ewf are interpreted as bash scripts.

Task

Either job or workflow or wait for a file. Each run of the task has its so-called Order Date (ODATE) and Run ID (unique increment integer per project).

Job

The basic unit of work, it can be either a standard Bash script *.sh or it can be represented by an *.evl parameters file which uses one and only one EVS job template. The job is always restarted from the beginning, (despite in some cases, like downloading X files from the source, it can start after the last successful step, it depends on how the job is written). So jobs are physically represented by *.evl and *.sh files.

Workflow

List of task calls with dependencies split by waits for another workflows (even from other projects). A project can have one or multiple workflows physically represented by *.ewf files. Workflow can be restarted from the beginning or from the last successful step.

Scheduling

Workflows, and jobs (if not included in workflows) can be scheduled using crontab (or any other scheduler). For crontab there is crontab.sh script to wrap scheduling up.