Implements of ETL – Technology Org

Despite the fact that blockchain is nothing more than a database, managing this data has its own specifics. This means that users need tools that can facilitate processes associated with information processing, such as storing, retrieving, transmitting and other actions. And most importantly, all these processes must be carried out in the form that is most acceptable to users.

Each business pursues its own goals, each business project has its own tasks, and as part of this there is a need to operate with a wide variety of data from various sources. Specific blockchain tools, in particular Ethereum ETL, will help users not to get lost in this constantly changing flow of information.

Implements of ETL – Technology Org

Cryptocurrency – artistic photo. Image credit: Michael Förtsch via Unsplash, free license

The abbreviation ETL (Extract – Transform – Load) covers a large number of tasks related to information. In this case, as the name of the process suggests, all work is divided into three stages. In the first stage (Extract), the necessary data is extracted from the corresponding source or several sources. At the second stage (Transform), the data is transformed, for example, unnecessary information is removed from the total amount, or the data is grouped according to certain criteria, etc. At the third stage (Load), the data is either loaded where it is needed or stored in files or databases.

There are traditional approaches for developers to solve such tasks. Most often, some custom scripts, functions in the database, manual deployment, cron for regularly running tasks, etc. are used. This approach is acceptable, but it raises several significant problems:

1) If there are too many tasks, they grow like a snowball.

2) Dependencies appear between tasks and, as a result, it becomes much more difficult to maintain such code.

3) Lack of monitoring leads to an increase in the number of errors and a decrease in response speed.

4) There is no restart mechanism.

5) Fault tolerance appears.

Fixing the above issues distracts developers from the main task of developing the product. The IT market offers several ETL tools for solving such problems, of which we note Luigi, Apache Airflow and Prefect.

Parsing the tools of ETL

The Luigi tool, developed 10 years ago, is in demand by many IT companies. One of the main advantages of Luigi is its small codebase and the lack of many dependencies. Therefore, it will not take much time to understand Luigi. Luigi has a built-in restart mechanism and a re-notification mechanism, and there are trigger hooks.

The key entity of Luigi is the Task, or, in other words, the object that describes the logic for executing a task. The second entity of Luigi is the Target object, or, in other words, a return value as a result of completing a task. Despite all the convenience of the Luigi tool, it still has a number of disadvantages. One of them is the lack of a scheduler. The second drawback is related to the coordinator or web service, which is a so-called “bottleneck”. That is, when there are a large number of tasks, calls will still be made to only one coordinator, and since the process is single-threaded, only one worker will have access to the coordinator.

The more large-scale tool is Apache Airflow, and it is the most popular solution among developers. Apache Airflow has the following advantages:

– convenient web UI with scheduler, database and logs;

– easy scalability due to queues;

– availability of config storage with encryption support;

– a large amount of ready-made code with a large number of services;

– presence of a composer for those who use Google Cloud Platform.

Apache Airflow has several key entities, and the very first one is Directed Acyclic Graph (DAG). Essentially, a DAG encapsulates several tasks that must be executed, and here these tasks are called operators. Each operator in a DAG is a kind of “building block” and must be idempotent. There are several categories of operators: Actions Operators, Transfer Operators and Sensor Operators. Apache Airflow is not without its drawbacks. Since this is a rather complex tool, it has a large number of dependencies. In addition, due to the fact that Apache Airflow is a fairly large tool, it loses significantly in flexibility.

One of the latest newly created tools from the developers of Apache Airflow is Prefect. Entities in Prefect are the same as in Airflow. Today, Prefect is still a rather “raw” tool, but with good prospects for the future.

In conclusion, we note that all the listed tools are associated with the Python programming language, since this language is most in demand in developing applications and any software, and it has also found its application in ML processes. In addition, Python is easy to understand for users and is freely available.