Data Engineering (Part 02)

#data #dataenginnering #etl #loaddata

In this blog we are going to discuss the working practices used in Data engineering.

What is ETL ?

ETL stands for Extract, Transform, and Load, and is a process used for transferring data from one system to another, it can be through API or local host. The Extract phase involves retrieving data from multiple sources, such as databases, files, and web services. The Transform phase involves transforming the data into a specific format that can be used in the target system. Finally, the Load phase involves loading the transformed data into the target system. ETL is used to ensure data accuracy and consistency across systems.

ETL is commonly used to migrate data from legacy systems to modern systems, as well as to integrate data from multiple sources into a single system. It can also be used to cleanse data and prepare it for analysis, as well as to load data into data warehouses and data lakes. ETL processes are usually automated to ensure efficient and accurate data transfer.

What is ELT ?

ELT (Extract-Load-Transform) is an alternative to ETL (Extract-Transform-Load) and is used for transferring data from one system to another. The Extract phase involves retrieving data from multiple sources, such as databases, files, and web services. The Load phase involves loading the data into the target system. The Transform phase involves transforming the data into a format that can be used in the target system. ELT is used to ensure data accuracy and consistency across systems.

Same like ETL, it allow migrate data from legacy systems to modern systems, as well as to integrate data from multiple sources into a single system. It can also be used to cleanse data and prepare it for analysis, as well as to load data into data warehouses and data lakes. ELT processes are usually also automated to ensure efficient and accurate data transfer. ELT is often preferred over ETL because it allows for parallel processing and makes it easier to handle large datasets.

There are many tools available for ETL :

Apache Airflow 2. Talend
AWS Glue 4. Informatica
SSIS

There are multiple types to load the data in Data Engineering in ETL process. Some of may use historical load, full load and incremental load.

Historical Load :

Historical load is the process of loading data from the past into a data warehouse or data lake. This process typically involves extracting data from multiple sources, transforming the data into a format that can be used in the target system, and loading the data into the target system.

Historical load can be used to capture data from past events and to analyze trends over time.

Full Load :

Full load is also used to load data from the data warehouse or data lake. Data extraction can be done from multiple sources.

Full load is often used when setting up a new data warehouse or data lake, or when starting from scratch with an existing system.

Incremental Load :

Incremental load is often used to keep a data warehouse or data lake up to date with the latest data. It also extract thee data from multiple sources.

3 Tier architecture :

The three-tier architecture of data engineering is a software architecture pattern that divides a system into three separate tiers:

Data ingestion
Data processing
Data storage

The data ingestion tier is responsible for collecting, cleaning, and loading data from multiple sources.

The data processing tier is responsible for transforming and aggregating the data.

The data storage tier is responsible for storing the data in a structured format.

This architecture allows for scalability and modularity, as well as improved performance and security.

This architecture is the back-bone of data engineering.

DEV Community

Data Engineering (Part 02)

Latest comments (0)