ETL is a process of extracting, transforming, and loading data from one or multiple sources into a destination. It is also an approach for data brokering. ETL stands for extract, transform, and load.
This is a common approach in moving data from one location to another, while transforming the structure of the data before it is loaded from its source into its destination.
ETL (Extract, Transform, Load) pipeline
ETL is a process with three separate steps and often called a pipeline, because data moves through these three steps.
Steps in an ETL pipeline:
- Extract data source from wherever it is (DB, API, ...).
- Transform or process the data in some way. This could be restructuring, renaming, removing invalid or unnecessary data, adding new values, or any other type of data processing.
- Load the data into its final destination (DB, flat file, ...).
ETL solves the problem of having data in different places and disparate formats by allowing you to pull data from different sources into a centralized location with a standardized format. ETL pipelines are typically run as batch jobs. This means all the data is moved at once.
Use Cases for ETL
A common use case for an ETL pipeline is in Data Analytics , with the following steps:
- Aggregate data to use for analytics
- Extract the raw data from database
- Clean, validate, and aggregate the data in transform stage.
- Load the transformed data into the destination
Another use case would be to periodically move stored data to a new database in a different format than it is stored currently. Let's imagine you are a company with stores around the globe different, which make transactions in local currencies, and every store reports their revenue to the head office at the end of the month. You could use an ETL pipeline here, to better analyze the data from each store. First step would be to extract the data from the reports, then transform the different currency amounts into a single base currency, and finally load the modified report data to a reporting database.
An ETL pipeline is a practical choice for migrating large amounts of data, like converting hundreds of gigabytes of data stored in flat files into a new format, or compute new data based on those hundred of gigabytes. In general, ETL is a great fit for:
- Big data analysis
- Clean and standardize data sets
- Migrate data (a lot)
- Data plumbing (connect data sources so data can flow)
Limitations of ETL
An ETL process can be computationally intensive , sometimes requires access to data that may not be available in real-time, and often it's a massive amount of data. Therefore, ETL processes are typically executed with a batch of data. This means that an ETL process is not working 24/7, and the actual state of the source data is lagging, sometimes minutes, though it could be days. The entire ETL pipeline takes time to extract, transform and load all the required data. The ETL pipeline usually runs on a schedule.
TL;DR
- An ETL pipeline extracts data, transforms it, and then loads it into its destination (db, etc.)
- Both ends of an ETL pipeline should be known: How to access the source of the data, and where it is going to end up.
- ETL is a powerful way to automate moving data between different parts of architecture in batches.
Thanks for reading and if you have any questions , use the comment function or send me a message @mariokandut.
If you want to know more about Node, have a look at these Node Tutorials.
Top comments (2)
Definitely what I was looking for ! Strange how folks are unappreciated for any stuff like that shared on globe . At least I got some basics into pipeline stuff
Thanks a lot ! ✌
You can use github.com/top-guns/etl-gun for this tasks.