DEV Community

Andre Yai
Andre Yai

Posted on

Steps of Big Data Pipeline

Image description

With the increase in computational and storage power, companies have been collecting more data than ever. This leading the need for new tasks and job opportunities. In order to extract value from data companies should rely on data pipelines. These pipelines consist of stages like collection, storage, process, and analyzing data.

Collection

This step is responsible for ingesting data from different sources to use them for later analysis. This data comes mainly from real-time and batch sources. 

In real-time platforms, we have those who produce data (Producers) and those who consume data (Consumers). Usually, an example of it would be what Netflix and Spotify use to send their data to millions of users. Examples of streaming include services like Kafka, AWS Kinesis, AWS SQS.

Batch collection step may involves migrating data from an existing database. For example ingest data from a transactional database like RDS, PostgreSQL, MySQL, Oracle, Aurora to a data lakes or data warehouses like AWS Redshift. For that in AWS, you can use the AWS Data Migration Service.

Storage

Once we collect our data it will need a place to be stored. In this service, by knowing their frequency and need we can control data lifecycle. This goes from getting more frequent data to archiving or deleting them. 
Some services that help with that would be AWS S3.

Process

This step deals with ETL which involves the process of cleaning, enriching, and transforming raw data into a more sophisticated layer. 
Some services that help with that would be AWS Glue, AWS EMR, AWS Lambda.

Governance

Data governance consists of data management, data quality, and data stewardship. This helps to manage policies to access data, data discovery, data accuracy, validation, and completeness. 
Some services that help with them are AWS Glue Catalog, AWS LakeFormation.

Analyze

This part is responsible for extracting value from data by performing data analysis, machine learning, and data visualization. This consists in extracting meaning from data by showing how it is organized, grouping, and predicting it.
Some services that help with that would be AWS Sagemaker, AWS QuickSight.

Top comments (0)