Mage

Posted on May 4, 2023

Data sources episode 1: Common data sources in modern pipelines

#dataengineering #datasources #datapipelines #moderndatasources

TLDR

Databases aren’t the only sources of data, modern data pipelines can consume data from various sources.

Outline

What is a Data Pipeline?
Difference between Data Pipeline and ETL (Extract, Transform, Load)?
What does Data Source mean?
Different types of Data Sources
Why is it important to know your Data Source?
Conclusion

What’s a Data Pipeline?

A data pipeline is a series of interconnected steps that extracts, transforms, and loads data from different sources into a target destination, such as a data warehouse. It enables organizations to process and access data in a useful format for analysis, reporting, or other purposes.
A simple example of a data pipeline could be a process of extracting sales data from multiple sources, such as point-of-sale (POS) systems and online sales platforms. The data is then transformed and cleaned to remove errors or inconsistencies and aggregated to create a single view of all sales. Finally, the transformed data is loaded into a data warehouse where it can be analyzed to identify trends, patterns, or insights that can inform business decisions.

Source: Giphy

Difference between Data Pipeline and ETL (Extract, Transform, Load)?

Data Pipeline and ETL are both essential concepts in data processing and integration, but they serve different purposes and have unique characteristics. Here's a comparison of Data Pipeline and ETL:

Aspect	Data Pipeline	ETL (Extract, Transform, Load)
Purpose	Move and process data from various sources to destinations	Extract data from sources, transform it, and load it to target
Scope	Broader, includes ETL as a subsets	A specific type of data pipeline
Flexibility	Can handle various types of data processing	Primarily focused on structured data transformation
Data Processing	Real-time or batch processing	Typically batch processing
Data Transformation	Optional, can be performed in various ways	Integral part of the process
Data Flow	Unidirectional or bidirectional	Unidirectional
Data Types	Handles structured, semi-structured, and unstructured data	Primarily handles structured and semi-structured
Use Cases	Data replication, streaming, analytics, machine learning	Data warehousing, data migration, data integration
Processing Complexity	Can range from simple to complex	Can range from simple to complex, but usually complex

What does "Data Source" mean?

In the context of a data pipeline, a "Data Source" refers to the origin or starting point from which data is collected, extracted, or ingested before it’s processed, transformed, and either stored or utilized in the pipeline.

Source: Giphy

Common sources of Data in Modern Data Pipelines?

Data sources can be diverse and include various types of systems, databases, applications, or files where raw data is generated, stored, or managed. Here are the popular ones:

Relational databases: A relational database is a type of database that organizes data into one or more tables, with each table consisting of rows and columns. Each table represents a single entity or concept, such as customers, orders, products, or employees. Relational databases use a structured query language (SQL) to manipulate and retrieve data from the tables. Example - MySQL, PostgreSQL
NoSQL databases: A NoSQL database is a type of database that differs from traditional, relational databases in its data model and approach to storing and retrieving data. NoSQL databases are often designed to handle unstructured or semi-structured data, such as social media posts, documents, and sensor data. ⁠ ⁠Unlike relational databases, NoSQL databases don’t use tables with fixed schemas to store data. Instead, they use various data models, such as key-value, document, graph, or column-family models. NoSQL databases are highly scalable and can handle large volumes of data and high levels of traffic. Example - MongoDB, Cassandra, Redis, AWS Dynamo, Azure CosmosDB, GCP Bigtable Data Representation in MongoDB (Source: MongoDB)
Data warehouses: A data warehouse is a large, centralized repository of data that is specifically designed for business intelligence and analytics. It’s used to store, manage, and analyze data from multiple sources to support decision-making and reporting in an organization. Example - AWS Redshift, Snowflake, GCP BigQuery
File systems: A file system is a distributed and scalable storage system designed to handle large volumes of data across multiple nodes or servers. It provides a way to store, manage, and access data in a distributed environment, enabling high availability and fault tolerance. Example - HDFS, AWS S3, Azure Blob, Google Cloud Storage
APIs: An API (Application Programming Interface) acts as a mediator between different software applications, allowing them to communicate and exchange data with each other. APIs can be used to retrieve data, initiate actions or workflows, and enable integrations between different software applications. Example - RESTful APIs, GraphQL, [SOAP](https://www.soapui.org/learn/api/soap-vs-rest-api/] Working example of REST API call and its response (Source)
Messaging Queues: A messaging queue allows software applications to communicate asynchronously by exchanging messages or data using a publish-subscribe or point-to-point messaging model. It improves the scalability, reliability, and fault tolerance of software systems by decoupling message sending and receiving. Messaging queues are commonly used in distributed systems, cloud computing, and microservices architecture. Example - Apache Kafka, RabbitMQ Source: Giphy
Social media platforms: Social media platforms such as Facebook, Twitter, Instagram, and LinkedIn can serve as valuable data sources for businesses and researchers. They provide large volumes of user-generated content, including text, images, and videos, that can be analyzed to gain insights into user behavior, sentiment, trends, and preferences. Source: Giphy
IoT devices and sensors: IoT devices and sensors generate vast amounts of data that can be used to monitor and analyze physical processes, environments, and activities. This data can be leveraged for predictive maintenance, process optimization, and real-time decision-making in various industries such as manufacturing, healthcare, transportation, and agriculture. Example - Smart Meters, Temperature Sensors, GPS trackers

Source: Giphy

Why is it important to know your data source?

Knowing your data source technically is important because it affects how the data is captured, stored, processed, and analyzed. Different data sources may have different data formats, data quality, data volume, and data velocity, which can impact the design of the data pipeline.

For example, if the data source generates large volumes of data in real-time, such as IoT sensors, the data pipeline must be designed to handle the high data velocity and ensure timely processing and analysis. Similarly, if the data source has poor data quality or inconsistent data formats, data cleaning and transformation steps must be included in the pipeline to ensure accurate and reliable data analysis.

Understanding the technical aspects of the data source can also help identify potential data integration or compatibility issues with other data sources or systems, which can impact the overall performance and effectiveness of the data pipeline.

Source: Giphy

Conclusion

In conclusion, data pipelines play a crucial role in processing and integrating data from various sources. The most common sources of data in data pipelines include relational databases, NoSQL databases, data warehouses, file systems, APIs, data streams, spreadsheets, logs, social media platforms, and IoT devices. By harnessing data from these diverse sources, businesses and organizations can gain valuable insights, optimize their operations, and drive data-driven decision-making. As the volume, variety, and velocity of data continue to increase, the importance of robust and scalable data pipelines cannot be overstated, making them a fundamental component in today's data-centric world.
In the episode 2 of the data sources series, we’ll include Singer Spec, an open source platform to sync data from various data sources.

Link to original blog: https://www.mage.ai/blog/data-sources-ep1-common-data-sources-in-modern-pipelines

DEV Community

Data sources episode 1: Common data sources in modern pipelines

TLDR

Outline

What’s a Data Pipeline?

Difference between Data Pipeline and ETL (Extract, Transform, Load)?

What does "Data Source" mean?

Common sources of Data in Modern Data Pipelines?

Why is it important to know your data source?

Conclusion

Top comments (0)

Read next

10 Future Apache Iceberg Developments to Look forward to in 2025

Setting up memory for Flink - Configuration

Talend vs. Apache Kafka: Which Data Tool Drives Better Business Insights?

LightningChart Python 1.0