So you want to break into data engineering? Start today by learning more about data engineering and the fundamental concepts.
Data Engineering encompasses the set of all processes that collect and integrate raw data from various resources—into a unified and accessible data repository—that can be used for analytics and other applications.
What Does a Data Engineer Do?
- Extracting and integrating data from a variety of sources—data collection.
- Preparing the data for analysis: processing the data by applying suitable transformations to prepare the data for analysis and other downstream tasks. Includes cleaning, validating, and transforming data.
- Designing, building, and maintaining data pipelines that encompass the flow of data from source to destination.
- Design and maintain infrastructure for data collection, processing, and storage—infrastructure management.
Data Engineering Concepts
we have incoming data from all resources across the spectrum: from relational databases and web scraping to news feeds and user chats. The data coming from these sources can be classified into one of the three broad categories:
- Structured data-Has a well-defined schema(schema. Data in relational databases, spreadsheets, and the like)
- Semi-structured data-Has some structure but no rigid schema. Typically has metadata tags that provide additional information.(Include JSON and XML data, emails, zip files)
- Unstructured data-Lacks a well-defined schema.(Images, videos and other multimedia files, website data)
Data Repositories: Data Warehouses, Data Lakes, and Data Marts
Before we take a deep dive,we'll learn about two data processing systems, namely, OLTP and OLAP systems:
OLTP
or Online Transactional Processing systems are used to store day-to-day operational data for applications such as inventory management. OLTP systems include relational databases that store data that can be used for analysis and deriving business insights.OLAP
or Online Analytical Processing systems are used to store large volumes of historical data for carrying out complex analytics. In addition to databases, OLAP systems also include data warehouses and data lakes.
Data warehouses: A data warehouse refers to a single comprehensive store house of incoming data.
Data lakes: Data lakes allow to store all data types—including semi-structured and unstructured data—in their raw format without processing them. Data lakes are often the destination for ELT processes.
Data mart:You can think of data mart as a smaller subsection of a data warehouse—tailored for a specific business use case common.
Data lake houses: Recently, data lake houses are also becoming popular as they allow the flexibility of data lakes while offering the structure and organization of data warehouses.
Data Pipelines: ETL and ELT Processes
Data pipelines encompass the journey of data—from source to the destination systems—through ETL and ELT processes.
ETL—Extract, Transform, and Load—process includes the following steps:
- Extract data from various sources
- Transform the data—clean, validate, and standardize data
- Load the data into a data repository or a destination application
ETL processes often have a data warehouse as the destination.
ELT—Extract, Load, and Transform—is a variation of the ETL process where instead of extract, transform, and load, the steps are in the order: extract, load, and transform.
Meaning the raw data collected from the source is loaded to the data repository—before any transformation is applied. This allows us to apply transformations specific to a particular application. ELT processes have data lakes as their destination.
Tools Data Engineers Should Know
- Programming language: Intermediate to advanced proficiency in a programming language preferably one of Python, Scalar, and Java
- Databases and SQL: Good understanding of database design and ability to work with databases both relational databases such as MySQL and PostgreSQL and non-relational databases such as MongoDB
- Command-line fundamentals: Familiarity which Shell scripting and data processing and the command line
- Knowledge of operating systems and networking
- Data warehousing fundamentals
- Fundamentals of distributed systems
Data engineering also requires strong software engineering skills including version control, logging, and application monitoring. You should also know how you use containerization tools like Docker and container orchestration tools like Kubernetes.
- dbt (data build tool) for analytics engineering
- Apache Sparkfor big data analysis and distributed data processing
- Airflow for data pipeline orchestration
- Fundamentals of cloud computing and working with at least one cloud provider such asAWS or Microsoft Azure.
Top comments (0)