Data Engineering 101: Introduction to Data Engineering

#dataengineering #datascience #beginners #codenewbie

Data Engineering is the practice of designing and building systems for collecting, storing, and analyzing data at scale.

There are four general steps through which data flows within an organization :

Data Collection and Storage
Data Preparation
Exploration and Visualization
Experimentation and Prediction

Data Engineers focus on the first part of the workflow. Their role is to ingest and store the data so it's easily accessible and ready to be analyzed. They do this by building data pipelines. A data pipeline is a set of actions that ingest raw data from disparate sources and move the data to a destination for storage and analysis.

Data pipelines ensure the data flows efficiently through the organization. They automate extracting, transforming, combining, validating, and loading data, to reduce human intervention and errors, and decrease the time it takes for data to flow through the organization.

Data Scientists intervene on the rest of the workflow: they prepare the data according to their analysis needs, explore it, build insightful visualizations, and then run experiments or build predictive models. Data engineers lay the groundwork that makes data science activity possible.

Data can be stored in different formats. Some data is structured, but most of it is unstructured. Structured and unstructured data is sourced, collected and scaled in different ways, and each one resides in a different type of database.

Structured data is easy to search and organize. Data is entered following a rigid structure, like a spreadsheet where there are set columns. SQL(Structured Query Language) is used to query such data.

Semi-structured data resembles structured data, but allows more freedom. It's therefore relatively easy to organize, and pretty structured, but allows more flexibility. Semi-structured data is stored in NoSQL databases and usually leverages the JSON, XML or YAML file formats.

Unstructured data is data that does not follow a model and can't be contained in a rows and columns format. This makes it difficult to search and organize. It's usually text, sound, pictures or videos. It's usually stored in data lakes, although it can also appear in data warehouses or databases

When it comes to storing big data, the two most popular options are data lakes and data warehouses. Data warehouses are used for analyzing archived structured data, while data lakes are used to store big data of all structures.

A Data Lake is a storage repository that can store a large amount of structured, semi-structured, and unstructured data. It is a place to store every type of data in its native format with no fixed limits on account size or file.

As the data pipeline graph shows, the Data Lake is where all the collected raw data gets stored, just as it was uploaded from the different sources. It's unprocessed and messy. While the data lake stores all the data, the data warehouse stores specific data for a specific use.

Data Engineers who typically work for small teams or small companies wear many hats as one of the few data-focused people in the company. These generalists are often responsible for every step of the data process, from managing data to analyzing it.

Pipeline-centric data engineers work alongside data scientists to help make use of the data they collect. Pipeline-centric data engineers need in-depth knowledge of distributed systems and computer science

In large organizations, where managing the flow of data is a full-time job, data engineers focus on analytics databases. Database-centric data engineers work with data warehouses across multiple databases and are responsible for developing table schemas.

According to Glassdoor, the average salary for a data engineer is $117,671 per year, with a reported salary range of $87,000 to $174,000 depending on skills, experience, and location. Senior data engineers earn an average salary of $134,244 per year, while lead data engineers earn an average salary of $139,907 per year.

DEV Community

Data Engineering 101: Introduction to Data Engineering

Top comments (0)