In this article am going to discuss a number of key data engineering concepts that will help understand data engineering career path.
Big data is a term used to describe large, complex datasets that are difficult to process using traditional computing techniques. Big data often includes data sets.
Business intelligence (BI) is defined as the collection of processes and strategies for analyzing data to generate insights used to make business decisions.
Data architecture involves the process of designing, constructing, and maintaining data systems. Data architecture includes the design of data models, database management systems, and data warehouses. Data engineers often work with data architects to design and implement data systems, but they can also work independently.
Data lakes in data architecture.
Data architecture in data marts and Olap cubes.
Containerization is the process of packaging an application so that it can run in isolated environments known as containers. Containerization allows for better resource utilization and portability of applications. A containerized application encapsulates all of its dependencies, libraries, binaries, and configuration files into containers. This allows an application to run in the cloud or on a virtual machine without needing to be refactored.
Docker has become synonymous with containers and is a suite of tools that can be used to create, run, and share containerized applications.
Docker for data engineering courtesy of google pics
Kubernetes, or k8s, is a portable, open-source platform for managing containerized applications.
Kubernetes for data engineering. courtesy of google pictures.
Cloud computing is a model for delivering IT services over the internet. Data engineers often use cloud-based services, like Amazon S3 and Google Cloud Storage, to store and process data.
Databases are collections of data that can be queried. Relational databases, such as MySQL, Oracle, and Microsoft SQL Server, store data in tables and have existed for over four decades. Now, there are many different types of databases including:
Wide-column stores* such as Cassandra and HBase
Key-value stores such as DynamoDB and memcachedb
Document databases such as MongoDB and Couchbase
Graph databases such as Neo4j
Data accessibility is the ability of users to access data stored in a system.
Data compliance and privacy is the act of following laws and regulations related to data. Data privacy is the act of protecting data from unauthorized access.
Data governance is the process of managing and governing data within an organization. Data governance includes policies and procedures for managing data.
Data marts are subsets of data warehouses that contain only the data needed by a specific group or department.
images courtesy of google pictures.
Data integration platforms are tools that help organizations combine data from multiple sources. These typically include features for data cleaning and transformation.eg iPaaS- is a set of automated tools that integrate software applications that are deployed in different environments.
Data infrastructure components can include virtual machines, cloud services, networking, storage, and software. These components are necessary for data systems to function.
Data pipelines encompass the process of extracting data from one or more sources, transforming the data into a format that can be used by applications further down the line, and loading the data into a target system. Data pipelines essentially automate the process of moving data from one system to another.
Data repositories or data stores are systems that are used to store data, as discussed earlier. Examples include relational databases, NoSQL databases, and traditional file systems.
Data sources are the systems or devices from which data is extracted. Examples of data sources include Global mental health data.
Data warehouses are centralized systems that store all the data organizations collect. Data warehousing involves extracting data from multiple sources, transforming the data into a format that can be used for analysis, and loading the data into the warehouse.
Data lakes are repositories that store all the data organizations collect, in their rawest form. Data lakes are often used for storing data that has not been transformed or processed in any way.
ETL and ELT processes are used for moving data from one system to another.
- ETL (extract, transform, load) processes involve extracting data from one or more sources, transforming the data into a format that can be used by the target system, and loading the data into the target system.
- ELT (extract, load, transform) processes involve extracting data from one or more sources, loading the data into the target system, and then transforming the data into the desired format. ETL processes are useful for data that needs cleaning in order to be used by the target system. On the other hand, ELT processes are useful when the target system can handle the data in its raw form, so ELT processes tend to be faster than ETL processes.
Data formats for storage include text files, CSV files, JSON files, and XML files. Data can also be stored in binary formats, such as Parquet and Avro.
Data visualization is the process of creating visual representations of data. These can be used to examine data, find patterns, and make decisions. They are most often used to communicate data to non-technical audiences.
Data engineering dashboards are web-based applications that allow data engineers to monitor the status of their data pipelines. These typically display the status of data pipelines, the number of errors in a pipeline, and the time it took to run a pipeline.
SQL and NoSQL databases : are two types of databases that are used to store data.
-SQL (structured query language) databases are relational databases, which means that data is stored in tables and can be queried using SQL.
-NoSQL (not only SQL) databases are non-relational databases, which means that data is stored in a format other than tables and can be queried using a variety of methods.
You would use SQL databases for structured data, such as data from a financial system, while NoSQL databases are best suited for unstructured data, such as data from social media. For semi-structured data, such as data from a weblog, you could use either SQL or NoSQL databases.
Top comments (0)