DEV Community

WanjohiChristopher
WanjohiChristopher

Posted on • Updated on

Data Engineering Toolset 2023

Happy new year!
Let's have a recap of what's DE is and why is important in this era.

Data Engineering is a process of collecting, transforming, cleansing, profiling and aggregating huge dataset in a nutshell.

Why do Organizations need Data Engineers?
Organizations are increasingly becoming data-driven, requiring data to inform their decision-making processes. However, the data that analysts and Scientists need to perform their jobs is not always readily accessible, leading to time-consuming data gathering processes. In such cases, organizations need Data Engineers to help ensure that the data they use is accurate and reliable, so they can achieve their business objectives. The saying "garbage in, garbage out" highlights the importance of having quality data.

As Data Engineers, our role is to ensure that the data collected from various sources, such as social media platforms, cloud systems, CRMs, and flat files, is readily accessible to the data team. To achieve this, we create a central repository, known as a Data Warehouse, to store this data. This will ensure that the data team has easy access to the information they need to perform their jobs effectively. I will include a snapshot of the Data Warehouse at the end of this article.

Toolset
Most important of all is SQL.
Then:

  • Python
  • Scala
  • Apache spark

  • RDBMS(Eg. Postgresql, MSSQL, etc)

  • Apache Kafka

  • AWS Services(AWS REDSHIFT- cloud data warehouse, Athena-performing SQL(QUERY ENGINE),AWS S3- acts as a data lake or storage, AWs Glue-is an Etl tool)

  • Apache airfow,cronjobs

  • Hadoop, MongoDb or Cassandra.

  • PowerBi/Tableau/Metabase/Aws Quicksight -for analytics.

Start with SQL as it cuts across all technologies.

Earlier, we discussed the importance of having a central repository, known as a Data Warehouse (DW), to store data. The DW uses a multidimensional data model.

One of the key concepts in a DW is data modeling, which involves creating a visual representation of the entire information system or its parts, to show the connections between data points and structures. A DW is composed of fact tables and dimension tables.

Conceptual modelling of DW.

Lastly Data Engineers need to understand these two schemas:
1.Star schema - a fact table in the middle connected to a set of dimensions.
2.Snowflake schema - a refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake.

Have a fantastic day!

Top comments (0)