In this article, we will talk about data engineers, what they manage, what they develop, and other activities they perform. We will also explore what a BI architecture is and what role data engineers play in this vast world of data (which is still relatively new to many people).
We will also discuss the skills typically possessed by data engineers. To avoid confusion regarding the positions we'll be referring to or if you're unsure about how a BI team is usually structured, I recommend reading my other article that explains the differences in skills and activities performed in various positions. You can find the link below.
Data roles in data teams and your skill set. Using math.
Data Engineer
A data engineer is responsible for designing, building, and managing data infrastructures that handle the processing, transformation, and storage of large volumes of data from various sources. There are different "types" of data engineers, such as those working with streaming data, messaging systems, big data engineers dealing with distributed systems, and many others. In this article, I will focus on those who work in BI teams.
A data engineer (in BI teams) is the one who equips BI analysts/data analysts, data scientists, ML models, data products, managers, and the entire company with reliable data. They achieve this by using tools for large-scale data processing, creating, managing, and monitoring routines. They develop tooling such as APIs and applications that abstract user activities, making it easier for all departments to leverage the data and ensuring transparent and auditable processes. They employ techniques like data modeling, following all normalization rules, and utilize current tools to develop scalable, performant data warehouses, data lakes, and data lakehouses that use minimal storage and processing resources.
Dated Processes
In the vast majority of companies, data analysis is carried out using Excel and Google Sheets. Typically, it's a repetitive task that consumes one's time, which could be spent on other tasks. Moreover, it has various weaknesses, such as the lack of visualization with charts, making it difficult for individuals to grasp the magnitude of data and make more informed decisions quickly. Given that Excel is prone to human errors, using it as the primary method for data analysis is a significant disadvantage.
I'm not against companies using Excel for their analysis; I'm against companies that have valuable data, which could be used as a growth pillar, but still treat it as a mere consequence of events.
When do I know I need a data engineer?
Speaking from my experience, it became evident that we needed one or more data engineers when the BI team's constructions (metrics tracking routines) started competing with production in terms of processing power, leading to exponential resource consumption. The routines we created started affecting overall sales processing, negatively impacting the user experience in our e-commerce platform. Additionally, we were faced with the daunting task of managing the costs of machines, processing, and storage, which were incurred without proper planning.
To address these issues, we decided to separate the production environments from the analytics team almost entirely. We improved storage by adopting datalakehouse principles and compressing files, which significantly reduced our space requirements. By using incremental data updates instead of full processing of all data, we eliminated processing bottlenecks and improved delivery speed for analytics and data science teams. With many abstractions in place, the processes became transparent, and most team members understood how KPIs were calculated. This transparency encouraged the company to become even more data-driven.
In general, you will want a data engineer when your operation starts to grow and the data-driven culture is well-established, or when you want to turn your data into a product, commercialize it, create data products from it, or simply need better performance and cost-effectiveness. A data engineer can address many of these challenges and provide guidance on optimizing the utilization of your resources for data processing, handling, storage, and management. They can help you make the most out of the data your company possesses.
In an ideal world, the data engineer is the pioneer of the entire data movement, but in reality, this role is relatively new in the market and might not always be the case.
Architecture Planning
As companies, teams, and operations grow, it's natural for these outdated processes to fall behind, and a BI team starts to be structured (or at least it should). At this stage, we enter the realm of data engineering, starting with studying the type of architecture to be used. Will we implement a data warehouse, or perhaps a data lakehouse? Will we use a cloud or an on-premise solution? What does our budget allow us to develop? Which tools will we use for daily KPI monitoring, PowerBI, Tableau, or another option? All of these questions, among others, are answered in collaboration with other departments, taking into account the company's current state, historical and cultural context, and the skills possessed by the people directly involved. These are some of the "obvious" variables that must be considered in the planning process of a data engineering center of excellence.
Development and Tooling
Once these pertinent questions are answered, the subsequent stages should be solid, metric-driven, well-documented, and well-architected to ensure reliable ETL/ELT processes. The development of pipelines mainly involves moving information from one system to another. You can perform an ETL (extract-transform-load) directly into a data warehouse on your RDS, or an ELT (extract-load-transform) process on your data lake. You may be aggregating data from your production database or consuming an FTP from a partner company to enrich your database. Developing APIs for other services to consume your transformed data or for data scientists to access it is quite common and not exclusive to back-end developers. The choice of tools can be determined during the development stage, but it generally aligns with common practices. For pipeline orchestration, a strong candidate is Airflow, a Python framework for managing routines. For distributed processing, you have PySpark and Spark at your disposal. For an on-premise data lake, you can use MinIO. For your data warehouse, PostgreSQL with a star schema modeling is a common choice, but if you scale up with many fact tables and numerous dimensions, making star schema impractical, you can opt for snowflake modeling. If you're performing data scraping and want to enrich your data, you can use low-code software like IBM RPA, or if you prefer to continue with Python, you can use scrapy, an excellent framework for web crawling.
I will write more articles in the future about MinIO and data lakes, Airflow and task orchestration, and distributed processing with Spark.
Metadata
When developing data systems, it is essential to have documentation, not just for the code but also for metadata, which is data about data. A pipeline that consumes information from a daily API performs various transformations before storing it alongside data from other systems. But what transformations does it perform and why? In BI teams, datasets are often prepared daily with numerous transformations, aggregations, and abstractions. How are these aggregations done? How is the gross_revenue
column calculated? Why do many columns from the production table not appear in this dataset? These are common questions that analysts and data scientists will ask, highlighting the need for a robust knowledge base with this metadata.
Data management
Data management is one of the tasks that data engineers handle, and it shares similarities with the activities performed by DBAs. Applying privacy guidelines to your data, providing appropriate access to the right users, and managing it continuously is a labor-intensive task, despite many DBMSs and S3 storage services having integrated permission controls. Additionally, it is necessary to develop robust logging and metrics systems to monitor the daily health of the data and pipelines, providing reports on routines that ran with errors, ran incompletely, or encountered any other type of inconsistency. The reliability of the data and the margin of error need to be measured and relentlessly communicated. The reliability of the data is often a subject of discussion and is usually linked to external data sources of the company, while the margin of error is due to rounding and updates that may be performed in the production environment, directly affecting OLAP systems.
Conclusion
In conclusion, we have explored what a data engineer does, how they can begin their journey, and when their efforts are needed. It is important to note that these insights are based on my experiences working across the three main data fronts. If you have any remaining questions about the positions within a data team or if you would like to learn more about the skill set required for a data engineer, I encourage you to read my article on the composition of a BI team and the skills typically sought after in this role.
Thank you very much for reading
Top comments (0)