This was by far the hardest part of the data engineering course given by DataTalksClub. The introduction was relatively easy to understand, though, and I supplemented it with questions and answers from ChatGPT.
What does an analytics engineer do?
The data engineer prepares and maintains the infrastructure that the data team will need. The data analyst uses this data to answer questions and solve problems. However, data analysts and data scientists were not meant to be software engineers. It's not part of their training and not their first priority. Similarly, data engineers are great software engineers, but they don't have training in how the data is going to be used by the business users. This is the gap that the analytics engineer fills. The analytics engineer uses software to present data to the data analyst for business use. They will use data loading tools and data storing tools, like a data engineer. They will also use data modelling tools and data presentation tools. DBT is a data modelling tool.
Data modelling concepts
ETL vs. ELT.
ETL takes longer to implement. We'll have more stable and compliant data. The data is transformed first, and the loaded into a data warehouse.
ELT is faster and more flexible, and has a lower cost and lower maintenance. With ELT, you transform the data after it's been loaded into a data warehouse.
Kimball's dimensional modeling
Kimball's approach emphasizes business process modeling and building data warehouses that are optimized for querying and analysis. Kimball advocates for the use of star schemas and snowflake schemas, which are dimensional models consisting of a central fact table surrounded by denormalized dimension tables. Kimball encourages the use of conformed dimensions, which are dimensions that have consistent definitions and attributes across multiple data marts, facilitating data consistency and integration. Kimball's methodology often employs a bottom-up approach, where data marts are built first to address specific business requirements and then integrated into a larger data warehouse environment.
Bill Inmon's Corporate Information Factory
Inmon's approach emphasizes building a centralized enterprise data warehouse (EDW) that serves as the single source of truth for an organization's data. Inmon advocates for a layered architecture, where data is extracted from source systems into a staging area, transformed and integrated in a data warehouse, and then accessed by end users through data marts or access layer. Inmon promotes the use of a normalized data model for the data warehouse, which reduces redundancy and improves data integrity but can be more complex for querying and analysis compared to dimensional models. Inmon's methodology often employs a top-down approach, where the enterprise data warehouse is built first as a comprehensive repository of all organizational data, and data marts are derived from the EDW to serve specific business needs.
Kimball's approach typically focuses on building data marts optimized for specific business areas, while Inmon's approach emphasizes building a centralized data warehouse. Kimball advocates for dimensional modeling, which is optimized for querying and analysis, while Inmon's approach often involves normalized data models for better data integrity. Kimball's approach offers flexibility and agility in building data marts to address evolving business needs, while Inmon's approach prioritizes data consistency and integration through a centralized data warehouse.
Elements of dimensional modelling
Facts tables are measurements, metrics or facts, which correspond to a business process. We can think of them as verbs. For example, sales or orders. The dimensions tables provide context to the facts tables. They correspond to a business entity. We can think about them as nouns, like customer or product. This is often called a star schema.
We can also look at this as a restaurant analogy. The Staging area contains the raw data. It's not meant to be exposed to everyone. It's meant to be exposed to the people who know how to use the raw data. This can be compared to the raw ingredients.
The Processing area takes the raw data and turns it into a data model. The focus is on efficiency and ensuring standards. This can be compared the the kitchen, where the food is prepared. This is limited to the cookers who know how to do this.
The Presentation area is where the final presentation of the data occurs. This is where the data is exposed to the business stakeholder. This can be compared to the dining room, where the food is served. We will be following these steps in this week of learning.
What is DBT?
DBT is a transformation workflow that allows anyone who knows SQL to deploy analytics code that follows software engineering best practices. We will be working with taxi data. If we were working in a business we would be taking data from various sources - backend systems, frontend sources from customer interactions, etc. and loading it into a data warehouse. We'll be working in BigQuery, but we could work in another data warehouse, like Amazon Redshift, or Snowflake.
DBT will sit on top of our data warehouse, and it will help us transform the data into something that's useful to expose to the business user as a BI (business intelligence) tool or another application such as a machine learning workflow. We have raw data and dbt will help us transform that data.
DBT will also introduce best software practices into our analytics code. It will introduce layers. We'll develop SQL code in something like a sandbox, where each developer has their own separate environment. Then it's going to introduce a testing and documentation framework as part of our development workflow. Then we're going to deploy this into production, where we'll be using things like CI and CD (continuous integration and continuous development). It's also going to introduce other concepts like version control, modularity, "do not repeat yourself". All these things will help us create high quality and high availability pipelines to deliver our data for our stakeholders.
How does DBT work?
We're create a .sql file. DBT will take away a lot of complexity about where we're going to create the data. It's going to generate the DDL and DML for us. DDL and DML are SQL categories of commands. DDL is "data definition language" and includes commands like CREATE, ALTER, and DROP, which create the infrastructure. DML is "data manipulation language" and includes commands like SELECT, INSERT, UPDATE, and DELETE, which query and use the infrastructure. Whenever we do a dbt run, it's going to compile all that code. Then it persists the data back into a table in the data warehouse.
There are two ways to use DBT. One is DBT Core, a free, open-source project that allows the data transformation. It builds and runs a DBT project using .sql and .yml files. It includes SQL compilation logic and a CLI interface to run DBT commands locally.
There is also DBT Cloud which is a SaaS application that runs DBT Core. It's used to develop and manage DBT projects. There is a web-based IDE and cloud CLI that can be used to manage environments. It has orchestration, logging, alerting, and more.
We can use DBT in two ways. We can either use a cloud IDE on BigQuery and not install it locally, or we can use a local IDE such as VSCode and install it locally, connecting it to a Postgres database. In this class, we'll take the taxi data that we already have loaded in Google cloud storage, load it into BigQuery, use DBT to transform it and then load it into an analytics tool to expose the data to the user.
Top comments (0)