$/\\: Fabien PORTES$

/\\: Fabien PORTES for Onepoint x Stack Labs

Posted on Dec 7, 2023

What is DataOps and how to make it real with Dataform ?

#data #dataops #devops #cloud

The software development world has advanced CI/CD tools, enabling efficient software delivery. The data world is catching up adopting CI/CD practices for successful data platforms. With automation and cloud technologies, data teams can automate data processes and ensure reliability and scalability.

This is where DataOps starts. DataOps is a set of practices to improve quality, speed, and collaboration and promote a culture of continuous improvement between people working on data analytics.

By converging software development and data practices, organizations maximize the potential of their data assets for informed decision-making. First part of the article will relate each step of a devops CI/CD to a step of a Data Ops CI/CD. The second part will show how Dataform can help for each of these steps.

What are the expectations of a DataOps CI/CD ?

As for software development projects, a data project needs a CI/CD chain that will ensure the modifications to the code base will be automatically tested and deployed. The code base here is the logic written in SQL that transforms raw data into curated consumable data.
The software development world can help us to identify the key features of a successful data CICD.

1. Compilation

In the software development world, the compilation is used to build an executable package and ensure there are no syntax errors in the language used. There still can be some runtime errors but at least the code follows the language rules. For an interpreted language we can make an analogy with the language parser used by the linter.

Similarly, in the context of a data project, this step focuses on validating the written SQL transformations. In the case of a sequence of transformations, it becomes crucial to verify that the resulting directed acyclic graph (DAG) generated by these transformations is valid. This verification can be conducted during the compilation step, as well as with the aid of a linter integrated into your preferred code editor.
By employing a compilation step and leveraging a linter, the following benefits can be obtained:

Language compliance: The compilation step ensures that the SQL transformations conform to the syntax and rules specified by the database management system or query language. Syntax validation: The linter performs static analysis on the SQL code, highlighting potential syntax errors, typos, or incorrect usage of language constructs. It helps identify issues early in the development process, promoting clean and error-free code.
Structural integrity: In the case of a chain of transformations, verifying the resulting DAG's validity is crucial. By confirming that the graph is acyclic, developers can ensure that data flows correctly through the transformations without encountering circular dependencies or infinite loops.
IDE integration: IDEs equipped with linters offer real-time feedback and suggestions while writing SQL code, enabling developers to spot and rectify errors immediately. This streamlines the development workflow and enhances code quality. By combining compilation checks and leveraging linters, developers can improve the reliability and correctness of their SQL transformations in data projects. This proactive approach helps catch errors early, promotes adherence to language rules, and ensures the integrity of the transformation process within the overall project.

2. Unit Testing

Unit testing in computer science is a software testing technique used to verify the correctness and functionality of individual units or components of a software application. A unit refers to the smallest testable part of a program, such as a function or a method. The purpose of unit testing is to isolate and test each unit in isolation, ensuring that it behaves as expected and meets the specified requirements.
In a data project, unit testing involves verifying the accuracy and functionality of a SQL transformation that generates outputs based on input data. It is beneficial to utilize mock data during unit testing to ensure consistent and expected results are obtained with each test run.
During unit testing of a SQL transformation, mock input data can be employed to simulate various scenarios and validate the behavior of the transformation. By providing predetermined input data, developers can assess whether the transformation produces the anticipated output(s) as defined by the test cases.
This will ensure

early bug detections in the SQL query in case the logic is wildly changed
code maintainability: unit test are a way to document code and provide insights on the logic implemented
modularity: unit testing encourages modular design emphasizing the individual units. You might write simpler and more modular SQL queries if you write unit tests for them faster feedback loop: unit tests are easily executable and fast, providing immediate feedback of the correctness of a unit.

3. Deployment

In software development once the tests pass and the code analysis is satisfactory, the deployment step is triggered. The artifacts are deployed to the target environment, which can include development, staging, or production environments, depending on the deployment strategy.
In a data project, deployment can be seen as the step where the SQL queries are run on the target environment. The compiled direct acyclic graph is actually run on the environment to produce the data objects defined.
The deployment step in a data project plays a pivotal role in transforming raw data into meaningful and usable insights. By executing SQL queries on the target environment and running the compiled DAG, data objects are generated, paving the way for further analysis, reporting, or utilization within the project.

4. Integration test

Integration testing, in the context of software development, is a testing technique that focuses on verifying the interactions and cooperation between different components or modules of a system. Integration tests aim to identify issues that may arise when multiple components are integrated and working together as a whole.
Integration tests play a crucial role in ensuring the accuracy and reliability of the data project by validating the integrity and consistency of the transformed data. By examining the output tables resulting from the executed DAG, these tests assess whether the data conforms to the expected format, structure, and content.
The key aspects to consider when conducting integration tests in a data project are:

Real data evaluation: Integration tests involve analyzing the transformed data with real-world characteristics.
DAG transformation verification: The integration tests focus on verifying the proper execution of the DAG of transformations. By evaluating the resulting tables, developers can identify any unexpected or incorrect data manipulation that may occur during the transformation process.
Anomaly detection: Integration tests aim to uncover any anomalies or inconsistencies in the transformed data. This includes detecting missing data, data corruption, data loss, or any deviations from the expected outcomes.
Validation against requirements: Integration tests assess whether the transformed data aligns with the specified requirements and business rules. This ensures that the data project delivers the expected results and meets the defined criteria for success.

To effectively perform integration tests in a data project, it is crucial to develop comprehensive test cases that cover various scenarios, including edge cases and critical data paths. These tests should be automated to enable frequent execution and ensure consistency in the evaluation process.

Step	Software	Data
Compilation	Syntax, typing validation Artefact generation	DAG generation and validation SQL syntax validation
Unit Tests	Code functionality testing	Query logic testing
Deployment	Artefacts deployment	DAG execution
Integration Test	System testing	Data Testing Functional testing

How to make DataOps with Dataform ?

To meet the industrial requirements of DataOps, a continuous integration and continuous development chain must be established based on the cool features of Dataform. Dataform is a service for data analysts to develop, test, version control, and schedule complex SQL workflows for data transformation in BigQuery.

Let's explore how DataOps can be achieved with Dataform based on the DataOps expectations of the previous paragraph.

1. Compile the changes

You can compile your project using the Dataform CLI or APIs through the web user interface. The compilation process checks that the project is valid by verifying the following points:

Are the dependencies used existing?
Are the config blocks correct?
Is the templating used valid?

The output of the compilation is a DAG of the list of actions to be run, corresponding to the tables to be created and the operations to be performed on your data warehouse. This output can be displayed as JSON through the CLI.
Below is a truncated output example defining a source table and a downstream one.

{
    "tables": [
        {
            "type": "table",
            "target": {
                "schema": "demo",
                "name": "source_table",
                "database": "demo_db"
            },
            "query": "SELECT 1 AS demo",
            "disabled": false,
            "fileName": "definitions/demo/demo_source.sqlx",
            "dependencyTargets": [],
            "enumType": "TABLE"
        },
        {
            "type": "table",
            "target": {
                "schema": "demo",
                "name": "downstream",
                "database": "demo_db"
            },
            "query": "SELECT * FROM source_table",
            "disabled": false,
            "fileName": "definitions/demo/downstream.sqlx",
            "dependencyTargets": [
                {
                    "schema": "demo",
                    "name": "source_table",
                    "database": "demo_db"
                }
            ],
            "enumType": "TABLE"
        }
    ],
    "projectConfig": {...},
    "graphErrors": {...},
    "declarations": {...},
    "targets": {...}
}

2. Run unit tests

Using the test feature of Dataform, unit tests can be run against queries to validate SQL code logic. It will ensure the logic of SQL queries is fulfilled. Indeed for a given select statement you can mock the from clause with fake data and also mock the expected result of the query with these fake data. If the query logic is changed in a way that the mocked output data doesn’t match with the query run on the mocked input data, an error is raised.
Unit tests give you confidence that your code produces the output data you expect. In case the test is failing an error is thrown and you can prevent merge of the changes to your code base.
Below is an example of testing downstream dataset defined on the previous example.

config {
  type: "test",
  dataset: "downstream"
}

input "ages" {
  SELECT 1 AS demo
}

SELECT 1 AS demo

3. Run the changes

You can run your project using the Dataform CLI or APIs through the web user interface. Running a project is actually running the DAG of the actions generated at the compilation step. The actions can be creations of tables and views as well as feeding tables with the logic implemented.
You can also specify a specific action to be run or a list of actions that match specific tags. This is very useful to schedule different parts of the project at different frequencies.

4. Run the integration tests

The Dataform assertions can be used to run integration tests once your project has been run. With assertions test you can validate the content of the data and throw an error in case data quality is not matching your expectations.
Assertions is a SQL query that should not return any row. You can see it as a query that is looking for errors in the data. For instance the following query creates an assertion that checks if the id column does not contain null values.

SELECT id FROM ref("table_1") WHERE id is not null

It also has table_1 as a dependency. So once the table_1 is built the assertion will run and check for errors in the data. If errors are found the assertion will fail and raise an error. This way you can ensure data quality of your data platform.
Assertions can also be manually configured as dependencies of the DAG of the queries to be run so that you can interrupt the project run on case assertions are not fulfilled.
Below is an example of adding simple assertions to the previous defined downstream table.

config {
  type: "table",
  assertions: {
    uniqueKey: ["demo"],
    nonNull: ["demo"],
    rowConditions: [
      'demo is null or demo > 0'
    ]
  }
SELECT 1 AS demo
}

Conclusion

All of the features proposed by Dataform integrate well with a CI/CD tool chain as typical steps can be performed to validate and deploy changes to the SQL code base. This brings data engineering to a level of industrialisation on par with the software development world.

Follow up my next article that will talk about a concrete industrialisation example of dataform !

Thanks for reading! I'm Fabien, data engineer at Stack Labs.
If you want to discover the Stack Labs Data Platform or join an enthousiast Data Engineering team, please contact us.

DEV Community