DEV Community

Samuel Earl
Samuel Earl

Posted on

Data Pipelines with Great Expectations | Step 3: Create Expectations

An Expectation is essentially a rule that defines how your data should be validated. An Expectation Suite is a collection of Expectations.

To create our first Expectation Suite, let’s go to our terminal. Stop the previous Jupyter Notebook by pressing Ctrl+C and type the following from inside the gx-getting-started directory:

great_expectations suite new
Enter fullscreen mode Exit fullscreen mode

This will bring up the following prompt:

How would you like to create your Expectation Suite?
    1. Manually, without interacting with a sample Batch of data (default)
    2. Interactively, with a sample Batch of data
    3. Automatically, using a Data Assistant
: 3
Enter fullscreen mode Exit fullscreen mode

You can create your expectations in one of three ways, as outlined in the above prompt. We are going to select option 3 and press Enter, which uses a Data Assistant to automatically generate an Expectation Suite based on the profile of our data.


What is a data profile?

If you were to write a profile of a person, then you would write down a description of that person and their characteristics. A data profile is similar. It is a description of the data and its characteristics (e.g. data types, value ranges, mean, median, mode).

The GX Data Assistant requires a batch of data so it can create a data profile. GX will then automatically generate an Expectation Suite that matches the data profile. That automatic Expectation Suite can be used as a starting point and you can edit each Expectation however you want.


The next prompt asks which dataset we want to use to generate our Expectation Suite:

A batch of data is required to edit the suite - let's help you to specify it.

Which data asset (accessible by data connector "default_inferred_data_connector_name") would you like to use?
    1. yellow_tripdata_sample_2019-01.csv
    2. yellow_tripdata_sample_2019-02.csv
: 1
Enter fullscreen mode Exit fullscreen mode

We will use the January 2019 data to create a data profile and an automatic Expectation Suite. We will then validate any subsequent batches of data with the Expectation Suite that was generated from the January 2019 batch. Select option 1 and press Enter.

That will bring up the next prompt.

Name the new Expectation Suite [yellow_tripdata_sample_2019-01.csv.warning]: getting_started_expectation_suite_taxi.demo
Enter fullscreen mode Exit fullscreen mode

You can name the suite anything you want. Let’s use the same name from GX’s Getting Started Tutorial: getting_started_expectation_suite_taxi.demo. Press Enter.

You should see one last prompt that asks if you want to proceed with creating the Expectation Suite. Type y then press Enter.

A new notebook will open up for you automatically in your browser. (You can close the tab with the previous notebook file.)


File Location

The Expectation Suite notebook file is located here: great_expectations/uncommitted/edit_getting_started_expectation_suite_taxi.demo.ipynb


Creating Expectations in Jupyter Notebooks

In the Jupyter Notebook that opened in your browser, scroll down to the second code cell that contains a list named exclude_column_names. This cell allows you to select the columns that you want to include in the Expectation Suite. We want to create an Expectation Suite based on the number of passengers in each taxi ride, so comment out the "passenger_count" column to include it in the suite. It should look like this:

exclude_column_names = [
    "vendor_id",
    "pickup_datetime",
    "dropoff_datetime",
#     "passenger_count",
    "trip_distance",
    "rate_code_id",
    "store_and_fwd_flag",
    "pickup_location_id",
    "dropoff_location_id",
    "payment_type",
    "fare_amount",
    "extra",
    "mta_tax",
    "tip_amount",
    "tolls_amount",
    "improvement_surcharge",
    "total_amount",
    "congestion_surcharge",
]
Enter fullscreen mode Exit fullscreen mode

Run all the cells in the Jupyter Notebook to save the Expectation Suite and create the configs for your new suite. Another browser tab will automatically open to show the Data Docs for your Expectation Suite.

Viewing Expectations in Data Docs

Data Docs are where you can view and edit your Expectations and see the status of your data validations.

The Data Docs open to a page titled “Expectation Validation Result”. If you click on “Home” in the breadcrumbs you will be taken to a page that has two tabs: “Validation Results” and “Expectation Suites”.

The tab for “Validation Results” is where we just were. It shows a list of all the validations that you have run. Each time you run all the cells in the edit_getting_started_expectation_suite_taxi.demo.ipynb file you are running the Expectation Suite against the dataset that you passed to it (i.e. the dataset that is listed next to the data_asset_name key in the first code cell of the edit_getting_started_expectation_suite_taxi.demo.ipynb file).

Also each time you run your Expectation Suite against your dataset a new entry will be created under the “Validation Results” list on the Data Docs home page. On the Data Docs home page, click the first result in the “Validation Results” list to view the Expectation Validation Result.

On the “Expectation Validation Result” page you will see a heading for “Table-Level Expectations” with a table of Expectations below it. Below that you will see a subheading for passenger_count with a table of Expectations below it.

In each table you can see the “Status”, “Expectation”, and “Observed Value” for each of the Expectations in the suite. Each row in the table will show the results of each Expectation (either a checkmark for success or an "X" for failed along with some other details).

How to edit an Expectation Suite

The Expectation Suite that was generated by the Data Assistant provides a good starting point, but you probably don't want to use all those Expectations in production or maybe you want to edit the configurations for some of your Expectations. Let’s edit our suite to include only the Expectations that we want for our project.

On the “Expectation Validation Result” page, click the button on the left side of the screen labeled “How to Edit This Suite”. Copy the CLI command that appears. Back in your terminal, stop the Jupyter Notebook server (Ctrl+C), paste the CLI command that you copied, and run it:

great_expectations suite edit getting_started_expectation_suite_taxi.demo
Enter fullscreen mode Exit fullscreen mode

You will see the following two prompts:

How would you like to edit your Expectation Suite?
    1. Manually, without interacting with a sample batch of data (default)
    2. Interactively, with a sample batch of data
: 2
Enter fullscreen mode Exit fullscreen mode

Select option 2, which will give us an easier way to edit our Expectation Suite.

A batch of data is required to edit the suite - let's help you to specify it.

Which data asset (accessible by data connector "default_inferred_data_connector_name") would you like to use?
    1. yellow_tripdata_sample_2019-01.csv
    2. yellow_tripdata_sample_2019-02.csv

Type [n] to see the next page or [p] for the previous. When you're ready to select an asset, enter the index.
: 1
Enter fullscreen mode Exit fullscreen mode

Select option 1 to use the same dataset that was used to profile our data.

A new Jupyter Notebook will open up with the edit_getting_started_expectation_suite_taxi.demo.ipynb loaded again. This time the file will have the Expectations that were automatically generated by the Data Assistant and each one will be in their own cells. (You can close any other tabs that have notebooks or Data Docs in them.)

Go ahead and delete both of the Table Expectations (click the “Edit” → “Delete Cells” or click the In in front of the cell that you want to delete and type D twice).

Under the passenger_count Column Expectation, delete all of the Expectations except for the validator.expect_column_values_to_not_be_null() and validator.expect_column_values_to_be_between() Expectations.

Once you have edited your Expectations, follow the same process to save them and open the Data Docs:

Run all the cells in your notebook to save the updates to your Expectation Suite. This will open the Data Docs for your Expectation Suite again.


Troubleshooting Tip

The steps above should overwrite the pre-existing getting_started_expectation_suite_taxi Expectation Suite, but it didn’t do that for me. So I had to go into the great_expectations/expectations folder and delete the getting_started_expectation_suite_taxi folder and then run all the cells in the edit_getting_started_expectation_suite_taxi.demo.ipynb file again. Then it worked.


Now you should only see the Expectations for "values must never be null" and "values must be greater than or equal to 1 and less than or equal to 6" under the passenger_count column.

Expectation Gallery

You can see the full list of Expectations that are available in the Expectation Gallery.


Top comments (0)