Data Pipelines with Great Expectations | Step 2: Connect to data

Before you can validate data you first need to have data that needs to be validated. A datasource is simply the data at some stage in your pipeline that needs to be validated before it can move to the next stage in your pipeline.

While inside your gx-getting-started directory in your terminal, run the following command to create a new datasource:

great_expectations datasource new

The format of GX CLI commands

Most of the GX CLI commands follow this format:
great_expectations <noun> <verb>
For more details you can type one of the following commands in a terminal:
great_expectations --help
great_expectations <noun> --help
great_expectations <noun> <verb> --help

You will be prompted to select a type of datasource:

What data would you like Great Expectations to connect to?
    1. Files on a filesystem (for processing with Pandas or Spark)
    2. Relational database (SQL)
:1

Since we will be using the local data files in our data directory, select option 1 and press Enter.

The next prompt will ask you how you want to process your data files:

What are you processing your files with?
    1. Pandas
    2. PySpark
:1

Select option 1 and press Enter.

The final prompt for your datasource asks where your data is located:

Enter the path of the root directory where the data files are stored. If files are on local disk enter a path relative to your current working directory or an absolute path.
:data

Since we are working inside the gx-getting-started directory, we can type data as the path to our data directory and press Enter.

When you press Enter a Jupyter Notebook will open up in your default browser with a file named datasource_new.ipynb loaded inside. This file will allow you to configure your datasource.

File Location

You can find the datasource_new.ipynb file here: great_expectations/uncommitted/datasource_new.ipynb

Let’s change the name of our datasource. In the Jupyter Notebook, scroll down to the second code cell and change it as follows:

datasource_name = "getting_started_datasource"

Run all the cells in your notebook to save your datasource configs in the great_expectations/great_expectations.yml file. The great_expectations.yml file contains the main project configurations. If you open that file you should see a new entry under the datasources header with the datasource that you just configured.

How are notebook files related to GX project configs

The Jupyter Notebook files inside the uncommitted folder are basically scripts that update the configurations in other parts of the great_expectations folder. The notebook files inside the uncommitted folder are generated when you run the CLI commands. Once a notebook file is generated, GX will open it automatically for you in Jupyter Notebooks. You can change the settings in the notebook file and run all the cells in the notebook to update your configs.

Everything in the uncommitted folder is temporary and is not intended to be saved in version control. However, the rest of the contents in your great_expectations folder should be saved in version control. The great_expectations/.gitignore file already comes with an entry for uncommitted/.