Before you can validate data you first need to have data that needs to be validated. A datasource is simply the data at some stage in your pipeline that needs to be validated before it can move to the next stage in your pipeline.
While inside your gx-getting-started
directory in your terminal, run the following command to create a new datasource:
great_expectations datasource new
The format of GX CLI commands
Most of the GX CLI commands follow this format:
great_expectations <noun> <verb>
For more details you can type one of the following commands in a terminal:
great_expectations --help
great_expectations <noun> --help
great_expectations <noun> <verb> --help
You will be prompted to select a type of datasource:
What data would you like Great Expectations to connect to?
1. Files on a filesystem (for processing with Pandas or Spark)
2. Relational database (SQL)
:1
Since we will be using the local data files in our data
directory, select option 1
and press Enter.
The next prompt will ask you how you want to process your data files:
What are you processing your files with?
1. Pandas
2. PySpark
:1
Select option 1
and press Enter.
The final prompt for your datasource asks where your data is located:
Enter the path of the root directory where the data files are stored. If files are on local disk enter a path relative to your current working directory or an absolute path.
:data
Since we are working inside the gx-getting-started
directory, we can type data
as the path to our data
directory and press Enter.
When you press Enter a Jupyter Notebook will open up in your default browser with a file named datasource_new.ipynb
loaded inside. This file will allow you to configure your datasource.
File Location
You can find the
datasource_new.ipynb
file here:great_expectations/uncommitted/datasource_new.ipynb
Letβs change the name of our datasource. In the Jupyter Notebook, scroll down to the second code cell and change it as follows:
datasource_name = "getting_started_datasource"
Run all the cells in your notebook to save your datasource configs in the great_expectations/great_expectations.yml
file. The great_expectations.yml
file contains the main project configurations. If you open that file you should see a new entry under the datasources
header with the datasource that you just configured.
How are notebook files related to GX project configs
The Jupyter Notebook files inside the
uncommitted
folder are basically scripts that update the configurations in other parts of thegreat_expectations
folder. The notebook files inside theuncommitted
folder are generated when you run the CLI commands. Once a notebook file is generated, GX will open it automatically for you in Jupyter Notebooks. You can change the settings in the notebook file and run all the cells in the notebook to update your configs.Everything in the
uncommitted
folder is temporary and is not intended to be saved in version control. However, the rest of the contents in yourgreat_expectations
folder should be saved in version control. Thegreat_expectations/.gitignore
file already comes with an entry foruncommitted/
.
Top comments (0)