As a data scientist, Jupyter Notebooks are probably one of the most frequently used "tools" to get the job done. Whether you are loading or processing data, analyzing data, using data to train a model, or perform other tasks of the data science workflow, notebooks are usually key.
Let's say you've created a set of notebooks that load, cleanse, and analyze time-series data, which is made available periodically. Instead of running each notebook manually (or performing all tasks in a single notebook, which limits reusability of task specific code) you could create and run a re-usable machine learning pipeline:
With Elyra you can do this in JupyterLab or leverage Kubeflow Pipelines, a popular platform for building and deploying portable, scalable machine learning workflows based on Docker containers.
In Elyra 2.1 (and later releases) you can run pipelines also on Apache Airflow, as outlined in this blog post.
Before I'll outline how you can create and run pipelines, a bit of background information.
JupyterLab can be extended using extensions, making it possible for anybody to customize the user experience by providing new functionality (like a CSV file editor or a visualization), integrating services (like git for sharing and version control), or themes.
Elyra is a set of AI-centric extension for JupyterLab that aim to simplify and streamline day-to-day activities. It's main feature is the Visual Pipeline Editor, which enables you to create workflows from Python notebooks or scripts and run them locally in JupyterLab, on Kubeflow Pipelines, or Apache Airflow.
Assembling a pipeline
Pipelines are assembled in Elyra using the Visual Pipeline Editor. The pipeline assembly process generally involves
- creating a new pipeline,
- adding Python notebooks or Python scripts and defining their runtime properties, and
- connecting the notebooks and scripts to define execution dependencies.
Creating a pipeline
Pipelines are created by opening the Elyra Pipeline Editor from the Launcher.
Adding Python notebooks and scripts to the pipeline
Python notebooks and scripts are added to the pipeline by dragging them from the JupyterLab File Browser onto the canvas.
Each notebook or file is represented by a node that includes an input port and an output port.
Runtime properties, which can be accessed via the context menu, define the execution environment (Docker image) in which the notebook or script is run during remote execution, inputs (file dependencies and environment variables), and output files.
Nodes can optionally be associated with comments to describe their purpose.
Defining dependencies between notebooks and scripts
Dependencies between notebooks or scripts are defined by connecting output ports (on the right side of a node) to input ports (on the left side of a node).
Dependencies are used to determine the order in which the nodes will be executed during a pipeline run.
The following rules are applied:
- circular dependencies are not allowed
- if two nodes are not connected (directly or indirectly), they can be executed in parallel
- if two nodes are connected, the node producing the inputs for the other node is executed first
There are some distinct differences between how pipelines are executed in JupyterLab and on a third-party workflow orchestration framework, such as Kubeflow Pipelines.
Running pipelines in JupyterLab
Pipelines can be executed in JupyterLab as long as the environment provides access to the pipeline's prerequisites. For example, the kernels that the notebooks are associated with must be already installed, just like required packages (if they are not installed in the notebooks).
Running pipelines in the JupyterLab environment should be a viable approach if
- you are assembling a new pipeline and are testing it using relatively small data volumes
- pipeline tasks don't require hardware resources in excess of what's available
- pipeline tasks complete in an acceptable amount of time, given existing resource constraints.
Nodes are executed as a sub-process in the JupyterLab environment and always processed sequentially.
Output files (like processed data files or training artifacts) are stored in the local file system and can be accessed using the JupyterLab File Browser.
Processed notebooks are updated in place, meaning their output cells reflect the execution results.
Python script output, such as messages sent to STDOUT or STDERR are displayed in the JupyterLab console.
Elyra currently does not provide pipeline monitoring capability in the JupyterLab UI aside from a message after processing has completed. However, the relevant information is contained in the JupyterLab console output.
To learn more about how to create a pipeline and run it in JupyterLab take a look at the Running notebook pipelines in JupyterLab tutorial.
Running pipelines on Kubeflow Pipelines
While running pipelines locally might be feasible in some scenarios, it's rather impractical if large data volumes need to be processed or if compute tasks require special purpose hardware like GPUs or TPUs to perform resource intensive calculations.
Elyra can be configured to access Kubeflow Pipeline instances (secured and unsecured) by defining a runtime configuration. Once configured the selected configuration is used to run the pipeline.
The main difference between local pipeline execution and execution on Kubeflow Pipelines is that each node is processed in an isolated Docker container, allowing for better portability, scalability, and manageability.
The following chart illustrates this for two dependent notebook nodes.
Data is exchanged between nodes using S3-compatible cloud storage. Before a notebook or Python script is executed the declared input file dependencies are automatically downloaded from cloud storage into the container. After processing has completed the declared output files are automatically uploaded from the container to cloud storage.
Since Elyra is not yet a mature project it currently relies on the Kubeflow Pipelines UI for pipeline execution monitoring.
Additional details, along with step-by-step instructions can be found in the Running notebook pipelines on Kubeflow Pipelines tutorial.
Ways to try Elyra (and pipelines)
The referenced tutorials are a great way to get started with pipelines. If you are looking for a more complex example this COVID-19 pipeline might fit the bill.
If you'd like to try out Elyra and start building your own pipelines, you have three options, outlined in the sections below:
- Use a sandbox environment hosted on the cloud
- Use Elyra Docker images
- Install JupyterLab and Elyra on your local machine
Kubeflow Pipelines is not included in any of the Elyra installation options.
Running Elyra in a sandbox environment on the cloud
You can test drive Elyra on mybinder.org, without having to install anything. Follow this link to try out the latest stable release or the latest development version (if you feel adventurous) in a sandbox environment.
The sandbox environment contains a getting_started
markdown document, which provides a short tour of the Elyra features:
A couple of things to note:
- Performance can sometimes be sluggish since this is a shared environment.
- The sandbox environment is not persisted and any changes you make will be lost when it is shut down.
If you have Docker installed on your machine consider using one of the Docker images instead.
Running Elyra container images
The Elyra community publishes ready-to-use container images on Docker Hub, which have JupyterLab and the Elyra extension pre-installed. Docker images are approximately one GB in size are tagged as follows:
-
elyra/elyra:latest
is the latest stable release -
elyra/elyra:x.y.z
has thex.y.z
release installed -
elyra/elyra:dev
is automatically built each time a change is committed.
Once you've decided which image to use (elyra/elyra:latest
is always an excellent choice because you won't miss out on the latest features!), you can spin up a sandbox container as follows:
docker run -it -p 8888:8888\
-v ${HOME}/jupyter-notebooks/:/home/jovyan/work\
-w /home/jovyan/work\
elyra/elyra:latest jupyter lab
Open your web browser to the displayed URL and you are ready to start.
To access the notebook, open this file in a browser:
file:///home/jovyan/.local/share/jupyter/runtime/nbserver-6-open.html
Or copy and paste one of these URLs:
http://4d17829ecd4c:8888/?token=d690bde267ec75d6f88c64a39825f8b05b919dd084451f82
or http://127.0.0.1:8888/?token=d690bde267ec75d6f88c64a39825f8b05b919dd084451f82
The caveat is, that in sandbox mode you cannot access existing files (such as notebooks) on your local machine and all changes you make are discarded when you shut down the container.
Therefore it's better to launch the Docker container like so, replacing ${HOME}/jupyter-notebooks/
and ${HOME}/jupyter-data-dir
with the names of existing local directories:
docker run -it -p 8888:8888\
-v ${HOME}/jupyter-notebooks/:/home/jovyan/work\
-w /home/jovyan/work\
-v ${HOME}/jupyter-data-dir:/home/jovyan/.local/share/jupyter\
elyra/elyra:latest jupyter lab
This way all changes are preserved when you shut down the container and you won't have to start from scratch when you bring it up again.
Installing Elyra locally
If your local environment meets the prerequisites, you can install JupyterLab and Elyra using pip
, conda
, or from source code, following the instructions in the installation guide.
Closing thoughts
Elyra is a community-driven effort. We welcome contributions of any kind: bug reports, feature requests, and of course pull requests. Head on over to https://github.com/elyra-ai/elyra and let's get started.
Originally published on IBM Developer.
Top comments (0)