As a data scientist, one of the most exciting things to me about Faethm is that data science is at the heart of our products.
As the head of our data engineering team, it's my responsibility to ensure our data science can scale to meet the needs of our rapidly growing and global customer base.
In this article, I'm going to share some of the most interesting parts of our approach to scaling data science products, and a few of the unique challenges that we have to address.
Faethm is data science for the evolution of work
Before we delve into our approach, it's important to understand a few things about Faethm and what we do.
Our customers depend on us to understand the future of work, and the impacts that technology and shifts in work patterns have on their most critical asset: their people.
Our data science team is responsible for designing and building our occupation ontology, breaking down the concept of "work" into roles, tasks, skills and a myriad of dynamic analytical attributes to describe all of these at the most detailed level. Our analytics are derived from a growing suite of propriety machine learning models.
Our platform ties it all together to help people leaders, strategy leaders and technology leaders make better decisions about their workforce, with a level of detail and speed to insight that is impossible without Faethm.
We use Python and Jupyter notebooks for data science
Our data scientists primarily use Python, Jupyter notebooks and the ever-growing range of Python packages for data transformation, analysis and modelling that you would expect to see in any data scientist's toolkit (and perhaps some you wouldn't).
Luckily running an interactive Jupyter workbench in the cloud is pretty easy.
AWS SageMaker provides the notebook platform for our teams to configure managed compute instances to their requirements and turn them on and off on-demand. Self-service access to variably powerful modelling environments requires managing a few IAM Role policies and some clicks in the AWS Console.
This means a data scientist can SSO into the AWS Console and get started on their next project with access to whatever S3 data is permitted by their access profile. Results written back to S3, notebooks pushed to the appropriate Git repository.
How do we turn this into a product so that our data scientists don't ever have to think about running a operational workflow?
Engineering data science without re-engineering notebooks
One of the core design goals of our approach is to scale without re-engineering data science workflows wherever possible.
Due to the complexity of our models, it's critical that data scientists have full transparency of how their models are functioning in production. So we don't re-write Jupyter notebooks. We don't even replicate the code within into executable Python scripts. We just execute them, exactly as written, no change required.
We do this with Papermill.
Papermill is a Python package for parameterising and executing Jupyter notebooks. As long as a notebook is written with parameters for dynamic functionality (usually with sensible defaults in the first notebook cell), Papermill can execute the notebook ($NOTEBOOK
) on the command line with a single command. Any parameters (-r
raw or -p
normal) can be overridden at runtime and Papermill does this by injecting a new notebook cell assigning the new parameter values.
A simple Papermill command line operation looks like this:
pip install papermill
papermill "$NOTEBOOK" "$OUTPUT_NOTEBOOK" \
-r A_RAW_PARAMETER "this is always a Python string" \
-p A_PARAMETER "True" # this is converted to a Python data type
Since Papermill executes the notebook and not just the code, the cell outputs including print statements, error messages, tables and plots are all rendered in the resulting output notebook ($OUTPUT_NOTEBOOK
). This means that the notebook itself becomes a rich log of exactly what was executed, and serves as a friendly diagnostic tool for data scientists to assess model performance and detect any process anomalies.
Reproducible notebook workflows
Papermill is great for executing our notebooks, but we need notebooks to be executed outside of the SageMaker instance they were created in. We can achieve this by capturing a few extra artifacts alongside our notebooks.
Firstly, we store a list of package dependencies in a project's Git repository. This is generated easily in the Jupyter terminal with pip freeze > requirements.txt
, but often best hand-crafted to keep dependencies to essentials.
Any other dependencies are also stored in the repository. These can include scripts, pickled objects (such as trained models) and common metadata.
We also capture some metadata in a YAML configuration file:
...
Notebooks:
- my-notebook.ipynb
- my-second-notebook.ipynb
...
This file lists the notebooks in execution order, so a workflow can be composed of multiple independent notebooks to maintain readability.
Finally, a simple buildspec.yml
configuration file is included that initiates the build process. This is the standard for AWS CodeBuild which we use as a build pipeline.
Changes to notebooks, dependencies and other repository items are managed through a combination of production and non-production Git branches, just like any other software project. Pull Requests provide a process for code promotion between staging and production environments, and facilitate a manual code review and automate a series of merge checks to create confidence in code changes.
Notebook containers built for production deployment
To keep our data science team focused on creating data science workflows and not build pipelines, the container build and deployment process is abstracted from individual Jupyter projects.
Webhooks are configured on each Git repository. Pushing to a branch in a notebook project triggers the build process. Staging and production branches are protected from bad commits by requiring a Pull Request for all changes.
A standard Dockerfile
consumes the artifacts stored in the project repository at build-time:
FROM python:3.7
RUN pip install papermill
# package dependencies
COPY requirements.txt
RUN pip install -r requirements.txt
# notebook execution order from YAML config
ARG NOTEBOOKS
ENV NOTEBOOKS=${NOTEBOOKS}
# prepare entrypoint script
COPY entrypoint.sh
# catch-all for other dependencies in the repository
COPY .
# these parameters will be injected at run-time
ENV PARAM1=
ENV PARAM2=
CMD ./entrypoint.sh
The entrypoint is an iterative bash script:
#!/bin/bash
for NOTEBOOK in ${NOTEBOOKS//,/ }
do
papermill "$NOTEBOOK" "s3://notebook-output-bucket/$NOTEBOOK" \
-r PARAM1 "$PARAM1" \
-p PARAM2 "$PARAM2"
done
This entrypoint.sh
script follows this configuration file to execute each of the notebooks at run-time, and stores the resulting notebook output in S3.
AWS CodeBuild determines the target environment from the repository branch, builds the container and pushes it to AWS ECR so it is available to be deployed into our container infrastructure.
Serverless task execution for Jupyter notebooks
With Faethm's customers spanning many different regions across the world, the data is subject to the data regulations of each customer's local jurisdiction. Our data science workflows need to be able to execute in the regions which our customers specify for their data to be stored. With our approach, data does not have to transfer between regions for processing.
We operate cloud environments in a growing number of customer regions across the world, throughout the Asia Pacific, US and Europe. As Faethm continues to scale, we need to be able to support new regions.
To run our Jupyter notebook containers, each supported region has a VPC with a ECS Fargate cluster configured to run notebook tasks on-demand.
Each Jupyter project is associated with an ECS task definition, and an ECS task definition template is configured by the build pipeline and deployed through CloudFormation.
Event-driven Jupyter notebook tasks
To simplify task execution, each notebook repository has a single event trigger. Typically, a notebook task will run in response to a data object landing in S3. An example is a CSV being uploaded from a user portal, upon which our analysis takes place.
In the project repository, the YAML configuration file captures the S3 bucket and prefix that will trigger the ECS task definition when a CloudTrail log sent to EventBridge matches it:
...
S3TriggerBucket: notebook-trigger-bucket
S3TriggerKeyPrefix: path/to/data/
...
The EventBridge rule template is configured by the build pipeline and deployed through CloudFormation, and this completes the basic requirements for automating Jupyter notebook execution.
Putting it all together
In this article we've looked at a few of the challenges to scaling and automating data science workflows in a multi-region environment. We've also looked at how to address them within the Jupyter ecosystem and how we are implementing solutions that take advantage of various AWS serverless offerings.
When you put all of these together, the result is our end-to-end serverless git-ops containerised event-driven Jupyter-notebooks-as-code data science workflow execution pipeline architecture.
We just call it notebook-pipeline
.
You’ve been reading a post from the Faethm AI engineering blog. We’re hiring, too! If share our passion for the future of work and want to pioneer world-leading data science and engineering projects, we’d love to hear from you. See our current openings: https://faethm.ai/careers
Top comments (0)