This blog was originally published at The AI Journal
If you are a Data Scientist, a Data Analyst or a Data Engineer, this post is about helping you decide if you need to start using Docker.
Automation of Data Science environments, and bringing the development and production environments for Data Science closer to each other are becoming a first-class concerns with every passing day. Docker has been advocated as an important solution to a wide variety of Data Engineering problems like these.
You might have heard about Docker and want to give it a try. You might even have tried it for a week and then given up in frustration! After reading through this post, you’ll:
- Understand what Docker is, and how it can be useful,
- Understand where Docker can help you in your Data Science workflow,
- Determine if Docker is a good fit for the current problems you have in your Data Science workflow.
Experience with Docker over time. Hang in there!
Here is how this post is organized:
- Challenges we face as Data Scientists and Data Engineers
- What is Docker? What can it do?
- Docker Application Scenarios
- Challenges in using Docker for Data Science (and how to overcome them)
- Next Steps
Data Science is about ideas, experimentation and sharing insights. Data Science folks are great at all those, but they seem to hit a snag once it involves getting it out to the people outside of the team.
Getting the results from the Data Scientist’s desk to the business, where they can use it to make decisions, is of the utmost priority. The impetus is more on getting out experiments right and getting them out fast, whatever it takes to achieve that. Over time, this leads to technical debt (and dependency hell) that might make our lives difficult if we revisit the experiments.
Docker is awesome. People routinely use it for:
- automating, sharing and reproducing experiments,
- packaging and deploying Data Science applications,
- creating easy to use Data Science sandboxes for on-boarding and experiments, and
- large scale data analysis and machine learning at scale in cloud environments.
Docker is an open platform for developers and sysadmins to build, ship, and run distributed applications, whether on laptops, data-center VMs, or the cloud. (Official Doc)
Docker uses containers to make it easier to create, deploy, and run applications.
Containers enable a programmer or a data engineer to isolate and package an application with all the dependancies it needs (files, libraries, etc).
The whole thing ships as one package, which can be run on any other machine running on the Linux OS, reproducing the packaged environment exactly irrespective of how the target machine is configured.
Importantly, Docker has seen an equal following in the start-up and larger enterprise space, validating its usefulness across the spectrum of engineering tasks.
- NetFlix runs a fleet of containers in their operations infrastructure (read more).
- Majority of the cloud providers have some form of support or platform based applications to run containers (more precisely Docker).
- To give an example, AWS provides AWS Container Service that helps to create a high performant, scalable and easy to orchestrate native container service.
The Docker learning curve is manageable, but yet for folks in Data Science with less exposure to the technical side of things might find it steep. We’ll talk about how to get started in your Docker journey, but let’s first look at use-cases where Docker really shines.
One of the biggest pain points in Data Science is sharing and reproducing experiments. Challenges involved in this are:
- Everyone makes a choice and these choices breaks installations in lot of Data Science libraries.
- A library that installs smoothly in Linux based systems might not be as easy in Windows.
- Different python/R versions, different library versions can wreck havoc while sharing analysis.
- There have been efforts to streamline this process, yet the aforementioned Operating System issue is still a big factor.
- A lot of times in smaller teams and even at enterprise level, the efforts in creating processes while sharing these experiments are either non-existent or are aligned to the regular reporting work that is a manual process.
- Data Science experiments are creative and non-linear, which makes it difficult to neatly fit them into processes that already exist in a system.
Docker helps in all the above. Since a Docker Image is immutable, the issue of conflicting operating system gets resolved. The versions of operating system and libraries stays frozen (unless updated) and we can set up workflows that integrate Docker in each person involved in the Data Science process (all the technical folks).
2. Is building proof-of-concept (or production-level) Dashboards, Web APIs and backend Cron-jobs a part of your job?
Data Science applications can come in many types:
- Dashboards using RShiny/Dash/Bokeh
- Web APIs created to serve Machine Learning models
- Scripts that run regularly (batch processes) using Machine Learning models
Sharing Data Science applications to end users is a big task. You don’t want to share it publicly, only to a specific subset. Dashboards for Proof of Concepts (POCs) are a tough task, sharing RShiny or Python code directly doesn’t really cut if the user is non-technical. It defeats the entire purpose of a web based user interface that’s easy to use. Docker and its eco-system can help Data Scientists to at least try to publish Data Science applications for a limited user-base.
Onboarding is a key concern for a lot of big organisations. Corporate firewalls, ServiceNow requests to install Anaconda are some technical hurdles that eat out on the productive time.
Docker based images can help new people get up and running quickly on their laptops/desktops with just a few commands. Sandboxes for quick feasibility analysis can also be resolved with Docker. If there are new things the team wants to try without the headache of installation, all we need is a Docker image for that.
Deployments are a big headache. It’s common to have a huge information asymmetry between Data Science groups and the DevOps group on how a Machine Learning or Data Science application should be integrated in a system.
Both the groups work in isolation and often the regular software engineering deployments are not effective with the unique problems associated with Machine Learning applications. Docker can help to bridge the issues concerning the above, if the Data Science groups are enabled to do small scale deployments or a hybrid team that works in tandem with Data Science groups, enabled by Docker.
5. Do you want to make it easier to carry out large scale data analysis and machine learning training in cloud environments
With the increasing complexities in data analysis (mostly due to volume) and machine learning (training models), more computing power is needed. Majority of the Cloud providers have provisions to access their compute power via their platform.
In these cases, time is money and tools like Docker along with Chef/ Puppet/ Terraform can help in saving time in setting up and managing cloud infrastructures.
Not enough resources for Data Science applications using Docker.
New skill that has a steep learning curve.
Getting entire team to change their workflow to integrate Docker.
Now we have a good understanding of usecases for Docker.
If you are struggling with some of these problems, it would probably be a good idea to evaluate Docker for you workflow or your Organisation. To learn more about Docker, what it can do, and how to use it practice/production, watch this space. I’ll be blogging more about it in the coming weeks.
You can also sign up for the newsletter to stay updated about my upcoming blogposts in this series. Feedback much appreciated! 😊