Photo by Lenin Estrada on Unsplash
Machine learning and automation are intrinsically intertwined in our world. Without automation, machine learning is reduced to making complex calculations by hand.
Current levels of automation have tightened the machine learning life cycle and the integration of models into production. Not only are the tools to automate these processes advancing but the mathematics involved in optimizing the model itself is advancing as well.
Concepts like Sequential Model-Based Optimization provide a means to, potentially, generate a viable model without domain or even data related expertise.
In this article we will discuss some of the tools your team can use to automate and manage your machine learning processes
AutoML And Automation Tools
Companies have already hit the market with end to end autoML tools. Tools like MLflow designed to take input data and generate an optimized model with minimal oversight from the operator. In addition, there are automation tools that have broader use cases like Airflow that can be used to manage statistical models. This increase of automation used in conjunction with machine learning provides a needed easy access point to powerful predictive tools. This easy access point, however, can be very deceptive. There are still plenty of pitfalls to the data illiterate and even to the more seasoned.
With tools such as MLflow data professionals can now automate sophisticated model tracking with ease. MLflow debuted in the 2018 Spark + AI Summit and is yet another Apache project. MLflow allows a Data Scientist to automate model development. Through MLflow the optimal model can be selected with greater ease using a tracking server. Parameters, attributes, and performance metrics can all be logged to this server and can then be used to quickly quarry for models that fit particular criteria. Airflow and MLflow are quickly becoming industry staples for automating the implementation, integration, and development of Machine Learning models.
Although MLflow is a powerful tool for sorting through logged models it does little to answer the question of what models should be made. This is a bit more of a difficult question because, depending on your model, training may take a sizable amount of resources, hyper-parameters could be unintuitive or both. Even these problems can, in part, be automated away.
Automation With Airflow
When discussing automation, especially automation using Python, one tool is mentioned far more often than most: Airflow. Airflow had its start in 2014 by Maxime Beauchemin at Airbnb. The project joined the Apache Software Foundation's Incubator program in 2016 and in 2019 Apache Airflow was announced as one of the Top-Level Projects.
But what does Airflow do? Apache Airflow makes working through DAGs (Directed Acyclic Graphs) a breeze. Some of you may be wondering how this impacts machine learning. Great question! Airflow has been called the "Swiss army knife of data pipelines". Airflow allows the user to create a DAG of modular operators. That is to say Airflow allows someone to create an entire work flow complete with operation start times and steps to perform in case of errors.
This allows operations to be carried out in parallel such as calling a model for a prediction while loading the next web page for a user. It transfers to distributed systems easily and scales up with little to no issues. Airflow has become a staple in automating machine learning processes from ETL to production and has been used by everyone from Adobe to United Airlines. Implementation and integration of machine learning models isn't the only thing being automated through pervasive python packages.
Automated Machine Learning Shouldn't Be A Catch-All
As automation becomes more sophisticated it becomes tempting to abstract away most of the steps used in developing machine learning models. Already products like Cloud AutoML by Google and Azure Machine Learning's AutoML feature offer users a more simplified experience developing machine learning models. Although these services certainly have their place automating machine learning can introduce pitfalls that the user should be aware of. The .first of which is understanding the data.
An autoML system cannot be expected to understand the quality of the data input. As the provider of the data it's important to know that your data are representative of the population you intend. A Simple search for "racist AI" can reveal many projects have suffered from this oversight.
Given this assumption we can conclude that the data sets these AI were trained on were likely either too small or skewed. With someone to look over the data as it gets processed, this kind of mishap can often be caught.
A lot more than just data quality can be found by data processing and some analysis; autoML can inhibit these insights. Often while working with data an Analyst or Data Scientist might notice an interesting correlation or peculiar data points that they could then research further.
The autoML process lacks domain knowledge or comprehension of the intended application. These can in part be substituted by some clever mathematical adjustments to the model but adjusting the model manually reduces the usefulness of autoML in the first place. Despite these drawbacks autoML can be incredibly useful in cases where data quality is not a concern especially if there are not enough data proficient staff to process and analyze the data.
Automation and machine learning have been tied together since the beginning. The question is never, "Should I use automation in my machine learning project?" Rather, "How much automation should I use?" The benefits of appropriately applying automation to a project can be profound. Using Airflow and Mlflow the machine learning life cycle can be tightened and procedurally generated models and experiments can be explored with greater convenience. Including mathematical principles such as Sequential Model-Based Optimization can tighten the machine learning life cycle even further. The combination of extant automation tools and principals can, in fact, remove much of the needed skill. AutoML tools are incredibly powerful and grant a higher degree of access.
Automation And Machine Learning Are Tools And Need To Be Treated As Such
Excessive automation tends to increase distance between the data professional and the ground truth data set. This inhibits data insights unless time is purposefully invested in having that professional examine the data prior. In the worst case heavy handed automation could lead to a model that actively hurts its intended user base. Despite the significance of these pitfalls they by no means suggest automation should be avoided or viewed in a negative light.
Automation, much like any tool we use, is just that: a tool. It can be used well or used poorly. As automation in machine learning becomes even more sophisticated the best thing we can do is continue to learn. Understanding the automation process and how it integrates with machine learning makes it easier to see where potential pitfalls might be and grants us better understanding on when these tools should be used.
If You're Interested In Reading More About Data Science, Then Check Out The Articles Below
4 SQL Tips For Data Scientists
How To Improve Your Data-Driven Strategy
What Is A Data Warehouse And Why Use It
Mistakes That Are Ruining Your Data-Driven Strategy
Top comments (3)
I agree these ML tools should be treated as just tools but understanding principles helps a lot to make it much easier. I'm actually quite amazed that it is only recently that there's a emphasise on DevOps mindset/principles being adopted for Data Science.
I agree with your devops comment, especially as concepts like MLOps become more popular. However, I am not surprised it took a decade or so XD.