"A baby learns to crawl, walk and then run. we are in the crawling stage when it comes into applying machine learning" ~ Dave Waters
Machine Learning monitoring is the final phase in the machine learning system lifecycle, essential for making sure you have a healthy and useful model serving predictions to your users and can help you identify any issues with your model before it has a significant impact on your users.
Monitoring has a different meaning in different teams that are implementing machine learning projects. Data scientists may be generally interested in the statistical properties of the input data and generated predictions, and machine learning engineers and software engineers can be interested in operational monitoring.
Machine learning models are software, they do require testing, maintenance, and monitoring like other software operations before and after production. As a data scientist or machine learning engineer once you have deployed your machine learning model to production it rapidly becomes apparent that the work is not over. You have to watch the behavior of the model you deployed in production this will enable you to avoid the unexpected points of failure of your model also to ensure accurate and effective predictions.
On the ground level, monitoring is all about collecting noticed events and providing visualized information about their statistics over time. As a result of the monitoring, we should be able to define alerts on some values of the monitored metrics and actions that should be performed such as alert email or SMS messages.
How do you know if your models are behaving as you expect them to?
What about the time customer behavior change, does your model still predict inappropriate manner?
Monitoring machine learning solutions is a complex and important task, taking into account that machine learning is a rapidly evolving field in terms of techniques and tools.
Common open-source tools for machine learning model monitoring are Prometheus, Grafana, Boxkite, etc
Key parts in monitoring ML in production
There are essential 3 key parts in monitoring machine learning models in production environment:-
- Service Monitoring: here you are looking at the system services such as request throughput, error rate, but also on resource utilization like I/O, CPU utilization, and storage utilization. For service monitoring, tools like New Relic can be used.
- Input data Monitoring: looking at what kind of data coming into your machine learning model, from the nature of the incoming data will provide an indication at what time the model will be retrained due to the massive changes in terms of data distributions.
- Prediction Monitoring: the process should begin by comparing prediction distribution with live data compared to the training data, but it should continue by monitoring model performance throughout its entire lifespan serving customers. If the model predictions using the live data are vastly different than the predictions from the training data, this is likely a good indicator that some discrepancies in data still exist in the live environment.
Final Thoughts
Monitoring systems can help to provide confidence to data scientists and machine learning engineers that systems are running smoothly and, in the event of a system failure, can quickly provide appropriate context when diagnosing the root cause.
When deploying machine learning models, we like to have confidence that our model is making useful predictions in production. But some points of failure may occur due to variation of data distributions compared to that used in model training and misconfiguration of a model in production etc.
Point to remember, All machine learning models are software we don't have to ignore the best practices of treating software before and after production.
Thank you for making the end of this article. If you find this article is informative feel free to share it with others.
Top comments (0)