Avoid these common ML monitoring mistakes – your model’s success hangs in the balance. 

So you’ve built a machine learning model that works well in the lab. You’ve validated it, gotten the green light from the internal stakeholders, ensured that it met any regulatory requirements, and finally pushed it live. High fives all around. So what’s next? For too many companies, the answer is: not much. But placing blind faith in your model’s real world performance is hugely risky, leaving yourself vulnerable to tremendous business and regulatory threats.

The ongoing sudden data shifts due to the pandemic’s twists and turns continue to stress and break models in industries across the globe, underscoring a reality that always existed: model performance needs to be actively managed. Models degrade over time, and so require continuous monitoring to ensure their effectiveness in production. And while many existing “observability” technologies provide some visibility into emerging threats, they are often inadequate to what organizations truly need. 

So where are organizations going astray when it comes to monitoring?

ML Monitoring Mistakes to Watch Out For

Mistake #1 – We don’t monitor. It worked well in the lab, so it’ll keep working in production, right?

Too many organizations are flying blind when it comes to machine learning models in production. Based on the confidence built during the lab development phase, they assume that their model will hold up roughly the same when moved to production. However, time and again, this proves to simply not be the case. The data is constantly changing. Bad actors might be stressing your application, trying to get around it. The meaning of the data itself might shift, so its implications are now different. Accuracy shifts. Bias creeps in. By not monitoring, you are not only risking the business impact of model error or outright failure, you might be inviting a regulatory investigation or a lawsuit from customers who believe that they were unfairly slighted by your AI. In an environment where AI is increasingly mistrusted, lack of performance visibility could be fatal.

(Image: Dev -> Validation -> Production -> ?)

Mistake #2 – Our infrastructure ops and BI systems will catch any problems. 

The second mistake that organizations make is to assume that they don’t need to directly monitor their models. Instead, they believe that they can catch model drift by monitoring models indirectly – by leveraging their infrastructure analytics or by tracking the business key performance indicators (KPIs) that they are tracking in their BI systems. This assumes that any issues will eventually show up somewhere, so why invest in monitoring models directly?  Unfortunately, this is also unwise. First, it won’t catch the broad range of issues that a model can run into, since those systems were set up to analyze different things. Second, the information will come too late and be too vague. Sure, your credit approval rate has faltered, you think it might be your AI system, but what specifically is going wrong? And you’ve only realized this a month after your model started malfunctioning.  The lack of direct model monitoring means that you’re operating in a data hole. That hole is what we call the “ML Monitoring Gap.”

If you don’t have broad visibility into your model performance directly and are relying on infrastructure or KPI analytics, you have fallen into the ML Monitoring Gap.

Mistake #3 – Monitoring is just about identifying drift and having a few alerts. That’s enough, right?

A raft of solutions claiming to offer ML monitoring have emerged, providing dashboards and alerts for negative changes in live models. These can help you to identify when a model might be experiencing problems: they can alert you to accuracy slips or some common types of drift. There are two challenges with this. 

  • First, accuracy and simple drift metrics are not enough to evaluate the full health and performance of your model. There are many risks that are simply not being properly evaluated, such as fairness. You need a monitoring solution that has a comprehensive view of what could be going wrong with your models, so that you can catch problems quickly.
  • Second, there is a huge drawback to solutions that only let you know that a problem exists.They don’t tell you why it exists or where specifically to look to fix it. The answer may be hidden in an obscure place, and take weeks to find. For these reasons, observability alone only solves part of the monitoring problem. It leaves the most important part of the problem –  “Now, how do I fix it?” –  completely in the dark.

The key things to watch for when monitoring ML models

There are multiple areas where a model can go wrong in production. Catching these quickly and resolving the underlying issues are key to ensuring model uptime and effectiveness. 

  • Segment analysis and drift: identifies important data or business segments and the distribution of instances between training and operational data
  • Accuracy:  analyzes how well the model is performing against its intended result
  • Data quality: includes situations like wrong data types, missing data, changed missing values, transposition of data, unexpected/out of range data, etc. 
  • Conceptual soundness and drift: analyzes and monitors the fundamental relationships between the inputs and outputs of the model and how these change over time.
  • Fairness: identifies and monitors potential bias against protected groups, such as groups identified by gender, ethnicity, national origin, religion, etc.
  • Estimated accuracy: if accuracy can’t be measured then it can still be estimated by comparing the distributional shift between test and operational data.
  • Explainability: the ability to determine precisely how the model is determining its results. Strong explainability capabilities are key to identifying the root cause of model issues in production.

Getting it right – and getting to the root cause

In addition to knowing what to monitor for, there are a number of considerations to keep in mind when thinking about solving your machine learning monitoring needs.

  1. Does it have a broad range of analytics?
    There are myriad ways that a machine learning model can go awry in production. Paying close attention to accuracy without looking at data quality or fairness, for example, only provides you with a narrow improvement on your field of view. 
  1. How can I get relevant alerts and not suffer from alert fatigue?
    Alerting and notifications sound great, until you receive you 1000th alert for a possible problem, only to realize that it’s not really a problem at all. Are you able to set appropriate thresholds so that you’re only receiving alerts that matter? Can the system ensure that you are only evaluating significant problems in the first place?
  1. Does it pinpoint how to solve the problem, not just identify problems?
    The ability to do root cause analysis is critical. Getting an alert, or reviewing a dashboard to see a problem emerging, is not enough to rapidly resolve the issue and get your model back into production. Without the ability to understand exactly where the problem is, data science teams can spend weeks trying to find the true problem, with the model out of commission all the while.

Monitoring is part of a comprehensive lifecycle view of AI Quality

The reality is that monitoring is not a standalone activity. It exists as part of a full model lifecycle. At TruEra, our solutions embrace the concept of continuous model evaluation instead of a reactive, standalone monitoring approach. With a continuous approach, ML teams proactively evaluate the quality of models before they move into production and then use that information to set up more precise and actionable monitoring. They also more effectively debug production issues and use the results as feedback into an iterative model development process or into operational updates of the existing ML application. 

This continuous AI quality evaluation approach offers real and significant benefits. Organizations can identify and act on emerging issues faster and resolve issues more effectively. Resource-constrained ML teams can reduce alert fatigue and avoid unnecessary rabbit hole debugging exercises. Teams can quickly identify short and long-term model improvements And best of all, it improves the ROI on machine learning initiatives by improving both initial and ongoing ML success rates.

What is AI Quality? It’s a framework for systematically improving your ML performance. Check out the blog “What is AI Quality? A framework.”

This blog has been republished by AIIA. To view the original article, please click here: https://truera.com/ml-monitoring-why-it-matters-and-how-to-get-it-right/