We’re living in unprecedented times wherein a matter of a few weeks, things changed dramatically for many humans and businesses across the globe. With COVID-19 spreading its wings across the globe and taking human lives we are seeing record jumps in unemployment and small business bankruptcies.

Today, AI is increasingly being applied by companies across industries, but AI is not the easiest technology to operationalize. Most production AI systems are patchworks of proprietary, open-source, and cloud-based technology amassed organically over time. However, the past few years have seen the emergence of GUI-based AI tools and open-source libraries to help enterprises less inclined to build in-house, successfully train, and deploy AI models.

As these tools have surfaced, companies have come to realize that training and deploying AI is only the first step- they must then monitor and manage their deployed models to ensure risk-free and reliable business outcomes. With the rise of higher-performing black-box models, the need to govern these models has become both more necessary and more challenging. Increasingly, companies are learning that:

Indeed, because their performance can degrade over time due to changes in input data post-deployment, models require continuous monitoring to ensure their fidelity while in production. And while many existing monitoring technologies provide real-time issue visibility, they are often insufficient to identify the root cause of issues within complex AI systems.

Lack of Feedback Loop

Most organizations are only able to discover issues with production ML systems after it’s too late and the damage has been done. In some cases, production issues can persist undetected until the ultimate business metrics powered by the ML system decline.

Image for post

Most of AI today is unmonitored (Image by Author)

Instead of relying on downstream business metrics as the indicator of an upstream model performance issue, businesses can get ahead of potential issues by monitoring leading indicators including prediction and feature drift and input data errors. Tracking these leading indicators and being able to identify unexpected shifts allows an ML Ops team to perform investigations in real-time as opposed to after the fact. But merely tracking the right metrics only solves half the problem. Once a shift is detected, an investigation or root cause analysis should be conducted as quickly as possible. To ensure a speedy and accurate root cause analysis, AI explainability can be used to help identify the underlying causes of the issue and what course of action (e.g. retrain the model on new data, fix a data pipeline) should be taken.

Image for post

The architecture of an Explainable Monitoring System (Image by Author)

Gaps in traditional Monitoring solutions

Today, there are two primary approaches to monitoring production software:

  • Service or infrastructure monitoring used by DevOps to get broad operational visibility and service health.
  • Business metrics monitoring via telemetry used by business owners to track business health.

Unfortunately, these methods fall short for ML systems, whose performance, unlike that of traditional software systems, is non-deterministic and depends on various factors such as seasonality, new user behavior trends, and often extremely high-dimensionality upstream data systems. For instance, a perfectly functioning Ads model might need to be updated when a new holiday season arrives. Similarly, a model trained to show content recommendations in the US may not do very well for users signing up internationally.

Image for post

The advent of AI brings about the need for Model Performance Monitoring (Image by Author)

Challenges unique to Model Monitoring

1. Model decay. Unlike other software, the ML model performance can decay over time. Monitoring for correct model outcomes, when available, provides immediate business impact change notifications. Being able to monitor model decay will help us know if it is time to refresh the model.

Image for post

Model Performance Drift over time (Image by Author)

2. Data drift. Although ML Models are trained with specific data (e.g. age 20–60), they can encounter different data in production (e.g. age 60–80) and consequently make sub-optimal predictions.

Image for post

Types of Drift (Image by Author)

3. Data integrity. Business data is dynamic and its composition is constantly changing. This can have an adverse performance impact on ML Models, especially with automated data pipelines. Data inconsistencies can often go unnoticed in deployed AI systems.

Image for post

Feature Distribution Screenshot (Image by Author)

4. Outliers. Deployed ML models can run into data that is far outside the training distribution. These outliers can cause isolated performance issues difficult to debug globally. Pinpointing them in real-time can provide insights into addressing issues right away. Detecting outliers is a challenging problem as there are a variety of techniques that can be applied and it has been well studied over many years. It becomes more challenging in the context of ML model performance because we need to study outliers as a multivariate analysis problem across a large number of variables and also see its impact on the model’s behavior, i.e., determine if it causes the model to act erratically.

Image for post

Model Monitoring Screenshot (Image by Author)

5. Bias. Even after monitoring for data changes, its true impact on protected groups might change despite model validation i.e. an ML model could become biased after deployment. The first defense could be to drop protected attributes (e.g., race, gender, etc) during the training process, but models could also exhibit biases due to other features that are highly correlated with the protected attributes. What we need is a continuous tracking of the models around fairness where these metrics are computed on the fly and in real-time. Bias definitions (equality of opportunity, equality of outcome, etc) might change from organization to organization and also from problem to problem as there is no unified definition for fairness. Therefore, we should be able to support a pluggable policy and enforce it continuously to detect any potential bias issues. If bias is detected, it is important to drill down into the causes to determine if the model needs to be replaced or there is a data pipeline issue.

Image for post

Model Evaluation Screenshot (Image by Author)

What is Explainable Monitoring?

A robust AI monitoring system requires integration with a model serving infrastructure to guard against the aforementioned 5 operational challenges. It allows users to easily review the real-time monitored output for spotting KPI and other issues or act on alerts. Investigating operational ML issues that are flagged often takes a lot of effort. The black-box nature of ML models makes them especially difficult to understand and debug for ML developers.

An Explainable ML Monitoring system extends traditional monitoring to provide deep model insights with actionable steps. With Monitoring, users can understand the problem drivers, root cause issues, and analyze the model to prevent a repeat. This helps save considerable time.

We think such a system should exhibit 3 key properties:

Comprehensive. An explainable ML monitoring system should cover all the essential leading indicators of model performance and the performance metrics themselves. In addition to statistically comprehensive, an ideal explainable ML monitoring system provides intuitive user interfaces for both technical (model developers, ML Ops) and non-technical (analysts, business owners) stakeholders.

Pluggable. Teams should be able to integrate the monitoring system with existing data and AI infrastructure and the most common open-source ML frameworks (Scikit-Learn, PyTorch, Tensorflow, Spark, etc) to quickly see actionable results.

Actionable. Users should be able to derive actionable insights behind the production issues. Live explanations with deeper analytics are essential to quickly uncovering the ‘why’ and ‘how’ of model behavior. Getting lots of alerts creates noise, so it is paramount that the system allows necessary controls for the users to configure alerts only for shifts that require action.

The financial stakes associated with AI are immense. Trust is part of those stakes and it is more easily lost than gained. We’ve seen what black-swan events like COVID-19 can do to businesses. If you’re having AI products unmonitored, you could be delivering bad decisions to your customers. Visibility, therefore, is going to be extremely important!

That’s why we decided to start Fiddler.AI and provided the system which we think can help to bridge the gap. Feel free to reach out and learn more about it by sending an email to info@fiddler.ai.