What keeps you up at night? If you’re an ML engineer or data scientist, then drift is most likely right up there on the top of the list. But drift in machine learning comes in many forms and variations. Concept drift, data drift, and model drift all pop up on this list, but even they only scratch the surface of much more nuanced problems. In this article, we’ll review all the types of drift and the nuances of each of them, as well as best practices to monitor, detect, investigate, and resolve drift in machine learning. 

Intro

“The only constant in life is change” ( Heraclitus of Ephesus, ~500 BC) 

Our world is constantly changing. And what’s true and obvious today may not be so tomorrow. This is undoubtedly on-point in the world of machine learning. The ultimate goal of machine learning models is to extract patterns from past data and use them to predict future behavior for unseen instances. These patterns also referred to as ‘concepts,’ are fundamental to machine learning because they help classify data and recognize relationships between different variables. 

Concept drift example: Sesame Street Bert vs. NLP BERT

Concept drift example: Sesame Street Bert vs. NLP BERT

When these relationships change in the real world–as they inevitably do–the patterns our model learned become invalid and can limit the model’s predictive power. This ‘model drift’  tends to happen when a model moves from development to a live production environment, when the data changes, or when situations in the real world change. 

What is drift in machine learning?

We tend to assume that a model will make correct predictions once it goes into production. After all, if we didn’t think so, we would continue to retrain the model until we felt it worked well enough to be deployed to the real world. However, most of us can attest to the fact that the real-world waits for no one, and that holds true for the data our models run on. Meaning that from day 1, the data that our models utilize to make predictions is already different from the data on which they trained. And depending on the degree of change, our models may suffer from model drift and model decay, unwanted bias or even just being suboptimal given the type of drift we are faced with. Drift may signal that our results will worsen over time or show suboptimal performance in specific slices of data or populations.  

What causes drift in ML? It can be anything from errors in data collection, changes in the way people behave, or even time gaps that alter what is considered a good prediction and what is not. An example of model drift could be an ML model whose algorithms help approve or reject loans for bank customers. In the past, most people requesting loans were between the ages of 32 and 40, with specific behavior profiles. But today, 80% of the loan requests are from people under the age of 30. In this case, the model must be optimized for the new mix of data, or it will offer incorrect predictions when it comes to who is a good loan candidate. 

As more and more machine learning models are deployed and used in live environments for real-world applications, model drift has become a major issue. In our digital and big data era, it’s unrealistic to expect data distributions to remain stable over a long period of time. This means drift is a top concern for data science teams scaling their use of ML. Let’s face it, the amount of time we spend maintaining model health is going up exponentially. It has definitely become a core—and painful—part of the day-to-day tasks for any team maintaining models in a live production environment. 

Types of drift in machine learning for $800 please

Drift in machine learning

‍Concept drift or model drift is sometimes used as a generic term to describe any changes in the statistical properties of the data. Mathematically, it indicates changes in the distribution P(y | X), which describes the relationship between the predictors and the target variables. The common formal definition for concept drift is: “a change in the joint probability distribution, i.e., Pt(X,y)  ≠  Pt+(X,y)” [1, 2, 4]. It’s also referred to as ‘dataset shift’. We can decompose the joint probability P(X,y) into smaller components to better understand what changes in these components can trigger concept drift:

  • Covariate shift P(X) – Also known as input drift, data drift, or population drift, covariate shift occurs when there are changes in the distribution of the input variables (i.e., features). This is the case in our example above where the age of people asking for loans evolves over time. It may happen for technical reasons, like a change in the data source pipeline or sensors that become inaccurate over time. Or, it may be caused by changes in the population, such as new types of customers, trends, or lifestyle fluctuations. Covariate shift can be detected on a univariate level, also referred to as feature drift, but it may also be analyzed on a multivariate level across the entire feature space distribution. 
  • Prior probability shift P(y) – Sometimes referred to as label drift, unconditional class shift, or prior probability shift, this drift occurs when there are changes in the distribution of the class variable (y). Two typical examples are spam and fraud detection models where the proportion of spam emails or fraud can significantly vary over time. For example, email phishing attacks and coronavirus scams spiked during the covid pandemic. 
  • Posterior class shift P(y | X) – Also known as conditional change, concept shift, or ‘real concept drift’, this refers to changes in the relationship between the input variables and the target variables. Take BERT or CORONA, for example, just a few years ago searches for these terms would turn up our favorite childhood character from Sesame Street and a Mexican beer that you drink with a lemon wedge. Today, these same searches are dominated by a deep learning framework used for text and articles on Covid-19. This is typically the hardest type of drift to detect. It is also the most dramatic type because it can lead to changes in the decision boundary and necessitate updates to the model.

Some researchers also distinguish between ‘real’ and ‘virtual’ concept drifts, where real concept drift refers to changes in P(y | X) and virtual concept drift refers to changes in P(X) or P(y) that don’t affect the decision boundaries or the posterior probabilities P(y | X). Although these ‘virtual’ changes appear to be less serious, they tend to be a side effect of the real ones. It’s hard to imagine a real-world application in which  P(X) changes without impacting P(y | X)[2]. 

Real vs virtual concept drift

Real vs virtual concept drift

Although researchers may need to distinguish between real and virtual drift, we need to monitor real-world applications for all types of drifts. Here’s why:

  1. Every kind of drift can lead to poor performance. Having a sudden peak in the frequency of certain types of data types (i.e., changes in P(X) or P(y)) can make it harder to classify these cases correctly. 
  2. Examining different types of drift can help us understand what is causing the more serious ones, help diagnose potential problems with the model, and make it easier to choose the right path for fast resolution. 

What causes drift?

Drift in machine learning can occur for any number of reasons, but these causes generally fall into two main groups: bad training data and changing environments. 

Bad training data that doesn’t accurately represent real-world situations is also known as unrepresentative training data:

  • Sample selection bias – This occurs when the training data was collected or prepared using a biased or flawed method. It doesn’t reliably represent the operating environment where the model will be deployed [3].
  • Changes in hidden variables – Hidden variables can’t be measured directly but have a tremendous influence on some of the observed variables [7]. Essentially, a change in hidden variables will change the data we observe. Even if there is no actual drift from the data source, there may still be changes that look similar to concept drift. For example, if we want to predict the number of visitors to an amusement park, we might look at the weather, the day of the week, holidays, etc. But the general economic situation or the general public mood can have an even more significant influence, as could a local tragedy or a big win by the local sports team. Although these factors cannot be measured directly, they have a crucial impact on the number of visitors (our target). 

Changing environments

  • Dynamic environment – This is the more basic and intuitive case of instability, where the change in data and relations is beyond our control. Some examples are:
  1. Any system that follows users’ personal interests, such as advertisements customized for constantly changing preferences. 
  2. Use-cases affected by weather, such as traffic predictions, where the data used to train the model may no longer be relevant. 
  3. Changes in the market may be caused by new competitors or a company moving in with new pricing models. 
  4. Changes in regulations.
  • Technical issues – These issues can be caused by a broken data pipeline or changes upstream in one of the feature’s values. This may be caused by a bug, some schema change, or even a change in a default value. 
  • Adversarial classification problems – Some common examples are spam filtering, network intrusion detection, or fraud detection, where attackers change their methods to bypass the model.
  • Deliberate business actions – These can include launching a marketing campaign that attracts new types of users or changes in a website that affect the users’ behavior.
  • Domain shift – This refers to changes in the meaning of values or terms. For example, inflation reduces the value of money, which means that an item’s price or a person’s income will have different effects at different times. Another example is a change in the meaning of terms. For example, a web search for ‘corona’ will retrieve completely different results in 2022 compared to 2019.
  • Hidden feedback loops – In many cases, deploying a model in a live environment inevitably changes that environment and invalidates the assumptions of the initial model in the process. For example, a churn prediction model will use the historical DB of user engagement to predict the chances of a given user abandoning the product. Typically, whenever the model predicts that a user is likely to churn, the marketing department will contact the user. If retention efforts work and the user stays, the dataset will now contain concept drift–because the user who was predicted as likely to churn has stayed put. 

What are drift patterns, and why do they matter?

The word ‘drift’ generally implies a gradual change over time. But these changes in data distribution over time can manifest in different shapes and sizes. Here are some of the different ways they take place, based on transition speed: 

  • Gradual –  A gradual transition will happen over time when new concepts come into play. For example, in a movie recommendation task, movies, genres, and user preferences all change gradually over time. 
  • Sudden – A drift can happen suddenly, for example, when a sensor is replaced by another one with a different calibration.
  • Incremental – Drift can also happen in a sequence of small steps, such that it is only noticed after a long period of time. An example may be when a sensor wears down and becomes less accurate over time. 
  • Blip – Spikes or blips are basically outliers or one-off occurrences that influence the model. This may be a war or some exceptional event. Of course, there may be recurring anomalies that take place—with no clue as to when they may happen again. 
Patterns of drift in machine learning

Patterns of drift in machine learning

ML drift detection

These different patterns in which models drift bring us directly to the issue of detection of ML drift. Because drift in the wild takes different forms, different methods are needed to detect each. Take, for example, detecting seasonal model drift, which requires methods that rely on time series analysis; these know how to decompose seasonal aspects versus statistical process controls that are primarily used to detect sudden or outlier changes. There are a few common statistical measures used to calculate these drifts, and in the next post on the subject, we’ll dive into that as well. 

Last words

As the ML industry and MLOps, in particular, become increasingly more mature, it’s essential to align the terminology among practitioners. Drift is a fundamental concept that needs to be understood by every ML practitioner. While general macro drift may occur occasionally, smaller local drifts on specific subpopulations or segments happen on a frequent basis and usually get by under the radar. While concept drift is often referred to as one major issue, as we saw in this post, it can be divided into many different types of drift in ML that happen for various reasons and may appear in different formats. Understanding all the possible patterns and types is an important step in detecting them efficiently and understanding how to deal with and fix their actual root cause.

This blog has been republished by AIIA. To view the original article, please click HERE.