blog bg left
Model Monitoring for Financial Fraud Classification
Back to Blog

Model Monitoring for Financial Fraud Classification

Every $1 of fraud loss costs financial services firms $4 in losses [1]. These losses stem from incurred interest, fines and legal fees, labor and investigation costs, and external recovery expenses. To avoid this, financial services firms deploy machine learning models to predict if a transaction is fraudulent or not based on historical data. But how can these firms be confident that their models are effective after they’ve been deployed in production? Issues that can affect a model’s effectiveness include:

  • The economic environment changing transaction patterns
  • The definition of a fraudulent transaction changing
  • Schema changes in how data is being recorded
  • Models receiving new values for the first time causing faulty predictions or model failure

When no safeguards are in place, financial services firms suffer more losses because they are failing to catch model performance degradation. This is where model monitoring comes in. Machine learning engineers can monitor their models across different performance metrics and KPIs, get alerted when there are anomalies, and take immediate corrective action.

Financial institutions can use the WhyLabs Observatory to monitor their machine learning models. With its unique approach to monitoring, WhyLabs ensures financial data remains secure and private, and scales with the volume of data being processed by the models. Since data never leaves its environment, there is no risk of any data leaks or misuse of data.

We’ll use a fraud transaction classification model example to show how the WhyLabs Observatory can be used in the financial services industry. The data will be logged with whylogs, and monitored with the WhyLabs Observatory.

A little bit about WhyLabs

WhyLabs Observatory is the solution for detecting and alerting on any data issues as data is fed into a machine learning model, including data drift, new unique values, missing values, etc. Financial services firms can utilize the Observatory to minimize losses from fraud. The WhyLabs Observatory platform can identify data quality issues/changes in a data’s distribution, detect anomalies, and send notifications. It can also show which aspects of the data have issues, speeding up time to resolution. This saves time from debugging so that data scientists and machine learning engineers can spend more time developing and deploying models that provide value for your business.

The WhyLabs platform monitors data, whether it is being transformed in a feature store, moving through a data pipeline (batch or real-time), or feeding into AI/ML systems or applications. The WhyLabs platform has two components, the open-source whylogs logging library and the WhyLabs Observatory. The whylogs logging library fits into existing tech stacks through a simple Python or Java integration. It supports both structured and unstructured data. No data is copy/duplicated/moved out of the environment, eliminating any risks of data leaks. whylogs analyzes the whole dataset and creates a statistical profile of all the different aspects of the data. By creating statistical profiles, whylogs captures rare events, seasonality, and outliers that otherwise might be missed with sampling as well as keeping sensitive financial data private.

Once whylogs profiles are ingested into the WhyLabs Observatory, monitors are enabled and anomaly detection is run on the profiles. Pre-built data monitors can be enabled with just a click to look for data drift, null values, data type changes, new unique values, and model performance metrics (e.g. Accuracy, Precision, Recall, and F1 Score). If there isn’t a pre-built monitor available for data issues/model metrics, there is a guided wizard on creating a custom monitor available. If anomalies are detected, notifications are generated showing which aspects of the data/model have issues. For more on data and model monitoring, go here.

Discussion of the dataset and data dictionary

The dataset used for this example is a modified version of this Kaggle dataset and has 7 features (model inputs) and 1 target (model output, isFraud). For this example, the “nameDest” column has been replaced with a column called “countries” and the target imbalance (isFraud) was changed to 91.7% not fraud (0), and 8.3% fraudulent (1).

The data dictionary is:

  • step: Maps a unit of time in the real world. In this case, 1 step is 1 hour of time
  • type: Type of transaction with 5 unique types: CASH-IN, CASH-OUT, DEBIT, PAYMENT, and TRANSFER (later two additional types are added, PAYPAL and VENMO)
  • amount: Amount of the transaction in US Dollars
  • nameOrig: Customer who started the transaction
  • oldbalanceOrig: Initial balance before the transaction
  • newbalanceOrig: Customer’s balance after the transaction
  • country: Country where the transaction occurred
  • oldbalanceDest: Initial recipient balance before the transaction
  • newbalanceDest: Recipient’s balance after the transaction
  • isFraud (Target): If transaction is fraudulent. 0 (not fraudulent), 1 (fraudulent)
NOTE: Although the Financial Fraud example has 9 inputs, WhyLabs is capable of monitoring models with 1000s of model inputs.

Getting started with WhyLabs

To start using WhyLabs, sign up for a free account! After creating an account and logging in to the Observatory, the Project Dashboard will appear.

Figure 1: The WhyLabs Observatory Project Dashboard

Once in the Project Dashboard, click “Create a project,” and a new box will appear. Under “Model project”, click on “Set up model.” A new screen will appear, where you’ll be prompted to enter a project name and type (if applicable). The below .gif shows this process.

Figure 2: Creating a New Project Flow

After creating a new project, click on WhyLabs in the top left to get back to the Observatory. Once in the Observatory, find the project card that was just created. Then at the bottom of that project card, click on “Set up a whylogs integration.” A new page will appear. Locate the dropdown menu underneath “Select a model project for whylogs integration.” Click on the dropdown, and select the newly created model. Note the Model ID, which in the example below is model-42 . Copy this as this will be needed when logging profiles with whylogs.

Once the model has been selected, copy the Organization ID underneath (it would be similar to org-xxxxxx) as that will be needed for whylogs. Then click the orange button underneath that says “Create Access Token.” This will generate an API Access Token. Copy the API Access Token. The Model ID, Organization ID, and API Access Token will be used by whylogs to send the profiles to the correct model in the WhyLabs Observatory.

Figure 3: Generating Necessary Credentials for whylogs

The below example discusses how production data can be profiled with whylogs using a pipelined XGBoost model in Python. An example notebook that shows how to use whylogs with Python can be found here. whylogs creates profiles of data by taking in the data as a pandas dataframe (df).

To use whylogs, a few imports and configurations need to be done. The code below shows how to set up whylogs for use in Python.

# if whylogs isn't installed, it can be installed via pip
# pip install whylogs[whylabs]
# print(whylogs.__version__) to confirm using at least whylogs V1.1

# importing whylogs
import whylogs as why
import os

os.environ["WHYLABS_API_KEY"] = "The WhyLabs API Generated Access Token Here"
os.environ["WHYLABS_DEFAULT_ORG_ID"] = "The WhyLabs Organization ID Here"
os.environ["WHYLABS_DEFAULT_DATASET_ID"] = "The WhyLabs Model ID Here That Will Receive the whylogs Profiles"

from whylogs.api.writer.whylabs import WhyLabsWriter

# importing pandas to read data into a dataframe. whylogs takes in dataframes.
import pandas as pd

# read in data as a dataframe, dataframe will be used for profiling with whylogs
df = pd.read_csv("path/to/your/data.csv")

After whylogs is imported, import all necessary libraries needed for building and training an XGBoost model. Then load the training data, train the XGBoost model on the training data, create a pipelined XGBoost model that is fitted on the training data, and feed production data to the pipeline model. For an idea of what this looks like, see the below starter code.

Importing additional necessary libraries:

# For data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# For building and training an XGBoost model
from xgboost import XGBClassifier

# For getting different metric scores, and splitting the data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (

# For data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder

# For creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# For defining maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)

# For supressing scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# For supressing warnings
import warnings


After reading in data into dataframe and verifying it is acceptable for training, build and train an XGBoost model:

# Making a copy of the dataframe
training = df.copy()

# Separating target variable and other variables
X = training.drop(columns = "Target")

Y = training["Target"]

# Creating dummy variables for non-numeric columns
ohe = OneHotEncoder()
X = ohe.fit_transform(X[["all non-numeric columns"]])

# Splitting data into training and test set:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 1, stratify = Y)
print(X_train.shape, X_test.shape)

# Building and training an XGBoost model
model = XGBClassifier(), Y_train)
preds = model.predict(X_test)

print(f'Accuracy = {accuracy_score(Y_test, preds):.2f}\nRecall = {recall_score(Y_test, preds):.2f}\n')
cm = confusion_matrix(Y_test, preds)
report = classification_report(Y_test, preds)
plt.figure(figsize=(8, 6))
plt.title('XGBoost Classifier Confusion Matrix (with original data)', size=16)
sns.heatmap(cm, annot=True, cmap='Blues');

Once finished training the model, and it is acceptable, pipeline it:

# Pipelining XGBoost with default parameters

Model = Pipeline(

# Making a copy of the dataframe
training = df.copy()

# Separating target variable and other variables
X = training.drop(columns = "Target")

Y = training["Target"]

# Creating dummy variables for non-numeric columns
ohe = OneHotEncoder()
X = ohe.fit_transform(X[["all non-numeric columns"]])

# Fitting the model on training data, Y)

Feed production data into the pipeline model:

# Importing production data
production = pd.read_csv("path/to/your/production.csv")

# Making a copy of the dataframe
prod = production.copy()

# Separating target variable and other variables
prod_x = prod.drop(columns = "Target")

prod_y = prod["Target"]

# Creating dummy variables for non-numeric columns
ohe = OneHotEncoder()
prod_x = ohe.fit_transform(prod_x[["non-numeric columns"]])

# Predicting on production data
model_predictions = Model.predict(prod_x)

# Predicting probabilities
predict_proba = Model.predict_proba(prod_x)

# Determining model's confidence score for a prediction
scores = [max(p) for p in predict_proba]

# Adding a new column to output called prediction (output) with model's predictions
prod_y["prediction_output"] = model_predictions

# Adding a new column to output called output_confidence with model's prediction confidence score
prod_y["output_scores"] = scores

# Combining the two dataframes before logging with whylogs (if any outputs are going to be logged)
prod = pd.concat([prod_x, prod_y], axis = 1)

# Renaming target column to include target so WhyLabs will recognize it as actual output
prod = prod.rename(columns = {"Target": "output_Target"})

Once production data is available, profiling with whylogs can start. To do that, a few lines of code are written for profiling the data. If ground truth (actual outcome) data is available, the data and ground truth would be logged with the below code:

results = why.log_classification_metrics(
    target_column = "The Target Feature (Must Contain the word ‘Output’ as part of the column name, e.g. ‘output_Target’),
    prediction_column = "prediction_output",
    score_column = "output_scores"

profile = results.profile()

If model inputs are only available, the data would be logged with the below code.

profile_results = why.log(prod) 

WhyLabs AI Observatory platform

Day 1 (7/21)

Now that profiles are logged, the first day of using the WhyLabs Observatory is focused on getting familiar with the data and setting up appropriate monitors for the fraud classification model. When first logging into the WhyLabs Observatory, the Project Dashboard appears, which is a central view of all pipelines/models being monitored. For every Project being monitored, there is a high level overview of the number of detected anomalies as well as the category the alerts fall in. The Financial Fraud model is what will be investigated.

Figure 4: Day 1 (7/21): Project Dashboard View of Monitored Projects

Looking at the first day (7/21) of profiles in Profile view, the WhyLabs Observatory automatically displays the statistics captured from the whylogs profiles and automatically generates visualizations from those statistics showing the distribution of the data.

Figure 5: Day 1 (7/21): Seeing how a whylogs Profile is rendered in the WhyLabs Observatory
Figure 6: Day 1 (7/21): Performing Exploratory Data Analysis in the WhyLabs Observatory

After understanding what the data looks like, appropriate monitors can be enabled. If unsure what kind of monitors are available, The WhyLabs Observatory has Preset monitors in the Model Manager. Multiple monitors can be enabled, and a monitor’s anomaly threshold can be configured to what is appropriate for the model. If looking to create a custom monitor, this can be done through a no-code wizard or a code editor. In response to monitors, actions can be specified to address a monitor’s alert. For example, WhyLabs supports using a Webhook to trigger a model re-training pipeline if a model’s F1 Score falls below 10%.

Figure 7: Day 1 (7/21): Reviewing the Pre-Built Monitors in the WhyLabs Observatory

Through WhyLabs -> Settings -> Notifications panel, users can specify how they would like to receive anomaly notifications.

Figure 8: Day 1 (7/21): Setting Up Notification Channels

Day 2 (7/22)

Now that profiles are in the platforms and monitors are enabled and configured, data comparison can begin. By going to the Profiles UI, The Day 2 (7/22) profile can be compared with the Day 1 (7/21) profile.  The WhyLabs Observatory is able to generate a side by side comparison of the statistics captured as well as the visualizations generated. Through the profile comparison, the “step” input has some different data compared to the previous day, however, no alert has gone off.

Figure 9: Day 2 (7/22): Performing a Profile Comparison with 7/21 profiles
Figure 10: Day 2 (7/22): Performing Exploratory Data Analysis for 7/21 and 7/22 Profiles

To ensure there are no anomalies, go into the Anomalies Feed section of the Monitor Manager to confirm if there are anomalies.

Figure 11: Day 2 (7/22): The Anomalies Feed

For this example, ground truth is available. Going to the Performance section of the platform, classification metrics (Accuracy, Precision, Recall, F1 Score) are automatically calculated and displayed showing how the model’s performance is over time.

If there are additional model-specific metrics or business KPIs that need to be tracked, custom code can be written in whylogs that defines these metrics and KPIs, and they would be tracked and displayed in the Performance section in WhyLabs.

Figure 12: Day 2 (7/22): Model Performance Metric Tracking

Day 3 (7/23)

A notification appeared for the first time. Going to the Anomalies Feed, information is provided on what the anomaly is. The “amount” feature had a significant distribution shift.

Figure 13: Day 3 (7/23): Anomalies Feed with Detected Anomalies

Going to the Performance metrics section, all the model metrics have had a significant drop off. What’s concerning is that the False Negatives (lower left corner of the Confusion Matrix) increase by almost 6X, meaning that the model is misclassifying almost 6X more transactions as fraudulent, when they actually weren’t fraudulent. If not fixed, the model could be leading the business to spend more money handling fraudulent transactions than allocated.

Figure 14: Day 3 (7/23): Model Performance Metric Tracking

To figure out what’s changed with “amount”, a good place to start would be to do a profile comparison comparing today’s profile (7/23) to the previous day’s profile (7/22). Looking at the statistics between the two profiles as well as the visualizations, it looks like on 7/23 the values for “amount” came in as cents as opposed to the usual dollars, which means a schema change happened somewhere upstream of the model. Through this data comparison, the WhyLabs Observatory was able to help debug what happened with “amount”.

Figure 15: Day 3 (7/23): Comparing 7/22 and 7/23 Profiles
Figure 16: Day 3 (7/23): Comparing 7/22 and 7/23 Profiles side by side

If users want to compare the data distribution to previous days, users can go to Inputs and click on “amount”. Different distributions are seen, and when at the Estimated Quantiles Distribution, there is a data distribution alert since the data distribution this day exceeded the anomaly threshold. With this information, users can go talk to the data engineering team to address the schema change and make sure “amount” is recorded in dollars again.

Figure 17: Day 3 (7/23): View of the Model’s Inputs
Figure 18: Day 3 (7/23): Quantile View of the Data Drift for “amount”

Day 6 (7/26)

After addressing the “amount” schema change with the data engineering team on Day 3, Days 4 and 5 were normal days with no WhyLabs Observatory alerts. However, on Day 6 (7/26), multiple alerts went off and going to the Anomalies Feed confirms multiple alerts.

Figure 19: Day 6 (7/26): Anomalies Feed on through 7/26

After reviewing the alerts in the Anomalies Feeds, going over to the Performance tab helps realize how catastrophic the alerts are as all the model performance metrics have plummeted to 0.

Figure 20: Day 6 (7/26): Model Performance Metrics

Seeing that the “type” input was labeled a high priority alert in the Anomalies Feed, a good place to look is Profiles to see what happened to “type”.

Doing a profile comparison between Day 6 (7/26) and Day 5 (7/25) shows that two new transaction types, PAYPAL and VENMO were added to 7/26.

Figure 21: Day 6 (7/26): Profile Comparison between 7/25 and 7/26

This was further confirmed by going to Inputs, clicking on “type” and seeing the visualization below.

Figure 22: Day 6 (7/26): Top 5 Most Frequent Items for “type” from 7/21 - 7/26

Also, checking the Output section, and the clicking on the, “output - isFraud” output confirms that the number of missing values went up because PAYPAL and VENMO weren’t transaction types included during model training, and since the classification model was seeing them for the first time, the model broke since it didn’t know how to predict on those two transactions.

Figure 23: Day 6 (7/26): Data Drift Threshold Exceeded in “output - isFraud”
Figure 24: Day 6 (7/26): Significant Increase in Fraction of Null Values for “output - isFraud”

After confirming that PAYPAL and VENMO are two new transaction types introduced into the production pipeline that the data scientists weren’t aware of, there has likely been a miscommunication that led to model failure. Connecting a Webhook to a model performance metric monitor can limit the damage. Once the model performance metric drops below the set threshold, the Webhook triggers a re-training job on new data, so the model can then predict on new incoming data without leading to costly, unplanned downtime.

Day 7 (7/27)

After what happened on Day 6, “type” had an automated model pipeline re-training trigger to re-train the model on new data. Day 7 the model recognized PAYPAL and VENMO and the metrics got back to the expected levels.


As you can see, without monitoring in place, some of the WhyLabs Observatory’s detected anomalies might have gone unnoticed for long periods. Even if they were noticed, data scientists/MLOps engineers might struggle to debug why a model’s performance degraded without understanding where to look. The WhyLabs Observatory is a safety mechanism for models to ensure they are providing the results the business needs and addressing issues as soon as they arise instead of the alternative.

Sign up to try WhyLabs Observatory for free and start monitoring your data and models today!

Please check out the Resources section below to get started with whylogs and WhyLabs. If you’re interested in learning how you can apply data and/or model monitoring to your organization, please contact us, and we would be happy to talk!




Other posts

Monitoring Image Data with whylogs v1

Monitoring Image Data with whylogs v1

When operating computer vision systems, data quality and data drift issues always pose the risk of model performance degradation. Whylabs provides a simple yet highly customizable solution for maintaining observability into data to detect issues and take action sooner.
Large Scale Data Profiling with whylogs and Fugue on Spark, Ray or Dask

Large Scale Data Profiling with whylogs and Fugue on Spark, Ray or Dask

Profiling large-scale data for use cases such as anomaly detection, drift detection, and data validation with Fugue on Spark, Ray or Dask.
WhyLabs Private Beta: Real-time, No-code, Cloud Storage Data Profiling

WhyLabs Private Beta: Real-time, No-code, Cloud Storage Data Profiling

We’re excited to announce our Private Beta release for a no-code integration option for WhyLabs, allowing users to bypass the need to integrate whylogs into their data pipeline.

Data and ML Monitoring is Easier with whylogs v1.1

The release of whylogs v1.1 brings many features to the whylogs data logging API, making it even easier to monitor your data and ML models!

Robust & Responsible AI Newsletter - Issue #3

Every quarter we send out a roundup of the hottest MLOps and Data-Centric AI news including industry highlights, what’s brewing at WhyLabs, and more.

Data Quality Monitoring in Apache Airflow with whylogs

To make the most of whylogs within your existing Apache Airflow pipelines, we’ve created the whylogs Airflow provider. Using an example, we’ll show how you can use whylogs and Airflow to make your workflow more responsible, scalable, and efficient.

Data Logging with whylogs: Profiling for Efficiency and Speed

Rather than sampling data, whylogs captures snapshots of the data making it fast and efficient for data logging, even if your datasets scale to larger sizes.

Data Quality Monitoring for Kafka, Beyond Schema Validation

Data quality mapped to a schema registry or data type validation is a good start, but there are a few things most data application owners don’t think about. We explore error scenarios beyond schema validation and how to mitigate them.

Data + Model Monitoring with WhyLabs: simple, customizable, actionable

The new monitoring system maximizes the helpfulness of alerts and minimizes alert fatigue, so users can focus on improving their models instead of worrying about them in production...
pre footer decoration
pre footer decoration
pre footer decoration

Run AI With Certainty

Book a demo