Data Science

Accuracy Degradation Factor: Quantifying Model Performance Shifts Before They Occur

Authored By

Published on

November 5, 2024

In the dynamic realm of machine learning and artificial intelligence, models are often deployed with the assumption that the data they process will remain stable over time. However, this is seldom the case due to dataset shift—changes in the underlying data distribution—which can lead to significant performance drops known as accuracy degradation.

Real-world incidents highlight the severity of this issue. For instance, during the COVID-19 pandemic, abrupt shifts in consumer behavior rendered many pre-pandemic predictive models ineffective. Likewise, self-driving cars have struggled when encountering weather conditions or road environments not represented in their training data, raising safety concerns.

Traditionally, identifying dataset shifts and subsequent accuracy degradation is a reactive process. Performance issues are addressed after they have occurred, leading to costly delays and potential negative impacts.

But what if we could anticipate these shifts before they happen? By developing proactive metrics that predict potential changes in data distributions and model performance, we can adjust our models in advance to maintain optimal accuracy. This foresight is crucial in applications where real-time reliability is essential.

In this blog post, we will delve into the concept of the Accuracy Degradation Factor (ADF) and Accuracy Degradation Profile (ADP) - a metric and an analysis designed to predict model performance shifts before they occur. We will explore how anticipating dataset shifts can help maintain model integrity and discuss practical methodologies for implementing this proactive approach. By staying ahead of the curve, we can ensure our models remain robust and reliable, even as the data landscape evolves.

How Accuracy Degradation Profile works?

ADP measures how the accuracy changes when we sequentially reduce the dataset by evaluating accuracy over smaller, contiguous sub-sets. ADF identifies the first point where these reductions cause a significant drop in performance. Both concepts are better understood with an example. As such, we consider the 2D binary classification dataset as presented in plot (a):

‍

We will begin by applying a typical machine learning process, training a Decision Tree model and calculating the accuracy on the entire test set. The test points are highlighted on the graph with red circles (b), and this set will be the focus of the analysis as we explore possible changes in its composition.

‍

Next, we will keep only the test set on the graph and assign a small index to each point for tracking purposes (c). We will plot both the true labels (y_true) and the predicted labels (y_pred) on the same graph. The predicted labels are slightly shifted vertically for clarity (d), allowing us to see where the model made mistakes - these misclassifications are indicated by the difference in color between the true labels (darker circles) and the predicted ones (shaded circles).

How Model Performance Shifts Across Data Neighborhoods

We can now consider a selection of points in the test set whose neighborhoods we can analyze. Specifically, we can explore how accuracy fluctuates as we move between neighborhoods. Imagine a particular point, say point 83, along with its neighboring points, which are visualized as a convex hull around them. In this case, the accuracy is flawless at 100%, as all predicted labels in that neighborhood match the true labels.

Similarly, when we examine point 76 and its neighbors, we again achieve 100% accuracy since the predicted yellow labels perfectly align with the true yellow labels.

However, not all points enjoy such perfect accuracy. Take point 96 as an example. The accuracy drops significantly to just 40% because only two out of five predicted labels match the actual ones, indicating a local mismatch between the predictions and reality. This sharp contrast between neighborhoods reveals how localized performance can vary depending on the region of the dataset.

What Data Topology Reveals About Model Performance

Through various experiments, both with synthetic and real-world datasets, we have observed that local accuracies fluctuate based on the data point and the size of its neighborhood. These fluctuations are often due to the unique topological features of each dataset - subtle structures and complexities that influence how well a model performs in certain areas. These nuances lead to variations in accuracy across different neighborhoods, offering valuable insight into the localized performance of machine learning models.

To better understand how accuracy changes as we reduce dataset, we introduce the Accuracy Degradation Profile (ADP). This method helps evaluate the robustness of machine learning models by gradually reducing the size of the test set and observing how it affects accuracy. ADP points out on each portion (of the reduced dataset) the mean accuracy over each sample, considering its neighborhoods, over a defined threshold.

Simply put, ADP shows us how well a model handles shrinking datasets and where performance starts to noticeably degrade. To make this concept even clearer, let's walk through an example that demonstrates how ADP works in practice.

Let us perform ADP over the selected dataset (Table 1):

Let's break down the Accuracy Degradation Profile (ADP) matrix to see how each column helps you understand model performance in detail:

Size Factor: This column shows how much of the test set is being used for accuracy calculations. For example, if the size factor is 0.95, you're looking at the nearest 95% of the samples in the test set. It's like zooming in on a neighborhood to analyze how well the model performs in that specific slice of data.‍
Above Threshold: This is the number of samples that surpass a predefined accuracy level. The threshold is calculated by multiplying the baseline accuracy by a percentage (usually 95%). So, if "above_threshold" equals 30, that means 30 samples have accuracy scores that are above this cutoff. It's a quick check to see how many data points are performing well.‍
ADP (Accuracy Degradation Profile): This represents the proportion of samples that meet the accuracy threshold. If the ADP is 1.0, that means 100% of the test samples are performing above the minimum acceptable accuracy. Think of it as a health indicator for the model’s performance across the dataset.‍
Average Accuracy: As the name suggests, this is the average accuracy of the test set, calculated for the specific portion determined by the size factor. It gives a sense of how well the model is doing overall in that section of the data.‍
Variance Accuracy: This column tracks how much the accuracy varies across the selected test samples, considering the size factor. High variance means the model’s performance is inconsistent, while low variance suggests it's performing more uniformly across the dataset.‍
Decision: This is where things get decisive. It compares the number of samples above the accuracy threshold with a preset percentage (usually 90%). If the number of high-performing samples meets or exceeds this percentage, the decision will be marked as "OK" (no accuracy degradation). But if it falls short, you’ll see “acc degrad!”—a red flag signaling that model performance is degrading.

Summary:

Every time the term 'acc degrad!' is shown on the ADP matrix, watch out (!!!): for a minimum number of samples, the mean accuracy over the convex hulls (comprising the sample being examined together with its neighborhood) is not higher than the minimum acceptable accuracy. It may be valuable to look deeper into the topology of the dataset to understand what is actually happening with the classifier applied to the reduced dataset. Even if the dataset is multidimensional, dimensionality reduction can be used to view it in 2D.

The accuracy degradation factor (ADF) is the first size_factor on which an accuracy degradation occurs. ADF ranges from 0.0 to 1.0. The closer it is to 1.0, the less robust the model is to dataset shifts, according to the ADP methodology. The closer it is to 0.0, the more robust the model is to dataset shifts, according to the ADP methodology.

We can now plot the information of Table 1:

Conclusion:

ADP and ADF serve as a flag, an early warning system to give practitioners a sense of where potential issues may lie. By highlighting potential degradation points, these metrics help identify areas that might require preemptive action, allowing teams to address problems before they lead to significant performance loss. They provide a call to action, encouraging further investigation, deeper analysis, and proactive interventions.

Once a warning is identified, the next step is to locate the specific problem area. ADF is the minimum shift on the data to find neighborhoods that show a significant drop in accuracy. With extra mitigators such as data rebalancing, feature engineering, or retraining of the model, it is a powerful tool for performance degradation identification. This targeted exploration allows practitioners to effectively allocate resources to address the most impactful issues.

A key strength of these metrics is that they make very few assumptions about the data or model. This flexibility makes ADP and ADF applicable across various types of datasets and machine learning models. They act as an all-direction measure of shifts, distilled into just one number, offering a comprehensive overview of potential changes. This simplicity provides a high-level summary that can be easily communicated to stakeholders, while still retaining enough nuance to guide technical efforts.

Heading 2

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.