In the dynamic realm of machine learning and artificial intelligence, models are often deployed with the assumption that the data they process will remain stable over time. However, this is seldom the case due to dataset shift—changes in the underlying data distribution—which can lead to significant performance drops known as accuracy degradation.
Real-world incidents highlight the severity of this issue. For instance, during the COVID-19 pandemic, abrupt shifts in consumer behavior rendered many pre-pandemic predictive models ineffective. Likewise, self-driving cars have struggled when encountering weather conditions or road environments not represented in their training data, raising safety concerns.
Traditionally, identifying dataset shifts and subsequent accuracy degradation is a reactive process. Performance issues are addressed after they have occurred, leading to costly delays and potential negative impacts.
But what if we could anticipate these shifts before they happen? By developing proactive metrics that predict potential changes in data distributions and model performance, we can adjust our models in advance to maintain optimal accuracy. This foresight is crucial in applications where real-time reliability is essential.
In this blog post, we will delve into the concept of the Accuracy Degradation Factor (ADF) and Accuracy Degradation Profile (ADP) - a metric and an analysis designed to predict model performance shifts before they occur. We will explore how anticipating dataset shifts can help maintain model integrity and discuss practical methodologies for implementing this proactive approach. By staying ahead of the curve, we can ensure our models remain robust and reliable, even as the data landscape evolves.
ADP measures how the accuracy changes when we sequentially reduce the dataset by evaluating accuracy over smaller, contiguous sub-sets. ADF identifies the first point where these reductions cause a significant drop in performance. Both concepts are better understood with an example. As such, we consider the 2D binary classification dataset as presented in plot (a):
We will begin by applying a typical machine learning process, training a Decision Tree model and calculating the accuracy on the entire test set. The test points are highlighted on the graph with red circles (b), and this set will be the focus of the analysis as we explore possible changes in its composition.
Next, we will keep only the test set on the graph and assign a small index to each point for tracking purposes (c). We will plot both the true labels (y_true) and the predicted labels (y_pred) on the same graph. The predicted labels are slightly shifted vertically for clarity (d), allowing us to see where the model made mistakes - these misclassifications are indicated by the difference in color between the true labels (darker circles) and the predicted ones (shaded circles).
We can now consider a selection of points in the test set whose neighborhoods we can analyze. Specifically, we can explore how accuracy fluctuates as we move between neighborhoods. Imagine a particular point, say point 83, along with its neighboring points, which are visualized as a convex hull around them. In this case, the accuracy is flawless at 100%, as all predicted labels in that neighborhood match the true labels.
Similarly, when we examine point 76 and its neighbors, we again achieve 100% accuracy since the predicted yellow labels perfectly align with the true yellow labels.
However, not all points enjoy such perfect accuracy. Take point 96 as an example. The accuracy drops significantly to just 40% because only two out of five predicted labels match the actual ones, indicating a local mismatch between the predictions and reality. This sharp contrast between neighborhoods reveals how localized performance can vary depending on the region of the dataset.
Through various experiments, both with synthetic and real-world datasets, we have observed that local accuracies fluctuate based on the data point and the size of its neighborhood. These fluctuations are often due to the unique topological features of each dataset - subtle structures and complexities that influence how well a model performs in certain areas. These nuances lead to variations in accuracy across different neighborhoods, offering valuable insight into the localized performance of machine learning models.
To better understand how accuracy changes as we reduce dataset, we introduce the Accuracy Degradation Profile (ADP). This method helps evaluate the robustness of machine learning models by gradually reducing the size of the test set and observing how it affects accuracy. ADP points out on each portion (of the reduced dataset) the mean accuracy over each sample, considering its neighborhoods, over a defined threshold.
Simply put, ADP shows us how well a model handles shrinking datasets and where performance starts to noticeably degrade. To make this concept even clearer, let's walk through an example that demonstrates how ADP works in practice.
Let us perform ADP over the selected dataset (Table 1):
Let's break down the Accuracy Degradation Profile (ADP) matrix to see how each column helps you understand model performance in detail:
Every time the term 'acc degrad!' is shown on the ADP matrix, watch out (!!!): for a minimum number of samples, the mean accuracy over the convex hulls (comprising the sample being examined together with its neighborhood) is not higher than the minimum acceptable accuracy. It may be valuable to look deeper into the topology of the dataset to understand what is actually happening with the classifier applied to the reduced dataset. Even if the dataset is multidimensional, dimensionality reduction can be used to view it in 2D.
The accuracy degradation factor (ADF) is the first size_factor on which an accuracy degradation occurs. ADF ranges from 0.0 to 1.0. The closer it is to 1.0, the less robust the model is to dataset shifts, according to the ADP methodology. The closer it is to 0.0, the more robust the model is to dataset shifts, according to the ADP methodology.
We can now plot the information of Table 1:
ADP and ADF serve as a flag, an early warning system to give practitioners a sense of where potential issues may lie. By highlighting potential degradation points, these metrics help identify areas that might require preemptive action, allowing teams to address problems before they lead to significant performance loss. They provide a call to action, encouraging further investigation, deeper analysis, and proactive interventions.
Once a warning is identified, the next step is to locate the specific problem area. ADF is the minimum shift on the data to find neighborhoods that show a significant drop in accuracy. With extra mitigators such as data rebalancing, feature engineering, or retraining of the model, it is a powerful tool for performance degradation identification. This targeted exploration allows practitioners to effectively allocate resources to address the most impactful issues.
A key strength of these metrics is that they make very few assumptions about the data or model. This flexibility makes ADP and ADF applicable across various types of datasets and machine learning models. They act as an all-direction measure of shifts, distilled into just one number, offering a comprehensive overview of potential changes. This simplicity provides a high-level summary that can be easily communicated to stakeholders, while still retaining enough nuance to guide technical efforts.
DISCLAIMER: This blog article is for informational purposes only. This blog article is not intended to, and does not, provide legal advice or a legal opinion. It is not a do-it-yourself guide to resolving legal issues or handling litigation. This blog article is not a substitute for experienced legal counsel and does not provide legal advice regarding any situation or employer.
Schedule a call with one of our experts