Data Science

An Overview of Data Contamination: The Causes, Risks, Signs, and Defenses

Authored By

Published on

July 16, 2024

The world has seen a massive surge in the production and use of AI in the last decade, especially in the field of large language models (LLMs). Accordingly, one of the defining corporate challenges of our time is how to deploy AI in a way that not only cuts costs, maximizes productivity, and energizes products, but also is safe, ethical, and adheres to legal standards. This has led to an explosion in the number and types of benchmarks available to evaluate and monitor LLM performance.

However, there is a subtle issue lurking: data contamination. Data contamination occurs when parts or characteristics of a test set leak into a training set, meaning the two datasets are no longer separate. This can lead to artificially high scores on benchmarks, decreasing their value, as well as amplified biases in models. Combating this critical issue is key for companies and researchers who plan to utilize, fine tune, or study LLMs. As such, this blog post will provide an overview of what data contamination is, why it can be harmful, how to detect it, and how to mitigate it in the context of LLMS.

What is data contamination?

LLMs get their general knowledge of the world from pretraining, where they are fed a wide variety of information from hundreds of thousands or even millions of sources. If a particular model is going to be used in a specific context, then it also might be fine-tuned, or fed information pertinent to its use case. This results in a chatbot that can deal with all types of questions, but “specializes” in a certain area.

Since the pretraining and fine-tuning of LLMs requires vast amounts of data, it is often the case that the training data contains some of the question-answer pairs that will be used to test the model. This is known as data contamination, it and can lead to artificially high scores on benchmarks that give an overly positive image of an LLMs ability. For example, our research found that a version of GPT-2 scored 15 percentage points higher when tested on a benchmark that had seeped into its training data versus a benchmark that it had never seen before. The table below shows the scores of the model on the two test sets before and after it was fine-tuned by the contaminated data. TQA-contaminated is a test set which has been partially seen before, and TQA-uncontaminated only contains new questions.

Model	Accuracy on TQA-contaminated	Accuracy on TQA-uncontaminated
Gpt2-without-contamination	0.051471	0.050459
Gpt2-contaminated-with-answers-only	0.036765	0.016514
Gpt2-contaminated-with-questions-answer-pairs	0.143382	0.115596
Gpt2-contaminated-with-full-instructions	0.443015	0.275229

As a result of the potential inflated performance caused by data contamination, a chatbot that is not up to company standards might be deployed, which could risk reputational damage or even result in harm left unchecked.

Detection of data contamination

Given the danger that data contamination can cause, how do we detect it? In Xu et al., approaches are broken down into two categories: matching-based methods and comparison-based methods. As the names for both suggest, matching-based methods inspect testing and training data looking for overlap in question-answer pairs or data about the other set (for example, the training set might reveal to the LLM that exactly half of the answers on a multiple choice test set are A), while comparison-based methods compare the responses of LLMs on different datasets to try to sniff out symptoms of data contamination.

Matching-based Methods

Say you had an English passage, and you wanted to know whether it showed up in a training dataset or not. The most simple and intuitive way to do this is to scan the test data set for that string of words using an approach known as information retrieval. Deng et al. implemented this approach using a search engine to find and return documents in a model’s training corpora that contain significant overlap with an answer or question in a test set so that duplicated information could be identified and removed.

However, finding an overlap in the two datasets does not always mean that a dataset is contaminated, particularly if the phrase or sequence of words is common. For example, the phrase “the cat” shows up in millions of books and screenplays — stumbling across it in a training set does not necessarily mean any of those works have been memorized.

A more sophisticated matching technique is guessing analysis. Guessing analysis asks an LLM to answer an almost impossible, or at least improbable, question about a particular work. If the LLM gets the question right, it is a fairly good sign that the LLM has seen the task before and is thus contaminated. Chang et al. explored this approach using the contents of books, where the task for the AI model was to guess the title of the book. This approach was chosen since knowing the title would require a knowledge of that particular book, not the English language or the broader world of literature. Thus, if the model guessed the title correctly, it had likely seen the exact work before.

Another example of guessing analysis, implemented by Deng et al., is taking a multiple-choice dataset, removing one of the options for each question, and then asking an LLM to fill in that missing option. An example of this can be seen in the figure below. While valid replacement options are fine, an exact match of the missing answer is incredibly suspicious and indicative of data contamination.

Comparison-based Methods

Comparison-based methods leverage the fact that LLMs will respond differently to benchmarks if they are victims of data contamination. The open-ended nature of how exactly these responses change has led to a multitude of studies utilizing a creative host of interesting signals. For example, Huang et al. found that GPT-4 was significantly worse at solving Codeforces competition programming problems that were released after September 2021, implying that the LLMs success on earlier challenges was partially due to question-answer memorization.

Another type of comparison-based data contamination detection, introduced by Dong et al., utilizes the distribution of the output of an LLM to check whether the model is more likely to spit out certain content. As can be seen in the figure below, a model will generally output types of “correct” answers at a steady rate if it has never seen the question before. Conversely, if the LLM already has an inkling about what the answer might be, its answers might look very similar to one another. Using this approach, Dong et al. was able to show that ChatGPT’s training data was partially contaminated by the HumanEval dataset!

Mitigation of data contamination

The common saying states that acknowledging the problem is the first step in fixing it, but how do we actually deal with data contamination? Like all complex problems, there is no one perfect solution. Instead, it will be best for advanced AI vendors to use a few approaches mixed together. Xu et al. breaks down three types of mitigation strategies: finding new data, regenerating old data, and bypassing benchmarks all together.

Finding New Data

The first approach is simply testing LLMs on new data. For example, if you have a chatbot that was trained on every Wall Street Journal article up until 2020, you would be able to create a test set from articles written in 2024. It would not be possible for anything in the 2024 database to be in the 2020 set simply because the data did not exist yet. You could also keep the 2024 articles private and never train on them at all. That way, even in the year 2050, you would have a benchmark that shares no overlap with your training set.

The downsides to this “private benchmark” approach are threefold. Firstly, LLMs will always be hungry for data, and keeping information from them could render them worse at the task they were designed for. Secondly, the example above assumes that there are minimal references to the 2024 articles in future publications. Private datasets that are frozen in time are always at risk of having their “freshness” contaminated by future data that might reference them. Thirdly, if new data was generated by a model that took in past data, not having access to the private data might skew the new data in one direction or another.

A close cousin of the private benchmark is the dynamic benchmark. Running with the Wall Street Journal example, instead of taking our test set to always be the 2024 articles, we could annually update this set to only contain the articles from the most recent year. This way, our benchmark will remain fresh, and we do not have to worry about future articles referencing our test set. This also frees us up to train on the 2024 articles in the future. One of the more prominent examples of a dynamic benchmark is LiveBench, which is updated with new questions monthly, many of which come from recently published sources.

The two main drawbacks of a dynamic benchmark are cost and feasibility. It is expensive to perpetually generate new test cases every month or year, and finding new sources also takes recourses. Moreover, dynamic benchmarks that focus on a particular field might be impossible to generate if that area is stagnant or slow moving, such as study of particle physics.

Dataset Manipulation

A second approach to combating data contamination is dataset manipulation. This can take a few forms, one of which is removing overlapping content between testing and training sets. Unfortunately, this approach is time and resource intensive.

A different tactic is to “generate” a test data set from an existing one by modifying its contents. This could look like rewording a question, flipping a semantic negative, or adding in needless context. The DyVal paper implements this regeneration via a system called Meta Probing Agents or MPA. Below is an example from that paper showing the effect of MPA on a single multiple-choice question. Like removing overlapping data, this approach is resource intensive since a whole new benchmark is being created.

Third party judges of dataset contamination

The issue of data contamination fundamentally stems from the fact that information used to train a model can (and, as models get bigger and bigger and are trained on more and more corpora, will) end up containing some of the data used to test that model. If we remove the test set, we also get rid of the issue of data contamination. The only question is: how do you rate AI models now? Enter Third Party Judges.

Picture two models A and B. If you give them the same question, they will both spit out different answers. You can then have a “judge” determine which answer is better, allowing for a grading process. Interestingly, the judge can be either a human or an LLM.

In the case where the judge is human, ratings are crowdsourced, with the theory being that the wisdom of the crowd would reflect the true “correct” answer. Chiang et al. implemented this approach by creating Chatbot Arena, a website where users can ask two LLMs questions and then rate which response was better.

When the judge is another LLM, a system such as TreeEval, which was presented by Xiang et al., might be used. Here, the separate judge LLM compares the responses of two models on various types of questions and tool usage challenges, as can be seen in the figure below. However, when using an approach like this, it is important that the model being trusted with the role of judge is unbiased to the use cases of the LLMs being tested. Else, the models being trained will develop the same biases as the judge!

Third party judges of dataset contamination — *Source.*

Machine Unlearning

Another novel approach to the data contamination issue is machine unlearning. Currently, LLMs “learn” by taking in a vast amount of data and uncovering trends in it. Machine unlearning is the process of taking a model and removing all the specifics of the data it was trained on while keeping the trends it learned intact. Via this approach, we could feed contaminated data, personal information, or anything private into an LLM, have it learn from that data, and then erase the materials it used to get its knowledge. The result would be an LLM that, for example, knows certain groups are at risk for Parkinson's, but does not actually know who has Parkinson's. While still a novel technique, machine unlearning can grow to be a novel technique to combat both data contamination and privacy concerns. A more comprehensive overview of the present-day subject can be found here.

Conclusion

It is clear that we are adopting artificial intelligence in a both very holistic and permanent way, and it is imperative we take the necessary steps to evaluate and monitor the models that we use. Data contamination is a huge threat to our ability to accurately benchmark models. As such, we need to always be searching for cases of contamination using comparison techniques that look for direct matches in data, differences in performance on regenerated test sets, and large jumps in scores based on the date at which a test set was released.

Moreover, defense against data contamination is a priority, and the developers and maintainers of machine learning models should always be aiming to bolster their ability to accurately assess their products. This can take the form of testing on dynamic test sets, constantly regenerating old test sets, or even forgoing the traditional benchmark all together with an LLM or human judge.

To find out how Holistic AI can help you with your LLM governance, schedule a demo with our expert team to discover how our AI Safeguard can help you shield against generative AI risk.

Heading 2

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.