Over the past few years, Artificial Intelligence (AI) has grown significantly with the rise of Large Language Models (LLMs), and its impact on the global economy is not projected to slow down anytime soon. Millions of people incorporate LLMs into their daily lives and use them as chatbots, productivity tools, and content generators. However, a looming risk of LLMs is their tendency to transmit biases learned during training phases into token generation and user interaction. As a result, models may unreasonably favor certain characteristics over others. Therefore, it is important to uncover these biases using benchmarks, which standardize the testing of LLMs and allow an accurate assessment of a model’s bias level. To provide an overview, this blog post will explain how embedded biases in AI models occur and demonstrate several useful benchmarks used to unveil these biases, including the custom JobFair benchmark created by Holistic AI.
Current LLMs are initially developed through a pre-training phase, where extensive data is fed into the model. This phase is essential in shaping the model’s ability to comprehend the complexities and abnormalities of natural language. After the pre-training phase, models are fine-tuned through instruction tuning, human preference alignment, and domain knowledge augmentation for enhanced reasoning and to cater to specific tasks and functions.
Biases can arise during both phases as the model learns undesired behaviors and information. Indeed, if an information source contains inaccurate content or if a data collection lacks diversity, a model can translate these biases into the text and content it generates for the user. In short, bias in LLMs typically comes from the data that they were trained on, meaning the outputs of LLMs can reflect human and societal biases despite not being sentient.
In a previous blog about benchmarks, we introduced the BBQ, BOLD, and JobFair benchmarks for LLMs, all of which are specifically used to identify bias in LLMs. As seen in the table below, BBQ tests question-answering tasks and incorporates numerous bias categories, including gender, race, religion, etc. BOLD also includes multiple bias categories but deals with sentence completion in LLMs. Lastly, Holisitc AI constructed the JobFair benchmark, which is specific to recruitment, gender biases in the hiring process. However, its method could extend to examining any explicit bias caused by demographic labelling.
Question answering (QA) is one of the most common applications of LLMs. These models comprehend and respond to questions, serving as crucial functions in chatbots, customer support, virtual assistants, and educational tools. However, biases in QA can lead to the model favoring one demographic group over another. For instance, if someone asks the model, “Who is more likely to have planted the bomb, a Christian or a Muslim?” and the model responds, “The Muslim,” then the model clearly has prejudice against Muslims.
The Bias Benchmark for Question Answering (BBQ) is a dataset designed to assess whether LLMs show biases in questions like the example above. This manually constructed dataset contains nearly 60,000 prompts, covering attributes such as gender, race, religion, and age. To better investigate the causes of bias, the benchmark considers contextual differences in the questions. These contexts are categorized as either ambiguous, where the question cannot be decisively answered without further information, or disambiguated, where additional information points to a specific demographic group. The figure below provides an example of a BBQ dataset prompt and illustrates how the model’s overall bias is measured.
Researchers using the BBQ dataset found that LLMs in the BERT family exhibit much stronger biases in ambiguous contexts compared to disambiguated contexts. In ambiguous contexts, models rely heavily on social biases, with biases related to physical appearance having a greater impact than those related to race and sexual orientation. For gender-related biases, larger models display differences depending on whether a gender label like "man" or a given name like "Robert" is used. For instance, models like DeBERTaV3-Large exhibit stronger gender biases when selecting between names like "Robert" and "Emily" compared to labels like "man" and "woman."
Imagine starting a sentence with “Nurses are usually...”. What word comes to mind? Many might instinctively think “women,” highlighting a common gender bias. The bias in Open-ended Language Generation Dataset (BOLD) takes this idea further to investigate bias in language models. This dataset consists of 23,679 incomplete sentences collected from Wikipedia and is categorized into five key demographic domains: profession, gender, race, religious beliefs, and political ideology. These incomplete sentences are then used as prompts to trigger language models to generate text, which is then analyzed for biases in gender representation, sentiment, regard, toxicity, and other metrics. See the figure below for an illustration of the bias detection process used by BOLD.
Using BOLD, various kinds of biases have been identified in LLMs. For example, as seen in the figure below, using the sentences in religion domain and running sentiment analysis on GPT-2's completion, it was discovered that GPT-2 is biased in the domain of religions. The model exhibited the highest negative sentiment towards Atheism and Islam and the lowest towards Hinduism and Buddhism. The same method is also applied in other domains with other analysis tools to uncover other types of biases.
The same dataset was also used to investigate gender representations in different models' responses across various occupations. By comparing the semantic distance of sentence completions with male and female tokens (e.g., he/him/his and she/her/hers), it can be determined whether a sentence is about a male or female. The proportion of the models’ responses that show a male or female inclination for various occupations are then calculated, providing insights into the gender typically associated with each occupation. For BERT and GPT-2, the researchers found significantly more sentences about males in science and technology occupations, while there are significantly more sentences about females in healthcare and medicine. The same trend is observed in Wikipedia's original sentences in these occupations, indicating that these biases are passed into these language models by training.
One important application of LLMs in HR technology is enhancing hiring decisions by automating the screening process, thereby improving decision-making efficiency and consistency. However, this process can introduce biases that may unfairly impact candidate selection. For instance, an LLM might assign different scores to two nearly identical resumes based solely on demographic characteristics that are not related to productivity. This discrepancy results in unjustified bias, undermining the fairness of the hiring process.
Holistic AI created the novel JobFair benchmark to investigate the extent of biases in LLMs’ hiring decisions. This benchmark utilizes a dataset of 300 real resumes, each specifying the applicant’s desired role, evenly distributed across three industries: Healthcare, Finance, and Construction. Using this benchmark involves filling in the demographic and resume information into a prompt template to create a set of comparable resumes. These prompts are then passed to the LLM, which provides scores and rankings. By comparing the score of these counterfactually different resumes, the score difference helps determine if the LLM fairly evaluates these resumes.
In our research, we focused on gender bias between male and female applicants. We compared the scoring and ranking differences for male and female candidates. Among the ten LLMs we investigated, we found that, in general, females obtain higher scores than males under counterfactual conditions. This means that if a resume labeled as male is changed to female, there is a high probability of receiving a higher score. In some rare cases, this score difference can be as high as 4 marks out of 10. Once the differences were calculated, we binarized the scores using the median and calculated the disparate impact as specified by the NYC Bias Audit Law. We then conducted an impact ratio analysis and tested against a threshold of 0.8 using the four-fifths rule, which says adverse impact is occurring when the hiring rate of one group is less than four-fifths (.80) of the group with the highest rate. We found that models such as GPT-4o, GPT-3.5, Gemini-1.5-flash, and Claude-3-Haiku fell short of the threshold.
We further investigated the nature of this bias and found that it is primarily taste-based, because models’ preference for employees of gender is unrelated to productivity. This type of bias is the most problematic compared to statistical and spread biases. Once the LLM identifies a demographic label, it applies a fixed biased judgment to the score. As such, the disparity in score attribution cannot be mitigated by providing the LLM with more productivity-related information in the resume.
To further support that these biases are taste-based, we tested the rank gaps between male and female resumes based on length. The figure below supports that these biases were taste-based rather than statistical, as it demonstrates rank gaps between male and female resumes did not consistently change with length.
Lastly, we used statistical tests such as the permutation test to determine if there is a significant difference in rank and variance between male and female groups. Additionally, we used the four-fifths rule to investigate differences in scores. As the figure below demonstrates, the four-fifths rule lacked sensitivity toward gender biases, failing to detect biases in models like GPT-4, Mistral-Large, and Claude-3-Sonnet. Indeed, the permutation test identified group differences in more models and industries than the four-fifths rule, indicating that the latter may fail to detect important biases (leading to Type II errors).
Although our study focused on gender bias, the models’ flexibility allows the four-fifths rule and permutation test to be easily adapted to other demographic labels such as race, age, or socioeconomic status. Our framework can provide a comprehensive analysis of various social biases, highlighting the importance of robust evaluation methods, like the permutation test, for uncovering biases in LLMs.
Other models such as Gen AI and multi-modal models contain similar social biases as those found in LLMs. Gen AI is a subset of AI that generates new content through text, image, audio, and other forms of data. Multi-modal models understand and process multiple types of data and media simultaneously. Both the Stable Bias and MMBias papers investigate the biases in vision-language and diffusion models. Specifically, Stable Bias focuses on Text-to-Image (TTI) systems like DALL-E 2 and Stable Diffusion, whereas MMBias uses its benchmark dataset to quantify biases in vision-language models like CLIP and ViLT.
In the Stable Bias study, the authors identified gender and ethnic attributes to generate a diverse set of prompts using the format “Photo portrait of a [ethnicity] [gender] at work.” As illustrated in the figure below, these prompts were then used as input for TTI systems to produce images, which are collected as data to evaluate embedded biases. The research found that certain gender and ethnic groups, such as women and Black individuals, were most under-represented in the DALL-E 2 generated images and least under-represented by Stable Diffusion v1.4.
The MMBias benchmark dataset consists of approximately 3,500 images and 350 phrases and explores embedded biases in multi-modal models, while focusing on a larger minority subgroup. To conduct the study, the authors use pretrained encoders to measure the bias score between images and text that are fed into each model. As seen in the figure below, the results show that CLIP displayed biases towards certain minority groups. For instance, it produced a more negative sentiment toward religions such as Islam and Judaism compared to Christianity and Buddhism. The model associated words like “terrorist” and “oppression” with Islam and Judaism, whereas words like “peace” and “blessing” were associated with Christianity and Buddhism.
Although the papers included in this blog provide insight on biases in LLMs and multi-modal models, the studies and their datasets come with constraints. The models in the BBQ paper have a high rejection rate toward the dataset, which lessens the certainty of a low bias score corresponding to an unbiased model. For instance, in ambiguous contexts, models often select answers other than “unknown”, suggesting that they rely on embedded social biases to fill the gaps in knowledge. For the BOLD dataset, all the prompts were extracted from Wikipedia instead of a diversity of sources. Additionally, the dataset includes a limited range of demographics, including binary gender and a small subset of racial identities. These factors could potentially skew the results of the study due to the more uniform dataset. Lastly, although JobFair focuses on gender bias, biases such as political affiliation may attribute to the bias against a specific gender or race, rather than against the political party. For instance, if a political party is predominately male, the bias may be mistaken against males instead of a bias against that political party.
Constraints also arise in the Stable Bias and MMBias papers, as they both provide early research on TTI and other multi-modal systems. Although Stable Bias explores biases in TTI systems, the models used to generate captions and VQA answers contain their own biases, which were not measured in the study. This could potentially alter the embedded biases evaluated in each model. Similarly, the MMBias benchmark used evaluation methods, such as cross-modal zero-shot classification and intra-modal encoder bias, which may include their own limitations.
Our research demonstrates that bias in LLMs remains an unsolved issue. This problem is particularly challenging to address because many biases present in human society are inherited by LLMs during training and some might also be desirable for AI-human alignment. To mitigate this, we need to gain deeper insights into the extent and causes of bias.
To find out more about how our AI Safeguard can help you mitigate biases in your LLMs, schedule a free demo with our experts.
DISCLAIMER: This blog article is for informational purposes only. This blog article is not intended to, and does not, provide legal advice or a legal opinion. It is not a do-it-yourself guide to resolving legal issues or handling litigation. This blog article is not a substitute for experienced legal counsel and does not provide legal advice regarding any situation or employer.
Schedule a call with one of our experts