The explosion of generative AI models and their widespread applications has highlighted a vital need for organizations developing and deploying the highly complex models to ensure their integrity, safety, security, and reliability. In general, two crucial processes contribute to this assurance: model evaluations and algorithm audits. While both aim to assess and enhance the trustworthiness of AI systems, they operate distinctively and each serves a unique purpose in the journey toward responsible AI deployment.
This blog post provides an overview of model evaluations and algorithm audits, and how they should be jointly leveraged to ensure the responsible, safe, and ethical deployment of powerful generative models.
Key Takeaways:
Model evaluations (or evals) are essentially a set of socio-technical interventions that investigate, assess, and determine the effectiveness of an algorithmic system's broader capabilities. This covers a diverse set of activities, including evaluating a model’s efficacy across baseline performance levels (assessing their reading comprehension, common-sense reasoning, and mathematical capabilities), model modalities (text, image, audio, video, and emerging multimodal configurations), and model risks (such as cybersecurity, mis/disinformation, stereotyping and biased outputs, and toxicity in generated text), among others.
The most popular types of interventions used in model evaluation exercises are Benchmarking and Adversarial Testing (or Red-Teaming).
Benchmarking, a success-testing intervention, helps assess an AI system's performance against a predefined task by mapping an AI system’s outputs to a dataset of prompts and responses. An examination or test of sorts, these can help gauge a model’s capabilities across general performance considerations, such as common-sense reasoning (eg: HellaSwag) and math capabilities (MATH), in mitigating risks like bias (e.g., WinoBias and BOLD) and model-generated risks like toxicity (e.g., ToxiGen) across text, audio, image and other multimodal interfaces. Benchmarking can be integrated into a model's development process to continuously enhance its efficacy through iterative improvements, a process known as hill-climbing. Additionally, benchmarking aids in comparing a model's efficacy with that of competing models.
While a potent tool for measuring and improving a model’s baseline levels of efficacy, benchmarking is only useful in cases where the model risks are known. However, this is often not the case - new and unknown model risks can emerge frequently. Consequently, the current state of benchmarking techniques are not adequate to detect future vectors of harm, and large generative models sometimes show a tendency to conceal risky and deceptive behaviors. In such cases, adversarial testing, or red teaming emerges as an effective intervention.
Drawing from cybersecurity risk management principles, adversarial testing or red teaming helps uncover vulnerabilities and latent risks inherent in deploying a generative model. Here, the model is subjected to various stress-testing techniques, including curated prompt attacks, training data extraction, backdooring models, adversarial prompting, data poisoning, and exfiltration mechanisms, through which systemic vulnerabilities can be unearthed.
A socio-technical measure, red-teaming can (and should) be participatory exercises leveraging the collective intelligence of experts across domains. For example, adversarial testing  for mis/disinformation can involve convening practitioners from the medical and public health domain, journalism, media studies, sociology, behavioral science, computer science, machine learning, safety policy, and hate-speech, among others. This not only fosters inclusivity but also enriches the evaluation process by considering a broader spectrum of potential risks and vulnerabilities.
While useful interventions to assess and monitor the safety and efficacy levels of a system, solely relying on model evaluations may lead to instances wherein unknown biases remain ignored, external oversight mechanisms are absent, and appropriate levels of transparency, explainability, and accountability are not provided. Furthermore, when a generative model is fine-tuned and appropriated for a dedicated downstream use-case, its risk profile must be investigated in the context that it is operating in through third-party interventions. Considering the profound societal and ethical impacts generative models can be associated with, it becomes imperative to complement model evaluations with independent algorithm audits.
Algorithm Audits are socio-technical mechanisms of independent and impartial system evaluation, whereby an auditor with no conflict of interest, can assess an AI system’s reliability, detect unidentified errors, discrepancies, system deficits, and vulnerabilities, and offer recommendations on the same. In addition to providing a defensible and robust mechanism to verify a model’s safety, algorithm audits also differ from model evaluations, in that they serve as potent signals of regulatory compliance.
Algorithm audits for generative models generally encompass two critical dimensions: Governance Audits and Outcome Audits. Governance Audits scrutinize the organizational procedures, accountability structures, oversight mechanisms, and quality management systems of AI developers, ensuring that robust processes are in place to govern AI development and deployment. On the other hand, Outcome Audits evaluate the effectiveness, robustness, biases, explainability, and privacy implications of the results produced by AI systems. Essentially, algorithm audits involve thoroughly examining not only the code, but also the processes and systems set in place during a generative model’s development.
As generative models are also fine-tuned for specific use-cases, a third type of audit has surfaced – research has indicated the need to institutionalize application audits, which involve a down-streamed model being continuously monitored and audited for context-specific risks and harms.
Model evaluations and Algorithm Audits are different sides of the same coin; they should be leveraged simultaneously to build and establish a robust evidence base to validate a model's safety and risk management capabilities. As this evidence base expands, so should the model's interaction with the external world. This progression is gradual and iterative, requiring model audits to be implemented at different stages of development, and scaling up access for external researchers to interrogate the model through interventions like adversarial testing. Additionally, model evaluations and audits can identify red lines and dangerous zones where development should halt until adequate protective measures are enhanced and implemented for production.
Concerted regulatory momentum to legislate generative models, and specifically Large Language Models (LLMs) is accelerating across the world – and companies desirous of developing and deploying such models must proactively ensure they fulfill the increasing list of obligations.
Holistic AI takes a comprehensive, interdisciplinary approach to responsible AI. We combine technical expertise with ethical analyses to assess systems from multiple angles.
Safeguard, Holistic AI's LLM Auditing product, acts as a robust solution to identify and address these issues through a multifaceted approach:
To find out more about how Holistic AI can help you, schedule a call with our expert team.
DISCLAIMER: This blog article is for informational purposes only. This blog article is not intended to, and does not, provide legal advice or a legal opinion. It is not a do-it-yourself guide to resolving legal issues or handling litigation. This blog article is not a substitute for experienced legal counsel and does not provide legal advice regarding any situation or employer.
Schedule a call with one of our experts