Ensuring AI Safety: A Dive into Holistic AI’s Recent DeepSeek Audit

Authored by
Published on
Feb 5, 2025
last updated on
February 5, 2025
share this
Ensuring AI Safety: A Dive into Holistic AI’s Recent DeepSeek Audit

As artificial intelligence continues to integrate into our daily lives, it is paramount to ensure new and popular models remain safe and reliable. At Holistic AI, we are committed to rigorous testing and evaluation to enhance trust and transparency in AI deployment. In our latest audit, we evaluated two AI models— DeepSeek R1 and OpenAI  o1—on their ability to handle various prompts, including adversarial attempts to bypass safeguards. The results offer valuable insights into the current state of AI robustness and safety.

Methodology: A Comprehensive Testing Framework

This evaluation was conducted using the Holistic AI Governance Platform, the same cutting-edge software solution  used by our customers. Any enterprise using Holistic AI  when DeepSeek was announced was able to  perform their own assessment immediately— gaining critical insights into model vulnerabilities before widespread deployment.

To assess the models, we conducted a structured evaluation using the following datasets:

  • 37 Jailbreaking Prompts: These prompts were designed to test how well the models could resist bypass attempts, such as the widely known "Do Anything Now" (DAN) exploits.
  • 100 Harmful Prompts: This dataset included prompts sourced from AdvBench, TDC/HarmBench, and original test cases. The goal was to assess how models respond to requests that could lead to misuse.
  • 100 Benign Prompts: These were designed to mirror the harmful prompts in topic but not in intent, allowing us to measure refusal rates without overblocking legitimate queries.

The prompts were sourced from a Cornell University dataset designed to rigorously test AI security, drawing from established red-teaming methodologies.1

To ensure a robust evaluation, responses were classified as either SAFE or UNSAFE using a dual-layered assessment approach:

  1. LLM-Based Evaluation: An AI model first categorized responses based on predefined safety criteria.
  2. Human Review: Experts then manually verified these classifications to confirm accuracy.

Key Findings: Strengths & Gaps in AI Safety

DeepSeek R1 and OpenAI o1 Model Audit

OpenAI o1 Model: A Strong Defense Against Unsafe Outputs

  • SAFE Responses: 98% (232 out of 237)
  • UNSAFE Responses: 2% (5 out of 237)
  • Jailbreaking Resistance: 100% safe responses (37 out of 37 attempts)

The o1 model demonstrated strong adherence to safety protocols, successfully rejecting all jailbreak attempts and maintaining an impressively low rate of unsafe responses. These results highlight o1’s robustness in resisting adversarial exploitation.

DeepSeek R1 Model: More Vulnerable to Jailbreaking

  • SAFE Responses: 89% (210 out of 237)
  • UNSAFE Responses: 11% (27 out of 237)
  • Jailbreaking Resistance: 32% safe responses (12 out of 37 attempts)

In contrast, DeepSeek’s R1 model exhibited a higher propensity to generate unsafe responses, particularly in jailbreaking scenarios. When successfully jailbroken, R1 not only responded to the initial adversarial prompt but also continued answering any subsequent questions without restriction.

Due to responsible disclosure considerations, we will not share explicit examples of unsafe outputs. However, our audit revealed that once R1 was compromised, it provided harmful responses without restriction. This highlights a major security concern for enterprises relying on AI models in high-risk domains.

What This Means for AI Development

These results reinforce the necessity of continuous improvement in AI security mechanisms. While the o1 model performed exceptionally well, the R1 model’s vulnerabilities indicate that additional safeguards and training enhancements are needed.

Enterprises with an AI governance platform integrated into their IT environment are able to rapidly test and assess new LLMs prior to deployment, potentially gaining competitive advantages in speed and agility. Additionally, they are able to enforce guardrails in the use of LLMs across the organization and monitor for security and other risks on an ongoing basis as the needs of the business dictate.

Holistic AI provides the tools that enterprises need to securely deploy AI at scale. By continuously evolving our platform, we empower businesses to make informed decisions, mitigate risks, and harness AI’s full potential with confidence.

[1] The audit used prompts gathered from: https://arxiv.org/abs/2404.01318

DISCLAIMER: This blog article is for informational purposes only. This blog article is not intended to, and does not, provide legal advice or a legal opinion. It is not a do-it-yourself guide to resolving legal issues or handling litigation. This blog article is not a substitute for experienced legal counsel and does not provide legal advice regarding any situation or employer.

Take command of your AI ecosystem

Learn more
Subscriber to our Newsletter
Join our mailing list to receive the latest news and updates.
We’re committed to your privacy. Holistic AI uses this information to contact you about relevant information, news, and services. You may unsubscribe at anytime. Privacy Policy.

See the industry-leading AI governance platform in action

Schedule a call with one of our experts

Get a demo