Join Webinar: Bias Detection in Large Language Models - Techniques and Best Practices
Register Now
Learn more about EU AI Act

Understanding the Robustness Attackers for Binary Classification

Authored by
Franklin Cardenoso Fernandez
Researcher at Holistic AI
Published on
Oct 15, 2024
read time
0
min read
share this
Understanding the Robustness Attackers for Binary Classification

As machine learning (ML) models continue to revolutionize critical sectors such as healthcare, finance, and cybersecurity, ensuring their robustness becomes more important than ever. Robustness refers to a model's ability to maintain reliable performance under various adversarial conditions, such as noise, data distribution shifts, or, more critically, deliberate adversarial attacks aimed at deceiving the system.

A particularly striking example of the need for robustness can be seen in the autonomous vehicle industry. Researchers demonstrated how seemingly harmless adversarial stickers on road signs confused a self-driving car’s autopilot system. This subtle manipulation caused the vehicle to misinterpret lane markers and veer dangerously into the wrong lane—an incident that underscores how even minor alterations can lead to major safety risks. This case makes it clear that machine learning models must be not only accurate but also resilient against deliberate adversarial attacks.

In the context of binary classification, black-box attackers such as the HSJ and ZOO methods exploit similar vulnerabilities. Without direct access to the model’s internal architecture, they can manipulate inputs to degrade performance, often with serious consequences. In this blog, we’ll explore these attack methods in detail and discuss how to measure and enhance model robustness to ensure their reliability in real-world applications.

Key takeaways:

  • Understand robustness in machine learning
  • Understand the robustness attackers classification
  • Describe the HSJ and ZOO robustness methods
  • Present robustness evaluation metrics

What Is Robustness in Machine Learning?

Robustness is a critical aspect of machine learning that indicates how well a model performs under unexpected or challenging circumstances; this is because the objective of building an ML model is not only to achieve high accuracy on its training data but also to generalize well to unseen data. This characteristic is particularly important in real-world applications where data can be noisy, incomplete, or manipulated.

Robustness assessment: The adversarial attacks

Probably, the most known and adopted strategies to evaluate ML models' robustness are adversarial attacks. These methods aim to craft inputs to produce adversarial samples so that although they are closer to correct input samples (according to some similarity metric,or example), their model predictions are different.

Depending on the way how these methods have access to the target model, the adversarial attackers can be categorized into white-box attacks or black-box attacks, while white-box attackers have complete knowledge of the model and its training data, black-box models only have access to the model through input-output queries.

General pipeline of a black-box adversarial attacker
Figure 1. General pipeline of a black-box adversarial attacker. We assume that the target model has been trained previously.
  • White-Box Methods: White-box robustness methods involve a clear understanding of the model's architecture, including parameters and gradients. This allows more control in manipulating and evaluating the adversarial examples based on the model behaviour. Among the most representative white-box techniques, we can find the gradient-based methods, which calculate the gradient of the loss function with respect to the input data. By making small perturbations in the direction of the gradient, attackers can create adversarial examples that are often imperceptible to humans but lead to incorrect classifications.
  • Black-box methods: Unlike white-box attacks, the models are treated as opaque entities in the black-box setting, focusing on their output rather than their internal structure. Although these attackers are less effective than the white-box methods, they are particularly useful in cases where the model is proprietary or too complex to understand fully, enabling the attackers to infer the models' vulnerabilities. Based on the model output type, these methods can be further categorized into score-based and decision-based, where the first has access to the full probabilities, such as the Zeroth-order optimization method, and the second has access only to the labels, such as the HopSkipJump method.

Zeroth-order optimization attackers

The Zeroth-order optimization (ZOO) attacker bases its operation on the white-box C&W attack that crafts adversarial samples by solving an optimization problem that uses the logit layer of the targeted model and the model gradients.

Unlike the C&W method, the ZOO attacker modifies it by modifying the loss function to depend only on the model confidence scores and compute approximate gradients with a finite difference method. These modifications allow us to use a black-box method that does not require creating a surrogate model to deploy the adversarial samples.

The core idea of these methods is the use of the objective function value (f(x)) at any input (x), known as the zeroth order oracle, to evaluate two very close points during the optimization process in order to generate adversarial attacks. The advantage of doing this is that these attackers do not explicitly require the calculation of the gradients, only evaluate the objective function at different points; for this reason, they can be considered derivative-free optimization methods.

HopSkipJump attacker

In contrast to ZOO, HopSkipJump is a decision-based attacker that uses binary information from output labels instead of confidence scores or function values. In addition, like the ZOO method, this attacker's core component is a gradient-direction estimate.

Its iterative algorithm crafts the perturbations with three steps at each iteration: gradient direction estimation, Boundary search via a binary search, and step-size updating via geometric progression until the perturbation becomes successful; this allows the HopSkipJump attacker to be a hyperparameter-free algorithm. Another key aspect of this algorithm is that only requires a limited number of queries for the target model, making it a query-efficient attacker and capable of outperforming other decision-based methods.

Measuring Robustness

To ensure that an ML model is robust, especially in binary classification tasks, it's essential to have effective metrics for evaluation. Some common methods for measuring robustness are the adversarial accuracy and empirical robustness metrics.

Adversarial accuracy focuses on the consistency of predictions by evaluating the proportion of correctly classified samples that maintain their classifications when subjected to intentional perturbations, which means that a higher score indicates that the assessed model presents better robustness. On the other hand, empirical robustness quantifies the minimum perturbation required to induce a misclassification, thereby evaluating how much an adversarial input must deviate from the original to mislead the model successfully. A higher empirical robustness score signifies that more significant perturbations are needed to alter the model's predictions, indicating enhanced resistance to adversarial manipulation. Together, these metrics offer valuable insights into a model's durability, ensuring that machine learning systems are accurate and secure against potential threats.

Practical implementations

If you want to put into practice all the presented in this blog, you can use the robustness module from the holisticai package, where you will find some demos for the described attackers as well as for the presented metrics. Feel free to use it to test it with your models to observe how well they perform by varying the input conditions.

Conclusion

Finally, robustness attackers pose a significant challenge to binary classification models in machine learning. By understanding the various methods—both white-box and black-box—for enhancing robustness and implementing effective measurement strategies, practitioners can build models that are not only accurate but also reliable under a range of conditions. As the field continues to evolve, focusing on robustness will be essential for advancing the trust and effectiveness of machine learning systems.

DISCLAIMER: This blog article is for informational purposes only. This blog article is not intended to, and does not, provide legal advice or a legal opinion. It is not a do-it-yourself guide to resolving legal issues or handling litigation. This blog article is not a substitute for experienced legal counsel and does not provide legal advice regarding any situation or employer.

Take command of your AI ecosystem

Learn more

Track AI Regulations in Real-time

Learn more
Subscriber to our Newsletter
Join our mailing list to receive the latest news and updates.
We’re committed to your privacy. Holistic AI uses this information to contact you about relevant information, news, and services. You may unsubscribe at anytime. Privacy Policy.

Discover how we can help your company

Schedule a call with one of our experts

Get a demo