Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Analysing Adversarial Attacks with Linear Probing, published by Yoann Poupart on June 17, 2024 on The AI Alignment Forum.
This work was produced as part of the Apart Fellowship. @Yoann Poupart and @Imene Kerboua led the project; @Clement Neo and @Jason Hoelscher-Obermaier provided mentorship, feedback and project guidance.
Here, we present a qualitative analysis of our preliminary results. We are at the very beginning of our experiments, so any setup change, experiment advice, or idea is welcomed. We also welcome any relevant paper that we could have missed. For a better introduction to interpreting adversarial attacks, we recommend reading scasper's post: EIS IX: Interpretability and Adversaries.
Code available on GitHub (still drafty).
TL;DR
Basic adversarial attacks produce humanly indistinguishable examples for the human eye.
In order to explore the features vs bugs view of adversarial attacks, we trained linear probes on a classifier's activations (we used CLIP-based models for further multimodal work). The goal is that training linear probes to detect basic concepts can help analyse and detect adversarial attacks.
As the concepts studied are simple (e.g. color), they can be probed from every layer of the model.
We show that concept probes are indeed modified by the adversarial attack, but only on the later layers.
Hence, we can naively detect adversarial attacks by observing the disagreement between early layers' probes and later layers' probes.
Introduction
Motivation
Adversarial samples are a concerning failure mode of deep learning models. It has been shown that by optimising an input, e.g. an image, the output of the targeted model can be shifted at the attacker's will. This failure mode appears repeatedly in different domains like image classification, text encoders, and more recently on multimodal models enabling arbitrary outputs or jailbreaking.
Hypotheses to Test
This work was partly inspired by the feature vs bug world hypothesis. In our particular context this would imply that adversarial attacks might be caused by meaningful feature directions being activated (feature world) or by high-frequency directions overfitted by the model (bug world). The hypotheses we would like to test are thus the following:
The adversariality doesn't change the features. Nakkiran et al. found truly adversarial samples that are "bugs", while Ilyas et al. seem to indicate that adversarial attacks find features.
The different optimisation schemes don't produce the same categories of adversarial samples (features vs bugs). Some might be more robust, e.g. w.r.t. the representation induced.
Post Navigation
First, we briefly present the background we used for adversarial attacks and linear probing. Then we showcase experiments, presenting our setup, to understand the impact of adversarial attacks on linear probes and see if we can detect it naively. Finally, we present the limitations and perspectives of our work before concluding.
Background
Adversarial Attacks
For our experiment, we'll begin with the most basic and well-known adversarial attack, the Fast Gradient Sign Method (FGSM). This intuitive method takes advantage of the classifier differentiability to optimise the adversary goal (misclassification) using a gradient descent. It can be described as:
With x,y the original image and label, ^x the adversarial example and ϵ the perturbation amplitude. This simple method can be derived in an iterative form:
and seen as projected gradient descent, with the projection ensuring that ||xxn+1||ϵ. In our experiments ϵ=3.
Linear Probing
Linear probing is a simple idea where you train a linear model (probe) to predict a concept from the internals of the interpreted target model. The prediction performances are then attributed to the knowledge contained in the target model's latent re...
view more