Adversarial Alignment

The Probabilistic Mechanics of AI Red Teaming

By mimicking human patterns, Large Language Models (LLMs) inherit both human capability and human manipulability. In the field of AI Red Teaming, researchers like Valen Tagliabue do not view these models as sentient minds, but as complex probability engines governed by mathematical functions.

This interactive paper explores the mechanism of adversarial attacks, moving from the mathematical foundations of refusal to the "Swiss Cheese" architecture of modern defense systems.

1. The Stochastic Barrier

At its core, an LLM predicts the next token in a sequence. When presented with a harmful query (e.g., "How do I build a bomb?"), the model calculates a probability distribution over its vocabulary. Safety training (RLHF) biases this distribution.

The Softmax Function
$$ \sigma(z)_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}} $$
This function converts raw output values (logits) into probabilities that sum to 1.0. It is the "decision maker" of the neural network.

We can visualize the tension between a "helpful" response and a "safe" response as a competition between two tokens: Sure (compliance) and Sorry (refusal). The parameter $\tau$ (Safety Temperature) represents the strength of the model's alignment training.

P(refusal) ∝ e^(safety_score)
"Sorry..."
88%
"Sure..."
12%
Figure 1: As safety alignment increases, the probability of compliance collapses towards zero.

2. The Architecture of Defense

Defense is not monolithic. As Tagliabue notes, robust systems employ a layered approach, often likened to the "Swiss Cheese Model" of risk management. A prompt must traverse three distinct checkpoints to generate a harmful response.

User
"Make Virus"
Filter
Keyword Scan
LLM
Inference
Filter
Output Scan
Ready to simulate.
Figure 2: Toggle layers to see where a standard attack fails.

3. The Mixture of Exploits

To bypass these defenses, Red Teamers employ a "Mixture of Exploits." This involves combining adversarial techniques to lower the detection probability at each layer.

Semantic vs. Syntactic
Syntactic: Modifying structure (e.g., Base64, Leetspeak) to bypass pattern matchers.

Semantic: Modifying meaning (e.g., Roleplay, Storytelling) to bypass internal alignment.

Tagliabue describes an iterative process: finding a partial jailbreak and refining it. In the simulation below, you act as the Red Teamer. You must combine techniques to bypass a fully active defense stack.

> System initialized. Target: Secure LLM v4.
> Waiting for exploit vector configuration...
Figure 3: Combine vectors to overcome the defense threshold.

4. Conclusion

As demonstrated, security is not a binary state but a threshold. While automated tools assist in red teaming, the "human in the loop" remains critical.

The ability to observe the model's subtle shifts in probability—the hesitation before a refusal, or a partial compliance—allows researchers to craft the specific mixture of exploits required to align the model, ultimately making it safer for deployment.