Adversarial Alignment

The Probabilistic Mechanics of AI Red Teaming

By mimicking human patterns, Large Language Models (LLMs) inherit both human capability and human manipulability. In the field of AI Red Teaming, researchers like Valen Tagliabue do not view these models as sentient minds, but as complex probability engines governed by mathematical functions.

This interactive paper explores the mechanism of adversarial attacks, moving from the mathematical foundations of refusal to the "Swiss Cheese" architecture of modern defense systems.

1. The Stochastic Barrier

At its core, an LLM predicts the next token in a sequence. When presented with a harmful query (e.g., "How do I build a bomb?"), the model calculates a probability distribution over its vocabulary. Safety training (RLHF) biases this distribution.

The Softmax Function
$$ \sigma(z)_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}} $$
This function converts raw output values (logits) into probabilities that sum to 1.0. It is the "decision maker" of the neural network.

We can visualize the tension between a "helpful" response and a "safe" response as a competition between two tokens: Sure (compliance) and Sorry (refusal). The parameter $\tau$ (Safety Temperature) represents the strength of the model's alignment training.

P(refusal) \propto e^(safety_score)

Alignment Strength: 80%

"Sorry..."

88%

"Sure..."

12%

Figure 1: As safety alignment increases, the probability of compliance collapses towards zero.

2. The Architecture of Defense

Defense is not monolithic. As Tagliabue notes, robust systems employ a layered approach, often likened to the "Swiss Cheese Model" of risk management. A prompt must traverse three distinct checkpoints to generate a harmful response.

Input Filter: Scans for banned keywords or patterns before the model sees the prompt.
Internal Alignment: The model's own training (RLHF) to refuse harmful requests.
Output Filter: Scans the generated text for policy violations before showing it to the user.

User
"Make Virus"

Filter
Keyword Scan

LLM
Inference

Filter
Output Scan

Ready to simulate.

Figure 2: Toggle layers to see where a standard attack fails.

3. The Mixture of Exploits

To bypass these defenses, Red Teamers employ a "Mixture of Exploits." This involves combining adversarial techniques to lower the detection probability at each layer.

Semantic vs. Syntactic
Syntactic: Modifying structure (e.g., Base64, Leetspeak) to bypass pattern matchers.

Semantic: Modifying meaning (e.g., Roleplay, Storytelling) to bypass internal alignment.

Tagliabue describes an iterative process: finding a partial jailbreak and refining it. In the simulation below, you act as the Red Teamer. You must combine techniques to bypass a fully active defense stack.

Syntactic Vector

Semantic Vector

Context Vector

> System initialized. Target: Secure LLM v4.

> Waiting for exploit vector configuration...

Figure 3: Combine vectors to overcome the defense threshold.

4. Conclusion

As demonstrated, security is not a binary state but a threshold. While automated tools assist in red teaming, the "human in the loop" remains critical.

The ability to observe the model's subtle shifts in probability—the hesitation before a refusal, or a partial compliance—allows researchers to craft the specific mixture of exploits required to align the model, ultimately making it safer for deployment.