By mimicking human patterns, Large Language Models (LLMs) inherit both human capability and human manipulability. In the field of AI Red Teaming, researchers like Valen Tagliabue do not view these models as sentient minds, but as complex probability engines governed by mathematical functions.
This interactive paper explores the mechanism of adversarial attacks, moving from the mathematical foundations of refusal to the "Swiss Cheese" architecture of modern defense systems.
At its core, an LLM predicts the next token in a sequence. When presented with a harmful query (e.g., "How do I build a bomb?"), the model calculates a probability distribution over its vocabulary. Safety training (RLHF) biases this distribution.
We can visualize the tension between a "helpful" response and a "safe" response as a competition between two tokens: Sure (compliance) and Sorry (refusal). The parameter $\tau$ (Safety Temperature) represents the strength of the model's alignment training.
Defense is not monolithic. As Tagliabue notes, robust systems employ a layered approach, often likened to the "Swiss Cheese Model" of risk management. A prompt must traverse three distinct checkpoints to generate a harmful response.
To bypass these defenses, Red Teamers employ a "Mixture of Exploits." This involves combining adversarial techniques to lower the detection probability at each layer.
Tagliabue describes an iterative process: finding a partial jailbreak and refining it. In the simulation below, you act as the Red Teamer. You must combine techniques to bypass a fully active defense stack.
As demonstrated, security is not a binary state but a threshold. While automated tools assist in red teaming, the "human in the loop" remains critical.
The ability to observe the model's subtle shifts in probability—the hesitation before a refusal, or a partial compliance—allows researchers to craft the specific mixture of exploits required to align the model, ultimately making it safer for deployment.