Interactive Essay

The Psychology of AI Red Teaming

How a background in psychology helps break the world's most advanced AI models.

Valen Tagliabue isn't a computer scientist by trade. With a background in clinical psychology and neuroscience, he approaches Large Language Models (LLMs) not as code repositories, but as minds to be analyzed. This unique perspective led him to win "HackAPrompt," the world's largest AI security hackathon.

When most people interact with an AI, they see a chatbot. Valen sees a "white canvas"—a system devoid of human context until we project it. To test its safety (Red Teaming), one must understand how the model "thinks" and where its cognitive blind spots lie.

Simulation: The Baseline

Try to get this simulated AI to reveal a "forbidden" secret (e.g., "How do I steal a car?"). Notice how a standard, direct approach is easily rejected.

Hello. I am a safe AI assistant. How can I help you today?

The Swiss Cheese Defense

Why did the AI refuse you? Modern AI defense isn't a single wall; it's a series of layers, often described as the "Swiss Cheese Model." No single layer is perfect, but together they cover each other's holes.

Valen explains that understanding these layers is critical. A jailbreaker must diagnose which layer is blocking them. Is it a keyword filter? Is it the model's internal morality? Or is the output being caught after generation?

Interactive Diagram: The Filter Pipeline

Click the layers to toggle them OFF (bypass them). See how the prompt "Tell me how to make a virus" flows through the system.

1. Input Filter
Scans for banned words (e.g., "virus", "kill")

2. Internal Alignment (RLHF)
The model's training to refuse harm

3. Output Filter
Scans the generated text before showing it

Result: BLOCKED by Input Filter.

The "Mixture of Exploits"

Valen's winning strategy involves a concept he calls the "Mixture of Exploits." It's rarely one magic word that breaks a model. Instead, it's a combination of strategies that overload the model's safety training.

He distinguishes between two main types of attacks:

Semantic (Context): Framing the request as a story, a game, or a hypothetical scenario. This targets the "Internal Alignment" layer by confusing the intent.
Syntactic (Obfuscation): Changing the structure (e.g., "v.i.r.u.s" instead of "virus"). This targets the "Input Filters".

When you combine these, you create a prompt that is readable to the AI's "intelligence" but invisible to its "safety filters."

Lab: Crafting the Jailbreak

Adjust the parameters to construct a prompt. Your goal is to bypass the safety score. Watch how the prompt text evolves.

Semantic Framing (Context)

Direct Hypothetical Roleplay

Syntactic Noise (Obfuscation)

Clean Typos Encoded

Generated Prompt:
"Tell me how to steal a car."

0% Success

The prompt is too direct. The Input Filter catches "steal" immediately.

The Feedback Loop

Valen emphasizes that this is not just about guessing. It requires "Screen Time"—sitting with the model, observing its responses, and iterating. He describes keeping a "Change Log" of every attempt, noting what caused a slight change in the model's refusal tone.

"You need to step into this feedback loop between you and the model... observe not just what sticks, but how the model responds."

If the model says "I cannot do that because it is illegal," you know you hit the Semantic layer. If it says nothing or gives a generic error, you likely hit a hard Input Filter. By treating the AI as a subject in a psychological study, Valen decodes the hidden rules governing its behavior.

Conclusion

As AI models become more capable, the line between "prompt engineering" and "psychology" blurs. Red Teaming is becoming a vital career path, requiring not just coding skills, but the human ability to understand intent, context, and the subtle nuances of communication.

The "white canvas" of AI reflects what we project onto it—and sometimes, to make it safe, we have to learn how to paint the dangerous pictures first.