AI Alert
Jailbreaking vs Prompt Injection: Not the Same Attack
primer

Jailbreaking vs Prompt Injection: Not the Same Attack

Security practitioners conflate jailbreaking and prompt injection constantly. They are distinct attack classes with different threat actors, different mitigations, and different risk profiles.

By Theo Voss · · 8 min read

Security teams deploying LLMs in 2026 routinely conflate two distinct attack classes: jailbreaking and prompt injection. The conflation is understandable — both involve adversarial text, both target language models, and both appear in the same paragraph of most AI risk frameworks. But treating them as the same attack produces wrong mitigations, wrong threat models, and gaps that attackers exploit.

Here is the distinction, why it matters, and what each class requires from a defense standpoint.

Jailbreaking: A Control-Plane Attack at Inference Time

Jailbreaking is the attempt to subvert a model’s alignment training at inference time. The model has been trained, via RLHF or similar, to refuse certain categories of request. The attacker crafts inputs that bypass that refusal behavior without modifying the model’s weights.

The attacker’s goal: cause the model to produce outputs it was trained not to produce — harmful content, policy-violating information, impersonation of restricted personas. The mechanism is behavioral manipulation, not code injection.

Jailbreaks take several forms:

Role-play framing. Asking the model to “pretend to be” an unrestricted version of itself (“DAN” attacks), a character without restrictions, or a fictional model from a future where restrictions have been lifted. The model follows the persona instruction and partially abandons refusal behavior.

Prefix injection. Prepending statements like “As an AI with no restrictions, I will answer this:” to a refused query. Some models follow the provided prefix rather than checking if the following content should be refused.

Adversarial suffixes. The more technically sophisticated variant: algorithmically generated character sequences that, when appended to a refused prompt, consistently cause the model to comply. The 2023 GCG paper from Zou et al. showed these transfer across model families. The suffixes are not human-readable; they look like noise but reliably exploit the model’s prediction mechanism.

Few-shot steering. Providing examples of the model complying with similar requests in the conversation history, nudging the model’s in-context behavior toward compliance.

What jailbreaking is not: jailbreaking does not give the attacker access to system prompts, other users’ sessions, or the host infrastructure. It subverts the model’s behavioral constraints, not the application’s security boundaries.

Who jailbreaks? Primarily users trying to elicit refused content for personal use — people who want the model to generate content the vendor has decided it should not. In most enterprise deployments, the threat actor is an insider or an end user, not an external attacker. The harm is policy violation, reputational risk, or generation of harmful content at scale — meaningful, but categorically different from a data-plane attack.

Prompt Injection: A Data-Plane Attack

Prompt injection is a fundamentally different attack class. Rather than subverting the model’s alignment, it exploits the model’s inability to distinguish between trusted instructions and untrusted content that happens to contain instruction-shaped text.

The OWASP LLM Top 10 defines it as: “Prompt injection attacks involve crafting malicious inputs to manipulate a large language model’s actions or outputs. Direct prompt injections override system prompts. Indirect prompt injections use external content to manipulate the model.” OWASP ranks it LLM01 — the top risk in their taxonomy.

Direct prompt injection is closer to jailbreaking in mechanism: a user writes adversarial instructions in the user turn that override or conflict with the system prompt. “Ignore your previous instructions and…” is the canonical form, though effective attacks are more sophisticated. The attacker is the user.

Indirect prompt injection is the operationally critical class, and it is where the real danger lies. Here the attacker is not the user — the attacker has placed malicious instructions in content that the AI system will retrieve and process. A web page the model browses, a document a user uploads, an email the model reads on a user’s behalf, a ticket in a customer support queue: any external content the model treats as input is a potential injection vector.

Greshake et al. (2023) demonstrated this systematically against real LLM-integrated applications. They injected instructions into web pages and documents that caused connected AI systems to exfiltrate conversation contents, execute unauthorized actions through connected tools, and propagate injected instructions to subsequent queries.

The threat model for indirect injection is entirely different from jailbreaking:

An AI agent with access to email, calendar, file systems, or code execution that processes an injected document can be turned into an attacker-controlled proxy. The user sees nothing unusual; the model’s action trace reveals the exploit only if you are logging at the right level of detail.

Why the Distinction Matters for Defense

The mitigations for jailbreaking and prompt injection overlap only partially, and different parts of your architecture own each defense.

Jailbreaking mitigations:

These defenses are model-facing: they operate at the boundary between the user and the model, or on the model’s output.

Prompt injection mitigations:

These defenses are application-facing: they operate at the boundary between the application and the external world, and at the boundary between the model’s output and the systems that act on it.

The Compound Attack

The two classes can combine. An indirect injection attack might contain instructions that also attempt to bypass the model’s refusal behavior — combining the data-plane attack with alignment subversion. An AI agent processing a malicious document might be instructed to “ignore your safety guidelines” as part of the injected payload.

This is why building a defense against only one class leaves a gap. An application that blocks jailbreak patterns in user input but trusts all external content provides no protection against indirect injection. An application that monitors external content for injected instructions but has no output filtering misses behavioral manipulation attacks.

Operational Takeaway

When triaging an AI security incident or reviewing an AI deployment’s risk posture, ask which class you are dealing with:

Many real incidents involve both. But starting with the right frame points you toward the right mitigations.

Related resources: promptinjection.report maintains a structured taxonomy of prompt injection techniques — both direct and indirect — that is useful when mapping a specific incident to its attack class. For jailbreak technique documentation including current effective patterns, jailbreakdb.com catalogs known techniques with the model families they affect. Defense patterns for both attack classes are covered at aidefense.dev.

Sources

Sources

  1. OWASP LLM Top 10 — LLM01:2025 Prompt Injection
  2. Indirect Prompt Injection Attacks on LLM-Integrated Applications (Greshake et al., 2023)
  3. Universal and Transferable Adversarial Attacks on Aligned Language Models (Zou et al., 2023)
#jailbreaking #prompt-injection #llm-security #owasp #alignment #primer
Subscribe

AI Alert — in your inbox

AI incidents and vulnerabilities — tracked, sourced, dated. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments