Building an AI Security Incident Response Plan
A practical incident response plan for AI systems — what a prompt-injection or model-extraction incident looks like, how the NIST SP 800-61r3 / CSF 2.0 functions map onto AI-specific incidents, and the containment and evidence steps generic IR plans miss.
Most organizations now have an AI deployment and a generic incident response plan, but the IR plan was written for a world of servers, endpoints, and network intrusions. When the incident is a prompt-injection campaign that exfiltrated data through an agent’s tool calls, or a model-extraction effort that scraped your proprietary model through the API, the standard playbook does not obviously apply. What do you contain? What is the evidence? Who do you even call?
This guide is a practical structure for an AI security incident response plan. It does not replace your existing IR program — it extends it to cover the AI-specific incident classes, the containment actions that have no equivalent in traditional IR, and the evidence you have to capture before the incident if you want any hope of investigating it. It is built on the current NIST incident response guidance and grounded in the real attack classes we track across this site.
Start from the current NIST structure
NIST overhauled its incident response guidance in April 2025 with SP 800-61 Revision 3 ↗. The big change matters here: Rev. 3 abandons the old rigid lifecycle (Preparation → Detection & Analysis → Containment/Eradication/Recovery → Post-Incident) and instead organizes incident response around the five functions of the NIST Cybersecurity Framework 2.0 — Identify, Protect, Detect, Respond, Recover — plus the cross-cutting Govern function. It is delivered as a CSF 2.0 Community Profile rather than a step-by-step recipe.
The practical reason to start here is that it lets you fold AI incident response into the same framework your security program already speaks, instead of bolting on a separate “AI IR” silo. The rest of this guide maps the AI-specific work onto those functions.
The AI incident classes you are planning for
Before the plan, the threat model. The incident types that distinguish AI IR from generic IR:
- Prompt injection leading to action. Direct or indirect injection that causes an LLM or agent to take an unintended action — exfiltrate data through a tool call, send a message, modify a record. The highest-impact class because the model has agency. (Background: direct vs. indirect prompt injection ↗ and the 2025 OWASP LLM Top 10 where injection is LLM01.)
- Sensitive data exposure. Confidential data entering the model context and leaking — either to other users, in logs, or back out through the model. The Samsung ChatGPT leak is the canonical example. (See our incident analysis.)
- Model or system-prompt extraction. Adversaries reconstructing a proprietary model or extracting confidential system-prompt contents through crafted queries. (See model extraction attacks and system prompt leakage.)
- Supply-chain compromise. A malicious model artifact, a poisoned RAG corpus, or a vulnerable inference dependency. (See LLM supply chain poisoning and our CVE roundups.)
- Cost/resource abuse. Token-budget exhaustion or runaway agent loops driven by a stolen key or an adversarial input — OWASP’s “unbounded consumption.”
MITRE ATLAS ↗ is the reference that names these as adversary techniques and gives you a shared vocabulary for the after-action report. Mapping each of your detections to an ATLAS technique at design time pays off when you have to explain the incident.
Prepare: the evidence you must capture before the incident
This is where AI IR most often fails, and it fails before the incident happens. The single most common reason an AI security incident cannot be investigated is that the telemetry needed to reconstruct it was never logged. Under the CSF Protect/Identify functions, the preparation work specific to AI:
- Log the full interaction, with provenance. For every model call: the input and its source (was it direct user input, retrieved RAG content, tool output, OCR’d text?), the system prompt version, the tools invoked and their arguments, the output, and the identity the call was made on behalf of. The input-source field is the one generic logging never captures, and it is exactly what tells you whether an injection came from a user or from poisoned retrieved content.
- Retain prompt/response content deliberately and securely. You need enough content to investigate, but raw prompts contain PII and secrets. Decide the retention window and access controls before you need them, and treat the log store as regulated data.
- Inventory and version your models, prompts, and corpora. You cannot investigate a poisoned RAG corpus if you cannot tell what was in it last Tuesday. Version the corpus, pin model revisions by hash, and version system prompts.
- Define and rehearse the escalation path. Who is the on-call? When does an alert page versus log-and-review? Who can pull a model from production or revoke an API key, and how fast? AI incidents cross team boundaries — the model team, app team, and security team rarely share an on-call rotation by default.
If you instrument nothing else from this guide, instrument the interaction log with provenance. Everything downstream depends on it.
Detect and analyze: triage for AI incidents
Detection draws on the same telemetry. The analysis questions that are specific to AI incidents:
- Was the input attacker-controlled, and through which channel? Direct input is a user-trust problem; injection via retrieved content is an architectural one with a wider blast radius. The
input_sourceprovenance answers this. - Did the model take an action, or just produce text? A leaked string is bad; an executed tool call that sent data or modified state is an active-harm incident. Triage severity on agency, not on output toxicity.
- What was the blast radius? Which sessions, which users, which downstream systems did the compromised output or action reach? For data exposure: what entered the context, and where could it have gone?
- Is it reproducible? If a single crafted input leaks the system prompt or bypasses a guardrail once, assume the adversary can repeat it. A reproducible bypass is a vulnerability, not a one-off.
Respond: containment actions with no traditional equivalent
This is where AI IR diverges most from the standard playbook. The containment levers, roughly in order of escalation:
- Revoke the compromised credential or session. If a stolen API key is driving cost abuse or scraping, kill the key. This is the closest analog to traditional account containment.
- Disable or constrain the agent’s tools. For an injection-to-action incident, the fastest containment is often to revoke the tool the agent abused — pull
send_email, disable the write path — rather than take the whole system down. Capability scoping at the tool layer is your circuit breaker, which is why tools should be independently disableable. - Roll back the prompt, model, or corpus. If a poisoned RAG document or a bad system-prompt change is the root cause, roll back to a known-good version. This is why versioning is a Prepare-phase requirement.
- Switch to a degraded but safe mode. Disable the risky feature (agentic actions, the upload pipeline, the affected tool) while keeping the rest of the service up. Total takedown is rarely the right first move for a customer-facing system, and a feature-level kill is usually available if you designed for it.
- Block the input pattern, carefully. Adding a guardrail rule for the specific bypass can stop an active campaign, but treat it as a stopgap — adversaries iterate around point fixes, and a brittle rule creates false confidence.
The discipline: contain at the narrowest effective layer. Generic IR reaches for “isolate the host.” AI IR usually has a more surgical option — revoke a tool, roll back a corpus, kill a key — that preserves the service.
Recover and learn: close the loop
Recovery under CSF 2.0 is about restoring service safely and verifying the fix held. For AI incidents:
- Verify the bypass is actually closed. Re-run the exact attack against the fixed system. If you patched a guardrail, confirm it now catches the payload and obvious variants — and that legitimate traffic still works.
- Rotate what leaked. If a system prompt leaked, assume it is public and rotate any secrets that were (wrongly) in it. The real lesson is usually that secrets did not belong in model context at all.
- Feed the incident back into detection. The attack that succeeded becomes a test case in your adversarial CI and a new detection rule. An incident that does not improve your detection coverage was a wasted incident.
- Update the threat model and the inventory. A novel incident class often reveals a system or data flow that was not in your inventory or risk assessment. Close that gap.
A note on the Govern function: AI incidents frequently raise reporting and disclosure questions that traditional IR plans do not anticipate — regulatory notification for data exposure, customer communication, and whether to contribute a sanitized writeup to a public resource like the AI Incident Database ↗, which exists precisely so the field learns from documented failures. Decide your disclosure posture in advance, not under pressure.
The minimum viable AI IR plan
If you are building this from nothing, the order that delivers the most resilience per unit effort:
- Instrument the provenance-rich interaction log and lock down its access. Without it, nothing else in this plan is executable.
- Define the escalation path across the model, app, and security teams, and name who can revoke a key, pull a tool, and roll back a model.
- Make tools, prompts, models, and corpora independently disableable and versioned, so containment has surgical options.
- Write per-class runbooks for the five incident types above, each mapped to an ATLAS technique and to the CSF function it lives under.
- Add the successful attack to adversarial CI after every incident, so detection compounds over time.
An AI security incident response plan is mostly the disciplined application of incident response fundamentals to a system whose telemetry, containment levers, and evidence look different. Get the logging and the surgical containment options in place before the incident, map the work onto the framework your security team already uses, and the AI-specific parts stop being a separate emergency and become a covered class of incident like any other.
Sources
- NIST SP 800-61r3 — Incident Response Recommendations (CSF 2.0 Profile) ↗ — the April 2025 IR guidance, restructured around the CSF 2.0 functions.
- NIST SP 800-61r3 (full PDF) ↗ — the complete document with the Community Profile tables.
- MITRE ATLAS ↗ — adversary tactics and techniques against AI systems; the shared vocabulary for AI incident reports.
- AI Incident Database ↗ — a public, searchable record of documented AI incidents and harms.
- OWASP Top 10 for LLM Applications 2025 ↗ — the risk taxonomy underlying the incident classes above.
See also
Sources
AI Alert — in your inbox
AI incidents and vulnerabilities — tracked, sourced, dated. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
Deepfake Cybersecurity: Detection Methods and Practical Defenses
From the FBI's May 2025 warning on AI voice attacks targeting US officials to NIST's synthetic content framework, here is what detection technology actually delivers — and where the gaps remain.
LLM Security Alerts: Monitoring, Detection, and Response
A practical guide to setting up LLM security alerting — what to monitor, what alert patterns indicate compromise or attack, how to triage LLM security incidents, and what a response playbook looks like.
AI Security: Attack Categories, Defense Gaps, and How to Respond
A practitioner guide to the four core attack categories against AI/ML systems — from adversarial inputs to supply chain compromise — with mitigation priorities and framework references.