Apr 10, 2026
Incident Response for Misbehaving Models: Playbooks for Outages, Harms, and PR Crises
AI Safety

Incident Response for Misbehaving Models: Playbooks for Outages, Harms, and PR Crises

At some point, your model is going to do something you can't defend in a one-liner. It will leak something it shouldn't. It will help someone do something they shouldn't. It will say something about a protected group that lands in a journalist's inbox. Or it will just fall over during a launch and turn your flagship feature into an apology screen.
Marcus ThompsonDecember 3, 202522 min read576 views

Introduction

At some point, your model is going to do something you can't defend in a one-liner. It will leak something it shouldn't.
It will help someone do something they shouldn't.
It will say something about a protected group that lands in a journalist's inbox. Or it will just fall over during a launch and turn your flagship feature into an apology screen. You don't get to choose whether you have incidents. You only get to choose whether you treat them like an engineering discipline or like improv theater. Most orgs have reasonable incident response for classic failures:

Almost none have the same maturity for model behavior. They have "guardrails," "policies," and "alignment goals."
They don't have:

  • Clear severity levels for model failures
  • On-call rotations that include safety
  • Kill switches for behavior, not just for traffic

If you ship models at scale, you need playbooks for three distinct classes of events:

  1. Functional incidents: outages, quality regressions, weird responses that break workflows.
  2. Safety incidents: actual or near-miss harms to users or third parties.
  3. Narrative incidents: screenshots and headlines that threaten trust, regardless of technical root cause.

Treat them as different families, with overlapping mechanics.

Why model incidents are not just "bugs with extra steps"

Traditional outages are mostly about absence:

  • No responses
  • Slow responses
  • Wrong but deterministic behavior

Model incidents often involve presence:

  • A response that should never have been generated
  • A response that is technically correct but contextually explosive
  • A response that is benign on its own but toxic at scale

Key differences:

Ambiguity

The same prompt–response pair can be seen as fine, borderline, or unacceptable depending on culture ai boom neurips icml status games, law, and PR context.

User agency

Attackers craft prompts explicitly to elicit the worst-case behavior. Some incidents are induced, not accidental.

Soft boundaries

There is no single line of code to point at. Behavior is distributed across weights, data, and safety layers.

Trying to run model incidents through a standard "P0 outage" lens gives you two failure modes:

  • You under-react to serious harm because "the service is up."
  • You over-react to social media media pipelines from text prompt to production asset heat without understanding whether there is a systemic issue.

You need explicit structure for this class of incidents.

Foundations: what must exist before the bad day

If you don't do this up front, you will not invent it under pressure.

Define severity levels specific to models

You can layer on top of your existing Sev model, but make criteria explicit. For example:

Sev 0 – Catastrophic model behavior

  • Clear physical or major financial harm likely or ongoing
  • Large-scale data leakage from model behavior
  • Widespread abuse (e.g. model being used as a crime accelerator)

Sev 1 – High-impact harmful behavior or major quality break

  • Reproducible policy policy why governments care about your gpu cluster violating outputs on common prompts
  • Harmful outputs affecting vulnerable users
  • Large customer impact (e.g. core feature obviously broken)

Sev 2 – Contained or edge-case issues

  • Hard-to-hit jailbreaks
  • Narrow topic bias or toxicity
  • Localized quality regressions

Tie each severity level to:

  • Who must be paged
  • Maximum acknowledgement time
  • Maximum time to first mitigation

Assign real on the open web ownership

For any incident at or above Sev 1 involving models, the response team must include:

  • Incident commander (usually from core infra or SRE)
  • Model owner (who actually knows the training and deployment details)
  • Safety lead (policy + red-team contact)
  • Product owner for the affected surface
  • Comms / PR and Legal on call for anything user-visible

If "safety" is not in the room, you will mis-grade the impact.
If "model" is not in the room, you will waste hours blaming infra.

Build telemetry for behavior, not just uptime

Logs that matter:

  • Prompt, response, and metadata for a sample of traffic (with privacy controls)
  • Aggregated stats on refusal rates, safety triggers, category hits
  • Per-model and per-version routing and performance

Without this, every incident starts with "can anyone reproduce this?" followed by blind flailing. You need enough logging to:

  • Reconstruct the exact interaction, when permitted
  • See whether an issue is widespread or limited to one tenant / region / version

Implement kill switches that actually cut risk

Not just "turn the feature off." You want the ability to:

  • Roll back model version per surface and region
  • Route to a smaller / safer fallback model
  • Force responses through stricter safety filters
  • Disable specific tools / actions the model can trigger
  • In the worst case, hard-block known dangerous prompt patterns at the edge

You do not want to be editing prompts and safety policies in live code while Twitter is melting down.

Run at least one serious drill

Take an ugly, plausible scenario:

  • The model provides detailed self-harm instructions to a minor
  • The model leaks internal customer data in a public surface
  • The model generates racist content that hits the press

Run a tabletop with the people who would be on the call. Walk it through:

  • How it's detected
  • Who gets paged
  • What gets turned off in the first 30 minutes
  • How you talk to affected users
  • How you decide between rollback and patch

If that meeting turns into "we have no idea how to do this," fix that before you ship more capabilities.

Playbook 1: functional outages and quality incidents

These are the closest to classic incidents. They still need model-specific handling.

Triggers:

  • Latency or error monitoring spikes
  • Huge increase in user retries or abandonment
  • Bulk customer complaints like "it stopped following instructions" or "it's suddenly much worse at X"

Step 1: triage infra vs model

Quick checks:

  • Are non-model endpoints healthy?
  • Are requests reaching the model service?
  • Did any infra change roll out near the onset (network, storage, auth)?

If infra is broken, handle as usual. If infra is healthy but:

  • Response rates drop
  • Nonsense or low-quality answers spike
  • Specific capabilities degrade

then the root cause is likely:

  • Bad deployment (wrong weights, bad version routing)
  • Misconfigured safety or orchestration layer
  • Unintended effect of a new training run or fine-tune

Step 2: contain the blast radius

Options in order of aggression:

  • Roll back to last known good model version
  • Switch affected traffic slice (tenant, region, product) to a fallback model
  • Reduce temperature / sampling weirdness if settings changed
  • If you can't trust behavior at all, disable only the affected feature while leaving the rest up

Do not keep serving obviously broken outputs because "uptime is green."

Step 3: isolate the change

You need a diff:

  • What changed in the last N hours?
  • Model version, safety policy, routing logic, eval thresholds
  • Did the change affect all prompts or only specific flows?
  • Can you reproduce the bad behavior on a fixed set of prompts across versions?

Build a small repro suite from real prompts where quality obviously regressed. Run it against:

  • Current bad version
  • Last known good
  • Any candidate patches

Keep it small and focused. This becomes part of your regression harness later.

Step 4: remediate

Root causes you will see:

  • Training run that improved some metrics and degraded unmeasured ones
  • Bad config for orchestration (wrong tool selection logic, incorrect system prompts)
  • Safety filters interfering with core functionality in unanticipated ways

Responses:

  • Revert and schedule a new, better-instrumented training run
  • Patch orchestration logic, including tests that would have caught this
  • Adjust safety config and add explicit tests for the throttled capabilities

Only ship forward when:

  • Repro suite passes
  • Key business metrics (task success, user-visible quality) are back in band
  • You understand, in words, what actually went wrong

Step 5: post-incident review

Standard PIR, but with some model-specific questions:

  • Did our eval suite actually cover the degradations users cared about?
  • Did we push a model update without enough monitoring on behavior?
  • Did infra or product teams roll this change out without a clear rollback plan?

Feed the answers back into:

  • Pre-deployment eval design
  • Change management around model releases
  • Routing and versioning strategy

Playbook 2: safety incidents and real harms

This is where you can't hide behind "it's just a beta."

Triggers:

  • A user receives content that plausibly causes harm
    • Self-harm encouragement or instructions
    • Hate or harassment
    • Guidance on serious crime or violence
  • Sensitive data appears in a response
    • Other users' data
    • Internal customer information
    • Secrets that should not be in any output
  • External reports from journalists, watchdogs, or partners showing policy-violating outputs

Step 1: stop the bleeding

Within the first 30–60 minutes, you want to:

  • Disable or restrict the specific surface where the behavior was observed (feature, tenant, product)
  • Apply stricter safety filters or route to a safer fallback model if available
  • If the issue is clearly systemic for a category (e.g. self-harm prompts), apply temporary global blocks on that category while you investigate

You are trading capabilities for immediate risk reduction. Do it.

Step 2: preserve evidence, protect privacy

You will need:

  • Exact prompts and responses
  • Timestamps, user IDs or identifiers, model and version IDs
  • Routing and safety configuration at the time

But these incidents often involve sensitive content. So:

  • Restrict access to the raw logs to the minimal necessary group
  • Redact identifiers where you can, but keep enough to contact the user if appropriate
  • Store a sealed copy for legal and regulatory needs

You cannot investigate without data. You cannot dump that data into every Slack channel either.

Step 3: assemble the right team

For a genuine safety incident, the live call should include:

  • Incident commander
  • Model / infra lead
  • Safety lead
  • Legal and privacy
  • Comms / PR
  • Product owner

Assign explicit roles:

  • One person talking to the rest of the org
  • One person coordinating technical isolation and patches
  • One person owning user communications and external statements

If more than three people on the call are trying to "own" the incident, you have a problem.

Step 4: assess scope and severity

Key questions:

  • Is this reproducible in a straightforward way, or did it require elaborate prompting?
  • Is it confined to a narrow surface or global?
  • Does it appear to affect one tenant / region / language talking to computers still hard more than others?
  • Has this pattern appeared before in logs or red-team findings?

Use:

  • Targeted probing based on the original prompt style
  • Sampling of similar prompts in existing logs
  • Synthetic adversarial tests if you have internal red-team tools

Decide:

  • Is this Sev 0/1 (systemic, high impact) or Sev 2 (edge-case but real)?
  • Do we need regulatory reporting?
  • Do we pause related development / rollouts until mitigation?

Step 5: handle affected users

This is not generic "sorry if you were offended" territory.

For harmful content:

  • If feasible and appropriate, contact the user directly with:
    • A clear acknowledgment of what happened
    • A straightforward apology
    • Pointers to support resources if the topic is self-harm or trauma adjacent

For data exposure:

  • Follow your data breach playbook:
    • Identify whose data was exposed
    • Notify according to legal requirements and contracts
    • Offer remediation if appropriate

Do not over-promise on technical root cause before you understand it.

Step 6: remediate technically

Common root causes:

  • Safety filters not applied or misconfigured in a particular surface
  • Reward models under-trained on specific harm patterns
  • New model version with different generalization behavior around harmful topics

Possible mitigations:

  • Tighten or fix safety filter application in the stack (and test it)
  • Add the prompt / behavior pattern to your red-team and training sets
  • Increase weight on safety objectives for the relevant harm category in follow-up fine-tuning
  • In extreme cases, permanently disallow certain topic combinations or response shapes, even at cost of false positives

Treat this like you would treat a critical critical infrastructure reliability engineering security bug. Assume attackers and opportunists will try to replicate and publicize it once it's known.

Step 7: adjust policy and training pipeline

If a real incident slipped through, your existing safety spec and training pipeline are insufficient somewhere.

Questions:

  • Did the spec explicitly cover this category of harm?
  • Were labeler guidelines and reward models aligned with the spec?
  • Did we have evals that would have caught this behavior before shipping?

Fixes:

  • Clarify and update the safety spec and labeling instructions
  • Collect more high-quality labeled data in this slice
  • Improve adversarial testing around this harm type

Otherwise you will see the same class of incident again under a slightly different surface.

Playbook 3: narrative and PR incidents

Sometimes the "incident" is less about the underlying risk and more about the perception:

  • A screenshot of a biased or offensive answer goes viral
  • A public figure demonstrates a jailbreak on stage
  • A high-profile customer complains publicly about misuse or harm

You cannot ignore these on the basis that "the system is working as designed." Perception affects regulators, partners, and future users.

Step 1: separate signal from noise

Verify:

  • Is the screenshot / report real?
  • If so, can you reproduce it? Under what conditions?
  • Does it reflect current behavior or an older model / config?

If it's fabricated or heavily edited, you still may need to respond, but your technical playbook changes.

Step 2: align the internal narrative

Before anyone tweets or talks to press, internal alignment:

  • What exactly happened?
  • How common is this behavior?
  • What immediate steps have we taken?
  • What is the plan over the next 24–72 hours?

Comms, legal, safety, and engineering should agree on:

  • What we will say
  • What we will not claim yet
  • Who is the single spokesperson

Step 3: craft the external response

Good patterns:

  • Acknowledge the issue without defensiveness
  • State clearly whether it reflects current behavior and scope
  • Outline immediate mitigations if user safety is implicated
  • Commit to specific follow-up (and actually do it)

Bad patterns:

  • Over-technical deflection ("it's just a sampling artifact")
  • Blaming users wholesale for "abusing" the system when the behavior is easy to trigger
  • Vague "we take this seriously" with no concrete actions

Remember: your audience is not your research team. It's regulators, customers, and people deciding whether to trust you.

Step 4: decide whether to treat it as a real incident

Sometimes a PR spike reveals a genuine systemic issue your existing metrics underweighted. Sometimes it's a one-off corner case with limited real-world impact.

You still run the same internal steps:

  • Reproduce
  • Check prevalence
  • Compare against your own severity definitions

If it meets your own criteria for Sev 1 or Sev 0, treat it as such, regardless of how loud or quiet the online conversation is.

If it does not, still decide on:

  • Whether to tighten mitigations for that narrow case
  • Whether to update user-facing documentation or warnings
  • Whether to share more about limitations and expected behavior

Common failure modes in model incident response

Patterns that keep repeating.

No one owns the incident

  • Infra says "works as spec."
  • Safety says "not our incident channel."
  • Product says "we just forwarded user reports."

Result: hours of churn, no clear decisions.

Fix: explicit incident ownership rules for model behavior, with a single on-call rotation empowered to pull in others and declare severities.

Over-indexing on infra metrics

Everything looks "green":

  • Latency fine
  • Error rates fine
  • CPU/GPU utilization fine

Meanwhile, the model is:

  • Spitting out garbage due to a bad fine-tune
  • Refusing legitimate requests due to safety overreach
  • Quietly leaking patterns of sensitive data in a subset of responses

If you don't have behavior metrics and canary tests, you will discover this via angry users, not dashboards.

Treating safety incidents as PR only

Some orgs respond to harmful outputs by:

  • Issuing statements
  • Adding more disclaimers
  • Tightening terms of service

But they never adjust:

  • Training data
  • Safety reward models
  • On-device filters
  • Incident classification

They're playing comms defense, not reducing future risk.

Conflating adversarial demos with real-world risk

Yes, someone with a full day to poke at your model will find a jailbreak. You need to distinguish:

  • High-effort, low-frequency exploits that require exotic prompts and persistence
  • Low-effort, high-frequency behaviors that ordinary users can trigger accidentally

Both matter, but they sit at different points on the risk curve. Incident response should prioritize harm likelihood and scale, not only how bad the worst screenshot looks.

Building domain specific assistants for law finance and medicine muscle instead of theater

All of this sounds heavy until you remember how many times the industry has done this before for other risks.

Security incident response used to be ad hoc. Now it's a discipline.
Site reliability used to be "hope the ops team can fix it." Now on-call rotations and runbooks are normal.

Model incident response will get there, if people admit it's a distinct problem.

Concrete moves:

  • Fold model incidents into your existing incident management tooling, with dedicated tags and severities
  • Add safety and model engineers to the on-call ladder for relevant services
  • Maintain a small but serious catalog of past incidents and near misses, with clear lessons learned
  • Run quarterly exercises based on those real cases, not imaginary ones

The goal is boring competence:

  • When something bad happens, the right people get paged.
  • They know what they're allowed to shut off.
  • They can see the data they need.
  • They can contain, understand, and remediate without making things worse.

Models will misbehave. Attackers will push them. Edge cases will slip through. Users will post screenshots. You decide whether that turns into chaos every time, or into a contained incident that you handle, learn from, and move on.

Master AI with Top-Rated Courses

Compare the best AI courses and accelerate your learning journey

Explore Courses

Keywords

Incident ResponseModel SafetyCrisis ManagementModel OperationsRisk ManagementProduction Systems

This should also interest you