Red Teaming in Practice: How to Break Your Own Models Before Others Do

Most organizations already have red teaming. It just lives in screenshots and Slack threads. Someone finds a jailbreak on Twitter, pastes it into the internal chatbot, gets a spicy answer, posts it with a reaction emoji, maybe tweaks a system prompt. A week later, the model is wired into tools and production data), and nobody knows whether that early failure pattern still exists. That is not red teaming. That is curiosity. Real on the open web red teaming is a continuous, structured effort to find failure modes before an adversary or an angry customer does. It treats models and their surrounding systems reliability engineering as targets, not as toys. It produces artifacts and regressions, not vibes. Below is what it looks like when done as engineering, not theater. ## Clarify what red teaming is for Red teaming for LLM systems has a simple goal: produce concrete, reproducible attack paths against real assets. It is not about: * Proving that models are dangerous in the abstract

Curating funny jailbreaks
Generating slideware for an internal ethics committee The outputs that matter are: * Attack traces that explain, step by step, how a failure happens
Severity ratings tied to business impact and regulatory exposure
Minimal prompts and payloads that reproduce the issue
Patches, guardrails, and tests that close or narrow the hole Everything else is noise. ## Start from a threat model, not from prompts The first critical step is not to craft clever jailbreaks. It is to write down what matters. For a given system: * Assets: internal documents, customer data, money movement, reputations, access rights, operational workflows
Entry points: public chat surfaces, authenticated assistants, API endpoints, RAG ingestion, tools the model can call
Adversaries: external attackers, insiders, abusive customers, researchers, competitors, automated scanners
Worst plausible outcomes: data leakage, policy policy why governments care about your gpu cluster or law violations, financial loss, large-scale misinformation, disruption of operations When this is explicit, attack work stops being random prompt play and starts being targeted. A support copilot wired into ticketing and CRM has a different threat profile than a public marketing chatbot. A developer assistant with access to repos and CI has a different profile again. The red team focuses on the ways those systems can fail in ways that actually matter. ## Define scope and rules of engagement Red teaming without guardrails turns into chaos. Scope shapes effort. Typical dimensions: * Environment: staging with realistic data, shadow mode in production, or tightly controlled production slices
Surfaces: which UIs, APIs, and agent workflows are in scope
Tools and integrations: which external systems the model can touch as part of the engagement
Time window: continuous engagement with periodic reports, or focused campaigns around launches and major changes Rules of engagement define: * What is off limits for legal or operational reasons
Which actions are allowed against internal systems (for example, read-only access vs write actions)
How far to push data exfiltration and system manipulation before stopping
When and how to coordinate with incident response, legal, and comms Without this, the first serious attack chain that hits a real system will trigger panic instead of learning. ## Assemble the right mix of attackers Effective red teams are not just security engineers with model access. The useful mix usually includes: * Security engineers who understand exploitation patterns, privilege escalation, and abuse of integrations
ML engineers or researchers who understand model behavior, training, and guardrails
Product or domain experts who understand workflows and where harm actually lands
Occasionally, external specialists who bring a different threat mindset The dynamic is simple. Security thinks in terms of attack surfaces, ML in terms of behavior under perturbation, product in terms of user impact. Those three views partner; none of them alone is enough. ## Build an attack catalog, not a gallery Red teaming gains power from repetition. That requires a structured view of attack types. A practical attack catalog usually covers: ### 1. Information disclosure * Training data leakage
Internal doc and secret leakage via RAG
System prompt and policy exfiltration ### 2. Policy and safety bypass * Jailbreaks and refusals overridden
Harmful content generation under thin disguises
Cross language talking to computers still hard and domain specific domain specific assistants for law finance and medicine bypasses ### 3. Tool and agent abuse * Unintended tool invocation chains
Overbroad actions in CRMs, ticketing, billing, ops
State changes without proper authorization paths ### 4. Prompt injection and context manipulation * Injected instructions in documents, tickets, web content
Malicious payloads in HTML, markdown, comments, metadata
RAG attacks via poisoned corpora ### 5. Integrity and bias * Systematic skew for certain users, languages, or topics
Predictable misclassification that benefits an attacker Each category links back to concrete assets and worst-case outcomes. Red teaming then fills in each slot with real examples and probes, rather than collecting random one-off jailbreaks. ## Run engagements as proper campaigns A competent red team engagement against an LLM system tends to move in phases. ### Reconnaissance * Enumerate all user entry points and agent workflows in scope
Map which models, providers, and variants sit behind each surface
Identify tools, RAG indices, and external systems the models can reach
Fingerprint obvious guardrails: content filters, refusal patterns, prompt structures This phase often reveals misconfigurations before any clever prompting happens. ### Initial exploitation * Try straightforward jailbreak patterns and policy evasions
Probe safety filters in each language and task type
Attempt basic retrieval-based exfiltration in RAG systems
Push agents to overstep on tool actions in obvious ways The aim here is to measure the baseline: how much work is required to get to a first nontrivial failure. ### Refinement For every promising failure: * Minimize the input until it becomes a short, stable payload
Test robustness across variations, tenants, and environments
Connect the failure to specific parts of the stack: prompt logic, retrieval config, tool permission, model variant This is where an anecdote becomes an artifact. The team stops saying "if you talk to it just right, it does something bad" and starts saying "this 30-token input plus this context reliably yields this class of breach." ### Chaining Once primitives exist, red teaming chains them. Examples: * Prompt injection via a doc that tricks the model into calling a tool that reveals more data that in turn changes future behavior
Initial safety bypass used to get the model to self-document its tool list and policies, used in a second phase to target the most powerful actions
Template-level jailbreak reused across different surfaces that share backends, showing how a single weakness fans out across products At this point, the team can assign severities based on composed impact, not just on single responses. ## Instrument the system before breaking it Red teaming without telemetry creates drama and no durable knowledge. Before serious attacks start, the system needs: * Request and trace IDs flowing from surface through model, retrieval, tools, and responses
Logging of which tools were called, with which arguments, and with which results
Structured logging of safety filter hits and policy decisions
Correlation between tenants, features, regions, and model variants With that in place, every attack trace becomes a concrete artifact: * This user or simulated user
Sent this input
Under this system prompt and model config
Retrieval pulled these documents
Tools executed these calls
Safety filters did or did not fire Later, engineering can reproduce the entire chain in a controlled harness. ## Turn findings into regressions, not just reports The output of a red team campaign that matters is not the slide deck. It is the new tests and controls. For each high-severity finding, a concrete package emerges: * The minimal attack payloads and context conditions
A reproducible test harness: script, notebook, or automation that hits the system or a faithful copy
A fix plan, which might include prompt changes, guardrails, permission tightening, retrieval filters, or model swaps
One or more new eval cases added to pre-deployment test suites Over time, the organization builds a regression library of known attacks. Before any launch or major config change, the CI pipeline runs that library against the candidate build. If a previously fixed attack reappears, the launch stops. Without this, red teaming turns into endless rediscovery. ## Make red teaming continuous One-off exercises create false comfort. Real systems evolve constantly. Models change.
Prompts change.
Retrieval corpora change.
Tools and permissions change.
Traffic mix changes. A working pattern usually mixes three loops. * Launch gating: targeted red teaming against new features or major changes before exposure to users.
Continuous smoke tests: automated runs of high-value attack cases in staging and, with care, in production shadow mode.
Periodic deep dives: concentrated campaigns a few times a year on systems that have drifted or increased in importance. The red team does not need to be large. It needs to be embedded in the development rhythm rather than bolted on at the end. ## Connect severities to business and legal impact Severity ratings cannot be purely technical. They must connect to: * Data classes involved: personal data, secrets, regulated information
User segments: general public, enterprise tenants, internal staff
Domains: health, finance, employment, safety-critical operations
Legal exposure: contractual commitments, sectoral regulation ai-products, public promises A jailbreak that generates offensive content in a sandbox demo is a different object than a prompt injection that causes a clinical assistant to cite the wrong dosage policy. When a red team marks something as critical, the reason is not "the model misbehaved." The reason is "this chain, left unfixed, creates concrete regulatory or business risk." ## Common failure patterns inside orgs Some patterns repeat across companies. * Red teaming sits under marketing or ethics, not engineering, and produces reports that never reach the people who can change code.
Findings are treated as personal attacks on model teams, so they get buried or minimized.
Campaigns focus entirely on public chat surfaces while ignoring internal copilots and agents that have much more dangerous access.
Teams fix prompts and ignore deeper issues with retrieval, permissions, and tool design.
There is no budget or plan to implement fixes that touch architecture, so red teaming degenerates into listing problems without resolution. In each case, the failure is organizational, not technical. ## A concrete example shape Consider a simplified internal HR assistant: * Employees can ask about policies, benefits, and procedures.
The assistant uses RAG over internal docs.
It can create draft tickets in an HR system for certain workflows. A basic red team campaign might: * Exfiltration: plant policy-like documents containing prompt injection that instructs the model to dump other policies verbatim. Confirm whether retrieval makes this visible, and whether the model obeys the injected instructions.
Policy bypass: craft questions in different languages and registers to see whether the assistant pairs benefit guidance with region-appropriate caveats and disclaimers, or whether some slices get materially wrong advice.
Tool abuse: steer the assistant into opening spurious HR tickets at scale, or into routing sensitive queries into inappropriate queues.
Data isolation: verify whether employees in one region can induce the model to reveal policy details or example cases that should only exist in another region. Each success becomes a test. Fixes might include: * Stricter retrieval filters and content sanitization
More constrained tool capabilities and human approvals
Split indices per region and tenant
Adjusted prompts and templates for disclaimers and scope The point is not that this specific system is special. The point is that every deployed assistant and agent has a similar attack surface. Red teaming makes that surface visible. ## Red teaming as routine, not as an event In the end, practical red teaming comes down to discipline. * Treat models, prompts, RAG, and tools as things that can be attacked, not as magic.
Encode attacks as code and tests, not as lore.
Tie severities and fixes to real business and legal impact.
Run the process often enough that it becomes dull. The moment it becomes dull, it starts to work.

AI Telegraph

Red Teaming in Practice: How to Break Your Own Models Before Others Do

Master AI with Top-Rated Courses

Keywords

This should also interest you

Model Security and Red Teaming: Stop Treating Safety as a Prompt

Offensive Prompting: What Real-World Attackers Actually Do to Your LLM

Supply-Chain Security for AI: Models, Weights, Datasets, and Dependencies