Introduction
Inside system-training run curriculum design data mixtures emergent-behavior most labs, "safety" starts as a document. A few dozen pages of goals, red lines, and phrases like "the model should never" and "the model must always." No self-harm instructions.
No targeted harassment.
No incitement to violence.
No evasion of law enforcement.
Be honest, be helpful, be respectful, be neutral. On paper, it looks clean. Then you try to turn those sentences into something a model can optimize against, and the whole thing comes apart. The reality is blunt: the distance between a safety spec and the loss function in a training run is where most of the real decisions get made. It's also where most of the comforting simplifications die. If you are not tracking that translation step by step, you are not doing safety work. You are writing policy policy why governments care about your gpu cluster theater. Let's walk the path.
The safety spec is a political document first
Before a single loss term is written, the spec is negotiated. You have:
Legal worrying about liability.
Policy teams worrying about regulators and headlines.
Security worrying about abuse.
Product worrying about usability.
Research worrying about not torpedoing capabilities.
They argue over things like:
"Should the model ever explain how to bypass DRM if the user claims it's for legitimate reasons?"
"Is criticism of institutions allowed if it sounds harsh?"
"What counts as 'medical advice' versus general information?"
Every "must" or "must not" in the spec encodes a compromise between these groups. The spec ends up as a layered object:
High-level goals
"Do not cause physical harm."
"Respect user autonomy and privacy."
Category lists
Violence, self-harm, extremism, hate, sexual content, crime, child safety, etc.
Examples
Snippets of "allowed," "borderline," and "disallowed" content.
Operational rules
How to respond: refuse, partially comply, redirect, warn, escalate.
None of that is numeric yet. It's written for humans. The model never sees it. What the model sees is labels and gradients. The translation from one to the other is where the real governance sits.
From prose to taxonomies and labeling schemas
The first mechanical step is turning the spec into something labelers can use. You cannot just hand raters a PDF and say "mark this as safe or unsafe." You need:
A taxonomy
Where exactly do you draw the line between "self-harm support" and "mental health education"?
Is "how to hide from your abusive partner" allowed? What about "how to hide from the police"?
Instructions
What should the model do when the user asks for something disallowed?
Give a generic refusal?
Offer safe alternatives?
Express empathy?
Scoring rubrics
Binary safe/unsafe is rarely enough. You end up with:
"Strongly harmful"
"Borderline"
"Benign but sensitive"
"Clearly safe"
These rubrics must encode two things at once:
Content classification: what kind of topic is this?
Behavioral evaluation: did the model respond the way the spec wants?
A typical labeling task might look like:
You're given a user prompt and two model replies.
You're asked to:
- Classify each reply into policy categories.
- Rate each reply on axes like "helpfulness," "harmlessness," "honesty," "politeness."
- Choose which reply is better overall given the policy.
In that moment, the safety spec is being compressed into human rlhf constitutional methods alignment tricks judgments. Those judgments are what get distilled into a reward model or direct supervised loss. If the labeling guidelines are vague, inconsistent, or contradictory, the policy that ends up in the model will be too.
Reward models: policy turned into numbers
Most modern systems cooling physical limits ai scaling reliability engineering do not directly optimize on binary "safe/unsafe" tags. They optimize on reward models. Roughly:
You collect many pairs of model outputs for the same prompt.
Human raters say which one is better given the spec.
You train a separate model R to predict those human preferences.
That model outputs a scalar reward r(x, y) for prompt x and reply y.
That scalar is what shows up in the loss. In reinforcement learning terms, you try to maximize expected reward:
Maximize E[r(x, y_model)] over the data distribution, with some regularization.
In practice, it's more tangled:
You might have separate reward models for "helpful" and "harmless" behavior.
You might add penalties for specific red-flag categories.
You might mix in language talking to computers still hard modeling loss to preserve base capabilities.
But the core point holds: the safety spec becomes a set of reward signals, and those signals are just numbers.
The mess hides in the details:
What prompts are included in the reward data.
How heavily you oversample dangerous or borderline prompts.
How you handle disagreement between raters.
How you treat "refuse politely" versus "comply carefully" in ambiguous cases.
If your raters consistently mark "over-refusal" as safer, the reward model will learn that being evasive is good.
If your raters reward "honest but uncomfortable" answers, the model will mirror that.
All of that is safety work. None of it is in the glossy spec.
Loss functions as a battleground of objectives
By the time you reach a real training run, "safety" is one term among several in the loss. You might have:
A base language modeling loss
Encouraging the model to stay close to pretraining behavior.
A supervised fine-tuning loss
Pushing it toward preferred answers collected under guidance.
A reward-based loss
Using the learned reward model to nudge outputs toward "better" and away from "worse."
Adversarial losses
Penalizing the model when red-teaming systems find violations.
Constraint losses
Enforcing particular formats or behaviors (e.g., tool calls, references, citations).
Mathematically, this ends up as something like:
L_total = L_LM + λ_helpful L_helpful + λ_safe L_safe + λ_format L_format + …
Those lambdas are not just tuning knobs. They are statements of priority.
If you set λ_safe too low, the model will happily trade off safety to be more helpful or fluent.
If you set λ_safe too high, the model becomes uselessly cautious, refusing innocuous requests and frustrating users.
There is no "correct" choice. There is only what your leadership decides is acceptable.
You also have to choose the shape of the penalty:
Do you apply a gentle gradient when behavior is mildly non-compliant?
Do you apply a sharp penalty when a threshold is crossed?
Do you treat all policy violations equally or weight some more heavily?
For example:
Sharing bomb-making instructions might receive a huge negative reward.
Speaking harshly about a public figure might receive a mild penalty or none, depending on your policy.
All of those decisions show up as curves in the loss landscape. That is where the spec becomes something the optimizer can see.
Data distribution and mining: the hidden half of "safety"
Loss functions do not act in a vacuum. They act on whatever data you feed them. Two models with identical loss terms can behave very differently if:
One sees many adversarial prompts and refined refusals during training.
The other mostly sees benign chitchat and polite answers.
Labs quietly do a lot of work on:
Collecting real usage data and sampling out risky prompts.
Generating synthetic attacks via automated red-teamers.
Mining logs for failures and high-impact near misses.
Then:
Curating those examples.
Labeling them more carefully.
Feeding them into new training runs with higher weight.
In practice, this feedback loop shapes behavior as much as the original spec. Policies drift as the data drift:
If new attack styles emerge, they show up in the training data.
If certain categories become more politically sensitive, they get more attention.
If regulators focus on one specific harm, that harm gets oversampled in the next training run.
What matters is not what's on page 14 of the spec. It is what ends up in the high-weight buckets of your training dataset.
The fracture lines: where safety specs and loss functions clash
Once you see the whole pipeline, you can predict where things will break.
Ambiguous categories
Specs often try to split hairs:
"Supportive discussion of self-harm without providing instructions."
"Discussion of illegal activity in a discouraging or critical context."
Labelers struggle with these. Reward models inherit their confusion. Models then learn weird boundary behavior:
Over-refusing to harmless content with certain keywords.
Under-reacting to genuine risk phrased in euphemism or humor.
Cross-cultural and multilingual variability
Specs are usually written in one language and cultural frame. Models operate in many.
Questions:
Is a particular phrase hate speech, reclaimed slang, or satire?
Is a description of a medical practice normal in one country and taboo in another?
If your reward data is mostly in one language and culture, the model will be safer there and sloppier elsewhere. The loss function does not know what "culture" means. It sees only token sequences and labels.
Capabilities versus safety tension
Every time you penalize a model for explaining how to do something dangerous, you risk removing knowledge that is also useful in legitimate contexts.
Chemistry and bio.
Security and exploit mitigation.
Law and gray-area advice.
You can try to encode "only when user intent is clearly malicious." In practice, intent inference is noisy. The loss function will push the model toward patterns that satisfy the reward on average:
Smoothed-over, generic responses whenever certain topics appear.
Polite evasions that avoid specific details.
You end up with models that "know but won't say" in ways that are uneven and sometimes absurd.
Measurable metrics versus actual risk
Internal dashboards want numbers.
Refusal rate on red-team prompts.
Rate of policy-violating completions on some test suite.
Average reward according to safety reward model.
These are proxies, not ground truth. Goodhart applies:
Optimizing for "low measured violation rate" encourages:
Models that over-refuse.
Tests that avoid complicated edge cases.
Reward models that learn to recognize safe-sounding evasions as "good."
If your safety work reports progress only in terms of these metrics, you will drift toward models that look safe under your own instruments and less safe under real adversarial use.
The organization chart inside the loss
One uncomfortable fact: organizational power shows up in the loss function.
Who sets λ_safe versus λ_helpful is not decided by the optimizer. It is decided by:
Leadership appetite for risk.
External pressure from regulators and media media pipelines from text prompt to production asset.
Internal advocacy from safety, product, and research groups.
If product insists the model must answer as many user questions as possible, you get pressure to reduce over-refusals.
If legal insists the company must never be seen "assisting wrongdoing," you get pressure to increase refusals and sanitization.
The compromise ends up as:
Different penalties for different harm categories.
Special handling for particular topics.
Backstops like hard filters on output.
From the outside, you see a model that refuses some requests, answers others, and wobbles on the line. From the inside, you see competing internal factions, each with their own nightmare scenario, pushing on the loss landscape.
A stripped-down path from spec to run
If you compress the whole process into a skeleton, it looks like this.
1. Policy writes a spec
Topics, red lines, desired behaviors, examples.
2. Safety and research teams convert it into schemas
Taxonomies, labeler guidelines, prompt templates.
3. Data and ops collect and label
Benign and adversarial prompts.
Model responses.
Human judgments of "better" and "worse."
4. Research trains reward models
One or more models that map (prompt, response) to reward scores along different axes.
5. Training engineers design a composite objective
Define the mix of:
Base LM loss.
Supervised fine-tune loss.
Reward-based losses.
Regularizers and constraints.
6. The training run happens
Gradients flow. Weights move. Resource meters spin.
7. Evaluation hits the new model
Synthetic test suites.
Red-team probes.
Offline replay of some real data.
8. Gaps are discovered
Too cautious here.
Too loose there.
Reward model brittle in some slices.
9. The loop repeats
Spec is updated.
Guidelines adjust.
Data is recollected.
New run.
At no point is there a magical step where "the policy" jumps directly into the model. It is always mediated by people, data, and choices about how hard to push in each direction.
What better practice actually looks like
If you want to do more than gesture at safety, you need to admit what the pipeline really is and intervene where it counts.
Explicit trade-offs, not hidden ones
Write down, in plain language:
Where you are deliberately trading helpfulness for safety.
Where you are accepting higher risk to avoid crippling the product.
Then reflect those choices in:
Loss weights.
Reward model targets.
Public documentation.
Ignoring the trade-offs does not remove them. It just hides them from scrutiny.
Reward model introspection
Treat reward models as safety-critical artifacts. Audit them:
What prompts do they see.
Where do they over-reward or under-reward.
How do they behave on adversarially chosen inputs.
If you find that your safety reward model gives high scores to safe-sounding but misleading answers, you have not built "safety." You have built a flattery detector.
Real adversarial testing, not toy attacks
Test with:
Attackers who don't care about your internal taxonomies.
Languages and dialects underrepresented in your training data.
Cross-domain prompts that combine benign and dangerous elements.
Then connect those failures back into the pipeline:
Update specs.
Adjust labeling.
Retrain reward models.
Change loss weights and sampling strategies.
Cross-functional ownership
Get research, safety, infra, and product in the same room and walk through an actual training pipeline.
Who owns:
The spec.
The data.
The reward models.
The loss design.
The final sign-off before deployment.
If the answer to any of those is "no one clearly," that's where trouble will come from.
The point
"Safety" is easy to talk about at the level of principles and policies. It is harder to talk about at the level of loss functions and datasets. That's where most organizations stop.
But models do not optimize against memos. They optimize against the signals we actually give them: labels, rewards, penalties, and the distributions we sample from.
From the moment a sentence in a safety spec becomes a line in a labeling rubric, then a scalar from a reward model, then a term in a composite loss, you are no longer arguing about values in the abstract. You are arguing about which behaviors will be pushed up or down by gradient descent.
If you care about what these systems do in the world, that is the argument you cannot afford to skip.



