Inside the Training Run: Curriculum Design, Data Mixtures, and Emergent Behavior

Most people imagine a training run as a black box. You wire up a giant model.
You dump in "the internet plus some extras."
You let it churn for a few weeks.
You get "intelligence" out the other side.

That story is comforting because it makes scale the only variable that matters. If performance is bad, just add more compute and more data. If something weird emerges, shrug and call it "emergent intelligence."

Inside a real lab, that's not how it feels. A large training run is closer to commissioning an industrial plant. Inputs, flows, and operating regimes matter. Small choices about what goes in and when can show up months later as strange behaviors you didn't intend and can't easily trace back.

Curriculum and data mixtures sound like implementation details. In practice, they are how you steer a system that is too big to reason about directly.

If you work anywhere near serious models, you can't afford to treat the training run as an afterthought. It's where you decide, usually without admitting it, what kind of "mind" you're trying to build.

A training run is a process, not just a job

Strip away the math and you're doing something simple and dangerous:

You expose a huge function approximator to a sequence of examples and errors.
You ask it to adjust itself to predict its inputs a little better each time.
You do this billions or trillions of times.

The structure of that sequence is the curriculum.
The composition of those examples is the mixture.

Both are under your control, even if you act like they aren't. If you randomly mix everything you have and hit "train," you still made a choice. You chose to let data availability decide what the model sees most. You chose not to shape what it learns first, what it sees repeatedly, and what it mostly ignores.

On small models, you can get away with that. On large ones, the system is plastic enough that whatever dominates the mixture early and often becomes part of the model's default worldview.

You can pretend that's "just statistics." Or you can admit that, at this scale, statistics and behavior are the same thing.

What curriculum means in practice

Curriculum in ML is not philosophy. It's a blunt set of decisions:

What the model sees early versus late.
What it sees more often versus rarely.
How quickly you increase difficulty or diversity.

You can think of it in a few layers.

Warm-up

Early in training, the model is random and brittle. You can:

Start it on simple, short sequences with clean signals.
Avoid hammering it with noisy, adversarial, or rare patterns.
Let it learn basic syntax, structures, and local consistency.

If you skip this and dump the full chaos of the web in from step one, you get a model that spends a lot of capacity just learning to stay upright. It will still get somewhere, but you waste a lot of gradient budget stabilizing it.

Complexity ramp

Once the model can handle basic patterns, you can:

Gradually increase sequence length.
Introduce more diverse domains.
Mix in more "edge" cases, code, math, and long documents.

The timing here matters.

Go too slow and you under-utilize your capacity. You end up with a model that's very good at shallow language games and weak everywhere else.
Go too fast and you destabilize learning, especially if you're aggressive with optimization and regularization.

Domain phasing

You might front-load some domains and delay others.

Example:

Early: generic web, books, Wikipedia.
Middle: more code, technical content, scientific texts.
Late: curated, high-value domains, internal data, or specific languages.

The idea is not mystical. You just don't want your early gradients dominated by obscure formats that will be a tiny fraction of usage. You teach the model to be a competent generalist first. Then you layer on specialized skills once its internal representation is strong enough to absorb them.

Refresher and consolidation

Toward the end of training, you can:

Cycle back to core domains you care most about.
Oversample evaluation-like distributions and high-stakes patterns.
Clean up obvious weaknesses exposed by interim evals.

This is where the curriculum stops being a pre-planned schedule and starts being reactive: you look at what the model is actually doing and feed it what it's clearly missing.

If you never do that, you're leaving performance on the floor because you're pretending that a static mixture is "neutral."

Data mixtures: what you feed the beast

Mixture design is less romantic than curriculum, but it probably matters more.

A mixture is just a set of sources plus weights:

Source A: 30%
Source B: 15%
Source C: 5%
…

The sources can be:

Scraped web.
Books and long-form text.
Code repositories.
Forums and chats.
Documentation and manuals.
Domain-specific corpora.
Internal logs and curated datasets.

You don't train on "everything." You train on the union of what you could get, what you're allowed to use, and what you chose to include.

Each mixture weight is a lever on emergent behavior.

Overweight conversational logs and the model becomes chatty and informal.
Overweight code and technical content and it becomes terse and symbolic, but maybe weird in everyday language.
Overweight social media and it picks up slang, aggression, and fragmented style.
Overweight curated, edited prose and it sounds polished but stiff.

You can't have all of these at full strength. Mixture weights are trade-offs.

The illusion of "unbiased" mixtures

A common mistake is to treat the mixture as something like a census:

"We sample the web proportionally."
"We use whatever is popular."

In practice:

The web is not a random sample of human experience.
Some groups, domains, and styles are massively overrepresented.
Others barely exist or are hidden behind paywalls and walled gardens.

If you blindly follow availability, you are encoding those structural biases into your model. At scale, that becomes the "default voice" and "default worldview."

So labs nudge:

Down-weight spam, SEO junk, near-duplicates, and obvious garbage.
Up-weight higher-quality sources according to internal heuristics.
Filter out disallowed content categories (child abuse, explicit violence, etc.).

Each of those filters is a mixture decision in disguise.

How aggressively you cull noisy or offensive data affects:

Model robustness to bad inputs.
Model familiarity with real-world ugliness.
Model tendency to reproduce harmful patterns versus ignore or critique them.

Again, there is no neutral choice. If your filtering is too aggressive, the model becomes brittle and blinkered. If it's too loose, the model picks up behaviors you'll later spend millions trying to train out.

Mixture over time

Mixture is not static. Early runs might use one blend. Later runs adjust based on:

Observed weaknesses.
Legal and licensing constraints.
Strategic shifts (more code focus, more multi-modal data, more non-English).

Between generations, you're accumulating changes that compound. A small increase in one domain and small decrease in another doesn't show up immediately, but over years it can tilt your entire model family.

People outside see "v4 is more X, less Y." Inside, that's often just mixture drift plus curriculum tweaks.

Emergent behavior: when the model teaches you what you trained it on

Emergence gets talked about like magic. It isn't.

When you see a large model:

Doing arithmetic it wasn't explicitly taught.
Following complex instructions in natural language.
Inventing intermediate representations for tasks you never supervised.

you're seeing the consequences of three things:

Scale.
Redundancy in the data.
A training process that doesn't collapse under its own weight.

Curriculum and mixtures shape which emergent behaviors are accessible.

If you feed enough math-like patterns, algorithmic explanations, and code, the model will eventually internalize procedures. Not as symbolic programs, but as distributed tendencies that approximate algorithms.
If you feed enough dialogues, Q&A, and policy-driven interactions, the model will eventually internalize "roles" and "rules of discourse."
If you feed enough narratives, long-form reasoning, and editing examples, the model will eventually learn to maintain threads over longer horizons.

You don't directly train those behaviors. You create conditions where they're useful for reducing loss on a wide variety of examples. Then you turn the crank.

The surprises are usually where your mental model of the mixture was wrong. You discover the model:

Knows far more about a niche topic than expected.
Is strangely stubborn or evasive in some domains.
Picks up weird stylistic quirks you thought you filtered out.

The model is just reflecting the actual distribution of your data and the effective curriculum it experienced, not the document where you wrote down what you thought it saw.

Why curriculum and mixtures matter more as models grow

On small models, your main problem is undercapacity. You worry about how much you can cram in before the model saturates.

On large models, you're closer to the opposite: overcapacity relative to what you can understand. The model is big enough to:

Memorize a lot more than you expect.
Form internal abstractions you cannot easily inspect.
Interact with your curriculum in nonlinear ways.

At that point, curriculum and mixtures become steering mechanisms, not just knobs.

You're no longer asking "what can this model learn at all?" You're asking "which behaviors do we want to make easy, and which do we want to leave underdeveloped?"

If you don't ask that, the answer becomes "whatever was easy and frequent in the training data," which is rarely what you would choose if you were being deliberate.

Curriculum failures in the wild

You can see curriculum and mixture mistakes in many deployed models if you know what to look for.

Overfitted personas

Models that default to a narrow voice:

Constant hedging and policy language.
Forced cheerfulness.
Over-formal or overly casual tone regardless of context.

This often reflects:

Late-stage oversampling of safety and policy conversations.
A curriculum where the model saw a lot of training on scripted refusals and apologies.

It's not that "the model is cautious." You trained it to simulate a support agent that's always on thin ice.

Topic whiplash

Models that can do deep reasoning in some domains and fall apart in others.

Good on programming, brittle on basic factual questions in low-resource languages.
Solid on scientific text, but confused by everyday idioms or low-prestige dialects.

That's often mixture and curriculum:

High-quality, dense coverage in a few domains.
Patchy, noisy, or sparse coverage elsewhere.
No deliberate curriculum to normalize and improve low-resource areas.

Short-horizon thinking

Models that sound smart but can't maintain coherent long chains of reasoning. They answer locally well but lose the plot over long contexts.

You can blame architecture, but curriculum is in there:

Insufficient emphasis on long documents, multi-step tasks, and structured arguments.
A training regime that rewards sentence-level prediction more than document-level coherence.

If you never train on tasks that require maintaining a thread over 10+ steps, the model has no reason to make that capacity robust, even if the architecture allows it.

Curriculum as a bridge between pretraining and fine-tuning

Most discussions slice training into:

Pretraining: broad, generic data.
Fine-tuning: narrow, task- or instruction-specific data.

Curriculum connects them.

Pretraining curriculum decisions affect:

What base capabilities are easy to elicit in fine-tuning.
How much you have to fight the model's defaults when you teach it instructions.
Which domains transfer well to new tasks with little data.

Fine-tuning curriculum decisions affect:

How much the model forgets or suppresses from pretraining.
Where it becomes overconfident versus appropriately uncertain.
How brittle it becomes to prompt phrasing.

You can't treat these phases as independent. If your pretraining mixture is heavy on informal, noisy chat, and your fine-tuning is a small, clean set of terse instruction-following examples, the model will constantly slip back toward its noisy defaults when stressed.

If your pretraining is saturated with code and structured reasoning, and your fine-tuning barely touches those capabilities, you'll leave performance on the table.

Curriculum across phases is where you decide:

"Teach this early and deep."
"Offset this later, but gently."
"Protect this behavior from being overwritten by later, smaller runs."

Instrumentation: seeing inside the run while it's in flight

You can't fix what you can't see. Serious training runs are instrumented heavily:

Loss by domain and source over time.
Perplexity curves segmented by data type.
Emergent capability probes at checkpoints.
Adversarial and eval suites run periodically mid-training.

Curriculum and mixture choices are tested live:

If code perplexity lags while language improves, you may adjust mixture or learning rates for code-heavy batches.
If long-context performance plateaus early, you may introduce more long documents or change masking schemes.
If safety probes start failing midway, you may need to adjust how much harmful or borderline content the model sees unsupervised.

You're not flying blind. You're flying a large, sluggish system where every correction has a time delay and a cost.

Good teams treat curriculum and mixture as controllable inputs, not accidents. They're willing to pause, adjust, and restart when the telemetry says the run is drifting somewhere bad.

Bad teams treat the run as sacred because it's expensive, and then act surprised when the model comes out warped.

Emergence you don't want

Some emergent behaviors are useful. Some are not.

Useful:

Generalization to new tasks.
Robustness to noise.
Implicit translation or code synthesis.

Not useful:

Mode-locking into one politeness style.
Overconfident hallucination.
Exploitable jailbreak patterns.
Subtle biases that track your worst data regions.

Curriculum and mixtures don't guarantee you avoid the bad ones. But they can make them more or less likely.

Models that see a lot of uncertain, self-correcting behavior during training tend to be more calibrated later.
Models that see overconfident, absolute statements on every topic tend to adopt that tone.
Models that see a single dominant narrative about controversial topics tend to internalize it as "truth" rather than "one of several views."

Again: the model is just doing gradient descent on what you fed it. If you train it on monologues, don't be surprised when it lectures. If you train it on arguments and revisions, don't be surprised when it hedges and debates.

Why this matters beyond labs

Curriculum design and data mixtures sound like inside baseball for people with badges and hardware quotas. They're not.

They are the closest thing we have to "values" encoded at training time:

Which languages and dialects matter.
Which domains are treated as central versus peripheral.
Which discourses are overrepresented.
Which kinds of uncertainty are acknowledged versus bulldozed.

Deployment-time safety filters and instruction tuning can only do so much. They're guardrails and surface polish on top of a distribution of behaviors baked in earlier.

If the underlying training run was careless, you're constantly fighting the base distribution with patches. If it was deliberate, your downstream work is aligning something already inclined to cooperate.

The point

A training run is not just a large matrix multiplication exercise. It's a process where you:

Choose what the model sees.
Choose when it sees it.
Choose how hard you push on different objectives.

Curriculum and mixtures are how those choices show up concretely.

Ignore them, and you get whatever pattern of behavior the path of least resistance produces.

Design them, instrument them, and adjust them, and you have at least some say in which emergent behaviors become dominant.

At the scale we're operating now, that difference is the line between building systems you can reason about and systems you only recognize after they've already started shaping the world around them.

AI Telegraph