What Comes After Scaling Laws? Rethinking Foundation Model Architecture

For a few years, the field lived under a simple rule of thumb: if you make the model bigger, give it more data, and spend more compute, it gets better. The original scaling laws for language models turned this intuition into curves on a log–log plot. Loss fell in a smooth, almost comforting way as parameters, dataset size, and compute all went up together. Architecture details looked like second-order noise. That picture was reinforced when people started pointing out that many large models were actually undertrained on data. The "compute-optimal" recipes that followed gave teams something close to a spreadsheet formula for planning training training models without centralizing data runs. You pick a budget, you pick a parameter count, you match it with the right number of tokens, and you can roughly predict where you will land. If you buy that view, the roadmap is obvious: raise more money, secure more GPUs, scrape more data. Keep going up and to the right. We are now in the part of the story where that picture is visibly breaking down. ## Cracks in the scaling story Two things have been happening at the same time. First, the neat curves turned out to be less universal than they looked in the early papers. Replication work and stress tests show that "compute-optimal" ratios come with wide error bars and heavy dependence on details of the training setup. Change the data mixture, regularization, or optimization a little, and your smooth power law starts to wobble. Push into new regimes of data or longer sequences, and you see diminishing returns earlier than you would expect. Second, people who have been in the field long enough have started saying explicitly that scaling alone will not buy the capabilities everyone is implicitly hoping for. Performance on public benchmarks continues to rise, yet certain behaviors plateau stubbornly: brittle reasoning, shallow planning, poor handling of real-world constraints. The models get larger and more impressive in demos, but some failure modes barely move. So "what comes after scaling laws" is not a mystical question about the end of progress. It is a practical question about what else needs to enter the equation besides raw size, tokens, and floating-point operations. ## New axes that actually matter The original scaling laws treated three quantities as central: number of parameters, amount of data, and total compute. Architecture and training strategy were held roughly constant. That was a useful simplification. It is no longer enough. Four additional axes are now clearly first-class. Data quality and diversity. Not all tokens are created equal. Redundant web text, synthetic noise, or badly filtered corpora can keep your loss trending down while doing very little for capabilities. In practice, careful curation and mixture design often buy more than another round of brute-force scaling. At the same parameter count, two models can behave very differently depending on how their data was assembled and in which order it was fed. Training objectives. Pure next-token prediction is a very narrow proxy task. Instruction tuning, preference optimization, supervised fine-tuning on tool use, and other multi-signal objectives change how capacity is used. You can hold size constant and get a model that behaves like it is "smarter" simply by changing what it is punished and rewarded for during training. Coupling to the world. The more systems reliability engineering rely on retrieval, tools, and agents, the less meaningful it is to talk about "the model" as an isolated object. A modest base model backed by a high-quality retrieval stack and a disciplined orchestration layer can outperform a huge monolith for many enterprise workloads. Architecture moves up a level, from the network alone to the overall system graph. Safety and control overhead. Post-training alignment, guardrails, filtering, and policy enforcement are now part of the core training story. They add extra loss terms, constraints, and potential failure modes. How you shape behavior after pretraining has become as important as how you shrink cross-entropy during it. Scaling is no longer just "more of the same." It is "more of what, exactly, and in service of which behavior." ## Mixture-of-Experts and conditional capacity One of the most visible architectural responses to the limits of dense scaling is Mixture-of-Experts. Instead of pushing every token through the exact same stack of dense layers, MoE models route tokens to a subset of specialized experts. Total parameters go up, but active parameters per token stay bounded. This changes the economics dramatically. You can: - Concentrate capacity on parts of the input distribution that are genuinely hard.

Adjust how many experts fire per token to trade off cost against quality.
Separate the "total knowledge" baked into the network from the price you pay on each forward pass. The trade-offs are real. Routing quality becomes a critical failure mode. Experts can collapse, under-utilize, or drift. Load ai tools that help people think balancing issues show up as performance cliffs at inference time. But the direction is clear: once you make parts of the model conditional, "size" stops being a single number. You have total parameters, active parameters, number of experts, routing capacity, and sometimes even specialized experts for specific modalities or domains. The scaling question therefore splits. You can scale total parameters without scaling cost linearly. You can shift where in the model you spend those parameters. And you can choose to leave some parts almost untouched for most inputs. ## State space models and the long-context problem Transformers built around quadratic attention hit hard constraints when you push sequence length. There are many engineering patches for this: windowed attention, clever caching, hierarchical schemes. All of them buy some headroom, none of them remove the fundamental scaling pressure. State space models attack the problem at its root. Instead of paying quadratic costs to let every token attend to every other token, they build an explicit recurrent state that evolves in linear time. Recent variants, combined with hardware-aware implementations, show that you can get competitive language modeling performance with very different throughput and context trade-offs. This reshapes the scaling question again. Now you are not just asking how loss scales with parameters and tokens, but how it scales with sequence length at constant latency, or with memory footprint at constant throughput. Hybrid architectures that mix attention blocks, state space layers, and MoE experts bring even more knobs into the picture. You end up deciding which parts of the input deserve global attention, which can rely on compressed state, and which need specialist experts. That is very far from the original "take a standard Transformer, double everything, and call your cloud rep." ## System-level scaling: retrieval, tools, agents In practice, powerful models rarely run nude. They sit at the center of systems that include: - Indexes and vector stores for retrieval.
Tooling layers for function calls, code execution, or external APIs.
Orchestration frameworks that chain calls, manage context, and route between models. There are no clean, widely accepted "laws" for these composite systems yet, but a few patterns recur. This relates to our analysis in Algorithmic Labor: Data Labeling, Reinforcement Learning, and the People Behind the Models. Performance on knowledge-heavy tasks often scales more with retrieval architecture and data quality than with another jump in base model size. Deep and thoughtful RAG design can dominate raw parameter count. Agent-like systems tend to hit combinatorial blow-ups. Give a model too much autonomy, too many tools, and too few constraints, and you get erratic behavior and unpredictable cost. Here, controlled scaling down complexity can improve reliability more than scaling up capacity. Latency budgets and user experience become hard constraints. Once you chain multiple calls, each with its own compute and network cost, the naive idea of "bigger is better" collides with real-world patience and wallet sizes. The important shift is conceptual. The object you are scaling is no longer a single function approximator, but a graph of components. Number of calls, graph depth, retrial strategies, and caching all matter as much as parameter count. ## Where the next generation of "laws" is likely to go If you try to extrapolate, you can see the outlines of the next phase. First, metrics will have to become multi-dimensional. Instead of one neat power law in parameters, tokens, and compute, you will get joint relationships that depend on active parameters, sequence length, data mixture quality, routing behavior, and system-level design choices. Second, frontiers will be drawn per task family rather than in the abstract. The architecture that defines the frontier for code generation will not be the same as the architecture that defines the frontier for long-horizon planning, or for multimodal reasoning, or for tightly constrained enterprise workflows. Third, risk and robustness will be pulled into the core picture. Today, we treat harmful outputs, jailbreaks, and brittleness as a separate layer of concern. That separation is already cracking. The way harmful behavior scales with capacity, data, and architecture will increasingly dictate which scaling paths are acceptable at all. Fourth, state and memory will become explicit design choices, not side effects. Systems that maintain long-lived representations of the world, user preferences, and past interactions will raise new questions around evaluation, safety, and failure modes. They will also create new axes along which "scaling" can happen: more persistent state, more refined world models, richer planning, even if the base model size remains stable. ## Rethinking what "bigger" means The practical lesson is simple and uncomfortable. For almost a decade, "bigger" quietly meant "more parameters in a dense Transformer trained on more tokens." The field built culture, infrastructure, and funding narratives around that assumption. It worked for a while. It will continue to work in some regimes. But if you are building serious systems now, you no longer have the luxury of thinking that way. Bigger can mean more conditional capacity via MoE; more temporal reach via state space models; a broader system surface via tools and retrieval; or a deeper investment in data quality, evaluation, and control. The original scaling laws were a map for the first part of the journey. They told us that we had left the regime of toy models and entered a world where simple architectures plus brute force were enough to keep capabilities climbing. The next phase will not fit on a single curve. It will look more like a decision tree: choices about where to put capacity, how to couple models to the world, and which risks you are willing to take on. That is harder to communicate in a keynote slide, but if you care about what these systems actually do in the world, it is the only kind of scaling that still matters.

AI Telegraph

What Comes After Scaling Laws? Rethinking Foundation Model Architecture

Master AI with Top-Rated Courses

This should also interest you

Inside the Training Run: Curriculum Design, Data Mixtures, and Emergent Behavior

Conference Culture After the AI Boom: NeurIPS, ICML, and the New Status Games

Forks, Schisms, and Alliances: How Open-Source AI Communities Govern Themselves