Apr 9, 2026
Category

Ai Development

6 articles in this category

Mixture-of-Experts at Scale: Sparse Compute, Routing, and Failure Modes
AI Development

Mixture-of-Experts at Scale: Sparse Compute, Routing, and Failure Modes

Most people still picture a "big model" as a single, dense stack where every token flows through the same layers. Double the parameters, double the memory, almost double the compute. That picture stopped scaling cleanly the moment we tried to push beyond a few dozen billion parameters while staying inside realistic latency and cost budgets.

Brandon ScottNov 18, 202510 min read
Retrieval-Augmented Generation Done Right: Architectures That Actually Work
AI Development

Retrieval-Augmented Generation Done Right: Architectures That Actually Work

RAG became the default answer to a simple question: how do you get an LLM to talk about things it was never trained on, using data that changes every day? Most teams implement the same recipe. Split documents into chunks, stuff them into a vector store, run a similarity search on user queries, feed the top few chunks into the prompt, hope hallucinations go away.

Daniel BrooksNov 13, 202511 min read
Beyond Chatbots: LLM Tool Use, Function Calling, and Agentic Workflows
AI Development

Beyond Chatbots: LLM Tool Use, Function Calling, and Agentic Workflows

The "chatbot" metaphor was useful at the beginning. It let people map a strange capability onto something familiar: a text box, a reply, a back-and-forth. As soon as teams tried to build serious systems on top of that metaphor, they hit the wall. A chatbot is a UI. A modern LLM stack is closer to a programmable runtime.

Daniel BrooksNov 8, 202512 min read
Mixture-of-Experts at Scale: Sparse Compute, Routing, and Failure Modes
AI Development

Mixture-of-Experts at Scale: Sparse Compute, Routing, and Failure Modes

Most people still picture a "large model" as one big uniform block: same layers, same weights, every token marching through the same path. You want more capacity, you make the block bigger. You pay almost linearly in memory, compute, and power. That picture breaks the moment you try to push capacity far beyond what you can afford to run for every single token.

Brandon ScottOct 25, 202511 min read
Quantization, Pruning, Distillation: How to Shrink Models Without Breaking Them
AI Development

Quantization, Pruning, Distillation: How to Shrink Models Without Breaking Them

Most teams wait too long before thinking about compression. They train or adopt a big model, get excited by its benchmark numbers, ship a prototype, then hit the wall: GPU bills, latency, memory limits, mobile deployment, compliance constraints. At that point, "make it smaller" becomes an urgent request, not a design choice.

Nathan PriceOct 19, 20259 min read
Fine-Tuning, Adapters, and Instruction Tuning: A Practical Map of the Options
AI Development

Fine-Tuning, Adapters, and Instruction Tuning: A Practical Map of the Options

"Let's just fine-tune it" is one of the most expensive sentences in this field. People use the same phrase to describe wildly different things: nudging tone, injecting domain knowledge, fixing safety issues, matching a house style, or building a brand-new capability on top of a base model. Under the hood, those goals map to different techniques, different data requirements, and very different risk profiles.

Nathan PriceOct 16, 202511 min read