If you only look at the line item "LLM API spend" in your billing console, you are already behind. Language model inference looks cheap in the prototype phase. A few cents here and there, some generous free credits, a dashboard full of nice round numbers per million tokens. Then the product begins to stick. Traffic ramps. Finance starts asking why your COGS is tied to a third-party rate card you do not control. The problem is not just that models are "expensive." The problem is that most teams do not have a clear mental model of what they are actually buying, how they are consuming it, or what levers they have to change the economics. This is an attempt to put structure on that. ## What you are actually paying for At the crudest level, you are paying for tokens: input plus output. But behind those tokens is a cost stack. You are paying for GPU time and memory on someone else's cluster.
You are paying for the engineering that keeps those clusters up, balanced, and patched.
You are paying for training training models without centralizing data amortization and for the R&D budget baked into the model's price.
You are paying for guardrails, logging, abuse handling, and compliance work ai how teams actually repartition tasks between humans and models you do not see. From your perspective, all of that collapses into a simple unit: dollars per million tokens for a given model and context length. That simplicity hides the variables you actually control. The three levers you do own are: - How many tokens you send and receive per task.
- Which model tier you use for which task.
- How efficiently you batch and schedule those tasks. The difference between an "expensive" product and a profitable one is mostly how seriously you take those three points. ## Tokens are not free; context is the real tax Most cost discussions start with outputs: how long the model's answers are. In practice, inputs dominate once you move beyond trivial prompts. Every extra paragraph in your system prompt, every redundant chunk in your RAG context, every verbose tool schema is a recurring tax. You pay for it every request. You can see this in two products built on the same model: ### Product A: - Short system prompt.
- Tight, template-based instructions.
- Retrieval that feeds only the minimum set of relevant passages.
- Deliberate cap on answer length. ### Product B: - Huge "persona" prompts.
- Dynamic instructions pasted in full on every turn.
- Naive retrieval that dumps ten documents into context "just in case."
- No length control on answers. Both hit the same model. Product B spends multiples more per completed task, for no proportional gain in value. The difference is not technology; it is discipline. The simplest way to improve unit economics is to treat context tokens like money. Because they are. ## Latency, batching, and utilization The second lever is how you use the provider's hardware. Providers make money by keeping GPUs hot: high utilization, large batches, minimal idle time. When you fire single-token requests at random intervals, they can still batch across tenants. When you insist on low latency and use big models with random spikes in traffic, you become an expensive customer to serve, and you pay for it in price. On your side, the constraints are: - Latency you can tolerate for each user-facing path.
- The shape of your traffic: spiky vs smooth, interactive vs batch.
- Your ability to batch calls or precompute. A few patterns change the economics dramatically: ### Micro-batching Combine many small tasks into a single request where possible: multiple classification decisions per call, multiple short prompts processed together. ### Asynchronous work Move non-critical work off the hot path. Precompute embeddings, summaries, and suggestions in the background instead of doing everything synchronously in the user's request. ### Tiered latency Not every interaction needs the same latency SLO. A chat reply is different from a nightly report. Relaxed SLOs let you use cheaper models and benefit more from batching. If you design everything like a low-latency chat, you will run your compute like a trader but sell it like a SaaS company. That is a bad trade. ## Model tiers and cascades The third lever is matching the model to the job. Treating all requests as equal and sending them to the same "best" model is easy. It is also a good way to turn impressive demos into unprofitable products. A more realistic stack has at least three types of models: - A small, cheap filter or router. It classifies, routes, and sometimes answers trivial queries.
- A mid-tier workhorse. It handles most of the routine load ai tools that help people think at a reasonable price.
- A high-end frontier model. It is reserved for hard cases where the quality difference actually matters. You wire them into cascades. For example: - Router decides whether a query is simple FAQ, a document lookup, or a complex reasoning task.
- FAQ queries go to a small model that pulls from a precomputed answer table.
- Document lookups go through RAG plus a mid-tier model.
- Only genuinely complex reasoning or delicate user-facing cases hit the frontier model. The economics are simple: you minimize the percentage of traffic that touches the expensive model, without meaningfully degrading user-perceived quality. The design work is not in the models themselves. It is in specifying what "hard enough to justify the expensive model" actually means for your use case, and in building evals that tell you whether your cascades are making the right tradeoffs. ## Retries, agents, and hidden waste One of the easiest ways to burn money is to hide retries and loops inside your orchestration. Every time you: - Regenerate because you did not like the answer.
- Let an agent chain call tools and models in a blind loop.
- Handle vague errors by "trying again with a slightly different prompt." you are spending tokens without tracking them to a user-perceived unit of value. In many systems reliability engineering, a significant fraction of token spend comes from: - Agent loops that never converge.
- Safety filters that force reruns.
- RAG pipelines media pipelines from text prompt to production asset that retrieve far too much and often fetch the same content repeatedly.
- Downstream services that reformat or clarify instead of catching the issue earlier. If you cannot answer "on average, how many tokens do we spend per user task, including retries and hidden calls," you are not doing unit economics, you are guessing. The fix is mechanical: - Instrument at the task level. For each "ticket resolved," "document summarized," or "code change accepted," track total tokens consumed across all calls.
- Put hard caps on agent loops and tool calls. If an agent cannot solve a problem within a bounded number of steps, fail fast or escalate.
- Treat retries as a metric to reduce, not a normal behavior. Each retry is a cost and a symptom. ## APIs vs self-hosting: the volume curve At low volume, the API price is almost irrelevant. Your real cost is engineering time and speed to market. Buying tokens at full retail is fine if it gets you to product fit faster. At very high volume, the equation flips. You pay for every inefficiency. Owning the stack, whether by self-hosting open-weight models or negotiating custom deals, starts to look rational. The transition between those regimes is not a philosophical moment; it is a spreadsheet moment. You look at: - Average tokens per task.
- Tasks per user per month.
- Target gross margin.
- Current and projected traffic. You compare "API at public pricing" with "API at negotiated pricing" with "self-hosting with fixed and variable costs." You factor in the operational tax of running your own models: infra team, reliability, safety pipeline. At some point, a decision becomes obvious. Until you have those numbers, "we should self-host to save money" is just noise. ## Denominator: cost per what? Tokens are the numerator. The harder question is the denominator. You cannot run a business on "dollars per million tokens." You run it on "dollars per resolved ticket," "dollars per qualified lead," "dollars per document processed," "dollars per code change merged." That is where unit economics lives. Two products can have identical LLM bills and very different economics because: - One product wastes tokens on retries, verbose outputs, and unnecessary RAG.
- The other compresses tasks, designs workflows around the model, and charges for the value created, not the raw compute consumed. Until you instrument at the level of real business units and tie model spend to those, you are optimizing in the wrong space. ## Pricing risk and vendor dependence There is one last dimension: pricing is not static. Providers can change: - Token prices.
- Context-length rules.
- Rate limits and throughput guarantees.
- Terms about data usage and privacy. If your product has no abstraction between "our business logic" and "this specific model endpoint with this pricing and behavior," you are exposed. A price change becomes a business risk, not just a procurement annoyance. The mitigation is not magical. It is the same pattern used everywhere else in infrastructure: - Internal abstraction for "generate," "embed," "classify" that can be wired to different providers or self-hosted models.
- Monitoring that tells you when a model regression makes your current choice too expensive for the quality it delivers.
- A plan for diversification when you hit certain scale thresholds. ## Conclusion Inference cost is not a technical curiosity. It is a core part of your unit economics. Treat it that way from day one and most "AI is too expensive" debates vanish. What you have instead are ordinary questions: which model for which workload, how many tokens per unit of value, which mix of providers and hosting strategies. Boring questions, which is exactly how you want your cost structure to feel.



