Token throughput optimization without quality loss

The most durable LLM cost reductions come from understanding where tokens are actually going — not from switching to the cheapest model and hoping quality holds. Token budget optimization is an instrumentation problem first and a prompt engineering problem second. Without per-stage token telemetry, you're optimizing blind.

This article walks through the three highest-leverage levers for reducing LLM token spend: context trimming, semantic caching, and model-tier routing. Each requires a different kind of visibility to apply correctly, and each has tradeoffs that teams routinely underestimate.

Understanding Your Token Budget Before Optimizing It

Before touching any code, the first step is understanding the current token distribution: what fraction of your total token spend is input tokens vs. output tokens, and within input tokens, what is the split between system prompt, retrieved context, and user message?

For RAG (retrieval-augmented generation) architectures, the retrieved context chunk is often 60–80% of input tokens per request. System prompts typically contribute 10–25%. The actual user message is often under 5% of input tokens. This breakdown matters because the optimization levers are different for each component. You can't compress a user message without altering its meaning. You can aggressively trim a retrieved context chunk if you know retrieval precision is high. You can cache a system prompt across requests if it doesn't vary per-user.

Spend one sprint instrumenting token counts per pipeline stage before spending any time on prompt compression. Teams that skip this step frequently spend engineering hours optimizing the wrong component — shortening the system prompt by 30% when the retrieved context blob is the actual cost driver.

Context Trimming: Where Most of the Gain Is

For RAG workloads, retrieved context trimming is consistently the highest-ROI optimization. The mechanism: instead of passing all retrieved chunks verbatim, apply a secondary relevance scoring pass (cross-encoder reranking or simple token-budget-aware truncation) to keep only the N highest-scoring chunks within a target token budget.

A concrete scenario: an internal knowledge assistant at a growing software company. The retrieval pipeline returns the top-8 chunks from a dense vector index, averaging 400 tokens each — 3,200 input tokens of context per request. Analysis of model attention patterns and answer quality shows that for 70% of queries, the correct answer is fully contained in the top-2 chunks. Trimming to top-3 chunks (1,200 tokens of context) reduces context input tokens by ~62% with minimal measurable quality degradation on factual Q&A tasks.

The 30% of queries where answer quality does degrade are typically multi-hop reasoning questions that genuinely require broader context. The correct approach isn't to apply the same trim budget uniformly — it's to apply a query classifier that routes complex queries to the full context window and simple factual queries to the trimmed window. This produces cost savings of 40–55% of context token spend, not 62%, but with quality held constant across query types.

We're not saying aggressive context trimming is always safe. For high-stakes applications (compliance Q&A, legal document review), the cost of a missed relevant passage exceeds the token savings. Know your error tolerance before setting a context budget.

Prompt Caching: The Underused Lever

Several providers offer prompt caching: when the same prefix (typically a system prompt or static context block) appears repeatedly, the KV cache from the first computation is reused for subsequent requests, and the cached tokens are billed at a reduced rate — in some cases 50–90% less than fresh input tokens.

The prerequisite for effective prompt caching is structural: the cacheable prefix must be deterministic, placed at the start of the prompt, and large enough that the cache lookup overhead is worth the savings. Providers typically require a minimum prefix length of 1,024–2,048 tokens for caching to engage.

This means teams that fragment system prompts (building them dynamically with per-request interpolation early in the prompt) lose the caching benefit entirely. A simple restructuring — moving all static instructions to a fixed prefix block, placing per-request context after it — can activate prompt caching without any quality change. For high-volume deployments with consistent system prompts, this alone can reduce effective input token cost by 30–50% on compatible providers.

The operational requirement: you need to track whether caching is actually engaging. Provider billing APIs report cached vs. uncached token counts separately. If you're not monitoring that breakdown, you might assume caching is working when it isn't — perhaps because a timestamp or session ID is being interpolated early in the prompt, busting cache on every request.

Model-Tier Routing: Matching Complexity to Capability

Not every request needs your highest-capability model. Classification tasks, simple extraction, short factual questions, and format-normalizing tasks are categories where a smaller, faster model produces output that is indistinguishable from a larger model for most use cases — at 5–15× lower per-token cost.

The challenge with model-tier routing is calibration: which requests are truly simple? Overconfident routing (classifying too many requests as "simple") degrades quality for edge cases. Underconfident routing (classifying too conservatively) leaves cost savings unrealized.

The correct instrumentation approach is empirical: run a shadow routing experiment where every request goes to both tiers simultaneously, and measure output quality divergence (human eval, embedding distance, downstream task success rate) across the request distribution. This gives you an empirically derived confidence threshold for routing, rather than one set by intuition.

For typical workloads with a mix of simple and complex queries, effective model-tier routing yields 25–45% cost reduction on the routed traffic, depending on the complexity distribution. The initial measurement investment is a few days of shadow traffic; the savings are ongoing.

Semantic Caching: Valid for Specific Use Cases, Often Overhyped

Semantic caching stores LLM responses and returns cached answers for semantically similar future queries, avoiding a new model call entirely. The appeal is obvious: in FAQ-style applications where many users ask essentially the same question, cache hit rates of 40–70% are plausible.

The reality is narrower than the pitch. Semantic caching works well for: closed-domain FAQ assistants with stable knowledge, query autocomplete, and batch classification tasks with high query repetition. It works poorly for: open-ended conversation, personalized responses, time-sensitive information, and any domain where freshness matters.

The cost of getting semantic caching wrong is subtle: a user gets a cached response to a semantically similar question that has a slightly different correct answer. Staleness in a general knowledge assistant is more damaging than it looks because users interpret LLM responses as authoritative. Cache invalidation strategy (TTL, change detection on source documents) is a non-trivial engineering problem that teams routinely underestimate when evaluating semantic caching ROI.

Putting It Together: A Measurement-First Workflow

The most expensive mistake in token optimization is applying levers before measuring the baseline. Teams optimize the system prompt (easy, low-risk) while ignoring retrieved context (harder, high-ROI). They implement semantic caching without instrumenting cache hit rates. They set a model-tier routing threshold based on assumptions about query complexity without validating against real traffic.

A measurement-first workflow looks like: (1) instrument token counts per pipeline stage for one week; (2) identify the highest-cost component; (3) instrument quality alongside cost for that component; (4) apply the appropriate lever with monitoring in place; (5) validate quality held before moving to the next component.

The teams that see 40–60% sustained token cost reductions aren't applying multiple optimizations simultaneously. They're iterating through components in ROI order, with measurement gates between each step. That's slower than the "do everything at once" approach, but it's the only way to know which changes actually caused what savings — and which changes introduced quality regressions you should roll back.