The anatomy of an LLM cost spike

LLM API costs don't usually spiral slowly. They spike — a 4× jump in a single billing period that shows up as a line item you weren't expecting, for a model you thought was running quietly in the background. By the time the invoice arrives, the window for root-cause analysis has closed and the spend is already sunk.

What follows is a breakdown of the six patterns we see most often when teams lose cost visibility on LLM workloads. Each one has a distinct signature in per-request telemetry. None of them are obvious until you're watching the right metrics.

Pattern 1: Context Window Bloat from Conversation History

Multi-turn chat applications accumulate context by appending the full conversation history to every subsequent request. The token count for turn N is roughly the sum of all previous turns plus the new message — meaning costs grow quadratically with conversation length, not linearly.

Consider a customer support application where average session length creeps from 8 turns to 14 turns after a product launch. On a model with 1K-token pricing at scale, that 75% increase in session length produces somewhere between 2× and 3× the token cost per session, depending on message verbosity. Most teams look at request count, not tokens-per-request. Request count stays flat. The billing cycle ends and the team is staring at a number that doesn't make sense.

The signature in telemetry: input token count per request drifting upward over days, not hours. It's a slow ramp, which is why a daily P95 token-count alert catches it; a per-minute alert doesn't fire until it's already expensive.

Pattern 2: Retry Amplification Under Rate Limits

When a provider returns a 429 Too Many Requests or a transient 5xx, retry logic kicks in. If that retry logic doesn't implement exponential backoff with jitter correctly — or if a downstream queue feeds requests faster than the backoff window drains them — you can end up in a retry storm: every failed request generates 3-5 additional requests, each of which may also fail and retry.

We're not saying exponential backoff is a novel idea. Every LLM client library documents it. The issue is that most implementations set max_retries=3 globally but don't instrument how often retries actually fire in production. A traffic spike on a Tuesday afternoon can push a deployment into a retry regime for 20 minutes. Those 20 minutes of retried requests cost 2-4× normal, and they don't show up as an error in application logs — they show up as increased latency and increased spend.

Detect this by tracking request count vs. unique session count separately. A diverging ratio (more requests per session than usual) is the early signal.

Pattern 3: Silent Model Tier Escalation

Many LLM deployments use model routing: fast cheap model for most requests, capable expensive model for edge cases. The routing logic typically lives in application code — a conditional that says "if classification confidence is below threshold, escalate to GPT-4o" or similar.

The problem: classification thresholds are often set once and never revisited. A change in input distribution — new user cohort, new feature that generates unusual queries, a prompt change that shifts confidence scores — can push the escalation rate from 5% to 40% without any engineer being paged. There's no infrastructure failure. The expensive model is doing exactly what the code told it to do.

Per-model cost breakdown is the only way to see this. A single "LLM API total" metric is blind to tier composition. You need to see what fraction of spend is hitting each model tier, broken out by time, so you catch a distribution shift within hours rather than at month-end.

Pattern 4: Prompt Template Bloat After Feature Additions

System prompts get longer over time. A team adds guardrails after an incident, then adds persona instructions for a new use case, then adds structured output requirements, then adds examples for few-shot guidance. Nobody deletes old instructions when adding new ones — the prompt grows by accretion.

Take a realistic scenario: a knowledge-retrieval assistant at a growing software company. Initial system prompt is 200 tokens. After six months of iteration, it's 1,800 tokens. Every request pays that 1,600-token overhead before a single word of user input is processed. At 50,000 requests per day on a model billing at $X per 1M input tokens, that 9× prompt expansion translates to proportionally higher input costs — entirely from instructions that were never audited for redundancy.

Tracking median input tokens at the prompt-template level (not per-request) over time is the correct instrumentation. A slow, monotonic increase in baseline input tokens usually means prompt bloat, not user behavior change.

Pattern 5: Uncapped max_tokens on Summarization Pipelines

Summarization tasks are often written with max_tokens left at the model default or set to a large ceiling "just in case the document is long." For most models, the default is several thousand tokens — far more than a summary needs. If the model decides to be thorough, it uses what it's given.

This particularly shows up in document-processing pipelines that run asynchronously. Nobody is reading the output in real time, so unexpectedly verbose responses go unnoticed. The cost signal is an elevated output token average that only becomes visible when you look at median output tokens per pipeline stage.

The fix isn't just capping max_tokens — it's instrumenting the actual output length distribution so you know what the right cap is. Setting max_tokens=4096 on a summarization step that 95% of the time uses under 400 output tokens is leaving money on the table even when nothing is technically wrong.

Pattern 6: Streaming Requests That Don't Stream

When you set stream=True on an OpenAI-compatible endpoint, you expect token-by-token delivery: faster time-to-first-token, interruptible requests. But there are several conditions under which streaming silently degrades to buffered delivery at the proxy or SDK layer — a misconfigured gateway, a caching layer that buffers the stream, a network intermediary that re-assembles chunks.

When streaming doesn't stream, you lose the ability to abort expensive requests early. A user who closes a chat window after 2 seconds still pays for the full completion if your abort-on-disconnect logic depends on streaming chunks to trigger. For long completions (story generation, code synthesis tasks), a 10% abort rate on requests that run to completion instead of cutting off represents meaningful unrecovered spend.

We're not saying streaming is always the right choice — buffered mode simplifies retries and is correct for async pipelines. The point is that your cost model probably assumes one behavior, and if the other is happening, you don't know unless you're measuring time-to-first-token separately from time-to-last-token.

What Useful Cost Instrumentation Looks Like

The common thread across all six patterns: none of them are visible in a "total monthly API spend" number. They require granular, per-request telemetry that is actually surfaced to the people who can act on it — not buried in a billing dashboard that engineers check quarterly.

Specifically: input token count per request (with P50/P95 percentiles), output token count per request, requests per session, per-model breakdown of total spend, prompt template identifier attached to each request, and time-to-first-token. These aren't exotic metrics. They're what every team needs to operate LLM workloads with the same discipline applied to database query costs or CDN egress — categories where most engineering teams already have working intuitions about anomalies.

The expensive part isn't collecting the data. It's having it available in a form where a P95 token-count regression at 2pm on a Wednesday triggers an alert instead of an invoice surprise six weeks later. That's the gap worth closing.