P99 latency: OpenAI vs Claude vs Gemini

P50 latency is the number you show in marketing materials. P99 is the number that determines whether your LLM-powered feature is usable for the users who hit it when things are slow. In production, those two numbers can be separated by an order of magnitude — and which provider narrows that gap most consistently matters enormously for user-facing applications.

This post describes what we observe across production LLM workloads instrumented through Framewren, covering OpenAI (GPT-4o and GPT-3.5-turbo-equivalent tiers), Anthropic Claude (Sonnet and Haiku class), and Gemini (1.5 Pro and Flash). The goal isn't a headline ranking. Latency varies by region, time of day, request characteristics, and rate limit tier. The goal is to explain what the distributions actually look like so you can make better architectural decisions.

Why P99 Behaves Differently Than P50

For most API services, P99 is P50 multiplied by some factor — perhaps 2–4×. Queuing theory tells you that tail latency grows with utilization, and well-engineered backends keep that multiplier bounded.

LLMs violate this expectation in a specific way: P99 is often driven not by infrastructure queuing but by output token variance. A request that generates a 40-token response completes in a fraction of a second. A request that generates a 2,000-token response takes several seconds, even if the model's tokens-per-second throughput is consistent. If your application doesn't tightly bound max_tokens, you're allowing output length to become a latency variable that the P99 tail will explore aggressively.

This means provider P99 comparisons are only meaningful when controlling for output token distribution. A provider that looks slow at P99 might simply be generating longer outputs for the same prompts — which could be a quality win, not a performance loss.

What We See in Completion Latency Distributions

Across instrumented production workloads using prompt-length-controlled requests (system prompt + user message totaling 300–500 input tokens, max_tokens=512), the latency distributions break into two distinct classes: tight-tail providers and wide-tail providers.

Tight-tail behavior means P99 / P50 ratios in the 2.5–4× range. Wide-tail behavior means ratios of 6–12×. In practical terms: if P50 is 1.2 seconds, a tight-tail provider delivers P99 around 3–5 seconds. A wide-tail provider delivers P99 at 8–15 seconds on the same workload.

What drives the wide tail? Several compounding factors: provider-side model serving queue depth spikes (which users have no visibility into), geographic routing jitter when requests cross availability zones, and cold-start penalties on provider infrastructure that scales down during off-peak windows and must ramp back up during morning traffic surges.

Flash/Haiku-class models (the faster, cheaper tier from each provider) consistently show tighter P99/P50 ratios than their Pro/GPT-4 equivalents. This makes sense architecturally — smaller models have lower per-request compute variance and are served on infrastructure designed for high-throughput, low-latency workloads. The cost-latency tradeoff isn't just about price; it's also about tail latency predictability.

The Time-of-Day Effect Is Real and Uncontrolled

Latency benchmarks that aggregate across all hours obscure one of the most significant production variables: LLM provider latency degrades during US business hours, particularly 9am–3pm Pacific, when the largest concentration of API customers is active simultaneously.

We see P99 latency on heavy-usage workloads run 40–80% higher during peak hours compared to off-peak windows (overnight US, or weekend mornings). For applications where you control request timing — batch processing, background enrichment, scheduled report generation — routing to off-peak windows is often a more reliable latency optimization than switching providers. Providers that appear to have better P99 performance in aggregate benchmarks may simply be capturing more off-peak traffic in their averages.

If your application is user-facing and synchronous, you can't shift to off-peak. But you should be measuring latency with time-of-day segmentation, not just a rolling 24-hour window, or your SLO analysis will be systematically optimistic.

Time-to-First-Token vs. Time-to-Last-Token

For streaming applications, the metric that matters for perceived responsiveness is time-to-first-token (TTFT), not total completion time. A provider that returns the first token in 300ms and then streams at 30 tokens/sec feels fast, even if total time-to-last-token is 6 seconds for a long response. A provider with TTFT of 2 seconds feels slow even if it's faster overall.

We're not saying TTFT is always more important than total latency — for non-streaming use cases (document processing, classification), only total time matters. But for chat interfaces and copilot-style UX, TTFT dominates the user experience metric, and conflating it with total completion time gives you a misleading picture of how users experience latency.

In practice, TTFT and total completion time are not perfectly correlated across providers. Some providers have very fast TTFT (the model starts generating tokens quickly) but lower streaming token throughput (tokens arrive slowly after the first). Others buffer briefly before streaming but then deliver at high throughput. Measuring both separately, and understanding which matters for your use case, is the correct instrumentation strategy.

Latency Regression Detection: The P99 Alert Problem

Model version updates from providers introduce latency regressions without notice. This has happened with documented model version changes across multiple providers — where a "minor update" to a model changes the token generation characteristics enough to shift P95/P99 by 20–40% on certain prompt shapes.

Consider a chat application where the system prompt includes structured output requirements (JSON schema, specific field ordering). A model update that changes how the model processes schema-constrained outputs can materially slow completion time for that specific prompt structure while leaving general P50 unaffected. If your alerting watches aggregate P50, you miss it.

Effective latency regression detection requires: baseline P99 by model version, alerts on P95/P99 divergence rather than P50, and correlation between latency shifts and model version identifiers returned in response headers. Most providers return model version information in the response — this should be tagged on every request record, not treated as static.

Implications for Provider Selection and Routing

The practical takeaway from production latency data isn't "provider X is better than provider Y." Provider performance is workload-specific, region-specific, and tier-specific enough that blanket comparisons mislead more than they inform.

What the data does support: multi-provider routing with real-time latency awareness is the correct architecture for latency-sensitive applications. Routing 100% of traffic to a single provider means you absorb 100% of that provider's bad P99 days. With fallback routing — where requests that hit a TTFT threshold are retried against a secondary provider — you can shave significant width from your tail without changing your primary provider at all.

The prerequisite for multi-provider routing is unified latency instrumentation: the same measurement methodology, the same percentile calculations, across all providers simultaneously. Without that, you're routing based on intuition about which provider "feels faster" rather than measured percentile comparison. That's the instrumentation problem worth solving first, before any routing architecture decisions.