Per-token pricing is the number that triggers the conversation about switching LLM providers. It's a visible, comparable number that makes migration seem like a straightforward financial decision. In practice, the per-token price difference is often the smallest part of the true migration cost — and teams that anchor on it tend to be surprised by what they didn't account for.
This post walks through the full cost model for an LLM provider migration: the engineering overhead, the latency delta and its downstream effects, the quality regression risk, and the instrumentation you need to make the decision with eyes open.
Why Per-Token Price Comparisons Mislead
Consider a workload running 2M requests per month at 800 input tokens and 400 output tokens per request on average. If Provider A charges $X per 1M input tokens and $Y per 1M output tokens, and Provider B charges 40% less, the arithmetic looks compelling. At scale, a 40% cost reduction on the API bill is meaningful.
But the per-token comparison assumes token count parity across providers, which is often wrong. Different models produce different response lengths for the same prompt. A model that is more verbose by 20% on average eliminates most of the per-token savings. Token count parity also assumes identical tokenization, which varies by model family — a prompt that is 800 tokens under GPT-4o's tokenizer may be 920 tokens under a different provider's tokenizer, changing the input cost calculation.
The correct pre-migration analysis runs the current prompt corpus against the candidate provider and measures actual token counts, not theoretical ones. This is a two-hour engineering task with the right tooling and a non-trivial data quality problem without it.
Integration Overhead: The One-Time Cost That Isn't
Moving from one OpenAI-compatible endpoint to another sounds like a one-line change. For simple deployments, it sometimes is. For production deployments with meaningful scale and complexity, it rarely is.
The sources of integration overhead that teams consistently underestimate: retry and error handling logic that references provider-specific error codes; structured output schemas that have been tuned for a specific model's JSON generation behavior; function calling / tool use implementations that differ across model families; rate limit handling code that assumes specific tier limits; monitoring and alerting that is coupled to provider-specific response fields (model version in response headers, usage object structure).
Each of these is individually manageable. Together, they represent a non-trivial engineering project — typically 1–3 engineering weeks for a moderately complex production deployment, longer for teams with multiple LLM integrations across different services. This one-time cost needs to be amortized over the expected duration of the new provider relationship to determine whether the per-token savings justify the migration.
We're not saying integration overhead should block migration decisions. We're saying it should be explicitly estimated before the decision is made, not discovered partway through implementation.
Latency Delta and Its Downstream Effects
Provider migrations frequently shift latency — P50, P95, and P99. Sometimes favorably, sometimes not. This is the cost dimension that teams most consistently fail to model before migrating.
Latency changes from a provider migration can affect cost in non-obvious ways. For synchronous user-facing applications, a 30% increase in P50 latency may require infrastructure changes to maintain acceptable user experience — more aggressive connection pooling, timeout adjustments, or UX changes to accommodate longer wait times. These are engineering costs triggered by the migration even if they're not paid to the provider.
For batch processing workloads, latency changes affect throughput: if you're running 100 concurrent requests and average response time increases by 40%, you either process 40% less volume per hour or increase concurrency to compensate — which may push you into higher rate limit tiers, partially offsetting the per-token savings.
The correct instrumentation for a migration evaluation is a shadow traffic period: route a percentage of production traffic to the candidate provider simultaneously with the current provider, measure latency side-by-side under identical load conditions, and compute the throughput implications before committing to the migration.
Quality Regression Risk
Output quality differences between providers are real, workload-dependent, and difficult to measure with off-the-shelf benchmarks. Publicly available model benchmarks (MMLU, HumanEval, etc.) measure performance on standardized tasks that may not correlate with your specific use case.
For extractive tasks (parsing structured data from unstructured text, named entity extraction, classification), quality differences between models of comparable capability tier are often small. For generative tasks (writing, code synthesis, multi-step reasoning), quality differences can be substantial and prompt-dependent — a prompt that elicits excellent output from one model may produce mediocre output from another model of similar capability tier, simply because the training distribution and RLHF process differ.
This means quality validation requires running your actual production prompts against the candidate provider, evaluating outputs against your quality criteria, and identifying the subset of your request types where quality degrades meaningfully. Automated evaluation with LLM-as-judge is a reasonable approach for high-volume workloads where human evaluation is impractical. The key is having a quality evaluation framework in place before migration, not improvising one after you're committed.
The True Cost Model
A structured migration cost model has four components. The per-token savings is the number everyone calculates. The integration engineering cost is the one-time spend that needs amortizing. The latency delta effect captures the infrastructure and throughput implications. The quality regression risk is harder to quantify but should be estimated as an expected value — probability of regression × cost of regression (refund rates, re-processing costs, user churn).
Running this model before migration often produces surprising conclusions. Migrations that look obviously correct based on per-token pricing sometimes break even over 12 months after accounting for integration cost and quality validation. Migrations that look marginal on price sometimes clear easily when latency improvements reduce infrastructure overhead.
The common thread: none of these numbers are visible without instrumentation. You can't estimate token count delta without capturing actual token counts per provider. You can't model latency implications without P50/P95/P99 data from shadow traffic. You can't assess quality regression risk without a quality measurement framework tied to your actual request corpus.
When Migration Economics Actually Work
Provider migrations with strong economics share a few common characteristics. The workload has high request volume and predictable prompt shapes — making token count estimates reliable and per-token savings significant in absolute terms. The migration is driven by a capability gain (a feature the new provider offers that the current one doesn't), not just price, reducing the risk that quality regression erodes the savings. The team has existing multi-provider infrastructure — routing, normalized SDK, unified observability — that reduces integration overhead to near zero.
Teams that migrate purely on per-token price, at moderate volume, without multi-provider infrastructure in place, and without quality validation, tend to find the migration financially neutral at best after accounting for total costs. The per-token savings are real. They're just rarely the whole story.
The right time to build migration optionality is before you need it — multi-provider routing infrastructure and unified observability that normalizes across providers. With that foundation, the integration cost of a migration drops from weeks to days, and the evaluation period becomes a matter of turning on shadow traffic rather than a separate engineering project. That's the architectural posture that makes provider migrations actually cheap.