An interactive cost model for sovereign AI deployment — comparing hosted APIs, cloud VPC, on-prem and consumer-fleet options across DeepSeek, Qwen and Liquid models, priced per developer.
Sovereign AI — where does it run?
Sovereign AI is keeping both the model and the data it touches under your control. The open-weight field — DeepSeek, Qwen, Llama, Liquid — alongside the frontier APIs offers a spectrum, from someone else's endpoint to your own air-gapped rack.
If you are reading this, your token bill is climbing and the obvious levers — renegotiate the contract, ration usage — have run their course. The instinct is to bring AI in-house and stop paying by the token. Before that, three pressures are worth separating, because only one of them is actually about cost. Cost itself: a hundred developers on a frontier model runs $45–50K a month, and the output tokens that dominate the bill do not get cheaper at scale. Jurisdiction: some data legally cannot leave the country, or cannot touch a vendor whose servers fall under a foreign government's disclosure law — which quietly removes the cheapest hosted options from the table. Dependency: when one provider holds your capability, its pricing, rate limits and policy changes are yours to absorb, not to negotiate.
Two shifts moved on-prem from a last resort to a real option. Open models got good and small — a 30-billion-parameter coding model now does, with the right setup, work that needed a frontier call a year ago. And the hardware stopped requiring a datacenter: a single workstation-class card holds a capable model. So the decision in front of you is not which model to buy. It is where on a spectrum you sit — from someone else's API to your own air-gapped rack — and that is set by where your data must not go, what you spend, and how much control you are willing to cede.
For a runaway token bill, owning GPUs is rarely the answer. The cheapest fix is almost always a right-sized open model on the cheapest endpoint your compliance posture allows — often 15 to 30 times less than a frontier API. Owning hardware is a decision about sovereignty and risk, not cost — until you reach real scale.
Three lenses follow: a deployment ladder, a build-your-own fleet, and a one-glance summary. The controls let your technical team pressure-test the assumptions; the per-developer figures and the summary table are the parts that matter for the budget.
1The deployment ladder
API vs cloud vs on-prem, with the model toggle spanning DeepSeek, Qwen and Liquid. The turn-multiplier models the extra tokens a smaller model needs to reach frontier-class output; the direct-API rung changes jurisdiction by vendor.
2Build your own fleet
Strip the cloud margin and the datacenter premium. Pick hardware and a right-sized model — the cheaper the model, the more developers a single card serves. Every fleet option is air-gapped by definition.
3The whole board, summarised
Representative per-developer figures at roughly 100–300 developers. The "best for" column is the deciding constraint, not a ranking.
| Path | Example config | ~$/dev/mo | Upfront | Data sits | Best for |
|---|---|---|---|---|---|
| Frontier API | Opus 4.8 / GPT-5.5 | $450–500 | $0 | US vendor | Max capability, zero ops |
| Vendor direct | V4-Flash / Qwen hosted | $8–17 | $0 | China | Lowest token cost |
| LFM direct | Liquid first-party | $1–4 | $0 | US vendor | Small model, US first-party |
| Western managed | Fireworks / DeepInfra | $10–30 | $0 | US / EU | Low cost, no capex |
| Cloud VPC | V4-Pro 8×H200 · 3-yr | $140–160 | commit | Your cloud | No hardware owned |
| Datacenter on-prem | V4-Pro 8×H200 owned | $90* | ~$340k/node | Air-gapped | Hard-regulated, high volume |
| Fleet · PRO 6000 | Qwen3.6-30B · 1 card/replica | $8–22 | ~$9k/card | Air-gapped | Self-hosted coding |
| Fleet · PRO 6000 | LFM2.5-8B · MIG-split | $4–8 | ~$9k/card | Air-gapped | Tool-calling tier at scale |
| Fleet · M5 Ultra | Big model, low power | $30+ | ~$10k/unit | Air-gapped | Silent, fits big, low concurrency |
What this is, and what it isn't
This is a model, not an answer. Every figure here is directional and dated — a May 2026 snapshot of prices and hardware that are moving quickly, built on assumptions about token use, utilisation and quality that your own workloads will not match exactly. The things that actually decide your answer — what your developers truly consume, which data your regulators ring-fence, what a smaller model's extra turns cost you in practice — turn on numbers we don't have. Treat the per-developer figures as a way to ask sharper questions of your own team, not as a verdict. If it changes the question you start from, it has done its job.