Self Hosted LLM vs Cloud Cost: When Local Inference Actually Saves Money — LLM Cost Calculator
Self hosted LLM vs cloud cost breakdown: GPU purchase, rental, and power vs API bills at 50M–1B tokens/month. Find your crossover point with real numbers.
Introduction
Teams hit $2,000+/month on API bills and ask whether self hosted LLM vs cloud cost favors buying or renting a GPU. The answer depends on three variables: monthly token volume, model size you need, and whether you already own hardware. This guide uses June 2026 pricing for both cloud APIs and consumer/datacenter GPUs so you can find your crossover point with numbers, not hype.
What self-hosting actually requires
Running inference locally means serving an open-weight model (7B–70B parameters) on your own GPU via tools like vLLM, llama.cpp, or Ollama. VRAM is the hard constraint:
- 7B models (Q4 quantized): ~5 GB VRAM — runs on RTX 4080 (16 GB) or RTX 4090 (24 GB) - 13B models: ~8 GB VRAM — comfortable on RTX 4090 - 70B models: ~40 GB VRAM — requires A100 80G or H100 80G; not feasible on consumer cards
If your quality bar needs 70B-class reasoning, self-hosting hardware costs jump from ~$1,800 (RTX 4090 purchase) to $15,000+ (A100 80G) or $3.20/hour cloud rental.
Cloud API cost baseline
At 500M input + 100M output tokens/month (a busy internal tool or customer-facing assistant):
- GPT-5.4 Mini API: $375 + $450 = $825/month - DeepSeek V4 Flash API: $50 + $20 = $70/month - Claude Sonnet 4.6 API: $1,500 + $1,500 = $3,000/month
DeepSeek has pushed cloud inference so cheap that self-hosting only wins on economics when you need data residency, zero egress, or volumes above ~2B tokens/month on a mid-tier model.
Self-hosted cost: purchase vs rental
RTX 4090 purchase ($1,800 upfront, 450W TDP): - Inference speed: ~120 tokens/sec on 7B, ~65 tokens/sec on 13B - At 730 hours/month uptime: power cost ≈ 0.45 kW × 730h × $0.12/kWh = $39/month - Amortized hardware over 24 months: $75/month - Total operating cost: ~$114/month excluding engineer time
RTX 4090 rental (RunPod ~$0.74/hour): - Full-month 24/7: 730 × $0.74 = $540/month — more expensive than DeepSeek API at most volumes - Part-time (8h/day weekdays): ~173 hours × $0.74 = $128/month — viable for batch jobs
H100 80G rental ($3.20/hour): - Needed for 70B models at ~90 tokens/sec - 24/7 monthly: $2,336/month — only justified above ~5B tokens/month vs GPT-5.4 Mini API
The crossover math
Self-hosting a 13B model on owned RTX 4090 beats GPT-5.4 Mini API when monthly spend exceeds roughly $400–$500 and you can keep the GPU utilized above 50%. It rarely beats DeepSeek V4 Flash on pure cost—$70/month for 600M tokens is hard to undercut with consumer hardware.
Self-hosting wins on non-price factors: - Data never leaves your network (healthcare, legal, defense) - Predictable latency without third-party rate limits - Custom fine-tunes on proprietary data - No per-token marginal cost once hardware is sunk
Decision framework
| Monthly tokens | Best economic choice | Best privacy choice | |---|---|---| | Under 100M | DeepSeek V4 Flash or Groq API | Local 7B on RTX 4090 | | 100M–1B | DeepSeek/Groq API or owned RTX 4090 | Owned RTX 4090 + 13B model | | 1B+ | Owned A100/H100 or negotiated API volume discount | On-prem A100 cluster |
Next steps
Use our self hosted llm vs cloud cost calculator to plug in your token volumes, pick a GPU tier, and see monthly totals side by side. Update assumptions quarterly—API prices dropped 40–60% between 2024 and 2026, which shifted the crossover point for most startups.
Estimate your own workload
Use the calculator to compare your expected API bill with a purchased or rented GPU setup.
Open calculator