Fireworks AI: Fast Open-Source Inference

Overview

Fireworks AI is an inference provider focused on speed and cost for open-source models. It offers serverless pay-per-token pricing, dedicated on-demand GPU instances, and a batch inference mode -- all backed by custom GPU optimization work rather than stock llama.cpp or vLLM deployments.

Pricing (May 2026)

Model size	Serverless $/M
Under 4B params	$0.10
4B-16B params	$0.20
16B+ params	$0.90
DeepSeek V4 Pro (High)	$2.17

Cached input tokens are priced at 50% of the serverless rate. Batch inference is 50% off for both input and output. Dedicated GPU instances range from $2.90/hour (A100) to $9.00/hour (B200). New accounts receive $1 in free credits.

Artificial Analysis data from April 2026 shows Fireworks median blended rate at $0.84/M across 16 tracked models.

Where it fits

Fireworks occupies the performance end of the open-source inference market, competing with Groq on latency and Together AI on model breadth. The tiered serverless pricing (size-based, not model-by-model) makes cost estimation more predictable: know your model's parameter count and you know your rate.

On-demand GPU deployments suit teams that need a dedicated endpoint with SLA guarantees rather than shared serverless capacity.

For teams that want to route across Fireworks and other providers without managing multiple API keys, OpenRouter carries the Fireworks catalog.

Field notes

Fireworks B200 GPU instance pricing ($9.00/hour) confirmed in May 2026 pricing page update. The B200 tier is positioned as the high-throughput option for teams running 70B+ models at production scale who need dedicated capacity rather than serverless burst. [changelog, 2026-05-01]

Overview

Pricing (May 2026)

Where it fits

Field notes

See also