Groq: LPU Inference API

Overview

Groq is an inference API that runs open-source LLMs on Language Processing Units (LPUs), purpose-built silicon for sequential token generation. The architecture eliminates the memory-bandwidth bottleneck that limits GPU throughput on autoregressive decoding.

Speed

Groq's LPUs deliver 276 to 1,500 tokens per second depending on model size, versus 40 to 100 tokens per second on comparable GPU infrastructure. For interactive applications where latency is user-visible (chat, autocomplete, real-time agents), this is the main reason teams evaluate Groq before Together AI or Fireworks AI.

Pricing (May 2026)

Model	Input $/M	Output $/M
Llama 3.3 70B	$0.59	$0.79
Llama 3.1 8B	$0.05	$0.08
Mixtral 8x7B	$0.24	$0.24
Gemma 2 9B	$0.20	$0.20

Free tier: 30 requests per minute, 6,000 tokens per minute per model. Developer tier (requires credit card on file): 10x the free limits plus a 25% discount on all usage. No free trial credit required; the discount applies from the first billed token.

Model selection

Groq runs open-weight models only. It does not host Claude, GPT-5, or Gemini. For frontier closed-source models, teams use the Anthropic API or OpenAI API directly. The Groq pitch is: when Llama 3.3 70B is good enough for your task, Groq gets you to that answer faster and cheaper than any GPU-based alternative.

Where it fits

Groq is the inference layer for latency-sensitive applications that can use open-weight models. Use cases: real-time transcription, sub-100ms chat responses, streaming code completions in custom tooling. Teams that need model flexibility across providers (Claude + Llama + GPT in one stack) often route through OpenRouter or Portkey, which both support Groq as a backend.

The trade-off: Groq's model catalog is narrower than Together AI's and the hardware is proprietary, so you cannot self-host the same architecture on your own cluster.

Field notes

Developer tier limit increase confirmed on groq.com/pricing: adding a credit card bumps rate limits 10x and applies a 25% discount from first use. Several teams on the community Discord reported hitting free tier limits within an hour of loading a RAG pipeline; the developer tier resolved this without a full paid plan commitment. [changelog, 2026-04-15]