Ollama: Local Model Runtime

Overview

Ollama is an open-source runtime for downloading and running large language models locally. It wraps llama.cpp inference behind a simple CLI and a REST API that follows the OpenAI format, abstracting GPU memory management, model quantization, and multi-model switching behind single commands.

Pricing

Free. Open source (MIT). You pay only for the hardware you run it on. There are no usage fees, API keys, or rate limits.

Model library

Ollama maintains a public model registry with 100+ models including Llama 3.3 70B, Qwen3, DeepSeek-R1, Gemma 2 (2B, 9B, 27B), Mistral, and Phi-4. Running a model is a single pull command; Ollama handles chunked downloads, verification, and hot-reload.

As of May 2026, over 112 million pulls for Llama 3.1 alone have been logged, making it the most-used local model runtime by a wide margin.

Performance

On GPU-accelerated hardware, Ollama delivers 300+ tokens per second for 8B models and up to 1,200 TPS on high-end configurations. On 8GB RAM consumer hardware with an integrated GPU, a quantized 8B model runs at useful speeds for development and testing.

Windows ARM64 received a native build in 2026, eliminating the emulation performance overhead on Snapdragon-based Windows machines.

Where it fits

Ollama is the local development substrate for teams that want:

Zero API costs during development
Privacy: prompts never leave the machine
Offline capability
Testing different open-source models without cloud latency

For production inference at scale, cloud providers (Groq, Together AI, Fireworks AI) offer better throughput per dollar than self-managed hardware at most company sizes.

Ollama's OpenAI-compatible API means Continue and Cline both support it as a provider -- local development with Claude-level capabilities for code completions at zero cost.

Field notes

Windows ARM64 native build shipped in early 2026 (confirmed in Ollama GitHub release notes). Teams using Copilot+ PCs reported performance improvements of 2-3x over the previous x86 emulation path for 8B models. [changelog, 2026-02-10]