Standardised testing

Benchmarks

We run a consistent suite of tasks across tools. Methodology is public; vendor pre-notification is 48 hours. Scores update as new runs land.

Read the methodology

Aider Polyglot

Code-editing accuracy across C, Python, TypeScript, JavaScript, and Rust. Methodology by Paul Gauthier (Aider AI), licensed for public reference. Scores represent each model's performance on Aider's polyglot coding test harness.

Unit: percentHigher = betterLast run: May 11, 2026

Methodology

Top 3

01OpenAI API88.0%
02Aider88.0%
03OpenRouter88.0%

Full leaderboard

Chatbot Arena Elo

Human-preference Elo rating from LMSys Chatbot Arena. Measures conversational quality via pairwise battle votes; the only Phase-1 benchmark with a human-preference signal rather than a capability metric.

Unit: eloHigher = betterLast run: Jun 1, 2026

Methodology

Top 3

Full leaderboard