Aider Polyglot
Code-editing accuracy across C, Python, TypeScript, JavaScript, and Rust. Methodology by Paul Gauthier (Aider AI), licensed for public reference. Scores represent each model's performance on Aider's polyglot coding test harness.
Standardised testing
We run a consistent suite of tasks across tools. Methodology is public; vendor pre-notification is 48 hours. Scores update as new runs land.
Read the methodologyCode-editing accuracy across C, Python, TypeScript, JavaScript, and Rust. Methodology by Paul Gauthier (Aider AI), licensed for public reference. Scores represent each model's performance on Aider's polyglot coding test harness.
Human-preference Elo rating from LMSys Chatbot Arena. Measures conversational quality via pairwise battle votes; the only Phase-1 benchmark with a human-preference signal rather than a capability metric.
Contamination-resistant coding benchmark (UC Berkeley / UCLA). Pass@1 on programming contest problems added monthly to prevent data leakage.
Coding-agent ability on a 300-issue subset of real GitHub issues. Measures end-to-end issue resolution rate. Expected: Q3 2026.
Human-verified subset of SWE-bench (500 issues); the current standard for serious coding-agent claims. Expected: Q3 2026.
First run · Q3 2026
Scores publish after methodology lock and a complete cross-tool run.
Terminal agent benchmark from Stanford / laude-institute. Measures end-to-end agent accuracy across Linux terminal tasks.