By promptfoo

promptfoo

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration. Used by OpenAI and Anthropic.

Filed under AI Evals & Testing. Status: Published

Stars: 21.4k
Forks: 1.9k
Open issues: 272
Last commit: 13d ago
Stats refreshed: Refreshed May 19, 2026

On the maker

promptfoo

Pricing

Open-source

View pricing

Field notes

No field notes yet.

Field notes for promptfoo will land here when sources support a confident take ; synthesized from postmortems, vendor retros, dev-team blogs, deeply-engaged GitHub issues, and our own builds.

Coverage isn’t promised on every tool ; empty sections are honest. Field notes are curated, not generated from vendor copy.

Benchmarks

Scores aren’t in yet.

We’re wiring up SWE-bench, Aider Polyglot, and a custom dev-task suite next. Methodology will be public; vendor pre-notification is 48 hours.

View benchmarks

How we make money

This directory is supported by display advertising. Advertisers do not influence editorial rankings, benchmark scoring, or which tools are featured. Tools are ordered by data.

Editorial independence policy →