By promptfoo
promptfoo
Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration. Used by OpenAI and Anthropic.
Filed under AI Evals & Testing. Status: Published
- Stars
- 21.4k
- Forks
- 1.9k
- Open issues
- 272
- Last commit
- 13d ago
- Stats refreshed
- Refreshed May 19, 2026
On the maker
Pricing
Open-source
Field notes
No field notes yet.
Field notes for promptfoo will land here when sources support a confident take ; synthesized from postmortems, vendor retros, dev-team blogs, deeply-engaged GitHub issues, and our own builds.
Coverage isn’t promised on every tool ; empty sections are honest. Field notes are curated, not generated from vendor copy.
Benchmarks
Scores aren’t in yet.
We’re wiring up SWE-bench, Aider Polyglot, and a custom dev-task suite next. Methodology will be public; vendor pre-notification is 48 hours.
View benchmarksHow we make money
This directory is supported by display advertising. Advertisers do not influence editorial rankings, benchmark scoring, or which tools are featured. Tools are ordered by data.