HomeThe directory

AI dev and agent tools, decision-grade.

Benchmarks we run. Prices we verify daily. Field evidence we curate from postmortems, dev posts, and vendor retros ; terse, dated, honest.

Tools: 798
Vendors: 2102
Categories: 23

3 / 798

SectionAI Evals & Testing3 tools

appworld
StonyBrookNLP
🌍 AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and Interactive Coding Agent, ACL'24 Best Resource Paper.
promptfoo
promptfoo
Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration. Used by OpenAI and Anthropic.
trulens
truera
Evaluation and Tracking for LLM Experiments and AI Agents