Methodology · vybing.dev

The tier system

Every tool in the directory is assigned a tier based on the completeness and verifiability of the data we hold on it.

Tier A · Published: Full data: benchmark suite complete, pricing verified within 48 hours, at least two field notes from production use. Fully indexed. These are the pages that carry the directory's editorial weight.
Tier B · Comparison & Category: Structured comparison pages (/compare/tool-a-vs-tool-b) and category landing pages (/categories/slug). Data is inherited from the two Tier-A tools being compared or the tools in the category.
Tier C · Discovered / Enriched: Tools we know exist but haven't fully profiled yet. Pages in this tier emit a robots noindex tag at the renderer level; they don't pollute the index with thin content. We enrich these continuously.

What decision-grade means

"Decision-grade" is a bar we hold ourselves to, not a marketing claim. It means a senior engineer should be able to open a tool page and make a real build-vs-buy decision from what they find there, without needing to visit the vendor's site.

Benchmarks we run

Where mature public benchmarks exist with documented, stable methodology (e.g. Aider Polyglot by Paul Gauthier), we attribute and import them. Where they do not exist, we run our own suites; those are version-controlled and results are dated. When a tool updates and the numbers change materially, we rerun and note the delta.

Prices we verify daily

An automated connector polls public pricing pages every 24 hours, hashes the content, and writes back on change. When pricing changes materially, the tool page surfaces a freshness banner. We do not rely on vendor-submitted pricing.

Field evidence we curate

Field notes are sourced from postmortems, developer blog posts, and vendor retrospectives. Each note is terse (one to two sentences), dated, and tagged with the context in which the evidence was gathered. We don't summarise testimonials.

Ranking and scoring

Ranking on category and comparison pages is derived from a weighted composite of:

Benchmark performance (task-completion rate, latency, cost-per-task)
Pricing accessibility (free tier, per-seat vs usage-based, self-hostable)
Integration coverage (language ecosystem, CI/CD, IDE)
Freshness (how recently data was verified and field notes were added)
Editorial signal (production usage patterns visible in public postmortems)

Scores are computed at ingestion time and stored in the database. The page template reads the score; it does not compute it. This separation means ranking logic can be audited independently of the renderer.

Current benchmark sources: Aider Polyglot (methodology by Paul Gauthier): a code-editing accuracy benchmark across 11 languages. Results are attributed to the source, dated to the run, and will update automatically when new runs are published.

pSEO ranking signals (best-listicle pages)

Best-listicle pages (e.g. /best/ai-coding-agents-for-python) rank tools by a signal score computed at request time against the qualifying tool set on that page:

score = (GH_stars / max_GH_stars)   × 0.5
      + (PH_votes / max_PH_votes)     × 0.3
      + useCaseMatchDepth              × 0.2

GH_stars and PH_votes — normalized per slot

GitHub star count and ProductHunt vote count are each divided by the highest value among qualifying tools on that page, clamping to [0, 1]. Raw counts are not compared across slots: a tool holding the highest star count in its candidate set scores 1.0 on that axis regardless of absolute number. Snapshots refresh weekly; metric age is shown on each best-listicle page.

useCaseMatchDepth — binary

Measures how tightly a tool's tag set aligns with the page's use-case slot. Currently binary: a tool scores 1.0 if it has an explicit tag for the slot (assigned manually, via category-level default, or by description keyword match), or it is excluded from the page entirely.

Completeness gate (per tool) — ≥7/8 fields required

Only tools with ≥7 of 8 completeness fields qualify for a slot. The 8 fields: slug and name (always present), description, website_url, vendor_name, a primary category assignment, at least one of GitHub stars or ProductHunt votes, and last_verified_at. Tools below the gate are excluded from scoring and from the page regardless of raw star count.

Slot gate — 4 qualifying tools minimum per page

A best-listicle page is only generated when at least 4 tools pass the completeness gate for the category × use-case pair. Pairs below this minimum are not written to the manifest and return 404 rather than exposing thin pages. This gate applies at generation time and again at request time.

Content-signal gate — noindex when signals are thin

At render time, the page evaluates 4 content signals across all tool entries: whether any entry has pricing data or an open-source license, whether any entry has GitHub stars or ProductHunt votes, whether any description mentions integration surface (API, SDK, plugin, CLI, workflow), and whether the entries represent ≥2 distinct vendors. Pages that pass fewer than 3 of 4 signals emit robots: noindex, nofollow and are excluded from the sitemap. The page becomes indexable automatically once tool data quality pushes 3 or more signals to passing — no editorial action required.

Tier-C tools and the promotion path

Tier-C tools — those not yet in published state or with incomplete data — are excluded from best-listicle slots. When a tool reaches ≥7/8 completeness and is promoted to published state, it appears in rankings on existing best-listicle pages at the next ISR revalidation (≤1 hour). New category × use-case slots not yet in the manifest resolve on first request via dynamicParams, then cache. Position within a slot is determined by signal score, not editorial discretion.

Tiebreak rule — publishedAt ascending

When two tools produce the same signal score, the tool with the earlier publishedAt date ranks higher. A longer-published tool has had more time to accumulate real adoption evidence than a newer entrant at the same current score. This is a stability proxy — it does not factor in editorial preference, affiliate status, or any criterion outside publishedAt.

Comparison pages (/compare/…) do not use signal scoring. They present two tools side-by-side without applying a ranked order — the reader determines which tool fits their context. A comparison page renders and indexes when both conditions are met: ≥80% data-completeness per tool and at least one shared category. Full template spec: docs/templates/comparison.md.

The full editorial specification for best-listicle pages — covering the ranking algorithm, FAQ generation rules, generator requirements, and sitemap integration — is at docs/templates/best-listicle.md.

Freshness signals

A tool page is flagged stale when any of the following conditions are met:

Pricing content hash changes and the page hasn't been regenerated
No new field note has been added in the past 90 days for a Tier-A tool
A benchmark run is overdue (cadence varies by tool category)
The tool's GitHub metrics (stars, recent commits) have shifted by more than 20% from the last snapshot

Stale tools are queued for re-review. Tier-A pages surface a freshness banner when verified data is older than 48 hours. ISR ensures pages regenerate within the cadence set per tier: 1 hour for Tier-A tool pages, 30 minutes for the homepage and tool index.

Affiliate disclosure

Some tools in the directory participate in affiliate programs. When a reader clicks through to a vendor site from a tool page that has an active affiliate relationship, we may earn a commission.

Non-negotiable rule

Affiliate program participation and commission rate play zero role in a tool's ranking, position on category pages, or visual prominence in the UI. Tools are ranked by benchmark data and editorial criteria only. A tool that pays a high commission but benchmarks poorly ranks below a tool that pays nothing but benchmarks well.

Affiliate relationships are disclosed structurally, not in fine print. Pages with active affiliate links carry a visible disclosure element. If you believe affiliate bias has affected a ranking, email hello@vybing.dev and we'll investigate publicly.

Funding & sources

We pay for all benchmark runs and infrastructure at production rates. We accept no vendor credits, sponsorships, hardware grants, or rebates for benchmark execution. The site is funded by editorial ads and affiliate placements that are clearly labeled.

Most benchmark scores you see on vybing are aggregated from established third-party leaderboards: aider.chat (Aider Polyglot), swebench.com (SWE-bench Lite + Verified), tbench.ai (Terminal-Bench), livecodebench.github.io (LiveCodeBench), and lmarena.ai (Chatbot Arena Elo). Each benchmark links to its source leaderboard where available; the methodology and run dates are theirs, not ours.

If you find a benchmark row that disagrees with the source leaderboard by more than rounding error, file an issue at github.com/theLifeOfLewis/vybing/issues. We will reconcile within 48 hours.

Corrections and disputes

Data errors happen. If you spot incorrect pricing, a stale benchmark, a field note that misrepresents a tool, or any other factual error, email hello@vybing.dev with a link to the page, the specific claim, and the correct information. We respond within two business days.

Vendors may not submit tool data directly. All data is gathered through automated connectors or verified by the editorial team. This separation is intentional: it prevents tools from self-promoting into higher tiers.

Last updated · 2026-05-23

← Back to directory

How we rank AI dev tools