The contract

Methodology

Every number on Modelyst carries a source, a date, and a method. This page is the method. Data last verified against its live sources on Jul 27, 2026.

Benchmark scores

Frontier benchmark scores (MMLU-Pro, GPQA Diamond, HLE, LiveCodeBench, SciCode, AIME, IFBench, τ²-bench, Terminal-Bench Hard, AA-LCR and friends) come from Artificial Analysis, who run the evaluations themselves with a consistent harness, and are refreshed here weekly. They're shown on a 0–100 scale. Scores from other sources (model cards, leaderboards, community submissions) are labeled with their provenance on every row — hover any score for its source and date. Community-submitted scores require a public source URL and are marked community; they never overwrite a measured value.

Prices — where they come from

Per-token prices are medians across the API providers serving each model, via Artificial Analysis — not list prices from a single vendor page, which is why a number here can differ slightly from any one provider's pricing page.

The headline $/1M figure is the blended 3:1 price: three parts input to one part output — (3 × input price + output price) ÷ 4 per 1M tokens. Example: a model priced $5 in / $30 out blends to (3×5 + 30) ÷ 4 = $11.25/1M. Model pages always show the raw input and output prices alongside the blend. A price of —means no measured price, never “free.”

Speed & latency

tok/s is the median output tokens per second across providers; TTFT is median time to first token. For reasoning models the latency figures include thinking time, which is why a frontier reasoning model can show latency of a minute or more. Both via Artificial Analysis; unmeasured values display as —.

The capability score

Modelyst's CAP (0–100) is a cross-benchmark percentile: for each benchmark, every model's best score becomes a percentile among all measured models; a model's capability score is its percentile averaged across the benchmarks it's measured on (minimum two), weighted by log-coverage so broadly-run benchmarks anchor the index. It answers “how does this model rank against the field,” not “what did it score on test X” — click any model for the underlying per-benchmark scores. Per-capability league tables apply the same percentile method scoped to one capability's benchmarks.

Benchmark health

Each benchmark page carries a health badge from the top-10 mean of best scores: saturating (≥88 — frontier models cluster at the ceiling; differences stop meaning much), active (≥65), headroom (below). Saturation is the strongest known failure mode of public benchmarks; we surface it instead of silently keeping stale numbers.

The score ledger

Modelyst keeps an append-only ledger: every observed change to any score — re-evaluations, silent endpoint updates, upstream methodology changes — is recorded with the old value, new value, and date, and shown on Changes, on model pages, and on benchmark pages. Headline metrics (capability, indexes, price, speed) are snapshotted on every refresh. Nothing is rewritten in place.

Papers

The papers corpus comes from arXiv (every paper links back to its arXiv page — we never host PDFs), with the curated Notable view from Hugging Face daily papers (community-upvoted, refreshed daily) and citation counts from OpenAlex.

What we don't do

No pay-for-placement, no vendor-submitted numbers presented as measured, no deleting history. Where data is thin or a benchmark is saturating, the page says so. Corrections: contribute with a source.