The score ledger

Changes

Every benchmark score on Modelyst is tracked in an append-only ledger. When a number moves between refreshes — a re-evaluation, a silent endpoint update, a harness change upstream — it lands here, with the old value, the new value, and the date. Last refresh: Jul 27, 2026.

Capability · price · speed drift — last 30 days

GPT-5 minicap -5.1-102 tok/s Qwen3.5 4Bcap -5.0+6 tok/s Qwen3.5 2Bcap -4.2price −0.04-36 tok/s GPT-5.5 Instant (May 2026)cap +2.7 LFM2 2.6Bcap +2.4-336 tok/s Qwen3.5 0.8Bcap -1.8price −0.02-30 tok/s MiMo-V2-Procap -1.8price −1.5-48 tok/s Mercury 2cap -1.8+113 tok/s Hy3-previewcap -1.8-41 tok/s DiffusionGemma 26B A4Bcap -1.8 Mistral Medium 3.5cap -1.8-32 tok/s Qwen3.5 Omni Pluscap -1.8-3 tok/s

Jul 27, 2026

35 changes · 34 first observations

Qwen3.5 2B GPQA Diamond45.6 → 32.5▼ 13.1

Qwen3.5 4B GPQA Diamond77.1 → 68▼ 9.1

Ministral 3 3B GPQA Diamond35.8 → 30.1▼ 5.7

Gemma 4 E4B GPQA Diamond57.6 → 52.2▼ 5.4

LFM2.5-8B-A1B GPQA Diamond51.3 → 46.6▼ 4.7

Gemma 4 E4B IFBench44.2 → 40.6▼ 3.6

Granite 4.0 350M GPQA Diamond20.3 → 23.7▲ 3.4

Granite 4.0 H 350M GPQA Diamond25.7 → 29▲ 3.3

Granite 4.0 Micro GPQA Diamond33.6 → 30.3▼ 3.3

Llama 3.2 Instruct 1B GPQA Diamond17.7 → 20.8▲ 3.1

Llama 3.2 Instruct 3B GPQA Diamond25.5 → 22.4▼ 3.1

LFM2.5-1.2B-Instruct GPQA Diamond32.6 → 29.5▼ 3.1

Granite 4.0 350M MATH11.2 → 14▲ 2.8

LFM2.5-1.2B-Instruct IFBench43.8 → 41▼ 2.8

Gemma 4 E2B GPQA Diamond43.3 → 40.5▼ 2.8

Granite 4.0 Micro IFBench24.8 → 22.1▼ 2.7

Ministral 3 3B IFBench26.8 → 24.1▼ 2.7

LFM2.5-8B-A1B IFBench55.6 → 53.3▼ 2.3

LFM2 2.6B GPQA Diamond29.2 → 31.5▲ 2.3

Gemma 4 E2B IFBench38 → 36▼ 2.0

Llama 3.2 Instruct 3B MATH48.9 → 47▼ 1.9

Qwen3.5 0.8B MATH11.1 → 9.3▼ 1.8

Qwen3.5 4B IFBench52 → 50.2▼ 1.8

Granite 4.0 350M IFBench16.8 → 15.1▼ 1.7

Granite 4.0 H 1B GPQA Diamond26.3 → 24.6▼ 1.7

LFM2 2.6B MATH76.8 → 75.3▼ 1.5

Qwen3.5 2B IFBench31.5 → 30.4▼ 1.1

Llama 3.2 Instruct 1B MATH26.6 → 27.7▲ 1.1

Granite 4.0 H 1B IFBench26.2 → 25.1▼ 1.1

Llama 3.2 Instruct 3B IFBench26.2 → 25.2▼ 1.0

Llama 3.2 Instruct 1B IFBench22.7 → 23.5▲ 0.8

LFM2 2.6B IFBench26.5 → 25.7▼ 0.8

Qwen3.5 0.8B IFBench21.6 → 20.9▼ 0.7

Qwen3.5 0.8B GPQA Diamond11.4 → 12▲ 0.6

Granite 4.0 H 350M IFBench17.6 → 17.1▼ 0.5

Jul 20, 2026

10 changes · 19 first observations

Llama 3.2 Instruct 1B MATH14 → 26.6▲ 12.6

LFM2 2.6B IFBench19.5 → 26.5▲ 7.0

Granite 4.0 350M GPQA Diamond26.1 → 20.3▼ 5.8

Llama 3.2 Instruct 1B GPQA Diamond19.6 → 17.7▼ 1.9

LFM2 2.6B GPQA Diamond30.6 → 29.2▼ 1.4

Granite 4.0 350M IFBench15.9 → 16.8▲ 0.9

Qwen3.5 0.8B GPQA Diamond11.1 → 11.4▲ 0.3

Llama 3.2 Instruct 1B IFBench22.8 → 22.7▼ 0.1

Qwen3.5 0.8B IFBench21.5 → 21.6▲ 0.1

North Mini Code Humanity's Last Exam9.9 → 10▲ 0.1

Jul 6, 2026

14 changes · 16 first observations

GPT-5 mini LiveCodeBench83.8 → 69.2▼ 14.6

GPT-5.5 Instant (May 2026)AA Long-Context Reasoning55.7 → 64▲ 8.3

GPT-5 mini AIME90.7 → 85▼ 5.7

GPT-5 mini Humanity's Last Exam19.7 → 14.6▼ 5.1

GPT-5 mini Terminal-Bench Hard33.3 → 28.8▼ 4.5

GPT-5 mini IFBench75.4 → 71.2▼ 4.2

GPT-5 mini τ²-bench68.4 → 71.1▲ 2.7

GPT-5 mini GPQA Diamond82.8 → 80.3▼ 2.5

GPT-5.5 Instant (May 2026)GPQA Diamond84.6 → 82.3▼ 2.3

GPT-5 mini AA Long-Context Reasoning68 → 66▼ 2.0

GPT-5 mini SciCode39.2 → 41▲ 1.8

GPT-5.5 Instant (May 2026)Humanity's Last Exam20.3 → 18.6▼ 1.7

GPT-5.5 Instant (May 2026)SciCode50.3 → 48.6▼ 1.7

GPT-5 mini MMLU-Pro83.7 → 82.8▼ 0.9

Jun 29, 2026

10 changes · 10 first observations

Nova 2.0 Lite AIME94.3 → 88.7▼ 5.6

Nova 2.0 Lite LiveCodeBench71.1 → 66.3▼ 4.8

Nova 2.0 Lite GPQA Diamond81.1 → 76.8▼ 4.3

Nova 2.0 Lite AA Long-Context Reasoning55.3 → 58.3▲ 3.0

Nova 2.0 Lite τ²-bench72.8 → 75.7▲ 2.9

Nova 2.0 Lite Humanity's Last Exam10.9 → 8.6▼ 2.3

Nova 2.0 Lite IFBench70.7 → 68.5▼ 2.2

Nova 2.0 Lite Terminal-Bench Hard16.7 → 17.4▲ 0.7

Nova 2.0 Lite MMLU-Pro81.8 → 81.3▼ 0.5

Nova 2.0 Lite SciCode36.9 → 36.8▼ 0.1

Source of each value: see the score's provenance on its model page. Scores via Artificial Analysis are medians across providers; changes can reflect re-evaluation, endpoint updates, or methodology changes upstream.