Benchmark data · Updated Feb 2026

We benchmarked 50 AI models
so you don't have to.

50 models from 13 providers, each tested against 25 hand-crafted extraction tasks across e-commerce, legal, medical, finance, and logic domains. Every response scored deterministically with field-level accuracy - a field passes when embedding similarity ≥ 80%.

1,250 real API calls, scored deterministically. No cherry-picked examples - every model runs every task under identical conditions.

About “cost per 1k tasks”: This is the estimated USD to run 1,000 real extraction tasks like these through each provider's API - calculated from actual token usage during the benchmark. Not abstract per-token pricing. A single “task” is one structured extraction from a real-world document (a product listing, a legal clause, a medical report).

50
Models tested
1,250
Total evaluations
13
Providers
92.6%
Best accuracy
Key Takeaways

What the data says.

Six observations from benchmarking 50 models across 25 extraction tasks - focused on what actually matters: accuracy, speed, and cost.

⚡ Fast & accurate
Grok 4 Fast - 1.4s avg

The fastest model with ≥80% accuracy. Ranks #16 at 89.6% - and it's eco-priced at $0.25/1k tasks.

💰 Cheap & accurate
Gemini 2.0 Flash - $0.17/1k tasks

Cheapest model to break 90% accuracy. Ranks #11 at 91.2% - 35× cheaper than the #1 model with only 1.4pp less accuracy.

🏅 Best value
DeepSeek V3.2 - #9 overall

Highest-ranked eco-tier model at 91.7% accuracy for just $0.24/1k tasks. Outperforms multiple ultra-tier models that cost 41× more.

💸 Price ≠ performance
$4.25/1k tasks → #17

Command A costs $4.25/1k tasks but ranks #17. Meanwhile Grok 3 Mini ranks #5at just $0.30/1k tasks.

🕐 The speed gap
68× between fastest and slowest

Grok 4 Fast averages 1.4s. DeepSeek R1 averages 97s. For batch workloads, speed compounds - choose wisely.

🏆 #1 overall
Grok 4 - 92.6%

Top accuracy at $6.00/1k tasks and 34.6s avg latency. But the top 10 is separated by just 1.0pp - the real difference is cost and speed.

Full Rankings

Leaderboard

Ranked by accuracy · Click any model to dive deeper

#ModelAccuracyCost/1k tasks
1Grok 492.6%$6.00
2Gemini 2.5 Pro92.5%$3.63
3Gemini 3 Pro92.5%$4.60
4Claude Opus 4.692.5%$10.00
5Grok 3 Mini92.2%$0.30
6Gemini 3 Flash92.2%$1.15
7Claude Sonnet 4.692.2%$6.00
8Grok 392.2%$6.00
9DeepSeek V3.291.7%$0.24
10Gemini 2.5 Flash91.6%$0.90
11Gemini 2.0 Flash91.2%$0.17
12Claude 3.5 Haiku91.2%$1.60
13Gemini 2.5 Flash Lite91.1%$0.17
14Mistral Large 390.0%$0.70
15Claude Haiku 4.589.7%$2.00
16Grok 4 Fast89.6%$0.25
17Command A88.6%$4.25
18Qwen 3 32B88.4%$0.14
19Mistral Small88.4%$0.14
20MiniMax M2.588.1%$0.51
21Nova Pro87.5%$1.36
22Claude 3 Haiku87.1%$0.50
23Gemini 2.0 Flash Lite86.7%$0.13
24Nova Lite85.7%$0.10
25GPT-4.182.5%$3.40
26GPT-5 Nano82.1%$0.14
27GPT-581.7%$3.63
28o381.7%$3.40
29GPT-4o81.7%$4.25
30GPT-5 Mini81.3%$0.72
31o3 Mini80.9%$1.87
32o4 Mini80.6%$1.87
33GPT-4o Mini79.5%$0.26
34GPT-4.1 Mini77.7%$0.68
35GPT-4.1 Nano77.5%$0.17
36Llama 4 Scout74.4%$0.13
37Llama 3.1 8B59.1%$0.03
38Nova Micro56.3%$0.06
39Magistral Small51.1%$0.70
40Magistral Medium49.3%$2.50
41Qwen 3.5 Plus29.3%$0.92
42DeepSeek R125.9%$0.90
43DeepSeek V3.124.6%$0.34
44DeepSeek V324.0%$0.62
45Kimi K222.5%$0.85
46Llama 4 Maverick21.9%$0.26
47Qwen 3.5 Flash21.6%$0.17
48Llama 3.3 70B21.6%$0.58
49Seed 1.819.6%$0.72
50Qwen 3 14B15.5%$0.10

Click any column header to sort. Click a model to see its head-to-head benchmark. Cost = estimated USD to run 1,000 extraction tasks like these via each provider's API - not token pricing.

Tradeoffs

Accuracy vs cost & speed

Higher accuracy doesn't always mean higher cost. Some models punch above their weight - hovering reveals details.

Accuracy vs Cost per 1k Tasks

11%27%44%61%78%95%$0.00$2.20$4.40$6.60$8.80$11.00Cost per 1,000 tasks (USD)Accuracy
eco
pro
ultra

Accuracy vs Latency

11%27%44%61%78%95%0ms21.4s42.8s64.3s85.7s107.1sAvg latency (ms)Accuracy
eco
pro
ultra
Stress Test

The toughest tasks

These tasks had the lowest average scores across all 50 models. See which models held up - and which ones completely fell apart.

Legal · Difficulty 5/5

Indemnification Cap Extraction

42%
avg score

Extract the indemnification liability cap from dense contractual language with multiple competing dollar figures.

Why it's hard: Must correctly distinguish the GENERAL liability cap ($500,000) from the INDEMNIFICATION cap ($2,000,000). Must identify both exceptions to the consequential damages waiver. Models often conflate the two caps or miss the 'notwithstanding' override.

24 passed (≥80%)25 scored 0%23 returned content but used wrong field names2 returned nothing
Expected output (7 fields)
{
  "general_liability_cap": "$500,000",
  "general_liability_basis": "Greater of 12-month fees or $500,000",
  "indemnification_cap": "$2,000,000",
  "indemnification_section": "9.1",
  "consequential_damages_excluded": true,
  "consequential_damages_exceptions": [
    "Confidentiality breaches",
    "Willful misconduct"
  ],
  "liability_cap_exceptions": [
    "Indemnification under Section 9.1",
    "Confidentiality under Section 12"
  ]
}
#ModelScore
1Claude 3.5 Haiku94%
2Gemini 2.0 Flash92%
3Nova Lite91%
4Nova Pro91%
5Gemini 2.0 Flash Lite90%
6Claude 3 Haiku90%
7Gemini 2.5 Flash Lite89%
8MiniMax M2.586%
9Claude Sonnet 4.686%
10Gemini 2.5 Flash84%
11Gemini 2.5 Pro84%
12Gemini 3 Pro84%
13Claude Opus 4.684%
14DeepSeek V3.282%
15Mistral Large 382%
16Grok 3 Mini81%
17Claude Haiku 4.581%
18Grok 481%
19Qwen 3 32B80%
20Mistral Small80%
21Grok 4 Fast80%
22Gemini 3 Flash80%
23Command A80%
24Grok 380%
25Nova Micro77%
26Llama 3.1 8B0%
27GPT-5 Nano0%
28Qwen 3 14B0%
29Llama 4 Scout0%
30Qwen 3.5 Flash0%
31GPT-4.1 Nano0%
32GPT-4o Mini0%
33Llama 4 Maverick0%
34DeepSeek V3.10%
35GPT-5 Mini0%
36Seed 1.80%
37GPT-4.1 Mini0%
38Qwen 3.5 Plus0%
39DeepSeek R10%
40Kimi K20%
41Magistral Small0%
42Llama 3.3 70B0%
43DeepSeek V30%
44o3 Mini0%
45o4 Mini0%
46GPT-50%
47Magistral Medium0%
48GPT-4.10%
49o30%
50GPT-4o0%

Click any model with ▶ to inspect its raw output. Models that scored 0% with “wrong field names” returned correct content but used a different schema - our scoring compares field-by-field against the expected structure.

Head to Head

Compare any two models

Pick two models and see how they stack up across all 25 tasks - with the actual outputs, field-level scores, and scoring breakdown.

Your data is different

These are our tasks.
Yours will tell a different story.

Upload your own dataset and run the same evaluation on any of 164+ models. No API keys, no integration work. Just upload, pick models, and get results in minutes.