50 models from 13 providers, each tested against 25 hand-crafted extraction tasks across e-commerce, legal, medical, finance, and logic domains. Every response scored deterministically with field-level accuracy - a field passes when embedding similarity ≥ 80%.
1,250 real API calls, scored deterministically. No cherry-picked examples - every model runs every task under identical conditions.
About “cost per 1k tasks”: This is the estimated USD to run 1,000 real extraction tasks like these through each provider's API - calculated from actual token usage during the benchmark. Not abstract per-token pricing. A single “task” is one structured extraction from a real-world document (a product listing, a legal clause, a medical report).
Six observations from benchmarking 50 models across 25 extraction tasks - focused on what actually matters: accuracy, speed, and cost.
The fastest model with ≥80% accuracy. Ranks #16 at 89.6% - and it's eco-priced at $0.25/1k tasks.
Cheapest model to break 90% accuracy. Ranks #11 at 91.2% - 35× cheaper than the #1 model with only 1.4pp less accuracy.
Highest-ranked eco-tier model at 91.7% accuracy for just $0.24/1k tasks. Outperforms multiple ultra-tier models that cost 41× more.
Command A costs $4.25/1k tasks but ranks #17. Meanwhile Grok 3 Mini ranks #5at just $0.30/1k tasks.
Grok 4 Fast averages 1.4s. DeepSeek R1 averages 97s. For batch workloads, speed compounds - choose wisely.
Top accuracy at $6.00/1k tasks and 34.6s avg latency. But the top 10 is separated by just 1.0pp - the real difference is cost and speed.
Ranked by accuracy · Click any model to dive deeper
Click any column header to sort. Click a model to see its head-to-head benchmark. Cost = estimated USD to run 1,000 extraction tasks like these via each provider's API - not token pricing.
Higher accuracy doesn't always mean higher cost. Some models punch above their weight - hovering reveals details.
These tasks had the lowest average scores across all 50 models. See which models held up - and which ones completely fell apart.
Extract the indemnification liability cap from dense contractual language with multiple competing dollar figures.
Why it's hard: Must correctly distinguish the GENERAL liability cap ($500,000) from the INDEMNIFICATION cap ($2,000,000). Must identify both exceptions to the consequential damages waiver. Models often conflate the two caps or miss the 'notwithstanding' override.
{
"general_liability_cap": "$500,000",
"general_liability_basis": "Greater of 12-month fees or $500,000",
"indemnification_cap": "$2,000,000",
"indemnification_section": "9.1",
"consequential_damages_excluded": true,
"consequential_damages_exceptions": [
"Confidentiality breaches",
"Willful misconduct"
],
"liability_cap_exceptions": [
"Indemnification under Section 9.1",
"Confidentiality under Section 12"
]
}| # | Model | Score |
|---|---|---|
| 1 | Claude 3.5 Haiku▶ | 94% |
| 2 | Gemini 2.0 Flash▶ | 92% |
| 3 | Nova Lite▶ | 91% |
| 4 | Nova Pro▶ | 91% |
| 5 | Gemini 2.0 Flash Lite▶ | 90% |
| 6 | Claude 3 Haiku▶ | 90% |
| 7 | Gemini 2.5 Flash Lite▶ | 89% |
| 8 | MiniMax M2.5▶ | 86% |
| 9 | Claude Sonnet 4.6▶ | 86% |
| 10 | Gemini 2.5 Flash▶ | 84% |
| 11 | Gemini 2.5 Pro▶ | 84% |
| 12 | Gemini 3 Pro▶ | 84% |
| 13 | Claude Opus 4.6▶ | 84% |
| 14 | DeepSeek V3.2▶ | 82% |
| 15 | Mistral Large 3▶ | 82% |
| 16 | Grok 3 Mini▶ | 81% |
| 17 | Claude Haiku 4.5▶ | 81% |
| 18 | Grok 4▶ | 81% |
| 19 | Qwen 3 32B▶ | 80% |
| 20 | Mistral Small▶ | 80% |
| 21 | Grok 4 Fast▶ | 80% |
| 22 | Gemini 3 Flash▶ | 80% |
| 23 | Command A▶ | 80% |
| 24 | Grok 3▶ | 80% |
| 25 | Nova Micro▶ | 77% |
| 26 | Llama 3.1 8B▶ | 0% |
| 27 | GPT-5 Nano▶ | 0% |
| 28 | Qwen 3 14B▶ | 0% |
| 29 | Llama 4 Scout▶ | 0% |
| 30 | Qwen 3.5 Flash▶ | 0% |
| 31 | GPT-4.1 Nano▶ | 0% |
| 32 | GPT-4o Mini▶ | 0% |
| 33 | Llama 4 Maverick▶ | 0% |
| 34 | DeepSeek V3.1▶ | 0% |
| 35 | GPT-5 Mini▶ | 0% |
| 36 | Seed 1.8▶ | 0% |
| 37 | GPT-4.1 Mini▶ | 0% |
| 38 | Qwen 3.5 Plus▶ | 0% |
| 39 | DeepSeek R1▶ | 0% |
| 40 | Kimi K2▶ | 0% |
| 41 | Magistral Small | 0% |
| 42 | Llama 3.3 70B▶ | 0% |
| 43 | DeepSeek V3▶ | 0% |
| 44 | o3 Mini▶ | 0% |
| 45 | o4 Mini▶ | 0% |
| 46 | GPT-5▶ | 0% |
| 47 | Magistral Medium | 0% |
| 48 | GPT-4.1▶ | 0% |
| 49 | o3▶ | 0% |
| 50 | GPT-4o▶ | 0% |
Click any model with ▶ to inspect its raw output. Models that scored 0% with “wrong field names” returned correct content but used a different schema - our scoring compares field-by-field against the expected structure.
Pick two models and see how they stack up across all 25 tasks - with the actual outputs, field-level scores, and scoring breakdown.
Upload your own dataset and run the same evaluation on any of 164+ models. No API keys, no integration work. Just upload, pick models, and get results in minutes.