Benchmark data · Updated Feb 2026

We benchmarked 50 AI models
so you don't have to.

50 models from 13 providers, each tested against 25 hand-crafted extraction tasks across e-commerce, legal, medical, finance, and logic domains. Every response scored deterministically with field-level accuracy - a field passes when embedding similarity ≥ 80%.

1,250 real API calls, scored deterministically. No cherry-picked examples - every model runs every task under identical conditions.

About “cost per 1k tasks”: This is the estimated USD to run 1,000 real extraction tasks like these through each provider's API - calculated from actual token usage during the benchmark. Not abstract per-token pricing. A single “task” is one structured extraction from a real-world document (a product listing, a legal clause, a medical report).

Models tested

1,250

Total evaluations

Providers

92.6%

Best accuracy

Key Takeaways

What the data says.

Six observations from benchmarking 50 models across 25 extraction tasks - focused on what actually matters: accuracy, speed, and cost.

⚡ Fast & accurate

Grok 4 Fast - 1.4s avg

The fastest model with ≥80% accuracy. Ranks #16 at 89.6% - and it's eco-priced at $0.25/1k tasks.

💰 Cheap & accurate

Gemini 2.0 Flash - $0.17/1k tasks

Cheapest model to break 90% accuracy. Ranks #11 at 91.2% - 35× cheaper than the #1 model with only 1.4pp less accuracy.

🏅 Best value

DeepSeek V3.2 - #9 overall

Highest-ranked eco-tier model at 91.7% accuracy for just $0.24/1k tasks. Outperforms multiple ultra-tier models that cost 41× more.

💸 Price ≠ performance

$4.25/1k tasks → #17

Command A costs $4.25/1k tasks but ranks #17. Meanwhile Grok 3 Mini ranks #5at just $0.30/1k tasks.

🕐 The speed gap

68× between fastest and slowest

Grok 4 Fast averages 1.4s. DeepSeek R1 averages 97s. For batch workloads, speed compounds - choose wisely.

🏆 #1 overall

Grok 4 - 92.6%

Top accuracy at $6.00/1k tasks and 34.6s avg latency. But the top 10 is separated by just 1.0pp - the real difference is cost and speed.

Full Rankings

Leaderboard

Ranked by accuracy · Click any model to dive deeper

#↑	Model	Provider	Tier	Accuracy↕	Latency↕	Cost/1k tasks↕
1	Grok 4	xAI	ultra	92.6%	34.6s	$6.00	→
2	Gemini 2.5 Pro	Google	ultra	92.5%	12.9s	$3.63	→
3	Gemini 3 Pro	Google	ultra	92.5%	17.0s	$4.60	→
4	Claude Opus 4.6	Anthropic	ultra	92.5%	4.8s	$10.00	→
5	Grok 3 Mini	xAI	pro	92.2%	14.5s	$0.30	→
6	Gemini 3 Flash	Google	pro	92.2%	12.4s	$1.15	→
7	Claude Sonnet 4.6	Anthropic	ultra	92.2%	4.7s	$6.00	→
8	Grok 3	xAI	ultra	92.2%	3.1s	$6.00	→
9	DeepSeek V3.2	DeepSeek	eco	91.7%	9.3s	$0.24	→
10	Gemini 2.5 Flash	Google	pro	91.6%	5.5s	$0.90	→
11	Gemini 2.0 Flash	Google	eco	91.2%	2.7s	$0.17	→
12	Claude 3.5 Haiku	Anthropic	pro	91.2%	5.2s	$1.60	→
13	Gemini 2.5 Flash Lite	Google	eco	91.1%	2.2s	$0.17	→
14	Mistral Large 3	Mistral	pro	90.0%	5.0s	$0.70	→
15	Claude Haiku 4.5	Anthropic	pro	89.7%	7.5s	$2.00	→
16	Grok 4 Fast	xAI	eco	89.6%	1.4s	$0.25	→
17	Command A	Cohere	ultra	88.6%	4.4s	$4.25	→
18	Qwen 3 32B	Alibaba	eco	88.4%	1.6s	$0.14	→
19	Mistral Small	Mistral	eco	88.4%	2.5s	$0.14	→
20	MiniMax M2.5	MiniMax	pro	88.1%	15.3s	$0.51	→
21	Nova Pro	Amazon	pro	87.5%	3.9s	$1.36	→
22	Claude 3 Haiku	Anthropic	pro	87.1%	3.4s	$0.50	→
23	Gemini 2.0 Flash Lite	Google	eco	86.7%	4.1s	$0.13	→
24	Nova Lite	Amazon	eco	85.7%	2.3s	$0.10	→
25	GPT-4.1	OpenAI	ultra	82.5%	3.7s	$3.40	→
26	GPT-5 Nano	OpenAI	eco	82.1%	13.3s	$0.14	→
27	GPT-5	OpenAI	ultra	81.7%	41.0s	$3.63	→
28	o3	OpenAI	ultra	81.7%	11.0s	$3.40	→
29	GPT-4o	OpenAI	ultra	81.7%	2.7s	$4.25	→
30	GPT-5 Mini	OpenAI	pro	81.3%	16.8s	$0.72	→
31	o3 Mini	OpenAI	pro	80.9%	9.5s	$1.87	→
32	o4 Mini	OpenAI	pro	80.6%	11.5s	$1.87	→
33	GPT-4o Mini	OpenAI	eco	79.5%	5.8s	$0.26	→
34	GPT-4.1 Mini	OpenAI	pro	77.7%	3.6s	$0.68	→
35	GPT-4.1 Nano	OpenAI	eco	77.5%	2.4s	$0.17	→
36	Llama 4 Scout	Meta	eco	74.4%	2.1s	$0.13	→
37	Llama 3.1 8B	Meta	eco	59.1%	1.7s	$0.03	→
38	Nova Micro	Amazon	eco	56.3%	1.5s	$0.06	→
39	Magistral Small	Mistral	pro	51.1%	2.9s	$0.70	→
40	Magistral Medium	Mistral	ultra	49.3%	7.9s	$2.50	→
41	Qwen 3.5 Plus	Alibaba	pro	29.3%	41.7s	$0.92	→
42	DeepSeek R1	DeepSeek	pro	25.9%	97.3s	$0.90	→
43	DeepSeek V3.1	DeepSeek	eco	24.6%	2.5s	$0.34	→
44	DeepSeek V3	DeepSeek	pro	24.0%	8.3s	$0.62	→
45	Kimi K2	Moonshot	pro	22.5%	9.4s	$0.85	→
46	Llama 4 Maverick	Meta	eco	21.9%	6.7s	$0.26	→
47	Qwen 3.5 Flash	Alibaba	eco	21.6%	26.1s	$0.17	→
48	Llama 3.3 70B	Meta	pro	21.6%	1.9s	$0.58	→
49	Seed 1.8	ByteDance	pro	19.6%	30.9s	$0.72	→
50	Qwen 3 14B	Alibaba	eco	15.5%	17.9s	$0.10	→

Click any column header to sort. Click a model to see its head-to-head benchmark. Cost = estimated USD to run 1,000 extraction tasks like these via each provider's API - not token pricing.

Tradeoffs

Accuracy vs cost & speed

Higher accuracy doesn't always mean higher cost. Some models punch above their weight - hovering reveals details.

Accuracy vs Cost per 1k Tasks

eco

pro

ultra

Accuracy vs Latency

eco

pro

ultra

Stress Test

The toughest tasks

These tasks had the lowest average scores across all 50 models. See which models held up - and which ones completely fell apart.

Legal · Difficulty 5/5

Indemnification Cap Extraction

42%

avg score

Extract the indemnification liability cap from dense contractual language with multiple competing dollar figures.

Why it's hard: Must correctly distinguish the GENERAL liability cap ($500,000) from the INDEMNIFICATION cap ($2,000,000). Must identify both exceptions to the consequential damages waiver. Models often conflate the two caps or miss the 'notwithstanding' override.

24 passed (≥80%)•25 scored 0%•23 returned content but used wrong field names•2 returned nothing

Expected output (7 fields)

{
  "general_liability_cap": "$500,000",
  "general_liability_basis": "Greater of 12-month fees or $500,000",
  "indemnification_cap": "$2,000,000",
  "indemnification_section": "9.1",
  "consequential_damages_excluded": true,
  "consequential_damages_exceptions": [
    "Confidentiality breaches",
    "Willful misconduct"
  ],
  "liability_cap_exceptions": [
    "Indemnification under Section 9.1",
    "Confidentiality under Section 12"
  ]
}

#	Model	Tier	Score	Note
1	Claude 3.5 Haiku▶	pro	94%
2	Gemini 2.0 Flash▶	eco	92%
3	Nova Lite▶	eco	91%
4	Nova Pro▶	pro	91%
5	Gemini 2.0 Flash Lite▶	eco	90%
6	Claude 3 Haiku▶	pro	90%
7	Gemini 2.5 Flash Lite▶	eco	89%
8	MiniMax M2.5▶	pro	86%
9	Claude Sonnet 4.6▶	ultra	86%
10	Gemini 2.5 Flash▶	pro	84%
11	Gemini 2.5 Pro▶	ultra	84%
12	Gemini 3 Pro▶	ultra	84%
13	Claude Opus 4.6▶	ultra	84%
14	DeepSeek V3.2▶	eco	82%
15	Mistral Large 3▶	pro	82%
16	Grok 3 Mini▶	pro	81%
17	Claude Haiku 4.5▶	pro	81%
18	Grok 4▶	ultra	81%
19	Qwen 3 32B▶	eco	80%
20	Mistral Small▶	eco	80%
21	Grok 4 Fast▶	eco	80%
22	Gemini 3 Flash▶	pro	80%
23	Command A▶	ultra	80%
24	Grok 3▶	ultra	80%
25	Nova Micro▶	eco	77%
26	Llama 3.1 8B▶	eco	0%	Wrong field names
27	GPT-5 Nano▶	eco	0%	Wrong field names
28	Qwen 3 14B▶	eco	0%	Wrong field names
29	Llama 4 Scout▶	eco	0%	Wrong field names
30	Qwen 3.5 Flash▶	eco	0%	Wrong field names
31	GPT-4.1 Nano▶	eco	0%	Wrong field names
32	GPT-4o Mini▶	eco	0%	Wrong field names
33	Llama 4 Maverick▶	eco	0%	Wrong field names
34	DeepSeek V3.1▶	eco	0%	Wrong field names
35	GPT-5 Mini▶	pro	0%	Wrong field names
36	Seed 1.8▶	pro	0%	Wrong field names
37	GPT-4.1 Mini▶	pro	0%	Wrong field names
38	Qwen 3.5 Plus▶	pro	0%	Wrong field names
39	DeepSeek R1▶	pro	0%	Wrong field names
40	Kimi K2▶	pro	0%	Wrong field names
41	Magistral Small	pro	0%	Empty output
42	Llama 3.3 70B▶	pro	0%	Wrong field names
43	DeepSeek V3▶	pro	0%	Wrong field names
44	o3 Mini▶	pro	0%	Wrong field names
45	o4 Mini▶	pro	0%	Wrong field names
46	GPT-5▶	ultra	0%	Wrong field names
47	Magistral Medium	ultra	0%	Empty output
48	GPT-4.1▶	ultra	0%	Wrong field names
49	o3▶	ultra	0%	Wrong field names
50	GPT-4o▶	ultra	0%	Wrong field names

Click any model with ▶ to inspect its raw output. Models that scored 0% with “wrong field names” returned correct content but used a different schema - our scoring compares field-by-field against the expected structure.

We benchmarked 50 AI models
so you don't have to.

What the data says.

Leaderboard

Accuracy vs cost & speed

Accuracy vs Cost per 1k Tasks

Accuracy vs Latency

The toughest tasks

Indemnification Cap Extraction

Compare any two models

These are our tasks.
Yours will tell a different story.

We benchmarked 50 AI modelsso you don't have to.

What the data says.

Leaderboard

Accuracy vs cost & speed

Accuracy vs Cost per 1k Tasks

Accuracy vs Latency

The toughest tasks

Indemnification Cap Extraction

Compare any two models

These are our tasks.Yours will tell a different story.

We benchmarked 50 AI models
so you don't have to.

These are our tasks.
Yours will tell a different story.