Qwen 3.5 FlashvsGPT-4.1

Head-to-head comparison across 25 high-difficulty extraction tasks spanning e-commerce, legal, medical, finance, and logic domains.

Qwen 3.5 Flash

Alibabaeco
Accuracy
21.6%
Cost / 1k Tasks
$0.17
Latency
26092ms

GPT-4.1

OpenAIultra
★ Higher Accuracy
Accuracy
82.5%
Cost / 1k Tasks
$3.40
Latency
3656ms

How we test

Last benchmarked: February 27, 2026

25 Extraction Tasks

Hand-crafted extraction tasks across 5 domains (E-commerce, Legal, Medical, Finance, Logic). Each rated difficulty 4–5, designed to expose ambiguous entities, nested structures, multi-step reasoning, and type coercion.

Deterministic Scoring

Every response is scored with field-level deterministic comparison - booleans and numbers checked exactly (5 % tolerance for numbers), strings compared via embedding cosine similarity. A field is marked correct when similarity ≥ 80%. Arrays are matched element-by-element. Scores range 0–100 per task.

Consistent & Reproducible

Each model runs every task once under identical conditions (same prompts, same schema). Results are timestamped and versioned so you can track changes over time.

Task Results

Each task shows the input, both model outputs, and the accuracy score.

25 tasks · 5 domains
01E-commerceAmbiguous Color Extraction
51vs98
Input
Nike Air Max 97 OG QS "Metallic Gold" Bullet — Titanium Violet / Varsity Red — Men's US 10.5 — 2018 Retro Release
Expected Output
{
  "brand": "Nike",
  "model": "Air Max 97 OG QS",
  "colorway": "Metallic Gold",
  "color_1": "Titanium Violet",
  "color_2": "Varsity Red",
  "size": "US 10.5",
  "gender": "Men's",
  "year": "2018",
  "edition": "Retro Release"
}
Qwen 3.5 Flash
14362ms51/100

Incorrect/missing fields: color_1, color_2, year, edition, colorway

brandNike
modelAir Max 97
colorwayTitanium Violet / Varsity Red(expected: Metallic Gold)
color_1-(expected: Titanium Violet)
color_2-(expected: Varsity Red)
sizeUS 10.5
genderMen's
year-(expected: 2018)
edition-(expected: Retro Release)
4/9 fields correct
GPT-4.1
995ms98/100

All fields correct

brandNike
modelAir Max 97 OG QS
colorwayMetallic Gold Bullet
color_1Titanium Violet
color_2Varsity Red
sizeUS 10.5
genderMen's
year2018
editionRetro Release
9/9 fields correct
Why this is hard: All 9 fields must be exact. 'Titanium Violet' must be extracted as a single compound color, NOT split into material + color. 'Metallic Gold' is the colorway nickname, not color_1.
02E-commerceNested Size & Variant Parsing
50vs99
Input
Apple MacBook Pro 14" M3 Max (16-core CPU / 40-core GPU) — 48GB Unified Memory — 1TB SSD — Space Black — AppleCare+ Bundle — Open Box Like New
Expected Output
{
  "brand": "Apple",
  "model": "MacBook Pro 14\"",
  "chip": "M3 Max",
  "cpu_cores": 16,
  "gpu_cores": 40,
  "memory_gb": 48,
  "storage": "1TB SSD",
  "color": "Space Black",
  "condition": "Open Box Like New",
  "includes_warranty": true
}
Qwen 3.5 Flash
17605ms50/100

Incorrect/missing fields: model, chip, memory_gb, storage, includes_warranty

brandApple
model-(expected: MacBook Pro 14")
chip-(expected: M3 Max)
cpu_cores16
gpu_cores40
memory_gb-(expected: 48)
storage-(expected: 1TB SSD)
colorSpace Black
conditionOpen Box Like New
includes_warranty-(expected: true)
5/10 fields correct
GPT-4.1
1538ms99/100

All fields correct

brandApple
modelMacBook Pro 14"
chipApple M3 Max
cpu_cores16
gpu_cores40
memory_gb48
storage1TB SSD
colorSpace Black
conditionOpen Box Like New
includes_warrantytrue
10/10 fields correct
Why this is hard: Numeric fields (cpu_cores, gpu_cores, memory_gb) must be integers, not strings. 'includes_warranty' must be boolean true (AppleCare+ = warranty). 'condition' must not be just 'New' — it is specifically 'Open Box Like New'.
03E-commerceMulti-pack Unit Price Trap
22vs100
Input
Olaplex Hair Perfector No. 3 — 100ml (3.3 fl oz) × 3 Pack — $89.97 ($29.99/each) — Sulfate Free — Salon Professional — Ships from Authorized Dealer
Expected Output
{
  "brand": "Olaplex",
  "product_name": "Hair Perfector No. 3",
  "volume_ml": 100,
  "volume_oz": 3.3,
  "pack_quantity": 3,
  "total_price_usd": 89.97,
  "unit_price_usd": 29.99,
  "is_sulfate_free": true,
  "seller_type": "Authorized Dealer"
}
Qwen 3.5 Flash
17138ms22/100

Incorrect/missing fields: volume_ml, volume_oz, pack_quantity, total_price_usd, unit_price_usd, is_sulfate_free, seller_type

brandOlaplex
product_nameHair Perfector No. 3
volume_ml-(expected: 100)
volume_oz-(expected: 3.3)
pack_quantity-(expected: 3)
total_price_usd-(expected: 89.97)
unit_price_usd-(expected: 29.99)
is_sulfate_free-(expected: true)
seller_type-(expected: Authorized Dealer)
2/9 fields correct
GPT-4.1
1070ms100/100

All fields correct

brandOlaplex
product_nameHair Perfector No. 3
volume_ml100
volume_oz3.3
pack_quantity3
total_price_usd89.97
unit_price_usd29.99
is_sulfate_freetrue
seller_typeAuthorized Dealer
9/9 fields correct
Why this is hard: Must distinguish total_price_usd (89.97) from unit_price_usd (29.99). pack_quantity must be integer 3. volume must reflect SINGLE unit, not total. is_sulfate_free must be boolean.
04E-commerceJapanese Electronics with Voltage
40vs99
Input
Sony WH-1000XM5 ワイヤレスノイズキャンセリングヘッドホン ブラック — Bluetooth 5.3 / LDAC — 30hr Battery — 100V-240V AC Adapter — JAN: 4548736132351 — Japan Domestic Model
Expected Output
{
  "brand": "Sony",
  "model": "WH-1000XM5",
  "type": "Wireless Noise Cancelling Headphones",
  "color": "Black",
  "bluetooth_version": "5.3",
  "audio_codec": "LDAC",
  "battery_hours": 30,
  "voltage_range": "100V-240V",
  "jan_code": "4548736132351",
  "region": "Japan Domestic"
}
Qwen 3.5 Flash
24599ms40/100

Incorrect/missing fields: type, bluetooth_version, audio_codec, battery_hours, voltage_range, region

brandSony
modelWH-1000XM5
type-(expected: Wireless Noise Cancelling Headphones)
colorBlack
bluetooth_version-(expected: 5.3)
audio_codec-(expected: LDAC)
battery_hours-(expected: 30)
voltage_range-(expected: 100V-240V)
jan_code4548736132351
region-(expected: Japan Domestic)
4/10 fields correct
GPT-4.1
1174ms99/100

All fields correct

brandSony
modelWH-1000XM5
typewireless noise-cancelling headphones
colorBlack
bluetooth_version5.3
audio_codecLDAC
battery_hours30
voltage_range100V-240V
jan_code4548736132351
regionJapan Domestic Model
10/10 fields correct
Why this is hard: The Japanese text ワイヤレスノイズキャンセリングヘッドホン must be translated to 'Wireless Noise Cancelling Headphones'. Color 'ブラック' = 'Black'. battery_hours must be integer. jan_code must be string (not number — leading zeros matter).
05E-commerceJewelry with Carat vs Karat
45vs99
Input
Tiffany & Co. Soleste Round Brilliant Diamond Engagement Ring — 1.52ct E/VVS2 — Platinum Setting with 18K Rose Gold Accent Band — GIA #2215847290 — Size 6.5 — Retail $42,800
Expected Output
{
  "brand": "Tiffany & Co.",
  "collection": "Soleste",
  "stone_type": "Round Brilliant Diamond",
  "carat_weight": 1.52,
  "color_grade": "E",
  "clarity_grade": "VVS2",
  "primary_metal": "Platinum",
  "accent_metal": "18K Rose Gold",
  "gia_report_number": "2215847290",
  "ring_size": 6.5,
  "retail_price_usd": 42800
}
Qwen 3.5 Flash
25326ms45/100

Incorrect/missing fields: collection, stone_type, primary_metal, accent_metal, gia_report_number, retail_price_usd

brandTiffany & Co.
collection-(expected: Soleste)
stone_type-(expected: Round Brilliant Diamond)
carat_weight1.52
color_gradeE
clarity_gradeVVS2
primary_metal-(expected: Platinum)
accent_metal-(expected: 18K Rose Gold)
gia_report_number-(expected: 2215847290)
ring_size6.5
retail_price_usd-(expected: 42800)
5/11 fields correct
GPT-4.1
1393ms99/100

All fields correct

brandTiffany & Co.
collectionSoleste
stone_typeDiamond
carat_weight1.52
color_gradeE
clarity_gradeVVS2
primary_metalPlatinum
accent_metal18K Rose Gold
gia_report_number2215847290
ring_size6.5
retail_price_usd42800
11/11 fields correct
Why this is hard: carat_weight is gemstone weight (1.52ct), NOT gold purity. accent_metal includes '18K' (karat = gold purity). ring_size must be numeric 6.5, not string. gia_report_number is string. retail_price_usd is integer with no comma.
06LegalIndemnification Cap Extraction
0vs0
Input
MASTER SERVICES AGREEMENT dated January 15, 2026. Section 8.2 (Limitation of Liability): Except for Vendor's indemnification obligations under Section 9.1 and breaches of Section 12 (Confidentiality), in no event shall either party's aggregate liability under this Agreement exceed the greater of (a) the total fees paid or payable by Client in the twelve (12) month period immediately preceding the event giving rise to the claim, or (b) Five Hundred Thousand Dollars ($500,000). Notwithstanding the foregoing, Vendor's aggregate liability for indemnification claims under Section 9.1 shall not exceed Two Million Dollars ($2,000,000). Section 8.3: IN NO EVENT SHALL EITHER PARTY BE LIABLE FOR ANY INDIRECT, INCIDENTAL, SPECIAL, CONSEQUENTIAL, OR PUNITIVE DAMAGES, REGARDLESS OF THE CAUSE OF ACTION OR THEORY OF LIABILITY, EVEN IF SUCH PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. The foregoing limitation shall not apply to breaches of Section 12 (Confidentiality) or willful misconduct.
Expected Output
{
  "general_liability_cap": "$500,000",
  "general_liability_basis": "Greater of 12-month fees or $500,000",
  "indemnification_cap": "$2,000,000",
  "indemnification_section": "9.1",
  "consequential_damages_excluded": true,
  "consequential_damages_exceptions": [
    "Confidentiality breaches",
    "Willful misconduct"
  ],
  "liability_cap_exceptions": [
    "Indemnification under Section 9.1",
    "Confidentiality under Section 12"
  ]
}
Qwen 3.5 Flash
11750ms0/100

Incorrect/missing fields: general_liability_cap, general_liability_basis, indemnification_cap, indemnification_section, consequential_damages_excluded, consequential_damages_exceptions, liability_cap_exceptions

general_liability_cap-(expected: $500,000)
general_liability_basis-(expected: Greater of 12-month fees or $500,000)
indemnification_cap-(expected: $2,000,000)
indemnification_section-(expected: 9.1)
consequential_damages_excluded-(expected: true)
consequential_damages_exceptions-(expected: ["Confidentiality breaches","Willful misconduct"])
liability_cap_exceptions-(expected: ["Indemnification under Section 9.1","Confidentiality under Section 12"])
0/7 fields correct
GPT-4.1
6754ms0/100

Incorrect/missing fields: general_liability_cap, general_liability_basis, indemnification_cap, indemnification_section, consequential_damages_excluded, consequential_damages_exceptions, liability_cap_exceptions

general_liability_cap-(expected: $500,000)
general_liability_basis-(expected: Greater of 12-month fees or $500,000)
indemnification_cap-(expected: $2,000,000)
indemnification_section-(expected: 9.1)
consequential_damages_excluded-(expected: true)
consequential_damages_exceptions-(expected: ["Confidentiality breaches","Willful misconduct"])
liability_cap_exceptions-(expected: ["Indemnification under Section 9.1","Confidentiality under Section 12"])
0/7 fields correct
Why this is hard: Must correctly distinguish the GENERAL liability cap ($500,000) from the INDEMNIFICATION cap ($2,000,000). Must identify both exceptions to the consequential damages waiver. Models often conflate the two caps or miss the 'notwithstanding' override.
07LegalTermination Clause Parsing
0vs89
Input
Section 14 — TERMINATION. 14.1 Termination for Convenience: Either party may terminate this Agreement upon ninety (90) days' prior written notice to the other party. 14.2 Termination for Cause: Either party may terminate this Agreement immediately upon written notice if the other party: (a) materially breaches this Agreement and fails to cure such breach within thirty (30) days after receiving written notice; (b) becomes insolvent, files for bankruptcy under Title 11, or has a receiver appointed for substantially all of its assets; or (c) is acquired by a direct competitor of the non-breaching party ('Change of Control'). 14.3 Effect of Termination: Upon termination under 14.1, all prepaid but unused fees shall be refunded pro rata. Upon termination under 14.2(a), the breaching party shall forfeit any unused prepaid fees. Upon termination under 14.2(b) or 14.2(c), fees shall be handled in accordance with Section 7.4.
Expected Output
{
  "termination_for_convenience.notice_period_days": 90,
  "termination_for_convenience.available_to": "Either party",
  "termination_for_cause.triggers": [
    {
      "type": "Material breach",
      "cure_period_days": 30
    },
    {
      "type": "Insolvency/Bankruptcy",
      "cure_period_days": 0
    },
    {
      "type": "Change of Control",
      "cure_period_days": 0
    }
  ],
  "termination_effects.convenience": "Pro rata refund of prepaid unused fees",
  "termination_effects.cause_breach": "Breaching party forfeits unused prepaid fees",
  "termination_effects.cause_insolvency_or_change_of_control": "Per Section 7.4"
}
Qwen 3.5 Flash
27757ms0/100

Incorrect/missing fields: termination_for_convenience.notice_period_days, termination_for_convenience.available_to, termination_for_cause.triggers, termination_effects.convenience, termination_effects.cause_breach, termination_effects.cause_insolvency_or_change_of_control

termination_for_convenience.notice_period_days-(expected: 90)
termination_for_convenience.available_to-(expected: Either party)
termination_for_cause.triggers-(expected: [{"type":"Material breach","cure_period_days":30},{"type":"Insolvency/Bankruptcy","cure_period_days":0},{"type":"Change of Control","cure_period_days":0}])
termination_effects.convenience-(expected: Pro rata refund of prepaid unused fees)
termination_effects.cause_breach-(expected: Breaching party forfeits unused prepaid fees)
termination_effects.cause_insolvency_or_change_of_control-(expected: Per Section 7.4)
0/6 fields correct
GPT-4.1
2727ms89/100

Incorrect/missing fields: termination_effects.cause_insolvency_or_change_of_control

termination_for_convenience.notice_period_days90
termination_for_convenience.available_toeither party
termination_for_cause.triggers[{"type":"material breach, uncured","cure_period_days":30},{"type":"insolvency or bankruptcy","cure_period_days":0},{"type":"change of control (acquisition by direct competitor)","cure_period_days":0}]
termination_effects.convenienceAll prepaid but unused fees shall be refunded pro rata.
termination_effects.cause_breachThe breaching party shall forfeit any unused prepaid fees.
termination_effects.cause_insolvency_or_change_of_controlFees shall be handled in accordance with Section 7.4.(expected: Per Section 7.4)
5/6 fields correct
Why this is hard: Must correctly map each termination trigger to its cure period (only material breach has 30 days). Must distinguish fee treatment across all three termination scenarios. Insolvency and Change of Control have NO cure period (immediate).
08LegalIP Assignment vs License Distinction
0vs79
Input
Section 6 — INTELLECTUAL PROPERTY. 6.1 Work Product: All deliverables, work product, and inventions conceived, created, or developed by Vendor in the performance of this Agreement ("Work Product") shall be considered "works made for hire" as defined by the U.S. Copyright Act. To the extent any Work Product does not qualify as a work made for hire, Vendor hereby irrevocably assigns to Client all right, title, and interest in and to such Work Product, including all intellectual property rights therein. 6.2 Pre-Existing IP: Notwithstanding Section 6.1, Vendor retains all right, title, and interest in and to any tools, frameworks, libraries, methodologies, and know-how that: (a) existed prior to this Agreement; (b) were developed independently outside the scope of this Agreement; or (c) are of general applicability and not specific to Client's business (collectively, "Pre-Existing IP"). To the extent any Pre-Existing IP is incorporated into the Work Product, Vendor grants Client a perpetual, irrevocable, worldwide, royalty-free, non-exclusive license to use, reproduce, and modify such Pre-Existing IP solely as embedded in the Work Product. 6.3 Residual Knowledge: Nothing in this Agreement shall restrict Vendor's right to use general skills, knowledge, experience, and ideas retained in the unaided memories of Vendor's personnel, provided such use does not infringe Client's patents or copyrights in the Work Product.
Expected Output
{
  "work_product_ownership": "Assigned to Client",
  "assignment_mechanism": "Work made for hire + irrevocable assignment fallback",
  "pre_existing_ip_ownership": "Retained by Vendor",
  "pre_existing_ip_categories": [
    "Prior tools/frameworks",
    "Independently developed",
    "General applicability"
  ],
  "pre_existing_ip_license_to_client.type": "Non-exclusive license",
  "pre_existing_ip_license_to_client.scope": "Use, reproduce, modify as embedded in Work Product",
  "pre_existing_ip_license_to_client.duration": "Perpetual",
  "pre_existing_ip_license_to_client.royalty": "Royalty-free",
  "pre_existing_ip_license_to_client.territory": "Worldwide",
  "residual_knowledge_clause": true,
  "residual_knowledge_limitation": "Must not infringe Client's patents or copyrights"
}
Qwen 3.5 Flash
32125ms0/100

Incorrect/missing fields: work_product_ownership, assignment_mechanism, pre_existing_ip_ownership, pre_existing_ip_categories, pre_existing_ip_license_to_client.type, pre_existing_ip_license_to_client.scope, pre_existing_ip_license_to_client.duration, pre_existing_ip_license_to_client.royalty, pre_existing_ip_license_to_client.territory, residual_knowledge_clause, residual_knowledge_limitation

work_product_ownership-(expected: Assigned to Client)
assignment_mechanism-(expected: Work made for hire + irrevocable assignment fallback)
pre_existing_ip_ownership-(expected: Retained by Vendor)
pre_existing_ip_categories-(expected: ["Prior tools/frameworks","Independently developed","General applicability"])
pre_existing_ip_license_to_client.type-(expected: Non-exclusive license)
pre_existing_ip_license_to_client.scope-(expected: Use, reproduce, modify as embedded in Work Product)
pre_existing_ip_license_to_client.duration-(expected: Perpetual)
pre_existing_ip_license_to_client.royalty-(expected: Royalty-free)
pre_existing_ip_license_to_client.territory-(expected: Worldwide)
residual_knowledge_clause-(expected: true)
residual_knowledge_limitation-(expected: Must not infringe Client's patents or copyrights)
0/11 fields correct
GPT-4.1
4640ms79/100

Incorrect/missing fields: pre_existing_ip_categories, work_product_ownership, assignment_mechanism, pre_existing_ip_ownership, pre_existing_ip_license_to_client.type, pre_existing_ip_license_to_client.scope, residual_knowledge_limitation

work_product_ownershipAll deliverables, work product, and inventions conceived, created, or developed by Vendor in the performance of this Agreement shall be considered 'works made for hire'; to the extent they are not, all rights are assigned to Client.(expected: Assigned to Client)
assignment_mechanismIrrevocable assignment by Vendor to Client of all right, title, and interest in and to the Work Product, including all intellectual property rights, to the extent not works made for hire.(expected: Work made for hire + irrevocable assignment fallback)
pre_existing_ip_ownershipVendor retains all right, title, and interest in and to Pre-Existing IP.(expected: Retained by Vendor)
pre_existing_ip_categories["tools","frameworks","libraries","methodologies","know-how"](expected: ["Prior tools/frameworks","Independently developed","General applicability"])
pre_existing_ip_license_to_client.typeroyalty-free, non-exclusive, irrevocable, perpetual license(expected: Non-exclusive license)
pre_existing_ip_license_to_client.scopeto use, reproduce, and modify Pre-Existing IP solely as embedded in the Work Product(expected: Use, reproduce, modify as embedded in Work Product)
pre_existing_ip_license_to_client.durationperpetual
pre_existing_ip_license_to_client.royaltyroyalty-free
pre_existing_ip_license_to_client.territoryworldwide
residual_knowledge_clausetrue
residual_knowledge_limitationVendor may use general skills, knowledge, experience, and ideas retained in unaided memories of its personnel, provided such use does not infringe Client's patents or copyrights in the Work Product.(expected: Must not infringe Client's patents or copyrights)
4/11 fields correct
Why this is hard: Must distinguish IP assignment (Work Product → Client) from IP licensing (Pre-Existing IP → non-exclusive license). Must capture all three categories of Pre-Existing IP. Residual knowledge clause is a separate carve-out. Models often incorrectly say ALL IP is assigned.
09LegalForce Majeure Scope Extraction
0vs0
Input
Section 15 — FORCE MAJEURE. 15.1 Neither party shall be liable for any failure or delay in performing its obligations under this Agreement (other than payment obligations, which shall not be excused) where such failure or delay results from Force Majeure Events including, but not limited to: acts of God, fire, flood, earthquake, epidemic, pandemic, war, terrorism, civil unrest, government sanctions, embargoes, strikes or labor disputes (excluding those involving a party's own employees), power grid failures, or internet backbone outages. 15.2 The affected party must provide written notice within five (5) business days of the Force Majeure Event's commencement, including a reasonable estimate of the expected duration. The affected party shall use commercially reasonable efforts to mitigate the impact. 15.3 If a Force Majeure Event continues for more than one hundred and twenty (120) consecutive calendar days, either party may terminate this Agreement upon written notice without penalty.
Expected Output
{
  "payment_obligations_excused": false,
  "notice_period_business_days": 5,
  "notice_requirements": [
    "Written notice",
    "Reasonable duration estimate"
  ],
  "mitigation_standard": "Commercially reasonable efforts",
  "termination_threshold_days": 120,
  "termination_penalty": false,
  "excluded_events": [
    "Strikes involving own employees"
  ],
  "included_events": [
    "Acts of God",
    "Fire",
    "Flood",
    "Earthquake",
    "Epidemic",
    "Pandemic",
    "War",
    "Terrorism",
    "Civil unrest",
    "Government sanctions",
    "Embargoes",
    "Strikes/labor disputes",
    "Power grid failures",
    "Internet backbone outages"
  ]
}
Qwen 3.5 Flash
16270ms0/100

Incorrect/missing fields: payment_obligations_excused, notice_period_business_days, notice_requirements, mitigation_standard, termination_threshold_days, termination_penalty, excluded_events, included_events

payment_obligations_excused-(expected: false)
notice_period_business_days-(expected: 5)
notice_requirements-(expected: ["Written notice","Reasonable duration estimate"])
mitigation_standard-(expected: Commercially reasonable efforts)
termination_threshold_days-(expected: 120)
termination_penalty-(expected: false)
excluded_events-(expected: ["Strikes involving own employees"])
included_events-(expected: ["Acts of God","Fire","Flood","Earthquake","Epidemic","Pandemic","War","Terrorism","Civil unrest","Government sanctions","Embargoes","Strikes/labor disputes","Power grid failures","Internet backbone outages"])
0/8 fields correct
GPT-4.1
5030ms0/100

Incorrect/missing fields: payment_obligations_excused, notice_period_business_days, notice_requirements, mitigation_standard, termination_threshold_days, termination_penalty, excluded_events, included_events

payment_obligations_excused-(expected: false)
notice_period_business_days-(expected: 5)
notice_requirements-(expected: ["Written notice","Reasonable duration estimate"])
mitigation_standard-(expected: Commercially reasonable efforts)
termination_threshold_days-(expected: 120)
termination_penalty-(expected: false)
excluded_events-(expected: ["Strikes involving own employees"])
included_events-(expected: ["Acts of God","Fire","Flood","Earthquake","Epidemic","Pandemic","War","Terrorism","Civil unrest","Government sanctions","Embargoes","Strikes/labor disputes","Power grid failures","Internet backbone outages"])
0/8 fields correct
Why this is hard: Critical: payment_obligations_excused must be FALSE. The key exclusion (own-employee strikes) is often missed. notice_period must be 5 BUSINESS days (not calendar). termination_threshold is 120 CALENDAR days.
10LegalNon-Compete Geographic Scope
0vs91
Input
Section 11 — RESTRICTIVE COVENANTS. 11.1 Non-Competition: For a period of eighteen (18) months following termination of employment for any reason, Employee shall not, directly or indirectly, engage in, own, manage, operate, consult for, or be employed by any Competing Business within the Restricted Territory. "Competing Business" means any entity that derives more than twenty percent (20%) of its annual revenue from products or services substantially similar to those offered by Company. "Restricted Territory" means the United States, the United Kingdom, and any country in which Company generated more than $5,000,000 in revenue during the twelve (12) months preceding termination. 11.2 Non-Solicitation of Employees: For a period of twenty-four (24) months following termination, Employee shall not recruit, solicit, or induce any person employed by Company (or who was employed within the six (6) months preceding such solicitation). 11.3 Non-Solicitation of Customers: For a period of twelve (12) months following termination, Employee shall not solicit any customer or prospective customer with whom Employee had material contact during the last twenty-four (24) months of employment. 11.4 Non-Disparagement: For a period of thirty-six (36) months following termination, Employee shall not make any public statements that disparage or defame Company, its officers, directors, or products. This obligation is mutual — Company shall instruct its officers and directors to refrain from disparaging Employee.
Expected Output
{
  "non_compete.duration_months": 18,
  "non_compete.scope": "Engage, own, manage, operate, consult, or be employed",
  "non_compete.competing_business_threshold": "20% of annual revenue from similar products/services",
  "non_compete.restricted_territory": [
    "United States",
    "United Kingdom",
    "Countries with >$5M Company revenue in prior 12 months"
  ],
  "non_solicitation_employees.duration_months": 24,
  "non_solicitation_employees.lookback_months": 6,
  "non_solicitation_customers.duration_months": 12,
  "non_solicitation_customers.contact_lookback_months": 24,
  "non_disparagement.duration_months": 36,
  "non_disparagement.is_mutual": true,
  "non_disparagement.company_scope": "Officers and directors"
}
Qwen 3.5 Flash
29897ms0/100

Incorrect/missing fields: non_compete.duration_months, non_compete.scope, non_compete.competing_business_threshold, non_compete.restricted_territory, non_solicitation_employees.duration_months, non_solicitation_employees.lookback_months, non_solicitation_customers.duration_months, non_solicitation_customers.contact_lookback_months, non_disparagement.duration_months, non_disparagement.is_mutual, non_disparagement.company_scope

non_compete.duration_months-(expected: 18)
non_compete.scope-(expected: Engage, own, manage, operate, consult, or be employed)
non_compete.competing_business_threshold-(expected: 20% of annual revenue from similar products/services)
non_compete.restricted_territory-(expected: ["United States","United Kingdom","Countries with >$5M Company revenue in prior 12 months"])
non_solicitation_employees.duration_months-(expected: 24)
non_solicitation_employees.lookback_months-(expected: 6)
non_solicitation_customers.duration_months-(expected: 12)
non_solicitation_customers.contact_lookback_months-(expected: 24)
non_disparagement.duration_months-(expected: 36)
non_disparagement.is_mutual-(expected: true)
non_disparagement.company_scope-(expected: Officers and directors)
0/11 fields correct
GPT-4.1
2777ms91/100

Incorrect/missing fields: non_compete.scope, non_compete.competing_business_threshold, non_disparagement.company_scope

non_compete.duration_months18
non_compete.scopeEmployee shall not, directly or indirectly, engage in, own, manage, operate, consult for, or be employed by any Competing Business.(expected: Engage, own, manage, operate, consult, or be employed)
non_compete.competing_business_thresholdAny entity that derives more than 20% of its annual revenue from products or services substantially similar to those offered by Company.(expected: 20% of annual revenue from similar products/services)
non_compete.restricted_territory["United States","United Kingdom","Any country in which Company generated more than $5,000,000 in revenue during the 12 months preceding termination"]
non_solicitation_employees.duration_months24
non_solicitation_employees.lookback_months6
non_solicitation_customers.duration_months12
non_solicitation_customers.contact_lookback_months24
non_disparagement.duration_months36
non_disparagement.is_mutualtrue
non_disparagement.company_scopeCompany, its officers, directors, or products(expected: Officers and directors)
8/11 fields correct
Why this is hard: Each covenant has a DIFFERENT duration (18/24/12/36 months). Models frequently assign a single duration to all. The geographic scope has a dynamic '$5M revenue' threshold — not just static countries. Non-disparagement is MUTUAL. Non-solicitation of employees has a 6-month lookback.
11MedicalAcute vs Chronic Symptom Classification
0vs91
Input
Patient: Female, 67yo, presents to ED with acute onset chest pain (started 2 hours ago, 8/10 severity, substernal, radiating to left arm), on a background of chronic stable angina (diagnosed 2019, managed with isosorbide mononitrate 60mg QD). History of Type 2 DM (HbA1c 7.2%, on metformin 1000mg BID + empagliflozin 25mg QD), hypertension (BP 168/94 on arrival, home meds: amlodipine 10mg + lisinopril 40mg), and previous NSTEMI (March 2023). Allergies: Aspirin (urticaria), Atorvastatin (myalgia — switched to rosuvastatin 20mg). Current vitals: HR 96 bpm, RR 22, SpO2 94% on RA, Temp 36.8°C.
Expected Output
{
  "demographics.sex": "Female",
  "demographics.age": 67,
  "chief_complaint": "Acute onset chest pain",
  "presenting_symptoms": [
    {
      "symptom": "Chest pain",
      "onset": "Acute",
      "duration": "2 hours",
      "severity": "8/10",
      "location": "Substernal",
      "radiation": "Left arm"
    }
  ],
  "chronic_conditions": [
    {
      "condition": "Chronic stable angina",
      "diagnosed": "2019",
      "status": "Managed"
    },
    {
      "condition": "Type 2 Diabetes Mellitus",
      "marker": "HbA1c 7.2%"
    },
    {
      "condition": "Hypertension",
      "current_bp": "168/94"
    },
    {
      "condition": "Previous NSTEMI",
      "date": "March 2023"
    }
  ],
  "medications": [
    {
      "name": "Isosorbide mononitrate",
      "dose": "60mg",
      "frequency": "QD",
      "indication": "Angina"
    },
    {
      "name": "Metformin",
      "dose": "1000mg",
      "frequency": "BID",
      "indication": "Diabetes"
    },
    {
      "name": "Empagliflozin",
      "dose": "25mg",
      "frequency": "QD",
      "indication": "Diabetes"
    },
    {
      "name": "Amlodipine",
      "dose": "10mg",
      "frequency": "QD",
      "indication": "Hypertension"
    },
    {
      "name": "Lisinopril",
      "dose": "40mg",
      "frequency": "QD",
      "indication": "Hypertension"
    },
    {
      "name": "Rosuvastatin",
      "dose": "20mg",
      "frequency": "QD",
      "indication": "Cholesterol"
    }
  ],
  "allergies": [
    {
      "drug": "Aspirin",
      "reaction": "Urticaria"
    },
    {
      "drug": "Atorvastatin",
      "reaction": "Myalgia",
      "note": "Switched to rosuvastatin"
    }
  ],
  "vitals.hr_bpm": 96,
  "vitals.rr": 22,
  "vitals.spo2_percent": 94,
  "vitals.temp_celsius": 36.8
}
Qwen 3.5 Flash
33793ms0/100

Incorrect/missing fields: demographics.sex, demographics.age, chief_complaint, presenting_symptoms, chronic_conditions, medications, allergies, vitals.hr_bpm, vitals.rr, vitals.spo2_percent, vitals.temp_celsius

demographics.sex-(expected: Female)
demographics.age-(expected: 67)
chief_complaint-(expected: Acute onset chest pain)
presenting_symptoms-(expected: [{"symptom":"Chest pain","onset":"Acute","duration":"2 hours","severity":"8/10","location":"Substernal","radiation":"Left arm"}])
chronic_conditions-(expected: [{"condition":"Chronic stable angina","diagnosed":"2019","status":"Managed"},{"condition":"Type 2 Diabetes Mellitus","marker":"HbA1c 7.2%"},{"condition":"Hypertension","current_bp":"168/94"},{"condition":"Previous NSTEMI","date":"March 2023"}])
medications-(expected: [{"name":"Isosorbide mononitrate","dose":"60mg","frequency":"QD","indication":"Angina"},{"name":"Metformin","dose":"1000mg","frequency":"BID","indication":"Diabetes"},{"name":"Empagliflozin","dose":"25mg","frequency":"QD","indication":"Diabetes"},{"name":"Amlodipine","dose":"10mg","frequency":"QD","indication":"Hypertension"},{"name":"Lisinopril","dose":"40mg","frequency":"QD","indication":"Hypertension"},{"name":"Rosuvastatin","dose":"20mg","frequency":"QD","indication":"Cholesterol"}])
allergies-(expected: [{"drug":"Aspirin","reaction":"Urticaria"},{"drug":"Atorvastatin","reaction":"Myalgia","note":"Switched to rosuvastatin"}])
vitals.hr_bpm-(expected: 96)
vitals.rr-(expected: 22)
vitals.spo2_percent-(expected: 94)
vitals.temp_celsius-(expected: 36.8)
0/11 fields correct
GPT-4.1
4850ms91/100

Incorrect/missing fields: chronic_conditions, medications

demographics.sexfemale
demographics.age67
chief_complaintAcute onset chest pain
presenting_symptoms[{"symptom":"chest pain","onset":"acute, started 2 hours ago","duration":"2 hours","severity":"8/10","location":"substernal","radiation":"left arm"}]
chronic_conditions[{"condition":"chronic stable angina","diagnosed":"2019","status":"stable, on treatment"},{"condition":"type 2 diabetes mellitus","diagnosed":"unknown","status":"managed, HbA1c 7.2%"},{"condition":"hypertension","diagnosed":"unknown","status":"elevated BP 168/94 on arrival"},{"condition":"previous NSTEMI","diagnosed":"March 2023","status":"history"}](expected: [{"condition":"Chronic stable angina","diagnosed":"2019","status":"Managed"},{"condition":"Type 2 Diabetes Mellitus","marker":"HbA1c 7.2%"},{"condition":"Hypertension","current_bp":"168/94"},{"condition":"Previous NSTEMI","date":"March 2023"}])
medications[{"name":"isosorbide mononitrate","dose":"60mg","frequency":"once daily (QD)","indication":"chronic stable angina"},{"name":"metformin","dose":"1000mg","frequency":"twice daily (BID)","indication":"type 2 diabetes"},{"name":"empagliflozin","dose":"25mg","frequency":"once daily (QD)","indication":"type 2 diabetes"},{"name":"amlodipine","dose":"10mg","frequency":"once daily (QD)","indication":"hypertension"},{"name":"lisinopril","dose":"40mg","frequency":"once daily (QD)","indication":"hypertension"},{"name":"rosuvastatin","dose":"20mg","frequency":"once daily (QD)","indication":"hyperlipidemia"}](expected: [{"name":"Isosorbide mononitrate","dose":"60mg","frequency":"QD","indication":"Angina"},{"name":"Metformin","dose":"1000mg","frequency":"BID","indication":"Diabetes"},{"name":"Empagliflozin","dose":"25mg","frequency":"QD","indication":"Diabetes"},{"name":"Amlodipine","dose":"10mg","frequency":"QD","indication":"Hypertension"},{"name":"Lisinopril","dose":"40mg","frequency":"QD","indication":"Hypertension"},{"name":"Rosuvastatin","dose":"20mg","frequency":"QD","indication":"Cholesterol"}])
allergies[{"drug":"Aspirin","reaction":"urticaria"},{"drug":"Atorvastatin","reaction":"myalgia"}]
vitals.hr_bpm96
vitals.rr22
vitals.spo2_percent94
vitals.temp_celsius36.8
9/11 fields correct
Why this is hard: Chest pain is ACUTE (2 hours), angina is CHRONIC (2019). Models must not merge these. All 6 medications must be extracted with correct frequency. Atorvastatin is an ALLERGY not a current medication. Rosuvastatin is the replacement. Vitals must be numeric.
12MedicalLab Results with Reference Ranges
5vs95
Input
COMPREHENSIVE METABOLIC PANEL — Collected: 02/15/2026 07:30. Glucose: 187 mg/dL (ref: 70-100, FLAG: HIGH). BUN: 42 mg/dL (ref: 7-20, FLAG: HIGH). Creatinine: 2.1 mg/dL (ref: 0.7-1.3, FLAG: HIGH). eGFR: 28 mL/min/1.73m² (ref: >60, FLAG: LOW — Stage 4 CKD). Sodium: 138 mEq/L (ref: 136-145). Potassium: 5.6 mEq/L (ref: 3.5-5.0, FLAG: HIGH — CRITICAL). Chloride: 102 mEq/L (ref: 98-106). CO2: 19 mEq/L (ref: 23-29, FLAG: LOW). Calcium: 8.2 mg/dL (ref: 8.5-10.5, FLAG: LOW). Total Protein: 6.8 g/dL (ref: 6.0-8.3). Albumin: 3.1 g/dL (ref: 3.5-5.5, FLAG: LOW). Bilirubin, Total: 0.9 mg/dL (ref: 0.1-1.2). ALT: 24 U/L (ref: 7-56). AST: 31 U/L (ref: 10-40). Alk Phos: 98 U/L (ref: 44-147).
Expected Output
{
  "panel_type": "Comprehensive Metabolic Panel",
  "collected": "2026-02-15T07:30",
  "results": [
    {
      "test": "Glucose",
      "value": 187,
      "unit": "mg/dL",
      "ref_low": 70,
      "ref_high": 100,
      "flag": "HIGH"
    },
    {
      "test": "BUN",
      "value": 42,
      "unit": "mg/dL",
      "ref_low": 7,
      "ref_high": 20,
      "flag": "HIGH"
    },
    {
      "test": "Creatinine",
      "value": 2.1,
      "unit": "mg/dL",
      "ref_low": 0.7,
      "ref_high": 1.3,
      "flag": "HIGH"
    },
    {
      "test": "eGFR",
      "value": 28,
      "unit": "mL/min/1.73m²",
      "ref_low": 60,
      "ref_high": null,
      "flag": "LOW",
      "note": "Stage 4 CKD"
    },
    {
      "test": "Sodium",
      "value": 138,
      "unit": "mEq/L",
      "ref_low": 136,
      "ref_high": 145,
      "flag": null
    },
    {
      "test": "Potassium",
      "value": 5.6,
      "unit": "mEq/L",
      "ref_low": 3.5,
      "ref_high": 5,
      "flag": "HIGH",
      "critical": true
    },
    {
      "test": "Chloride",
      "value": 102,
      "unit": "mEq/L",
      "ref_low": 98,
      "ref_high": 106,
      "flag": null
    },
    {
      "test": "CO2",
      "value": 19,
      "unit": "mEq/L",
      "ref_low": 23,
      "ref_high": 29,
      "flag": "LOW"
    },
    {
      "test": "Calcium",
      "value": 8.2,
      "unit": "mg/dL",
      "ref_low": 8.5,
      "ref_high": 10.5,
      "flag": "LOW"
    },
    {
      "test": "Total Protein",
      "value": 6.8,
      "unit": "g/dL",
      "ref_low": 6,
      "ref_high": 8.3,
      "flag": null
    },
    {
      "test": "Albumin",
      "value": 3.1,
      "unit": "g/dL",
      "ref_low": 3.5,
      "ref_high": 5.5,
      "flag": "LOW"
    },
    {
      "test": "Bilirubin Total",
      "value": 0.9,
      "unit": "mg/dL",
      "ref_low": 0.1,
      "ref_high": 1.2,
      "flag": null
    },
    {
      "test": "ALT",
      "value": 24,
      "unit": "U/L",
      "ref_low": 7,
      "ref_high": 56,
      "flag": null
    },
    {
      "test": "AST",
      "value": 31,
      "unit": "U/L",
      "ref_low": 10,
      "ref_high": 40,
      "flag": null
    },
    {
      "test": "Alk Phos",
      "value": 98,
      "unit": "U/L",
      "ref_low": 44,
      "ref_high": 147,
      "flag": null
    }
  ],
  "critical_values": [
    "Potassium 5.6 mEq/L"
  ],
  "abnormal_count": 7,
  "normal_count": 8
}
Qwen 3.5 Flash
26265ms5/100

Incorrect/missing fields: panel_type, collected, critical_values, abnormal_count, normal_count, results

panel_type-(expected: Comprehensive Metabolic Panel)
collected-(expected: 2026-02-15T07:30)
results[{"test_name":"Glucose","value":187,"unit":"mg/dL","reference_range":"70-100","status":"HIGH","flagged":true,"notes":null},{"test_name":"BUN","value":42,"unit":"mg/dL","reference_range":"7-20","status":"HIGH","flagged":true,"notes":null},{"test_name":"Creatinine","value":2.1,"unit":"mg/dL","reference_range":"0.7-1.3","status":"HIGH","flagged":true,"notes":null},{"test_name":"eGFR","value":28,"unit":"mL/min/1.73m²","reference_range":">60","status":"LOW","flagged":true,"notes":"Stage 4 CKD"},{"test_name":"Sodium","value":138,"unit":"mEq/L","reference_range":"136-145","status":"NORMAL","flagged":false,"notes":null},{"test_name":"Potassium","value":5.6,"unit":"mEq/L","reference_range":"3.5-5.0","status":"HIGH","flagged":true,"notes":"CRITICAL"},{"test_name":"Chloride","value":102,"unit":"mEq/L","reference_range":"98-106","status":"NORMAL","flagged":false,"notes":null},{"test_name":"CO2","value":19,"unit":"mEq/L","reference_range":"23-29","status":"LOW","flagged":true,"notes":null},{"test_name":"Calcium","value":8.2,"unit":"mg/dL","reference_range":"8.5-10.5","status":"LOW","flagged":true,"notes":null},{"test_name":"Total Protein","value":6.8,"unit":"g/dL","reference_range":"6.0-8.3","status":"NORMAL","flagged":false,"notes":null},{"test_name":"Albumin","value":3.1,"unit":"g/dL","reference_range":"3.5-5.5","status":"LOW","flagged":true,"notes":null},{"test_name":"Bilirubin, Total","value":0.9,"unit":"mg/dL","reference_range":"0.1-1.2","status":"NORMAL","flagged":false,"notes":null},{"test_name":"ALT","value":24,"unit":"U/L","reference_range":"7-56","status":"NORMAL","flagged":false,"notes":null},{"test_name":"AST","value":31,"unit":"U/L","reference_range":"10-40","status":"NORMAL","flagged":false,"notes":null},{"test_name":"Alk Phos","value":98,"unit":"U/L","reference_range":"44-147","status":"NORMAL","flagged":false,"notes":null}](expected: [{"test":"Glucose","value":187,"unit":"mg/dL","ref_low":70,"ref_high":100,"flag":"HIGH"},{"test":"BUN","value":42,"unit":"mg/dL","ref_low":7,"ref_high":20,"flag":"HIGH"},{"test":"Creatinine","value":2.1,"unit":"mg/dL","ref_low":0.7,"ref_high":1.3,"flag":"HIGH"},{"test":"eGFR","value":28,"unit":"mL/min/1.73m²","ref_low":60,"ref_high":null,"flag":"LOW","note":"Stage 4 CKD"},{"test":"Sodium","value":138,"unit":"mEq/L","ref_low":136,"ref_high":145,"flag":null},{"test":"Potassium","value":5.6,"unit":"mEq/L","ref_low":3.5,"ref_high":5,"flag":"HIGH","critical":true},{"test":"Chloride","value":102,"unit":"mEq/L","ref_low":98,"ref_high":106,"flag":null},{"test":"CO2","value":19,"unit":"mEq/L","ref_low":23,"ref_high":29,"flag":"LOW"},{"test":"Calcium","value":8.2,"unit":"mg/dL","ref_low":8.5,"ref_high":10.5,"flag":"LOW"},{"test":"Total Protein","value":6.8,"unit":"g/dL","ref_low":6,"ref_high":8.3,"flag":null},{"test":"Albumin","value":3.1,"unit":"g/dL","ref_low":3.5,"ref_high":5.5,"flag":"LOW"},{"test":"Bilirubin Total","value":0.9,"unit":"mg/dL","ref_low":0.1,"ref_high":1.2,"flag":null},{"test":"ALT","value":24,"unit":"U/L","ref_low":7,"ref_high":56,"flag":null},{"test":"AST","value":31,"unit":"U/L","ref_low":10,"ref_high":40,"flag":null},{"test":"Alk Phos","value":98,"unit":"U/L","ref_low":44,"ref_high":147,"flag":null}])
critical_values-(expected: ["Potassium 5.6 mEq/L"])
abnormal_count-(expected: 7)
normal_count-(expected: 8)
0/6 fields correct
GPT-4.1
6567ms95/100

Incorrect/missing fields: critical_values

panel_typeComprehensive Metabolic Panel
collected2026-02-15T07:30:00
results[{"test":"Glucose","value":187,"unit":"mg/dL","ref_low":70,"ref_high":100,"flag":"HIGH"},{"test":"BUN","value":42,"unit":"mg/dL","ref_low":7,"ref_high":20,"flag":"HIGH"},{"test":"Creatinine","value":2.1,"unit":"mg/dL","ref_low":0.7,"ref_high":1.3,"flag":"HIGH"},{"test":"eGFR","value":28,"unit":"mL/min/1.73m²","ref_low":60,"ref_high":999,"flag":"LOW"},{"test":"Sodium","value":138,"unit":"mEq/L","ref_low":136,"ref_high":145,"flag":"NORMAL"},{"test":"Potassium","value":5.6,"unit":"mEq/L","ref_low":3.5,"ref_high":5,"flag":"HIGH"},{"test":"Chloride","value":102,"unit":"mEq/L","ref_low":98,"ref_high":106,"flag":"NORMAL"},{"test":"CO2","value":19,"unit":"mEq/L","ref_low":23,"ref_high":29,"flag":"LOW"},{"test":"Calcium","value":8.2,"unit":"mg/dL","ref_low":8.5,"ref_high":10.5,"flag":"LOW"},{"test":"Total Protein","value":6.8,"unit":"g/dL","ref_low":6,"ref_high":8.3,"flag":"NORMAL"},{"test":"Albumin","value":3.1,"unit":"g/dL","ref_low":3.5,"ref_high":5.5,"flag":"LOW"},{"test":"Bilirubin, Total","value":0.9,"unit":"mg/dL","ref_low":0.1,"ref_high":1.2,"flag":"NORMAL"},{"test":"ALT","value":24,"unit":"U/L","ref_low":7,"ref_high":56,"flag":"NORMAL"},{"test":"AST","value":31,"unit":"U/L","ref_low":10,"ref_high":40,"flag":"NORMAL"},{"test":"Alk Phos","value":98,"unit":"U/L","ref_low":44,"ref_high":147,"flag":"NORMAL"}]
critical_values["Potassium"](expected: ["Potassium 5.6 mEq/L"])
abnormal_count7
normal_count8
5/6 fields correct
Why this is hard: All 15 lab tests must be extracted. Potassium must be flagged as CRITICAL (not just HIGH). eGFR has a one-sided reference range (>60) — ref_high should be null. Numeric values must be numbers, not strings. Count of abnormal (7) and normal (8) must sum to 15.
13MedicalMedication Reconciliation with Conflicts
6vs60
Input
MEDICATION RECONCILIATION — Inpatient Day 3. Current Orders: (1) Warfarin 5mg PO QD (for AFib — INR target 2.0-3.0, last INR 2.4 on 02/13). (2) Metoprolol Succinate 100mg PO QD. (3) Diltiazem ER 240mg PO QD. (4) Amiodarone 200mg PO QD (started 02/12 for rate-refractory AFib). (5) Fluconazole 400mg IV QD (Day 5 of 14 for candidemia). (6) Enoxaparin 80mg SQ Q12H (DVT prophylaxis — started before warfarin was therapeutic). (7) Acetaminophen 1000mg PO Q6H PRN. (8) Omeprazole 40mg PO QD. Home Medications NOT continued: Aspirin 81mg QD (held due to warfarin + enoxaparin = triple antithrombotic risk).
Expected Output
{
  "active_medications": [
    {
      "name": "Warfarin",
      "dose": "5mg",
      "route": "PO",
      "frequency": "QD",
      "indication": "AFib"
    },
    {
      "name": "Metoprolol Succinate",
      "dose": "100mg",
      "route": "PO",
      "frequency": "QD"
    },
    {
      "name": "Diltiazem ER",
      "dose": "240mg",
      "route": "PO",
      "frequency": "QD"
    },
    {
      "name": "Amiodarone",
      "dose": "200mg",
      "route": "PO",
      "frequency": "QD",
      "indication": "Rate-refractory AFib",
      "start_date": "02/12"
    },
    {
      "name": "Fluconazole",
      "dose": "400mg",
      "route": "IV",
      "frequency": "QD",
      "indication": "Candidemia",
      "day": 5,
      "total_days": 14
    },
    {
      "name": "Enoxaparin",
      "dose": "80mg",
      "route": "SQ",
      "frequency": "Q12H",
      "indication": "DVT prophylaxis"
    },
    {
      "name": "Acetaminophen",
      "dose": "1000mg",
      "route": "PO",
      "frequency": "Q6H",
      "prn": true
    },
    {
      "name": "Omeprazole",
      "dose": "40mg",
      "route": "PO",
      "frequency": "QD"
    }
  ],
  "held_medications": [
    {
      "name": "Aspirin",
      "dose": "81mg",
      "reason": "Triple antithrombotic risk with warfarin + enoxaparin"
    }
  ],
  "potential_interactions": [
    {
      "drugs": [
        "Warfarin",
        "Fluconazole"
      ],
      "severity": "Major",
      "effect": "Fluconazole inhibits CYP2C9, significantly increasing warfarin levels and bleeding risk"
    },
    {
      "drugs": [
        "Warfarin",
        "Amiodarone"
      ],
      "severity": "Major",
      "effect": "Amiodarone inhibits CYP2C9/1A2/3A4, increasing warfarin effect — typical dose reduction 30-50%"
    },
    {
      "drugs": [
        "Metoprolol",
        "Diltiazem"
      ],
      "severity": "Major",
      "effect": "Additive AV nodal blockade — risk of severe bradycardia or heart block"
    },
    {
      "drugs": [
        "Warfarin",
        "Enoxaparin"
      ],
      "severity": "High",
      "effect": "Dual anticoagulation increases bleeding risk"
    }
  ],
  "therapeutic_duplications": [
    {
      "category": "Rate control",
      "drugs": [
        "Metoprolol",
        "Diltiazem",
        "Amiodarone"
      ],
      "note": "Triple rate control — assess need for all three"
    }
  ]
}
Qwen 3.5 Flash
31182ms6/100

Incorrect/missing fields: active_medications, held_medications, therapeutic_duplications, potential_interactions

active_medications-(expected: [{"name":"Warfarin","dose":"5mg","route":"PO","frequency":"QD","indication":"AFib"},{"name":"Metoprolol Succinate","dose":"100mg","route":"PO","frequency":"QD"},{"name":"Diltiazem ER","dose":"240mg","route":"PO","frequency":"QD"},{"name":"Amiodarone","dose":"200mg","route":"PO","frequency":"QD","indication":"Rate-refractory AFib","start_date":"02/12"},{"name":"Fluconazole","dose":"400mg","route":"IV","frequency":"QD","indication":"Candidemia","day":5,"total_days":14},{"name":"Enoxaparin","dose":"80mg","route":"SQ","frequency":"Q12H","indication":"DVT prophylaxis"},{"name":"Acetaminophen","dose":"1000mg","route":"PO","frequency":"Q6H","prn":true},{"name":"Omeprazole","dose":"40mg","route":"PO","frequency":"QD"}])
held_medications-(expected: [{"name":"Aspirin","dose":"81mg","reason":"Triple antithrombotic risk with warfarin + enoxaparin"}])
potential_interactions[{"id":1,"drugs_involved":["Warfarin","Amiodarone"],"severity":"Major","type":"Pharmacokinetic (CYP2C9 Inhibition)","risk_description":"Amiodarone inhibits the metabolism of Warfarin, significantly increasing INR and bleeding risk.","recommendation":"Monitor INR closely; Warfarin dose reduction is often required upon initiation."},{"id":2,"drugs_involved":["Warfarin","Fluconazole"],"severity":"Major","type":"Pharmacokinetic (CYP2C9 Inhibition)","risk_description":"Fluconazole inhibits CYP2C9, potentiating the effect of Warfarin and increasing INR.","recommendation":"Monitor INR closely during Fluconazole therapy."},{"id":3,"drugs_involved":["Metoprolol Succinate","Diltiazem ER"],"severity":"Moderate to Major","type":"Pharmacodynamic (Additive Negative Chronotropy)","risk_description":"Combination increases risk of severe bradycardia, AV block, and hypotension.","recommendation":"Monitor heart rate and blood pressure frequently."},{"id":4,"drugs_involved":["Warfarin","Enoxaparin"],"severity":"Moderate","type":"Pharmacodynamic (Anticoagulant Additivity)","risk_description":"Concurrent use (Bridging) increases bleeding risk compared to monotherapy.","recommendation":"Standard bridging protocol; discontinue Enoxaparin once therapeutic INR reached."},{"id":5,"drugs_involved":["Warfarin","Acetaminophen"],"severity":"Moderate","type":"Pharmacodynamic/Pharmacokinetic","risk_description":"High-dose or chronic use (up to 4g/day) can potentiate Warfarin effect via VKORC1 inhibition.","recommendation":"Use lowest effective dose; monitor INR if used regularly at high doses."},{"id":6,"drugs_involved":["Warfarin","Aspirin","Enoxaparin"],"status":"Managed (Action Taken)","severity":"Major","type":"Pharmacodynamic (Triple Antithrombotic)","risk_description":"Combined antiplatelet and dual anticoagulant use poses extremely high bleeding risk.","management":"Aspirin has been held per reconciliation notes."}](expected: [{"drugs":["Warfarin","Fluconazole"],"severity":"Major","effect":"Fluconazole inhibits CYP2C9, significantly increasing warfarin levels and bleeding risk"},{"drugs":["Warfarin","Amiodarone"],"severity":"Major","effect":"Amiodarone inhibits CYP2C9/1A2/3A4, increasing warfarin effect — typical dose reduction 30-50%"},{"drugs":["Metoprolol","Diltiazem"],"severity":"Major","effect":"Additive AV nodal blockade — risk of severe bradycardia or heart block"},{"drugs":["Warfarin","Enoxaparin"],"severity":"High","effect":"Dual anticoagulation increases bleeding risk"}])
therapeutic_duplications-(expected: [{"category":"Rate control","drugs":["Metoprolol","Diltiazem","Amiodarone"],"note":"Triple rate control — assess need for all three"}])
0/4 fields correct
GPT-4.1
9035ms60/100

Incorrect/missing fields: potential_interactions, therapeutic_duplications

active_medications[{"name":"Warfarin","dose":"5mg","route":"PO","frequency":"QD","indication":"Atrial fibrillation (INR target 2.0-3.0)"},{"name":"Metoprolol Succinate","dose":"100mg","route":"PO","frequency":"QD","indication":"Rate control in atrial fibrillation"},{"name":"Diltiazem ER","dose":"240mg","route":"PO","frequency":"QD","indication":"Rate control in atrial fibrillation"},{"name":"Amiodarone","dose":"200mg","route":"PO","frequency":"QD","indication":"Rate-refractory atrial fibrillation"},{"name":"Fluconazole","dose":"400mg","route":"IV","frequency":"QD","indication":"Candidemia"},{"name":"Enoxaparin","dose":"80mg","route":"SQ","frequency":"Q12H","indication":"DVT prophylaxis; bridge until warfarin therapeutic"},{"name":"Acetaminophen","dose":"1000mg","route":"PO","frequency":"Q6H PRN","indication":"Pain/fever"},{"name":"Omeprazole","dose":"40mg","route":"PO","frequency":"QD","indication":"GI prophylaxis"}]
held_medications[{"name":"Aspirin","dose":"81mg","reason":"Held due to increased risk of bleeding with concurrent warfarin and enoxaparin (triple antithrombotic risk)."}]
potential_interactions[{"drugs":["Fluconazole","Warfarin"],"severity":"High","effect":"Fluconazole inhibits warfarin metabolism, increasing bleeding risk; requires closer INR monitoring."},{"drugs":["Warfarin","Enoxaparin"],"severity":"High","effect":"Increased risk of bleeding with concurrent anticoagulants."},{"drugs":["Amiodarone","Warfarin"],"severity":"Moderate","effect":"Amiodarone can increase warfarin's anticoagulant effect by inhibiting warfarin metabolism; monitor INR closely."},{"drugs":["Amiodarone","Diltiazem"],"severity":"Moderate","effect":"Additive effects on heart rate and AV conduction; risk of bradycardia or heart block."},{"drugs":["Metoprolol","Diltiazem"],"severity":"Moderate","effect":"Additive effects on AV nodal blockade; may cause bradycardia or heart block."}](expected: [{"drugs":["Warfarin","Fluconazole"],"severity":"Major","effect":"Fluconazole inhibits CYP2C9, significantly increasing warfarin levels and bleeding risk"},{"drugs":["Warfarin","Amiodarone"],"severity":"Major","effect":"Amiodarone inhibits CYP2C9/1A2/3A4, increasing warfarin effect — typical dose reduction 30-50%"},{"drugs":["Metoprolol","Diltiazem"],"severity":"Major","effect":"Additive AV nodal blockade — risk of severe bradycardia or heart block"},{"drugs":["Warfarin","Enoxaparin"],"severity":"High","effect":"Dual anticoagulation increases bleeding risk"}])
therapeutic_duplications[{"category":"Rate control agents","drugs":["Metoprolol Succinate","Diltiazem ER","Amiodarone"],"note":"Multiple agents for rate control may increase risk of bradycardia or AV block."},{"category":"Antithrombotic therapy","drugs":["Warfarin","Enoxaparin"],"note":"Dual anticoagulation during warfarin bridge; monitor for excessive bleeding."}](expected: [{"category":"Rate control","drugs":["Metoprolol","Diltiazem","Amiodarone"],"note":"Triple rate control — assess need for all three"}])
2/4 fields correct
Why this is hard: Must identify ALL 4 major drug interactions. The Warfarin-Fluconazole CYP2C9 interaction is frequently missed by smaller models. Metoprolol+Diltiazem dual AV nodal blockade must be flagged. Aspirin is HELD, not active. Acetaminophen must be marked PRN.
14MedicalSurgical Note Parsing
25vs95
Input
OPERATIVE REPORT. Date: 02/20/2026. Surgeon: Dr. Sarah Chen, MD, FACS. Assistant: Dr. James Park, MD. Anesthesia: General endotracheal (Dr. Reeves). Procedure: Laparoscopic cholecystectomy converted to open cholecystectomy. Indication: Acute cholecystitis with empyema, failed medical management. Findings: Gallbladder severely inflamed, gangrenous with empyema. Dense adhesions to duodenum and hepatic flexure of colon. Critical view of safety could NOT be obtained laparoscopically — decision to convert at 47 minutes. Common bile duct diameter 6mm, no stones, confirmed with intraoperative cholangiogram. Estimated blood loss: 350mL. Specimens: Gallbladder sent to pathology. Drain: 19-Fr Blake drain placed in Morrison's pouch. Complications: None intraoperative. Patient extubated and transferred to PACU in stable condition.
Expected Output
{
  "date": "2026-02-20",
  "surgeon.name": "Dr. Sarah Chen",
  "surgeon.credentials": "MD, FACS",
  "assistant.name": "Dr. James Park",
  "assistant.credentials": "MD",
  "anesthesiologist": "Dr. Reeves",
  "anesthesia_type": "General endotracheal",
  "procedure_planned": "Laparoscopic cholecystectomy",
  "procedure_actual": "Open cholecystectomy",
  "was_converted": true,
  "conversion_time_minutes": 47,
  "indication": "Acute cholecystitis with empyema",
  "findings.gallbladder_status": "Gangrenous with empyema",
  "findings.adhesions": "Dense, to duodenum and hepatic flexure",
  "findings.critical_view_obtained": false,
  "findings.cbd_diameter_mm": 6,
  "findings.cbd_stones": false,
  "findings.cholangiogram_performed": true,
  "ebl_ml": 350,
  "specimens": [
    "Gallbladder"
  ],
  "drain.type": "19-Fr Blake",
  "drain.location": "Morrison's pouch",
  "complications": "None",
  "disposition": "PACU, stable"
}
Qwen 3.5 Flash
16796ms25/100

Incorrect/missing fields: surgeon.name, surgeon.credentials, assistant.name, assistant.credentials, anesthesiologist, anesthesia_type, procedure_planned, procedure_actual, was_converted, conversion_time_minutes, findings.gallbladder_status, findings.critical_view_obtained, findings.cbd_diameter_mm, findings.cbd_stones, ebl_ml, drain.type, drain.location, complications

date02/20/2026
surgeon.name-(expected: Dr. Sarah Chen)
surgeon.credentials-(expected: MD, FACS)
assistant.name-(expected: Dr. James Park)
assistant.credentials-(expected: MD)
anesthesiologist-(expected: Dr. Reeves)
anesthesia_type-(expected: General endotracheal)
procedure_planned-(expected: Laparoscopic cholecystectomy)
procedure_actual-(expected: Open cholecystectomy)
was_converted-(expected: true)
conversion_time_minutes-(expected: 47)
indicationAcute cholecystitis with empyema, failed medical management
findings.gallbladder_status-(expected: Gangrenous with empyema)
findings.adhesionsDense adhesions to duodenum and hepatic flexure of colon
findings.critical_view_obtained-(expected: false)
findings.cbd_diameter_mm-(expected: 6)
findings.cbd_stones-(expected: false)
findings.cholangiogram_performedtrue
ebl_ml-(expected: 350)
specimens["Gallbladder"]
drain.type-(expected: 19-Fr Blake)
drain.location-(expected: Morrison's pouch)
complicationsNone intraoperative(expected: None)
dispositionPatient extubated and transferred to PACU in stable condition
6/24 fields correct
GPT-4.1
3095ms95/100

Incorrect/missing fields: complications, disposition

date2026-02-20
surgeon.nameDr. Sarah Chen
surgeon.credentialsMD, FACS
assistant.nameDr. James Park
assistant.credentialsMD
anesthesiologistDr. Reeves
anesthesia_typeGeneral endotracheal
procedure_plannedLaparoscopic cholecystectomy
procedure_actualOpen cholecystectomy (after conversion)
was_convertedtrue
conversion_time_minutes47
indicationAcute cholecystitis with empyema, failed medical management
findings.gallbladder_statusSeverely inflamed, gangrenous with empyema
findings.adhesionsDense adhesions to duodenum and hepatic flexure of colon
findings.critical_view_obtainedfalse
findings.cbd_diameter_mm6
findings.cbd_stonesfalse
findings.cholangiogram_performedtrue
ebl_ml350
specimens["Gallbladder"]
drain.type19-Fr Blake drain
drain.locationMorrison's pouch
complicationsNone intraoperative(expected: None)
dispositionExtubated and transferred to PACU in stable condition(expected: PACU, stable)
22/24 fields correct
Why this is hard: was_converted must be true. critical_view_obtained must be false (this is WHY they converted). cbd_stones must be false (6mm diameter, no stones). conversion_time_minutes must be 47. Models often miss the conversion narrative or default critical_view_obtained to true.
15MedicalRadiology Report Structured Extraction
6vs91
Input
CT CHEST WITH CONTRAST — 02/18/2026. CLINICAL INDICATION: 67F with history of NSCLC s/p right upper lobectomy (2024), new cough, rule out recurrence. TECHNIQUE: Helical CT from thoracic inlet to adrenals with 80mL Omnipaque 350 IV contrast. COMPARISON: CT Chest 08/12/2025. FINDINGS: LUNGS: New 1.8 × 1.4 cm spiculated soft tissue nodule in the right lower lobe (series 4, image 187), suspicious for recurrence. No prior correlate. Stable 4mm ground-glass nodule in the left lower lobe (previously noted, unchanged from 08/2025). Post-surgical changes in the right upper lobe with expected fibrotic bands. MEDIASTINUM: New subcarinal lymph node measuring 1.6 cm short axis (previously 0.8 cm), concerning for metastatic adenopathy. Right hilar lymphadenopathy, largest node 1.2 cm (new). Heart size normal. No pericardial effusion. PLEURA: Small right-sided pleural effusion (new). No pneumothorax. BONES: No suspicious osseous lesions. Mild degenerative changes thoracic spine. UPPER ABDOMEN: Left adrenal gland 2.1 cm nodule with enhancement (not present on prior) — cannot exclude metastasis. Right adrenal normal. IMPRESSION: 1. New 1.8 cm spiculated RLL nodule — highly suspicious for recurrence of NSCLC. 2. New mediastinal and right hilar lymphadenopathy — concerning for nodal metastasis. 3. New left adrenal nodule — metastasis cannot be excluded; recommend dedicated adrenal CT or PET-CT. 4. New small right pleural effusion — may represent malignant effusion in this context. 5. Stable LLL 4mm GGN — recommend continued surveillance. RECOMMENDATION: PET-CT recommended. Multidisciplinary tumor board review.
Expected Output
{
  "study_type": "CT Chest with contrast",
  "date": "2026-02-18",
  "clinical_history": "67F, NSCLC s/p right upper lobectomy (2024), new cough",
  "comparison_study": "CT Chest 08/12/2025",
  "new_findings": [
    {
      "location": "Right lower lobe",
      "description": "1.8 × 1.4 cm spiculated soft tissue nodule",
      "concern": "Recurrence of NSCLC",
      "severity": "Highly suspicious"
    },
    {
      "location": "Subcarinal",
      "description": "Lymph node 1.6 cm short axis (was 0.8 cm)",
      "concern": "Metastatic adenopathy"
    },
    {
      "location": "Right hilum",
      "description": "Lymphadenopathy, largest 1.2 cm",
      "concern": "Metastatic adenopathy"
    },
    {
      "location": "Right pleura",
      "description": "Small pleural effusion",
      "concern": "Possible malignant effusion"
    },
    {
      "location": "Left adrenal",
      "description": "2.1 cm enhancing nodule",
      "concern": "Cannot exclude metastasis"
    }
  ],
  "stable_findings": [
    {
      "location": "Left lower lobe",
      "description": "4mm ground-glass nodule",
      "comparison": "Unchanged from 08/2025"
    }
  ],
  "post_surgical": "Right upper lobectomy changes with fibrotic bands",
  "impressions_count": 5,
  "recommendations": [
    "PET-CT",
    "Multidisciplinary tumor board review",
    "Continued surveillance of LLL GGN"
  ],
  "overall_concern": "Likely NSCLC recurrence with possible nodal and adrenal metastasis"
}
Qwen 3.5 Flash
31664ms6/100

Incorrect/missing fields: study_type, date, clinical_history, comparison_study, new_findings, stable_findings, post_surgical, impressions_count, overall_concern, recommendations

study_type-(expected: CT Chest with contrast)
date-(expected: 2026-02-18)
clinical_history-(expected: 67F, NSCLC s/p right upper lobectomy (2024), new cough)
comparison_study-(expected: CT Chest 08/12/2025)
new_findings-(expected: [{"location":"Right lower lobe","description":"1.8 × 1.4 cm spiculated soft tissue nodule","concern":"Recurrence of NSCLC","severity":"Highly suspicious"},{"location":"Subcarinal","description":"Lymph node 1.6 cm short axis (was 0.8 cm)","concern":"Metastatic adenopathy"},{"location":"Right hilum","description":"Lymphadenopathy, largest 1.2 cm","concern":"Metastatic adenopathy"},{"location":"Right pleura","description":"Small pleural effusion","concern":"Possible malignant effusion"},{"location":"Left adrenal","description":"2.1 cm enhancing nodule","concern":"Cannot exclude metastasis"}])
stable_findings-(expected: [{"location":"Left lower lobe","description":"4mm ground-glass nodule","comparison":"Unchanged from 08/2025"}])
post_surgical-(expected: Right upper lobectomy changes with fibrotic bands)
impressions_count-(expected: 5)
recommendations["PET-CT recommended.","Multidisciplinary tumor board review."](expected: ["PET-CT","Multidisciplinary tumor board review","Continued surveillance of LLL GGN"])
overall_concern-(expected: Likely NSCLC recurrence with possible nodal and adrenal metastasis)
0/10 fields correct
GPT-4.1
8157ms91/100

Incorrect/missing fields: new_findings

study_typeCT Chest with Contrast
date2026-02-18
clinical_history67F with history of NSCLC s/p right upper lobectomy (2024), new cough, rule out recurrence.
comparison_studyCT Chest 08/12/2025
new_findings[{"location":"Right lower lobe (lung)","description":"New 1.8 × 1.4 cm spiculated soft tissue nodule, no prior correlate","concern":"Suspicious for recurrence of NSCLC","severity":"High"},{"location":"Subcarinal mediastinal lymph node","description":"Enlarged to 1.6 cm short axis (previously 0.8 cm)","concern":"Concerning for metastatic adenopathy","severity":"High"},{"location":"Right hilar lymph nodes","description":"New lymphadenopathy, largest node 1.2 cm","concern":"Concerning for nodal metastasis","severity":"High"},{"location":"Right pleural space","description":"Small right-sided pleural effusion","concern":"May represent malignant effusion in this context","severity":"Moderate"},{"location":"Left adrenal gland","description":"New 2.1 cm nodule with enhancement, not present before","concern":"Cannot exclude metastasis","severity":"High"}](expected: [{"location":"Right lower lobe","description":"1.8 × 1.4 cm spiculated soft tissue nodule","concern":"Recurrence of NSCLC","severity":"Highly suspicious"},{"location":"Subcarinal","description":"Lymph node 1.6 cm short axis (was 0.8 cm)","concern":"Metastatic adenopathy"},{"location":"Right hilum","description":"Lymphadenopathy, largest 1.2 cm","concern":"Metastatic adenopathy"},{"location":"Right pleura","description":"Small pleural effusion","concern":"Possible malignant effusion"},{"location":"Left adrenal","description":"2.1 cm enhancing nodule","concern":"Cannot exclude metastasis"}])
stable_findings[{"location":"Left lower lobe (lung)","description":"4mm ground-glass nodule","comparison":"Stable, unchanged from 08/2025"},{"location":"Bones (thoracic spine)","description":"Mild degenerative changes","comparison":"No suspicious osseous lesions"}]
post_surgicalRight upper lobe: Post-surgical changes with expected fibrotic bands.
impressions_count5
recommendations["PET-CT recommended","Multidisciplinary tumor board review","Consider dedicated adrenal CT or PET-CT for left adrenal nodule","Continued surveillance of stable LLL ground-glass nodule"]
overall_concernFindings are highly concerning for recurrence and metastatic disease from prior NSCLC.
9/10 fields correct
Why this is hard: Must identify all 5 NEW findings and distinguish from the 1 STABLE finding. The subcarinal node GREW (0.8→1.6cm) — this is new/changed, not stable. Left adrenal nodule 'cannot exclude metastasis' — models must preserve the uncertainty language. Recommendations must include all three actions.
16FinanceEarnings Call P&L Extraction
0vs99
Input
Q4 2025 Earnings Call — TechCorp Inc. (Ticker: TCHK). CFO Maria Santos: "Total revenue for Q4 was $4.23 billion, up 18% year-over-year. Breaking that down: Cloud Services revenue was $2.87 billion, growing 31% and now representing 68% of total revenue. Enterprise Software was $980 million, roughly flat. Professional Services contributed $380 million, down 12% as we continue the strategic shift away from lower-margin consulting. On the cost side, GAAP gross profit was $2.71 billion, for a gross margin of 64.1%. Non-GAAP gross margin, excluding $142 million in stock-based compensation allocated to cost of revenue, was 67.4%. Operating expenses: R&D was $890 million, up from $760 million — we've been investing heavily in our AI platform. Sales & Marketing was $620 million, and G&A was $210 million. GAAP operating income was $990 million, and non-GAAP operating income was $1.38 billion. The delta is primarily SBC of $310 million, restructuring charges of $47 million related to the EMEA headcount reduction, and $33 million in acquisition-related costs. GAAP net income was $742 million, or $2.47 per diluted share on 300.4 million shares. Non-GAAP EPS was $3.84."
Expected Output
{
  "period": "Q4 2025",
  "company": "TechCorp Inc.",
  "ticker": "TCHK",
  "total_revenue_billions": 4.23,
  "revenue_yoy_growth": "18%",
  "revenue_breakdown": [
    {
      "segment": "Cloud Services",
      "revenue_billions": 2.87,
      "growth": "31%",
      "pct_of_total": 68
    },
    {
      "segment": "Enterprise Software",
      "revenue_millions": 980,
      "growth": "~0%"
    },
    {
      "segment": "Professional Services",
      "revenue_millions": 380,
      "growth": "-12%"
    }
  ],
  "gaap_gross_profit_billions": 2.71,
  "gaap_gross_margin_pct": 64.1,
  "non_gaap_gross_margin_pct": 67.4,
  "sbc_in_cogs_millions": 142,
  "operating_expenses.rd_millions": 890,
  "operating_expenses.rd_prior_millions": 760,
  "operating_expenses.sales_marketing_millions": 620,
  "operating_expenses.ga_millions": 210,
  "gaap_operating_income_millions": 990,
  "non_gaap_operating_income_millions": 1380,
  "gaap_non_gaap_reconciliation": [
    {
      "item": "Stock-based compensation",
      "amount_millions": 310
    },
    {
      "item": "Restructuring charges",
      "amount_millions": 47
    },
    {
      "item": "Acquisition-related costs",
      "amount_millions": 33
    }
  ],
  "gaap_net_income_millions": 742,
  "gaap_eps": 2.47,
  "non_gaap_eps": 3.84,
  "diluted_shares_millions": 300.4
}
Qwen 3.5 Flash
29601ms0/100

Incorrect/missing fields: period, company, ticker, total_revenue_billions, revenue_yoy_growth, revenue_breakdown, gaap_gross_profit_billions, gaap_gross_margin_pct, non_gaap_gross_margin_pct, sbc_in_cogs_millions, operating_expenses.rd_millions, operating_expenses.rd_prior_millions, operating_expenses.sales_marketing_millions, operating_expenses.ga_millions, gaap_operating_income_millions, non_gaap_operating_income_millions, gaap_non_gaap_reconciliation, gaap_net_income_millions, gaap_eps, non_gaap_eps, diluted_shares_millions

period-(expected: Q4 2025)
company-(expected: TechCorp Inc.)
ticker-(expected: TCHK)
total_revenue_billions-(expected: 4.23)
revenue_yoy_growth-(expected: 18%)
revenue_breakdown-(expected: [{"segment":"Cloud Services","revenue_billions":2.87,"growth":"31%","pct_of_total":68},{"segment":"Enterprise Software","revenue_millions":980,"growth":"~0%"},{"segment":"Professional Services","revenue_millions":380,"growth":"-12%"}])
gaap_gross_profit_billions-(expected: 2.71)
gaap_gross_margin_pct-(expected: 64.1)
non_gaap_gross_margin_pct-(expected: 67.4)
sbc_in_cogs_millions-(expected: 142)
operating_expenses.rd_millions-(expected: 890)
operating_expenses.rd_prior_millions-(expected: 760)
operating_expenses.sales_marketing_millions-(expected: 620)
operating_expenses.ga_millions-(expected: 210)
gaap_operating_income_millions-(expected: 990)
non_gaap_operating_income_millions-(expected: 1380)
gaap_non_gaap_reconciliation-(expected: [{"item":"Stock-based compensation","amount_millions":310},{"item":"Restructuring charges","amount_millions":47},{"item":"Acquisition-related costs","amount_millions":33}])
gaap_net_income_millions-(expected: 742)
gaap_eps-(expected: 2.47)
non_gaap_eps-(expected: 3.84)
diluted_shares_millions-(expected: 300.4)
0/21 fields correct
GPT-4.1
4861ms99/100

Incorrect/missing fields: revenue_breakdown

periodQ4 2025
companyTechCorp Inc.
tickerTCHK
total_revenue_billions4.23
revenue_yoy_growth18%
revenue_breakdown[{"segment":"Cloud Services","revenue_billions":2.87,"growth":"31%","pct_of_total":68},{"segment":"Enterprise Software","revenue_billions":0.98,"growth":"0%","pct_of_total":23.2},{"segment":"Professional Services","revenue_billions":0.38,"growth":"-12%","pct_of_total":9}](expected: [{"segment":"Cloud Services","revenue_billions":2.87,"growth":"31%","pct_of_total":68},{"segment":"Enterprise Software","revenue_millions":980,"growth":"~0%"},{"segment":"Professional Services","revenue_millions":380,"growth":"-12%"}])
gaap_gross_profit_billions2.71
gaap_gross_margin_pct64.1
non_gaap_gross_margin_pct67.4
sbc_in_cogs_millions142
operating_expenses.rd_millions890
operating_expenses.rd_prior_millions760
operating_expenses.sales_marketing_millions620
operating_expenses.ga_millions210
gaap_operating_income_millions990
non_gaap_operating_income_millions1380
gaap_non_gaap_reconciliation[{"item":"Stock-based compensation","amount_millions":310},{"item":"Restructuring charges","amount_millions":47},{"item":"Acquisition-related costs","amount_millions":33}]
gaap_net_income_millions742
gaap_eps2.47
non_gaap_eps3.84
diluted_shares_millions300.4
20/21 fields correct
Why this is hard: Must distinguish GAAP from non-GAAP figures throughout. The reconciliation items (SBC $310M + restructuring $47M + acquisition $33M = $390M delta) must all be captured. SBC in COGS ($142M) is SEPARATE from total SBC ($310M). Models frequently confuse these or sum them incorrectly.
17FinanceBalance Sheet Ratio Calculation
0vs98
Input
CONSOLIDATED BALANCE SHEET — December 31, 2025 (in millions). ASSETS: Cash and equivalents $3,420. Short-term investments $1,850. Accounts receivable, net $2,190 (allowance for doubtful accounts $68). Inventories $890. Prepaid expenses $340. Total Current Assets $8,690. Property and equipment, net $12,400. Goodwill $8,750. Intangible assets, net $3,200. Operating lease right-of-use assets $1,860. Other non-current assets $920. Total Assets $35,820. LIABILITIES: Accounts payable $1,640. Accrued expenses $2,180. Current portion of long-term debt $500. Deferred revenue, current $1,420. Total Current Liabilities $5,740. Long-term debt $8,200 (senior notes: $5B at 3.75% due 2030, $3.2B at 4.25% due 2033). Operating lease liabilities $1,680. Deferred tax liabilities $890. Other non-current liabilities $460. Total Liabilities $16,970. EQUITY: Common stock $30. Additional paid-in capital $12,840. Retained earnings $9,180. Treasury stock ($2,400). Accumulated other comprehensive loss ($800). Total Stockholders' Equity $18,850. Total Liabilities + Equity $35,820.
Expected Output
{
  "as_of": "2025-12-31",
  "currency": "USD millions",
  "current_assets": 8690,
  "total_assets": 35820,
  "current_liabilities": 5740,
  "total_liabilities": 16970,
  "total_equity": 18850,
  "cash_and_equivalents": 3420,
  "total_debt": 8700,
  "debt_breakdown": [
    {
      "instrument": "Senior notes 3.75%",
      "amount": 5000,
      "maturity": 2030
    },
    {
      "instrument": "Senior notes 4.25%",
      "amount": 3200,
      "maturity": 2033
    }
  ],
  "computed_ratios.current_ratio": 1.51,
  "computed_ratios.quick_ratio": 1.3,
  "computed_ratios.debt_to_equity": 0.46,
  "computed_ratios.debt_to_assets": 0.24,
  "computed_ratios.working_capital_millions": 2950,
  "balance_sheet_check": true
}
Qwen 3.5 Flash
27763ms0/100

Incorrect/missing fields: as_of, currency, current_assets, total_assets, current_liabilities, total_liabilities, total_equity, cash_and_equivalents, total_debt, debt_breakdown, computed_ratios.current_ratio, computed_ratios.quick_ratio, computed_ratios.debt_to_equity, computed_ratios.debt_to_assets, computed_ratios.working_capital_millions, balance_sheet_check

as_of-(expected: 2025-12-31)
currency-(expected: USD millions)
current_assets-(expected: 8690)
total_assets-(expected: 35820)
current_liabilities-(expected: 5740)
total_liabilities-(expected: 16970)
total_equity-(expected: 18850)
cash_and_equivalents-(expected: 3420)
total_debt-(expected: 8700)
debt_breakdown-(expected: [{"instrument":"Senior notes 3.75%","amount":5000,"maturity":2030},{"instrument":"Senior notes 4.25%","amount":3200,"maturity":2033}])
computed_ratios.current_ratio-(expected: 1.51)
computed_ratios.quick_ratio-(expected: 1.3)
computed_ratios.debt_to_equity-(expected: 0.46)
computed_ratios.debt_to_assets-(expected: 0.24)
computed_ratios.working_capital_millions-(expected: 2950)
balance_sheet_check-(expected: true)
0/16 fields correct
GPT-4.1
4605ms98/100

Incorrect/missing fields: currency

as_of2025-12-31
currencyUSD(expected: USD millions)
current_assets8690
total_assets35820
current_liabilities5740
total_liabilities16970
total_equity18850
cash_and_equivalents3420
total_debt8700
debt_breakdown[{"instrument":"Senior Notes (3.75% due 2030)","amount":5000,"maturity":2030},{"instrument":"Senior Notes (4.25% due 2033)","amount":3200,"maturity":2033},{"instrument":"Current portion of long-term debt","amount":500,"maturity":2026}]
computed_ratios.current_ratio1.514
computed_ratios.quick_ratio1.357
computed_ratios.debt_to_equity0.461
computed_ratios.debt_to_assets0.243
computed_ratios.working_capital_millions2950
balance_sheet_checktrue
15/16 fields correct
Why this is hard: Total debt = current portion ($500M) + long-term ($8,200M) = $8,700M. Current ratio = 8690/5740 = 1.514. Quick ratio = (8690 - 890 - 340)/5740 = 1.30. Working capital = 8690-5740 = 2950. balance_sheet_check confirms Assets = L+E (35820 = 16970 + 18850). Treasury stock is NEGATIVE equity.
18FinanceMulti-Currency FX Impact Analysis
8vs96
Input
GEOGRAPHIC REVENUE ANALYSIS — FY2025. Americas: Reported revenue $6.24B (up 22% YoY). Europe: Reported revenue €2.18B ($2.31B at avg rate 1.06 USD/EUR), up 14% in euros but only 9% in USD due to euro weakness (prior year avg rate was 1.11 USD/EUR). Asia-Pacific: Reported ¥412B ($2.78B at avg rate 148.1 JPY/USD), up 19% in yen but down 2% in USD — the yen depreciated from 128.4 to 148.1 JPY/USD. Japan alone was ¥298B ($2.01B), up 23% in local currency but flat in USD. Rest of World: $680M, up 8%. Total reported revenue: $12.01B, up 15.2% YoY. On a constant-currency basis (using FY2024 exchange rates), total revenue would have been $12.68B, representing 21.7% growth. FX headwind to reported revenue: approximately $670M or 5.6 percentage points of growth.
Expected Output
{
  "fiscal_year": "FY2025",
  "total_revenue_reported_billions": 12.01,
  "total_revenue_constant_currency_billions": 12.68,
  "reported_growth_pct": 15.2,
  "constant_currency_growth_pct": 21.7,
  "fx_headwind_millions": 670,
  "fx_headwind_growth_points": 5.6,
  "regions": [
    {
      "region": "Americas",
      "reported_usd_billions": 6.24,
      "yoy_reported_pct": 22
    },
    {
      "region": "Europe",
      "local_currency": "EUR",
      "local_revenue_billions": 2.18,
      "reported_usd_billions": 2.31,
      "avg_fx_rate": 1.06,
      "prior_year_fx_rate": 1.11,
      "yoy_local_pct": 14,
      "yoy_usd_pct": 9
    },
    {
      "region": "Asia-Pacific",
      "local_currency": "JPY",
      "local_revenue_billions_jpy": 412,
      "reported_usd_billions": 2.78,
      "avg_fx_rate": 148.1,
      "prior_year_fx_rate": 128.4,
      "yoy_local_pct": 19,
      "yoy_usd_pct": -2
    },
    {
      "region": "Rest of World",
      "reported_usd_millions": 680,
      "yoy_reported_pct": 8
    }
  ],
  "japan_subset.local_revenue_billions_jpy": 298,
  "japan_subset.reported_usd_billions": 2.01,
  "japan_subset.yoy_local_pct": 23,
  "japan_subset.yoy_usd_pct": 0
}
Qwen 3.5 Flash
28418ms8/100

Incorrect/missing fields: total_revenue_reported_billions, total_revenue_constant_currency_billions, reported_growth_pct, constant_currency_growth_pct, fx_headwind_millions, fx_headwind_growth_points, regions, japan_subset.local_revenue_billions_jpy, japan_subset.reported_usd_billions, japan_subset.yoy_local_pct, japan_subset.yoy_usd_pct

fiscal_yearFY2025
total_revenue_reported_billions-(expected: 12.01)
total_revenue_constant_currency_billions-(expected: 12.68)
reported_growth_pct-(expected: 15.2)
constant_currency_growth_pct-(expected: 21.7)
fx_headwind_millions-(expected: 670)
fx_headwind_growth_points-(expected: 5.6)
regions-(expected: [{"region":"Americas","reported_usd_billions":6.24,"yoy_reported_pct":22},{"region":"Europe","local_currency":"EUR","local_revenue_billions":2.18,"reported_usd_billions":2.31,"avg_fx_rate":1.06,"prior_year_fx_rate":1.11,"yoy_local_pct":14,"yoy_usd_pct":9},{"region":"Asia-Pacific","local_currency":"JPY","local_revenue_billions_jpy":412,"reported_usd_billions":2.78,"avg_fx_rate":148.1,"prior_year_fx_rate":128.4,"yoy_local_pct":19,"yoy_usd_pct":-2},{"region":"Rest of World","reported_usd_millions":680,"yoy_reported_pct":8}])
japan_subset.local_revenue_billions_jpy-(expected: 298)
japan_subset.reported_usd_billions-(expected: 2.01)
japan_subset.yoy_local_pct-(expected: 23)
japan_subset.yoy_usd_pct-(expected: 0)
1/12 fields correct
GPT-4.1
2292ms96/100

Incorrect/missing fields: regions

fiscal_yearFY2025
total_revenue_reported_billions12.01
total_revenue_constant_currency_billions12.68
reported_growth_pct15.2
constant_currency_growth_pct21.7
fx_headwind_millions670
fx_headwind_growth_points5.6
regions[{"region":"Americas","reported_usd_billions":6.24,"yoy_reported_pct":22},{"region":"Europe","reported_usd_billions":2.31,"yoy_reported_pct":9},{"region":"Asia-Pacific","reported_usd_billions":2.78,"yoy_reported_pct":-2},{"region":"Rest of World","reported_usd_billions":0.68,"yoy_reported_pct":8}](expected: [{"region":"Americas","reported_usd_billions":6.24,"yoy_reported_pct":22},{"region":"Europe","local_currency":"EUR","local_revenue_billions":2.18,"reported_usd_billions":2.31,"avg_fx_rate":1.06,"prior_year_fx_rate":1.11,"yoy_local_pct":14,"yoy_usd_pct":9},{"region":"Asia-Pacific","local_currency":"JPY","local_revenue_billions_jpy":412,"reported_usd_billions":2.78,"avg_fx_rate":148.1,"prior_year_fx_rate":128.4,"yoy_local_pct":19,"yoy_usd_pct":-2},{"region":"Rest of World","reported_usd_millions":680,"yoy_reported_pct":8}])
japan_subset.local_revenue_billions_jpy298
japan_subset.reported_usd_billions2.01
japan_subset.yoy_local_pct23
japan_subset.yoy_usd_pct0
11/12 fields correct
Why this is hard: APAC grew 19% in yen but SHRANK 2% in USD — models must correctly report negative USD growth. Japan is a SUBSET of APAC, not a separate region. FX headwind is $670M (constant_currency - reported). Prior year FX rates must be distinct from current year. Europe: 14% local growth but only 9% in USD.
19FinanceCash Flow Statement Extraction
0vs99
Input
CONSOLIDATED STATEMENT OF CASH FLOWS — FY2025 (in millions). OPERATING ACTIVITIES: Net income $2,890. Adjustments: Depreciation and amortization $1,420. Stock-based compensation $680. Deferred income taxes ($210). Changes in working capital: Accounts receivable ($340). Inventories $85. Accounts payable $220. Deferred revenue $190. Accrued expenses ($65). Net cash from operating activities $4,870. INVESTING ACTIVITIES: Capital expenditures ($2,180). Acquisitions, net of cash acquired ($3,400). Purchases of short-term investments ($2,800). Maturities of short-term investments $1,950. Proceeds from sale of assets $120. Net cash used in investing activities ($6,310). FINANCING ACTIVITIES: Proceeds from issuance of debt $3,000. Repayment of debt ($1,500). Share repurchases ($2,200). Dividends paid ($580). Proceeds from employee stock plans $340. Net cash used in financing activities ($940). Effect of exchange rate changes ($85). NET DECREASE IN CASH ($2,465). Cash at beginning of period $5,885. Cash at end of period $3,420.
Expected Output
{
  "fiscal_year": "FY2025",
  "operating_cash_flow": 4870,
  "investing_cash_flow": -6310,
  "financing_cash_flow": -940,
  "fx_effect": -85,
  "net_change_in_cash": -2465,
  "beginning_cash": 5885,
  "ending_cash": 3420,
  "free_cash_flow": 2690,
  "key_items.net_income": 2890,
  "key_items.depreciation_amortization": 1420,
  "key_items.sbc": 680,
  "key_items.capex": -2180,
  "key_items.acquisitions": -3400,
  "key_items.share_repurchases": -2200,
  "key_items.dividends": -580,
  "key_items.debt_issued": 3000,
  "key_items.debt_repaid": -1500,
  "cash_flow_check": true
}
Qwen 3.5 Flash
31631ms0/100

Incorrect/missing fields: fiscal_year, operating_cash_flow, investing_cash_flow, financing_cash_flow, fx_effect, net_change_in_cash, beginning_cash, ending_cash, free_cash_flow, key_items.net_income, key_items.depreciation_amortization, key_items.sbc, key_items.capex, key_items.acquisitions, key_items.share_repurchases, key_items.dividends, key_items.debt_issued, key_items.debt_repaid, cash_flow_check

fiscal_year-(expected: FY2025)
operating_cash_flow-(expected: 4870)
investing_cash_flow-(expected: -6310)
financing_cash_flow-(expected: -940)
fx_effect-(expected: -85)
net_change_in_cash-(expected: -2465)
beginning_cash-(expected: 5885)
ending_cash-(expected: 3420)
free_cash_flow-(expected: 2690)
key_items.net_income-(expected: 2890)
key_items.depreciation_amortization-(expected: 1420)
key_items.sbc-(expected: 680)
key_items.capex-(expected: -2180)
key_items.acquisitions-(expected: -3400)
key_items.share_repurchases-(expected: -2200)
key_items.dividends-(expected: -580)
key_items.debt_issued-(expected: 3000)
key_items.debt_repaid-(expected: -1500)
cash_flow_check-(expected: true)
0/19 fields correct
GPT-4.1
2180ms99/100

All fields correct

fiscal_year2025
operating_cash_flow4870
investing_cash_flow-6310
financing_cash_flow-940
fx_effect-85
net_change_in_cash-2465
beginning_cash5885
ending_cash3420
free_cash_flow2690
key_items.net_income2890
key_items.depreciation_amortization1420
key_items.sbc680
key_items.capex-2180
key_items.acquisitions-3400
key_items.share_repurchases-2200
key_items.dividends-580
key_items.debt_issued3000
key_items.debt_repaid-1500
cash_flow_checktrue
19/19 fields correct
Why this is hard: Free cash flow = Operating CF ($4,870) - CapEx ($2,180) = $2,690. Cash flow check: 4870 + (-6310) + (-940) + (-85) = -2465, and 5885 + (-2465) = 3420. Negative values MUST have negative signs. Models often forget to negate outflows or miscalculate FCF by including acquisitions.
20FinanceConvertible Note Terms Extraction
20vs93
Input
SERIES B CONVERTIBLE NOTE TERM SHEET — Acme AI, Inc. Principal Amount: $25,000,000. Issuance Date: January 15, 2026. Maturity Date: January 15, 2028 (24 months). Interest Rate: 6% per annum, simple interest, compounded annually, payable at conversion or maturity (not current-pay). Conversion Discount: 25% discount to the price per share in the Next Qualified Financing (minimum $50M raise). Valuation Cap: $200,000,000 pre-money. Conversion Mechanics: Notes convert at the LOWER of (a) the Valuation Cap price or (b) the Discount Price. If no Qualified Financing occurs by maturity, noteholder may elect: (i) conversion at the Valuation Cap, (ii) repayment of principal plus accrued interest, or (iii) extension for 12 months at 8% interest. Anti-Dilution: Broad-based weighted average adjustment. Most Favored Nation: If the Company issues subsequent convertible instruments with more favorable terms (lower cap or higher discount), these notes shall automatically adjust to the more favorable terms. Pro Rata Rights: Noteholders have the right to participate in the Next Qualified Financing for up to 2× their original investment. Information Rights: Monthly financial statements, annual audited financials, and board observer seat for investors holding >$5M in notes.
Expected Output
{
  "instrument": "Series B Convertible Note",
  "company": "Acme AI, Inc.",
  "principal_amount": 25000000,
  "issuance_date": "2026-01-15",
  "maturity_date": "2028-01-15",
  "term_months": 24,
  "interest_rate_pct": 6,
  "interest_type": "Simple, compounded annually",
  "interest_payment": "At conversion or maturity (not current-pay)",
  "conversion_discount_pct": 25,
  "qualified_financing_minimum": 50000000,
  "valuation_cap": 200000000,
  "conversion_price": "Lower of cap price or discount price",
  "maturity_options": [
    {
      "option": "Convert at valuation cap"
    },
    {
      "option": "Repayment of principal plus accrued interest"
    },
    {
      "option": "Extension for 12 months at 8% interest"
    }
  ],
  "anti_dilution": "Broad-based weighted average",
  "most_favored_nation": true,
  "mfn_triggers": [
    "Lower valuation cap",
    "Higher discount"
  ],
  "pro_rata_rights.available": true,
  "pro_rata_rights.participation_limit": "2× original investment",
  "information_rights.monthly": "Financial statements",
  "information_rights.annual": "Audited financials",
  "information_rights.board_observer": true,
  "information_rights.board_observer_threshold": 5000000
}
Qwen 3.5 Flash
24451ms20/100

Incorrect/missing fields: instrument, company, interest_rate_pct, interest_payment, conversion_discount_pct, qualified_financing_minimum, valuation_cap, conversion_price, maturity_options, anti_dilution, most_favored_nation, mfn_triggers, pro_rata_rights.available, pro_rata_rights.participation_limit, information_rights.monthly, information_rights.annual, information_rights.board_observer, information_rights.board_observer_threshold, interest_type

instrument-(expected: Series B Convertible Note)
company-(expected: Acme AI, Inc.)
principal_amount25000000
issuance_date2026-01-15
maturity_date2028-01-15
term_months24
interest_rate_pct-(expected: 6)
interest_typesimple(expected: Simple, compounded annually)
interest_payment-(expected: At conversion or maturity (not current-pay))
conversion_discount_pct-(expected: 25)
qualified_financing_minimum-(expected: 50000000)
valuation_cap-(expected: 200000000)
conversion_price-(expected: Lower of cap price or discount price)
maturity_options-(expected: [{"option":"Convert at valuation cap"},{"option":"Repayment of principal plus accrued interest"},{"option":"Extension for 12 months at 8% interest"}])
anti_dilution-(expected: Broad-based weighted average)
most_favored_nation-(expected: true)
mfn_triggers-(expected: ["Lower valuation cap","Higher discount"])
pro_rata_rights.available-(expected: true)
pro_rata_rights.participation_limit-(expected: 2× original investment)
information_rights.monthly-(expected: Financial statements)
information_rights.annual-(expected: Audited financials)
information_rights.board_observer-(expected: true)
information_rights.board_observer_threshold-(expected: 5000000)
4/23 fields correct
GPT-4.1
4567ms93/100

Incorrect/missing fields: mfn_triggers, conversion_price, pro_rata_rights.participation_limit, information_rights.monthly

instrumentConvertible Note
companyAcme AI, Inc.
principal_amount25000000
issuance_date2026-01-15
maturity_date2028-01-15
term_months24
interest_rate_pct6
interest_typeSimple, compounded annually
interest_paymentPayable at conversion or maturity (not current-pay)
conversion_discount_pct25
qualified_financing_minimum50000000
valuation_cap200000000
conversion_priceLower of (a) Valuation Cap price or (b) Discount Price (25% discount in Qualified Financing)(expected: Lower of cap price or discount price)
maturity_options[{"option":"Conversion at Valuation Cap"},{"option":"Repayment of principal plus accrued interest"},{"option":"Extension for 12 months at 8% interest"}]
anti_dilutionBroad-based weighted average adjustment
most_favored_nationtrue
mfn_triggers["Lower valuation cap on new convertible instruments","Higher conversion discount on new convertible instruments"](expected: ["Lower valuation cap","Higher discount"])
pro_rata_rights.availabletrue
pro_rata_rights.participation_limitUp to 2× original investment in Next Qualified Financing(expected: 2× original investment)
information_rights.monthlyMonthly financial statements(expected: Financial statements)
information_rights.annualAnnual audited financials
information_rights.board_observertrue
information_rights.board_observer_threshold5000000
19/23 fields correct
Why this is hard: Interest is 'simple, compounded annually' and NOT current-pay — this distinction matters. Extension option changes rate from 6% to 8%. board_observer_threshold is for investors >$5M (not all noteholders). MFN clause triggers must be specific. Conversion is at the LOWER of cap/discount, not either.
21Logic / SchemaConditional Array Extraction with camelCase
0vs100
Input
Extract ONLY the third item from the following list IF its price is greater than $100. Format all field names in camelCase. If the condition is not met, return {"result": null, "reason": "condition_not_met"}.

Items:
1. Widget Alpha — Price: $45.00 — Category: Hardware — In Stock: Yes
2. Widget Beta — Price: $220.00 — Category: Software — In Stock: No
3. Widget Gamma — Price: $189.50 — Category: Hardware — In Stock: Yes
4. Widget Delta — Price: $67.00 — Category: Electronics — In Stock: Yes
Expected Output
{
  "result.itemName": "Widget Gamma",
  "result.price": 189.5,
  "result.category": "Hardware",
  "result.inStock": true,
  "conditionMet": true,
  "extractedIndex": 3
}
Qwen 3.5 Flash
49250ms0/100

Incorrect/missing fields: result.itemName, result.price, result.category, result.inStock, conditionMet, extractedIndex

result.itemName-(expected: Widget Gamma)
result.price-(expected: 189.5)
result.category-(expected: Hardware)
result.inStock-(expected: true)
conditionMet-(expected: true)
extractedIndex-(expected: 3)
0/6 fields correct
GPT-4.1
1004ms100/100

All fields correct

result.itemNameWidget Gamma
result.price189.5
result.categoryHardware
result.inStocktrue
conditionMettrue
extractedIndex3
6/6 fields correct
Why this is hard: Must extract the THIRD item (Widget Gamma at $189.50), NOT the most expensive or first >$100. Field names MUST be camelCase (itemName, inStock, conditionMet, extractedIndex). inStock must be boolean true, not string 'Yes'. price must be numeric without dollar sign.
22Logic / SchemaNested Object Flattening with Conflict Resolution
100vs100
Input
Flatten the following nested JSON into a single-level object. When key conflicts arise (same key name at different nesting levels), use dot notation with the full path. Preserve original data types. Do NOT flatten arrays — keep them as-is.

{"user": {"id": 42, "name": "Alice", "address": {"city": "Portland", "state": "OR", "zip": "97201"}}, "order": {"id": 1087, "items": [{"sku": "A1", "qty": 2}, {"sku": "B3", "qty": 1}], "total": 299.99, "address": {"city": "Seattle", "state": "WA", "zip": "98101"}}, "id": 999}
Expected Output
{
  "user.id": 42,
  "user.name": "Alice",
  "user.address.city": "Portland",
  "user.address.state": "OR",
  "user.address.zip": "97201",
  "order.id": 1087,
  "order.items": [
    {
      "sku": "A1",
      "qty": 2
    },
    {
      "sku": "B3",
      "qty": 1
    }
  ],
  "order.total": 299.99,
  "order.address.city": "Seattle",
  "order.address.state": "WA",
  "order.address.zip": "98101",
  "id": 999
}
Qwen 3.5 Flash
37480ms100/100

All fields correct

user.id42
user.nameAlice
user.address.cityPortland
user.address.stateOR
user.address.zip97201
order.id1087
order.items[{"sku":"A1","qty":2},{"sku":"B3","qty":1}]
order.total299.99
order.address.citySeattle
order.address.stateWA
order.address.zip98101
id999
12/12 fields correct
GPT-4.1
1740ms100/100

All fields correct

user.id42
user.nameAlice
user.address.cityPortland
user.address.stateOR
user.address.zip97201
order.id1087
order.items[{"sku":"A1","qty":2},{"sku":"B3","qty":1}]
order.total299.99
order.address.citySeattle
order.address.stateWA
order.address.zip98101
id999
12/12 fields correct
Why this is hard: There are THREE 'id' fields (user.id=42, order.id=1087, id=999) — all must be preserved with correct paths. The top-level 'id' keeps its short name. Arrays must NOT be flattened (order.items stays as array). Two different addresses (Portland and Seattle) must be distinguished by path. Data types must be preserved (integers, floats, strings, arrays).
23Logic / SchemaMulti-step Date Reasoning
0vs29
Input
Calculate the number of BUSINESS DAYS between February 14, 2026 (Saturday) and March 6, 2026 (Friday), INCLUSIVE of both dates if they fall on business days. Exclude weekends (Saturday, Sunday) and the following US holidays: Presidents' Day (February 16, 2026 — Monday). Show your work by listing each business day.
Expected Output
{
  "start_date": "2026-02-14",
  "end_date": "2026-03-06",
  "start_is_business_day": false,
  "end_is_business_day": true,
  "holidays_excluded": [
    "2026-02-16"
  ],
  "business_days": [
    "2026-02-17",
    "2026-02-18",
    "2026-02-19",
    "2026-02-20",
    "2026-02-23",
    "2026-02-24",
    "2026-02-25",
    "2026-02-26",
    "2026-02-27",
    "2026-03-02",
    "2026-03-03",
    "2026-03-04",
    "2026-03-05",
    "2026-03-06"
  ],
  "total_business_days": 14
}
Qwen 3.5 Flash
22512ms0/100

Incorrect/missing fields: start_date, end_date, start_is_business_day, end_is_business_day, holidays_excluded, business_days, total_business_days

start_date-(expected: 2026-02-14)
end_date-(expected: 2026-03-06)
start_is_business_day-(expected: false)
end_is_business_day-(expected: true)
holidays_excluded-(expected: ["2026-02-16"])
business_days-(expected: ["2026-02-17","2026-02-18","2026-02-19","2026-02-20","2026-02-23","2026-02-24","2026-02-25","2026-02-26","2026-02-27","2026-03-02","2026-03-03","2026-03-04","2026-03-05","2026-03-06"])
total_business_days-(expected: 14)
0/7 fields correct
GPT-4.1
3058ms29/100

Incorrect/missing fields: start_date, end_date, start_is_business_day, end_is_business_day, holidays_excluded

start_date-(expected: 2026-02-14)
end_date-(expected: 2026-03-06)
start_is_business_day-(expected: false)
end_is_business_day-(expected: true)
holidays_excluded-(expected: ["2026-02-16"])
business_days["2026-02-16 (excluded, Presidents' Day)","2026-02-14 (excluded, Saturday)","2026-02-15 (excluded, Sunday)","2026-02-17","2026-02-18","2026-02-19","2026-02-20","2026-02-21 (excluded, Saturday)","2026-02-22 (excluded, Sunday)","2026-02-23","2026-02-24","2026-02-25","2026-02-26","2026-02-27","2026-02-28 (excluded, Saturday)","2026-03-01 (excluded, Sunday)","2026-03-02","2026-03-03","2026-03-04","2026-03-05","2026-03-06"]
total_business_days14
2/7 fields correct
Why this is hard: Feb 14 is Saturday — NOT a business day. Feb 16 is Presidents' Day — excluded. The first business day is Feb 17 (Tuesday). March 6 IS a business day (Friday) and is INCLUSIVE. Total must be exactly 14. Models commonly miscount by ±1 or forget to skip the holiday.
24Logic / SchemaSchema Migration with Backward Compatibility
100vs100
Input
Migrate the following V1 data to V2 schema. V2 rules: (1) 'full_name' splits into 'firstName' and 'lastName'. (2) 'age' is REMOVED — compute 'birthYear' from age assuming current year is 2026. (3) 'email' stays but add 'emailDomain'. (4) 'tags' array moves under a new 'metadata' object. (5) Add 'schemaVersion': 2. (6) For backward compatibility: keep 'full_name' as 'deprecated_fullName' and 'age' as 'deprecated_age'.

V1 Data: {"full_name": "Dr. Maria Santos-Garcia", "age": 34, "email": "maria.sg@tufts-medical.edu", "tags": ["cardiology", "research", "AI/ML"], "role": "Senior Fellow"}
Expected Output
{
  "schemaVersion": 2,
  "firstName": "Maria",
  "lastName": "Santos-Garcia",
  "email": "maria.sg@tufts-medical.edu",
  "emailDomain": "tufts-medical.edu",
  "birthYear": 1992,
  "role": "Senior Fellow",
  "metadata.tags": [
    "cardiology",
    "research",
    "AI/ML"
  ],
  "deprecated_fullName": "Dr. Maria Santos-Garcia",
  "deprecated_age": 34
}
Qwen 3.5 Flash
28845ms100/100

All fields correct

schemaVersion2
firstNameMaria
lastNameSantos-Garcia
emailmaria.sg@tufts-medical.edu
emailDomaintufts-medical.edu
birthYear1992
roleSenior Fellow
metadata.tags["cardiology","research","AI/ML"]
deprecated_fullNameDr. Maria Santos-Garcia
deprecated_age34
10/10 fields correct
GPT-4.1
1506ms100/100

All fields correct

schemaVersion2
firstNameMaria
lastNameSantos-Garcia
emailmaria.sg@tufts-medical.edu
emailDomaintufts-medical.edu
birthYear1992
roleSenior Fellow
metadata.tags["cardiology","research","AI/ML"]
deprecated_fullNameDr. Maria Santos-Garcia
deprecated_age34
10/10 fields correct
Why this is hard: Name splitting: 'Dr.' is a title, NOT part of firstName. 'Santos-Garcia' is a hyphenated surname, NOT two last names. birthYear = 2026 - 34 = 1992. emailDomain must extract from the full email address. The 'AI/ML' tag has a slash — must be preserved as-is in the array. Deprecated fields must retain original values exactly.
25Logic / SchemaRecursive Tree Summarization
63vs63
Input
Given this org tree, compute for EACH node: (1) total_reports = count of ALL descendants (direct + indirect), (2) total_salary_below = sum of all descendant salaries, (3) max_depth = deepest level below this node (leaf = 0). Return the tree with computed fields added.

{"name": "CEO Jane", "salary": 450000, "reports": [{"name": "VP Engineering Bob", "salary": 320000, "reports": [{"name": "Dir Platform Carol", "salary": 250000, "reports": [{"name": "Senior Eng Dan", "salary": 185000, "reports": []}, {"name": "Senior Eng Eve", "salary": 180000, "reports": [{"name": "Eng Intern Frank", "salary": 65000, "reports": []}]}]}, {"name": "Dir ML Grace", "salary": 260000, "reports": [{"name": "ML Eng Hank", "salary": 195000, "reports": []}]}]}, {"name": "VP Sales Iris", "salary": 310000, "reports": [{"name": "Dir Enterprise Jack", "salary": 240000, "reports": []}]}]}
Expected Output
{
  "name": "CEO Jane",
  "salary": 450000,
  "total_reports": 9,
  "total_salary_below": 2005000,
  "max_depth": 4,
  "reports": [
    {
      "name": "VP Engineering Bob",
      "salary": 320000,
      "total_reports": 6,
      "total_salary_below": 1135000,
      "max_depth": 3,
      "reports": [
        {
          "name": "Dir Platform Carol",
          "salary": 250000,
          "total_reports": 3,
          "total_salary_below": 430000,
          "max_depth": 2,
          "reports": [
            {
              "name": "Senior Eng Dan",
              "salary": 185000,
              "total_reports": 0,
              "total_salary_below": 0,
              "max_depth": 0,
              "reports": []
            },
            {
              "name": "Senior Eng Eve",
              "salary": 180000,
              "total_reports": 1,
              "total_salary_below": 65000,
              "max_depth": 1,
              "reports": [
                {
                  "name": "Eng Intern Frank",
                  "salary": 65000,
                  "total_reports": 0,
                  "total_salary_below": 0,
                  "max_depth": 0,
                  "reports": []
                }
              ]
            }
          ]
        },
        {
          "name": "Dir ML Grace",
          "salary": 260000,
          "total_reports": 1,
          "total_salary_below": 195000,
          "max_depth": 1,
          "reports": [
            {
              "name": "ML Eng Hank",
              "salary": 195000,
              "total_reports": 0,
              "total_salary_below": 0,
              "max_depth": 0,
              "reports": []
            }
          ]
        }
      ]
    },
    {
      "name": "VP Sales Iris",
      "salary": 310000,
      "total_reports": 1,
      "total_salary_below": 240000,
      "max_depth": 1,
      "reports": [
        {
          "name": "Dir Enterprise Jack",
          "salary": 240000,
          "total_reports": 0,
          "total_salary_below": 0,
          "max_depth": 0,
          "reports": []
        }
      ]
    }
  ]
}
Qwen 3.5 Flash
15814ms63/100

Incorrect/missing fields: total_reports, total_salary_below, reports

nameCEO Jane
salary450000
total_reports10(expected: 9)
total_salary_below1875000(expected: 2005000)
max_depth4
reports[{"name":"VP Engineering Bob","salary":320000,"total_reports":6,"total_salary_below":1075000,"max_depth":3,"reports":[{"name":"Dir Platform Carol","salary":250000,"total_reports":3,"total_salary_below":430000,"max_depth":2,"reports":[{"name":"Senior Eng Dan","salary":185000,"total_reports":0,"total_salary_below":0,"max_depth":0,"reports":[]},{"name":"Senior Eng Eve","salary":180000,"total_reports":1,"total_salary_below":65000,"max_depth":1,"reports":[{"name":"Eng Intern Frank","salary":65000,"total_reports":0,"total_salary_below":0,"max_depth":0,"reports":[]}]}]},{"name":"Dir ML Grace","salary":260000,"total_reports":1,"total_salary_below":195000,"max_depth":1,"reports":[{"name":"ML Eng Hank","salary":195000,"total_reports":0,"total_salary_below":0,"max_depth":0,"reports":[]}]}]},{"name":"VP Sales Iris","salary":310000,"total_reports":1,"total_salary_below":240000,"max_depth":1,"reports":[{"name":"Dir Enterprise Jack","salary":240000,"total_reports":0,"total_salary_below":0,"max_depth":0,"reports":[]}]}](expected: [{"name":"VP Engineering Bob","salary":320000,"total_reports":6,"total_salary_below":1135000,"max_depth":3,"reports":[{"name":"Dir Platform Carol","salary":250000,"total_reports":3,"total_salary_below":430000,"max_depth":2,"reports":[{"name":"Senior Eng Dan","salary":185000,"total_reports":0,"total_salary_below":0,"max_depth":0,"reports":[]},{"name":"Senior Eng Eve","salary":180000,"total_reports":1,"total_salary_below":65000,"max_depth":1,"reports":[{"name":"Eng Intern Frank","salary":65000,"total_reports":0,"total_salary_below":0,"max_depth":0,"reports":[]}]}]},{"name":"Dir ML Grace","salary":260000,"total_reports":1,"total_salary_below":195000,"max_depth":1,"reports":[{"name":"ML Eng Hank","salary":195000,"total_reports":0,"total_salary_below":0,"max_depth":0,"reports":[]}]}]},{"name":"VP Sales Iris","salary":310000,"total_reports":1,"total_salary_below":240000,"max_depth":1,"reports":[{"name":"Dir Enterprise Jack","salary":240000,"total_reports":0,"total_salary_below":0,"max_depth":0,"reports":[]}]}])
3/6 fields correct
GPT-4.1
5784ms63/100

Incorrect/missing fields: total_reports, total_salary_below, reports

nameCEO Jane
salary450000
total_reports8(expected: 9)
total_salary_below1425000(expected: 2005000)
max_depth4
reports[{"name":"VP Engineering Bob","salary":320000,"reports":[{"name":"Dir Platform Carol","salary":250000,"reports":[{"name":"Senior Eng Dan","salary":185000,"reports":[],"total_reports":0,"total_salary_below":0,"max_depth":0},{"name":"Senior Eng Eve","salary":180000,"reports":[{"name":"Eng Intern Frank","salary":65000,"reports":[],"total_reports":0,"total_salary_below":0,"max_depth":0}],"total_reports":1,"total_salary_below":65000,"max_depth":1}],"total_reports":3,"total_salary_below":430000,"max_depth":2},{"name":"Dir ML Grace","salary":260000,"reports":[{"name":"ML Eng Hank","salary":195000,"reports":[],"total_reports":0,"total_salary_below":0,"max_depth":0}],"total_reports":1,"total_salary_below":195000,"max_depth":1}],"total_reports":6,"total_salary_below":875000,"max_depth":3},{"name":"VP Sales Iris","salary":310000,"reports":[{"name":"Dir Enterprise Jack","salary":240000,"reports":[],"total_reports":0,"total_salary_below":0,"max_depth":0}],"total_reports":1,"total_salary_below":240000,"max_depth":1}](expected: [{"name":"VP Engineering Bob","salary":320000,"total_reports":6,"total_salary_below":1135000,"max_depth":3,"reports":[{"name":"Dir Platform Carol","salary":250000,"total_reports":3,"total_salary_below":430000,"max_depth":2,"reports":[{"name":"Senior Eng Dan","salary":185000,"total_reports":0,"total_salary_below":0,"max_depth":0,"reports":[]},{"name":"Senior Eng Eve","salary":180000,"total_reports":1,"total_salary_below":65000,"max_depth":1,"reports":[{"name":"Eng Intern Frank","salary":65000,"total_reports":0,"total_salary_below":0,"max_depth":0,"reports":[]}]}]},{"name":"Dir ML Grace","salary":260000,"total_reports":1,"total_salary_below":195000,"max_depth":1,"reports":[{"name":"ML Eng Hank","salary":195000,"total_reports":0,"total_salary_below":0,"max_depth":0,"reports":[]}]}]},{"name":"VP Sales Iris","salary":310000,"total_reports":1,"total_salary_below":240000,"max_depth":1,"reports":[{"name":"Dir Enterprise Jack","salary":240000,"total_reports":0,"total_salary_below":0,"max_depth":0,"reports":[]}]}])
3/6 fields correct
Why this is hard: CEO total_reports = 9 (all other people). total_salary_below for CEO = 320000+250000+185000+180000+65000+260000+195000+310000+240000 = 2,005,000. max_depth for CEO = 4 (Jane→Bob→Carol→Eve→Frank). Carol has 3 total_reports (Dan+Eve+Frank). Models commonly miscount indirect reports or miscalculate recursive sums.
ROI Analysis

Should you swap?

Cost Savings
95%

Qwen 3.5 Flash is 95% cheaper than GPT-4.1

Qwen 3.5 Flash$0.17/1k
GPT-4.1$3.40/1k
Accuracy Gap
26.2%

as accurate (60.9pt gap)

Qwen 3.5 Flash21.6%
GPT-4.182.5%
Recommendation
Test First

GPT-4.1 outperforms by 60.9pts. Run on your data before switching.

Average Latency
Qwen 3.5 Flash
800ms
GPT-4.1
1,500ms
Accuracy by Domain

25 tasks across 5 domains

Qwen 3.5 Flash
GPT-4.1
E-commerce5 tasks
41.6
99.0
Legal5 tasks
0.0
51.8
Medical5 tasks
8.4
86.4
Finance5 tasks
5.6
97.0
Logic / Schema5 tasks
52.6
78.4
Overall Average
Qwen 3.5 Flash: 21.6GPT-4.1: 82.5
Failure Deep-Dive

Where models disagree

Tasks with the largest score gap between models - showing the actual outputs and expected values side by side.

Logic / SchemaConditional Array Extraction with camelCase

Qwen 3.5 Flash scored 0/100 vs GPT-4.1's 100/100

Input Text
Extract ONLY the third item from the following list IF its price is greater than $100. Format all field names in camelCase. If the condition is not met, return {"result": null, "reason": "condition_not_met"}.

Items:
1. Widget Alpha — Price: $45.00 — Category: Hardware — In Stock: Yes
2. Widget Beta — Price: $220.00 — Category: Software — In Stock: No
3. Widget Gamma — Price: $189.50 — Category: Hardware — In Stock: Yes
4. Widget Delta — Price: $67.00 — Category: Electronics — In Stock: Yes
Qwen 3.5 Flash0/100

Incorrect/missing fields: result.itemName, result.price, result.category, result.inStock, conditionMet, extractedIndex

GPT-4.1100/100

All fields correct

Field-by-Field Comparison
FieldExpectedQwen 3.5 FlashGPT-4.1
result.itemNameWidget Gamma-Widget Gamma
result.price189.5-189.5
result.categoryHardware-Hardware
result.inStocktrue-true
conditionMettrue-true
extractedIndex3-3
Why this task is hard:

Must extract the THIRD item (Widget Gamma at $189.50), NOT the most expensive or first >$100. Field names MUST be camelCase (itemName, inStock, conditionMet, extractedIndex). inStock must be boolean true, not string 'Yes'. price must be numeric without dollar sign.

FinanceEarnings Call P&L Extraction

Qwen 3.5 Flash scored 0/100 vs GPT-4.1's 99/100

Input Text
Q4 2025 Earnings Call — TechCorp Inc. (Ticker: TCHK). CFO Maria Santos: "Total revenue for Q4 was $4.23 billion, up 18% year-over-year. Breaking that down: Cloud Services revenue was $2.87 billion, growing 31% and now representing 68% of total revenue. Enterprise Software was $980 million, roughly flat. Professional Services contributed $380 million, down 12% as we continue the strategic shift away from lower-margin consulting. On the cost side, GAAP gross profit was $2.71 billion, for a gross margin of 64.1%. Non-GAAP gross margin, excluding $142 million in stock-based compensation allocated to cost of revenue, was 67.4%. Operating expenses: R&D was $890 million, up from $760 million — we've been investing heavily in our AI platform. Sales & Marketing was $620 million, and G&A was $210 million. GAAP operating income was $990 million, and non-GAAP operating income was $1.38 billion. The delta is primarily SBC of $310 million, restructuring charges of $47 million related to the EMEA headcount reduction, and $33 million in acquisition-related costs. GAAP net income was $742 million, or $2.47 per diluted share on 300.4 million shares. Non-GAAP EPS was $3.84."
Qwen 3.5 Flash0/100

Incorrect/missing fields: period, company, ticker, total_revenue_billions, revenue_yoy_growth, revenue_breakdown, gaap_gross_profit_billions, gaap_gross_margin_pct, non_gaap_gross_margin_pct, sbc_in_cogs_millions, operating_expenses.rd_millions, operating_expenses.rd_prior_millions, operating_expenses.sales_marketing_millions, operating_expenses.ga_millions, gaap_operating_income_millions, non_gaap_operating_income_millions, gaap_non_gaap_reconciliation, gaap_net_income_millions, gaap_eps, non_gaap_eps, diluted_shares_millions

GPT-4.199/100

Incorrect/missing fields: revenue_breakdown

Field-by-Field Comparison
FieldExpectedQwen 3.5 FlashGPT-4.1
periodQ4 2025-Q4 2025
companyTechCorp Inc.-TechCorp Inc.
tickerTCHK-TCHK
total_revenue_billions4.23-4.23
revenue_yoy_growth18%-18%
revenue_breakdown[{"segment":"Cloud Services","revenue_billions":2.87,"growth":"31%","pct_of_total":68},{"segment":"Enterprise Software","revenue_millions":980,"growth":"~0%"},{"segment":"Professional Services","revenue_millions":380,"growth":"-12%"}]-[{"segment":"Cloud Services","revenue_billions":2.87,"growth":"31%","pct_of_total":68},{"segment":"Enterprise Software","revenue_billions":0.98,"growth":"0%","pct_of_total":23.2},{"segment":"Professional Services","revenue_billions":0.38,"growth":"-12%","pct_of_total":9}]
gaap_gross_profit_billions2.71-2.71
gaap_gross_margin_pct64.1-64.1
non_gaap_gross_margin_pct67.4-67.4
sbc_in_cogs_millions142-142
operating_expenses.rd_millions890-890
operating_expenses.rd_prior_millions760-760
operating_expenses.sales_marketing_millions620-620
operating_expenses.ga_millions210-210
gaap_operating_income_millions990-990
non_gaap_operating_income_millions1380-1380
gaap_non_gaap_reconciliation[{"item":"Stock-based compensation","amount_millions":310},{"item":"Restructuring charges","amount_millions":47},{"item":"Acquisition-related costs","amount_millions":33}]-[{"item":"Stock-based compensation","amount_millions":310},{"item":"Restructuring charges","amount_millions":47},{"item":"Acquisition-related costs","amount_millions":33}]
gaap_net_income_millions742-742
gaap_eps2.47-2.47
non_gaap_eps3.84-3.84
diluted_shares_millions300.4-300.4
Why this task is hard:

Must distinguish GAAP from non-GAAP figures throughout. The reconciliation items (SBC $310M + restructuring $47M + acquisition $33M = $390M delta) must all be captured. SBC in COGS ($142M) is SEPARATE from total SBC ($310M). Models frequently confuse these or sum them incorrectly.

FinanceCash Flow Statement Extraction

Qwen 3.5 Flash scored 0/100 vs GPT-4.1's 99/100

Input Text
CONSOLIDATED STATEMENT OF CASH FLOWS — FY2025 (in millions). OPERATING ACTIVITIES: Net income $2,890. Adjustments: Depreciation and amortization $1,420. Stock-based compensation $680. Deferred income taxes ($210). Changes in working capital: Accounts receivable ($340). Inventories $85. Accounts payable $220. Deferred revenue $190. Accrued expenses ($65). Net cash from operating activities $4,870. INVESTING ACTIVITIES: Capital expenditures ($2,180). Acquisitions, net of cash acquired ($3,400). Purchases of short-term investments ($2,800). Maturities of short-term investments $1,950. Proceeds from sale of assets $120. Net cash used in investing activities ($6,310). FINANCING ACTIVITIES: Proceeds from issuance of debt $3,000. Repayment of debt ($1,500). Share repurchases ($2,200). Dividends paid ($580). Proceeds from employee stock plans $340. Net cash used in financing activities ($940). Effect of exchange rate changes ($85). NET DECREASE IN CASH ($2,465). Cash at beginning of period $5,885. Cash at end of period $3,420.
Qwen 3.5 Flash0/100

Incorrect/missing fields: fiscal_year, operating_cash_flow, investing_cash_flow, financing_cash_flow, fx_effect, net_change_in_cash, beginning_cash, ending_cash, free_cash_flow, key_items.net_income, key_items.depreciation_amortization, key_items.sbc, key_items.capex, key_items.acquisitions, key_items.share_repurchases, key_items.dividends, key_items.debt_issued, key_items.debt_repaid, cash_flow_check

GPT-4.199/100

All fields correct

Field-by-Field Comparison
FieldExpectedQwen 3.5 FlashGPT-4.1
fiscal_yearFY2025-2025
operating_cash_flow4870-4870
investing_cash_flow-6310--6310
financing_cash_flow-940--940
fx_effect-85--85
net_change_in_cash-2465--2465
beginning_cash5885-5885
ending_cash3420-3420
free_cash_flow2690-2690
key_items.net_income2890-2890
key_items.depreciation_amortization1420-1420
key_items.sbc680-680
key_items.capex-2180--2180
key_items.acquisitions-3400--3400
key_items.share_repurchases-2200--2200
key_items.dividends-580--580
key_items.debt_issued3000-3000
key_items.debt_repaid-1500--1500
cash_flow_checktrue-true
Why this task is hard:

Free cash flow = Operating CF ($4,870) - CapEx ($2,180) = $2,690. Cash flow check: 4870 + (-6310) + (-940) + (-85) = -2465, and 5885 + (-2465) = 3420. Negative values MUST have negative signs. Models often forget to negate outflows or miscalculate FCF by including acquisitions.

FinanceBalance Sheet Ratio Calculation

Qwen 3.5 Flash scored 0/100 vs GPT-4.1's 98/100

Input Text
CONSOLIDATED BALANCE SHEET — December 31, 2025 (in millions). ASSETS: Cash and equivalents $3,420. Short-term investments $1,850. Accounts receivable, net $2,190 (allowance for doubtful accounts $68). Inventories $890. Prepaid expenses $340. Total Current Assets $8,690. Property and equipment, net $12,400. Goodwill $8,750. Intangible assets, net $3,200. Operating lease right-of-use assets $1,860. Other non-current assets $920. Total Assets $35,820. LIABILITIES: Accounts payable $1,640. Accrued expenses $2,180. Current portion of long-term debt $500. Deferred revenue, current $1,420. Total Current Liabilities $5,740. Long-term debt $8,200 (senior notes: $5B at 3.75% due 2030, $3.2B at 4.25% due 2033). Operating lease liabilities $1,680. Deferred tax liabilities $890. Other non-current liabilities $460. Total Liabilities $16,970. EQUITY: Common stock $30. Additional paid-in capital $12,840. Retained earnings $9,180. Treasury stock ($2,400). Accumulated other comprehensive loss ($800). Total Stockholders' Equity $18,850. Total Liabilities + Equity $35,820.
Qwen 3.5 Flash0/100

Incorrect/missing fields: as_of, currency, current_assets, total_assets, current_liabilities, total_liabilities, total_equity, cash_and_equivalents, total_debt, debt_breakdown, computed_ratios.current_ratio, computed_ratios.quick_ratio, computed_ratios.debt_to_equity, computed_ratios.debt_to_assets, computed_ratios.working_capital_millions, balance_sheet_check

GPT-4.198/100

Incorrect/missing fields: currency

Field-by-Field Comparison
FieldExpectedQwen 3.5 FlashGPT-4.1
as_of2025-12-31-2025-12-31
currencyUSD millions-USD
current_assets8690-8690
total_assets35820-35820
current_liabilities5740-5740
total_liabilities16970-16970
total_equity18850-18850
cash_and_equivalents3420-3420
total_debt8700-8700
debt_breakdown[{"instrument":"Senior notes 3.75%","amount":5000,"maturity":2030},{"instrument":"Senior notes 4.25%","amount":3200,"maturity":2033}]-[{"instrument":"Senior Notes (3.75% due 2030)","amount":5000,"maturity":2030},{"instrument":"Senior Notes (4.25% due 2033)","amount":3200,"maturity":2033},{"instrument":"Current portion of long-term debt","amount":500,"maturity":2026}]
computed_ratios.current_ratio1.51-1.514
computed_ratios.quick_ratio1.3-1.357
computed_ratios.debt_to_equity0.46-0.461
computed_ratios.debt_to_assets0.24-0.243
computed_ratios.working_capital_millions2950-2950
balance_sheet_checktrue-true
Why this task is hard:

Total debt = current portion ($500M) + long-term ($8,200M) = $8,700M. Current ratio = 8690/5740 = 1.514. Quick ratio = (8690 - 890 - 340)/5740 = 1.30. Working capital = 8690-5740 = 2950. balance_sheet_check confirms Assets = L+E (35820 = 16970 + 18850). Treasury stock is NEGATIVE equity.

LegalNon-Compete Geographic Scope

Qwen 3.5 Flash scored 0/100 vs GPT-4.1's 91/100

Input Text
Section 11 — RESTRICTIVE COVENANTS. 11.1 Non-Competition: For a period of eighteen (18) months following termination of employment for any reason, Employee shall not, directly or indirectly, engage in, own, manage, operate, consult for, or be employed by any Competing Business within the Restricted Territory. "Competing Business" means any entity that derives more than twenty percent (20%) of its annual revenue from products or services substantially similar to those offered by Company. "Restricted Territory" means the United States, the United Kingdom, and any country in which Company generated more than $5,000,000 in revenue during the twelve (12) months preceding termination. 11.2 Non-Solicitation of Employees: For a period of twenty-four (24) months following termination, Employee shall not recruit, solicit, or induce any person employed by Company (or who was employed within the six (6) months preceding such solicitation). 11.3 Non-Solicitation of Customers: For a period of twelve (12) months following termination, Employee shall not solicit any customer or prospective customer with whom Employee had material contact during the last twenty-four (24) months of employment. 11.4 Non-Disparagement: For a period of thirty-six (36) months following termination, Employee shall not make any public statements that disparage or defame Company, its officers, directors, or products. This obligation is mutual — Company shall instruct its officers and directors to refrain from disparaging Employee.
Qwen 3.5 Flash0/100

Incorrect/missing fields: non_compete.duration_months, non_compete.scope, non_compete.competing_business_threshold, non_compete.restricted_territory, non_solicitation_employees.duration_months, non_solicitation_employees.lookback_months, non_solicitation_customers.duration_months, non_solicitation_customers.contact_lookback_months, non_disparagement.duration_months, non_disparagement.is_mutual, non_disparagement.company_scope

GPT-4.191/100

Incorrect/missing fields: non_compete.scope, non_compete.competing_business_threshold, non_disparagement.company_scope

Field-by-Field Comparison
FieldExpectedQwen 3.5 FlashGPT-4.1
non_compete.duration_months18-18
non_compete.scopeEngage, own, manage, operate, consult, or be employed-Employee shall not, directly or indirectly, engage in, own, manage, operate, consult for, or be employed by any Competing Business.
non_compete.competing_business_threshold20% of annual revenue from similar products/services-Any entity that derives more than 20% of its annual revenue from products or services substantially similar to those offered by Company.
non_compete.restricted_territory["United States","United Kingdom","Countries with >$5M Company revenue in prior 12 months"]-["United States","United Kingdom","Any country in which Company generated more than $5,000,000 in revenue during the 12 months preceding termination"]
non_solicitation_employees.duration_months24-24
non_solicitation_employees.lookback_months6-6
non_solicitation_customers.duration_months12-12
non_solicitation_customers.contact_lookback_months24-24
non_disparagement.duration_months36-36
non_disparagement.is_mutualtrue-true
non_disparagement.company_scopeOfficers and directors-Company, its officers, directors, or products
Why this task is hard:

Each covenant has a DIFFERENT duration (18/24/12/36 months). Models frequently assign a single duration to all. The geographic scope has a dynamic '$5M revenue' threshold — not just static countries. Non-disparagement is MUTUAL. Non-solicitation of employees has a 6-month lookback.

Run this comparison on your data.

These benchmarks use generic tasks. Your use case is unique. Upload your own dataset and compare any of 50+ models with field-level accuracy scoring.

Start Free Trial

Compare different models

Switch to any head-to-head matchup, or browse the full leaderboard.

Other Comparisons