Stop writing eval scripts. Just change your baseURL.

Spin up the Checkstack sidecar locally, point your SDK to port 3456, and let our engine automatically grade cheaper models in the background while you test your app.

Start the proxy

Auth happens automatically. Shadow models receive traffic immediately.

$ npx checkstack run

Point your SDK at Checkstack

One line change. Checkstack proxies every call transparently.

baseURL: "http://localhost:3456/v1"

Use your app normally

Every prompt is silently replayed against the models you choose. An AI judge scores each response.

> POST /api/chat → 200 OK

+ [claude-3.5-sonnet] get_weather({...})

gemini-flash ✓ match $0.0002

claude-haiku ✓ match $0.0014

gpt-4o-mini ✓ match $0.0003

> POST /api/chat → 200 OK

+ [claude-3.5-sonnet] search_places({...})

gemini-flash ✓ match $0.0004

Stop and compare

Hit Ctrl+C. Get a cost and accuracy breakdown instantly. Switch to the winner.

Session complete · 4 calls

MODEL

MATCH

COST/10k

claude-3.5-sonnet *

100%

$234

gemini-2.0-flash

100%

claude-haiku

75%

$62

💡 Switch to gemini-flash → save $226 per 10k runs

Rerun to verify

Optional

Not convinced? Re-run the exact same traffic with multiple repeats to check consistency before you commit.

$ checkstack rerun chk_9a2f18x --repeat 3

Re-running 4 calls × 3 repeats...

get_weather

✓ 3/3

search_places

✓ 3/3

get_tee_times

✓ 3/3

book_tee_time

✓ 3/3

Consistency: 12/12 (100%) across 3 runs

Optimize your prompt

Optional

If a cheaper model almost passes, Checkstack can suggest prompt tweaks to close the gap - without changing your logic.

$ checkstack rerun chk_9a2f18x --suggest

Analyzing 1 failed call...

⚠ search_places: haiku returns 2 results vs 3

💡 Suggested: add "Return at least 3 results" to system prompt

Estimated match rate: 75% → 100%

✓ Ready: Run --apply to update codebase

Read the full integration guide →

Why Checkstack

How we compare.

There's nothing else quite like Checkstack. It's the only tool that brings real-time parallel shadow testing to your local dev environment, letting you instantly test against 160+ models with zero API key setup.

Feature	CheckStack	Enterprise Evals LangSmith / Braintrust	API Gateways Helicone / Portkey	Static CLI Promptfoo
Real-Time Testing	✓Live Shadow on Every Call	✗No: Offline Eval Suites	✗No: Logging / A/B Routing	✗No: Static Batch Runs
Models Available	✓160+ via Checkstack Proxy, No Keys	⚠Bring Your Own Keys	⚠Bring Your Own Keys	⚠Bring Your Own Keys
Integration	✓1-Line baseURL Swap	✗Custom SDK + Tracing Setup	✓1-Line baseURL Swap	⚠Config Files Required
Test Data	✓Real App Traffic	⚠Prod Traces / Datasets	⚠Production Logs	✗Hardcoded Test Cases
Prod Risk	✓Zero: Pre-Prod Only	⚠Medium: SDK in Prod	✗High: Inline Proxy	✓Zero: Static Only
Pricing	✓$15/mo + At-Cost Usage	⚠Free Tier, Then $39–249/mo	⚠Usage-Based ($79/mo+)	⚠Free (Funded by OpenAI)

Real-Time Testing

✓Live Shadow on Every Call

✗No: Offline Eval Suites

✗No: Logging / A/B Routing

✗No: Static Batch Runs

Models Available

✓160+ via Checkstack Proxy, No Keys

⚠Bring Your Own Keys

Integration

✓1-Line baseURL Swap

✗Custom SDK + Tracing Setup

✓1-Line baseURL Swap

⚠Config Files Required

Test Data

✓Real App Traffic

⚠Prod Traces / Datasets

⚠Production Logs

✗Hardcoded Test Cases

Prod Risk

✓Zero: Pre-Prod Only

⚠Medium: SDK in Prod

✗High: Inline Proxy

✓Zero: Static Only

Pricing

✓$15/mo + At-Cost Usage

⚠Free Tier, Then $39–249/mo

⚠Usage-Based ($79/mo+)

⚠Free (Funded by OpenAI)

Checkstack vs LangSmith or Braintrust

✓Real-time model comparison on every request vs running offline eval suites as a separate step
✓One env var change vs custom SDK integration and tracing instrumentation
✓Tests against your real local dev traffic, no dataset curation needed
✓Runs pre-production so broken prompts never reach users

Checkstack vs Helicone or Portkey

✓Parallel shadow testing across models vs single-model A/B routing or logging only
✓No production proxy in the critical path, zero risk of gateway downtime
✓$15/mo platform fee vs usage-based pricing starting at $79/mo+
✓160+ models through the Vercel AI Gateway with no API keys to manage

Checkstack vs Promptfoo

✓Real-time shadow testing on live traffic vs batch runs against static YAML test cases
✓Automatic schema-aware field-level accuracy scoring with no eval config to write
✓Independent and vendor-neutral (Promptfoo is funded by OpenAI)
✓No API keys needed: proxies 160+ models through the Vercel AI Gateway

How Shadow Testing Works

1.Point your local dev environment at the Checkstack proxy
2.Every AI call is forwarded to your primary model and shadowed to cheaper alternatives
3.Get field-level accuracy scores, latency, and cost comparisons in your terminal

Pricing

Zero markup. You pay what the compute costs.

Checkstack passes the exact token cost to you, to the eighth decimal. No hidden fees, no per-seat pricing, no surprises.

The $1 Bet

Sign up and get $1.00 of real API compute free. Find a cheaper model and save, or confirm your stack is already optimal. You win either way.

Claim Your $1 →

Checkstack Pro

$15/ month

14-day free trial. No credit card required.

✓CLI proxy for live shadow testing
✓Real-time accuracy dashboard
✓164+ models across 24 providers
✓AI-graded accuracy with field-level scores
✓Exportable production middleware
✓$1 of evaluation credit included with trial

Start Free Trial →

Evaluation Credit

$0markup

Pay-as-you-go. Top up when you need to.

✓Purchase credit as a one-time top-up
✓Exact token cost, zero markup, ever
✓Top-up options: $10, $25, or $50
✓Credit expires 1 year after purchase
✓Covers thousands of evals on Eco-tier
✓Unused credit carries forward

$10

$25

$50

Available after signing up.

Model Catalog

Test any of 164 models across 24 providers.

The benchmarks compare 6 popular models. When you run your own evals you can pick any model from the full catalog. Compare frontier models, open-source options, and cost-optimized choices side-by-side on your exact data.

OpenAI34 models total

GPT-3.5 TurboGPT-4oGPT-4o MiniGPT-4.1GPT-4.1 MiniGPT-4.1 NanoGPT-5 MiniGPT-5 Nanoo1o3o4-mini+23 more

We benchmarked 50 of these models so you don't have to.

Accuracy, latency, and cost across JSON extraction, RAG audit, tool calling, and more.

View Benchmark →

Effortless benchmarking
while you build.

Stop writing eval scripts. Just change your baseURL.

Start the proxy

Point your SDK at Checkstack

Use your app normally

Stop and compare

Rerun to verify

Optimize your prompt

How we compare.

Checkstack vs LangSmith or Braintrust

Checkstack vs Helicone or Portkey

Checkstack vs Promptfoo

How Shadow Testing Works

Zero markup. You pay what the compute costs.

Test any of 164 models across 24 providers.

We benchmarked 50 of these models so you don't have to.

Stop guessing.
Start measuring.

Effortless benchmarkingwhile you build.

Stop writing eval scripts. Just change your baseURL.

Start the proxy

Point your SDK at Checkstack

Use your app normally

Stop and compare

Rerun to verify

Optimize your prompt

How we compare.

Checkstack vs LangSmith or Braintrust

Checkstack vs Helicone or Portkey

Checkstack vs Promptfoo

How Shadow Testing Works

Zero markup. You pay what the compute costs.

Test any of 164 models across 24 providers.

We benchmarked 50 of these models so you don't have to.

Stop guessing.Start measuring.

Effortless benchmarking
while you build.

Stop guessing.
Start measuring.