# Checkstack

> Checkstack is an LLM evaluation and shadow-testing platform for AI engineers. Compare 164+ language models on your own data across three modes: live shadow testing via CLI proxy, RAG auditing with LLM-as-judge grading, and structured JSON extraction with deterministic field-level scoring. Per-task cost and latency tracking, zero API markup.

## Modes

### 1. Shadow Testing (CLI Proxy)
Run `npx checkstack start` to start a local proxy that shadow-tests your LLM calls. It intercepts requests to your primary model, forwards them immediately (zero latency added), and replays them against cheaper alternatives in the background. You get side-by-side accuracy, latency, and cost data with zero code changes.

### 2. RAG Auditor
Compare how multiple models answer the same question given the same retrieved context. A consistent Claude Sonnet 4.6 judge grades every response on the RAG Triad:
- **Faithfulness** (Context ↔ Answer): Is the answer supported by the context?
- **Context Relevance** (Query ↔ Context): Does the retrieved context contain relevant information?
- **Answer Relevance** (Query ↔ Answer): Does the answer address the user's question?

### 3. JSON Optimizer
Upload a dataset with a target JSON schema, run extractions across multiple LLMs, and score outputs per-field using deterministic comparison against ground truth (exact match for booleans, 5% tolerance for numbers, embedding cosine similarity for strings).

## Quick Start

### Shadow Testing (recommended)
  npx checkstack auth     # sign in via browser
  npx checkstack start    # start proxy on localhost:3456
  # Point your app at http://localhost:3456/v1 — done.

### Web Evaluations
Sign up at https://checkstack.ai — the playground lets you try both RAG auditing and JSON extraction without an account.

## SDK Compatibility

The CLI proxy works with anything that takes a base URL:

  # OpenAI Node.js
  new OpenAI({ baseURL: "http://localhost:3456/v1" })

  # OpenAI Python
  OpenAI(base_url="http://localhost:3456/v1")

  # Vercel AI SDK
  createOpenAI({ baseURL: "http://localhost:3456/v1" })

  # LangChain Python
  ChatOpenAI(openai_api_base="http://localhost:3456/v1")

  # LangChain JS
  new ChatOpenAI({ configuration: { baseURL: "http://localhost:3456/v1" } })

  # LlamaIndex / LiteLLM / curl
  export OPENAI_BASE_URL=http://localhost:3456/v1

Supported upstream providers: OpenAI, Anthropic, Google Gemini, Mistral, Groq, Together AI, Fireworks AI, Perplexity, DeepSeek, Cohere, AWS Bedrock. Auto-detected from the model ID.

## Shadow Evaluation Pipeline

7-phase pipeline when a request hits the proxy:

1. **Intercept**: Forward to original model immediately, capture response, fire async shadow request. Your API keys never leave your machine.
2. **Dispatch**: Authenticate, check balance, convert messages/tools to AI SDK format (preserves multi-turn history, tool calls, images, system prompts). Sanitized fallback if conversion fails.
3. **Model Selection**: Pick shadow models from pool (Gemini Flash Lite, Gemini Flash, Mistral Small, Claude 3.5 Haiku, GPT-4o Mini) minus the original. Configurable via `npx checkstack configure`. Tool-call fallback with `tool_fallback: true` flag if a shadow model lacks tool support.
4. **Compare (Text/Tool)**: Embed both responses with text-embedding-3-small. Cosine similarity: ≥0.85 = match, 0.70–0.85 = borderline (LLM judge), <0.70 = mismatch. Tool calls compared structurally. Mixed type → LLM judge with full tool definitions.
5. **Compare (Structured)**: For JSON schema responses: flatten to dot-paths, compare field-by-field. Strings via embedding similarity (≥0.80 match, 0.65–0.80 judge, <0.65 mismatch). Overall ≥90% = match.
6. **LLM Judge**: Claude Sonnet 4.6 (temperature 0) for borderline cases. Evaluates whether the shadow response is a valid alternative, not an exact match.
7. **Results & Billing**: Costs deducted at exact provider rates, zero markup. CLI prints receipt. Results persisted to dashboard grouped by session.

## 50-Model Extraction Benchmark

50 models from 14 providers benchmarked across 25 high-difficulty extraction tasks in 5 domains (e-commerce, legal, medical, finance, logic). Deterministic field-level scoring.

Results: https://checkstack.ai/compare

Top 10 by accuracy:
1. Gemini 3 Pro — 92.9%
2. Claude Opus 4.6 — 92.8%
3. Grok 4 — 92.6%
4. Claude Sonnet 4.6 — 92.5%
5. Gemini 2.5 Pro — 92.5%
6. Grok 3 — 92.2%
7. Grok 3 Mini — 92.2%
8. Gemini 3 Flash — 92.2%
9. DeepSeek V3.2 — 92.0%
10. Gemini 2.5 Flash — 91.9%

Key finding: Tier price doesn't predict accuracy. Eco-tier models frequently outperform ultra-tier. Reasoning models (o3, o4 Mini, DeepSeek R1) don't consistently dominate extraction.

Head-to-head comparisons: https://checkstack.ai/compare/{model-a-slug}-vs-{model-b-slug}
All 50 model slugs: llama-3-1-8b, nova-micro, gpt-5-nano, qwen-3-14b, nova-lite, gemini-2-0-flash-lite, llama-4-scout, gemini-2-0-flash, gemini-2-5-flash-lite, qwen-3-32b, qwen-3-5-flash, gpt-4-1-nano, mistral-small, gpt-4o-mini, llama-4-maverick, grok-4-fast, deepseek-v3-1, deepseek-v3-2, gpt-5-mini, seed-1-8, claude-3-haiku, grok-3-mini, gemini-2-5-flash, minimax-m2-5, gpt-4-1-mini, qwen-3-5-plus, deepseek-r1, kimi-k2, mistral-large-3, gemini-3-flash, magistral-small, llama-3-3-70b, deepseek-v3, nova-pro, claude-3-5-haiku, claude-haiku-4-5, o3-mini, o4-mini, gpt-5, gemini-2-5-pro, magistral-medium, gpt-4-1, o3, gemini-3-pro, command-a, gpt-4o, claude-sonnet-4-6, grok-3, grok-4, claude-opus-4-6

## Models

164+ models from: OpenAI, Anthropic, Google, Mistral, Meta, DeepSeek, xAI, Alibaba (Qwen), Amazon Nova, Cohere, MiniMax, Moonshot, ByteDance, Magistral.

Full list with pricing: GET https://checkstack.ai/api/v1/models

## Pricing

- 14-day free trial, $1 preloaded balance, no credit card
- $15/month subscription
- Actual model costs deducted from balance (zero markup)
- Top up anytime via Stripe

## REST API

Base URL: https://checkstack.ai/api/v1

Auth: `Authorization: Bearer sk-cs_...` (get a key at https://checkstack.ai/app → Settings).
Read-only endpoints and playground evaluations (5/day) work without auth.

### Public Endpoints (no auth)

GET /api/v1/models — List 164+ LLMs with pricing. Filter: ?tier=eco|pro|ultra&provider=OpenAI
GET /api/v1/benchmark — 50-model leaderboard. Filter: ?domain=ecommerce|legal|medical|finance|logic&tier=eco|pro|ultra&top=10

### Evaluation Endpoints

POST /api/v1/evaluate/json — JSON extraction comparison (single or batch up to 200 rows × 5 models)
  Body: { text, schema, ground_truth?, models?, project_name? }  OR  { rows: [...], schema, models? }

POST /api/v1/evaluate/rag — RAG audit comparison (single or batch)
  Body: { query, context, models?, project_name? }  OR  { rows: [...], models? }

POST /api/v1/recommend — AI model recommendation
  Body: { use_case, priority?: "accuracy"|"cost"|"latency"|"balanced", budget?: "eco"|"pro"|"ultra"|"any", top_n? }

GET /api/v1/evaluate/session/{sessionId} — Poll for results

### Account Endpoints (auth required)

GET /api/v1/credits — Check balance
GET /api/v1/keys — List API keys
POST /api/v1/keys/revoke — Revoke a key: { key_id }
POST /api/v1/keys/rotate — Atomic rotate: { key_id } → new key

## MCP Server (Agent Integration)

  npx checkstack-mcp@latest

Add to Claude Desktop, Cursor, VS Code, or any MCP client:

  {
    "mcpServers": {
      "checkstack": {
        "command": "npx",
        "args": ["-y", "checkstack-mcp@latest"]
      }
    }
  }

No API key needed — the agent runs device-flow auth automatically.

### MCP Tools

- setup_account — Sign in + get API key (opens browser, key flows back)
- recommend_model — Best LLM for a use case (uses benchmark data)
- get_benchmark — 50-model leaderboard
- list_models — Browse 164+ models with pricing
- evaluate_json_extraction — Compare models on JSON extraction
- evaluate_rag — Compare models on RAG (hallucination detection)
- compare_models — Head-to-head two models
- check_credits — Account balance
- buy_credits — Top-up link
- manage_keys — List, revoke, or rotate API keys

## CLI Commands

- `npx checkstack auth` — Sign in via browser (device flow)
- `npx checkstack start` — Start proxy on localhost:3456
- `npx checkstack run` — Open results dashboard
- `npx checkstack wallet` — Check credit balance
- `npx checkstack configure` — Set shadow models + preferences
- `npx checkstack rerun` — Re-run a previous evaluation
- `npx checkstack subscribe` — Manage subscription

## Links

- Website: https://checkstack.ai
- Docs: https://checkstack.ai/docs
- Playground: https://checkstack.ai/#playground
- 50-Model Benchmark: https://checkstack.ai/compare
- Sign up: https://checkstack.ai/sign-up
- npm: https://www.npmjs.com/package/checkstack
- MCP: https://www.npmjs.com/package/checkstack-mcp