# Checkstack > Checkstack is an LLM evaluation and shadow-testing platform for AI engineers. Compare 164+ language models on your own data across three modes: live shadow testing via CLI proxy, RAG auditing with LLM-as-judge grading, and structured JSON extraction with deterministic field-level scoring. Per-task cost and latency tracking, zero API markup. ## Modes ### 1. Shadow Testing (CLI Proxy) Run `npx checkstack start` to start a local proxy that shadow-tests your LLM calls. It intercepts requests to your primary model, forwards them immediately (zero latency added), and replays them against cheaper alternatives in the background. You get side-by-side accuracy, latency, and cost data with zero code changes. ### 2. RAG Auditor Compare how multiple models answer the same question given the same retrieved context. A consistent Claude Sonnet 4.6 judge grades every response on the RAG Triad: - **Faithfulness** (Context ↔ Answer): Is the answer supported by the context? - **Context Relevance** (Query ↔ Context): Does the retrieved context contain relevant information? - **Answer Relevance** (Query ↔ Answer): Does the answer address the user's question? ### 3. JSON Optimizer Upload a dataset with a target JSON schema, run extractions across multiple LLMs, and score outputs per-field using deterministic comparison against ground truth (exact match for booleans, 5% tolerance for numbers, embedding cosine similarity for strings). ## Quick Start ### Shadow Testing (recommended) npx checkstack auth # sign in via browser npx checkstack start # start proxy on localhost:3456 # Point your app at http://localhost:3456/v1 — done. ### Web Evaluations Sign up at https://checkstack.ai — the playground lets you try both RAG auditing and JSON extraction without an account. ## SDK Compatibility The CLI proxy works with anything that takes a base URL: # OpenAI Node.js new OpenAI({ baseURL: "http://localhost:3456/v1" }) # OpenAI Python OpenAI(base_url="http://localhost:3456/v1") # Vercel AI SDK createOpenAI({ baseURL: "http://localhost:3456/v1" }) # LangChain Python ChatOpenAI(openai_api_base="http://localhost:3456/v1") # LangChain JS new ChatOpenAI({ configuration: { baseURL: "http://localhost:3456/v1" } }) # LlamaIndex / LiteLLM / curl export OPENAI_BASE_URL=http://localhost:3456/v1 Supported upstream providers: OpenAI, Anthropic, Google Gemini, Mistral, Groq, Together AI, Fireworks AI, Perplexity, DeepSeek, Cohere, AWS Bedrock. Auto-detected from the model ID. ## Shadow Evaluation Pipeline 7-phase pipeline when a request hits the proxy: 1. **Intercept**: Forward to original model immediately, capture response, fire async shadow request. Your API keys never leave your machine. 2. **Dispatch**: Authenticate, check balance, convert messages/tools to AI SDK format (preserves multi-turn history, tool calls, images, system prompts). Sanitized fallback if conversion fails. 3. **Model Selection**: Pick shadow models from pool (Gemini Flash Lite, Gemini Flash, Mistral Small, Claude 3.5 Haiku, GPT-4o Mini) minus the original. Configurable via `npx checkstack configure`. Tool-call fallback with `tool_fallback: true` flag if a shadow model lacks tool support. 4. **Compare (Text/Tool)**: Embed both responses with text-embedding-3-small. Cosine similarity: ≥0.85 = match, 0.70–0.85 = borderline (LLM judge), <0.70 = mismatch. Tool calls compared structurally. Mixed type → LLM judge with full tool definitions. 5. **Compare (Structured)**: For JSON schema responses: flatten to dot-paths, compare field-by-field. Strings via embedding similarity (≥0.80 match, 0.65–0.80 judge, <0.65 mismatch). Overall ≥90% = match. 6. **LLM Judge**: Claude Sonnet 4.6 (temperature 0) for borderline cases. Evaluates whether the shadow response is a valid alternative, not an exact match. 7. **Results & Billing**: Costs deducted at exact provider rates, zero markup. CLI prints receipt. Results persisted to dashboard grouped by session. ## 50-Model Extraction Benchmark 50 models from 14 providers benchmarked across 25 high-difficulty extraction tasks in 5 domains (e-commerce, legal, medical, finance, logic). Deterministic field-level scoring. Results: https://checkstack.ai/compare Top 10 by accuracy: 1. Gemini 3 Pro — 92.9% 2. Claude Opus 4.6 — 92.8% 3. Grok 4 — 92.6% 4. Claude Sonnet 4.6 — 92.5% 5. Gemini 2.5 Pro — 92.5% 6. Grok 3 — 92.2% 7. Grok 3 Mini — 92.2% 8. Gemini 3 Flash — 92.2% 9. DeepSeek V3.2 — 92.0% 10. Gemini 2.5 Flash — 91.9% Key finding: Tier price doesn't predict accuracy. Eco-tier models frequently outperform ultra-tier. Reasoning models (o3, o4 Mini, DeepSeek R1) don't consistently dominate extraction. Head-to-head comparisons: https://checkstack.ai/compare/{model-a-slug}-vs-{model-b-slug} All 50 model slugs: llama-3-1-8b, nova-micro, gpt-5-nano, qwen-3-14b, nova-lite, gemini-2-0-flash-lite, llama-4-scout, gemini-2-0-flash, gemini-2-5-flash-lite, qwen-3-32b, qwen-3-5-flash, gpt-4-1-nano, mistral-small, gpt-4o-mini, llama-4-maverick, grok-4-fast, deepseek-v3-1, deepseek-v3-2, gpt-5-mini, seed-1-8, claude-3-haiku, grok-3-mini, gemini-2-5-flash, minimax-m2-5, gpt-4-1-mini, qwen-3-5-plus, deepseek-r1, kimi-k2, mistral-large-3, gemini-3-flash, magistral-small, llama-3-3-70b, deepseek-v3, nova-pro, claude-3-5-haiku, claude-haiku-4-5, o3-mini, o4-mini, gpt-5, gemini-2-5-pro, magistral-medium, gpt-4-1, o3, gemini-3-pro, command-a, gpt-4o, claude-sonnet-4-6, grok-3, grok-4, claude-opus-4-6 ## Models 164+ models from: OpenAI, Anthropic, Google, Mistral, Meta, DeepSeek, xAI, Alibaba (Qwen), Amazon Nova, Cohere, MiniMax, Moonshot, ByteDance, Magistral. Full list with pricing: GET https://checkstack.ai/api/v1/models ## Pricing - 14-day free trial, $1 preloaded balance, no credit card - $15/month subscription - Actual model costs deducted from balance (zero markup) - Top up anytime via Stripe ## REST API Base URL: https://checkstack.ai/api/v1 Auth: `Authorization: Bearer sk-cs_...` (get a key at https://checkstack.ai/app → Settings). Read-only endpoints and playground evaluations (5/day) work without auth. ### Public Endpoints (no auth) GET /api/v1/models — List 164+ LLMs with pricing. Filter: ?tier=eco|pro|ultra&provider=OpenAI GET /api/v1/benchmark — 50-model leaderboard. Filter: ?domain=ecommerce|legal|medical|finance|logic&tier=eco|pro|ultra&top=10 ### Evaluation Endpoints POST /api/v1/evaluate/json — JSON extraction comparison (single or batch up to 200 rows × 5 models) Body: { text, schema, ground_truth?, models?, project_name? } OR { rows: [...], schema, models? } POST /api/v1/evaluate/rag — RAG audit comparison (single or batch) Body: { query, context, models?, project_name? } OR { rows: [...], models? } POST /api/v1/recommend — AI model recommendation Body: { use_case, priority?: "accuracy"|"cost"|"latency"|"balanced", budget?: "eco"|"pro"|"ultra"|"any", top_n? } GET /api/v1/evaluate/session/{sessionId} — Poll for results ### Account Endpoints (auth required) GET /api/v1/credits — Check balance GET /api/v1/keys — List API keys POST /api/v1/keys/revoke — Revoke a key: { key_id } POST /api/v1/keys/rotate — Atomic rotate: { key_id } → new key ## MCP Server (Agent Integration) npx checkstack-mcp@latest Add to Claude Desktop, Cursor, VS Code, or any MCP client: { "mcpServers": { "checkstack": { "command": "npx", "args": ["-y", "checkstack-mcp@latest"] } } } No API key needed — the agent runs device-flow auth automatically. ### MCP Tools - setup_account — Sign in + get API key (opens browser, key flows back) - recommend_model — Best LLM for a use case (uses benchmark data) - get_benchmark — 50-model leaderboard - list_models — Browse 164+ models with pricing - evaluate_json_extraction — Compare models on JSON extraction - evaluate_rag — Compare models on RAG (hallucination detection) - compare_models — Head-to-head two models - check_credits — Account balance - buy_credits — Top-up link - manage_keys — List, revoke, or rotate API keys ## CLI Commands - `npx checkstack auth` — Sign in via browser (device flow) - `npx checkstack start` — Start proxy on localhost:3456 - `npx checkstack run` — Open results dashboard - `npx checkstack wallet` — Check credit balance - `npx checkstack configure` — Set shadow models + preferences - `npx checkstack rerun` — Re-run a previous evaluation - `npx checkstack subscribe` — Manage subscription ## Links - Website: https://checkstack.ai - Docs: https://checkstack.ai/docs - Playground: https://checkstack.ai/#playground - 50-Model Benchmark: https://checkstack.ai/compare - Sign up: https://checkstack.ai/sign-up - npm: https://www.npmjs.com/package/checkstack - MCP: https://www.npmjs.com/package/checkstack-mcp