Documentation

Checkstack Docs

Evaluate LLMs on your exact workload. Compare cost, accuracy, and latency across 164+ models - in under 30 seconds.

Quickstart

Get running in three commands. No API keys to configure - Checkstack uses device auth (like GitHub CLI).

Terminal
# 1. Authenticate (opens browser, one-time)
npx checkstack auth

# 2. Start the local proxy
npx checkstack start

# 3. Point your app at the proxy
#    → set baseURL to http://localhost:3456/v1

That's it. Every LLM call your app makes now flows through Checkstack. Shadow probes run in the background, comparing your model against cheaper alternatives - and you get a cost/accuracy receipt in your terminal.

Or, use checkstack run to auto-inject the proxy into a child process:

Terminal
# Automatically sets OPENAI_BASE_URL for the child process
checkstack run -- node my-agent.js
checkstack run -- python main.py

Compatibility

Checkstack works with any tool or SDK that lets you set a custom base URL. The proxy speaks the OpenAI-compatible chat completions API, so if your client can point at a custom endpoint, it works with Checkstack.

OpenAI Node SDK

app.ts
import OpenAI from "openai";

const openai = new OpenAI({
  baseURL: "http://localhost:3456/v1",  // ← Checkstack proxy
  apiKey: process.env.OPENAI_API_KEY,   // your real key, forwarded upstream
});

const res = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Hello" }],
});

OpenAI Python SDK

main.py
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:3456/v1",  # ← Checkstack proxy
    api_key=os.environ["OPENAI_API_KEY"],
)

resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
)

Vercel AI SDK

app.ts
import { createOpenAI } from "@ai-sdk/openai";
import { generateText } from "ai";

const provider = createOpenAI({
  baseURL: "http://localhost:3456/v1",  // ← Checkstack proxy
});

const { text } = await generateText({
  model: provider("gpt-4o"),
  prompt: "Hello",
});

LangChain (Python)

chain.py
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-4o",
    openai_api_base="http://localhost:3456/v1",  # ← Checkstack proxy
)

response = llm.invoke("Hello")

LangChain (JS/TS)

chain.ts
import { ChatOpenAI } from "@langchain/openai";

const llm = new ChatOpenAI({
  modelName: "gpt-4o",
  configuration: {
    baseURL: "http://localhost:3456/v1",  // ← Checkstack proxy
  },
});

const response = await llm.invoke("Hello");

LlamaIndex (Python)

index.py
from llama_index.llms.openai import OpenAI

llm = OpenAI(
    model="gpt-4o",
    api_base="http://localhost:3456/v1",  # ← Checkstack proxy
)

response = llm.complete("Hello")

LiteLLM

app.py
import litellm

litellm.api_base = "http://localhost:3456/v1"  # ← Checkstack proxy

response = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
)

curl

Terminal
curl http://localhost:3456/v1/chat/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Environment Variable

Many SDKs respect the OPENAI_BASE_URL or OPENAI_API_BASE env var. Set it once and every OpenAI-compatible client in that process uses the proxy automatically:

Terminal
export OPENAI_BASE_URL=http://localhost:3456/v1

# Now any SDK that reads this env var routes through Checkstack
node my-agent.js
python main.py

checkstack run sets this automatically for the child process.

Anything with a Base URL

The proxy is a standard OpenAI-compatible HTTP server. If your tool, framework, or custom code can point baseURL at a custom endpoint, it's compatible. This includes Cursor, Continue, Cody, Aider, OpenRouter clients, Semantic Kernel, AutoGen, CrewAI, and any OpenAI-compatible wrapper.

CLI Reference

Install globally or use via npx:

Terminal
npm install -g checkstack    # global install
npx checkstack <command>     # or run via npx

checkstack auth

Authenticate via device auth (opens browser). One-time setup - stores your API key in ~/.checkstack/config.json.

Terminal
checkstack auth

checkstack start

Start the local proxy server. Stays running until you press Ctrl+C.

Terminal
checkstack start              # default port 3456
checkstack start --port 8080  # custom port
FlagDescription
-p, --port <port>Proxy port (default: 3456)
--skip-authSkip authentication check (proxy-only mode)

checkstack run

Start the proxy and spawn a child command with OPENAI_BASE_URL auto-injected.

Terminal
checkstack run -- node my-agent.js
checkstack run -- python main.py
checkstack run --port 8080 -- npm start

checkstack wallet

Check your evaluation credit balance and subscription status.

Terminal
checkstack wallet

checkstack configure

Opens the dashboard to pick your shadow models (up to 5).

Terminal
checkstack configure

checkstack rerun

Re-run a past evaluation with optional prompt optimization.

Terminal
checkstack rerun <run-id>
checkstack rerun <run-id> --repeat 3
checkstack rerun <run-id> --suggest
checkstack rerun <run-id> --optimize 5
FlagDescription
-r, --repeat <n>Run 1–5 repeats for consistency
-s, --suggestAsk AI for prompt improvements
-o, --optimize <n>Auto-iterate prompt refinement 1–20 times
--system-promptOverride system prompt
--user-promptOverride user prompt

checkstack subscribe

Opens Stripe checkout in your browser to start a Checkstack Pro subscription.

Terminal
checkstack subscribe

How the Proxy Works

The Checkstack proxy is a lightweight local HTTP server that sits between your app and your LLM provider. Here's the flow:

  1. Your app sends a request to http://localhost:3456/v1/chat/completions
  2. The proxy detects the upstream provider from your API key prefix or model name (e.g. claude-* → Anthropic, gemini-* → Google)
  3. The request is forwarded immediately to the real provider - your app gets its response at full speed
  4. In the background, Checkstack fire-and-forgets a shadow request, running the same prompt against your configured shadow models
  5. An AI judge grades each response. A cost/accuracy receipt is printed in your terminal
Key Design Principles
  • Zero latency impact - your original request is forwarded immediately, shadow probes are async
  • Fail-safe - if Checkstack is down or your wallet is empty, the original request still succeeds
  • Keys never stored - API keys are forwarded to the upstream provider only, never persisted
  • Streaming supported - works with both SSE streaming and standard JSON responses

Provider Detection

The proxy detects which upstream provider to forward to using this priority chain:

  1. X-Upstream-Base header - explicit override
  2. API key prefix - e.g. sk-ant- → Anthropic, sk-or- → OpenRouter
  3. Model name - e.g. claude-* → Anthropic, gemini-* → Google
  4. Default → OpenAI

Supported Providers

The proxy can forward to any of these upstream providers:

OpenAI
Anthropic
Google (Gemini)
Mistral
DeepSeek
OpenRouter
Vercel AI Gateway
Cloudflare AI Gateway
Fireworks AI
Together AI
Cohere

Evaluation Flow

When you run your app through the Checkstack proxy, every LLM call goes through a 7-phase evaluation pipeline. Your app is never slowed down — shadow testing happens asynchronously after the original response is returned.

Phase 1: Interception

When your app sends a request to localhost:3456/v1/chat/completions, the proxy:

  1. Detects the upstream provider from the API key prefix (e.g. sk-ant- → Anthropic) or model name (e.g. claude-* → Anthropic)
  2. Forwards immediately to the real provider — your app receives its normal response at full speed
  3. Captures the response (streaming or buffered) along with the original request payload
  4. Fires a background shadow request to Checkstack's server — this is async and fire-and-forget, so if Checkstack is unreachable, your app is unaffected
What Gets Sent

The shadow request includes the session ID, original model name, messages (full conversation history including images), tool definitions, the original response (text + tool calls), and optionally a response_format (JSON schema). Your API keys are never sent to Checkstack — they are only forwarded to the upstream provider.

Phase 2: Shadow Dispatch

On the server, Checkstack authenticates the request, verifies your balance, then converts the OpenAI-format messages and tools into the AI SDK's internal format. This conversion preserves:

  • Multi-turn conversation history with proper role attribution
  • Tool call / tool result chains with call ID linking
  • Image content in user messages (base64 and URL)
  • Image content in tool results (e.g. browser screenshots)
  • System prompts and structured output schemas

If the primary conversion fails (some models have non-standard formats), a sanitized fallback flattens tool calls into plain text prose — so every shadow model sees a usable conversation regardless of tool support. For example:

Before (native tool format)
{ "role": "assistant",
  "tool_calls": [{
    "function": {
      "name": "search_web",
      "arguments": "{\"q\": \"weather NYC\"}" }}]}
{ "role": "tool",
  "tool_call_id": "call_abc",
  "content": "72°F, sunny" }
After (sanitized fallback)
{ "role": "assistant",
  "content": "Called search_web({\"q\": \"weather NYC\"})" }
{ "role": "user",
  "content": "Tool result for search_web: 72°F, sunny" }

Phase 3: Model Selection

Checkstack picks 2–3 cheaper shadow models to compare against your original. You can configure these via checkstack configure or let Checkstack auto-select from the default pool:

ModelProvider
Gemini 2.0 Flash LiteGoogle
Gemini 2.0 FlashGoogle
Mistral SmallMistral
Claude 3.5 HaikuAnthropic
GPT-4o MiniOpenAI

The auto-selector excludes any model that matches the original (you won't shadow-test GPT-4o against itself). All remaining models from the pool run in parallel via the AI Gateway.

If a shadow model doesn't support tool calling, Checkstack automatically retries with the sanitized (tool-free) message format. When this happens, the response includes a tool_fallback: true flag so you know that model was evaluated without native tool support.

Phase 4: Comparison Logic

Each shadow response is compared against the original using a multi-stage pipeline. The path depends on the response type:

Standard Path (Text + Tool Calls)
  1. Classify response types — both text, both tool calls, or mixed (one text, one tool call)
  2. Both tool calls → direct comparison: same function names + same args = match; same functions but string arg differences = close → embed the differing args; different functions = miss → LLM judge
  3. Both text → embed both responses using text-embedding-3-small, compute cosine similarity
  4. Mixed type (e.g. original used a tool call, shadow responded with text) → sent to the LLM judge with full tool definitions so it can verify tool calls are real (not hallucinated), plus explicit guidance that text-only responses are equally valid
Embedding Similarity Thresholds
Cosine SimilarityResult
≥ 0.85Match — responses are semantically equivalent
0.70 – 0.85Borderline — escalated to LLM judge
< 0.70Mismatch — responses diverge

Phase 5: Structured Output Evaluation

When your request includes a response_format with a JSON schema, Checkstack uses deterministic field-level comparison instead of embedding the full response:

  1. Parse shadow response as JSON (mark as invalid if it fails)
  2. Flatten both objects to dot-path keys (e.g. address.city, items[0].name)
  3. Compare each field:
    • Booleans and numbers → exact match (score 1.0 or 0.0)
    • Strings → batched into a single embedMany() call, scored by cosine similarity
  4. Borderline strings (similarity 0.65–0.80) → escalated to the LLM judge for a semantic equivalence check
  5. Final accuracy = (total field score / total fields) × 100. Shadow matches if accuracy ≥ 90%
Field-Level Thresholds
String SimilarityField Score
≥ 0.801.0 (match)
0.65 – 0.80LLM judge decides
< 0.650.0 (mismatch)

Phase 6: LLM Judge

When embedding similarity falls in a borderline zone, or when response types are mixed, Checkstack escalates to an LLM judge (Claude Sonnet 4.6, temperature 0) for a definitive ruling.

The judge receives:

  • A conversation summary (last user message + truncated history, capped at 6,000 characters)
  • The original response and shadow response (text or tool call details)
  • Available tool definitions (stripped for mixed-type comparisons to avoid tool bias)

The judge is asked: “Is the shadow response a valid next step for this conversation?” — it evaluates independently whether the shadow's approach would satisfy the user, rather than requiring an exact match with the original.

Judge Variants
ScenarioJudge Behavior
Borderline textFull context including tool descriptions
Mixed typeFull context with tool definitions — verifies tool calls are real, but text-only responses are not penalized
Tool call missEvaluates whether different tool calls achieve the same intent
Structured fieldField-level semantic equivalence check

Phase 7: Results & Billing

After comparison completes, Checkstack:

  1. Deducts the cost of shadow generation + embedding + judge calls from your wallet (zero markup — exact provider cost)
  2. Calculates savings — finds the cheapest matching shadow and computes the percentage you'd save by switching
  3. Returns a receipt to the CLI, which prints a summary in your terminal showing each shadow model's match status, cost, and latency
  4. Persists results asynchronously — dataset rows and per-model results are written to your dashboard after the response is returned

All requests within a single checkstack start session are grouped into one run on your dashboard, so you can browse through every intercepted call with its comparison results.

Example Receipt (per request)
┌────────────────────────────────────────────┐
│  Shadow Eval #14                           │
├────────────────────────────────────────────┤
│  Original: gpt-4o         $0.0032  820ms   │
├────────────────────────────────────────────┤
│  ✓ Gemini Flash Lite      $0.0001  156ms   │
│  ✓ Gemini Flash           $0.0003  203ms   │
│  ✗ Mistral Small          $0.0004  312ms   │
├────────────────────────────────────────────┤
│  Overhead: $0.0001  Savings: 97%           │
│  Balance: $11.82                           │
└────────────────────────────────────────────┘
Full Pipeline
Your AppCLI PROXY · localhost:3456Forward to upstream providerCapture + fire shadow requestYour App(continues normally)response returned immediatelyCheckstack ServerAuth + balance checkConvert messages + toolsGenerate shadows (parallel)Gemini Flash Lite · Gemini FlashMistral Small · Claude 3.5 Haiku · ...Compare each shadowEmbedding similarityTool call comparison · LLM judgeDeduct cost · calc savingsReturn receipt to CLI

Pricing

Checkstack Pro
$15/mo

14-day free trial with $1 of evaluation credit included. No credit card required. Full access to CLI proxy, dashboard, and all 164+ models.

Evaluation Credit
$0 markup

Purchase credit as a one-time top-up ($10, $25, or $50). We pass the exact token cost to you - zero markup, ever. Credit expires 1 year after purchase.

Ready to get started?

14-day free trial. $1 of evaluation credit included. No credit card required.