Documentation

Checkstack Docs

Evaluate LLMs on your exact workload. Compare cost, accuracy, and latency across 164+ models - in under 30 seconds.

Quickstart

Get running in three commands. No API keys to configure - Checkstack uses device auth (like GitHub CLI).

Terminal

# 1. Authenticate (opens browser, one-time)
npx checkstack auth

# 2. Start the local proxy
npx checkstack start

# 3. Point your app at the proxy
#    → set baseURL to http://localhost:3456/v1

That's it. Every LLM call your app makes now flows through Checkstack. Shadow probes run in the background, comparing your model against cheaper alternatives - and you get a cost/accuracy receipt in your terminal.

Or, use checkstack run to auto-inject the proxy into a child process:

Terminal

# Automatically sets OPENAI_BASE_URL for the child process
checkstack run -- node my-agent.js
checkstack run -- python main.py

Compatibility

Checkstack works with any tool or SDK that lets you set a custom base URL. The proxy speaks the OpenAI-compatible chat completions API, so if your client can point at a custom endpoint, it works with Checkstack.

OpenAI Node SDK

app.ts

import OpenAI from "openai";

const openai = new OpenAI({
  baseURL: "http://localhost:3456/v1",  // ← Checkstack proxy
  apiKey: process.env.OPENAI_API_KEY,   // your real key, forwarded upstream
});

const res = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Hello" }],
});

OpenAI Python SDK

main.py

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:3456/v1",  # ← Checkstack proxy
    api_key=os.environ["OPENAI_API_KEY"],
)

resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
)

Vercel AI SDK

app.ts

import { createOpenAI } from "@ai-sdk/openai";
import { generateText } from "ai";

const provider = createOpenAI({
  baseURL: "http://localhost:3456/v1",  // ← Checkstack proxy
});

const { text } = await generateText({
  model: provider("gpt-4o"),
  prompt: "Hello",
});

LangChain (Python)

chain.py

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-4o",
    openai_api_base="http://localhost:3456/v1",  # ← Checkstack proxy
)

response = llm.invoke("Hello")

LangChain (JS/TS)

chain.ts

import { ChatOpenAI } from "@langchain/openai";

const llm = new ChatOpenAI({
  modelName: "gpt-4o",
  configuration: {
    baseURL: "http://localhost:3456/v1",  // ← Checkstack proxy
  },
});

const response = await llm.invoke("Hello");

LlamaIndex (Python)

index.py

from llama_index.llms.openai import OpenAI

llm = OpenAI(
    model="gpt-4o",
    api_base="http://localhost:3456/v1",  # ← Checkstack proxy
)

response = llm.complete("Hello")

LiteLLM

app.py

import litellm

litellm.api_base = "http://localhost:3456/v1"  # ← Checkstack proxy

response = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
)

curl

Terminal

curl http://localhost:3456/v1/chat/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Environment Variable

Many SDKs respect the OPENAI_BASE_URL or OPENAI_API_BASE env var. Set it once and every OpenAI-compatible client in that process uses the proxy automatically:

Terminal

export OPENAI_BASE_URL=http://localhost:3456/v1

# Now any SDK that reads this env var routes through Checkstack
node my-agent.js
python main.py

checkstack run sets this automatically for the child process.

Anything with a Base URL

The proxy is a standard OpenAI-compatible HTTP server. If your tool, framework, or custom code can point baseURL at a custom endpoint, it's compatible. This includes Cursor, Continue, Cody, Aider, OpenRouter clients, Semantic Kernel, AutoGen, CrewAI, and any OpenAI-compatible wrapper.

CLI Reference

Install globally or use via npx:

Terminal

npm install -g checkstack    # global install
npx checkstack <command>     # or run via npx

checkstack auth

Authenticate via device auth (opens browser). One-time setup - stores your API key in ~/.checkstack/config.json.

Terminal

checkstack auth

checkstack start

Start the local proxy server. Stays running until you press Ctrl+C.

Terminal

checkstack start              # default port 3456
checkstack start --port 8080  # custom port

Flag	Description
-p, --port <port>	Proxy port (default: 3456)
--skip-auth	Skip authentication check (proxy-only mode)

checkstack run

Start the proxy and spawn a child command with OPENAI_BASE_URL auto-injected.

Terminal

checkstack run -- node my-agent.js
checkstack run -- python main.py
checkstack run --port 8080 -- npm start

checkstack wallet

Check your evaluation credit balance and subscription status.

Terminal

checkstack wallet

checkstack configure

Opens the dashboard to pick your shadow models (up to 5).

Terminal

checkstack configure

checkstack rerun

Re-run a past evaluation with optional prompt optimization.

Terminal

checkstack rerun <run-id>
checkstack rerun <run-id> --repeat 3
checkstack rerun <run-id> --suggest
checkstack rerun <run-id> --optimize 5

Flag	Description
-r, --repeat <n>	Run 1–5 repeats for consistency
-s, --suggest	Ask AI for prompt improvements
-o, --optimize <n>	Auto-iterate prompt refinement 1–20 times
--system-prompt	Override system prompt
--user-prompt	Override user prompt

Opens Stripe checkout in your browser to start a Checkstack Pro subscription.

Terminal

checkstack subscribe

How the Proxy Works

The Checkstack proxy is a lightweight local HTTP server that sits between your app and your LLM provider. Here's the flow:

Your app sends a request to http://localhost:3456/v1/chat/completions
The proxy detects the upstream provider from your API key prefix or model name (e.g. claude-* → Anthropic, gemini-* → Google)
The request is forwarded immediately to the real provider - your app gets its response at full speed
In the background, Checkstack fire-and-forgets a shadow request, running the same prompt against your configured shadow models
An AI judge grades each response. A cost/accuracy receipt is printed in your terminal

Key Design Principles

✓Zero latency impact - your original request is forwarded immediately, shadow probes are async
✓Fail-safe - if Checkstack is down or your wallet is empty, the original request still succeeds
✓Keys never stored - API keys are forwarded to the upstream provider only, never persisted
✓Streaming supported - works with both SSE streaming and standard JSON responses

Provider Detection

The proxy detects which upstream provider to forward to using this priority chain:

X-Upstream-Base header - explicit override
API key prefix - e.g. sk-ant- → Anthropic, sk-or- → OpenRouter
Model name - e.g. claude-* → Anthropic, gemini-* → Google
Default → OpenAI

Supported Providers

The proxy can forward to any of these upstream providers:

OpenAI

Anthropic

Google (Gemini)

Mistral

DeepSeek

OpenRouter

Vercel AI Gateway

Cloudflare AI Gateway

Fireworks AI

Together AI

Cohere

Evaluation Flow

When you run your app through the Checkstack proxy, every LLM call goes through a 7-phase evaluation pipeline. Your app is never slowed down — shadow testing happens asynchronously after the original response is returned.

Phase 1: Interception

When your app sends a request to localhost:3456/v1/chat/completions, the proxy:

Detects the upstream provider from the API key prefix (e.g. sk-ant- → Anthropic) or model name (e.g. claude-* → Anthropic)
Forwards immediately to the real provider — your app receives its normal response at full speed
Captures the response (streaming or buffered) along with the original request payload
Fires a background shadow request to Checkstack's server — this is async and fire-and-forget, so if Checkstack is unreachable, your app is unaffected

What Gets Sent

The shadow request includes the session ID, original model name, messages (full conversation history including images), tool definitions, the original response (text + tool calls), and optionally a response_format (JSON schema). Your API keys are never sent to Checkstack — they are only forwarded to the upstream provider.

Phase 2: Shadow Dispatch

On the server, Checkstack authenticates the request, verifies your balance, then converts the OpenAI-format messages and tools into the AI SDK's internal format. This conversion preserves:

Multi-turn conversation history with proper role attribution
Tool call / tool result chains with call ID linking
Image content in user messages (base64 and URL)
Image content in tool results (e.g. browser screenshots)
System prompts and structured output schemas

If the primary conversion fails (some models have non-standard formats), a sanitized fallback flattens tool calls into plain text prose — so every shadow model sees a usable conversation regardless of tool support. For example:

Before (native tool format)

{ "role": "assistant",
  "tool_calls": [{
    "function": {
      "name": "search_web",
      "arguments": "{\"q\": \"weather NYC\"}" }}]}
{ "role": "tool",
  "tool_call_id": "call_abc",
  "content": "72°F, sunny" }

After (sanitized fallback)

{ "role": "assistant",
  "content": "Called search_web({\"q\": \"weather NYC\"})" }
{ "role": "user",
  "content": "Tool result for search_web: 72°F, sunny" }

Phase 3: Model Selection

Checkstack picks 2–3 cheaper shadow models to compare against your original. You can configure these via checkstack configure or let Checkstack auto-select from the default pool:

Model	Provider
Gemini 2.0 Flash Lite	Google
Gemini 2.0 Flash	Google
Mistral Small	Mistral
Claude 3.5 Haiku	Anthropic
GPT-4o Mini	OpenAI

The auto-selector excludes any model that matches the original (you won't shadow-test GPT-4o against itself). All remaining models from the pool run in parallel via the AI Gateway.

If a shadow model doesn't support tool calling, Checkstack automatically retries with the sanitized (tool-free) message format. When this happens, the response includes a tool_fallback: true flag so you know that model was evaluated without native tool support.

Phase 4: Comparison Logic

Each shadow response is compared against the original using a multi-stage pipeline. The path depends on the response type:

Standard Path (Text + Tool Calls)

Classify response types — both text, both tool calls, or mixed (one text, one tool call)
Both tool calls → direct comparison: same function names + same args = match; same functions but string arg differences = close → embed the differing args; different functions = miss → LLM judge
Both text → embed both responses using text-embedding-3-small, compute cosine similarity
Mixed type (e.g. original used a tool call, shadow responded with text) → sent to the LLM judge with full tool definitions so it can verify tool calls are real (not hallucinated), plus explicit guidance that text-only responses are equally valid

Embedding Similarity Thresholds

Cosine Similarity	Result
≥ 0.85	Match — responses are semantically equivalent
0.70 – 0.85	Borderline — escalated to LLM judge
< 0.70	Mismatch — responses diverge

Phase 5: Structured Output Evaluation

When your request includes a response_format with a JSON schema, Checkstack uses deterministic field-level comparison instead of embedding the full response:

Parse shadow response as JSON (mark as invalid if it fails)
Flatten both objects to dot-path keys (e.g. address.city, items[0].name)
Compare each field:
- Booleans and numbers → exact match (score 1.0 or 0.0)
- Strings → batched into a single embedMany() call, scored by cosine similarity
Borderline strings (similarity 0.65–0.80) → escalated to the LLM judge for a semantic equivalence check
Final accuracy = (total field score / total fields) × 100. Shadow matches if accuracy ≥ 90%

Field-Level Thresholds

String Similarity	Field Score
≥ 0.80	1.0 (match)
0.65 – 0.80	LLM judge decides
< 0.65	0.0 (mismatch)

Phase 6: LLM Judge

When embedding similarity falls in a borderline zone, or when response types are mixed, Checkstack escalates to an LLM judge (Claude Sonnet 4.6, temperature 0) for a definitive ruling.

The judge receives:

A conversation summary (last user message + truncated history, capped at 6,000 characters)
The original response and shadow response (text or tool call details)
Available tool definitions (stripped for mixed-type comparisons to avoid tool bias)

The judge is asked: “Is the shadow response a valid next step for this conversation?” — it evaluates independently whether the shadow's approach would satisfy the user, rather than requiring an exact match with the original.

Judge Variants

Scenario	Judge Behavior
Borderline text	Full context including tool descriptions
Mixed type	Full context with tool definitions — verifies tool calls are real, but text-only responses are not penalized
Tool call miss	Evaluates whether different tool calls achieve the same intent
Structured field	Field-level semantic equivalence check

Phase 7: Results & Billing

After comparison completes, Checkstack:

Deducts the cost of shadow generation + embedding + judge calls from your wallet (zero markup — exact provider cost)
Calculates savings — finds the cheapest matching shadow and computes the percentage you'd save by switching
Returns a receipt to the CLI, which prints a summary in your terminal showing each shadow model's match status, cost, and latency
Persists results asynchronously — dataset rows and per-model results are written to your dashboard after the response is returned

All requests within a single checkstack start session are grouped into one run on your dashboard, so you can browse through every intercepted call with its comparison results.

Example Receipt (per request)

┌────────────────────────────────────────────┐
│  Shadow Eval #14                           │
├────────────────────────────────────────────┤
│  Original: gpt-4o         $0.0032  820ms   │
├────────────────────────────────────────────┤
│  ✓ Gemini Flash Lite      $0.0001  156ms   │
│  ✓ Gemini Flash           $0.0003  203ms   │
│  ✗ Mistral Small          $0.0004  312ms   │
├────────────────────────────────────────────┤
│  Overhead: $0.0001  Savings: 97%           │
│  Balance: $11.82                           │
└────────────────────────────────────────────┘

Full Pipeline

Pricing

Checkstack Pro

$15/mo

14-day free trial with $1 of evaluation credit included. No credit card required. Full access to CLI proxy, dashboard, and all 164+ models.

Evaluation Credit

$0 markup

Purchase credit as a one-time top-up ($10, $25, or $50). We pass the exact token cost to you - zero markup, ever. Credit expires 1 year after purchase.

Ready to get started?

14-day free trial. $1 of evaluation credit included. No credit card required.

Start Free →See the Benchmark