Checkstack Docs
Evaluate LLMs on your exact workload. Compare cost, accuracy, and latency across 164+ models - in under 30 seconds.
Quickstart
Get running in three commands. No API keys to configure - Checkstack uses device auth (like GitHub CLI).
# 1. Authenticate (opens browser, one-time) npx checkstack auth # 2. Start the local proxy npx checkstack start # 3. Point your app at the proxy # → set baseURL to http://localhost:3456/v1
That's it. Every LLM call your app makes now flows through Checkstack. Shadow probes run in the background, comparing your model against cheaper alternatives - and you get a cost/accuracy receipt in your terminal.
Or, use checkstack run to auto-inject the proxy into a child process:
# Automatically sets OPENAI_BASE_URL for the child process checkstack run -- node my-agent.js checkstack run -- python main.py
Compatibility
Checkstack works with any tool or SDK that lets you set a custom base URL. The proxy speaks the OpenAI-compatible chat completions API, so if your client can point at a custom endpoint, it works with Checkstack.
OpenAI Node SDK
import OpenAI from "openai";
const openai = new OpenAI({
baseURL: "http://localhost:3456/v1", // ← Checkstack proxy
apiKey: process.env.OPENAI_API_KEY, // your real key, forwarded upstream
});
const res = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Hello" }],
});OpenAI Python SDK
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:3456/v1", # ← Checkstack proxy
api_key=os.environ["OPENAI_API_KEY"],
)
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
)Vercel AI SDK
import { createOpenAI } from "@ai-sdk/openai";
import { generateText } from "ai";
const provider = createOpenAI({
baseURL: "http://localhost:3456/v1", // ← Checkstack proxy
});
const { text } = await generateText({
model: provider("gpt-4o"),
prompt: "Hello",
});LangChain (Python)
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="gpt-4o",
openai_api_base="http://localhost:3456/v1", # ← Checkstack proxy
)
response = llm.invoke("Hello")LangChain (JS/TS)
import { ChatOpenAI } from "@langchain/openai";
const llm = new ChatOpenAI({
modelName: "gpt-4o",
configuration: {
baseURL: "http://localhost:3456/v1", // ← Checkstack proxy
},
});
const response = await llm.invoke("Hello");LlamaIndex (Python)
from llama_index.llms.openai import OpenAI
llm = OpenAI(
model="gpt-4o",
api_base="http://localhost:3456/v1", # ← Checkstack proxy
)
response = llm.complete("Hello")LiteLLM
import litellm
litellm.api_base = "http://localhost:3456/v1" # ← Checkstack proxy
response = litellm.completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
)curl
curl http://localhost:3456/v1/chat/completions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Hello"}]
}'Environment Variable
Many SDKs respect the OPENAI_BASE_URL or OPENAI_API_BASE env var. Set it once and every OpenAI-compatible client in that process uses the proxy automatically:
export OPENAI_BASE_URL=http://localhost:3456/v1 # Now any SDK that reads this env var routes through Checkstack node my-agent.js python main.py
checkstack run sets this automatically for the child process.
Anything with a Base URL
The proxy is a standard OpenAI-compatible HTTP server. If your tool, framework, or custom code can point baseURL at a custom endpoint, it's compatible. This includes Cursor, Continue, Cody, Aider, OpenRouter clients, Semantic Kernel, AutoGen, CrewAI, and any OpenAI-compatible wrapper.
CLI Reference
Install globally or use via npx:
npm install -g checkstack # global install npx checkstack <command> # or run via npx
checkstack auth
Authenticate via device auth (opens browser). One-time setup - stores your API key in ~/.checkstack/config.json.
checkstack auth
checkstack start
Start the local proxy server. Stays running until you press Ctrl+C.
checkstack start # default port 3456 checkstack start --port 8080 # custom port
| Flag | Description |
|---|---|
| -p, --port <port> | Proxy port (default: 3456) |
| --skip-auth | Skip authentication check (proxy-only mode) |
checkstack run
Start the proxy and spawn a child command with OPENAI_BASE_URL auto-injected.
checkstack run -- node my-agent.js checkstack run -- python main.py checkstack run --port 8080 -- npm start
checkstack wallet
Check your evaluation credit balance and subscription status.
checkstack wallet
checkstack configure
Opens the dashboard to pick your shadow models (up to 5).
checkstack configure
checkstack rerun
Re-run a past evaluation with optional prompt optimization.
checkstack rerun <run-id> checkstack rerun <run-id> --repeat 3 checkstack rerun <run-id> --suggest checkstack rerun <run-id> --optimize 5
| Flag | Description |
|---|---|
| -r, --repeat <n> | Run 1–5 repeats for consistency |
| -s, --suggest | Ask AI for prompt improvements |
| -o, --optimize <n> | Auto-iterate prompt refinement 1–20 times |
| --system-prompt | Override system prompt |
| --user-prompt | Override user prompt |
checkstack subscribe
Opens Stripe checkout in your browser to start a Checkstack Pro subscription.
checkstack subscribe
How the Proxy Works
The Checkstack proxy is a lightweight local HTTP server that sits between your app and your LLM provider. Here's the flow:
- Your app sends a request to
http://localhost:3456/v1/chat/completions - The proxy detects the upstream provider from your API key prefix or model name (e.g.
claude-*→ Anthropic,gemini-*→ Google) - The request is forwarded immediately to the real provider - your app gets its response at full speed
- In the background, Checkstack fire-and-forgets a shadow request, running the same prompt against your configured shadow models
- An AI judge grades each response. A cost/accuracy receipt is printed in your terminal
- ✓Zero latency impact - your original request is forwarded immediately, shadow probes are async
- ✓Fail-safe - if Checkstack is down or your wallet is empty, the original request still succeeds
- ✓Keys never stored - API keys are forwarded to the upstream provider only, never persisted
- ✓Streaming supported - works with both SSE streaming and standard JSON responses
Provider Detection
The proxy detects which upstream provider to forward to using this priority chain:
X-Upstream-Baseheader - explicit override- API key prefix - e.g.
sk-ant-→ Anthropic,sk-or-→ OpenRouter - Model name - e.g.
claude-*→ Anthropic,gemini-*→ Google - Default → OpenAI
Supported Providers
The proxy can forward to any of these upstream providers:
Evaluation Flow
When you run your app through the Checkstack proxy, every LLM call goes through a 7-phase evaluation pipeline. Your app is never slowed down — shadow testing happens asynchronously after the original response is returned.
Phase 1: Interception
When your app sends a request to localhost:3456/v1/chat/completions, the proxy:
- Detects the upstream provider from the API key prefix (e.g.
sk-ant-→ Anthropic) or model name (e.g.claude-*→ Anthropic) - Forwards immediately to the real provider — your app receives its normal response at full speed
- Captures the response (streaming or buffered) along with the original request payload
- Fires a background shadow request to Checkstack's server — this is async and fire-and-forget, so if Checkstack is unreachable, your app is unaffected
The shadow request includes the session ID, original model name, messages (full conversation history including images), tool definitions, the original response (text + tool calls), and optionally a response_format (JSON schema). Your API keys are never sent to Checkstack — they are only forwarded to the upstream provider.
Phase 2: Shadow Dispatch
On the server, Checkstack authenticates the request, verifies your balance, then converts the OpenAI-format messages and tools into the AI SDK's internal format. This conversion preserves:
- Multi-turn conversation history with proper role attribution
- Tool call / tool result chains with call ID linking
- Image content in user messages (base64 and URL)
- Image content in tool results (e.g. browser screenshots)
- System prompts and structured output schemas
If the primary conversion fails (some models have non-standard formats), a sanitized fallback flattens tool calls into plain text prose — so every shadow model sees a usable conversation regardless of tool support. For example:
{ "role": "assistant",
"tool_calls": [{
"function": {
"name": "search_web",
"arguments": "{\"q\": \"weather NYC\"}" }}]}
{ "role": "tool",
"tool_call_id": "call_abc",
"content": "72°F, sunny" }{ "role": "assistant",
"content": "Called search_web({\"q\": \"weather NYC\"})" }
{ "role": "user",
"content": "Tool result for search_web: 72°F, sunny" }Phase 3: Model Selection
Checkstack picks 2–3 cheaper shadow models to compare against your original. You can configure these via checkstack configure or let Checkstack auto-select from the default pool:
| Model | Provider |
|---|---|
| Gemini 2.0 Flash Lite | |
| Gemini 2.0 Flash | |
| Mistral Small | Mistral |
| Claude 3.5 Haiku | Anthropic |
| GPT-4o Mini | OpenAI |
The auto-selector excludes any model that matches the original (you won't shadow-test GPT-4o against itself). All remaining models from the pool run in parallel via the AI Gateway.
If a shadow model doesn't support tool calling, Checkstack automatically retries with the sanitized (tool-free) message format. When this happens, the response includes a tool_fallback: true flag so you know that model was evaluated without native tool support.
Phase 4: Comparison Logic
Each shadow response is compared against the original using a multi-stage pipeline. The path depends on the response type:
- Classify response types — both text, both tool calls, or mixed (one text, one tool call)
- Both tool calls → direct comparison: same function names + same args = match; same functions but string arg differences = close → embed the differing args; different functions = miss → LLM judge
- Both text → embed both responses using
text-embedding-3-small, compute cosine similarity - Mixed type (e.g. original used a tool call, shadow responded with text) → sent to the LLM judge with full tool definitions so it can verify tool calls are real (not hallucinated), plus explicit guidance that text-only responses are equally valid
| Cosine Similarity | Result |
|---|---|
| ≥ 0.85 | Match — responses are semantically equivalent |
| 0.70 – 0.85 | Borderline — escalated to LLM judge |
| < 0.70 | Mismatch — responses diverge |
Phase 5: Structured Output Evaluation
When your request includes a response_format with a JSON schema, Checkstack uses deterministic field-level comparison instead of embedding the full response:
- Parse shadow response as JSON (mark as invalid if it fails)
- Flatten both objects to dot-path keys (e.g.
address.city,items[0].name) - Compare each field:
- Booleans and numbers → exact match (score 1.0 or 0.0)
- Strings → batched into a single
embedMany()call, scored by cosine similarity
- Borderline strings (similarity 0.65–0.80) → escalated to the LLM judge for a semantic equivalence check
- Final accuracy = (total field score / total fields) × 100. Shadow matches if accuracy ≥ 90%
| String Similarity | Field Score |
|---|---|
| ≥ 0.80 | 1.0 (match) |
| 0.65 – 0.80 | LLM judge decides |
| < 0.65 | 0.0 (mismatch) |
Phase 6: LLM Judge
When embedding similarity falls in a borderline zone, or when response types are mixed, Checkstack escalates to an LLM judge (Claude Sonnet 4.6, temperature 0) for a definitive ruling.
The judge receives:
- A conversation summary (last user message + truncated history, capped at 6,000 characters)
- The original response and shadow response (text or tool call details)
- Available tool definitions (stripped for mixed-type comparisons to avoid tool bias)
The judge is asked: “Is the shadow response a valid next step for this conversation?” — it evaluates independently whether the shadow's approach would satisfy the user, rather than requiring an exact match with the original.
| Scenario | Judge Behavior |
|---|---|
| Borderline text | Full context including tool descriptions |
| Mixed type | Full context with tool definitions — verifies tool calls are real, but text-only responses are not penalized |
| Tool call miss | Evaluates whether different tool calls achieve the same intent |
| Structured field | Field-level semantic equivalence check |
Phase 7: Results & Billing
After comparison completes, Checkstack:
- Deducts the cost of shadow generation + embedding + judge calls from your wallet (zero markup — exact provider cost)
- Calculates savings — finds the cheapest matching shadow and computes the percentage you'd save by switching
- Returns a receipt to the CLI, which prints a summary in your terminal showing each shadow model's match status, cost, and latency
- Persists results asynchronously — dataset rows and per-model results are written to your dashboard after the response is returned
All requests within a single checkstack start session are grouped into one run on your dashboard, so you can browse through every intercepted call with its comparison results.
┌────────────────────────────────────────────┐ │ Shadow Eval #14 │ ├────────────────────────────────────────────┤ │ Original: gpt-4o $0.0032 820ms │ ├────────────────────────────────────────────┤ │ ✓ Gemini Flash Lite $0.0001 156ms │ │ ✓ Gemini Flash $0.0003 203ms │ │ ✗ Mistral Small $0.0004 312ms │ ├────────────────────────────────────────────┤ │ Overhead: $0.0001 Savings: 97% │ │ Balance: $11.82 │ └────────────────────────────────────────────┘
Pricing
14-day free trial with $1 of evaluation credit included. No credit card required. Full access to CLI proxy, dashboard, and all 164+ models.
Purchase credit as a one-time top-up ($10, $25, or $50). We pass the exact token cost to you - zero markup, ever. Credit expires 1 year after purchase.
Ready to get started?
14-day free trial. $1 of evaluation credit included. No credit card required.