EvanCore — a lab for AI agent harnesses

The lab · evan compare

Two configs. One task. The judge decides.

Here, two variants differ only by model. An LLM judge scores both on your stated criteria. The cost and tokens are real; the config delta is computed from the specs.

$ evan compare variant_a.yaml variant_b.yaml --trigger "Write a 3-sentence pitch…" --criteria "punchy, concrete, no buzzwords"

Variant A · variant_a.yaml

0.0/10

cost$0.0020

modelhaiku

in10 tok

out509 tok

◀ winner

Variant B · variant_b.yaml

0.0/10

cost$0.0023

modelsonnet

in3 tok

out152 tok

Config delta (A → B)

~ model.model_id: 'haiku' → 'sonnet'

🏆 Variant B wins — 9.0 vs 7.0

B opens with a sharper pain point, the 'last Tuesday' detail is concrete, and "git for the part of your stack that lives in someone's head or a Notion doc" lands harder than A's generic closer.

Ship it: evan release variant_b.yaml

Runs anywhere

Real Claude, no API key.

EvanCore harnesses run on the model you already have. Point a config at the local claude CLI and compare real Claude models with zero setup — your prompts and data never leave your machine.

claude-code

Routes through your local claude CLI using existing auth.

no API key

ollama

Fully offline. Compare local models, $0 per run.

anthropic

Raw API when you want exact per-token cost accounting.

openai

Any OpenAI-compatible endpoint, same spec shape.

The spine beneath the lab

A comparison you can trust, a winner you can ship.

An A/B test is only worth it if the thing you scored is the thing you ship. Every config is a content-addressed, signed, reversible artifact — so "B won" becomes "B is shipped," and stays reversible.

content-addressed specs
blake2b-64 hashing — the exact config you scored is the exact config you ship. No drift.
overlays
Author A/B/C variants without copy-pasting whole specs.
release · rollback · upgrade · promote
Ship the winner; every release has an inverse; move it across environments.
signed & verifiable
Ed25519 signatures + trust + revocation — distribute configs others can verify.

$ evan release variant_b.yaml --sign alice@acme

📦 SPEC_RELEASED h-abc123f7d9e2

✍ SPEC_SIGNED alice@acme

🚀 HARNESS_DEPLOYED app

$ evan rollback h-abc123 --yes

↩ HARNESS_ROLLED_BACK

Not a tracing tool

Tracing tools watch the run. EvanCore version-controls the config — and tells you which one to keep.

not this →

An agent framework. No runtime DSL, no orchestration primitives, no LangGraph / CrewAI overlap.

not this →

A hosting platform. You still choose where your agents run.

not this →

A dashboard you paste prompts into. The lab runs in your repo and CI; configs stay local.

Stop guessing
which config is better.

EvanCore is in private beta. The design, spec schema, and evaluation log are open. The engine is closed.

Request access Read the design ↗

Stop guessingwhich config is better.

Stop guessing
which config is better.