A lab for AI agent harnesses

Change a config.
Know if it got better.

You tune an agent — a prompt, a tool, a memory setting — and you don't actually know if it improved. EvanCore runs two harness configs on the same task, scores the outcomes, and shows which won — output, diff, cost, verdict. Then the winner ships with one command.

$ evan compare a.yaml b.yaml runs with no API key 633 tests passing public design · private core
The lab · evan compare
Two configs. One task. The judge decides.

Here, two variants differ only by model. An LLM judge scores both on your stated criteria. The cost and tokens are real; the config delta is computed from the specs.

$ evan compare variant_a.yaml variant_b.yaml --trigger "Write a 3-sentence pitch…" --criteria "punchy, concrete, no buzzwords"
Variant A · variant_a.yaml
0.0/10
cost$0.0020
modelhaiku
in10 tok
out509 tok
◀ winner
Variant B · variant_b.yaml
0.0/10
cost$0.0023
modelsonnet
in3 tok
out152 tok
Config delta (A → B)
~ model.model_id: 'haiku''sonnet'
🏆 Variant B wins — 9.0 vs 7.0
B opens with a sharper pain point, the 'last Tuesday' detail is concrete, and "git for the part of your stack that lives in someone's head or a Notion doc" lands harder than A's generic closer.
Ship it: evan release variant_b.yaml
Runs anywhere
Real Claude, no API key.

EvanCore harnesses run on the model you already have. Point a config at the local claude CLI and compare real Claude models with zero setup — your prompts and data never leave your machine.

claude-code
Routes through your local claude CLI using existing auth.
no API key
ollama
Fully offline. Compare local models, $0 per run.
anthropic
Raw API when you want exact per-token cost accounting.
openai
Any OpenAI-compatible endpoint, same spec shape.
The spine beneath the lab
A comparison you can trust, a winner you can ship.

An A/B test is only worth it if the thing you scored is the thing you ship. Every config is a content-addressed, signed, reversible artifact — so "B won" becomes "B is shipped," and stays reversible.

  • content-addressed specs

    blake2b-64 hashing — the exact config you scored is the exact config you ship. No drift.

  • overlays

    Author A/B/C variants without copy-pasting whole specs.

  • release · rollback · upgrade · promote

    Ship the winner; every release has an inverse; move it across environments.

  • signed & verifiable

    Ed25519 signatures + trust + revocation — distribute configs others can verify.

$ evan release variant_b.yaml --sign alice@acme
📦 SPEC_RELEASED h-abc123f7d9e2
✍ SPEC_SIGNED alice@acme
🚀 HARNESS_DEPLOYED app
$ evan rollback h-abc123 --yes
↩ HARNESS_ROLLED_BACK
Not a tracing tool

Tracing tools watch the run. EvanCore version-controls the config — and tells you which one to keep.

not this →

An agent framework. No runtime DSL, no orchestration primitives, no LangGraph / CrewAI overlap.

not this →

A hosting platform. You still choose where your agents run.

not this →

A dashboard you paste prompts into. The lab runs in your repo and CI; configs stay local.

Stop guessing
which config is better.

EvanCore is in private beta. The design, spec schema, and evaluation log are open. The engine is closed.