You tune an agent — a prompt, a tool, a memory setting — and you don't actually know if it improved. EvanCore runs two harness configs on the same task, scores the outcomes, and shows which won — output, diff, cost, verdict. Then the winner ships with one command.
Here, two variants differ only by model. An LLM judge scores both on your stated criteria. The cost and tokens are real; the config delta is computed from the specs.
EvanCore harnesses run on the model you already have. Point a config at the local claude CLI and compare real Claude models with zero setup — your prompts and data never leave your machine.
An A/B test is only worth it if the thing you scored is the thing you ship. Every config is a content-addressed, signed, reversible artifact — so "B won" becomes "B is shipped," and stays reversible.
blake2b-64 hashing — the exact config you scored is the exact config you ship. No drift.
Author A/B/C variants without copy-pasting whole specs.
Ship the winner; every release has an inverse; move it across environments.
Ed25519 signatures + trust + revocation — distribute configs others can verify.
Tracing tools watch the run. EvanCore version-controls the config — and tells you which one to keep.
An agent framework. No runtime DSL, no orchestration primitives, no LangGraph / CrewAI overlap.
A hosting platform. You still choose where your agents run.
A dashboard you paste prompts into. The lab runs in your repo and CI; configs stay local.
EvanCore is in private beta. The design, spec schema, and evaluation log are open. The engine is closed.