Policy Engine Supremacy
A human-authored policy layer retains final authority. Generated guardrails predict validity; business rules decide.
Open-Source Research Project
Stop relying on prompt engineering for safety. AutoHarness validates AI agent actions against human-authored policy engines. Manual harnesses catch invalid actions today; LLM-powered synthesis of validation code from failure traces is on the roadmap.
Based on AutoHarness (arXiv:2603.03329) โ applied here to enterprise AI workflows.
The Governance Stack
LLMs are excellent at writing code, but terrible at enforcing rules. AutoHarness separates generation from execution. The LLM synthesizes harness code; a human-authored policy engine gives the final ruling. Deterministic reproducibility is guaranteed through seeded execution and config hashing.
LLM Agent proposes action
def evaluate(action, state):
return {
"rejected": violates_harness_rules(action, state),
"reason": "generated harness evaluates action",
}Final Authority
The Proof
Across expense approval, customer support, and software deployment workflows, a noisy agent produces the failure corpus. A scripted oracle represents the deterministic upper bound. The policy engine remains authoritative throughout โ catching self-approvals, unauthorized refunds, SLA violations, production freezes, and unauthorized rollbacks.
Benchmark Results
| Condition | Success Rate | Invalid Rate | Policy Denial | Composite Score | Actions |
|---|---|---|---|---|---|
| Noisy Agent no harness | 0.0% | 100.0% | 0.0% | โ0.500 | 180 |
| Noisy Agent + manual harness | 0.0% | 100.0% | 0.0% | โ0.500 | 180 |
| Scripted Agent deterministic baseline | 100.0% | 0.0% | 0.0% | 1.000 | 9 |
Benchmark Results
| Condition | Success Rate | Invalid Rate | Policy Denial | Composite Score | Actions |
|---|---|---|---|---|---|
| Noisy Agent no harness | 77.8% | 19.1% | 73.0% | 0.609 | 89 |
| Noisy Agent + manual harness | 77.8% | 92.1% | 0.0% | 0.317 | 89 |
| Scripted Agent deterministic baseline | 33.3% | 0.0% | 97.6% | 0.236 | 123 |
Benchmark Results
| Condition | Success Rate | Invalid Rate | Policy Denial | Composite Score | Actions |
|---|---|---|---|---|---|
| Noisy Agent no harness | 88.9% | 12.9% | 61.3% | 0.763 | 31 |
| Noisy Agent + manual harness | 88.9% | 29.0% | 45.2% | โ0.574 | 31 |
| Scripted Agent deterministic baseline | 88.9% | 0.0% | 71.4% | 0.817 | 28 |
Feature Grid
A human-authored policy layer retains final authority. Generated guardrails predict validity; business rules decide.
Every experiment is reproducible from config, seed, and code hash. No drifting behavior between runs.
Execution failures captured as structured counterexamples โ the raw material for synthesis. Model defined, extraction pipeline on the roadmap.
LLM-powered generation of validation harness code from counterexamples. AST validation and sandbox execution on the roadmap.
Enterprise Workflows
CLI Demo
$ uv run autoharness compare -e expense-approval -c no-harness,manual,scripted -d test
> Loaded 9 scenarios from scenarios/expense-approval/test.jsonl
>
> Comparison: expense-approval (test)
> โโโโโโโโโโโโโโณโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโณโโโโโโโโโโโโณโโโโโโโโโโ
> โ โ โ โ Policy โ โ โ
> โ Condition โ Success Rate โ Invalid Rate โ Denial โ Composite โ Actions โ
> โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
> โ no-harness โ 0.0% โ 100.0% โ 0.0% โ -0.500 โ 180 โ
> โ manual โ 0.0% โ 100.0% โ 0.0% โ -0.500 โ 180 โ
> โ scripted โ 100.0% โ 0.0% โ 0.0% โ 1.000 โ 9 โ
> โโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโโ$ uv run autoharness compare -e support-ticket -c no-harness,manual,scripted -d test
> Loaded 9 scenarios from scenarios/support-ticket/test.jsonl
>
> Comparison: support-ticket (test)
> โโโโโโโโโโโโโโณโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโณโโโโโโโโโโโโณโโโโโโโโโโ
> โ โ โ โ Policy โ โ โ
> โ Condition โ Success Rate โ Invalid Rate โ Denial โ Composite โ Actions โ
> โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
> โ no-harness โ 77.8% โ 19.1% โ 73.0% โ 0.609 โ 89 โ
> โ manual โ 77.8% โ 92.1% โ 0.0% โ 0.317 โ 89 โ
> โ scripted โ 33.3% โ 0.0% โ 97.6% โ 0.236 โ 123 โ
> โโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโโ$ uv run autoharness compare -e deployment -c no-harness,manual,scripted -d test
> Loaded 9 scenarios from scenarios/deployment/test.jsonl
>
> Comparison: deployment (test)
> โโโโโโโโโโโโโโณโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโณโโโโโโโโโโโโณโโโโโโโโโโ
> โ โ โ โ Policy โ โ โ
> โ Condition โ Success Rate โ Invalid Rate โ Denial โ Composite โ Actions โ
> โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
> โ no-harness โ 88.9% โ 12.9% โ 61.3% โ 0.763 โ 31 โ
> โ manual โ 88.9% โ 29.0% โ 45.2% โ โ0.574 โ 31 โ
> โ scripted โ 88.9% โ 0.0% โ 71.4% โ 0.817 โ 28 โ
> โโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโโFinal CTA