Open-Source Research Project

Deterministic Guardrails for Enterprise AI.

Stop relying on prompt engineering for safety. AutoHarness validates AI agent actions against human-authored policy engines. Manual harnesses catch invalid actions today; LLM-powered synthesis of validation code from failure traces is on the roadmap.

Based on AutoHarness (arXiv:2603.03329) โ€” applied here to enterprise AI workflows.

166 Tests, all passing
3 Environments, all ready

The Governance Stack

Generated code. Not generated decisions.

LLMs are excellent at writing code, but terrible at enforcing rules. AutoHarness separates generation from execution. The LLM synthesizes harness code; a human-authored policy engine gives the final ruling. Deterministic reproducibility is guaranteed through seeded execution and config hashing.

Generation Layer
The Agent

LLM Agent proposes action

Execution Boundary
def evaluate(action, state):
    return {
        "rejected": violates_harness_rules(action, state),
        "reason": "generated harness evaluates action",
    }
Authority Layer
Policy Engine

Final Authority

The Proof

Benchmarking the Gap.

Across expense approval, customer support, and software deployment workflows, a noisy agent produces the failure corpus. A scripted oracle represents the deterministic upper bound. The policy engine remains authoritative throughout โ€” catching self-approvals, unauthorized refunds, SLA violations, production freezes, and unauthorized rollbacks.

Benchmark Results

Expense Approval (test set, 9 scenarios)

ConditionSuccess RateInvalid RatePolicy DenialComposite ScoreActions
Noisy Agent no harness0.0%100.0%0.0%
โˆ’0.500
180
Noisy Agent + manual harness0.0%100.0%0.0%
โˆ’0.500
180
Scripted Agent deterministic baseline100.0%0.0%0.0%
1.000
9
The noisy agent produces 100% invalid actions โ€” the baseline chaos that motivates this project. The scripted oracle achieves 100% success.

Benchmark Results

Customer Support (test set, 9 scenarios)

ConditionSuccess RateInvalid RatePolicy DenialComposite ScoreActions
Noisy Agent no harness77.8%19.1%73.0%
0.609
89
Noisy Agent + manual harness77.8%92.1%0.0%
0.317
89
Scripted Agent deterministic baseline33.3%0.0%97.6%
0.236
123
Without a harness, 73% of actions hit policy denial (self-assignment, unauthorized refunds, critical resolution). With the manual harness, 92.1% of actions are blocked before reaching the policy engine โ€” it catches invalid operations earlier in the pipeline. The gap: 7.9% of actions still pass the harness but fail downstream โ€” the exact surface synthesis targets.

Benchmark Results

Software Deployment (test set, 9 scenarios)

ConditionSuccess RateInvalid RatePolicy DenialComposite ScoreActions
Noisy Agent no harness88.9%12.9%61.3%
0.763
31
Noisy Agent + manual harness88.9%29.0%45.2%
โˆ’0.574
31
Scripted Agent deterministic baseline88.9%0.0%71.4%
0.817
28
The manual harness degrades performance here โ€” it blocks valid deployment approvals (29% invalid rate vs 12.9% without it), dropping the composite score from 0.763 to โˆ’0.574. This is a real harness bug: production approval checks are incorrectly rejecting legitimate workflows. The scripted oracle achieves zero invalid actions but still hits 71.4% policy denial from production freezes and approval gates.

Feature Grid

Failure-Driven Refinement.

Policy Engine Supremacy

A human-authored policy layer retains final authority. Generated guardrails predict validity; business rules decide.

Deterministic Reproducibility

Every experiment is reproducible from config, seed, and code hash. No drifting behavior between runs.

Counterexample Extraction (planned)

Execution failures captured as structured counterexamples โ€” the raw material for synthesis. Model defined, extraction pipeline on the roadmap.

Harness Synthesis (planned)

LLM-powered generation of validation harness code from counterexamples. AST validation and sandbox execution on the roadmap.

Enterprise Workflows

Built for High-Stakes Environments.

Expense Approval

Actions

submit, request_receipt, approve, reject, escalate

Catches

self-approvalsout-of-policy amountsmissing receipts

Customer Support

Actions

assign, prioritize, resolve, refund, escalate

Catches

self-assignmentunauthorized refundscritical resolution

Software Deployment

Actions

create, approve, start, cancel, rollback

Catches

self-approvalsproduction freezesunauthorized rollbacks

CLI Demo

Try It Yourself.

Terminal โ€” expense-approval
$ uv run autoharness compare -e expense-approval -c no-harness,manual,scripted -d test

> Loaded 9 scenarios from scenarios/expense-approval/test.jsonl
>
>                       Comparison: expense-approval (test)
> โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
> โ”ƒ            โ”ƒ              โ”ƒ              โ”ƒ Policy      โ”ƒ           โ”ƒ         โ”ƒ
> โ”ƒ Condition  โ”ƒ Success Rate โ”ƒ Invalid Rate โ”ƒ Denial      โ”ƒ Composite โ”ƒ Actions โ”ƒ
> โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
> โ”‚ no-harness โ”‚ 0.0%         โ”‚ 100.0%       โ”‚ 0.0%        โ”‚ -0.500    โ”‚ 180     โ”‚
> โ”‚ manual     โ”‚ 0.0%         โ”‚ 100.0%       โ”‚ 0.0%        โ”‚ -0.500    โ”‚ 180     โ”‚
> โ”‚ scripted   โ”‚ 100.0%       โ”‚ 0.0%         โ”‚ 0.0%        โ”‚ 1.000     โ”‚ 9       โ”‚
> โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Terminal โ€” support-ticket
$ uv run autoharness compare -e support-ticket -c no-harness,manual,scripted -d test

> Loaded 9 scenarios from scenarios/support-ticket/test.jsonl
>
>                        Comparison: support-ticket (test)
> โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
> โ”ƒ            โ”ƒ              โ”ƒ              โ”ƒ Policy      โ”ƒ           โ”ƒ         โ”ƒ
> โ”ƒ Condition  โ”ƒ Success Rate โ”ƒ Invalid Rate โ”ƒ Denial      โ”ƒ Composite โ”ƒ Actions โ”ƒ
> โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
> โ”‚ no-harness โ”‚ 77.8%        โ”‚ 19.1%        โ”‚ 73.0%       โ”‚ 0.609     โ”‚ 89      โ”‚
> โ”‚ manual     โ”‚ 77.8%        โ”‚ 92.1%        โ”‚ 0.0%        โ”‚ 0.317     โ”‚ 89      โ”‚
> โ”‚ scripted   โ”‚ 33.3%        โ”‚ 0.0%         โ”‚ 97.6%       โ”‚ 0.236     โ”‚ 123     โ”‚
> โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Terminal โ€” deployment
$ uv run autoharness compare -e deployment -c no-harness,manual,scripted -d test

> Loaded 9 scenarios from scenarios/deployment/test.jsonl
>
>                          Comparison: deployment (test)
> โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
> โ”ƒ            โ”ƒ              โ”ƒ              โ”ƒ Policy      โ”ƒ           โ”ƒ         โ”ƒ
> โ”ƒ Condition  โ”ƒ Success Rate โ”ƒ Invalid Rate โ”ƒ Denial      โ”ƒ Composite โ”ƒ Actions โ”ƒ
> โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
> โ”‚ no-harness โ”‚ 88.9%        โ”‚ 12.9%        โ”‚ 61.3%       โ”‚ 0.763     โ”‚ 31      โ”‚
> โ”‚ manual     โ”‚ 88.9%        โ”‚ 29.0%        โ”‚ 45.2%       โ”‚ โˆ’0.574    โ”‚ 31      โ”‚
> โ”‚ scripted   โ”‚ 88.9%        โ”‚ 0.0%         โ”‚ 71.4%       โ”‚ 0.817     โ”‚ 28      โ”‚
> โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Final CTA

Ready to put guardrails on your AI agents?