Open-Source Research Project

Deterministic Guardrails for Enterprise AI.

Stop relying on prompt engineering for safety. AutoHarness validates AI agent actions against human-authored policy engines. Manual harnesses catch invalid actions today; LLM-powered synthesis of validation code from failure traces is on the roadmap.

View on GitHub Read the Research Paper

Based on AutoHarness (arXiv:2603.03329) — applied here to enterprise AI workflows.

166 Tests, all passing

3 Environments, all ready

The Governance Stack

Generated code. Not generated decisions.

LLMs are excellent at writing code, but terrible at enforcing rules. AutoHarness separates generation from execution. The LLM synthesizes harness code; a human-authored policy engine gives the final ruling. Deterministic reproducibility is guaranteed through seeded execution and config hashing.

Generation Layer

The Agent

LLM Agent proposes action

Execution Boundary

def evaluate(action, state):
    return {
        "rejected": violates_harness_rules(action, state),
        "reason": "generated harness evaluates action",
    }

Authority Layer

Policy Engine

Final Authority

The Proof

Benchmarking the Gap.

Across expense approval, customer support, and software deployment workflows, a noisy agent produces the failure corpus. A scripted oracle represents the deterministic upper bound. The policy engine remains authoritative throughout — catching self-approvals, unauthorized refunds, SLA violations, production freezes, and unauthorized rollbacks.

Benchmark Results

Expense Approval (test set, 9 scenarios)

Condition	Success Rate	Invalid Rate	Policy Denial	Composite Score	Actions
Noisy Agent no harness	0.0%	100.0%	0.0%	−0.500	180
Noisy Agent + manual harness	0.0%	100.0%	0.0%	−0.500	180
Scripted Agent deterministic baseline	100.0%	0.0%	0.0%	1.000	9

Benchmark Results

Customer Support (test set, 9 scenarios)

Condition	Success Rate	Invalid Rate	Policy Denial	Composite Score	Actions
Noisy Agent no harness	77.8%	19.1%	73.0%	0.609	89
Noisy Agent + manual harness	77.8%	92.1%	0.0%	0.317	89
Scripted Agent deterministic baseline	33.3%	0.0%	97.6%	0.236	123

Benchmark Results

Software Deployment (test set, 9 scenarios)

Condition	Success Rate	Invalid Rate	Policy Denial	Composite Score	Actions
Noisy Agent no harness	88.9%	12.9%	61.3%	0.763	31
Noisy Agent + manual harness	88.9%	29.0%	45.2%	−0.574	31
Scripted Agent deterministic baseline	88.9%	0.0%	71.4%	0.817	28

Feature Grid

Failure-Driven Refinement.

Policy Engine Supremacy

A human-authored policy layer retains final authority. Generated guardrails predict validity; business rules decide.

Deterministic Reproducibility

Every experiment is reproducible from config, seed, and code hash. No drifting behavior between runs.

Counterexample Extraction (planned)

Execution failures captured as structured counterexamples — the raw material for synthesis. Model defined, extraction pipeline on the roadmap.

Harness Synthesis (planned)

LLM-powered generation of validation harness code from counterexamples. AST validation and sandbox execution on the roadmap.

Enterprise Workflows

Built for High-Stakes Environments.

Expense Approval

Actions

submit, request_receipt, approve, reject, escalate

Catches

self-approvalsout-of-policy amountsmissing receipts

Customer Support

Actions

assign, prioritize, resolve, refund, escalate

Catches

self-assignmentunauthorized refundscritical resolution

Software Deployment

Actions

create, approve, start, cancel, rollback

Catches

self-approvalsproduction freezesunauthorized rollbacks

CLI Demo

Try It Yourself.

Terminal — expense-approval

$ uv run autoharness compare -e expense-approval -c no-harness,manual,scripted -d test

> Loaded 9 scenarios from scenarios/expense-approval/test.jsonl
>
>                       Comparison: expense-approval (test)
> ┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┓
> ┃            ┃              ┃              ┃ Policy      ┃           ┃         ┃
> ┃ Condition  ┃ Success Rate ┃ Invalid Rate ┃ Denial      ┃ Composite ┃ Actions ┃
> ┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━┩
> │ no-harness │ 0.0%         │ 100.0%       │ 0.0%        │ -0.500    │ 180     │
> │ manual     │ 0.0%         │ 100.0%       │ 0.0%        │ -0.500    │ 180     │
> │ scripted   │ 100.0%       │ 0.0%         │ 0.0%        │ 1.000     │ 9       │
> └────────────┴──────────────┴──────────────┴─────────────┴───────────┴─────────┘

Terminal — support-ticket

$ uv run autoharness compare -e support-ticket -c no-harness,manual,scripted -d test

> Loaded 9 scenarios from scenarios/support-ticket/test.jsonl
>
>                        Comparison: support-ticket (test)
> ┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┓
> ┃            ┃              ┃              ┃ Policy      ┃           ┃         ┃
> ┃ Condition  ┃ Success Rate ┃ Invalid Rate ┃ Denial      ┃ Composite ┃ Actions ┃
> ┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━┩
> │ no-harness │ 77.8%        │ 19.1%        │ 73.0%       │ 0.609     │ 89      │
> │ manual     │ 77.8%        │ 92.1%        │ 0.0%        │ 0.317     │ 89      │
> │ scripted   │ 33.3%        │ 0.0%         │ 97.6%       │ 0.236     │ 123     │
> └────────────┴──────────────┴──────────────┴─────────────┴───────────┴─────────┘

Terminal — deployment

$ uv run autoharness compare -e deployment -c no-harness,manual,scripted -d test

> Loaded 9 scenarios from scenarios/deployment/test.jsonl
>
>                          Comparison: deployment (test)
> ┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┓
> ┃            ┃              ┃              ┃ Policy      ┃           ┃         ┃
> ┃ Condition  ┃ Success Rate ┃ Invalid Rate ┃ Denial      ┃ Composite ┃ Actions ┃
> ┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━┩
> │ no-harness │ 88.9%        │ 12.9%        │ 61.3%       │ 0.763     │ 31      │
> │ manual     │ 88.9%        │ 29.0%        │ 45.2%       │ −0.574    │ 31      │
> │ scripted   │ 88.9%        │ 0.0%         │ 71.4%       │ 0.817     │ 28      │
> └────────────┴──────────────┴──────────────┴─────────────┴───────────┴─────────┘

Final CTA

Ready to put guardrails on your AI agents?

View on GitHub Read the Research Paper