Benchmark

AI agents get it wrong 50% of the time.

Using GPT-5.4

50% → 82% optimal decisions
Same cases. Same context.
Only the decision layer changed.

No fine-tuning. No extra data. No different model.

The question

Why not just let the model decide?
If it already has the context, tools, and instructions…

why add another layer?

We answered with data.

Primary result · Code execution · N=50

50% 82%

Optimal action match. GPT-5.4. No training on the benchmark set.

.00

Protocol

The setup: a fair comparison

Controlled A/B benchmarks

  • Model: GPT-5.4
  • Same case context
  • Same available actions
  • Same workflow
  • No training on the dataset

The only difference

BIGHUB adds a structured decision layer: signals, precedents, alternatives, regret.

No fine-tuning. No extra data. No different model. Just better decisions.

.01

B1 · Hero

Optimal action match

Baseline vs. runs where only the BIGHUB decision layer is added. Same model, same cases.

Beyond accuracy: outcome quality

Mean outcome score under the framework rubric.

50% Baseline
82% With BIGHUB

+32 percentage points · 64% fewer wrong decisions

Changing only the decision layer significantly improves optimal decisions.

0.646 Baseline
0.884 With BIGHUB

+0.238 mean rubric uplift

The model does not only choose differently; it aligns with better outcome scores under the framework rubric.

.02

B2 · Slices

Where the gap widens

Performance by decision difficulty

Critical risk
+66.7 pp
High regret
+57.1 pp
Limited evidence
+33.3 pp
Suboptimal actions
+25.0 pp

The gap widens on ambiguous, high-risk decisions. The harder the decision, the bigger the gap.

On simple cases, models perform well. On difficult decisions, where tradeoffs matter, the decision structure becomes critical.

.03

B3 · Ablation

What drives the improvement

One model call per arm per trace · N=50

50% Baseline
46% Risk only
48% Reco only
50% Precedents
46% Regret only
84% Full packet

No single signal explains the gain. The improvement comes from the full decision structure. Removing any single component breaks the improvement.

Adding isolated signals does not match full-packet performance. The gain comes from combining signals into a structured decision process.

.04

B4 · Recovery

Failure recovery

When the baseline chooses the wrong action

25 baseline errors
  • Corrected by BIGHUB · 16
  • Not corrected · 9

Total baseline errors: 25

Corrected by BIGHUB: 16

Recovery rate: 64%

Main failure modes corrected:

  • Over-cautious decisions
  • Execution versus review tradeoffs

BIGHUB does not only influence decisions; it corrects a large share of baseline mistakes on this set. Most improvements come from correcting bad baseline decisions.

.05

B5 · Incidents

Second benchmark vertical · incident response

Incident next-step

Production-style incidents · N=20 · Oracle aligned with rubric

60% Baseline
80% With BIGHUB

+20 percentage points

Outcome score (mean): baseline 0.7285 · BIGHUB 0.79

Divergences: improved 4 · worsened 0 · neutral 0

On divergent traces, packet choices improved oracle match; none worsened relative to baseline. No regressions observed on divergent cases.

The decision layer moves choices toward stronger corrective actions (rollback, escalate) and higher rubric alignment on this corpus.

.06

In practice

What changes in practice

Same stack, different decision behavior

Before

  • Model hesitates
  • Chooses safe or wrong action
  • Misses tradeoffs

With BIGHUB

  • Evaluates alternatives
  • Understands consequences
  • Chooses the best action
.07

Mechanism

Why this works

Models do not fail only from lack of capability. They fail when tradeoffs stay implicit, alternatives stay unstructured, and past outcomes are not surfaced in a comparable form.

BIGHUB turns decisions into something the model can evaluate, not only generate.

.08

Caveats

Limitations

  • Framework-aligned evaluation, not production logs
  • Small N (20–50 traces per vertical in these runs)
  • Model and API variance possible
  • Decision layer and alternatives use the same rubric ontology as the oracle

These numbers measure alignment inside a defined benchmark, not guaranteed real-world optimality.

.09

Close

Conclusion

The model already has the answer. It just chooses the wrong action.

The bottleneck is not the model. It is the decision structure.

With the same model and the same information, better framing yields better decisions under the framework.