B1 · Hero
Optimal action match
Baseline vs. runs where only the BIGHUB decision layer is added. Same model, same cases.
Beyond accuracy: outcome quality
Mean outcome score under the framework rubric.
+32 percentage points · 64% fewer wrong decisions
Changing only the decision layer significantly improves optimal decisions.
+0.238 mean rubric uplift
The model does not only choose differently; it aligns with better outcome scores under the framework rubric.
B2 · Slices
Where the gap widens
Performance by decision difficulty
The gap widens on ambiguous, high-risk decisions. The harder the decision, the bigger the gap.
On simple cases, models perform well. On difficult decisions, where tradeoffs matter, the decision structure becomes critical.
B3 · Ablation
What drives the improvement
One model call per arm per trace · N=50
No single signal explains the gain. The improvement comes from the full decision structure. Removing any single component breaks the improvement.
Adding isolated signals does not match full-packet performance. The gain comes from combining signals into a structured decision process.
B4 · Recovery
Failure recovery
When the baseline chooses the wrong action
- Corrected by BIGHUB · 16
- Not corrected · 9
Total baseline errors: 25
Corrected by BIGHUB: 16
Recovery rate: 64%
Main failure modes corrected:
- Over-cautious decisions
- Execution versus review tradeoffs
BIGHUB does not only influence decisions; it corrects a large share of baseline mistakes on this set. Most improvements come from correcting bad baseline decisions.
B5 · Incidents
Second benchmark vertical · incident response
Incident next-step
Production-style incidents · N=20 · Oracle aligned with rubric
+20 percentage points
Outcome score (mean): baseline 0.7285 · BIGHUB 0.79
Divergences: improved 4 · worsened 0 · neutral 0
On divergent traces, packet choices improved oracle match; none worsened relative to baseline. No regressions observed on divergent cases.
The decision layer moves choices toward stronger corrective actions (rollback, escalate) and higher rubric alignment on this corpus.
In practice
What changes in practice
Same stack, different decision behavior
Before
- Model hesitates
- Chooses safe or wrong action
- Misses tradeoffs
With BIGHUB
- Evaluates alternatives
- Understands consequences
- Chooses the best action
Mechanism
Why this works
Models do not fail only from lack of capability. They fail when tradeoffs stay implicit, alternatives stay unstructured, and past outcomes are not surfaced in a comparable form.
BIGHUB turns decisions into something the model can evaluate, not only generate.
Caveats
Limitations
- Framework-aligned evaluation, not production logs
- Small N (20–50 traces per vertical in these runs)
- Model and API variance possible
- Decision layer and alternatives use the same rubric ontology as the oracle
These numbers measure alignment inside a defined benchmark, not guaranteed real-world optimality.
Close
Conclusion
The model already has the answer. It just chooses the wrong action.
The bottleneck is not the model. It is the decision structure.
With the same model and the same information, better framing yields better decisions under the framework.