41.11% → 73.14%
Average good decision rate across the full GPT-5.5 suite—each point is match to the benchmark-defined optimal action.
Executive summary
BIGHUB takes GPT-5.5 from 41.11% to 73.14% average good decision rate across the full suite (match to the benchmark-defined optimal action). Same model, same frozen traces, same rubric; arms differ only by inclusion of the BIGHUB decision packet in the input.
41.11% → 73.14%
Average good decision rate across the full GPT-5.5 suite—each point is match to the benchmark-defined optimal action.
Full suite
Seven benchmark views: the IT matrix split into incident and helpdesk, plus coldstart, incident large, incident large coldstart, refunds, and refunds large.
IT
71.95% → 91.67%
+19.72 pp
IT
40.28% → 82.78%
+42.50 pp
Coldstart
71.39% → 85.56%
+14.17 pp
Cardinality scaling
44.17% → 86.67%
+42.50 pp
High-cardinality coldstart
44.45% → 75.55%
+31.11 pp
Transfer
11.95% → 47.50%
+35.55 pp
Non-IT cardinality
3.61% → 42.22%
+38.61 pp
Proof
Same model family, same frozen benchmark traces, same rubric. Between arms, only whether the BIGHUB decision packet is included in the model input changes; the evaluation contract is otherwise fixed.
| Benchmark | Baseline GPT-5.5 | With BIGHUB | Uplift |
|---|---|---|---|
| IT incident | 71.95% | 91.67% | +19.72 pp |
| IT helpdesk | 40.28% | 82.78% | +42.50 pp |
| Incident coldstart | 71.39% | 85.56% | +14.17 pp |
| Incident large | 44.17% | 86.67% | +42.50 pp |
| Incident large coldstart | 44.45% | 75.55% | +31.11 pp |
| Refunds | 11.95% | 47.50% | +35.55 pp |
| Refunds large | 3.61% | 42.22% | +38.61 pp |
Good decision rate measures match to the benchmark-defined optimal action.
Why this matters
BIGHUB is not helping a weak baseline. It improves GPT-5.5 directly, even when the base model is already strong.
What the suite shows
The uplift is broad across IT, coldstart, large-cardinality, and transfer benchmarks. It is not a single benchmark trick.
Method
The suite keeps the same basic design across all benchmark families: frozen datasets, deterministic benchmark rubrics, paired baseline vs packet evaluation, and three seeds per benchmark. Good decision rate measures match to the benchmark-defined optimal action under the same trace context; uplift is the gain in that rate when the packet arm is used.
This page focuses on a single question: once you already have a very strong frontier model like GPT-5.5, how much extra decision quality do you unlock by adding BIGHUB on top?
Limits
These benchmarks measure alignment under a frozen authored contract, not guaranteed production business lift. The packet and the rubric share the same benchmark ontology by design; that is what makes the decision surfaces auditable, but it also means this is a framework-aligned evaluation rather than unconstrained production ground truth.
Outcome maps and scores are benchmark-authored, not learned from production logs.
API spend depends on current model pricing and does not include engineering overhead or downstream business cost.
Higher uplift can come from a weaker baseline. That is why this page reports both uplift and absolute packet good rate.
Temperature is fixed, but API-side nondeterminism and model revision drift can still affect exact cell outcomes over time.