BIGHUB · GPT-5.5 with BIGHUB

Executive summary

GPT-5.5 makes better decisions with BIGHUB

BIGHUB takes GPT-5.5 from 41.11% to 73.14% average good decision rate across the full suite (match to the benchmark-defined optimal action). Same model, same frozen traces, same rubric; arms differ only by inclusion of the BIGHUB decision packet in the input.

Primary proof

41.11% → 73.14%

Average good decision rate across the full GPT-5.5 suite—each point is match to the benchmark-defined optimal action.

Average uplift +32.02 pp

Best view +42.50 pp

Consistency 7 / 7 positive

Full suite

GPT-5.5 benchmark views

Seven benchmark views: the IT matrix split into incident and helpdesk, plus coldstart, incident large, incident large coldstart, refunds, and refunds large.

Good decision rate by view Baseline GPT-5.5 With BIGHUB

IT

Incident

71.95% → 91.67%

+19.72 pp

High-baseline workflow. BIGHUB still removes a large share of residual decision error.

IT

Helpdesk

40.28% → 82.78%

+42.50 pp

One of the clearest wins in the suite. GPT-5.5 routing quality more than doubles with BIGHUB.

Coldstart

Incident coldstart

71.39% → 85.56%

+14.17 pp

Even with thinner precedent coverage, the packet still produces a clear positive lift.

Cardinality scaling

Incident large

44.17% → 86.67%

+42.50 pp

Large action space, same pattern: BIGHUB keeps GPT-5.5 highly aligned under more complex choice sets.

High-cardinality coldstart

Incident large coldstart

44.45% → 75.55%

+31.11 pp

The lift persists even when both uncertainty and action-space complexity increase together.

Transfer

Refunds

11.95% → 47.50%

+35.55 pp

Different vertical, same mechanism: BIGHUB still sharply improves GPT-5.5 transfer decisions.

Non-IT cardinality

Refunds large

3.61% → 42.22%

+38.61 pp

The strongest transfer stress test in the suite still shows a large absolute gain with BIGHUB.

Proof

Baseline GPT-5.5 vs GPT-5.5 with BIGHUB

Same model family, same frozen benchmark traces, same rubric. Between arms, only whether the BIGHUB decision packet is included in the model input changes; the evaluation contract is otherwise fixed.

Benchmark	Baseline GPT-5.5	With BIGHUB	Uplift
IT incident	71.95%	91.67%	+19.72 pp
IT helpdesk	40.28%	82.78%	+42.50 pp
Incident coldstart	71.39%	85.56%	+14.17 pp
Incident large	44.17%	86.67%	+42.50 pp
Incident large coldstart	44.45%	75.55%	+31.11 pp
Refunds	11.95%	47.50%	+35.55 pp
Refunds large	3.61%	42.22%	+38.61 pp

Good decision rate measures match to the benchmark-defined optimal action.

Why this matters

BIGHUB is not helping a weak baseline. It improves GPT-5.5 directly, even when the base model is already strong.

What the suite shows

The uplift is broad across IT, coldstart, large-cardinality, and transfer benchmarks. It is not a single benchmark trick.

Method

Frozen benchmark contract

The suite keeps the same basic design across all benchmark families: frozen datasets, deterministic benchmark rubrics, paired baseline vs packet evaluation, and three seeds per benchmark. Good decision rate measures match to the benchmark-defined optimal action under the same trace context; uplift is the gain in that rate when the packet arm is used.

This page focuses on a single question: once you already have a very strong frontier model like GPT-5.5, how much extra decision quality do you unlock by adding BIGHUB on top?

Common contract

3 seeds per benchmark view
N = 120 traces per seed
Baseline arm vs packet arm
Oracle-aligned good decision rate

Benchmark families

IT incident and helpdesk
Coldstart structure-only probe
Large-action incident probes
Non-IT transfer and cardinality tests

Limits

What these numbers do and do not claim

These benchmarks measure alignment under a frozen authored contract, not guaranteed production business lift. The packet and the rubric share the same benchmark ontology by design; that is what makes the decision surfaces auditable, but it also means this is a framework-aligned evaluation rather than unconstrained production ground truth.

Authored outcomes

Outcome maps and scores are benchmark-authored, not learned from production logs.

Cost realism

API spend depends on current model pricing and does not include engineering overhead or downstream business cost.

Absolute vs relative quality

Higher uplift can come from a weaker baseline. That is why this page reports both uplift and absolute packet good rate.

Model volatility

Temperature is fixed, but API-side nondeterminism and model revision drift can still affect exact cell outcomes over time.

Top models. Better decisions.