Top models. Better decisions.

Benchmark suite · GPT-5.5 · April 2026

GPT-5.5 is already a top-tier model. BIGHUB still improves its decision quality on frozen operational workloads by structuring each action before execution.

Same GPT-5.5 model, same frozen traces, same benchmark rubric. Baseline and packet arms differ only by whether the BIGHUB decision packet is included in the model input.

Explore the benchmark
21 cells · 2,520 labeled traces · 5,040 LLM calls

GPT-5.5 makes better decisions with BIGHUB

BIGHUB takes GPT-5.5 from 41.11% to 73.14% average good decision rate across the full suite (match to the benchmark-defined optimal action). Same model, same frozen traces, same rubric; arms differ only by inclusion of the BIGHUB decision packet in the input.

Primary proof

41.11% → 73.14%

Average good decision rate across the full GPT-5.5 suite—each point is match to the benchmark-defined optimal action.

Average uplift +32.02 pp
Best view +42.50 pp
Consistency 7 / 7 positive

GPT-5.5 benchmark views

Seven benchmark views: the IT matrix split into incident and helpdesk, plus coldstart, incident large, incident large coldstart, refunds, and refunds large.

Good decision rate by view Baseline GPT-5.5 With BIGHUB
Good decision rate: baseline GPT-5.5 versus with BIGHUB across seven benchmark views. BIGHUB is higher on every view. 100% 75% 50% 25% 0% BIGHUB Base IT inc. IT help Coldstart Inc. large Inc. L+cs Refunds Ref. large

IT

Incident

71.95% → 91.67%

+19.72 pp

High-baseline workflow. BIGHUB still removes a large share of residual decision error.

IT

Helpdesk

40.28% → 82.78%

+42.50 pp

One of the clearest wins in the suite. GPT-5.5 routing quality more than doubles with BIGHUB.

Coldstart

Incident coldstart

71.39% → 85.56%

+14.17 pp

Even with thinner precedent coverage, the packet still produces a clear positive lift.

Cardinality scaling

Incident large

44.17% → 86.67%

+42.50 pp

Large action space, same pattern: BIGHUB keeps GPT-5.5 highly aligned under more complex choice sets.

High-cardinality coldstart

Incident large coldstart

44.45% → 75.55%

+31.11 pp

The lift persists even when both uncertainty and action-space complexity increase together.

Transfer

Refunds

11.95% → 47.50%

+35.55 pp

Different vertical, same mechanism: BIGHUB still sharply improves GPT-5.5 transfer decisions.

Non-IT cardinality

Refunds large

3.61% → 42.22%

+38.61 pp

The strongest transfer stress test in the suite still shows a large absolute gain with BIGHUB.

Baseline GPT-5.5 vs GPT-5.5 with BIGHUB

Same model family, same frozen benchmark traces, same rubric. Between arms, only whether the BIGHUB decision packet is included in the model input changes; the evaluation contract is otherwise fixed.

Benchmark Baseline GPT-5.5 With BIGHUB Uplift
IT incident 71.95% 91.67% +19.72 pp
IT helpdesk 40.28% 82.78% +42.50 pp
Incident coldstart 71.39% 85.56% +14.17 pp
Incident large 44.17% 86.67% +42.50 pp
Incident large coldstart 44.45% 75.55% +31.11 pp
Refunds 11.95% 47.50% +35.55 pp
Refunds large 3.61% 42.22% +38.61 pp

Good decision rate measures match to the benchmark-defined optimal action.

Why this matters

BIGHUB is not helping a weak baseline. It improves GPT-5.5 directly, even when the base model is already strong.

What the suite shows

The uplift is broad across IT, coldstart, large-cardinality, and transfer benchmarks. It is not a single benchmark trick.

Frozen benchmark contract

The suite keeps the same basic design across all benchmark families: frozen datasets, deterministic benchmark rubrics, paired baseline vs packet evaluation, and three seeds per benchmark. Good decision rate measures match to the benchmark-defined optimal action under the same trace context; uplift is the gain in that rate when the packet arm is used.

This page focuses on a single question: once you already have a very strong frontier model like GPT-5.5, how much extra decision quality do you unlock by adding BIGHUB on top?

Common contract
  • 3 seeds per benchmark view
  • N = 120 traces per seed
  • Baseline arm vs packet arm
  • Oracle-aligned good decision rate
Benchmark families
  • IT incident and helpdesk
  • Coldstart structure-only probe
  • Large-action incident probes
  • Non-IT transfer and cardinality tests

What these numbers do and do not claim

These benchmarks measure alignment under a frozen authored contract, not guaranteed production business lift. The packet and the rubric share the same benchmark ontology by design; that is what makes the decision surfaces auditable, but it also means this is a framework-aligned evaluation rather than unconstrained production ground truth.

Authored outcomes

Outcome maps and scores are benchmark-authored, not learned from production logs.

Cost realism

API spend depends on current model pricing and does not include engineering overhead or downstream business cost.

Absolute vs relative quality

Higher uplift can come from a weaker baseline. That is why this page reports both uplift and absolute packet good rate.

Model volatility

Temperature is fixed, but API-side nondeterminism and model revision drift can still affect exact cell outcomes over time.