GPT-5.5 hallucination rates on legal citations is the question every managing partner wants quantified — and the public benchmark answer is incomplete. Per OpenAI's GPT-5.5 system card, the April 23, 2026 launch flagged improved calibration as a primary upgrade. Translation: fewer fabricated citations on niche legal questions. But OpenAI doesn't publish a legal-citation-specific hallucination benchmark. Anthropic doesn't either for Opus 4.7. What we have: general calibration improvements, observational evidence from legal-tech teams, and the Damien Charlotin hallucination database tracking real sanctions cases. This spoke walks what's documented, what's anecdotal, and how to interpret hallucination claims for procurement decisions.
What OpenAI and Anthropic actually published about hallucination rates
Neither lab published model-vs-prior-version hallucination benchmarks specific to legal citations. What they did publish:
OpenAI's GPT-5.5 system card describes calibration improvements in general terms: "less likely to proceed confidently with a bad plan." The card includes safety evaluations and capability benchmarks across domains (math, coding, reasoning) but doesn't isolate legal-citation generation as a separate metric.
Anthropic's Opus 4.7 release notes describe similar improvements without legal-specific quantification. The 87.6% SWE-bench Verified score and 94.2% GPQA Diamond score are dev/research benchmarks, not legal benchmarks.
The practical implication: there's no publicly-cited "GPT-5.5 hallucinates X% less often on legal citations than GPT-5.4" number. Anyone quoting one is either citing internal vendor data, a third-party benchmark not yet widely replicated, or making it up. For procurement purposes, treat specific percentage claims with skepticism unless sourced.
What is documented: the directional improvement is real per both labs' published claims, observational evidence from legal-tech teams running before-and-after testing, and the practical experience of attorneys using both versions. The magnitude isn't precisely quantified.
The Charlotin database: real sanctions data, but not version-attributed
Damien Charlotin's hallucination database hosted at HEC Paris's Smart Law Hub catalogs 1,227 documented AI hallucination sanctions cases globally as of early 2026 — up from 719 in January per NPR's April 3, 2026 piece. The database is the most-cited public ledger of legal AI sanctions.
What the database tracks: jurisdiction, sanction type, model named (when disclosed), outcome. What it doesn't reliably track: model version. The Cherry Hill federal sanction on April 27, 2026 (per The Inquirer's coverage) sanctioned an attorney who couldn't recall whether he'd used Claude, ChatGPT, or Grok — let alone which version. The Alabama sanction of W. Perry Hall ($17,200 plus solo-filing bar per WFTV's coverage) doesn't specify version either.
For version-attribution analysis, the database is incomplete. The cases that surface in 2027 with version-attributed sanctions data will create the first version-specific empirical evidence. Until then, version-rate comparison relies on observational evidence from legal-tech teams with controlled testing.
The second-order angle: as state bar ethics opinions and insurance underwriters reference the Charlotin database (per the ABA Journal sanctions ramp-up coverage), version-attribution becomes increasingly important. Firms documenting model version on every AI-assisted draft (per the federal court AI disclosure rules need model version specifics spoke) build the data infrastructure that the field is going to need.
What legal-tech teams have observed in controlled testing
Legal-tech teams running before-and-after comparisons of GPT-5.4 and GPT-5.5 on legal-citation tasks report directional improvement without precise quantification. Common observations from internal testing across multiple firms (anecdotal but consistent):
- Confident fabrication of whole cases dropped meaningfully. GPT-5.4's most embarrassing failure mode — generating a plausible-looking case name, citation, and holding for a case that doesn't exist — surfaces less often on 5.5. The reduction is observable but not benchmarked. - Holding-misquote and statute-section drift increased proportionally. As confident fabrication dropped, the residual hallucinations shifted toward subtler errors. Citations are real; what the model says they hold isn't always what they hold. This isn't a regression; it's a failure-mode shift. - Niche state-bar questions show the largest improvement. Questions about state-specific bar rules, recent state appellate decisions, and statute renumberings were the worst-performing categories on 5.4. They're the most-improved on 5.5. - Federal-circuit case law shows smaller improvement. Federal-circuit citation generation was already relatively reliable on 5.4. The 5.5 improvement is smaller in this category because the baseline was higher. - Long-tail jurisdictions (territorial, tribal, foreign) show variable improvement. These categories have less training data; the calibration improvement helps less when the underlying knowledge is sparse.
The practical procurement read: 5.5 is meaningfully better than 5.4 on the failure modes that drove the most-prominent sanctions cases. The improvement isn't transformative; verification protocols still apply (per the citation verification protocol after GPT-5.5 launch spoke). The shift in failure-mode distribution requires updating what verification looks for.
How GPT-5.5 compares to Claude Opus 4.7 and Gemini 3.1 Pro on hallucination behavior
Three frontier models, three different observational profiles on legal hallucination:
GPT-5.5 shows the largest improvement on confident fabrication of whole cases. Calibration improvement per OpenAI's system card translates to a more measured tone on niche legal questions — the model is more likely to flag uncertainty rather than generate a plausible-looking but fabricated answer. Failure-mode distribution shifts toward subtler errors.
Claude Opus 4.7 shows the strongest performance on niche state-bar variations and statute renumberings in observational comparison. Anthropic's calibration improvements appear strongest on the categories where prior Claude versions had the largest gap from accurate answers. Per the GPT-5.5 vs Claude Opus 4.7 comparison spoke, the cross-vendor performance is workload-dependent.
Gemini 3.1 Pro is variable. Strong on multi-jurisdictional regulatory reasoning where the 2M-token context window helps; weaker on US-state-specific case law where the calibration on niche questions appears less reliable than GPT-5.5 or Opus 4.7.
For pure citation-generation tasks across the three: Opus 4.7 and GPT-5.5 sit close on observable hallucination rates with different failure-mode distributions; Gemini 3.1 Pro sits behind on US legal citations specifically but matches or beats on multi-jurisdictional regulatory.
The practical procurement implication: hallucination behavior alone shouldn't drive single-vendor consolidation. Different models hallucinate differently, but all three need verification. Workflow shape, cost structure, and deployment surface matter more for the procurement decision than marginal hallucination-rate differences.
What firms should actually measure for version-rate comparison
Firms that want internal data on hallucination rates by model version should run controlled testing. The protocol that produces useful data:
Test set construction: Build a corpus of 100-200 legal-citation queries spanning the failure modes the firm cares about — niche state-bar variations, recently-renumbered statutes, recent appellate decisions, multi-jurisdictional regulatory questions. Each query has a known correct answer documented in advance.
Controlled comparison: Run the test set through each model version under controlled conditions — same prompt, same context, same effort level. Capture each model's output verbatim.
Verification protocol: Score each output for fabrication (citation doesn't exist), holding-misquote (citation exists but holding doesn't match), date drift (citation has wrong date), section drift (statute citation has wrong section), and over-confidence (model asserts uncertainty inappropriately or fails to flag genuine uncertainty).
Comparison metrics: Calculate hallucination rate per category per model version. Track failure-mode distribution. Score by practice area (litigation, transactional, regulatory).
The operational cost: 1-2 days of associate time to build the test set, 4-8 hours of model runtime to execute, 2-3 days of attorney review for scoring. Total: about a week of sustained work for actionable internal data. The data informs model selection, verification protocol design, and effort-level routing decisions.
For most mid-market and BigLaw firms, this is worth doing once a quarter. The legal-tech engineering required is modest (per the Codex CLI for legal-tech engineering spoke). The data quality is substantially better than relying on vendor claims or industry observations.
The Bottom Line: My take: GPT-5.5 hallucinates less often than GPT-5.4 on legal citations — the directional improvement is real per both lab claims and observational evidence from legal-tech teams. The magnitude isn't precisely benchmarked publicly. Failure-mode distribution shifts toward subtler errors (holding-misquote, statute-section drift) rather than confident fabrication of whole cases. Verification protocols still apply; what they check needs to update. Firms wanting internal data should run controlled testing once a quarter rather than rely on vendor claims.
AI-Assisted Research. This piece was researched and written with AI assistance, reviewed and edited by Manu Ayala. For deeper takes and the perspective behind the research, follow me on LinkedIn or email me directly.
