Calibration is the malpractice variable now. OpenAI shipped GPT-5.5 on April 23, 2026, with the system card describing the model as "less likely to proceed confidently with a bad plan." That's calibration. With 1,227 documented AI hallucination sanctions cases globally cataloged in Damien Charlotin's database at HEC Paris (up from 719 in January 2026 per NPR's April 3 piece), calibration just stopped being an engineering metric and started being a malpractice metric. The Cherry Hill ruling on April 27, 2026 (per The Inquirer's coverage) sanctioned attorney Raja Rajan in NJ federal court — and he wasn't sure whether he'd used Claude, ChatGPT, or Grok. That's the floor case. This spoke walks the calibration-to-sanctions causal chain and what changes after GPT-5.5.


What calibration actually means inside the model

Calibration in language models is the alignment between a model's confidence in an answer and the actual probability that the answer is correct. A well-calibrated model that says "I'm 90% sure this case stands for X" is right about 90% of the time on similar-confidence questions. A poorly-calibrated model says "90% sure" on questions where it's actually right 60% of the time — overconfident, prone to fabrication.

For legal research, the practical translation: a poorly-calibrated model fabricates plausible-looking citations, misquotes holdings with high confidence, and asserts statute sections that don't exist. The Charlotin database's 1,227 cases are what happens when poorly-calibrated AI output gets filed without verification.

GPT-5.5's calibration improvement, per OpenAI's system card framing, reduces the rate of high-confidence-but-wrong outputs. The model is more willing to express uncertainty, more likely to flag when its answer is based on inference rather than direct knowledge, and less likely to generate fabricated citations on the kinds of niche legal questions where prior versions failed.

We don't have a public benchmark for GPT-5.5's hallucination rate vs GPT-5.4's specifically (see the hallucination rate vs prior versions spoke for what is documented). Anecdotal observation across legal-tech teams suggests the improvement is meaningful — fewer fabricated cases on niche state-bar questions, fewer overconfident answers on recently-renumbered statutes — but not transformative. 5.5 still hallucinates. It hallucinates less often, and the failure modes shift toward subtler errors.

The 1,227-case database and what it actually documents

Damien Charlotin's hallucination database, hosted at damiencharlotin.com/hallucinations and maintained at HEC Paris's Smart Law Hub, catalogs documented cases of AI-generated legal hallucinations that surfaced in court filings, sanctions orders, or other public legal records. The count grew from 719 in January 2026 to 1,227 by early 2026 per NPR's April reporting — roughly 5-6 new documented cases per day across jurisdictions.

The database tracks cases by jurisdiction, sanction type, model used (when disclosed), and outcome. Per the ABA Journal's coverage of sanctions ramp-up, the cases span US federal courts, state courts, foreign jurisdictions, and professional disciplinary proceedings. Recent high-profile examples: the Alabama Supreme Court sanction of attorney W. Perry Hall ($17,200 plus solo-filing bar, who infamously cited two more nonexistent cases in his apology footnote per WFTV's coverage); Oregon's $109,700 sanction (record-high for AI errors); the 6th Circuit's $30,000 sanction against two attorneys for 24+ fake citations.

The database is the most-cited public ledger in legal coverage of AI sanctions. It's also a source state bar regulators, federal courts, and insurance carriers cite when developing AI-related rules and underwriting standards. The database is becoming infrastructure — when state ethics opinions reference "the documented sanctions cases," they're referencing this dataset.

For firms, the practical implication: documented sanctions are a leading indicator of regulatory and insurance treatment. The cases that surface this year shape the rules that govern next year.

Why calibration improvements only help if verification protocols exist

The Cherry Hill ruling on April 27, 2026 surfaces the structural problem. Per The Inquirer's coverage, attorney Raja Rajan was sanctioned in NJ federal court for AI hallucinations. He wasn't sure whether he'd used Claude, ChatGPT, or Grok. That's the floor: model brand and version don't matter when verification is missing.

Calibration improvements reduce the rate of high-confidence-but-wrong outputs. They don't eliminate it. Per the Heppner ruling and per the federal court AI disclosure rules need model version specifics spoke, the legal infrastructure assumes verification happens. When verification doesn't happen, the calibration improvement is a rounding error against the underlying procedural failure.

The operator read: GPT-5.5's improved calibration reduces the floor probability that a non-verifying associate ships a fake case to a federal judge. It doesn't eliminate the probability. Workflows that depend on the model being right — without an associate-driven Westlaw or Lexis verification step — fail at lower rates than pre-5.5 but still fail. Workflows with verification catch the residual errors that calibration improvements miss.

The second-order angle: calibration improvements raise the cost of NOT having verification. The firms that catch errors via verification keep getting better outputs over time. The firms that skip verification keep shipping the residual errors and accumulating sanctions exposure. The gap between disciplined and undisciplined firms widens with each model improvement.

Failure-mode shift: from confident fabrication to subtle errors

Pre-5.5 (and pre-Opus 4.7) AI hallucination failure modes were dominated by confident fabrication. The model invented whole cases that didn't exist, with case names and citations that looked plausible. Verification caught these reliably because Westlaw and Lexis returned no result on the fake citation.

Post-5.5, the failure-mode distribution shifts. Confident fabrications are less common. What replaces them: subtler errors that pass cursory verification but fail under careful review.

- Misquoted holdings. Citation is real, case exists, but the holding the model summarized doesn't match what the case actually held. Verification by citation lookup misses this; verification by reading the cited page catches it. - Slightly-wrong dates. Decision dates off by a few months, statute amendment dates off by a year. Hard to catch without precise verification. - Statutes cited at wrong section. Underlying statute exists; section cited is similar but not the section that supports the proposition. - Mis-attributed authority. Holding attributed to a Supreme Court case when it's actually a circuit court case (or vice versa). Citation appears correct; jurisdictional weight is wrong. - Paraphrase drift. The model's restatement of a case's reasoning subtly distorts the actual reasoning. Direct quotes from the case would be correct; the model's paraphrase introduces error.

For litigation teams, the verification protocol update is to confirm not just that citations exist, but that the holdings the model summarized actually appear at the cited page. The citation verification protocol after GPT-5.5 launch spoke walks the workflow change.

Insurance, ethics opinions, and the next 12-18 months

Insurance carriers writing legal AI riders are starting to ask firms which models they use and which versions. Better-calibrated models lower expected loss. Carriers haven't published version-specific underwriting standards yet, but the Charlotin database is being cited in carrier risk assessments. Firms that document their model-version decisions now build defensible records when carriers start asking version-specific questions.

State bar ethics opinions are the next layer. Per the ABA Journal coverage of sanctions ramp-up, state bars are increasingly issuing AI-specific guidance. Most current opinions name "AI tools generally" without differentiating models or versions. That's likely to change as model-version-specific failure modes become evident. The 12-18 month horizon: state bar opinions begin naming model behavior characteristics.

Federal court standing orders are the third layer. Per Bloomberg Law's standing-order tracker and Ropes & Gray's AI Court Order Tracker, 300+ federal judges have AI-related standing orders. Almost none differentiate by model version. The structural argument for updating: with the 5.4-to-5.5 calibration delta, treating the versions identically is sanctioning the wrong thing. The federal court AI disclosure rules need model version specifics spoke walks this argument.

The firm-level move: update internal disclosure templates to include model version and date of use even when not strictly required. If a judge, carrier, or bar regulator asks the version question first, the firm that documented it is protected. The firms that didn't document have to invent a record retroactively.

What changes in the firm's AI use policy after April 23

Three concrete policy updates that reflect GPT-5.5's calibration improvement and the sanctions landscape:

1. Model-version logging. Every AI-assisted draft includes a metadata note: model name, version, date of use, effort level (standard vs Pro for OpenAI, standard vs xhigh for Anthropic). The metadata travels with the document through review and filing. When a sanctions question surfaces later, the firm has the version-specific record.

2. Verification protocol update. Citation verification now includes content verification — confirming the holding the model summarized actually appears at the cited page, not just that the citation exists. This is a 30-minute training session for associates plus a paragraph in the AI use policy. The citation verification protocol after GPT-5.5 launch spoke walks the workflow.

3. Effort-level controls. Per the Pro vs standard upgrade spoke, effort-level controls prevent drift. Associates who default to ChatGPT Pro on personal accounts ($200/month, gets the Pro variant at $30/$180) burn firm reimbursement at six-times-standard rates without procurement tracking. Internal AI use policy specifies which workloads warrant Pro vs standard.

The operational reality: the policy update is cheap. Implementation costs days, not weeks. The firms that update now have the policy in place when the next round of sanctions cases names model-version-specific behaviors. The firms that don't update will be the firms that show up in the Charlotin database six months from now.

The Bottom Line: My take: Calibration is now a malpractice variable, not an engineering metric. GPT-5.5's improvement reduces the floor probability of fabrication but doesn't eliminate it. The firms that update verification protocols, log model versions, and control effort-level drift build defensible records against the next wave of sanctions cases. The firms that treat calibration as a vendor marketing claim rather than an operational input keep filing the residual errors and accumulating exposure.

AI-Assisted Research. This piece was researched and written with AI assistance, reviewed and edited by Manu Ayala. For deeper takes and the perspective behind the research, follow me on LinkedIn or email me directly.