I've been testing every Claude release since Sonnet 3.5 dropped last summer. Most of them are incremental. Better benchmarks, slightly faster responses, the kind of improvements that look good in a blog post but don't change how I work.

Opus 4.7 is different. Anthropic released it on April 16, 2026, and within 48 hours the legal AI community went from cautiously optimistic to genuinely unsettled. Not because of the headline benchmarks — though those are striking. Because of one capability that changes everything about how lawyers should think about AI output: self-verification.

Here's the number that matters most. Harvey ran Opus 4.7 through their BigLaw Bench — the most rigorous legal AI benchmark that exists — and it scored 90.9%, with 45% perfect scores. That's the highest any model has ever achieved on that test. For context, the previous best was in the low 80s. This isn't a marginal improvement. This is a category jump.

But the benchmark isn't the story. The story is what happens when you actually use this model on real legal work — and the hidden cost trap that nobody's talking about yet.


Here's the feature that stopped me cold.

Opus 4.7 doesn't just generate answers. It devises ways to verify its own outputs. It creates independent checks on its own reasoning, finds alternative approaches to confirm or challenge its conclusions, and flags when it can't verify something.

If you've spent any time using AI for legal research, you know the problem this solves. The previous generation of models — including earlier Claude versions — would give you a confident, well-structured, completely wrong answer. Hallucinated case citations. Misapplied standards of review. Rules from one jurisdiction presented as if they applied in another. The output looked so polished that catching the errors required you to basically do the research yourself.

Self-verification attacks this problem at the root. Instead of generating one answer and presenting it with false confidence, Opus 4.7 generates the answer, then generates a separate verification process, then tells you where it couldn't confirm its own work.

I tested this on a multi-jurisdictional employment law question — whether a non-compete clause enforceable in Texas would hold up in California. Previous Claude versions would give me a smooth, confident analysis that occasionally mixed up the specific statutory provisions. Opus 4.7 gave me the analysis, then flagged that California's specific statutory framework (Business and Professions Code Section 16600) creates a near-total bar that differs from what it had initially characterized as a "strong presumption against enforcement." It caught its own imprecision.

That's not a benchmark improvement. That's a structural change in how reliable AI-assisted legal research can be. For the first time, the model is doing some of the verification work that we've been telling associates they need to do manually on every AI output.

Harvey BigLaw Bench: 90.9% and What It Actually Tests

Let me be specific about what BigLaw Bench measures, because most people treat it like a single score when it's actually a battery of tests across multiple legal competencies.

BigLaw Bench tests contract analysis, legal research, regulatory interpretation, case law application, and multi-step legal reasoning. It was designed by practicing attorneys at Harvey's partner firms — not by AI researchers who think legal work is multiple choice. The questions mirror real associate-level tasks: "Review this indemnification clause and identify deviations from market standard terms" or "Analyze whether this fact pattern triggers reporting obligations under the new SEC climate disclosure rules."

90.9% overall accuracy with 45% perfect scores means that on nearly half the test scenarios, Opus 4.7 produced output that experienced attorneys couldn't meaningfully improve. On the other 55%, it got close enough that the attorney's role shifted from "redo this work" to "refine this work."

For comparison, first-year associates typically score in the 60-70% range on similar evaluations. Senior associates land in the 80-85% range. Opus 4.7 at 90.9% is performing at the level of a strong mid-level associate who happens to have instant recall of every case, statute, and regulation ever published.

The SWE-bench score — a software engineering benchmark — jumped from 80.8% to 87.6%. I mention this because it matters for firms building internal tools. If your firm is developing custom AI workflows, the model that powers those workflows just got meaningfully better at writing and debugging code. That has downstream effects on every piece of legal tech your firm deploys.

3x Vision Resolution: Why Scanned Documents Just Became Viable

This is the improvement that practice-area leads should be paying attention to.

Opus 4.7 processes images at 2,576 pixels — three times the resolution of the previous version. That sounds like a spec sheet detail until you think about what law firms actually work with: scanned contracts, faxed amendments, dense annexes with small-font tables, handwritten margin notes on executed agreements.

Previous vision models could read clean, high-resolution documents. They choked on the real-world stuff. A scanned contract from 2008 with slightly skewed pages and 9-point font? The model would miss critical terms, misread numbers, or silently skip sections it couldn't parse.

At 2,576 pixels, Opus 4.7 can read the kind of documents that actually exist in firm document management systems. Not the pristine PDFs that come from modern drafting tools, but the scanned annexes, the handwritten exhibits, the faded faxes that somehow remain critical to a deal that closed fifteen years ago.

For M&A due diligence teams, this is significant. A meaningful percentage of the documents in any data room are legacy scans. Being able to run those through AI analysis without first paying a vendor to OCR and clean them up removes a bottleneck and a cost that firms have been absorbing for years.

For litigation teams dealing with discovery, the implications are similar. Scanned documents that previously required manual review can now be fed through Opus 4.7 for initial categorization, relevance screening, and privilege identification. The 3x resolution improvement turns a theoretical capability into a practical one.

The Hidden Token Cost Trap Nobody's Talking About

Now here's the part that will actually hit your firm's budget, and I haven't seen anyone else cover this yet.

Opus 4.7 uses a new tokenizer. Tokenizers are how AI models break text into pieces for processing. Different tokenizers split the same text into different numbers of tokens. And Opus 4.7's new tokenizer uses 10-35% more tokens on the same input compared to previous Claude models.

Let me make this concrete. If your firm was spending $10,000/month on Claude API costs for contract review workflows, the same volume of work on Opus 4.7 could cost $11,000-$13,500 — before you account for the model's higher per-token pricing.

The pricing is $5 per million input tokens and $25 per million output tokens with a 1M token context window and 128K token output limit. The context window is massive — you can feed it an entire contract suite in a single prompt. The output limit means it can generate comprehensive analyses without truncation.

But here's the trap. Most firms evaluate AI costs based on the price-per-token listed on the website. They don't test how many tokens their specific documents consume under the new tokenizer. Legal text — with its defined terms, section references, and specialized vocabulary — tends to tokenize less efficiently than casual English. A contract that consumed 50,000 tokens under the old tokenizer might consume 60,000-67,000 under the new one.

The improvement in quality is real. The improvement in capability is real. But if your firm is running AI workflows at scale, you need to re-benchmark your costs before assuming Opus 4.7 is a drop-in upgrade. Run your actual documents through both tokenizers and compare. The per-token price might look competitive, but the token count is where the cost hides.

I've talked to three firms in the last week that upgraded to Opus 4.7 without testing tokenization costs first. All three saw their monthly API spend increase by 15-25%. One of them is now running a hybrid approach — Opus 4.7 for complex analysis that benefits from self-verification, and the cheaper Sonnet model for routine tasks where the extra quality isn't needed.

Instruction Following: What 'More Disciplined' Actually Means

Anthropic describes Opus 4.7's instruction following as "more disciplined." That's corporate speak for something that actually matters in legal contexts.

Previous Claude versions had a tendency to improvise. You'd give it a detailed prompt — "analyze this contract clause, identify deviations from the standard, cite the relevant UCC provisions, and format the output as a redline memo" — and it would do most of that but add its own creative touches. Extra commentary you didn't ask for. Alternative suggestions that weren't requested. A helpful but unwanted summary at the end.

In legal work, improvisation is dangerous. When a partner asks an associate to draft a motion to compel citing Federal Rules 26 and 37, and the associate also throws in a creative argument under Rule 16 that nobody reviewed — that's how sanctions happen.

Opus 4.7's disciplined instruction following means it does what you tell it to do. Not more. Not less. If your prompt says "identify only material deviations from the template," it doesn't helpfully flag immaterial variations. If your prompt says "cite only binding authority from the Second Circuit," it doesn't include persuasive authority from other circuits.

This matters because it makes prompts more reliable. You can build standardized prompt templates for your firm's common workflows and trust that the outputs will be consistent. That's the foundation of scalable AI deployment. Without disciplined instruction following, every prompt is a negotiation with the model. With it, a prompt becomes a specification.

What This Means for Different Practice Areas

Not every practice area benefits equally from Opus 4.7. Here's where I see the biggest impact.

Transactional/M&A: The combination of self-verification and 3x vision resolution makes Opus 4.7 the first model I'd trust for first-pass due diligence on mixed-format data rooms. The model can read legacy scans, analyze complex provisions, and flag where its own analysis might be wrong. That doesn't eliminate the need for attorney review, but it changes what the attorney is reviewing. Instead of doing the analysis, they're checking the analysis. That's a fundamentally different (and faster) workflow.

Litigation: Self-verification is critical here. Hallucinated case citations have been the most embarrassing AI failure mode for litigators. Opus 4.7's ability to check its own citations doesn't make the problem disappear entirely, but it reduces the rate of confident wrong answers significantly. If you've been hesitant to use AI for research memos because of the citation risk, this is the version that should make you reconsider.

Regulatory/Compliance: The 90.9% BigLaw Bench score includes regulatory interpretation tasks. For firms doing compliance work — SEC filings, banking regulations, healthcare compliance — Opus 4.7 is materially better at understanding the interplay between multiple regulatory frameworks. The self-verification feature is especially valuable here because regulatory analysis often involves layered requirements where missing one layer invalidates the entire analysis.

IP/Patent: The vision improvement helps with patent drawings and technical specifications. The self-verification helps with prior art analysis. But patent prosecution involves a level of technical specificity that still pushes the model's limits. Opus 4.7 is better, but I wouldn't restructure an IP practice around it yet.

Family/Estate/Small Practice: The cost trap I described is proportionally more painful for smaller firms. If you're a five-attorney estate planning firm, the quality improvements in Opus 4.7 are real but the cost increase matters more. Stick with Sonnet for routine drafting. Use Opus 4.7 for the complex trust structures and multi-jurisdictional estate plans where self-verification earns its premium.

How Opus 4.7 Compares to GPT-5 and Gemini 2.5

This is the question every managing partner is going to ask, so let me lay it out.

On legal-specific benchmarks, Opus 4.7 leads. The 90.9% BigLaw Bench score is the highest any model has achieved. GPT-5 hasn't been publicly tested on BigLaw Bench, but GPT-4o scored in the low 80s on similar legal reasoning tests. Google's Gemini 2.5 Pro performs well on general reasoning but hasn't demonstrated the same level of legal-specific capability.

On cost, Opus 4.7 is mid-range. GPT-4o is cheaper per token but uses roughly the same number of tokens. Gemini 2.5 Pro is the most cost-effective for high-volume processing. But — and this is important — cheaper per token doesn't mean cheaper per outcome. If a model requires more human review time because its outputs are less reliable, the labor cost erases the token savings.

On enterprise features, Anthropic's offering is competitive. The 1M context window is the largest among the major models. The 128K output limit is generous. Enterprise deployment options exist for firms that need data isolation.

My take: for complex legal analysis — contract review, regulatory interpretation, litigation research — Opus 4.7 is the best model available right now. For routine drafting, email generation, and simple research tasks, you don't need it. Use the cheaper models for the 80% of tasks that are straightforward, and route the complex 20% to Opus 4.7.

The firms that will waste the most money are the ones that use Opus 4.7 for everything. The firms that will get the most value are the ones that match model capability to task complexity.

What Your Firm Should Do This Week

I'll keep this practical.

First, test the tokenizer cost on your actual documents. Take ten representative documents from your most common workflow — contracts, memos, briefs — and run them through the Opus 4.7 tokenizer. Compare the token count to what you were getting before. If the increase is above 20%, you need to factor that into your budget before upgrading.

Second, run a self-verification test. Take a complex legal question where you know the right answer — a multi-jurisdictional issue, a layered regulatory question, something where previous models have given you confident wrong answers. Run it through Opus 4.7 and see whether the self-verification catches the errors that previous versions missed. If it does, that's your proof of concept for the upgrade.

Third, test the vision capabilities on your worst-quality documents. Dig out the ugliest scanned contract in your DMS. The one with the faded ink and the crooked pages. See if Opus 4.7 can read it accurately. If it can, you just eliminated a manual bottleneck.

Fourth, build a tiered model strategy. Not every task needs Opus 4.7. Define which workflows get the premium model and which ones run on Sonnet or Haiku. The quality difference is real, but so is the cost difference. Match them.

Fifth, if you're on the Harvey platform, this upgrade is automatic. Harvey runs on Claude, and they've already integrated Opus 4.7. If you're building custom workflows on the API, the upgrade requires testing — don't just swap the model name in your code and assume everything works the same way.

The Bottom Line: Opus 4.7's self-verification is the first AI capability that reduces the single biggest risk in legal AI — confident wrong answers — and the hidden tokenizer cost increase is the detail that will separate firms that deploy it intelligently from firms that just get a bigger bill.

AI-Assisted Research. This piece was researched and written with AI assistance, reviewed and edited by Manu Ayala. For deeper takes and the perspective behind the research, follow me on LinkedIn or email me directly.