Stanford's 2025 study on legal AI accuracy dropped numbers that should alarm every managing partner relying on AI-assisted research. Lexis+ AI hallucinated on 17-33% of queries. Westlaw AI performed nearly twice as worse. These aren't edge cases or adversarial prompts — they're standard legal research questions asked the way a practicing lawyer would ask them.
The data vendors don't advertise these numbers. LexisNexis achieved 65% accuracy on verified legal queries. Westlaw managed 42%. That means for every three research queries you run through Westlaw's AI, fewer than two come back fully accurate. If an associate performed at that level, you'd fire them. But firms are trusting these tools with client matters every day without understanding the failure rate.
What Stanford Actually Tested and Found
The Stanford study evaluated retrieval-augmented generation (RAG) systems — the architecture both Lexis+ AI and Westlaw AI use. RAG is supposed to solve hallucinations by grounding AI responses in actual legal databases. The study tested whether it actually does.
Researchers submitted hundreds of legal research queries across practice areas and verified every citation, quote, and legal proposition in the outputs. Lexis+ AI's hallucination rate ranged from 17% on straightforward queries to 33% on complex multi-issue questions. Westlaw AI's numbers were consistently worse, with accuracy at 42% compared to Lexis's 65%. The hallucinations weren't random gibberish — they were plausible-sounding citations to real courts with fabricated case names, or real case names with fabricated holdings. That's the dangerous kind.
Why RAG Doesn't Fix the Problem
Legal AI vendors sold RAG as the solution to hallucinations. The pitch: "Our AI is grounded in actual legal databases, so it can't make things up." The Stanford data proves that's marketing, not reality.
RAG reduces hallucinations but doesn't eliminate them. The retrieval step can pull irrelevant documents. The generation step can mischaracterize what it retrieved. The synthesis step can combine accurate fragments into inaccurate conclusions. When a RAG system retrieves a real case but summarizes the holding incorrectly, that's worse than an obvious fabrication — because the citation checks out but the law is wrong. A lawyer who verifies the case exists but doesn't read the actual opinion will miss the error entirely.
What the Vendors Don't Tell You
Neither LexisNexis nor Westlaw publish their own accuracy benchmarks in a way that allows independent verification. When pressed on the Stanford numbers, both vendors pointed to internal testing that supposedly shows higher accuracy rates. They haven't released that data for peer review.
Here's what the sales reps won't mention: accuracy varies dramatically by practice area. Complex regulatory questions, multi-jurisdictional issues, and recent case law perform worst. The tools are most accurate on well-established, frequently cited propositions — exactly the research you least need AI help with. The harder the question, the less you can trust the answer. Both platforms also perform worse on state law than federal law, which matters for the majority of practicing lawyers who work primarily in state courts.
The Real Cost of a 1-in-3 Failure Rate
A 17-33% hallucination rate doesn't just risk sanctions. It compounds across a practice. If your firm runs 100 AI-assisted research queries per week and one-third contain some form of hallucination, that's 30+ potentially flawed research memos hitting partner desks every week.
Not all hallucinations lead to filed documents. But they waste associate time chasing phantom authorities, create false confidence in legal positions, and occasionally make it into briefs. The Portland attorney who paid $109,700 in sanctions relied on AI output he didn't verify. The Mata v. Avianca lawyers trusted their tool. At a 1-in-3 failure rate, the question isn't whether your firm will file a hallucinated citation — it's when. And the malpractice implications extend beyond sanctions to client harm on matters where flawed research shaped case strategy.
What Lawyers Should Actually Verify
Given these accuracy rates, every AI research output requires verification, but the verification needs to be targeted. Check these five things on every AI research response:
1. Case existence: Confirm every cited case exists in an actual reporter. Don't just search the case name — verify the citation format, court, and year. 2. Holding accuracy: Read the actual opinion. AI frequently gets the court right but the holding wrong, sometimes stating the opposite of what the court held. 3. Current status: Check that cited cases haven't been overruled, distinguished, or superseded. AI training data has cutoff dates; Shepardize everything. 4. Quotation accuracy: If the AI puts text in quotation marks, verify word-for-word against the source. Fabricated quotes from real cases are the most common hallucination type in RAG systems. 5. Logical synthesis: Even when individual citations are accurate, the AI may combine them into a legal argument the cases don't actually support. Verify the reasoning chain, not just the components.
The Bottom Line: Your legal AI vendor is selling you a tool with a documented 1-in-3 failure rate and calling it innovation. The Stanford data is clear: no legal AI platform is reliable enough to use without full human verification on every output. Treat AI research as a first draft from an unreliable summer clerk, not a finished work product.
AI-Assisted Research. This piece was researched and written with AI assistance, reviewed and edited by Manu Ayala. For deeper takes and the perspective behind the research, follow me on LinkedIn or email me directly.
