In the past twelve months, the advances in technology that leverage artificial intelligence has come at staggering speed. On the tech side, there exists a state of outright war between the giants: Microsoft, Google, OpenAI, Meta, and Anthropic. Meanwhile, on the legal side, Chief Justice Roberts urges “caution and humility” when using artificial intelligence in legal practice. Between these two extremes, the break-neck pace of tech and the cry for no pace at all, rests the attorney, trying to practice law in a competitive landscape.
For the hard-working practitioner, we break down recent claims of 100% hallucination-free legal tools put forth by the big law platforms. Lexis-Nexis, for example, currently (as of June, 2024) that its product is hallucination free (at least in terms of its “linked legal citations”, whatever that is). Implying hallucation free is great news for us all. Artificial intelligence can benefit all attorneys and help them reduce the stress and overhead of legal practice. But, is it true? Is AI reliable? To help us answer that question, we happily look to a study conducted by Stanford that tackled this very question.
The Stanford Study: Unveiling the Truth About Legal AI
In a paper entitled “Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools,” researchers from Stanford and Yale evaluated several prominent legal AI tools, including Lexis+ AI, Westlaw AI-Assisted Research, and Thomson Reuters’ Ask Practical Law AI. Despite the advanced technology behind these tools, the findings reveal a concerning tendency for these AI systems to “hallucinate,” or generate incorrect or misleading information, with error rates ranging from 17% to 43%. This suggests that while these tools hold promise, they are not yet entirely reliable. It also raises questions about the assertions to the contrary from the companies that make these products.
Summary of the Stanford Study
The Stanford study focused on the performance of AI-driven legal research tools using retrieval-augmented generation (RAG) systems, which combine the power of large language models with legal databases to produce responses. The study involved a comprehensive, preregistered dataset of over 200 legal queries designed to test these tools’ accuracy and vulnerability to hallucinations.
The benchmarks used in the study were as follows:
- Correctness: The factual accuracy of the response.
- Groundedness: The relationship between the model’s response and its cited sources.
- Hallucination: A response that is either incorrect or misgrounded.
- Responsiveness: The tool’s ability to provide complete and relevant answers to queries.
The results were presented in a comparative table, showcasing the performance of each legal AI tool:
Tool | Accuracy Rate | Hallucination Rate | Incomplete Responses |
---|---|---|---|
Lexis+ AI | 65% | 17% | 18% |
Westlaw AI-Assisted | 42% | 33% | 25% |
Ask Practical Law AI | 20% | 17% | 62% |
GPT-4 | 49% | 43% | 8% |
Findings and Discussion
The study highlighted several key findings. Lexis+ AI was the highest-performing tool, answering 65% of queries accurately. However, it still hallucinated in 17% of responses. The hallucinations identified were often subtle and required careful analysis to spot but involved misunderstandings of case holdings, incorrect citations, and misinterpretations of legal roles and hierarchies. For example, Westlaw’s AI-AR often provided lengthy responses, increasing the chance of including at least one hallucination, while Lexis+ AI and Ask Practical Law AI struggled with distinguishing between legal actors and respecting the hierarchy of legal authority.
Case Studies of Hallucinations
The study provided detailed examples of hallucinations across different tools. For instance, Westlaw’s AI-AR erroneously claimed that a specific provision in the Federal Rules of Bankruptcy Procedure stated that deadlines were jurisdictional, which was not true. Similarly, Lexis+ AI mistakenly asserted that the Telephone Consumer Protection Act of 1991 granted exclusive jurisdiction to federal courts over actions brought by state attorneys general, contrary to the Supreme Court’s decision in Mims v. Arrow Financial Services, which found concurrent state and federal court jurisdiction. Westlaw’s AI-Assisted Research, while also using a RAG system, had a significantly higher hallucination rate of 33%. Thomson Reuters’ Ask Practical Law AI, although relying on a specialized database of practical law documents, had the highest rate of incomplete responses at 62%.
What is a Hallucination?
Hallucination is a poor choice of terminology. Humans hallucinate. Computers do not. Referring to an error as hallucination gives more credence to the fear-mongering instinct of the naysayers. Now industry slang, there are different definitions of it. In the context of AI legal research, the Stanford team defines “hallucinations” as instances where the AI generates incorrect or misleading information. This can occur in several forms, such as misinterpreting case holdings, confusing the arguments made by litigants with the court’s decision, disregarding legal hierarchies, or citing overruled or irrelevant cases. Essentially, a hallucination happens when the AI provides a response that is not factually accurate or properly grounded in authoritative legal sources, leading to potentially significant errors in legal analysis and decision-making.
Lexis Nexis’ claim that their AI tool delivers 100% hallucination-free output is based on a narrow definition of hallucinations, focusing primarily on the absence of fabricated legal citations. While their tool may avoid outright fabrications by linking to real legal documents, it can still produce hallucinations in a broader sense. For instance, the AI might cite a genuine case but misinterpret its holding, or provide an accurate citation while failing to acknowledge that the case has been overruled. By narrowly defining hallucinations, Lexis Nexis may overlook these subtler yet equally critical inaccuracies, thus overstating the reliability of their AI outputs.
For our purposes, hallucinations are defined as inaccuracies in several key areas.
Misinterpretation of Holdings
This occurs when the AI correctly cites an existing case but misunderstands or misrepresents its actual holding. For example, the AI might cite a case to support a legal proposition that the case does not actually endorse, leading to incorrect conclusions.
Failing to Distinguish Parties to the Controversy
While correctly identifying the citation and the court, the output attributes a litigant’s argument to the court’s decision, which can cause significant confusion about the source of a legal principle. This error blurs the lines between the arguments presented by the parties involved and the judicial decisions made by the court.
Including Incorrect Legal Authority
In many instances, the output correctly summarized the holding but attributed the legal precedence to an incorrect source, e.g., identifying the holding in a tax case, Clayton v. Commissioner, as codified in the tax code.
Confusion Over the Identity of the Court
This can occur in subtle ways within the narrative of the output suggests, for example, that a state court overruled a federal decision on a matter of federal law, an impossibility within the U.S. legal system. Such mistakes reveal a fundamental misunderstanding of the judicial hierarchy and the precedence of legal authority.
Meandering Legal Authority
Lastly, citing overruled or irrelevant cases is another form of hallucination. In this instance, the AI might reference real cases that are no longer considered good law or are not applicable to the specific legal question being addressed. This leads to outdated or incorrect legal advice, undermining the reliability of the AI’s output.
These examples illustrate the various ways in which AI can produce inaccurate and misleading information, emphasizing the need for careful oversight and validation of AI-generated legal research.
Correctness and Groundedness
Correctness refers to whether the AI’s response is factually accurate, while groundedness evaluates whether the response is backed by appropriate legal sources. Lexis+ AI had the highest accuracy rate, providing correct and grounded responses 65% of the time. In contrast, Westlaw AI-Assisted Research was accurate 42% of the time, and Ask Practical Law AI had an accuracy rate of just 20%.
One key area of concern identified was the AI’s tendency to misinterpret case holdings. For example, Westlaw AI-AR frequently misunderstood the actual holdings of cases, citing them correctly but misrepresenting their implications. In one instance, it claimed that the Federal Rules of Bankruptcy Procedure stated that certain deadlines were jurisdictional, a misinterpretation that could lead to significant legal errors.
To improve the reliability and accuracy of AI-powered legal research tools, several strategies can be employed, each addressing specific aspects of the AI’s functionality and output quality.
Implementing Verification Nodes to Check Citations
Verification nodes serve as a critical mechanism for quality assurance within AI systems. These nodes, essentially blocks of code, perform rigorous checks on the citations provided by the AI. While introducing these nodes may increase latency—meaning there might be a slight delay between a user’s query and the AI’s response—the trade-off is a significant boost in accuracy. By meticulously verifying each citation, the system can ensure that the references are valid, current, and correctly applied to the legal context.
Better Integration of Subsequent Case History
A more sophisticated approach to integrating subsequent case history can greatly enhance the AI’s legal reasoning. Rather than relying on a simple Boolean result indicating whether a court opinion is still good law, the AI should be capable of reviewing how various courts have treated the opinion over time. This deeper analysis includes examining the sentiment surrounding the opinion and how its holding has been applied in subsequent cases. By doing so, the AI can provide a more nuanced understanding of the opinion’s current relevance and impact, offering users a richer and more accurate legal analysis.
Adopting a Consensus-Based Approach Using Multiple Sources
Adopting a consensus-based approach that utilizes multiple sources can further improve the quality of AI-generated legal research. If we consider retrieval-augmented generation (RAG) as an initial step that refines the query before it is submitted to the large language model (LLM), we can also implement a post-output node for additional quality checks. This node would analyze the AI’s output, comparing it against several reliable sources, such as law review articles and relevant internet content. Given AI’s ability to process information rapidly, this comparative analysis can help ensure that the final output is not only accurate but also comprehensive and well-supported by diverse perspectives.
Are You Net Ahead?
Until better safeguards are put in place, legal tool providers reduce false assertions as to quality, and improve transparency, the concerns of Chief Justice Roberts seem valid. While these tools are amazing, do they really save a hard-working attorney time? Practice demands more than an output of court opinions that exist. Gone are the days of search tools for legal research behind pay walls. The value proposition of these tools is that the summary and analysis is also correct and that is not entirely accurate.
The Road Ahead
As we navigate this new era of AI-assisted legal research, the need for transparency, rigorous testing, and human oversight has never been clearer. While these tools offer immense potential, they are not infallible. Legal professionals must remain vigilant, verifying key propositions and citations to ensure the integrity of their work.
What are your thoughts on AI in legal research? Have you had experiences with these tools? Share in the comments below!