Evaluating AI Models for Legal Reasoning: How do Large Language Models Stack up

Image of author and founder of LegalEdge Innovators, Darol Tuttle

by Darol Tuttle

Darol is the founder of LegalEdge Innovators, a practicing attorney in the areas of estate planning and elder law, and the founder of BoomX Academy, home of the BoomX Show: Laws of Money

In a comprehensive evaluation of AI models for legal reasoning, we compared the proprietary models ChatGPT-4o and Gemini Flash against the open-source model Llama 3. The evaluation involved complex legal scenarios to test the models' abilities in issue spotting, tax calculations, and estate planning. ChatGPT-4o demonstrated stellar performance, accurately identifying and analyzing all 10 issues presented. Gemini Flash, while competent in some areas, showed notable shortcomings in handling nuanced legal concepts, highlighting the need for further improvement through techniques such as Retrieval-Augmented Generation (RAG) and fine-tuning. Llama 3 showed potential but lagged behind in depth and accuracy. This study underscores the critical role of issue spotting in legal analysis and the potential for AI models to assist in this domain, with proprietary models offering more robust support and performance compared to their open-source counterparts.
oung woman contemplating with the question 'Can AI reason like a lawyer?

LegalEdge Innovators is a forward-thinking tech company dedicated to helping lawyers navigate the complexities of the rapidly evolving landscape of technological development and AI breakthroughs. As part of our mission to evaluate cutting-edge technology and its application to legal services, we pitted three large language models against one another, subjecting each to bar exam question, increasing in complexity to see how they stack up.

This project on 21 possible legal issues in the areas of estate tax.  We asked OpenAI’s ChatGPT4o, Google’s Gemini Flash, and Meta’s Llama 3 70b instruct to not only spot issues, but calculate tax liabilities, and make recommendations.  This article details our evaluation process, the benchmarks used, the models tested, and the results emphasizing the number of issues spotted by each model.

Key Points

  • ChatGPT-4o and Llama 3 70 B correctly identified 21 issues out of 21 possible possible legal issues from minimal facts.  Both applied complicated tax rules from state and federal codes, performed math calculations to determine the tax liability, adjusted to increasingly complex fact patterns, and, ChatGPT4o drafted a proposed trust provision that correctly used a Clayton election as the funding formula for a NonMarital and Marital Trust split.
  • Gemini Flash performed the worst.  It identified only 13 of 21 issues but avoided or refused to respond in 14 instances.
  • Gemini and ChatGPT 4o are proprietary models, owned by Google and OpenAI (Microsoft has a substantial ownership interest in OpenAI)respectively.   Meta, owner of Facebook and Instagram, released Llama 3 and is open source.  (You can visit Meta’s AI imitative and test their models by visiting Meta AI.)Proprietary models charge a fee to access their models while open source does not.  

About the LMMSYS Project

To pull this off, we used a great chatbot arena provided by the research project at Large Model Systems Organization (LMSYS Org).  LMSYS Org trains large language models and makes them widely available while also developing distributed systems to accelerate their training and inference.  

LLM Sys also ranks each model on a leaderboard.  Our models ranked in the top 11.  Our first model, ChatGPT-4o, ranked the highest overall, demonstrating superior performance in legal reasoning and issue spotting. Gemini Flash ranked 9th, showing solid capabilities but with room for improvement in certain areas. Llama 3 ranked 11th, indicating potential but falling behind in comparison to the other models tested.  

Our informal testing differed in the ranking published on the LLM Sys leaderboard, ranking Gemini the worst by a large margin.  We also conclude that ChatGPT 4o and Llama 3 70b performed equally well and would not rank one over the other. One possible explanation for this is that the evaluation criteria for LLM Sys’ “hard prompts” were: specificity, domain knowledge, complexity, problem-solving, creativity, technical advocacy, and real-world applications.  The LLM Sys project was meant to evaluate models in the context of complex prompting from the medical, accounting and legal services industry. 

Our evaluation, on the other hand, was specific to the legal services industry. Rather than rate “creativity”, we simply evaluated responses as a bar examiner would when writing bar exam results, i.e., did the applicant spot the legal issues, frame the issue correctly, analyze the application of fact to law, and return with a conclusion? 

The Base Fact Pattern

“I am married and live in Chicago. Our net worth is $6 million. Of that, assets worth $3 million are located in Illinois. The rest, $3 million are located in Washington, to include a summer vacation home worth $1.5 million and a $500,00 non-qualified brokerage account inside a Washington state governed living trust. The assets in Chicago include a personal residence worth $1 million and the rest are assets in my IRA. The Washington state property is all community property pursant to a community property agreement. The Illinois residence is titled in my name as is the IRA. We have no estate planning documents at all.

If I die later today, and my wife dies tomorrow, what taxes, if any, will be due. Tell me the type of tax due, e.g., income tax, capital gains tax, inheritance tax, estate tax, property tax, the amount due for each tax, and the government agency responsible for collecting the tax.”

 Key areas of focus included:

  • Community Property Rules: Understanding how these rules differ between Washington and Illinois.

  • State-Specific Estate Taxes: Recognizing the different estate tax exemptions in Illinois and Washington, both of which are among the twelve states with their own estate tax.

  • Tax Basis Adjustments: Determining when Sections 1014-16 apply to allow for an adjusted tax basis under different scenarios.

  • STUB Rule Application: Comparing how the STUB rule works in Washington versus Illinois.

  • Credit Shelter Trusts: Evaluating whether a Credit Shelter Trust can have a second step-up in basis.

  • QTIP Trust Mechanics: Understanding the mechanics and implications of a Qualified Terminable Interest Property (QTIP) trust.

  • Estate Tax Exemptions: Applying estate tax exemptions properly and calculating tax liability using graduated tax rates above the exemption amount.

  • Federal Estate Exemptions: Identifying differences in federal estate exemption amounts.

  • Funding Formulas: Assessing funding formulas in an estate plan when marital and nonmarital shares are split.

  • Legal Authority for Allocation: Determining whether there was legal authority to allow the personal representative to allocate or fund each share after the first spouse died.

By incorporating these elements and testing specific scenarios, the evaluation aimed to thoroughly assess the models’ abilities to handle complex legal and tax issues accurately and effectively.

Results

In an effort to rigorously evaluate the capabilities of various AI models in handling complex legal and tax issues, a detailed fact pattern was developed. This fact pattern evolved progressively, adding layers of complexity to test each model’s knowledge and application across a wide range of legal concepts and jurisdictional differences.

One of the primary areas of focus was the understanding and application of community property rules, which differ significantly between Washington and Illinois. The fact pattern required the models to discern these differences and apply the correct rules based on the state-specific context. Additionally, the models were tested on their ability to recognize and apply state-specific estate tax exemptions. Both Illinois and Washington are among the twelve states that impose their own estate tax, each with unique exemption amounts and graduated tax rates.

The evaluation also delved into the intricate details of federal estate tax laws, specifically the varying federal estate exemption amounts. The models needed to determine the appropriate application of Sections 1014-16 of the Internal Revenue Code to allow for an adjusted tax basis under different scenarios. This included understanding the step-up in basis rules and how they impact the tax calculations for inherited assets.

Another critical aspect of the evaluation was the application of the STUB rule, comparing how it functions in Washington versus Illinois. This required the models to navigate the nuances of state-specific tax regulations and their implications on estate planning.

The mechanics of various trusts were also put to the test. For instance, the evaluation examined whether a Credit Shelter Trust could receive a second step-up in basis and the detailed workings of a Qualified Terminable Interest Property (QTIP) trust. Understanding these trusts’ mechanisms and their tax implications was essential for accurate estate planning.

Moreover, the fact pattern included scenarios that required the models to apply estate tax exemptions correctly and understand and apply funding formulas in estate plans, particularly when marital and nonmarital shares are split.  Llama 3 and ChatGPT 4o also correctly identified that, after Clayton v. Commissioner et seq a Personal Representative can be given discretion to allocate between shares after the death of the Testator.  Both correctly cited uniform codes and case law to support this proposition. ChatGPT 4o went further than prompted and drafted proposed trust language to insert into a trust to make give the Personal Representative these powers. 

Issues Spotted

Issues ChatGPT-4o Gemini Llama 3
Federal Estate Tax Liability True True True
Illinois State Estate Tax Liability True True True
Washington State Estate Tax Liability True True True
Income Tax Implications True True True
Capital Gains Tax Implications True True True
Property Tax Implications True True True
IRA Basis True False True
Inheritance Tax Implications True True True
Step-up in Basis for Assets True True True
Credit Shelter Trust Usage True True True
QTIP Trust Usage True True True
Marital Deduction Application True True True
Community Property Considerations True False True
Funding Formula for Trusts True False True
Discretionary Trust Powers True False True
Clayton Election Use True False True
Tax Basis of Inherited Assets True False True
Post-Mortem Planning True False True
Exemption Amounts True True True
Advanced Tax-Saving Trusts True True True
Tax Savings Analysis True True True

ChatGPT-4o: Stellar Performance in Legal Reasoning

In the rigorous evaluation of various AI models, ChatGPT-4o emerged as the top performer, demonstrating exceptional accuracy and understanding in handling complex legal and tax issues. Here, we highlight specific examples of ChatGPT-4o’s stellar performance, with a particular focus on its precise handling of funding formulas and Clayton election questions.

Example 1: Correct Application of Community Property Rules

One of the scenarios required the model to apply community property rules in Washington, which differs significantly from those in Illinois.

Scenario: A client and their spouse, residing in Washington, jointly own a home purchased during their marriage. The client also has an IRA valued at $500,000 and other separate property acquired before the marriage.

ChatGPT-4o’s Response: ChatGPT-4o correctly identified the home as community property since it was acquired during the marriage. It also correctly classified the IRA and other separate property as separate property, adhering to Washington’s community property rules.

Explanation: In Washington, assets acquired during the marriage are considered community property, while assets acquired before the marriage are separate property. ChatGPT-4o demonstrated a clear understanding of these distinctions, providing accurate classifications and recommendations.

Example 2: Accurate Calculation of State Estate Tax

The evaluation also involved calculating the state estate tax liability in Illinois, which has its own estate tax system with specific exemption amounts and graduated tax rates.

Scenario: A client with a total estate valued at $5 million needs to know the estate tax liability in Illinois, which has a $4 million exemption and a graduated tax rate.

ChatGPT-4o’s Response: ChatGPT-4o accurately calculated the estate tax by applying Illinois’ graduated rates above the $4 million exemption. It provided a detailed breakdown of the tax liability, ensuring the client had a clear understanding of the tax owed.

Explanation: The model correctly applied Illinois’ graduated tax rates to the portion of the estate exceeding the $4 million exemption, resulting in an accurate tax calculation. This demonstrated ChatGPT-4o’s proficiency in handling state-specific tax regulations.

Example 3: Funding Formulas in Estate Plans

One of the more complex aspects of estate planning involves determining the appropriate funding formulas when splitting marital and nonmarital shares.

Scenario: A client needs guidance on how to split assets into marital and nonmarital shares in their estate plan, ensuring proper funding of each share.

ChatGPT-4o’s Response: ChatGPT-4o provided a detailed explanation of various funding formulas, including the pecuniary and fractional share methods. It recommended the most suitable method based on the client’s specific circumstances, ensuring the marital and nonmarital shares were funded accurately.

Explanation: ChatGPT-4o’s ability to clearly explain and recommend appropriate funding formulas demonstrated its deep understanding of estate planning. The model’s guidance ensured the client’s estate plan was both compliant and optimized for tax efficiency.

Example 4: Understanding Clayton Election

The Clayton election allows for flexibility in determining the allocation of assets between a marital trust and a bypass trust after the first spouse’s death.

Scenario: A client wants to know if their personal representative can decide how to allocate assets between a marital trust and a bypass trust using the Clayton election.

ChatGPT-4o’s Response: ChatGPT-4o accurately explained the Clayton election, detailing how it permits the personal representative to allocate assets between the trusts posthumously. It also highlighted the legal authority underpinning this flexibility, ensuring the client understood the benefits and implications of using the Clayton election.

Explanation: The model’s thorough understanding of the Clayton election and its ability to convey this complex concept clearly and accurately showcased its superior legal reasoning capabilities. By highlighting the legal authority and practical applications, ChatGPT-4o provided invaluable guidance for estate planning.

ChatGPT-4o’s performance in the evaluation was exemplary, consistently demonstrating a deep understanding of complex legal and tax issues. Its precise handling of community property rules, accurate state estate tax calculations, comprehensive guidance on funding formulas, and clear explanation of the Clayton election highlighted its proficiency in legal reasoning. These examples underscore ChatGPT-4o’s potential as a reliable and sophisticated tool for legal analysis, capable of providing accurate and insightful recommendations in even the most intricate scenarios.

Llama 3 70b Instruct Equal to the Task

In the evaluation of various AI models, Llama 3 demonstrated a mastery of legal reasoning in several key areas, despite LLM Sys’ overall ranking lower compared to models like ChatGPT-4o. Here are specific examples of Llama 3’s notable performances: 

Example 1: Correct Application of Community Property Rules

Llama 3 showed a solid understanding of community property rules in a scenario involving the classification of assets in Washington.

Scenario: A client and their spouse, residing in Washington, jointly own a home purchased during their marriage. The client also has an IRA valued at $500,000 and other separate property acquired before the marriage.

Llama 3’s Response: Llama 3 correctly classified the home as community property, given it was acquired during the marriage. It also accurately identified the IRA and other assets acquired before the marriage as separate property.

Explanation: Llama 3 demonstrated a clear understanding of Washington’s community property rules, distinguishing correctly between community and separate property. This accurate classification ensured appropriate handling of asset division in estate planning.

Example 2: Accurate Calculation of State Estate Tax in Illinois

Llama 3 also showed proficiency in calculating the estate tax liability under Illinois state laws.

Scenario: A client with a total estate valued at $5 million seeks to understand the estate tax liability in Illinois, which has a $4 million exemption and a graduated tax rate.

Llama 3’s Response: Llama 3 correctly calculated the estate tax by applying Illinois’ graduated rates to the portion of the estate exceeding the $4 million exemption. It provided a step-by-step breakdown of the calculation, ensuring clarity and accuracy.

Explanation: The model’s precise calculation of Illinois estate tax, using the correct exemption and graduated rates, demonstrated its capability to handle state-specific tax regulations accurately.

Example 3: Explanation of QTIP Trust Mechanics

One of the important aspects tested was the understanding of the mechanics and implications of a Qualified Terminable Interest Property (QTIP) trust.

Scenario: A client wants to set up a QTIP trust to provide for their spouse while preserving the principal for their children.

Llama 3’s Response: Llama 3 accurately explained that a QTIP trust allows the surviving spouse to receive income from the trust during their lifetime, with the principal remaining intact for the children after the spouse’s death. It highlighted the benefits of using a QTIP trust for estate tax purposes and ensured that the client’s objectives were met.

Explanation: Llama 3’s clear and accurate explanation of QTIP trust mechanics showcased its understanding of complex trust structures and their tax implications, providing valuable guidance for estate planning.

Example 4: Correct Identification of Federal Estate Tax Exemptions

Llama 3 was able to accurately identify and apply federal estate tax exemptions in a given scenario.

Scenario: A client with a substantial estate needs to know the applicable federal estate tax exemption and how it affects their overall tax liability.

Llama 3’s Response: Llama 3 correctly identified the current federal estate tax exemption amount and applied it to the client’s estate to determine the taxable portion. It provided a detailed explanation of how the exemption works and its impact on the estate tax calculation.

Explanation: By accurately identifying and applying the federal estate tax exemption, Llama 3 demonstrated its capability to handle federal tax regulations effectively. This ensured the client had a clear understanding of their estate tax obligations.

While Llama 3 ranked lower overall compared to other models like ChatGPT-4o, it demonstrated notable performances in several key areas. Its correct application of community property rules, accurate calculation of state estate taxes, clear explanation of QTIP trust mechanics, and proper identification of federal estate tax exemptions highlight its potential in handling complex legal and tax issues. These examples underscore Llama 3’s ability to provide accurate and insightful recommendations in specific scenarios, showcasing its strengths despite the need for further development in other areas.

Question Llama 3’s Response ChatGPT-4o’s Response
Application of Community Property Rules – Correctly classified the home as community property.<br>- Correctly identified the IRA and other assets acquired before the marriage as separate property. – Correctly classified the home as community property.<br>- Correctly identified the IRA and other assets acquired before the marriage as separate property.
Calculation of State Estate Tax in Illinois – Correctly calculated the estate tax by applying Illinois’ graduated rates to the portion of the estate exceeding the $4 million exemption.<br>- Provided a step-by-step breakdown of the calculation. – Correctly calculated the estate tax using Illinois’ graduated rates above the $4 million exemption.<br>- Provided a detailed breakdown of the tax liability, ensuring clarity and accuracy.
Explanation of QTIP Trust Mechanics – Explained that a QTIP trust allows the surviving spouse to receive income from the trust during their lifetime, with the principal remaining intact for the children after the spouse’s death.<br>- Highlighted the benefits for estate tax purposes. – Detailed the mechanics of a QTIP trust, including how it provides income to the surviving spouse while preserving the principal for the children.<br>- Emphasized the estate tax benefits and the specific legal provisions involved.
Federal Estate Tax Exemptions – Correctly identified the current federal estate tax exemption amount.<br>- Applied the exemption to the client’s estate to determine the taxable portion.<br>- Provided a basic explanation of how the exemption works. – Accurately identified the federal estate tax exemption amount.<br>- Applied the exemption in the estate tax calculation, detailing the impact on the client’s tax liability.<br>- Provided a thorough explanation of the exemption and its implications.
Understanding of Funding Formulas in Estate Plans – Briefly mentioned the pecuniary and fractional share methods.<br>- Provided a general recommendation without much detail. – Provided a detailed explanation of various funding formulas, including the pecuniary and fractional share methods.<br>- Recommended the most suitable method based on the client’s specific circumstances, ensuring accurate funding of marital and nonmarital shares.
Clayton Election – Provided a basic explanation of the Clayton election but lacked detail on its practical application and legal authority. – Accurately explained the Clayton election, detailing how it allows the personal representative to allocate assets between the trusts posthumously.<br>- Highlighted the legal authority underpinning this flexibility, ensuring the client understood the benefits and implications.

Analysis

  • Llama 3:
    • Showed competence in understanding and applying basic legal concepts.
    • Provided correct answers but often lacked depth and detailed explanations.
    • Struggled with providing comprehensive recommendations and failed to address certain complexities.
  • ChatGPT-4o:
    • Consistently provided detailed, accurate, and comprehensive answers.
    • Demonstrated a deep understanding of complex legal and tax issues.
    • Excelled in explaining intricate concepts clearly and thoroughly, particularly in areas requiring detailed legal knowledge such as funding formulas and Clayton elections.

Gemini Flash: Performance Evaluation and Specific Examples of Shortcomings

The evaluation of various AI models on their ability to handle complex legal and tax issues revealed significant discrepancies in performance. Among the models tested, Gemini Flash, ranked 9th overall, demonstrated notable shortcomings. Here, we delve into specific examples that highlight where Gemini Flash either provided incorrect information or was non-responsive.

Example 1: Misapplication of Community Property Rules

One of the critical areas tested was the application of community property rules in Washington compared to Illinois. In one scenario, the fact pattern required the model to determine how assets acquired during the marriage should be classified and divided upon the client’s death.

Scenario: A client and their spouse, residing in Washington, jointly own a home purchased during their marriage. The client also has an IRA valued at $500,000 and other separate property acquired before the marriage.

Gemini’s Response: Gemini Flash incorrectly classified all assets as community property, including the IRA and pre-marriage separate property. It failed to distinguish between community property and separate property based on the state’s rules.

Correct Application: In Washington, only assets acquired during the marriage are considered community property, while assets acquired before the marriage remain separate property. The IRA, if funded before the marriage, should be classified as separate property.

Example 2: Incorrect Calculation of State Estate Tax

Another test involved calculating the state estate tax liability in Illinois, which has its own estate tax system with specific exemption amounts and graduated tax rates.

Scenario: A client with a total estate valued at $5 million needs to know the estate tax liability in Illinois, which has a $4 million exemption and a graduated tax rate.

Gemini’s Response: Gemini Flash incorrectly calculated the estate tax by applying a flat rate instead of the graduated rates above the exemption amount. It suggested a significantly lower tax liability than would actually be owed.

Correct Calculation: The correct approach would involve calculating the tax on the amount above the $4 million exemption using Illinois’ graduated tax rates, leading to a higher and more accurate tax liability.

Example 3: Non-responsiveness on the STUB Rule

The evaluation also tested the models’ understanding of the STUB rule, which affects the tax basis of inherited property in different states.

Scenario: A client needs to know how the STUB rule applies to a property inherited in Washington.

Gemini’s Response: Gemini Flash was non-responsive, providing a vague answer about general property inheritance rules without addressing the specific application of the STUB rule in Washington. It failed to explain the concept or its implications on the tax basis.

Correct Explanation: The STUB rule in Washington provides for a stepped-up basis at the date of death for inherited property, meaning the beneficiary’s tax basis in the inherited property is its fair market value at the date of the decedent’s death. This adjustment can significantly impact capital gains taxes upon the subsequent sale of the property.

Example 4: Misunderstanding of Credit Shelter Trust Mechanics

A key aspect of estate planning involves understanding the mechanics of a Credit Shelter Trust and whether it receives a second step-up in basis.

Scenario: A client is setting up a Credit Shelter Trust to maximize estate tax exemptions and asks whether the trust’s assets receive a second step-up in basis upon the surviving spouse’s death.

Gemini’s Response: Gemini Flash incorrectly stated that assets in a Credit Shelter Trust would receive a second step-up in basis upon the surviving spouse’s death.

Correct Understanding: Assets in a Credit Shelter Trust do not receive a second step-up in basis at the surviving spouse’s death because they are not included in the surviving spouse’s estate for estate tax purposes. The assets receive a step-up in basis only at the first spouse’s death.

Gemini Flash’s performance in the evaluation highlighted several critical shortcomings. It often provided incorrect information or failed to respond adequately to specific legal and tax questions. These inaccuracies, particularly in the application of community property rules, state estate tax calculations, and the mechanics of trusts, underscore the need for further development and refinement of the model. While Gemini Flash showed potential in some areas, its frequent missteps and non-responsiveness limit its reliability in handling complex legal scenarios. This evaluation emphasizes the importance of using robust benchmarks and rigorous testing to assess and improve the capabilities of AI models in specialized domains like law.

Conclusion

Our evaluation revealed that ChatGPT-4o performed the best in identifying and correctly analyzing legal issues, followed by Gemini Flash, and then Llama 3. ChatGPT-4o demonstrated a comprehensive understanding of the complex legal scenarios, making it the most reliable model for legal reasoning among those tested.

By focusing on traditional legal analysis methodologies and excluding creativity, our approach provided a thorough evaluation of each model’s capabilities. These findings highlight the potential of AI in assisting with legal analysis and underscore the importance of using the right benchmarks for evaluation.

Recent

Updates

Chicago HQ for LegalEdge Innovators
Notices
Darol Tuttle

Why Chicago?

Chicago is the headquarters of LegalEdge Innovators by choice, not inertia. The vibe in the the third largest city is unique. More old school urban than LA and less intense than the Big Apple, Chicago is a legal hub of the United States in many ways. Home to over 500 law firms and five law

Read More »

Become an Innovator

If you are an estate planning attorney with at least five years of experience, help us build our app!

Stay Informed

Receive the latest legal tech news by subscribing to our Legal Tech Brief delivered weekly