Risks of Gen AI: accuracy and reliability

Tell me in two minutes

The number of organisations adopting AI in some form has surged over the past year.
However, one of the most significant issues facing organisations in deploying AI (and especially generative AI) is inaccuracy, with one McKinsey survey reporting that 23% of respondents had experienced negative consequences from generative AI’s inaccuracy.
There are a range of reasons why LLMs in particular fail, including their statistical nature, limited generalisation skills, non-deterministic outputs and issues with the quality and currency of their training data.
Despite the development of techniques like retrieval augmented generation that seek to minimise inaccuracy, the solution to the problem has so far eluded the best minds in the world of AI.
Accuracy issues can lead to numerous consequences, including loss of customers, damage to brand and wasted time and expense in developing, testing and deploying unfit AI tools. They can also lead to legal consequences including under consumer law, contract law, discrimination law, defamation law and administrative law.
In deploying AI tools, organisations will need to consider the risk of inaccuracy (and related issues of reliability and robustness) and take appropriate mitigating steps.

This article is part of KWM’s series on the risks of GenAI and examines the issues of accuracy and reliability. Find more articles in this series here.

Accuracy and reliability of LLMs

LLMs at a glance

LLMs, like those underlying chatbots such as ChatGPT, Bing and Gemini, are statistical models. These statistical models are initially trained to consider a passage of text and predict the likelihood of what comes next, by assigning a probability to each possible next token (often a word or sub-word) in its vocabulary. The currently dominant architecture for LLMs is autoregressive, which means that once the next token is selected, the entire passage (including both the original input and the new token) is passed back through the LLM to produce the following token (and so forth).

After initial training, the LLM is then further trained by a variety of techniques, including supervised learning and reinforcement learning by human feedback, to provide responses that are helpful and reduce harmfulness.

These LLMs are then incorporated into products such as chatbots, which might include additional features such as user interfaces, content filters and other tools such as the ability to search the internet.

Accuracy and reliability limitations

LLMs have a troubled relationship with the truth and are prone to confidently confabulate (commonly referred to as “hallucinating”). Most lawyers will now be familiar with Mata v. Avianca in which a US attorney relied on outputs of ChatGPT that contained fake cases, which the lawyer ultimately submitted to a US federal court.

Sidebar

LLM HALLUCINATIONS AND LEGAL HANGOVERS

A 'hallucination' refers to LLM outputs that are false, inaccurate or otherwise not grounded in reality. This presents a range of potential challenges from a legal and risk perspective. If misleading information is used for commercial or legal purposes, inaccurate information can lead to incorrect advice, poor decision-making, mistrust and potentially serious legal consequences. As a result, it is imperative for businesses to consider the hallucination risks when utilising large language models for generating content and to mitigate accordingly.

However, hallucinations are just one example of the issues associated with LLMs. Many errors in LLM outputs will not be so obvious. The AI Risk Management Framework published by the US National Institute of Standards and Technology helpfully draws attention to three related concepts:

	DEFINITION	EXPLANATION	Example uses 2
Accuracy	How close the outputs of an AI system are to the true values. Measurements of accuracy can take into account measures such as rates of false positives and false negatives.	For example, in a legal context, if a lawyer seeks to use an LLM to provide a list of potential issues with a contract, inaccuracy might involve not only outputs that incorrectly interpret the contract (false positives) but also where it fails to identify a significant issue (false negatives). The false negative problem is particularly problematic. In undertaking a contract review, missing a key clause could be disastrous and, while it is often easy to fact-check what is outputted by an LLM, it is not so simple to determine what has been missed.
Robustness	The ability of a system to maintain its level of performance under a variety of circumstances.	Robustness is a key challenge with LLM. While they can produce good results for many questions, they often fail unexpectedly. Sometimes a slightly novel, complex or unexpected example can expose the shallowness of their reasoning. Sometimes even seemingly innocuous and trivial changes to the prompt can be the difference between a right and wrong answer. This presents a large challenge when moved from limited testing to wide scale deployment of LLMs in the messy real world.
Reliability	A goal for overall correctness of AI system operation under the conditions of expected use and over a given period of time.	Reliability over time is an issue for LLMs in two key ways: LLMs are often used in a non-deterministic mode, meaning that LLMs can produce multiple different outputs from the same input. This means that it might be right the first 10 times you test it, but then fails on attempt 11. The underlying models on many commonly used LLMs are constantly being updated or finetuned. This often impacts results in a noticeable way. For example, many developers noticed that one update of OpenAI’s GPT-4 model seemed to make it more ‘lazy’. This means although an LLM-based tool might work on the day it was tested, it might begin to experience errors in the future if the underlying model is updated.

Garbage in, garbage out - Why do LLMs fail?

Five Key Points of Failure

There are a number of causes of inaccuracy and unreliability in LLMs. But a few key causes to be aware of are:

Statistics vs reasoning: LLMs are fundamentally statistical models. Although outputs may seem human-like at times, many AI researchers have expressed scepticism and have highlighted the inherent limitations in the reasoning capabilities of LLMs.

The difference between statistics and reasoning is clear when an LLM deals with a slight variation on a very common question. For instance, Google’s Gemini has seen so many examples of the trick question ‘what weighs more: a pound of bricks or a pound of feathers?’ that even when question is changed to refer to 2 pounds of bricks it replies ‘a pound of feathers and 2 pounds of bricks actually weigh the same’. In this case, the LLM’s supposed reasoning abilities begin to appear more like advanced (but imperfect) pattern-matching from its training data.

This lack of reasoning could have serious consequences in tasks that require precise abstract reasoning, such as (in the legal context) interpreting a complex contractual clause (especially if it involves definitions and cross-references scattered throughout a document), or applying a specific law or policy (eg, when calculating employee entitlements).
Non-deterministic: LLMs often operate in a non-deterministic setting (often reflected in the concept of “temperature”). This has two consequences: firstly, the next token prediction will not always result in the most likely word being chosen; and second, even with the same inputs, LLMs may produce different outputs. While this non-determinism is often seen as improving the overall quality of LLM outputs, it also creates an additional avenue for accuracy and reliability issues.
Training data: Additional reliability and accuracy concerns stem from an LLM’s training data. LLMs are trained on a vast amount of data. Although the exact details of this training data is not usually disclosed by the developers of LLMs, it is known that many LLMs have been trained on portions of the public internet. For example, GPT-3 was trained on data sets that included the Common Crawl (a dataset created by crawling the public internet) and a data set of upvoted reddit comments. These data sets will include inaccurate information resulting in a ‘garbage in, garbage out’ problem, as well as built-in biases.
Currency: The biggest step in LLM training (both computationally and in terms of training data) generally occurs only once in the lifetime of an LLM. This means that knowledge of recent events may be limited and outputs might be skewed by some historical biases. Although more recent knowledge can be introduced into the model through subsequent fine-tuning or through tools use (as discussed below), it is important to note that some of the underlying training data might become out of date over time (or might even have been out of date at the time of training).
Generalisation to new scenarios: A related issue occurs when LLMs encounter unfamiliar scenarios. While LLMs demonstrate some ability to generalise to new scenarios and to new tasks, researchers have noted that LLM reasoning skills significantly decline when they encounter scenarios or tasks that are significantly different to those they encountered in training (sometimes referred to as “out of distribution”). Unfortunately, the scenarios that might be “out of distribution” for a particular LLM (or any AI model) may not be apparent to its users, deployers or even its developers before the failure is seen in the real world.

RAG is no panacea, but does increase reliability and accuracy

Retrieval Augmentation Generation (or RAG for those who like an acronym) is a technique that enables LLMs to access information that was not included in its original training data. This is often achieved by taking the user’s query to run some type of search and then providing the LLM with both the user’s query and the top results from the search. In this way, the LLM is able to formulate answers that can take into account not only more recent information but also information in specialised databases (such as a company’s internal knowledge) that never made it into the original training data.

However, this approach is far from foolproof. For example, many implementations of RAG rely on a type of search called vector search (sometimes referred to as semantic search). A vector search tries to locate extracts that are semantically similar to the user’s query. Although the results of a vector search can sometime be very impressive, it can also result both in false positives (ie, irrelevant results) and false negatives (ie, missing relevant results).

False positive search results fed into an LLM can often lead to wild hallucinations. A recent example of this (in the context of a RAG product relying on internet search) is Google’s AI Overview, which advised users to put glue on their pizza to stop the cheese falling off – an inaccuracy caused by the reliance on an irrelevant result. More mundane examples could occur where an underlying dataset is out of date (eg, it contains outdated policies or product information) or an extract is missing some key context (eg, the LLM may see an extract of product information without seeing the name of the product to which it relates).
False negatives can also be problematic. While no one would expect something like an internet search to be exhaustive, when an LLM is presented with a few search results that do not contain any relevant information, the LLM’s output may assume (or even confidently state) that no such information exists. For example, a user relying on a talk-to-your-document tool may ask whether a set of documents contains any policy or law about a certain matter. If the underlying search fails to pull back the relevant part of the corpus in the top results (which is not unusual in vector search), the LLM may mislead the user by incorrectly stating that there is no policy or restriction to be concerned about.

In fact, the limitations of RAG have been recently demonstrated by a Stanford University study in which researchers found that Lexis+ AI and Westlaw AI-Assisted Research and Ask Practical Law AI (which are all implementations of RAG) hallucinated between 17% to 33% of the time.

RAG does have one key benefit though: the underlying search is often capable of pulling back faithful extracts of the original documents. As a result, many RAG products include citations and links to underlying sources that informed its outputs. However, while this can assist in fact-checking the positive claims in LLM outputs, it does not address what the search missed (the false negative problem). As a result, while RAG may be useful in some contexts, it does not fully solve the accuracy problem.

Potential Legal Consequences

Exposure risks

Inaccurate outputs from LLMs (and AI more generally) can lead to numerous consequences, including loss of customers, damage to brand and wasted time and expense in developing, testing and deploying unfit AI tools. They can also lead to legal consequences. A few examples of potential legal claims are:

Australian Consumer Law: Although we are unaware of any claims so far under the ACL relating to LLMs, a recent case in Canada highlights the risks of deploying unreliable customer-facing chatbots. In that case, a chatbot on Air Canada’s website incorrectly told a customer that it was possible to obtain a discounted bereavement fare by applying for a partial refund after the relevant flight. In reliance of this, the customer purchased a full-price ticket. The Civil Resolution Tribunal of British Columbia rejected Air Canada’s claim that it was not responsible for the misleading statements made by its chatbot, found Air Canada liable for negligent misrepresentation and ordered it to pay compensation to the customer.
Contractual law: Last year, a Chevrolet dealership in Watsonville, California, had to remove access to its customer-facing chatbot after purporting to agree to sell cars for much less than retail price - in one case agreeing to a price of US$1 and stating ‘that's a deal, and that's a legally binding offer - no takesies backsies’. Although in this case no one seems to have attempted to seriously assert that an actual contract was formed (and enforce that contract), it serves as a reminder of the risks in letting chatbots engaged in open-ended conversations with end users.
Discrimination law: There are many examples of discriminatory outcomes from the use of AI systems. For instance, in the United States a system used to assist parole boards in determining a probability score for the likelihood of an inmate reoffending was shown to treat black defendants far less favourably than white defendants. Similarly, a health care risk-prediction algorithm used in the United States to select which individuals would receive additional care was found to favour white patients over black patients. More recently, Bloomberg’s analysis of 5100 images created with Stable Diffusion v1.5 demonstrated gender and racial bias in its outputs, by depicting most subjects in high-paying roles as men of lighter skin tones. Given the possibility of discriminatory outputs, companies should be aware of their obligations under anti-discrimination legislation and how they may apply to their business, especially when using AI tools in decision-making (such as in automated hiring systems).
Defamation law: Companies may face defamation lawsuits as a result of hallucinations from their chatbots. For example, Brian Hood, the mayor of Hepburn Shire in regional Victoria, threatened (but later dropped) a defamation lawsuit against OpenAI for ChatGPT’s false claims that he was a guilty party in a foreign bribery scandal (when in fact he had acted as the whistleblower).
Administrative law: AI tools will likely become more widely used in the course of making administrative decisions. Administrative decision-makers will need to be aware of their obligations under law and follow the guidance issued by the relevant government, including by ensuring that they continue to use their own discretion in any AI-assisted decision-making.

What now?

In light of these risks, organisations deploying AI tools should consider mitigation strategies. These could include:

carefully testing any customer-facing tools to understand the risks of inaccurate answers and, where possible, limiting the ability of these tools to generate open-ended outputs in response to any question;
including clear disclaimers about the purpose of the tool and the risks relating to inaccuracy;
implementing processes to monitor the outputs of the AI tool;
ensuring appropriate human oversight of outputs (including considering what needs to be done to make this human oversight effective);
where possible, seeking appropriate protections from the supplier of any AI tools; and
taking into account the risk profile of the relevant use case in deciding what AI tool to use (or whether to use AI at all).

Stay tuned for the next update in our risks in GenAI series. Subscribe to data and technology newsletters here.

Getting lost in the changing landscape of AI regulatory requirements?

View our resources and videos developed by our experts to help you stay on top of the latest GenAI and tech developments.

Our GenAI regulatory map will help you to understand and keep up with this fast moving regulatory and stakeholder landscape.

OPEN THE MAP

This easy-to-use and regularly updated timeline will help you stay on top of important developments across key areas of tech-related regulation, including GenAI.

OPEN THE TRACKER

Gen AI

We are at a technological inflection point with GenAI. Its capabilities are improving rapidly almost daily and the potential productivity gains from the use of GenAI are dramatic.

Risks of Gen AI: accuracy and reliability

Tell me in two minutes

Accuracy and reliability of LLMs

LLMs at a glance

Accuracy and reliability limitations

Garbage in, garbage out - Why do LLMs fail?

Five Key Points of Failure

RAG is no panacea, but does increase reliability and accuracy

Potential Legal Consequences

Exposure risks

What now?

Getting lost in the changing landscape of AI regulatory requirements?

Governance Solutions

Crisis Management

Innovation at Mallesons

Owl Advisory by Mallesons

Early Careers

Qualified Lawyers

Shared Services and Support

Brisbane

Canberra

Melbourne

Perth

Sydney

Singapore