LLMs

Evaluating and Testing Large Language Models

Hampus Gustavsson
Senior Data Scientist at Todai

As with all software, before releasing it to production, it must pass a thorough test suite. The same applies to large language models (LLMs). However, testing LLMs is not as straightforward as testing traditional software or even classical machine learning models. This article highlights several aspects of testing LLMs, followed by a demonstration using a Python implementation from Confident AI—Deepeval. Our test case will feature a type of LLM called Retrieval-Augmented Generation (RAG), which integrates access to domain-specific data.

You can find all the referenced code in our repository.

Background

At Todai, we often help companies develop RAG systems tailored to specific industries and use cases. Testing these applications must be tailored to each project’s unique requirements. In this article, we share key lessons learned and identify the variables we consider critical during the testing phase. Our focus will primarily be on selecting appropriate metrics and designing test suites to optimize business objectives and key performance indicators (KPIs).

Modes of testing

Let us start by dividing the metrics into two categories: statistical and model based. A statistical approach, based on traditional n-grams techniques, counts the number of correctly predicted words. This is a traditional approach. While robust, it is less flexible and struggles with scenarios where correct answers can be expressed in multiple valid ways. Consequently, its use has declined as LLMs have advanced.As publicly available large language models have become better, the statistical based approach has become less common. While some use cases still benefit from statistical methods, the model-based approach generally provides greater flexibility and accuracy.

Model Based

When using a model based approach, the overarching scheme is to prompt an LLM such that it outputs quantitative results. It is not uncommon that this comes from a subjective perspective. And one does have to be mindful of this. As with asking a coworker whether they would label an answer correct or incorrect, ambiguity exists in this approach. We will later in the article go into depth on how an assessment of a question answer can look.

Metrics

Selecting the right metrics is a complex process, often involving translating qualitative business goals into quantitative measures. Metrics generally fall into two categories:

  1. Testing against predefined answers: This method compares the model’s responses to a set of prewritten answers.

  2. System health evaluations: This involves testing the application’s guardrails to prevent undesirable outcomes, such as biased or toxic responses.

In customer-facing applications, system health evaluations are essential to mitigate the risk of poor user experiences. This can involve stress-testing the system by intentionally provoking problematic outputs and measuring against bias, toxicity, or predefined ethical standards.

In our demo, we will focus on the first category: testing performance against prewritten answers. This involves generating questions and expected answers, scoring the model’s responses, and aggregating these scores to evaluate overall performance. Key considerations include identifying relevant metrics, defining thresholds, and understanding the consequences of deployment or non-deployment.

So, what kind of metrics should we consider for our RAG system? We will be using Answer Relevancy and Answer Correctness. The answer relevancy metric is the LLM equivalent to precision, this measures the degree of irrelevant or redundant information provided by the model.

Question-Answer Generation

When it comes to generating the test suite, there are a few things to consider. We want to generate tests in a scalable, representative and robust way.

Starting out with the representative and robustness aspect, we should make the questions as relevant for the task as possible and focus on making the questions more global rather than local, challenging the RAG to combine various sources of information to end up with the correct answer. 

The straightforward way of generating questions and answers is by letting subject matter experts and / or stakeholders manually create these. This is for sure a valid way, but let us also explore test suite building a little further beyond this. We will be looking into three approaches – two offline and one online based. These are manually made, automatically made and sourced from a production setting. We will try to cover what sets these techniques apart and compare  their pros and cons against each other. Let us look at them one by one.

  • Manually made: The straightforward way to get the questions and correct answers. Preferably, this is done via subject matter experts and / or intended users when applicable. The cons with this approach is that it might be too biased to the ones creating the questions and answers, the dataset it creates is stale (i.e. not taking into account data drift) and might of course be resource intensive.

  • Production questions: This is the preferred approach, with one big caveat. If you can get real world questions and answers, hopefully these represent the real world use case as much as possible. It is also quick to react to changes in the environment it is being used in. The caveat, however, is to retrieve this feedback without disturbing the user experience. The obvious way is to explicitly ask for feedback. But this is rarely something you want to do. There are a few different ways you can get user feedback implicitly, and it is one of these you should opt for going for.

  • Automated: The basis for these is to use another model to generate questions and answers. Either via randomly sampling documents to base the questions from or in a methodical manner getting documents. There is a risk of generating questions that are too easily answered by the model. Ie, the model is having a hard time creating questions hard enough for itself to be able to answer. Also, without further prompting or providing examples, there is a risk that the questions being tested have low or no significance in practice. This can be tamed by prompting the QA generator to by combining one of the two previous steps. You then can use these questions as a test fleet, and prompt the model to try to make questions in the same distribution as the test fleer.

Results

What this brief theoretical background, we will look into our RAG example. We provide our model with the prompt:

From the given documents, create a test suite of 4 questions and answers to be used to evaluate a RAG system.

This test suite should be a mix of detailed knowledge about technical specifications together with understanding which relies on combining

multiple sources from the document to get it right. Make the questions very difficult, as it is meant for optimising a RAG system,

and we want it to be evaluated thoroughly before being put into production.

And the suite generated looks like the following.

{

   “question”: “What is the significance of the WEEE symbol on the machine, and what are the consequences of not adhering to its disposal guidelines?”,

   “answer”: “The WEEE symbol indicates that the product may not be treated as household waste and must be disposed of through specific waste collection systems for electric and electronic appliances. Failure to adhere to these guidelines can result in environmental harm and potential sanctions by European Community members due to non-compliance with directives like 2002/95/CE and 2002/96/CE.”

},

{

   “question”: “Describe the recommended procedure for using detergent and rinse aid in the dishwasher, including any adjustments that may be needed based on specific circumstances.”,

   “answer”: “For detergent, use non-foaming liquid types suitable for industrial dishwashers, placing it directly into the tub. The correct amount depends on water hardness and is advised by the detergent manufacturer. The dishwasher may be equipped with an electric detergent dispenser for precise dosing. For rinse aid, the machine automatically dispenses it, with amounts adjustable via a screw on the dispenser: clockwise to reduce and anticlockwise to increase. The recommended amount is 2-5 cm measured on the suction tube.”

},

{

   “question”: “What are the optimal environmental practices recommended for operating this machine, and what are the potential consequences of non-compliance?”,

   “answer”: “Optimal practices include operating the machine at full capacity to minimize waste, using highly biodegradable detergents and rinse aids, adjusting amounts according to water hardness annually, and ensuring proper pre-cleaning of dishes. Non-compliance can lead to excessive energy, water, and detergent use, increased costs, and reduced machine performance.”

},

{

   “question”: “What steps should be taken to ensure the safe disposal of the machine, and how does this relate to the CE marking and European directives?”,

   “answer”: “Before disposal, disconnect the machine from electrical and water supplies, and cut off the electrical cable to prevent reuse. Metal parts, being stainless steel, are recyclable, and plastic parts are marked for recycling. The CE marking indicates compliance with international and European directives, ensuring no hazardous substances are used. Non-compliance with these steps can lead to penalties under European law.”

}

Questions

So, where the test suite generated any good? Let us look briefly at them and try to assess the outputs. 

The four questions used in this test suite were designed to evaluate the system’s ability to provide accurate, relevant, and comprehensive answers in various scenarios. Each question was made to reflect practical use cases while challenging the RAG system to integrate information from multiple sources. 

  1. Question: “What is the significance of the WEEE symbol on the machine, and what are the consequences of not adhering to its disposal guidelines?”
    • Purpose: This question assesses the model’s ability to explain compliance-related symbols and articulate the consequences of non-compliance.

  2. Question: “Describe the recommended procedure for using detergent and rinse aid in the dishwasher, including any adjustments that may be needed based on specific circumstances.”
    • Purpose: This question evaluates the system’s capacity to provide detailed procedural guidance, highlighting adaptability to varying operational contexts such as water hardness.

  3. Question: “What are the optimal environmental practices recommended for operating this machine, and what are the potential consequences of non-compliance?”
    • Purpose: This question tests the model’s ability to synthesize best practices and highlight environmental and economic implications of improper use.

  4. Question: “What steps should be taken to ensure the safe disposal of the machine, and how does this relate to the CE marking and European directives?”
    • Purpose: This question challenges the model to link disposal procedures with regulatory compliance, assessing its understanding of certifications.


All in all, this consisted of a decently well test suite a well-rounded test suite for evaluating the RAG system’s capabilities.

Focusing on covering a specific aspect of the machine’s operation, environmental considerations, and regulatory compliance.

Note, that when running the application, you might get other results.

{

   “name”: “test_case_0”,

   “success”: true,

   “metrics_data”: [

       {

           “name”: “Answer Relevancy”,

           “threshold”: 0.8,

           “success”: true,

           “score”: 1.0,

           “reason”: “The score is 1.00 because the response is perfectly relevant and directly addresses the question about the significance of the WEEE symbol and the consequences of not following its disposal guidelines. Great job!”,

           “strict_mode”: false,

           “evaluation_model”: “gpt-4o”,

           “error”: null,

           “evaluation_cost”: 0.0041625,

           “verbose_logs”: “Statements:\n[\n    \”The WEEE symbol indicates that the product should not be disposed of as household waste\”,\n    \”The product must be treated through separate collection systems for electric and electronic appliances\”,\n    \”Ensuring proper disposal helps protect the environment\”,\n    \”Non-compliance with these guidelines can lead to sanctions as per the regulations of individual European Community members\”\n] \n \nVerdicts:\n[\n    {\n        \”verdict\”: \”yes\”,\n        \”reason\”: null\n    },\n    {\n        \”verdict\”: \”yes\”,\n        \”reason\”: null\n    },\n    {\n        \”verdict\”: \”yes\”,\n        \”reason\”: null\n    },\n    {\n        \”verdict\”: \”yes\”,\n        \”reason\”: null\n    }\n]”

       },

       {

           “name”: “Correctness (GEval)”,

           “threshold”: 0.8,

           “success”: true,

           “score”: 0.8777299857712594,

           “reason”: “The actual output closely aligns with the expected output by addressing the WEEE symbol’s purpose and consequences of non-compliance. It lacks specific mention of directives 2002/95/CE and 2002/96/CE, which slightly affects alignment.”,

           “strict_mode”: false,

           “evaluation_model”: “gpt-4o”,

           “error”: null,

           “evaluation_cost”: 0.0015125000000000002,

           “verbose_logs”: “Criteria:\nNone \n \nEvaluation Steps:\n[\n    \”How well does ‘actual output’ align with ‘expected output’\”\n]”

       }

   ],

   “conversational”: false,

   “multimodal”: false,

   “input”: “What is the significance of the WEEE symbol on the machine, and what are the consequences of not adhering to its disposal guidelines?”,

   “actual_output”: “The WEEE symbol indicates that the product should not be disposed of as household waste and must be treated through separate collection systems for electric and electronic appliances. Ensuring proper disposal helps protect the environment. Non-compliance with these guidelines can lead to sanctions as per the regulations of individual European Community members.”,

   “expected_output”: “The WEEE symbol indicates that the product may not be treated as household waste and must be disposed of through specific waste collection systems for electric and electronic appliances. Failure to adhere to these guidelines can result in environmental harm and potential sanctions by European Community members due to non-compliance with directives like 2002/95/CE and 2002/96/CE.”,

   “context”: null,

   “retrieval_context”: null

}

Results

One of the evaluated test cases revolved around the question:

“What is the significance of the WEEE symbol on the machine, and what are the consequences of not adhering to its disposal guidelines?”

The expected answer provided detailed information about the WEEE symbol, its purpose, and the consequences of non-compliance, referencing specific European directives (2002/95/CE and 2002/96/CE). The actual output generated by the RAG system captured the core aspects of the question but omitted explicit mentions of these directives.

Metrics Evaluation

The evaluation utilized two primary metrics: Answer Relevancy and Correctness (GEval). Here’s a breakdown of the results. The main take away from this small sample, is that the RAG system is producing relevant answers that address the core of the questions posed. This is evident by the perfect Answer Relevancy score. The system’s ability to achieve a high Correctness score indicates strong semantic alignment with the expected outputs. 

However, while the generated response was semantically correct, the lack of specific details—such as explicit mention of relevant directives—shows that the system may prioritize broader relevance over fine-grained accuracy. This highlights a potential limitation in scenarios where precise adherence to regulatory or technical language is critical.

Wrapping It Up

Testing large language models, particularly advanced RAG systems, is a nuanced process that requires careful metric selection, robust test suite design, and alignment with real-world needs. While the presented methods and case studies offer valuable insights, they also highlight ongoing challenges, such as managing data drift, reducing biases, and refining models to capture fine-grained details.

By prioritizing relevance, correctness, and adaptability in testing frameworks, businesses can optimize LLM deployment, ensuring they meet both technical standards and user expectations. Ultimately, the key to success lies in continuous learning and iterative improvements tailored to specific use cases.