Benchmarks For LLMs

Understand the role and limitations of benchmarks in LLM performance evaluation. Explore the techniques for developing robust LLMs.

Large Language Models have gained massive popularity in recent years. I mean, you have seen it. LLMs exceptional ability to understand human language commands made them become the absolutely perfect integration for businesses, supporting critical workflows and automating tasks to maximum efficiency. Plus, beyond the average user’s understanding, there is so much more LLMs can do. And as our reliance on them grows, we really must pay more attention to measures to ensure needed accuracy and reliability. This is a global task that concerns whole institutions, but in the realm of businesses there are now several benchmarks that can be used to evaluate LLM’s performance across various domains. These can test the model’s abilities in comprehension, logic building, mathematics, and so on, and the results determine whether an LLM is ready for business deployment.

In this article, I have gathered a comprehensive list of the most popular benchmarks for LLM evaluation. We will discuss each benchmark in detail and see how different LLMs fare against the evaluation criteria. But first, let’s understand LLM evaluation in more detail.

What is LLM Evaluation?

Like other AI models, LLMs also need to be evaluated against specific benchmarks that assess various aspects of the language model’s performance: knowledge, accuracy, reliability, and consistency. The standard typically involves:

Understanding User Queries: Assessing the model’s ability to accurately comprehend and interpret a wide range of user inputs.
Output Verification: Verifying the AI-generated responses against a trusted knowledge base to ensure they are correct and relevant.
Robustness: Measuring how well the model performs with ambiguous, incomplete, or noisy inputs.

LLM evaluation gives developers the power to identify and address limitations efficiently, so that they can improve the overall user experience. If an LLM is thoroughly evaluated, it will be accurate and robust enough to handle different real-world applications, even including those with ambiguous or unexpected inputs.

Benchmarks

LLMs are one of the most complicated pieces of technology to date and can power even the trickiest of applications. So the evaluation process simply has to be equally as complex, putting its thought process and technical accuracy to the test.

A benchmark uses specific datasets, metrics, and evaluation tasks to test LLM performance, and allows for comparing different LLMs and measuring their accuracy, which in turn drives progress in the industry by improved performance.

Here are some of the most typical aspects of LLM performance:

Knowledge: The model’s knowledge needs to be tested across various domains. That;s what the knowledge benchmark is for. It evaluates how effectively the model can recall information from different fields, like Physics, Programming, Geography, etc.
Logical Reasoning: Means testing a model’s ability to ‘think’ step-by-step and derive a logical conclusion, they typically involve scenarios where the model has to select the most plausible continuation or explanation based on everyday knowledge and logical reasoning.
Reading Comprehension: Models have to be excellent at natural language interpretation and then generate responses accordingly. The test looks like answering questions based on passages to gauge comprehension, inference, and detail retention. Like a school reading test.
Code Understanding: This is needed to measure a model’s proficiency in understanding, writing, and debugging code. These benchmarks give the model coding tasks or problems that the model has to solve accurately, often covering a range of programming languages and paradigms.
World Knowledge: To evaluate the model’s grasp of general knowledge about the world. These datasets typically have questions that need broad, encyclopedic knowledge to be answered correctly, which makes them different from more specific and specialized knowledge benchmarks.

“Knowledge” Benchmarks

MMLU (Multimodal Language Understanding)

This benchmark is made to test the LLM’s grasp of factual knowledge across various topics like humanities, social sciences, history, computer science, and even law. 57 questions and 15k tasks all directed at making sure the model has great reasoning capabilities. This makes MMLU a good tool to assess an LLM’s factual knowledge and reasoning dealing with various topics.

Recently it has become a key benchmark for evaluating LLMs for the above mentioned areas. Developers always want to optimize their models to outperform others in this benchmark, which makes it a de facto standard for evaluating advanced reasoning and knowledge in LLMs. Large enterprise-grade models have shown impressive scores on this benchmark, including the GPT-4-omni at 88.7%, Claude 3 Opus at 86.8%, Gemini 1.5 Pro at 85.9%, and Llama-3 70B at 82%. Small models typically don’t perform as well on this benchmark, usually not exceeding 60-65%, but the recent performance of Phi-3-Small-7b at 75.3% is something to think about.

However, MMLU is not without cons: it has known issues such as ambiguous questions, incorrect answers, and missing context. And, many think that some of its tasks are too easy for proper LLM evaluation.

I’d like to make it clear that benchmarks like MMLU don’t perfectly depict real-world scenarios. If an LLM achieves a great score on this, it does not always mean that it has become a subject-matter-expert. Benchmarks are really quite limited in scope and often rely on multiple-choice questions, which can never fully capture the complexity and context of real-world interactions. True understanding needs knowing facts and applying that knowledge dynamically and this involves critical thinking, problem-solving, and contextual understanding. For these reasons, LLMs constantly need to be refined and updated so that the model keeps the benchmark’s relevance and effectiveness.

GPQA (Graduate-Level Google-Proof Q&A Benchmark)

This benchmark assesses LLMs on logical reasoning using a dataset with just 448 questions. Domain experts developed it and it covers topics in biology, physics, and chemistry.

Each question goes through the following validation process:

An expert in the same topic answers the question and provides detailed feedback.
The question writer revises the question based on this feedback.
A second expert answers the revised question.

This process can actually make sure the questions are objective, accurate, and challenging for a language model. Even experienced PhD scholars achieve only an accuracy of 65% on these questions, while GPT-4-omni reaches only 53.6%, highlighting the gap between human and machine intelligence.

Because of the high qualification requirements, the dataset is in fact quite small, which somewhat limits its statistical power for comparing accuracy, and requires large effect sizes. The experts who created and validated these questions came from Upwork, so they potentially introduced biases based on their expertise and the topics covered.

Code Benchmarks

HumanEval

164 programming problems, a real test for the LLMs coding abilities. It’s HumanEval. It’s designed to test the basic coding abilities of large language models (LLMs). It uses the pass@k metric to judge the functional accuracy of the code that is being generated, which outputs the probability of at least one of the top k LLM-generated code samples passing the test cases.

While the HumanEval dataset includes function signatures, docstrings, code bodies, and several unit tests, it does not include the full range of real-world coding problems, which just won’t adequately test a model’s capability to make correct code for diverse scenarios.

MBPP (Mostly Basic Python Programming)

Mbpp benchmark consists of 1,000 crowd-sourced Python programming questions. These are entry-level problems and they focus on fundamental programming skills. It uses a few-shot and fine tuning approaches to evaluate model performance, with larger models typically performing better on this dataset. However, since the dataset contains mainly entry-level programs, it still does not fully represent the complexities and challenges of real-world applications.

Math Benchmarks

While most LLMs are quite great at structuring standard responses, mathematical reasoning is a much bigger problem for them. Why? Because it requires skills related to question understanding, a step-by-step logical approach with mathematical reasoning, and deriving the correct answer.

The “Chain of Thought” (CoT) method is made to evaluate LLMs on mathematics-related benchmarks, it involves prompting models to explain their step-by-step reasoning process when solving a problem. There are several benefits to this. It makes the reasoning process more transparent, helps identify flaws in the model’s logic, and allows for a more granular assessment of problem-solving skills. By breaking down complex problems into a series of simpler steps, CoT can improve the model’s performance on math benchmarks and provide deeper insights into its reasoning capabilities.

GSM8K: A Popular Math Benchmark

One of the well-known benchmarks for evaluating math abilities in LLMs is the GSM8K dataset. GSM8K consists of 8.5k mid-school math problems, which take a few steps to solve, and solutions primarily involve performing a sequence of elementary calculations. Typically, larger models or those specifically trained for mathematical reasoning tend to perform better on this benchmark, e.g. GPT-4 models boast a score of 96.5%, while DeepSeekMATH-RL-7B lags slightly behind at 88.2%.

While GSM8K is useful for assessing a model’s ability to handle grade school-level math problems, it may not fully capture a model’s capacity to solve more advanced or diverse mathematical challenges, thus limiting its effectiveness as a comprehensive measure of math ability.

The Math Dataset: A Comprehensive Alternative

The math dataset dealt with the shortcomings of benchmarks like GSM8K. This dataset is more extensive, covering elementary arithmetic to high school and even college-level problems. It is also compared against humans, with a computer science PhD student who doesn’t like mathematics achieving an accuracy of 40% and a gold medalist achieving an accuracy of 90%

It provides a more all-round assessment of an LLM’s mathematical capabilities. It takes care of proving that the model is proficient in basic arithmetic and competent in complex areas like algebra, geometry, and calculus. But the increased complexity and diversity of problems can make it challenging for models to achieve high accuracy, especially those not explicitly trained on a wide range of mathematical concepts. Also, the varied problem formats in the Math dataset can introduce inconsistencies in model performance, which makes it a lot harder to draw definitive conclusions about a model’s overall mathematical proficiency.

Using the Chain of Thought method with the Math dataset can enhance the evaluation because it reveals the step-by-step reasoning abilities of LLMs across a wide spectrum of mathematical challenges. A combined approach like this makes sure there is a more robust and detailed assessment of an LLM’s true mathematical capabilities.

Reading Comprehension Benchmarks

A reading comprehension assessment evaluates the model’s ability to understand and process complex text, which is especially fundamental for applications like customer support, content generation, and information retrieval. There are a few benchmarks designed to assess this skill, each with unique attributes that contribute to a comprehensive evaluation of a model’s capabilities.

RACE (Reading Comprehension dataset from Examinations)

RACE benchmarks have almost 28,000 passages and 100,000 questions collected from the English exams for middle and high school Chinese students between the ages of 12 and 18. It doesn’t restrict the questions and answers to be extracted from the given passages, making the tasks even the more challenging.

It covers a broad range of topics and question types, which makes for a thorough assessment and includes questions at different difficulty levels. Also questions in RACE are specifically designed for testing human reading skills and are created by domain experts.

However, the benchmark does have some drawbacks. Since it is developed on Chinese educational materials, it is prone to introduce cultural biases that do not reflect a global context. Also, the high difficulty level in some questions is not actually representative of typical real-world tasks. So performance evaluations can be not so accurate.

DROP (Discrete Reasoning Over Paragraphs)

Another significant approach is DROP (Discrete Reasoning Over Paragraphs), which challenges models to perform discrete reasoning over paragraphs. It has 96,000 questions to test the reasoning capabilities of LLMs and the questions are extracted from Wikipedia and crowdsourced from Amazon Mechanical Turk. DROP questions often call models to perform mathematical operations like addition, subtraction, and comparison based on information scattered across a passage.

The questions are challenging. They require LLMs to locate multiple numbers in the passage and add or subtract them to get the final answer. Big models such as GPT-4 and palm achieve 80% and 85%, while humans achieve 96% on the DROP dataset.

Common Sense Benchmarks

Testing common sense in language models is an interesting one but also key because it evaluates a model’s ability to make judgments and inferences that align with our – human reasoning. Unlike us, who develop a comprehensive world model through practical experiences, language models are trained on huge datasets without actually inherently understanding the context. This means that models struggle with tasks requiring an intuitive grasp of everyday situations, logical reasoning, and practical knowledge, which are very important for robust and reliable AI applications.

HellaSwag (Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations)

Hellaswag is developed by Rowan Zellers and colleagues at the University of Washington and the Allen Institute for Artificial Intelligence. It is designed to test a model’s ability to predict the most plausible continuation of a given scenario. This benchmark is constructed using Adversarial Filtering (AF), where a series of discriminators iteratively select adversarial machine-generated wrong answers. This method creates a dataset with trivial examples for humans but challenging for models, resulting in a “Goldilocks” zone of difficulty.

While Hellaswag has been challenging for earlier models, state-of-the-art models like GPT-4 have achieved performance levels close to human accuracy, indicating significant progress in the field. However, these results suggest the need for continuously evolving benchmarks to keep pace with advancements in AI capabilities.

Openbook

The Openbook dataset consists of 5957 elementary-level science multiple-choice questions. The questions are gathered from open-book exams and developed to assess human understanding of the subject.

Openbook benchmark requires reasoning capability beyond information retrieval. GPT-4 achieves the highest accuracy of 95.9% as of now.

OpenbookQA is modeled after open book exams and consists of 5,957 multiple-choice elementary-level science questions. These questions are designed to probe the understanding of 1,326 core science facts and their application to novel situations.

Similar to Hellaswag, earlier models found OpenbookQA challenging, but modern models like GPT-4 have achieved near-human performance levels. This progress underscores the importance of developing even more complex and nuanced benchmarks to continue pushing the boundaries of AI understanding.

Are Benchmarks Enough for LLM Performance Evaluation?

Yes, while they do provide a standardized approach to evaluating LLM performance, they can also be misleading. The Large Model Systems Organization says that a good LLM benchmark should be scalable, capable of evaluating new models with a relatively small number of trials, and provide a unique ranking order for all models. But, there are reasons why they may not be enough. Here are some:

Benchmark Leakage

This is a common encounter, and it happens when training data overlaps with test data, making a misleading evaluation. If a model has already encountered some test questions during training, its result may not accurately reflect its true capabilities. But an ideal benchmark should minimize memorization and reflect real-world scenarios.

Evaluation Bias

LLM benchmark leaderboards are used to compare LLMs’ performance on various tasks. However, relying on those leaderboards for model comparison can be misleading. Simple changes in benchmark tests like altering the order of questions, can shift the ranking of models by up to eight positions. Also, LLMs may perform differently depending on the scoring methods, highlighting the importance of considering evaluation biases.

Open Endedness

Real-world LLM interaction involves designing prompts to generate desired AI outputs. LLM outputs depend on the effectiveness of prompts, and benchmarks are designed to test context awareness of LLMs. While benchmarks are designed to test an LLM’s context awareness, they do not always translate directly to real-world performance. For example, a model achieving a 100% score on a benchmark dataset, such as the LSAT, does not guarantee the same level of accuracy in practical applications. This underscores the importance of considering the open-ended nature of real-world tasks in LLM evaluation.

Effective Evaluation for Robust LLMs

So, now you know that benchmarks are not always the best option because they can’t always generalize across all problems. But, there are other ways.

Custom Benchmarks

These are perfect for testing specific behaviors and functionalities in task-specific scenarios. Lets say, if LLM is designed for medical officers, the datasets collected from medical settings will effectively represent real-world scenarios. These custom benchmarks can focus on domain-specific language understanding, performance, and unique contextual requirements. By aligning the benchmarks with possible real-world scenarios, you can ensure that the LLM performs well in general and excels in the specific tasks it’s intended for. This can help identifying and addressing any gaps or weaknesses in the model’s capabilities early on.

Data Leakage Detection Pipeline

If you want your evaluations to “show” integrity, having a data leakage-free benchmark pipeline is very important. Data leakage happens when the benchmark data is included in the model’s pretraining corpus, resulting in artificially high-performance scores. To avoid this, benchmarks should be cross-referenced against pretraining data. Plus, steps to avoid any previously seen information. This can involve using proprietary or newly curated datasets that are kept separate from the model’s training pipeline – this will ensure that the performance metrics you get reflect the model’s ability to generalize well.

Human Evaluation

Automated metrics on their own can’t capture the full spectrum of a model’s performance, especially when it comes to very nuanced and subjective aspects of language understanding and generation. Here, human evaluation gives a much better assessment:

Hiring Professionals that can provide detailed and reliable evaluations, especially for specialized domains.
Crowdsourcing! Platforms like Amazon Mechanical Turk allow you to gather diverse human judgments quickly and for little cost.
Community Feedback: Using platforms like the LMSYS leaderboard arena, where users can vote and compare models, adds an extra layer of insight. The LMSYS Chatbot Arena Hard, for instance, is particularly effective in highlighting subtle differences between top models through direct user interactions and votes.

Conclusion

Without evaluation and benchmarking, we would have no way of knowing if the LLMs ability to handle real-world tasks is as accurate and applicable as we think it to be. But, as I said, benchmarks are not a completely fool-proof way to check that, they can lead to gaps in performance of LLMs. This can also slow down the development of LLMs that are truly robust for work.

This is how it should be in an ideal world. LLMs understand user queries, identify errors in prompts, complete tasks as instructed, and generate reliable outputs. The results are already great but not ideal. This is where task-specific benchmarks prove to be very helpful just as human evaluation and detecting benchmark leakage. By using those, we get a chance to produce actually robust LLMs.