How to design an evaluation metric
Introduction
Designing a metric is easier when you use the same organizing ideas the literature already uses. A major survey of LLM evaluation groups the field into three strands: what to evaluate (tasks and capabilities), where to evaluate (datasets and benchmarks), and how to evaluate (metrics, judges, and protocols).1 That lines up with a sensible build path: clarify the construct, gather benchmark material, then implement scoring.
If your topic is specialized quality (for instance the quality of machine learning explanations), studies that pit LLM judges against people find that models can be informative yet still out of step with humans on some dimensions, so calibration on a shared sample is time well spent.2
Closer to cognitive evaluation, a systematic review of theory of mind benchmarks for LLMs stresses how much conclusions depend on which task and metric you pick, and how easy it is to read too much into a narrow score.3 Keeping that scepticism in mind steers you away from metrics that look decisive but say little about misuse or edge cases.
The sections below turn those ideas into a concise checklist from purpose through to running numbers on data.
Step 1: Decide the purpose of the metric
You need to know what you measure before you can claim an accurate measurement. That means deciding what the metric is supposed to assess: the behaviour or quality you care about in context, not just "overall quality." Our earlier piece Why evaluating your LLM or agent based features matters makes the same point in stronger terms: evaluation should show how well a model does relative to its intended purpose. As that article recalls:
The objective of performance evaluations is to generate quantitative evidence, most commonly in the form of performance metric scores, about how well a model performs with respect to its intended purpose.
If the purpose is vague, the metric might look precise while measuring the wrong thing. In our particular context of designing and implementing an evaluation metric for a LLM, we need to decide what we want to assess relative to the output of the LLM.
Let's say, for example, that we start with the very general idea of assessing whether a text is harmful. The next step is to go down the rabbit hole and question this definition until we get a single atomic answer. This harmfulness can be defined in a lot of ways: the text might be toxic, containing insults, it might be biased, supporting nararatives against a certain group, it might be manipulative, trying to coerce the user into doing something against their own will, it might also have a negative psychological effect on the user. Now, from all these ways in which a text can be defined as having a negative impact on the user, we need to decide which fits best our purpose. Let's say that in our case, we actually want to assess discrimination against a certain group. This means we need to concentrate on assessing the bias of a text.
Step 2: Decide the scaling
Next step is to choose whether the outcome is binary (pass or fail, yes or no) or a score on a scale. If it is numeric, fix the range early (for example zero to one or zero to one hundred) so everyone interprets movement the same way and you will be able to compare runs over time.
A numerical scale is more flexible and allows for more nuanced assessment, but a binary scale is simpler and easier to interpret. If you are not sure, you can start with a numerical scale and later convert it to a binary score, as it is easier to do the conversion this way. Similar to the logits of a neural network that represent the probability of a class, a numerical score can be interpreted as the probability of the text being biased and then converted into a binary label.
One thing to keep in mind is that the scaling should be consistent across different existing metrics. For example, if you are using a numerical scale, you should use the same range across all metrics. If you are using a binary scale, you should use the same threshold across all metrics. This will make it easier to compare the results of different metrics.
Step 3: Research existing implementations
Before you invent everything from scratch, use any LLM powered research assistant to scan for existing solutions: open source frameworks, vendor tooling, blog posts, and GitHub repos where people solved a similar scoring problem. The goal is a shortlist of patterns you can reuse or adapt, not to copy without understanding.
For a bias oriented tools list, for example:
- Responsibly: Toolkit for Auditing and Mitigating Bias and Fairness of Machine Learning Systems (documentation; includes word level word embedding bias metrics such as WEAT)
- LangFair — Python library for use-case level LLM bias and fairness assessments (CVS Health)
- FairLangProc — fairness metrics, datasets and algorithms for bias in NLP
- mlm-bias — Bias evaluation methods for masked language models (PyTorch)
- Biaslyze — the NLP bias identification toolkit
- BIAS_METRIC_RANKING — compare different bias related metrics
- Sentence-Level Stereotype Classifier — DistilBERT stereotype detection (gender, race, profession, religion)
- Bias detection model — DistilBERT sequence classifier for biased text, trained on MBAD (d4data)
- DistilROBERTA fine-tuned for bias detection (neutral vs biased, wikirev-bias)
- bias-detector — RoBERTa fine-tuned for bias in English news (BABE), with the related paper “To Bias or Not to Bias: Detecting bias in News with bias-detector”
Step 4: Research published work
Run a second pass with a research assistant focused on papers and citations. To see how articles cluster and to discover related work through citation links, Connected Papers is a practical way to build a graph around any of the works cited in the introduction (or a closer paper for your niche) and spot connections you might miss from keyword search alone.
For a bias oriented reading list, for example, you might start from:
- An Actionable Framework for Assessing Bias and Fairness in Large Language Model Use Cases — arXiv PDF (Bouchard et al.)
- ROBBIE: Robust Bias Evaluation of Large Generative Language Models — arXiv PDF (Meta)
- Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge — PDF on SciSpace
- Bias and Fairness in Large Language Models: A Survey — Computational Linguistics PDF (Gallegos et al.)
- An Actionable Framework for Assessing Bias and Fairness in Large Language Model Use Cases — PDF on SciSpace (mirror of the arXiv framework paper)
- LangFair: A Python Package for Assessing Bias and Fairness in Large Language Model Use Cases — PDF on SciSpace (JOSS / software paper)
Step 5: Collect benchmark datasets
As explained in Why evaluating your LLM or agent based features matters, benchmarking and evaluation overlap but are not the same: benchmarks centre on what you test on (datasets and tasks), while evaluation centres on how you score outputs for your purpose. This step is where benchmarking matters most for you. Look for public datasets aimed at the task you care about, download or pin versions, and store them somewhere stable so you can plug them in when the metric is ready.
For a bias oriented collection of Hugging Face datasets, for example:
- SoFa (Social Fairness) — large-scale fairness benchmark for social bias probing
- StereoSet — stereotype bias across gender, race, religion, and profession
- BOLD — Bias in Open-ended Language Generation Dataset (Amazon Science)
- Holistic Bias — descriptor dataset for measuring likelihood bias in LMs (Smith et al., via FairNLP mirror)
- CLEAR-Bias — corpus for adversarial robustness to bias elicitation in LLMs
- Bias_identification — gathered stereotypical bias dataset compiled from StereoSet, CrowS-Pairs, and others
Step 6: Decide on the core of the algorithm
At this point, you have an idea of what is going on in the field and what problems the existing solutions might have. This is the time to decide on the core of the algorithm that will be used to score the text. You now have a scale, a literature and tooling scan, and candidate datasets. Before you write the full spec, fix the central idea of how the score is produced. Typical choices are:
- a deterministic pipeline (rules, string checks, parsers)
- a pre-trained model scorer (for example a classifier or embedding model from Hugging Face)
- an LLM-as-a-judge with a fixed prompt and rubric
- a proprietary model trained on a specific dataset for a specific purpose from scratch
You can also combine these categories if you are clear how each part feeds the outcome.
During this step, you can also capitalize on specific problems or gaps in the existing solutions to design a more efficient and effective metric.
With that core in mind, get practical. Write down what the metric receives as input (raw answer only, retrieved context, tool traces, full conversation), what methods from Step 6 are used, whether you aggregate intermediate results or keep them separate, and how the final score is produced.
This is the spine everything else hangs on. After you have decided on the core of the algorithm and you have a proof-of-concept (either mentally or a tryout Jupyter notebook), you can move on to the next step.
Step 7: The actual coding work
Now that you have a clear understanding of the metric and how it will be implemented in code, the actual coding work can start. If the idea was clearly defined in the previous step, this step should be straightforward, but might still require some trial and error to get the details right.
Among other things, keep in mind:
- Model size — if your pipeline uses a neural or LLM component, the size of the model (parameters, memory, latency) affects cost, batching, and where you can run it; plan for the footprint of the actual checkpoint you ship.
- LLM-as-a-Judge — give clear examples in the judge prompt or rubric and be as specific as possible about what counts as a pass or fail so scores stay stable across runs.
Step 8: Run the metric on your data
After having implemented the metric, you can run it on the datasets you collected or on proprietary data that you have access to. This step will help you to get a clear understanding of how the metric performs and what improvements can be made. If using existing datasets, some of them already have the ground truth labels, which makes it easier to evaluate the metric. At this point, it is very important to make sure your data covers all the corner cases and edge cases you can think of.
Step 9: Analyze and refine
Finally, analyze the results and refine the metric if needed. This step will probably be repeated multiple times. For example:
- LLM-as-a-judge — you might need to refine the prompt
- Pre-trained model — you might need to fine-tune it on a specific dataset
- Proprietary model — you might need to add more data
- Deterministic pipeline — you might need to check if the approach can cover all the corner cases and edge cases
- Combination of these — you might need to check the contribution of each of the approaches to the final score
Footnotes
-
A Survey on Evaluation of Large Language Models. Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, Xing Xie. https://arxiv.org/pdf/2307.03109 ↩
-
Can LLM Assist in the Evaluation of the Quality of Machine Learning Explanations? Bo Wang, Yiqiao Li, Jianlong Zhou, Fang Chen. https://arxiv.org/pdf/2502.20635v1 ↩
-
A Systematic Review on the Evaluation of Large Language Models in Theory of Mind Tasks. Karahan Sarıtaş, Kıvanç Tezören, Yavuz Durmazkeser. https://arxiv.org/pdf/2502.08796 ↩
FAQ
Frequently asked questions
Decide the purpose of the metric and what construct it assesses. You need a clear idea of what you are measuring before the number means anything.
Benchmarks mostly define what you test on: datasets and tasks. After you know your scoring method, collecting benchmark datasets helps you exercise the metric in a controlled way.
Yes. Many teams combine rules or small models with an LLM as judge. Sketch inputs, intermediate steps, and how you aggregate before you commit to one approach.