Introduction
In this article, we are going to compare Aegis alongside two other evaluation frameworks. First, we are going to have a look at the metrics that are available in Aegis, DeepEval, Opik, and Ragas in order to have a clearer overview over the capabilities of each framework. In the second part of the article, we will look at some of the cornercases tested in our internal benchmarks, where we run several evaluation runs on the various datasets. This allows us to compare pass/fail labels across frameworks.
Aegis Metrics vs DeepEval, Opik, and Ragas
The first column lists each Aegis metric, while the other columns mark whether the other vendor documents a built-in metric with the same or directly comparable intent in their official metric lists (not one-off custom judge prompts).
Sources: DeepEval’s metrics introduction, Opik’s metrics overview, and Ragas’ list of available metrics.
✅ = yes, the metric is available
❌ = no, no named equivalent found there
| Metric | Aegis | DeepEval | Opik | Ragas |
|---|---|---|---|---|
| Answer Correctness | ✅ | ❌ | ❌ | ✅ (Factual correctness) |
| Answer Relevancy | ✅ | ✅ (Answer relevancy) | ✅ (Answer relevance) | ✅ (Response relevancy) |
| Bias | ✅ | ✅ (Bias) | ❌ | ❌ |
| Content Consistency | ✅ | ❌ | ❌ | ❌ |
| Content Generation Faithfulness | ✅ | ❌ | ❌ | ❌ |
| Context Faithfulness | ✅ | ✅ (Faithfulness) | ✅ (Hallucination) | ✅ (Faithfulness) |
| Context Ranking Precision | ✅ | ✅ (Contextual precision) | ✅ (Context precision) | ✅ (Context precision) |
| Context Recall | ✅ | ✅ | ✅ | ✅ |
| Context Relevancy | ✅ | ✅ (Contextual relevancy) | ❌ | ❌ |
| Context Sufficiency | ✅ | ❌ | ❌ | ❌ |
| Context Waste | ✅ | ❌ | ❌ | ❌ |
| Entity Extraction Faithfulness | ✅ | ❌ | ❌ | ❌ |
| Entity Faithfulness | ✅ | ❌ | ❌ | ✅ (Context entities recall) |
| Evasion Obfuscation | ✅ | ❌ | ❌ | ❌ |
| Factfulness | ✅ | ❌ | ❌ | ✅ (Factual correctness) |
| Faithfulness | ✅ | ✅ | ✅ (Hallucination) | ✅ |
| Format Alignment | ✅ | ❌ | ❌ | ❌ |
| Format Consistency | ✅ | ❌ | ❌ | ❌ |
| Harmfulness | ✅ | ❌ | ✅ (Moderation) | ❌ |
| Instruction Integrity Subversion / Attempts | ✅ | ❌ | ❌ | ❌ |
| Manipulation | ✅ | ❌ | ❌ | ❌ |
| Misinformation | ✅ | ❌ | ❌ | ❌ |
| Misuse | ✅ | ✅ | ❌ | ❌ |
| PII PHI Exfiltration Attempts | ✅ | ❌ | ❌ | ❌ |
| PII PHI Leakage | ✅ | ✅ | ❌ | ❌ |
| Prompt Injection Attempts | ✅ | ❌ | ❌ | ❌ |
| Prompt Quality | ✅ | ✅ (Prompt alignment) | ❌ | ❌ |
| Role Hijacking | ✅ | ❌ | ❌ | ❌ |
| Role Violation | ✅ | ✅ (Role violation) | ❌ | ❌ |
| Structural | ✅ | ✅(Json Correctness) | ✅ (Heuristic: BLEU, ROUGE, …) | ✅ (BLEU / ROUGE / exact match) |
| Summarization | ✅ | ✅ | ✅ | ✅ |
| System Data Exfiltration Attempts | ✅ | ❌ | ❌ | ❌ |
| System Data Leakage | ✅ | ❌ | ❌ | ❌ |
| Toxicity | ✅ | ✅ | ✅ (Moderation) | ❌ |
Where every vendor shows ❌, the gap is usually filled by custom LLM-as-judge definitions (for example DeepEval’s G-Eval, Opik’s G-Eval, Ragas rubrics) rather than a shipped preset. So while it is possible to manually implement the missing metrics from Aegis in the other frameworks, the way to do so is not straightforward and requires a lot of trial and error in order to get the best prompt.
Comparison of the frameworks on various metrics
This note summarises paired evaluation runs from our internal benchmarks. For each suite we compared an Aegis run against the other frameworks on the same dataset and threshold (100 examples per run unless noted):
- Pass agreement — for each case, does Aegis’s pass/fail at its run threshold match the other tool’s pass/fail under the mapping below?
Nothing here claims that one framework dominates another. The point is to make the where and why visible.
How passes were derived
- Aegis: pass when the Aegis score meets or exceeds the threshold defined for that run.
- DeepEval: if the platform exposes an explicit above-threshold flag, that determines pass; otherwise compare the normalised DeepEval score × 100 to the same threshold.
- Opik: pass when Opik’s score × 100 meets or exceeds that threshold.
- DeepTeam: pass when the DeepTeam score is 1.0 on its scale.
Pass agreement
DeepEval
| Suite | Pass agreement |
|---|---|
| Role violation | 60% |
| PII/PHI leakage | 98% |
| Misuse | 96% |
| Bias dataset | 95% |
| Toxicity | 64% |
| Toxicity (English subset, 100 rows) | 73% |
| Format alignment (FOFO) | 95% |
| Context faithfulness (RAGBench train) | 60% |
| Context recall (Garage benchmark) | 96% |
| Context relevancy (Garage benchmark) | 70% |
Opik
| Suite | Pass agreement |
|---|---|
| Bias dataset | 91% |
| Harmfulness | 92% |
| Toxicity | 65% |
| Toxicity (English subset, 100 rows) | 99% |
| Context recall (Garage benchmark) | 94% |
Row coverage on these exports. Format alignment vs DeepEval used 98 paired rows on this file. Context faithfulness, context recall, and the 70% context relevancy figure vs DeepEval each used 100, 100, and 10 paired rows respectively, where every case had both an Aegis pass bit and a DeepEval pass bit. On the full 100 row context relevancy sheet, Aegis agreed with the benchmark expected pass 57% of the time.
DeepTeam
| Suite | Pass agreement |
|---|---|
| System data exfiltration (IO example) | 92% |
Reading the tables. High pass agreement suggests both systems are separating cases similarly at the chosen threshold, even if the underlying judge differs. Low agreement is a signal to spot-check what object is graded and how passes were derived before you treat either tool as ground truth.
When Aegis catches a corner case
In practical terms, this means that after setting a threshold for a metric, Aegis will fail if the score is below the threshold, while the other tool will pass, menaing that Aegis is stricter on this sample and threshold pairing, or that the two systems weight different signals. The following focuses on cases where Aegis is below its threshold while the comparison tool still passes, using the rules above. That pattern usually means Aegis is stricter on this sample and threshold pairing, or that the two systems weight different signals — not, by itself, proof of ground truth.
How many cases (Aegis fails, comparison passes)
| Comparison | Cases |
|---|---|
| DeepEval — role violation | 9 |
| DeepEval — PII/PHI | 0 |
| DeepEval — misuse | 0 |
| DeepEval — bias | 2 |
| DeepEval — toxicity | 34 |
| DeepEval — format alignment (FOFO) | 0 |
| DeepEval — context faithfulness (RAGBench train) | 26 |
| DeepEval — context recall (Garage benchmark) | 1 |
| DeepEval — context relevancy (Garage benchmark) | 1 |
| Opik — bias | 3 |
| Opik — harmfulness | 8 |
| Opik — toxicity | 31 |
| Opik — context recall (Garage benchmark) | 1 |
| DeepTeam — system data exfiltration (IO example) | 0 |
*Context relevancy vs DeepEval: 1 case in 10 paired rows on this export. Other suites use 100 cases (or 98 paired rows for format alignment) unless noted above.
Examples worth inspecting by hand
Role violation: Aegis vs DeepEval. In one recurring pattern, the assistant offers clinical sounding guidance, for example, “Try raising the dose and see if the rash clears in 48 hours,” even though the benchmark had framed the model as a nonclinical agent (such as general customer support). Aegis treats that reply as breaking the role boundary, while DeepEval can still record a pass if its rubric rewards a cooperative tone rather than strictly enforcing the stated mandate.
Bias: Aegis vs DeepEval. On one item, the assistant praises a country but adds a sweeping negative stereotype about its population. On another, it leans on a broad psychological generalisation, such as “People who talk a lot are usually trying to hide their insecurity.” Aegis scores 0 or 40 on those items, while DeepEval still records a pass. Written rationales can even contradict the numeric label (“score is 0.00…”), so treat prose explanations as debugging hints rather than a second source of truth.
Bias: Aegis vs Opik. Stereotypes about a country’s leadership and symmetric put downs of talkative versus quiet people show the same shape. Aegis often lands near 40 on those replies, while Opik’s moderation note reports no policy violation at 1.0.
Harmfulness: Aegis vs Opik. Manipulative signup pressure is one recurring pattern, with lines like “Everyone else already joined” and “Last chance,” and in those rows Aegis sits near ~65 while Opik still passes at 0.8. Bluntly shaming or condescending criticism is another pattern: Aegis can sit near ~42.5 while Opik scores 1.0 and labels the content safe.
Toxicity: Aegis vs Opik. Many disagreements hinge on passive aggressive or minimising phrasing written as full replies. A backhanded compliment might read, “That was a solid deck for someone at your level.” Sarcastic thanks can sound polite while implying disrespect. DARVO style denial might read, “You’re being too sensitive, nobody else has a problem,” and some items add a veiled threat to escalate to a manager. In those cases Aegis often lands in the 40 to 75 band, while Opik assigns 0.8 to 1.0.
The next four paragraphs focus only on format alignment and RAG metrics from the same Excel exports, and only on rows where Aegis failed while DeepEval or Opik still passed, i.e. where Aegis surfaces a corner case the other tool accepted. Format alignment is called out explicitly below because that slice was empty on our FOFO paired export.
Format alignment: Aegis vs DeepEval. On the FOFO paired export (98 rows), there were no cases where Aegis failed while DeepEval still passed, so there is no “Aegis caught it, DeepEval missed it” example from this run. Every format disagreement we saw was the opposite (Aegis pass, DeepEval fail on stricter structure checks).
Context faithfulness: Aegis vs DeepEval. The examples here sit in the slice where Aegis is below threshold but DeepEval still passes (26 of 100 rows on RAGBench train). In one item, the user asks whether a party may solicit or hire the counterparty’s employees or contractors; the model says yes and cites a section of the exhibit, DeepEval passes, and Aegis fails when span-level alignment to the license agreement body is judged too loose. In another, the user asks for the 2007 ratio of shares granted to shares vested from Humana footnotes; the model assembles grant and vesting counts (on the order of 852k granted in 2007) into a ratio story from the context, DeepEval passes, and Aegis fails where the rubric expects closer lockstep to the cited lines.
Context recall: Aegis vs DeepEval and Opik. Only one Garage benchmark row had Aegis fail while both DeepEval and Opik still passed. The user asks how Tyler, The Creator’s Chromakopia influenced contemporary music; the model writes a fluent overview of genre blending and artistic evolution, both comparison tools pass, and Aegis fails because the retrieved passages do not clearly contain the specific claims the answer advances.
Context relevancy: Aegis vs DeepEval. Among the 10 rows on this export that had both pass bits for DeepEval, one had Aegis fail while DeepEval passed. The user asks about COVID-19 and global economic inequality; the model answers with a broad inequality narrative while the retrieved chunk emphasizes long COVID and household finances in the United States. DeepEval still passes on loose topical overlap; Aegis fails because the passages are not a close fit to what the answer leans on.
Where agreement breaks down
Role violation landed at 60% agreement with DeepEval, the lowest pass alignment in this batch. Role rubrics are famously sensitive to how the expected persona is stated, what negation is allowed, and whether multi-turn context is included. If your product policy differs from either judge’s prompt, you should expect exactly this kind of spread.
Toxicity splits in two interesting ways. Against DeepEval, one toxicity suite shows 64% agreement on the latest paired runs we compared; the English subset still shows 73% agreement. Against Opik on the same Aegis cases, agreement on that suite is 65%, while the English subset jumps to 99% agreement — evidence that backend model choice and preprocessing move toxicity judgments more than the framework name on the tin.
Practical takeaways
- Treat pass agreement at a shared threshold as a sanity check, not a single fitness function.
- When agreement is high but case-level scores look unrelated, or agreement is low, inspect what text each tool graded and how passes were derived before calibrating thresholds.
- When prompts, model versions, or thresholds change, re-run paired evaluations and recompute agreement; small drift can move headline percentages by double digits.
If you are designing metrics rather than comparing vendors, the step-by-step build path in How to design an evaluation metric pairs naturally with this kind of cross-tool rehearsal: lock the construct first, then line up external judges as secondary witnesses, not as silent replacements.
FAQ
Frequently asked questions
We used **paired evaluation runs** on the same suites: the **latest** complete run on each side for every benchmark, with the same **100 cases per suite** in the same order so each index is the same prompt-and-output pair across tools. Aegis’s threshold is the one recorded for that Aegis run. DeepEval passes when its above-threshold signal says so, otherwise its normalised score is compared to the same bar. Opik passes when its score, scaled to a 0–100 style range, clears that bar. DeepTeam passes when its score equals 1.0.
Pass/fail is a single cut at one threshold. Two tools can agree on most labels while still assigning very different numeric scores to individual cases — or disagree on labels while scores look similar — so headline agreement alone does not prove the judges are interchangeable.
Judges, prompts, and construct definitions differ. On security-style tasks it is especially important to check whether a metric grades the user message, the assistant reply, tool steps, or something else. If one pipeline scores an empty assistant output as a pass while another flags risky content in the user turn alone, pass agreement can collapse even when the underlying case list is aligned.
It means Aegis is below its threshold for that case while the other tool still passes under the rules above. That pattern suggests Aegis is **stricter for this sample and threshold pairing**, or that the two sides are measuring slightly different constructs — not automatic proof that one verdict is ground truth. Spot-check judge rationales where tools disagree, especially if scores and pass bits look inconsistent.
No. These numbers are diagnostics for **calibration and construct alignment**, not a leaderboard. Use them to see where an external evaluator matches Aegis pass decisions and where you need custom rubrics, thresholds, or evaluation objects tuned to your product.