Aegis alongside DeepEval, Opik, and DeepTeam: what paired runs showed us
On the same ordered test cases and thresholds, we compared pass/fail labels across frameworks. Agreement ranged from strong alignment on several suites to sharp splits where rubrics measure different things.