AI quality is multi-dimensional

Aegis organises AI evaluation into distinct metric categories, each addressing a critical dimension of trust. Toghether, they provide a complete picture of how models and agents behave across use cases and environments. Each category is evaluated independently, scored on a 0-100 scale, and can be tracked over time.

Metric Categories

Coverage at a Glance

Aegis organises evaluations into focused dimensions so teams can ship with confidence.

General Performance

Measures core response quality, ensuring outputs are accurate, relevant, and consistent with the user’s intent across common AI tasks. Examples: Factfulness · Answer Relevancy · Content Consistency · Summarization.

Retrieval-Augmented Generation (RAG)

Evaluates how effectively retrieved context is used, helping prevent hallucinations and ensuring responses are grounded in enterprise knowledge. Examples: Context Relevancy · Context Sufficiency · Context Recall · Context Faithfulness.

Security

Assesses resilience against adversarial inputs and data exposure, identifying vulnerabilities that could lead to prompt manipulation or sensitive data leakage. Examples: Role Hijacking · Instruction Integrity Subversion · System Data Leakage · PII-PHI Leakage.

Safety

Detects harmful, misleading, or policy-violating behavior that could impact users, brand trust, or regulatory compliance. Examples: Misinformation · Misuse · Role Violation.

Structural Integrity

Validates that AI outputs conform to required formats, schemas, and structural constraints for safe downstream processing. Examples: JSON Schema Match · XML Schema Match · Exact Match · Is Valid JSON.

Alignment & Output Control

Ensures responses follow expected structure, format, and generation constraints, enabling predictable and controllable AI behavior. Examples: Format Alignment · Format Consistency · Prompt Quality · Content Generation Faithfulness.

How Aegis metrics work

A hybrid approach; objectivity and nuance. Each metric in Aegis produces a score on a 0-100 scale, rather than one simple pass/fail result.

See your AI through a clearer lens

Run structured evaluations, track regressions, and understand how your models and agents behave—across performance, safety, and alignment.