Judges¶

OntoEval provides two evaluation methods (judges) for comparing AI-generated changes against ground truth changes:

Available Judges¶

Metadiff Judge - Structural diff comparison with precision/recall metrics
LLM Judge - AI-powered semantic evaluation using GPT-4o

Overview¶

Both judges take two diffs as input - typically the AI-generated diff and the ground truth human-generated diff - and produce evaluation metrics. The judges can be used independently or together to provide comprehensive evaluation from different perspectives.

Common Use Cases¶

Benchmarking: Evaluate how well AI agents perform on ontology editing tasks
Debugging: Understand where AI agents are making mistakes
Comparison: Compare different AI agents or configurations
Quality Assessment: Validate that changes meet expected standards

Judge Selection¶

Choose your judge based on your evaluation needs:

Use Metadiff Judge for fast, objective structural comparison
Use LLM Judge for nuanced semantic evaluation and detailed feedback
Use both for comprehensive analysis combining structural and semantic perspectives