Evaluations

Metacoder provides a powerful evaluation framework built on DeepEval, the open-source LLM evaluation framework. This integration enables systematic testing of AI coders across different models, tasks, and metrics.

Why Evaluate AI Coders?

Evaluating AI coding assistants is crucial for:

Performance Comparison: Compare different coders on the same tasks
Model Selection: Test various LLMs to find the best fit
Regression Testing: Ensure changes don't degrade performance
Tool Integration: Validate MCP and tool usage accuracy
Reproducible Research: Create benchmarks for academic papers

Key Features

40+ Ready-to-Use Metrics: Access DeepEval's comprehensive metric suite
LLM-Powered Evaluation: Use any LLM to judge outputs with human-like accuracy
Flexible Integration: Compatible with all DeepEval metrics and custom evaluations
Reproducible Benchmarks: Systematic testing across model × coder × case × metric combinations
MCP Support: Test coders with external tools and services

How It Works

The evaluation system runs a matrix of tests:

Models: Different AI models (GPT-4, Claude, etc.)
Coders: Various coding assistants (Claude Code, Goose, Codex, etc.)
Cases: Test scenarios with inputs and expected outputs
Metrics: Quality measures from DeepEval

Each combination produces a scored result, enabling comprehensive comparisons.

Quick Start

# Run evaluation suite
metacoder eval tests/input/example_eval_config.yaml

# Compare specific coders
metacoder eval my_evals.yaml -c claude -c goose

# Custom output location
metacoder eval my_evals.yaml -o results.yaml

DeepEval Integration

Metacoder's evaluation system leverages DeepEval's powerful features:

Dynamic Metric Loading: Any DeepEval metric can be used by name
LLMTestCase Compatibility: Our EvalCase model maps to DeepEval's test case format
Flexible Scoring: All DeepEval scoring mechanisms and thresholds are supported
Custom Metrics: Create your own metrics using DeepEval's abstractions

Next Steps

Examples - See complete evaluation configurations
Configuration Reference - Detailed API documentation
DeepEval Documentation - Learn about available metrics