Configuration Reference
This page provides detailed documentation for evaluation configuration using the data models from metacoder.evals.eval_model
.
Configuration Models
The evaluation system uses Pydantic models for configuration validation and type safety.
EvalCase
Individual test case configuration:
EvalCase
Bases: BaseModel
A single evaluation test case.
Defines an individual test scenario with input, expected output, and evaluation criteria. Each case is run through all specified metrics to produce scored results.
Example
EvalDataset
Complete evaluation dataset configuration:
EvalDataset
Bases: BaseModel
Complete evaluation dataset configuration.
Defines the full evaluation suite including models, coders, test cases, and optional MCP servers. The evaluation runner will execute all combinations of model × coder × case × metric.
Example
Configuration File Structure
A complete evaluation configuration file contains:
# Dataset metadata
name: string # Required: Name of the evaluation dataset
description: string # Optional: Description of what's being tested
# Coders to test
coders: # Optional: Defaults to all available coders
claude: {} # Empty dict uses default configuration
goose:
custom_option: value # Coder-specific configuration
# AI models configuration
models: # Required: At least one model
model_name:
provider: string # Provider name (openai, anthropic, etc.)
name: string # Model identifier
# MCP servers
servers: # Optional: MCP server configurations
server_name:
name: string
command: string
args: [list]
env:
KEY: value
# Test cases
cases: # Required: At least one test case
- name: string # Required: Unique case identifier
metrics: [list] # Required: List of metric names
input: string # Required: Input prompt/question
expected_output: string # Optional: Expected response
retrieval_context: string|list # Optional: Context for RAG metrics
threshold: float # Optional: Pass threshold (default: 0.7)
context: [list] # Optional: Additional context
additional_metadata: dict # Optional: Extra metadata
comments: string # Optional: Notes about the test
tags: [list] # Optional: Tags for filtering
Field Details
Model Configuration
The models
field maps model names to their configurations:
Server Configuration
The servers
field defines MCP servers available for evaluation:
servers:
pubmed:
name: pubmed
command: uvx
args: [mcp-simple-pubmed]
env:
PUBMED_EMAIL: ${PUBMED_EMAIL}
type: stdio
enabled: true
Metrics
Metrics can be specified by name. Common metrics include:
CorrectnessMetric
- Built-in factual accuracy metricAnswerRelevancyMetric
- How well answer relates to questionFaithfulnessMetric
- Factual accuracy based on contextHallucinationMetric
- Detects false informationGEval
- Custom criteria evaluation
See DeepEval metrics for the complete list.
Test Case Fields
- name: Unique identifier for the test case
- metrics: List of metric names to apply
- input: The prompt or question to send to the coder
- expected_output: The expected response (for comparison metrics)
- retrieval_context: Context documents for RAG metrics (string or list)
- threshold: Score threshold for passing (0.0-1.0, default 0.7)
- additional_metadata: Dictionary of extra data for custom metrics
- comments: Human-readable notes about the test
- tags: List of tags for categorization/filtering
Command Line Usage
# Basic evaluation
metacoder eval config.yaml
# Specify output file
metacoder eval config.yaml -o results.yaml
# Test specific coders
metacoder eval config.yaml -c claude -c goose
# Custom working directory
metacoder eval config.yaml -w ./eval_workspace
# Verbose output
metacoder eval config.yaml -v
Important Notes
The output of evaluations can be large, in particular the field execution_metadata
, which has a full trace of
everything the agent did, including tool responses. For something like pubmed full text retrieval this can be large.
You may wish to partition evals into sub-evals
See Also
- Examples - Complete configuration examples
- Evaluation Introduction - Overview of the evaluation system