Configuration Reference

This page provides detailed documentation for evaluation configuration using the data models from metacoder.evals.eval_model.

Configuration Models

The evaluation system uses Pydantic models for configuration validation and type safety.

EvalCase

Individual test case configuration:

`EvalCase`

Bases: BaseModel

A single evaluation test case.

Defines an individual test scenario with input, expected output, and evaluation criteria. Each case is run through all specified metrics to produce scored results.

Example

name: "title_extraction"
metrics: [CorrectnessMetric]
input: "What is the title of PMID:12345?"
expected_output: "Example Paper Title"
threshold: 0.9

EvalDataset

Complete evaluation dataset configuration:

`EvalDataset`

Bases: BaseModel

Complete evaluation dataset configuration.

Defines the full evaluation suite including models, coders, test cases, and optional MCP servers. The evaluation runner will execute all combinations of model × coder × case × metric.

Example

name: "pubmed_evaluation"
description: "Test PubMed MCP integration"
models:
  gpt-4o:
    provider: openai
    name: gpt-4
coders:
  claude: {}
servers:
  pubmed:
    name: pubmed
    command: uvx
    args: [mcp-simple-pubmed]
cases:
  - name: "test1"
    metrics: [CorrectnessMetric]
    input: "What is 1+1?"
    expected_output: "2"

Configuration File Structure

A complete evaluation configuration file contains:

# Dataset metadata
name: string                    # Required: Name of the evaluation dataset
description: string             # Optional: Description of what's being tested

# Coders to test
coders:                         # Optional: Defaults to all available coders
  claude: {}                    # Empty dict uses default configuration
  goose: 
    custom_option: value        # Coder-specific configuration

# AI models configuration
models:                         # Required: At least one model
  model_name:
    provider: string            # Provider name (openai, anthropic, etc.)
    name: string                # Model identifier

# MCP servers
servers:                        # Optional: MCP server configurations
  server_name:
    name: string
    command: string
    args: [list]
    env: 
      KEY: value

# Test cases
cases:                          # Required: At least one test case
  - name: string                # Required: Unique case identifier
    metrics: [list]             # Required: List of metric names
    input: string               # Required: Input prompt/question
    expected_output: string     # Optional: Expected response
    retrieval_context: string|list  # Optional: Context for RAG metrics
    threshold: float            # Optional: Pass threshold (default: 0.7)
    context: [list]             # Optional: Additional context
    additional_metadata: dict   # Optional: Extra metadata
    comments: string            # Optional: Notes about the test
    tags: [list]                # Optional: Tags for filtering

Field Details

Model Configuration

The models field maps model names to their configurations:

models:
  gpt-4o:
    provider: openai
    name: gpt-4
  claude-3:
    provider: anthropic
    name: claude-3-opus

Server Configuration

The servers field defines MCP servers available for evaluation:

servers:
  pubmed:
    name: pubmed
    command: uvx
    args: [mcp-simple-pubmed]
    env:
      PUBMED_EMAIL: ${PUBMED_EMAIL}
    type: stdio
    enabled: true

Metrics

Metrics can be specified by name. Common metrics include:

CorrectnessMetric - Built-in factual accuracy metric
AnswerRelevancyMetric - How well answer relates to question
FaithfulnessMetric - Factual accuracy based on context
HallucinationMetric - Detects false information
GEval - Custom criteria evaluation

See DeepEval metrics for the complete list.

Test Case Fields

name: Unique identifier for the test case
metrics: List of metric names to apply
input: The prompt or question to send to the coder
expected_output: The expected response (for comparison metrics)
retrieval_context: Context documents for RAG metrics (string or list)
threshold: Score threshold for passing (0.0-1.0, default 0.7)
additional_metadata: Dictionary of extra data for custom metrics
comments: Human-readable notes about the test
tags: List of tags for categorization/filtering

Command Line Usage

# Basic evaluation
metacoder eval config.yaml

# Specify output file
metacoder eval config.yaml -o results.yaml

# Test specific coders
metacoder eval config.yaml -c claude -c goose

# Custom working directory
metacoder eval config.yaml -w ./eval_workspace

# Verbose output
metacoder eval config.yaml -v

Important Notes

The output of evaluations can be large, in particular the field execution_metadata, which has a full trace of everything the agent did, including tool responses. For something like pubmed full text retrieval this can be large.

You may wish to partition evals into sub-evals

Configuration Reference

Configuration Models

EvalCase

EvalCase

EvalDataset

EvalDataset

Configuration File Structure

Field Details

Model Configuration

Server Configuration

Metrics

Test Case Fields

Command Line Usage

Important Notes

See Also

`EvalCase`

`EvalDataset`