PAL Evaluation Guide

The PAL evaluation system allows you to create automated test suites for your prompt assemblies, ensuring they produce consistent and correct outputs across different scenarios.

Quick Start

1. Create an Evaluation File

Evaluation files use the .eval.yaml extension and define test cases for your prompts:

pal_version: "1.0"
prompt_id: "my-prompt-id"
target_version: "1.0.0"
description: "Test suite for my prompt"

test_cases:
  - name: "basic_test"
    description: "Test basic functionality"
    variables:
      user_input: "Hello world"
    assertions:
      - type: "contains"
        config:
          text: "greeting"
          case_sensitive: false

2. Run Evaluations

# Run evaluation with mock LLM (for testing)
pal evaluate my_prompt.eval.yaml

# Run with a real model
pal evaluate my_prompt.eval.yaml --model gpt-4o --provider openai

# Specify a custom PAL file
pal evaluate tests.eval.yaml --pal-file custom.pal

# Output results to file
pal evaluate tests.eval.yaml --output results.json --output-format json

Evaluation File Structure

Required Fields

pal_version: PAL specification version (currently “1.0”)
prompt_id: ID of the prompt assembly to test
target_version: Expected version of the prompt assembly
test_cases: List of test scenarios

Test Case Structure

test_cases:
  - name: "unique_test_name"
    description: "What this test validates"
    variables:
      # Variables to pass to the prompt
      param1: "value1"
      param2: ["list", "of", "values"]
    assertions:
      # List of assertions to validate the response
      - type: "assertion_type"
        config:
          # Assertion-specific configuration

Available Assertions

1. Contains Assertion

Checks if the response contains specific text.

- type: "contains"
  config:
    text: "expected text"
    case_sensitive: true  # Optional, defaults to true

2. Regex Match Assertion

Validates the response against a regular expression pattern.

- type: "regex_match"
  config:
    pattern: "\\d{4}-\\d{2}-\\d{2}"  # Date pattern
    flags: 0  # Optional regex flags

3. JSON Valid Assertion

Ensures the response is valid JSON.

- type: "json_valid"
  config: {}

4. JSON Field Equals Assertion

Checks if a specific JSON field equals an expected value.

- type: "json_field_equals"
  config:
    path: "$.result.status"  # JSONPath-like syntax
    value: "success"

5. Length Assertion

Validates response length constraints.

- type: "length"
  config:
    min_length: 100      # Minimum characters
    max_length: 1000     # Maximum characters
    # OR
    exact_length: 150    # Exact character count

Advanced Features

Variable Types

Test variables can be simple values or complex structures:

variables:
  # Simple string
  user_query: "What is the weather?"
  
  # Number
  temperature: 72
  
  # List
  options: ["A", "B", "C"]
  
  # Object
  user_profile:
    name: "John Doe"
    age: 30
    preferences: ["tech", "sports"]

Multiple Assertions per Test

Each test case can have multiple assertions:

- name: "comprehensive_test"
  variables:
    input: "Generate a JSON response"
  assertions:
    - type: "json_valid"
      config: {}
    - type: "contains"
      config:
        text: "status"
    - type: "length"
      config:
        min_length: 50
        max_length: 500

Auto-Discovery

If you don’t specify a --pal-file, the evaluation runner will automatically search for a PAL file with the matching prompt_id in the same directory and subdirectories.

Output Formats

Console Output (Default)

pal evaluate tests.eval.yaml

Displays a human-readable report with pass/fail status and assertion details.

JSON Output

pal evaluate tests.eval.yaml --output results.json --output-format json

Generates a detailed JSON report with:

Summary statistics (total tests, pass rate)
Individual test results
Assertion details
Execution metadata

Best Practices

1. Test Edge Cases

Create test cases for various scenarios:

test_cases:
  - name: "empty_input"
    variables:
      user_input: ""
    assertions:
      - type: "contains"
        config:
          text: "please provide input"

  - name: "long_input"
    variables:
      user_input: "{{ very_long_text }}"
    assertions:
      - type: "length"
        config:
          max_length: 2000

2. Use Descriptive Names

Make test names and descriptions clear:

- name: "classification_high_confidence"
  description: "Verify classification returns high confidence for clear inputs"

4. Version Compatibility

Always specify the target version to catch version mismatches:

target_version: "2.1.0"  # Will warn if prompt version differs

Integration with Development Workflow

1. Continuous Testing

Run evaluations as part of your development process:

# Test before committing changes
pal evaluate tests.eval.yaml

# Run with different models
pal evaluate tests.eval.yaml --model gpt-4o --provider openai
pal evaluate tests.eval.yaml --model claude-3-sonnet --provider anthropic

2. Regression Testing

Use evaluation files to catch regressions when modifying prompts:

Create comprehensive test suites covering expected behavior
Run tests before and after changes
Compare results to identify any degradation

3. Model Comparison

Evaluate the same prompt with different models:

pal evaluate tests.eval.yaml --model gpt-4o --output gpt4-results.json --output-format json
pal evaluate tests.eval.yaml --model claude-3-sonnet --output claude-results.json --output-format json

Troubleshooting

Common Issues

Version Mismatch Warning: The prompt version doesn’t match target_version
- Update the evaluation file or prompt version
Prompt Not Found: Auto-discovery can’t find the PAL file
- Use --pal-file to specify the exact path
- Ensure the prompt_id matches the PAL file’s id
Assertion Failures: Tests are failing unexpectedly
- Check if the model output format has changed
- Verify assertion configurations are correct
- Use --model mock to test with predictable responses

Debug Mode

Use verbose output to see detailed execution information:

pal evaluate tests.eval.yaml --verbose

This will show:

Compiled prompt content
Model responses
Detailed assertion results
Execution timing information

Example: Complete Evaluation Suite

Here’s a comprehensive example for a content classification prompt:

pal_version: "1.0"
prompt_id: "content-classifier"
target_version: "1.2.0"
description: "Comprehensive test suite for content classification"

test_cases:
  - name: "clear_spam_classification"
    description: "Classify obvious spam content"
    variables:
      content: "URGENT!!! Win money now!!! Click here!!!"
      categories: ["spam", "legitimate", "promotional"]
    assertions:
      - type: "json_valid"
        config: {}
      - type: "json_field_equals"
        config:
          path: "$.category"
          value: "spam"
      - type: "json_field_equals"
        config:
          path: "$.confidence"
          value: "high"

  - name: "legitimate_content"
    description: "Classify normal business content"
    variables:
      content: "Thank you for your order. It will ship within 3-5 business days."
      categories: ["spam", "legitimate", "promotional"]
    assertions:
      - type: "json_valid"
        config: {}
      - type: "json_field_equals"
        config:
          path: "$.category"
          value: "legitimate"

  - name: "edge_case_empty"
    description: "Handle empty content gracefully"
    variables:
      content: ""
      categories: ["spam", "legitimate", "promotional"]
    assertions:
      - type: "json_valid"
        config: {}
      - type: "contains"
        config:
          text: "insufficient"
          case_sensitive: false

  - name: "response_format"
    description: "Ensure consistent response structure"
    variables:
      content: "Sample content for testing"
      categories: ["spam", "legitimate", "promotional"]
    assertions:
      - type: "json_valid"
        config: {}
      - type: "regex_match"
        config:
          pattern: '"category"\\s*:\\s*"(spam|legitimate|promotional)"'
      - type: "regex_match"
        config:
          pattern: '"confidence"\\s*:\\s*"(high|medium|low)"'
      - type: "length"
        config:
          min_length: 50
          max_length: 300

This evaluation suite tests multiple scenarios, validates JSON structure, checks specific field values, and ensures consistent response formatting.