PAL Evaluation Guide
The PAL evaluation system allows you to create automated test suites for your prompt assemblies, ensuring they produce consistent and correct outputs across different scenarios.
Quick Start
1. Create an Evaluation File
Evaluation files use the .eval.yaml extension and define test cases for your prompts:
pal_version: "1.0"
prompt_id: "my-prompt-id"
target_version: "1.0.0"
description: "Test suite for my prompt"
test_cases:
- name: "basic_test"
description: "Test basic functionality"
variables:
user_input: "Hello world"
assertions:
- type: "contains"
config:
text: "greeting"
case_sensitive: false
2. Run Evaluations
# Run evaluation with mock LLM (for testing)
pal evaluate my_prompt.eval.yaml
# Run with a real model
pal evaluate my_prompt.eval.yaml --model gpt-4o --provider openai
# Specify a custom PAL file
pal evaluate tests.eval.yaml --pal-file custom.pal
# Output results to file
pal evaluate tests.eval.yaml --output results.json --output-format json
Evaluation File Structure
Required Fields
pal_version: PAL specification version (currently “1.0”)prompt_id: ID of the prompt assembly to testtarget_version: Expected version of the prompt assemblytest_cases: List of test scenarios
Test Case Structure
test_cases:
- name: "unique_test_name"
description: "What this test validates"
variables:
# Variables to pass to the prompt
param1: "value1"
param2: ["list", "of", "values"]
assertions:
# List of assertions to validate the response
- type: "assertion_type"
config:
# Assertion-specific configuration
Available Assertions
1. Contains Assertion
Checks if the response contains specific text.
- type: "contains"
config:
text: "expected text"
case_sensitive: true # Optional, defaults to true
2. Regex Match Assertion
Validates the response against a regular expression pattern.
- type: "regex_match"
config:
pattern: "\\d{4}-\\d{2}-\\d{2}" # Date pattern
flags: 0 # Optional regex flags
3. JSON Valid Assertion
Ensures the response is valid JSON.
- type: "json_valid"
config: {}
4. JSON Field Equals Assertion
Checks if a specific JSON field equals an expected value.
- type: "json_field_equals"
config:
path: "$.result.status" # JSONPath-like syntax
value: "success"
5. Length Assertion
Validates response length constraints.
- type: "length"
config:
min_length: 100 # Minimum characters
max_length: 1000 # Maximum characters
# OR
exact_length: 150 # Exact character count
Advanced Features
Variable Types
Test variables can be simple values or complex structures:
variables:
# Simple string
user_query: "What is the weather?"
# Number
temperature: 72
# List
options: ["A", "B", "C"]
# Object
user_profile:
name: "John Doe"
age: 30
preferences: ["tech", "sports"]
Multiple Assertions per Test
Each test case can have multiple assertions:
- name: "comprehensive_test"
variables:
input: "Generate a JSON response"
assertions:
- type: "json_valid"
config: {}
- type: "contains"
config:
text: "status"
- type: "length"
config:
min_length: 50
max_length: 500
Auto-Discovery
If you don’t specify a --pal-file, the evaluation runner will automatically search for a PAL file with the matching prompt_id in the same directory and subdirectories.
Output Formats
Console Output (Default)
pal evaluate tests.eval.yaml
Displays a human-readable report with pass/fail status and assertion details.
JSON Output
pal evaluate tests.eval.yaml --output results.json --output-format json
Generates a detailed JSON report with:
Summary statistics (total tests, pass rate)
Individual test results
Assertion details
Execution metadata
Best Practices
1. Test Edge Cases
Create test cases for various scenarios:
test_cases:
- name: "empty_input"
variables:
user_input: ""
assertions:
- type: "contains"
config:
text: "please provide input"
- name: "long_input"
variables:
user_input: "{{ very_long_text }}"
assertions:
- type: "length"
config:
max_length: 2000
2. Use Descriptive Names
Make test names and descriptions clear:
- name: "classification_high_confidence"
description: "Verify classification returns high confidence for clear inputs"
4. Version Compatibility
Always specify the target version to catch version mismatches:
target_version: "2.1.0" # Will warn if prompt version differs
Integration with Development Workflow
1. Continuous Testing
Run evaluations as part of your development process:
# Test before committing changes
pal evaluate tests.eval.yaml
# Run with different models
pal evaluate tests.eval.yaml --model gpt-4o --provider openai
pal evaluate tests.eval.yaml --model claude-3-sonnet --provider anthropic
2. Regression Testing
Use evaluation files to catch regressions when modifying prompts:
Create comprehensive test suites covering expected behavior
Run tests before and after changes
Compare results to identify any degradation
3. Model Comparison
Evaluate the same prompt with different models:
pal evaluate tests.eval.yaml --model gpt-4o --output gpt4-results.json --output-format json
pal evaluate tests.eval.yaml --model claude-3-sonnet --output claude-results.json --output-format json
Troubleshooting
Common Issues
Version Mismatch Warning: The prompt version doesn’t match
target_versionUpdate the evaluation file or prompt version
Prompt Not Found: Auto-discovery can’t find the PAL file
Use
--pal-fileto specify the exact pathEnsure the
prompt_idmatches the PAL file’sid
Assertion Failures: Tests are failing unexpectedly
Check if the model output format has changed
Verify assertion configurations are correct
Use
--model mockto test with predictable responses
Debug Mode
Use verbose output to see detailed execution information:
pal evaluate tests.eval.yaml --verbose
This will show:
Compiled prompt content
Model responses
Detailed assertion results
Execution timing information
Example: Complete Evaluation Suite
Here’s a comprehensive example for a content classification prompt:
pal_version: "1.0"
prompt_id: "content-classifier"
target_version: "1.2.0"
description: "Comprehensive test suite for content classification"
test_cases:
- name: "clear_spam_classification"
description: "Classify obvious spam content"
variables:
content: "URGENT!!! Win money now!!! Click here!!!"
categories: ["spam", "legitimate", "promotional"]
assertions:
- type: "json_valid"
config: {}
- type: "json_field_equals"
config:
path: "$.category"
value: "spam"
- type: "json_field_equals"
config:
path: "$.confidence"
value: "high"
- name: "legitimate_content"
description: "Classify normal business content"
variables:
content: "Thank you for your order. It will ship within 3-5 business days."
categories: ["spam", "legitimate", "promotional"]
assertions:
- type: "json_valid"
config: {}
- type: "json_field_equals"
config:
path: "$.category"
value: "legitimate"
- name: "edge_case_empty"
description: "Handle empty content gracefully"
variables:
content: ""
categories: ["spam", "legitimate", "promotional"]
assertions:
- type: "json_valid"
config: {}
- type: "contains"
config:
text: "insufficient"
case_sensitive: false
- name: "response_format"
description: "Ensure consistent response structure"
variables:
content: "Sample content for testing"
categories: ["spam", "legitimate", "promotional"]
assertions:
- type: "json_valid"
config: {}
- type: "regex_match"
config:
pattern: '"category"\\s*:\\s*"(spam|legitimate|promotional)"'
- type: "regex_match"
config:
pattern: '"confidence"\\s*:\\s*"(high|medium|low)"'
- type: "length"
config:
min_length: 50
max_length: 300
This evaluation suite tests multiple scenarios, validates JSON structure, checks specific field values, and ensures consistent response formatting.