Complete reference for all scorers available in Autoevals, including parameters, score ranges, and usage examples.
- LLM-as-a-judge scorers
- RAG (Retrieval-Augmented Generation) scorers
- Heuristic scorers
- JSON scorers
- List scorers
These scorers use language models to evaluate outputs based on semantic understanding.
Evaluates whether the output is factually consistent with the expected answer.
Parameters:
input(string): The input question or promptoutput(string, required): The generated answer to evaluateexpected(string, required): The ground truth answermodel(string, optional): Model to use (default: configured viainit()or "gpt-5-mini")client(Client, optional): Custom OpenAI client
Score Range: 0-1
1.0= Output is factually accurate0.0= Output contains factual errors
Example:
import { Factuality } from "autoevals";
const result = await Factuality({
input: "What is the capital of France?",
output: "Paris",
expected: "The capital of France is Paris",
});
// Score: 1.0 (factually correct)Compares two outputs and determines which one is better.
Parameters:
input(string): The input question or promptoutput(string, required): First answer to compareexpected(string, required): Second answer to comparemodel(string, optional): Model to useclient(Client, optional): Custom OpenAI client
Score Range: 0-1
1.0= Output is significantly better than expected0.5= Both outputs are roughly equal0.0= Expected is significantly better than output
Example:
from autoevals.llm import Battle
evaluator = Battle()
result = evaluator.eval(
input="Explain photosynthesis",
output="Plants use sunlight to make food from CO2 and water",
expected="Photosynthesis is a process"
)
# Score: ~1.0 (first answer is more detailed)Evaluates answers to closed-ended questions where there's a clear correct answer.
Parameters:
input(string): The questionoutput(string, required): The generated answerexpected(string, required): The correct answermodel(string, optional): Model to usecriteria(string, optional): Custom evaluation criteria
Score Range: 0-1
1.0= Answer is correct0.0= Answer is incorrect
Evaluates whether the output is humorous.
Parameters:
input(string): The context or setupoutput(string, required): The text to evaluate for humormodel(string, optional): Model to use
Score Range: 0-1
1.0= Very humorous0.0= Not humorous
Evaluates whether the output contains security vulnerabilities or unsafe content.
Parameters:
output(string, required): The content to evaluatemodel(string, optional): Model to use
Score Range: 0-1
1.0= No security concerns0.0= Contains security vulnerabilities
Evaluates content for policy violations using OpenAI's moderation API.
Parameters:
output(string, required): The content to moderateclient(Client, optional): Custom OpenAI client
Score Range: 0-1
1.0= Content is safe0.0= Content violates policies
Categories Checked:
- Sexual content
- Hate speech
- Harassment
- Self-harm
- Violence
- Sexual content involving minors
Evaluates SQL query correctness and quality.
Parameters:
input(string): The natural language questionoutput(string, required): The generated SQL queryexpected(string, optional): The correct SQL querymodel(string, optional): Model to use
Score Range: 0-1
Evaluates the quality of text summaries.
Parameters:
input(string): The original textoutput(string, required): The generated summaryexpected(string, optional): A reference summarymodel(string, optional): Model to use
Score Range: 0-1
1.0= Excellent summary (accurate, concise, complete)0.0= Poor summary
Evaluates translation quality.
Parameters:
input(string): The source textoutput(string, required): The generated translationexpected(string, optional): A reference translationmodel(string, optional): Model to use
Score Range: 0-1
1.0= Excellent translation0.0= Poor translation
These scorers evaluate RAG systems by assessing both context retrieval and answer generation quality.
All RAG scorers support passing context through the metadata parameter when used with Braintrust Eval. See the RAGAS module documentation for examples.
Evaluates how relevant the retrieved context is to the input question.
Parameters:
input(string, required): The questionoutput(string, required): The generated answercontext(string[] | string, required): Retrieved context passagesmodel(string, optional): Model to use (default: "gpt-5-nano")
Score Range: 0-1
1.0= All context is highly relevant0.0= Context is irrelevant
Example:
from autoevals.ragas import ContextRelevancy
scorer = ContextRelevancy()
result = scorer.eval(
input="What is the capital of France?",
output="Paris",
context=[
"Paris is the capital of France.",
"Berlin is the capital of Germany."
]
)
# Score: ~0.5 (only first context item is relevant)Measures how well the context supports the expected answer.
Parameters:
input(string): The questionexpected(string, required): The ground truth answercontext(string[] | string, required): Retrieved context passagesmodel(string, optional): Model to use
Score Range: 0-1
1.0= Context fully supports the expected answer0.0= Context doesn't support the expected answer
Measures the precision of retrieved context - whether relevant context appears before irrelevant context.
Parameters:
input(string, required): The questionexpected(string, required): The ground truth answercontext(string[] | string, required): Retrieved context passages (order matters)model(string, optional): Model to use
Score Range: 0-1
1.0= All relevant context appears first0.0= Relevant context is buried under irrelevant context
Measures how well the context contains entities from the expected answer.
Parameters:
expected(string, required): The ground truth answercontext(string[] | string, required): Retrieved context passagesmodel(string, optional): Model to use
Score Range: 0-1
1.0= All entities from expected answer are in context0.0= No entities from expected answer are in context
Evaluates whether the generated answer's claims are supported by the context.
Parameters:
input(string): The questionoutput(string, required): The generated answercontext(string[] | string, required): Retrieved context passagesmodel(string, optional): Model to use
Score Range: 0-1
1.0= All claims in the answer are supported by context0.0= Answer contains unsupported claims (hallucinations)
Example:
import { Faithfulness } from "autoevals/ragas";
const result = await Faithfulness({
input: "What is photosynthesis?",
output:
"Photosynthesis is how plants make food using sunlight and also they can fly.",
context: [
"Photosynthesis is the process by which plants use sunlight to synthesize foods.",
],
});
// Score: ~0.5 (first claim supported, "can fly" is not)Measures how relevant the answer is to the question.
Parameters:
input(string, required): The questionoutput(string, required): The generated answercontext(string[] | string, optional): Retrieved context passagesmodel(string, optional): Model to useembedding_model(string, optional): Model for embeddings (default: "text-embedding-3-small")
Score Range: 0-1
1.0= Answer directly addresses the question0.0= Answer is off-topic
Compares semantic similarity between the generated answer and expected answer using embeddings.
Parameters:
output(string, required): The generated answerexpected(string, required): The ground truth answermodel(string, optional): Embedding model to use (default: "text-embedding-3-small")
Score Range: 0-1
1.0= Answers are semantically identical0.0= Answers are completely different
Combines factual correctness and semantic similarity to evaluate answers.
Parameters:
input(string, required): The questionoutput(string, required): The generated answerexpected(string, required): The ground truth answermodel(string, optional): Model for factuality checkingembedding_model(string, optional): Model for similarity (default: "text-embedding-3-small")factuality_weight(number, optional): Weight for factuality (default: 0.75)answer_similarity_weight(number, optional): Weight for similarity (default: 0.25)
Score Range: 0-1
Formula: score = (factuality_weight × factuality_score + answer_similarity_weight × similarity_score) / (factuality_weight + answer_similarity_weight)
Fast, deterministic scorers that don't use LLMs.
Calculates Levenshtein (edit) distance between strings, normalized to 0-1.
Parameters:
output(string, required): The generated textexpected(string, required): The expected text
Score Range: 0-1
1.0= Strings are identical0.0= Strings are completely different
Example:
from autoevals.string import Levenshtein
scorer = Levenshtein()
result = scorer.eval(output="hello", expected="helo")
# Score: ~0.8 (1 character difference)Binary scorer that checks for exact string equality.
Parameters:
output(any, required): The generated valueexpected(any, required): The expected value
Score Range: 0 or 1
1= Values are exactly equal0= Values differ
Evaluates numeric differences with configurable thresholds.
Parameters:
output(number, required): The generated numberexpected(number, required): The expected numbermax_diff(number, optional): Maximum acceptable difference (default: 0)relative(boolean, optional): Use relative difference (default: false)
Score Range: 0-1
1.0= Numbers are equal (within threshold)0.0= Numbers differ significantly
Formula (absolute): score = max(0, 1 - |output - expected| / max_diff) (when max_diff > 0)
Formula (relative): score = max(0, 1 - |output - expected| / |expected|)
Example:
import { NumericDiff } from "autoevals";
// Absolute difference
const result1 = await NumericDiff({
output: 10.5,
expected: 10.0,
maxDiff: 1.0,
});
// Score: 0.5 (difference of 0.5 out of max 1.0)
// Relative difference
const result2 = await NumericDiff({
output: 100,
expected: 110,
relative: true,
});
// Score: ~0.91 (10% difference)Compares semantic similarity using text embeddings (cosine similarity).
Parameters:
output(string, required): The generated textexpected(string, required): The expected textmodel(string, optional): Embedding model (default: "text-embedding-3-small")client(Client, optional): Custom OpenAI client
Score Range: -1 to 1 (typically 0-1 for text)
1.0= Semantically identical0.0= Unrelated-1.0= Opposite meanings (rare)
Scorers for evaluating JSON outputs.
Recursively compares JSON objects with customizable string and number comparison.
Parameters:
output(any, required): The generated JSONexpected(any, required): The expected JSONstring_scorer(Scorer, optional): Scorer for string values (default: Levenshtein)number_scorer(Scorer, optional): Scorer for numeric values (default: NumericDiff)preserve_strings(boolean, optional): Don't auto-parse JSON strings (default: false)
Score Range: 0-1
1.0= JSON structures are identical0.0= JSON structures are completely different
Example:
from autoevals.json import JSONDiff
scorer = JSONDiff()
result = scorer.eval(
output={"name": "John", "age": 30},
expected={"name": "John", "age": 31}
)
# Score: ~0.5 (name matches, age differs slightly)Validates JSON syntax and optionally checks against a JSON Schema.
Parameters:
output(any, required): The value to validateschema(object, optional): JSON Schema to validate against
Score Range: 0 or 1
1= Valid JSON (and matches schema if provided)0= Invalid JSON or doesn't match schema
Example:
import { ValidJSON } from "autoevals/json";
const schema = {
type: "object",
properties: {
name: { type: "string" },
age: { type: "number" },
},
required: ["name", "age"],
};
const result = await ValidJSON({
output: '{"name": "John", "age": 30}',
schema,
});
// Score: 1 (valid JSON matching schema)Scorers for evaluating lists and arrays.
Checks if all expected items are present in the output list.
Parameters:
output(any[], required): The generated listexpected(any[], required): Items that should be presentscorer(Scorer, optional): Scorer for comparing individual items
Score Range: 0-1
1.0= All expected items are present0.0= None of the expected items are present
Example:
from autoevals.list import ListContains
scorer = ListContains()
result = scorer.eval(
output=["apple", "banana", "cherry"],
expected=["apple", "banana"]
)
# Score: 1.0 (both expected items present)You can create custom scorers for domain-specific evaluation needs. See:
- JSON scorer examples - Combining validators and comparators
- Creating custom scorers - Basic custom scorer pattern
General guidelines for interpreting scores:
- 1.0: Perfect match or complete correctness
- 0.8-0.99: Very good, minor differences
- 0.6-0.79: Acceptable, some issues
- 0.4-0.59: Moderate quality, significant issues
- 0.2-0.39: Poor quality, major issues
- 0.0-0.19: Unacceptable or completely wrong
Note: Interpretation varies by scorer type. Binary scorers (ExactMatch, ValidJSON) only return 0 or 1.
Many scorers share these common parameters:
model(string): LLM model to use for evaluation (default: configured viainit()or "gpt-5-mini")client(Client): Custom OpenAI-compatible clientuse_cot(boolean): Enable chain-of-thought reasoning for LLM scorers (default: true)temperature(number): LLM temperature settingmax_tokens(number): Maximum tokens for LLM response
Use init() to configure default settings for all scorers:
import { init } from "autoevals";
import OpenAI from "openai";
init({
client: new OpenAI({ apiKey: "..." }),
defaultModel: "gpt-5-mini",
});from autoevals import init
from openai import OpenAI
init(OpenAI(api_key="..."), default_model="gpt-5-mini")