Autoevals scorer reference

Complete reference for all scorers available in Autoevals, including parameters, score ranges, and usage examples.

LLM-as-a-judge scorers
RAG (Retrieval-Augmented Generation) scorers
Heuristic scorers
JSON scorers
List scorers

LLM-as-a-judge scorers

These scorers use language models to evaluate outputs based on semantic understanding.

Factuality

Evaluates whether the output is factually consistent with the expected answer.

Parameters:

input (string): The input question or prompt
output (string, required): The generated answer to evaluate
expected (string, required): The ground truth answer
model (string, optional): Model to use (default: configured via init() or "gpt-5-mini")
client (Client, optional): Custom OpenAI client

Score Range: 0-1

1.0 = Output is factually accurate
0.0 = Output contains factual errors

Example:

import { Factuality } from "autoevals";

const result = await Factuality({
  input: "What is the capital of France?",
  output: "Paris",
  expected: "The capital of France is Paris",
});
// Score: 1.0 (factually correct)

Battle

Compares two outputs and determines which one is better.

Parameters:

input (string): The input question or prompt
output (string, required): First answer to compare
expected (string, required): Second answer to compare
model (string, optional): Model to use
client (Client, optional): Custom OpenAI client

Score Range: 0-1

1.0 = Output is significantly better than expected
0.5 = Both outputs are roughly equal
0.0 = Expected is significantly better than output

Example:

from autoevals.llm import Battle

evaluator = Battle()
result = evaluator.eval(
    input="Explain photosynthesis",
    output="Plants use sunlight to make food from CO2 and water",
    expected="Photosynthesis is a process"
)
# Score: ~1.0 (first answer is more detailed)

ClosedQA

Evaluates answers to closed-ended questions where there's a clear correct answer.

Parameters:

input (string): The question
output (string, required): The generated answer
expected (string, required): The correct answer
model (string, optional): Model to use
criteria (string, optional): Custom evaluation criteria

Score Range: 0-1

1.0 = Answer is correct
0.0 = Answer is incorrect

Humor

Evaluates whether the output is humorous.

Parameters:

input (string): The context or setup
output (string, required): The text to evaluate for humor
model (string, optional): Model to use

Score Range: 0-1

1.0 = Very humorous
0.0 = Not humorous

Security

Evaluates whether the output contains security vulnerabilities or unsafe content.

Parameters:

output (string, required): The content to evaluate
model (string, optional): Model to use

Score Range: 0-1

1.0 = No security concerns
0.0 = Contains security vulnerabilities

Moderation

Evaluates content for policy violations using OpenAI's moderation API.

Parameters:

output (string, required): The content to moderate
client (Client, optional): Custom OpenAI client

Score Range: 0-1

1.0 = Content is safe
0.0 = Content violates policies

Categories Checked:

Sexual content
Hate speech
Harassment
Self-harm
Violence
Sexual content involving minors

Sql

Evaluates SQL query correctness and quality.

Parameters:

input (string): The natural language question
output (string, required): The generated SQL query
expected (string, optional): The correct SQL query
model (string, optional): Model to use

Score Range: 0-1

Summary

Evaluates the quality of text summaries.

Parameters:

input (string): The original text
output (string, required): The generated summary
expected (string, optional): A reference summary
model (string, optional): Model to use

Score Range: 0-1

1.0 = Excellent summary (accurate, concise, complete)
0.0 = Poor summary

Translation

Evaluates translation quality.

Parameters:

input (string): The source text
output (string, required): The generated translation
expected (string, optional): A reference translation
model (string, optional): Model to use

Score Range: 0-1

1.0 = Excellent translation
0.0 = Poor translation

RAG (Retrieval-Augmented Generation) scorers

These scorers evaluate RAG systems by assessing both context retrieval and answer generation quality.

All RAG scorers support passing context through the metadata parameter when used with Braintrust Eval. See the RAGAS module documentation for examples.

ContextRelevancy

Evaluates how relevant the retrieved context is to the input question.

Parameters:

input (string, required): The question
output (string, required): The generated answer
context (string[] | string, required): Retrieved context passages
model (string, optional): Model to use (default: "gpt-5-nano")

Score Range: 0-1

1.0 = All context is highly relevant
0.0 = Context is irrelevant

Example:

from autoevals.ragas import ContextRelevancy

scorer = ContextRelevancy()
result = scorer.eval(
    input="What is the capital of France?",
    output="Paris",
    context=[
        "Paris is the capital of France.",
        "Berlin is the capital of Germany."
    ]
)
# Score: ~0.5 (only first context item is relevant)

ContextRecall

Measures how well the context supports the expected answer.

Parameters:

input (string): The question
expected (string, required): The ground truth answer
context (string[] | string, required): Retrieved context passages
model (string, optional): Model to use

Score Range: 0-1

1.0 = Context fully supports the expected answer
0.0 = Context doesn't support the expected answer

ContextPrecision

Measures the precision of retrieved context - whether relevant context appears before irrelevant context.

Parameters:

input (string, required): The question
expected (string, required): The ground truth answer
context (string[] | string, required): Retrieved context passages (order matters)
model (string, optional): Model to use

Score Range: 0-1

1.0 = All relevant context appears first
0.0 = Relevant context is buried under irrelevant context

ContextEntityRecall

Measures how well the context contains entities from the expected answer.

Parameters:

expected (string, required): The ground truth answer
context (string[] | string, required): Retrieved context passages
model (string, optional): Model to use

Score Range: 0-1

1.0 = All entities from expected answer are in context
0.0 = No entities from expected answer are in context

Faithfulness

Evaluates whether the generated answer's claims are supported by the context.

Parameters:

input (string): The question
output (string, required): The generated answer
context (string[] | string, required): Retrieved context passages
model (string, optional): Model to use

Score Range: 0-1

1.0 = All claims in the answer are supported by context
0.0 = Answer contains unsupported claims (hallucinations)

Example:

import { Faithfulness } from "autoevals/ragas";

const result = await Faithfulness({
  input: "What is photosynthesis?",
  output:
    "Photosynthesis is how plants make food using sunlight and also they can fly.",
  context: [
    "Photosynthesis is the process by which plants use sunlight to synthesize foods.",
  ],
});
// Score: ~0.5 (first claim supported, "can fly" is not)

AnswerRelevancy

Measures how relevant the answer is to the question.

Parameters:

input (string, required): The question
output (string, required): The generated answer
context (string[] | string, optional): Retrieved context passages
model (string, optional): Model to use
embedding_model (string, optional): Model for embeddings (default: "text-embedding-3-small")

Score Range: 0-1

1.0 = Answer directly addresses the question
0.0 = Answer is off-topic

AnswerSimilarity

Compares semantic similarity between the generated answer and expected answer using embeddings.

Parameters:

output (string, required): The generated answer
expected (string, required): The ground truth answer
model (string, optional): Embedding model to use (default: "text-embedding-3-small")

Score Range: 0-1

1.0 = Answers are semantically identical
0.0 = Answers are completely different

AnswerCorrectness

Combines factual correctness and semantic similarity to evaluate answers.

Parameters:

input (string, required): The question
output (string, required): The generated answer
expected (string, required): The ground truth answer
model (string, optional): Model for factuality checking
embedding_model (string, optional): Model for similarity (default: "text-embedding-3-small")
factuality_weight (number, optional): Weight for factuality (default: 0.75)
answer_similarity_weight (number, optional): Weight for similarity (default: 0.25)

Score Range: 0-1

Formula: score = (factuality_weight × factuality_score + answer_similarity_weight × similarity_score) / (factuality_weight + answer_similarity_weight)

Heuristic scorers

Fast, deterministic scorers that don't use LLMs.

Levenshtein

Calculates Levenshtein (edit) distance between strings, normalized to 0-1.

Parameters:

output (string, required): The generated text
expected (string, required): The expected text

Score Range: 0-1

1.0 = Strings are identical
0.0 = Strings are completely different

Example:

from autoevals.string import Levenshtein

scorer = Levenshtein()
result = scorer.eval(output="hello", expected="helo")
# Score: ~0.8 (1 character difference)

ExactMatch

Binary scorer that checks for exact string equality.

Parameters:

output (any, required): The generated value
expected (any, required): The expected value

Score Range: 0 or 1

1 = Values are exactly equal
0 = Values differ

NumericDiff

Evaluates numeric differences with configurable thresholds.

Parameters:

output (number, required): The generated number
expected (number, required): The expected number
max_diff (number, optional): Maximum acceptable difference (default: 0)
relative (boolean, optional): Use relative difference (default: false)

Score Range: 0-1

1.0 = Numbers are equal (within threshold)
0.0 = Numbers differ significantly

Formula (absolute): score = max(0, 1 - |output - expected| / max_diff) (when max_diff > 0)

Formula (relative): score = max(0, 1 - |output - expected| / |expected|)

Example:

import { NumericDiff } from "autoevals";

// Absolute difference
const result1 = await NumericDiff({
  output: 10.5,
  expected: 10.0,
  maxDiff: 1.0,
});
// Score: 0.5 (difference of 0.5 out of max 1.0)

// Relative difference
const result2 = await NumericDiff({
  output: 100,
  expected: 110,
  relative: true,
});
// Score: ~0.91 (10% difference)

EmbeddingSimilarity

Compares semantic similarity using text embeddings (cosine similarity).

Parameters:

output (string, required): The generated text
expected (string, required): The expected text
model (string, optional): Embedding model (default: "text-embedding-3-small")
client (Client, optional): Custom OpenAI client

Score Range: -1 to 1 (typically 0-1 for text)

1.0 = Semantically identical
0.0 = Unrelated
-1.0 = Opposite meanings (rare)

JSON scorers

Scorers for evaluating JSON outputs.

JSONDiff

Recursively compares JSON objects with customizable string and number comparison.

Parameters:

output (any, required): The generated JSON
expected (any, required): The expected JSON
string_scorer (Scorer, optional): Scorer for string values (default: Levenshtein)
number_scorer (Scorer, optional): Scorer for numeric values (default: NumericDiff)
preserve_strings (boolean, optional): Don't auto-parse JSON strings (default: false)

Score Range: 0-1

1.0 = JSON structures are identical
0.0 = JSON structures are completely different

Example:

from autoevals.json import JSONDiff

scorer = JSONDiff()
result = scorer.eval(
    output={"name": "John", "age": 30},
    expected={"name": "John", "age": 31}
)
# Score: ~0.5 (name matches, age differs slightly)

ValidJSON

Validates JSON syntax and optionally checks against a JSON Schema.

Parameters:

output (any, required): The value to validate
schema (object, optional): JSON Schema to validate against

Score Range: 0 or 1

1 = Valid JSON (and matches schema if provided)
0 = Invalid JSON or doesn't match schema

Example:

import { ValidJSON } from "autoevals/json";

const schema = {
  type: "object",
  properties: {
    name: { type: "string" },
    age: { type: "number" },
  },
  required: ["name", "age"],
};

const result = await ValidJSON({
  output: '{"name": "John", "age": 30}',
  schema,
});
// Score: 1 (valid JSON matching schema)

List scorers

Scorers for evaluating lists and arrays.

ListContains

Checks if all expected items are present in the output list.

Parameters:

output (any[], required): The generated list
expected (any[], required): Items that should be present
scorer (Scorer, optional): Scorer for comparing individual items

Score Range: 0-1

1.0 = All expected items are present
0.0 = None of the expected items are present

Example:

from autoevals.list import ListContains

scorer = ListContains()
result = scorer.eval(
    output=["apple", "banana", "cherry"],
    expected=["apple", "banana"]
)
# Score: 1.0 (both expected items present)

Custom scorers

You can create custom scorers for domain-specific evaluation needs. See:

JSON scorer examples - Combining validators and comparators
Creating custom scorers - Basic custom scorer pattern

Score interpretation

General guidelines for interpreting scores:

1.0: Perfect match or complete correctness
0.8-0.99: Very good, minor differences
0.6-0.79: Acceptable, some issues
0.4-0.59: Moderate quality, significant issues
0.2-0.39: Poor quality, major issues
0.0-0.19: Unacceptable or completely wrong

Note: Interpretation varies by scorer type. Binary scorers (ExactMatch, ValidJSON) only return 0 or 1.

Common parameters

Many scorers share these common parameters:

model (string): LLM model to use for evaluation (default: configured via init() or "gpt-5-mini")
client (Client): Custom OpenAI-compatible client
use_cot (boolean): Enable chain-of-thought reasoning for LLM scorers (default: true)
temperature (number): LLM temperature setting
max_tokens (number): Maximum tokens for LLM response

Configuring defaults

Use init() to configure default settings for all scorers:

import { init } from "autoevals";
import OpenAI from "openai";

init({
  client: new OpenAI({ apiKey: "..." }),
  defaultModel: "gpt-5-mini",
});

from autoevals import init
from openai import OpenAI

init(OpenAI(api_key="..."), default_model="gpt-5-mini")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoevals scorer reference

Table of contents

LLM-as-a-judge scorers

Factuality

Battle

ClosedQA

Humor

Security

Moderation

Sql

Summary

Translation

RAG (Retrieval-Augmented Generation) scorers

ContextRelevancy

ContextRecall

ContextPrecision

ContextEntityRecall

Faithfulness

AnswerRelevancy

AnswerSimilarity

AnswerCorrectness

Heuristic scorers

Levenshtein

ExactMatch

NumericDiff

EmbeddingSimilarity

JSON scorers

JSONDiff

ValidJSON

List scorers

ListContains

Custom scorers

Score interpretation

Common parameters

Configuring defaults

FilesExpand file tree

SCORERS.md

Latest commit

History

SCORERS.md

File metadata and controls

Autoevals scorer reference

Table of contents

LLM-as-a-judge scorers

Factuality

Battle

ClosedQA

Humor

Security

Moderation

Sql

Summary

Translation

RAG (Retrieval-Augmented Generation) scorers

ContextRelevancy

ContextRecall

ContextPrecision

ContextEntityRecall

Faithfulness

AnswerRelevancy

AnswerSimilarity

AnswerCorrectness

Heuristic scorers

Levenshtein

ExactMatch

NumericDiff

EmbeddingSimilarity

JSON scorers

JSONDiff

ValidJSON

List scorers

ListContains

Custom scorers

Score interpretation

Common parameters

Configuring defaults