Skip to content

Latest commit

 

History

History
628 lines (414 loc) · 15.5 KB

File metadata and controls

628 lines (414 loc) · 15.5 KB

Autoevals scorer reference

Complete reference for all scorers available in Autoevals, including parameters, score ranges, and usage examples.

Table of contents


LLM-as-a-judge scorers

These scorers use language models to evaluate outputs based on semantic understanding.

Factuality

Evaluates whether the output is factually consistent with the expected answer.

Parameters:

  • input (string): The input question or prompt
  • output (string, required): The generated answer to evaluate
  • expected (string, required): The ground truth answer
  • model (string, optional): Model to use (default: configured via init() or "gpt-5-mini")
  • client (Client, optional): Custom OpenAI client

Score Range: 0-1

  • 1.0 = Output is factually accurate
  • 0.0 = Output contains factual errors

Example:

import { Factuality } from "autoevals";

const result = await Factuality({
  input: "What is the capital of France?",
  output: "Paris",
  expected: "The capital of France is Paris",
});
// Score: 1.0 (factually correct)

Battle

Compares two outputs and determines which one is better.

Parameters:

  • input (string): The input question or prompt
  • output (string, required): First answer to compare
  • expected (string, required): Second answer to compare
  • model (string, optional): Model to use
  • client (Client, optional): Custom OpenAI client

Score Range: 0-1

  • 1.0 = Output is significantly better than expected
  • 0.5 = Both outputs are roughly equal
  • 0.0 = Expected is significantly better than output

Example:

from autoevals.llm import Battle

evaluator = Battle()
result = evaluator.eval(
    input="Explain photosynthesis",
    output="Plants use sunlight to make food from CO2 and water",
    expected="Photosynthesis is a process"
)
# Score: ~1.0 (first answer is more detailed)

ClosedQA

Evaluates answers to closed-ended questions where there's a clear correct answer.

Parameters:

  • input (string): The question
  • output (string, required): The generated answer
  • expected (string, required): The correct answer
  • model (string, optional): Model to use
  • criteria (string, optional): Custom evaluation criteria

Score Range: 0-1

  • 1.0 = Answer is correct
  • 0.0 = Answer is incorrect

Humor

Evaluates whether the output is humorous.

Parameters:

  • input (string): The context or setup
  • output (string, required): The text to evaluate for humor
  • model (string, optional): Model to use

Score Range: 0-1

  • 1.0 = Very humorous
  • 0.0 = Not humorous

Security

Evaluates whether the output contains security vulnerabilities or unsafe content.

Parameters:

  • output (string, required): The content to evaluate
  • model (string, optional): Model to use

Score Range: 0-1

  • 1.0 = No security concerns
  • 0.0 = Contains security vulnerabilities

Moderation

Evaluates content for policy violations using OpenAI's moderation API.

Parameters:

  • output (string, required): The content to moderate
  • client (Client, optional): Custom OpenAI client

Score Range: 0-1

  • 1.0 = Content is safe
  • 0.0 = Content violates policies

Categories Checked:

  • Sexual content
  • Hate speech
  • Harassment
  • Self-harm
  • Violence
  • Sexual content involving minors

Sql

Evaluates SQL query correctness and quality.

Parameters:

  • input (string): The natural language question
  • output (string, required): The generated SQL query
  • expected (string, optional): The correct SQL query
  • model (string, optional): Model to use

Score Range: 0-1

Summary

Evaluates the quality of text summaries.

Parameters:

  • input (string): The original text
  • output (string, required): The generated summary
  • expected (string, optional): A reference summary
  • model (string, optional): Model to use

Score Range: 0-1

  • 1.0 = Excellent summary (accurate, concise, complete)
  • 0.0 = Poor summary

Translation

Evaluates translation quality.

Parameters:

  • input (string): The source text
  • output (string, required): The generated translation
  • expected (string, optional): A reference translation
  • model (string, optional): Model to use

Score Range: 0-1

  • 1.0 = Excellent translation
  • 0.0 = Poor translation

RAG (Retrieval-Augmented Generation) scorers

These scorers evaluate RAG systems by assessing both context retrieval and answer generation quality.

All RAG scorers support passing context through the metadata parameter when used with Braintrust Eval. See the RAGAS module documentation for examples.

ContextRelevancy

Evaluates how relevant the retrieved context is to the input question.

Parameters:

  • input (string, required): The question
  • output (string, required): The generated answer
  • context (string[] | string, required): Retrieved context passages
  • model (string, optional): Model to use (default: "gpt-5-nano")

Score Range: 0-1

  • 1.0 = All context is highly relevant
  • 0.0 = Context is irrelevant

Example:

from autoevals.ragas import ContextRelevancy

scorer = ContextRelevancy()
result = scorer.eval(
    input="What is the capital of France?",
    output="Paris",
    context=[
        "Paris is the capital of France.",
        "Berlin is the capital of Germany."
    ]
)
# Score: ~0.5 (only first context item is relevant)

ContextRecall

Measures how well the context supports the expected answer.

Parameters:

  • input (string): The question
  • expected (string, required): The ground truth answer
  • context (string[] | string, required): Retrieved context passages
  • model (string, optional): Model to use

Score Range: 0-1

  • 1.0 = Context fully supports the expected answer
  • 0.0 = Context doesn't support the expected answer

ContextPrecision

Measures the precision of retrieved context - whether relevant context appears before irrelevant context.

Parameters:

  • input (string, required): The question
  • expected (string, required): The ground truth answer
  • context (string[] | string, required): Retrieved context passages (order matters)
  • model (string, optional): Model to use

Score Range: 0-1

  • 1.0 = All relevant context appears first
  • 0.0 = Relevant context is buried under irrelevant context

ContextEntityRecall

Measures how well the context contains entities from the expected answer.

Parameters:

  • expected (string, required): The ground truth answer
  • context (string[] | string, required): Retrieved context passages
  • model (string, optional): Model to use

Score Range: 0-1

  • 1.0 = All entities from expected answer are in context
  • 0.0 = No entities from expected answer are in context

Faithfulness

Evaluates whether the generated answer's claims are supported by the context.

Parameters:

  • input (string): The question
  • output (string, required): The generated answer
  • context (string[] | string, required): Retrieved context passages
  • model (string, optional): Model to use

Score Range: 0-1

  • 1.0 = All claims in the answer are supported by context
  • 0.0 = Answer contains unsupported claims (hallucinations)

Example:

import { Faithfulness } from "autoevals/ragas";

const result = await Faithfulness({
  input: "What is photosynthesis?",
  output:
    "Photosynthesis is how plants make food using sunlight and also they can fly.",
  context: [
    "Photosynthesis is the process by which plants use sunlight to synthesize foods.",
  ],
});
// Score: ~0.5 (first claim supported, "can fly" is not)

AnswerRelevancy

Measures how relevant the answer is to the question.

Parameters:

  • input (string, required): The question
  • output (string, required): The generated answer
  • context (string[] | string, optional): Retrieved context passages
  • model (string, optional): Model to use
  • embedding_model (string, optional): Model for embeddings (default: "text-embedding-3-small")

Score Range: 0-1

  • 1.0 = Answer directly addresses the question
  • 0.0 = Answer is off-topic

AnswerSimilarity

Compares semantic similarity between the generated answer and expected answer using embeddings.

Parameters:

  • output (string, required): The generated answer
  • expected (string, required): The ground truth answer
  • model (string, optional): Embedding model to use (default: "text-embedding-3-small")

Score Range: 0-1

  • 1.0 = Answers are semantically identical
  • 0.0 = Answers are completely different

AnswerCorrectness

Combines factual correctness and semantic similarity to evaluate answers.

Parameters:

  • input (string, required): The question
  • output (string, required): The generated answer
  • expected (string, required): The ground truth answer
  • model (string, optional): Model for factuality checking
  • embedding_model (string, optional): Model for similarity (default: "text-embedding-3-small")
  • factuality_weight (number, optional): Weight for factuality (default: 0.75)
  • answer_similarity_weight (number, optional): Weight for similarity (default: 0.25)

Score Range: 0-1

Formula: score = (factuality_weight × factuality_score + answer_similarity_weight × similarity_score) / (factuality_weight + answer_similarity_weight)


Heuristic scorers

Fast, deterministic scorers that don't use LLMs.

Levenshtein

Calculates Levenshtein (edit) distance between strings, normalized to 0-1.

Parameters:

  • output (string, required): The generated text
  • expected (string, required): The expected text

Score Range: 0-1

  • 1.0 = Strings are identical
  • 0.0 = Strings are completely different

Example:

from autoevals.string import Levenshtein

scorer = Levenshtein()
result = scorer.eval(output="hello", expected="helo")
# Score: ~0.8 (1 character difference)

ExactMatch

Binary scorer that checks for exact string equality.

Parameters:

  • output (any, required): The generated value
  • expected (any, required): The expected value

Score Range: 0 or 1

  • 1 = Values are exactly equal
  • 0 = Values differ

NumericDiff

Evaluates numeric differences with configurable thresholds.

Parameters:

  • output (number, required): The generated number
  • expected (number, required): The expected number
  • max_diff (number, optional): Maximum acceptable difference (default: 0)
  • relative (boolean, optional): Use relative difference (default: false)

Score Range: 0-1

  • 1.0 = Numbers are equal (within threshold)
  • 0.0 = Numbers differ significantly

Formula (absolute): score = max(0, 1 - |output - expected| / max_diff) (when max_diff > 0)

Formula (relative): score = max(0, 1 - |output - expected| / |expected|)

Example:

import { NumericDiff } from "autoevals";

// Absolute difference
const result1 = await NumericDiff({
  output: 10.5,
  expected: 10.0,
  maxDiff: 1.0,
});
// Score: 0.5 (difference of 0.5 out of max 1.0)

// Relative difference
const result2 = await NumericDiff({
  output: 100,
  expected: 110,
  relative: true,
});
// Score: ~0.91 (10% difference)

EmbeddingSimilarity

Compares semantic similarity using text embeddings (cosine similarity).

Parameters:

  • output (string, required): The generated text
  • expected (string, required): The expected text
  • model (string, optional): Embedding model (default: "text-embedding-3-small")
  • client (Client, optional): Custom OpenAI client

Score Range: -1 to 1 (typically 0-1 for text)

  • 1.0 = Semantically identical
  • 0.0 = Unrelated
  • -1.0 = Opposite meanings (rare)

JSON scorers

Scorers for evaluating JSON outputs.

JSONDiff

Recursively compares JSON objects with customizable string and number comparison.

Parameters:

  • output (any, required): The generated JSON
  • expected (any, required): The expected JSON
  • string_scorer (Scorer, optional): Scorer for string values (default: Levenshtein)
  • number_scorer (Scorer, optional): Scorer for numeric values (default: NumericDiff)
  • preserve_strings (boolean, optional): Don't auto-parse JSON strings (default: false)

Score Range: 0-1

  • 1.0 = JSON structures are identical
  • 0.0 = JSON structures are completely different

Example:

from autoevals.json import JSONDiff

scorer = JSONDiff()
result = scorer.eval(
    output={"name": "John", "age": 30},
    expected={"name": "John", "age": 31}
)
# Score: ~0.5 (name matches, age differs slightly)

ValidJSON

Validates JSON syntax and optionally checks against a JSON Schema.

Parameters:

  • output (any, required): The value to validate
  • schema (object, optional): JSON Schema to validate against

Score Range: 0 or 1

  • 1 = Valid JSON (and matches schema if provided)
  • 0 = Invalid JSON or doesn't match schema

Example:

import { ValidJSON } from "autoevals/json";

const schema = {
  type: "object",
  properties: {
    name: { type: "string" },
    age: { type: "number" },
  },
  required: ["name", "age"],
};

const result = await ValidJSON({
  output: '{"name": "John", "age": 30}',
  schema,
});
// Score: 1 (valid JSON matching schema)

List scorers

Scorers for evaluating lists and arrays.

ListContains

Checks if all expected items are present in the output list.

Parameters:

  • output (any[], required): The generated list
  • expected (any[], required): Items that should be present
  • scorer (Scorer, optional): Scorer for comparing individual items

Score Range: 0-1

  • 1.0 = All expected items are present
  • 0.0 = None of the expected items are present

Example:

from autoevals.list import ListContains

scorer = ListContains()
result = scorer.eval(
    output=["apple", "banana", "cherry"],
    expected=["apple", "banana"]
)
# Score: 1.0 (both expected items present)

Custom scorers

You can create custom scorers for domain-specific evaluation needs. See:


Score interpretation

General guidelines for interpreting scores:

  • 1.0: Perfect match or complete correctness
  • 0.8-0.99: Very good, minor differences
  • 0.6-0.79: Acceptable, some issues
  • 0.4-0.59: Moderate quality, significant issues
  • 0.2-0.39: Poor quality, major issues
  • 0.0-0.19: Unacceptable or completely wrong

Note: Interpretation varies by scorer type. Binary scorers (ExactMatch, ValidJSON) only return 0 or 1.


Common parameters

Many scorers share these common parameters:

  • model (string): LLM model to use for evaluation (default: configured via init() or "gpt-5-mini")
  • client (Client): Custom OpenAI-compatible client
  • use_cot (boolean): Enable chain-of-thought reasoning for LLM scorers (default: true)
  • temperature (number): LLM temperature setting
  • max_tokens (number): Maximum tokens for LLM response

Configuring defaults

Use init() to configure default settings for all scorers:

import { init } from "autoevals";
import OpenAI from "openai";

init({
  client: new OpenAI({ apiKey: "..." }),
  defaultModel: "gpt-5-mini",
});
from autoevals import init
from openai import OpenAI

init(OpenAI(api_key="..."), default_model="gpt-5-mini")