ArtemisKit

Open-source LLM evaluation toolkit - Test, evaluate, stress-test, and red-team your AI applications with scenario-based testing and multi-provider support.

📚 Documentation | 🚀 Getting Started

Features

Scenario-Based Testing - Define test cases in YAML with multi-turn conversation support
Security Red Teaming - OWASP LLM Top 10 2025 attack vectors with 7+ mutation strategies
Guardian Mode - Runtime AI protection with injection detection, PII filtering, and action validation
Programmatic SDK - TypeScript/JavaScript SDK with Jest/Vitest integration
Stress Testing - Measure latency, throughput, and reliability under load
Multi-Provider Support - OpenAI, Anthropic, Azure OpenAI, Vercel AI SDK (20+ providers)
Agentic Testing - Test LangChain and DeepAgents applications
Rich Reports - Interactive HTML reports with configuration traceability
CI/CD Ready - Exit codes, JUnit export, and baseline regression detection

Installation

npm install -g @artemiskit/cli
# or
pnpm add -g @artemiskit/cli
# or
bun add -g @artemiskit/cli

Quick Start (Basic Example)

This is the simplest way to get started with ArtemisKit.

1. Set up your API key

export OPENAI_API_KEY="your-api-key"

2. Create a simple scenario

# scenarios/hello.yaml
name: hello-test
description: My first ArtemisKit test

cases:
  - id: greeting-test
    prompt: "Say hello"
    expected:
      type: contains
      values:
        - "hello"
      mode: any

3. Run it

artemiskit run scenarios/hello.yaml
# or use the short alias
akit run scenarios/hello.yaml

That's it! ArtemisKit will use OpenAI by default. See below for full configuration options.

Configuration

Config File (Full Reference)

Create artemis.config.yaml in your project root. Here's every available option:

# artemis.config.yaml - Full Reference
# =====================================

# Project identifier (used in run storage and reports)
project: my-project

# Default provider to use when not specified in scenario or CLI
# Options: openai, azure-openai, vercel-ai
provider: openai

# Default model to use
# NOTE: For azure-openai, this is DISPLAY ONLY - the actual model
# is determined by your Azure deployment, not this value.
# See docs/providers/azure-openai.md for details.
model: gpt-4o

# Directory containing scenario files
scenariosDir: ./scenarios

# Provider-specific configuration
providers:
  openai:
    # API key (can use environment variable reference)
    apiKey: ${OPENAI_API_KEY}
    
  azure-openai:
    # API key for Azure OpenAI
    apiKey: ${AZURE_OPENAI_API_KEY}
    # Your Azure resource name (the subdomain in your endpoint URL)
    resourceName: ${AZURE_OPENAI_RESOURCE_NAME}
    # The deployment name you created in Azure Portal
    deploymentName: ${AZURE_OPENAI_DEPLOYMENT_NAME}
    # API version (optional, has sensible default)
    apiVersion: "2024-02-15-preview"

  vercel-ai:
    # Underlying provider for Vercel AI SDK
    underlyingProvider: openai
    apiKey: ${OPENAI_API_KEY}

# Storage configuration for run history
storage:
  # Storage type: "local" or "supabase"
  type: local
  # Base path for local storage (relative to project root)
  basePath: ./artemis-runs

# Output configuration for reports
output:
  # Output format: "json", "html", or "both"
  format: html
  # Directory for generated reports
  dir: ./artemis-output

# CI-specific settings (optional)
ci:
  # Fail if regression exceeds threshold
  failOnRegression: true
  # Regression threshold (0-1)
  regressionThreshold: 0.05

Minimal Config File

If you just want to set defaults, a minimal config works too:

# artemis.config.yaml - Minimal
project: my-project
provider: openai
model: gpt-4o

Scenario Format

Basic Scenario (Simple Prompts)

# scenarios/basic.yaml
name: basic-test
description: Simple prompt-response tests

# Optional: Override provider/model for this scenario
provider: openai
model: gpt-4o

cases:
  - id: greeting
    prompt: "Say hello"
    expected:
      type: contains
      values:
        - "hello"
      mode: any

Full Scenario Reference

Here's every available option for scenarios:

# scenarios/full-reference.yaml - Complete Example
# =================================================

# Required: Unique name for this scenario
name: customer-support-eval

# Optional: Human-readable description
description: Evaluate customer support bot responses

# Optional: Scenario version
version: "1.0"

# Optional: Tags for filtering (use --tags flag)
tags:
  - support
  - production

# Optional: Provider override (defaults to config file, then "openai")
# Options: openai, azure-openai, vercel-ai
provider: openai

# Optional: Model override
# NOTE: For azure-openai, this is DISPLAY ONLY - actual model
# is determined by your Azure deployment. See docs/providers/azure-openai.md
model: gpt-4o

# Optional: Model parameters
temperature: 0.7
maxTokens: 1024
seed: 42

# Optional: System prompt prepended to all cases
setup:
  systemPrompt: |
    You are a helpful customer support assistant.
    Always be polite and professional.

# Optional: Scenario-level variables (available to all cases)
# Case-level variables override these. Use {{var_name}} syntax.
variables:
  company_name: "Acme Corp"
  default_greeting: "Hello"

# Required: Test cases to run
cases:
  # ---- Simple prompt/response case ----
  - id: simple-greeting
    name: Simple greeting test
    description: Test basic greeting response
    # The prompt to send to the model
    prompt: "Hello, I need help"
    # Expected result validation
    expected:
      type: contains
      values:
        - "help"
        - "assist"
      mode: any
    # Optional: Tags for this case
    tags:
      - basic

  # ---- Case with regex matching ----
  - id: order-number-check
    name: Order number extraction
    prompt: "My order number is #12345"
    expected:
      type: regex
      pattern: "12345"
      flags: "i"

  # ---- Case with exact match ----
  - id: yes-no-response
    name: Binary response test
    prompt: "Reply with only 'Yes' or 'No': Is the sky blue?"
    expected:
      type: exact
      value: "Yes"
      caseSensitive: false

  # ---- Case with fuzzy matching ----
  - id: fuzzy-match-test
    name: Fuzzy similarity test
    prompt: "What color is grass?"
    expected:
      type: fuzzy
      value: "green"
      threshold: 0.8

  # ---- Case with LLM grading ----
  - id: quality-check
    name: Response quality evaluation
    prompt: "Explain quantum computing in simple terms"
    expected:
      type: llm_grader
      rubric: |
        Score 1.0 if the explanation is clear and accurate.
        Score 0.5 if partially correct but confusing.
        Score 0.0 if incorrect or overly technical.
      threshold: 0.7

  # ---- Case with JSON schema validation ----
  - id: json-output-test
    name: Structured output test
    prompt: "Return a JSON object with name and age fields"
    expected:
      type: json_schema
      schema:
        type: object
        properties:
          name:
            type: string
          age:
            type: number
        required:
          - name
          - age

  # ---- Multi-turn conversation ----
  - id: multi-turn-support
    name: Multi-turn conversation
    # Use array of messages for multi-turn
    prompt:
      - role: user
        content: "I have a problem with my order"
      - role: assistant
        content: "I'd be happy to help. What's your order number?"
      - role: user
        content: "Order number is #99999"
    expected:
      type: contains
      values:
        - "99999"
      mode: any

  # ---- Case with variables ----
  - id: dynamic-content
    name: Variable substitution test
    # Case-level variables override scenario-level
    variables:
      product_name: "Widget Pro"
      order_id: "ORD-789"
    prompt: "What's the status of my {{product_name}} order {{order_id}}?"
    expected:
      type: contains
      values:
        - "ORD-789"
      mode: any

  # ---- Case with timeout and retries ----
  - id: slow-response-test
    name: Timeout handling test
    prompt: "Generate a detailed report"
    expected:
      type: contains
      values:
        - "report"
      mode: any
    timeout: 30000
    retries: 2

Variables

Variables let you create dynamic, reusable scenarios. Use {{variable_name}} syntax in prompts.

name: customer-support
description: Test with dynamic content

# Scenario-level variables - available to all cases
variables:
  company_name: "Acme Corp"
  support_email: "support@acme.com"

cases:
  # Uses scenario-level variables
  - id: contact-info
    prompt: "What is the email for {{company_name}}?"
    expected:
      type: contains
      values:
        - "support@acme.com"
      mode: any

  # Case-level variables override scenario-level
  - id: different-company
    variables:
      company_name: "TechCorp"  # Overrides "Acme Corp"
      product: "Widget"
    prompt: "Tell me about {{product}} from {{company_name}}"
    expected:
      type: contains
      values:
        - "TechCorp"
      mode: any

Variable precedence: case-level > scenario-level

Expectation Types

Type	Description	Key Fields
`contains`	Response contains string(s)	`values: [...]`, `mode: all\|any`
`exact`	Response exactly equals value	`value: "..."`, `caseSensitive: bool`
`regex`	Response matches regex pattern	`pattern: "..."`, `flags: "i"`
`fuzzy`	Fuzzy string similarity	`value: "..."`, `threshold: 0.8`
`llm_grader`	LLM-based evaluation	`rubric: "..."`, `threshold: 0.7`
`json_schema`	Validate JSON structure	`schema: {...}`

CLI Commands

Command	Description
`artemiskit run <scenario>`	Run scenario-based evaluations
`artemiskit validate <path>`	Validate scenarios without running them
`artemiskit redteam <scenario>`	Run security red team tests
`artemiskit stress <scenario>`	Run load/stress tests
`artemiskit report <run-id>`	Regenerate report from saved run
`artemiskit history`	View run history
`artemiskit compare <id1> <id2>`	Compare two runs
`artemiskit baseline`	Manage baselines for regression detection
`artemiskit init`	Initialize configuration

Use akit as a shorter alias for artemiskit.

Run Command Options

artemiskit run <scenario> [options]

Options:
  -p, --provider <provider>   Provider: openai, azure-openai, vercel-ai
  -m, --model <model>         Model to use
  -o, --output <dir>          Output directory for results
  -v, --verbose               Verbose output
  -t, --tags <tags...>        Filter test cases by tags
  -c, --concurrency <n>       Number of concurrent test cases (default: 1)
  --parallel <n>              Number of scenarios to run in parallel
  --timeout <ms>              Timeout per test case in milliseconds
  --retries <n>               Number of retries per test case
  --config <path>             Path to config file
  --save                      Save results to storage (default: true)
  --ci                        CI mode: machine-readable output
  --baseline                  Compare against baseline for regression
  --budget <amount>           Maximum budget in USD
  --export <format>           Export format: markdown or junit

Validate Command Options

artemiskit validate <path> [options]

Options:
  --json                      Output results as JSON
  --strict                    Treat warnings as errors
  -q, --quiet                 Only output errors
  --export junit              Export to JUnit XML for CI

CI/CD Integration

ArtemisKit supports CI/CD pipelines with machine-readable output and JUnit exports:

# Machine-readable output for CI
akit run scenarios/ --ci

# Export to JUnit XML for CI platforms
akit run scenarios/ --export junit --export-output ./test-results

# Validate scenarios before running
akit validate scenarios/ --strict --export junit

GitHub Actions example:

- name: Validate scenarios
  run: akit validate scenarios/ --strict

- name: Run tests
  run: akit run scenarios/ --export junit --export-output ./test-results

- name: Publish Test Results
  uses: EnricoMi/publish-unit-test-result-action@v2
  if: always()
  with:
    files: test-results/*.xml

Providers

ArtemisKit supports multiple LLM providers. See the provider documentation for detailed setup guides.

Provider	Use Case	Docs
`openai`	Direct OpenAI API	docs/providers/openai.md
`azure-openai`	Azure OpenAI Service	docs/providers/azure-openai.md
`vercel-ai`	20+ providers via Vercel AI SDK	docs/providers/vercel-ai.md

Quick Setup

OpenAI:

export OPENAI_API_KEY="sk-..."
akit run scenario.yaml --provider openai --model gpt-4o

Azure OpenAI:

export AZURE_OPENAI_API_KEY="..."
export AZURE_OPENAI_RESOURCE_NAME="my-resource"
export AZURE_OPENAI_DEPLOYMENT_NAME="gpt-4o-deployment"
akit run scenario.yaml --provider azure-openai --model gpt-4o
# Note: --model is for display only; actual model is your deployment

Vercel AI (any provider):

export ANTHROPIC_API_KEY="sk-ant-..."
akit run scenario.yaml --provider vercel-ai --model anthropic:claude-3-5-sonnet-20241022

Security Testing (Red Team)

Test your LLM for vulnerabilities:

akit redteam scenarios/my-bot.yaml --mutations typo,role-spoof,cot-injection

Attack Configuration File

Fine-tune your red team testing with a YAML configuration file:

akit redteam scenarios/my-bot.yaml --attack-config attacks.yaml

Example attacks.yaml:

version: "1.0"

defaults:
  severity: medium

mutations:
  bad-likert-judge:
    enabled: true
    scaleType: effectiveness
  crescendo:
    enabled: true
    steps: 5
  encoding:
    enabled: true
    types:
      - base64
      - rot13

owasp:
  LLM01:
    enabled: true
    minSeverity: medium
  LLM05:
    enabled: false  # Disable this category

Mutations must be explicitly listed in the config to be included. OWASP category settings can disable entire categories or set minimum severity thresholds.

Available Mutations

Mutation	Description	OWASP
`encoding`	Base64, ROT13, hex, unicode obfuscation	—
`multi_turn`	Multi-message escalation sequences	—
`bad-likert-judge`	Exploit evaluation capability	LLM01
`crescendo`	Multi-turn gradual escalation	LLM01
`deceptive-delight`	Positive framing bypass	LLM01
`output-injection`	XSS, SQLi, command injection in output	LLM05
`excessive-agency`	Unauthorized action claim testing	LLM06
`system-extraction`	System prompt leakage techniques	LLM07
`hallucination-trap`	Confident fabrication triggers	LLM09

OWASP Compliance Testing

# Test specific OWASP categories
akit redteam --prompt "..." --owasp LLM01,LLM05

# Full OWASP compliance scan
akit redteam --prompt "..." --owasp-full

Packages

ArtemisKit is a monorepo with the following packages:

Package	Description
`@artemiskit/cli`	Command-line interface
`@artemiskit/core`	Core runner, types, and storage
`@artemiskit/sdk`	Programmatic SDK for TypeScript/JavaScript
`@artemiskit/reports`	HTML and JSON report generation
`@artemiskit/redteam`	Red team mutation strategies with OWASP LLM Top 10
`@artemiskit/adapter-openai`	OpenAI/Azure provider adapter
`@artemiskit/adapter-vercel-ai`	Vercel AI SDK adapter
`@artemiskit/adapter-anthropic`	Anthropic provider adapter
`@artemiskit/adapter-langchain`	LangChain.js agent testing adapter
`@artemiskit/adapter-deepagents`	DeepAgents.js agentic testing adapter

Development

# Clone the repository
git clone https://github.com/code-sensei/artemiskit.git
cd artemiskit

# Install dependencies
bun install

# Build all packages
bun run build

# Run tests
bun test

# Type check
bun run typecheck

# Lint
bun run lint

Roadmap

See ROADMAP.md for the full development roadmap.

Contributing

Contributions are welcome! Please read CONTRIBUTING.md before submitting a pull request.

License

Apache-2.0 - See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 218 Commits
.agents/skills		.agents/skills
.changeset		.changeset
.claude		.claude
.cursor/skills		.cursor/skills
.githooks		.githooks
.github		.github
.opencode/skills		.opencode/skills
docs-content @ 4c92382		docs-content @ 4c92382
docs		docs
examples		examples
migrations		migrations
packages		packages
scenarios		scenarios
scripts		scripts
skills		skills
supabase/migrations		supabase/migrations
.env.example		.env.example
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
artemis.config.yaml		artemis.config.yaml
biome.json		biome.json
bun.lock		bun.lock
bunfig.toml		bunfig.toml
flux-prompting.skill		flux-prompting.skill
package.json		package.json
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

ArtemisKit

Features

Installation

Quick Start (Basic Example)

1. Set up your API key

2. Create a simple scenario

3. Run it

Configuration

Config File (Full Reference)

Minimal Config File

Scenario Format

Basic Scenario (Simple Prompts)

Full Scenario Reference

Variables

Expectation Types

CLI Commands

Run Command Options

Validate Command Options

CI/CD Integration

Providers

Quick Setup

Security Testing (Red Team)

Attack Configuration File

Available Mutations

OWASP Compliance Testing

Packages

Development

Roadmap

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages