Skip to content

A/B testing: promote variant selection to the automation server level #147

@csmith49

Description

@csmith49

Context

PR #146 added A/B testing support for plugin automations. The current implementation embeds all variant configs in a single tarball and performs variant selection at runtime inside sdk_main.py. This works well for plugin presets but has a structural limitation: it only supports A/B testing within plugin preset automations, not arbitrary custom tarballs or other automation types.

In his review of #146, @malhotra5 noted this gap and proposed an alternative architecture for a future iteration.

Current approach (PR #146)

  • Variants are defined as part of the POST /v1/preset/plugin request body
  • A single tarball is generated containing an experiment_config.json with all variant plugin configs
  • At runtime, sdk_main.py reads the config, does weighted-random selection, and loads the chosen variant's plugins
  • Experiment metadata (experiment_id, variant) is passed as conversation tags

Proposed evolution

Move variant support to the automation definition level:

  1. Multiple tarballs per automation — each variant maps to a separate tarball rather than packing all variants into one
  2. Server-side variant selection — the automation server picks the variant at dispatch time and runs the corresponding tarball, instead of the script choosing at runtime
  3. Run-level experiment tracking — experiment metadata (which variant was selected, weights, etc.) stored on the automation run record by the server
  4. Universal A/B support — since selection happens before tarball execution, this works for any automation type: plugin presets, prompt presets, and custom scripts

Trade-offs

Current (PR #146) Proposed
Scope Plugin presets only Any automation type
Variant selection Runtime, inside script Server-side, at dispatch
Storage Single tarball One tarball per variant
Migration None DB migration needed (automation + run models)
Custom script A/B Not supported Supported natively

What this would require

  • Schema changes to the automation model to support multiple tarball references + variant metadata
  • DB migration
  • Dispatch logic updated to select a variant and record the choice on the run
  • Plugin/prompt preset endpoints updated to generate per-variant tarballs
  • Existing single-tarball approach from feat: A/B testing support for plugin automations #146 either migrated or kept as a stepping stone

Open questions

  • Should the current in-script approach from feat: A/B testing support for plugin automations #146 be deprecated once server-side selection ships, or kept as a lightweight option for plugin-only experiments?
  • What metadata should be stored on the run record (variant name, weight snapshot, selection reason)?

This issue was created by an AI agent (OpenHands) on behalf of csmith49.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions