A/B testing: promote variant selection to the automation server level

## Context

PR #146 added A/B testing support for plugin automations. The current implementation embeds all variant configs in a single tarball and performs variant selection at runtime inside `sdk_main.py`. This works well for plugin presets but has a structural limitation: **it only supports A/B testing within plugin preset automations**, not arbitrary custom tarballs or other automation types.

In [his review of #146](https://github.com/OpenHands/automation/pull/146), @malhotra5 noted this gap and proposed an alternative architecture for a future iteration.

## Current approach (PR #146)

- Variants are defined as part of the `POST /v1/preset/plugin` request body
- A single tarball is generated containing an `experiment_config.json` with all variant plugin configs
- At runtime, `sdk_main.py` reads the config, does weighted-random selection, and loads the chosen variant's plugins
- Experiment metadata (`experiment_id`, `variant`) is passed as conversation tags

## Proposed evolution

Move variant support to the **automation definition level**:

1. **Multiple tarballs per automation** — each variant maps to a separate tarball rather than packing all variants into one
2. **Server-side variant selection** — the automation server picks the variant at dispatch time and runs the corresponding tarball, instead of the script choosing at runtime
3. **Run-level experiment tracking** — experiment metadata (which variant was selected, weights, etc.) stored on the automation run record by the server
4. **Universal A/B support** — since selection happens before tarball execution, this works for _any_ automation type: plugin presets, prompt presets, and custom scripts

## Trade-offs

| | Current (PR #146) | Proposed |
|---|---|---|
| Scope | Plugin presets only | Any automation type |
| Variant selection | Runtime, inside script | Server-side, at dispatch |
| Storage | Single tarball | One tarball per variant |
| Migration | None | DB migration needed (automation + run models) |
| Custom script A/B | Not supported | Supported natively |

## What this would require

- Schema changes to the automation model to support multiple tarball references + variant metadata
- DB migration
- Dispatch logic updated to select a variant and record the choice on the run
- Plugin/prompt preset endpoints updated to generate per-variant tarballs
- Existing single-tarball approach from #146 either migrated or kept as a stepping stone

## Open questions

- Should the current in-script approach from #146 be deprecated once server-side selection ships, or kept as a lightweight option for plugin-only experiments?
- What metadata should be stored on the run record (variant name, weight snapshot, selection reason)?

_This issue was created by an AI agent (OpenHands) on behalf of csmith49._

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A/B testing: promote variant selection to the automation server level #147

Context

Current approach (PR #146)

Proposed evolution

Trade-offs

What this would require

Open questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

	Current (PR #146)	Proposed
Scope	Plugin presets only	Any automation type
Variant selection	Runtime, inside script	Server-side, at dispatch
Storage	Single tarball	One tarball per variant
Migration	None	DB migration needed (automation + run models)
Custom script A/B	Not supported	Supported natively

A/B testing: promote variant selection to the automation server level #147

Description

Context

Current approach (PR #146)

Proposed evolution

Trade-offs

What this would require

Open questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions