- Version:
v1 - Effective from: Cohort 1
- Document status: Current implementation-aligned reference
- Last updated: March 7, 2026
This document specifies the benchmark methodology implemented by Forecaster Arena today. It focuses on the measurement protocol rather than deployment or operational details, but where implementation constraints materially affect the benchmark, they are described explicitly.
Forecaster Arena evaluates whether frontier LLMs can make useful, calibrated forecasts about real future events by participating in paper-traded prediction markets.
The benchmark is designed to measure:
- probabilistic reasoning under uncertainty
- confidence calibration
- portfolio construction and risk management
- consistency under identical decision conditions
The benchmark is explicitly not trying to measure:
- tool-augmented web research ability during a decision run
- conversational helpfulness
- coding performance
- user preference alignment
Prediction markets are used because they provide:
- future-facing questions that cannot exist in pretraining corpora as answered facts
- externally resolved outcomes
- continuously refreshed evaluation items
- market prices that can serve as a practical confidence baseline
This does not make the benchmark perfect. Prediction markets still contain noise, liquidity issues, and crowd biases. But they are far more resistant to benchmark contamination than static Q&A sets.
A cohort is an independent weekly competition instance containing one agent for each active benchmark family slot.
- A new cohort starts on Sunday at
00:00 UTC - The weekly decision run is scheduled for Sunday at
00:05 UTC - Cohorts remain active until all open positions resolve or are otherwise settled
Weekly cohorts were chosen because they:
- provide repeated samples for comparing models across time
- avoid the interpretability problems of one endless rolling contest
- are frequent enough to collect data, but not so frequent that every minor price move becomes a decision event
The system enforces one cohort per normalized UTC week start. This is not only an operational convenience; it is part of benchmark integrity. Duplicate weekly cohorts would distort aggregate results and break fair comparability.
The current active family lineup is:
| Internal ID | Display Name | Provider | OpenRouter ID |
|---|---|---|---|
gpt-5.1 |
GPT-5.2 | OpenAI | openai/gpt-5.2 |
gemini-2.5-flash |
Gemini 3 Pro | google/gemini-3-pro-preview |
|
grok-4 |
Grok 4.1 | xAI | x-ai/grok-4.1-fast |
claude-opus-4.5 |
Claude Opus 4.5 | Anthropic | anthropic/claude-opus-4.5 |
deepseek-v3.1 |
DeepSeek V3.2 | DeepSeek | deepseek/deepseek-v3.2 |
kimi-k2 |
Kimi K2 | Moonshot AI | moonshotai/kimi-k2-thinking |
qwen-3-next |
Qwen 3 | Alibaba | qwen/qwen3-235b-a22b-2507 |
Important note:
- Historical internal IDs are intentionally stable and may lag display-name
updates. For example,
gpt-5.1currently maps to display nameGPT-5.2. - Benchmark analysis should use stable IDs for joins and display names for presentation.
All agents receive:
- the same decision cadence
- the same market universe for a given run
- the same starting balance
- the same bet constraints
- the same system-prompt structure
- deterministic temperature (
0)
The benchmark does not attempt to normalize provider-native context windows, hidden reasoning behavior, or proprietary inference stacks beyond what OpenRouter exposes.
Each agent begins each cohort with $10,000 in virtual cash.
Constraints:
| Rule | Value |
|---|---|
| Starting balance | $10,000 |
| Minimum bet | $50 |
| Maximum single bet | 25% of current cash balance |
| Position model | one open position per market per side |
| Recapitalization | none |
Why these constraints exist:
- The minimum bet avoids token “dust” positions.
- The max bet forces portfolio construction rather than all-in gambling.
- One position per market/side keeps accounting and attribution auditable.
- No recapitalization makes bankruptcy and cash management meaningful.
Markets are sourced from the Polymarket Gamma API.
The benchmark decision engine pulls the top 500 markets by volume for the
decision prompt context.
The system supports:
- binary markets (
YES/NO) - multi-outcome markets (named outcomes with per-outcome prices)
Showing all markets would:
- exceed practical prompt budgets
- include illiquid noise
- make decision context uneven in quality
The top-volume filter is therefore part of the benchmark design, not just a UI optimization.
During a weekly decision run, each agent receives:
- current cash balance
- open positions
- average entry price
- side-correct current price
- current marked value
- unrealized P/L
For binary markets, the stored market price is the YES price. When a model
holds a NO position, the prompt converts that to the side-correct price so
the model sees the relevant value for its existing exposure.
For each selected market, the prompt includes fields such as:
- market ID
- question
- category
- current price or outcome prices
- volume
- close date
The weekly prompt does not directly provide:
- real-time news browsing
- other models’ positions or decisions
- explicit historical price series
- human-authored external commentary
This keeps the benchmark focused on model reasoning over a shared static snapshot.
Each model must output one of:
BETSELLHOLD
{
"action": "BET",
"bets": [
{ "market_id": "uuid", "side": "YES", "amount": 500.0 }
],
"reasoning": "Why this trade should be made."
}{
"action": "SELL",
"sells": [
{ "position_id": "uuid", "percentage": 100 }
],
"reasoning": "Why this position should be reduced or exited."
}{
"action": "HOLD",
"reasoning": "Why the current portfolio should remain unchanged."
}The engine:
- parses the response
- validates structure and value ranges
- retries malformed outputs up to the configured malformed-response retry limit
- defaults to
HOLDif the output still cannot be made valid
This fallback is part of benchmark execution because a benchmark run must complete even when a model’s output format fails.
Within a cohort, agents are processed sequentially.
This means:
- all models still see the same market universe
- but each model is called one after another, not concurrently
The design is intended to reduce provider contention and keep operations predictable.
The engine now claims a unique decision row for
(agent_id, cohort_id, decision_week) before making the network call.
Why this matters:
- overlapping cron executions must not generate duplicate decisions
- duplicate trades would invalidate the fairness of the benchmark
- retryable failures should update the same decision row, not create a new one
This is an implementation safeguard that directly supports benchmark integrity.
If a model emits BET or SELL but all executions fail, the result is stored as
a retryable failure rather than as a final successful decision.
That allows the system to distinguish:
- valid “no action” outcomes, from
- apparent trading decisions that produced zero state change
For a buy:
- cash balance decreases by
total_amount - open position cost basis increases
- shares are computed from
amount / executable_price
For a sell:
- a fraction of the position is reduced or closed
- realized proceeds are returned to cash
- cost basis is reduced proportionally
For binary markets:
YEStrades use the storedcurrent_priceNOtrades use1 - current_price
For multi-outcome markets:
- the engine reads the named outcome price from
current_prices - invalid or missing outcome prices cause execution failure
Forecaster Arena uses two primary evaluation axes:
- calibration quality
- portfolio value generation
For a resolved trade:
Brier = (forecast_probability - actual_outcome)^2
Where:
forecast_probabilityis derived from bet sizeactual_outcomeis1if the traded side won, otherwise0
Interpretation:
- lower is better
0is perfect0.25corresponds to an uninformative 50/50 forecast on binary outcomes
The system derives confidence from bet size:
max_possible_bet = cash_balance * 0.25
implied_confidence = bet_amount / max_possible_bet
This intentionally couples stated conviction to actual capital allocation.
Portfolio return is measured from the $10,000 cohort baseline:
Return % = (total_value - 10000) / 10000 * 100
Portfolio return alone can reward luck or over-concentration. Brier score alone can ignore practical value generation.
Using both means the benchmark values:
- being right
- knowing how confident to be
- and sizing positions coherently
Portfolio snapshots are taken on a 10-minute schedule.
Each snapshot records:
- cash balance
- positions value
- total value
- total P/L
- total P/L percent
- cumulative Brier score
- number of resolved bets
Snapshots use snapshot_timestamp, not a daily bucket.
This supports:
- intraday chart ranges such as
10Mand1H - detailed debugging of position valuation
- cohort-level and aggregate model curves
When a market is closed but unresolved and fresh pricing is unusable, the system can fall back to prior valuation behavior rather than collapsing the position to zero. This preserves more stable mark-to-market curves during settlement gaps.
Resolution status is checked against Polymarket.
The local system only marks a market resolved after settlement succeeds.
This is important because marking a market resolved too early can strand open positions that would no longer be revisited by later resolution jobs.
If upstream data says a market is resolved but the winning outcome cannot be
reliably determined, the system refunds positions as CANCELLED.
This is a conservative operational choice intended to protect benchmark accounting from ambiguous resolution data.
The benchmark stores:
- full prompts
- raw model responses
- parsed decision payloads
- trade records
- portfolio snapshots
- Brier scores
- system logs
This supports:
- replay and inspection
- external research review
- debugging of model and execution failures
- export of bounded data slices for analysis
Important limitations remain:
Even though the target event is in the future, models still reason from a frozen knowledge state rather than from live browsing during the run.
Markets themselves may already reflect public consensus or liquidity distortions. The benchmark therefore evaluates forecasting inside a market environment, not in isolation from it.
Hosted models can change behavior between benchmark weeks even when internal IDs stay constant.
The benchmark is operationally deterministic, but not infinitely parallel. Decision runtime and provider availability can still shape what “production reproducibility” means in practice.
Changes that would require a methodology version bump include:
- altering scoring formulas
- changing starting balance or bet constraints
- changing market-selection rules
- adding or removing decision actions
- introducing external retrieval or tool use into the prompt protocol
Changes that can happen without a methodology bump include:
- implementation hardening for safety or idempotency
- UI improvements
- admin/export/security improvements
- model display-name refreshes when the benchmark protocol itself is unchanged