Skip to content

Deploy: Claude leaderboard entries + PR #71 fixes + your look-ahead fix, reconciled#2

Open
FlyM1ss wants to merge 19 commits into
Allan-Feng:mainfrom
Open-Finance-Lab:main
Open

Deploy: Claude leaderboard entries + PR #71 fixes + your look-ahead fix, reconciled#2
FlyM1ss wants to merge 19 commits into
Allan-Feng:mainfrom
Open-Finance-Lab:main

Conversation

@FlyM1ss

@FlyM1ss FlyM1ss commented Jul 5, 2026

Copy link
Copy Markdown

What this is

Your fork's main is Render prod's deploy branch, and it diverged from Open-Finance-Lab main today (your look-ahead fix + agent editor vs. upstream PR Open-Finance-Lab#71/Open-Finance-Lab#72/Open-Finance-Lab#73). Upstream main now contains everything reconciled — this PR fast-forwards your fork to it.

Merging this deploys to prod:

After merging, your fork matches upstream exactly, so future "Sync fork" clicks stay conflict-free.

🤖 Generated with Claude Code

https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu

YangletLiu and others added 19 commits July 4, 2026 20:32
Edit Landing page and dashboard text style
Edit website snapshot in readme
…ipped bundle

pages/not-found.tsx was fully dead (App.tsx renders its own inline 404) and
ui/card.tsx survived the LOW sweep only as its import — both deleted; tsc +
vite build verified green.

The committed frontend bundle was still the PRE-prune build: its CSS
(107.76 kB) predates the 65-file component deletion (3618b6a). Swapped in the
fresh assets (CSS 31.24 kB, JS byte-identical) and updated index.html's
references, dropping the now-redundant ?v= cache-busters (filenames are
content-hashed and vercel.json serves /assets/* immutable, / no-cache).
Headless smoke: root renders, auth modal present, no console errors.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu
…TH backstop in CI

B0/B1 follow-ups: _owner_context's X-Session-Id fallback makes session_id an
ownership credential for import-created agents — document that 'session_id is
never a credential' only holds for built-in agents. CI gets an explicit
DATABASE_PATH env so a conftest isolation regression can never write into the
committed seed DB.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu
H4 follow-up. RunStore (protocol steps) and BacktestDatabase (equity/trades)
share one file; under the rollback journal a finalize's heavy writes block
every reader. WAL lets readers proceed during commits — the actual concurrency
shape here. Best-effort with fallback for filesystems without shm support;
journal_mode persists in the file so one switch at construction covers all
later connections from either layer (RunStore's own _enable_wal lands with the
lifecycle-unification commit that reworks that module). Verified on WSL drvfs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu
… sweep, heartbeat recovery

Closes the three H4 reconciliation follow-ups in one mechanism: v2 runs now
write protocol_runs rows through their whole lifecycle (create=loading,
load-failure/finalize/cancel update the row), so

* the per-agent active-run cap counts run_store.count_active_runs() on BOTH
  surfaces under the SHARED v1 create lock — an agent can no longer hold
  2× MAX_ACTIVE_RUNS_PER_AGENT by splitting across v1+v2. The v2 create path
  first reconciles its own finished backends (passive is_active() peek only;
  live runs are never touched under the lock);
* the reaper reaches v2: reap_runs() drives sweeps registered via
  register_reaper_sweep() (composition root registers reap_v2_runs — domain
  never imports api/*). The sweep drains abandoned v2 runs through elapsed
  deadlines and swaps terminal backends for a DB-backed
  ArchivedBacktestBackend tombstone: market-data buffers freed, while
  status/result/decisions/idempotent-replay stay serviceable — and v2's
  _require_run rehydrates terminal rows post-restart (owners see 'failed'
  instead of a 404; anti-oracle 404 shape preserved for foreign probes);
* startup orphan recovery now covers v2 rows for free, and multi-worker
  deployments get heartbeat-based recovery: rows carry
  owner_instance/heartbeat_at (schema migrated in place), every reaper pass
  heartbeats runs whose session is live in this process, and
  fail_stale_runs() only fails runs nobody heartbeats
  (RUN_HEARTBEAT_STALE_SECONDS, default 300s; blanket
  RUN_RECOVERY_ON_STARTUP stays for single-process).

16 new tests (tests/test_run_lifecycle_unification.py) pin the cross-surface
cap both directions, sweep archival + deadline draining, restart rehydration,
oracle shape, heartbeat stamping/refresh, stale recovery, and the legacy
schema migration. FakeBackend gained an honest is_active();
test_active_run_count_is_side_effect_free re-pinned to the ledger-based
counting (passivity invariant unchanged).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu
…t() (B0 deep fix)

The class fix behind B0 and its recurring guard sites: AlpacaDataLoader
(missing creds / missing alpaca-py), BaselineGenerator (same) and
HourlyBacktester.load_data (no data) used to sys.exit(1). SystemExit is a
BaseException — it evaded every 'except Exception' server boundary, silently
killed daemon loader threads, and wedged the ASGI loop. They now raise
MarketDataUnavailableError (a plain RuntimeError subclass); the CLI
entrypoints (backtest_hourly_agent, backtest_custom_algo) translate it to
exit code 1 at the process boundary.

The existing (Exception, SystemExit) guards stay as defense-in-depth; their
comments now describe them as such. app.py's paper-baseline init, the legacy
/backtest/run background thread, and the algo-service subprocess path are
covered by their existing 'except Exception' handlers with no local changes.

Tests: tests/test_market_data_errors.py (red-green) pins raise-not-exit for
all three libraries, the plain-Exception contract, and the CLI boundary.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu
…_TOKENS

LOW-sweep residual: the per-request output-token ceiling is an env knob that
changes a run's spend/behavior, but nothing recorded which value was in
force. agent_runs gains a JSON metadata column (added in BOTH _init_schema
and _migrate_schema, per the schema convention); insert_run accepts an
optional metadata dict; every agent_runs reader returns it parsed (one
_parse_run_row helper) so listings and get_run agree. The engine's LLM
agent run records {llm_max_output_tokens: <effective value>}; rule-based
runs record nothing. 6 red-green tests incl. legacy-schema migration.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu
The new CI ran RED on 5 failures that predate this PR (they fail on main
identically). All five pinned contracts the product deliberately no longer
has:

* GET /runs is a PUBLIC listing (route docstring; the dashboard home page
  fetches it session-less for the seed runs) — the strict per-session
  listing tests now pin the real contract: public metadata listing,
  session-scoped per-run data access, middleware 400 on malformed session
  ids for scoped routes (fixture also now patches the backtests router's
  import-time db binding, which the old tests silently missed).
* /runs/latest/metrics filters agent_name == 'Agent' (the internal hourly
  agent); the old fixture's 'Agent A'/'Agent B' rows could never match.
* create_safe_prompt's strategy body was rewritten; the old CANNOT-phrased
  constraint sentences only exist in a retired commented-out prompt. The
  active prompt regains a compact no-tools constraint line (parity with
  CUSTOM_STRATEGY_OUTPUT_CONTRACT — tightens, does not loosen) and the test
  pins the current wording; the hard boundary remains response-side
  validate_llm_response enforcement (unchanged, still pinned).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu
…ise contract

test_missing_credentials_exits pinned the retired sys.exit(1) behavior; it
now pins MarketDataUnavailableError (companion to the B0 deep-fix commit).
Full backend suite: 829 passed, 0 failed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu
…fication

A 50-agent review (5 lenses, 3-vote refute panels) of the follow-ups diff
confirmed 12 findings; the real defects, all fixed red-green:

* CRITICAL cancel-clobber: POST /v2/runs/{id}/cancel hardcoded status
  "closed" into the run's ledger row. A cleanup cancel landing after the
  final decision (before the sweep archives the backend) downgraded a
  genuinely completed run to "closed" — in the DB row AND the in-memory
  session — permanently hiding its metrics. cancel_run now no-ops on
  terminal runs and reports the run's true state; BacktestBackend.cancel()
  guards the terminal check under the session lock (closes the peek→cancel
  race with a finalizing submit).
* _archive_run no longer swaps in the tombstone when the terminal row write
  fails — the backend stays live so the next sweep retries, instead of the
  row freezing in an active status with nothing left to reconcile it.
* v2 rows now leave "loading" (→ "running") when the background market-data
  load succeeds.
* RunStore's heartbeat-column migration tolerates losing the probe→ALTER
  race to a concurrently-starting worker (duplicate-column errors are the
  goal state, not a crash).
* ArchivedBacktestBackend.from_record recovers step_index/total_steps for
  completed runs from the decision log (restart-rehydrated runs reported a
  misleading 0/0).
* test_active_run_count_is_side_effect_free regained its count assertion
  (ledger-based) alongside the passivity invariants.

Accepted + documented in code: cross-surface create serialization on the
shared lock, and deliberate v2<->v1 cross-surface run visibility.

Suites after fixes: backend 837 passed / 0 failed; packaging 41 passed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu
Upstream merged PR #67 plus two landing refreshes (#69/#70) while this
follow-up pass was in flight. Conflicts were confined to the shipped
frontend bundle: upstream's refresh was built before this branch's dead
component prune, and this branch's refresh predated upstream's brand
changes (Hero/Navbar/index.css, atltransparent logo, Inter 800). Resolved
by REBUILDING the bundle from the merged landing source — brand changes +
prune together (tsc green, vite build green, index-CoQ1VbAs.css 31.71 kB,
headless smoke: renders, auth modal, 0 console errors) — and pointing
index.html at the rebuilt hashed assets (content hashes + the vercel.json
immutable/no-cache split make the ?v= params redundant).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu
… reads, bounded registry

Seven findings from the PR #71 review, each red-green pinned in
tests/test_pr71_review_fixes.py:

1. cancel-during-loading race — load_market_data published waiting_decision
   WITHOUT _step_lock, defeating cancel()'s lock guard; the background _load
   also wrote failed/running unconditionally. A cancel landing during the load
   was silently reverted (run kept running, cap slot freed, client misled).
   Both load paths now respect an already-terminal (cancelled) status under the
   lock.

2. archived status() dropped baseline_run_ids/compare_url — the live
   get_status() returns them for a completed run but the tombstone omitted
   them, so the same endpoint changed shape once the reaper archived the run.
   ArchivedBacktestBackend.status() now rebuilds both from the persisted
   baseline columns (shared build_compare_url helper).

3. rehydrated failed/closed runs reported 0/0 steps — from_record only
   recovered counts for completed runs (from the decision log). protocol_runs
   now persists step_index/total_steps at archive time; from_record reads them
   (decision-log fallback kept for legacy rows).

4. cancel hung during finalize — cancel() blocked on _step_lock held through
   _finalize's baseline generation (seconds-to-minutes). _finalize now
   publishes "completed" before the best-effort baseline block, and cancel()
   short-circuits on a lock-free terminal read.

5. v2 _runs never pruned — the reaper now evicts prior-pass archived tombstones
   (a later read rehydrates from the terminal row), bounding _runs by ACTIVE
   runs instead of total historical runs.

6/7. dedupe — extracted a shared enable_wal(db_path) helper (was duplicated
   verbatim across both DB layers) and a single TERMINAL_STATUSES constant in
   execution/base.py (was defined three times).

Suite: 843 passed / 0 failed; packaging 41.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01KoZVnrSPdYNyrxSfrUwaok
…eanup

Quality-only cleanups (no behavior change; suite 843/0). From a 4-lens
review of the PR diff:

- Extract shared build_final_metrics(run) in external_run_service (mirrors
  build_compare_url); the 9-key completed-run metrics dict was duplicated
  verbatim in ArchivedBacktestBackend.status().
- BacktestBackend.advance() calls session.drain_expired() instead of
  get_current_step(), which rebuilt + discarded a full market snapshot
  under _step_lock on every reaper pass, per live run.
- _archive_run derived step_index/total_steps twice with divergent
  is-None vs or fallbacks (disagreeing on a real 0) — collapse to one helper.
- Drop the dead `not isinstance(backend, ArchivedBacktestBackend)` guard in
  cancel_run (active is read from is_active(), which a tombstone always
  reports False, so active already implies a live backend).
- reap_runs and the _v2_fakes test double use the TERMINAL_STATUSES constant
  instead of hardcoded status tuples (the constant's own anti-drift purpose).
- get_runs_by_mode routes through _parse_run_row like the other five
  agent_runs readers (no consumer depends on the raw-string metadata shape).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01KoZVnrSPdYNyrxSfrUwaok
PR #67 follow-ups: run-lifecycle unification, heartbeat recovery, sys.exit deep fix, green CI
…tries

Full contest-window (2026-04-15 -> 2026-05-15) LLM agent runs deployed
into the seed DB via deploy_leaderboard_model.py, 161/161 steps
LLM-decided for both entries (zero parse failures, zero API errors,
verified from the deploy logs before commit).

Gemini/Qwen/DeepSeek entries are deliberately NOT included: the
CommonStack gateway budget capped out mid-run, degrading their curves
into partial rule-based fallback (Qwen: 1/161 real decisions), so they
were removed pending a clean rerun. Note: the H6 fallback guard passes
any run with llm_calls > 0 and cannot see partial fallback; tracked as
a follow-up.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu
feat(leaderboard): publish Claude Haiku 4.5 + Sonnet 4.6 contest entries
…stream main

Reconciles the diverged prod branch (Allan-Feng/AgenticTrading@main) with
Open-Finance-Lab main (PR #71 lifecycle fixes + PR #72 Claude leaderboard
entries). Conflict resolutions:

- backtest.db (binary): took allan-fork's DB (baselines recomputed with the
  reference_start_date look-ahead fix) and re-injected the two verified
  Claude contest runs (161 equity points each) via SQL.
- alpaca_bars.py: kept PR #71's MarketDataUnavailableError contract
  (raise on missing creds/SDK, no silent disabled-loader state); preserved
  Allan's AlpacaCredentialsError as a subclass so both exception names work.
- baseline_generator.py: union of both sides' imports;
  strategies._common import made lazy to break the circular import
  baseline_generator -> strategies/__init__ -> buy_hold -> baseline_generator.

Full backend suite green on the merged tree: 843 passed / 0 failed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu
Sync: absorb allan-fork prod commits (baseline look-ahead fix, agent editor) + conflict resolution
@vercel

vercel Bot commented Jul 5, 2026

Copy link
Copy Markdown

@FlyM1ss is attempting to deploy a commit to the allan-feng's projects Team on Vercel.

A member of the Team first needs to authorize it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants