Deploy: Claude leaderboard entries + PR #71 fixes + your look-ahead fix, reconciled by FlyM1ss · Pull Request #2 · Allan-Feng/AgenticTrading

FlyM1ss · 2026-07-05T12:54:51Z

What this is

Your fork's main is Render prod's deploy branch, and it diverged from Open-Finance-Lab main today (your look-ahead fix + agent editor vs. upstream PR Open-Finance-Lab#71/Open-Finance-Lab#72/Open-Finance-Lab#73). Upstream main now contains everything reconciled — this PR fast-forwards your fork to it.

Merging this deploys to prod:

Your baseline look-ahead fix + recomputed baselines + agent-editor UI (already live)
The two verified Claude leaderboard entries (Haiku 4.5 + Sonnet 4.6, 161/161 LLM-decided steps, injected into your recomputed DB)
PR PR #67 follow-ups: run-lifecycle unification, heartbeat recovery, sys.exit deep fix, green CI Open-Finance-Lab/AgenticTrading#71's run-lifecycle/heartbeat/WAL/sys.exit fixes
Conflict resolutions verified with the full backend suite: 843 passed / 0 failed (details in Sync: absorb allan-fork prod commits (baseline look-ahead fix, agent editor) + conflict resolution Open-Finance-Lab/AgenticTrading#73)

After merging, your fork matches upstream exactly, so future "Sync fork" clicks stay conflict-free.

🤖 Generated with Claude Code

https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu

Edit Landing page and dashboard text style

Edit website snapshot in readme

…ipped bundle pages/not-found.tsx was fully dead (App.tsx renders its own inline 404) and ui/card.tsx survived the LOW sweep only as its import — both deleted; tsc + vite build verified green. The committed frontend bundle was still the PRE-prune build: its CSS (107.76 kB) predates the 65-file component deletion (3618b6a). Swapped in the fresh assets (CSS 31.24 kB, JS byte-identical) and updated index.html's references, dropping the now-redundant ?v= cache-busters (filenames are content-hashed and vercel.json serves /assets/* immutable, / no-cache). Headless smoke: root renders, auth modal present, no console errors. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu

…TH backstop in CI B0/B1 follow-ups: _owner_context's X-Session-Id fallback makes session_id an ownership credential for import-created agents — document that 'session_id is never a credential' only holds for built-in agents. CI gets an explicit DATABASE_PATH env so a conftest isolation regression can never write into the committed seed DB. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu

H4 follow-up. RunStore (protocol steps) and BacktestDatabase (equity/trades) share one file; under the rollback journal a finalize's heavy writes block every reader. WAL lets readers proceed during commits — the actual concurrency shape here. Best-effort with fallback for filesystems without shm support; journal_mode persists in the file so one switch at construction covers all later connections from either layer (RunStore's own _enable_wal lands with the lifecycle-unification commit that reworks that module). Verified on WSL drvfs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu

… sweep, heartbeat recovery Closes the three H4 reconciliation follow-ups in one mechanism: v2 runs now write protocol_runs rows through their whole lifecycle (create=loading, load-failure/finalize/cancel update the row), so * the per-agent active-run cap counts run_store.count_active_runs() on BOTH surfaces under the SHARED v1 create lock — an agent can no longer hold 2× MAX_ACTIVE_RUNS_PER_AGENT by splitting across v1+v2. The v2 create path first reconciles its own finished backends (passive is_active() peek only; live runs are never touched under the lock); * the reaper reaches v2: reap_runs() drives sweeps registered via register_reaper_sweep() (composition root registers reap_v2_runs — domain never imports api/*). The sweep drains abandoned v2 runs through elapsed deadlines and swaps terminal backends for a DB-backed ArchivedBacktestBackend tombstone: market-data buffers freed, while status/result/decisions/idempotent-replay stay serviceable — and v2's _require_run rehydrates terminal rows post-restart (owners see 'failed' instead of a 404; anti-oracle 404 shape preserved for foreign probes); * startup orphan recovery now covers v2 rows for free, and multi-worker deployments get heartbeat-based recovery: rows carry owner_instance/heartbeat_at (schema migrated in place), every reaper pass heartbeats runs whose session is live in this process, and fail_stale_runs() only fails runs nobody heartbeats (RUN_HEARTBEAT_STALE_SECONDS, default 300s; blanket RUN_RECOVERY_ON_STARTUP stays for single-process). 16 new tests (tests/test_run_lifecycle_unification.py) pin the cross-surface cap both directions, sweep archival + deadline draining, restart rehydration, oracle shape, heartbeat stamping/refresh, stale recovery, and the legacy schema migration. FakeBackend gained an honest is_active(); test_active_run_count_is_side_effect_free re-pinned to the ledger-based counting (passivity invariant unchanged). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu

…t() (B0 deep fix) The class fix behind B0 and its recurring guard sites: AlpacaDataLoader (missing creds / missing alpaca-py), BaselineGenerator (same) and HourlyBacktester.load_data (no data) used to sys.exit(1). SystemExit is a BaseException — it evaded every 'except Exception' server boundary, silently killed daemon loader threads, and wedged the ASGI loop. They now raise MarketDataUnavailableError (a plain RuntimeError subclass); the CLI entrypoints (backtest_hourly_agent, backtest_custom_algo) translate it to exit code 1 at the process boundary. The existing (Exception, SystemExit) guards stay as defense-in-depth; their comments now describe them as such. app.py's paper-baseline init, the legacy /backtest/run background thread, and the algo-service subprocess path are covered by their existing 'except Exception' handlers with no local changes. Tests: tests/test_market_data_errors.py (red-green) pins raise-not-exit for all three libraries, the plain-Exception contract, and the CLI boundary. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu

…_TOKENS LOW-sweep residual: the per-request output-token ceiling is an env knob that changes a run's spend/behavior, but nothing recorded which value was in force. agent_runs gains a JSON metadata column (added in BOTH _init_schema and _migrate_schema, per the schema convention); insert_run accepts an optional metadata dict; every agent_runs reader returns it parsed (one _parse_run_row helper) so listings and get_run agree. The engine's LLM agent run records {llm_max_output_tokens: <effective value>}; rule-based runs record nothing. 6 red-green tests incl. legacy-schema migration. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu

The new CI ran RED on 5 failures that predate this PR (they fail on main identically). All five pinned contracts the product deliberately no longer has: * GET /runs is a PUBLIC listing (route docstring; the dashboard home page fetches it session-less for the seed runs) — the strict per-session listing tests now pin the real contract: public metadata listing, session-scoped per-run data access, middleware 400 on malformed session ids for scoped routes (fixture also now patches the backtests router's import-time db binding, which the old tests silently missed). * /runs/latest/metrics filters agent_name == 'Agent' (the internal hourly agent); the old fixture's 'Agent A'/'Agent B' rows could never match. * create_safe_prompt's strategy body was rewritten; the old CANNOT-phrased constraint sentences only exist in a retired commented-out prompt. The active prompt regains a compact no-tools constraint line (parity with CUSTOM_STRATEGY_OUTPUT_CONTRACT — tightens, does not loosen) and the test pins the current wording; the hard boundary remains response-side validate_llm_response enforcement (unchanged, still pinned). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu

…ise contract test_missing_credentials_exits pinned the retired sys.exit(1) behavior; it now pins MarketDataUnavailableError (companion to the B0 deep-fix commit). Full backend suite: 829 passed, 0 failed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu

…fication A 50-agent review (5 lenses, 3-vote refute panels) of the follow-ups diff confirmed 12 findings; the real defects, all fixed red-green: * CRITICAL cancel-clobber: POST /v2/runs/{id}/cancel hardcoded status "closed" into the run's ledger row. A cleanup cancel landing after the final decision (before the sweep archives the backend) downgraded a genuinely completed run to "closed" — in the DB row AND the in-memory session — permanently hiding its metrics. cancel_run now no-ops on terminal runs and reports the run's true state; BacktestBackend.cancel() guards the terminal check under the session lock (closes the peek→cancel race with a finalizing submit). * _archive_run no longer swaps in the tombstone when the terminal row write fails — the backend stays live so the next sweep retries, instead of the row freezing in an active status with nothing left to reconcile it. * v2 rows now leave "loading" (→ "running") when the background market-data load succeeds. * RunStore's heartbeat-column migration tolerates losing the probe→ALTER race to a concurrently-starting worker (duplicate-column errors are the goal state, not a crash). * ArchivedBacktestBackend.from_record recovers step_index/total_steps for completed runs from the decision log (restart-rehydrated runs reported a misleading 0/0). * test_active_run_count_is_side_effect_free regained its count assertion (ledger-based) alongside the passivity invariants. Accepted + documented in code: cross-surface create serialization on the shared lock, and deliberate v2<->v1 cross-surface run visibility. Suites after fixes: backend 837 passed / 0 failed; packaging 41 passed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu

Upstream merged PR #67 plus two landing refreshes (#69/#70) while this follow-up pass was in flight. Conflicts were confined to the shipped frontend bundle: upstream's refresh was built before this branch's dead component prune, and this branch's refresh predated upstream's brand changes (Hero/Navbar/index.css, atltransparent logo, Inter 800). Resolved by REBUILDING the bundle from the merged landing source — brand changes + prune together (tsc green, vite build green, index-CoQ1VbAs.css 31.71 kB, headless smoke: renders, auth modal, 0 console errors) — and pointing index.html at the rebuilt hashed assets (content hashes + the vercel.json immutable/no-cache split make the ?v= params redundant). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu

… reads, bounded registry Seven findings from the PR #71 review, each red-green pinned in tests/test_pr71_review_fixes.py: 1. cancel-during-loading race — load_market_data published waiting_decision WITHOUT _step_lock, defeating cancel()'s lock guard; the background _load also wrote failed/running unconditionally. A cancel landing during the load was silently reverted (run kept running, cap slot freed, client misled). Both load paths now respect an already-terminal (cancelled) status under the lock. 2. archived status() dropped baseline_run_ids/compare_url — the live get_status() returns them for a completed run but the tombstone omitted them, so the same endpoint changed shape once the reaper archived the run. ArchivedBacktestBackend.status() now rebuilds both from the persisted baseline columns (shared build_compare_url helper). 3. rehydrated failed/closed runs reported 0/0 steps — from_record only recovered counts for completed runs (from the decision log). protocol_runs now persists step_index/total_steps at archive time; from_record reads them (decision-log fallback kept for legacy rows). 4. cancel hung during finalize — cancel() blocked on _step_lock held through _finalize's baseline generation (seconds-to-minutes). _finalize now publishes "completed" before the best-effort baseline block, and cancel() short-circuits on a lock-free terminal read. 5. v2 _runs never pruned — the reaper now evicts prior-pass archived tombstones (a later read rehydrates from the terminal row), bounding _runs by ACTIVE runs instead of total historical runs. 6/7. dedupe — extracted a shared enable_wal(db_path) helper (was duplicated verbatim across both DB layers) and a single TERMINAL_STATUSES constant in execution/base.py (was defined three times). Suite: 843 passed / 0 failed; packaging 41. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01KoZVnrSPdYNyrxSfrUwaok

…eanup Quality-only cleanups (no behavior change; suite 843/0). From a 4-lens review of the PR diff: - Extract shared build_final_metrics(run) in external_run_service (mirrors build_compare_url); the 9-key completed-run metrics dict was duplicated verbatim in ArchivedBacktestBackend.status(). - BacktestBackend.advance() calls session.drain_expired() instead of get_current_step(), which rebuilt + discarded a full market snapshot under _step_lock on every reaper pass, per live run. - _archive_run derived step_index/total_steps twice with divergent is-None vs or fallbacks (disagreeing on a real 0) — collapse to one helper. - Drop the dead `not isinstance(backend, ArchivedBacktestBackend)` guard in cancel_run (active is read from is_active(), which a tombstone always reports False, so active already implies a live backend). - reap_runs and the _v2_fakes test double use the TERMINAL_STATUSES constant instead of hardcoded status tuples (the constant's own anti-drift purpose). - get_runs_by_mode routes through _parse_run_row like the other five agent_runs readers (no consumer depends on the raw-string metadata shape). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01KoZVnrSPdYNyrxSfrUwaok

PR #67 follow-ups: run-lifecycle unification, heartbeat recovery, sys.exit deep fix, green CI

…tries Full contest-window (2026-04-15 -> 2026-05-15) LLM agent runs deployed into the seed DB via deploy_leaderboard_model.py, 161/161 steps LLM-decided for both entries (zero parse failures, zero API errors, verified from the deploy logs before commit). Gemini/Qwen/DeepSeek entries are deliberately NOT included: the CommonStack gateway budget capped out mid-run, degrading their curves into partial rule-based fallback (Qwen: 1/161 real decisions), so they were removed pending a clean rerun. Note: the H6 fallback guard passes any run with llm_calls > 0 and cannot see partial fallback; tracked as a follow-up. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu

feat(leaderboard): publish Claude Haiku 4.5 + Sonnet 4.6 contest entries

…stream main Reconciles the diverged prod branch (Allan-Feng/AgenticTrading@main) with Open-Finance-Lab main (PR #71 lifecycle fixes + PR #72 Claude leaderboard entries). Conflict resolutions: - backtest.db (binary): took allan-fork's DB (baselines recomputed with the reference_start_date look-ahead fix) and re-injected the two verified Claude contest runs (161 equity points each) via SQL. - alpaca_bars.py: kept PR #71's MarketDataUnavailableError contract (raise on missing creds/SDK, no silent disabled-loader state); preserved Allan's AlpacaCredentialsError as a subclass so both exception names work. - baseline_generator.py: union of both sides' imports; strategies._common import made lazy to break the circular import baseline_generator -> strategies/__init__ -> buy_hold -> baseline_generator. Full backend suite green on the merged tree: 843 passed / 0 failed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SVGFyxaNN4VPyAxn2hgsiu

Sync: absorb allan-fork prod commits (baseline look-ahead fix, agent editor) + conflict resolution

vercel · 2026-07-05T12:54:56Z

@FlyM1ss is attempting to deploy a commit to the allan-feng's projects Team on Vercel.

A member of the Team first needs to authorize it.

YangletLiu and others added 19 commits July 4, 2026 20:32

Merge pull request #69 from Allan-Feng/main

aa0c328

Edit Landing page and dashboard text style

Merge pull request #70 from Allan-Feng/main

294e873

Edit website snapshot in readme

Merge pull request #71 from Open-Finance-Lab/pr67-followups

16a780e

PR #67 follow-ups: run-lifecycle unification, heartbeat recovery, sys.exit deep fix, green CI

Merge pull request #72 from Open-Finance-Lab/leaderboard-llm-entries

3ec7036

feat(leaderboard): publish Claude Haiku 4.5 + Sonnet 4.6 contest entries

Merge pull request #73 from Open-Finance-Lab/sync-fork-20260705

b5667ab

Sync: absorb allan-fork prod commits (baseline look-ahead fix, agent editor) + conflict resolution

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploy: Claude leaderboard entries + PR #71 fixes + your look-ahead fix, reconciled#2

Deploy: Claude leaderboard entries + PR #71 fixes + your look-ahead fix, reconciled#2
FlyM1ss wants to merge 19 commits into
Allan-Feng:mainfrom
Open-Finance-Lab:main

FlyM1ss commented Jul 5, 2026

Uh oh!

vercel Bot commented Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FlyM1ss commented Jul 5, 2026

What this is

Uh oh!

vercel Bot commented Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants