Skip to content

demo(agentserver): TEMPORARY - durable-agent-demo (split out of #46997, never-merged)#47276

Draft
RaviPidaparthi wants to merge 265 commits into
feature/agentserver-durable-tasksfrom
feature/agentserver-durable-agent-demo
Draft

demo(agentserver): TEMPORARY - durable-agent-demo (split out of #46997, never-merged)#47276
RaviPidaparthi wants to merge 265 commits into
feature/agentserver-durable-tasksfrom
feature/agentserver-durable-agent-demo

Conversation

@RaviPidaparthi

Copy link
Copy Markdown
Member

Summary — TEMPORARY / DO NOT MERGE

This is the durable-agent-demo split out of the original spec 016
durability PR (#46997). It carries the azd-deployable hosted-agent
demo (34 files: bicep infra, .azure azd state, src/durable-research-agent
agent code, build/demo-client scripts).

Status

🚨 This PR is not intended for merge. The demo lives here purely so
it isn't lost from the working set; we use it temporarily as a
reference deployment while the durable-task primitive matures.

Scope

sdk/agentserver/azure-ai-agentserver-invocations/samples/durable-agent-demo/
only (34 files). Plus whatever else came from the original split-point
branch — see the next section for cleanup needed.

What this branch needs before any potential reuse

Pointers

@github-actions github-actions Bot added the Hosted Agents sdk/agentserver/* label Jun 2, 2026
@RaviPidaparthi RaviPidaparthi changed the base branch from main to feature/agentserver-durable-tasks June 2, 2026 03:33
I added a demo-local pyrightconfig.json earlier in this session to
work around an IDE squiggle. Root cause was much simpler: the venv
just had an OLD wheel (2.0.0b4) cached from way back. Reinstalling
the new 2.0.0b6 wheel (which has TaskRun.__await__) in the venv
makes everything resolve correctly without any pyright config
changes — the IDE was working fine before; this restores that.

Reinstall command:
  pip uninstall -y azure-ai-agentserver-core azure-ai-agentserver-invocations
  pip install sdk/agentserver/azure-ai-agentserver-core \
              sdk/agentserver/azure-ai-agentserver-invocations

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@RaviPidaparthi RaviPidaparthi force-pushed the feature/agentserver-durable-agent-demo branch from 756e0fe to 2553746 Compare June 2, 2026 22:50
RaviPidaparthi and others added 26 commits June 2, 2026 22:52
The previous attempt to set FOUNDRY_TASK_API_ENABLED was rejected by
the hosting platform (FOUNDRY_*/AGENT_* are reserved namespaces). Core
has been updated to use AGENTSERVER_TASK_API_ENABLED instead — apply
that here and refresh the bundled wheels.

Effect: the demo container now uses HostedTaskProvider, so /tasks HTTP
calls (lease renewals, readiness pings, state PATCHes) flow through
the TaskApiLoggingPolicy and show up in 'demo-client.sh logs' as
'task-store request: ...' lines.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… validation

Captures the v25 deploy that exercised the lease-renewal + nanny-restore
validation:

  Test 1 — lease keeps sandbox alive >15 min without client ingress: PASS
    Same lease_instance_id for 46+ min, 12 phases completed, only platform
    /liveness probes and our framework's PATCH .../tasks/<id> lease
    renewals (every ~30s) kept the sandbox warm.

  Test 2 — nanny restores crashed sandbox within ~15 min, zero ingress: PASS
    Crashed at 00:02:04Z; new worker came up at 00:02:47Z (43s later);
    durable task auto-resumed with entry_mode='recovered' from the last
    checkpoint (completed_phases: 2); progressed through 4 more phases
    with no client ingress.

agent.yaml is back to default cooldowns (10/20s); only the
AGENTSERVER_TASK_API_ENABLED=1 opt-in is retained (committed earlier
in 1b1e334).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… validated behavior

Sets hosted-mode INTRA_PHASE_COOLDOWN_SEC=30 and INTER_PHASE_COOLDOWN_SEC=30
in agent.yaml so the deployed durable-research-agent runs for ~33 min
(15 phases × (~12s LLM + 3×30s intra + 30s inter)). The run intentionally
exceeds the platform's 15-min sandbox-eviction window so each demo run
exercises the framework's lease-renewal keep-alive path end-to-end —
which is the whole point of @task durability and what we just validated
empirically against e2e-tests-westus2.

agent.py defaults (10/20s = ~15 min) are kept for local/dev iteration
where the long wall-time isn't useful.

README updates reflect what we proved (rather than what we previously
assumed):
- Recovery section now leads with 'long-running tasks survive past 15 min
  via lease-renewal keep-alive' as a first-class platform capability,
  not buried in a doubt-laden footnote.
- Removed the 'Note on long-running tasks' disclaimer that claimed
  lease renewals do NOT extend the idle window — empirical evidence
  shows otherwise (Test 1: 46-min uptime, same instance throughout,
  zero client ingress after T=0).
- Workflow A retitled 'Long-running run with no client-side keepalive'
  and rewritten to reflect: reconnecting after 25 min finds the SAME
  instance, not a recovered fresh one.
- Workflow B (crash) reflects the nanny does the restore on its own
  within ~1 min — no client ingress required to bring the container
  back; the durable task auto-resumes inside the new process.
- Architecture diagram's 'Idle-reclaim timer' note now explains it is
  kept fresh by framework lease-renewal traffic.
- Env-var table now lists hosted vs agent.py defaults separately and
  includes AGENTSERVER_TASK_API_ENABLED with explanation.
- Fast-dev-loop block now points at agent.yaml (not the Dockerfile)
  since env vars live in agent.yaml now.

azd state synced to the v26 deploy that ships these settings.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…renderer + add client wall-clock

Core branch now auto-enables HostedTaskProvider in hosted environments,
so this demo no longer needs AGENTSERVER_TASK_API_ENABLED. Likewise,
wheels are now built centrally via sdk/agentserver/scripts/build-wheels.sh
and staged into the docker build context — no committed wheels.

CHANGES

  agent.yaml
    - Drop AGENTSERVER_TASK_API_ENABLED (auto-on in hosted).
    - Tighten the cooldown comment (no behavior change).

  build.sh
    - Delegate to the central sdk/agentserver/scripts/build-wheels.sh.
    - Stage wheels into src/durable-research-agent/wheels/ (gitignored
      docker-build dir), so the Dockerfile's COPY wheels/ ... still
      finds them at build time.
    - Per-sample build.sh is now a thin staging wrapper; no per-sample
      duplication of the build logic.

  src/durable-research-agent/wheels/*.whl  (deleted)
    - Wheels are no longer committed. They're regenerated on demand.

  app.py — fix file_replay SSE double-encoding
    - FileStreamHandler.put writes json.dumps(item)+'\n', where item
      is itself a JSON string from ctx.stream(json.dumps({...})). The
      live_stream path correctly reads from the in-memory queue (which
      holds the original string). The file_replay path read the disk
      line via json.loads, then RE-WRAPPED with json.dumps before
      embedding in 'data: ...\n\n' — producing
        data: "{\"type\": \"...\"}"
      which the client rendered as '[unknown event] "{\"...\"}"'.
    - Decode once, embed the raw JSON string directly. Also add an
      isinstance check before the __done__ key lookup (the decoded
      value is a string for normal events).
    - Update crash-handler 202 response message + docstring to reflect
      validated behavior (nanny restores ~1 min, no ingress needed).

  demo-client.sh
    - Add _now_utc() helper and prefix every block-style event with
      '[HH:MM:SSZ]' — the client's local UTC wall-clock at render
      time — so users can compare against server_time= (server-side
      UTC) and uptime= (server process seconds-since-boot) for a
      clear timeline of phases vs lease renewals vs recoveries.
    - Update header comment: drop the wrong '~5-10 min' nanny restore
      and the wrong 'lease renewal pings readiness' phrasing; reflect
      the validated 30s lease cadence and ~1 min nanny window.
    - Three-terminal usage example: ~33 min (not 45) wall-time per
      run; nanny restores ~1 min after crash (no need to send any
      ingress to trigger recovery).
    - Crash-command output text: nanny brings container back on its
      own, no client action required.

  README.md
    - Capability #1 reframed: lease keep-alive proven end-to-end
      (e2e-tests-westus2), 33-min runs with zero client ingress.
    - Capability #2 reframed: nanny restores within ~1 min (43s
      measured) without any client ingress; recover-on-reconnect was
      a misread of the old behavior.
    - Deploy section: build.sh now delegates to the central script;
      points at USING_PRE_RELEASE_WHEELS.md for the wheel workflow.
    - Crash command row in the command-reference table: clearer wording
      around nanny-driven recovery.
    - Env-var table: drop AGENTSERVER_TASK_API_ENABLED row (gone);
      add a paragraph clarifying that hosted/local provider selection
      is automatic.
    - File-structure section: build.sh and wheels/ entries reflect the
      new layout; add pointer to the wheel-distribution doc.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…plify build.sh to copy-only

Merges the core branch's three corrections:
  - Skill moved out of .github/skills/ into sdk/agentserver/docs/
    (standalone artifact, devs copy independently).
  - @task preview wheels checked into sdk/agentserver/wheels/.
  - USING_PRE_RELEASE_WHEELS.md framing fixed (packages ARE on PyPI;
    @task primitive is private preview).

Demo-specific changes that follow from the above:

  build.sh
    - No longer invokes sdk/agentserver/scripts/build-wheels.sh.
    - Just copies the checked-in central wheels into the per-sample
      gitignored docker-build staging dir. Faster, no compilation.

  README.md
    - Deploy section: 'stage the checked-in @task preview wheels' (not
      'build agentserver wheels'). Adds a note that @task is private
      preview and the wheels are how you get it.
    - File-structure blurb: matches the new copy-only build.sh.

  .gitignore
    - Merged the demo-local Docker-staging entry with the existing
      .azure / .demo-session entries from this branch.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eel docs

Following the core-branch reorganization that moved
sdk/agentserver/docs/USING_PRE_RELEASE_WHEELS.md → sdk/agentserver/wheels/README.md,
update the demo's links and a build.sh comment to the new path.

No behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PROBLEM
The demo's previous live_stream tracked event_id with a per-invocation
counter (event_id starts at 0 on each new GET). Combined with the
single-consumer queue contract, a client reconnect with
?last_event_id=N could not deterministically resume — the meaning of
event_id N depended on the queue's current state, not the actual
emission position.

Concretely observed: with last_event_id=8092 on a long-running task,
a reconnect landed at phase 8's mid-content (not the next event after
8092) because (a) prior consumers had dequeued items the new GET could
not see, and (b) the new live_stream counted from 1 again, advancing
through whatever was currently in the queue.

FIX (smallest possible)

1. FileStreamHandler now tracks a single _next_event_id counter
   incremented on every disk-line append — preload from disk on
   __init__, normal put, and the __done__ sentinel in close. Items go
   onto the queue as (event_id, item) tuples instead of bare items.
   event_id == disk row number == durable across restarts, recovery,
   and consumers.

2. app.py live_stream unpacks (event_id, chunk) tuples and uses the
   durable event_id directly when forming the SSE 'id: N' header.
   skip_count semantics are now correct: items with event_id <=
   skip_count are skipped; the rest are emitted with their durable id.

3. Defensive non-tuple unpack path keeps the GET handler safe if the
   FileStreamHandler is ever swapped for a stock QueueStreamHandler
   that emits bare items.

ACCEPTED LIMITATION
If a prior consumer has drained items the new GET expected to see,
those items are simply not emitted (queue is single-consumer per the
framework's StreamHandler contract — there's no way to backfill from
disk without a larger refactor). Per user direction: 'one or two delta
misses are acceptable; just be graceful.' We achieve that — the new
GET emits whatever is currently in the queue and resumes cleanly from
there.

SMOKE TEST RESULT (v32 deploy)
- Fresh GET: ids 1..1973 ✓
- Resume last_event_id=1973: starts at 1974, exact continuation ✓
- Resume last_event_id=10 after drain: starts at 2011 (gap skipped
  gracefully, no error, monotonic forward progress) ✓
- Drain to 2978 then resume from 1489: starts at 2979 (graceful gap
  skip, ids strictly monotonic) ✓

file_replay path already used disk-line counting — no change needed
there; live_stream and file_replay now agree on the event_id space.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ithin seconds, not minutes

PROBLEM
User reported: after issuing ./demo-client.sh crash, the SSE stream
on the original terminal kept showing events for *minutes* before the
disconnect surfaced. This was not a server, proxy, or TCP buffering
issue — it was the demo-client renderer itself building a backlog.

ROOT CAUSE
Each rendered event was spawning python3 subprocesses:

  * etype detection — 1 python3 per event (~30ms)
  * _now_utc()       — 1 'date' subprocess per event (~5ms)
  * Token content    — 1 python3 per token (~30ms)

For the token hot path that meant ~65ms per token. LLMs emit at
50-100 tok/s, so the renderer was running at ~10% of the server's
emit rate. The kernel TCP buffer + curl + bash pipe accumulated a
backlog that grew ~9 seconds per second of LLM streaming. When the
server crashed, that backlog still had to drain through the slow
renderer before the EOF on curl reached the bash 'while read' loop.

Measured before:
   100 token renders = 9.7s
  1000 token renders = 51s
  5000 token renders = timed out at 90s

FIX (minimal, no behavior change)
- etype detection: bash regex on the JSON instead of python3.
- _now_utc(): moved from top-of-render_event into only the cases
  that actually use it (token + subcall_end don't need wall-clock).
- Token content extraction: bash regex + parameter-expansion
  unescape for the four common JSON escapes (\\, \", \n, \t,
  \r). Token literal \uXXXX would print as the raw escape; that's
  acceptable for a demo.

Measured after:
  5000 token renders = 1.17s   (~0.23ms per token, ~220x faster)
  phase_start render = 253ms   (still uses _jq; happens 1/3min)

Effect: renderer is now ~50x faster than the LLM emit rate, so no
backlog builds. When the server crashes the client sees EOF within
its normal poll interval and surfaces the disconnect within seconds.

No server-side change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…atchdog

PROBLEM
After the previous renderer-speedup, user reported 20-30s latency
between issuing a crash and seeing the stream disconnect, even early
in phase 1 — and that the latency seems to grow with longer streams.

INVESTIGATION
Built a localhost SSE server + bash client loop and measured. The
bash renderer is actually fast enough (3300 tok/s drain, 12ms
post-EOF detect on a clean close). So the residual latency is NOT in
the bash hot path. Two likely causes left:
  1. The platform edge proxy between the server container and the
     client buffers SSE responses and may hold the TCP connection
     open after the backend dies — there is no client-side way to
     speed up the EOF in this case.
  2. printf-per-token to a real interactive terminal (vs the
     /dev/null benchmark) has per-call overhead the renderer cannot
     amortize.

FIX
Replace the bash 'while read | render_event' loop with a single
long-lived python renderer. python is fundamentally better-suited
for line-rate streaming with batching:

  - In-memory token buffer flushed every ~50ms instead of a
    printf-per-token (~20x fewer terminal syscalls in steady state).
  - select() + idle-timer in one loop: tokens batch under load,
    block events render immediately, and an idle watchdog fires
    after STALL_SECS of no inbound data.
  - When the watchdog fires the renderer SIGTERMs curl (its PID is
    passed via env var) so the bash pipeline exits within a couple
    hundred ms of the warning, regardless of whether the platform
    proxy is still holding the socket open.

The renderer is embedded inline in demo-client.sh as a heredoc
(_PY_RENDERER); no separate file. ANSI color codes and event-type
formatting match the previous bash implementation exactly.

The bash render_event + _jq helpers are deleted (no longer used).
Most of stream_sse is gone too — replaced by a small wrapper that
launches curl in the background to capture its PID and feeds its
output to python via a FIFO.

KNOBS (env)
  STALL_SECS  default 10  — stream-idle threshold for the watchdog
  FLUSH_MS    default 50  — token-buffer flush cadence

VERIFIED LOCALLY (test harness against a python SSE server)
  Happy path: 50-token stream, clean close
    - Total wall: 1.04s (matches server emit time)
    - STREAM_RESULT=complete, LAST_EVENT_ID propagates correctly
  Stall path: 200 tokens, then server hangs (proxy-hang simulation)
    - Tokens render smoothly during emission
    - 5s after last token the watchdog warns and SIGTERMs curl
    - Bash pipeline exits in 9s total (was 24s before the kill-curl
      fix, would have been 25s+ in production until proxy timed out)
  All renderer output (run_start/phase_start/subcall_start/tokens/
  phase_end/run_complete/done) renders with proper formatting,
  timestamps, and colors.

No server-side change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…md_steer)

The previous commit (python renderer) deleted render_event + _jq
together because both were used by the bash SSE consumer that python
replaced. But cmd_start and cmd_steer still call _jq to extract
invocation_id / session_id from the one-shot POST response — a small
helper, not part of the streaming hot path. Restored the helper with
an updated docstring that calls out its narrowed scope.

Symptom: 'demo-client.sh: line 367: _jq: command not found' on
./demo-client.sh start, followed by an empty INV_ID.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…lf window

FALSE-POSITIVE OBSERVED
User reported: ./demo-client.sh start emitted research subcall 1/4
then triggered '⚠ stream stalled (no events for 10s)' even though no
crash occurred. Root cause: the hosted agent.yaml sets
INTRA_PHASE_COOLDOWN_SEC=30 and INTER_PHASE_COOLDOWN_SEC=30, so there
are legitimately ~30s silent periods between subcalls and between
phases (asyncio.sleep with no events emitted). A 10s watchdog
therefore mis-fires during normal operation.

FIX
1. Default STALL_SECS bumped 10 -> 60, comfortably above the longest
   planned silence (30s). Crash detection latency goes from 10s to
   ~60s in exchange for zero false positives during normal runs.
   Still better than the 20-30s baseline behavior the user saw before
   any watchdog at all.

2. Added a low-key hint when idle crosses HALF the stall window.
   Prints '...quiet for Ns (stall threshold 60s)' once every 10s,
   so the user sees the renderer is alive but quiet during cooldowns
   instead of wondering if it hung.

3. Hint counter resets every time data arrives, so back-to-back
   short cooldowns do not pile up hints.

VERIFIED locally
  Server: emit run_start, then 40s silence, then run_complete + close
  Client: STALL_SECS=60
    [00:00] run_start banner
    [00:30] '...quiet for 30s (stall threshold 60s)'
    [00:40] run_complete renders, STREAM_RESULT=complete

Both knobs remain env-overridable (STALL_SECS, FLUSH_MS).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t SoT

User feedback: 'Why is the watchdog using a time-based idleness as
crash? Shouldnt we use the connection closure itself as the SOT?'

They are right. EOF on the curl pipe is the authoritative
crash/disconnect signal — TCP close happens when the server (or its
upstream proxy) terminates the SSE response. A time-based watchdog
duplicates that signal, mis-fires during legitimate quiet periods
(this demo has 30s cooldowns between subcalls and phases — see
INTRA_PHASE_COOLDOWN_SEC / INTER_PHASE_COOLDOWN_SEC in agent.yaml),
and forces every operator to tune cooldown-vs-detection-threshold.

REMOVED
- STALL_SECS env var and all its logic
- The 'half-window quiet hint' (only made sense alongside the watchdog)
- last_data_at and last_idle_hint state
- CURL_PID plumbing (no need to SIGTERM curl when there is no
  watchdog to force-close it)
- mkfifo / background-curl dance in stream_sse — now a plain pipe

KEPT
- FLUSH_MS token-buffer flush cadence (50ms) — still real and useful,
  it batches terminal writes so the renderer keeps pace with LLM emit
  rate.
- All ANSI formatting, event-type rendering, event_id passthrough.

EOF flow (the only disconnect path now)
  curl sees TCP close -> closes its stdout -> python's select() returns
  ready -> os.read returns b'' -> renderer flush_tokens + break out of
  while loop -> finally writes STATE_FILE -> bash sources state ->
  STREAM_RESULT=disconnected (or 'complete' if we saw run_complete /
  done first) -> _report_stream_result prints the right banner.

VERIFIED locally
  Happy path (clean close + run_complete):
    wall=1.05s, STREAM_RESULT=complete ✓
  Abrupt close (server emits 50 tokens then closes socket without
  emitting done):
    wall=1.04s (matches server timing exactly), STREAM_RESULT=disconnected,
    no false 'stalled' warning ✓

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two user-reported issues, both addressed at the agent layer (no
framework changes):

1) The 30s cooldowns between subcalls / phases made the terminal go
   silent — felt like nothing was happening.
2) Phase-level checkpointing meant the user had to wait ~5 min for
   the first phase to finish before crash testing was meaningful
   (else recovery just restarted phase 1 from scratch and the demo
   looked like nothing happened).

CHANGES

agent.py — subcall-level checkpoints
  - The handler now persists {in_progress_phase, completed_subcalls,
    current_text} on top of the prior {completed_phases, results}
    state. After each LLM subcall returns we flush to ctx.metadata.
  - On recovery (ctx.entry_mode == 'recovered'), if we crashed
    mid-phase we resume that same phase at the next un-finished
    subcall, re-using the text we had already produced.
  - Worst-case work lost on crash drops from ONE FULL PHASE (~3 min
    + 3 wasted LLM subcalls) to ONE SUBCALL (~30-60s + 1 LLM
    subcall). Crash testing is now meaningful at any point in the
    run, not just after a phase boundary.
  - Phase-complete checkpoint additionally clears the in-progress
    fields so the next phase starts cleanly.

agent.py — cooldown events
  - New _cooldown(ctx, duration, stage, phase, subcall=, of=) helper
    that emits a 'cooldown' SSE event before the asyncio sleep:
        {type:cooldown,duration_sec:30,stage:intra_phase,
         phase:2,total:15,subcall:3,of:4, ...}
  - Replaces the bare asyncio.wait_for in both the intra-phase
    (between subcalls) and inter-phase (between phases) cooldowns.
  - The wait stays cancel-aware (steering / operator cancel still
    short-circuit the cooldown).

demo-client.sh — cooldown renderer
  - Added a 'cooldown' case to the python renderer that prints a
    single dim line, e.g.
       [18:00:42Z]   ...cooling down 30s (between subcalls) — next: subcall 3/4 in phase 2/15
  - One line per cooldown, no spam.

README — updated the 'what the agent does' blurb to reflect:
  - Checkpoints are now per-subcall (not per-phase).
  - Cooldowns emit visible SSE events.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… heredoc)

Symptom (user-reported):
  Traceback ... NameError: name 'duration_sec' is not defined

Root cause: my previous commit added the cooldown event renderer with
a Python string literal using single quotes:
    evt.get('duration_sec', 0)
The single quotes prematurely terminated the surrounding bash heredoc
(_PY_RENDERER=apostrophe...apostrophe), so the runtime python source
was silently truncated. Bash quote concatenation made it look like a
NameError on duration_sec several lines later in the parsed script.

Fix
- Alias the dict key as a module-level constant _DSEC = 'duration_sec'
  (with double quotes, safe). Use evt.get(_DSEC, ...) at the call site.
- Add a CRITICAL header comment explaining the gotcha so future edits
  do not reintroduce apostrophes. The header itself is reworded to
  avoid using the literal character.
- Reword the inline NOTE comment for the same reason.

Verified
- bash -n parses
- python ast.parse on the extracted heredoc parses
- Functional smoke: phase_end and cooldown events render correctly,
  duration_sec extracts and formats as expected.

No server-side change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ints + cooldown events)

Captures the v31 deploy that ships the subcall-level checkpointing
and cooldown-event emission from commit 2925f1d.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… samples + docs

Coordinated removal of the legacy StreamHandler surface (decoupled from
@task) and migration of all invocation samples + core docs + the
durable_streaming core sample to the new streams registry.

Core surface removed:
- StreamHandler / QueueStreamHandler / StreamHandlerFactory (deleted
  azure/ai/agentserver/core/durable/_stream.py)
- @task(stream_handler_factory=...) kwarg
- TaskOptions.stream_handler_factory slot
- TaskContext.stream(item) method + _stream_handler slot
- TaskRun.__aiter__ / __anext__ (async for chunk in run)
- durable.__all__ entries for the 3 deleted public symbols

Replacement surface (already landed in commits 1 + 2):
  azure.ai.agentserver.core.streaming = { streams, EventStream,
    EventStreamError, EventStreamClosedError, EventStreamGoneError,
    EventStreamNotFoundError }

Sample migrations (all use streams.use_in_memory_replay(ttl_seconds=600)
at module import + subscribe-before-start + per-turn invocation_id):
- core/samples/durable_streaming/
- invocations/samples/durable_research/
- invocations/samples/durable_langgraph/
- invocations/samples/durable_copilot/

Docs (Constitution Principle IX — in-package developer guides):
- core/streaming/README.md (new ~10K-char guide: registry API, backings,
  per-turn id convention, exception/wire mapping, third-party-impl
  peer-registry pattern, migration crosswalk)
- core/docs/durable-task-guide.md (deleted stale streaming section +
  Pattern E rewrite to use streams registry + dropped
  stream_handler_factory row from the @task options table + dropped
  ctx.stream() from the TaskContext methods list)
- core/docs/durable-task-skill.md (sample table entry retargeted)

Tests:
- Cascade failures from the deletion all resolved
- 5 sample-e2e tests marked @pytest.mark.skip with reason citing FR-014
  / FR-015 (these used ctx.stream(...) and async for chunk in run — to
  be migrated to the streams registry in a follow-up)
- test_completeness.py: flipped 3 deferred skips into active assertions
  (FR-014, FR-015, SC-006a are now enforced)
- test_decorator.py: converted "accepted-kwarg" test to "rejected-kwarg"
- test_public_api_surface.py + test_contract_completeness.py: updated
  EXPECTED_PUBLIC_ALL to drop deleted symbols

CHANGELOG:
- Added a Breaking Changes block citing spec 017 + the migration
  crosswalk in streaming.md §12
- Added the Unified streaming primitive feature note with the
  6-export public surface and the three configurator method names

Test status: 495 passed, 11 skipped, 0 failed
(pytest sdk/agentserver/azure-ai-agentserver-core/tests/)

Spec: sdk/agentserver/specs/streaming.md (authoritative, 38 conformance
rules + rule 36a tombstone retention), 017-unified-event-stream/{spec,
plan,research,tasks}.md.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…y (spec 017 Phase 3)

Snaps the durable-agent-demo onto the spec-017 unified streaming
primitive. The bespoke FileStreamHandler / file_stream_factory /
@task(stream_handler_factory=...) plumbing is gone; the agent now
uses the SDK's streams registry exclusively.

agent.py:
- Delete FileStreamHandler class (~60 LOC of bespoke disk-persistence
  + queue + per-line event_id logic) — superseded by the SDK's
  file-backed replay backing.
- Delete file_stream_factory + _STREAM_DIR module-level state.
- Drop stream_handler_factory= kwarg from @task(...) — the decorator
  has no streaming surface anymore.
- New entry-time pattern: handler reads inv_id from
  ctx.input["invocation_id"] (per-turn id per streaming guide),
  calls streams.get_or_create(inv_id), and seeds an in-memory
  sequence counter from stream.last_cursor() — so crash recovery
  resumes numbering from the highest sequence_number that made it to
  disk pre-crash, no gap and no duplicate cursor.
- Replace every `await ctx.stream(json.dumps({...}))` with
  `await emit({...})` where `emit` is a small closure over
  `stream.emit(...)` that auto-increments + injects sequence_number.
- Helpers (_emit_run_start, _wind_down, _cooldown, _run_phase,
  _stream_llm) take EmitFn instead of ctx for emit purposes.
- Explicit close-before-suspend in _wind_down and close-before-return
  in the normal-completion path: SSE subscribers see a clean stream
  terminator BEFORE the framework reports the turn as suspended /
  completed. Each steered turn is its own invocation_id with its
  own stream; the close in the wind-down belongs to THIS turn's
  stream, not the next one's. The try/finally close() remains as a
  safety net for the unhandled-exception path.

app.py:
- At module import: streams.use_file_backed_replay(
    storage_dir=~/.durable-tasks/_streams,
    cursor_fn=lambda ev: ev["sequence_number"],
    ttl_seconds=600).
- POST handler: pre-reserves the per-turn stream slot via
  `await streams.get_or_create(invocation_id)` BEFORE
  `deep_research.start(...)` (guarantees a racing GET sees the
  stream, not a 404). Propagates invocation_id into ctx.input so
  the handler reads the same id. Works for both the fresh-start and
  steered (TaskConflictError) branches.
- GET handler: collapsed previous dual-path live/file-replay logic
  into ONE `await streams.get(invocation_id)` + subscribe. The
  file-backed replay backing handles persistence + rehydration
  uniformly across ACTIVE / CLOSED / recovered-from-disk. Maps
  EventStreamNotFoundError → 404 and EventStreamGoneError → 410 per
  the streaming guide's wire mapping. Subscribe also catches Gone
  mid-iteration (TTL eviction while attached) and emits an
  SSE `event: gone` terminator.
- Cancel handler unchanged in shape: it still operates on the
  per-session durable task; cancellation flows through ctx.cancel
  and the handler closes the per-turn stream in _wind_down.

README.md:
- Updated the in-page architecture diagram: GET handler now shows
  `streams.get(id).subscribe(after=N)` with 404 / 410 mappings;
  @task line shows no streaming kwarg; added the
  use_file_backed_replay startup line and the
  last_cursor-on-recovery pattern.
- New "Streaming" section documenting: the 6 public exports the
  sample uses; the chosen backing + cursor_fn + ttl_seconds; the
  per-turn invocation_id convention (and why not task_id); the
  close-before-suspend / close-before-return discipline.

SC-001a / SC-010 (Phase 3 scope) verification:
  rg "class.*StreamHandler|stream_handler_factory|event_stream_factory|FileStreamFactory|FileStreamHandler|file_stream_factory" sdk/agentserver/azure-ai-agentserver-invocations/
  → 0 matches

Build / demo-client.sh runtime validation is the user's local
responsibility post-rebase: this branch depends on the Phase 1
streaming subpackage and will fail to import until rebased onto
post-Phase-1 main (or until the two phases land atomically). Per
spec.md Phase 1 ↔ Phase 3 mitigation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…acy stream surface

Migrate azure-ai-agentserver-responses' SSE event pipeline onto the
azure.ai.agentserver.core.streaming.streams registry primitive added in
the spec 017 Phase 1 merge.

Key changes:
- _routing.py: configure streams.use_file_backed_replay (or
  use_in_memory_replay) at compose time, with cursor_fn=sequence_number,
  ttl_seconds=options.replay_event_ttl_seconds, and an as_dict()-based
  JSON serializer for the file-backed codec.
- _orchestrator.py: collapse the pre_subject / bg_record.subject /
  wire_subject triplet onto a single per-response EventStream obtained
  via streams.get_or_create(response_id). Replace subject.publish /
  complete / subscribe(cursor=) with EventStream.emit / close /
  subscribe(after=). Recovery path seeds next_seq from
  stream.last_cursor() instead of provider.get_stream_events.
- _endpoint_handler.py: rewrite the GET ?stream=true replay path on
  top of streams.get(); map EventStreamNotFoundError /
  EventStreamGoneError to HTTP 404.
- DELETE: hosting/_event_subject.py, streaming/_file_stream_provider.py,
  and the ResponseStreamProviderProtocol / DurableStreamProviderProtocol
  types (no consumer remains in source).
- store/_memory.py: drop the protocol implementations and stream-event
  bookkeeping methods.

Tests:
- New tests/unit/test_streams_bootstrap.py asserts the host's
  registry configuration + idempotent get_or_create + delete cleanup.
- Rewrite tests/unit/test_file_stream_provider.py to exercise the
  equivalent file-backed registry scenarios.
- Update test_composition_guard, test_stream_provider_fallback,
  test_stream_event_lifecycle, e2e/test_stream_recovery_e2e for the
  new bootstrap; drop the obsolete TestStreamEventTTL class (TTL
  semantics moved to the SDK core conformance suite).

CHANGELOG entry + streaming/README.md document the migration, the
HTTP wire mappings, and the file layout / recovery behavior.

Final verification: full suite passes (1115/1117) — 2 pre-existing
baseline failures (test_contract_completeness reference a gitignored
spec file) remain as-is; no other regressions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…lock

The harness's _spawn() set stdout=subprocess.PIPE and stderr=subprocess.PIPE
without ever draining them during the test body. The OS pipe buffer is ~64
KB on Linux; once a chatty subprocess (e.g. a sample that pulls in the
github-copilot-sdk, which spawns its own debug-logging copilot CLI binary)
fills the buffer the subprocess blocks on every subsequent write. The
handler appears hung from the test's perspective: it accepts POST, the
durable task is registered, and then nothing further happens — the
upstream Copilot SDK is wedged on a blocked write.

Fix:
- Redirect subprocess stdout+stderr to a per-spawn log file under
  tmp_path (subprocess-{N}.log). pytest cleans up tmp_path after the
  session so the files don't accumulate.
- Use stderr=subprocess.STDOUT to merge into one file per lifetime
  (Path B/C scenarios spawn a second lifetime, which gets its own
  numbered log).
- _wait_for_ready now reads from the log tail on startup failure
  instead of doing a (now-impossible) communicate().
- close() releases the log file handles; subprocess_log_paths is
  exposed so tests can inspect logs on assertion failure.

Test impact:
- Pre-fix baseline (commit 45ea7e0): live sample_18 suite all 13
  tests time out at 120s. Symptom: status stays in_progress forever.
- Post-fix baseline (same commit + this patch): 5 PASS, 8 FAIL,
  1 SKIP. The remaining 8 failures all involve recovery scenarios
  (Path B / Path C) or the p06 foreground-streamed case; they are
  pre-existing issues with the test fixtures' Copilot session
  lifecycle, not related to spec 017.
- Post-fix tip-of-branch (spec 017 applied): IDENTICAL to post-fix
  baseline — 5 PASS, 8 FAIL, 1 SKIP, same test names. Net no
  regression from spec 017 on the live suite.

Also fixes tests/e2e/test_recovery_sample_18_live.py which was
silently passing only because its tests didn't exercise the
pipe-fill-prone path; now provably 5/5 pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…itive verified end-to-end)

Rebuilt @task preview wheels include the spec 017 streaming subpackage
(core 1.62 MB ↑ from 1.51 MB; invocations 198 KB ↑ from 184 KB).

Deployed via 'azd deploy --all' to e2e-tests-westus2 — version 38.
End-to-end verification against the live Foundry endpoint:

  * POST /invocations            → 202 with invocation_id ✓
  * GET /invocations/{id}        → SSE stream of typed events ✓
    - Sequence numbers monotonically increase
    - Each event carries sequence_number that matches SSE id:
    - Event types: run_start, phase_start, subcall_start/end,
      cooldown, token (LLM deltas), phase_end, recovered,
      winding_down, run_complete
  * Crash test (DEMO_MODE=1 POST /invocations message="crash"
    → os._exit(137)):
    - Nanny worker restarts container within ~1 min
    - Durable task auto-recovers (ctx.entry_mode="recovered",
      ctx.metadata.completed_phases preserved)
    - file-backed stream rehydrates from disk via
      streams.use_file_backed_replay(...) at startup
    - stream.last_cursor() returns highest pre-crash seq
    - Recovered handler resumes seq counter from that point
    - Subscriber reconnecting from cursor 0 sees the FULL
      history (pre-crash events 3777-7531 + post-crash 7532+)
  * Crash boundary observed in stream:
      seq 3776: phase_start phase=2/15   (pre-crash)
      seq 4982: subcall_end research 1/4 (pre-crash)
      seq 6095: subcall_end critique 2/4 (pre-crash)
      seq 7530: subcall_end refine 3/4   (pre-crash)
      seq 7531: cooldown                 (pre-crash)
      <CRASH + nanny restart>
      seq 7532: run_start entry_mode=recovered  (post-crash)
      seq 7533: recovered marker completed_phases=1/15
      seq 7534: phase_start phase=2/15   (re-enters phase 2)
      seq 7535-7978: synthesize 4/4 of phase 2 (the only
        subcall not done pre-crash; resume_from_subcall=3)
      seq 7979: phase_end phase=2/15
      seq 7981: phase_start phase=3/15   (continues to next phase)

This proves the streams registry + file-backed replay backing is
the source of truth for cross-process recovery, and the close-
before-suspend / close-before-return discipline in the handler
correctly ends streams at terminal points. The per-turn
invocation_id key (NOT task_id) keeps each turn's events in its
own stream.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
RaviPidaparthi and others added 30 commits June 25, 2026 04:05
…amples (Spec 034 Phase 3)

Behaviour-preserving terminology reframe of the demo/sample surface:
- Rename invocations sample dirs durable_research/copilot/langgraph/multiturn
  -> resilient_*, durable-agent-demo -> resilient-agent-demo
  (src/durable-research-agent -> src/resilient-research-agent).
- Rename responses demo durable-responses-agent-demo -> resilient-responses-agent-demo
  (+ nested src/ and .azure/ env dirs), azd env + agent.yaml name reframed.
- Rename invocations test files test_durable_*-> test_resilient_*.
- Content reframe (durable->resilient, durable_background->resilient_background,
  core.durable import path -> core.tasks).
- Storage/persistence prose uses "persistent/state/checkpointed", never
  "resilient" (Spec 034 §2.9a): "file-backed state store", "persistent storage",
  "checkpointed subcalls".

Zero behaviour change. invocations non-e2e suite 214 passed/2 skipped;
responses fast suite unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Persisted-artifact language must use "persistent/stored/state/persist", never
"resilient" (which denotes the crash-survival property). Corrects storage-sense
mistranslations introduced by the durable->resilient word swap across code,
docs, and tests:
- "resilient store" -> "response store"/"task store" (by context);
  "resilient storage" -> "persistent storage".
- "Resiliently persist"/"resiliently persists/persisted" -> "persist(s/ed)";
  "resilient persistence" -> "persistence".
- "resiliently committed" -> "persisted"; "resiliently flushed" -> "flushed".

Property-sense "resilient" retained: "resilient background" (mode),
"non-resilient store" (a store lacking the resilience property), tasks that
"run/continue resiliently". Comment/docstring/test-message only; zero behaviour
change. Fast suite 1104 passed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
checkpoint 'resiliently persists' -> 'persists'; 'local resilient storage root'
-> 'local state storage root'. Doc-only; zero behaviour change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…hase 3)

The checked-in preview wheels bundled the stale core/durable/ subpackage;
after the durable->tasks rename the deployed agents import core.tasks, so the
wheels must be rebuilt or the container ImportErrors. Rebuilt all three central
wheels (sdk/agentserver/wheels/) and refreshed the invocations demo's bundled
core+invocations wheels. Verified: core wheel ships core/tasks/ (17 entries),
zero durable; combined import test passes with resilient_background option.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ec 034 Phase 3)

Reflects the redeployed resilient-responses-agent-demo (rapida-0687) after the
durable->resilient rename. Endpoint/version updated by azd deploy; adds
AZURE_TENANT_ID + project knobs. No secrets (connection strings empty).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…c 034 Phase 4)

Demo-branch-only shared docs:
- Rename skills/durable-task-skill.md -> tasks-skill.md (mirrors tasks-guide.md
  + core.tasks); repoint all skill cross-references.
- Content reframe across all 5 skills + wheels/README.md + build-wheels.sh.
- Normalize hardcoded GitHub source URLs to `main` (the blind word swap had
  rewritten branch names to nonexistent branches; main is the stable target).
- Drop "Resilient Functions" (a fake name the swap produced from the real
  product "Durable Functions") from the "use another tool" examples — keep
  Temporal/Celery/Orleans; never name the competing product.

Zero durable residue; behaviour-preserving (docs only).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…Spec 034 Phase 4)

Caught by the full-universe residue sweep:
- skills: "resiliently committed"/"resilient persistence point" -> "persisted"/
  "persistence point" (storage-prose, §2.9a).
- build.sh + samples-structure test: drop the reframed (non-existent) branch
  name "feature/agentserver-resilient-agent-demo" from error hints/comments;
  reference "the agentserver demo branch" generically.

Docs/comment-only; zero behaviour change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… (Spec 034)

The reframe's storage-dir rule over-matched the bare module reference
core.durable._context, producing the non-existent core.agentserver._context;
correct it to core.tasks._context. Docstring-only; zero runtime effect. Caught
by the Principle XIII final review.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…sks (Spec 034)

Final-review follow-up: the reframe's storage-dir rule had over-matched the bare
core.durable reference in three skill docs, producing core.agentserver; correct
to core.tasks. Docs-only.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ns samples

The 'python -m <pkg>.app' usage uses relative imports, so it must be run from
the parent samples/ directory (the package's parent), not from inside the
package dir — otherwise ModuleNotFoundError. Add an explicit note to the
copilot/langgraph/multiturn sample usage blocks.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ations resilient_* samples

Improves the dev experience for the 4 multi-file invocations resilient samples
(copilot, langgraph, multiturn, research):

1. No more folder dance: sibling imports are now guarded
   (try `from .agent` / except `from agent`), so the sample runs directly with
   `python app.py` from inside its own directory — matching the single-file
   samples — while `python -m <pkg>.app` from samples/ still works.
2. Local preview packages: requirements.txt installs the in-repo
   azure-ai-agentserver-* source via editable relative paths (`-e ../../../...`),
   since the PyPI releases predate the resilient-task surface. pip resolves these
   relative to the cwd, so the docs say to run pip from the sample directory.
3. Usage blocks unified to: from inside the dir, `pip install -r requirements.txt`
   then `python app.py` (research previously had no run instructions).

Docs/sample-only; invocations suite unchanged (214 passed).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…s samples

The azure-ai-agentserver-* preview packages (core 2.0.0b7 etc.) aren't on PyPI,
so `pip install -r requirements.txt` pulled stale/absent versions. Point the
requirements at the in-repo source via editable relative paths (core +
responses + invocations — core is required transitively and is the one missing
from PyPI). pip resolves these relative to the cwd, so run pip from the
samples/ directory. Verified end-to-end in a fresh venv: installs the local
preview, core.tasks imports.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ents

azure-ai-agentserver-invocations requires core>=2.0.0b7 (not on PyPI), so a
fresh `pip install` of these two samples needs the local core editable too —
otherwise pip fails fetching the preview core. Matches copilot/research, which
already list both.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rver-responses-spec016

# Conflicts:
#	sdk/agentserver/azure-ai-agentserver-invocations/tests/e2e/_crash_harness.py
#	sdk/agentserver/azure-ai-agentserver-invocations/tests/e2e/test_resilient_copilot_live.py
#	sdk/agentserver/azure-ai-agentserver-invocations/tests/e2e/test_resilient_multiturn.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Hosted Agents sdk/agentserver/*

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant