feat(agentserver): light up durable-task primitive (core 2.0.0b6 + invocations 1.0.0b5)#46997
Open
RaviPidaparthi wants to merge 267 commits into
Open
feat(agentserver): light up durable-task primitive (core 2.0.0b6 + invocations 1.0.0b5)#46997RaviPidaparthi wants to merge 267 commits into
RaviPidaparthi wants to merge 267 commits into
Conversation
…urable-tasks # Conflicts: # sdk/agentserver/.gitignore # sdk/agentserver/azure-ai-agentserver-core/CHANGELOG.md # sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/_base.py # sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/_tracing.py # sdk/agentserver/azure-ai-agentserver-core/samples/selfhosted_invocation/selfhosted_invocation.py # sdk/agentserver/azure-ai-agentserver-core/tests/test_tracing_e2e.py # sdk/agentserver/azure-ai-agentserver-invocations/CHANGELOG.md # sdk/agentserver/azure-ai-agentserver-invocations/azure/ai/agentserver/invocations/_invocation.py # sdk/agentserver/azure-ai-agentserver-invocations/tests/conftest.py # sdk/agentserver/azure-ai-agentserver-invocations/tests/test_span_parenting.py # sdk/agentserver/azure-ai-agentserver-invocations/tests/test_tracing.py
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Implements spec-009’s “pluggable stream handler” work for the durable task framework by introducing a StreamHandler protocol with a default QueueStreamHandler, plus related durable-task capabilities (retry, resume route, metadata, samples/tests) and extensive formatting/tidying across tests and samples.
Changes:
- Added a pluggable streaming abstraction (
StreamHandler,QueueStreamHandler, factory type) and wired it intoTaskContext.stream()andTaskRunasync iteration. - Introduced/expanded durable-task building blocks:
TaskResult,RetryPolicy, resume HTTP route, hosted provider client, lease renewal helper, and substantial new test coverage + samples. - Updated docs/changelogs and reformatted various tests/samples for style consistency.
Reviewed changes
Copilot reviewed 88 out of 92 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/agentserver/azure-ai-agentserver-invocations/tests/test_tracing_e2e.py | Formatting-only adjustments (line wrapping/blank lines). |
| sdk/agentserver/azure-ai-agentserver-invocations/tests/test_session_id.py | Formatting-only adjustments (blank lines, wrapped AsyncClient context). |
| sdk/agentserver/azure-ai-agentserver-invocations/tests/test_server_routes.py | Formatting-only adjustments (blank lines). |
| sdk/agentserver/azure-ai-agentserver-invocations/tests/test_request_limits.py | Minor whitespace cleanup. |
| sdk/agentserver/azure-ai-agentserver-invocations/tests/test_request_id.py | Formatting-only adjustments. |
| sdk/agentserver/azure-ai-agentserver-invocations/tests/test_multimodal_protocol.py | Minor whitespace cleanup and section spacing. |
| sdk/agentserver/azure-ai-agentserver-invocations/tests/test_invoke.py | Formatting-only adjustments (blank lines). |
| sdk/agentserver/azure-ai-agentserver-invocations/tests/test_graceful_shutdown.py | Formatting + wrapped long asserts for readability. |
| sdk/agentserver/azure-ai-agentserver-invocations/tests/test_get_cancel.py | Minor whitespace cleanup. |
| sdk/agentserver/azure-ai-agentserver-invocations/tests/test_edge_cases.py | Formatting-only adjustments (blank lines). |
| sdk/agentserver/azure-ai-agentserver-invocations/tests/test_decorator_pattern.py | Formatting (wrapped JSONResponse returns). |
| sdk/agentserver/azure-ai-agentserver-invocations/samples/streaming_invoke_agent/streaming_invoke_agent.py | Reformatted token list for readability. |
| sdk/agentserver/azure-ai-agentserver-invocations/samples/simple_invoke_agent/simple_invoke_agent.py | Minor whitespace cleanup. |
| sdk/agentserver/azure-ai-agentserver-invocations/samples/multiturn_invoke_agent/multiturn_invoke_agent.py | Formatting; JSONResponse construction wrapped. |
| sdk/agentserver/azure-ai-agentserver-invocations/samples/durable_multiturn/store.py | New sample persistence helper (file-backed JSON store). |
| sdk/agentserver/azure-ai-agentserver-invocations/samples/durable_multiturn/requirements.txt | New sample requirements. |
| sdk/agentserver/azure-ai-agentserver-invocations/samples/durable_multiturn/app.py | New durable multiturn sample host wiring. |
| sdk/agentserver/azure-ai-agentserver-invocations/samples/durable_multiturn/agent.py | New durable multiturn sample agent task. |
| sdk/agentserver/azure-ai-agentserver-invocations/samples/durable_langgraph/store.py | New sample persistence helper (file-backed JSON store). |
| sdk/agentserver/azure-ai-agentserver-invocations/samples/durable_langgraph/requirements.txt | New sample requirements (LangGraph + deps). |
| sdk/agentserver/azure-ai-agentserver-invocations/samples/durable_langgraph/app.py | New streaming + steering durable LangGraph host sample. |
| sdk/agentserver/azure-ai-agentserver-invocations/samples/durable_copilot/store.py | New sample persistence helper (file-backed JSON store). |
| sdk/agentserver/azure-ai-agentserver-invocations/samples/durable_copilot/requirements.txt | New sample requirements (Copilot SDK, core, Starlette, uvicorn). |
| sdk/agentserver/azure-ai-agentserver-invocations/samples/durable_copilot/app.py | New durable Copilot host sample with SSE. |
| sdk/agentserver/azure-ai-agentserver-invocations/samples/durable_copilot/agent.py | New steerable durable Copilot agent sample. |
| sdk/agentserver/azure-ai-agentserver-invocations/samples/durable_claude/store.py | New sample persistence helper (file-backed JSON store). |
| sdk/agentserver/azure-ai-agentserver-invocations/samples/durable_claude/requirements.txt | New sample requirements (Anthropic SDK + runtime deps). |
| sdk/agentserver/azure-ai-agentserver-invocations/samples/durable_claude/app.py | New durable Claude host sample with SSE. |
| sdk/agentserver/azure-ai-agentserver-invocations/samples/durable_claude/agent.py | New steerable durable Claude agent sample. |
| sdk/agentserver/azure-ai-agentserver-invocations/samples/async_invoke_agent/async_invoke_agent.py | Formatting-only adjustments (wrapped JSON dict literals). |
| sdk/agentserver/azure-ai-agentserver-invocations/CHANGELOG.md | Changelog updates to mention durable samples + dependency bump. |
| sdk/agentserver/azure-ai-agentserver-core/tests/test_tracing.py | Formatting-only adjustments. |
| sdk/agentserver/azure-ai-agentserver-core/tests/test_startup_logging.py | Formatting-only adjustments and wrapped long lines. |
| sdk/agentserver/azure-ai-agentserver-core/tests/test_server_routes.py | Minor whitespace cleanup. |
| sdk/agentserver/azure-ai-agentserver-core/tests/test_logger.py | Minor whitespace cleanup. |
| sdk/agentserver/azure-ai-agentserver-core/tests/test_graceful_shutdown.py | Formatting-only adjustments and wrapped long asserts. |
| sdk/agentserver/azure-ai-agentserver-core/tests/test_config.py | Formatting for long function signatures. |
| sdk/agentserver/azure-ai-agentserver-core/tests/durable/test_task_result.py | New tests for TaskResult wrapper behavior + guardrails. |
| sdk/agentserver/azure-ai-agentserver-core/tests/durable/test_streaming.py | New tests for pluggable stream handler integration. |
| sdk/agentserver/azure-ai-agentserver-core/tests/durable/test_source.py | New tests exercising source field persistence. |
| sdk/agentserver/azure-ai-agentserver-core/tests/durable/test_retry.py | New tests for RetryPolicy and retry integration. |
| sdk/agentserver/azure-ai-agentserver-core/tests/durable/test_resume_route.py | New tests for the resume HTTP route behavior. |
| sdk/agentserver/azure-ai-agentserver-core/tests/durable/test_models.py | New tests for durable models/exceptions. |
| sdk/agentserver/azure-ai-agentserver-core/tests/durable/test_metadata.py | New tests for dict-like TaskMetadata + flush semantics. |
| sdk/agentserver/azure-ai-agentserver-core/tests/durable/test_local_provider.py | New tests for local durable provider CRUD/listing. |
| sdk/agentserver/azure-ai-agentserver-core/tests/durable/test_lifecycle.py | New lifecycle automation tests. |
| sdk/agentserver/azure-ai-agentserver-core/tests/durable/test_get.py | New tests for DurableTask.get(). |
| sdk/agentserver/azure-ai-agentserver-core/tests/durable/test_entry_mode.py | New tests for ctx.entry_mode across paths. |
| sdk/agentserver/azure-ai-agentserver-core/tests/durable/test_decorator.py | New tests for @durable_task decorator/options/type extraction. |
| sdk/agentserver/azure-ai-agentserver-core/tests/durable/test_cancellation_timeout.py | New tests for cancellation, timeout, and termination. |
| sdk/agentserver/azure-ai-agentserver-core/tests/durable/test_callable_factories.py | New tests for callable factories on tags/description. |
| sdk/agentserver/azure-ai-agentserver-core/tests/durable/init.py | New package init for durable tests. |
| sdk/agentserver/azure-ai-agentserver-core/tests/conftest.py | Formatting-only adjustments. |
| sdk/agentserver/azure-ai-agentserver-core/samples/durable_streaming/requirements.txt | New durable sample requirements. |
| sdk/agentserver/azure-ai-agentserver-core/samples/durable_streaming/durable_streaming.py | New sample demonstrating streaming with durable tasks. |
| sdk/agentserver/azure-ai-agentserver-core/samples/durable_source/requirements.txt | New durable sample requirements. |
| sdk/agentserver/azure-ai-agentserver-core/samples/durable_source/durable_source.py | New sample demonstrating source usage. |
| sdk/agentserver/azure-ai-agentserver-core/samples/durable_retry/requirements.txt | New durable sample requirements. |
| sdk/agentserver/azure-ai-agentserver-core/samples/durable_retry/durable_retry.py | New sample demonstrating retry policies. |
| sdk/agentserver/azure-ai-agentserver-core/pyproject.toml | Added httpx dependency + optional hosted extras (azure-identity). |
| sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/durable/_stream.py | New StreamHandler protocol + default QueueStreamHandler + factory alias. |
| sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/durable/_run.py | New TaskRun async-iter streaming integration and lifecycle control methods. |
| sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/durable/_retry.py | New RetryPolicy implementation and presets. |
| sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/durable/_resume_route.py | New Starlette route for POST /tasks/resume. |
| sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/durable/_result.py | New TaskResult wrapper class. |
| sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/durable/_provider.py | New storage provider protocol for durable subsystem. |
| sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/durable/_metadata.py | New dict-like TaskMetadata with flush/auto-flush. |
| sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/durable/_lease.py | New lease identity utilities + renewal loop. |
| sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/durable/_exceptions.py | New durable exception types (failed/suspended/cancelled/etc.). |
| sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/durable/_context.py | New TaskContext with stream support and lifecycle fields. |
| sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/durable/_client.py | New hosted durable task provider httpx client. |
| sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/durable/init.py | New public durable API exports. |
| sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/_middleware.py | Formatting-only adjustments for imports/log calls. |
| sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/_errors.py | Minor formatting simplification. |
| sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/_config.py | Minor formatting simplification. |
| sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/init.py | Minor whitespace cleanup. |
| sdk/agentserver/azure-ai-agentserver-core/README.md | Added durable-task documentation section + link. |
| sdk/agentserver/azure-ai-agentserver-core/CHANGELOG.md | Large changelog entry documenting durable subsystem and other changes. |
| sdk/agentserver/.gitignore | Added .vscode/ ignore. |
Comments suppressed due to low confidence (1)
sdk/agentserver/azure-ai-agentserver-invocations/samples/durable_multiturn/store.py:1
- For JSON persistence, it’s better to write/read with an explicit encoding (UTF-8) for cross-platform consistency. Consider using
open(fd, \"w\", encoding=\"utf-8\")(oros.fdopen) and also usingread_text(encoding=\"utf-8\")inload()to avoid platform-default encoding surprises.
Comment on lines
+57
to
+72
| if initial_delay.total_seconds() < 0: | ||
| raise ValueError(f"initial_delay must be >= 0, got {initial_delay}") | ||
| if max_attempts < 1 and not ( | ||
| max_attempts == 1 and initial_delay == timedelta(0) | ||
| ): | ||
| pass # allow no_retry preset | ||
| if backoff_coefficient < 1.0: | ||
| raise ValueError( | ||
| f"backoff_coefficient must be >= 1.0, got {backoff_coefficient}" | ||
| ) | ||
| if max_delay < initial_delay: | ||
| raise ValueError( | ||
| f"max_delay ({max_delay}) must be >= initial_delay ({initial_delay})" | ||
| ) | ||
| if max_attempts < 1: | ||
| raise ValueError(f"max_attempts must be >= 1, got {max_attempts}") |
Comment on lines
+191
to
+192
| except Exception as exc: | ||
| if "not found" in str(exc).lower(): |
Comment on lines
+210
to
+213
| if task_info.payload and "metadata" in task_info.payload: | ||
| meta_data: dict[str, Any] = task_info.payload["metadata"] | ||
| for key, value in meta_data.items(): | ||
| self._metadata.set(key, value) |
| and self._flush_callback is not None | ||
| and self._flush_task is None | ||
| ): | ||
| self._flush_task = asyncio.get_event_loop().create_task( |
Comment on lines
+60
to
+67
| except Exception as exc: # pylint: disable=broad-exception-caught | ||
| msg = str(exc).lower() | ||
| if "not found" in msg: | ||
| return Response(status_code=404) | ||
| if "not 'suspended'" in msg or "already" in msg or "conflict" in msg: | ||
| return Response(status_code=409) | ||
| logger.error("Resume failed for task %s: %s", task_id, exc, exc_info=True) | ||
| return Response(status_code=500) |
|
|
||
| ### Breaking Changes | ||
|
|
||
| - **`source` parameter removed** — The `source` keyword argument has been removed from `@durable_task()`, `.run()`, `.start()`, and `.options()`. Source provenance is now auto-stamped by the framework and cannot be overridden by developers. Use `tags` for custom metadata. |
- Pin aiohttp>=3.9.0,<4.0.0 to prevent pre-release 4.0.0a1 from being pulled by --pre flag (fails to compile on Python 3.13) - Disable mindependency for invocations/responses since azure-ai-agentserver-core>=2.0.0b4 is not yet on PyPI - Disable apistub for core (tool bug with Generic[Input,Output] on 3.10) - Change task API route from /storage/tasks to /internal/tasks - Add durable task overview documentation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
….0a0 - AgentServerHost lifespan now automatically creates and initializes a DurableTaskManager during startup, and shuts it down on exit. This fixes 'DurableTaskManager not initialized' errors when using @durable_task without manual manager setup. - Pin aiohttp<4.0.0a0 to exclude pre-release 4.0.0a1 which fails to build (missing longintrepr.h) when CI uses --pre flag for nightly builds. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Changed HostedDurableTaskProvider base URL from /storage/tasks to /tasks - Task API integration remains disabled (FOUNDRY_TASK_API_ENABLED=0) - Includes all durable demo improvements: 12-stage research pipeline, crash recovery, GET reconnect with file fallback, cancel support, supervisor proxy, and updated README with demo script Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ointer Replace hand-crafted @durable_task checkpoint logic with LangGraph StateGraph and AsyncSqliteSaver. This eliminates FileStreamHandler, manual metadata management, and JSONL-based replay. Key changes: - agent.py: LangGraph StateGraph with looping research_stage node - app.py: Simplified HTTP handlers (no durable task framework imports) - GET handler: replays from checkpoint state instead of JSONL files - Cancel: asyncio.Event checked at node entry - requirements.txt: added langgraph, langgraph-checkpoint-sqlite, aiosqlite - README: updated architecture docs - .env: committed for deployment config All 5 test scenarios pass: - Full 12-stage execution with checkpointing - Already-complete detection on re-invocation - Cancel mid-execution (stops at next node boundary) - Resume after cancel (clears stale cancel flag) - Unknown thread returns None Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e checkpointer" This reverts commit 4cf120a.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- POST handler returns 202 immediately (fire-and-forget) with invocation_id, session_id, task_id in response body - GET handler streams SSE with sequential id: N on each event - Supports last_event_id query param to skip already-seen events on reconnect (platform strips non x-client- headers) - Crash handler returns 202 then exits asynchronously - Session ID resolution simplified to use framework config - Demo client (demo-client.sh): POST→GET flow, client-side skip, LAST_EVENT_ID tracking, logs command for 3rd terminal - Verified live: crash mid-stage-5 → reconnect resumes from stage 5 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… architecture - Document POST 202 fire-and-forget flow - Document GET streaming with last_event_id query param - Add container logs section (azd ai agent monitor) - Update manual curl demo steps - Add 'How it works (client flow)' section Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…veness Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When reusing a session for a new task, the stale LAST_EVENT_ID from the previous run caused the client to skip the first stage header. Now resets to 0 when POST returns status='started'. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n=v1 - task_id = invocation_id (no stale stream file reuse across sessions) - Reconnect does GET-only with last_event_id (no redundant POST) - Switch from api-version=2025-11-15-preview to v1 - Add Foundry-Features: HostedAgents=V1Preview header - Print exact method + endpoint on each call for debugging Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- durable_task → task (decorator) - DurableTask → Task (wrapper class) - DurableTaskOptions → TaskOptions (config) - DurableTaskManager → TaskManager - DurableTaskProvider → TaskProvider - HostedDurableTaskProvider → HostedTaskProvider - LocalFileDurableTaskProvider → LocalFileTaskProvider - _durable_task_ tag prefix → _task_ - agentserver.durable_task source type → agentserver.task - LIST API: server-side source_type filter (replaces client-side) - LIST API: cursor-based pagination (limit=100, has_more/after) - Dockerfile: FOUNDRY_TASK_API_ENABLED=1 All 189 tests pass. E2E validated with 2 crash recoveries. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…pagation fix Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…agation on cancel + get Adds 3 tests pinning the contract that the cancel and get invocation endpoints expose agent_session_id as request.state.session_id (mirror of the invoke endpoint, fixed in 57ca8f7): 1. test_cancel_propagates_agent_session_id_to_request_state - POST /invocations/{id}/cancel?agent_session_id=X - Asserts request.state.session_id == "X" inside a custom cancel handler. 2. test_get_propagates_agent_session_id_to_request_state - GET /invocations/{id}?agent_session_id=X - Same assertion for the get endpoint's handler. 3. test_cancel_without_agent_session_id_leaves_request_state_session_id_empty - Pins the explicit-empty-string contract when the query param is absent (handlers can branch on falsy without an AttributeError). Verified by reverting the framework fix locally — all 3 tests fail with AttributeError: 'State' object has no attribute 'session_id'. With the fix re-applied, all 10 cancel/get tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…var per invocation protocol spec Per `invocation-protocol-spec.md` §1.2 (GET) and §1.3 (cancel), neither endpoint has a platform-defined `agent_session_id` query parameter. The session is implicit: sourced from the `FOUNDRY_AGENT_SESSION_ID` environment variable the platform sets on the container (surfaced via `self.config.session_id`). The previous one-line fix (commit 57ca8f7) read the query param and stamped it on `request.state.session_id`. That worked for local dev where the demo's curl loop passes `agent_session_id` as a query param, but in the hosted contract the platform never sends a query param, so `request.state.session_id` resolved to the empty string and the demo's cancel handler had to depend on its own `app.config.session_id` fallback — defeating the purpose of having the framework propagate it. The fix here matches the invoke endpoint's source-precedence: caller-provided query param wins (callers may still pass one — the spec forwards any non-platform-defined query params transparently), env var falls back (the hosted default), empty string when both absent. Custom cancel/get handlers can now uniformly read `request.state.session_id` without owning the fallback ladder. Test coverage (`test_get_cancel.py`): - `test_cancel_propagates_session_id_from_env_var` (hosted default) - `test_get_propagates_session_id_from_env_var` (hosted default) - `test_cancel_caller_query_param_overrides_env_var` (precedence) - `test_cancel_without_env_var_or_query_param_yields_empty_session_id` (graceful no-op) Verified by reverting the framework change locally: all 4 fail; with the fix re-applied, all 11 cancel/get tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n_id source for cancel/get Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…anch The checked-in preview wheels were sitting at sdk/agentserver/wheels/ on the durable-tasks branch even though only the durable-agent-demo sample consumes them (via its build.sh that copies into the docker build context). Wheels move to the demo branch where they actually belong, and where they can also bundle the responses package alongside core + invocations (the demo's docker image is the only consumer of this wheels directory). durable-tasks branch is now wheels-free. Downstream consumers that need the preview wheels should base off the demo branch where the build-wheels.sh script + checked-in *.whl files live. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The two standalone "AI-coding-agent skill" files (durable-task-skill.md, streaming-skill.md) move to the demo branch alongside the preview wheels they pair with. Their long-form developer guides (durable-task-guide.md, streaming-guide.md, task-and-streaming-spec.md) stay here in the core package's docs — those are SOT reference documentation tied to the package surface, not portable copy-into-your-project artifacts like the skills. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The cold-start recovery scan (_recover_stale_tasks) reclaimed a stale in_progress task via a direct provider.update that discarded the post-reclaim record and pre-tracked the stale scan etag. The lease renewal heartbeat (routed through _provider_update_locked with force_if_match=True) then kept sending the stale pre-reclaim etag. Both the LocalFileTaskProvider and the hosted task API enforce If-Match strictly, so the first heartbeat (~one lease half-life, 30s) 412'd and the renewal loop cancelled the recovered execution as 'lost ownership', truncating every crash recovery. Fix: route the scan reclaim through _reclaim_one (which now returns the post-reclaim TaskInfo and refreshes the tracked etag via _provider_update_locked), and adopt that record so (a) the heartbeat's tracked etag matches the store and (b) _start_existing_task sees the post-reclaim lease generation/instance. Also have _steering_cleanup_orphan_attachments return its updated record so the reclaim carries the current etag when cleanup wrote, and correct the stale _reclaim_one docstring that claimed the local provider ignores if_match. Adds tests/durable/test_recovery_lease_etag.py (RED without the fix). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Backport the spec-024 storage-path unification (.durable-tasks -> .durable via resolve_durable_subdir) to the durable_research invocation sample on the base branch, so the invocation samples match across the durable-tasks -> responses-spec016 -> demo stack. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…te-serialization (spec 031) Behavioral tests through real framework wiring (no mocks): pending_input_count must reflect the live queued-steering count at the cancel boundary (SOT §13 ordering invariant), and a steer+drain must run the steered turn with no blind PATCHes (SOT §25.1). The pending_input_count test lands RED (count==0, the observed defect) ahead of the fix. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…pec 031) Conformance-restoration: the implementation had drifted from the SOT spec (docs/task-and-streaming-spec.md). Fixes: - pending_input_count (FR-001/002, SOT §12/§13): add settable _ActiveTask slot _pending_input_count; record it BEFORE ctx.cancel in the same-process enqueue and both cross-process steering polls (§13 ordering invariant); reset to the remaining backlog on drain. Previously read a never-written attribute -> 0. - No blind writes (FR-005b, SOT §25.1): the queued-steering-cancel path used a bare provider.update with no If-Match (lost-update risk) -> now routes through the lock-held update primitive carrying the tracked etag. - Lock-held update primitive (FR-005a): extract _provider_update_lock_held for callers that already hold the non-reentrant per-task lock; _provider_update_locked delegates to it. - Drain read-inside-lock (FR-005, SOT §25.2): the steering drain now performs its record read + compute + write under one per-task lock hold (read was lock-free before, letting the lease heartbeat invalidate the pinned etag and starve the retry budget). Cross-process conflicts retry outside the lock. - SOT reconciled: §12/§13 (pending_input_count is in-process observed state, failure-tolerant, set-before-cancel), §25.1 (no blind writes), §25.2 (read-inside-lock rationale + two-helper split). Validated: 646 durable + 746 durable+streaming green; responses consumer green (2 pre-existing unrelated failures). Tests from prior commit now GREEN. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ry, provider parity Completes the remaining spec-031 conformance work: - Drain recovers from cross-process etag conflict (FR-006/SOT §25.3): the drain now also catches EtagConflict (consistent with the terminal-write + metadata paths) so a conflict landing on its write re-reads + retries instead of being abandoned by the caller's bare except. RED test reproduced the exact hosted 'Failed to drain steering queue' failure, then GREEN. - Metadata flush conflict retry (FR-006a/SOT §25.3): re-read + retry (≤5, last-write-wins on the namespace slot) instead of swallow-and-defer. - Local/file-provider hosted-parity assertions (FR-008/009): stale-if_match -> hosted-identical etag_mismatch/412 classification; lease-only update bumps the etag; two providers over one store reproduce a deterministic cross-process conflict + recovery. - Registered the new behavioral tests in the completeness meta-test (FR-016). - FR-005c if_match-site audit recorded; FR-012 streams conformance confirmed already-comprehensive by audit; FR-018 guide already consistent. Validated: 651 durable + 755 durable+streaming green; pylint unchanged (9.59/10). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n bug (spec 031)
Hosted re-test of the invocations steering demo revealed the real drain failure
is NOT an etag race but a lease-renewal-on-suspended rejection: the multi-turn
turn writes status=suspended before the drain runs, and the hosted task store
rejects the drain PATCH's lease-renewal piggyback with 'lease renewal is only
supported for in_progress tasks'. The local provider was too permissive (allowed
renewal on {in_progress, suspended}), masking it. This commit tightens the local
provider to in_progress-only (hosted parity) + adds the drain status-transition
regression test, landing the bug RED.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…red turn runs (spec 031) The drain promotes a queued steering input to a NEW turn, but the record was left 'suspended' by the just-ended turn. The drain's PATCH now sets status='in_progress' (a valid suspended->in_progress claim that the hosted store accepts WITH lease params), instead of a bare lease renewal that the store rejects on a suspended task. Validated on a deployed hosted agent (rapida-0687): the steered turn now runs the new topic end-to-end; logs show 'Steering drain: drained next input' with no 'Failed to drain'. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…se renewal requires in_progress (spec 031) Corrects §52 step 14 (drain Phase-1 PATCH MUST set status='in_progress') and adds the §25.4 rule that the lease-extension trio is a renewal only on an already-in_progress record, or a claim when the same PATCH flips to in_progress. This is the doc-level root of the hosted 'Failed to drain steering queue' bug. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ion (data loss) _compact_on_disk() rewrote the log to a temp file and os.replace()'d it over self._path, but never reopened self._file — which still pointed at the old, now-unlinked inode. Every emit()/close() after the first compaction (once the eviction interval is crossed) therefore wrote to the orphaned inode and was LOST on the next process lifetime, and the single-writer flock was held on the dead inode. This breaks the persist-before-publish crash-recovery contract the file-backed replay backing exists to provide. Reopen self._file against the live path and re-acquire the flock after the swap (open+lock the new handle before closing the old so the single-writer guarantee is never released). Adds regression tests asserting post-compaction emit + terminal persist to the live file and survive rehydrate (RED before fix). Found via code review of the durable-tasks branch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…yers Spec 033 Phase 4 (FR-007) core side — promote the internals that the composed protocol packages (responses/invocations) reach for, so they stop importing core underscore-private modules: - TaskRun.is_queued — public queued-steering-input predicate (replaces reaching TaskRun._queued_cancel_callback). Narrowly documented; SOT task-and-streaming-spec §35 + durable-task-guide §4.6 updated. - azure.ai.agentserver.core.platform_headers — public re-export of the shared platform HTTP/wire-contract constants (header names, isolation keys, error tags). - core.read_request_id(scope) — public ASGI-scope helper so callers read the middleware-resolved request id without depending on REQUEST_ID_STATE_KEY. - Genericize the durable metadata docstring (drop the named '_responses' wrapper example — core must not name a wrapper layer). RED-first tests: test_public_contract_surface.py (module + helper), TaskRun shape + behavioral is_queued (queued vs fresh) in test_steering.py. Full core suite green (866 passed). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e 1) De-brand the 'durable/durability' terminology (collides with Microsoft Durable Functions / DTF / DTS) to 'resilient/resilience (crash resilience)'. Pure rename / reframe — zero behaviour change (core suite: 866 passed, 5 skipped, unchanged). - core/durable/ subpackage -> core/tasks/ (the broad task primitive: long-running, multi-turn, steering, retry AND resilience); tests/durable/ -> tests/tasks/. - docs/durable-task-guide.md -> docs/tasks-guide.md. - storage_paths: AGENTSERVER_DURABLE_ROOT -> AGENTSERVER_STATE_ROOT, .durable -> .agentserver, resolve_durable_* -> resolve_state_*, DurableSubdir -> StateSubdir (storage is 'state', not 'resilient'; nothing shipped, no compat needed). - word swaps durability->resilience / durable->resilient across source, tests, docs, README, CHANGELOG; module-path string literals -> tasks. - removed stale tracked .test-runs/ generated fixtures. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Persisted-artifact language must use "persistent/stored/state", never "resilient" (which denotes the crash-survival property). Corrects three storage-sense mistranslations introduced by the durable->resilient word swap: - _manager.py: "NOT resiliently persisted" -> "NOT persisted"; "previously- existing resilient backup write" -> "...backup write". - task-and-streaming-spec.md: "abstraction over the resilient store" -> "...task store"; pseudo-code comment "resets retry budget resiliently" -> "resets retry budget". - test_file_backed_replay_event_stream.py docstring: "resiliently persisted" -> "persisted". Property-sense "resiliently" (task runs/continues resiliently) retained. Comment/docstring-only; zero behaviour change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…Phase 4) Behaviour-preserving governance-doc reframe to de-brand "durable/durability". - Principle X "Durability Contract Conformance" -> "Resilience Contract Conformance"; Principle XII (core primitive) prose reframed to task/resilience vocabulary. - Repoint path references to the renamed files: resilience-contract.md, tests/e2e/resilience_contract/, core/tasks/, tests/tasks/, resilient-responses-developer-guide.md, resilient_background. - Version 1.6.0 -> 1.7.0 (renamed principle). The Constitution is the contract-of-record and must reference no stale name (FR-034-7): zero durable residue. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… (Spec 034) The blind durable->resilient word swap turned the real Microsoft product "Azure Durable Functions" into a non-existent "Resilient Functions" in three "use another workflow engine instead" references. Drop it, keeping the other real examples (Temporal, Celery, Orleans) and never naming the competing product. Comment/docstring-only; zero behaviour change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The durable->resilient word swap's `.durable`->`.agentserver` storage-dir rule over-matched the logger namespace `azure.ai.agentserver.durable`, producing the doubled, unintended `azure.ai.agentserver.agentserver`. Correct it to `azure.ai.agentserver.tasks` (mirroring the core.tasks subpackage) across 8 source loggers + the `.taskapi` child, the C-OBS-2 doc clause, and the matching test assertion. Loggers stay children of `azure.ai.agentserver`, so propagation was unaffected and suites passed — caught by the Principle XIII final review. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… core branch (Spec 034)
Per the branch-ownership model (invocations samples are owned by the core
branch and reach the demo branch transitively via core->responses->demo), the
invocations samples + tests reframe — previously applied only on the demo
branch — now originates here on core:
- Rename durable_{copilot,langgraph,multiturn,research} -> resilient_*
(+ e2e/test_durable_* -> test_resilient_*, test_durable_samples_structure
-> test_resilient_samples_structure); content reframed durable->resilient.
- Sample run UX: guarded sibling imports so each runs directly with
`python app.py` from its own dir (-m still works); requirements install the
in-repo preview packages via editable relative paths (core preview isn't on
PyPI).
- Storage-prose fixes (durable storage -> persistent storage) in
async_invoke_agent / multiturn_invoke_agent / _crash_harness; CHANGELOG.
Files are byte-identical to the demo branch's versions, so the forward merge
stays conflict-free. invocations suite: 214 passed.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…"running") A new conversation turn whose text repeated a previous turn would hang forever in "running": the content-based dedup (compare against the last UserMessage in Copilot's history) wrongly decided the message was "already sent", skipped session.send, then waited for a SessionIdle that never fires (no new turn was started). Fix: dedup by invocation identity, not content. Persist the Copilot messageId returned by session.send under the invocation_id as a durable "already delivered" marker. Only a crash-recovered invocation (same invocation_id, marker present) skips the re-send; a brand-new turn (resumed) gets a fresh invocation_id with no marker and therefore always sends — even with identical text. On the recovery skip-path, if the turn already completed upstream, return that reply instead of waiting for an idle that won't come. Also add debug-friendly logging (per-event at DEBUG; send/idle/turn-resolution milestones at INFO) — the absence of these made this hard to diagnose. Reproduced + verified end-to-end against the live GitHub Copilot CLI: sending the same message twice to one session now completes both turns; a recovered in-flight turn resumes and completes; suite unchanged (214 passed). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…copilot dedup Corrects the previous dedup fix. The marker approach (persist the send() messageId and skip on its presence) was wrong for crash recovery: send() returning does NOT guarantee Copilot persisted the message, so a crash between send() and Copilot's persistence would leave the marker set, skip the resend, and hang — the message never reached Copilot. (Also verified send()'s messageId does not correlate to any get_messages() event id, so it can't be used to probe persistence.) New approach keeps Copilot's persisted history as the source of truth, but scopes that content check to crash recovery only: copilot_has_message = entry_mode == "recovered" and <message is last UserMessage> - Fresh / resumed *new* turns always send (fixes the original hang where a repeated-text turn was wrongly deduped). - On recovery, if Copilot's history lacks the message (Copilot never persisted it), we RESEND; if it has it and the turn already finished, we return that reply instead of waiting for a SessionIdle that won't fire. Removed the messageId marker; restored the history-based check. Kept the debug/INFO logging. Basic flow + new-turn case verified live; suite 214 passed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lights up the durable-task primitive in
azure-ai-agentserver-core2.0.0b6 (and the matching invocations-protocol sample suite in
azure-ai-agentserver-invocations1.0.0b5) as a new feature.The durable-task primitive is a small decorator-driven API that lets a
hosted agent run long operations as named tasks that survive
process crashes, OOM kills, and container redeployments. Tasks pick up
exactly where they were after recovery, without the developer writing
any explicit checkpoint or replay code.
Full developer guide:
sdk/agentserver/azure-ai-agentserver-core/docs/durable-task-guide.md.Scope of THIS PR
azure-ai-agentserver-core— full durable-task primitive shippingfor the first time (no prior release of the primitive).
azure-ai-agentserver-invocations— matching durable samplesuite (
durable_copilot,durable_multiturn,durable_langgraph,durable_research) demonstrating the primitive end-to-end on theinvocations transport. Plus per-sample READMEs, a
SHIPPABLE.mdmanifest, a cross-sample
DURABLE_SAMPLES.mdoperational guide, anda CI gate (
test_samples_shippable_bar.py) that enforces theper-sample shippable bar on every PR.
Out of scope of this PR (split into separate PRs)
azure-ai-agentserver-responsesdurable orchestration→ see PR for branch
feature/agentserver-responses-spec016durable-agent-demoazd-deployable hosted-agent sample→ see PR for branch
feature/agentserver-durable-agent-demo(temporary; never-merged demo sample)
What the primitive ships
Tour:
Concepts shipping
@task(...)decorator +Taskreturned object with.run(),.start(),.options(...),.get_active_run(task_id).TaskContext—entry_mode,input,metadata(with auto-flushat lifecycle boundaries),
cancel(asyncio.Event), causebooleans
timeout_exceeded/cancel_requested, steering signalspending_input_count/is_steered_turn,shutdown,retry_attempt,recovery_count. Providesctx.suspend(output=...),ctx.stream(chunk),ctx.exit_for_recovery().TaskResult.status: Literal["completed", "suspended"].Failure paths surface as exceptions (
TaskFailed,TaskCancelled,TaskConflictError).TaskConflictError— single error type for any "task is busy / notavailable" state (live elsewhere, recovered elsewhere, evicted under
split-brain protection, terminal with queued steerer). Carries
current_statusso callers can branch.RetryPolicy— exponential / fixed / linear backoff presets,durable across crash and recovery.
EntryModeLiteral:"fresh" | "resumed" | "recovered".Suspended(sentinel for.run()of a suspended task),TaskStatusLiteral,TaskMetadata,StreamHandler,StreamHandlerFactory,QueueStreamHandler.Behavior shipping
three layers (startup scan, periodic background scan, inline reclaim
at scheduling primitives). The developer sees
ctx.entry_mode == "recovered"and otherwise the sameTaskContextsurface as on a fresh entry.
session cancels stranded executions in the previous process cleanly
via
HTTP 409 binding_mismatch. The previous process cancels itsexecution, suppresses its terminal write, and signals its awaiters
with
TaskConflictError.Task.start(...)on an already-active steerable task queues the new input. The first turn's
ctx.suspend(...)call resolves the steerer's.result()with thenext turn's outcome.
@task(timeout=...)isanchored to a persisted per-turn-start timestamp. A crash mid-turn
does NOT reset the budget; the recovered watchdog computes
remaining budget from the persisted timestamp.
ctx.metadataisflushed automatically at every terminal-of-turn boundary.
ETag-protected; steerable input data is cleared at the suspend
transition (data minimization); the lease owner string incorporates
both
FOUNDRY_AGENT_NAMEand session ID so two different agentssharing a session ID cannot collide on lease ownership.
Transport
HostedTaskProvideris built onazure.core.AsyncPipelineClientwith the standard policy chain (request-id, headers, user-agent,
retry,
AsyncBearerTokenCredentialPolicy, task-API logging,distributed tracing). Retry policy retries on 5xx / 408 / 429 only —
never on 409 regardless of body.
ContentDecodePolicyintentionallyexcluded; body parsing happens at the call site with defensive
error handling.
httpxis no longer a production dependency.Validation