Skip to content

AgentService (TS): emit §6.4 mandatory leading ack before handler runs#100

Merged
M64GitHub merged 1 commit into
mainfrom
sdk-mandatory-initial-ack
May 11, 2026
Merged

AgentService (TS): emit §6.4 mandatory leading ack before handler runs#100
M64GitHub merged 1 commit into
mainfrom
sdk-mandatory-initial-ack

Conversation

@M64GitHub
Copy link
Copy Markdown
Contributor

@M64GitHub M64GitHub commented May 11, 2026

Summary

TS-side counterpart to PR #101 (Python SDK, merged) — brings @synadia-ai/agent-service into parity with the §6.4 spec tightening (nats-agent-sdk-docs@b1c6972), which requires an immediate {"type":"status","data":"ack"} chunk as the first message on the reply subject, before any latency-inducing handler work.

Rebased and rescoped. This branch was rebased onto main after PRs #101 + #102 landed; the Python changes that lived in earlier revisions of this branch are now in main and have been dropped. What remains is the TS-only change.

What changes

  • agent-sdk/typescript/src/service.ts#dispatchPrompt now publishes encodeChunk({type:"status",status:"ack"}) synchronously after envelope decode and before invoking the handler. Comment block uses "leading ack" terminology for cross-SDK consistency with the Python side. The pre-existing periodic keep-alive interval is retained as §6.6 inactivity-timer defense; keepaliveIntervalS now controls only the periodic cadence, not the leading ack.
  • agent-sdk/typescript/test/integration/agent-service.test.ts — new test emits the §6.4 mandatory leading 'ack' as the first chunk before the handler runs: handler blocks on a promise, the test asserts the ack lands before the handler is released. try/finally around the pre-release assertions guarantees the suspended handler is unblocked even on assertion failure.
  • agent-sdk/typescript/CHANGELOG.md[Unreleased] entry mirroring the Python SDK's agent-sdk/python/CHANGELOG.md shape.

Wire compatibility

Fully compatible. Callers already decode arbitrary status chunks via @synadia-ai/agents. The only observable change is one extra ack event at the head of every stream, which clients have always accepted.

Test plan

  • agent-sdk/typescript typecheck + eslint + prettier --check + 38 tests (37 existing + 1 new) all green against a local NATS broker
  • client-sdk/typescript 280 tests still pass — wire decode untouched
  • Verified the new test fails fast (assertion-error, not timeout) if the eager ack is removed from #dispatchPrompt

@synadia-claude-reviewer
Copy link
Copy Markdown

synadia-claude-reviewer Bot commented May 11, 2026

Claude finished @M64GitHub's task in 12m 38s —— View job


Review

  • Gather context (PR diff, source files)
  • Review service.ts changes
  • Review integration test
  • Review CHANGELOG
  • Post findings

The implementation is correct and the approach is sound. The ack is emitted synchronously before await handler(...), which is the right place — it satisfies §6.4's "first message on the reply subject" requirement regardless of handler latency. The ackBytes variable is cleanly reused by the keepalive interval, and the CHANGELOG entry is accurate.

Two findings to address before merge, and one parity note:

1. Python AgentService parity (missing fix)

agent-sdk/python/src/synadia_ai/agent_service/service.py _on_prompt_request() still only emits acks via _keepalive_loop — no eager initial ack before the handler runs. It's the exact same pre-PR pattern just fixed in TS. The downstream impact table flags agents/claude-code as a follow-up but doesn't mention the Python agent-sdk. This PR should either fix the Python side or explicitly call it out as a tracked follow-up (a TODO comment and an issue are both acceptable; silently leaving it is not, given the spec tightening applies to both).

2. Test resource leak on assertion failure (inline comment on test line 128)

If either pre-releaseHandler assertion fails, releaseHandler!() is never called, leaving the handler suspended until the test runner timeout fires. A try/finally around the polling block would make the test faster to recover and cleaner to debug.

3. Silent initial ack failure → handler runs for a gone caller (inline comment on service.ts line 578)

Consistent with the existing error-handling pattern (PromptResponse.send() and the keepalive both eat respond() errors), so not a regression. Worth a follow-up to abort early when the initial ack fails, to avoid paying LLM/network costs for a dead connection.

Items 2 and 3 are non-blocking — item 1 should be resolved (either fixed or explicitly tracked) before this merges.

const ackBytes = encodeChunk(ack);
try {
msg.respond(ackBytes);
} catch {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the initial ack fails here (connection already gone), the handler still runs and does potentially expensive work (LLM call, etc.) that no one will receive. All subsequent response.send() calls will also silently throw for the same dead connection.

This is consistent with existing error-handling elsewhere in the file (keepalive and PromptResponse.send() both eat errors), but it's worth a follow-up: an early-exit when the initial ack fails would prevent wasted work. Not a blocker for this PR's scope.

while (messages.length < 1 && Date.now() < deadline) {
await new Promise((r) => setTimeout(r, 25));
}
expect(messages[0]).toEqual({ type: "status", status: "ack" });
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If either assertion on lines 128-129 fails, releaseHandler!() is never called — the handler stays suspended in handlerCanProceed, and iter is a dangling unresolved promise. afterEach calls service.stop() which stops the NATS microservice, but the handler task itself stays alive until the test runner times it out.

This won't cause false positives (Vitest throws immediately on assertion failure), but it does mean a single unexpected failure here can slow down the suite by the configured test timeout. Consider a try/finally around the polling + pre-release assertions to ensure releaseHandler!() is always called.

M64GitHub added a commit that referenced this pull request May 11, 2026
Addresses the three reviewer items on PR #100:

1. **Python `AgentService` parity (the main miss).** The Python
   agent-sdk had the exact same pre-PR pattern — `_keepalive_loop`
   only, no eager initial ack. Now emits the §6.4 mandatory ack
   synchronously in `_on_prompt_request` before the handler runs,
   mirroring the TS change. Brings the Python reference agent and
   every Python harness consuming `AgentService` into spec compliance
   automatically.

   - `service.py`: eager `await request.respond(encode_chunk(...))`
     before the keep-alive task is created.
   - `test_keepalive_e2e.py`: renamed + updated existing tests for
     the new "always one initial ack" semantics; added
     `test_initial_ack_arrives_before_handler_runs` that gates the
     handler on `asyncio.Event` and pins the eager behavior.
   - `test_query_e2e.py`, `test_error_completion_e2e.py`,
     `test_echo_e2e.py`, `test_attachments_e2e.py`,
     `test_reference_agent_e2e.py`: account for the additional ack
     frame in their wire-shape / chunk-count assertions.
   - `CHANGELOG.md`: `[Unreleased]` entry mirroring the TS one.

2. **TS test resource leak on assertion failure.** Wrapped the
   pre-`releaseHandler()` polling block in a `try { ... } finally
   { releaseHandler!(); }` so a failed assertion still releases the
   suspended handler — keeps the test runner from hanging until its
   own timeout.

3. **TS CHANGELOG.md prettier formatting.** CI's `format:check` was
   tripping on indentation in the new `[Unreleased]` block;
   `prettier --write` reflowed it cleanly. (Was the cause of the
   `CI — agent-sdk/typescript / Lint + typecheck (pull_request)`
   failure.)

Reviewer's third note (silent initial-ack publish failure → handler
runs on a dead reply subject) intentionally not addressed in this
commit — it matches the existing best-effort error-handling pattern
across `PromptResponse.send()` and the keepalive loop in both
languages. Worth a follow-up that adds early-abort across all four
sites at once.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@M64GitHub
Copy link
Copy Markdown
Contributor Author

@claude - please review again, I think we fixed everything

@synadia-claude-reviewer
Copy link
Copy Markdown

synadia-claude-reviewer Bot commented May 11, 2026

Claude finished @M64GitHub's task in 1m 47s —— View job


Re-review

  • Gather context
  • Review Python AgentService parity fix (service.py)
  • Review Python tests (keepalive, echo, error completion, etc.)
  • Review TS test resource leak fix (agent-service.test.ts)
  • Review TS service.ts
  • Review changelogs
  • Post findings

All three issues from the previous review have been addressed. Here's the rundown:

Issue 1 — Python parity ✅ Fixed

agent-sdk/python/src/synadia_ai/agent_service/service.py lines 426–439 now emit the mandatory initial ack before the handler runs, exactly mirroring the TS pattern:

try:
    await request.respond(encode_chunk(StatusChunk(status="ack")))
except Exception:
    # Best-effort, matching the keepalive loop and PromptStream
    # error-handling pattern…
    log.exception("initial ack emit failed on %s", request.subject)

The implementation is correct: ack fires before the await handler(envelope, stream) call, error handling is consistent with the existing pattern, and the keepalive_interval_s=None path correctly skips the task while still emitting the mandatory initial ack.

Issue 2 — TS test resource leak ✅ Fixed

The polling + assertion block is now wrapped in try/finally (lines 128–137), guaranteeing releaseHandler!() is called even on assertion failure:

try {
  // poll + assertions
} finally {
  releaseHandler!();
}

The Python counterpart (test_initial_ack_arrives_before_handler_runs in test_keepalive_e2e.py, lines 222–238) follows the same pattern, with handler_can_proceed.set() in a finally block.

Issue 3 — Silent ack failure informational ✅ Acknowledged

Behavior unchanged (consistent with existing error-handling), which is the right call for this PR's scope.


Additional observations

Python test coverage is thorough. test_keepalive_e2e.py covers all four scenarios: initial ack + keepalive, initial ack only (keepalive disabled), eager pre-handler ack via gated handler, and no-ack-between-error-and-terminator. The test for fast handlers correctly expects exactly one ack.

test_error_completion_e2e.py updated correctly. The 500 path now asserts 3 messages (initial ack → error frame → terminator), while the 400 path still asserts 2 messages (error frame → terminator) — which is correct, because the decode failure returns before the ack is emitted, and the finally still runs the terminator.

Minor: TS CHANGELOG indentation drift. Two continuation lines in agent-sdk/typescript/CHANGELOG.md are missing leading indentation (the --wait-for-empty and status:"ack"} lines have no indent relative to the surrounding bullet). Cosmetic only — not a blocker.


Verdict

All three originally flagged items are resolved. The Python and TypeScript implementations are now in parity, test coverage pins the eager-ack behavior on both sides, and the changelogs are accurate. Ready to merge (minor CHANGELOG formatting aside).

Spec §6.4 was tightened to require an immediate
`{"type":"status","data":"ack"}` chunk as the first message on the
reply subject, before any latency-inducing handler work
(nats-agent-sdk-docs@b1c6972). The Python SDK adopted this in PR #101
(merged as 0.4.0); this commit brings `@synadia-ai/agent-service`
into parity.

Change: `AgentService.#dispatchPrompt` publishes
`encodeChunk({type:"status",status:"ack"})` synchronously after a
successful envelope decode and before invoking the user-supplied
handler. The pre-existing periodic keep-alive interval is retained
for §6.6 inactivity-timer defense — the spec text no longer mentions
it but the wire shape remains valid; `keepaliveIntervalS` now
controls only the periodic cadence, not the leading ack.

Added integration test pinning the eager behavior: a handler that
blocks on a promise, the test asserts the ack arrives before the
handler is released. try/finally around the pre-release assertions
makes a failed expectation reliably unblock the suspended handler
so the test runner doesn't hang.

Wire-compatible — callers already decode arbitrary `status` chunks
via `@synadia-ai/agents`. Brings the TS reference agent and
`agents/open-agent` into compliance automatically. The other
agent harnesses (pi, openclaw, claude-code-headless, pi-headless)
already comply via their own pre-handler ack publishes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@M64GitHub M64GitHub force-pushed the sdk-mandatory-initial-ack branch from c8879d4 to 5c297d2 Compare May 11, 2026 18:52
@M64GitHub M64GitHub changed the title AgentService: emit §6.4 mandatory initial ack before handler runs AgentService (TS): emit §6.4 mandatory leading ack before handler runs May 11, 2026
@M64GitHub
Copy link
Copy Markdown
Contributor Author

@claude - we had to abandon the python stuff, as that is already done meanwhile by PR 101 + PR 102. Can you please check now.

@synadia-claude-reviewer
Copy link
Copy Markdown

synadia-claude-reviewer Bot commented May 11, 2026

Claude finished @M64GitHub's task in 2m 4s —— View job


Re-review (TS-only scope)

  • Gather context — confirmed diff is 3 TS-only files, no Python changes
  • Review service.ts changes
  • Review integration test changes
  • Review CHANGELOG
  • Post findings

The diff is TS-only: agent-sdk/typescript/src/service.ts, the integration test, and the CHANGELOG. No Python files changed — consistent with the PR description that Python parity already landed in PRs #101 + #102.

service.ts

The implementation is correct:

  • ackBytes is declared before the keepalive timer setup (line 576) and reused by the setInterval callback (line 596) — no duplication.
  • msg.respond(ackBytes) fires synchronously at line 578, before await handler(envelope, response) at line 614. This satisfies §6.4 unconditionally.
  • The 400-error/malformed-envelope path at lines 550–562 exits before the ack is emitted — correct, since a decode failure means the request was never accepted.
  • try/catch around the leading msg.respond() (lines 577–583) is best-effort and consistent with the keepalive loop and PromptResponse.send() error-handling pattern.
  • The comment block (lines 566–574) accurately describes §6.4, §6.6, the nats req --wait-for-empty observability note, and Python parity.

Integration test ✅

  • startService() defaults keepaliveIntervalS: null (line 48 in the describe block), so the only ack before handler completion is the mandatory leading one — no keepalive-interval acks race with the assertion. The test is deterministic.
  • try/finally at lines 124–133 guarantees releaseHandler!() is called even on assertion failure — resource leak concern from the prior review is resolved.
  • services.push(service) inside startService() (line 51) means afterEach cleanly stops the service. No teardown leak.
  • Full stream shape is verified after releasing the handler: messages[0] is ack, a response with "late response" is present, and messages.at(-1) is status:done.
  • Minor cosmetic: messages[0] is asserted twice (line 129 inside try, and line 139 after await iter). Harmless — the second is a post-release sanity check that the array head didn't shift.

CHANGELOG ✅

  • Entry is under [Unreleased]### Changed. Correct placement.
  • Continuation lines are 2-space indented relative to the bullet — indentation drift noted in the first review is fixed.
  • Spec commit hash linked, Python parity noted, wire-compatibility explained. No issues.

Verdict

Ready to merge. All three issues from the initial review are resolved (Python parity via PRs #101/#102 is confirmed in main; TS test resource leak fixed; silent-ack-failure acknowledged as consistent with existing pattern). The scope is correctly narrowed to TS-only.

@M64GitHub M64GitHub merged commit 0d6835b into main May 11, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant