add LocalProcessIntrospectTool for incident response#2518
Conversation
Wraps app/agents/probe.py + tail buffer + error signals as a first-class OpenSRE investigation tool. The planner calls this to diagnose stuck or misbehaving local agents from the opensre interactive shell. - @tool decorator with surfaces=('investigation',) - Returns psutil snapshot + last 50 stdout lines + error/retry rates - Registered in telemetry allowlist (_TOOLS_WITHOUT_DELIBERATE_CATCH) - 13 tests including contract, edge cases (missing PID, no stdout, etc.)
Greptile code reviewThis repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md. Run a review — add a PR comment with: Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5. Optional: automate with the greploop skill. |
Greptile SummaryThis PR adds
Confidence Score: 5/5Safe to merge; the tool is additive, the bounded seek-before-read correctly prevents OOM on large log files, and all edge cases have test coverage. The change is purely additive — no existing behaviour is modified beyond a symbol rename that is fully reflected across all call-sites and tests. The previously identified unbounded-read risk is addressed by the seek-offset approach, and the 13 new tests (including the large-file bound test) give good confidence in the implementation. No files require special attention beyond the minor inconsistency in LocalProcessIntrospectTool/init.py noted in the review comments. Important Files Changed
Sequence DiagramsequenceDiagram
participant Planner as Investigation Planner
participant Tool as local_process_introspect
participant Probe as probe()
participant Tail as _read_stdout_tail()
participant RT as resolve_target()
participant FS as Filesystem (stdout file)
participant ES as ErrorSignals
Planner->>Tool: "local_process_introspect(pid=1234)"
Tool->>Probe: probe(pid) [blocks 100ms for cpu%]
Probe-->>Tool: "ProcessSnapshot | None"
Tool->>Tail: _read_stdout_tail(pid)
Tail->>RT: resolve_target(pid)
RT-->>Tail: ResolvedTarget(path) or raises AttachUnsupported/OSError
alt resolve succeeds
Tail->>FS: fstat → seek(size - DEFAULT_MAX_BYTES) → read()
FS-->>Tail: ≤4 MiB bytes
Tail-->>Tool: "last 50 lines as str | None"
else resolve fails
Tail-->>Tool: None
end
alt stdout_tail is truthy
Tool->>ES: ErrorSignals().observe(stdout_tail)
ES-->>Tool: rate_per_minute() → error_counts dict
end
Tool-->>Planner: "{snapshot, stdout_tail, error_counts}"
Reviews (4): Last reviewed commit: "docs: merge local_process_introspect int..." | Re-trigger Greptile |
…vent OOM - Rename _resolve_target -> resolve_target (public API, added to __all__) - Replace unbounded f.read() with seek-based bounded read in _read_stdout_tail - Use os.fstat + DEFAULT_MAX_BYTES to read only the last 4 MB of stdout
c5e537e to
17f9416
Compare
|
@greptile review |
There was a problem hiding this comment.
found a couple of things:
-
looks like the new tool is missing its docs page and
docs/docs.jsonentry.AGENTS.mdsays new tools should ship withdocs/<tool_name>.mdxplus a matchingdocs/docs.jsonentry in the same pr.this pr adds the tool source, tail rename, and tests, but nothing under
docs/, and i don't see alocal_process_introspectentry indocs/docs.json.since the other tools in
app/tools/have matching docs pages, the@tooldecorator text is currently the only user-visible documentation for this one. -
the 4 mib seek-based bounded read path doesn't seem to be tested.
the commit says it replaced the unbounded
f.read()with a bounded read usingos.fstat+DEFAULT_MAX_BYTES, but the current_read_stdout_tailtest only writes a small sample file. because of that,offsetis always0and this branch never runs:if offset > 0: f.seek(offset)
so the oom regression this was meant to prevent is not really covered yet.
could we add a test with a stdout file larger than 4 mib, assert that the read starts at
st_size - DEFAULT_MAX_BYTES, and also check that the returned tail still includes the file's last line?
…ad test - error_signals key renamed to error_counts since rate_per_minute() returns raw counts when all events share a single timestamp - add docs/local_process_introspect.mdx and docs.json entry - add test_read_stdout_tail_bounded_read_large_file covering the offset > 0 / f.seek() path for files exceeding DEFAULT_MAX_BYTES
|
@greptile review |
|
@kespineira I have implemented the changes you wanted, a review could help. |
There was a problem hiding this comment.
@X1Vi small docs follow-up (non-blocking): since local_process_introspect is conceptually part of the agents subsystem, i think the cleanest home for this content would be a section inside docs/agents.mdx.
it imports resolve_target, probe, and ErrorSignals from app/agents/, and exposes that same machinery as a planner tool. docs/agents.mdx is also where /agents trace is already described, so it feels like the most natural place for this content.
as a smaller change, moving the standalone page from Observability and incidents to Integrations > Overview, next to agents, would already fix the current SaaS-vs-local grouping mismatch.
|
@kespineira Done
Done @kespineira |
|
Thanks for your effort. Just two small long-term thoughts: |
|
@cerencamkiran I am assuming these are the points of concerns rather than instructions for next steps. |
|
@greptile review |
|
@X1Vi demo is missing, it doesnt make sense. please run a proper investigate scenario with LocalProcessIntrospectTool |
Wraps app/agents/probe.py + tail buffer + error signals as a first-class OpenSRE investigation tool. The planner calls this to diagnose stuck or misbehaving local agents from the opensre interactive shell.
Fixes #1506
Code Understanding and AI Usage
Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?
If you used AI assistance:
Explain your implementation approach:
What I have made is an inspection tool that works with PID. We pass the PID into the function and the tail the logs then we check the logs for error signals. We pass everything as an output the snapshot of the process, error signal and tail logs.
We can run this via functions which I will provide in the demo.
I have added the tests for this in telemetry as well.
Checklist before requesting a review
Demo
Screencast from 2026-05-25 19-02-32.webm
Note: Please check Allow edits from maintainers if you would like us to assist in the PR.