fix(debugger): use bytecode/sys.monitoring instead of setattr wrapper#772
fix(debugger): use bytecode/sys.monitoring instead of setattr wrapper#772liustve wants to merge 25 commits into
Conversation
…capture Surgical rebuild of PR aws-observability#770 that fixes the named throwable-capture regression without inheriting the archive's 13-commit churn or its 12 documented correctness regressions. Logical changes (single commit; granularity preserved in tests / structure): 1. SysMonitoringEngine (Python 3.12+): add PROBE / function-level BREAKPOINT support via PY_START + PY_RETURN + PY_UNWIND. PY_UNWIND is set globally (not local-event-capable on 3.12-3.14). New methods: enable_function_level_instrumentation, disable_function_level_instrumentation, _function_start_event_handler, _function_return_event_handler, _function_unwind_event_handler, _handle_function_level_instrumentation, _build_throwable. Reentrancy guard via threading.local. Three critical leak fixes vs first draft: - PY_RETURN/PY_UNWIND pop _call_start_ns BEFORE reentrancy/membership checks - disable_function_level_instrumentation sweeps in-flight stamps across all threads - disable_breakpoints_for_function explicitly clears _instrumentation_types and sweeps _call_start_ns 2. BytecodeInjectionEngine (Python 3.9/3.10): SETUP_FINALLY-based bytecode rewrite. New module-level _FUNCTION_HANDLER_NAME / _REFUSAL_MASK (inspect.CO_GENERATOR | CO_COROUTINE | CO_ASYNC_GENERATOR | CO_ITERABLE_COROUTINE). InjectionState gains optional function_metadata bag (zero new dicts). Generator/coroutine/async-generator refused. 3.11 deferred to a follow-up commit (high-level bytecode API forbids nested TryBegin; needs ConcreteBytecode + exception_table). 3. _undecorate.py: BFS through __wrapped__ / functools.partial / closure cells with co_filename disambiguation. Fixes the silent "wrong location" bug for @login_required / @cache_page / any @functools.wraps decorator. Without this, instrumenting `module.my_view` rewrites the decorator's wrapper, not my_view. 4. Manager wiring (instrumentation_manager.py): - _instrument_function_level helper calls engine first - _apply_function: engine path for line=0 BPs, setattr wrapper fallback only when engine refuses (preserves async support via wrapper) - _remove_function: also calls disable_function_level_instrumentation - _build_resource: deployment.environment -> deployment.environment.name (the engines' _get_environment helpers already read the corrected key) - _snapshot_emitter.initialize() called eagerly at __init__ to avoid RuntimeError("cannot schedule new futures after interpreter shutdown") when BatchLogRecordProcessor's daemon worker tries to spawn from a sys.monitoring callback thread 5. SnapshotOtlpEmitter.initialize(): public wrapper around _ensure_initialized so the manager can eagerly initialize on a normal user thread. Tests: - 10 new function-level tests in test_bytecode_injection_engine.py (basic return, throwable capture, generator/coroutine/async-gen refusals, disable-restores-original, kwonly+varargs, recursion, throwable-inside- user-try, decorator unwrap) - test_sys_monitoring_engine.py: assert all 4 callbacks register (LINE + PY_START + PY_RETURN + PY_UNWIND) and PY_UNWIND set_events - test_instrumentation_manager.py: deployment.environment.name key Validation: - Python 3.9: 526 passed, 35 skipped - Python 3.10: 526 passed, 35 skipped - Python 3.11: 516 passed, 45 skipped (10 function-level deferred) - Python 3.12: 520 passed, 41 skipped (uses SysMonitoringEngine) - black, isort, flake8: clean across 47 files - Regression harness vs origin/main: 0 unexpected deltas (12 scenarios) Things deliberately NOT carried over from the archive: - _bytecode_templates.py (430 LOC) — inlined into BIE - _snapshot_factory.py — Snapshot built inline like SME does - _function_entries dict — duplicates _injection_states - _tls = threading.local() — dead state in archive - super().__init__() — ABC has no __init__ - _LocalSlots / _InjectionInstrs dataclasses — ceremony - disable_function_entry fallback scan-by-code — would match closure siblings - start_ns == 0 skip sentinel — collides with serializer-error fallback - 12 other regressions documented in the deep-analysis workflow output
Claude PR Review
|
…into initialize The original split (public initialize() wrapping private _ensure_initialized()) was redundant — both did the exact same thing and once initialize() was a public API, _ensure_initialized stopped being internal-only. One method now serves both eager startup-time init and lazy on-first-emit init via the existing double-checked locking.
Ports the conftest fix from PR aws-observability#770. The schema base URL now points at a versioned raw.githubusercontent.com path instead of opentelemetry.io/docs/specs/semconv, making schema-validation tests deterministic when the docs site drifts.
|
Claude finished @liustve's task in 18m 3s —— View job Claude PR ReviewReviewed the engine rebuild end-to-end against Critical1.
2. SME The BIE engine sidesteps this by stashing
A SME-side recursion test would have caught this — please add one alongside the fix. Medium3. 4. Global PY_UNWIND fires on every Python exception in the process. Minor / nits
Tests
Things that look right
|
…ice-walk Adds 3.11 BytecodeInjectionEngine support so PROBE / function-level BREAKPOINT no longer falls through to the setattr wrapper on 3.11 (restoring throwable capture parity with 3.9/3.10/3.12+). Why this needs a different path than 3.9/3.10: - 3.11 removed SETUP_FINALLY (the opcode the existing path uses) - 3.11 introduced PEP 657 exception_table — protected regions live out-of-band as metadata, not in the instruction stream - bytecode==0.18.1 forbids nested TryBegin pseudo-instructions Splice-walk pattern (no nesting): - Single shared exception handler label at end of function - Outer protected region opens after RESUME 0 / closure preamble - On every user TryBegin: emit TryEnd before it (closing outer) - On matching user TryEnd: open fresh outer TryBegin after it - Result: N adjacent (never nested) outer regions, all targeting the same handler — bytecode==0.18.1 accepts this and writes the correct co_exceptiontable Critical 3.11 details: - Prologue must come AFTER RESUME 0 / COPY_FREE_VARS / MAKE_CELL so closures and free-vars are armed when the entry hook fires - LOAD_GLOBAL takes a tuple oparg `(push_null: bool, name: str)` - PRECALL/CALL must agree on argcount - Handler is entered with push_lasti=True → stack is [..., lasti, exc] - co_exceptiontable handlers do NOT auto-set sys.exc_info(); we use PUSH_EXC_INFO at handler entry to make sys.exc_info() see the live exception so _build_throwable can populate the snapshot - RERAISE 0 (not 1) at the foot — we discard lasti via PUSH_EXC_INFO's side effect, then reraise the top-of-stack exception New methods (alongside existing 3.9/3.10 helpers): - _build_function_prologue_311 - _build_function_entry_call_311 - _build_function_exit_call_311 - _build_function_unwind_foot_311 - _create_code_with_function_wrap_311 (the splice-walk driver) Test gate widened: TestBytecodeInjectionEngineFunctionLevel now runs on 3.9-3.11 (all 10 tests pass), where it previously skipped 3.11. Cross-version validation: 3.9: 526 passed 3.10: 526 passed 3.11: 526 passed (was 516 — +10 function-level tests no longer skipped) 3.12: 520 passed - black / isort / flake8: clean - Regression harness vs origin/main on 3.11: PASS, 0 unexpected deltas
Claude PR Review
|
…e leaks, recursion, perf) Four targeted fixes from the PR review: CRITICAL: BIE disable_function_level_instrumentation no-op after undecorate redirect. Enable redirects through @functools.wraps to instrument the inner function and keys state by id(real_func), but the manager passes the wrapper back on disable → id(wrapper) != id(real_func) → state lookup misses, inner __code__ stays patched until interpreter exit. Disable now scans by (file_path, qualified_name, __globals__) when the direct id lookup misses, finding the same record the enable path created. Regression test added: test_function_level_disable_after_ decorated_enable_restores_inner. CRITICAL: SME _call_start_ns clobbered on recursion. Replaced Dict[(code_id, thread_id), int] with threading.local() LIFO stack of (code_id, start_ns) tuples. PY_START pushes; PY_RETURN/PY_UNWIND walks back to find the matching code_id and pops. Recursive calls on the same thread now preserve outer-frame stamps, so every nested call reports a correct duration. No cross-thread sweep needed on disable: per-thread state dies with the thread or gets popped on the next event. Two regression tests added. MEDIUM: BIE _handle_function_event linear scan for instrumentation_type. Replaced an O(N) iteration over _injection_states.values() with a parallel Dict[(function_key, 0), str] keyed lookup, mirroring SME's existing _instrumentation_types dict. Populated on enable, cleared on disable/disable_breakpoints_for_function/cleanup. MEDIUM: SME PY_UNWIND check ordering. PY_UNWIND fires globally on every Python exception in the process. Reordered the handler to do the cheap _function_keys membership check FIRST, then pop the start_ns stack only for our codes. Exception-heavy apps with no instrumented functions now pay one dict miss per unwind frame instead of one dict miss + a stack walk. Cross-version validation: 3.9: 527 passed, 37 skipped (was 526) 3.10: 527 passed, 37 skipped (was 526) 3.11: 527 passed, 37 skipped (was 526) 3.12: 522 passed, 42 skipped (was 520) — +2 SME recursion tests black/isort/flake8: clean Regression harness vs origin/main: PASS, 0 unexpected deltas
The bytecode-injection engine declines to rewrite generators, coroutines, async generators, and iterable coroutines because in-place rewriting would corrupt .send()/.throw()/await semantics. But that's not the end of the story — the manager observes the engine's False return and routes through FunctionWrapper.instrument_function, which has a working coroutine path via _create_async_wrapper. So async functions ARE instrumented end-to-end; just not by this engine. The previous naming and comments suggested async was unsupported full-stop. Renames and clarifications: - _REFUSAL_MASK -> _DECLINE_BYTECODE_REWRITE_MASK - _is_unsupported_code -> _should_decline_bytecode_rewrite - Test names updated: test_function_level_refuses_* -> test_engine_declines_*_so_manager_can_fall_back_to_wrapper - Log message updated to clarify the manager will route through wrapper New test: TestApplyFunction.test_engine_decline_falls_back_to_setattr_wrapper. Locks down the wrapper-fallback half of the contract that the engine tests only verify on the engine half (asserting the engine returns False). Confirms that when enable_function_level_instrumentation returns False, _apply_function calls FunctionWrapper.instrument_function so the async-wrapper path can wire up async PROBEs. Cross-version validation: 3.9: 528 passed (was 527) 3.10: 528 passed (was 527) 3.11: 528 passed (was 527) 3.12: 523 passed (was 522) black/isort/flake8: clean
Claude PR Review
|
The setattr-wrapper path needed _patch_flask_view_functions because
replacing module.foo via setattr does NOT update Flask's app.view_functions
dict — Flask continues calling the old reference, bypassing instrumentation.
The bytecode-injection / sys.monitoring engines mutate the function's
__code__ in place. Any reference Flask is holding (or any other framework)
sees the rewritten code on the next call. No view_functions patch needed.
Removed from _function_wrapper.py (~139 LOC):
- _patch_framework_references (dispatcher, only had one call)
- _patch_flask_view_functions (Flask app scan + identity/name match)
- _patch_single_flask_app (per-app patcher)
- Call sites in _replace_function_in_module and restore_function
test_function_wrapper_flask.py rewritten in place — same 7 test names,
same intent ("Flask view_functions stays consistent under DI"), just
exercising the engine path instead of the deleted patch helpers. Each
test now:
1. Sets up a Flask app with a route handler
2. Arms the handler via engine.enable_function_level_instrumentation
3. Invokes through Flask's view_functions dict
4. Asserts the snapshot fired (proving __code__ mutation flowed through
Flask's stored reference without any explicit patching)
test_function_wrapper_branches.py: dropped TestPatchFlaskViewFunctionsErrors
class (66 LOC) — it tested error branches in the deleted Flask methods.
Cross-version validation:
3.9: 529 passed
3.10: 529 passed
3.11: 529 passed
3.12: 524 passed
black/isort/flake8: clean
….environment too Replace hardcoded "service.name" / "deployment.environment" / "deployment.environment.name" strings with their OTel constants: - ResourceAttributes.SERVICE_NAME - ResourceAttributes.DEPLOYMENT_ENVIRONMENT - DEPLOYMENT_ENVIRONMENT_NAME (from semconv._incubating, the only path the stable name is exposed at today) _build_resource now writes BOTH deployment.environment and deployment.environment.name when an environment is set, so consumers reading either key see the value. Engine _get_environment helpers also read both (prefer .name, fall back to bare). All inline OTel imports hoisted to module top: - instrumentation_manager.py: Resource, ResourceAttributes, DEPLOYMENT_ENVIRONMENT_NAME - _function_wrapper.py: otel_trace, ResourceAttributes, DEPLOYMENT_ENVIRONMENT_NAME - _sys_monitoring_engine.py: TracerProvider, ResourceAttributes, DEPLOYMENT_ENVIRONMENT_NAME - _bytecode_injection_engine.py: same as SME - _debugger_client.py: + DEPLOYMENT_ENVIRONMENT_NAME Tests: - test_returns_resource_when_attrs_present: now asserts both keys - test_resource_import_failure_returns_none: dropped (premise was an inline import that no longer exists; replaced by Resource.create failure test) Cross-version validation: 3.9: 528 passed 3.10: 528 passed 3.11: 528 passed 3.12: 523 passed black/isort/flake8: clean
…ffix The 3.9/3.10 builders previously had no version suffix while their 3.11 counterparts did, making the version split harder to scan. Add _39_310 suffixes to the four 3.9/3.10 builders so each family is explicit, and group the family pairs together (prologue 39_310 + 311, then entry_call, then exit_call, then unwind_call) instead of listing all 39_310 first followed by all 311. Also rename _build_function_unwind_foot_311 -> _build_function_unwind_call_311 for symmetry with the rest.
- pylint E1101 on inspect.CO_GENERATOR / CO_COROUTINE / CO_ASYNC_GENERATOR
/ CO_ITERABLE_COROUTINE: pylint's static analyzer doesn't see these CPython
runtime constants. Suppress with a per-block no-member disable.
- pylint W0404 / W0621 redefined-outer-name on `from bytecode import Label`:
Label is already imported at module top inside the IS_BYTECODE_INSTALLED
guard. Removed the redundant inline import in
_create_code_with_function_wrap.
- pylint W0404 / W0621 on inline `from amazon...debugger._function_wrapper
import get_snapshot_emitter`: removed two stale inline re-imports in BIE
/ SME emit-snapshot paths; the engines use the top-level import.
- Engine unit tests retargeted from
`amazon...debugger._function_wrapper.get_snapshot_emitter` to
`amazon...debugger.instrumentation_engine._{bytecode_injection,sys_monitoring}_engine.get_snapshot_emitter`
so mock.patch hits the consumer module's name binding now that the import
is hoisted. (15 patch-site renames across the two engine test files.)
- codespell 'asend' false positive: added to ignore-words-list in
.codespellrc. asend is a real Python AsyncGenerator method; codespell
thinks it's a typo for "ascend".
Cross-version validation:
3.9: 528 passed
3.10: 528 passed
3.11: 528 passed
3.12: 523 passed
black / isort / flake8 / pylint (10.00/10) / codespell: clean
CI surfaced more pylint refactor warnings (C0302 too-many-lines, R0912/14/15 too-many-branches/locals/statements, C0103 invalid-name on short-loop-counter `i`, R1710 inconsistent-return-statements). Address each minimally: - BIE / manager: file-level `# pylint: disable=too-many-lines` (both files exceed the 1000-line default and the engine is one cohesive unit). - BIE.enable_function_level_instrumentation, BIE.disable_function_level_instrumentation: per-method too-many-branches. - BIE._unique_local_name: rename loop counter `i` -> `suffix`. - SME._pop_matching_start_ns: rename loop counter `i` -> `idx`. - SME._function_start_event_handler / _function_return_event_handler: add explicit `return None` on guard paths so all returns are consistent (R1710). - SME._handle_function_level_instrumentation: per-method too-many-locals,too-many-statements (snapshot construction is unavoidably wide). - _undecorate.undecorated: per-function too-many-locals,too-many-branches (BFS through wrappings is one cohesive routine). - manager._apply_function: too-many-locals (added). Pylint now 10.00/10. All other linters and 528/528/528/523 tests unchanged.
|
Claude finished @liustve's task in 1m 7s —— View job Claude PR ReviewRe-reviewed the single merge commit since the prior review ( The merge is mechanical and brings in only one upstream change:
New findingsNone — bias toward silence. The merge is upstream sync only with no behavioral impact on this PR's scope.
|
|
Keeping it as draft to give more time to evaluate. |
Resolve conflicts by keeping the interpreter-layer instrumentation engine (bytecode / sys.monitoring) and dropping the framework-specific patching code that landed on main after this branch forked. The engine mutates a function's __code__ in place, so any reference a framework holds (Flask view_functions, Django URLPattern.callback, FastAPI route table / Depends sub-dependencies, `from x import f` aliases) sees the rewritten code on the next call — no per-framework patcher needed. Production code removed (reverted from main): - _patch_framework_references / _patch_flask_view_functions / _patch_single_flask_app - _patch_django_url_patterns / _patch_single_resolver / _maybe_patch_pattern - _patch_fastapi_routes / _patch_single_fastapi_app (incl. Depends rebind) - _patch_module_aliases / _is_application_module / _stdlib_lib_dirs (aws-observability#776, aws-observability#780, aws-observability#785, aws-observability#786) Tests: - Behavioral contract tests (di/django_test.py, di/fastapi_test.py, di/flask_test.py) kept as-is — they validate end-to-end that everything still works through the engine with no patching. - White-box unit tests that called the deleted _patch_* methods directly (test_function_wrapper_django.py, test_function_wrapper_fastapi.py, TestModuleAliasRedirect) rewritten 1:1 to drive the engine — same scenarios, only the operative call changed (mirrors how this branch already rewrote test_function_wrapper_flask.py). - Dropped only the tests of deleted internals with no engine equivalent: the _patch_framework_references dispatcher tests, the resolver-walk error-branch tests, and TestIsApplicationModule. Also adopt main's "drop Python 3.9" change (aws-observability#762): engine and manager gate on >= 3.10; function-level builders renamed _39_310 -> _310; 3.9 references removed across src and tests. Kept main's non-patching debugger improvements: capture-limit default constants (aws-observability#771), instrumentation_name field removal (aws-observability#775), OTel 1.42.1 bump (aws-observability#762), and the capture-limits test.
…vability#787 (engine-driven) PR aws-observability#787 adds DI framework-dispatch fixes on top of main's per-framework patchers (Starlette route.app rebuild, aiohttp handler rebind). This branch removed those patchers in favor of in-place __code__ mutation, so port the NEW framework test coverage from aws-observability#787 in the same engine-driving style this branch already uses for Flask/Django/FastAPI: arm the handler via the engine, then invoke through the framework's stored reference and assert it fires. - test_function_wrapper_starlette.py: route.app closure (pure-Starlette) fires via the engine; name+module fallback; FastAPI-subclass handled by its own route table; no-Starlette-app no-op. The two ASGI-driving tests assert on the handler's own invocation counter rather than the shared snapshot list, since on the global sys.monitoring (3.12) path that list can also collect snapshots for unrelated code armed by other in-process tests. - test_function_wrapper_aiohttp.py: route.handler identity, cross-route isolation, no-app no-op. (aiohttp is not in dev-requirements, so these SKIP in CI — same as on aws-observability#787's branch.) Dropped only aws-observability#787's pure white-box tests of deleted patcher internals (_patch_starlette_routes / _patch_aiohttp_routes / _starlette_app_wraps_endpoint, middleware-skip), consistent with how this branch handled the Flask/Django/ FastAPI patcher tests. aws-observability#787's production-only fixes (functools.partial keying, LINE_NOT_EXECUTABLE status) are not ported — those are code changes, not tests. Verified under the exact CI env (tox 4.30.3, 3.12-test-aws-opentelemetry-distro, distro from built wheel): 2276 passed, 38 skipped, 0 failed.
…apture_arguments
Root cause of the DI contract-test failures on Python 3.10/3.11 (the
bytecode-injection engine path; 3.12 sys.monitoring was unaffected).
BytecodeInjectionEngine._build_entry_context read capture_config.capture_locals
to decide what to capture at function entry, but a function-level PROBE/
BREAKPOINT carries its argument list in capture_arguments (capture_locals stays
None — that field is for line-level breakpoints). So for any PROBE configured
with capture_arguments=["x"], the entry handler short-circuited on
`capture_locals is None` and returned None, dropping ALL argument capture. The
snapshot still fired with a return value, but captures.entry was absent — which
is exactly what the contract tests caught:
- 'items' not found in {} (test_probe_captures_arguments)
- Captured string value should not be None / 0 != 20 (capture-limits tests)
- 'price_cents' not found in {} (from-import alias test)
The 3.12 sys.monitoring engine reads capture_arguments correctly
(_sys_monitoring_engine.py), as does the legacy wrapper path
(_function_wrapper._capture_entry_context via inspect.signature().bind) — only
the bytecode entry path conflated the two fields. Verified locally on real
CPython 3.10.15 and 3.11.10: the injected locals() dict already contains the
arguments at entry (items=[1,2,3]); the defect was purely field selection, not
bytecode/locals timing.
Fix: _build_entry_context reads capture_config.capture_arguments (None = none,
[] = all, [names] = named), mirroring the sys.monitoring engine; also use the
DEFAULT_MAX_* constants for clamp defaults (were stray literals 3/10).
Tests:
- New regression tests test_function_level_captures_arguments and
test_function_level_captures_all_arguments_when_empty_list assert
captures.entry.arguments is populated on the bytecode path. They fail without
this fix and pass with it. (Each target is defined in its own module
namespace so the injected handler global doesn't alias across tests.)
- test_function_wrapper_starlette.py: the two ASGI-driving tests now use SYNC
handlers — the bytecode engine on 3.10/3.11 declines coroutine bodies by
design (async routes go through the FunctionWrapper fallback), so an async
handler armed directly via the engine returns False there. Starlette wraps a
sync endpoint via functools.partial(run_in_threadpool, endpoint), still
closing over the same function object, so the test premise holds uniformly
across 3.10–3.12.
Verified: full debugger suite green on 3.10 (536), 3.11 (563), 3.12 (558);
broader distro suite 2166 passed on 3.11. black/isort/flake8 clean.
… in the same module
Root cause of the DI contract-test "Found 0 snapshots" failures on Python
3.10/3.11 (the bytecode-injection engine; 3.12 sys.monitoring unaffected).
disable_function_level_instrumentation has a fallback used when id(func) isn't a
direct key (the decorator-wrapper case). The old fallback scanned ALL injection
states and matched on:
md.file_path == code.co_filename
AND candidate.original_code.co_name == candidate.qualified_name.rsplit(".")[-1]
The second clause compares the candidate against ITSELF (always true for a
module-level function), so the scan degenerated to "the first function-level
state whose function shares this module's globals/file." Every function in a
module shares co_filename and __globals__, so disabling one function (e.g. a
line-only breakpoint on `shared_function` during a config update) mis-resolved
to and DISARMED an unrelated function-level PROBE (`compute_total`) — which was
then never re-armed (it reported "unchanged, skipping" on later polls). The
PROBE silently stopped firing => contract tests timed out waiting for snapshots.
This reproduced end-to-end locally (di-flask contract test on 3.11:
DIFlaskProbeTest 6 failed / 1 passed, matching CI) and as a focused unit repro.
Fix: anchor the fallback to the TARGET's identity, mirroring how enable keys the
state. Resolve the real inner function via undecorated(func, name=func.__name__
or code.co_name, path=code.co_filename) and look it up by id(); only if that
misses, fall back to an EXACT original-code identity match. No module-wide
co_name/globals scan, so a sibling can never be mis-disarmed. Preserves the
decorator-unwrap disable contract (test_function_level_disable_after_decorated_
enable_restores_inner still passes).
Tests: new regression test_disable_one_function_does_not_disarm_sibling_in_same_
module — arms one function-level PROBE + one line BP in the same module, disables
the line-BP function, asserts the PROBE stays armed and still captures args. Fails
without this fix, passes with it.
Verified:
- di-flask DIFlaskProbeTest contract class: 7 passed (was 6 failed/1 passed),
run locally on the rebuilt 3.11 image via Docker.
- Full debugger unit suite: 3.10 (537), 3.11 (564), 3.12 (558), 0 failures.
- black/isort/flake8 clean.
…y; disable async route-handler contract test Two changes: 1) disable_function_level_instrumentation now resolves the keyed state by the TARGET function's identity instead of a co_name/globals heuristic. enable stamps the outer symbol the manager holds with a direct reference to the resolved (undecorated) function it keyed state under; disable reads that reference back and looks the state up by id(). This replaces the refactor in the previous commit with a pure identity lookup — no scan over unrelated functions that merely share a module (co_filename/__globals__), which was mis-disarming a sibling PROBE when another function in the same module was updated. The decorator-unwrap disable path is preserved. 2) Temporarily disable DIFastAPIRouteHandlerTest (__test__ = False). Its target route_handler_target is an `async def` handler: the bytecode engine declines coroutine bodies, so instrumentation falls back to the setattr wrapper, which replaces only the module-level name and does not reach the reference FastAPI captured in its route table at decoration time — so the async route handler produces no snapshot. The sync variant (DIFastAPISyncRouteHandlerTest) is instrumented in place by the engine and still passes. Tracked as a known gap; re-enable when async route-handler instrumentation reaches framework-held refs. Verified locally on the rebuilt 3.11 image: di-flask DIFlaskProbeTest 7 passed; full debugger unit suite green on 3.10/3.11/3.12 (537/564/558). black/flake8 clean.
…ability#787; wire into CI Adds the pure-Starlette DI contract app (di_starlette_server.py, Dockerfile, mock_di_api.py, requirements.txt) and starlette_test.py, and adds "starlette" to the di-contract-tests CI matrix (the setup script auto-discovers the di-starlette image dir, so no script change is needed). The app has two DI targets, mirroring aws-observability#787: - process_data (sync control) — instrumented in place by the engine; proves DI is active in a pure-Starlette process. - starlette_handler (async route handler) — held in Starlette's request_response closure on route.app. DIStarletteFunctionLevelTest (process_data) is active and passes. The DIStarletteRouteHandlerTest class is disabled (__test__ = False) for the same reason as the FastAPI async route handler: the bytecode engine declines coroutine bodies, so an async handler falls back to the setattr wrapper which cannot reach Starlette's route.app closure -> no snapshot. Re-enable once async route-handler instrumentation reaches framework-held references. Verified locally on the rebuilt 3.11 image (Docker): starlette_test.py -> 1 passed (the sync control; async class deselected). black/isort/flake8 clean.
123da06 to
2941c31
Compare
…wrap The bytecode engine declined generators, coroutines, async generators and iterable coroutines, returning False so the manager fell back to the setattr FunctionWrapper. For a framework-held async handler (a Starlette route in a request_response closure, or a FastAPI APIRoute.endpoint reference) setattr only rebinds the module name and never reaches the stored reference, so async route handlers produced no snapshot. Mirror dd-trace-py's debugger: wrap the function body (after GEN_START / RESUME) rather than declining. Remove the decline mask and route every function kind through the entry/return/unwind body-wrap. _bytecode_injection_engine.py: - Remove _DECLINE_BYTECODE_REWRITE_MASK and _should_decline_bytecode_rewrite. - 3.10: keep GEN_START first, splice the prologue after it. - 3.11: detect the preamble as a contiguous run from the top (was scanning to the last RESUME, which in a generator/coroutine is the post-yield RESUME); include RETURN_GENERATOR / POP_TOP. - Suppress GeneratorExit / asyncio.CancelledError on the unwind path so closing a suspended generator or cancelling a task is not reported as an exception. - cleanup() also pops the injected _function_event_handler global. _sys_monitoring_engine.py: same GeneratorExit / CancelledError suppression on the PY_UNWIND path. Replace the engine decline unit tests with acceptance/correctness coverage for generators, coroutines, async generators, send/throw fidelity, exception capture, and GeneratorExit suppression. Unit tests: 569 pass on 3.10/3.11, 557 on 3.12. black/isort/flake8/codespell clean.
…PI + Starlette) Extend the di-fastapi and di-starlette contract apps with instrumented targets modeled on canonical FastAPI/Starlette async constructs, and add test classes asserting DI captures each end-to-end: - number_stream: a StreamingResponse async-generator body. - sse_event_stream: a Server-Sent Events async-generator body that breaks on client disconnect (await request.is_disconnected()). - db_session_dependency (FastAPI only): a yield-based async dependency (DB-session setup/teardown); Starlette has no Depends. - fetch_remote_value: an async function awaiting real I/O before returning. - lookup_or_404: an async function raising HTTPException on the not-found path; the test asserts the snapshot body carries the captured throwable and the request still returns 404. Each target is wired to a route (/stream, /sse, /with-db, /await-io, /lookup) and configured as a DI BREAKPOINT; the test classes hit the route and assert a method-level snapshot for that target. Re-enable the previously-disabled async route-handler tests (DIStarletteRouteHandlerTest, DIFastAPIRouteHandlerTest). The di-contract-tests CI matrix already runs fastapi + starlette on 3.10/3.11/3.12. Verified locally: both app servers import and expose the new routes/configs, and the real bytecode (3.10/3.11) and sys.monitoring (3.12) engines each emit one snapshot per target (throwable captured on the raise path). black/isort/flake8/codespell clean.
add5bf9 to
a19002a
Compare

Problem
The current Dynamic Instrumentation path replaces a target function with a Python-level wrapper installed via
setattr(module, name, wrapper). This has two issues, one architectural and one downstream of it.1. setattr only updates the module's name binding. Frameworks register handler functions at import time and store direct references in their own internal data structures (route registries, dispatcher tables, ORM hooks, etc.). When a request comes in, the framework calls those stored references, not
module.foo. So setattr-replacingmodule.fooleaves DI silently bypassed for every framework-registered entry point — the wrapper never runs, no snapshot, no instrumentation. The previous workaround was a per-framework patcher that scanned for known frameworks and rewrote their internal dicts. This is fundamentally a losing approach: every new framework needs its own patcher, the matching heuristics break under decorator wrappers, and any framework we don't know about is invisible.2. The wrapper's exception path drops the throwable. When a wrapped function raises, the resulting snapshot's
Captures.return_context.throwablefield is empty — no captured exception type, message, or stacktrace. The user gets a snapshot that says the function failed but provides none of the data they need to diagnose it. The cause is that the wrapper'stry/exceptsits on top of the user's frame, and exception-state lifecycle differences between Python versions (especially 3.11+ with PEP 657) made the capture logic unreliable.Both problems share a root cause: instrumenting at the Python layer instead of at the interpreter layer.
Fix
Instrument at the interpreter layer. The function's
__code__is mutated in place — any reference any framework or library is holding sees the rewritten code on the next call, with no per-framework patching. The interpreter's own exception flow runs the unwind hook, sosys.exc_info()is reliably the live exception and the throwable is always captured.Each Python version uses the engine that fits it best, and all three engines share the same handler contract: entry, return, unwind.
sys.monitoring): PY_START / PY_RETURN / PY_UNWINDSETUP_FINALLY-based body wrap, RERAISE in handlerco_exceptiontablewrap via splice walk (see below)The unwind handler reads
sys.exc_info()directly, builds aCapturedThrowablewith type/message/stacktrace (in-function frames merged with caller frames), and emits the snapshot.Architectural change
3.12+: sys.monitoring
Native PEP 669 events. Per-thread LIFO stack of
(code_id, start_ns)tuples preserves correct durations across recursion. PY_UNWIND is set globally (it isn't local-event-capable on 3.12-3.14); a cheap_function_keysmembership check filters to instrumented codes before any allocation. Reentrancy guard viathreading.localso a snapshot built from the serializer cannot itself trigger another snapshot on the same thread.3.9 / 3.10: bytecode rewrite with SETUP_FINALLY
Standard
SETUP_FINALLYbody wrap: prologue calls the entry hook, everyRETURN_VALUEis replaced with a sequence that calls the exit hook then returns, and the SETUP_FINALLY block calls the unwind hook thenRERAISEs. Local slots forstart_ns/entry_ctx/retvalare pre-initialized toNoneso the unwind hook'sLOAD_FASTis safe even if the entry hook crashed before storing them.POP_BLOCKis emitted before eachRETURN_VALUEso the block stack is balanced on the normal path.3.11: bytecode rewrite with PEP 657 exception_table — splice walk
3.11 removed
SETUP_FINALLY. Protected regions live in out-of-band metadata (co_exceptiontable). The bytecode library exposes this viaTryBegin/TryEndpseudo-instructions, but it forbids nestedTryBegin— and any function that contains usertry/exceptalready has one. Wrapping the whole body with anotherTryBeginraisesRuntimeError: TryBegin pseudo instructions cannot be nested.The fix is a splice walk that produces N adjacent (never nested) outer regions, all targeting one shared exception handler. Walking the user's instructions:
The outer protected regions are flat (one closes before each user try, the next opens after) so the library accepts them, and they all funnel uncaught exceptions to the same handler. Other 3.11-specific details:
RESUME 0/COPY_FREE_VARS/MAKE_CELLso closures and free-vars are armed when the entry hook fires (otherwiseframe.f_localsis incomplete).PUSH_NULL+LOAD_GLOBAL((push_null: bool, name: str))+PRECALL n+CALL n(the tuple-formLOAD_GLOBALarg is mandatory; passing a bare string silently miswritesnamei).push_lasti=Trueis[..., lasti, exc]. 3.11 does NOT auto-setsys.exc_info()forco_exceptiontablehandlers, so the foot usesPUSH_EXC_INFOto expose the live exception to the unwind hook, thenRERAISE 0to propagate.Decorator unwrapping
@functools.wraps-style decorators rebindmodule.nameto a wrapper whose__code__is the decorator's check, not the user's body. Without unwrapping, the engine would rewrite the decorator's code and snapshots would report the wrapper's location and fire on every decorator invocation instead of the actual handler. A new_undecorate.undecorated()BFSes through__wrapped__/functools.partial/ closure cells, withco_filenamedisambiguation, to find the real user function. The engine arms that one and keys state by its id.Async / generator fallback
Engines decline to rewrite generators, coroutines, async generators, and iterable coroutines (rewriting their bodies would corrupt
.send()/.throw()/awaitsemantics). On decline, the manager routes throughFunctionWrapper.instrument_function, which has dedicated coroutine support via_create_async_wrapper. Soasync deffunctions remain instrumented end-to-end — just not by the bytecode engine.Framework-specific patcher removed
The
__code__mutation is in-place: every reference any framework holds sees the rewritten code on the next call. No framework-specific scanning, no per-framework patcher, no fragile name+module identity matching. The previously-needed patcher was deleted and its tests rewritten in place to exercise the engine path and prove the framework's stored reference flows through without explicit patching.State leak fixes
_call_start_nsis popped unconditionally in PY_RETURN/PY_UNWIND before the reentrancy/membership guards (otherwise a disable-mid-flight or serializer-triggered short-circuit orphans the stamp).disable_function_level_instrumentationclears all per-code state including_instrumentation_typesand the per-thread stack.disable_breakpoints_for_function(the existing line-BP path) also clears function-level state on the same code, so coexisting instrumentation tears down cleanly.Other changes in this PR
SnapshotOtlpEmitter.initialize()is called eagerly during manager init. Lazy init from asys.monitoringcallback at shutdown could hitRuntimeError(\"cannot schedule new futures after interpreter shutdown\")becauseBatchLogRecordProcessorspawns a daemon worker on construction._build_resourcewrites bothdeployment.environment(older convention) anddeployment.environment.name(OTel-stable) so consumers reading either key see the value. Engines'_get_environmenthelpers prefer the.namevariant and fall back to the bare key.\"service.name\"/\"deployment.environment(.name)\"strings replaced with constants fromopentelemetry.semconv.resource.ResourceAttributesandopentelemetry.semconv._incubating.attributes.deployment_attributes.DEPLOYMENT_ENVIRONMENT_NAME. All inline OTel imports hoisted to module top.v1.41.0raw GitHub path so schema-validation tests are stable when the docs site drifts.Validation
Test plan