Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,8 +128,10 @@ The Lab decision surface now also exposes `policy_version`, `triggered_rules`, a
`agent-runtime-report` is an additive reliable edge agent runtime report path.
It bundles Orchestrator scheduling evidence and AIGuard runtime reliability `guard_analysis` into a Lab-owned agent deployment decision context without changing existing Runtime result or compare contracts.
The current bundled evidence is a synthetic/dummy sustained high-load 3-agent scenario.
The report preserves sustained queue-depth, worker health, Runtime result health/error/event evidence, runtime event summary/timeline, policy decision reason, and `sustained_overload_risk` evidence as local-first deployment review context.
The report preserves sustained queue-depth, worker health, Runtime result health/error/event evidence, optional remote dispatch worker-selection context, runtime event summary/timeline, policy decision reason, and `sustained_overload_risk` evidence as local-first deployment review context.
When a Runtime result JSON with `runtime_health_snapshot` / `runtime_events` is available, add `--runtime-result <path>` to include Runtime-side operation context in the same Lab report.
When an InferEdgeOrchestrator `inferedge-remote-dispatch-result-v1` JSON is available, add `--remote-dispatch <path>` to include file-based worker selection, retry/fallback plan, and plan-only remote execution context.
This is remote dispatch evidence for local-first review; it does not claim production remote execution.

![InferEdge Local Studio demo evidence](assets/images/local-studio-demo-evidence.png)

Expand Down
11 changes: 11 additions & 0 deletions docs/portfolio/agent_runtime_reliability_report.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ Generate a Markdown report:
poetry run inferedgelab agent-runtime-report \
--orchestration-summary examples/agent_runtime/agent_3_orchestration_summary.json \
--guard-analysis examples/agent_runtime/aiguard_runtime_guard_analysis.json \
--remote-dispatch /tmp/inferedge_agent_runtime_e2e/06_remote_dispatch_result.json \
--format markdown \
--output reports/agent_runtime_reliability_report.md
```
Expand Down Expand Up @@ -70,6 +71,9 @@ runtime operation review:
- Optional Runtime result operation evidence through `--runtime-result`,
including `runtime_health_snapshot`, `runtime_error_classification`, and
`runtime_events`.
- Optional Orchestrator remote dispatch evidence through `--remote-dispatch`,
including file-based worker selection, selected worker id, plan-only remote
execution context, and retry/fallback plan fields.
- Runtime timeout observation context, including `timeout_policy`,
`timeout_budget_ms`, and `runtime_timeout_observed`. A timeout observation is
treated as Lab `review_required` evidence because it means the configured
Expand All @@ -80,10 +84,15 @@ These fields make the report path explicit:

```text
Runtime result operation evidence + Orchestrator operation evidence
-> optional remote worker-selection context
-> AIGuard reliability explanation
-> Lab-owned deployment risk context
```

Remote dispatch remains a starter contract. It records worker-selection and
fallback-plan evidence for review, but it does not claim production SSH/HTTP
execution, secure tunnel operation, or long-lived remote worker readiness.

## Lab Decision Context

Expected decision:
Expand Down Expand Up @@ -113,6 +122,8 @@ Triggered rules:

- Orchestrator records scheduling and policy evidence.
- Orchestrator operation-health fields are displayed as local runtime evidence.
- Orchestrator remote dispatch result fields are displayed as plan-only worker
selection evidence when provided.
- AIGuard explains runtime reliability risk.
- Lab remains the final deployment decision owner.
- This report is an additive agent-runtime path and does not change existing
Expand Down
9 changes: 9 additions & 0 deletions inferedgelab/commands/agent_runtime_report.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,13 +28,19 @@ def agent_runtime_report_cmd(
"--runtime-result",
help="Optional InferEdge-Runtime result JSON with runtime_health_snapshot/runtime_events",
),
remote_dispatch: str = typer.Option(
"",
"--remote-dispatch",
help="Optional InferEdgeOrchestrator remote dispatch result JSON",
),
format: str = typer.Option("text", "--format", "-f", help="text/json/markdown"),
output: str = typer.Option("", "--output", "-o", help="Optional output path"),
) -> None:
report = load_agent_runtime_reliability_bundle(
orchestration_summary_path=orchestration_summary,
guard_analysis_path=guard_analysis or None,
runtime_result_path=runtime_result or None,
remote_dispatch_path=remote_dispatch or None,
)
normalized_format = format.strip().lower()
if normalized_format == "json":
Expand All @@ -60,6 +66,7 @@ def _text_summary(report: dict) -> str:
decision = report["agent_deployment_decision"]
guard = report["guard_summary"]
runtime_context = report["agent_runtime_summary"].get("runtime_result_context") or {}
remote_context = report["agent_runtime_summary"].get("remote_dispatch_context") or {}
health = runtime_context.get("runtime_health_snapshot") or {}
error = runtime_context.get("runtime_error_classification") or {}
lines = [
Expand All @@ -74,6 +81,8 @@ def _text_summary(report: dict) -> str:
f"deadline_miss_rate: {metrics['deadline_miss_rate']:.6g}",
f"runtime_health_status: {health.get('status')}",
f"runtime_error_category: {error.get('category')}",
f"remote_dispatch_status: {remote_context.get('dispatch_status')}",
f"remote_selected_worker_id: {remote_context.get('selected_worker_id')}",
"triggered_rules:",
]
lines.extend(f"- {rule}" for rule in decision["triggered_rules"])
Expand Down
110 changes: 110 additions & 0 deletions inferedgelab/services/agent_runtime_report.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
AGENT_RUNTIME_POLICY_VERSION = "inferedge-lab-agent-runtime-policy-v1"
ORCHESTRATION_SCHEMA_VERSION = "inferedge-orchestration-summary-v1"
AIGUARD_DIAGNOSIS_SCHEMA_VERSION = "inferedge-aiguard-diagnosis-v1"
REMOTE_DISPATCH_SCHEMA_VERSION = "inferedge-remote-dispatch-result-v1"

DEFAULT_AGENT_RUNTIME_THRESHOLDS = {
"deadline_miss_rate_review": 0.05,
Expand Down Expand Up @@ -95,6 +96,7 @@ def build_agent_runtime_reliability_report(
orchestration_summary: dict[str, Any],
guard_analysis: dict[str, Any] | None = None,
runtime_result: dict[str, Any] | None = None,
remote_dispatch: dict[str, Any] | None = None,
source: dict[str, Any] | None = None,
thresholds: dict[str, float] | None = None,
) -> dict[str, Any]:
Expand All @@ -104,6 +106,7 @@ def build_agent_runtime_reliability_report(
metrics = compute_agent_runtime_metrics(orchestration_summary)
runtime_summary = _agent_runtime_summary(orchestration_summary)
runtime_result_context = _runtime_result_operation_context(runtime_result)
remote_dispatch_context = _remote_dispatch_context(remote_dispatch)
decision = build_agent_runtime_deployment_decision(
metrics=metrics,
guard_analysis=guard_analysis,
Expand Down Expand Up @@ -131,6 +134,11 @@ def build_agent_runtime_reliability_report(
if isinstance(runtime_result, dict)
else None
),
"remote_dispatch": (
remote_dispatch.get("schema_version")
if isinstance(remote_dispatch, dict)
else None
),
"source_contracts": runtime_summary.get("source_contracts", {}),
},
"agent_runtime_summary": {
Expand All @@ -140,6 +148,7 @@ def build_agent_runtime_reliability_report(
"timeline_summary": _timeline_summary(orchestration_summary, metrics),
"operation_context": _operation_context(orchestration_summary, metrics),
"runtime_result_context": runtime_result_context,
"remote_dispatch_context": remote_dispatch_context,
"policy_decision_reasons": metrics["policy_decision_reasons"],
"policy_decision_log_count": len(_policy_log(orchestration_summary)),
},
Expand Down Expand Up @@ -349,14 +358,19 @@ def load_agent_runtime_reliability_bundle(
orchestration_summary_path: str | Path,
guard_analysis_path: str | Path | None = None,
runtime_result_path: str | Path | None = None,
remote_dispatch_path: str | Path | None = None,
) -> dict[str, Any]:
orchestration_summary = _load_json_dict(orchestration_summary_path)
guard_analysis = _load_json_dict(guard_analysis_path) if guard_analysis_path else None
runtime_result = _load_json_dict(runtime_result_path) if runtime_result_path else None
remote_dispatch = (
_load_json_dict(remote_dispatch_path) if remote_dispatch_path else None
)
return build_agent_runtime_reliability_report(
orchestration_summary=orchestration_summary,
guard_analysis=guard_analysis,
runtime_result=runtime_result,
remote_dispatch=remote_dispatch,
source={
"orchestration_summary_path": str(orchestration_summary_path),
"guard_analysis_path": str(guard_analysis_path)
Expand All @@ -365,6 +379,9 @@ def load_agent_runtime_reliability_bundle(
"runtime_result_path": str(runtime_result_path)
if runtime_result_path
else None,
"remote_dispatch_path": str(remote_dispatch_path)
if remote_dispatch_path
else None,
},
)

Expand All @@ -375,9 +392,14 @@ def build_agent_runtime_reliability_markdown(report: dict[str, Any]) -> str:
decision = report["agent_deployment_decision"]
guard = report["guard_summary"]
runtime_result_context = runtime.get("runtime_result_context") or {}
remote_dispatch_context = runtime.get("remote_dispatch_context") or {}
runtime_health = runtime_result_context.get("runtime_health_snapshot") or {}
runtime_error = runtime_result_context.get("runtime_error_classification") or {}
runtime_event_summary = runtime_result_context.get("runtime_event_summary") or {}
remote_execution = remote_dispatch_context.get("remote_execution") or {}
remote_execution_plan = remote_dispatch_context.get("remote_execution_plan") or {}
retry_fallback_plan = remote_dispatch_context.get("retry_fallback_plan") or {}
worker_selection = remote_dispatch_context.get("worker_selection") or {}

lines = [
"# InferEdge Agent Runtime Reliability Report",
Expand Down Expand Up @@ -521,6 +543,37 @@ def build_agent_runtime_reliability_markdown(report: dict[str, Any]) -> str:
)
],
"",
"## Remote Dispatch Context",
"",
"| Field | Value |",
"|---|---|",
f"| remote_dispatch_schema | {remote_dispatch_context.get('source_schema_version') or '-'} |",
f"| dispatch_status | {remote_dispatch_context.get('dispatch_status') or '-'} |",
f"| selected_worker_id | {remote_dispatch_context.get('selected_worker_id') or '-'} |",
f"| decision_reason | {remote_dispatch_context.get('decision_reason') or '-'} |",
f"| production_remote_execution | {remote_execution.get('production_remote_execution', '-')} |",
f"| execution_plan_mode | {remote_execution_plan.get('mode') or '-'} |",
f"| network_execution_performed | {remote_execution_plan.get('network_execution_performed', '-')} |",
f"| planned_transport | {remote_execution_plan.get('transport') or '-'} |",
f"| fallback_worker_ids | {', '.join(worker_selection.get('fallback_worker_ids') or []) or '-'} |",
f"| retry_max_attempts | {_fmt_number(retry_fallback_plan.get('max_attempts'))} |",
f"| retry_execution_performed | {retry_fallback_plan.get('execution_performed', '-')} |",
"",
"Remote worker selection sample:",
"",
"| Worker | Eligible | Status | Health | Endpoint | Reason |",
"|---|---|---|---|---|---|",
*[
"| "
f"{item.get('worker_id') or '-'} | "
f"{item.get('eligible')} | "
f"{item.get('status') or '-'} | "
f"{item.get('health_state') or '-'} | "
f"{item.get('endpoint_type') or '-'} | "
f"{item.get('decision_reason') or '-'} |"
for item in remote_dispatch_context.get("worker_evaluations", [])
],
"",
"## AIGuard Runtime Reliability Evidence",
"",
f"- guard_status: `{guard.get('status')}`",
Expand Down Expand Up @@ -819,6 +872,63 @@ def _runtime_result_operation_context(
}


def _remote_dispatch_context(
remote_dispatch: dict[str, Any] | None,
) -> dict[str, Any]:
if not isinstance(remote_dispatch, dict):
return {
"source_schema_version": None,
"dispatch_status": None,
"selected_worker_id": None,
"decision_reason": None,
"remote_execution": {},
"remote_execution_plan": {},
"worker_selection": {
"schema_version": None,
"selected_worker_id": None,
"candidate_worker_ids": [],
"fallback_worker_ids": [],
"evaluations": [],
},
"retry_fallback_plan": {},
"worker_evaluations": [],
"runtime_event_sample": [],
}

worker_selection = remote_dispatch.get("worker_selection")
if not isinstance(worker_selection, dict):
worker_selection = {
"schema_version": None,
"selected_worker_id": remote_dispatch.get("selected_worker_id"),
"candidate_worker_ids": [],
"fallback_worker_ids": [],
"evaluations": [],
}
retry_fallback_plan = remote_dispatch.get("retry_fallback_plan")
remote_execution_plan = remote_dispatch.get("remote_execution_plan")
remote_execution = remote_dispatch.get("remote_execution")
runtime_events = _dict_list(remote_dispatch.get("runtime_events"))
evaluations = _dict_list(worker_selection.get("evaluations"))
return {
"source_schema_version": remote_dispatch.get("schema_version"),
"dispatch_status": remote_dispatch.get("dispatch_status"),
"selected_worker_id": remote_dispatch.get("selected_worker_id"),
"decision_reason": remote_dispatch.get("decision_reason"),
"remote_execution": dict(remote_execution)
if isinstance(remote_execution, dict)
else {},
"remote_execution_plan": dict(remote_execution_plan)
if isinstance(remote_execution_plan, dict)
else {},
"worker_selection": dict(worker_selection),
"retry_fallback_plan": dict(retry_fallback_plan)
if isinstance(retry_fallback_plan, dict)
else {},
"worker_evaluations": evaluations[:8],
"runtime_event_sample": runtime_events[:8],
}


def _queue_state_summary(orchestration_summary: dict[str, Any]) -> dict[str, Any]:
value = orchestration_summary.get("queue_state_summary")
if isinstance(value, dict):
Expand Down
Loading
Loading