This file is for Codex, Claude Code, Cursor, Aider, and other agentic
programming tools that modify this repository. It is not a tutorial for
benchmark diagnosis agents. If you want to build a troubleshooting agent that
NetOpsBench evaluates, start with
docs/content/docs/build-your-agent/custom-agents.mdx and
examples/agents/README.md.
Keep changes small, preserve benchmark semantics, verify behavior with focused tests, and avoid changing the public SDK/API surface unless the task explicitly asks for a breaking change.
netopsbench/sdk/: stable public Python SDK. User-facing sessions, runtimes, agents, MCP config, reports, scenario helpers, faults, and evaluators live here.netopsbench/platform/: internal runtime implementation. This includes topology/runtime orchestration, scenario execution, worker pools, faults, traffic, Pingmesh, toolkit, and observability internals. Changes here should include focused tests.netopsbench/platform/toolkit/: direct toolkit methods and FastMCP wrappers exposed to troubleshooting agents.netopsbench/platform/scenario/: scenario parsing, models, validation, execution, observation collection, and episode handling.netopsbench/platform/session/: session orchestration, runtime dispatch, diagnosis callbacks, scoring handoff, and reporting.examples/: public runnable examples. Treat these as part of the user experience; keep commands, docs, and imports aligned.scenarios/generated/<scale>/: generated benchmark scenarios used by examples and suite runs.observability/andscripts/runtime/: Docker, Containerlab, Telegraf, Pingmesh, and BGP runtime support.tests/: lightweight unit and contract tests. Real Containerlab tests are usually markedrealor named*_real.
The stable external boundary is netopsbench.sdk, the documented CLI commands,
the docs under docs/, and public examples under examples/. Treat
netopsbench.platform.* as internal unless a task explicitly asks to expose or
document it.
The normal path is:
- A user calls an example, CLI command, or SDK session API.
- The session layer provisions or attaches to a runtime worker pool.
- The scenario executor runs episodes and manages traffic/fault lifecycle.
- Baseline and fault-window observations are collected.
- A troubleshooting agent receives
DiagnosticContextand returnsDiagnosisResult. - The evaluator scores detection, localization, fault type, runtime, tool usage, and token usage.
- Reports and raw artifacts are written for the run.
Do not confuse coding agents editing this repository with troubleshooting
agents evaluated by the benchmark. Troubleshooting agents implement
diagnose(context) -> DiagnosisResult; coding agents should preserve that
contract while modifying the codebase.
Important files for this path:
netopsbench/platform/session/orchestrator.pynetopsbench/platform/session/dispatch.pynetopsbench/platform/scenario/executor.pynetopsbench/platform/scenario/episode_runner.pynetopsbench/platform/session/scoring.pynetopsbench/evaluator/scorer.py
Pingmesh tools must query the episode window, not an arbitrary rolling window, when scenario diagnosis is running. Preserve this precedence:
- Explicit tool args:
start_timeandend_time - The current
AgentToolkitdefault Pingmesh window NETOPSBENCH_PINGMESH_CONTEXT_FILE, a JSON file with{"start_time": "...Z", "end_time": "...Z"}NETOPSBENCH_PINGMESH_START_TIMEandNETOPSBENCH_PINGMESH_END_TIME- Rolling fallback via
time_range_minutes
Do not remove or reorder this behavior without updating tests and docs. Relevant files:
netopsbench/platform/toolkit/_core/observability/pingmesh_scope.pynetopsbench/platform/toolkit/_core/observability/pingmesh_ops.pynetopsbench/platform/toolkit/mcp/observability.pynetopsbench/platform/session/orchestrator.pytests/test_pingmesh_time_scope.py
get_pingmesh_hotspots is intentionally leaf-pair aggregated. Do not change it
to client-path granularity unless the benchmark design changes.
Healthy scenarios are represented by scenario metadata:
metadata:
negative_sample: trueFor negative samples, the scenario runner should diagnose a representative
middle fault_type: none episode instead of skipping it. That episode waits,
collects healthy observations, and calls the agent. Ordinary fault_type: none
episodes that are not selected for diagnosis may still be skipped quickly.
Scoring intent:
- Positive cases: main score is localization-oriented; fault type is a separate KPI.
- Negative cases: score is 1 only when the verdict is
network_healthy;fault_detectedandinconclusivescore 0. - Detection summaries include positive and negative cases.
- Localization and fault-type summaries aggregate only positive cases.
Relevant tests:
tests/test_negative_sample_scenarios.pytests/test_scenario_schema.pytests/test_e2e.py
The runtime currently relies on host-side helper scripts for Telegraf, Pingmesh, and BGP snapshots. Preserve these behaviors:
- BGP line protocol includes
topology_id. - Telegraf tails
/var/lib/netopsbench/bgp_neighbors.lpfrom the beginning and useswatch_method = "poll". - Worker Telegraf config and BGP line protocol files are readable by the Telegraf container.
Relevant files:
scripts/observability/start_worker_telegraf.shscripts/runtime/run_bgp_collector.pyobservability/telegraf.conf.templatetests/test_bgp_collector.pytests/test_runtime_config_consistency.py
- Prefer small, targeted changes over broad rewrites.
- Preserve public SDK imports, public example behavior, and documented CLI commands unless the task explicitly asks for a breaking change.
- Keep
examples/, docs, and tests aligned. If a command or public API changes, update the example and the relevant docs in the same change. - Do not treat generated runtime artifacts as source. Do not commit caches,
.envfiles,.pytest_cache/,__pycache__/,scenario_results/,lab-topology/,.netopsbench*/, or local virtual environments. - Do not run real Containerlab tests by default. Run them only when the user asks or when the environment is confirmed to support Docker, Containerlab, and non-interactive privileged commands.
- If you touch
netopsbench/platform/*, add or update focused tests for the behavior. If you touchnetopsbench/sdk/*, preserve the public API contract or document the migration. - Use structured parsers and existing helpers instead of ad hoc string handling when the codebase already provides them.
Lightweight tests for Pingmesh, negative samples, SDK contracts, scenarios, and examples:
python -m pytest \
tests/test_pingmesh_time_scope.py \
tests/test_negative_sample_scenarios.py \
tests/test_runtime_agent_input_contract.py \
tests/test_api_sessions.py \
tests/test_scenario_schema.py \
tests/test_e2e.py \
tests/test_scenario_commands.py \
tests/test_example_agents.pyIf you use a local virtual environment, replace python with that interpreter.
Run one public example from the repository root:
netopsbench benchmark prepare --scales xs
export OPENAI_API_KEY=...
PYTHONPATH=. python examples/01_run_scenario.py --vendor openaiRun the XS benchmark suite with a provider of your choice:
BENCH_VENDOR=openai BENCH_SCALES=xs bash scripts/run_all_benchmarks.shBefore committing or handing off changes:
git diff --check
git status --short --branch- Do not revert user changes unless explicitly requested.
- Treat untracked files as user-owned unless the task clearly asks to add them.
- Avoid merging or checking out feature branches unless the user asks for that branch.
- Keep public examples and SDK docs aligned with code changes.