You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- A `<testcase>` with a `<failure>` child is a failed test. The `message` attribute and text content describe the failure.
23
23
- A `<testcase>` without `<failure>` is a pass.
24
24
@@ -73,6 +73,6 @@ Produce a concise report for the investigated failure:
73
73
5.**RFC basis**: the protocol rule that explains the failure
74
74
6.**Recovery status**: is the network still broken or has it been fixed?
75
75
76
-
If there are remaining uninvestigated failures, re-present the list (without the one just investigated) and ask the user to pick the next one. Repeat until all failures are investigated or the user stops.
76
+
**IMPORTANT — always loop back.** After the report, if there are remaining uninvestigated failures, you MUST immediately re-present the remaining failure list and ask the user to pick the next one — do not wait for the user to ask. The user acknowledging a fix ("ok", "got it", "I'll do that") is NOT a signal to stop. Only stop looping if the user explicitly declines (e.g. "that's all", "no more", "skip the rest") or all failures have been investigated.
77
77
78
78
After investigating a failure, if its root cause likely explains other failures still on the list, say so — the user may choose to skip those.
Copy file name to clipboardExpand all lines: CLAUDE.md
+4Lines changed: 4 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -66,6 +66,10 @@ State clearly:
66
66
- What the root cause is (or what further information is needed)
67
67
- What the recommended fix is (configuration direction only — never push changes)
68
68
69
+
### Multi-failure investigations
70
+
71
+
When investigating multiple failures (e.g. via `/qa`), always loop back after each finding. Present remaining uninvestigated failures and ask the user to pick the next one. The user acknowledging a fix is not a signal to stop — only stop when the user explicitly declines or all failures are covered.
72
+
69
73
## Constraints
70
74
71
75
-**Read-only.** Never suggest commands that change device configuration. Diagnosis and direction only.
Copy file name to clipboardExpand all lines: metadata/workflow/WORKFLOW.md
+8-113Lines changed: 8 additions & 113 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -48,85 +48,13 @@ See [OPTIMIZATIONS.md](../scalability/OPTIMIZATIONS.md) for the full RAG optimiz
48
48
49
49
---
50
50
51
-
## Interactive Investigation
52
-
53
-
The user asks a question in Claude Code. The agent follows the diagnostic workflow defined in `CLAUDE.md`:
54
-
55
-
### Step 0 — Preflight
56
-
57
-
```
58
-
get_status()
59
-
```
60
-
61
-
Confirms which backends are active: inventory (device count), intent (router count), and ChromaDB availability. Displayed as a table before any investigation begins.
62
-
63
-
### Step 1 — Load the Protocol Skill
64
-
65
-
The agent reads the relevant skill file before starting. Skill files contain decision trees and query sequences — the agent follows them, it does not improvise.
66
-
67
-
| When to use | Skill file |
68
-
|-------------|-----------|
69
-
| Adjacency, neighbor state, LSDB, area type |`skills/ospf/SKILL.md`|
70
-
| Path selection, PBR, route-maps, prefix-lists, AD conflicts |`skills/routing/SKILL.md`|
71
-
| Reachability ("can't reach X from Y") | Start with `traceroute` to find the breaking hop, then load the appropriate skill |
72
-
73
-
### Step 2 — Search the Knowledge Base
74
-
75
-
```
76
-
search_knowledge_base(query="OSPF neighbor stuck in INIT", topic="rfc", protocol="ospf")
77
-
```
78
-
79
-
Returns RFC text and vendor documentation relevant to the issue. The `protocol` filter eliminates cross-protocol noise. The embedding model maps the question to nearby chunks even when the exact words differ.
80
-
81
-
### Step 3 — Query Live Devices
82
-
83
-
The agent queries the devices involved in the issue:
84
-
85
-
```
86
-
query_intent(device="D1C") # what SHOULD the network look like?
87
-
get_ospf("D1C", "neighbors") # what DOES it look like?
traceroute("E1C", "192.168.42.1") # where does the path break?
90
-
```
91
-
92
-
The skill file dictates which queries to run and in what order. For OSPF adjacency issues, the checklist is: timers → area type → network type → auth → passive → MTU → interface state. Stop at the first mismatch.
93
-
94
-
### Step 4 — Synthesize
95
-
96
-
The agent combines knowledge base context with live data. When they conflict, live data wins. The report states:
97
-
98
-
- What the data shows
99
-
- Root cause with RFC citation
100
-
- Fix direction (configuration guidance only — YANA never pushes config)
101
-
102
-
### Example
103
-
104
-
```
105
-
User: "Why can't E1C reach A2A's loopback?"
106
-
107
-
Agent:
108
-
1. get_status() → inventory, intent, ChromaDB all active
109
-
2. Reads skills/ospf/SKILL.md
110
-
3. get_routing("E1C", "ip_route") → 192.168.42.1 missing from VRF1
111
-
4. get_ospf("E1C", "database") → No Type 3 LSA for 192.168.42.1
112
-
5. query_intent() → A2A should be in Area 1 (stub), connected via D1C/D2B
113
-
6. get_ospf("D1C", "neighbors") → D1C has no adjacency with A2A
114
-
7. get_ospf("A2A", "interfaces") → A2A's Area 1 is "normal", not stub
causes Hellos to be silently discarded. Fix: add stub config to A2A.
119
-
```
120
-
121
-
---
122
-
123
51
## QA Investigation
124
52
125
53
Run your tests with any framework. When something fails, YANA investigates.
126
54
127
55
### Test Results
128
56
129
-
YANA reads JUnit XML results from `results/`. JUnit XML is the de facto standard — produced by pytest (`--junitxml`), pyATS (`--xunit`), Robot Framework (`--xunit`), Ansible (junit callback), and most other test runners.
57
+
YANA reads JUnit XML results from `results/`. JUnit XML is the de facto standard — produced by pytest (`--junitxml`), pyATS (`--xunit`), Robot Framework (`--xunit`), and most other test runners.
130
58
131
59
Place your test results in `results/` as `.xml` files. YANA doesn't care how the tests were run — it only needs the results.
132
60
@@ -142,52 +70,19 @@ When tests fail, the user runs `/qa` in Claude Code. The skill (`.claude/skills/
142
70
4. Present numbered failure list to the user
143
71
5. User picks a failure to investigate
144
72
6. Agent reads test context from <properties> (device, rfc_ref, description)
145
-
7. Agent runs the same diagnostic workflow as interactive mode:
73
+
7. Agent investigates:
74
+
- get_status() → confirm backends are active
75
+
- Load protocol skill (skills/ospf/SKILL.md or skills/routing/SKILL.md)
146
76
- query_intent() → expected state
147
-
- get_ospf/get_routing/get_interfaces → live state
148
-
- Follows skill decision trees to trace the root cause
77
+
- get_ospf/get_routing/get_interfaces/traceroute → live state
78
+
- Follow skill decision trees to trace the root cause
9. Re-presents remaining failures — user picks next, or stops
152
82
```
153
83
154
84
If multiple failures share a root cause, the agent says so after investigating the first one — the user can skip the rest.
155
85
156
-
---
157
-
158
-
## Architecture Summary
86
+
### Interactive Mode
159
87
160
-
```
161
-
┌─────────────────────────────────────────┐
162
-
│ Claude Code (UI) │
163
-
│ │
164
-
│ Interactive: User asks a question │
165
-
│ QA: User runs /qa after tests │
166
-
└──────────────┬──────────────────────────┘
167
-
│ MCP protocol
168
-
┌──────────────▼──────────────────────────┐
169
-
│ YANA MCP Server │
170
-
│ server/MCPServer.py │
171
-
│ │
172
-
│ 8 tools registered via FastMCP │
173
-
└──┬───────┬───────┬───────┬──────────────┘
174
-
│ │ │ │
175
-
┌────────▼──┐ ┌──▼────┐ ┌▼─────┐ ┌▼──────────┐
176
-
│ SSH tools │ │ RAG │ │Intent│ │ Status │
177
-
│ get_ospf │ │search │ │query │ │ get_status│
178
-
│ get_routing│ │_kb │ │_intent││ list_dev │
179
-
│ get_intf │ │ │ │ │ │ │
180
-
│ traceroute │ │ │ │ │ │ │
181
-
└─────┬──────┘ └───┬───┘ └──┬───┘ └───────────┘
182
-
│ │ │
183
-
┌──────────▼──┐ ┌─────▼───┐ ┌──▼──────────┐
184
-
│ Scrapli SSH │ │ChromaDB │ │ JSON files │
185
-
│ 6 vendors │ │ + MiniLM│ │ data/*.json │
186
-
│ env creds │ │ │ │ │
187
-
└──────────────┘ └─────────┘ └─────────────┘
188
-
189
-
Test runners (separate process, not MCP):
190
-
pytest, pyATS, Ansible, Robot Framework, etc.
191
-
→ JUnit XML results in results/
192
-
→ Consumed by /qa skill in Claude
193
-
```
88
+
YANA also handles ad-hoc questions outside the QA workflow. The user asks a question directly (e.g. "Why can't E1C reach A2A's loopback?") and the agent follows the same diagnostic process: preflight check via `get_status()`, load the relevant protocol skill, query live devices, search the knowledge base, and synthesize a report with root cause and RFC citation. The full interactive workflow is defined in `CLAUDE.md`.
<propertyname="description"value="Verify E1C has route to A2A loopback 192.168.42.1 in VRF1"/>
14
+
</properties>
15
+
<failuremessage="Verify E1C has route to A2A loopback 192.168.42.1 in VRF1">Assertion route_exists("192.168.42.1") returned False. NETCONF response contains VRF1 routing entries but 192.168.42.1/32 is not present in the RIB.</failure>
<propertyname="description"value="Verify E1C has FULL OSPF adjacency with C1J (router-id 22.22.22.11)"/>
22
+
</properties>
23
+
<failuremessage="Verify E1C has FULL OSPF adjacency with C1J (router-id 22.22.22.11)">Assertion ospf_neighbor_full("22.22.22.11") returned False. Neighbor 22.22.22.11 found but adjacency state is INIT, not FULL.</failure>
0 commit comments