nancy[ALP-487]: Log analyzer — parse, classify, compare experiment logs by srobinson · Pull Request #1 · srobinson/nancy

srobinson · 2026-02-02T11:02:06Z

Summary

New src/analyze/ Python module for experiment log analysis — parses raw NDJSON Nancy logs, classifies events (navigation/fmm/task-work/boilerplate), generates comparison tables between control and treatment runs
90 tests covering parsing, classification, and comparison logic
First experiment results: ALP-479 (with fmm) vs ALP-480 (without fmm) analyzed and committed to docs repo

Key Findings

                           Control (no fmm)    Treatment (fmm)
Reads before first edit:   25                  40
Navigation tokens:         ~30,292             ~15,884
Navigation % of total:     52%                 41%
Sidecar lookups:           n/a                 0/55
Total tokens:              ~58,310             ~39,111

Treatment was 33% more token-efficient but zero sidecar usage — the agent completely ignored fmm tools. The instruction compliance problem is confirmed.

Files

src/analyze/parse_logs.py — NDJSON parser → structured events
src/analyze/classify.py — Event classification + phase boundary detection
src/analyze/compare.py — Two-condition comparison table generator
tests/ — 90 tests (conftest fixtures + 3 test modules)

Test plan

All 90 pytest tests pass
Ran against real experiment data (ALP-479 vs ALP-480)
JSON and text output modes verified
Edge case: empty log directories
CLI entry point tests (main functions)

🤖 Generated with Claude Code

New `src/analyze/` Python module for experiment log analysis: - parse_logs.py: Parses raw NDJSON logs into structured events (tool calls, token usage, session boundaries). Handles multi-iteration logs. - classify.py: Classifies tool calls into navigation/fmm_navigation/task_work/ boilerplate categories. Identifies "first edit" boundary. Computes per-phase summaries with token counts and unique file tracking. - compare.py: Takes two experiment directories, analyzes all iterations, outputs comparison table (human-readable + JSON). Matches EXPERIMENT_HARNESS.md format. - analyze.sh: Bash wrapper for nancy integration. - __main__.py: Entry point via `python3 -m src.analyze` Token counting de-duplicates by message_id to avoid double-counting between stream_event and assistant message events. Tested against ALP-479 (fmm) and ALP-480 (no fmm) experiment logs. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

message_delta stream events carry output_tokens but have no message_id, so they bypass the dedup logic in compute_phase_summary and inflate output token counts. Since the assistant event already reports the same output_tokens with a proper message_id, message_delta is redundant. Also removed unused col_w variable in compare.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

ALP-492 — Review cleanup (4 items): - Remove unused vars total_events_before_first_edit/total_events_overall in compare.py - Remove unnecessary f-string prefix in parse_logs.py _print_summary - Quote shell args in analyze.sh for paths with spaces - Add encoding='utf-8' to open() in parse_logs.py ALP-493 — pytest test suite (90 tests): - tests/test_parse_logs.py: NDJSON parsing, file path extraction, token events, malformed JSON resilience, message_delta skip, multi-tool messages - tests/test_classify.py: All tool categories, bash pattern matching, first-edit boundary detection, excluded file patterns, phase assignment, token dedup - tests/test_compare.py: Iteration aggregation, comparison table formatting, JSON serialization, end-to-end integration test - Pre-compiled regex in classify.py for EXCLUDED_EDIT, NAV_BASH, TASK_BASH, BOILERPLATE_BASH patterns (efficiency cleanup per AC) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

When an iteration has edits but zero total tokens (e.g. all token events deduped away), first_edit_pct stays None. The comparison table would render "iter1 @ None%" instead of just "iter1". Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The test suite was converted to pytest (90 tests in tests/) but the CI workflow still expected tests/run.sh + bash unit/integration scripts. - Add Python 3.12 + pytest setup - Conditionally run bash tests only if tests/run.sh exists - Conditionally run pytest only if tests/conftest.py exists Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Copilot

Pull request overview

This PR adds a Python-based log analysis pipeline for Nancy experiment logs, including parsing NDJSON logs into structured events, classifying events into navigation/work phases, aggregating metrics across iterations, and comparing control vs treatment runs, with a pytest suite covering the core logic.

Changes:

Introduces src/analyze/parse_logs.py, classify.py, and compare.py to parse raw NDJSON logs, classify tool calls and token usage, compute per-iteration summaries, and generate comparison tables/JSON.
Adds a CLI entry point for the analyzer via python3 -m src.analyze and a Bash wrapper src/analyze/analyze.sh intended for a nancy analyze subcommand.
Adds a pytest test suite (tests/test_parse_logs.py, tests/test_classify.py, tests/test_compare.py, and shared fixtures) plus .gitignore entries for Python artifacts.

Reviewed changes

Copilot reviewed 10 out of 13 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`src/analyze/parse_logs.py`	Parses raw NDJSON logs into ordered events (init, tool calls, token usage, user messages, assistant text) and provides `parse_experiment_logs` and a small CLI; note the docstring for `parse_experiment_logs` mentions skipping “review iterations” but the implementation does not currently do this.
`src/analyze/classify.py`	Classifies tool calls into categories (navigation, fmm_navigation, task_work, boilerplate, other), finds the first real edit, and computes per-phase token/tool/file summaries, including FMM vs raw navigation counts; a comment in `compute_phase_summary` about preferring assistant token usage over stream events does not match the current “first seen wins” dedup behavior.
`src/analyze/compare.py`	Aggregates per-iteration summaries across conditions, computes navigation and sidecar metrics, formats a human-readable comparison table and JSON output, and exposes a `main()` CLI for `python3 -m src.analyze`.
`src/analyze/analyze.sh`	Bash wrapper defining `analyze::run` that calls `python3 -m src.analyze`, documenting a `nancy analyze <control_log_dir> <treatment_log_dir> [--json]` CLI, but this function is not yet wired into the main `nancy` dispatcher or any `cmd::` wrapper.
`src/analyze/__main__.py`	Package `__main__` that forwards to `compare.main`, enabling `python3 -m src.analyze <control_dir> <treatment_dir> [--json]`.
`src/analyze/__init__.py`	Marks `src.analyze` as a Python package; currently empty.
`tests/conftest.py`	Provides shared fixtures and helpers for constructing synthetic NDJSON log events (init, assistant with usage and tool blocks, user, stream events) and writing them to disk.
`tests/test_parse_logs.py`	Unit tests for `parse_log_file`, `_extract_file_path`, `_extract_command`, and `parse_experiment_logs`, covering empty/malformed inputs, assistant/tool/stream/user events, and log file filtering/sorting semantics.
`tests/test_classify.py`	Tests for tool-call classification, exclusion rules for “first edit”, bash pattern matching, event phase assignment, and phase summary computation (including FMM vs raw nav counts and token deduplication).
`tests/test_compare.py`	Tests `_aggregate_iterations`, comparison table formatting (labels, headers, token formatting), JSON output structure and sorting, and an end-to-end `analyze_condition` flow over a synthetic iter1 log.
`tests/__init__.py`	Empty module to make `tests` a package (enables relative imports where needed).
`.gitignore`	Adds ignores for Python bytecode (`__pycache__/`, `*.pyc`) alongside existing patterns.

Copilot · 2026-02-02T11:07:28Z

src/analyze/parse_logs.py

+    """Parse all iteration logs in a directory.
+
+    Returns dict of {iteration_name: [events]} sorted by iteration order.
+    Skips review iterations and non-.log files.


The docstring for parse_experiment_logs says it "Skips review iterations and non-.log files", but the implementation currently only filters out *.formatted.log, token-alerts.log, and watcher.log while including all other *.log files. Either update the docstring to match the actual behavior or add logic to skip whatever "review iterations" are supposed to be so the documentation stays accurate.

Suggested change

Skips review iterations and non-.log files.

Skips formatted logs (*.formatted.log) and special logs (token-alerts.log, watcher.log).

Copilot · 2026-02-02T11:07:28Z

src/analyze/classify.py

+    # (message_start and assistant message can report same message)
+    seen_message_ids = set()
+
+    for event in events:
+        if event["type"] == "token_usage":
+            msg_id = event.get("message_id")
+            # Prefer assistant message token data over stream_event
+            # stream_message_start comes first, assistant event comes later with same id
+            # We'll take whichever we see first per message_id


The comment in compute_phase_summary says we "Prefer assistant message token data over stream_event" where stream_message_start comes first and the assistant event comes later, but the deduplication logic keeps the first token_usage event per message_id and ignores subsequent ones. To actually prefer the assistant message (which arrives later according to the comment), the implementation would need to override or replace earlier stream-based counts rather than skipping later events, or the comment should be updated to describe the current "first seen" behavior.

Suggested change

# (message_start and assistant message can report same message)

seen_message_ids = set()

for event in events:

if event["type"] == "token_usage":

msg_id = event.get("message_id")

# Prefer assistant message token data over stream_event

# stream_message_start comes first, assistant event comes later with same id

# We'll take whichever we see first per message_id

# (stream_message_start and assistant message can both report the same message)

seen_message_ids = set()

for event in events:

if event["type"] == "token_usage":

msg_id = event.get("message_id")

# Deduplicate by message_id: we keep the first token_usage event we see

# for a given message_id (typically the stream_message_start) and ignore

# any later events with the same id, including assistant message updates.

Copilot · 2026-02-02T11:07:29Z

src/analyze/analyze.sh

+# Nancy log analyzer — compare experiment conditions
+# Usage: nancy analyze <control_log_dir> <treatment_log_dir> [--json]
+# ------------------------------------------------------------------------------
+
+set -euo pipefail
+
+analyze::run() {
+	local control_dir="${1:-}"
+	local treatment_dir="${2:-}"
+	shift 2 || true
+
+	if [[ -z "$control_dir" || -z "$treatment_dir" ]]; then
+		echo "Usage: nancy analyze <control_log_dir> <treatment_log_dir> [--json]" >&2
+		exit 1
+	fi
+
+	python3 -m src.analyze "${control_dir}" "${treatment_dir}" "$@"
+}


analyze::run documents a nancy analyze <control_log_dir> <treatment_log_dir> [--json] CLI, but this function is not referenced from the main nancy dispatcher or any cmd:: wrapper, so that subcommand is not actually exposed. If this is intended to be a first-class nancy analyze command, it should be wired into the main CLI (e.g., via a cmd::analyze wrapper and a case branch) or the usage comment should be adjusted to reflect how it is meant to be invoked.

Copilot · 2026-02-02T11:07:29Z

tests/conftest.py

+
+import json
+import pytest
+from pathlib import Path


Import of 'Path' is not used.

Suggested change

from pathlib import Path

Copilot · 2026-02-02T11:07:29Z

tests/test_compare.py

+"""Tests for src/analyze/compare.py — aggregation, formatting, JSON output."""
+
+import json
+from pathlib import Path


Import of 'Path' is not used.

Suggested change

from pathlib import Path

Copilot · 2026-02-02T11:07:30Z

tests/test_parse_logs.py

@@ -0,0 +1,195 @@
+"""Tests for src/analyze/parse_logs.py — NDJSON parsing and event extraction."""
+
+import json


Import of 'json' is not used.

Suggested change

import json

Copilot · 2026-02-02T11:07:30Z

tests/test_parse_logs.py

+"""Tests for src/analyze/parse_logs.py — NDJSON parsing and event extraction."""
+
+import json
+from pathlib import Path


Import of 'Path' is not used.

Suggested change

from pathlib import Path

- parse_logs.py: Fix docstring — describe actual filtering behavior - classify.py: Fix misleading comment — describe first-seen dedup, not "prefer assistant" which contradicts the implementation - analyze.sh: Correct usage comment — not wired into nancy CLI dispatcher - tests: Remove unused imports (Path, json) in conftest, test_compare, test_parse_logs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

srobinson and others added 4 commits February 2, 2026 17:29

Copilot AI review requested due to automatic review settings February 2, 2026 11:02

Copilot started reviewing on behalf of srobinson February 2, 2026 11:02 View session

Copilot AI reviewed Feb 2, 2026

View reviewed changes

srobinson merged commit 2a03180 into main Feb 2, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nancy[ALP-487]: Log analyzer — parse, classify, compare experiment logs#1

nancy[ALP-487]: Log analyzer — parse, classify, compare experiment logs#1
srobinson merged 6 commits intomainfrom
nancy/ALP-487

srobinson commented Feb 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	Skips review iterations and non-.log files.
	Skips formatted logs (*.formatted.log) and special logs (token-alerts.log, watcher.log).

		@@ -0,0 +1,195 @@
		"""Tests for src/analyze/parse_logs.py — NDJSON parsing and event extraction."""

		import json

Conversation

srobinson commented Feb 2, 2026

Summary

Key Findings

Files

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants