nancy[ALP-487]: Log analyzer — parse, classify, compare experiment logs#1
nancy[ALP-487]: Log analyzer — parse, classify, compare experiment logs#1
Conversation
New `src/analyze/` Python module for experiment log analysis: - parse_logs.py: Parses raw NDJSON logs into structured events (tool calls, token usage, session boundaries). Handles multi-iteration logs. - classify.py: Classifies tool calls into navigation/fmm_navigation/task_work/ boilerplate categories. Identifies "first edit" boundary. Computes per-phase summaries with token counts and unique file tracking. - compare.py: Takes two experiment directories, analyzes all iterations, outputs comparison table (human-readable + JSON). Matches EXPERIMENT_HARNESS.md format. - analyze.sh: Bash wrapper for nancy integration. - __main__.py: Entry point via `python3 -m src.analyze` Token counting de-duplicates by message_id to avoid double-counting between stream_event and assistant message events. Tested against ALP-479 (fmm) and ALP-480 (no fmm) experiment logs. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
message_delta stream events carry output_tokens but have no message_id, so they bypass the dedup logic in compute_phase_summary and inflate output token counts. Since the assistant event already reports the same output_tokens with a proper message_id, message_delta is redundant. Also removed unused col_w variable in compare.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
ALP-492 — Review cleanup (4 items): - Remove unused vars total_events_before_first_edit/total_events_overall in compare.py - Remove unnecessary f-string prefix in parse_logs.py _print_summary - Quote shell args in analyze.sh for paths with spaces - Add encoding='utf-8' to open() in parse_logs.py ALP-493 — pytest test suite (90 tests): - tests/test_parse_logs.py: NDJSON parsing, file path extraction, token events, malformed JSON resilience, message_delta skip, multi-tool messages - tests/test_classify.py: All tool categories, bash pattern matching, first-edit boundary detection, excluded file patterns, phase assignment, token dedup - tests/test_compare.py: Iteration aggregation, comparison table formatting, JSON serialization, end-to-end integration test - Pre-compiled regex in classify.py for EXCLUDED_EDIT, NAV_BASH, TASK_BASH, BOILERPLATE_BASH patterns (efficiency cleanup per AC) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When an iteration has edits but zero total tokens (e.g. all token events deduped away), first_edit_pct stays None. The comparison table would render "iter1 @ None%" instead of just "iter1". Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The test suite was converted to pytest (90 tests in tests/) but the CI workflow still expected tests/run.sh + bash unit/integration scripts. - Add Python 3.12 + pytest setup - Conditionally run bash tests only if tests/run.sh exists - Conditionally run pytest only if tests/conftest.py exists Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds a Python-based log analysis pipeline for Nancy experiment logs, including parsing NDJSON logs into structured events, classifying events into navigation/work phases, aggregating metrics across iterations, and comparing control vs treatment runs, with a pytest suite covering the core logic.
Changes:
- Introduces
src/analyze/parse_logs.py,classify.py, andcompare.pyto parse raw NDJSON logs, classify tool calls and token usage, compute per-iteration summaries, and generate comparison tables/JSON. - Adds a CLI entry point for the analyzer via
python3 -m src.analyzeand a Bash wrappersrc/analyze/analyze.shintended for anancy analyzesubcommand. - Adds a pytest test suite (
tests/test_parse_logs.py,tests/test_classify.py,tests/test_compare.py, and shared fixtures) plus.gitignoreentries for Python artifacts.
Reviewed changes
Copilot reviewed 10 out of 13 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
src/analyze/parse_logs.py |
Parses raw NDJSON logs into ordered events (init, tool calls, token usage, user messages, assistant text) and provides parse_experiment_logs and a small CLI; note the docstring for parse_experiment_logs mentions skipping “review iterations” but the implementation does not currently do this. |
src/analyze/classify.py |
Classifies tool calls into categories (navigation, fmm_navigation, task_work, boilerplate, other), finds the first real edit, and computes per-phase token/tool/file summaries, including FMM vs raw navigation counts; a comment in compute_phase_summary about preferring assistant token usage over stream events does not match the current “first seen wins” dedup behavior. |
src/analyze/compare.py |
Aggregates per-iteration summaries across conditions, computes navigation and sidecar metrics, formats a human-readable comparison table and JSON output, and exposes a main() CLI for python3 -m src.analyze. |
src/analyze/analyze.sh |
Bash wrapper defining analyze::run that calls python3 -m src.analyze, documenting a nancy analyze <control_log_dir> <treatment_log_dir> [--json] CLI, but this function is not yet wired into the main nancy dispatcher or any cmd:: wrapper. |
src/analyze/__main__.py |
Package __main__ that forwards to compare.main, enabling python3 -m src.analyze <control_dir> <treatment_dir> [--json]. |
src/analyze/__init__.py |
Marks src.analyze as a Python package; currently empty. |
tests/conftest.py |
Provides shared fixtures and helpers for constructing synthetic NDJSON log events (init, assistant with usage and tool blocks, user, stream events) and writing them to disk. |
tests/test_parse_logs.py |
Unit tests for parse_log_file, _extract_file_path, _extract_command, and parse_experiment_logs, covering empty/malformed inputs, assistant/tool/stream/user events, and log file filtering/sorting semantics. |
tests/test_classify.py |
Tests for tool-call classification, exclusion rules for “first edit”, bash pattern matching, event phase assignment, and phase summary computation (including FMM vs raw nav counts and token deduplication). |
tests/test_compare.py |
Tests _aggregate_iterations, comparison table formatting (labels, headers, token formatting), JSON output structure and sorting, and an end-to-end analyze_condition flow over a synthetic iter1 log. |
tests/__init__.py |
Empty module to make tests a package (enables relative imports where needed). |
.gitignore |
Adds ignores for Python bytecode (__pycache__/, *.pyc) alongside existing patterns. |
src/analyze/parse_logs.py
Outdated
| """Parse all iteration logs in a directory. | ||
|
|
||
| Returns dict of {iteration_name: [events]} sorted by iteration order. | ||
| Skips review iterations and non-.log files. |
There was a problem hiding this comment.
The docstring for parse_experiment_logs says it "Skips review iterations and non-.log files", but the implementation currently only filters out *.formatted.log, token-alerts.log, and watcher.log while including all other *.log files. Either update the docstring to match the actual behavior or add logic to skip whatever "review iterations" are supposed to be so the documentation stays accurate.
| Skips review iterations and non-.log files. | |
| Skips formatted logs (*.formatted.log) and special logs (token-alerts.log, watcher.log). |
src/analyze/classify.py
Outdated
| # (message_start and assistant message can report same message) | ||
| seen_message_ids = set() | ||
|
|
||
| for event in events: | ||
| if event["type"] == "token_usage": | ||
| msg_id = event.get("message_id") | ||
| # Prefer assistant message token data over stream_event | ||
| # stream_message_start comes first, assistant event comes later with same id | ||
| # We'll take whichever we see first per message_id |
There was a problem hiding this comment.
The comment in compute_phase_summary says we "Prefer assistant message token data over stream_event" where stream_message_start comes first and the assistant event comes later, but the deduplication logic keeps the first token_usage event per message_id and ignores subsequent ones. To actually prefer the assistant message (which arrives later according to the comment), the implementation would need to override or replace earlier stream-based counts rather than skipping later events, or the comment should be updated to describe the current "first seen" behavior.
| # (message_start and assistant message can report same message) | |
| seen_message_ids = set() | |
| for event in events: | |
| if event["type"] == "token_usage": | |
| msg_id = event.get("message_id") | |
| # Prefer assistant message token data over stream_event | |
| # stream_message_start comes first, assistant event comes later with same id | |
| # We'll take whichever we see first per message_id | |
| # (stream_message_start and assistant message can both report the same message) | |
| seen_message_ids = set() | |
| for event in events: | |
| if event["type"] == "token_usage": | |
| msg_id = event.get("message_id") | |
| # Deduplicate by message_id: we keep the first token_usage event we see | |
| # for a given message_id (typically the stream_message_start) and ignore | |
| # any later events with the same id, including assistant message updates. |
| # Nancy log analyzer — compare experiment conditions | ||
| # Usage: nancy analyze <control_log_dir> <treatment_log_dir> [--json] | ||
| # ------------------------------------------------------------------------------ | ||
|
|
||
| set -euo pipefail | ||
|
|
||
| analyze::run() { | ||
| local control_dir="${1:-}" | ||
| local treatment_dir="${2:-}" | ||
| shift 2 || true | ||
|
|
||
| if [[ -z "$control_dir" || -z "$treatment_dir" ]]; then | ||
| echo "Usage: nancy analyze <control_log_dir> <treatment_log_dir> [--json]" >&2 | ||
| exit 1 | ||
| fi | ||
|
|
||
| python3 -m src.analyze "${control_dir}" "${treatment_dir}" "$@" | ||
| } |
There was a problem hiding this comment.
analyze::run documents a nancy analyze <control_log_dir> <treatment_log_dir> [--json] CLI, but this function is not referenced from the main nancy dispatcher or any cmd:: wrapper, so that subcommand is not actually exposed. If this is intended to be a first-class nancy analyze command, it should be wired into the main CLI (e.g., via a cmd::analyze wrapper and a case branch) or the usage comment should be adjusted to reflect how it is meant to be invoked.
tests/conftest.py
Outdated
|
|
||
| import json | ||
| import pytest | ||
| from pathlib import Path |
There was a problem hiding this comment.
Import of 'Path' is not used.
| from pathlib import Path |
tests/test_compare.py
Outdated
| """Tests for src/analyze/compare.py — aggregation, formatting, JSON output.""" | ||
|
|
||
| import json | ||
| from pathlib import Path |
There was a problem hiding this comment.
Import of 'Path' is not used.
| from pathlib import Path |
tests/test_parse_logs.py
Outdated
| @@ -0,0 +1,195 @@ | |||
| """Tests for src/analyze/parse_logs.py — NDJSON parsing and event extraction.""" | |||
|
|
|||
| import json | |||
There was a problem hiding this comment.
Import of 'json' is not used.
| import json |
tests/test_parse_logs.py
Outdated
| """Tests for src/analyze/parse_logs.py — NDJSON parsing and event extraction.""" | ||
|
|
||
| import json | ||
| from pathlib import Path |
There was a problem hiding this comment.
Import of 'Path' is not used.
| from pathlib import Path |
- parse_logs.py: Fix docstring — describe actual filtering behavior - classify.py: Fix misleading comment — describe first-seen dedup, not "prefer assistant" which contradicts the implementation - analyze.sh: Correct usage comment — not wired into nancy CLI dispatcher - tests: Remove unused imports (Path, json) in conftest, test_compare, test_parse_logs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Summary
src/analyze/Python module for experiment log analysis — parses raw NDJSON Nancy logs, classifies events (navigation/fmm/task-work/boilerplate), generates comparison tables between control and treatment runsKey Findings
Treatment was 33% more token-efficient but zero sidecar usage — the agent completely ignored fmm tools. The instruction compliance problem is confirmed.
Files
src/analyze/parse_logs.py— NDJSON parser → structured eventssrc/analyze/classify.py— Event classification + phase boundary detectionsrc/analyze/compare.py— Two-condition comparison table generatortests/— 90 tests (conftest fixtures + 3 test modules)Test plan
🤖 Generated with Claude Code