Skip to content

nancy[ALP-487]: Log analyzer — parse, classify, compare experiment logs#1

Merged
srobinson merged 6 commits intomainfrom
nancy/ALP-487
Feb 2, 2026
Merged

nancy[ALP-487]: Log analyzer — parse, classify, compare experiment logs#1
srobinson merged 6 commits intomainfrom
nancy/ALP-487

Conversation

@srobinson
Copy link
Owner

Summary

  • New src/analyze/ Python module for experiment log analysis — parses raw NDJSON Nancy logs, classifies events (navigation/fmm/task-work/boilerplate), generates comparison tables between control and treatment runs
  • 90 tests covering parsing, classification, and comparison logic
  • First experiment results: ALP-479 (with fmm) vs ALP-480 (without fmm) analyzed and committed to docs repo

Key Findings

                           Control (no fmm)    Treatment (fmm)
Reads before first edit:   25                  40
Navigation tokens:         ~30,292             ~15,884
Navigation % of total:     52%                 41%
Sidecar lookups:           n/a                 0/55
Total tokens:              ~58,310             ~39,111

Treatment was 33% more token-efficient but zero sidecar usage — the agent completely ignored fmm tools. The instruction compliance problem is confirmed.

Files

  • src/analyze/parse_logs.py — NDJSON parser → structured events
  • src/analyze/classify.py — Event classification + phase boundary detection
  • src/analyze/compare.py — Two-condition comparison table generator
  • tests/ — 90 tests (conftest fixtures + 3 test modules)

Test plan

  • All 90 pytest tests pass
  • Ran against real experiment data (ALP-479 vs ALP-480)
  • JSON and text output modes verified
  • Edge case: empty log directories
  • CLI entry point tests (main functions)

🤖 Generated with Claude Code

srobinson and others added 4 commits February 2, 2026 17:29
New `src/analyze/` Python module for experiment log analysis:

- parse_logs.py: Parses raw NDJSON logs into structured events (tool calls,
  token usage, session boundaries). Handles multi-iteration logs.
- classify.py: Classifies tool calls into navigation/fmm_navigation/task_work/
  boilerplate categories. Identifies "first edit" boundary. Computes per-phase
  summaries with token counts and unique file tracking.
- compare.py: Takes two experiment directories, analyzes all iterations, outputs
  comparison table (human-readable + JSON). Matches EXPERIMENT_HARNESS.md format.
- analyze.sh: Bash wrapper for nancy integration.
- __main__.py: Entry point via `python3 -m src.analyze`

Token counting de-duplicates by message_id to avoid double-counting between
stream_event and assistant message events.

Tested against ALP-479 (fmm) and ALP-480 (no fmm) experiment logs.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
message_delta stream events carry output_tokens but have no message_id,
so they bypass the dedup logic in compute_phase_summary and inflate
output token counts. Since the assistant event already reports the same
output_tokens with a proper message_id, message_delta is redundant.

Also removed unused col_w variable in compare.py.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
ALP-492 — Review cleanup (4 items):
- Remove unused vars total_events_before_first_edit/total_events_overall in compare.py
- Remove unnecessary f-string prefix in parse_logs.py _print_summary
- Quote shell args in analyze.sh for paths with spaces
- Add encoding='utf-8' to open() in parse_logs.py

ALP-493 — pytest test suite (90 tests):
- tests/test_parse_logs.py: NDJSON parsing, file path extraction, token events,
  malformed JSON resilience, message_delta skip, multi-tool messages
- tests/test_classify.py: All tool categories, bash pattern matching, first-edit
  boundary detection, excluded file patterns, phase assignment, token dedup
- tests/test_compare.py: Iteration aggregation, comparison table formatting,
  JSON serialization, end-to-end integration test
- Pre-compiled regex in classify.py for EXCLUDED_EDIT, NAV_BASH, TASK_BASH,
  BOILERPLATE_BASH patterns (efficiency cleanup per AC)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When an iteration has edits but zero total tokens (e.g. all token events
deduped away), first_edit_pct stays None. The comparison table would
render "iter1 @ None%" instead of just "iter1".

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings February 2, 2026 11:02
The test suite was converted to pytest (90 tests in tests/) but the CI
workflow still expected tests/run.sh + bash unit/integration scripts.

- Add Python 3.12 + pytest setup
- Conditionally run bash tests only if tests/run.sh exists
- Conditionally run pytest only if tests/conftest.py exists

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a Python-based log analysis pipeline for Nancy experiment logs, including parsing NDJSON logs into structured events, classifying events into navigation/work phases, aggregating metrics across iterations, and comparing control vs treatment runs, with a pytest suite covering the core logic.

Changes:

  • Introduces src/analyze/parse_logs.py, classify.py, and compare.py to parse raw NDJSON logs, classify tool calls and token usage, compute per-iteration summaries, and generate comparison tables/JSON.
  • Adds a CLI entry point for the analyzer via python3 -m src.analyze and a Bash wrapper src/analyze/analyze.sh intended for a nancy analyze subcommand.
  • Adds a pytest test suite (tests/test_parse_logs.py, tests/test_classify.py, tests/test_compare.py, and shared fixtures) plus .gitignore entries for Python artifacts.

Reviewed changes

Copilot reviewed 10 out of 13 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/analyze/parse_logs.py Parses raw NDJSON logs into ordered events (init, tool calls, token usage, user messages, assistant text) and provides parse_experiment_logs and a small CLI; note the docstring for parse_experiment_logs mentions skipping “review iterations” but the implementation does not currently do this.
src/analyze/classify.py Classifies tool calls into categories (navigation, fmm_navigation, task_work, boilerplate, other), finds the first real edit, and computes per-phase token/tool/file summaries, including FMM vs raw navigation counts; a comment in compute_phase_summary about preferring assistant token usage over stream events does not match the current “first seen wins” dedup behavior.
src/analyze/compare.py Aggregates per-iteration summaries across conditions, computes navigation and sidecar metrics, formats a human-readable comparison table and JSON output, and exposes a main() CLI for python3 -m src.analyze.
src/analyze/analyze.sh Bash wrapper defining analyze::run that calls python3 -m src.analyze, documenting a nancy analyze <control_log_dir> <treatment_log_dir> [--json] CLI, but this function is not yet wired into the main nancy dispatcher or any cmd:: wrapper.
src/analyze/__main__.py Package __main__ that forwards to compare.main, enabling python3 -m src.analyze <control_dir> <treatment_dir> [--json].
src/analyze/__init__.py Marks src.analyze as a Python package; currently empty.
tests/conftest.py Provides shared fixtures and helpers for constructing synthetic NDJSON log events (init, assistant with usage and tool blocks, user, stream events) and writing them to disk.
tests/test_parse_logs.py Unit tests for parse_log_file, _extract_file_path, _extract_command, and parse_experiment_logs, covering empty/malformed inputs, assistant/tool/stream/user events, and log file filtering/sorting semantics.
tests/test_classify.py Tests for tool-call classification, exclusion rules for “first edit”, bash pattern matching, event phase assignment, and phase summary computation (including FMM vs raw nav counts and token deduplication).
tests/test_compare.py Tests _aggregate_iterations, comparison table formatting (labels, headers, token formatting), JSON output structure and sorting, and an end-to-end analyze_condition flow over a synthetic iter1 log.
tests/__init__.py Empty module to make tests a package (enables relative imports where needed).
.gitignore Adds ignores for Python bytecode (__pycache__/, *.pyc) alongside existing patterns.

"""Parse all iteration logs in a directory.

Returns dict of {iteration_name: [events]} sorted by iteration order.
Skips review iterations and non-.log files.
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring for parse_experiment_logs says it "Skips review iterations and non-.log files", but the implementation currently only filters out *.formatted.log, token-alerts.log, and watcher.log while including all other *.log files. Either update the docstring to match the actual behavior or add logic to skip whatever "review iterations" are supposed to be so the documentation stays accurate.

Suggested change
Skips review iterations and non-.log files.
Skips formatted logs (*.formatted.log) and special logs (token-alerts.log, watcher.log).

Copilot uses AI. Check for mistakes.
Comment on lines +190 to +198
# (message_start and assistant message can report same message)
seen_message_ids = set()

for event in events:
if event["type"] == "token_usage":
msg_id = event.get("message_id")
# Prefer assistant message token data over stream_event
# stream_message_start comes first, assistant event comes later with same id
# We'll take whichever we see first per message_id
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment in compute_phase_summary says we "Prefer assistant message token data over stream_event" where stream_message_start comes first and the assistant event comes later, but the deduplication logic keeps the first token_usage event per message_id and ignores subsequent ones. To actually prefer the assistant message (which arrives later according to the comment), the implementation would need to override or replace earlier stream-based counts rather than skipping later events, or the comment should be updated to describe the current "first seen" behavior.

Suggested change
# (message_start and assistant message can report same message)
seen_message_ids = set()
for event in events:
if event["type"] == "token_usage":
msg_id = event.get("message_id")
# Prefer assistant message token data over stream_event
# stream_message_start comes first, assistant event comes later with same id
# We'll take whichever we see first per message_id
# (stream_message_start and assistant message can both report the same message)
seen_message_ids = set()
for event in events:
if event["type"] == "token_usage":
msg_id = event.get("message_id")
# Deduplicate by message_id: we keep the first token_usage event we see
# for a given message_id (typically the stream_message_start) and ignore
# any later events with the same id, including assistant message updates.

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +20
# Nancy log analyzer — compare experiment conditions
# Usage: nancy analyze <control_log_dir> <treatment_log_dir> [--json]
# ------------------------------------------------------------------------------

set -euo pipefail

analyze::run() {
local control_dir="${1:-}"
local treatment_dir="${2:-}"
shift 2 || true

if [[ -z "$control_dir" || -z "$treatment_dir" ]]; then
echo "Usage: nancy analyze <control_log_dir> <treatment_log_dir> [--json]" >&2
exit 1
fi

python3 -m src.analyze "${control_dir}" "${treatment_dir}" "$@"
}
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

analyze::run documents a nancy analyze <control_log_dir> <treatment_log_dir> [--json] CLI, but this function is not referenced from the main nancy dispatcher or any cmd:: wrapper, so that subcommand is not actually exposed. If this is intended to be a first-class nancy analyze command, it should be wired into the main CLI (e.g., via a cmd::analyze wrapper and a case branch) or the usage comment should be adjusted to reflect how it is meant to be invoked.

Copilot uses AI. Check for mistakes.

import json
import pytest
from pathlib import Path
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'Path' is not used.

Suggested change
from pathlib import Path

Copilot uses AI. Check for mistakes.
"""Tests for src/analyze/compare.py — aggregation, formatting, JSON output."""

import json
from pathlib import Path
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'Path' is not used.

Suggested change
from pathlib import Path

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,195 @@
"""Tests for src/analyze/parse_logs.py — NDJSON parsing and event extraction."""

import json
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'json' is not used.

Suggested change
import json

Copilot uses AI. Check for mistakes.
"""Tests for src/analyze/parse_logs.py — NDJSON parsing and event extraction."""

import json
from pathlib import Path
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'Path' is not used.

Suggested change
from pathlib import Path

Copilot uses AI. Check for mistakes.
- parse_logs.py: Fix docstring — describe actual filtering behavior
- classify.py: Fix misleading comment — describe first-seen dedup, not
  "prefer assistant" which contradicts the implementation
- analyze.sh: Correct usage comment — not wired into nancy CLI dispatcher
- tests: Remove unused imports (Path, json) in conftest, test_compare,
  test_parse_logs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@srobinson srobinson merged commit 2a03180 into main Feb 2, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants