Skip to content

Refactor: Metrics Exporter#53

Open
philippkaser wants to merge 5 commits intovirtUOS:mainfrom
philippkaser:historic-analysis-refactor
Open

Refactor: Metrics Exporter#53
philippkaser wants to merge 5 commits intovirtUOS:mainfrom
philippkaser:historic-analysis-refactor

Conversation

@philippkaser
Copy link
Copy Markdown

Summary

Major refactor of the metrics export script with improved architecture, better error handling, and enhanced token attribution logic.


✨ Improvements

Architecture & Code Quality

  • ✅ Modular class-based design (MongoDBManager, MariaDBManager, MetricsCalculator, etc.)
  • ✅ Proper logging with configurable levels (replaces print statements)
  • ✅ Environment variable configuration via .env file
  • ✅ MySQL/MariaDB connection retry logic
  • ✅ Comprehensive CLI with argparse (help text, examples, etc.)

Functionality

  • Incremental sync: Only processes missing dates by default
  • Force mode: --force flag to re-process existing dates
  • Retention management: --cleanup flag to remove data outside range
  • Memory optimization: Single bulk fetch instead of per-day queries
  • Better date handling: Flexible date range options (--days, --start-date/--end-date)

Changes & Behavioral Differences

1. Token Attribution Logic (MAJOR CHANGE)

Old behavior:

  • Simple dict split (prompt/completion) or 50/50 for integer tokens
  • Counted tokens once per message

New behavior:

  • Parent-child message relationship tracking
  • User message tokens attributed as INPUT to all AI responses (including regenerations)
  • AI response tokens counted as OUTPUT

2. Unique User Calculation

Old: Counted all messages (user + AI)
New: Only counts messages where isCreatedByUser=true
Impact: User counts will be more accurate but numerically different

3. Messages by Model

Old: Counted all messages (user messages as "unknown")
New: Only counts AI response messages with token counts
Impact: Message counts exclude user messages entirely

4. MongoDB Read Preference

Old: secondary (always read from replica secondaries)
New: primaryPreferred (read from primary, fallback to secondary)
Impact: Better for single-instance deployments, but may increase primary load

5. Data Clearing Strategy

Old: Always truncates all tables before processing
New: Incremental by default, only clears with --force flag
Impact: Safer, but use --force --cleanup to match old behavior

@Odrec
Copy link
Copy Markdown
Collaborator

Odrec commented Apr 21, 2026

Thanks for the PR. Sorry it took so long to review.

Some remarks from the review:

  1. Can you add a one-paragraph "migration notes" to the README change: "if you were running a previous version, run --force --cleanup once; daily_messages_by_model no longer includes user messages; daily user counts may shift slightly"?
  2. Replace the nested for ai_msg in ai_responses loop with a parentMessageId → [ai] index built once per sync (perf + correctness will survive large windows).
  3. Clarify the regeneration attribution choice in code comments + docs: "input tokens are counted once per AI reply, so N regenerations multiply input token cost N×."
  4. Guard message['messageId'] → message.get('messageId'); skip if absent.
  5. Drop commented-out print lines and the unused delete_future_dates.
  6. Decide: should readPreference be configurable via env var? That way secondary-only replica-set users don't regress.

None of these are blockers architecturally but merging as-is will surface in someone's Grafana as "numbers changed on April XX, 2026, and nobody knows why."

Thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants