feat(extract): add --scope session and --slim to batch extract#222
Open
elronbandel wants to merge 2 commits into
Open
feat(extract): add --scope session and --slim to batch extract#222elronbandel wants to merge 2 commits into
elronbandel wants to merge 2 commits into
Conversation
batch extract today only emits run-level rows with all fields including
JSON-encoded blobs (session_results, accumulated_*_report,
*_session_ids) — the resulting CSV is ~70 MB on a moderate run tree
and isn't useful for downstream analysis. There's also no built-in way
to get per-session results in tabular form; users were walking session
dirs with ad-hoc scripts.
Two new options on `batch extract`:
--scope run|session (default: run)
run: existing behavior — one row per config
session: one row per task; iterates each run's embedded
session_results and prepends config_path + run_id
--slim
Drop bulky/redundant fields. For run scope: session_results,
*_session_ids, accumulated_*_report, etc. For session scope:
details and cost_reports (the two JSON-blob fields).
Combined effect on a ~150-config / 12k-session tree:
--scope run → 70 MB
--scope run --slim → 82 KB
--scope session → 70 MB
--scope session --slim → 2.4 MB
Tests added (3 new):
- test_batch_extract_slim_drops_bulky_fields
- test_batch_extract_scope_session_emits_one_row_per_task
- test_batch_extract_scope_session_slim_drops_json_blobs
Existing 6 tests in test_cli_batch.py pass unchanged.
Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
Code already self-documents from option help text + tests. The README will carry the user-facing how-to. Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
`batch extract` today emits run-level rows with all fields including JSON-encoded blobs (`session_results`, `accumulated__report`, `_session_ids`) — the CSV is ~70 MB on a moderate run tree and not useful for downstream statistical analysis. There's also no built-in way to get per-session results in tabular form; users have been walking session dirs with ad-hoc scripts.
What
Two new options on `batch extract`:
Impact
Same ~150-config / 12k-session tree:
For a repo-tracked snapshot or a HuggingFace dataset:
```bash
exgentic batch extract --config "experiments///*/config.json" \
--scope run --slim --output runs.csv
exgentic batch extract --config "experiments///*/config.json" \
--scope session --slim --output sessions.csv
```
Tests
All 6 existing tests in `test_cli_batch.py` pass unchanged.
Compatibility
Default behavior (no flags) is byte-identical to before. `--scope` defaults to `run` and `--slim` defaults to off.