fix: streaming output not clobbered by post-extraction output handler#9
Conversation
In streaming mode, `_do_streaming_extract` writes data directly to the output file and returns an `ExtractionResult` with an intentionally empty `tables` dict (data lives in the file, not in memory). The CLI then called `_generate_and_output_sql` unconditionally, which regenerated SQL from the empty `tables` and wrote the resulting `BEGIN; ... COMMIT;` shell back to the same path — clobbering the streamed content. The bug is silent: dbslice logs `Wrote <N> rows to <path>` and exits 0, but the file is left as a ~400-byte empty shell. Fix: add `was_streamed: bool` to `ExtractionResult`, set it to True in `_do_streaming_extract`, and short-circuit the SQL/JSON/CSV output handlers when the flag is set. The streamed file is preserved as-is. Closes nabroleonx#8.
| # Streaming mode wrote data directly to the output file during extraction. | ||
| # result.tables is empty in that case; regenerating output from it would | ||
| # clobber the streamed content. See _generate_and_output_sql for context. | ||
| if result.was_streamed: |
There was a problem hiding this comment.
Streaming is SQL-only today, so this JSON guard should not preserve a streamed SQL file under a JSON output path. Please add validation before extraction that rejects --stream/auto-streaming when output_format != SQL, and keep the JSON handler on its normal generation path for non-streaming results.
There was a problem hiding this comment.
Agreed — removed this guard. Streaming is SQL-only and is now rejected for non-SQL output before extraction: the CLI errors on an explicit --stream with a non-SQL --output, and ExtractionEngine._should_use_streaming() returns False for any non-SQL format (covering the auto-threshold path too). So the JSON handler stays on its normal generation path and can never preserve a streamed SQL file under a .json name. Fixed in a2c623d.
| # Streaming mode wrote data directly to the output file during extraction. | ||
| # result.tables is empty in that case; regenerating output from it would | ||
| # clobber the streamed content. See _generate_and_output_sql for context. | ||
| if result.was_streamed: |
There was a problem hiding this comment.
Same here this CSV guard should not preserve a streamed SQL file under a CSV output path. Please use the same validation as JSON: reject streaming when output_format != SQL until streaming CSV is deliberately implemented.
There was a problem hiding this comment.
Same fix as the JSON guard — removed. Non-SQL streaming is rejected up front (CLI validation + _should_use_streaming returning False for non-SQL), so the CSV handler keeps its normal generation path and a streamed SQL file can never be preserved under a .csv name. Fixed in a2c623d.
| result.used_deferred_cycle_strategy = used_deferred_cycle_strategy | ||
| # Flag so downstream output handlers know not to re-write the file. | ||
| # Data has already been streamed to disk; result.tables is empty. | ||
| result.was_streamed = True |
There was a problem hiding this comment.
Put the streamed marker at the source of truth. StreamingExtractionEngine.stream_to_file() should return ExtractionResult(..., was_streamed=True) directly, so direct callers and the CLI wrapper see consistent metadata. The wrapper assignment can then be removed or left only as redundant safety.
There was a problem hiding this comment.
Done. StreamingExtractionEngine.stream_to_file() now sets was_streamed=True directly in the returned ExtractionResult, so direct callers and the CLI wrapper see consistent metadata. Removed the assignment in _do_streaming_extract. Fixed in a2c623d.
| assert out_file.read_text() == streamed_content | ||
|
|
||
| # Construct a streaming-mode result: tables empty, stats populated, flag set. | ||
| result = ExtractionResult( |
There was a problem hiding this comment.
This test manually constructs was_streamed=True, so it does not prove the real streaming path sets the flag. Please add a regression test that runs _do_streaming_extract() or extract() through _handle_output_format() and asserts the streamed SQL file is not clobbered.
There was a problem hiding this comment.
Rewritten. The test now drives the real path: it runs StreamingExtractionEngine.stream_to_file() (which is what sets was_streamed) and feeds the result through _handle_output_format(), then asserts the streamed SQL file is byte-for-byte unchanged. It also asserts result.was_streamed is True and result.tables == {}, so the flag is proven on the actual streaming path. Added test_streaming_disabled_for_non_sql_output to cover the format gate. Fixed in a2c623d.
| # Streaming mode wrote data directly to the output file during extraction. | ||
| # result.tables is empty in that case; regenerating SQL from it would | ||
| # produce an empty BEGIN/COMMIT shell and clobber the streamed content. | ||
| if result.was_streamed: |
There was a problem hiding this comment.
Please move the streamed-result skip into a single helper or a single guard in _handle_output_format(). The shared guard should only apply to SQL streaming; JSON/CSV should be rejected before streaming starts rather than short-circuited here.
There was a problem hiding this comment.
Consolidated into a single guard at the top of _handle_output_format(); the three per-handler skips are gone. The guard only fires for streamed (SQL) results — non-SQL streaming is rejected before extraction rather than short-circuited here. Fixed in a2c623d.
| was_streamed=True, | ||
| ) | ||
|
|
||
| quiet = rich.console.Console(file=open(os.devnull, "w"), force_terminal=False) |
There was a problem hiding this comment.
open(os.devnull, "w") leaves the file handle open. Please use a with open(...) as devnull: block around the output-handler call.
There was a problem hiding this comment.
Fixed — wrapped in with open(os.devnull, "w") as devnull:.
…t guard Streaming writes SQL directly to disk; it has no JSON/CSV path. Instead of defensively short-circuiting every output handler (which would silently preserve SQL content under a .json/.csv name), reject the unsupported combo: - _should_use_streaming() returns False for any non-SQL output_format, so the auto-threshold path never streams SQL into a JSON/CSV file. (This was a latent bug: the gate keyed only on row count + output_file, not format.) - CLI rejects an explicit --stream for non-SQL output before extraction. - Consolidate the three per-handler streamed-result skips into one guard in _handle_output_format(); only SQL reaches it now. - Set was_streamed=True at the source of truth (StreamingExtractionEngine .stream_to_file) so direct callers and the CLI see consistent metadata; drop the redundant wrapper assignment. - Regression test now drives the real streaming engine through _handle_output_format and asserts the streamed file is byte-for-byte unchanged; add _should_use_streaming format-gate test; fix devnull leak. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thanks for the review — all six points addressed in a2c623d. Reshape summary:
Worth flagging: the format gate also closes a latent pre-existing bug — All 704 unit tests pass ( |
Fixes #8.
Summary
In streaming mode,
_do_streaming_extractwrites data directly to the output file and returns anExtractionResultwith an intentionally emptytablesdict (data lives in the file, not in memory). The CLI then calls_generate_and_output_sqlunconditionally, which regenerates SQL from the emptytablesand writes the resultingBEGIN; ... COMMIT;shell back to the same path — clobbering the streamed content.The bug is silent: dbslice logs
Wrote <N> rows to <path>and exits 0, but the file is left as a ~400-byte empty shell. Reproducing it just requires an extraction large enough to trigger streaming (default threshold 50k rows, default--stream-threshold).Fix
was_streamed: bool = Falsefield toExtractionResult(core/engine.py).was_streamed=Truein_do_streaming_extractwhen constructing the return value.cli._generate_and_output_sql, short-circuit whenresult.was_streamed— the file is already on disk; print the success line and return the path without regenerating output._generate_and_output_jsonand_generate_and_output_csv. Streaming is SQL-only today, but the defensive guard means JSON/CSV won't silently clobber if streaming is ever extended.Test
Added
test_streaming_output_not_clobbered_by_output_handlerintests/test_streaming.py:ExtractionResultwithwas_streamed=Trueand emptytables._generate_and_output_sqldirectly.All 703 unit tests pass (
just test).What the user sees after the fix
Before (silent corruption):
After (preserved):