fix: streaming output not clobbered by post-extraction output handler by james-toji-leung · Pull Request #9 · nabroleonx/dbslice

james-toji-leung · 2026-05-20T16:35:03Z

Fixes #8.

Summary

In streaming mode, _do_streaming_extract writes data directly to the output file and returns an ExtractionResult with an intentionally empty tables dict (data lives in the file, not in memory). The CLI then calls _generate_and_output_sql unconditionally, which regenerates SQL from the empty tables and writes the resulting BEGIN; ... COMMIT; shell back to the same path — clobbering the streamed content.

The bug is silent: dbslice logs Wrote <N> rows to <path> and exits 0, but the file is left as a ~400-byte empty shell. Reproducing it just requires an extraction large enough to trigger streaming (default threshold 50k rows, default --stream-threshold).

Fix

Add was_streamed: bool = False field to ExtractionResult (core/engine.py).
Set was_streamed=True in _do_streaming_extract when constructing the return value.
In cli._generate_and_output_sql, short-circuit when result.was_streamed — the file is already on disk; print the success line and return the path without regenerating output.
Apply the same short-circuit to _generate_and_output_json and _generate_and_output_csv. Streaming is SQL-only today, but the defensive guard means JSON/CSV won't silently clobber if streaming is ever extended.

Test

Added test_streaming_output_not_clobbered_by_output_handler in tests/test_streaming.py:

Pre-populates an "already-streamed" file with known content.
Builds an ExtractionResult with was_streamed=True and empty tables.
Calls _generate_and_output_sql directly.
Asserts the file content is unchanged afterward.

All 703 unit tests pass (just test).

What the user sees after the fix

Before (silent corruption):

$ dbslice extract ... -f out.sql --stream
... extraction runs ~20 min, out.sql streams to 290 MB ...
Wrote 800000 rows to out.sql
$ wc -l out.sql
6 out.sql   # 400 bytes, no INSERTs

After (preserved):

$ dbslice extract ... -f out.sql --stream
... extraction runs ~20 min, out.sql streams to 290 MB ...
Wrote 800000 rows to out.sql
$ wc -l out.sql
364000 out.sql   # all streamed INSERTs intact

In streaming mode, `_do_streaming_extract` writes data directly to the output file and returns an `ExtractionResult` with an intentionally empty `tables` dict (data lives in the file, not in memory). The CLI then called `_generate_and_output_sql` unconditionally, which regenerated SQL from the empty `tables` and wrote the resulting `BEGIN; ... COMMIT;` shell back to the same path — clobbering the streamed content. The bug is silent: dbslice logs `Wrote <N> rows to <path>` and exits 0, but the file is left as a ~400-byte empty shell. Fix: add `was_streamed: bool` to `ExtractionResult`, set it to True in `_do_streaming_extract`, and short-circuit the SQL/JSON/CSV output handlers when the flag is set. The streamed file is preserved as-is. Closes nabroleonx#8.

nabroleonx · 2026-06-04T16:18:50Z

+    # Streaming mode wrote data directly to the output file during extraction.
+    # result.tables is empty in that case; regenerating output from it would
+    # clobber the streamed content. See _generate_and_output_sql for context.
+    if result.was_streamed:


Streaming is SQL-only today, so this JSON guard should not preserve a streamed SQL file under a JSON output path. Please add validation before extraction that rejects --stream/auto-streaming when output_format != SQL, and keep the JSON handler on its normal generation path for non-streaming results.

Agreed — removed this guard. Streaming is SQL-only and is now rejected for non-SQL output before extraction: the CLI errors on an explicit --stream with a non-SQL --output, and ExtractionEngine._should_use_streaming() returns False for any non-SQL format (covering the auto-threshold path too). So the JSON handler stays on its normal generation path and can never preserve a streamed SQL file under a .json name. Fixed in a2c623d.

nabroleonx · 2026-06-04T16:18:50Z

+    # Streaming mode wrote data directly to the output file during extraction.
+    # result.tables is empty in that case; regenerating output from it would
+    # clobber the streamed content. See _generate_and_output_sql for context.
+    if result.was_streamed:


Same here this CSV guard should not preserve a streamed SQL file under a CSV output path. Please use the same validation as JSON: reject streaming when output_format != SQL until streaming CSV is deliberately implemented.

Same fix as the JSON guard — removed. Non-SQL streaming is rejected up front (CLI validation + _should_use_streaming returning False for non-SQL), so the CSV handler keeps its normal generation path and a streamed SQL file can never be preserved under a .csv name. Fixed in a2c623d.

nabroleonx · 2026-06-04T16:18:51Z

        result.used_deferred_cycle_strategy = used_deferred_cycle_strategy
+        # Flag so downstream output handlers know not to re-write the file.
+        # Data has already been streamed to disk; result.tables is empty.
+        result.was_streamed = True


Put the streamed marker at the source of truth. StreamingExtractionEngine.stream_to_file() should return ExtractionResult(..., was_streamed=True) directly, so direct callers and the CLI wrapper see consistent metadata. The wrapper assignment can then be removed or left only as redundant safety.

Done. StreamingExtractionEngine.stream_to_file() now sets was_streamed=True directly in the returned ExtractionResult, so direct callers and the CLI wrapper see consistent metadata. Removed the assignment in _do_streaming_extract. Fixed in a2c623d.

nabroleonx · 2026-06-04T16:18:51Z

+    assert out_file.read_text() == streamed_content
+
+    # Construct a streaming-mode result: tables empty, stats populated, flag set.
+    result = ExtractionResult(


This test manually constructs was_streamed=True, so it does not prove the real streaming path sets the flag. Please add a regression test that runs _do_streaming_extract() or extract() through _handle_output_format() and asserts the streamed SQL file is not clobbered.

Rewritten. The test now drives the real path: it runs StreamingExtractionEngine.stream_to_file() (which is what sets was_streamed) and feeds the result through _handle_output_format(), then asserts the streamed SQL file is byte-for-byte unchanged. It also asserts result.was_streamed is True and result.tables == {}, so the flag is proven on the actual streaming path. Added test_streaming_disabled_for_non_sql_output to cover the format gate. Fixed in a2c623d.

nabroleonx · 2026-06-04T16:18:51Z

+    # Streaming mode wrote data directly to the output file during extraction.
+    # result.tables is empty in that case; regenerating SQL from it would
+    # produce an empty BEGIN/COMMIT shell and clobber the streamed content.
+    if result.was_streamed:


Please move the streamed-result skip into a single helper or a single guard in _handle_output_format(). The shared guard should only apply to SQL streaming; JSON/CSV should be rejected before streaming starts rather than short-circuited here.

Consolidated into a single guard at the top of _handle_output_format(); the three per-handler skips are gone. The guard only fires for streamed (SQL) results — non-SQL streaming is rejected before extraction rather than short-circuited here. Fixed in a2c623d.

nabroleonx · 2026-06-04T16:18:51Z

+        was_streamed=True,
+    )
+
+    quiet = rich.console.Console(file=open(os.devnull, "w"), force_terminal=False)


open(os.devnull, "w") leaves the file handle open. Please use a with open(...) as devnull: block around the output-handler call.

Fixed — wrapped in with open(os.devnull, "w") as devnull:.

…t guard Streaming writes SQL directly to disk; it has no JSON/CSV path. Instead of defensively short-circuiting every output handler (which would silently preserve SQL content under a .json/.csv name), reject the unsupported combo: - _should_use_streaming() returns False for any non-SQL output_format, so the auto-threshold path never streams SQL into a JSON/CSV file. (This was a latent bug: the gate keyed only on row count + output_file, not format.) - CLI rejects an explicit --stream for non-SQL output before extraction. - Consolidate the three per-handler streamed-result skips into one guard in _handle_output_format(); only SQL reaches it now. - Set was_streamed=True at the source of truth (StreamingExtractionEngine .stream_to_file) so direct callers and the CLI see consistent metadata; drop the redundant wrapper assignment. - Regression test now drives the real streaming engine through _handle_output_format and asserts the streamed file is byte-for-byte unchanged; add _should_use_streaming format-gate test; fix devnull leak. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

james-toji-leung · 2026-06-15T13:56:28Z

Thanks for the review — all six points addressed in a2c623d. Reshape summary:

Reject non-SQL streaming up front instead of defensively guarding each output handler. The CLI errors on --stream with a non-SQL --output, and _should_use_streaming() now returns False for any non-SQL format.
Single guard in _handle_output_format(); removed the per-handler JSON/CSV/SQL skips.
was_streamed set at the source of truth in stream_to_file().
Regression test drives the real engine through _handle_output_format() and asserts the file is byte-for-byte unchanged; added a format-gate test; fixed the devnull handle leak.

Worth flagging: the format gate also closes a latent pre-existing bug — _should_use_streaming() keyed only on row count + output file, never the format, so a large JSON/CSV extraction would already auto-stream SQL into the output file regardless of --output. That path is now blocked.

All 704 unit tests pass (just test).

nabroleonx requested changes Jun 4, 2026

View reviewed changes

nabroleonx mentioned this pull request Jun 4, 2026

--stream + --out-file produces empty output after successful extraction #8

Open

james-toji-leung requested a review from nabroleonx June 15, 2026 14:16

Conversation

james-toji-leung commented May 20, 2026

Summary

Fix

Test

What the user sees after the fix

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

james-toji-leung commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants