Skip to content

fix: streaming output not clobbered by post-extraction output handler#9

Open
james-toji-leung wants to merge 2 commits into
nabroleonx:mainfrom
james-toji-leung:fix/streaming-output-clobbered-on-finalize
Open

fix: streaming output not clobbered by post-extraction output handler#9
james-toji-leung wants to merge 2 commits into
nabroleonx:mainfrom
james-toji-leung:fix/streaming-output-clobbered-on-finalize

Conversation

@james-toji-leung

Copy link
Copy Markdown

Fixes #8.

Summary

In streaming mode, _do_streaming_extract writes data directly to the output file and returns an ExtractionResult with an intentionally empty tables dict (data lives in the file, not in memory). The CLI then calls _generate_and_output_sql unconditionally, which regenerates SQL from the empty tables and writes the resulting BEGIN; ... COMMIT; shell back to the same path — clobbering the streamed content.

The bug is silent: dbslice logs Wrote <N> rows to <path> and exits 0, but the file is left as a ~400-byte empty shell. Reproducing it just requires an extraction large enough to trigger streaming (default threshold 50k rows, default --stream-threshold).

Fix

  • Add was_streamed: bool = False field to ExtractionResult (core/engine.py).
  • Set was_streamed=True in _do_streaming_extract when constructing the return value.
  • In cli._generate_and_output_sql, short-circuit when result.was_streamed — the file is already on disk; print the success line and return the path without regenerating output.
  • Apply the same short-circuit to _generate_and_output_json and _generate_and_output_csv. Streaming is SQL-only today, but the defensive guard means JSON/CSV won't silently clobber if streaming is ever extended.

Test

Added test_streaming_output_not_clobbered_by_output_handler in tests/test_streaming.py:

  • Pre-populates an "already-streamed" file with known content.
  • Builds an ExtractionResult with was_streamed=True and empty tables.
  • Calls _generate_and_output_sql directly.
  • Asserts the file content is unchanged afterward.

All 703 unit tests pass (just test).

What the user sees after the fix

Before (silent corruption):

$ dbslice extract ... -f out.sql --stream
... extraction runs ~20 min, out.sql streams to 290 MB ...
Wrote 800000 rows to out.sql
$ wc -l out.sql
6 out.sql   # 400 bytes, no INSERTs

After (preserved):

$ dbslice extract ... -f out.sql --stream
... extraction runs ~20 min, out.sql streams to 290 MB ...
Wrote 800000 rows to out.sql
$ wc -l out.sql
364000 out.sql   # all streamed INSERTs intact

In streaming mode, `_do_streaming_extract` writes data directly to the
output file and returns an `ExtractionResult` with an intentionally empty
`tables` dict (data lives in the file, not in memory). The CLI then called
`_generate_and_output_sql` unconditionally, which regenerated SQL from the
empty `tables` and wrote the resulting `BEGIN; ... COMMIT;` shell back to
the same path — clobbering the streamed content.

The bug is silent: dbslice logs `Wrote <N> rows to <path>` and exits 0,
but the file is left as a ~400-byte empty shell.

Fix: add `was_streamed: bool` to `ExtractionResult`, set it to True in
`_do_streaming_extract`, and short-circuit the SQL/JSON/CSV output
handlers when the flag is set. The streamed file is preserved as-is.

Closes nabroleonx#8.
Comment thread src/dbslice/cli.py Outdated
# Streaming mode wrote data directly to the output file during extraction.
# result.tables is empty in that case; regenerating output from it would
# clobber the streamed content. See _generate_and_output_sql for context.
if result.was_streamed:

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Streaming is SQL-only today, so this JSON guard should not preserve a streamed SQL file under a JSON output path. Please add validation before extraction that rejects --stream/auto-streaming when output_format != SQL, and keep the JSON handler on its normal generation path for non-streaming results.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — removed this guard. Streaming is SQL-only and is now rejected for non-SQL output before extraction: the CLI errors on an explicit --stream with a non-SQL --output, and ExtractionEngine._should_use_streaming() returns False for any non-SQL format (covering the auto-threshold path too). So the JSON handler stays on its normal generation path and can never preserve a streamed SQL file under a .json name. Fixed in a2c623d.

Comment thread src/dbslice/cli.py Outdated
# Streaming mode wrote data directly to the output file during extraction.
# result.tables is empty in that case; regenerating output from it would
# clobber the streamed content. See _generate_and_output_sql for context.
if result.was_streamed:

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here this CSV guard should not preserve a streamed SQL file under a CSV output path. Please use the same validation as JSON: reject streaming when output_format != SQL until streaming CSV is deliberately implemented.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same fix as the JSON guard — removed. Non-SQL streaming is rejected up front (CLI validation + _should_use_streaming returning False for non-SQL), so the CSV handler keeps its normal generation path and a streamed SQL file can never be preserved under a .csv name. Fixed in a2c623d.

Comment thread src/dbslice/core/engine.py Outdated
result.used_deferred_cycle_strategy = used_deferred_cycle_strategy
# Flag so downstream output handlers know not to re-write the file.
# Data has already been streamed to disk; result.tables is empty.
result.was_streamed = True

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put the streamed marker at the source of truth. StreamingExtractionEngine.stream_to_file() should return ExtractionResult(..., was_streamed=True) directly, so direct callers and the CLI wrapper see consistent metadata. The wrapper assignment can then be removed or left only as redundant safety.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. StreamingExtractionEngine.stream_to_file() now sets was_streamed=True directly in the returned ExtractionResult, so direct callers and the CLI wrapper see consistent metadata. Removed the assignment in _do_streaming_extract. Fixed in a2c623d.

Comment thread tests/test_streaming.py Outdated
assert out_file.read_text() == streamed_content

# Construct a streaming-mode result: tables empty, stats populated, flag set.
result = ExtractionResult(

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test manually constructs was_streamed=True, so it does not prove the real streaming path sets the flag. Please add a regression test that runs _do_streaming_extract() or extract() through _handle_output_format() and asserts the streamed SQL file is not clobbered.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewritten. The test now drives the real path: it runs StreamingExtractionEngine.stream_to_file() (which is what sets was_streamed) and feeds the result through _handle_output_format(), then asserts the streamed SQL file is byte-for-byte unchanged. It also asserts result.was_streamed is True and result.tables == {}, so the flag is proven on the actual streaming path. Added test_streaming_disabled_for_non_sql_output to cover the format gate. Fixed in a2c623d.

Comment thread src/dbslice/cli.py Outdated
# Streaming mode wrote data directly to the output file during extraction.
# result.tables is empty in that case; regenerating SQL from it would
# produce an empty BEGIN/COMMIT shell and clobber the streamed content.
if result.was_streamed:

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move the streamed-result skip into a single helper or a single guard in _handle_output_format(). The shared guard should only apply to SQL streaming; JSON/CSV should be rejected before streaming starts rather than short-circuited here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consolidated into a single guard at the top of _handle_output_format(); the three per-handler skips are gone. The guard only fires for streamed (SQL) results — non-SQL streaming is rejected before extraction rather than short-circuited here. Fixed in a2c623d.

Comment thread tests/test_streaming.py Outdated
was_streamed=True,
)

quiet = rich.console.Console(file=open(os.devnull, "w"), force_terminal=False)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open(os.devnull, "w") leaves the file handle open. Please use a with open(...) as devnull: block around the output-handler call.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — wrapped in with open(os.devnull, "w") as devnull:.

…t guard

Streaming writes SQL directly to disk; it has no JSON/CSV path. Instead of
defensively short-circuiting every output handler (which would silently
preserve SQL content under a .json/.csv name), reject the unsupported combo:

- _should_use_streaming() returns False for any non-SQL output_format, so the
  auto-threshold path never streams SQL into a JSON/CSV file. (This was a
  latent bug: the gate keyed only on row count + output_file, not format.)
- CLI rejects an explicit --stream for non-SQL output before extraction.
- Consolidate the three per-handler streamed-result skips into one guard in
  _handle_output_format(); only SQL reaches it now.
- Set was_streamed=True at the source of truth (StreamingExtractionEngine
  .stream_to_file) so direct callers and the CLI see consistent metadata;
  drop the redundant wrapper assignment.
- Regression test now drives the real streaming engine through
  _handle_output_format and asserts the streamed file is byte-for-byte
  unchanged; add _should_use_streaming format-gate test; fix devnull leak.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@james-toji-leung

Copy link
Copy Markdown
Author

Thanks for the review — all six points addressed in a2c623d. Reshape summary:

  • Reject non-SQL streaming up front instead of defensively guarding each output handler. The CLI errors on --stream with a non-SQL --output, and _should_use_streaming() now returns False for any non-SQL format.
  • Single guard in _handle_output_format(); removed the per-handler JSON/CSV/SQL skips.
  • was_streamed set at the source of truth in stream_to_file().
  • Regression test drives the real engine through _handle_output_format() and asserts the file is byte-for-byte unchanged; added a format-gate test; fixed the devnull handle leak.

Worth flagging: the format gate also closes a latent pre-existing bug_should_use_streaming() keyed only on row count + output file, never the format, so a large JSON/CSV extraction would already auto-stream SQL into the output file regardless of --output. That path is now blocked.

All 704 unit tests pass (just test).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

--stream + --out-file produces empty output after successful extraction

2 participants