Native engine: ~1M rows/sec SELECT by suppressing GC during materialization#142
Merged
Conversation
…zation A fetch over the Native engine allocates ~100k GC-tracked containers (one tuple plus one Record per row), which repeatedly trips the generational cyclic collector mid-fetch — even though none of those objects are garbage, they are all about to be returned. That collection, not the decode or the row build, was the dominant cost. - New Cython `build_records(columns, names)` transposes the decoded columns into Record rows in one C loop (replacing the Python `zip` + comprehension). - `build_records` and the per-column `decode_column_with_prefix` now disable the cyclic GC for their (synchronous, await-free) sections and restore the prior state afterwards, so no other coroutine ever runs with GC off and the collector still runs between blocks and during IO. Native SELECT goes ~730k -> ~1.0M rows/sec on the mixed-type benchmark (50k rows, M1 Pro, CH 26.5), now ahead of clickhouse-connect while staying numpy-free. Iteration notes in perf_iterations/. Full suite 909 passed on cython, pure fallback green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add three TestNative cases pinning the GC-suppression contract: a fetch leaves the collector enabled, it does not re-enable a caller-disabled collector, and the state is restored even when decoding raises mid-fetch. Refresh the README and BENCHMARK_RESULTS.md headline figures (Native SELECT ~730k -> ~1,000k rows/sec, now the fastest in the client comparison) and note that absolute throughput is sensitive to background CPU load. perf_iterations/ records the full iteration log, including the load-sensitivity A/B. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A workflow_dispatch-only job that builds the Cython extension, starts ClickHouse with both the HTTP (8123) and native (9000) ports, installs the comparison clients, runs benchmarks_vs_libs.py and benchmarks.py, and publishes the output to the run's job summary. Runs on an isolated GitHub runner — lower absolute numbers than a dev machine but consistent, for tracking relative regressions. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
It was already excluded (not a package, not grafted), but an explicit prune documents the intent and guards against future MANIFEST changes. Verified: the benchmark iteration log ships in neither the wheel nor the sdist. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A focused performance pass on the Native read engine: SELECT throughput goes from ~730k to ~1,000k rows/sec on the mixed-type benchmark (50k rows, idle M1 Pro, CH 26.5), making it the fastest client in the comparison, ahead of clickhouse-connect, while staying numpy-free.
The real cost was the cyclic garbage collector: a fetch allocates ~100k GC-tracked containers (one tuple + one Record per row), which repeatedly trips generational GC mid-fetch - even though none of those objects are garbage, they are all about to be returned. A controlled
gc.disable()test jumped the headline from ~693k to ~1.24M, confirming it.The riskier "disable GC across the whole fetch (spanning
awaits)" variant - which reached ~1.24M - was deliberately not taken; it would be a footgun under concurrency.No public API change; behavior is identical, just faster.