Skip to content

Native engine: ~1M rows/sec SELECT by suppressing GC during materialization#142

Merged
maximdanilchenko merged 4 commits into
masterfrom
native-perf-1m
Jun 8, 2026
Merged

Native engine: ~1M rows/sec SELECT by suppressing GC during materialization#142
maximdanilchenko merged 4 commits into
masterfrom
native-perf-1m

Conversation

@maximdanilchenko

Copy link
Copy Markdown
Owner

A focused performance pass on the Native read engine: SELECT throughput goes from ~730k to ~1,000k rows/sec on the mixed-type benchmark (50k rows, idle M1 Pro, CH 26.5), making it the fastest client in the comparison, ahead of clickhouse-connect, while staying numpy-free.

The real cost was the cyclic garbage collector: a fetch allocates ~100k GC-tracked containers (one tuple + one Record per row), which repeatedly trips generational GC mid-fetch - even though none of those objects are garbage, they are all about to be returned. A controlled gc.disable() test jumped the headline from ~693k to ~1.24M, confirming it.

The riskier "disable GC across the whole fetch (spanning awaits)" variant - which reached ~1.24M - was deliberately not taken; it would be a footgun under concurrency.

No public API change; behavior is identical, just faster.

maximdanilchenko and others added 4 commits June 8, 2026 17:39
…zation

A fetch over the Native engine allocates ~100k GC-tracked containers (one tuple
plus one Record per row), which repeatedly trips the generational cyclic
collector mid-fetch — even though none of those objects are garbage, they are
all about to be returned. That collection, not the decode or the row build, was
the dominant cost.

- New Cython `build_records(columns, names)` transposes the decoded columns into
  Record rows in one C loop (replacing the Python `zip` + comprehension).
- `build_records` and the per-column `decode_column_with_prefix` now disable the
  cyclic GC for their (synchronous, await-free) sections and restore the prior
  state afterwards, so no other coroutine ever runs with GC off and the
  collector still runs between blocks and during IO.

Native SELECT goes ~730k -> ~1.0M rows/sec on the mixed-type benchmark (50k
rows, M1 Pro, CH 26.5), now ahead of clickhouse-connect while staying
numpy-free. Iteration notes in perf_iterations/. Full suite 909 passed on
cython, pure fallback green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add three TestNative cases pinning the GC-suppression contract: a fetch leaves
the collector enabled, it does not re-enable a caller-disabled collector, and
the state is restored even when decoding raises mid-fetch.

Refresh the README and BENCHMARK_RESULTS.md headline figures (Native SELECT
~730k -> ~1,000k rows/sec, now the fastest in the client comparison) and note
that absolute throughput is sensitive to background CPU load. perf_iterations/
records the full iteration log, including the load-sensitivity A/B.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A workflow_dispatch-only job that builds the Cython extension, starts ClickHouse
with both the HTTP (8123) and native (9000) ports, installs the comparison
clients, runs benchmarks_vs_libs.py and benchmarks.py, and publishes the output
to the run's job summary. Runs on an isolated GitHub runner — lower absolute
numbers than a dev machine but consistent, for tracking relative regressions.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
It was already excluded (not a package, not grafted), but an explicit prune
documents the intent and guards against future MANIFEST changes. Verified: the
benchmark iteration log ships in neither the wheel nor the sdist.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@maximdanilchenko maximdanilchenko merged commit 859ecdf into master Jun 8, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant