feat(calcite): vortex-calcite SQL adapter with zone-map aggregate push-down#158
Conversation
223fc9c to
9b2d117
Compare
Demo numbersCaptured from Aggregate push-down vs Calcite full scanPush-down time is flat vs row count (stats only). The ~4 ms still includes the Phase 1 — filter push-down (zone-map chunk pruning)One chunk read instead of 100, exact result. Phase 2 — aggregate rule rewrite (no scan at all)
Real SQL (full-scan today; Phase-1/2 push-down targets)
Summary: 23× on the same five aggregates; 99% of chunks skipped on a selective filter; no decode at all for |
| with: | ||
| path: ~/.m2/repository | ||
| key: ${{ runner.os }}-maven-${{ hashFiles('**/pom.xml') }} | ||
| key: ${{ runner.os }}-maven-v2-${{ hashFiles('**/pom.xml') }} |
| with: | ||
| path: ~/.m2/repository | ||
| key: ${{ runner.os }}-maven-${{ hashFiles('**/pom.xml') }} | ||
| key: ${{ runner.os }}-maven-v2-${{ hashFiles('**/pom.xml') }} |
| with: | ||
| path: ~/.m2/repository | ||
| key: ${{ runner.os }}-maven-${{ hashFiles('**/pom.xml') }} | ||
| key: ${{ runner.os }}-maven-v2-${{ hashFiles('**/pom.xml') }} |
| /// Same shape as the integration-test generator: 30 tickers, a random walk per ticker, one | ||
| /// row per ticker per day. Written with the native Java writer (no JNI) and a small chunk size | ||
| /// so the file carries several zones — making the zone-map push-down visible. | ||
| final class OhlcGenerator { |
There was a problem hiding this comment.
this is duplicate several times already
maybe it should be in the integration module and reused here?
|
|
||
| @BeforeAll | ||
| static void writeFile() throws Exception { | ||
| // One 1M-row OHLC file (100 zones at CHUNK rows each), shared by every test. |
There was a problem hiding this comment.
don't repeat as comment what the code is already saying ;)
| @@ -0,0 +1,220 @@ | |||
| # ADR 0018: Apache Calcite SQL adapter — be a push-down source, not an engine | |||
|
|
|||
| - **Status:** Proposed — Phases 0–2 prototyped on `feat/vortex-calcite-demo`; productionisation | |||
There was a problem hiding this comment.
just keep "Proposed" as status
85b756d to
b7fd87c
Compare
New `vortex-calcite` module (Apache Calcite 1.40) exposing a Vortex file as a SQL table, with filter/projection/aggregate push-down into the reader's existing zone-map primitives. ADR 0018 records the decision: be a push-down source, not a query engine. - VortexTable (ProjectableFilterableTable): DType.Struct -> SQL row type; projection prunes columns; Calcite predicates (=,<>,<,<=,>,>=,AND,BETWEEN,IN via RexUtil.expandSearch) translate to a reader RowFilter for zone-map chunk skipping (pushed, not consumed — pruning is approximate, Calcite re-checks rows). - VortexAggregatePushDownRule: rewrites a whole-table MIN/MAX/COUNT over a VortexTable into a single-row Values computed from footer stats — no scan, no decode. Registered end-to-end on the JDBC planner via Hook.PLANNER. - VortexAggregates / VortexSchema: stats-backed helpers; sum is exact Long for integer columns (no double precision loss). - Demos (OhlcSqlDemoTest, AggregatePushDownTest): 1M-row OHLC, MIN/MAX/COUNT ~44x vs full scan; date-range filter prunes 99% of chunks; EXPLAIN shows the rewrite. - CalciteSmokeTest gates Calcite/Janino runtime codegen on JDK 25. Heavy Calcite deps quarantined in this module; core/reader/writer stay clean. OHLC test data is single-sourced in core.testing.OhlcData (core test-jar), reused by calcite and (via a thin adapter) integration. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…t<Object[]> VortexTable.scan() built the entire result as a List<Object[]> (one fresh array + boxed cell per row) before returning. For a 1M-row full scan an async-profiler run showed ~72% of CPU in G1 GC: every row was promoted into the old gen and the whole result stayed live at once. Column decode itself was ~0.5%. Replace it with a streaming Enumerator that advances chunk by chunk, decoding each requested column once per chunk and yielding one row per moveNext(). Rows are no longer retained, so the working set is one chunk and rows die in the young gen. Fresh array per row is kept (correct for ORDER BY / joins that retain rows). Measured (CalciteDemo, 1M rows, MIN/MAX/COUNT full scan): GC 71% -> 3%, ~52 ms/query -> ~28 ms/query. CalciteDemo is a profiling harness, disabled unless -Ddemo.profile=true; run under async-profiler by attaching to the forked test JVM (argLine is owned by the byte-buddy agent goal, so attach by PID rather than -agentpath). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
c8921c1 to
888698f
Compare
What
New
vortex-calcitemodule: query a Vortex file with SQL via Apache Calcite, plus a push-down aggregate helper that answersMIN/MAX/COUNTfrom footer zone-map statistics without decoding data segments.Why
Vortex's advantage is the scan, not a query engine — an external engine only ever sees decoded values, so it has already paid the cost Vortex can skip (zone-map chunk-skip, encoded-domain compare, stats-table aggregates). This module makes Vortex a push-down SQL source for Calcite, which owns parse/plan/optimise/join. Recorded as ADR 0018.
Contents
VortexTable(ScannableTable) —DType.Struct→ SQL row type, chunk →Object[].VortexSchema— SQL table name →.vortexfile.VortexAggregates—MIN/MAX/COUNTfrom zone stats (push-down);SUM/AVGvia scan (no per-zoneSUMstat emitted yet — the ADR 0013 §6 next increment).CalciteSmokeTest— gates Calcite Enumerable + Janino runtime codegen on JDK 25 (passes → no interpreter fallback needed).OhlcSqlDemoTest— 120k-row OHLC file; asserts push-down equals Calcite full-scan ground truth.Demo result (120k rows, 12 zones)
Notes
ScannableTableonly — full-scan rows still crossObject[]. Filter/project push-down (ProjectableFilterableTable) and aggregate push-down as aRelOptRule(Aggregate(TableScan))are Phases 1–2 in ADR 0018.dateis reserved;AVG(BIGINT)does integer division.🤖 Generated with Claude Code