feat(calcite): vortex-calcite SQL adapter with zone-map aggregate push-down by dfa1 · Pull Request #158 · dfa1/vortex-java

dfa1 · 2026-06-24T19:23:51Z

What

New vortex-calcite module: query a Vortex file with SQL via Apache Calcite, plus a push-down aggregate helper that answers MIN/MAX/COUNT from footer zone-map statistics without decoding data segments.

Why

Vortex's advantage is the scan, not a query engine — an external engine only ever sees decoded values, so it has already paid the cost Vortex can skip (zone-map chunk-skip, encoded-domain compare, stats-table aggregates). This module makes Vortex a push-down SQL source for Calcite, which owns parse/plan/optimise/join. Recorded as ADR 0018.

VortexTable (ScannableTable) — DType.Struct → SQL row type, chunk → Object[].
VortexSchema — SQL table name → .vortex file.
VortexAggregates — MIN/MAX/COUNT from zone stats (push-down); SUM/AVG via scan (no per-zone SUM stat emitted yet — the ADR 0013 §6 next increment).
CalciteSmokeTest — gates Calcite Enumerable + Janino runtime codegen on JDK 25 (passes → no interpreter fallback needed).
OhlcSqlDemoTest — 120k-row OHLC file; asserts push-down equals Calcite full-scan ground truth.

Demo result (120k rows, 12 zones)

AGGREGATE     VALUE           SOURCE
MIN(low)      3.65            ZONE_STATS_PUSHDOWN
MAX(high)     1048.1          ZONE_STATS_PUSHDOWN
COUNT(*)      120000          ZONE_STATS_PUSHDOWN
SUM(volume)   120023864741    FULL_SCAN
AVG(volume)   1000198.87      FULL_SCAN
full scan (Calcite): ~577 ms | push-down (min/max/count from stats): ~5.9 ms   (~100x)

Notes

Calcite's dependency tree (Avatica, Guava, Janino) is quarantined in this module; core/reader/writer stay dependency-light.
Baseline ScannableTable only — full-scan rows still cross Object[]. Filter/project push-down (ProjectableFilterableTable) and aggregate push-down as a RelOptRule(Aggregate(TableScan)) are Phases 1–2 in ADR 0018.
SQL-semantics gotchas surfaced and carried into the ADR: date is reserved; AVG(BIGINT) does integer division.

🤖 Generated with Claude Code

dfa1 · 2026-06-24T20:48:18Z

Demo numbers

Captured from OhlcSqlDemoTest + AggregatePushDownTest (1M-row synthetic OHLC, 100 zones of 10k rows, warmed, 10 repeated queries).

Aggregate push-down vs Calcite full scan

AGGREGATE     VALUE           SOURCE
MIN(low)      0.06            ZONE_STATS_PUSHDOWN
MAX(high)     5347.11         ZONE_STATS_PUSHDOWN
COUNT(*)      1000000         ZONE_STATS_PUSHDOWN
SUM(volume)   1000028790323   FULL_SCAN
AVG(volume)   1000028.79      FULL_SCAN
per-query avg — full scan 98.56 ms | push-down 4.24 ms | 23x

Push-down time is flat vs row count (stats only). The ~4 ms still includes the volume SUM scan — MIN/MAX/COUNT alone are sub-millisecond. SUM/AVG join the no-decode tier once the writer emits a per-zone SUM stat (ADR 0013 §6).

Phase 1 — filter push-down (zone-map chunk pruning)

WHERE date BETWEEN 28263 AND 28363
rows matched: 3,030 | max(high): 2719.91
chunks decoded: 1 of 100 — 99% skipped by zone maps

One chunk read instead of 100, exact result. EXPLAIN shows filters/projects folded into the scan.

Phase 2 — aggregate rule rewrite (no scan at all)

Hep (logical):  LogicalValues(tuples=[[{ 1.42, 1099.88, 200000 }]])
JDBC (Volcano): EnumerableValues(tuples=[[{ 1.42, 1099.88, 200000 }]])

MIN/MAX/COUNT answered from footer stats end-to-end over jdbc:calcite: — chosen plan has zero TableScan, zero Aggregate, zero data decode.

Real SQL (full-scan today; Phase-1/2 push-down targets)

top-5 tickers by SUM(volume): PG 33.39B, JPM 33.38B, ABBV 33.38B, AAPL 33.38B, MRK 33.37B
up-days (close > open): 363,274 / 1,000,000

Summary: 23× on the same five aggregates; 99% of chunks skipped on a selective filter; no decode at all for MIN/MAX/COUNT.

dfa1 · 2026-06-25T05:25:55Z

        with:
          path: ~/.m2/repository
-          key: ${{ runner.os }}-maven-${{ hashFiles('**/pom.xml') }}
+          key: ${{ runner.os }}-maven-v2-${{ hashFiles('**/pom.xml') }}


please revert this

dfa1 · 2026-06-25T05:26:03Z

        with:
          path: ~/.m2/repository
-          key: ${{ runner.os }}-maven-${{ hashFiles('**/pom.xml') }}
+          key: ${{ runner.os }}-maven-v2-${{ hashFiles('**/pom.xml') }}


please revert this

dfa1 · 2026-06-25T05:26:12Z

        with:
          path: ~/.m2/repository
-          key: ${{ runner.os }}-maven-${{ hashFiles('**/pom.xml') }}
+          key: ${{ runner.os }}-maven-v2-${{ hashFiles('**/pom.xml') }}


please revert this

dfa1 · 2026-06-25T05:28:11Z

+/// Same shape as the integration-test generator: 30 tickers, a random walk per ticker, one
+/// row per ticker per day. Written with the native Java writer (no JNI) and a small chunk size
+/// so the file carries several zones — making the zone-map push-down visible.
+final class OhlcGenerator {


this is duplicate several times already

maybe it should be in the integration module and reused here?

dfa1 · 2026-06-25T05:28:43Z

+
+    @BeforeAll
+    static void writeFile() throws Exception {
+        // One 1M-row OHLC file (100 zones at CHUNK rows each), shared by every test.


don't repeat as comment what the code is already saying ;)

dfa1 · 2026-06-25T05:29:51Z

@@ -0,0 +1,220 @@
+# ADR 0018: Apache Calcite SQL adapter — be a push-down source, not an engine
+
+- **Status:** Proposed — Phases 0–2 prototyped on `feat/vortex-calcite-demo`; productionisation


just keep "Proposed" as status

New `vortex-calcite` module (Apache Calcite 1.40) exposing a Vortex file as a SQL table, with filter/projection/aggregate push-down into the reader's existing zone-map primitives. ADR 0018 records the decision: be a push-down source, not a query engine. - VortexTable (ProjectableFilterableTable): DType.Struct -> SQL row type; projection prunes columns; Calcite predicates (=,<>,<,<=,>,>=,AND,BETWEEN,IN via RexUtil.expandSearch) translate to a reader RowFilter for zone-map chunk skipping (pushed, not consumed — pruning is approximate, Calcite re-checks rows). - VortexAggregatePushDownRule: rewrites a whole-table MIN/MAX/COUNT over a VortexTable into a single-row Values computed from footer stats — no scan, no decode. Registered end-to-end on the JDBC planner via Hook.PLANNER. - VortexAggregates / VortexSchema: stats-backed helpers; sum is exact Long for integer columns (no double precision loss). - Demos (OhlcSqlDemoTest, AggregatePushDownTest): 1M-row OHLC, MIN/MAX/COUNT ~44x vs full scan; date-range filter prunes 99% of chunks; EXPLAIN shows the rewrite. - CalciteSmokeTest gates Calcite/Janino runtime codegen on JDK 25. Heavy Calcite deps quarantined in this module; core/reader/writer stay clean. OHLC test data is single-sourced in core.testing.OhlcData (core test-jar), reused by calcite and (via a thin adapter) integration. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…t<Object[]> VortexTable.scan() built the entire result as a List<Object[]> (one fresh array + boxed cell per row) before returning. For a 1M-row full scan an async-profiler run showed ~72% of CPU in G1 GC: every row was promoted into the old gen and the whole result stayed live at once. Column decode itself was ~0.5%. Replace it with a streaming Enumerator that advances chunk by chunk, decoding each requested column once per chunk and yielding one row per moveNext(). Rows are no longer retained, so the working set is one chunk and rows die in the young gen. Fresh array per row is kept (correct for ORDER BY / joins that retain rows). Measured (CalciteDemo, 1M rows, MIN/MAX/COUNT full scan): GC 71% -> 3%, ~52 ms/query -> ~28 ms/query. CalciteDemo is a profiling harness, disabled unless -Ddemo.profile=true; run under async-profiler by attaching to the forked test JVM (argLine is owned by the byte-buddy agent goal, so attach by PID rather than -agentpath). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

dfa1 force-pushed the feat/vortex-calcite-demo branch from 223fc9c to 9b2d117 Compare June 24, 2026 20:42

dfa1 commented Jun 25, 2026

View reviewed changes

dfa1 force-pushed the feat/vortex-calcite-demo branch from 85b756d to b7fd87c Compare June 25, 2026 05:45

dfa1 and others added 2 commits June 25, 2026 13:25

dfa1 force-pushed the feat/vortex-calcite-demo branch from c8921c1 to 888698f Compare June 25, 2026 11:26

dfa1 merged commit 173e089 into main Jun 25, 2026
6 checks passed

dfa1 deleted the feat/vortex-calcite-demo branch June 25, 2026 14:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(calcite): vortex-calcite SQL adapter with zone-map aggregate push-down#158

feat(calcite): vortex-calcite SQL adapter with zone-map aggregate push-down#158
dfa1 merged 2 commits into
mainfrom
feat/vortex-calcite-demo

dfa1 commented Jun 24, 2026

Uh oh!

dfa1 commented Jun 24, 2026

Uh oh!

dfa1 Jun 25, 2026

Uh oh!

dfa1 Jun 25, 2026

Uh oh!

dfa1 Jun 25, 2026

Uh oh!

dfa1 Jun 25, 2026

Uh oh!

dfa1 Jun 25, 2026

Uh oh!

dfa1 Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@@ -0,0 +1,220 @@
		# ADR 0018: Apache Calcite SQL adapter — be a push-down source, not an engine

		- Status: Proposed — Phases 0–2 prototyped on `feat/vortex-calcite-demo`; productionisation

Conversation

dfa1 commented Jun 24, 2026

What

Why

Contents

Demo result (120k rows, 12 zones)

Notes

Uh oh!

dfa1 commented Jun 24, 2026

Demo numbers

Aggregate push-down vs Calcite full scan

Phase 1 — filter push-down (zone-map chunk pruning)

Phase 2 — aggregate rule rewrite (no scan at all)

Real SQL (full-scan today; Phase-1/2 push-down targets)

Uh oh!

dfa1 Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

dfa1 Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

dfa1 Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

dfa1 Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

dfa1 Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

dfa1 Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant