Skip to content

feat(calcite): vortex-calcite SQL adapter with zone-map aggregate push-down#158

Merged
dfa1 merged 2 commits into
mainfrom
feat/vortex-calcite-demo
Jun 25, 2026
Merged

feat(calcite): vortex-calcite SQL adapter with zone-map aggregate push-down#158
dfa1 merged 2 commits into
mainfrom
feat/vortex-calcite-demo

Conversation

@dfa1

@dfa1 dfa1 commented Jun 24, 2026

Copy link
Copy Markdown
Owner

What

New vortex-calcite module: query a Vortex file with SQL via Apache Calcite, plus a push-down aggregate helper that answers MIN/MAX/COUNT from footer zone-map statistics without decoding data segments.

Why

Vortex's advantage is the scan, not a query engine — an external engine only ever sees decoded values, so it has already paid the cost Vortex can skip (zone-map chunk-skip, encoded-domain compare, stats-table aggregates). This module makes Vortex a push-down SQL source for Calcite, which owns parse/plan/optimise/join. Recorded as ADR 0018.

Contents

  • VortexTable (ScannableTable) — DType.Struct → SQL row type, chunk → Object[].
  • VortexSchema — SQL table name → .vortex file.
  • VortexAggregatesMIN/MAX/COUNT from zone stats (push-down); SUM/AVG via scan (no per-zone SUM stat emitted yet — the ADR 0013 §6 next increment).
  • CalciteSmokeTest — gates Calcite Enumerable + Janino runtime codegen on JDK 25 (passes → no interpreter fallback needed).
  • OhlcSqlDemoTest — 120k-row OHLC file; asserts push-down equals Calcite full-scan ground truth.

Demo result (120k rows, 12 zones)

AGGREGATE     VALUE           SOURCE
MIN(low)      3.65            ZONE_STATS_PUSHDOWN
MAX(high)     1048.1          ZONE_STATS_PUSHDOWN
COUNT(*)      120000          ZONE_STATS_PUSHDOWN
SUM(volume)   120023864741    FULL_SCAN
AVG(volume)   1000198.87      FULL_SCAN
full scan (Calcite): ~577 ms | push-down (min/max/count from stats): ~5.9 ms   (~100x)

Notes

  • Calcite's dependency tree (Avatica, Guava, Janino) is quarantined in this module; core/reader/writer stay dependency-light.
  • Baseline ScannableTable only — full-scan rows still cross Object[]. Filter/project push-down (ProjectableFilterableTable) and aggregate push-down as a RelOptRule(Aggregate(TableScan)) are Phases 1–2 in ADR 0018.
  • SQL-semantics gotchas surfaced and carried into the ADR: date is reserved; AVG(BIGINT) does integer division.

🤖 Generated with Claude Code

@dfa1 dfa1 force-pushed the feat/vortex-calcite-demo branch from 223fc9c to 9b2d117 Compare June 24, 2026 20:42
@dfa1

dfa1 commented Jun 24, 2026

Copy link
Copy Markdown
Owner Author

Demo numbers

Captured from OhlcSqlDemoTest + AggregatePushDownTest (1M-row synthetic OHLC, 100 zones of 10k rows, warmed, 10 repeated queries).

Aggregate push-down vs Calcite full scan

AGGREGATE     VALUE           SOURCE
MIN(low)      0.06            ZONE_STATS_PUSHDOWN
MAX(high)     5347.11         ZONE_STATS_PUSHDOWN
COUNT(*)      1000000         ZONE_STATS_PUSHDOWN
SUM(volume)   1000028790323   FULL_SCAN
AVG(volume)   1000028.79      FULL_SCAN
per-query avg — full scan 98.56 ms | push-down 4.24 ms | 23x

Push-down time is flat vs row count (stats only). The ~4 ms still includes the volume SUM scan — MIN/MAX/COUNT alone are sub-millisecond. SUM/AVG join the no-decode tier once the writer emits a per-zone SUM stat (ADR 0013 §6).

Phase 1 — filter push-down (zone-map chunk pruning)

WHERE date BETWEEN 28263 AND 28363
rows matched: 3,030 | max(high): 2719.91
chunks decoded: 1 of 100 — 99% skipped by zone maps

One chunk read instead of 100, exact result. EXPLAIN shows filters/projects folded into the scan.

Phase 2 — aggregate rule rewrite (no scan at all)

Hep (logical):  LogicalValues(tuples=[[{ 1.42, 1099.88, 200000 }]])
JDBC (Volcano): EnumerableValues(tuples=[[{ 1.42, 1099.88, 200000 }]])

MIN/MAX/COUNT answered from footer stats end-to-end over jdbc:calcite: — chosen plan has zero TableScan, zero Aggregate, zero data decode.

Real SQL (full-scan today; Phase-1/2 push-down targets)

  • top-5 tickers by SUM(volume): PG 33.39B, JPM 33.38B, ABBV 33.38B, AAPL 33.38B, MRK 33.37B
  • up-days (close > open): 363,274 / 1,000,000

Summary: 23× on the same five aggregates; 99% of chunks skipped on a selective filter; no decode at all for MIN/MAX/COUNT.

Comment thread .github/workflows/ci.yml Outdated
with:
path: ~/.m2/repository
key: ${{ runner.os }}-maven-${{ hashFiles('**/pom.xml') }}
key: ${{ runner.os }}-maven-v2-${{ hashFiles('**/pom.xml') }}

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert this

Comment thread .github/workflows/load.yml Outdated
with:
path: ~/.m2/repository
key: ${{ runner.os }}-maven-${{ hashFiles('**/pom.xml') }}
key: ${{ runner.os }}-maven-v2-${{ hashFiles('**/pom.xml') }}

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert this

Comment thread .github/workflows/sonar.yml Outdated
with:
path: ~/.m2/repository
key: ${{ runner.os }}-maven-${{ hashFiles('**/pom.xml') }}
key: ${{ runner.os }}-maven-v2-${{ hashFiles('**/pom.xml') }}

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert this

/// Same shape as the integration-test generator: 30 tickers, a random walk per ticker, one
/// row per ticker per day. Written with the native Java writer (no JNI) and a small chunk size
/// so the file carries several zones — making the zone-map push-down visible.
final class OhlcGenerator {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is duplicate several times already

maybe it should be in the integration module and reused here?


@BeforeAll
static void writeFile() throws Exception {
// One 1M-row OHLC file (100 zones at CHUNK rows each), shared by every test.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't repeat as comment what the code is already saying ;)

Comment thread docs/adr/0018-calcite-sql-adapter.md Outdated
@@ -0,0 +1,220 @@
# ADR 0018: Apache Calcite SQL adapter — be a push-down source, not an engine

- **Status:** Proposed — Phases 0–2 prototyped on `feat/vortex-calcite-demo`; productionisation

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just keep "Proposed" as status

@dfa1 dfa1 force-pushed the feat/vortex-calcite-demo branch from 85b756d to b7fd87c Compare June 25, 2026 05:45
dfa1 and others added 2 commits June 25, 2026 13:25
New `vortex-calcite` module (Apache Calcite 1.40) exposing a Vortex file as a SQL
table, with filter/projection/aggregate push-down into the reader's existing
zone-map primitives. ADR 0018 records the decision: be a push-down source, not a
query engine.

- VortexTable (ProjectableFilterableTable): DType.Struct -> SQL row type; projection
  prunes columns; Calcite predicates (=,<>,<,<=,>,>=,AND,BETWEEN,IN via
  RexUtil.expandSearch) translate to a reader RowFilter for zone-map chunk skipping
  (pushed, not consumed — pruning is approximate, Calcite re-checks rows).
- VortexAggregatePushDownRule: rewrites a whole-table MIN/MAX/COUNT over a VortexTable
  into a single-row Values computed from footer stats — no scan, no decode. Registered
  end-to-end on the JDBC planner via Hook.PLANNER.
- VortexAggregates / VortexSchema: stats-backed helpers; sum is exact Long for integer
  columns (no double precision loss).
- Demos (OhlcSqlDemoTest, AggregatePushDownTest): 1M-row OHLC, MIN/MAX/COUNT ~44x vs
  full scan; date-range filter prunes 99% of chunks; EXPLAIN shows the rewrite.
- CalciteSmokeTest gates Calcite/Janino runtime codegen on JDK 25.

Heavy Calcite deps quarantined in this module; core/reader/writer stay clean.
OHLC test data is single-sourced in core.testing.OhlcData (core test-jar), reused
by calcite and (via a thin adapter) integration.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…t<Object[]>

VortexTable.scan() built the entire result as a List<Object[]> (one fresh array +
boxed cell per row) before returning. For a 1M-row full scan an async-profiler run
showed ~72% of CPU in G1 GC: every row was promoted into the old gen and the whole
result stayed live at once. Column decode itself was ~0.5%.

Replace it with a streaming Enumerator that advances chunk by chunk, decoding each
requested column once per chunk and yielding one row per moveNext(). Rows are no
longer retained, so the working set is one chunk and rows die in the young gen.
Fresh array per row is kept (correct for ORDER BY / joins that retain rows).

Measured (CalciteDemo, 1M rows, MIN/MAX/COUNT full scan): GC 71% -> 3%,
~52 ms/query -> ~28 ms/query.

CalciteDemo is a profiling harness, disabled unless -Ddemo.profile=true; run under
async-profiler by attaching to the forked test JVM (argLine is owned by the
byte-buddy agent goal, so attach by PID rather than -agentpath).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dfa1 dfa1 force-pushed the feat/vortex-calcite-demo branch from c8921c1 to 888698f Compare June 25, 2026 11:26
@dfa1 dfa1 merged commit 173e089 into main Jun 25, 2026
6 checks passed
@dfa1 dfa1 deleted the feat/vortex-calcite-demo branch June 25, 2026 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant