Skip to content

[WIP][DNR] Staging branch for tpcds#110

Open
shrshi wants to merge 1544 commits into
rapidsai:IBM-techpreviewfrom
shrshi:tpcds-staging
Open

[WIP][DNR] Staging branch for tpcds#110
shrshi wants to merge 1544 commits into
rapidsai:IBM-techpreviewfrom
shrshi:tpcds-staging

Conversation

@shrshi
Copy link
Copy Markdown

@shrshi shrshi commented May 15, 2026

Branch used for VeloxCon 2026 results. Aggregates open PRs tracked in facebookincubator#15772

Dario Pavlovic and others added 30 commits April 17, 2026 16:54
…SizeStats (facebookincubator#17231)

Summary:
Pull Request resolved: facebookincubator#17231

CONTEXT: RowVector::estimateFlatSize() recursively walks all column vectors to
estimate batch memory size. For wide schemas this is very expensive — profiling
shows it consuming ~33% of total CPU in certain data pipelines, called
unconditionally twice per batch in TableScan::getOutput.

WHAT: Gate the estimateFlatSize call in TableScan::getOutput behind the existing
enableOperatorBatchSizeStats config flag, matching the pattern already used in
Driver::runInternal. When disabled (the default), both addInputVector and
RECORD_METRIC_VALUE receive 0 bytes — row counts are still tracked. Also
deduplicates the computation from two calls to one.

Reviewed By: Yuhta

Differential Revision: D98855710

fbshipit-source-id: 0fcc0f6d4193e468e20b053478c030cd4559c3c6
…ebookincubator#17210)

Summary:
In debug mode, CUDA errors from GPU operations may go undetected until much later, making them hard to attribute. This adds cudaGetLastError checks at the boundary of each operator method to surface errors as early as possible.

Pull Request resolved: facebookincubator#17210

Reviewed By: pratikpugalia

Differential Revision: D101388224

Pulled By: peterenescu

fbshipit-source-id: 0e29b9acd39e676fb7b2f6626606c88828cf2186
- CudfHashAggregation.cpp: Add missing stream parameter to
  StddevSampAggregator::addGroupbyRequest, replace veloxToCudfTypeId
  with veloxToCudfDataType
- CudfHashJoin.cpp: Remove orphaned initializeFilter() calls
- CudfHiveDataSource: Add getTableRowType() declaration, fix variable
  name typo, remove dead code
- CudfConfig.h: Add missing functionEngine member used by AstUtils.h
- ExpressionEvaluator.cpp: Remove merge conflict markers
…cebookincubator#17139)

Summary:
Pull Request resolved: facebookincubator#17139

When IndexLookupJoin has `needsIndexSplit=true`, its index TableScan node ID is included in `groupedExecutionLeafNodeIds` for coordinator-side split scheduling. However, this node is NOT a separate pipeline leaf in Velox — IndexLookupJoin manages the index source internally. Without this change, `validateGroupedExecutionLeafNodes` rejects the plan because it cannot find the index source node ID in any driver factory.

Add `collectIndexLookupSourceIds` to collect index lookup source node IDs and skip them during leaf validation. Fix `noMoreSplitsForGroup` to handle missing split stores (creates a store with noMoreSplits already set). Fix `getSplitOrFuture` to propagate global `noMoreSplits` to newly created per-group stores.

Reviewed By: xiaoxmeng

Differential Revision: D100372349

fbshipit-source-id: fb4093f87e9d665bb3d4cd697b2f6365851c34dc
…cebookincubator#17235)

Summary:
Pull Request resolved: facebookincubator#17235

Add DWRF file format support for Iceberg data sink. Velox supports DWRF file format, so adding support to Iceberg connector to leverage it.

Part of prestodb/presto#27198

Reviewed By: srsuryadev

Differential Revision: D100061159

fbshipit-source-id: f87d7bb7a3743d3bdb88285343ea61a6e5f5bcfb
…facebookincubator#17157) (facebookincubator#17157)

Summary:
Add three new IP address utility functions:
- ip_version(IPADDRESS) -> BIGINT
- ip_version(IPPREFIX) -> BIGINT
- ip_prefix_masklen(IPPREFIX) -> BIGINT

Without these functions, users who need IP version or prefix length
must cast to VARCHAR and parse the string representation. This forces
them to either defer the conversion to typed IPADDRESS/IPPREFIX and
carry raw VARCHAR throughout their dataflow, or create redundant
columns for metadata that should be derivable from the type itself.

This has three costs:

Correctness — String-based workarounds are error-prone. IPv4-mapped
IPv6 addresses like ::ffff:1.2.3.4 contain ":" but are IPv4.
Different string representations of the same IP can cause silent
mismatches. Dedicated functions operate on the native binary
representation, eliminating these classes of bugs.

Performance — String interaction is significantly slower than
operating on the typed representation. ip_version is a single bit
check on the 128-bit address; ip_prefix_masklen reads the stored
prefix length byte directly — no VARCHAR cast, no allocation, no
parsing.

Completeness — Presto already has a rich set of IP functions
(ip_prefix, ip_subnet_min, ip_subnet_max, ip_prefix_collapse,
is_subnet_of, is_private_ip), but lacks these two basic introspection
primitives. Adding them promotes the use of typed IPADDRESS/IPPREFIX
at the entry point of data pipelines.

Existing art: BigQuery has NET.IP_VERSION(). PostgreSQL's inet type
supports family() and masklen().

Presto docs: https://prestodb.io/docs/current/functions/ip.html
Discussion: https://fb.workplace.com/groups/presto.dev/permalink/31395723200049562/

Pulled By:
tc25898

Pull Request resolved: facebookincubator#17157

tc25898

Reviewed By: kaikalur

Differential Revision: D100568584

fbshipit-source-id: 2ceac57ca514a178135261e6270557b54c04295d
Implement CudfGroupId operator to replace the CPU GroupId operator on GPU
for SQL GROUPING SETS, CUBE, and ROLLUP operations.

- Add CudfGroupId class inheriting from CudfOperatorBase
- Cycle through grouping sets one at a time (matching CPU behavior)
- Create all-null columns for keys not in current grouping set
- Create constant group_id column for each grouping set
- Optimize column ownership with usage counting (move vs copy)
- Register GroupIdAdapter in OperatorAdapters
- Add comprehensive tests matching core Velox test patterns
…GSEGV (facebookincubator#17247)

Summary:
Pull Request resolved: facebookincubator#17247

D100855055 introduced TDigestAccumulator.h with layout {double compression, TDigest digest} and updated TDigestAggregate.cpp to use it, but did NOT update MergeTDigestAggregate.h. This left two conflicting TDigestAccumulator structs in the same namespace (facebook::velox::aggregate::prestosql) with different memory layouts.

Both translation units instantiate the template Aggregate::destroyAccumulators<TDigestAccumulator> as weak symbols. The linker deduplicates to one instantiation, which uses the wrong sizeof(T) for one of the two callers. The memset in destroyAccumulators writes past the end of the accumulator, corrupting adjacent memory.

Crash chain in production:
1. Query uses merge(CAST(tdigest_... AS tdigest(double))) triggering MergeTDigestAggregate
2. TDigest compression_ reads as 0 (should be 100) due to wrong struct layout offsets
3. VELOX_USER_CHECK_EQ throws: "100 vs. 0 Cannot merge TDigests with different compression parameters"
4. During cleanup, ~TDigest() calls HashStringAllocator::freeToPool() which hits SIGSEGV — memory corrupted by wrong-sized memset

Fix: Remove the duplicate TDigestAccumulator from MergeTDigestAggregate.h and include TDigestAccumulator.h instead. Update member access from digest_ (private) to digest (public).

Reviewed By: natashasehgal, talgalili

Differential Revision: D101577660

fbshipit-source-id: 79ef12d38d45a08128f43d3c8ec2611c307edc12
…r hierarchy (facebookincubator#17246)

Summary:
X-link: facebookincubator/nimble#662

Pull Request resolved: facebookincubator#17246

Rename the protected member `memoryPool_` to `pool_` across the `SelectiveColumnReader` class hierarchy. The public accessor `memoryPool()` retains its name so external callers are unaffected.

This is a mechanical rename across 21 files spanning Velox common, DWRF selective readers, Parquet readers, and Nimble selective readers. The DWRF non-selective `ColumnReader` hierarchy has its own separate `memoryPool_` member and is intentionally left unchanged.

Reviewed By: xiaoxmeng

Differential Revision: D101539025

fbshipit-source-id: 91a0742c43602b3ea3df727ecd0d4b5e490a6109
Summary: Pull Request resolved: facebookincubator#16987

Reviewed By: srsuryadev, rjaber

Differential Revision: D97773692

fbshipit-source-id: 7e910b042ad5248feaaf0438552a944fe8f8596c
…ubator#17225)

Summary:
After PR facebookincubator#15511, Velox + GEO now requires `absl` as dependency but doesn't resolve it when `VELOX_BUILD_TESTING=OFF`. The patch fixes the issue.

Fixes build error:

```
CMake Error at build/_deps/s2geometry-src/CMakeLists.txt:54 (find_package):
  By not providing "Findabsl.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "absl", but
  CMake did not find one.

  Could not find a package configuration file provided by "absl" with any of
  the following names:

    abslConfig.cmake
    absl-config.cmake

  Add the installation prefix of "absl" to CMAKE_PREFIX_PATH or set
  "absl_DIR" to a directory containing one of the above files.  If "absl"
  provides a separate development package or SDK, be sure it has been
  installed.
```

Pull Request resolved: facebookincubator#17225

Reviewed By: kgpai

Differential Revision: D101581081

Pulled By: peterenescu

fbshipit-source-id: d5348c99aad81023542a0fb31ea1995466d10e6d
…kincubator#17240)

Summary: Pull Request resolved: facebookincubator#17240

Reviewed By: srsuryadev

Differential Revision: D98150572

fbshipit-source-id: 0bbb1ee8e05f7808e1b2dc9c15b0e14d0db0f352
…on (facebookincubator#17236)

Summary:
Check ast support before pushExprToTree adds the next expression node.

Removed except arg from createCudfExpression because it shouldn't be needed now.

(cherry picked from commit 9dbc09c)

Pull Request resolved: facebookincubator#17236

Reviewed By: pratikpugalia

Differential Revision: D101650152

Pulled By: peterenescu

fbshipit-source-id: 03a81246d39bbc048e3da9d78478476deed803d7
…cubator#17216)

Summary:
Pull Request resolved: facebookincubator#17216

When converting exclusive bounds to inclusive bounds for BigintRange, HugeintRange, and TimestampRange filters, the code unconditionally increments/decrements the boundary value. This overflows when the value is at the type limit (e.g., greaterThan(INT64_MAX) computes INT64_MAX + 1, which wraps to INT64_MIN, creating a range that matches everything instead of nothing). Guard against overflow by returning AlwaysFalse (or IsNull when nulls are allowed) when the boundary is at the type limit. This fixes incorrect query results for filters like WHERE col > 9223372036854775807.

Reviewed By: Yuhta

Differential Revision: D101039167

fbshipit-source-id: fdaddc66c7fb91c079c3dab136b76fc1eafead6e
…kincubator#17206)

Summary:
- Migrate `CudfMarkDistinct` from `exec::Operator` + `NvtxHelper` to the unified `CudfOperatorBase` introduced in facebookincubator#16934
- `CudfMarkDistinct` (facebookincubator#16974) was merged before the base class unification, so it still used the old pattern
- Rename `addInput`/`getOutput` to `doAddInput`/`doGetOutput` (protected template methods)
- Remove manual `VELOX_NVTX_OPERATOR_FUNC_RANGE()` calls — base class handles NVTX profiling uniformly

Pull Request resolved: facebookincubator#17206

Test Plan:
- [x] Built `velox_cudf_mark_distinct_test` successfully
- [x] All 11 existing tests pass
- [x] Pre-commit checks (clang-format, clang-tidy, license headers) pass

Reviewed By: srsuryadev

Differential Revision: D101650187

Pulled By: peterenescu

fbshipit-source-id: c845abdb24274e021118ecafca02cdcc20ced888
facebookincubator#17179)

Summary:
The Memory Arbitration Fuzzer intermittently crashes in CI with `SIGABRT` when a task fails to complete within the 5-second default timeout under heavy concurrent load. `readCursorAsync()` throws `VELOX_FAIL("Failed to wait for task to complete after 5.00s, ...")` with error code `kInvalidState` when a task remains in `Running` state past the timeout.

Under heavy contention (4 fuzzer instances × 72 threads on a 4-core CI runner), threads get starved and tasks legitimately cannot complete within 5 seconds — but this crashes the fuzzer because its error handler assumes any `kInvalidState` must be from an injected fault (spill FS fault or task abort request).

## Fix

Extend the task-completion timeout in the fuzzer to **1 hour** instead of trying to recognize and recover from the 5s default. This avoids false `SIGABRT` crashes from CI thread starvation while still catching real deadlocks.

To make this configurable, add a `maxWaitMicros(uint64_t)` setter and a private `maxWaitMicros_{5'000'000}` member to `AssertQueryBuilder`. Default behavior is unchanged for all existing callers. The fuzzer calls `builder.maxWaitMicros(kOneHourUs)` once per query before reading the cursor.

`CursorParameters` (in the public `velox/exec/Cursor.h`) is left untouched — `TaskCursor::create` never reads this field, so it belongs in the test-only builder, not the public struct.

## Root cause analysis

Examined 3 CI failures over 2 weeks (Apr 7, 9, 11) — all had the identical signature:
- `Expression: injectedSpillFsFault || injectedTaskAbortRequest`
- `injectedSpillFsFault: false, injectedTaskAbortRequest: false`
- `Failed to wait for task to complete after 5.00s`
- Different plan types (HashJoin, OrderBy, TopNRowNumber) — not operator-specific
- All drivers were stuck in `enqueued` state, waiting for CPU or memory arbitration

Pull Request resolved: facebookincubator#17179

Test Plan:
- [x] Reproduced the original crash with 4 parallel fuzzer instances under elevated memory pressure (`--arbitrator_capacity 128MB --allocator_capacity 512MB --num_batches 64 --global_arbitration_pct 20 --task_abort_interval_ms 500`). Instance 3 crashed with `exit 134 (SIGABRT)` and the exact same error as CI.
- [x] Applied fix, rebuilt, re-ran the same 4-instance stress test — all 4 instances passed.
- [x] **Latest validation (2026-04-17, gpu1, 8 CPU/30GB RAM):** built incrementally with `ninja -j 4`, ran **4 concurrent instances × 15 min** (seeds 101–104). All 4 exited cleanly. **0 FATAL crashes, 0 "Failed to wait" errors** across all logs; 408 expected `MEM_ABORTED` events confirm the arbitrator was actively exercised. Memory peaked at ~5GB / 30GB.

Reviewed By: pratikpugalia

Differential Revision: D100872383

Pulled By: kgpai

fbshipit-source-id: ea56b6563f1f2bfa05d0854fe4cefadbd06a4173
…s flag (facebookincubator#17232)

Summary:
Pull Request resolved: facebookincubator#17232

CONTEXT: When the same RowVector is evaluated across many ExprSet instances (e.g. ~1800 times in some cases), the EvalCtx constructor's per-column isFlatEncoding/mayHaveNulls loop runs redundantly for each new instance of EvalCtx, even though the data is the same — 39% of CPU in production.

WHAT: Add a new EvalCtx constructor overload that accepts a pre-computed inputFlatNoNulls flag, plus a static computeInputFlatNoNulls() helper. This lets callers compute the flag once and reuse it across all EvalCtx instances for the same RowVector. The original constructor now delegates to computeInputFlatNoNulls() to avoid code duplication.

Reviewed By: Yuhta

Differential Revision: D101418306

fbshipit-source-id: 68e78b5d47d1b9e36b5c3532dffaa3a5206f1a05
…r async RPC batch mode (facebookincubator#17227)

Summary:
Pull Request resolved: facebookincubator#17227

Batch dispatch chunking and AIMD congestion control for async RPC operators in batch mode.

Problem: When using batch mode, dispatch_batch_size was not actually splitting rows — all rows were sent as a single RPC call, potentially exceeding the server's concurrent request limit. Additionally, there was no backpressure mechanism to throttle dispatch when the server was overloaded.

Changes:
1. Batch dispatch chunking: flushBatch(maxRows) drains only maxRows from pending rows instead of all. RPCOperator loops flushBatchRequests(dispatchBatchSize_) to flush in chunks. AsyncRPCFunction.h updated with maxRows parameter.

2. Backpressure check in addInput: while loop checks isUnderBackpressure() between flushes to prevent overshooting maxPendingBatches.

3. AIMD congestion control: RPCState tracks effectiveMaxPendingBatches_ (starts at maxPendingBatches_=2). On success: +1 (additive increase). On error (>50% real errors or _rpc_retried signal): /2 (multiplicative decrease, floor 1). Null input responses are excluded from error counting. Suppresses redundant "decreased from 1 to 1" log messages.

Reviewed By: Yuhta

Differential Revision: D101062260

fbshipit-source-id: cc0f809cabdbf8c9b0abe53305cabaa10ba3b645
…acebookincubator#17221)

Summary:
When the build side of a hash join has no data (e.g. anti-join with
empty result), inputs_ is empty. With a debug build, the logging at
`noMoreInput()` accesses `inputs_[0]` without bound checking,
which could result in a `SIGSEGV`.

This fixes Q16 (TPC-H) which uses NOT IN (anti-join) where the build
side can be empty for some partitions.

Pull Request resolved: facebookincubator#17221

Reviewed By: kagamiori

Differential Revision: D101671299

Pulled By: peterenescu

fbshipit-source-id: 0971d85671e67810789d8148ce3891e757faf4f7
…r#16037)

Summary:
Replaces previous PR facebookincubator#15014  that erroneously also included changes in some `experimental/cudf` files.

This experimental feature is targeted at a multi-node setup of Prestissimo when running with the Cudf extension. It provides an exchange mechanism to transfer CudfVectors between GPU device memory without the need to first copy data to host memory. This mechanism is in addition to the HTTP-based exchange used in Prestissimo. The main goal is to exploit fast hardware interconnects that are available between GPUs.

The cudf exchange builds on top of the existing experimental Cudf extension. The approach follows the same design as the HTTP based exchange. Its components operate on CudfVectors instead of SerializedPages and include:

* CudfOutputQueueManager and CudfQueues instead of the OutputBufferManager and OutputBuffers
* CudfExchangeClient and CudfExchangeSource instead of ExchangeClient and ExchangeSource/PrestoExchangeSource
* CudfExchange and CudfExchangeQueue instead of Exchange and ExchangeQueue
* CudfExchangeServer instead of Prestissimo's HTTP resource

The code in this PR does not contain the necessary hooks needed to initialize Cudf exchange.

Pull Request resolved: facebookincubator#16037

Reviewed By: pratikpugalia

Differential Revision: D101650220

Pulled By: peterenescu

fbshipit-source-id: 207b7964d0039f1885055ee4dae302a200480159
Summary:
This is Part 2 of 3 of the GPU Decimal implementation. It adds decimal functions and some other tidying-up.

More explanation and annotation to follow.

Pull Request resolved: facebookincubator#16750

Reviewed By: kgpai

Differential Revision: D101837619

Pulled By: kagamiori

fbshipit-source-id: 4e7971aba5f3f5668d19bb2ff53571aa28b74d72
…#17283)

Summary:
Add the benchmark, then we can optimize the cudf to velox conversion, like facebookincubator#16760 and facebookincubator#16859

Pull Request resolved: facebookincubator#17283

Reviewed By: kgpai

Differential Revision: D101865735

Pulled By: kagamiori

fbshipit-source-id: 98f7bfa4b73ef8190793c570c3554d953f07394b
Summary:
- register `startswith` in the cuDF function evaluator for Spark string expressions
- support `startswith(column, constant)`, `startswith(column, column)`, and `startswith(constant, column)` on the GPU path
- preserve Spark null semantics for column-pattern evaluation and keep `startswith(constant, constant)` off the GPU path until row-count support exists

Pull Request resolved: facebookincubator#17205

Test Plan:
- [x] `velox_cudf_expression_selection_test --gtest_filter='*Startswith*'`
- [x] `velox_cudf_spark_filter_project_test --gtest_filter='*startswith*'`
- [x] `velox_cudf_hash_join_test --gtest_filter='*innerJoinWithMixedFilterPrecomputation*:*leftJoinWithStringFunctionFilter*'`
- [x] `velox_cudf_table_scan_test --gtest_filter='TableScanTest.filterPushdown:TableScanTest.remainingFilterExtraction'`

Reviewed By: kgpai

Differential Revision: D101898882

Pulled By: kagamiori

fbshipit-source-id: 606efdb38563fe5ac0043b83a4d822e41ac2cb53
…stoSerializerEstimationUtils (facebookincubator#17288)

Summary:
Pull Request resolved: facebookincubator#17288

Replace all usages of `folly::grow_capacity_by` with `std::vector::reserve` in `PrestoSerializerEstimationUtils.cpp`. Since the vectors start empty in both `estimateWrapperSerializedSize` and `expandRepeatedRanges`, `reserve` is functionally identical and simpler. This removes the dependency on `folly/container/Reserve.h` from this file, supporting Velox's ongoing effort to reduce Folly dependencies.

Addresses post-land review comment on D98150572.

Reviewed By: mbasmanova

Differential Revision: D101866359

fbshipit-source-id: 0651baca0567af00a1ed009b745ad37868a7b07e
Summary:
Pull Request resolved: facebookincubator#17237

Operator stats have gaps and to suitable to determine where time went in a task.
Adding driver timing stats per pipeline to assist in understanding a query bottlenecks.

The current version does not handle the grouped execution well.
Working on the extra to support grouped execution as well.

Reviewed By: bikramSingh91

Differential Revision: D101395498

fbshipit-source-id: 1bd6e5311844f851109fd694afe37d101efe90a7
…p shuffle compression on small pages (facebookincubator#17306)

Summary:
Pull Request resolved: facebookincubator#17306

The minimum compression ratio is not fully expressive enough to avoid wasteful compression. For production workloads, analysis revealed that the page size distribution skews very small (P50 < 1K) such that even best case compression is not likely to reduce the number of packets transmitted over the network but does introduce additional serialization overhead and CPU consumption. As a result of this, exchange compression will tend to degrade latency even in a network bound environment.

However, if we disable compression up to some multiple of the network MTU, we can limit compression to cases where it is likely to be beneficial. There are two reasons this may benefit execution latency:

1. queries exchanging lots of data directly benefit from a reduced network transfer bottleneck
2. the noisy neighbor effect from exchange-heavy queries on well-behaved queries is reduced, since they will chew up less network bandwidth

This new option to disable page compression when the page size is below some threshold is propagated similarly to the minimum compression parameter. The new size-based skip uses arena size at flush time and is independent of the existing probe-based heuristic in PrestoIterativeVectorSerializer; it survives the PartitionedOutput flow that recreates the serializer between flushes (which resets the probe counter).

Reviewed By: Yuhta

Differential Revision: D100420559

fbshipit-source-id: 0f0fd55c4ef73e065be49eb90fbcff7b177802ae
…17239)

Summary:
Rename EnumsDeclare.h to EnumDeclare.h and Enums.h to EnumDefine.h for symmetric naming that makes clear these are a matched pair — EnumDeclare.h in every .h, EnumDefine.h in the corresponding .cpp. The old naming (Enums.h / EnumsDeclare.h) made Enums.h look like the main header and EnumsDeclare.h look auxiliary, tempting people to just include Enums.h.

The declare macros (VELOX_DECLARE_ENUM_NAME, VELOX_DECLARE_EMBEDDED_ENUM_NAME) live in the lightweight EnumDeclare.h that only needs <iosfwd>, <optional>, <string_view>. The define macros stay in EnumDefine.h with their heavy transitive includes (folly/container/F14Map.h and velox/common/base/Exceptions.h, together ~8M preprocessed). The two headers are independent.

All velox headers include EnumDeclare.h for the DECLARE macros. The corresponding .cpp files include EnumDefine.h for the DEFINE macros. BUCK targets updated accordingly: enums_declare is an exported dep for headers, enums is a private dep for .cpp files.

ConfigProperty.h includes EnumDeclare.h instead of Enums.h, cutting ~8M of preprocessed includes from every consumer.

Pull Request resolved: facebookincubator#17239

Reviewed By: mbasmanova

Differential Revision: D101462826

fbshipit-source-id: c439cba5a8db21d830dce61d6901164bfa48fa51
…ubator#17198)

Summary:
Pull Request resolved: facebookincubator#17198

Add getExtractionSizeValues() and getExtractionValues() to map/list
readers.  Add kField handling in struct reader (direct child getValues
with lazy loading support).  Add TransformColumnLoader for mixed
multi-extraction lazy path.  Add text reader transform support in
RowReader::projectColumns().  Reader checks deltaUpdate() to bypass
extraction when delta updates are active.

Reviewed By: maniloya

Differential Revision: D97671736

fbshipit-source-id: c1610538656ddc3af515dc7cc9a8c7c0610e3ccb
Summary:
Introduce Axiom, a C++ library for building composable query engines
on top of Velox. The post covers the motivation (fragmented landscape
of monolithic engines), architecture (pluggable frontends, optimizer,
runtime, connectors), current status (TPC-H, production workloads,
CLI), and a call for community involvement.

Pull Request resolved: facebookincubator#17316

Reviewed By: kKPulla

Differential Revision: D102156907

Pulled By: mbasmanova

fbshipit-source-id: 5bcc1cb2c1a46da8d0631c85edc931a393b003bc
…small batches (facebookincubator#17203)

Summary:
Pull Request resolved: facebookincubator#17203

For small batches, peeling dictionary encoding from inputs before calling vector functions can create more overhead than it saves. This diff adds a configurable `expression.min_rows_for_peeling` threshold (default: 0) that suppresses peeling when the number of selected rows (to process) falls below it.

When peeling is suppressed, some VectorFunctions still expect flat or constant inputs. To preserve this guarantee:
- A single constant-encoded input continues to be peeled (cheap and required since some UDFs have started to rely on this expectation. Running the fuzzer exposed this gap)
- A single dictionary-encoded input (possibly alongside constant inputs, e.g. in-predicate) is flattened before evaluation.

Added fuzzer coverage:
Extended the expression fuzzer to exercise partial selectivity by randomly deselecting rows. Set the peeling threshold to 25 (1/4 of the default batch size). Both these now allow us to exercise the functionality exposed via the new expression.min_rows_for_peeling config.

Additional fixes to fuzzer:
- Fixed retryWithTry to avoid evaluating uninitialized rows.
- Fixed a bug in ExpressionRunner where the query config diverged from the one used in fuzzer tests.
- Fixed inputs being double-wrapped in lazy vectors when using verify mode in ExpressionRunner.
- Fixed ExpressionRunner failing to load a repro when no selectivity vector files were provided.

Unit tests
Extended existing tests in ExprTest to exercise with peeling enabled and disabled.

Additional Context:
Single-arg VectorFunctions are expected to receive only flat or constant inputs. Fuzz testing revealed latent bugs in several UDFs: some had constant-input optimizations that mishandled errors under TRY (e.g. reporting the error only for the representative row instead of all selected rows), and others assumed constant-encoded complex types would always be peeled, so they asserted flat encoding unconditionally. These bugs were masked because peeling always ran for constant-encoded inputs. To keep this change low risk, the constant-peeling guarantee is preserved explicitly in the disabled-peeling path for single-arg functions.

Reviewed By: Yuhta

Differential Revision: D101074993

fbshipit-source-id: 2aec51f90f063ed0ad2e114a50211f19baa474db
zacw7 and others added 21 commits May 12, 2026 16:20
…tor#17485)

Summary:
Pull Request resolved: facebookincubator#17485

Accept kPartitionKey column handles in HiveIndexSource init() (previously crashed on any non-regular handle). Track partition column handles in partitionKeyHandles_, include partition columns in readerOutputType_, and synthesize partition column values from split metadata via scan-spec setConstantValue().

Key changes:
- init(): Accept kPartitionKey handles alongside kRegular. Populate partitionKeyHandles_. Skip subfield/postProcessor checks for partition columns.
- setPartitionValues(): New method that sets partition constants on scanSpec_ children from the first split's partition keys. Iterates scan spec children (same pattern as FileSplitReader::adaptColumns) and delegates to setPartitionValue() for each partition column. Validates all splits share the same partition values.
- setPartitionValue(): Extracted helper matching FileSplitReader::setPartitionValue signature.
- addSplits(): Calls setPartitionValues() before creating readers.
- Pass partitionKeyHandles_ to makeScanSpec() so partition column children are included in the spec.

Reviewed By: xiaoxmeng

Differential Revision: D104739908

fbshipit-source-id: cb96fc3d9005df3f0cb91d04a002ca7918e9ace9
…bator#17493)

Summary:
Pull Request resolved: facebookincubator#17493

Adds support for writing Iceberg tables in NIMBLE format via a new
`NimbleWriterOptionsAdapter` mirroring the existing
`DwrfWriterOptionsAdapter` in the anonymous namespace of
`WriterOptionsAdapter.cpp`, plus a `case dwio::common::FileFormat::NIMBLE`
arm in `createWriterOptionsAdapter()`.

## Why

Before this diff, `isSupportedFileFormat(NIMBLE)` returned `false` because
`createWriterOptionsAdapter()` only handled `PARQUET` and `DWRF`. Any
INSERT into a NIMBLE-formatted Iceberg table failed at the
`IcebergDataSink` ctor with:

```
isSupportedFileFormat(tableStorageFormat)
Unsupported file format for writing Iceberg tables: nimble
```

NIMBLE is otherwise fully supported in Velox's dwio layer (reader,
writer, vector serde) -- only the Iceberg connector's
`WriterOptionsAdapter` dispatch was missing.

## Manifest format string choice

The new adapter reports `manifestFormatString() == "ORC"`, matching the
convention already used by `DwrfWriterOptionsAdapter`. Iceberg's manifest
file-format vocabulary has no NIMBLE enum, and the cross-engine
convention established by Java
`com.facebook.presto.iceberg.FileFormat.NIMBLE.toIceberg()` (in
presto-facebook-iceberg) reports NIMBLE as `"ORC"` so downstream
consumers (coordinator, catalog, snapshot tooling) can interpret the
commit message without a NIMBLE-aware enum extension.

The actual on-disk format is identified at read time via the file
extension and the Nimble magic bytes (`0xa1fa` little-endian footer),
not via the manifest string. Writing `"ORC"` here is therefore safe and
preserves cross-engine round-trip compatibility with Java planners.

## What this does NOT do

- Does not register a `NimbleWriterFactory` with the dwio writer
  registry -- that registration happens at the Prestissimo server
  bootstrap (covered in the stacked `presto_cpp` diff).
- Does not change DWRF or Parquet adapter behavior.
- Does not touch the read path.

Reviewed By: srsuryadev

Differential Revision: D104838011

fbshipit-source-id: 2afee19569eea267c7104cc5831a0486d1c53ecd
… left semi project support (facebookincubator#17113)

Summary:
- Add `CudfNestedLoopJoinBuild`, `CudfNestedLoopJoinProbe`, and `CudfNestedLoopJoinBridge` GPU operators that accelerate nested loop joins using libcudf APIs
- Support inner, left, right, full outer, and left semi project join types with optional filter conditions
- Register `NestedLoopJoinBuildAdapter` and `NestedLoopJoinProbeAdapter` in `OperatorAdapters.cpp`
- Fix pre-existing bug in `CudfFromVelox::getOutput()` that returned a 0-row `CudfVector` instead of `nullptr`

Closes facebookincubator#17112
Part of facebookincubator#15772
Supersedes facebookincubator#16942

### Design

**Two-path approach** for optimal performance:
- **No filter (cross join)**: uses `cudf::cross_join(probe, build)` for full cartesian product
- **With filter (conditional join)**: uses `cudf::conditional_inner_join(probe, build, ast)` to evaluate the filter on GPU, returning only matching row index pairs, then gathers actual data using indices

**Batched mismatch tracking** for outer joins: since build data is processed in batches, per-batch left/right join APIs cannot be used directly (a row unmatched in one batch may match a later batch). Instead, `conditional_inner_join` is always used per-batch and GPU-side BOOL8 flag columns track which rows were matched:
- Left join: `probeMatchedFlags_` tracks per-probe-batch mismatches; after all build batches, unmatched probe rows are emitted with null build columns
- Right join: `buildMatchedFlags_` tracks cross-probe mismatches; after all probes finish, the last driver merges flags from all peers and emits unmatched build rows

**Left semi project**: uses `cudf::conditional_left_semi_join` to find matching probe indices, then builds a BOOL8 match column via `cudf::contains`.

### Known limitation

Zero-column build side is not yet supported — `cudf::table` with zero columns reports `num_rows() == 0`, causing the operator to treat a non-empty build as empty. Tracked as a TODO.

Pull Request resolved: facebookincubator#17113

Test Plan:
- [x] 53 GPU unit tests pass (`velox_cudf_nested_loop_join_test`)
  - All 5 join types: inner, left, right, full, left semi project
  - With and without filter conditions
  - Empty build/probe per join type
  - Multi-batch build and probe
  - Multi-driver execution (2 drivers)
  - NULL handling in data and filter conditions
  - Output column reordering
  - Large cross join (100×50 = 5000 rows)

 - [x] Run all TPC-DS queries that contain NLJ of g7e.4xlarge (NVIDIA RTX PRO 6000 Blackwell Server Edition, 16vCPUs):
  All queries run but two (SF100) that fail due to a different reason

  | Query | NLJ nodes | GPU cold | CPU cold | GPU warm ± std | CPU warm ± std | Speedup | t-stat  | 95% CI (s)       | Significant? |
  |-------|-----------|----------|----------|----------------|----------------|---------|---------|------------------|--------------|
  | Q9    | 15        | 26.90s   | 19.96s   | 22.91 ± 8.40s  | 16.16 ± 7.54s  | +41.7%  | +1.337  | [-3.35, +16.84]  | NO (n=5)     |
  | Q14   | 3         | 58.94s   | 11.99s   | 12.27 ± 0.16s  | 12.36 ± 0.39s  | -0.7%   | -0.451  | [-0.47, +0.30]   | NO (n=5)     |
  | Q23   | 2         | 32.29s   | 24.93s   | 23.12 ± 0.15s  | 24.89 ± 0.29s  | **-7.1%** | -12.126 | [-2.06, -1.48] | YES (n=5)|
  | Q24   | 1         | -        | -        | 5.48 ± 0.36s   | 5.47 ± 0.23s   | +0.1%   | +0.031  | [-0.35, +0.36]   | NO (n=5/8)   |
  | Q28   | 5         | 13.17s   | 13.52s   | 13.17 ± 0.08s  | 13.36 ± 0.08s  | **-1.5%** | -3.883 | [-0.29, -0.09]  | YES (n=5)|
  | Q44   | 2         | 13.77s   | 7.27s    | 7.24 ± 0.17s   | 7.20 ± 0.12s   | +0.5%   | +0.412  | [-0.15, +0.22]   | NO (n=5)     |
  | Q54   | 2         | 15.58s   | 4.49s    | 4.15 ± 0.08s   | 4.07 ± 0.02s   | +1.9%   | +2.056  | [+0.00, +0.15]   | YES (n=5)|
  | Q61   | -         | -        | -        | FAIL           | FAIL           | -       | -       | -                | Decimal N/S  |
  | Q77   | 1         | 28.98s   | 7.06s    | 6.94 ± 0.25s   | 7.00 ± 0.13s   | -0.8%   | -0.452  | [-0.30, +0.19]   | NO (n=5)     |
  | Q88   | 7         | 8.02s    | 7.37s    | 7.45 ± 0.22s   | 7.37 ± 0.15s   | +1.1%   | +0.676  | [-0.16, +0.32]   | NO (n=5)     |
  | Q90   | -         | -        | -        | FAIL           | FAIL           | -       | -       | -                | Decimal N/S  |

  ## Synthetic NLJ Benchmark — SF100 Results

  | Q# | Description | Probe | Build | Join Condition | Output | Time | Status |
  |----|-------------|-------|-------|----------------|--------|------|--------|
  | 1 | store_sales × item (range join) | 288M | 204K | `ss_list_price BETWEEN (i_current_price - 1.0) AND (i_current_price + 1.0)` | — | — | Host OOM |
  | 2 | store_sales × date_dim (inequality) | 288M | 365 | `ss_sold_date_sk > d_date_sk` (build filtered: `d_year = 2000`) | — | — | Host OOM |
  | 3 | catalog_sales × item (multi-condition) | 144M | 204K | `cs_list_price > i_current_price AND cs_wholesale_cost < i_wholesale_cost` | — | — | Host OOM |
  | 4 | store × item (baseline) | 402 | 204K | `i_current_price > 50.0` | 4.6M rows | 324ms | Pass |
  | 5 | customer × customer_address | 2M | 50K | `c_current_addr_sk > ca_address_sk` (probe filtered: `c_birth_year > 1970`) | — | — | GPU OOM |
  | 6 | web_sales × store_sales (filtered) | 72M | ~30K | `ws_ext_sales_price > ss_ext_sales_price` (build filtered: `ss_store_sk = 1`) | — | — | GPU OOM |
  | 7 | store × promotion (date range) | 402 | 1K | `p_start_date_sk BETWEEN (s_closed_date_sk - 100) AND (s_closed_date_sk + 100)` | 6.7K rows | 186ms | Pass |
  | 8 | date_dim × store (inequality) | 366 | 402 | `d_date_sk > s_store_sk` (probe filtered: `d_year = 2000`) | 147K rows | 150ms | Pass |
  | 9 | web_page × catalog_page (multi-AND) | 2K | 20K | `wp_web_page_sk > cp_catalog_page_sk AND wp_char_count > cp_catalog_page_number` | 2.0M rows | 166ms | Pass |
  | 10 | item × household_demo (BETWEEN+AND) | 20K | 7.2K | `i_current_price BETWEEN 10 AND 50 AND hd_dep_count > 0` (probe filtered: `i_category_id = 1`) | 6.2M rows | 309ms | Pass |

  ### Query Definitions

  ```sql
  -- Q1: Range join (Host OOM on SF100)
  SELECT ss_item_sk, ss_list_price, ss_sales_price, i_item_sk, i_current_price
  FROM store_sales INNER JOIN item
    ON ss_list_price BETWEEN (i_current_price - 1.0) AND (i_current_price + 1.0)

  -- Q2: Inequality + filtered build (Host OOM on SF100)
  SELECT ss_sold_date_sk, ss_ext_sales_price, d_date_sk, d_year
  FROM store_sales INNER JOIN date_dim
    ON ss_sold_date_sk > d_date_sk
  WHERE d_year = 2000

  -- Q3: Multi-condition (Host OOM on SF100)
  SELECT cs_item_sk, cs_list_price, cs_wholesale_cost, i_item_sk, i_current_price, i_wholesale_cost
  FROM catalog_sales INNER JOIN item
    ON cs_list_price > i_current_price AND cs_wholesale_cost < i_wholesale_cost

  -- Q4: Small baseline (Pass)
  SELECT s_store_sk, s_store_name, i_item_sk, i_current_price
  FROM store INNER JOIN item
    ON i_current_price > 50.0

  -- Q5: Medium cross-product (GPU OOM on SF100)
  SELECT c_customer_sk, c_current_addr_sk, ca_address_sk, ca_state
  FROM customer INNER JOIN customer_address
    ON c_current_addr_sk > ca_address_sk
  WHERE c_birth_year > 1970

  -- Q6: Fact-to-fact theta (GPU OOM on SF100)
  SELECT ws_item_sk, ws_ext_sales_price, ss_item_sk, ss_ext_sales_price
  FROM web_sales INNER JOIN store_sales
    ON ws_ext_sales_price > ss_ext_sales_price
  WHERE ss_store_sk = 1

  -- Q7: Date range overlap (Pass)
  SELECT s_store_sk, s_store_name, p_promo_sk, p_promo_name, p_cost
  FROM store INNER JOIN promotion
    ON p_start_date_sk BETWEEN (s_closed_date_sk - 100) AND (s_closed_date_sk + 100)

  -- Q8: Filtered probe + inequality (Pass)
  SELECT d_date_sk, d_day_name, s_store_sk, s_store_name
  FROM date_dim INNER JOIN store
    ON d_date_sk > s_store_sk
  WHERE d_year = 2000

  -- Q9: Multi-condition AND (Pass)
  SELECT wp_web_page_sk, wp_char_count, cp_catalog_page_sk, cp_catalog_page_number
  FROM web_page INNER JOIN catalog_page
    ON wp_web_page_sk > cp_catalog_page_sk AND wp_char_count > cp_catalog_page_number

  -- Q10: BETWEEN + AND (Pass)
  SELECT i_item_sk, i_current_price, hd_demo_sk, hd_dep_count
  FROM item INNER JOIN household_demographics
    ON i_current_price BETWEEN 10 AND 50 AND hd_dep_count > 0
  WHERE i_category_id = 1

Reviewed By: kKPulla

Differential Revision: D104904497

Pulled By: mbasmanova

fbshipit-source-id: b848952dbb4461ac21dae81d0d88404a4ac59c62
…or#17494)

Summary:
Pull Request resolved: facebookincubator#17494

Add an optional per-batch scan statistics callback to Velox's TableScan
operator via QueryCtx. The callback fires after each non-empty batch with
the completed row count delta, wall time, and table name.

This enables accurate per-batch scan statistics reporting without polling,
matching the push-model semantics of existing scan callbacks in the
execution stack.

Changes:
- Add ScanBatchEvent struct and ScanBatchCb callback to QueryCtx
- Fire callback in TableScan::getOutput() with completed rows delta
  (from getCompletedRows()), wall time (microseconds from existing
  MicrosecondTimer), and table name (from tableHandle_->name())
- Wire callback adapter in VeloxBatchCursor for HiveConnector path with
  mutex for kParallel thread safety

Follows existing Velox callback patterns (CallbackSink::Consumer,
UpdateAndCheckTraceLimitCB, HiveColumnHandle::postProcessor).

Reviewed By: Yuhta

Differential Revision: D104889992

fbshipit-source-id: 11a8abd33a036477850eb0da7f3e3795dc014db3
…ookincubator#16626)

Summary:
On ubuntu22, with gcc 11.4
/root/velox/velox/exec/fuzzer/SpatialJoinFuzzer.cpp:82:12: error: ‘x’ is used uninitialized [-Werror=uninitialized]
   82 |     double x, y;
      |            ^
/root/velox/velox/exec/fuzzer/SpatialJoinFuzzer.cpp:82:15: error: ‘y’ is used uninitialized [-Werror=uninitialized]
   82 |     double x, y;

Pull Request resolved: facebookincubator#16626

Reviewed By: mbasmanova

Differential Revision: D104691989

Pulled By: kgpai

fbshipit-source-id: a23edaf3a2331a3da5e350a1d6b8efe005d43105
…te copy (facebookincubator#17457)

Summary:
For snappy and zstd compressed Parquet pages, decompress directly into the
target buffer instead of going through the stream-based PagedInputStream path.

## Problem

The current path creates a `SeekableArrayInputStream`, wraps it in a
`PagedInputStream` (which allocates an internal `outputBuffer_`), decompresses
into that buffer, then copies to the final destination via `readFully()`
(`std::copy`). The intermediate buffer allocation and copy add unnecessary
overhead for codecs where the full compressed page is already available in
memory. Profiling on TPC-H SF100 with snappy-compressed Parquet showed
`memmove` (from `std::copy` in `readFully`) as a visible cost in the
decompression path.

## Fix

Call `snappy::RawUncompress` / `ZSTD_decompress` directly into the destination
buffer, eliminating the intermediate allocation and copy. Other codecs (gzip,
lz4, lzo) fall back to the original stream-based path.

Both `Snappy::snappy` and `zstd::zstd` are already linked in the parquet reader
CMakeLists.

Pull Request resolved: facebookincubator#17457

Reviewed By: kKPulla

Differential Revision: D104904218

Pulled By: mbasmanova

fbshipit-source-id: eac16c85b335e2106cbb4a8c1a58bb1277042b2a
…r#17337)

Summary:
Pull Request resolved: facebookincubator#17337

Skip creating child readers for unneeded streams based on ExtractionType
(kKeys skips values, kValues skips keys, kSize skips all children).
Add needsKeyReader()/needsElementReader() helpers with deltaUpdate
awareness.  Add VELOX_CHECK in Parquet reader that ExtractionType is
kNone.  Add DWRF reader-level extraction tests including IO reduction
validation and nested ScanSpec verification.

Reviewed By: maniloya

Differential Revision: D97671745

fbshipit-source-id: 661dec9b00b1f2dc1b2035f41b992ca9a83849ec
Summary:
Avoid allocating a contiguous temporary buffer in ABFS preadv.

Read the requested Azure range once and stream bytes directly into the
caller-provided buffers. For null gap ranges, consume the skipped bytes
with a reusable thread-local discard buffer so later buffers stay aligned
without changing the number of remote reads.

Add a fake Azure client unit test to verify preadv issues a single
download for buffers with gaps and fills the non-null ranges correctly.

Pull Request resolved: facebookincubator#17370

Reviewed By: amitkdutta

Differential Revision: D105014405

Pulled By: mbasmanova

fbshipit-source-id: b7cbf6af5c173025ade02477f4b36c0e86b1e722
…16240)

Summary:
`insertTableHandle_->writerOptions()` returns a shared `WriterOptions` object. Previously, `memoryPool` was only set when null, so after writer 0 initialized it, later writers reused writer 0's pool. As a result, all writers' memory gets attributed to writer 0's pool while other writers' pools show no usage, making it impossible to identify which writer is consuming memory.

Fix: set `memoryPool` unconditionally, matching the existing pattern for `nonReclaimableSection`.

Pull Request resolved: facebookincubator#16240

Reviewed By: jagill

Differential Revision: D105008445

Pulled By: mbasmanova

fbshipit-source-id: 2987e8239325fcff6b25e7c159682c0f5a88a11b
…ookincubator#17505)

Summary:
Pull Request resolved: facebookincubator#17505

## Context
Velox PRs fail Netlify deploy preview builds (e.g.
https://app.netlify.com/projects/meta-velox/deploys/69fd4318582cb600086ecbd9)
because `fbcode/velox/public_tld/website/yarn.lock` contains URLs pointing
at `registry.facebook.net`. That host is Meta's internal Metaccio proxy
and is not reachable from external CI runners (Netlify, GitHub Actions,
external contributors), so `yarn install` fails before the docs site can
build.

## Motivation
D104280800 added bidirectional yarn lockfile URL rewriting hooks. Projects
opt in by dropping a `.rewrite-lockfile.fb` marker file in the directory
where `yarn install` runs. Once opted in, yarn:

- WRITE: rewrites 3P resolved URLs from registry.facebook.net to
  registry.yarnpkg.com before saving yarn.lock (lockfile on disk is
  portable to OSS).
- READ: rewrites them back to registry.facebook.net in memory so internal
  fetches still go through Metaccio.

1P scoped packages (rootfoo/*, nest/*, etc.) are never rewritten. The
Velox website only depends on 3P packages so this cleanly applies.

## This diff
- Adds `fbcode/velox/public_tld/website/.rewrite-lockfile.fb` opt-in
  marker (path-mapped to `website/.rewrite-lockfile.fb` in the
  facebookincubator/velox GitHub mirror).
- Re-runs `yarn install` from that directory, which triggers the
  experimentalLockfileWriteHook and replaces 1175 `registry.facebook.net`
  URLs with `registry.yarnpkg.com` URLs in `yarn.lock`. The reformatted
  layout (unquoted field names, sorted keys) is yarn 1.22.21's standard
  re-serialization output.

After ShipIt syncs to GitHub, Netlify (and any external `yarn install`)
will resolve packages from the public yarn registry and CI will pass.
Internal devs continue to fetch via Metaccio thanks to the read hook.

Reviewed By: pratikpugalia

Differential Revision: D105031443

fbshipit-source-id: 09135f7ee605c4d294a1e0df097b4cd814348dab
…7339)

Summary:
Pull Request resolved: facebookincubator#17339

Add comprehensive DWRF and Nimble end-to-end extraction tests through
HiveDataSource table scan pipeline.  Covers MapKeys, MapValues, Size,
MapKeyFilter, StructField, ArraySize, nested chains, multiple
extractions, multi-format splits (DWRF+TEXT+DWRF), and IO reduction
validation.

Reviewed By: apurva-meta

Differential Revision: D97671762

fbshipit-source-id: c3c08b61cb8b8cd8c92d798d673ce2ee1eabdb25
…ubator#17486)

Summary:
Pull Request resolved: facebookincubator#17486

Add non-index join condition support to HiveIndexSource: join conditions on columns that are neither index columns nor partition columns (e.g., bucket columns in colocated joins) are now applied as post-read equality filters.

Key changes:
- Rename initIndexLookupConditions() to initConditions(). Extend it to categorize all join conditions in a single pass: index conditions are pushed to indexLookupConditions_, and non-index equality conditions are validated, resolved to column indices, and stored in nonIndexConditions_ for post-read filtering. Non-index condition columns are added to the reader output type if not already projected.
- Add NonIndexCondition struct and applyNonIndexConditions() which compares reader output column values against probe-side values using SQL null semantics (either null means not equal).
- Add applyNonIndexCondition() wrapper in HiveLookupIterator, called before evaluateRemainingFilter() in getOutput().
- Store original non-index conditions as nonIndexLookupConditions_ for inspection.

Reviewed By: xiaoxmeng

Differential Revision: D104772751

fbshipit-source-id: 0f95d02cdf66925063aa363ea315a08cd601c0c8
Summary:
Pull Request resolved: facebookincubator#17510

Add velox::serializer::KeyDecoder to decode KeyEncoder-compatible composite keys for all supported scalar key column types. Wire the decoder into Buck/CMake and add round-trip plus malformed-input coverage.

Reviewed By: xiaoxmeng

Differential Revision: D101708729

fbshipit-source-id: 4d5b2e39d99df7c287bba58f1f4a9a01bcac3079
Summary:
Pull Request resolved: facebookincubator#17480

TypedDistinctAggregations::extractValues() loops over every group to extract distinct rows from the accumulator and add them to the aggregation function. It currently create a new vector to hold the distinct rows every time. This can cause a large number of memory allocation when the grouping keys have high cardinality. This diff make it reuse the vector across groups.

Reviewed By: Yuhta

Differential Revision: D104498389

fbshipit-source-id: f9af8ccec13ed4752e878e80dc70b7e1c74daaca
…r#17482)

Summary:
This fix adds a call to the base Operator::initialize method, which sets up the reclaimer, tracer, and initialization state. This is consistent with how other operators implement their initialize method.

Pull Request resolved: facebookincubator#17482

Reviewed By: kgpai

Differential Revision: D105084573

Pulled By: mbasmanova

fbshipit-source-id: bd7a005fa9558df782c2fd6ccc7d27985aba4cd9
…bookincubator#16349)

Summary:
closes facebookincubator#16309

Pull Request resolved: facebookincubator#16349

Reviewed By: amitkdutta

Differential Revision: D105075676

Pulled By: mbasmanova

fbshipit-source-id: ef1a57d7007218e6183ac06aab2709fd2022d4ae
…n::apply calls (facebookincubator#17508)

Summary:
Pull Request resolved: facebookincubator#17508

Introduces a global listener registry for observing VectorFunction::apply calls during expression evaluation, following the SplitListeners pattern (velox/exec/Task.h).

Listener factories are registered globally via registerVectorFunctionListenerFactory(). During expression compilation, ExprCompiler iterates all registered factories, calling create() once per resolved scalar function with the function name, VectorFunctionMetadata, and QueryConfig. Each factory independently decides whether to observe that function by returning a VectorFunctionListeners struct (containing pre and/or post listeners) or std::nullopt to skip. Returned listeners are stored on the Expr node and invoked via invokeApplyWithListeners() in both applyFunction() and evalSimplifiedImpl(). Special forms (AND, OR, CAST, etc.) are not subject to listening.

Key design decisions:
- Global static registry (not per-query): multiple factories can be registered independently without coordination, each observing different concerns (monitoring, access control, auditing).
- Factory receives QueryConfig: per-query behavior control without per-query factory instances. A factory can return std::nullopt based on config flags.
- Listeners are shared_ptrs so a single listener instance can be shared across multiple Expr nodes.
- Post-listener has finally semantics: always runs after apply, even if apply throws. Receives std::exception_ptr (nullptr on success) enabling RAII-like cleanup.
- Listener exceptions propagate as-is (not wrapped as user errors). Only VectorFunction::apply non-Velox exceptions are wrapped via VELOX_USER_FAIL.
- Both listeners receive the expression name as the first argument for attribution.

Reviewed By: kevinwilfong

Differential Revision: D104325777

fbshipit-source-id: 5f2461e92a6ae241acf49fefad7d8225942ee8b1
…#17511)

Summary:
Fix the deletion vector writer and reader to always (not when numContainers >=4) write/read offsets in case of non-run bitmaps consistent with the [roaring bitmap spec](https://github.com/RoaringBitmap/RoaringFormatSpec#3-offset-header).

Pull Request resolved: facebookincubator#17511

Reviewed By: pratikpugalia

Differential Revision: D105096426

Pulled By: mbasmanova

fbshipit-source-id: d5f227a29d918c5f2314d97ae42130d480e66a41
Resolve merge conflicts in ExpressionEvaluator.cpp and align the
function registry to main's model (flat CudfFunctionSpec, bool overwrite
parameter). Remove duplicate LogicalFunction class and drop the
now-invalid DateTruncFunction::canEvaluate argument.

Also fix: veloxToCudfTypeId -> veloxToCudfDataType in CudfReduce,
CudfGroupby, and CudfNestedLoopJoin; finalMask/finalNullCount ->
nullMask/nullCount in DecimalExpressionKernels.cu; and remove the
undeclared expressionSpansBothSides call in CudfHashJoin.
@shrshi shrshi requested review from a team, bdice, devavret, karthikeyann and mhaseeb123 as code owners May 15, 2026 18:15
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@shrshi shrshi changed the base branch from velox-cudf to IBM-techpreview May 15, 2026 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.