[WIP][DNR] Staging branch for tpcds by shrshi · Pull Request #110 · rapidsai/velox

shrshi · 2026-05-15T18:15:57Z

Branch used for VeloxCon 2026 results. Aggregates open PRs tracked in facebookincubator#15772

…SizeStats (facebookincubator#17231) Summary: Pull Request resolved: facebookincubator#17231 CONTEXT: RowVector::estimateFlatSize() recursively walks all column vectors to estimate batch memory size. For wide schemas this is very expensive — profiling shows it consuming ~33% of total CPU in certain data pipelines, called unconditionally twice per batch in TableScan::getOutput. WHAT: Gate the estimateFlatSize call in TableScan::getOutput behind the existing enableOperatorBatchSizeStats config flag, matching the pattern already used in Driver::runInternal. When disabled (the default), both addInputVector and RECORD_METRIC_VALUE receive 0 bytes — row counts are still tracked. Also deduplicates the computation from two calls to one. Reviewed By: Yuhta Differential Revision: D98855710 fbshipit-source-id: 0fcc0f6d4193e468e20b053478c030cd4559c3c6

…ebookincubator#17210) Summary: In debug mode, CUDA errors from GPU operations may go undetected until much later, making them hard to attribute. This adds cudaGetLastError checks at the boundary of each operator method to surface errors as early as possible. Pull Request resolved: facebookincubator#17210 Reviewed By: pratikpugalia Differential Revision: D101388224 Pulled By: peterenescu fbshipit-source-id: 0e29b9acd39e676fb7b2f6626606c88828cf2186

- CudfHashAggregation.cpp: Add missing stream parameter to StddevSampAggregator::addGroupbyRequest, replace veloxToCudfTypeId with veloxToCudfDataType - CudfHashJoin.cpp: Remove orphaned initializeFilter() calls - CudfHiveDataSource: Add getTableRowType() declaration, fix variable name typo, remove dead code - CudfConfig.h: Add missing functionEngine member used by AstUtils.h - ExpressionEvaluator.cpp: Remove merge conflict markers

…cebookincubator#17139) Summary: Pull Request resolved: facebookincubator#17139 When IndexLookupJoin has `needsIndexSplit=true`, its index TableScan node ID is included in `groupedExecutionLeafNodeIds` for coordinator-side split scheduling. However, this node is NOT a separate pipeline leaf in Velox — IndexLookupJoin manages the index source internally. Without this change, `validateGroupedExecutionLeafNodes` rejects the plan because it cannot find the index source node ID in any driver factory. Add `collectIndexLookupSourceIds` to collect index lookup source node IDs and skip them during leaf validation. Fix `noMoreSplitsForGroup` to handle missing split stores (creates a store with noMoreSplits already set). Fix `getSplitOrFuture` to propagate global `noMoreSplits` to newly created per-group stores. Reviewed By: xiaoxmeng Differential Revision: D100372349 fbshipit-source-id: fb4093f87e9d665bb3d4cd697b2f6365851c34dc

…cebookincubator#17235) Summary: Pull Request resolved: facebookincubator#17235 Add DWRF file format support for Iceberg data sink. Velox supports DWRF file format, so adding support to Iceberg connector to leverage it. Part of prestodb/presto#27198 Reviewed By: srsuryadev Differential Revision: D100061159 fbshipit-source-id: f87d7bb7a3743d3bdb88285343ea61a6e5f5bcfb

…facebookincubator#17157) (facebookincubator#17157) Summary: Add three new IP address utility functions: - ip_version(IPADDRESS) -> BIGINT - ip_version(IPPREFIX) -> BIGINT - ip_prefix_masklen(IPPREFIX) -> BIGINT Without these functions, users who need IP version or prefix length must cast to VARCHAR and parse the string representation. This forces them to either defer the conversion to typed IPADDRESS/IPPREFIX and carry raw VARCHAR throughout their dataflow, or create redundant columns for metadata that should be derivable from the type itself. This has three costs: Correctness — String-based workarounds are error-prone. IPv4-mapped IPv6 addresses like ::ffff:1.2.3.4 contain ":" but are IPv4. Different string representations of the same IP can cause silent mismatches. Dedicated functions operate on the native binary representation, eliminating these classes of bugs. Performance — String interaction is significantly slower than operating on the typed representation. ip_version is a single bit check on the 128-bit address; ip_prefix_masklen reads the stored prefix length byte directly — no VARCHAR cast, no allocation, no parsing. Completeness — Presto already has a rich set of IP functions (ip_prefix, ip_subnet_min, ip_subnet_max, ip_prefix_collapse, is_subnet_of, is_private_ip), but lacks these two basic introspection primitives. Adding them promotes the use of typed IPADDRESS/IPPREFIX at the entry point of data pipelines. Existing art: BigQuery has NET.IP_VERSION(). PostgreSQL's inet type supports family() and masklen(). Presto docs: https://prestodb.io/docs/current/functions/ip.html Discussion: https://fb.workplace.com/groups/presto.dev/permalink/31395723200049562/ Pulled By: tc25898 Pull Request resolved: facebookincubator#17157 tc25898 Reviewed By: kaikalur Differential Revision: D100568584 fbshipit-source-id: 2ceac57ca514a178135261e6270557b54c04295d

Implement CudfGroupId operator to replace the CPU GroupId operator on GPU for SQL GROUPING SETS, CUBE, and ROLLUP operations. - Add CudfGroupId class inheriting from CudfOperatorBase - Cycle through grouping sets one at a time (matching CPU behavior) - Create all-null columns for keys not in current grouping set - Create constant group_id column for each grouping set - Optimize column ownership with usage counting (move vs copy) - Register GroupIdAdapter in OperatorAdapters - Add comprehensive tests matching core Velox test patterns

…GSEGV (facebookincubator#17247) Summary: Pull Request resolved: facebookincubator#17247 D100855055 introduced TDigestAccumulator.h with layout {double compression, TDigest digest} and updated TDigestAggregate.cpp to use it, but did NOT update MergeTDigestAggregate.h. This left two conflicting TDigestAccumulator structs in the same namespace (facebook::velox::aggregate::prestosql) with different memory layouts. Both translation units instantiate the template Aggregate::destroyAccumulators<TDigestAccumulator> as weak symbols. The linker deduplicates to one instantiation, which uses the wrong sizeof(T) for one of the two callers. The memset in destroyAccumulators writes past the end of the accumulator, corrupting adjacent memory. Crash chain in production: 1. Query uses merge(CAST(tdigest_... AS tdigest(double))) triggering MergeTDigestAggregate 2. TDigest compression_ reads as 0 (should be 100) due to wrong struct layout offsets 3. VELOX_USER_CHECK_EQ throws: "100 vs. 0 Cannot merge TDigests with different compression parameters" 4. During cleanup, ~TDigest() calls HashStringAllocator::freeToPool() which hits SIGSEGV — memory corrupted by wrong-sized memset Fix: Remove the duplicate TDigestAccumulator from MergeTDigestAggregate.h and include TDigestAccumulator.h instead. Update member access from digest_ (private) to digest (public). Reviewed By: natashasehgal, talgalili Differential Revision: D101577660 fbshipit-source-id: 79ef12d38d45a08128f43d3c8ec2611c307edc12

…r hierarchy (facebookincubator#17246) Summary: X-link: facebookincubator/nimble#662 Pull Request resolved: facebookincubator#17246 Rename the protected member `memoryPool_` to `pool_` across the `SelectiveColumnReader` class hierarchy. The public accessor `memoryPool()` retains its name so external callers are unaffected. This is a mechanical rename across 21 files spanning Velox common, DWRF selective readers, Parquet readers, and Nimble selective readers. The DWRF non-selective `ColumnReader` hierarchy has its own separate `memoryPool_` member and is intentionally left unchanged. Reviewed By: xiaoxmeng Differential Revision: D101539025 fbshipit-source-id: 91a0742c43602b3ea3df727ecd0d4b5e490a6109

Summary: Pull Request resolved: facebookincubator#16987 Reviewed By: srsuryadev, rjaber Differential Revision: D97773692 fbshipit-source-id: 7e910b042ad5248feaaf0438552a944fe8f8596c

…ubator#17225) Summary: After PR facebookincubator#15511, Velox + GEO now requires `absl` as dependency but doesn't resolve it when `VELOX_BUILD_TESTING=OFF`. The patch fixes the issue. Fixes build error: ``` CMake Error at build/_deps/s2geometry-src/CMakeLists.txt:54 (find_package): By not providing "Findabsl.cmake" in CMAKE_MODULE_PATH this project has asked CMake to find a package configuration file provided by "absl", but CMake did not find one. Could not find a package configuration file provided by "absl" with any of the following names: abslConfig.cmake absl-config.cmake Add the installation prefix of "absl" to CMAKE_PREFIX_PATH or set "absl_DIR" to a directory containing one of the above files. If "absl" provides a separate development package or SDK, be sure it has been installed. ``` Pull Request resolved: facebookincubator#17225 Reviewed By: kgpai Differential Revision: D101581081 Pulled By: peterenescu fbshipit-source-id: d5348c99aad81023542a0fb31ea1995466d10e6d

…kincubator#17240) Summary: Pull Request resolved: facebookincubator#17240 Reviewed By: srsuryadev Differential Revision: D98150572 fbshipit-source-id: 0bbb1ee8e05f7808e1b2dc9c15b0e14d0db0f352

…on (facebookincubator#17236) Summary: Check ast support before pushExprToTree adds the next expression node. Removed except arg from createCudfExpression because it shouldn't be needed now. (cherry picked from commit 9dbc09c) Pull Request resolved: facebookincubator#17236 Reviewed By: pratikpugalia Differential Revision: D101650152 Pulled By: peterenescu fbshipit-source-id: 03a81246d39bbc048e3da9d78478476deed803d7

…cubator#17216) Summary: Pull Request resolved: facebookincubator#17216 When converting exclusive bounds to inclusive bounds for BigintRange, HugeintRange, and TimestampRange filters, the code unconditionally increments/decrements the boundary value. This overflows when the value is at the type limit (e.g., greaterThan(INT64_MAX) computes INT64_MAX + 1, which wraps to INT64_MIN, creating a range that matches everything instead of nothing). Guard against overflow by returning AlwaysFalse (or IsNull when nulls are allowed) when the boundary is at the type limit. This fixes incorrect query results for filters like WHERE col > 9223372036854775807. Reviewed By: Yuhta Differential Revision: D101039167 fbshipit-source-id: fdaddc66c7fb91c079c3dab136b76fc1eafead6e

…kincubator#17206) Summary: - Migrate `CudfMarkDistinct` from `exec::Operator` + `NvtxHelper` to the unified `CudfOperatorBase` introduced in facebookincubator#16934 - `CudfMarkDistinct` (facebookincubator#16974) was merged before the base class unification, so it still used the old pattern - Rename `addInput`/`getOutput` to `doAddInput`/`doGetOutput` (protected template methods) - Remove manual `VELOX_NVTX_OPERATOR_FUNC_RANGE()` calls — base class handles NVTX profiling uniformly Pull Request resolved: facebookincubator#17206 Test Plan: - [x] Built `velox_cudf_mark_distinct_test` successfully - [x] All 11 existing tests pass - [x] Pre-commit checks (clang-format, clang-tidy, license headers) pass Reviewed By: srsuryadev Differential Revision: D101650187 Pulled By: peterenescu fbshipit-source-id: c845abdb24274e021118ecafca02cdcc20ced888

facebookincubator#17179) Summary: The Memory Arbitration Fuzzer intermittently crashes in CI with `SIGABRT` when a task fails to complete within the 5-second default timeout under heavy concurrent load. `readCursorAsync()` throws `VELOX_FAIL("Failed to wait for task to complete after 5.00s, ...")` with error code `kInvalidState` when a task remains in `Running` state past the timeout. Under heavy contention (4 fuzzer instances × 72 threads on a 4-core CI runner), threads get starved and tasks legitimately cannot complete within 5 seconds — but this crashes the fuzzer because its error handler assumes any `kInvalidState` must be from an injected fault (spill FS fault or task abort request). ## Fix Extend the task-completion timeout in the fuzzer to **1 hour** instead of trying to recognize and recover from the 5s default. This avoids false `SIGABRT` crashes from CI thread starvation while still catching real deadlocks. To make this configurable, add a `maxWaitMicros(uint64_t)` setter and a private `maxWaitMicros_{5'000'000}` member to `AssertQueryBuilder`. Default behavior is unchanged for all existing callers. The fuzzer calls `builder.maxWaitMicros(kOneHourUs)` once per query before reading the cursor. `CursorParameters` (in the public `velox/exec/Cursor.h`) is left untouched — `TaskCursor::create` never reads this field, so it belongs in the test-only builder, not the public struct. ## Root cause analysis Examined 3 CI failures over 2 weeks (Apr 7, 9, 11) — all had the identical signature: - `Expression: injectedSpillFsFault || injectedTaskAbortRequest` - `injectedSpillFsFault: false, injectedTaskAbortRequest: false` - `Failed to wait for task to complete after 5.00s` - Different plan types (HashJoin, OrderBy, TopNRowNumber) — not operator-specific - All drivers were stuck in `enqueued` state, waiting for CPU or memory arbitration Pull Request resolved: facebookincubator#17179 Test Plan: - [x] Reproduced the original crash with 4 parallel fuzzer instances under elevated memory pressure (`--arbitrator_capacity 128MB --allocator_capacity 512MB --num_batches 64 --global_arbitration_pct 20 --task_abort_interval_ms 500`). Instance 3 crashed with `exit 134 (SIGABRT)` and the exact same error as CI. - [x] Applied fix, rebuilt, re-ran the same 4-instance stress test — all 4 instances passed. - [x] **Latest validation (2026-04-17, gpu1, 8 CPU/30GB RAM):** built incrementally with `ninja -j 4`, ran **4 concurrent instances × 15 min** (seeds 101–104). All 4 exited cleanly. **0 FATAL crashes, 0 "Failed to wait" errors** across all logs; 408 expected `MEM_ABORTED` events confirm the arbitrator was actively exercised. Memory peaked at ~5GB / 30GB. Reviewed By: pratikpugalia Differential Revision: D100872383 Pulled By: kgpai fbshipit-source-id: ea56b6563f1f2bfa05d0854fe4cefadbd06a4173

…s flag (facebookincubator#17232) Summary: Pull Request resolved: facebookincubator#17232 CONTEXT: When the same RowVector is evaluated across many ExprSet instances (e.g. ~1800 times in some cases), the EvalCtx constructor's per-column isFlatEncoding/mayHaveNulls loop runs redundantly for each new instance of EvalCtx, even though the data is the same — 39% of CPU in production. WHAT: Add a new EvalCtx constructor overload that accepts a pre-computed inputFlatNoNulls flag, plus a static computeInputFlatNoNulls() helper. This lets callers compute the flag once and reuse it across all EvalCtx instances for the same RowVector. The original constructor now delegates to computeInputFlatNoNulls() to avoid code duplication. Reviewed By: Yuhta Differential Revision: D101418306 fbshipit-source-id: 68e78b5d47d1b9e36b5c3532dffaa3a5206f1a05

…r async RPC batch mode (facebookincubator#17227) Summary: Pull Request resolved: facebookincubator#17227 Batch dispatch chunking and AIMD congestion control for async RPC operators in batch mode. Problem: When using batch mode, dispatch_batch_size was not actually splitting rows — all rows were sent as a single RPC call, potentially exceeding the server's concurrent request limit. Additionally, there was no backpressure mechanism to throttle dispatch when the server was overloaded. Changes: 1. Batch dispatch chunking: flushBatch(maxRows) drains only maxRows from pending rows instead of all. RPCOperator loops flushBatchRequests(dispatchBatchSize_) to flush in chunks. AsyncRPCFunction.h updated with maxRows parameter. 2. Backpressure check in addInput: while loop checks isUnderBackpressure() between flushes to prevent overshooting maxPendingBatches. 3. AIMD congestion control: RPCState tracks effectiveMaxPendingBatches_ (starts at maxPendingBatches_=2). On success: +1 (additive increase). On error (>50% real errors or _rpc_retried signal): /2 (multiplicative decrease, floor 1). Null input responses are excluded from error counting. Suppresses redundant "decreased from 1 to 1" log messages. Reviewed By: Yuhta Differential Revision: D101062260 fbshipit-source-id: cc0f809cabdbf8c9b0abe53305cabaa10ba3b645

…acebookincubator#17221) Summary: When the build side of a hash join has no data (e.g. anti-join with empty result), inputs_ is empty. With a debug build, the logging at `noMoreInput()` accesses `inputs_[0]` without bound checking, which could result in a `SIGSEGV`. This fixes Q16 (TPC-H) which uses NOT IN (anti-join) where the build side can be empty for some partitions. Pull Request resolved: facebookincubator#17221 Reviewed By: kagamiori Differential Revision: D101671299 Pulled By: peterenescu fbshipit-source-id: 0971d85671e67810789d8148ce3891e757faf4f7

…r#16037) Summary: Replaces previous PR facebookincubator#15014 that erroneously also included changes in some `experimental/cudf` files. This experimental feature is targeted at a multi-node setup of Prestissimo when running with the Cudf extension. It provides an exchange mechanism to transfer CudfVectors between GPU device memory without the need to first copy data to host memory. This mechanism is in addition to the HTTP-based exchange used in Prestissimo. The main goal is to exploit fast hardware interconnects that are available between GPUs. The cudf exchange builds on top of the existing experimental Cudf extension. The approach follows the same design as the HTTP based exchange. Its components operate on CudfVectors instead of SerializedPages and include: * CudfOutputQueueManager and CudfQueues instead of the OutputBufferManager and OutputBuffers * CudfExchangeClient and CudfExchangeSource instead of ExchangeClient and ExchangeSource/PrestoExchangeSource * CudfExchange and CudfExchangeQueue instead of Exchange and ExchangeQueue * CudfExchangeServer instead of Prestissimo's HTTP resource The code in this PR does not contain the necessary hooks needed to initialize Cudf exchange. Pull Request resolved: facebookincubator#16037 Reviewed By: pratikpugalia Differential Revision: D101650220 Pulled By: peterenescu fbshipit-source-id: 207b7964d0039f1885055ee4dae302a200480159

Summary: This is Part 2 of 3 of the GPU Decimal implementation. It adds decimal functions and some other tidying-up. More explanation and annotation to follow. Pull Request resolved: facebookincubator#16750 Reviewed By: kgpai Differential Revision: D101837619 Pulled By: kagamiori fbshipit-source-id: 4e7971aba5f3f5668d19bb2ff53571aa28b74d72

…#17283) Summary: Add the benchmark, then we can optimize the cudf to velox conversion, like facebookincubator#16760 and facebookincubator#16859 Pull Request resolved: facebookincubator#17283 Reviewed By: kgpai Differential Revision: D101865735 Pulled By: kagamiori fbshipit-source-id: 98f7bfa4b73ef8190793c570c3554d953f07394b

Summary: - register `startswith` in the cuDF function evaluator for Spark string expressions - support `startswith(column, constant)`, `startswith(column, column)`, and `startswith(constant, column)` on the GPU path - preserve Spark null semantics for column-pattern evaluation and keep `startswith(constant, constant)` off the GPU path until row-count support exists Pull Request resolved: facebookincubator#17205 Test Plan: - [x] `velox_cudf_expression_selection_test --gtest_filter='*Startswith*'` - [x] `velox_cudf_spark_filter_project_test --gtest_filter='*startswith*'` - [x] `velox_cudf_hash_join_test --gtest_filter='*innerJoinWithMixedFilterPrecomputation*:*leftJoinWithStringFunctionFilter*'` - [x] `velox_cudf_table_scan_test --gtest_filter='TableScanTest.filterPushdown:TableScanTest.remainingFilterExtraction'` Reviewed By: kgpai Differential Revision: D101898882 Pulled By: kagamiori fbshipit-source-id: 606efdb38563fe5ac0043b83a4d822e41ac2cb53

…stoSerializerEstimationUtils (facebookincubator#17288) Summary: Pull Request resolved: facebookincubator#17288 Replace all usages of `folly::grow_capacity_by` with `std::vector::reserve` in `PrestoSerializerEstimationUtils.cpp`. Since the vectors start empty in both `estimateWrapperSerializedSize` and `expandRepeatedRanges`, `reserve` is functionally identical and simpler. This removes the dependency on `folly/container/Reserve.h` from this file, supporting Velox's ongoing effort to reduce Folly dependencies. Addresses post-land review comment on D98150572. Reviewed By: mbasmanova Differential Revision: D101866359 fbshipit-source-id: 0651baca0567af00a1ed009b745ad37868a7b07e

Summary: Pull Request resolved: facebookincubator#17237 Operator stats have gaps and to suitable to determine where time went in a task. Adding driver timing stats per pipeline to assist in understanding a query bottlenecks. The current version does not handle the grouped execution well. Working on the extra to support grouped execution as well. Reviewed By: bikramSingh91 Differential Revision: D101395498 fbshipit-source-id: 1bd6e5311844f851109fd694afe37d101efe90a7

…p shuffle compression on small pages (facebookincubator#17306) Summary: Pull Request resolved: facebookincubator#17306 The minimum compression ratio is not fully expressive enough to avoid wasteful compression. For production workloads, analysis revealed that the page size distribution skews very small (P50 < 1K) such that even best case compression is not likely to reduce the number of packets transmitted over the network but does introduce additional serialization overhead and CPU consumption. As a result of this, exchange compression will tend to degrade latency even in a network bound environment. However, if we disable compression up to some multiple of the network MTU, we can limit compression to cases where it is likely to be beneficial. There are two reasons this may benefit execution latency: 1. queries exchanging lots of data directly benefit from a reduced network transfer bottleneck 2. the noisy neighbor effect from exchange-heavy queries on well-behaved queries is reduced, since they will chew up less network bandwidth This new option to disable page compression when the page size is below some threshold is propagated similarly to the minimum compression parameter. The new size-based skip uses arena size at flush time and is independent of the existing probe-based heuristic in PrestoIterativeVectorSerializer; it survives the PartitionedOutput flow that recreates the serializer between flushes (which resets the probe counter). Reviewed By: Yuhta Differential Revision: D100420559 fbshipit-source-id: 0f0fd55c4ef73e065be49eb90fbcff7b177802ae

…17239) Summary: Rename EnumsDeclare.h to EnumDeclare.h and Enums.h to EnumDefine.h for symmetric naming that makes clear these are a matched pair — EnumDeclare.h in every .h, EnumDefine.h in the corresponding .cpp. The old naming (Enums.h / EnumsDeclare.h) made Enums.h look like the main header and EnumsDeclare.h look auxiliary, tempting people to just include Enums.h. The declare macros (VELOX_DECLARE_ENUM_NAME, VELOX_DECLARE_EMBEDDED_ENUM_NAME) live in the lightweight EnumDeclare.h that only needs <iosfwd>, <optional>, <string_view>. The define macros stay in EnumDefine.h with their heavy transitive includes (folly/container/F14Map.h and velox/common/base/Exceptions.h, together ~8M preprocessed). The two headers are independent. All velox headers include EnumDeclare.h for the DECLARE macros. The corresponding .cpp files include EnumDefine.h for the DEFINE macros. BUCK targets updated accordingly: enums_declare is an exported dep for headers, enums is a private dep for .cpp files. ConfigProperty.h includes EnumDeclare.h instead of Enums.h, cutting ~8M of preprocessed includes from every consumer. Pull Request resolved: facebookincubator#17239 Reviewed By: mbasmanova Differential Revision: D101462826 fbshipit-source-id: c439cba5a8db21d830dce61d6901164bfa48fa51

…ubator#17198) Summary: Pull Request resolved: facebookincubator#17198 Add getExtractionSizeValues() and getExtractionValues() to map/list readers. Add kField handling in struct reader (direct child getValues with lazy loading support). Add TransformColumnLoader for mixed multi-extraction lazy path. Add text reader transform support in RowReader::projectColumns(). Reader checks deltaUpdate() to bypass extraction when delta updates are active. Reviewed By: maniloya Differential Revision: D97671736 fbshipit-source-id: c1610538656ddc3af515dc7cc9a8c7c0610e3ccb

Summary: Introduce Axiom, a C++ library for building composable query engines on top of Velox. The post covers the motivation (fragmented landscape of monolithic engines), architecture (pluggable frontends, optimizer, runtime, connectors), current status (TPC-H, production workloads, CLI), and a call for community involvement. Pull Request resolved: facebookincubator#17316 Reviewed By: kKPulla Differential Revision: D102156907 Pulled By: mbasmanova fbshipit-source-id: 5bcc1cb2c1a46da8d0631c85edc931a393b003bc

…small batches (facebookincubator#17203) Summary: Pull Request resolved: facebookincubator#17203 For small batches, peeling dictionary encoding from inputs before calling vector functions can create more overhead than it saves. This diff adds a configurable `expression.min_rows_for_peeling` threshold (default: 0) that suppresses peeling when the number of selected rows (to process) falls below it. When peeling is suppressed, some VectorFunctions still expect flat or constant inputs. To preserve this guarantee: - A single constant-encoded input continues to be peeled (cheap and required since some UDFs have started to rely on this expectation. Running the fuzzer exposed this gap) - A single dictionary-encoded input (possibly alongside constant inputs, e.g. in-predicate) is flattened before evaluation. Added fuzzer coverage: Extended the expression fuzzer to exercise partial selectivity by randomly deselecting rows. Set the peeling threshold to 25 (1/4 of the default batch size). Both these now allow us to exercise the functionality exposed via the new expression.min_rows_for_peeling config. Additional fixes to fuzzer: - Fixed retryWithTry to avoid evaluating uninitialized rows. - Fixed a bug in ExpressionRunner where the query config diverged from the one used in fuzzer tests. - Fixed inputs being double-wrapped in lazy vectors when using verify mode in ExpressionRunner. - Fixed ExpressionRunner failing to load a repro when no selectivity vector files were provided. Unit tests Extended existing tests in ExprTest to exercise with peeling enabled and disabled. Additional Context: Single-arg VectorFunctions are expected to receive only flat or constant inputs. Fuzz testing revealed latent bugs in several UDFs: some had constant-input optimizations that mishandled errors under TRY (e.g. reporting the error only for the representative row instead of all selected rows), and others assumed constant-encoded complex types would always be peeled, so they asserted flat encoding unconditionally. These bugs were masked because peeling always ran for constant-encoded inputs. To keep this change low risk, the constant-peeling guarantee is preserved explicitly in the disabled-peeling path for single-arg functions. Reviewed By: Yuhta Differential Revision: D101074993 fbshipit-source-id: 2aec51f90f063ed0ad2e114a50211f19baa474db

…tor#17485) Summary: Pull Request resolved: facebookincubator#17485 Accept kPartitionKey column handles in HiveIndexSource init() (previously crashed on any non-regular handle). Track partition column handles in partitionKeyHandles_, include partition columns in readerOutputType_, and synthesize partition column values from split metadata via scan-spec setConstantValue(). Key changes: - init(): Accept kPartitionKey handles alongside kRegular. Populate partitionKeyHandles_. Skip subfield/postProcessor checks for partition columns. - setPartitionValues(): New method that sets partition constants on scanSpec_ children from the first split's partition keys. Iterates scan spec children (same pattern as FileSplitReader::adaptColumns) and delegates to setPartitionValue() for each partition column. Validates all splits share the same partition values. - setPartitionValue(): Extracted helper matching FileSplitReader::setPartitionValue signature. - addSplits(): Calls setPartitionValues() before creating readers. - Pass partitionKeyHandles_ to makeScanSpec() so partition column children are included in the spec. Reviewed By: xiaoxmeng Differential Revision: D104739908 fbshipit-source-id: cb96fc3d9005df3f0cb91d04a002ca7918e9ace9

…bator#17493) Summary: Pull Request resolved: facebookincubator#17493 Adds support for writing Iceberg tables in NIMBLE format via a new `NimbleWriterOptionsAdapter` mirroring the existing `DwrfWriterOptionsAdapter` in the anonymous namespace of `WriterOptionsAdapter.cpp`, plus a `case dwio::common::FileFormat::NIMBLE` arm in `createWriterOptionsAdapter()`. ## Why Before this diff, `isSupportedFileFormat(NIMBLE)` returned `false` because `createWriterOptionsAdapter()` only handled `PARQUET` and `DWRF`. Any INSERT into a NIMBLE-formatted Iceberg table failed at the `IcebergDataSink` ctor with: ``` isSupportedFileFormat(tableStorageFormat) Unsupported file format for writing Iceberg tables: nimble ``` NIMBLE is otherwise fully supported in Velox's dwio layer (reader, writer, vector serde) -- only the Iceberg connector's `WriterOptionsAdapter` dispatch was missing. ## Manifest format string choice The new adapter reports `manifestFormatString() == "ORC"`, matching the convention already used by `DwrfWriterOptionsAdapter`. Iceberg's manifest file-format vocabulary has no NIMBLE enum, and the cross-engine convention established by Java `com.facebook.presto.iceberg.FileFormat.NIMBLE.toIceberg()` (in presto-facebook-iceberg) reports NIMBLE as `"ORC"` so downstream consumers (coordinator, catalog, snapshot tooling) can interpret the commit message without a NIMBLE-aware enum extension. The actual on-disk format is identified at read time via the file extension and the Nimble magic bytes (`0xa1fa` little-endian footer), not via the manifest string. Writing `"ORC"` here is therefore safe and preserves cross-engine round-trip compatibility with Java planners. ## What this does NOT do - Does not register a `NimbleWriterFactory` with the dwio writer registry -- that registration happens at the Prestissimo server bootstrap (covered in the stacked `presto_cpp` diff). - Does not change DWRF or Parquet adapter behavior. - Does not touch the read path. Reviewed By: srsuryadev Differential Revision: D104838011 fbshipit-source-id: 2afee19569eea267c7104cc5831a0486d1c53ecd

… left semi project support (facebookincubator#17113) Summary: - Add `CudfNestedLoopJoinBuild`, `CudfNestedLoopJoinProbe`, and `CudfNestedLoopJoinBridge` GPU operators that accelerate nested loop joins using libcudf APIs - Support inner, left, right, full outer, and left semi project join types with optional filter conditions - Register `NestedLoopJoinBuildAdapter` and `NestedLoopJoinProbeAdapter` in `OperatorAdapters.cpp` - Fix pre-existing bug in `CudfFromVelox::getOutput()` that returned a 0-row `CudfVector` instead of `nullptr` Closes facebookincubator#17112 Part of facebookincubator#15772 Supersedes facebookincubator#16942 ### Design **Two-path approach** for optimal performance: - **No filter (cross join)**: uses `cudf::cross_join(probe, build)` for full cartesian product - **With filter (conditional join)**: uses `cudf::conditional_inner_join(probe, build, ast)` to evaluate the filter on GPU, returning only matching row index pairs, then gathers actual data using indices **Batched mismatch tracking** for outer joins: since build data is processed in batches, per-batch left/right join APIs cannot be used directly (a row unmatched in one batch may match a later batch). Instead, `conditional_inner_join` is always used per-batch and GPU-side BOOL8 flag columns track which rows were matched: - Left join: `probeMatchedFlags_` tracks per-probe-batch mismatches; after all build batches, unmatched probe rows are emitted with null build columns - Right join: `buildMatchedFlags_` tracks cross-probe mismatches; after all probes finish, the last driver merges flags from all peers and emits unmatched build rows **Left semi project**: uses `cudf::conditional_left_semi_join` to find matching probe indices, then builds a BOOL8 match column via `cudf::contains`. ### Known limitation Zero-column build side is not yet supported — `cudf::table` with zero columns reports `num_rows() == 0`, causing the operator to treat a non-empty build as empty. Tracked as a TODO. Pull Request resolved: facebookincubator#17113 Test Plan: - [x] 53 GPU unit tests pass (`velox_cudf_nested_loop_join_test`) - All 5 join types: inner, left, right, full, left semi project - With and without filter conditions - Empty build/probe per join type - Multi-batch build and probe - Multi-driver execution (2 drivers) - NULL handling in data and filter conditions - Output column reordering - Large cross join (100×50 = 5000 rows) - [x] Run all TPC-DS queries that contain NLJ of g7e.4xlarge (NVIDIA RTX PRO 6000 Blackwell Server Edition, 16vCPUs): All queries run but two (SF100) that fail due to a different reason | Query | NLJ nodes | GPU cold | CPU cold | GPU warm ± std | CPU warm ± std | Speedup | t-stat | 95% CI (s) | Significant? | |-------|-----------|----------|----------|----------------|----------------|---------|---------|------------------|--------------| | Q9 | 15 | 26.90s | 19.96s | 22.91 ± 8.40s | 16.16 ± 7.54s | +41.7% | +1.337 | [-3.35, +16.84] | NO (n=5) | | Q14 | 3 | 58.94s | 11.99s | 12.27 ± 0.16s | 12.36 ± 0.39s | -0.7% | -0.451 | [-0.47, +0.30] | NO (n=5) | | Q23 | 2 | 32.29s | 24.93s | 23.12 ± 0.15s | 24.89 ± 0.29s | **-7.1%** | -12.126 | [-2.06, -1.48] | YES (n=5)| | Q24 | 1 | - | - | 5.48 ± 0.36s | 5.47 ± 0.23s | +0.1% | +0.031 | [-0.35, +0.36] | NO (n=5/8) | | Q28 | 5 | 13.17s | 13.52s | 13.17 ± 0.08s | 13.36 ± 0.08s | **-1.5%** | -3.883 | [-0.29, -0.09] | YES (n=5)| | Q44 | 2 | 13.77s | 7.27s | 7.24 ± 0.17s | 7.20 ± 0.12s | +0.5% | +0.412 | [-0.15, +0.22] | NO (n=5) | | Q54 | 2 | 15.58s | 4.49s | 4.15 ± 0.08s | 4.07 ± 0.02s | +1.9% | +2.056 | [+0.00, +0.15] | YES (n=5)| | Q61 | - | - | - | FAIL | FAIL | - | - | - | Decimal N/S | | Q77 | 1 | 28.98s | 7.06s | 6.94 ± 0.25s | 7.00 ± 0.13s | -0.8% | -0.452 | [-0.30, +0.19] | NO (n=5) | | Q88 | 7 | 8.02s | 7.37s | 7.45 ± 0.22s | 7.37 ± 0.15s | +1.1% | +0.676 | [-0.16, +0.32] | NO (n=5) | | Q90 | - | - | - | FAIL | FAIL | - | - | - | Decimal N/S | ## Synthetic NLJ Benchmark — SF100 Results | Q# | Description | Probe | Build | Join Condition | Output | Time | Status | |----|-------------|-------|-------|----------------|--------|------|--------| | 1 | store_sales × item (range join) | 288M | 204K | `ss_list_price BETWEEN (i_current_price - 1.0) AND (i_current_price + 1.0)` | — | — | Host OOM | | 2 | store_sales × date_dim (inequality) | 288M | 365 | `ss_sold_date_sk > d_date_sk` (build filtered: `d_year = 2000`) | — | — | Host OOM | | 3 | catalog_sales × item (multi-condition) | 144M | 204K | `cs_list_price > i_current_price AND cs_wholesale_cost < i_wholesale_cost` | — | — | Host OOM | | 4 | store × item (baseline) | 402 | 204K | `i_current_price > 50.0` | 4.6M rows | 324ms | Pass | | 5 | customer × customer_address | 2M | 50K | `c_current_addr_sk > ca_address_sk` (probe filtered: `c_birth_year > 1970`) | — | — | GPU OOM | | 6 | web_sales × store_sales (filtered) | 72M | ~30K | `ws_ext_sales_price > ss_ext_sales_price` (build filtered: `ss_store_sk = 1`) | — | — | GPU OOM | | 7 | store × promotion (date range) | 402 | 1K | `p_start_date_sk BETWEEN (s_closed_date_sk - 100) AND (s_closed_date_sk + 100)` | 6.7K rows | 186ms | Pass | | 8 | date_dim × store (inequality) | 366 | 402 | `d_date_sk > s_store_sk` (probe filtered: `d_year = 2000`) | 147K rows | 150ms | Pass | | 9 | web_page × catalog_page (multi-AND) | 2K | 20K | `wp_web_page_sk > cp_catalog_page_sk AND wp_char_count > cp_catalog_page_number` | 2.0M rows | 166ms | Pass | | 10 | item × household_demo (BETWEEN+AND) | 20K | 7.2K | `i_current_price BETWEEN 10 AND 50 AND hd_dep_count > 0` (probe filtered: `i_category_id = 1`) | 6.2M rows | 309ms | Pass | ### Query Definitions ```sql -- Q1: Range join (Host OOM on SF100) SELECT ss_item_sk, ss_list_price, ss_sales_price, i_item_sk, i_current_price FROM store_sales INNER JOIN item ON ss_list_price BETWEEN (i_current_price - 1.0) AND (i_current_price + 1.0) -- Q2: Inequality + filtered build (Host OOM on SF100) SELECT ss_sold_date_sk, ss_ext_sales_price, d_date_sk, d_year FROM store_sales INNER JOIN date_dim ON ss_sold_date_sk > d_date_sk WHERE d_year = 2000 -- Q3: Multi-condition (Host OOM on SF100) SELECT cs_item_sk, cs_list_price, cs_wholesale_cost, i_item_sk, i_current_price, i_wholesale_cost FROM catalog_sales INNER JOIN item ON cs_list_price > i_current_price AND cs_wholesale_cost < i_wholesale_cost -- Q4: Small baseline (Pass) SELECT s_store_sk, s_store_name, i_item_sk, i_current_price FROM store INNER JOIN item ON i_current_price > 50.0 -- Q5: Medium cross-product (GPU OOM on SF100) SELECT c_customer_sk, c_current_addr_sk, ca_address_sk, ca_state FROM customer INNER JOIN customer_address ON c_current_addr_sk > ca_address_sk WHERE c_birth_year > 1970 -- Q6: Fact-to-fact theta (GPU OOM on SF100) SELECT ws_item_sk, ws_ext_sales_price, ss_item_sk, ss_ext_sales_price FROM web_sales INNER JOIN store_sales ON ws_ext_sales_price > ss_ext_sales_price WHERE ss_store_sk = 1 -- Q7: Date range overlap (Pass) SELECT s_store_sk, s_store_name, p_promo_sk, p_promo_name, p_cost FROM store INNER JOIN promotion ON p_start_date_sk BETWEEN (s_closed_date_sk - 100) AND (s_closed_date_sk + 100) -- Q8: Filtered probe + inequality (Pass) SELECT d_date_sk, d_day_name, s_store_sk, s_store_name FROM date_dim INNER JOIN store ON d_date_sk > s_store_sk WHERE d_year = 2000 -- Q9: Multi-condition AND (Pass) SELECT wp_web_page_sk, wp_char_count, cp_catalog_page_sk, cp_catalog_page_number FROM web_page INNER JOIN catalog_page ON wp_web_page_sk > cp_catalog_page_sk AND wp_char_count > cp_catalog_page_number -- Q10: BETWEEN + AND (Pass) SELECT i_item_sk, i_current_price, hd_demo_sk, hd_dep_count FROM item INNER JOIN household_demographics ON i_current_price BETWEEN 10 AND 50 AND hd_dep_count > 0 WHERE i_category_id = 1 Reviewed By: kKPulla Differential Revision: D104904497 Pulled By: mbasmanova fbshipit-source-id: b848952dbb4461ac21dae81d0d88404a4ac59c62

…or#17494) Summary: Pull Request resolved: facebookincubator#17494 Add an optional per-batch scan statistics callback to Velox's TableScan operator via QueryCtx. The callback fires after each non-empty batch with the completed row count delta, wall time, and table name. This enables accurate per-batch scan statistics reporting without polling, matching the push-model semantics of existing scan callbacks in the execution stack. Changes: - Add ScanBatchEvent struct and ScanBatchCb callback to QueryCtx - Fire callback in TableScan::getOutput() with completed rows delta (from getCompletedRows()), wall time (microseconds from existing MicrosecondTimer), and table name (from tableHandle_->name()) - Wire callback adapter in VeloxBatchCursor for HiveConnector path with mutex for kParallel thread safety Follows existing Velox callback patterns (CallbackSink::Consumer, UpdateAndCheckTraceLimitCB, HiveColumnHandle::postProcessor). Reviewed By: Yuhta Differential Revision: D104889992 fbshipit-source-id: 11a8abd33a036477850eb0da7f3e3795dc014db3

…ookincubator#16626) Summary: On ubuntu22, with gcc 11.4 /root/velox/velox/exec/fuzzer/SpatialJoinFuzzer.cpp:82:12: error: ‘x’ is used uninitialized [-Werror=uninitialized] 82 | double x, y; | ^ /root/velox/velox/exec/fuzzer/SpatialJoinFuzzer.cpp:82:15: error: ‘y’ is used uninitialized [-Werror=uninitialized] 82 | double x, y; Pull Request resolved: facebookincubator#16626 Reviewed By: mbasmanova Differential Revision: D104691989 Pulled By: kgpai fbshipit-source-id: a23edaf3a2331a3da5e350a1d6b8efe005d43105

…te copy (facebookincubator#17457) Summary: For snappy and zstd compressed Parquet pages, decompress directly into the target buffer instead of going through the stream-based PagedInputStream path. ## Problem The current path creates a `SeekableArrayInputStream`, wraps it in a `PagedInputStream` (which allocates an internal `outputBuffer_`), decompresses into that buffer, then copies to the final destination via `readFully()` (`std::copy`). The intermediate buffer allocation and copy add unnecessary overhead for codecs where the full compressed page is already available in memory. Profiling on TPC-H SF100 with snappy-compressed Parquet showed `memmove` (from `std::copy` in `readFully`) as a visible cost in the decompression path. ## Fix Call `snappy::RawUncompress` / `ZSTD_decompress` directly into the destination buffer, eliminating the intermediate allocation and copy. Other codecs (gzip, lz4, lzo) fall back to the original stream-based path. Both `Snappy::snappy` and `zstd::zstd` are already linked in the parquet reader CMakeLists. Pull Request resolved: facebookincubator#17457 Reviewed By: kKPulla Differential Revision: D104904218 Pulled By: mbasmanova fbshipit-source-id: eac16c85b335e2106cbb4a8c1a58bb1277042b2a

…r#17337) Summary: Pull Request resolved: facebookincubator#17337 Skip creating child readers for unneeded streams based on ExtractionType (kKeys skips values, kValues skips keys, kSize skips all children). Add needsKeyReader()/needsElementReader() helpers with deltaUpdate awareness. Add VELOX_CHECK in Parquet reader that ExtractionType is kNone. Add DWRF reader-level extraction tests including IO reduction validation and nested ScanSpec verification. Reviewed By: maniloya Differential Revision: D97671745 fbshipit-source-id: 661dec9b00b1f2dc1b2035f41b992ca9a83849ec

Summary: Avoid allocating a contiguous temporary buffer in ABFS preadv. Read the requested Azure range once and stream bytes directly into the caller-provided buffers. For null gap ranges, consume the skipped bytes with a reusable thread-local discard buffer so later buffers stay aligned without changing the number of remote reads. Add a fake Azure client unit test to verify preadv issues a single download for buffers with gaps and fills the non-null ranges correctly. Pull Request resolved: facebookincubator#17370 Reviewed By: amitkdutta Differential Revision: D105014405 Pulled By: mbasmanova fbshipit-source-id: b7cbf6af5c173025ade02477f4b36c0e86b1e722

…16240) Summary: `insertTableHandle_->writerOptions()` returns a shared `WriterOptions` object. Previously, `memoryPool` was only set when null, so after writer 0 initialized it, later writers reused writer 0's pool. As a result, all writers' memory gets attributed to writer 0's pool while other writers' pools show no usage, making it impossible to identify which writer is consuming memory. Fix: set `memoryPool` unconditionally, matching the existing pattern for `nonReclaimableSection`. Pull Request resolved: facebookincubator#16240 Reviewed By: jagill Differential Revision: D105008445 Pulled By: mbasmanova fbshipit-source-id: 2987e8239325fcff6b25e7c159682c0f5a88a11b

…ookincubator#17505) Summary: Pull Request resolved: facebookincubator#17505 ## Context Velox PRs fail Netlify deploy preview builds (e.g. https://app.netlify.com/projects/meta-velox/deploys/69fd4318582cb600086ecbd9) because `fbcode/velox/public_tld/website/yarn.lock` contains URLs pointing at `registry.facebook.net`. That host is Meta's internal Metaccio proxy and is not reachable from external CI runners (Netlify, GitHub Actions, external contributors), so `yarn install` fails before the docs site can build. ## Motivation D104280800 added bidirectional yarn lockfile URL rewriting hooks. Projects opt in by dropping a `.rewrite-lockfile.fb` marker file in the directory where `yarn install` runs. Once opted in, yarn: - WRITE: rewrites 3P resolved URLs from registry.facebook.net to registry.yarnpkg.com before saving yarn.lock (lockfile on disk is portable to OSS). - READ: rewrites them back to registry.facebook.net in memory so internal fetches still go through Metaccio. 1P scoped packages (rootfoo/*, nest/*, etc.) are never rewritten. The Velox website only depends on 3P packages so this cleanly applies. ## This diff - Adds `fbcode/velox/public_tld/website/.rewrite-lockfile.fb` opt-in marker (path-mapped to `website/.rewrite-lockfile.fb` in the facebookincubator/velox GitHub mirror). - Re-runs `yarn install` from that directory, which triggers the experimentalLockfileWriteHook and replaces 1175 `registry.facebook.net` URLs with `registry.yarnpkg.com` URLs in `yarn.lock`. The reformatted layout (unquoted field names, sorted keys) is yarn 1.22.21's standard re-serialization output. After ShipIt syncs to GitHub, Netlify (and any external `yarn install`) will resolve packages from the public yarn registry and CI will pass. Internal devs continue to fetch via Metaccio thanks to the read hook. Reviewed By: pratikpugalia Differential Revision: D105031443 fbshipit-source-id: 09135f7ee605c4d294a1e0df097b4cd814348dab

…7339) Summary: Pull Request resolved: facebookincubator#17339 Add comprehensive DWRF and Nimble end-to-end extraction tests through HiveDataSource table scan pipeline. Covers MapKeys, MapValues, Size, MapKeyFilter, StructField, ArraySize, nested chains, multiple extractions, multi-format splits (DWRF+TEXT+DWRF), and IO reduction validation. Reviewed By: apurva-meta Differential Revision: D97671762 fbshipit-source-id: c3c08b61cb8b8cd8c92d798d673ce2ee1eabdb25

…ubator#17486) Summary: Pull Request resolved: facebookincubator#17486 Add non-index join condition support to HiveIndexSource: join conditions on columns that are neither index columns nor partition columns (e.g., bucket columns in colocated joins) are now applied as post-read equality filters. Key changes: - Rename initIndexLookupConditions() to initConditions(). Extend it to categorize all join conditions in a single pass: index conditions are pushed to indexLookupConditions_, and non-index equality conditions are validated, resolved to column indices, and stored in nonIndexConditions_ for post-read filtering. Non-index condition columns are added to the reader output type if not already projected. - Add NonIndexCondition struct and applyNonIndexConditions() which compares reader output column values against probe-side values using SQL null semantics (either null means not equal). - Add applyNonIndexCondition() wrapper in HiveLookupIterator, called before evaluateRemainingFilter() in getOutput(). - Store original non-index conditions as nonIndexLookupConditions_ for inspection. Reviewed By: xiaoxmeng Differential Revision: D104772751 fbshipit-source-id: 0f95d02cdf66925063aa363ea315a08cd601c0c8

Summary: Pull Request resolved: facebookincubator#17510 Add velox::serializer::KeyDecoder to decode KeyEncoder-compatible composite keys for all supported scalar key column types. Wire the decoder into Buck/CMake and add round-trip plus malformed-input coverage. Reviewed By: xiaoxmeng Differential Revision: D101708729 fbshipit-source-id: 4d5b2e39d99df7c287bba58f1f4a9a01bcac3079

Summary: Pull Request resolved: facebookincubator#17480 TypedDistinctAggregations::extractValues() loops over every group to extract distinct rows from the accumulator and add them to the aggregation function. It currently create a new vector to hold the distinct rows every time. This can cause a large number of memory allocation when the grouping keys have high cardinality. This diff make it reuse the vector across groups. Reviewed By: Yuhta Differential Revision: D104498389 fbshipit-source-id: f9af8ccec13ed4752e878e80dc70b7e1c74daaca

…r#17482) Summary: This fix adds a call to the base Operator::initialize method, which sets up the reclaimer, tracer, and initialization state. This is consistent with how other operators implement their initialize method. Pull Request resolved: facebookincubator#17482 Reviewed By: kgpai Differential Revision: D105084573 Pulled By: mbasmanova fbshipit-source-id: bd7a005fa9558df782c2fd6ccc7d27985aba4cd9

…bookincubator#16349) Summary: closes facebookincubator#16309 Pull Request resolved: facebookincubator#16349 Reviewed By: amitkdutta Differential Revision: D105075676 Pulled By: mbasmanova fbshipit-source-id: ef1a57d7007218e6183ac06aab2709fd2022d4ae

…n::apply calls (facebookincubator#17508) Summary: Pull Request resolved: facebookincubator#17508 Introduces a global listener registry for observing VectorFunction::apply calls during expression evaluation, following the SplitListeners pattern (velox/exec/Task.h). Listener factories are registered globally via registerVectorFunctionListenerFactory(). During expression compilation, ExprCompiler iterates all registered factories, calling create() once per resolved scalar function with the function name, VectorFunctionMetadata, and QueryConfig. Each factory independently decides whether to observe that function by returning a VectorFunctionListeners struct (containing pre and/or post listeners) or std::nullopt to skip. Returned listeners are stored on the Expr node and invoked via invokeApplyWithListeners() in both applyFunction() and evalSimplifiedImpl(). Special forms (AND, OR, CAST, etc.) are not subject to listening. Key design decisions: - Global static registry (not per-query): multiple factories can be registered independently without coordination, each observing different concerns (monitoring, access control, auditing). - Factory receives QueryConfig: per-query behavior control without per-query factory instances. A factory can return std::nullopt based on config flags. - Listeners are shared_ptrs so a single listener instance can be shared across multiple Expr nodes. - Post-listener has finally semantics: always runs after apply, even if apply throws. Receives std::exception_ptr (nullptr on success) enabling RAII-like cleanup. - Listener exceptions propagate as-is (not wrapped as user errors). Only VectorFunction::apply non-Velox exceptions are wrapped via VELOX_USER_FAIL. - Both listeners receive the expression name as the first argument for attribution. Reviewed By: kevinwilfong Differential Revision: D104325777 fbshipit-source-id: 5f2461e92a6ae241acf49fefad7d8225942ee8b1

…#17511) Summary: Fix the deletion vector writer and reader to always (not when numContainers >=4) write/read offsets in case of non-run bitmaps consistent with the [roaring bitmap spec](https://github.com/RoaringBitmap/RoaringFormatSpec#3-offset-header). Pull Request resolved: facebookincubator#17511 Reviewed By: pratikpugalia Differential Revision: D105096426 Pulled By: mbasmanova fbshipit-source-id: d5f227a29d918c5f2314d97ae42130d480e66a41

Resolve merge conflicts in ExpressionEvaluator.cpp and align the function registry to main's model (flat CudfFunctionSpec, bool overwrite parameter). Remove duplicate LogicalFunction class and drop the now-invalid DateTruncFunction::canEvaluate argument. Also fix: veloxToCudfTypeId -> veloxToCudfDataType in CudfReduce, CudfGroupby, and CudfNestedLoopJoin; finalMask/finalNullCount -> nullMask/nullCount in DecimalExpressionKernels.cu; and remove the undeclared expressionSpansBothSides call in CudfHashJoin.

copy-pr-bot · 2026-05-15T18:16:05Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Dario Pavlovic and others added 30 commits April 17, 2026 16:54

fbcode/velox/expression/ComplexViewTypes.h (facebookincubator#16987)

f666b03

Summary: Pull Request resolved: facebookincubator#16987 Reviewed By: srsuryadev, rjaber Differential Revision: D97773692 fbshipit-source-id: 7e910b042ad5248feaaf0438552a944fe8f8596c

fbcode/velox/serializers/PrestoSerializerEstimationUtils.cpp (faceboo…

bd52780

…kincubator#17240) Summary: Pull Request resolved: facebookincubator#17240 Reviewed By: srsuryadev Differential Revision: D98150572 fbshipit-source-id: 0bbb1ee8e05f7808e1b2dc9c15b0e14d0db0f352

zacw7 and others added 21 commits May 12, 2026 16:20

Merged main into cudf-window-operator

3aac072

Merged main into tpcds-staging

d92c38b

shrshi requested review from a team, bdice, devavret, karthikeyann and mhaseeb123 as code owners May 15, 2026 18:15

shrshi changed the base branch from velox-cudf to IBM-techpreview May 15, 2026 18:19

shrshi added 2 commits May 15, 2026 18:58

Merged cudf-window-operator into tpcds-staging

7460d18

Fix build issue

e80dc6f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][DNR] Staging branch for tpcds#110

[WIP][DNR] Staging branch for tpcds#110
shrshi wants to merge 1544 commits into
rapidsai:IBM-techpreviewfrom
shrshi:tpcds-staging

shrshi commented May 15, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

shrshi commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

shrshi commented May 15, 2026 •

edited

Loading