Update velox-cudf from upstream main by bdice · Pull Request #112 · rapidsai/velox

bdice · 2026-05-19T18:14:57Z

Summary

Merge facebookincubator/velox main into rapidsai/velox velox-cudf
Prefer upstream cuDF implementation, dependency, and setup-script changes during conflict resolution
Preserve RAPIDS CI runner customizations while updating Linux CI to split cuDF build and GPU test execution

DO NOT SQUASH-MERGE

Validation

pre-commit run --all-files
yamllint and action-validator passed for active workflows
CI will validate the RAPIDS runner split build/test flow

…16920) Summary: Implements GPU version of EnforceSingleRow to maintain GPU pipeline continuity for scalar subqueries. Validates row count using GPU metadata without host↔device data transfer. related to: facebookincubator#15772 closing: facebookincubator#16888 ### Performance Benchmarks (SF100, 5 iterations) All queries show **no significant performance difference** between GPU and CPU implementations, which is expected for this lightweight operator. The benefit is maintaining GPU pipeline continuity (avoiding GPU↔CPU transfers), not faster execution of the check itself. | Query | GPU mean±std | CPU mean±std | Diff | t-stat | 95% CI | Significant? | |-------|--------------|--------------|------|--------|--------|--------------| | Q6 (1 occ) | 1.738±0.048s | 1.736±0.034s | +0.1% | 0.076 | [-0.051, +0.055]s | NO | | Q14 (3 occ) | 11.490±0.309s | 11.190±0.165s | +2.7% | 1.914 | [-0.013, +0.613]s | NO | | Q44 (2 occ) | 7.294±0.358s | 7.102±0.135s | +2.7% | 1.121 | [-0.151, +0.535]s | NO | | Q54 (2 occ) | 4.008±0.443s | 3.818±0.036s | +5.0% | 0.956 | [-0.208, +0.588]s | NO | | Q58 (3 occ) | 3.806±0.123s | 3.750±0.053s | +1.5% | 0.936 | [-0.064, +0.176]s | NO | **Methodology**: Welch t-test with 95% confidence intervals. "Not significant" means |t| < 2.0 (p ≥ 0.05), indicating performance differences are within statistical noise. **Test environment**: SF100 INT32 data (~43GB), local NVMe storage, NVIDIA RTX PRO 6000 Blackwell, 5 independent runs per mode. Pull Request resolved: facebookincubator#16920 Reviewed By: tanjialiang Differential Revision: D100439821 Pulled By: xiaoxmeng fbshipit-source-id: 1a982e43a4aca83bb025e2f181ddbc3544346aff

…acebookincubator#17090) Summary: This makes S3Config available for use in downstream user such as FileWrite and ReadFile to support options needed to support various S3 backends and allow more granular configuration. Pull Request resolved: facebookincubator#17090 Reviewed By: tanjialiang Differential Revision: D100439661 Pulled By: xiaoxmeng fbshipit-source-id: 36c5cf19e4d2cc8831b1d4e39629bf845602c0d8

…facebookincubator#16732) Summary: ## Description This PR re-adds the use of `enqueueForDevice` when cudf buffered input data source is being used with the experimental (hybrid scan) reader. Tests have also been updated and verified locally to ensure the `enqueueForDevice` is indeed used. ## Checklist - [x] All new and existing `velox_cudf_xx` tests pass - [x] I am familiar with the contribution guide Pull Request resolved: facebookincubator#16732 Reviewed By: tanjialiang Differential Revision: D100439856 Pulled By: xiaoxmeng fbshipit-source-id: f6fb5afa74ef01d3221df9c512bafa2dbc162307

Summary: ## Description This PR updates the `update-cudf-deps.sh` to not silently exit without any message or effect when, for example, `read` fails due to GH rate limiting and throws a verbose message such as ```bash Error: Failed to fetch VERSION file from branch main Try: `gh auth login` (or set GH_TOKEN) before running this script Create a GitHub personal access token at https://github.com/settings/tokens/new (Select scope: public_repo) ``` Pull Request resolved: facebookincubator#16332 Reviewed By: tanjialiang Differential Revision: D100439867 Pulled By: xiaoxmeng fbshipit-source-id: 467a7b6c637a70c7f5a4bd3343e06efd10d518fb

Summary: ### Problem Build error for T = double: ``` /home/vpcuser/velox/./velox/experimental/cudf/expression/AstUtils.h: In function ‘std::unique_ptr<cudf::scalar> facebook::velox::cudf_velox::makeScalarFromValue(const facebook::velox::TypePtr&, T, bool, std::optional<cudf::type_id>) [with T = double]’: /home/vpcuser/velox/./velox/experimental/cudf/expression/AstUtils.h:79:41: error: call to ‘cudf::get_default_stream’ declared with attribute error: cudf default stream argument used. Pass stream explicitly. 79 | auto stream = cudf::get_default_stream(); | ~~~~~~~~~~~~~~~~~~~~~~~~^~ /home/vpcuser/velox/./velox/experimental/cudf/expression/AstUtils.h:80:8: error: call to ‘cudf::get_current_device_resource_ref’ declared with attribute error: cudf default memory resource argument used. Pass mr explicitly. 80 | auto mr = cudf::get_current_device_resource_ref(); ``` `makeScalarFromValue()` in `AstUtils.h` calls `cudf::get_default_stream()` and `cudf::get_current_device_resource_ref()`. When any `.cpp` file includes both `CudfNoDefaults.h` and `AstUtils.h` (e.g. `CudfHashJoin.cpp`), these calls hit the `__attribute__((error))` redeclarations that guard against accidental default argument usage. In Release builds (`-O2`), GCC's dead-code elimination removes unused template instantiations before the error fires. In Debug builds (`-O0`), all instantiations are preserved, triggering a hard compile error. ### Solution Replaces the two calls with existing approved alternatives: - `cudf::get_default_stream()` → `cudf::get_default_stream(cudf::allow_default_stream)` (from `CudfDefaultStreamOverload.h`) - `cudf::get_current_device_resource_ref()` → `rmm::mr::get_current_device_resource()` (direct RMM call, same pattern as `get_temp_mr()` in `CudfNoDefaults.h`) Both return identical values at runtime — **zero performance impact**. ### Changes - `velox/experimental/cudf/expression/AstUtils.h`: Update `makeScalarFromValue()` to use poison-safe alternatives Pull Request resolved: facebookincubator#17011 Reviewed By: tanjialiang Differential Revision: D100439740 Pulled By: xiaoxmeng fbshipit-source-id: f1b0462dcad4645adf33ff29b6a810ffc725e290

…acebookincubator#17035) Summary: - Fixes Count aggregation condition to run all count aggregations on GPU - Adds unit tests to check if all count aggregations run on GPU - Fixes CI build break too. Pull Request resolved: facebookincubator#17035 Reviewed By: tanjialiang Differential Revision: D100439695 Pulled By: xiaoxmeng fbshipit-source-id: 2f05cb59e65e0d98bdaea492bf0c6576b8a9cb6d

…acebookincubator#17134) Summary: Pull Request resolved: facebookincubator#17134 Introduces the PaimonSplitReader framework for reading Paimon tables in Velox/Presto. **PaimonSplitReader** (extends FileSplitReader): Multi-file reader for rawConvertible Paimon splits. A single Paimon split (one partition × bucket) can contain multiple data files across LSM levels. PaimonSplitReader receives the full PaimonConnectorSplit, builds FileConnectorSplits internally, and handles sequential file iteration transparently — when a file is exhausted, it resets the base reader and advances to the next non-empty file. Deletion vectors and changelog files are validated upfront and throw VELOX_NYI. **PaimonDataSource** (extends FileDataSource): Refactored to be a pure orchestrator. addSplit() validates the split and selects the read path (rawConvertible vs merge-on-read NYI). createSplitReader() simply passes the PaimonConnectorSplit to PaimonSplitReader. Does not override next() — FileDataSource::next() drives the read loop, getting filter eval, column projection, and stats tracking for free. **PaimonMergeReader**: NYI placeholder for merge-on-read (primary-key tables with rawConvertible=false). Includes MergeEngine interface stub for future deduplicate/partial-update strategies. Currently working: append-only and primary-key tables with rawConvertible=true, single and multi-file splits, partition keys. NYI: deletion vectors, changelog files, merge-on-read, _rowkind column, streaming reads. Reviewed By: xiaoxmeng Differential Revision: D99660568 fbshipit-source-id: 632ef8d0cfff39867d9fe95633d2e0f877f2a8e1

…17133) Summary: Pull Request resolved: facebookincubator#17133 CONTEXT: The IndexLookupJoin operator uses a complex stat splitting mechanism with special-cased lazy vector stat handling. The nimble index reader creates lazy vectors but immediately loads them, wasting the wrap/unwrap overhead. WHAT: 1. Rename IndexLazyStatWriter to IndexStatWriter and simplify it to accumulate all index source related stats into a local synchronized map (indexSourceRuntimeStats_). The stat splitter now simply dumps this map as the IndexSource node's runtime stats, removing the complex stat name matching and erasing logic. 2. Move all index source stat recording (input/output stats, connector stats) to go through the IndexStatWriter's map, keeping them cleanly separated from the join operator's own stats. 3. Add generateLazyChildren parameter to nimble StructColumnReaderBase/StructColumnReader/buildColumnReader, and pass false in SelectiveNimbleIndexReader. This avoids creating lazy vectors that are immediately loaded, eliminating the wrap/unwrap overhead. The loadedVector() call is also removed since it's no longer needed. Reviewed By: zacw7 Differential Revision: D100443040 fbshipit-source-id: 6d87c7dceea2b9f756e2c70d3abad24fbaa743a8

Summary: Pull Request resolved: facebookincubator#17138 CONTEXT: Index lookup needs to distinguish between point lookups (exact key match) and range lookups (lower/upper bounds). Dense index only supports point lookups while cluster index supports both. WHAT: Add isPointLookup() method to EncodedKeyBounds that returns true when both bounds are present and equal. Add unit test coverage. Reviewed By: tanjialiang, han-yan01 Differential Revision: D100556652 fbshipit-source-id: e478c1ee7062f2f4fee2c3f1553c6e1bb39676a0

…or#17135) Summary: Follow up of facebookincubator#17011, as the comment facebookincubator#17011 (comment) Pull Request resolved: facebookincubator#17135 Reviewed By: xiaoxmeng Differential Revision: D100559104 Pulled By: pratikpugalia fbshipit-source-id: 51adc7ed44ee1b4bcecbbad0f6940481ec9073eb

…emantics (facebookincubator#17137) Summary: Pull Request resolved: facebookincubator#17137 Add a `nullAsValue` boolean flag to `HashJoinNode` that makes join keys use IS NOT DISTINCT FROM semantics where NULL equals NULL. This is different from the existing `nullAware` flag which implements IN/NOT IN three-valued logic. The two flags are mutually exclusive. When `nullAsValue` is true: - HashBuild stores rows with NULL keys in the hash table (uses `HashTable<false>`) - HashBuild does not filter out NULL key rows during `addInput` - HashProbe does not filter out NULL key rows during probe The hash table comparison logic (`HashTable<false>::compareKeys`) already treats NULLs as equal values via `CompareFlags::NullHandlingMode::kNullAsValue`, so no changes are needed in the comparison path. This flag is needed to correctly implement SQL set operations (EXCEPT, INTERSECT, EXCEPT ALL, INTERSECT ALL) which require NULL=NULL per the SQL standard. Part of facebookincubator/axiom#1235 Reviewed By: xiaoxmeng Differential Revision: D100545193 fbshipit-source-id: f5cfdf3ef975a1fa8a57c0cb129a426f8cf6fcd1

Summary: Pull Request resolved: facebookincubator#17084 Introduce stub for Avro format Reviewed By: raymondlin1 Differential Revision: D100118684 fbshipit-source-id: 3280d099d5954a710b83623fb84b8f6c29d20d9e

…acebookincubator#17123) Summary: Pull Request resolved: facebookincubator#17123 CONTEXT: D71324408 introduced processedStrides_ but incremented it once per stripe load in loadCurrentStripe(). This undercounts when a stripe has multiple strides, producing incorrect stride-level statistics. WHAT: Move processedStrides_ increment from loadCurrentStripe() (once per stripe) to nextRowNumber() (once per stride read, guarded by strideSize > 0 and stride boundary alignment). This gives an accurate count of strides actually processed. Reviewed By: Yuhta Differential Revision: D100236453 fbshipit-source-id: 3e970cfa6c6b0d59335920df0b30882700b90236

…Registry lookup (facebookincubator#17151) Summary: Pull Request resolved: facebookincubator#17151 TableScan, TableWriter, and IndexLookupJoin all called ConnectorRegistry::tryGet(connectorId) (global-only) in their constructors, ignoring the query-scoped API that was already implemented and tested. This meant per-query connector registry overrides attached to QueryCtx were silently bypassed during execution. Change each operator to use ConnectorRegistry::tryGet(queryCtx, connectorId), which checks for a per-query registry override first, then falls back to the global registry. The QueryCtx is available via driverCtx->task->queryCtx() in all three constructors. When no override is set (the default), behavior is identical to the old global-only call. Reviewed By: mbasmanova Differential Revision: D100662333 fbshipit-source-id: c939b60ddcebb7e9a9674a6d08d37a488f67f68d

Summary: This PR adds `CudfMarkDistinct`, a GPU implementation of the Velox `MarkDistinct` operator, as part of the cuDF GPU acceleration module (`velox/experimental/cudf/`). `MarkDistinct` annotates each row with a boolean indicating whether it is the first occurrence of its key(s) within the partition. It is used in query plans that require deduplication markers (e.g. `SELECT DISTINCT` rewrites). Related to: facebookincubator#15772 ### Files changed - `velox/experimental/cudf/exec/CudfMarkDistinct.h` — operator declaration - `velox/experimental/cudf/exec/CudfMarkDistinct.cpp` — operator implementation - `velox/experimental/cudf/exec/OperatorAdapters.cpp` — registration of `MarkDistinctAdapter` - `velox/experimental/cudf/CudfConfig.h` — `enableMarkDistinct` and `markDistinctMaxKeys` config fields - `velox/experimental/cudf/tests/MarkDistinctTest.cpp` — 12 test cases covering single/multi-batch, composite keys, nulls, strings, empty batches, and memory limit ### Benchmark Here are the current results for plans for which MarkDistinct appears (SF100), these are run using karthikeyann's branch here: https://github.com/karthikeyann/velox/tree/fea-TPCDS_plans that enables running TPC-DS workloads: | Query | GPU cold | CPU cold | GPU warm±std | CPU warm±std | Speedup | t-stat | 95% CI (GPU-CPU) | Significant? | |-------|----------|----------|--------------|--------------|---------|--------|------------------|--------------| | Q16 | 3.940s | 3.810s | 3.786±0.037s | 3.788±0.081s | -0.1% | -0.050 | [-0.082, +0.078]s | NO (n=5) | | Q28 | 14.620s | 12.670s | 13.196±0.079s | 12.804±0.042s | +3.1% | 9.825 | [+0.312, +0.472]s | YES (n=5) | | Q94 | 1.940s | 1.960s | 1.932±0.031s | 1.910±0.022s | +1.2% | 1.283 | [-0.012, +0.056]s | NO (n=5) | | Q95 | 2.400s | 2.400s | 2.300±0.021s | 2.334±0.056s | -1.5% | -1.262 | [-0.088, +0.020]s | NO (n=5) | With these configs: ```json { "benchmark_config": { "scale_factor": "SF100", "data_path": "/velox/data_local/tpcds/sf100/tpcds_sf100", "plan_path": "/velox/VeloxPlans/presto/tpcds/sf100/plans", "data_format": "parquet", "stats_runs": 5, "warmup_runs": 1, "execution": { "num_drivers": 4, "num_splits_per_file": 10, "num_io_threads": 8 }, "cudf": { "cudf_enabled": true, "cudf_memory_resource": "pool", "cudf_memory_percent": 85, "cudf_gpu_batch_size_rows": 1000000 }, "hardware": { "gpu": "NVIDIA RTX PRO 6000 Blackwell Server Edition", "gpu_memory_mib": 97887, "cuda": "12.8" }, "operator": "mark_distinct", "queries": ["Q16", "Q28", "Q94", "Q95"] } } ``` ### Known limitation: residual overhead on key-growth batches The current implementation rebuilds `seenFilter_` and copies all `seenKeys_` via `cudf::concatenate` on every batch that introduces new distinct keys. This is O(D) per such batch, where D is the total distinct keys seen so far. For queries with many MarkDistinct nodes operating on a large key space (e.g. Q28 with 6 nodes on SF100), this accounts for the remaining +2.8% gap vs CPU. #### Possible optimization: amortized doubling with a pending buffer Accumulate new keys in a small `pendingKeys_` staging table and only consolidate into `seenKeys_` (and rebuild `seenFilter_`) when `pendingKeys_` reaches the size of `seenKeys_`. This reduces total copy work from O(D × B_new) to O(D) amortized. Correctness requirement: the anti-join must probe **both** `seenFilter_` and a secondary `pendingFilter_` built over `pendingKeys_`. Without this, keys in the pending buffer between merges are invisible to the filter and can be falsely marked as new. For example: batch 1 sees key A (→ seenKeys), batch 2 sees key B (→ pendingKeys, no merge yet), batch 3 sees B again — if only `seenFilter_` is probed, B gets marked as first-occurrence twice. Trade-offs: - Extra probe per batch against `pendingFilter_` (small table, cheap) - Peak memory at merge time is ~3D (seen + pending + merged result) vs ~2D with the current approach — higher OOM risk on GPU - In cudf-Velox, batches are typically large (up to max int rows), so B_new is small and the current approach may be sufficient for most workloads Not implemented in this PR to keep scope contained. Documented here so the optimization can be picked up if key-growth-heavy queries become a pain point. See discussion [here](facebookincubator#16974 (comment)). Pull Request resolved: facebookincubator#16974 Reviewed By: pratikpugalia Differential Revision: D100670746 Pulled By: peterenescu fbshipit-source-id: dafef1412818db97b0d12e3405b68abaf1c49d88

…ubator#17050) Summary: Pull Request resolved: facebookincubator#17050 Relax the format checks in FileIndexReader::createIndexReader() and HiveIndexSource::addSplits() to accept SST file format alongside NIMBLE and FLUX. The Nimble-specific RowReaderOptions (setIndexEnabled, setEagerFirstStripeLoad) are now guarded by != SST rather than == NIMBLE, so new stripe-based formats get the correct defaults automatically. Also adds 9 end-to-end tests exercising SST index join through the full HiveIndexSource -> FileIndexReader -> SstIndexReader -> rodos stack: pointLookup, pointLookupWithMisses, prefixLookup, rangeScan, multipleProbeBatches, outputColumnSelection, remainingFilter, remainingFilterAllFiltered, allMisses. Reviewed By: zacw7, xiaoxmeng Differential Revision: D97697248 fbshipit-source-id: 5002e5b8ee59276c11a7d89c9f85dc62ef7e4412

facebookincubator#17150) Summary: Fix operator hang caused by missing `std::move()` in `getOutput()`. `CudfEnforceSingleRow::getOutput()` returned `input_` without `std::move()`, leaving `input_` non-null after returning. Since `isFinished()` checks `noMoreInput_ && input_ == nullptr`, it never returned true, causing the driver to poll the operator indefinitely. Pull Request resolved: facebookincubator#17150 Reviewed By: pratikpugalia Differential Revision: D100671193 Pulled By: peterenescu fbshipit-source-id: 08832cd2238a2aa71e1cae0e7522c78ee8b5fee6

Summary: Pull Request resolved: facebookincubator#17131 Add s2geometry 0.12.0 installation to Velox's CI setup scripts so that Docker images can be rebuilt with s2geometry as a system package. Changes: - Add install_s2geometry function to setup-common.sh (downloads, patches, and builds s2geometry from source) - Add s2geometry install step to all platform setup scripts (Ubuntu, CentOS 9, Fedora, macOS) - Add S2GEOMETRY_VERSION=0.12.0 to setup-versions.sh Reviewed By: kgpai, pratikpugalia Differential Revision: D100426816 fbshipit-source-id: 4295cb63bda184444e0cbd837bc32713a329e6ef

…ookincubator#17094) Summary: Pull Request resolved: facebookincubator#17094 Refactor `makeFieldNotFoundErrorMessage` into a `formatAvailableFields` helper and pass the template string and arguments directly to VELOX_USER_FAIL, so that `messageTemplate()` properly captures the static template for programmatic error categorization. Previously, the error message was fully pre-formatted into a single string before being passed to the macro, which meant the template was the formatted message itself. Reviewed By: peterenescu Differential Revision: D100205606 fbshipit-source-id: 86b2a1eeb5d7ebff56920e2ace5ad6ba843ef360

…acebookincubator#17096) Summary: Pull Request resolved: facebookincubator#17096 Fix callsites that pre-format error messages (via `fmt::format`, `std::stringstream`, or string concatenation) before passing them to VELOX_FAIL/VELOX_USER_FAIL macros. Pre-formatting causes the macro to take the single-arg path where the formatted message becomes the template, defeating template-based error categorization. The fix passes format strings and arguments directly to the macros so that `messageTemplate()` properly captures the static template. Reviewed By: peterenescu Differential Revision: D99974595 fbshipit-source-id: 868cea2cfac1dc2283e5f6eb3354c932b8fb4282

…ctions (facebookincubator#17097) Summary: Pull Request resolved: facebookincubator#17097 Fix callsites that pre-format error messages (via `fmt::format`, `std::stringstream`, or string concatenation) before passing them to VELOX_FAIL/VELOX_USER_FAIL/ VELOX_UNSUPPORTED macros. Pre-formatting causes the macro to take the single-arg path where the formatted message becomes the template, defeating template-based error categorization. The fix passes format strings and arguments directly to the macros so that `messageTemplate()` properly captures the static template. Reviewed By: peterenescu Differential Revision: D100205508 fbshipit-source-id: 1556904f44e3bcd6fc5c53b3569db7285ae9fe57

facebookincubator#17066) Summary: Pull Request resolved: facebookincubator#17066 DistinctAggregations used to assume all arguments are columns and throw when there are literal arguments. This assumption does not always hold. This diff fixes this issue by allowing DistinctAggregations to handle literal arguments. Literal arguments are omitted during the deduplication and added back before passing the deduplicated inputs to the aggregation function. This diff introduces a new ConstantDistinctAggregations class to handle the case when all distinct arguments are constants. In this situation, deduplication is unnecessary since there is at most one distinct value. ConstantDistinctAggregations tracks a boolean flag per group to record whether the constant has been seen, and defers the actual feeding of the constant value to the aggregation function until extraction time, where the actual aggregation result is computed once and broadcast to all groups that have seen inputs. These aggregations cannot always be constant-folded by the planner because they may depend on per-group conditions such as FILTER (WHERE ...) clauses. This diff fixes facebookincubator#17057. Reviewed By: mbasmanova Differential Revision: D99911616 fbshipit-source-id: d1dfeda614b7ecd4631e4958997ee81d9253be46

…acebookincubator#17153) Summary: Pull Request resolved: facebookincubator#17153 The Window Fuzzer's `generateKRowsFrameBound()` function previously used `[INT64_MIN, INT64_MAX]` as the range for random k values 30% of the time. This produced negative frame offsets (e.g. `-5 PRECEDING`) which are semantically invalid — both Presto and Velox reject them. When the fuzzer verifies results against a Presto reference database, these invalid queries cause Presto errors, inflating the 'reference DB failed' count and pushing the verification ratio below the 50% threshold. Analysis of 8 recent Window Fuzzer failures on GitHub CI showed that 75 out of 105 total Presto errors (71%) were caused by negative frame offsets. This was the dominant source of flakiness (8% failure rate on push-to-main). Fix: Change the distribution of generated k values to: - 70%: valid, small range [1, 10] - 20%: valid, full positive range [1, INT64_MAX] - 10%: invalid, full range [INT64_MIN, INT64_MAX] to exercise negative/zero offsets and verify both Velox and the reference DB reject them consistently. This retains coverage of invalid inputs at a low enough rate to avoid pushing the verification ratio below threshold, while ensuring the fuzzer verifies error behavior against the reference DB (requires the error verification in the follow-up diff D100732002). Reviewed By: kagamiori Differential Revision: D100657400 fbshipit-source-id: 1063cc96a764cc15d5a66528dd220ce5870ebdf6

Summary: Pull Request resolved: facebookincubator#17159 Avoid setting weight to 0, as it effectively gives a signal to the scheduler that split has 0 data to process Reviewed By: xiaoxmeng Differential Revision: D100745378 fbshipit-source-id: dcbe770456dc92a1fa180f7b6c184b2c88614162

…acebookincubator#17147) Summary: The clone currently only carries the existing `ReadFileInputStream` and not all original constructor inputs, so rebuilding from ReadFile would not be behavior- equivalent. This PR makes it protected to allow reuse of `input_` (ReadFileInputStream), which retains the internal `fileIoContext_`. Follow-up for: facebookincubator@084f222. Pull Request resolved: facebookincubator#17147 Reviewed By: kgpai Differential Revision: D100708786 Pulled By: peterenescu fbshipit-source-id: 123a6de4c1a7567519762d12e315023aa818bc3f

…gregate.rst (facebookincubator#17001) Summary: Pull Request resolved: facebookincubator#17001 Reordered all function entries in array.rst and aggregate.rst to be in strict alphabetical order, following the approach of D96488148. array.rst: Moved none_match, array_split_into_chunks, array_max_by, arrays_overlap, arrays_union, array_position, l2_norm, array_sum, sequence, and remove_nulls to their correct alphabetical positions. aggregate.rst: Fixed ordering within sections: - General: any_value before arbitrary, geometric_mean before histogram, max/max(n) before max_by, min/min(n) before min_by - Statistical: var_pop and var_samp before variance ___ overriding_review_checks_triggers_an_audit_and_retroactive_review Oncall Short Name: presto_batch Differential Revision: D99121480 fbshipit-source-id: 1f6382aa62f8052236dedc7e868d99b7a4693a01

…ookincubator#17152) Summary: Fixes facebookincubator#17141 Two bugs prevented build error extraction from working: 1. **ANSI escape codes**: The build uses `-fdiagnostics-color=always` which embeds ANSI escape sequences in the log. The grep for `: error:` fails because invisible codes sit between `:` and `error`. Fix: strip codes with `sed` before grepping. 2. **Wrong grep pattern**: The pattern `'^.*:(error|fatal error):'` expected `:error:` with no space, but GCC format is `file:line:col: error: msg` (space before `error`). This means the grep never matched even without ANSI codes. Fix: use `' (error|fatal error):'` instead. Applied to both the adapters and ubuntu-debug extract steps. Pull Request resolved: facebookincubator#17152 Test Plan: - [x] Verified sed strips ANSI codes correctly - [x] Verified grep pattern matches `error:` and `fatal error:` in GCC format - [x] Verified normal build lines don't match - [ ] Validate on next PR with a build failure that BUILD_FAILURE_DETAILS is populated Reviewed By: mbasmanova Differential Revision: D100832340 Pulled By: kgpai fbshipit-source-id: 4d9fc625dfc3eb8983a5bf99c98a8d3b3cc622b3

…CudfTable (facebookincubator#17164) Summary: Fixes facebookincubator#17163. `toCudfTable()` frees Arrow host buffers immediately after `cudf::from_arrow()` returns, but `from_arrow()` uses `cudaMemcpyBatchAsync` (CUDA 13.0+) with `cudaMemcpySrcAccessOrderStream` which defers reading source buffers until the stream reaches each copy. The freed host memory is then read by the CUDA driver's event handler thread, causing a SIGSEGV. Move `stream.synchronize()` into `toCudfTable()` before the Arrow resource release so that all async H2D copies complete while source buffers are still valid. Remove the now-redundant synchronize from `CudfFromVelox::getOutput()`. ## Testing Ran `velox_cudf_tpch_benchmark` Q11 SF100 with `--num_repeats=100 --num_drivers=4` — crashes immediately without the fix, completes all iterations with the fix (verified across multiple runs). All existing Velox-cuDF tests are passing. Pull Request resolved: facebookincubator#17164 Reviewed By: kKPulla Differential Revision: D100832047 Pulled By: peterenescu fbshipit-source-id: eece785d5c69f748e7531801cf0eed525a3167da

…acebookincubator#17185) Summary: The CI Failure Comment workflow (`ci-failure-comment.yml`) fails on fork PRs because the `claude-code-action` OIDC token exchange checks that the triggering actor has write access to the repository. Fork PR authors don't have write access, causing a `401 Unauthorized` error. Fix: pass `github_token: ${{ github.token }}` explicitly so the action uses the workflow's own GITHUB_TOKEN (with the declared permissions) instead of the OIDC flow. **Security:** No change in exposure — the same token is already available to Claude via the `GH_TOKEN` env var. The workflow's token is scoped to exactly the permissions declared in the file (`contents: read`, `pull-requests: write`, `issues: write`, `actions: read`) and is automatically revoked after the job completes. Example failure: https://github.com/facebookincubator/velox/actions/runs/24424455191/job/71354986089 Pull Request resolved: facebookincubator#17185 Test Plan: - Verified fix by pushing branch to upstream and dispatching `workflow_dispatch` against a failed fork PR run (run ID `24422420081`, PR facebookincubator#16048) - Run succeeded: https://github.com/facebookincubator/velox/actions/runs/24426241515 Reviewed By: pratikpugalia Differential Revision: D100885246 Pulled By: kgpai fbshipit-source-id: 134386016611fae1f31e287d1adc3d1d5d8a9dca

…cture (facebookincubator#16934) Summary: Corresponding PR for facebookincubator#16885. Pull Request resolved: facebookincubator#16934 Reviewed By: kKPulla Differential Revision: D100680297 Pulled By: peterenescu fbshipit-source-id: 7e091be403fbc81e74c485c88e2f3d622018a8bb

…ookincubator#17513) Summary: Pull Request resolved: facebookincubator#17513 Add partition key information to the per-batch scan stats callback. Changes: - Add `getPartitionKeys()` virtual method to `DataSource` interface with default empty implementation. `FileDataSource` overrides to return the current split's partition keys. - Add `partitionKeys` pointer field to `ScanBatchEvent`, populated from `dataSource_->getPartitionKeys()` in `TableScan::getOutput()`. - New test `scanBatchCallbackPartitionKeys` verifying partition keys propagate through the callback. Reviewed By: Yuhta Differential Revision: D105112445 fbshipit-source-id: 7655311b32017703b40627b82028545ffe99b92c

…5871) Summary: This PR adds support for the `TZDIR` environment variable in the `tzdb` implementation. Previously, the timezone database discovery logic was limited to standard system paths (e.g., `/usr/share/zoneinfo`, `/usr/share/zoneinfo/uclibc`). This limitation caused issues in large scale data center since if we want to upgrade tzdata dependency for velox, we have to upgrade the system's. With this change, `__libcpp_tzdb_directory` now checks for `TZDIR` first. This aligns Velox's behavior with other standard C++ libraries (like abseil-cpp) and provides users with greater flexibility to configure the timezone database location. Pull Request resolved: facebookincubator#15871 Reviewed By: apurva-meta Differential Revision: D105270088 Pulled By: pedroerp fbshipit-source-id: 1f90acb7f31cd394177b67c9ba6b68e7e4e13f95

) Summary: Added simd::any, simd::none, and simd::all in SimdUtil.h for boolean-only SIMD mask checks. On AArch64, these helpers use xsimd::any/none/all directly. On x86, they preserve the previous behavior by reducing through simd::toBitMask so code generation stays aligned with the original path. On Arm Neon, xsimd::any is one instruction: vmaxvq_u64, whereas simd::toBitMask is two instructions: vandq_u64 + vaddvq_u64. On x86 AVX2, both are only one instruction, and our experiment shows simd::toBitMask gives a better performance on AVX2 due to better further optimization by the compiler. So, this PR only changes what's used outside x86. We have verified that the instructions generated on x86 remain unchanged. Switch the boolean-only toBitMask call sites in: MmapAllocator HashTable VectorHasher-inl KeyEncoder Filter Leaving toBitMask unchanged at sites that still need exact lane masks for bit operations such as ctz, popcount, filtering, or packed-bit writes. This improves performance on Arm for the relevant boolean-only reduction paths, while preserving the previous x86 behavior. Pull Request resolved: facebookincubator#17257 Reviewed By: jagill Differential Revision: D105303951 Pulled By: mbasmanova fbshipit-source-id: b220cbb5d555e9512fc5c4dc394a60c87355822c

…acebookincubator#17525) Summary: Pull Request resolved: facebookincubator#17525 Strobelight profiling (https://fburl.com/scuba/strobelight_services/oeoxrdzz) shows that `__folly_memcpy` via `AlignedBuffer::reallocate` is the top CPU consumer in dai_data/waas, costing $3.6M/yr (16.12% of all memcpy CPU). The root cause is that `MemoryPoolImpl::reallocate` always does alloc + memcpy + free, even when jemalloc could expand the allocation in-place via `::realloc()`. This diff adds a `reallocateBytes` method to `MemoryAllocator` that tries in-place reallocation first. `MallocAllocator` overrides `reallocateBytesWithoutRetry` to use `::realloc()`, which jemalloc can often service without moving data (avoiding the expensive memcpy). Other allocators (e.g., MmapAllocator) fall back to the existing alloc + memcpy + free path via the default implementation. The new path is gated by `--velox_enable_inplace_realloc`, which **defaults to true** so all Velox-using services get the win out of the box. The legacy alloc + memcpy + free path is kept behind the gflag in `MemoryPoolImpl::reallocate` as a per-service safety knob — any service that needs to opt out can override the gflag locally. Once the in-place path has baked across the fleet, the gflag and the legacy branch can be deleted in a follow-up. The change is backward-compatible at the API level: when `::realloc()` cannot expand in-place, the implementation falls back to allocating new memory + memcpy + free (same behavior as today). Reviewed By: tanjialiang Differential Revision: D101689101 fbshipit-source-id: 84e7c9c4005c3efb904ad5d0ca3af07ee663c074

…17532) Summary: Fix ``` /__w/velox/velox/velox/experimental/cudf/connectors/hive/CudfSplitReader.cpp: In constructor 'facebook::velox::cudf_velox::connector::hive::CudfSplitReader::CudfSplitReader(std::shared_ptr<facebook::velox::cudf_velox::connector::hive::CudfHiveConnectorSplit>, std::shared_ptr<const facebook::velox::connector::hive::HiveTableHandle>, const facebook::velox::RowTypePtr&, const std::vector<std::__cxx11::basic_string<char> >&, facebook::velox::FileHandleFactory*, folly::Executor*, const facebook::velox::connector::ConnectorQueryCtx*, const std::shared_ptr<facebook::velox::cudf_velox::connector::hive::CudfHiveConfig>&, const std::shared_ptr<facebook::velox::io::IoStatistics>&, const std::shared_ptr<facebook::velox::IoStats>&, bool, const cudf::ast::expression*)': /__w/velox/velox/velox/experimental/cudf/connectors/hive/CudfSplitReader.cpp:77:7: error: no matching function for call to 'facebook::velox::dwio::common::ReaderOptions::ReaderOptions(facebook::velox::memory::MemoryPool*&, std::__shared_ptr<facebook::velox::io::IoStatistics, __gnu_cxx::_S_atomic>::element_type*, std::__shared_ptr<facebook::velox::io::IoStatistics, __gnu_cxx::_S_atomic>::element_type*)' 77 | baseReaderOpts_(pool_, ioStatistics_.get(), ioStatistics_.get()), | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In file included from /__w/velox/velox/./velox/experimental/cudf/connectors/hive/CudfHiveConnectorSplit.h:20, from /__w/velox/velox/./velox/experimental/cudf/connectors/hive/CudfSplitReader.h:20, from /__w/velox/velox/velox/experimental/cudf/connectors/hive/CudfSplitReader.cpp:18: /__w/velox/velox/./velox/dwio/common/Options.h:579:12: note: candidate: 'facebook::velox::dwio::common::ReaderOptions::ReaderOptions(facebook::velox::memory::MemoryPool*)' 579 | explicit ReaderOptions(velox::memory::MemoryPool* pool) | ^~~~~~~~~~~~~ /__w/velox/velox/./velox/dwio/common/Options.h:579:12: note: candidate expects 1 argument, 3 provided /__w/velox/velox/./velox/dwio/common/Options.h:572:7: note: candidate: 'facebook::velox::dwio::common::ReaderOptions::ReaderOptions(const facebook::velox::dwio::common::ReaderOptions&)' 572 | class ReaderOptions : public io::ReaderOptions { ``` Refs: https://github.com/facebookincubator/velox/actions/runs/25913198690/job/76163332652?pr=17500 Pull Request resolved: facebookincubator#17532 Reviewed By: ximyu, duxiao1212 Differential Revision: D105332599 Pulled By: xiaoxmeng fbshipit-source-id: fbe60e23a6c1277a083937a97d0cbd077dd6fabb

Summary: Pull Request resolved: facebookincubator#17519 Velox's coercion rules don't match Presto. Presto allows `BIGINT -> REAL`, Velox doesn't. So `select typeof(real '1' / bigint '2')` returns `real` in Presto but `double` in Axiom (which uses Velox's coercion). The same root cause affects all REAL/BIGINT arithmetic, comparisons, `IN`, `CASE`, `COALESCE`, set operations, and `JOIN USING`. Before this change, Velox's `TypeCoercer` had a single hard-coded rule set with no per-dialect customization. This change makes coercion rules customizable: a SQL dialect can ship its own complete rule set that matches its semantics, without modifying Velox's defaults and without colliding with other dialects in the same process. `TypeCoercer` is now an instance class that owns a complete `CoercionEntry` rule set. Velox ships `TypeCoercer::defaults()` holding today's lossless built-in rules. Each SQL dialect ships its own complete instance. This change adds Presto's: `velox::functions::prestosql::typeCoercer()`, the Velox defaults plus `BIGINT -> REAL`. Rule sets are flat per-dialect (no live mixing, no layering): cost numbers are only meaningful inside a single rule set. Constructor validates that: - entries are non-null; - both source and target are built-in types (custom-type coercions remain in `CastRulesRegistry`); - source DECIMAL is the canonical placeholder `DECIMAL(1, 0)`; - no `DECIMAL -> DECIMAL` entries (handled by the type system); - cost is unique per source type; - no duplicate `(fromName, toName)` keys. The Velox-side resolution API (`SignatureBinder`, `resolveFunction*WithCoercions`, `SwitchExpr`/`CoalesceExpr` planning helpers) gains `const TypeCoercer& coercer = TypeCoercer::defaults()` as a tail parameter so Velox-internal callers compile unchanged. The downstream Axiom integration -- threading the dialect coercer through `PlanBuilder::Context` and wiring `presto::typeCoercer()` in `PrestoParser` -- lands as a follow-up. Coercion is a planning-time concern. Runtime expression evaluators (`Expr` ctors, `ExprCompiler`) do not consult `TypeCoercer` and are unchanged. `velox/docs/develop/types.rst` adds a new "Type Coercion" section covering the `TypeCoercer` mechanism, customization scope, `coerceTypeBase` vs `coercible`, the default-rules table, the Presto-specific additions, and the 2-step DECIMAL handling per case. # Breaking Changes There are no static back-compat wrappers. Consumers that called the old static methods (`velox::TypeCoercer::coercible(...)`, `velox::TypeCoercer::leastCommonSuperType(...)`, etc.) will fail to compile until they are updated to invoke the instance API on a `TypeCoercer&` -- typically `TypeCoercer::defaults().coercible(...)` for callers that don't have a dialect coercer. bypass-github-export-checks Reviewed By: xiaoxmeng Differential Revision: D105206605 fbshipit-source-id: 107f83e5757689950c9a78d264194dc23cb32306

…#17527) Summary: Pull Request resolved: facebookincubator#17527 Pull Request resolved: facebookincubator#17521 When a HashBuild builder task fails (e.g., OOM in finishHashBuild()) before calling HashTableCache::put(), the cache entry persists with buildComplete=false and unfulfilled ContinuePromises. This causes all subsequent tasks landing on the same executor to hang forever in kWaitForBuild state with 0B allocated memory, until the driver times out (exit_code=254). This was observed in spark3_full_shadow verifier jobs where entire jobs got stuck for 9+ hours. The fix has two parts: 1. HashBuild::close(): When the builder task closes without completing the build, call HashTableCache::drop() to remove the poisoned cache entry. This is gated on: caching is enabled, this is the builder task, and buildComplete is still false. 2. HashTableCache::drop(): Fulfill any pending buildPromises before erasing the entry, so waiting tasks are unblocked instead of hanging forever. Waiters wake up and fail with a clear error via the existing VELOX_CHECK in receivedCachedHashTable(). Also updates the error message in receivedCachedHashTable() to include the cache key and mention possible OOM cause, and converts the builderFailureLeavesWaitersStuck test to verify the fix instead of documenting the bug. Reviewed By: xiaoxmeng Differential Revision: D105165613 fbshipit-source-id: d83e54371d182b2c8f6ebd7560286221dce33282

…tor#17535) Summary: Pin `tzdata` (and `tzdata-java` where applicable) in all four velox-dev dockerfiles. Closes / mitigates facebookincubator#17522. Without an explicit pin, every docker rebuild — triggered by any change to `scripts/docker/*.dockerfile` or `scripts/setup-*.sh` — silently picks up the latest `tzdata` from the distro repos. tzdata `2026b`'s encoding of British Columbia's permanent-PDT change broke Velox's bundled libc++ chrono::tzdb parser and caused the Presto-SOT fuzzer regression in facebookincubator#17522: - 4 fuzzers (Expression / Window / Aggregation / Join, all "with Presto as source of truth") fail on any plan touching `TIMESTAMP WITH TIME ZONE`. - The mismatch shape is exactly `(3,600,000 ms) << 12 = 14,745,600,000` — i.e., 1-hour offset, same `tz_id`. Velox's tzdb returns the wrong offset for the affected timezones; Presto Java (reading `tzdata-java`) is correct. The trigger was a routine docker rebuild on 2026-05-14 that bumped `tzdata` from `2026a-1.el9` → `2026b-1.el9`. With this PR the version is explicit and reproducible. ## Per-image pin | Image | Pinned to | Notes | |---|---|---| | `centos9`, `pyvelox` | `tzdata-2026a-1.el9` | pinned in `base-image` stage | | `adapters` | `tzdata-2026a-1.el9` + `tzdata-java-2026a-1.el9` | inherits from `centos9`; `tzdata-java` re-pinned after Java install | | `presto-java`, `spark-server` | `tzdata-2026a-1.el9` + `tzdata-java-2026a-1.el9` | defensive re-pin in `java-base` so it builds even before `centos9` is rebuilt with the pin | | `fedora` | `tzdata-2025c-1.fc42` (current) | pinned in `base-build` and `fedora` stages | | `ubuntu` (22.04) | `tzdata=2026a-0ubuntu0.22.04.1n` (current) | `apt-get install --allow-downgrades` | ## Mechanism Pin uses an install-or-downgrade pattern that works whether the base image ships an older, equal, or newer tzdata than the pinned version: ```dockerfile ARG CENTOS_TZDATA_VERSION=2026a-1.el9 RUN dnf -y install "tzdata-${CENTOS_TZDATA_VERSION}.noarch" || \ dnf -y downgrade "tzdata-${CENTOS_TZDATA_VERSION}.noarch" ``` Verified manually on a fresh `quay.io/centos/centos:stream9` (currently shipping `2026b`): `dnf install tzdata-2026a-1.el9.noarch` performs the downgrade as expected, so the `|| dnf downgrade` fallback is defensive. ## Bumping intentionally Change the `{CENTOS,FEDORA,UBUNTU}_TZDATA_VERSION` ARG default at the top of each dockerfile, rebuild the docker images, and re-run the Presto-SOT fuzzers locally to confirm no regression before merging. Pull Request resolved: facebookincubator#17535 Test Plan: - [ ] Docker images rebuild successfully (workflow `docker.yml` will fire automatically on this PR since it touches `scripts/docker/*.dockerfile`). - [ ] After the rebuilt images land on `:presto-java`, the four Presto-SOT fuzzer jobs (Expression, Window, Aggregation, Join) go green again on `main`. - [ ] `rpm -q tzdata` inside the rebuilt `velox-dev:presto-java` shows `tzdata-2026a-1.el9.noarch` (and `tzdata-java` matches). Reviewed By: kevinwilfong, pratikpugalia Differential Revision: D105375476 Pulled By: kgpai fbshipit-source-id: 9c46190a1d848b42b0c9a1806bbff377d638b984

…okincubator#17517) Summary: Follow-up to facebookincubator#17238 (comment) Alphabetize source file lists, library dependencies, `add_subdirectory` calls, and test blocks across all 9 CMake files in `velox/experimental/cudf/`. - Dependencies are grouped by type (cudf/arrow, velox\_\*, third-party), then alphabetized within each group - Test blocks are kept together (`add_executable`, `add_test`, `set_tests_properties`, `target_link_libraries`) and ordered alphabetically by test name - Add `velox_add_cudf_test()` helper and `CUDF_TEST_DEFAULT_LIBS` variable to eliminate repeated boilerplate across all test targets, remove unused link dependencies, and register the custom command with gersemi. Pull Request resolved: facebookincubator#17517 Reviewed By: pratikpugalia Differential Revision: D105390344 Pulled By: mbasmanova fbshipit-source-id: 5549a3db449dec4bb05658d3d8b5fe053fda8ea9

…acebookincubator#17534) Summary: Pull Request resolved: facebookincubator#17534 X-link: facebookincubator/nimble#731 Remove the `defaultIoStats_` fallback in `TabletReader` and instead enforce that callers always supply `ioOptions` with a non-null `metadataIoStats()`. The constructor now does an early `NIMBLE_CHECK` / `NIMBLE_CHECK_NOT_NULL` in its init list, so any caller that forgets to set the field fails fast at construction with a clear message. Reviewed By: xiaoxmeng Differential Revision: D105338687 fbshipit-source-id: 53d2b708ea5ea96c4ca27dc4b50b4d7720e0afa8

Summary: Streamline the PR review workflow by reducing the number of manual commands and permission prompts needed to fetch PR data and post reviews. The style guide captures review conventions developed over many review iterations. Add scripts and a style guide for PR reviews: - `scripts/review/fetch.py` — fetches PR metadata, diff, comments, and reviews in one shot - `scripts/review/post.py` — posts a review from a file (handles multiline bodies cleanly) - `scripts/review/REVIEW_GUIDE.md` — review style guide covering tone, structure, and what to check Pull Request resolved: facebookincubator#17524 Reviewed By: xiaoxmeng Differential Revision: D105240290 Pulled By: mbasmanova fbshipit-source-id: 36df968e1131412cb5e5b0143d5b1c95e3989974

…ator#17463) Summary: **Summary**: when the leaf node (Scan operator) in one child of the Join operator is rolled back to Vanilla Spark, while the leaf node (Scan operator) in the other child still uses NativeScan, and the join key used by Join contains the decimal type, Spark ORC reader will prioritize using the decimal type in the table schema (e.g. DECIMAL (20,0)) over the decimal type in the ORC file (e.g. DECIMAL (38,18)) but Native ORC reader only using the decimal type in the ORC file, which ultimately leads to incorrect matching of the join key, resulting in the output of 0 rows after join. **Root cause**: Hive ORC fixed the DECIMAL (38,18) in footer, and the true scale of each row is written in the SECONDRY stream. The original ORC reader used footer's The use of scale instead of the scale of the meta store (table schema) results in inconsistency between the output of NativeScan and Vanilla Spark. Fixes: facebookincubator#17462 This PR proposes to pass the request type into `SelectiveColumnReader`. This PR could help to fix apache/gluten#11980 Pull Request resolved: facebookincubator#17463 Reviewed By: pratikpugalia Differential Revision: D105464139 Pulled By: mbasmanova fbshipit-source-id: b1348d9be467c2ec74ee84387d4b3fb0d30deda9

…folder (facebookincubator#17540) Summary: Pull Request resolved: facebookincubator#17540 X-link: facebookincubator/nimble#741 Moves encoding selection files (`EncodingSelection.h`, `EncodingSelectionPolicy.h`, `EncodingSizeEstimation.h`, `Statistics.h/.cpp`, `EncodingIdentifier.h`) into a new `encodings/selection/` subfolder for better code organization. The files remain in the same BUCK target to preserve the existing dependency structure between `EncodingSelection.h` and `EncodingFactory.h`. All include paths across ~55 consumer files are updated accordingly. Reviewed By: pratikpugalia Differential Revision: D105446634 fbshipit-source-id: d5665c130c8118dbe0da1f633b14546f6cb9403f

…kincubator#17541) Summary: `CudfToVelox` and `CudfFromVelox` adapter operators use synthetic planNodeIds (e.g. 4-to-velox) and are not part of the plan tree. Their stats were previously invisible in `printPlanWithStats` output, making it difficult to observe adapter operator performance (e.g. GPU-CPU conversion overhead). Using `setStatSplitter` to remap their stats to the parent plan node at collection time, they now appear as operator-type breakdown lines under their parent node — the same mechanism used by `HashBuild/HashProbe` under `HashJoinNode`. ## Changes - `CudfConversion.cpp`: `setStatSplitter` in CudfFromVelox/CudfToVelox constructors remaps stats planNodeId to parent. - `AdapterOperatorTest.cpp`: Verifies stats merge into parent plan node. - `operators.rst`: GPU Operators (cuDF) section — plan rewriting, adapter operators, naming convention. - `print-plan-with-stats.rst`: GPU Operator Stats section — example output with multi-operator breakdown. ## Example output ### GPU mode output (TPC-H Q6 SF10) ``` Execution time: 346ms Splits total: 1, finished: 1 -- Aggregation[4][FINAL a0 := sum("a0")] -> a0:DOUBLE Output: 2 rows (16B, 2 batches), Cpu time: 576.15us, Wall time: 632.42us, Blocked wall time: 0ns, Peak memory: 0B, Memory allocations: 0, CPU breakdown: B/I/O/F (32.88us/4.81us/524.85us/13.61us) CudfToVelox: Input: 1 rows (8B, 1 batches), Output: 1 rows (8B, 1 batches), Cpu time: 223.65us, Wall time: 254.26us, Blocked wall time: 0ns, Peak memory: 0B, Memory allocations: 0, Threads: 1, CPU breakdown: B/I/O/F (18.46us/2.35us/195.12us/7.72us) CudfReduceFINAL: Input: 1 rows (8B, 1 batches), Output: 1 rows (8B, 1 batches), Cpu time: 352.50us, Wall time: 378.16us, Blocked wall time: 0ns, Peak memory: 0B, Memory allocations: 0, Threads: 1, CPU breakdown: B/I/O/F (14.42us/2.46us/329.73us/5.89us) -- LocalPartition[3][GATHER] -> a0:DOUBLE Output: 2 rows (16B, 2 batches), Cpu time: 107.75us, Wall time: 139.80us, Blocked wall time: 343.93ms, Peak memory: 0B, Memory allocations: 0, CPU breakdown: B/I/O/F (18.36us/46.05us/23.23us/20.11us) CudfLocalPartition: Input: 1 rows (8B, 1 batches), Output: 1 rows (8B, 1 batches), Cpu time: 88.15us, Wall time: 109.59us, Blocked wall time: 0ns, Peak memory: 0B, Memory allocations: 0, Threads: 1, CPU breakdown: B/I/O/F (11.34us/46.05us/11.59us/19.17us) LocalExchange: Input: 1 rows (8B, 1 batches), Output: 1 rows (8B, 1 batches), Cpu time: 19.60us, Wall time: 30.21us, Blocked wall time: 343.93ms, Peak memory: 0B, Memory allocations: 0, Threads: 1, CPU breakdown: B/I/O/F (7.02us/0ns/11.64us/940ns) -- Aggregation[2][PARTIAL a0 := sum(ROW["p0"])] -> a0:DOUBLE Output: 1 rows (8B, 1 batches), Cpu time: 10.82ms, Wall time: 10.84ms, Blocked wall time: 0ns, Peak memory: 0B, Memory allocations: 0, Threads: 1, CPU breakdown: B/I/O/F (13.44us/4.65us/10.80ms/5.03us) -- Project[1][expressions: (p0:DOUBLE, multiply(ROW["l_extendedprice"],ROW["l_discount"]))] -> p0:DOUBLE Output: 1139264 rows (8.83MB, 1 batches), Cpu time: 44.62ms, Wall time: 44.64ms, Blocked wall time: 0ns, Peak memory: 0B, Memory allocations: 0, Threads: 1, CPU breakdown: B/I/O/F (8.92us/6.32us/44.60ms/5.13us) -- TableScan[0][table: lineitem, range filters: [(l_discount, DoubleRange: [0.050000, 0.070000] no nulls), (l_quantity, DoubleRange: (-inf, 24.000000) no nulls), (l_shipdate, BigintRange: [8766, 9130] no nulls)]] -> l_shipdate:DATE, l_extendedprice:DOUBLE, l_quantity:DOUBLE, l_discount:DOUBLE Input: 1139264 rows (30.97MB, 1 batches), Output: 1139264 rows (30.97MB, 1 batches), Cpu time: 147.16ms, Wall time: 288.43ms, Blocked wall time: 0ns, Peak memory: 1.72MB, Memory allocations: 1957, Threads: 1, Splits: 1, CPU breakdown: B/I/O/F (4.16us/0ns/147.15ms/1.03us) ``` Pull Request resolved: facebookincubator#17541 Reviewed By: srsuryadev Differential Revision: D105490119 Pulled By: mbasmanova fbshipit-source-id: 07abd9724c6785d4bb2669e9c01ddd36fb3d5ea1

…bookincubator#16591) (facebookincubator#17353) Summary: ## Problem When the Parquet reader detects a schema mismatch between the file footer and the requested schema, the error reports the converted file type and the requested type but not the field name: ``` Converted type VARCHAR is not allowed for requested type INTEGER ``` For tables with many columns, the reader has to cross-reference the file footer manually to figure out which column triggered the mismatch. ## Change Append the file column name (`schemaElement.name`) to `kTypeMappingErrorFmtStr` in `ReaderBase::convertType`: ``` Converted type VARCHAR is not allowed for requested type INTEGER for file column 'name' ``` The name is added at the end so the existing static prefix is preserved, matching the project style guide that puts runtime information after the static description. All 19 throw sites that share `kTypeMappingErrorFmtStr` in `convertType` are updated; the existing `fileColumnVarcharToMetadataColumnMismatchTest` is extended to assert the column name appears in the message. ## Verification - Existing `VELOX_ASSERT_THROW` substring matchers in `ParquetReaderWideningTest` and `ParquetTableScanTest` still match — they check for `"is not allowed for requested type"`, which remains a substring of the new message. - The updated assertion in `fileColumnVarcharToMetadataColumnMismatchTest` would have failed before this change (the column name `"name"` was not in the error) and passes after. Resolves facebookincubator#16591. Pull Request resolved: facebookincubator#17353 Reviewed By: srsuryadev Differential Revision: D105506987 Pulled By: mbasmanova fbshipit-source-id: d4fff9657d1a7a843b3794dc1a9d54f03c50ada0

…zer (facebookincubator#17536) Summary: Pull Request resolved: facebookincubator#17536 Reviewed By: srsuryadev Differential Revision: D105506016 Pulled By: mbasmanova fbshipit-source-id: 63cbbd8e0edcb9ad2ea56e50b1909cafab71f6d2

Summary: Pull Request resolved: facebookincubator#17383 This change was generated by pointing an AI agent at hot functions showing up in our internal profiles. It was reviewed by a human (outside the velox team): Added `keyValuesMap_.reserve(inputArray.size())` after `keyValuesMap_.clear()` in `MultimapFromEntriesFunction::call`. This pre-allocates bucket space in the `folly::F14FastMap` before the insertion loop, avoiding costly rehashes as elements are inserted. The upper bound `inputArray.size()` is the maximum possible number of unique keys and is a cheap O(1) estimate. The `uniqueKeys_` vector already had a similar `reserve` call. Reviewed By: IosifSpulber Differential Revision: D103133759 fbshipit-source-id: 6e672e81d2a6dccdc514a630be7ff5d8697294a0

…acebookincubator#17543) Summary: Pull Request resolved: facebookincubator#17543 `EnforceSingleRowNode` validates that its input produces exactly one row. With multiple drivers, each driver independently sees a slice of the input and may emit its own row (or NULL on empty input), violating the single-row contract. Mark it `requiresSingleThread() = true` so the runtime collapses the pipeline to one driver, matching the existing treatment of `LimitNode`, `OrderByNode`, and other operators with the same constraint. Reviewed By: srsuryadev Differential Revision: D105507323 fbshipit-source-id: 55feb7f63f5d4fbbf94ea1fccfedb44c0157f2d7

…ubator#17487) Summary: CONTEXT: Spark 3.4+ introduced bitmap aggregate functions (`bitmap_construct_agg`, `bitmap_or_agg`, `bitmap_bucket_number`, `bitmap_bit_position`, `bitmap_count`) for efficient count-distinct approximation using sub-bitmap splitting. `bitmap_construct_agg` is the core aggregation primitive — it builds a fixed-size 4096-byte bitmap (32768 bits) from input bit positions computed by `bitmap_bit_position`. WHAT: Add the `bitmap_construct_agg` aggregate function for Spark. The function takes BIGINT input positions in [0, 32767] and produces a VARBINARY bitmap with the corresponding bits set. Key design points: - Uses `SimpleAggregateAdapter` with `default_null_behavior_ = false` for non-nullable output - Fixed 4096-byte bitmap allocated lazily via `HashStringAllocator` (zero cost for empty groups) - Out-of-range positions raise a descriptive error with plain English message - NULL inputs are skipped; the aggregate is non-nullable (returns all-zeros bitmap for empty/all-null input) - Partial aggregation supported: intermediate bitmaps serialized as VARBINARY, merged via 64-bit OR - Uses `bits::setBit` for bitmap manipulation Tests cover: basic, null handling, group-by, boundaries, invalid positions, partial+final merge, invalid intermediates, overlapping merges. Ref: `org.apache.spark.sql.catalyst.expressions.BitmapConstructAgg` Pull Request resolved: facebookincubator#17487 Reviewed By: srsuryadev Differential Revision: D105567436 Pulled By: mbasmanova fbshipit-source-id: e1c2d5f7a26d9af8f7ed5078122a7f1ae6b83430

…oned partial aggregation (facebookincubator#17547) Summary: The patch replaces the existing boolean stat `kAbandonedPartialAggregation` with a new integer stat `kAbandonedPartialAggregationRows`, in the HashAggregation operator so users can measure how many rows passed through after partial aggregation was abandoned. This makes performance analysis of partial hash aggregation easier. Pull Request resolved: facebookincubator#17547 Reviewed By: apurva-meta Differential Revision: D105576520 Pulled By: mbasmanova fbshipit-source-id: 1c15e41cce920f592c82849c75278214aead31ab

…nge hangs (facebookincubator#17506) Summary: - Fix `CudfNestedLoopJoinProbe` to consume probe input when build is empty, matching CPU `NestedLoopJoinProbe` behavior. The original code set `finished_=true` in `isBlocked()` when detecting an empty build side for inner/right joins. This caused the probe operator to finish before consuming probe input, which can hang upstream exchanges waiting for their output buffers to be drained. Addresses review comment from facebookincubator#17113: facebookincubator#17113 (comment) Pull Request resolved: facebookincubator#17506 Reviewed By: kgpai Differential Revision: D105583207 Pulled By: mbasmanova fbshipit-source-id: bee4c19ad6effa20100b5bf2b30cbe28aadbe858

…r#17369) (facebookincubator#17369) Summary: Extends the Velox tracing framework with expression-level tracing, enabling callers to trace output batches from named expression functions (e.g., UDFs). Changes: - TraceCtx: Added virtual methods shouldTraceExpr() and createExprOutputTracer(op, functionName, instanceIndex) for expression-level tracing decisions and writer creation - Expr: Added maybeSetupTracer() to create and cache a tracer during operator initialization, stored as a unique_ptr member. Uses a visited set to avoid redundant traversal of shared CSE nodes, and an instance counter to distinguish multiple Expr nodes with the same function name within a single operator - ExprSet: Added maybeSetupTracers() to walk all expressions and set up tracers - FilterProject: Calls ExprSet::maybeSetupTracers() during initialize(), scoped to operators where shouldTrace() returns true — expression tracing is only set up inside traced operators - EvalCtx: Removed traceCtx plumbing (no longer needed since tracers are owned by Expr nodes) Performance: Tracers are created once at initialization and owned by each Expr node as a unique_ptr, following the same pattern as Operator::inputTracer_. The hot path uses FOLLY_UNLIKELY on a null pointer check. Examples: 1. Single instance: A FilterProject operator "proj-5" with projection `a * 10 as b` and tracing configured for "multiply" creates one tracer identified as ("proj-5", "multiply", 0). 2. Multiple instances of the same function: Projections `a * 10 as b, c * 20 as d` with tracing configured for "multiply" creates two tracers: ("proj-5", "multiply", 0) and ("proj-5", "multiply", 1), each producing a distinct output stream. 3. Selective tracing: Projections `a * 10 as b, a + 5 as c` with tracing configured for "multiply" only traces the multiply expression. The plus expression is skipped since it is not in the trace configuration. 4. Operator scoping: If operator "proj-5" is not in the traced operators list, none of its UDFs are traced regardless of the UDF trace configuration. Pull Request resolved: facebookincubator#17369 Test Plan: buck2 test @//mode/dev-nosan fbcode//velox/exec/tests:test -- CustomTraceTest.exprOutputTrace CustomTraceTest.exprTraceOnlyMatchingFunctions Reviewed By: bikramSingh91 Differential Revision: D102552818 Pulled By: patrickstuedi fbshipit-source-id: 16780bc675d4bd12277de66ec193898398ba5cff

Summary: Pull Request resolved: facebookincubator#17542 This diff adds FLUX as a valid index source into Velox and adds roundtrip serializatino. Reviewed By: xiaoxmeng Differential Revision: D105505038 fbshipit-source-id: 6af038feb9d5c401d13e4f5873f04a59ba1c94cb

copy-pr-bot · 2026-05-19T18:15:04Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

bdice · 2026-05-21T14:18:04Z

/ok to test c7d0e59

perlitz and others added 30 commits April 11, 2026 01:58

feat: Introduce Avro format (facebookincubator#17084)

ccedd30

Summary: Pull Request resolved: facebookincubator#17084 Introduce stub for Avro format Reviewed By: raymondlin1 Differential Revision: D100118684 fbshipit-source-id: 3280d099d5954a710b83623fb84b8f6c29d20d9e

Raymond Lin and others added 24 commits May 15, 2026 03:14

Merge upstream main into velox-cudf

c7d0e59

bdice requested review from a team, devavret, karthikeyann and mhaseeb123 as code owners May 19, 2026 18:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update velox-cudf from upstream main#112

Update velox-cudf from upstream main#112
bdice wants to merge 1672 commits into
rapidsai:velox-cudffrom
bdice:velox-update-2026-05-18

bdice commented May 19, 2026

Uh oh!

copy-pr-bot Bot commented May 19, 2026

Uh oh!

bdice commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

bdice commented May 19, 2026

Summary

Validation

Uh oh!

copy-pr-bot Bot commented May 19, 2026

Uh oh!

bdice commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants