Skip to content

Update velox-cudf from upstream main#112

Open
bdice wants to merge 1672 commits into
rapidsai:velox-cudffrom
bdice:velox-update-2026-05-18
Open

Update velox-cudf from upstream main#112
bdice wants to merge 1672 commits into
rapidsai:velox-cudffrom
bdice:velox-update-2026-05-18

Conversation

@bdice
Copy link
Copy Markdown

@bdice bdice commented May 19, 2026

Summary

  • Merge facebookincubator/velox main into rapidsai/velox velox-cudf
  • Prefer upstream cuDF implementation, dependency, and setup-script changes during conflict resolution
  • Preserve RAPIDS CI runner customizations while updating Linux CI to split cuDF build and GPU test execution

DO NOT SQUASH-MERGE

Validation

  • pre-commit run --all-files
  • yamllint and action-validator passed for active workflows
  • CI will validate the RAPIDS runner split build/test flow

perlitz and others added 30 commits April 11, 2026 01:58
…16920)

Summary:
Implements GPU version of EnforceSingleRow to maintain GPU pipeline continuity for scalar subqueries. Validates row count using GPU metadata without host↔device data transfer.

related to: facebookincubator#15772
closing: facebookincubator#16888

### Performance Benchmarks (SF100, 5 iterations)

  All queries show **no significant performance difference** between GPU and CPU implementations, which is expected for
  this lightweight operator. The benefit is maintaining GPU pipeline continuity (avoiding GPU↔CPU transfers), not faster
  execution of the check itself.

  | Query | GPU mean±std | CPU mean±std | Diff | t-stat | 95% CI | Significant? |
  |-------|--------------|--------------|------|--------|--------|--------------|
  | Q6 (1 occ) | 1.738±0.048s | 1.736±0.034s | +0.1% | 0.076 | [-0.051, +0.055]s | NO |
  | Q14 (3 occ) | 11.490±0.309s | 11.190±0.165s | +2.7% | 1.914 | [-0.013, +0.613]s | NO |
  | Q44 (2 occ) | 7.294±0.358s | 7.102±0.135s | +2.7% | 1.121 | [-0.151, +0.535]s | NO |
  | Q54 (2 occ) | 4.008±0.443s | 3.818±0.036s | +5.0% | 0.956 | [-0.208, +0.588]s | NO |
  | Q58 (3 occ) | 3.806±0.123s | 3.750±0.053s | +1.5% | 0.936 | [-0.064, +0.176]s | NO |

  **Methodology**: Welch t-test with 95% confidence intervals. "Not significant" means |t| < 2.0 (p ≥ 0.05), indicating
  performance differences are within statistical noise.

  **Test environment**: SF100 INT32 data (~43GB), local NVMe storage, NVIDIA RTX PRO 6000 Blackwell, 5 independent runs
  per mode.

Pull Request resolved: facebookincubator#16920

Reviewed By: tanjialiang

Differential Revision: D100439821

Pulled By: xiaoxmeng

fbshipit-source-id: 1a982e43a4aca83bb025e2f181ddbc3544346aff
…acebookincubator#17090)

Summary:
This makes S3Config available for use in downstream user such as FileWrite and ReadFile to support options needed to support various  S3 backends and allow more granular configuration.

Pull Request resolved: facebookincubator#17090

Reviewed By: tanjialiang

Differential Revision: D100439661

Pulled By: xiaoxmeng

fbshipit-source-id: 36c5cf19e4d2cc8831b1d4e39629bf845602c0d8
…facebookincubator#16732)

Summary:
## Description

This PR re-adds the use of `enqueueForDevice` when cudf buffered input data source is being used with the experimental (hybrid scan) reader. Tests have also been updated and verified locally to ensure the `enqueueForDevice` is indeed used.

## Checklist
- [x] All new and existing `velox_cudf_xx` tests pass
- [x] I am familiar with the contribution guide

Pull Request resolved: facebookincubator#16732

Reviewed By: tanjialiang

Differential Revision: D100439856

Pulled By: xiaoxmeng

fbshipit-source-id: f6fb5afa74ef01d3221df9c512bafa2dbc162307
Summary:
## Description

This PR updates the `update-cudf-deps.sh` to not silently exit without any message or effect when, for example, `read` fails due to GH rate limiting and throws a verbose message such as

```bash
Error: Failed to fetch VERSION file from branch main
  Try: `gh auth login` (or set GH_TOKEN) before running this script
  Create a GitHub personal access token at https://github.com/settings/tokens/new
  (Select scope: public_repo)
```

Pull Request resolved: facebookincubator#16332

Reviewed By: tanjialiang

Differential Revision: D100439867

Pulled By: xiaoxmeng

fbshipit-source-id: 467a7b6c637a70c7f5a4bd3343e06efd10d518fb
Summary:
### Problem

Build error for T = double:
```
/home/vpcuser/velox/./velox/experimental/cudf/expression/AstUtils.h: In function ‘std::unique_ptr<cudf::scalar> facebook::velox::cudf_velox::makeScalarFromValue(const facebook::velox::TypePtr&, T, bool, std::optional<cudf::type_id>) [with T = double]’:
/home/vpcuser/velox/./velox/experimental/cudf/expression/AstUtils.h:79:41: error: call to ‘cudf::get_default_stream’ declared with attribute error: cudf default stream argument used. Pass stream explicitly.
   79 |   auto stream = cudf::get_default_stream();
      |                 ~~~~~~~~~~~~~~~~~~~~~~~~^~
/home/vpcuser/velox/./velox/experimental/cudf/expression/AstUtils.h:80:8: error: call to ‘cudf::get_current_device_resource_ref’ declared with attribute error: cudf default memory resource argument used. Pass mr explicitly.
   80 |   auto mr = cudf::get_current_device_resource_ref();
```

`makeScalarFromValue()` in `AstUtils.h` calls `cudf::get_default_stream()` and `cudf::get_current_device_resource_ref()`. When any `.cpp` file includes both `CudfNoDefaults.h` and `AstUtils.h` (e.g. `CudfHashJoin.cpp`), these calls hit the `__attribute__((error))` redeclarations that guard against accidental default argument usage.

In Release builds (`-O2`), GCC's dead-code elimination removes unused template instantiations before the error fires. In Debug builds (`-O0`), all instantiations are preserved, triggering a hard compile error.

### Solution
Replaces the two calls with existing approved alternatives:
- `cudf::get_default_stream()` → `cudf::get_default_stream(cudf::allow_default_stream)` (from `CudfDefaultStreamOverload.h`)
- `cudf::get_current_device_resource_ref()` → `rmm::mr::get_current_device_resource()` (direct RMM call, same pattern as `get_temp_mr()` in `CudfNoDefaults.h`)

Both return identical values at runtime — **zero performance impact**.

### Changes
- `velox/experimental/cudf/expression/AstUtils.h`: Update `makeScalarFromValue()` to use poison-safe alternatives

Pull Request resolved: facebookincubator#17011

Reviewed By: tanjialiang

Differential Revision: D100439740

Pulled By: xiaoxmeng

fbshipit-source-id: f1b0462dcad4645adf33ff29b6a810ffc725e290
…acebookincubator#17035)

Summary:
- Fixes Count aggregation condition to run all count aggregations on GPU
- Adds unit tests to check if all count aggregations run on GPU
- Fixes CI build break too.

Pull Request resolved: facebookincubator#17035

Reviewed By: tanjialiang

Differential Revision: D100439695

Pulled By: xiaoxmeng

fbshipit-source-id: 2f05cb59e65e0d98bdaea492bf0c6576b8a9cb6d
…acebookincubator#17134)

Summary:
Pull Request resolved: facebookincubator#17134

Introduces the PaimonSplitReader framework for reading Paimon tables in Velox/Presto.

**PaimonSplitReader** (extends FileSplitReader): Multi-file reader for rawConvertible Paimon splits. A single Paimon split (one partition × bucket) can contain multiple data files across LSM levels. PaimonSplitReader receives the full PaimonConnectorSplit, builds FileConnectorSplits internally, and handles sequential file iteration transparently — when a file is exhausted, it resets the base reader and advances to the next non-empty file. Deletion vectors and changelog files are validated upfront and throw VELOX_NYI.

**PaimonDataSource** (extends FileDataSource): Refactored to be a pure orchestrator. addSplit() validates the split and selects the read path (rawConvertible vs merge-on-read NYI). createSplitReader() simply passes the PaimonConnectorSplit to PaimonSplitReader. Does not override next() — FileDataSource::next() drives the read loop, getting filter eval, column projection, and stats tracking for free.

**PaimonMergeReader**: NYI placeholder for merge-on-read (primary-key tables with rawConvertible=false). Includes MergeEngine interface stub for future deduplicate/partial-update strategies.

Currently working: append-only and primary-key tables with rawConvertible=true, single and multi-file splits, partition keys.
NYI: deletion vectors, changelog files, merge-on-read, _rowkind column, streaming reads.

Reviewed By: xiaoxmeng

Differential Revision: D99660568

fbshipit-source-id: 632ef8d0cfff39867d9fe95633d2e0f877f2a8e1
…17133)

Summary:
Pull Request resolved: facebookincubator#17133

CONTEXT: The IndexLookupJoin operator uses a complex stat splitting mechanism with special-cased lazy vector stat handling. The nimble index reader creates lazy vectors but immediately loads them, wasting the wrap/unwrap overhead.

WHAT:
1. Rename IndexLazyStatWriter to IndexStatWriter and simplify it to accumulate all index source related stats into a local synchronized map (indexSourceRuntimeStats_). The stat splitter now simply dumps this map as the IndexSource node's runtime stats, removing the complex stat name matching and erasing logic.

2. Move all index source stat recording (input/output stats, connector stats) to go through the IndexStatWriter's map, keeping them cleanly separated from the join operator's own stats.

3. Add generateLazyChildren parameter to nimble StructColumnReaderBase/StructColumnReader/buildColumnReader, and pass false in SelectiveNimbleIndexReader. This avoids creating lazy vectors that are immediately loaded, eliminating the wrap/unwrap overhead. The loadedVector() call is also removed since it's no longer needed.

Reviewed By: zacw7

Differential Revision: D100443040

fbshipit-source-id: 6d87c7dceea2b9f756e2c70d3abad24fbaa743a8
Summary:
Pull Request resolved: facebookincubator#17138

CONTEXT: Index lookup needs to distinguish between point lookups
(exact key match) and range lookups (lower/upper bounds). Dense index
only supports point lookups while cluster index supports both.

WHAT: Add isPointLookup() method to EncodedKeyBounds that returns
true when both bounds are present and equal. Add unit test coverage.

Reviewed By: tanjialiang, han-yan01

Differential Revision: D100556652

fbshipit-source-id: e478c1ee7062f2f4fee2c3f1553c6e1bb39676a0
…or#17135)

Summary:
Follow up of facebookincubator#17011, as the comment facebookincubator#17011 (comment)

Pull Request resolved: facebookincubator#17135

Reviewed By: xiaoxmeng

Differential Revision: D100559104

Pulled By: pratikpugalia

fbshipit-source-id: 51adc7ed44ee1b4bcecbbad0f6940481ec9073eb
…emantics (facebookincubator#17137)

Summary:
Pull Request resolved: facebookincubator#17137

Add a `nullAsValue` boolean flag to `HashJoinNode` that makes join keys use IS NOT DISTINCT FROM semantics where NULL equals NULL. This is different from the existing `nullAware` flag which implements IN/NOT IN three-valued logic. The two flags are mutually exclusive.

When `nullAsValue` is true:
- HashBuild stores rows with NULL keys in the hash table (uses `HashTable<false>`)
- HashBuild does not filter out NULL key rows during `addInput`
- HashProbe does not filter out NULL key rows during probe

The hash table comparison logic (`HashTable<false>::compareKeys`) already treats NULLs as equal values via `CompareFlags::NullHandlingMode::kNullAsValue`, so no changes are needed in the comparison path.

This flag is needed to correctly implement SQL set operations (EXCEPT, INTERSECT, EXCEPT ALL, INTERSECT ALL) which require NULL=NULL per the SQL standard.

Part of facebookincubator/axiom#1235

Reviewed By: xiaoxmeng

Differential Revision: D100545193

fbshipit-source-id: f5cfdf3ef975a1fa8a57c0cb129a426f8cf6fcd1
Summary:
Pull Request resolved: facebookincubator#17084

Introduce stub for Avro format

Reviewed By: raymondlin1

Differential Revision: D100118684

fbshipit-source-id: 3280d099d5954a710b83623fb84b8f6c29d20d9e
…acebookincubator#17123)

Summary:
Pull Request resolved: facebookincubator#17123

CONTEXT: D71324408 introduced processedStrides_ but incremented it once per stripe load in loadCurrentStripe(). This undercounts when a stripe has multiple strides, producing incorrect stride-level statistics.

WHAT: Move processedStrides_ increment from loadCurrentStripe() (once per stripe) to nextRowNumber() (once per stride read, guarded by strideSize > 0 and stride boundary alignment). This gives an accurate count of strides actually processed.

Reviewed By: Yuhta

Differential Revision: D100236453

fbshipit-source-id: 3e970cfa6c6b0d59335920df0b30882700b90236
…Registry lookup (facebookincubator#17151)

Summary:
Pull Request resolved: facebookincubator#17151

TableScan, TableWriter, and IndexLookupJoin all called ConnectorRegistry::tryGet(connectorId) (global-only) in their constructors, ignoring the query-scoped API that was already implemented and tested. This meant per-query connector registry overrides attached to QueryCtx were silently bypassed during execution.

Change each operator to use ConnectorRegistry::tryGet(queryCtx, connectorId), which checks for a per-query registry override first, then falls back to the global registry. The QueryCtx is available via driverCtx->task->queryCtx() in all three constructors. When no override is set (the default), behavior is identical to the old global-only call.

Reviewed By: mbasmanova

Differential Revision: D100662333

fbshipit-source-id: c939b60ddcebb7e9a9674a6d08d37a488f67f68d
Summary:
This PR adds `CudfMarkDistinct`, a GPU implementation of the Velox `MarkDistinct` operator, as part of the cuDF GPU acceleration module (`velox/experimental/cudf/`).

`MarkDistinct` annotates each row with a boolean indicating whether it is the first occurrence of its key(s) within the partition. It is used in query plans that require deduplication markers (e.g. `SELECT DISTINCT` rewrites).

Related to: facebookincubator#15772

### Files changed

- `velox/experimental/cudf/exec/CudfMarkDistinct.h` — operator declaration
- `velox/experimental/cudf/exec/CudfMarkDistinct.cpp` — operator implementation
- `velox/experimental/cudf/exec/OperatorAdapters.cpp` — registration of `MarkDistinctAdapter`
- `velox/experimental/cudf/CudfConfig.h` — `enableMarkDistinct` and `markDistinctMaxKeys` config fields
- `velox/experimental/cudf/tests/MarkDistinctTest.cpp` — 12 test cases covering single/multi-batch, composite keys, nulls, strings, empty batches, and memory limit

### Benchmark

Here are the current results for plans for which MarkDistinct appears (SF100), these are run using karthikeyann's branch here: https://github.com/karthikeyann/velox/tree/fea-TPCDS_plans that enables running TPC-DS workloads:

| Query | GPU cold | CPU cold | GPU warm±std | CPU warm±std | Speedup | t-stat | 95% CI (GPU-CPU) | Significant? |
|-------|----------|----------|--------------|--------------|---------|--------|------------------|--------------|
| Q16 | 3.940s | 3.810s | 3.786±0.037s | 3.788±0.081s | -0.1% | -0.050 | [-0.082, +0.078]s | NO (n=5) |
| Q28 | 14.620s | 12.670s | 13.196±0.079s | 12.804±0.042s | +3.1% | 9.825 | [+0.312, +0.472]s | YES (n=5) |
| Q94 | 1.940s | 1.960s | 1.932±0.031s | 1.910±0.022s | +1.2% | 1.283 | [-0.012, +0.056]s | NO (n=5) |
| Q95 | 2.400s | 2.400s | 2.300±0.021s | 2.334±0.056s | -1.5% | -1.262 | [-0.088, +0.020]s | NO (n=5) |

With these configs:

```json
  {
    "benchmark_config": {
      "scale_factor": "SF100",
      "data_path": "/velox/data_local/tpcds/sf100/tpcds_sf100",
      "plan_path": "/velox/VeloxPlans/presto/tpcds/sf100/plans",
      "data_format": "parquet",
      "stats_runs": 5,
      "warmup_runs": 1,
      "execution": {
        "num_drivers": 4,
        "num_splits_per_file": 10,
        "num_io_threads": 8
      },
      "cudf": {
        "cudf_enabled": true,
        "cudf_memory_resource": "pool",
        "cudf_memory_percent": 85,
        "cudf_gpu_batch_size_rows": 1000000
      },
      "hardware": {
        "gpu": "NVIDIA RTX PRO 6000 Blackwell Server Edition",
        "gpu_memory_mib": 97887,
        "cuda": "12.8"
      },
      "operator": "mark_distinct",
      "queries": ["Q16", "Q28", "Q94", "Q95"]
    }
  }
  ```

  ### Known limitation: residual overhead on key-growth batches

The current implementation rebuilds `seenFilter_` and copies all `seenKeys_` via `cudf::concatenate` on every batch that introduces new distinct keys. This is O(D) per such batch, where D is the total distinct keys seen so far. For queries with many MarkDistinct nodes operating on a large key space (e.g. Q28 with 6 nodes on SF100), this accounts for the remaining +2.8% gap vs CPU.

#### Possible optimization: amortized doubling with a pending buffer

Accumulate new keys in a small `pendingKeys_` staging table and only consolidate into `seenKeys_` (and rebuild `seenFilter_`) when `pendingKeys_` reaches the size of `seenKeys_`. This reduces total copy work from O(D × B_new) to O(D) amortized.

Correctness requirement: the anti-join must probe **both** `seenFilter_` and a secondary `pendingFilter_` built over `pendingKeys_`. Without this, keys in the pending buffer between merges are invisible to the filter and can be falsely marked as new. For example: batch 1 sees key A (→ seenKeys), batch 2 sees key B (→ pendingKeys, no merge yet), batch 3 sees B again — if only `seenFilter_` is probed, B gets marked as first-occurrence twice.

Trade-offs:
- Extra probe per batch against `pendingFilter_` (small table, cheap)
- Peak memory at merge time is ~3D (seen + pending + merged result) vs ~2D with the current approach — higher OOM risk on GPU
- In cudf-Velox, batches are typically large (up to max int rows), so B_new is small and the current approach may be sufficient for most workloads

Not implemented in this PR to keep scope contained. Documented here so the optimization can be picked up if key-growth-heavy queries become a pain point. See discussion [here](facebookincubator#16974 (comment)).

Pull Request resolved: facebookincubator#16974

Reviewed By: pratikpugalia

Differential Revision: D100670746

Pulled By: peterenescu

fbshipit-source-id: dafef1412818db97b0d12e3405b68abaf1c49d88
…ubator#17050)

Summary:
Pull Request resolved: facebookincubator#17050

Relax the format checks in FileIndexReader::createIndexReader() and
HiveIndexSource::addSplits() to accept SST file format alongside NIMBLE
and FLUX. The Nimble-specific RowReaderOptions (setIndexEnabled,
setEagerFirstStripeLoad) are now guarded by != SST rather than == NIMBLE,
so new stripe-based formats get the correct defaults automatically.

Also adds 9 end-to-end tests exercising SST index join through the full
HiveIndexSource -> FileIndexReader -> SstIndexReader -> rodos stack:
pointLookup, pointLookupWithMisses, prefixLookup, rangeScan,
multipleProbeBatches, outputColumnSelection, remainingFilter,
remainingFilterAllFiltered, allMisses.

Reviewed By: zacw7, xiaoxmeng

Differential Revision: D97697248

fbshipit-source-id: 5002e5b8ee59276c11a7d89c9f85dc62ef7e4412
facebookincubator#17150)

Summary:
Fix operator hang caused by missing `std::move()` in `getOutput()`.

`CudfEnforceSingleRow::getOutput()` returned `input_` without `std::move()`, leaving `input_` non-null after returning. Since `isFinished()` checks `noMoreInput_ && input_ == nullptr`, it never returned true, causing the driver to poll the operator indefinitely.

Pull Request resolved: facebookincubator#17150

Reviewed By: pratikpugalia

Differential Revision: D100671193

Pulled By: peterenescu

fbshipit-source-id: 08832cd2238a2aa71e1cae0e7522c78ee8b5fee6
Summary:
Pull Request resolved: facebookincubator#17131

Add s2geometry 0.12.0 installation to Velox's CI setup scripts so that Docker images can be rebuilt with s2geometry as a system package.

Changes:
- Add install_s2geometry function to setup-common.sh (downloads, patches, and builds s2geometry from source)
- Add s2geometry install step to all platform setup scripts (Ubuntu, CentOS 9, Fedora, macOS)
- Add S2GEOMETRY_VERSION=0.12.0 to setup-versions.sh

Reviewed By: kgpai, pratikpugalia

Differential Revision: D100426816

fbshipit-source-id: 4295cb63bda184444e0cbd837bc32713a329e6ef
…ookincubator#17094)

Summary:
Pull Request resolved: facebookincubator#17094

Refactor `makeFieldNotFoundErrorMessage` into a `formatAvailableFields` helper
and pass the template string and arguments directly to VELOX_USER_FAIL, so that
`messageTemplate()` properly captures the static template for programmatic error
categorization.

Previously, the error message was fully pre-formatted into a single string before
being passed to the macro, which meant the template was the formatted message itself.

Reviewed By: peterenescu

Differential Revision: D100205606

fbshipit-source-id: 86b2a1eeb5d7ebff56920e2ace5ad6ba843ef360
…acebookincubator#17096)

Summary:
Pull Request resolved: facebookincubator#17096

Fix callsites that pre-format error messages (via `fmt::format`, `std::stringstream`,
or string concatenation) before passing them to VELOX_FAIL/VELOX_USER_FAIL macros.
Pre-formatting causes the macro to take the single-arg path where the formatted
message becomes the template, defeating template-based error categorization.

The fix passes format strings and arguments directly to the macros so that
`messageTemplate()` properly captures the static template.

Reviewed By: peterenescu

Differential Revision: D99974595

fbshipit-source-id: 868cea2cfac1dc2283e5f6eb3354c932b8fb4282
…ctions (facebookincubator#17097)

Summary:
Pull Request resolved: facebookincubator#17097

Fix callsites that pre-format error messages (via `fmt::format`, `std::stringstream`,
or string concatenation) before passing them to VELOX_FAIL/VELOX_USER_FAIL/
VELOX_UNSUPPORTED macros. Pre-formatting causes the macro to take the single-arg
path where the formatted message becomes the template, defeating template-based
error categorization.

The fix passes format strings and arguments directly to the macros so that
`messageTemplate()` properly captures the static template.

Reviewed By: peterenescu

Differential Revision: D100205508

fbshipit-source-id: 1556904f44e3bcd6fc5c53b3569db7285ae9fe57
facebookincubator#17066)

Summary:
Pull Request resolved: facebookincubator#17066

DistinctAggregations used to assume all arguments are columns and
throw when there are literal arguments. This assumption does not
always hold. This diff fixes this issue by allowing DistinctAggregations
to handle literal arguments. Literal arguments are omitted during the
deduplication and added back before passing the deduplicated
inputs to the aggregation function.

This diff introduces a new ConstantDistinctAggregations class to handle
the case when all distinct arguments are constants. In this situation,
deduplication is unnecessary since there is at most one distinct value.
ConstantDistinctAggregations tracks a boolean flag per group to record
whether the constant has been seen, and defers the actual feeding of
the constant value to the aggregation function until extraction time,
where the actual aggregation result is computed once and broadcast to
all groups that have seen inputs. These aggregations cannot always be
constant-folded by the planner because they may depend on per-group
conditions such as FILTER (WHERE ...) clauses.

This diff fixes facebookincubator#17057.

Reviewed By: mbasmanova

Differential Revision: D99911616

fbshipit-source-id: d1dfeda614b7ecd4631e4958997ee81d9253be46
…acebookincubator#17153)

Summary:
Pull Request resolved: facebookincubator#17153

The Window Fuzzer's `generateKRowsFrameBound()` function previously used `[INT64_MIN, INT64_MAX]` as the range for random k values 30% of the time. This produced negative frame offsets (e.g. `-5 PRECEDING`) which are semantically invalid — both Presto and Velox reject them. When the fuzzer verifies results against a Presto reference database, these invalid queries cause Presto errors, inflating the 'reference DB failed' count and pushing the verification ratio below the 50% threshold.

Analysis of 8 recent Window Fuzzer failures on GitHub CI showed that 75 out of 105 total Presto errors (71%) were caused by negative frame offsets. This was the dominant source of flakiness (8% failure rate on push-to-main).

Fix: Change the distribution of generated k values to:
- 70%: valid, small range [1, 10]
- 20%: valid, full positive range [1, INT64_MAX]
- 10%: invalid, full range [INT64_MIN, INT64_MAX] to exercise negative/zero offsets and verify both Velox and the reference DB reject them consistently.

This retains coverage of invalid inputs at a low enough rate to avoid pushing the verification ratio below threshold, while ensuring the fuzzer verifies error behavior against the reference DB (requires the error verification in the follow-up diff D100732002).

Reviewed By: kagamiori

Differential Revision: D100657400

fbshipit-source-id: 1063cc96a764cc15d5a66528dd220ce5870ebdf6
Summary:
Pull Request resolved: facebookincubator#17159

Avoid setting weight to 0, as it effectively gives a signal to the scheduler
that split has 0 data to process

Reviewed By: xiaoxmeng

Differential Revision: D100745378

fbshipit-source-id: dcbe770456dc92a1fa180f7b6c184b2c88614162
…acebookincubator#17147)

Summary:
The clone currently only carries the existing `ReadFileInputStream` and not all
original constructor inputs, so rebuilding from ReadFile would not be behavior-
equivalent. This PR makes it protected to allow reuse of `input_`
(ReadFileInputStream), which retains the internal `fileIoContext_`.

Follow-up for: facebookincubator@084f222.

Pull Request resolved: facebookincubator#17147

Reviewed By: kgpai

Differential Revision: D100708786

Pulled By: peterenescu

fbshipit-source-id: 123a6de4c1a7567519762d12e315023aa818bc3f
…gregate.rst (facebookincubator#17001)

Summary:
Pull Request resolved: facebookincubator#17001

Reordered all function entries in array.rst and aggregate.rst to be in strict alphabetical order, following the approach of D96488148.

array.rst: Moved none_match, array_split_into_chunks, array_max_by, arrays_overlap, arrays_union, array_position, l2_norm, array_sum, sequence, and remove_nulls to their correct alphabetical positions.

aggregate.rst: Fixed ordering within sections:
- General: any_value before arbitrary, geometric_mean before histogram, max/max(n) before max_by, min/min(n) before min_by
- Statistical: var_pop and var_samp before variance

___

overriding_review_checks_triggers_an_audit_and_retroactive_review
Oncall Short Name: presto_batch

Differential Revision: D99121480

fbshipit-source-id: 1f6382aa62f8052236dedc7e868d99b7a4693a01
…ookincubator#17152)

Summary:
Fixes facebookincubator#17141

Two bugs prevented build error extraction from working:

1. **ANSI escape codes**: The build uses `-fdiagnostics-color=always` which embeds ANSI escape sequences in the log. The grep for `: error:` fails because invisible codes sit between `:` and `error`. Fix: strip codes with `sed` before grepping.

2. **Wrong grep pattern**: The pattern `'^.*:(error|fatal error):'` expected `:error:` with no space, but GCC format is `file:line:col: error: msg` (space before `error`). This means the grep never matched even without ANSI codes. Fix: use `' (error|fatal error):'` instead.

Applied to both the adapters and ubuntu-debug extract steps.

Pull Request resolved: facebookincubator#17152

Test Plan:
- [x] Verified sed strips ANSI codes correctly
- [x] Verified grep pattern matches `error:` and `fatal error:` in GCC format
- [x] Verified normal build lines don't match
- [ ] Validate on next PR with a build failure that BUILD_FAILURE_DETAILS is populated

Reviewed By: mbasmanova

Differential Revision: D100832340

Pulled By: kgpai

fbshipit-source-id: 4d9fc625dfc3eb8983a5bf99c98a8d3b3cc622b3
…CudfTable (facebookincubator#17164)

Summary:
Fixes facebookincubator#17163.

`toCudfTable()` frees Arrow host buffers immediately after `cudf::from_arrow()` returns, but `from_arrow()` uses `cudaMemcpyBatchAsync` (CUDA 13.0+) with `cudaMemcpySrcAccessOrderStream` which defers reading source buffers until the stream reaches each copy. The freed host memory is then read by the CUDA driver's event handler thread, causing a SIGSEGV.

Move `stream.synchronize()` into `toCudfTable()` before the Arrow resource release so that all async H2D copies complete while source buffers are still valid. Remove the now-redundant synchronize from `CudfFromVelox::getOutput()`.

## Testing

Ran `velox_cudf_tpch_benchmark` Q11 SF100 with `--num_repeats=100 --num_drivers=4` — crashes immediately without the fix, completes all iterations with the fix (verified across multiple runs). All existing Velox-cuDF tests are passing.

Pull Request resolved: facebookincubator#17164

Reviewed By: kKPulla

Differential Revision: D100832047

Pulled By: peterenescu

fbshipit-source-id: eece785d5c69f748e7531801cf0eed525a3167da
…acebookincubator#17185)

Summary:
The CI Failure Comment workflow (`ci-failure-comment.yml`) fails on fork PRs because the `claude-code-action` OIDC token exchange checks that the triggering actor has write access to the repository. Fork PR authors don't have write access, causing a `401 Unauthorized` error.

Fix: pass `github_token: ${{ github.token }}` explicitly so the action uses the workflow's own GITHUB_TOKEN (with the declared permissions) instead of the OIDC flow.

**Security:** No change in exposure — the same token is already available to Claude via the `GH_TOKEN` env var. The workflow's token is scoped to exactly the permissions declared in the file (`contents: read`, `pull-requests: write`, `issues: write`, `actions: read`) and is automatically revoked after the job completes.

Example failure: https://github.com/facebookincubator/velox/actions/runs/24424455191/job/71354986089

Pull Request resolved: facebookincubator#17185

Test Plan:
- Verified fix by pushing branch to upstream and dispatching `workflow_dispatch` against a failed fork PR run (run ID `24422420081`, PR facebookincubator#16048)
- Run succeeded: https://github.com/facebookincubator/velox/actions/runs/24426241515

Reviewed By: pratikpugalia

Differential Revision: D100885246

Pulled By: kgpai

fbshipit-source-id: 134386016611fae1f31e287d1adc3d1d5d8a9dca
…cture (facebookincubator#16934)

Summary:
Corresponding PR for facebookincubator#16885.

Pull Request resolved: facebookincubator#16934

Reviewed By: kKPulla

Differential Revision: D100680297

Pulled By: peterenescu

fbshipit-source-id: 7e091be403fbc81e74c485c88e2f3d622018a8bb
Raymond Lin and others added 24 commits May 15, 2026 03:14
…ookincubator#17513)

Summary:
Pull Request resolved: facebookincubator#17513

Add partition key information to the per-batch scan stats callback.

Changes:
- Add `getPartitionKeys()` virtual method to `DataSource` interface with default empty implementation. `FileDataSource` overrides to return the current split's partition keys.
- Add `partitionKeys` pointer field to `ScanBatchEvent`, populated from `dataSource_->getPartitionKeys()` in `TableScan::getOutput()`.
- New test `scanBatchCallbackPartitionKeys` verifying partition keys propagate through the callback.

Reviewed By: Yuhta

Differential Revision: D105112445

fbshipit-source-id: 7655311b32017703b40627b82028545ffe99b92c
…5871)

Summary:
This PR adds support for the `TZDIR` environment variable in the `tzdb` implementation.

Previously, the timezone database discovery logic was limited to standard system paths (e.g., `/usr/share/zoneinfo`, `/usr/share/zoneinfo/uclibc`). This limitation caused issues in large scale data center since if we want to upgrade tzdata dependency for velox, we have to upgrade the system's.

With this change, `__libcpp_tzdb_directory` now checks for `TZDIR` first. This aligns Velox's behavior with other standard C++ libraries (like abseil-cpp) and provides users with greater flexibility to configure the timezone database location.

Pull Request resolved: facebookincubator#15871

Reviewed By: apurva-meta

Differential Revision: D105270088

Pulled By: pedroerp

fbshipit-source-id: 1f90acb7f31cd394177b67c9ba6b68e7e4e13f95
)

Summary:
Added simd::any, simd::none, and simd::all in SimdUtil.h for boolean-only SIMD mask checks.

On AArch64, these helpers use xsimd::any/none/all directly. On x86, they preserve the
previous behavior by reducing through simd::toBitMask so code generation stays aligned
with the original path.

On Arm Neon, xsimd::any is one instruction: vmaxvq_u64, whereas simd::toBitMask is two
instructions: vandq_u64 + vaddvq_u64. On x86 AVX2, both are only one instruction, and
our experiment shows simd::toBitMask gives a better performance on AVX2 due to better
further optimization by the compiler. So, this PR only changes what's used outside x86.
We have verified that the instructions generated on x86 remain unchanged.

Switch the boolean-only toBitMask call sites in:

MmapAllocator
HashTable
VectorHasher-inl
KeyEncoder
Filter

Leaving toBitMask unchanged at sites that still need exact lane masks for bit operations such
as ctz, popcount, filtering, or packed-bit writes.

This improves performance on Arm for the relevant boolean-only reduction paths, while
preserving the previous x86 behavior.

Pull Request resolved: facebookincubator#17257

Reviewed By: jagill

Differential Revision: D105303951

Pulled By: mbasmanova

fbshipit-source-id: b220cbb5d555e9512fc5c4dc394a60c87355822c
…acebookincubator#17525)

Summary:
Pull Request resolved: facebookincubator#17525

Strobelight profiling (https://fburl.com/scuba/strobelight_services/oeoxrdzz) shows that `__folly_memcpy` via `AlignedBuffer::reallocate` is the top CPU consumer in dai_data/waas, costing $3.6M/yr (16.12% of all memcpy CPU). The root cause is that `MemoryPoolImpl::reallocate` always does alloc + memcpy + free, even when jemalloc could expand the allocation in-place via `::realloc()`.

This diff adds a `reallocateBytes` method to `MemoryAllocator` that tries in-place reallocation first. `MallocAllocator` overrides `reallocateBytesWithoutRetry` to use `::realloc()`, which jemalloc can often service without moving data (avoiding the expensive memcpy). Other allocators (e.g., MmapAllocator) fall back to the existing alloc + memcpy + free path via the default implementation.

The new path is gated by `--velox_enable_inplace_realloc`, which **defaults to true** so all Velox-using services get the win out of the box. The legacy alloc + memcpy + free path is kept behind the gflag in `MemoryPoolImpl::reallocate` as a per-service safety knob — any service that needs to opt out can override the gflag locally. Once the in-place path has baked across the fleet, the gflag and the legacy branch can be deleted in a follow-up.

The change is backward-compatible at the API level: when `::realloc()` cannot expand in-place, the implementation falls back to allocating new memory + memcpy + free (same behavior as today).

Reviewed By: tanjialiang

Differential Revision: D101689101

fbshipit-source-id: 84e7c9c4005c3efb904ad5d0ca3af07ee663c074
…17532)

Summary:
Fix
```
/__w/velox/velox/velox/experimental/cudf/connectors/hive/CudfSplitReader.cpp: In constructor 'facebook::velox::cudf_velox::connector::hive::CudfSplitReader::CudfSplitReader(std::shared_ptr<facebook::velox::cudf_velox::connector::hive::CudfHiveConnectorSplit>, std::shared_ptr<const facebook::velox::connector::hive::HiveTableHandle>, const facebook::velox::RowTypePtr&, const std::vector<std::__cxx11::basic_string<char> >&, facebook::velox::FileHandleFactory*, folly::Executor*, const facebook::velox::connector::ConnectorQueryCtx*, const std::shared_ptr<facebook::velox::cudf_velox::connector::hive::CudfHiveConfig>&, const std::shared_ptr<facebook::velox::io::IoStatistics>&, const std::shared_ptr<facebook::velox::IoStats>&, bool, const cudf::ast::expression*)':
/__w/velox/velox/velox/experimental/cudf/connectors/hive/CudfSplitReader.cpp:77:7: error: no matching function for call to 'facebook::velox::dwio::common::ReaderOptions::ReaderOptions(facebook::velox::memory::MemoryPool*&, std::__shared_ptr<facebook::velox::io::IoStatistics, __gnu_cxx::_S_atomic>::element_type*, std::__shared_ptr<facebook::velox::io::IoStatistics, __gnu_cxx::_S_atomic>::element_type*)'
   77 |       baseReaderOpts_(pool_, ioStatistics_.get(), ioStatistics_.get()),
      |       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /__w/velox/velox/./velox/experimental/cudf/connectors/hive/CudfHiveConnectorSplit.h:20,
                 from /__w/velox/velox/./velox/experimental/cudf/connectors/hive/CudfSplitReader.h:20,
                 from /__w/velox/velox/velox/experimental/cudf/connectors/hive/CudfSplitReader.cpp:18:
/__w/velox/velox/./velox/dwio/common/Options.h:579:12: note: candidate: 'facebook::velox::dwio::common::ReaderOptions::ReaderOptions(facebook::velox::memory::MemoryPool*)'
  579 |   explicit ReaderOptions(velox::memory::MemoryPool* pool)
      |            ^~~~~~~~~~~~~
/__w/velox/velox/./velox/dwio/common/Options.h:579:12: note:   candidate expects 1 argument, 3 provided
/__w/velox/velox/./velox/dwio/common/Options.h:572:7: note: candidate: 'facebook::velox::dwio::common::ReaderOptions::ReaderOptions(const facebook::velox::dwio::common::ReaderOptions&)'
  572 | class ReaderOptions : public io::ReaderOptions {
```
Refs: https://github.com/facebookincubator/velox/actions/runs/25913198690/job/76163332652?pr=17500

Pull Request resolved: facebookincubator#17532

Reviewed By: ximyu, duxiao1212

Differential Revision: D105332599

Pulled By: xiaoxmeng

fbshipit-source-id: fbe60e23a6c1277a083937a97d0cbd077dd6fabb
Summary:
Pull Request resolved: facebookincubator#17519

Velox's coercion rules don't match Presto. Presto allows `BIGINT -> REAL`, Velox doesn't. So `select typeof(real '1' / bigint '2')` returns `real` in Presto but `double` in Axiom (which uses Velox's coercion). The same root cause affects all REAL/BIGINT arithmetic, comparisons, `IN`, `CASE`, `COALESCE`, set operations, and `JOIN USING`.

Before this change, Velox's `TypeCoercer` had a single hard-coded rule set with no per-dialect customization. This change makes coercion rules customizable: a SQL dialect can ship its own complete rule set that matches its semantics, without modifying Velox's defaults and without colliding with other dialects in the same process.

`TypeCoercer` is now an instance class that owns a complete `CoercionEntry` rule set. Velox ships `TypeCoercer::defaults()` holding today's lossless built-in rules. Each SQL dialect ships its own complete instance. This change adds Presto's: `velox::functions::prestosql::typeCoercer()`, the Velox defaults plus `BIGINT -> REAL`. Rule sets are flat per-dialect (no live mixing, no layering): cost numbers are only meaningful inside a single rule set.

Constructor validates that:

- entries are non-null;
- both source and target are built-in types (custom-type coercions remain in `CastRulesRegistry`);
- source DECIMAL is the canonical placeholder `DECIMAL(1, 0)`;
- no `DECIMAL -> DECIMAL` entries (handled by the type system);
- cost is unique per source type;
- no duplicate `(fromName, toName)` keys.

The Velox-side resolution API (`SignatureBinder`, `resolveFunction*WithCoercions`, `SwitchExpr`/`CoalesceExpr` planning helpers) gains `const TypeCoercer& coercer = TypeCoercer::defaults()` as a tail parameter so Velox-internal callers compile unchanged. The downstream Axiom integration -- threading the dialect coercer through `PlanBuilder::Context` and wiring `presto::typeCoercer()` in `PrestoParser` -- lands as a follow-up.

Coercion is a planning-time concern. Runtime expression evaluators (`Expr` ctors, `ExprCompiler`) do not consult `TypeCoercer` and are unchanged.

`velox/docs/develop/types.rst` adds a new "Type Coercion" section covering the `TypeCoercer` mechanism, customization scope, `coerceTypeBase` vs `coercible`, the default-rules table, the Presto-specific additions, and the 2-step DECIMAL handling per case.

# Breaking Changes

There are no static back-compat wrappers. Consumers that called the old static methods (`velox::TypeCoercer::coercible(...)`, `velox::TypeCoercer::leastCommonSuperType(...)`, etc.) will fail to compile until they are updated to invoke the instance API on a `TypeCoercer&` -- typically `TypeCoercer::defaults().coercible(...)` for callers that don't have a dialect coercer.

bypass-github-export-checks

Reviewed By: xiaoxmeng

Differential Revision: D105206605

fbshipit-source-id: 107f83e5757689950c9a78d264194dc23cb32306
…#17527)

Summary:
Pull Request resolved: facebookincubator#17527

Pull Request resolved: facebookincubator#17521

When a HashBuild builder task fails (e.g., OOM in finishHashBuild()) before calling HashTableCache::put(), the cache entry persists with buildComplete=false and unfulfilled ContinuePromises. This causes all subsequent tasks landing on the same executor to hang forever in kWaitForBuild state with 0B allocated memory, until the driver times out (exit_code=254). This was observed in spark3_full_shadow verifier jobs where entire jobs got stuck for 9+ hours.

The fix has two parts:

1. HashBuild::close(): When the builder task closes without completing the build, call HashTableCache::drop() to remove the poisoned cache entry. This is gated on: caching is enabled, this is the builder task, and buildComplete is still false.

2. HashTableCache::drop(): Fulfill any pending buildPromises before erasing the entry, so waiting tasks are unblocked instead of hanging forever. Waiters wake up and fail with a clear error via the existing VELOX_CHECK in receivedCachedHashTable().

Also updates the error message in receivedCachedHashTable() to include the cache key and mention possible OOM cause, and converts the builderFailureLeavesWaitersStuck test to verify the fix instead of documenting the bug.

Reviewed By: xiaoxmeng

Differential Revision: D105165613

fbshipit-source-id: d83e54371d182b2c8f6ebd7560286221dce33282
…tor#17535)

Summary:
Pin `tzdata` (and `tzdata-java` where applicable) in all four velox-dev dockerfiles. Closes / mitigates facebookincubator#17522.

Without an explicit pin, every docker rebuild — triggered by any change to `scripts/docker/*.dockerfile` or `scripts/setup-*.sh` — silently picks up the latest `tzdata` from the distro repos. tzdata `2026b`'s encoding of British Columbia's permanent-PDT change broke Velox's bundled libc++ chrono::tzdb parser and caused the Presto-SOT fuzzer regression in facebookincubator#17522:

- 4 fuzzers (Expression / Window / Aggregation / Join, all "with Presto as source of truth") fail on any plan touching `TIMESTAMP WITH TIME ZONE`.
- The mismatch shape is exactly `(3,600,000 ms) << 12 = 14,745,600,000` — i.e., 1-hour offset, same `tz_id`. Velox's tzdb returns the wrong offset for the affected timezones; Presto Java (reading `tzdata-java`) is correct.

The trigger was a routine docker rebuild on 2026-05-14 that bumped `tzdata` from `2026a-1.el9` → `2026b-1.el9`. With this PR the version is explicit and reproducible.

## Per-image pin

| Image | Pinned to | Notes |
|---|---|---|
| `centos9`, `pyvelox` | `tzdata-2026a-1.el9` | pinned in `base-image` stage |
| `adapters` | `tzdata-2026a-1.el9` + `tzdata-java-2026a-1.el9` | inherits from `centos9`; `tzdata-java` re-pinned after Java install |
| `presto-java`, `spark-server` | `tzdata-2026a-1.el9` + `tzdata-java-2026a-1.el9` | defensive re-pin in `java-base` so it builds even before `centos9` is rebuilt with the pin |
| `fedora` | `tzdata-2025c-1.fc42` (current) | pinned in `base-build` and `fedora` stages |
| `ubuntu` (22.04) | `tzdata=2026a-0ubuntu0.22.04.1n` (current) | `apt-get install --allow-downgrades` |

## Mechanism

Pin uses an install-or-downgrade pattern that works whether the base image ships an older, equal, or newer tzdata than the pinned version:

```dockerfile
ARG CENTOS_TZDATA_VERSION=2026a-1.el9
RUN dnf -y install "tzdata-${CENTOS_TZDATA_VERSION}.noarch" || \
      dnf -y downgrade "tzdata-${CENTOS_TZDATA_VERSION}.noarch"
```

Verified manually on a fresh `quay.io/centos/centos:stream9` (currently shipping `2026b`): `dnf install tzdata-2026a-1.el9.noarch` performs the downgrade as expected, so the `|| dnf downgrade` fallback is defensive.

## Bumping intentionally

Change the `{CENTOS,FEDORA,UBUNTU}_TZDATA_VERSION` ARG default at the top of each dockerfile, rebuild the docker images, and re-run the Presto-SOT fuzzers locally to confirm no regression before merging.

Pull Request resolved: facebookincubator#17535

Test Plan:
- [ ] Docker images rebuild successfully (workflow `docker.yml` will fire automatically on this PR since it touches `scripts/docker/*.dockerfile`).
- [ ] After the rebuilt images land on `:presto-java`, the four Presto-SOT fuzzer jobs (Expression, Window, Aggregation, Join) go green again on `main`.
- [ ] `rpm -q tzdata` inside the rebuilt `velox-dev:presto-java` shows `tzdata-2026a-1.el9.noarch` (and `tzdata-java` matches).

Reviewed By: kevinwilfong, pratikpugalia

Differential Revision: D105375476

Pulled By: kgpai

fbshipit-source-id: 9c46190a1d848b42b0c9a1806bbff377d638b984
…okincubator#17517)

Summary:
Follow-up to facebookincubator#17238 (comment)
Alphabetize source file lists, library dependencies, `add_subdirectory` calls, and test blocks across all 9 CMake files in `velox/experimental/cudf/`.

- Dependencies are grouped by type (cudf/arrow, velox\_\*, third-party), then alphabetized within each group
- Test blocks are kept together (`add_executable`, `add_test`, `set_tests_properties`, `target_link_libraries`) and ordered alphabetically by test name
- Add `velox_add_cudf_test()` helper and `CUDF_TEST_DEFAULT_LIBS` variable to eliminate repeated boilerplate across all test targets, remove unused link dependencies, and register the custom command with gersemi.

Pull Request resolved: facebookincubator#17517

Reviewed By: pratikpugalia

Differential Revision: D105390344

Pulled By: mbasmanova

fbshipit-source-id: 5549a3db449dec4bb05658d3d8b5fe053fda8ea9
…acebookincubator#17534)

Summary:
Pull Request resolved: facebookincubator#17534

X-link: facebookincubator/nimble#731

Remove the `defaultIoStats_` fallback in `TabletReader` and instead enforce that callers always supply `ioOptions` with a non-null `metadataIoStats()`. The constructor now does an early `NIMBLE_CHECK` / `NIMBLE_CHECK_NOT_NULL` in its init list, so any caller that forgets to set the field fails fast at construction with a clear message.

Reviewed By: xiaoxmeng

Differential Revision: D105338687

fbshipit-source-id: 53d2b708ea5ea96c4ca27dc4b50b4d7720e0afa8
Summary:
Streamline the PR review workflow by reducing the number of manual commands and permission prompts needed to fetch PR data and post reviews. The style guide captures review conventions developed over many review iterations.

Add scripts and a style guide for PR reviews:

- `scripts/review/fetch.py` — fetches PR metadata, diff, comments, and reviews in one shot
- `scripts/review/post.py` — posts a review from a file (handles multiline bodies cleanly)
- `scripts/review/REVIEW_GUIDE.md` — review style guide covering tone, structure, and what to check

Pull Request resolved: facebookincubator#17524

Reviewed By: xiaoxmeng

Differential Revision: D105240290

Pulled By: mbasmanova

fbshipit-source-id: 36df968e1131412cb5e5b0143d5b1c95e3989974
…ator#17463)

Summary:
**Summary**: when the leaf node (Scan operator) in one child of the Join operator is rolled back to Vanilla Spark, while the leaf node (Scan operator) in the other child still uses NativeScan, and the join key used by Join contains the decimal type, Spark ORC reader will prioritize using the decimal type in the table schema (e.g. DECIMAL (20,0)) over the decimal type in the ORC file (e.g. DECIMAL (38,18)) but Native ORC reader only using the decimal type in the ORC file, which ultimately leads to incorrect matching of the join key, resulting in the output of 0 rows after join.

**Root cause**:
Hive ORC fixed the DECIMAL (38,18) in footer, and the true scale of each row is written in the SECONDRY stream. The original ORC reader used footer's
The use of scale instead of the scale of the meta store (table schema) results in inconsistency between the output of NativeScan and Vanilla Spark.

Fixes: facebookincubator#17462

This PR proposes to pass the request type into `SelectiveColumnReader`.

This PR could help to fix apache/gluten#11980

Pull Request resolved: facebookincubator#17463

Reviewed By: pratikpugalia

Differential Revision: D105464139

Pulled By: mbasmanova

fbshipit-source-id: b1348d9be467c2ec74ee84387d4b3fb0d30deda9
…folder (facebookincubator#17540)

Summary:
Pull Request resolved: facebookincubator#17540

X-link: facebookincubator/nimble#741

Moves encoding selection files (`EncodingSelection.h`, `EncodingSelectionPolicy.h`, `EncodingSizeEstimation.h`, `Statistics.h/.cpp`, `EncodingIdentifier.h`) into a new `encodings/selection/` subfolder for better code organization. The files remain in the same BUCK target to preserve the existing dependency structure between `EncodingSelection.h` and `EncodingFactory.h`. All include paths across ~55 consumer files are updated accordingly.

Reviewed By: pratikpugalia

Differential Revision: D105446634

fbshipit-source-id: d5665c130c8118dbe0da1f633b14546f6cb9403f
…kincubator#17541)

Summary:
`CudfToVelox` and `CudfFromVelox` adapter operators use synthetic planNodeIds (e.g. 4-to-velox) and are not part of the plan tree. Their stats were previously invisible in `printPlanWithStats` output, making it difficult to observe adapter operator performance (e.g. GPU-CPU conversion overhead). Using `setStatSplitter` to remap their stats to the parent plan node at collection time, they now appear as operator-type breakdown lines under their parent node — the same mechanism used by `HashBuild/HashProbe` under `HashJoinNode`.

## Changes

- `CudfConversion.cpp`: `setStatSplitter` in CudfFromVelox/CudfToVelox constructors remaps stats planNodeId to parent.
- `AdapterOperatorTest.cpp`: Verifies stats merge into parent plan node.
- `operators.rst`: GPU Operators (cuDF) section — plan rewriting, adapter operators, naming convention.
- `print-plan-with-stats.rst`: GPU Operator Stats section — example output with multi-operator breakdown.

## Example output
### GPU mode output (TPC-H Q6 SF10)
```
Execution time: 346ms
Splits total: 1, finished: 1
-- Aggregation[4][FINAL a0 := sum("a0")] -> a0:DOUBLE
   Output: 2 rows (16B, 2 batches), Cpu time: 576.15us, Wall time: 632.42us, Blocked wall time: 0ns, Peak memory: 0B, Memory allocations: 0, CPU breakdown: B/I/O/F (32.88us/4.81us/524.85us/13.61us)
   CudfToVelox: Input: 1 rows (8B, 1 batches), Output: 1 rows (8B, 1 batches), Cpu time: 223.65us, Wall time: 254.26us, Blocked wall time: 0ns, Peak memory: 0B, Memory allocations: 0, Threads: 1, CPU breakdown: B/I/O/F (18.46us/2.35us/195.12us/7.72us)
   CudfReduceFINAL: Input: 1 rows (8B, 1 batches), Output: 1 rows (8B, 1 batches), Cpu time: 352.50us, Wall time: 378.16us, Blocked wall time: 0ns, Peak memory: 0B, Memory allocations: 0, Threads: 1, CPU breakdown: B/I/O/F (14.42us/2.46us/329.73us/5.89us)
  -- LocalPartition[3][GATHER] -> a0:DOUBLE
     Output: 2 rows (16B, 2 batches), Cpu time: 107.75us, Wall time: 139.80us, Blocked wall time: 343.93ms, Peak memory: 0B, Memory allocations: 0, CPU breakdown: B/I/O/F (18.36us/46.05us/23.23us/20.11us)
     CudfLocalPartition: Input: 1 rows (8B, 1 batches), Output: 1 rows (8B, 1 batches), Cpu time: 88.15us, Wall time: 109.59us, Blocked wall time: 0ns, Peak memory: 0B, Memory allocations: 0, Threads: 1, CPU breakdown: B/I/O/F (11.34us/46.05us/11.59us/19.17us)
     LocalExchange: Input: 1 rows (8B, 1 batches), Output: 1 rows (8B, 1 batches), Cpu time: 19.60us, Wall time: 30.21us, Blocked wall time: 343.93ms, Peak memory: 0B, Memory allocations: 0, Threads: 1, CPU breakdown: B/I/O/F (7.02us/0ns/11.64us/940ns)
    -- Aggregation[2][PARTIAL a0 := sum(ROW["p0"])] -> a0:DOUBLE
       Output: 1 rows (8B, 1 batches), Cpu time: 10.82ms, Wall time: 10.84ms, Blocked wall time: 0ns, Peak memory: 0B, Memory allocations: 0, Threads: 1, CPU breakdown: B/I/O/F (13.44us/4.65us/10.80ms/5.03us)
      -- Project[1][expressions: (p0:DOUBLE, multiply(ROW["l_extendedprice"],ROW["l_discount"]))] -> p0:DOUBLE
         Output: 1139264 rows (8.83MB, 1 batches), Cpu time: 44.62ms, Wall time: 44.64ms, Blocked wall time: 0ns, Peak memory: 0B, Memory allocations: 0, Threads: 1, CPU breakdown: B/I/O/F (8.92us/6.32us/44.60ms/5.13us)
        -- TableScan[0][table: lineitem, range filters: [(l_discount, DoubleRange: [0.050000, 0.070000] no nulls), (l_quantity, DoubleRange: (-inf, 24.000000) no nulls), (l_shipdate, BigintRange: [8766, 9130] no nulls)]] -> l_shipdate:DATE, l_extendedprice:DOUBLE, l_quantity:DOUBLE, l_discount:DOUBLE
           Input: 1139264 rows (30.97MB, 1 batches), Output: 1139264 rows (30.97MB, 1 batches), Cpu time: 147.16ms, Wall time: 288.43ms, Blocked wall time: 0ns, Peak memory: 1.72MB, Memory allocations: 1957, Threads: 1, Splits: 1, CPU breakdown: B/I/O/F (4.16us/0ns/147.15ms/1.03us)
```

Pull Request resolved: facebookincubator#17541

Reviewed By: srsuryadev

Differential Revision: D105490119

Pulled By: mbasmanova

fbshipit-source-id: 07abd9724c6785d4bb2669e9c01ddd36fb3d5ea1
…bookincubator#16591) (facebookincubator#17353)

Summary:
## Problem

When the Parquet reader detects a schema mismatch between the file footer
and the requested schema, the error reports the converted file type and the
requested type but not the field name:

```
Converted type VARCHAR is not allowed for requested type INTEGER
```

For tables with many columns, the reader has to cross-reference the file
footer manually to figure out which column triggered the mismatch.

## Change

Append the file column name (`schemaElement.name`) to
`kTypeMappingErrorFmtStr` in `ReaderBase::convertType`:

```
Converted type VARCHAR is not allowed for requested type INTEGER for file column 'name'
```

The name is added at the end so the existing static prefix is preserved,
matching the project style guide that puts runtime information after the
static description. All 19 throw sites that share `kTypeMappingErrorFmtStr`
in `convertType` are updated; the existing
`fileColumnVarcharToMetadataColumnMismatchTest` is extended to assert the
column name appears in the message.

## Verification

- Existing `VELOX_ASSERT_THROW` substring matchers in
  `ParquetReaderWideningTest` and `ParquetTableScanTest` still match —
  they check for `"is not allowed for requested type"`, which remains a
  substring of the new message.
- The updated assertion in `fileColumnVarcharToMetadataColumnMismatchTest`
  would have failed before this change (the column name `"name"` was not
  in the error) and passes after.

Resolves facebookincubator#16591.

Pull Request resolved: facebookincubator#17353

Reviewed By: srsuryadev

Differential Revision: D105506987

Pulled By: mbasmanova

fbshipit-source-id: d4fff9657d1a7a843b3794dc1a9d54f03c50ada0
…zer (facebookincubator#17536)

Summary: Pull Request resolved: facebookincubator#17536

Reviewed By: srsuryadev

Differential Revision: D105506016

Pulled By: mbasmanova

fbshipit-source-id: 63cbbd8e0edcb9ad2ea56e50b1909cafab71f6d2
Summary:
Pull Request resolved: facebookincubator#17383

This change was generated by pointing an AI agent at hot functions showing up in our internal profiles. It was reviewed by a human (outside the velox team):

Added `keyValuesMap_.reserve(inputArray.size())` after `keyValuesMap_.clear()` in `MultimapFromEntriesFunction::call`. This pre-allocates bucket space in the `folly::F14FastMap` before the insertion loop, avoiding costly rehashes as elements are inserted. The upper bound `inputArray.size()` is the maximum possible number of unique keys and is a cheap O(1) estimate. The `uniqueKeys_` vector already had a similar `reserve` call.

Reviewed By: IosifSpulber

Differential Revision: D103133759

fbshipit-source-id: 6e672e81d2a6dccdc514a630be7ff5d8697294a0
…acebookincubator#17543)

Summary:
Pull Request resolved: facebookincubator#17543

`EnforceSingleRowNode` validates that its input produces exactly one row. With multiple drivers, each driver independently sees a slice of the input and may emit its own row (or NULL on empty input), violating the single-row contract. Mark it `requiresSingleThread() = true` so the runtime collapses the pipeline to one driver, matching the existing treatment of `LimitNode`, `OrderByNode`, and other operators with the same constraint.

Reviewed By: srsuryadev

Differential Revision: D105507323

fbshipit-source-id: 55feb7f63f5d4fbbf94ea1fccfedb44c0157f2d7
…ubator#17487)

Summary:
CONTEXT: Spark 3.4+ introduced bitmap aggregate functions (`bitmap_construct_agg`, `bitmap_or_agg`, `bitmap_bucket_number`, `bitmap_bit_position`, `bitmap_count`) for efficient count-distinct approximation using sub-bitmap splitting. `bitmap_construct_agg` is the core aggregation primitive — it builds a fixed-size 4096-byte bitmap (32768 bits) from input bit positions computed by `bitmap_bit_position`.

WHAT: Add the `bitmap_construct_agg` aggregate function for Spark. The function takes BIGINT input positions in [0, 32767] and produces a VARBINARY bitmap with the corresponding bits set.

Key design points:
- Uses `SimpleAggregateAdapter` with `default_null_behavior_ = false` for non-nullable output
- Fixed 4096-byte bitmap allocated lazily via `HashStringAllocator` (zero cost for empty groups)
- Out-of-range positions raise a descriptive error with plain English message
- NULL inputs are skipped; the aggregate is non-nullable (returns all-zeros bitmap for empty/all-null input)
- Partial aggregation supported: intermediate bitmaps serialized as VARBINARY, merged via 64-bit OR
- Uses `bits::setBit` for bitmap manipulation

Tests cover: basic, null handling, group-by, boundaries, invalid positions, partial+final merge, invalid intermediates, overlapping merges.

Ref: `org.apache.spark.sql.catalyst.expressions.BitmapConstructAgg`

Pull Request resolved: facebookincubator#17487

Reviewed By: srsuryadev

Differential Revision: D105567436

Pulled By: mbasmanova

fbshipit-source-id: e1c2d5f7a26d9af8f7ed5078122a7f1ae6b83430
…oned partial aggregation (facebookincubator#17547)

Summary:
The patch replaces the existing boolean stat `kAbandonedPartialAggregation` with a new integer stat `kAbandonedPartialAggregationRows`, in the HashAggregation operator so users can measure how many rows passed through after partial aggregation was abandoned. This makes performance analysis of partial hash aggregation easier.

Pull Request resolved: facebookincubator#17547

Reviewed By: apurva-meta

Differential Revision: D105576520

Pulled By: mbasmanova

fbshipit-source-id: 1c15e41cce920f592c82849c75278214aead31ab
…nge hangs (facebookincubator#17506)

Summary:
- Fix `CudfNestedLoopJoinProbe` to consume probe input when build is empty, matching CPU `NestedLoopJoinProbe` behavior. The original code set `finished_=true` in `isBlocked()` when detecting an empty build side for inner/right joins. This caused the probe operator to finish before consuming probe input, which can hang upstream exchanges waiting for their output buffers to be drained.
Addresses review comment from facebookincubator#17113: facebookincubator#17113 (comment)

Pull Request resolved: facebookincubator#17506

Reviewed By: kgpai

Differential Revision: D105583207

Pulled By: mbasmanova

fbshipit-source-id: bee4c19ad6effa20100b5bf2b30cbe28aadbe858
…r#17369) (facebookincubator#17369)

Summary:
Extends the Velox tracing framework with expression-level tracing, enabling callers to trace output batches from named expression functions (e.g., UDFs).

Changes:
- TraceCtx: Added virtual methods shouldTraceExpr() and createExprOutputTracer(op, functionName, instanceIndex) for expression-level tracing decisions and writer creation
- Expr: Added maybeSetupTracer() to create and cache a tracer during operator initialization, stored as a unique_ptr member. Uses a visited set to avoid redundant traversal of shared CSE nodes, and an instance counter to distinguish multiple Expr nodes with the same function name within a single operator
- ExprSet: Added maybeSetupTracers() to walk all expressions and set up tracers
- FilterProject: Calls ExprSet::maybeSetupTracers() during initialize(), scoped to operators where shouldTrace() returns true — expression tracing is only set up inside traced operators
- EvalCtx: Removed traceCtx plumbing (no longer needed since tracers are owned by Expr nodes)

Performance: Tracers are created once at initialization and owned by each Expr node as a unique_ptr, following the same pattern as Operator::inputTracer_. The hot path uses FOLLY_UNLIKELY on a null pointer check.

Examples:

1. Single instance: A FilterProject operator "proj-5" with projection `a * 10 as b` and tracing configured for "multiply" creates one tracer identified as ("proj-5", "multiply", 0).

2. Multiple instances of the same function: Projections `a * 10 as b, c * 20 as d` with tracing configured for "multiply" creates two tracers: ("proj-5", "multiply", 0) and ("proj-5", "multiply", 1), each producing a distinct output stream.

3. Selective tracing: Projections `a * 10 as b, a + 5 as c` with tracing configured for "multiply" only traces the multiply expression. The plus expression is skipped since it is not in the trace configuration.

4. Operator scoping: If operator "proj-5" is not in the traced operators list, none of its UDFs are traced regardless of the UDF trace configuration.

Pull Request resolved: facebookincubator#17369

Test Plan: buck2 test @//mode/dev-nosan fbcode//velox/exec/tests:test -- CustomTraceTest.exprOutputTrace CustomTraceTest.exprTraceOnlyMatchingFunctions

Reviewed By: bikramSingh91

Differential Revision: D102552818

Pulled By: patrickstuedi

fbshipit-source-id: 16780bc675d4bd12277de66ec193898398ba5cff
Summary:
Pull Request resolved: facebookincubator#17542

This diff adds FLUX as a valid index source into Velox and adds roundtrip serializatino.

Reviewed By: xiaoxmeng

Differential Revision: D105505038

fbshipit-source-id: 6af038feb9d5c401d13e4f5873f04a59ba1c94cb
@bdice bdice requested review from a team, devavret, karthikeyann and mhaseeb123 as code owners May 19, 2026 18:14
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 19, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@bdice
Copy link
Copy Markdown
Author

bdice commented May 21, 2026

/ok to test c7d0e59

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.