From e5b6eeb210d9a294a3787d0b51c36c0465b67c0a Mon Sep 17 00:00:00 2001 From: Sebastian Fischer Date: Tue, 5 May 2026 11:01:41 +0200 Subject: [PATCH 1/4] docs: add gotchas vignette --- pkgdown/_pkgdown.yml | 2 + vignettes/differences-from-base-r.Rmd | 119 ++++++++++++++++++++++++++ 2 files changed, 121 insertions(+) create mode 100644 vignettes/differences-from-base-r.Rmd diff --git a/pkgdown/_pkgdown.yml b/pkgdown/_pkgdown.yml index 2d3f4455..72a9b34e 100644 --- a/pkgdown/_pkgdown.yml +++ b/pkgdown/_pkgdown.yml @@ -39,6 +39,8 @@ navbar: href: articles/random-numbers.html - text: Type Promotion href: articles/type-promotion.html + - text: Differences from base R + href: articles/differences-from-base-r.html - text: Efficiency href: articles/efficiency.html - text: FAQ diff --git a/vignettes/differences-from-base-r.Rmd b/vignettes/differences-from-base-r.Rmd new file mode 100644 index 00000000..b3984a23 --- /dev/null +++ b/vignettes/differences-from-base-r.Rmd @@ -0,0 +1,119 @@ +--- +title: "Differences from base R" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Differences from base R} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +This vignette lists major behavioral differences between {anvl} and base R that R users should be aware of when working with `AnvlArray`s. + +```{r} +library(anvl) +``` + +## Row-major vs column-major ordering + +R stores matrices and arrays in *column-major* order, while {anvl} (following XLA) uses *row-major* order. +This makes no difference when you only use shape-aware operations (subsetting, matrix multiplication, etc.) -- the indices are the same in both. +The difference shows up when you flatten an array, because the underlying data is then traversed in a different order. + +Consider the 2x2 matrix below: + +```{r} +m <- matrix(1:4, nrow = 2) +m +``` + +In base R, `as.vector()` flattens it column-by-column, so we get `1, 2, 3, 4`: + +```{r} +as.vector(m) +``` + +In {anvl}, reshaping to a length-4 vector traverses the data row-by-row, so we get `1, 3, 2, 4`: + +```{r} +nv_reshape(m, shape = 4) +``` + +If you need column-major flattening in {anvl}, transpose first: + +```{r} +nv_reshape(nv_transpose(m), shape = 4) +``` + +## No recycling + +Base R *recycles* the shorter operand when two vectors of different lengths are combined elementwise: + +```{r} +c(1, 2, 3, 4) + c(1, 2) +``` + +{anvl} only auto-broadcasts *scalars* (operands with shape `integer()`). +Adding a scalar to an array works as you would expect: + +```{r} +nv_array(1:4) + 10L +``` + +But combining two non-scalar arrays of different shapes errors, even when one shape is a "tile" of the other: + +```{r, error = TRUE} +nv_array(1:4) + nv_array(1:2) +``` + +When two non-scalar arrays differ only by size-1 dimensions (numpy-style broadcasting, e.g. shape `(2, 3)` and `(1, 3)`), use `nv_broadcast_arrays()` to align them explicitly first: + +```{r} +a <- nv_array(matrix(1:6, nrow = 2)) +b <- nv_array(matrix(c(10, 20, 30), nrow = 1)) +xs <- nv_broadcast_arrays(a, b) +xs[[1]] + xs[[2]] +``` + +Note that even `nv_broadcast_arrays()` cannot replicate R's recycling for shapes like `(4)` and `(2)` -- the shapes must be broadcast-compatible in the numpy sense. + +## No `NA`s + +R has a dedicated missing-value marker (`NA`) for every atomic type. +{anvl} arrays do not -- there is no representation of "missing" at the XLA level. +When you convert R values containing `NA` into an `AnvlArray`, the `NA`s are silently turned into the closest available value of the target dtype. +For floating-point dtypes, that value is `NaN`: + +```{r} +nv_array(NA_real_) +``` + +```{r} +nv_array(c(1, NA, 3)) +``` + +Round-tripping back to R produces `NaN`, not `NA`: + +```{r} +as_array(nv_array(c(1, NA, 3))) +``` + +Integer dtypes have no `NaN`, but `NA_integer_` does *appear* to round-trip: + +```{r} +nv_scalar(NA_integer_) |> as.integer() +``` + +This is misleading. +R represents `NA_integer_` by reserving one specific 32-bit integer value (`-2^31 = -2147483648`) as a sentinel for missingness, leaving only `2^32 - 1` valid integers. +{anvl} has no notion of missing values and just stores that bit pattern as a regular `i32`. +The round-trip "works" only because R interprets the same bit pattern back as `NA` -- but inside {anvl} the value behaves like the integer `-2147483648`, and any computation on it (e.g. addition, comparison) will treat it as such rather than propagating missingness. +The same caveat applies in reverse: if a genuine {anvl} computation produces `-2147483648`, converting back to R will silently turn it into `NA`. + +For logical (`bool`) dtype the situation is worst: there is no spare bit pattern at all, so a bare `NA` (which is logical) is silently turned into `TRUE`: + +```{r} +nv_scalar(NA) +as.logical(nv_scalar(NA)) +``` + +If your data contains missing values, decide how to handle them *before* converting to an `AnvlArray`. From c6d47855fb4c44153028ebc3946513f151a2a208 Mon Sep 17 00:00:00 2001 From: Sebastian Fischer Date: Tue, 5 May 2026 16:32:02 +0200 Subject: [PATCH 2/4] ... --- NEWS.md | 3 + R/array.R | 57 +++++++++++++++++-- man/AnvlArray.Rd | 19 ++++++- man/as_array.Rd | 14 ++++- pkgdown/_pkgdown.yml | 4 +- ...ifferences-from-base-r.Rmd => gotchas.Rmd} | 34 ++++++++++- 6 files changed, 117 insertions(+), 14 deletions(-) rename vignettes/{differences-from-base-r.Rmd => gotchas.Rmd} (70%) diff --git a/NEWS.md b/NEWS.md index c8df8a76..adb9c281 100644 --- a/NEWS.md +++ b/NEWS.md @@ -34,6 +34,9 @@ * `nv_select()` to select a slice along a dimension by index. * `mean()` and `median()` now error when called with `na.rm = TRUE`, since anvl arrays do not carry `NA`s. `mean()` also rejects non-zero `trim`. +* `nv_array()`, `nv_scalar()`, and `as_array()` gained a `scan_na` argument + that opts into checking for `NA` values during host -> device and + device -> host transfers. See the "Differences from base R" vignette. ## Other diff --git a/R/array.R b/R/array.R index abae4a43..18f275a5 100644 --- a/R/array.R +++ b/R/array.R @@ -44,6 +44,12 @@ #' Backend to use (`"xla"` or `"quickr"`). #' Defaults to `default_backend()`. #' Must not be specified inside [`jit()`]. +#' @param scan_na (`logical(1)`)\cr +#' If `TRUE`, error when `data` contains any `NA` values. XLA has no +#' representation for missing values, so they are otherwise silently +#' coerced to the closest available value of the target dtype (e.g. `NaN` +#' for floats, the bit pattern `-2147483648` for `i32`, `TRUE` for +#' `bool`). Defaults to `FALSE`. #' @return ([`AnvlArray`]) #' @examplesIf pjrt::plugins_downloaded() #' # A 1-d array (vector) with shape (4). Default type for integers is `i32` @@ -82,7 +88,23 @@ NULL #' @rdname AnvlArray #' @export -nv_array <- function(data, dtype = NULL, device = NULL, shape = NULL, ambiguous = NULL, backend = NULL) { +nv_array <- function( + data, + dtype = NULL, + device = NULL, + shape = NULL, + ambiguous = NULL, + backend = NULL, + scan_na = FALSE +) { + assert_flag(scan_na) + if (scan_na && !is_anvl_array(data) && anyNA(data)) { + n_na <- sum(is.na(data)) + cli_abort(c( + "Input {.arg data} contains {n_na} {.val NA} value{?s}, which {?has/have} no representation at the XLA level.", + i = "Replace or drop missing values before transferring, or set {.code scan_na = FALSE} to skip this check." + )) + } if (is_anvl_array(data)) { if (!is.null(device) && !eq_device(device(data), nv_device(device, backend))) { cli_abort("Cannot change device of existing AnvlArray from {.val {device(data)}} to {.val {device}}") @@ -241,8 +263,16 @@ unwrap_if_array <- function(x) { #' @rdname AnvlArray #' @export -nv_scalar <- function(data, dtype = NULL, device = NULL, ambiguous = NULL, backend = NULL) { - nv_array(data, dtype = dtype, device = device, shape = integer(), ambiguous = ambiguous, backend = backend) +nv_scalar <- function(data, dtype = NULL, device = NULL, ambiguous = NULL, backend = NULL, scan_na = FALSE) { + nv_array( + data, + dtype = dtype, + device = device, + shape = integer(), + ambiguous = ambiguous, + backend = backend, + scan_na = scan_na + ) } #' @rdname AnvlArray @@ -298,9 +328,24 @@ shape.AnvlArray <- function(x, ...) { globals$backends[[x$backend]]$shape(x) } -#' @export -as_array.AnvlArray <- function(x, ...) { - globals$backends[[x$backend]]$as_array(x) +#' @rdname as_array +#' @param scan_na (`logical(1)`)\cr +#' If `TRUE` and the array's dtype is `i32`, error when the materialized +#' R integer vector contains any `NA_integer_` values. R reserves the bit +#' pattern `-2147483648` as the `NA_integer_` sentinel, so a genuine +#' device-side `i32` value of `-2147483648` is silently turned into `NA` +#' on transfer. No-op for other dtypes. Defaults to `FALSE`. +#' @export +as_array.AnvlArray <- function(x, scan_na = FALSE, ...) { + assert_flag(scan_na) + result <- globals$backends[[x$backend]]$as_array(x) + if (scan_na && (dtype(x) == as_dtype("i32")) && anyNA(result)) { + cli_abort(c( + "Materialized R integer vector contains {.val NA} values from device-side {.val -2147483648}.", + i = "This collision is irrecoverable: the device value and {.val NA} are indistinguishable in R. Set {.code scan_na = FALSE} to skip this check." + )) + } + result } #' @export diff --git a/man/AnvlArray.Rd b/man/AnvlArray.Rd index c458155d..ba9e6e3e 100644 --- a/man/AnvlArray.Rd +++ b/man/AnvlArray.Rd @@ -16,10 +16,18 @@ nv_array( device = NULL, shape = NULL, ambiguous = NULL, - backend = NULL + backend = NULL, + scan_na = FALSE ) -nv_scalar(data, dtype = NULL, device = NULL, ambiguous = NULL, backend = NULL) +nv_scalar( + data, + dtype = NULL, + device = NULL, + ambiguous = NULL, + backend = NULL, + scan_na = FALSE +) nv_empty(dtype, shape, device = NULL, ambiguous = FALSE) @@ -80,6 +88,13 @@ Backend to use (\code{"xla"} or \code{"quickr"}). Defaults to \code{default_backend()}. Must not be specified inside \code{\link[=jit]{jit()}}.} +\item{scan_na}{(\code{logical(1)})\cr +If \code{TRUE}, error when \code{data} contains any \code{NA} values. XLA has no +representation for missing values, so they are otherwise silently +coerced to the closest available value of the target dtype (e.g. \code{NaN} +for floats, the bit pattern \code{-2147483648} for \code{i32}, \code{TRUE} for +\code{bool}). Defaults to \code{FALSE}.} + \item{like}{(\code{\link{AnvlArray}})\cr An existing array. Any of \code{dtype}, \code{device}, \code{shape}, \code{ambiguous}, and \code{backend} that are \code{NULL} (the default) are taken from \code{like}.} diff --git a/man/as_array.Rd b/man/as_array.Rd index 0b00f6e8..b4dff65e 100644 --- a/man/as_array.Rd +++ b/man/as_array.Rd @@ -1,15 +1,25 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/reexports.R -\name{as_array} +% Please edit documentation in R/array.R, R/reexports.R +\name{as_array.AnvlArray} +\alias{as_array.AnvlArray} \alias{as_array} \title{Convert to an R array} \usage{ +\method{as_array}{AnvlArray}(x, scan_na = FALSE, ...) + as_array(x, ...) } \arguments{ \item{x}{(\code{\link{arrayish}})\cr An array-like object.} +\item{scan_na}{(\code{logical(1)})\cr +If \code{TRUE} and the array's dtype is \code{i32}, error when the materialized +R integer vector contains any \code{NA_integer_} values. R reserves the bit +pattern \code{-2147483648} as the \code{NA_integer_} sentinel, so a genuine +device-side \code{i32} value of \code{-2147483648} is silently turned into \code{NA} +on transfer. No-op for other dtypes. Defaults to \code{FALSE}.} + \item{...}{Additional arguments passed to methods (unused).} } \value{ diff --git a/pkgdown/_pkgdown.yml b/pkgdown/_pkgdown.yml index 72a9b34e..d58f0b22 100644 --- a/pkgdown/_pkgdown.yml +++ b/pkgdown/_pkgdown.yml @@ -39,8 +39,8 @@ navbar: href: articles/random-numbers.html - text: Type Promotion href: articles/type-promotion.html - - text: Differences from base R - href: articles/differences-from-base-r.html + - text: Gotchas + href: articles/gotchas.html - text: Efficiency href: articles/efficiency.html - text: FAQ diff --git a/vignettes/differences-from-base-r.Rmd b/vignettes/gotchas.Rmd similarity index 70% rename from vignettes/differences-from-base-r.Rmd rename to vignettes/gotchas.Rmd index b3984a23..ffde8d67 100644 --- a/vignettes/differences-from-base-r.Rmd +++ b/vignettes/gotchas.Rmd @@ -1,5 +1,5 @@ --- -title: "Differences from base R" +title: "Gotchas" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Differences from base R} @@ -7,7 +7,7 @@ vignette: > %\VignetteEncoding{UTF-8} --- -This vignette lists major behavioral differences between {anvl} and base R that R users should be aware of when working with `AnvlArray`s. +This vignette lists various things to be aware of, specifically in relation to base R. ```{r} library(anvl) @@ -117,3 +117,33 @@ as.logical(nv_scalar(NA)) ``` If your data contains missing values, decide how to handle them *before* converting to an `AnvlArray`. +To opt into a runtime check, pass `scan_na = TRUE` to `nv_array()` / `nv_scalar()`, which errors if the input contains any `NA`: + +```{r, error = TRUE} +nv_array(c(1, NA, 3), scan_na = TRUE) +``` + +The same flag is available on `as_array()` for the `i32` round-trip case, where it errors if the materialized integer vector contains any `NA` (i.e. any `-2147483648`): + +```{r, error = TRUE} +as_array(nv_scalar(NA_integer_), scan_na = TRUE) +``` + +## No unsigned integers + +R's `integer` type is signed 32-bit (range `-2147483648` to `2147483647`). +{anvl} also exposes unsigned integer dtypes (`ui8`, `ui16`, `ui32`, `ui64`) backed by XLA, but R has no native counterpart. +For values that fit into R's signed integer range, the round-trip works as expected: + +```{r} +as_array(nv_array(c(0L, 200L, 255L), dtype = "ui8")) +``` + +For larger device-side values, however, materialization back into an R integer vector silently produces `NA`: + +```{r} +big <- nv_array(2147483647L, dtype = "ui32") + 1L +as_array(big) +``` + +The device-side value is `2147483648` -- a perfectly valid `ui32` -- but it falls outside R's signed integer range, so it collides with the `NA_integer_` sentinel on materialization. The same caveat applies to all values `>= 2^31` in any unsigned dtype, including the much larger range of `ui64`. If you need to consume large unsigned values in R, convert the dtype on the device side first (e.g. `nv_convert(x, "f64")`). From 38382bb4ae7363b1a89394f552f68327ee3daa82 Mon Sep 17 00:00:00 2001 From: Sebastian Fischer Date: Mon, 18 May 2026 17:50:44 +0200 Subject: [PATCH 3/4] ... --- R/api.R | 2 +- R/array.R | 22 ++++++--------- R/backend-quickr.R | 2 +- R/backend-xla.R | 2 +- R/backend.R | 7 +++-- man/AnvlBackend.Rd | 5 +++- man/as_array.Rd | 13 +++++---- vignettes/gotchas.Rmd | 64 +++++++++++++++++++++++-------------------- 8 files changed, 61 insertions(+), 56 deletions(-) diff --git a/R/api.R b/R/api.R index 441d1aa2..3238fe7c 100644 --- a/R/api.R +++ b/R/api.R @@ -101,7 +101,7 @@ nv_broadcast_scalars <- function(...) { target_shape <- non_scalar_shapes[[1L]] if (!all(vapply(non_scalar_shapes, identical, logical(1L), target_shape))) { - shapes <- paste0(sapply(shapes, shape2string), sep = ", ") + shapes <- paste0(sapply(shapes, shape2string), collapse = ", ") cli_abort( "All non-scalar arrays must have the same shape, but got {shapes}. Use {.fn nv_broadcast_arrays} for general broadcasting." # nolint ) diff --git a/R/array.R b/R/array.R index 381f4cac..a9489bfe 100644 --- a/R/array.R +++ b/R/array.R @@ -426,23 +426,17 @@ shape.AnvlArray <- function(x, ...) { #' @rdname as_array #' @param check (`logical(1)`)\cr -#' If `TRUE` and the array's dtype is `i32`, error when the materialized -#' R integer vector contains any `NA_integer_` values. R reserves the bit -#' pattern `-2147483648` as the `NA_integer_` sentinel, so a genuine -#' device-side `i32` value of `-2147483648` is silently turned into `NA` -#' on transfer. No-op for other dtypes. Defaults to `FALSE`. See the -#' "Gotchas" vignette. +#' If `TRUE`, sanity-check the materialized R vector against losing +#' information across the device-to-host boundary, and abort if any +#' problematic value is detected. Forwarded to the backend; for the +#' `xla` backend the relevant cases are `i32`/`i64` values colliding +#' with the `NA` bit pattern and `ui64` values `>= 2^63` wrapping +#' through `bit64::integer64`. See [`pjrt::as_array.PJRTBuffer()`] for +#' the full list. Defaults to `FALSE`. See the "Gotchas" vignette. #' @export as_array.AnvlArray <- function(x, check = FALSE, ...) { assert_flag(check) - result <- globals$backends[[x$backend]]$as_array(x) - if (check && (dtype(x) == as_dtype("i32")) && anyNA(result)) { - cli_abort(c( - "Materialized R integer vector contains {.val NA} values from device-side {.val -2147483648}.", - i = "This collision is irrecoverable: the device value and {.val NA} are indistinguishable in R. Set {.code check = FALSE} to skip this check." - )) - } - result + globals$backends[[x$backend]]$as_array(x, check = check) } #' @export diff --git a/R/backend-quickr.R b/R/backend-quickr.R index 73c28631..b26348fe 100644 --- a/R/backend-quickr.R +++ b/R/backend-quickr.R @@ -162,7 +162,7 @@ AnvlBackendQuickr <- function() { dtype = function(x) x$dtype, shape = function(x) x$shape, ambiguous = function(x) x$ambiguous, - as_array = function(x) x$data, + as_array = function(x, check) x$data, as_raw = function(x, row_major) as.raw(x$data), platform = function(x) "cpu", device = function(x) quickr_device("cpu"), diff --git a/R/backend-xla.R b/R/backend-xla.R index 45b02270..b781ce74 100644 --- a/R/backend-xla.R +++ b/R/backend-xla.R @@ -319,7 +319,7 @@ AnvlBackendXla <- function() { dtype = function(x) tengen::dtype(x$data), shape = function(x) tengen::shape(x$data), ambiguous = function(x) x$ambiguous, - as_array = function(x) tengen::as_array(x$data), + as_array = function(x, check) tengen::as_array(x$data, check = check), as_raw = function(x, row_major) tengen::as_raw(x$data, row_major = row_major), platform = function(x) pjrt::platform(x$data), device = function(x) device(x$data), diff --git a/R/backend.R b/R/backend.R index 112552b5..4550ec97 100644 --- a/R/backend.R +++ b/R/backend.R @@ -6,7 +6,10 @@ #' @param dtype (`function`)\cr Extracts the dtype from an AnvlArray. #' @param shape (`function`)\cr Extracts the shape from an AnvlArray. #' @param ambiguous (`function`)\cr Extracts the ambiguous flag from an AnvlArray. -#' @param as_array (`function`)\cr Converts an AnvlArray to an R array. +#' @param as_array (`function(x, check)`)\cr Converts an AnvlArray to an R +#' array. The `check` flag is forwarded from [`as_array()`]; backends may use +#' it to abort when materialization would lose information (e.g. ui64 values +#' wrapping through `bit64::integer64`). See [`pjrt::as_array.PJRTBuffer()`]. #' @param as_raw (`function`)\cr Converts an AnvlArray to raw bytes. #' @param platform (`function`)\cr Returns the platform name (e.g. `"cpu"`). #' @param device (`function`)\cr Returns the device object for an AnvlArray. @@ -141,7 +144,7 @@ register_backend( dtype = function(x) x$dtype, shape = function(x) x$shape, ambiguous = function(x) x$ambiguous, - as_array = function(x) x$data, + as_array = function(x, check) x$data, as_raw = function(x, row_major) cli_abort("as_raw not supported for plain backend"), platform = function(x) "cpu", device = function(x) PlainDeviceCpu(), diff --git a/man/AnvlBackend.Rd b/man/AnvlBackend.Rd index ec55487b..b1e68a38 100644 --- a/man/AnvlBackend.Rd +++ b/man/AnvlBackend.Rd @@ -30,7 +30,10 @@ underlying data (\code{PJRTBuffer} for \code{"xla"} backend, \code{array()} for \item{ambiguous}{(\code{function})\cr Extracts the ambiguous flag from an AnvlArray.} -\item{as_array}{(\code{function})\cr Converts an AnvlArray to an R array.} +\item{as_array}{(\verb{function(x, check)})\cr Converts an AnvlArray to an R +array. The \code{check} flag is forwarded from \code{\link[=as_array]{as_array()}}; backends may use +it to abort when materialization would lose information (e.g. ui64 values +wrapping through \code{bit64::integer64}). See \code{\link[pjrt:as_array.PJRTBuffer]{pjrt::as_array.PJRTBuffer()}}.} \item{as_raw}{(\code{function})\cr Converts an AnvlArray to raw bytes.} diff --git a/man/as_array.Rd b/man/as_array.Rd index 33d53782..ecacd553 100644 --- a/man/as_array.Rd +++ b/man/as_array.Rd @@ -14,12 +14,13 @@ as_array(x, ...) An array-like object.} \item{check}{(\code{logical(1)})\cr -If \code{TRUE} and the array's dtype is \code{i32}, error when the materialized -R integer vector contains any \code{NA_integer_} values. R reserves the bit -pattern \code{-2147483648} as the \code{NA_integer_} sentinel, so a genuine -device-side \code{i32} value of \code{-2147483648} is silently turned into \code{NA} -on transfer. No-op for other dtypes. Defaults to \code{FALSE}. See the -"Gotchas" vignette.} +If \code{TRUE}, sanity-check the materialized R vector against losing +information across the device-to-host boundary, and abort if any +problematic value is detected. Forwarded to the backend; for the +\code{xla} backend the relevant cases are \code{i32}/\code{i64} values colliding +with the \code{NA} bit pattern and \code{ui64} values \verb{>= 2^63} wrapping +through \code{bit64::integer64}. See \code{\link[pjrt:as_array.PJRTBuffer]{pjrt::as_array.PJRTBuffer()}} for +the full list. Defaults to \code{FALSE}. See the "Gotchas" vignette.} \item{...}{Additional arguments passed to methods (unused).} } diff --git a/vignettes/gotchas.Rmd b/vignettes/gotchas.Rmd index 1bd5d6a8..256a310d 100644 --- a/vignettes/gotchas.Rmd +++ b/vignettes/gotchas.Rmd @@ -9,15 +9,15 @@ vignette: > This vignette lists various things to be aware of, specifically in relation to base R. -```{r} +```{r, include = FALSE} library(anvl) ``` ## Row-major vs column-major ordering R stores matrices and arrays in *column-major* order, while {anvl} (following XLA) uses *row-major* order. -This makes no difference when you only use shape-aware operations (subsetting, matrix multiplication, etc.) -- the indices are the same in both. -The difference shows up when you flatten an array, because the underlying data is then traversed in a different order. +For most operations, this is an internal implementation detail that does not change the semantics. +However, for reshaping operations such as `nv_flatten()` there is a difference. Consider the 2x2 matrix below: @@ -35,13 +35,13 @@ as.vector(m) In {anvl}, reshaping to a length-4 vector traverses the data row-by-row, so we get `1, 3, 2, 4`: ```{r} -nv_reshape(m, shape = 4) +nv_flatten(m) ``` If you need column-major flattening in {anvl}, transpose first: ```{r} -nv_reshape(nv_transpose(m), shape = 4) +nv_flatten(t(m)) ``` ## No recycling @@ -68,9 +68,12 @@ nv_array(1:4) + nv_array(1:2) When two non-scalar arrays differ only by size-1 dimensions (numpy-style broadcasting, e.g. shape `(2, 3)` and `(1, 3)`), use `nv_broadcast_arrays()` to align them explicitly first: ```{r} -a <- nv_array(matrix(1:6, nrow = 2)) -b <- nv_array(matrix(c(10, 20, 30), nrow = 1)) +a <- nv_matrix(1:6, nrow = 2) +shape(a) +b <- nv_matrix(c(10, 20, 30), nrow = 1) +shape(b) xs <- nv_broadcast_arrays(a, b) +lapply(xs, shape) xs[[1]] + xs[[2]] ``` @@ -79,9 +82,8 @@ Note that even `nv_broadcast_arrays()` cannot replicate R's recycling for shapes ## No `NA`s R has a dedicated missing-value marker (`NA`) for every atomic type. -{anvl} arrays do not -- there is no representation of "missing" at the XLA level. -When you convert R values containing `NA` into an `AnvlArray`, the `NA`s are silently turned into the closest available value of the target dtype. -For floating-point dtypes, that value is `NaN`: +{anvl} arrays do not -- there is no representation of "missing" at the XLA level, only `NaN` for floating point numbers. +When you convert R values containing `NA` into an `AnvlArray`, the `NA`s are silently turned into `NaN`s. ```{r} nv_array(NA_real_) @@ -91,50 +93,44 @@ nv_array(NA_real_) nv_array(c(1, NA, 3)) ``` -Round-tripping back to R produces `NaN`, not `NA`: +Round-tripping back to R is not guaranteed to produce `NA`, but can also yield `NaN`: ```{r} as_array(nv_array(c(1, NA, 3))) ``` -Integer dtypes have no `NaN`, but `NA_integer_` does *appear* to round-trip: +For other data types, the situation is even worse, especially for integers, where R uses the smallest possible value to represent missingness: ```{r} -nv_scalar(NA_integer_) |> as.integer() +nv_scalar(NA_integer_) ``` -This is misleading. -R represents `NA_integer_` by reserving one specific 32-bit integer value (`-2^31 = -2147483648`) as a sentinel for missingness, leaving only `2^32 - 1` valid integers. -{anvl} has no notion of missing values and just stores that bit pattern as a regular `i32`. -The round-trip "works" only because R interprets the same bit pattern back as `NA` -- but inside {anvl} the value behaves like the integer `-2147483648`, and any computation on it (e.g. addition, comparison) will treat it as such rather than propagating missingness. -The same caveat applies in reverse: if a genuine {anvl} computation produces `-2147483648`, converting back to R will silently turn it into `NA`. +However, when you convert it back, you get a missing value again: + +```{r} +as.integer(nv_scalar(NA_integer_)) +``` -For logical (`bool`) dtype the situation is worst: there is no spare bit pattern at all, so a bare `NA` (which is logical) is silently turned into `TRUE`: +When creating logicals, `NA` will be interpreted as `TRUE`: ```{r} nv_scalar(NA) as.logical(nv_scalar(NA)) ``` -If your data contains missing values, decide how to handle them *before* converting to an `AnvlArray`. -To opt into a runtime check, pass `check = TRUE` to `nv_array()` / `nv_scalar()`, which errors if the input contains any `NA`: +In order to avoid these pitfals, array creators such as `nv_array()` have a `check` argument to prevent the above problems. +It is `FALSE` by default, because it needs to scan the complete data. ```{r, error = TRUE} nv_array(c(1, NA, 3), check = TRUE) ``` -The same flag is available on `as_array()` for the `i32` round-trip case, where it errors if the materialized integer vector contains any `NA` (i.e. any `-2147483648`): +The same flag is available for converters like `as_array()`: ```{r, error = TRUE} as_array(nv_scalar(NA_integer_), check = TRUE) ``` -It is also forwarded by the `as.integer()` / `as.double()` / `as.logical()` / `as.vector()` methods for `AnvlArray`, so the same scan is available when coercing directly to a bare R vector: - -```{r, error = TRUE} -as.integer(nv_scalar(NA_integer_), check = TRUE) -``` - ## No unsigned integers R's `integer` type is signed 32-bit (range `-2147483648` to `2147483647`). @@ -145,11 +141,19 @@ For values that fit into R's signed integer range, the round-trip works as expec as_array(nv_array(c(0L, 200L, 255L), dtype = "ui8")) ``` -For larger device-side values, however, materialization back into an R integer vector silently produces `NA`: +Because `ui32` does not fit into R's native integer type, it will be converted to `bit64::integer64` data type: + ```{r} big <- nv_array(2147483647L, dtype = "ui32") + 1L as_array(big) ``` -The device-side value is `2147483648` -- a perfectly valid `ui32` -- but it falls outside R's signed integer range, so it collides with the `NA_integer_` sentinel on materialization. The same caveat applies to all values `>= 2^31` in any unsigned dtype, including the much larger range of `ui64`. If you need to consume large unsigned values in R, convert the dtype on the device side first (e.g. `nv_convert(x, "f64")`). +However, for `ui64`, we also convert to `integer64`, which does not cover the whole range, so overflow is possible, but can be detected via the `check` flag: + +```{r, error = TRUE} +big <- nv_array(0L, dtype = "ui64") - 1L +big +as_array(big) +as_array(big, check = TRUE) +``` From 8cd657c840c722a33054945ae39ba7c856d84984 Mon Sep 17 00:00:00 2001 From: Sebastian Fischer Date: Mon, 18 May 2026 19:17:57 +0200 Subject: [PATCH 4/4] remove unsupported argument from as.vector --- R/array.R | 4 ++-- man/as-AnvlArray.Rd | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/R/array.R b/R/array.R index a9489bfe..170259e9 100644 --- a/R/array.R +++ b/R/array.R @@ -516,8 +516,8 @@ as.logical.AnvlArray <- function(x, check = FALSE, ...) { #' @rdname as-AnvlArray #' @method as.vector AnvlArray #' @export -as.vector.AnvlArray <- function(x, mode = "any", check = FALSE) { - as.vector(as_array(x, check = check), mode = mode) +as.vector.AnvlArray <- function(x, mode = "any") { + as.vector(as_array(x), mode = mode) } #' @rdname platform diff --git a/man/as-AnvlArray.Rd b/man/as-AnvlArray.Rd index 74f030af..2b7e4f73 100644 --- a/man/as-AnvlArray.Rd +++ b/man/as-AnvlArray.Rd @@ -14,7 +14,7 @@ \method{as.logical}{AnvlArray}(x, check = FALSE, ...) -\method{as.vector}{AnvlArray}(x, mode = "any", check = FALSE) +\method{as.vector}{AnvlArray}(x, mode = "any") } \arguments{ \item{x}{(\code{\link{AnvlArray}})\cr