Skip to content

Plumb a bulk_open_max_parallelism option.#7

Open
folded wants to merge 3 commits into
google-deepmind:mainfrom
folded:fix/bulk-open-parallelism
Open

Plumb a bulk_open_max_parallelism option.#7
folded wants to merge 3 commits into
google-deepmind:mainfrom
folded:fix/bulk-open-parallelism

Conversation

@folded

@folded folded commented May 1, 2026

Copy link
Copy Markdown
Contributor

Plumbs a bulk_open_max_parallelism option for bulk shard opens through the API to filesystem
drivers.

Drivers can set their own defaults, and GCS uses this to lower the default to 32 (while keeping
read parallelism at 100 thread). Empirically this was appropriate for a within-zone bulk open
(1334 shards, 40M records).

On accesses with a longer RTT there is still a benefit to higher parallelism, but it comes at
the cost of a lot of DNS lookup churn within libcurl. On MacOS, this causes higher peak fd
usage, leading to opaque curl address resolution failures, lots of retries with slow backoff,
and stochastic failures if retry budget exhausts.

folded added 3 commits April 30, 2026 22:03
The open phase has different concurrency economics from the read phase:

* Each file open issues a fresh metadata request that requires DNS
  resolution.  On macOS, libcurl uses a per-call threaded resolver
  (CURLRES_THREADED) — every getaddrinfo call spawns a fresh pthread for
  the lookup.  Under bursts of concurrent opens at N=100 (the previous
  hard-coded fan-out), pthread_create() can return EAGAIN, libcurl emits
  "getaddrinfo() thread failed to start" and surfaces the failure to
  callers as CURLE_FAILED_INIT (curl error 2).  The error is stochastic
  but reproducible at N >= ~117 shards on a default-configured macOS host.

* The opens are GCS-latency bound; past ~16-32 in flight there is no
  further wall-clock benefit, only resolver-thread pressure.

* The read phase reuses connections from the GCS client's pool, so it
  does not create resolver threads at the same rate and can safely run
  with the existing max_parallelism=100 default.

This commit splits the two by:

* Adding `Options::bulk_open_max_parallelism` (default 16) alongside the
  existing `max_parallelism` (default 100, retained for the read path).
* Plumbing the new parameter through `file::BulkOpenPRead`,
  `FileSystem::BulkOpenPRead`, and the GCS / POSIX overrides + mock.
* Routing `BagzReader::Open` to pass `options.bulk_open_max_parallelism`
  to `file::BulkOpenPRead`.
* Exposing the option in the Python binding (kwargs and attribute).

Values <= 0 mean "use the file-system default" (100), preserving the
prior behaviour for any caller not passing the option.

Trace evidence of the failure mode (libcurl 8.19.0, macOS arm64,
captured via DYLD_INSERT_LIBRARIES interpose to force CURLOPT_VERBOSE):

    * getaddrinfo() thread failed to start
    * Could not resolve host: storage.googleapis.com
    * closing connection #N

15 thread-spawn failures in a single run × cascading retries until
google-cloud-cpp's retry budget exhausts and one operation surfaces as
the permanent error to the caller.
A parallelism sweep on a same-region GCE n2-standard-8 (1334 shards,
5 iters per setting) found:

    p     min     p50     p95
    16   2.23s   2.25s   2.34s
    32   1.67s   1.69s   1.71s   <- floor
    64   2.21s   2.23s   2.26s
   100   2.31s   2.33s   4.09s

p=32 is ~25% faster than p=16 and ~25% faster than p=64.  Beyond 32 the
curve regresses — connection-pool / libcurl-cache contention dominates
the residual RTT savings.  16 was a conservative first guess; the data
says we have headroom.

Cross-region clients (~140ms RTT macOS->australia-southeast1) still
prefer higher parallelism (the latency masks the worker-overhead cost),
but 32 is within ~1s of optimum on a 1334-shard open and stays clear of
the macOS pthread_create EAGAIN window that fires around p=64+.
The 32-thread cap is justified by GCS-specific behaviour (DNS-resolution
saturation past ~32 in flight, and fd pressure under burst load), so it
belongs on the GCS backend rather than as a bagz-level default that
silently constrains the posix path too.

Bagz `bulk_open_max_parallelism` now defaults to 0 ("use the file-system
default"), and `GcsFileSystem` picks 32 via a new
`kDefaultBulkOpenMaxParallelism` alongside the existing
`kDefaultMaxParallelism = 100` for reads.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant