Plumb a bulk_open_max_parallelism option.#7
Open
folded wants to merge 3 commits into
Open
Conversation
The open phase has different concurrency economics from the read phase:
* Each file open issues a fresh metadata request that requires DNS
resolution. On macOS, libcurl uses a per-call threaded resolver
(CURLRES_THREADED) — every getaddrinfo call spawns a fresh pthread for
the lookup. Under bursts of concurrent opens at N=100 (the previous
hard-coded fan-out), pthread_create() can return EAGAIN, libcurl emits
"getaddrinfo() thread failed to start" and surfaces the failure to
callers as CURLE_FAILED_INIT (curl error 2). The error is stochastic
but reproducible at N >= ~117 shards on a default-configured macOS host.
* The opens are GCS-latency bound; past ~16-32 in flight there is no
further wall-clock benefit, only resolver-thread pressure.
* The read phase reuses connections from the GCS client's pool, so it
does not create resolver threads at the same rate and can safely run
with the existing max_parallelism=100 default.
This commit splits the two by:
* Adding `Options::bulk_open_max_parallelism` (default 16) alongside the
existing `max_parallelism` (default 100, retained for the read path).
* Plumbing the new parameter through `file::BulkOpenPRead`,
`FileSystem::BulkOpenPRead`, and the GCS / POSIX overrides + mock.
* Routing `BagzReader::Open` to pass `options.bulk_open_max_parallelism`
to `file::BulkOpenPRead`.
* Exposing the option in the Python binding (kwargs and attribute).
Values <= 0 mean "use the file-system default" (100), preserving the
prior behaviour for any caller not passing the option.
Trace evidence of the failure mode (libcurl 8.19.0, macOS arm64,
captured via DYLD_INSERT_LIBRARIES interpose to force CURLOPT_VERBOSE):
* getaddrinfo() thread failed to start
* Could not resolve host: storage.googleapis.com
* closing connection #N
15 thread-spawn failures in a single run × cascading retries until
google-cloud-cpp's retry budget exhausts and one operation surfaces as
the permanent error to the caller.
A parallelism sweep on a same-region GCE n2-standard-8 (1334 shards,
5 iters per setting) found:
p min p50 p95
16 2.23s 2.25s 2.34s
32 1.67s 1.69s 1.71s <- floor
64 2.21s 2.23s 2.26s
100 2.31s 2.33s 4.09s
p=32 is ~25% faster than p=16 and ~25% faster than p=64. Beyond 32 the
curve regresses — connection-pool / libcurl-cache contention dominates
the residual RTT savings. 16 was a conservative first guess; the data
says we have headroom.
Cross-region clients (~140ms RTT macOS->australia-southeast1) still
prefer higher parallelism (the latency masks the worker-overhead cost),
but 32 is within ~1s of optimum on a 1334-shard open and stays clear of
the macOS pthread_create EAGAIN window that fires around p=64+.
The 32-thread cap is justified by GCS-specific behaviour (DNS-resolution
saturation past ~32 in flight, and fd pressure under burst load), so it
belongs on the GCS backend rather than as a bagz-level default that
silently constrains the posix path too.
Bagz `bulk_open_max_parallelism` now defaults to 0 ("use the file-system
default"), and `GcsFileSystem` picks 32 via a new
`kDefaultBulkOpenMaxParallelism` alongside the existing
`kDefaultMaxParallelism = 100` for reads.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Plumbs a bulk_open_max_parallelism option for bulk shard opens through the API to filesystem
drivers.
Drivers can set their own defaults, and GCS uses this to lower the default to 32 (while keeping
read parallelism at 100 thread). Empirically this was appropriate for a within-zone bulk open
(1334 shards, 40M records).
On accesses with a longer RTT there is still a benefit to higher parallelism, but it comes at
the cost of a lot of DNS lookup churn within libcurl. On MacOS, this causes higher peak fd
usage, leading to opaque curl address resolution failures, lots of retries with slow backoff,
and stochastic failures if retry budget exhausts.