Skip to content

fix(firestore): retry transient connection errors in streaming RPCs#14620

Draft
bhshkh wants to merge 2 commits into
googleapis:mainfrom
bhshkh:fs-fix-transient-errors
Draft

fix(firestore): retry transient connection errors in streaming RPCs#14620
bhshkh wants to merge 2 commits into
googleapis:mainfrom
bhshkh:fs-fix-transient-errors

Conversation

@bhshkh
Copy link
Copy Markdown
Contributor

@bhshkh bhshkh commented May 20, 2026

Retry transient connection errors (specifically syscall.ECONNRESET and syscall.ECONNREFUSED) in Firestore streaming RPCs.

Fixes: #10350

@product-auto-label product-auto-label Bot added the api: firestore Issues related to the Firestore API. label May 20, 2026
@bhshkh
Copy link
Copy Markdown
Contributor Author

bhshkh commented May 20, 2026

Currently, unary RPCs automatically retry on Unavailable errors (which gRPC maps connection resets/refusals to) because of default gax configs. However, streaming RPCs do not automatically retry failures that occur during stream consumption (Recv()).

This PR implements retry and resumption logic for the following streaming RPCs:

  1. BatchGetDocuments (used in GetAll and DocumentRef.Get):
    • Retries the stream on transient errors.
    • Tracks remaining documents to only request what has not been successfully received yet.
    • Added strict duplicate detection in the Recv() loop to fail if the same document is received multiple times.
  2. RunQuery (used in Query.Documents iteration):
    • Retries the stream on transient errors.
    • Resumes the query using StartAfter(lastDoc) if some documents were already returned to the user.
    • Adjusts the limit and clears the offset for the resumed query.
    • Resets backoff state on successful reads.
  3. RunAggregationQuery (used in AggregationQuery.GetResponse):
    • Retries the stream on transient errors.
    • Since it returns a single accumulated result, it retries the entire query from the beginning.

Other changes:

  • Introduced a unified isRetryable helper function in client.go that checks gRPC status codes, explicit syscall errors, and falls back to string matching ("connection reset by peer", "connection refused") for resilience.
  • Refactored isPermanentWatchError in watch.go to use the new isRetryable helper.
  • Updated existing tests and added new unit tests in client_test.go, query_test.go, and transaction_test.go to verify the retry behaviors.

Retries are skipped if executing within a transaction (tid != nil or tx != nil), or if ExplainOptions are set (to avoid corrupting metrics), or if LimitToLast is used on queries (due to complexity of resumption).

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements retry logic for several Firestore operations, including BatchGetDocuments, RunQuery, and RunAggregationQuery. It introduces a centralized isRetryable helper function to identify transient errors—such as specific gRPC status codes and connection resets—and utilizes exponential backoff for retries. The queryDocumentIterator was updated to resume streaming from the last received document upon retry. Corresponding unit tests were added to verify the retry behavior across these methods. I have no feedback to provide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: firestore Issues related to the Firestore API.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

firestore: handle transient connection errors

1 participant