Add retries by teqwve · Pull Request #16 · near/near-api-py

teqwve · 2022-08-15T18:10:44Z

Hi,

this PR add rpc requests retrying on various errors, analogous to retrying done by near-api-js. I tested it manually by injecting faulty nonces or making my network very unstable with iptables.

I wasn't 100% sure if it's always safe to retry requests, especially in cases when the request has made it to the server (e.g. response read timeouts). Could you please help here? Near is quite new to me and I'm not sure :) (that's why I was so hard on near-api-js).

The only relevant fragment of documentation I've found so far (description of synchronous transaction errors) suggests so:

Retrying done by near-api-js:

teqwve · 2022-08-15T18:13:48Z

            signer: 'near_api.signer.Signer',
-            account_id: Optional[str] = None
+            account_id: Optional[str] = None,
+            tx_nonce_retry_number: int = 12,


I copied these default number of retries from near-api-js, that's why they're so huge... (~30 retries in total with delay from 1.5s to 1min, theoretically a request can take ~3 minutes).

With the current code request can take up to 6 minutes. Note that JS code multiplies delay by 500 milliseconds which is not done here on line 63.

teqwve · 2022-08-15T18:19:34Z

+        if session is None:
+            # Loosely based on https://findwork.dev/blog/advanced-usage-python-requests-timeouts-retries-hooks/
+            adapter = requests.adapters.HTTPAdapter(
+                max_retries=urllib3.util.Retry(
+                    total=http_retry_number,
+                    backoff_factor=http_retry_backoff_factor,
+                    status_forcelist=[502, 503],
+                    allowed_methods=["GET", "POST"],
+                ),
+            )
+
+            session = requests.Session()
+            session.mount("https://", adapter)
+            session.mount("http://", adapter)


I used sessions instead of manual retrying because a built-in mechanism seem more straightforward to me and it allows great customisation (user can pass their own Session object).

If you prefer I can change it to manual exception catching and handling.

This is a bit specious to me. There are now three nested retry loops. We have

a loop in _sign_and_submit_tx which retries up to 12 times if we had wrong nonce or block,

another loop here in json_rpc which retries up to 10 times if we had a timeout and

final loop in the Retry adapter which retries up to 10 times on server errors.

It may be somewhat contrived, but in pathological case we may end up retrying 1200 times. I’d say we should at least get rid of the Retry adapter and to the looping logic for 5xx errors in json_rpc. Something like:

while True: try: return self._json_rpc_once(method, params, timeout) except HTTPError as e: # whatever the exception type is if attempt >= self._tx_timeout_retry_number: raise if e.status_code // 100 == 5: log.warning("Retrying request to %s as it returned server error: %s", method, e) else: raise except JsonProviderError as e: if attempt >= self._tx_timeout_retry_number: raise if e.get_type() == "HANDLER_ERROR" and e.get_cause() == "TIMEOUT_ERROR": log.warning("Retrying request to %s as it has timed out: %s", method, e) else: raise attempt += 1 time.sleep(self._tx_timeout_retry_backoff_factor ** attempt * 0.5)

Yeah, that's what is done by near-api-js ;p

Ok, I'll refactor this to have only two levels of retries

Yeah, I’ve noticed. I’d argue it’s a bug in JS implementation.

teqwve · 2022-08-15T18:24:55Z

+                and self.args[0]["data"]["TxExecutionError"]["InvalidTxError"] == "Expired"
+            )
+        except (IndexError, KeyError):
+            return False


Here it would probably be better to do the same thing as near-api-js does, but I gave up the moment I looked at code responsible for that (traversing whole returned object tree and comparing with json scheme just to get the error name?) I don't feel competent enough to implement this in python since I've found nothing about this in docs.

Although If you think it would be much better to use this json schema in python as well (and maybe can point me at some docs or example error responses) I can implement this here.

[1] https://github.com/near/near-api-js/blob/master/packages/near-api-js/src/utils/rpc_errors.ts#L55
[2] https://github.com/near/near-api-js/blob/master/packages/near-api-js/src/generated/rpc_error_schema.json

mina86 · 2022-08-16T15:21:04Z

+        if session is None:
+            # Loosely based on https://findwork.dev/blog/advanced-usage-python-requests-timeouts-retries-hooks/
+            adapter = requests.adapters.HTTPAdapter(
+                max_retries=urllib3.util.Retry(
+                    total=http_retry_number,
+                    backoff_factor=http_retry_backoff_factor,
+                    status_forcelist=[502, 503],
+                    allowed_methods=["GET", "POST"],
+                ),
+            )
+
+            session = requests.Session()
+            session.mount("https://", adapter)
+            session.mount("http://", adapter)


This is a bit specious to me. There are now three nested retry loops. We have

a loop in _sign_and_submit_tx which retries up to 12 times if we had wrong nonce or block,

another loop here in json_rpc which retries up to 10 times if we had a timeout and

final loop in the Retry adapter which retries up to 10 times on server errors.

It may be somewhat contrived, but in pathological case we may end up retrying 1200 times. I’d say we should at least get rid of the Retry adapter and to the looping logic for 5xx errors in json_rpc. Something like:

while True: try: return self._json_rpc_once(method, params, timeout) except HTTPError as e: # whatever the exception type is if attempt >= self._tx_timeout_retry_number: raise if e.status_code // 100 == 5: log.warning("Retrying request to %s as it returned server error: %s", method, e) else: raise except JsonProviderError as e: if attempt >= self._tx_timeout_retry_number: raise if e.get_type() == "HANDLER_ERROR" and e.get_cause() == "TIMEOUT_ERROR": log.warning("Retrying request to %s as it has timed out: %s", method, e) else: raise attempt += 1 time.sleep(self._tx_timeout_retry_backoff_factor ** attempt * 0.5)

mina86 · 2022-08-16T15:21:56Z

+            tx_timeout_retry_backoff_factor: float = 1.5,
+            http_retry_number: int = 10,
+            http_retry_backoff_factor: float = 1.5,
+            session: Optional[requests.Session] = None,


Is this argument ever used? I’d rather we get rid of it. With it, the prototype of this constructor is a bit weird. If session is given, majority of the arguments are just ignored.

Since you prefer to not use built-in retry mechanism then it's not very useful at the moment. I'll remove it from this PR.

The main reason for adding this here was to make it possible to configure HTTP connection pooling(*) which is very useful when doing more than couple of requests to near api per second.

I intended to do something like:

session = requests.Session() sessions.mount( "https://", requests.HTTPAdapter( pool_maxsize=1, pool_connections=10, pool_block=True, ... ) ) provider = near_api.providers.JsonProvider(..., session=session)

I'll remove this argument here and create a separate PR which will allow me to do:

provider = near_api.providers.JsonProvider(..., http_pool_maxsize=1, http_pool_block=...)

under the hood it will create session with relevant pool size and retries=0. So the client won't be staring a new TLS connection each time and client will be handling retires itself. Sounds ok?

(*) relevant docs: https://requests.readthedocs.io/en/latest/api/#requests.adapters.HTTPAdapter

I mean it’s fine to have optional session argument. My issue was about having mutually exclusive set of arguments.

mina86 · 2022-08-16T15:27:04Z

            signer: 'near_api.signer.Signer',
-            account_id: Optional[str] = None
+            account_id: Optional[str] = None,
+            tx_nonce_retry_number: int = 12,


With the current code request can take up to 6 minutes. Note that JS code multiplies delay by 500 milliseconds which is not done here on line 63.

mina86 · 2022-08-16T15:37:08Z

-        r = requests.get("%s/status" % self.rpc_addr(), timeout=timeout)
+        r = self._session.get("%s/status" % self.rpc_addr(), timeout=timeout)


This should go through the retry loop as well. Currently this only retries on server errors but I don’t see why we wouldn’t want to retry on timeouts if we’re retrying other requests on timeout.

Do you mean the same TIMEOUT_ERROR that is handled in json_rpc? Is it possible to get one here? In docs it among broadcast_tx_commit errors and it's not mentioned among /stats endpoint errors (at least here).

But yeah, since we'll be handling all timeouts ourselves then retry code might be the same and we can add handling of these error here as well :)

Ah, I see. It’s a different timeout that what I was thinking of. What happens if requsets.get’s timeout runs out?

Also, at this point maybe we should move TIMEOUT_ERROR handling to _sign_and_submit_tx? JsonProvider would handle 5xx and Account would handle TIMEOUT_ERROR and any other errors of this kind?

What happens if requsets.get’s timeout runs out?

I guess that requests's retry logic triggers and once number of attempts is exceeded an exception like requests.Timeout is thrown.

Also, at this point maybe ...

Sounds great, I'll also be able to stick with built-in mechanisms for network/http error retries.

ConfidenceYobo · 2022-09-01T14:34:47Z

Isn't it better to utilize the broadcast tx async function instead of retrying the transaction, which can take a long time and cause a horrible user experience in some applications?

teqwve added 5 commits August 15, 2022 19:42

add retries on network-level and http-level errors

d8ed3ec

add simple error details parsing

338b722

add retries on near timeout errors

0e0d8c2

add retries on invalid transaction errors

1b33227

add logging using standard python module

9395cbb

teqwve commented Aug 15, 2022

View reviewed changes

fixup! add logging using standard python module

2c85592

mm-near requested a review from mina86 August 16, 2022 13:29

mina86 reviewed Aug 16, 2022

View reviewed changes

		r = requests.get("%s/status" % self.rpc_addr(), timeout=timeout)
		r = self._session.get("%s/status" % self.rpc_addr(), timeout=timeout)

Conversation

teqwve commented Aug 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

teqwve Aug 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

teqwve-cosmose Aug 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ConfidenceYobo commented Sep 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

teqwve commented Aug 15, 2022 •

edited

Loading

teqwve Aug 15, 2022 •

edited

Loading

teqwve-cosmose Aug 16, 2022 •

edited

Loading