don't cache if the schema registry is unavailable#143
Conversation
jpcosal
left a comment
There was a problem hiding this comment.
LG from infra POV
One minor comment
Co-authored-by: jpcosal <42937033+jpcosal@users.noreply.github.com>
| // we can't import avroregistry, to compare the error, so we're looking at the error message to see if the | ||
| // error is of type `UnavailableError` (avroregistry/errors.go) | ||
| if err != nil && strings.HasPrefix(err.Error(), "schema registry unavailability caused by") { | ||
| return nil, err |
There was a problem hiding this comment.
Can it lead to an overload of the registry if we don't cache anything? The other open PR that addresses the same issue caches the error for 1 minute: #127
Can we just merge that one, or maybe decrease the duration a bit if necessary?
There was a problem hiding this comment.
Merging mentioned PR would be nice.
1min might be a bit too much in our case indeed. How about we make that duration configurable (e.g w/ default and overridable through env var)?
There was a problem hiding this comment.
When the schema registr\y is unavailable the request doesn't reach the schema registry pod at all, so it's not stressed.
If i'm understanding this correctly, this should only be called once per pod per topic (unless there's a race condition where it's called for each partition on the same pod at the same time). In any case, the number of requests to the schema registry is very low.
There was a problem hiding this comment.
When the schema registr\y is unavailable the request doesn't reach the schema registry pod at all, so it's not stressed.
If the registry is unavailable for 10 minutes, and in that timeframe one pod after another is started (e.g. autoscaling, node consolidation) and runs into the error, then the registry comes back up, won't all pods send their requests to the registry at the same time? Could that lead to issues?
If yes, the error caching could spread that initial load a bit if I'm not mistaken (assuming the pods with the errors didn't all start at the same time, but spread over those 10 minutes of registry downtime).
There was a problem hiding this comment.
Ideally we could check the exact issue
- unavailable registry issue: do not cache the error
- registry response issue (invalid schema, data store error, etc): cache the error for X minutes
But this current PR is good enough as a first step IMO
won't all pods send their requests to the registry at the same time? Could that lead to issues?
This is already the current behavior when a deploy happens, the registry should accommodate this (as Adrian mentioned the number of requests per service should be low, basically 1 per topic consumed/produced in per instance)
There was a problem hiding this comment.
My PR explicitly only doesn't cache avroregistry.UnavailableError
Looking at the code, that can only happen if:
- the schema registry isn't available at all (which should be fixed by an open PR in universe)
- or if the schema registry returns a 5xx response code
I believe that it would be a mistake to cache the error in either of those cases. If we want to limit the cache time for other types of errors, that would be beyond the scope of this PR.
There was a problem hiding this comment.
This is already the current behavior when a deploy happens
Indeed. Then no blocker, and we can consider improving it in the future if necessary
fixes #39
When the schema registry is unavailable, we should not cache the error, to allow the client to retry.
Because of a cyclical import, I can't import the actual error type and compare it, so I'm checking the error message instead, I'm open to suggestions on a better way to do this check.
I've also done some gardening on this package, upgraded the go version, packages with vulnerabilities, github actions and fixed a failed test.