-
Notifications
You must be signed in to change notification settings - Fork 438
RATIS-2341. Introduce LocalLease for roughly up-to-date check #1296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@symious , Even for old data, the data could be changed from time to time. It may work if the data is immutable. However, Ratis does not aware if the data is immutable. Could you give more details how the LocalLease algorithm works? Since the feature allows reading old data, is it similar to staleRead? |
|
@szetszwo Thanks for the review. Yes, updated the description that most of the queries are on unchanged old date. The idea for the LocalLease is for the use of FollowerRead, so that the follower can decide if it's updated with the leader and then read local data for clients. |
|
For follower reads, if there is no communication with the leader, it is impossible to ensure linearizable. If you want to optimize the RTT delay in this round, you can enable lease reads and disable follower reads, ensuring that queries are only made on the leader. Additionally, you can utilize cluster resources through sharding and multiple Raft groups. This is the lowest-latency safe solution. The essence of follower reads is to optimize the throughput of a single consensus group (since adding multiple followers allows for horizontal scaling of the read state machine throughput). Sacrificing correctness to optimize follower-read latency seems counterproductive. |
|
@OneSizeFitsQuorum Thank you for the review. In high write load situations, readIndex perform can not fulfill requirements, and the good point of LocalLease is that users can choose the threshold to read from leader or follower. And it's relative correctness, since with ReadIndex you can not have absolute correctness. |
@symious, we should make it configurable and emphasize that it can return stale data.
What do you mean by absolute correctness? Is it the same as linearizable read? ReadIndex does provide linearizable read but LocalLease does not. In other words, ReadIndex won't return stale data. |
|
@szetszwo Thank you for the review. I think for the ReadIndex implementation, stale read could still happen. |
|
Stale read means reading the old value of already changed data. In your example, the data is updated after read. Let's add commit indices: leader (c6) and follower (c4).
This is the same as sending the request directly to the leader:
|
|
Yes, direclty read from leader could also have this issue, only read from follower will amptify this issue, since additional [t0, t2) time range have large latency. In my view, this is a relative outcome — it mainly depends on how the threshold is set. Therefore, the purpose of LocalLease is to allow the application to pass this threshold in as a parameter, |
|
@OneSizeFitsQuorum Thank you for the info. GPT’s answers can sometimes be wrong too, but from the client’s perspective, if at T1 I send a balance inquiry to the bank, at T2 my account is debited by 100 yuan, and at T3 I receive a response showing that I still have 100 yuan, then the result is clearly incorrect from the client’s point of view. In the case of ReadIndex, this kind of situation is even more likely to occur because of higher latency — especially under high write load conditions. Returning to the original question, LocalLease allows the user to decide whether to trust the results read from a follower within certain limits — limitLag and limitTime. |
That's why they usually have a transaction number in the record. Our transaction number is the log index. The banks won't know when the customers see the information but they know when they have retrieved the information. BTW, this case does happen in real life. The log index L (transaction number) means that the client request was received at index L. The server can reply any values on or after index L, not necessarily the latest value. (revised) Raft is to make a single server state machine running in multiple servers. So, any behaviors happening in a single server are acceptable. Now, I may understand why you talk about absolute/relative -- it is the same as the Theory of Special Relativity from Albert Einstein. 😀 Suppose we have a high precision clock which can display nano seconds.
The "current" time is "relative" to the observers since the distances are different. |
@symious , we need to make it configurable (global or per request) as mentioned.
Agree. We already support staleRead with I do not oppose adding a new stale read with time limit. We just should not claim that it is the same as linearizable read. |
@szetszwo Currently the API accepts a limitLag and limitTime, applications can configure the inputs. So for Ratis side, it's more like a BiPredicate<Integer, Long>. |
|
How to use it? Could you add a test with |
Such situations are typical in the presence of concurrent requests. Above thesis reveals that linearizability provides visibility guarantees only for operations that can be organized into a partial order. The model requires all operations to be arranged in a single linear extension of that order. In particular, when a request at time t2 is initiated only after the response of a request at time t1 has been received, the execution order is well defined, and t2 must occur after t1. Conversely, if both requests t1 and t2 are issued before any response is obtained, their execution order on the server side remains indeterminate. Either t1 or t2 may be executed first, and such nondeterminism does not violate the property of linearizability. Moreover, the example under discussion is independent of the Raft protocol. The same behavior may arise even in a single-machine system. If one interprets this case as a violation of linearizability, it would imply that a single-machine system could not satisfy linearizability either, which is clearly untenable. Certainly, providing an interface for stale reads is feasible. However, it is important to explicitly note that such an interface does not guarantee linearizability even others can. |
|
@szetszwo Added test case to show the usage of the method. PTAL. |
Yes, this should be able to configure on application/User side if using this method. |
| boolean isFollowerUptoDate = d1.okForLocalReadBounded(1000, 100); | ||
| // Clients can choose to query from local StateMachine or leader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How could the client call okForLocalReadBounded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Client need to build the RaftServer first.
In Ozone, Clients will send message to OzoneManager, and OzoneManager will do the check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Client need to build the RaftServer first.
Client builds a new server? How would it work? The Follower is already running. The server built by the client will be different from the Follower. Could you show it in the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szetszwo Add a new API for client to call this method directly. PTAL.
szetszwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@symious , if we are adding a new sendStaleRead method, we should not change ratis-server-api anymore since all the calls are internal calls.
| * @param limitTimeMs if server is follower, limitTimeMs will be used to decide read from follower or not | ||
| * @return the reply. | ||
| */ | ||
| RaftClientReply sendReadOnly(Message message, RaftPeerId server, int limitLog, long limitTimeMs) throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should call it sendStaleRead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szetszwo The addition of "okForLocalReadBounded" in RaftServer is for the follower OzoneManager to decide to use it's local data, because in Ozone, clients are not using ratis' client IO API.
The latest commit is for your last question: "Client builds a new server? How would it work".
Basically this feature is for OzoneManager as a RaftServer it to provide follower read based on the result of "okForLocalReadBounded", if it's for direct ratis' client call, the original sendStaleRead should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The addition of "okForLocalReadBounded" in RaftServer is for the follower OzoneManager to decide to use it's local data, because in Ozone, clients are not using ratis' client IO API.
Could you add an example to show exactly how Ozone will use it?
The latest commit is for your last question: "Client builds a new server? How would it work".
My question actually is how to use okForLocalReadBounded.
If Ozone does not need this new sendReadOnly(..., long limitTimeMs) method. Please don't add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szetszwo The usage on OzoneManager side is as follows:
if (ozoneManager.isFollowerReadLocalLeaseEnabled() &&
omRatisServer.getServerDivision().okForLocalReadBounded(
ozoneManager.getFollowerReadLocalLeaseLagLimit(),
ozoneManager.getFollowerReadLocalLeaseTimeMs())) {
return handler.handleReadRequest(request);
} else if (omRatisServer.isLinearizableRead()) {
return submitReadRequestToRatis(request);
}
@symious , The current code already supports it. We may add the following method to Ozone: boolean allowFollowerReadLocalLease(RaftServer.Division ratisDivision, long leaseTimeMsLimit, long leaseLogLimit) {
final DivisionInfo divisionInfo = ratisDivision.getInfo();
final FollowerInfoProto followerInfo = divisionInfo.getRoleInfoProto().getFollowerInfo();
if (followerInfo == null) {
return false; // not follower
}
final ServerRpcProto leaderInfo = followerInfo.getLeaderInfo();
if (leaderInfo == null) {
return false; // no leader
}
if (Time.monotonicNow() - leaderInfo.getLastRpcElapsedTimeMs() > leaseTimeMsLimit) {
return false; // lease time expired
}
final RaftPeerId leaderId = divisionInfo.getLeaderId();
Long leaderCommit = null;
if (leaderId != null) {
for (CommitInfoProto i : ratisDivision.getCommitInfos()) {
if (i.getServer().getId().equals(leaderId.toByteString())) {
leaderCommit = i.getCommitIndex();
}
}
}
if (leaderCommit == null) {
return false;
}
return divisionInfo.getLastAppliedIndex() + leaseLogLimit >= leaderCommit;
} |
|
@szetszwo Thanks for the proposal! The current approach works functionally, but it might introduce overhead under high QPS scenarios — every request would traverse multiple layers (getInfo() → getRoleInfoProto() → getFollowerInfo()), iterate through getCommitInfos(), and perform repeated time checks. These protobuf object accesses and list scans could become a performance hotspot. The introduce of LocalLease seems a more efficient way, This would reduce per-request cost, stabilize the lease semantics across requests, and make it easier to add metrics or future extensions (like RTT-based adjustments). What do you think? |
This PR also introduces overhead to appendEntries and stores duplicated information. The overhead is added to all StateMachine implementations, even if they are not using this new feature. Moreover, the new API I agree that the |
|
@szetszwo Tested with 3 Client Servers (each 300 client threads) reading 3 OzoneManagers.
The first result is due to some bugs of the code. The result are similar, overhead can be ignore then. And it is better to implement on Ratis' application side, rather than Ratis itself. I'll close this ticket then. @szetszwo @OneSizeFitsQuorum Thank you again for the review. |
|
@symious , thanks for comparing the performance! One potential improvement is to add a non-proto version of |
What changes were proposed in this pull request?
The current logic of ReadIndex limits the performance of FollowerRead.
In production, the vast majority of requests are for unchanged old data, so running the readIndex logic for every request becomes somewhat wasteful.
This ticket is to introduce LocalLease for followers, so that followers can decide if it's catched up with Leader, and application can reasonably make use of this information.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/RATIS-2341
How was this patch tested?
Local performance test.