Skip to content

Conversation

@symious
Copy link
Contributor

@symious symious commented Oct 1, 2025

What changes were proposed in this pull request?

The current logic of ReadIndex limits the performance of FollowerRead.

In production, the vast majority of requests are for unchanged old data, so running the readIndex logic for every request becomes somewhat wasteful.

This ticket is to introduce LocalLease for followers, so that followers can decide if it's catched up with Leader, and application can reasonably make use of this information.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/RATIS-2341

How was this patch tested?

Local performance test.

@szetszwo
Copy link
Contributor

szetszwo commented Oct 1, 2025

In production, the vast majority of requests are for old data, ...

@symious , Even for old data, the data could be changed from time to time. It may work if the data is immutable. However, Ratis does not aware if the data is immutable.

Could you give more details how the LocalLease algorithm works? Since the feature allows reading old data, is it similar to staleRead?

@symious
Copy link
Contributor Author

symious commented Oct 2, 2025

@szetszwo Thanks for the review. Yes, updated the description that most of the queries are on unchanged old date.

The idea for the LocalLease is for the use of FollowerRead, so that the follower can decide if it's updated with the leader and then read local data for clients.

@OneSizeFitsQuorum
Copy link
Contributor

OneSizeFitsQuorum commented Oct 2, 2025

For follower reads, if there is no communication with the leader, it is impossible to ensure linearizable.

If you want to optimize the RTT delay in this round, you can enable lease reads and disable follower reads, ensuring that queries are only made on the leader. Additionally, you can utilize cluster resources through sharding and multiple Raft groups. This is the lowest-latency safe solution.

The essence of follower reads is to optimize the throughput of a single consensus group (since adding multiple followers allows for horizontal scaling of the read state machine throughput). Sacrificing correctness to optimize follower-read latency seems counterproductive.

@symious
Copy link
Contributor Author

symious commented Oct 3, 2025

@OneSizeFitsQuorum Thank you for the review. In high write load situations, readIndex perform can not fulfill requirements, and the good point of LocalLease is that users can choose the threshold to read from leader or follower. And it's relative correctness, since with ReadIndex you can not have absolute correctness.

@szetszwo
Copy link
Contributor

szetszwo commented Oct 3, 2025

... the good point of LocalLease is that users can choose the threshold to read from leader or follower. And it's relative correctness, ...

@symious, we should make it configurable and emphasize that it can return stale data.

... since with ReadIndex you can not have absolute correctness.

What do you mean by absolute correctness? Is it the same as linearizable read? ReadIndex does provide linearizable read but LocalLease does not. In other words, ReadIndex won't return stale data.

@symious
Copy link
Contributor Author

symious commented Oct 6, 2025

@szetszwo Thank you for the review.

I think for the ReadIndex implementation, stale read could still happen.
At time T0, the leader returns an index to the follower.
At time T1, the follower receives this index.
At time T2, the follower’s commitIndex reaches that index.
At time T3, the follower finishes the read query and returns the result to the client —
however, by client receiving the response, the leader may have already made changes, such as deleting that file.

@szetszwo
Copy link
Contributor

szetszwo commented Oct 6, 2025

Stale read means reading the old value of already changed data. In your example, the data is updated after read.

Let's add commit indices: leader (c6) and follower (c4).

  • At time T0, the leader returns an index (c6) to the follower.
  • At time T1, the follower (c4) receives this index (c6).
  • At time T2, the follower’s commitIndex (c4) reaches that index (c6).
  • At time T3, the follower (c6) finishes the read query and returns the result to the client —
  • however, by client receiving the response, the leader may have already made changes to c7, such as deleting that file.

This is the same as sending the request directly to the leader:

  • At time T0, the client sends a read request to the leader (c6).
  • The leader (c6) finishes the read query.
  • While the packet is stuck in the network due to congestion, the leader has changed to c7.
  • Client receives the result (c6).

@symious
Copy link
Contributor Author

symious commented Oct 6, 2025

Yes, direclty read from leader could also have this issue, only read from follower will amptify this issue, since additional [t0, t2) time range have large latency.
Checked with GPT if these situations are "linearize", it said no.

In my view, this is a relative outcome — it mainly depends on how the threshold is set.
If the threshold is very strict, latency can indeed lead to an incorrect result.
However, with a larger threshold, the result is acceptable.

Therefore, the purpose of LocalLease is to allow the application to pass this threshold in as a parameter,
so that the client can decide how to proceed based on it.

@OneSizeFitsQuorum
Copy link
Contributor

Hi, @symious
The example you mentioned is linearizable, next follower or leader read after t3 will see the changes the leader just made.
see thesis for detailed defination

@symious
Copy link
Contributor Author

symious commented Oct 6, 2025

@OneSizeFitsQuorum Thank you for the info.

GPT’s answers can sometimes be wrong too, but from the client’s perspective, if at T1 I send a balance inquiry to the bank, at T2 my account is debited by 100 yuan, and at T3 I receive a response showing that I still have 100 yuan, then the result is clearly incorrect from the client’s point of view.

In the case of ReadIndex, this kind of situation is even more likely to occur because of higher latency — especially under high write load conditions.

Returning to the original question, LocalLease allows the user to decide whether to trust the results read from a follower within certain limits — limitLag and limitTime.
Essentially, it’s the same issue described above:
if the limits are small, the user can still go through the ReadIndex path;
but if the user can tolerate some staleness, they may choose to read directly from the follower, even though the value might be slightly outdated or inaccurate.

@szetszwo
Copy link
Contributor

szetszwo commented Oct 6, 2025

... if at T1 I send a balance inquiry to the bank, at T2 my account is debited by 100 yuan, and at T3 I receive a response showing that I still have 100 yuan, then the result is clearly incorrect from the client’s point of view.

That's why they usually have a transaction number in the record. Our transaction number is the log index. The banks won't know when the customers see the information but they know when they have retrieved the information. BTW, this case does happen in real life.

The log index L (transaction number) means that the client request was received at index L. The server can reply any values on or after index L, not necessarily the latest value. (revised)

Raft is to make a single server state machine running in multiple servers. So, any behaviors happening in a single server are acceptable.

Now, I may understand why you talk about absolute/relative -- it is the same as the Theory of Special Relativity from Albert Einstein. 😀 Suppose we have a high precision clock which can display nano seconds.

  • T0: we look at the time at the clock
  • T1: the light travels from the clock to our eyes
  • T2: the signal is transmitted from our eyes to our brain.
  • T3: we learn that the "current" time is T0. However, the "absolute" time actually is T3.

The "current" time is "relative" to the observers since the distances are different.

@szetszwo
Copy link
Contributor

szetszwo commented Oct 6, 2025

... LocalLease allows the user to decide whether to trust the results read from a follower within certain limits — limitLag and limitTime. ...

@symious , we need to make it configurable (global or per request) as mentioned.

... if the user can tolerate some staleness, they may choose to read directly from the follower, even though the value might be slightly outdated or inaccurate.

Agree. We already support staleRead with minIndex. We could support another staleRead API with time limit.

I do not oppose adding a new stale read with time limit. We just should not claim that it is the same as linearizable read.

@symious
Copy link
Contributor Author

symious commented Oct 7, 2025

we need to make it configurable (global or per request) as mentioned.

@szetszwo Currently the API accepts a limitLag and limitTime, applications can configure the inputs. So for Ratis side, it's more like a BiPredicate<Integer, Long>.

@szetszwo
Copy link
Contributor

szetszwo commented Oct 7, 2025

How to use it? Could you add a test with CounterStateMachine to show it?

@OneSizeFitsQuorum
Copy link
Contributor

from the client’s perspective, if at T1 I send a balance inquiry to the bank, at T2 my account is debited by 100 yuan, and at T3 I receive a response showing that I still have 100 yuan, then the result is clearly incorrect from the client’s point of view.

Such situations are typical in the presence of concurrent requests. Above thesis reveals that linearizability provides visibility guarantees only for operations that can be organized into a partial order. The model requires all operations to be arranged in a single linear extension of that order. In particular, when a request at time t2 is initiated only after the response of a request at time t1 has been received, the execution order is well defined, and t2 must occur after t1. Conversely, if both requests t1 and t2 are issued before any response is obtained, their execution order on the server side remains indeterminate. Either t1 or t2 may be executed first, and such nondeterminism does not violate the property of linearizability.

Moreover, the example under discussion is independent of the Raft protocol. The same behavior may arise even in a single-machine system. If one interprets this case as a violation of linearizability, it would imply that a single-machine system could not satisfy linearizability either, which is clearly untenable.

Certainly, providing an interface for stale reads is feasible. However, it is important to explicitly note that such an interface does not guarantee linearizability even others can.

@symious
Copy link
Contributor Author

symious commented Oct 8, 2025

@szetszwo Added test case to show the usage of the method. PTAL.

@symious
Copy link
Contributor Author

symious commented Oct 8, 2025

it is important to explicitly note that such an interface does not guarantee linearizability even others can.

Yes, this should be able to configure on application/User side if using this method.

Comment on lines +85 to +86
boolean isFollowerUptoDate = d1.okForLocalReadBounded(1000, 100);
// Clients can choose to query from local StateMachine or leader
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How could the client call okForLocalReadBounded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Client need to build the RaftServer first.

In Ozone, Clients will send message to OzoneManager, and OzoneManager will do the check.

Copy link
Contributor

@szetszwo szetszwo Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Client need to build the RaftServer first.

Client builds a new server? How would it work? The Follower is already running. The server built by the client will be different from the Follower. Could you show it in the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szetszwo Add a new API for client to call this method directly. PTAL.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@symious , if we are adding a new sendStaleRead method, we should not change ratis-server-api anymore since all the calls are internal calls.

* @param limitTimeMs if server is follower, limitTimeMs will be used to decide read from follower or not
* @return the reply.
*/
RaftClientReply sendReadOnly(Message message, RaftPeerId server, int limitLog, long limitTimeMs) throws IOException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should call it sendStaleRead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szetszwo The addition of "okForLocalReadBounded" in RaftServer is for the follower OzoneManager to decide to use it's local data, because in Ozone, clients are not using ratis' client IO API.

The latest commit is for your last question: "Client builds a new server? How would it work".

Basically this feature is for OzoneManager as a RaftServer it to provide follower read based on the result of "okForLocalReadBounded", if it's for direct ratis' client call, the original sendStaleRead should be enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition of "okForLocalReadBounded" in RaftServer is for the follower OzoneManager to decide to use it's local data, because in Ozone, clients are not using ratis' client IO API.

Could you add an example to show exactly how Ozone will use it?

The latest commit is for your last question: "Client builds a new server? How would it work".

My question actually is how to use okForLocalReadBounded.

If Ozone does not need this new sendReadOnly(..., long limitTimeMs) method. Please don't add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szetszwo The usage on OzoneManager side is as follows:

      if (ozoneManager.isFollowerReadLocalLeaseEnabled() &&
              omRatisServer.getServerDivision().okForLocalReadBounded(
                      ozoneManager.getFollowerReadLocalLeaseLagLimit(),
                      ozoneManager.getFollowerReadLocalLeaseTimeMs())) {
        return handler.handleReadRequest(request);
      } else if (omRatisServer.isLinearizableRead()) {
        return submitReadRequestToRatis(request);
      }

@szetszwo
Copy link
Contributor

szetszwo commented Oct 17, 2025

... The usage on OzoneManager side is as follows:

      if (ozoneManager.isFollowerReadLocalLeaseEnabled() &&
              omRatisServer.getServerDivision().okForLocalReadBounded(
                      ozoneManager.getFollowerReadLocalLeaseLagLimit(),
                      ozoneManager.getFollowerReadLocalLeaseTimeMs())) {
        return handler.handleReadRequest(request);
      } else if (omRatisServer.isLinearizableRead()) {
        return

@symious , The current code already supports it. We may add the following method to Ozone:

  boolean allowFollowerReadLocalLease(RaftServer.Division ratisDivision, long leaseTimeMsLimit, long leaseLogLimit) {
    final DivisionInfo divisionInfo = ratisDivision.getInfo();
    final FollowerInfoProto followerInfo = divisionInfo.getRoleInfoProto().getFollowerInfo();
    if (followerInfo == null) {
      return false; // not follower
    }
    final ServerRpcProto leaderInfo = followerInfo.getLeaderInfo();
    if (leaderInfo == null) {
      return false; // no leader
    }

    if (Time.monotonicNow() - leaderInfo.getLastRpcElapsedTimeMs() > leaseTimeMsLimit) {
      return false; // lease time expired
    }

    final RaftPeerId leaderId = divisionInfo.getLeaderId();
    Long leaderCommit = null;
    if (leaderId != null) {
      for (CommitInfoProto i : ratisDivision.getCommitInfos()) {
        if (i.getServer().getId().equals(leaderId.toByteString())) {
          leaderCommit = i.getCommitIndex();
        }
      }
    }
    if (leaderCommit == null) {
      return false;
    }

    return divisionInfo.getLastAppliedIndex() + leaseLogLimit >= leaderCommit;
  }

@symious
Copy link
Contributor Author

symious commented Oct 17, 2025

@szetszwo Thanks for the proposal!

The current approach works functionally, but it might introduce overhead under high QPS scenarios — every request would traverse multiple layers (getInfo() → getRoleInfoProto() → getFollowerInfo()), iterate through getCommitInfos(), and perform repeated time checks. These protobuf object accesses and list scans could become a performance hotspot.

The introduce of LocalLease seems a more efficient way, This would reduce per-request cost, stabilize the lease semantics across requests, and make it easier to add metrics or future extensions (like RTT-based adjustments). What do you think?

@szetszwo
Copy link
Contributor

szetszwo commented Oct 18, 2025

The current approach works functionally, but it might introduce overhead under high QPS scenarios — ...

This PR also introduces overhead to appendEntries and stores duplicated information. The overhead is added to all StateMachine implementations, even if they are not using this new feature.

Moreover, the new API okForLocalReadBounded seems inflexible -- it puts the logic in Ratis instead of Ozone. Ozone cannot change it later on.

I agree that the allowFollowerReadLocalLease(..) has overhead. How about using it as the first implementation? Then we can measure the performance impact and seek improvement. The overhead may or may not be significant.

@symious
Copy link
Contributor Author

symious commented Oct 21, 2025

@szetszwo Tested with 3 Client Servers (each 300 client threads) reading 3 OzoneManagers.

Feature on Ozone Side: 80135 calls/second.
Feature on Ozone Side: 202727 calls/second.
Feature on Ratis Side: 205023 calls/second.

The first result is due to some bugs of the code.

The result are similar, overhead can be ignore then. And it is better to implement on Ratis' application side, rather than Ratis itself. I'll close this ticket then.

@szetszwo @OneSizeFitsQuorum Thank you again for the review.

@symious symious closed this Oct 21, 2025
@szetszwo
Copy link
Contributor

@symious , thanks for comparing the performance!

One potential improvement is to add a non-proto version of getRoleInfoProto(). Then, it can avoid the serialization/deserialization cost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants