RATIS-2341. Introduce LocalLease for roughly up-to-date check #1296

symious · 2025-10-01T07:37:21Z

What changes were proposed in this pull request?

The current logic of ReadIndex limits the performance of FollowerRead.

In production, the vast majority of requests are for unchanged old data, so running the readIndex logic for every request becomes somewhat wasteful.

This ticket is to introduce LocalLease for followers, so that followers can decide if it's catched up with Leader, and application can reasonably make use of this information.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/RATIS-2341

How was this patch tested?

Local performance test.

szetszwo · 2025-10-01T18:04:58Z

In production, the vast majority of requests are for old data, ...

@symious , Even for old data, the data could be changed from time to time. It may work if the data is immutable. However, Ratis does not aware if the data is immutable.

Could you give more details how the LocalLease algorithm works? Since the feature allows reading old data, is it similar to staleRead?

symious · 2025-10-02T08:01:07Z

@szetszwo Thanks for the review. Yes, updated the description that most of the queries are on unchanged old date.

The idea for the LocalLease is for the use of FollowerRead, so that the follower can decide if it's updated with the leader and then read local data for clients.

OneSizeFitsQuorum · 2025-10-02T13:03:02Z

For follower reads, if there is no communication with the leader, it is impossible to ensure linearizable.

If you want to optimize the RTT delay in this round, you can enable lease reads and disable follower reads, ensuring that queries are only made on the leader. Additionally, you can utilize cluster resources through sharding and multiple Raft groups. This is the lowest-latency safe solution.

The essence of follower reads is to optimize the throughput of a single consensus group (since adding multiple followers allows for horizontal scaling of the read state machine throughput). Sacrificing correctness to optimize follower-read latency seems counterproductive.

symious · 2025-10-03T05:50:44Z

@OneSizeFitsQuorum Thank you for the review. In high write load situations, readIndex perform can not fulfill requirements, and the good point of LocalLease is that users can choose the threshold to read from leader or follower. And it's relative correctness, since with ReadIndex you can not have absolute correctness.

szetszwo · 2025-10-03T18:07:03Z

... the good point of LocalLease is that users can choose the threshold to read from leader or follower. And it's relative correctness, ...

@symious, we should make it configurable and emphasize that it can return stale data.

... since with ReadIndex you can not have absolute correctness.

What do you mean by absolute correctness? Is it the same as linearizable read? ReadIndex does provide linearizable read but LocalLease does not. In other words, ReadIndex won't return stale data.

symious · 2025-10-06T03:29:54Z

@szetszwo Thank you for the review.

I think for the ReadIndex implementation, stale read could still happen.
At time T0, the leader returns an index to the follower.
At time T1, the follower receives this index.
At time T2, the follower’s commitIndex reaches that index.
At time T3, the follower finishes the read query and returns the result to the client —
however, by client receiving the response, the leader may have already made changes, such as deleting that file.

szetszwo · 2025-10-06T05:03:55Z

Stale read means reading the old value of already changed data. In your example, the data is updated after read.

Let's add commit indices: leader (c6) and follower (c4).

At time T0, the leader returns an index (c6) to the follower.
At time T1, the follower (c4) receives this index (c6).
At time T2, the follower’s commitIndex (c4) reaches that index (c6).
At time T3, the follower (c6) finishes the read query and returns the result to the client —
however, by client receiving the response, the leader may have already made changes to c7, such as deleting that file.

This is the same as sending the request directly to the leader:

At time T0, the client sends a read request to the leader (c6).
The leader (c6) finishes the read query.
While the packet is stuck in the network due to congestion, the leader has changed to c7.
Client receives the result (c6).

symious · 2025-10-06T05:40:06Z

Yes, direclty read from leader could also have this issue, only read from follower will amptify this issue, since additional [t0, t2) time range have large latency.
Checked with GPT if these situations are "linearize", it said no.

In my view, this is a relative outcome — it mainly depends on how the threshold is set.
If the threshold is very strict, latency can indeed lead to an incorrect result.
However, with a larger threshold, the result is acceptable.

Therefore, the purpose of LocalLease is to allow the application to pass this threshold in as a parameter,
so that the client can decide how to proceed based on it.

OneSizeFitsQuorum · 2025-10-06T06:15:10Z

Hi, @symious
The example you mentioned is linearizable, next follower or leader read after t3 will see the changes the leader just made.
see thesis for detailed defination

symious · 2025-10-06T06:51:27Z

@OneSizeFitsQuorum Thank you for the info.

GPT’s answers can sometimes be wrong too, but from the client’s perspective, if at T1 I send a balance inquiry to the bank, at T2 my account is debited by 100 yuan, and at T3 I receive a response showing that I still have 100 yuan, then the result is clearly incorrect from the client’s point of view.

In the case of ReadIndex, this kind of situation is even more likely to occur because of higher latency — especially under high write load conditions.

Returning to the original question, LocalLease allows the user to decide whether to trust the results read from a follower within certain limits — limitLag and limitTime.
Essentially, it’s the same issue described above:
if the limits are small, the user can still go through the ReadIndex path;
but if the user can tolerate some staleness, they may choose to read directly from the follower, even though the value might be slightly outdated or inaccurate.

szetszwo · 2025-10-06T15:21:05Z

... if at T1 I send a balance inquiry to the bank, at T2 my account is debited by 100 yuan, and at T3 I receive a response showing that I still have 100 yuan, then the result is clearly incorrect from the client’s point of view.

That's why they usually have a transaction number in the record. Our transaction number is the log index. The banks won't know when the customers see the information but they know when they have retrieved the information. BTW, this case does happen in real life.

The log index L (transaction number) means that the client request was received at index L. The server can reply any values on or after index L, not necessarily the latest value. (revised)

Raft is to make a single server state machine running in multiple servers. So, any behaviors happening in a single server are acceptable.

Now, I may understand why you talk about absolute/relative -- it is the same as the Theory of Special Relativity from Albert Einstein. 😀 Suppose we have a high precision clock which can display nano seconds.

T0: we look at the time at the clock
T1: the light travels from the clock to our eyes
T2: the signal is transmitted from our eyes to our brain.
T3: we learn that the "current" time is T0. However, the "absolute" time actually is T3.

The "current" time is "relative" to the observers since the distances are different.

szetszwo · 2025-10-06T16:11:40Z

... LocalLease allows the user to decide whether to trust the results read from a follower within certain limits — limitLag and limitTime. ...

@symious , we need to make it configurable (global or per request) as mentioned.

... if the user can tolerate some staleness, they may choose to read directly from the follower, even though the value might be slightly outdated or inaccurate.

Agree. We already support staleRead with minIndex. We could support another staleRead API with time limit.

I do not oppose adding a new stale read with time limit. We just should not claim that it is the same as linearizable read.

symious · 2025-10-07T05:30:17Z

we need to make it configurable (global or per request) as mentioned.

@szetszwo Currently the API accepts a limitLag and limitTime, applications can configure the inputs. So for Ratis side, it's more like a BiPredicate<Integer, Long>.

szetszwo · 2025-10-07T15:54:06Z

How to use it? Could you add a test with CounterStateMachine to show it?

OneSizeFitsQuorum · 2025-10-08T06:45:00Z

from the client’s perspective, if at T1 I send a balance inquiry to the bank, at T2 my account is debited by 100 yuan, and at T3 I receive a response showing that I still have 100 yuan, then the result is clearly incorrect from the client’s point of view.

Such situations are typical in the presence of concurrent requests. Above thesis reveals that linearizability provides visibility guarantees only for operations that can be organized into a partial order. The model requires all operations to be arranged in a single linear extension of that order. In particular, when a request at time t2 is initiated only after the response of a request at time t1 has been received, the execution order is well defined, and t2 must occur after t1. Conversely, if both requests t1 and t2 are issued before any response is obtained, their execution order on the server side remains indeterminate. Either t1 or t2 may be executed first, and such nondeterminism does not violate the property of linearizability.

Moreover, the example under discussion is independent of the Raft protocol. The same behavior may arise even in a single-machine system. If one interprets this case as a violation of linearizability, it would imply that a single-machine system could not satisfy linearizability either, which is clearly untenable.

Certainly, providing an interface for stale reads is feasible. However, it is important to explicitly note that such an interface does not guarantee linearizability even others can.

symious · 2025-10-08T07:03:40Z

@szetszwo Added test case to show the usage of the method. PTAL.

symious · 2025-10-08T07:05:52Z

it is important to explicitly note that such an interface does not guarantee linearizability even others can.

Yes, this should be able to configure on application/User side if using this method.

szetszwo · 2025-10-08T15:46:24Z

ratis-examples/src/test/java/org/apache/ratis/examples/counter/TestCounter.java

+      boolean isFollowerUptoDate = d1.okForLocalReadBounded(1000, 100);
+      // Clients can choose to query from local StateMachine or leader


How could the client call okForLocalReadBounded?

Client need to build the RaftServer first.

In Ozone, Clients will send message to OzoneManager, and OzoneManager will do the check.

Client need to build the RaftServer first.

Client builds a new server? How would it work? The Follower is already running. The server built by the client will be different from the Follower. Could you show it in the test?

@szetszwo Add a new API for client to call this method directly. PTAL.

szetszwo

@symious , if we are adding a new sendStaleRead method, we should not change ratis-server-api anymore since all the calls are internal calls.

szetszwo · 2025-10-12T23:03:36Z

ratis-client/src/main/java/org/apache/ratis/client/api/BlockingApi.java

+   * @param limitTimeMs if server is follower, limitTimeMs will be used to decide read from follower or not
+   * @return the reply.
+   */
+  RaftClientReply sendReadOnly(Message message, RaftPeerId server, int limitLog, long limitTimeMs) throws IOException;


We should call it sendStaleRead.

@szetszwo The addition of "okForLocalReadBounded" in RaftServer is for the follower OzoneManager to decide to use it's local data, because in Ozone, clients are not using ratis' client IO API.

The latest commit is for your last question: "Client builds a new server? How would it work".

Basically this feature is for OzoneManager as a RaftServer it to provide follower read based on the result of "okForLocalReadBounded", if it's for direct ratis' client call, the original sendStaleRead should be enough.

The addition of "okForLocalReadBounded" in RaftServer is for the follower OzoneManager to decide to use it's local data, because in Ozone, clients are not using ratis' client IO API.

Could you add an example to show exactly how Ozone will use it?

The latest commit is for your last question: "Client builds a new server? How would it work".

My question actually is how to use okForLocalReadBounded.

If Ozone does not need this new sendReadOnly(..., long limitTimeMs) method. Please don't add it.

@szetszwo The usage on OzoneManager side is as follows:

if (ozoneManager.isFollowerReadLocalLeaseEnabled() && omRatisServer.getServerDivision().okForLocalReadBounded( ozoneManager.getFollowerReadLocalLeaseLagLimit(), ozoneManager.getFollowerReadLocalLeaseTimeMs())) { return handler.handleReadRequest(request); } else if (omRatisServer.isLinearizableRead()) { return submitReadRequestToRatis(request); }

szetszwo · 2025-10-17T17:42:20Z

... The usage on OzoneManager side is as follows:

      if (ozoneManager.isFollowerReadLocalLeaseEnabled() &&
              omRatisServer.getServerDivision().okForLocalReadBounded(
                      ozoneManager.getFollowerReadLocalLeaseLagLimit(),
                      ozoneManager.getFollowerReadLocalLeaseTimeMs())) {
        return handler.handleReadRequest(request);
      } else if (omRatisServer.isLinearizableRead()) {
        return

@symious , The current code already supports it. We may add the following method to Ozone:

  boolean allowFollowerReadLocalLease(RaftServer.Division ratisDivision, long leaseTimeMsLimit, long leaseLogLimit) {
    final DivisionInfo divisionInfo = ratisDivision.getInfo();
    final FollowerInfoProto followerInfo = divisionInfo.getRoleInfoProto().getFollowerInfo();
    if (followerInfo == null) {
      return false; // not follower
    }
    final ServerRpcProto leaderInfo = followerInfo.getLeaderInfo();
    if (leaderInfo == null) {
      return false; // no leader
    }

    if (Time.monotonicNow() - leaderInfo.getLastRpcElapsedTimeMs() > leaseTimeMsLimit) {
      return false; // lease time expired
    }

    final RaftPeerId leaderId = divisionInfo.getLeaderId();
    Long leaderCommit = null;
    if (leaderId != null) {
      for (CommitInfoProto i : ratisDivision.getCommitInfos()) {
        if (i.getServer().getId().equals(leaderId.toByteString())) {
          leaderCommit = i.getCommitIndex();
        }
      }
    }
    if (leaderCommit == null) {
      return false;
    }

    return divisionInfo.getLastAppliedIndex() + leaseLogLimit >= leaderCommit;
  }

symious · 2025-10-17T23:37:10Z

@szetszwo Thanks for the proposal!

The current approach works functionally, but it might introduce overhead under high QPS scenarios — every request would traverse multiple layers (getInfo() → getRoleInfoProto() → getFollowerInfo()), iterate through getCommitInfos(), and perform repeated time checks. These protobuf object accesses and list scans could become a performance hotspot.

The introduce of LocalLease seems a more efficient way, This would reduce per-request cost, stabilize the lease semantics across requests, and make it easier to add metrics or future extensions (like RTT-based adjustments). What do you think?

szetszwo · 2025-10-18T15:48:57Z

The current approach works functionally, but it might introduce overhead under high QPS scenarios — ...

This PR also introduces overhead to appendEntries and stores duplicated information. The overhead is added to all StateMachine implementations, even if they are not using this new feature.

Moreover, the new API okForLocalReadBounded seems inflexible -- it puts the logic in Ratis instead of Ozone. Ozone cannot change it later on.

I agree that the allowFollowerReadLocalLease(..) has overhead. How about using it as the first implementation? Then we can measure the performance impact and seek improvement. The overhead may or may not be significant.

symious · 2025-10-21T07:31:26Z

@szetszwo Tested with 3 Client Servers (each 300 client threads) reading 3 OzoneManagers.

~~Feature on Ozone Side: 80135 calls/second.~~
Feature on Ozone Side: 202727 calls/second.
Feature on Ratis Side: 205023 calls/second.

The first result is due to some bugs of the code.

The result are similar, overhead can be ignore then. And it is better to implement on Ratis' application side, rather than Ratis itself. I'll close this ticket then.

@szetszwo @OneSizeFitsQuorum Thank you again for the review.

szetszwo · 2025-10-21T18:36:51Z

@symious , thanks for comparing the performance!

One potential improvement is to add a non-proto version of getRoleInfoProto(). Then, it can avoid the serialization/deserialization cost.

RATIS-2341. Introduce LocalLease for roughly up-to-date check

2063194

RATIS-2341. Fix checkstyle

4f830ef

RATIS-2341. Fix checkstyle

e4251a4

RATIS-2341. Add test for new API usage

3b867e9

szetszwo reviewed Oct 8, 2025

View reviewed changes

symious added 4 commits October 10, 2025 18:11

RATIS-2341-1. test new API

beaf98c

Fix test case

2a26f29

Fix checkstyle

83e3672

fix checkstyle

9bee8ec

szetszwo reviewed Oct 12, 2025

View reviewed changes

Revert new API.

47f7445

symious closed this Oct 21, 2025

		boolean isFollowerUptoDate = d1.okForLocalReadBounded(1000, 100);
		// Clients can choose to query from local StateMachine or leader

RATIS-2341. Introduce LocalLease for roughly up-to-date check #1296

RATIS-2341. Introduce LocalLease for roughly up-to-date check #1296

Conversation

symious commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

szetszwo commented Oct 1, 2025

Uh oh!

symious commented Oct 2, 2025

Uh oh!

OneSizeFitsQuorum commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

symious commented Oct 3, 2025

Uh oh!

szetszwo commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

symious commented Oct 6, 2025

Uh oh!

szetszwo commented Oct 6, 2025

Uh oh!

symious commented Oct 6, 2025

Uh oh!

OneSizeFitsQuorum commented Oct 6, 2025

Uh oh!

symious commented Oct 6, 2025

Uh oh!

szetszwo commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szetszwo commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

symious commented Oct 7, 2025

Uh oh!

szetszwo commented Oct 7, 2025

Uh oh!

OneSizeFitsQuorum commented Oct 8, 2025

Uh oh!

symious commented Oct 8, 2025

Uh oh!

symious commented Oct 8, 2025

Uh oh!

szetszwo Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

symious Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

szetszwo Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

symious Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

szetszwo left a comment

Choose a reason for hiding this comment

Uh oh!

szetszwo Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

symious Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

szetszwo Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

symious Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

szetszwo commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

symious commented Oct 17, 2025

Uh oh!

symious commented Oct 1, 2025 •

edited

Loading

OneSizeFitsQuorum commented Oct 2, 2025 •

edited

Loading

szetszwo commented Oct 3, 2025 •

edited

Loading

szetszwo commented Oct 6, 2025 •

edited

Loading

szetszwo commented Oct 6, 2025 •

edited

Loading

szetszwo Oct 9, 2025 •

edited

Loading

szetszwo commented Oct 17, 2025 •

edited

Loading

szetszwo commented Oct 18, 2025 •

edited

Loading

symious commented Oct 21, 2025 •

edited

Loading