Skip to content

secondsBehindMaster / replication lag spuriously drops to 0 between DML batches #12580

@mengxian-li

Description

@mengxian-li

What did you do?

Run a DM task replicating from MySQL to TiDB with a significant replication lag (e.g. DM is
processing binlogs from several minutes ago). Then repeatedly query query-status at a
high frequency (every ~100ms or faster).

Minimal reproduction steps:

  1. Start a DM task with a significant upstream lag (e.g. insert a large batch of rows ahead of DM's
    checkpoint).
  2. While DM is actively catching up (DML batches flowing through syncer workers), run query-status
    in a loop.
  3. Observe the secondsBehindMaster field in the output.

What did you expect to see?

secondsBehindMaster should remain at the actual lag value (e.g. ~300s if 5 minutes behind). It
should only drop to 0 when DM has genuinely caught up to the upstream.

What did you see instead?

secondsBehindMaster intermittently and spuriously drops to 0, then jumps back to the real lag value
a moment later. This happens repeatedly during active replication, even though DM is significantly
behind the upstream. The false 0 reading disappears on the next metric update cycle (~100ms).

Root cause: updateReplicationLagMetric() runs every 100ms and computes lag from the minimum
non-zero entry in workerJobTSArray. After each DML batch completes, successFunc calls
updateReplicationJobTS(nil, idx) which resets that worker's slot to 0. During the brief window
between one batch finishing and the next starting, all worker TSes can be 0 simultaneously. If the
100ms cron fires in that window, minTS == 0, lag is computed as 0, and secondsBehindMaster is
stored as 0 — a false reading.

Versions of the cluster

(affects all versions with workerJobTSArray-based lag computation)
DM version (run dmctl -V or dm-worker -V or dm-master -V):

Release Version: v8.5.2-18-g27e2ad16b

Downstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

Release Version: v8.5.2-57-gad8d7f1c03

### current status of DM cluster (execute `query-status <task-name>` in dmctl)

(task running normally, but secondsBehindMaster oscillates between 0 and the real lag value) 

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/dmIssues or PRs related to DM.contributionThis PR is from a community contributor.first-time-contributorIndicates that the PR was contributed by an external member and is a first-time contributor.type/bugThe issue is confirmed as a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions