Skip to content

[DM] checkpoint may be flushed after downstream BeginTx failure and skip unapplied DML on resume #12626

@GMHDBJD

Description

@GMHDBJD

What did you do?

While investigating a DM incremental replication consistency issue, I found that checkpoint flushing does not treat downstream BeginTx failures as an execution error that should block checkpoint advancement.

A minimal local reproduction can be added on top of pingcap/tiflow upstream/master commit 43647d57a1d03070cd57344c4be9e4900d9e6cea:

go test ./dm/syncer -run TestCheckpointFlushWorkerSkipsCheckpointOnBeginError -count=1

The test sets the syncer's checkpoint flush worker execError to a downstream begin failure:

execError.Store(terror.ErrDBExecuteFailedBegin.Delegate(sql.ErrConnDone))

and then triggers a checkpoint flush. The test expects checkpoint flush to be skipped because the downstream DML transaction did not successfully begin, so pending DMLs may not be durable downstream.

Relevant code paths:

  • dm/pkg/conn/baseconn.go: downstream BeginTx errors are wrapped as terror.ErrDBExecuteFailedBegin.
  • dm/syncer/checkpoint_flush_worker.go: checkpoint flush is skipped only when execError matches terror.ErrDBExecuteFailed or terror.ErrDBUnExpect.
  • dm/syncer/syncer.go: sync/async checkpoint flush uses the same skip predicate.

What did you expect to see?

Checkpoint flush should be skipped when a downstream SQL execution path fails in a way that means DMLs may not be durable downstream, including BeginTx failures such as sql.ErrConnDone / sql: connection is already closed.

After restart/resume, DM should replay from a checkpoint that is not beyond the unapplied DMLs.

What did you see instead?

The checkpoint flush worker still calls FlushPointsExcept after execError is set to terror.ErrDBExecuteFailedBegin.Delegate(sql.ErrConnDone).

The local regression test fails as follows:

--- FAIL: TestCheckpointFlushWorkerSkipsCheckpointOnBeginError (0.00s)
    checkpoint_flush_worker_repro_test.go:129:
        Error:          Not equal:
                        expected: 0
                        actual  : 1
        Messages:       checkpoint flush must be skipped after downstream BeginTx failure; flushing here can persist a checkpoint past non-durable DML
FAIL
FAIL github.com/pingcap/tiflow/dm/syncer 0.085s

This means a checkpoint can be persisted past DML jobs that failed before the downstream transaction was even created. On resume, DM can then start from the advanced checkpoint and skip those DMLs.

Impact

This is a data correctness risk for DM incremental replication. If the downstream connection is closed during BeginTx, some DMLs may not be applied to the downstream, but the checkpoint can still advance. After resume, the unapplied DMLs may not be replayed.

There is also a related diagnostic issue: judgeKeyNotFound can run after ExecuteSQL returns an error, which may produce misleading no matching record warnings for batches that did not actually execute successfully.

Suggested fix

  • Treat terror.ErrDBExecuteFailedBegin as a checkpoint-blocking execution error in all checkpoint flush guards, together with terror.ErrDBExecuteFailed and terror.ErrDBUnExpect.
  • Consider treating sql.ErrConnDone / sql: connection is already closed during BeginTx as retryable where safe.
  • Avoid running key-not-found diagnostics after ExecuteSQL has already returned an execution error unless the execution result is valid.
  • Add a regression test covering ErrDBExecuteFailedBegin(sql.ErrConnDone) and checkpoint flush behavior.

Versions of the cluster

DM version:

pingcap/tiflow upstream/master @ 43647d57a1d03070cd57344c4be9e4900d9e6cea

Upstream MySQL/MariaDB server version:

N/A for the minimal code-level reproduction

Downstream TiDB cluster version:

N/A for the minimal code-level reproduction

How did you deploy DM:

N/A for the minimal code-level reproduction

Other interesting information:

Observed when downstream transaction begin returned sql: connection is already closed.

Current status of DM cluster

N/A for the minimal code-level reproduction

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions