What did you do?
While investigating a DM incremental replication consistency issue, I found that checkpoint flushing does not treat downstream BeginTx failures as an execution error that should block checkpoint advancement.
A minimal local reproduction can be added on top of pingcap/tiflow upstream/master commit 43647d57a1d03070cd57344c4be9e4900d9e6cea:
go test ./dm/syncer -run TestCheckpointFlushWorkerSkipsCheckpointOnBeginError -count=1
The test sets the syncer's checkpoint flush worker execError to a downstream begin failure:
execError.Store(terror.ErrDBExecuteFailedBegin.Delegate(sql.ErrConnDone))
and then triggers a checkpoint flush. The test expects checkpoint flush to be skipped because the downstream DML transaction did not successfully begin, so pending DMLs may not be durable downstream.
Relevant code paths:
dm/pkg/conn/baseconn.go: downstream BeginTx errors are wrapped as terror.ErrDBExecuteFailedBegin.
dm/syncer/checkpoint_flush_worker.go: checkpoint flush is skipped only when execError matches terror.ErrDBExecuteFailed or terror.ErrDBUnExpect.
dm/syncer/syncer.go: sync/async checkpoint flush uses the same skip predicate.
What did you expect to see?
Checkpoint flush should be skipped when a downstream SQL execution path fails in a way that means DMLs may not be durable downstream, including BeginTx failures such as sql.ErrConnDone / sql: connection is already closed.
After restart/resume, DM should replay from a checkpoint that is not beyond the unapplied DMLs.
What did you see instead?
The checkpoint flush worker still calls FlushPointsExcept after execError is set to terror.ErrDBExecuteFailedBegin.Delegate(sql.ErrConnDone).
The local regression test fails as follows:
--- FAIL: TestCheckpointFlushWorkerSkipsCheckpointOnBeginError (0.00s)
checkpoint_flush_worker_repro_test.go:129:
Error: Not equal:
expected: 0
actual : 1
Messages: checkpoint flush must be skipped after downstream BeginTx failure; flushing here can persist a checkpoint past non-durable DML
FAIL
FAIL github.com/pingcap/tiflow/dm/syncer 0.085s
This means a checkpoint can be persisted past DML jobs that failed before the downstream transaction was even created. On resume, DM can then start from the advanced checkpoint and skip those DMLs.
Impact
This is a data correctness risk for DM incremental replication. If the downstream connection is closed during BeginTx, some DMLs may not be applied to the downstream, but the checkpoint can still advance. After resume, the unapplied DMLs may not be replayed.
There is also a related diagnostic issue: judgeKeyNotFound can run after ExecuteSQL returns an error, which may produce misleading no matching record warnings for batches that did not actually execute successfully.
Suggested fix
- Treat
terror.ErrDBExecuteFailedBegin as a checkpoint-blocking execution error in all checkpoint flush guards, together with terror.ErrDBExecuteFailed and terror.ErrDBUnExpect.
- Consider treating
sql.ErrConnDone / sql: connection is already closed during BeginTx as retryable where safe.
- Avoid running key-not-found diagnostics after
ExecuteSQL has already returned an execution error unless the execution result is valid.
- Add a regression test covering
ErrDBExecuteFailedBegin(sql.ErrConnDone) and checkpoint flush behavior.
Versions of the cluster
DM version:
pingcap/tiflow upstream/master @ 43647d57a1d03070cd57344c4be9e4900d9e6cea
Upstream MySQL/MariaDB server version:
N/A for the minimal code-level reproduction
Downstream TiDB cluster version:
N/A for the minimal code-level reproduction
How did you deploy DM:
N/A for the minimal code-level reproduction
Other interesting information:
Observed when downstream transaction begin returned sql: connection is already closed.
Current status of DM cluster
N/A for the minimal code-level reproduction
What did you do?
While investigating a DM incremental replication consistency issue, I found that checkpoint flushing does not treat downstream
BeginTxfailures as an execution error that should block checkpoint advancement.A minimal local reproduction can be added on top of
pingcap/tiflowupstream/mastercommit43647d57a1d03070cd57344c4be9e4900d9e6cea:go test ./dm/syncer -run TestCheckpointFlushWorkerSkipsCheckpointOnBeginError -count=1The test sets the syncer's checkpoint flush worker
execErrorto a downstream begin failure:and then triggers a checkpoint flush. The test expects checkpoint flush to be skipped because the downstream DML transaction did not successfully begin, so pending DMLs may not be durable downstream.
Relevant code paths:
dm/pkg/conn/baseconn.go: downstreamBeginTxerrors are wrapped asterror.ErrDBExecuteFailedBegin.dm/syncer/checkpoint_flush_worker.go: checkpoint flush is skipped only whenexecErrormatchesterror.ErrDBExecuteFailedorterror.ErrDBUnExpect.dm/syncer/syncer.go: sync/async checkpoint flush uses the same skip predicate.What did you expect to see?
Checkpoint flush should be skipped when a downstream SQL execution path fails in a way that means DMLs may not be durable downstream, including
BeginTxfailures such assql.ErrConnDone/sql: connection is already closed.After restart/resume, DM should replay from a checkpoint that is not beyond the unapplied DMLs.
What did you see instead?
The checkpoint flush worker still calls
FlushPointsExceptafterexecErroris set toterror.ErrDBExecuteFailedBegin.Delegate(sql.ErrConnDone).The local regression test fails as follows:
This means a checkpoint can be persisted past DML jobs that failed before the downstream transaction was even created. On resume, DM can then start from the advanced checkpoint and skip those DMLs.
Impact
This is a data correctness risk for DM incremental replication. If the downstream connection is closed during
BeginTx, some DMLs may not be applied to the downstream, but the checkpoint can still advance. After resume, the unapplied DMLs may not be replayed.There is also a related diagnostic issue:
judgeKeyNotFoundcan run afterExecuteSQLreturns an error, which may produce misleadingno matching recordwarnings for batches that did not actually execute successfully.Suggested fix
terror.ErrDBExecuteFailedBeginas a checkpoint-blocking execution error in all checkpoint flush guards, together withterror.ErrDBExecuteFailedandterror.ErrDBUnExpect.sql.ErrConnDone/sql: connection is already closedduringBeginTxas retryable where safe.ExecuteSQLhas already returned an execution error unless the execution result is valid.ErrDBExecuteFailedBegin(sql.ErrConnDone)and checkpoint flush behavior.Versions of the cluster
DM version:
pingcap/tiflow upstream/master @ 43647d57a1d03070cd57344c4be9e4900d9e6ceaUpstream MySQL/MariaDB server version:
N/A for the minimal code-level reproductionDownstream TiDB cluster version:
N/A for the minimal code-level reproductionHow did you deploy DM:
N/A for the minimal code-level reproductionOther interesting information:
Observed when downstream transaction begin returned sql: connection is already closed.Current status of DM cluster
N/A for the minimal code-level reproduction