[DM] checkpoint may be flushed after downstream BeginTx failure and skip unapplied DML on resume

## What did you do?

While investigating a DM incremental replication consistency issue, I found that checkpoint flushing does not treat downstream `BeginTx` failures as an execution error that should block checkpoint advancement.

A minimal local reproduction can be added on top of `pingcap/tiflow` `upstream/master` commit `43647d57a1d03070cd57344c4be9e4900d9e6cea`:

```bash
go test ./dm/syncer -run TestCheckpointFlushWorkerSkipsCheckpointOnBeginError -count=1
```

The test sets the syncer's checkpoint flush worker `execError` to a downstream begin failure:

```go
execError.Store(terror.ErrDBExecuteFailedBegin.Delegate(sql.ErrConnDone))
```

and then triggers a checkpoint flush. The test expects checkpoint flush to be skipped because the downstream DML transaction did not successfully begin, so pending DMLs may not be durable downstream.

Relevant code paths:

- `dm/pkg/conn/baseconn.go`: downstream `BeginTx` errors are wrapped as `terror.ErrDBExecuteFailedBegin`.
- `dm/syncer/checkpoint_flush_worker.go`: checkpoint flush is skipped only when `execError` matches `terror.ErrDBExecuteFailed` or `terror.ErrDBUnExpect`.
- `dm/syncer/syncer.go`: sync/async checkpoint flush uses the same skip predicate.

## What did you expect to see?

Checkpoint flush should be skipped when a downstream SQL execution path fails in a way that means DMLs may not be durable downstream, including `BeginTx` failures such as `sql.ErrConnDone` / `sql: connection is already closed`.

After restart/resume, DM should replay from a checkpoint that is not beyond the unapplied DMLs.

## What did you see instead?

The checkpoint flush worker still calls `FlushPointsExcept` after `execError` is set to `terror.ErrDBExecuteFailedBegin.Delegate(sql.ErrConnDone)`.

The local regression test fails as follows:

```text
--- FAIL: TestCheckpointFlushWorkerSkipsCheckpointOnBeginError (0.00s)
    checkpoint_flush_worker_repro_test.go:129:
        Error:          Not equal:
                        expected: 0
                        actual  : 1
        Messages:       checkpoint flush must be skipped after downstream BeginTx failure; flushing here can persist a checkpoint past non-durable DML
FAIL
FAIL github.com/pingcap/tiflow/dm/syncer 0.085s
```

This means a checkpoint can be persisted past DML jobs that failed before the downstream transaction was even created. On resume, DM can then start from the advanced checkpoint and skip those DMLs.

## Impact

This is a data correctness risk for DM incremental replication. If the downstream connection is closed during `BeginTx`, some DMLs may not be applied to the downstream, but the checkpoint can still advance. After resume, the unapplied DMLs may not be replayed.

There is also a related diagnostic issue: `judgeKeyNotFound` can run after `ExecuteSQL` returns an error, which may produce misleading `no matching record` warnings for batches that did not actually execute successfully.

## Suggested fix

- Treat `terror.ErrDBExecuteFailedBegin` as a checkpoint-blocking execution error in all checkpoint flush guards, together with `terror.ErrDBExecuteFailed` and `terror.ErrDBUnExpect`.
- Consider treating `sql.ErrConnDone` / `sql: connection is already closed` during `BeginTx` as retryable where safe.
- Avoid running key-not-found diagnostics after `ExecuteSQL` has already returned an execution error unless the execution result is valid.
- Add a regression test covering `ErrDBExecuteFailedBegin(sql.ErrConnDone)` and checkpoint flush behavior.

## Versions of the cluster

DM version:

```console
pingcap/tiflow upstream/master @ 43647d57a1d03070cd57344c4be9e4900d9e6cea
```

Upstream MySQL/MariaDB server version:

```console
N/A for the minimal code-level reproduction
```

Downstream TiDB cluster version:

```console
N/A for the minimal code-level reproduction
```

How did you deploy DM:

```console
N/A for the minimal code-level reproduction
```

Other interesting information:

```console
Observed when downstream transaction begin returned sql: connection is already closed.
```

## Current status of DM cluster

```console
N/A for the minimal code-level reproduction
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DM] checkpoint may be flushed after downstream BeginTx failure and skip unapplied DML on resume #12626

What did you do?

What did you expect to see?

What did you see instead?

Impact

Suggested fix

Versions of the cluster

Current status of DM cluster

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[DM] checkpoint may be flushed after downstream BeginTx failure and skip unapplied DML on resume #12626

Description

What did you do?

What did you expect to see?

What did you see instead?

Impact

Suggested fix

Versions of the cluster

Current status of DM cluster

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions