Skip to content

test(dm): add MariaDB source smoke test and next-gen integration test#12599

Open
joechenrh wants to merge 35 commits into
pingcap:masterfrom
joechenrh:mariadb-source-smoke-dm
Open

test(dm): add MariaDB source smoke test and next-gen integration test#12599
joechenrh wants to merge 35 commits into
pingcap:masterfrom
joechenrh:mariadb-source-smoke-dm

Conversation

@joechenrh
Copy link
Copy Markdown
Contributor

@joechenrh joechenrh commented Apr 9, 2026

What problem does this PR solve?

Issue Number: close #12615

What is changed and how it works?

Enable DM integration tests to run on next-gen TiDB (Cloud Storage Engine edition) alongside classic TiDB. All 13 test groups (G00–G11 + TLS_GROUP) pass on next-gen CI.

  • Add MariaDB source smoke integration test case (mariadb_source) in G10
  • Add full next-gen cluster startup (MinIO + PD + TiKV + tikv-worker + SYSTEM TiDB)
  • Adapt test scripts for next-gen compatibility (see table below)
  • Simplify cluster lifecycle with shared functions in cluster_lib.sh
  • Fix pre-existing flaky tests (see below)

Infrastructure changes

  • cluster_lib.sh (new): centralizes cluster lifecycle ops — cleanup_tidb_server, cleanup_downstream_cluster, run_tidb_server, run_downstream_cluster, run_downstream_cluster_with_tls
  • run_downstream_cluster_nextgen (new): starts MinIO + PD + TiKV + tikv-worker + SYSTEM TiDB on port 4000 (tests use SYSTEM keyspace directly — no separate User keyspace TiDB needed)
  • run_downstream_cluster_with_tls_nextgen (new): restarts TiDB with client-facing TLS
  • run_downstream_cluster_classic (renamed from run_downstream_cluster): classic PD + TiKV + TiDB
  • env_variables: centralized next-gen vars (PD_ADDR, TIKV_WORKER_ADDR, KEYSPACE_NAME=SYSTEM, TIDB_EXTRA_ARGS) under NEXT_GEN=1 guard
  • run_tidb_server: unified startup — unistore/tikv via PD_ADDR, next-gen keyspace + tikv-worker-url + dxf_service config, TLS detection
  • cleanup_tidb_server: port-4000 targeted, removes temp-storage lock
  • cleanup_process: sequential dm-master kill (SIGHUP one-at-a-time, 30s timeout, SIGKILL escalation) to maintain etcd quorum; SIGKILL for workers (stuck in Lightning loads)
  • ha_cases_lib.sh: move print_debug_status from ha_cases2 (fix command-not-found in ha_cases3); serialize dmctl_start_task calls (parallel runs corrupt shared log file)
  • test_prepare: add shared normalize_session_block() function; guard cleanup_data against empty target_db
  • Don't set tidb_ddl_enable_fast_reorg=0 / tidb_enable_dist_task=0 on next-gen (breaks DXF-based DDL)

MariaDB source smoke test

  • New mariadb_source test case: full + incremental replication from MariaDB to TiDB
  • Added to G10 group (before import_into_mode to avoid TiDB cleanup conflict)
  • mariadb_source/run.sh: sets RESET_MASTER=false before sourcing test_prepare (MariaDB-only test)
  • db1.prepare.sql: explicit utf8mb4_general_ci collation — MariaDB 11.4+ defaults to utf8mb4_uca1400_ai_ci which TiDB does not support
  • run.sh: MariaDB health check and set_default_variables with graceful fallback (MariaDB sidecar may not be available in all CI pods, e.g. compatibility test)
  • CI sidecar: MariaDB 11.3 (1C/2G) added via feat(pipelines/pingcap/tiflow): add MariaDB source for DM integration test PingCAP-QE/ci#4496

Note: CI must use MariaDB 11.3. MariaDB 11.4+ changed DML row events (Annotate_rows, Table_map, Write_rows_v1) to set End_log_pos=0 instead of the correct next-event offset. DM relay relies on End_log_pos to advance the binlog position, causing it to get stuck at the first DML event. Verified locally: 11.3 works, 11.4/12.2 reproduce the issue. This is a DM code issue tracked separately.

Test adaptations for next-gen (by group)

Group Test Change
G02 check_task Replace GRANT ALL with specific privileges + CONFIG
G03 dmctl_basic config diff Session block normalization (next-gen omits tidb_txn_mode)
G05 many_tables Phase 2 import-into mode + existing MinIO instead of Lightning physical
G07 shardddl1 DML merge Threshold relaxed (>2 instead of >5)
G09 openapi test_delete_task cleanup_tidb_server (port-4000 targeted)
G10 new_relay config export/import Session normalization + patch before config import
G10 new_relay / all_mode cleanup_tidb_server instead of pkill tidb-server
G10 import_into_mode PID-targeted MinIO kill (preserve cluster MinIO)
G11 sync_collation Explicit COLLATE utf8_general_ci (next-gen defaults utf8 to utf8_bin)
G11 sql_mode Remove NO_AUTO_CREATE_USER (not in MySQL 8.0 / next-gen)

Tests skipped on next-gen

Group Test Reason
G09 new_collation_off Next-gen can't disable new collation framework
G09 s3_dumpling_lightning Lightning physical mode version gate (26.x > max 10.0.0)
G09 openapi test_tls Lightning assumes HTTPS on status port; needs cluster-ssl on PD/TiKV

Flaky test fixes (pre-existing)

Group Test Issue Fix
G01 ha_cases3_1 Parallel dmctl_start_task & writes same log file when both run in the same second Serialize calls
G06 metrics replication_lag_sum for worker2 is 0 because syncer hasn't processed events yet when metric is checked Move check_sync_diff before metric checks (_sum is cumulative, order doesn't affect result)
G09 openapi delete_master_with_retry_success: first DELETE returns etcdserver: server stopped, retries get not exists (non-204) Accept "not exists" as success
G09 s3_dumpling_lightning Checks data count before both sources finish Lightning physical import Wait for both sources to enter Sync before checking
G10 print_status ((i++)) with i=0 under set -e returns exit code 1 Use i=$((i + 1))
G10 import_into_mode HA failover retry too tight (10×2s=20s) for re-dump + IMPORT INTO after keepalive-loss failover Increase to 30 retries (60s)
G10 all_mode test_query_timeout threshold 10s too tight on loaded CI nodes Relax to 30s
All cleanup_process Killing all 3 dm-masters simultaneously causes etcd quorum loss, blocking session.Close() Revoke up to 60s Sequential kill (SIGHUP one-at-a-time with 30s timeout + SIGKILL escalation)
All cleanup_process dm-worker stuck in Lightning load, doesn't respond to SIGHUP SIGKILL directly for workers

Check List

Tests

  • Integration test

Questions

Will it cause performance regression or break compatibility?

No. This only affects DM integration test infrastructure. No production code changes.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

None

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 9, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. area/dm Issues or PRs related to DM. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/needs-linked-issue do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 9, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces integration tests for MariaDB as a data source in DM. It includes environment variable updates, configuration files, test data, and a new test runner script. The main test entry point was also updated to conditionally manage MySQL and MariaDB services. Feedback focuses on improving test isolation and maintainability, specifically by disabling automatic master resets in MariaDB-only environments, ensuring consistent SQL modes and server variables for MariaDB, and using wildcard patterns for test case matching.

Comment thread dm/tests/mariadb_source/run.sh
Comment thread dm/tests/run.sh Outdated
Comment thread dm/tests/run.sh Outdated
Comment thread dm/tests/run.sh Outdated
@joechenrh
Copy link
Copy Markdown
Contributor Author

/test ?

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 10, 2026

@joechenrh: The following commands are available to trigger required jobs:

/test pull-build
/test pull-cdc-integration-kafka-test
/test pull-cdc-integration-mysql-test
/test pull-cdc-integration-pulsar-test
/test pull-cdc-integration-storage-test
/test pull-check
/test pull-dm-compatibility-test
/test pull-dm-integration-test
/test pull-error-log-review
/test pull-syncdiff-integration-test
/test pull-unit-test-cdc
/test pull-verify
/test wip-pull-unit-test-dm
/test wip-pull-unit-test-engine

The following commands are available to trigger optional jobs:

/test pull-dm-integration-test-next-gen

Use /test all to run the following jobs that were automatically triggered:

pingcap/tiflow/ghpr_verify
pingcap/tiflow/pull_dm_compatibility_test
pingcap/tiflow/pull_dm_integration_test
pingcap/tiflow/pull_dm_integration_test_next_gen
pingcap/tiflow/pull_syncdiff_integration_test
pull-build
pull-check
pull-error-log-review
pull-unit-test-cdc
Details

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@joechenrh
Copy link
Copy Markdown
Contributor Author

/test pull-dm-integration-test-next-gen

@joechenrh joechenrh marked this pull request as ready for review April 10, 2026 06:17
@ti-chi-bot ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 10, 2026
@joechenrh
Copy link
Copy Markdown
Contributor Author

/test pull-dm-integration-test-next-gen

@joechenrh
Copy link
Copy Markdown
Contributor Author

/test pull-dm-integration-test-next-gen

3 similar comments
@joechenrh
Copy link
Copy Markdown
Contributor Author

/test pull-dm-integration-test-next-gen

@joechenrh
Copy link
Copy Markdown
Contributor Author

/test pull-dm-integration-test-next-gen

@joechenrh
Copy link
Copy Markdown
Contributor Author

/test pull-dm-integration-test-next-gen

@joechenrh joechenrh force-pushed the mariadb-source-smoke-dm branch 3 times, most recently from 222df5d to 1907227 Compare April 14, 2026 06:23
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (master@9fbde6e). Learn more about missing BASE report.
⚠️ Report is 4 commits behind head on master.
✅ All tests successful. No failed tests found.

Additional details and impacted files
Components Coverage Δ
cdc 57.3652% <ø> (?)
dm 49.1598% <ø> (?)
engine 50.7110% <ø> (?)
Flag Coverage Δ
cdc 57.3652% <ø> (?)
unit 53.4109% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

@@             Coverage Diff             @@
##             master     #12599   +/-   ##
===========================================
  Coverage          ?   53.4109%           
===========================================
  Files             ?       1011           
  Lines             ?     139975           
  Branches          ?          0           
===========================================
  Hits              ?      74762           
  Misses            ?      59592           
  Partials          ?       5621           
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ti-chi-bot ti-chi-bot Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 14, 2026
@joechenrh joechenrh force-pushed the mariadb-source-smoke-dm branch from 9314f56 to 4293341 Compare April 15, 2026 09:07
@ti-chi-bot ti-chi-bot Bot removed the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 15, 2026
@joechenrh
Copy link
Copy Markdown
Contributor Author

/retest

2 similar comments
@joechenrh
Copy link
Copy Markdown
Contributor Author

/retest

@joechenrh
Copy link
Copy Markdown
Contributor Author

/retest

…king data

The test checked data count immediately after task start without
waiting for both sources to finish Lightning physical import. On
loaded CI nodes, one source's import may still be running, resulting
in partial data (e.g. 8 rows instead of 25).

Fix: wait for both sources to enter Sync mode (via query-status)
before inserting increment data and checking results. Also move
increment SQL inserts after the Sync wait to ensure clean separation
between full load and incremental replication.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joechenrh joechenrh force-pushed the mariadb-source-smoke-dm branch from 6d84d34 to 338d4d5 Compare April 23, 2026 06:56
@joechenrh
Copy link
Copy Markdown
Contributor Author

/retest

2 similar comments
@joechenrh
Copy link
Copy Markdown
Contributor Author

/retest

@joechenrh
Copy link
Copy Markdown
Contributor Author

/retest

Task enters Sync/Running but data check fails. Add root/test user
queries and SHOW DATABASES to CI log for debugging.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joechenrh joechenrh force-pushed the mariadb-source-smoke-dm branch from 9989242 to a1d83b3 Compare April 23, 2026 09:44
@joechenrh
Copy link
Copy Markdown
Contributor Author

/retest

1 similar comment
@joechenrh
Copy link
Copy Markdown
Contributor Author

/retest

start_multi_tasks_cluster ran two dmctl_start_task in parallel (&).
run_dm_ctl uses $workdir/dmctl.$ts.log where $ts is second-precision
timestamp. When both run in the same second, they write the same log
file, corrupting output and causing "result count mismatch" failures.

Run them sequentially instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joechenrh
Copy link
Copy Markdown
Contributor Author

/retest

- db1.prepare.sql: explicit utf8mb4_general_ci collation for CREATE
  DATABASE. MariaDB 11.4+ defaults to utf8mb4_uca1400_ai_ci which
  TiDB does not support. Verified locally: 11.4 with explicit collation
  passes dump + load but fails at syncer (binlog position empty filename).
  CI stays on 11.3 where this is not needed, but the fix future-proofs
  for when DM adds 11.4+ support.
- Remove NEXT_GEN skip: MariaDB sidecar is now available in next-gen CI
  (PingCAP-QE/ci#4496).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joechenrh
Copy link
Copy Markdown
Contributor Author

/retest

Comment thread dm/tests/_utils/run_downstream_cluster_nextgen Outdated
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 24, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: D3Hunter

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Apr 24, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 24, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-24 07:39:54.518532642 +0000 UTC m=+2324399.723892699: ☑️ agreed by D3Hunter.

@ti-chi-bot ti-chi-bot Bot added the approved label Apr 24, 2026
Remove the separate User keyspace TiDB (port 4000) and let SYSTEM TiDB
serve as the downstream directly. DM tests don't need keyspace
isolation — SYSTEM keyspace works as a normal TiDB.

- env_variables: KEYSPACE_NAME=SYSTEM (was dm_test), remove
  TIDB_SYSTEM_PORT/TIDB_SYSTEM_STATUS_PORT
- run_downstream_cluster_nextgen: start SYSTEM TiDB on port 4000 via
  run_tidb_server with dxf_service config, remove separate User TiDB
- cluster_lib.sh: update comments
- run_downstream_cluster_with_tls_nextgen: update comments

DXF (tidb_service_scope=dxf_service) is only needed for IMPORT INTO
and ADD INDEX. Tests that kill-restart TiDB to verify DM auto-resume
operate in Sync mode which doesn't use DXF.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joechenrh joechenrh force-pushed the mariadb-source-smoke-dm branch from 7d49868 to 4f45cad Compare April 24, 2026 09:43
@joechenrh
Copy link
Copy Markdown
Contributor Author

/retest

joechenrh and others added 2 commits April 27, 2026 01:43
After killing and restarting SYSTEM TiDB via run_tidb_server (e.g.
many_tables Phase 2), the restarted TiDB lacked tidb_service_scope=
dxf_service. IMPORT INTO tasks then found no DXF node and imported
0 rows.

Fix: run_tidb_server on next-gen always writes [instance]
tidb_service_scope="dxf_service" alongside keyspace-name and
tikv-worker-url. This makes every TiDB restart DXF-capable.

Also remove the now-redundant tidb-system.toml from
run_downstream_cluster_nextgen.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
check_metric for replication_lag_sum failed because worker2's syncer
hadn't processed any events yet (metric=0). The log-based check
([ShowLagInLog]) passed but the Prometheus metric endpoint lagged
behind.

Fix: move check_sync_diff before the metric checks. Once sync_diff
passes, both syncers have processed events and updated their lag
counters.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joechenrh
Copy link
Copy Markdown
Contributor Author

/test ?

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 8, 2026

@joechenrh: The following commands are available to trigger required jobs:

/test pull-build
/test pull-cdc-integration-kafka-test
/test pull-cdc-integration-mysql-test
/test pull-cdc-integration-pulsar-test
/test pull-cdc-integration-storage-test
/test pull-check
/test pull-dm-compatibility-test
/test pull-dm-integration-test
/test pull-error-log-review
/test pull-syncdiff-integration-test
/test pull-unit-test-cdc
/test pull-verify
/test wip-pull-unit-test-dm
/test wip-pull-unit-test-engine

The following commands are available to trigger optional jobs:

/test pull-dm-integration-test-next-gen

Use /test all to run the following jobs that were automatically triggered:

pingcap/tiflow/ghpr_verify
pingcap/tiflow/pull_cdc_integration_kafka_test
pingcap/tiflow/pull_cdc_integration_pulsar_test
pingcap/tiflow/pull_cdc_integration_storage_test
pingcap/tiflow/pull_cdc_integration_test
pingcap/tiflow/pull_dm_compatibility_test
pingcap/tiflow/pull_dm_integration_test
pingcap/tiflow/pull_dm_integration_test_next_gen
pingcap/tiflow/pull_syncdiff_integration_test
pull-build
pull-check
pull-error-log-review
pull-unit-test-cdc
Details

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@joechenrh
Copy link
Copy Markdown
Contributor Author

/test pull-dm-integration-test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved area/dm Issues or PRs related to DM. needs-1-more-lgtm Indicates a PR needs 1 more LGTM. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add MariaDB source smoke test and next-gen integration test support

2 participants