test(dm): add MariaDB source smoke test and next-gen integration test#12599
test(dm): add MariaDB source smoke test and next-gen integration test#12599joechenrh wants to merge 35 commits into
Conversation
|
Skipping CI for Draft Pull Request. |
There was a problem hiding this comment.
Code Review
This pull request introduces integration tests for MariaDB as a data source in DM. It includes environment variable updates, configuration files, test data, and a new test runner script. The main test entry point was also updated to conditionally manage MySQL and MariaDB services. Feedback focuses on improving test isolation and maintainability, specifically by disabling automatic master resets in MariaDB-only environments, ensuring consistent SQL modes and server variables for MariaDB, and using wildcard patterns for test case matching.
|
/test ? |
|
@joechenrh: The following commands are available to trigger required jobs: The following commands are available to trigger optional jobs: Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/test pull-dm-integration-test-next-gen |
|
/test pull-dm-integration-test-next-gen |
|
/test pull-dm-integration-test-next-gen |
3 similar comments
|
/test pull-dm-integration-test-next-gen |
|
/test pull-dm-integration-test-next-gen |
|
/test pull-dm-integration-test-next-gen |
222df5d to
1907227
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files
Flags with carried forward coverage won't be shown. Click here to find out more. @@ Coverage Diff @@
## master #12599 +/- ##
===========================================
Coverage ? 53.4109%
===========================================
Files ? 1011
Lines ? 139975
Branches ? 0
===========================================
Hits ? 74762
Misses ? 59592
Partials ? 5621 🚀 New features to boost your workflow:
|
9314f56 to
4293341
Compare
|
/retest |
2 similar comments
|
/retest |
|
/retest |
…king data The test checked data count immediately after task start without waiting for both sources to finish Lightning physical import. On loaded CI nodes, one source's import may still be running, resulting in partial data (e.g. 8 rows instead of 25). Fix: wait for both sources to enter Sync mode (via query-status) before inserting increment data and checking results. Also move increment SQL inserts after the Sync wait to ensure clean separation between full load and incremental replication. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6d84d34 to
338d4d5
Compare
|
/retest |
2 similar comments
|
/retest |
|
/retest |
Task enters Sync/Running but data check fails. Add root/test user queries and SHOW DATABASES to CI log for debugging. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9989242 to
a1d83b3
Compare
|
/retest |
1 similar comment
|
/retest |
start_multi_tasks_cluster ran two dmctl_start_task in parallel (&). run_dm_ctl uses $workdir/dmctl.$ts.log where $ts is second-precision timestamp. When both run in the same second, they write the same log file, corrupting output and causing "result count mismatch" failures. Run them sequentially instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/retest |
- db1.prepare.sql: explicit utf8mb4_general_ci collation for CREATE DATABASE. MariaDB 11.4+ defaults to utf8mb4_uca1400_ai_ci which TiDB does not support. Verified locally: 11.4 with explicit collation passes dump + load but fails at syncer (binlog position empty filename). CI stays on 11.3 where this is not needed, but the fix future-proofs for when DM adds 11.4+ support. - Remove NEXT_GEN skip: MariaDB sidecar is now available in next-gen CI (PingCAP-QE/ci#4496). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/retest |
[LGTM Timeline notifier]Timeline:
|
Remove the separate User keyspace TiDB (port 4000) and let SYSTEM TiDB serve as the downstream directly. DM tests don't need keyspace isolation — SYSTEM keyspace works as a normal TiDB. - env_variables: KEYSPACE_NAME=SYSTEM (was dm_test), remove TIDB_SYSTEM_PORT/TIDB_SYSTEM_STATUS_PORT - run_downstream_cluster_nextgen: start SYSTEM TiDB on port 4000 via run_tidb_server with dxf_service config, remove separate User TiDB - cluster_lib.sh: update comments - run_downstream_cluster_with_tls_nextgen: update comments DXF (tidb_service_scope=dxf_service) is only needed for IMPORT INTO and ADD INDEX. Tests that kill-restart TiDB to verify DM auto-resume operate in Sync mode which doesn't use DXF. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7d49868 to
4f45cad
Compare
|
/retest |
After killing and restarting SYSTEM TiDB via run_tidb_server (e.g. many_tables Phase 2), the restarted TiDB lacked tidb_service_scope= dxf_service. IMPORT INTO tasks then found no DXF node and imported 0 rows. Fix: run_tidb_server on next-gen always writes [instance] tidb_service_scope="dxf_service" alongside keyspace-name and tikv-worker-url. This makes every TiDB restart DXF-capable. Also remove the now-redundant tidb-system.toml from run_downstream_cluster_nextgen. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
check_metric for replication_lag_sum failed because worker2's syncer hadn't processed any events yet (metric=0). The log-based check ([ShowLagInLog]) passed but the Prometheus metric endpoint lagged behind. Fix: move check_sync_diff before the metric checks. Once sync_diff passes, both syncers have processed events and updated their lag counters. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/test ? |
|
@joechenrh: The following commands are available to trigger required jobs: The following commands are available to trigger optional jobs: Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/test pull-dm-integration-test |
What problem does this PR solve?
Issue Number: close #12615
What is changed and how it works?
Enable DM integration tests to run on next-gen TiDB (Cloud Storage Engine edition) alongside classic TiDB. All 13 test groups (G00–G11 + TLS_GROUP) pass on next-gen CI.
mariadb_source) in G10cluster_lib.shInfrastructure changes
cluster_lib.sh(new): centralizes cluster lifecycle ops —cleanup_tidb_server,cleanup_downstream_cluster,run_tidb_server,run_downstream_cluster,run_downstream_cluster_with_tlsrun_downstream_cluster_nextgen(new): starts MinIO + PD + TiKV + tikv-worker + SYSTEM TiDB on port 4000 (tests use SYSTEM keyspace directly — no separate User keyspace TiDB needed)run_downstream_cluster_with_tls_nextgen(new): restarts TiDB with client-facing TLSrun_downstream_cluster_classic(renamed fromrun_downstream_cluster): classic PD + TiKV + TiDBenv_variables: centralized next-gen vars (PD_ADDR, TIKV_WORKER_ADDR, KEYSPACE_NAME=SYSTEM, TIDB_EXTRA_ARGS) underNEXT_GEN=1guardrun_tidb_server: unified startup — unistore/tikv via PD_ADDR, next-gen keyspace + tikv-worker-url + dxf_service config, TLS detectioncleanup_tidb_server: port-4000 targeted, removes temp-storage lockcleanup_process: sequential dm-master kill (SIGHUP one-at-a-time, 30s timeout, SIGKILL escalation) to maintain etcd quorum; SIGKILL for workers (stuck in Lightning loads)ha_cases_lib.sh: moveprint_debug_statusfrom ha_cases2 (fix command-not-found in ha_cases3); serializedmctl_start_taskcalls (parallel runs corrupt shared log file)test_prepare: add sharednormalize_session_block()function; guardcleanup_dataagainst emptytarget_dbtidb_ddl_enable_fast_reorg=0/tidb_enable_dist_task=0on next-gen (breaks DXF-based DDL)MariaDB source smoke test
mariadb_sourcetest case: full + incremental replication from MariaDB to TiDBimport_into_modeto avoid TiDB cleanup conflict)mariadb_source/run.sh: setsRESET_MASTER=falsebefore sourcingtest_prepare(MariaDB-only test)db1.prepare.sql: explicitutf8mb4_general_cicollation — MariaDB 11.4+ defaults toutf8mb4_uca1400_ai_ciwhich TiDB does not supportrun.sh: MariaDB health check andset_default_variableswith graceful fallback (MariaDB sidecar may not be available in all CI pods, e.g. compatibility test)Note: CI must use MariaDB 11.3. MariaDB 11.4+ changed DML row events (
Annotate_rows,Table_map,Write_rows_v1) to setEnd_log_pos=0instead of the correct next-event offset. DM relay relies onEnd_log_posto advance the binlog position, causing it to get stuck at the first DML event. Verified locally: 11.3 works, 11.4/12.2 reproduce the issue. This is a DM code issue tracked separately.Test adaptations for next-gen (by group)
Tests skipped on next-gen
Flaky test fixes (pre-existing)
dmctl_start_task &writes same log file when both run in the same secondreplication_lag_sumfor worker2 is 0 because syncer hasn't processed events yet when metric is checkedcheck_sync_diffbefore metric checks (_sumis cumulative, order doesn't affect result)delete_master_with_retry_success: first DELETE returnsetcdserver: server stopped, retries getnot exists(non-204)((i++))withi=0underset -ereturns exit code 1i=$((i + 1))test_query_timeoutthreshold 10s too tight on loaded CI nodessession.Close()Revoke up to 60sCheck List
Tests
Questions
Will it cause performance regression or break compatibility?
No. This only affects DM integration test infrastructure. No production code changes.
Do you need to update user documentation, design documentation or monitoring documentation?
No.
Release note