From bc7fd5c8b7d9cccf2a856fc43ea7dea3f9c66ee8 Mon Sep 17 00:00:00 2001 From: hanjianqiao Date: Wed, 20 May 2026 06:47:07 +0000 Subject: [PATCH] update doc --- README.md | 7 +++++-- doc/docs/en/architecture/ddl_replication.md | 2 +- .../non_native_datatype_handling.md | 4 ++-- doc/docs/en/getting-started/configuration.md | 6 +++--- doc/docs/en/getting-started/installation.md | 2 +- .../en/tutorial/mysql_cdc_to_postgresql.md | 4 ++-- .../tutorial/native_olr_cdc_to_postgresql.md | 8 ++++---- .../tutorial/postgresql_cdc_to_postgresql.md | 20 +++++++++---------- .../tutorial/sqlserver_cdc_to_postgresql.md | 6 +++--- .../en/user-guide/object_mapping_rules.md | 2 +- 10 files changed, 32 insertions(+), 29 deletions(-) diff --git a/README.md b/README.md index 918deb3..ca5690c 100644 --- a/README.md +++ b/README.md @@ -28,12 +28,12 @@ SynchDB extension consists of these major components: ## Build Requirement The following software is required to build and run SynchDB. The versions listed are the versions tested during development. Older versions may still work. +* Unix based operating system like Ubuntu 22.04 or MacOS * Java Development Kit 17 or later. Download [here](https://www.oracle.com/ca-en/java/technologies/downloads/) * Apache Maven 3.6.3 or later. Download [here](https://maven.apache.org/download.cgi) * PostgreSQL source or build environment. Git clone [here](https://github.com/postgres/postgres). Refer to this [wiki](https://wiki.postgresql.org/wiki/Compile_and_Install_from_source_code) to build PostgreSQL from source or this [page](https://www.postgresql.org/download/linux/) to install PostgreSQL via packages + * If PostgreSQL is installed via a package manager, the corresponding devel package needs to be installed as well. * Docker compose 2.28.1 (for testing). Refer to [here](https://docs.docker.com/compose/install/linux/) -* Unix based operating system like Ubuntu 22.04 or MacOS - **The following is required if Openlog Replicator Connector is enabled in build** * libprotobuf-c v1.5.2. Refer to [here](https://github.com/protobuf-c/protobuf-c.git) to build from source. @@ -50,6 +50,9 @@ If you already have PostgreSQL installed, you can build and install Default Sync ``` BASH USE_PGXS=1 make PG_CONFIG=$(which pg_config) + +# Using Maven to build Debezium +export PATH=${YOUR_MAVEN_PATH}/bin/:$PATH USE_PGXS=1 make build_dbz PG_CONFIG=$(which pg_config) sudo USE_PGXS=1 make PG_CONFIG=$(which pg_config) install diff --git a/doc/docs/en/architecture/ddl_replication.md b/doc/docs/en/architecture/ddl_replication.md index 840e083..ebee3b0 100644 --- a/doc/docs/en/architecture/ddl_replication.md +++ b/doc/docs/en/architecture/ddl_replication.md @@ -85,7 +85,7 @@ Supported modifications: Other properties that can be specified during ALTER TABLE ALTER COLUMN are not supported at the moment. -Please note that SynchDB only supports basic data type change on an existing column. For example, `INT` → `BIGINT` or `VARCHAR` → `TEXT`. Complex data type changes such as `TEXT` → `INT` or `INT` → `TIMESTAMP` are not currently supported. This is because PostgreSQL requires the user to additioanlly supply a type casting function to perform the type casting as the result of complex data type change. SynchDB currently has to knowledge what type casting functions to use for specific type conversion. In the future, We may allow user to supply his or her own casting functions to use for specific type conversions via the rule file, but for now, it is not supported. +Please note that SynchDB only supports basic data type change on an existing column. For example, `INT` → `BIGINT` or `VARCHAR` → `TEXT`. Complex data type changes such as `TEXT` → `INT` or `INT` → `TIMESTAMP` are not currently supported. This is because PostgreSQL requires the user to additionally supply a type casting function to perform the type casting as the result of complex data type change. SynchDB currently has limited knowledge what type casting functions to use for specific type conversion. In the future, We may allow user to supply his or her own casting functions to use for specific type conversions via the rule file, but for now, it is not supported. ## **Database-Specific Behavior** diff --git a/doc/docs/en/architecture/non_native_datatype_handling.md b/doc/docs/en/architecture/non_native_datatype_handling.md index 189fc44..7d9c59f 100644 --- a/doc/docs/en/architecture/non_native_datatype_handling.md +++ b/doc/docs/en/architecture/non_native_datatype_handling.md @@ -1,4 +1,4 @@ -# None-native Data Type Handling +# Non-native Data Type Handling ## **Handling Non-Native Data Types** @@ -24,4 +24,4 @@ It is possible that a table contains a column data type that is custom created b ``` -The category tells DML Converter about the nature of the data type (numeric? string? datetime? ...etc) to help the converter select the right routine to process. For most cases, using type category paired with the DBZ metadata that describes how the input data payload is formatted is sufficient to select the right routine to process the data. However, in some cases, it may not be sufficient. For example, custom DATE, TIME, TIMESTAMP date types could all be categorized under `TYPCATEGORY_DATETIME`, so the converter does not know if it is working with a DATE, TIME or TIMESTAMP as each would produce different time formats. Currently, the covnerter looks for certain keywords from the data type name to identify. In the future, we may expose this part to let the user tell the converter exactly which routine to use should there be an ambiguity. Another example would be `TYPCATEGORY_USER` and `TYPCATEGORY_GEOMETRIC` which does not clearly indicate the data format. For these categories, the converter currently does not perform any further processing as it simply leaves the data payload as is. PostgreSQL may or may not reject such unprocessed data. This is why the transform feature next is important to give the DML converter a final chance to correct its data payload. \ No newline at end of file +The category tells DML Converter about the nature of the data type (numeric? string? datetime? ...etc) to help the converter select the right routine to process. For most cases, using type category paired with the DBZ metadata that describes how the input data payload is formatted is sufficient to select the right routine to process the data. However, in some cases, it may not be sufficient. For example, custom DATE, TIME, TIMESTAMP date types could all be categorized under `TYPCATEGORY_DATETIME`, so the converter does not know if it is working with a DATE, TIME or TIMESTAMP as each would produce different time formats. Currently, the converter looks for certain keywords from the data type name to identify. In the future, we may expose this part to let the user tell the converter exactly which routine to use should there be an ambiguity. Another example would be `TYPCATEGORY_USER` and `TYPCATEGORY_GEOMETRIC` which does not clearly indicate the data format. For these categories, the converter currently does not perform any further processing as it simply leaves the data payload as is. PostgreSQL may or may not reject such unprocessed data. This is why the transform feature next is important to give the DML converter a final chance to correct its data payload. \ No newline at end of file diff --git a/doc/docs/en/getting-started/configuration.md b/doc/docs/en/getting-started/configuration.md index 6553997..0f25b68 100644 --- a/doc/docs/en/getting-started/configuration.md +++ b/doc/docs/en/getting-started/configuration.md @@ -3,7 +3,7 @@ weight: 40 --- # SynchDB Configuration -SynchDB supports the following GUC variables in postgresql.conf. These are common parameters that apply the all connectors managed by SynchDB: +SynchDB supports the following GUC variables in postgresql.conf. These are common parameters that apply to all connectors managed by SynchDB: | GUC Variable| Type | Default Value | Description | |-|-|-|-| @@ -14,7 +14,7 @@ SynchDB supports the following GUC variables in postgresql.conf. These are commo | synchdb.dbz_queue_size | integer | 8192 | The maximum size (measured in number of change events) of Debezium embedded engine's change event queue. It should be set at least twice of `synchdb.dbz_batch_size` | | synchdb.dbz_connect_timeout_ms | integer | 30000 | The timeout value in milliseconds for Debezium embedded engine to established an initial connection to a remote database | | synchdb.dbz_query_timeout_ms | integer | 600000 | The timeout value in milliseconds for Debezium embedded engine to execute a query on a remote database | -| synchdb.dbz_skipped_oeprations | string | "t" | A comma-separated list of operations Debezium shall skip when processing change events. "c" is for inserts, "u" is for updates, "d" is for deletes, "t" is for truncates | +| synchdb.dbz_skipped_operations | string | "t" | A comma-separated list of operations Debezium shall skip when processing change events. "c" is for inserts, "u" is for updates, "d" is for deletes, "t" is for truncates | | synchdb.jvm_max_heap_size | integer | 1024 | The maximum heap size in MB to be allocated to Java Virtual Machine (JVM) when starting a connector. | | synchdb.dbz_snapshot_thread_num | integer | 2 | The number of threads Debezium embedded connector should spawn during initial snapshot. Please note that according to Debezium, multi-threaded snapshot is an `incubating feature` | | synchdb.dbz_snapshot_fetch_size | integer | 0 | The number of rows Debezium embedded connector should fetch at a time during initial snapshot. Set it to 0 to let the engine choose automatically | @@ -30,7 +30,7 @@ SynchDB supports the following GUC variables in postgresql.conf. These are commo | synchdb.jvm_max_direct_buffer_size | integer | 1024 | The maximum direct buffer size in MB to be allocated to hold JSON change events | | synchdb.dbz_logminer_stream_mode | enum | "uncommitted" | The streaming mode for Debezium based Oracle connector. The default is uncommitted, which means all the changes streamed from Oracle via Debezium is uncommitted. This indicates Debezium has to do some work to ensure the integrity of transactions and all associated changes. Setting to "committed" shifts this work on Oralce side | | synchdb.olr_connect_timeout_ms | integer | 5000 | (affects OLR connector only) the connect timeout in milliseconds when connecting to openlog replicator service | -| synchdb.olr_read_timeout_m | integer | 5000 | (affects OLR connector only) the read timeout in milliseconds when reading from a socket | +| synchdb.olr_read_timeout_ms | integer | 5000 | (affects OLR connector only) the read timeout in milliseconds when reading from a socket | | synchdb.olr_snapshot_engine | enum | "debezium" | the underlining engine to complete the initial snapshot process. Could be "debezium" or "fdw". If "fdw" is selected, you need to ensure the corresponding FDW is installed prior. For example, for Oracle connector, ensure "oracle_fdw" is preinstalled. | | synchdb.cdc_start_delay_ms | integer | 0 | a delay waited after initial snapshot completes and before CDC streaming begins. | | synchdb.fdw_migrate_with_subtx | boolean | true | option to use sub transactions to migrate a table during FDW based snapshot | diff --git a/doc/docs/en/getting-started/installation.md b/doc/docs/en/getting-started/installation.md index 72f5273..183355e 100644 --- a/doc/docs/en/getting-started/installation.md +++ b/doc/docs/en/getting-started/installation.md @@ -206,7 +206,7 @@ JDK_LIB_PATH=${JDK_HOME_PATH}/lib echo $JDK_LIB_PATH echo $JDK_LIB_PATH/server -sudo echo "$JDK_LIB_PATH" | sudo tee -a /etc/ld.so.conf.d/x86_64-linux-gnu.conf +sudo echo "$JDK_LIB_PATH" | sudo tee -a /etc/ld.so.conf.d/x86_64-linux-gnu.conf sudo echo "$JDK_LIB_PATH/server" | sudo tee -a /etc/ld.so.conf.d/x86_64-linux-gnu.conf ``` Note, for mac with M1/M2 chips, the linker file is located in /etc/ld.so.conf.d/aarch64-linux-gnu.conf diff --git a/doc/docs/en/tutorial/mysql_cdc_to_postgresql.md b/doc/docs/en/tutorial/mysql_cdc_to_postgresql.md index 8f9cd74..10e794b 100644 --- a/doc/docs/en/tutorial/mysql_cdc_to_postgresql.md +++ b/doc/docs/en/tutorial/mysql_cdc_to_postgresql.md @@ -16,7 +16,7 @@ SELECT synchdb_add_conninfo( ## **Initial Snapshot** "Initial snapshot" (or table snapshot) in SynchDB means to copy table schema plus initial data for all designated tables. This is similar to the term "table sync" in PostgreSQL logical replication. When a connector is started using the default `initial` mode, it will automatically perform the initial snapshot before going to Change Data Capture (CDC) stage. This can be omitted entirely with mode `never` or partially omitted with mode `no_data`. See [here](../../user-guide/start_stop_connector/) for all snapshot options. -Once the initial snapshot is completed, the connector will not do it again upon subsequent restarts and will just resume with CDC since the last incomplete offset. This behavior is controled by the metadata files managed by Debezium engine. See [here](../../architecture/metadata_files/) for more about metadata files. +Once the initial snapshot is completed, the connector will not do it again upon subsequent restarts and will just resume with CDC since the last incomplete offset. This behavior is controlled by the metadata files managed by Debezium engine. See [here](../../architecture/metadata_files/) for more about metadata files. ## **Different Connector Launch Modes** @@ -122,7 +122,7 @@ SELECT synchdb_start_engine_bgw('mysqlconn', 'never'); Restarting the connector in `never` mode will resume CDC since the last successful point. -### **Always do Initial Snapahot + CDC** +### **Always do Initial Snapshot + CDC** Start the connector using `always` mode will always capture the schemas of capture tables, always redo the initial snapshot and then go to CDC. This is similar to a reset button because everything will be rebuilt using this mode. Use it with caution especially when you have large number of tables being captured, which could take a long time to finish. After the rebuild, CDC resumes as normal. diff --git a/doc/docs/en/tutorial/native_olr_cdc_to_postgresql.md b/doc/docs/en/tutorial/native_olr_cdc_to_postgresql.md index 4adc18e..4d8c9d1 100644 --- a/doc/docs/en/tutorial/native_olr_cdc_to_postgresql.md +++ b/doc/docs/en/tutorial/native_olr_cdc_to_postgresql.md @@ -1,6 +1,6 @@ # Native Openlog Replicator Connector -## **Prepare MySQL Database for SynchDB** +## **Prepare Oracle Database for SynchDB** Before SynchDB can be used to replicate from Native Openlog Replicator (OLR) Connector, Both OLR and Oracle database itself need to be configured according to the procedure outlined [here](../../getting-started/remote_database_setups/) @@ -32,7 +32,7 @@ SELECT synchdb_add_olr_conninfo( ## **Initial Snapshot** "Initial snapshot" (or table snapshot) in SynchDB means to copy table schema plus initial data for all designated tables. This is similar to the term "table sync" in PostgreSQL logical replication. When a connector is started using the default `initial` mode, it will automatically perform the initial snapshot before going to Change Data Capture (CDC) stage. This can be partially omitted with mode `no_data`. See [here](../../user-guide/start_stop_connector/) for all snapshot options. -Once the initial snapshot is completed, the connector will not do it again upon subsequent restarts and will just resume with CDC since the last incomplete offset. This behavior is controled by the metadata files managed by Debezium engine. See [here](../../architecture/metadata_files/) for more about metadata files. +Once the initial snapshot is completed, the connector will not do it again upon subsequent restarts and will just resume with CDC since the last incomplete offset. This behavior is controlled by the metadata files managed by Debezium engine. See [here](../../architecture/metadata_files/) for more about metadata files. ## **Different Connector Launch Modes** @@ -113,7 +113,7 @@ SELECT synchdb_start_engine_bgw('olrconn', 'never'); Restarting the connector in `never` mode will resume CDC since the last successful point. -### **Always do Initial Snapahot + CDC** +### **Always do Initial Snapshot + CDC** Start the connector using `always` mode will always capture the schemas of capture tables, always redo the initial snapshot and then go to CDC. This is similar to a reset button because everything will be rebuilt using this mode. Use it with caution especially when you have large number of tables being captured, which could take a long time to finish. After the rebuild, CDC resumes as normal. @@ -124,7 +124,7 @@ SELECT synchdb_start_engine_bgw('olrconn', 'always'); After the initial snapshot, CDC will begin. Restarting a connector in `always` mode will repeat the same process described above. -## **Possible Snapshot Modes for MySQL Connector** +## **Possible Snapshot Modes for Openlog Replicator Connector** * initial (default) * initial_only diff --git a/doc/docs/en/tutorial/postgresql_cdc_to_postgresql.md b/doc/docs/en/tutorial/postgresql_cdc_to_postgresql.md index 15dbfbb..a409172 100644 --- a/doc/docs/en/tutorial/postgresql_cdc_to_postgresql.md +++ b/doc/docs/en/tutorial/postgresql_cdc_to_postgresql.md @@ -18,7 +18,7 @@ SELECT ## **Initial Snapshot** "Initial snapshot" (or table snapshot) in SynchDB means to copy table schema plus initial data for all designated tables. This is similar to the term "table sync" in PostgreSQL logical replication. When a connector is started using the default `initial` mode, it will automatically perform the initial snapshot before going to Change Data Capture (CDC) stage. This can be partially omitted with mode `no_data`. See [here](../../user-guide/start_stop_connector/) for all snapshot options. -Once the initial snapshot is completed, the connector will not do it again upon subsequent restarts and will just resume with CDC since the last incomplete offset. This behavior is controled by the metadata files managed by Debezium engine. See [here](../../architecture/metadata_files/) for more about metadata files. +Once the initial snapshot is completed, the connector will not do it again upon subsequent restarts and will just resume with CDC since the last incomplete offset. This behavior is controlled by the metadata files managed by Debezium engine. See [here](../../architecture/metadata_files/) for more about metadata files. PostgreSQL connector's initial snapshot is a little different. Debezium engine does not build the initial table schema like other connectors do. This is because PostgreSQL does not explicitly emits DDL WAL events. PostgreSQL's native logical replication also behaves the same. The user must pre-create the table schema at the destination before launching logical replication. So, when launching the Debezium based PostgreSQL connector for the first time, it assumes you have already created the designated table schemas and their initial data, and would enter CDC streaming mode immediately, without actually doing initial snapshot. @@ -114,7 +114,7 @@ SELECT synchdb_start_engine_bgw('pgconn', 'no_data'); Restarting the connector in `no_data` mode will not rebuild the schema again, and it will resume CDC since the last successful point. -### **Always do Initial Snapahot + CDC** +### **Always do Initial Snapshot + CDC** **with synchdb.olr_snapshot_engine = 'debezium':** @@ -131,7 +131,7 @@ SELECT synchdb_start_engine_bgw('pgconn', 'always'); However, it is possible to select partial tables to redo the initial snapshot by using the `snapshottable` option of the connector. Tables matching the criteria in `snapshottable` will redo the inital snapshot, if not, their initial snapshot will be skipped. If `snapshottable` is null or empty, by default, all the tables specified in `table` option of the connector will redo the initial snapshot under `always` mode. -This example makes the connector only redo the initial snapshot of `inventory.customers` table. All other tables will have their snapshot skipped. +This example makes the connector only redo the initial snapshot of `public.customers` table. All other tables will have their snapshot skipped. ```sql UPDATE synchdb_conninfo SET data = jsonb_set(data, '{snapshottable}', '"public.customers"') @@ -338,16 +338,16 @@ Once the snapshot is complete, the connector will continue capturing subsequent ### **Add More Tables to Replicate During Run Time.** -The `mysqlconn` from previous section has already completed the initial snapshot and obtained the table schemas of the selected table. If we would like to add more tables to replicate from, we will need to notify the Debezium engine about the updated table section and perform the initial snapshot again. Here's how it is done: +The `pgconn` from previous section has already completed the initial snapshot and obtained the table schemas of the selected table. If we would like to add more tables to replicate from, we will need to notify the Debezium engine about the updated table section and perform the initial snapshot again. Here's how it is done: 1. Update the `synchdb_conninfo` table to include additional tables. -2. In this example, we add the `inventory.customers` table to the sync list: +2. In this example, we add the `public.customers` table to the sync list: ```sql UPDATE synchdb_conninfo SET data = jsonb_set(data, '{table}', '"public.orders,public.customers"') WHERE name = 'pgconn'; ``` -3. Configure the snapshot table parameter to include only the new table `inventory.customers` to that SynchDB does not try to rebuild the 2 tables that have already finished the snapshot. +3. Configure the snapshot table parameter to include only the new table `public.customers` to that SynchDB does not try to rebuild the 2 tables that have already finished the snapshot. ```sql UPDATE synchdb_conninfo SET data = jsonb_set(data, '{snapshottable}', '"public.customers"') @@ -364,12 +364,12 @@ SELECT synchdb_start_engine_bgw('pgconn', 'always'); Now, we can examine our tables again: ```sql -postgres=# \dt inventory.* +postgres=# \dt public.* List of tables Schema | Name | Type | Owner -----------+-----------+-------+-------- - inventory | customers | table | ubuntu - inventory | orders | table | ubuntu - inventory | products | table | ubuntu + public | customers | table | ubuntu + public | orders | table | ubuntu + public | products | table | ubuntu ``` \ No newline at end of file diff --git a/doc/docs/en/tutorial/sqlserver_cdc_to_postgresql.md b/doc/docs/en/tutorial/sqlserver_cdc_to_postgresql.md index 19f9721..85970e3 100644 --- a/doc/docs/en/tutorial/sqlserver_cdc_to_postgresql.md +++ b/doc/docs/en/tutorial/sqlserver_cdc_to_postgresql.md @@ -29,8 +29,8 @@ SELECT ## **Initial Snapshot** "Initial snapshot" (or table snapshot) in SynchDB means to copy table schema plus initial data for all designated tables. This is similar to the term "table sync" in PostgreSQL logical replication. When a connector is started using the default `initial` mode, it will automatically perform the initial snapshot before going to Change Data Capture (CDC) stage. This can be partially omitted with mode `no_data`. See [here](../../user-guide/start_stop_connector/) for all snapshot options. -Once the initial snapshot is completed, the connector will not do it again upon subsequent restarts and will just resume with CDC since the last incomplete offset. This behavior is controled by the metadata files managed by Debezium engine. See [here](../../architecture/metadata_files/) for more about metadata files. -** +Once the initial snapshot is completed, the connector will not do it again upon subsequent restarts and will just resume with CDC since the last incomplete offset. This behavior is controlled by the metadata files managed by Debezium engine. See [here](../../architecture/metadata_files/) for more about metadata files. + ## **Different Connector Launch Modes** ### **Initial Snapshot + CDC** @@ -111,7 +111,7 @@ SELECT synchdb_start_engine_bgw('sqlserverconn', 'no_data'); Restarting the connector in `no_data` mode will not rebuild the schema again, and it will resume CDC since the last successful point. -### **Always do Initial Snapahot + CDC** +### **Always do Initial Snapshot + CDC** Start the connector using `always` mode will always capture the schemas of capture tables, always redo the initial snapshot and then go to CDC. This is similar to a reset button because everything will be rebuilt using this mode. Use it with caution especially when you have large number of tables being captured, which could take a long time to finish. After the rebuild, CDC resumes as normal. diff --git a/doc/docs/en/user-guide/object_mapping_rules.md b/doc/docs/en/user-guide/object_mapping_rules.md index f975f08..ca4d14f 100644 --- a/doc/docs/en/user-guide/object_mapping_rules.md +++ b/doc/docs/en/user-guide/object_mapping_rules.md @@ -28,7 +28,7 @@ SELECT synchdb_add_objmap('mysqlconn','table','inventory.customers','schema1.peo * `source object` represents the column in fully-qualified name in remote database * `destination object` represents the column name in PostgreSQL. No need to format it as fully-qualified column name. -This example maps `inventory.customers.emaiL` column in the source table to `contact` in PostgreSQL. +This example maps `inventory.customers.email` column in the source table to `contact` in PostgreSQL. ```sql SELECT synchdb_add_objmap('mysqlconn','column','inventory.customers.email','contact'); ```