Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,12 +28,12 @@ SynchDB extension consists of these major components:

## Build Requirement
The following software is required to build and run SynchDB. The versions listed are the versions tested during development. Older versions may still work.
* Unix based operating system like Ubuntu 22.04 or MacOS
* Java Development Kit 17 or later. Download [here](https://www.oracle.com/ca-en/java/technologies/downloads/)
* Apache Maven 3.6.3 or later. Download [here](https://maven.apache.org/download.cgi)
* PostgreSQL source or build environment. Git clone [here](https://github.com/postgres/postgres). Refer to this [wiki](https://wiki.postgresql.org/wiki/Compile_and_Install_from_source_code) to build PostgreSQL from source or this [page](https://www.postgresql.org/download/linux/) to install PostgreSQL via packages
* If PostgreSQL is installed via a package manager, the corresponding devel package needs to be installed as well.
* Docker compose 2.28.1 (for testing). Refer to [here](https://docs.docker.com/compose/install/linux/)
* Unix based operating system like Ubuntu 22.04 or MacOS

**The following is required if Openlog Replicator Connector is enabled in build**

* libprotobuf-c v1.5.2. Refer to [here](https://github.com/protobuf-c/protobuf-c.git) to build from source.
Expand All @@ -50,6 +50,9 @@ If you already have PostgreSQL installed, you can build and install Default Sync

``` BASH
USE_PGXS=1 make PG_CONFIG=$(which pg_config)

# Using Maven to build Debezium
export PATH=${YOUR_MAVEN_PATH}/bin/:$PATH
USE_PGXS=1 make build_dbz PG_CONFIG=$(which pg_config)

sudo USE_PGXS=1 make PG_CONFIG=$(which pg_config) install
Expand Down
2 changes: 1 addition & 1 deletion doc/docs/en/architecture/ddl_replication.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ Supported modifications:

Other properties that can be specified during ALTER TABLE ALTER COLUMN are not supported at the moment.

Please note that SynchDB only supports basic data type change on an existing column. For example, `INT` → `BIGINT` or `VARCHAR` → `TEXT`. Complex data type changes such as `TEXT` → `INT` or `INT` → `TIMESTAMP` are not currently supported. This is because PostgreSQL requires the user to additioanlly supply a type casting function to perform the type casting as the result of complex data type change. SynchDB currently has to knowledge what type casting functions to use for specific type conversion. In the future, We may allow user to supply his or her own casting functions to use for specific type conversions via the rule file, but for now, it is not supported.
Please note that SynchDB only supports basic data type change on an existing column. For example, `INT` → `BIGINT` or `VARCHAR` → `TEXT`. Complex data type changes such as `TEXT` → `INT` or `INT` → `TIMESTAMP` are not currently supported. This is because PostgreSQL requires the user to additionally supply a type casting function to perform the type casting as the result of complex data type change. SynchDB currently has limited knowledge what type casting functions to use for specific type conversion. In the future, We may allow user to supply his or her own casting functions to use for specific type conversions via the rule file, but for now, it is not supported.

## **Database-Specific Behavior**

Expand Down
4 changes: 2 additions & 2 deletions doc/docs/en/architecture/non_native_datatype_handling.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# None-native Data Type Handling
# Non-native Data Type Handling

## **Handling Non-Native Data Types**

Expand All @@ -24,4 +24,4 @@ It is possible that a table contains a column data type that is custom created b

```

The category tells DML Converter about the nature of the data type (numeric? string? datetime? ...etc) to help the converter select the right routine to process. For most cases, using type category paired with the DBZ metadata that describes how the input data payload is formatted is sufficient to select the right routine to process the data. However, in some cases, it may not be sufficient. For example, custom DATE, TIME, TIMESTAMP date types could all be categorized under `TYPCATEGORY_DATETIME`, so the converter does not know if it is working with a DATE, TIME or TIMESTAMP as each would produce different time formats. Currently, the covnerter looks for certain keywords from the data type name to identify. In the future, we may expose this part to let the user tell the converter exactly which routine to use should there be an ambiguity. Another example would be `TYPCATEGORY_USER` and `TYPCATEGORY_GEOMETRIC` which does not clearly indicate the data format. For these categories, the converter currently does not perform any further processing as it simply leaves the data payload as is. PostgreSQL may or may not reject such unprocessed data. This is why the transform feature next is important to give the DML converter a final chance to correct its data payload.
The category tells DML Converter about the nature of the data type (numeric? string? datetime? ...etc) to help the converter select the right routine to process. For most cases, using type category paired with the DBZ metadata that describes how the input data payload is formatted is sufficient to select the right routine to process the data. However, in some cases, it may not be sufficient. For example, custom DATE, TIME, TIMESTAMP date types could all be categorized under `TYPCATEGORY_DATETIME`, so the converter does not know if it is working with a DATE, TIME or TIMESTAMP as each would produce different time formats. Currently, the converter looks for certain keywords from the data type name to identify. In the future, we may expose this part to let the user tell the converter exactly which routine to use should there be an ambiguity. Another example would be `TYPCATEGORY_USER` and `TYPCATEGORY_GEOMETRIC` which does not clearly indicate the data format. For these categories, the converter currently does not perform any further processing as it simply leaves the data payload as is. PostgreSQL may or may not reject such unprocessed data. This is why the transform feature next is important to give the DML converter a final chance to correct its data payload.
6 changes: 3 additions & 3 deletions doc/docs/en/getting-started/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ weight: 40
---
# SynchDB Configuration

SynchDB supports the following GUC variables in postgresql.conf. These are common parameters that apply the all connectors managed by SynchDB:
SynchDB supports the following GUC variables in postgresql.conf. These are common parameters that apply to all connectors managed by SynchDB:

| GUC Variable| Type | Default Value | Description |
|-|-|-|-|
Expand All @@ -14,7 +14,7 @@ SynchDB supports the following GUC variables in postgresql.conf. These are commo
| synchdb.dbz_queue_size | integer | 8192 | The maximum size (measured in number of change events) of Debezium embedded engine's change event queue. It should be set at least twice of `synchdb.dbz_batch_size` |
| synchdb.dbz_connect_timeout_ms | integer | 30000 | The timeout value in milliseconds for Debezium embedded engine to established an initial connection to a remote database |
| synchdb.dbz_query_timeout_ms | integer | 600000 | The timeout value in milliseconds for Debezium embedded engine to execute a query on a remote database |
| synchdb.dbz_skipped_oeprations | string | "t" | A comma-separated list of operations Debezium shall skip when processing change events. "c" is for inserts, "u" is for updates, "d" is for deletes, "t" is for truncates |
| synchdb.dbz_skipped_operations | string | "t" | A comma-separated list of operations Debezium shall skip when processing change events. "c" is for inserts, "u" is for updates, "d" is for deletes, "t" is for truncates |
| synchdb.jvm_max_heap_size | integer | 1024 | The maximum heap size in MB to be allocated to Java Virtual Machine (JVM) when starting a connector. |
| synchdb.dbz_snapshot_thread_num | integer | 2 | The number of threads Debezium embedded connector should spawn during initial snapshot. Please note that according to Debezium, multi-threaded snapshot is an `incubating feature` |
| synchdb.dbz_snapshot_fetch_size | integer | 0 | The number of rows Debezium embedded connector should fetch at a time during initial snapshot. Set it to 0 to let the engine choose automatically |
Expand All @@ -30,7 +30,7 @@ SynchDB supports the following GUC variables in postgresql.conf. These are commo
| synchdb.jvm_max_direct_buffer_size | integer | 1024 | The maximum direct buffer size in MB to be allocated to hold JSON change events |
| synchdb.dbz_logminer_stream_mode | enum | "uncommitted" | The streaming mode for Debezium based Oracle connector. The default is uncommitted, which means all the changes streamed from Oracle via Debezium is uncommitted. This indicates Debezium has to do some work to ensure the integrity of transactions and all associated changes. Setting to "committed" shifts this work on Oralce side |
| synchdb.olr_connect_timeout_ms | integer | 5000 | (affects OLR connector only) the connect timeout in milliseconds when connecting to openlog replicator service |
| synchdb.olr_read_timeout_m | integer | 5000 | (affects OLR connector only) the read timeout in milliseconds when reading from a socket |
| synchdb.olr_read_timeout_ms | integer | 5000 | (affects OLR connector only) the read timeout in milliseconds when reading from a socket |
| synchdb.olr_snapshot_engine | enum | "debezium" | the underlining engine to complete the initial snapshot process. Could be "debezium" or "fdw". If "fdw" is selected, you need to ensure the corresponding FDW is installed prior. For example, for Oracle connector, ensure "oracle_fdw" is preinstalled. |
| synchdb.cdc_start_delay_ms | integer | 0 | a delay waited after initial snapshot completes and before CDC streaming begins. |
| synchdb.fdw_migrate_with_subtx | boolean | true | option to use sub transactions to migrate a table during FDW based snapshot |
Expand Down
2 changes: 1 addition & 1 deletion doc/docs/en/getting-started/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,7 @@ JDK_LIB_PATH=${JDK_HOME_PATH}/lib
echo $JDK_LIB_PATH
echo $JDK_LIB_PATH/server

sudo echo "$JDK_LIB_PATH" sudo tee -a /etc/ld.so.conf.d/x86_64-linux-gnu.conf
sudo echo "$JDK_LIB_PATH" | sudo tee -a /etc/ld.so.conf.d/x86_64-linux-gnu.conf
sudo echo "$JDK_LIB_PATH/server" | sudo tee -a /etc/ld.so.conf.d/x86_64-linux-gnu.conf
```
Note, for mac with M1/M2 chips, the linker file is located in /etc/ld.so.conf.d/aarch64-linux-gnu.conf
Expand Down
4 changes: 2 additions & 2 deletions doc/docs/en/tutorial/mysql_cdc_to_postgresql.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ SELECT synchdb_add_conninfo(
## **Initial Snapshot**
"Initial snapshot" (or table snapshot) in SynchDB means to copy table schema plus initial data for all designated tables. This is similar to the term "table sync" in PostgreSQL logical replication. When a connector is started using the default `initial` mode, it will automatically perform the initial snapshot before going to Change Data Capture (CDC) stage. This can be omitted entirely with mode `never` or partially omitted with mode `no_data`. See [here](../../user-guide/start_stop_connector/) for all snapshot options.

Once the initial snapshot is completed, the connector will not do it again upon subsequent restarts and will just resume with CDC since the last incomplete offset. This behavior is controled by the metadata files managed by Debezium engine. See [here](../../architecture/metadata_files/) for more about metadata files.
Once the initial snapshot is completed, the connector will not do it again upon subsequent restarts and will just resume with CDC since the last incomplete offset. This behavior is controlled by the metadata files managed by Debezium engine. See [here](../../architecture/metadata_files/) for more about metadata files.

## **Different Connector Launch Modes**

Expand Down Expand Up @@ -122,7 +122,7 @@ SELECT synchdb_start_engine_bgw('mysqlconn', 'never');

Restarting the connector in `never` mode will resume CDC since the last successful point.

### **Always do Initial Snapahot + CDC**
### **Always do Initial Snapshot + CDC**

Start the connector using `always` mode will always capture the schemas of capture tables, always redo the initial snapshot and then go to CDC. This is similar to a reset button because everything will be rebuilt using this mode. Use it with caution especially when you have large number of tables being captured, which could take a long time to finish. After the rebuild, CDC resumes as normal.

Expand Down
8 changes: 4 additions & 4 deletions doc/docs/en/tutorial/native_olr_cdc_to_postgresql.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Native Openlog Replicator Connector

## **Prepare MySQL Database for SynchDB**
## **Prepare Oracle Database for SynchDB**

Before SynchDB can be used to replicate from Native Openlog Replicator (OLR) Connector, Both OLR and Oracle database itself need to be configured according to the procedure outlined [here](../../getting-started/remote_database_setups/)

Expand Down Expand Up @@ -32,7 +32,7 @@ SELECT synchdb_add_olr_conninfo(
## **Initial Snapshot**
"Initial snapshot" (or table snapshot) in SynchDB means to copy table schema plus initial data for all designated tables. This is similar to the term "table sync" in PostgreSQL logical replication. When a connector is started using the default `initial` mode, it will automatically perform the initial snapshot before going to Change Data Capture (CDC) stage. This can be partially omitted with mode `no_data`. See [here](../../user-guide/start_stop_connector/) for all snapshot options.

Once the initial snapshot is completed, the connector will not do it again upon subsequent restarts and will just resume with CDC since the last incomplete offset. This behavior is controled by the metadata files managed by Debezium engine. See [here](../../architecture/metadata_files/) for more about metadata files.
Once the initial snapshot is completed, the connector will not do it again upon subsequent restarts and will just resume with CDC since the last incomplete offset. This behavior is controlled by the metadata files managed by Debezium engine. See [here](../../architecture/metadata_files/) for more about metadata files.

## **Different Connector Launch Modes**

Expand Down Expand Up @@ -113,7 +113,7 @@ SELECT synchdb_start_engine_bgw('olrconn', 'never');

Restarting the connector in `never` mode will resume CDC since the last successful point.

### **Always do Initial Snapahot + CDC**
### **Always do Initial Snapshot + CDC**

Start the connector using `always` mode will always capture the schemas of capture tables, always redo the initial snapshot and then go to CDC. This is similar to a reset button because everything will be rebuilt using this mode. Use it with caution especially when you have large number of tables being captured, which could take a long time to finish. After the rebuild, CDC resumes as normal.

Expand All @@ -124,7 +124,7 @@ SELECT synchdb_start_engine_bgw('olrconn', 'always');

After the initial snapshot, CDC will begin. Restarting a connector in `always` mode will repeat the same process described above.

## **Possible Snapshot Modes for MySQL Connector**
## **Possible Snapshot Modes for Openlog Replicator Connector**

* initial (default)
* initial_only
Expand Down
20 changes: 10 additions & 10 deletions doc/docs/en/tutorial/postgresql_cdc_to_postgresql.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ SELECT
## **Initial Snapshot**
"Initial snapshot" (or table snapshot) in SynchDB means to copy table schema plus initial data for all designated tables. This is similar to the term "table sync" in PostgreSQL logical replication. When a connector is started using the default `initial` mode, it will automatically perform the initial snapshot before going to Change Data Capture (CDC) stage. This can be partially omitted with mode `no_data`. See [here](../../user-guide/start_stop_connector/) for all snapshot options.

Once the initial snapshot is completed, the connector will not do it again upon subsequent restarts and will just resume with CDC since the last incomplete offset. This behavior is controled by the metadata files managed by Debezium engine. See [here](../../architecture/metadata_files/) for more about metadata files.
Once the initial snapshot is completed, the connector will not do it again upon subsequent restarts and will just resume with CDC since the last incomplete offset. This behavior is controlled by the metadata files managed by Debezium engine. See [here](../../architecture/metadata_files/) for more about metadata files.

PostgreSQL connector's initial snapshot is a little different. Debezium engine does not build the initial table schema like other connectors do. This is because PostgreSQL does not explicitly emits DDL WAL events. PostgreSQL's native logical replication also behaves the same. The user must pre-create the table schema at the destination before launching logical replication. So, when launching the Debezium based PostgreSQL connector for the first time, it assumes you have already created the designated table schemas and their initial data, and would enter CDC streaming mode immediately, without actually doing initial snapshot.

Expand Down Expand Up @@ -114,7 +114,7 @@ SELECT synchdb_start_engine_bgw('pgconn', 'no_data');

Restarting the connector in `no_data` mode will not rebuild the schema again, and it will resume CDC since the last successful point.

### **Always do Initial Snapahot + CDC**
### **Always do Initial Snapshot + CDC**

**with synchdb.olr_snapshot_engine = 'debezium':**

Expand All @@ -131,7 +131,7 @@ SELECT synchdb_start_engine_bgw('pgconn', 'always');

However, it is possible to select partial tables to redo the initial snapshot by using the `snapshottable` option of the connector. Tables matching the criteria in `snapshottable` will redo the inital snapshot, if not, their initial snapshot will be skipped. If `snapshottable` is null or empty, by default, all the tables specified in `table` option of the connector will redo the initial snapshot under `always` mode.

This example makes the connector only redo the initial snapshot of `inventory.customers` table. All other tables will have their snapshot skipped.
This example makes the connector only redo the initial snapshot of `public.customers` table. All other tables will have their snapshot skipped.
```sql
UPDATE synchdb_conninfo
SET data = jsonb_set(data, '{snapshottable}', '"public.customers"')
Expand Down Expand Up @@ -338,16 +338,16 @@ Once the snapshot is complete, the connector will continue capturing subsequent

### **Add More Tables to Replicate During Run Time.**

The `mysqlconn` from previous section has already completed the initial snapshot and obtained the table schemas of the selected table. If we would like to add more tables to replicate from, we will need to notify the Debezium engine about the updated table section and perform the initial snapshot again. Here's how it is done:
The `pgconn` from previous section has already completed the initial snapshot and obtained the table schemas of the selected table. If we would like to add more tables to replicate from, we will need to notify the Debezium engine about the updated table section and perform the initial snapshot again. Here's how it is done:

1. Update the `synchdb_conninfo` table to include additional tables.
2. In this example, we add the `inventory.customers` table to the sync list:
2. In this example, we add the `public.customers` table to the sync list:
```sql
UPDATE synchdb_conninfo
SET data = jsonb_set(data, '{table}', '"public.orders,public.customers"')
WHERE name = 'pgconn';
```
3. Configure the snapshot table parameter to include only the new table `inventory.customers` to that SynchDB does not try to rebuild the 2 tables that have already finished the snapshot.
3. Configure the snapshot table parameter to include only the new table `public.customers` to that SynchDB does not try to rebuild the 2 tables that have already finished the snapshot.
```sql
UPDATE synchdb_conninfo
SET data = jsonb_set(data, '{snapshottable}', '"public.customers"')
Expand All @@ -364,12 +364,12 @@ SELECT synchdb_start_engine_bgw('pgconn', 'always');

Now, we can examine our tables again:
```sql
postgres=# \dt inventory.*
postgres=# \dt public.*
List of tables
Schema | Name | Type | Owner
-----------+-----------+-------+--------
inventory | customers | table | ubuntu
inventory | orders | table | ubuntu
inventory | products | table | ubuntu
public | customers | table | ubuntu
public | orders | table | ubuntu
public | products | table | ubuntu

```
Loading
Loading