diff --git a/DB-FORMATS.md b/DB-FORMATS.md new file mode 100644 index 00000000..f59abc62 --- /dev/null +++ b/DB-FORMATS.md @@ -0,0 +1,134 @@ +# duc DB Backend Formats + +Reference for all database backends supported across duc versions, derived from +the source implementations in `src/libduc/db-*.c` and `configure.ac`. + + +## Summary Table + +| Backend | File/Dir | Single file | Compression | Version key | Default in | +|----------------|----------|-------------|-------------------|-------------|-------------| +| Tokyo Cabinet | File | Yes | Optional (deflate)| Yes | 1.4.6 | +| Kyoto Cabinet | File | Yes | Always (kct opts) | Yes | — | +| LevelDB | Dir | **No** | Always (Snappy) | No | — | +| SQLite3 | File | Yes | None | No | — | +| LMDB | File | Yes | None | No | — | +| Tkrzw | File | Yes | Optional (ZSTD) | Yes | 1.5.0-rc2 | + +--- + +## Tokyo Cabinet (`tokyocabinet`) + +- **Introduced:** ≤ 1.4.6 +- **Default in:** 1.4.6 +- **Storage layout:** Single file on disk +- **Magic header (first bytes):** `ToKyO CaBiNeT` +- **Internal type:** `TCBDB` — B+ Tree Database +- **Compression:** Optional deflate (`BDBTDEFLATE`), enabled via `--compress` flag +- **Tuning:** `tcbdbtune(hdb, 256, 512, 131072, 9, 11, BDBTLARGE [| BDBTDEFLATE])` +- **Version check:** Stores and validates `duc_db_version` key on open +- **Notes:** + - The `BDBTLARGE` flag is always set, allowing the file to exceed 2 GB. + - `DUC_OPEN_FORCE` triggers `BDBOTRUNC`, which truncates and recreates the file. + +--- + +## Kyoto Cabinet (`kyotocabinet`) + +- **Introduced:** ≤ 1.4.6 +- **Default in:** — +- **Storage layout:** Single file on disk +- **Magic header (first bytes):** `Kyoto CaBiNeT` +- **Internal type:** KCT (Tree Cabinet), opened with `#type=kct#opts=c` +- **Compression:** Enabled unconditionally via `opts=c` in the open string +- **Version check:** Stores and validates `duc_db_version` key on open +- **Notes:** + - Error mapping is incomplete; all backend errors map to `DUC_E_UNKNOWN`. + - The `DUC_OPEN_COMPRESS` flag is accepted but has no additional effect since + compression is always on via the open string. + +--- + +## LevelDB (`leveldb`) + +- **Introduced:** ≤ 1.4.6 +- **Default in:** — +- **Storage layout:** **Directory** (not a single file); LevelDB stores multiple + SSTable (`.ldb`/`.sst`) and manifest files inside a directory. +- **Magic header:** N/A — detected as a directory by `duc_db_type_check()` +- **Compression:** Snappy compression is always enabled + (`leveldb_snappy_compression`); the `DUC_OPEN_COMPRESS` flag has no effect. +- **Version check:** None — does not store or check `duc_db_version` +- **Notes:** + - Because the path is a directory, it behaves differently from all other + backends when specifying `--database`. + - `leveldb_options_set_create_if_missing` is always set; the DB is created + automatically if it does not exist. + +--- + +## SQLite3 (`sqlite3`) + +- **Introduced:** ≤ 1.4.6 +- **Default in:** — +- **Storage layout:** Single file on disk +- **Magic header (first bytes):** `SQLite format 3` +- **Internal schema:** Single table `blobs(key UNIQUE PRIMARY KEY, value)` with + an additional index `keys` on the `key` column. +- **Compression:** None — no compression support +- **Version check:** None — does not store or check `duc_db_version` +- **Notes:** + - All writes are batched inside a single `BEGIN`/`COMMIT` transaction that + spans the lifetime of the open database (committed on `db_close`). + - On open, a deliberate bogus query (`select bogus from bogus`) is run to + detect corrupt files that `sqlite3_open()` would otherwise accept silently. + - `insert or replace` semantics are used, so re-indexing a path overwrites the + previous entry cleanly. + +--- + +## LMDB (`lmdb`) + +- **Introduced:** ≤ 1.4.6 +- **Default in:** — +- **Storage layout:** Single file on disk (opened with `MDB_NOSUBDIR`) +- **Magic header:** Standard LMDB file header (not checked by duc's type + detector; falls through to `unknown`) +- **Compression:** None — no compression support +- **Version check:** None — does not store or check `duc_db_version` +- **Memory map size:** + - 32-bit platforms: 1 GB (`1024 * 1024 * 1024`) + - 64-bit platforms: 256 GB (`1024 * 1024 * 1024 * 256`) +- **Notes:** + - Uses a single write transaction (`MDB_txn`) for all puts, committed on + `db_close`. A write error in `db_put` calls `exit(1)` immediately. + - The large pre-allocated map size is a virtual address reservation only; + actual disk usage grows on demand. + +--- + +## Tkrzw (`tkrzw`) + +- **Introduced:** 1.5.0-rc2 +- **Default in:** 1.5.0-rc2 +- **Storage layout:** Single file on disk +- **Magic header:** Tkrzw-specific header (not yet checked by duc's type + detector) +- **Internal type:** `HashDBM` with `StdFile` file driver +- **Base open options:** `dbm=HashDBM,file=StdFile,offset_width=5` +- **Compression:** Optional ZSTD record compression (`record_comp_mode=RECORD_COMP_ZSTD`), + enabled at compile time via `--with-tkrzw-zstd` and at runtime via the + `DUC_OPEN_COMPRESS` flag. Falls back to `NONE` if not compiled in. +- **Version check:** Stores and validates `duc_db_version` key on open +- **Filesystem size hints:** The `num_buckets` tuning parameter is scaled via + new `DUC_FS_*` flags: + | Flag | `num_buckets` | + |-------------------|---------------| + | `DUC_FS_BIG` | 100,000,000 | + | `DUC_FS_BIGGER` | 1,000,000,000 | + | `DUC_FS_BIGGEST` | 10,000,000,000| +- **Notes:** + - `DUC_OPEN_FORCE` appends `,truncate=true` to the options string, recreating + the file. + - Tkrzw is a successor/spiritual replacement for both Tokyo Cabinet and Kyoto + Cabinet, providing a modern hash-based store with better compression options. diff --git a/migrator/Makefile b/migrator/Makefile new file mode 100644 index 00000000..b9bbdef3 --- /dev/null +++ b/migrator/Makefile @@ -0,0 +1,94 @@ +CC = gcc +CFLAGS = -O2 -Wall -Wextra -std=c11 +LDFLAGS = + +# --------------------------------------------------------------------------- +# Auto-detect available backends via pkg-config (or direct lib checks). +# Each enabled backend appends -DHAVE_ plus its compile/link flags. +# --------------------------------------------------------------------------- + +# --- Tokyo Cabinet ---------------------------------------------------------- +TC_LIBS := $(shell pkg-config --libs tokyocabinet 2>/dev/null) +TC_CFLAGS := $(shell pkg-config --cflags tokyocabinet 2>/dev/null) +ifneq ($(TC_LIBS),) + CFLAGS += -DHAVE_TOKYOCABINET $(TC_CFLAGS) + LDFLAGS += $(TC_LIBS) + $(info [+] Tokyo Cabinet detected) +else + $(info [-] Tokyo Cabinet not found (install: libtokyocabinet-dev)) +endif + +# --- Kyoto Cabinet ---------------------------------------------------------- +KC_LIBS := $(shell pkg-config --libs kyotocabinet 2>/dev/null) +KC_CFLAGS := $(shell pkg-config --cflags kyotocabinet 2>/dev/null) +ifneq ($(KC_LIBS),) + CFLAGS += -DHAVE_KYOTOCABINET $(KC_CFLAGS) + LDFLAGS += $(KC_LIBS) + $(info [+] Kyoto Cabinet detected) +else + $(info [-] Kyoto Cabinet not found (install: libkyotocabinet-dev)) +endif + +# --- LevelDB ---------------------------------------------------------------- +LDB_LIBS := $(shell pkg-config --libs leveldb 2>/dev/null) +LDB_CFLAGS := $(shell pkg-config --cflags leveldb 2>/dev/null) +ifneq ($(LDB_LIBS),) + CFLAGS += -DHAVE_LEVELDB $(LDB_CFLAGS) + LDFLAGS += $(LDB_LIBS) + $(info [+] LevelDB detected) +else + # Fallback: try direct link (common on Debian/Ubuntu without .pc file) + LDB_DIRECT := $(shell echo 'int main(){}' | $(CC) -x c - -lleveldb -o /dev/null 2>/dev/null && echo yes) + ifeq ($(LDB_DIRECT),yes) + CFLAGS += -DHAVE_LEVELDB + LDFLAGS += -lleveldb + $(info [+] LevelDB detected (direct link)) + else + $(info [-] LevelDB not found (install: libleveldb-dev)) + endif +endif + +# --- SQLite3 ---------------------------------------------------------------- +SQ_LIBS := $(shell pkg-config --libs sqlite3 2>/dev/null) +SQ_CFLAGS := $(shell pkg-config --cflags sqlite3 2>/dev/null) +ifneq ($(SQ_LIBS),) + CFLAGS += -DHAVE_SQLITE3 $(SQ_CFLAGS) + LDFLAGS += $(SQ_LIBS) + $(info [+] SQLite3 detected) +else + $(info [-] SQLite3 not found (install: libsqlite3-dev)) +endif + +# --- LMDB ------------------------------------------------------------------- +MDB_DIRECT := $(shell echo 'int main(){}' | $(CC) -x c - -llmdb -o /dev/null 2>/dev/null && echo yes) +ifeq ($(MDB_DIRECT),yes) + CFLAGS += -DHAVE_LMDB + LDFLAGS += -llmdb + $(info [+] LMDB detected) +else + $(info [-] LMDB not found (install: liblmdb-dev)) +endif + +# --- Tkrzw ------------------------------------------------------------------ +TKRZW_DIRECT := $(shell echo 'int main(){}' | $(CC) -x c - -ltkrzw -o /dev/null 2>/dev/null && echo yes) +ifeq ($(TKRZW_DIRECT),yes) + CFLAGS += -DHAVE_TKRZW + LDFLAGS += -ltkrzw + $(info [+] Tkrzw detected) +else + $(info [-] Tkrzw not found (install: libtkrzw-dev)) +endif + +# --------------------------------------------------------------------------- + +TARGET = migrator + +.PHONY: all clean + +all: $(TARGET) + +$(TARGET): migrator.c + $(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) + +clean: + rm -f $(TARGET) diff --git a/migrator/README.md b/migrator/README.md new file mode 100644 index 00000000..3aa9eee1 --- /dev/null +++ b/migrator/README.md @@ -0,0 +1,156 @@ +# duc Database Migrator + +A standalone command-line tool that converts a duc index database from any +supported backend format to any other, without losing data. + +For a detailed description of each backend's on-disk format, internal +structure, and quirks see **[DB-FORMATS.md](../DB-FORMATS.md)**. + +--- + +## Overview + +duc stores its index as a simple key-value database. The backend is chosen +at compile time; all backends share the same logical schema but differ in +file format, compression, and performance characteristics. + +`migrator` links every available backend into a single binary and performs a +raw KV copy between them — all duc-internal keys (`duc_db_version`, +`duc_index_reports`, path records, …) are transferred verbatim. + +Typical use cases: + +- Upgrading from the 1.4.6 default (`tokyocabinet`) to the 1.5.0 default (`tkrzw`) +- Converting a `leveldb` directory-based database to a single-file format +- Switching to `sqlite3` for inspection with standard SQL tooling + +--- + +## Supported Backends + +| Backend | Format | Compression | Default in | +|-----------------|---------|-------------------|-------------| +| `tokyocabinet` | File | Optional (deflate)| 1.4.6 | +| `kyotocabinet` | File | Always (kct) | — | +| `leveldb` | **Dir** | Always (Snappy) | — | +| `sqlite3` | File | None | — | +| `lmdb` | File | None | — | +| `tkrzw` | File | Optional (ZSTD) | 1.5.0-rc2 | + +All backends listed above are compiled into one binary if the corresponding +library is present at build time. The Makefile reports which ones were +detected. + +> **Note on LevelDB:** the `path` for a LevelDB database is a **directory**, +> not a file. Pass the directory path to `--from` or `--to` accordingly. + +--- + +## Building + +Dependencies are auto-detected via `pkg-config` (and direct linker probes for +LMDB and Tkrzw, which often lack `.pc` files). + +```sh +cd migrator +make +``` + +Example output showing which backends were found: + +``` +[+] Tokyo Cabinet detected +[-] Kyoto Cabinet not found (install: libkyotocabinet-dev) +[+] LevelDB detected (direct link) +[+] SQLite3 detected +[+] LMDB detected +[+] Tkrzw detected +``` + +At least two backends must be compiled in to perform a migration. + +### Manual flags + +If auto-detection fails you can pass flags directly: + +```sh +make CFLAGS="-DHAVE_TOKYOCABINET -DHAVE_TKRZW" \ + LDFLAGS="-ltokyocabinet -ltkrzw" +``` + +--- + +## Usage + +``` +./migrator --from : --to : +``` + +`format` is one of the backend names in the table above; `path` is the +filesystem path to the database file (or directory for LevelDB). + +### Examples + +**Tokyo Cabinet → Tkrzw** (the common 1.4.6 → 1.5.0 upgrade path): + +```sh +./migrator \ + --from tokyocabinet:~/.cache/duc/duc.db \ + --to tkrzw:~/.cache/duc/duc.tkrzw.db +``` + +**Tokyo Cabinet → SQLite3** (for ad-hoc SQL inspection): + +```sh +./migrator \ + --from tokyocabinet:/var/cache/duc/duc.db \ + --to sqlite3:/tmp/duc-inspect.sqlite +# Then: sqlite3 /tmp/duc-inspect.sqlite "select key from blobs" +``` + +**LevelDB directory → LMDB single file:** + +```sh +./migrator \ + --from leveldb:/var/cache/duc/duc-leveldb/ \ + --to lmdb:/var/cache/duc/duc.lmdb +``` + +### Testing + +See the [testing/README.md](../testing/README.md) resp. [testing/test-migrator.sh](../testing/test-migrator.sh) which exercises the migrator against all backend combinations. + +--- + +## How It Works + +1. The source database is opened **read-only**. +2. A full cursor scan iterates every key-value record in storage order. +3. Each record is written verbatim to the destination database. +4. Both databases are flushed and closed cleanly on completion. + +Progress is printed every 10 000 records; the final line reports the total +count and any write errors. + +Because the copy is raw (below the duc abstraction layer), the destination +database is immediately usable by duc without re-indexing. + +--- + +## Caveats + +- **`duc_db_version`** is copied as-is. Backends that do not normally store + this key (LevelDB, SQLite3, LMDB) will have it present after migration, + which is harmless. Backends that validate it on open (Tokyo Cabinet, Kyoto + Cabinet, Tkrzw) will accept it as long as the version string matches the + compiled duc version. + +- **LevelDB** stores its data in a directory; make sure the destination + directory either does not exist or is empty before migrating into it. + +- **LMDB** pre-allocates a large virtual address range (1 GB on 32-bit, 256 GB + on 64-bit). Actual disk usage is much smaller; the reservation is virtual + memory only. + +- The migrator does **not** validate the integrity of the source database + before copying. Run `duc info` on the source first if in doubt. diff --git a/migrator/migrator.c b/migrator/migrator.c new file mode 100644 index 00000000..cbfefa45 --- /dev/null +++ b/migrator/migrator.c @@ -0,0 +1,804 @@ +/* + * migrator.c - duc database backend converter + * + * Copies every raw key-value record from a duc database stored in one backend + * format into a new database using a different backend. All six backends are + * compiled into a single binary (guarded by HAVE_* macros), so any source / + * destination pairing is possible without multiple build variants. + * + * Usage: + * ./migrator --from : --to : + * + * Supported formats (enabled at compile time via HAVE_* flags): + * tokyocabinet, kyotocabinet, leveldb, sqlite3, lmdb, tkrzw + * + * The migration is a raw KV copy (below the duc abstraction layer), so every + * key is transferred verbatim, including duc_db_version and duc_index_reports. + */ + +#include +#include +#include + + +/* ============================================================ + * Generic backend interface + * ============================================================ */ + +typedef struct { + const char *name; + + /* Open the database at path. readonly=1 for source, 0 for destination. + * Returns an opaque handle on success, NULL on failure. */ + void *(*open)(const char *path, int readonly); + + /* Flush and close. */ + void (*close)(void *handle); + + /* Write one record. Returns 0 on success, -1 on error. */ + int (*put)(void *handle, + const void *key, size_t klen, + const void *val, size_t vlen); + + /* Iteration --------------------------------------------------- + * iter_new() – create an iterator positioned before the first record. + * iter_next() – advance and fill *key, *val with malloc'd buffers; + * caller must free() both. Returns 1, or 0 when done. + * iter_free() – destroy the iterator. + */ + void *(*iter_new)(void *handle); + int (*iter_next)(void *iter, + void **key, size_t *klen, + void **val, size_t *vlen); + void (*iter_free)(void *iter); + + /* Return total number of records, or 0 if not cheaply available. */ + size_t (*count)(void *handle); +} backend_ops_t; + + +/* ============================================================ + * Tokyo Cabinet (TCBDB – B+ tree) + * ============================================================ */ +#ifdef HAVE_TOKYOCABINET +#include +#include + +typedef struct { TCBDB *hdb; BDBCUR *cur; } tc_iter_t; + +static void *tc_open(const char *path, int readonly) +{ + TCBDB *hdb = tcbdbnew(); + tcbdbtune(hdb, 256, 512, 131072, 9, 11, BDBTLARGE); + uint32_t mode = readonly + ? (HDBONOLCK | HDBOREADER) + : (HDBOWRITER | HDBOCREAT); + if (!tcbdbopen(hdb, path, mode)) { + fprintf(stderr, "tokyocabinet: cannot open '%s': %s\n", + path, tcbdberrmsg(tcbdbecode(hdb))); + tcbdbdel(hdb); + return NULL; + } + return hdb; +} + +static void tc_close(void *h) +{ + tcbdbclose((TCBDB *)h); + tcbdbdel((TCBDB *)h); +} + +static int tc_put(void *h, const void *k, size_t kl, const void *v, size_t vl) +{ + return tcbdbput((TCBDB *)h, k, (int)kl, v, (int)vl) ? 0 : -1; +} + +static void *tc_iter_new(void *h) +{ + tc_iter_t *it = malloc(sizeof *it); + it->hdb = (TCBDB *)h; + it->cur = tcbdbcurnew(it->hdb); + tcbdbcurfirst(it->cur); + return it; +} + +static int tc_iter_next(void *iter, + void **key, size_t *klen, + void **val, size_t *vlen) +{ + tc_iter_t *it = iter; + int ks, vs; + /* tcbdbcurkey / tcbdbcurval each return a malloc'd buffer */ + *key = tcbdbcurkey(it->cur, &ks); + if (!*key) return 0; + *klen = (size_t)ks; + *val = tcbdbcurval(it->cur, &vs); + *vlen = (size_t)vs; + tcbdbcurnext(it->cur); + return 1; +} + +static void tc_iter_free(void *iter) +{ + tc_iter_t *it = iter; + tcbdbcurdel(it->cur); + free(it); +} + +static size_t tc_count(void *h) { return (size_t)tcbdbrnum((TCBDB *)h); } + +static const backend_ops_t tc_ops = { + "tokyocabinet", + tc_open, tc_close, tc_put, + tc_iter_new, tc_iter_next, tc_iter_free, + tc_count +}; +#endif /* HAVE_TOKYOCABINET */ + + +/* ============================================================ + * Kyoto Cabinet (KCT – tree cabinet with compression) + * ============================================================ */ +#ifdef HAVE_KYOTOCABINET +#include + +typedef struct { KCCUR *cur; } kc_iter_t; + +static void *kc_open(const char *path, int readonly) +{ + KCDB *kdb = kcdbnew(); + char fname[4096]; + snprintf(fname, sizeof fname, "%s#type=kct#opts=c", path); + uint32_t mode = readonly ? KCOREADER : (KCOWRITER | KCOCREATE); + if (!kcdbopen(kdb, fname, mode)) { + fprintf(stderr, "kyotocabinet: cannot open '%s'\n", path); + kcdbdel(kdb); + return NULL; + } + return kdb; +} + +static void kc_close(void *h) +{ + kcdbclose((KCDB *)h); + kcdbdel((KCDB *)h); +} + +static int kc_put(void *h, const void *k, size_t kl, const void *v, size_t vl) +{ + return kcdbset((KCDB *)h, k, kl, v, vl) ? 0 : -1; +} + +static void *kc_iter_new(void *h) +{ + kc_iter_t *it = malloc(sizeof *it); + it->cur = kcdbcursor((KCDB *)h); + kccurjump(it->cur); + return it; +} + +static int kc_iter_next(void *iter, + void **key, size_t *klen, + void **val, size_t *vlen) +{ + kc_iter_t *it = iter; + size_t ks, vs; + const char *vp; + /* kccurget packs key+value in one allocation; step=1 advances the cursor. */ + char *k = kccurget(it->cur, &ks, &vp, &vs, 1); + if (!k) return 0; + *key = k; + *klen = ks; + /* Copy value into a fresh buffer so caller can always call free() on it */ + /* vp points into the same allocation as k (kccurget packs key+value in + * one buffer); copy the value but do NOT kcfree(vp) — freeing k via + * the caller's free(*key) releases the whole block. */ + *val = malloc(vs); + memcpy(*val, vp, vs); + *vlen = vs; + return 1; +} + +static void kc_iter_free(void *iter) +{ + kc_iter_t *it = iter; + kccurdel(it->cur); + free(it); +} + +static size_t kc_count(void *h) { return (size_t)kcdbcount((KCDB *)h); } + +static const backend_ops_t kc_ops = { + "kyotocabinet", + kc_open, kc_close, kc_put, + kc_iter_new, kc_iter_next, kc_iter_free, + kc_count +}; +#endif /* HAVE_KYOTOCABINET */ + + +/* ============================================================ + * LevelDB (SSTable directory, Snappy compression) + * ============================================================ */ +#ifdef HAVE_LEVELDB +#include + +typedef struct { + leveldb_t *db; + leveldb_options_t *options; + leveldb_readoptions_t *roptions; + leveldb_writeoptions_t *woptions; +} ldb_handle_t; + +typedef struct { + leveldb_iterator_t *it; + leveldb_readoptions_t *roptions; +} ldb_iter_t; + +static void *ldb_open(const char *path, int readonly) +{ + ldb_handle_t *h = malloc(sizeof *h); + char *err = NULL; + h->options = leveldb_options_create(); + h->roptions = leveldb_readoptions_create(); + h->woptions = leveldb_writeoptions_create(); + leveldb_options_set_create_if_missing(h->options, !readonly); + leveldb_options_set_compression(h->options, leveldb_snappy_compression); + h->db = leveldb_open(h->options, path, &err); + if (err) { + fprintf(stderr, "leveldb: cannot open '%s': %s\n", path, err); + leveldb_free(err); + leveldb_options_destroy(h->options); + leveldb_readoptions_destroy(h->roptions); + leveldb_writeoptions_destroy(h->woptions); + free(h); + return NULL; + } + return h; +} + +static void ldb_close(void *handle) +{ + ldb_handle_t *h = handle; + leveldb_close(h->db); + leveldb_options_destroy(h->options); + leveldb_readoptions_destroy(h->roptions); + leveldb_writeoptions_destroy(h->woptions); + free(h); +} + +static int ldb_put(void *handle, + const void *k, size_t kl, + const void *v, size_t vl) +{ + ldb_handle_t *h = handle; + char *err = NULL; + leveldb_put(h->db, h->woptions, k, kl, v, vl, &err); + if (err) { leveldb_free(err); return -1; } + return 0; +} + +static void *ldb_iter_new(void *handle) +{ + ldb_handle_t *h = handle; + ldb_iter_t *it = malloc(sizeof *it); + it->roptions = leveldb_readoptions_create(); + it->it = leveldb_create_iterator(h->db, it->roptions); + leveldb_iter_seek_to_first(it->it); + return it; +} + +static int ldb_iter_next(void *iter, + void **key, size_t *klen, + void **val, size_t *vlen) +{ + ldb_iter_t *it = iter; + if (!leveldb_iter_valid(it->it)) return 0; + /* LevelDB returns pointers into its internal buffer; must copy */ + const char *k = leveldb_iter_key(it->it, klen); + const char *v = leveldb_iter_value(it->it, vlen); + *key = malloc(*klen); memcpy(*key, k, *klen); + *val = malloc(*vlen); memcpy(*val, v, *vlen); + leveldb_iter_next(it->it); + return 1; +} + +static void ldb_iter_free(void *iter) +{ + ldb_iter_t *it = iter; + leveldb_iter_destroy(it->it); + leveldb_readoptions_destroy(it->roptions); + free(it); +} + +static size_t ldb_count(void *h) { (void)h; return 0; } + +static const backend_ops_t ldb_ops = { + "leveldb", + ldb_open, ldb_close, ldb_put, + ldb_iter_new, ldb_iter_next, ldb_iter_free, + ldb_count +}; +#endif /* HAVE_LEVELDB */ + + +/* ============================================================ + * SQLite3 (single-file, table: blobs(key, value)) + * ============================================================ */ +#ifdef HAVE_SQLITE3 +#include + +typedef struct { sqlite3 *s; } sq_handle_t; +typedef struct { sqlite3_stmt *stmt; } sq_iter_t; + +static void *sq_open(const char *path, int readonly) +{ + sq_handle_t *h = malloc(sizeof *h); + int flags = readonly + ? SQLITE_OPEN_READONLY + : (SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE); + if (sqlite3_open_v2(path, &h->s, flags, NULL) != SQLITE_OK) { + fprintf(stderr, "sqlite3: cannot open '%s': %s\n", + path, sqlite3_errmsg(h->s)); + sqlite3_close(h->s); + free(h); + return NULL; + } + if (!readonly) { + sqlite3_exec(h->s, + "create table if not exists blobs" + "(key unique primary key, value)", 0, 0, 0); + sqlite3_exec(h->s, + "create index if not exists keys on blobs(key)", 0, 0, 0); + sqlite3_exec(h->s, "begin", 0, 0, 0); + } + return h; +} + +static void sq_close(void *handle) +{ + sq_handle_t *h = handle; + sqlite3_exec(h->s, "commit", 0, 0, 0); + sqlite3_close(h->s); + free(h); +} + +static int sq_put(void *handle, + const void *k, size_t kl, + const void *v, size_t vl) +{ + sq_handle_t *h = handle; + sqlite3_stmt *stmt; + sqlite3_prepare(h->s, + "insert or replace into blobs(key,value) values(?,?)", + -1, &stmt, 0); + sqlite3_bind_text(stmt, 1, k, (int)kl, SQLITE_STATIC); + sqlite3_bind_blob(stmt, 2, v, (int)vl, SQLITE_STATIC); + sqlite3_step(stmt); + sqlite3_finalize(stmt); + return 0; +} + +static void *sq_iter_new(void *handle) +{ + sq_handle_t *h = handle; + sq_iter_t *it = malloc(sizeof *it); + sqlite3_prepare_v2(h->s, + "select key, value from blobs", + -1, &it->stmt, 0); + return it; +} + +static int sq_iter_next(void *iter, + void **key, size_t *klen, + void **val, size_t *vlen) +{ + sq_iter_t *it = iter; + if (sqlite3_step(it->stmt) != SQLITE_ROW) return 0; + /* Pointers are only valid until the next sqlite3_step(); copy them */ + *klen = (size_t)sqlite3_column_bytes(it->stmt, 0); + *vlen = (size_t)sqlite3_column_bytes(it->stmt, 1); + *key = malloc(*klen); + memcpy(*key, sqlite3_column_blob(it->stmt, 0), *klen); + *val = malloc(*vlen); + memcpy(*val, sqlite3_column_blob(it->stmt, 1), *vlen); + return 1; +} + +static void sq_iter_free(void *iter) +{ + sq_iter_t *it = iter; + sqlite3_finalize(it->stmt); + free(it); +} + +static size_t sq_count(void *h) +{ + sq_handle_t *sh = h; + sqlite3_stmt *stmt; + size_t n = 0; + if (sqlite3_prepare_v2(sh->s, "select count(*) from blobs", -1, &stmt, 0) == SQLITE_OK) { + if (sqlite3_step(stmt) == SQLITE_ROW) n = (size_t)sqlite3_column_int64(stmt, 0); + sqlite3_finalize(stmt); + } + return n; +} + +static const backend_ops_t sq_ops = { + "sqlite3", + sq_open, sq_close, sq_put, + sq_iter_new, sq_iter_next, sq_iter_free, + sq_count +}; +#endif /* HAVE_SQLITE3 */ + + +/* ============================================================ + * LMDB (memory-mapped single file, MDB_NOSUBDIR) + * ============================================================ */ +#ifdef HAVE_LMDB +#include + +typedef struct { MDB_env *env; MDB_dbi dbi; MDB_txn *txn; } mdb_handle_t; +typedef struct { MDB_cursor *cur; int started; } mdb_iter_t; + +static void *mdb_be_open(const char *path, int readonly) +{ + mdb_handle_t *h = malloc(sizeof *h); + unsigned int env_flags = MDB_NOSUBDIR; + unsigned int open_flags = 0; + unsigned int txn_flags = 0; + + if (readonly) { + env_flags |= MDB_RDONLY; + txn_flags |= MDB_RDONLY; + } else { + open_flags |= MDB_CREATE; + } + + int rc; + if ((rc = mdb_env_create(&h->env)) != MDB_SUCCESS) goto err; + if (!readonly) { + /* For write: give a large virtual address space so large DBs fit. */ + size_t map_size = 1024u * 1024u * 1024u; + if (sizeof(size_t) == 8) map_size *= 256u; + if ((rc = mdb_env_set_mapsize(h->env, map_size)) != MDB_SUCCESS) goto err; + } + /* For readonly: use size=0 — LMDB adopts the mapsize from the file header. */ + if ((rc = mdb_env_open(h->env, path, env_flags, 0664)) != MDB_SUCCESS) goto err; + if ((rc = mdb_txn_begin(h->env, NULL, txn_flags, &h->txn)) != MDB_SUCCESS) goto err; + if ((rc = mdb_open(h->txn, NULL, open_flags, &h->dbi)) != MDB_SUCCESS) goto err; + return h; +err: + fprintf(stderr, "lmdb: cannot open '%s': %s\n", path, mdb_strerror(rc)); + mdb_env_close(h->env); + free(h); + return NULL; +} + +static void mdb_be_close(void *handle) +{ + mdb_handle_t *h = handle; + mdb_txn_commit(h->txn); + mdb_dbi_close(h->env, h->dbi); + mdb_env_close(h->env); + free(h); +} + +static int mdb_be_put(void *handle, + const void *k, size_t kl, + const void *v, size_t vl) +{ + mdb_handle_t *h = handle; + MDB_val mk = { kl, (void *)k }; + MDB_val mv = { vl, (void *)v }; + return mdb_put(h->txn, h->dbi, &mk, &mv, 0) == MDB_SUCCESS ? 0 : -1; +} + +static void *mdb_iter_new(void *handle) +{ + mdb_handle_t *h = handle; + mdb_iter_t *it = malloc(sizeof *it); + mdb_cursor_open(h->txn, h->dbi, &it->cur); + it->started = 0; + return it; +} + +static int mdb_iter_next(void *iter, + void **key, size_t *klen, + void **val, size_t *vlen) +{ + mdb_iter_t *it = iter; + MDB_val mk, mv; + MDB_cursor_op op = it->started ? MDB_NEXT : MDB_FIRST; + it->started = 1; + if (mdb_cursor_get(it->cur, &mk, &mv, op) != MDB_SUCCESS) return 0; + /* LMDB data lives in the memory-mapped file; copy before txn ends */ + *klen = mk.mv_size; *key = malloc(mk.mv_size); memcpy(*key, mk.mv_data, mk.mv_size); + *vlen = mv.mv_size; *val = malloc(mv.mv_size); memcpy(*val, mv.mv_data, mv.mv_size); + return 1; +} + +static void mdb_iter_free(void *iter) +{ + mdb_iter_t *it = iter; + mdb_cursor_close(it->cur); + free(it); +} + +static size_t mdb_count(void *h) +{ + mdb_handle_t *mh = h; + MDB_stat st; + if (mdb_stat(mh->txn, mh->dbi, &st) == MDB_SUCCESS) return (size_t)st.ms_entries; + return 0; +} + +static const backend_ops_t mdb_ops = { + "lmdb", + mdb_be_open, mdb_be_close, mdb_be_put, + mdb_iter_new, mdb_iter_next, mdb_iter_free, + mdb_count +}; +#endif /* HAVE_LMDB */ + + +/* ============================================================ + * Tkrzw (HashDBM, StdFile; new default in 1.5.0-rc2) + * ============================================================ */ +#ifdef HAVE_TKRZW +#include + +static void *tkrzw_be_open(const char *path, int readonly) +{ + TkrzwDBM *hdb = tkrzw_dbm_open( + path, !readonly, + "dbm=HashDBM,file=StdFile,num_buckets=131072,record_comp_mode=RECORD_COMP_ZSTD"); + if (!hdb) { + TkrzwStatus s = tkrzw_get_last_status(); + fprintf(stderr, "tkrzw: cannot open '%s': %s\n", path, s.message); + return NULL; + } + return hdb; +} + +static void tkrzw_be_close(void *h) +{ + tkrzw_dbm_close((TkrzwDBM *)h); +} + +static int tkrzw_be_put(void *h, + const void *k, size_t kl, + const void *v, size_t vl) +{ + return tkrzw_dbm_set((TkrzwDBM *)h, + k, (int32_t)kl, + v, (int32_t)vl, + 1 /* overwrite */) ? 0 : -1; +} + +typedef struct { TkrzwDBMIter *it; } tkrzw_iter_t; + +static void *tkrzw_iter_new(void *h) +{ + tkrzw_iter_t *it = malloc(sizeof *it); + it->it = tkrzw_dbm_make_iterator((TkrzwDBM *)h); + tkrzw_dbm_iter_first(it->it); + return it; +} + +static int tkrzw_iter_next(void *iter, + void **key, size_t *klen, + void **val, size_t *vlen) +{ + tkrzw_iter_t *it = iter; + int32_t ks, vs; + char *k, *v; + /* tkrzw_dbm_iter_get returns malloc'd key and value; step separately */ + if (!tkrzw_dbm_iter_get(it->it, &k, &ks, &v, &vs)) return 0; + *key = k; *klen = (size_t)ks; + *val = v; *vlen = (size_t)vs; + tkrzw_dbm_iter_next(it->it); + return 1; +} + +static void tkrzw_iter_free(void *iter) +{ + tkrzw_iter_t *it = iter; + tkrzw_dbm_iter_free(it->it); + free(it); +} + +static size_t tkrzw_count(void *h) { return (size_t)tkrzw_dbm_count((TkrzwDBM *)h); } + +static const backend_ops_t tkrzw_ops = { + "tkrzw", + tkrzw_be_open, tkrzw_be_close, tkrzw_be_put, + tkrzw_iter_new, tkrzw_iter_next, tkrzw_iter_free, + tkrzw_count +}; +#endif /* HAVE_TKRZW */ + + +/* ============================================================ + * Backend registry + * ============================================================ */ + +static const backend_ops_t * const backends[] = { +#ifdef HAVE_TOKYOCABINET + &tc_ops, +#endif +#ifdef HAVE_KYOTOCABINET + &kc_ops, +#endif +#ifdef HAVE_LEVELDB + &ldb_ops, +#endif +#ifdef HAVE_SQLITE3 + &sq_ops, +#endif +#ifdef HAVE_LMDB + &mdb_ops, +#endif +#ifdef HAVE_TKRZW + &tkrzw_ops, +#endif + NULL +}; + +static const backend_ops_t *find_backend(const char *name) +{ + for (int i = 0; backends[i]; i++) + if (strcmp(backends[i]->name, name) == 0) + return backends[i]; + return NULL; +} + +static void list_backends(FILE *out) +{ + fprintf(out, "Compiled-in backends:"); + for (int i = 0; backends[i]; i++) + fprintf(out, " %s", backends[i]->name); + fprintf(out, "\n"); +} + + +/* ============================================================ + * Argument parsing helpers + * ============================================================ */ + +static void usage(const char *argv0) +{ + fprintf(stderr, + "Usage: %s --from : --to :\n\n" + "Copies every key-value record from a duc database in one backend\n" + "format to a new database using a different backend. The migration\n" + "is a raw KV copy (below the duc abstraction layer), so all internal\n" + "duc keys (duc_db_version, duc_index_reports, …) are transferred too.\n\n", + argv0); + list_backends(stderr); +} + +/* Split "format:path" on the first colon. + * Writes the format name into fmt (NUL-terminated) and sets *path. + * Returns 0 on success, -1 if no colon is present. */ +static int split_spec(const char *arg, + char *fmt, size_t fmtsz, + const char **path) +{ + const char *colon = strchr(arg, ':'); + if (!colon) return -1; + size_t flen = (size_t)(colon - arg); + if (flen >= fmtsz) flen = fmtsz - 1; + memcpy(fmt, arg, flen); + fmt[flen] = '\0'; + *path = colon + 1; + return 0; +} + + +/* ============================================================ + * main + * ============================================================ */ + +int main(int argc, char **argv) +{ + const char *from_arg = NULL; + const char *to_arg = NULL; + + for (int i = 1; i < argc; i++) { + if (!strcmp(argv[i], "--from") && i + 1 < argc) from_arg = argv[++i]; + else if (!strcmp(argv[i], "--to") && i + 1 < argc) to_arg = argv[++i]; + else { usage(argv[0]); return 1; } + } + + if (!from_arg || !to_arg) { usage(argv[0]); return 1; } + + char from_fmt[64], to_fmt[64]; + const char *from_path, *to_path; + + if (split_spec(from_arg, from_fmt, sizeof from_fmt, &from_path) < 0) { + fprintf(stderr, "error: --from argument must be :\n"); + return 1; + } + if (split_spec(to_arg, to_fmt, sizeof to_fmt, &to_path) < 0) { + fprintf(stderr, "error: --to argument must be :\n"); + return 1; + } + + const backend_ops_t *src_ops = find_backend(from_fmt); + const backend_ops_t *dst_ops = find_backend(to_fmt); + + if (!src_ops) { + fprintf(stderr, "error: unknown source backend '%s'\n", from_fmt); + list_backends(stderr); + return 1; + } + if (!dst_ops) { + fprintf(stderr, "error: unknown destination backend '%s'\n", to_fmt); + list_backends(stderr); + return 1; + } + + fprintf(stderr, "Migrating: %s:%s -> %s:%s\n", + from_fmt, from_path, to_fmt, to_path); + + void *src = src_ops->open(from_path, 1 /* read-only */); + if (!src) return 1; + + void *dst = dst_ops->open(to_path, 0 /* read-write */); + if (!dst) { src_ops->close(src); return 1; } + + size_t total = src_ops->count(src); + + fprintf(stderr, "Scanning..."); + fflush(stderr); + void *iter = src_ops->iter_new(src); + fprintf(stderr, "\r"); + fflush(stderr); + + void *key, *val; + size_t klen, vlen; + unsigned long done = 0, errors = 0; + const int BAR = 40; + + while (src_ops->iter_next(iter, &key, &klen, &val, &vlen)) { + if (dst_ops->put(dst, key, klen, val, vlen) != 0) + errors++; + free(key); + free(val); + done++; + if (done % 100 == 0 || done == 1) { + if (total > 0) { + int filled = (int)((double)done / total * BAR); + fprintf(stderr, "\r ["); + for (int i = 0; i < BAR; i++) + fputc(i < filled ? '=' : (i == filled ? '>' : ' '), stderr); + fprintf(stderr, "] %lu/%zu (%d%%)", + done, total, (int)((double)done / total * 100)); + } else { + fprintf(stderr, "\r %lu records", done); + } + fflush(stderr); + } + } + + /* Final completed bar */ + if (total > 0) { + fprintf(stderr, "\r ["); + for (int i = 0; i < BAR; i++) fputc('=', stderr); + fprintf(stderr, "] %lu/%lu (100%%)\n", done, done); + } else { + fprintf(stderr, "\r %lu records\n", done); + } + + src_ops->iter_free(iter); + src_ops->close(src); + dst_ops->close(dst); + + if (errors) + fprintf(stderr, " %lu write error(s)\n", errors); + fprintf(stderr, "Done: %lu records copied.\n", done); + + return errors ? 1 : 0; +} diff --git a/testing/.gitignore b/testing/.gitignore new file mode 100644 index 00000000..cdd7f79d --- /dev/null +++ b/testing/.gitignore @@ -0,0 +1,3 @@ +*.log +duc-* +dbs \ No newline at end of file diff --git a/testing/README.md b/testing/README.md new file mode 100644 index 00000000..ed13ac9e --- /dev/null +++ b/testing/README.md @@ -0,0 +1,95 @@ +# duc — multi-backend testing + +This directory contains scripts for building, cross-testing, and testing migration of `duc` databases across all supported backends. + +## Backends + +| Backend | Binary | DB file / directory | +|----------------|----------------------|----------------------------| +| tkrzw | `duc-tkrzw` | `*.db` | +| tokyocabinet | `duc-tokyocabinet` | `*.db` | +| sqlite3 | `duc-sqlite3` | `*.db` | +| lmdb | `duc-lmdb` | `*.db` | +| leveldb | `duc-leveldb` | `*.dir/` (directory) | +| kyotocabinet | `duc-kyotocabinet` | `*.db` | + +## Scripts + +### `build-all-backends.sh` + +Builds a separate `duc-` binary for every supported database backend. + +**Must be run from the `testing/` directory.** + +```bash +cd testing +bash build-all-backends.sh +``` + +- Runs `autoreconf -i` if `configure` is missing or older than `configure.ac`. +- For each backend: runs `./configure --with-db-backend=`, then `make`. +- Copies the resulting binary to `testing/duc-`. +- Saves full build output to `testing/build-.log`. +- Exits with a non-zero status if any backend fails to build. + +The number of parallel make jobs can be controlled via the `JOBS` environment +variable (defaults to `nproc`): + +```bash +JOBS=4 bash build-all-backends.sh +``` + +### `test-compare-backends.sh` + +Indexes a filesystem path with every available `duc-` binary, dumps +the result as JSON, and performs a pairwise comparison to verify that all +backends produce identical output. + +```bash +bash test-compare-backends.sh [PATH] +``` + +- `PATH` defaults to `/usr/share/doc` if not specified. +- Skips any backend whose binary is not present in `testing/`. +- Database files are written to `testing/dbs/` and **kept after the run** for + further inspection. +- Exits with a non-zero status if any pair of backends produces different JSON. + +### `test-migrator.sh` + +Migrates every database in `testing/dbs/` to every other backend format using +the `migrator` binary, producing output databases in `testing/dbs/migrated/`. + +```bash +bash test-migrator.sh [--include-tkrzw-as-source] [PATH] +``` + +- Requires `../migrator/migrator` to be built (`cd ../migrator && make`). +- Requires source databases in `dbs/` (run `test-compare-backends.sh` first). +- Output files are named `-to-.` (e.g. `sqlite3-to-lmdb.db`). +- LevelDB outputs use a `.dir` directory instead of a file. +- Per-migration stdout/stderr is saved to `dbs/migrated/logs/-to-.log`. +- Exits with a non-zero status if any migration fails or times out. +- Each migration is time-limited; set `TIMEOUT` to override (default: 300 s). + +**tkrzw as source is disabled by default** because iterating over a tkrzw +database is extremely slow (several minutes per destination backend). tkrzw is +always available as a *destination*. To also migrate from tkrzw: + +```bash +bash test-migrator.sh --include-tkrzw-as-source +``` + +## Dependencies + +The following development libraries must be installed before building: + +```bash +sudo apt-get install \ + libtokyocabinet-dev \ + libkyotocabinet-dev \ + libleveldb-dev \ + liblmdb-dev \ + libsqlite3-dev \ + libtkrzw-dev +``` diff --git a/testing/build-all-backends.sh b/testing/build-all-backends.sh new file mode 100755 index 00000000..ce094e52 --- /dev/null +++ b/testing/build-all-backends.sh @@ -0,0 +1,80 @@ +#!/usr/bin/env bash +# +# Copyright (c) 2026 George Ruinelli +# +# build-all-backends.sh — Build duc for every supported database backend. +# +# For each backend (tkrzw, tokyocabinet, sqlite3, lmdb, leveldb, kyotocabinet) +# this script runs ./configure --with-db-backend=, compiles duc, and +# copies the resulting binary as testing/duc-. Build output for each +# backend is saved to testing/build-.log. +# +# Usage: +# cd testing && bash build-all-backends.sh +# +# Environment: +# JOBS — number of parallel make jobs (default: nproc) +# +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +ROOT_DIR="$(cd "$SCRIPT_DIR/.." && pwd)" +JOBS="${JOBS:-$(nproc)}" + +# Ensure the script is executed from within the testing/ directory +if [[ "$(pwd)" != "$SCRIPT_DIR" ]]; then + echo "error: must be run from the testing/ directory" >&2 + echo " cd $(basename "$SCRIPT_DIR") && bash $(basename "$0")" >&2 + exit 1 +fi + +BACKENDS=(tkrzw tokyocabinet sqlite3 lmdb leveldb kyotocabinet) + +cd "$ROOT_DIR" + +# Regenerate build system if configure is missing or older than configure.ac +if [[ ! -f configure || configure.ac -nt configure ]]; then + echo "==> Running autoreconf -i ..." + autoreconf -i +fi + +failed=() + +for backend in "${BACKENDS[@]}"; do + echo "" + echo "==> Building duc-$backend ..." + + if ! ./configure --with-db-backend="$backend" > "$SCRIPT_DIR/build-$backend.log" 2>&1; then + echo " configure FAILED (see testing/build-$backend.log)" + failed+=("$backend") + continue + fi + + if ! make -j"$JOBS" >> "$SCRIPT_DIR/build-$backend.log" 2>&1; then + echo " make FAILED (see testing/build-$backend.log)" + failed+=("$backend") + continue + fi + + cp duc "$SCRIPT_DIR/duc-$backend" + echo " -> $SCRIPT_DIR/duc-$backend OK" +done + +echo "" +echo "=== Build summary ===" +for backend in "${BACKENDS[@]}"; do + bin="$SCRIPT_DIR/duc-$backend" + if [[ " ${failed[*]:-} " == *" $backend "* ]]; then + echo " FAIL $backend" + elif [[ -x "$bin" ]]; then + echo " OK $backend ($("$bin" --version 2>&1 | head -1))" + else + echo " MISS $backend" + fi +done + +if [[ ${#failed[@]} -gt 0 ]]; then + echo "" + echo "Some backends failed: ${failed[*]}" + exit 1 +fi diff --git a/testing/test-compare-backends.sh b/testing/test-compare-backends.sh new file mode 100755 index 00000000..c09a49e1 --- /dev/null +++ b/testing/test-compare-backends.sh @@ -0,0 +1,111 @@ +#!/usr/bin/env bash +# +# Copyright (c) 2026 George Ruinelli +# +# test-compare-backends.sh — Index a path with every duc backend and compare JSON output. +# +# For each duc- binary found in the same directory, this script: +# 1. Indexes the given path into a persistent database in testing/dbs/. +# 2. Dumps the database content as JSON. +# 3. Performs a pairwise diff of all JSON outputs and reports any differences. +# +# Database files and the JSON outputs are kept in testing/dbs/ after the run for further inspection. +# +# Requirements: +# Run build-all-backends.sh first to compile the duc- binaries that +# this script expects to find in the same directory. +# +# Usage: +# bash test-compare-backends.sh [PATH] +# +# Arguments: +# PATH — filesystem path to index (default: /usr/share/doc) +# +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +INDEX_PATH="${1:-/usr/share/doc}" + +BACKENDS=(tkrzw tokyocabinet sqlite3 lmdb leveldb kyotocabinet) +DBDIR="$SCRIPT_DIR/dbs" + +mkdir -p "$DBDIR" + +echo "Indexing path: $INDEX_PATH" +echo "DB dir: $DBDIR" +echo "" + +# Index and dump JSON for each backend +for backend in "${BACKENDS[@]}"; do + bin="$SCRIPT_DIR/duc-$backend" + if [[ ! -x "$bin" ]]; then + echo "[$backend] SKIP — binary not found: $bin" + continue + fi + + # leveldb uses a directory as DB path + if [[ "$backend" == "leveldb" ]]; then + db="$DBDIR/$backend.dir" + else + db="$DBDIR/$backend.db" + fi + + json_file="$DBDIR/$backend.json" + + rm -rf "$db" + + echo -n "[$backend] indexing ... " + if "$bin" index -q -d "$db" "$INDEX_PATH" 2>&1; then + echo -n "done. dumping json ... " + "$bin" json -d "$db" "$INDEX_PATH" > "$json_file" 2>&1 + echo "done. ($(wc -c < "$json_file") bytes)" + else + echo "FAILED" + continue + fi +done + +echo "" +echo "=== Pairwise JSON comparison ===" +echo "" + +# Collect successfully produced JSON files +successful=() +for backend in "${BACKENDS[@]}"; do + f="$DBDIR/$backend.json" + [[ -s "$f" ]] && successful+=("$backend") +done + +if [[ ${#successful[@]} -lt 2 ]]; then + echo "Need at least 2 successful backends to compare." + exit 1 +fi + +all_match=true +for ((i = 0; i < ${#successful[@]}; i++)); do + for ((j = i + 1; j < ${#successful[@]}; j++)); do + a="${successful[$i]}" + b="${successful[$j]}" + fa="$DBDIR/$a.json" + fb="$DBDIR/$b.json" + if diff -q "$fa" "$fb" > /dev/null 2>&1; then + echo " $a == $b [identical]" + else + echo " $a != $b [DIFFER]" + all_match=false + diff --unified=3 "$fa" "$fb" | head -40 || true + echo " ..." + fi + done +done + +# Remove lock files left behind by backends (e.g. lmdb creates a .db-lock) +rm -f "$DBDIR"/*.lock "$DBDIR"/*.db-lock + +echo "" +if $all_match; then + echo "Result: all backends produce identical JSON output." +else + echo "Result: differences found between backends (see above)." + exit 1 +fi diff --git a/testing/test-migrator.sh b/testing/test-migrator.sh new file mode 100755 index 00000000..121f2bab --- /dev/null +++ b/testing/test-migrator.sh @@ -0,0 +1,223 @@ +#!/usr/bin/env bash +# +# Copyright (c) 2026 George Ruinelli +# +# test_migrate-db-any-to-any.sh — Migrate every duc database in dbs/ to every other backend format. +# +# For each source database found in testing/dbs/ the script invokes the migrator +# binary for every other backend, producing a converted database in +# testing/dbs/migrated/. Output files are named -to-. +# (or -to-.dir for LevelDB). Per-migration logs are written to +# testing/dbs/migrated/logs/. +# +# Any existing output file/directory for a given pair is removed before +# migrating so the run is always clean and reproducible. +# +# Usage: +# bash test-migrator.sh [--include-tkrzw-as-source] [PATH] +# +# Arguments: +# PATH — filesystem path that was indexed (default: /usr/share/doc) +# Must match the path used when running test-compare-backends.sh. +# +# Options: +# --include-tkrzw-as-source +# Also migrate FROM the tkrzw database. Disabled by default because +# tkrzw source iteration is extremely slow (several minutes per +# destination). tkrzw is always available as a migration destination. +# +# Environment: +# TIMEOUT — seconds allowed per migration before it is killed (default: 300) +# +# Requirements: +# - ../migrator/migrator must be built (cd ../migrator && make) +# - Source databases and JSON files must exist in dbs/ +# (run test-compare-backends.sh first) +# +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" + +INCLUDE_TKRZW_SOURCE=0 +POSITIONAL=() +for _arg in "$@"; do + case "$_arg" in + --include-tkrzw-as-source) INCLUDE_TKRZW_SOURCE=1 ;; + *) POSITIONAL+=("$_arg") ;; + esac +done +INDEX_PATH="${POSITIONAL[0]:-/usr/share/doc}" +DBDIR="$SCRIPT_DIR/dbs" +OUTDIR="$DBDIR/migrated" +MIGRATOR="$SCRIPT_DIR/../migrator/migrator" +LOGDIR="$OUTDIR/logs" +TIMEOUT="${TIMEOUT:-300}" + +if [[ ! -x "$MIGRATOR" ]]; then + echo "error: migrator binary not found: $MIGRATOR" >&2 + echo " cd ../migrator && make" >&2 + exit 1 +fi + +mkdir -p "$OUTDIR" "$LOGDIR" + +# Map each backend to its source path and file extension +declare -A DB_PATH +declare -A DB_EXT +DB_PATH[tkrzw]="$DBDIR/tkrzw.db" +DB_EXT[tkrzw]="db" +DB_PATH[tokyocabinet]="$DBDIR/tokyocabinet.db" +DB_EXT[tokyocabinet]="db" +DB_PATH[sqlite3]="$DBDIR/sqlite3.db" +DB_EXT[sqlite3]="db" +DB_PATH[lmdb]="$DBDIR/lmdb.db" +DB_EXT[lmdb]="db" +DB_PATH[leveldb]="$DBDIR/leveldb.dir" +DB_EXT[leveldb]="dir" +DB_PATH[kyotocabinet]="$DBDIR/kyotocabinet.db" +DB_EXT[kyotocabinet]="db" + +BACKENDS=(tokyocabinet kyotocabinet sqlite3 lmdb leveldb tkrzw) + +migrate_failed=() +migrate_ok=() +skipped=() +json_failed=() +json_ok=() +diff_fail=() +diff_ok=() + +# ============================================================ +# Phase 1 — Migrate all databases +# ============================================================ +echo "=== Phase 1: Migrate ===" +echo "" + +for src in "${BACKENDS[@]}"; do + src_path="${DB_PATH[$src]}" + if [[ ! -e "$src_path" ]]; then + echo " [$src] SKIP — source DB not found: $src_path" + skipped+=("$src") + continue + fi + + if [[ "$src" == "tkrzw" ]]; then + if [[ "$INCLUDE_TKRZW_SOURCE" != "1" ]]; then + echo " [tkrzw] SKIP as source (pass --include-tkrzw-as-source to enable; iteration is very slow)" + skipped+=("tkrzw-as-source") + continue + fi + echo " [tkrzw] WARNING: tkrzw source iteration is very slow — this may take several minutes per destination" + fi + + for dst in "${BACKENDS[@]}"; do + [[ "$src" == "$dst" ]] && continue + + dst_ext="${DB_EXT[$dst]}" + out_path="$OUTDIR/${src}-to-${dst}.${dst_ext}" + log="$LOGDIR/${src}-to-${dst}.log" + + rm -rf "$out_path" + + printf " %-14s -> %-14s ... " "$src" "$dst" + if timeout "$TIMEOUT" "$MIGRATOR" --from "${src}:${src_path}" --to "${dst}:${out_path}" > "$log" 2>&1; then + echo "ok" + migrate_ok+=("${src}-to-${dst}") + else + rc=$? + [[ $rc -eq 124 ]] && echo "TIMEOUT (>${TIMEOUT}s)" || echo "FAILED (rc=$rc)" + migrate_failed+=("${src}-to-${dst}") + fi + done +done + +# ============================================================ +# Phase 2 — Export each migrated database to JSON +# ============================================================ +echo "" +echo "=== Phase 2: Export JSON ===" +echo "" + +for pair in "${migrate_ok[@]}"; do + src="${pair%%-to-*}" + dst="${pair##*-to-}" + dst_ext="${DB_EXT[$dst]}" + out_path="$OUTDIR/${pair}.${dst_ext}" + migrated_json="$OUTDIR/${pair}.json" + dst_bin="$SCRIPT_DIR/duc-$dst" + + printf " %-30s ... " "$pair" + if [[ ! -x "$dst_bin" ]]; then + echo "SKIP (duc-$dst not found)" + continue + fi + if "$dst_bin" json -d "$out_path" "$INDEX_PATH" > "$migrated_json" 2>&1; then + echo "ok ($(wc -c < "$migrated_json") bytes)" + json_ok+=("$pair") + else + echo "FAILED" + json_failed+=("$pair") + fi +done + +# ============================================================ +# Phase 3 — Compare each migrated JSON against source JSON +# ============================================================ +echo "" +echo "=== Phase 3: Compare JSON ===" +echo "" + +for pair in "${json_ok[@]}"; do + src="${pair%%-to-*}" + src_json="$DBDIR/${src}.json" + migrated_json="$OUTDIR/${pair}.json" + + printf " %-30s ... " "$pair" + if [[ ! -s "$src_json" ]]; then + echo "SKIP (no source JSON for $src)" + continue + fi + if diff -q "$src_json" "$migrated_json" > /dev/null 2>&1; then + echo "match" + diff_ok+=("$pair") + else + echo "DIFFER" + diff_fail+=("$pair") + fi +done + +# ============================================================ +# Summary +# ============================================================ +echo "" +echo "=== Summary ===" +total=$(( ${#BACKENDS[@]} * (${#BACKENDS[@]} - 1) )) +skipped_pairs=$(( ${#skipped[@]} * (${#BACKENDS[@]} - 1) )) +attempted=$(( total - skipped_pairs )) +echo " Migrations possible : $total" +echo " Migrations skipped : $skipped_pairs" +echo " Migrations attempted : $attempted" +echo " Migration failed : ${#migrate_failed[@]}" +echo " JSON export failed : ${#json_failed[@]}" +echo " JSON match : ${#diff_ok[@]}" +echo " JSON differ : ${#diff_fail[@]}" + +if [[ ${#diff_fail[@]} -gt 0 ]]; then + echo "" + echo "Migrations with JSON differences:" + for f in "${diff_fail[@]}"; do + echo " --- $f ---" + diff --unified=3 "$DBDIR/${f%%-to-*}.json" "$OUTDIR/$f.json" | head -20 || true + echo "" + done +fi + +if [[ ${#migrate_failed[@]} -gt 0 ]]; then + echo "" + echo "Failed migrations:" + for f in "${migrate_failed[@]}"; do echo " $f"; done +fi + +if [[ ${#migrate_failed[@]} -gt 0 || ${#diff_fail[@]} -gt 0 || ${#json_failed[@]} -gt 0 ]]; then + exit 1 +fi