Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
dd820e1
adds DB documentation and an initial migrator
Apr 3, 2026
da87055
add tool to build DUC for any of the available backends.
Apr 3, 2026
39d9d29
added script to test indexing with the different backends
Apr 3, 2026
d08836e
added readme
Apr 3, 2026
d7b6a90
fix migrator build errors: kccurfirst→kccurjump, mdb_open macro confl…
Apr 3, 2026
9922c4b
migrator: open tkrzw with RECORD_COMP_ZSTD to match duc's default com…
Apr 3, 2026
2a1e56d
added migration testscript
Apr 3, 2026
6dc59d8
added readme
Apr 3, 2026
f2caf81
migrator: fix tkrzw offset_width mismatch and kyotocabinet double-fre…
Apr 3, 2026
b9e73b6
renamed file
Apr 3, 2026
2a1bd32
updated script
Apr 3, 2026
0864a62
migrator: fix sqlite3 key binding — use bind_text to match duc's db-s…
Apr 3, 2026
84331a9
.
Apr 3, 2026
8d23215
updated timeout
Apr 3, 2026
c6e3be0
tkrzw: cap num_buckets=131072 to avoid 100M-bucket default (476 MB sp…
Apr 3, 2026
3d4285e
revert: db-tkrzw.c num_buckets change — do not touch duc core
Apr 3, 2026
cb52d3d
update test script
Apr 3, 2026
bdcb20f
migrator: add progress output — scan/copy status lines with flush
Apr 3, 2026
b20f3ed
.
Apr 3, 2026
5a9af1a
migrator: add count() per backend; replace verbose output with ASCII …
Apr 3, 2026
ade47d2
migrator: warn (not error) if fix-crashes-on-indexing not in ancestry…
Apr 3, 2026
e68e211
migrator: remove branch ancestry check from Makefile
Apr 3, 2026
d4679be
fix migrator
Apr 3, 2026
48a8319
show warning
Apr 3, 2026
4224709
.
Apr 3, 2026
774bd94
revert: restore all duc source files to master state — add-migrator-t…
Apr 3, 2026
47b73e1
added copyright to scripts
Apr 3, 2026
20622ae
update readme
Apr 3, 2026
6f5741d
docs: move db-formats.md to repo root; update references in migrator/…
Apr 3, 2026
fb705e1
docs: rename db-formats.md to DB-FORMATS.md; update all references
Apr 3, 2026
3b3486a
docs: remove DB-FORMATS.md reference from root README
Apr 3, 2026
cf2d51f
.
Apr 3, 2026
67c1bc3
update doc
Apr 3, 2026
0b02563
update doc
Apr 3, 2026
bf88d0c
update doc
Apr 3, 2026
0b9b0d3
revert: restore duc core files to v1.5.0-rc2 state — add-migrator-too…
Apr 4, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
134 changes: 134 additions & 0 deletions DB-FORMATS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# duc DB Backend Formats

Reference for all database backends supported across duc versions, derived from
the source implementations in `src/libduc/db-*.c` and `configure.ac`.


## Summary Table

| Backend | File/Dir | Single file | Compression | Version key | Default in |
|----------------|----------|-------------|-------------------|-------------|-------------|
| Tokyo Cabinet | File | Yes | Optional (deflate)| Yes | 1.4.6 |
| Kyoto Cabinet | File | Yes | Always (kct opts) | Yes | — |
| LevelDB | Dir | **No** | Always (Snappy) | No | — |
| SQLite3 | File | Yes | None | No | — |
| LMDB | File | Yes | None | No | — |
| Tkrzw | File | Yes | Optional (ZSTD) | Yes | 1.5.0-rc2 |

---

## Tokyo Cabinet (`tokyocabinet`)

- **Introduced:** ≤ 1.4.6
- **Default in:** 1.4.6
- **Storage layout:** Single file on disk
- **Magic header (first bytes):** `ToKyO CaBiNeT`
- **Internal type:** `TCBDB` — B+ Tree Database
- **Compression:** Optional deflate (`BDBTDEFLATE`), enabled via `--compress` flag
- **Tuning:** `tcbdbtune(hdb, 256, 512, 131072, 9, 11, BDBTLARGE [| BDBTDEFLATE])`
- **Version check:** Stores and validates `duc_db_version` key on open
- **Notes:**
- The `BDBTLARGE` flag is always set, allowing the file to exceed 2 GB.
- `DUC_OPEN_FORCE` triggers `BDBOTRUNC`, which truncates and recreates the file.

---

## Kyoto Cabinet (`kyotocabinet`)

- **Introduced:** ≤ 1.4.6
- **Default in:** —
- **Storage layout:** Single file on disk
- **Magic header (first bytes):** `Kyoto CaBiNeT`
- **Internal type:** KCT (Tree Cabinet), opened with `#type=kct#opts=c`
- **Compression:** Enabled unconditionally via `opts=c` in the open string
- **Version check:** Stores and validates `duc_db_version` key on open
- **Notes:**
- Error mapping is incomplete; all backend errors map to `DUC_E_UNKNOWN`.
- The `DUC_OPEN_COMPRESS` flag is accepted but has no additional effect since
compression is always on via the open string.

---

## LevelDB (`leveldb`)

- **Introduced:** ≤ 1.4.6
- **Default in:** —
- **Storage layout:** **Directory** (not a single file); LevelDB stores multiple
SSTable (`.ldb`/`.sst`) and manifest files inside a directory.
- **Magic header:** N/A — detected as a directory by `duc_db_type_check()`
- **Compression:** Snappy compression is always enabled
(`leveldb_snappy_compression`); the `DUC_OPEN_COMPRESS` flag has no effect.
- **Version check:** None — does not store or check `duc_db_version`
- **Notes:**
- Because the path is a directory, it behaves differently from all other
backends when specifying `--database`.
- `leveldb_options_set_create_if_missing` is always set; the DB is created
automatically if it does not exist.

---

## SQLite3 (`sqlite3`)

- **Introduced:** ≤ 1.4.6
- **Default in:** —
- **Storage layout:** Single file on disk
- **Magic header (first bytes):** `SQLite format 3`
- **Internal schema:** Single table `blobs(key UNIQUE PRIMARY KEY, value)` with
an additional index `keys` on the `key` column.
- **Compression:** None — no compression support
- **Version check:** None — does not store or check `duc_db_version`
- **Notes:**
- All writes are batched inside a single `BEGIN`/`COMMIT` transaction that
spans the lifetime of the open database (committed on `db_close`).
- On open, a deliberate bogus query (`select bogus from bogus`) is run to
detect corrupt files that `sqlite3_open()` would otherwise accept silently.
- `insert or replace` semantics are used, so re-indexing a path overwrites the
previous entry cleanly.

---

## LMDB (`lmdb`)

- **Introduced:** ≤ 1.4.6
- **Default in:** —
- **Storage layout:** Single file on disk (opened with `MDB_NOSUBDIR`)
- **Magic header:** Standard LMDB file header (not checked by duc's type
detector; falls through to `unknown`)
- **Compression:** None — no compression support
- **Version check:** None — does not store or check `duc_db_version`
- **Memory map size:**
- 32-bit platforms: 1 GB (`1024 * 1024 * 1024`)
- 64-bit platforms: 256 GB (`1024 * 1024 * 1024 * 256`)
- **Notes:**
- Uses a single write transaction (`MDB_txn`) for all puts, committed on
`db_close`. A write error in `db_put` calls `exit(1)` immediately.
- The large pre-allocated map size is a virtual address reservation only;
actual disk usage grows on demand.

---

## Tkrzw (`tkrzw`)

- **Introduced:** 1.5.0-rc2
- **Default in:** 1.5.0-rc2
- **Storage layout:** Single file on disk
- **Magic header:** Tkrzw-specific header (not yet checked by duc's type
detector)
- **Internal type:** `HashDBM` with `StdFile` file driver
- **Base open options:** `dbm=HashDBM,file=StdFile,offset_width=5`
- **Compression:** Optional ZSTD record compression (`record_comp_mode=RECORD_COMP_ZSTD`),
enabled at compile time via `--with-tkrzw-zstd` and at runtime via the
`DUC_OPEN_COMPRESS` flag. Falls back to `NONE` if not compiled in.
- **Version check:** Stores and validates `duc_db_version` key on open
- **Filesystem size hints:** The `num_buckets` tuning parameter is scaled via
new `DUC_FS_*` flags:
| Flag | `num_buckets` |
|-------------------|---------------|
| `DUC_FS_BIG` | 100,000,000 |
| `DUC_FS_BIGGER` | 1,000,000,000 |
| `DUC_FS_BIGGEST` | 10,000,000,000|
- **Notes:**
- `DUC_OPEN_FORCE` appends `,truncate=true` to the options string, recreating
the file.
- Tkrzw is a successor/spiritual replacement for both Tokyo Cabinet and Kyoto
Cabinet, providing a modern hash-based store with better compression options.
94 changes: 94 additions & 0 deletions migrator/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
CC = gcc
CFLAGS = -O2 -Wall -Wextra -std=c11
LDFLAGS =

# ---------------------------------------------------------------------------
# Auto-detect available backends via pkg-config (or direct lib checks).
# Each enabled backend appends -DHAVE_<NAME> plus its compile/link flags.
# ---------------------------------------------------------------------------

# --- Tokyo Cabinet ----------------------------------------------------------
TC_LIBS := $(shell pkg-config --libs tokyocabinet 2>/dev/null)
TC_CFLAGS := $(shell pkg-config --cflags tokyocabinet 2>/dev/null)
ifneq ($(TC_LIBS),)
CFLAGS += -DHAVE_TOKYOCABINET $(TC_CFLAGS)
LDFLAGS += $(TC_LIBS)
$(info [+] Tokyo Cabinet detected)
else
$(info [-] Tokyo Cabinet not found (install: libtokyocabinet-dev))
endif

# --- Kyoto Cabinet ----------------------------------------------------------
KC_LIBS := $(shell pkg-config --libs kyotocabinet 2>/dev/null)
KC_CFLAGS := $(shell pkg-config --cflags kyotocabinet 2>/dev/null)
ifneq ($(KC_LIBS),)
CFLAGS += -DHAVE_KYOTOCABINET $(KC_CFLAGS)
LDFLAGS += $(KC_LIBS)
$(info [+] Kyoto Cabinet detected)
else
$(info [-] Kyoto Cabinet not found (install: libkyotocabinet-dev))
endif

# --- LevelDB ----------------------------------------------------------------
LDB_LIBS := $(shell pkg-config --libs leveldb 2>/dev/null)
LDB_CFLAGS := $(shell pkg-config --cflags leveldb 2>/dev/null)
ifneq ($(LDB_LIBS),)
CFLAGS += -DHAVE_LEVELDB $(LDB_CFLAGS)
LDFLAGS += $(LDB_LIBS)
$(info [+] LevelDB detected)
else
# Fallback: try direct link (common on Debian/Ubuntu without .pc file)
LDB_DIRECT := $(shell echo 'int main(){}' | $(CC) -x c - -lleveldb -o /dev/null 2>/dev/null && echo yes)
ifeq ($(LDB_DIRECT),yes)
CFLAGS += -DHAVE_LEVELDB
LDFLAGS += -lleveldb
$(info [+] LevelDB detected (direct link))
else
$(info [-] LevelDB not found (install: libleveldb-dev))
endif
endif

# --- SQLite3 ----------------------------------------------------------------
SQ_LIBS := $(shell pkg-config --libs sqlite3 2>/dev/null)
SQ_CFLAGS := $(shell pkg-config --cflags sqlite3 2>/dev/null)
ifneq ($(SQ_LIBS),)
CFLAGS += -DHAVE_SQLITE3 $(SQ_CFLAGS)
LDFLAGS += $(SQ_LIBS)
$(info [+] SQLite3 detected)
else
$(info [-] SQLite3 not found (install: libsqlite3-dev))
endif

# --- LMDB -------------------------------------------------------------------
MDB_DIRECT := $(shell echo 'int main(){}' | $(CC) -x c - -llmdb -o /dev/null 2>/dev/null && echo yes)
ifeq ($(MDB_DIRECT),yes)
CFLAGS += -DHAVE_LMDB
LDFLAGS += -llmdb
$(info [+] LMDB detected)
else
$(info [-] LMDB not found (install: liblmdb-dev))
endif

# --- Tkrzw ------------------------------------------------------------------
TKRZW_DIRECT := $(shell echo 'int main(){}' | $(CC) -x c - -ltkrzw -o /dev/null 2>/dev/null && echo yes)
ifeq ($(TKRZW_DIRECT),yes)
CFLAGS += -DHAVE_TKRZW
LDFLAGS += -ltkrzw
$(info [+] Tkrzw detected)
else
$(info [-] Tkrzw not found (install: libtkrzw-dev))
endif

# ---------------------------------------------------------------------------

TARGET = migrator

.PHONY: all clean

all: $(TARGET)

$(TARGET): migrator.c
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)

clean:
rm -f $(TARGET)
156 changes: 156 additions & 0 deletions migrator/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
# duc Database Migrator

A standalone command-line tool that converts a duc index database from any
supported backend format to any other, without losing data.

For a detailed description of each backend's on-disk format, internal
structure, and quirks see **[DB-FORMATS.md](../DB-FORMATS.md)**.

---

## Overview

duc stores its index as a simple key-value database. The backend is chosen
at compile time; all backends share the same logical schema but differ in
file format, compression, and performance characteristics.

`migrator` links every available backend into a single binary and performs a
raw KV copy between them — all duc-internal keys (`duc_db_version`,
`duc_index_reports`, path records, …) are transferred verbatim.

Typical use cases:

- Upgrading from the 1.4.6 default (`tokyocabinet`) to the 1.5.0 default (`tkrzw`)
- Converting a `leveldb` directory-based database to a single-file format
- Switching to `sqlite3` for inspection with standard SQL tooling

---

## Supported Backends

| Backend | Format | Compression | Default in |
|-----------------|---------|-------------------|-------------|
| `tokyocabinet` | File | Optional (deflate)| 1.4.6 |
| `kyotocabinet` | File | Always (kct) | — |
| `leveldb` | **Dir** | Always (Snappy) | — |
| `sqlite3` | File | None | — |
| `lmdb` | File | None | — |
| `tkrzw` | File | Optional (ZSTD) | 1.5.0-rc2 |

All backends listed above are compiled into one binary if the corresponding
library is present at build time. The Makefile reports which ones were
detected.

> **Note on LevelDB:** the `path` for a LevelDB database is a **directory**,
> not a file. Pass the directory path to `--from` or `--to` accordingly.

---

## Building

Dependencies are auto-detected via `pkg-config` (and direct linker probes for
LMDB and Tkrzw, which often lack `.pc` files).

```sh
cd migrator
make
```

Example output showing which backends were found:

```
[+] Tokyo Cabinet detected
[-] Kyoto Cabinet not found (install: libkyotocabinet-dev)
[+] LevelDB detected (direct link)
[+] SQLite3 detected
[+] LMDB detected
[+] Tkrzw detected
```

At least two backends must be compiled in to perform a migration.

### Manual flags

If auto-detection fails you can pass flags directly:

```sh
make CFLAGS="-DHAVE_TOKYOCABINET -DHAVE_TKRZW" \
LDFLAGS="-ltokyocabinet -ltkrzw"
```

---

## Usage

```
./migrator --from <format>:<path> --to <format>:<path>
```

`format` is one of the backend names in the table above; `path` is the
filesystem path to the database file (or directory for LevelDB).

### Examples

**Tokyo Cabinet → Tkrzw** (the common 1.4.6 → 1.5.0 upgrade path):

```sh
./migrator \
--from tokyocabinet:~/.cache/duc/duc.db \
--to tkrzw:~/.cache/duc/duc.tkrzw.db
```

**Tokyo Cabinet → SQLite3** (for ad-hoc SQL inspection):

```sh
./migrator \
--from tokyocabinet:/var/cache/duc/duc.db \
--to sqlite3:/tmp/duc-inspect.sqlite
# Then: sqlite3 /tmp/duc-inspect.sqlite "select key from blobs"
```

**LevelDB directory → LMDB single file:**

```sh
./migrator \
--from leveldb:/var/cache/duc/duc-leveldb/ \
--to lmdb:/var/cache/duc/duc.lmdb
```

### Testing

See the [testing/README.md](../testing/README.md) resp. [testing/test-migrator.sh](../testing/test-migrator.sh) which exercises the migrator against all backend combinations.

---

## How It Works

1. The source database is opened **read-only**.
2. A full cursor scan iterates every key-value record in storage order.
3. Each record is written verbatim to the destination database.
4. Both databases are flushed and closed cleanly on completion.

Progress is printed every 10 000 records; the final line reports the total
count and any write errors.

Because the copy is raw (below the duc abstraction layer), the destination
database is immediately usable by duc without re-indexing.

---

## Caveats

- **`duc_db_version`** is copied as-is. Backends that do not normally store
this key (LevelDB, SQLite3, LMDB) will have it present after migration,
which is harmless. Backends that validate it on open (Tokyo Cabinet, Kyoto
Cabinet, Tkrzw) will accept it as long as the version string matches the
compiled duc version.

- **LevelDB** stores its data in a directory; make sure the destination
directory either does not exist or is empty before migrating into it.

- **LMDB** pre-allocates a large virtual address range (1 GB on 32-bit, 256 GB
on 64-bit). Actual disk usage is much smaller; the reservation is virtual
memory only.

- The migrator does **not** validate the integrity of the source database
before copying. Run `duc info` on the source first if in doubt.
Loading