Skip to content

math-hiyoko/fm-index

Repository files navigation

FM Index

CI codecov PyPI - Version PyPI - License PyPI - PythonVersion PyPI - Implementation PyPI - Types PyPI Downloads PyPI - Format Rust Unsafe GitHub Repo stars

High-performance FM-index implementation powered by Rust,
designed for fast substring search on large texts and collections

Features:

  • Fast count / locate substring queries
  • Data-parallel optimizations across index construction and queries
  • Supports single text and multiple documents
  • Pickle serialization support for efficient index persistence
  • Safe Rust (no unsafe)

Installation

pip install fm-index

FMIndex (Single Document)

What is FMIndex?

FMIndex builds a compressed index over a single string,
allowing fast substring search without scanning the original data.

Construction Complexity

  • Time / Space: O(|data|)

Example

from fm_index import FMIndex

genome = "ACGTACGTTGACCTGACTGACTGACTGACGATCGATCGATCGATCGATCG"
fm = FMIndex(data=genome)

Count Substring Occurrences

Counts how many times a pattern appears.
Time complexity is independent of data size.

fm.count(pattern="GACTGACT")
# 2

Locate Substring Positions

Returns all starting offsets where the pattern occurs.

To improve throughput for high-frequency patterns,
FMIndex applies parallel execution to parts of the locate pipeline.

fm.locate(pattern="GACTGACT")
# [18, 14]

Iterative Locate (Streaming)

For large result sets, iter_locate provides a memory-efficient
iterator interface that yields positions lazily.

for pos in fm.iter_locate(pattern="GACTGACT"):
    print(pos)
# 18
# 14
  • Same results as locate
  • Does not allocate a result list
  • Suitable for streaming and early termination

MultiFMIndex (Multiple Documents)

MultiFMIndex extends FMIndex to support multiple documents
while keeping query time independent of corpus size

Query processing is internally parallelized where possible,
making multi-document search efficient in practice.

Construction Complexity

  • Time / Space: O(|''.join(data)| + len(data) log (len(data)))
from fm_index import MultiFMIndex

documents = [
    "政府はAI研究の支援を強化すると発表した。",
    "政府は新たなデータ活用方針を発表した。",
    "政府はサイバーセキュリティ対策を発表した。",
    "専門家はAI検索技術の進化に注目している。",
    "研究者は高速な検索アルゴリズムに注目している。",
    "オープンソース界隈では全文検索ライブラリに注目している。"
]

mfm = MultiFMIndex(data=documents)

Count Across All Documents

mfm.count_all(pattern="検索")
# 3

Count Per Document

mfm.count(pattern="検索")
# {3: 1, 4: 1, 5: 1}

# Count within a specific document
mfm.count(pattern="検索", doc_id=3)
# 1

Locate Per Document

mfm.locate(pattern="検索")
# {5: [13], 4: [7], 3: [6]}

# Locate within a specific document
mfm.locate(pattern="検索", doc_id=3)
# [6]

Iterative Locate (Streaming)

# Iterate across all documents
for doc_id, pos in mfm.iter_locate(pattern="検索"):
    print(doc_id, pos)
# 4 7
# 5 13
# 3 6

# Iterate within a specific document
for doc_id, pos in mfm.iter_locate(pattern="検索", doc_id=3):
    print(doc_id, pos)
# 6

Prefix / Suffix Search

mfm.startswith(prefix="政府は")
mfm.endswith(suffix="注目している。")

Serialization (Pickle Support)

Both FMIndex and MultiFMIndex support Python's pickle protocol, allowing you to save and load pre-built indices efficiently.

The internal data structures are serialized directly in binary format, making deserialization much faster than rebuilding the index from scratch.

Save Index to File

import pickle
from fm_index import FMIndex, MultiFMIndex

# Build and save FMIndex
fm = FMIndex("large genome sequence..." * 10000)
with open("genome.fmindex", "wb") as f:
    pickle.dump(fm, f)

# Build and save MultiFMIndex
mfm = MultiFMIndex(["document1", "document2", ...])
with open("documents.mfmindex", "wb") as f:
    pickle.dump(mfm, f)

Load Index from File

# Load FMIndex
with open("genome.fmindex", "rb") as f:
    fm = pickle.load(f)

# Load MultiFMIndex
with open("documents.mfmindex", "rb") as f:
    mfm = pickle.load(f)

# Use immediately without reconstruction
result = fm.locate("ACGT")

This is particularly useful when:

  • Working with large datasets where index construction is expensive
  • Deploying pre-built indices in production environments
  • Sharing indices across different processes or machines

Safety

  • Powered by safe Rust
  • Memory-safe by design

Development & Testing

Run Tests

pip install -e ".[test]"
cargo test --all --release
pytest

Formating

pip install -e ".[dev]"
cargo fmt --all
cargo clippy --all-targets --all-features
ruff format

Generating Docs

pdoc fm_index \
      --output-directory docs \
      --no-search \
      --docformat markdown \
      --template-directory pdoc_templates

References