FM Index

High-performance FM-index implementation powered by Rust,
designed for fast substring search on large texts and collections

PyPI: https://pypi.org/project/fm-index
Document: https://math-hiyoko.github.io/fm-index
Repository: https://github.com/math-hiyoko/fm-index

Features:

Fast count / locate substring queries
Data-parallel optimizations across index construction and queries
Supports single text and multiple documents
Pickle serialization support for efficient index persistence
Safe Rust (no unsafe)

Installation

pip install fm-index

FMIndex (Single Document)

What is FMIndex?

FMIndex builds a compressed index over a single string,
allowing fast substring search without scanning the original data.

Construction Complexity

Time / Space: O(|data|)

Example

from fm_index import FMIndex

genome = "ACGTACGTTGACCTGACTGACTGACTGACGATCGATCGATCGATCGATCG"
fm = FMIndex(data=genome)

Count Substring Occurrences

Counts how many times a pattern appears.
Time complexity is independent of data size.

fm.count(pattern="GACTGACT")
# 2

Locate Substring Positions

Returns all starting offsets where the pattern occurs.

To improve throughput for high-frequency patterns,
FMIndex applies parallel execution to parts of the locate pipeline.

fm.locate(pattern="GACTGACT")
# [18, 14]

Iterative Locate (Streaming)

For large result sets, iter_locate provides a memory-efficient
iterator interface that yields positions lazily.

for pos in fm.iter_locate(pattern="GACTGACT"):
    print(pos)
# 18
# 14

Same results as locate
Does not allocate a result list
Suitable for streaming and early termination

MultiFMIndex (Multiple Documents)

MultiFMIndex extends FMIndex to support multiple documents
while keeping query time independent of corpus size

Query processing is internally parallelized where possible,
making multi-document search efficient in practice.

Construction Complexity

Time / Space: O(|''.join(data)| + len(data) log (len(data)))

from fm_index import MultiFMIndex

documents = [
    "政府はAI研究の支援を強化すると発表した。",
    "政府は新たなデータ活用方針を発表した。",
    "政府はサイバーセキュリティ対策を発表した。",
    "専門家はAI検索技術の進化に注目している。",
    "研究者は高速な検索アルゴリズムに注目している。",
    "オープンソース界隈では全文検索ライブラリに注目している。"
]

mfm = MultiFMIndex(data=documents)

Count Across All Documents

mfm.count_all(pattern="検索")
# 3

Count Per Document

mfm.count(pattern="検索")
# {3: 1, 4: 1, 5: 1}

# Count within a specific document
mfm.count(pattern="検索", doc_id=3)
# 1

Locate Per Document

mfm.locate(pattern="検索")
# {5: [13], 4: [7], 3: [6]}

# Locate within a specific document
mfm.locate(pattern="検索", doc_id=3)
# [6]

Iterative Locate (Streaming)

# Iterate across all documents
for doc_id, pos in mfm.iter_locate(pattern="検索"):
    print(doc_id, pos)
# 4 7
# 5 13
# 3 6

# Iterate within a specific document
for doc_id, pos in mfm.iter_locate(pattern="検索", doc_id=3):
    print(doc_id, pos)
# 6

Prefix / Suffix Search

mfm.startswith(prefix="政府は")
mfm.endswith(suffix="注目している。")

Serialization (Pickle Support)

Both FMIndex and MultiFMIndex support Python's pickle protocol, allowing you to save and load pre-built indices efficiently.

The internal data structures are serialized directly in binary format, making deserialization much faster than rebuilding the index from scratch.

Save Index to File

import pickle
from fm_index import FMIndex, MultiFMIndex

# Build and save FMIndex
fm = FMIndex("large genome sequence..." * 10000)
with open("genome.fmindex", "wb") as f:
    pickle.dump(fm, f)

# Build and save MultiFMIndex
mfm = MultiFMIndex(["document1", "document2", ...])
with open("documents.mfmindex", "wb") as f:
    pickle.dump(mfm, f)

Load Index from File

# Load FMIndex
with open("genome.fmindex", "rb") as f:
    fm = pickle.load(f)

# Load MultiFMIndex
with open("documents.mfmindex", "rb") as f:
    mfm = pickle.load(f)

# Use immediately without reconstruction
result = fm.locate("ACGT")

This is particularly useful when:

Working with large datasets where index construction is expensive
Deploying pre-built indices in production environments
Sharing indices across different processes or machines

Safety

Powered by safe Rust
Memory-safe by design

Development & Testing

Run Tests

pip install -e ".[test]"
cargo test --all --release
pytest

Formating

pip install -e ".[dev]"
cargo fmt --all
cargo clippy --all-targets --all-features
ruff format

Generating Docs

pdoc fm_index \
      --output-directory docs \
      --no-search \
      --docformat markdown \
      --template-directory pdoc_templates

References

P. Ferragina and G. Manzini,
Opportunistic data structures with applications,
Proceedings 41st Annual Symposium on Foundations of Computer Science,
Redondo Beach, CA, USA,
2000,
pp. 390-398,
https://doi.org/10.1109/SFCS.2000.892127.
FM Indexを使うとWikipedia全文検索みたいなことができる
https://qiita.com/math-hiyoko/items/10d50527504914e00388
A Wikipedia-scale search index, built in one line.
https://medium.com/@koki.watanabe.56/a-wikipedia-scale-search-index-built-in-one-line-1847bb05198b

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
docs		docs
pdoc_templates		pdoc_templates
src		src
tests		tests
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
fm_index.pyi		fm_index.pyi
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FM Index

Features:

Installation

FMIndex (Single Document)

What is FMIndex?

Construction Complexity

Example

Count Substring Occurrences

Locate Substring Positions

Iterative Locate (Streaming)

MultiFMIndex (Multiple Documents)

Construction Complexity

Count Across All Documents

Count Per Document

Locate Per Document

Iterative Locate (Streaming)

Prefix / Suffix Search

Serialization (Pickle Support)

Save Index to File

Load Index from File

Safety

Development & Testing

Run Tests

Formating

Generating Docs

References

About

Uh oh!

Releases 16

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FM Index

Features:

Installation

FMIndex (Single Document)

What is FMIndex?

Construction Complexity

Example

Count Substring Occurrences

Locate Substring Positions

Iterative Locate (Streaming)

MultiFMIndex (Multiple Documents)

Construction Complexity

Count Across All Documents

Count Per Document

Locate Per Document

Iterative Locate (Streaming)

Prefix / Suffix Search

Serialization (Pickle Support)

Save Index to File

Load Index from File

Safety

Development & Testing

Run Tests

Formating

Generating Docs

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 16

Uh oh!

Contributors

Uh oh!

Languages