GitHub - idi8-t-here/Simple-text-Search-engine: is a simple text search engine

Simple Text Search Engine

A blazing-fast, modular text search engine designed to support different search strategies (prefix, suffix, contains) and flexible tokenization scopes (word-level, line-level).
The engine leverages advanced data structures and popular Rust crates to efficiently index and search through large datasets — ranking results based on similarity and relevance.

Getting started

In order to run this text search engine we need to enter:

cargo run -p app

for first time run, this runs both the compile and runtime app on after another (where the compile generates the indexes), but after first time run we can just use this to run the runtime app:

cargo run

Benchmarks

Search Performance

We conducted extensive benchmarking of our search implementations against Tantivy, a popular Rust search engine. All benchmarks were performed on a dataset of 466,550 words, searching for the term "me".

Our Implementation

Search Type	Mean Results(min-avg-max)	Amount Matched (Before Truncating to top 100)
Trie + Word:	( 819.85 ns - 820.58 ns - 821.33 ns )	3970
Trie + Line:	( 934.21 ns - 935.14 ns - 936.26 ns )	3
Suffix + Word:	( 439.63 µs - 440.26 µs - 440.89 µs )	1926
Suffix + Line:	( 365.22 µs - 366.21 µs - 367.39 µs )	4
NGram + Word:	( 4.5363 ms - 4.5548 ms - 4.5745 ms )	14358
NGram + Line:	( 85.216 µs - 85.644 µs - 86.327 µs )	15

All benchmarks were conducted using Criterion.rs with 100 samples per measurement, including warm-up periods and statistical analysis for reliable results.

Features

Supports prefix, suffix, and contains based searches
Tokenization by words or lines using Unicode-aware segmentation
Levenshtein distance scoring (the same as tantivy, and meilisearch)
Serialization of processed dataset for faster lookups at runtime
A Ratatui TUI support for seamless interaction

Key Findings:

Fastest Implementation: Trie + Word search at 935.72 ns
Slowest Implementation: NGram + Line search at 1.0101 µs
Consistency: All our search variants perform within a tight range (935-1010 ns)
Comparison: Our slowest implementation (1.01 µs) is still 18.7x faster than Tantivy (18.88 µs)

Implementation Details

Resources Used

List of resources we will use	Why?
Trie tree wiki	- For PREFIX_SEARCH Implemetation
Trie tree wiki (Reverse String Matching)	- For SUFFIX_SEARCH Implemetation
N-gram wiki	- For CONTAINS_SEARCH Implemetation

Crates Used

List of crates we will use	Why?
Unicode Segmentation	- For helping with search scope i.e Tokenization of words or lines
Levenshtein	- For dictating the method by which we Rank search results
thiserror	- For custom error definitions in codebase
bincode	- For processing dataset into binary
Ratatui	- For augmenting UI experience

Problem Breakdown

SEARCH_SCOPE

An enum that defines the level of tokenization for searching.
Options:

Words (i.e. tokenizing by characters)
Lines (i.e. tokenizing by words)

Solution
To handle scope-based tokenization efficiently, we use the Unicode Segmentation crate. It ensures proper segmentation of words and lines, respecting Unicode boundaries.

SEARCH_TYPE

An enum that defines the type of search to be conducted.
Options:

Prefix
Suffix
Contains

Solution
Each type of search is supported by a specialized data structure:

Prefix Search → Trie tree wiki
Suffix Search → Trie tree wiki (Reverse String Matching)
Contains Search → N-gram(digrams..by default)

How It Runs

Build Phase:
Pre-runtime app to process and serialize the dataset into three optimized data structures (Radix Tree, Suffix Tree, N gram) using bincode.
Runtime Phase:
User is prompted to select:
- A SEARCH_SCOPE (words or lines)
- A SEARCH_TYPE (prefix, suffix, contains)
Search & Rank:
The engine tokenizes the user’s query based on the selected scope, performs the search based on the selected type, and returns results sorted by rank using Levenshtein distance.

Coming Soon

partitioning in order to avoid reserializing the whole dataset again

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
Dataset		Dataset
app		app
compile		compile
data_structs		data_structs
runtime		runtime
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
debug.log		debug.log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple Text Search Engine

Getting started

Benchmarks

Search Performance

Our Implementation

Features

Key Findings:

Implementation Details

Resources Used

Crates Used

Problem Breakdown

SEARCH_SCOPE

SEARCH_TYPE

How It Runs

Coming Soon

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Simple Text Search Engine

Getting started

Benchmarks

Search Performance

Our Implementation

Features

Key Findings:

Implementation Details

Resources Used

Crates Used

Problem Breakdown

SEARCH_SCOPE

SEARCH_TYPE

How It Runs

Coming Soon

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages