Skip to content

Conversation

@Sanjana-m55
Copy link

@Sanjana-m55 Sanjana-m55 commented Feb 8, 2026

This PR introduces a minimal, local-first semantic search pipeline for markdown notes.

It includes:

  • Markdown chunking
  • Local embeddings using Sentence Transformers (MiniLM)
  • FAISS-based vector similarity search
  • Simple CLI-based semantic querying

The implementation runs fully offline after the initial model download and is intended as a foundational building block for the Smart Notes project’s semantic search and future RAG pipeline.

No existing code paths are modified.

Summary by CodeRabbit

  • New Features

    • Introduced semantic search functionality enabling intelligent querying of markdown document collections
    • Implements offline-first approach using locally-hosted language models, eliminating dependency on external APIs
  • Documentation

    • Added semantic search module documentation with complete setup instructions and command-line usage examples
  • Chores

    • Added project configuration files and Python package dependencies for semantic search

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 8, 2026

📝 Walkthrough

Walkthrough

This pull request introduces a complete offline-first semantic search module to the codebase. It includes markdown file ingestion and chunking, local embedding generation via SentenceTransformer, FAISS-based vector storage, and a CLI tool for searching markdown content through semantic similarity.

Changes

Cohort / File(s) Summary
Setup & Configuration
.gitignore, semantic_search/requirements.txt
Added gitignore patterns for Python artifacts, virtual environments, and caches. Introduced dependencies: sentence-transformers, faiss-cpu, and markdown.
Core Pipeline Components
semantic_search/chunker.py, semantic_search/embedder.py, semantic_search/vector_store.py
Implemented markdown chunking with paragraph-based accumulation (max 800 chars), SentenceTransformer-based embedding generation with the MiniLM model, and FAISS IndexFlatIP vector store with metadata tracking and similarity search.
Pipeline Orchestration
semantic_search/search.py
Added entry point with CLI argument parsing, markdown file discovery, end-to-end chunking-embedding-indexing workflow, and query-time search with formatted result output including sources and snippets.
Documentation
semantic_search/README.md
Added module README describing the offline-first pipeline architecture and usage instructions for invoking via python -m semantic_search.search.

Sequence Diagram

sequenceDiagram
    actor User
    participant CLI as search.py
    participant FileSystem
    participant Chunker as chunker.py
    participant Embedder as embedder.py
    participant VectorStore as vector_store.py
    participant FAISS

    User->>CLI: python -m semantic_search.search notes_folder "query"
    CLI->>FileSystem: load_markdown_files(notes_folder)
    FileSystem-->>CLI: list of .md file paths
    
    loop For each markdown file
        CLI->>Chunker: chunk_markdown_file(file_path)
        Chunker->>FileSystem: read file
        FileSystem-->>Chunker: markdown content
        Chunker->>Chunker: _clean_markdown() → plain text
        Chunker->>Chunker: split & accumulate into chunks
        Chunker-->>CLI: List[Dict(text, source)]
    end
    
    CLI->>Embedder: embed_texts(all_chunks)
    Embedder->>Embedder: SentenceTransformer.encode()
    Embedder-->>CLI: List[List[float]] embeddings
    
    CLI->>VectorStore: __init__(embedding_dim)
    VectorStore->>FAISS: create IndexFlatIP
    
    CLI->>VectorStore: add(embeddings, metadatas)
    VectorStore->>FAISS: add vectors
    
    CLI->>Embedder: embed_texts([query])
    Embedder-->>CLI: query_embedding
    
    CLI->>VectorStore: search(query_embedding, k=5)
    VectorStore->>FAISS: search(query_vector)
    FAISS-->>VectorStore: top-k indices & scores
    VectorStore-->>CLI: List[Dict(text, source, score)]
    
    CLI->>User: print results with sources & snippets
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A search through notes with local grace,
No clouds needed in this place—
Embeddings dance, chunks align,
FAISS finds semantics so fine!
Offline wisdom, hop and gleam!

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add local semantic search MVP for Smart Notes' directly and clearly describes the main change: introduction of a local semantic search minimum viable product.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In `@semantic_search/chunker.py`:
- Around line 43-45: The loop building chunks can exceed max_chars by one
because it always concatenates " " + para but checks len(current_chunk) +
len(para) <= max_chars; update the logic in the loop that iterates over
paragraphs to account for the extra space only when current_chunk is non-empty
(e.g., check len(current_chunk) + (1 if current_chunk else 0) + len(para) <=
max_chars), and change the concatenation to add the space conditionally (append
para directly if current_chunk is empty, otherwise append " " + para) so you
also avoid leading spaces.
- Around line 5-16: The _clean_markdown function only strips a few specific
tags, leaving other HTML from markdown.markdown() in the output; update
_clean_markdown(md_text: str) to fully remove all HTML tags and decode HTML
entities before returning plain text (e.g., run the markdown output through an
HTML-to-text step such as using an HTML parser or a robust regex to strip tags
and html.unescape to decode entities) so that headings, lists, links, emphasis,
code tags, etc. are removed and the returned text is clean for embedding; ensure
you modify the function containing markdown.markdown(md_text) and return the
fully cleaned plain text.

In `@semantic_search/README.md`:
- Around line 17-18: The README has an unclosed fenced code block around the
example command; close the triple-backtick fence after the shown command (the
code block that starts before the line containing "python -m
semantic_search.search semantic_search/test_notes \"your query here\"") so the
rest of the document renders correctly, and optionally add a short installation
note above or below the example such as a pip install instruction (e.g.,
reference to requirements.txt) to show how to install dependencies before
running the command.

In `@semantic_search/search.py`:
- Around line 46-48: The loop printing search results always appends "..." even
for short texts; update the block that iterates over results (the for res in
results loop, using res['text'], res['source'], res['score']) to compute a
snippet = res['text'] if len(res['text']) <= 200 else res['text'][:200] + "..."
and print snippet instead of always printing res['text'][:200] + "...", so the
ellipsis is only added when the original text was truncated.
- Around line 34-39: The current code assumes embeddings exist and does
len(embeddings[0]) which raises IndexError when all_chunks/texts are empty;
after calling Embedder().embed_texts(texts) check if embeddings is empty and
handle it: either return early (skip building the VectorStore) or instantiate
VectorStore using a safe default embedding dimension obtained from the embedder
(e.g. add/use Embedder.embedding_dim or Embedder.get_embedding_dim()) instead of
indexing embeddings[0]; ensure you never call len(embeddings[0]) without first
verifying embeddings is non-empty and update the VectorStore creation site to
use the safe dimension or the early-return path.
🧹 Nitpick comments (2)
semantic_search/requirements.txt (1)

1-3: Consider pinning dependency versions or adding upper bounds for reproducibility.

Using only >= constraints risks breaking changes from future major releases. For a project meant to be a foundational building block, consider pinning to compatible ranges (e.g., ~=2.2.2 or >=2.2.2,<3).

semantic_search/vector_store.py (1)

34-40: Consider strict=True on zip() for defensive coding.

Per Ruff B905, adding strict=True to zip(scores[0], indices[0]) would catch unexpected length mismatches, though in practice FAISS guarantees equal-length arrays.

Proposed fix
-        for score, idx in zip(scores[0], indices[0]):
+        for score, idx in zip(scores[0], indices[0], strict=True):

Comment on lines +5 to +16
def _clean_markdown(md_text: str) -> str:
"""
Convert markdown text to plain text.
"""
html = markdown.markdown(md_text)
# Very lightweight HTML tag removal
text = (
html.replace("<p>", "")
.replace("</p>", "\n")
.replace("<br />", "\n")
)
return text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Incomplete HTML tag stripping will pollute embeddings with HTML markup.

Only <p>, </p>, and <br /> are stripped. All other tags produced by markdown.markdown() — headings (<h1><h6>), lists (<ul>, <li>), emphasis (<strong>, <em>), links (<a href="...">), code (<pre>, <code>), etc. — will remain in the text fed to the embedder, degrading search quality.

Proposed fix using regex-based tag stripping
 import os
+import re
 from typing import List, Dict
 import markdown
 
 def _clean_markdown(md_text: str) -> str:
     """
     Convert markdown text to plain text.
     """
     html = markdown.markdown(md_text)
-    # Very lightweight HTML tag removal
-    text = (
-        html.replace("<p>", "")
-        .replace("</p>", "\n")
-        .replace("<br />", "\n")
-    )
+    # Strip all HTML tags
+    text = re.sub(r"<[^>]+>", "", html)
     return text
🤖 Prompt for AI Agents
In `@semantic_search/chunker.py` around lines 5 - 16, The _clean_markdown function
only strips a few specific tags, leaving other HTML from markdown.markdown() in
the output; update _clean_markdown(md_text: str) to fully remove all HTML tags
and decode HTML entities before returning plain text (e.g., run the markdown
output through an HTML-to-text step such as using an HTML parser or a robust
regex to strip tags and html.unescape to decode entities) so that headings,
lists, links, emphasis, code tags, etc. are removed and the returned text is
clean for embedding; ensure you modify the function containing
markdown.markdown(md_text) and return the fully cleaned plain text.

Comment on lines +43 to +45
for para in paragraphs:
if len(current_chunk) + len(para) <= max_chars:
current_chunk += " " + para
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Off-by-one: the joining space is not accounted for in the length check.

Line 44 checks len(current_chunk) + len(para) <= max_chars, but Line 45 concatenates with " " + para, adding an extra character. This means chunks can exceed max_chars by 1.

Proposed fix
     for para in paragraphs:
-        if len(current_chunk) + len(para) <= max_chars:
+        separator = " " if current_chunk else ""
+        if len(current_chunk) + len(separator) + len(para) <= max_chars:
-            current_chunk += " " + para
+            current_chunk += separator + para
         else:

This also avoids the leading space on the first paragraph of each chunk.

🤖 Prompt for AI Agents
In `@semantic_search/chunker.py` around lines 43 - 45, The loop building chunks
can exceed max_chars by one because it always concatenates " " + para but checks
len(current_chunk) + len(para) <= max_chars; update the logic in the loop that
iterates over paragraphs to account for the extra space only when current_chunk
is non-empty (e.g., check len(current_chunk) + (1 if current_chunk else 0) +
len(para) <= max_chars), and change the concatenation to add the space
conditionally (append para directly if current_chunk is empty, otherwise append
" " + para) so you also avoid leading spaces.

Comment on lines +17 to +18
```bash
python -m semantic_search.search semantic_search/test_notes "your query here"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unclosed code fence will break README rendering.

The fenced code block opened on Line 18 is never closed with a matching ```. This will cause the rest of the README (and any content added below) to render as a code block on GitHub. Also consider adding installation instructions (pip install -r requirements.txt).

🤖 Prompt for AI Agents
In `@semantic_search/README.md` around lines 17 - 18, The README has an unclosed
fenced code block around the example command; close the triple-backtick fence
after the shown command (the code block that starts before the line containing
"python -m semantic_search.search semantic_search/test_notes \"your query
here\"") so the rest of the document renders correctly, and optionally add a
short installation note above or below the example such as a pip install
instruction (e.g., reference to requirements.txt) to show how to install
dependencies before running the command.

Comment on lines +34 to +39
texts = [c["text"] for c in all_chunks]

embedder = Embedder()
embeddings = embedder.embed_texts(texts)

vector_store = VectorStore(embedding_dim=len(embeddings[0]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

IndexError crash when markdown files exist but yield no text chunks.

If all .md files are empty or whitespace-only, all_chunks will be [], texts will be [], embedder.embed_texts([]) returns [], and len(embeddings[0]) on Line 39 raises an IndexError.

Proposed fix
     texts = [c["text"] for c in all_chunks]
 
+    if not texts:
+        print("No text chunks extracted from markdown files.")
+        sys.exit(1)
+
     embedder = Embedder()
     embeddings = embedder.embed_texts(texts)
🤖 Prompt for AI Agents
In `@semantic_search/search.py` around lines 34 - 39, The current code assumes
embeddings exist and does len(embeddings[0]) which raises IndexError when
all_chunks/texts are empty; after calling Embedder().embed_texts(texts) check if
embeddings is empty and handle it: either return early (skip building the
VectorStore) or instantiate VectorStore using a safe default embedding dimension
obtained from the embedder (e.g. add/use Embedder.embedding_dim or
Embedder.get_embedding_dim()) instead of indexing embeddings[0]; ensure you
never call len(embeddings[0]) without first verifying embeddings is non-empty
and update the VectorStore creation site to use the safe dimension or the
early-return path.

Comment on lines +46 to +48
for res in results:
print(f"- {res['source']} (score: {res['score']:.3f})")
print(f" {res['text'][:200]}...\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Trailing "..." is always appended, even for short texts.

Line 48 always appends ... after truncation, even when the text is shorter than 200 characters.

Proposed fix
-        print(f"  {res['text'][:200]}...\n")
+        snippet = res['text'][:200]
+        ellipsis = "..." if len(res['text']) > 200 else ""
+        print(f"  {snippet}{ellipsis}\n")
🤖 Prompt for AI Agents
In `@semantic_search/search.py` around lines 46 - 48, The loop printing search
results always appends "..." even for short texts; update the block that
iterates over results (the for res in results loop, using res['text'],
res['source'], res['score']) to compute a snippet = res['text'] if
len(res['text']) <= 200 else res['text'][:200] + "..." and print snippet instead
of always printing res['text'][:200] + "...", so the ellipsis is only added when
the original text was truncated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants