Skip to content

Integrate embedding & semantic search pipeline #33

@justinmadison

Description

@justinmadison

Summary

Compute vector embeddings for each normalized article and provide a semantic search interface to find similar articles by content.

Motivation

  • Enables “More like this” recommendations and clustering of related stories.
  • Powers a search experience that goes beyond keyword matching, surfacing semantically relevant results.

Scope

In scope: implementation, tests

Acceptance Criteria

  • compute_embedding(text) returns a fixed-length float vector.
  • embed_task(article_id) stores the embedding for the article in the vector index.
  • search_similar(query, top_k) returns the top K most semantically similar articles.
  • CLI commands embed and search run without errors and print expected output.
  • All tests pass in CI and README clearly documents embedding & search workflows.

Additional Context

  1. Add dependencies
    • Add sentence-transformers and faiss-cpu (or equivalent) to /nlp/requirements.txt.
  2. Core function signatures (/nlp/core.py)
    • def compute_embedding(text: str) -> List[float]
    • def index_article(article_id: str, embedding: List[float]) -> None
    • def search_similar(query: str, top_k: int = 5) -> List[Dict]
  3. Celery task hook (/nlp/tasks.py)
    • Register:
      @app.task
      def embed_task(article_id: str) -> List[float]
    • Should call compute_embedding, then index_article.
  4. CLI entrypoints (/nlp/cli.py)
    • python -m nlp.cli embed --article-id=<id>
    • python -m nlp.cli search --query="..." --top-k=5
  5. Tests & documentation
    • Create /nlp/tests/test_core_embedding.py to:
      • Assert compute_embedding() returns a vector of the expected dimension.
      • Assert that search_similar() returns a non-empty list for a sample query.
    • Create /nlp/tests/test_embed_task.py to:
      • Mock DB and vector store, verify embed_task() calls both core functions.
    • Update /nlp/README.md with:
      • Installation steps
      • How to run embed_task via Celery
      • CLI usage examples for embed and search

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Ready

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions