-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Summary
Compute vector embeddings for each normalized article and provide a semantic search interface to find similar articles by content.
Motivation
- Enables “More like this” recommendations and clustering of related stories.
- Powers a search experience that goes beyond keyword matching, surfacing semantically relevant results.
Scope
In scope: implementation, tests
Acceptance Criteria
-
compute_embedding(text)returns a fixed-length float vector. -
embed_task(article_id)stores the embedding for the article in the vector index. -
search_similar(query, top_k)returns the top K most semantically similar articles. - CLI commands
embedandsearchrun without errors and print expected output. - All tests pass in CI and README clearly documents embedding & search workflows.
Additional Context
- Add dependencies
- Add
sentence-transformersandfaiss-cpu(or equivalent) to/nlp/requirements.txt.
- Add
- Core function signatures (
/nlp/core.py)def compute_embedding(text: str) -> List[float]def index_article(article_id: str, embedding: List[float]) -> Nonedef search_similar(query: str, top_k: int = 5) -> List[Dict]
- Celery task hook (
/nlp/tasks.py)- Register:
@app.task def embed_task(article_id: str) -> List[float]
- Should call
compute_embedding, thenindex_article.
- Register:
- CLI entrypoints (
/nlp/cli.py)python -m nlp.cli embed --article-id=<id>python -m nlp.cli search --query="..." --top-k=5
- Tests & documentation
- Create
/nlp/tests/test_core_embedding.pyto:- Assert
compute_embedding()returns a vector of the expected dimension. - Assert that
search_similar()returns a non-empty list for a sample query.
- Assert
- Create
/nlp/tests/test_embed_task.pyto:- Mock DB and vector store, verify
embed_task()calls both core functions.
- Mock DB and vector store, verify
- Update
/nlp/README.mdwith:- Installation steps
- How to run
embed_taskvia Celery - CLI usage examples for
embedandsearch
- Create
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
Ready