Skip to content

add tag-based indexing#3

Merged
Ashex merged 3 commits into
mainfrom
feat/tag-indexing
Feb 22, 2026
Merged

add tag-based indexing#3
Ashex merged 3 commits into
mainfrom
feat/tag-indexing

Conversation

@Ashex

@Ashex Ashex commented Feb 22, 2026

Copy link
Copy Markdown
Owner

Documents are now indexed with structured tags (source, content_type,
domain, topic, namespace, lexicon_type, language) using a hybrid
vocabulary: controlled core enums plus generated tags derived from
content metadata. Tags are stored in txtai's tags column and queried
via SQL WHERE clauses, replacing the previous post-retrieval filtering
that frequently returned empty results when top-k ANN candidates were
dominated by other sources.

Documents are now indexed with structured tags (source, content_type,
domain, topic, namespace, lexicon_type, language) using a hybrid
vocabulary: controlled core enums plus generated tags derived from
content metadata. Tags are stored in txtai's tags column and queried
via SQL WHERE clauses, replacing the previous post-retrieval filtering
that frequently returned empty results when top-k ANN candidates were
dominated by other sources.

Key changes:
- parser: add tags field to ContentChunk, tag builder functions per
  source, encode_tags() for pipe-delimited txtai storage
- indexer: write tags at index time, persist in chunk_meta.json,
  SQL-filtered _filtered_search() with fallback, backward-compat
  for old indexes without tags
- tools: add content_type filter to search_atproto_docs
- tests: update regression tests for tag-aware fake embeddings,
  add test_tags.py (37 tests covering encoding, builders, filtered
  search, metadata round-trip), add compare_search_quality.py
@Ashex Ashex merged commit b7cae5b into main Feb 22, 2026
1 check passed
@Ashex Ashex deleted the feat/tag-indexing branch February 22, 2026 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant