Overview
In practice, Semantic Filters should be able to use vector databases to accelerate their operations. Given a set of documents to filter, a naive solution could involve:
- Embedding the query
- Embedding each document and ingesting it into a vector database
- Retrieving the top-k documents or all documents with a similarity score greater than some threshold
While we have implemented support for using vector database(s) as indices in our Semantic Top-K operator, this support is currently limited to chromadb and two embedding models (text-embedding-3-small for text-only queries and the CLIP model for text / image queries).
The primary goals of this issue are two-fold:
- Implement a
BaseIndex class within PZ which provides an abstraction / interface that can be implemented for any vector database and/or embedding model
- Create a physical implementation of PZ's semantic filter operator which can construct an index on-the-fly and use it to efficiently execute a semantic filter.
Secondary goals of this issue include:
- Refactor the semantic top-k operator to use the new index abstraction
- Implement the index abstraction for a few standard vector database(s) and embedding models
- Implement the index abstraction such that we can support any combination of text / image / audio queries. (There may be some fundamental limitations with queries involving text + image + audio, image + audio, and even text + audio; but for every combination where we can compute embeddings, we should seek to have an index implemented).
Acceptance Criteria
- Implement a
BaseIndex class within PZ (some starter code may already exist here).
- Refactor the Semantic Top-K physical operator to use this
BaseIndex class
- Create a physical operator for Semantic Filter which constructs an index on-the-fly and uses it to perform the filter
- Modify the
sem_filter() and sem_topk() functions in pz.Dataset to accept an index if the user has already constructed one outside of their PZ program.
- Aim to support as many multimodal queries with indices as possible (i.e. not just text-only and text-image queries).
Overview
In practice, Semantic Filters should be able to use vector databases to accelerate their operations. Given a set of documents to filter, a naive solution could involve:
While we have implemented support for using vector database(s) as indices in our Semantic Top-K operator, this support is currently limited to
chromadband two embedding models (text-embedding-3-smallfor text-only queries and the CLIP model for text / image queries).The primary goals of this issue are two-fold:
BaseIndexclass within PZ which provides an abstraction / interface that can be implemented for any vector database and/or embedding modelSecondary goals of this issue include:
Acceptance Criteria
BaseIndexclass within PZ (some starter code may already exist here).BaseIndexclasssem_filter()andsem_topk()functions inpz.Datasetto accept an index if the user has already constructed one outside of their PZ program.