Skip to content

Create PZ Index Class which Can Be Used by Semantic Filter and Semantic Top-K Operators  #137

@mdr223

Description

@mdr223

Overview

In practice, Semantic Filters should be able to use vector databases to accelerate their operations. Given a set of documents to filter, a naive solution could involve:

  1. Embedding the query
  2. Embedding each document and ingesting it into a vector database
  3. Retrieving the top-k documents or all documents with a similarity score greater than some threshold

While we have implemented support for using vector database(s) as indices in our Semantic Top-K operator, this support is currently limited to chromadb and two embedding models (text-embedding-3-small for text-only queries and the CLIP model for text / image queries).

The primary goals of this issue are two-fold:

  1. Implement a BaseIndex class within PZ which provides an abstraction / interface that can be implemented for any vector database and/or embedding model
  2. Create a physical implementation of PZ's semantic filter operator which can construct an index on-the-fly and use it to efficiently execute a semantic filter.

Secondary goals of this issue include:

  1. Refactor the semantic top-k operator to use the new index abstraction
  2. Implement the index abstraction for a few standard vector database(s) and embedding models
  3. Implement the index abstraction such that we can support any combination of text / image / audio queries. (There may be some fundamental limitations with queries involving text + image + audio, image + audio, and even text + audio; but for every combination where we can compute embeddings, we should seek to have an index implemented).

Acceptance Criteria

  • Implement a BaseIndex class within PZ (some starter code may already exist here).
  • Refactor the Semantic Top-K physical operator to use this BaseIndex class
  • Create a physical operator for Semantic Filter which constructs an index on-the-fly and uses it to perform the filter
  • Modify the sem_filter() and sem_topk() functions in pz.Dataset to accept an index if the user has already constructed one outside of their PZ program.
  • Aim to support as many multimodal queries with indices as possible (i.e. not just text-only and text-image queries).

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions