Skip to content

Implement Optimized Semantic Equijoin Operator #269

@mdr223

Description

@mdr223

The goal for this issue is to create an optimized version of an equijoin in join.py: https://github.com/mitdbg/palimpzest/blob/main/src/palimpzest/query/operators/join.py

Currently join.py contains two join implementations:

  • NestedLoopsJoin which incurs O(N*M) LLM invocations
  • EmbeddingJoin which incurs O(N*M/c) LLM invocations (where c is some constant factor accounting for the fact that a fraction of the inputs will be filtered / joined automatically based on their embedding similarity.

While the issue is open-ended, a good first approach to implementing an optimized equijoin would be to try implementing an O(N+M) strategy whereby each left and right record has a join key which is extracted once using an LLM. All subsequent join evaluations are then performed by checking whether left_record.join_key == right_record.join_key. A key issue is that the value(s) for the join key may differ in subtle ways. For instance, an LLM extracting the animal in an image might extract "Tiger" for one image of a tiger and "Bengal Tiger" for another image of a tiger. Some experimentation will need to be done to come up with a robust strategy that accounts for such subtle differences.

In order to evaluate an implementation of a novel join algorithm, a script like the following may be used:

from pydantic import BaseModel, Field
import pandas as pd
import palimpzest as pz
from palimpzest.query.operators.join import YourJoinOp

# datasets of movie reviews and actor descriptions
movie_reviews = [
    {"review": "Inception is a mind-bending thriller that blurs the lines between dreams and reality. A must-watch!"},
    {"review": "The Devil Wears Prada is a sharp and witty look into the fashion industry, with standout performances."},
    {"review": "Titanic is a heartbreaking love story set against the backdrop of a historic tragedy. Truly moving."},
    {"review": "The Dark Knight redefined the superhero genre with its intense action and complex characters."},
    {"review": "Training Day is a gritty crime drama that keeps you on the edge of your seat from start to finish."},
]
actor_descriptions = [
    {"actor": "Tom Cruise is an American actor and producer known for his roles in action films such as 'Top Gun' and the 'Mission: Impossible' series."},
    {"actor": "Meryl Streep is an acclaimed American actress recognized for her versatility and roles in films like 'The Devil Wears Prada' and 'Sophie's Choice'."},
    {"actor": "Leonardo DiCaprio is an American actor and film producer known for his performances in 'Titanic', 'Inception', and 'The Revenant'."},
    {"actor": "Scarlett Johansson is an American actress and singer, famous for her roles in 'Lost in Translation' and as Black Widow in the Marvel Cinematic Universe."},
    {"actor": "Denzel Washington is an American actor and director known for his powerful performances in films like 'Training Day' and 'Malcolm X'."},
]

# create DataRecords for movie reviews
movie_ds = pz.MemoryDataset(id="movies", vals=movie_reviews)
output = movie_ds.run()
left_candidates = [dr for dr in output]

# create DataRecords for actor descriptions
actor_ds = pz.MemoryDataset(id="actors", vals=actor_descriptions)
output = actor_ds.run()
right_candidates = [dr for dr in output]

# execute semantic join with your operator
class JoinSchema(BaseModel):
    review: str = Field(description="A movie review")
    actor: str = Field(description="A sentence about an actor")

join_op = YourJoinOp(
    input_schema=JoinSchema,
    output_schema=JoinSchema,
    model=pz.Model.GPT_4o_MINI,
    condition="The actor appears in the movie being reviewed",
)
output, _ = join_op(left_candidates, right_candidates, final=True)
output_df = pd.DataFrame([dr.to_dict() for dr in output])
print(output_df)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions