Implement Optimized Semantic Equijoin Operator

The goal for this issue is to create an optimized version of an equijoin in `join.py`: https://github.com/mitdbg/palimpzest/blob/main/src/palimpzest/query/operators/join.py

Currently `join.py` contains two join implementations:
- `NestedLoopsJoin` which incurs `O(N*M)` LLM invocations
- `EmbeddingJoin` which incurs `O(N*M/c)` LLM invocations (where `c` is some constant factor accounting for the fact that a fraction of the inputs will be filtered / joined automatically based on their embedding similarity.

While the issue is open-ended, a good first approach to implementing an optimized equijoin would be to try implementing an `O(N+M)` strategy whereby each left and right record has a join key which is extracted once using an LLM. All subsequent join evaluations are then performed by checking whether `left_record.join_key == right_record.join_key`. A key issue is that the value(s) for the join key may differ in subtle ways. For instance, an LLM extracting the animal in an image might extract `"Tiger"` for one image of a tiger and `"Bengal Tiger"` for another image of a tiger. Some experimentation will need to be done to come up with a robust strategy that accounts for such subtle differences.

In order to evaluate an implementation of a novel join algorithm, a script like the following may be used:
```python
from pydantic import BaseModel, Field
import pandas as pd
import palimpzest as pz
from palimpzest.query.operators.join import YourJoinOp

# datasets of movie reviews and actor descriptions
movie_reviews = [
    {"review": "Inception is a mind-bending thriller that blurs the lines between dreams and reality. A must-watch!"},
    {"review": "The Devil Wears Prada is a sharp and witty look into the fashion industry, with standout performances."},
    {"review": "Titanic is a heartbreaking love story set against the backdrop of a historic tragedy. Truly moving."},
    {"review": "The Dark Knight redefined the superhero genre with its intense action and complex characters."},
    {"review": "Training Day is a gritty crime drama that keeps you on the edge of your seat from start to finish."},
]
actor_descriptions = [
    {"actor": "Tom Cruise is an American actor and producer known for his roles in action films such as 'Top Gun' and the 'Mission: Impossible' series."},
    {"actor": "Meryl Streep is an acclaimed American actress recognized for her versatility and roles in films like 'The Devil Wears Prada' and 'Sophie's Choice'."},
    {"actor": "Leonardo DiCaprio is an American actor and film producer known for his performances in 'Titanic', 'Inception', and 'The Revenant'."},
    {"actor": "Scarlett Johansson is an American actress and singer, famous for her roles in 'Lost in Translation' and as Black Widow in the Marvel Cinematic Universe."},
    {"actor": "Denzel Washington is an American actor and director known for his powerful performances in films like 'Training Day' and 'Malcolm X'."},
]

# create DataRecords for movie reviews
movie_ds = pz.MemoryDataset(id="movies", vals=movie_reviews)
output = movie_ds.run()
left_candidates = [dr for dr in output]

# create DataRecords for actor descriptions
actor_ds = pz.MemoryDataset(id="actors", vals=actor_descriptions)
output = actor_ds.run()
right_candidates = [dr for dr in output]

# execute semantic join with your operator
class JoinSchema(BaseModel):
    review: str = Field(description="A movie review")
    actor: str = Field(description="A sentence about an actor")

join_op = YourJoinOp(
    input_schema=JoinSchema,
    output_schema=JoinSchema,
    model=pz.Model.GPT_4o_MINI,
    condition="The actor appears in the movie being reviewed",
)
output, _ = join_op(left_candidates, right_candidates, final=True)
output_df = pd.DataFrame([dr.to_dict() for dr in output])
print(output_df)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Optimized Semantic Equijoin Operator #269

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement Optimized Semantic Equijoin Operator #269

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions