Skip to content

Conversation

@rileyok-ons
Copy link
Collaborator

✨ Summary

Resolves #130
Adds partial "starts_with" matching to the reverse searches to check for hierarchical queries

📜 Changes Introduced

  • Rewrites the join call to be join_with, to see if the doc label starts with query label
  • Feature implementation (feat:) / bug fix (fix:) / refactoring (chore:) / documentation (docs:) / testing (test:)
  • Updates to tests and/or documentation
  • Terraform changes (if applicable)

✅ Checklist

Please confirm you've completed these checks before requesting a review.

  • Code passes linting with Ruff
  • Security checks pass using Bandit
  • API and Unit tests are written and pass using pytest
  • Terraform files (if applicable) follow best practices and have been validated (terraform fmt & terraform validate)
  • DocStrings follow Google-style and are added as per Pylint recommendations
  • Documentation has been updated if needed

🔍 How to Test

Attempt using in both api and reverse search demos, should preserve order and do hierarchical matching on demo codes

@rileyok-ons rileyok-ons linked an issue Feb 11, 2026 that may be closed by this pull request
@lukeroantreeONS lukeroantreeONS self-requested a review February 12, 2026 10:45
Copy link
Collaborator

@lukeroantreeONS lukeroantreeONS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behaves as described, and implemented well. Happy to approve.

Tested based on DEMO notebook;

Image

Couple of small points:

  • could you update the documentation (README, Demo general_workflow notebook, reverse_search docstring) to mention this functionality so it's visible to users?
  • currently the dataframe we return is sorted by doc_id - I think it would be more intuitive if it was returned sorted first by id, then by doc_id. Do you agree? If not, happy to approve this now and talk later, and if so up to you whether you think it would be good to add to this PR or list as a separate task for a later date.

@rileyok-ons
Copy link
Collaborator Author

Behaves as described, and implemented well. Happy to approve.

Tested based on DEMO notebook;

Image One point though - currently the dataframe we return is sorted by `doc_id` - I think it would be more intuitive if it was returned sorted first by `id`, then by `doc_id`. Do you agree? If not, happy to approve this now and talk later, and if so up to you whether you think it would be good to add to this PR or list as a separate task for a later date.

This might be because of the join being performed right to left, will check if it makes a difference, something to note is there's a lot less explicit control for the matching and order than in a standard join, will look into if the paradigm is different

@lukeroantreeONS
Copy link
Collaborator

This might be because of the join being performed right to left, will check if it makes a difference, something to note is there's a lot less explicit control for the matching and order than in a standard join, will look into if the paradigm is different

I think your join is great, I was thinking more of a possible final sorting step on the final_table variable, e.g.

final_table = final_table.sort(
    by=["id", "doc_id"],
    descending=[False, False]
)

@rileyok-ons
Copy link
Collaborator Author

This might be because of the join being performed right to left, will check if it makes a difference, something to note is there's a lot less explicit control for the matching and order than in a standard join, will look into if the paradigm is different

I think your join is great, I was thinking more of a possible final sorting step on the final_table variable, e.g.

final_table = final_table.sort(
    by=["id", "doc_id"],
    descending=[False, False]
)

I see what you mean, will put it in as default and can can set as optional later ideas change

@rileyok-ons rileyok-ons marked this pull request as ready for review February 12, 2026 16:14
Copy link
Contributor

@frayle-ons frayle-ons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, like the code and agree that it should be sorted by 'id' first and the sub-sorted by 'doc_id'

only suggestion is that we add a boolean parameter to the method to switch between exact matching and partial matching


return results_df

def reverse_search(self, query: VectorStoreReverseSearchInput, n_results=100) -> VectorStoreReverseSearchOutput:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be appropriate to add a new method parameter 'do_partial' (or other name) and have it default to false?

so by default only full id matching results are returned, but if true then this partial matching code activates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Partial matching to reverse search

3 participants