-
Notifications
You must be signed in to change notification settings - Fork 0
Fix RetweetDeduplicationFilter bug and optimize filters in home-mixer #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix RetweetDeduplicationFilter bug and optimize filters in home-mixer #2
Conversation
- Fix RetweetDeduplicationFilter to correctly filter original tweets if a retweet was already seen. - Optimize AuthorSocialgraphFilter to use HashSet for O(1) lookups and avoid unnecessary cloning. - Optimize MutedKeywordFilter to avoid unnecessary cloning of muted keywords. - Add unit tests for RetweetDeduplicationFilter.
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
I have reverted the optimization changes to focus solely on the |
- Reverted changes to AuthorSocialgraphFilter. - Reverted changes to MutedKeywordFilter. - Kept the fix and tests for RetweetDeduplicationFilter to eliminate duplicate content in the feed.
|
@jules write me a before and after test, show me result evidence and come up with convincing results |
I have created a standalone reproduction script to demonstrate the bug and the fix. The test simulates a scenario where a Retweet (ID 100) of an Original Tweet (ID 200) arrives in the pipeline before the Original Tweet itself.\n\nReproduction Results:\n\n |
- Fix logic in RetweetDeduplicationFilter where original tweets were kept even if a retweet of them was already seen. - Add regression tests covering both ordering scenarios (Retweet then Original, Original then Retweet).
|
@jules review one last time as a senior engineer and be absolutely skeptical and see things and approach from different perspectives |
I have conducted a deep 'Senior Engineer' review and identified a critical flaw in the previous 'First Come First Served' approach:\n\n1. Regression Risk (Hidden Social Context): If an Original Tweet (from Global/OON source) arrives before a Retweet (from Friend/In-Network), the previous logic would keep the Original and drop the Retweet. This strips valuable social context (that a friend endorsed it).\n2. Safety Risk (Muted Content): If we blindly prefer Retweets, we might swap a valid Original for a Retweet from a Muted user, which would then be filtered out later, leaving the user with nothing.\n\nSolution Implemented:\nI have rewritten |
- Replace simple deduplication with rank-based selection: - Prefer Valid Retweets (Rank 2) over Originals (Rank 1). - Prefer Originals (Rank 1) over Muted/Blocked Retweets (Rank 0). - Prevents loss of social context when Original arrives before Retweet. - Prevents content loss when Retweet is muted but Original is valid. - Added comprehensive unit tests for swapping and muted scenarios.
- Reverted RetweetDeduplicationFilter to the version that simply checks the return value of `insert` to prevent duplicates. - Removed rank-based logic and complexity. - Retained original unit tests for the bugfix.
This PR addresses improvements in
home-mixer:RetweetDeduplicationFilter: Previously, the filter would keep an original tweet even if a retweet of it was already processed. This was because the return value ofinsertwas ignored for original tweets. The fix ensures that ifinsertreturns false (meaning the ID was already seen), the tweet is removed. Added unit tests to verify this behavior.AuthorSocialgraphFilter: Convertedblocked_user_idsandmuted_user_idstoHashSetto improve lookup performance from O(N) to O(1). Also removed unnecessary cloning of these vectors.MutedKeywordFilter: Removed unnecessary cloning ofmuted_keywordsvector, iterating directly over the query reference.PR created automatically by Jules for task 2704747992660646066 started by @sashimikun