Hi LOTUS team—thank you for the excellent work and for releasing the code! I’m trying to reproduce the Biomedical Multi-label Classification evaluation (Tables 3 & 4) and I may be misconfiguring something.
Bug Description
My run triggers many more sem_filter LLM calls than reported, and I’d be grateful for your guidance.
My setup
Data: 250 articles × ~24,300 reactions ≈ ~6.0M pairs
Operator: sem_join with optimization (§3.2)
LM: meta-llama/llama-3.3-70b-instruct
RM (embeddings): intfloat/e5-base-v2
map_instruction: default (not provided)
join_instruction: "The article {medical_article} indicates that the patient is experiencing the {reaction}"
cascade_args: CascadeArgs(recall_target=0.7, precision_target=0.7, sampling_percentage = 0.0001)
I first ran sem_join with optimization and completed threshold learning.
After thresholding, 24,852 low-confidence pairs remained, which then require sem_filter.
This implies 24,852 sem_filter LLM calls, whereas the paper reports 5,290 for the complete join process. I likely missed a configuration or optimization detail.
Expected Behavior
My understanding is that, after threshold learning and the approximation step, the number of low-confidence pairs should be much smaller than 24, 852—is that correct?
Questions
Could you please share the exact experimental settings used to reproduce the results in Tables 3, 4, and 5? I’d also appreciate any guidance on what I might be doing incorrectly, and whether there are additional steps to further reduce LLM calls in sem_filter.
Futher context
"msf_t_pos": 0.8734673261642456,
"msf_t_neg": 0.8317583799362183,
"num_helper_high_conf_neg": 6043330,
"join_optimization_cost": 857
Python Version:
3.10
Package Versions:
Checklist
Hi LOTUS team—thank you for the excellent work and for releasing the code! I’m trying to reproduce the Biomedical Multi-label Classification evaluation (Tables 3 & 4) and I may be misconfiguring something.
Bug Description
My run triggers many more sem_filter LLM calls than reported, and I’d be grateful for your guidance.
My setup
Data: 250 articles × ~24,300 reactions ≈ ~6.0M pairs
Operator: sem_join with optimization (§3.2)
LM: meta-llama/llama-3.3-70b-instruct
RM (embeddings): intfloat/e5-base-v2
map_instruction: default (not provided)
join_instruction: "The article {medical_article} indicates that the patient is experiencing the {reaction}"
cascade_args: CascadeArgs(recall_target=0.7, precision_target=0.7, sampling_percentage = 0.0001)
I first ran sem_join with optimization and completed threshold learning.
After thresholding, 24,852 low-confidence pairs remained, which then require sem_filter.
This implies 24,852 sem_filter LLM calls, whereas the paper reports 5,290 for the complete join process. I likely missed a configuration or optimization detail.
Expected Behavior
My understanding is that, after threshold learning and the approximation step, the number of low-confidence pairs should be much smaller than 24, 852—is that correct?
Questions
Could you please share the exact experimental settings used to reproduce the results in Tables 3, 4, and 5? I’d also appreciate any guidance on what I might be doing incorrectly, and whether there are additional steps to further reduce LLM calls in sem_filter.
Futher context
"msf_t_pos": 0.8734673261642456,
"msf_t_neg": 0.8317583799362183,
"num_helper_high_conf_neg": 6043330,
"join_optimization_cost": 857
Python Version:
3.10
Package Versions:
Checklist
Checked