Summary
_vendor_build_dataloader (openadmet/models/features/chemprop.py lines 84–88) unconditionally sets drop_last=True whenever len(dataset) % batch_size == 1, regardless of whether the dataloader is used for training or inference, and regardless of whether batch normalization is actually enabled (batch_norm=False is the default in ChemPropModel). This causes three distinct failure modes and a fourth unrelated logic bug in the sampler code.
Bug A — Hard Crash in Inference
File: openadmet/models/inference/inference.py line 276
Trigger: any inference call where N % batch_size == 1
ChemPropFeaturizer.featurize() always returns indices = np.arange(len(smiles)) — length N — but when drop_last=True the DataLoader only processes N−1 molecules. model.predict() therefore returns shape (N−1, n_tasks). When the inference pipeline constructs the output Series:
# inference.py line 276
data[predictions_tag] = pd.Series(predictions[:, j], index=X_indices)
# ↑ N-1 values ↑ N indices
this raises:
ValueError: Length of values (N-1) does not match length of index (N)
The same crash also affects the std column (line 277) and all acquisition function results (line 290).
Affected dataset sizes (first several for each common batch size):
batch_size |
Affected N |
| 128 (default) |
129, 257, 385, 513, 641, 769, 897, … |
| 64 |
65, 129, 193, 257, 321, 385, 449, … |
| 32 |
33, 97, 161, 225, 289, 353, 417, … |
For the default batch_size=128, approximately 0.78% of all possible dataset sizes trigger this crash.
Bug B — Silent Metric Corruption in Cross-Validation
Files: openadmet/models/eval/cross_validation.py lines 628, 648–661; openadmet/models/eval/eval_base.py lines 110–136
Trigger: any validation fold where N_val % batch_size == 1
y_pred_fold has shape (N_val−1, n_tasks) but y_val has shape (N_val, n_tasks). This hits the shape-mismatch branch in get_t_true_and_t_pred:
# eval_base.py line 110
if y_true.shape[0] != y_pred.shape[0]:
# Silently reinterprets the task as pairwise ranking
t_true = np.array([
y_true[i, task_id] - y_true[j, task_id]
for i in range(N)
for j in range(N)
])
t_pred = y_pred[:, task_id]
All metrics (MSE, MAE, R², Kendall τ, Spearman ρ) are then computed over pairwise differences of ground-truth values rather than absolute predictions. The resulting numbers are meaningless and incomparable to any other evaluation run. No warning is emitted at the metric computation level.
Bug C — Silent Training Data Truncation
Trigger: training set where N_train % batch_size == 1
One molecule is silently excluded from every training epoch:
- With
shuffle=False (the default): always the same last molecule, which never contributes to any gradient update
- With
shuffle=True (used in ensemble bootstrap paths): a different molecule is excluded each epoch, causing each ensemble member to train on an inconsistent N−1 subset
No warning or log message is emitted in either case.
Bonus Bug — Sampler Logic Is Inverted
File: openadmet/models/features/chemprop.py lines 71–77
if sampler is not None: # ← condition is inverted; should be: if sampler is None
if class_balance:
sampler = ClassBalanceSampler(dataset.Y, seed, shuffle)
elif shuffle and seed is not None:
sampler = SeededSampler(len(dataset), seed)
else:
sampler = None
dataset_to_dataloader always passes sampler=None (the default), so the block is never entered in normal usage. ClassBalanceSampler and SeededSampler are completely dead code. When a custom sampler is passed, the block overwrites or nullifies it. This was introduced during vendoring — the upstream chemprop source correctly uses if sampler is None:.
Test Coverage
No existing test exercises ChemPropFeaturizer with a dataset size where N % batch_size == 1. All integration test datasets happen to be safe: N=30, N=999, N=2419 (none satisfy N ≡ 1 mod 128).
Recommended Fixes
-
Remove the automatic drop_last heuristic — expose it as an explicit drop_last=False parameter on _vendor_build_dataloader and dataset_to_dataloader. Callers that need it for batch norm can opt in explicitly and only during training.
-
Fix featurize() to return correct indices — if drop_last=True is ever retained, return indices[:-1] so the returned index array always matches what the DataLoader actually processes.
-
Fix the inverted sampler condition — change line 71 from if sampler is not None: to if sampler is None:.
-
Add regression tests covering N ≡ 1 (mod batch_size) for featurize, inference, and cross-validation paths.
Summary
_vendor_build_dataloader(openadmet/models/features/chemprop.pylines 84–88) unconditionally setsdrop_last=Truewheneverlen(dataset) % batch_size == 1, regardless of whether the dataloader is used for training or inference, and regardless of whether batch normalization is actually enabled (batch_norm=Falseis the default inChemPropModel). This causes three distinct failure modes and a fourth unrelated logic bug in the sampler code.Bug A — Hard Crash in Inference
File:
openadmet/models/inference/inference.pyline 276Trigger: any inference call where
N % batch_size == 1ChemPropFeaturizer.featurize()always returnsindices = np.arange(len(smiles))— length N — but whendrop_last=Truethe DataLoader only processes N−1 molecules.model.predict()therefore returns shape(N−1, n_tasks). When the inference pipeline constructs the output Series:this raises:
The same crash also affects the
stdcolumn (line 277) and all acquisition function results (line 290).Affected dataset sizes (first several for each common batch size):
batch_sizeFor the default
batch_size=128, approximately 0.78% of all possible dataset sizes trigger this crash.Bug B — Silent Metric Corruption in Cross-Validation
Files:
openadmet/models/eval/cross_validation.pylines 628, 648–661;openadmet/models/eval/eval_base.pylines 110–136Trigger: any validation fold where
N_val % batch_size == 1y_pred_foldhas shape(N_val−1, n_tasks)buty_valhas shape(N_val, n_tasks). This hits the shape-mismatch branch inget_t_true_and_t_pred:All metrics (MSE, MAE, R², Kendall τ, Spearman ρ) are then computed over pairwise differences of ground-truth values rather than absolute predictions. The resulting numbers are meaningless and incomparable to any other evaluation run. No warning is emitted at the metric computation level.
Bug C — Silent Training Data Truncation
Trigger: training set where
N_train % batch_size == 1One molecule is silently excluded from every training epoch:
shuffle=False(the default): always the same last molecule, which never contributes to any gradient updateshuffle=True(used in ensemble bootstrap paths): a different molecule is excluded each epoch, causing each ensemble member to train on an inconsistent N−1 subsetNo warning or log message is emitted in either case.
Bonus Bug — Sampler Logic Is Inverted
File:
openadmet/models/features/chemprop.pylines 71–77dataset_to_dataloaderalways passessampler=None(the default), so the block is never entered in normal usage.ClassBalanceSamplerandSeededSamplerare completely dead code. When a custom sampler is passed, the block overwrites or nullifies it. This was introduced during vendoring — the upstream chemprop source correctly usesif sampler is None:.Test Coverage
No existing test exercises
ChemPropFeaturizerwith a dataset size whereN % batch_size == 1. All integration test datasets happen to be safe: N=30, N=999, N=2419 (none satisfy N ≡ 1 mod 128).Recommended Fixes
Remove the automatic
drop_lastheuristic — expose it as an explicitdrop_last=Falseparameter on_vendor_build_dataloaderanddataset_to_dataloader. Callers that need it for batch norm can opt in explicitly and only during training.Fix
featurize()to return correct indices — ifdrop_last=Trueis ever retained, returnindices[:-1]so the returned index array always matches what the DataLoader actually processes.Fix the inverted sampler condition — change line 71 from
if sampler is not None:toif sampler is None:.Add regression tests covering N ≡ 1 (mod batch_size) for featurize, inference, and cross-validation paths.