Skip to content

[BUG] drop_last in ChemPropFeaturizer causes inference crash and silent metric corruption #517

@smcolby

Description

@smcolby

Summary

_vendor_build_dataloader (openadmet/models/features/chemprop.py lines 84–88) unconditionally sets drop_last=True whenever len(dataset) % batch_size == 1, regardless of whether the dataloader is used for training or inference, and regardless of whether batch normalization is actually enabled (batch_norm=False is the default in ChemPropModel). This causes three distinct failure modes and a fourth unrelated logic bug in the sampler code.


Bug A — Hard Crash in Inference

File: openadmet/models/inference/inference.py line 276
Trigger: any inference call where N % batch_size == 1

ChemPropFeaturizer.featurize() always returns indices = np.arange(len(smiles)) — length N — but when drop_last=True the DataLoader only processes N−1 molecules. model.predict() therefore returns shape (N−1, n_tasks). When the inference pipeline constructs the output Series:

# inference.py line 276
data[predictions_tag] = pd.Series(predictions[:, j], index=X_indices)
#                                  ↑ N-1 values         ↑ N indices

this raises:

ValueError: Length of values (N-1) does not match length of index (N)

The same crash also affects the std column (line 277) and all acquisition function results (line 290).

Affected dataset sizes (first several for each common batch size):

batch_size Affected N
128 (default) 129, 257, 385, 513, 641, 769, 897, …
64 65, 129, 193, 257, 321, 385, 449, …
32 33, 97, 161, 225, 289, 353, 417, …

For the default batch_size=128, approximately 0.78% of all possible dataset sizes trigger this crash.


Bug B — Silent Metric Corruption in Cross-Validation

Files: openadmet/models/eval/cross_validation.py lines 628, 648–661; openadmet/models/eval/eval_base.py lines 110–136
Trigger: any validation fold where N_val % batch_size == 1

y_pred_fold has shape (N_val−1, n_tasks) but y_val has shape (N_val, n_tasks). This hits the shape-mismatch branch in get_t_true_and_t_pred:

# eval_base.py line 110
if y_true.shape[0] != y_pred.shape[0]:
    # Silently reinterprets the task as pairwise ranking
    t_true = np.array([
        y_true[i, task_id] - y_true[j, task_id]
        for i in range(N)
        for j in range(N)
    ])
    t_pred = y_pred[:, task_id]

All metrics (MSE, MAE, R², Kendall τ, Spearman ρ) are then computed over pairwise differences of ground-truth values rather than absolute predictions. The resulting numbers are meaningless and incomparable to any other evaluation run. No warning is emitted at the metric computation level.


Bug C — Silent Training Data Truncation

Trigger: training set where N_train % batch_size == 1

One molecule is silently excluded from every training epoch:

  • With shuffle=False (the default): always the same last molecule, which never contributes to any gradient update
  • With shuffle=True (used in ensemble bootstrap paths): a different molecule is excluded each epoch, causing each ensemble member to train on an inconsistent N−1 subset

No warning or log message is emitted in either case.


Bonus Bug — Sampler Logic Is Inverted

File: openadmet/models/features/chemprop.py lines 71–77

if sampler is not None:   # ← condition is inverted; should be: if sampler is None
    if class_balance:
        sampler = ClassBalanceSampler(dataset.Y, seed, shuffle)
    elif shuffle and seed is not None:
        sampler = SeededSampler(len(dataset), seed)
    else:
        sampler = None

dataset_to_dataloader always passes sampler=None (the default), so the block is never entered in normal usage. ClassBalanceSampler and SeededSampler are completely dead code. When a custom sampler is passed, the block overwrites or nullifies it. This was introduced during vendoring — the upstream chemprop source correctly uses if sampler is None:.


Test Coverage

No existing test exercises ChemPropFeaturizer with a dataset size where N % batch_size == 1. All integration test datasets happen to be safe: N=30, N=999, N=2419 (none satisfy N ≡ 1 mod 128).


Recommended Fixes

  1. Remove the automatic drop_last heuristic — expose it as an explicit drop_last=False parameter on _vendor_build_dataloader and dataset_to_dataloader. Callers that need it for batch norm can opt in explicitly and only during training.

  2. Fix featurize() to return correct indices — if drop_last=True is ever retained, return indices[:-1] so the returned index array always matches what the DataLoader actually processes.

  3. Fix the inverted sampler condition — change line 71 from if sampler is not None: to if sampler is None:.

  4. Add regression tests covering N ≡ 1 (mod batch_size) for featurize, inference, and cross-validation paths.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions