[BUG] `drop_last` in `ChemPropFeaturizer` causes inference crash and silent metric corruption

## Summary

`_vendor_build_dataloader` (`openadmet/models/features/chemprop.py` lines 84–88) unconditionally sets `drop_last=True` whenever `len(dataset) % batch_size == 1`, regardless of whether the dataloader is used for training or inference, and regardless of whether batch normalization is actually enabled (`batch_norm=False` is the default in `ChemPropModel`). This causes three distinct failure modes and a fourth unrelated logic bug in the sampler code.

---

## Bug A — Hard Crash in Inference

**File:** `openadmet/models/inference/inference.py` line 276  
**Trigger:** any inference call where `N % batch_size == 1`

`ChemPropFeaturizer.featurize()` always returns `indices = np.arange(len(smiles))` — length **N** — but when `drop_last=True` the DataLoader only processes **N−1** molecules. `model.predict()` therefore returns shape `(N−1, n_tasks)`. When the inference pipeline constructs the output Series:

```python
# inference.py line 276
data[predictions_tag] = pd.Series(predictions[:, j], index=X_indices)
#                                  ↑ N-1 values         ↑ N indices
```

this raises:

```
ValueError: Length of values (N-1) does not match length of index (N)
```

The same crash also affects the `std` column (line 277) and all acquisition function results (line 290).

**Affected dataset sizes** (first several for each common batch size):

| `batch_size` | Affected N |
|---|---|
| 128 (default) | 129, 257, 385, 513, 641, 769, 897, … |
| 64 | 65, 129, 193, 257, 321, 385, 449, … |
| 32 | 33, 97, 161, 225, 289, 353, 417, … |

For the default `batch_size=128`, approximately 0.78% of all possible dataset sizes trigger this crash.

---

## Bug B — Silent Metric Corruption in Cross-Validation

**Files:** `openadmet/models/eval/cross_validation.py` lines 628, 648–661; `openadmet/models/eval/eval_base.py` lines 110–136  
**Trigger:** any validation fold where `N_val % batch_size == 1`

`y_pred_fold` has shape `(N_val−1, n_tasks)` but `y_val` has shape `(N_val, n_tasks)`. This hits the shape-mismatch branch in `get_t_true_and_t_pred`:

```python
# eval_base.py line 110
if y_true.shape[0] != y_pred.shape[0]:
    # Silently reinterprets the task as pairwise ranking
    t_true = np.array([
        y_true[i, task_id] - y_true[j, task_id]
        for i in range(N)
        for j in range(N)
    ])
    t_pred = y_pred[:, task_id]
```

All metrics (MSE, MAE, R², Kendall τ, Spearman ρ) are then computed over pairwise *differences* of ground-truth values rather than absolute predictions. The resulting numbers are meaningless and incomparable to any other evaluation run. No warning is emitted at the metric computation level.

---

## Bug C — Silent Training Data Truncation

**Trigger:** training set where `N_train % batch_size == 1`

One molecule is silently excluded from every training epoch:

- With `shuffle=False` (the default): always the same last molecule, which never contributes to any gradient update
- With `shuffle=True` (used in ensemble bootstrap paths): a different molecule is excluded each epoch, causing each ensemble member to train on an inconsistent N−1 subset

No warning or log message is emitted in either case.

---

## Bonus Bug — Sampler Logic Is Inverted

**File:** `openadmet/models/features/chemprop.py` lines 71–77

```python
if sampler is not None:   # ← condition is inverted; should be: if sampler is None
    if class_balance:
        sampler = ClassBalanceSampler(dataset.Y, seed, shuffle)
    elif shuffle and seed is not None:
        sampler = SeededSampler(len(dataset), seed)
    else:
        sampler = None
```

`dataset_to_dataloader` always passes `sampler=None` (the default), so the block is **never entered** in normal usage. `ClassBalanceSampler` and `SeededSampler` are completely dead code. When a custom sampler *is* passed, the block overwrites or nullifies it. This was introduced during vendoring — the upstream chemprop source correctly uses `if sampler is None:`.

---

## Test Coverage

No existing test exercises `ChemPropFeaturizer` with a dataset size where `N % batch_size == 1`. All integration test datasets happen to be safe: N=30, N=999, N=2419 (none satisfy N ≡ 1 mod 128).

---

## Recommended Fixes

1. **Remove the automatic `drop_last` heuristic** — expose it as an explicit `drop_last=False` parameter on `_vendor_build_dataloader` and `dataset_to_dataloader`. Callers that need it for batch norm can opt in explicitly and only during training.

2. **Fix `featurize()` to return correct indices** — if `drop_last=True` is ever retained, return `indices[:-1]` so the returned index array always matches what the DataLoader actually processes.

3. **Fix the inverted sampler condition** — change line 71 from `if sampler is not None:` to `if sampler is None:`.

4. **Add regression tests** covering N ≡ 1 (mod batch_size) for featurize, inference, and cross-validation paths.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `drop_last` in `ChemPropFeaturizer` causes inference crash and silent metric corruption #517

Summary

Bug A — Hard Crash in Inference

Bug B — Silent Metric Corruption in Cross-Validation

Bug C — Silent Training Data Truncation

Bonus Bug — Sampler Logic Is Inverted

Test Coverage

Recommended Fixes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

`batch_size`	Affected N
128 (default)	129, 257, 385, 513, 641, 769, 897, …
64	65, 129, 193, 257, 321, 385, 449, …
32	33, 97, 161, 225, 289, 353, 417, …

[BUG] drop_last in ChemPropFeaturizer causes inference crash and silent metric corruption #517

Description

Summary

Bug A — Hard Crash in Inference

Bug B — Silent Metric Corruption in Cross-Validation

Bug C — Silent Training Data Truncation

Bonus Bug — Sampler Logic Is Inverted

Test Coverage

Recommended Fixes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[BUG] `drop_last` in `ChemPropFeaturizer` causes inference crash and silent metric corruption #517