feat: add from_foundation flag to ChemPropLightningModule

smcolby · Copilot · smcolby · commit 90c46ade5508 · 2026-05-22T15:06:10.000-08:00
Adds a from_foundation: str | bool parameter (default 'chemeleon') that
controls how the ChemProp message-passing encoder is initialised:

- 'chemeleon': downloads CheMeleon weights from Zenodo (existing behavior)
- '/path/to/weights.pt': loads a local checkpoint in the same
  {hyper_parameters, state_dict} format as CheMeleon
- False: builds BondMessagePassing() with default ChemProp architecture
  and random weights; no checkpoint required

Validation is performed at construction time via _validate_from_foundation()
which checks against _KNOWN_FOUNDATION_MODELS and Path.exists(). Unknown
names and non-existent paths raise ValueError with a helpful message.

Changes:
- moal/config.py: add from_foundation field to ModelConfig
- moal/model.py: _KNOWN_FOUNDATION_MODELS, _validate_from_foundation,
  updated __init__ / _build_model dispatch, _load_foundation_weights
  (replaces _get_chemeleon_mp)
- moal/cli.py: forward from_foundation in both model builders
- examples/default_config.yaml: document all three modes
- tests/test_model.py: update hparam assertions; add TestFromFoundation
  class (7 new tests)
- README.md: generalize encoder description; add Foundation Model design note
- .github/copilot-instructions.md: update ModelConfig table; add
  Foundation model section; retire outdated _CHEMPELEON_ATOM_FDIM note

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;
diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
@@ -151,7 +151,7 @@ All campaign parameters live in `moal/config.py` as frozen dataclasses. The YAML
 | YAML key | Dataclass | Notable fields |
 |---|---|---|
 | `oracle:` | `OracleConfig` | `cost_ps`, `cost_drc`, `ps_threshold`, `upper_bound`, `activity_threshold` |
-| `model:` | `ModelConfig` | `hidden_size`, `depth`, `ffn_hidden_size`, `ffn_num_layers`, `freeze_epochs`, `lr_encoder`, `lr_head`, `sigma`, `w_drc`, `w_ps`, `learnable_sigma`, `reset_weights_on_refit`, **`fast`**, **`initial_error`**, **`final_error`** |
+| `model:` | `ModelConfig` | `hidden_size`, `depth`, `ffn_hidden_size`, `ffn_num_layers`, `freeze_epochs`, `lr_encoder`, `lr_head`, `sigma`, `w_drc`, `w_ps`, `learnable_sigma`, `reset_weights_on_refit`, **`fast`**, **`initial_error`**, **`final_error`**, **`from_foundation`** |
 | `acquisition:` | `AcquisitionConfig` | `ps_threshold`, `target_threshold`, **`tau`** |
 | `trainer:` | `TrainerConfig` | `max_epochs`, `accelerator`, `enable_progress_bar`, `enable_model_summary`, `val_fraction`, `split_seed`, `num_workers`, `log_every_n_steps` |
 | `dashboard:` | `DashboardConfig` | `enabled`, `model_metric`, `port`, `export_width`, `export_height`, `theme` |
@@ -213,7 +213,7 @@ All modules use `logger = logging.getLogger(__name__)`. The `suppress_noisy_logg
 
 ### Freeze/unfreeze schedule
 
-`ChemPropLightningModule` freezes the CheMeleon encoder for the first `freeze_epochs` training epochs, then unfreezes and adds a second optimizer for the encoder at `lr_encoder`. The epoch counter resets on every `trainer.fit()` call (every AL iteration). This is intentional — early iterations have tiny labeled pools where encoder fine-tuning would overfit.
+`ChemPropLightningModule` freezes the message-passing encoder for the first `freeze_epochs` training epochs, then unfreezes and adds a second optimizer for the encoder at `lr_encoder`. The epoch counter resets on every `trainer.fit()` call (every AL iteration). This is intentional — early iterations have tiny labeled pools where encoder fine-tuning would overfit.
 
 ### Scaffold split
 
@@ -227,6 +227,14 @@ All modules use `logger = logging.getLogger(__name__)`. The `suppress_noisy_logg
 
 `CliRunner.invoke()` does **not** sandbox file I/O by default. Any test that triggers CLI output-directory creation must pass `--output-dir str(tmp_path / "out")` (or use `runner.isolated_filesystem()`) to avoid leaking `results/` into the pytest CWD.
 
-### CheMeleon feature dimensions
+### Foundation model (`from_foundation`)
 
-`_CHEMPELEON_ATOM_FDIM = 72` and `_CHEMPELEON_BOND_FDIM = 14` in `model.py` are hardcoded to match the CheMeleon pretraining feature spec. These are verified at model initialization. Do not change them without updating the checkpoint.
+`ChemPropLightningModule` accepts a `from_foundation: str | bool` constructor parameter (default `"chemeleon"`) that controls encoder initialisation:
+
+- `"chemeleon"` — downloads the CheMeleon checkpoint from Zenodo, caches it at `~/.chemprop/chemeleon_mp.pt`, and loads its weights.
+- Any other string — treated as a filesystem path; the checkpoint must have `{"hyper_parameters": ..., "state_dict": ...}` format (same as CheMeleon).
+- `False` — builds `BondMessagePassing()` with default ChemProp architecture and random weights; no checkpoint is required.
+
+The known-name registry lives in `_KNOWN_FOUNDATION_MODELS: frozenset[str]` at module level. `_validate_from_foundation(value)` is called at `__init__` time; it raises `ValueError` for unknown names and non-existent paths. `from_foundation` is included in `save_hyperparameters()`.
+
+`True` is not a valid value — the validator catches it because `True` is not in `_KNOWN_FOUNDATION_MODELS` and `Path(True)` is not a valid path expression.
diff --git a/README.md b/README.md
@@ -7,7 +7,7 @@ A Python pipeline for maximizing the discovery of **active compounds** (pEC50 >
 - **Primary Screen (PS):** Returns an inequality label (`< T` or `>= T`) at a configurable threshold. Cheap. A hit (`>= T`) is an INTERVAL-censored label — eligible for a DRC upgrade in a later iteration.
 - **Dose-Response Curve (DRC):** Returns the exact continuous pEC50 value. Expensive. Can be run as a first-pass query *or* as a follow-up upgrade on a PS hit.
 
-The underlying predictive model is **ChemProp** initialized with **CheMeleon** pretrained weights, trained with a **Tobit (censored regression) loss** that correctly handles both label types.
+The underlying predictive model is **ChemProp** fine-tuned with a **Tobit (censored regression) loss** that correctly handles both label types. By default the ChemProp encoder is initialised with **CheMeleon** pretrained weights (see `model.from_foundation` below).
 
 ## Installation
 
@@ -49,7 +49,8 @@ data:
 ```
 
 CheMeleon pretrained weights are downloaded automatically from Zenodo on first
-run and cached at `~/.chemprop/chemeleon_mp.pt` for subsequent use.
+run and cached at `~/.chemprop/chemeleon_mp.pt` for subsequent use (only when
+`model.from_foundation: chemeleon`, the default).
 
 ## Commands
 
@@ -167,6 +168,14 @@ The campaign emits a rich progress bar with `n_iterations × 3` discrete steps:
 
 ## Key Design Notes
 
+**Foundation model (`model.from_foundation`):** Controls which weights initialise the ChemProp message-passing encoder. Three values are accepted:
+
+- `"chemeleon"` (default) — downloads the CheMeleon checkpoint from Zenodo and loads it.
+- A filesystem path string — loads a local checkpoint in the same `{hyper_parameters, state_dict}` format as CheMeleon.
+- `false` — builds the encoder with default ChemProp architecture and random weights; no checkpoint required. Useful for ablation studies or environments without network access.
+
+Unknown named strings and non-existent paths raise `ValueError` at model construction. The `from_foundation` value is recorded in Lightning checkpoints alongside all other hyperparameters.
+
 **Unified input format:** All three CSV inputs — `data.simulate.input_csv`, `data.simulate.pretrain.input_csv`, and `data.plan.input_csv` — use the same campaign state schema (`smiles`, `relation`, `value`).  For `moal simulate`, only `==` rows are loaded as oracle ground truth; PS and blank rows are skipped.  For `moal plan` and the pretrain input, all labeled rows (`<`, `>=`, `==`) become training records; unqueried rows (empty) are inference targets or skipped with a warning, respectively.
 
 **Pretrain warm-starting:** `moal simulate` accepts a pretrain CSV (`data.simulate.pretrain.input_csv`) in the same mixed-fidelity format. Pretrain records are merged with oracle-acquired records before each `model.refit()` call. Oracle records always win on a same-fidelity duplicate; pretrain PS INTERVAL records are automatically dropped when the oracle upgrades that compound to DRC. See `data.simulate.pretrain.*` in the config reference for all fields.
diff --git a/examples/default_config.yaml b/examples/default_config.yaml
@@ -104,6 +104,9 @@ model:
   initial_error: 1.2                  # Fast mode: Starting noise magnitude (pEC50 units) for the error ramp in fast mode
   final_error: 0.65                   # Fast mode: Ending noise magnitude for the ramp 
                                       # Set equal to initial_error for constant noise
+  from_foundation: chemeleon          # Encoder initialisation: "chemeleon" (download CheMeleon from Zenodo),
+                                      # a local path to a {hyper_parameters, state_dict} checkpoint file,
+                                      # or false for random ChemProp weights (no checkpoint required)
 
 # --------------------------------------------------------------------------
 # PyTorch Lightning trainer
diff --git a/moal/cli.py b/moal/cli.py
@@ -710,6 +710,7 @@ def _build_simulation_model(
         w_drc=cfg.model.w_drc,
         w_ps=cfg.model.w_ps,
         learnable_sigma=cfg.model.learnable_sigma,
+        from_foundation=cfg.model.from_foundation,
     )
 
 
@@ -748,6 +749,7 @@ def _build_plan_model(cfg: PipelineConfig) -> ChemPropLightningModule:
         w_drc=cfg.model.w_drc,
         w_ps=cfg.model.w_ps,
         learnable_sigma=cfg.model.learnable_sigma,
+        from_foundation=cfg.model.from_foundation,
     )
 
 
diff --git a/moal/config.py b/moal/config.py
@@ -82,6 +82,12 @@ class ModelConfig:
         mode. The ramp linearly interpolates from initial_error to final_error
         over all iterations. Set equal to initial_error for a constant noise
         level.
+    from_foundation : str or bool
+        Controls encoder initialisation. ``"chemeleon"`` (default) downloads
+        and loads CheMeleon pretrained weights. A filesystem path string loads
+        a local checkpoint in the same ``{hyper_parameters, state_dict}``
+        format. ``False`` builds the encoder with default ChemProp architecture
+        and random weights (no checkpoint required).
     """
 
     ffn_hidden_size: int = 300
@@ -97,6 +103,7 @@ class ModelConfig:
     fast: bool = False
     initial_error: float = 0.7
     final_error: float = 0.5
+    from_foundation: str | bool = "chemeleon"
 
 
 @dataclass(frozen=True)
diff --git a/moal/model.py b/moal/model.py
@@ -34,6 +34,41 @@
 
 logger = logging.getLogger(__name__)
 
+_KNOWN_FOUNDATION_MODELS: frozenset[str] = frozenset({"chemeleon"})
+
+
+def _validate_from_foundation(value: str | bool) -> None:
+    """Validate the ``from_foundation`` parameter value.
+
+    Parameters
+    ----------
+    value : str or bool
+        The value to validate.
+
+    Raises
+    ------
+    ValueError
+        If ``value`` is not ``False``, a known named model, or an existing
+        file path.
+    """
+    if value is False:
+        return
+    if not isinstance(value, str):
+        raise ValueError(
+            f"from_foundation must be False, a known model name, or a filesystem path; "
+            f"got {value!r}. Known names: {sorted(_KNOWN_FOUNDATION_MODELS)}"
+        )
+    if value in _KNOWN_FOUNDATION_MODELS:
+        return
+    if Path(value).exists():
+        return
+    raise ValueError(
+        f"from_foundation={value!r} is not a recognised foundation model name "
+        f"and does not resolve to an existing file path. "
+        f"Known names: {sorted(_KNOWN_FOUNDATION_MODELS)}. "
+        "Pass False to use random ChemProp weights."
+    )
+
 
 def download_chemeleon() -> None:
     """Download the CheMeleon checkpoint if not already cached locally.
@@ -67,7 +102,7 @@ def download_chemeleon() -> None:
 
 
 class ChemPropLightningModule(L.LightningModule):
-    """ChemProp MPNN fine-tuned from CheMeleon pretrained weights.
+    """ChemProp MPNN with configurable foundation-model encoder initialisation.
 
     Parameters
     ----------
@@ -91,6 +126,12 @@ class ChemPropLightningModule(L.LightningModule):
         Primary screen loss weight. Default is 0.3.
     learnable_sigma : bool, optional
         If True, σ is a learned parameter. Default is False.
+    from_foundation : str or bool, optional
+        Controls encoder initialisation. ``"chemeleon"`` (default) downloads
+        and loads CheMeleon pretrained weights. A filesystem path string loads
+        a local checkpoint in ``{hyper_parameters, state_dict}`` format.
+        ``False`` builds the encoder with default ChemProp architecture and
+        random weights.
     """
 
     def __init__(
@@ -104,8 +145,11 @@ def __init__(
         w_drc: float = 1.0,
         w_ps: float = 0.3,
         learnable_sigma: bool = False,
+        from_foundation: str | bool = "chemeleon",
     ) -> None:
         super().__init__()
+        _validate_from_foundation(from_foundation)
+        self._from_foundation = from_foundation
         self.save_hyperparameters()
 
         self.freeze_epochs = freeze_epochs
@@ -132,7 +176,7 @@ def _build_model(
         ffn_hidden_size: int,
         ffn_num_layers: int,
     ) -> nn.Module:
-        """Construct the MPNN with CheMeleon message-passing weights.
+        """Construct the MPNN, dispatching on ``self._from_foundation``.
 
         Parameters
         ----------
@@ -144,42 +188,43 @@ def _build_model(
         Returns
         -------
         nn.Module
-            Fully assembled ``chemprop.models.MPNN`` with pretrained
-            message-passing weights and a freshly initialised FFN head.
+            Fully assembled ``chemprop.models.MPNN``.
         """
-        chemeleon_weights = self._get_chemeleon_mp()
+        if self._from_foundation is False:
+            logger.info("Building ChemProp encoder with random weights (from_foundation=False).")
+            mp: nn.Module = BondMessagePassing()
+        else:
+            foundation_weights = self._load_foundation_weights()
+            mp = BondMessagePassing(**foundation_weights["hyper_parameters"])
+            mp.load_state_dict(foundation_weights["state_dict"])
 
-        # Mean aggregation
         agg = MeanAggregation()
-
-        # Message passing
-        mp = BondMessagePassing(**chemeleon_weights["hyper_parameters"])
-        mp.load_state_dict(chemeleon_weights["state_dict"])
-
-        # FFN predictor head
         ffn = RegressionFFN(
-            input_dim=mp.output_dim,  # Infer input dim from mp output
+            input_dim=cast(BondMessagePassing, mp).output_dim,
             hidden_dim=ffn_hidden_size,
             n_layers=ffn_num_layers,
         )
         return cast(nn.Module, MPNN(message_passing=mp, agg=agg, predictor=ffn))
 
-    def _get_chemeleon_mp(self) -> dict:
-        """Load and return the CheMeleon pretrained message-passing weights.
+    def _load_foundation_weights(self) -> dict:
+        """Load pretrained message-passing weights from a named model or local path.
 
-        Calls :func:`download_chemeleon` to ensure the checkpoint exists at
-        ``~/.chemprop/chemeleon_mp.pt``, then loads it with
-        ``weights_only=True``.
+        When ``self._from_foundation == "chemeleon"`` the checkpoint is
+        downloaded from Zenodo if not already cached.  For any other string
+        value it is treated as a local filesystem path.
 
         Returns
         -------
         dict
             Checkpoint dictionary with ``hyper_parameters`` and
             ``state_dict`` keys.
         """
-        # Ensure the CheMeleon checkpoint is downloaded
-        download_chemeleon()
-        ckpt_path = Path().home() / ".chemprop" / "chemeleon_mp.pt"
+        if self._from_foundation == "chemeleon":
+            download_chemeleon()
+            ckpt_path = Path().home() / ".chemprop" / "chemeleon_mp.pt"
+        else:
+            ckpt_path = Path(str(self._from_foundation))
+            logger.info("Loading foundation weights from local path: %s", ckpt_path)
         return cast(dict[str, Any], torch.load(ckpt_path, weights_only=True))
 
     # ------------------------------------------------------------------
diff --git a/tests/test_model.py b/tests/test_model.py

Original file line number	Diff line number	Diff line change
`@@ -710,6 +710,7 @@ def _build_simulation_model(`
`710`	`710`	`w_drc=cfg.model.w_drc,`
`711`	`711`	`w_ps=cfg.model.w_ps,`
`712`	`712`	`learnable_sigma=cfg.model.learnable_sigma,`
	`713`	`+ from_foundation=cfg.model.from_foundation,`
`713`	`714`	`)`
`714`	`715`
`715`	`716`
`@@ -748,6 +749,7 @@ def _build_plan_model(cfg: PipelineConfig) -> ChemPropLightningModule:`
`748`	`749`	`w_drc=cfg.model.w_drc,`
`749`	`750`	`w_ps=cfg.model.w_ps,`
`750`	`751`	`learnable_sigma=cfg.model.learnable_sigma,`
	`752`	`+ from_foundation=cfg.model.from_foundation,`
`751`	`753`	`)`
`752`	`754`
`753`	`755`