fireflyframework · ancongui · Jun 25, 2026 · Jun 25, 2026 · Jun 25, 2026
diff --git a/docs/security.md b/docs/security.md
@@ -101,14 +101,15 @@ The post-conditions are enforced in order: a non-`DataFrame` result raises `Feat
 
 !!! warning "Implementation status — what is enforced today"
 
-    Layers 1–2 (static analysis + the restricted in-process namespace) are **enforced now** and are
-    what protects you today. The sandbox *tiers* below (`docker`, `e2b`), `execution.timeout_seconds`,
-    and the `require_approval` HITL gate are currently **declared, validated configuration** — their
-    routing and enforcement are on the roadmap (a `CodeExecutorPort` with per-tier adapters and an
-    approval gate). Until that ships, the real isolation is the in-process `monty` / `local` path:
-    **do not run genuinely untrusted data through GenAI expecting container/microVM isolation yet.**
+    Layers 1–2 (static analysis + the restricted in-process namespace) are **enforced**.
+    `execution.timeout_seconds` is **enforced** (a wall-clock guard on the executor), and
+    `require_approval` is **enforced** when an approver is wired — fail-closed: code runs only if the
+    approver grants it. The container/microVM tiers (`docker`, `e2b`) are **not yet implemented and now
+    fail fast** rather than silently running in-process, so you can't mistake in-process for isolation.
+    The real isolation today remains the in-process `monty` / `local` path; for genuinely untrusted
+    data, wait for the container adapters (or supply your own `CodeExecutorPort`).
 
-Layers 1 and 2 run **in-process**. They block the obvious capabilities, but a determined escape against a CPython process is never something to bet sensitive data on. The configuration surface below lets you *declare* stronger isolation for untrusted data via `execution.sandbox` in `ExecutionConfig` (enforcement is roadmap, per the note above):
+Layers 1 and 2 run **in-process**. They block the obvious capabilities, but a determined escape against a CPython process is never something to bet sensitive data on. `execution.sandbox` selects the tier; the in-process tiers run the restricted executor below, while `docker` / `e2b` raise until their adapters land:
 
 ```python
 from fireflyframework_datascience.core.config import FireflyDataScienceConfig
@@ -151,7 +152,7 @@ The literal type for `sandbox` is exactly `Literal["monty", "docker", "e2b", "lo
 
     Profile overlays outrank the base `firefly-datascience.yaml`, so a `prod` profile can tighten isolation without touching the base file. See [Configuration](configuration.md) for the full precedence order.
 
-Beyond the strongest sandbox sits **HITL** (human-in-the-loop): `execution.require_approval` defaults to `True`, and the design's final tier is a person — not a policy — signing off on generated code before it runs. (Per the status note above, the approval-gate wiring is on the roadmap; today the field is declared and validated.)
+Beyond the strongest sandbox sits **HITL** (human-in-the-loop): `execution.require_approval` defaults to `True`, and the executor **enforces it** — when an approver is wired, generated code runs only if the approver grants it (and fail-closed if `require_approval` is set but no approver is available). A person, not a policy, signs off.
 
 !!! note "Defaults are the safe end of every axis"
     Out of the box, `sandbox = "monty"` (in-process restricted interpreter), `timeout_seconds = 60`, and `require_approval = True`. You loosen these deliberately — and only `local` removes isolation entirely.
@@ -163,7 +164,7 @@ The subtle attack is not the model going rogue on its own; it is a **column valu
 1. **Static analysis is content-blind.** It rejects `os`, `subprocess`, `socket`, dunder access, and `eval`/`exec`/`open` regardless of *why* the model wrote them — so a successful injection still produces code that gets rejected.
 2. **The restricted namespace** means even "clever" injected code has no I/O, no imports, no host reach.
 3. **The numeric-new-column contract** means injected code that tries to do anything other than add a numeric feature fails the post-conditions.
-4. **Sandboxing + HITL** are the *intended* outer tiers for genuinely untrusted data (route to `docker`/`e2b`, require approval). Their enforcement is on the roadmap (see the Layer 3 status note) — today, rely on points 1–3, which are enforced in-process.
+4. **HITL approval** is enforced for untrusted data (require an approver to sign off, fail-closed). The container/microVM tiers (`docker`/`e2b`) are not yet implemented and fail fast rather than running in-process — so points 1–3 plus approval are what protect you today.
 
 !!! warning "The framework does not read your data's meaning"
     Firefly cannot inspect or sanitize the *semantics* of your data. Prompt-injection defense rests on capability restriction and sandboxing, not on detecting malicious text. Treat data of unknown provenance as untrusted input: raise `execution.sandbox` and keep `require_approval` on.

diff --git a/src/fireflyframework_datascience/features/auto_configuration.py b/src/fireflyframework_datascience/features/auto_configuration.py
@@ -23,8 +23,19 @@ class FeaturesAutoConfiguration:
     @bean(name="genai_feature_engineer", primary=True)
     def feature_engineer(self, config: FireflyDataScienceConfig) -> FeatureEngineerPort:
         from fireflyframework_datascience.features import CostBenefitGate
+        from fireflyframework_datascience.features.executor import FeatureCodeExecutor
         from fireflyframework_datascience.features.genai import AgentFeatureProposer, GenAIFeatureEngineer
 
         proposer = AgentFeatureProposer(model=config.genai.default_model)
         gate = CostBenefitGate(min_gain=0.0)
-        return GenAIFeatureEngineer(proposer, gate=gate)
+        # Honor the FULL declared execution config — do not silently drop a control. Timeout +
+        # require_approval (fail-closed HITL) + sandbox tier (in-process 'monty'/'local' run the
+        # restricted executor; 'docker'/'e2b' fail until those adapters land). With the default
+        # require_approval=True and no approver wired, automated GenAI fail-closes: set
+        # execution.require_approval=False (or wire an approver) to run it unattended.
+        executor = FeatureCodeExecutor(
+            timeout_seconds=config.execution.timeout_seconds,
+            require_approval=config.execution.require_approval,
+            sandbox=config.execution.sandbox,
+        )
+        return GenAIFeatureEngineer(proposer, gate=gate, executor=executor)
diff --git a/src/fireflyframework_datascience/features/executor.py b/src/fireflyframework_datascience/features/executor.py
@@ -14,10 +14,14 @@
 
 from __future__ import annotations
 
+from collections.abc import Callable, Iterator
+from contextlib import contextmanager
 from typing import Any
 
 from fireflyframework_datascience.core.exceptions import FireflyDataScienceError
 
+_IN_PROCESS_SANDBOXES = {"local", "monty"}
+
 # A deliberately small builtins allowlist — enough for arithmetic/aggregation feature code, nothing else.
 _SAFE_BUILTINS = {
     name: __builtins__[name] if isinstance(__builtins__, dict) else getattr(__builtins__, name)
@@ -54,9 +58,20 @@ class FeatureExecutionError(FireflyDataScienceError):
 class FeatureCodeExecutor:
     """Vets and executes a feature-engineering snippet against a copy of the dataframe."""
 
-    def __init__(self) -> None:
+    def __init__(
+        self,
+        *,
+        timeout_seconds: int | None = None,
+        require_approval: bool = False,
+        approver: Callable[[str], bool] | None = None,
+        sandbox: str = "local",
+    ) -> None:
         from fireflyframework_agentic.execution import SafetyPolicy
 
+        self._timeout_seconds = timeout_seconds
+        self._require_approval = require_approval
+        self._approver = approver
+        self._sandbox = sandbox
         self._policy = SafetyPolicy(
             denied_modules=frozenset(
                 {"os", "sys", "subprocess", "shutil", "socket", "pathlib", "importlib", "builtins"}
@@ -86,6 +101,20 @@ def execute(self, code: str, X: Any) -> Any:
         """
         from fireflyframework_agentic.execution import analyze_code
 
+        # Tier routing: in-process tiers run the restricted exec below; container/microVM tiers are not
+        # yet implemented and must FAIL rather than silently run in-process pretending to be isolated.
+        if self._sandbox not in _IN_PROCESS_SANDBOXES:
+            raise FeatureExecutionError(
+                f"Execution sandbox {self._sandbox!r} is not yet available; use 'monty' or 'local' "
+                "(restricted in-process) — container/microVM adapters are on the roadmap."
+            )
+        # HITL approval gate: when required, code runs only if an approver grants it (fail-closed).
+        if self._require_approval and (self._approver is None or not self._approver(code)):
+            raise FeatureExecutionError(
+                "Feature code requires approval but it was not granted "
+                "(wire an approver or set execution.require_approval=False)."
+            )
+
         report = analyze_code(code, self._policy)
         if not report.safe:
             reasons = "; ".join(v.message for v in report.violations)
@@ -96,10 +125,14 @@ def execute(self, code: str, X: Any) -> Any:
 
         before = set(X.columns)
         namespace: dict[str, Any] = {"df": X.copy(), "pd": pd, "np": np}
-        try:
-            exec(compile(code, "<feature>", "exec"), {"__builtins__": _SAFE_BUILTINS}, namespace)  # noqa: S102
-        except Exception as exc:  # noqa: BLE001 - surface any runtime error as a typed rejection
-            raise FeatureExecutionError(f"Feature code failed at runtime: {exc}") from exc
+        compiled = compile(code, "<feature>", "exec")
+        with self._time_limit():
+            try:
+                exec(compiled, {"__builtins__": _SAFE_BUILTINS}, namespace)  # noqa: S102
+            except TimeoutError as exc:
+                raise FeatureExecutionError(f"Feature code timed out after {self._timeout_seconds}s") from exc
+            except Exception as exc:  # noqa: BLE001 - surface any runtime error as a typed rejection
+                raise FeatureExecutionError(f"Feature code failed at runtime: {exc}") from exc
 
         result = namespace["df"]
         if not isinstance(result, pd.DataFrame):
@@ -114,5 +147,26 @@ def execute(self, code: str, X: Any) -> Any:
             result[col] = result[col].replace([np.inf, -np.inf], np.nan)
         return result
 
+    @contextmanager
+    def _time_limit(self) -> Iterator[None]:
+        """Wall-clock guard via SIGALRM. No-op if no timeout is set or not on the main thread."""
+        import signal
+        import threading
+
+        if not self._timeout_seconds or threading.current_thread() is not threading.main_thread():
+            yield
+            return
+
+        def _on_timeout(signum: int, frame: Any) -> None:
+            raise TimeoutError
+
+        previous = signal.signal(signal.SIGALRM, _on_timeout)
+        signal.alarm(self._timeout_seconds)
+        try:
+            yield
+        finally:
+            signal.alarm(0)
+            signal.signal(signal.SIGALRM, previous)
+
 
 __all__ = ["FeatureCodeExecutor", "FeatureExecutionError"]
diff --git a/tests/features/test_executor_enforcement.py b/tests/features/test_executor_enforcement.py
@@ -0,0 +1,98 @@
+# Copyright 2026 Firefly Software Foundation.
+"""The executor must actually enforce execution.{timeout_seconds, require_approval, sandbox}.
+
+These were declared config fields read by zero lines (an integrity gap). Real enforcement, real data.
+"""
+
+from __future__ import annotations
+
+import pandas as pd
+import pytest
+
+from fireflyframework_datascience.features.executor import FeatureCodeExecutor, FeatureExecutionError
+
+
+def test_executor_enforces_timeout() -> None:
+    executor = FeatureCodeExecutor(timeout_seconds=1)
+    with pytest.raises(FeatureExecutionError, match="timed out"):
+        # an infinite loop never adds the column; the timeout must interrupt it
+        executor.execute("df['x'] = 0\nwhile True:\n    pass", pd.DataFrame({"a": [1, 2, 3]}))
+
+
+def test_executor_rejects_unimplemented_sandbox() -> None:
+    # docker/e2b are not implemented — they must ERROR, not silently run in-process pretending isolation
+    executor = FeatureCodeExecutor(sandbox="docker")
+    with pytest.raises(FeatureExecutionError, match="docker"):
+        executor.execute("df['b'] = df['a'] + 1", pd.DataFrame({"a": [1, 2]}))
+
+
+def test_executor_in_process_sandboxes_still_run() -> None:
+    for sandbox in ("local", "monty"):
+        executor = FeatureCodeExecutor(sandbox=sandbox)
+        out = executor.execute("df['b'] = df['a'] + 1", pd.DataFrame({"a": [1, 2]}))
+        assert "b" in out.columns
+
+
+def test_executor_enforces_approval() -> None:
+    # denying approver -> rejected
+    deny = FeatureCodeExecutor(require_approval=True, approver=lambda code: False)
+    with pytest.raises(FeatureExecutionError, match="approval"):
+        deny.execute("df['b'] = df['a'] + 1", pd.DataFrame({"a": [1, 2]}))
+
+    # require_approval with NO approver -> fail closed
+    no_approver = FeatureCodeExecutor(require_approval=True)
+    with pytest.raises(FeatureExecutionError, match="approval"):
+        no_approver.execute("df['b'] = df['a'] + 1", pd.DataFrame({"a": [1, 2]}))
+
+    # approving approver -> runs
+    allow = FeatureCodeExecutor(require_approval=True, approver=lambda code: True)
+    out = allow.execute("df['b'] = df['a'] + 1", pd.DataFrame({"a": [1, 2]}))
+    assert "b" in out.columns
+
+
+def test_executor_defaults_unchanged() -> None:
+    # default executor: no timeout, no approval, in-process — preserves existing behaviour
+    out = FeatureCodeExecutor().execute("df['b'] = df['a'] * 2", pd.DataFrame({"a": [1, 2]}))
+    assert list(out["b"]) == [2, 4]
+
+
+def test_auto_config_wires_full_execution_config_into_executor() -> None:
+    # The GenAI auto-config must not silently drop any execution control: timeout, sandbox AND
+    # require_approval are threaded from config into the wired executor.
+    from fireflyframework_datascience import FireflyDataScienceApplication
+    from fireflyframework_datascience.core.config import FireflyDataScienceConfig
+    from fireflyframework_datascience.features import FeatureEngineerPort
+
+    config = FireflyDataScienceConfig()
+    config.genai.enabled = True
+    config.execution.require_approval = False  # explicit unattended opt-out
+    config.execution.sandbox = "local"
+    config.execution.timeout_seconds = 30
+    app = FireflyDataScienceApplication.run(config=config, print_output=False)
+
+    engineer = app.container.resolve_optional(FeatureEngineerPort)
+    assert engineer is not None
+    executor = engineer._executor  # type: ignore[attr-defined]
+    assert executor._require_approval is False
+    assert executor._sandbox == "local"
+    assert executor._timeout_seconds == 30
+
+
+def test_auto_config_default_is_fail_closed_hitl() -> None:
+    # With the default execution config (require_approval=True) and no approver wired, the wired
+    # executor fail-closes — the secure-by-default posture the docs describe.
+    from fireflyframework_datascience import FireflyDataScienceApplication
+    from fireflyframework_datascience.core.config import FireflyDataScienceConfig
+    from fireflyframework_datascience.features import FeatureEngineerPort
+
+    config = FireflyDataScienceConfig()
+    config.genai.enabled = True
+    app = FireflyDataScienceApplication.run(config=config, print_output=False)
+
+    engineer = app.container.resolve_optional(FeatureEngineerPort)
+    assert engineer is not None
+    executor = engineer._executor  # type: ignore[attr-defined]
+    assert executor._require_approval is True
+    assert executor._approver is None  # no approver shipped → fail-closed until one is wired
+    with pytest.raises(FeatureExecutionError, match="approval"):
+        executor.execute("df['b'] = df['a'] + 1", pd.DataFrame({"a": [1, 2]}))