diff --git a/docs/security.md b/docs/security.md index 21257da..6715a6d 100644 --- a/docs/security.md +++ b/docs/security.md @@ -101,14 +101,15 @@ The post-conditions are enforced in order: a non-`DataFrame` result raises `Feat !!! warning "Implementation status — what is enforced today" - Layers 1–2 (static analysis + the restricted in-process namespace) are **enforced now** and are - what protects you today. The sandbox *tiers* below (`docker`, `e2b`), `execution.timeout_seconds`, - and the `require_approval` HITL gate are currently **declared, validated configuration** — their - routing and enforcement are on the roadmap (a `CodeExecutorPort` with per-tier adapters and an - approval gate). Until that ships, the real isolation is the in-process `monty` / `local` path: - **do not run genuinely untrusted data through GenAI expecting container/microVM isolation yet.** + Layers 1–2 (static analysis + the restricted in-process namespace) are **enforced**. + `execution.timeout_seconds` is **enforced** (a wall-clock guard on the executor), and + `require_approval` is **enforced** when an approver is wired — fail-closed: code runs only if the + approver grants it. The container/microVM tiers (`docker`, `e2b`) are **not yet implemented and now + fail fast** rather than silently running in-process, so you can't mistake in-process for isolation. + The real isolation today remains the in-process `monty` / `local` path; for genuinely untrusted + data, wait for the container adapters (or supply your own `CodeExecutorPort`). -Layers 1 and 2 run **in-process**. They block the obvious capabilities, but a determined escape against a CPython process is never something to bet sensitive data on. The configuration surface below lets you *declare* stronger isolation for untrusted data via `execution.sandbox` in `ExecutionConfig` (enforcement is roadmap, per the note above): +Layers 1 and 2 run **in-process**. They block the obvious capabilities, but a determined escape against a CPython process is never something to bet sensitive data on. `execution.sandbox` selects the tier; the in-process tiers run the restricted executor below, while `docker` / `e2b` raise until their adapters land: ```python from fireflyframework_datascience.core.config import FireflyDataScienceConfig @@ -151,7 +152,7 @@ The literal type for `sandbox` is exactly `Literal["monty", "docker", "e2b", "lo Profile overlays outrank the base `firefly-datascience.yaml`, so a `prod` profile can tighten isolation without touching the base file. See [Configuration](configuration.md) for the full precedence order. -Beyond the strongest sandbox sits **HITL** (human-in-the-loop): `execution.require_approval` defaults to `True`, and the design's final tier is a person — not a policy — signing off on generated code before it runs. (Per the status note above, the approval-gate wiring is on the roadmap; today the field is declared and validated.) +Beyond the strongest sandbox sits **HITL** (human-in-the-loop): `execution.require_approval` defaults to `True`, and the executor **enforces it** — when an approver is wired, generated code runs only if the approver grants it (and fail-closed if `require_approval` is set but no approver is available). A person, not a policy, signs off. !!! note "Defaults are the safe end of every axis" Out of the box, `sandbox = "monty"` (in-process restricted interpreter), `timeout_seconds = 60`, and `require_approval = True`. You loosen these deliberately — and only `local` removes isolation entirely. @@ -163,7 +164,7 @@ The subtle attack is not the model going rogue on its own; it is a **column valu 1. **Static analysis is content-blind.** It rejects `os`, `subprocess`, `socket`, dunder access, and `eval`/`exec`/`open` regardless of *why* the model wrote them — so a successful injection still produces code that gets rejected. 2. **The restricted namespace** means even "clever" injected code has no I/O, no imports, no host reach. 3. **The numeric-new-column contract** means injected code that tries to do anything other than add a numeric feature fails the post-conditions. -4. **Sandboxing + HITL** are the *intended* outer tiers for genuinely untrusted data (route to `docker`/`e2b`, require approval). Their enforcement is on the roadmap (see the Layer 3 status note) — today, rely on points 1–3, which are enforced in-process. +4. **HITL approval** is enforced for untrusted data (require an approver to sign off, fail-closed). The container/microVM tiers (`docker`/`e2b`) are not yet implemented and fail fast rather than running in-process — so points 1–3 plus approval are what protect you today. !!! warning "The framework does not read your data's meaning" Firefly cannot inspect or sanitize the *semantics* of your data. Prompt-injection defense rests on capability restriction and sandboxing, not on detecting malicious text. Treat data of unknown provenance as untrusted input: raise `execution.sandbox` and keep `require_approval` on. diff --git a/src/fireflyframework_datascience/features/auto_configuration.py b/src/fireflyframework_datascience/features/auto_configuration.py index 1e32a56..39e55f5 100644 --- a/src/fireflyframework_datascience/features/auto_configuration.py +++ b/src/fireflyframework_datascience/features/auto_configuration.py @@ -23,8 +23,19 @@ class FeaturesAutoConfiguration: @bean(name="genai_feature_engineer", primary=True) def feature_engineer(self, config: FireflyDataScienceConfig) -> FeatureEngineerPort: from fireflyframework_datascience.features import CostBenefitGate + from fireflyframework_datascience.features.executor import FeatureCodeExecutor from fireflyframework_datascience.features.genai import AgentFeatureProposer, GenAIFeatureEngineer proposer = AgentFeatureProposer(model=config.genai.default_model) gate = CostBenefitGate(min_gain=0.0) - return GenAIFeatureEngineer(proposer, gate=gate) + # Honor the FULL declared execution config — do not silently drop a control. Timeout + + # require_approval (fail-closed HITL) + sandbox tier (in-process 'monty'/'local' run the + # restricted executor; 'docker'/'e2b' fail until those adapters land). With the default + # require_approval=True and no approver wired, automated GenAI fail-closes: set + # execution.require_approval=False (or wire an approver) to run it unattended. + executor = FeatureCodeExecutor( + timeout_seconds=config.execution.timeout_seconds, + require_approval=config.execution.require_approval, + sandbox=config.execution.sandbox, + ) + return GenAIFeatureEngineer(proposer, gate=gate, executor=executor) diff --git a/src/fireflyframework_datascience/features/executor.py b/src/fireflyframework_datascience/features/executor.py index e029d95..964851e 100644 --- a/src/fireflyframework_datascience/features/executor.py +++ b/src/fireflyframework_datascience/features/executor.py @@ -14,10 +14,14 @@ from __future__ import annotations +from collections.abc import Callable, Iterator +from contextlib import contextmanager from typing import Any from fireflyframework_datascience.core.exceptions import FireflyDataScienceError +_IN_PROCESS_SANDBOXES = {"local", "monty"} + # A deliberately small builtins allowlist — enough for arithmetic/aggregation feature code, nothing else. _SAFE_BUILTINS = { name: __builtins__[name] if isinstance(__builtins__, dict) else getattr(__builtins__, name) @@ -54,9 +58,20 @@ class FeatureExecutionError(FireflyDataScienceError): class FeatureCodeExecutor: """Vets and executes a feature-engineering snippet against a copy of the dataframe.""" - def __init__(self) -> None: + def __init__( + self, + *, + timeout_seconds: int | None = None, + require_approval: bool = False, + approver: Callable[[str], bool] | None = None, + sandbox: str = "local", + ) -> None: from fireflyframework_agentic.execution import SafetyPolicy + self._timeout_seconds = timeout_seconds + self._require_approval = require_approval + self._approver = approver + self._sandbox = sandbox self._policy = SafetyPolicy( denied_modules=frozenset( {"os", "sys", "subprocess", "shutil", "socket", "pathlib", "importlib", "builtins"} @@ -86,6 +101,20 @@ def execute(self, code: str, X: Any) -> Any: """ from fireflyframework_agentic.execution import analyze_code + # Tier routing: in-process tiers run the restricted exec below; container/microVM tiers are not + # yet implemented and must FAIL rather than silently run in-process pretending to be isolated. + if self._sandbox not in _IN_PROCESS_SANDBOXES: + raise FeatureExecutionError( + f"Execution sandbox {self._sandbox!r} is not yet available; use 'monty' or 'local' " + "(restricted in-process) — container/microVM adapters are on the roadmap." + ) + # HITL approval gate: when required, code runs only if an approver grants it (fail-closed). + if self._require_approval and (self._approver is None or not self._approver(code)): + raise FeatureExecutionError( + "Feature code requires approval but it was not granted " + "(wire an approver or set execution.require_approval=False)." + ) + report = analyze_code(code, self._policy) if not report.safe: reasons = "; ".join(v.message for v in report.violations) @@ -96,10 +125,14 @@ def execute(self, code: str, X: Any) -> Any: before = set(X.columns) namespace: dict[str, Any] = {"df": X.copy(), "pd": pd, "np": np} - try: - exec(compile(code, "", "exec"), {"__builtins__": _SAFE_BUILTINS}, namespace) # noqa: S102 - except Exception as exc: # noqa: BLE001 - surface any runtime error as a typed rejection - raise FeatureExecutionError(f"Feature code failed at runtime: {exc}") from exc + compiled = compile(code, "", "exec") + with self._time_limit(): + try: + exec(compiled, {"__builtins__": _SAFE_BUILTINS}, namespace) # noqa: S102 + except TimeoutError as exc: + raise FeatureExecutionError(f"Feature code timed out after {self._timeout_seconds}s") from exc + except Exception as exc: # noqa: BLE001 - surface any runtime error as a typed rejection + raise FeatureExecutionError(f"Feature code failed at runtime: {exc}") from exc result = namespace["df"] if not isinstance(result, pd.DataFrame): @@ -114,5 +147,26 @@ def execute(self, code: str, X: Any) -> Any: result[col] = result[col].replace([np.inf, -np.inf], np.nan) return result + @contextmanager + def _time_limit(self) -> Iterator[None]: + """Wall-clock guard via SIGALRM. No-op if no timeout is set or not on the main thread.""" + import signal + import threading + + if not self._timeout_seconds or threading.current_thread() is not threading.main_thread(): + yield + return + + def _on_timeout(signum: int, frame: Any) -> None: + raise TimeoutError + + previous = signal.signal(signal.SIGALRM, _on_timeout) + signal.alarm(self._timeout_seconds) + try: + yield + finally: + signal.alarm(0) + signal.signal(signal.SIGALRM, previous) + __all__ = ["FeatureCodeExecutor", "FeatureExecutionError"] diff --git a/tests/features/test_executor_enforcement.py b/tests/features/test_executor_enforcement.py new file mode 100644 index 0000000..bcec225 --- /dev/null +++ b/tests/features/test_executor_enforcement.py @@ -0,0 +1,98 @@ +# Copyright 2026 Firefly Software Foundation. +"""The executor must actually enforce execution.{timeout_seconds, require_approval, sandbox}. + +These were declared config fields read by zero lines (an integrity gap). Real enforcement, real data. +""" + +from __future__ import annotations + +import pandas as pd +import pytest + +from fireflyframework_datascience.features.executor import FeatureCodeExecutor, FeatureExecutionError + + +def test_executor_enforces_timeout() -> None: + executor = FeatureCodeExecutor(timeout_seconds=1) + with pytest.raises(FeatureExecutionError, match="timed out"): + # an infinite loop never adds the column; the timeout must interrupt it + executor.execute("df['x'] = 0\nwhile True:\n pass", pd.DataFrame({"a": [1, 2, 3]})) + + +def test_executor_rejects_unimplemented_sandbox() -> None: + # docker/e2b are not implemented — they must ERROR, not silently run in-process pretending isolation + executor = FeatureCodeExecutor(sandbox="docker") + with pytest.raises(FeatureExecutionError, match="docker"): + executor.execute("df['b'] = df['a'] + 1", pd.DataFrame({"a": [1, 2]})) + + +def test_executor_in_process_sandboxes_still_run() -> None: + for sandbox in ("local", "monty"): + executor = FeatureCodeExecutor(sandbox=sandbox) + out = executor.execute("df['b'] = df['a'] + 1", pd.DataFrame({"a": [1, 2]})) + assert "b" in out.columns + + +def test_executor_enforces_approval() -> None: + # denying approver -> rejected + deny = FeatureCodeExecutor(require_approval=True, approver=lambda code: False) + with pytest.raises(FeatureExecutionError, match="approval"): + deny.execute("df['b'] = df['a'] + 1", pd.DataFrame({"a": [1, 2]})) + + # require_approval with NO approver -> fail closed + no_approver = FeatureCodeExecutor(require_approval=True) + with pytest.raises(FeatureExecutionError, match="approval"): + no_approver.execute("df['b'] = df['a'] + 1", pd.DataFrame({"a": [1, 2]})) + + # approving approver -> runs + allow = FeatureCodeExecutor(require_approval=True, approver=lambda code: True) + out = allow.execute("df['b'] = df['a'] + 1", pd.DataFrame({"a": [1, 2]})) + assert "b" in out.columns + + +def test_executor_defaults_unchanged() -> None: + # default executor: no timeout, no approval, in-process — preserves existing behaviour + out = FeatureCodeExecutor().execute("df['b'] = df['a'] * 2", pd.DataFrame({"a": [1, 2]})) + assert list(out["b"]) == [2, 4] + + +def test_auto_config_wires_full_execution_config_into_executor() -> None: + # The GenAI auto-config must not silently drop any execution control: timeout, sandbox AND + # require_approval are threaded from config into the wired executor. + from fireflyframework_datascience import FireflyDataScienceApplication + from fireflyframework_datascience.core.config import FireflyDataScienceConfig + from fireflyframework_datascience.features import FeatureEngineerPort + + config = FireflyDataScienceConfig() + config.genai.enabled = True + config.execution.require_approval = False # explicit unattended opt-out + config.execution.sandbox = "local" + config.execution.timeout_seconds = 30 + app = FireflyDataScienceApplication.run(config=config, print_output=False) + + engineer = app.container.resolve_optional(FeatureEngineerPort) + assert engineer is not None + executor = engineer._executor # type: ignore[attr-defined] + assert executor._require_approval is False + assert executor._sandbox == "local" + assert executor._timeout_seconds == 30 + + +def test_auto_config_default_is_fail_closed_hitl() -> None: + # With the default execution config (require_approval=True) and no approver wired, the wired + # executor fail-closes — the secure-by-default posture the docs describe. + from fireflyframework_datascience import FireflyDataScienceApplication + from fireflyframework_datascience.core.config import FireflyDataScienceConfig + from fireflyframework_datascience.features import FeatureEngineerPort + + config = FireflyDataScienceConfig() + config.genai.enabled = True + app = FireflyDataScienceApplication.run(config=config, print_output=False) + + engineer = app.container.resolve_optional(FeatureEngineerPort) + assert engineer is not None + executor = engineer._executor # type: ignore[attr-defined] + assert executor._require_approval is True + assert executor._approver is None # no approver shipped → fail-closed until one is wired + with pytest.raises(FeatureExecutionError, match="approval"): + executor.execute("df['b'] = df['a'] + 1", pd.DataFrame({"a": [1, 2]}))