Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 10 additions & 9 deletions docs/security.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,14 +101,15 @@ The post-conditions are enforced in order: a non-`DataFrame` result raises `Feat

!!! warning "Implementation status — what is enforced today"

Layers 1–2 (static analysis + the restricted in-process namespace) are **enforced now** and are
what protects you today. The sandbox *tiers* below (`docker`, `e2b`), `execution.timeout_seconds`,
and the `require_approval` HITL gate are currently **declared, validated configuration** — their
routing and enforcement are on the roadmap (a `CodeExecutorPort` with per-tier adapters and an
approval gate). Until that ships, the real isolation is the in-process `monty` / `local` path:
**do not run genuinely untrusted data through GenAI expecting container/microVM isolation yet.**
Layers 1–2 (static analysis + the restricted in-process namespace) are **enforced**.
`execution.timeout_seconds` is **enforced** (a wall-clock guard on the executor), and
`require_approval` is **enforced** when an approver is wired — fail-closed: code runs only if the
approver grants it. The container/microVM tiers (`docker`, `e2b`) are **not yet implemented and now
fail fast** rather than silently running in-process, so you can't mistake in-process for isolation.
The real isolation today remains the in-process `monty` / `local` path; for genuinely untrusted
data, wait for the container adapters (or supply your own `CodeExecutorPort`).

Layers 1 and 2 run **in-process**. They block the obvious capabilities, but a determined escape against a CPython process is never something to bet sensitive data on. The configuration surface below lets you *declare* stronger isolation for untrusted data via `execution.sandbox` in `ExecutionConfig` (enforcement is roadmap, per the note above):
Layers 1 and 2 run **in-process**. They block the obvious capabilities, but a determined escape against a CPython process is never something to bet sensitive data on. `execution.sandbox` selects the tier; the in-process tiers run the restricted executor below, while `docker` / `e2b` raise until their adapters land:

```python
from fireflyframework_datascience.core.config import FireflyDataScienceConfig
Expand Down Expand Up @@ -151,7 +152,7 @@ The literal type for `sandbox` is exactly `Literal["monty", "docker", "e2b", "lo

Profile overlays outrank the base `firefly-datascience.yaml`, so a `prod` profile can tighten isolation without touching the base file. See [Configuration](configuration.md) for the full precedence order.

Beyond the strongest sandbox sits **HITL** (human-in-the-loop): `execution.require_approval` defaults to `True`, and the design's final tier is a person — not a policy — signing off on generated code before it runs. (Per the status note above, the approval-gate wiring is on the roadmap; today the field is declared and validated.)
Beyond the strongest sandbox sits **HITL** (human-in-the-loop): `execution.require_approval` defaults to `True`, and the executor **enforces it** — when an approver is wired, generated code runs only if the approver grants it (and fail-closed if `require_approval` is set but no approver is available). A person, not a policy, signs off.

!!! note "Defaults are the safe end of every axis"
Out of the box, `sandbox = "monty"` (in-process restricted interpreter), `timeout_seconds = 60`, and `require_approval = True`. You loosen these deliberately — and only `local` removes isolation entirely.
Expand All @@ -163,7 +164,7 @@ The subtle attack is not the model going rogue on its own; it is a **column valu
1. **Static analysis is content-blind.** It rejects `os`, `subprocess`, `socket`, dunder access, and `eval`/`exec`/`open` regardless of *why* the model wrote them — so a successful injection still produces code that gets rejected.
2. **The restricted namespace** means even "clever" injected code has no I/O, no imports, no host reach.
3. **The numeric-new-column contract** means injected code that tries to do anything other than add a numeric feature fails the post-conditions.
4. **Sandboxing + HITL** are the *intended* outer tiers for genuinely untrusted data (route to `docker`/`e2b`, require approval). Their enforcement is on the roadmap (see the Layer 3 status note) — today, rely on points 1–3, which are enforced in-process.
4. **HITL approval** is enforced for untrusted data (require an approver to sign off, fail-closed). The container/microVM tiers (`docker`/`e2b`) are not yet implemented and fail fast rather than running in-process — so points 1–3 plus approval are what protect you today.

!!! warning "The framework does not read your data's meaning"
Firefly cannot inspect or sanitize the *semantics* of your data. Prompt-injection defense rests on capability restriction and sandboxing, not on detecting malicious text. Treat data of unknown provenance as untrusted input: raise `execution.sandbox` and keep `require_approval` on.
Expand Down
13 changes: 12 additions & 1 deletion src/fireflyframework_datascience/features/auto_configuration.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,19 @@ class FeaturesAutoConfiguration:
@bean(name="genai_feature_engineer", primary=True)
def feature_engineer(self, config: FireflyDataScienceConfig) -> FeatureEngineerPort:
from fireflyframework_datascience.features import CostBenefitGate
from fireflyframework_datascience.features.executor import FeatureCodeExecutor
from fireflyframework_datascience.features.genai import AgentFeatureProposer, GenAIFeatureEngineer

proposer = AgentFeatureProposer(model=config.genai.default_model)
gate = CostBenefitGate(min_gain=0.0)
return GenAIFeatureEngineer(proposer, gate=gate)
# Honor the FULL declared execution config — do not silently drop a control. Timeout +
# require_approval (fail-closed HITL) + sandbox tier (in-process 'monty'/'local' run the
# restricted executor; 'docker'/'e2b' fail until those adapters land). With the default
# require_approval=True and no approver wired, automated GenAI fail-closes: set
# execution.require_approval=False (or wire an approver) to run it unattended.
executor = FeatureCodeExecutor(
timeout_seconds=config.execution.timeout_seconds,
require_approval=config.execution.require_approval,
sandbox=config.execution.sandbox,
)
return GenAIFeatureEngineer(proposer, gate=gate, executor=executor)
64 changes: 59 additions & 5 deletions src/fireflyframework_datascience/features/executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,14 @@

from __future__ import annotations

from collections.abc import Callable, Iterator
from contextlib import contextmanager
from typing import Any

from fireflyframework_datascience.core.exceptions import FireflyDataScienceError

_IN_PROCESS_SANDBOXES = {"local", "monty"}

# A deliberately small builtins allowlist — enough for arithmetic/aggregation feature code, nothing else.
_SAFE_BUILTINS = {
name: __builtins__[name] if isinstance(__builtins__, dict) else getattr(__builtins__, name)
Expand Down Expand Up @@ -54,9 +58,20 @@ class FeatureExecutionError(FireflyDataScienceError):
class FeatureCodeExecutor:
"""Vets and executes a feature-engineering snippet against a copy of the dataframe."""

def __init__(self) -> None:
def __init__(
self,
*,
timeout_seconds: int | None = None,
require_approval: bool = False,
approver: Callable[[str], bool] | None = None,
sandbox: str = "local",
) -> None:
from fireflyframework_agentic.execution import SafetyPolicy

self._timeout_seconds = timeout_seconds
self._require_approval = require_approval
self._approver = approver
self._sandbox = sandbox
self._policy = SafetyPolicy(
denied_modules=frozenset(
{"os", "sys", "subprocess", "shutil", "socket", "pathlib", "importlib", "builtins"}
Expand Down Expand Up @@ -86,6 +101,20 @@ def execute(self, code: str, X: Any) -> Any:
"""
from fireflyframework_agentic.execution import analyze_code

# Tier routing: in-process tiers run the restricted exec below; container/microVM tiers are not
# yet implemented and must FAIL rather than silently run in-process pretending to be isolated.
if self._sandbox not in _IN_PROCESS_SANDBOXES:
raise FeatureExecutionError(
f"Execution sandbox {self._sandbox!r} is not yet available; use 'monty' or 'local' "
"(restricted in-process) — container/microVM adapters are on the roadmap."
)
# HITL approval gate: when required, code runs only if an approver grants it (fail-closed).
if self._require_approval and (self._approver is None or not self._approver(code)):
raise FeatureExecutionError(
"Feature code requires approval but it was not granted "
"(wire an approver or set execution.require_approval=False)."
)

report = analyze_code(code, self._policy)
if not report.safe:
reasons = "; ".join(v.message for v in report.violations)
Expand All @@ -96,10 +125,14 @@ def execute(self, code: str, X: Any) -> Any:

before = set(X.columns)
namespace: dict[str, Any] = {"df": X.copy(), "pd": pd, "np": np}
try:
exec(compile(code, "<feature>", "exec"), {"__builtins__": _SAFE_BUILTINS}, namespace) # noqa: S102
except Exception as exc: # noqa: BLE001 - surface any runtime error as a typed rejection
raise FeatureExecutionError(f"Feature code failed at runtime: {exc}") from exc
compiled = compile(code, "<feature>", "exec")
with self._time_limit():
try:
exec(compiled, {"__builtins__": _SAFE_BUILTINS}, namespace) # noqa: S102
except TimeoutError as exc:
raise FeatureExecutionError(f"Feature code timed out after {self._timeout_seconds}s") from exc
except Exception as exc: # noqa: BLE001 - surface any runtime error as a typed rejection
raise FeatureExecutionError(f"Feature code failed at runtime: {exc}") from exc

result = namespace["df"]
if not isinstance(result, pd.DataFrame):
Expand All @@ -114,5 +147,26 @@ def execute(self, code: str, X: Any) -> Any:
result[col] = result[col].replace([np.inf, -np.inf], np.nan)
return result

@contextmanager
def _time_limit(self) -> Iterator[None]:
"""Wall-clock guard via SIGALRM. No-op if no timeout is set or not on the main thread."""
import signal
import threading

if not self._timeout_seconds or threading.current_thread() is not threading.main_thread():
yield
return

def _on_timeout(signum: int, frame: Any) -> None:
raise TimeoutError

previous = signal.signal(signal.SIGALRM, _on_timeout)
signal.alarm(self._timeout_seconds)
try:
yield
finally:
signal.alarm(0)
signal.signal(signal.SIGALRM, previous)


__all__ = ["FeatureCodeExecutor", "FeatureExecutionError"]
98 changes: 98 additions & 0 deletions tests/features/test_executor_enforcement.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Copyright 2026 Firefly Software Foundation.
"""The executor must actually enforce execution.{timeout_seconds, require_approval, sandbox}.

These were declared config fields read by zero lines (an integrity gap). Real enforcement, real data.
"""

from __future__ import annotations

import pandas as pd
import pytest

from fireflyframework_datascience.features.executor import FeatureCodeExecutor, FeatureExecutionError


def test_executor_enforces_timeout() -> None:
executor = FeatureCodeExecutor(timeout_seconds=1)
with pytest.raises(FeatureExecutionError, match="timed out"):
# an infinite loop never adds the column; the timeout must interrupt it
executor.execute("df['x'] = 0\nwhile True:\n pass", pd.DataFrame({"a": [1, 2, 3]}))


def test_executor_rejects_unimplemented_sandbox() -> None:
# docker/e2b are not implemented — they must ERROR, not silently run in-process pretending isolation
executor = FeatureCodeExecutor(sandbox="docker")
with pytest.raises(FeatureExecutionError, match="docker"):
executor.execute("df['b'] = df['a'] + 1", pd.DataFrame({"a": [1, 2]}))


def test_executor_in_process_sandboxes_still_run() -> None:
for sandbox in ("local", "monty"):
executor = FeatureCodeExecutor(sandbox=sandbox)
out = executor.execute("df['b'] = df['a'] + 1", pd.DataFrame({"a": [1, 2]}))
assert "b" in out.columns


def test_executor_enforces_approval() -> None:
# denying approver -> rejected
deny = FeatureCodeExecutor(require_approval=True, approver=lambda code: False)
with pytest.raises(FeatureExecutionError, match="approval"):
deny.execute("df['b'] = df['a'] + 1", pd.DataFrame({"a": [1, 2]}))

# require_approval with NO approver -> fail closed
no_approver = FeatureCodeExecutor(require_approval=True)
with pytest.raises(FeatureExecutionError, match="approval"):
no_approver.execute("df['b'] = df['a'] + 1", pd.DataFrame({"a": [1, 2]}))

# approving approver -> runs
allow = FeatureCodeExecutor(require_approval=True, approver=lambda code: True)
out = allow.execute("df['b'] = df['a'] + 1", pd.DataFrame({"a": [1, 2]}))
assert "b" in out.columns


def test_executor_defaults_unchanged() -> None:
# default executor: no timeout, no approval, in-process — preserves existing behaviour
out = FeatureCodeExecutor().execute("df['b'] = df['a'] * 2", pd.DataFrame({"a": [1, 2]}))
assert list(out["b"]) == [2, 4]


def test_auto_config_wires_full_execution_config_into_executor() -> None:
# The GenAI auto-config must not silently drop any execution control: timeout, sandbox AND
# require_approval are threaded from config into the wired executor.
from fireflyframework_datascience import FireflyDataScienceApplication
from fireflyframework_datascience.core.config import FireflyDataScienceConfig
from fireflyframework_datascience.features import FeatureEngineerPort

config = FireflyDataScienceConfig()
config.genai.enabled = True
config.execution.require_approval = False # explicit unattended opt-out
config.execution.sandbox = "local"
config.execution.timeout_seconds = 30
app = FireflyDataScienceApplication.run(config=config, print_output=False)

engineer = app.container.resolve_optional(FeatureEngineerPort)
assert engineer is not None
executor = engineer._executor # type: ignore[attr-defined]
assert executor._require_approval is False
assert executor._sandbox == "local"
assert executor._timeout_seconds == 30


def test_auto_config_default_is_fail_closed_hitl() -> None:
# With the default execution config (require_approval=True) and no approver wired, the wired
# executor fail-closes — the secure-by-default posture the docs describe.
from fireflyframework_datascience import FireflyDataScienceApplication
from fireflyframework_datascience.core.config import FireflyDataScienceConfig
from fireflyframework_datascience.features import FeatureEngineerPort

config = FireflyDataScienceConfig()
config.genai.enabled = True
app = FireflyDataScienceApplication.run(config=config, print_output=False)

engineer = app.container.resolve_optional(FeatureEngineerPort)
assert engineer is not None
executor = engineer._executor # type: ignore[attr-defined]
assert executor._require_approval is True
assert executor._approver is None # no approver shipped → fail-closed until one is wired
with pytest.raises(FeatureExecutionError, match="approval"):
executor.execute("df['b'] = df['a'] + 1", pd.DataFrame({"a": [1, 2]}))
Loading