Skip to content

framework__eval — sandboxed scratchpad against the live container #79

@tonydspaniard

Description

@tonydspaniard

Goal

Build framework__eval — a sandboxed scratchpad MCP tool (and bin/altair eval CLI equivalent) that executes a short PHP snippet inside the project's live container, returns the structured result, and tears down without persisting state. The agent's "let me check" primitive.

Why

When an agent forms a hypothesis — "does UserRepository::findByEmail return null or throw when no match?", "what does formatNegotiator->getContentTypeByFormat('json') actually return?" — the cheapest way to validate it is to run a few lines of PHP. Today that requires writing a temporary script, running it via php -r, knowing the autoloader path, and bootstrapping the container. Agents don't do that well.

A scratchpad tool collapses all of this:

// agent calls framework__eval with this snippet
$users = container(UserRepository::class);
return $users->findByEmail('does-not-exist@example.com');

Returns null. Hypothesis validated in 200ms. No file written, no test created.

How it works

The tool spawns a fresh PHP process (sandboxed; not the agent's session) with the project's autoloader and the container booted. The snippet runs inside a wrapper that:

  1. Provides a container() helper (resolves from the framework Container)
  2. Captures return value, stdout, and any throwables
  3. Times out at a configurable wall clock (default 5s)
  4. Returns structured output as JSON
{
  "result": {
    "type": "null",
    "value": null
  },
  "stdout": "",
  "stderr": "",
  "duration_ms": 187,
  "memory_peak_bytes": 4194304,
  "exception": null
}

Or on exception:

{
  "result": null,
  "exception": {
    "class": "App\\User\\UserNotFoundException",
    "message": "No user with email 'x'",
    "file": "src/App/User/UserRepository.php",
    "line": 42,
    "stack_trace": ["..."]
  },
  "duration_ms": 134
}

The agent reads result.value or exception.class and learns. Without breaking the workspace.

Guard rails

This is the most dangerous tool in the MCP palette — eval is eval. The guardrails:

  1. Always runs in a fresh subprocess. No state contamination between calls. No persistent container.
  2. Read-only DB connection by default. A separate --writes flag enables writes; the MCP server respects the same --allow-writes boot flag from univeros/mcp — Model Context Protocol server for agent-native workflows #69.
  3. Filesystem sandbox. The subprocess inherits a chdir to a temp directory; file_put_contents to absolute paths outside the project is blocked at the subprocess level via open_basedir.
  4. No network by default. A separate --network flag enables outbound HTTP.
  5. Time limit. Default 5s wall clock; max 60s. Hard kill via SIGKILL.
  6. Memory limit. Default 128MB; max 512MB.
  7. No exec/shell_exec/passthru available. Disabled via disable_functions.
  8. No eval available inside the snippet. Disabled via disable_functions.

The --unsafe flag lifts every guard rail simultaneously, for cases where the agent (with explicit user confirmation) needs to do something the sandbox forbids. Logs to .altair/events.jsonl (#76) every time it's used.

API surface

CLI

bin/altair eval 'return container(UserRepository::class)->count();'
bin/altair eval --file=snippet.php
bin/altair eval --timeout=10s 'return ...;'
bin/altair eval --writes 'container(EntityManager::class)->flush();'
bin/altair eval --network 'return file_get_contents("https://...");'
bin/altair eval --json                # JSON output (default for MCP)

Default output is pretty: a heredoc-ish summary with the result type, duration, peak memory, and (if exception) the trace. JSON output is the MCP variant.

MCP tool

{
  "name": "framework__eval",
  "description": "Execute a PHP snippet against the live container, return structured result",
  "inputSchema": {
    "snippet": "string",
    "timeout_ms": "integer (default 5000, max 60000)",
    "allow_writes": "boolean (default false)",
    "allow_network": "boolean (default false)"
  }
}

Capturing the return value

PHP doesn't easily let you capture the return of a script. The wrapper inlines the snippet inside a function:

// generated wrapper
require __DIR__ . '/../vendor/autoload.php';
$result = (function () use (&$ctx) {
    // user snippet inserted here
    return container(UserRepository::class)->count();
})();
file_put_contents('php://stderr', json_encode([
    'result' => Altair\Eval\Encoder::encode($result),
    'memory_peak_bytes' => memory_get_peak_usage(true),
]));

Stderr is reserved for the structured result so stdout remains usable for echo/print capture.

Altair\Eval\Encoder produces the typed JSON form:

{ "type": "object", "class": "App\\User\\User", "id": 42 }
{ "type": "array", "value": [...] }
{ "type": "string", "value": "..." }
{ "type": "null", "value": null }
{ "type": "iterable", "preview": [...], "exhausted": false, "size_hint": 100 }

Objects render their __debugInfo() if available; otherwise public-property snapshot, capped at depth 3. Iterables show the first 50 items (more on request).

Shape

src/Altair/Eval/
├── Cli/
│   └── EvalCommand.php
├── Mcp/
│   └── EvalTool.php
├── Runner/
│   ├── SubprocessRunner.php       # spawns the wrapper, captures result
│   ├── WrapperBuilder.php         # generates the PHP wrapper file
│   └── SecurityProfile.php        # encodes the guardrail flags into php.ini / open_basedir
├── Encoder/
│   ├── ValueEncoder.php           # converts return value to structured JSON
│   └── ExceptionEncoder.php
└── composer.json

Acceptance criteria

  • bin/altair eval 'return 1 + 1;' returns 2 with a clean structured output
  • Container helper works: bin/altair eval 'return container(SomeBoundInterface::class);' resolves correctly
  • Exception is captured cleanly: bin/altair eval 'throw new \RuntimeException("nope");' returns the exception JSON, exit code 1
  • Timeout kills runaway snippets after the limit; clean error message, no zombies
  • Memory limit enforced; clean error rather than OOM kill
  • Network disabled by default; --network toggles
  • DB writes disabled by default; --writes toggles
  • disable_functions prevents exec, shell_exec, passthru, eval, assert, system
  • open_basedir confines filesystem writes to the project tree
  • framework__eval MCP tool produces the structured response shape
  • --unsafe mode emits a mutation event (Examples library — .altair/examples/ + MCP tools for idiomatic patterns #76) every time it's used
  • Tests:
    • Golden cases for ValueEncoder (scalars, arrays, objects, iterables, recursion)
    • Exception encoding (with and without previous chain)
    • Subprocess timeout test (snippet that loops, assert it's killed)
    • Security profile test (verifies disable_functions actually disables what we expect)
    • End-to-end MCP test

Out of scope

Dependencies

No new external deps — uses proc_open from stdlib.

Why this matters

Agents form hypotheses constantly. Without eval, every hypothesis becomes a write-a-test-and-run-it loop. With it, the loop collapses to milliseconds. The single biggest productivity multiplier for the agent's middle-loop reasoning.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions