Feat/basilica by distributedstatemachine · Pull Request #1 · one-covenant/ridges

distributedstatemachine · 2026-01-12T21:49:23Z

Summary by CodeRabbit

New Features
- Added Basilica Sandbox integration, enabling containerized script execution with file mounting, input/output handling, and comprehensive result logging.
Tests
- Added test suite covering Basilica Sandbox integration, sandbox manager operations, problem evaluation, and concurrent execution scenarios.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

- Add BasilicaSandboxManager as thin wrapper around basilica-sdk-python - Add comprehensive test script with concurrent evaluation support - Remove basilica-sdk from optional dependencies (use local uv install) - Supports both warm pool (fast) and cold start sandbox creation Setup: cd ridges && uv pip install -e ../basilica/crates/basilica-sdk-python export BASILICA_API_URL=http://localhost:9080 export BASILICA_API_TOKEN=dev-token Usage: python test_basilica_sandbox.py --concurrent 50 --eval

- Reduce from 587 to 207 lines (~65% smaller) - Move imports to top level - Consolidate duplicate worker functions into single worker() - Add header() helper for consistent formatting - Simplify result tracking and error handling

- Add --dx flag for DX-focused showcase test - Use context managers for automatic sandbox cleanup - Use namespaced API (sandbox.files, sandbox.process) - Use python_sandbox() factory function - Demonstrate global configuration with basilica.configure() - Use improved concurrent test with context managers

- Reorganize tests around SDK capabilities, not test numbers - Showcase modern API patterns throughout (context managers, namespaced API) - Simplify CLI: --full for all tests, --scale N for stress test - Remove redundant DX showcase section (entire file is now the showcase) - Cleaner output with section headers and consistent formatting - Global basilica.configure() at top for cleaner test code

- Focus on testing ridges + Basilica integration, not SDK demos - Clean structure: SDK, SandboxManager, Polyglot, Concurrent - Simple CLI: --full for Polyglot, --scale N for stress test - Uses new SDK conventions (context managers, namespaced API) throughout

- Rename test_concurrent to test_concurrent_evals - Run real Polyglot evaluations instead of simple computations - Show test pass/fail breakdown for each evaluation - Include Polyglot test as standard (not just with --full) - --full now includes concurrent evals, --scale N for custom count

- Use basilica.configure() for global SDK configuration - Use python_sandbox() factory function - Use namespaced API (sandbox.files, sandbox.process) - Cleaner code with better comments

- basilica_sandbox_manager.py: 156 → 76 lines (-51%) - test_basilica_sandbox.py: 318 → 130 lines (-59%) - Same functionality, less boilerplate

coderabbitai · 2026-01-12T21:49:33Z

Walkthrough

Introduces a Basilica Sandbox adapter (BasilicaSandboxManager) that wraps the Basilica SDK's Sandbox to implement a standardized interface. The implementation includes sandbox initialization with file mounting and script execution, result capture with logs, and a factory function for backend selection. A comprehensive test suite validates SDK integration, manager functionality, and concurrent evaluation scenarios.

Changes

Cohort / File(s)	Summary
Core Sandbox Adapter `evaluator/sandbox/basilica_sandbox_manager.py`	Introduces `BasilicaSandboxManager` class wrapping basilica.Sandbox; `SandboxHandle` dataclass to represent sandbox instances; `initialize_sandbox()` method configuring container-based Python sandbox with optional file mounting via `on_mount` callback; `run_sandbox()` method executing scripts and capturing stdout/stderr/output.json; `get_sandbox_manager()` factory function selecting backend based on configuration.
Configuration `pyproject.toml`	Adds 4 commented lines documenting installation and local linking of basilica-sdk-python for sandbox support.
Integration Tests `test_basilica_sandbox.py`	Introduces test module with `test_sdk()` validating SDK connection and file I/O, `test_manager()` exercising initialization and execution, `test_polyglot()` validating problem evaluation, `test_concurrent()` running parallel evals with statistics; Click-based CLI entry point with `--full`, `--scale`, `--problem`, `--quiet` options; enforces `BASILICA_API_TOKEN` presence.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Manager as BasilicaSandboxManager
    participant Basilica as basilica.Sandbox
    participant Container as Container FS

    Client->>Manager: initialize_sandbox(script_path, input_data, on_mount)
    Manager->>Basilica: create sandbox instance
    Basilica->>Container: allocate container
    
    opt on_mount callback
        Manager->>Manager: call on_mount(target_path)
        Manager->>Container: mount files via callback
    end
    
    Manager->>Container: write script to /sandbox
    Manager->>Container: write input.json to /sandbox
    Manager->>Client: return SandboxHandle
    
    Client->>Manager: run_sandbox(handle)
    Manager->>Basilica: execute script in sandbox
    Basilica->>Container: run Python script
    Container-->>Basilica: stdout/stderr/exit code
    Manager->>Container: read output.json from /sandbox
    Manager->>Basilica: cleanup sandbox
    Manager->>Client: return SandboxResultWithLogs

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

In the sandbox realm where Basilica dwells,
Scripts execute safely in containerized shells,
Files mounted and mounted, results collected with care,
The evaluator spreads its concurrent wings in the air! 🐰✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The pull request title 'Feat/basilica' is vague and generic, using a conventional prefix without describing the actual feature being implemented.	Use a more descriptive title that explains the feature, such as 'Add BasilicaSandboxManager adapter for basilica SDK integration' or 'Implement basilica sandbox backend support'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In @evaluator/sandbox/basilica_sandbox_manager.py:
- Around line 49-52: The current try/except around sandbox.files.write (using
open(local).read() and os.path.relpath) silently drops files on
UnicodeDecodeError/IOError; update the logic in basilica_sandbox_manager.py to
detect and handle binary/unreadable files instead of passing: attempt a text
read first but on UnicodeDecodeError fall back to binary read and call
sandbox.files.write with bytes if the API supports it, and always log a warning
(including the file path from os.path.relpath(local, tmp) and the exception)
when a file is skipped or an IO error occurs so users can see which files were
not mounted.

In @test_basilica_sandbox.py:
- Around line 29-32: Module-level basilica.configure is called with a possibly
empty BASILICA_API_TOKEN at import time, causing silent invalid configuration
for tests; instead, create a helper like _configure_basilica() that reads
BASILICA_API_TOKEN and BASILICA_API_URL and only calls basilica.configure when
the token is present (or raise/exit if required), then remove the top-level
basilica.configure call and invoke _configure_basilica() at the start of main()
and at the start of each test that needs Basilica so imports no longer configure
with an empty token.

🧹 Nitpick comments (4)

evaluator/sandbox/basilica_sandbox_manager.py (4)

35-35: Avoid mutable default argument.

Using {} as a default argument is a well-known Python pitfall—the same dict instance is shared across all calls.

♻️ Proposed fix

     def initialize_sandbox(
         self, *, name: str, script_path: str, input_data: Any = None,
-        env_vars: Dict[str, str] = {}, on_mount: Callable[[str], None] = None,
-        timeout_seconds: int = None
+        env_vars: Dict[str, str] | None = None, on_mount: Callable[[str], None] | None = None,
+        timeout_seconds: int | None = None
     ) -> SandboxHandle:
         script_name = os.path.basename(script_path)
-        sandbox = python_sandbox(runtime="container", env={**env_vars, "PYTHONUNBUFFERED": "1"},
+        sandbox = python_sandbox(runtime="container", env={**(env_vars or {}), "PYTHONUNBUFFERED": "1"},
                                   timeout_seconds=timeout_seconds or 3600)

49-55: Use context managers for file I/O to avoid resource leaks.

open(local).read() and open(script_path).read() leave file handles unclosed until garbage collection. In concurrent scenarios, this could exhaust file descriptors.

♻️ Proposed fix

                     local = os.path.join(root, f)
                     try:
-                        sandbox.files.write(f"/sandbox/{os.path.relpath(local, tmp)}", open(local).read())
+                        with open(local) as fh:
+                            sandbox.files.write(f"/sandbox/{os.path.relpath(local, tmp)}", fh.read())
                     except (UnicodeDecodeError, IOError):
                         pass
             shutil.rmtree(tmp, ignore_errors=True)
         
-        sandbox.files.write(f"/sandbox/{script_name}", open(script_path).read())
+        with open(script_path) as fh:
+            sandbox.files.write(f"/sandbox/{script_name}", fh.read())

81-87: Consider explicit None type hints for PEP 484 compliance.

The function works correctly, but explicit type hints improve clarity and static analysis.

♻️ Proposed fix

-def get_sandbox_manager(inference_gateway_url: str = None, backend: str = None):
+def get_sandbox_manager(inference_gateway_url: str | None = None, backend: str | None = None):
     """Factory: returns BasilicaSandboxManager or SandboxManager."""

26-31: Document or utilize the unused inference_gateway_url parameter.

The parameter exists for interface parity with SandboxManager but is silently ignored. Consider adding a docstring note or using it as a fallback for BASILICA_API_URL.

📝 Option: Add documentation

     def __init__(self, inference_gateway_url: str = None):
+        """Initialize BasilicaSandboxManager.
+        
+        Args:
+            inference_gateway_url: Unused; kept for interface parity with SandboxManager.
+                                   Basilica uses BASILICA_API_URL env var instead.
+        """
         api_url = os.environ.get("BASILICA_API_URL", "http://localhost:9080")

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 84b6a6a and 61d2eea.

📒 Files selected for processing (3)

evaluator/sandbox/basilica_sandbox_manager.py
pyproject.toml
test_basilica_sandbox.py

🧰 Additional context used

🧬 Code graph analysis (2)

evaluator/sandbox/basilica_sandbox_manager.py (2)

evaluator/models.py (2)

Sandbox (9-15)

SandboxResultWithLogs (27-28)

evaluator/sandbox/sandbox_manager.py (1)

SandboxManager (21-188)

test_basilica_sandbox.py (3)

evaluator/sandbox/basilica_sandbox_manager.py (3)

BasilicaSandboxManager (23-78)

initialize_sandbox (33-59)

run_sandbox (61-78)

models/problem.py (1)

ProblemTestResultStatus (18-21)

evaluator/problem_suites/problem_suite.py (2)

has_problem_name (36-37)

get_problem (39-40)

🪛 Ruff (0.14.10)

evaluator/sandbox/basilica_sandbox_manager.py

26-26: Unused method argument: inference_gateway_url

(ARG002)

26-26: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

30-30: Avoid specifying long messages outside the exception class

(TRY003)

35-35: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

35-35: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

36-36: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

69-69: Do not use bare except

(E722)

77-77: Multiple statements on one line (colon)

(E701)

78-78: Do not use bare except

(E722)

78-78: try-except-pass detected, consider logging the exception

(S110)

78-78: Multiple statements on one line (colon)

(E701)

81-81: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

81-81: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

test_basilica_sandbox.py

34-34: Do not assign a lambda expression, use a def

Rewrite ok as a def

(E731)

35-35: Do not assign a lambda expression, use a def

Rewrite fail as a def

(E731)

36-36: Do not assign a lambda expression, use a def

Rewrite section as a def

(E731)

51-51: Consider moving this statement to an else block

(TRY300)

52-52: Do not catch blind exception: Exception

(BLE001)

53-53: Multiple statements on one line (semicolon)

(E702)

53-53: Multiple statements on one line (semicolon)

(E702)

62-62: Multiple statements on one line (semicolon)

(E702)

70-70: Consider moving this statement to an else block

(TRY300)

71-71: Do not catch blind exception: Exception

(BLE001)

72-72: Multiple statements on one line (semicolon)

(E702)

72-72: Multiple statements on one line (semicolon)

(E702)

81-81: Multiple statements on one line (semicolon)

(E702)

93-93: Consider moving this statement to an else block

(TRY300)

94-94: Do not catch blind exception: Exception

(BLE001)

95-95: Multiple statements on one line (semicolon)

(E702)

95-95: Multiple statements on one line (semicolon)

(E702)

110-110: Multiple statements on one line (colon)

(E701)

113-113: Multiple statements on one line (colon)

(E701)

122-122: Multiple statements on one line (colon)

(E701)

123-123: Do not catch blind exception: Exception

(BLE001)

125-125: Multiple statements on one line (semicolon)

(E702)

126-126: Multiple statements on one line (colon)

(E701)

149-149: Multiple statements on one line (semicolon)

(E702)

157-157: Multiple statements on one line (colon)

(E701)

🔇 Additional comments (7)

pyproject.toml (1)

28-30: LGTM - Clear setup instructions.

The comments provide helpful guidance for developers who need to set up the Basilica SDK locally. Consider adding these instructions to a README or CONTRIBUTING.md for better discoverability.

evaluator/sandbox/basilica_sandbox_manager.py (1)

15-20: LGTM - Clean dataclass design.

The SandboxHandle provides a clear abstraction for holding sandbox state between initialization and execution.

test_basilica_sandbox.py (5)

39-53: LGTM - Good SDK integration test.

The test covers connection, code execution, and file I/O with proper cleanup via context manager.

56-74: LGTM - Exercises manager end-to-end.

The test validates initialization, script execution, and result extraction. The finally block ensures cleanup of the temporary file.

77-95: LGTM - Good integration with problem suite.

The test validates the full evaluation pipeline using the existing POLYGLOT_PY_SUITE infrastructure.

98-138: LGTM - Solid concurrent test with proper synchronization.

The thread-safe stats collection and bounded thread pool are well-implemented. The 80% success threshold is a reasonable heuristic for integration tests.

One minor readability note: the list comprehension on line 101 is quite dense; consider extracting to a separate line for clarity.

141-165: LGTM - Clean CLI implementation.

The Click-based CLI provides good options for full test runs, scaling, and verbosity control. The final summary and exit code handling are appropriate.

coderabbitai · 2026-01-12T21:54:09Z

evaluator/sandbox/basilica_sandbox_manager.py

+                    try:
+                        sandbox.files.write(f"/sandbox/{os.path.relpath(local, tmp)}", open(local).read())
+                    except (UnicodeDecodeError, IOError):
+                        pass


⚠️ Potential issue | 🟡 Minor

Silent failure on binary/unreadable files may hide issues.

Silently passing on UnicodeDecodeError/IOError means binary files or files with encoding issues won't be mounted, and users won't know. Consider logging a warning or supporting binary file uploads.

🔧 Option: Log skipped files

try: - sandbox.files.write(f"/sandbox/{os.path.relpath(local, tmp)}", open(local).read()) + with open(local) as fh: + sandbox.files.write(f"/sandbox/{os.path.relpath(local, tmp)}", fh.read()) except (UnicodeDecodeError, IOError): - pass + # Consider logging: logger.warning(f"Skipped non-text file: {local}") + pass

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

try:

sandbox.files.write(f"/sandbox/{os.path.relpath(local, tmp)}", open(local).read())

except (UnicodeDecodeError, IOError):

pass

try:

with open(local) as fh:

sandbox.files.write(f"/sandbox/{os.path.relpath(local, tmp)}", fh.read())

except (UnicodeDecodeError, IOError):

# Consider logging: logger.warning(f"Skipped non-text file: {local}")

pass

🤖 Prompt for AI Agents

In @evaluator/sandbox/basilica_sandbox_manager.py around lines 49 - 52, The current try/except around sandbox.files.write (using open(local).read() and os.path.relpath) silently drops files on UnicodeDecodeError/IOError; update the logic in basilica_sandbox_manager.py to detect and handle binary/unreadable files instead of passing: attempt a text read first but on UnicodeDecodeError fall back to binary read and call sandbox.files.write with bytes if the API supports it, and always log a warning (including the file path from os.path.relpath(local, tmp) and the exception) when a file is skipped or an IO error occurs so users can see which files were not mounted.

coderabbitai · 2026-01-12T21:54:10Z

evaluator/sandbox/basilica_sandbox_manager.py

+            try:
+                output = json.loads(handle.sandbox.files.read("/sandbox/output.json"))
+            except:
+                output = {"success": False, "error": "Failed to read output.json"}


⚠️ Potential issue | 🟡 Minor

Avoid bare except: — catch specific exceptions.

Bare except: catches SystemExit, KeyboardInterrupt, and other exceptions that typically shouldn't be swallowed. This also masks the actual failure reason when output.json is missing or malformed.

🐛 Proposed fix

try: output = json.loads(handle.sandbox.files.read("/sandbox/output.json")) - except: + except (json.JSONDecodeError, FileNotFoundError, Exception) as e: output = {"success": False, "error": "Failed to read output.json"} + # Consider logging e for debugging

🧰 Tools

🪛 Ruff (0.14.10)

69-69: Do not use bare except

(E722)

coderabbitai · 2026-01-12T21:54:10Z

test_basilica_sandbox.py

+basilica.configure(
+    api_url=os.environ.get("BASILICA_API_URL", "http://localhost:9080"),
+    api_key=os.environ.get("BASILICA_API_TOKEN", ""),
+)


⚠️ Potential issue | 🟡 Minor

Module-level basilica.configure with empty token may cause confusing failures.

If BASILICA_API_TOKEN is unset, this configures basilica with an empty API key at import time. The main() function checks for the token later (line 148), but tests imported without running main() would silently use an invalid configuration.

🐛 Option: Move configuration inside test functions or guard it

-basilica.configure( - api_url=os.environ.get("BASILICA_API_URL", "http://localhost:9080"), - api_key=os.environ.get("BASILICA_API_TOKEN", ""), -) +def _configure_basilica(): + api_key = os.environ.get("BASILICA_API_TOKEN") + if not api_key: + raise ValueError("BASILICA_API_TOKEN required") + basilica.configure( + api_url=os.environ.get("BASILICA_API_URL", "http://localhost:9080"), + api_key=api_key, + )

Then call _configure_basilica() at the start of each test or in main().

🤖 Prompt for AI Agents

In @test_basilica_sandbox.py around lines 29 - 32, Module-level basilica.configure is called with a possibly empty BASILICA_API_TOKEN at import time, causing silent invalid configuration for tests; instead, create a helper like _configure_basilica() that reads BASILICA_API_TOKEN and BASILICA_API_URL and only calls basilica.configure when the token is present (or raise/exit if required), then remove the top-level basilica.configure call and invoke _configure_basilica() at the start of main() and at the start of each test that needs Basilica so imports no longer configure with an empty token.

Test User added 8 commits January 12, 2026 14:51

refactor: simplify test_basilica_sandbox.py

fa0b555

- Reduce from 587 to 207 lines (~65% smaller) - Move imports to top level - Consolidate duplicate worker functions into single worker() - Add header() helper for consistent formatting - Simplify result tracking and error handling

refactor: use improved SDK interface in BasilicaSandboxManager

0853bd6

- Use basilica.configure() for global SDK configuration - Use python_sandbox() factory function - Use namespaced API (sandbox.files, sandbox.process) - Cleaner code with better comments

refactor: slim down basilica_sandbox_manager and test file

61d2eea

- basilica_sandbox_manager.py: 156 → 76 lines (-51%) - test_basilica_sandbox.py: 318 → 130 lines (-59%) - Same functionality, less boilerplate

coderabbitai bot reviewed Jan 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/basilica#1

Feat/basilica#1
distributedstatemachine wants to merge 8 commits intomainfrom
feat/basilica

distributedstatemachine commented Jan 12, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 12, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 12, 2026

Uh oh!

coderabbitai bot Jan 12, 2026

Uh oh!

coderabbitai bot Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

distributedstatemachine commented Jan 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

distributedstatemachine commented Jan 12, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 12, 2026 •

edited

Loading