Skip to content

Feat/basilica#1

Open
distributedstatemachine wants to merge 8 commits intomainfrom
feat/basilica
Open

Feat/basilica#1
distributedstatemachine wants to merge 8 commits intomainfrom
feat/basilica

Conversation

@distributedstatemachine
Copy link
Member

@distributedstatemachine distributedstatemachine commented Jan 12, 2026

Summary by CodeRabbit

  • New Features

    • Added Basilica Sandbox integration, enabling containerized script execution with file mounting, input/output handling, and comprehensive result logging.
  • Tests

    • Added test suite covering Basilica Sandbox integration, sandbox manager operations, problem evaluation, and concurrent execution scenarios.

✏️ Tip: You can customize this high-level summary in your review settings.

Test User added 8 commits January 12, 2026 14:51
- Add BasilicaSandboxManager as thin wrapper around basilica-sdk-python
- Add comprehensive test script with concurrent evaluation support
- Remove basilica-sdk from optional dependencies (use local uv install)
- Supports both warm pool (fast) and cold start sandbox creation

Setup:
  cd ridges && uv pip install -e ../basilica/crates/basilica-sdk-python
  export BASILICA_API_URL=http://localhost:9080
  export BASILICA_API_TOKEN=dev-token

Usage:
  python test_basilica_sandbox.py --concurrent 50 --eval
- Reduce from 587 to 207 lines (~65% smaller)
- Move imports to top level
- Consolidate duplicate worker functions into single worker()
- Add header() helper for consistent formatting
- Simplify result tracking and error handling
- Add --dx flag for DX-focused showcase test
- Use context managers for automatic sandbox cleanup
- Use namespaced API (sandbox.files, sandbox.process)
- Use python_sandbox() factory function
- Demonstrate global configuration with basilica.configure()
- Use improved concurrent test with context managers
- Reorganize tests around SDK capabilities, not test numbers
- Showcase modern API patterns throughout (context managers, namespaced API)
- Simplify CLI: --full for all tests, --scale N for stress test
- Remove redundant DX showcase section (entire file is now the showcase)
- Cleaner output with section headers and consistent formatting
- Global basilica.configure() at top for cleaner test code
- Focus on testing ridges + Basilica integration, not SDK demos
- Clean structure: SDK, SandboxManager, Polyglot, Concurrent
- Simple CLI: --full for Polyglot, --scale N for stress test
- Uses new SDK conventions (context managers, namespaced API) throughout
- Rename test_concurrent to test_concurrent_evals
- Run real Polyglot evaluations instead of simple computations
- Show test pass/fail breakdown for each evaluation
- Include Polyglot test as standard (not just with --full)
- --full now includes concurrent evals, --scale N for custom count
- Use basilica.configure() for global SDK configuration
- Use python_sandbox() factory function
- Use namespaced API (sandbox.files, sandbox.process)
- Cleaner code with better comments
- basilica_sandbox_manager.py: 156 → 76 lines (-51%)
- test_basilica_sandbox.py: 318 → 130 lines (-59%)
- Same functionality, less boilerplate
@coderabbitai
Copy link

coderabbitai bot commented Jan 12, 2026

Walkthrough

Introduces a Basilica Sandbox adapter (BasilicaSandboxManager) that wraps the Basilica SDK's Sandbox to implement a standardized interface. The implementation includes sandbox initialization with file mounting and script execution, result capture with logs, and a factory function for backend selection. A comprehensive test suite validates SDK integration, manager functionality, and concurrent evaluation scenarios.

Changes

Cohort / File(s) Summary
Core Sandbox Adapter
evaluator/sandbox/basilica_sandbox_manager.py
Introduces BasilicaSandboxManager class wrapping basilica.Sandbox; SandboxHandle dataclass to represent sandbox instances; initialize_sandbox() method configuring container-based Python sandbox with optional file mounting via on_mount callback; run_sandbox() method executing scripts and capturing stdout/stderr/output.json; get_sandbox_manager() factory function selecting backend based on configuration.
Configuration
pyproject.toml
Adds 4 commented lines documenting installation and local linking of basilica-sdk-python for sandbox support.
Integration Tests
test_basilica_sandbox.py
Introduces test module with test_sdk() validating SDK connection and file I/O, test_manager() exercising initialization and execution, test_polyglot() validating problem evaluation, test_concurrent() running parallel evals with statistics; Click-based CLI entry point with --full, --scale, --problem, --quiet options; enforces BASILICA_API_TOKEN presence.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Manager as BasilicaSandboxManager
    participant Basilica as basilica.Sandbox
    participant Container as Container FS

    Client->>Manager: initialize_sandbox(script_path, input_data, on_mount)
    Manager->>Basilica: create sandbox instance
    Basilica->>Container: allocate container
    
    opt on_mount callback
        Manager->>Manager: call on_mount(target_path)
        Manager->>Container: mount files via callback
    end
    
    Manager->>Container: write script to /sandbox
    Manager->>Container: write input.json to /sandbox
    Manager->>Client: return SandboxHandle
    
    Client->>Manager: run_sandbox(handle)
    Manager->>Basilica: execute script in sandbox
    Basilica->>Container: run Python script
    Container-->>Basilica: stdout/stderr/exit code
    Manager->>Container: read output.json from /sandbox
    Manager->>Basilica: cleanup sandbox
    Manager->>Client: return SandboxResultWithLogs
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

In the sandbox realm where Basilica dwells,
Scripts execute safely in containerized shells,
Files mounted and mounted, results collected with care,
The evaluator spreads its concurrent wings in the air! 🐰✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The pull request title 'Feat/basilica' is vague and generic, using a conventional prefix without describing the actual feature being implemented. Use a more descriptive title that explains the feature, such as 'Add BasilicaSandboxManager adapter for basilica SDK integration' or 'Implement basilica sandbox backend support'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In @evaluator/sandbox/basilica_sandbox_manager.py:
- Around line 49-52: The current try/except around sandbox.files.write (using
open(local).read() and os.path.relpath) silently drops files on
UnicodeDecodeError/IOError; update the logic in basilica_sandbox_manager.py to
detect and handle binary/unreadable files instead of passing: attempt a text
read first but on UnicodeDecodeError fall back to binary read and call
sandbox.files.write with bytes if the API supports it, and always log a warning
(including the file path from os.path.relpath(local, tmp) and the exception)
when a file is skipped or an IO error occurs so users can see which files were
not mounted.

In @test_basilica_sandbox.py:
- Around line 29-32: Module-level basilica.configure is called with a possibly
empty BASILICA_API_TOKEN at import time, causing silent invalid configuration
for tests; instead, create a helper like _configure_basilica() that reads
BASILICA_API_TOKEN and BASILICA_API_URL and only calls basilica.configure when
the token is present (or raise/exit if required), then remove the top-level
basilica.configure call and invoke _configure_basilica() at the start of main()
and at the start of each test that needs Basilica so imports no longer configure
with an empty token.
🧹 Nitpick comments (4)
evaluator/sandbox/basilica_sandbox_manager.py (4)

35-35: Avoid mutable default argument.

Using {} as a default argument is a well-known Python pitfall—the same dict instance is shared across all calls.

♻️ Proposed fix
     def initialize_sandbox(
         self, *, name: str, script_path: str, input_data: Any = None,
-        env_vars: Dict[str, str] = {}, on_mount: Callable[[str], None] = None,
-        timeout_seconds: int = None
+        env_vars: Dict[str, str] | None = None, on_mount: Callable[[str], None] | None = None,
+        timeout_seconds: int | None = None
     ) -> SandboxHandle:
         script_name = os.path.basename(script_path)
-        sandbox = python_sandbox(runtime="container", env={**env_vars, "PYTHONUNBUFFERED": "1"},
+        sandbox = python_sandbox(runtime="container", env={**(env_vars or {}), "PYTHONUNBUFFERED": "1"},
                                   timeout_seconds=timeout_seconds or 3600)

49-55: Use context managers for file I/O to avoid resource leaks.

open(local).read() and open(script_path).read() leave file handles unclosed until garbage collection. In concurrent scenarios, this could exhaust file descriptors.

♻️ Proposed fix
                     local = os.path.join(root, f)
                     try:
-                        sandbox.files.write(f"/sandbox/{os.path.relpath(local, tmp)}", open(local).read())
+                        with open(local) as fh:
+                            sandbox.files.write(f"/sandbox/{os.path.relpath(local, tmp)}", fh.read())
                     except (UnicodeDecodeError, IOError):
                         pass
             shutil.rmtree(tmp, ignore_errors=True)
         
-        sandbox.files.write(f"/sandbox/{script_name}", open(script_path).read())
+        with open(script_path) as fh:
+            sandbox.files.write(f"/sandbox/{script_name}", fh.read())

81-87: Consider explicit None type hints for PEP 484 compliance.

The function works correctly, but explicit type hints improve clarity and static analysis.

♻️ Proposed fix
-def get_sandbox_manager(inference_gateway_url: str = None, backend: str = None):
+def get_sandbox_manager(inference_gateway_url: str | None = None, backend: str | None = None):
     """Factory: returns BasilicaSandboxManager or SandboxManager."""

26-31: Document or utilize the unused inference_gateway_url parameter.

The parameter exists for interface parity with SandboxManager but is silently ignored. Consider adding a docstring note or using it as a fallback for BASILICA_API_URL.

📝 Option: Add documentation
     def __init__(self, inference_gateway_url: str = None):
+        """Initialize BasilicaSandboxManager.
+        
+        Args:
+            inference_gateway_url: Unused; kept for interface parity with SandboxManager.
+                                   Basilica uses BASILICA_API_URL env var instead.
+        """
         api_url = os.environ.get("BASILICA_API_URL", "http://localhost:9080")
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 84b6a6a and 61d2eea.

📒 Files selected for processing (3)
  • evaluator/sandbox/basilica_sandbox_manager.py
  • pyproject.toml
  • test_basilica_sandbox.py
🧰 Additional context used
🧬 Code graph analysis (2)
evaluator/sandbox/basilica_sandbox_manager.py (2)
evaluator/models.py (2)
  • Sandbox (9-15)
  • SandboxResultWithLogs (27-28)
evaluator/sandbox/sandbox_manager.py (1)
  • SandboxManager (21-188)
test_basilica_sandbox.py (3)
evaluator/sandbox/basilica_sandbox_manager.py (3)
  • BasilicaSandboxManager (23-78)
  • initialize_sandbox (33-59)
  • run_sandbox (61-78)
models/problem.py (1)
  • ProblemTestResultStatus (18-21)
evaluator/problem_suites/problem_suite.py (2)
  • has_problem_name (36-37)
  • get_problem (39-40)
🪛 Ruff (0.14.10)
evaluator/sandbox/basilica_sandbox_manager.py

26-26: Unused method argument: inference_gateway_url

(ARG002)


26-26: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


30-30: Avoid specifying long messages outside the exception class

(TRY003)


35-35: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


35-35: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


36-36: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


69-69: Do not use bare except

(E722)


77-77: Multiple statements on one line (colon)

(E701)


78-78: Do not use bare except

(E722)


78-78: try-except-pass detected, consider logging the exception

(S110)


78-78: Multiple statements on one line (colon)

(E701)


81-81: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


81-81: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

test_basilica_sandbox.py

34-34: Do not assign a lambda expression, use a def

Rewrite ok as a def

(E731)


35-35: Do not assign a lambda expression, use a def

Rewrite fail as a def

(E731)


36-36: Do not assign a lambda expression, use a def

Rewrite section as a def

(E731)


51-51: Consider moving this statement to an else block

(TRY300)


52-52: Do not catch blind exception: Exception

(BLE001)


53-53: Multiple statements on one line (semicolon)

(E702)


53-53: Multiple statements on one line (semicolon)

(E702)


62-62: Multiple statements on one line (semicolon)

(E702)


70-70: Consider moving this statement to an else block

(TRY300)


71-71: Do not catch blind exception: Exception

(BLE001)


72-72: Multiple statements on one line (semicolon)

(E702)


72-72: Multiple statements on one line (semicolon)

(E702)


81-81: Multiple statements on one line (semicolon)

(E702)


93-93: Consider moving this statement to an else block

(TRY300)


94-94: Do not catch blind exception: Exception

(BLE001)


95-95: Multiple statements on one line (semicolon)

(E702)


95-95: Multiple statements on one line (semicolon)

(E702)


110-110: Multiple statements on one line (colon)

(E701)


113-113: Multiple statements on one line (colon)

(E701)


122-122: Multiple statements on one line (colon)

(E701)


123-123: Do not catch blind exception: Exception

(BLE001)


125-125: Multiple statements on one line (semicolon)

(E702)


126-126: Multiple statements on one line (colon)

(E701)


149-149: Multiple statements on one line (semicolon)

(E702)


157-157: Multiple statements on one line (colon)

(E701)

🔇 Additional comments (7)
pyproject.toml (1)

28-30: LGTM - Clear setup instructions.

The comments provide helpful guidance for developers who need to set up the Basilica SDK locally. Consider adding these instructions to a README or CONTRIBUTING.md for better discoverability.

evaluator/sandbox/basilica_sandbox_manager.py (1)

15-20: LGTM - Clean dataclass design.

The SandboxHandle provides a clear abstraction for holding sandbox state between initialization and execution.

test_basilica_sandbox.py (5)

39-53: LGTM - Good SDK integration test.

The test covers connection, code execution, and file I/O with proper cleanup via context manager.


56-74: LGTM - Exercises manager end-to-end.

The test validates initialization, script execution, and result extraction. The finally block ensures cleanup of the temporary file.


77-95: LGTM - Good integration with problem suite.

The test validates the full evaluation pipeline using the existing POLYGLOT_PY_SUITE infrastructure.


98-138: LGTM - Solid concurrent test with proper synchronization.

The thread-safe stats collection and bounded thread pool are well-implemented. The 80% success threshold is a reasonable heuristic for integration tests.

One minor readability note: the list comprehension on line 101 is quite dense; consider extracting to a separate line for clarity.


141-165: LGTM - Clean CLI implementation.

The Click-based CLI provides good options for full test runs, scaling, and verbosity control. The final summary and exit code handling are appropriate.

Comment on lines +49 to +52
try:
sandbox.files.write(f"/sandbox/{os.path.relpath(local, tmp)}", open(local).read())
except (UnicodeDecodeError, IOError):
pass
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silent failure on binary/unreadable files may hide issues.

Silently passing on UnicodeDecodeError/IOError means binary files or files with encoding issues won't be mounted, and users won't know. Consider logging a warning or supporting binary file uploads.

🔧 Option: Log skipped files
                     try:
-                        sandbox.files.write(f"/sandbox/{os.path.relpath(local, tmp)}", open(local).read())
+                        with open(local) as fh:
+                            sandbox.files.write(f"/sandbox/{os.path.relpath(local, tmp)}", fh.read())
                     except (UnicodeDecodeError, IOError):
-                        pass
+                        # Consider logging: logger.warning(f"Skipped non-text file: {local}")
+                        pass
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
sandbox.files.write(f"/sandbox/{os.path.relpath(local, tmp)}", open(local).read())
except (UnicodeDecodeError, IOError):
pass
try:
with open(local) as fh:
sandbox.files.write(f"/sandbox/{os.path.relpath(local, tmp)}", fh.read())
except (UnicodeDecodeError, IOError):
# Consider logging: logger.warning(f"Skipped non-text file: {local}")
pass
🤖 Prompt for AI Agents
In @evaluator/sandbox/basilica_sandbox_manager.py around lines 49 - 52, The
current try/except around sandbox.files.write (using open(local).read() and
os.path.relpath) silently drops files on UnicodeDecodeError/IOError; update the
logic in basilica_sandbox_manager.py to detect and handle binary/unreadable
files instead of passing: attempt a text read first but on UnicodeDecodeError
fall back to binary read and call sandbox.files.write with bytes if the API
supports it, and always log a warning (including the file path from
os.path.relpath(local, tmp) and the exception) when a file is skipped or an IO
error occurs so users can see which files were not mounted.

Comment on lines +67 to +70
try:
output = json.loads(handle.sandbox.files.read("/sandbox/output.json"))
except:
output = {"success": False, "error": "Failed to read output.json"}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid bare except: — catch specific exceptions.

Bare except: catches SystemExit, KeyboardInterrupt, and other exceptions that typically shouldn't be swallowed. This also masks the actual failure reason when output.json is missing or malformed.

🐛 Proposed fix
             try:
                 output = json.loads(handle.sandbox.files.read("/sandbox/output.json"))
-            except:
+            except (json.JSONDecodeError, FileNotFoundError, Exception) as e:
                 output = {"success": False, "error": "Failed to read output.json"}
+                # Consider logging e for debugging
🧰 Tools
🪛 Ruff (0.14.10)

69-69: Do not use bare except

(E722)

Comment on lines +29 to +32
basilica.configure(
api_url=os.environ.get("BASILICA_API_URL", "http://localhost:9080"),
api_key=os.environ.get("BASILICA_API_TOKEN", ""),
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Module-level basilica.configure with empty token may cause confusing failures.

If BASILICA_API_TOKEN is unset, this configures basilica with an empty API key at import time. The main() function checks for the token later (line 148), but tests imported without running main() would silently use an invalid configuration.

🐛 Option: Move configuration inside test functions or guard it
-basilica.configure(
-    api_url=os.environ.get("BASILICA_API_URL", "http://localhost:9080"),
-    api_key=os.environ.get("BASILICA_API_TOKEN", ""),
-)
+def _configure_basilica():
+    api_key = os.environ.get("BASILICA_API_TOKEN")
+    if not api_key:
+        raise ValueError("BASILICA_API_TOKEN required")
+    basilica.configure(
+        api_url=os.environ.get("BASILICA_API_URL", "http://localhost:9080"),
+        api_key=api_key,
+    )

Then call _configure_basilica() at the start of each test or in main().

🤖 Prompt for AI Agents
In @test_basilica_sandbox.py around lines 29 - 32, Module-level
basilica.configure is called with a possibly empty BASILICA_API_TOKEN at import
time, causing silent invalid configuration for tests; instead, create a helper
like _configure_basilica() that reads BASILICA_API_TOKEN and BASILICA_API_URL
and only calls basilica.configure when the token is present (or raise/exit if
required), then remove the top-level basilica.configure call and invoke
_configure_basilica() at the start of main() and at the start of each test that
needs Basilica so imports no longer configure with an empty token.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant