Skip to content

Phase D — migrate aegis-oss to @stackbilt/llm-providers (remove all bolted-in LLM logic) #24

@stackbilt-admin

Description

@stackbilt-admin

Summary

Multi-session epic to remove all bolted-in LLM inference logic from aegis-oss and the downstream AEGIS daemon, consuming `@stackbilt/llm-providers` as the canonical routing layer. Per the architectural rule: no Stackbilt repo (public or private) may contain its own bolted-in LLM inference, routing, failover, or provider-specific orchestration logic. llm-providers is the SoT.

This is a four-session epic. aegis-oss is the contract repo per `project_dependency_model.md` — migration lands here first, then the daemon inherits via `@stackbilt/aegis-core`.

Current migration state (as of 2026-04-10, daemon v1.96.2)

Already on llm-providers (done):

  • `kernel/executors/cerebras.ts` — uses `CerebrasProvider`
  • `kernel/executors/groq.ts` — uses `GroqProvider` (both plain and tool-use variants)
  • `kernel/resilience.ts` — re-exports `CircuitBreakerManager`, `CostTracker`, `CreditLedger`, `ExhaustionRegistry` from llm-providers

Not yet migrated (the real Phase D scope):

  1. Anthropic — `web/src/claude.ts` / `executeClaudeChat` still uses raw Anthropic SDK. `AnthropicProvider` exists in llm-providers.
  2. Workers AI / GPT-OSS — `executeGptOss` / `executeWorkersAi` use raw `env.ai?.run()`. `CloudflareProvider` exists in llm-providers.
  3. Dispatch routing policy — the daemon downstream has a Cerebras remap (`if plan.executor === 'claude' → cerebras_mid`) that intercepts semantic executor names inside the dispatch switch. This is policy, not LLM logic, and should become a thin routing adapter above the llm-providers factory instead of an intercept inside the executor switch.

Architectural concern: executor naming abstraction

Current dispatch uses semantic executor names (`claude_opus`, `cerebras_reasoning`, `cerebras_mid`, `claude_code`) that encode strategy + tier + capability. llm-providers routes by provider + model + fallback chain. Phase D needs a lightweight routing policy adapter in aegis-oss that maps semantic names → (provider, model, fallback chain) tuples. This adapter becomes part of the canonical contract; the daemon inherits it and can optionally supply custom presets for daemon-specific strategies.

This is the design work that makes Phase D non-mechanical.

Dependencies on llm-providers

Dependencies on edge-auth

  • `Stackbilt-dev/edge-auth#82` — canonical `ResourceQuotaProvider` contract. Independent of Phase D execution, but the QuotaHook wiring in Session D.3+ will consume it.

Plan

Session D.1 — aegis-oss: routing adapter + Anthropic migration

  • Design the executor routing adapter (semantic names → provider/model/fallback tuples)
  • Port `executeClaudeChat` to use `AnthropicProvider` from llm-providers
  • Preserve: MCP integration, streaming (gated on llm-providers#26), cost tracking, circuit breakers
  • Update canonical dispatch tests to pass against the new adapter
  • Publish `@stackbilt/aegis-core` with the migrated code

Session D.2 — aegis-oss: Workers AI migration

  • Port `executeGptOss` / `executeWorkersAi` to `CloudflareProvider`
  • Retire raw `env.ai?.run()` usage
  • Update tests
  • Publish new `@stackbilt/aegis-core` version

Session D.3 — daemon (Stackbilt-dev/aegis): inherit via dependency model

  • Remove daemon's Cerebras remap intercept
  • Delete daemon's `web/src/claude.ts` and `web/src/kernel/executors/workers-ai.ts`
  • Keep daemon-specific Cerebras tier presets if they differ from aegis-oss defaults (inject as custom presets on the factory)
  • Restore canonical dispatch tests — they should start passing once the bolted-in logic is gone
  • Bump daemon to consume the new `@stackbilt/aegis-core`
  • Deploy

Session D.4 — validation + policy enforcement

  • Integration test end-to-end: chat streaming, tool-use, failover scenarios
  • Grep across aegis-oss + daemon + foodfiles + img-forge for direct `@anthropic-ai/sdk` / `groq-sdk` / raw `env.ai?.run()` imports — delete any that remain
  • Add a lint rule (or CI check) that fails on raw provider SDK imports outside of `@stackbilt/llm-providers`
  • Close this epic
  • Close corresponding daemon kernel shadow (`dispatch.ts +366L` in `project_daemon_kernel_shadow`)

Related daemon work

  • 1.96.1 (2026-04-10) — Phase A kernel shadow cleanup. Closed 4 of ~10 shadows. Phase D.3 closes the dispatch.ts shadow as a side effect.
  • 1.96.2 (2026-04-10) — AI Gateway account-ID shadow collapse (5th shadow). Fixed a latent bug where the wrong CF account ID was hardcoded; now pulls from `env.CF_ACCOUNT_ID` inherited via Phase A.

Internal memory references (AEGIS context)

  • `project_phase_d_llm_providers.md` — full scoping with gap analysis and 4-session breakdown
  • `project_resource_quota_seam.md` — fractal multi-tenant quota architecture
  • `project_daemon_kernel_shadow.md` — broader daemon→aegis-oss shadow cleanup context
  • `feedback_no_bolted_llm_logic.md` — the architectural rule triggering this work

Priority

Design-heavy, multi-session. Not a cc-taskrunner candidate — this needs dedicated Claude Code sessions for each phase. Track here, execute as sessions become available.

🤖 Filed by AEGIS during Phase D scoping session

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions