I built this to model a governed multi-agent control plane rather than another chatbot.
The key challenge I wanted to capture was the part that usually gets hidden in simple demos: how data, signals, decisions, constraints, evidence, and operating risk move through a system that someone else could inspect and run locally.
I intentionally kept this version local and synthetic because the goal is to make the architecture and tradeoffs reviewable without external services, private data, paid APIs, or cloud setup.
Enterprises will run many specialized AI agents, and the hard problem is safe orchestration: tools, permissions, handoffs, conflicts, approvals, traces, and auditability.
This matters because production teams do not only need outputs. They need evidence, ownership, repeatable validation, failure modes, and a path from local prototype to governed production system.
- multi-agent orchestration
- tool governance
- policy enforcement
- red-team evaluation
- trace-style observability
- human-in-the-loop workflow
- production-style data pipeline design
- synthetic but realistic data modeling
- scorecard generation
- API/dashboard serving
- testable architecture
- honest limitation framing
Synthetic enterprise tasks are routed to domain agents, checked against tool policies, handed off across specialists, red-teamed, traced, evaluated, and pushed through approval/audit workflows.
The important pattern is that inputs are not just transformed into outputs. They are turned into scored, documented artifacts that can be reviewed by operators, analysts, engineers, and business stakeholders.
- Synthetic data keeps the repo safe to run and share publicly.
- Deterministic local logic makes validation repeatable without paid APIs.
- DuckDB or local artifacts provide warehouse-style inspection without cloud setup.
- FastAPI shows how the system could be served as a service layer.
- Streamlit gives reviewers a fast way to inspect the outputs visually.
- Scorecards make quality, risk, reliability, or readiness measurable.
- Tests and Ruff keep the repo from being only documentation.
- Docker/CI files show the intended deployment shape without claiming production readiness.
See docs/design-decisions.md for the detailed tradeoff record.
Latest validation run: 2026-06-02.
- Pipeline: passed
- Pytest: passed (145 tests)
- Ruff: passed
- Repository quality docs check: passed
- Detailed command output is recorded in docs/validation-log.md.
- agent/tool registries
- task routing decisions
- policy decisions
- handoff graphs
- conflict reports
- red-team results
- traces
- approval queue
- scorecards
Recruiter / hiring manager:
- Read this README first.
- Review docs/recruiter-summary.md if present.
- Check docs/validation-log.md.
- Use docs/repo-review-guide.md for the quickest path.
Senior engineer:
- Review the architecture docs.
- Inspect the
src/modules. - Inspect tests and generated scorecards.
- Read docs/design-decisions.md and docs/tradeoffs-and-simplifications.md.
Interview path:
- Run the pipeline command from the validation log.
- Launch the dashboard or API if this repo includes them.
- Explain one design decision and one simplification honestly.
- Synthetic data only.
- Local prototype rather than deployed production system.
- Deterministic rules or simulations where a production system may use live models, streaming data, or enterprise integrations.
- No real sensitive data is used.
- No authentication, RBAC, secrets management, or production security boundary unless explicitly stated elsewhere in the repo.
- External systems are simulated instead of connected live.
- connect real agent SDKs
- use OPA/RBAC
- export OpenTelemetry traces
- integrate Jira/ServiceNow approvals
- add secrets, auth, and production tool adapters
See docs/production-roadmap.md for the staged roadmap.
Agentic Enterprise Runtime is my flagship AI infrastructure project. It simulates a governed enterprise multi-agent runtime where specialized agents route tasks, request tools, hand off work, resolve conflicts, evaluate risk, trigger approval workflows, generate audit trails, and produce executive-ready decisions.
This is not a chatbot or single-agent demo. It is a control plane for enterprise AI agents.
Validation:
- 12 deterministic domain agents
- 41 governed tools
- 600 synthetic enterprise tasks
- 8 probability scenarios
- 145 tests passing
- Ruff checks passing
- FastAPI and Streamlit launched locally
- End-to-end runtime pipeline validated
- Flagship demo validated
This project simulates a future enterprise AI operating layer: a governed multi-agent runtime.
A basic AI app asks: "Can an assistant answer a question?"
This project asks: "Which agent should handle the task, which tools can it safely use, what happens if agents disagree, what is the probability of a bad outcome, and how do we audit every decision?"
Large enterprises will not rely on one AI assistant. They will deploy specialized AI agents across finance, fraud, compliance, data engineering, customer support, supply chain, security, MLOps, analytics, and executive operations.
This runtime simulates how those agents can be coordinated safely: task routing, tool access governance, handoffs, confidence scoring, conflict resolution, probability-based simulation, approval workflows, audit trails, human escalation, and executive briefings.
Positioning: This project demonstrates future-facing AI infrastructure: governed multi-agent orchestration, policy-controlled tool access, probability-aware decision simulation, and audit-ready enterprise automation.
- Multi-agent orchestration across enterprise domains.
- Policy-controlled tool access with deterministic governance authority.
- Agent handoff design with evidence and acceptance records.
- Conflict arbitration for contradictory agent recommendations.
- Red-team testing for prompt injection, tool abuse, approval bypass, and unsafe actions.
- Trace-style observability for task, tool, handoff, guardrail, and decision spans.
- Repeatable evaluation harness for routing, policy, approvals, red-team detection, lineage, and audit completeness.
- Human-in-the-loop approval workflow with action escrow and SLA reporting.
- Audit-ready decision lineage and executive/operator briefings.
Enterprise AI adoption creates a new infrastructure challenge. When multiple AI agents can access tools, data, workflows, and business systems, companies must answer:
- Which agent is allowed to use which tool?
- Which actions require approval?
- What if one agent recommends approve and another recommends block?
- What if an agent is tricked by prompt injection?
- What if a support agent tries to access finance data?
- What if a fraud agent wants to freeze an account but confidence is low?
- What if two low-risk actions combine into a high-risk enterprise outcome?
- When should the system simulate before acting?
- When should it escalate to a human?
- How should every decision be logged, explained, and audited?
Without a governed runtime, multi-agent systems can become untraceable, over-permissioned, unsafe, inconsistent, expensive, difficult to audit, and risky for regulated workflows.
This repo does not require an LLM API. The deterministic runtime remains the system of record so reviewers can inspect orchestration, policy, safety, and audit logic directly. V0.2 adds optional live-agent adapters, but live recommendations are advisory only and cannot bypass policy enforcement, audit logging, approval workflow, or safety checks.
V0.2 turns the V0.1 governed deterministic runtime into a more production-shaped flagship system:
- Optional live-agent adapter with deterministic fallback by default.
- Trace-style observability for tasks, tools, handoffs, guardrails, approvals, and final decisions.
- Repeatable offline evaluation harness for routing, policy, handoffs, conflicts, approvals, red-team detection, lineage, and audit completeness.
- Red-team scenario pack covering prompt injection, tool abuse, approval bypass, memory contamination, confidence without evidence, and unsafe irreversible actions.
- Interactive approval workflow with decision history, action escrow status, and SLA reporting.
- Flagship demo mode for a support refund scenario with fraud review and prompt injection containment.
- V0.2 scorecards for runtime maturity, observability, evaluation, red-team, approvals, and live-agent readiness.
Deterministic mode is the default. No API key is required. Tests do not call external APIs. Live-agent mode is optional, disabled by default, and gracefully falls back when optional dependencies or credentials are missing.
V0.3 adds the public launch layer around the runtime:
- screenshot capture guides
- five-minute demo script
- ten-minute technical walkthrough
- recruiter, executive, technical reviewer, and architecture one-pagers
- interview talking points and STAR stories
- technical blog draft
- LinkedIn launch sequence
- resume bullets and ATS keywords
- GitHub profile/release setup notes
- portfolio landing repo update snippet
flowchart LR
A["Synthetic Domain Data"] --> B["Enterprise Tasks"]
C["Agent Registry"] --> D["Task Router"]
E["Tool Registry"] --> F["Tool Policy Engine"]
B --> D
D --> G["Agent Recommendations"]
G --> F
F --> H["Tool Execution Simulation"]
G --> I["Handoff Engine"]
G --> J["Conflict Detector"]
H --> K["Probability Simulator"]
K --> L["Safety Gates"]
L --> M["Decision Engine"]
M --> N["Approval Queue / Action Escrow"]
M --> O["Audit Trail"]
M --> P["Shared Memory"]
M --> Q["Briefings + Scorecards"]
Q --> R["DuckDB"]
R --> S["FastAPI"]
R --> T["Streamlit"]
M --> U["Trace Recorder"]
U --> V["Evaluation Harness"]
V --> W["Red-Team Reports"]
flowchart TD
A["Task Arrives"] --> B["Classify Intent + Domain"]
B --> C["Select Primary Agent"]
C --> D["Recommend Tools"]
D --> E["Evaluate Policies"]
E --> F{"Allowed?"}
F -- "yes" --> G["Simulate Tool Result"]
F -- "no" --> H["Block or Escalate"]
G --> I["Score Confidence + Risk"]
I --> J["Final Decision"]
flowchart TD
A["Agent Tool Request"] --> B["Allowed Tool Check"]
B --> C["Domain Boundary Check"]
C --> D["Approval Policy"]
D --> E["Prompt Attack Check"]
E --> F{"Safe?"}
F -- "yes" --> G["Execute / Shadow Execute"]
F -- "no" --> H["Block + Audit"]
flowchart LR
A["Primary Agent"] --> B{"Needs Specialist?"}
B -- "yes" --> C["Secondary Agent"]
C --> D["Accepted Handoff"]
D --> E["Combined Recommendation"]
B -- "no" --> F["Continue"]
flowchart TD
A["Agent Recommendations"] --> B{"Contradiction?"}
B -- "yes" --> C["Arbitration Engine"]
C --> D["Authority + Evidence + Risk"]
D --> E["Winning Recommendation"]
E --> F{"Human Review?"}
F --> G["Decision"]
B -- "no" --> G
flowchart LR
A["Scenario Assumptions"] --> B["Cost of Action"]
A --> C["Cost of Inaction"]
A --> D["False Positive / Negative"]
B --> E["Expected Outcome"]
C --> E
D --> E
E --> F["Safe-to-Execute Decision"]
| Domain | Scenario examples |
|---|---|
| Finance | revenue anomaly investigation, invoice collection risk, margin impact simulation |
| Fraud and Payments | suspicious transaction review, account freeze recommendation, false-positive risk |
| Healthcare / Insurance Operations | claim triage, missing documentation, policy eligibility review |
| Retail / Supply Chain | inventory reroute, stockout prevention, supplier delay response |
| Customer Support | refund recommendation, churn escalation, sensitive-data masking |
| Data Platform Reliability | pipeline incident triage, SLA breach handling, backfill recommendation |
| AI Governance and Security | prompt injection detection, unsafe tool blocking, policy escalation |
| Executive Operations | cross-domain briefing, decision memo, risk summary consolidation |
- Action Escrow: high-impact actions are staged for approval.
- Shadow Mode Simulation: risky actions are simulated before execution.
- Agent Quorum: multiple agents must agree for high-risk actions.
- Tool Least Privilege: agents can only use tools required for domain and task.
- Confidence-Risk Gate: high confidence cannot override governance safety.
- Contradiction Arbitration: conflicts are resolved using evidence, authority, and risk.
- Prompt Attack Containment: malicious prompts are isolated before tool execution.
- Reversible Action Preference: moderate confidence prefers reversible actions.
- Human Escalation Threshold: high risk requires human review.
- Decision Lineage: recommendations link back to agents, tools, policies, evidence, and assumptions.
runtime_health_scorecard.json/csvagent_performance_report.json/csvtool_usage_report.json/csvgovernance_compliance_report.json/csvprobability_simulation_report.json/csvagent_safety_report.json/csvhandoff_quality_report.json/csvconflict_resolution_report.json/csvv02_runtime_upgrade_summary.json/csvred_team_scorecard.json/csvevaluation_scorecard.json/csvtrace_observability_scorecard.json/csvapproval_workflow_scorecard.json/csvlive_agent_adapter_scorecard.json/csv
Run the flagship multi-agent scenario:
python -m src.demo.run_flagship_demoScenario: support_refund_with_fraud_and_prompt_injection
The demo routes a refund task through support, fraud, security, governance, and executive agents. It detects prompt injection, blocks direct account freeze, stages the action for human approval, and writes trace, audit, briefing, and scorecard evidence.
Flow: support refund request -> fraud review -> governance review -> security detects prompt injection -> conflict arbitration -> approval queue -> executive briefing -> traces/audit/scorecards.
Start with docs/screenshots/README.md.
Recommended captures:
- Executive Overview
- Tool Governance
- Multi-Agent Handoffs
- Red-Team Results
- Trace Explorer
- Evaluation Harness
- Approval Queue
- Flagship Demo Summary
- Scorecards
Recruiter path:
- README executive summary
- screenshot guide
- recruiter one-pager
- LinkedIn flagship post
Senior engineer path:
- architecture one-pager
- technical deep dive
- V0.2 design docs
- evaluation and red-team scorecards
AI platform path:
- live-agent adapter design
- tool governance design
- tracing and evaluation docs
- approval workflow and red-team docs
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
python -m src.data_generation.generate_domain_data
python -m src.data_generation.generate_tasks
python -m src.data_generation.generate_probability_scenarios
python -m src.pipeline.run_all
python -m src.demo.run_flagship_demo
python -m pytest
python -m ruff check .uvicorn src.api.main:app --reloadEndpoints include /health, /runtime-summary, /agents, /tools, /tasks, /decisions, /handoffs, /conflicts, /approval-queue, /approval-history, /audit-log, /scorecards, /briefings, /traces, /traces/summary, /evaluations, /red-team-scenarios, /red-team-results, /live-agent-status, /flagship-demo-summary, /v02-summary, /route-task, /evaluate-tool-access, /simulate-scenario, /resolve-conflict, /submit-approval-decision, /run-flagship-demo, /run-red-team-scenario, and /run-evaluation-suite.
streamlit run src/dashboard/app.pyDashboard sections include Executive Overview, V0.2 Runtime Upgrade Overview, Trace Explorer, Evaluation Harness, Red-Team Scenario Results, Live Agent Adapter Status, Approval Workflow, Flagship Demo, Guardrail / Policy Span View, Decision Lineage Completeness, Runtime Regression Summary, Agent Registry, Tool Registry, Task Routing, Multi-Agent Handoffs, Tool Governance, Probability Scenarios, Agent Conflicts, Human Approval Queue, Safety Incidents, Audit Trail, Runtime Scorecards, and Executive Briefings.
Current validation target:
- domain data generation passes
- task generation passes
- probability scenario generation passes
- full pipeline passes
- flagship demo command passes
- 145 tests pass
- ruff passes
- API and dashboard launch locally
- synthetic data only
- deterministic agents instead of live LLM agents
- optional live-agent adapter is conceptual unless a live framework dependency and credentials are supplied
- local DuckDB instead of enterprise warehouse
- simulated tools instead of real business systems
- no cloud deployment
- no authentication
- no real identity provider
- no live approval system
- no real OpenAI/Anthropic/LangGraph/LlamaIndex integration yet
- no external API calls in tests
- not production security software
This is a portfolio-grade simulation, not production security software.
- OpenAI Agents SDK implementation
- LangGraph multi-agent workflow
- real OpenAI Agents SDK adapter
- OpenTelemetry collector export
- LlamaIndex tool orchestration
- AutoGen/CrewAI comparison
- OpenPolicyAgent policy engine
- vector/RAG tool integration
- real identity provider integration
- Slack/Jira/ServiceNow approvals
- real business tool adapters
- Kafka event streaming
- Snowflake/Databricks deployment
- observability with OpenTelemetry
- cloud deployment
- role-based access control
I added a small readiness scorecard so the production roadmap is not just prose. The check reads config/future_enhancements.json, verifies the repo has the expected roadmap/review artifacts, and writes:
data/scorecards/future_enhancement_readiness.jsondata/scorecards/future_enhancement_readiness.csv
Run it with:
python scripts/generate_future_enhancement_scorecard.pyThis is a local planning signal, not a claim that the repository is production-ready.