Skip to content

Add AWS production architecture planning doc#70

Open
amalbet wants to merge 4 commits into
mainfrom
docs/aws-production-architecture
Open

Add AWS production architecture planning doc#70
amalbet wants to merge 4 commits into
mainfrom
docs/aws-production-architecture

Conversation

@amalbet

@amalbet amalbet commented Apr 16, 2026

Copy link
Copy Markdown

Summary

Planning document for the target AWS production deployment architecture. Covers OpenEMS Backend + MBE on AWS with OpenVPN for Raspberry Pi edge devices at microgrid sites.

This is a draft for team review — we're iterating on the architecture over the next few days before committing to specific tickets. Please leave line-by-line comments on anything that needs refinement.

What's covered

  • Current state: existing Terraform IaC, CI/CD, gaps to address
  • Target architecture: VPC layout, OpenVPN server (EC2), ECS Fargate for Backend + MBE
  • OpenVPN details: PKI structure, VPN subnet, certificate per edge site
  • Two edge types: simulation edges (Docker on ECS, testing) vs production edges (Raspberry Pi, OpenVPN tunnel)
  • Security model: network segmentation table, credential management, TLS coverage
  • Environments: dev / stage / prod isolation, promotion flow, Terraform workspaces
  • Implementation phases
  • 8 open questions for team input
  • Cost estimate (~$280/mo per environment)

Open questions (need team input)

  1. ECS Fargate vs EC2 for the backend stack (Fargate can't run OpenVPN, no persistent storage for InfluxDB)
  2. InfluxDB persistence strategy (EFS vs EC2 vs Timestream)
  3. Domain name strategy
  4. Pi provisioning process (manual vs SD card image)
  5. Monitoring (CloudWatch vs Grafana)
  6. Cost optimization (Reserved Instances?)
  7. OpenVPN HA (single point of failure acceptable for MVP?)
  8. MBE deployment target (ECS vs keep on Vercel with API Gateway)

Test plan

🤖 Generated with Claude Code

Documents the target deployment architecture for OpenEMS + MBE on AWS
with OpenVPN for Raspberry Pi edge devices. Covers VPC layout, security
model, environment strategy (dev/stage/prod), data flow, cost estimates,
and open questions for team iteration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Alejandro Malbet <amalbet@gmail.com>
Guidelines for the IT admin to provision a dedicated dev AWS account
isolated from production, with safe scoping for Claude Code agent
sessions. Covers account structure, IAM strategy, Service Control
Policies, cost controls, and a provisioning checklist.

Complements the production architecture doc by establishing the
foundation needed before we can start provisioning dev infrastructure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Alejandro Malbet <amalbet@gmail.com>
Split out to #72 so each doc can be reviewed independently.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Alejandro Malbet <amalbet@gmail.com>

@tushabe tushabe left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for documenting this, it gives me some visibility on decisions made here and why they were made. I especially appreciate the cost estimate of running the entire stack because it helps facilitate conversations around business feasibility post Pilot.

@amalbet

amalbet commented Apr 21, 2026

Copy link
Copy Markdown
Author

Discussion: Do we need MBE in the same VPC as OpenEMS?

When we drafted this architecture, we defaulted to putting MBE (billing engine) in the same VPC as OpenEMS so the B2B REST communication stays on a private network. On reflection, we missed an important tradeoff discussion — we should consider whether MBE can stay on Vercel/Supabase (where it already works) and connect to OpenEMS over a secured public endpoint instead.

Why this matters

MBE is already deployed and working on Vercel + cloud Supabase. Moving it to ECS means:

  • Migrating off Vercel (new deployment pipeline, ALB, health checks, log management)
  • Either self-hosting Supabase on AWS (significant complexity — we've seen this firsthand) or continuing to use cloud Supabase from ECS (which means we're moving to AWS only for the OpenEMS connection)
  • Ongoing ops burden that Vercel currently handles for us (auto-scaling, TLS, CDN, zero-downtime deploys)
  • Losing the free tier — both Vercel and Supabase free tiers are sufficient for MBE's current and near-term usage. Moving to ECS adds ~$40-60/mo in AWS costs for infrastructure that Vercel provides at $0.

The only reason to move MBE into the VPC is to keep the OpenEMS B2B endpoint private. So the real question is: can we secure that connection without migrating MBE?

How MBE uses the OpenEMS APIs

MBE connects to the OpenEMS Backend B2B REST endpoint (port 8082, /jsonrpc) from Next.js API routes — server-side only. The browser never talks to OpenEMS directly. The calls are:

  • getEdgesStatus — check if edges are online/offline
  • getEdgeConfig — discover meters on an edge
  • queryHistoricTimeseriesEnergy — pull energy readings for billing
  • getEdgesChannelsValues — read current channel values

The data flowing through is energy consumption readings — not financial data, not PII, not credentials.


Three options

Option 1: MBE on Vercel + API Gateway (public endpoint)

Expose a single API Gateway endpoint in front of the OpenEMS B2B REST API.

MBE (Vercel, server-side) ──HTTPS + API key──▶ API Gateway (public)
                                                     │
                                              TLS termination
                                              API key validation
                                              Rate limiting
                                                     │
                                                     ▼
                                            OpenEMS Backend (private EC2)
                                              Basic auth on /jsonrpc

Security layers: TLS + API key + Basic auth + rate limiting + CloudWatch logging + optional WAF.

What stays private: OpenEMS UI (:4200), edge websocket (:8081), Odoo (:10016), InfluxDB (:8086), SSH/SSM — all IP-allowlisted or VPN-only. Only the API Gateway URL is public.

Risk profile:

  • The API Gateway endpoint is discoverable and scannable on the internet
  • If API key + Basic auth credentials leak, an attacker can query energy data
  • Mitigation: credentials rotatable, API Gateway logging flags unusual patterns, rate limiting prevents brute force
  • Data sensitivity is low (kWh readings, edge status — not financial/PII)

Option 2: Move MBE to ECS (same VPC)

Move MBE off Vercel into ECS Fargate within the same VPC. OpenEMS B2B stays fully private.

Browser ──HTTPS──▶ ALB ──▶ MBE (ECS Fargate) ──private──▶ OpenEMS Backend
                                  │
                                  │ internet (HTTPS)
                                  ▼
                           Cloud Supabase (auth + DB)

Security: Zero public surface for OpenEMS. MBE-to-backend communication is private network only.

Cost: ~$40-60/mo (ALB + Fargate) — replacing what Vercel and Supabase provide for free today. For a pilot-stage startup, this is meaningful spend on infrastructure that adds operational burden without adding user value.

Effort: ~3 days. No MBE code changes — Dockerfile already exists and works. All infrastructure: ECS task definition, ALB, ACM cert, CI/CD pipeline, CloudWatch logging, DNS cutover.

Ongoing ops: Every deploy, log investigation, scaling decision, and certificate renewal that Vercel handles today becomes our responsibility.


Option 3: Lambda VPC proxy (recommended middle ground)

Keep MBE on Vercel. Move only the OpenEMS API calls to a thin Lambda function inside the VPC. OpenEMS stays fully private — no public endpoint.

Browser ──HTTPS──▶ MBE (Vercel)
                      │
                      │ MBE API routes call the Lambda
                      ▼
              Lambda Function URL (HTTPS, IAM auth)
              ┌─────────────────────────────┐
              │  VPC private subnet         │
              │                             │
              │  Lambda (passthrough proxy) │
              │       │                     │
              │       │ private network     │
              │       ▼                     │
              │  OpenEMS Backend :8082      │
              └─────────────────────────────┘

How it works: The Lambda is a ~10-line function — no business logic, no dependencies, no state. It receives the JSON-RPC request from Vercel, forwards it to OpenEMS on the private IP, and returns the response. The billing logic, meter discovery, and kWh calculations stay in MBE on Vercel.

What changes in MBE: One env var — OPENEMS_B2B_URL points to the Lambda Function URL instead of directly to OpenEMS. Zero code logic changes.

Security posture — better than Option 1:

Option 1 (API Gateway) Option 3 (Lambda proxy)
OpenEMS public surface API Gateway URL (discoverable, scannable) None — OpenEMS is fully private
Auth to reach OpenEMS API key + Basic auth (secrets that can leak) IAM auth — Lambda Function URL uses AWS SigV4. No shared secrets over the wire. Vercel calls with an IAM access key that only authorizes this specific Lambda.
Attack surface Public HTTPS endpoint on the internet Lambda Function URL can be scoped to IAM auth only — not callable without valid AWS credentials. Not useful even if discovered.
Credential compromise impact Attacker can query energy data from anywhere Attacker needs AWS IAM credentials (not just an API key), and the Lambda still enforces Basic auth to OpenEMS internally
Network path Internet → API Gateway → Backend Internet → Lambda Function URL → private network → Backend
What's exposed if breached B2B REST API (read-only energy data) Same, but harder to breach (IAM >> API key)

Why this is more secure than Option 1: The key difference is authentication strength. Option 1 uses an API key (a static string anyone can use if leaked). Option 3 uses AWS IAM authentication (SigV4 signed requests) — significantly harder to compromise, automatically time-limited, logged in CloudTrail, and revocable per-credential. And the OpenEMS backend itself remains on a private network with no public-facing port.

Cost: Essentially $0 at MBE's request volume (Lambda free tier: 1M requests/month). VPC-attached Lambda has no additional cost. Combined with the Vercel and Supabase free tiers that MBE already runs within, the entire MBE stack costs $0/mo. Moving to ECS (Option 2) would replace this with ~$40-60/mo in AWS infrastructure costs — meaningful spend for a pilot-stage project that adds operational burden without adding user-facing value.

Effort: ~1 day. Write the Lambda (10 lines), Terraform for Lambda + VPC config + Function URL + IAM role, update one env var in Vercel.

Stability: The Lambda is a stateless passthrough — no business logic, no dependencies, nothing to update. Deploy once, forget about it.


Comparison

Option 1: API Gateway Option 2: MBE to ECS Option 3: Lambda proxy
OpenEMS stays private ❌ (API Gateway is public)
MBE stays on Vercel
Supabase stays cloud
Auth strength API key + Basic auth N/A (private network) IAM SigV4 + Basic auth
Effort ~1 day ~3 days ~1 day
Ongoing ops API Gateway only ECS + ALB + CI/CD + logs Lambda only (stateless)
MBE monthly cost $0 (Vercel free) ~$38-71/mo ¹ ~$0 ² (Vercel free + Lambda free)
Public attack surface 1 endpoint 0 0

¹ Option 2 cost breakdown (us-east-1, on-demand pricing):

Resource Calculation Monthly
ALB (fixed) $0.0225/hr × 730 hrs ~$16
ALB LCU (usage) Minimal traffic ~$2
Fargate vCPU 0.5 vCPU × $0.04048/hr × 730 hrs ~$15
Fargate memory 1 GB × $0.004445/hr × 730 hrs ~$3
CloudWatch Logs Minimal ingestion ~$2
Subtotal (public subnet) ~$38/mo
NAT Gateway (if private subnet) $0.045/hr × 730 hrs + data processing ~$33
Subtotal (private subnet) ~$71/mo

NAT Gateway is required if MBE runs in a private subnet — it needs outbound internet to reach cloud Supabase. Placing MBE in a public subnet avoids this cost but exposes the container to inbound internet (mitigated by security group rules).

² Option 3 cost breakdown (us-east-1, on-demand pricing):

MBE makes ~100-500 OpenEMS API calls/month (billing generation, meter discovery, status checks). Even at 100× that volume the cost is negligible.

Component Rate 500 req/mo 50,000 req/mo
Lambda requests $0.20 per 1M $0.0001 $0.01
Lambda compute $0.0000166667/GB-sec $0.0005 (32 GB-sec) $0.05 (3,200 GB-sec)
Data transfer (Lambda → OpenEMS) $0 (intra-VPC) $0 $0
Data transfer (Lambda → Vercel) $0.09/GB after 100 GB free $0 (~2.5 MB) $0 (~250 MB)
Total < $0.01/mo $0.06/mo

Lambda free tier includes 1M requests/month and 400,000 GB-seconds/month — MBE won't approach either limit. Vercel free tier (100 GB bandwidth, 100K serverless function invocations) and Supabase free tier (500 MB database, 50K auth users) are similarly unconstrained at MBE's current and near-term scale.

What we'd like to hear from the team

  1. Is Option 3 (Lambda proxy) acceptable? It keeps OpenEMS fully private while avoiding the ECS migration — and preserves the $0/mo cost of MBE running on Vercel/Supabase free tiers.
  2. Is there a compliance or policy requirement that prevents any intermediary (even IAM-authenticated) from bridging Vercel to the VPC?
  3. Does the team prefer the simplicity of Option 1 (accept the small risk of a public endpoint) over the zero-surface guarantee of Option 3?

If Option 3 works for the team, we can implement it in a day and move on to higher-value work.

cc @aidan-barnes-axm @tushabe @axmsoftware

@amalbet

amalbet commented Apr 21, 2026

Copy link
Copy Markdown
Author

note that Claude-Code estimates assume human execution. It is probably a few minutes so don't take it literally.

@amalbet

amalbet commented Apr 21, 2026

Copy link
Copy Markdown
Author

Update: PR #75 implements Option 3 (Lambda VPC proxy). Ready for review alongside this architecture discussion. (Replaces #74 which was closed to remove infrastructure metadata from the public repo.)

@tushabe

tushabe commented Apr 21, 2026

Copy link
Copy Markdown
  • Limit Security by VPN: I am not sure that a VPN is a good long term option to secure our data in transit so I like the options that limit that security approach to openEMS for now.
  • No compliance requirement that prevents any intermediaries that I am aware of, at least for the customers and the regulatory bodies in Uganda.
  • I like the simplicity of option 1. In another thread, I would like to understand why OpenEMS endpoints are public and why we can't add IAM to them.

@amalbet I'd say, God speed with option 3.

Captures the explicit PR #71 scope decision (UI/Odoo/B2B/WS over plain
HTTP, IP-allowlisted) so it's not silently normalized, and opens the
team discussion for HTTPS options (self-signed, Caddy+LE, ALB+ACM)
before the dev env becomes persistent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Alejandro Malbet <amalbet@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants