This page covers operational guidance for running the AI-Q blueprint in production environments.
The default compose stack includes a PostgreSQL container, but for production workloads consider a managed database service:
- Amazon RDS for PostgreSQL
- Google Cloud SQL for PostgreSQL
- Azure Database for PostgreSQL
Set the following environment variables to point to your managed database:
| Variable | Driver | Example |
|---|---|---|
NAT_JOB_STORE_DB_URL |
asyncpg |
postgresql+asyncpg://<user>:<pw>@rds-host:5432/aiq_jobs |
AIQ_CHECKPOINT_DB |
psycopg2 |
postgresql://<user>:<pw>@rds-host:5432/aiq_checkpoints |
AIQ_SUMMARY_DB |
psycopg |
postgresql+psycopg://<user>:<pw>@rds-host:5432/aiq_jobs |
When using a managed database, you must run the initialization SQL manually (or as a migration step) since the init-db.sql Docker entrypoint script only executes on a fresh PostgreSQL container volume. The script:
- Creates the
aiq_checkpointsdatabase. - Grants permissions to the application user.
- Creates the
job_infotable with performance indices inaiq_jobs.
Refer to deploy/compose/init-db.sql for the full schema.
Back up the following databases regularly:
aiq_jobs-- Contains thejob_infotable (job metadata) andjob_eventstable (event stream). This is the critical operational data store.aiq_checkpoints-- Contains LangGraph agent state checkpoints. These allow resumption of interrupted research workflows.
For managed databases, enable automated daily backups with at least 7 days of retention. For self-managed PostgreSQL, use pg_dump on a schedule:
pg_dump -U aiq -d aiq_jobs > aiq_jobs_$(date +%Y%m%d).sql
pg_dump -U aiq -d aiq_checkpoints > aiq_checkpoints_$(date +%Y%m%d).sqlThe backend is stateless apart from database connections, so it can be horizontally scaled behind a load balancer.
Docker Compose: Run multiple backend containers by scaling the service and using a reverse proxy (such as Traefik or NGINX) in front:
docker compose --env-file ../.env -f docker-compose.yaml up -d --scale aiq-agent=3Note that each scaled instance starts its own embedded Dask scheduler and worker. For a shared Dask cluster, deploy Dask separately and set NAT_DASK_SCHEDULER_ADDRESS to point to the external scheduler.
Each backend container runs an embedded Dask scheduler with a configurable number of workers and threads:
| Variable | Default | Guidance |
|---|---|---|
DASK_NWORKERS |
1 |
Increase for higher job throughput. Each worker consumes memory proportional to the research workflow depth. |
DASK_NTHREADS |
4 |
Increase for I/O-bound workloads (web searches, API calls). |
Deep research workflows are memory- and compute-intensive due to multi-phase LLM calls. Recommended minimums:
| Component | CPU | Memory | Notes |
|---|---|---|---|
| Backend | 2 cores | 4 GB | Increase for deep research or multiple concurrent users. |
| Frontend | 0.5 cores | 512 MB | Lightweight Next.js server. |
| PostgreSQL | 1 core | 2 GB | Increase for high write throughput. |
The Docker image runs as a non-root user (aiq, UID 1000) in both dev and release targets. The NVIDIA distroless base image has no shell and no package manager, reducing the attack surface.
The compose stack mounts configs/ as read-only (:ro), preventing the application from modifying its own configuration at runtime.
Store API keys in deploy/.env and ensure the file is not committed to version control (it is listed in .gitignore). Never embed keys in configuration files or Dockerfiles.
The backend exposes a health endpoint at /health for liveness and readiness probes.
curl http://localhost:8000/healthBackend logs show agent execution, tool calls, LLM interactions, and job lifecycle events.
docker logs aiq-agent -fSet LOG_LEVEL=DEBUG for verbose output during troubleshooting. Use LOG_LEVEL=WARNING in production to reduce log volume.
The backend supports OpenTelemetry-compatible tracing. See Observability for setup guides covering Phoenix, LangSmith, Weave, and the OTEL Collector with privacy redaction.
If you are deploying the aiq_api front-end and want request correlation on
NAT-exported spans, set the relevant environment variables at deploy time rather
than hardcoding them in code:
AIQ_TRACE_USER_IDENTITY_MODEAIQ_TRACE_USER_IDENTITY_HMAC_SECRETAIQ_TRACE_CLIENT_ID_MODEAIQ_TRACE_CLIENT_ID_HMAC_SECRETAIQ_TRACE_CLIENT_IP_HEADERS
| Metric | Source | What to look for |
|---|---|---|
| Backend response time | Health endpoint, access logs | Increasing latency indicates resource pressure or LLM API slowdowns. |
| Job queue depth | job_info table (status='pending') |
Growing backlog means Dask workers cannot keep up. |
| Database connections | PostgreSQL pg_stat_activity |
Connection exhaustion from too many backend replicas. |
| Container restarts | Docker | Frequent restarts indicate OOM kills or startup failures. |
| Dask worker memory | Dask dashboard (port 8787) | Memory growth in workers during deep research. |