Note
🔧 Beta — This project is under active development. APIs are stabilizing but may still change. Feedback and contributions are welcome.
A Kubernetes operator for managing Open Job Spec clusters and workers using custom resources.
The OJS Kubernetes Operator automates the deployment and lifecycle management of OJS server clusters on Kubernetes. It provides two Custom Resource Definitions:
- OJSCluster — Manages OJS server deployments with backend configuration, auto-scaling, and monitoring
- OJSWorker — Manages worker deployments that process jobs from an OJSCluster
- Declarative cluster management — Define OJS clusters as Kubernetes custom resources
- All 6 backend types — Redis, PostgreSQL, NATS, Kafka, SQS, and Lite backends
- Embedded backends — Auto-deploy Redis alongside the OJS server
- Worker management — Deploy and scale workers per job type and queue
- HPA-based auto-scaling — Automatic HorizontalPodAutoscaler creation when worker autoscaling is enabled
- Queue-based auto-scaling — Scale workers based on queue depth metrics
- Webhook validation — Validating admission webhooks for OJSCluster and OJSWorker CRDs
- Monitoring — Prometheus ServiceMonitor and Grafana dashboard support
- Production-ready — Health checks, graceful shutdown, leader election, RBAC
┌─────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌──────────────────┐ │
│ │ OJS Operator │ │
│ │ (controller-mgr) │ │
│ └────────┬─────────┘ │
│ │ watches & reconciles │
│ ▼ │
│ ┌────────────────┐ ┌─────────────────┐ │
│ │ OJSCluster CR │────▶│ OJS Server Pods │ │
│ │ (desired state)│ │ (Deployment) │ │
│ └────────────────┘ │ + Service │ │
│ │ + ConfigMap │ │
│ └────────┬────────┘ │
│ │ │
│ ┌────────────────┐ ┌────────▼────────┐ │
│ │ OJSWorker CR │────▶│ Worker Pods │ │
│ │ (desired state)│ │ (Deployment) │ │
│ └────────────────┘ └─────────────────┘ │
│ │
│ ┌──────────────────────────────────────┐ │
│ │ Backend (Redis / PostgreSQL / NATS / Kafka / SQS / Lite) │ │
│ │ (external or embedded) │ │
│ └──────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
The operator follows the standard controller-runtime pattern:
-
OJSClusterReconciler watches
OJSClusterCRs and reconciles:- A Deployment for the OJS server pods
- A Service for HTTP and metrics access
- A ConfigMap with backend configuration
- Optionally an embedded Redis deployment and service
- Status updates with phase, replica counts, and conditions
-
OJSWorkerReconciler watches
OJSWorkerCRs and reconciles:- A Deployment for worker pods connected to the referenced OJSCluster
- Status updates with phase, replica counts, and queue depth
- Periodic re-queue for auto-scaling when enabled
- Kubernetes cluster (v1.26+)
kubectlconfigured to access the clusterhelmv3 (for Helm-based installation)- cert-manager (optional, for webhook certificates)
# Install CRDs
make install
# Deploy the operator
make deploy# Install from local chart
helm install ojs-operator charts/ojs-operator \
--create-namespace \
--namespace ojs-system
# Or with custom values
helm install ojs-operator charts/ojs-operator \
--namespace ojs-system \
--set image.tag=v0.2.0 \
--set replicaCount=2# If your cluster has OLM installed
kubectl apply -f https://raw.githubusercontent.com/openjobspec/ojs-k8s-operator/main/config/olm/catalogsource.yaml
kubectl apply -f https://raw.githubusercontent.com/openjobspec/ojs-k8s-operator/main/config/olm/subscription.yamlDefines an OJS server cluster deployment.
| Field | Type | Default | Description |
|---|---|---|---|
backend.type |
string | required | Backend type: redis, postgres, nats, kafka, sqs, or lite |
backend.url |
string | Backend connection URL | |
backend.urlSecretRef.name |
string | Secret name containing the connection URL | |
backend.urlSecretRef.key |
string | Key within the Secret | |
backend.embedded |
bool | false |
Auto-deploy the backend (Redis only) |
replicas |
int32 | 2 |
Number of OJS server replicas |
image |
string | ghcr.io/openjobspec/ojs-server:latest |
OJS server container image |
resources |
ResourceRequirements | Pod resource requests/limits | |
autoScaling.enabled |
bool | Enable auto-scaling | |
autoScaling.minReplicas |
int32 | Minimum replica count | |
autoScaling.maxReplicas |
int32 | Maximum replica count | |
autoScaling.targetQueueDepth |
int64 | Target queue depth per replica | |
autoScaling.targetJobsPerWorker |
int64 | Target active jobs per worker | |
autoScaling.scaleUpCooldown |
string | Cooldown after scale-up (e.g., 60s) |
|
autoScaling.scaleDownCooldown |
string | Cooldown after scale-down (e.g., 300s) |
|
monitoring.enabled |
bool | Enable Prometheus metrics | |
monitoring.serviceMonitor |
bool | Create ServiceMonitor resource | |
monitoring.grafanaDashboard |
bool | Create Grafana dashboard ConfigMap |
| Field | Description |
|---|---|
phase |
Cluster phase: Pending, Running, Scaling, Error |
replicas |
Total server pod count |
readyReplicas |
Ready server pod count |
queueDepth |
Current queued jobs |
activeJobs |
Current active jobs |
conditions |
Standard Kubernetes conditions (Ready, BackendReady) |
Defines a worker deployment that processes jobs from an OJSCluster.
| Field | Type | Default | Description |
|---|---|---|---|
clusterRef |
string | required | Name of the OJSCluster to connect to |
jobTypes |
[]string | required | Job types this worker handles |
queues |
[]string | ["default"] |
Queues to process |
concurrency |
int32 | Concurrent jobs per pod | |
replicas |
int32 | 1 |
Number of worker pods |
image |
string | required | Worker container image |
command |
[]string | Command override | |
env |
[]EnvVar | Additional environment variables | |
resources |
ResourceRequirements | Pod resource requests/limits | |
autoScaling.enabled |
bool | Enable queue-based auto-scaling | |
autoScaling.minReplicas |
int32 | Minimum worker replicas | |
autoScaling.maxReplicas |
int32 | Maximum worker replicas | |
autoScaling.targetJobsPerWorker |
int64 | Desired pending jobs per worker | |
autoScaling.scaleUpThreshold |
int64 | Queue depth to trigger scale-up | |
autoScaling.scaleDownDelay |
string | Delay before scale-down (e.g., 5m) |
|
autoScaling.pollingInterval |
string | Queue metrics polling interval (e.g., 30s) |
|
gracefulShutdown.timeoutSeconds |
int32 | 30 |
Max time to wait for active jobs |
gracefulShutdown.drainBeforeShutdown |
bool | Wait for active jobs to complete |
| Field | Description |
|---|---|
phase |
Worker phase: Pending, Running, Scaling, Draining, Error |
replicas |
Total worker pod count |
readyReplicas |
Ready worker pod count |
activeJobs |
Jobs being processed |
queueDepth |
Pending jobs for this worker's queues |
lastScaleTime |
Last scaling event timestamp |
conditions |
Standard Kubernetes conditions |
The simplest OJSCluster with an embedded Redis backend:
apiVersion: ojs.openjobspec.dev/v1alpha1
kind: OJSCluster
metadata:
name: ojs-minimal
spec:
backend:
type: redis
embedded: true
replicas: 1A production-ready cluster with external Redis, resource limits, and monitoring:
apiVersion: ojs.openjobspec.dev/v1alpha1
kind: OJSCluster
metadata:
name: ojs-production
namespace: ojs-system
spec:
backend:
type: redis
urlSecretRef:
name: redis-credentials
key: url
replicas: 3
image: ghcr.io/openjobspec/ojs-backend-redis:v0.1.0
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "1"
memory: "512Mi"
autoScaling:
enabled: true
minReplicas: 3
maxReplicas: 10
targetQueueDepth: 100
monitoring:
enabled: true
serviceMonitor: true
grafanaDashboard: trueAn OJSCluster with multiple specialized workers:
apiVersion: ojs.openjobspec.dev/v1alpha1
kind: OJSCluster
metadata:
name: ojs-multi
spec:
backend:
type: redis
embedded: true
replicas: 2
---
apiVersion: ojs.openjobspec.dev/v1alpha1
kind: OJSWorker
metadata:
name: email-worker
spec:
clusterRef: ojs-multi
jobTypes: [email.send, email.digest]
queues: [emails]
concurrency: 10
replicas: 2
image: myapp/email-worker:latest
autoScaling:
enabled: true
minReplicas: 1
maxReplicas: 10
targetJobsPerWorker: 5
---
apiVersion: ojs.openjobspec.dev/v1alpha1
kind: OJSWorker
metadata:
name: data-processor
spec:
clusterRef: ojs-multi
jobTypes: [data.transform, report.generate]
queues: [data-processing]
concurrency: 50
replicas: 3
image: myapp/data-worker:latest
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2"
memory: "2Gi"More examples are available in config/samples/.
make build # Build operator binary to bin/manager
make test # Run tests with race detection and coverage
make lint # Run go vet# Install CRDs into cluster
make install
# Run the operator locally (requires kubeconfig)
make runmake docker-build # Build container image
make docker-push # Push to registrymake helm-package # Package Helm chart
make helm-install # Install via Helm into ojs-system namespace
make helm-uninstall # Uninstall Helm releasemake generate # Run go generate (deepcopy, etc.)# Check operator logs
kubectl logs -n ojs-system -l control-plane=controller-manager
# Check events
kubectl get events -n ojs-system --sort-by='.lastTimestamp'# Check the cluster status and conditions
kubectl describe ojscluster <name>
# If using embedded Redis, verify the Redis pod is running
kubectl get pods -l app.kubernetes.io/component=backendThis typically means the referenced OJSCluster was not found:
# Verify the clusterRef matches an existing OJSCluster in the same namespace
kubectl get ojsclusters
# Check worker conditions
kubectl describe ojsworker <name># Verify the ClusterRole has the necessary permissions
kubectl get clusterrole ojs-operator-manager-role -o yaml
# Check if the ServiceAccount is bound correctly
kubectl get clusterrolebinding ojs-operator-manager-rolebinding -o yaml# Remove the finalizer to allow deletion
kubectl patch ojscluster <name> -p '{"metadata":{"finalizers":null}}' --type=merge- Set resource limits on all OJS server and worker pods
- Configure
PodDisruptionBudgetfor high-availability - Enable Prometheus metrics collection (
monitoring.enabled: true) - Set up Grafana dashboards (
monitoring.grafanaDashboard: true) - Use
urlSecretRefinstead of plaintext URLs for backend credentials - Configure graceful shutdown on all workers
- Review and tighten RBAC permissions
- Test failover and recovery procedures
- Configure backup strategy for the backend database
Apache License 2.0 — see LICENSE for details.