OJS Kubernetes Operator

Note

🔧 Beta — This project is under active development. APIs are stabilizing but may still change. Feedback and contributions are welcome.

A Kubernetes operator for managing Open Job Spec clusters and workers using custom resources.

Overview

The OJS Kubernetes Operator automates the deployment and lifecycle management of OJS server clusters on Kubernetes. It provides two Custom Resource Definitions:

OJSCluster — Manages OJS server deployments with backend configuration, auto-scaling, and monitoring
OJSWorker — Manages worker deployments that process jobs from an OJSCluster

Features

Declarative cluster management — Define OJS clusters as Kubernetes custom resources
All 6 backend types — Redis, PostgreSQL, NATS, Kafka, SQS, and Lite backends
Embedded backends — Auto-deploy Redis alongside the OJS server
Worker management — Deploy and scale workers per job type and queue
HPA-based auto-scaling — Automatic HorizontalPodAutoscaler creation when worker autoscaling is enabled
Queue-based auto-scaling — Scale workers based on queue depth metrics
Webhook validation — Validating admission webhooks for OJSCluster and OJSWorker CRDs
Monitoring — Prometheus ServiceMonitor and Grafana dashboard support
Production-ready — Health checks, graceful shutdown, leader election, RBAC

Architecture

┌─────────────────────────────────────────────────────────┐
│                    Kubernetes Cluster                     │
│                                                          │
│  ┌──────────────────┐                                    │
│  │  OJS Operator     │                                   │
│  │  (controller-mgr) │                                   │
│  └────────┬─────────┘                                    │
│           │ watches & reconciles                         │
│           ▼                                              │
│  ┌────────────────┐     ┌─────────────────┐              │
│  │  OJSCluster CR  │────▶│ OJS Server Pods │              │
│  │  (desired state)│     │ (Deployment)    │              │
│  └────────────────┘     │ + Service       │              │
│                          │ + ConfigMap     │              │
│                          └────────┬────────┘              │
│                                   │                      │
│  ┌────────────────┐     ┌────────▼────────┐              │
│  │  OJSWorker CR   │────▶│ Worker Pods     │              │
│  │  (desired state)│     │ (Deployment)    │              │
│  └────────────────┘     └─────────────────┘              │
│                                                          │
│  ┌──────────────────────────────────────┐                │
│  │  Backend (Redis / PostgreSQL / NATS / Kafka / SQS / Lite)  │                │
│  │  (external or embedded)               │                │
│  └──────────────────────────────────────┘                │
└─────────────────────────────────────────────────────────┘

The operator follows the standard controller-runtime pattern:

OJSClusterReconciler watches OJSCluster CRs and reconciles:
- A Deployment for the OJS server pods
- A Service for HTTP and metrics access
- A ConfigMap with backend configuration
- Optionally an embedded Redis deployment and service
- Status updates with phase, replica counts, and conditions
OJSWorkerReconciler watches OJSWorker CRs and reconciles:
- A Deployment for worker pods connected to the referenced OJSCluster
- Status updates with phase, replica counts, and queue depth
- Periodic re-queue for auto-scaling when enabled

Prerequisites

Kubernetes cluster (v1.26+)
kubectl configured to access the cluster
helm v3 (for Helm-based installation)
cert-manager (optional, for webhook certificates)

Installation

Option 1: kubectl apply

# Install CRDs
make install

# Deploy the operator
make deploy

Option 2: Helm

# Install from local chart
helm install ojs-operator charts/ojs-operator \
  --create-namespace \
  --namespace ojs-system

# Or with custom values
helm install ojs-operator charts/ojs-operator \
  --namespace ojs-system \
  --set image.tag=v0.2.0 \
  --set replicaCount=2

Option 3: OLM (Operator Lifecycle Manager)

# If your cluster has OLM installed
kubectl apply -f https://raw.githubusercontent.com/openjobspec/ojs-k8s-operator/main/config/olm/catalogsource.yaml
kubectl apply -f https://raw.githubusercontent.com/openjobspec/ojs-k8s-operator/main/config/olm/subscription.yaml

CRD Reference

OJSCluster

Defines an OJS server cluster deployment.

Spec Fields

Field	Type	Default	Description
`backend.type`	string	required	Backend type: `redis`, `postgres`, `nats`, `kafka`, `sqs`, or `lite`
`backend.url`	string		Backend connection URL
`backend.urlSecretRef.name`	string		Secret name containing the connection URL
`backend.urlSecretRef.key`	string		Key within the Secret
`backend.embedded`	bool	`false`	Auto-deploy the backend (Redis only)
`replicas`	int32	`2`	Number of OJS server replicas
`image`	string	`ghcr.io/openjobspec/ojs-server:latest`	OJS server container image
`resources`	ResourceRequirements		Pod resource requests/limits
`autoScaling.enabled`	bool		Enable auto-scaling
`autoScaling.minReplicas`	int32		Minimum replica count
`autoScaling.maxReplicas`	int32		Maximum replica count
`autoScaling.targetQueueDepth`	int64		Target queue depth per replica
`autoScaling.targetJobsPerWorker`	int64		Target active jobs per worker
`autoScaling.scaleUpCooldown`	string		Cooldown after scale-up (e.g., `60s`)
`autoScaling.scaleDownCooldown`	string		Cooldown after scale-down (e.g., `300s`)
`monitoring.enabled`	bool		Enable Prometheus metrics
`monitoring.serviceMonitor`	bool		Create ServiceMonitor resource
`monitoring.grafanaDashboard`	bool		Create Grafana dashboard ConfigMap

Status Fields

Field	Description
`phase`	Cluster phase: `Pending`, `Running`, `Scaling`, `Error`
`replicas`	Total server pod count
`readyReplicas`	Ready server pod count
`queueDepth`	Current queued jobs
`activeJobs`	Current active jobs
`conditions`	Standard Kubernetes conditions (`Ready`, `BackendReady`)

OJSWorker

Defines a worker deployment that processes jobs from an OJSCluster.

Spec Fields

Field	Type	Default	Description
`clusterRef`	string	required	Name of the OJSCluster to connect to
`jobTypes`	[]string	required	Job types this worker handles
`queues`	[]string	`["default"]`	Queues to process
`concurrency`	int32		Concurrent jobs per pod
`replicas`	int32	`1`	Number of worker pods
`image`	string	required	Worker container image
`command`	[]string		Command override
`env`	[]EnvVar		Additional environment variables
`resources`	ResourceRequirements		Pod resource requests/limits
`autoScaling.enabled`	bool		Enable queue-based auto-scaling
`autoScaling.minReplicas`	int32		Minimum worker replicas
`autoScaling.maxReplicas`	int32		Maximum worker replicas
`autoScaling.targetJobsPerWorker`	int64		Desired pending jobs per worker
`autoScaling.scaleUpThreshold`	int64		Queue depth to trigger scale-up
`autoScaling.scaleDownDelay`	string		Delay before scale-down (e.g., `5m`)
`autoScaling.pollingInterval`	string		Queue metrics polling interval (e.g., `30s`)
`gracefulShutdown.timeoutSeconds`	int32	`30`	Max time to wait for active jobs
`gracefulShutdown.drainBeforeShutdown`	bool		Wait for active jobs to complete

Status Fields

Field	Description
`phase`	Worker phase: `Pending`, `Running`, `Scaling`, `Draining`, `Error`
`replicas`	Total worker pod count
`readyReplicas`	Ready worker pod count
`activeJobs`	Jobs being processed
`queueDepth`	Pending jobs for this worker's queues
`lastScaleTime`	Last scaling event timestamp
`conditions`	Standard Kubernetes conditions

Examples

Minimal Cluster

The simplest OJSCluster with an embedded Redis backend:

apiVersion: ojs.openjobspec.dev/v1alpha1
kind: OJSCluster
metadata:
  name: ojs-minimal
spec:
  backend:
    type: redis
    embedded: true
  replicas: 1

Production Setup

A production-ready cluster with external Redis, resource limits, and monitoring:

apiVersion: ojs.openjobspec.dev/v1alpha1
kind: OJSCluster
metadata:
  name: ojs-production
  namespace: ojs-system
spec:
  backend:
    type: redis
    urlSecretRef:
      name: redis-credentials
      key: url
  replicas: 3
  image: ghcr.io/openjobspec/ojs-backend-redis:v0.1.0
  resources:
    requests:
      cpu: "250m"
      memory: "256Mi"
    limits:
      cpu: "1"
      memory: "512Mi"
  autoScaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 10
    targetQueueDepth: 100
  monitoring:
    enabled: true
    serviceMonitor: true
    grafanaDashboard: true

Multi-Worker Deployment

An OJSCluster with multiple specialized workers:

apiVersion: ojs.openjobspec.dev/v1alpha1
kind: OJSCluster
metadata:
  name: ojs-multi
spec:
  backend:
    type: redis
    embedded: true
  replicas: 2
---
apiVersion: ojs.openjobspec.dev/v1alpha1
kind: OJSWorker
metadata:
  name: email-worker
spec:
  clusterRef: ojs-multi
  jobTypes: [email.send, email.digest]
  queues: [emails]
  concurrency: 10
  replicas: 2
  image: myapp/email-worker:latest
  autoScaling:
    enabled: true
    minReplicas: 1
    maxReplicas: 10
    targetJobsPerWorker: 5
---
apiVersion: ojs.openjobspec.dev/v1alpha1
kind: OJSWorker
metadata:
  name: data-processor
spec:
  clusterRef: ojs-multi
  jobTypes: [data.transform, report.generate]
  queues: [data-processing]
  concurrency: 50
  replicas: 3
  image: myapp/data-worker:latest
  resources:
    requests:
      cpu: "500m"
      memory: "512Mi"
    limits:
      cpu: "2"
      memory: "2Gi"

More examples are available in config/samples/.

Development

Build and Test

make build          # Build operator binary to bin/manager
make test           # Run tests with race detection and coverage
make lint           # Run go vet

Run Locally

# Install CRDs into cluster
make install

# Run the operator locally (requires kubeconfig)
make run

Docker

make docker-build   # Build container image
make docker-push    # Push to registry

Helm

make helm-package   # Package Helm chart
make helm-install   # Install via Helm into ojs-system namespace
make helm-uninstall # Uninstall Helm release

Code Generation

make generate       # Run go generate (deepcopy, etc.)

Troubleshooting

Operator pod is not starting

# Check operator logs
kubectl logs -n ojs-system -l control-plane=controller-manager

# Check events
kubectl get events -n ojs-system --sort-by='.lastTimestamp'

OJSCluster stuck in Pending

# Check the cluster status and conditions
kubectl describe ojscluster <name>

# If using embedded Redis, verify the Redis pod is running
kubectl get pods -l app.kubernetes.io/component=backend

OJSWorker in Error phase

This typically means the referenced OJSCluster was not found:

# Verify the clusterRef matches an existing OJSCluster in the same namespace
kubectl get ojsclusters

# Check worker conditions
kubectl describe ojsworker <name>

RBAC errors

# Verify the ClusterRole has the necessary permissions
kubectl get clusterrole ojs-operator-manager-role -o yaml

# Check if the ServiceAccount is bound correctly
kubectl get clusterrolebinding ojs-operator-manager-rolebinding -o yaml

Resetting a stuck resource

# Remove the finalizer to allow deletion
kubectl patch ojscluster <name> -p '{"metadata":{"finalizers":null}}' --type=merge

Production Checklist

Set resource limits on all OJS server and worker pods
Configure PodDisruptionBudget for high-availability
Enable Prometheus metrics collection (monitoring.enabled: true)
Set up Grafana dashboards (monitoring.grafanaDashboard: true)
Use urlSecretRef instead of plaintext URLs for backend credentials
Configure graceful shutdown on all workers
Review and tighten RBAC permissions
Test failover and recovery procedures
Configure backup strategy for the backend database

License

Apache License 2.0 — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.github		.github
api/v1alpha1		api/v1alpha1
charts/ojs-operator		charts/ojs-operator
cmd/manager		cmd/manager
config		config
docs		docs
hack		hack
internal		internal
.editorconfig		.editorconfig
.gitignore		.gitignore
.golangci.yml		.golangci.yml
CHANGELOG.md		CHANGELOG.md
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
go.mod		go.mod
go.sum		go.sum

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

OJS Kubernetes Operator

Overview

Features

Architecture

Prerequisites

Installation

Option 1: kubectl apply

Option 2: Helm

Option 3: OLM (Operator Lifecycle Manager)

CRD Reference

OJSCluster

Spec Fields

Status Fields

OJSWorker

Spec Fields

Status Fields

Examples

Minimal Cluster

Production Setup

Multi-Worker Deployment

Development

Build and Test

Run Locally

Docker

Helm

Code Generation

Troubleshooting

Operator pod is not starting

OJSCluster stuck in Pending

OJSWorker in Error phase

RBAC errors

Resetting a stuck resource

Production Checklist

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages