A Redis-backed HTTP service for managing recurring scrape jobs. Workers long-poll for jobs, execute them, then report success or failure. The scheduler handles timing, jitter, and rescheduling.
┌─────────────┐ GET /jobs/next ┌───────────┐
│ Workers │ ◄─────────────────► │ HTTP API │
│ (scrapers) │ POST /jobs/*/ └─────┬─────┘
└─────────────┘ complete|error │
┌─────▼─────┐
│ Scheduler │
│ (Go) │
└─────┬─────┘
┌─────▼─────┐
│ Redis │
│ 3 keys │
└───────────┘
A background goroutine polls Redis every poll_interval_s seconds and promotes due jobs from the schedule (sorted set) to the ready queue (list). Workers block on GET /jobs/next and receive jobs as they become available.
- Go 1.22+
- Redis 6+
go build -o scheduler .Configuration is loaded from config.yaml in the current directory or $HOME/.scheduler/config.yaml. Every value can be overridden with an environment variable using the SCHED_ prefix (dots become underscores).
server:
addr: ":8081"
read_timeout_s: 5
write_timeout_s: 35 # must exceed scheduler.ready_pop_timeout_s
redis:
addr: "localhost:6379"
password: ""
db: 0
scheduler:
poll_interval_s: 5 # how often the promoter checks for due jobs
max_jitter_s: 30 # random offset added to every scheduled time
default_interval_s: 300
ready_pop_timeout_s: 30 # how long GET /jobs/next blocks
promote_batch_size: 100 # max jobs promoted per poll tick
logging:
format: "text"
log_poll_stats: true # log when jobs are promotedEnvironment variable examples:
SCHED_REDIS_ADDR=redis:6379
SCHED_SERVER_ADDR=:9000
SCHED_SCHEDULER_MAX_JITTER_S=60./scheduler serveAdd a single job:
./scheduler job add --url https://example.com/feed --interval 300
./scheduler job add --url https://example.com/feed --interval 60 --meta region=us --meta priority=highImport jobs from a JSON file:
./scheduler job import --file jobs.jsonjobs.json format:
[
{ "url": "https://example.com/a", "interval_s": 300 },
{ "url": "https://example.com/b", "interval_s": 60, "meta": { "region": "eu" } }
]Bulk imports stagger first-run times evenly across the interval to avoid a thundering herd.
List jobs:
./scheduler job list
./scheduler job list --limit 100Delete a job:
./scheduler job delete a1b2c3d4e5f6Long-polls for the next ready job. Blocks up to ready_pop_timeout_s seconds.
| Status | Meaning |
|---|---|
200 OK |
Job returned as JSON |
204 No Content |
No job became ready within the timeout |
503 Service Unavailable |
Server is shutting down |
Response body (200):
{
"id": "a1b2c3d4e5f6",
"url": "https://example.com/feed",
"interval_s": 300,
"meta": { "region": "us" }
}Report successful execution. The job is rescheduled at now + interval + jitter.
Request body:
{
"scheduled_at": 1710423000,
"picked_up_at": 1710423005
}Both fields are Unix seconds and are optional (use 0 if unavailable). They are used for telemetry only — the scheduler logs duration_ms and wait_ms.
Response: 204 No Content
Report a failed execution. The job is rescheduled using the same formula as /complete.
Request body:
{
"error": "connection timeout",
"scheduled_at": 1710423000,
"picked_up_at": 1710423005
}Response: 204 No Content
Both POST endpoints reject bodies larger than 4 KB.
A job's ID is derived deterministically from its URL (truncated SHA-1, 12 hex characters). Importing the same URL twice is safe — ZADD NX ensures the schedule entry is not overwritten.
| Key | Type | Contents |
|---|---|---|
scraper:jobs |
Hash | job_id → JSON — all job definitions |
scraper:schedule |
Sorted Set | job_id, scored by next-run Unix timestamp |
scraper:ready |
List | job IDs ready for immediate execution |
go test ./...Structured logs are emitted via log/slog. Key log events:
| Event | Level | Fields |
|---|---|---|
job_complete |
INFO | job_id, duration_ms, wait_ms, next_run_at |
job_error |
WARN | job_id, error, duration_ms, wait_ms, will_retry_at |
poll_stats |
INFO | promoted_count, poll_duration_ms (only when jobs are promoted) |
server starting |
INFO | addr |
promoter error |
ERROR | error |
- Queue size will fill up if scraper fails