This guide covers how to run integration tests that create/delete accounts and seed “real-world” data at meaningful scale—while still being safe for the backend and useful for finding latency, contention, and scalability bugs that lightweight tests miss.
Goal: move from “does it work?” to “does it work under realistic pressure?”
Many integration tests validate correctness with minimal data:
- a single user
- a handful of entities
- no concurrency
- happy-path timings
That’s great for correctness—but it often fails to uncover:
- N+1 queries and expensive joins that only show up with large tables
- index bloat / missing indexes
- queue backlogs and async worker starvation
- lock contention (row/table locks, distributed locks)
- cache stampedes and thundering herds
- rate limiting / circuit breaking behaviors
- slow downstream dependencies (email, payments, search indexing)
- p99/p999 latency spikes that are invisible at small scale
“Induction” (“if it works for 10 rows it will work for 10M”) is often wrong because cost curves are rarely linear.
You generally should not run heavy-load integration tests against production. Even read-heavy can trigger:
- cache churn
- autoscaling events and cost spikes
- noisy-neighbor effects
- false alarms in monitoring
- data pollution
Instead, design a pipeline that isolation-protects production while still representing production reality.
The safest default is a separate environment:
- Staging/Pre-prod with production-like infrastructure sizing
- Shadow environment (sometimes called “perf” or “soak”) used only for load/soak testing
- Ephemeral environments for each CI run (more expensive, but very clean)
Best practices
- Ensure configuration matches production in all the ways that matter:
- DB engine/version, indexes, connection pool sizes
- queue/worker sizing
- caches and TTLs
- rate limits
- search index settings
- feature flags
- Use production-like anonymized datasets (details below).
Maintain a pre-seeded dataset in your test environment:
- A curated set of accounts and data relationships
- “Small/Medium/Large” variants
- Reset via DB snapshot restore or schema + restore
Pros
- Very fast test setup
- Highly repeatable
- Great for regression tracking
Cons
- Requires snapshot tooling
- Must keep dataset current with schema changes
When to use
- Daily CI regressions
- p95/p99 latency trend monitoring
- reproducible bug hunts
Generate realistic data during test setup:
- seeded RNG for determinism
- schema-aware factories
- relationship graphs that match real usage
Pros
- Can model new scenarios quickly
- Can scale data size gradually
Cons
- Can be slow to generate at high volumes unless optimized
- Needs careful realism (distributions, relationships)
When to use
- New feature testing
- Exploring what-if scaling behaviors
Export production data, anonymize it, and import into test env.
Pros
- Highest realism (shape of data, relationship density, edge cases)
Cons
- Requires careful privacy/security
- Needs ongoing data governance
When to use
- Perf/scaling validation for critical paths
- Query tuning + index verification
Instead of creating accounts and hard-deleting them, use:
- a
test_run_idfield / tag - a
tenant/namespace - a dedicated “test organization” root entity
- soft-deletes with TTL-based cleanup
This avoids expensive delete cascades, reduces vacuum/compaction churn, and makes cleanup safer.
Anti-pattern
- Running large cascaded deletes in the same DB used by others
- Deleting rows with heavy foreign key graphs frequently
Better
- Partition test data by run ID and delete by partition
- Use time-based partitioning and drop partitions (super cheap)
Have a backend-supported cleanup mechanism:
- Any entity created with
expires_atorttl_seconds - A scheduled cleanup job removes expired test data
- Rate limited + chunked deletes (safe, backpressure-aware)
This lets tests create realistically without fear of long-term DB growth.
Correctness integration tests
- small data
- run on every PR
- strict pass/fail
Load-style integration tests
- larger datasets, concurrency
- run on schedule (nightly) or per-release
- trend-based evaluation (latency budget regressions)
This prevents turning PR validation into a backend DOS.
If your backend supports it, add a controlled test profile:
- uses dedicated resources (DB/schema/cluster)
- can bypass external side effects:
- no emails/SMS
- no billing
- no irreversible third-party calls
- still executes core logic (auth, persistence, business rules)
Important: don’t make test mode “too fake”—you still want real query shapes and queue behavior.
Your test runner should obey a budget:
- max requests/sec
- max concurrency
- max accounts created per minute
- max rows created per run
- max runtime
And it should adapt based on backend signals:
- HTTP 429/503
- queue depth
- p95 latency spikes
Technique: adaptive load
- Start at low concurrency
- Gradually ramp up (step function)
- Stop ramp if error rates or latency exceed thresholds
This is safer than blast-and-pray.
You can often find load-related bugs without billions of rows by ensuring you have:
- realistic relationship depth (e.g., 1 account → 10 projects → 500 items each)
- realistic distributions (many small accounts, few giant ones)
- realistic access patterns (read hot sets, write bursts)
Example
- 90% of accounts: 1–5 projects
- 9% of accounts: 6–50 projects
- 1% of accounts: 200–1000 projects (the “whales”)
Simulating “whales” is often where the bugs hide.
Measure how long it takes to load an account’s dashboard when:
- caches are cold
- related tables are large
- the user has “whale” scale
What to measure
- p50/p95/p99 response time
- DB query count
- bytes returned
- cache hit ratio
Simulate multiple clients updating the same “hot” entity:
- shared project, shared playlist, shared cart
- concurrent writes and reads
This catches:
- transaction isolation issues
- race conditions
- lock contention
- retry storms
Send bursts that enqueue background work:
- notifications
- search indexing
- analytics pipelines
- media processing jobs
Validate:
- queue depth stabilizes
- workers keep up
- “eventual consistency” completes within SLA
Test endpoints that paginate deep into history with large datasets.
Common bugs:
- OFFSET pagination slowdowns
- missing composite indexes
- unbounded sorts
Prefer keyset pagination where possible.
Instead of thousands of normal API calls, provide an internal-only endpoint like:
POST /internal/test/seedAccountPOST /internal/test/seedWhaleAccountPOST /internal/test/cleanup?runId=...
These endpoints can:
- insert in batches
- enforce budgets
- tag data with run IDs
- avoid expensive side effects
- record seed metadata for debugging
Security
- available only in non-prod
- requires service auth + IP allowlists
- audited
If internal seed endpoints are not possible, provide batch endpoints:
POST /items/batchPOST /events/batch- bulk upsert patterns
This reduces request overhead and is closer to real throughput.
When you generate synthetic data, ensure:
- valid foreign keys
- valid state machines (draft → published → archived)
- realistic timestamp distributions
- realistic text lengths and blobs (don’t use “aaa” everywhere)
A “factory” system should allow:
- scale knobs (items per project, projects per account)
- distribution knobs (many small + few large)
- feature knobs (enable attachments, comments, tags)
- Separate DB cluster, or at least separate schema + resource quotas
- Separate cache namespace
- Separate queue topics
- When the backend is struggling, tests should fail fast (or ramp down), not amplify load.
Tests are useless if you can’t see why they’re slow.
You want:
- request tracing (trace IDs propagated)
- DB query timing, lock waits
- queue depth, worker utilization
- cache hit rate, eviction rate
- p95/p99 latency charts by endpoint
Track per scenario:
- success rate
- error types (4xx vs 5xx)
- p50/p95/p99 latency
- throughput (RPS)
- DB time per request
- queue time to completion (for async flows)
Instead of strict single-value assertions, use budgets like:
- p95 latency must be within X% of last baseline
- error rate must be < Y%
- DB queries per request must not exceed N
Store baselines per release tag to compare across versions.
- PR Gate: small integration tests, deterministic, fast
- Nightly: medium dataset, moderate concurrency
- Pre-release: large dataset + ramp + soak
- Post-deploy: canary “smoke + latency budget” checks
Tag and retain previous releases so you can run:
- the same scenarios against old versions
- compare performance and correctness
- bisect regressions
This is especially valuable when you suspect a backend change introduced a regression.
- Restore or select a known dataset (Small/Med/Large) OR seed synthetic data tagged with
test_run_id. - Warm up critical caches (optional and measured).
- Ramp concurrency: 1 → 5 → 20 → 50 (example) while observing 429/503/p95.
- Run scenario suite:
- whale dashboard hydrate
- deep pagination query
- burst writes + async completion
- concurrent updates to shared resources
- Record metrics + traces per scenario.
- Cleanup:
- prefer partition drop or TTL cleanup
- otherwise chunked deletes by
test_run_id.
- Produce a report:
- deltas vs baseline
- top slow endpoints
- top slow DB queries
- queue backlog behavior
- Too much load in PR CI → creates flaky tests and angry backend teams
- Unrealistic data → green tests, real users still suffer
- No observability → you detect slowness but can’t diagnose it
- Hard deletes with cascades → test suite slowly destroys DB performance
- Ignoring p99 → p50 looks fine, users still complain
- No isolation → tests compete with other staging users and become meaningless
- Separate environment or isolated resources for load testing
- Data seeded with realistic distributions and relationships
- Test harness enforces rate and concurrency budgets
- Scenarios target latency tail, contention, and async processing
- Baselines per release + regression budgets
- Cleanup via TTL or partitions, not massive cascade deletes
- Deep observability: traces, DB metrics, queue metrics
- accounts: 10 / 100 / 1,000
- projects per account: distribution (p50=3, p95=20, p99=500)
- items per project: distribution (p50=50, p95=5,000)
- attachment rate: 0% / 5% / 25%
- comment depth: 0 / 2 / 20
- concurrency: 1 / 5 / 20 / 50
- ramp schedule: step size + hold time
- budget thresholds: p95, error rate, DB time
- “Correctness Integration Suite”
- “Performance Integration Suite”
- “Ramped Load Scenario Pack”
- “Soak + Tail Latency Pack”
Portions of drafting and editorial refinement in this repository were accelerated using large language models (including ChatGPT, Claude, and Gemini) under direct human design, validation, and final approval. All technical decisions, code, and architectural conclusions are authored and verified by the repository maintainer.