Skip to content

fix(providers): harden postgres major upgrades#151

Open
bherila wants to merge 1 commit into
gotempsh:mainfrom
bherila:fix/postgres-upgrade-handoff
Open

fix(providers): harden postgres major upgrades#151
bherila wants to merge 1 commit into
gotempsh:mainfrom
bherila:fix/postgres-upgrade-handoff

Conversation

@bherila

@bherila bherila commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

Extracts the Postgres major-upgrade hardening that was originally discovered while validating MariaDB PITR E2E coverage in #138.

This PR keeps the scope to Postgres upgrade correctness only:

  • seed real s3_sources and backups rows in the Postgres upgrade test fixture so postgres_major_upgrades.pre_upgrade_backup_id satisfies the real FK constraint;
  • wait for a usable Postgres database with psql SELECT 1, not just Docker health / pg_isready, and create the configured DB if entrypoint initialization has not finished that race yet;
  • require snapshot and rollback handoffs to remove stale live volumes before recreating them, so pg18 containers cannot reopen pg17-era PGDATA after a supposedly clean copy;
  • make volume copy sidecars explicit, wait for their removal, and fail with context when Docker cannot release them.

Why this is required

The pg17 to pg18 upgrade path depends on a clean data-volume handoff. During CI validation, the old behavior could leave the live volume or copy sidecar around and then start the target pg18 container against stale pg17 data. That produces misleading readiness states and can make an upgrade appear to advance while the target database is not actually usable.

The fixture update is also required because the tests now run against the real schema: the fake backup id used by the upgrade tests violated the backups FK once the orchestrator persisted it to postgres_major_upgrades.pre_upgrade_backup_id.

Together these changes make the Postgres upgrade tests exercise the real control-plane constraints and make the runtime upgrade path fail early instead of silently reusing incompatible data.

Verification

Local:

  • cargo fmt --all -- --check
  • cargo check -p temps-providers --features docker-tests --lib
  • cargo test -p temps-providers --features docker-tests --lib orchestrator_phase_new_container_completes_under_timeout -- --nocapture --test-threads=1 (passes locally; Docker-dependent body skipped because local Docker is unavailable)

I also kicked off the fork GitHub Actions workflow for this branch and will attach the run once it completes.

@dviejokfs

Copy link
Copy Markdown
Contributor

could you handle the conflicts @bherila ? I merged the other PR #152 and now CHANGELOG.md has conflicts

@dviejokfs

Copy link
Copy Markdown
Contributor

Thanks for this @bherila — the volume-handoff hardening is the right direction, and these are genuinely dangerous paths (a bad handoff = pg18 opening stale pg17 PGDATA), so the "fail early instead of silently reusing incompatible data" framing is exactly right. The auto_remove: true → false + explicit remove/poll/exit-code check on the copy sidecar, and remove_volume_best_effort → remove_volume_or_fail, both look correct. I confirmed wait_for_container_healthy is safe here because the container defines a HEALTHCHECK programmatically (postgres_lifecycle.rs:242, pg_isready with a 30s start_period), so state.health will be populated and it won't false-timeout.

A few items to address before merge (CHANGELOG conflict aside):

1. Route the new exec through exec_util::run_exec instead of the new exec_in_container helper.
crates/temps-providers/src/externalsvc/exec_util.rs already exists specifically to make docker exec resilient, and its module docs call out the exact failure modes the new helper is exposed to:

  • drain-or-stall on stdout backpressure (the new code does drain — good), and
  • inspect_exec.running/exit-code transiently returning None, which the naive inspect_exec read in exec_in_container doesn't tolerate.

The code being replaced was already a bespoke inline loop, so this isn't introducing the first duplication — but consolidating onto run_exec (it takes cmd: Vec<String> + env + a timeout and returns ExecResult { exit_code, output }) means a single fix benefits every engine, and avoids re-treading the bugs exec_util was written for.

2. Copy-sidecar leak on the wait_container error path.
Now that the sidecar is auto_remove: false, the explicit remove_container only runs after wait_container().await?. If wait_container itself errors, we return early and leak a stopped container. Impact is cosmetic (stopped container, not data; the next run force-removes same-named containers), but it'd be cleaner to remove the sidecar on that error path too.

3. Real pg17→pg18 validation before merge (the important one).
The behavioral changes — remove_volume_or_fail hard-fail, the copy exit-code check, the healthcheck+psql SELECT 1+ensure-database readiness — are all Docker-dependent integration paths. The postgres-upgrades check is green, but on a fork PR these heavy bodies may skip rather than actually exercise the new handoff logic, so the riskiest paths aren't proven by the CI run. Given this is data-loss-class code, it'd be good to see one real pg17→pg18 upgrade against a DB with actual data observed green before merging. I know that's the same constraint that came up on #138 (no large instance to test against) — if you can't, I (or another maintainer) can run it on a seeded instance; just flag it.

Test changes themselves look good — swapping the Ok(42) stub for real s3_sources/backups rows and asserting against the persisted id is a strengthening, not a loosening. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants