End-to-end test Home Assistant Core before release

### Problem statement

The artifact that users run as Home Assistant is the Home Assistant Core image, with the frontend bundled in. The Core image is built and published on every release. The same image is pulled directly by Home Assistant Container users and is run by the Supervisor on Home Assistant Operating System and Supervised installations. One Core artifact, multiple install types consuming it.
 
Releases follow a public schedule. A new major release ships on the first Wednesday of each month, with the last week of the cycle focused on beta testing as documented in the [release FAQ](https://www.home-assistant.io/faq/release/). Patch releases ship on a weekly cadence, typically on Fridays, between majors. The release tags confirm this pattern: 2026.5.0 published Wednesday May 6, preceded by 2026.5.0b0 on Wednesday April 29 (the one-week beta), and the previous cycle's patches at 2026.4.1 (Fri April 3), 2026.4.2 (April 11), 2026.4.3 (Fri April 17), 2026.4.4 (Fri April 24). Patches do not go through a beta period.
 
The release pipeline does a lot of testing on the source side. [home-assistant/core .github/workflows/](https://github.com/home-assistant/core/tree/dev/.github/workflows) runs unit tests, type checks, lint, license audits, and CodeQL on every PR and push. The frontend has its own CI. None of this exercises the build artifact that is the Home Assistant shipped image itself. Once the image is built, it is pushed to `ghcr.io/home-assistant/homeassistant`, mirrored to Docker Hub, and tagged `latest` and `stable`, without ever being booted and exercised end-to-end against a real browser.
 
The closest pre-existing artifact-level check lives outside Core, in [home-assistant/operating-system tests/smoke_test/](https://github.com/home-assistant/operating-system/tree/dev/tests/smoke_test). That suite boots the OS image in QEMU and asserts OS-layer health: container startup, RAUC bootloader state, systemd units, network connectivity, swap behavior. The single check that touches the Core layer is `test_landing_page` curling `localhost:8123` and asserting the response contains `</html>`. That confirms an HTTP server is up. It does not confirm the frontend bundle loaded, that onboarding can complete, that login works, or that the WebSocket and REST APIs respond correctly. The OS smoke test is valuable for what it covers, but it is OS-layer coverage, not Core-layer coverage. Inspiration, not substitute.
 
1. **The Core artifact is built but never exercised.** The Docker image we ship every release is published without anyone (or anything) ever booting it and walking through the user-visible flows. Source-side tests catch source bugs. They cannot catch packaging, asset bundling, frontend build, or runtime configuration issues that appear only in the built image.
2. **Patch releases ship weekly without a beta period.** Major releases get caught by beta testers. Patch releases do not. A regression introduced in 2026.5.1 reaches every up-to-date instance the moment the Friday tag is published.
3. **The first user to hit a regression is a real user.** Today, the canary for an artifact-level regression is the first user who upgrades after the tag lands. That is the wrong place for the canary.
This pattern is not new. [architecture #262](https://github.com/home-assistant/architecture/issues/262) (July 2019) is the canonical record of a release-time regression that escaped because the platform unit tests did not run on the implementation. The fix at the time was a stricter pylint rule. The structural problem, that source-side checks alone do not guarantee the shipped artifact works, is still with us.
 
This opportunity is also the blocking prerequisite for [Apply patch updates automatically](https://github.com/OpenHomeFoundation/roadmap/issues/168), the sibling opportunity that proposes turning patch updates into a default-on, hands-off behavior for new installations. We cannot ship that bet without first guaranteeing that the Core artifact we publish boots and works.
 
A new user who hits a broken onboarding flow on a fresh install, or an existing user whose Friday patch upgrade silently breaks their setup, has no context for what went wrong, no way to roll back, and no patience to dig in. Closing that gap maps to **Make Home Assistant More Approachable**.

### Community signals

Internal architectural signals on release-time regressions and update flow exist in the home-assistant/architecture repo:
 
- [architecture #262](https://github.com/home-assistant/architecture/issues/262), "Checks for non-implemented methods" (July 2019). Documents a release-time regression in the climate component that escaped because unit tests did not run on the implementation. The exact failure mode this opportunity is designed to catch.
- [architecture #202](https://github.com/home-assistant/architecture/issues/202), "Decouple integration releases from core" (April 2019). Adjacent prior art on rollback-on-failure update flows.
The artifact-level evidence lives in the workflows themselves:
 
- [home-assistant/core .github/workflows/](https://github.com/home-assistant/core/tree/dev/.github/workflows). The Core Docker image is built by `builder.yml`, signed, and published. There is no workflow that boots the resulting image and tests it.
- [home-assistant/operating-system tests/smoke_test/test_basic.py](https://github.com/home-assistant/operating-system/blob/dev/tests/smoke_test/test_basic.py). Useful prior art for booting an artifact in CI and running pytest against it. The OS team has solved the harness problem; the Core team has not adopted the equivalent for the Core image.

External signals (forum threads where users report regressions slipping through the Friday patch tag, and Month of WTH posts asking for stronger pre-release verification) are real but not yet sampled here. Two or three representative annotated links should be added before submission.

### Scope & Boundaries

#### In scope
 
- **A test layer that boots the freshly-built Home Assistant Core Docker image in CI and drives it end-to-end with a real browser.** The artifact under test is the Core image, not the OS image and not the install types.
- **First-run coverage.** Onboarding, account creation, initial sign-in, default dashboard render. Onboarding is the highest-leverage flow because a failure there has no user workaround.
- **Core navigation hotspots.** Settings, Devices and services, Automations and scenes. The shortlist every user touches in their first session.
- **Release and nightly CI integration.** The new test layer runs against every release candidate, every patch tag, and every nightly build. A failure on a release candidate or patch tag blocks publication of the Core image to ghcr.io and Docker Hub.
- **Coverage of both REST and WebSocket API code paths.** A frontend smoke test that drives the UI exercises both naturally; explicit assertions on the API surface are welcome where they sharpen failure attribution.
#### Not in scope
 
- **Replacing the existing source-side unit test suites.** Coverage in core and frontend stays where it is. This is additive.
- **Replacing or absorbing the HAOS QEMU smoke test.** That suite covers OS-layer health and stays where it is. The Core test layer is independent of it.
- **Per-install-type variation.** The Core image is the same artifact in Container, OS, Supervised, and (via PyPI) Core install types. We test the Core image once. Install-type-specific concerns (boot, networking, supervisor orchestration, OS upgrade) are different bets.
- **Deep per-integration testing.** Verifying that every integration onboards and works against a live device is a much larger bet.
- **Performance benchmarking and long-running stability.** Useful, but a separate workstream.
- **App and HACS coverage.** Out of scope for the core release pipeline.
- **Sibling bet: [Apply patch updates automatically](https://github.com/OpenHomeFoundation/roadmap/issues/168).** This bet is the prerequisite, not the implementation.

### Foreseen solution

**Phase 1: Harness and minimum viable suite.** Adopt [Playwright](https://playwright.dev) as the browser-automation toolchain. It is the natural fit: cross-browser, well-maintained, with strong debug tooling and good Python and TypeScript bindings. Build the workflow that pulls the freshly-built Core image, runs it, and drives a Playwright browser through the first-run flow: onboarding, account creation, login, default dashboard render. The HAOS smoke test is useful prior art for the boot-and-pytest pattern; the Core layer is browser-driven on top of `docker run`, not QEMU.
 
**Phase 2: Hotspot expansion.** Extend coverage past first-run into Settings, the Integrations list, and automation creation. Selection should be informed by which areas appear most often in regression reports against home-assistant/core and the related repos. Keep the suite tight; this is a smoke test, not a full regression suite.
 
**Phase 3: Release CI integration as a blocking gate.** Wire the test layer into `core/builder.yml` (or a sibling workflow) so that a failure prevents publication of the image to the public registries. This is the step that turns the test layer from a useful side-channel into a guarantee.
 
Phases 1 and 2 are sequential. Phase 3 should land alongside or shortly after Phase 1 to avoid building coverage that nobody is gated on.

### Risks & open questions

- **Run time.** Booting the Core image and driving a browser through onboarding adds wall-clock time to release CI on every release candidate, every Friday patch tag, and every nightly build. CI cost is not the concern. Run time is. The gating question is how long the new layer takes and whether that delay is acceptable on the critical path to publication.
- **Flakiness budget.** Browser-driven tests are notoriously flaky. A flaky blocking gate undermines trust. We need a clear policy: retry behavior, quarantine for unstable tests, ownership of fixing flakes fast.
- **Hotspot selection methodology.** Which flows count as "the minimum"? A defensible list needs input from frontend, core, and the patterns in our own bug tracker. Risk: scope creep turns the smoke test into a full regression suite.
- **Coverage illusion.** A passing E2E test says "this image boots and a user can sign in and see a dashboard". It does not say "this image is correct". We need to communicate that distinction so the new gate does not crowd out other quality investments.
- **Maintenance ownership.** Who owns the suite when a flow changes? Frontend team, release engineering, or a shared rotation? Without clear ownership, the suite rots.
- **Open question: PR-level coverage.** Should the layer also run on PR merges into dev, in addition to release candidates, patch tags, and nightly builds? PR-level catches regressions earliest but multiplies the run time. Nightly plus release-candidate-and-patch-tag is the minimum viable answer.

### Appetite

Medium. Phase 1 is the first time we have a browser-driven harness against the Core image, so there is real plumbing to build. Phase 2 is incremental. Phase 3 is integration work that is small in code but high in risk if the gate is unreliable. Together, roughly one to two cycles of focused work spanning release engineering, core, and frontend. Ongoing maintenance and expansion is a continuing investment that should sit with a clearly-named owner.

### Execution issues

_To be populated once a bet is approved._

### Decision log

| Date       | Decision | Outcome |
|------------|----------|---------|
| 2026-05-07 | Created opportunity for betting table consideration. | Initial creation |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

End-to-end test Home Assistant Core before release #167

Problem statement

Community signals

Scope & Boundaries

In scope

Not in scope

Foreseen solution

Risks & open questions

Appetite

Execution issues

Decision log

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

End-to-end test Home Assistant Core before release #167

Description

Problem statement

Community signals

Scope & Boundaries

In scope

Not in scope

Foreseen solution

Risks & open questions

Appetite

Execution issues

Decision log

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions