[gNOI] Add sonic.gnoi.oras.v1.Oras service (Pull RPC) + design doc#692
[gNOI] Add sonic.gnoi.oras.v1.Oras service (Pull RPC) + design doc#692hdwhdw wants to merge 14 commits into
Conversation
Draft RFC for a new sonic.gnoi.oras.v1.Oras service that lets an orchestrator instruct a switch to pull an OCI/ORAS artifact from a registry into local staging, decoupled from install. Tracks ADO Feature #37984064. Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
SONiC's TransferToRemote actually performs a download (HTTP GET into local_path), not an upload as upstream openconfig defines. Update §1 to describe the real current state and enumerate the concrete limitations that block its use for ACR/ORAS. Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
PoC subset of the ORAS Pull design (doc/oras-pull-design.md): a single streaming Pull RPC with anonymous + basic auth and an optional http_proxy field. List/Delete and richer features are deferred. Generated oras.pb.go is checked in following the existing precedent (proto/sonic.pb.go, proto/gnoi/sonic_debug.pb.go). Makefile wires the new binding into PROTO_GO_BINDINGS so make can regenerate it. Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
Streaming server implementation of sonic.gnoi.oras.v1.Oras.Pull:
* Resolves manifest by tag or digest against an OCI registry.
* Requires single-layer artifacts (PoC scope per the design doc;
SONiC OS images are single layer).
* Stages the layer into a temp dir next to local_path and renames
into place on success, so a failed pull never leaves a partial
file at local_path.
* Emits PullStarted once the manifest is resolved, PullProgress at
most once per second, and a final PullResult with elapsed time
and per-layer digest.
* Reuses the file-server path allowlist (/tmp, /var/tmp, /host).
* Supports anonymous and basic auth (ACR admin user); workload
identity and bearer modes are stubbed out for v1.
* http_proxy field plumbed into the HTTP transport so testbeds
where the registry is not reachable via the default route (e.g.
sonic-vs vlabs behind a host tinyproxy) can still pull.
* Best-effort registry-error to gRPC status mapping.
Adds oras.land/oras-go/v2 v2.6.0 and github.com/opencontainers/image-spec
v1.1.1 to go.mod.
Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
Add OrasServer wrapper and wire it into registerAllServices behind the existing EnableTranslibWrite || EnableNativeWrite gate, alongside the other gNOI services. Pull authenticates the caller and then delegates to pkg/gnoi/oras.HandlePull. Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Drop the EnableTranslibWrite/EnableNativeWrite gate for sonic.gnoi.oras.v1.Oras. The other services inside that gate are there because gNMI write paths (translib / native YANG) should not be exposed unless the operator opted in to writes. Oras Pull does not touch any YANG datastore — it writes only into an allowlisted staging area inside the gnmi container — so the gate is not meaningful here. Keep the service available on every build, mirroring the unconditional registration of system / factory_reset. Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Introduce handlePullWithRepo as a seam so tests can drive the pull loop
against an httptest-backed fake registry (PlainHTTP) without reaching
the real network. HandlePull keeps the same public signature, validates
the request, constructs the repository, then delegates.
New tests cover:
- validatePullRequest (12 cases)
- validateLocalPath (allowlist + traversal)
- pullReference (tag vs digest precedence)
- pickSingleLayer (0/1/2 layers, malformed JSON)
- mapRegistryError (401/host/timeout/404/ENOSPC/default)
- countingReader, copyAndRemove, jsonUnmarshalStrict
- newRepository wires basic-auth credentials + leaves Credential nil
for anonymous; rejects invalid registry refs
- HandlePull happy path and auth-failure path against a fake registry
Package coverage now 81.8%.
Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
pkg/gnoi/oras has no CGO or SONiC dependencies, so its tests can run in the pure-test stage. Registering it here makes the pipeline pick up the unit tests and include them in the diff-coverage gate. Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Add tests for HandlePull wrapper (E2E + bad registry ref), multi-layer manifest rejection, MkdirTemp failure, blob fetch 500, Send error on PullStarted, and copyAndRemove dst-open error. Lifts statement coverage from 81.8% to 89.9% so the pipeline diff-coverage gate (>=80% lines) clears with comfortable margin. Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
Adds a new SONiC-specific gNOI service sonic.gnoi.oras.v1.Oras with a single server-streaming Pull RPC that fetches an OCI/ORAS artifact (e.g. a SONiC .bin image) from a registry into an allowlisted local staging directory (/tmp, /var/tmp, /host). Pull resolves a manifest, validates a single-layer artifact, streams the blob to disk with progress events, and reports the final layer/manifest digests. The service is registered unconditionally on the gNMI gRPC server (it does not touch YANG/Redis), and shares the existing gNMI authn hook. A companion design doc covers the longer-term shape (multi-layer, workload-identity auth, List/Delete).
Changes:
- New proto + generated bindings for
sonic.gnoi.oras.v1.Oras(PullRequest,AuthConfig,PullStarted/Progress/Result). - New
pkg/gnoi/orasserver implementation on top oforas.land/oras-go/v2, with path allowlisting and a streaming progress loop. - gNMI server wiring:
OrasServer,Pullshim, registration inregisterAllServices; Makefile/pure.mk entries; new module deps (oras-go/v2,image-spec,go-digest).
Reviewed changes
Copilot reviewed 10 out of 12 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| proto/gnoi/oras/oras.proto | New service & message definitions for the Pull RPC. |
| proto/gnoi/oras/oras.pb.go | Generated bindings for the new proto. |
| pkg/gnoi/oras/oras.go | Server implementation: validation, repo construction, manifest fetch, layer copy, progress streaming. |
| pkg/gnoi/oras/json.go | Tiny jsonUnmarshalStrict wrapper used by manifest parsing. |
| pkg/gnoi/oras/oras_test.go | Unit tests covering validation, error mapping, and end-to-end pulls against an in-process fake registry. |
| gnmi_server/gnoi_oras.go | OrasServer.Pull shim that authenticates and delegates to pkg/gnoi/oras. |
| gnmi_server/server.go | Defines OrasServer, threads it through registerAllServices, and registers Oras unconditionally. |
| Makefile | Adds the new proto to the generated-bindings list. |
| pure.mk | Marks pkg/gnoi/oras as a pure-Go package. |
| go.mod / go.sum | Adds oras.land/oras-go/v2, opencontainers/image-spec, go-digest, and transitive deps. |
| doc/oras-pull-design.md | Design doc describing rationale, RPC shape, staging policy, auth tiers, and open questions. |
Files not reviewed (1)
- proto/gnoi/oras/oras.pb.go: Language not supported
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
- Serialize stream.Send through a safeStream mutex; close progressDone before final Send (and wait for the progress goroutine to exit) so the progress goroutine can never race with PullResult. - Restrict os.Rename copy-and-delete fallback to EXDEV only; surface permission / target-is-dir / etc. errors as-is instead of silently masking them. - Replace substring-based mapRegistryError with errors.As inspection of errcode.ErrorResponse, errdef.ErrNotFound, net.OpError/DNSError, net.Error.Timeout(), and syscall.ECONNREFUSED / ENOSPC. Adds PermissionDenied for 403. - validateLocalPath: walk path components for literal '..' segments instead of strings.Contains, which over-rejected names like 'a..b'. - Drop the dead '_ = oras.Copy' line and the oras-go top-level import. - Rename jsonUnmarshalStrict -> parseManifest; comment matches behavior. - proto: redact registry hostname example; move PoC-subset scope notes from the file-level comment (which protoc-gen-go places in front of the DO NOT EDIT marker) to the Oras service comment; regen pb.go. - Tests: switch hard-coded /tmp/oras-test-<pid>.bin paths to a per-test MkdirTemp helper under /tmp; rename proto_clone -> protoClone; add TestIsCrossDeviceError; rework TestMapRegistryError to use typed errors. Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
http_proxy on the wire mixes ops policy into the RPC contract. Go's http.DefaultTransport already honors the standard HTTP_PROXY / HTTPS_PROXY / NO_PROXY env vars via http.ProxyFromEnvironment, and lab testbeds can inject those on the gnmi process (e.g. via /usr/bin/gnmi-native.sh). Production switches with a default route to the registry need no configuration. Also trims the design doc's PullRequest example: removes media_type_filter, source_address, source_vrf, skip_if_exists, expected_manifest_digest — all speculative v2+ knobs that don't belong in the first cut. Lists them in a 'deliberately deferred' section so the rationale is preserved. Regenerates oras.pb.go to drop field sonic-net#7. Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
New /tmp/, /var/tmp/, /host/ allowlist + /mnt/host bind-mount translation lives in pkg/hostfs. pkg/gnoi/oras switches over so its writes land on the host filesystem (sonic-installer reads from the host /tmp, not the container's tmpfs). internal/diskspace and pkg/gnoi/file still have their own private copies of this logic; migrating them is a follow-up to keep this change focused. Also picks up an existing go.mod entry (opencontainers/go-digest is used directly by the oras tests; promote it from indirect). Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Documents how HandlePull drives oras-go through Resolve → Fetch → file.Store, where progress comes from (countingReader + 1s ticker), and why the staging dir is created next to the destination (same-fs rename). Captures hostfs.Translate as the container→host path seam. Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
114a9e9 to
6153efd
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Why I did it
Adds a new
sonic.gnoi.oras.v1.OrasgNOI service so an orchestrator can instruct a switch to pull an OCI/ORAS artifact (e.g. a SONiC.binimage) from a registry into local staging, decoupled from install.Existing gNOI surfaces don't fit:
gnoi.file.TransferToRemoteis upload (target → remote), wrong direction.gnoi.system.SetPackage+RemoteDownloadonly models a single URL + flat creds; can't express registry/repo/tag-or-digest, multi-layer manifests, or richer auth.gnoi.containerz.Deployis client-streamed only.Full rationale and alternatives are in the design doc shipped in this PR.
How I did it
doc/covering rationale, RPC shape, auth tiers, staging policy, and open questions.proto/gnoi/oras/oras.protodefiningsonic.gnoi.oras.v1.Oraswith a streamingPullRPC.pkg/gnoi/orasimplementingPullon top oforas.land/oras-go/v2, with an allowlisted staging-path policy (/tmp/,/var/tmp/,/host/).gnmi_server/server.go. Registered unconditionally (not behind the translib/native-write gate) becausePullonly writes into allowlisted staging dirs — it doesn't touch YANG datastores. This mirrors howgnoi.system.Systemandgnoi.factory_resetare already registered on this server.Scope of this PoC PR (matches what is actually wired up in the proto and server):
PullRequestfields:registry,repository,tagordigest(oneof),local_path,auth,http_proxy.AuthConfig:anonymousorbasiconly.PullResponsestream events:PullStarted(with manifest digest + total bytes),PullProgress,PullResult(withmanifest_digest,layer_digest,bytes_written,local_path,elapsed).InvalidArgument.Items called out in the design doc but not implemented in this PR:
List/DeleteRPCs, bearer / workload-identity auth, multi-layer artifacts. See the design doc for the full list of deliberately deferred knobs.How to verify it
PoC verified end-to-end against a SONiC KVM (single-node
vlab):.azure/templates/build-deb.yml):sonic-gnmi_0.1_amd64.debinto the target'sgnmicontainer and restart it:/usr/bin/gnmi-native.shinside the gnmi container and restart:http.DefaultTransporthonors these viahttp.ProxyFromEnvironment.{"started": {"manifestDigest": "sha256:7c7a2eb5...", "totalBytes": "790570990"}} {"progress": {"bytesTransferred": "146837", "totalBytes": "790570990"}} {"progress": {"bytesTransferred": "205290901", "totalBytes": "790570990"}} {"progress": {"bytesTransferred": "476921237", "totalBytes": "790570990"}} {"progress": {"bytesTransferred": "786136469", "totalBytes": "790570990"}} {"result": {"manifestDigest": "sha256:7c7a2eb5...", "layerDigest": "sha256:6f0923e8...", "bytesWritten": "790570990", "localPath": "/var/tmp/<image-tag>", "elapsed": "51.241704147s"}}bytes_writtenequals the layer descriptor size;layer_digestmatches the manifest'ssha256:....pkg/hostfstranslateslocal_path: /var/tmp/<image-tag>to/mnt/host/var/tmp/<image-tag>for the write so the artifact appears on the host root, not in the container's private tmpfs.Unauthenticated, only the call with real creds in theauth.basicfield succeeded.Notes:
local_pathallowlist (/tmp/,/var/tmp/,/host/) is enforced server-side. Prefer/var/tmp/or/host/for large artifacts:/tmp/is host tmpfs, and pages there count against the gnmi container's cgroup; a 790 MB pull into/tmp/will trip SONiC'smemory_checkerand restart the container.Unit tests for the new server live under
pkg/gnoi/oras/...and run with the standardgo test ./pkg/gnoi/oras/....Which release branch to backport (provide reason below if selected)
None — this is a new feature, not a fix.
Description for the changelog
Add
sonic.gnoi.oras.v1.OrasgNOI service for pulling OCI/ORAS artifacts from a registry into on-device staging.Link to config_db schema for YANG module changes
N/A — no YANG / config_db changes.
Open questions (feedback welcome)
/hostplatforms.gnoi.os.Installto callPullinternally?artifactType?List/Deleteon-device, or push inventory to the control plane?