Skip to content

[gNOI] Add sonic.gnoi.oras.v1.Oras service (Pull RPC) + design doc#692

Open
hdwhdw wants to merge 14 commits into
sonic-net:masterfrom
hdwhdw:daweihuang/oras-pull-design
Open

[gNOI] Add sonic.gnoi.oras.v1.Oras service (Pull RPC) + design doc#692
hdwhdw wants to merge 14 commits into
sonic-net:masterfrom
hdwhdw:daweihuang/oras-pull-design

Conversation

@hdwhdw
Copy link
Copy Markdown
Contributor

@hdwhdw hdwhdw commented May 30, 2026

Why I did it

Adds a new sonic.gnoi.oras.v1.Oras gNOI service so an orchestrator can instruct a switch to pull an OCI/ORAS artifact (e.g. a SONiC .bin image) from a registry into local staging, decoupled from install.

Existing gNOI surfaces don't fit:

  • gnoi.file.TransferToRemote is upload (target → remote), wrong direction.
  • gnoi.system.SetPackage + RemoteDownload only models a single URL + flat creds; can't express registry/repo/tag-or-digest, multi-layer manifests, or richer auth.
  • gnoi.containerz.Deploy is client-streamed only.

Full rationale and alternatives are in the design doc shipped in this PR.

How I did it

  1. Design doc under doc/ covering rationale, RPC shape, auth tiers, staging policy, and open questions.
  2. Proto proto/gnoi/oras/oras.proto defining sonic.gnoi.oras.v1.Oras with a streaming Pull RPC.
  3. Server pkg/gnoi/oras implementing Pull on top of oras.land/oras-go/v2, with an allowlisted staging-path policy (/tmp/, /var/tmp/, /host/).
  4. Registration in gnmi_server/server.go. Registered unconditionally (not behind the translib/native-write gate) because Pull only writes into allowlisted staging dirs — it doesn't touch YANG datastores. This mirrors how gnoi.system.System and gnoi.factory_reset are already registered on this server.

Scope of this PoC PR (matches what is actually wired up in the proto and server):

  • PullRequest fields: registry, repository, tag or digest (oneof), local_path, auth, http_proxy.
  • AuthConfig: anonymous or basic only.
  • PullResponse stream events: PullStarted (with manifest digest + total bytes), PullProgress, PullResult (with manifest_digest, layer_digest, bytes_written, local_path, elapsed).
  • Single-layer artifacts only; multi-layer manifests are rejected with InvalidArgument.

Items called out in the design doc but not implemented in this PR: List / Delete RPCs, bearer / workload-identity auth, multi-layer artifacts. See the design doc for the full list of deliberately deferred knobs.

How to verify it

PoC verified end-to-end against a SONiC KVM (single-node vlab):

  1. Build the deb with the same flags upstream CI uses (see .azure/templates/build-deb.yml):
    ENABLE_TRANSLIB_WRITE=y ENABLE_NATIVE_WRITE=y \
      dpkg-buildpackage -rfakeroot -us -uc -b -j$(nproc)
    
  2. Install sonic-gnmi_0.1_amd64.deb into the target's gnmi container and restart it:
    docker cp sonic-gnmi_0.1_amd64.deb gnmi:/tmp/
    docker exec gnmi dpkg -i /tmp/sonic-gnmi_0.1_amd64.deb
    docker exec gnmi supervisorctl restart gnmi-native
    
  3. Confirm the service is exposed (no extra runtime flag required):
    grpcurl -insecure -cert client.cer -key client.key \
      <switch-ip>:50052 list | grep sonic.gnoi.oras.v1.Oras
    
  4. (Lab only — production switches with a route to the registry skip this step.) On a testbed that needs an HTTP proxy to reach the registry, inject the proxy env vars into the gnmi process — e.g. prepend two lines to /usr/bin/gnmi-native.sh inside the gnmi container and restart:
    export HTTPS_PROXY=http://<lab-proxy>:<port>
    export NO_PROXY=localhost,127.0.0.1,10.0.0.0/8
    
    Go's http.DefaultTransport honors these via http.ProxyFromEnvironment.
  5. Pull a SONiC image artifact from an OCI registry through the new RPC. The example below targets a ~790 MB single-layer artifact:
    grpcurl -insecure -cert client.cer -key client.key -max-time 300 \
      -d '{"registry":"<registry-host>",
           "repository":"<repository>",
           "tag":"<image-tag>",
           "local_path":"/var/tmp/<image-tag>",
           "auth":{"basic":{"username":"<user>","password":"<password>"}}}' \
      <switch-ip>:50052 sonic.gnoi.oras.v1.Oras/Pull
    
    Observed streamed response (progress messages trimmed for brevity):
    {"started": {"manifestDigest": "sha256:7c7a2eb5...", "totalBytes": "790570990"}}
    {"progress": {"bytesTransferred": "146837",    "totalBytes": "790570990"}}
    {"progress": {"bytesTransferred": "205290901", "totalBytes": "790570990"}}
    {"progress": {"bytesTransferred": "476921237", "totalBytes": "790570990"}}
    {"progress": {"bytesTransferred": "786136469", "totalBytes": "790570990"}}
    {"result":   {"manifestDigest": "sha256:7c7a2eb5...",
                  "layerDigest":    "sha256:6f0923e8...",
                  "bytesWritten":   "790570990",
                  "localPath":      "/var/tmp/<image-tag>",
                  "elapsed":        "51.241704147s"}}
    bytes_written equals the layer descriptor size; layer_digest matches the manifest's sha256:....
  6. Confirm the file landed on the host filesystem (not in the container's tmpfs). From the SONiC host shell:
    $ ls -la /var/tmp/<image-tag>
    -rw-r--r-- 1 root root 790570990 ... /var/tmp/<image-tag>
    $ sha256sum /var/tmp/<image-tag>
    6f0923e8f37f6288532d70bfda42579d87b082c5ae4d177c0e252f95c2b0f140  /var/tmp/<image-tag>
    
    The sha matches the layer digest above. Internally pkg/hostfs translates local_path: /var/tmp/<image-tag> to /mnt/host/var/tmp/<image-tag> for the write so the artifact appears on the host root, not in the container's private tmpfs.
  7. Verified credentials really come from the RPC payload (not ambient state): wiped any cred files inside the gnmi container and restarted it; calls with no auth or wrong auth returned Unauthenticated, only the call with real creds in the auth.basic field succeeded.

Notes:

  • The local_path allowlist (/tmp/, /var/tmp/, /host/) is enforced server-side. Prefer /var/tmp/ or /host/ for large artifacts: /tmp/ is host tmpfs, and pages there count against the gnmi container's cgroup; a 790 MB pull into /tmp/ will trip SONiC's memory_checker and restart the container.

Unit tests for the new server live under pkg/gnoi/oras/... and run with the standard go test ./pkg/gnoi/oras/....

Which release branch to backport (provide reason below if selected)

None — this is a new feature, not a fix.

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111

Description for the changelog

Add sonic.gnoi.oras.v1.Oras gNOI service for pulling OCI/ORAS artifacts from a registry into on-device staging.

Link to config_db schema for YANG module changes

N/A — no YANG / config_db changes.

Open questions (feedback welcome)

  1. Staging path policy on small-/host platforms.
  2. Should we extend gnoi.os.Install to call Pull internally?
  3. Server-side enforcement of a SONiC-specific artifactType?
  4. Keep List / Delete on-device, or push inventory to the control plane?

Draft RFC for a new sonic.gnoi.oras.v1.Oras service that lets an
orchestrator instruct a switch to pull an OCI/ORAS artifact from a
registry into local staging, decoupled from install.

Tracks ADO Feature #37984064.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

SONiC's TransferToRemote actually performs a download (HTTP GET into
local_path), not an upload as upstream openconfig defines. Update §1 to
describe the real current state and enumerate the concrete limitations
that block its use for ACR/ORAS.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

hdwhdw added 3 commits May 29, 2026 21:36
PoC subset of the ORAS Pull design (doc/oras-pull-design.md): a single
streaming Pull RPC with anonymous + basic auth and an optional
http_proxy field. List/Delete and richer features are deferred.

Generated oras.pb.go is checked in following the existing precedent
(proto/sonic.pb.go, proto/gnoi/sonic_debug.pb.go). Makefile wires the
new binding into PROTO_GO_BINDINGS so make can regenerate it.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
Streaming server implementation of sonic.gnoi.oras.v1.Oras.Pull:

  * Resolves manifest by tag or digest against an OCI registry.
  * Requires single-layer artifacts (PoC scope per the design doc;
    SONiC OS images are single layer).
  * Stages the layer into a temp dir next to local_path and renames
    into place on success, so a failed pull never leaves a partial
    file at local_path.
  * Emits PullStarted once the manifest is resolved, PullProgress at
    most once per second, and a final PullResult with elapsed time
    and per-layer digest.
  * Reuses the file-server path allowlist (/tmp, /var/tmp, /host).
  * Supports anonymous and basic auth (ACR admin user); workload
    identity and bearer modes are stubbed out for v1.
  * http_proxy field plumbed into the HTTP transport so testbeds
    where the registry is not reachable via the default route (e.g.
    sonic-vs vlabs behind a host tinyproxy) can still pull.
  * Best-effort registry-error to gRPC status mapping.

Adds oras.land/oras-go/v2 v2.6.0 and github.com/opencontainers/image-spec
v1.1.1 to go.mod.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
Add OrasServer wrapper and wire it into registerAllServices behind the
existing EnableTranslibWrite || EnableNativeWrite gate, alongside the
other gNOI services. Pull authenticates the caller and then delegates
to pkg/gnoi/oras.HandlePull.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Drop the EnableTranslibWrite/EnableNativeWrite gate for sonic.gnoi.oras.v1.Oras.
The other services inside that gate are there because gNMI write paths
(translib / native YANG) should not be exposed unless the operator opted in
to writes. Oras Pull does not touch any YANG datastore — it writes only into
an allowlisted staging area inside the gnmi container — so the gate is not
meaningful here. Keep the service available on every build, mirroring the
unconditional registration of system / factory_reset.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@hdwhdw hdwhdw changed the title doc: design for gNOI ORAS Pull service [DRAFT — RFC] [gNOI] Add sonic.gnoi.oras.v1.Oras service (Pull RPC) + design doc Jun 1, 2026
Introduce handlePullWithRepo as a seam so tests can drive the pull loop
against an httptest-backed fake registry (PlainHTTP) without reaching
the real network. HandlePull keeps the same public signature, validates
the request, constructs the repository, then delegates.

New tests cover:
  - validatePullRequest (12 cases)
  - validateLocalPath (allowlist + traversal)
  - pullReference (tag vs digest precedence)
  - pickSingleLayer (0/1/2 layers, malformed JSON)
  - mapRegistryError (401/host/timeout/404/ENOSPC/default)
  - countingReader, copyAndRemove, jsonUnmarshalStrict
  - newRepository wires basic-auth credentials + leaves Credential nil
    for anonymous; rejects invalid registry refs
  - HandlePull happy path and auth-failure path against a fake registry

Package coverage now 81.8%.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

pkg/gnoi/oras has no CGO or SONiC dependencies, so its tests can run
in the pure-test stage. Registering it here makes the pipeline pick up
the unit tests and include them in the diff-coverage gate.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Add tests for HandlePull wrapper (E2E + bad registry ref), multi-layer
manifest rejection, MkdirTemp failure, blob fetch 500, Send error on
PullStarted, and copyAndRemove dst-open error. Lifts statement coverage
from 81.8% to 89.9% so the pipeline diff-coverage gate (>=80% lines)
clears with comfortable margin.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@hdwhdw hdwhdw marked this pull request as ready for review June 1, 2026 17:40
Copilot AI review requested due to automatic review settings June 1, 2026 17:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new SONiC-specific gNOI service sonic.gnoi.oras.v1.Oras with a single server-streaming Pull RPC that fetches an OCI/ORAS artifact (e.g. a SONiC .bin image) from a registry into an allowlisted local staging directory (/tmp, /var/tmp, /host). Pull resolves a manifest, validates a single-layer artifact, streams the blob to disk with progress events, and reports the final layer/manifest digests. The service is registered unconditionally on the gNMI gRPC server (it does not touch YANG/Redis), and shares the existing gNMI authn hook. A companion design doc covers the longer-term shape (multi-layer, workload-identity auth, List/Delete).

Changes:

  • New proto + generated bindings for sonic.gnoi.oras.v1.Oras (PullRequest, AuthConfig, PullStarted/Progress/Result).
  • New pkg/gnoi/oras server implementation on top of oras.land/oras-go/v2, with path allowlisting and a streaming progress loop.
  • gNMI server wiring: OrasServer, Pull shim, registration in registerAllServices; Makefile/pure.mk entries; new module deps (oras-go/v2, image-spec, go-digest).

Reviewed changes

Copilot reviewed 10 out of 12 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
proto/gnoi/oras/oras.proto New service & message definitions for the Pull RPC.
proto/gnoi/oras/oras.pb.go Generated bindings for the new proto.
pkg/gnoi/oras/oras.go Server implementation: validation, repo construction, manifest fetch, layer copy, progress streaming.
pkg/gnoi/oras/json.go Tiny jsonUnmarshalStrict wrapper used by manifest parsing.
pkg/gnoi/oras/oras_test.go Unit tests covering validation, error mapping, and end-to-end pulls against an in-process fake registry.
gnmi_server/gnoi_oras.go OrasServer.Pull shim that authenticates and delegates to pkg/gnoi/oras.
gnmi_server/server.go Defines OrasServer, threads it through registerAllServices, and registers Oras unconditionally.
Makefile Adds the new proto to the generated-bindings list.
pure.mk Marks pkg/gnoi/oras as a pure-Go package.
go.mod / go.sum Adds oras.land/oras-go/v2, opencontainers/image-spec, go-digest, and transitive deps.
doc/oras-pull-design.md Design doc describing rationale, RPC shape, staging policy, auth tiers, and open questions.
Files not reviewed (1)
  • proto/gnoi/oras/oras.pb.go: Language not supported

Comment thread pkg/gnoi/oras/oras.go
Comment thread proto/gnoi/oras/oras.proto
Comment thread pkg/gnoi/oras/oras.go Outdated
Comment thread pkg/gnoi/oras/oras.go
Comment thread pkg/gnoi/oras/json.go Outdated
Comment thread pkg/gnoi/oras/oras.go Outdated
Comment thread pkg/gnoi/oras/oras.go Outdated
Comment thread proto/gnoi/oras/oras.pb.go
Comment thread pkg/gnoi/oras/oras_test.go Outdated
Comment thread pkg/gnoi/oras/oras_test.go Outdated
@hdwhdw
Copy link
Copy Markdown
Contributor Author

hdwhdw commented Jun 1, 2026

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

- Serialize stream.Send through a safeStream mutex; close progressDone
  before final Send (and wait for the progress goroutine to exit) so the
  progress goroutine can never race with PullResult.
- Restrict os.Rename copy-and-delete fallback to EXDEV only; surface
  permission / target-is-dir / etc. errors as-is instead of silently
  masking them.
- Replace substring-based mapRegistryError with errors.As inspection of
  errcode.ErrorResponse, errdef.ErrNotFound, net.OpError/DNSError,
  net.Error.Timeout(), and syscall.ECONNREFUSED / ENOSPC. Adds
  PermissionDenied for 403.
- validateLocalPath: walk path components for literal '..' segments
  instead of strings.Contains, which over-rejected names like 'a..b'.
- Drop the dead '_ = oras.Copy' line and the oras-go top-level import.
- Rename jsonUnmarshalStrict -> parseManifest; comment matches behavior.
- proto: redact registry hostname example; move PoC-subset scope notes
  from the file-level comment (which protoc-gen-go places in front of
  the DO NOT EDIT marker) to the Oras service comment; regen pb.go.
- Tests: switch hard-coded /tmp/oras-test-<pid>.bin paths to a per-test
  MkdirTemp helper under /tmp; rename proto_clone -> protoClone; add
  TestIsCrossDeviceError; rework TestMapRegistryError to use typed
  errors.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 12 changed files in this pull request and generated no new comments.

Files not reviewed (1)
  • proto/gnoi/oras/oras.pb.go: Language not supported

http_proxy on the wire mixes ops policy into the RPC contract. Go's
http.DefaultTransport already honors the standard HTTP_PROXY / HTTPS_PROXY
/ NO_PROXY env vars via http.ProxyFromEnvironment, and lab testbeds can
inject those on the gnmi process (e.g. via /usr/bin/gnmi-native.sh).
Production switches with a default route to the registry need no
configuration.

Also trims the design doc's PullRequest example: removes media_type_filter,
source_address, source_vrf, skip_if_exists, expected_manifest_digest — all
speculative v2+ knobs that don't belong in the first cut. Lists them in a
'deliberately deferred' section so the rationale is preserved.

Regenerates oras.pb.go to drop field sonic-net#7.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

New /tmp/, /var/tmp/, /host/ allowlist + /mnt/host bind-mount translation
lives in pkg/hostfs. pkg/gnoi/oras switches over so its writes land on the
host filesystem (sonic-installer reads from the host /tmp, not the
container's tmpfs).

internal/diskspace and pkg/gnoi/file still have their own private copies
of this logic; migrating them is a follow-up to keep this change focused.

Also picks up an existing go.mod entry (opencontainers/go-digest is used
directly by the oras tests; promote it from indirect).

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Documents how HandlePull drives oras-go through Resolve → Fetch →
file.Store, where progress comes from (countingReader + 1s ticker), and
why the staging dir is created next to the destination (same-fs rename).
Captures hostfs.Translate as the container→host path seam.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@hdwhdw hdwhdw force-pushed the daweihuang/oras-pull-design branch from 114a9e9 to 6153efd Compare June 1, 2026 20:56
@mssonicbld
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants