Skip to content

Commit 8426fac

Browse files
committed
test(e2e): address gpu workload review feedback
Signed-off-by: Evan Lezar <elezar@nvidia.com>
1 parent 422695b commit 8426fac

3 files changed

Lines changed: 63 additions & 74 deletions

File tree

e2e/gpu/README.md

Lines changed: 20 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,7 @@
33

44
# GPU workload images
55

6-
This directory defines workload test images currently used by the OpenShell GPU
7-
e2e suite.
6+
This directory defines workload test images for OpenShell GPU validation.
87

98
## Contract
109

@@ -23,10 +22,11 @@ Each workload image must:
2322
command explicitly.
2423

2524
OpenShell sandbox creation replaces the image entrypoint with the supervisor and
26-
does not run the OCI image `CMD`. E2e tests that use these images through
27-
OpenShell run the command from each manifest entry explicitly.
25+
does not run the OCI image `CMD`. When these images are used through OpenShell,
26+
the workload command from each manifest entry must be passed explicitly.
2827

29-
The test harness is manifest-driven. Each workload entry carries:
28+
The image build task writes a local workload manifest. Each workload entry
29+
carries:
3030

3131
- `name`
3232
- `image`
@@ -61,24 +61,27 @@ The build task uses `tasks/scripts/container-engine.sh`. Set
6161
`CONTAINER_ENGINE=docker` or `CONTAINER_ENGINE=podman` to choose an engine
6262
explicitly. When unset, the helper uses its existing auto-detection behavior.
6363

64-
Local tags use the current commit short SHA. Dirty local trees append `-dirty`.
65-
Set `OPENSHELL_GPU_WORKLOAD_IMAGE_TAG=<tag>` to override the tag.
64+
Local tags use the current commit short SHA plus a short fingerprint of the
65+
external build inputs. Dirty local trees append `-dirty`. Set
66+
`OPENSHELL_GPU_WORKLOAD_IMAGE_TAG=<tag>` to override the tag.
6667

6768
The task writes the latest build refs to:
6869

6970
```text
7071
e2e/gpu/images/.build/latest.env
7172
```
7273

73-
The task also writes the local workload manifest used by the Rust e2e runner:
74+
The task also writes a local workload manifest for downstream tooling and
75+
future workload-runner integration:
7476

7577
```text
7678
e2e/gpu/images/.build/workloads.yaml
7779
```
7880

7981
That local manifest is created by `mise run e2e:workloads:build`. It contains
8082
the full image reference, command, expected outcome, and requirements for each
81-
selected workload.
83+
selected workload. It also records the external build inputs used to produce
84+
the workload images.
8285

8386
Use the env file in later commands:
8487

@@ -87,7 +90,8 @@ source e2e/gpu/images/.build/latest.env
8790
```
8891

8992
That env file exports `OPENSHELL_E2E_WORKLOAD_MANIFEST` pointing at the local
90-
manifest. The per-image refs remain available as a convenience for direct
93+
manifest. The current checked-in Rust GPU e2e target does not consume this
94+
manifest yet. The per-image refs remain available as a convenience for direct
9195
container-engine validation.
9296

9397
## Direct Validation
@@ -120,57 +124,14 @@ where Podman CDI is configured.
120124
Direct container-engine validation catches image, CDI, CUDA, and host GPU setup
121125
issues before OpenShell sandbox behavior is involved.
122126

123-
## Manifest-Driven Validation
127+
## OpenShell GPU E2E
124128

125-
The Rust GPU validation target is:
129+
The current Rust GPU validation target is:
126130

127131
```shell
128-
cargo test --manifest-path e2e/rust/Cargo.toml --features e2e-docker-gpu --test gpu -- --nocapture
132+
mise run e2e:gpu
129133
```
130134

131-
The workload validation path reads:
132-
133-
```text
134-
OPENSHELL_E2E_WORKLOAD_MANIFEST
135-
```
136-
137-
When that variable is unset, the runner uses the default local manifest path:
138-
139-
```text
140-
e2e/gpu/images/.build/workloads.yaml
141-
```
142-
143-
If neither path exists, the workload validation test prints a clear skip
144-
message telling you to run:
145-
146-
```shell
147-
mise run e2e:workloads:build
148-
```
149-
150-
or to set `OPENSHELL_E2E_WORKLOAD_MANIFEST` to an external manifest.
151-
152-
Each manifest entry supplies the sandbox image and command. OpenShell runs that
153-
command through `openshell sandbox create --gpu --from <image> -- <command>`.
154-
The test runner iterates all GPU-tagged workload entries and enforces each
155-
entry's declared expectation:
156-
157-
- `expect: pass` requires `OPENSHELL_GPU_WORKLOAD_SUCCESS`
158-
- `expect: fail` requires `OPENSHELL_GPU_WORKLOAD_FAILURE`
159-
160-
The current local manifest includes three workloads:
161-
162-
- `smoke-pass` expected to pass
163-
- `smoke-fail` expected to fail
164-
- `cuda-basic` expected to pass
165-
166-
## External Manifests
167-
168-
External workload catalogs can use the same schema. Point the runner at one
169-
with:
170-
171-
```shell
172-
export OPENSHELL_E2E_WORKLOAD_MANIFEST=/abs/path/to/workloads.yaml
173-
```
174-
175-
That lets alternate workload manifests use the same test runner without
176-
introducing per-workload env vars.
135+
That target runs `gpu_device_selection`. It validates GPU request and device
136+
selection behavior against a Docker-backed gateway. It does not run the
137+
workload manifest generated by `mise run e2e:workloads:build`.

tasks/scripts/e2e-gpu-build-images.sh

Lines changed: 42 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ BASE_IMAGE="${OPENSHELL_SANDBOX_BASE_IMAGE:-ghcr.io/nvidia/openshell-community/s
1616
CUDA_BUILD_IMAGE="${CUDA_BUILD_IMAGE:-nvcr.io/nvidia/cuda:12.8.1-base-ubuntu22.04}"
1717
CUDA_SAMPLES_REPO="${CUDA_SAMPLES_REPO:-https://github.com/NVIDIA/cuda-samples}"
1818
CUDA_SAMPLES_REF="${CUDA_SAMPLES_REF:-v12.8}"
19+
SUPPORTED_IMAGES=(smoke-pass smoke-fail cuda-basic)
1920

2021
shell_quote() {
2122
local value=$1
@@ -39,22 +40,13 @@ yaml_quote() {
3940
}
4041

4142
available_image_dirs() {
42-
local dockerfile
4343
local preferred
44-
local seen=" "
4544

46-
for preferred in smoke-pass smoke-fail cuda-basic; do
45+
for preferred in "${SUPPORTED_IMAGES[@]}"; do
4746
if [[ -f "${IMAGES_ROOT}/${preferred}/Dockerfile" ]]; then
4847
echo "${preferred}"
49-
seen+="${preferred} "
5048
fi
5149
done
52-
53-
find "${IMAGES_ROOT}" -mindepth 2 -maxdepth 2 -name Dockerfile -type f | sort | while IFS= read -r dockerfile; do
54-
name="$(basename "$(dirname "${dockerfile}")")"
55-
[[ "${seen}" == *" ${name} "* ]] && continue
56-
echo "${name}"
57-
done
5850
}
5951

6052
contains_image() {
@@ -90,6 +82,19 @@ image_expectation() {
9082
esac
9183
}
9284

85+
workload_input_fingerprint() {
86+
local -a names=("$@")
87+
88+
{
89+
printf 'OPENSHELL_SANDBOX_BASE_IMAGE=%s\n' "${BASE_IMAGE}"
90+
if contains_image cuda-basic "${names[@]}"; then
91+
printf 'CUDA_BUILD_IMAGE=%s\n' "${CUDA_BUILD_IMAGE}"
92+
printf 'CUDA_SAMPLES_REPO=%s\n' "${CUDA_SAMPLES_REPO}"
93+
printf 'CUDA_SAMPLES_REF=%s\n' "${CUDA_SAMPLES_REF}"
94+
fi
95+
} | git -C "${ROOT}" hash-object --stdin | cut -c1-8
96+
}
97+
9398
mapfile -t available < <(available_image_dirs)
9499
if [[ ${#available[@]} -eq 0 ]]; then
95100
echo "No GPU workload image Dockerfiles found under ${IMAGES_ROOT}" >&2
@@ -128,11 +133,13 @@ fi
128133
if [[ -n "${OPENSHELL_GPU_WORKLOAD_IMAGE_TAG:-}" ]]; then
129134
image_tag="${OPENSHELL_GPU_WORKLOAD_IMAGE_TAG}"
130135
else
131-
image_tag="${source_short_sha}"
136+
input_fingerprint="$(workload_input_fingerprint "${selected[@]}")"
137+
image_tag="${source_short_sha}-${input_fingerprint}"
132138
if [[ "${source_dirty}" == "true" ]]; then
133139
image_tag="${image_tag}-dirty"
134140
fi
135141
fi
142+
input_fingerprint="$(workload_input_fingerprint "${selected[@]}")"
136143

137144
declare -A image_refs=()
138145

@@ -148,12 +155,23 @@ for name in "${selected[@]}"; do
148155
build_args=(
149156
--build-arg "OPENSHELL_SANDBOX_BASE_IMAGE=${BASE_IMAGE}"
150157
)
158+
build_labels=(
159+
--label "com.nvidia.openshell.gpu-workload.source=${name}"
160+
--label "com.nvidia.openshell.gpu-workload.base-image=${BASE_IMAGE}"
161+
--label "com.nvidia.openshell.gpu-workload.input-fingerprint=${input_fingerprint}"
162+
--label "org.opencontainers.image.revision=${source_sha}"
163+
)
151164
if [[ "${name}" == "cuda-basic" ]]; then
152165
build_args+=(
153166
--build-arg "CUDA_BUILD_IMAGE=${CUDA_BUILD_IMAGE}"
154167
--build-arg "CUDA_SAMPLES_REPO=${CUDA_SAMPLES_REPO}"
155168
--build-arg "CUDA_SAMPLES_REF=${CUDA_SAMPLES_REF}"
156169
)
170+
build_labels+=(
171+
--label "com.nvidia.openshell.gpu-workload.cuda-build-image=${CUDA_BUILD_IMAGE}"
172+
--label "com.nvidia.openshell.gpu-workload.cuda-samples-repo=${CUDA_SAMPLES_REPO}"
173+
--label "com.nvidia.openshell.gpu-workload.cuda-samples-ref=${CUDA_SAMPLES_REF}"
174+
)
157175
fi
158176

159177
echo
@@ -162,8 +180,7 @@ for name in "${selected[@]}"; do
162180
--load \
163181
--provenance=false \
164182
-t "${image_ref}" \
165-
--label "com.nvidia.openshell.gpu-workload.source=${name}" \
166-
--label "org.opencontainers.image.revision=${source_sha}" \
183+
"${build_labels[@]}" \
167184
"${build_args[@]}" \
168185
"${context}"
169186

@@ -180,6 +197,11 @@ manifest_path="${BUILD_DIR}/workloads.yaml"
180197
write_env_var OPENSHELL_GPU_WORKLOAD_IMAGE_SOURCE_PATH "${IMAGES_ROOT}"
181198
write_env_var OPENSHELL_GPU_WORKLOAD_IMAGE_SOURCE_SHA "${source_sha}"
182199
write_env_var OPENSHELL_GPU_WORKLOAD_IMAGE_SOURCE_DIRTY "${source_dirty}"
200+
write_env_var OPENSHELL_GPU_WORKLOAD_IMAGE_INPUT_FINGERPRINT "${input_fingerprint}"
201+
write_env_var OPENSHELL_SANDBOX_BASE_IMAGE "${BASE_IMAGE}"
202+
write_env_var CUDA_BUILD_IMAGE "${CUDA_BUILD_IMAGE}"
203+
write_env_var CUDA_SAMPLES_REPO "${CUDA_SAMPLES_REPO}"
204+
write_env_var CUDA_SAMPLES_REF "${CUDA_SAMPLES_REF}"
183205
write_env_var OPENSHELL_GPU_WORKLOAD_CONTAINER_ENGINE "${CONTAINER_ENGINE}"
184206
write_env_var OPENSHELL_E2E_WORKLOAD_MANIFEST "${manifest_path}"
185207
for name in "${selected[@]}"; do
@@ -194,11 +216,17 @@ manifest_path="${BUILD_DIR}/workloads.yaml"
194216
echo " path: $(yaml_quote "${IMAGES_ROOT}")"
195217
echo " revision: $(yaml_quote "${source_sha}")"
196218
echo " dirty: ${source_dirty}"
219+
echo " input_fingerprint: $(yaml_quote "${input_fingerprint}")"
197220
echo " container_engine: $(yaml_quote "${CONTAINER_ENGINE}")"
221+
echo " inputs:"
222+
echo " openshell_sandbox_base_image: $(yaml_quote "${BASE_IMAGE}")"
223+
echo " cuda_build_image: $(yaml_quote "${CUDA_BUILD_IMAGE}")"
224+
echo " cuda_samples_repo: $(yaml_quote "${CUDA_SAMPLES_REPO}")"
225+
echo " cuda_samples_ref: $(yaml_quote "${CUDA_SAMPLES_REF}")"
198226
echo "workloads:"
199227
for name in "${selected[@]}"; do
200228
echo " - name: $(yaml_quote "${name}")"
201-
echo " image: $(yaml_quote "${image_refs[${name}]}" )"
229+
echo " image: $(yaml_quote "${image_refs[${name}]}")"
202230
echo " command:"
203231
echo " - $(yaml_quote "/usr/local/bin/openshell-gpu-workload")"
204232
echo " expect: $(yaml_quote "$(image_expectation "${name}")")"

tasks/test.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ description = "Run Docker GPU end-to-end tests"
2626
depends = ["e2e:docker:gpu"]
2727

2828
["e2e:workloads:build"]
29-
description = "Build local workload test images and manifest for e2e validation"
29+
description = "Build local GPU workload test images and manifest"
3030
run = "bash tasks/scripts/e2e-gpu-build-images.sh"
3131

3232
["e2e:k3s:gpu"]

0 commit comments

Comments
 (0)