AI Model Serving on MicroShift is a single-model serving platform for AI models.
It includes limited subset of Red Hat OpenShift AI (RHOAI):
KServe in Raw Deployment Mode
and the ServingRuntimes object type that is namespace-specific.
Now you can train your models in the cloud and serve them on the edge.
- ServingRuntime
- For more information about Serving Runtimes refer to upstream Kserve documentation
- InferenceService
Currently AI Model Serving on MicroShift ships with following model-serving runtimes:
- OpenVINO Model Server
- vLLM ServingRuntime for KServe
- Caikit Text Generation Inference Server (Caikit-TGIS) ServingRuntime for KServe
- Caikit Standalone ServingRuntime for KServe
- Text Generation Inference Server (TGIS) Standalone ServingRuntime for KServe
- vLLM ServingRuntime with Gaudi accelerators support for KServe
- vLLM ROCm ServingRuntime for KServe
Refer to RHOAI documentation. for details about the model-serving runtimes.
- Develop, train, test, and prepare model for serving
- Configure the OS and MicroShift for the hardware - driver & device plugin
- Install
microshift-ai-model-servingpackage (and restart MicroShift) - Package model into an OCI image (ModelCar)
- Select suitable model-serving runtime (Model Server)
- Copy ServingRuntime Custom Resource from
redhat-ods-applicationsto your own namespace - Create
InferenceServiceobject - Create
Routeobject - Make requests against the model server
To enable GPU/hardware accelerators for MicroShift, follow the Partner's guidance on installing either an Operator or a driver for RHEL plus a device plugin for Kubernetes. Operators might be more convenient, but using only the driver and device plugin may be more resource efficient.
MicroShift cannot provide support for a Partner's procedure. For troubleshooting, consult the Partner's documentation or product support. The following links are examples and pointers only. These links might not include everything you need, but are a good place to start.
- NVIDIA:
- Intel Guadi
- AMD
Below is an example usage of AI Model Serving for MicroShift. It uses the OpenVino Model Server (OVMS) and ResNet-50 model. Note: OVMS can run on the CPU, so configuring an additional hardware accelerator is not included in this example.
The microshift-ai-model-serving RPM contains manifests that deploy kserve
with Raw Deployment mode enabled and ServingRuntimes objects in the redhat-ods-applications namespace.
To install AI Model Serving for MicroShift run following command:
sudo dnf install -y microshift-ai-model-servingAfter installing the package and restarting MicroShift,
there should be new Pod running in the redhat-ods-applications namespace:
$ oc get pods -n redhat-ods-applications
NAME READY STATUS RESTARTS AGE
kserve-controller-manager-7fc9fc688-kttmm 1/1 Running 0 1hYou can also install the release info package. It contains JSON file with image references useful for offline procedures or deploying copy of a ServingRuntime to your namespace during a bootc image build:
sudo dnf install -y microshift-ai-model-serving-release-infoYou can package your model into an OCI image and make use of what is known as the ModelCar approach. This can help you set up offline environments because the model can be embedded just like any other container image.
The exact directory structure depends on the model server. Below is an example Containerfile with a ResNet-50 model compatible with OpenVino Model Server (OVMS) used in the OVMS' examples.
See "How to build a ModelCar container" section of Build and deploy a ModelCar container in OpenShift AI article for guidance on building OCI image with a model from Hugging Face suitable for an vLLM model server.
FROM registry.access.redhat.com/ubi9/ubi-minimal:latest
RUN microdnf install -y wget && microdnf clean all
RUN mkdir -p /models/1 && chmod -R 755 /models/1
RUN wget -q -P /models/1 \
https://storage.openvinotoolkit.org/repositories/open_model_zoo/2022.1/models_bin/2/resnet50-binary-0001/FP32-INT1/resnet50-binary-0001.bin \
https://storage.openvinotoolkit.org/repositories/open_model_zoo/2022.1/models_bin/2/resnet50-binary-0001/FP32-INT1/resnet50-binary-0001.xmlYou can build it and push it to your registry:
podman build -t IMAGE_REF .
podman push IMAGE_REFFor this example, we'll build the image locally and use it right away without pushing it to registry first.
sudo is required to make it part of the root's container storage and usable by MicroShift
because CRI-O and Podman share the storage.
For offline use cases, be sure to include a tag other than latest.
If the latest tag is used, the container that fetches and sets up the model
will be configured with the imagePullPolicy: set to Always and the local image
will be ignored. If you use any other tag, the imagePullPolicy: is set to IfNotPresent.
$ sudo podman build -t ovms-resnet50:test .
STEP 1/4: FROM registry.access.redhat.com/ubi9/ubi-minimal:latest
Trying to pull registry.access.redhat.com/ubi9/ubi-minimal:latest...
Getting image source signatures
Checking if image destination supports signatures
Copying blob 533b69cfd644 done |
Copying blob 863e9a7e2102 done |
Copying config 098048e6f9 done |
Writing manifest to image destination
Storing signatures
STEP 2/4: RUN microdnf install -y wget && microdnf clean all
<< SNIP >>
--> 4c74352ad42e
STEP 3/4: RUN mkdir -p /models/1 && chmod -R 755 /models/1
--> bfd31acb1e81
STEP 4/4: RUN wget -q -P /models/1 https://storage.openvinotoolkit.org/repositories/open_model_zoo/2022.1/models_bin/2/resnet50-binary-0001/FP32-INT1/resnet50-binary-0001.bin https://storage.openvinotoolkit.org/repositories/open_model_zoo/2022.1/models_bin/2/resnet50-binary-0001/FP32-INT1/resnet50-binary-0001.xml
COMMIT ovms-resnet50:test
--> 375b265c1c4b
Successfully tagged localhost/ovms-resnet50:test
375b265c1c4bc6f0a059c8739fb2b3a46e1b563728f6d9c51f26f29bb2c87c3eRun the following command to make sure the image exists:
$ sudo podman images | grep ovms-resnet50
localhost/ovms-resnet50 test 375b265c1c4b 3 minutes ago 136 MBUse the following command to create a new namespace that will be used throughout this guide:
oc create ns ai-demoFirst, you must select the ServingRuntime that supports the format of your model. Then, you need to create the ServingRuntime in your workload's namespace.
If the cluster is already running, you can export the desired ServingRuntime to a file and tweak it.
If they cluster is not running or you want to prepare a manifest, you can use
original definition on the disk.
For more information about ServingRuntimes refer to the RHOAI documentation
This approach does not require a live cluster, so it can be part of CI/CD automation.
Overview of the procedure:
- Install
microshift-ai-model-serving-release-infoRPM - Extract the image reference of a particular
ServingRuntimefrom the release info file - Make a copy of the chosen ServingRuntime YAML file
- Add the actual image reference to the
image:parameter field value - Create the object using the file or make it part of a manifest (kustomization)
The following example shows the process of reusing microshift-ai-model-serving manifest's files
to re-create OVMS ServingRuntime in the workload's namespace:
# Get image reference for the 'ovms-image'
OVMS_IMAGE="$(jq -r '.images | with_entries(select(.key == "ovms-image")) | .[]' /usr/share/microshift/release/release-ai-model-serving-"$(uname -i)".json)"
# Duplicate the original ServingRuntime yaml
cp /usr/lib/microshift/manifests.d/050-microshift-ai-model-serving-runtimes/ovms-kserve.yaml ./ovms-kserve.yaml
# Update the image reference
sed -i "s,image: ovms-image,image: ${OVMS_IMAGE}," ./ovms-kserve.yamlThen you can re-create the ServingRuntime in a custom namespace:
oc create -n ai-demo -f ./ovms-kserve.yamlAlternatively, if the preceding procedure is part of a bootc Containerfile and
the ServingRuntime ends up as part of new manifest, the namespace can set in the kustomization.yaml:
namespace: ai-demoNext, we need to create InferenceService custom resource (CR).
InferenceService instructs kserve on how to create a Deployment for serving the model.
Kserve uses the ServingRuntime based on modelFormat specified in InferenceService.
It's possible to add extra arguments that will be passed to the model server using .spec.predictor.model.args.
The following is an example of an InferenceService with a model in the openvino_ir format.
It features an additional argument, --layout=NHWC:NCHW to make OVMS accept the
request input data in a different layout than the model was originally exported with.
Extra args are passed through to the OVMS container.
For more information about the InferenceService CR refer to RHOAI documentation.
Example InferenceService object with an openvino_ir model format
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: ovms-resnet50
spec:
predictor:
model:
protocolVersion: v2
modelFormat:
name: openvino_ir
storageUri: "oci://localhost/ovms-resnet50:test"
args:
- --layout=NHWC:NCHWSave the the InferenceService example to a file, then create it on the cluster:
$ oc create -n ai-demo -f ./FILE.yaml
inferenceservice.serving.kserve.io/ovms-resnet50 createdSoon, a Deployment and a Pod should appear in that namespace. Depending on the size of ServingRuntime's image and the size of the ModelCar OCI image, it may take a while for the Pod to become ready.
$ oc get -n ai-demo deployment
NAME READY UP-TO-DATE AVAILABLE AGE
ovms-resnet50-predictor 1/1 1 1 72s
$ oc rollout status -n ai-demo deployment ovms-resnet50-predictor
deployment "ovms-resnet50-predictor" successfully rolled out
$ oc get -n ai-demo pod
NAME READY STATUS RESTARTS AGE
ovms-resnet50-predictor-6fdb566b7f-bc9k5 2/2 Running 1 (72s ago) 74sKserve will also create a Service:
$ oc get svc -n ai-demo
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ovms-resnet50-predictor ClusterIP None <none> 80/TCP 119sInferenceService can also include plethora of different options.
For example, the CR can contain a resources section that is passed to the Deployment and then to the Pod,
so that the model server gets access to the hardware (thanks to the device plugin).
In this example, an NVIDIA device:
spec:
predictor:
model:
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1For complete InferenceService specification refer to kserve API Reference.
Note: You don't need to wait for the model server Pod's readiness before creating Route.
Create an OpenShift Route CR to expose the service.
You can either use the oc expose svc command or create definition in a YAML file and apply it.
$ oc expose svc -n ai-demo ovms-resnet50-predictor
route.route.openshift.io/ovms-resnet50-predictor exposed
$ oc get route -n ai-demo
NAME HOST ADMITTED SERVICE TLS
ovms-resnet50-predictor ovms-resnet50-predictor-ai-demo.apps.example.com True ovms-resnet50-predictorYou are ready to check if your model is ready for inference. We'll reuse the OVMS examples to test the inference.
Get the IP of the MicroShift cluster and assign it to the IP variable.
Use the HOST value of the Route and assign it to the DOMAIN variable.
Next, run the following curl command.
Alternatively, instead of using the --connect-to "${DOMAIN}::${IP}:" flag,
you can use real DNS, or add the IP and the domain to the /etc/hosts file.
DOMAIN=ovms-resnet50-predictor-ai-demo.apps.example.com
IP=192.168.0.10
curl -i "${DOMAIN}/v2/models/ovms-resnet50/ready" \
--connect-to "${DOMAIN}::${IP}:"Response code 200 is expected. Example output:
HTTP/1.1 200 OK
content-type: application/json
date: Wed, 12 Mar 2025 16:01:32 GMT
content-length: 0
set-cookie: 56bb4b6df4f80f0b59f56aa0a5a91c1a=4af1408b4a1c40925456f73033d4a7d1; path=/; HttpOnly
We can also query the model's metadata:
curl "${DOMAIN}/v2/models/ovms-resnet50" \
--connect-to "${DOMAIN}::${IP}:"Example output:
{"name":"ovms-resnet50","versions":["1"],"platform":"OpenVINO","inputs":[{"name":"0","datatype":"FP32","shape":[1,224,224,3]}],"outputs":[{"name":"1463","datatype":"FP32","shape":[1,1000]}]Let's try querying the actual model - the following example verifies whether the inference is in accordance with the training data.
First, download an image of a bee from the OpenVino examples:
curl -O https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/common/static/images/bee.jpegNext, create the request data:
- Start with an inference header in JSON format.
- Get the size of the header. It needs to be passed to the OVMS later in form of an HTTP header.
- Append the size of the image to the request file. OVMS expects 4 bytes (little endian).
The following command uses the
xxdutility which is part of thevim-commonpackage. - Append the image to the request file:
IMAGE=./bee.jpeg
REQ=./request.json
# Add an inference header
echo -n '{"inputs" : [{"name": "0", "shape": [1], "datatype": "BYTES"}]}' > "${REQ}"
# Get the size of the inference header
HEADER_LEN="$(stat -c %s "${REQ}")"
# Add size of the data (image) in binary format (4 bytes, little endian)
printf "%08X" $(stat --format=%s "${IMAGE}") | sed 's/\(..\)/\1\n/g' | tac | tr -d '\n' | xxd -r -p >> "${REQ}"
# Add the data, i.e. the image
cat "${IMAGE}" >> "${REQ}"Now we can make an inference request against the model server that is using the ovms-resnet50 model.
curl \
--data-binary "@./request.json" \
--header "Inference-Header-Content-Length: ${HEADER_LEN}" \
"${DOMAIN}/v2/models/ovms-resnet50/infer" \
--connect-to "${DOMAIN}::${IP}:" > response.jsonA response saved to a response.json is a JSON object which has the following structure:
The contents of .outputs[0].data were omitted from the example for brevity.
{
"model_name": "ovms-resnet50",
"model_version": "1",
"outputs": [{
"name": "1463",
"shape": [1, 1000],
"datatype": "FP32",
"data": [ ....... ]
}]
}To verify the response, we'll use Python.
We need to obtain the index of the highest element in the .outputs[0].data.
import json
with open('response.json') as f:
response = json.load(f)
data = response["outputs"][0]["data"]
argmax = data.index(max(data))
print(argmax)The output of the Python script we just ran should be 309.
We can validate it against resnet's input data:
../../../../demos/common/static/images/bee.jpeg 309
You can try querying the model using other images mentioned in the resnet's input data.
To obtain Prometheus metrics of the model server simply make a request on /metrics endpoint:
curl "${DOMAIN}/metrics" \
--connect-to "${DOMAIN}::${IP}:"Partial example output:
# HELP ovms_requests_success Number of successful requests to a model or a DAG.
# TYPE ovms_requests_success counter
ovms_requests_success{api="KServe",interface="REST",method="ModelReady",name="ovms-resnet50"} 4
ovms_requests_success{api="KServe",interface="REST",method="ModelMetadata",name="ovms-resnet50",version="1"} 1
In order to scrape model server's metrics with microshift-observability:
- Your OTEL configuration needs to include prometheus receiver (see opentelemetry-collector-large.yaml preset of example)
- Your InferenceService CR needs to include following annotation which will be passed-through to the Pod:
apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: annotations: prometheus.io/scrape: "true"
To learn more about kserve endpoints see upstream documentation:
If you wish to override kserve settings, you need to make a copy of existing ConfigMap, tweak the desired settings, and overwrite the existing ConfigMap.
Settings are stored in a ConfigMap named inferenceservice-config in the redhat-ods-applications namespace.
Alternatively, you can copy the ConfigMap from /usr/lib/microshift/manifests.d/010-microshift-ai-model-serving-kserve/inferenceservice-config-microshift-patch.yaml.
After tweaking it, you must apply the ConfigMap and restart kserve (e.g. by deleting Pod or scaling the Deployment down to 0 and back to 1).
For RHEL For Edge and RHEL Image Mode systems, create a new manifest making sure it's applied after /usr/lib/microshift/manifests.d/010-microshift-ai-model-serving-kserve.
- AI Model Serving on MicroShift is only available on the x86_64 platform.
- AI Model Serving on MicroShift supports a very specific subset of the RHOAI Operator components.
- You must secure the exposed model server's endpoint (e.g. OAUTH2).
- Not all model servers support IPv6.
- Only OCI (ModelCar) model delivery system is tested and supported
- Because of a bug in kserve (to be ported to RHOAI), rebooting a MicroShift host can result in the model server failing if it was using ModelCar (a model in an OCI image).
- Because of MicroShift's architecture, installing the
microshift-ai-model-servingRPM before runningsystemctl start microshiftfor the first time, can cause MicroShift to failure to start. However, MicroShift will automatically restart and then start successfully. See OCPBUGS-51365. - Currently,
ClusterServingRuntimesare not supported by RHOAI, which means that you will need to copy theServingRuntimeshipped within the package to your workload's namespace.