AI Model Serving on MicroShift

AI Model Serving on MicroShift is a single-model serving platform for AI models. It includes limited subset of Red Hat OpenShift AI (RHOAI): KServe in Raw Deployment Mode and the ServingRuntimes object type that is namespace-specific. Now you can train your models in the cloud and serve them on the edge.

Definitions

ServingRuntime
- For more information about Serving Runtimes refer to upstream Kserve documentation
InferenceService

Supported model-serving runtimes

Currently AI Model Serving on MicroShift ships with following model-serving runtimes:

OpenVINO Model Server
vLLM ServingRuntime for KServe
Caikit Text Generation Inference Server (Caikit-TGIS) ServingRuntime for KServe
Caikit Standalone ServingRuntime for KServe
Text Generation Inference Server (TGIS) Standalone ServingRuntime for KServe
vLLM ServingRuntime with Gaudi accelerators support for KServe
vLLM ROCm ServingRuntime for KServe

Refer to RHOAI documentation. for details about the model-serving runtimes.

General usage overview

Develop, train, test, and prepare model for serving
Configure the OS and MicroShift for the hardware - driver & device plugin
Install microshift-ai-model-serving package (and restart MicroShift)
Package model into an OCI image (ModelCar)
Select suitable model-serving runtime (Model Server)
Copy ServingRuntime Custom Resource from redhat-ods-applications to your own namespace
Create InferenceService object
Create Route object
Make requests against the model server

Setting up hardware - drivers and device plugins

To enable GPU/hardware accelerators for MicroShift, follow the Partner's guidance on installing either an Operator or a driver for RHEL plus a device plugin for Kubernetes. Operators might be more convenient, but using only the driver and device plugin may be more resource efficient.

MicroShift cannot provide support for a Partner's procedure. For troubleshooting, consult the Partner's documentation or product support. The following links are examples and pointers only. These links might not include everything you need, but are a good place to start.

Step-by-step guide

Below is an example usage of AI Model Serving for MicroShift. It uses the OpenVino Model Server (OVMS) and ResNet-50 model. Note: OVMS can run on the CPU, so configuring an additional hardware accelerator is not included in this example.

Installing AI Model Serving for MicroShift

The microshift-ai-model-serving RPM contains manifests that deploy kserve with Raw Deployment mode enabled and ServingRuntimes objects in the redhat-ods-applications namespace.

To install AI Model Serving for MicroShift run following command:

sudo dnf install -y microshift-ai-model-serving

After installing the package and restarting MicroShift, there should be new Pod running in the redhat-ods-applications namespace:

$ oc get pods -n redhat-ods-applications
NAME                                        READY   STATUS    RESTARTS   AGE
kserve-controller-manager-7fc9fc688-kttmm   1/1     Running   0          1h

You can also install the release info package. It contains JSON file with image references useful for offline procedures or deploying copy of a ServingRuntime to your namespace during a bootc image build:

sudo dnf install -y microshift-ai-model-serving-release-info

Packaging the AI model into an OCI image (ModelCar)

You can package your model into an OCI image and make use of what is known as the ModelCar approach. This can help you set up offline environments because the model can be embedded just like any other container image.

The exact directory structure depends on the model server. Below is an example Containerfile with a ResNet-50 model compatible with OpenVino Model Server (OVMS) used in the OVMS' examples.

See "How to build a ModelCar container" section of Build and deploy a ModelCar container in OpenShift AI article for guidance on building OCI image with a model from Hugging Face suitable for an vLLM model server.

FROM registry.access.redhat.com/ubi9/ubi-minimal:latest
RUN microdnf install -y wget && microdnf clean all
RUN mkdir -p /models/1 && chmod -R 755 /models/1
RUN wget -q -P /models/1 \
  https://storage.openvinotoolkit.org/repositories/open_model_zoo/2022.1/models_bin/2/resnet50-binary-0001/FP32-INT1/resnet50-binary-0001.bin \
  https://storage.openvinotoolkit.org/repositories/open_model_zoo/2022.1/models_bin/2/resnet50-binary-0001/FP32-INT1/resnet50-binary-0001.xml

You can build it and push it to your registry:

podman build -t IMAGE_REF .
podman push IMAGE_REF

For this example, we'll build the image locally and use it right away without pushing it to registry first. sudo is required to make it part of the root's container storage and usable by MicroShift because CRI-O and Podman share the storage.

For offline use cases, be sure to include a tag other than latest. If the latest tag is used, the container that fetches and sets up the model will be configured with the imagePullPolicy: set to Always and the local image will be ignored. If you use any other tag, the imagePullPolicy: is set to IfNotPresent.

$ sudo podman build -t ovms-resnet50:test .
STEP 1/4: FROM registry.access.redhat.com/ubi9/ubi-minimal:latest
Trying to pull registry.access.redhat.com/ubi9/ubi-minimal:latest...
Getting image source signatures
Checking if image destination supports signatures
Copying blob 533b69cfd644 done   |
Copying blob 863e9a7e2102 done   |
Copying config 098048e6f9 done   |
Writing manifest to image destination
Storing signatures
STEP 2/4: RUN microdnf install -y wget && microdnf clean all
<< SNIP >>
--> 4c74352ad42e
STEP 3/4: RUN mkdir -p /models/1 && chmod -R 755 /models/1
--> bfd31acb1e81
STEP 4/4: RUN wget -q -P /models/1   https://storage.openvinotoolkit.org/repositories/open_model_zoo/2022.1/models_bin/2/resnet50-binary-0001/FP32-INT1/resnet50-binary-0001.bin   https://storage.openvinotoolkit.org/repositories/open_model_zoo/2022.1/models_bin/2/resnet50-binary-0001/FP32-INT1/resnet50-binary-0001.xml
COMMIT ovms-resnet50:test
--> 375b265c1c4b
Successfully tagged localhost/ovms-resnet50:test
375b265c1c4bc6f0a059c8739fb2b3a46e1b563728f6d9c51f26f29bb2c87c3e

Run the following command to make sure the image exists:

$ sudo podman images | grep ovms-resnet50
localhost/ovms-resnet50          test          375b265c1c4b  3 minutes ago  136 MB

Creating your namespace

Use the following command to create a new namespace that will be used throughout this guide:

oc create ns ai-demo

Deploying ServingRuntime to the workload's namespace

First, you must select the ServingRuntime that supports the format of your model. Then, you need to create the ServingRuntime in your workload's namespace.

If the cluster is already running, you can export the desired ServingRuntime to a file and tweak it. If they cluster is not running or you want to prepare a manifest, you can use original definition on the disk.

For more information about ServingRuntimes refer to the RHOAI documentation

Create ServingRuntime based on installed manifests and release info

This approach does not require a live cluster, so it can be part of CI/CD automation.

Overview of the procedure:

Install microshift-ai-model-serving-release-info RPM
Extract the image reference of a particular ServingRuntime from the release info file
Make a copy of the chosen ServingRuntime YAML file
Add the actual image reference to the image: parameter field value
Create the object using the file or make it part of a manifest (kustomization)

The following example shows the process of reusing microshift-ai-model-serving manifest's files to re-create OVMS ServingRuntime in the workload's namespace:

# Get image reference for the 'ovms-image'
OVMS_IMAGE="$(jq -r '.images | with_entries(select(.key == "ovms-image")) | .[]' /usr/share/microshift/release/release-ai-model-serving-"$(uname -i)".json)"

# Duplicate the original ServingRuntime yaml
cp /usr/lib/microshift/manifests.d/050-microshift-ai-model-serving-runtimes/ovms-kserve.yaml ./ovms-kserve.yaml

# Update the image reference
sed -i "s,image: ovms-image,image: ${OVMS_IMAGE}," ./ovms-kserve.yaml

Then you can re-create the ServingRuntime in a custom namespace:

oc create -n ai-demo -f ./ovms-kserve.yaml

Alternatively, if the preceding procedure is part of a bootc Containerfile and the ServingRuntime ends up as part of new manifest, the namespace can set in the kustomization.yaml:

namespace: ai-demo

Creating the InferenceService custom resource

Next, we need to create InferenceService custom resource (CR). InferenceService instructs kserve on how to create a Deployment for serving the model. Kserve uses the ServingRuntime based on modelFormat specified in InferenceService.

It's possible to add extra arguments that will be passed to the model server using .spec.predictor.model.args.

The following is an example of an InferenceService with a model in the openvino_ir format. It features an additional argument, --layout=NHWC:NCHW to make OVMS accept the request input data in a different layout than the model was originally exported with. Extra args are passed through to the OVMS container.

For more information about the InferenceService CR refer to RHOAI documentation.

Example InferenceService object with an openvino_ir model format

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: ovms-resnet50
spec:
  predictor:
    model:
      protocolVersion: v2
      modelFormat:
        name: openvino_ir
      storageUri: "oci://localhost/ovms-resnet50:test"
      args:
      - --layout=NHWC:NCHW

Save the the InferenceService example to a file, then create it on the cluster:

$ oc create -n ai-demo -f ./FILE.yaml
inferenceservice.serving.kserve.io/ovms-resnet50 created

Soon, a Deployment and a Pod should appear in that namespace. Depending on the size of ServingRuntime's image and the size of the ModelCar OCI image, it may take a while for the Pod to become ready.

$ oc get -n ai-demo deployment
NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
ovms-resnet50-predictor   1/1     1            1           72s

$ oc rollout status -n ai-demo deployment ovms-resnet50-predictor
deployment "ovms-resnet50-predictor" successfully rolled out

$ oc get -n ai-demo pod
NAME                                       READY   STATUS    RESTARTS      AGE
ovms-resnet50-predictor-6fdb566b7f-bc9k5   2/2     Running   1 (72s ago)   74s

Kserve will also create a Service:

$ oc get svc -n ai-demo
NAME                      TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
ovms-resnet50-predictor   ClusterIP   None         <none>        80/TCP    119s

Specifying hardware accelerators

InferenceService can also include plethora of different options. For example, the CR can contain a resources section that is passed to the Deployment and then to the Pod, so that the model server gets access to the hardware (thanks to the device plugin). In this example, an NVIDIA device:

spec:
  predictor:
    model:
      resources:
        limits:
          nvidia.com/gpu: 1
        requests:
          nvidia.com/gpu: 1

For complete InferenceService specification refer to kserve API Reference.

Creating a Route

Note: You don't need to wait for the model server Pod's readiness before creating Route.

Create an OpenShift Route CR to expose the service. You can either use the oc expose svc command or create definition in a YAML file and apply it.

$ oc expose svc -n ai-demo ovms-resnet50-predictor
route.route.openshift.io/ovms-resnet50-predictor exposed

$ oc get route -n ai-demo
NAME                      HOST                                               ADMITTED   SERVICE                   TLS
ovms-resnet50-predictor   ovms-resnet50-predictor-ai-demo.apps.example.com   True       ovms-resnet50-predictor

Querying the model server

You are ready to check if your model is ready for inference. We'll reuse the OVMS examples to test the inference.

Get the IP of the MicroShift cluster and assign it to the IP variable. Use the HOST value of the Route and assign it to the DOMAIN variable. Next, run the following curl command. Alternatively, instead of using the --connect-to "${DOMAIN}::${IP}:" flag, you can use real DNS, or add the IP and the domain to the /etc/hosts file.

DOMAIN=ovms-resnet50-predictor-ai-demo.apps.example.com
IP=192.168.0.10
curl -i "${DOMAIN}/v2/models/ovms-resnet50/ready" \
    --connect-to "${DOMAIN}::${IP}:"

Response code 200 is expected. Example output:

HTTP/1.1 200 OK
content-type: application/json
date: Wed, 12 Mar 2025 16:01:32 GMT
content-length: 0
set-cookie: 56bb4b6df4f80f0b59f56aa0a5a91c1a=4af1408b4a1c40925456f73033d4a7d1; path=/; HttpOnly

We can also query the model's metadata:

curl "${DOMAIN}/v2/models/ovms-resnet50" \
    --connect-to "${DOMAIN}::${IP}:"

Example output:

{"name":"ovms-resnet50","versions":["1"],"platform":"OpenVINO","inputs":[{"name":"0","datatype":"FP32","shape":[1,224,224,3]}],"outputs":[{"name":"1463","datatype":"FP32","shape":[1,1000]}]

Let's try querying the actual model - the following example verifies whether the inference is in accordance with the training data.

First, download an image of a bee from the OpenVino examples:

curl -O https://raw.githubusercontent.com/openvinotoolkit/model_server/main/demos/common/static/images/bee.jpeg

Next, create the request data:

Start with an inference header in JSON format.
Get the size of the header. It needs to be passed to the OVMS later in form of an HTTP header.
Append the size of the image to the request file. OVMS expects 4 bytes (little endian). The following command uses the xxd utility which is part of the vim-common package.
Append the image to the request file:

IMAGE=./bee.jpeg
REQ=./request.json

# Add an inference header
echo -n '{"inputs" : [{"name": "0", "shape": [1], "datatype": "BYTES"}]}' > "${REQ}"

# Get the size of the inference header
HEADER_LEN="$(stat -c %s "${REQ}")"

# Add size of the data (image) in binary format (4 bytes, little endian)
printf "%08X" $(stat --format=%s "${IMAGE}") | sed 's/\(..\)/\1\n/g' | tac | tr -d '\n' | xxd -r -p >> "${REQ}"

# Add the data, i.e. the image
cat "${IMAGE}" >> "${REQ}"

Now we can make an inference request against the model server that is using the ovms-resnet50 model.

curl \
    --data-binary "@./request.json" \
    --header "Inference-Header-Content-Length: ${HEADER_LEN}" \
    "${DOMAIN}/v2/models/ovms-resnet50/infer" \
    --connect-to "${DOMAIN}::${IP}:" > response.json

A response saved to a response.json is a JSON object which has the following structure: The contents of .outputs[0].data were omitted from the example for brevity.

{
    "model_name": "ovms-resnet50",
    "model_version": "1",
    "outputs": [{
            "name": "1463",
            "shape": [1, 1000],
            "datatype": "FP32",
            "data": [ ....... ]
        }]
}

To verify the response, we'll use Python. We need to obtain the index of the highest element in the .outputs[0].data.

import json

with open('response.json') as f:
    response = json.load(f)

data = response["outputs"][0]["data"]
argmax = data.index(max(data))
print(argmax)

The output of the Python script we just ran should be 309. We can validate it against resnet's input data:

../../../../demos/common/static/images/bee.jpeg 309

You can try querying the model using other images mentioned in the resnet's input data.

Getting the model server's metrics

To obtain Prometheus metrics of the model server simply make a request on /metrics endpoint:

curl "${DOMAIN}/metrics" \
    --connect-to "${DOMAIN}::${IP}:"

Partial example output:

# HELP ovms_requests_success Number of successful requests to a model or a DAG.
# TYPE ovms_requests_success counter
ovms_requests_success{api="KServe",interface="REST",method="ModelReady",name="ovms-resnet50"} 4
ovms_requests_success{api="KServe",interface="REST",method="ModelMetadata",name="ovms-resnet50",version="1"} 1

Scraping model server's metrics with microshift-observability (OTEL)

In order to scrape model server's metrics with microshift-observability:

Your OTEL configuration needs to include prometheus receiver (see opentelemetry-collector-large.yaml preset of example)

Your InferenceService CR needs to include following annotation which will be passed-through to the Pod:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    prometheus.io/scrape: "true"

Other Inference Protocol endpoints

To learn more about kserve endpoints see upstream documentation:

Appendix

Overriding kserve configuration

If you wish to override kserve settings, you need to make a copy of existing ConfigMap, tweak the desired settings, and overwrite the existing ConfigMap.

Settings are stored in a ConfigMap named inferenceservice-config in the redhat-ods-applications namespace. Alternatively, you can copy the ConfigMap from /usr/lib/microshift/manifests.d/010-microshift-ai-model-serving-kserve/inferenceservice-config-microshift-patch.yaml.

After tweaking it, you must apply the ConfigMap and restart kserve (e.g. by deleting Pod or scaling the Deployment down to 0 and back to 1). For RHEL For Edge and RHEL Image Mode systems, create a new manifest making sure it's applied after /usr/lib/microshift/manifests.d/010-microshift-ai-model-serving-kserve.

Limitations

AI Model Serving on MicroShift is only available on the x86_64 platform.
AI Model Serving on MicroShift supports a very specific subset of the RHOAI Operator components.
You must secure the exposed model server's endpoint (e.g. OAUTH2).
Not all model servers support IPv6.
Only OCI (ModelCar) model delivery system is tested and supported

Known issues

Because of a bug in kserve (to be ported to RHOAI), rebooting a MicroShift host can result in the model server failing if it was using ModelCar (a model in an OCI image).
Because of MicroShift's architecture, installing the microshift-ai-model-serving RPM before running systemctl start microshift for the first time, can cause MicroShift to failure to start. However, MicroShift will automatically restart and then start successfully. See OCPBUGS-51365.
Currently, ClusterServingRuntimes are not supported by RHOAI, which means that you will need to copy the ServingRuntime shipped within the package to your workload's namespace.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI Model Serving on MicroShift

Definitions

Supported model-serving runtimes

General usage overview

Setting up hardware - drivers and device plugins

Step-by-step guide

Installing AI Model Serving for MicroShift

Packaging the AI model into an OCI image (ModelCar)

Creating your namespace

Deploying ServingRuntime to the workload's namespace

Create ServingRuntime based on installed manifests and release info

Creating the InferenceService custom resource

Specifying hardware accelerators

Creating a Route

Querying the model server

Getting the model server's metrics

Scraping model server's metrics with microshift-observability (OTEL)

Other Inference Protocol endpoints

Appendix

Overriding kserve configuration

Limitations

Known issues

FilesExpand file tree

ai_model_serving.md

Latest commit

History

ai_model_serving.md

File metadata and controls

AI Model Serving on MicroShift

Definitions

Supported model-serving runtimes

General usage overview

Setting up hardware - drivers and device plugins

Step-by-step guide

Installing AI Model Serving for MicroShift

Packaging the AI model into an OCI image (ModelCar)

Creating your namespace

Deploying ServingRuntime to the workload's namespace

Create ServingRuntime based on installed manifests and release info

Creating the InferenceService custom resource

Specifying hardware accelerators

Creating a Route

Querying the model server

Getting the model server's metrics

Scraping model server's metrics with microshift-observability (OTEL)

Other Inference Protocol endpoints

Appendix

Overriding kserve configuration

Limitations

Known issues