Skip to content

Latest commit

 

History

History
289 lines (224 loc) · 7.73 KB

File metadata and controls

289 lines (224 loc) · 7.73 KB

Examples

Concise walkthroughs for each deployment configuration. Every example follows the same pattern: generate, build, deploy, test. For a full end-to-end tutorial, see Getting Started.

HTTP: sklearn + Flask

Deploy a scikit-learn model with Flask serving on a CPU instance.

yo @aws/ml-container-creator sklearn-flask-demo \
  --deployment-config=http-flask \
  --engine=sklearn \
  --model-format=pkl \
  --include-sample-model \
  --deployment-target=managed-inference \
  --instance-type=ml.m5.large \
  --build-target=codebuild \
  --region=us-east-1 \
  --skip-prompts

Copy your own model into the project (or use the generated sample):

cp /path/to/model.pkl code/model.pkl

Build, push, deploy:

./do/build
./do/push
./do/deploy
./do/test

HTTP: XGBoost + FastAPI

Deploy an XGBoost model with FastAPI serving.

yo @aws/ml-container-creator xgboost-fastapi-demo \
  --deployment-config=http-fastapi \
  --engine=xgboost \
  --model-format=json \
  --include-sample-model \
  --deployment-target=managed-inference \
  --instance-type=ml.m5.large \
  --build-target=codebuild \
  --region=us-east-1 \
  --skip-prompts
./do/build && ./do/push && ./do/deploy && ./do/test

HTTP: TensorFlow + Flask

Deploy a TensorFlow SavedModel with Flask serving.

yo @aws/ml-container-creator tf-flask-demo \
  --deployment-config=http-flask \
  --engine=tensorflow \
  --model-format=SavedModel \
  --include-sample-model \
  --deployment-target=managed-inference \
  --instance-type=ml.m5.large \
  --build-target=codebuild \
  --region=us-east-1 \
  --skip-prompts
./do/build && ./do/push && ./do/deploy && ./do/test

Transformers: vLLM

Deploy an LLM with vLLM. GPU instance required.

yo @aws/ml-container-creator vllm-demo \
  --deployment-config=transformers-vllm \
  --model-name=openai/gpt-oss-20b \
  --deployment-target=managed-inference \
  --instance-type=ml.g6.12xlarge \
  --build-target=codebuild \
  --region=us-east-1 \
  --skip-prompts

LLM containers are large. Use CodeBuild for the image build:

./do/submit    # Build and push via CodeBuild
./do/deploy
./do/test

For gated models (e.g., Llama), add --hf-token='$HF_TOKEN' and export the token in your environment. See HuggingFace Authentication.

Transformers: SGLang

Deploy an LLM with SGLang. Same workflow as vLLM with a different deployment config.

yo @aws/ml-container-creator sglang-demo \
  --deployment-config=transformers-sglang \
  --model-name=openai/gpt-oss-20b \
  --deployment-target=managed-inference \
  --instance-type=ml.g6.12xlarge \
  --build-target=codebuild \
  --region=us-east-1 \
  --skip-prompts
./do/submit && ./do/deploy && ./do/test

Transformers: TensorRT-LLM

Deploy an LLM with NVIDIA TensorRT-LLM. Requires NGC authentication for the base image and A10G or newer GPUs (ml.g5 instances, not ml.g6).

yo @aws/ml-container-creator trtllm-demo \
  --deployment-config=transformers-tensorrt-llm \
  --model-name=meta-llama/Llama-3.2-3B-Instruct \
  --deployment-target=managed-inference \
  --instance-type=ml.g5.12xlarge \
  --build-target=codebuild \
  --region=us-east-1 \
  --skip-prompts

Set your NGC API key before building:

export NGC_API_KEY='your-ngc-api-key'
./do/submit && ./do/deploy && ./do/test

The generated container runs TensorRT-LLM on port 8081 behind an Nginx reverse proxy on port 8080 for SageMaker compatibility. Key environment variables for tuning:

Variable Default Description
TRTLLM_TP_SIZE 1 Tensor parallelism (set to GPU count)
TRTLLM_MAX_BATCH_SIZE 256 Maximum batch size
TRTLLM_MAX_INPUT_LEN 2048 Maximum input token length
TRTLLM_MAX_OUTPUT_LEN 512 Maximum output token length

Transformers: LMI (Large Model Inference)

Deploy an LLM with AWS Large Model Inference (DJL-based). Uses serving.properties for configuration instead of environment variables.

yo @aws/ml-container-creator lmi-demo \
  --deployment-config=transformers-lmi \
  --model-name=openai/gpt-oss-20b \
  --deployment-target=managed-inference \
  --instance-type=ml.g5.12xlarge \
  --build-target=codebuild \
  --region=us-east-1 \
  --skip-prompts
./do/submit && ./do/deploy && ./do/test

Transformers: DJL

Deploy an LLM with Deep Java Library serving.

yo @aws/ml-container-creator djl-demo \
  --deployment-config=transformers-djl \
  --model-name=openai/gpt-oss-20b \
  --deployment-target=managed-inference \
  --instance-type=ml.g5.12xlarge \
  --build-target=codebuild \
  --region=us-east-1 \
  --skip-prompts
./do/submit && ./do/deploy && ./do/test

Triton: FIL (Tree Models)

Deploy XGBoost or LightGBM models on NVIDIA Triton Inference Server using the Forest Inference Library backend.

yo @aws/ml-container-creator triton-fil-demo \
  --deployment-config=triton-fil \
  --model-format=json \
  --deployment-target=managed-inference \
  --instance-type=ml.g5.xlarge \
  --build-target=codebuild \
  --region=us-east-1 \
  --skip-prompts

The generator creates a Triton model repository layout with config.pbtxt. Place your model file in the generated model_repository/ directory before building.

./do/build && ./do/push && ./do/deploy && ./do/test

Triton: ONNX Runtime

Deploy ONNX models on Triton.

yo @aws/ml-container-creator triton-onnx-demo \
  --deployment-config=triton-onnxruntime \
  --deployment-target=managed-inference \
  --instance-type=ml.g5.xlarge \
  --build-target=codebuild \
  --region=us-east-1 \
  --skip-prompts
./do/build && ./do/push && ./do/deploy && ./do/test

Triton: Python Backend

Deploy custom Python models on Triton. The Python backend gives full control over preprocessing, inference, and postprocessing logic.

yo @aws/ml-container-creator triton-python-demo \
  --deployment-config=triton-python \
  --deployment-target=managed-inference \
  --instance-type=ml.g5.xlarge \
  --build-target=codebuild \
  --region=us-east-1 \
  --skip-prompts

Edit the generated model.py in the model repository to implement your inference logic, then build and deploy.

CodeBuild CI/CD

Any of the examples above can use CodeBuild for the image build instead of building locally. Set --build-target=codebuild during generation, then use ./do/submit instead of ./do/build + ./do/push:

./do/submit    # Creates CodeBuild project, uploads source, builds, pushes to ECR
./do/deploy    # Deploy the CodeBuild-built image
./do/test      # Validate the endpoint

./do/submit automatically creates the CodeBuild project, IAM service role, and S3 source bucket on first run. All projects share a single ECR repository (ml-container-creator) with project-specific image tags.

HyperPod EKS Deployment

Any of the examples above can target HyperPod EKS instead of managed inference. Set --deployment-target=hyperpod-eks and provide your cluster details:

yo @aws/ml-container-creator hyperpod-demo \
  --deployment-config=transformers-vllm \
  --model-name=openai/gpt-oss-20b \
  --deployment-target=hyperpod-eks \
  --hyperpod-cluster=my-cluster \
  --hyperpod-namespace=ml-serving \
  --instance-type=ml.g5.12xlarge \
  --build-target=codebuild \
  --region=us-east-1 \
  --skip-prompts
./do/submit && ./do/deploy && ./do/test hyperpod

Cleanup

Tear down resources when done to stop incurring charges:

./do/clean endpoint   # Delete SageMaker endpoint, config, and inference component
./do/clean ecr        # Delete ECR images
./do/clean codebuild  # Delete CodeBuild project and IAM role
./do/clean all        # All of the above