Concise walkthroughs for each deployment configuration. Every example follows the same pattern: generate, build, deploy, test. For a full end-to-end tutorial, see Getting Started.
Deploy a scikit-learn model with Flask serving on a CPU instance.
yo @aws/ml-container-creator sklearn-flask-demo \
--deployment-config=http-flask \
--engine=sklearn \
--model-format=pkl \
--include-sample-model \
--deployment-target=managed-inference \
--instance-type=ml.m5.large \
--build-target=codebuild \
--region=us-east-1 \
--skip-promptsCopy your own model into the project (or use the generated sample):
cp /path/to/model.pkl code/model.pklBuild, push, deploy:
./do/build
./do/push
./do/deploy
./do/testDeploy an XGBoost model with FastAPI serving.
yo @aws/ml-container-creator xgboost-fastapi-demo \
--deployment-config=http-fastapi \
--engine=xgboost \
--model-format=json \
--include-sample-model \
--deployment-target=managed-inference \
--instance-type=ml.m5.large \
--build-target=codebuild \
--region=us-east-1 \
--skip-prompts./do/build && ./do/push && ./do/deploy && ./do/testDeploy a TensorFlow SavedModel with Flask serving.
yo @aws/ml-container-creator tf-flask-demo \
--deployment-config=http-flask \
--engine=tensorflow \
--model-format=SavedModel \
--include-sample-model \
--deployment-target=managed-inference \
--instance-type=ml.m5.large \
--build-target=codebuild \
--region=us-east-1 \
--skip-prompts./do/build && ./do/push && ./do/deploy && ./do/testDeploy an LLM with vLLM. GPU instance required.
yo @aws/ml-container-creator vllm-demo \
--deployment-config=transformers-vllm \
--model-name=openai/gpt-oss-20b \
--deployment-target=managed-inference \
--instance-type=ml.g6.12xlarge \
--build-target=codebuild \
--region=us-east-1 \
--skip-promptsLLM containers are large. Use CodeBuild for the image build:
./do/submit # Build and push via CodeBuild
./do/deploy
./do/testFor gated models (e.g., Llama), add --hf-token='$HF_TOKEN' and export the token in your environment. See HuggingFace Authentication.
Deploy an LLM with SGLang. Same workflow as vLLM with a different deployment config.
yo @aws/ml-container-creator sglang-demo \
--deployment-config=transformers-sglang \
--model-name=openai/gpt-oss-20b \
--deployment-target=managed-inference \
--instance-type=ml.g6.12xlarge \
--build-target=codebuild \
--region=us-east-1 \
--skip-prompts./do/submit && ./do/deploy && ./do/testDeploy an LLM with NVIDIA TensorRT-LLM. Requires NGC authentication for the base image and A10G or newer GPUs (ml.g5 instances, not ml.g6).
yo @aws/ml-container-creator trtllm-demo \
--deployment-config=transformers-tensorrt-llm \
--model-name=meta-llama/Llama-3.2-3B-Instruct \
--deployment-target=managed-inference \
--instance-type=ml.g5.12xlarge \
--build-target=codebuild \
--region=us-east-1 \
--skip-promptsSet your NGC API key before building:
export NGC_API_KEY='your-ngc-api-key'
./do/submit && ./do/deploy && ./do/testThe generated container runs TensorRT-LLM on port 8081 behind an Nginx reverse proxy on port 8080 for SageMaker compatibility. Key environment variables for tuning:
| Variable | Default | Description |
|---|---|---|
TRTLLM_TP_SIZE |
1 |
Tensor parallelism (set to GPU count) |
TRTLLM_MAX_BATCH_SIZE |
256 |
Maximum batch size |
TRTLLM_MAX_INPUT_LEN |
2048 |
Maximum input token length |
TRTLLM_MAX_OUTPUT_LEN |
512 |
Maximum output token length |
Deploy an LLM with AWS Large Model Inference (DJL-based). Uses serving.properties for configuration instead of environment variables.
yo @aws/ml-container-creator lmi-demo \
--deployment-config=transformers-lmi \
--model-name=openai/gpt-oss-20b \
--deployment-target=managed-inference \
--instance-type=ml.g5.12xlarge \
--build-target=codebuild \
--region=us-east-1 \
--skip-prompts./do/submit && ./do/deploy && ./do/testDeploy an LLM with Deep Java Library serving.
yo @aws/ml-container-creator djl-demo \
--deployment-config=transformers-djl \
--model-name=openai/gpt-oss-20b \
--deployment-target=managed-inference \
--instance-type=ml.g5.12xlarge \
--build-target=codebuild \
--region=us-east-1 \
--skip-prompts./do/submit && ./do/deploy && ./do/testDeploy XGBoost or LightGBM models on NVIDIA Triton Inference Server using the Forest Inference Library backend.
yo @aws/ml-container-creator triton-fil-demo \
--deployment-config=triton-fil \
--model-format=json \
--deployment-target=managed-inference \
--instance-type=ml.g5.xlarge \
--build-target=codebuild \
--region=us-east-1 \
--skip-promptsThe generator creates a Triton model repository layout with config.pbtxt. Place your model file in the generated model_repository/ directory before building.
./do/build && ./do/push && ./do/deploy && ./do/testDeploy ONNX models on Triton.
yo @aws/ml-container-creator triton-onnx-demo \
--deployment-config=triton-onnxruntime \
--deployment-target=managed-inference \
--instance-type=ml.g5.xlarge \
--build-target=codebuild \
--region=us-east-1 \
--skip-prompts./do/build && ./do/push && ./do/deploy && ./do/testDeploy custom Python models on Triton. The Python backend gives full control over preprocessing, inference, and postprocessing logic.
yo @aws/ml-container-creator triton-python-demo \
--deployment-config=triton-python \
--deployment-target=managed-inference \
--instance-type=ml.g5.xlarge \
--build-target=codebuild \
--region=us-east-1 \
--skip-promptsEdit the generated model.py in the model repository to implement your inference logic, then build and deploy.
Any of the examples above can use CodeBuild for the image build instead of building locally. Set --build-target=codebuild during generation, then use ./do/submit instead of ./do/build + ./do/push:
./do/submit # Creates CodeBuild project, uploads source, builds, pushes to ECR
./do/deploy # Deploy the CodeBuild-built image
./do/test # Validate the endpoint./do/submit automatically creates the CodeBuild project, IAM service role, and S3 source bucket on first run. All projects share a single ECR repository (ml-container-creator) with project-specific image tags.
Any of the examples above can target HyperPod EKS instead of managed inference. Set --deployment-target=hyperpod-eks and provide your cluster details:
yo @aws/ml-container-creator hyperpod-demo \
--deployment-config=transformers-vllm \
--model-name=openai/gpt-oss-20b \
--deployment-target=hyperpod-eks \
--hyperpod-cluster=my-cluster \
--hyperpod-namespace=ml-serving \
--instance-type=ml.g5.12xlarge \
--build-target=codebuild \
--region=us-east-1 \
--skip-prompts./do/submit && ./do/deploy && ./do/test hyperpodTear down resources when done to stop incurring charges:
./do/clean endpoint # Delete SageMaker endpoint, config, and inference component
./do/clean ecr # Delete ECR images
./do/clean codebuild # Delete CodeBuild project and IAM role
./do/clean all # All of the above