A minimal, runnable example of launching a SageMaker training job for a PyTorch classifier on FashionMNIST. Two launch paths are covered:
- Inside AWS — a SageMaker Studio / notebook instance (role auto-resolved).
- Outside AWS — a laptop with the AWS CLI configured.
| File | Purpose |
|---|---|
training.py |
Training entry-point that runs inside the training container. |
inference.py |
Loads the produced model.tar.gz and predicts on held-out samples. |
launch_training.py |
Submits the job using the SageMaker SDK v3 ModelTrainer API. |
prepare_data.py |
Downloads FashionMNIST locally and uploads it to S3 once. |
requirements.txt |
Extra pip packages installed in the container before training runs. |
sagemaker_training_quotas.csv |
Snapshot of account-level SageMaker training-related Service Quotas. |
sagemaker_training_quotas.csv is a quick
reference of available SageMaker training quotas for choosing
--instance-type in launch_training.py.
- Columns:
QuotaName,Value,Unit - Includes quota families for:
- Training job usage (on-demand)
- Spot training job usage
- Training warm pool usage
- Reserved-capacity training plan quotas (
Number of <instance> instances in reserved capacity)
- Covers
ml.c*,ml.m*,ml.r*,ml.g*,ml.p*, andml.trn1*instance families
This file is a point-in-time snapshot (not live). Refresh with:
aws service-quotas list-service-quotas --service-code sagemakerEverything lives under
s3://dsw-melax-dev-s3/data/hxjeong/sagemaker-tutorial/:
.../sagemaker-tutorial/
├── input/ <-- dataset (you upload this once)
│ └── FashionMNIST/raw/*.gz
└── output/ <-- SageMaker writes here
└── fashion-mnist-<ts>/
├── output/model.tar.gz <-- contents of SM_MODEL_DIR, tarred
└── output/output.tar.gz <-- contents of SM_OUTPUT_DATA_DIR
Why these paths:
input/— matches thetraininginput channel declared inlaunch_training.py. SageMaker mounts this S3 prefix at/opt/ml/input/data/traininginside the container, which is whattorchvision.datasets.FashionMNISTpoints at (viaSM_CHANNEL_TRAINING).output/— passed asoutput_path=to the estimator. SageMaker tars/opt/ml/modelintomodel.tar.gzand/opt/ml/output/dataintooutput.tar.gzunder this prefix when the job finishes. The training code never uploads to S3 directly — it just writes to those local paths.
If the job uses checkpoints, add a third prefix (e.g.
.../sagemaker-tutorial/checkpoints/) and wire it via checkpoint_s3_uri=
so SageMaker keeps local /opt/ml/checkpoints in sync with S3.
pip install -r requirements.txt
aws configure # only needed outside AWSThe launcher uses the SageMaker SDK v3
ModelTrainerAPI (sagemaker.train.model_trainer). If you're stuck on v2, use the legacysagemaker.pytorch.PyTorchestimator instead — the inputs (source_dir,entry_point,hyperparameters,fit({"training": ...})) are roughly equivalent.
You also need an IAM role that SageMaker can assume, with
AmazonSageMakerFullAccess and read/write on the S3 bucket above.
python prepare_data.py \
--s3-uri s3://dsw-melax-dev-s3/data/hxjeong/sagemaker-tutorial/inputThe SDK picks up the notebook's execution role automatically.
python launch_training.py --epochs 5 --instance-type ml.g4dn.xlargePass the role ARN explicitly (or export SM_TRAINING_ROLE_ARN):
python launch_training.py \
--role-arn arn:aws:iam::<account-id>:role/<SageMakerExecutionRole> \
--epochs 5 --instance-type ml.g4dn.xlargeEither way, the SDK:
- Tars the current directory (
training.py+requirements.txt) and uploads it to the default SageMaker bucket. - Starts an
ml.g4dn.xlargecontainer with the PyTorch 2.3 DLC image (resolved viasagemaker.core.image_uris.retrieve). - Mounts
s3://.../input/at/opt/ml/input/data/training. - Runs
pip install -r requirements.txt && python training.py …inside the container. - Streams stdout to CloudWatch and, on success, uploads
model.tar.gz+output.tar.gztooutput_path.
If you'd rather skip the Python SDK entirely, you can call
aws sagemaker create-training-job directly. You'll need to (a) build/push
your own image or reference an AWS-provided DLC URI for your region, and (b)
upload your code tarball yourself. A minimal call looks like:
aws sagemaker create-training-job \
--training-job-name fashion-mnist-$(date +%s) \
--role-arn arn:aws:iam::<account-id>:role/<SageMakerExecutionRole> \
--algorithm-specification '{
"TrainingImage": "763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker",
"TrainingInputMode": "File"
}' \
--input-data-config '[{
"ChannelName": "training",
"DataSource": {"S3DataSource": {
"S3DataType": "S3Prefix",
"S3Uri": "s3://dsw-melax-dev-s3/data/hxjeong/sagemaker-tutorial/input",
"S3DataDistributionType": "FullyReplicated"
}}
}]' \
--output-data-config '{"S3OutputPath": "s3://dsw-melax-dev-s3/data/hxjeong/sagemaker-tutorial/output"}' \
--resource-config '{"InstanceType": "ml.g4dn.xlarge", "InstanceCount": 1, "VolumeSizeInGB": 30}' \
--stopping-condition '{"MaxRuntimeInSeconds": 3600}' \
--hyper-parameters '{"epochs": "5", "batch-size": "128",
"sagemaker_program": "training.py",
"sagemaker_submit_directory": "s3://<your-bucket>/code/sourcedir.tar.gz"}'The Python SDK is the preferred path — it handles the code tarball, picks the right DLC image for the region, and wires up metric definitions for you.
After the job finishes, launch_training.py prints the model artifact URI.
Fetch and run inference locally:
aws s3 cp <model_data_uri> model.tar.gz
mkdir -p model && tar -xzf model.tar.gz -C model
python inference.py --model-dir model --n 16SageMaker communicates with the entry-point purely through paths and environment variables — no SageMaker SDK call inside the container:
| In container | What it is | In this example |
|---|---|---|
SM_CHANNEL_TRAINING |
Mount point of the training input channel |
Loaded by datasets.FashionMNIST |
SM_MODEL_DIR |
Files written here get packaged into model.tar.gz |
model.pt, classes.json |
SM_OUTPUT_DATA_DIR |
Files written here get packaged into output.tar.gz |
metrics.json |
| stdout | Lines matching metric_definitions regex become CloudWatch metrics |
test_acc=0.91; |
Because of the local fallbacks at the top of main(), the same training.py
also runs on a laptop with python training.py --data-dir ./data, which makes
iterating much cheaper than doing every change through a SageMaker job.
| Check | Result |
|---|---|
py_compile on all 4 files |
PASS |
| Import each module | PASS |
--help for every CLI |
PASS |
training.py 1 epoch on CPU |
PASS — test_acc=0.844 |
inference.py on the saved artifact |
PASS — 13/16 correct |
ModelTrainer dry-construct + DLC image resolve |
PASS |
scripts/smoke_test.sh runs everything locally in ~1 minute and prints a
PASS/FAIL summary. It does not submit a SageMaker job — it only exercises
the code paths you can verify without AWS compute, and it doesn't require
AWS credentials (the DLC image URI is a public ECR path):
pip install -r requirements.txt # host-side deps, run once
bash scripts/smoke_test.shWhat it checks: a preflight that the host-side deps from requirements.txt
are installed (fails fast with an install hint if not), py_compile on
every .py file, importing each module, --help on every CLI, a 1-epoch
CPU training run against a freshly downloaded FashionMNIST, that model.pt
classes.jsonwere written, aninference.pyround-trip against the produced artifact, and a dry-construct of the SageMakerModelTrainer(imports + DLC image resolution, no API call). Non-zero exit on any failure.
The script defaults AWS_DEFAULT_REGION to us-east-1 when nothing is
configured, so the image-URI lookup works on a laptop with no aws configure. Export a different region to override.
- Run
training.pylocally first — theSM_*env-var fallbacks let you iterate in ~30 seconds instead of waiting 5 minutes for a container to spin up. - Upload the dataset once with
prepare_data.py. Every training job just mounts that S3 prefix. - Pick the right instance.
ml.g4dn.xlarge(~$0.75/hr) is enough for FashionMNIST. Avoidml.p3.*/ml.p4d.*for toy examples. - Use
--no-waitonce it's working. The launcher returns immediately and you watch progress in the SageMaker console.