This guide is designed to assist users in migrating their machine learning applications from smaller-scale computing environments to the LUMI supercomputer. We will walk you through a detailed example of training an image classification model using PyTorch's Vision Transformer (VIT) on the ImageNet dataset.
All Python and bash scripts referenced in this guide are accessible in this GitHub repository. We start with a basic python script, visiontransformer.py, that could run on your local machine and modify it over the next chapters to run it efficiently on LUMI.
Even though this guide uses PyTorch, most of the covered topics are independent of the used machine learning framework. We therefore believe this guide is helpful for all new ML users on LUMI while also providing a concrete example that runs on LUMI.
Important
PyTorch containers on LUMI will in the future be provided by the LUMI AI Factory. This guide will soon be updated to utilize these new containers. The containers currently referenced in this guide remain available on LUMI but will no longer receive updates. However, all examples included in this guide will continue to work as they currently do. For more information about the new containers, refer to the LUMI AI Factory AI Software Environment documentation.
Before proceeding, please ensure you meet the following prerequisites:
- A basic understanding of machine learning concepts and Python programming. This guide will focus primarily on aspects specific to training models on LUMI.
- An active user account on LUMI and familiarity with its basic operations.
- If you wish to run the included examples, you need to be part of a project with GPU hours on LUMI.
The guide is structured into the following sections: