Paper(korean)(coming soon) | Slides(en) | Presentation Video(en)
Official repository of 'Implementation and Performance Evaluation of Vision Transformer Model Based on MCU-NPU' (KICS winter conference 2026)
keyword: NPU, STM32N6, MCU, Transformer, Vision Transformer
├─firmware
├─app_config.h
├─main.c
├─make_model
├─transformer_npu.py
├─transformer_npu_v2.py
└─prebuilt_binary
├─v1
├─custom_vit_ln_im160_attdim144_depth6_head4_ff576.tflite
├─network_data.hex
├─STM32N6570-DK_GettingStarted_ImageClassification_sign.bin
└─v2
├─v2_custom_vit_ln_im160_attdim144_depth6_head4_ff576.tflite
├─network_data.hex
├─STM32N6570-DK_GettingStarted_ImageClassification_sign.bin
- run STM32CubeProgrammer
- Connect STM32N6-DK
Port: SWD, Frequency: 8000, Mode: Hot Plug, Access Port:1, Reset mode: Hardware reset
-
flash the ai_fsbl.hex (from https://github.com/STMicroelectronics/STM32N6-GettingStarted-ImageClassification/releases/tag/v2.1.1)
-
flash the network_data.hex
-
flash the
STM32N6570-DK_GettingStarted_ImageClassification_sign.binat0x70100000
- kagglehub
- pillow
- tensorflow==2.7.0
- STEdgeAI v2.2.0
- STM32CubeProgrammer v2.19.0
- STM32CubeIDE v1.19.0
generate vision transformer with input image size 160, patch size 16, number of classes 5, hidden size D 144, number of layers 6, number of heads 4, MLP size 576, dropout 0.1, using layernorm, and train model 5 epochs
python transformer_npu.py --img_size 160 --patch_size 16 --num_classes 5 --d_model 144 --num_blocks 6 --num_heads 4 --d_ff 576 --dropout 0.1 --use_layernorm --epochs 5
python transformer_npu_v2.py --img_size 160 --patch_size 16 --num_classes 5 --d_model 144 --num_blocks 6 --num_heads 4 --d_ff 576 --dropout 0.1 --use_layernorm --epochs 5
-
Install STEdgeAI, STM32CubeIDE, and STM32CubeProgrammer
-
Add
/c/\<InstallFolder>/Utilities/windows/in your path to have stedgeai known by your bash(In my case, C:\ST\STEdgeAI\2.2\Utilities\windows). -
Add
/c/\<InstallFolder>/STM32CubeIDE_<X.X.X>/STM32CubeIDE/plugins/com.st.stm32cube.ide.mcu.externaltools.gnu-tools-for-stm32.<X.X.X>/bin/in your path to have arm-none-eabi-objcopy known by your bash(In my case, C:\ST\STM32CubeIDE_1.19.0\STM32CubeIDE\plugins\com.st.stm32cube.ide.mcu.externaltools.gnu-tools-for-stm32.13.3.rel1.win32_1.0.0.202411081344). -
download ImageClassification Sourcecode at https://github.com/STMicroelectronics/STM32N6-GettingStarted-ImageClassification/releases/tag/v2.1.1.zip and unzip.
-
Put you tflite model in the
Modelfolder. -
Generate C file
cd Model
stedgeai generate --model Model_File.tflite --target stm32n6 --st-neural-art default@user_neuralart_STM32N6570-DK.json
cp st_ai_output/network.c STM32N6570-DK/
cp st_ai_output/network_ecblobs.h STM32N6570-DK/
cp st_ai_output/network_atonbuf.xSPI2.raw STM32N6570-DK/network_data.xSPI2.bin
arm-none-eabi-objcopy -I binary STM32N6570-DK/network_data.xSPI2.bin --change-addresses 0x70380000 -O ihex STM32N6570-DK/network_data.hexthen network.c, network_data.hex, network_data.xSPI2.bin, network_ecblobs.h will be generated in ./Model/STM32N6570-DK
- run STM32CubeIDE
- Click File - Open Projects from File System
- Click Directory and add directory
STM32N6_GettingStarted_ImageClassification-v2.1.1/Application/STM32N6570-DK - Click finish
replace the app_config.h and main.c to those in firmware folder of this repository
then modify NN_WIDTH, NN_HEIGHT value to image size, change PATCH_SIZE, NB_CLASSES, classes_table, and welcome message according to your model file. The code below is compatible with the model that has input image size 160, patch size 16, and 5 classes
#define NN_WIDTH (160) //you can modify
#define NN_HEIGHT (160)//you can modify
#define NN_BPP 3
#define PATCH_SIZE 16 //you can modify
#define COLOR_BGR (0)
#define COLOR_RGB (1)
#define COLOR_MODE COLOR_RGB
/* Classes */
#define NB_CLASSES (5) //you can modify
#define CLASSES_TABLE const char* classes_table[NB_CLASSES] = {\
"daisy","dandelion","rose","sunflower","tulip"} //you can modify
/* Display */
#define WELCOME_MSG_1 "Vision Transformer.tflite" //you can modify
#define WELCOME_MSG_2 "Model running on NPU" //you can modify
#endif- Click Project - Build All
then bin file has been generated at STM32N6_GettingStarted_ImageClassification-v2.1.1/Application/STM32N6570-DK/STM32CubeIDE/Debug
- run STM32CubeProgrammer
- Connect STM32N6-DK
Port: SWD, Frequency: 8000, Mode: Hot Plug, Access Port:1, Reset mode: Hardware reset
-
Flash ai_fsbl.hex at
STM32N6_GettingStarted_ImageClassification-v2.1.1/Binary -
Flash network_data.hex
-
Sign the bin file
STM32_SigningTool_CLI -bin STM32N6570-DK_GettingStarted_ImageClassification.bin -nk -t ssbl -hv 2.3 -o STM32N6570-DK_GettingStarted_ImageClassification_sign.bin- Flash STM32N6570-DK_GettingStarted_ImageClassification_sign.bin at address
0x70100000
Notes: Arguement of transformer_npu.py and transformer_npu_v2.py are the same
python transformer_npu.py [-h] [--img_size IMG_SIZE] [--patch_size PATCH_SIZE] [--num_classes NUM_CLASSES] [--d_model D_MODEL] [--num_blocks NUM_BLOCKS] [--num_heads NUM_HEADS] [--d_ff D_FF][--dropout DROPOUT] [--use_layernorm] [--epochs EPOCHS] [--input_type INPUT_TYPE][--output_type OUTPUT_TYPE]
- --img_size : Input image size (default: 224)
- --patch_size : Patch size(default:16)
- --num_classes : Number of classes (default: 5)
- --d_model : Hidden size D (default: 128)
- --num_blocks : Number of Transformer layers (default: 4)
- --num_heads : Number of attention heads (default: 4)
- --d_ff : MLP size(feed-forward MLP network dimension) (default: 256)
- --dropout : dropout rate(default: 0.1)
- --use_layernorm : use layernorm in transformer block
- --epochs : Number of training epochs (default: 5)
- --input_type : TFLite input type(default: uint8) : float32, int8, uint8
- --output_type : TFLite output type(default: float32) : float32, int8, uint8
- image size should be divisible by patch size
- d_model(hidden size D) should be divisible by num_heads(Number of attention heads)
- we recommend --use_layernorm argument
- we recommend setting --input_type to uint8
- Patch Embedding (PE) is configured to be performed in the pre-processing process before model input
- Because if the model has PE and self-attention, it is not possible to interpret the model structure using STEdgeAI
- Remove bias parameter from Fully Connected layer
- This is because, at the time of compilation, STEdgeAI removes the existing batch dimension from the three-dimensional input, and the patch part is set as the batch dimension. however, broadcasting bias as the batch dimension in the NPU is not supported
- Use ReLU activation function for Multi-Layer Perceptron used for feed-forward
- because TFLite GELU is not supported
- Patch Embedding (PE) is configured to be performed in the pre-processing process before model input
- Because if the model has PE and self-attention, it is not possible to interpret the model structure using STEdgeAI
- Change Fully Connected layer to 1x1 Conv2D
- Therefore model file from v2 has bias parameter in QKV projection, feed-forward etc.
- Use ReLU activation function for Multi-Layer Perceptron used for feed-forward
- because TFLite GELU is not supported
(table 1. Architecture of Vision Transformer for stm32n6 NPU)
| ViT-T | ViT-S | ViT-B | ViT-L | |
|---|---|---|---|---|
| Image | 96 | 160 | 160 | 224 |
| Patch | 16 | 16 | 16 | 16 |
| Hidden Size D | 128 | 128 | 144 | 176 |
| Layers | 4 | 6 | 6 | 6 |
| Heads | 4 | 4 | 4 | 8 |
| MLP size | 128 | 256 | 576 | 704 |
| Params(w/o bias) (M) | 0.494 | 0.889 | 1.608 | 2.371 |
| MACs(w/o bias) (M) | 20.3 | 113.4 | 188.0 | 609.5 |
| Params(w/ bias) (M) | 0.498 | 0.894 | 1.616 | 2.380 |
| MACs(w/ bias) (M) | 20.4 | 113.6 | 188.4 | 610.5 |
(table 2. Comparison of model inference time bnetween STM32N6(MCU+NPU) and STM32H753ZI(MCU))
| board | ViT-T | ViT-S | ViT-B | ViT-L | |
|---|---|---|---|---|---|
| Inference time(ms) (w/o bias) | STM32N6 | 12 | 72 | 82 | 417 |
| Inference time(ms) (w/o bias) | STM32H7 | 142 | 748 | 968 | - |
| Inference time(ms) (w/ bias) | STM32N6 | 9 | 61 | 71 | 373 |
| Inference time(ms) (w/ bias) | STM32H7 | 108 | 632 | 826 | - |
CPU and NPU allocation details of the operation used in the ViT model. The PE part is a preprocessing process. The operator marked in orange is executed in the NPU, and the operator marked in light green is executed in the CPU. In a model including bias, the FC layer excluding the FC used for classification is changed to 1x1Conv2D with the bias parameter
If you find stm32n6-transformer useful in your research and wish to cite it, please use the following BibTex entry:
- for code:
@software{kim2026stm32n6transformer,
author = {Kim, Tae Hun},
title = {stm32n6-transformer: Apply Vision Transformer on STM32N6 NPU},
url = {https://github.com/minchoCoin/stm32n6-transformer},
version = {0.0},
year = {2026},
note = {Accessed: 2026-03-16}
}
- for paper (NOT YET PUBLISHED!):
@inproceedings{kim2026npu,
title = {Implementation and Performance Evaluation of Vision Transformer Model Based on MCU-NPU},
author = {Kim, Tae Hun and Kim, Tae Gu and Cho, Yong Hun and Shin, Ki Hun and Kwak, Do Gyun and Han, Koo Dong and Na, Sang Jin and Cho, Eun Young and Hwang, Woo Chan and Baek, Yun Ju},
booktitle = {Proceedings of the Symposium of the Korean Institute of Communications and Information Sciences},
pages = {348--349},
year = {2026},
month = {2},
publisher = {Korea Institute of Communications and Information Sciences},
volume = {89},
issn = {2383-8302}
}
이 논문은 정부(과학기술정보통신부)의 재원으로 정보통신기획평가원-대학ICT연구센터(ITRC)의 지원을 받아 수행된 연구임(IITP-2026-RS-2023-00260098)
본 연구는 과학기술정보통신부 및 정보통신기획평가원의 인공지능융합혁신인재양성사업 연구 결과로 수행되었음(IITP-2026-RS-2023-00254177)
