Skip to content
This repository was archived by the owner on Sep 19, 2022. It is now read-only.
This repository was archived by the owner on Sep 19, 2022. It is now read-only.

Pytorch version may have an effect on the training reproduction #355

@Shuai-Xie

Description

@Shuai-Xie

I try to figure out why Bare Metal (BM) and PytorchJob (PJ) have different training results in #354 (comment).

And now I find that PytorchJon v1.8.0 and 1.9.0 have different training results both on BM and PJ.

Experiment settings

  • Two V100 GPU machines 48/49. Each has 4 cards. We have 8 GPUs in total.
  • DDP training resnet18 on mnist dataset with batchsize=256 and epochs=1
  • set random seed=1

BM

# torch             1.8.0+cu111
# torchvision       0.9.0+cu111
Train Epoch: 0 [0/30]   loss=2.5691
Train Epoch: 0 [10/30]  loss=2.2320
Train Epoch: 0 [20/30]  loss=0.8108
Test Epoch: 0 [0/40]    acc=33.5938
Test Epoch: 0 [10/40]   acc=35.5469
Test Epoch: 0 [20/40]   acc=34.7098
Test Epoch: 0 [30/40]   acc=35.0302
Test Epoch: 0, acc=35.7200
test acc: 35.72, best acc: 35.72
training seconds: 19.506625175476074
best_acc: 35.72

# torch             1.9.0+cu111
# torchvision       0.10.0+cu111
Train Epoch: 0 [0/30]   loss=2.5137
Train Epoch: 0 [10/30]  loss=2.4295
Train Epoch: 0 [20/30]  loss=0.9048
Test Epoch: 0 [0/40]    acc=63.2812
Test Epoch: 0 [10/40]   acc=64.9858
Test Epoch: 0 [20/40]   acc=63.8021
Test Epoch: 0 [30/40]   acc=63.9365
Test Epoch: 0, acc=64.1200
test acc: 64.12, best acc: 64.12
training seconds: 18.64181399345398
best_acc: 64.12

PJ

I build docker images from different versions of the PyTorch base images.

# FROM pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime
Train Epoch: 0 [0/30]   loss=2.5691
Train Epoch: 0 [10/30]  loss=2.5132
Train Epoch: 0 [20/30]  loss=0.7198
Test Epoch: 0 [0/40]    acc=38.2812
Test Epoch: 0 [10/40]   acc=40.9091
Test Epoch: 0 [20/40]   acc=39.8996
Test Epoch: 0 [30/40]   acc=40.4738
Test Epoch: 0, acc=40.9600
test acc: 40.96, best acc: 40.96
training seconds: 20.630347967147827
best_acc: 40.96

# FROM pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime
Train Epoch: 0 [0/30]   loss=2.5137
Train Epoch: 0 [10/30]  loss=2.3939
Train Epoch: 0 [20/30]  loss=0.6989
Test Epoch: 0 [0/40]    acc=67.5781
Test Epoch: 0 [10/40]   acc=69.2827
Test Epoch: 0 [20/40]   acc=68.4152
Test Epoch: 0 [30/40]   acc=67.8805
Test Epoch: 0, acc=67.9700
test acc: 67.97, best acc: 67.97
training seconds: 26.458710193634033
best_acc: 67.97

Please let me know if I write the wrong code. I've posted my code here: https://github.com/Shuai-Xie/mnist-pytorchjob-example.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions