Skip to content

Comments

[Fix]fix issue #6#10

Open
HarmonJiang wants to merge 1 commit intoMIT-SPARK:masterfrom
HarmonJiang:dev
Open

[Fix]fix issue #6#10
HarmonJiang wants to merge 1 commit intoMIT-SPARK:masterfrom
HarmonJiang:dev

Conversation

@HarmonJiang
Copy link

Introduction

Issue #6 comes from the bug that statement os.path.exists(complete_path_logs_current_training_job) always be True.
When we start a new training job, this code

# Create the checkpoint subfolder if nonexistent.

# Create the checkpoint subfolder if nonexistent.
self.__checkpoint_subfolder = os.path.join(self.__log_folder,
                                           self.__training_job_name,
                                           'checkpoints')
if (not os.path.exists(self.__checkpoint_subfolder)):
    try:
        os.makedirs(self.__checkpoint_subfolder)
    except OSError:
        raise OSError("Error while trying to create folder "
                      f"'{self.__checkpoint_subfolder}'. Exiting.")

will generate folder $PD_MESH_NET_ROOT/training_logs/new_job/checkpoints. It means that folder $PD_MESH_NET_ROOT/training_logs/new_job has been already created.
Then,

complete_path_logs_current_training_job = os.path.join(

complete_path_logs_current_training_job is defined as $PD_MESH_NET_ROOT/training_logs/new_job, which is created before this definition. So the statement os.path.exists(complete_path_logs_current_training_job) will always be True and the program will try to load previous checkpoint although we launch a new training job.

Changes

To fix it, just move the checkpoint creation codes after the if (not os.path.exists(complete_path_logs_current_training_job)) statement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant