Running Legacy TF1.x Code on a Modern CUDA 11.8 Server
This guide outlines the exact steps required to run the legacy ACDC Challenge segmentation codebase (originally written for TensorFlow 1.x) on the University of Twente (UT) JupyterLab servers.
The Conflict: The original codebase is old and expects an older version of TensorFlow (1.x) and an older GPU driver. However, the UT server is equipped with modern hardware running CUDA 11.8. Because this is a shared university server, we do not have the administrator rights to downgrade or change the server's CUDA version.
CUDA (Compute Unified Device Architecture) is a software layer and programming interface created by NVIDIA.
- What it does: It allows deep learning frameworks (like TensorFlow and PyTorch) to communicate directly with the NVIDIA GPU. Instead of doing math slowly one-by-one on the CPU, CUDA tells the GPU to perform thousands of matrix multiplications simultaneously (parallel processing), which is essential for training neural networks.
- Why it caused an issue: Every version of TensorFlow is hardcoded to work with a specific version of CUDA. TensorFlow 1.x does not understand CUDA 11.8. If we tried to install TF1.x, the GPU would simply not be recognized.
Our Solution: We install TensorFlow 2.13 (which perfectly matches the server's CUDA 11.8) and utilize its built-in compat.v1 module to "trick" the modern TensorFlow into executing our legacy TF 1.x code.
Before installing any packages, we must create an isolated workspace. We use Conda for this, which ensures our specific Python version and packages do not interfere with other projects on the server.
Open a terminal in JupyterLab and run the following commands to create a new environment named acdc-tf2 using Python 3.10 (which is highly stable for this setup):
# Create the environment with Python 3.10
conda create --name acdc-tf2 python=3.10 -y
# Activate the environment
conda activate acdc-tf2
(Note: You must run conda activate acdc-tf2 every time you open a new terminal before running your training scripts).
The original requirements.txt contains outdated packages that conflict with modern Python environments and the new TensorFlow version. We must manually install specific versions to stabilize the environment.
Run the following commands in your activated acdc-tf2 environment:
# 1. Install TF compatible with the UT server's CUDA 11.8
pip install tensorflow==2.13.0
# 2. Install NumPy 1.x and specific typing-extensions to prevent TF crashes
# (TF 2.13 will crash if the newer NumPy 2.0 is installed)
pip install numpy==1.24.3 typing-extensions==4.5.0
# 3. Downgrade nibabel to ensure compatibility with older NumPy (used for .nii.gz files)
pip install "nibabel<5.0"
# 4. Install OpenCV (Headless version prevents missing GUI library errors on Linux servers)
pip install opencv-python-headless
# 5. Install the remaining required dependencies
pip install pandas scipy scikit-image matplotlib SimpleITK h5py tqdm networkx
Note: You can safely ignore any pip warnings related to PyTorch if they appear, as we are strictly using TensorFlow for this pipeline.
Because the codebase uses TF1 syntax (tf.variable_scope, tf.get_variable), we must force TensorFlow 2 to behave like TensorFlow 1.
In every executable Python file (train.py, model.py, unet2D_bn_xent.py, system.py, etc.), you must change the TensorFlow import.
Find:
import tensorflow as tfReplace with:
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()The old tf.contrib module was completely removed in TF2. We must replace the old initializers and batch normalization functions with their modern Keras/Compat equivalents.
At the top of layers.py, replace the variance_scaling_initializer and xavier_initializer imports with these drop-in wrapper functions:
# --- Fix for TF1 -> TF2 Keras initializers ---
def variance_scaling_initializer(uniform=False, factor=2.0, mode='FAN_IN', dtype=tf.float32):
distribution = 'uniform' if uniform else 'truncated_normal'
return tf.keras.initializers.VarianceScaling(scale=factor, mode=mode.lower(), distribution=distribution)
def xavier_initializer(uniform=True, dtype=tf.float32):
if uniform:
return tf.keras.initializers.GlorotUniform()
else:
return tf.keras.initializers.GlorotNormal()
# -------------------------------------------------Furthermore, update the batch_normalisation_layer function in layers.py to use the v1 compat module:
def batch_normalisation_layer(bottom, name, training):
# TF2 Compat fix: replacing tf.contrib.layers.batch_norm
h_bn = tf.compat.v1.layers.batch_normalization(
inputs=bottom, momentum=0.99, epsilon=1e-3, training=training,
name=name, center=True, scale=True
)
return h_bnThe file unet2D_bn_xent.py (located in your main project folder ~/Deep_learning_project/acdc_challenge/acdc_segmenter/) acts as the "control panel" for your experiment. You can duplicate this file to create different experiments.
Inside this file, you can tweak various hyperparameters and swap out core components of the training pipeline.
-
Adjusting Hyperparameters: You can directly change values like
learning_rate = 0.01to0.001, changeloss_type = 'crossentropy'to'dice', or toggle data augmentations likedo_rotations = True. -
Swapping the Network Architecture: The config file defines the model via
model_handle = model_zoo.unet2D_bn. -
How to find other models: Open the
model_zoo.pyfile in your repository. Look at the functions defined inside (e.g.,unet2D,unet3D). You can change your config to use any of these by updating the handle (e.g.,model_handle = model_zoo.unet2D). -
Swapping the Optimizer: The config file defines the optimizer via
optimizer_handle = tf.train.AdamOptimizer. -
How to find other optimizers: Because we are using the TF1 compatibility bridge, you can swap this with other standard TF1 optimizers by searching the TensorFlow documentation. For example, you could change it to
tf.train.GradientDescentOptimizerortf.train.RMSPropOptimizer.
When running the U-Net model with batch_size = 24, the GPU will run out of VRAM (resulting in a RESOURCE_EXHAUSTED error).
Solution: Open your configuration file (unet2D_bn_xent.py) and lower the batch size so it fits inside the GPU's memory.
# Training settings
batch_size = 16 # Reduced from 24 to prevent OOMTo prevent the training from aborting when you close your laptop or lose connection to the UT JupyterLab, run the script using nohup (no hangup). This ensures the process continues running even if your terminal session ends.
Start the training in the background:
nohup python train.py --config unet2D_bn_xent.py > training_output.log 2>&1 &- This runs the script in the background, redirects all output to
training_output.log, and appends the process ID to the log.
You can monitor the training progress by checking the log file:
tail -f training_output.log(Press Ctrl + C to exit the tail command without stopping the training.)
To verify that the GPU is actively processing your data and to monitor VRAM usage, open a terminal and run:
watch -n 1 nvidia-smiWhat is TensorBoard? TensorBoard is a built-in visualization tool. It reads the log files generated by our training script and creates live, interactive graphs of our metrics (like Training Loss and Validation Dice scores).
Modern Python environments often have setuptools versions that recently removed the pkg_resources module, which TensorBoard 2.13 requires to boot.
Fix:
pip install "setuptools<70.0.0"
Open a new terminal (while your training runs in the background via tmux), activate your environment (conda activate acdc-tf2), and start TensorBoard:
tensorboard --logdir ~/Deep_learning_project/acdc_challenge/acdc_logdir --bind_all
(Ensure you use the correct, case-sensitive path to your log directory).
Because we are behind the university's firewall, we cannot simply surf to localhost:6006. We must use the Jupyter proxy trick:
- Look at your current JupyterLab URL. It likely looks like:
https://jupyter.utwente.nl/user/[your-username]/lab/tree/... - Change the URL to route directly to port 6006 by replacing
/lab/tree/...with/proxy/6006/:https://jupyter.utwente.nl/user/[your-username]/proxy/6006/ - Hit Enter. You should now see the TensorBoard dashboard tracking your metrics in real-time.
As the model trains, it automatically saves files to ~/Deep_learning_project/acdc_challenge/acdc_logdir/unet2D_bn_xent.
You will find:
events.out.tfevents...: The raw metric data read by TensorBoard.model_best_dice.ckpt: The network weights at the epoch where it achieved the highest Dice score on the validation set. (Use this for your final testing/inference).model_best_xent.ckpt: The weights where it achieved the lowest Cross-Entropy Loss.unet2D_bn_xent.py: A backup copy of the config file used for this specific run, ensuring total reproducibility of the experiment.