DEEP-GAP: Deep-learning Evaluation of Execution Parallelism in GPU Architectural Performance

This repository provides the exact benchmarking scripts used in the DEEP-GAP study to evaluate GPU inference performance across FP32, FP16, and INT8 precision modes.

1. Overview

This benchmark measures:

Throughput (images/sec)
Median latency (ms)
P99 latency (ms)
Precision scaling effects (FP32 vs FP16 vs INT8)
GPU memory usage via NVML
Batch size scaling behavior
TensorRT optimization benefits

2. Models Evaluated

ResNet-18
ResNet-50
ResNet-101

Pretrained ImageNet weights are used.

3. Precision Modes

FP32 (PyTorch eager) run_benchmark_resnet_18_50_101_FP32_NVML.py

FP16 (PyTorch half-precision) run_benchmark_resnet_18_50_101_FP16_NVML.py

INT8 (TensorRT quantized inference) run_benchmark_resnet_18_50_101_INT8_TensorRT_NVML.py

4. Requirements

GPU with CUDA support (Tested on T4 and L4)

Linux (Ubuntu recommended) Python 3.8+

Install dependencies:

pip install torch torchvision numpy pynvml

For INT8 TensorRT:

pip install tensorrt pycuda

5. Benchmark Configuration

Warmup iterations: 20
Timed iterations: 100
Repeats per config: 3
Sweeps per config: 10

Batch sizes: 1,2,4,8,16,32,64,128,256,384,512

Input size: 3x224x224

6. Running Benchmark

Run all:

chmod +x run_all_benchmarks.sh
./run_all_benchmarks.sh

Or individually:

cd FP32
python run_benchmark_resnet_18_50_101_FP32_NVML.py

cd FP16
python run_benchmark_resnet_18_50_101_FP16_NVML.py

cd INT8-TensorRT
python run_benchmark_resnet_18_50_101_INT8_TensorRT_NVML.py

7. Output

Each script produces:

benchmark_results_final.csv

inside its respective folder.

Columns include:

timestamp
model
batch size
median latency
p99 latency
throughput
gpu memory usage
system metadata

8. GPU telemetry script

start_gpu_telemetry.sh

Continuously records GPU hardware metrics using nvidia-smi during benchmark execution.

The script captures:

GPU temperature
power consumption (W)
GPU utilization (%)
memory utilization (%)
total GPU memory used (MB)
SM clock frequency
memory clock frequency
GPU performance state (P-state)

Data is logged once per second and written to:

gpu_telemetry_full_run.csv

Start telemetry before running benchmarks:

./start_gpu_telemetry.sh

9. TensorRT artifacts

INT8 script generates:

trt_artifacts/

Includes:

ONNX model
TensorRT engine
calibration cache

10. Reproducibility

To reproduce results consistently, use the same batch sizes and iteration counts, ensure the GPU is not shared with other workloads, and allow a cooling interval between runs.

11. Citation

DEEP-GAP: Deep-learning Evaluation of Execution Parallelism in GPU Architectural Performance

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
FP16		FP16
FP32		FP32
INT8		INT8
images		images
LICENSE		LICENSE
README.md		README.md
gpu_telemetry_logger.sh		gpu_telemetry_logger.sh
run_all_benchmarks.sh		run_all_benchmarks.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DEEP-GAP: Deep-learning Evaluation of Execution Parallelism in GPU Architectural Performance

1. Overview

2. Models Evaluated

3. Precision Modes

4. Requirements

5. Benchmark Configuration

6. Running Benchmark

7. Output

8. GPU telemetry script

9. TensorRT artifacts

10. Reproducibility

11. Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DEEP-GAP: Deep-learning Evaluation of Execution Parallelism in GPU Architectural Performance

1. Overview

2. Models Evaluated

3. Precision Modes

4. Requirements

5. Benchmark Configuration

6. Running Benchmark

7. Output

8. GPU telemetry script

9. TensorRT artifacts

10. Reproducibility

11. Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages