CUDA Path Tracer

University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 3

Lijun Qu
- LinkedIn, personal website.
Tested on: Windows 11, i7-14700HX (2.10 GHz) 32GB, Nvidia GeForce RTX 4060 Laptop
Intro
Features
Performance Analysis
- Russian Roulette Path Termination
- BVH
Extras and Bloopers

Example Renders:

Source

Introduction

I built a CUDA path tracer into a feature-complete, toggleable renderer. Core work includes

BSDF shading (diffuse, perfect specular),
stochastic AA,
stream-compacted path termination,
material-based sorting.

I added

physically-based refraction,
depth of field,
HDRI environment lighting,
PBR texture mapping on glTF meshes (albedo/normal/metal-rough).

For performance, I implemented

Russian Roulette
a CPU-built BVH with iterative GPU traversal,
integrated Intel Open Image Denoiser for cleaner images at low spp.

I profiled with Nsight and reported rays-per-bounce and per-kernel stacked bars, showing compaction/RR benefits (especially in closed scenes), reduced intersection time with BVH on heavy meshes, and clear quality gains from DoF, refraction, and denoising.

Features

BSDFs

Diffuse

Specular

Refraction

I implemented specular transmission for dielectrics (glass/water) as a delta BSDF. At a surface hit, I first detect whether the ray is entering or exiting using cosThetaI = dot(-wi, n). Based on the sign, I flip the shading normal if needed and set the index-of-refraction pair (ηi, ηt) accordingly. I then try to compute the transmitted direction with Snell’s law (glm::refract(wi, n, ηi/ηt)). If refraction is impossible (total internal reflection), I fall back to perfect mirror reflection.

For energy split, I evaluate the Fresnel term (Schlick) to get the reflectance F. I stochastically choose between reflection and transmission (probability F vs. 1−F), treating the chosen lobe as a delta event (pdf = 1). When transmitting, I scale the path throughput by (1−F) * transmissionColor * (ηt/ηi)^2 (solid-angle change), and when reflecting by F * specularColor. The new ray origin is offset by an epsilon along the chosen direction to avoid self-intersections, and the path continues with one fewer bounce. (Rough/ microfacet transmission is not used—this is perfect, smooth glass.)

Reference: PBRv4 9.3

Physically-based Depth-of-field


No DOF	Lens Radius: 0.15, Focal Dist: 10.0	Lens Radius: 0.3, Focal Dist: 12.0	Lens Radius: 0.8, Focal Dist: 12.0

I use a thin-lens camera with two parameters: lensRadius (aperture) and focalDist. For each pixel sample, I first form the usual pinhole ray to a jittered sensor sample. I then compute the focal point by intersecting that ray with a plane at focalDist along the camera forward axis. Next, I sample a point on the circular lens using concentric-disk sampling: lensPos = eye + lensRadius * (x * camRight + y * camUp).

The final primary ray is origin = lensPos, direction = normalize(focalPoint - origin). Throughput is unchanged (camera sampling only); total blur increases with lensRadius, and setting lensRadius = 0 reduces to a pinhole camera. I offset the origin by a small epsilon along the direction to avoid self-intersection. (A circular aperture is assumed; no polygonal bokeh yet.)

Reference: PBRv4 5.2

Mesh Loading

This path tracer supports .glTF 3D scene loading and rendering. This was done through wrapping the tinyGLTF library. Here are the supported capabilities:

Triangular Mesh Loading
Material Loading
Albedo Texture Loading and Sampling
Object Space Normal Map Loading and Sampling
Materials do not need to be mapped manually in your Path Tracer .json file. That is, if your glTF file has 4 unique materials, then you just include the gltf as a mesh in your .json file. You don't need to include the materials accordingly to allow for the 4 materials to appear in the render.

There are a few restrictions however:

The mesh must be triangulated. Only triangles are supported currently.

Texture and Normal Mapping + Metallic Mapping


Loaded all mappings	Only loaded base color	Only loaded normal

I use glTF 2.0 (tinygltf) for textured meshes (OBJ loads geometry only). At a triangle hit, I barycentrically interpolate UVs and sample bound textures on the GPU (bilinear, repeat/clamp per material).

Base Color (Albedo): Sample baseColorTexture (sRGB → linear). Multiply by baseColorFactor. This becomes the diffuse albedo used by my BSDF and is also written to the denoiser’s albedo AOV.

Normal Mapping: If tangents are provided in glTF, I use them; otherwise I build a per-triangle TBN from position/UV derivatives. I decode the normal map from [0,1]→[-1,1], apply optional normalScale, transform from tangent space → world, and renormalize. The resulting world normal feeds shading and the normal AOV for OIDN.

Metallic-Roughness: I sample the metallicRoughnessTexture and factors; roughness from the G channel, metallic from the B channel (clamped [0,1]). I compute F0 = mix(0.04, baseColor, metallic) and use metallic/roughness to modulate my specular vs diffuse energy split (smooth conductor/insulator behavior). (No full microfacet BRDF yet—roughness currently influences lobe weighting; perfect specular/diffuse are used for scattering.)

Reference: PBRv4 10.4

Mesh Loading

This path tracer supports glTF 2.0 scene loading via a lightweight wrapper around tinygltf. It handles real-world, multi-node glTF files where a single asset can contain multiple meshes/primitives, per-node transforms, and distinct texture sets.

Supported capabilities

Triangular mesh loading: positions, indices, normals, UVs (triangulated primitives).
Scene graph & transforms: applies each node’s TRS (with hierarchy) so one glTF can instance the same mesh with different transforms.
Material binding: preserves per-primitive material indices; you map these to your renderer’s materials in the scene JSON.
Texture maps (PBR Metallic-Roughness) for glTF: Base Color (Albedo): sampled in sRGB → linear, multiplied by baseColorFactor. Normal Map (tangent-space): uses glTF tangents when available; otherwise builds a per-triangle TBN from geometry/UVs. Metallic-Roughness: reads Roughness = G, Metallic = B and their factors; drives dielectric/metal behavior and lobe weighting.
Samplers & wrap modes: respects glTF sampler repeat/clamp and texCoord set 0 (UV scale/offset when provided).

What this enables

You can import complex glTFs that include many parts (chairs, floors, props, etc.), each with its own transform and texture set—all in one file.
Multiple nodes referencing the same mesh are instanced with different transforms.
Works seamlessly with my BVH (CPU build, iterative GPU traversal) for large triangle counts.

Restrictions / current limits

Primitives must be triangles (glTF triangles are supported; quads/lines are not).
OBJ is geometry-only in my renderer (no texture mapping for OBJ).
In the scene JSON, you still define materials that correspond to the glTF material slots (e.g., if the glTF has 4 materials, define 4 entries and map indices → names).
Single UV set (TEXCOORD_0) assumed. Occlusion/emissive maps are not wired yet.
Normal maps are tangent-space (object-space normals are not supported).

OIDN


No OIDN applied	OIDN applied

I integrated Intel Open Image Denoise (OIDN) as a post-process on my path-traced output. OIDN is an open-source, CPU-based filter designed specifically for Monte Carlo noise.

Reference: Intel OIDN

Performance Analysis

Russian Roulette Path Termination


No RR applied	RR applied

Russian Roulette (RR) probabilistically stops low-contribution paths to save work while keeping the estimator unbiased. When a path survives, its throughput is scaled by 1/p (the survival probability) so the expected contribution remains the same.

Reference: PBRv3 13.7

Why RR improves FPS. By probabilistically terminating low-contribution paths after a few bounces, RR reduces the average path length—fewer intersections, fewer shading evals, and less memory traffic per frame. It also removes “straggler” rays, improving warp coherence and making stream compaction more effective. The speedup is largest in closed scenes (many long, dim bounces) and smaller—but still positive—in open scenes.

BVH

I use a BVH(Bounding Volume Hierarchy) to accelerate ray–triangle tests by organizing the mesh into a tree of tight AABB nodes. Rays traverse the tree, intersecting a handful of boxes and only testing triangles in the hit leaves—turning a naïve O(N) scan into something closer to O(log N) per ray. The BVH is built on the CPU and traversed iteratively on the GPU with early-out using the current closest tMax, which cuts intersection work and improves cache/warp coherence. This made a practical difference: before BVH, even the Suzanne (~3,936 tris) mesh was sluggish enough that I set its material to emissive just to verify it loaded; after BVH, it renders smoothly with normal materials.

Reference: BVH

Extras and Bloopers

Bloopers

Here are some bloopers I encountered:

Future Work!

Forsyth Triangle Reordering (from my Qualcomm internship). I previously analyzed the Forsyth index-buffer reordering algorithm (CPU pre-process that improves vertex-cache locality). I want to port a variant into this path tracer to test whether triangle order inside BVH leaves (and mesh buffers) improves memory locality/L2 hit rate during ray–triangle tests. Plan: build a CPU pass that reorders indices with Forsyth (or a cache-friendly heuristic), rebuild the BVH, and compare FPS, intersect kernel time, global load transactions, and L2 hit rate vs. baseline.

Why FPS isn’t monotonic with triangle count. In my measurements, higher-triangle meshes sometimes run faster than lower-triangle ones. I suspect factors beyond triangle count dominate: spatial distribution & scale (how much of the BVH the rays traverse), camera framing (on-screen coverage), material/texture cost, ray coherence (affected by DoF/refraction), and leaf sizes/SPLITs from BVH build. I plan to run controlled studies that normalize scene bounds and camera, vary mesh world-scale, and log nodes visited per ray, leaf hits, and BSDF/texture time to isolate which factors drive FPS.

Name		Name	Last commit message	Last commit date
Latest commit History 178 Commits
cmake		cmake
external		external
img		img
scenes		scenes
src		src
stream_compaction		stream_compaction
.cproject		.cproject
.gitignore		.gitignore
.project		.project
CMakeLists.txt		CMakeLists.txt
GNUmakefile		GNUmakefile
INSTRUCTION.md		INSTRUCTION.md
Project3-CUDA-Path-Tracer.launch		Project3-CUDA-Path-Tracer.launch
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA Path Tracer

Example Renders:

Source

Source

Source

Introduction

Features

BSDFs

Diffuse

Specular

Refraction

Physically-based Depth-of-field

Mesh Loading

Texture and Normal Mapping + Metallic Mapping

Mesh Loading

OIDN

Performance Analysis

Russian Roulette Path Termination

BVH

Extras and Bloopers

Bloopers

Future Work!

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CUDA Path Tracer

Example Renders:

Source

Source

Source

Introduction

Features

BSDFs

Diffuse

Specular

Refraction

Physically-based Depth-of-field

Mesh Loading

Texture and Normal Mapping + Metallic Mapping

Mesh Loading

OIDN

Performance Analysis

Russian Roulette Path Termination

BVH

Extras and Bloopers

Bloopers

Future Work!

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages