University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 3
-
Lijun Qu
-
Tested on: Windows 11, i7-14700HX (2.10 GHz) 32GB, Nvidia GeForce RTX 4060 Laptop
I built a CUDA path tracer into a feature-complete, toggleable renderer. Core work includes
- BSDF shading (diffuse, perfect specular),
- stochastic AA,
- stream-compacted path termination,
- material-based sorting.
I added
- physically-based refraction,
- depth of field,
- HDRI environment lighting,
- PBR texture mapping on glTF meshes (albedo/normal/metal-rough).
For performance, I implemented
- Russian Roulette
- a CPU-built BVH with iterative GPU traversal,
- integrated Intel Open Image Denoiser for cleaner images at low spp.
I profiled with Nsight and reported rays-per-bounce and per-kernel stacked bars, showing compaction/RR benefits (especially in closed scenes), reduced intersection time with BVH on heavy meshes, and clear quality gains from DoF, refraction, and denoising.
![]() |
![]() |
|---|
![]() |
![]() |
![]() |
![]() |
|---|
I implemented specular transmission for dielectrics (glass/water) as a delta BSDF. At a surface hit, I first detect whether the ray is entering or exiting using cosThetaI = dot(-wi, n). Based on the sign, I flip the shading normal if needed and set the index-of-refraction pair (ηi, ηt) accordingly. I then try to compute the transmitted direction with Snell’s law (glm::refract(wi, n, ηi/ηt)). If refraction is impossible (total internal reflection), I fall back to perfect mirror reflection.
For energy split, I evaluate the Fresnel term (Schlick) to get the reflectance F. I stochastically choose between reflection and transmission (probability F vs. 1−F), treating the chosen lobe as a delta event (pdf = 1). When transmitting, I scale the path throughput by (1−F) * transmissionColor * (ηt/ηi)^2 (solid-angle change), and when reflecting by F * specularColor. The new ray origin is offset by an epsilon along the chosen direction to avoid self-intersections, and the path continues with one fewer bounce. (Rough/ microfacet transmission is not used—this is perfect, smooth glass.)
Reference: PBRv4 9.3
![]() |
![]() |
![]() |
![]() |
|---|---|---|---|
| No DOF | Lens Radius: 0.15, Focal Dist: 10.0 | Lens Radius: 0.3, Focal Dist: 12.0 | Lens Radius: 0.8, Focal Dist: 12.0 |
I use a thin-lens camera with two parameters: lensRadius (aperture) and focalDist. For each pixel sample, I first form the usual pinhole ray to a jittered sensor sample. I then compute the focal point by intersecting that ray with a plane at focalDist along the camera forward axis. Next, I sample a point on the circular lens using concentric-disk sampling: lensPos = eye + lensRadius * (x * camRight + y * camUp).
The final primary ray is origin = lensPos, direction = normalize(focalPoint - origin). Throughput is unchanged (camera sampling only); total blur increases with lensRadius, and setting lensRadius = 0 reduces to a pinhole camera. I offset the origin by a small epsilon along the direction to avoid self-intersection. (A circular aperture is assumed; no polygonal bokeh yet.)
Reference: PBRv4 5.2
This path tracer supports .glTF 3D scene loading and rendering. This was done through wrapping the tinyGLTF library. Here are the supported capabilities:
- Triangular Mesh Loading
- Material Loading
- Albedo Texture Loading and Sampling
- Object Space Normal Map Loading and Sampling
- Materials do not need to be mapped manually in your Path Tracer .json file. That is, if your glTF file has 4 unique materials, then you just include the gltf as a mesh in your .json file. You don't need to include the materials accordingly to allow for the 4 materials to appear in the render.
There are a few restrictions however:
- The mesh must be triangulated. Only triangles are supported currently.
![]() |
![]() |
![]() |
|---|---|---|
| Loaded all mappings | Only loaded base color | Only loaded normal |
I use glTF 2.0 (tinygltf) for textured meshes (OBJ loads geometry only). At a triangle hit, I barycentrically interpolate UVs and sample bound textures on the GPU (bilinear, repeat/clamp per material).
Base Color (Albedo): Sample baseColorTexture (sRGB → linear). Multiply by baseColorFactor. This becomes the diffuse albedo used by my BSDF and is also written to the denoiser’s albedo AOV.
Normal Mapping: If tangents are provided in glTF, I use them; otherwise I build a per-triangle TBN from position/UV derivatives. I decode the normal map from [0,1]→[-1,1], apply optional normalScale, transform from tangent space → world, and renormalize. The resulting world normal feeds shading and the normal AOV for OIDN.
Metallic-Roughness: I sample the metallicRoughnessTexture and factors; roughness from the G channel, metallic from the B channel (clamped [0,1]). I compute F0 = mix(0.04, baseColor, metallic) and use metallic/roughness to modulate my specular vs diffuse energy split (smooth conductor/insulator behavior). (No full microfacet BRDF yet—roughness currently influences lobe weighting; perfect specular/diffuse are used for scattering.)
Reference: PBRv4 10.4
This path tracer supports glTF 2.0 scene loading via a lightweight wrapper around tinygltf. It handles real-world, multi-node glTF files where a single asset can contain multiple meshes/primitives, per-node transforms, and distinct texture sets.
Supported capabilities
- Triangular mesh loading: positions, indices, normals, UVs (triangulated primitives).
- Scene graph & transforms: applies each node’s TRS (with hierarchy) so one glTF can instance the same mesh with different transforms.
- Material binding: preserves per-primitive material indices; you map these to your renderer’s materials in the scene JSON.
- Texture maps (PBR Metallic-Roughness) for glTF: Base Color (Albedo): sampled in sRGB → linear, multiplied by baseColorFactor. Normal Map (tangent-space): uses glTF tangents when available; otherwise builds a per-triangle TBN from geometry/UVs. Metallic-Roughness: reads Roughness = G, Metallic = B and their factors; drives dielectric/metal behavior and lobe weighting.
- Samplers & wrap modes: respects glTF sampler repeat/clamp and texCoord set 0 (UV scale/offset when provided).
What this enables
- You can import complex glTFs that include many parts (chairs, floors, props, etc.), each with its own transform and texture set—all in one file.
- Multiple nodes referencing the same mesh are instanced with different transforms.
- Works seamlessly with my BVH (CPU build, iterative GPU traversal) for large triangle counts.
Restrictions / current limits
- Primitives must be triangles (glTF triangles are supported; quads/lines are not).
- OBJ is geometry-only in my renderer (no texture mapping for OBJ).
- In the scene JSON, you still define materials that correspond to the glTF material slots (e.g., if the glTF has 4 materials, define 4 entries and map indices → names).
- Single UV set (TEXCOORD_0) assumed. Occlusion/emissive maps are not wired yet.
- Normal maps are tangent-space (object-space normals are not supported).
![]() |
![]() |
|---|---|
| No OIDN applied | OIDN applied |
I integrated Intel Open Image Denoise (OIDN) as a post-process on my path-traced output. OIDN is an open-source, CPU-based filter designed specifically for Monte Carlo noise.
Reference: Intel OIDN
![]() |
![]() |
|---|---|
| No RR applied | RR applied |
Russian Roulette (RR) probabilistically stops low-contribution paths to save work while keeping the estimator unbiased. When a path survives, its throughput is scaled by 1/p (the survival probability) so the expected contribution remains the same.
Reference: PBRv3 13.7
Why RR improves FPS. By probabilistically terminating low-contribution paths after a few bounces, RR reduces the average path length—fewer intersections, fewer shading evals, and less memory traffic per frame. It also removes “straggler” rays, improving warp coherence and making stream compaction more effective. The speedup is largest in closed scenes (many long, dim bounces) and smaller—but still positive—in open scenes.
I use a BVH(Bounding Volume Hierarchy) to accelerate ray–triangle tests by organizing the mesh into a tree of tight AABB nodes. Rays traverse the tree, intersecting a handful of boxes and only testing triangles in the hit leaves—turning a naïve O(N) scan into something closer to O(log N) per ray. The BVH is built on the CPU and traversed iteratively on the GPU with early-out using the current closest tMax, which cuts intersection work and improves cache/warp coherence. This made a practical difference: before BVH, even the Suzanne (~3,936 tris) mesh was sluggish enough that I set its material to emissive just to verify it loaded; after BVH, it renders smoothly with normal materials.
Reference: BVH
Here are some bloopers I encountered:
![]() |
![]() |
![]() |
![]() |
|---|
Forsyth Triangle Reordering (from my Qualcomm internship). I previously analyzed the Forsyth index-buffer reordering algorithm (CPU pre-process that improves vertex-cache locality). I want to port a variant into this path tracer to test whether triangle order inside BVH leaves (and mesh buffers) improves memory locality/L2 hit rate during ray–triangle tests. Plan: build a CPU pass that reorders indices with Forsyth (or a cache-friendly heuristic), rebuild the BVH, and compare FPS, intersect kernel time, global load transactions, and L2 hit rate vs. baseline.
Why FPS isn’t monotonic with triangle count. In my measurements, higher-triangle meshes sometimes run faster than lower-triangle ones. I suspect factors beyond triangle count dominate: spatial distribution & scale (how much of the BVH the rays traverse), camera framing (on-screen coverage), material/texture cost, ray coherence (affected by DoF/refraction), and leaf sizes/SPLITs from BVH build. I plan to run controlled studies that normalize scene bounds and camera, vary mesh world-scale, and log nodes visited per ray, leaf hits, and BSDF/texture time to isolate which factors drive FPS.




























