University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 3
Avi Serebrenik
* Tested on: Windows 11, i7-13620H @ 2.4GHz 32GB, RTX 4070 8GB (Laptop)
"Time" rendered at 2000x1600 with 5000 iterations and a bounce depth of 12
This Path Tracer was built with C++ and CUDA, with the code having the following top-level structure, where the first two and the last are done once per scene; 3, 4, and 9 are called once per iteration; and 4-8 are called once per bounce:
- Import scene information from a JSON file, which includes basic materials, their positions and transformation matrices, and the paths for any external gLTF meshes with attached transformation matrices.
- Convert this scene information into CUDA-readable formats and structs.
- Call any needed information in the main path tracing function.
- Start the path tracing loop by first calculating a ray from the camera through each pixel via a kernel.
- Using a computeIntersect kernel, check for the closest intersection from that ray and set the intersection information accordingly.
- If enabled, sort intersections and rays by material information.
- Calculate the color information and the next bounce direction of each ray in separate or one unified kernel.
- If enabled, use stream compaction to do away with rays that have terminated their bounces and repeat the path tracing loop if we have bounces and rays left.
- Post loop, use this loop's information to add to the information in the scene.
- Post tracing, clean up the memory, and save the final image to a file.
Beyond these features, I also added the following, which I will describe and analyze the results of below:
- Supporting diffuse, specular, transmissive, dielectric, and PBR (using a GGX Microfacet model).
- Stochastic Sampled anti-aliasing.
- Supporting JSON primitives (spheres and boxes), and gLTF mesh imports with external (non-binary) textures using the tinygLTF library.
- Physically accurate Depth of Field
- .hdr environment maps
- Bounding Volume Hierarchies using a Surface Area Heuristic for much faster runtime improvements and binned construction for much faster pre-process building.
- Stream Compaction
- Material Sorting
- Various ImGui features to toggle all of these above, display runtime information, and toggle channels for swapping between the scene, its albedo, its metal-roughness, its BVHs, its normals, or its depth.
The scenes, gLTFs, and environment maps I used and a few more are in the scenes folder if you would like to recreate these on your own machine. I believe they should all work after setting up the project with Cmake and building it, but I only tested it on my personal laptop. The images in this readme and more (including a 4k version of the image above, which took approx. 2 hours to render at 0.7fps) are found in the img folder. The sources for everything are at the bottom of this readme in the credits section.
"Cornell Showcase" showing off supported materials. Rendered at 800x800 with 5000 iterations and a bounce depth of 8.
This one is based on my earlier path tracer in OpenGL, and uses much of the same code ported over to CUDA, so I will be focusing on their implementation over the theory. In order to easily support the sorted and unsorted versions, each material has a kernel, but the kernels all just call their respective inline functions, so the one big unsorted kernel can share that too. These kernels and their inline functions can be found in utils.cuh and utils.cu.
Diffuse uses cosine weighted hemisphere sampling to give us a random bounce direction and easy calculations.
Specular simply gives us perfect refraction along the intersection's surface normal and doesn't add material color to the path.
Transmissive took a while to fully port here, which turned out to be mainly the fault of a given sphere intersection code that I had to then modify, but relies on Snell's Law by using glm::refract. One of the indices of reflection is presumed to be air aka 1.0, and the other needs to be encoded in the material. This means I don't support transmission between two non-air materials. I made this choice because this would only be a small fraction of all cases and would need some extra information to be stored for each ray, slowing down everything else.
Dielectric combines specular and transmissive materials. I decide which material to use via a random number generator and a probability of chosing specular over transmissive which is encoded in the material information. I then attenuate the results by 2 to account for this, and a Fresnel Dielectric Evaluation helper function.
PBR materials use a GGX/Trowbridge-Reitz Microfacet model. This alone took up multiple times the work as the others combined, as simply "porting it over to CUDA" as I mentioned above didn't work... After getting all the encoded material information and checking for material maps, I first determine the incoming ray direction (wi) by using a similar probability-based split between purely diffuse materials (which I handle with a simple hemisphere check) and the rest, which I handle with GGX importance sampling. Next, I calculate the BRDF with a helper function that uses a Trowbridge-Reitz function to calculate the Distribution term, a Schlick approximation for the F term, and the Smith approximation for the Geometry term for the specular component. The diffuse component is quite standard and incorporates the metallic factor, and I simply lerp them together using the F term (which becomes kS). For the PDF, I use the same Trowbridge-Reitz function for the D term to calculate the specular component, and I use the Fresnel Schlick approximation to come up with a specular probability value that I use to simply lerp between this specular and a diffuse component, similarly to the BRDF. Importantly, I found that the roughness needs to be clamped between 0.05 and 1 or 0.95, depending on the situation, to avoid some floating-point errors and weird outputs. However, as can be seen in the image below, the final result doesn't suffer from this clamping.
"PBR and Dielectric Showcase". The dielectric spheres have a varying IOR, starting at 1 (air) and increasing by 0.25 until 2.5 at the end. Rendered at 800x800 with 5000 iterations and a bounce depth of 8.
This method is quite short and easy to implement and offers great results and basically 0 performance cost. Instead of shooting our rays through the very center of each pixel, leaving a pixel's distance between each neighboring one, we can simply randomly offset--jitter--them within these pixels to create variable distances that do away with these harsh pixel edges. To implement this, I just created a random x and y offset in the range [-0.5, 0.5] and added them to my coordinates before translating them into pixel space. The results speak for themselves:
Edge of sphere with no anti-aliasing |
Edge of sphere with stochastic anti-aliasing |
The JSON parsing relies on the nlohmann JSON library to easily access the information, and then this is parsed in the Scene class's loadFromJSON function. Each material needs to have some specific basic tags, and all of my JSON scenes can be viewed in the scenes folder. For a good example of multiple different materials and objects, I would recommend looking at "cornellShowcase.json," which is the scene shown at the top of the material evaluation section.
Most of this functionality was given in the base code for this project, and I only added the specific materials and their parameters, and the GLTF array field. This array holds a string for a gLTF file path to read from and a rotation, translation, and scale vector to properly fit these into a scene. In the loadFromJSON function, these are simply all loaded in, and then later I call the loadFromGLTF function. This latter function uses two different classes, which are both in GLTFManager.h and GLTFManager.cu, called GLTFLoader and GLTFManager.
GLTFLoader is mainly for calling the tinygLTF library and getting all the content in a nice format for me to read. It stores temporary things so I can delete all of them by the time I actually pathtrace. This allows for less memory use at runtime and better encapsulation at the cost of not being able to switch scenes at runtime, but that's not something I wanted to support.
GLTFManager, on the other hand, does the more in-depth processes of creating proper triangles, creating the material array that these triangles will reference, and managing my BVH nodes (more on those later) by calling functions from the BVH class that's also in this file. It also handles copying everything to the device from the host and clearing all that at the end of the process. Since I support multiple gLTFs, this is done sequentially, one by one, to make sure nothing gets tangled up, then I create the BVHs from every triangle, before everything gets copied to the device.
Textures are also read in at this point with a TextureLoader class, which is separate because it also handles environment map loading. Both these functionalities use the cudaTextureObject_t class for copying textures to the device and easy access from there. The output of each of these texture channels (for the Time scene) can be seen here:
Albedo |
Metal Rough Mask |
Normal |
"PBR and Dielectric Showcase" Metal Rough Mask
My implementation of Depth of Field follows the physically based depth of field PBRT book article on it very closely. I simply added a focus distance and a lens radius variable to my camera, which can be set at runtime in the ImGui window, and instead of using a pinhole camera source, I jitter my ray origin to be on the lens. The performance overhead of this is negligible, as it is only a few extra calculations, but we get some very nice-looking results:
Depth of Field - Near |
Depth of Field - Mid |
Depth of Field - Far |
No Depth of Field |
Environment maps can be swapped during runtime from an ImGui dropdown menu that just changes where the path is. If it is new, it frees up the current map and loads in the new one as a cudaTextureObject_t, so that at any given time we only need to store one environment map in memory. I decided on this approach because, as shown in the GIF below, the swap is quite fast, and I already has a lot of assets in memory. Loading them in is done with the Texture Loader class, same as the textures, but the actual function is a little different, since they are in High Dynamic Range and need some extra transformations. Unfortunately, I was unable to implement importance sampling or MIS for this project due to the limited time frame, so some of these environment maps create quite a few fireflies, but this is something that I will attempt to fix in the future.
"Dragons" in varying environment maps
As most graphics people can tell, the GIF above has 8 Stanford Dragons in it, yet the scene still runs at about 20-30fps depending on the viewing angle. I'm tooting my own horn here, but that's pretty dang good for close to 7 million triangles! This is accomplished by bounding volumes in the shape of squares. Basically, we create a hierarchical bounding volume structure that encompasses the triangles in the scene. Each box, because I'm using boxes for the volumes, has two children, and only leaf BVH nodes contain actual triangles. Thus, when computing ray intersections, instead of checking against each triangle, we simply traverse down the BVH tree until we hit a leaf node, and then we only need to check the triangles within. Box intersections are very easy to compute, as each node stores a maximum and a minimum corner value, so we only need to check against a few planes. On the CPU, this intersection check would be recursive, but I employ a stack-based method as described in this YouTube video.
How does one construct a BVH? I used a blog by "Jacco" for this part of the code, but the basic naive implementation simply splits each bounding box along the largest axis to get each child. It subdivides recursively and terminates when there are two or fewer triangles in the box. This is a good starting point, but it does not get us ~25fps for 7million triangles, so I implemented some later parts of the blog to speed things up. First, the BVH nodes themselves become smaller, only storing the necessary information of their min and max values, the triangle count of this node, and a final value which is either the index of their left child node in the BVH array (knowing that the right is that +1) if the triangle count is 0 (i.e. if this is not a leaf node), or the index of the first child triangle if this is a leaf node. This optimization means that we need to sort the triangles, but that's easily done and allows these nodes to be only 32 bytes, allowing for a pair of sibling nodes to sit cozily on one 64-byte cache line. The speedup is worth it.
Next, we move on from the naive subdivision to instead divide more evenly based on the probability of hitting a triangle in the box. We get this probability by checking the area of the triangles, since a larger area is the best predictor. However, this comes with a lot more calculations at build time (O(n^2) construction) to the point where my larger scenes like Time or Dragons never even built - or maybe they would eventually, but I got impatient. Thus, instead of checking every triangle when subdividing, we bin them (I stuck with 8 bins) using evenly spaced dividing lines, and then only checking within these bins. I believe this means that we won't get the fully optimal subdivisions, but instead of waiting forever, I now only need to wait a few seconds to construct these BVHs.For a quick comparison of what a Naive vs a Surface Area Heuristic (SAH) based assembly looks like, here is an image of each using the Stanford Dragon model. Can you tell which is which before looking at their description?
Stanford Dragon with Naive BVH construction |
Stanford Dragon with SAH BVH construction |
The differences seem minimal, but they're there in crucial spots. For example, notice the big box spanning the left side of the dragon in the naive version that isn't there in SAH. Similarly, notice the multiple overlapping boxes above the top part of the tail in the naive. These have massive empty areas that the SAH already does away with. Also, notice how on the left side, the two boxes are split higher in the naive version than in SAH. This is because there is much more detail and triangles in the feet of the dragon, so a more even split should be lower along the neck, as opposed to the naive, more distance-based split. These might seem like my nitpicking, but as I discuss in the Analysis section, the runtime difference is drastic.
Finally, as I mentioned above, and as one can see in the GIF, the framerate of the Dragons scene varies based on viewing angle. This is because these BVHs are constructed before runtime and thus optimize for a general case. However, we have different triangle overlaps and setups from different angles. To optimize for this, one can traverse the BVH in a distance-based order when finding intersections, but my implementation of this was buggy, and I never got around to fixing it. This is something that I hope to do in the future. For a funny bug that I got while trying to implement this, please look at the Bloopers section.
Since we loop over our paths multiple times (usually 8 for me), we want to get rid of paths that have already terminated. To do this, I simply partition my intersection and path arrays using thrust::partition and set the new number of paths accordingly. This usually gives a decent performance boost, as discussed in the Analysis section below.
"Cornell Showcase" with the material output option showing the differing materials on my code's side
The first ImGUI tab with Environment Map and Draw Mode options |
The second tab with Depth of Field and AA options |
The third tab with performance toggles |
The fourth tab with camera information |
I added a variety of ImGUI features for debugging purposes and to assist me with creating this Readme, which turned out to be quite helpful. The window constantly displays performance and scene statistics, such as the number of triangles and the current frame rate and iteration, has a button at the bottom to save the current display as an image, and has 4 tabs for different options/data. They are shown right above, and their functions have all been discussed before, besides the draw mode dropdown. This helped me greatly with debugging and provided some very neat visuals by displaying the following channels:
Before jumping in, here are the scenes and their stats that I am using for analysis. Most of these have popped up above in the showcase, since they display a variety of features, so I thought they would also be a varied enough collection for good performance analysis.
Cornell Showcase |
Time |
PBR and Dielectric Showcase |
Dragons |
Dragons Box |
| Name | Dimension | Triangles | SAH BVH Nodes | Naive BVH Nodes | Traced Depth |
|---|---|---|---|---|---|
| Cornell Showcase (CS) | 800x800 | 1,315,991 | 2,609,273 | 1,423,557 | 8 |
| Time (T) | 2000x1600 | 479,113 | 948,991 | 44,707 | 12 |
| PBR and Dielectric Showcase (PBR) | 800x800 | 501,776 | 1,001,021 | 524,093 | 8 |
| Dragons (D) | 800x800 | 6,970,448 | 13,772,339 | 7,277,053 | 8 |
| Dragons Box (DB) | 800x800 | 6,970,448 | 13,772,339 | 7,277,053 | 8 |
When plotting the timing of each step in the path tracer for one iteration, we get this graph:
Comparison chart of everything
This doesn't tell us much, since intersection computation takes up by far the most time, with material sorting and stream compaction coming in right after it. However, simply removing this section and charting it would be unfair, since it is precisely the intersection computation that stream compaction (and material sorting to some extent) targets. Thus, my comparisons will be as follows: first, we will compare the timing of ray generation, information setup, and rendering for each scene, showing how different materials and scene complexities affect each part; next, we will compare the effect of material sorting and stream compaction on intersection calculations; and lastly we will compare the time it takes to build a BVH using Naive vs SAH and their intersection times.
My method for getting this data was creating a CUDA timer that I start and stop before and after each section, and then checking the output. The timings themselves should be pretty accurate, but there is always variance between each frame and also each run of the scene due to my computer doing other processes in the background. Thus, I will focus more on general trends in the results rather than specific values.
Comparison chart of everything
These outputs are pretty much what we'd expect. Setup takes about the same time for each one, and the higher resolution scene takes much more time to generate the rays, while the others are pretty much the same. However, I do find the rendering timings quite interesting. Even though we have more rays and bounces in the Time scene, it takes about the same time to render the paths, meaning it is actually faster. PBR and Dielectric Showcase is also seemingly significantly faster than the others. But how do these differ from the rest? What could explain this? Honestly, I don't know. I believe that the similarity is probably due to the kernel call being the big bottleneck, and the actual calculations within are relatively the same speed. However, closed scenes should, in theory, be slower than open ones, since we'd still have more rays to color in each bounce. This is supported by the PBR and Dielectric Showcase scene, which is very open with only a few spheres that would shoot rays into the void in 1-2 bounces. However, the two dragon scenes have pretty much the same runtime, which is quite baffling; thus, I think these differences are most probably within the error rate due to the general fluctuations in performance. However, we see that all of these are quite fast, even though there is a lot of branching, especially in the Cornell Showcase scene, due to its high number of differing materials.
When taking a look at the chart right below, we can start to see the general differences between the rendering times for each scene. The slowest by far is Time, which makes sense since it is larger, has three different complex gLTFs, has the most lights by far (10), and is mostly closed. Cornell Showcase comes right in between the two dragon scenes, which shows just how much of an effect a closed scene has, since PBR has many more triangles that are more complex, but render much faster. PBR is also only two materials with a lot of space in between each sphere, so paths can be terminated quite early. In fact, I doubt we ever even reach more than 3-4 bounces for any ray, since that would mean the spheres bounce a ray back and forth between them. The two dragon scenes also have a drastic difference in time, even though the only difference is 4 walls and a light source. From a naive standpoint, this should mean nothing compared to the nearly 7 million triangles, but we are not naive and know that these walls bounce rays between them and the light source creates a more complex scenario, slowing this down by nearly 5x. Thus, in general, due to the BVH optimizations that we'll see a bit later, the true predictor for rendering time is how many times we think rays will bounce and the complexity of the lighting.
Comparison chart of intersection timings
Stream compaction gives us the predictable result: it reduces intersection time at the cost of a fixed overhead, making it useful for larger scenes, but bad for smaller scenes where this overhead is more time than the time it saves. Material sorting, on the other hand, is an interesting one. In theory, this should also show some improvement, especially since we get stream compaction with it for free, but none of the scenes benefit from it. This was a bit odd to me, since there must be some reason why we would want to do this. To investigate, I set up two new scenarios based on my hypothesis that the reason for this slowdown was either that my extra start and end index checking was taking up too much time or that launching a higher number of kernels was. The first scenario is letting thrust do the only sorting, and I launch everything in one kernel afterwards, making the same material paths be grouped together in memory, but not doing any extra sorting. However, we lose my stream compaction here, so we keep re-checking worthless paths, which makes this incredibly slow. Thus, I next tested the same scenario with added stream compaction.
Comparison chart for Material Sorting methods
While stream compaction adds back some performance, I was relieved to find that my extra overhead method is still actually faster, and the real bottleneck was the original thrust::sort_by_key. Some comparisons with thrust algorithms can be seen in my previous project on stream compaction, but I trust that they have very optimized methods, so I believe that material sorting is simply not worth it unless, for some reason, you have many more materials than I have.
Comparison chart for intersection time based on BVH algorithm
I wanted the time axis to be linear here, but had to opt for logarithmic so the SAH columns wouldn't disappear completely. Needless to say, the SAH method is an incredible speedup. 16% in the worst case, 99.2% in the best - aka a 118x speedup. Using this is an absolute no-brainer. The cost? A more complex algorithm to implement (which is well documented online) and a longer build time. How much longer?
Comparison chart for BVH build times
About 10x more build time in the worst case and 4.5x in the best. However, we only need to do this once (unless we're doing an animated scene, for which there would ideally be extra optimizations), so again, using this is a must. Nvidia also showed off their new BVH algorithm at SIGGRAPH this past summer to support "Mega-Geometry" in real time, which is very exciting and something I would love to explore in the future.
On a last note, these charts also reveal an interesting difference in intersection time that isn't fully dependent on the number of BVH nodes or triangles. As I alluded to a few times before, the sparseness of the objects also correlates with intersection time. Dragons Box has about 14x more BVH nodes than Time, but renders over 2x as fast. Some of this is due to the number of lights in Time increasing the complexity of the scene, but it is generally more open than Dragons Box. However, the key difference is the distribution of triangles. While the Stanford dragon itself is quite complex, having 8 of them means that they are way more spread out than the models in Time, which are overlapping quite a bit. This means that we can filter out a lot of the triangles very early on in Dragons Box, while we still need to consider a lot in Time, since while we try to have bounding boxes be discrete, they can still easily have some overlap, since the binning part sorts them based on centroid position, not max edge distance. I believe the depth traversal-based algorithm would have helped even this out, but again, I couldn't implement it in time, so I can't do the thorough comparison. We can see this discrete difference in the BVH visualization of these scenes, and in PBR and Dielectric Showcase, which has a similar situation.
Dragons scene BVH visualization |
PBR and Dielectric Showcase BVH visualization |
Time BVH visualization |
Code:
- UPenn CIS 565 GPU Programming for the base framework.
- tinygLTF for help with gltf loading.
- stb_image for image saving and texture importing help.
- nlohmann for JSON parsing.
- jacco, as mentioned above, for their BVH implementation blog.
Models:
- Machine
- Space Marine
- Hand Monster
- Astronaut
- PBR Spheres, Cesium Man, and Cesium Box are from the Khronos gLTF samples library
- Stanford Dragon
Congrats! You made it to the end of the readme! Sorry for all my rambles! As a reward, here are some funny bugs and bloopers from the strenuous and sleepless two weeks of making this path tracer.
Cesium Man in Space!! (or just in front of a noisy diffuse surface)
I started with going full specular for everything in the Cornell box...
Not even sure what I did wrong here
Cool looking lens from dielectric testing |
Another classic dielectric bug |
Somehow my ball got a lil' cap (ignore the gLTF box test) |
BVH machine REALLY broke - aka why my distance-based algorithm doesn't work and the inspo for BVH visualizing
My first good gLTF import...except the light was above it, not in front




























