Description:

Tooling

C++17, GCC 11.4.0 (Release -O3), OpenMP 2.0

Linux Ubuntu 22.04

model name: AMD Ryzen 5 4500U with Radeon Graphics

CPU(s): 6

Thread(s) per core: 1

Description:

Task

Given a piriform figure with parameter a = 2. It is required to calculate the volume of this figure. In this lab, it is suggested to do this using the Monte Carlo method, using functions that check whether a given coordinate is inside the figure and find the parallelepiped (bounding box) in which the figure completely lies (the implementation of these functions is in hit.cpp). Specifically, we need to generate random points inside the parallelepiped and find the ratio of points inside versus outside the figure. By multiplying this ratio by the volume of the parallelepiped, we will obtain the approximate volume of the figure.

The equation is available on Wolfram. I created its graph in Desmos, which I used to determine the domain.

Random Number Generators

Xor Shift Random

Implements a generator based on XOR operations and shifts (Xorshift).
Has two internal states (seed_a and seed_b), which are updated via the roll_2_7_3 function. I made a version with two seeds instead of the classic one-seed version. Fellow mathematicians say this improves the quality of the randomness.
Possesses average statistical properties.
Relatively fast (here, it's average in speed).

LCG Random

Implements a Linear Congruential Generator (LCG).
Operates using the formula: state = 1664525 * state + 1013904223.
Generates numbers by shifting bits and scaling them to obtain a fractional value in the range [0, 1).
Fast, but has flaws in statistical properties (short period, correlation between consecutive values).

mt19937_random

Uses the Mersenne Twister random number generator (std::mt19937).
Initialized with a starting value (seed).
Generates numbers with a uniform distribution in a given range.
This generator has excellent statistical properties, a massive period (about 2^19937-1), and allows for higher precision.

All three generators are appropriate for this task; empirically, they manage to calculate the volume with precision up to the hundredths.

In the lab, I was asked to leave the fastest variant based on the research results. According to my research, LCG is indeed the fastest. If desired, you can edit the calculate function in main.cpp by changing the generator type when calling the template function for volume calculation. Alas, in real life, one constantly has to find compromises—in this case, between speed and accuracy.

Performance and Optimization Analysis

Final Result: 75.1 ms (Static schedule, 6 threads, LCG)

Through deep data distribution optimization and intentional algorithmic selection, the following performance metrics were achieved relative to the control baselines:

1. Parallel Efficiency: 4.26x Speedup 🚀

Compared to sequential code running on a single core, the optimized multi-threaded implementation demonstrated high scalability. The speedup ratio of over 4x represents excellent efficiency (~71% utilization) for a random number generation-intensive algorithm on a 6-core processor.

Calculation

Baseline time (LCG, no omp): 320.2 ms. Best time (LCG, 6 threads): 75.1 ms. 320.2 / 75.1 = 4.26.

2. Algorithmic Superiority: 12.7x Faster 🚀

Choosing a lightweight LCG generator over the standard library's Mersenne Twister (std::mt19937) reduced execution time by more than tenfold. This highlights the critical importance of selecting tools specific to the task: universal solutions significantly underperform compared to specialized ones in high-performance computing.

Calculation

Standard Library time (Mersenne, no omp): 956.2 ms. Best time (LCG, 6 threads): 75.1 ms. 956.2 / 75.1 = 12.73.

3. Eliminating Overheads: 25.1x Performance Boost 🚀

Comparison with a "naive" parallel implementation (Dynamic schedule, chunk_size=1) revealed a massive disparity. The final version is 25 times faster due to the elimination of False Sharing and the minimization of OpenMP scheduling overhead.

Calculation

Naive parallelism time (LCG, Dynamic, chunk=1): 1887.4 ms. Best time (LCG, 6 threads, Static): 75.1 ms. 1887.4 / 75.1 = 25.13.

The combination of an optimal scheduling mode (Static), cache coherence management, and a computationally lightweight generator allowed the system to reach a new performance tier, maximizing CPU resource utilization.

Optimization

I set the mode to Release -O3.

In the random point generation part of the code, I only changed one coordinate relative to the previous iteration. For a large number of points, this is acceptable.

False sharing

Most high-performance processors insert a cache buffer between slow memory and high-speed processor registers. Accessing a memory location causes a chunk of real memory (a cache line) containing the requested memory location to be copied into the cache. Subsequent references to the same memory location or those near it are likely to be satisfied from the cache until the system decides it must maintain consistency between the cache and memory.

However, simultaneous updates of individual elements in the same cache line coming from different processors cause the entire cache line to be invalidated, even if those updates are logically independent of each other. Every update of an individual element in a cache line marks the line as invalid. Other processors accessing another element in the same line find the line marked invalid. They are forced to fetch a more recent copy of the line from memory or elsewhere, even though the element they are accessing has not been modified. This is because cache consistency is maintained on a cache-line basis, not for individual elements. This results in increased interconnect traffic and overhead. Additionally, while the cache line update is in progress, access to elements in that line is blocked.

This situation is called false sharing. If it occurs frequently, the performance and scalability of an OpenMP application will drop significantly.

False sharing degrades performance if all of the following conditions are met.

Shared data is modified by multiple processors.

Multiple processors update data in the same cache line.

This update occurs very frequently (for example, in a tight loop).

Oracle Documentation "What Is False Sharing?"

In simple terms: all threads want to use a shared cache, which results in it having to be constantly rewritten from memory because what one thread needs differs from what another thread needs. This is very bad because memory access takes a lot of time. In our case, the coordinates of points are updated in a loop within a multi-threaded program, and this is exactly the case where one should pay attention to false sharing (see the last paragraph of the quote).

Careful analysis of those parallel loops that play a major role in application performance can reveal scalability issues caused by false sharing. In general, false sharing can be reduced by

making use of private data as much as possible;

using compiler optimization features to eliminate memory loads and stores.

In certain cases, the impact of false sharing may be less noticeable when solving large-scale problems, as false sharing may be less frequent in such cases.

Methods to combat false sharing largely depend on the specific application. In some cases, changing the data distribution method can reduce false sharing. In other cases, changing the mapping of iterations to threads, providing each thread with more work per chunk (by changing the chunksize value), can also lead to a reduction in false sharing.

Oracle Documentation "Reducing False Sharing"

We will try to devise a reasonable data distribution method and pay attention to the chunk_size parameter during analysis.

Race condition

A race condition occurs when two threads access the same memory without synchronization. This can cause incorrect results in a multi-threaded program.

Multi-threaded Program Memory Organization

To avoid false sharing, I packed the point coordinates into a single structure.
To avoid race conditions, I created local instances of the generator and the starting point. Otherwise, it could happen that different threads work asynchronously with a single instance. That is, I ensured that each thread has its own.

I attempted to simulate cross-platform compatibility and account for processors where the cache line size is not 64. On Linux, this can be checked; for Windows, I set it to 64 (or if the cache line size query on Linux somehow fails). This isn't an error; the speed will just be worse.

What if you want to run this on your specific hardware? Then you can uncomment this code in CMake and comment out the constant:

// constexpr size_t CACHE_LINE_SIZE = 64

set(CACHE_LINE_SIZE 64)

if (UNIX)
    execute_process(
            COMMAND getconf LEVEL1_DCACHE_LINESIZE
            OUTPUT_VARIABLE CACHE_LINE_SIZE
            OUTPUT_STRIP_TRAILING_WHITESPACE
            ERROR_VARIABLE GETCONF_ERROR
    )
endif ()

OpenMP Overview

#pragma omp parallel defines a parallel region, which is a region of the program that should be executed by multiple threads in parallel. This is a fundamental construct that starts parallel execution.

OpenMP Documentation p.8

That is, using this directive, I define the section where parallel execution begins, and using num_threads, I set the required number of threads (as requested via the command line, default, or any other value passed in the function arguments).

#pragma omp parallel num_threads(threads)

#pragma omp for schedule is used to distribute the iterations of a for loop among threads in a parallel region.

dynamic means that the loop iterations are not distributed in advance. Instead, when a thread finished executing its current iteration, it receives the next available iteration.

static means that loop iterations are distributed among threads before execution begins. Each thread receives a pre-defined number of iterations.

these options should be chosen depending on the task. Dynamic distribution helps avoid situations where some threads remain idle while others are overloaded, whereas static distribution avoids the overhead of distributing iterations during runtime. In other words, static is useful if iterations take roughly the same amount of time, while dynamic is better if they don't.

chunk_size defines the number of iterations that will be passed to a thread at one time.

If the optional chunk_size is set, the value must be positive. If chunk_size is not set, a value of 1 is assumed, except in the case of a static schedule. For a static schedule, the default chunk size is set to the loop iteration space divided by the number of threads applied to the loop.

OpenMP Documentation p.48

I use this, for example, here, when calculating a new coordinate and performing the hit test for the piriform figure inside the loop:

#pragma omp for schedule(static)
        for (long long i = 0; i < num_points; ++i) {
            if (hit_test(point.x, point.y, point.z)) {
                ++local_count_inside;
            }
            if (i % 3 == 0) { point.x = local_rt.next_float(axis_range[0], axis_range[1]); }
            if (i % 3 == 1) { point.y = local_rt.next_float(axis_range[2], axis_range[3]); }
            if (i % 3 == 2) { point.z = local_rt.next_float(axis_range[4], axis_range[5]); }
        }

In addition to the previously mentioned fight against race conditions, it is worth using atomic when calculating the sum of hit points.

The atomic directive ensures that a specific memory location is updated atomically, rather than exposing it to the possibility of multiple, simultaneous writing threads.

OpenMP Documentation p.19

To avoid race conditions, all updates of the location in parallel should be protected with the atomic directive, except those that are known to be free of race conditions.

OpenMP Documentation p.20

Selection of the Optimal Variant

All my efforts are located in the research directory.

I created two functions that can be called in main(): the fastest variant according to the research (calculate) and the variant described in the ~~training hyperparameters~~ config.

I ran the program via a bash script a pre-set number of times. I could have run the solver function in a loop within C++, but that's not quite correct because within a single program run, the compiler might inline function calls and thereby reduce execution time (especially since I set the aggressive -O3 mode). That is not the same as what is asked in the assignment. All results are written to txt files, which I then analyze in Python.

I performed 100 measurements for each configuration; all results are left in the research directory. In the table, I recorded the arithmetic mean and plotted graphs based on the table.

Analysis

Important! The resulting graphs make sense specifically for my platform (moreover, maybe it's always raining outside when I run a certain configuration, and this rain affects my hardware without me noticing).

Execution time decreases (approximately linearly) as the number of threads increases—up to the number of cores. What happens next? I don't have hyper-threading, because:

> lscpu
Thread(s) per core:   1

Hyper-threading can slightly affect the result. This technology allows one physical processor core to execute two threads simultaneously. From the operating system's perspective, such a core looks like two logical cores, allowing the processor to handle more threads. But this doesn't mean, of course, that the speed will double exactly, because physically these threads compete for shared resources. However, I don't have hyper-threading, so it all stops at the number of cores.

By chunk_size=0, I mean the absence of the chunk_size parameter.

Xor Shift Random

XOR SHIFT RANDOM	Dynamic	Static	no omp
			742.1
threads=1 chunk_size=0	959.2	647.5
threads=1 chunk_size=1	977.4	738.5
threads=1 chunk_size=2	799.0	693.8
threads=1 chunk_size=4	706.8	668.0
threads=1 chunk_size=8	662.4	669.9
threads=1 chunk_size=16	656.9	672.0
threads=1 chunk_size=1024	616.2	608.1
threads=1 chunk_size=65536	616.3	607.5

XOR SHIFT RANDOM	Dynamic	Static	no omp
			742.1
threads=2 chunk_size=0	3063.8	375.6
threads=2 chunk_size=1	3087.7	425.8
threads=2 chunk_size=2	1872.4	403.2
threads=2 chunk_size=4	1199.1	387.6
threads=2 chunk_size=8	827.1	377.3
threads=2 chunk_size=16	578.1	386.9
threads=2 chunk_size=1024	311.3	306.9
threads=2 chunk_size=65536	314.2	308.7

XOR SHIFT RANDOM	Dynamic	Static	no omp
			742.1
threads=3 chunk_size=0	2305.8	271.8
threads=3 chunk_size=1	2377.2	429.8
threads=3 chunk_size=2	1485.7	342.5
threads=3 chunk_size=4	927.9	307.4
threads=3 chunk_size=8	606.9	292.3
threads=3 chunk_size=16	431.6	298.6
threads=3 chunk_size=1024	213.5	212.3
threads=3 chunk_size=65536	217.8	213.8

XOR SHIFT RANDOM	Dynamic	Static	no omp
			742.1
threads=4 chunk_size=0	2396.8	231.6
threads=4 chunk_size=1	2620.6	429.8
threads=4 chunk_size=2	1432.2	342.5
threads=4 chunk_size=4	808.5	307.4
threads=4 chunk_size=8	486.9	292.3
threads=4 chunk_size=16	350.5	298.6
threads=4 chunk_size=1024	175.4	175.1
threads=4 chunk_size=65536	174.0	173.6

XOR SHIFT RANDOM	Dynamic	Static	no omp
			742.1
threads=5 chunk_size=0	1791.8	194.6
threads=5 chunk_size=1	1942.3	225.4
threads=5 chunk_size=2	1069.5	212.7
threads=5 chunk_size=4	625.1	198.6
threads=5 chunk_size=8	428.3	195.6
threads=5 chunk_size=16	298.4	201.3
threads=5 chunk_size=1024	146.3	151.0
threads=5 chunk_size=65536	148.6	148.6

XOR SHIFT RANDOM	Dynamic	Static	no omp
			742.1
threads=6 chunk_size=0	1717.6	174.4
threads=6 chunk_size=1	1924.5	262.6
threads=6 chunk_size=2	1122.6	202.5
threads=6 chunk_size=4	630.3	190.3
threads=6 chunk_size=8	402.1	174.9
threads=6 chunk_size=16	275.0	184.6
threads=6 chunk_size=1024	132.4	129.8
threads=6 chunk_size=65536	138.3	131.5

XOR SHIFT RANDOM	Dynamic	Static	no omp
—	—	—	742.1
threads=1 chunk_size=0	959.2	647.5
threads=2 chunk_size=0	3063.8	375.6
threads=3 chunk_size=0	2305.8	271.8
threads=4 chunk_size=0	2396.8	231.6
threads=5 chunk_size=0	1791.8	194.6
threads=6 chunk_size=0	1717.6	174.4

Fig. 1. Execution time vs number of threads. Without chunk_size, Xor Shift Random

XOR SHIFT RANDOM	Dynamic	Static
threads=1	616.2	608.1
threads=2	311.3	306.9
threads=3	213.5	212.3
threads=4	174.0	173.6
threads=5	146.3	148.6
threads=6	132.4	129.8

Fig. 2. Execution time vs number of threads. Best chunk_size, Xor Shift Random

XOR SHIFT RANDOM	Dynamic	Static	no omp
—	—	—	742.1
threads=1 chunk_size=0	959.2	647.5
threads=1 chunk_size=1	977.4	738.5
threads=1 chunk_size=2	799.0	693.8
threads=1 chunk_size=4	706.8	668.0
threads=1 chunk_size=8	662.4	669.9
threads=1 chunk_size=16	656.9	672.0
threads=1 chunk_size=1024	616.2	608.1
threads=1 chunk_size=65536	616.3	607.5

Fig. 3. Execution time with OpenMP disabled and with one thread enabled, Xor Shift Random

Fig. 4. Execution time with OpenMP disabled and with one thread enabled, Xor Shift Random

XOR SHIFT RANDOM	Dynamic	Static	no omp
*	*	*	742.1
*	132.4	*	*
*	*	129.8	*

Execution time in the best configuration, Xor Shift Random

LCG

LCG	Dynamic	Static	no omp
			320.2
threads=1 chunk_size=0	720.3	363.3
threads=1 chunk_size=1	721.4	438.8
threads=1 chunk_size=2	529.4	401.9
threads=1 chunk_size=4	434.2	381.6
threads=1 chunk_size=8	399.5	372.4
threads=1 chunk_size=16	376.5	368.4
threads=1 chunk_size=1024	357.1	363.0
threads=1 chunk_size=65536	360.1	363.7

LCG	Dynamic	Static	no omp
			320.2
threads=2 chunk_size=0	2471.2	208.9
threads=2 chunk_size=1	2575.8	208.9
threads=2 chunk_size=2	1221.5	188.9
threads=2 chunk_size=4	797.6	182.0
threads=2 chunk_size=8	567.7	169.8
threads=2 chunk_size=16	431.1	175.4
threads=2 chunk_size=1024	186.0	168.7
threads=2 chunk_size=65536	184.9	166.0

LCG	Dynamic	Static	no omp
—	—	—	320.2
threads=3 chunk_size=0	2224.3	120.3
threads=3 chunk_size=1	2202.0	150.9
threads=3 chunk_size=2	1230.0	132.6
threads=3 chunk_size=4	750.7	125.0
threads=3 chunk_size=8	499.2	122.4
threads=3 chunk_size=16	287.3	119.5
threads=3 chunk_size=1024	130.2	120.9
threads=3 chunk_size=65536	127.8	121.4

LCG	Dynamic	Static	no omp
—	—	—	320.2
threads=4 chunk_size=0	2485.2	101.2
threads=4 chunk_size=1	2538.8	122.0
threads=4 chunk_size=2	1359.5	111.5
threads=4 chunk_size=4	743.9	106.0
threads=4 chunk_size=8	450.9	104.5
threads=4 chunk_size=16	262.3	102.0
threads=4 chunk_size=1024	141.1	101.1
threads=4 chunk_size=65536	138.9	100.1

LCG	Dynamic	Static	no omp
—	—	—	320.2
threads=5 chunk_size=0	1900.0	86.3
threads=5 chunk_size=1	1905.3	104.4
threads=5 chunk_size=2	1002.7	95.6
threads=5 chunk_size=4	677.9	90.9
threads=5 chunk_size=8	393.0	88.7
threads=5 chunk_size=16	241.0	87.7
threads=5 chunk_size=1024	116.7	88.0
threads=5 chunk_size=65536	116.5	86.6

LCG	Dynamic	Static	no omp
			320.2
threads=6 chunk_size=0	1883.1092000000003	75.05852899999995
threads=6 chunk_size=1	1887.4066999999998	100.74184199999998
threads=6 chunk_size=2	1011.23479	89.44294300000001
threads=6 chunk_size=4	612.3865099999997	83.60841299999998
threads=6 chunk_size=8	328.4400100000001	81.92129200000001
threads=6 chunk_size=16	224.26999	79.43605899999997
threads=6 chunk_size=1024	99.36119700000003	80.245992
threads=6 chunk_size=65536	105.26176199999998	81.105628

LCG	Dynamic	Static	no omp
—	—	—	320.2
threads=6 chunk_size=0	1883.1	75.1
threads=6 chunk_size=1	1887.4	100.7
threads=6 chunk_size=2	1011.2	89.4
threads=6 chunk_size=4	612.4	83.6
threads=6 chunk_size=8	328.4	81.9
threads=6 chunk_size=16	224.3	79.4
threads=6 chunk_size=1024	99.4	80.2
threads=6 chunk_size=65536	105.3	81.1

Fig. 5. Execution time vs number of threads. Without chunk_size, LCG

LCG	Dynamic	Static	no omp
—	—	—	320.2
threads=1 chunk_size=0	720.3	363.3
threads=2 chunk_size=0	2471.2	208.9
threads=3 chunk_size=0	2224.3	120.3
threads=4 chunk_size=0	2485.2	101.2
threads=5 chunk_size=0	1900.0	86.3
threads=6 chunk_size=0	1883.1	75.1

Fig. 6. Execution time vs number of threads. Best chunk_size, LCG

LCG	Dynamic	Static	no omp
			320.2
threads=1 chunk_size=0	720.3	363.3
threads=1 chunk_size=1	721.4	438.8
threads=1 chunk_size=2	529.4	401.9
threads=1 chunk_size=4	434.2	381.6
threads=1 chunk_size=8	399.5	372.4
threads=1 chunk_size=16	376.5	368.4
threads=1 chunk_size=1024	357.1	363.0
threads=1 chunk_size=65536	360.1	363.7

Fig. 7. Execution time with OpenMP disabled and with one thread enabled, LCG

Fig. 8. Execution time with OpenMP disabled and with one thread enabled, LCG

LCG	Dynamic	Static	no omp
*	*	*	320.2
*	99.4	*	*
*	*	75.1	*

Execution time in the best configuration, LCG

Mersenne

MERSENNE	Dynamic	Static	no omp
			956.2
threads=1 chunk_size=0	1459.2	979.7
threads=1 chunk_size=1	1457.8	1155.5
threads=1 chunk_size=2	1228.1	1038.6
threads=1 chunk_size=4	1094.0	1002.3
threads=1 chunk_size=8	1038.7	991.4
threads=1 chunk_size=16	1012.1	983.5
threads=1 chunk_size=1024	981.8	977.5
threads=1 chunk_size=65536	980.8	977.5

MERSENNE	Dynamic	Static	no omp
—	—	—	956.2
threads=2 chunk_size=0	2454.3	491.4
threads=2 chunk_size=1	2528.4	579.8
threads=2 chunk_size=2	1112.4	519.2
threads=2 chunk_size=4	896.7	506.5
threads=2 chunk_size=8	878.2	510.7
threads=2 chunk_size=16	667.4	498.9
threads=2 chunk_size=1024	497.7	493.8
threads=2 chunk_size=65536	496.8	491.0

MERSENNE	Dynamic	Static	no omp
			956.2
threads=3 chunk_size=0	2324.0	333.5
threads=3 chunk_size=1	2309.7	389.1
threads=3 chunk_size=2	1360.4	360.8
threads=3 chunk_size=4	893.6	359.2
threads=3 chunk_size=8	594.2	340.9
threads=3 chunk_size=16	443.9	336.7
threads=3 chunk_size=1024	336.0	333.7
threads=3 chunk_size=65536	333.3	331.6

MERSENNE	Dynamic	Static	no omp
—	—	—	956.2
threads=4 chunk_size=0	2427.3	269.2
threads=4 chunk_size=1	2428.9	317.3
threads=4 chunk_size=2	1388.7	279.7
threads=4 chunk_size=4	810.2	277.3
threads=4 chunk_size=8	505.9	275.5
threads=4 chunk_size=16	354.4	271.9
threads=4 chunk_size=1024	266.4	267.7
threads=4 chunk_size=65536	334.2	269.0

MERSENNE	Dynamic	Static	no omp
			956.2
threads=5 chunk_size=0	2072.5	228.4
threads=5 chunk_size=1	2090.7	258.4
threads=5 chunk_size=2	1144.2	238.5
threads=5 chunk_size=4	708.6	235.4
threads=5 chunk_size=8	400.7	233.6
threads=5 chunk_size=16	299.6	228.9
threads=5 chunk_size=1024	225.8	228.2
threads=5 chunk_size=65536	226.1	228.2

MERSENNE	Dynamic	Static	no omp
			956.2
threads=6 chunk_size=0	1770.0	198.3
threads=6 chunk_size=1	1787.9	230.8
threads=6 chunk_size=2	1042.1	214.6
threads=6 chunk_size=4	641.5	217.4
threads=6 chunk_size=8	353.4	202.1
threads=6 chunk_size=16	264.3	201.3
threads=6 chunk_size=1024	204.3	209.7
threads=6 chunk_size=65536	205.0	208.2

MERSENNE	Dynamic	Static	no omp
			956.2
threads=1 chunk_size=0	1459.2	979.7
threads=2 chunk_size=0	2454.3	491.4
threads=3 chunk_size=0	2324.0	333.5
threads=4 chunk_size=0	2427.3	269.2
threads=5 chunk_size=0	2072.5	228.4
threads=6 chunk_size=0	1770.0	198.3

Fig. 9. Execution time vs number of threads. Without chunk_size, Mersenne

MERSENNE	Dynamic	Static	no omp
			956.2
threads=1	980.8	977.5
threads=2	496.8	491.0
threads=3	333.3	331.6
threads=4	266.4	267.7
threads=5	225.8	228.2
threads=6	204.3	198.3

Fig. 10. Execution time vs number of threads. Best chunk_size, Mersenne

MERSENNE	Dynamic	Static	no omp
			956.2
threads=1 chunk_size=0	1459.2	979.7
threads=1 chunk_size=1	1457.8	1155.5
threads=1 chunk_size=2	1228.1	1038.6
threads=1 chunk_size=4	1094.0	1002.3
threads=1 chunk_size=8	1038.7	991.4
threads=1 chunk_size=16	1012.1	983.5
threads=1 chunk_size=1024	981.8	977.5
threads=1 chunk_size=65536	980.8	977.5

Fig. 11. Execution time with OpenMP disabled and with one thread enabled, Mersenne

Fig. 12. Execution time with OpenMP disabled and with one thread enabled, Mersenne

MERSENNE	Dynamic	Static	no omp
			956.2
*	*	198.3
*	204.3	*

As expected, static generally showed better results since it doesn't waste time on distributing computations between threads—each thread is allocated a pre-known amount. Since the operations in the loop are identical, this is the optimal method. But as I already mentioned, this isn't an absolute rule; in other cases, dynamic might be better.

Also, as expected from the Oracle documentation, the graphs show that increasing chunk_size speeds up the program. Reducing false sharing is a good thing:

In other cases, changing the mapping of iterations to threads, providing each thread with more work per chunk (by changing the chunksize value), can also lead to a reduction in false sharing.

Oracle Documentation "Reducing False Sharing"

At the beginning of the graph with low thread counts and small chunk_sizes, one can see that the multi-threaded programs performed poorly: this is due to the overhead of managing tasks within threads. Furthermore, it can be seen that between 1024 and 65536 chunks, the changes are already insignificant and the optimum is usually achieved at the maximum number of threads.

From the table data, it can also be noted that in the case of static, not specifying chunks shows good results compared to specifying a small number of chunks. This is because static defaults to a chunk size equal to the number of loop iterations divided by the number of threads applied to the loop. Meanwhile, for dynamic, there is a clear correlation: more chunks are better, as its default size is 1.

If we compare the generators, they can be ranked by speed:

LCG
Xor Shift Random
Mersenne

This is because LCG contains the minimum amount of computation, Xor Shift Random slightly more, but beyond a 19-iteration loop and bitwise shifts, it has nothing, whereas Mersenne calls library functions, which is slow.

Best average result: 75 ms with static, threads=6, without chunk_size

With the specified number of chunks, the minimum result is 80 ms, which is quite close.

Инструментарий

C++17, GCC 11.4.0 (Release -O3), OpenMP 2.0

Linux Ubuntu 22.04

model name: AMD Ryzen 5 4500U with Radeon Graphics

CPU(s): 6

Thread(s) per core: 1

Описание:

Задача

Дана фигура piriform с параметром a = 2.
Необходимо посчитать объём этой фигуры. В лабораторной работе это предлагают сделать методом Монте-Карло, используя функции, которые проверяют вхождение данной координаты в фигуру и находят параллелепипед, в котором полностью лежит фигура (реализация этих функций лежит в hit.cpp). Конкретно нам нужно сгенерировать случайные точки внутри параллелепипеда и найти соотношение точек внутри и вне фигуры. Домножив это на объём параллелепипеда, мы получим примерный объём фигуры

Уравнение есть в Wolfram. Я сделала её график в Десмосе, с помощью которого я определила область

Генератор случайных чисел

Xor Shift Random

Реализует генератор на основе операции XOR и сдвигов (Xorshift).
Имеет два внутренних состояния (seed_a и seed_b), которые обновляются через функцию roll_2_7_3. Я сделала версию от двух сидов вместо классической от одного. Товарищи математики говорят, что это улучшает качество рандома
Обладает средними статистическими свойствами
Относительно быстрый (здесь средний по скорости)

LCG Random

Реализует линейный конгруэнтный генератор (LCG)
Работает по формуле: state = 1664525 * state + 1013904223
Генерирует числа путем сдвига битов и масштабирования их для получения дробного значения в диапазоне [0, 1)
Быстрый, но с недостатками в статистических свойствах (небольшой период, корреляция между последовательными значениями)

mt19937_random

Использует генератор случайных чисел Mersenne Twister (std::mt19937)
Инициализируется начальным значением (seed)
Генерирует числа с равномерным распределением в заданном диапазоне
Этот генератор обладает хорошими статистическими свойствами, имеет большой период (около 2^19937-1), и за счёт этого можно добиться более высокой точности

Все три генератора уместно использовать в этой задаче, эмпирически они справляются считать объём с точностью до сотых

В лабораторной просили оставить самый быстрый вариант по итогам исследований. По моим исследованиям LCG действительно самый быстрый. При желании вы можете отредактировать функцию calculate в main.cpp, поменяв тип генератора при вызове шаблонной функции вычисления объёма. Увы, в реальной жизни постоянно приходится искать компромиссы. В данном случае между скоростью и точностью

Анализ производительности и оптимизации

Итоговый результат: 75,1 мс (Static schedule, 6 потоков, LCG)

Благодаря глубокой оптимизации распределения данных и осознанному выбору алгоритмов, были достигнуты следующие показатели эффективности относительно контрольных точек (бейзлайнов):

1. Эффективность параллелизма: прирост в 4,26 раза
По сравнению с последовательным кодом на одном ядре, оптимизированная многопоточная реализация показала высокую масштабируемость. Коэффициент ускорения составил более 4, что является отличным показателем для алгоритмов с интенсивной генерацией случайных чисел на 6-ядерном процессоре (КПД ~71%).

Расчет

Базовое время (LCG, no omp): 320,2 мс. Лучшее время (LCG, threads=6): 75,1 мс. 320,2 / 75,1 = 4,26.

2. Алгоритмическое превосходство: прирост в 12,7 раз
Выбор легковесного LCG-генератора вместо стандартной библиотечной реализации Mersenne Twister (std::mt19937) позволил сократить время выполнения более чем в 10 раз. Это подчеркивает важность подбора инструментов под конкретную задачу: универсальные решения значительно проигрывают специализированным.

Расчет

Время стандартной библиотеки (Mersenne, no omp): 956,2 мс. Лучшее время (LCG, threads=6): 75,1 мс. 956,2 / 75,1 = 12,73.

3. Победа над накладными расходами: прирост в 25,1 раз
Сравнение с «наивной» параллельной реализацией (Dynamic schedule, chunk_size=1) показало колоссальную разницу. Финальный вариант работает в 25 раз быстрее за счет устранения эффекта False Sharing и минимизации издержек планировщика OpenMP на управление потоками.

Расчет

Время наивного параллелизма (LCG, Dynamic, chunk=1): 1887,4 мс. Лучшее время (LCG, threads=6, Static): 75,1 мс. 1887,4 / 75,1 = 25,13.

Вывод: Сочетание правильного режима планирования (Static), контроля когерентности кэш-памяти и выбора вычислительно легкого генератора позволило перевести производительность системы на качественно новый уровень, максимально утилизировав ресурсы CPU.

Оптимизация

Я выставила режим release -O3

В части кода с генерацией рандомной точки я меняла только одну координату относительно предыдущей итерации. Для большого количества точек это ок

False sharing

Большинство высокопроизводительных процессоров вставляют кэш-буфер между медленной памятью и высокоскоростными регистрами процессора. Обращение к ячейке памяти приводит к тому, что фрагмент реальной памяти (строка кэша), содержащий запрашиваемую ячейку памяти, копируется в кэш. Последующие ссылки на ту же ячейку памяти или на те, что находятся рядом с ней, вероятно, могут выполняться из кэша до тех пор, пока система не решит, что необходимо поддерживать согласованность между кэшем и памятью.

Однако одновременное обновление отдельных элементов одной строки кэша, поступающее от разных процессоров, приводит к аннулированию целых строк кэша, даже если эти обновления логически независимы друг от друга. Каждое обновление отдельного элемента строки кэша помечает строку как недействительную. Другие процессоры, обращающиеся к другому элементу в той же строке, видят, что строка помечена как недействительная. Они вынуждены извлекать более свежую копию строки из памяти или из другого места, даже если элемент, к которому они обращаются, не был изменен. Это происходит потому, что согласованность кэша поддерживается на основе кэш-линий, а не для отдельных элементов. В результате увеличивается трафик межсоединений и накладные расходы. Кроме того, пока идет обновление кэш-линии, доступ к элементам в этой линии заблокирован.

Такая ситуация называется false sharing. Если это происходит часто, производительность и масштабируемость OpenMP-приложения значительно снижаются.

False sharing снижает производительность, если имеют место все следующие условия.

Общие данные изменяются несколькими процессорами.

Несколько процессоров обновляют данные в одной и той же строке кэша.

Это обновление происходит очень часто (например, в узком цикле).

Документация Oracle "What Is False Sharing?"

Простыми словами: все потоки хотят использовать общий кэш, из-за чего его приходится постоянно переписывать из памяти, поскольку то что нужно для одного потока отличается от того что нужно другому, и это очень плохо, потому что обращение в память тратит много времени. В нашем случае в цикле обновляются координаты точек в многопоточной программе, и это как раз тот случай, когда следует обратить внимание на false sharing (см. последний абзац цитаты)

Тщательный анализ тех параллельных циклов, которые играют основную роль в выполнении приложения, может выявить проблемы масштабируемости производительности, вызванные false sharing. В общем случае false sharing можно уменьшить путём

максимально возможного использования приватных данных;

использования оптимизационных функций компилятора для устранения загрузки и сохранения памяти.

В отдельных случаях влияние false sharing может быть менее заметным при решении задач больших размеров, так как в этом случае false sharing может быть меньше.

Методы борьбы с false sharing во многом зависят от конкретного приложения. В некоторых случаях изменение способа распределения данных может уменьшить false sharing. В других случаях изменение отображения итераций на потоки, предоставление каждому потоку большего объема работы на чанк (путем изменения значения chunksize) также может привести к уменьшению false sharing.

Документация Oracle "Reducing False Sharing"

Постараемся придумать разумный способ распределения данных и при анализе обратим внимание на параметр chunk_size

Race condition

Race condition возникает, когда два потока обращаются к одной и той же памяти без синхронизации. Это может вызвать некорректные результаты многопоточной программы

Организация памяти многопоточной программы

Для избежания false sharing я упаковала координаты точки в одну структуру
Для избежания race condition я сделала локальные экземпляры генератора и первой точки. Иначе может получиться так, что разные потоки будут асинхронно работать с одним экземпляром. То есть я сделала так, чтобы он был свой у каждого потока

Я попыталась изобразить кросс-платформенность и учесть процессоры, где размер кэш-линии не 64. В случае с линуксом это можно проверить, на винде поставила 64 (или если на линуксе обращение к размеру кэш-линии почему-то не сработает). Это не ошибка, просто скорость будет хуже

Вдруг вы захотите запустить на своём специфическом железе? Тогда можно раскомментировать этот код в CMake и закомментировать константу

// constexpr size_t CACHE_LINE_SIZE = 64

set(CACHE_LINE_SIZE 64)

if (UNIX)
    execute_process(
            COMMAND getconf LEVEL1_DCACHE_LINESIZE
            OUTPUT_VARIABLE CACHE_LINE_SIZE
            OUTPUT_STRIP_TRAILING_WHITESPACE
            ERROR_VARIABLE GETCONF_ERROR
    )
endif ()

Обзор OpenMP

#pragma omp parallel определяет параллельную область, которая является областью программы, которая должна выполняться несколькими потоками параллельно. Это фундаментальная конструкция, которая запускает параллельное выполнение.

Документация OpenMP p.8

То есть с помощью этой директивы я определяю участок, откуда начнётся параллельное выполнение, и с помощью num_threads выставляю необходимое число поток (как попросили в командной строке, дефолтное или любое другое, которое будет выставлено в аргументах функции)

#pragma omp parallel num_threads(threads)

#pragma omp for schedule используется для распределения итераций цикла for между потоками в параллельной области

dynamic значит что итерации цикла не распределяются заранее. Вместо этого, когда поток завершает выполнение своей текущей итерации, он получает следующую доступную итерацию

static значит что итерации цикла распределяются между потоками до начала выполнения. Каждый поток получает заранее определенное количество итераций,

Эти варианты следует выбирать в зависимости от задачи. Динамическое распределение помогает избежать ситуации, когда некоторые потоки остаются без работы, в то время как другие перегружены, зато статическое распределение позволяет избежать накладных расходов на распределение итераций во время выполнения. То есть static полезен, если итерации выполняются примерно одинаковое время, а dynamic если наоборот - непохожее

chunk_size определяет количество итераций, которые будут переданы потоку за один раз

If the optional chunk_size is set, the value must be positive. If chunk_size is not set, a value of 1 is assumed, except in the case of a static schedule. For a static schedule, the default chunk size is set to the loop iteration space divided by the number of threads applied to the loop.

Докуметация OpenMP p.48

Я это использую, например, здесь, когда считаю новую координату и делаю проверку попадания точки в piriform внутри цикла

#pragma omp for schedule(static)
        for (long long i = 0; i < num_points; ++i) {
            if (hit_test(point.x, point.y, point.z)) {
                ++local_count_inside;
            }
            if (i % 3 == 0) { point.x = local_rt.next_float(axis_range[0], axis_range[1]); }
            if (i % 3 == 1) { point.y = local_rt.next_float(axis_range[2], axis_range[3]); }
            if (i % 3 == 2) { point.z = local_rt.next_float(axis_range[4], axis_range[5]); }
        }

Кроме указанной ранее борьбы с race conditition стоит выставить atomic при подсчёте суммы попавших точек

The atomic directive ensures that a specific memory location is updated atomically, rather than exposing it to the possibility of multiple, simultaneous writing threads.

Документация OpenMP p.19

To avoid race conditions, all updates of the location in parallel should be protected with the atomic directive, except those that are known to be free of race conditions

Документация OpenMP p.20

Подбор оптимального варианта

Все мои потуги лежат в директории research

Я сделала две функции, которые можно вызвать в main(): самый быстрый вариант по итогам исследования (calculate) и вариант, описанный в ~~гиперпарметрах обучения~~ конфиге

Я запускала программу через bash-скрипт заранее выставленное число раз. Я могла бы запускать функцию решения задачи в цикле на C++, но это не совсем правильно, потому что в рамках одного запуска программы компилятор может проинлайнить вызовы функций и тем самым уменьшить время выполнения (тем более что я выставила агрессивный режим -O3). То есть это не то же самое что просят в задании. Все результаты пишутся в txt-файлы, которые я потом анализирую в питоне

Я делала по 100 измерений на каждой конфигурации, все результаты я оставила в директории research. В таблицу записывала среднее арифметическое и строила графики по таблице

Анализ

Важно! Полученные графики имеют смысл именно для моей платформы (более того, может у меня всегда идёт дождь за окном, когда я запускаю определённую конфигурацию, и этот дождь влияет на моё железо, только я этого не замечаю)

Время выполнения (приблизительно линейно) уменьшается при возрастании числа потоков - до количества ядер. А что дальше? У меня нет гипертрединга, потому что

> lscpu
Thread(s) per core:   1

Гипертрединг может немного влиять на результат. Эта технология позволяет одному физическому ядру процессора выполнять два потока одновременно. С точки зрения операционной системы, такое ядро выглядит как два логических ядра, что позволяет процессору обрабатывать больше потоков. Но это, конечно, не значит, что скорость увеличится ровно в два раза, потому что физически эти потоки конкурируют за общие ресурсы. Впрочем, у меня нет гипертрединга, поэтому на количестве ядер всё и остановится

Под chunk_size=0 я подразумеваю отсутствие параметра chunk_size

Xor Shift Random

XOR SHIFT RANDOM	Dynamic	Static	no omp
			742.1273400000001
threads=1 chunk_size=0	959.1870700000001	647.5335600000001
threads=1 chunk_size=1	977.4376000000001	738.4600399999997
threads=1 chunk_size=2	798.97356	693.8203500000001
threads=1 chunk_size=4	706.7744299999999	667.9802800000001
threads=1 chunk_size=8	662.3845100000002	669.9193300000001
threads=1 chunk_size=16	656.89602	671.99949
threads=1 chunk_size=1024	616.1506799999999	608.1179800000003
threads=1 chunk_size=65536	616.2857799999998	607.48229
threads=2 chunk_size=0	3063.832100000001	375.62092999999993
threads=2 chunk_size=1	3087.6666999999993	425.81378000000007
threads=2 chunk_size=2	1872.4175	403.21542999999986
threads=2 chunk_size=4	1199.14824	387.57298000000014
threads=2 chunk_size=8	827.06555	377.26754999999997
threads=2 chunk_size=16	578.0699399999999	386.9205300000002
threads=2 chunk_size=1024	311.30982	306.94826000000006
threads=2 chunk_size=65536	314.162	308.7052300000001
threads=3 chunk_size=0	2305.7704	271.76511
threads=3 chunk_size=1	2377.2187	429.7632899999998
threads=3 chunk_size=2	1485.7054999999998	342.46197
threads=3 chunk_size=4	927.9267	307.3826700000002
threads=3 chunk_size=8	606.9004800000001	292.34443999999996
threads=3 chunk_size=16	431.55013999999983	298.5801100000001
threads=3 chunk_size=1024	213.46932000000007	212.34223999999986
threads=3 chunk_size=65536	217.80307000000002	213.75945000000004
threads=4 chunk_size=0	2396.8017	231.64429
threads=4 chunk_size=1	2620.621200000001	429.7632899999998
threads=4 chunk_size=2	1432.1845999999998	342.46197
threads=4 chunk_size=4	808.5284399999999	307.3826700000002
threads=4 chunk_size=8	486.8980600000001	292.34443999999996
threads=4 chunk_size=16	350.4992300000002	298.5801100000001
threads=4 chunk_size=1024	175.40966999999998	175.07930000000005
threads=4 chunk_size=65536	174.02857999999992	173.56111999999996
threads=5 chunk_size=0	1791.7743000000005	194.55849
threads=5 chunk_size=1	1942.3265000000008	225.41342000000003
threads=5 chunk_size=2	1069.51909	212.69293
threads=5 chunk_size=4	625.14891	198.58798999999988
threads=5 chunk_size=8	428.2980999999999	195.64558999999997
threads=5 chunk_size=16	298.40604999999994	201.25531000000007
threads=5 chunk_size=1024	146.30154000000002	150.98498000000006
threads=5 chunk_size=65536	148.59708999999998	148.56091000000004
threads=6 chunk_size=0	1717.5671000000016	174.36667999999997
threads=6 chunk_size=1	1924.5363999999993	262.64434999999986
threads=6 chunk_size=2	1122.6137000000003	202.47624
threads=6 chunk_size=4	630.2659500000001	190.31953000000004
threads=6 chunk_size=8	402.14745000000005	174.93008999999998
threads=6 chunk_size=16	275.02544	184.60392
threads=6 chunk_size=1024	132.39290000000003	129.79215
threads=6 chunk_size=65536	138.29397000000006	131.54746

XOR SHIFT RANDOM	Dynamic	Static	no omp
			742.1273400000001
threads=1 chunk_size=0	959.1870700000001	647.5335600000001
threads=2 chunk_size=0	3063.832100000001	375.62092999999993
threads=3 chunk_size=0	2305.7704	271.76511
threads=4 chunk_size=0	2396.8017	231.64429
threads=5 chunk_size=0	1791.7743000000005	194.55849
threads=6 chunk_size=0	1717.5671000000016	174.36667999999997

Рис. 1. Время выполнения от числа потоков. Без chunk_size, Xor Shift Random

XOR SHIFT RANDOM	Dynamic	Static
threads=1	616.1506799999999	608.1179800000003
threads=2	311.30982	306.94826000000006
threads=3	213.46932000000007	212.34223999999986
threads=4	174.02857999999992	173.56111999999996
threads=5	146.30154000000002	148.56091000000004
threads=6	132.39290000000003	129.79215

Рис. 2. Время выполнения от числа потоков. Лучший chunk_size, Xor Shift Random

XOR SHIFT RANDOM	Dynamic	Static	no omp
			742.1273400000001
threads=1 chunk_size=0	959.1870700000001	647.5335600000001
threads=1 chunk_size=1	977.4376000000001	738.4600399999997
threads=1 chunk_size=2	798.97356	693.8203500000001
threads=1 chunk_size=4	706.7744299999999	667.9802800000001
threads=1 chunk_size=8	662.3845100000002	669.9193300000001
threads=1 chunk_size=16	656.89602	671.99949
threads=1 chunk_size=1024	616.1506799999999	608.1179800000003
threads=1 chunk_size=65536	616.2857799999998	607.48229

Рис. 3. Время выполнения с отключённым OpenMP и с включённым одним потоком, Xor Shift Random

Рис. 4. Время выполнения с отключённым OpenMP и с включённым одним потоком, Xor Shift Random

XOR SHIFT RANDOM	Dynamic	Static	no omp
*	*	*	742.1273400000001
*	132.39290000000003	*	*
*	*	129.79215	*

Время выполнения в лучшей конфигурации, Xor Shift Random

LCG

LCG	Dynamic	Static	no omp
			320.21758999999986
threads=1 chunk_size=0	720.32038	363.2665499999998
threads=1 chunk_size=1	721.4270099999999	438.8332000000001
threads=1 chunk_size=2	529.3780300000002	401.88604
threads=1 chunk_size=4	434.1517700000001	381.61602000000005
threads=1 chunk_size=8	399.45808000000005	372.3880299999999
threads=1 chunk_size=16	376.54825000000005	368.36753
threads=1 chunk_size=1024	357.08664	363.04133
threads=1 chunk_size=65536	360.12431999999995	363.7118000000001
threads=2 chunk_size=0	2471.1630000000005	208.9217299999999
threads=2 chunk_size=1	2575.7833999999993	208.93125000000003
threads=2 chunk_size=2	1221.4701699999996	188.90817000000004
threads=2 chunk_size=4	797.6244300000002	182.03506999999993
threads=2 chunk_size=8	567.7238500000002	169.82162
threads=2 chunk_size=16	431.08428999999984	175.37787000000003
threads=2 chunk_size=1024	186.0359300000001	168.70204
threads=2 chunk_size=65536	184.91643999999997	166.0188900000001
threads=3 chunk_size=0	2224.300499999999	120.28897999999995
threads=3 chunk_size=1	2201.985899999999	150.89968999999996
threads=3 chunk_size=2	1230.0265000000002	132.64675
threads=3 chunk_size=4	750.6885999999998	125.04092999999997
threads=3 chunk_size=8	499.1914499999998	122.43509999999993
threads=3 chunk_size=16	287.27097	119.50589000000005
threads=3 chunk_size=1024	130.20472999999998	120.94409
threads=3 chunk_size=65536	127.81734999999995	121.37797999999995
threads=4 chunk_size=0	2485.205099999999	101.20989799999998
threads=4 chunk_size=1	2538.766599999999	122.02588
threads=4 chunk_size=2	1359.5470999999995	111.48121799999998
threads=4 chunk_size=4	743.9374599999999	106.04796499999999
threads=4 chunk_size=8	450.92199	104.52965400000001
threads=4 chunk_size=16	262.336	102.01833800000003
threads=4 chunk_size=1024	141.11465000000007	101.10425599999995
threads=4 chunk_size=65536	138.87098999999995	100.127789
threads=5 chunk_size=0	1900.0151999999998	86.25729800000003
threads=5 chunk_size=1	1905.2843000000005	104.41016199999996
threads=5 chunk_size=2	1002.7346199999998	95.590499
threads=5 chunk_size=4	677.9435499999998	90.91517100000003
threads=5 chunk_size=8	392.9684800000003	88.73768799999999
threads=5 chunk_size=16	240.95204999999996	87.66330700000005
threads=5 chunk_size=1024	116.69940699999998	88.02471699999998
threads=5 chunk_size=65536	116.467575	86.61003499999998
threads=6 chunk_size=0	1883.1092000000003	75.05852899999995
threads=6 chunk_size=1	1887.4066999999998	100.74184199999998
threads=6 chunk_size=2	1011.23479	89.44294300000001
threads=6 chunk_size=4	612.3865099999997	83.60841299999998
threads=6 chunk_size=8	328.4400100000001	81.92129200000001
threads=6 chunk_size=16	224.26999	79.43605899999997
threads=6 chunk_size=1024	99.36119700000003	80.245992
threads=6 chunk_size=65536	105.26176199999998	81.105628

LCG	Dynamic	Static	no omp
			320.21758999999986
threads=1 chunk_size=0	720.32038	363.2665499999998
threads=2 chunk_size=0	2471.1630000000005	208.9217299999999
threads=3 chunk_size=0	2224.300499999999	120.28897999999995
threads=4 chunk_size=0	2485.205099999999	101.20989799999998
threads=5 chunk_size=0	1900.0151999999998	86.25729800000003
threads=6 chunk_size=0	1883.1092000000003	75.05852899999995

Рис. 5. Время выполнения от числа потоков. Без chunk_size, LCG

LCG	Dynamic	Static	no omp
			320.21758999999986
threads=1	357.08664	363.04133
threads=2	184.91643999999997	166.0188900000001
threads=3	127.81734999999995	119.50589000000005
threads=4	138.87098999999995	100.127789
threads=5	116.467575	86.25729800000003
threads=6	99.36119700000003	75.05852899999995

Рис. 6. Время выполнения от числа потоков. Лучший chunk_size, LCG

LCG	Dynamic	Static	no omp
			320.21758999999986
threads=1 chunk_size=0	720.32038	363.2665499999998
threads=1 chunk_size=1	721.4270099999999	438.8332000000001
threads=1 chunk_size=2	529.3780300000002	401.88604
threads=1 chunk_size=4	434.1517700000001	381.61602000000005
threads=1 chunk_size=8	399.45808000000005	372.3880299999999
threads=1 chunk_size=16	376.54825000000005	368.36753
threads=1 chunk_size=1024	357.08664	363.04133
threads=1 chunk_size=65536	360.12431999999995	363.7118000000001

Рис. 7. Время выполнения с отключённым OpenMP и с включённым одним потоком, LCG

Рис. 8. Время выполнения с отключённым OpenMP и с включённым одним потоком, LCG

LCG	Dynamic	Static	no omp
*	*	*	320.21758999999986
*	99.36119700000003	*	*
*	*	75.05852899999995	*

Время выполнения в лучшей конфигурации, LCG

Mersenne

MERSENNE	Dynamic	Static	no omp
			956.23506
threads=1 chunk_size=0	1459.2138000000004	979.6839999999996
threads=1 chunk_size=1	1457.8219000000004	1155.4581000000005
threads=1 chunk_size=2	1228.0756	1038.6444999999994
threads=1 chunk_size=4	1094.0227	1002.3167000000001
threads=1 chunk_size=8	1038.6502	991.3661299999997
threads=1 chunk_size=16	1012.1187000000001	983.51205
threads=1 chunk_size=1024	981.8386800000001	977.4677800000002
threads=1 chunk_size=65536	980.7897299999997	977.4954800000005
threads=2 chunk_size=0	2454.2849	491.3553899999998
threads=2 chunk_size=1	2528.3521000000005	579.8482200000001
threads=2 chunk_size=2	1112.3551	519.2280599999999
threads=2 chunk_size=4	896.71071	506.45614
threads=2 chunk_size=8	878.1847200000001	510.72994000000006
threads=2 chunk_size=16	667.4337100000001	498.85919
threads=2 chunk_size=1024	497.7303399999999	493.83408000000003
threads=2 chunk_size=65536	496.75344999999993	491.04488
threads=3 chunk_size=0	2324.007899999999	333.45305
threads=3 chunk_size=1	2309.6744444444444	389.06447999999983
threads=3 chunk_size=2	1360.4386000000002	360.83767000000006
threads=3 chunk_size=4	893.5689899999999	359.1863900000001
threads=3 chunk_size=8	594.2066099999998	340.91036999999983
threads=3 chunk_size=16	443.92572	336.71448
threads=3 chunk_size=1024	336.041	333.66865000000007
threads=3 chunk_size=65536	333.34274000000005	331.6086
threads=4 chunk_size=0	2427.310400000001	269.15664999999996
threads=4 chunk_size=1	2428.913600000001	317.30904999999996
threads=4 chunk_size=2	1388.6960000000004	279.70716
threads=4 chunk_size=4	810.2486000000001	277.3220599999999
threads=4 chunk_size=8	505.85546	275.49273999999997
threads=4 chunk_size=16	354.39224	271.93818000000016
threads=4 chunk_size=1024	266.44991999999996	267.71892
threads=4 chunk_size=65536	334.1695	268.95080999999993
threads=5 chunk_size=0	2072.469200000001	228.36536
threads=5 chunk_size=1	2090.6867999999995	258.4376400000001
threads=5 chunk_size=2	1144.2234999999996	238.52624
threads=5 chunk_size=4	708.6009300000001	235.39597
threads=5 chunk_size=8	400.67623	233.64247000000006
threads=5 chunk_size=16	299.5613299999999	228.9460399999999
threads=5 chunk_size=1024	225.77935999999997	228.24191999999988
threads=5 chunk_size=65536	226.09871000000007	228.16843
threads=6 chunk_size=0	1770.0350999999998	198.31024000000005
threads=6 chunk_size=1	1787.8928999999991	230.82468
threads=6 chunk_size=2	1042.1054999999997	214.63073000000003
threads=6 chunk_size=4	641.47724	217.35527000000005
threads=6 chunk_size=8	353.3799500000001	202.06297000000006
threads=6 chunk_size=16	264.3303200000001	201.3278
threads=6 chunk_size=1024	204.2944299999999	209.72317999999984
threads=6 chunk_size=65536	205.00369000000012	208.20205000000013

MERSENNE	Dynamic	Static	no omp
			956.23506
threads=1 chunk_size=0	1459.2138000000004	979.6839999999996
threads=2 chunk_size=0	2454.2849	491.3553899999998
threads=3 chunk_size=0	2324.007899999999	333.45305
threads=4 chunk_size=0	2427.310400000001	269.15664999999996
threads=5 chunk_size=0	2072.469200000001	228.36536
threads=6 chunk_size=0	1770.0350999999998	198.31024000000005

Рис. 9. Время выполнения от числа потоков. Без chunk_size, Mersenne

MERSENNE	Dynamic	Static	no omp
			956.23506
threads=1	980.7897299999997	977.4954800000005
threads=2	496.75344999999993	491.04488
threads=3	333.34274000000005	331.6086
threads=4	266.44991999999996	267.71892
threads=5	225.77935999999997	228.16843
threads=6	204.2944299999999	198.31024000000005

Рис. 10. Время выполнения от числа потоков. Лучший chunk_size, Mersenne

MERSENNE	Dynamic	Static	no omp
			956.23506
threads=1 chunk_size=0	1459.2138000000004	979.6839999999996
threads=1 chunk_size=1	1457.8219000000004	1155.4581000000005
threads=1 chunk_size=2	1228.0756	1038.6444999999994
threads=1 chunk_size=4	1094.0227	1002.3167000000001
threads=1 chunk_size=8	1038.6502	991.3661299999997
threads=1 chunk_size=16	1012.1187000000001	983.51205
threads=1 chunk_size=1024	981.8386800000001	977.4677800000002
threads=1 chunk_size=65536	980.7897299999997	977.4954800000005

Рис. 11. Время выполнения с отключённым OpenMP и с включённым одним потоком, Mersenne

Рис. 12. Время выполнения с отключённым OpenMP и с включённым одним потоком, Mersenne

MERSENNE	Dynamic	Static	no omp
			956.23506
*	*	198.31024000000005
*	204.2944299999999	*

Как и ожидалось, static в целом показал лучшие результаты, поскольку он не тратит время на распределение вычислений между потоками - каждому потоку выделено заранее известное количество. Поскольку операции в цикле однотипные, это оптимальный способ. Но как я уже упоминала, это верно не абсолютно всегда, в ином случае может быть лучше dynamic

Также, как и ожидалось из документации Oracle, по графикам можно видеть, что увеличение chunk_size ускоряет программу. Уменьшение false sharing - это хорошо

В других случаях изменение отображения итераций на потоки, предоставление каждому потоку большего объема работы на чанк (путем изменения значения chunksize) также может привести к уменьшению false sharing.

Документация Oracle "Reducing False Sharing"

В начале графика при малых потоках и малых chunk_size можно видеть, что многопоточные программы отработали плохо: это связано с накладными расходами на управление задачами в потоках. Далее можно видеть, что между 1024 и 65536 чанками изменения уже несущественные и оптимум, как правило, достигается при максимальном числе потоков

По табличным данным также можно заметить, что в случае static отсутствие указания чанков показывает хорошие результаты по сравнению с указанием небольшого количества чанков. Это связано с тем что static по умолчанию ставит размер не 1, а число итераций цикла, деленное на количество потоков, примененных к циклу. В то время как для dynamic есть чёткая взаимосвязь: больше чанков - лучше, поскольку у него размер по умолчанию 1

Если сравнивать генераторы, то их можно расставить по скорости:

LCG
Xor Shift Random
Mersenne

Это связано с тем что LCG содержит минимум вычислений, Xor Shift Random чуть побольше, но кроме цикла на 19 итераций и побитовых сдвигов в нём ничего нет, а Mersenne вызывает библиотечные функции, что долго

Лучший усреднённый результат 75 мс при static, threads=6, без chunk_size

При указанном количестве чанков минимальный результат 80 мс, что довольно близко

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
pictures		pictures
research		research
test_data		test_data
CMakeLists.txt		CMakeLists.txt
README.md		README.md
hit.cpp		hit.cpp
hit.h		hit.h
lcg.h		lcg.h
main.cpp		main.cpp
mersenne.h		mersenne.h
monte_carlo.h		monte_carlo.h
parser_lib.h		parser_lib.h
xor_shift_random.h		xor_shift_random.h

Folders and files

Latest commit

History

Repository files navigation

Tooling

Description:

Task

Random Number Generators

Performance and Optimization Analysis

Optimization

False sharing

Race condition

Multi-threaded Program Memory Organization

OpenMP Overview

Selection of the Optimal Variant

Analysis

Xor Shift Random

LCG

Mersenne

Best average result: 75 ms with static, threads=6, without chunk_size

Инструментарий

Описание:

Задача

Генератор случайных чисел

Анализ производительности и оптимизации

Оптимизация

False sharing

Race condition

Организация памяти многопоточной программы

Обзор OpenMP

Подбор оптимального варианта

Анализ

Xor Shift Random

LCG

Mersenne

Лучший усреднённый результат 75 мс при static, threads=6, без chunk_size

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages