Working, fast, parallel version of HVM3-Strict. ~735MIPS w/10 threads.#18
Working, fast, parallel version of HVM3-Strict. ~735MIPS w/10 threads.#18mikemcqueen wants to merge 1 commit into
Conversation
./bench.sh 100
Running cabal run . -- run examples/bench_parallel_sum_range.hvms -s for 100 iterations...
--------
Min MIPS: 703
Max MIPS: 786
Avg MIPS: 738.49
Main changes:
* Memory layout - Adopted HVM2's memory layout for RBAG and RNOD, by first
dividing all available heap space into separate RNOD an RBAG spaces, then
sub-dividing each of those into thread-"owned" chunks of the same size.
* Reads and writes to thread-owned RBAG space are non-atomic.
* Writes to thread-owned RNOD space via expand_ref() are non-atomic; all
other reads and writes to RNOD space are atomic.
* Work stealing - A small buffer at the beginning of each thread-owned RBAG
space is called the "booty bag" or BBAG. This is filled with newly pushed
redexes before any redexes are pushed to the "normal" RBAG.
* Threads (including the owning thread, if necessary) can steal the entire
bag, and when that happens, it is prioritized as a redex pop source until
it becomes empty.
* Deferred redexes - Two small buffers at the end of each thread-owned RBAG
space are known as deferred bags, or DFER. They contains only APPLAM redexes
that result from combining the APP and the root LAM term of an expanded APPREF
redex.
* There are two bags because they alternate being pushed-into and popped-from.
* When one of these bags reaches a certain threshold in size, memory written
to them is synchronized (on ARM), and they are prioritized as a redex pop
source.
* This was done as an optimization to the race condition fix. It resulted in
better performance than synchronizing memory after every APPREF.
* Race condition fix - pushing the APPLAM redex that results from an APPREF
interaction "publishes" an address internal to that LAM node's "address space"
via the Loc field of the LAM term. There was no guarantee that the prior writes
to locations within that space would be visible to other threads to read, or
that once once one location in that space became visible, all other locations
in that space would also be visible.
* This was solved (hopefully) with a `dmb ishst` Inner SHareable domain STore
barrier inline assembly instruction on ARM to synchronize all prior writes,
before pushing one of those APPLAM redexes.
Other changes:
* Bind threads to specific cores on ARM. Threads 0-3 to Pcores, 4-N to Ecores.
* Number of Pcores is defined by enum value PCOR_TOT, and currently must be
manually changed for architectures with more than 4 Pcores.
* Improved work-stealing victim identfication on ARM. It uses a Pcore-biased
round-robin approach.
* Added a rudimentary bench.sh script to root of tree to make benchmarking
multiple easier.
Notes:
* Technically, it is undefined behavior to access a memory location using both
atomic and non-atomic loads and/or stores. However, a) it works, and b) it is
my understanding that there is significant legacy code that depends on it
continuing to work, so it's my belief that it is "safe" to do.
* Only the following 7 interaction types have actually been tested, because
these are the only interactions used by the parallel sum benchmark:
APPLAM, APPREF, DUPNUM, MATNUM, MATREF, OPXNUM, OPYNUM
|
@Lorenzobattistela @developedby |
|
Victor: so, last year I wrote and published HVM2, which was a strict Interaction Combinators evaluator, and compiled it to CUDA, focusing on the RTX 4090 for a demo. we also had Bend as a friendly user-facing language for it IIRC, HVM2 achieved ~30 billion interactions per second on RTX 4090, and 2 billion or so (?) on Apple M3 CPU, with all cores Me: Also important: HVM2 used 32-bit terms with 29-bit values. A pair was 64-bits. Victor: after that, I realized there is a simpler atomic linker that I could use so, the next goal would be to rewrite it again using the new atomic linker, and bring it to the HVM3 architecture (Haskell parser, C/CUDA runtime with FFI) so, in the middle of this, I believe you stepped in and started working on HVM3-Strict Me: Back in March, @lorenzo at HOC published a separate HVM3-Strict repo on GitHub. This was a 64-bit term implementation that included the atomic linker and Haskell parser. No CUDA. Single threaded only. It used a sliding-queue RBAG implementation that different significantly from HVM2s implementation. It was really hard to benchmark this because there was only 1 example (parallel_sum_range) and it choked after @height=7 due to the default RBAG "starting position" configuration. Node ref-expansion was overwriting the beginning of the RBAG queue. I ran that early commit just now on an M4 and got about 7 MIPS but it only took .7 secs so not a great dataset to draw conclusions from |
|
Victor: so if you can report everything that I missed, including motivation, what has been done, results, what we have and what must be done in your fork Me: One of our members here @wyattgill9 (@ zen1 on discord) started making some optimizations and PRs that were getting accepted. In a conversation on discord I saw the choke @height=8 conversation, and decided to take a look. I root-caused that problem and that helped achieve higher @height's, longer runtimes, and more reliable benchmark results. Wyatt/zen1 continued to make changes to the single-threaded code that got it above 100MIPS. Wyatt/zen1 was also making some attempts at parallel support, but it didn't seem to be working, so I decided to take a look. The main thing for the early attempts at parallel HVM3-Strict was 128-bit atomics. It was pretty much a requirement. That's mostly what my first parallel version did. Scaling sucked. Couldn't beat 2x single threaded perf regardless of thread/core count. @nicolas. kept yelling at us to go back to HVM2s RNOD/RBAG memory layout instead of the sliding-queue implementation in HVM3-Strict. Finally I listened to him and did it. Very nice perf increase ~550MIPS or so with 10 threads on an M4. Note that 550 and 2B (your claim above) are very different. I believe this is largely due to 64-bit terms vs. 32-bit terms. Double the memory moving around now. Implemented a new work-stealing algo and got it up to ~850MIPS. This also got rid of 128-bit atomics. RBAGs are private per-thread no atomics required. I discovered there was a race condition, however. On ARM only, in one very specific condition (and possibly others, that don't get exposed in this particular benchmark) due to relaxed memory ordering everywhere. This race condition exists in every commit and has nothing to do with any changes I made. Frankly, I suspect this same race condition occurs in HVM2 as well. Spent a couple weeks tracking this down, came up with a solution, and I have never seen the race condition occur since the fix, and that's been several hundred thousand executions. That fix required some occasional memory barrier/synchronizations, and that dropped perf back down to ~700MIPs. A few more tweaks and I had it back up to ~735 or so now. I have some ideas related to missed branch-prediction that I think could possibly get it back up to 850MIPS. And I suspect there are many ideas I haven't had yet that could help push it even further. |
|
So, , my original motivation was to help the guy (Wyatt/zen1) doing optimizations, by getting the parallel-sum benchmark running longer. Then once that was done, my motivation was to get parallel working. Then there was a period where my motivation was to fix the race condition. I'm currently pursuing possible node-reuse/recycling implementations (hvmvis was a child of this motivation.) The PR I submitted for HVM3-Strict is a huge (set of) change(s). It can be broken up into separate PRs if that will help make it easier to digest and/or provide history. 128-bit Atomics and HVM2-style RBAG/RNOD memory organization. This will have a race condition. ~550MIPS. |
|
On the subject of what else needs to be done for HVM3-Strict. CUDA support obviously, I haven't really looked at HVM2's implementation, but HVM3-Strict follows HVM2s memory structure very closely, so my intuition is that a port should not be too difficult. And, more examples that use all of the available interaction types. The one existing example only uses 7 interaction types. I suspect there may be other interactions that have a race condition such as the APPLAM-following-APPREF that currently get deferred. |
./bench.sh 100
Running cabal run . -- run examples/bench_parallel_sum_range.hvms -s for 100 iterations... --------
Min MIPS: 703
Max MIPS: 786
Avg MIPS: 738.49
Main changes:
Memory layout - Adopted HVM2's memory layout for RBAG and RNOD, by first dividing all available heap space into separate RNOD an RBAG spaces, then sub-dividing each of those into thread-"owned" chunks of the same size.
Work stealing - A small buffer at the beginning of each thread-owned RBAG space is called the "booty bag" or BBAG. This is filled with newly pushed redexes before any redexes are pushed to the "normal" RBAG.
Deferred redexes - Two small buffers at the end of each thread-owned RBAG space are known as deferred bags, or DFER. They contains only APPLAM redexes that result from combining the APP and the root LAM term of an expanded APPREF redex.
Race condition fix - pushing the APPLAM redex that results from an APPREF interaction "publishes" an address internal to that LAM node's "address space" via the Loc field of the LAM term. There was no guarantee that the prior writes to locations within that space would be visible to other threads to read, or that once once one location in that space became visible, all other locations in that space would also be visible.
dmb ishstInner SHareable domain STore barrier inline assembly instruction on ARM to synchronize all prior writes, before pushing one of those APPLAM redexes.Other changes:
Bind threads to specific cores on ARM. Threads 0-3 to Pcores, 4-N to Ecores.
Improved work-stealing victim identfication on ARM. It uses a Pcore-biased round-robin approach.
Added a rudimentary bench.sh script to root of tree to make benchmarking multiple easier.
Notes:
Technically, it is undefined behavior to access a memory location using both atomic and non-atomic loads and/or stores. However, a) it works, and b) it is my understanding that there is significant legacy code that depends on it continuing to work, so it's my belief that it is "safe" to do.
Only the following 7 interaction types have actually been tested, because these are the only interactions used by the parallel sum benchmark:
APPLAM, APPREF, DUPNUM, MATNUM, MATREF, OPXNUM, OPYNUM