Skip to content

Working, fast, parallel version of HVM3-Strict. ~735MIPS w/10 threads.#18

Open
mikemcqueen wants to merge 1 commit into
HigherOrderCO-archive:mainfrom
mikemcqueen:parallel-rebase
Open

Working, fast, parallel version of HVM3-Strict. ~735MIPS w/10 threads.#18
mikemcqueen wants to merge 1 commit into
HigherOrderCO-archive:mainfrom
mikemcqueen:parallel-rebase

Conversation

@mikemcqueen

Copy link
Copy Markdown

./bench.sh 100
Running cabal run . -- run examples/bench_parallel_sum_range.hvms -s for 100 iterations... --------
Min MIPS: 703
Max MIPS: 786
Avg MIPS: 738.49

Main changes:

  • Memory layout - Adopted HVM2's memory layout for RBAG and RNOD, by first dividing all available heap space into separate RNOD an RBAG spaces, then sub-dividing each of those into thread-"owned" chunks of the same size.

    • Reads and writes to thread-owned RBAG space are non-atomic.
    • Writes to thread-owned RNOD space via expand_ref() are non-atomic; all other reads and writes to RNOD space are atomic.
  • Work stealing - A small buffer at the beginning of each thread-owned RBAG space is called the "booty bag" or BBAG. This is filled with newly pushed redexes before any redexes are pushed to the "normal" RBAG.

    • Threads (including the owning thread, if necessary) can steal the entire bag, and when that happens, it is prioritized as a redex pop source until it becomes empty.
  • Deferred redexes - Two small buffers at the end of each thread-owned RBAG space are known as deferred bags, or DFER. They contains only APPLAM redexes that result from combining the APP and the root LAM term of an expanded APPREF redex.

    • There are two bags because they alternate being pushed-into and popped-from.
    • When one of these bags reaches a certain threshold in size, memory written to them is synchronized (on ARM), and they are prioritized as a redex pop source.
    • This was done as an optimization to the race condition fix. It resulted in better performance than synchronizing memory after every APPREF.
  • Race condition fix - pushing the APPLAM redex that results from an APPREF interaction "publishes" an address internal to that LAM node's "address space" via the Loc field of the LAM term. There was no guarantee that the prior writes to locations within that space would be visible to other threads to read, or that once once one location in that space became visible, all other locations in that space would also be visible.

    • This was solved (hopefully) with a dmb ishst Inner SHareable domain STore barrier inline assembly instruction on ARM to synchronize all prior writes, before pushing one of those APPLAM redexes.

Other changes:

  • Bind threads to specific cores on ARM. Threads 0-3 to Pcores, 4-N to Ecores.

    • Number of Pcores is defined by enum value PCOR_TOT, and currently must be manually changed for architectures with more than 4 Pcores.
  • Improved work-stealing victim identfication on ARM. It uses a Pcore-biased round-robin approach.

  • Added a rudimentary bench.sh script to root of tree to make benchmarking multiple easier.

Notes:

  • Technically, it is undefined behavior to access a memory location using both atomic and non-atomic loads and/or stores. However, a) it works, and b) it is my understanding that there is significant legacy code that depends on it continuing to work, so it's my belief that it is "safe" to do.

  • Only the following 7 interaction types have actually been tested, because these are the only interactions used by the parallel sum benchmark:

    APPLAM, APPREF, DUPNUM, MATNUM, MATREF, OPXNUM, OPYNUM

./bench.sh 100
Running cabal run . -- run examples/bench_parallel_sum_range.hvms -s for 100 iterations...
--------
Min MIPS: 703
Max MIPS: 786
Avg MIPS: 738.49

Main changes:

* Memory layout - Adopted HVM2's memory layout for RBAG and RNOD, by first
  dividing all available heap space into separate RNOD an RBAG spaces, then
  sub-dividing each of those into thread-"owned" chunks of the same size.
  * Reads and writes to thread-owned RBAG space are non-atomic.
  * Writes to thread-owned RNOD space via expand_ref() are non-atomic; all
    other reads and writes to RNOD space are atomic.

* Work stealing - A small buffer at the beginning of each thread-owned RBAG
  space is called the "booty bag" or BBAG. This is filled with newly pushed
  redexes before any redexes are pushed to the "normal" RBAG.
  * Threads (including the owning thread, if necessary) can steal the entire
    bag, and when that happens, it is prioritized as a redex pop source until
    it becomes empty.

* Deferred redexes - Two small buffers at the end of each thread-owned RBAG
  space are known as deferred bags, or DFER. They contains only APPLAM redexes
  that result from combining the APP and the root LAM term of an expanded APPREF
  redex.
  * There are two bags because they alternate being pushed-into and popped-from.
  * When one of these bags reaches a certain threshold in size, memory written
    to them is synchronized (on ARM), and they are prioritized as a redex pop
    source.
  * This was done as an optimization to the race condition fix. It resulted in
    better performance than synchronizing memory after every APPREF.

* Race condition fix - pushing the APPLAM redex that results from an APPREF
  interaction "publishes" an address internal to that LAM node's "address space"
  via the Loc field of the LAM term. There was no guarantee that the prior writes
  to locations within that space would be visible to other threads to read, or
  that once once one location in that space became visible, all other locations
  in that space would also be visible.
  * This was solved (hopefully) with a `dmb ishst` Inner SHareable domain STore
    barrier inline assembly instruction on ARM to synchronize all prior writes,
    before pushing one of those APPLAM redexes.

Other changes:

* Bind threads to specific cores on ARM. Threads 0-3 to Pcores, 4-N to Ecores.
  * Number of Pcores is defined by enum value PCOR_TOT, and currently must be
    manually changed for architectures with more than 4 Pcores.

* Improved work-stealing victim identfication on ARM. It uses a Pcore-biased
  round-robin approach.

* Added a rudimentary bench.sh script to root of tree to make benchmarking
  multiple easier.

Notes:

* Technically, it is undefined behavior to access a memory location using both
  atomic and non-atomic loads and/or stores. However, a) it works, and b) it is
  my understanding that there is significant legacy code that depends on it
  continuing to work, so it's my belief that it is "safe" to do.

* Only the following 7 interaction types have actually been tested, because
  these are the only interactions used by the parallel sum benchmark:

  APPLAM, APPREF, DUPNUM, MATNUM, MATREF, OPXNUM, OPYNUM
@wyattgill9

Copy link
Copy Markdown
Contributor

@Lorenzobattistela @developedby

@mikemcqueen

Copy link
Copy Markdown
Author

Victor:

so, last year I wrote and published HVM2, which was a strict Interaction Combinators evaluator, and compiled it to CUDA, focusing on the RTX 4090 for a demo. we also had Bend as a friendly user-facing language for it

IIRC, HVM2 achieved ~30 billion interactions per second on RTX 4090, and 2 billion or so (?) on Apple M3 CPU, with all cores

Me:

Also important:

HVM2 used 32-bit terms with 29-bit values. A pair was 64-bits.

Victor:

after that, I realized there is a simpler atomic linker that I could use so, the next goal would be to rewrite it again using the new atomic linker, and bring it to the HVM3 architecture (Haskell parser, C/CUDA runtime with FFI)

so, in the middle of this, I believe you stepped in and started working on HVM3-Strict

Me:

Back in March, @lorenzo at HOC published a separate HVM3-Strict repo on GitHub.

4450da3

This was a 64-bit term implementation that included the atomic linker and Haskell parser. No CUDA. Single threaded only. It used a sliding-queue RBAG implementation that different significantly from HVM2s implementation.

It was really hard to benchmark this because there was only 1 example (parallel_sum_range) and it choked after @height=7 due to the default RBAG "starting position" configuration. Node ref-expansion was overwriting the beginning of the RBAG queue. I ran that early commit just now on an M4 and got about 7 MIPS but it only took .7 secs so not a great dataset to draw conclusions from

@mikemcqueen

mikemcqueen commented Jun 17, 2025

Copy link
Copy Markdown
Author

Victor:

so if you can report everything that I missed, including motivation, what has been done, results, what we have and what must be done in your fork
that would be very helpfun so I can decide how to proceed

Me:

One of our members here @wyattgill9 (@ zen1 on discord) started making some optimizations and PRs that were getting accepted. In a conversation on discord I saw the choke @height=8 conversation, and decided to take a look. I root-caused that problem and that helped achieve higher @height's, longer runtimes, and more reliable benchmark results. Wyatt/zen1 continued to make changes to the single-threaded code that got it above 100MIPS.

Wyatt/zen1 was also making some attempts at parallel support, but it didn't seem to be working, so I decided to take a look.

The main thing for the early attempts at parallel HVM3-Strict was 128-bit atomics. It was pretty much a requirement. That's mostly what my first parallel version did. Scaling sucked. Couldn't beat 2x single threaded perf regardless of thread/core count.

@nicolas. kept yelling at us to go back to HVM2s RNOD/RBAG memory layout instead of the sliding-queue implementation in HVM3-Strict. Finally I listened to him and did it. Very nice perf increase ~550MIPS or so with 10 threads on an M4. Note that 550 and 2B (your claim above) are very different. I believe this is largely due to 64-bit terms vs. 32-bit terms. Double the memory moving around now.

Implemented a new work-stealing algo and got it up to ~850MIPS. This also got rid of 128-bit atomics. RBAGs are private per-thread no atomics required.

I discovered there was a race condition, however. On ARM only, in one very specific condition (and possibly others, that don't get exposed in this particular benchmark) due to relaxed memory ordering everywhere. This race condition exists in every commit and has nothing to do with any changes I made. Frankly, I suspect this same race condition occurs in HVM2 as well.

Spent a couple weeks tracking this down, came up with a solution, and I have never seen the race condition occur since the fix, and that's been several hundred thousand executions.

That fix required some occasional memory barrier/synchronizations, and that dropped perf back down to ~700MIPs. A few more tweaks and I had it back up to ~735 or so now. I have some ideas related to missed branch-prediction that I think could possibly get it back up to 850MIPS. And I suspect there are many ideas I haven't had yet that could help push it even further.

@mikemcqueen

mikemcqueen commented Jun 17, 2025

Copy link
Copy Markdown
Author

So, , my original motivation was to help the guy (Wyatt/zen1) doing optimizations, by getting the parallel-sum benchmark running longer. Then once that was done, my motivation was to get parallel working. Then there was a period where my motivation was to fix the race condition. I'm currently pursuing possible node-reuse/recycling implementations (hvmvis was a child of this motivation.)

The PR I submitted for HVM3-Strict is a huge (set of) change(s). It can be broken up into separate PRs if that will help make it easier to digest and/or provide history.

128-bit Atomics and HVM2-style RBAG/RNOD memory organization. This will have a race condition. ~550MIPS.
Add "booty bag" work-stealing implementation. 128-bit atomics removed, RBAG accessed non-atomically. Also has race condition. ~850MIPS.
Add "deferred bag" and memory barrier/syncing to fix race condition, and some other small optimizations. ~735MIPS.

@mikemcqueen

Copy link
Copy Markdown
Author

On the subject of what else needs to be done for HVM3-Strict. CUDA support obviously, I haven't really looked at HVM2's implementation, but HVM3-Strict follows HVM2s memory structure very closely, so my intuition is that a port should not be too difficult.

And, more examples that use all of the available interaction types. The one existing example only uses 7 interaction types. I suspect there may be other interactions that have a race condition such as the APPLAM-following-APPREF that currently get deferred.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants