Performance in recursive fork-join is very bad

I maintain a set of benchmarks to compare the performance of executors. https://github.com/tzcnt/runtime-benchmarks currently it mostly tracks fork-join performance. I have recently added HPX to this benchmark suite in PR https://github.com/tzcnt/runtime-benchmarks/pull/4.

I was shocked to discover that HPX on average is 900x slower than the fastest runtime. See the summary table on the repo README, or the full dataset chart here: https://fleetcode.com/runtime-benchmarks/ Note that I have only run this on one machine so far (with 128GB ram) - because on a different machine, it would take literally hours to complete the benchmark run, or run out of memory... whereas the competing runtimes can complete a benchmark run in a few minutes.

You can read the PR description above for a complete description of all the things that I tried, but the summary is that I wasn't able to achieve reasonable performance despite spending many hours trying different approaches.

I also discovered that the `abp-priority-lifo` scheduler is broken and doesn't actually use the correct scheduling policy. I had to manually edit [this line](https://github.com/STEllAR-GROUP/hpx/blob/f23dba500a506bea052bde15755c223d8d2ba7c0/libs/core/threadmanager/src/threadmanager.cpp#L437) to reference `hpx::threads::policies::lockfree_abp_lifo` instead, after which I was able to I verify that the correct scheduler was being used.

I expected that this would solve the problem as the memory blowup that I observed is usually what happens when runtimes use strictly FIFO scheduling, which can result in e.g. for Skynet, 10^8 = 100,000,000 active tasks maximum, rather than 10*8 = 80 active tasks maximum when using LIFO scheduling for the local task queue. I thought that the `lockfree_abp_lifo` policy would prevent this problem, but it didn't make any impact.

I see this issue https://github.com/STEllAR-GROUP/hpx/issues/3348 and this PR https://github.com/STEllAR-GROUP/hpx/pull/4744 but I'm not sure if this would resolve this problem.

I'd also like to note that 3 of the benchmarks I'm running have nearly equivalent implementations in your examples, so I'd expect that it would be possible to make these run in a performant manner.
https://github.com/STEllAR-GROUP/hpx/blob/master/examples/quickstart/fibonacci_one.cpp
https://github.com/STEllAR-GROUP/hpx/blob/master/tests/performance/local/skynet.cpp
https://github.com/STEllAR-GROUP/hpx/tree/master/examples/nqueen

I'd like to give you an opportunity to present your library in the best light, so I'm open to any feedback on how I can optimize my implementation / build flags to make HPX perform better.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance in recursive fork-join is very bad #6793

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Performance in recursive fork-join is very bad #6793

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions