Skip to content

Performance in recursive fork-join is very bad #6793

@tzcnt

Description

@tzcnt

I maintain a set of benchmarks to compare the performance of executors. https://github.com/tzcnt/runtime-benchmarks currently it mostly tracks fork-join performance. I have recently added HPX to this benchmark suite in PR tzcnt/runtime-benchmarks#4.

I was shocked to discover that HPX on average is 900x slower than the fastest runtime. See the summary table on the repo README, or the full dataset chart here: https://fleetcode.com/runtime-benchmarks/ Note that I have only run this on one machine so far (with 128GB ram) - because on a different machine, it would take literally hours to complete the benchmark run, or run out of memory... whereas the competing runtimes can complete a benchmark run in a few minutes.

You can read the PR description above for a complete description of all the things that I tried, but the summary is that I wasn't able to achieve reasonable performance despite spending many hours trying different approaches.

I also discovered that the abp-priority-lifo scheduler is broken and doesn't actually use the correct scheduling policy. I had to manually edit this line to reference hpx::threads::policies::lockfree_abp_lifo instead, after which I was able to I verify that the correct scheduler was being used.

I expected that this would solve the problem as the memory blowup that I observed is usually what happens when runtimes use strictly FIFO scheduling, which can result in e.g. for Skynet, 10^8 = 100,000,000 active tasks maximum, rather than 10*8 = 80 active tasks maximum when using LIFO scheduling for the local task queue. I thought that the lockfree_abp_lifo policy would prevent this problem, but it didn't make any impact.

I see this issue #3348 and this PR #4744 but I'm not sure if this would resolve this problem.

I'd also like to note that 3 of the benchmarks I'm running have nearly equivalent implementations in your examples, so I'd expect that it would be possible to make these run in a performant manner.
https://github.com/STEllAR-GROUP/hpx/blob/master/examples/quickstart/fibonacci_one.cpp
https://github.com/STEllAR-GROUP/hpx/blob/master/tests/performance/local/skynet.cpp
https://github.com/STEllAR-GROUP/hpx/tree/master/examples/nqueen

I'd like to give you an opportunity to present your library in the best light, so I'm open to any feedback on how I can optimize my implementation / build flags to make HPX perform better.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions