Skip to content

Add NUMA-aware allocator support (Linux: libnuma, cross-platform strategy) #466

@Twon

Description

@Twon

Title: Add NUMA-aware allocator support (Linux: libnuma, cross-platform strategy)

Summary

Introduce a NUMA-aware allocator implementation for Morpheus, enabling memory allocation policies that are aware of Non-Uniform Memory Access (NUMA) topologies. Initial implementation will target Linux using libnuma, with a cross-platform abstraction to support Windows and provide graceful fallback on macOS.

Motivation

Modern multi-socket and multi-core systems frequently exhibit NUMA characteristics, where memory access latency depends on the proximity of memory to the executing CPU core.

Without NUMA awareness, allocations may occur on remote nodes, leading to:

  • Increased memory access latency
  • Reduced cache locality
  • Cross-node memory traffic (QPI/Infinity Fabric penalties)
  • Unpredictable performance in multi-threaded workloads

For Morpheus use cases (e.g. high-throughput pipelines, parsing, transformation, and low-latency systems), these effects can be significant.

A NUMA-aware allocator enables:

  • Thread-local allocation on the correct NUMA node
  • Explicit placement strategies (bind, interleave, preferred node)
  • Improved scalability under parallel workloads
  • Better alignment with CPU affinity strategies

Linux Implementation (libnuma)

Leverage libnuma to implement a NUMA-aware allocator:

Features

  • Allocate memory on a specific NUMA node (numa_alloc_onnode)
  • Interleaved allocation across nodes (numa_alloc_interleaved)
  • Preferred node allocation
  • Query system topology (numa_available, numa_num_configured_nodes)

Example API (conceptual)

enum class numa_policy
{
    local,
    preferred,
    interleave,
    bind
};

template <typename T>
class numa_allocator
{
public:
    using value_type = T;

    numa_allocator(int node, numa_policy policy);

    T* allocate(std::size_t n);
    void deallocate(T* p, std::size_t n);

private:
    int node_;
    numa_policy policy_;
};

Integration

  • Works with Morpheus allocator-aware types
  • Can be wrapped in std::pmr::memory_resource for runtime polymorphism

Windows Support

Windows does not provide libnuma, but exposes NUMA functionality via the Win32 API:

Relevant APIs

  • VirtualAllocExNuma
  • GetNumaNodeProcessorMaskEx
  • GetCurrentProcessorNumberEx
  • SetThreadGroupAffinity

Strategy

  • Implement a Windows-specific backend using VirtualAllocExNuma

  • Map Morpheus numa_policy to:

    • Preferred node allocation
    • Explicit node binding where possible

Limitations

  • Less flexible than Linux libnuma (e.g. interleaving is less direct)
  • Requires careful handling of processor groups on large systems

macOS (Apple Silicon / Intel)

macOS does not expose NUMA APIs in a meaningful or controllable way:

  • Apple Silicon uses a unified memory architecture (UMA)
  • Intel macOS systems abstract NUMA details away from user-space

Strategy

  • Provide a no-op / fallback allocator

  • Behaves like a standard allocator while preserving API compatibility

  • Optionally:

    • Use thread affinity hints (limited impact)
    • Document that NUMA policies are ignored

Cross-Platform Abstraction

Introduce a unified interface:

class numa_memory_resource : public std::pmr::memory_resource
{
public:
    numa_memory_resource(int node, numa_policy policy);

private:
    void* do_allocate(size_t bytes, size_t alignment) override;
    void do_deallocate(void* p, size_t bytes, size_t alignment) override;
    bool do_is_equal(const memory_resource& other) const noexcept override;
};

Backend Selection

  • Linux → libnuma
  • Windows → Win32 NUMA APIs
  • macOS → fallback (standard allocation)

Benefits

  • Improved locality and reduced latency in multi-threaded workloads
  • Better scalability on multi-socket systems
  • Alignment with thread pinning / CPU affinity strategies
  • Enables advanced users to tune performance-critical paths

Risks / Considerations

  • Portability complexity
    Requires multiple platform-specific implementations

  • Testing difficulty
    NUMA effects are hardware-dependent; CI coverage may be limited

  • Misuse potential
    Incorrect node selection can degrade performance

  • API design
    Needs to balance flexibility with ease of use

Alternatives Considered

  • Ignore NUMA entirely
    → Leaves significant performance on the table for target use cases

  • Rely on OS default policies
    → Often suboptimal for tightly controlled workloads

Next Steps

  1. Implement Linux prototype using libnuma
  2. Design abstraction layer for cross-platform support
  3. Add Windows backend using VirtualAllocExNuma
  4. Provide macOS fallback implementation
  5. Integrate with Morpheus allocator framework
  6. Benchmark impact on representative workloads

Open Questions

  • Should NUMA policy be compile-time or runtime configurable?
  • Do we expose low-level controls or provide higher-level presets?
  • Should thread affinity utilities be included alongside allocator support?

Examples

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions