Title: Add NUMA-aware allocator support (Linux: libnuma, cross-platform strategy)
Summary
Introduce a NUMA-aware allocator implementation for Morpheus, enabling memory allocation policies that are aware of Non-Uniform Memory Access (NUMA) topologies. Initial implementation will target Linux using libnuma, with a cross-platform abstraction to support Windows and provide graceful fallback on macOS.
Motivation
Modern multi-socket and multi-core systems frequently exhibit NUMA characteristics, where memory access latency depends on the proximity of memory to the executing CPU core.
Without NUMA awareness, allocations may occur on remote nodes, leading to:
- Increased memory access latency
- Reduced cache locality
- Cross-node memory traffic (QPI/Infinity Fabric penalties)
- Unpredictable performance in multi-threaded workloads
For Morpheus use cases (e.g. high-throughput pipelines, parsing, transformation, and low-latency systems), these effects can be significant.
A NUMA-aware allocator enables:
- Thread-local allocation on the correct NUMA node
- Explicit placement strategies (bind, interleave, preferred node)
- Improved scalability under parallel workloads
- Better alignment with CPU affinity strategies
Linux Implementation (libnuma)
Leverage libnuma to implement a NUMA-aware allocator:
Features
- Allocate memory on a specific NUMA node (
numa_alloc_onnode)
- Interleaved allocation across nodes (
numa_alloc_interleaved)
- Preferred node allocation
- Query system topology (
numa_available, numa_num_configured_nodes)
Example API (conceptual)
enum class numa_policy
{
local,
preferred,
interleave,
bind
};
template <typename T>
class numa_allocator
{
public:
using value_type = T;
numa_allocator(int node, numa_policy policy);
T* allocate(std::size_t n);
void deallocate(T* p, std::size_t n);
private:
int node_;
numa_policy policy_;
};
Integration
- Works with Morpheus allocator-aware types
- Can be wrapped in
std::pmr::memory_resource for runtime polymorphism
Windows Support
Windows does not provide libnuma, but exposes NUMA functionality via the Win32 API:
Relevant APIs
VirtualAllocExNuma
GetNumaNodeProcessorMaskEx
GetCurrentProcessorNumberEx
SetThreadGroupAffinity
Strategy
Limitations
- Less flexible than Linux
libnuma (e.g. interleaving is less direct)
- Requires careful handling of processor groups on large systems
macOS (Apple Silicon / Intel)
macOS does not expose NUMA APIs in a meaningful or controllable way:
- Apple Silicon uses a unified memory architecture (UMA)
- Intel macOS systems abstract NUMA details away from user-space
Strategy
Cross-Platform Abstraction
Introduce a unified interface:
class numa_memory_resource : public std::pmr::memory_resource
{
public:
numa_memory_resource(int node, numa_policy policy);
private:
void* do_allocate(size_t bytes, size_t alignment) override;
void do_deallocate(void* p, size_t bytes, size_t alignment) override;
bool do_is_equal(const memory_resource& other) const noexcept override;
};
Backend Selection
- Linux →
libnuma
- Windows → Win32 NUMA APIs
- macOS → fallback (standard allocation)
Benefits
- Improved locality and reduced latency in multi-threaded workloads
- Better scalability on multi-socket systems
- Alignment with thread pinning / CPU affinity strategies
- Enables advanced users to tune performance-critical paths
Risks / Considerations
-
Portability complexity
Requires multiple platform-specific implementations
-
Testing difficulty
NUMA effects are hardware-dependent; CI coverage may be limited
-
Misuse potential
Incorrect node selection can degrade performance
-
API design
Needs to balance flexibility with ease of use
Alternatives Considered
Next Steps
- Implement Linux prototype using
libnuma
- Design abstraction layer for cross-platform support
- Add Windows backend using
VirtualAllocExNuma
- Provide macOS fallback implementation
- Integrate with Morpheus allocator framework
- Benchmark impact on representative workloads
Open Questions
- Should NUMA policy be compile-time or runtime configurable?
- Do we expose low-level controls or provide higher-level presets?
- Should thread affinity utilities be included alongside allocator support?
Examples
Title: Add NUMA-aware allocator support (Linux: libnuma, cross-platform strategy)
Summary
Introduce a NUMA-aware allocator implementation for Morpheus, enabling memory allocation policies that are aware of Non-Uniform Memory Access (NUMA) topologies. Initial implementation will target Linux using
libnuma, with a cross-platform abstraction to support Windows and provide graceful fallback on macOS.Motivation
Modern multi-socket and multi-core systems frequently exhibit NUMA characteristics, where memory access latency depends on the proximity of memory to the executing CPU core.
Without NUMA awareness, allocations may occur on remote nodes, leading to:
For Morpheus use cases (e.g. high-throughput pipelines, parsing, transformation, and low-latency systems), these effects can be significant.
A NUMA-aware allocator enables:
Linux Implementation (libnuma)
Leverage
libnumato implement a NUMA-aware allocator:Features
numa_alloc_onnode)numa_alloc_interleaved)numa_available,numa_num_configured_nodes)Example API (conceptual)
Integration
std::pmr::memory_resourcefor runtime polymorphismWindows Support
Windows does not provide
libnuma, but exposes NUMA functionality via the Win32 API:Relevant APIs
VirtualAllocExNumaGetNumaNodeProcessorMaskExGetCurrentProcessorNumberExSetThreadGroupAffinityStrategy
Implement a Windows-specific backend using
VirtualAllocExNumaMap Morpheus
numa_policyto:Limitations
libnuma(e.g. interleaving is less direct)macOS (Apple Silicon / Intel)
macOS does not expose NUMA APIs in a meaningful or controllable way:
Strategy
Provide a no-op / fallback allocator
Behaves like a standard allocator while preserving API compatibility
Optionally:
Cross-Platform Abstraction
Introduce a unified interface:
Backend Selection
libnumaBenefits
Risks / Considerations
Portability complexity
Requires multiple platform-specific implementations
Testing difficulty
NUMA effects are hardware-dependent; CI coverage may be limited
Misuse potential
Incorrect node selection can degrade performance
API design
Needs to balance flexibility with ease of use
Alternatives Considered
Ignore NUMA entirely
→ Leaves significant performance on the table for target use cases
Rely on OS default policies
→ Often suboptimal for tightly controlled workloads
Next Steps
libnumaVirtualAllocExNumaOpen Questions
Examples