As you know, gallocy is trying to provide a "transparent" or "implicit" interface to the application. This necessarily means that we transpose replacement memory and thread interface allocators it the process. This is extremely difficult to do correctly, and in some cases we've found it impossible (e.g., in libc's case). The root problem is that we need to use the system allocator/threads at the same time we're replacing them! This turns into a never ending war against the standard library, where gallocy is constantly trying to hide its existence from the actual running process, but periodically corrupting heaps in doing so.
One way I think we can get around this is by moving as much code as possible out of the runtime API and into a separate process. This separate process is a standard C++ application: it can use libraries, use the system allocator, etc., and is truly local state specific to that node. After all, it's a heavy process. This daemon would be responsible for maintaining the distributed vmm, consensus, networking, etc., and would participate in no function interposition black magic. It would also expose an explicit interface as a library: think gallocy_malloc, gallocy_free, gallocy_pthread_create, gallocy_pthread_join, etc. We would have some serious freedom and rule out entire classes of potential errors.
At this point the runtime API is just a library that does very little more than signal handling and function interposition. It still conducts black magic, but it does so without the worry that it is going to screw up its internal state (the primary reason why we're maintaining two allocators, custom types, custom threading symbols, and more). It would use the explicit interface by implementing a custom IPC protocol that we would need to develop.
- E.g., a allocation flow might be: 1) target process calls mmap, 2) runtime api intercepts and notifies daemon by IPC, 3) daemon synchronizes global address space, 4) runtime api returns.
- E.g., a page transfer might be: 1) target application segfault, 2) runtime api intercepts and notifies daemon by IPC, 3) daemon fetches page from proper owner, 4) runtime api returns with data.
- E.g., a request for a page might be: s1) cluster contacts a daemon, 2) daemon notifies runtime by IPC, 3) runtime API reads request for memory 0xffff0000, 4) runtime API sets read-only on 0xffff0000 and sends contents to daemon by IPC, 5) daemon sends memory to cluster and synchronizes global address space.
This decision would make a few show stopping problems, like the libc issue, tractable: this design allows for us to maintain a single system allocator, so no memory allocation mismatch is possible. As long as we can implement the runtime interface such that it doesn't use the allocator (infinite loop) we simply add a IPC sync step to every allocation or fault. That sounds possible.
This design would add at ~5 microseconds to any transaction against the daemon if we chose a fast IPC like domain sockets or shared memory. I got these numbers by running a few tests using https://github.com/rigtorp/ipc-bench.
Am I missing anything super major or would this design work and make implementing the system substantially easier (and a lot, lot, lot safer). First order of business would be an investigation into if IPC uses the allocator, right? I can put together a few experiments.
Let's think of potential problems with this design and record them here.
As you know, gallocy is trying to provide a "transparent" or "implicit" interface to the application. This necessarily means that we transpose replacement memory and thread interface allocators it the process. This is extremely difficult to do correctly, and in some cases we've found it impossible (e.g., in libc's case). The root problem is that we need to use the system allocator/threads at the same time we're replacing them! This turns into a never ending war against the standard library, where gallocy is constantly trying to hide its existence from the actual running process, but periodically corrupting heaps in doing so.
One way I think we can get around this is by moving as much code as possible out of the runtime API and into a separate process. This separate process is a standard C++ application: it can use libraries, use the system allocator, etc., and is truly local state specific to that node. After all, it's a heavy process. This daemon would be responsible for maintaining the distributed vmm, consensus, networking, etc., and would participate in no function interposition black magic. It would also expose an explicit interface as a library: think gallocy_malloc, gallocy_free, gallocy_pthread_create, gallocy_pthread_join, etc. We would have some serious freedom and rule out entire classes of potential errors.
At this point the runtime API is just a library that does very little more than signal handling and function interposition. It still conducts black magic, but it does so without the worry that it is going to screw up its internal state (the primary reason why we're maintaining two allocators, custom types, custom threading symbols, and more). It would use the explicit interface by implementing a custom IPC protocol that we would need to develop.
This decision would make a few show stopping problems, like the libc issue, tractable: this design allows for us to maintain a single system allocator, so no memory allocation mismatch is possible. As long as we can implement the runtime interface such that it doesn't use the allocator (infinite loop) we simply add a IPC sync step to every allocation or fault. That sounds possible.
This design would add at ~5 microseconds to any transaction against the daemon if we chose a fast IPC like domain sockets or shared memory. I got these numbers by running a few tests using https://github.com/rigtorp/ipc-bench.
Am I missing anything super major or would this design work and make implementing the system substantially easier (and a lot, lot, lot safer). First order of business would be an investigation into if IPC uses the allocator, right? I can put together a few experiments.
Let's think of potential problems with this design and record them here.