Re: [RFC v2 00/35] optimize cost of inter-process communication

David Hildenbrand <david@xxxxxxxxxx> · Fri, 30 May 2025 11:56:51 +0200

## Address space sharing

For address space sharing, RPAL partitions the entire userspace virtual
address space and allocates non-overlapping memory ranges to each process.
On x86_64 architectures, RPAL uses a memory range size covered by a
single PUD (Page Upper Directory) entry, which is 512GB. This restricts
each process’s virtual address space to 512GB on x86_64, sufficient for
most applications in our scenario. The rationale is straightforward:
address space sharing can be simply achieved by copying the PUD from one
process’s page table to another’s. So one process can directly use the
data pointer to access another's memory.

  |------------| <- 0
  |------------| <- 512 GB
  |  Process A |
  |------------| <- 2*512 GB
  |------------| <- n*512 GB
  |  Process B |
  |------------| <- (n+1)*512 GB
  |------------| <- STACK_TOP
  |  Kernel    |
  |------------|

Oh my.

It reminds me a bit about mshare -- just that mshare tries to do it in a 
less hacky way..

## RPAL call

We refer to the lightweight userspace context switching mechanism as RPAL
call. It enables the caller (or sender) thread of one process to directly
switch to the callee (or receiver) thread of another process.

When Process A’s caller thread initiates an RPAL call to Process B’s
callee thread, the CPU saves the caller’s context and loads the callee’s
context. This enables direct userspace control flow transfer from the
caller to the callee. After the callee finishes data processing, the CPU
saves Process B’s callee context and switches back to Process A’s caller
context, completing a full IPC cycle.

  |------------|                |---------------------|
  |  Process A |                |  Process B          |
  | |-------|  |                | |-------|           |
  | | caller| --- RPAL call --> | | callee|    handle |
  | | thread| <------------------ | thread| -> event  |
  | |-------|  |                | |-------|           |
  |------------|                |---------------------|

# Security and compatibility with kernel subsystems

## Memory protection between processes

Since processes using RPAL share the address space, unintended
cross-process memory access may occur and corrupt the data of another
process. To mitigate this, we leverage Memory Protection Keys (MPK) on x86
architectures.

MPK assigns 4 bits in each page table entry to a "protection key", which
is paired with a userspace register (PKRU). The PKRU register defines
access permissions for memory regions protected by specific keys (for
detailed implementation, refer to the kernel documentation "Memory
Protection Keys"). With MPK, even though the address space is shared
among processes, cross-process access is restricted: a process can only
access the memory protected by a key if its PKRU register is configured
with the corresponding permission. This ensures that processes cannot
access each other’s memory unless an explicit PKRU configuration is set.

## Page fault handling and TLB flushing

Due to the shared address space architecture, both page fault handling and
TLB flushing require careful consideration. For instance, when Process A
accesses Process B’s memory, a page fault may occur in Process A's
context, but the faulting address belongs to Process B. In this case, we
must pass Process B's mm_struct to the page fault handler.

In an mshare region, all faults would be rerouted to the mshare MM 
either way.

TLB flushing is more complex. When a thread flushes the TLB, since the
address space is shared, not only other threads in the current process but
also other processes that share the address space may access the
corresponding memory (related to the TLB flush). Therefore, the cpuset used
for TLB flushing should be the union of the mm_cpumasks of all processes
that share the address space.

Oh my.

It all reminds me of mshare, just the context switch handling is 
different (and significantly ... more problematic).

Maybe something could be built on top of mshare, but I'm afraid the real 
magic is the address space sharing combined with the context switching 
... which sounds like a big can of worms.

So in the current form, I understand all the NACKs.

--
Cheers,

David / dhildenb