Re: [RFC v2 00/35] optimize cost of inter-process communication

Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> · Fri, 30 May 2025 10:33:31 +0100

Bo,

You have outstanding feedback on your v1 from me and Dave Hansen. I'm not
quite sure why you're sending a v2 without responding to that.

This isn't how the upstream kernel works...

Thanks, Lorenzo

On Fri, May 30, 2025 at 05:27:28PM +0800, Bo Li wrote:
> Changelog:
>
> v2:
> - Port the RPAL functions to the latest v6.15 kernel.
> - Add a supplementary introduction to the application scenarios and
>   security considerations of RPAL.
>
> link to v1:
> https://lore.kernel.org/lkml/CAP2HCOmAkRVTci0ObtyW=3v6GFOrt9zCn2NwLUbZ+Di49xkBiw@xxxxxxxxxxxxxx/
>
> --------------------------------------------------------------------------
>
> # Introduction
>
> We mainly apply RPAL to the service mesh architecture widely adopted in
> modern cloud-native data centers. Before the rise of the service mesh
> architecture, network functions were usually integrated into monolithic
> applications as libraries, and the main business programs invoked them
> through function calls. However, to facilitate the independent development
> and operation and maintenance of the main business programs and network
> functions, the service mesh removed the network functions from the main
> business programs and made them independent processes (called sidecars).
> Inter-process communication (IPC) is used for interaction between the main
> business program and the sidecar, and the introduced inter-process
> communication has led to a sharp increase in resource consumption in
> cloud-native data centers, and may even occupy more than 10% of the CPU of
> the entire microservice cluster.
>
> To achieve the efficient function call mechanism of the monolithic
> architecture under the service mesh architecture, we introduced the RPAL
> (Running Process As Library) architecture, which implements the sharing of
> the virtual address space of processes and the switching threads in user
> mode. Through the analysis of the service mesh architecture, we found that
> the process memory isolation between the main business program and the
> sidecar is not particularly important because they are split from one
> application and were an integral part of the original monolithic
> application. It is more important for the two processes to be independent
> of each other because they need to be independently developed and
> maintained to ensure the architectural advantages of the service mesh.
> Therefore, RPAL breaks the isolation between processes while preserving the
> independence between them.  We think that RPAL can also be applied to other
> scenarios featuring sidecar-like architectures, such as distributed file
> storage systems in LLM infra.
>
> In RPAL architecture, multiple processes share a virtual address space, so
> this architecture can be regarded as an advanced version of the Linux
> shared memory mechanism:
>
> 1. Traditional shared memory requires two processes to negotiate to ensure
> the mapping of the same piece of memory. In RPAL architecture, two RPAL
> processes still need to reach a consensus before they can successfully
> invoke the relevant system calls of RPAL to share the virtual address
> space.
> 2. Traditional shared memory only shares part of the data. However, in RPAL
> architecture, processes that have established an RPAL communication
> relationship share a virtual address space, and all user memory (such as
> data segments and code segments) of each RPAL process is shared among these
> processes. However, a process cannot access the memory of other processes
> at any time. We use the MPK mechanism to ensure that the memory of other
> processes can only be accessed when special RPAL functions are called.
> Otherwise, a page fault will be triggered.
> 3. In RPAL architecture, to ensure the consistency of the execution context
> of the shared code (such as the stack and thread local storage), we further
> implement the thread context switching in user mode based on the ability to
> share the virtual address space of different processes, enabling the
> threads of different processes to directly perform fast switching in user
> mode without falling into kernel mode for slow switching.
>
> # Background
>
> In traditional inter-process communication (IPC) scenarios, Unix domain
> sockets are commonly used in conjunction with the epoll() family for event
> multiplexing. IPC operations involve system calls on both the data and
> control planes, thereby imposing a non-trivial overhead on the interacting
> processes. Even when shared memory is employed to optimize the data plane,
> two data copies still remain. Specifically, data is initially copied from
> a process's private memory space into the shared memory area, and then it
> is copied from the shared memory into the private memory of another
> process.
>
> This poses a question: Is it possible to reduce the overhead of IPC with
> only minimal modifications at the application level? To address this, we
> observed that the functionality of IPC, which encompasses data transfer
> and invocation of the target thread, is similar to a function call, where
> arguments are passed and the callee function is invoked to process them.
> Inspired by this analogy, we introduce RPAL (Run Process As Library), a
> framework designed to enable one process to invoke another as if making
> a local function call, all without going through the kernel.
>
> # Design
>
> First, let’s formalize RPAL’s core objectives:
>
> 1. Data-plane efficiency: Reduce the number of data copies from two (in the
>    shared memory solution) to one.
> 2. Control-plane optimization: Eliminate the overhead of system calls and
>    kernel's thread switches.
> 3. Application compatibility: Minimize the modifications to existing
>    applications that utilize Unix domain sockets and the epoll() family.
>
> To attain the first objective, processes that use RPAL share the same
> virtual address space. So one process can access another's data directly
> via a data pointer. This means data can be transferred from one process to
> another with just one copy operation.
>
> To meet the second goal, RPAL relies on the shared address space to do
> lightweight context switching in user space, which we call an "RPAL call".
> This allows one process to execute another process's code just like a
> local function call.
>
> To achieve the third target, RPAL stays compatible with the epoll family
> of functions, like epoll_create(), epoll_wait(), and epoll_ctl(). If an
> application uses epoll for IPC, developers can switch to RPAL with just a
> few small changes. For instance, you can just replace epoll_wait() with
> rpal_epoll_wait(). The basic epoll procedure, where a process waits for
> another to write to a monitored descriptor using an epoll file descriptor,
> still works fine with RPAL.
>
> ## Address space sharing
>
> For address space sharing, RPAL partitions the entire userspace virtual
> address space and allocates non-overlapping memory ranges to each process.
> On x86_64 architectures, RPAL uses a memory range size covered by a
> single PUD (Page Upper Directory) entry, which is 512GB. This restricts
> each process’s virtual address space to 512GB on x86_64, sufficient for
> most applications in our scenario. The rationale is straightforward:
> address space sharing can be simply achieved by copying the PUD from one
> process’s page table to another’s. So one process can directly use the
> data pointer to access another's memory.
>
>
>  |------------| <- 0
>  |------------| <- 512 GB
>  |  Process A |
>  |------------| <- 2*512 GB
>  |------------| <- n*512 GB
>  |  Process B |
>  |------------| <- (n+1)*512 GB
>  |------------| <- STACK_TOP
>  |  Kernel    |
>  |------------|
>
> ## RPAL call
>
> We refer to the lightweight userspace context switching mechanism as RPAL
> call. It enables the caller (or sender) thread of one process to directly
> switch to the callee (or receiver) thread of another process.
>
> When Process A’s caller thread initiates an RPAL call to Process B’s
> callee thread, the CPU saves the caller’s context and loads the callee’s
> context. This enables direct userspace control flow transfer from the
> caller to the callee. After the callee finishes data processing, the CPU
> saves Process B’s callee context and switches back to Process A’s caller
> context, completing a full IPC cycle.
>
>
>  |------------|                |---------------------|
>  |  Process A |                |  Process B          |
>  | |-------|  |                | |-------|           |
>  | | caller| --- RPAL call --> | | callee|    handle |
>  | | thread| <------------------ | thread| -> event  |
>  | |-------|  |                | |-------|           |
>  |------------|                |---------------------|
>
> # Security and compatibility with kernel subsystems
>
> ## Memory protection between processes
>
> Since processes using RPAL share the address space, unintended
> cross-process memory access may occur and corrupt the data of another
> process. To mitigate this, we leverage Memory Protection Keys (MPK) on x86
> architectures.
>
> MPK assigns 4 bits in each page table entry to a "protection key", which
> is paired with a userspace register (PKRU). The PKRU register defines
> access permissions for memory regions protected by specific keys (for
> detailed implementation, refer to the kernel documentation "Memory
> Protection Keys"). With MPK, even though the address space is shared
> among processes, cross-process access is restricted: a process can only
> access the memory protected by a key if its PKRU register is configured
> with the corresponding permission. This ensures that processes cannot
> access each other’s memory unless an explicit PKRU configuration is set.
>
> ## Page fault handling and TLB flushing
>
> Due to the shared address space architecture, both page fault handling and
> TLB flushing require careful consideration. For instance, when Process A
> accesses Process B’s memory, a page fault may occur in Process A's
> context, but the faulting address belongs to Process B. In this case, we
> must pass Process B's mm_struct to the page fault handler.
>
> TLB flushing is more complex. When a thread flushes the TLB, since the
> address space is shared, not only other threads in the current process but
> also other processes that share the address space may access the
> corresponding memory (related to the TLB flush). Therefore, the cpuset used
> for TLB flushing should be the union of the mm_cpumasks of all processes
> that share the address space.
>
> ## Lazy switch of kernel context
>
> In RPAL, a mismatch may arise between the user context and the kernel
> context. The RPAL call is designed solely to switch the user context,
> leaving the kernel context unchanged. For instance, when a RPAL call takes
> place, transitioning from caller thread to callee thread, and subsequently
> a system call is initiated within callee thread, the kernel will
> incorrectly utilize the caller's kernel context (such as the kernel stack)
> to process the system call.
>
> To resolve context mismatch issues, a kernel context switch is triggered at
> the kernel entry point when the callee initiates a syscall or an
> exception/interrupt occurs. This mechanism ensures context consistency
> before processing system calls, interrupts, or exceptions. We refer to this
> kernel context switch as a "lazy switch" because it defers the switching
> operation from the traditional thread switch point to the next kernel entry
> point.
>
> Lazy switch should be minimized as much as possible, as it significantly
> degrades performance. We currently utilize RPAL in an RPC framework, in
> which the RPC sender thread relies on the RPAL call to invoke the RPC
> receiver thread entirely in user space. In most cases, the receiver
> thread is free of system calls and the code execution time is relatively
> short. This characteristic effectively reduces the probability of a lazy
> switch occurring.
>
> ## Time slice correction
>
> After an RPAL call, the callee's user mode code executes. However, the
> kernel incorrectly attributes this CPU time to the caller due to the
> unchanged kernel context.
>
> To resolve this, we use the Time Stamp Counter (TSC) register to measure
> CPU time consumed by the callee thread in user space. The kernel then uses
> this user-reported timing data to adjust the CPU accounting for both the
> caller and callee thread, similar to how CPU steal time is implemented.
>
> ## Process recovery
>
> Since processes can access each other’s memory, there is a risk that the
> target process’s memory may become invalid at the access time (e.g., if
> the target process has exited unexpectedly). The kernel must handle such
> cases; otherwise, the accessing process could be terminated due to
> failures originating from another process.
>
> To address this issue, each thread of the process should pre-establish a
> recovery point when accessing the memory of other processes. When such an
> invalid access occurs, the thread traps into the kernel. Inside the page
> fault handler, the kernel restores the user context of the thread to the
> recovery point. This mechanism ensures that processes maintain mutual
> independence, preventing cascading failures caused by cross-process memory
> issues.
>
> # Performance
>
> To quantify the performance improvements driven by RPAL, we measured
> latency both before and after its deployment. Experiments were conducted on
> a server equipped with two Intel(R) Xeon(R) Platinum 8336C CPUs (2.30 GHz)
> and 1 TB of memory. Latency was defined as the duration from when the
> client thread initiates a message to when the server thread is invoked and
> receives it.
>
> During testing, the client transmitted 1 million 32-byte messages, and we
> computed the per-message average latency. The results are as follows:
>
> *****************
> Without RPAL: Message length: 32 bytes, Total TSC cycles: 19616222534,
>  Message count: 1000000, Average latency: 19616 cycles
> With RPAL: Message length: 32 bytes, Total TSC cycles: 1703459326,
>  Message count: 1000000, Average latency: 1703 cycles
> *****************
>
> These results confirm that RPAL delivers substantial latency improvements
> over the current epoll implementation—achieving a 17,913-cycle reduction
> (an ~91.3% improvement) for 32-byte messages.
>
> We have applied RPAL to an RPC framework that is widely used in our data
> center. With RPAL, we have successfully achieved up to 15.5% reduction in
> the CPU utilization of processes in real-world microservice scenario. The
> gains primarily stem from minimizing control plane overhead through the
> utilization of userspace context switches. Additionally, by leveraging
> address space sharing, the number of memory copies is significantly
> reduced.
>
> # Future Work
>
> Currently, RPAL requires the MPK (Memory Protection Key) hardware feature,
> which is supported by a range of Intel CPUs. For AMD architectures, MPK is
> supported only on the latest processor, specifically, 3th Generation AMD
> EPYC™ Processors and subsequent generations. Patch sets that extend RPAL
> support to systems lacking MPK hardware will be provided later.
>
> Accompanying test programs are also provided in the samples/rpal/
> directory. And the user-mode RPAL library, which realizes user-space RPAL
> call, is in the samples/rpal/librpal directory.
>
> We hope to get some community discussions and feedback on RPAL's
> optimization approaches and architecture.
>
> Look forward to your comments.
>
> Bo Li (35):
>   Kbuild: rpal support
>   RPAL: add struct rpal_service
>   RPAL: add service registration interface
>   RPAL: add member to task_struct and mm_struct
>   RPAL: enable virtual address space partitions
>   RPAL: add user interface
>   RPAL: enable shared page mmap
>   RPAL: enable sender/receiver registration
>   RPAL: enable address space sharing
>   RPAL: allow service enable/disable
>   RPAL: add service request/release
>   RPAL: enable service disable notification
>   RPAL: add tlb flushing support
>   RPAL: enable page fault handling
>   RPAL: add sender/receiver state
>   RPAL: add cpu lock interface
>   RPAL: add a mapping between fsbase and tasks
>   sched: pick a specified task
>   RPAL: add lazy switch main logic
>   RPAL: add rpal_ret_from_lazy_switch
>   RPAL: add kernel entry handling for lazy switch
>   RPAL: rebuild receiver state
>   RPAL: resume cpumask when fork
>   RPAL: critical section optimization
>   RPAL: add MPK initialization and interface
>   RPAL: enable MPK support
>   RPAL: add epoll support
>   RPAL: add rpal_uds_fdmap() support
>   RPAL: fix race condition in pkru update
>   RPAL: fix pkru setup when fork
>   RPAL: add receiver waker
>   RPAL: fix unknown nmi on AMD CPU
>   RPAL: enable time slice correction
>   RPAL: enable fast epoll wait
>   samples/rpal: add RPAL samples
>
>  arch/x86/Kbuild                               |    2 +
>  arch/x86/Kconfig                              |    2 +
>  arch/x86/entry/entry_64.S                     |  160 ++
>  arch/x86/events/amd/core.c                    |   14 +
>  arch/x86/include/asm/pgtable.h                |   25 +
>  arch/x86/include/asm/pgtable_types.h          |   11 +
>  arch/x86/include/asm/tlbflush.h               |   10 +
>  arch/x86/kernel/asm-offsets.c                 |    3 +
>  arch/x86/kernel/cpu/common.c                  |    8 +-
>  arch/x86/kernel/fpu/core.c                    |    8 +-
>  arch/x86/kernel/nmi.c                         |   20 +
>  arch/x86/kernel/process.c                     |   25 +-
>  arch/x86/kernel/process_64.c                  |  118 +
>  arch/x86/mm/fault.c                           |  271 ++
>  arch/x86/mm/mmap.c                            |   10 +
>  arch/x86/mm/tlb.c                             |  172 ++
>  arch/x86/rpal/Kconfig                         |   21 +
>  arch/x86/rpal/Makefile                        |    6 +
>  arch/x86/rpal/core.c                          |  477 ++++
>  arch/x86/rpal/internal.h                      |   69 +
>  arch/x86/rpal/mm.c                            |  426 +++
>  arch/x86/rpal/pku.c                           |  196 ++
>  arch/x86/rpal/proc.c                          |  279 ++
>  arch/x86/rpal/service.c                       |  776 ++++++
>  arch/x86/rpal/thread.c                        |  313 +++
>  fs/binfmt_elf.c                               |   98 +-
>  fs/eventpoll.c                                |  320 +++
>  fs/exec.c                                     |   11 +
>  include/linux/mm_types.h                      |    3 +
>  include/linux/rpal.h                          |  633 +++++
>  include/linux/sched.h                         |   21 +
>  init/init_task.c                              |    6 +
>  kernel/exit.c                                 |    5 +
>  kernel/fork.c                                 |   32 +
>  kernel/sched/core.c                           |  676 +++++
>  kernel/sched/fair.c                           |  109 +
>  kernel/sched/sched.h                          |    8 +
>  mm/mmap.c                                     |   16 +
>  mm/mprotect.c                                 |  106 +
>  mm/rmap.c                                     |    4 +
>  mm/vma.c                                      |   18 +
>  samples/rpal/Makefile                         |   17 +
>  samples/rpal/asm_define.c                     |   14 +
>  samples/rpal/client.c                         |  178 ++
>  samples/rpal/librpal/asm_define.h             |    6 +
>  samples/rpal/librpal/asm_x86_64_rpal_call.S   |   57 +
>  samples/rpal/librpal/debug.h                  |   12 +
>  samples/rpal/librpal/fiber.c                  |  119 +
>  samples/rpal/librpal/fiber.h                  |   64 +
>  .../rpal/librpal/jump_x86_64_sysv_elf_gas.S   |   81 +
>  .../rpal/librpal/make_x86_64_sysv_elf_gas.S   |   82 +
>  .../rpal/librpal/ontop_x86_64_sysv_elf_gas.S  |   84 +
>  samples/rpal/librpal/private.h                |  341 +++
>  samples/rpal/librpal/rpal.c                   | 2351 +++++++++++++++++
>  samples/rpal/librpal/rpal.h                   |  149 ++
>  samples/rpal/librpal/rpal_pkru.h              |   78 +
>  samples/rpal/librpal/rpal_queue.c             |  239 ++
>  samples/rpal/librpal/rpal_queue.h             |   55 +
>  samples/rpal/librpal/rpal_x86_64_call_ret.S   |   45 +
>  samples/rpal/offset.sh                        |    5 +
>  samples/rpal/server.c                         |  249 ++
>  61 files changed, 9710 insertions(+), 4 deletions(-)
>  create mode 100644 arch/x86/rpal/Kconfig
>  create mode 100644 arch/x86/rpal/Makefile
>  create mode 100644 arch/x86/rpal/core.c
>  create mode 100644 arch/x86/rpal/internal.h
>  create mode 100644 arch/x86/rpal/mm.c
>  create mode 100644 arch/x86/rpal/pku.c
>  create mode 100644 arch/x86/rpal/proc.c
>  create mode 100644 arch/x86/rpal/service.c
>  create mode 100644 arch/x86/rpal/thread.c
>  create mode 100644 include/linux/rpal.h
>  create mode 100644 samples/rpal/Makefile
>  create mode 100644 samples/rpal/asm_define.c
>  create mode 100644 samples/rpal/client.c
>  create mode 100644 samples/rpal/librpal/asm_define.h
>  create mode 100644 samples/rpal/librpal/asm_x86_64_rpal_call.S
>  create mode 100644 samples/rpal/librpal/debug.h
>  create mode 100644 samples/rpal/librpal/fiber.c
>  create mode 100644 samples/rpal/librpal/fiber.h
>  create mode 100644 samples/rpal/librpal/jump_x86_64_sysv_elf_gas.S
>  create mode 100644 samples/rpal/librpal/make_x86_64_sysv_elf_gas.S
>  create mode 100644 samples/rpal/librpal/ontop_x86_64_sysv_elf_gas.S
>  create mode 100644 samples/rpal/librpal/private.h
>  create mode 100644 samples/rpal/librpal/rpal.c
>  create mode 100644 samples/rpal/librpal/rpal.h
>  create mode 100644 samples/rpal/librpal/rpal_pkru.h
>  create mode 100644 samples/rpal/librpal/rpal_queue.c
>  create mode 100644 samples/rpal/librpal/rpal_queue.h
>  create mode 100644 samples/rpal/librpal/rpal_x86_64_call_ret.S
>  create mode 100755 samples/rpal/offset.sh
>  create mode 100644 samples/rpal/server.c
>
> --
> 2.20.1
>