Re: [RFC v2 00/35] optimize cost of inter-process communication

Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> · Fri, 30 May 2025 15:42:50 -0700

On Fri, 30 May 2025 17:27:28 +0800 Bo Li <libo.gcs85@xxxxxxxxxxxxx> wrote:

> During testing, the client transmitted 1 million 32-byte messages, and we
> computed the per-message average latency. The results are as follows:
> 
> *****************
> Without RPAL: Message length: 32 bytes, Total TSC cycles: 19616222534,
>  Message count: 1000000, Average latency: 19616 cycles
> With RPAL: Message length: 32 bytes, Total TSC cycles: 1703459326,
>  Message count: 1000000, Average latency: 1703 cycles
> *****************
> 
> These results confirm that RPAL delivers substantial latency improvements
> over the current epoll implementation—achieving a 17,913-cycle reduction
> (an ~91.3% improvement) for 32-byte messages.

Noted ;)

Quick question:

>  arch/x86/Kbuild                               |    2 +
>  arch/x86/Kconfig                              |    2 +
>  arch/x86/entry/entry_64.S                     |  160 ++
>  arch/x86/events/amd/core.c                    |   14 +
>  arch/x86/include/asm/pgtable.h                |   25 +
>  arch/x86/include/asm/pgtable_types.h          |   11 +
>  arch/x86/include/asm/tlbflush.h               |   10 +
>  arch/x86/kernel/asm-offsets.c                 |    3 +
>  arch/x86/kernel/cpu/common.c                  |    8 +-
>  arch/x86/kernel/fpu/core.c                    |    8 +-
>  arch/x86/kernel/nmi.c                         |   20 +
>  arch/x86/kernel/process.c                     |   25 +-
>  arch/x86/kernel/process_64.c                  |  118 +
>  arch/x86/mm/fault.c                           |  271 ++
>  arch/x86/mm/mmap.c                            |   10 +
>  arch/x86/mm/tlb.c                             |  172 ++
>  arch/x86/rpal/Kconfig                         |   21 +
>  arch/x86/rpal/Makefile                        |    6 +
>  arch/x86/rpal/core.c                          |  477 ++++
>  arch/x86/rpal/internal.h                      |   69 +
>  arch/x86/rpal/mm.c                            |  426 +++
>  arch/x86/rpal/pku.c                           |  196 ++
>  arch/x86/rpal/proc.c                          |  279 ++
>  arch/x86/rpal/service.c                       |  776 ++++++
>  arch/x86/rpal/thread.c                        |  313 +++

The changes are very x86-heavy.  Is that a necessary thing?  Would
another architecture need to implement a similar amount to enable RPAL?
IOW, how much of the above could be made arch-neutral?