On 7/28/2025 12:38 PM, David Laight wrote: >>> ... >>> >>> Or just write a byte copy loop in C with (eg) barrier() inside it >>> to stop gcc converting it to memcpy(). >>> >>> David >> >> Great. It's rep movsb without any of the performance. > > And without the massive setup overhead that dominates short copies. > Given the rest of the code I'm sure a byte copy loop won't make > any difference to the overall performance. > Wouldn't it be better to introduce a generic mechanism than something customized for this scenario? PeterZ had suggested that inline memcpy could have more usages: https://lore.kernel.org/lkml/20241029113611.GS14555@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ Is there a concern that the inline versions might get optimized into standard memcpy/memset calls by GCC? Wouldn't the volatile keyword prevent that? static __always_inline void *__inline_memcpy(void *to, const void *from, size_t len) { void *ret = to; asm volatile("rep movsb" : "+D" (to), "+S" (from), "+c" (len) : : "memory"); return ret; } static __always_inline void *__inline_memset(void *s, int v, size_t n) { void *ret = s; asm volatile("rep stosb" : "+D" (s), "+c" (n) : "a" ((uint8_t)v) : "memory"); return ret; }