On Mon, Jun 02, 2025 at 05:16:42PM +0100, Richard Earnshaw (lists) via Gcc-help wrote: > On 02/06/2025 15:49, Jonathan Wakely wrote: > > On Mon, 2 Jun 2025 at 14:24, Richard Earnshaw (lists) wrote: > >> $ /work/rearnsha/scratch/gnu/gcc/aarch64/master/gcc/xgcc -B /work/rearnsha/scratch/gnu/gcc/aarch64/master/gcc/ -I ~/gnusrc/newlib/master/newlib/libc/include/ -O2 -march=armv8-a+mops -o - -S /tmp/mem.c > >> .arch armv8-a+mops > >> f: > >> cpyfp [x0]!, [x1]!, x2! > >> cpyfm [x0]!, [x1]!, x2! > >> cpyfe [x0]!, [x1]!, x2! > >> ret > > > > Ah, thanks for the correction! > > > > For x86_64 both gcc and clang emit a call to memcpy: > > > > https://godbolt.org/z/hGvbM4df8 > > AArch64 would do as well if you don't have the MOPS extension. As I said, the limit, if any, is an target implementation choice; it's generally driven by the amount of code bloat that picking the best strategy would require. Same on Power. On most architectures it is possible to do faster memcpy routines if you can spend as much code as you want on it, but like on Arm with MOPS you just need some small insns, and on e.g. more embedded targets you cannot go faster than loops that do a word per cycle in any way, and you can write pretty good code for that (you can then implement the libc memcpy() as just a __builtin_memcpy(), great fun!) Segher