On Tue 08-07-25 21:08:00, Baokun Li wrote: > Sorry for getting to this so late – I've been totally overloaded > with stuff recently. > > Anyway, back to what we were discussing. I managed to test > the performance difference between READ_ONCE / WRITE_ONCE and > smp_load_acquire / smp_store_release on an ARM64 server. > Here's the results: > > CPU: Kunpeng 920 > Memory: 512GB > Disk: 960GB SSD (~500M/s) > > | mb_optimize_scan | 0 | 1 | > |-------------------|----------------|----------------| > | Num. containers | P80 | P1 | P80 | P1 | > --------|-------------------|-------|--------|-------|--------| > | acquire/release | 9899 | 290260 | 5005 | 307361 | > single | [READ|WRITE]_ONCE | 9636 | 337597 | 4834 | 341440 | > goal |-------------------|-------|--------|-------|--------| > | | -2.6% | +16.3% | -3.4% | +11.0% | > --------|-------------------|-------|--------|-------|--------| > | acquire/release | 19931 | 290348 | 7365 | 311717 | > muti | [READ|WRITE]_ONCE | 19628 | 320885 | 7129 | 321275 | > goal |-------------------|-------|--------|-------|--------| > | | -1.5% | +10.5% | -3.2% | +3.0% | > > So, my tests show that READ_ONCE / WRITE_ONCE gives us better > single-threaded performance. That's because it skips the mandatory > CPU-to-CPU syncing. This also helps explain why x86 has double the > disk bandwidth (~1000MB/s) of Arm64, but surprisingly, single > containers run much worse on x86. Interesting! Thanks for measuring the data! > However, in multi-threaded scenarios, not consistently reading > the latest goal has these implications: > > * ext4_get_group_info() calls increase, as ext4_mb_good_group_nolock() > is invoked more often on incorrect groups. > > * ext4_mb_load_buddy() calls increase due to repeated group accesses > leading to more folio_mark_accessed calls. > > * ext4_mb_prefetch() calls increase with more frequent prefetch_grp > access. (I suspect the current mb_prefetch mechanism has some inherent > issues we could optimize later.) > > At this point, I believe either approach is acceptable. > > What are your thoughts? Yes, apparently both approaches have their pros and cons. I'm actually surprised the impact of additional barriers on ARM is so big for the single container case. 10% gain for single container cases look nice OTOH realistical workloads will have more container so maybe that's not worth optimizing for. Ted, do you have any opinion? Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR