On 11/07/2025 17:08, David Hildenbrand wrote: > CCing Ryan, who recently fiddled with readahead. > > > On 11.07.25 07:55, Youling Tang wrote: >> From: Youling Tang <tangyouling@xxxxxxxxxx> >> >> On XFS systems with pagesize=4K, blocksize=16K, and CONFIG_TRANSPARENT_HUGEPAGE >> enabled, We observed the following readahead behaviors: >> # echo 3 > /proc/sys/vm/drop_caches >> # dd if=test of=/dev/null bs=64k count=1 >> # ./tools/mm/page-types -r -L -f /mnt/xfs/test >> foffset offset flags >> 0 136d4c __RU_l_________H______t_________________F_1 >> 1 136d4d __RU_l__________T_____t_________________F_1 >> 2 136d4e __RU_l__________T_____t_________________F_1 >> 3 136d4f __RU_l__________T_____t_________________F_1 >> ... >> c 136bb8 __RU_l_________H______t_________________F_1 >> d 136bb9 __RU_l__________T_____t_________________F_1 >> e 136bba __RU_l__________T_____t_________________F_1 >> f 136bbb __RU_l__________T_____t_________________F_1 <-- first read >> 10 13c2cc ___U_l_________H______t______________I__F_1 <-- readahead >> flag >> 11 13c2cd ___U_l__________T_____t______________I__F_1 >> 12 13c2ce ___U_l__________T_____t______________I__F_1 >> 13 13c2cf ___U_l__________T_____t______________I__F_1 >> ... >> 1c 1405d4 ___U_l_________H______t_________________F_1 >> 1d 1405d5 ___U_l__________T_____t_________________F_1 >> 1e 1405d6 ___U_l__________T_____t_________________F_1 >> 1f 1405d7 ___U_l__________T_____t_________________F_1 >> [ra_size = 32, req_count = 16, async_size = 16] >> >> # echo 3 > /proc/sys/vm/drop_caches >> # dd if=test of=/dev/null bs=60k count=1 >> # ./page-types -r -L -f /mnt/xfs/test >> foffset offset flags >> 0 136048 __RU_l_________H______t_________________F_1 >> ... >> c 110a40 __RU_l_________H______t_________________F_1 >> d 110a41 __RU_l__________T_____t_________________F_1 >> e 110a42 __RU_l__________T_____t_________________F_1 <-- first read >> f 110a43 __RU_l__________T_____t_________________F_1 <-- first >> readahead flag >> 10 13e7a8 ___U_l_________H______t_________________F_1 >> ... >> 20 137a00 ___U_l_________H______t_______P______I__F_1 <-- second >> readahead flag (20 - 2f) >> 21 137a01 ___U_l__________T_____t_______P______I__F_1 >> ... >> 3f 10d4af ___U_l__________T_____t_______P_________F_1 >> [first readahead: ra_size = 32, req_count = 15, async_size = 17] >> >> When reading 64k data (same for 61-63k range, where last_index is page-aligned >> in filemap_get_pages()), 128k readahead is triggered via page_cache_sync_ra() >> and the PG_readahead flag is set on the next folio (the one containing 0x10 >> page). >> >> When reading 60k data, 128k readahead is also triggered via page_cache_sync_ra(). >> However, in this case the readahead flag is set on the 0xf page. Although the >> requested read size (req_count) is 60k, the actual read will be aligned to >> folio size (64k), which triggers the readahead flag and initiates asynchronous >> readahead via page_cache_async_ra(). This results in two readahead operations >> totaling 256k. >> >> The root cause is that when the requested size is smaller than the actual read >> size (due to folio alignment), it triggers asynchronous readahead. By changing >> last_index alignment from page size to folio size, we ensure the requested size >> matches the actual read size, preventing the case where a single read operation >> triggers two readahead operations. I recently fiddled with mmap readahead paths, doing similar-ish things. I haven't looked at the non-mmap paths so don't consider myself expert here. But what you are saying makes sense and superficially the solution looks good to me, so: Reviewed-by: Ryan Roberts <ryan.roberts@xxxxxxx> with one nit below... >> >> After applying the patch: >> # echo 3 > /proc/sys/vm/drop_caches >> # dd if=test of=/dev/null bs=60k count=1 >> # ./page-types -r -L -f /mnt/xfs/test >> foffset offset flags >> 0 136d4c __RU_l_________H______t_________________F_1 >> 1 136d4d __RU_l__________T_____t_________________F_1 >> 2 136d4e __RU_l__________T_____t_________________F_1 >> 3 136d4f __RU_l__________T_____t_________________F_1 >> ... >> c 136bb8 __RU_l_________H______t_________________F_1 >> d 136bb9 __RU_l__________T_____t_________________F_1 >> e 136bba __RU_l__________T_____t_________________F_1 <-- first read >> f 136bbb __RU_l__________T_____t_________________F_1 >> 10 13c2cc ___U_l_________H______t______________I__F_1 <-- readahead >> flag >> 11 13c2cd ___U_l__________T_____t______________I__F_1 >> 12 13c2ce ___U_l__________T_____t______________I__F_1 >> 13 13c2cf ___U_l__________T_____t______________I__F_1 >> ... >> 1c 1405d4 ___U_l_________H______t_________________F_1 >> 1d 1405d5 ___U_l__________T_____t_________________F_1 >> 1e 1405d6 ___U_l__________T_____t_________________F_1 >> 1f 1405d7 ___U_l__________T_____t_________________F_1 >> [ra_size = 32, req_count = 16, async_size = 16] >> >> The same phenomenon will occur when reading from 49k to 64k. Set the readahead >> flag to the next folio. >> >> Because the minimum order of folio in address_space equals the block size (at >> least in xfs and bcachefs that already support bs > ps), having request_count >> aligned to block size will not cause overread. >> >> Co-developed-by: Chi Zhiling <chizhiling@xxxxxxxxxx> >> Signed-off-by: Chi Zhiling <chizhiling@xxxxxxxxxx> >> Signed-off-by: Youling Tang <tangyouling@xxxxxxxxxx> >> --- >> include/linux/pagemap.h | 6 ++++++ >> mm/filemap.c | 5 +++-- >> 2 files changed, 9 insertions(+), 2 deletions(-) >> >> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h >> index e63fbfbd5b0f..447bb264fd94 100644 >> --- a/include/linux/pagemap.h >> +++ b/include/linux/pagemap.h >> @@ -480,6 +480,12 @@ mapping_min_folio_nrpages(struct address_space *mapping) >> return 1UL << mapping_min_folio_order(mapping); >> } >> +static inline unsigned long >> +mapping_min_folio_nrbytes(struct address_space *mapping) >> +{ >> + return mapping_min_folio_nrpages(mapping) << PAGE_SHIFT; >> +} >> + >> /** >> * mapping_align_index() - Align index for this mapping. >> * @mapping: The address_space. >> diff --git a/mm/filemap.c b/mm/filemap.c >> index 765dc5ef6d5a..56a8656b6f86 100644 >> --- a/mm/filemap.c >> +++ b/mm/filemap.c >> @@ -2584,8 +2584,9 @@ static int filemap_get_pages(struct kiocb *iocb, size_t >> count, >> unsigned int flags; >> int err = 0; >> - /* "last_index" is the index of the page beyond the end of the read */ >> - last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE); >> + /* "last_index" is the index of the folio beyond the end of the read */ pedantic nit: I think you actually mean "the index of the first page within the first minimum-sized folio beyond the end of the read"? Thanks, Ryan >> + last_index = round_up(iocb->ki_pos + count, >> mapping_min_folio_nrbytes(mapping)); >> + last_index >>= PAGE_SHIFT; >> retry: >> if (fatal_signal_pending(current)) >> return -EINTR; > >