Hi, On 2025-07-11 13:55:09 +0800, Youling Tang wrote: > From: Youling Tang <tangyouling@xxxxxxxxxx> > > On XFS systems with pagesize=4K, blocksize=16K, and CONFIG_TRANSPARENT_HUGEPAGE > enabled, We observed the following readahead behaviors: > # echo 3 > /proc/sys/vm/drop_caches > # dd if=test of=/dev/null bs=64k count=1 > # ./tools/mm/page-types -r -L -f /mnt/xfs/test > foffset offset flags > 0 136d4c __RU_l_________H______t_________________F_1 > 1 136d4d __RU_l__________T_____t_________________F_1 > 2 136d4e __RU_l__________T_____t_________________F_1 > 3 136d4f __RU_l__________T_____t_________________F_1 > ... > c 136bb8 __RU_l_________H______t_________________F_1 > d 136bb9 __RU_l__________T_____t_________________F_1 > e 136bba __RU_l__________T_____t_________________F_1 > f 136bbb __RU_l__________T_____t_________________F_1 <-- first read > 10 13c2cc ___U_l_________H______t______________I__F_1 <-- readahead flag > 11 13c2cd ___U_l__________T_____t______________I__F_1 > 12 13c2ce ___U_l__________T_____t______________I__F_1 > 13 13c2cf ___U_l__________T_____t______________I__F_1 > ... > 1c 1405d4 ___U_l_________H______t_________________F_1 > 1d 1405d5 ___U_l__________T_____t_________________F_1 > 1e 1405d6 ___U_l__________T_____t_________________F_1 > 1f 1405d7 ___U_l__________T_____t_________________F_1 > [ra_size = 32, req_count = 16, async_size = 16] > > # echo 3 > /proc/sys/vm/drop_caches > # dd if=test of=/dev/null bs=60k count=1 > # ./page-types -r -L -f /mnt/xfs/test > foffset offset flags > 0 136048 __RU_l_________H______t_________________F_1 > ... > c 110a40 __RU_l_________H______t_________________F_1 > d 110a41 __RU_l__________T_____t_________________F_1 > e 110a42 __RU_l__________T_____t_________________F_1 <-- first read > f 110a43 __RU_l__________T_____t_________________F_1 <-- first readahead flag > 10 13e7a8 ___U_l_________H______t_________________F_1 > ... > 20 137a00 ___U_l_________H______t_______P______I__F_1 <-- second readahead flag (20 - 2f) > 21 137a01 ___U_l__________T_____t_______P______I__F_1 > ... > 3f 10d4af ___U_l__________T_____t_______P_________F_1 > [first readahead: ra_size = 32, req_count = 15, async_size = 17] > > When reading 64k data (same for 61-63k range, where last_index is page-aligned > in filemap_get_pages()), 128k readahead is triggered via page_cache_sync_ra() > and the PG_readahead flag is set on the next folio (the one containing 0x10 page). > > When reading 60k data, 128k readahead is also triggered via page_cache_sync_ra(). > However, in this case the readahead flag is set on the 0xf page. Although the > requested read size (req_count) is 60k, the actual read will be aligned to > folio size (64k), which triggers the readahead flag and initiates asynchronous > readahead via page_cache_async_ra(). This results in two readahead operations > totaling 256k. > > The root cause is that when the requested size is smaller than the actual read > size (due to folio alignment), it triggers asynchronous readahead. By changing > last_index alignment from page size to folio size, we ensure the requested size > matches the actual read size, preventing the case where a single read operation > triggers two readahead operations. > > After applying the patch: > # echo 3 > /proc/sys/vm/drop_caches > # dd if=test of=/dev/null bs=60k count=1 > # ./page-types -r -L -f /mnt/xfs/test > foffset offset flags > 0 136d4c __RU_l_________H______t_________________F_1 > 1 136d4d __RU_l__________T_____t_________________F_1 > 2 136d4e __RU_l__________T_____t_________________F_1 > 3 136d4f __RU_l__________T_____t_________________F_1 > ... > c 136bb8 __RU_l_________H______t_________________F_1 > d 136bb9 __RU_l__________T_____t_________________F_1 > e 136bba __RU_l__________T_____t_________________F_1 <-- first read > f 136bbb __RU_l__________T_____t_________________F_1 > 10 13c2cc ___U_l_________H______t______________I__F_1 <-- readahead flag > 11 13c2cd ___U_l__________T_____t______________I__F_1 > 12 13c2ce ___U_l__________T_____t______________I__F_1 > 13 13c2cf ___U_l__________T_____t______________I__F_1 > ... > 1c 1405d4 ___U_l_________H______t_________________F_1 > 1d 1405d5 ___U_l__________T_____t_________________F_1 > 1e 1405d6 ___U_l__________T_____t_________________F_1 > 1f 1405d7 ___U_l__________T_____t_________________F_1 > [ra_size = 32, req_count = 16, async_size = 16] > > The same phenomenon will occur when reading from 49k to 64k. Set the readahead > flag to the next folio. > > Because the minimum order of folio in address_space equals the block size (at > least in xfs and bcachefs that already support bs > ps), having request_count > aligned to block size will not cause overread. > > Co-developed-by: Chi Zhiling <chizhiling@xxxxxxxxxx> > Signed-off-by: Chi Zhiling <chizhiling@xxxxxxxxxx> > Signed-off-by: Youling Tang <tangyouling@xxxxxxxxxx> I bisected boot failures on two of my 32-bit systems to this patch. > --- > include/linux/pagemap.h | 6 ++++++ > mm/filemap.c | 5 +++-- > 2 files changed, 9 insertions(+), 2 deletions(-) > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index e63fbfbd5b0f..447bb264fd94 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -480,6 +480,12 @@ mapping_min_folio_nrpages(struct address_space *mapping) > return 1UL << mapping_min_folio_order(mapping); > } > > +static inline unsigned long > +mapping_min_folio_nrbytes(struct address_space *mapping) > +{ > + return mapping_min_folio_nrpages(mapping) << PAGE_SHIFT; > +} > + > /** > * mapping_align_index() - Align index for this mapping. > * @mapping: The address_space. > diff --git a/mm/filemap.c b/mm/filemap.c > index 765dc5ef6d5a..56a8656b6f86 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -2584,8 +2584,9 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count, > unsigned int flags; > int err = 0; > > - /* "last_index" is the index of the page beyond the end of the read */ > - last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE); > + /* "last_index" is the index of the folio beyond the end of the read */ > + last_index = round_up(iocb->ki_pos + count, mapping_min_folio_nrbytes(mapping)); iocb->ki_pos is loff_t (long long) while pgoff_t is unsigned long and this overflow seems to happen in practice, resulting in last_index being before index. The following diff resolves the issue for me: diff --git a/mm/filemap.c b/mm/filemap.c index 3c071307f40e..d2902be0b845 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2585,8 +2585,8 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count, int err = 0; /* "last_index" is the index of the folio beyond the end of the read */ - last_index = round_up(iocb->ki_pos + count, mapping_min_folio_nrbytes(mapping)); - last_index >>= PAGE_SHIFT; + last_index = round_up(iocb->ki_pos + count, + mapping_min_folio_nrbytes(mapping)) >> PAGE_SHIFT; retry: if (fatal_signal_pending(current)) return -EINTR; Regards, Klara Modin > + last_index >>= PAGE_SHIFT; > retry: > if (fatal_signal_pending(current)) > return -EINTR; > -- > 2.34.1 >