On 06/05/2025 12:29, Jan Kara wrote: > On Tue 06-05-25 10:28:11, Ryan Roberts wrote: >> On 05/05/2025 10:37, Jan Kara wrote: >>> On Mon 05-05-25 11:13:26, Jan Kara wrote: >>>> On Wed 30-04-25 15:59:15, Ryan Roberts wrote: >>>>> Previously asynchonous readahead would read ra_pages (usually 128K) >>>>> directly after the end of the synchonous readahead and given the >>>>> synchronous readahead portion had no alignment guarantees (beyond page >>>>> boundaries) it is possible (and likely) that the end of the initial 128K >>>>> region would not fall on a natural boundary for the folio size being >>>>> used. Therefore smaller folios were used to align down to the required >>>>> boundary, both at the end of the previous readahead block and at the >>>>> start of the new one. >>>>> >>>>> In the worst cases, this can result in never properly ramping up the >>>>> folio size, and instead getting stuck oscillating between order-0, -1 >>>>> and -2 folios. The next readahead will try to use folios whose order is >>>>> +2 bigger than the folio that had the readahead marker. But because of >>>>> the alignment requirements, that folio (the first one in the readahead >>>>> block) can end up being order-0 in some cases. >>>>> >>>>> There will be 2 modifications to solve this issue: >>>>> >>>>> 1) Calculate the readahead size so the end is aligned to a folio >>>>> boundary. This prevents needing to allocate small folios to align >>>>> down at the end of the window and fixes the oscillation problem. >>>>> >>>>> 2) Remember the "preferred folio order" in the ra state instead of >>>>> inferring it from the folio with the readahead marker. This solves >>>>> the slow ramp up problem (discussed in a subsequent patch). >>>>> >>>>> This patch addresses (1) only. A subsequent patch will address (2). >>>>> >>>>> Worked example: >>>>> >>>>> The following shows the previous pathalogical behaviour when the initial >>>>> synchronous readahead is unaligned. We start reading at page 17 in the >>>>> file and read sequentially from there. I'm showing a dump of the pages >>>>> in the page cache just after we read the first page of the folio with >>>>> the readahead marker. >>> >>> <snip> >>> >>>> Looks good. When I was reading this code some time ago, I also felt we >>>> should rather do some rounding instead of creating small folios so thanks >>>> for working on this. Feel free to add: >>>> >>>> Reviewed-by: Jan Kara <jack@xxxxxxx> >>> >>> But now I've also remembered why what you do here isn't an obvious win. >>> There are storage devices (mostly RAID arrays) where optimum read size >>> isn't a power of 2. Think for example a RAID-0 device composed from three >>> disks. It will have max_pages something like 384 (512k * 3). Suppose we are >>> on x86 and max_order is 9. Then previously (if we were lucky with >>> alignment) we were alternating between order 7 and order 8 pages in the >>> page cache and do optimally sized IOs od 1536k. >> >> Sorry I'm struggling to follow some of this, perhaps my superficial >> understanding of all the readahead subtleties is starting to show... >> >> How is the 384 figure provided? I'd guess that comes from bdi->io_pages, and >> bdi->ra_pages would remain the usual 32 (128K)? > > Sorry, I have been probably too brief in my previous message :) > bdi->ra_pages is actually set based on optimal IO size reported by the > hardware (see blk_apply_bdi_limits() and how its callers are filling in > lim->io_opt). The 128K you speak about is just a last-resort value if > hardware doesn't provide one. And some storage devices do report optimal IO > size that is not power of two. Ahh, got it - thanks for the education! > > Also note that bdi->ra_pages can be tuned in sysfs and a lot of users > actually do this (usually from their udev rules). We don't have to perform > well when some odd value gets set but you definitely cannot assume > bdi->ra_pages is 128K :). > >> In which case, for mmap, won't >> we continue to be limited by ra_pages and will never get beyond order-5? (for >> mmap req_size is always set to ra_pages IIRC, so ractl_max_pages() always just >> returns ra_pages). Or perhaps ra_pages is set to 384 somewhere, but I'm not >> spotting it in the code... >> >> I guess you are also implicitly teaching me something about how the block layer >> works here too... if there are 2 read requests for an order-7 and order-8, then >> the block layer will merge those to a single read (upto the 384 optimal size?) > > Correct. In fact readahead code will already perform this merging when > submitting the IO. > >> but if there are 2 reads of order-8 then it won't merge because it would be >> bigger than the optimal size and it won't split the second one at the optimal >> size either? Have I inferred that correctly? > > With the code as you modify it, you would round down ra->size from 384 to > 256 and submit only one 1MB sized IO (with one order-8 page). And this will > cause regression in read throughput for such devices because they now don't > get buffer large enough to run at full speed. Ahha, yes, thanks - now it's clicking. > >>> Now you will allocate all >>> folios of order 8 (nice) but reads will be just 1024k and you'll see >>> noticeable drop in read throughput (not nice). Note that this is not just a >>> theoretical example but a real case we have hit when doing performance >>> testing of servers and for which I was tweaking readahead code in the past. >>> >>> So I think we need to tweak this logic a bit. Perhaps we should round_down >>> end to the minimum alignment dictated by 'order' and maxpages? Like: >>> >>> 1 << min(order, ffs(max_pages) + PAGE_SHIFT - 1) >> >> Sorry I'm staring at this and struggling to understand the "PAGE_SHIFT - >> 1" part? > > My bad. It should have been: > > 1 << min(order, ffs(max_pages) - 1) > >> I think what you are suggesting is that the patch becomes something like >> this: >> >> ---8<--- >> + end = ra->start + ra->size; >> + aligned_end = round_down(end, 1UL << min(order, ilog2(max_pages))); > > Not quite. ilog2() returns the most significant bit set but we really want > to align to the least significant bit set. So when max_pages is 384, we > want to align to at most order-7 (aligning the end more does not make sense > when you want to do IO 384 pages large). That's why I'm using ffs() and not > ilog2(). Yep got it now. > >> + if (aligned_end > ra->start) >> + ra->size -= end - aligned_end; >> + ra->async_size = ra->size; >> ---8<--- >> >> So if max_pages=384, then aligned_end will be aligned down to a maximum >> of the previous 1MB boundary? > > No, it needs to be aligned only to previous 512K boundary because we want > to do IOs 3*512K large. > > Hope things are a bit clearer now :) Yes, much! Thanks, Ryan > > Honza