On Tue 06-05-25 10:28:11, Ryan Roberts wrote: > On 05/05/2025 10:37, Jan Kara wrote: > > On Mon 05-05-25 11:13:26, Jan Kara wrote: > >> On Wed 30-04-25 15:59:15, Ryan Roberts wrote: > >>> Previously asynchonous readahead would read ra_pages (usually 128K) > >>> directly after the end of the synchonous readahead and given the > >>> synchronous readahead portion had no alignment guarantees (beyond page > >>> boundaries) it is possible (and likely) that the end of the initial 128K > >>> region would not fall on a natural boundary for the folio size being > >>> used. Therefore smaller folios were used to align down to the required > >>> boundary, both at the end of the previous readahead block and at the > >>> start of the new one. > >>> > >>> In the worst cases, this can result in never properly ramping up the > >>> folio size, and instead getting stuck oscillating between order-0, -1 > >>> and -2 folios. The next readahead will try to use folios whose order is > >>> +2 bigger than the folio that had the readahead marker. But because of > >>> the alignment requirements, that folio (the first one in the readahead > >>> block) can end up being order-0 in some cases. > >>> > >>> There will be 2 modifications to solve this issue: > >>> > >>> 1) Calculate the readahead size so the end is aligned to a folio > >>> boundary. This prevents needing to allocate small folios to align > >>> down at the end of the window and fixes the oscillation problem. > >>> > >>> 2) Remember the "preferred folio order" in the ra state instead of > >>> inferring it from the folio with the readahead marker. This solves > >>> the slow ramp up problem (discussed in a subsequent patch). > >>> > >>> This patch addresses (1) only. A subsequent patch will address (2). > >>> > >>> Worked example: > >>> > >>> The following shows the previous pathalogical behaviour when the initial > >>> synchronous readahead is unaligned. We start reading at page 17 in the > >>> file and read sequentially from there. I'm showing a dump of the pages > >>> in the page cache just after we read the first page of the folio with > >>> the readahead marker. > > > > <snip> > > > >> Looks good. When I was reading this code some time ago, I also felt we > >> should rather do some rounding instead of creating small folios so thanks > >> for working on this. Feel free to add: > >> > >> Reviewed-by: Jan Kara <jack@xxxxxxx> > > > > But now I've also remembered why what you do here isn't an obvious win. > > There are storage devices (mostly RAID arrays) where optimum read size > > isn't a power of 2. Think for example a RAID-0 device composed from three > > disks. It will have max_pages something like 384 (512k * 3). Suppose we are > > on x86 and max_order is 9. Then previously (if we were lucky with > > alignment) we were alternating between order 7 and order 8 pages in the > > page cache and do optimally sized IOs od 1536k. > > Sorry I'm struggling to follow some of this, perhaps my superficial > understanding of all the readahead subtleties is starting to show... > > How is the 384 figure provided? I'd guess that comes from bdi->io_pages, and > bdi->ra_pages would remain the usual 32 (128K)? Sorry, I have been probably too brief in my previous message :) bdi->ra_pages is actually set based on optimal IO size reported by the hardware (see blk_apply_bdi_limits() and how its callers are filling in lim->io_opt). The 128K you speak about is just a last-resort value if hardware doesn't provide one. And some storage devices do report optimal IO size that is not power of two. Also note that bdi->ra_pages can be tuned in sysfs and a lot of users actually do this (usually from their udev rules). We don't have to perform well when some odd value gets set but you definitely cannot assume bdi->ra_pages is 128K :). > In which case, for mmap, won't > we continue to be limited by ra_pages and will never get beyond order-5? (for > mmap req_size is always set to ra_pages IIRC, so ractl_max_pages() always just > returns ra_pages). Or perhaps ra_pages is set to 384 somewhere, but I'm not > spotting it in the code... > > I guess you are also implicitly teaching me something about how the block layer > works here too... if there are 2 read requests for an order-7 and order-8, then > the block layer will merge those to a single read (upto the 384 optimal size?) Correct. In fact readahead code will already perform this merging when submitting the IO. > but if there are 2 reads of order-8 then it won't merge because it would be > bigger than the optimal size and it won't split the second one at the optimal > size either? Have I inferred that correctly? With the code as you modify it, you would round down ra->size from 384 to 256 and submit only one 1MB sized IO (with one order-8 page). And this will cause regression in read throughput for such devices because they now don't get buffer large enough to run at full speed. > > Now you will allocate all > > folios of order 8 (nice) but reads will be just 1024k and you'll see > > noticeable drop in read throughput (not nice). Note that this is not just a > > theoretical example but a real case we have hit when doing performance > > testing of servers and for which I was tweaking readahead code in the past. > > > > So I think we need to tweak this logic a bit. Perhaps we should round_down > > end to the minimum alignment dictated by 'order' and maxpages? Like: > > > > 1 << min(order, ffs(max_pages) + PAGE_SHIFT - 1) > > Sorry I'm staring at this and struggling to understand the "PAGE_SHIFT - > 1" part? My bad. It should have been: 1 << min(order, ffs(max_pages) - 1) > I think what you are suggesting is that the patch becomes something like > this: > > ---8<--- > + end = ra->start + ra->size; > + aligned_end = round_down(end, 1UL << min(order, ilog2(max_pages))); Not quite. ilog2() returns the most significant bit set but we really want to align to the least significant bit set. So when max_pages is 384, we want to align to at most order-7 (aligning the end more does not make sense when you want to do IO 384 pages large). That's why I'm using ffs() and not ilog2(). > + if (aligned_end > ra->start) > + ra->size -= end - aligned_end; > + ra->async_size = ra->size; > ---8<--- > > So if max_pages=384, then aligned_end will be aligned down to a maximum > of the previous 1MB boundary? No, it needs to be aligned only to previous 512K boundary because we want to do IOs 3*512K large. Hope things are a bit clearer now :) Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR