Re: [GIT PULL] VFIO updates for v6.17-rc1

Jason Gunthorpe <jgg@xxxxxxxxxx> · Tue, 5 Aug 2025 10:55:05 -0300

On Tue, Aug 05, 2025 at 03:33:49PM +0200, David Hildenbrand wrote:

> > David, there is another alternative to prevent this, simple though a
> > bit wasteful, just allocate a bit bigger to ensure the allocation
> > doesn't end on an exact PAGE_SIZE boundary?
> 
> :/ in particular doing that through the memblock in sparse_init_nid(), I am
> not so sure that's a good idea.

It would probably be some work to make larger allocations to avoid
padding :\

> I prefer Linus' proposal and avoids the one nth_page(), unless any other
> approach can help us get rid of more nth_page() usage -- and I don't think
> your proposal could, right?

If the above were solved - so the struct page allocations could be
larger than a section, arguably just the entire range being plugged,
then I think you also solve the nth_page() problem too.

Effectively the nth_page() problem is that we allocate the struct page
arrays on an arbitary section-by-section basis, and then the arch sets
MAX_ORDER so that a folio can cross sections, effectively guaranteeing
to virtually fragment the page *'s inside folios.

Doing a giant vmalloc at the start so you could also cheaply add some
padding would effectively also prevent the nth_page problem as we can
reasonably say that no folio should extend past an entire memory
region.

Maybe there is some reason we can't do a giant vmalloc on these
systems that also can't do SPARSE_VMMEMAP :\ But perhaps we could get
up to MAX_ORDER at least? Or perhaps we could have those systems
reduce MAX_ORDER?

So, I think they are lightly linked problems.

I suppose this is also a limitation with Linus's suggestion. It
doesn't give the optimal answer for for 1G pages on these older systems:

        for (size_t nr = 1; nr < nr_pages; nr++) {
                if (*pages++ != ++page)
                        break;

Since that will exit every section.

At least for scatterlist like cases the point of this function is just
to speed things up. If it returns short the calling code should still
be directly checking phys_addr contiguity anyhow. Something for the
kdoc I suppose.

Jason