Re: [PATCH] vfio/type1: conditional rescheduling while pinning

Alex Williamson <alex.williamson@xxxxxxxxxx> · Wed, 19 Mar 2025 12:17:04 -0600

On Wed, 19 Mar 2025 09:47:05 -0600
Keith Busch <kbusch@xxxxxxxxxx> wrote:

> On Mon, Mar 17, 2025 at 04:53:47PM -0600, Alex Williamson wrote:
> > On Mon, 17 Mar 2025 16:30:47 -0600
> > Keith Busch <kbusch@xxxxxxxxxx> wrote:
> >   
> > > On Mon, Mar 17, 2025 at 03:44:17PM -0600, Alex Williamson wrote:  
> > > > On Wed, 12 Mar 2025 15:52:55 -0700    
> > > > > @@ -679,6 +679,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
> > > > >  
> > > > >  		if (unlikely(disable_hugepages))
> > > > >  			break;
> > > > > +		cond_resched();
> > > > >  	}
> > > > >  
> > > > >  out:    
> > > > 
> > > > Hey Keith, is this still necessary with:
> > > > 
> > > > https://lore.kernel.org/all/20250218222209.1382449-1-alex.williamson@xxxxxxxxxx/    
> > > 
> > > Thank you for the suggestion. I'll try to fold this into a build, and
> > > see what happens. But from what I can tell, I'm not sure it will help.
> > > We're simply not getting large folios in this path and dealing with
> > > individual pages. Though it is a large contiguous range (~60GB, not
> > > necessarily aligned). Shoould we expect to only be dealing with PUD and
> > > PMD levels with these kinds of mappings?  
> > 
> > IME with QEMU, PMD alignment basically happens without any effort and
> > gets 90+% of the way there, PUD alignment requires a bit of work[1].
> >    
> > > > This is currently in linux-next from the vfio next branch and should
> > > > pretty much eliminate any stalls related to DMA mapping MMIO BARs.
> > > > Also the code here has been refactored in next, so this doesn't apply
> > > > anyway, and if there is a resched still needed, this location would
> > > > only affect DMA mapping of memory, not device BARs.  Thanks,    
> > > 
> > > Thanks for the head's up. Regardless, it doesn't look like bad place to
> > > cond_resched(), but may not trigger any cpu stall indicator outside this
> > > vfio fault path.  
> > 
> > Note that we already have a cond_resched() in vfio_iommu_map(), which
> > we'll hit any time we get a break in a contiguous mapping.  We may hit
> > that regularly enough that it's not an issue for RAM mapping, but I've
> > certainly seen soft lockups when we have many GiB of contiguous pfnmaps
> > prior to the series above.  Thanks,  
> 
> So far adding the additional patches has not changed anything. We've
> ensured we are using an address and length aligned to 2MB, but it sure
> looks like vfio's fault handler is only getting order-0 faults. I'm not
> finding anything immediately obvious about what we can change to get the
> desired higher order behvaior, though. Any other hints or information I
> could provide?

Since you mention folding in the changes, are you working on an upstream
kernel or a downstream backport?  Huge pfnmap support was added in
v6.12 via [1].  Without that you'd never see better than a order-a
fault.  I hope that's it because with all the kernel pieces in place it
should "Just work".  Thanks,

Alex

[1] https://lore.kernel.org/all/20240826204353.2228736-1-peterx@xxxxxxxxxx/