On Tue, Mar 25, 2025 at 03:57:00PM +0000, Fuad Tabba wrote: > Hi Vishal, > > > On Fri, 21 Mar 2025 at 20:09, Vishal Annapurve <vannapurve@xxxxxxxxxx> wrote: > > > > On Tue, Mar 18, 2025 at 9:20 AM Fuad Tabba <tabba@xxxxxxxxxx> wrote: > > > ... > > > +/* > > > + * Callback function for __folio_put(), i.e., called once all references by the > > > + * host to the folio have been dropped. This allows gmem to transition the state > > > + * of the folio to shared with the guest, and allows the hypervisor to continue > > > + * transitioning its state to private, since the host cannot attempt to access > > > + * it anymore. > > > + */ > > > void kvm_gmem_handle_folio_put(struct folio *folio) > > > { > > > - WARN_ONCE(1, "A placeholder that shouldn't trigger. Work in progress."); > > > + struct address_space *mapping; > > > + struct xarray *shared_offsets; > > > + struct inode *inode; > > > + pgoff_t index; > > > + void *xval; > > > + > > > + mapping = folio->mapping; > > > + if (WARN_ON_ONCE(!mapping)) > > > + return; > > > + > > > + inode = mapping->host; > > > + index = folio->index; > > > + shared_offsets = &kvm_gmem_private(inode)->shared_offsets; > > > + xval = xa_mk_value(KVM_GMEM_GUEST_SHARED); > > > + > > > + filemap_invalidate_lock(inode->i_mapping); > > > > As discussed in the guest_memfd upstream, folio_put can happen from > > atomic context [1], so we need a way to either defer the work outside > > kvm_gmem_handle_folio_put() (which is very likely needed to handle > > hugepages and merge operation) or ensure to execute the logic using > > synchronization primitives that will not sleep. > > Thanks for pointing this out. For now, rather than deferring (which > we'll come to when hugepages come into play), I think this would be FWIW, with SNP, it's only possible to unsplit an RMP entry if the guest cooperates with re-validating/re-accepting the memory at a higher order. Currently, this guest support is not implemented in linux. So, if we were to opportunistically unsplit hugepages, we'd zap the mappings in KVM, let it fault in at a higher order so we could reduce TLB misses, and then KVM would (via kvm_x86_call(private_max_mapping_level)(kvm, pfn) find that the RMP entry is still split to 4K, and remap everything right back to the 4K granularity it was already at to begin with. TDX seems to have a bit more flexibility in being able to 'unsplit'/promote private ranges back up to higher orders, so it could potentially benefit from doing things opportunistically... However, ideally...the guest would just avoid unecessarily carving up ranges to begin with and pack all it's shared mappings into smaller GPA ranges. Then, all this unsplitting of huge pages could be completely avoided until cleanup/truncate time. So maybe even for hugepages we should just plan to do things this way, at least as a start? > possible to resolve by ensuring we have exclusive access* to the folio > instead, and using that to ensure that we can access the > shared_offsets maps. > > * By exclusive access I mean either holding the folio lock, or knowing > that no one else has references to the folio (which is the case when > kvm_gmem_handle_folio_put() is called). > > I'll try to respin something in time for folks to look at it before > the next sync. Thanks for posting. I was looking at how to get rid of filemap_invalidate_lock() from conversion path, and having that separate rwlock seems to resolve a lot of the potential races I was looking at. I'm working on rebasing SNP 2MB support on top of your v7 series now. -Mike > > Cheers, > /fuad > > > [1] https://elixir.bootlin.com/linux/v6.14-rc6/source/include/linux/mm.h#L1483 > > > > > + folio_lock(folio); > > > + kvm_gmem_restore_pending_folio(folio, inode); > > > + folio_unlock(folio); > > > + WARN_ON_ONCE(xa_err(xa_store(shared_offsets, index, xval, GFP_KERNEL))); > > > + filemap_invalidate_unlock(inode->i_mapping); > > > } > > > EXPORT_SYMBOL_GPL(kvm_gmem_handle_folio_put); > > > > > > -- > > > 2.49.0.rc1.451.g8f38331e32-goog > > > >