On Tue, 2025-09-02 at 10:33 -0700, Sean Christopherson wrote: > > Besides, a cache flush after 2 can essentially cause a memory write to the > > page. > > Though we could invoke tdh_phymem_page_wbinvd_hkid() after the KVM_BUG_ON(), > > the SEAMCALL itself can fail. > > I think this falls into the category of "don't screw up" flows. Failure to > remove a private SPTE is a near-catastrophic error. Going out of our way to > reduce the impact of such errors increases complexity without providing much > in the way of value. > > E.g. if VMCLEAR fails, KVM WARNs but continues on and hopes for the best, even > though there's a decent chance failure to purge the VMCS cache entry could be > lead to UAF-like problems. To me, this is largely the same. > > If anything, we should try to prevent #2, e.g. by marking the entire > guest_memfd as broken or something, and then deliberately leaking _all_ pages. There was a marathon thread on this subject. We did discuss this option (link to most relevant part I could find): https://lore.kernel.org/kvm/a9affa03c7cdc8109d0ed6b5ca30ec69269e2f34.camel@xxxxxxxxx/ The high level summary is that pinning the pages wrinkles guestmemfd's plans to use refcount for other tracking purposes. Dropping refcounts interferes with the error handling safety. I strongly agree that we should not optimize for the error path at all. If we could bug the guestmemfd (kind of what we were discussing in that link) I think it would be appropriate to use in these cases. I guess the question is are we ok dropping the safety before we have a solution like that. In that thread I was advocating for yes, partly to close it because the conversation was getting stuck. But there is probably a long tail of potential issues or ways of looking at it that could put it in the grey area.