Re: [PATCH v5 1/2] kvm: sev: Add SEV-SNP guest request throttling

Dionna Amalie Glaze <dionnaglaze@xxxxxxxxxx> · Wed, 28 May 2025 11:25:23 -0700

On Wed, May 21, 2025 at 11:19 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Fri, May 16, 2025, Dionna Amalie Glaze wrote:
> > > > @@ -4015,6 +4042,12 @@ static int snp_handle_guest_req(struct vcpu_svm *svm, gpa_t req_gpa, gpa_t resp_
> > > >
> > > >       mutex_lock(&sev->guest_req_mutex);
> > > >
> > > > +     if (!__ratelimit(&sev->snp_guest_msg_rs)) {
> > > > +             svm_vmgexit_no_action(svm, SNP_GUEST_ERR(SNP_GUEST_VMM_ERR_BUSY, 0));
> > > > +             ret = 1;
> > > > +             goto out_unlock;
> > >
> > > Can you (or anyone) explain what a well-behaved guest will do in in response to
> > > BUSY?  And/or explain why KVM injecting an error into the guest is better than
> > > exiting to userspace.
> >
> > Ah, I missed this question. The guest is meant to back off and try again
> > after waiting a bit.  This is the behavior added in
> > https://lore.kernel.org/all/20230214164638.1189804-2-dionnaglaze@xxxxxxxxxx/
>
> Nice, it's already landed and considered legal VMM behavior.
>
> > If KVM returns to userspace with an exit type that the guest request was
> > throttled, then what is user space supposed to do with that?
>
> The userspace exit doesn't have to notify userspace that the guest was throttled,
> e.g. KVM could exit on _every_ request and let userspace do its own throttling.
>
> I have no idea whether or not that's sane/useful, which is why I'm asking.  The
> cover letter, changelog, and documentation are all painfully sparse with respect
> to explaining why *this* uAPI is the right uAPI.
>
> > It could wait a bit before trying KVM_RUN again, but with the enlightened
> > method, the guest could at least work on other kernel tasks while it waits
> > for its turn to get an attestation report.
>
> Nothing prevents KVM from providing userspace a way to communicate VMM_ERR_BUSY,
> e.g. as done for KVM_EXIT_SNP_REQ_CERTS:
>
> https://lore.kernel.org/all/20250428195113.392303-2-michael.roth@xxxxxxx
>
> > Perhaps this is me not understanding the preferred KVM way of doing things.
>
> The only real preference at play is to not end up with uAPI and ABI that doesn't
> fit "everyone's" needs.  It's impossible to fully future-proof KVM's ABI, but we
> can at least perform due diligence to ensure we didn't simply pick the the path
> of least resistance.
>
> The bar gets lowered a tiny bit if we go with a module param (which I think we
> should do), but I'd still like an explanation of why a fairly simple ratelimiting
> mechanism is the best overall approach.

Before I send out a revised patchset with changed commit text, what do
you think about the following

    The AMD-SP is a precious resource that doesn't have a scheduler other
    than a mutex lock queue. To avoid customers from causing a DoS, a
    kernel module parameter for rate limiting guest requests is added.
[Addition:]
    The kernel module parameter is a lower bound kernel-imposed rate limit
    for any SEV-SNP VM-initiated guest request. This does not preclude the
    addition of a new KVM exit type for SEV-SNP guest requests for
    userspace to impose any additional throttling logic. The default value of
    0 maintains the previous behavior that there is no imposed rate limit on
    guest requests.

We could still ask Michael to change KVM_EXIT_SNP_REQ_CERTS  to
KVM_EXIT_SNP_GUEST_REQ
and for the exit structure to include msg_type as well as the
gfn+npages when the kind is an extended request for an attestation
report so that we don't need to have two exit types.

Regardless of that change for additional throttling opportunities, I
think the system-wide imposed lower bound is important for quelling
noisy neighbors to some degree.

-- 
-Dionna Glaze, PhD, CISSP, CCSP (she/her)