Re: [PATCH 64/67] iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts

Joao Martins <joao.m.martins@xxxxxxxxxx> · Thu, 10 Apr 2025 18:13:02 +0100

On 10/04/2025 16:45, Sean Christopherson wrote:
> On Wed, Apr 09, 2025, Joao Martins wrote:
>> On 04/04/2025 20:39, Sean Christopherson wrote:
>> I would suggest holding off on this and the next one, while progressing with
>> the rest of the series.
> 
> Agreed, though I think there's a "pure win" alternative that can be safely
> implemented (but it definitely should be done separately).
> 
> If HLT-exiting is disabled for the VM, and the VM doesn't have access to the
> various paravirtual features that can put it into a synthetic HLT state (PV async
> #PF and/or Xen support), then I'm pretty sure GALogIntr can be disabled entirely,
> i.e. disabled during the initial irq_set_vcpu_affinity() and never enabled.  KVM
> doesn't emulate HLT via its full emulator for AMD (just non-unrestricted Intel
> guests), so I'm pretty sure there would be no need for KVM to ever wake a vCPU in
> response to a device interrupt.
> 

Done via IRQ affinity changes already a significant portion of the IRTE and it's
already on a slowpath that performs an invalidation, so via
irq_set_vcpu_affinity is definitely safe.

But even with HLT exits disabled; there's still preemption though? But I guess
that's a bit more rare if it's conditional to HLT exiting being enabled or not,
and whether there's only a single task running.

>>> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
>>> index 2e016b98fa1b..27b03e718980 100644
>>> --- a/drivers/iommu/amd/iommu.c
>>> +++ b/drivers/iommu/amd/iommu.c
>>> -static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu)
>>> +static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu,
>>> +				  bool ga_log_intr)
>>>  {
>>>  	if (cpu >= 0) {
>>>  		entry->lo.fields_vapic.destination =
>>> @@ -3783,12 +3784,14 @@ static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu)
>>>  		entry->hi.fields.destination =
>>>  					APICID_TO_IRTE_DEST_HI(cpu);
>>>  		entry->lo.fields_vapic.is_run = true;
>>> +		entry->lo.fields_vapic.ga_log_intr = false;
>>>  	} else {
>>>  		entry->lo.fields_vapic.is_run = false;
>>> +		entry->lo.fields_vapic.ga_log_intr = ga_log_intr;
>>>  	}
>>>  }
>>>
>>
>> isRun, Destination and GATag are not cached. Quoting the update from a few years
>> back (page 93 of IOMMU spec dated Feb 2025):
>>
>> | When virtual interrupts are enabled by setting MMIO Offset 0018h[GAEn] and
>> | IRTE[GuestMode=1], IRTE[IsRun], IRTE[Destination], and if present IRTE[GATag],
>> | are not cached by the IOMMU. Modifications to these fields do not require an
>> | invalidation of the Interrupt Remapping Table.
> 
> Ooh, that's super helpful info.  Any objections to me adding verbose comments to
> explain the effective rules for amd_iommu_update_ga()?
> 
That's a great addition, it should have been there from the beginning when we
added the cacheviness of guest-mode IRTE into the mix.

>> This is the reason we were able to get rid of the IOMMU invalidation in
>> amd_iommu_update_ga() ... which sped up vmexit/vmenter flow with iommu avic.
>> Besides the lock contention that was observed at the time, we were seeing stalls
>> in this path with enough vCPUs IIRC; CCing Alejandro to keep me honest.
>>
>> Now this change above is incorrect as is and to make it correct: you will need
>> xor with the previous content of the IRTE::ga_log_intr and then if it changes
>> then you re-add back an invalidation command via
>> iommu_flush_irt_and_complete()). The latter is what I am worried will
>> reintroduce these above problem :(
> 
> Ya, the need to flush definitely changes things.
> 
>> The invalidation command (which has a completion barrier to serialize
>> invalidation execution) takes some time in h/w, and will make all your vcpus
>> content on the irq table lock (as is). Even assuming you somehow move the
>> invalidation outside the lock, you will content on the iommu lock (for the
>> command queue) or best case assuming no locks (which I am not sure it is
>> possible) you will need to wait for the command to complete until you can
>> progress forward with entering/exiting.
>>
>> Unless the GALogIntr bit is somehow also not cached too which wouldn't need the
>> invalidation command (which would be good news!). Adding Suravee/Vasant here.
>>
>> It's a nice trick how you would leverage this in SVM, but do you have
>> measurements that corroborate its introduction? How many unnecessary GALog
>> entries were you able to avoid with this trick say on a workload that would
>> exercise this (like netperf 1byte RR test that sleeps and wakes up a lot) ?
> 
> I didn't do any measurements; I assumed the opportunistic toggling of GALogIntr
> would be "free".
> 
> There might be optimizations that could be done, but I think the better solution
> is to simply disable GALogIntr when it's not needed.  That'd limit the benefits
> to select setups, but trying to optimize IRQ bypass for VMs that are CPU-overcommited,
> i.e. can't do native HLT, seems rather pointless.
> 
/me nods