On 10/04/2025 16:45, Sean Christopherson wrote: > On Wed, Apr 09, 2025, Joao Martins wrote: >> On 04/04/2025 20:39, Sean Christopherson wrote: >> I would suggest holding off on this and the next one, while progressing with >> the rest of the series. > > Agreed, though I think there's a "pure win" alternative that can be safely > implemented (but it definitely should be done separately). > > If HLT-exiting is disabled for the VM, and the VM doesn't have access to the > various paravirtual features that can put it into a synthetic HLT state (PV async > #PF and/or Xen support), then I'm pretty sure GALogIntr can be disabled entirely, > i.e. disabled during the initial irq_set_vcpu_affinity() and never enabled. KVM > doesn't emulate HLT via its full emulator for AMD (just non-unrestricted Intel > guests), so I'm pretty sure there would be no need for KVM to ever wake a vCPU in > response to a device interrupt. > Done via IRQ affinity changes already a significant portion of the IRTE and it's already on a slowpath that performs an invalidation, so via irq_set_vcpu_affinity is definitely safe. But even with HLT exits disabled; there's still preemption though? But I guess that's a bit more rare if it's conditional to HLT exiting being enabled or not, and whether there's only a single task running. >>> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c >>> index 2e016b98fa1b..27b03e718980 100644 >>> --- a/drivers/iommu/amd/iommu.c >>> +++ b/drivers/iommu/amd/iommu.c >>> -static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu) >>> +static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu, >>> + bool ga_log_intr) >>> { >>> if (cpu >= 0) { >>> entry->lo.fields_vapic.destination = >>> @@ -3783,12 +3784,14 @@ static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu) >>> entry->hi.fields.destination = >>> APICID_TO_IRTE_DEST_HI(cpu); >>> entry->lo.fields_vapic.is_run = true; >>> + entry->lo.fields_vapic.ga_log_intr = false; >>> } else { >>> entry->lo.fields_vapic.is_run = false; >>> + entry->lo.fields_vapic.ga_log_intr = ga_log_intr; >>> } >>> } >>> >> >> isRun, Destination and GATag are not cached. Quoting the update from a few years >> back (page 93 of IOMMU spec dated Feb 2025): >> >> | When virtual interrupts are enabled by setting MMIO Offset 0018h[GAEn] and >> | IRTE[GuestMode=1], IRTE[IsRun], IRTE[Destination], and if present IRTE[GATag], >> | are not cached by the IOMMU. Modifications to these fields do not require an >> | invalidation of the Interrupt Remapping Table. > > Ooh, that's super helpful info. Any objections to me adding verbose comments to > explain the effective rules for amd_iommu_update_ga()? > That's a great addition, it should have been there from the beginning when we added the cacheviness of guest-mode IRTE into the mix. >> This is the reason we were able to get rid of the IOMMU invalidation in >> amd_iommu_update_ga() ... which sped up vmexit/vmenter flow with iommu avic. >> Besides the lock contention that was observed at the time, we were seeing stalls >> in this path with enough vCPUs IIRC; CCing Alejandro to keep me honest. >> >> Now this change above is incorrect as is and to make it correct: you will need >> xor with the previous content of the IRTE::ga_log_intr and then if it changes >> then you re-add back an invalidation command via >> iommu_flush_irt_and_complete()). The latter is what I am worried will >> reintroduce these above problem :( > > Ya, the need to flush definitely changes things. > >> The invalidation command (which has a completion barrier to serialize >> invalidation execution) takes some time in h/w, and will make all your vcpus >> content on the irq table lock (as is). Even assuming you somehow move the >> invalidation outside the lock, you will content on the iommu lock (for the >> command queue) or best case assuming no locks (which I am not sure it is >> possible) you will need to wait for the command to complete until you can >> progress forward with entering/exiting. >> >> Unless the GALogIntr bit is somehow also not cached too which wouldn't need the >> invalidation command (which would be good news!). Adding Suravee/Vasant here. >> >> It's a nice trick how you would leverage this in SVM, but do you have >> measurements that corroborate its introduction? How many unnecessary GALog >> entries were you able to avoid with this trick say on a workload that would >> exercise this (like netperf 1byte RR test that sleeps and wakes up a lot) ? > > I didn't do any measurements; I assumed the opportunistic toggling of GALogIntr > would be "free". > > There might be optimizations that could be done, but I think the better solution > is to simply disable GALogIntr when it's not needed. That'd limit the benefits > to select setups, but trying to optimize IRQ bypass for VMs that are CPU-overcommited, > i.e. can't do native HLT, seems rather pointless. > /me nods