Thanks for the review, Sean. Apologies for not replying sooner. I hope you still have some context in mind. I have a question for you below, but other than that, I'll send out a new rev next week. On Mon, 2024-12-02 at 10:30 -0800, Sean Christopherson wrote: > KVM: SVM: > > Please through the relevant maintainer handbooks, there are warts all > over. > > Documentation/process/maintainer-kvm-x86.rst > Documentation/process/maintainer-tip.rst > > And the shortlog is wrong. The patch itself is also broken. KVM > should (a) add > support for virtualizing ERAPS and (b) advertise support to > *userspace*. The > userspace VMM ultimately decides what to expose/enable for the guest. Point taken - I interpreted it the way I wanted it to - but I can see what you mean - I'll reword. [...] > > While the new default RSB size is used on the host without any > > software > > modification necessary, the RSB size for guests is limited to the > > older > > Please describe hardware behavior, and make it abundantly clear that > the changelog > is talking about hardware behavior. One of my pet peeves > (understatement) with > the APM is that it does a shit job of explaining the actual > architectural behavior. OK, more rewording. Hardware behaviour is to just use 64 entries on the host; and limit to 32 entries for any guest. If the right VMCB bits are set, like in the patch, the hardware uses all the 64 entries for the guest. > > value (32 entries) for backwards compatibility. > > Backwards compatibility with what? And how far back? E.g. have CPUs > with a RAP > always had 32 entries? Yes, all CPUs upto Zen5 had 32 entries -- that number's even just hard- coded in the definition of RSB_CLEAR_LOOPS, which is used for the RSB stuffing. The wording is borrowed a bit from the APM -- especially the "default" value. With Zen5, the default hardware RSB size is 64. So to expose the hardware-default size to the guest, this VMCB update is necessary; else the guests use the non-default, trimmed-down 32 entry RSB. > > > to also use the default number of entries by setting > > "default" is clearly wrong, since the *default* behavior is to use Yea; I get this is confusing. The hardware default is 64, but guest default is 32. My intention with that stmt was to say use the native default vs the guest default. Will reword. > > the new ALLOW_LARGER_RAP bit in the VMCB. > > I detest the "ALLOW_LARGER" name. "Allow" implies the guest somehow > has a choice. > And "Larger" implies there's an even larger size I agree, but this is the name used in the APM. What's the preference - deviate from the APM naming, or stick to it, despite the confusion it causes? > > The two cases for backward compatibility that need special handling > > are > > nested guests, and guests using shadow paging > > Guests don't use shadow paging, *KVM* uses > > > (or when NPT is disabled): > > "i.e", not "or". "Or" makes it sound like "NPT is disabled" is > separate case > from shadow paging. > > > For nested guests: the ERAPS feature adds host/guest tagging to > > entries > > in the RSB, but does not distinguish between ASIDs. On a nested > > exit, > > the L0 hypervisor instructs the hardware (via another new VMCB bit, > > I strongly suspect this was copied from the APM. Please don't do > that. State > what change is being for *KVM*, not for "the L0 hypervisor". This > verbiage mixes > hardware behavior with software behavior, which again is why I hate > much of the > APM's wording. This isn't necessarily from the APM; this is me describing what the hw does, to set up why KVM needs to do a few other things. I want to say that the hw is in control of host/guest tagging for the RSB entries; but it does not tag ASIDs. So KVM running as L0 hypervisor needs to take certain steps. KVM running as L1 hypervisor does not need to. > > FLUSH_RAP_ON_VMRUN) to flush the RSB on the next VMRUN to prevent > > RSB > > poisoning attacks from an L2 guest to an L1 guest. With that in > > place, > > this feature can be exposed to guests. > > ERAPS can also be advertised if nested virtualization is disabled, > no? I think > it makes sense to first add support for ERAPS if "!nested", and then > in a separate > path, add support for ERAPS when nested virtualization is enabled. > Partly so that > it's easier for readers to understand why nested VMs are special, but > mainly because > the nested virtualization support is sorely lacking. Yea, it makes sense to split these up, I'll try doing that. Nested virt disabled is the common case, so it makes sense to make that more obvious as well. > > For shadow paging guests: do not expose this feature to guests; > > only > > expose if nested paging is enabled, to ensure a context switch > > within > > a guest triggers a context switch on the CPU -- thereby ensuring > > guest > > context switches flush guest RSB entries. > > Huh? OK this is where I'm slightly confused as well - so tell me if I'm wrong: When EPT/NPT is disabled, and shadow MMU is used by kvm, the CR3 register on the CPU holds the PGD of the qemu process. So if a task switch happens within the guest, the CR3 on the CPU is not updated, but KVM's shadow MMU routines change the page tables pointed to by that CR3. Contrasting to the NPT case, the CPU's CR3 holds the guest PGD directly, and task switches within the guest cause an update to the CPU's CR3. Am I misremembering and misreading the code? > > For shadow paging, the CPU's CR3 is not used for guest processes, > > and hence > > cannot benefit from this feature. > > What does that have to do with anything? If what I wrote above is true, since CR3 does not change when guest tasks switch, the RSB will not be flushed by hardware on guest task switches, and so exposing ERAPS to this guest is wrong -- it'll relax its own FILL_RETURN_BUFFER routines (via patch 1) and that's not what we want when the hw is doing that. [...] > > @@ -1362,10 +1364,22 @@ static inline int __do_cpuid_func(struct > > kvm_cpuid_array *array, u32 function) > > case 0x80000020: > > entry->eax = entry->ebx = entry->ecx = entry->edx = 0; > > break; > > - case 0x80000021: > > - entry->ebx = entry->ecx = entry->edx = 0; > > + case 0x80000021: { > > + unsigned int ebx_mask = 0; > > + > > + entry->ecx = entry->edx = 0; > > cpuid_entry_override(entry, CPUID_8000_0021_EAX); > > + > > + /* > > + * Bits 23:16 in EBX indicate the size of the RSB. > > Is this enumeration explicitly tied to ERAPS? Not tied to ERAPS, but it happens to be defined at the same time as ERAPS is being defined, and there's no other CPUID bit that says 'the RSB size is now available in these bits'. So yea, the ERAPS CPUID bit is being overloaded here. Existence of the ERAPS CPUID bit confirms that the default RSB size is now different, and that the hardware flushes the RSB (as in patch 1). > > + * Expose the value in the hardware to the guest. > > __do_cpuid_func() is used to advertise KVM's supported CPUID to host > userspace, > not to the guest. > > Side topic, what happens when Zen6 adds EVEN_LARGER_RAP? Enumerating > the size of > the RAP suggets it's likely to change in the future. It's just informational-only for now, and there aren't any plans to extend the RAP anymore. But you're right - it could change in the future. > > + */ > > + if (kvm_cpu_cap_has(X86_FEATURE_ERAPS)) > > + ebx_mask |= GENMASK(23, 16); > > + > > + entry->ebx &= ebx_mask; > > This is a waste of code and makes it unnecessarily difficult to > read. Just do: > > if (kvm_cpu_cap_has(X86_FEATURE_ERAPS)) > entry->ebx &= GENMASK(23, 16); > else > entry->ebx = 0; Well I left it all this way to make it easier the the future to add more ebx bits here, but sure - I'll just shorten it now. [...] Amit