On Thu, 2025-08-28 at 16:40 -0700, Sean Christopherson wrote: > On Wed, Aug 27, 2025, David Woodhouse wrote: > > when there's an *existing* hypervisor leaf which just gives the information > > directly, which is implemented in QEMU and EC2, as well as various guests. > > Can we just have the VMM do the work then? I.e. carve out the bit and the > leaf in KVM's ABI, but leave it to the VMM to fill in? I'd strongly prefer not > to hook kvm_cpuid(), as I don't like overriding userspace's CPUID entries, and > I especially don't like that hooking kvm_cpuid() means the value can change > throughout the lifetime of the VM, at least in theory, but in practice will only > ever be checked by the guest during early boot. The problem is that VMM doesn't know what TSC frequency the guest actually gets. VMM only knows what it *asked* for, not what KVM actually ended up configuring — which depends on the capabilities of the hardware and the host's idea of what its actual TSC frequency is. Hence https://git.kernel.org/torvalds/c/f422f853af036 in which we allowed KVM to populate the value in the Xen TSC info CPUID leaves. I was just following that precedent. I am not *entirely* averse to ripping that out, and doing things differently. We would have to: • Declare that exposing the TSC frequency to guests via CPUID is nonsense on crappy old hardware where it actually varies at runtime anyway. Partly because the guest will only check it at boot, and partly because that TSC has to be advertised as unreliable anyway. • Add a new API for the VMM to extract the actual effective frequency, only on 'sane' hosts. • Declare that we don't care that it's strictly an ABI change, and VMMs which used to just populate the leaf and let KVM fill it in for Xen guests now *have* to use the new API. I'm actually OK with that, even the last one, because I've just noticed that KVM is updating the *wrong* Xen leaf. 0x40000x03/2 EAX is supposed to be the *host* TSC frequency, and the guest frequency is supposed to be in 0x40000x03/0 ECX. And Linux as a Xen guest doesn't even use it anyway, AFAICT. Paul, it was your code originally; are you happy with removing it? As we look at a new API for exposing the precise TSC scaling, I'd like to make sure it works for VMClock (for which I am still working on writing up proper documentation but in the meantime https://gitlab.com/qemu-project/qemu/-/commit/3634039b93cc5 serves as a decent reference). In short, VMClock allows the hypervisor to provide a pvclock-style clock with microsecond accuracy to its guests, solving the problems of • All guests using external precision clocks to repeat the *same* work of calibrating the *same* underlying oscillator • ...badly, experiencing imprecision due to steal time as they do so. • Live migration completely disrupting the clock and causing actual data corruption, where precision timestamps are required for e.g. distributed database coherency. In its initial implementation, the VMClock in QEMU (and EC2) only resolves the last issue, by advertising a 'disruption' on live migration so that the guest can know that its clock is hosed until it manages to resync. Now I'm trying to plumb in the actual clock information from the host, so that migrated guests can have precision time from the moment they arrive on the new host. There are two major use cases to consider... 1. Dedicated hosting setups will calibrate the host TSC *directly* against the external clock, and maybe feed it into the host kernel's adjtimex() almost as an afterthought. So userspace will be able to produce a system-wide VMClock data structure which can then be advertised to each guest with the appropriate TSC offset and scaling factor. For this I think we want the *actual* scaling factor to be exposed by KVM to userspace, not just the resulting estimated frequency. Unless we allow userspace just to provide the host's view and let KVM apply the offset/scale. Which maybe doesn't make as much sense in *this* setup but we might end up wanting that anyway for... 2. More traditional hosts just running Chrony/ntpd to feed the host's CLOCK_REALTIME with adjtimex(). For this case, there is probably more of an argument for letting the kernel generate the vmclock data — KVM already has the gtod notifier which is invoked every time the apparent frequency changes, and userspace has none of what it needs. So... if we need KVM to be able to apply the per-VM scaling/offset because we're going to do it all in-kernel in that second case, then we might as well let KVM apply the per-VM scaling/offset even in the dedicated hosting case. And then the API we use for the original CPUID problem only needs to expose the actual effective frequency. But if we want userspace to do more for itself, we'd need to expose the scaling factors directly. I think...
Attachment:
smime.p7s
Description: S/MIME cryptographic signature