Re: [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Fri, 29 Aug 2025 10:50:01 +0100

On Thu, 2025-08-28 at 16:40 -0700, Sean Christopherson wrote:
> On Wed, Aug 27, 2025, David Woodhouse wrote:
> > when there's an *existing* hypervisor leaf which just gives the information
> > directly, which is implemented in QEMU and EC2, as well as various guests.
> 
> Can we just have the VMM do the work then?  I.e. carve out the bit and the
> leaf in KVM's ABI, but leave it to the VMM to fill in?  I'd strongly prefer not
> to hook kvm_cpuid(), as I don't like overriding userspace's CPUID entries, and
> I especially don't like that hooking kvm_cpuid() means the value can change
> throughout the lifetime of the VM, at least in theory, but in practice will only
> ever be checked by the guest during early boot.

The problem is that VMM doesn't know what TSC frequency the guest
actually gets. VMM only knows what it *asked* for, not what KVM
actually ended up configuring — which depends on the capabilities of
the hardware and the host's idea of what its actual TSC frequency is.

Hence https://git.kernel.org/torvalds/c/f422f853af036 in which we
allowed KVM to populate the value in the Xen TSC info CPUID leaves. I
was just following that precedent.

I am not *entirely* averse to ripping that out, and doing things
differently. We would have to:

 • Declare that exposing the TSC frequency to guests via CPUID is
   nonsense on crappy old hardware where it actually varies at runtime
   anyway. Partly because the guest will only check it at boot, and
   partly because that TSC has to be advertised as unreliable anyway.

 • Add a new API for the VMM to extract the actual effective frequency,
   only on 'sane' hosts.

 • Declare that we don't care that it's strictly an ABI change, and
   VMMs which used to just populate the leaf and let KVM fill it in
   for Xen guests now *have* to use the new API.

I'm actually OK with that, even the last one, because I've just noticed
that KVM is updating the *wrong* Xen leaf. 0x40000x03/2 EAX is supposed
to be the *host* TSC frequency, and the guest frequency is supposed to
be in 0x40000x03/0 ECX. And Linux as a Xen guest doesn't even use it
anyway, AFAICT.

Paul, it was your code originally; are you happy with removing it?

As we look at a new API for exposing the precise TSC scaling, I'd like
to make sure it works for VMClock (for which I am still working on
writing up proper documentation but in the meantime 
https://gitlab.com/qemu-project/qemu/-/commit/3634039b93cc5 serves as a
decent reference). In short, VMClock allows the hypervisor to provide a
pvclock-style clock with microsecond accuracy to its guests, solving
the problems of
 • All guests using external precision clocks to repeat the *same* work
   of calibrating the *same* underlying oscillator
 • ...badly, experiencing imprecision due to steal time as they do so.
 • Live migration completely disrupting the clock and causing actual
   data corruption, where precision timestamps are required for e.g.
   distributed database coherency.

In its initial implementation, the VMClock in QEMU (and EC2) only
resolves the last issue, by advertising a 'disruption' on live
migration so that the guest can know that its clock is hosed until it
manages to resync.

Now I'm trying to plumb in the actual clock information from the host,
so that migrated guests can have precision time from the moment they
arrive on the new host. There are two major use cases to consider...

1. Dedicated hosting setups will calibrate the host TSC *directly*
   against the external clock, and maybe feed it into the host kernel's
   adjtimex() almost as an afterthought. So userspace will be able to
   produce a system-wide VMClock data structure which can then be
   advertised to each guest with the appropriate TSC offset and scaling
   factor.

   For this I think we want the *actual* scaling factor to be exposed
   by KVM to userspace, not just the resulting estimated frequency.
   Unless we allow userspace just to provide the host's view and let
   KVM apply the offset/scale. Which maybe doesn't make as much sense
   in *this* setup but we might end up wanting that anyway for...

2. More traditional hosts just running Chrony/ntpd to feed the host's
   CLOCK_REALTIME with adjtimex(). For this case, there is probably
   more of an argument for letting the kernel generate the vmclock
   data — KVM already has the gtod notifier which is invoked every time
   the apparent frequency changes, and userspace has none of what it
   needs.

So... if we need KVM to be able to apply the per-VM scaling/offset
because we're going to do it all in-kernel in that second case, then we
might as well let KVM apply the per-VM scaling/offset even in the
dedicated hosting case. And then the API we use for the original CPUID
problem only needs to expose the actual effective frequency.

But if we want userspace to do more for itself, we'd need to expose the
scaling factors directly. I think...

Attachment:
smime.p7s

Description: S/MIME cryptographic signature