On Thu, Mar 27, 2025 at 1:10 AM David Woodhouse <dwmw2@xxxxxxxxxxxxx> wrote: > > On Wed, 2025-03-26 at 08:54 -0700, Ming Lin wrote: > > I applied the patch series on top of 6.9 cleanly and tested it with my > > debug tool patch. > > But it seems the time drift still increased monotonically. > > > > Would you help take a look if the tool patch makes sense? > > https://github.com/minggr/linux/commit/5284a211b6bdc9f9041b669539558a6a858e88d0 > > > > The tool patch adds a KVM debugfs entry to trigger time calculations > > and print the results. > > See my first email for more detail. > > Your first message seemed to say that the problem occurred with live > migration. This message says "the time drift still increased > monotonically". Yes, we discovered this issue in our production environment, where the time inside the guest OS slowed down by more than 2 seconds. This problem occurred both during live upgrades locally and live migrations remotely. However, the issue is only noticeable after the guest OS has been running for a long time, typically over 30 days. Since 30 days is too long to wait, I wrote a debugfs tool to quickly reproduce the original issue, but now I'm not sure if the tool is working correctly. > > Trying to make sure I fully understand... the time drift between the > host's CLOCK_MONOTONIC_RAW and the guest's kvmclock increases > monotonically *but* the guest only observes the change when its > master_kernel_ns/master_cycle_now are updated (e.g. on live migration) > and its kvmclock is reset back to the host's CLOCK_MONOTONIC_RAW? Yes, we are using the 5.4 kernel and have verified that the guest OS time remains correct after live upgrades/migrations, as long as master_kernel_ns / master_cycle_now are not updated (i.e., if the old master_kernel_ns / master_cycle_now values are retained). > > Is this live migration from one VMM to another on the same host, so we > don't have to worry about the accuracy of the TSC itself? The guest TSC > remains consistent? And presumably your host does *have* a stable TSC, > and the guest's test case really ought to be checking the > PVCLOCK_TSC_STABLE_BIT to make sure of that? The live migration is from one VMM to another on a remote host, and we have also observed the same issue during live upgrades on the same host. > > If all the above assumptions/interpretations of mine are true, I still > think it's expected that your clock will jump on live migration > *unless* you also taught your VMM to use the new KVM_[GS]ET_CLOCK_GUEST > ioctls which were added in my patch series, specifically to preserve > the mathematical relationship between guest TSC and kvmclock across a > migration. > We are planning to test the patches on a 6.9 kernel (where they can be applied cleanly) and modify the live upgrade/migration code to use the new KVM_[GS]ET_CLOCK_GUEST ioctls. BTW, what is the plan for upstreaming these patches? Thanks, Ming