> On 22. Jul 2025, at 12:27, Xiaoyao Li <xiaoyao.li@xxxxxxxxx> wrote: > > On 7/22/2025 5:21 PM, Mathias Krause wrote: >> On 22.07.25 05:45, Xiaoyao Li wrote: >>> On 6/20/2025 3:42 AM, Mathias Krause wrote: >>>> KVM has a weird behaviour when a guest executes VMCALL on an AMD system >>>> or VMMCALL on an Intel CPU. Both naturally generate an invalid opcode >>>> exception (#UD) as they are just the wrong instruction for the CPU >>>> given. But instead of forwarding the exception to the guest, KVM tries >>>> to patch the guest instruction to match the host's actual hypercall >>>> instruction. That is doomed to fail as read-only code is rather the >>>> standard these days. But, instead of letting go the patching attempt and >>>> falling back to #UD injection, KVM injects the page fault instead. >>>> >>>> That's wrong on multiple levels. Not only isn't that a valid exception >>>> to be generated by these instructions, confusing attempts to handle >>>> them. It also destroys guest state by doing so, namely the value of CR2. >>>> >>>> Sean attempted to fix that in KVM[1] but the patch was never applied. >>>> >>>> Later, Oliver added a quirk bit in [2] so the behaviour can, at least, >>>> conceptually be disabled. Paolo even called out to add this very >>>> functionality to disable the quirk in QEMU[3]. So lets just do it. >>>> >>>> A new property 'hypercall-patching=on|off' is added, for the very >>>> unlikely case that there are setups that really need the patching. >>>> However, these would be vulnerable to memory corruption attacks freely >>>> overwriting code as they please. So, my guess is, there are exactly 0 >>>> systems out there requiring this quirk. >>> >>> The default behavior is patching the hypercall for many years. >>> >>> If you desire to change the default behavior, please at least keep it >>> unchanged for old machine version. i.e., introduce compat_property, >>> which sets KVMState->hypercall_patching_enabled to true. >> Well, the thing is, KVM's patching is done with the effective >> permissions of the guest which means, if the code in question isn't >> writable from the guest's point of view, KVM's attempt to modify it will >> fail. This failure isn't transparent for the guest as it sees a #PF >> instead of a #UD, and that's what I'm trying to fix by disabling the quirk. >> The hypercall patching was introduced in Linux commit 7aa81cc04781 >> ("KVM: Refactor hypercall infrastructure (v3)") in v2.6.25. Until then >> it was based on a dedicated hypercall page that was handled by KVM to >> use the proper instruction of the KVM module in use (VMX or SVM). >> Patching code was fine back then, but the introduction of DEBUG_RO_DATA >> made the patching attempts fail and, ultimately, lead to Paolo handle >> this with commit c1118b3602c2 ("x86: kvm: use alternatives for VMCALL >> vs. VMMCALL if kernel text is read-only"). >> However, his change still doesn't account for the cross-vendor live >> migration case (Intel<->AMD), which will still be broken, causing the >> before mentioned bogus #PF, which will just lead to misleading Oops >> reports, confusing the poor souls, trying to make sense of it. >> IMHO, there is no valid reason for still having the patching in place as >> the .text of non-ancient kernel's will be write-protected, making >> patching attempts fail. And, as they fail with a #PF instead of #UD, the >> guest cannot even handle them appropriately, as there was no memory >> write attempt from its point of view. Therefore the default should be to >> disable it, IMO. This won't prevent guests making use of the wrong >> instruction from trapping, but, at least, now they'll get the correct >> exception vector and can handle it appropriately. > > But you don't accout for the case that guest kernel is built without CONFIG_STRICT_KERNEL_RWX enabled, or without CONFIG_DEBUG_RO_DATA, or for whatever reason the guest's text is not readonly, and the VM needs to be migrated among different vendors (Intel <-> AMD). > > Before this patch, the above usecase works well. But with this patch, the guest will gets #UD after migrated to different vendors. > > I heard from some small CSPs that they do want to the ability to live migrate VMs among Intel and AMD host. > Is there a particular reason to not handle that #UD as a hypercall to support that scenario? Especially with some guest OS kernels having kernel patch protection with periodic scrubbing of r/o memory (ie Windows), doesn’t sound great to override anything in a way the guest OS kernel can notice… -Mohamed >>> >>>> [1] https://lore.kernel.org/kvm/20211210222903.3417968-1- >>>> seanjc@xxxxxxxxxx/ >>>> [2] https://lore.kernel.org/kvm/20220316005538.2282772-2- >>>> oupton@xxxxxxxxxx/ >>>> [3] https://lore.kernel.org/kvm/80e1f1d2-2d79-22b7-6665- >>>> c00e4fe9cb9c@xxxxxxxxxx/ >>>> >>>> Cc: Oliver Upton <oliver.upton@xxxxxxxxx> >>>> Cc: Sean Christopherson <seanjc@xxxxxxxxxx> >>>> Cc: Paolo Bonzini <pbonzini@xxxxxxxxxx> >>>> Signed-off-by: Mathias Krause <minipli@xxxxxxxxxxxxxx> >>>> --- >>>> include/system/kvm_int.h | 1 + >>>> qemu-options.hx | 10 ++++++++++ >>>> target/i386/kvm/kvm.c | 38 ++++++++++++++++++++++++++++++++++++++ >>>> 3 files changed, 49 insertions(+) >>>> >>>> diff --git a/include/system/kvm_int.h b/include/system/kvm_int.h >>>> index 756a3c0a250e..fd7129824429 100644 >>>> --- a/include/system/kvm_int.h >>>> +++ b/include/system/kvm_int.h >>>> @@ -159,6 +159,7 @@ struct KVMState >>>> uint64_t kvm_eager_split_size; /* Eager Page Splitting chunk >>>> size */ >>>> struct KVMDirtyRingReaper reaper; >>>> struct KVMMsrEnergy msr_energy; >>>> + bool hypercall_patching_enabled; >>> >>> IMHO, we can just name it "hypercall_patching". >>> >>> Since it's a boolean type, true means enabled and false means disabled. >> Ok, makes sense. >>> >>>> NotifyVmexitOption notify_vmexit; >>>> uint32_t notify_window; >>>> uint32_t xen_version; >>>> diff --git a/qemu-options.hx b/qemu-options.hx >>>> index 1f862b19a676..c2e232649c19 100644 >>>> --- a/qemu-options.hx >>>> +++ b/qemu-options.hx >>>> @@ -231,6 +231,7 @@ DEF("accel", HAS_ARG, QEMU_OPTION_accel, >>>> " dirty-ring-size=n (KVM dirty ring GFN count, >>>> default 0)\n" >>>> " eager-split-size=n (KVM Eager Page Split chunk >>>> size, default 0, disabled. ARM only)\n" >>>> " notify-vmexit=run|internal-error| >>>> disable,notify-window=n (enable notify VM exit and set notify window, >>>> x86 only)\n" >>>> + " hypercall-patching=on|off (enable KVM's VMCALL/ >>>> VMMCALL hypercall patching quirk, x86 only)\n" >>>> " thread=single|multi (enable multi-threaded TCG)\n" >>>> " device=path (KVM device path, default /dev/ >>>> kvm)\n", QEMU_ARCH_ALL) >>>> SRST >>>> @@ -313,6 +314,15 @@ SRST >>>> open up for a specified of time (i.e. notify-window). >>>> Default: notify-vmexit=run,notify-window=0. >>>> + ``hypercall-patching=on|off`` >>>> + KVM tries to recover from the wrong hypercall instruction >>>> being used by >>>> + a guest by attempting to rewrite it to the one supported >>>> natively by >>>> + the host CPU (VMCALL on Intel, VMMCALL for AMD systems). >>>> However, this >>>> + patching may fail if the guest memory is write protected, >>>> leading to a >>>> + page fault getting propagated to the guest instead of an illegal >>>> + instruction exception. As this may confuse guests, it gets >>>> disabled by >>>> + default (x86 only). >>>> + >>>> ``device=path`` >>>> Sets the path to the KVM device node. Defaults to ``/dev/ >>>> kvm``. This >>>> option can be used to pass the KVM device to use via a file >>>> descriptor >>>> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c >>>> index 56a6b9b6381a..6f5f3b95e553 100644 >>>> --- a/target/i386/kvm/kvm.c >>>> +++ b/target/i386/kvm/kvm.c >>>> @@ -3224,6 +3224,19 @@ static int kvm_vm_enable_energy_msrs(KVMState *s) >>>> return 0; >>>> } >>>> +static int kvm_vm_disable_hypercall_patching(KVMState *s) >>>> +{ >>>> + int valid_quirks = kvm_vm_check_extension(s, >>>> KVM_CAP_DISABLE_QUIRKS2); >>>> + >>>> + if (valid_quirks & KVM_X86_QUIRK_FIX_HYPERCALL_INSN) { >>>> + return kvm_vm_enable_cap(s, KVM_CAP_DISABLE_QUIRKS2, 0, >>>> + KVM_X86_QUIRK_FIX_HYPERCALL_INSN); >>>> + } >>>> + >>>> + warn_report("kvm: disabling hypercall patching not supported"); >>> >>> It's not clear it's 1) KVM doesn't support/has FIX_HYPERCALL_INSN quirk >>> or 2) KVM has FIX_HYPERCALL_INSN quirk but doesn't allow it to be >>> disabled, when KVM_X86_QUIRK_FIX_HYPERCALL_INSN is not returned in >>> KVM_CAP_DISABLE_QUIRKS2. >>> >>> If it's case 1), it can be treated as hypercall patching is disabled >>> thus no warning is expected. >>> >>> So, I think it requires a new cap in KVM to return the enabled quirks. >> KVM_CAP_DISABLE_QUIRKS2 fixes that bug of KVM_CAP_DISABLE_QUIRKS by >> doing just that: returning the mask of supported quirks when queried via >> KVM_CHECK_EXTENSION. So if KVM_X86_QUIRK_FIX_HYPERCALL_INSN is in that >> mask, it can also be disabled. If that attempt fails (for whatever >> reason), it's an error, which makes kvm_vm_enable_cap() return a >> non-zero value, triggering the warn_report("kvm: failed to disable >> hypercall patching quirk") in the caller. >> If KVM_X86_QUIRK_FIX_HYPERCALL_INSN is missing in the >> KVM_CAP_DISABLE_QUIRKS2 mask, it may either be that KVM is too old to >> even have the hypercall patching (pre-v2.6.25) or does do the patching, >> just doesn't have KVM_X86_QUIRK_FIX_HYPERCALL_INSN yet, which came in >> Linux commit f1a9761fbb00 ("KVM: x86: Allow userspace to opt out of >> hypercall patching"), which is v5.19. >> Ignoring pre-v2.6.25 kernels for a moment, we can assume that KVM will >> do the patching. So the lack of KVM_X86_QUIRK_FIX_HYPERCALL_INSN but >> having 'hypercall_patching_enabled == false' indicates that the user >> wants to disable it but QEMU cannot do so, because KVM lacks the >> extension to do so. This, IMO, legitimizes the warn_report("kvm: >> disabling hypercall patching not supported") -- as it's not supported. > > The minimum supported kernel version is 4.5 (see commit f180e367fce4). > So pre-v2.6.25 kernels is not the case. > > Surely we can list all the cases of different versions of KVM starting from v4.5 and draw the conclusion that the semantics of "valid_quirks & KVM_X86_QUIRK_FIX_HYPERCALL_INSN == 0" means KVM enables the hypercall patching quirk but don't provide the interface for userspace to disable it. So the code logic is correct. > > My statement of "I think it requires a new cap in KVM to return the enabled quirks" is more for generic consideration. i.e., QEMU can know whether a quirk is enabled or not without analysing the detailed history of KVM. > > Of course, it's more of the requirement on KVM to provide new interface and Current QEMU can do nothing on it. That is, current implementation of this PATCH is OK to me. > >>> >>>> + return 0; >>> >>> I think return 0 here is to avoid the warn_report() in the caller. But >>> for the correct semantics, we need to return -1 to indicate that it >>> fails to disable the hypercall patching? >> No, returning 0 here is very much on purpose, as you noticed, to avoid >> the warn_report() in the caller. The already issued warn_report() is the >> correct one for this case. > > We can use @Error to pass the error log instead of the trick on return value. > > e.g., > > --- a/target/i386/kvm/kvm.c > +++ b/target/i386/kvm/kvm.c > @@ -3228,17 +3228,24 @@ static int kvm_vm_enable_energy_msrs(KVMState *s) > return 0; > } > > -static int kvm_vm_disable_hypercall_patching(KVMState *s) > +static int kvm_vm_disable_hypercall_patching(KVMState *s, Error **errp) > { > int valid_quirks = kvm_vm_check_extension(s, KVM_CAP_DISABLE_QUIRKS2); > + int ret = -1; > > if (valid_quirks & KVM_X86_QUIRK_FIX_HYPERCALL_INSN) { > - return kvm_vm_enable_cap(s, KVM_CAP_DISABLE_QUIRKS2, 0, > - KVM_X86_QUIRK_FIX_HYPERCALL_INSN); > + ret = kvm_vm_enable_cap(s, KVM_CAP_DISABLE_QUIRKS2, 0, > + KVM_X86_QUIRK_FIX_HYPERCALL_INSN); > + if (ret) { > + error_setg_errno(errp, -ret, "kvm: failed to disable " > + "hypercall patching quirk: %s", > + strerror(-ret)); > + } > + } else { > + error_setg(errp, "kvm: disabling hypercall patching not supported"); > } > > - warn_report("kvm: disabling hypercall patching not supported"); > - return 0; > + return ret; > } > > int kvm_arch_init(MachineState *ms, KVMState *s) > @@ -3381,8 +3388,8 @@ int kvm_arch_init(MachineState *ms, KVMState *s) > } > > if (s->hypercall_patching_enabled == false) { > - if (kvm_vm_disable_hypercall_patching(s)) { > - warn_report("kvm: failed to disable hypercall patching quirk"); > + if (kvm_vm_disable_hypercall_patching(s, &local_err)) { > + error_report_err(local_err); > } > } > >> I guess, it's a question of if disabling hypercall patching is a hard >> requirement or can soft-fail. I decided for the latter, as hypercall >> patching shouldn't be needed in most cases, so if it cannot be disabled, >> it's mostly fine to start the VM still. >>> >>>> +} >>>> + >>>> int kvm_arch_init(MachineState *ms, KVMState *s) >>>> { >>>> int ret; >>>> @@ -3363,6 +3376,12 @@ int kvm_arch_init(MachineState *ms, KVMState *s) >>>> } >>>> } >>>> + if (s->hypercall_patching_enabled == false) { >>>> + if (kvm_vm_disable_hypercall_patching(s)) { >>>> + warn_report("kvm: failed to disable hypercall patching >>>> quirk"); >>>> + } >>>> + } >>>> + >>>> return 0; >>>> } >>>> @@ -6456,6 +6475,19 @@ void kvm_request_xsave_components(X86CPU >>>> *cpu, uint64_t mask) >>>> } >>>> } >>>> +static bool kvm_arch_get_hypercall_patching(Object *obj, Error **errp) >>>> +{ >>>> + KVMState *s = KVM_STATE(obj); >>>> + return s->hypercall_patching_enabled; >>>> +} >>>> + >>>> +static void kvm_arch_set_hypercall_patching(Object *obj, bool value, >>>> + Error **errp) >>>> +{ >>>> + KVMState *s = KVM_STATE(obj); >>>> + s->hypercall_patching_enabled = value; >>>> +} >>>> + >>>> static int kvm_arch_get_notify_vmexit(Object *obj, Error **errp) >>>> { >>>> KVMState *s = KVM_STATE(obj); >>>> @@ -6589,6 +6621,12 @@ static void >>>> kvm_arch_set_xen_evtchn_max_pirq(Object *obj, Visitor *v, >>>> void kvm_arch_accel_class_init(ObjectClass *oc) >>>> { >>>> + object_class_property_add_bool(oc, "hypercall-patching", >>>> + kvm_arch_get_hypercall_patching, >>>> + kvm_arch_set_hypercall_patching); >>>> + object_class_property_set_description(oc, "hypercall-patching", >>>> + "Enable hypercall patching >>>> quirk"); >>>> + >>>> object_class_property_add_enum(oc, "notify-vmexit", >>>> "NotifyVMexitOption", >>>> &NotifyVmexitOption_lookup, >>>> kvm_arch_get_notify_vmexit, >>> >> Thanks, >> Mathias > >