On Fri, May 9, 2025 at 5:49 PM Andrew Jones <ajones@xxxxxxxxxxxxxxxx> wrote: > > On Fri, May 09, 2025 at 05:33:49PM +0530, Anup Patel wrote: > > On Fri, May 9, 2025 at 2:16 PM Radim Krčmář <rkrcmar@xxxxxxxxxxxxxxxx> wrote: > > > > > > 2025-05-09T12:25:24+05:30, Anup Patel <anup@xxxxxxxxxxxxxx>: > > > > On Thu, May 8, 2025 at 8:01 PM Radim Krčmář <rkrcmar@xxxxxxxxxxxxxxxx> wrote: > > > >> > > > >> Add a toggleable VM capability to modify several reset related code > > > >> paths. The goals are to > > > >> 1) Allow userspace to reset any VCPU. > > > >> 2) Allow userspace to provide the initial VCPU state. > > > >> > > > >> (Right now, the boot VCPU isn't reset by KVM and KVM sets the state for > > > >> VCPUs brought up by sbi_hart_start while userspace for all others.) > > > >> > > > >> The goals are achieved with the following changes: > > > >> * Reset the VCPU when setting MP_STATE_INIT_RECEIVED through IOCTL. > > > > > > > > Rather than using separate MP_STATE_INIT_RECEIVED ioctl(), we can > > > > define a capability which when set, the set_mpstate ioctl() will reset the > > > > VCPU upon changing VCPU state from RUNNABLE to STOPPED state. > > > > > > Yeah, I started with that and then realized it has two drawbacks: > > > > > > * It will require larger changes in userspaces, because for > > > example QEMU now first loads the initial state and then toggles the > > > mp_state, which would incorrectly reset the state. > > > > > > * It will also require an extra IOCTL if a stopped VCPU should be > > > reset > > > 1) STOPPED -> RUNNING (= reset) > > > 2) RUNNING -> STOPPED (VCPU should be stopped) > > > or if the current state of a VCPU is not known. > > > 1) ??? -> STOPPED > > > 2) STOPPED -> RUNNING > > > 3) RUNNING -> STOPPED > > > > > > I can do that for v3 if you think it's better. > > > > Okay, for now keep the MP_STATE_INIT_RECEIVED ioctl() > > > > > > > > >> * Preserve the userspace initialized VCPU state on sbi_hart_start. > > > >> * Return to userspace on sbi_hart_stop. > > > > > > > > There is no userspace involvement required when a Guest VCPU > > > > stops itself using SBI HSM stop() call so STRONG NO to this change. > > > > > > Ok, I'll drop it from v3 -- it can be handled by future patches that > > > trap SBI calls to userspace. > > > > > > The lack of userspace involvement is the issue. KVM doesn't know what > > > the initial state should be. > > > > The SBI HSM virtualization does not need any KVM userspace > > involvement. > > > > When a VCPU stops itself using SBI HSM stop(), the Guest itself > > provides the entry address and argument when starting the VCPU > > using SBI HSM start() without any KVM userspace involvement. > > > > In fact, even at Guest boot time all non-boot VCPUs are brought-up > > using SBI HSM start() by the boot VCPU where the Guest itself > > provides entry address and argument without any KVM userspace > > involvement. > > HSM only specifies the state of a few registers and the ISA only a few > more. All other registers have IMPDEF reset values. Zeros, like KVM > selects, are a good choice and the best default, but if the VMM disagrees, > then it should be allowed to select what it likes, as the VMM/user is the > policy maker and KVM is "just" the accelerator. Till now there are no such IMPDEF reset values expected. We will cross that bridge when needed. Although, I doubt we will ever need it. > > > > > > > > > >> * Don't make VCPU reset request on sbi_system_suspend. > > > > > > > > The entry state of initiating VCPU is already available on SBI system > > > > suspend call. The initiating VCPU must be resetted and entry state of > > > > initiating VCPU must be setup. > > > > > > Userspace would simply call the VCPU reset and set the complete state, > > > because the userspace exit already provides all the sbi information. > > > > > > I'll drop this change. It doesn't make much sense if we aren't fixing > > > the sbi_hart_start reset. > > > > > > >> The patch is reusing MP_STATE_INIT_RECEIVED, because we didn't want to > > > >> add a new IOCTL, sorry. :) > > > >> > > > >> Signed-off-by: Radim Krčmář <rkrcmar@xxxxxxxxxxxxxxxx> > > > >> --- > > > >> If you search for cap 7.42 in api.rst, you'll see that it has a wrong > > > >> number, which is why we're 7.43, in case someone bothers to fix ARM. > > > >> > > > >> I was also strongly considering creating all VCPUs in RUNNABLE state -- > > > >> do you know of any similar quirks that aren't important, but could be > > > >> fixed with the new userspace toggle? > > > > > > > > Upon creating a VM, only one VCPU should be RUNNABLE and all > > > > other VCPUs must remain in OFF state. This is intentional because > > > > imagine a large number of VCPUs entering Guest OS at the same > > > > time. We have spent a lot of effort in the past to get away from this > > > > situation even in the host boot flow. We can't expect user space to > > > > correctly set the initial MP_STATE of all VCPUs. We can certainly > > > > think of some mechanism using which user space can specify > > > > which VCPU should be runnable upon VM creation. > > > > > > We already do have the mechanism -- the userspace will set MP_STATE of > > > VCPU 0 to STOPPED and whatever VCPUs it wants as boot with to RUNNABLE > > > before running all the VCPUs for the first time. > > > > Okay, nothing to be done on this front. > > > > > > > > The userspace must correctly set the initial MP state anyway, because a > > > resume will want a mp_state that a fresh boot. > > > > > > > The current approach is to do HSM state management in kernel > > > > space itself and not rely on user space. Allowing userspace to > > > > resetting any VCPU is fine but this should not affect the flow for > > > > SBI HSM, SBI System Reset, and SBI System Suspend. > > > > > > Yes, that is the design I was trying to change. I think userspace > > > should have control over all aspects of the guest it executes in KVM. > > > > For SBI HSM, the kernel space should be the only entity managing. > > The VMM should always be the only manager. KVM can provide defaults in > order to support simple VMMs that don't have opinions, but VMMs concerned > with managing all state on behalf of their users' choices and to ensure > successful migrations, will want to be involved. I think you misunderstood my comment. VMM is still the over manager but the VCPU SBI HSM states are managed by the kernel KVM. Regards, Anup > > Thanks, > drew > > > > > > > > > Accelerating SBI in KVM is good, but userspace should be able to say how > > > the unspecified parts are implemented. Trapping to userspace is the > > > simplest option. (And sufficient for ecalls that are not a hot path.) > > > > > > > For the unspecified parts, we have KVM exits at appropriate places > > e.g. SBI system reset, SBI system suspend, etc. > > > > Regards, > > Anup