Re: [PATCH v2 2/2] RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET

Anup Patel <anup@xxxxxxxxxxxxxx> · Fri, 9 May 2025 17:59:28 +0530

On Fri, May 9, 2025 at 5:49 PM Andrew Jones <ajones@xxxxxxxxxxxxxxxx> wrote:
>
> On Fri, May 09, 2025 at 05:33:49PM +0530, Anup Patel wrote:
> > On Fri, May 9, 2025 at 2:16 PM Radim Krčmář <rkrcmar@xxxxxxxxxxxxxxxx> wrote:
> > >
> > > 2025-05-09T12:25:24+05:30, Anup Patel <anup@xxxxxxxxxxxxxx>:
> > > > On Thu, May 8, 2025 at 8:01 PM Radim Krčmář <rkrcmar@xxxxxxxxxxxxxxxx> wrote:
> > > >>
> > > >> Add a toggleable VM capability to modify several reset related code
> > > >> paths.  The goals are to
> > > >>  1) Allow userspace to reset any VCPU.
> > > >>  2) Allow userspace to provide the initial VCPU state.
> > > >>
> > > >> (Right now, the boot VCPU isn't reset by KVM and KVM sets the state for
> > > >>  VCPUs brought up by sbi_hart_start while userspace for all others.)
> > > >>
> > > >> The goals are achieved with the following changes:
> > > >>  * Reset the VCPU when setting MP_STATE_INIT_RECEIVED through IOCTL.
> > > >
> > > > Rather than using separate MP_STATE_INIT_RECEIVED ioctl(), we can
> > > > define a capability which when set, the set_mpstate ioctl() will reset the
> > > > VCPU upon changing VCPU state from RUNNABLE to STOPPED state.
> > >
> > > Yeah, I started with that and then realized it has two drawbacks:
> > >
> > >  * It will require larger changes in userspaces, because for
> > >    example QEMU now first loads the initial state and then toggles the
> > >    mp_state, which would incorrectly reset the state.
> > >
> > >  * It will also require an extra IOCTL if a stopped VCPU should be
> > >    reset
> > >     1) STOPPED -> RUNNING (= reset)
> > >     2) RUNNING -> STOPPED (VCPU should be stopped)
> > >    or if the current state of a VCPU is not known.
> > >     1) ???     -> STOPPED
> > >     2) STOPPED -> RUNNING
> > >     3) RUNNING -> STOPPED
> > >
> > > I can do that for v3 if you think it's better.
> >
> > Okay, for now keep the MP_STATE_INIT_RECEIVED ioctl()
> >
> > >
> > > >>  * Preserve the userspace initialized VCPU state on sbi_hart_start.
> > > >>  * Return to userspace on sbi_hart_stop.
> > > >
> > > > There is no userspace involvement required when a Guest VCPU
> > > > stops itself using SBI HSM stop() call so STRONG NO to this change.
> > >
> > > Ok, I'll drop it from v3 -- it can be handled by future patches that
> > > trap SBI calls to userspace.
> > >
> > > The lack of userspace involvement is the issue.  KVM doesn't know what
> > > the initial state should be.
> >
> > The SBI HSM virtualization does not need any KVM userspace
> > involvement.
> >
> > When a VCPU stops itself using SBI HSM stop(), the Guest itself
> > provides the entry address and argument when starting the VCPU
> > using SBI HSM start() without any KVM userspace involvement.
> >
> > In fact, even at Guest boot time all non-boot VCPUs are brought-up
> > using SBI HSM start() by the boot VCPU where the Guest itself
> > provides entry address and argument without any KVM userspace
> > involvement.
>
> HSM only specifies the state of a few registers and the ISA only a few
> more. All other registers have IMPDEF reset values. Zeros, like KVM
> selects, are a good choice and the best default, but if the VMM disagrees,
> then it should be allowed to select what it likes, as the VMM/user is the
> policy maker and KVM is "just" the accelerator.

Till now there are no such IMPDEF reset values expected. We will
cross that bridge when needed. Although, I doubt we will ever need it.

>
> >
> > >
> > > >>  * Don't make VCPU reset request on sbi_system_suspend.
> > > >
> > > > The entry state of initiating VCPU is already available on SBI system
> > > > suspend call. The initiating VCPU must be resetted and entry state of
> > > > initiating VCPU must be setup.
> > >
> > > Userspace would simply call the VCPU reset and set the complete state,
> > > because the userspace exit already provides all the sbi information.
> > >
> > > I'll drop this change.  It doesn't make much sense if we aren't fixing
> > > the sbi_hart_start reset.
> > >
> > > >> The patch is reusing MP_STATE_INIT_RECEIVED, because we didn't want to
> > > >> add a new IOCTL, sorry. :)
> > > >>
> > > >> Signed-off-by: Radim Krčmář <rkrcmar@xxxxxxxxxxxxxxxx>
> > > >> ---
> > > >> If you search for cap 7.42 in api.rst, you'll see that it has a wrong
> > > >> number, which is why we're 7.43, in case someone bothers to fix ARM.
> > > >>
> > > >> I was also strongly considering creating all VCPUs in RUNNABLE state --
> > > >> do you know of any similar quirks that aren't important, but could be
> > > >> fixed with the new userspace toggle?
> > > >
> > > > Upon creating a VM, only one VCPU should be RUNNABLE and all
> > > > other VCPUs must remain in OFF state. This is intentional because
> > > > imagine a large number of VCPUs entering Guest OS at the same
> > > > time. We have spent a lot of effort in the past to get away from this
> > > > situation even in the host boot flow. We can't expect user space to
> > > > correctly set the initial MP_STATE of all VCPUs. We can certainly
> > > > think of some mechanism using which user space can specify
> > > > which VCPU should be runnable upon VM creation.
> > >
> > > We already do have the mechanism -- the userspace will set MP_STATE of
> > > VCPU 0 to STOPPED and whatever VCPUs it wants as boot with to RUNNABLE
> > > before running all the VCPUs for the first time.
> >
> > Okay, nothing to be done on this front.
> >
> > >
> > > The userspace must correctly set the initial MP state anyway, because a
> > > resume will want a mp_state that a fresh boot.
> > >
> > > > The current approach is to do HSM state management in kernel
> > > > space itself and not rely on user space. Allowing userspace to
> > > > resetting any VCPU is fine but this should not affect the flow for
> > > > SBI HSM, SBI System Reset, and SBI System Suspend.
> > >
> > > Yes, that is the design I was trying to change.  I think userspace
> > > should have control over all aspects of the guest it executes in KVM.
> >
> > For SBI HSM, the kernel space should be the only entity managing.
>
> The VMM should always be the only manager. KVM can provide defaults in
> order to support simple VMMs that don't have opinions, but VMMs concerned
> with managing all state on behalf of their users' choices and to ensure
> successful migrations, will want to be involved.

I think you misunderstood my comment. VMM is still the over manager
but the VCPU SBI HSM states are managed by the kernel KVM.

Regards,
Anup

>
> Thanks,
> drew
>
> >
> > >
> > > Accelerating SBI in KVM is good, but userspace should be able to say how
> > > the unspecified parts are implemented.  Trapping to userspace is the
> > > simplest option.  (And sufficient for ecalls that are not a hot path.)
> > >
> >
> > For the unspecified parts, we have KVM exits at appropriate places
> > e.g. SBI system reset, SBI system suspend, etc.
> >
> > Regards,
> > Anup