On Mon, Jun 02, 2025 at 12:12:51PM -0700, Chang S. Bae wrote: >== Preempt Case == > >To illustrate how the XFD MSR state becomes incorrect in this scenario: > > task #1 (fpstate->xfd=0) task #2 (fpstate->xfd=0x80000) > ======================== ============================== > handle_signal() > -> setup_rt_frame() > -> get_siframe() > -> copy_fpstate_to_sigframe() > -> fpregs_unlock() > ... > ... > switch_fpu_return() > -> fpregs_restore_userregs() > -> restore_fpregs_from_fpstate() > -> xfd_write_state() > ^ IA32_XFD_MSR = 0 > ... > ... > -> fpu__clear_user_states() > -> fpregs_lock() > -> restore_fpregs_from_init_fpstate() > -> os_rstor() > -> xfd_validate_state() > ^ IA32_XFD_MSR != fpstate->xfd > -> fpregs_mark_active() > -> fpregs_unlock() > >Since fpu__clear_user_states() marks the FPU state as valid in the end, an >XFD MSR sync-up was clearly missing. Thanks for this analysis. It makes sense. > >== Return-to-Userspace Path == > >Both tasks at that moment are on the return-to-userspace path, but at >different points in IRQ state: > > * task #2 is inside handle_signal() and already re-enabled IRQs. > * task #1 is after IRQ is disabled again when calling > switch_fpu_return(). > > local_irq_disable_exit_to_user() > exit_to_user_mode_prepare() > -> exit_to_user_mode_loop() > -> local_irq_enable_exit_to_user() > -> arch_do_signal_or_restart() > -> handle_signal() > -> local_irq_disable_exit_to_user() > -> arch_exit_user_mode_prepare() > -> arch_exit_work() > -> switch_fpu_return() > >This implies that fpregs_lock()/fpregs_unlock() is necessary inside >handle_signal() when XSAVE instructions are invoked. > >But, it should be okay for switch_fpu_return() to call >fpregs_restore_userregs() without fpregs_lock(). > >== XFD Sanity Checker == > >The XFD sanity checker -- xfd_op_valid() -- correctly caught this issue in >the test case. However, it may have a false negative when AMX usage was >flipped between the two tasks. > >Despite that, I don't think extending its coverage is worthwhile, as it would >complicate the logic. The current logic and documentation seem sound. > >== Fix Consideration == > >I think the fix is straightforward: resynchronize the IA32_XFD MSR in >fpu__clear_user_states(). This fix sounds good. Btw, what do you think the impact of this issue might be? Aside from the splat, task #2 could execute AMX instructions without requesting permissions, but its AMX state would be discarded during the next FPU switch, as RFBM[18] is cleared when executing XSAVES. And, in the "flipped" scenario you mentioned, task #2 might receive an extra #NM, after which its fpstate would be re-allocated (although the size won't increase further). So, for well-behaved tasks that never use AMX, there is no impact; tasks that use AMX may receive extra #NM. There won't be any unexpected #PF, memory corruption, or kernel panic.