On Tue, Apr 01, 2025 at 01:20:37PM +0200, Jan Kara wrote: > On Mon 31-03-25 21:13:20, James Bottomley wrote: > > On Tue, 2025-04-01 at 01:32 +0200, Christian Brauner wrote: > > > On Mon, Mar 31, 2025 at 03:51:43PM -0400, James Bottomley wrote: > > > > On Thu, 2025-03-27 at 10:06 -0400, James Bottomley wrote: > > > > [...] > > > > > -static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, > > > > > bool > > > > > reader) > > > > > +static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, > > > > > bool > > > > > reader, > > > > > + bool freeze) > > > > > { > > > > > DEFINE_WAIT_FUNC(wq_entry, percpu_rwsem_wake_function); > > > > > bool wait; > > > > > @@ -156,7 +157,8 @@ static void percpu_rwsem_wait(struct > > > > > percpu_rw_semaphore *sem, bool reader) > > > > > spin_unlock_irq(&sem->waiters.lock); > > > > > > > > > > while (wait) { > > > > > - set_current_state(TASK_UNINTERRUPTIBLE); > > > > > + set_current_state(TASK_UNINTERRUPTIBLE | > > > > > + freeze ? TASK_FREEZABLE : 0); > > > > > > > > This is a bit embarrassing, the bug I've been chasing is here: the > > > > ? > > > > operator is lower in precedence than | meaning this expression > > > > always > > > > evaluates to TASK_FREEZABLE and nothing else (which is why the > > > > process > > > > goes into R state and never wakes up). > > > > > > > > Let me fix that and redo all the testing. > > > > > > I don't think that's it. I think you're missing making pagefault > > > writers such > > > as systemd-journald freezable: > > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > > index b379a46b5576..528e73f192ac 100644 > > > --- a/include/linux/fs.h > > > +++ b/include/linux/fs.h > > > @@ -1782,7 +1782,8 @@ static inline void __sb_end_write(struct > > > super_block *sb, int level) > > > static inline void __sb_start_write(struct super_block *sb, int > > > level) > > > { > > > percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1, > > > - level == SB_FREEZE_WRITE); > > > + (level == SB_FREEZE_WRITE || > > > + level == SB_FREEZE_PAGEFAULT)); > > > } > > > > Yes, I was about to tell Jan that the condition here simply needs to be > > true. All our rwsem levels need to be freezable to avoid a hibernation > > failure. > > So there is one snag with this. SB_FREEZE_PAGEFAULT level is acquired under > mmap_sem, SB_FREEZE_INTERNAL level is possibly acquired under some other > filesystem locks. So if you freeze the filesystem, a task can block on > frozen filesystem with e.g. mmap_sem held and if some other task then Yeah, I wondered about that yesterday. > blocks on grabbing that mmap_sem, hibernation fails because we'll be unable > to hibernate the task waiting for mmap_sem. So if you'd like to completely > avoid these hibernation failures, you'd have to make a slew of filesystem > related locks use freezable sleeping. I don't think that's feasible. > > I was hoping that failures due to SB_FREEZE_PAGEFAULT level not being > freezable would be rare enough but you've proven they are quite frequent. > We can try making SB_FREEZE_PAGEFAULT level (or even SB_FREEZE_INTERNAL) > freezable and see whether that works good enough... I think that's fine and we'll see whether this causes a lot of issues. I've got the patchset written in a way now that userspace can just enable or disable freeze during migration.