On Tue, 2025-04-01 at 13:20 +0200, Jan Kara wrote: > On Mon 31-03-25 21:13:20, James Bottomley wrote: > > On Tue, 2025-04-01 at 01:32 +0200, Christian Brauner wrote: [...] > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > > index b379a46b5576..528e73f192ac 100644 > > > --- a/include/linux/fs.h > > > +++ b/include/linux/fs.h > > > @@ -1782,7 +1782,8 @@ static inline void __sb_end_write(struct > > > super_block *sb, int level) > > > static inline void __sb_start_write(struct super_block *sb, int > > > level) > > > { > > > percpu_down_read_freezable(sb->s_writers.rw_sem + level - > > > 1, > > > - level == SB_FREEZE_WRITE); > > > + (level == SB_FREEZE_WRITE || > > > + level == > > > SB_FREEZE_PAGEFAULT)); > > > } > > > > Yes, I was about to tell Jan that the condition here simply needs > > to be true. All our rwsem levels need to be freezable to avoid a > > hibernation failure. > > So there is one snag with this. SB_FREEZE_PAGEFAULT level is acquired > under mmap_sem, SB_FREEZE_INTERNAL level is possibly acquired under > some other filesystem locks. Just for SB_FREEZE_INTERNAL, I think there's no case of sb_start_intwrite() that can ever hold in D wait because by the time we acquire the semaphore for write, the internal freeze_fs should have been called and the filesystem should have quiesced itself. On the other hand, if that theory itself is true, there's no real need for sb_start_intwrite() at all because it can never conflict. > So if you freeze the filesystem, a task can block on frozen > filesystem with e.g. mmap_sem held and if some other task then blocks > on grabbing that mmap_sem, hibernation fails because we'll be unable > to hibernate the task waiting for mmap_sem. So if you'd like to > completely avoid these hibernation failures, you'd have to make a > slew of filesystem related locks use freezable sleeping. I don't > think that's feasible. I wouldn't see that because I'm on x86_64 and that takes the vma_lock in page faults not the mmap_lock. The granularity of all these locks is process level, so it's hard to see what they'd be racing with ... even if I conjecture two threads trying to write to something, they'd have to have some internal co-ordination which would likely prevent the second one from writing if the first got stuck on the page fault. > I was hoping that failures due to SB_FREEZE_PAGEFAULT level not being > freezable would be rare enough but you've proven they are quite > frequent. We can try making SB_FREEZE_PAGEFAULT level (or even > SB_FREEZE_INTERNAL) freezable and see whether that works good > enough... I'll try to construct a more severe test than systemd-journald ... it looks to be single threaded in its operation. Regards, James