On Sat, Mar 08, 2025 at 06:10:14PM +0800, Baokun Li wrote: > On 2025/3/8 17:58, Ojaswin Mujoo wrote: > > On Sat, Mar 08, 2025 at 01:48:44PM +0530, Ojaswin Mujoo wrote: > > > On Sat, Mar 08, 2025 at 10:57:16AM +0800, Zhang Yi wrote: > > > > On 2025/3/8 1:26, Ojaswin Mujoo wrote: > > > > > On Fri, Mar 07, 2025 at 08:36:08PM +0800, Zhang Yi wrote: > > > > > > On 2025/3/7 18:27, Ojaswin Mujoo wrote: > > > > > > > On Fri, Mar 07, 2025 at 04:43:24PM +0800, Zhang Yi wrote: > > > > > > > > On 2025/3/7 16:13, Ojaswin Mujoo wrote: > > > > > > > > > On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote: > > > > > > > > > > On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote: > > > > > > > > > > > On 2025/3/6 22:28, Ojaswin Mujoo wrote: > > > > > > > > > > > > Presently we always BUG_ON if trying to start a transaction on a journal marked > > > > > > > > > > > > with JBD2_UNMOUNT, since this should never happen. However, while ltp running > > > > > > > > > > > > stress tests, it was observed that in case of some error handling paths, it is > > > > > > > > > > > > possible for update_super_work to start a transaction after the journal is > > > > > > > > > > > > destroyed eg: > > > > > > > > > > > > > > > > > > > > > > > > (umount) > > > > > > > > > > > > ext4_kill_sb > > > > > > > > > > > > kill_block_super > > > > > > > > > > > > generic_shutdown_super > > > > > > > > > > > > sync_filesystem /* commits all txns */ > > > > > > > > > > > > evict_inodes > > > > > > > > > > > > /* might start a new txn */ > > > > > > > > > > > > ext4_put_super > > > > > > > > > > > > flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */ > > > > > > > > > > > > jbd2_journal_destroy > > > > > > > > > > > > journal_kill_thread > > > > > > > > > > > > journal->j_flags |= JBD2_UNMOUNT; > > > > > > > > > > > > jbd2_journal_commit_transaction > > > > > > > > > > > > jbd2_journal_get_descriptor_buffer > > > > > > > > > > > > jbd2_journal_bmap > > > > > > > > > > > > ext4_journal_bmap > > > > > > > > > > > > ext4_map_blocks > > > > > > > > > > > > ... > > > > > > > > > > > > ext4_inode_error > > > > > > > > > > > > ext4_handle_error > > > > > > > > > > > > schedule_work(&sbi->s_sb_upd_work) > > > > > > > > > > > > > > > > > > > > > > > > /* work queue kicks in */ > > > > > > > > > > > > update_super_work > > > > > > > > > > > > jbd2_journal_start > > > > > > > > > > > > start_this_handle > > > > > > > > > > > > BUG_ON(journal->j_flags & > > > > > > > > > > > > JBD2_UNMOUNT) > > > > > > > > > > > > > > > > > > > > > > > > Hence, introduce a new sbi flag s_journal_destroying to indicate journal is > > > > > > > > > > > > destroying only do a journaled (and deferred) update of sb if this flag is not > > > > > > > > > > > > set. Otherwise, just fallback to an un-journaled commit. > > > > > > > > > > > > > > > > > > > > > > > > We set sbi->s_journal_destroying = true only after all the FS updates are done > > > > > > > > > > > > during ext4_put_super() (except a running transaction that will get commited > > > > > > > > > > > > during jbd2_journal_destroy()). After this point, it is safe to commit the sb > > > > > > > > > > > > outside the journal as it won't race with a journaled update (refer > > > > > > > > > > > > 2d01ddc86606). > > > > > > > > > > > > > > > > > > > > > > > > Also, we don't need a similar check in ext4_grp_locked_error since it is only > > > > > > > > > > > > called from mballoc and AFAICT it would be always valid to schedule work here. > > > > > > > > > > > > > > > > > > > > > > > > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available") > > > > > > > > > > > > Reported-by: Mahesh Kumar <maheshkumar657g@xxxxxxxxx> > > > > > > > > > > > > Suggested-by: Jan Kara <jack@xxxxxxx> > > > > > > > > > > > > Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> > > > > > > > > > > > > --- > > > > > > > > > > > > fs/ext4/ext4.h | 2 ++ > > > > > > > > > > > > fs/ext4/ext4_jbd2.h | 8 ++++++++ > > > > > > > > > > > > fs/ext4/super.c | 4 +++- > > > > > > > > > > > > 3 files changed, 13 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > > > > > > > > > > > > index 2b7d781bfcad..d48e93bd5690 100644 > > > > > > > > > > > > --- a/fs/ext4/ext4.h > > > > > > > > > > > > +++ b/fs/ext4/ext4.h > > > > > > > > > > > > @@ -1728,6 +1728,8 @@ struct ext4_sb_info { > > > > > > > > > > > > */ > > > > > > > > > > > > struct work_struct s_sb_upd_work; > > > > > > > > > > > > + bool s_journal_destorying; > > > > > > > > > > > > + > > > > > > > > > > > > /* Atomic write unit values in bytes */ > > > > > > > > > > > > unsigned int s_awu_min; > > > > > > > > > > > > unsigned int s_awu_max; > > > > > > > > > > > > diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h > > > > > > > > > > > > index 9b3c9df02a39..6bd3ca84410d 100644 > > > > > > > > > > > > --- a/fs/ext4/ext4_jbd2.h > > > > > > > > > > > > +++ b/fs/ext4/ext4_jbd2.h > > > > > > > > > > > > @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour > > > > > > > > > > > > { > > > > > > > > > > > > int err = 0; > > > > > > > > > > > > + /* > > > > > > > > > > > > + * At this point all pending FS updates should be done except a possible > > > > > > > > > > > > + * running transaction (which will commit in jbd2_journal_destroy). It > > > > > > > > > > > > + * is now safe for any new errors to directly commit superblock rather > > > > > > > > > > > > + * than going via journal. > > > > > > > > > > > > + */ > > > > > > > > > > > > + sbi->s_journal_destorying = true; > > > > > > > > > > > > + > > > > > > > > > > > Hi, Ojaswin! > > > > > > > > > > > > > > > > > > > > > > I'm afraid you still need to flush the superblock update work here, > > > > > > > > > > > otherwise I guess the race condition you mentioned in v1 could still > > > > > > > > > > > occur. > > > > > > > > > > > > > > > > > > > > > > ext4_put_super() > > > > > > > > > > > flush_work(&sbi->s_sb_upd_work) > > > > > > > > > > > > > > > > > > > > > > **kjournald2** > > > > > > > > > > > jbd2_journal_commit_transaction() > > > > > > > > > > > ... > > > > > > > > > > > ext4_inode_error() > > > > > > > > > > > /* JBD2_UNMOUNT not set */ > > > > > > > > > > > schedule_work(s_sb_upd_work) > > > > > > > > > > > > > > > > > > > > > > **workqueue** > > > > > > > > > > > update_super_work > > > > > > > > > > > /* s_journal_destorying is not set */ > > > > > > > > > > > if (journal && !s_journal_destorying) > > > > > > > > > > > > > > > > > > > > > > ext4_journal_destroy() > > > > > > > > > > > /* set s_journal_destorying */ > > > > > > > > > > > sbi->s_journal_destorying = true; > > > > > > > > > > > jbd2_journal_destroy() > > > > > > > > > > > journal->j_flags |= JBD2_UNMOUNT; > > > > > > > > > > > > > > > > > > > > > > jbd2_journal_start() > > > > > > > > > > > start_this_handle() > > > > > > > > > > > BUG_ON(JBD2_UNMOUNT) > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > Yi. > > > > > > > > > > Hi Yi, > > > > > > > > > > > > > > > > > > > > Yes you are right, somehow missed this edge case :( > > > > > > > > > > > > > > > > > > > > Alright then, we have to move out sbi->s_journal_destroying outside the > > > > > > > > > > helper. Just wondering if I should still let it be in > > > > > > > > > > ext4_journal_destroy and just add an extra s_journal_destroying = false > > > > > > > > > > before schedule_work(s_sb_upd_work), because it makes sense. > > > > > > > > > > > > > > > > > > > > Okay let me give it some thought but thanks for pointing this out! > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > ojaswin > > > > > > > > > Okay so thinking about it a bit more, I see you also suggested to flush > > > > > > > > > the work after marking sbi->s_journal_destroying. But will that solve > > > > > > > > > it? > > > > > > > > > > > > > > > > > > ext4_put_super() > > > > > > > > > flush_work(&sbi->s_sb_upd_work) > > > > > > > > > **kjournald2** > > > > > > > > > jbd2_journal_commit_transaction() > > > > > > > > > ... > > > > > > > > > ext4_inode_error() > > > > > > > > > /* JBD2_UNMOUNT not set */ > > > > > > > > > schedule_work(s_sb_upd_work) > > > > > > > > > **workqueue** > > > > > > > > > update_super_work > > > > > > > > > /* s_journal_destorying is not set */ > > > > > > > > > if (journal && !s_journal_destorying) > > > > > > > > > ext4_journal_destroy() > > > > > > > > > /* set s_journal_destorying */ > > > > > > > > > sbi->s_journal_destorying = true; > > > > > > > > > flush_work(&sbi->s_sb_upd_work) > > > > > > > > > schedule_work() > > > > > > > > ^^^^^^^^^^^^^^^ > > > > > > > > where does this come from? > > > > > > > > > > > > > > > > After this flush_work, we can guarantee that the running s_sb_upd_work > > > > > > > > finishes before we set JBD2_UNMOUNT. Additionally, the journal will > > > > > > > > not commit transaction or call schedule_work() again because it has > > > > > > > > been aborted due to the previous error. Am I missing something? > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Yi. > > > > > > > Hmm, so I am thinking of a corner case in ext4_handle_error() where > > > > > > > > > > > > > > if(journal && !is_journal_destroying) > > > > > > > > > > > > > > is computed but schedule_work() not called yet, which is possible cause > > > > > > > the cmp followed by jump is not atomic in nature. If the schedule_work > > > > > > > is only called after we have done the flush then we end up with this: > > > > > > > > > > > > > > if (journal && !s_journal_destorying) > > > > > > > ext4_journal_destroy() > > > > > > > /* set s_journal_destorying */ > > > > > > > sbi->s_journal_destorying = true; > > > > > > > flush_work(&sbi->s_sb_upd_work) > > > > > > > schedule_work() > > > > > > > > > > > > > > Which is possible IMO, although the window is tiny. > > > > > > Yeah, right! > > > > > > Sorry for misread the location where you add the "!s_journal_destorying" > > > > > > check, the graph I provided was in update_super_work(), which was wrong. > > > > > Oh right, I also misread your trace but yes as discussed, even > > > > > > > > > > sbi->s_journal_destorying = true; > > > > > flush_work() > > > > > jbd2_journal_destroy() > > > > > > > > > > doesn't work. > > > > > > > > > > > The right one should be: > > > > > > > > > > > > ext4_put_super() > > > > > > flush_work(&sbi->s_sb_upd_work) > > > > > > > > > > > > **kjournald2** > > > > > > jbd2_journal_commit_transaction() > > > > > > ... > > > > > > ext4_inode_error() > > > > > > /* s_journal_destorying is not set */ > > > > > > if (journal && !s_journal_destorying) > > > > > > (schedule_work(s_sb_upd_work)) //can be here > > > > > > > > > > > > ext4_journal_destroy() > > > > > > /* set s_journal_destorying */ > > > > > > sbi->s_journal_destorying = true; > > > > > > jbd2_journal_destroy() > > > > > > journal->j_flags |= JBD2_UNMOUNT; > > > > > > > > > > > > (schedule_work(s_sb_upd_work)) //also can be here > > > > > > > > > > > > **workqueue** > > > > > > update_super_work() > > > > > > journal = sbi->s_journal //get journal > > > > > > kfree(journal) > > > > > > jbd2_journal_start(journal) //journal UAF > > > > > > start_this_handle() > > > > > > BUG_ON(JBD2_UNMOUNT) //bugon here > > > > > > > > > > > > > > > > > > So there are two problems here, the first one is the 'journal' UAF, > > > > > > the second one is triggering JBD2_UNMOUNT flag BUGON. > > > > > Indeed, there's a possible UAF here as well. > > > > > > > > > > > > > > As for the fix, how about we do something like this: > > > > > > > > > > > > > > > > > > ext4_put_super() > > > > > > > > > > > > > > > > > > flush_work(&sbi->s_sb_upd_work) > > > > > > > > > destroy_workqueue(sbi->rsv_conversion_wq); > > > > > > > > > > > > > > > > > > ext4_journal_destroy() > > > > > > > > > /* set s_journal_destorying */ > > > > > > > > > sbi->s_journal_destorying = true; > > > > > > > > > > > > > > > > > > /* trigger a commit and wait for it to complete */ > > > > > > > > > > > > > > > > > > flush_work(&sbi->s_sb_upd_work) > > > > > > > > > > > > > > > > > > jbd2_journal_destroy() > > > > > > > > > journal->j_flags |= JBD2_UNMOUNT; > > > > > > > > > jbd2_journal_start() > > > > > > > > > start_this_handle() > > > > > > > > > BUG_ON(JBD2_UNMOUNT) > > > > > > > > > > > > > > > > > > Still giving this codepath some thought but seems like this might just > > > > > > > > > be enough to fix the race. Thoughts on this? > > > > > > > > > > > > > > > I think this solution should work, the forced commit and flush_work() > > > > > > should ensure that the last transaction is committed and that the > > > > > > potential work is done. > > > > > > > > > > > > Besides, the s_journal_destorying flag is set and check concurrently > > > > > > now, so we need WRITE_ONCE() and READ_ONCE() for it. Besides, what > > > > > > about adding a new flag into sbi->s_mount_state instead of adding > > > > > > new s_journal_destorying? > > > > > Right, that makes sence. I will incorporate these changes in the next > > > > > revision. > > > > > > > > > Think about this again, it seems that we no longer need the destroying > > > > flag. Because we force to commit and wait for the **last** transaction to > > > > complete, and the flush work should also ensure that the last sb_update > > > > work to complete. Regardless of whether it starts a new handle in the > > > > last update_super_work(), it will not commit since the journal should > > > > have aborted. What are your thoughts? > > > > > > > > ext4_put_super() > > > > flush_work(&sbi->s_sb_upd_work) > > > > destroy_workqueue(sbi->rsv_conversion_wq) > > > > > > > > ext4_journal_destroy() > > > > /* trigger a commit (it will commit the last trnasaction) */ > > > > > > > > **kjournald2** > > > > jbd2_journal_commit_transaction() > > > > ... > > > > ext4_inode_error() > > > > schedule_work(s_sb_upd_work)) > > > > > > > > **workqueue** > > > > update_super_work() > > > > jbd2_journal_start(journal) > > > > start_this_handle() > > > > //This new trans will > > > > //not be committed. > > > > > > > > jbd2_journal_abort() > > > > > > > > /* wait for it to complete */ > > > > > > > > flush_work(&sbi->s_sb_upd_work) > > > > jbd2_journal_destroy() > > > > journal->j_flags |= JBD2_UNMOUNT; > > > > jbd2_journal_commit_transaction() //it will commit nothing > > > > > > > > Thanks, > > > > Yi. > > > Hi Yi, > > > > > > There's one more path for which we need the flag: > > > > > > ext4_journal_destroy() > > > /* trigger a commit (it will commit the last trnasaction) */ > > > **kjournald2** > > > jbd2_journal_commit_transaction() > > > journal->j_commit_callback() > > > ext4_journal_commit_callback() > > > ext4_maybe_update_superblock() > > > schedule_work() > > > /* start a transaction here */ > > > flush_work() > > > jbd2_journal_destroy() > > > journal_kill_thread > > > flags |= JBD2_UNMOUNT > > > jbd2_journal_commit_transaction() > > > ... > > > ext4_inode_error() > > > schedule_work(s_sb_upd_work)) > > > /* update_super_work_tries to start the txn */ > > > BUG_ON() > > Oops the formatting is wrong, here's the trace: > > > > ext4_journal_destroy() > > /* trigger a commit (it will commit the last trnasaction) */ > > > > **kjournald2** > > jbd2_journal_commit_transaction() > > journal->j_commit_callback() > > ext4_journal_commit_callback() > > ext4_maybe_update_superblock() > > schedule_work() > At this point, SB_ACTIVE should have been cleared, > so ext4_maybe_update_superblock() should do nothing. > > With this in mind, it could be the case that an > additional flag is no longer needed. Yes got it now, all the backtraces are confusing me haha Alright then, I feel we can take 2 approaches now. - Without the flag, (flushing sb update wq 2 times): This approach looks like how Yi has described here [1] however this relies on the fact that the last update_super_work is only called in journal is aborted and hence will never start a txn. This is possible because we flush the sb 2 times: ext4_put_super() flush_work(s_sb_upd_work) // 1st flush, clears all pending updates ext4_journal_destroy /* commit & wait for txn */ flush_work(s_sb_upd_work) // 2nd flush, only has work if journal is aborted */ The first flush flushes all pending updates and 2nd one is only invoked when the journal commit faces an error and hence never starts a new txn. Thus everything works. [1] https://lore.kernel.org/all/cover.1741270780.git.ojaswin@xxxxxxxxxxxxx/T/#m3de5dcae6afa979d01281f64b0c088131e72fc92 - With the flag. (single sb flush). This is an alternate approach where we remove the flush_work() from ext4_put_super() and only keep 1 in ext4_journal_destroy(). With this we need to explicitly rely on a sbi/mount flag because the flush can actually start a txn. WITHOUT the flag we can run into this: **kjournald2** jbd2_journal_commit_transaction() journal->j_commit_callback() ext4_journal_commit_callback() ext4_maybe_update_superblock() schedule_work() ext4_put_super() ext4_journal_destroy /* commit & wait for txn */ flush_work(s_sb_upd_work) // This will have sb updates even if journal not aborted /* update_super_work starts a txn */ jbd2_destroy_journal flags |= JBD2_UNMOUNT jbd2_journal_commit_transaction /* error */ ext4_handle_error scheduled_work() /*update super work hits BUG_ON */ Hence the flag is needed in this approach. I actually prefer the 2nd approach with the flag because it seems easier to maintain in the long run than relying on the fragile flush-commit-flush sequence, which might get silently broken as new code is added. I hope I didn't miss something this time. Thoughts? :) Regards, ojaswin > > > Regards, > Baokun > > > > /* update_super_work starts a new txn here */ > > flush_work() > > jbd2_journal_destroy() > > journal_kill_thread > > flags |= JBD2_UNMOUNT > > jbd2_journal_commit_transaction() > > ... > > ext4_inode_error() > > schedule_work(s_sb_upd_work)) > > /* update_super_work_tries to start the txn */ > > BUG_ON() > > > > > I think this to protect against this path we do need a flag. > > > > > > Regards, > > > ojaswin > >