On 2025/3/8 1:26, Ojaswin Mujoo wrote: > On Fri, Mar 07, 2025 at 08:36:08PM +0800, Zhang Yi wrote: >> On 2025/3/7 18:27, Ojaswin Mujoo wrote: >>> On Fri, Mar 07, 2025 at 04:43:24PM +0800, Zhang Yi wrote: >>>> On 2025/3/7 16:13, Ojaswin Mujoo wrote: >>>>> On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote: >>>>>> On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote: >>>>>>> On 2025/3/6 22:28, Ojaswin Mujoo wrote: >>>>>>>> Presently we always BUG_ON if trying to start a transaction on a journal marked >>>>>>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running >>>>>>>> stress tests, it was observed that in case of some error handling paths, it is >>>>>>>> possible for update_super_work to start a transaction after the journal is >>>>>>>> destroyed eg: >>>>>>>> >>>>>>>> (umount) >>>>>>>> ext4_kill_sb >>>>>>>> kill_block_super >>>>>>>> generic_shutdown_super >>>>>>>> sync_filesystem /* commits all txns */ >>>>>>>> evict_inodes >>>>>>>> /* might start a new txn */ >>>>>>>> ext4_put_super >>>>>>>> flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */ >>>>>>>> jbd2_journal_destroy >>>>>>>> journal_kill_thread >>>>>>>> journal->j_flags |= JBD2_UNMOUNT; >>>>>>>> jbd2_journal_commit_transaction >>>>>>>> jbd2_journal_get_descriptor_buffer >>>>>>>> jbd2_journal_bmap >>>>>>>> ext4_journal_bmap >>>>>>>> ext4_map_blocks >>>>>>>> ... >>>>>>>> ext4_inode_error >>>>>>>> ext4_handle_error >>>>>>>> schedule_work(&sbi->s_sb_upd_work) >>>>>>>> >>>>>>>> /* work queue kicks in */ >>>>>>>> update_super_work >>>>>>>> jbd2_journal_start >>>>>>>> start_this_handle >>>>>>>> BUG_ON(journal->j_flags & >>>>>>>> JBD2_UNMOUNT) >>>>>>>> >>>>>>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is >>>>>>>> destroying only do a journaled (and deferred) update of sb if this flag is not >>>>>>>> set. Otherwise, just fallback to an un-journaled commit. >>>>>>>> >>>>>>>> We set sbi->s_journal_destroying = true only after all the FS updates are done >>>>>>>> during ext4_put_super() (except a running transaction that will get commited >>>>>>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb >>>>>>>> outside the journal as it won't race with a journaled update (refer >>>>>>>> 2d01ddc86606). >>>>>>>> >>>>>>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only >>>>>>>> called from mballoc and AFAICT it would be always valid to schedule work here. >>>>>>>> >>>>>>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available") >>>>>>>> Reported-by: Mahesh Kumar <maheshkumar657g@xxxxxxxxx> >>>>>>>> Suggested-by: Jan Kara <jack@xxxxxxx> >>>>>>>> Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> >>>>>>>> --- >>>>>>>> fs/ext4/ext4.h | 2 ++ >>>>>>>> fs/ext4/ext4_jbd2.h | 8 ++++++++ >>>>>>>> fs/ext4/super.c | 4 +++- >>>>>>>> 3 files changed, 13 insertions(+), 1 deletion(-) >>>>>>>> >>>>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h >>>>>>>> index 2b7d781bfcad..d48e93bd5690 100644 >>>>>>>> --- a/fs/ext4/ext4.h >>>>>>>> +++ b/fs/ext4/ext4.h >>>>>>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info { >>>>>>>> */ >>>>>>>> struct work_struct s_sb_upd_work; >>>>>>>> >>>>>>>> + bool s_journal_destorying; >>>>>>>> + >>>>>>>> /* Atomic write unit values in bytes */ >>>>>>>> unsigned int s_awu_min; >>>>>>>> unsigned int s_awu_max; >>>>>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h >>>>>>>> index 9b3c9df02a39..6bd3ca84410d 100644 >>>>>>>> --- a/fs/ext4/ext4_jbd2.h >>>>>>>> +++ b/fs/ext4/ext4_jbd2.h >>>>>>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour >>>>>>>> { >>>>>>>> int err = 0; >>>>>>>> >>>>>>>> + /* >>>>>>>> + * At this point all pending FS updates should be done except a possible >>>>>>>> + * running transaction (which will commit in jbd2_journal_destroy). It >>>>>>>> + * is now safe for any new errors to directly commit superblock rather >>>>>>>> + * than going via journal. >>>>>>>> + */ >>>>>>>> + sbi->s_journal_destorying = true; >>>>>>>> + >>>>>>> >>>>>>> Hi, Ojaswin! >>>>>>> >>>>>>> I'm afraid you still need to flush the superblock update work here, >>>>>>> otherwise I guess the race condition you mentioned in v1 could still >>>>>>> occur. >>>>>>> >>>>>>> ext4_put_super() >>>>>>> flush_work(&sbi->s_sb_upd_work) >>>>>>> >>>>>>> **kjournald2** >>>>>>> jbd2_journal_commit_transaction() >>>>>>> ... >>>>>>> ext4_inode_error() >>>>>>> /* JBD2_UNMOUNT not set */ >>>>>>> schedule_work(s_sb_upd_work) >>>>>>> >>>>>>> **workqueue** >>>>>>> update_super_work >>>>>>> /* s_journal_destorying is not set */ >>>>>>> if (journal && !s_journal_destorying) >>>>>>> >>>>>>> ext4_journal_destroy() >>>>>>> /* set s_journal_destorying */ >>>>>>> sbi->s_journal_destorying = true; >>>>>>> jbd2_journal_destroy() >>>>>>> journal->j_flags |= JBD2_UNMOUNT; >>>>>>> >>>>>>> jbd2_journal_start() >>>>>>> start_this_handle() >>>>>>> BUG_ON(JBD2_UNMOUNT) >>>>>>> >>>>>>> Thanks, >>>>>>> Yi. >>>>>> Hi Yi, >>>>>> >>>>>> Yes you are right, somehow missed this edge case :( >>>>>> >>>>>> Alright then, we have to move out sbi->s_journal_destroying outside the >>>>>> helper. Just wondering if I should still let it be in >>>>>> ext4_journal_destroy and just add an extra s_journal_destroying = false >>>>>> before schedule_work(s_sb_upd_work), because it makes sense. >>>>>> >>>>>> Okay let me give it some thought but thanks for pointing this out! >>>>>> >>>>>> Regards, >>>>>> ojaswin >>>>> >>>>> Okay so thinking about it a bit more, I see you also suggested to flush >>>>> the work after marking sbi->s_journal_destroying. But will that solve >>>>> it? >>>>> >>>>> ext4_put_super() >>>>> flush_work(&sbi->s_sb_upd_work) >>>>> >>>>> **kjournald2** >>>>> jbd2_journal_commit_transaction() >>>>> ... >>>>> ext4_inode_error() >>>>> /* JBD2_UNMOUNT not set */ >>>>> schedule_work(s_sb_upd_work) >>>>> >>>>> **workqueue** >>>>> update_super_work >>>>> /* s_journal_destorying is not set */ >>>>> if (journal && !s_journal_destorying) >>>>> >>>>> ext4_journal_destroy() >>>>> /* set s_journal_destorying */ >>>>> sbi->s_journal_destorying = true; >>>>> flush_work(&sbi->s_sb_upd_work) >>>>> schedule_work() >>>> ^^^^^^^^^^^^^^^ >>>> where does this come from? >>>> >>>> After this flush_work, we can guarantee that the running s_sb_upd_work >>>> finishes before we set JBD2_UNMOUNT. Additionally, the journal will >>>> not commit transaction or call schedule_work() again because it has >>>> been aborted due to the previous error. Am I missing something? >>>> >>>> Thanks, >>>> Yi. >>> >>> Hmm, so I am thinking of a corner case in ext4_handle_error() where >>> >>> if(journal && !is_journal_destroying) >>> >>> is computed but schedule_work() not called yet, which is possible cause >>> the cmp followed by jump is not atomic in nature. If the schedule_work >>> is only called after we have done the flush then we end up with this: >>> >>> if (journal && !s_journal_destorying) >>> ext4_journal_destroy() >>> /* set s_journal_destorying */ >>> sbi->s_journal_destorying = true; >>> flush_work(&sbi->s_sb_upd_work) >>> schedule_work() >>> >>> Which is possible IMO, although the window is tiny. >> >> Yeah, right! >> Sorry for misread the location where you add the "!s_journal_destorying" >> check, the graph I provided was in update_super_work(), which was wrong. > > Oh right, I also misread your trace but yes as discussed, even > > sbi->s_journal_destorying = true; > flush_work() > jbd2_journal_destroy() > > doesn't work. > >> The right one should be: >> >> ext4_put_super() >> flush_work(&sbi->s_sb_upd_work) >> >> **kjournald2** >> jbd2_journal_commit_transaction() >> ... >> ext4_inode_error() >> /* s_journal_destorying is not set */ >> if (journal && !s_journal_destorying) >> (schedule_work(s_sb_upd_work)) //can be here >> >> ext4_journal_destroy() >> /* set s_journal_destorying */ >> sbi->s_journal_destorying = true; >> jbd2_journal_destroy() >> journal->j_flags |= JBD2_UNMOUNT; >> >> (schedule_work(s_sb_upd_work)) //also can be here >> >> **workqueue** >> update_super_work() >> journal = sbi->s_journal //get journal >> kfree(journal) >> jbd2_journal_start(journal) //journal UAF >> start_this_handle() >> BUG_ON(JBD2_UNMOUNT) //bugon here >> >> >> So there are two problems here, the first one is the 'journal' UAF, >> the second one is triggering JBD2_UNMOUNT flag BUGON. > > Indeed, there's a possible UAF here as well. > >> >>>>> >>>>> As for the fix, how about we do something like this: >>>>> >>>>> ext4_put_super() >>>>> >>>>> flush_work(&sbi->s_sb_upd_work) >>>>> destroy_workqueue(sbi->rsv_conversion_wq); >>>>> >>>>> ext4_journal_destroy() >>>>> /* set s_journal_destorying */ >>>>> sbi->s_journal_destorying = true; >>>>> >>>>> /* trigger a commit and wait for it to complete */ >>>>> >>>>> flush_work(&sbi->s_sb_upd_work) >>>>> >>>>> jbd2_journal_destroy() >>>>> journal->j_flags |= JBD2_UNMOUNT; >>>>> >>>>> jbd2_journal_start() >>>>> start_this_handle() >>>>> BUG_ON(JBD2_UNMOUNT) >>>>> >>>>> Still giving this codepath some thought but seems like this might just >>>>> be enough to fix the race. Thoughts on this? >>>>> >> >> I think this solution should work, the forced commit and flush_work() >> should ensure that the last transaction is committed and that the >> potential work is done. >> >> Besides, the s_journal_destorying flag is set and check concurrently >> now, so we need WRITE_ONCE() and READ_ONCE() for it. Besides, what >> about adding a new flag into sbi->s_mount_state instead of adding >> new s_journal_destorying? > > Right, that makes sence. I will incorporate these changes in the next > revision. > Think about this again, it seems that we no longer need the destroying flag. Because we force to commit and wait for the **last** transaction to complete, and the flush work should also ensure that the last sb_update work to complete. Regardless of whether it starts a new handle in the last update_super_work(), it will not commit since the journal should have aborted. What are your thoughts? ext4_put_super() flush_work(&sbi->s_sb_upd_work) destroy_workqueue(sbi->rsv_conversion_wq) ext4_journal_destroy() /* trigger a commit (it will commit the last trnasaction) */ **kjournald2** jbd2_journal_commit_transaction() ... ext4_inode_error() schedule_work(s_sb_upd_work)) **workqueue** update_super_work() jbd2_journal_start(journal) start_this_handle() //This new trans will //not be committed. jbd2_journal_abort() /* wait for it to complete */ flush_work(&sbi->s_sb_upd_work) jbd2_journal_destroy() journal->j_flags |= JBD2_UNMOUNT; jbd2_journal_commit_transaction() //it will commit nothing Thanks, Yi.