Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> writes: > On 2025/3/8 1:26, Ojaswin Mujoo wrote: >> On Fri, Mar 07, 2025 at 08:36:08PM +0800, Zhang Yi wrote: >>> On 2025/3/7 18:27, Ojaswin Mujoo wrote: >>>> On Fri, Mar 07, 2025 at 04:43:24PM +0800, Zhang Yi wrote: >>>>> On 2025/3/7 16:13, Ojaswin Mujoo wrote: >>>>>> On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote: >>>>>>> On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote: >>>>>>>> On 2025/3/6 22:28, Ojaswin Mujoo wrote: >>>>>>>>> Presently we always BUG_ON if trying to start a transaction on a journal marked >>>>>>>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running >>>>>>>>> stress tests, it was observed that in case of some error handling paths, it is >>>>>>>>> possible for update_super_work to start a transaction after the journal is >>>>>>>>> destroyed eg: >>>>>>>>> >>>>>>>>> (umount) >>>>>>>>> ext4_kill_sb >>>>>>>>> kill_block_super >>>>>>>>> generic_shutdown_super >>>>>>>>> sync_filesystem /* commits all txns */ >>>>>>>>> evict_inodes >>>>>>>>> /* might start a new txn */ >>>>>>>>> ext4_put_super >>>>>>>>> flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */ >>>>>>>>> jbd2_journal_destroy >>>>>>>>> journal_kill_thread >>>>>>>>> journal->j_flags |= JBD2_UNMOUNT; >>>>>>>>> jbd2_journal_commit_transaction >>>>>>>>> jbd2_journal_get_descriptor_buffer >>>>>>>>> jbd2_journal_bmap >>>>>>>>> ext4_journal_bmap >>>>>>>>> ext4_map_blocks >>>>>>>>> ... >>>>>>>>> ext4_inode_error >>>>>>>>> ext4_handle_error >>>>>>>>> schedule_work(&sbi->s_sb_upd_work) >>>>>>>>> >>>>>>>>> /* work queue kicks in */ >>>>>>>>> update_super_work >>>>>>>>> jbd2_journal_start >>>>>>>>> start_this_handle >>>>>>>>> BUG_ON(journal->j_flags & >>>>>>>>> JBD2_UNMOUNT) >>>>>>>>> >>>>>>>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is >>>>>>>>> destroying only do a journaled (and deferred) update of sb if this flag is not >>>>>>>>> set. Otherwise, just fallback to an un-journaled commit. >>>>>>>>> >>>>>>>>> We set sbi->s_journal_destroying = true only after all the FS updates are done >>>>>>>>> during ext4_put_super() (except a running transaction that will get commited >>>>>>>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb >>>>>>>>> outside the journal as it won't race with a journaled update (refer >>>>>>>>> 2d01ddc86606). >>>>>>>>> >>>>>>>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only >>>>>>>>> called from mballoc and AFAICT it would be always valid to schedule work here. >>>>>>>>> >>>>>>>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available") >>>>>>>>> Reported-by: Mahesh Kumar <maheshkumar657g@xxxxxxxxx> >>>>>>>>> Suggested-by: Jan Kara <jack@xxxxxxx> >>>>>>>>> Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> >>>>>>>>> --- >>>>>>>>> fs/ext4/ext4.h | 2 ++ >>>>>>>>> fs/ext4/ext4_jbd2.h | 8 ++++++++ >>>>>>>>> fs/ext4/super.c | 4 +++- >>>>>>>>> 3 files changed, 13 insertions(+), 1 deletion(-) >>>>>>>>> >>>>>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h >>>>>>>>> index 2b7d781bfcad..d48e93bd5690 100644 >>>>>>>>> --- a/fs/ext4/ext4.h >>>>>>>>> +++ b/fs/ext4/ext4.h >>>>>>>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info { >>>>>>>>> */ >>>>>>>>> struct work_struct s_sb_upd_work; >>>>>>>>> >>>>>>>>> + bool s_journal_destorying; >>>>>>>>> + >>>>>>>>> /* Atomic write unit values in bytes */ >>>>>>>>> unsigned int s_awu_min; >>>>>>>>> unsigned int s_awu_max; >>>>>>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h >>>>>>>>> index 9b3c9df02a39..6bd3ca84410d 100644 >>>>>>>>> --- a/fs/ext4/ext4_jbd2.h >>>>>>>>> +++ b/fs/ext4/ext4_jbd2.h >>>>>>>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour >>>>>>>>> { >>>>>>>>> int err = 0; >>>>>>>>> >>>>>>>>> + /* >>>>>>>>> + * At this point all pending FS updates should be done except a possible >>>>>>>>> + * running transaction (which will commit in jbd2_journal_destroy). It >>>>>>>>> + * is now safe for any new errors to directly commit superblock rather >>>>>>>>> + * than going via journal. >>>>>>>>> + */ >>>>>>>>> + sbi->s_journal_destorying = true; >>>>>>>>> + >>>>>>>> >>>>>>>> Hi, Ojaswin! >>>>>>>> >>>>>>>> I'm afraid you still need to flush the superblock update work here, >>>>>>>> otherwise I guess the race condition you mentioned in v1 could still >>>>>>>> occur. >>>>>>>> >>>>>>>> ext4_put_super() >>>>>>>> flush_work(&sbi->s_sb_upd_work) >>>>>>>> >>>>>>>> **kjournald2** >>>>>>>> jbd2_journal_commit_transaction() >>>>>>>> ... >>>>>>>> ext4_inode_error() >>>>>>>> /* JBD2_UNMOUNT not set */ >>>>>>>> schedule_work(s_sb_upd_work) >>>>>>>> >>>>>>>> **workqueue** >>>>>>>> update_super_work >>>>>>>> /* s_journal_destorying is not set */ >>>>>>>> if (journal && !s_journal_destorying) >>>>>>>> >>>>>>>> ext4_journal_destroy() >>>>>>>> /* set s_journal_destorying */ >>>>>>>> sbi->s_journal_destorying = true; >>>>>>>> jbd2_journal_destroy() >>>>>>>> journal->j_flags |= JBD2_UNMOUNT; >>>>>>>> >>>>>>>> jbd2_journal_start() >>>>>>>> start_this_handle() >>>>>>>> BUG_ON(JBD2_UNMOUNT) >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Yi. >>>>>>> Hi Yi, >>>>>>> >>>>>>> Yes you are right, somehow missed this edge case :( >>>>>>> >>>>>>> Alright then, we have to move out sbi->s_journal_destroying outside the >>>>>>> helper. Just wondering if I should still let it be in >>>>>>> ext4_journal_destroy and just add an extra s_journal_destroying = false >>>>>>> before schedule_work(s_sb_upd_work), because it makes sense. >>>>>>> >>>>>>> Okay let me give it some thought but thanks for pointing this out! >>>>>>> >>>>>>> Regards, >>>>>>> ojaswin >>>>>> >>>>>> Okay so thinking about it a bit more, I see you also suggested to flush >>>>>> the work after marking sbi->s_journal_destroying. But will that solve >>>>>> it? >>>>>> >>>>>> ext4_put_super() >>>>>> flush_work(&sbi->s_sb_upd_work) >>>>>> >>>>>> **kjournald2** >>>>>> jbd2_journal_commit_transaction() >>>>>> ... >>>>>> ext4_inode_error() >>>>>> /* JBD2_UNMOUNT not set */ >>>>>> schedule_work(s_sb_upd_work) >>>>>> >>>>>> **workqueue** >>>>>> update_super_work >>>>>> /* s_journal_destorying is not set */ >>>>>> if (journal && !s_journal_destorying) >>>>>> >>>>>> ext4_journal_destroy() >>>>>> /* set s_journal_destorying */ >>>>>> sbi->s_journal_destorying = true; >>>>>> flush_work(&sbi->s_sb_upd_work) >>>>>> schedule_work() >>>>> ^^^^^^^^^^^^^^^ >>>>> where does this come from? >>>>> >>>>> After this flush_work, we can guarantee that the running s_sb_upd_work >>>>> finishes before we set JBD2_UNMOUNT. Additionally, the journal will >>>>> not commit transaction or call schedule_work() again because it has >>>>> been aborted due to the previous error. Am I missing something? >>>>> >>>>> Thanks, >>>>> Yi. >>>> >>>> Hmm, so I am thinking of a corner case in ext4_handle_error() where >>>> >>>> if(journal && !is_journal_destroying) >>>> >>>> is computed but schedule_work() not called yet, which is possible cause >>>> the cmp followed by jump is not atomic in nature. If the schedule_work >>>> is only called after we have done the flush then we end up with this: >>>> >>>> if (journal && !s_journal_destorying) >>>> ext4_journal_destroy() >>>> /* set s_journal_destorying */ >>>> sbi->s_journal_destorying = true; >>>> flush_work(&sbi->s_sb_upd_work) >>>> schedule_work() >>>> >>>> Which is possible IMO, although the window is tiny. >>> >>> Yeah, right! >>> Sorry for misread the location where you add the "!s_journal_destorying" >>> check, the graph I provided was in update_super_work(), which was wrong. >> >> Oh right, I also misread your trace but yes as discussed, even >> >> sbi->s_journal_destorying = true; >> flush_work() >> jbd2_journal_destroy() >> >> doesn't work. >> >>> The right one should be: >>> >>> ext4_put_super() >>> flush_work(&sbi->s_sb_upd_work) >>> >>> **kjournald2** >>> jbd2_journal_commit_transaction() >>> ... >>> ext4_inode_error() >>> /* s_journal_destorying is not set */ >>> if (journal && !s_journal_destorying) >>> (schedule_work(s_sb_upd_work)) //can be here >>> >>> ext4_journal_destroy() >>> /* set s_journal_destorying */ >>> sbi->s_journal_destorying = true; >>> jbd2_journal_destroy() >>> journal->j_flags |= JBD2_UNMOUNT; >>> >>> (schedule_work(s_sb_upd_work)) //also can be here >>> >>> **workqueue** >>> update_super_work() >>> journal = sbi->s_journal //get journal >>> kfree(journal) >>> jbd2_journal_start(journal) //journal UAF >>> start_this_handle() >>> BUG_ON(JBD2_UNMOUNT) //bugon here >>> >>> >>> So there are two problems here, the first one is the 'journal' UAF, >>> the second one is triggering JBD2_UNMOUNT flag BUGON. >> >> Indeed, there's a possible UAF here as well. >> >>> >>>>>> >>>>>> As for the fix, how about we do something like this: >>>>>> >>>>>> ext4_put_super() >>>>>> >>>>>> flush_work(&sbi->s_sb_upd_work) >>>>>> destroy_workqueue(sbi->rsv_conversion_wq); >>>>>> >>>>>> ext4_journal_destroy() >>>>>> /* set s_journal_destorying */ >>>>>> sbi->s_journal_destorying = true; >>>>>> >>>>>> /* trigger a commit and wait for it to complete */ >>>>>> >>>>>> flush_work(&sbi->s_sb_upd_work) >>>>>> >>>>>> jbd2_journal_destroy() >>>>>> journal->j_flags |= JBD2_UNMOUNT; >>>>>> >>>>>> jbd2_journal_start() >>>>>> start_this_handle() >>>>>> BUG_ON(JBD2_UNMOUNT) >>>>>> >>>>>> Still giving this codepath some thought but seems like this might just >>>>>> be enough to fix the race. Thoughts on this? >>>>>> >>> >>> I think this solution should work, the forced commit and flush_work() >>> should ensure that the last transaction is committed and that the >>> potential work is done. >>> >>> Besides, the s_journal_destorying flag is set and check concurrently >>> now, so we need WRITE_ONCE() and READ_ONCE() for it. Besides, what >>> about adding a new flag into sbi->s_mount_state instead of adding >>> new s_journal_destorying? >> >> Right, that makes sence. I will incorporate these changes in the next >> revision. >> > > Think about this again, it seems that we no longer need the destroying > flag. Because we force to commit and wait for the **last** transaction to > complete, and the flush work should also ensure that the last sb_update > work to complete. Regardless of whether it starts a new handle in the > last update_super_work(), it will not commit since the journal should > have aborted. What are your thoughts? > I think the confusion maybe coming because v2 patch isn't where we discussed to put the s_journal_destroying to true, in this thread [1] [1]: https://lore.kernel.org/linux-ext4/jnxpphuradrsf73cxfmohfu7wwwckihtulw6ovsitddgt5pqkg@2uoejkr66qnl/ > ext4_put_super() + sbi->s_journal_destroying = true; We should add s_journal_destroying to true before calling for flush_work. > flush_work(&sbi->s_sb_upd_work) After the above flush work is complete, we will always check if s_journal_destroying is set. If yes, then we should never schedule the sb update work > destroy_workqueue(sbi->rsv_conversion_wq) > > ext4_journal_destroy() > /* trigger a commit (it will commit the last trnasaction) */ > > **kjournald2** > jbd2_journal_commit_transaction() > ... > ext4_inode_error() > schedule_work(s_sb_upd_work)) Then this schedule work will never happen, since it will check if s_journal_destroying flag is set. > > **workqueue** > update_super_work() > jbd2_journal_start(journal) > start_this_handle() > //This new trans will > //not be committed. > > jbd2_journal_abort() > > /* wait for it to complete */ > > flush_work(&sbi->s_sb_upd_work) No need to again call the flush work here, since there is no new work which will be scheduled right. Am I missing something? -ritesh > jbd2_journal_destroy() > journal->j_flags |= JBD2_UNMOUNT; > jbd2_journal_commit_transaction() //it will commit nothing > > Thanks, > Yi.