Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying

Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> · Sat, 8 Mar 2025 10:57:16 +0800

On 2025/3/8 1:26, Ojaswin Mujoo wrote:
> On Fri, Mar 07, 2025 at 08:36:08PM +0800, Zhang Yi wrote:
>> On 2025/3/7 18:27, Ojaswin Mujoo wrote:
>>> On Fri, Mar 07, 2025 at 04:43:24PM +0800, Zhang Yi wrote:
>>>> On 2025/3/7 16:13, Ojaswin Mujoo wrote:
>>>>> On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote:
>>>>>> On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote:
>>>>>>> On 2025/3/6 22:28, Ojaswin Mujoo wrote:
>>>>>>>> Presently we always BUG_ON if trying to start a transaction on a journal marked
>>>>>>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running
>>>>>>>> stress tests, it was observed that in case of some error handling paths, it is
>>>>>>>> possible for update_super_work to start a transaction after the journal is
>>>>>>>> destroyed eg:
>>>>>>>>
>>>>>>>> (umount)
>>>>>>>> ext4_kill_sb
>>>>>>>>   kill_block_super
>>>>>>>>     generic_shutdown_super
>>>>>>>>       sync_filesystem /* commits all txns */
>>>>>>>>       evict_inodes
>>>>>>>>         /* might start a new txn */
>>>>>>>>       ext4_put_super
>>>>>>>> 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
>>>>>>>>         jbd2_journal_destroy
>>>>>>>>           journal_kill_thread
>>>>>>>>             journal->j_flags |= JBD2_UNMOUNT;
>>>>>>>>           jbd2_journal_commit_transaction
>>>>>>>>             jbd2_journal_get_descriptor_buffer
>>>>>>>>               jbd2_journal_bmap
>>>>>>>>                 ext4_journal_bmap
>>>>>>>>                   ext4_map_blocks
>>>>>>>>                     ...
>>>>>>>>                     ext4_inode_error
>>>>>>>>                       ext4_handle_error
>>>>>>>>                         schedule_work(&sbi->s_sb_upd_work)
>>>>>>>>
>>>>>>>>                                                /* work queue kicks in */
>>>>>>>>                                                update_super_work
>>>>>>>>                                                  jbd2_journal_start
>>>>>>>>                                                    start_this_handle
>>>>>>>>                                                      BUG_ON(journal->j_flags &
>>>>>>>>                                                             JBD2_UNMOUNT)
>>>>>>>>
>>>>>>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
>>>>>>>> destroying only do a journaled (and deferred) update of sb if this flag is not
>>>>>>>> set. Otherwise, just fallback to an un-journaled commit.
>>>>>>>>
>>>>>>>> We set sbi->s_journal_destroying = true only after all the FS updates are done
>>>>>>>> during ext4_put_super() (except a running transaction that will get commited
>>>>>>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb
>>>>>>>> outside the journal as it won't race with a journaled update (refer
>>>>>>>> 2d01ddc86606).
>>>>>>>>
>>>>>>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only
>>>>>>>> called from mballoc and AFAICT it would be always valid to schedule work here.
>>>>>>>>
>>>>>>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
>>>>>>>> Reported-by: Mahesh Kumar <maheshkumar657g@xxxxxxxxx>
>>>>>>>> Suggested-by: Jan Kara <jack@xxxxxxx>
>>>>>>>> Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx>
>>>>>>>> ---
>>>>>>>>  fs/ext4/ext4.h      | 2 ++
>>>>>>>>  fs/ext4/ext4_jbd2.h | 8 ++++++++
>>>>>>>>  fs/ext4/super.c     | 4 +++-
>>>>>>>>  3 files changed, 13 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>>>>>> index 2b7d781bfcad..d48e93bd5690 100644
>>>>>>>> --- a/fs/ext4/ext4.h
>>>>>>>> +++ b/fs/ext4/ext4.h
>>>>>>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
>>>>>>>>  	 */
>>>>>>>>  	struct work_struct s_sb_upd_work;
>>>>>>>>  
>>>>>>>> +	bool s_journal_destorying;
>>>>>>>> +
>>>>>>>>  	/* Atomic write unit values in bytes */
>>>>>>>>  	unsigned int s_awu_min;
>>>>>>>>  	unsigned int s_awu_max;
>>>>>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
>>>>>>>> index 9b3c9df02a39..6bd3ca84410d 100644
>>>>>>>> --- a/fs/ext4/ext4_jbd2.h
>>>>>>>> +++ b/fs/ext4/ext4_jbd2.h
>>>>>>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
>>>>>>>>  {
>>>>>>>>  	int err = 0;
>>>>>>>>  
>>>>>>>> +	/*
>>>>>>>> +	 * At this point all pending FS updates should be done except a possible
>>>>>>>> +	 * running transaction (which will commit in jbd2_journal_destroy). It
>>>>>>>> +	 * is now safe for any new errors to directly commit superblock rather
>>>>>>>> +	 * than going via journal.
>>>>>>>> +	 */
>>>>>>>> +	sbi->s_journal_destorying = true;
>>>>>>>> +
>>>>>>>
>>>>>>> Hi, Ojaswin!
>>>>>>>
>>>>>>> I'm afraid you still need to flush the superblock update work here,
>>>>>>> otherwise I guess the race condition you mentioned in v1 could still
>>>>>>> occur.
>>>>>>>
>>>>>>>  ext4_put_super()
>>>>>>>   flush_work(&sbi->s_sb_upd_work)
>>>>>>>
>>>>>>>                     **kjournald2**
>>>>>>>                     jbd2_journal_commit_transaction()
>>>>>>>                     ...
>>>>>>>                     ext4_inode_error()
>>>>>>>                       /* JBD2_UNMOUNT not set */
>>>>>>>                       schedule_work(s_sb_upd_work)
>>>>>>>
>>>>>>>                                   **workqueue**
>>>>>>>                                    update_super_work
>>>>>>>                                    /* s_journal_destorying is not set */
>>>>>>>                             	   if (journal && !s_journal_destorying)
>>>>>>>
>>>>>>>   ext4_journal_destroy()
>>>>>>>    /* set s_journal_destorying */
>>>>>>>    sbi->s_journal_destorying = true;
>>>>>>>    jbd2_journal_destroy()
>>>>>>>     journal->j_flags |= JBD2_UNMOUNT;
>>>>>>>
>>>>>>>                                        jbd2_journal_start()
>>>>>>>                                         start_this_handle()
>>>>>>>                                           BUG_ON(JBD2_UNMOUNT)
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Yi.
>>>>>> Hi Yi,
>>>>>>
>>>>>> Yes you are right, somehow missed this edge case :(
>>>>>>
>>>>>> Alright then, we have to move out sbi->s_journal_destroying outside the
>>>>>> helper. Just wondering if I should still let it be in
>>>>>> ext4_journal_destroy and just add an extra s_journal_destroying = false
>>>>>> before schedule_work(s_sb_upd_work), because it makes sense.
>>>>>>
>>>>>> Okay let me give it some thought but thanks for pointing this out!
>>>>>>
>>>>>> Regards,
>>>>>> ojaswin
>>>>>
>>>>> Okay so thinking about it a bit more, I see you also suggested to flush
>>>>> the work after marking sbi->s_journal_destroying. But will that solve
>>>>> it?
>>>>>
>>>>>   ext4_put_super()
>>>>>    flush_work(&sbi->s_sb_upd_work)
>>>>>  
>>>>>                      **kjournald2**
>>>>>                      jbd2_journal_commit_transaction()
>>>>>                      ...
>>>>>                      ext4_inode_error()
>>>>>                        /* JBD2_UNMOUNT not set */
>>>>>                        schedule_work(s_sb_upd_work)
>>>>>  
>>>>>                                     **workqueue**
>>>>>                                     update_super_work
>>>>>                                     /* s_journal_destorying is not set */
>>>>>                              	      if (journal && !s_journal_destorying)
>>>>>  
>>>>>    ext4_journal_destroy()
>>>>>     /* set s_journal_destorying */
>>>>>     sbi->s_journal_destorying = true;
>>>>>     flush_work(&sbi->s_sb_upd_work)
>>>>>                                       schedule_work()
>>>>                                         ^^^^^^^^^^^^^^^
>>>>                                         where does this come from?
>>>>
>>>> After this flush_work, we can guarantee that the running s_sb_upd_work
>>>> finishes before we set JBD2_UNMOUNT. Additionally, the journal will
>>>> not commit transaction or call schedule_work() again because it has
>>>> been aborted due to the previous error. Am I missing something?
>>>>
>>>> Thanks,
>>>> Yi.
>>>
>>> Hmm, so I am thinking of a corner case in ext4_handle_error() where 
>>>
>>>  if(journal && !is_journal_destroying) 
>>>
>>> is computed but schedule_work() not called yet, which is possible cause
>>> the cmp followed by jump is not atomic in nature. If the schedule_work
>>> is only called after we have done the flush then we end up with this:
>>>
>>>                               	      if (journal && !s_journal_destorying)
>>>     ext4_journal_destroy()
>>>      /* set s_journal_destorying */
>>>      sbi->s_journal_destorying = true;
>>>      flush_work(&sbi->s_sb_upd_work)
>>>                                        schedule_work()
>>>
>>> Which is possible IMO, although the window is tiny.
>>
>> Yeah, right!
>> Sorry for misread the location where you add the "!s_journal_destorying"
>> check, the graph I provided was in update_super_work(), which was wrong.
> 
> Oh right, I also misread your trace but yes as discussed, even 
> 
>     sbi->s_journal_destorying = true;
> 		flush_work()
>     jbd2_journal_destroy()
> 
> doesn't work.
> 
>> The right one should be:
>>
>>  ext4_put_super()
>>   flush_work(&sbi->s_sb_upd_work)
>>
>>                     **kjournald2**
>>                     jbd2_journal_commit_transaction()
>>                     ...
>>                     ext4_inode_error()
>>                       /* s_journal_destorying is not set */
>>                       if (journal && !s_journal_destorying)
>>                         (schedule_work(s_sb_upd_work))  //can be here
>>
>>   ext4_journal_destroy()
>>    /* set s_journal_destorying */
>>    sbi->s_journal_destorying = true;
>>    jbd2_journal_destroy()
>>     journal->j_flags |= JBD2_UNMOUNT;
>>
>>                         (schedule_work(s_sb_upd_work))  //also can be here
>>
>>                                   **workqueue**
>>                                    update_super_work()
>>                                    journal = sbi->s_journal //get journal
>>     kfree(journal)
>>                                      jbd2_journal_start(journal) //journal UAF
>>                                        start_this_handle()
>>                                          BUG_ON(JBD2_UNMOUNT) //bugon here
>>
>>
>> So there are two problems here, the first one is the 'journal' UAF,
>> the second one is triggering JBD2_UNMOUNT flag BUGON.
> 
> Indeed, there's a possible UAF here as well.
> 
>>
>>>>>
>>>>> As for the fix, how about we do something like this:
>>>>>
>>>>>   ext4_put_super()
>>>>>
>>>>>    flush_work(&sbi->s_sb_upd_work)
>>>>>    destroy_workqueue(sbi->rsv_conversion_wq);
>>>>>
>>>>>    ext4_journal_destroy()
>>>>>     /* set s_journal_destorying */
>>>>>     sbi->s_journal_destorying = true;
>>>>>
>>>>>    /* trigger a commit and wait for it to complete */
>>>>>
>>>>>     flush_work(&sbi->s_sb_upd_work)
>>>>>
>>>>>     jbd2_journal_destroy()
>>>>>      journal->j_flags |= JBD2_UNMOUNT;
>>>>>  
>>>>>                                         jbd2_journal_start()
>>>>>                                          start_this_handle()
>>>>>                                            BUG_ON(JBD2_UNMOUNT)
>>>>>
>>>>> Still giving this codepath some thought but seems like this might just
>>>>> be enough to fix the race. Thoughts on this?
>>>>>
>>
>> I think this solution should work, the forced commit and flush_work()
>> should ensure that the last transaction is committed and that the
>> potential work is done.
>>
>> Besides, the s_journal_destorying flag is set and check concurrently
>> now, so we need WRITE_ONCE() and READ_ONCE() for it. Besides, what
>> about adding a new flag into sbi->s_mount_state instead of adding
>> new s_journal_destorying?
> 
> Right, that makes sence. I will incorporate these changes in the next 
> revision.
> 

Think about this again, it seems that we no longer need the destroying
flag. Because we force to commit and wait for the **last** transaction to
complete, and the flush work should also ensure that the last sb_update
work to complete. Regardless of whether it starts a new handle in the
last update_super_work(), it will not commit since the journal should
have aborted. What are your thoughts?

 ext4_put_super()
  flush_work(&sbi->s_sb_upd_work)
  destroy_workqueue(sbi->rsv_conversion_wq)

  ext4_journal_destroy()
   /* trigger a commit (it will commit the last trnasaction) */

                    **kjournald2**
                    jbd2_journal_commit_transaction()
                    ...
                     ext4_inode_error()
                      schedule_work(s_sb_upd_work))

                                     **workqueue**
                                      update_super_work()
                                        jbd2_journal_start(journal)
                                          start_this_handle()
                                          //This new trans will
                                          //not be committed.

                     jbd2_journal_abort()

   /* wait for it to complete */

   flush_work(&sbi->s_sb_upd_work)
   jbd2_journal_destroy()
    journal->j_flags |= JBD2_UNMOUNT;
   jbd2_journal_commit_transaction() //it will commit nothing

Thanks,
Yi.