Re: [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?)

Jens Axboe <axboe@xxxxxxxxx> · Mon, 26 May 2025 18:56:56 -0600

On 5/26/25 6:51 PM, Trond Myklebust wrote:
> On Mon, 2025-05-26 at 11:38 -0600, Jens Axboe wrote:
>> On 5/26/25 9:06 AM, Jens Axboe wrote:
>>> On 5/26/25 7:05 AM, Jens Axboe wrote:
>>>> On 5/25/25 1:12 PM, Vlastimil Babka wrote:
>>>>> On 5/25/25 8:06 PM, Al Viro wrote:
>>>>>> On Sun, May 25, 2025 at 09:32:09AM +0100, Al Viro wrote:
>>>>>>
>>>>>>> Breakage is still present in the current mainline ;-/
>>>>>>
>>>>>> With CONFIG_DEBUG_VM on top of pagealloc debugging:
>>>>>>
>>>>>> [ 1434.992817] run fstests generic/127 at 2025-05-25
>>>>>> 11:46:11g
>>>>>> [ 1448.956242] BUG: Bad page state in process kworker/2:1 
>>>>>> pfn:112cb0g
>>>>>> [ 1448.956846] page: refcount:0 mapcount:0
>>>>>> mapping:0000000000000000 index:0x3e pfn:0x112cb0g
>>>>>> [ 1448.957453] flags:
>>>>>> 0x800000000000000e(referenced|uptodate|writeback|zone=2)g
>>>>>
>>>>> It doesn't like the writeback flag.
>>>>>
>>>>>> [ 1448.957863] raw: 800000000000000e dead000000000100
>>>>>> dead000000000122 0000000000000000g
>>>>>> [ 1448.958303] raw: 000000000000003e 0000000000000000
>>>>>> 00000000ffffffff 0000000000000000g
>>>>>> [ 1448.958833] page dumped because: PAGE_FLAGS_CHECK_AT_FREE
>>>>>> flag(s) setg
>>>>>> [ 1448.959320] Modules linked in: xfs autofs4 fuse nfsd
>>>>>> auth_rpcgss nfs_acl nfs lockd grace sunrpc loop ecryptfs
>>>>>> 9pnet_virtio 9pnet netfs evdev pcspkr sg button ext4 jbd2
>>>>>> btrfs blake2b_generic xor zlib_deflate raid6_pq zstd_compress
>>>>>> sr_mod cdrom ata_generic ata_piix psmouse serio_raw i2c_piix4
>>>>>> i2c_smbus libata e1000g
>>>>>> [ 1448.960874] CPU: 2 UID: 0 PID: 2614 Comm: kworker/2:1 Not
>>>>>> tainted 6.14.0-rc1+ #78g
>>>>>> [ 1448.960878] Hardware name: QEMU Standard PC (i440FX +
>>>>>> PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014g
>>>>>> [ 1448.960879] Workqueue: xfs-conv/sdb1 xfs_end_io [xfs]g
>>>>>> [ 1448.960938] Call Trace:g
>>>>>> [ 1448.960939]  <TASK>g
>>>>>> [ 1448.960940]  dump_stack_lvl+0x4f/0x60g
>>>>>> [ 1448.960953]  bad_page+0x6f/0x100g
>>>>>> [ 1448.960957]  free_frozen_pages+0x471/0x640g
>>>>>> [ 1448.960958]  iomap_finish_ioend+0x196/0x3c0g
>>>>>> [ 1448.960963]  iomap_finish_ioends+0x83/0xc0g
>>>>>> [ 1448.960964]  xfs_end_ioend+0x64/0x140 [xfs]g
>>>>>> [ 1448.961003]  xfs_end_io+0x93/0xc0 [xfs]g
>>>>>> [ 1448.961036]  process_one_work+0x153/0x390g
>>>>>> [ 1448.961044]  worker_thread+0x2ab/0x3b0g
>>>>>> [ 1448.961045]  ? rescuer_thread+0x470/0x470g
>>>>>> [ 1448.961047]  kthread+0xf7/0x200g
>>>>>> [ 1448.961048]  ? kthread_use_mm+0xa0/0xa0g
>>>>>> [ 1448.961049]  ret_from_fork+0x2d/0x50g
>>>>>> [ 1448.961053]  ? kthread_use_mm+0xa0/0xa0g
>>>>>> [ 1448.961054]  ret_from_fork_asm+0x11/0x20g
>>>>>> [ 1448.961058]  </TASK>g
>>>>>> [ 1448.961155] Disabling lock debugging due to kernel taintg
>>>>>> [ 1448.969569] page: refcount:0 mapcount:0
>>>>>> mapping:0000000000000000 index:0x3e pfn:0x112cb0g
>>>>>
>>>>> same pfn, same struct page
>>>>>
>>>>>> [ 1448.970023] flags:
>>>>>> 0x800000000000000e(referenced|uptodate|writeback|zone=2)g
>>>>>> [ 1448.970651] raw: 800000000000000e dead000000000100
>>>>>> dead000000000122 0000000000000000g
>>>>>> [ 1448.971222] raw: 000000000000003e 0000000000000000
>>>>>> 00000000ffffffff 0000000000000000g
>>>>>> [ 1448.971812] page dumped because:
>>>>>> VM_BUG_ON_FOLIO(((unsigned int) folio_ref_count(folio) + 127u
>>>>>> <= 127u))g
>>>>>> [ 1448.972490] ------------[ cut here ]------------g
>>>>>> [ 1448.972841] kernel BUG at ./include/linux/mm.h:1455!g
>>>>>
>>>>> this is folio_get() noticing refcount is 0, so a use-after
>>>>> free, because
>>>>> we already tried to free the page above.
>>>>>
>>>>> I'm not familiar with this code too much, but I suspect problem
>>>>> was
>>>>> introduced by commit fb7d3bc414939 ("mm/filemap: drop
>>>>> streaming/uncached
>>>>> pages when writeback completes") and only (more) exposed here.
>>>>>
>>>>> so in folio_end_writeback() we have
>>>>>         if (__folio_end_writeback(folio))
>>>>>                 folio_wake_bit(folio, PG_writeback);
>>>>>
>>>>> but calling the folio_end_dropbehind_write() doesn't depend on
>>>>> the
>>>>> result of __folio_end_writeback()
>>>>> this seems rather suspicious
>>>>>
>>>>> I think if __folio_end_writeback() was true then PG_writeback
>>>>> would be
>>>>> cleared and thus we'd not see the PAGE_FLAGS_CHECK_AT_FREE
>>>>> failure.
>>>>> Instead we do a premature folio_end_dropbehind_write() dropping
>>>>> a page
>>>>> ref and then the final folio_put() in folio_end_writeback()
>>>>> frees the
>>>>> page and splats on the PG_writeback. Then the folio is
>>>>> processed again
>>>>> in the following iteration of iomap_finish_ioend() and splats
>>>>> on the
>>>>> refcount-already-zero.
>>>>>
>>>>> So I think folio_end_dropbehind_write() should only be done
>>>>> when
>>>>> __folio_end_writeback() was true. Most likely even the
>>>>> folio_test_clear_dropbehind() should be tied to that, or we
>>>>> clear it too
>>>>> early and then never act upon it later?
>>>>
>>>> Thanks for taking a look at this! I tried to reproduce this this
>>>> morning
>>>> and failed miserably. I then injected a delay for the above case,
>>>> and it
>>>> does indeed then trigger for me. So far, so good.
>>>>
>>>> I agree with your analysis, we should only be doing the
>>>> dropbehind for a
>>>> non-zero return from __folio_end_writeback(), and that includes
>>>> the
>>>> test_and_clear to avoid dropping the drop-behind state. But we
>>>> also need
>>>> to check/clear this state pre __folio_end_writeback(), which then
>>>> puts
>>>> us in a spot where it needs to potentially be re-set. Which fails
>>>> pretty
>>>> racy...
>>>>
>>>> I'll ponder this a bit. Good thing fsx got RWF_DONTCACHE support,
>>>> or I
>>>> suspect this would've taken a while to run into.
>>>
>>> Took a closer look... I may be smoking something good here, but I
>>> don't
>>> see what the __folio_end_writeback()() return value has to do with
>>> this
>>> at all. Regardless of what it returns, it should've cleared
>>> PG_writeback, and in fact the only thing it returns is whether or
>>> not we
>>> had anyone waiting on it. Which should have _zero_ bearing on
>>> whether or
>>> not we can clear/invalidate the range.
>>>
>>> To me, this smells more like a race of some sort, between dirty and
>>> invalidation. fsx does a lot of sub-page sized operations.
>>>
>>> I'll poke a bit more...
>>
>> I _think_ we're racing with the same folio being marked for writeback
>> again. Al, can you try the below?
>>
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 7b90cbeb4a1a..e95b184a2459 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -1604,7 +1604,7 @@ static void folio_end_dropbehind_write(struct
>> folio *folio)
>>  	 * invalidation in that case.
>>  	 */
>>  	if (in_task() && folio_trylock(folio)) {
>> -		if (folio->mapping)
>> +		if (folio->mapping && !folio_test_writeback(folio))
>>  			folio_unmap_invalidate(folio->mapping,
>> folio, 0);
>>  		folio_unlock(folio);
>>  	}
>>
> 
> I think we need to test for PG_dirty after retaking the folio lock as
> well. Nothing stops a second thread from redirtying the page once the
> folio lock is dropped, and while some filesystems may insist on waiting
> for PG_writeback before allowing redirtying to complete, that still
> ends up racing because folio_end_dropbehind_write() is called after the
> call to __folio_end_writeback().

Agree, local version actually has both as well.

> Note that the same set of races can happen in
> filemap_end_dropbehind_read(), so we need the same set of checks after
> taking the folio lock there too. The existing checks are insufficient,
> since they only happen before taking the folio lock.

Ah good catch. I'll send out the patch tomorrow.

-- 
Jens Axboe