On 5/26/25 6:51 PM, Trond Myklebust wrote: > On Mon, 2025-05-26 at 11:38 -0600, Jens Axboe wrote: >> On 5/26/25 9:06 AM, Jens Axboe wrote: >>> On 5/26/25 7:05 AM, Jens Axboe wrote: >>>> On 5/25/25 1:12 PM, Vlastimil Babka wrote: >>>>> On 5/25/25 8:06 PM, Al Viro wrote: >>>>>> On Sun, May 25, 2025 at 09:32:09AM +0100, Al Viro wrote: >>>>>> >>>>>>> Breakage is still present in the current mainline ;-/ >>>>>> >>>>>> With CONFIG_DEBUG_VM on top of pagealloc debugging: >>>>>> >>>>>> [ 1434.992817] run fstests generic/127 at 2025-05-25 >>>>>> 11:46:11g >>>>>> [ 1448.956242] BUG: Bad page state in process kworker/2:1 >>>>>> pfn:112cb0g >>>>>> [ 1448.956846] page: refcount:0 mapcount:0 >>>>>> mapping:0000000000000000 index:0x3e pfn:0x112cb0g >>>>>> [ 1448.957453] flags: >>>>>> 0x800000000000000e(referenced|uptodate|writeback|zone=2)g >>>>> >>>>> It doesn't like the writeback flag. >>>>> >>>>>> [ 1448.957863] raw: 800000000000000e dead000000000100 >>>>>> dead000000000122 0000000000000000g >>>>>> [ 1448.958303] raw: 000000000000003e 0000000000000000 >>>>>> 00000000ffffffff 0000000000000000g >>>>>> [ 1448.958833] page dumped because: PAGE_FLAGS_CHECK_AT_FREE >>>>>> flag(s) setg >>>>>> [ 1448.959320] Modules linked in: xfs autofs4 fuse nfsd >>>>>> auth_rpcgss nfs_acl nfs lockd grace sunrpc loop ecryptfs >>>>>> 9pnet_virtio 9pnet netfs evdev pcspkr sg button ext4 jbd2 >>>>>> btrfs blake2b_generic xor zlib_deflate raid6_pq zstd_compress >>>>>> sr_mod cdrom ata_generic ata_piix psmouse serio_raw i2c_piix4 >>>>>> i2c_smbus libata e1000g >>>>>> [ 1448.960874] CPU: 2 UID: 0 PID: 2614 Comm: kworker/2:1 Not >>>>>> tainted 6.14.0-rc1+ #78g >>>>>> [ 1448.960878] Hardware name: QEMU Standard PC (i440FX + >>>>>> PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014g >>>>>> [ 1448.960879] Workqueue: xfs-conv/sdb1 xfs_end_io [xfs]g >>>>>> [ 1448.960938] Call Trace:g >>>>>> [ 1448.960939] <TASK>g >>>>>> [ 1448.960940] dump_stack_lvl+0x4f/0x60g >>>>>> [ 1448.960953] bad_page+0x6f/0x100g >>>>>> [ 1448.960957] free_frozen_pages+0x471/0x640g >>>>>> [ 1448.960958] iomap_finish_ioend+0x196/0x3c0g >>>>>> [ 1448.960963] iomap_finish_ioends+0x83/0xc0g >>>>>> [ 1448.960964] xfs_end_ioend+0x64/0x140 [xfs]g >>>>>> [ 1448.961003] xfs_end_io+0x93/0xc0 [xfs]g >>>>>> [ 1448.961036] process_one_work+0x153/0x390g >>>>>> [ 1448.961044] worker_thread+0x2ab/0x3b0g >>>>>> [ 1448.961045] ? rescuer_thread+0x470/0x470g >>>>>> [ 1448.961047] kthread+0xf7/0x200g >>>>>> [ 1448.961048] ? kthread_use_mm+0xa0/0xa0g >>>>>> [ 1448.961049] ret_from_fork+0x2d/0x50g >>>>>> [ 1448.961053] ? kthread_use_mm+0xa0/0xa0g >>>>>> [ 1448.961054] ret_from_fork_asm+0x11/0x20g >>>>>> [ 1448.961058] </TASK>g >>>>>> [ 1448.961155] Disabling lock debugging due to kernel taintg >>>>>> [ 1448.969569] page: refcount:0 mapcount:0 >>>>>> mapping:0000000000000000 index:0x3e pfn:0x112cb0g >>>>> >>>>> same pfn, same struct page >>>>> >>>>>> [ 1448.970023] flags: >>>>>> 0x800000000000000e(referenced|uptodate|writeback|zone=2)g >>>>>> [ 1448.970651] raw: 800000000000000e dead000000000100 >>>>>> dead000000000122 0000000000000000g >>>>>> [ 1448.971222] raw: 000000000000003e 0000000000000000 >>>>>> 00000000ffffffff 0000000000000000g >>>>>> [ 1448.971812] page dumped because: >>>>>> VM_BUG_ON_FOLIO(((unsigned int) folio_ref_count(folio) + 127u >>>>>> <= 127u))g >>>>>> [ 1448.972490] ------------[ cut here ]------------g >>>>>> [ 1448.972841] kernel BUG at ./include/linux/mm.h:1455!g >>>>> >>>>> this is folio_get() noticing refcount is 0, so a use-after >>>>> free, because >>>>> we already tried to free the page above. >>>>> >>>>> I'm not familiar with this code too much, but I suspect problem >>>>> was >>>>> introduced by commit fb7d3bc414939 ("mm/filemap: drop >>>>> streaming/uncached >>>>> pages when writeback completes") and only (more) exposed here. >>>>> >>>>> so in folio_end_writeback() we have >>>>> if (__folio_end_writeback(folio)) >>>>> folio_wake_bit(folio, PG_writeback); >>>>> >>>>> but calling the folio_end_dropbehind_write() doesn't depend on >>>>> the >>>>> result of __folio_end_writeback() >>>>> this seems rather suspicious >>>>> >>>>> I think if __folio_end_writeback() was true then PG_writeback >>>>> would be >>>>> cleared and thus we'd not see the PAGE_FLAGS_CHECK_AT_FREE >>>>> failure. >>>>> Instead we do a premature folio_end_dropbehind_write() dropping >>>>> a page >>>>> ref and then the final folio_put() in folio_end_writeback() >>>>> frees the >>>>> page and splats on the PG_writeback. Then the folio is >>>>> processed again >>>>> in the following iteration of iomap_finish_ioend() and splats >>>>> on the >>>>> refcount-already-zero. >>>>> >>>>> So I think folio_end_dropbehind_write() should only be done >>>>> when >>>>> __folio_end_writeback() was true. Most likely even the >>>>> folio_test_clear_dropbehind() should be tied to that, or we >>>>> clear it too >>>>> early and then never act upon it later? >>>> >>>> Thanks for taking a look at this! I tried to reproduce this this >>>> morning >>>> and failed miserably. I then injected a delay for the above case, >>>> and it >>>> does indeed then trigger for me. So far, so good. >>>> >>>> I agree with your analysis, we should only be doing the >>>> dropbehind for a >>>> non-zero return from __folio_end_writeback(), and that includes >>>> the >>>> test_and_clear to avoid dropping the drop-behind state. But we >>>> also need >>>> to check/clear this state pre __folio_end_writeback(), which then >>>> puts >>>> us in a spot where it needs to potentially be re-set. Which fails >>>> pretty >>>> racy... >>>> >>>> I'll ponder this a bit. Good thing fsx got RWF_DONTCACHE support, >>>> or I >>>> suspect this would've taken a while to run into. >>> >>> Took a closer look... I may be smoking something good here, but I >>> don't >>> see what the __folio_end_writeback()() return value has to do with >>> this >>> at all. Regardless of what it returns, it should've cleared >>> PG_writeback, and in fact the only thing it returns is whether or >>> not we >>> had anyone waiting on it. Which should have _zero_ bearing on >>> whether or >>> not we can clear/invalidate the range. >>> >>> To me, this smells more like a race of some sort, between dirty and >>> invalidation. fsx does a lot of sub-page sized operations. >>> >>> I'll poke a bit more... >> >> I _think_ we're racing with the same folio being marked for writeback >> again. Al, can you try the below? >> >> >> diff --git a/mm/filemap.c b/mm/filemap.c >> index 7b90cbeb4a1a..e95b184a2459 100644 >> --- a/mm/filemap.c >> +++ b/mm/filemap.c >> @@ -1604,7 +1604,7 @@ static void folio_end_dropbehind_write(struct >> folio *folio) >> * invalidation in that case. >> */ >> if (in_task() && folio_trylock(folio)) { >> - if (folio->mapping) >> + if (folio->mapping && !folio_test_writeback(folio)) >> folio_unmap_invalidate(folio->mapping, >> folio, 0); >> folio_unlock(folio); >> } >> > > I think we need to test for PG_dirty after retaking the folio lock as > well. Nothing stops a second thread from redirtying the page once the > folio lock is dropped, and while some filesystems may insist on waiting > for PG_writeback before allowing redirtying to complete, that still > ends up racing because folio_end_dropbehind_write() is called after the > call to __folio_end_writeback(). Agree, local version actually has both as well. > Note that the same set of races can happen in > filemap_end_dropbehind_read(), so we need the same set of checks after > taking the folio lock there too. The existing checks are insufficient, > since they only happen before taking the folio lock. Ah good catch. I'll send out the patch tomorrow. -- Jens Axboe