On Tue, Apr 08, 2025 at 10:51:25AM -0700, Darrick J. Wong wrote: > Hi everyone, > > I saw the following crash in 6.15-rc1 when running xfs/032 from fstests > for-next. I don't see it in 6.14. I'll try to bisect, but in the > meantime does this look familiar to anyone? The XFS configuration is > pretty boring: > > MKFS_OPTIONS="-m autofsck=1, -n size=8192" > MOUNT_OPTIONS="-o uquota,gquota,pquota" > > (4k fsblocks, x64 host, directory blocks are 8k) > > From the stack trace, it looks like the null pointer dereference is in > this call to bdev_nr_sectors: > > void guard_bio_eod(struct bio *bio) > { > sector_t maxsector = bdev_nr_sectors(bio->bi_bdev); > > because bio->bi_bdev is NULL for some reason. The crash itself seems to > be from do_mpage_readpage around line 304: > > alloc_new: > if (args->bio == NULL) { > args->bio = bio_alloc(bdev, bio_max_segs(args->nr_pages), opf, > gfp); > > bdev is NULL here ^^^^ > > if (args->bio == NULL) > goto confused; > args->bio->bi_iter.bi_sector = first_block << (blkbits - 9); > } > > length = first_hole << blkbits; > if (!bio_add_folio(args->bio, folio, length, 0)) { > args->bio = mpage_bio_submit_read(args->bio); > goto alloc_new; > } > > relative_block = block_in_file - args->first_logical_block; > nblocks = map_bh->b_size >> blkbits; > if ((buffer_boundary(map_bh) && relative_block == nblocks) || > (first_hole != blocks_per_folio)) > args->bio = mpage_bio_submit_read(args->bio); > > My guess is that there was no previous call to ->get_block and that > blocks_per_folio == 0, so nobody ever actually set the local @bdev > variable to a non-NULL value. blocks_per_folio is perhaps zero because > xfs/032 tried formatting with a sector size of 64k, which causes the > bdev inode->i_blkbits to be set to 16, but for some reason we got a > folio that wasn't 64k in size: > > const unsigned blkbits = inode->i_blkbits; > const unsigned blocks_per_folio = folio_size(folio) >> blkbits; > > <shrug> That's just my conjecture for now. Ok so overnight my debugging patch confirmed this hypothesis: XFS (sda4): Mounting V5 Filesystem 8cf3c461-57b0-4bba-86ab-6dc13b8cdab0 XFS (sda4): Ending clean mount XFS (sda4): Quotacheck needed: Please wait. XFS (sda4): Quotacheck: Done. XFS (sda4): Unmounting Filesystem 8cf3c461-57b0-4bba-86ab-6dc13b8cdab0 FARK bio_alloc with NULL bdev?! blkbits 13 fsize 4096 blocks_per_folio 0 willy told me to set CONFIG_DEBUG_VM=y and rerun xfs/032. That didn't turn anything up, so I decided to race it with: while sleep 0.1; do blkid -c /dev/null; done to simulate udev calling libblkid. That produced a debugging assertion with 40 seconds: page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x4f3bc4 pfn:0x43da4 head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 memcg:ffff8880446b4800 flags: 0x4fff80000000041(locked|head|node=1|zone=1|lastcpupid=0xfff) raw: 04fff80000000041 0000000000000000 dead000000000122 0000000000000000 raw: 00000000004f3bc4 0000000000000000 00000001ffffffff ffff8880446b4800 head: 04fff80000000041 0000000000000000 dead000000000122 0000000000000000 head: 00000000004f3bc4 0000000000000000 00000001ffffffff ffff8880446b4800 head: 04fff80000000201 ffffea00010f6901 00000000ffffffff 00000000ffffffff head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000002 page dumped because: VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1)) ------------[ cut here ]------------ kernel BUG at mm/filemap.c:871! Oops: invalid opcode: 0000 [#1] SMP CPU: 3 UID: 0 PID: 26689 Comm: (udev-worker) Not tainted 6.15.0-rc1-djwx #rc1 PREEMPT(lazy) 8c302df0300eabbbd3cdc47fd812690b8d635c39 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 RIP: 0010:__filemap_add_folio+0x4ae/0x540 Code: 40 49 89 d4 0f b6 c1 49 d3 ec 81 e1 c0 00 00 00 0f 84 e0 fb ff ff e9 92 b6 d3 ff 48 c7 c6 68 57 ec 81 4c 89 ef e8 82 6e 05 00 <0f> 0b 49 89 d4 e9 c2 fb ff ff 48 c7 c6 9 RSP: 0018:ffffc900016e3a70 EFLAGS: 00010246 RAX: 0000000000000049 RBX: 0000000000112cc0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00000000ffffffff RBP: 0000000000000001 R08: 0000000000000000 R09: 205d313431343737 R10: 0000000000000729 R11: 6d75642065676170 R12: 00000000004f3ba8 R13: ffffea00010f6900 R14: ffff88804076a530 R15: ffff88804076a530 FS: 00007f8863b788c0(0000) GS:ffff8880fb952000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055cf459d5000 CR3: 000000000d96f003 CR4: 00000000001706f0 Call Trace: <TASK> ? memcg_list_lru_alloc+0x2d0/0x2d0 filemap_add_folio+0x7f/0xd0 page_cache_ra_unbounded+0x147/0x260 force_page_cache_ra+0x92/0xb0 filemap_get_pages+0x13b/0x7b0 ? current_time+0x3b/0x110 filemap_read+0x106/0x4c0 ? _raw_spin_unlock+0x14/0x30 blkdev_read_iter+0x64/0x120 vfs_read+0x290/0x390 ksys_read+0x6f/0xe0 do_syscall_64+0x47/0x100 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7f886428025d Code: 31 c0 e9 c6 fe ff ff 50 48 8d 3d a6 53 0a 00 e8 59 ff 01 00 66 0f 1f 84 00 00 00 00 00 80 3d 81 23 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f c RSP: 002b:00007fff5ce76228 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 000055cf45839640 RCX: 00007f886428025d RDX: 0000000000040000 RSI: 000055cf45996908 RDI: 000000000000000f RBP: 00000004f3b80000 R08: 00007f886435add0 R09: 00007f886435add0 R10: 0000000000000000 R11: 0000000000000246 R12: 000055cf459968e0 R13: 0000000000040000 R14: 000055cf45839698 R15: 000055cf459968f8 </TASK> Modules linked in: xfs ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set nft_compat ip_set_hash_mac ip_set nf_tables nfnet] Dumping ftrace buffer: (ftrace buffer empty) ---[ end trace 0000000000000000 ]--- RIP: 0010:__filemap_add_folio+0x4ae/0x540 Code: 40 49 89 d4 0f b6 c1 49 d3 ec 81 e1 c0 00 00 00 0f 84 e0 fb ff ff e9 92 b6 d3 ff 48 c7 c6 68 57 ec 81 4c 89 ef e8 82 6e 05 00 <0f> 0b 49 89 d4 e9 c2 fb ff ff 48 c7 c6 9 RSP: 0018:ffffc900016e3a70 EFLAGS: 00010246 RAX: 0000000000000049 RBX: 0000000000112cc0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00000000ffffffff RBP: 0000000000000001 R08: 0000000000000000 R09: 205d313431343737 R10: 0000000000000729 R11: 6d75642065676170 R12: 00000000004f3ba8 R13: ffffea00010f6900 R14: ffff88804076a530 R15: ffff88804076a530 FS: 00007f8863b788c0(0000) GS:ffff8880fb952000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055cf459d5000 CR3: 000000000d96f003 CR4: 00000000001706f0 Digging into the VM, I noticed that mount is stuck in D state: /proc/44312/task/44312/stack : [<0>] folio_wait_bit_common+0x144/0x350 [<0>] truncate_inode_pages_range+0x4df/0x5b0 [<0>] set_blocksize+0x10b/0x130 [<0>] xfs_setsize_buftarg+0x1f/0x50 [xfs] [<0>] xfs_setup_devices+0x1a/0xc0 [xfs] [<0>] xfs_fs_fill_super+0x423/0xb20 [xfs] [<0>] get_tree_bdev_flags+0x132/0x1d0 [<0>] vfs_get_tree+0x17/0xa0 [<0>] path_mount+0x721/0xa90 [<0>] __x64_sys_mount+0x10c/0x140 [<0>] do_syscall_64+0x47/0x100 [<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53 Regrettably the udev worker is gone, but my guess is that the process exited with the folio locked, so now truncate_inode_pages_range can't lock it to get rid of it. Then it occurred to me to look at set_blocksize again: /* Don't change the size if it is same as current */ if (inode->i_blkbits != blksize_bits(size)) { sync_blockdev(bdev); inode->i_blkbits = blksize_bits(size); mapping_set_folio_order_range(inode->i_mapping, get_order(size), get_order(size)); kill_bdev(bdev); } (Note that I changed mapping_set_folio_min_order here to mapping_set_folio_order_range to shut up a folio migration bug that I reported elsewhere on fsdevel yesterday, and willy suggested forcing the max order as a temporary workaround.) The update of i_blkbits and the order bits of mapping->flags are performed before kill_bdev truncates the pagecache, which means there's a window where there can be a !uptodate order-0 folio in the pagecache but i_blkbits > PAGE_SHIFT (in this case, 13). The debugging assertion above is from someone trying to install a too-small folio into the pagecache. I think the "FARK" message I captured overnight is from readahead trying to bring in contents from disk for this too-small folio and failing. So I think the answer is that set_blocksize needs to lock out folio_add, flush the dirty folios, invalidate the entire bdev pagecache, set i_blkbits and the folio order, and only then allow new additions to the pagecache. But then, which lock(s)? Were this a file on XFS I'd say that one has to take i_rwsem and mmap_invalidate_lock before truncating the pagecache but by my recollection bdev devices don't take either lock in their IO paths. --D