On Mon 26-05-25 11:13:34, Thadeu Lima de Souza Cascardo wrote: > inline data handling has a race between writing and writing to a memory > map. > > When ext4_page_mkwrite is called, it calls ext4_convert_inline_data, which > destroys the inline data, but if block allocation fails, restores the > inline data. In that process, we could have: > > CPU1 CPU2 > destroy_inline_data > write_begin (does not see inline data) > restory_inline_data > write_end (sees inline data) > > This leads to bugs like the one below, as write_begin did not prepare for > the case of inline data, which is expected by the write_end side of it. > > ------------[ cut here ]------------ > kernel BUG at fs/ext4/inline.c:235! > Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI > CPU: 1 UID: 0 PID: 5838 Comm: syz-executor110 Not tainted 6.13.0-rc3-syzkaller-00209-g499551201b5f #0 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 > RIP: 0010:ext4_write_inline_data fs/ext4/inline.c:235 [inline] > RIP: 0010:ext4_write_inline_data_end+0xdc7/0xdd0 fs/ext4/inline.c:774 > Code: 47 1d 8c e8 4b 3a 91 ff 90 0f 0b e8 63 7a 47 ff 48 8b 7c 24 10 48 c7 c6 e0 47 1d 8c e8 32 3a 91 ff 90 0f 0b e8 4a 7a 47 ff 90 <0f> 0b 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 > RSP: 0018:ffffc900031c7320 EFLAGS: 00010293 > RAX: ffffffff8257f9a6 RBX: 000000000000005a RCX: ffff888012968000 > RDX: 0000000000000000 RSI: 000000000000005a RDI: 000000000000005b > RBP: ffffc900031c7448 R08: ffffffff8257ef87 R09: 1ffff11006806070 > R10: dffffc0000000000 R11: ffffed1006806071 R12: 000000000000005a > R13: dffffc0000000000 R14: ffff888076b65bd8 R15: 000000000000005b > FS: 00007f5c6bacf6c0(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000020000a00 CR3: 0000000073fb6000 CR4: 0000000000350ef0 > Call Trace: > <TASK> > generic_perform_write+0x6f8/0x990 mm/filemap.c:4070 > ext4_buffered_write_iter+0xc5/0x350 fs/ext4/file.c:299 > ext4_file_write_iter+0x892/0x1c50 > iter_file_splice_write+0xbfc/0x1510 fs/splice.c:743 > do_splice_from fs/splice.c:941 [inline] > direct_splice_actor+0x11d/0x220 fs/splice.c:1164 > splice_direct_to_actor+0x588/0xc80 fs/splice.c:1108 > do_splice_direct_actor fs/splice.c:1207 [inline] > do_splice_direct+0x289/0x3e0 fs/splice.c:1233 > do_sendfile+0x564/0x8a0 fs/read_write.c:1363 > __do_sys_sendfile64 fs/read_write.c:1424 [inline] > __se_sys_sendfile64+0x17c/0x1e0 fs/read_write.c:1410 > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > RIP: 0033:0x7f5c6bb18d09 > Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 > RSP: 002b:00007f5c6bacf218 EFLAGS: 00000246 ORIG_RAX: 0000000000000028 > RAX: ffffffffffffffda RBX: 00007f5c6bba0708 RCX: 00007f5c6bb18d09 > RDX: 0000000000000000 RSI: 0000000000000005 RDI: 0000000000000004 > RBP: 00007f5c6bba0700 R08: 0000000000000000 R09: 0000000000000000 > R10: 000080001d00c0d0 R11: 0000000000000246 R12: 00007f5c6bb6d620 > R13: 00007f5c6bb6d0c0 R14: 0031656c69662f2e R15: 8088e3ad122bc192 > </TASK> > Modules linked in: > ---[ end trace 0000000000000000 ]--- > > This happens because ext4_page_mkwrite is not protected by the inode_lock. > The xattr semaphore is not sufficient to protect inline data handling in a > sane way, so we need to rely on the inode_lock. Adding the inode_lock to > ext4_page_mkwrite is not an option, otherwise lock-ordering problems with > mmap_lock may arise. > > The conversion inside ext4_page_mkwrite was introduced at commit > 7b4cc9787fe3 ("ext4: evict inline data when writing to memory map"). This > fixes a documented bug in the commit message, which suggests some > alternative fixes. > > Convert inline data when mmap is called, instead of doing it only when the > mmapped page is written to. Using the inode_lock there does not lead to > lock-ordering issues. > > The drawback is that inline conversion will happen when the file is > mmapped, even though the page will not be written to. > > Fixes: 7b4cc9787fe3 ("ext4: evict inline data when writing to memory map") > Reported-by: syzbot+0c89d865531d053abb2d@xxxxxxxxxxxxxxxxxxxxxxxxx > Closes: https://syzkaller.appspot.com/bug?extid=0c89d865531d053abb2d > Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@xxxxxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx > --- > Changes in v2: > - Convert inline data at mmap time, avoiding data loss. > - Link to v1: https://lore.kernel.org/r/20250519-ext4_inline_page_mkwrite-v1-1-865d9a62b512@xxxxxxxxxx > --- > fs/ext4/file.c | 6 ++++++ > fs/ext4/inode.c | 4 ---- > 2 files changed, 6 insertions(+), 4 deletions(-) > > diff --git a/fs/ext4/file.c b/fs/ext4/file.c > index beb078ee4811d6092e362e37307e7d87e5276cbc..f2380471df5d99500e49fdc639fa3e56143c328f 100644 > --- a/fs/ext4/file.c > +++ b/fs/ext4/file.c > @@ -819,6 +819,12 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) > if (!daxdev_mapping_supported(vma, dax_dev)) > return -EOPNOTSUPP; > > + inode_lock(inode); > + ret = ext4_convert_inline_data(inode); > + inode_unlock(inode); > + if (ret) > + return ret; > + > file_accessed(file); > if (IS_DAX(file_inode(file))) { > vma->vm_ops = &ext4_dax_vm_ops; So I would *love* to do this and was thinking about this as well. But the trouble is that this causes lock inversion as well because ->mmap callback is called with mmap_lock held and so we cannot acquire inode_lock here either. Recent changes which switch from ->mmap to ->mmap_prepare callback are actually going in a suitable direction but we'd need a rather larger rewrite to get from under mmap_lock and I'm not sure that's justified. Anyway, thanks for having a look! Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR