On Wed 02-04-25 19:04:23, Luis Chamberlain wrote: > On Wed, Apr 02, 2025 at 02:58:28AM +0100, Matthew Wilcox wrote: > > On Tue, Apr 01, 2025 at 02:49:51PM -0700, Davidlohr Bueso wrote: > > > So the below could be tucked in for norefs only (because this is about the addr > > > space i_private_lock), but this also shortens the hold time; if that matters > > > at all, of course, vs changing the migration semantics. > > > > I like this approach a lot better. One wrinkle is that it doesn't seem > > that we need to set the BH_Migrate bit on every buffer; we could define > > that it's only set on the head BH, right? > > Yes, we are also only doing this for block devices, and for migration > purposes. Even though a bit from one buffer may be desirable it makes > no sense to allow for that in case migration is taking place. So indeed > we have no need to add the flag for all buffers. > > I think the remaining question is what users of __find_get_block_slow() > can really block, and well I've started trying to determine that with > coccinelle [0], its gonna take some more time. > > Perhaps its easier to ask, why would a block device mapping want to > allow __find_get_block_slow() to not block? So I've audited all callers of __find_get_block_slow() (there aren't that many) and most of them are actually fine with sleeping these days. Analysis: __find_get_block_slow() is only used from __find_get_block(). __find_get_block() is used from: write_boundary_block() - locks the buffer so can sleep bdev_getblk() - allocates buffers with 'gfp' mask. We use GFP_NOWAIT mask from some places (generally doing readahead). For callers where !gfpflags_allow_blocking() we should bail rather than block on migration. Callers are currently fine with this, we should probably document that bdev_getblk() with restrictive gfp mask may fail even if bh is present - or perhaps make this even more explicit in the API by providing bdev_try_getblk() and make bdev_getblk() assert gfp mask allows sleeping. __getblk_slow() - only called from bdev_getblk(). Probably should fold there. ocfs2_force_read_journal() - allows sleeping as it does IO jbd2_journal_revoke() - can sleep (has might_sleep() in the beginning) jbd2_journal_cancel_revoke() - only used from do_get_write_access() and do_get_create_access() which do sleep. So can sleep. jbd2_clear_buffer_revoked_flags() - only called from journal commit code which sleeps. So can sleep. The last user is sb_find_get_block() which is used from: hpfs_prefetch_sectors() - prefers bail rather than blocking fat_dir_readahead() - prefers bail rather than blocking exfat_dir_readahead() - prefers bail rather than blocking ext4_free_blocks() - can sleep ext4_getblk() - depending on EXT4_GET_BLOCKS_CACHED_NOWAIT flag either can sleep or must bail (and is fine with it) rather than sleeping fs/ext4/ialloc.c:recently_deleted() - this one is the most problematic place. It must bail rather than sleeping (called under a spinlock) but it depends on the fact that if bh is not returned, then the data has been written out and evicted from memory. Luckily, the usage of recently_deleted() is mostly an optimization to reduce damage in case of crash so rare false failure should be OK. Ted, what is your opinion? And this is actually all. So it seems that if we give possibility to callers to tell whether they want to bail or wait for migration, things should work out fine. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR