On 3/26/25 1:47 PM, Jan Kara wrote: > On Wed 26-03-25 16:19:32, Matthew Wilcox wrote: >> On Wed, Mar 26, 2025 at 11:55:22AM -0400, Theodore Ts'o wrote: >>> On Wed, Mar 26, 2025 at 03:25:07PM +0000, Matthew Wilcox wrote: >>>> >>>> We've got three reports now (two are syzkaller kiddie stuff, but one's a >>>> real workload) of a warning in the page allocator from filesystems >>>> doing reclaim. Essentially they're using GFP_NOFAIL from reclaim >>>> context. This got me thinking about bs>PS and I realised that if we fix >>>> this, then we're going to end up trying to do high order GFP_NOFAIL allocations >>>> in the memory reclaim path, and that is really no bueno. >>>> >>>> https://lore.kernel.org/linux-mm/20250326105914.3803197-1-matt@xxxxxxxxxxxxxxxx/ >>>> >>>> I'll prepare a better explainer of the problem in advance of this. >>> >>> Thanks for proposing this as a last-minute LSF/MM topic! >>> >>> I was looking at this myself, and was going to reply to the mail >>> thread above, but I'll do it here. >>> >>> >From my perspective, the problem is that as part of memory reclaim, >>> there is an attempt to shrink the inode cache, and there are cases >>> where an inode's refcount was elevated (for example, because it was >>> referenced by a dentry), and when the dentry gets flushed, now the >>> inode can get evicted. But if the inode is one that has been deleted, >>> then at eviction time the file system will try to release the blocks >>> associated with the deleted-file. This operation will require memory >>> allocation, potential I/O, and perhaps waiting for a journal >>> transaction to complete. >>> >>> So basically, there are a class of inodes where if we are in reclaim, >>> we should probably skip trying to evict them because there are very >>> likely other inodes that will be more likely to result in memory >>> getting released expeditiously. And if we take a look at >>> inode_lru_isolate(), there's logic there already about when inodes >>> should skipped getting evicted. It's probably just a matter of adding >>> some additional coditions there. >> >> This is a helpful way of looking at the problem. I was looking at the >> problem further down where we've already entered evict_inode(). At that >> point we can't fail. My proposal was going to be that the filesystem pin >> the metadata that it would need to modify in order to evict the inode. >> But avoiding entering evict_inode() is even better. >> >> However, I can't see how inode_lru_isolate() can know whether (looking >> at the three reports): >> >> - the ext4 inode table has been reclaimed and ext4 would need to >> allocate memory in order to reload the table from disc in order to >> evict this inode >> - the ext4 block bitmap has been reclaimed and ext4 would need to >> allocate memory in order to reload the bitmap from disc to >> discard the preallocation >> - the fat cluster information has been reclaimed and fat would >> need to allocate memory in order to reload the cluster from >> disc to update the cluster information > > Well, I think Ted was speaking about a more "big hammer" approach like > adding: > > if (current->flags & PF_MEMALLOC && !inode->i_nlink) { > spin_unlock(&inode->i_lock); > return LRU_SKIP; > } > > to inode_lru_isolate(). The problem isn't with inode_lru_isolate() here as > far as I'm reading the stacktrace. We are scanning *dentry* LRU list, > killing the dentry which is dropping the last reference to the inode and > iput() then ends up doing all the deletion work. So we would have to avoid > dropping dentry from the LRU if dentry->d_inode->i_nlink == 0 and that > frankly seems a bit silly to me. > >> So maybe it makes sense for ->evict_inode() to change from void to >> being able to return an errno, and then change the filesystems to not >> set GFP_NOFAIL, and instead just decline to evict the inode. > > So this would help somewhat but inode deletion is a *heavy* operation (you > can be freeing gigabytes of blocks) so you may end up doing a lot of > metadata IO through the journal and deep in the bowels of the filesystem we > are doing GFP_NOFAIL allocations anyway because there's just no sane way to > unroll what we've started. So I'm afraid that ->evict() doing GFP_NOFAIL > allocation for inodes with inode->i_nlink == 0 is a fact of life that is very > hard to change.