Re: [Lsf-pc] [LSF/MM/BPF Topic] Filesystem reclaim & memory allocation BOF

Chris Mason <clm@xxxxxxxx> · Wed, 26 Mar 2025 15:08:31 -0400



On 3/26/25 1:47 PM, Jan Kara wrote:
> On Wed 26-03-25 16:19:32, Matthew Wilcox wrote:
>> On Wed, Mar 26, 2025 at 11:55:22AM -0400, Theodore Ts'o wrote:
>>> On Wed, Mar 26, 2025 at 03:25:07PM +0000, Matthew Wilcox wrote:
>>>>
>>>> We've got three reports now (two are syzkaller kiddie stuff, but one's a
>>>> real workload) of a warning in the page allocator from filesystems
>>>> doing reclaim.  Essentially they're using GFP_NOFAIL from reclaim
>>>> context.  This got me thinking about bs>PS and I realised that if we fix
>>>> this, then we're going to end up trying to do high order GFP_NOFAIL allocations
>>>> in the memory reclaim path, and that is really no bueno.
>>>>
>>>> https://lore.kernel.org/linux-mm/20250326105914.3803197-1-matt@xxxxxxxxxxxxxxxx/ 
>>>>
>>>> I'll prepare a better explainer of the problem in advance of this.
>>>
>>> Thanks for proposing this as a last-minute LSF/MM topic!
>>>
>>> I was looking at this myself, and was going to reply to the mail
>>> thread above, but I'll do it here.
>>>
>>> >From my perspective, the problem is that as part of memory reclaim,
>>> there is an attempt to shrink the inode cache, and there are cases
>>> where an inode's refcount was elevated (for example, because it was
>>> referenced by a dentry), and when the dentry gets flushed, now the
>>> inode can get evicted.  But if the inode is one that has been deleted,
>>> then at eviction time the file system will try to release the blocks
>>> associated with the deleted-file.  This operation will require memory
>>> allocation, potential I/O, and perhaps waiting for a journal
>>> transaction to complete.
>>>
>>> So basically, there are a class of inodes where if we are in reclaim,
>>> we should probably skip trying to evict them because there are very
>>> likely other inodes that will be more likely to result in memory
>>> getting released expeditiously.  And if we take a look at
>>> inode_lru_isolate(), there's logic there already about when inodes
>>> should skipped getting evicted.  It's probably just a matter of adding
>>> some additional coditions there.
>>
>> This is a helpful way of looking at the problem.  I was looking at the
>> problem further down where we've already entered evict_inode().  At that
>> point we can't fail.  My proposal was going to be that the filesystem pin
>> the metadata that it would need to modify in order to evict the inode.
>> But avoiding entering evict_inode() is even better.
>>
>> However, I can't see how inode_lru_isolate() can know whether (looking
>> at the three reports):
>>
>>  - the ext4 inode table has been reclaimed and ext4 would need to
>>    allocate memory in order to reload the table from disc in order to
>>    evict this inode
>>  - the ext4 block bitmap has been reclaimed and ext4 would need to
>>    allocate memory in order to reload the bitmap from disc to
>>    discard the preallocation
>>  - the fat cluster information has been reclaimed and fat would
>>    need to allocate memory in order to reload the cluster from
>>    disc to update the cluster information
> 
> Well, I think Ted was speaking about a more "big hammer" approach like
> adding:
> 
> 	if (current->flags & PF_MEMALLOC && !inode->i_nlink) {
> 		spin_unlock(&inode->i_lock);
> 		return LRU_SKIP;
> 	}
> 
> to inode_lru_isolate(). The problem isn't with inode_lru_isolate() here as
> far as I'm reading the stacktrace. We are scanning *dentry* LRU list,
> killing the dentry which is dropping the last reference to the inode and
> iput() then ends up doing all the deletion work. So we would have to avoid
> dropping dentry from the LRU if dentry->d_inode->i_nlink == 0 and that
> frankly seems a bit silly to me.
> 
>> So maybe it makes sense for ->evict_inode() to change from void to
>> being able to return an errno, and then change the filesystems to not
>> set GFP_NOFAIL, and instead just decline to evict the inode.
> 
> So this would help somewhat but inode deletion is a *heavy* operation (you
> can be freeing gigabytes of blocks) so you may end up doing a lot of
> metadata IO through the journal and deep in the bowels of the filesystem we
> are doing GFP_NOFAIL allocations anyway because there's just no sane way to
> unroll what we've started. So I'm afraid that ->evict() doing GFP_NOFAIL
> allocation for inodes with inode->i_nlink == 0 is a fact of life that is very
> hard to change.