Re: [RFC] Another take at restarting FUSE servers

Bernd Schubert <bernd@xxxxxxxxxxx> · Fri, 12 Sep 2025 17:20:58 +0200

On 9/12/25 16:58, Darrick J. Wong wrote:
> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote:
>>
>>
>> On 9/12/25 13:41, Amir Goldstein wrote:
>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@xxxxxxxxxxx> wrote:
>>>>
>>>>
>>>>
>>>> On 8/1/25 12:15, Luis Henriques wrote:
>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote:
>>>>>
>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
>>>>>>>>
>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
>>>>>>>> could restart itself.  It's unclear if doing so will actually enable us
>>>>>>>> to clear the condition that caused the failure in the first place, but I
>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
>>>>>>>> aren't totally crazy.
>>>>>>>
>>>>>>> I'm trying to understand what the failure scenario is here.  Is this
>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
>>>>>>> is supposed to happen with respect to open files, metadata and data
>>>>>>> modifications which were in transit, etc.?  Sure, fuse2fs could run
>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going
>>>>>>> potentally to be out of sync, right?
>>>>>>>
>>>>>>> What are the recovery semantics that we hope to be able to provide?
>>>>>>
>>>>>> <echoing what we said on the ext4 call this morning>
>>>>>>
>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new
>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
>>>>>> would initiate GETATTR requests on all the cached inodes to validate
>>>>>> that they still exist; and then resend all the unacknowledged requests
>>>>>> that were pending at the time.  It might be the case that you have to
>>>>>> that in the reverse order; I only know enough about the design of fuse
>>>>>> to suspect that to be true.
>>>>>>
>>>>>> Anyhow once those are complete, I think we can resume operations with
>>>>>> the surviving inodes.  The ones that fail the GETATTR revalidation are
>>>>>> fuse_make_bad'd, which effectively revokes them.
>>>>>
>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
>>>>> but probably GETATTR is a better option.
>>>>>
>>>>> So, are you currently working on any of this?  Are you implementing this
>>>>> new NOTIFY_RESTARTED request?  I guess it's time for me to have a closer
>>>>> look at fuse2fs too.
>>>>
>>>> Sorry for joining the discussion late, I was totally occupied, day and
>>>> night. Added Kevin to CC, who is going to work on recovery on our
>>>> DDN side.
>>>>
>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse
>>>> server restart we want kernel to recover inodes and their lookup count.
>>>> Now inode recovery might be hard, because we currently only have a
>>>> 64-bit node-id - which is used my most fuse application as memory
>>>> pointer.
>>>>
>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends
>>>> outstanding requests. And that ends up in most cases in sending requests
>>>> with invalid node-IDs, that are casted and might provoke random memory
>>>> access on restart. Kind of the same issue why fuse nfs export or
>>>> open_by_handle_at doesn't work well right now.
>>>>
>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which
>>>> would not return a 64-bit node ID, but a max 128 byte file handle.
>>>> And then FUSE_REVALIDATE_FH on server restart.
>>>> The file handles could be stored into the fuse inode and also used for
>>>> NFS export.
>>>>
>>>> I *think* Amir had a similar idea, but I don't find the link quickly.
>>>> Adding Amir to CC.
>>>
>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread:
>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@xxxxxxxxxxxxxx/
>>
>> Thanks for the reference Amir! I even had been in that thread.
>>
>>>
>>>>
>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which
>>>> will iterate over all superblock inodes and mark them with fuse_make_bad.
>>>> Any objections against that?
> 
> What if you actually /can/ reuse a nodeid after a restart?  Consider
> fuse4fs, where the nodeid is the on-disk inode number.  After a restart,
> you can reconnect the fuse_inode to the ondisk inode, assuming recovery
> didn't delete it, obviously.
> 
> I suppose you could just ask for refreshed stat information and either
> the server gives it to you and the fuse_inode lives; or the server
> returns ENOENT and then we mark it bad.  But I'd have to see code
> patches to form a real opinion.
> 
> It's very nice of fuse to have implemented revoke() ;)

Assuming you would run with an attr cache timeout equal 0 the existing
NOTIFY_RESEND would be enough for fuse4fs? 

Thanks,
Bernd