On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote: > > > On 9/12/25 13:41, Amir Goldstein wrote: > > On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@xxxxxxxxxxx> wrote: > >> > >> > >> > >> On 8/1/25 12:15, Luis Henriques wrote: > >>> On Thu, Jul 31 2025, Darrick J. Wong wrote: > >>> > >>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote: > >>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote: > >>>>>> > >>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse > >>>>>> could restart itself. It's unclear if doing so will actually enable us > >>>>>> to clear the condition that caused the failure in the first place, but I > >>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts > >>>>>> aren't totally crazy. > >>>>> > >>>>> I'm trying to understand what the failure scenario is here. Is this > >>>>> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what > >>>>> is supposed to happen with respect to open files, metadata and data > >>>>> modifications which were in transit, etc.? Sure, fuse2fs could run > >>>>> e2fsck -fy, but if there are dirty inode on the system, that's going > >>>>> potentally to be out of sync, right? > >>>>> > >>>>> What are the recovery semantics that we hope to be able to provide? > >>>> > >>>> <echoing what we said on the ext4 call this morning> > >>>> > >>>> With iomap, most of the dirty state is in the kernel, so I think the new > >>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which > >>>> would initiate GETATTR requests on all the cached inodes to validate > >>>> that they still exist; and then resend all the unacknowledged requests > >>>> that were pending at the time. It might be the case that you have to > >>>> that in the reverse order; I only know enough about the design of fuse > >>>> to suspect that to be true. > >>>> > >>>> Anyhow once those are complete, I think we can resume operations with > >>>> the surviving inodes. The ones that fail the GETATTR revalidation are > >>>> fuse_make_bad'd, which effectively revokes them. > >>> > >>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests, > >>> but probably GETATTR is a better option. > >>> > >>> So, are you currently working on any of this? Are you implementing this > >>> new NOTIFY_RESTARTED request? I guess it's time for me to have a closer > >>> look at fuse2fs too. > >> > >> Sorry for joining the discussion late, I was totally occupied, day and > >> night. Added Kevin to CC, who is going to work on recovery on our > >> DDN side. > >> > >> Issue with GETATTR and LOOKUP is that they need a path, but on fuse > >> server restart we want kernel to recover inodes and their lookup count. > >> Now inode recovery might be hard, because we currently only have a > >> 64-bit node-id - which is used my most fuse application as memory > >> pointer. > >> > >> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends > >> outstanding requests. And that ends up in most cases in sending requests > >> with invalid node-IDs, that are casted and might provoke random memory > >> access on restart. Kind of the same issue why fuse nfs export or > >> open_by_handle_at doesn't work well right now. > >> > >> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which > >> would not return a 64-bit node ID, but a max 128 byte file handle. > >> And then FUSE_REVALIDATE_FH on server restart. > >> The file handles could be stored into the fuse inode and also used for > >> NFS export. > >> > >> I *think* Amir had a similar idea, but I don't find the link quickly. > >> Adding Amir to CC. > > > > Or maybe it was Miklos' idea. Hard to keep track of this rolling thread: > > https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@xxxxxxxxxxxxxx/ > > Thanks for the reference Amir! I even had been in that thread. > > > > >> > >> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which > >> will iterate over all superblock inodes and mark them with fuse_make_bad. > >> Any objections against that? What if you actually /can/ reuse a nodeid after a restart? Consider fuse4fs, where the nodeid is the on-disk inode number. After a restart, you can reconnect the fuse_inode to the ondisk inode, assuming recovery didn't delete it, obviously. I suppose you could just ask for refreshed stat information and either the server gives it to you and the fuse_inode lives; or the server returns ENOENT and then we mark it bad. But I'd have to see code patches to form a real opinion. It's very nice of fuse to have implemented revoke() ;) --D > > IDK, it seems much more ugly than implementing LOOKUP_HANDLE > > and I am not sure that LOOKUP_HANDLE is that hard to implement, when > > comparing to this alternative. > > > > I mean a restartable server is going to be a new implementation anyway, right? > > So it makes sense to start with a cleaner and more adequate protocol, > > does it not? > > Definitely, if we agree on the approach on LOOKUP_HANDLE and using it > for recovery, adding that op seems simple. And reading through the > thread you had posted above, just the implementation was missing. > So let's go ahead to do this approach. > > > Thanks, > Bernd > > >