On 9/12/25 16:58, Darrick J. Wong wrote: > On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote: >> >> >> On 9/12/25 13:41, Amir Goldstein wrote: >>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@xxxxxxxxxxx> wrote: >>>> >>>> >>>> >>>> On 8/1/25 12:15, Luis Henriques wrote: >>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote: >>>>> >>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote: >>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote: >>>>>>>> >>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse >>>>>>>> could restart itself. It's unclear if doing so will actually enable us >>>>>>>> to clear the condition that caused the failure in the first place, but I >>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts >>>>>>>> aren't totally crazy. >>>>>>> >>>>>>> I'm trying to understand what the failure scenario is here. Is this >>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what >>>>>>> is supposed to happen with respect to open files, metadata and data >>>>>>> modifications which were in transit, etc.? Sure, fuse2fs could run >>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going >>>>>>> potentally to be out of sync, right? >>>>>>> >>>>>>> What are the recovery semantics that we hope to be able to provide? >>>>>> >>>>>> <echoing what we said on the ext4 call this morning> >>>>>> >>>>>> With iomap, most of the dirty state is in the kernel, so I think the new >>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which >>>>>> would initiate GETATTR requests on all the cached inodes to validate >>>>>> that they still exist; and then resend all the unacknowledged requests >>>>>> that were pending at the time. It might be the case that you have to >>>>>> that in the reverse order; I only know enough about the design of fuse >>>>>> to suspect that to be true. >>>>>> >>>>>> Anyhow once those are complete, I think we can resume operations with >>>>>> the surviving inodes. The ones that fail the GETATTR revalidation are >>>>>> fuse_make_bad'd, which effectively revokes them. >>>>> >>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests, >>>>> but probably GETATTR is a better option. >>>>> >>>>> So, are you currently working on any of this? Are you implementing this >>>>> new NOTIFY_RESTARTED request? I guess it's time for me to have a closer >>>>> look at fuse2fs too. >>>> >>>> Sorry for joining the discussion late, I was totally occupied, day and >>>> night. Added Kevin to CC, who is going to work on recovery on our >>>> DDN side. >>>> >>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse >>>> server restart we want kernel to recover inodes and their lookup count. >>>> Now inode recovery might be hard, because we currently only have a >>>> 64-bit node-id - which is used my most fuse application as memory >>>> pointer. >>>> >>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends >>>> outstanding requests. And that ends up in most cases in sending requests >>>> with invalid node-IDs, that are casted and might provoke random memory >>>> access on restart. Kind of the same issue why fuse nfs export or >>>> open_by_handle_at doesn't work well right now. >>>> >>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which >>>> would not return a 64-bit node ID, but a max 128 byte file handle. >>>> And then FUSE_REVALIDATE_FH on server restart. >>>> The file handles could be stored into the fuse inode and also used for >>>> NFS export. >>>> >>>> I *think* Amir had a similar idea, but I don't find the link quickly. >>>> Adding Amir to CC. >>> >>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread: >>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@xxxxxxxxxxxxxx/ >> >> Thanks for the reference Amir! I even had been in that thread. >> >>> >>>> >>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which >>>> will iterate over all superblock inodes and mark them with fuse_make_bad. >>>> Any objections against that? > > What if you actually /can/ reuse a nodeid after a restart? Consider > fuse4fs, where the nodeid is the on-disk inode number. After a restart, > you can reconnect the fuse_inode to the ondisk inode, assuming recovery > didn't delete it, obviously. > > I suppose you could just ask for refreshed stat information and either > the server gives it to you and the fuse_inode lives; or the server > returns ENOENT and then we mark it bad. But I'd have to see code > patches to form a real opinion. > > It's very nice of fuse to have implemented revoke() ;) Assuming you would run with an attr cache timeout equal 0 the existing NOTIFY_RESEND would be enough for fuse4fs? Thanks, Bernd