Re: [nfsv4] Is NFSv4.2's clone_blksize per-file or per-file-system?

Rick Macklem <rick.macklem@xxxxxxxxx> · Sun, 10 Aug 2025 08:38:10 -0700

On Sun, Aug 10, 2025 at 8:27 AM Rick Macklem <rick.macklem@xxxxxxxxx> wrote:
>
> On Sun, Aug 10, 2025 at 7:52 AM Rick Macklem <rick.macklem@xxxxxxxxx> wrote:
> >
> > On Sun, Aug 10, 2025 at 7:32 AM Rick Macklem <rick.macklem@xxxxxxxxx> wrote:
> > >
> > > On Sun, Aug 10, 2025 at 6:58 AM David Noveck <davenoveck@xxxxxxxxx> wrote:
> > > >
> > > >
> > > >
> > > > On Sat, Aug 9, 2025 at 5:02 PM Rick Macklem <rick.macklem@xxxxxxxxx> wrote:
> > > >>
> > > >> On Sat, Aug 9, 2025 at 1:12 PM David Noveck <davenoveck@xxxxxxxxx> wrote:
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Friday, August 8, 2025, Rick Macklem <rick.macklem@xxxxxxxxx> wrote:
> > > >> >>
> > > >> >> On Fri, Aug 8, 2025 at 8:38 PM Trond Myklebust <trondmy@xxxxxxxxx> wrote:
> > > >> >> >
> > > >> >> >
> > > >> >> >
> > > >> >> > On Fri, Aug 8, 2025 at 9:47 PM Rick Macklem <rick.macklem@xxxxxxxxx> wrote:
> > > >> >> >>
> > > >> >> >> Hi,
> > > >> >> >>
> > > >> >> >> I'm looking at RFC7862 and I cannot find where it
> > > >> >> >> states if the clone_blksize attribute is per-file or
> > > >> >> >> per-file-system.
> > > >> >> >>
> > > >> >> >> If it is not in the RFC, which do others think it is?
> > > >> >
> > > >> >
> > > >> >  Before you told us about ZFS,  I would have assumed per-fs.
> > > >> >
> > > >> > Given the uncertainty in the spec, you may wind up dealing clients that assume it is per-fs.
> > > >> >
> > > >> > Although this is not a  catastrophe, you might want to file an errata report explaining the negative consequences of assuming this is per-fs. It won't get into a spec for a long while but it does provide as much warning as you can right now .
> > > >> >
> > > >> >
> > > >> >
> > > >> >>
> > > >> >> >> (Or maybe, if you have implemented CLONE,
> > > >> >> >> which does your implementation assume?)
> > > >> >> >>
> > > >> >> >> In case you are wondering why I am asking,
> > > >> >> >> it turns out that files in a ZFS volume can have
> > > >> >> >> different block sizes. (It can be changed after the
> > > >> >> >> file system is created.)
> > > >> >
> > > >> >
> > > >> > The guy who allowed that probably thinks it's a helpful feature.  Sigh!
> > > >> It's not just a feature change after creation, it turns out to be based
> > > >> on file size as well.  A small file gets 512 and a larger one gets a full record
> > > >> (128K on my test system).
> > > >>
> > > >> And, yes, block cloning requires alignment with 512bytes or 128Kbytes
> > > >> depending on the file.
> > > >>
> > > >> I can return 128K for clone_blksize and that will (sub-optimally) handle
> > > >> the 512byte case, but I think it is also possible to increase the record
> > > >> size from 128K-> after the file system has files in it.
> > > >>
> > > >> I'll take a look at the Linux client to try and see if/how it uses
> > > >> clone_blksize.  I need to decide if I should always return 128K
> > > >> (or whatever the full recordsize is) or 512 for the small files.
> > > >
> > > >
> > > > I don't see the point of returning anything but 128K given what you said above.
> > > > If a file has to be smaller than 512 to merit the 512 block size, it could still be cloned with a 128k clone_block_size.  The spec makes an exception for the last block of a file being shorter than the block size so returning a 512-byte clone_block_size.
> > > I'll be experimenting with it soon.
> > > What I do not know (you could write what I know about ZFS on a
> > > postage stamp;-) is whether the blksize for a file changes as it
> > > grows.
> > > --> So the problem is a file might get 512 because it is small when
> > >      first created and then grow large. Again, I do not currently know
> > >      what determines the blksize. Whether it is the first write being less
> > >      than a record size when created or maybe it does switch to recordsize
> > >      (128K in my case) when it grows beyond 128K or ???
> > >      - I do know that ZFS allocates new blocks whenever data is written
> > >        to a file, even if the file is not growing. (Which is why it cannot
> > >        support ALLOCATE at this time and probably never will.)
> > >
> > > I'll be poking at it. For now, I just do not know, rick
> > I should have done a scan before posting.
> > I just ran a little program that printed out the blksize of every
> > regular file in a ZFS file system.
> > It turns out that the blksize is any exact multiple of 512 up to
> > 128K (the record size for the volume).
> > Since most are C sources or objects, most are less than 128K.
> >
> > If I return 128K, then most files would not be CLONEable unless
> > the CLONE is for the entire file.
> It appears that your suggestion of 128K is correct for ZFS.
> I am still not sure, but it appears that, for files up to 128K,
> the files are a single block (which is any multiple of 512).
> --> As such, only the entire small file can be cloned.
>
> So, returning 128K for all files in the file system seems like
> it will be the correct choice.
>
> It still leaves the per-filesystem vs per-server question
> since (if I read it correctly) the Linux client uses clone_blksize
> per-server (and not per-server file system).
Actually, there's a good chance I got this wrong. I recall that
the Linux client creates a separate "mount" that shows up in
places like "df" for every server file system.
So, it is fairly likely that the Linux client is per-file system.

Maybe someone like Trond can clarify this w.r.t. the Linux client?

rick

>
> I do not think per-server is the correct choice, since different
> file systems on a server could have different block sizes.
>
> rick
>
> > Of course, I do not currently know how clients actually use
> > clone_blksize either. (Do they check alignment using it before
> > doing a CLONE or ???)
> >
> > I'll be playing around with CLONE for both FreeBSD and Linux
> > in the coming days.
> > I'll post if/when I have useful info, rick
> >
> > >
> > >
> > > >>
> > > >>
> > > >> Thanks for the comments, rick
> > > >>
> > > >> >
> > > >> >> >>
> > > >> >
> > > >> >
> > > >> >>
> > > >> >> >> Thanks, rick
> > > >> >> >>
> > > >> >> >
> > > >> >> > Yes, but since ZFS only supports filesystem level snapshots, and not actual file cloning, does that matter to anything?
> > > >> >> ZFS now has a feature it calls block cloning, which does clone file ranges.
> > > >> >> (It was only added recently. I do not know if the Linux port uses it yet?)
> > > >> >>
> > > >> >> rick
> > > >> >>
> > > >> >> >
> > > >> >> > Cheers
> > > >> >> >   Trond
> > > >> >>
> > > >> >> _______________________________________________
> > > >> >> nfsv4 mailing list -- nfsv4@xxxxxxxx
> > > >> >> To unsubscribe send an email to nfsv4-leave@xxxxxxxx