Re: [nfsv4] Is NFSv4.2's clone_blksize per-file or per-file-system?

Rick Macklem <rick.macklem@xxxxxxxxx> · Sun, 10 Aug 2025 07:32:16 -0700

On Sun, Aug 10, 2025 at 6:58 AM David Noveck <davenoveck@xxxxxxxxx> wrote:
>
>
>
> On Sat, Aug 9, 2025 at 5:02 PM Rick Macklem <rick.macklem@xxxxxxxxx> wrote:
>>
>> On Sat, Aug 9, 2025 at 1:12 PM David Noveck <davenoveck@xxxxxxxxx> wrote:
>> >
>> >
>> >
>> > On Friday, August 8, 2025, Rick Macklem <rick.macklem@xxxxxxxxx> wrote:
>> >>
>> >> On Fri, Aug 8, 2025 at 8:38 PM Trond Myklebust <trondmy@xxxxxxxxx> wrote:
>> >> >
>> >> >
>> >> >
>> >> > On Fri, Aug 8, 2025 at 9:47 PM Rick Macklem <rick.macklem@xxxxxxxxx> wrote:
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> I'm looking at RFC7862 and I cannot find where it
>> >> >> states if the clone_blksize attribute is per-file or
>> >> >> per-file-system.
>> >> >>
>> >> >> If it is not in the RFC, which do others think it is?
>> >
>> >
>> >  Before you told us about ZFS,  I would have assumed per-fs.
>> >
>> > Given the uncertainty in the spec, you may wind up dealing clients that assume it is per-fs.
>> >
>> > Although this is not a  catastrophe, you might want to file an errata report explaining the negative consequences of assuming this is per-fs. It won't get into a spec for a long while but it does provide as much warning as you can right now .
>> >
>> >
>> >
>> >>
>> >> >> (Or maybe, if you have implemented CLONE,
>> >> >> which does your implementation assume?)
>> >> >>
>> >> >> In case you are wondering why I am asking,
>> >> >> it turns out that files in a ZFS volume can have
>> >> >> different block sizes. (It can be changed after the
>> >> >> file system is created.)
>> >
>> >
>> > The guy who allowed that probably thinks it's a helpful feature.  Sigh!
>> It's not just a feature change after creation, it turns out to be based
>> on file size as well.  A small file gets 512 and a larger one gets a full record
>> (128K on my test system).
>>
>> And, yes, block cloning requires alignment with 512bytes or 128Kbytes
>> depending on the file.
>>
>> I can return 128K for clone_blksize and that will (sub-optimally) handle
>> the 512byte case, but I think it is also possible to increase the record
>> size from 128K-> after the file system has files in it.
>>
>> I'll take a look at the Linux client to try and see if/how it uses
>> clone_blksize.  I need to decide if I should always return 128K
>> (or whatever the full recordsize is) or 512 for the small files.
>
>
> I don't see the point of returning anything but 128K given what you said above.
> If a file has to be smaller than 512 to merit the 512 block size, it could still be cloned with a 128k clone_block_size.  The spec makes an exception for the last block of a file being shorter than the block size so returning a 512-byte clone_block_size.
I'll be experimenting with it soon.
What I do not know (you could write what I know about ZFS on a
postage stamp;-) is whether the blksize for a file changes as it
grows.
--> So the problem is a file might get 512 because it is small when
     first created and then grow large. Again, I do not currently know
     what determines the blksize. Whether it is the first write being less
     than a record size when created or maybe it does switch to recordsize
     (128K in my case) when it grows beyond 128K or ???
     - I do know that ZFS allocates new blocks whenever data is written
       to a file, even if the file is not growing. (Which is why it cannot
       support ALLOCATE at this time and probably never will.)

I'll be poking at it. For now, I just do not know, rick

>>
>>
>> Thanks for the comments, rick
>>
>> >
>> >> >>
>> >
>> >
>> >>
>> >> >> Thanks, rick
>> >> >>
>> >> >
>> >> > Yes, but since ZFS only supports filesystem level snapshots, and not actual file cloning, does that matter to anything?
>> >> ZFS now has a feature it calls block cloning, which does clone file ranges.
>> >> (It was only added recently. I do not know if the Linux port uses it yet?)
>> >>
>> >> rick
>> >>
>> >> >
>> >> > Cheers
>> >> >   Trond
>> >>
>> >> _______________________________________________
>> >> nfsv4 mailing list -- nfsv4@xxxxxxxx
>> >> To unsubscribe send an email to nfsv4-leave@xxxxxxxx