Re: [PATCH v2 7/7] Use simple_start_creating() in various places.

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Wed, 10 Sep 2025 19:28:37 +0100

On Wed, Sep 10, 2025 at 07:54:25AM -0400, Jeff Layton wrote:

> I'm having a hard time finding where that is defined in the POSIX
> specs. The link count for normal files is fairly well-defined. The link
> count for directories has always been more nebulous.

Try UNIX textbooks...  <pulls some out> in Stevens that would be 4.14
and I'm pretty sure that it's covered in other standard ones.

History of that thing: on traditional filesystem "." and ".." are real
directory entries, "." refering to the inode of directory itself and
".." to that of parent (if it's not a root directory) or to directory
itself (if it is root).  Link count is literally the number of directory
entries pointing to given inode.  For directories that boils down to 2
if empty ("." + reference in parent or ".." for non-root and root resp.),
then each subdirectory adds 1 (".." in it points back to ours).

That goes for any local UNIX filesystem, and there hadn't been anything
else until RFS and NFS, both keeping the same rules (both coming from
the underlying filesystem layout).  Try mkdir and ln -s on NFS, watch
what's happening to stat of parent.

The things got murkier for FAT, when its support got added - on-disk
layout has nothing like a link count.  It has only one directory entry
for any non-directory *and* directory entries have object type of the
thing they are pointing to, so one can match the behaviour of normal UNIX
filesystem by going through the directory contents when reading an inode;
the cost is not particularly high, so that's what everyone did.

For original isofs (i.e. no rockridge extensions, no ownership, etc.)
that was considerably harder; I don't remember what SunOS hsfs had done
there (thankfully), our implementation started with "just slap 2 for
directories if RR is not there", but that switched to "we don't know the
exact answer, so use 1 to indicate that" pretty early <checks> 1.1.40 -
Aug '94; original isofs implementation went into the tree in Dec '92.

More complications came when... odd people came complaining about the
overflows when they got 65534 subdirectories in the same directory.
That had two sides to it - on-disk inode layout and userland ABI.
For the latter the long-term solution was to make st_nlink 32bit
(in newer variant of stat(2) if needed) and fail with EOVERFLOW if the
value doesn't fit into the ABI you are trying to use.  For the latter...
some weird kludges followed, with the things eventually settling down
on "if you can't manage the expected value, at least report something
that couldn't be confused for it".  Since 1 is normally impossible for
a directory, that turned into "can't tell you how many links are there".
That covered both "we don't have enough bits in the on-disk field" and
"we don't have that field on disk at all and can't be bothered calculating
it" (as in iso9660 case above).

Of course, for e.g. NFS the value we report is whatever the server
tells us; nobody is going to have client to readdirplus the entire
directory and count subdirectories in it just to check if server lies
and is inconsistent at that.  But that's not really different from the
situation with local filesystem - we assume that the count in on-disk
inode matches the number of directory entries pointing to it.

The find(1) (well, tree-walkers in general, really) thing Neil has
mentioned is that on filesystems where readdir(3) gives you no reliable
dirent->d_type you need to stat every entry in order to decide whether
it's a subdirectory you would need to walk into.  Being able to tell
"this directory has no subdirectories" allows to skip those stat(2) calls
when going through it.  Same for "I've already seen 5 subdirectories,
stat on our directory has reported st_nlink being 7, so we'd already
seen all subdirectories here; no need to stat(2) further entries", for
that matter...  On a sane filesystem you'd just look for entries with
->dt_type == DT_DIR and skip all those stat(2).

IOW, the real rules are
	* st_nlink >= 2: st_nlink - 2 subdirectories
	* st_nlink = 1: refused to report the number of subdirectories
	* st_nlink = 0: fstat on something that had been removed
In case of corrupted filesystem,  bullshitting server, etc. result might
have no relationship to reality, of course.

I don't know of any case where creation of symlinks in a directory would
affected the parent's link count.  Frankly, I thought that was just
an accidental cut'n'paste from __nfsd_mkdir()...  As long as nothing
in the userland is playing odd games with that st_nlink value, I'd say
we should remove the temptation to start doing that and return to the
usual semantics.