Re: [GIT PULL] bcachefs fixes for 6.15-rc4

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Sun, 27 Apr 2025 23:01:20 -0400

On Sun, Apr 27, 2025 at 07:39:46PM -0700, Linus Torvalds wrote:
> On Sun, 27 Apr 2025 at 19:22, Eric Biggers <ebiggers@xxxxxxxxxx> wrote:
> >
> > I suspect that all that was really needed was case-insensitivity of ASCII a-z.
> 
> Yes. That's my argument. I think anything else ends up being a
> mistake. MAYBE extend it to the first 256 characters in Unicode (aka
> "Latin1").
> 
> Case folding on a-z is the only thing you could really effectively
> rely on in user space even in the DOS times, because different
> codepages would make for different rules for the upper 128 characters
> anyway, and you could be in a situation where you literally couldn't
> copy files from one floppy to another, because two files that had
> distinct names on one floppy would have the *same* name on another
> one.
> 
> Of course, that was mostly a weird corner case that almost nobody ever
> actually saw in practice, because very few people even used anything
> else than the default codepage.
> 
> And the same is afaik still true on NT, although practically speaking
> I suspect it went from "unusual" to "really doesn't happen EVER in
> practice".

I'm having trouble finding anything authoritative, but what I'm seeing
indicates that NTFS does do Unicode casefolding (and their own
incompatible version, at that).

> Extending those mistakes to full unicode and mixing in things like
> nonprinting codes and other things have only made things worse.
> 
> And dealing with things like ß and ss and trying to make those compare
> as equal is a *horrible* mistake. People who really need to do that
> (usually for some legalistic local reason) tend to have very specific
> rules for sorting anyway, and they are rules specific to particular
> situations, not something that the filesystem should even try to work
> with.

Well, casefolding is something that's directly exposed to users. So I do
think that if casefolding is going to exist at all, there is a strong
argument for it to be unicode and handling things like ß to ss.

(Can you imagine being the user that gets used to typing in filenames
and ignoring capitalization, except whenever an accented letter is part
of the filename, and then your muscle-memeory breaks? That sort of thing
is maddening).

BUT:

I'm becoming more and more convinced that I want more separation between
casefolded lookups and non casefolded lookups, the potential for
casefolding rule changes to break case-sensitive lookups is just bad.

If we do a "casefolding version 2" in bcachefs, we'll just have a
separate btree for casefolded dirents, and casefolded directories will
have their dirents indexed twice.

That's trivially extensible to multiple versions if - god forbid - we
ever end up needing to support multiple "locales", and more importantly
it'd let us support a mode where it's only certain pids that get
casefolded lookups, so you don't e.g. get casefolding dependencies
creeping into your makefiles as can happen today.