On Sun, Apr 27, 2025 at 07:39:46PM -0700, Linus Torvalds wrote: > On Sun, 27 Apr 2025 at 19:22, Eric Biggers <ebiggers@xxxxxxxxxx> wrote: > > > > I suspect that all that was really needed was case-insensitivity of ASCII a-z. > > Yes. That's my argument. I think anything else ends up being a > mistake. MAYBE extend it to the first 256 characters in Unicode (aka > "Latin1"). > > Case folding on a-z is the only thing you could really effectively > rely on in user space even in the DOS times, because different > codepages would make for different rules for the upper 128 characters > anyway, and you could be in a situation where you literally couldn't > copy files from one floppy to another, because two files that had > distinct names on one floppy would have the *same* name on another > one. > > Of course, that was mostly a weird corner case that almost nobody ever > actually saw in practice, because very few people even used anything > else than the default codepage. > > And the same is afaik still true on NT, although practically speaking > I suspect it went from "unusual" to "really doesn't happen EVER in > practice". I'm having trouble finding anything authoritative, but what I'm seeing indicates that NTFS does do Unicode casefolding (and their own incompatible version, at that). > Extending those mistakes to full unicode and mixing in things like > nonprinting codes and other things have only made things worse. > > And dealing with things like ß and ss and trying to make those compare > as equal is a *horrible* mistake. People who really need to do that > (usually for some legalistic local reason) tend to have very specific > rules for sorting anyway, and they are rules specific to particular > situations, not something that the filesystem should even try to work > with. Well, casefolding is something that's directly exposed to users. So I do think that if casefolding is going to exist at all, there is a strong argument for it to be unicode and handling things like ß to ss. (Can you imagine being the user that gets used to typing in filenames and ignoring capitalization, except whenever an accented letter is part of the filename, and then your muscle-memeory breaks? That sort of thing is maddening). BUT: I'm becoming more and more convinced that I want more separation between casefolded lookups and non casefolded lookups, the potential for casefolding rule changes to break case-sensitive lookups is just bad. If we do a "casefolding version 2" in bcachefs, we'll just have a separate btree for casefolded dirents, and casefolded directories will have their dirents indexed twice. That's trivially extensible to multiple versions if - god forbid - we ever end up needing to support multiple "locales", and more importantly it'd let us support a mode where it's only certain pids that get casefolded lookups, so you don't e.g. get casefolding dependencies creeping into your makefiles as can happen today.