On Mon, Apr 28, 2025 at 03:05:19AM +0100, Autumn Ashton wrote: > > > On 4/28/25 2:43 AM, Kent Overstreet wrote: > > On Sun, Apr 27, 2025 at 06:30:59PM -0700, Eric Biggers wrote: > > > On Sun, Apr 27, 2025 at 08:55:30PM -0400, Kent Overstreet wrote: > > > > The thing is, that's exactly what we're doing. ext4 and bcachefs both > > > > refer to a specific revision of the folding rules: for ext4 it's > > > > specified in the superblock, for bcachefs it's hardcoded for the moment. > > > > > > > > I don't think this is the ideal approach, though. > > > > > > > > That means the folding rules are "whatever you got when you mkfs'd". > > > > Think about what that means if you've got a fleet of machines, of > > > > different ages, but all updated in sync: that's a really annoying way > > > > for gremlins of the "why does this machine act differently" variety to > > > > creep in. > > > > > > > > What I'd prefer is for the unicode folding rules to be transparently and > > > > automatically updated when the kernel is updated, so that behaviour > > > > stays in sync. That would behave more the way users would expect. > > > > > > > > But I only gave this real thought just over the past few days, and doing > > > > this safely and correctly would require some fairly significant changes > > > > to the way casefolding works. > > > > > > > > We'd have to ensure that lookups via the case sensitive name always > > > > works, even if the casefolding table the dirent was created with give > > > > different results that the currently active casefolding table. > > > > > > > > That would require storing two different "dirents" for each real dirent, > > > > one normalized and one un-normalized, because we'd have to do an > > > > un-normalized lookup if the normalized lookup fails (and vice versa). > > > > Which should be completely fine from a performance POV, assuming we have > > > > working negative dentries. > > > > > > > > But, if the unicode folding rules are stable enough (and one would hope > > > > they are), hopefully all this is a non-issue. > > > > > > > > I'd have to gather more input from users of casefolding on other > > > > filesystems before saying what our long term plans (if any) will be. > > > > > > Wouldn't lookups via the case-sensitive name keep working even if the > > > case-insensitivity rules change? It's lookups via a case-insensitive name that > > > could start producing different results. Applications can depend on > > > case-insensitive lookups being done in a certain way, so changing the > > > case-insensitivity rules can be risky. > > > > No, because right now on a case-insensitive filesystem we _only_ do the > > lookup with the normalized name. > > > > > Regardless, the long-term plan for the case-insensitivity rules should be to > > > deprecate the current set of rules, which does Unicode normalization which is > > > way overkill. It should be replaced with a simple version of case-insensitivity > > > that matches what FAT does. And *possibly* also a version that matches what > > > NTFS does (a u16 upcase_table[65536] indexed by UTF-16 coding units), if someone > > > really needs that. > > > > > > As far as I know, that was all that was really needed in the first place. > > > > > > People misunderstood the problem as being about language support, rather than > > > about compatibility with legacy filesystems. And as a result they incorrectly > > > decided they should do Unicode normalization, which is way too complex and has > > > all sorts of weird properties. > > > > Believe me, I do see the appeal of that. > > > > One of the things I should really float with e.g. Valve is the > > possibility of providing tooling/auditing to make it easy to fix > > userspace code that's doing lookups that only work with casefolding. > > This is not really about fixing userspace code that expects casefolding, or > providing some form of stopgap there. > > The main need there is Proton/Wine, which is a compat layer for Windows > apps, which needs to pretend it's on NTFS and everything there expects > casefolding to work. > > No auditing/tooling required, we know the problem. It is unavoidable. Does this boil all the way up to e.g. savegames? I was imagining predetermined assets, where the name of the file would be present in a compiled binary, and it's little more than a search and replace. But would only work if it's present as a string literal. > I agree with the calling about Unicode normalization being odd though, when > I was implementing casefolding for bcachefs, I immediately thought it was a > huge hammer to do full normalization for the intended purpose, and not just > a big table... Samba's historically wanted casefolding, and Windows casefolding is Unicode (and it's full, not simple - mostly), so I'd expect that was the other main driver. I'm sure there's other odd corners besides just Samba where Windows compatibility comes up, people cook up all kinds of strange things.