Re: [GIT PULL] bcachefs fixes for 6.15-rc4

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Sun, 27 Apr 2025 22:16:37 -0400

On Mon, Apr 28, 2025 at 03:05:19AM +0100, Autumn Ashton wrote:
> 
> 
> On 4/28/25 2:43 AM, Kent Overstreet wrote:
> > On Sun, Apr 27, 2025 at 06:30:59PM -0700, Eric Biggers wrote:
> > > On Sun, Apr 27, 2025 at 08:55:30PM -0400, Kent Overstreet wrote:
> > > > The thing is, that's exactly what we're doing. ext4 and bcachefs both
> > > > refer to a specific revision of the folding rules: for ext4 it's
> > > > specified in the superblock, for bcachefs it's hardcoded for the moment.
> > > > 
> > > > I don't think this is the ideal approach, though.
> > > > 
> > > > That means the folding rules are "whatever you got when you mkfs'd".
> > > > Think about what that means if you've got a fleet of machines, of
> > > > different ages, but all updated in sync: that's a really annoying way
> > > > for gremlins of the "why does this machine act differently" variety to
> > > > creep in.
> > > > 
> > > > What I'd prefer is for the unicode folding rules to be transparently and
> > > > automatically updated when the kernel is updated, so that behaviour
> > > > stays in sync. That would behave more the way users would expect.
> > > > 
> > > > But I only gave this real thought just over the past few days, and doing
> > > > this safely and correctly would require some fairly significant changes
> > > > to the way casefolding works.
> > > > 
> > > > We'd have to ensure that lookups via the case sensitive name always
> > > > works, even if the casefolding table the dirent was created with give
> > > > different results that the currently active casefolding table.
> > > > 
> > > > That would require storing two different "dirents" for each real dirent,
> > > > one normalized and one un-normalized, because we'd have to do an
> > > > un-normalized lookup if the normalized lookup fails (and vice versa).
> > > > Which should be completely fine from a performance POV, assuming we have
> > > > working negative dentries.
> > > > 
> > > > But, if the unicode folding rules are stable enough (and one would hope
> > > > they are), hopefully all this is a non-issue.
> > > > 
> > > > I'd have to gather more input from users of casefolding on other
> > > > filesystems before saying what our long term plans (if any) will be.
> > > 
> > > Wouldn't lookups via the case-sensitive name keep working even if the
> > > case-insensitivity rules change?  It's lookups via a case-insensitive name that
> > > could start producing different results.  Applications can depend on
> > > case-insensitive lookups being done in a certain way, so changing the
> > > case-insensitivity rules can be risky.
> > 
> > No, because right now on a case-insensitive filesystem we _only_ do the
> > lookup with the normalized name.
> > 
> > > Regardless, the long-term plan for the case-insensitivity rules should be to
> > > deprecate the current set of rules, which does Unicode normalization which is
> > > way overkill.  It should be replaced with a simple version of case-insensitivity
> > > that matches what FAT does.  And *possibly* also a version that matches what
> > > NTFS does (a u16 upcase_table[65536] indexed by UTF-16 coding units), if someone
> > > really needs that.
> > > 
> > > As far as I know, that was all that was really needed in the first place.
> > > 
> > > People misunderstood the problem as being about language support, rather than
> > > about compatibility with legacy filesystems.  And as a result they incorrectly
> > > decided they should do Unicode normalization, which is way too complex and has
> > > all sorts of weird properties.
> > 
> > Believe me, I do see the appeal of that.
> > 
> > One of the things I should really float with e.g. Valve is the
> > possibility of providing tooling/auditing to make it easy to fix
> > userspace code that's doing lookups that only work with casefolding.
> 
> This is not really about fixing userspace code that expects casefolding, or
> providing some form of stopgap there.
> 
> The main need there is Proton/Wine, which is a compat layer for Windows
> apps, which needs to pretend it's on NTFS and everything there expects
> casefolding to work.
> 
> No auditing/tooling required, we know the problem. It is unavoidable.

Does this boil all the way up to e.g. savegames?

I was imagining predetermined assets, where the name of the file would
be present in a compiled binary, and it's little more than a search and
replace. But would only work if it's present as a string literal.

> I agree with the calling about Unicode normalization being odd though, when
> I was implementing casefolding for bcachefs, I immediately thought it was a
> huge hammer to do full normalization for the intended purpose, and not just
> a big table...

Samba's historically wanted casefolding, and Windows casefolding is
Unicode (and it's full, not simple - mostly), so I'd expect that was the
other main driver.

I'm sure there's other odd corners besides just Samba where Windows
compatibility comes up, people cook up all kinds of strange things.