On 2025-04-24 at 15:53 -0400 Gabriel Krisman Bertazi sent off: > The big problem is that each of the big OS vendors chose specific > semantics of what to casefold. APFS does NFD + full casefolding[1], > right? except for "some code-points". I'm not sure what they do with ß, > tbh. I could never find any documentation on the specific code-points > they add/ignore. Apple basically stores the files in NFD and do casefolding but not those lossy folding rules that make "ß" and "ss" equal. I have an overview of filesystems and their encodings written up at https://www.j3e.de/linux/convmv/man/#Filesystem-issues - that might be interesting for the discussion also. > In ext4, we decided to have no exceptions. Just do plain NFD + CF. That > means we do C+F from the table below: > > https://www.unicode.org/Public/12.1.0/ucd/CaseFolding.txt > > Which includes ß->SS. We could argue forever whether that doesn't make > sense for language X, such as German. I'm not a German speaker but > friends said it would be common to see straße uppercased to STRASSE there, > even though the 2017 agreement abolished it in favor of ẞ. So what is > the right way? I am a German speaker, so I can shed light on that. "ß" and "ss" are definetely not equal. If your Name is "Groß" this is a different Name than "Gross". The word "Ma0e" exists and the word "Masse" existist, they are something completely different. The only thing to say here is that people without that letter on the keyboard often use "ss" as a fallback, just like writing "ae" is a common fallback for writing "ä". In a filesystem they should not be projected on the same file. The main problem that was made when the casefolding was introduced in the Linux kernel was to use all of the cases listed in https://www.unicode.org/Public/12.1.0/ucd/CaseFolding.txt If you grep for all the F flagged cases there (grep " F;") you will get 104 "casefold" rules, which are essentially bogous for filesystem casefolding. They mainly reduce the number of valid codepoints for filenames. Apart of the German "ß" they also contain ligatures and combinations of greek letters, which are being "equalized". All of those reduced codepoints can be unique characters of filenames on ci Windows or Apple filesystems, they are not considered for casefolding in any way, except for the "simple" (S flagged) casefolding of the corresponding codepoint. Those F flagged casefolding make sense for cases like CTRL-F in browsers, there you want to find places, where a "fi" ligature (fi) is used if you search for "fi" but in filenames you need to be able to use both. At least this is what all operating systems with case-insensitive filesystems do (except for Linux till now). > My point is we can't rely fully on languages to argue the right > semantics. There are no right semantics. And Languages are also alive > and changing. There are many other examples where full casefold will > look stupid; for instance, one would argue we should also translate the > T column (i.e non-Turkish languages). The Turkish language with the dottet/dotless i/I is a very special and exceptional case, ci is not being done for that in any other ci filesystem implementation. The i/I case doesn't really matter in this discussion. > It is not useless. Android and Wine emulators have been using it just > fine for years. We also cannot break compatibility for them. I understand that we can't break compatibility with it but we should try to find a way to improve the current situation, which is far from being good. > > Can this be changed without causing too much hassle? > > We attempted to do a much smaller change recently in commit > 5c26d2f1d3f5, because we assumed no one would be trying to create files > with silly stuff like ZWSP (U+200B). Turns out there is a reasonable > use-case for that with Variation Selectors, and we had to revert it. So > we need to be very careful with any changes here, so people don't lose > access to their files on a kernel update. Even with that, more > casefolding flavor will cause all sorts of compatibility issues when > moving data across volumes, so I'd be very wary of having more than one > flavor. especially becasue files should be movable also from other platforms also, we should be very close to what other platforms do here. The fact that our casefolding is significantly recuding the number of possible codepoints (the 104 F flagged ones), causes a major interoperability problem. > What are the exact requirements for samba? Do you only fold the C > column? Do you need stuff like compatibility normalization? For Samba it's required that we don't have a reduced set of valid Unicode characters. And that means that the F flaged mappings are not being used. The Turkish T mapping should also not be used. Mappings we should use: - the "C"ommonand and - the "S"imple flagged mappings from the Unicode mapping table only. I understand that it's difficult to change this as we store hashes of the current lowercase version of the filenames. I'm not an expert enough in the filesystem code to come up with a good idea how to solve this though. Eventually we can use different versions of casefolding tables and store in the filesystem, which version to use?