# Review of draft-bray-unichars-12 I didn't have time to review this document during the IETF last-call period, so please accept my apologies for being late in these notes. The observations here are mainly from the perspective of a protocol designer that employs Unicode in most of the protocols. It is nice to see that this document finally found a purpose, which is listing three ABNF productions. (I don't understand why these aren't properly marked for extraction, though.) I'm not entirely sure I can find a Proposed Standard in this document. (The Shepherd writeup does not tell me why it is suggested to be "Proposed Standard", either.) The main innovation of the document is the invention of the "problematic" character. The document uses this wide brush both for constructions that are simply syntax errors (where it would be sufficient to point that out), and for certain characters that were not included in XML (specifically for a set of control characters). Using the same term both for broken (non-)Unicode and for characters that have well-established uses is bound to confuse protocol designers that want to make use of the reference information provided by this document. Since classifying some of the control characters as universally "useful" and the others as universally "legacy" seems to be the main new thing, let's have a closer look at the control characters in question. The character LF is widely used as the newline character and is probably the only control character that has a defined, non-empty meaning. LF is actually "problematic" when the text in question is not actually intended to be structured into lines, which in the era of structured data representation formats is now the predominant use of text in protocols (*). If this document had a normative intent, it should say that the decision to include LF in the repertoire for a data item MUST always be explicit. The character CR just is noise when preceding an LF (and has no defined meaning when occuring anywhere else). It is required by some older standards such as certain mail formats, but is vestigial. It is not wrong to allow it for line-structured text, but the protocol probably has to state how it is ignored; a document like this could provide boilerplate that can simply be copied or referenced. The character HT has no defined meaning and is still used today in a variety of ways. Like all other ASCII control characters, it cannot be used unless its meaning in the specific protocol is defined. HT certainly should not be included in a recommended repertoire as a blanket item. The use of HT in environments that do not define what they mean leads to untold pain; one of the authors should remember the RFC 7386 incident which led to a quick republishing of RFC 7396. Characters such as FF or RS are in limited use for providing additional structure to text that goes beyond a line structure. They are not "problematic" if their meaning is defined in the application protocol. (Compare the usage of RS in RFC 8142, as it is defined in RFC 8091, or the way the publication form of RFCs up to RFC8649 employed FF to provide pagination.) NUL probably is the only "legacy control" character that actually should earn the term "problematic", but not in the meaning of a syntax error, but because it triggers a common implementation error in certain platforms. DEL (U+007F) is mentioned once, but is interestingly included in the ABNF rule xml-character but not in unicode-assignable. (This maybe is the most obvious indicator that xml-character is useful mainly as historical information about a legacy construct, which also requires to be read in conjunction with the rules XML provides for tolerating, but ultimately ignoring, CR, as it always should be.) The document does not even mention BOMs, realized by another character that is actually "problematic": U+FEFF. The semi-automatic insertion of BOMs into UTF-8 items in certain cases BOMs weren't designed for also generates untold pain. The Unicode standard has an unequivocal position on the use of BOMs with UTF-8. STD 63 (which, by the way, needs to be cited) is a bit more feeble, but does make the point as well. Clearly, a JSON text that precedes all member names with U+FEFF would lead to interoperability problems; some guidance may be advised. One question with which the document does not help is how much our protocol definitions should import fine details of Unicode, such as the definition of non-characters. Handling them like unassigned assignables will work well in most protocols. Grüße, Carsten (*) Not capturing this important shift in the use of textual elements is certainly one of the surprising points of a document intended to be published in 2025. -- last-call mailing list -- last-call@xxxxxxxx To unsubscribe send an email to last-call-leave@xxxxxxxx