[Last-Call] Re: Last Call: <draft-bray-unichars-10.txt> (Unicode Character Repertoire Subsets) to Proposed Standard

Carsten Bormann <cabo@xxxxxxx> · Sun, 6 Apr 2025 17:47:28 +0200

# Review of draft-bray-unichars-12

I didn't have time to review this document during the IETF last-call
period, so please accept my apologies for being late in these notes.
The observations here are mainly from the perspective of a protocol
designer that employs Unicode in most of the protocols.

It is nice to see that this document finally found a purpose, which is
listing three ABNF productions.
(I don't understand why these aren't properly marked for extraction,
though.)

I'm not entirely sure I can find a Proposed Standard in this document.
(The Shepherd writeup does not tell me why it is suggested to be
"Proposed Standard", either.)

The main innovation of the document is the invention of the
"problematic" character.
The document uses this wide brush both for constructions that are
simply syntax errors (where it would be sufficient to point that out),
and for certain characters that were not included in XML (specifically
for a set of control characters).
Using the same term both for broken (non-)Unicode and for characters
that have well-established uses is bound to confuse protocol designers
that want to make use of the reference information provided by this
document.

Since classifying some of the control characters as universally
"useful" and the others as universally "legacy" seems to be the main
new thing, let's have a closer look at the control characters in
question.

The character LF is widely used as the newline character and is
probably the only control character that has a defined, non-empty
meaning.
LF is actually "problematic" when the text in question is not actually
intended to be structured into lines, which in the era of structured
data representation formats is now the predominant use of text in
protocols (*).
If this document had a normative intent, it should say that the
decision to include LF in the repertoire for a data item MUST always
be explicit.

The character CR just is noise when preceding an LF (and has no
defined meaning when occuring anywhere else).
It is required by some older standards such as certain mail formats,
but is vestigial.
It is not wrong to allow it for line-structured text, but the protocol
probably has to state how it is ignored; a document like this could
provide boilerplate that can simply be copied or referenced.

The character HT has no defined meaning and is still used today in a
variety of ways.
Like all other ASCII control characters, it cannot be used unless its
meaning in the specific protocol is defined.
HT certainly should not be included in a recommended repertoire as a
blanket item.
The use of HT in environments that do not define what they mean leads
to untold pain; one of the authors should remember the RFC 7386
incident which led to a quick republishing of RFC 7396.

Characters such as FF or RS are in limited use for providing
additional structure to text that goes beyond a line structure.
They are not "problematic" if their meaning is defined in the
application protocol.
(Compare the usage of RS in RFC 8142, as it is defined in RFC 8091, or
the way the publication form of RFCs up to RFC8649 employed FF to
provide pagination.)

NUL probably is the only "legacy control" character that actually
should earn the term "problematic", but not in the meaning of a syntax
error, but because it triggers a common implementation error in
certain platforms.

DEL (U+007F) is mentioned once, but is interestingly included in
the ABNF rule xml-character but not in unicode-assignable.
(This maybe is the most obvious indicator that xml-character is useful
mainly as historical information about a legacy construct, which also
requires to be read in conjunction with the rules XML provides for
tolerating, but ultimately ignoring, CR, as it always should be.)

The document does not even mention BOMs, realized by another character
that is actually "problematic": U+FEFF.
The semi-automatic insertion of BOMs into UTF-8 items in certain cases
BOMs weren't designed for also generates untold pain.
The Unicode standard has an unequivocal position on the use of BOMs
with UTF-8.
STD 63 (which, by the way, needs to be cited) is a bit more feeble,
but does make the point as well.
Clearly, a JSON text that precedes all member names with U+FEFF would
lead to interoperability problems; some guidance may be advised.

One question with which the document does not help is how much our
protocol definitions should import fine details of Unicode, such as
the definition of non-characters.  Handling them like unassigned
assignables will work well in most protocols.

Grüße, Carsten

(*) Not capturing this important shift in the use of textual elements
is certainly one of the surprising points of a document intended to be
published in 2025.

-- 
last-call mailing list -- last-call@xxxxxxxx
To unsubscribe send an email to last-call-leave@xxxxxxxx