[Last-Call] Re: Bormann: "The observations here are mainly from the perspective of a protocol designer"

John C Klensin <john-ietf@xxxxxxx> · Tue, 08 Apr 2025 23:28:08 -0400

I know this was primarily addressed to Carsten, but...

--On Tuesday, April 8, 2025 09:53 -0700 Tim Bray
<tbray@xxxxxxxxxxxxxx> wrote:

> Orie asked for discussion of this contribution and one from John,
> but the ensuing walls of text have wrecked my input buffers, so in
> this note I'm going back to the referenced contribution and
> address the specific issues raised in it.
> 
> On Sun, Apr 6, 2025 at 8:47 AM Carsten Bormann <cabo@xxxxxxx>
> wrote:
> 
>> I didn't have time to review this document during the IETF
>> last-call period, so please accept my apologies for being late in
>> these notes. The observations here are mainly from the perspective
>> of a protocol designer that employs Unicode in most of the
>> protocols.
>> 
> 
> I think it would be helpful to bear in mind the context: Carsten has
> repeatedly and explicitly argued that Unichars is wrong, damaging,
> and should be replaced by his modern-network-unicode draft. Having
> said that, I will address these issues on the assumption that they
> are offered in good faith.

FWIW, I disagree with Carsten even if a few of the comments below
lead back to a document more like his.  If Unichars is focused on
language and transport use, rather than the needs of specific
protocols, then it is supplemental to UTF-.  As soon as one starts to
make the sorts of differentiations in sections 2 - 5 of the
network-unicode draft, I think it ventures into PRECIS territory,
overlapping with PRECIS in some cases and filling in things that it
skipped over in others.

>...
>> I'm not entirely sure I can find a Proposed Standard in this
>> document.
>> (The Shepherd writeup does not tell me why it is suggested to be
>> "Proposed Standard", either.)

> This document is explicitly designed to be referenced by
> data-format and protocol designers working on other documents. Thus
> the three named subsets, each isolated in a referenceable
> subsection provided with ABNF. It is my perception that publishing
> it as an informational RFC would add friction to the process of
> referencing it and thus reduce its usefulness.

See separate note(s) today.

>> The main innovation of the document is the invention of the
>> "problematic" character.

Only the terminology because the concept is clearly present in parts
of PRECIS and in IDNA2008's "Disallowed" as well as parts of Unicode.

> ...
> 
>> Using the same term both for broken (non-)Unicode and for
>> characters that have well-established uses is bound to confuse
>> protocol designers that want to make use of the reference
>> information provided by this document.

> I disagree that any code point Unichars classes as "problematic" has
> "well-established" interoperable uses.
> 
> [Discussion of the problems that careless use of \n, \r, and \t can
> lead to elided.]

> There are plenty of characters whose use can lead to problems; an
> exhaustive discussion would produce a thousand-page document. The
> fact remains that these three characters have been widely used in
> protocols and data formats, and (a) they empirically can be used
> interoperably and (b) any attempt to exclude them would be useless.
> Yes, it is annoying that the notions of "line break", "record
> separator", and "indentation level" are messy and not well-mapped
> to Unicode code points, but this seems not to be a terribly
> difficult problem in practice for protocol designers.

Again, FWIW, agree with all of that.

> Characters such as FF or RS are in limited use for providing
>> additional structure to text that goes beyond a line structure.
>> They are not "problematic" if their meaning is defined in the
>> application protocol.

> I think the claim that FF and RS are not problematic in 2025 is
> just wrong. I repeat that argument for every other C0 and C1
> control character.

I think they, or at least FF, are not quite as clear as that, citing
pagination in plain text I-Ds as an interesting example.  But that
takes us down the slippery slope toward application-specific (or
application-dependent) profiles.

>> (Compare the usage of RS in RFC 8142, as it is defined in RFC
>> 8091, or the way the publication form of RFCs up to RFC8649
>> employed FF to provide pagination.)

> 8142 (2017) inherits this from 7464 (2015) and I'm pretty sure that
> it would not be deemed acceptable today. A private-use code point
> would be much more appropriate. 8091 correctly points out problems
> that can be caused by using RS.

RFCs in plain text aside, there are not only the I-D cases but, IIR,
the PDF forms of RFCs which are, AFAICT, still paginated.

>> NUL probably is the only "legacy control" character that actually
>> should earn the term "problematic", but not in the meaning of a
>> syntax error, but because it triggers a common implementation
>> error in certain platforms.

> Are you arguing for another subset that allows the use of the C0
> and C1 controls on the premise that they are somehow interoperable?
> I don't think this would be a good idea but I could be wrong.

If I were arguing for an additional subset (I hope I'm not), it would
allow LF (referring to U+000A as "newline" is a different source of
confusion, inconsistent with both Unicode terminology and that of RFC
20), CR, HT, and FF.  "RS" is problematic for another reason, which
is that its definition in ordinary text strings has never been
precisely and consistently described.  The fact that Unicode's
preferred name for the code point (U+001E) is "Information Separator
Two" is actually symptomatic of the problem -- it is probably easier
to make an argument for including codepoints like ESC (U+001B) --if
only because of ISO/IEC 2022 control sequences-- and DC1-DC4 (U+0011
through U+0014) but it is easy to argue that those are not "text
strings" but device controls or for use with structured data.  See
separate note titled "draft-bray-unichars-13 nits du jour".

Whatever is said about ESC should also give due consideration to DEL,
as Carsten mentions below.

>> DEL (U+007F) is mentioned once, but is interestingly included in
>> the ABNF rule xml-character but not in unicode-assignable.
>> (This maybe is the most obvious indicator that xml-character is
>> useful mainly as historical information about a legacy construct,
>> which also requires to be read in conjunction with the rules XML
>> provides for tolerating, but ultimately ignoring, CR, as it always
>> should be.)

> DEL is classified by Unicode as a Control Character, see
> https://www.unicode.org/charts/PDF/U0000.pdf. I suspect that if I
> knew in 1996 what I know now I would have tried to exclude it from
> XML as we did the other controls.
> 
>> The document does not even mention BOMs, realized by another
>> character
>> that is actually "problematic": U+FEFF.

> This proved valuable in XML, working exactly as designed, with the
> result being the quote from Larry Wall: "An XML document knows what
> encoding it's in". It's highly interoperable. It's probably not
> useful any more in our (thankfully) all-UTF-8 world, but I don't
> see it as problematic.

>> The semi-automatic insertion of BOMs into UTF-8 items in certain
>> cases BOMs weren't designed for also generates untold pain.

> Really? I've never experienced a problem, but I guess there is some
> evidence on your side, given that RFC8259 explicitly says "don't do
> this" but blesses the use of Postel's law by parsers when people do
> it anyhow. Does it happen any more?

I think so but I've said about as much as I can usefully contribute
about BOMs in another recent note.  However,...

> If there is consensus that U+FEFF should be classified as
> "problematic" I would disagree but could live with that.

The problem with making it "problematic" is that U+FEFF has a
perfectly good, sometimes necessary, and non-problematic use as ZERO
WIDTH NO-BREAK SPACE.  It would be hard to ban that one without also
banning U+00A0, NO-BREAK SPACE, which is now not problematic.  So, if
you want to classify it is "problematic", it should be problematic
only when it appears as the first character in a file or clearly
delimited sequence of codes.  That takes you into the space of
talking about strings (and whatever "delimited sequence of codes"
means) and raises other problems: just not how the document works
right now.

>...

best,
   john

-- 
last-call mailing list -- last-call@xxxxxxxx
To unsubscribe send an email to last-call-leave@xxxxxxxx