[Last-Call] Re: draft-bray-unichars-14 ietf last call Secdir review

Addison Phillips <addisoni18n@xxxxxxxxx> · Fri, 2 May 2025 13:12:45 -0700

The problem with disallowing unassigned code points is that it disadvantages languages whose code points are assigned later. Such languages can go many years with support gaps and barriers.
Confusables *are* a problem, but most new assignments don't represent new confusables. Perhaps better coordination between Unicode and IETF is called for to prevent gaps and better document problem vectors?
Addison

On Fri, May 2, 2025, 12:45 John C Klensin <john-ietf@xxxxxxx> wrote:

--On Thursday, May 1, 2025 14:25 -0700 Tim Hollebeek via Datatracker

<noreply@xxxxxxxx> wrote:

> Document: draft-bray-unichars

> Title: Unicode Character Repertoire Subsets

> Reviewer: Tim Hollebeek

> Review result: Ready

> 

> This is a very important and useful document. I found it useful and

> will recommend it to others once published.

> 

> The only thing I'd point out is the opportunity to perhaps add a

> sentence opining on the intersection between "confusables" and

> "unassigned code points", and point out that if "confusables" is in

> your threat model, you have to admit you've signed up for reviewing

> and/or consuming a new list of valid code points every new unicode

> release.

And, of course, that assumes there is an entity that will create such

lists and do so accurately (presumably reflecting broad consensus)

and on a timely basis.  That is where the scope of this document

slides toward those of, e.g., PRECIS and IDNA2008.  To put the

concern into sharper perspective, PRECIS has not been updated since

2017 (Unicode 10.0) and IDNA2008 since 2022 (Unicode 12.0.0).  Draft

updates to both are floating around, but neither has been queued for

community review and action and, at this point, could be outdated

before being approved and published.    

FWIW, the problem Tim points out is exactly the reason why IDNA2008

and PRECIS disallow the use of unassigned code points -- there is no

way to know what might end up assigned to them in some future Unicode

release.  They might not only get assigned to characters that create

confusability problems but could possibly end up being assigned to

noncharacters or device or presentation controls not covered by the

current spec (although, unlike confusables, those other uses are

unlikely.  The confusable issue is not theoretical: we've had real

examples in which assignment of previously unassigned characters has

created the potential for confusion with code points assigned earlier

and versioning issues with IDNA2008.

That set of issues associated with future code point assignments puts

the statement at the end of the introduction to Section 2 of the I-D:

        "Since unassigned code points regularly become assigned when

        new characters are added to Unicode, it is usually not a good

        practice to specify that unassigned code points should be

        avoided."

in direct contradiction to the IDNA2008 and PRECIS specs, which

consider the use of unassigned code points to be very bad practice.

So while I think some variation on the sentence suggested by Tim

would be useful, it may not be sufficient.   In particular, the last

paragraph of the introductory part of Section 1, which was intended

to deal with uses of character strings in context, might reasonably

be altered to include an explicit statement about confusable

characters.   And the statements that unassigned code points should

be allowed (as above) and that Private Use ones should too (again,

absent context that specifically identifies the private use and its

conventions, there is no way to know what those code points

represent) perhaps should be reviewed once again.

   john

-- 

last-call mailing list -- last-call@xxxxxxxx

To unsubscribe send an email to last-call-leave@xxxxxxxx

-- 
last-call mailing list -- last-call@xxxxxxxx
To unsubscribe send an email to last-call-leave@xxxxxxxx