[Last-Call] Re: [Roll] Iotdir telechat review of draft-ietf-roll-rnfd-05

Konrad Iwanicki <iwanicki@xxxxxxxxxxxx> · Sun, 16 Feb 2025 16:18:21 +0100

Dear Peter,

Thank you for your review. Please, find my answers below.

On 14/02/2025 12:09, Peter Van der Stok via Datatracker wrote:
Reviewer: Peter Van der Stok
Review result: Not Ready

This will be a short high-level review, because I am really confused by the
contents; and also my knowledge of RPL is ancient. Below the sources of my
misunderstanding.

I will try to clarify each issue inline, so apologies if the formatting 
of your text got disrupted.

The network configuration is not clear. The LBR, DODAG root,
usually has at least two interfaces, one wireless LLN interface and a possibly
wired, internet link. Only the LLN seems to be discussed, monitoring the LBR
via the Internet link looks more efficient to me than the proposal of this
document.

Your understanding of the scope is correct: the document, like the 
original RPL, focuses on the LLN part, that is, the resource-constrained 
nodes and the LBR, and not the network on the other side of the LBR. For 
this reason, we are concerned with the LLN interface of the LBR and its 
failures. Note, however, that in this view, any failure of an LBR that 
leads to its LLN interface being down from the constrained nodes' 
perspective (e.g., a "total" crash of the entire LBR device due to a 
power outage, mentioned in Section 1.3) also falls within our scope of 
interest.

The assumed configuration of the network is thus simply the same as in 
the documents regarding RPL, notably the original RFC 6550. This fact is 
stated explicitly in Section 2 (Terminology). In particular, the term 
"DODAG root" describes the node corresponding to the LLN interface of 
the LBR.

Monitoring the LBR via the Internet link is an approach orthogonal to 
the RNFD algorithm proposed in the document and does not solve the 
problems that RNFD solves. In particular, as explained in Section 1, 
"detecting" a failure of a DODAG root by RNFD implies that each LLN node 
considers the root as down; otherwise, the emergent behaviors described 
in Section 1.1 occur. In contrast, with monitoring the LBR via the 
Internet link, one still needs a solution for communicating the 
observations of the Internet node that detects a problem with the LBR to 
every LLN node. This may be problematic, especially given that a failure 
of an LBR may effectively disconnect the LLN nodes from the Internet.

The fault model of the LLN is unclear, Is it a crash of the LBR CPU
or a failure of the wireless interface. The failure of the wireless interface
can be electronic, but can also be caused by refelections, etc..

RNFD targets any failures that make an LBR's LLN interface appear as 
being down from the perspective of the LLN nodes, and hence prevent the 
LBR from acting as a DODAG root. In particular, communication failures 
that have similar effects will be handled by the algorithm.

The sentinels,
seemingly all the children of the LBR, are supposed to come to an agreement.
The subject of agreement between nodes in a failing communication environment
is discussed in the "Byzantine Generals algorithm", and its many derivatives.
However, no mention is made of this problem and how its known solutions compare
to the solution used here.

One of the components of the RNFD algorithm is indeed reaching consensus 
between nodes on whether the DODAG root is down. However, the Byzantine 
generals algorithm solves a different consensus problem, in which some 
of the nodes participating in the algorithm can be malicious and can 
collude. In RNFD, and for that matter RPL as well, any malicious nodes 
could disrupt the entire system in many (sometimes subtle) ways. RPL 
(and hence RNFD) thus assume collaborating nodes.

Nevertheless, let us thus consider using the Byzantine generals 
algorithm as a replacement of the present consensus solution in RNFD. To 
combat malicious nodes, the algorithm would require multiple rounds of 
direct peer-to-peer communication between each pair of Sentinels. Such 
communication would be hard to ensure in a multi-hop LLN and would be an 
overkill because of to the resource constraints of the LLN.

In general, there is a plethora of consensus algorithms crafted for 
specific problem variants. The one in the RNFD algorithm is well suited 
for the problem that RNFD aims to solve, the constraints it faces in the 
process, and the operation of RPL for which RNFD is designed.

The document claims many improvements in detecting
LBR failure over the already existing techniques in the RPL network. However,
no numbers are cited as function of the network configuration, for example: a
small network with routes of only two hops, or a large network with many hop
routes.

For numbers, please refer to [Iwanicki16], which in the document is the 
second informative reference and is cited in Section 1.1.

The document seems to claim that removing of routes and reconstructing
routes is no longer needed. I don't understand from the text how that is
possible.

Given how general this claim is, I believe it is not stated in the 
document. I am guessing that you refer to Section 1, which describes the 
problem that the RNFD algorithm addresses.

What the section claims is that when the DODAG root is down, 
reconstructing upward routes in the DODAG by RPL does not make much 
sense: no physical path such a route could employ exists, because simply 
the last (i.e., destination) node, which is the root, is nonexistent. 
Consequently, any attempt to reconstruct the routes is bound to fail 
anyway and only causes unnecessary work and control traffic.

Removing the routes is in turn the way to go. However, achieving this by 
repeated failed attempts to reconstruct them by all nodes, as is done by 
RPL without RNFD, triggers various undesirable emergent behaviors and is 
suboptimal, not to mention that some implementations have problems doing 
this correctly.

One thing strikes me as missing, the standardization of the traffic
patterns provoked by RNFD. It is recommended that "care is taken". That means
ending up with as many traffic patterns as there are manufacturers, possibly
leading to implementations that hinder each other instead of collaborating. I
would expect a number of sending delays and counters with standardized values.

Again, I am guessing that you are referring to the following sentence 
from Section 5.2 (Detecting and Verifying Problems with the DODAG Root): 
"Care SHOULD be taken not to overload the DODAG root with traffic due to 
simultaneous probes, for instance, random backoffs can be employed to 
this end."

To give more context, this part of the section describes how a Sentinel 
can confirm its observation that the DODAG root is down, for example, by 
sending IPv6 ping messages. This section was deliberately left without 
specific values (or even ranges) of timeouts and retry counts for such 
probes, because the performance of RPL with respect to those two aspects 
is heavily impacted by the underlying L2: for one implementation of L2, 
a few tens of milliseconds and with multiple retries would be 
sufficient, for others (e.g., with radio duty cycling), those can be 
tens of seconds and no retries. Put differently, a choice was considered 
among the following options:

1. forcing implementers of the document to use configurations that are 
clearly suboptimal in their particular protocol stacks, or

2. providing in the document broad ranges that virtually do not 
standardize anything (notably with respect to the resulting traffic you 
mentioned), or

3. leaving the issue open but emphasizing to the implementers its 
potential performance impact,

Of those, option 3 was selected and the entire verification step was 
made optional. The "Care SHOULD be taken" is precisely this emphasis. 
Perhaps it could be formulated differently.

I hope that the answers above clarify at least some of your concerns. If 
you have further suggestions where and how the text of the document 
could be improved, I would be glad to hear them.

Thank you again for your time!

Best regards,
--
- Konrad Iwanicki.

--
last-call mailing list -- last-call@xxxxxxxx
To unsubscribe send an email to last-call-leave@xxxxxxxx