Dear Peter, Thank you for your review. Please, find my answers below. On 14/02/2025 12:09, Peter Van der Stok via Datatracker wrote:
Reviewer: Peter Van der Stok Review result: Not Ready This will be a short high-level review, because I am really confused by the contents; and also my knowledge of RPL is ancient. Below the sources of my misunderstanding.
I will try to clarify each issue inline, so apologies if the formatting of your text got disrupted.
The network configuration is not clear. The LBR, DODAG root, usually has at least two interfaces, one wireless LLN interface and a possibly wired, internet link. Only the LLN seems to be discussed, monitoring the LBR via the Internet link looks more efficient to me than the proposal of this document.
Your understanding of the scope is correct: the document, like the original RPL, focuses on the LLN part, that is, the resource-constrained nodes and the LBR, and not the network on the other side of the LBR. For this reason, we are concerned with the LLN interface of the LBR and its failures. Note, however, that in this view, any failure of an LBR that leads to its LLN interface being down from the constrained nodes' perspective (e.g., a "total" crash of the entire LBR device due to a power outage, mentioned in Section 1.3) also falls within our scope of interest.
The assumed configuration of the network is thus simply the same as in the documents regarding RPL, notably the original RFC 6550. This fact is stated explicitly in Section 2 (Terminology). In particular, the term "DODAG root" describes the node corresponding to the LLN interface of the LBR.
Monitoring the LBR via the Internet link is an approach orthogonal to the RNFD algorithm proposed in the document and does not solve the problems that RNFD solves. In particular, as explained in Section 1, "detecting" a failure of a DODAG root by RNFD implies that each LLN node considers the root as down; otherwise, the emergent behaviors described in Section 1.1 occur. In contrast, with monitoring the LBR via the Internet link, one still needs a solution for communicating the observations of the Internet node that detects a problem with the LBR to every LLN node. This may be problematic, especially given that a failure of an LBR may effectively disconnect the LLN nodes from the Internet.
The fault model of the LLN is unclear, Is it a crash of the LBR CPU or a failure of the wireless interface. The failure of the wireless interface can be electronic, but can also be caused by refelections, etc..
RNFD targets any failures that make an LBR's LLN interface appear as being down from the perspective of the LLN nodes, and hence prevent the LBR from acting as a DODAG root. In particular, communication failures that have similar effects will be handled by the algorithm.
The sentinels, seemingly all the children of the LBR, are supposed to come to an agreement. The subject of agreement between nodes in a failing communication environment is discussed in the "Byzantine Generals algorithm", and its many derivatives. However, no mention is made of this problem and how its known solutions compare to the solution used here.
One of the components of the RNFD algorithm is indeed reaching consensus between nodes on whether the DODAG root is down. However, the Byzantine generals algorithm solves a different consensus problem, in which some of the nodes participating in the algorithm can be malicious and can collude. In RNFD, and for that matter RPL as well, any malicious nodes could disrupt the entire system in many (sometimes subtle) ways. RPL (and hence RNFD) thus assume collaborating nodes.
Nevertheless, let us thus consider using the Byzantine generals algorithm as a replacement of the present consensus solution in RNFD. To combat malicious nodes, the algorithm would require multiple rounds of direct peer-to-peer communication between each pair of Sentinels. Such communication would be hard to ensure in a multi-hop LLN and would be an overkill because of to the resource constraints of the LLN.
In general, there is a plethora of consensus algorithms crafted for specific problem variants. The one in the RNFD algorithm is well suited for the problem that RNFD aims to solve, the constraints it faces in the process, and the operation of RPL for which RNFD is designed.
The document claims many improvements in detecting LBR failure over the already existing techniques in the RPL network. However, no numbers are cited as function of the network configuration, for example: a small network with routes of only two hops, or a large network with many hop routes.
For numbers, please refer to [Iwanicki16], which in the document is the second informative reference and is cited in Section 1.1.
The document seems to claim that removing of routes and reconstructing routes is no longer needed. I don't understand from the text how that is possible.
Given how general this claim is, I believe it is not stated in the document. I am guessing that you refer to Section 1, which describes the problem that the RNFD algorithm addresses.
What the section claims is that when the DODAG root is down, reconstructing upward routes in the DODAG by RPL does not make much sense: no physical path such a route could employ exists, because simply the last (i.e., destination) node, which is the root, is nonexistent. Consequently, any attempt to reconstruct the routes is bound to fail anyway and only causes unnecessary work and control traffic.
Removing the routes is in turn the way to go. However, achieving this by repeated failed attempts to reconstruct them by all nodes, as is done by RPL without RNFD, triggers various undesirable emergent behaviors and is suboptimal, not to mention that some implementations have problems doing this correctly.
One thing strikes me as missing, the standardization of the traffic patterns provoked by RNFD. It is recommended that "care is taken". That means ending up with as many traffic patterns as there are manufacturers, possibly leading to implementations that hinder each other instead of collaborating. I would expect a number of sending delays and counters with standardized values.
Again, I am guessing that you are referring to the following sentence from Section 5.2 (Detecting and Verifying Problems with the DODAG Root): "Care SHOULD be taken not to overload the DODAG root with traffic due to simultaneous probes, for instance, random backoffs can be employed to this end."
To give more context, this part of the section describes how a Sentinel can confirm its observation that the DODAG root is down, for example, by sending IPv6 ping messages. This section was deliberately left without specific values (or even ranges) of timeouts and retry counts for such probes, because the performance of RPL with respect to those two aspects is heavily impacted by the underlying L2: for one implementation of L2, a few tens of milliseconds and with multiple retries would be sufficient, for others (e.g., with radio duty cycling), those can be tens of seconds and no retries. Put differently, a choice was considered among the following options:
1. forcing implementers of the document to use configurations that are clearly suboptimal in their particular protocol stacks, or
2. providing in the document broad ranges that virtually do not standardize anything (notably with respect to the resulting traffic you mentioned), or
3. leaving the issue open but emphasizing to the implementers its potential performance impact,
Of those, option 3 was selected and the entire verification step was made optional. The "Care SHOULD be taken" is precisely this emphasis. Perhaps it could be formulated differently.
I hope that the answers above clarify at least some of your concerns. If you have further suggestions where and how the text of the document could be improved, I would be glad to hear them.
Thank you again for your time! Best regards, -- - Konrad Iwanicki. -- last-call mailing list -- last-call@xxxxxxxx To unsubscribe send an email to last-call-leave@xxxxxxxx