TIPC Working Group J. Maloy Work-in-Progress Ericsson A. Stephens Wind River October 21, 2010 TIPC: Transparent Inter Process Communication Protocol Status of this Memo This document is a "work-in-progress" edition of the specification for version 2 of the TIPC protocol, and has NOT yet been approved by the TIPC Working Group. Chapters 7, 8, 9, 10, 11, and 12 have been recently updated and are believed to be accurate; earlier chapters are still in the process of being updated. This document reflects the capabilities of TIPC 2.0 as implemented by the Open Source TIPC project (see http://tipc.sf.net). Copyright Notice Copyright (C) TIPC Working Group (2010). This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), AND THE MULTICORE ASSOCIATION DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. Abstract This document describes TIPC, a protocol specially designed for efficient communication within clusters of loosely coupled nodes. TIPC is a reliable transport protocol, typically operating on top of L2 packet networks. It should also work well on higher-level protocols, such as DCCP, TCP, or SCTP. TIPC offers the following services to applications: Maloy & Stephens [Page 1] TIPC October 2010 o A functional addressing scheme providing full addressing transparency over the whole cluster. o Address scoping that can optionally restrict communication to a designated subset of the network. o A topology subscription service providing up-to-date information about functional and physical network topology. o A lightweight connection service that report errors or destination unreachability within a fraction of a second. o A reliable datagram service for connectionless communication. o A reliable multicast service, based on functional addressing, that uses the underlying network multicast service when possible. o Acknowledged, loss-free, error-free, non-duplicated transfer of application data, in both connection-based and connectionless modes. o Data fragmentation conforming to discovered carrier MTU size. o Bundling of multiple messages into a single packet to minimize the impact of congestion when messages cannot be sent immediately. o Configurable congestion control at the bearer, link, and connection levels. o Transparent, link-level load sharing and redundancy, through support of heterogeneous multi-homing. o A slim, non-layered protocol header allowing efficient protocol implementations. Apart from common process-to-process communication, the design of TIPC permits the exchange of messages process-to-kernel and kernel- to-kernel, with full addressing and interface transparency. Maloy & Stephens [Page 2] TIPC October 2010 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.1. Existing Protocols . . . . . . . . . . . . . . . . . . 6 1.1.2. Assumptions . . . . . . . . . . . . . . . . . . . . . 7 1.2. Architectural View . . . . . . . . . . . . . . . . . . . . 8 1.3. Functional View . . . . . . . . . . . . . . . . . . . . . 9 1.3.1. API Adapters . . . . . . . . . . . . . . . . . . . . . 10 1.3.2. Address Subscription . . . . . . . . . . . . . . . . . 11 1.3.3. Address Distribution . . . . . . . . . . . . . . . . . 11 1.3.4. Address Translation . . . . . . . . . . . . . . . . . 11 1.3.5. Multicast . . . . . . . . . . . . . . . . . . . . . . 11 1.3.6. Connection Supervision . . . . . . . . . . . . . . . . 11 1.3.7. Routing and Link Selection . . . . . . . . . . . . . . 12 1.3.8. Neighbour Detection . . . . . . . . . . . . . . . . . 12 1.3.9. Link Establishment/Supervision . . . . . . . . . . . . 12 1.3.10. Link Failover . . . . . . . . . . . . . . . . . . . . 12 1.3.11. Fragmentation/Defragmentation . . . . . . . . . . . . 12 1.3.12. Bundling . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.13. Congestion Control . . . . . . . . . . . . . . . . . . 13 1.3.14. Sequence and Retransmission Control . . . . . . . . . 13 1.3.15. Bearer Layer . . . . . . . . . . . . . . . . . . . . . 13 1.4. Fault Handling . . . . . . . . . . . . . . . . . . . . . . 13 1.4.1. Fault Avoidance . . . . . . . . . . . . . . . . . . . 13 1.4.2. Fault Detection . . . . . . . . . . . . . . . . . . . 14 1.4.3. Fault Recovery . . . . . . . . . . . . . . . . . . . . 15 1.4.4. Overload Protection . . . . . . . . . . . . . . . . . 15 1.5. Terminology . . . . . . . . . . . . . . . . . . . . . . . 16 1.6. Abbreviations . . . . . . . . . . . . . . . . . . . . . . 19 2. TIPC Features . . . . . . . . . . . . . . . . . . . . . . . . 19 2.1. Network Topology . . . . . . . . . . . . . . . . . . . . . 19 2.1.1. Network . . . . . . . . . . . . . . . . . . . . . . . 20 2.1.2. Zone . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.1.3. Cluster . . . . . . . . . . . . . . . . . . . . . . . 20 2.1.4. Node . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2. Links . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3. Ports . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4. Messages . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4.1. Taxonomy . . . . . . . . . . . . . . . . . . . . . . . 21 2.4.2. Format . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5. Addressing . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5.1. Location Transparency . . . . . . . . . . . . . . . . 22 2.5.2. Network Address . . . . . . . . . . . . . . . . . . . 22 2.5.3. Port Identity . . . . . . . . . . . . . . . . . . . . 22 2.5.4. Port Name . . . . . . . . . . . . . . . . . . . . . . 22 2.5.5. Port Name Sequence . . . . . . . . . . . . . . . . . . 23 2.5.6. Multicast Addressing . . . . . . . . . . . . . . . . . 24 Maloy & Stephens [Page 3] TIPC October 2010 2.5.7. Publishing Scope . . . . . . . . . . . . . . . . . . . 25 2.5.8. Lookup Policies . . . . . . . . . . . . . . . . . . . 25 3. Port-Based Communication . . . . . . . . . . . . . . . . . . . 26 3.1. Payload Messages . . . . . . . . . . . . . . . . . . . . . 26 3.1.1. Payload Message Types . . . . . . . . . . . . . . . . 26 3.1.2. Payload Message Format . . . . . . . . . . . . . . . . 27 3.1.3. Payload Message Delivery . . . . . . . . . . . . . . . 33 3.2. Connectionless Communication . . . . . . . . . . . . . . . 33 3.3. Connection-based Communication . . . . . . . . . . . . . . 33 3.3.1. Connection Setup . . . . . . . . . . . . . . . . . . . 33 3.3.2. Connection Shutdown . . . . . . . . . . . . . . . . . 35 3.3.3. Connection Abortion . . . . . . . . . . . . . . . . . 36 3.3.4. Connection Supervision . . . . . . . . . . . . . . . . 37 3.3.5. Flow Control . . . . . . . . . . . . . . . . . . . . . 39 3.3.6. Sequentiality Check . . . . . . . . . . . . . . . . . 40 3.4. Multicast Communication . . . . . . . . . . . . . . . . . 40 4. Name Table . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1. Distributed Name Table Protocol Overview . . . . . . . . . 41 4.2. Name Distributor Message Processing . . . . . . . . . . . 41 4.3. Name Distributor Message Format . . . . . . . . . . . . . 42 4.4. Name Publication Descriptor Format . . . . . . . . . . . . 44 5. Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.1. TIPC Internal Header . . . . . . . . . . . . . . . . . . . 46 5.1.1. Internal Message Header Format . . . . . . . . . . . . 46 5.1.2. Internal Message Header Fields Description . . . . . . 47 5.2. Link Creation . . . . . . . . . . . . . . . . . . . . . . 50 5.2.1. Intra-Cluster Link Setup . . . . . . . . . . . . . . . 50 5.2.2. Inter-Cluster Link Setup . . . . . . . . . . . . . . . 52 5.3. Link Activation . . . . . . . . . . . . . . . . . . . . . 53 5.4. Link MTU Negotiation . . . . . . . . . . . . . . . . . . . 55 5.5. Link Continuity Check . . . . . . . . . . . . . . . . . . 56 5.6. Sequence Control and Retransmission . . . . . . . . . . . 56 5.7. Message Bundling . . . . . . . . . . . . . . . . . . . . . 57 5.8. Message Fragmentation . . . . . . . . . . . . . . . . . . 57 5.9. Link Congestion Control . . . . . . . . . . . . . . . . . 58 5.10. Bearer Congestion Control . . . . . . . . . . . . . . . . 58 5.11. Link Load Sharing vs Active/Standby . . . . . . . . . . . 59 5.12. Link Changeover . . . . . . . . . . . . . . . . . . . . . 59 5.13. Link Deletion . . . . . . . . . . . . . . . . . . . . . . 61 5.14. Message Bundler Protocol . . . . . . . . . . . . . . . . . 61 5.15. Link State Maintenance Protocol . . . . . . . . . . . . . 61 5.16. Link Changeover Protocol . . . . . . . . . . . . . . . . . 62 5.17. Message Fragmentation Protocol . . . . . . . . . . . . . . 63 6. Broadcast Link . . . . . . . . . . . . . . . . . . . . . . . . 63 6.1. Broadcast Protocol . . . . . . . . . . . . . . . . . . . . 64 6.2. Piggybacked Acknowledge . . . . . . . . . . . . . . . . . 64 6.3. Coordinated Acknowledge Interval . . . . . . . . . . . . . 64 6.4. Coordinated Broadcast of Negative Acknowledges . . . . . . 64 Maloy & Stephens [Page 4] TIPC October 2010 6.5. Replicated Delivery . . . . . . . . . . . . . . . . . . . 65 6.6. Congestion Control . . . . . . . . . . . . . . . . . . . . 65 7. Neighbor Detection . . . . . . . . . . . . . . . . . . . . . . 65 7.1. Neighbor Detection Protocol Overview . . . . . . . . . . . 65 7.2. Link Request Message Processing . . . . . . . . . . . . . 66 7.3. Link Response Message Processing . . . . . . . . . . . . . 67 7.4. Link Configuration Message Format . . . . . . . . . . . . 68 8. Topology Service . . . . . . . . . . . . . . . . . . . . . . . 70 8.1. Topology Service Semantics . . . . . . . . . . . . . . . . 70 8.2. Topology Service Protocol . . . . . . . . . . . . . . . . 71 8.2.1. Subscription Message Format . . . . . . . . . . . . . 71 8.2.2. Event Message Format . . . . . . . . . . . . . . . . . 73 8.3. Monitoring Functional Topology . . . . . . . . . . . . . . 75 8.4. Monitoring Physical Topology . . . . . . . . . . . . . . . 75 9. Configuration Service . . . . . . . . . . . . . . . . . . . . 75 9.1. Configuration Service Semantics . . . . . . . . . . . . . 75 9.2. Configuration Service Protocol . . . . . . . . . . . . . . 76 9.2.1. Command Message Format . . . . . . . . . . . . . . . . 77 9.2.2. Command Argument TLV Descriptions . . . . . . . . . . 79 9.3. Command Message Descriptions . . . . . . . . . . . . . . . 85 10. Security Considerations . . . . . . . . . . . . . . . . . . . 91 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 92 12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 93 Maloy & Stephens [Page 5] TIPC October 2010 1. Introduction This section explains the rationale behind the development of the Transparent Inter Process Communication (TIPC) protocol. It also gives a brief introduction to the services provided by this protocol, as well as the basic concepts needed to understand the further description of the protocol in this document. 1.1. Motivation There are no standard protocols available today that fully satisfy the special needs of application programs working within highly available, dynamic cluster environments. Clusters may grow or shrink by orders of magnitude, having member nodes crashing and restarting, having routers failing and replaced, having functionality moved around due to load balancing considerations, etc. All this must be handled without significant disturbances of the service(s) offered by the cluster. To minimize the effort by the application programmers to deal with such situations, and to maximize the chance that they are handled in a correct and optimal way, the cluster internal communication service should provide special support helping the applications to adapt to changes in the cluster. It should also, when possible, leverage the special conditions present within cluster environments to present a more efficient and more fault-tolerant communication service than more general protocols are capable of. This is the purpose of TIPC. Version 1 of TIPC has been widely deployed in customer networks. This document describes version 2 of TIPC. An open source implementation of version 2 is available at [TIPC]. 1.1.1. Existing Protocols TCP [RFC793] has the advantage of being ubiquitous, stable, and well known by most programmers. Its most significant shortcomings in a real-time cluster environment are the following: o It lacks any notion of functional addressing and addressing transparency. Mechanisms exist (DNS, CORBA Naming Service) for transparent and dynamic lookup of the correct IP-adress of a destination, but these are in general too static and expensive to use. o TCP has non-optimal performance, especially for intra-node communication and for short messages in general. For intra-node communication there are other and more efficient mechanisms available, at least on Unix, but then the location of the destination process has to be assumed, and can not be changed. It Maloy & Stephens [Page 6] TIPC October 2010 is desirable to have a protocol working efficiently for both intra-node and inter-node messaging, without forcing the user to distinguish between these cases in his code. o The rather heavy connection setup/shutdown scheme of TCP is a disadvantage in a dynamic environment. The minimum number of packets exchanged for even the shortest TCP transaction is nine (SYN, SYNACK etc.), while with TIPC this can be reduced to two, or even to one if connectionless mode is used. o The connection-oriented nature of TCP makes it impossible to support true multicast. SCTP [RFC2960] is message oriented, it provides some level of user connection supervision, message bundling, loss-free changeover, and a few more features that may make it more suitable than TCP as an intra-cluster protocol. Otherwise, it has all the drawbacks of TCP already listed above. Apart from these weaknesses, neither TCP nor SCTP provide any topology information/subscription service, something that has proven very useful both for applications and for management functionality operating within cluster environments. Both TCP and SCTP are general purpose protocols, in the sense that they can be used safely over the Internet as well as within a closed cluster. This virtual advantage is also their major weakness: they require funtionality and header space to deal with situations that will never happen, or only infrequently, within clusters. 1.1.2. Assumptions TIPC has been designed based on the following assumptions, empirically known to be valid within most clusters. o Most messages cross only one direct hop. o Transfer time for most messages is short. o Most messages are passed over intra-cluster connections. o Packet loss rate is normally low; retransmission is infrequent. o Available bandwidth and memory volume is normally high. o For all relevant bearers packets are check-summed by hardware. Maloy & Stephens [Page 7] TIPC October 2010 o The number of inter-communicating nodes is relatively static and limited at any moment in time. o Security is a less crucial issue in closed clusters than on the Internet. These assumptions allow TIPC to use a simple, traffic-driven, fixed- size sliding window protocol located at the signalling link level, rather than a timer-driven transport level protocol. This in turn leads to other benefits, such as earlier release of transmission buffers, earlier packet loss detection and retransmission, earlier detection of node unavailability, to mention but some. Of course, situations with long transfer delays, high loss rates, long messages, security issues, etc. must also be dealt with, but from the viewpoint of being exceptions rather than as the general rule. 1.2. Architectural View TIPC should be seen as a layer between an application using TIPC and a packet transport service such as Ethernet, ATM, DCCP, UDP, TCP, or SCTP. The latter are denoted by the generic term "bearer service", or simply "bearer", throughout this document. TIPC provides reliable transfer of user messages between TIPC users, or more specifically between two TIPC ports, which are the endpoints of all TIPC communication. A TIPC user normally means a user-level process, but may also be a kernel-level function or a driver. Described by standard terminology TIPC spans the level of transport, network, and signalling link layers, although this does not inhibit it from using another transport level protocol as bearer, so that e.g. an SCTP association may serve as bearer for a TIPC signalling link. Maloy & Stephens [Page 8] TIPC October 2010 Node A Node B ------------- ------------- | TIPC | | TIPC | | Application | | Application | |-------------| |-------------| | | | | | TIPC |TIPC address TIPC address| TIPC | | | | | |-------------| |-------------| | L2 Bearer |Bearer address \/ Bearer address| L2 Bearer | | Service | /\ | Service | ------------- ------------- | | |---------------- Bearer Transport ----------------| Figure 1: Architectural view of TIPC 1.3. Functional View Functionally TIPC can be described as consisting of several layers performing different tasks, as shown in Figure 2 . It must be emphasized that this layering reflects a functional model, not the way TIPC should be (or actually is) implemented. Maloy & Stephens [Page 9] TIPC October 2010 TIPC Application ---------------------------------------------------------- ------------- ----------- ------------- | Socket | | Port | | Other API | | Adapter | | Adapter | | Adapters | ------------- ----------- ------------- ========================================================= ---------------------------- | Address | Address | | Subscription | Resolution | |--------------+---------------------------------------- | Address Table| Connection Supervision | | Distribution | Routing/Link Selection | -----------------------------------------------------+- | | Neighbour Detection | | Node | Multicast | Link Establish/Supervision | ----------> | | Link Failover | Internal -----------------------------------------------+- | Fragmentation/Defragmentation | | | | | ----------------------------------------- | | Bundling | | | Congestion Control | | -----------------------------------+----- | | Sequence/Retransmission | | | | Control | | | -------+--------------+----- | | ========|==============|============|===========|======== | | | | -----V----- -----V---- ----V----- --V------- -| Ethernet | | DCCP | | SCTP | | Mirrored | | | | | | | | | Memory | | ---------+- ---------- ---------- ---------- -----------+ Figure 2: Functional view of TIPC 1.3.1. API Adapters TIPC makes no assumptions about which APIs should be used, except that they must allow access to the TIPC services. It is possible to provide all functionality via a standard socket interface, an asynchronous port API, and any other form of dedicated interface that can be motivated. In these layers there is also support for transport-level congestion and overload protection control. Maloy & Stephens [Page 10] TIPC October 2010 1.3.2. Address Subscription The service "Topology Information and Subscription" provides the ability to interrogate and if necessary subscribe for the availability of a functional address, and thereby determine the availability of an associated physical or virtual resource. This can be used by a distributed application to synchronize its startup, and may even serve as a simple, distributed event channel if used with care. 1.3.3. Address Distribution Functional addresses and their associated physical addresses must be equally available within the whole cluster. For performance and fault tolerance reasons it is not acceptable to keep the necessary address tables in one node; instead, TIPC must ensure that they are distributed to all nodes in the cluster, and that they are kept consistent at any time. This is the task of the Address Distribution Service, also called Name Distribution Service. 1.3.4. Address Translation The translation from a functional address to a physical address is performed on-the-fly during message sending by this functional layer. This step must use an efficient algorithm, and multiple translations of a functional address should be avoided where possible. It is possible to bypass address translation altogether when sending messages if the sender is able to use a physical address as the destination address. For example, this can be done when a server responds to a connection setup request, or when communication between two applications occurs over an already established connection. 1.3.5. Multicast This layer, supported by the underlying three layers, provides a reliable intra-cluster broadcast service, typically defined as a semi-static multicast group over the underlying bearer. It also provides many of the same capabilities as an ordinary unicast link, such as message fragmentation, message bundling, and congestion control. 1.3.6. Connection Supervision There are several mechanisms to ensure immediate detection and report of connection failure. Maloy & Stephens [Page 11] TIPC October 2010 1.3.7. Routing and Link Selection This is the step of finding the correct destination node, and, if applicable, the correct next-hop router node, plus selecting the right link to use for reaching that node. If the destination node turns out to be the own node, the rest of the stack is omitted, and the message is sent directly to the receiving port. 1.3.8. Neighbour Detection When a node is started it must make the rest of the cluster aware of its existence, and itself learn the topology of the cluster. By default this is done by use of broadcast, but there are other methods available. 1.3.9. Link Establishment/Supervision Once a neighbouring node has been detected on a bearer, a signalling link is established towards it. The functional state of that link has to be supervised continuously, and proper action taken if it fails. 1.3.10. Link Failover TIPC on a node will establish one link per-destination node and functional bearer instance, typically one per-configured ethernet interface. Normally these will run in parallel and share load equally, but special care has to be taken during the transition period when a link comes up or goes down, to ensure the guaranteed cardinality and sequentiality of the message delivery. This is done by this layer. 1.3.11. Fragmentation/Defragmentation When necessary TIPC fragments and reassembles messages that can not be contained within one MTU-size packet. 1.3.12. Bundling Whenever there is some kind of congestion situation, i.e. when a bearer or a link can not immediately send a packet as requested, TIPC starts to bundle messages into packets already waiting to be sent. When the congestion abates the waiting packets are sent immediately, and unbundled at the receiving node. Maloy & Stephens [Page 12] TIPC October 2010 1.3.13. Congestion Control When a bearer instance becomes congested, e.g. it is unable to accept more outgoing packets, all links on that bearer are marked as congested, and no more messages are attempted to be sent over those links until the bearer opens up again for traffic. During this transition time messages are queued or bundled on the links, and then sent whenever the congestion has abated. A similar mechanism is used when the send window of a link becomes full, but affects only that particular link. 1.3.14. Sequence and Retransmission Control This layer ensures the cardinality and sequentiality of packets over a link. 1.3.15. Bearer Layer This layer adapts to some connectionless or connection-oriented transport service, providing the necessary information and services to enable the upper layers to perform their tasks. 1.4. Fault Handling Most functions for improving system fault tolerance are described elswhere, under the repective functions, but some aspects deserve being mentioned separately. 1.4.1. Fault Avoidance Strict source address check After the neighbour detection phase, a message arriving to a node must have a not only a valid Pevious Node address, but this must belong to one of the nodes known having a direct link to the destination. The node may in practice be aware of at most a few hundred such nodes, while a network address is 32 bits long. The risk of accepting a garbled message having a valid address within that range, a sequence number that fits into the reception window, and otherwise valid header fields, is extremely small, no doubt less than one to several billions. Sparse port address space As an extra measure, TIPC uses a 32-bit pseudo-random number as the first part of a port identity. This gives an extra protection against corrupted messages, or against obsolete messages arriving at a node after long delays. Such messages will not find any Maloy & Stephens [Page 13] TIPC October 2010 destination port, and be attempted returned to the sender port. If there is no valid sender port, the message should be quietly discarded. Name Table Keys When a NAME TABLE is updated with a new publication, each of those are qualified with a Key field, that is only known by the publishing port. This key must be presented and verified when the publication is withdrawn, in all instances of the name table. If the key does not fit, the withdrawal is refused. Link Selectors Whenever a message/packet is sent or routed, the link used for the next-hop transport is always selected in a deterministic way, based on the sender port's random number. The risk of having packets arriving in disorder is hence non-existent for single-hop messages, and extremely low for multi-hop messages. Repeated Name Lookups If a lookup in the NAME TABLE has returned a port identity that later turns out to be false, TIPC performs up to 6 new lookups before giving up and rejecting the message. Routing Counter To eliminate the risk of having messages roaming around in the network a routing counter is present in the TIPC header. This counter is updated for each inter-node hop and for each NAME TABLE lookup the message is subject to. If this counter reaches the upper limit, seven, the message is rejected back to the sender port. 1.4.2. Fault Detection The mechanisms for fault detection have been described in previous sections, but some of them will be briefly repeated here: o Transport Level Sequence Number, to detect disordered multi-hop packets. o Connection Supervision and Abortion mechanism. o Link Supervision and Continuation control. Maloy & Stephens [Page 14] TIPC October 2010 1.4.3. Fault Recovery When a failure has been detected, several mechanisms are used to eliminate the impact from the problem, or when that is impossible, to help the application to recover from it: Link Changeover When a link fails, its traffic is directed over to the redundant link, if any, in such a way that message sequentiality and cardinality is preserved. This feature is described in Section 5.12. Returning Messages to Sender When no destination is found for a message, the 1024 first bytes of it is returned to the sner port, along with an approriate error code. This helps the application to identify the exact instant of failure, and if possible, to find a new destination for the failed call. The complete list of error codes and their significance is described in Section 2.4. 1.4.4. Overload Protection To overcome situations where the congestion/flow control mechanisms described earlier in this section are inadequate or insufficient, TIPC must provide two additional overload protection services: Node Overload Protection TIPC must maintain a global counter on each node, keeping track of the total number of pending, unhandled payload messages on the node. When this counter reaches a critical value, which should be configurable, TIPC must selectively reject new incoming messages. Which messages to reject must be based on the following criteria: * Message importance. LOW_IMPORTANCE messages should be rejected at a lower threshold than MEDIUM_IMPORTANCE messages, which should be rejected before HIGH_IMPORTANCE messages. CRITICAL_IMPORTANCE should not be rejected at all. * Message type. Connectionless messages should be rejected earlier than connection oriented messages. Rejecting such messages normally means rejecting a service request form the beginning, causing less disturbances than interrupting a transaction already in progress, e.g. an ongoing phone call. Maloy & Stephens [Page 15] TIPC October 2010 Process Overload Protection TIPC must maintain a counter for each process, or if this is impossible, for each port, keeping track of the total number of pending, unhandled payload messages on that process or port. When this counter reaches a critical value, which should be configurable, TIPC must selectively reject new incoming messages. Which messages to reject should be based on the same criteria as for the node overload protection mechanism, but all thresholds must be set significantly lower. Empirically a ratio 2:1 between the node global thresholds and the port local thresholds has been working well. 1.5. Terminology This section defines terms whose meaning may otherwise be unclear or ambiguous. Application: A user-written program that directly utilizes TIPC for communication. Bearer: An instance of a physical or logical transport media, such as Ethernet, ATM/AAL or DCCP, over which messages can be sent. Broadcast: The sending of a message to all other nodes in the sender's cluster, each of which receives a copy of the message. Note that what is considered a broadcast from the TIPC viewpoint may be mapped onto a multicast at the bearer (Ethernet or DCCP) level. Connection: A logical channel for passing messages between two ports. Once a connection is established no address need be indicated when sending a message from either of the endpoints. Cluster: A collection of nodes that are directly interconnected (i.e. fully meshed). All nodes in a cluster have network addresses that differ only in their node identifier. Domain: A subset of topologically related nodes in a TIPC network, normally designated by a network address. For example, designates a specific node, designates any node within the specified cluster, designates any node within the specified zone, and <0.0.0> designates any node within the network. Internal Message: A message that is generated and consumed by an internal TIPC subsystem. Maloy & Stephens [Page 16] TIPC October 2010 Link: A communication channel connecting two nodes, performing tasks such as message transfer, sequence ordering, retransmission, etc. A pair of nodes may be interconnected by one link on a single bearer, or by a pair of links on two bearers in either a load sharing or an active-plus-standby configuration. Link Changeover: The act of moving all traffic from a failing link in a link pair to the remaining link, while retaining the original sequence order and cardinality of messages. Link Endpoint: A communication endpoint, used in pairs by a link to send and receive TIPC messages between two nodes. Location Transparency: The ability of an application within a cluster to communicate with another application without knowing the physical location of the latter. (This term is sometimes called "addressing tranparency".) Message: The fundamental unit of information exchanged between TIPC ports or between TIPC subsystems. Consists of a TIPC message header, followed by from 0 to 66,000 bytes of data. Message Bundling: The act of aggregating several messages into one packet (typically an Ethernet frame) to minimize the impact of congestion when messages cannot be sent immediately. Message Fragmentation: The act of dividing a long message into several packets during transmission and later reassembling the fragments into the original message at the receiving end. Multicast: The sending of a message to multiple TIPC ports, each of which receives a copy of the message. Name: An alias for Port Name. Name Sequence: An alias for Port Name Sequence. Name Table: An TIPC-internal table existing on each node which keeps track of the mapping between port names and port identities. Network: A collection of nodes that can communicate with one another via TIPC. The network may consist of a single node, a single cluster, a single zone, or a group of inter-connected zones. Network Address: An integer that identifies a node, or set of nodes, within a TIPC network. It is a 32 bit integer, subdivided into three fields (8/12/12), representing a zone, cluster and node Maloy & Stephens [Page 17] TIPC October 2010 identifier, respectively; normally denoted as . Network Identity: An integer that uniquely identifies a TIPC network. Used to keep traffic from different TIPC networks separated from each other when a common bearer is being used; for example, when multiple networks are running on a LAN in a lab environment. Node: A computer within a TIPC network, uniquely identified by a network address. Packet: The unit of data sent over a bearer. It may contain one or more complete TIPC messages, or a fragment of one TIPC message. Payload Message: A message that carries application-related content between applications, or between an application and a service. Port: A communication endpoint, capable of sending and receiving TIPC messages. Once created a TIPC port persists until it is deleted by its owner, either explicitly or implicitly. Port Identity: A physical address that uniquely identifies a TIPC port port within a network; normally denoted as . Once a port is deleted its identity will not be reissued for a very long time. Port Name: A functional address that identifies a TIPC port as being capable of providing a specific service; normally denoted as {type,instance}. For load sharing and redundancy purposes several ports may bind to the same name; likewise, a single port may bind to multiple names if it provides multiple services. Port Name Sequence: A mechanism for specifying a range of continguous port names; normally denoted as {type,lower- instance,upper-instance}. Service: A TIPC subsystem that communicates with applications or other TIPC subsystems using TIPC ports. Scope: A shorthand form for expressing the domain that contains a node, as seen by that node; that is, own-node, own-cluster, or own-zone. Unicast: The sending of a message to a single node in the network. Zone: A "super-cluster" of clusters that are directly interconnected (i.e. fully meshed). All nodes in a zone have Maloy & Stephens [Page 18] TIPC October 2010 network addresses that share a common zone identifier. 1.6. Abbreviations o API - Application Programming Interface o MAC - Message Authentication Code [RFC2104] o MTU - Maximum Transmission Unit 2. TIPC Features 2.1. Network Topology From a TIPC viewpoint the network is organized in a five-layer structure: ------------------------------------------------------ ---------- | Zone <1> | | Zone <2> | | ----------------------- ---------------------- | | | | | Cluster <1.1> | | Cluster <1.2> | | | | | | | | | | | | | | ------- | | ------- ------- | | | | | | | | | | | | | | | | | | | | | Node | | | | Node +--+ Node | | | | | | | |<1.1.1>| ------- | | |<1.2.1>| |<1.2.2>| | | | | | | | +---+ | | | | | | | | | | | | | ---+--- | Node | | | --+---- ------- | | | | | | | |<1.1.3>| | | | | | | | | | ---+--- | | | | --+-- | | | | | | | +---+ | | | |Seco.| | | | | | | | Node | ------- | | |<1.2.| | | | | | | |<1.1.2>| | | |3333>| | | | | | | | | | | ----- | | | | | | ------- | | | | | | | ----------------------- ---------------------- | | | | | | | ----------------------------------------------------- ---------- Figure 3: TIPC network topology Maloy & Stephens [Page 19] TIPC October 2010 2.1.1. Network The top level is the TIPC network as such. This is the ensemble of all zones interconnected via TIPC, i.e. the domain where any node can reach any other node by using a TIPC network address. The zones within such a network must be directly interconnected all-to-all via TIPC links, since there is no zone-level routing, i.e. a message can not pass from one zone to another via an intermediate zone. Any number of links between two zones is permitted, and normally there will be more than one for redundancy reasons. It is possible to create distinct, isolated networks, even on the same LAN, reusing the same network addresses, by assigning each network a Network Identity. This identity is not an address, and only serves the purpose of isolating networks from each other. Networks with different identities can not communicate with each other via TIPC. 2.1.2. Zone The next level in the hierarchy is the zone. This is the largest scope of location transparency within a network, i.e. the domain where a programmer does not need to worry about network addresses. Zone identities must be unique and within the numeric range [1,255]. The actual maximum number of zones within a network may be implementation dependent, and should be configurable. Description of how zones are interconnected inside a network falls outside the scope of this document. 2.1.3. Cluster The third level is the cluster. Cluster identities must be unique and within the numeric range [1,4095]. The actual maximum number of clusters within a zone may be implementation dependent, and should be configurable. Description of how clusters are interconnected, inside a zone and across zones, falls outside the scope of this document. 2.1.4. Node The fourth level is the individual node. Node identities must be unique and within the numeric range [1,2097]. The actual maximum number of nodes within a cluster may be implementation dependent, and should be configurable. Usage of the value range [2098,4095], made possible by the 12-bit format of the node identiy, remains to be defined, and falls outside the scope of this document. Nodes within a cluster must be interconnected all-to-all. Description of how nodes are interconnected across clusters and zones falls outside the scope of this document. Maloy & Stephens [Page 20] TIPC October 2010 2.2. Links 2.3. Ports 2.4. Messages The "message" is the fundamental unit of information exchanged between TIPC ports or between TIPC subsystems. 2.4.1. Taxonomy TIPC messages fall into two main classes. o A "payload message" carries application-specified content between applications, or between applications and TIPC services. o An "internal message" carries TIPC-specified content between TIPC subsystems. Messages are further categorized based on their use, as indicated below: User User Name Purpose Class ---- --------- ------- ----- 0 LOW_IMPORTANCE Low Importance Data payload 1 MEDIUM_IMPORTANCE Medium Importance Data payload 2 HIGH_IMPORTANCE High Importance Data payload 3 CRITICAL_IMPORTANCE Critical Importance Data payload 4 USER_TYPE_4 Reserved for future use n/a 5 BCAST_PROTOCOL Broadcast Link Protocol internal 6 MSG_BUNDLER Message Bundler Protocol internal 7 LINK_PROTOCOL Link State Protocol internal 8 CONN_MANAGER Connection Manager internal 9 USER_TYPE_9 Reserved for future use n/a 10 CHANGEOVER_PROTOCOL Link Changeover Protocol internal 11 NAME_DISTRIBUTOR Name Table Update Protocol internal 12 MSG_FRAGMENTER Message Fragmentation Protocol internal 13 LINK_CONFIG Neighbor Detection Protocol internal 14 USER_TYPE_14 Reserved for future use n/a 15 USER_TYPE_15 Reserved for future use n/a 2.4.2. Format Every TIPC message consists of a message header and a data part. The message header format is user-dependent, and ranges in length from 6 to 11 words. The content of each word in the header is stored as a single 32-bit integer coded in network byte order. A small Maloy & Stephens [Page 21] TIPC October 2010 number of fields are common to all message header formats; the remaining fields are either unique to a single user or utilized by multiple users. The format of the data part of a message is user-dependent, and ranges in length from 0 to 66,000 bytes. The message header format and data format for each message user are described in detail in the section describing the message's use. 2.5. Addressing 2.5.1. Location Transparency TIPC provides two functional address types, Port Name and Port Name Sequence, to support location transparency, and two physical address types, Network Address and Port Identity, to be used when physical location knowledge is necessary for the user. 2.5.2. Network Address A physical entity within a network is identified internally by a TIPC Network Address. This address is a 32-bit integer, structured into three fields, zone (8 MSB), cluster, (12 bits), and node (12 LSB). The address is only filled in with as much information as is relevant for the entity concerned, e.g. a zone may be identified as 0x03000000 (<3.0.0>), a cluster as 0x03001000 (<3.1.0>), and a node as 0x03001005 (<3.1.5>). Any of these formats is sufficient for the TIPC routing function to find a valid destination for a message. 2.5.3. Port Identity This address is produced internally by TIPC when a port is created, and is only valid as long as that physical instance of the port exists. It consists of two 32-bit integers. The first one is a random number with a period of 2^31-1, the second one is a fully qualified network address with the internal format as described earlier. A port identity may be used the same way as a port name, for connectionless communication or connection setup, as long as the user is aware of its limitations. The main advantage with using this address type over a port name is that it avoids the potentially expensive binding operation in the destination port, something which may be desirable for performance reasons. 2.5.4. Port Name A port name is a persistent address typically used for connectionless communication and for setting up connections. Binding a port name to Maloy & Stephens [Page 22] TIPC October 2010 a port roughly corresponds to binding a socket to a port number in TCP, except that the port name is unique and has validity for the whole publishing scope indicated in the bind operation, not only for a specific node. This means that no network address has to be given by the caller when setting up a connection, unless he explicitly wants to reach a certain node, cluster or zone. A port name consists of two 32-bits integers. The first integer is called the Name Type, and typically identifies a certain service type or functionality. The second integer is called the Name Instance, and is used as a key for accessing a certain instance of the requested service. The type/instance structure of a port name helps giving support for both service partitioning and service load sharing. When a port name is used as destination address for a message, it must be translated by TIPC to a port identity before it can reach it destination. This translation is performed on a node within the lookup scope indicated along with the port name. 2.5.5. Port Name Sequence To give further support for service partitioning TIPC even provides an address type called Port Name Sequence, or just Name Sequence. This is a three-integer structure defining a range of port names, i.e. a name type plus the lower limit of and the upper boundary of the range. By allowing a port to bind to a sequence, instead of just an individual port name, it is possible to partition the service's range of responsibility into sub-ranges, without having to create a vast number of ports to do so. There are very few limitations on how name sequences may be bound to ports. One may bind many different sequences, or many instances of the same sequence, to the same port, to different ports on the same node, or to different ports anywhere in the cluster or zone. The only restriction, in reality imposed by the implementation complexity it would involve, is that no partially overlapping sequences of the same name type may exist within the same publishing scope. Maloy & Stephens [Page 23] TIPC October 2010 --------------- | Partition B | | | O bind(type: 17 | ----------------- | lower:10 | | | | upper:19)| |send(type: 17 | --------------- | instance:7) O------+ | | | --------------- | | | | Partition A | ----------------- | | | +-------->O bind(type: 17 | | lower:0 | | upper:9 | --------------- Figure 4: Functional addressing, using port name and port name sequence When a port name is used as a destination address it is never used alone, contrary to what is indicated in Figure 4 . It has to be accompanied by a network address stating the scope and policy for the lookup of the port name. This will be described later. 2.5.6. Multicast Addressing The concept of functional addressing is also used to provide multicast functionality. If the sender of a message indicates a port name sequence instead of a port name, a replica of the message will be sent to all ports bound to a name sequence fully or partially overlapping with the sequence indicated. Maloy & Stephens [Page 24] TIPC October 2010 --------------- | Partition B | | | +-------->O bind(type: 17 | ----------------- | | lower:10 | | | | | upper:19)| |send(type: 17 | | --------------- | lower:7 O------+ | upper 13) | | --------------- | | | | Partition A | ----------------- | | | +-------->O bind(type: 17 | | lower:0 | | upper:9 | --------------- Figure 5: Functional multicast, using port name sequence Only one replica of the message will be sent to each identified target port, even if it is bound to more than one overlapping name sequence. This function will whenever possible and considered advantageous make use of the reliable cluster broadcast service also supported by TIPC. 2.5.7. Publishing Scope The default visibility scope of a published (bound) port name is the local cluster. If the publication issuer wants to give it some other visibility he must indicate this explicitly when binding the port. The scopes available are: Value Meaning ----- ------- 1 Visibility within whole own zone 2 Visibility within whole own cluster 3 Visibility limited to own node 2.5.8. Lookup Policies When a port name is looked up in the TIPC internal NAME TABLE for translation to a port identity the following rules apply: If indicated lookup domain is , the lookup algorithm must choose a matching publication from that particular node. If nothing is found on the given node, it must give up and reject the request, Maloy & Stephens [Page 25] TIPC October 2010 even if other matching publications exist within the zone. If the lookup domain is , the algorithm must select round- robin among all matching publications within that cluster, treating node local publications no different than the others. If nothing is found within the given cluster, it must give up and reject the request, even if other matching publications exist within the zone. If the lookup domain is , the algorithm must select round- robin among all concerned publications within that zone, treating node or cluster local publications no different than the others. If nothing is found, it must give up and reject the request. A lookup domain of <0.0.0> means that the nearest found publication must be selected. First a lookup with scope is attempted. If that fails, a lookup with the scope is tried, and finally, if that fails, a lookup with the scope . If that fails the request must be rejected. Round-robin based lookup means that the algorithm must select equally among all the matching publications within the given scope. In practice this may mean stepping forward in a circular list referring to those publications between each lookup. 3. Port-Based Communication [NEED INTRO TO THIS SECTION] 3.1. Payload Messages 3.1.1. Payload Message Types The header is organized so that it should be possible to omit certain parts of it, whenever any information is dispensable. The following header sizes are used: Cluster Internal Connection Based Non-Routed Messages: Such messages per definition do only one hop over an inherently reliable link, so all fields from word 6 and onwards are redundant or irrelevant. The message header can be limited to 24 bytes. By ensuring that no other messages have this particular header size, this can indeed be used as a test that we are dealing with that kind of message, and some code optimization can be done based on this knowledge. Maloy & Stephens [Page 26] TIPC October 2010 Direct Addressed Messages: These are connection-less messages containing a port identity as destination address, i.e. the fields 'destination port' and 'destination node' are filled in and non-zero. All fields from word 7 and onwards are irrelevant, and the message size can be set to 32. Connection Based Potentially Routed Messages: Inter cluster connection based messages, and intra-cluster messages between cluster nodes, may need to be routed via intermediate nodes if there is no direct link between the two. 'Originating node' may hence differ from 'previous node', so this field must be present. Since there is now a small, but not negligeable risk that messages may be lost or arrive in disorder (the intermediate node may crash), a transport level connection sequence number is needed for problem detection. A header size of 36 bytes is required. Port Name Addressed Messages: These are connection-less messages containing a port name as destination address, i.e. the fields 'name type' and 'name instance' have valid values, while 'destination port' is zero before the name table lookup, and non-zero after a sucessful lookup. 'Destination node' may be zero or have a valid value before lookup,but has a valid value after a sucessful lookup. The header size is set to 40. Multicast Messages: Multicast messages are similar to port name addressed messages, except that the destination address is a range (port name sequence) rather than a port name. An extra word, the 'upper' part of the port name sequence must be present, so the header size ends up at 44. 3.1.2. Payload Message Format The format of the various types of payload message is shown in [TBD]. Maloy & Stephens [Page 27] TIPC October 2010 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ -+ w0:| Ver | User | Hsize |N|D|S|R| Message Size | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | r w1:|Mtype| Error |Reroute|Lsc| RES | Broadcast Acknowledge | | e +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | q w2:| Link Acknowledge | Link Sequence | | u +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | i w3:| Previous Node | | r +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | e w4:| Originating Port | | d +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | w5:| Destination Port / Destination Network | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ -+ c w6:| Originating Node | | o +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | n w7:| Destination Node | | d +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | i w8:| Name Type | | t +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | i w9:| Name Instance / Name Sequence Lower | | o +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | n wA:| Name Sequence Upper | | a +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ -+ l \ \ | / / | o \ \ | p / / | t \ Data \ | i / / | o \ \ | n / / | a \ \ | l / / | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ -+ Figure 6: TIPC payload message format The interpretation of the fields of the message are as follows: R, RES, RESERVED: Reserved area. These bits must be zero. Ver: 3 bits Maloy & Stephens [Page 28] TIPC October 2010 Message protocol version number; currently 2. This field is present to facilitate future upgrades of the TIPC protocol. User: 4 bits Message user, as described in [TBD]. Payload messages utilize one of: LOW_IMPORTANCE, MEDIUM_IMPORTANCE, HIGH_IMPORTANCE, and CRITICAL_IMPORTANCE, which denotes the importance of the message to the application. Hsize: 4 bits The size of the message header, in words. Payload messages can have message headers ranging from 6 to 11 words in length, depending on the value of Mtype. N: 1 bit Non-sequenced message flag. If this bit is clear the message is part of the normal flow of messages between link endpoints; if this bit is set the message is not part of the normal flow of messages between link endpoints. Payload messages that are sent to all cluster nodes using the broadcast link set this bit. D: 1 bit Destination droppable flag. If this bit is set the message should be silently dropped if it cannot be delivered to the specified destination port; if clear, the message should be returned to the originating port. S: 1 bit Source droppable flag. If this bit is set the message should be silently dropped if the sending application is unable to send the message due to congestion; if clear, the sending application should be notified that the send operation was unsuccessful. Message Size: 17 bits The size of the message, including both header and data parts, in bytes. The maximum data part of a payload message is 66000 bytes, which makes it possible to tunnel maximum-size IP- messages through TIPC. Maloy & Stephens [Page 29] TIPC October 2010 Mtype: 3 bits Message type. A User-specific value that indicates the exact nature of the message. Payload messages utilize the following values: Mtype Mtype Name Purpose ----- ---------- ------- 0 CONN_MSG Data sent over an established connection 1 MCAST_MSG Data sent to a port name sequence address using multicasting 2 NAMED_MSG Data sent to a port name address 3 DIRECT_MSG Data sent to a port id address Error: 4 bits The error status of the message, which is one of The following values apply: Error Error Name Meaning ----- ---------- ------- 0 OK No error 1 ERR_NO_NAME Destination port name is unknown 2 ERR_NO_PORT Destination port id does not exist 3 ERR_NO_NODE Destination node is unreachable 4 ERR_OVERLOAD Destination is congested 5 CONN_SHUTDOWN Normal connection shutdown occurred Reroute: 4 bits The number of routing operations performed on the message. [NEED TO FIX UP THIS TEXT, SINCE IT ISN'T REALLY ACCURATE AND PROBABLY DOESN'T BELONG HERE ANYWAY ...] This counter has the purpose of stopping messages from roaming around in the system. This may, at least theoretically, happen in case of temporary NAME TABLE or routing table inconsistency. The counter is incremented each time a lookup is done in the NAME TABLE, and each time the message makes an inter-node hop. When the counter reaches a limit, (seven in the current implementation) the counter is reset and the message is rejected with the appropriate error code. Lsc: 2 bits Lookup Scope. Indicates the scope of the lookup domain that was used during translation from name to port identity in a Maloy & Stephens [Page 30] TIPC October 2010 NAMED_MSG or MCAST_MSG. This information enables subsequent retranslation of the original name, if necessary. The following values apply: Lsc Meaning --- ---------- 1 Zone Scope 2 Cluster Scope 3 Node Scope Broadcast Acknowledge: 16 bits The sequence number of the last in-sequence packet received by the sending node from the recipient node using the broadcast link. This allows the recipient node to release buffers associated with messages successfully transferred over the broadcast link. Messages sent to all cluster nodes over the broadcast link set this value to zero. Link Acknowledge: 16 bits The sequence number of the last in-sequence packet received by the sending node from the recipient node using the link over which the packet is carried. This allows the recipient node to release buffers associated with messages successfully transferred over the link. Messages sent to all cluster nodes over the broadcast link set this value to zero. Link Sequence: 16 bits The sequence number of the packet being transferred from the sending node to the recipient node on the associated link or the broadcast link. This allows the recipient node to detect lost packets, duplicate packets, or out-of-sequence packets. Previous Node: 32 bits The network address of the last node visited by the message. In the case of intra-cluster messages this is usually, but not always, identical to Originating Node. Originating Port: 32 bits The reference part of the port identifier of the originating port from which the message was sent. Maloy & Stephens [Page 31] TIPC October 2010 Destination Port: 32 bits The reference part of the port identifier to which the message was sent; present only in messages sent a single node over a link. For NAMED_MSG and MCAST_MSG messages this field is set to zero until name lookup has been completed. Destination Network: 32 bits The network identity of the sender; present only in messages sent to all cluster nodes over the broadcast link. Originating Node: 32 bits The network address of the node from which the message originally was sent. Destination Node: 32 bits The network address of the final destination node for a message. Messages sent to all cluster nodes over the broadcast link set this value to <0.0.0>. Name Type: 32 bits The type part of the port name or port name sequence to which a message of type NAMED_MSG or MCAST_MSG was sent. Name Instance: 32 bits The instance part of the port name to which a message of type NAMED_MSG was sent. Name Sequence Lower: 32 bits The lower boundary of the port name sequence to which a message of type MCAST_MSG was sent. Name Sequence Upper: 32 bits The upper boundary of the port name sequence to which a message of type MCAST_MSG was sent. Data: 0 to 66,000 bytes The content and format of this region is specified by the application or service that sends the message. Maloy & Stephens [Page 32] TIPC October 2010 3.1.3. Payload Message Delivery [INCLUDE INFO ON NAME RESOLUTION] As long as the value remains zero new lookups will be performed until a destination is found, or until 'Reroute Counter' reaches the upper limit. When a port name has been successfully translated to a port identity, the field "Destination Node" is filled with a complete node address. This also means that the scope of the original lookup domain is lost, since this is indicated by the value of this field before the lookup. Sometimes, e.g. because of temporary inonsistency ot the NAME TABLE during update, the destination port turns out to not exist, and one or more new lookups must be performed. In order to do this correctly, the original lookup scope must be preserved in the message, and that is done in this field. The lookup domain is recreated based on the complete destination node address and the lookup scope. 3.2. Connectionless Communication [TBD] 3.3. Connection-based Communication User Connections must be kept as lightweight as possible because of their potential huge number, and because it must be possible to establish and shut down thousands of connections per second on a node. 3.3.1. Connection Setup How a connection is established and terminated is not defined by the protocol, only how they are supervised, and if necessary, aborted. Instead, this is left to the end user to define, or to the actual implementation of the user API-adapter. The following figures show two examples of this. Maloy & Stephens [Page 33] TIPC October 2010 ------------------- ------------------- | Client | | Server | | | | | | (3)create(cport) | | (1)create(suport) | | (4)send(type:17, |------------->0 (2)bind(type: 17, | | inst: 7) 0<------+ |\ lower:0 | | (8)lconnect(sport)| | | \ upper:9) | | | | | / | | | | |/(5)create(sport) | | | +------0 (6)lconnect(cport)| | | | (7)send() | ------------------- ------------------- Figure 7: Example of user defined establishment of a connection Figure 7 shows an example where the user himself defines how to set up the connection. In this case, the client starts with sending one payload- carrying NAMED_MSG message to the setup port (suport)(4). The setup server receives the message, and reads its contents and the client port (cport) identity. He then creates a new port (sport)(5), and connects it to the client port's identity(6). The lconnect() call is a purely node local operation in this case, and the connection is not fully established until the server has fulfilled the request and sent a response payload-carrying CONN_MSG message back to the client port(7). Upon reception of the response message the client reads the server port's identity and performs an lconnect() on it(8). This way, a connection has been established without sending a single protocol message. -------------------- ------------------- | Client | | Server | | | | (1)create(suport) | | (4)create(cport) | "SYN" | (2)bind(type: 17, | | (5)connect(type:17,|------------->0 lower:0 | | (9) inst: 7)0<------+ /| upper:9) | | | | / | (3)accept() | | | (7)| \ | (8) | | | | (6)\| | | | +------0 (9)recv() | | | "SYN" | | -------------------- ------------------- Figure 8: TCP-style connection setup Maloy & Stephens [Page 34] TIPC October 2010 Figure 8 shows an example where the user API-adapter supports a TCP- style connection setup, using hidden protocol messages to fulfil the connection. The client starts with calling connect()(5), causing the API to send an empty NAMED_MSG message ("SYN" in TCP terminology) to the setup port. Upon reception, the API-adapter at the server side creates the server port, peforms a local lconnect()(6) on it towards the client port, and sends an empty CONN_MSG ("SYN") back to the client port (7). The accept() call in the server then returns, and the server can start waiting for messages (8). When the second SYN message arrives in the client, the API-adapter performs a node local lconnect() and lets the original connect() call return (9). Note the difference between this protocol and the real TCP connection setup protocol. In our case there is no need for SYN_ACK messages, because the transport media between the client and the server (the node-to-node link) is reliable. Also note from these examples that it is possible to retain full compatibility between these two very different ways of establishing a connection. Before the connection is established, a TCP-style client or server should interpret a payload message from a user-controlled counterpart as an implicit SYN, and perform an lconnect() before queueing the message for reading by the user. The other way around, a user-controlled client or server must perform an lconnect() when receiving the empty message from its TCP-style counterpart. 3.3.2. Connection Shutdown ------------------- ------------------- | Client | | Server | | | | | | | | | | lclose() 0 0 lclose() | | | | | | | | | | | | | ------------------- ------------------- Figure 9: Example of user defined shutdown of a connection Figure 9 shows the simplest possible user defined connection shutdown scheme. If it inherent in the user protocol when the connection should be closed, both parties will know the right moment to perform a "node local close" (lclose()) and no protocol messages need to be involved. Maloy & Stephens [Page 35] TIPC October 2010 -------------------- ------------------- | Client | | Server | | | "FIN" | | | (1)close()0------------->0(2)close() | | | | | | | | | | | | | -------------------- ------------------- Figure 10: TCP-style connection shutdown In Figure 10 a TCP-style connection close() is illustrated. This is simpler than the connection setup case, because the built-in connection abortion mechanism of TIPC can be used. When the client calls close() (1) TIPC must delete the client port. As will be described later, deleting a connected port has the effect that a CRITICAL_IMPORANCE/CONN_MSG ("FIN" in TCP terminology) with error code NO_REMOTE_PORT is sent to the other end. Reception of such a message means that TIPC at the receiving side must shut down the connection, and this must be done already before the server is notified. The server must then call close() (2), not to close the connection, but to delete the port. TIPC does not send any "FIN" this time, the server port is already disconnected, and the client port is anyway gone. If both endpoints call close() simultaneously, two "FIN" messages will cross each other, but at the reception they will have no effect, since there is no destination port, and they must be discarded by TIPC. Note even here the automatic compatibility with a user-defined peer and a TCP-style ditto: Any user, no matter the user API, must at any moment be ready to receive a "connection aborted" indication, and this is what in reality happens here. 3.3.3. Connection Abortion When a connected port receives an indication from the TIPC link layer that it has lost contact with its peer node, it must immediately disconnect itself and send an empty CONN_MSG/NO_REMOTE_NODE to its owner process. When a connected port is deleted without a preceding disconnect() call from the user it must immediately disconnect itself and send an empty CONN_MSG/NO_REMOTE_PORT to its peer port. This may happen when the owner process crashes, and the OS is reclaiming its resources. When a connected port receives a timeout call, and is still in CONNECTED/PROBING state since the previous timer expiration,it must Maloy & Stephens [Page 36] TIPC October 2010 immediately disconnect itself and send an empty CONN_MSG/ NO_REMOTE_PORT to its owner process. When a connected port receives a rejected previously sent message, (a CONN_MSG with error code), it must immediately disconnect itself and deliver the message, with data contents, to its owner process. When a port participating in a multi-hop connection receives a CONN_MSG where the connection level sequence number does not fit, it must immediately disconnect itself, send an empty CONN_MSG/COMM_ERROR to its owner process, and another empty CONN_MSG/COMM_ERROR to its peer port. When a connected port receives a CONN_MSG from somebody else than its peer port, it must immediately send an empty CONN_MSG/NO_CONNECTION to the originating port. When TIPC in a node receives a CONN_MSG/TIPC_OK for which it finds no destination port, it must immediately send an empty CONN_MSG/ NO_REMOTE_PORT back to the originating port. When a bound port receives a CONN_MSG from anybody,it must immediately send an empty CONN_MSG/NO_CONNECTION to the originating port. 3.3.4. Connection Supervision [BROUGHT FROM TERMINOLOGY SECTION] A connection also implies automatic supervision of the existence and state of the endpoints. In almost all practical cases the mechanisms for resource cleanup after process failure, rejection of messages when no destination port is found, and supervision of peer nodes at the link level, is sufficient for immediate failure detection and abortion of connections. However, because of the non-specified connection setup procedure of TIPC, there exists cases when a connection may remain incomplete. This may happen if the client in a user-defined setup/shutdown scheme forgets to call lconnect() (see Figure 9 ), and then deletes itself, or if one of the parties fails to call lclose() (see Figure 10 ). These cases are considered very rare, and should normally have no serious consequenses for the availability of the system, so a slow background timer is judged sufficient to discover such situations. When a connection is established each port starts a timer, whose purpose is to check the status of the connection. It does this by regularly (typical configured interval is once an hour) sending a Maloy & Stephens [Page 37] TIPC October 2010 CONN_PROBE message to the peer port of the connection. The probe has two tasks; first, to inform that the sender is still alive and connected; second, to inquire about the state of the recipient. A CONN_PROBE or a CONN_PROBE_REPLY reply MUST be immediately responded to according to the following scheme: --------------------------------------------------------------------- | | Received Message Type | | |-----------------+------------------| | | CONN_PROBE | CONN_PROBE_REPLY | | | | | |==============================|====================================| | | Multi-hop | CRITICAL_IMPORANCE+ | | | seqno wrong| TIPC_COMM_ERROR | | | ------------|-----------------+------------------| | | Connected Multi-hop | | | | | to sender seqno ok | | | | | port ------------| | | | | Single hop | CONN_PROBE_REPLY| No Response | | |------------------------| | | | | Not connected, | | | |Rece-| not bound, | | | |ving |------------------------|-----------------+------------------| |Port | Connected to | | |State| other port | CRITICAL_IMPORANCE+ | | |------------------------| TIPC_NOT_CONNECTED | | | Bound to | | | | port name sequence | | | |------------------------|------------------------------------| | | Recv. node available, | CRITICAL_IMPORANCE+ | | | recv. port non-existent| TIPC_NO_REMOTE_PORT | | |------------------------|------------------------------------| | | Receiving node | CRITICAL_IMPORANCE+ | | | unavailable | TIPC_NO_REMOTE_NODE | --------------------------------------------------------------------- Figure 11: Response to probe/probe replies vs port state. If everything is well then the receiving port will answer with a probe reply message, and the probing port will go to rest for another interval. It is inherent in the protocol that one of the ports - the one connected last - normally will remain passive in this relationship. Each time its timer expires it will find that it has just received and replied to a probe, so it will never have any Maloy & Stephens [Page 38] TIPC October 2010 reason to explicitly send a probe itself. When an error is encountered, one or two empty CONN_MSG data are generated, one to indicate a connection abortion in the receiving port, if it exists, and one to do the same thing in the sending port. The state machine for a port during this message exchange is described in Section 3.3 . 3.3.4.1. Connection Manager Although a TIPC internal user, Connection Manager is special, because it uses the 36-byte header format of CONN_MSG payload messages instead of the 40-byte internal format. This is because those messages must contain a destination port and a originating port. The following message types are valid for Connection Manager: User: 8 (CONN_MANAGER). Message Types: ID Value Meaning -------- ---------- 0 Probe to test existence of peer (CONN_PROBE) 1 Reply to probe, confirming existence (CONN_PROBE_REPLY) 2 Acknowledge N Messages (MSG_ACK) MSG_ACK messages are used for transport-level congestion control, and carry one network byte order 32-byte integer as data. This indicates the number of messages acknowledged, i.e. actually read by the port sending the acknowledge. This information makes it possible for the other port to keep track of the number of sent, but not yet received and handled messages, and to take action if this value surpasses a certain threshold. The details about why and when these messages are sent are described in Section 3.3.4. 3.3.5. Flow Control The mechanism for end-to-end flow control at the connection level has as its primary purpose to stop a sending process from overrunning a slower receiving process. Other tasks, such as bearer, link, network, and node congestion control, are handled by other mechanisms in TIPC. Because of this, the algorithm can be kept very simple. It works as follows: Maloy & Stephens [Page 39] TIPC October 2010 1. The message sender (the API-adapter) keeps one counter, SENT_CNT, to count messsages he has sent, but which has not yet been acnkowledged. The counter is incremented for each sent message. 2. The receiver counts the number of messages he delivers to the user, ignoring any messages pending in the process in-queue. For each N message, he sends back a CONN_MANAGER/ACK_MSG containing this number in its data part. 3. When the sender receives the acknowledge message, he subtracts N from SENT_CNT, and stores the new value. 4. When the sender wants to send a new message he must first check the value of SENT_CNT, and if this exceeds a certain limit, he must abstain from sending the message. A typical measure to take when this happens is to block the sending process until SENT_CNT is under the limit again, but this will be API-dependent. The recommended value for the send window N is at least 200 messages, and the limit for SENT should be at least 2*N. 3.3.6. Sequentiality Check Inter-cluster connection-based messages, and intra-cluster messages between cluster nodes, may need to be routed via intermediate nodes if there is no direct link between the two. This implies a small, but not negligeable risk that messages may be lost or re-ordered. E.g. an intermediate node may crash, or it may have changed its routing table in the interval between the messages. A connection level sequence number is used to detect such problems, and this must be checked for each message received on the connection. If the sequence number does not fit in sequence, no attempts of re- sequencing should be done. The port discovering the sequence error must immediately abort the connection by sending one empty CONN_MSG/ COMM_ERROR message to itself, and one to the peer port. The sequence number must not be checked on single-hop connections, where the link protocol guarantees that no such errors can occur. The first message sent on a connection has the sequence number 42. 3.4. Multicast Communication [TBD] Maloy & Stephens [Page 40] TIPC October 2010 4. Name Table [MISSING TEXT HERE] 4.1. Distributed Name Table Protocol Overview The TIPC internal NAME TABLE is used for translation from a port name to a corresponding port identity, or from a port name sequence to a corresponding set of port identities. In order to achieve acceptable translation times and fault tolerance, an instance of this table must exist on each node. Each instance of the table must be kept consistent with all other instances within the same zone, and there must be no unnecessary delays in the update the neighbouring table instances when a port name sequence is published or withdrawn. Inconsistencies are only tolerated for the short timespan it takes for update messages to reach the neigbouring nodes, or for the time it takes for a node to detect that a neighbouring node has disappeared. 4.2. Name Distributor Message Processing When a node establishes contact with a new node in the cluster or the zone, it must immediately send out the necessary number of NAME_DISTRIBUTOR/ PUBLICATION messages to that node, in order to let it update its local NAME TABLE instance. When a node looses contact with another node, it must immediately clean its NAME TABLE from all entries pertaining to that node. When a port name sequence is published on a node, TIPC must immediately send out a NAME_DISTRIBUTOR/PUBLICATION message to all nodes within the publishing scope, in order to have them update their tables. When a port name sequence is withdrawn on a node, TIPC must immediately send out a NAME_DISTRIBUTOR/WITHDRAWAL message to all nodes within the publishing scope, in order to have them remove the corresponding entry from their tables. Temporary table inconsistencies may occur, despite the above, and are handled as follows: If a successful lookup on one node leads to a non-existing port on another node, the lookup is repeated on that node. If this lookup succeeds, but again leads to a non-existing port, another lookup is done. This procedure can be repeated up to six times before giving up and rejecting the message. Maloy & Stephens [Page 41] TIPC October 2010 4.3. Name Distributor Message Format The format of the name distribution message used to update remote name tables is shown in Figure 12. 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w0:| Ver | User | Hsize |N|R|R|R| Message Size | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w1:|Mtype| RESERVED | Broadcast Acknowledge | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w2:| Link Acknowledge | Link Sequence | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w3:| Previous Node | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w4:| Originating Port | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w5:| Destination Port / Destination Network | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w6:| Originating Node | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w7:| Destination Node | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w8:| RESERVED | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w9:| Item Size | RESERVED | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ \ \ / Data (list of name items) / \ \ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 12: Name Distributor message format The interpretation of the fields of the message are as follows: R, RESERVED: Defined in Payload Message. Ver: 3 bits Defined in Payload Message. Maloy & Stephens [Page 42] TIPC October 2010 User: 4 bits Defined in Payload Message. A NAME_DISTRIBUTOR message is identified by the value 11. Hsize: 4 bits Defined in Payload Message. A NAME_DISTRIBUTOR message header is 40 bytes. N: 1 bit Defined in Payload Message. Message Size: 17 bits Defined in Payload Message. Mtype: 3 bits Defined in Payload Message. A NAME_DISTRIBUTOR message specifies 0 for a name publication message or 1 for a name withdrawl message. Broadcast Acknowledge: 16 bits Defined in Payload Message. Link Acknowledge: 16 bits Defined in Payload Message. Link Sequence: 16 bits Defined in Payload Message. Previous Node: 32 bits Defined in Payload Message. Originating Port: 32 bits Defined in Payload Message. A NAME_DISTRIBUTOR message sets this field to zero as the message originates with TIPC's name table subsystem. Maloy & Stephens [Page 43] TIPC October 2010 Destination Port: 32 bits Defined in Payload Message. A NAME_DISTRIBUTOR message sets this field to zero as the message is destined for TIPC's name table subsystem. Destination Network: 32 bits Defined in Payload Message. Originating Node: 32 bits Defined in Payload Message. Destination Node: 32 bits Defined in Payload Message. Item Size: 32 bits The size, in words, of each name publication descriptor contained in Data. A value of zero indicates that Item Size is not specified by the sender, signifying that a 5 word descriptor size may be assumed. Data: up to 66,000 bytes A list of one or more name publication descriptors. The total number of descriptors in the message is equal to (Message Size - Hsize)/(Item Size * 4). 4.4. Name Publication Descriptor Format The format of a name publication descriptor is shown in [TBD]. The full seven word format MUST be used by nodes in multi-cluster TIPC networks; nodes in single-cluster TIPC networks MAY use the shorter five word format. All fields of the descriptor MUST be stored in network byte order. Maloy & Stephens [Page 44] TIPC October 2010 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ -+ w0:| Type | | r +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | e w1:| Lower Bound | | q +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | u w2:| Upper Bound | | i +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | r w3:| Reference | | e +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | d w4:| Key | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ -+ w5:| Node | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w6:| RESERVED | Dist | Scope | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Type: The type part of the published port name sequence. Lower: The lower part of the published port name sequence. Upper: The upper part of the published port name sequence. Reference: The reference part of the publishing port's identity. Key: A created by the publishing port. Node: The node part of the publishing port's identity. If this field is not present it can be assumed to be the same as Originating Node. Dist: A bitmask indicating what other portions of the network the receiver must notify about a change in the status of name publication. The following values are used: Dist Dist Name Meaning ---- --------- ------- 0x1 DIST_TO_CLUSTER Send to all other nodes in receiver's cluster 0x2 DIST_TO_ZONE Send to all other clusters in receiver's zone If this field is not present it can be assumed to be zero, indicating that no other nodes need to be notified. Scope: The distribution scope of the published port name sequence. If this field is not present then it can be assumed to be cluster- wide. Maloy & Stephens [Page 45] TIPC October 2010 5. Links This section discusses the operation of unicast links that carry messages from the originating node to a single destination node to which it has a direct path. The operation of TIPC's broadcast link is described in Section 6. 5.1. TIPC Internal Header 5.1.1. Internal Message Header Format 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w0:|vers |msg usr|hdr sz |n|resrv| packet size | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w1:|m typ| sequence gap | broadcast ack no | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w2:|link level ack no/bc gap after | link level/bc seqno/bc gap to | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w3:| previous node | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w4:| last sent broadcast/fragm no | next sent pkt/ fragm msg no | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w5:| session no | res |r|berid|link prio|netpl|p| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w6:| originating node | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w7:| destination node | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w8:| transport sequence number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w9:| msg count/max packet | link tolerance | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ \ \ / User Specific Data / \ \ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 13: TIPC Internal Message Header Format The internal header has one format and one size, 40 bytes. Some fields are only relevant to some users, but for simplicity in understanding and implementation we present it as single header Maloy & Stephens [Page 46] TIPC October 2010 format. 5.1.2. Internal Message Header Fields Description The first four words are almost identical to the corresponding part of the data message header. The differences are described next. Sequence Gap: 13 bits. Used by: LINK_PROTOCOL The field 'Error Code','Reroute Count', 'Lookup Scope' and 'Options Position' fields have no relevance for LINK_PROTOCOL/ STATE_MSG messages, so these 13 bits can be recycled in such messages. 'Sequence Gap' informs the recipient about the size of a gap detected in the sender's received packet sequence, from 'Link Level Acknowledge Number' and onwards. The receiver of this information must immediately retransmit the missing packets. Broadcast Gap After: 16 bits. Used by: BCAST_PROTOCOL This field occupies the same physical space as Link Level Acknowledge Number, but is defined only for BCAST_PROTOCOL Negative Acknowledge messages. It indicates that a gap has been detected in in the received sequence of broadcast data packets. Last packet received in sequence is indicated here. Broadcast Gap To: 16 bits. Used by: BCAST_PROTOCOL This field occupies the same physical space as Broadcast Sequence Number and Link Level Acknowledge Number, but is defined only for BCAST_PROTOCOL Negative Acknowledge messages. It indicates that a gap has been detected in in the received sequence of broadcast data packets. First packet received out-of-sequence is indicated here. Last Sent Broadcast: 16 bits.Used by: LINK_PROTOCOL In order to speed up detection of lost broadcasts packets all LINK_PROTOCOL/STATE_MSG messages contain this information from the sender node. If the receiver finds that this is not in accordance with what he has received, he immediately broadcasts a BCAST_PROTOCOL/STATE_MSG back to the sender, with bc_gap_after and bcast_gap_to set appropriately. Fragment Number: 16 bits.Used by: MSG_FRAGMENTER Maloy & Stephens [Page 47] TIPC October 2010 Occupying the same space as 'Next Sent Broadcast' this value indicates the number of a message fragment within a fragmented message, starting from 1. Next Sent Packet: 16 bits. Used by: LINK_PROTOCOL Link protocol messages bypass all other packets in order to maintain link integrity, and hence can not have sequence numbers valid for the ordinary packet stream. But all receivers are dependent of this information to detect packet losses, and cannot completely rely on the assumption that a sequenced packet will arrive within acceptable time. To guarantee a worst case packet loss detection time, even on low-traffic links,the equivalent information to a valid sequence number has to be conveyed by the link continuity check (STATE_MSG) messages, and that is the purpose of this field. Fragment Number: 16 bits.Used by: MSG_FRAGMENTER Occupying the same space as 'Next Sent Packet', this value identifies a a fragmented message on the particular link where it is sent. Session Number: 16 bits. Used by: LINK_PROTOCOL The risk of packets being reordered by the router is particularly elevated at the moment of first contact between nodes, so a check of sequentiality is needed even for LINK_PROTOCOL/RESET_MSG messages. The session number starts from a random value, and is incremented each time a link comes up. This way, redundant RESET_MSG messages, delayed by the router and arriving after the link has been brought to a working state,can be identified and ignored. Redundant Link: 1 bit When set, this bit informs the other endpoint that the sender thinks it has a second working link towards the destination. This information is needed by the recipient on order know whether he should initiate a changeover procedure or not in case of link failure. Bearer Identity: 3 bits When a bearer is registered with the link layer of TIPC in a node, it is assigned a unique ideitifying number in the range [0,7]. This number will not necessarily be the same in different nodes, so a link endpoint to needs to know the other endpoint's assigned Maloy & Stephens [Page 48] TIPC October 2010 identity for the same bearer. This is needed during the link changeover procedure, in order to identify the destination bearer instance of a tunneled packet. Link Priority: 5 bits. Used by: LINK_PROTOCOL When there are more than one link between two nodes, one may want to use them in load sharing or active/standby mode. Equal priority between links means load sharing, different priorities means that the link with the higher numerical value will take all traffic. By offering a value range of 32 one can build in a default relation between different bearer types,(e.g. DCCP is given lower priority than Ethernet), and no manual configuration of these values should normally be needed. Network Plane: 3 bits. Used by: LINK_PROTOCOL When multiple parallel routers and multiple network interfaces are used it is useful, although not strictly needed by the protocol, to have a network pervasive identifier telling which interfaces are connected to which routers. This relieves system managers from the burden of manually keeping track of the actual physical connectivity. Typically, the identifier 0 would be presented to the operator as 'Network A', identity 1 as 'Network B' etc. This identity must be agreed upon in the whole network, and therefore this field is present and valid in the header of all LINK_PROTOCOL messages. The 'negotiation' consists of letting the node with the lowest numeral value of its network address, typically node <1.1.1>, decide the identities. All others must strictly update their identities to the value received from any lower node. Probe: 1 bit. Used by: LINK_PROTOCOL This one-bit field is used only by messages of type LINK_PROTOCOL/ STATE_MSG. When set it instructs the receiving link endpoint to immediately respond with a STATE_MSG. The Probe bit MUST NOT be set in the responding message. Message Count: 16 bits. Used by: MSG_BUNDLER, CHANGEOVER_PROTOCOL This field is used for two different purposes. First, the message bundling function uses it to indicate how many packets are bundled in a bundle packet. Second, when a link goes down, the endpoint detecting the failure must send an ORIG_MSG to the other endpoint (tunneled through the remaing link) informing it about how many tunneled packets to expect. This gives the other endpoint a chance to know when the changeover is finished, so it can return to the normal link setup procedure. Maloy & Stephens [Page 49] TIPC October 2010 Max Packet: 16 bits. Used by: LINK_PROTOCOL Occupying the same space as 'Message Count', this field is used by a link endpoint during MTU negotiation to tell its peer the size of the largest link state message it has received. The size value is specified as a count of 4 byte words, rather than bytes, to allow the largest possible message size (66060 bytes) to be represented using 16 bits. A value of 0 indicates that no information about the largest link state message is provideded. Semantically, this field has two meanings, depending on the message type carrying it. When present in a RESET or ACTIVATE message, this field indicates the theoretical MTU of the sender, in practice the MTU allowed by the sender's local media interface. When present in STATE messages, the field serves as a confirmation that packets with that MTU has been recived. This is useful for the MTU negotiation described earlier. Link Tolerance: 16 bits. Used by: LINK_PROTOCOL Each link endpoint must have a limit for how long it can wait for packets from the other end before it declares the link failed. Initially this time may differ between the two endpoints, and must be negotiated. At link setup all RESET_MSG messages in both directions carry the sender's configured value in this field, and the highest numerical value will be the one chosen by both endpoints. In STATE_MSG messages this field is normally zero, but if the value is explicitly changed at one endpoint, e.g. by a configuration command, it will be carried by the next STATE_MSG and force the other endpoint to also change its value. Subsequent STATE_MSG messages return to the zero value. The unit of the value is [ms]. 5.2. Link Creation 5.2.1. Intra-Cluster Link Setup TIPC automatically detects the neighbouring cluster nodes that can be reached through an interface and automatically configures a link to each of the nodes that the interface's configuration permits it to. Note: This automatic configuration requires that TIPC be able to send Link Request messages to all possible receivers on that interface. This is easily done when the media type used by the interface supports some form of broadcast capability (eg. Ethernet); other media types might require the use of a "replicast" facility. Support for manual configuration of links on interfaces that can not support automatic neighbour discovery in any form is left for future study. Maloy & Stephens [Page 50] TIPC October 2010 Whenever TIPC detects that a new interface has become active, it periodically broadcasts Link Request messages from that interface to other prospective members of the network, informing them of the node's existence. If a node that receives such a request determines that a link to the sender is required, it creates a new link endpoint and returns a unicast Link Response message to the sending node, which causes that node to create a corresponding link endpoint (see Figure 14). The two link endpoints then begin the link activation process described in Section 5.3. The structure and semantics of Link Request and Link Response messages are described in Section 7. ------------- | <1.1.3> | | | ucast(dest:<1.1.1>,orig:<1.1.3> | | <------------------------------- | | | | ------------- -------------- | <1.1.1> | | | bcast(orig:<1.1.1>,dest:<1.1.0>) | |--------------------------------> | | | | -------------- ------------- ucast(dest:<1.1.1>,orig:<1.1.2> | <1.1.2> | <------------------------------- | | | | | | | | ------------- Figure 14: Neighbour Detection There are two reasons for the on-going broadcasting decribed above. First, it allows two nodes to discover each other even if the communication media between them is initially non-functional. (For example, in a dual-switch system one of the cables may be faulty or disconnected at start up time, while the cluster is still fully connected and functional via the other switch.) The continuous discovery mechanism allows the missing links to be created once a working cable is inserted, without requiring a restart of any of the nodes. Second, it allows users to replace (hot-swap) an interface card with one having a different media address (eg. a MAC address for Maloy & Stephens [Page 51] TIPC October 2010 Ethernet), again without having to restart the node. When a node receives a Link Request message its originating media address is compared with the one previously stored for that destination, and if they differ the old one is replaced allowing the link activation process to begin using the new address. Link Request broadcasting begins 125 msec after an interface is enabled, then repeats at an interval that doubles after each transmission until it reaches an upper limit of 2000 msec; thereafter, broadcasts occurs every 2000 msec if there are no active links on the interface, or every 600,000 msec if there is at least one active link. The broadcasts continue at these rates as long as the node is up. This pattern of broadcasts ensures that a node broadcasts frequently when an interface is first enabled or when there is no connectivity on the interface, and very slowly once some amount of connectivity exists. Such an approach places the bulk of the burden of neighbour discovery on the node that is increasing its connectivity to the TIPC network, allowing nodes that are already fully connected to take a more passive role. Note: This algorithm does not allow for rapid neighbour discovery in the event that a cluster is initially partitioned into two or more multi-node sections that later become able to communicate, as it can take up to 10 minutes for the partitions to discovery one another. Further investigation is required to address this issue. Each Link Request message contains a destination domain that indicates which neighbouring nodes are permitted to establish links to the transmitting interface; this value should be configurable on a per-interface basis. Typical settings include <0.0.0>, which permits connection to any node in the network, and , which permits connection to any node within the cluster. A node receiving a Link Request message ensures that it belongs to the destination domain stated in the message, and that the Network Identity of the message is equal to its own. If so, and if a link does not already exist, it creates its end of the link and returns a unicast Link Response message back to the requesting node. This message then triggers the requesting node to create the other end of the link (if there is not one already), and the link activation phase then begins. 5.2.2. Inter-Cluster Link Setup This section will be defined when multi-cluster support is added to TIPC. Maloy & Stephens [Page 52] TIPC October 2010 5.3. Link Activation Link activation and supervision is completely handled by the generic part of the protocol, in contrast to the partially media-dependent neighbour detection protocol. The following FSM describes how a link is activated and supervised. --------------- --------------- | |<--(CHECKPOINT == LAST_REC)--| | | | | | |Working-Unknown|----TRAFFIC/ACTIVATE_MSG---->|Working-Working| | | | | | |-------+ +-ACTIVATE_MSG>| | --------------- \ / ------------A-- | \ / | | | NO TRAFFIC/ \/ RESET_MSG TRAFFIC/ | NO PROBE /\ | ACTIVATE_MSG | REPLY / \ | | ---V----------- / \ --V------------ | |-------+ +--RESET_MSG-->| | | | | | | Reset-Unknown | | Reset-Reset | | |----------RESET_MSG--------->| | | | | | -------------A- --------------- | | | BLOCK/ | UNBLOCK/ | CHANGEOVER| CHANGEOVER END | ORIG_MSG | -V------------- | | | | | Blocked | | | | | --------------- Figure 15: Link finite state machine A link enpoint's state is defined by the own endpoint's state, combined with what is known about the other endpoint's state. The following states exist: Maloy & Stephens [Page 53] TIPC October 2010 Reset-Unknown Own link endpoint reset, i.e. queues are emptied and sequence numbers are set back to their initial values. The state of the peer endpoint is unknown. LINK_PROTOCOL/RESET_MSG messages are sent periodically at CONTINUITY_INTERVAL to inform peer about the own endpoint's state, and to force it to reset its own enpoint,if this has not already been done. If the peer endpoint is rebooting, or has reset for some other reason, it will sooner or later also reach the state Reset-Unknown, and start sending its own RESET_MSG messages periodically. At least one of the endpoints, and often both, will eventually receive a RESET_MSG and transfer to state Reset-Reset. If the peer is still active, i.e. in one of the states Working-Working or Working-Unknown, and has not yet detected the disturbance causing this endpoint to reset, it will sooner or later receive a RESET_MSG, and transfer directly to state Reset-Reset. If a LINK_PROTOCOL/ ACTIVATE_MSG message is received in this state, the link endpoint knows that the peer is already in state Reset-Reset, and can itself move directly on to state Working-Working. Any other messages are ignored in this state. CONTINUITY_INTERVAL is calculated as the smallest value of LINK_TOLERANCE/4 and 0.5 sec. Reset-Reset Own link endpoint reset, peer endpoint known to be reset, since the only way to reach this state is through receiving a RESET_MSG from peer. The link endpoint is periodically at CONTINUITY_INTERVAL sending ACTIVATE_MSG messages. This will will eventually cause peer to transfer to state Working-Working. The own endpoint will also transfer to state Working-Working as soon as any message which is not a RESET_MSG is received. Working-Working Own link endpoint working. Peer link endpoint known to be working, i.e. both can send and receive traffic messages. A periodic timer with the interval CONTINUITY_INTERVAL checks if anything has been received from the peer during the last interval. If not,state transfers to state Working-Unknown. Working-Unknown Own link endpoint working. Peer link endpoint in unknown state. LINK_PROTOCOL/STATE_MSG messages with the PROBE bit set are sent at an interval of CONTINUITY_INTERVAL/4 to force a response from peer. If a calculated number of probes (LINK_TOLERANCE/ (CONTINUITY_INTERVAL/4) remain unresponded, state transfers to Maloy & Stephens [Page 54] TIPC October 2010 Reset-Unknown. Own link endpoint is reset, and the link is considered lost. If, instead, any kind of message, except LINK_PROTOCOL/RESET_MSG and LINK_PROTOCOL/ACTIVATE_MSG is received, state transfers back to Working-Working. Reception of a RESET_MSG in this situation brings the link to state Reset-Reset. ACTIVATE_MSG will never received in this state. Blocked The link endpoint is blocked from accepting any packets in either direction, except incoming, tunneled CHANGEOVER_PROTOCOL/ORIG_MSG. This state is entered upon the arrival of the first such message, and left when the last has been counted in and delivered. See description about the changeover procedure later in this section. The Blocked state may also be entered and left through the management commands BLOCK and UNBLOCK. This is also described later. A newly created link endpoint starts from the state Reset-Unknown. The recommended default value for LINK_TOLERANCE is 0.8 sec. 5.4. Link MTU Negotiation The actual MTU used by a link may vary with the media used. The two endpints of a link may disagree on the allowed MTU (e.g. one using Ethernet jumbo frames and the other not), and intermediate switches may put a more strict limitation to the MTU size than what is visible from the endpoints. Therefore, TIPC implements an interval halving MTU negotiation algorithm that intends to find the biggest possible MTU that can be used between the two link endpoints. This is done for each direction separately, so in theory we could end up with one MTU in one direction, and a different on in the opposite direction. The algorithm works as follows: 1. A link endpoints starts out with an MTU of 1500 bytes, or the MTU reported from the bearer media, whichever is smallest (CURR_MTU). It also registers a wanted MTU (TARGET_MTU), which is equal to the one reported by the local interface. TARGET_MTU is sent along in the Max Packet field of all RESET and ACTIVATE messages to the other end, to let it know about the target to negotiate for. The other end will update its own TARGET_MTU to be the smallest of the the one received and the one registered locally. 2. When the link has been established, using very short RESET and ACTIVATE messages, the endpoint lets its first STATE messages have the size of CURR_MTU + (TARGET_MTU - CURR_MTU)/2. Maloy & Stephens [Page 55] TIPC October 2010 3. If any of those messages are received, the other endpoint responds with a STATE message where Max Packet confirms that the size is usable. CURR_MTU is updated to the new size, and the algorithm goes back to step 2. 4. After a number og trials (e.g. 10) with the attempted MTU without any confirmation from the other end, TARGET_MTU is decremented with 4, and the algorithm goes back to step 2. If the link state moves to WORKING_UNKNOWN during this negotiation, due to lost STATE messages, the link moves temporarily back to using CURR_MTU as packet size. However, as soon as the link is back in WORKING_WORKING state, the negotiation continues from where it was suspended. 5. After a number of iterations CURR_MTU is equal to TARGET_MTU, and the negotiation is over. 5.5. Link Continuity Check During normal traffic both link enpoints are in state Working- Working. At each expiration point, the background timer checkpoints the value of the Last Received Sequence Number. Before doing this, it compares the check- point from the previous expiration with the current value of Last Received Sequence Number, and if they differ, it takes the new checkpoint and goes back to sleep. If the two values don't differ, it means that nothing was received during the last interval, and the link endpoint must start probing, i.e. move to state Working-Unknown. Note here that even LINK_PROTOCOL messages are counted as received traffic, altough they don't contain valid sequence numbers. When a LINK_PROTOCOL message is received, the checkpoint value is moved,instead of Last Received Sequence Number, and hence the next comparison gives the desired result. 5.6. Sequence Control and Retransmission Each packet eligible to be sent on a link is assigned a Link Level Sequence Number, and appended to a send queue associated with the link endpoint. At the moment the packet is sent, its field Link Level Acknowledge Number is set to the value of Last Received Sequence Number. When a packet is received in a link endpoint, its send queue is scanned, and all packets with a sequence number lower than the arriving packet's acknowledge number (modulo 2^16-1) are released. If the packet's sequence number is equal to Last Received Sequence Maloy & Stephens [Page 56] TIPC October 2010 Number + 1 (mod 2^16-1), the counter is updated, and the packet is delivered upwards in the stack. A counter, Non Acknowledged Packets, is incremented for each message received, and if it reaches the value 10, a LINK_PROTOCOL/STATE_MSG is sent back to the sender. For any message sent, except BCAST_PROTOCOL messages, the Non Acknowledged Packets counter is set to zero. Otherwise, if the sequence number is lower, the packet is considered a duplicate, and is silently discarded. Otherwise,if a gap is found in the sequence, the packet is sorted into the Deferred Incoming Packets Queue associated to the link endpoint, to be re-sequenced and delivered upwards when the missing packets arrive. If that queue is empty,the gap is calculated and immediately transferred in a LINK_PROTOCOL/STATE_MSG back to the sending node. That node must immediately retransmit the missing packets. Also, for each 8 subsequent received out-of-sequence packets, such a message must be sent. 5.7. Message Bundling Sometimes a packet can not be sent immediately over a bearer, due to network or recipient congestion (link level send window overflow), or due to bearer congestion. In such situations it is important to utilize the network and bearer as efficiently as possible, and not stop important users from sending messages before this is absolutely unavoidable. To achieve this, messages which can not be transmitted immediately are bundled into already waiting, packets whenever possible, i.e. when there are unsent packets in the send queue of a link. When the packet finally arrives at the receiving node it is split up to its individual messages again. Since the bundling layer is located below the fragmentation layer in the functional model of the stack, even message fragments may be bundled with other messages this way, but this can only happen to the last fragment of a message, the only one normally not filling an entire packet by itself. It must be emphasized that message transmissions never are delayed in order to obtain this effect. In contrast to TCP's Nagle Algorithm, the only goal of the TIPC bundling mechanism is to overcome congestion situations as quickly and efficiently as possible. 5.8. Message Fragmentation When a message is longer than the identified MTU of the link it will use, it is split up in fragments, each being sent in separate packets to the destination node. Each fragment is wrapped into a packet headed by an TIPC internal header (see Section 5.1). The User field of the header is set to MSG_FRAGMENTER, and each fragment is assigned Maloy & Stephens [Page 57] TIPC October 2010 a Fragment Number relative to the first fragment of the message. Each fragmented message is also assigned a Fragmented Message Number, to be present in all fragments. Fragmented Message Number must be a sequence number with the period of 2^16-1. At reception the fragments are reassembled so that the original message is recreated, and then delivered upwards to the destination port. 5.9. Link Congestion Control TIPC uses a common sliding window protocol to handle traffic flow at the signalling link level. When the send queue associated to each link endpoint reaches a configurable limit, the Send Window Limit, TIPC stop sending messages over that link. Packets may still be appended to or bundled into waiting packets in the queue, but only after having been subject to a filtering function, selecting or rejecting user calls according to the sent message's importance priority. LOW_IMPORTANCE messages are not accepted at all in this situation. MEDIUM_IMPORTANCE messages are still accepted, up to a configurable limit set for that user. All other users also have their individually configurable limits, recommended to be assigned values in the following ascending order: LOW_IMPORTANCE, MEDIUM_IMPORTANCE, HIGH_IMPORTANCE, CRITICAL_IMPORTANCE, CONNECTION_MANAGER,BCAST_PROTOCOL, ROUTE_DISTRIBUTOR, NAME_DISTRIBUTOR, MSG_FRAGMENTER. MSG_BUNDLER messages are not filtered this way, since those are packets created at a later stage. Whether to accept a message due for fragmentation or not is decided on its original importance, set before the fragmentation is done. Once such a message has been accepted, its individal fragments must be handled as being more important than the original message. When the part of the queue containing sent packets again is under the Send Window Limit, the waiting packets must immediately be sent, but only until the Send Window Limit is reached again. 5.10. Bearer Congestion Control When the local bearer media becomes overloaded, e.g. when an Ethernet circuit runs out of send buffers, the Bearer Congestion Control function may be activated. This function keeps track of the current state of the bearer, and stops accepting any packet send calls until the bearer is ready for it again. During this interval TIPC users may still perform send calls, and packets will be accumulated in the affected links send queues according to the same rules as for Link Congestion Control, but all actual transmission is stopped. When the congestion has abated, the bearer opens up for traffic again, and the links having packets waiting to be sent are scheduled round-robin for sending their unsent packets. This level of Maloy & Stephens [Page 58] TIPC October 2010 congestion control is optional, and its activation should be configurable. 5.11. Link Load Sharing vs Active/Standby When a link is created it is assigned a Link Priority, used to determine its relation to a possible parallel link to the same node. There are two possible relations between parallel working links. Load Sharing Load Sharing is used when the links have the same priority value.Payload traffic is shared equally over the two links, in order to take full advantage of available bandwidth. The selection of which link to use must be done in a deterministic way, so that message sequentiality can be preserved for each individual sender port. To obtain this a Link Selector is used. This must be value correlated to the sender in such a way that all messages from that sender choose the same link, while guaranteeing a statistically equal possibility for both links to be selected for the overall traffic between the nodes. A simple example of a link selector with the right properties is the last two bits of the random number part of the originating port's identity, another is the same bits in Fragmented Message Number in message fragments. Active/Standby When the priority of one link has a higher numeral value than that of the other, all traffic will go through that link, denoted the Active Link. The other link is kept up and working with the help of the continuity timer and probe messages, and is called the Standby Link. The task of this link is to take over traffic in case the active link fails. Link Priority has a value within the range [1,31]. When a link is created it inherits a default priority from its corresponding bearer, and this should normally not need to be changed thereafter. However, Link Priority must be reconfigurable in run-time. 5.12. Link Changeover When the link configuration between two nodes changes, the moving of traffic from one link to another must be performed in such a way that message sequentiality and cardinality per sender is preserved. The following situations may occur: Maloy & Stephens [Page 59] TIPC October 2010 Active Link Failure Before opening the remaining link for messages with the failing link's selector, all packets in the failing link's send queue must wrapped into messages (tunneling messages) to be sent over the remaining link, irrespective of whether this is a load sharing active link or a standby link. These messages are headed by a TIPC Internal Header, the User field set to CHANGEOVER_PROTOCOL, Message Type set to ORIG_MSG. On the tunneling link the messages are subject to congestion control, fragmentation and bundling, like any other messages. Upon arrival in the arriving node, the tunneled packets are unwrapped, and moved over to the failing links receiving endpoint. This link endpoint must now be reset, if it has not already been done, and itself initiate tunneling of its own queued packets in the opposite direction. The unwrapped packets' original sequence numbers are compared to Last Received Sequence Number of the failed links receiving endpoint, and are delivered upwards or dropped according to their relation to this value. There is no need for the failing link to consider packet sequentiality or possible losses in this case, - the tunneling link must be considered a reliable media guaranteeing all the necessary properties. The header of the first ORIG_MSG sent in each direction must contain a valid number in the Message Count field, in order to let the receiver know how many packets to expect. During the whole changeover procedure both link endpoints must be blocked for any normal message reception, to avoid that the link is inadvertently activated again before the changeover is finished. When the expected number of packets has been received, the link endpoint is deblocked, and can go back to the normal activation procedure. Standby Link Failure This case is trivial, as there is no traffic to redirect. Second Link With Same Priority Comes Up When a link is active, and a second link with the same priority comes up, half of the traffic from the first link must be taken over by the new link. Before opening the new link for new user messages, the packets in the existing link's send queue must be transmitted over that link. This is done by wrapping copies of these packets into messages (tunnel messages) to be sent over the new link. The tunnel messages are headed by a TIPC Internal Header, the User field set to CHANGEOVER_PROTOCOL, Message Type set to DUPLICATE_MSG. On the tunneling link the messages are subject to congestion control, fragmentation and bundling, just like any other messages. Upon arrival in the arriving node, the Maloy & Stephens [Page 60] TIPC October 2010 tunneled packets are unwrapped, and delivered to the original links receiving endpoint, just like any other packet arriving over that link's own bearer. If the original packet has already arrived over that bearer, the tunneled packet is dropped as a duplicate, otherwise the tunneled packet will be accepted, and the original packet dropped as a duplicate when it arrives. Second Link With Higher Priority Comes Up When a link is active, and a second link with a higher numerical priority comes up, all traffic from the first link must be taken over by the new link. The handling of this case is identical to the case when a link with same priority comes up. After the traffic takeover has finished, no more senders will select the old link, but this does not affect the takeover procedure. 5.13. Link Deletion Once created, a link endpoint continues to exist as long as its associated interface continues to exist. Note: The persistence of a link endpoint whose peer cannot be reached for a significant period of time requires further study. It may be desirable for TIPC to reclaim the resources associated with such an endpoint by automatically deleting the endpoint after a suitable interval. 5.14. Message Bundler Protocol User: 6 (MSG_BUNDLER) Message Types: None A MSG_BUNDLER packet contains as many bundled packets as indicated in Message Count. All bundled messages start at a 4-byte aligned position in the packet. Each bundled packet is a complete packet, including header, but with the fields Broadcast Acknowledge Number, Link Level Sequence Number and Link Level Acknowledge Number left undefined. Any kind of packets, except LINK_PROTOCOL and MSG_BUNDLER packets, may be bundled. 5.15. Link State Maintenance Protocol User: 7 (LINK_PROTOCOL) Maloy & Stephens [Page 61] TIPC October 2010 ID Value Meaning -------- ---------- 0 Detailed state of a working link (STATE_MSG) endpoint 1 Reset receiving endpoint, sender is (RESET_MSG) RESET_UNKNOWN 2 Sender in RESET_RESET,ready to receive (ACTIVATE_MSG) traffic RESET_MSG messages must have a data part that must be a zero- terminated string. This string is the name of the bearer instance used by the sender node for this link. Examples of such names is "eth0","vmnet1" or "udp". Those messages must also contain valid values in the fields Session Number, Link Priority and Link Tolerance. ACTIVATE_MSG messages do not need to contain any valid fields except Message User and Message Type. STATE_MSG messages may leave bearer name and Session Number undefined, but Link Priority and Link Tolerance must be set to zero in the normal case. If any of these values are non-zero, it implies an order to the receiver to change its local value to the one in the message. This must be done when a management command has changed the corresponding value at one link endpoint, in order to enforce the same change at the other endpoint. Network Identity must be valid in all messages. Link protocol messages must always be sent immediately, disregarding any traffic messages queued in the link. Hence, they can not follow the ordinary packet sequence, and their sequence number must be ignored at the receiving endpoint. To facilitate this, these messages should be given a sequence number guaranteed not to fit in sequence. The recommended way to do this is to give such messages the next unassigned Link Level Sequence Number + 362768. This way, at the reception the test for the user LINK_PROTOCOL needs to be performed only once, after the sequentiality check has failed, and we never need to reverse the Next Received Link Level Sequence Number. 5.16. Link Changeover Protocol User: 10 (CHANGEOVER_PROTOCOL) ID Value Meaning -------- ---------- 0 Tunneled duplicate of packet (DUPLICATE_MSG) 1 Tunneled original of packet (ORIGINAL_MSG) Maloy & Stephens [Page 62] TIPC October 2010 DUPLICATE_MSG messages contain no extra information in the header apart from the first thee words. The first ORIGINAL_MSG message sent out MUST contain a valid value in the Message Count field, in order to inform the recipient about how many such messages, inclusive the first one, to expect. If this field is zero in the first message, it means that there are no packets wrapped in that message, and none to expect. 5.17. Message Fragmentation Protocol User: 12 (MSG_FRAGMENTER) ID Value Meaning -------- ---------- 0 First fragment of message (FIRST_FRAGMENT) 1 Body fragment of message (FRAGMENT) 2 Last fragment of message (LAST_FRAGMENT) All packets contain a dedicated identifier, Fragmented Message Number, to distinguish them from packets belonging to other messages from the same node. All packets also contain a sequence number within its respective message, the Fragment Number field, in order to, if necessary, reorder the packets when they arrive to the detination node. Both these sequence numbers must be incemented modulo 2^16-1. 6. Broadcast Link To effectively support the functional multicast service described in a previous section, a reliable cluster broadcast service is provided by TIPC. Although seen as a broadcast service from a TIPC viewpoint, at the bearer level this service is typically implemented as a multicast group comprising all nodes in the cluster. At the multicast/broadcast sending node a sequence of actions is followed: o When a functional multicast is requested, TIPC first looks up all matching destinations in its name translation table. o If any node external port is on the destination list, the message is sent to the multicast link for broadcast transport off node. o If the own node is on the list, a replica is sent to the functional multicast receive function in the own node. Maloy & Stephens [Page 63] TIPC October 2010 6.1. Broadcast Protocol User: 5 (BCAST_PROTOCOL). There is only one type of BCAST_PROTOCOL messages: STATE_MSG. This is equivalent to a Broadcast Negative Acknowledge, and is sent as a broadcast visble by all other nodes accroding to the rules stated in the section Multicast Protocol. 6.2. Piggybacked Acknowledge All packets, without exception, passed from one node to another, contain a valid value in the field Acknowledged Bcast Number. Since there is always some traffic going on between all nodes in the cluster (in the worst case only link supervision messages), the receiving node can trust that the Last Acknowledged Bcast counter it has for each node is kept well up-to-date. This value will under no circumstances be older than one CONTINUITY_INTERVAL, so it will inhibit a lot of unnecessary retransmissions of packets which in reality have already be received at the other end. 6.3. Coordinated Acknowledge Interval If the received packet fits in sequence as described above, AND if the last four bits of the sequence number of the packet received are equal to the last four bits of the own node's network address a LINK_PROTOCOL/STATE_MSG is generated and sent back as unicast to the receiving node, acknowledging the packet, and implicitly all previously received packets. This means that e.g. node will only explicitly acknowledge packet number 1, 17, 33, and so on, node number will acknowledge packet number 2, 18, 34, etc. This condition significantly reduces the number of explicit acknowledges needing to be sent, taking advantage of the normally ongoing traffic over each link. 6.4. Coordinated Broadcast of Negative Acknowledges If the Last Sent Broadcast field of a LINK_PROTOCOL/STATE_MSG differs from the registered last received broadcast data packet, or if a broadcast data packet is received out of sequence, a BCAST_PROTOCOL/ STATE_MSG ("NACK") packet MAY be broadcast back to the node in question. It is RECOMMENDED that such NACKs are not sent every time a gap is detected, to avoid possible overload of the sender node. It is RECOMMENDED that a node always looks into NACKs being broadcasted from other nodes, so it can identify if these report the same sequence gap as registered locally for that node. In that case, the node SHOULD delay the sending its own corresponding NACK until a later occasion. Maloy & Stephens [Page 64] TIPC October 2010 6.5. Replicated Delivery When an in-sequence functional multicast is delivered upwards in the stack, TIPC looks up in the NAME TABLE and finds all node local destination ports. The destination list created this way is stripped of all duplicates, so that only one message replica is sent to each identified destination port. 6.6. Congestion Control Messages sent over the "broadcast link" are subject to the same congestion control mechanisms as point-to-point links, with prioritized transmission queue appending, message bundling, and as last resort a return value to the sender indicating the congestion. Typically this return value is taken care of by the socket layer code, blocking the sending process until the congestion abates. Hence, the sending application should never notice the congestion at all. 7. Neighbor Detection TIPC supports the automatic discovery of the physical network topology and the establishment of links between neighboring nodes through the use of a neighbor detection protocol. 7.1. Neighbor Detection Protocol Overview A node initiates neighbor detection by sending a "link request" message to all of its potential neighbors over each bearer that the node has been configured to use. This message identifies the requesting node and specifies both the subset of network nodes the node is willing to establish links to and the media address to be used by such links. A node that receives a link request message and determines that a new link between the nodes must be established must return a "link response" message to the requesting node; this message identifies the receiving node and specifies the receiving node's own media address. The exchange of messages permits each node to create a link endpoint which has the necessary information to begin communicating with its peer. The conditions under which a node sends link request messages is not specified in this document. For example, implementations may send messages periodically as long as a node is operational, and may suspend the sending of requests whenever a node has working links to all of its potential neighbors. In contrast, the conditions under which a node sends link response messages is specified. Maloy & Stephens [Page 65] TIPC October 2010 7.2. Link Request Message Processing A link request message SHOULD be sent to all potential neighbors simultaneously using multicasting or broadcasting if a bearer's media type supports this capability; otherwise, separate link request messages SHOULD be sent to all potential neighbors in individually. A node that receives a link request message MUST ignore the message if it is not supposed to communicate with the requesting node on the associated bearer. Conditions that prohibit communication include the following: o The requesting node has a different TIPC network identifier than the receiving node. o The receiving node has the same TIPC network address as the requesting node (i.e. a node must ignore a message from itself). o The requesting node does not lie within the network domain that the receiving node is authorized to communicate with over the associated bearer. o The receiving node does not lie within the network domain that the requesting node has specified in its request. In addition, a node that receives a link request message MUST ignore the message if it would interfere with existing communication with the requesting node. (Request messages of this nature can arise if network nodes are not configured correctly, resulting in two or more nodes having the same network address.) Conditions that cause interference include the following: o The receiving node currently has a working link to the requesting node on the associated bearer. o The receiving node has a working link to the requesting node on another bearer that was established using a different node signature. A node that receives a link request message that is not ignored SHOULD establish a link endpoint capable of communicating with the requesting node. If the receiving node currently has a (non- operational) link endpoint to the requesting node on the associated bearer it MUST delete or reconfigure the link endpoint to preclude the existence of two parallel links to the same node on the same bearer. If the receiving node currently has one or more (non- operational) link endpoints to the requesting node on other bearers that were established using a different node signature it MUST delete Maloy & Stephens [Page 66] TIPC October 2010 or recongfigure those link endpoints to preclude the existence of links to two different nodes having the same network address. Once the receiving node has established the required link endpoint it MUST send a link response message to the requesting node on the associated bearer. The link response message MUST be directed only to the requesting node; if possible, it SHOULD be sent without using multicasting or broadcasting. 7.3. Link Response Message Processing A node that receives a link response message MUST ignore the message if it is not supposed to communicate with the responding node on the associated bearer. Conditions that prohibit communication include the following: o The responding node has a different TIPC network identifier than the receiving node. o The receiving node has the same TIPC network address as the responding node (i.e. a node must ignore a message from itself). o The responding node does not lie within the network domain that the receiving node is authorized to communicate with over the associated bearer. o The receiving node does not lie within the network domain that the responding node has specified in its response. In addition, a node that receives a link response message MUST ignore the message if it would interfere with existing communication with the responding node. Conditions that cause interference include the following: o The receiving node currently has a working link to the responding node on the associated bearer. o The receiving node has a working link to the responding node on another bearer that was established using a different node signature. A node that receives a link response message that is not ignored SHOULD establish a link endpoint capable of communicating with the responding node. If the receiving node currently has a (non- operational) link endpoint to the responding node on the associated bearer it MUST delete or reconfigure the link endpoint to preclude the existence of two parallel links to the same node on the same bearer. If the receiving node currently has one or more (non- Maloy & Stephens [Page 67] TIPC October 2010 operational) link endpoints to the responding node on other bearers that were established using a different node signature it MUST delete or recongfigure those link endpoints to preclude the existence of links to two different nodes having the same network address. Once the receiving node has established the required link endpoint it MUST NOT send a link configuration message (either a request or a response) to the responding node. 7.4. Link Configuration Message Format The format of the link configuration message used to exchange link requests and link responses is shown in Figure 16. 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w0:| Ver | User | Hsize |N|R|R|R| Message Size | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w1:|Mtype| Node Flags | Node Signature | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w2:| Destination Domain | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w3:| Previous Node | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w4:| Network Id | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w5:| RESERVED | Media Id | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w6:| | +-+-+-+-+-+-+- +-+-+-+-+-+-+-+ w7:| | +-+-+-+-+-+-+- Media Address +-+-+-+-+-+-+-+ w8:| | +-+-+-+-+-+-+- +-+-+-+-+-+-+-+ w9:| | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 16: Link Configuration message format The interpretation of the fields of the message are as follows: R, RESERVED: Defined in Payload Message. Maloy & Stephens [Page 68] TIPC October 2010 Ver: 3 bits Defined in Payload Message. User: 4 bits Defined in Payload Message. A LINK_CONFIGURATION message is identified by the value 13. Hsize: 4 bits Defined in Payload Message. A LINK_CONFIGURATION message header is 40 bytes. N: 1 bit Defined in Payload Message. A LINK_CONFIGURATION message sets this bit, as it is not part of a normal flow of messages over a link. Message Size: 17 bits Defined in Payload Message. A LINK_CONFIGURATION message is 40 bytes in length. Mtype: 3 bits Defined in Payload Message. A LINK_CONFIGURATION message specifies 0 for a link request message or 1 for a link response message. Node Flags: 13 bits A bitmap indicating capabilities of the sender that receivers may need to be aware of. Reserved for future use; MUST be zero. Node Signature: 16 bits A bit pattern chosen by the sender as a means of distinguishing itself from another node that may have been incorrectly configured with the same network address. The value MAY be selected at random upon node initialization. The value MUST be the same in all link configuration messages sent by the node, regardless of the bearer used to send each message. Maloy & Stephens [Page 69] TIPC October 2010 Destination Domain: 32 bits The network domain to which the message is directed. denote that the sender desires a link to a specific node; , , and <0.0.0> denotes that the message can be processed by any node in the sender's cluster, zone, and network, respectively. Previous Node: 32 bits Defined in Payload Message. Network Id: 32 bits The network identity of the sender. Media Id: 8 bits A value that identifies the type of media address in the following area. Currently, the only specified value is for Ethernet, which uses 1. Media Address: 20 bytes The media address of the sender, the format of which is media- specific. Currently, the only specified value is for Ethernet, and consists of a MAC address in the first 6 bytes of the area. 8. Topology Service TIPC provides a message-based mechanism for an application to learn about the port names that are visible to its node. This is achieved by communicating with a Topology Service that has knowledge of the contents of the node's name table. 8.1. Topology Service Semantics A "subscription" is a request by a subscriber to TIPC, telling TIPC to indicate when a port name sequence overlapping the requested range is published or withdrawn. Subscription for an individual port name is requested by specifying a port name sequence with whose lower and upper instance values are identical. An "event" is a response by TIPC to a subscriber, telling the subscriber about a change in availability of the port name(s) specified by a subscription, or in the status of the subscription itself. Each event associated with the availability of port names Maloy & Stephens [Page 70] TIPC October 2010 indicates the portion of the requested port name sequence that has changed its availability, as well as identifying the physical address involved in the change. A subscription may cause zero, one, or more events during its lifetime. 8.2. Topology Service Protocol An application subscribing for the availability of port name sequences must follow these steps: 1. Establish a TIPC connection to the Topology Server, using the port name {1,1}. 2. Send a subscription message on the new connection for each port name sequence to be monitored. 3. Wait for arrival of event messages indicating status changes for the requested port name sequence(s). After a subscription has been received and registered by the Topology Server, the subscriber will immediately receive zero or more events, in accordance with the state of the name table at the time of registration, and the flags in the subscription message. Thereafter, the subscriber will receive an event for each change in the name table corresponding to the subscription. Each subscription issued by an application remains registered until one of the following conditions arises: 1. The time limit specified for the subscription expires. (This results in the Topology Server issuing a final event to the application, indicating that the subscription has timed out.) 2. The subscription is cancelled by the application. (This is achieved by resending the original subscription message with a cancellation bit set; no acknowledgement is provided by the Topology Server.) 3. The application's connection to the Topology Server is terminated. 8.2.1. Subscription Message Format The format of a subscription message is shown in Figure 17. The five first words are integers, while the format of the final two words is unspecified. The words of a subscription message may be sent in network byte order or host byte order, however all words MUST utilize the same ordering. (The byte ordering used in a specific Maloy & Stephens [Page 71] TIPC October 2010 subscription message can be deduced by examining the high-order and low-order bytes of the fifth word of the message, exactly one of which will be non-zero.) 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w0 | Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w1 | Lower | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w2 | Upper | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w3 | Timeout | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w4 | RESERVED |C|S|P| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w5 | User Handle | w6 | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 17: Format for subscription message The interpretation of the fields of the message are as follows: o Type: The type of the port name sequence subscribed for. o Lower: Lower bound of the port name sequence subscribed for. o Upper: Upper bound of the port name sequence subscribed for. o Timeout: The time before the subscription expires, in milliseconds. A timeout of zero means that the subscription expires immediately, but the Topology Server MUST still respond with all events reflecting the state of the requested sequence at the time of the subscription's arrival; this enables an application to perform a one-shot inquiry into the name table to obtain a result immediately, regardless of whether or not the desired names are present. A timeout of 0xffffffff means the subscription will never expire. o Filter: Describes the semantics of the subscription. All bits must be zero, except for the following: Maloy & Stephens [Page 72] TIPC October 2010 Name Description ---- ----------- P When set, the S-bit MUST NOT be set. The Topology Server MUST send an event for each publication or withdrawal of a sequence overlapping the requested one. When clear, the S-bit MUST be set. S When set, the P-bit MUST NOT be set. The Topology Server MUST send an event only when the number of sequences overlapping the requested one goes from zero to non-zero, or vice versa. When clear, the P-bit MUST be set. C When clear, the Topology Server MUST register the subscription specified by the message. When set, the Topology Server MUST cancel a registered subscription corresponding to the one indicated in this message, if one exists. 'Corresponding' means that all the fields (except the C-bit itself) have the same value as in the original subscription message, and the message is submitted via the same connection. o User Reference: An opaque 8-byte character sequence, to be used by the subscriber for his own purposes. The Topology Server MUST NOT interpret or alter this field in any way, and must return it, along with the rest of the original subscription, in all event messages. 8.2.2. Event Message Format The format of an event message is shown in Figure 18. The five first words in the message are integers; the remainder of the message is specified in Figure 17. All words of an event message MUST be sent using the same byte order used by the subscription message that registered the subscription. (The byte ordering used in a specific event message can be deduced by examining the high-order and low- order bytes of the tenth word of the message, exactly one of which will be non-zero.) Maloy & Stephens [Page 73] TIPC October 2010 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w0 | Event | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w1 | Found Lower | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w2 | Found Upper | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w3 | Port Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w4 | Node Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w5 / / \ Subscription \ w11/ / +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 18: Format for Port Name Sequence subscription event message The interpretation of the fields of the message are as follows: o Event: Identifies the status change relating to a subscription. The value MUST be one of the following: Value Description ----- ----------- 1 A sequence overlapping with the requested range was published. 2 A sequence overlapping with the requested range was withdrawn. 3 The Timeout limit specified by the subscription has been reached. o Found Lower: The lower bound of the actually published or withdrawn sequence that overlaps the requested sequence. In timeout events this field is the Lower value of the associated subscription message. o Found Upper: The upper bound of the actually published or withdrawn sequence that overlaps the requested sequence. In timeout events this field is the Upper value of the associated subscription message. o Reference: The reference portion of the port identifier associated with the published or withdrawn sequence. In timeout events this field is zero. o Node: The network address portion of the port identifier associated with the published or withdrawn sequence. In timeout Maloy & Stephens [Page 74] TIPC October 2010 events this field is zero. o Subscription: An exact copy of the subscription message (as described in Figure 17) which triggered the sending of the event message. 8.3. Monitoring Functional Topology The functional topology of the network can be continuously monitored by subscribing for the relevant port names or name sequences corresponding to the services of interest to an application. 8.4. Monitoring Physical Topology The physical topology of the network can be considered a special case of the functional topology, and can be monitored in the same way. To track the availability or disappearance of a specific node or group of nodes, an application running on these node(s) can publish a port name representing this "function"; this name can then be subscribed to by other applications. TIPC's Topology Service can then notify subscribing applications whenever it discovers or loses contact with a node publishing that name. TIPC enables an application to easily monitor the availability of the nodes within its cluster by having each node automatically publish the reserved name {0,} with cluster scope, where is the network address of the node. The port identifier associated with this name identifies the node's Configuration Service. 9. Configuration Service TIPC provides a message-based mechanism for an application to inquire about the configuration and status of a TIPC network and, in some instances, to alter the configuration. This is achieved by communicating with a Configuration Service that implements a variety of network management-style commands. 9.1. Configuration Service Semantics A "configuration command" is an operation supported by TIPC's Configuration Service that alters the configuration of a network node or returns information about the current configuration or state of the network. There are three classes of configuration command defined by TIPC: o "Public commands" are operations that can be issued by any application and executed by the Configuration Service on any Maloy & Stephens [Page 75] TIPC October 2010 network node. These operations are typically non-intrusive and MUST NOT impact other applications running on the affected node. o "Protected commands" are operations that can only be issued by an application that has network administration privileges on its node and executed by the Configuration Service on any network node. These operations are potentially intrusive and MAY impact other applications running on the affected node. o "Private commands" are operations that can only be issued by an application that has network administration privileges on its node and executed by the Configuration Service on its node only. These operations are typically intrusive and MAY impact other applications running on the affected node. A "command message" is a message exchanged by an application and TIPC. There are two classes of command message defined by TIPC: o A "command request" is a request by an application to the Configuration Service, asking it to perform a specific configuration command. o A "command reply" is a response by the Configuration Service to an application that acknowledges that a command request has been acted upon, and returns any requested information. 9.2. Configuration Service Protocol Command messages may be sent over any protocol (e.g. Netlink [RFC 3549]), and may have different formats, to be decided by the particular implementation. Definition such formats falls outside the scope of this document. Here, we only define the formats that MUST be used when the command messages are carried over TIPC. An application that interacts with the Configuration Service uses TIPC payload messages containing command requests and replies. The application MUST follow these steps: 1. Send a connectionless command request to a Configuration Server using the port name {0,}, where is the network address of the node to be queried or manipulated. 2. Wait for the arrival of a command reply from the Configuration Server corresponding to the previously issued command request. After a command request is received by the Configuration Server, the server will attempt to perform the requested operation and return a command reply indicating the results of the operation. Maloy & Stephens [Page 76] TIPC October 2010 9.2.1. Command Message Format The data portion of a command message consists of a command descriptor followed by zero or more command arguments. 9.2.1.1. Command Descriptor The format of a command descriptor is shown in Figure 19. All fields of the command descriptor MUST be stored in network byte order. 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w0:| Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w1:| Command | Flags | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w2:| | + RESERVED + w3:| | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 19: Command descriptor format The interpretation of the fields of the descriptor are as follows: Length: 32 bits The total length of the command, including the command descriptor and its arguments. Command: 16 bits Identifies a specific configuration command. Flags: 16 bits Describes the semantics of the configuration command. Two values are defined: Flags Meaning ----- ------- 0 Message is a command reply 1 Message is a command request Maloy & Stephens [Page 77] TIPC October 2010 RESERVED: 8 bytes Defined in Payload Message. 9.2.1.2. Command Arguments A command message contains zero or more Type-Length-Value (TLV) triplets that provide details about the associated request or reply. The set of TLVs associated with a command request may be different than the set of TLVs associated with its reply. The format of a command argument TLV is shown in Figure 20. The first two fields of the TLV MUST be stored in network byte order; the order used in the value field that follows depends on TLV's type. TLV triplets MUST begin on a 32-bit word boundary offset from the start of the command message; thus, it may be necessary to include one, two, or three bytes of padding between adjacent TLVs in a command message. 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w0:| Length | Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w1:\ \ / Value / wN:\ \ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 20: Command argument TLV format The interpretation of the fields of a command argument TLV are as follows: Length: 16 bits The length of the TLV, in bytes. Type: 16 bits Identifies the data type encoded in Value. Value: 0 to 65,531 bytes The data portion of the TLV. Maloy & Stephens [Page 78] TIPC October 2010 9.2.2. Command Argument TLV Descriptions The TLVs defined for TIPC's Configuration Service are described in this section. 9.2.2.1. VOID VOID (type 1) is a zero-byte TLV type that can be used as a placeholder in command messages. Currently, no command messages utilize this type. 9.2.2.2. UNSIGNED UNSIGNED (type 2) is a TLV type designating a generic unsigned integer. It is represented by a 32-bit integer, which MUST be stored in network byte order. 9.2.2.3. STRING STRING (type 3) is a TLV type designating a moderately-sized character string. It is represented by a zero-terminated sequence of characters, which may range from one byte to 128 bytes, including the terminating zero character. 9.2.2.4. LARGE_STRING LARGE_STRING (type 4) is a TLV type designating a large-sized character string. It is represented by a zero-terminated sequence of characters, which may range from one byte to 2048 bytes, including the terminating zero character. 9.2.2.5. ULTRA_STRING ULTRA_STRING (type 5) is a TLV type designating a very large-sized character string. It is represented by a zero-terminated sequence of characters, which may range from one byte to 32768 bytes, including the terminating zero character. 9.2.2.6. ERROR_STRING ERROR_STRING (type 16) is a TLV type designating the reason for the failure of a command request. It is represented by a zero-terminated sequence of characters, which may range from one byte to 128 bytes, including the terminating zero character. The first character of an ERROR_STRING may be a special error code character, lying in the range 0x80 to 0xFF, which corresponds to one of the following pre-defined reasons: Maloy & Stephens [Page 79] TIPC October 2010 Value Meaning ----- ------- 0x80 The request contains incorrect TLV(s) 0x81 The request requires network administrator privileges 0x83 The designated node does not permit requests from off-node 0x84 The request is not supported 0x85 The request has invalid argument values 9.2.2.7. NET_ADDR NET_ADDR (type 17) is a TLV type designating a TIPC network address. It is represented by a 32-bit integer denoting zone, cluster, and node identifiers (using 8, 12, and 12 bits, respectively), with the zone identifier occupying the most significant bits and the node identifier occupying the least significant bits. This value MUST be stored in network byte order. 9.2.2.8. MEDIA_NAME MEDIA_NAME (type 18) is a TLV type designating a media type usable for TIPC messages. It is represented by a zero-terminated sequence of characters, which may range from one byte to 16 bytes, including the terminating zero character. As an example, the media type for Ethernet bearers is "eth". 9.2.2.9. BEARER_NAME BEARER_NAME (type 19) is a TLV type designating a TIPC bearer. It is represented by a zero-terminated sequence of characters, which may range from one byte to 32 bytes, including the terminating zero character. The resulting string MUST have the form "medianame:interfacename". For example, an Ethernet bearer may have the name "eth:eth0". 9.2.2.10. LINK_NAME LINK_NAME (type 20) is a TLV type designating a TIPC link endpoint. It is represented by a zero-terminated sequence of characters, which may range from one byte to 60 bytes, including the terminating zero character. The resulting string MUST have the form "Z.C.N:interfacename- Z.C.N:interfacename". For example, an Ethernet link endpoint may have the name "1.1.7:eth0-1.1.12:eth0". Maloy & Stephens [Page 80] TIPC October 2010 9.2.2.11. NODE_INFO NODE_INFO (type 21) is a TLV type designating the reachability status (up/down) of a neighboring node. It is represented by the 8-byte structure shown in Figure 21. All fields of this structure MUST be stored in network byte order. 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w0 | Node Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w1 | Up | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 21: Value field of NODE_INFO TLV The interpretation of the fields of the structure are as follows: Node Address: 32 bits The network address of a neighboring node. Up: 32 bits Non-zero if there is a working link to the specified node. 9.2.2.12. LINK_INFO LINK_INFO (type 22) is a TLV type designating the status (up/down) of a link endpoint. It is represented by the 68-byte structure shown in Figure 22. The first two fields of this structure MUST be stored in network byte order. 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w0 | Node Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w1 | Up | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w2 \ \ / Link Name / w16\ \ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Maloy & Stephens [Page 81] TIPC October 2010 Figure 22: Value field of LINK_INFO TLV The interpretation of the fields of the structure are as follows: Node Address: 32 bits The network address of a neighboring node. Up: 32 bits Non-zero if the specified link is working. Link Name: 60 bytes Zero-terminated string identifying a local link endpoint. MUST have format "Z.C.N:interfacename-Z.C.N:interfacename". 9.2.2.13. BEARER_CONFIG BEARER_CONFIG (type 23) is a TLV type used to enable a bearer. It is represented by the 40-byte structure shown in Figure 23. The first two fields of this structure MUST be stored in network byte order. 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w0 | Priority | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w1 | Discovery Domain | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w2 \ \ / Bearer Name / w9 \ \ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 23: Value field of BEARER_CONFIG TLV The interpretation of the fields of the structure are as follows: Priority: 32 bits Desired priority for bearer. Discovery Domain: 32 bits Network domain whose nodes the bearer will establish links to. This MUST be a domain containing the node itself. Maloy & Stephens [Page 82] TIPC October 2010 Bearer Name: 32 bytes Zero-terminated string designating the name of a bearer. MUST have format "medianame:interfacename". 9.2.2.14. LINK_CONFIG LINK_CONFIG (type 24) is a TLV type used to change the properties of a link. It is represented by the 64-byte structure shown in Figure 23. The first field of this structure MUST be stored in network byte order. 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w0 | Value | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w1 \ \ / Link Name / w15\ \ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 24: Value field of LINK_CONFIG TLV The interpretation of the fields of the structure are as follows: Value: 32 bits Desired value of the link property being set. Link Name: 60 bytes Zero-terminated string designating a local link endpoint. MUST have format "Z.C.N:interfacename-Z.C.N:interfacename". 9.2.2.15. NAME_TBL_QUERY NAM_TBL_QRY (type 25) is a TLV type used when requesting name table information. It is represented by the 16-byte structure shown in Figure 25. All fields of this structure MUST be stored in network byte order. Maloy & Stephens [Page 83] TIPC October 2010 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w0 |A| Depth | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w1 | Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w2 | Lower Bound | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w3 | Upper Bound | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 25: Value field of NAME_TBL_QUERY TLV The interpretation of the fields of the structure are as follows: A: 1 bit All types flag. If this bit is set the Configuration Server examines all entries in the name table, rather than just the name sequence specified by {Type, Lower Bound, Upper Bound}; in such cases Type, Lower Bound, and Upper Bound MUST be set to zero. If this bit is clear only names having a type value of Type are examined. Depth: 31 bits Amount of information to be displayed. The value MUST be one of the following: Depth Meaning ----- ------- 1 Display port name type info only 2 As 1, but also display port name instance info 3 As 2, but also display port identity info 4 As 3, but also display any additional info available Type: 32 bits The type value in the set of desired port names. Lower Bound: 32 bits The lowest instance value in the set of desired port names. Upper Bound: 32 bits Maloy & Stephens [Page 84] TIPC October 2010 The upper instance value in the set of desired port names. 9.3. Command Message Descriptions The set of commands that MAY be supported by the Configuration Service are described in this section. The description of the command reply message for each command assumes that the associated command request is executed successfully. If an error occurs during the processing of a request the Configuration Service MUST include a TLV of type ERROR_STRING as part of the command reply returned to the requesting application. 9.3.1. NOOP NOOP (command 0x0000) is a public command that performs no action. This command may be useful for demonstrating that an application can interact successfully with the Configuration Service. The command request contains no TLV. The command reply contains no TLV. 9.3.2. GET_NODES GET_NODES (command 0x0001) is a public command that is used to obtain information about the status of a node's neighbors. The command request contains a single TLV of type NET_ADDR, which represents a network domain. The command reply contains zero or more TVLs of type NODE_INFO, one for each node within the specified domain that this node has a direct link to (even if it is not currently operational). 9.3.3. GET_MEDIA_NAMES GET_MEDIA_NAMES (command 0x0002) is a public command that is used to obtain the names of all media types currently configured on a node. The command request contains no TLV. The command reply contains zero or more TLVs of type MEDIA_NAME. 9.3.4. GET_BEARER_NAMES GET_BEARER_NAMES (command 0x0003) is a public command that is used to obtain the names of all bearers currently configured on a node. The command request contains no TLV. The command reply contains zero or more TLVs of type BEARER_NAME. Maloy & Stephens [Page 85] TIPC October 2010 9.3.5. GET_LINKS GET_LINKS (command 0x0004) is a public command that is used to obtain information about the status of a node's link endpoints. The command request contains a single TLV of type NET_ADDR, which specifies a network domain. The command reply contains zero or more TLVs of type LINK_INFO, corresponding to the node's own broadcast link endpoint and any link endpoint whose peer node lies within the specified network domain. 9.3.6. SHOW_NAME_TABLE SHOW_NAME_TABLE (command 0x0005) is a public command that is used to obtain information about the contents of a node's name table. The command request contains a single TLV of type NAME_TBL_QUERY. The command reply contains a single TLV of type ULTRA_STRING, whose content is unspecified. 9.3.7. SHOW_PORTS SHOW_PORTS (command 0x0006) is a public command that is used to obtain status and statistics information about a link endpoint. The command request contains no TLV. The command reply contains a single TLV of type ULTRA_STRING, whose content is unspecified. 9.3.8. SHOW_LINK_STATS SHOW_LINK_STATS (command 0x000B) is a public command that is used to obtain status and statistics information about a link endpoint. The command request contains a single TLV of type LINK_NAME. The command reply contains a single TLV of type ULTRA_STRING, whose content is unspecified. 9.3.9. SHOW_STATS SHOW_STATS (command 0x000F) is a public command that is used to obtain status and statistics information about TIPC for a node. The command request contains a single TLV of type UNSIGNED, which indicates the information to be obtained; a value of zero returns all available information, while no other values are currently defined. The command reply contains a single TLV of type ULTRA_STRING, whose content is unspecified. Maloy & Stephens [Page 86] TIPC October 2010 9.3.10. GET_REMOTE_MNG GET_REMOTE_MNG (command 0x4003) is a private command that is used to determine whether a node can be remotely managed by another node in the TIPC network. The command request contains no TLV. The command reply contains a single TLV of type UNSIGNED; a value of zero indicates that the node's Configuration Service is unable to process command requests issued by another node, while any other value indicates that processing of off-node command requests is enabled. 9.3.11. GET_MAX_PORTS GET_MAX_PORTS (command 0x4004) is a private command that is used to obtain the maximum number of ports that can be supported simultaneously by a node. The command request contains no TLV. The command reply contains a single TLV of type UNSIGNED. 9.3.12. GET_MAX_PUBL GET_MAX_PUBL (command 0x4005) is a private command that is used to obtain the maximum number of publications that can be supported simultaneously by a node's name table. The command request contains no TLV. The command reply contains a single TLV of type UNSIGNED. 9.3.13. GET_MAX_SUBSCR GET_MAX_SUBSCR (command 0x4006) is a private command that is used to obtain the maximum number of subscriptions that can be supported simultaneously by a node's Topology Service. The command request contains no TLV. The command reply contains a single TLV of type UNSIGNED. 9.3.14. GET_MAX_ZONES GET_MAX_ZONES (command 0x4007) is a private command that is used to obtain the maximum number of zones a node can support in its own network. The command request contains no TLV. The command reply contains a single TLV of type UNSIGNED. Maloy & Stephens [Page 87] TIPC October 2010 9.3.15. GET_MAX_CLUSTERS GET_MAX_CLUSTERS (command 0x4008) is a private command that is used to obtain the maximum number of clusters a node can support in its own zone. The command request contains no TLV. The command reply contains a single TLV of type UNSIGNED. 9.3.16. GET_MAX_NODES GET_MAX_NODES (command 0x4009) is a protected command that is used to obtain the maximum number of nodes a node can support in its own cluster. The command request contains no TLV. The command reply contains a single TLV of type UNSIGNED. 9.3.17. GET_NETID SET_NETID (command 0x400B) is a protected command that is used to obtain the TIPC network identifier used by a node. The command request contains no TLV. The command reply contains a single TLV of type UNSIGNED. 9.3.18. ENABLE_BEARER ENABLE_BEARER (command 0x4101) is a protected command that is used to initiate a node's use of the specified bearer for TIPC messaging. The node will respond to requests from neighboring nodes to establish new links if the nodes lie within the specified discovery domain. The command request contains a single TLV of type BEARER_CONFIG. The command reply contains no TLV. 9.3.19. DISABLE_BEARER DISABLE_BEARER (command 0x4102) is a protected command that is used to terminate a node's use of the specified bearer for TIPC messaging. The node deletes all existing link endpoints that utilize that bearer and will ignore all requests from neighboring nodes to establish new links. The command request contains a single TLV of type BEARER_NAME. The command reply contains no TLV. Maloy & Stephens [Page 88] TIPC October 2010 9.3.20. SET_LINK_TOL SET_LINK_TOL (command 0x4107) is a protected command that is used to configure the tolerance attribute of a link endpoint. (The tolerance attribute of the link's peer endpoint will be configured to match automatically.) The command request contains a single TLV of type LINK_CONFIG. The command reply contains no TLV. 9.3.21. SET_LINK_PRI SET_LINK_PRI (command 0x4108) is a protected command that is used to configure the priority attribute of a link endpoint. (The priority attribute of the link's peer endpoint will be configured to match automatically.) The command request contains a single TLV of type LINK_CONFIG. The command reply contains no TLV. 9.3.22. SET_LINK_WINDOW SET_LINK_WINDOW (command 0x4109) is a protected command that is used to configure the message window attribute of a link endpoint. (The priority attribute of the link's peer endpoint MUST NOT be configured to match automatically.) The command request contains a single TLV of type LINK_CONFIG. The command reply contains no TLV. 9.3.23. SET_LOG_SIZE SET_LOG_SIZE (command 0x410A) is a protected command that is used to configure the maximum number of characters that a node's log can contain, including the terminating zero character. The command request contains a single TLV of type UNSIGNED. The command reply contains no TLV. 9.3.24. DUMP_LOG DUMP_LOG (command 0x410B) is a protected command that is used to retrieve the contents of a node's log and to reset the log to empty. The command request contains no TLV. The command reply contains a single TLV of type ULTRA_STRING, whose content is unspecified. Maloy & Stephens [Page 89] TIPC October 2010 9.3.25. RESET_LINK_STATS RESET_LINK_STATS (command 0x410C) is a protected command that is used to reset the statistics counters for a link endpoint. The command request contains a single TLV of type LINK_NAME. The command reply contains no TLV. 9.3.26. SET_NODE_ADDR SET_NODE_ADDR (command 0x8001) is a private command that is used to configure the network address of a node. The command request contains a single TLV of type NET_ADDR, indicating the desired network address. The command reply contains no TLV. 9.3.27. SET_REMOTE_MNG SET_REMOTE_MNG (command 0x8003) is a private command that is used to configure whether a node can be remotely managed by another node in the TIPC network. The command request contains a single TLV of type UNSIGNED; a value of zero disables the node's Configuration Service from processing command requests issued by another node, while any other value enables processing of off-node command requests. The command reply contains no TLV. 9.3.28. SET_MAX_PORTS SET_MAX_PORTS (command 0x8004) is a private command that is used to configure the maximum number of ports that can be supported simultaneously by a node. The command request contains a single TLV of type UNSIGNED. The command reply contains no TLV. 9.3.29. SET_MAX_PUBL SET_MAX_PUBL (command 0x8005) is a private command that is used to configure the maximum number of publications that can be supported simultaneously by a node's name table. The command request contains a single TLV of type UNSIGNED. The command reply contains no TLV. Maloy & Stephens [Page 90] TIPC October 2010 9.3.30. SET_MAX_SUBSCR SET_MAX_SUBSCR (command 0x8006) is a private command that is used to configure the maximum number of subscriptions that can be supported simultaneously by a node's Topology Service. The command request contains a single TLV of type UNSIGNED. The command reply contains no TLV. 9.3.31. SET_MAX_ZONES SET_MAX_ZONES (command 0x8007) is a private command that is used to configure the maximum number of zones a node can support in its own network. The command request contains a single TLV of type UNSIGNED. The command reply contains no TLV. 9.3.32. SET_MAX_CLUSTERS SET_MAX_CLUSTERS (command 0x8008) is a private command that is used to configure the maximum number of clusters a node can support in its own zone. The command request contains a single TLV of type UNSIGNED. The command reply contains no TLV. 9.3.33. SET_MAX_NODES SET_MAX_NODES (command 0x8009) is a private command that is used to configure the maximum number of nodes a node can support in its own cluster. The command request contains a single TLV of type UNSIGNED. The command reply contains no TLV. 9.3.34. SET_NETID SET_NETID (command 0x800B) is a private command that is used to configure the TIPC network identifier used by a node. The command request contains a single TLV of type UNSIGNED. The command reply contains no TLV. 10. Security Considerations TIPC is a special-purpose transport protocol designed for operation Maloy & Stephens [Page 91] TIPC October 2010 within a secure, closed network of interconnecting nodes within a cluster. TIPC does not possess any native security features, and relies on the properties of the selected bearer protocol (e.g. IP- Sec) when such features are needed. 11. IANA Considerations As TIPC is not an Internet protocol this document has no IANA actions. [RFC Editor: please do not remove this section.] 12. References [RFC2026] Bradner, S., "The Internet Standards Process -- Revision 3", RFC 2026, BCP 9, October 1996, . [RFC2104] Krawczyk, H., Bellare, M., and R. Canetti, "HMAC: Keyed- Hashing for Message Authentication", RFC 2104, February 1997, . [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", RFC 2119, March 1997. [RFC2406] Kent, S. and R. Atkinson, "IP Encapsulating Security Payload (ESP)", RFC 2406, November 1998, . [RFC2408] Maughan, D., Schertler, M., Schneider, M., and J. Turner, "Internet Security Association and Key Management Protocol", RFC 2408, November 1998, . [RFC2434] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA Considerations Section in RFCs", RFC 2434, BCP 26, October 1998, . [RFC2460] Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6) Specification", RFC 2460, December 1998, . [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion Control", RFC 2581, April 1999, . Maloy & Stephens [Page 92] TIPC October 2010 [RFC2960] Stewart, R., Xie, Q., Morneault, K., Sharp, C., Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M., Zhang, L., and V. Paxson, "Stream Control Transmission Protocol", RFC 2960, October 2000, . [RFC768] Postel, J., "User Datagram Protocol", RFC 768, STD 6, August 1980, . [RFC793] Postel, J., "Transmission Control Protocol", RFC 793, STD 7, September 1981, . [TIPC] "TIPC Project Home Page", January 2003, . Authors' Addresses Jon Paul Maloy Ericsson Research Canada 8400, boul. Decarie Ville Mont-Royal, Quebec H4P 2N2 Canada Phone: +1 514 576-2150 Email: jon.maloy@ericsson.com Allan Stephens Wind River 350 Terry Fox Drive, Suite 200 Kanata, ON K2K 2W5 Canada Phone: +1 613 270-2259 Email: allan.stephens@windriver.com Maloy & Stephens [Page 93]