TOC 
Forces WGJ. Maloy
Internet-DraftEricsson
Intended status: InformationalA. Stephens
Expires: August 17, 2014Wind River
 February 13, 2014


TIPC: Transparent Inter Process Communication Protocol

Abstract

This document describes TIPC, a protocol specially designed for efficient communication within clusters of loosely coupled nodes.

TIPC provides two types of services to its applications:

An "all-in-one" L2 or L3 based message transport service:

A service and topology tracking function:

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as “work in progress.”

This Internet-Draft will expire on August 17, 2014.

Copyright Notice

Copyright (c) 2014 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.



Table of Contents

1.  Introduction
2.  Conventions
3.  Overview
    3.1.  Background
        3.1.1.  Existing Protocols
        3.1.2.  Assumptions
    3.2.  Architectural Overview
    3.3.  Functional Overview
        3.3.1.  API Adapters
        3.3.2.  Address Subscription
        3.3.3.  Address Distribution
        3.3.4.  Address Translation
        3.3.5.  Multicast
        3.3.6.  Connection Supervision
        3.3.7.  Routing and Link Selection
        3.3.8.  Neighbour Detection
        3.3.9.  Link Establishment/Supervision
        3.3.10.  Link Failover
        3.3.11.  Fragmentation/Reassembly
        3.3.12.  Bundling
        3.3.13.  Congestion Control
        3.3.14.  Sequence and Retransmission Control
        3.3.15.  Bearer Layer
    3.4.  Fault Handling
        3.4.1.  Fault Avoidance
        3.4.2.  Fault Detection
        3.4.3.  Fault Recovery
        3.4.4.  Overload Protection
    3.5.  Terminology
    3.6.  Abbreviations
4.  TIPC Features
    4.1.  Network Topology
        4.1.1.  Network
        4.1.2.  Zone
        4.1.3.  Cluster
        4.1.4.  Node
    4.2.  Link
    4.3.  Port
    4.4.  Message
        4.4.1.  Taxonomy
        4.4.2.  Format
    4.5.  Addressing
        4.5.1.  Location Transparency
        4.5.2.  Network Address
        4.5.3.  Port Identity
        4.5.4.  Port Name
        4.5.5.  Port Name Sequence
        4.5.6.  Multicast Addressing
        4.5.7.  Publishing Scope
        4.5.8.  Lookup Policies
5.  Port-Based Communication
    5.1.  Payload Messages
        5.1.1.  Payload Message Types
        5.1.2.  Payload Message Header Sizes
        5.1.3.  Payload Message Format
        5.1.4.  Payload Message Delivery
    5.2.  Connectionless Communication
    5.3.  Connection-based Communication
        5.3.1.  Connection Setup
        5.3.2.  Connection Shutdown
        5.3.3.  Connection Abortion
        5.3.4.  Connection Supervision
        5.3.5.  Flow Control
        5.3.6.  Sequentiality Check
    5.4.  Multicast Communication
6.  Name Table
    6.1.  Distributed Name Table Protocol Overview
    6.2.  Name Distributor Message Processing
    6.3.  Name Distributor Message Format
    6.4.  Name Publication Descriptor Format
7.  Links
    7.1.  TIPC Internal Header
        7.1.1.  Internal Message Header Format
        7.1.2.  Internal Message Header Fields Description
    7.2.  Link Creation
        7.2.1.  Link Setup
        7.2.2.  Link Activation
        7.2.3.  Link MTU Negotiation
        7.2.4.  Link Continuity Check
        7.2.5.  Sequence Control and Retransmission
        7.2.6.  Message Bundling
        7.2.7.  Message Fragmentation
        7.2.8.  Link Congestion Control
        7.2.9.  Link Load Sharing vs Active/Standby
        7.2.10.  Load Sharing
        7.2.11.  Active/Standby
    7.3.  Link Failover
        7.3.1.  Active Link Failure
        7.3.2.  Standby Link Failure
        7.3.3.  Second Link With Same Priority Comes Up
        7.3.4.  Second Link With Higher Priority Comes Up
        7.3.5.  Link Deletion
        7.3.6.  Message Bundler Protocol
        7.3.7.  Link State Maintenance Protocol
        7.3.8.  Link Changeover Protocol
        7.3.9.  Message Fragmentation Protocol
8.  Broadcast Link
    8.1.  Broadcast Protocol
    8.2.  Piggybacked Acknowledge
    8.3.  Coordinated Acknowledge Interval
    8.4.  Coordinated Broadcast of Negative Acknowledges
    8.5.  Replicated Delivery
    8.6.  Congestion Control
9.  Neighbor Detection
    9.1.  Neighbor Detection Protocol Overview
        9.1.1.  Link Request Message Processing
        9.1.2.  Link Response Message Processing
        9.1.3.  Link Discovery Message Format
        9.1.4.  Media Address Formats
10.  Topology Service
    10.1.  Topology Service Semantics
    10.2.  Topology Service Protocol
        10.2.1.  Subscription Message Format
        10.2.2.  Event Message Format
    10.3.  Monitoring Service Topology
    10.4.  Monitoring Physical Topology
11.  Configuration Service
    11.1.  Configuration Service Semantics
    11.2.  Configuration Service Protocol
        11.2.1.  Command Message Format
12.  Security Considerations
13.  IANA Considerations
14.  Contributors
15.  Acknowledgements
16.  References
    16.1.  Normative References
    16.2.  Informative References
Appendix A.  Change Log
Appendix B.  Remaining Issues




 TOC 

1.  Introduction

This section explains the rationale behind the development of the Transparent Inter Process Communication (TIPC) protocol. It also gives a brief introduction to each service provided by this protocol, as well as the basic concepts needed to understand the further description of the protocol in this document.



 TOC 

2.  Conventions

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119] (Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels,” March 1997.).



 TOC 

3.  Overview

This section explains the rationale behind the development of the Transparent Inter Process Communication (TIPC) protocol. It also gives a brief introduction to the services provided by this protocol, as well as the basic concepts needed to understand the further description of the protocol in this document.



 TOC 

3.1.  Background

There are no standard protocols available today that fully satisfy the special needs of application programs working within highly available, dynamic cluster environments. Clusters may grow or shrink by orders of magnitude, having member nodes crashing and restarting, having routers failing and replaced, having functionality moved around due to load balancing considerations, etc. All this must be handled without significant disturbances of the service(s) offered by the cluster. To minimize the effort by the application programmers to deal with such situations, and to maximize the chance that they are handled in a correct and optimal way, the cluster internal communication service should provide special support helping the applications to adapt to changes in the cluster. It should also, when possible, leverage the special conditions present within cluster environments to present a more efficient and more fault-tolerant communication service than more general protocols are capable of. This is the purpose of TIPC.

Version 1 of the TIPC protocol was proprietary, and has been widely deployed in Ericsson's customer networks. This document describes version 2 of the protocol. An open source implementation of version 2 is available as part of the standard Linux kernel at www.kernel.org



 TOC 

3.1.1.  Existing Protocols

TCP [RFC0793] (Postel, J., “Transmission Control Protocol,” September 1981.) has the advantage of being ubiquitous, stable, and wellknown by most programmers. Its most significant shortcomings in a real-time cluster environment are the following:

SCTP [RFC2960] (Stewart, R., Xie, Q., Morneault, K., Sharp, C., Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M., Zhang, L., and V. Paxson, “Stream Control Transmission Protocol,” October 2000.) is message oriented, it provides some level of user connection supervision, message bundling, loss-free changeover, and a few more features that may make it more suitable than TCP as an intra-cluster protocol. Otherwise, it has all the drawbacks of TCP already listed above.

Apart from these weaknesses, neither TCP nor SCTP provide any topology information/subscription service, something that has proven very useful both for applications and for management functionality operating within cluster environments.

Both TCP and SCTP are general purpose protocols, in the sense that they can be used safely over the Internet as well as within a closed cluster. This virtual advantage is also their major weakness: they require funtionality and header space to deal with situations that will never happen, or only infrequently, within clusters.



 TOC 

3.1.2.  Assumptions

TIPC has been designed based on the following assumptions, empirically known to be valid within most clusters.

These assumptions allow TIPC to use a simple, traffic-driven, fixed-size sliding window protocol located at the signalling link level, rather than a timer-driven transport level protocol. This in turn leads to other benefits, such as earlier release of transmission buffers, earlier packet loss detection and retransmission, earlier detection of node unavailability, to mention but some. Of course, situations with long transfer delays, high loss rates, long messages, security issues, etc. must also be dealt with, but from the viewpoint of being exceptions rather than as the general rule.



 TOC 

3.2.  Architectural Overview

TIPC should be seen as a layer between an application using TIPC and a packet transport service such as Ethernet, InfiniBand, UDP, TCP, or SCTP. The latter are denoted by the generic term "bearer service", or simply "bearer", throughout this document.

TIPC provides reliable transfer of user messages between TIPC users, or more specifically between two TIPC ports, which are the endpoints of all TIPC communication. A TIPC user normally means a user process, but may also be a kernel-level function or a driver.

Described by standard terminology TIPC spans the level of transport, network, and signalling link layers, although this does not inhibit it from using another transport level protocol as bearer, so that e.g. an TCP connection may serve as bearer for a TIPC signalling link.



      Node A                                             Node B
 -----------------                                 -----------------
|      TIPC       |                               |      TIPC       |
|   Application   |                               |   Application   |
|-----------------|                               |-----------------|
|                 |                               |                 |
|      TIPC       |TIPC address       TIPC address|      TIPC       |
|                 |                               |                 |
|-----------------|                               |-----------------|
| L2 or L3 Bearer |Bearer address   Bearer address| L2 or L3 Bearer |
|     Service     |                               |     Service     |
 -----------------                                 -----------------
        |                                                  |
        +---------------- Bearer Transport ----------------+

 Figure 1: Architectural view of TIPC 



 TOC 

3.3.  Functional Overview

Functionally TIPC can be described as consisting of several layers performing different tasks, as shown in Figure 2 (Functional view of TIPC).



   TIPC User

  ----------------------------------------------------------
  -------------      -------------
  |   Socket    |    |  Other API
  |   Adapter   |    |  Adapters..
  -------------      -------------
  =========================================================
  ----------------------------
  | Address      |  Address    |
  | Subscription |  Resolution |
  |--------------+----------------------------------------
  | Address Table|        Connection Supervision          |
  | Distribution |        Routing/Link Selection          |
  -----------------------------------------------------+-
  |                   |  Neighbour Detection        |   | Node
  |     Multicast     |  Link Establish/Supervision |    ---------->
  |                   |  Link Failover              |     Internal
  -----------------------------------------------+-
  |      Fragmentation/Defragmentation      |     |
  |                                         |     |
  ------------------------------------------      |
  |               Bundling                  |     |
  |          Congestion Control             |     |
  ------------------------------------+-----      |
  |   Sequence/Retransmission  |      |           |
  |         Control            |      |           |
  -------+---------------+-----       |           |
  =========|==============|============|===========|========
           |              |            |           |
       ----V-----    -----V----    ----V-----    --V-------
      |  Ethernet | |   UDP    |  |   TCP    |  | Mirrored |
      |           | |          |  |          |  | Memory   |
       -----------   ----------    ----------    ----------

 Figure 2: Functional view of TIPC 



 TOC 

3.3.1.  API Adapters

TIPC makes no assumptions about which APIs should be used, except that they must allow access to the TIPC services. It is possible to provide all functionality via a standard socket interface, an asynchronous port API, and any other form of dedicated interface that can be motivated. In these layers there MUST be support for transport-level congestion and overload protection control.



 TOC 

3.3.2.  Address Subscription

The service "Topology Information and Subscription" provides the ability to interrogate and if necessary subscribe for the availability of a service address, and thereby determine the availability of an associated physical/virtual resource or service.

This can be used by a distributed application to synchronize its startup, and may even serve as a simple, distributed event channel.



 TOC 

3.3.3.  Address Distribution

service addresses and their associated physical addresses must be equally available within the whole cluster. For performance and fault tolerance reasons it is not acceptable to keep the necessary address tables in one node; instead, TIPC must ensure that they are distributed to all nodes in the cluster, and that they are kept consistent at any time. This is the task of the Address Distribution Service, also called Name Distribution Service.



 TOC 

3.3.4.  Address Translation

The translation from a service address to a physical address is performed on-the-fly during message sending by this functional layer. This step must use an efficient algorithm, and multiple translations of a service address should be avoided where possible.

It is possible to bypass address translation altogether when sending messages if the sender is able to use a physical address as the destination address. For example, this can be done when a server responds to a connection setup request, or when communication between two applications occurs over an already established connection.



 TOC 

3.3.5.  Multicast

This layer, supported by the underlying three layers, provides a reliable intra-cluster broadcast service, typically defined as a semi-static multicast group over the underlying bearer. It also provides the same features as an ordinary unicast link, such as message fragmentation, message bundling, and congestion control.



 TOC 

3.3.6.  Connection Supervision

There are several mechanisms to ensure immediate detection and report of connection failure.



 TOC 

3.3.7.  Routing and Link Selection

This is the step of finding the correct destination node, plus selecting the right link to use for reaching that node. If the destination node turns out to be the own node, the rest of the stack is omitted, and the message is sent directly to the receiving port.



 TOC 

3.3.8.  Neighbour Detection

When a node is started it must make the rest of the cluster aware of its existence, and itself learn the topology of the cluster. By default this is done by use of broadcast, but there are other methods available.



 TOC 

3.3.9.  Link Establishment/Supervision

Once a neighbouring node has been detected on a bearer, a signalling link is established towards it. The functional state of that link has to be supervised continuously, and proper action taken if it fails.



 TOC 

3.3.10.  Link Failover

TIPC on a node will establish one link per-destination node and functional bearer instance, typically one per-configured ethernet interface. Normally these will run in parallel and share load equally, but special care has to be taken during the transition period when a link comes up or goes down, to ensure the guaranteed cardinality and sequentiality of the message delivery. This is done by this layer.



 TOC 

3.3.11.  Fragmentation/Reassembly

When necessary TIPC fragments and reassembles messages that can not be contained within one MTU-size packet.



 TOC 

3.3.12.  Bundling

Whenever there is some kind of congestion situation, i.e. when a bearer or a link can not immediately send a packet as requested, TIPC starts to bundle messages into packets already waiting to be sent. When the congestion abates the waiting packets are sent immediately, and unbundled at the receiving node.



 TOC 

3.3.13.  Congestion Control

When a bearer instance becomes congested, e.g. it is unable to accept more outgoing packets, all links on that bearer are marked as congested, and no more messages are attempted to be sent over those links until the bearer opens up again for traffic. During this transition time messages are queued or bundled on the links, and then sent whenever the congestion has abated. A similar mechanism is used when the send window of a link becomes full, but affects only that particular link.



 TOC 

3.3.14.  Sequence and Retransmission Control

This layer ensures the cardinality and sequentiality of packets over a link.



 TOC 

3.3.15.  Bearer Layer

This layer adapts to some connectionless or connection-oriented transport service, providing the necessary information and services to enable the upper layers to perform their tasks.



 TOC 

3.4.  Fault Handling

Most functions for improving system fault tolerance are described elswhere, under the repective functions, but some aspects deserve being mentioned separately.



 TOC 

3.4.1.  Fault Avoidance

Strict Source Address Check : After the neighbour detection phase, a message arriving to a node must have a not only a valid Pevious Node address, but this must belong to one of the nodes known having a direct link to the destination. The node may in practice be aware of at most a few hundred such nodes, while a network address is 32 bits long. The risk of accepting a garbled message having a valid address within that range, a sequence number that fits into the reception window, and otherwise valid header fields, is extremely small, no doubt less than one to several billions.

Sparse Port Address Space : As an extra measure, TIPC uses a 32-bit pseudo-random number as the first part of a port identity. This gives an extra protection against corrupted messages, or against obsolete messages arriving at a node after long delays. Such messages will not find any destination port, and be attempted returned to the sender port. If there is no valid sender port, the message should be quietly discarded.

Name Table Keys : When a NAME TABLE is updated with a new publication, each of those are qualified with a Key field, that is only known by the publishing port. This key must be presented and verified when the publication is withdrawn, in all instances of the name table. If the key does not fit, the withdrawal is refused.

Link Selectors : Whenever a message/packet is sent or routed, the link used for the next-hop transport is always selected in a deterministic way, based on the sender port's random number. The risk of having packets arriving in disorder is hence non-existent.

Repeated Name Lookups : If a lookup in the NAME TABLE has returned a port identity that later turns out to be false, TIPC performs up to 6 new lookups before giving up and rejecting the message.



 TOC 

3.4.2.  Fault Detection

The mechanisms for fault detection have been described in previous sections, but some of them will be briefly repeated here:

Transport Level Sequence Number, to detect disordered multi-hop packets.

Connection Supervision and Abortion mechanism.

Link Supervision and Continuation control.



 TOC 

3.4.3.  Fault Recovery

When a failure has been detected, several mechanisms are used to eliminate the impact from the problem, or when that is impossible, to help the application to recover from it:

Link Failover: When a link fails, its traffic is directed over to the redundant link, if any, in such a way that message sequentiality and cardinality is preserved. This feature is described in Section 7.3 (Link Failover).

Returning Messages to Sender : When no destination is found for a message, the 1024 first bytes of it is returned to the sender port, along with a descriptive error code. This helps the application to identify the exact instant of failure, and if possible, to find a new destination for the failed call. The complete list of error codes and their significance is described in Figure 9 (TIPC Error Codes).



 TOC 

3.4.4.  Overload Protection

To overcome situations where the congestion/flow control mechanisms described earlier in this section are inadequate or insufficient, TIPC must provide an additional overload protection service:

Process Overload Protection

TIPC must maintain a counter for each process, or if this is impossible, for each port, keeping track of the total number of pending, unhandled payload messages on that process or port. When this counter reaches a critical value, which should be configurable, TIPC must selectively reject new incoming messages. Which messages to reject should be based on the same criteria as for the node overload protection mechanism, but all thresholds must be set significantly lower. Empirically a ratio 2:1 between the node global thresholds and the port local thresholds has been working well.



 TOC 

3.5.  Terminology

This section defines terms whose meaning may otherwise be unclear or ambiguous.

Application: A user-written program that directly utilizes TIPC for communication.

Bearer: An instance of a physical or logical transport media, such as Ethernet, ATM/AAL or DCCP, over which messages can be sent.

Broadcast: The sending of a message to all other nodes in the sender's cluster, each of which receives a copy of the message. Note that what is considered a broadcast from the TIPC viewpoint may be mapped onto a multicast at the bearer (Ethernet or DCCP) level.

Connection: A logical channel for passing messages between two ports. Once a connection is established no address need be indicated when sending a message from either of the endpoints.

Cluster: A collection of nodes that are directly interconnected (i.e. fully meshed). All nodes in a cluster have network addresses that differ only in their node identifier.

Domain: A subset of topologically related nodes in a TIPC network, normally designated by a network address. For example, <Z.C.N> designates a specific node, <Z.C.0> designates any node within the specified cluster, <Z.0.0> designates any node within the specified zone, and <0.0.0> designates any node within the network.

Functional Address: Synonym for Service Address.

Internal Message: A message that is generated and consumed by an internal TIPC subsystem.

Link: A communication channel connecting two nodes, performing tasks such as message transfer, sequence ordering, retransmission, etc. A pair of nodes may be interconnected by one link on a single bearer, or by a pair of links on two bearers in either a load sharing or an active-plus-standby configuration.

Link Changeover: The act of moving all traffic from a failing link in a link pair to the remaining link, while retaining the original sequence order and cardinality of messages.

Link Endpoint: A communication endpoint, used in pairs by a link to send and receive TIPC messages between two nodes.

Location Transparency: The ability of an application within a cluster to communicate with another application without knowing the physical location of the latter. (This term is sometimes called "addressing tranparency".)

Message: The fundamental unit of information exchanged between TIPC ports or between TIPC subsystems. Consists of a TIPC message header, followed by from 0 to 66,000 bytes of data.

Message Bundling: The act of aggregating several messages into one packet (typically an Ethernet frame) to minimize the impact of congestion when messages cannot be sent immediately.

Message Fragmentation: The act of dividing a long message into several packets during transmission and later reassembling the fragments into the original message at the receiving end.

Multicast: The sending of a message to multiple TIPC ports, each of which receives a copy of the message.

Name: An alias for Port Name.

Name Sequence: An alias for Port Name Sequence.

Name Table: An TIPC-internal table existing on each node which keeps track of the mapping between port names and port identities.

Network: A collection of nodes that can communicate with one another via TIPC. The network may consist of a single node, a single cluster, a single zone, or a group of inter-connected zones.

Network Address: An integer that identifies a node, or set of nodes, within a TIPC network. It is a 32 bit integer, subdivided into three fields (8/12/12), representing a zone, cluster and node identifier, respectively; normally denoted as <Z.C.N>.

Network Identity: An integer that uniquely identifies a TIPC network. Used to keep traffic from different TIPC networks separated from each other when a common bearer is being used; for example, when multiple networks are running on a LAN in a lab environment.

Node: A computer within a TIPC network, uniquely identified by a network address.

Packet: The unit of data sent over a bearer. It may contain one or more complete TIPC messages, or a fragment of one TIPC message.

Payload Message: A message that carries application-related content between applications, or between an application and a service.

Port: A communication endpoint, capable of sending and receiving TIPC messages. Once created a TIPC port persists until it is deleted by its owner, either explicitly or implicitly. In all practice, a TIPC port is embedded in a POSIX socket in all existing implementations, and there is no need for the user application to distinguish between these two concepts.

Port Identity: A physical address that uniquely identifies a TIPC port port within a network; normally denoted as <Z.C.N:reference>. Once a port is deleted its identity will not be reissued for a very long time.

Port Name: A service address that identifies a TIPC port as being capable of providing a specific service; normally denoted as {type,instance}. For load sharing and redundancy purposes several ports may bind to the same name; likewise, a single port may bind to multiple names if it provides multiple services.

Port Name Sequence: A mechanism for specifying a range of continguous port names; normally denoted as {type,lower-instance,upper-instance}.

Service: A TIPC subsystem that communicates with applications or other TIPC subsystems using TIPC ports.

Service Address: A location independent address, identifying a port. Manifested as a Port Name in TIPC.

Scope: A shorthand form for expressing the domain that contains a node, as seen by that node; that is, own-node, own-cluster, or own-zone.

Unicast: The sending of a message to a single node in the network.

Zone: A "super-cluster" of clusters that are directly interconnected (i.e. fully meshed). All nodes in a zone have network addresses that share a common zone identifier.



 TOC 

3.6.  Abbreviations

API - Application Programming Interface

MAC - Message Authentication Code [RFC2104] (Krawczyk, H., Bellare, M., and R. Canetti, “HMAC: Keyed-Hashing for Message Authentication,” February 1997.)

MTU - Maximum Transmission Unit

RTT - Round Trip Time, the elapsed time from the moment a message is sent to a destination to the moment it arrives back to the sender, provided the message is immediately bounced back from the sender. Typically on the order of 100 usecs, process-to-process, between 2 Ghz CPUs via a 100 Mbps Ethernet switch.



 TOC 

4.  TIPC Features



 TOC 

4.1.  Network Topology

From a TIPC addressing viewpoint the network is organized in a five-layer hierarchy:




+----------------------------------------------------+ +------------+
| Zone <1>                                           | |Zone <2>    |
|  ----------------------    ----------------------  | |            |
| | Cluster <1.1>        |  | Cluster <1.2>        | | |            |
| |                      |  |                      | | |            |
| |  -------             |  |  -------    -------  | | |   -------  |
| | |       |            |  | |       |  |       | | | |  |       | |
| | | Node  |   -------  |  | | Node  +--+ Node  | | | |  | Node  | |
| | |<1.1.1>|  |       | |  | |<1.2.1>|  |<1.2.2>| | | |  |<2.1.1>| |
| | |       +--+       +------+       |  |       +--------+       | |
| | |       |  |       | |  | |       |  |       | | | |  |       | |
| |  ---+---   | Node  | |  |  --------   ---+---  | | |   -------  |
| |     |      |<1.1.3>| |  |                |     | | |            |
| |  ---+---   |       | |  |             ---+---  | | |   -------  |
| | |       |  |       | |  |            |       | | | |  |       | |
| | | Node  +--+       | |  |            | Node  +--------+ Node  | |
| | |<1.1.2>|  |       | |  |            |<1.2.3>| | | |  |<2.1.2>| |
| | |       |   -------  |  |            |       | | | |  |       | |
| | |       |            |  |            |       | | | |  |       | |
| |  -------             |  |             -------  | | |   -------- |
|  ----------------------    ----------------------  | |            |
|                                                    | |            |
+----------------------------------------------------+ +------------+

 Figure 3: TIPC Network Topology 



 TOC 

4.1.1.  Network

The top level is the TIPC network as such. This is the ensemble of all nodes interconnected via TIPC, i.e. the domain where a node can reach another node by using a TIPC network address. A node wanting to communicate with another node within the network, irrespective of its location in the network hierarchy, must have a direct link to that node. There is no routing in TIPC, i.e., a message can not pass from one node to another via an intermediate node.

It is possible to create distinct, isolated networks, even on the same LAN, reusing the same network addresses, by assigning each network a Network Identity. This identity is not an address, and only serves the purpose of isolating networks from each other. Networks with different identities can not communicate with each other via TIPC.



 TOC 

4.1.2.  Zone

It may be convenient for a system administrator to subdivide the nodes in a network into groups, or Zones, by assigning each zone a Zone Identity. A zone identitiy must be unique and within the numeric range [1,255].



 TOC 

4.1.3.  Cluster

The nodes within a zone may further be grouped into Clusters, by assigning them a Cluster Identity. A cluster identitiy must be unique within the zone, and within the numeric range [1,4095].



 TOC 

4.1.4.  Node

A cluster consists of individual Nodes, each having a unique Node Identity within the cluster. A node identitiy must be within the numeric range [1,4095].



 TOC 

4.2.  Link

The communication channels between pairs of nodes is called Link. A link is delivering units of data between nodes with guaranteed and ordered delivery. A link is also actively supervised, and will declared faulty if no traffic has been received from the other endpoint after a configurable amount of time.

There may be many working links between between a pair of nodes, but only two links may be actively used for data transport at any moment of time.



 TOC 

4.3.  Port

The endpoint of all data traffic inside each node is called Port, typically accessible for its users via a standard socket API.



 TOC 

4.4.  Message

The fundamental unit of information exchanged between TIPC ports or TIPC subsystems is called Message.



 TOC 

4.4.1.  Taxonomy

TIPC messages fall into two main classes.

A "payload message" carries application-specified content between applications, or between applications and TIPC services.

An "internal message" carries TIPC-specified content between TIPC subsystems.

Messages are further categorized based on their use, as indicated below:


User   User Name             Purpose                         Class
----   ---------             -------                         -----
0      LOW_IMPORTANCE        Low Importance Data             payload
1      MEDIUM_IMPORTANCE     Medium Importance Data          payload
2      HIGH_IMPORTANCE       High Importance Data            payload
3      CRITICAL_IMPORTANCE   Critical Importance Data        payload
4      USER_TYPE_4           Reserved for future use         n/a
5      BCAST_PROTOCOL        Broadcast Link Protocol         internal
6      MSG_BUNDLER           Message Bundler Protocol        internal
7      LINK_PROTOCOL         Link State Protocol             internal
8      CONN_MANAGER          Connection Manager              internal
9      USER_TYPE_9           Reserved for future use         n/a
10     CHANGEOVER_PROTOCOL   Link Changeover Protocol        internal
11     NAME_DISTRIBUTOR      Name Table Update Protocol      internal
12     MSG_FRAGMENTER        Message Fragmentation Protocol  internal
13     LINK_DISCOVER         Neighbor Detection Protocol     internal
14     USER_TYPE_14          Reserved for future use         n/a
15     USER_TYPE_15          Reserved for future use         n/a

 Figure 4: TIPC Message Types 



 TOC 

4.4.2.  Format

Every TIPC message consists of a message header and a data part.

The message header format is user-dependent, and ranges in length from 6 to 11 words. The content of each word in the header is stored as a single 32-bit integer coded in network byte order. A small number of fields are common to all message header formats; the remaining fields are either unique to a single user or utilized by multiple users.

The format of the data part of a message is user-dependent, and ranges in length from 0 to 66,000 bytes.

The message header format and data format for each message user are described in detail in the section describing the message's use.



 TOC 

4.5.  Addressing



 TOC 

4.5.1.  Location Transparency

TIPC provides two service address types, Port Name and Port Name Sequence, to support location transparency, and two physical address types, Network Address and Port Identity, to be used when physical location knowledge is necessary for the user.



 TOC 

4.5.2.  Network Address

A physical entity within a network is identified internally by a TIPC Network Address. This address is a 32-bit integer, structured into three fields, zone (8 MSB), cluster, (12 bits), and node (12 LSB). The address is only filled in with as much information as is relevant for the entity concerned, e.g. a zone may be identified as 0x03000000 (<3.0.0>), a cluster as 0x03001000 (<3.1.0>), and a node as 0x03001005 (<3.1.5>). Any of these formats is sufficient for the TIPC routing function to find a valid destination for a message.



 TOC 

4.5.3.  Port Identity

A Port Identity is produced internally by TIPC when a port is created, and is only valid as long as that physical instance of the port exists. It consists of two 32-bit integers. The first one is a random number with a period of 2^31-1, the second one is a fully qualified network address with the internal format as described earlier. A port identity may be used the same way as a port name, for connectionless communication or connection setup, as long as the user is aware of its limitations. The main advantage with using this address type over a port name is that it avoids the potentially expensive binding operation in the destination port, something which may be desirable for performance reasons.



 TOC 

4.5.4.  Port Name

A port name is a persistent address typically used for connectionless communication and for setting up connections. Binding a port name to a port roughly corresponds to binding a socket to a port number in TCP, except that the port name is unique and has validity for the whole publishing scope indicated in the bind operation, not only for a specific node. This means that no network address has to be given by the caller when setting up a connection, unless he explicitly wants to reach a certain node, cluster or zone.

A port name consists of two 32-bits integers. The first integer is called the Name Type, and typically identifies a certain service type or functionality. The second integer is called the Name Instance, and is used as a key for accessing a certain instance of the requested service.

The type/instance structure of a port name helps giving support for both service partitioning and service load sharing.

When a port name is used as destination address for a message, it must be translated by TIPC to a port identity before it can reach it destination. This translation is performed on a node within the lookup scope indicated along with the port name.



 TOC 

4.5.5.  Port Name Sequence

To give further support for service partitioning TIPC even provides an address type called Port Name Sequence, or just Name Sequence. This is a three-integer structure defining a range of port names, i.e. a name type plus the lower limit of and the upper boundary of the range. By allowing a port to bind to a sequence, instead of just an individual port name, it is possible to partition the service's range of responsibility into sub-ranges, without having to create a vast number of ports to do so.

There are very few limitations on how name sequences may be bound to ports. One may bind many different sequences, or many instances of the same sequence, to the same port, to different ports on the same node, or to different ports anywhere in the cluster or zone. The only restriction, in reality imposed by the implementation complexity it would involve, is that no partially overlapping sequences of the same name type may exist within the same publishing scope.




                                    ---------------
                                   | Partition B   |
                                   |               |
                                   O bind(type: 17 |
 -----------------                 |      lower:10 |
|                 |                |      upper:19)|
|send(type:    17 |                 ---------------
|     instance:7) O------+
|                 |      |          ---------------
|                 |      |         | Partition A   |
 -----------------       |         |               |
                         +-------->O bind(type: 17 |
                                   |      lower:0  |
                                   |      upper:9  |
                                    ---------------

 Figure 5: Service addressing, using port name and port name sequence 

When a port name is used as a destination address it is never used alone, contrary to what is indicated in Figure 5 (Service addressing, using port name and port name sequence). It has to be accompanied by a network address stating the scope and policy for the lookup of the port name. This will be described later.



 TOC 

4.5.6.  Multicast Addressing

The concept of service addressing is also used to provide multicast functionality. If the sender of a message indicates a port name sequence instead of a port name, a replica of the message will be sent to all ports bound to a name sequence fully or partially overlapping with the sequence indicated.



                                     ---------------
                                    | Partition B   |
                                    |               |
                          +-------->O bind(type: 17 |
  -----------------       |         |      lower:10 |
 |                 |      |         |      upper:19)|
 |send(type: 17    |      |          ---------------
 |     lower:7     O------+
 |     upper 13)   |      |          ---------------
 |                 |      |         | Partition A   |
  -----------------       |         |               |
                          +-------->O bind(type: 17 |
                                    |      lower:0  |
                                    |      upper:9  |
                                     ---------------

 Figure 6: service multicast, using port name sequence 

Only one replica of the message will be sent to each identified target port, even if it is bound to more than one overlapping name sequence.

This function will whenever possible and considered advantageous make use of the reliable cluster broadcast service also supported by TIPC.



 TOC 

4.5.7.  Publishing Scope

The default visibility scope of a published (bound) port name is the local cluster. If the publication issuer wants to give it some other visibility he must indicate this explicitly when binding the port. The scopes available are:

Value     Meaning
-----     -------
1         Visibility within whole own zone
2         Visibility within whole own cluster
3         Visibility limited to own node



 TOC 

4.5.8.  Lookup Policies

When a port name is looked up in the TIPC internal naming table for translation to a port identity the following rules apply:

If indicated lookup domain is <Z.C.N>, the lookup algorithm must choose a matching publication from that particular node. If nothing is found on the given node, it must give up and reject the request, even if other matching publications exist within the zone.

If the lookup domain is <Z.C.0>, the algorithm must select round-robin among all matching publications within that cluster, treating node local publications no different than the others. If nothing is found within the given cluster, it must give up and reject the request, even if other matching publications exist within the zone. Note here that if the sender node is not part of the lookup domain, there may be cases where the message is redirected to a third node after lookup. E.g., if a node <Z.C1.N> sends a message with lookup domain <Z.C2.0>, the first lookup will happen on a node in cluster C2, which may quite well redirect the message to a third node in that cluster.

If the lookup domain is <Z.0.0>, the algorithm must select round-robin among all concerned publications within that zone, treating node or cluster local publications no different than the others. If nothing is found, it must give up and reject the request.

A lookup domain of <0.0.0> means that the nearest found publication must be selected. First a lookup with scope <own zone.own cluster.own node> is attempted. If that fails, a lookup with the scope <own zone.own cluster.0> is tried, and finally, if that fails, a lookup with the scope <own zone.0.0>. If that fails the request must be rejected.

Round-robin based lookup means that the algorithm must select equally among all the matching publications within the given scope. In practice this means stepping forward in a circular list referring to those publications between each lookup.



 TOC 

5.  Port-Based Communication

All application communication through TIPC is done by passing data ("payload") messages between a sender port and a receiver port.



 TOC 

5.1.  Payload Messages



 TOC 

5.1.1.  Payload Message Types

TIPC supports four different payload message types:

Figure 8 (TIPC Data Message Types) presents the message types and their corresponding type identifiers in the message header.



 TOC 

5.1.2.  Payload Message Header Sizes

The header is organized so that it should be possible to omit certain parts of it, whenever any information is dispensable. The following header sizes are used:



 TOC 

5.1.3.  Payload Message Format

All payload messages sharea common base header format. The only difference between the message types is how many bytes of the base header they need to use, as described in Section 5.1.2 (Payload Message Header Sizes).




    3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
    1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w0:| Ver | User  | Hsize |N|D|S|R|          Message Size           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w1:|Mtype| Error |Reroute|Lsc| RES |     Broadcast Acknowledge     |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w2:|        Link Acknowledge       |        Link Sequence          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w3:|                         Previous Node                         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w4:|                        Originating Port                       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w5:|              Destination Port / Destination Network           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w6:|                        Originating Node                       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w7:|                        Destination Node                       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w8:|                            Name Type                          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w9:|                 Name Instance / Name Sequence Lower           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
wA:|                      Name Sequence Upper                      |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   \                                                               \
   /                                                               /
   \                                                               \
   /                                                               /
   \                             Data                              \
   /                                                               /
   \                                                               \
   /                                                               /
   \                                                               \
   /                                                               /
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

      w0-w5: Required. w6-wA: Conditional. Data: Optional

 Figure 7: TIPC Payload Message Format 

The interpretation of the fields of the message is as follows:



 TOC 

5.1.4.  Payload Message Delivery

CONN_MSG and DIRECT_MSG messages are delivered directly to the destination port. If this is impossible, e.g. because the destination port or node has disappeared, the message is dropped or rejected back to sender, depending of the setting of the 'dest_droppable' bit.

NAMED_MSG messages are subject to a name table lookup before the final destination can be determined. The following procedure applies for finding the correct destination:

  1. Initially, the 'destination port' and 'destination node' fields of the message header are empty.
  2. If the sender node is within the lookup domain of the destination address (e.g., <1.1.1> is within domain <1.1.0>), a lookup is performed on that node. Only publications which have been published from a node that is also located within the requested domain, and which have a publication scope comprising the sender node are considered. (E.g., a publication with scope 'cluster' or 'zone' from node <1.1.3> is considered. A publication with scope 'node' from the same node is not seen. A publication with scope 'zone' from <1.2.1> is seen, but ignored.).
  3. If a matching publication is found, the destination port number and node adress are added to the header, and the message is sent to the destination node for delivery. If no matching publication is found, the message is dropped or rejected back to the sender.
  4. If the sender node is outside the lookup domain, the message is forwarded to a node within that domain for further lookup. On the receiving node the lookup procedure just described in the previous steps are performed.
  5. If lookup was successful, and the message has reached the found destination node, it is now delivered to the found destination port, if it is still there.
  6. If the destination port has disapepared, the destination port and destination node fields are cleared from the header, and a new lookup is performed on the current node, still considering the originally indicated lookup domain, and following the steps described above. The original lookup domain is recreated based on the complete current node address and the 2-bit 'lookup scope', which was conveyed in the message header. At the same time the 'reroute' counter is incremented. Up to six such lookup attempts will be made until the message is dropped or rejected.

MCAST_MSG messages are also subject to name table lookups before the final destinations can be determined. The following procedure applies for finding the correct destinations:

  1. Initially, the Destination Port and Destination Node fields of the message header are empty.
  2. A first lookup is performed, unconditionally, on the sending node. Here, all node local matching destination ports are identified, and a copy of the message is sent to each and one of them.
  3. At the same time, the lookup identifies if there are any publications from external, cluster local, nodes. If so, a copy of the message is sent via the broadcast link to all nodes in the cluster.
  4. At each destination node, a final lookup is made, once again to identify node local destination ports. A copy of the message is sent each and one of them.
  5. If any of the found destination ports have disappeared, or are overloaded, the corresponding message copy is silently dropped.



 TOC 

5.2.  Connectionless Communication

There are two types of unicast connectionless communication, name addressed and direct addressed message transport. Both these have already been described.

The main advantage with this type of communication is that there is no need for a connection setup procedure. Another is that messages can be sent from one to many destionations, as well as many to one. A third one is that a destination port easily can move around, without the potential senders needing to be aware.

The main disadvantage with this kind of communication is the lack of flow control. It is very easy to overwhelm a single destination from many sources.



 TOC 

5.3.  Connection-based Communication

User Connections are designed to be lightweight because of their potentially huge number, and because it must be possible to establish and shut down thousands of connections per second on a node.



 TOC 

5.3.1.  Connection Setup

How a connection is established and terminated is not defined by the protocol, only how they are supervised, and if necessary, aborted. Instead, this is left to the implementation to define. The following figures show two examples of this.




 -------------------                -------------------
| Client            |              | Server            |
|                   |              |                   |
| (3)create(cport)  |              | (1)create(suport) |
| (4)send(type:17,  |------------->0 (2)bind(type: 17, |
|         inst: 7)  0<------+      |\        lower:0   |
| (8)lconnect(sport)|       |      | \       upper:9)  |
|                   |       |      | /                 |
|                   |       |      |/(5)create(sport)  |
|                   |       +------0 (6)lconnect(cport)|
|                   |              | (7)send()         |
 -------------------                -------------------

 Figure 11: Example of user defined establishment of a connection 

In the example illustrated above the user himself defines how to set up the connection. In this case, the client starts with sending one payload- carrying NAMED_MSG message to the setup port (suport)(4). The setup server receives the message, and reads its contents and the client port (cport) identity. He then creates a new port (sport)(5), and connects it to the client port's identity(6). The lconnect() call is a purely node local operation in this case, and the connection is not fully established until the server has fulfilled the request and sent a response payload-carrying CONN_MSG message back to the client port(7). Upon reception of the response message the client reads the server port's identity and performs an lconnect() on it(8). This way, a connection has been established without exchanging a single protocol message.




 --------------------                -------------------
| Client             |              | Server            |
|                    |              | (1)create(suport) |
| (4)create(cport)   |   "SYN"      | (2)bind(type: 17, |
| (5)connect(type:17,|------------->0         lower:0   |
| (9)        inst: 7)0<------+     /|         upper:9)  |
|                    |       |    / | (3)accept()       |
|                    |    (7)|    \ | (8)               |
|                    |       |  (6)\|                   |
|                    |       +------0 (9)recv()         |
|                    |      "SYN"   |                   |
 --------------------                -------------------

 Figure 12: TCP-style connection setup 

The figure above shows an example where the user API-adapter supports a TCP-style connection setup, using hidden protocol messages to fulfil the connection. The client starts with calling connect()(5), causing the API to send an empty NAMED_MSG message ("SYN" in TCP terminology) to the setup port. Upon reception, the API-adapter at the server side creates the server port, peforms a local lconnect()(6) on it towards the client port, and sends an empty CONN_MSG ("SYN") back to the client port (7). The accept() call in the server then returns, and the server can start waiting for messages (8). When the second SYN message arrives in the client, the API-adapter performs a node local lconnect() and lets the original connect() call return (9).

Note the difference between this protocol and the real TCP connection setup protocol. In our case there is no need for SYN_ACK messages, because the transport media between the client and the server (the node-to-node link) is reliable.

Also note from these examples that it is possible to retain full compatibility between these two very different ways of establishing a connection. Before the connection is established, a TCP-style client or server should interpret a payload message from a user-controlled counterpart as an implicit SYN, and perform an lconnect() before queueing the message for reading by the user. The other way around, a user-controlled client or server must perform an lconnect() when receiving the empty message from its TCP-style counterpart.



 TOC 

5.3.2.  Connection Shutdown




       -------------------                -------------------
      | Client            |              | Server            |
      |                   |              |                   |
      |                   |              |                   |
      |          lclose() 0              0 lclose()          |
      |                   |              |                   |
      |                   |              |                   |
      |                   |              |                   |
       -------------------                -------------------

 Figure 13: Example of user defined shutdown of a connection 

The figure above shows the simplest possible user defined connection shutdown scheme. If it inherent in the user protocol when the connection should be closed, both parties will know the right moment to perform a "node local close" (lclose()) and no protocol messages need to be involved.




       --------------------                -------------------
      | Client             |              | Server            |
      |                    |    "FIN"     |                   |
      |          (1)close()0------------->0(2)close()         |
      |                    |              |                   |
      |                    |              |                   |
      |                    |              |                   |
       --------------------                -------------------

 Figure 14: TCP-style shutdown of a connection 

In the figure above a TCP-style connection close() is illustrated. This is simpler than the connection setup case, because the built-in connection abortion mechanism of TIPC can be used. When the client calls close() (1) TIPC must delete the client port. As will be described later, deleting a connected port has the effect that a CRITICAL_IMPORANCE/CONN_MSG ("FIN" in TCP terminology) with error code NO_REMOTE_PORT is sent to the other end. Reception of such a message means that TIPC at the receiving side must shut down the connection, and this must be done already before the server is notified. The server must then call close() (2), not to close the connection, but to delete the port. TIPC does not send any "FIN" this time, the server port is already disconnected, and the client port is anyway gone. If both endpoints call close() simultaneously, two "FIN" messages will cross each other, but at the reception they will have no effect, since there is no destination port, and they must be discarded by TIPC.

Note even here the automatic compatibility with a user-defined peer and a TCP-style ditto: Any user, no matter the user API, must at any moment be ready to receive a "connection aborted" indication, and this is what in reality happens here.



 TOC 

5.3.3.  Connection Abortion

When a connected port receives an indication from the TIPC link layer that it has lost contact with its peer node, it must immediately disconnect itself and send an empty CONN_MSG/NO_REMOTE_NODE to its owner process.

When a connected port is deleted without a preceding disconnect() call from the user it must immediately disconnect itself and send an empty CONN_MSG/NO_REMOTE_PORT to its peer port. This may happen when the owner process crashes, and the OS is reclaiming its resources.

When a connected port receives a timeout call, and is still in CONNECTED/PROBING state since the previous timer expiration,it must immediately disconnect itself and send an empty CONN_MSG/NO_REMOTE_PORT to its owner process.

When a connected port receives a CONN_MSG with error code, it must immediately disconnect itself and deliver the message to its owner process.

When a connected port receives a CONN_MSG from somebody else than its peer port, it must immediately send an empty CONN_MSG/NO_CONNECTION to the originating port of that message.

When TIPC in a node receives a CONN_MSG/TIPC_OK for which it finds no destination port, it must immediately send an empty CONN_MSG/NO_REMOTE_PORT back to the originating port.

When a bound port receives a CONN_MSG from anybody,it must immediately send an empty CONN_MSG/NO_CONNECTION to the originating port.



 TOC 

5.3.4.  Connection Supervision

A connection also implies automatic supervision of the existence and state of the endpoints.

In almost all practical cases the mechanisms for resource cleanup after process failure and supervision of peer nodes at the link level, is sufficient for immediate failure detection and abortion of connections.

However, because of the non-specified connection setup procedure of TIPC, there exists cases when a connection may remain incomplete. This may happen if the client in a user-defined setup/shutdown scheme forgets to call lconnect() (see "Example of user defined shutdown of a connection"), and then deletes itself, or if one of the parties fails to call lclose() (see "TCP-style connection shutdown") These cases are considered very rare, and should normally have no serious consequenses for the availability of the system, so a slow background timer is deemed sufficient to discover such situations.

When a connection is established each port starts a timer, whose purpose is to check the status of the connection. It does this by regularly (typical configured interval is once an hour) sending a CONN_PROBE message to the peer port of the connection. The probe has two tasks; first, to inform that the sender is still alive and connected; second, to inquire about the state of the recipient.

A CONN_PROBE or a CONN_PROBE_REPLY reply MUST be immediately responded to according to the following scheme:




---------------------------------------------------------------------
|                              |        Received Message Type       |
|                              |-----------------+------------------|
|                              |   CONN_PROBE    | CONN_PROBE_REPLY |
|                              |                 |                  |
|==============================|====================================|
|     |             Multi-hop  |        CRITICAL_IMPORANCE+         |
|     |             seqno wrong|        TIPC_COMM_ERROR             |
|     |            ------------|-----------------+------------------|
|     | Connected   Multi-hop  |                 |                  |
|     | to sender   seqno ok   |                 |                  |
|     | port       ------------|                 |                  |
|     |             Single hop | CONN_PROBE_REPLY|  No Response     |
|     |------------------------|                 |                  |
|     | Not connected,         |                 |                  |
|Rece-| not bound,             |                 |                  |
|ving |------------------------|-----------------+------------------|
|Port | Connected to           |                                    |
|State| other port             |        CRITICAL_IMPORANCE+         |
|     |------------------------|        TIPC_NOT_CONNECTED          |
|     | Bound to               |                                    |
|     | port name sequence     |                                    |
|     |------------------------|------------------------------------|
|     | Recv. node available,  |        CRITICAL_IMPORANCE+         |
|     | recv. port non-existent|        TIPC_NO_REMOTE_PORT         |
|     |------------------------|------------------------------------|
|     | Receiving node         |        CRITICAL_IMPORANCE+         |
|     | unavailable            |        TIPC_NO_REMOTE_NODE         |
---------------------------------------------------------------------

 Figure 15: Response to probe/probe replies vs port state 

If everything is well then the receiving port will answer with a probe reply message, and the probing port will go to rest for another interval. It is inherent in the protocol that one of the ports - the one connected last - normally will remain passive in this relationship. Each time its timer expires it will find that it has just received and replied to a probe, so it will never have any reason to explicitly send a probe itself.

When an error is encountered, one or two empty CONN_MSG data are generated, one to indicate a connection abortion in the receiving port, if it exists, and one to do the same thing in the sending port.

The state machine for a port during this message exchange is described in "Connection-based Communication".



 TOC 

5.3.4.1.  Connection Manager

Although a TIPC internal user, Connection Manager is special, because it uses a 36-byte header format of payload messages instead of the 40-byte internal format. This is because those messages must contain a destination port and a originating port.

The following message types are valid for Connection Manager:

User: 8 (CONN_MANAGER).

Message Types:




ID Value   Meaning
--------   ----------
0          Probe to test existence of peer      (CONN_PROBE)
1          Reply to probe, confirming existence (CONN_PROBE_REPLY)
2          Acknowledge N Messages               (MSG_ACK)
 Figure 16: Connection Supervision Message Types 

MSG_ACK messages are used for transport-level congestion control, and carry one network byte order 32-byte integer as data. This indicates the number of messages acknowledged, i.e. actually read by the port sending the acknowledge. This information makes it possible for the other port to keep track of the number of sent, but not yet received and handled messages, and to take action if this value surpasses a certain threshold.

The details about why and when these messages are sent are described in "Connection Supervision"



 TOC 

5.3.5.  Flow Control

The mechanism for end-to-end flow control at the connection level has as its primary purpose to stop a sending process from overrunning a slower receiving process. Other tasks, such as bearer, link, network, and node congestion control, are handled by other mechanisms in TIPC. Because of this, the algorithm can be kept very simple. It works as follows:

The message sender (the API-adapter) keeps one counter, SENT_CNT, to count messsages he has sent, but which has not yet been acnkowledged. The counter is incremented for each sent message.

The receiver counts the number of messages he delivers to the user, ignoring any messages pending in the process in-queue. For each N message, he sends back a CONN_MANAGER/ACK_MSG containing this number in its data part.

When the sender receives the acknowledge message, he subtracts N from SENT_CNT, and stores the new value.

When the sender wants to send a new message he must first check the value of SENT_CNT, and if this exceeds a certain limit, he must abstain from sending the message. A typical measure to take when this happens is to block the sending process until SENT_CNT is under the limit again, but this will be API-dependent.

The recommended value for the send window N is at least 200 messages, and the limit for SENT should be at least 2*N.



 TOC 

5.3.6.  Sequentiality Check

Inter-cluster connection-based messages, and intra-cluster messages between cluster nodes, may need to be routed via intermediate nodes if there is no direct link between the two. This implies a small, but not negligeable risk that messages may be lost or re-ordered. E.g. an intermediate node may crash, or it may have changed its routing table in the interval between the messages. A connection level sequence number is used to detect such problems, and this must be checked for each message received on the connection. If the sequence number does not fit in sequence, no attempts of re-sequencing should be done. The port discovering the sequence error must immediately abort the connection by sending one empty CONN_MSG/COMM_ERROR message to itself, and one to the peer port.

The sequence number must not be checked on single-hop connections, where the link protocol guarantees that no such errors can occur.

The first message sent on a connection has the sequence number 42.



 TOC 

5.4.  Multicast Communication

Section 4.5.6 (Multicast Addressing) describes the concept of multicast addressing in TIPC. Section 5.1.4 (Payload Message Delivery) describes in detail how multicast addess lookup is performed.

Two additional details should be noted:

The addressing concept makes it both logical and easy to permit such 'zone multicast' and 'redirected multicast' features, but at least for now it is deemed to risky to implement this.



 TOC 

6.  Name Table

The Name Table is a distributed data base that keeps all port names that have been published in the network. It is used for translation from a port name to a corresponding port identity, or from a port name sequence to a corresponding set of port identities. In order to achieve acceptable translation times and fault tolerance, a replica of the table must exist on each node. The table replicas are not exactly identical; instead each replica keeps exactly those publications that have been published in the domain to which the replica belongs, provided that it is directly reachable (via a link) from the publishing node.



 TOC 

6.1.  Distributed Name Table Protocol Overview

The replicas of the table must be kept consistent with the other instances within the same domain, and there must be no unnecessary delays in the synchronization between neighbouring table instances when a port name sequence is published or withdrawn. Inconsistencies are only tolerated for the short timespan it takes for update messages to reach the neigbouring nodes, or for the time it takes for a node to detect that a neighbouring node has disappeared.



 TOC 

6.2.  Name Distributor Message Processing

When a node establishes contact with a new node in the cluster or the zone, it must immediately send out the necessary number of NAME_DISTRIBUTOR/ PUBLICATION messages to that node, in order to let it update its local NAME TABLE instance.

When a node looses contact with another node, it must immediately clean its NAME TABLE from all entries pertaining to that node.

When a port name sequence is published on a node, TIPC must immediately send out a NAME_DISTRIBUTOR/PUBLICATION message to all nodes within the publishing scope, in order to have them update their tables.

When a port name sequence is withdrawn on a node, TIPC must immediately send out a NAME_DISTRIBUTOR/WITHDRAWAL message to all nodes within the publishing scope, in order to have them remove the corresponding entry from their tables.

Brief, transient table inconsistencies may occur, despite the above, and are handled as follows: If a successful lookup on one node leads to a non-existing port on another node, the lookup is repeated on that node. If this lookup succeeds, but again leads to a non-existing port, another lookup is done. This procedure can be repeated up to six times before giving up and rejecting the message.



 TOC 

6.3.  Name Distributor Message Format

The format of the name distribution message used to update remote name tables is shown in Figure 17 (Name Table Distributor Message Format).




    3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
    1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w0:| Ver | User  | Hsize |N|R|R|R|          Message Size           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w1:|Mtype|     RESERVED            |     Broadcast Acknowledge     |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w2:|        Link Acknowledge       |        Link Sequence          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w3:|                         Previous Node                         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w4:|                        Originating Port                       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w5:|              Destination Port / Destination Network           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w6:|                        Originating Node                       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w7:|                        Destination Node                       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w8:|                            RESERVED                           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w9:|   Item Size   |M|           RESERVED                           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   \                                                               \
   /                  Data (list of name items)                    /
   \                                                               \
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

 Figure 17: Name Table Distributor Message Format 

The interpretation of the fields of the message is as follows:



 TOC 

6.4.  Name Publication Descriptor Format

The format of a name publication descriptor is shown in Figure 18 (Name Table Distribution Items). The full seven word format MUST be used by nodes in multi-cluster TIPC networks; nodes in single-cluster TIPC networks MAY use the shorter five word format. All fields of the descriptor MUST be stored in network byte order.




    3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
    1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w0:|                              Type                             |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w1:|                          Lower Bound                          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w2:|                          Upper Bound                          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w3:|                            Reference                          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w4:|                              Key                              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w5:|                              Node                             |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w6:|                            RESERVED                   | Scope |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

 Figure 18: Name Table Distribution Items 

Type: The type part of the published port name sequence.

Lower: The lower part of the published port name sequence.

Upper: The upper part of the published port name sequence.

Reference: The reference part of the publishing port's identity.

Key: A created by the publishing port.

Node: The node part of the publishing port's identity. If this field is not present it can be assumed to be the same as Originating Node.

Scope: The distribution scope of the published port name sequence. If this field is not present then it can be assumed to be cluster-wide.



 TOC 

7.  Links

This section discusses the operation of unicast links that carry messages from the originating node to a single destination node to which it has a direct path.

The operation of TIPC's broadcast link is described in Section 8 (Broadcast Link).



 TOC 

7.1.  TIPC Internal Header



 TOC 

7.1.1.  Internal Message Header Format




    3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w0:|vers |msg usr|hdr sz |n|resrv|            packet size          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w1:|m typ|   sequence gap          |       broadcast ack no        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w2:|link level ack no/bc gap after | link level/bc seqno/bc gap to |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w3:|                       previous node                           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w4:|  last sent broadcast/fragm no | next sent pkt/ fragm msg no   |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w5:|          session no           | res |r|berid|link prio|netpl|p|
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w6:|                      originating node                         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w7:|                      destination node                         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w8:|                  transport sequence number                    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w9:|   msg count/max packet        |       link tolerance          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   \                                                               \
   /                     User Specific Data                        /
   \                                                               \
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

 Figure 19: TIPC Internal Message Header Format 

The internal header has one format and one size, 40 bytes. Some fields are only relevant to some users, but for simplicity in understanding and presentation we show it as single header format.



 TOC 

7.1.2.  Internal Message Header Fields Description

The first four words are almost identical to the corresponding part of the data message header. The differences are described next.

Next come the fields which are unique for the internal header, from word 4 and onwards.



 TOC 

7.2.  Link Creation



 TOC 

7.2.1.  Link Setup

TIPC automatically detects all neighbouring nodes that can be reached through an interface, and automatically establishes a link to each of those nodes, provided that that the sender's bearer is configured to permits this.

This automatic configuration requires that TIPC be able to send Link Request messages to all possible receivers on that interface. This is easily done when the media type used by the interface supports some form of broadcast capability (eg. Ethernet); other media types might require the use of a "replicast" facility. Support for manual configuration of links on interfaces that can not support automatic neighbour discovery in any form is left for future study.

Whenever TIPC detects that a new interface has become active, it periodically broadcasts Link Request messages from that interface to other prospective members of the network, informing them of the node's existence. If a node that receives such a request determines that a link to the sender is required, it creates a new link endpoint and returns a unicast Link Response message to the sending node, which causes that node to create a corresponding link endpoint. The two link endpoints then begin the link activation process described in Section 7.2.2 (Link Activation)

The structure and semantics of Link Request and Link Response messages are described in Section 9 (Neighbor Detection)




                                                  -------------
                                                 | <1.1.3>     |
                                                 |             |
                 ucast(dest:<1.1.1>,orig:<1.1.3> |             |
                <------------------------------- |             |
                                                 |             |
                                                  -------------
 --------------
| <1.1.1>      |
|              | bcast(orig:<1.1.1>,dest:<1.1.0>)
|              |-------------------------------->
|              |
|              |
 --------------
                                                  -------------
                 ucast(dest:<1.1.1>,orig:<1.1.2> | <1.1.2>     |
                <------------------------------- |             |
                                                 |             |
                                                 |             |
                                                 |             |
                                                  -------------

 Figure 20: Neighbor Detection 

There are two reasons for the on-going broadcasting decribed above. First, it allows two nodes to discover each other even if the communication media between them is initially non-functional. (For example, in a dual-switch system one of the cables may be faulty or disconnected at start up time, while the cluster is still fully connected and functional via the other switch.) The continuous discovery mechanism allows the missing links to be created once a working cable is inserted, without requiring a restart of any of the nodes. Second, it allows users to replace (hot-swap) an interface card with one having a different media address (eg. a MAC address for Ethernet), again without having to restart the node. When a node receives a Link Request message its originating media address is compared with the one previously stored for that destination, and if they differ the old one is replaced allowing the link activation process to begin using the new address.

Link Request broadcasting begins 125 msec after an interface is enabled, then repeats at an interval that doubles after each transmission until it reaches an upper limit of 2000 msec; thereafter, broadcasts occurs every 2000 msec if there are no active links on the interface, or every 600,000 msec if there is at least one active link. The broadcasts continue at these rates as long as the node is up. This pattern of broadcasts ensures that a node broadcasts frequently when an interface is first enabled or when there is no connectivity on the interface, and very slowly once some amount of connectivity exists. Such an approach places the bulk of the burden of neighbour discovery on the node that is increasing its connectivity to the TIPC network, allowing nodes that are already fully connected to take a more passive role.

Note: This algorithm does not allow for rapid neighbour discovery in the event that a cluster is initially partitioned into two or more multi-node sections that later become able to communicate, as it can take up to 10 minutes for the partitions to discovery one another. Further investigation is required to address this issue.

Each Link Request message contains a destination domain that indicates which neighbouring nodes are permitted to establish links to the transmitting interface; this value should be configurable on a per-interface basis. Typical settings include <0.0.0>, which permits connection to any node in the network, and <own_zone.own_cluster.0>, which permits connection to any node within the cluster.

A node receiving a Link Request message ensures that it belongs to the destination domain stated in the message, and that the Network Identity of the message is equal to its own. If so, and if a link does not already exist, it creates its end of the link and returns a unicast Link Response message back to the requesting node. This message then triggers the requesting node to create the other end of the link (if there is not one already), and the link activation phase then begins.



 TOC 

7.2.2.  Link Activation

Link activation and supervision is completely handled by the generic part of the protocol, in contrast to the partially media-dependent neighbour detection protocol.

The following FSM describes how a link is activated and supervised.




 ---------------                               ---------------
|               |<--(CHECKPOINT == LAST_REC)--|               |
|               |                             |               |
|Working-Unknown|----TRAFFIC/ACTIVATE_MSG---->|Working-Working|
|               |                             |               |
|               |-------+      +-ACTIVATE_MSG>|               |
 ---------------         \    /                ------------A--
    |                     \  /                   |         |
    | NO TRAFFIC/          \/                 RESET_MSG  TRAFFIC/
    | NO PROBE             /\                    |     ACTIVATE_MSG
    | REPLY               /  \                   |         |
 ---V-----------         /    \                --V------------
|               |-------+      +--RESET_MSG-->|               |
|               |                             |               |
| Reset-Unknown |                             |  Reset-Reset  |
|               |----------RESET_MSG--------->|               |
|               |                             |               |
 -------------A-                               ---------------
  |           |
  | BLOCK/    | UNBLOCK/
  | CHANGEOVER| CHANGEOVER END
  | ORIG_MSG  |
 -V-------------
|               |
|               |
|    Blocked    |
|               |
|               |
 ---------------

 Figure 21: Link finite state machine 

A link enpoint's state is defined by the own endpoint's state, combined with what is known about the other endpoint's state. The following states exist:

Reset-Unknown

Own link endpoint reset, i.e. queues are emptied and sequence numbers are set back to their initial values. The state of the peer endpoint is unknown. LINK_PROTOCOL/RESET_MSG messages are sent periodically at CONTINUITY_INTERVAL to inform peer about the own endpoint's state, and to force it to reset its own enpoint,if this has not already been done. If the peer endpoint is rebooting, or has reset for some other reason, it will sooner or later also reach the state Reset-Unknown, and start sending its own RESET_MSG messages periodically. At least one of the endpoints, and often both, will eventually receive a RESET_MSG and transfer to state Reset-Reset. If the peer is still active, i.e. in one of the states Working-Working or Working-Unknown, and has not yet detected the disturbance causing this endpoint to reset, it will sooner or later receive a RESET_MSG, and transfer directly to state Reset-Reset. If a LINK_PROTOCOL/ ACTIVATE_MSG message is received in this state, the link endpoint knows that the peer is already in state Reset-Reset, and can itself move directly on to state Working-Working. Any other messages are ignored in this state. CONTINUITY_INTERVAL is calculated as the smallest value of LINK_TOLERANCE/4 and 0.5 s.

Reset-Reset

Own link endpoint reset, peer endpoint known to be reset, since the only way to reach this state is through receiving a RESET_MSG from peer. The link endpoint is periodically at CONTINUITY_INTERVAL sending ACTIVATE_MSG messages. This will will eventually cause peer to transfer to state Working-Working. The own endpoint will also transfer to state Working-Working as soon as any message which is not a RESET_MSG is received.

Working-Working

Own link endpoint working. Peer link endpoint known to be working, i.e. both can send and receive traffic messages. A periodic timer with the interval CONTINUITY_INTERVAL checks if anything has been received from the peer during the last interval. If not,state transfers to state Working-Unknown.

Working-Unknown

Own link endpoint working. Peer link endpoint in unknown state. LINK_PROTOCOL/STATE_MSG messages with the PROBE bit set are sent at an interval of CONTINUITY_INTERVAL/4 to force a response from peer. If a calculated number of probes (LINK_TOLERANCE/(CONTINUITY_INTERVAL/4) remain unresponded, state transfers to Reset-Unknown. Own link endpoint is reset, and the link is considered lost. If, instead, any kind of message, except LINK_PROTOCOL/RESET_MSG and LINK_PROTOCOL/ACTIVATE_MSG is received, state transfers back to Working-Working. Reception of a RESET_MSG in this situation brings the link to state Reset-Reset. ACTIVATE_MSG will never received in this state.

Blocked

The link endpoint is blocked from accepting any packets in either direction, except incoming, tunneled CHANGEOVER_PROTOCOL/ORIG_MSG. This state is entered upon the arrival of the first such message, and left when the last has been counted in and delivered. See description about the changeover procedure later in this section. The Blocked state may also be entered and left through the management commands BLOCK and UNBLOCK. This is also described later.

A newly created link endpoint starts from the state Reset-Unknown. The recommended default value for LINK_TOLERANCE is 0.8 s.



 TOC 

7.2.3.  Link MTU Negotiation

The actual MTU used by a link may vary with the media used. The two endpints of a link may disagree on the allowed MTU (e.g. one using Ethernet jumbo frames and the other not), and intermediate switches may put a more strict limitation to the MTU size than what is visible from the endpoints. Therefore, TIPC implements an interval halving MTU negotiation algorithm that intends to find the biggest possible MTU that can be used between the two link endpoints. This is done for each direction separately, so in theory we could end up with one MTU in one direction, and a different on in the opposite direction.

The algorithm works as follows:

A link endpoint starts out with an MTU of 1500 bytes, or the MTU reported from the bearer media, whichever is smallest (CURR_MTU). It also registers a wanted MTU (TARGET_MTU), which is equal to the one reported by the local interface. TARGET_MTU is sent along in the Max Packet field of all RESET and ACTIVATE messages to the other end, to let it know about the target to negotiate for. The other end will update its own TARGET_MTU to be the smallest of the the one received and the one registered locally.

When the link has been established, using very short RESET and ACTIVATE messages, the endpoint lets its first STATE messages have the size of CURR_MTU + (TARGET_MTU - CURR_MTU)/2.

If any of those messages are received, the other endpoint responds with a STATE message where Max Packet confirms that the size is usable. CURR_MTU is updated to the new size, and the algorithm goes back to step 2.

After a number og trials (e.g. 10) with the attempted MTU without any confirmation from the other end, TARGET_MTU is decremented with 4, and the algorithm goes back to step 2. If the link state moves to WORKING_UNKNOWN during this negotiation, due to lost STATE messages, the link moves temporarily back to using CURR_MTU as packet size. However, as soon as the link is back in WORKING_WORKING state, the negotiation continues from where it was suspended.

After a number of iterations CURR_MTU is equal to TARGET_MTU, and the negotiation is over.



 TOC 

7.2.4.  Link Continuity Check

During normal traffic both link enpoints are in state Working-Working. At each expiration point, the background timer checkpoints the value of the Last Received Sequence Number. Before doing this, it compares the check- point from the previous expiration with the current value of Last Received Sequence Number, and if they differ, it takes the new checkpoint and goes back to sleep. If the two values don't differ, it means that nothing was received during the last interval, and the link endpoint must start probing, i.e. move to state Working-Unknown.

Note here that even LINK_PROTOCOL messages are counted as received traffic, altough they don't contain valid sequence numbers. When a LINK_PROTOCOL message is received, the checkpoint value is moved,instead of Last Received Sequence Number, and hence the next comparison gives the desired result.



 TOC 

7.2.5.  Sequence Control and Retransmission

Each packet eligible to be sent on a link is assigned a Link Level Sequence Number, and appended to a send queue associated with the link endpoint. At the moment the packet is sent, its field Link Level Acknowledge Number is set to the value of Last Received Sequence Number.

When a packet is received in a link endpoint, its send queue is scanned, and all packets with a sequence number lower than the arriving packet's acknowledge number (modulo 2^16-1) are released.

If the packet's sequence number is equal to Last Received Sequence Number + 1 (mod 2^16-1), the counter is updated, and the packet is delivered upwards in the stack. A counter, Non Acknowledged Packets, is incremented for each message received, and if it reaches the value 10, a LINK_PROTOCOL/STATE_MSG is sent back to the sender. For any message sent, except BCAST_PROTOCOL messages, the Non Acknowledged Packets counter is set to zero.

Otherwise, if the sequence number is lower, the packet is considered a duplicate, and is silently discarded.

Otherwise,if a gap is found in the sequence, the packet is sorted into the Deferred Incoming Packets Queue associated to the link endpoint, to be re-sequenced and delivered upwards when the missing packets arrive. If that queue is empty,the gap is calculated and immediately transferred in a LINK_PROTOCOL/STATE_MSG back to the sending node. That node must immediately retransmit the missing packets. Also, for each 8 subsequent received out-of-sequence packets, such a message must be sent.



 TOC 

7.2.6.  Message Bundling

Sometimes a packet can not be sent immediately over a bearer, due to network or recipient congestion (link level send window overflow), or due to bearer congestion. In such situations it is important to utilize the network and bearer as efficiently as possible, and not stop important users from sending messages before this is absolutely unavoidable. To achieve this, messages which can not be transmitted immediately are bundled into already waiting, packets whenever possible, i.e. when there are unsent packets in the send queue of a link. When the packet finally arrives at the receiving node it is split up to its individual messages again. Since the bundling layer is located below the fragmentation layer in the functional model of the stack, even message fragments may be bundled with other messages this way, but this can only happen to the last fragment of a message, the only one normally not filling an entire packet by itself.

It must be emphasized that message transmissions never are delayed in order to obtain this effect. In contrast to TCP's Nagle Algorithm, the only goal of the TIPC bundling mechanism is to overcome congestion situations as quickly and efficiently as possible.



 TOC 

7.2.7.  Message Fragmentation

When a message is longer than the identified MTU of the link it will use, it is split up in fragments, each being sent in separate packets to the destination node. Each fragment is wrapped into a packet headed by an TIPC internal header (see Figure 19 (TIPC Internal Message Header Format)) The User field of the header is set to MSG_FRAGMENTER, and each fragment is assigned a Fragment Number relative to the first fragment of the message. Each fragmented message is also assigned a Fragmented Message Number, to be present in all fragments. Fragmented Message Number must be a sequence number with the period of 2^16-1. At reception the fragments are reassembled so that the original message is recreated, and then delivered upwards to the destination port.



 TOC 

7.2.8.  Link Congestion Control

TIPC uses a common sliding window protocol to handle traffic flow at the signalling link level. When the send queue associated to each link endpoint reaches a configurable limit, the Send Window Limit, TIPC stop sending messages over that link. Packets may still be appended to or bundled into waiting packets in the queue, but only after having been subject to a filtering function, selecting or rejecting user calls according to the sent message's importance priority. LOW_IMPORTANCE messages are not accepted at all in this situation. MEDIUM_IMPORTANCE messages are still accepted, up to a configurable limit set for that user. All other users also have their individually configurable limits, recommended to be assigned values in the following ascending order: LOW_IMPORTANCE, MEDIUM_IMPORTANCE, HIGH_IMPORTANCE, CRITICAL_IMPORTANCE, CONNECTION_MANAGER,BCAST_PROTOCOL, ROUTE_DISTRIBUTOR, NAME_DISTRIBUTOR, MSG_FRAGMENTER. MSG_BUNDLER messages are not filtered this way, since those are packets created at a later stage. Whether to accept a message due for fragmentation or not is decided on its original importance, set before the fragmentation is done. Once such a message has been accepted, its individal fragments must be handled as being more important than the original message.

When the part of the queue containing sent packets again is under the Send Window Limit, the waiting packets must immediately be sent, but only until the Send Window Limit is reached again.



 TOC 

7.2.9.  Link Load Sharing vs Active/Standby

When a link is created it is assigned a Link Priority, used to determine its relation to a possible parallel link to the same node. There are two possible relations between parallel working links.



 TOC 

7.2.10.  Load Sharing

Load Sharing is used when the links have the same priority value. Payload traffic is shared equally over the two links, in order to take full advantage of available bandwidth. The selection of which link to use must be done in a deterministic way, so that message sequentiality can be preserved for each individual sender port. To obtain this a Link Selector is used. This must be value correlated to the sender in such a way that all messages from that sender choose the same link, while guaranteeing a statistically equal possibility for both links to be selected for the overall traffic between the nodes. A simple example of a link selector with the right properties is the last two bits of the random number part of the originating port's identity, another is the same bits in Fragmented Message Number in message fragments.



 TOC 

7.2.11.  Active/Standby

When the priority of one link has a higher numeral value than that of the others, all traffic will go through that link, denoted the Active Link. The other links will be kept up and working with the help of the continuity timer and probe messages, and are called Standby Links. The task of these link is to take over traffic in case the active link fails.

Link Priority has a value within the range [1,31]. When a link is created it inherits a default priority from its corresponding bearer, and this should normally not need to be changed thereafter. However, Link Priority must be reconfigurable in run-time.



 TOC 

7.3.  Link Failover

When the link configuration between two nodes changes, the moving of traffic from one link to another must be performed in such a way that message sequentiality and cardinality per sender is preserved. The following situations may occur:



 TOC 

7.3.1.  Active Link Failure

Before opening the remaining link for messages with the failing link's selector, all packets in the failing link's send queue must wrapped into messages (tunneling messages) to be sent over the remaining link, irrespective of whether this is a load sharing active link or a standby link. These messages are headed by a TIPC Internal Header, the User field set to CHANGEOVER_PROTOCOL, Message Type set to ORIG_MSG. On the tunneling link the messages are subject to congestion control, fragmentation and bundling, like any other messages. Upon arrival in the arriving node, the tunneled packets are unwrapped, and moved over to the failing links receiving endpoint. This link endpoint must now be reset, if it has not already been done, and itself initiate tunneling of its own queued packets in the opposite direction. The unwrapped packets' original sequence numbers are compared to Last Received Sequence Number of the failed links receiving endpoint, and are delivered upwards or dropped according to their relation to this value. There is no need for the failing link to consider packet sequentiality or possible losses in this case, - the tunneling link must be considered a reliable media guaranteeing all the necessary properties. The header of the first ORIG_MSG sent in each direction must contain a valid number in the Message Count field, in order to let the receiver know how many packets to expect. During the whole changeover procedure both link endpoints must be blocked for any normal message reception, to avoid that the link is inadvertently activated again before the changeover is finished. When the expected number of packets has been received, the link endpoint is deblocked, and can go back to the normal activation procedure.



 TOC 

7.3.2.  Standby Link Failure

This case is trivial, as there is no traffic to redirect.



 TOC 

7.3.3.  Second Link With Same Priority Comes Up

When a link is active, and a second link with the same priority comes up, half of the traffic from the first link must be taken over by the new link. Before opening the new link for new user messages, the packets in the existing link's send queue must be transmitted over that link. This is done by wrapping copies of these packets into messages (tunnel messages) to be sent over the new link. The tunnel messages are headed by a TIPC Internal Header, the User field set to CHANGEOVER_PROTOCOL, Message Type set to DUPLICATE_MSG. On the tunneling link the messages are subject to congestion control, fragmentation and bundling, just like any other messages. Upon arrival in the arriving node, the tunneled packets are unwrapped, and delivered to the original links receiving endpoint, just like any other packet arriving over that link's own bearer. If the original packet has already arrived over that bearer, the tunneled packet is dropped as a duplicate, otherwise the tunneled packet will be accepted, and the original packet dropped as a duplicate when it arrives.



 TOC 

7.3.4.  Second Link With Higher Priority Comes Up

When a link is active, and a second link with a higher numerical priority comes up, all traffic from the first link must be taken over by the new link. The handling of this case is identical to the case when a link with same priority comes up. After the traffic takeover has finished, no more senders will select the old link, but this does not affect the takeover procedure.



 TOC 

7.3.5.  Link Deletion

Once created, a link endpoint continues to exist as long as its associated interface continues to exist.

Note: The persistence of a link endpoint whose peer cannot be reached for a significant period of time requires further study. It may be desirable for TIPC to reclaim the resources associated with such an endpoint by automatically deleting the endpoint after a suitable interval.



 TOC 

7.3.6.  Message Bundler Protocol

User: 6 (MSG_BUNDLER)

Message Types: None

A MSG_BUNDLER packet contains as many bundled packets as indicated in Message Count. All bundled messages start at a 4-byte aligned position in the packet. Each bundled packet is a complete packet, including header, but with the fields Broadcast Acknowledge Number, Link Level Sequence Number and Link Level Acknowledge Number left undefined. Any kind of packets, except LINK_PROTOCOL and MSG_BUNDLER packets, may be bundled.



 TOC 

7.3.7.  Link State Maintenance Protocol

User: 7 (LINK_PROTOCOL)




ID Value   Meaning
--------   ----------
0          Detailed state of a working link endpoint (STATE_MSG)
1          Reset receiving endpoint.                 (RESET_MSG)
2          Sender in RESET_RESET,ready to receive    (ACTIVATE_MSG)

 Figure 22: Link Maintenance Prootocol Messages 

RESET_MSG messages must have a data part that must be a zero-terminated string. This string is the name of the bearer instance used by the sender node for this link. Examples of such names is "eth0","vmnet1" or "udp". Those messages must also contain valid values in the fields Session Number, Link Priority and Link Tolerance.

ACTIVATE_MSG messages do not need to contain any valid fields except Message User and Message Type.

STATE_MSG messages may leave bearer name and Session Number undefined, but Link Priority and Link Tolerance must be set to zero in the normal case. If any of these values are non-zero, it implies an order to the receiver to change its local value to the one in the message. This must be done when a management command has changed the corresponding value at one link endpoint, in order to enforce the same change at the other endpoint. Network Identity must be valid in all messages.

Link protocol messages must always be sent immediately, disregarding any traffic messages queued in the link. Hence, they can not follow the ordinary packet sequence, and their sequence number must be ignored at the receiving endpoint. To facilitate this, these messages should be given a sequence number guaranteed not to fit in sequence. The recommended way to do this is to give such messages the next unassigned Link Level Sequence Number + 362768. This way, at the reception the test for the user LINK_PROTOCOL needs to be performed only once, after the sequentiality check has failed, and we never need to reverse the Next Received Link Level Sequence Number.



 TOC 

7.3.8.  Link Changeover Protocol

User: 10 (CHANGEOVER_PROTOCOL)




ID Value    Meaning
--------    ----------
0           Tunneled duplicate of packet            (DUPLICATE_MSG)
1           Tunneled failed over original of packet (ORIGINAL_MSG)

 Figure 23: Changeover Message Types 

DUPLICATE_MSG messages contain no extra information in the header apart from the first thee words. The first ORIGINAL_MSG message sent out MUST contain a valid value in the Message Count field, in order to inform the recipient about how many such messages, inclusive the first one, to expect. If this field is zero in the first message, it means that there are no packets wrapped in that message, and none to expect.



 TOC 

7.3.9.  Message Fragmentation Protocol

User: 12 (MSG_FRAGMENTER)




ID Value    Meaning
--------    ----------
0           First fragment of message (FIRST_FRAGMENT)
1           Body fragment of message  (FRAGMENT)
2           Last fragment of message  (LAST_FRAGMENT)

 Figure 24: Fragmentation Message Types 

All packets contain a dedicated identifier, Fragmented Message Number, to distinguish them from packets belonging to other messages from the same node. All packets also contain a sequence number within its respective message, the Fragment Number field, in order to, if necessary, reorder the packets when they arrive to the detination node. Both these sequence numbers must be incemented modulo 2^16-1.



 TOC 

8.  Broadcast Link

To effectively support the service multicast feature described in a Section 5.4 (Multicast Communication), a reliable cluster broadcast service is provided by TIPC.

Although seen as a broadcast service from a TIPC viewpoint, at the bearer level this service may be implemented as a multicast group comprising all nodes in the cluster.

At the multicast/broadcast sending node a sequence of actions is followed:

  1. When a service multicast is requested, TIPC first looks up all matching destinations in its name translation table.
  2. If any node external port is on the destination list, the message is sent to the multicast link for broadcast transport off node.
  3. If the own node is on the list, a replica is sent to the service multicast receive function in the own node.



 TOC 

8.1.  Broadcast Protocol

User: 5 (BCAST_PROTOCOL).

There is only one type of BCAST_PROTOCOL message, but it is still used for two very different purposes:

Note that the receiver is still not allowed to start accepting broadcast messages. This he can do when he knows the initial bulk update of the name table is finished, i.e., when he sees a NAME_DISRIBUTOR packet with the M bit unset, as described in Section 6.3 (Name Distributor Message Format).



 TOC 

8.2.  Piggybacked Acknowledge

All packets, without exception, passed from one node to another, contain a valid value in the field Acknowledged Bcast Number. Since there is always some traffic going on between all nodes in the cluster (in the worst case only link supervision messages), the receiving node can trust that the Last Acknowledged Bcast counter it has for each node is kept well up-to-date. This value will under no circumstances be older than one CONTINUITY_INTERVAL, so it will inhibit a lot of unnecessary retransmissions of packets which in reality have already be received at the other end.



 TOC 

8.3.  Coordinated Acknowledge Interval

If the received packet fits in sequence as described above, AND if the last four bits of the sequence number of the packet received are equal to the last four bits of the own node's network address a LINK_PROTOCOL/STATE_MSG is generated and sent back as unicast to the receiving node, acknowledging the packet, and implicitly all previously received packets. This means that e.g. node <Z.C.1> will only explicitly acknowledge packet number 1, 17, 33, and so on, node number <Z.C.2> will acknowledge packet number 2, 18, 34, etc. This condition significantly reduces the number of explicit acknowledges needing to be sent, taking advantage of the normally ongoing traffic over each link.



 TOC 

8.4.  Coordinated Broadcast of Negative Acknowledges

If the Last Sent Broadcast field of a LINK_PROTOCOL/STATE_MSG differs from the registered last received broadcast data packet, or if a broadcast data packet is received out of sequence, a BCAST_PROTOCOL/STATE_MSG ("NACK") packet MAY be broadcast back to the node in question. It is RECOMMENDED that such NACKs are not sent every time a gap is detected, to avoid possible overload of the sender node. It is RECOMMENDED that a node always looks into NACKs being broadcasted from other nodes, so it can identify if these report the same sequence gap as registered locally for that node. In that case, the node SHOULD delay the sending its own corresponding NACK until a later occasion.



 TOC 

8.5.  Replicated Delivery

When an in-sequence service multicast is delivered upwards in the stack, TIPC looks up in the NAME TABLE and finds all node local destination ports. The destination list created this way is stripped of all duplicates, so that only one message replica is sent to each identified destination port.



 TOC 

8.6.  Congestion Control

Messages sent over the "broadcast link" are subject to the same congestion control mechanisms as point-to-point links, with prioritized transmission queue appending, message bundling, and as last resort a return value to the sender indicating the congestion. Typically this return value is taken care of by the socket layer code, blocking the sending process until the congestion abates. Hence, the sending application should never notice the congestion at all.



 TOC 

9.  Neighbor Detection

TIPC supports the automatic discovery of the physical network topology and the establishment of links between neighboring nodes through the use of a neighbor detection protocol.



 TOC 

9.1.  Neighbor Detection Protocol Overview

A node initiates neighbor detection by sending a "link request" message to all of its potential neighbors over each bearer that the node has been configured to use. This message identifies the requesting node and specifies both the subset of network nodes the node is willing to establish links to and the media address to be used by such links. A node that receives a link request message and determines that a new link between the nodes must be established must return a "link response" message to the requesting node; this message identifies the receiving node and specifies the receiving node's own media address. The exchange of messages permits each node to create a link endpoint which has the necessary information to begin communicating with its peer.

The conditions under which a node sends link request messages is not specified in this document. For example, implementations may send messages periodically as long as a node is operational, and may suspend the sending of requests whenever a node has working links to all of its potential neighbors. In contrast, the conditions under which a node sends link response messages is specified.



 TOC 

9.1.1.  Link Request Message Processing

A link request message SHOULD be sent to all potential neighbors simultaneously using multicasting or broadcasting if a bearer's media type supports this capability; otherwise, separate link request messages SHOULD be sent to all potential neighbors in individually.

A node that receives a link request message MUST ignore the message if it is not supposed to communicate with the requesting node on the associated bearer. Conditions that prohibit communication include the following:

The requesting node has a different TIPC network identifier than the receiving node.

The receiving node has the same TIPC network address as the requesting node (i.e. a node must ignore a message from itself).

The requesting node does not lie within the network domain that the receiving node is authorized to communicate with over the associated bearer.

The receiving node does not lie within the network domain that the requesting node has specified in its request.

In addition, a node that receives a link request message MUST ignore the message if it would interfere with existing communication with the requesting node. (Request messages of this nature can arise if network nodes are not configured correctly, resulting in two or more nodes having the same network address.) Conditions that cause interference include the following:

The receiving node currently has a working link to the requesting node on the associated bearer.

The receiving node has a working link to the requesting node on another bearer that was established using a different node signature.

A node that receives a link request message that is not ignored SHOULD establish a link endpoint capable of communicating with the requesting node. If the receiving node currently has a (non-operational) link endpoint to the requesting node on the associated bearer it MUST delete or reconfigure the link endpoint to preclude the existence of two parallel links to the same node on the same bearer. If the receiving node currently has one or more (non-operational) link endpoints to the requesting node on other bearers that were established using a different node signature it MUST delete or recongfigure those link endpoints to preclude the existence of links to two different nodes having the same network address.

Once the receiving node has established the required link endpoint it MUST send a link response message to the requesting node on the associated bearer. The link response message MUST be directed only to the requesting node; if possible, it SHOULD be sent without using multicasting or broadcasting.



 TOC 

9.1.2.  Link Response Message Processing

A node that receives a link response message MUST ignore the message if it is not supposed to communicate with the responding node on the associated bearer. Conditions that prohibit communication include the following:

The responding node has a different TIPC network identifier than the receiving node.

The receiving node has the same TIPC network address as the responding node (i.e. a node must ignore a message from itself).

The responding node does not lie within the network domain that the receiving node is authorized to communicate with over the associated bearer.

The receiving node does not lie within the network domain that the responding node has specified in its response.

In addition, a node that receives a link response message MUST ignore the message if it would interfere with existing communication with the responding node. Conditions that cause interference include the following:

The receiving node currently has a working link to the responding node on the associated bearer.

The receiving node has a working link to the responding node on another bearer that was established using a different node signature.

A node that receives a link response message that is not ignored SHOULD establish a link endpoint capable of communicating with the responding node. If the receiving node currently has a (non-operational) link endpoint to the responding node on the associated bearer it MUST delete or reconfigure the link endpoint to preclude the existence of two parallel links to the same node on the same bearer. If the receiving node currently has one or more (non-operational) link endpoints to the responding node on other bearers that were established using a different node signature it MUST delete or recongfigure those link endpoints to preclude the existence of links to two different nodes having the same network address.

Once the receiving node has established the required link endpoint it MUST NOT send a link configuration message (either a request or a response) to the responding node.



 TOC 

9.1.3.  Link Discovery Message Format

The format of the link discovery message used to exchange link requests and link responses is shown in Figure 25 (Neighbor Discovery Message Format)




     3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
     1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w0: | Ver | User  | Hsize |N|R|R|R|           Message Size          |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w1: |Mtype|        Capabilities       |      Node Signature         |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w2: |                      Destination Domain                       |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w3: |                         Previous Node                         |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w4: |                          Network Id                           |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w5: |                                                               |
    +-+-+-+-+-+-+-                                    +-+-+-+-+-+-+-+
w6: |                                                               |
    +-+-+-+-+-+-+-                                    +-+-+-+-+-+-+-+
w7: |                                                               |
    +-+-+-+-+-+-+-                                    +-+-+-+-+-+-+-+
w8: |                                                               |
    +-+-+-+-+-+-+-            Media Address           +-+-+-+-+-+-+-+
w9: |                                                               |
    +-+-+-+-+-+-+-                                    +-+-+-+-+-+-+-+
w10:|                                                               |
    +-+-+-+-+-+-+-                                    +-+-+-+-+-+-+-+
w11:|                                                               |
    +-+-+-+-+-+-+-                                    +-+-+-+-+-+-+-+
w12:|                                                               |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w13:|                           Reserved                            |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w14:|                           Reserved                            |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w15:|                           Reserved                            |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

 Figure 25: Neighbor Discovery Message Format 

The interpretation of the fields of the message is as follows:

R, RESERVED: Defined in Payload Message.

Ver: 3 bits : Defined in Payload Message.

User: 4 bits: Defined in Payload Message. A LINK_DISCOVERY message is identified by the value 13.

Hsize: 4 bits: Defined in Payload Message. A LINK_DISCOVERY message header is 64 bytes.

N: 1 bit: Defined in Payload Message. A LINK_DISCOVERY message sets this bit, as it is not part of a normal flow of messages over a link.

Message Size: 17 bits: Defined in Payload Message. A LINK_DISCOVERY message is 64 bytes in length.

Mtype: 3 bits: Defined in Payload Message. A LINK_DISCOVERY message specifies 0 for a link request message or 1 for a link response message.

Capabilities: 13 bits : A bitmap indicating capabilities of the sender node that receivers may need to be aware of. Currently only one bit is defined: LSB of the field (bit 15 in word 1) indicates that the sender node will send out SYN messages with the SYN bit set. All other bits are reserved for future use, and MUST be zero.

Destination Domain: 32 bits: The network domain to which the message is directed. <Z.C.N> denote that the sender desires a link to a specific node; <Z.C.0>, <Z.0.0>, and <0.0.0> denotes that the message can be processed by any node in the sender's cluster, zone, and network, respectively.

Previous Node: 32 bits: Defined in Payload Message.

Network Id: 32 bits: The network identity of the sender.

Media Address: 32 bytes: The media address of the sender. This has media specif format, described in Section 9.1.4 (Media Address Formats).



 TOC 

9.1.4.  Media Address Formats

The media address of the sender, the format of which is media-specific. Currently, the following formats are defined:



 TOC 

9.1.4.1.  Ethernet Address Format




     3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
     1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w5: |                        Zero                   | Addr Type = 1 |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w6: |                        Ethernet MAC Address                   |
    +                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w7: |                               |                               |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                 +-+-+-+-+-+-+-+
w8: |                                                               |
    +-+-+-+-+-+-+-                                    +-+-+-+-+-+-+-+
w9: |                                                               |
    +-+-+-+-+-+-+-                                    +-+-+-+-+-+-+-+
w10:|                              Zero                             |
    +-+-+-+-+-+-+-                                    +-+-+-+-+-+-+-+
w11:|                                                               |
    +-+-+-+-+-+-+-                                    +-+-+-+-+-+-+-+
w12:|                                                               |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

 Figure 26: Ethernet Address Format 



 TOC 

9.1.4.2.  Infiniband Address Format




     3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
     1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w5: |                                                               |
    +-+-+-+-+-+-+                                    -+-+-+-+-+-+-+-+
w6: |                                                               |
    +-+-+-+-+-+-+                                    -+-+-+-+-+-+-+-+
w7: |                      Infiniband Address                       |
    +-+-+-+-+-+-+                                    -+-+-+-+-+-+-+-+
w8: |                                                               |
    +-+-+-+-+-+-+-                                    +-+-+-+-+-+-+-+
w9: |                                                               |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w10:|                                                               |
    +-+-+-+-+-+-+-                                    +-+-+-+-++-+-+|
w11:|                              Zero                             |
    +-+-+-+-+-+-+-                                    +-+-+-+-+-+-+-+
w12:|                                                               |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

 Figure 27: Infiniband Address Format 

Note that there is no address type field in the Infiniband address format



 TOC 

9.1.4.3.  UDP/IPv4 Address




     3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
     1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w5: |                      Zero                     | Addr Type = 3 |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w6: |                           IPv4 Address                        |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w7: |        Port Number            |           Zero                |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-++-+-+-+-+-+-+-++-+-+-+-+-+-+-++-+-+-
w8: |                                                               |
    +-+-+-+-+-+-+-                                    +-+-+-+-+-+-+-+
w9: |                                                               |
    +-+-+-+-+-+-+-               Zero                 +-+-+-+-+-+-+-+
w10:|                                                               |
    +-+-+-+-+-+-+-                                    +-+-+-+-+-+-+-+
w11:|                                                               |
    +-+-+-+-+-+-+-                                    +-+-+-+-+-+-+-+
w12:|                                                               |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

 Figure 28: UDP/IPv4 Address Format 



 TOC 

10.  Topology Service

TIPC provides a message-based mechanism for an application to learn about the port names that are visible to its node. This is achieved by communicating with a Topology Service that has knowledge of the contents of the node's name table.



 TOC 

10.1.  Topology Service Semantics

A "topology subscription" is a request by a subscriber to TIPC, telling TIPC to indicate when a port name sequence overlapping the requested range is published or withdrawn. Subscription for an individual port name is requested by specifying a port name sequence with whose lower and upper instance values are identical.

An "event" is a response by TIPC to a subscriber, telling the subscriber about a change in availability of the port name(s) specified by a subscription, or in the status of the subscription itself. Each event associated with the availability of port names indicates the portion of the requested port name sequence that has changed its availability, as well as identifying the physical address involved in the change. A subscription may cause zero, one, or more events during its lifetime.



 TOC 

10.2.  Topology Service Protocol

An application subscribing for the availability of port name sequences must follow these steps:

  1. Establish a TIPC connection to the Topology Server, using the port name {1,1}.
  2. Send a subscription message on the new connection for each port name sequence to be monitored.
  3. Wait for arrival of event messages indicating status changes for the requested port name sequence(s).

After a subscription has been received and registered by the Topology Server, the subscriber will immediately receive zero or more events, in accordance with the state of the name table at the time of registration, and the flags in the subscription message. Thereafter, the subscriber will receive an event for each change in the name table corresponding to the subscription.

Each subscription issued by an application remains registered until one of the following conditions arises:

  1. The time limit specified for the subscription expires. (This results in the Topology Server issuing a final event to the application, indicating that the subscription has timed out.)
  2. The subscription is cancelled by the application. (This is achieved by resending the original subscription message with a cancellation bit set; no acknowledgement is provided by the Topology Server.)
  3. The application's connection to the Topology Server is terminated.



 TOC 

10.2.1.  Subscription Message Format

The format of a subscription message is shown in Figure 29 (Format of toplogy subscription message). The five first words are integers, while the format of the final two words is unspecified. The words of a subscription message may be sent in network byte order or host byte order, however all words MUST utilize the same ordering. (The byte ordering used in a specific subscription message can be deduced by examining the high-order and low-order bytes of the fifth word of the message, exactly one of which will be non-zero.)




    3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
    1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w0 |                          Type                                 |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w1 |                          Lower                                |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w2 |                          Upper                                |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w3 |                          Timeout                              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w4 |                          RESERVED                       |C|S|P|
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w5 |                          User Reference                       |
w6 |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

 Figure 29: Format of toplogy subscription message 

The interpretation of the fields of the message is as follows:

Type: The type of the port name sequence subscribed for.

Lower: Lower bound of the port name sequence subscribed for.

Upper: Upper bound of the port name sequence subscribed for.

Timeout: The time before the subscription expires, in milliseconds. A timeout of zero means that the subscription expires immediately, but the Topology Server MUST still respond with all events reflecting the state of the requested sequence at the time of the subscription's arrival; this enables an application to perform a one-shot inquiry into the name table to obtain a result immediately, regardless of whether or not the desired names are present. A timeout of 0xffffffff means the subscription will never expire.

Filter: Describes the semantics of the subscription. All bits must be zero, except for the following:



Name     Description
----     -----------
P        When set, the S-bit MUST NOT be set. The Topology Server
         MUST send an event for each publication or withdrawal
         of a sequence overlapping the requested one.
         When clear, the S-bit MUST be set.

S        When set, the P-bit MUST NOT be set. The Topology Server
         MUST send an event only when the number of sequences
         overlapping the requested one goes from zero to non-zero,
         or vice versa.
         When clear, the P-bit MUST be set.

C        When clear, the Topology Server MUST register the
         subscription specified by the message.
         When set, the Topology Server MUST cancel a registered
         subscription corresponding to the one indicated in this
         message, if one exists.
         'Corresponding' means that all the fields (except the
         C-bit itself) have the same value as in the original
         subscription message, and the message is submitted via
         the same connection.

         User Reference: An opaque 8-byte character sequence,
         to be used by the subscriber for his own purposes.
         The Topology Server MUST NOT interpret or alter this
         field in any way, and must return it, along with the
         rest of the original subscription, in all event
         messages.

 Figure 30: Definition of bits in a subscription message 



 TOC 

10.2.2.  Event Message Format

The format of an event message is shown in Figure 31 (Format of toplogy event message) The five first words in the message are integers; the remainder of the message is specified in Figure 29 (Format of toplogy subscription message) All words of an event message MUST be sent using the same byte order used by the subscription message that registered the subscription. (The byte ordering used in a specific event message can be deduced by examining the high-order and low-order bytes of the tenth word of the message, exactly one of which will be non-zero.)




    3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
    1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w0 |                         Event                                 |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w1 |                         Found Lower                           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w2 |                         Found Upper                           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w3 |                         Port Number                           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w4 |                         Node Address                          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w5 /                                                               /
   \                         Subscription                          \
w11/                                                               /
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

 Figure 31: Format of toplogy event message 

The interpretation of the fields of the message is as follows:



 TOC 

10.3.  Monitoring Service Topology

The service topology of the network can be continuously monitored by subscribing for the relevant port names or name sequences corresponding to the services of interest to the application.



 TOC 

10.4.  Monitoring Physical Topology

The physical topology of the network can be considered a special case of the functional topology, and can be monitored in the same way. To track the availability or disappearance of a specific node or group of nodes, an application running on these node(s) can publish a port name representing this "function"; this name can then be subscribed to by other applications. TIPC's Topology Service can then notify subscribing applications whenever it discovers or loses contact with a node publishing that name.

TIPC enables an application to easily monitor the availability of the nodes within its cluster by having each node automatically publish the reserved name {0,<Z.C.N>} with cluster scope, where <Z.C.N> is the network address of the node. The port identifier associated with this name identifies the node's Configuration Service.



 TOC 

11.  Configuration Service

TIPC provides a message-based mechanism for an application to inquire about the configuration and status of a TIPC network and, in some instances, to alter the configuration. This is achieved by communicating with a Configuration Service that implements a variety of network management-style commands.



 TOC 

11.1.  Configuration Service Semantics

A "configuration command" is an operation supported by TIPC's Configuration Service that alters the configuration of a network node or returns information about the current configuration or state of the network. There are three classes of configuration command defined by TIPC:

A "command message" is a message exchanged by an application and TIPC. There are two classes of command message defined by TIPC:



 TOC 

11.2.  Configuration Service Protocol

Command messages may be sent over any protocol (e.g. Netlink [RFC 3549]), and may have different formats, to be decided by the particular implementation. Definition such formats falls outside the scope of this document. Here, we only define the formats that MUST be used when the command messages are carried over TIPC itself.

An application that interacts with the Configuration Service uses TIPC payload messages containing command requests and replies. The application MUST follow these steps:

  1. Send a connectionless command request to a Configuration Server using the port name {0,<Z.C.N>}, where <Z.C.N> is the network address of the node to be queried or manipulated.
  2. Wait for the arrival of a command reply from the Configuration Server corresponding to the previously issued command request.

After a command request is received by the Configuration Server, the server will attempt to perform the requested operation and return a command reply indicating the results of the operation.



 TOC 

11.2.1.  Command Message Format

The data portion of a command message consists of a command descriptor followed by zero or more command arguments.



 TOC 

11.2.1.1.  Command Descriptor

The format of a command descriptor is shown in Figure 33 All fields of the command descriptor MUST be stored in network byte order.




    3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
    1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w0:|                             Length                            |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w1:|        Command                |           Flags               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w2:|                                                               |
   +                           RESERVED                            +
w3:|                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


 Figure 33 

The interpretation of the fields of the descriptor is as follows:



 TOC 

11.2.1.1.1.  Command Arguments

A command message contains zero or more Type-Length-Value (TLV) triplets that provide details about the associated request or reply. The set of TLVs associated with a command request may be different than the set of TLVs associated with its reply.

The format of a command argument TLV is shown in Figure 35 (Command argument TLV format) The first two fields of the TLV MUST be stored in network byte order; the order used in the value field that follows depends on TLV's type. TLV triplets MUST begin on a 32-bit word boundary offset from the start of the command message; thus, it may be necessary to include one, two, or three bytes of padding between adjacent TLVs in a command message.




    3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
    1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w0:|            Length             |           Type                |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w1:\                                                               \
   /                            Value                              /
wN:\                                                               \
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

 Figure 35: Command argument TLV format 

The interpretation of the fields of a command argument TLV is as follows:



 TOC 

11.2.1.2.  Command Argument TLV Descriptions

The TLVs defined for TIPC's Configuration Service are described in this section.



 TOC 

11.2.1.2.1.  VOID

VOID (type 1) is a zero-byte TLV type that can be used as a placeholder in command messages. Currently, no command messages utilize this type.



 TOC 

11.2.1.2.2.  UNSIGNED

UNSIGNED (type 2) is a TLV type designating a generic unsigned integer. It is represented by a 32-bit integer, which MUST be stored in network byte order.



 TOC 

11.2.1.2.3.  STRING

STRING (type 3) is a TLV type designating a moderately-sized character string. It is represented by a zero-terminated sequence of characters, which may range from one byte to 128 bytes, including the terminating zero character.



 TOC 

11.2.1.2.4.  LARGE_STRING

LARGE_STRING (type 4) is a TLV type designating a large-sized character string. It is represented by a zero-terminated sequence of characters, which may range from one byte to 2048 bytes, including the terminating zero character.



 TOC 

11.2.1.2.5.  ULTRA_STRING

ULTRA_STRING (type 5) is a TLV type designating a very large-sized character string. It is represented by a zero-terminated sequence of characters, which may range from one byte to 32768 bytes, including the terminating zero character.



 TOC 

11.2.1.2.6.  ERROR_STRING

ERROR_STRING (type 16) is a TLV type designating the reason for the failure of a command request. It is represented by a zero-terminated sequence of characters, which may range from one byte to 128 bytes, including the terminating zero character.

The first character of an ERROR_STRING may be a special error code character, lying in the range 0x80 to 0xFF, which corresponds to one of the following pre-defined reasons:




Value    Meaning
-----    -------
0x80     The request contains incorrect TLV(s)
0x81     The request requires network administrator privileges
0x83     The designated node does not permit requests from off-node
0x84     The request is not supported
0x85     The request has invalid argument values

 Figure 36: Command Error Codes 



 TOC 

11.2.1.2.7.  NET_ADDR

NET_ADDR (type 17) is a TLV type designating a TIPC network address. It is represented by a 32-bit integer denoting zone, cluster, and node identifiers (using 8, 12, and 12 bits, respectively), with the zone identifier occupying the most significant bits and the node identifier occupying the least significant bits. This value MUST be stored in network byte order.



 TOC 

11.2.1.2.8.  MEDIA_NAME

MEDIA_NAME (type 18) is a TLV type designating a media type usable for TIPC messages. It is represented by a zero-terminated sequence of characters, which may range from one byte to 16 bytes, including the terminating zero character.

As an example, the media type for Ethernet bearers is "eth".



 TOC 

11.2.1.2.9.  BEARER_NAME

BEARER_NAME (type 19) is a TLV type designating a TIPC bearer. It is represented by a zero-terminated sequence of characters, which may range from one byte to 32 bytes, including the terminating zero character.

The resulting string MUST have the form "medianame:interfacename". For example, an Ethernet bearer may have the name "eth:eth0".



 TOC 

11.2.1.2.10.  LINK_NAME

LINK_NAME (type 20) is a TLV type designating a TIPC link endpoint. It is represented by a zero-terminated sequence of characters, which may range from one byte to 60 bytes, including the terminating zero character.

The resulting string MUST have the form "Z.C.N:own_side_interfacename-Z.C.N:peer_side_interfacename". For example, an Ethernet link endpoint may have the name "1.1.7:eth0-1.1.12:eth0".



 TOC 

11.2.1.2.11.  NODE_INFO

NODE_INFO (type 21) is a TLV type designating the reachability status (up/down) of a neighboring node. It is represented by the 8-byte structure shown in Figure 37 (Node Availability Info). All fields of this structure MUST be stored in network byte order.




    3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
    1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w0 |                           Node Address                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w1 |                           Up                                  |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

 Figure 37: Node Availability Info 

The interpretation of the fields of the structure is as follows:



 TOC 

11.2.1.2.12.  LINK_INFO

LINK_INFO (type 22) is a TLV type designating the status (up/down) of a link endpoint. It is represented by the 68-byte structure shown in Figure 38. The first two fields of this structure MUST be stored in network byte order.




    3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
    1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w0 |                           Node Address                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w1 |                           Up                                  |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w2 \                                                               \
   /                           Link Name                           /
w16\                                                               \
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

 Figure 38 

The interpretation of the fields of the structure is as follows:



 TOC 

11.2.1.2.13.  BEARER_CONFIG

BEARER_CONFIG (type 23) is a TLV type used to enable a bearer. It is represented by the 40-byte structure shown in Figure 39 (Value field of BEARER_CONFIG TLV). The first two fields of this structure MUST be stored in network byte order.




    3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
    1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w0 |                           Priority                            |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w1 |                           Discovery Domain                    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w2 \                                                               \
   /                           Bearer Name                         /
w9 \                                                               \
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

 Figure 39: Value field of BEARER_CONFIG TLV 

The interpretation of the fields of the structure is as follows:



 TOC 

11.2.1.2.14.  LINK_CONFIG

LINK_CONFIG (type 24) is a TLV type used to change the properties of a link. It is represented by the 64-byte structure shown in Figure 40 (Link Configuration Command Format). The first field of this structure MUST be stored in network byte order.




    3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
    1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w0 |                           Value                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w1 \                                                               \
   /                           Link Name                           /
w15\                                                               \
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

 Figure 40: Link Configuration Command Format 

The interpretation of the fields of the structure is as follows:



 TOC 

11.2.1.2.15.  NAME_TBL_QUERY

NAM_TBL_QRY (type 25) is a TLV type used when requesting name table information. It is represented by the 16-byte structure shown in Figure 41 (Value field of NAME_TBL_QUERY TLV). All fields of this structure MUST be stored in network byte order.




    3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
    1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w0 |A|                         Depth                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w1 |                           Type                                |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w2 |                           Lower Bound                         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
w3 |                           Upper Bound                         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

 Figure 41: Value field of NAME_TBL_QUERY TLV 

The interpretation of the fields of the structure is as follows:



 TOC 

11.2.1.3.  Command Message Descriptions

The set of commands that MAY be supported by the Configuration Service are described in this section.

The description of the command reply message for each command assumes that the associated command request is executed successfully. If an error occurs during the processing of a request the Configuration Service MUST include a TLV of type ERROR_STRING as part of the command reply returned to the requesting application.



 TOC 

11.2.1.3.1.  NOOP

NOOP (command 0x0000) is a public command that performs no action. This command may be useful for demonstrating that an application can interact successfully with the Configuration Service.

The command request contains no TLV. The command reply contains no TLV.



 TOC 

11.2.1.3.2.  GET_NODES

GET_NODES (command 0x0001) is a public command that is used to obtain information about the status of a node's neighbors.

The command request contains a single TLV of type NET_ADDR, which represents a network domain. The command reply contains zero or more TVLs of type NODE_INFO, one for each node within the specified domain that this node has a direct link to (even if it is not currently operational).



 TOC 

11.2.1.3.3.  GET_MEDIA_NAMES

GET_MEDIA_NAMES (command 0x0002) is a public command that is used to obtain the names of all media types currently configured on a node.

The command request contains no TLV. The command reply contains zero or more TLVs of type MEDIA_NAME.



 TOC 

11.2.1.3.4.  GET_BEARER_NAMES

GET_BEARER_NAMES (command 0x0003) is a public command that is used to obtain the names of all bearers currently configured on a node.

The command request contains no TLV. The command reply contains zero or more TLVs of type BEARER_NAME.



 TOC 

11.2.1.3.5.  GET_LINKS

GET_LINKS (command 0x0004) is a public command that is used to obtain information about the status of a node's link endpoints.

The command request contains a single TLV of type NET_ADDR, which specifies a network domain. The command reply contains zero or more TLVs of type LINK_INFO, corresponding to the node's own broadcast link endpoint and any link endpoint whose peer node lies within the specified network domain.



 TOC 

11.2.1.3.6.  SHOW_NAME_TABLE

SHOW_NAME_TABLE (command 0x0005) is a public command that is used to obtain information about the contents of a node's name table.

The command request contains a single TLV of type NAME_TBL_QUERY. The command reply contains a single TLV of type ULTRA_STRING, whose content is unspecified.



 TOC 

11.2.1.3.7.  SHOW_PORTS

SHOW_PORTS (command 0x0006) is a public command that is used to obtain status and statistics information about a link endpoint.

The command request contains no TLV. The command reply contains a single TLV of type ULTRA_STRING, whose content is unspecified.



 TOC 

11.2.1.3.8.  SHOW_LINK_STATS

SHOW_LINK_STATS (command 0x000B) is a public command that is used to obtain status and statistics information about a link endpoint.

The command request contains a single TLV of type LINK_NAME. The command reply contains a single TLV of type ULTRA_STRING, whose content is unspecified.



 TOC 

11.2.1.3.9.  SHOW_STATS

SHOW_STATS (command 0x000F) is a public command that is used to obtain status and statistics information about TIPC for a node.

The command request contains a single TLV of type UNSIGNED, which indicates the information to be obtained; a value of zero returns all available information, while no other values are currently defined. The command reply contains a single TLV of type ULTRA_STRING, whose content is unspecified.



 TOC 

11.2.1.3.10.  GET_REMOTE_MNG

GET_REMOTE_MNG (command 0x4003) is a private command that is used to determine whether a node can be remotely managed by another node in the TIPC network.

The command request contains no TLV. The command reply contains a single TLV of type UNSIGNED; a value of zero indicates that the node's Configuration Service is unable to process command requests issued by another node, while any other value indicates that processing of off-node command requests is enabled.



 TOC 

11.2.1.3.11.  GET_MAX_PORTS

GET_MAX_PORTS (command 0x4004) is a private command that is used to obtain the maximum number of ports that can be supported simultaneously by a node.

The command request contains no TLV. The command reply contains a single TLV of type UNSIGNED.



 TOC 

11.2.1.3.12.  GET_NETID

SET_NETID (command 0x400B) is a protected command that is used to obtain the TIPC network identifier used by a node.

The command request contains no TLV. The command reply contains a single TLV of type UNSIGNED.



 TOC 

11.2.1.3.13.  ENABLE_BEARER

ENABLE_BEARER (command 0x4101) is a protected command that is used to initiate a node's use of the specified bearer for TIPC messaging. The node will respond to requests from neighboring nodes to establish new links if the nodes lie within the specified discovery domain.

The command request contains a single TLV of type BEARER_CONFIG. The command reply contains no TLV.



 TOC 

11.2.1.3.14.  DISABLE_BEARER

DISABLE_BEARER (command 0x4102) is a protected command that is used to terminate a node's use of the specified bearer for TIPC messaging. The node deletes all existing link endpoints that utilize that bearer and will ignore all requests from neighboring nodes to establish new links.

The command request contains a single TLV of type BEARER_NAME. The command reply contains no TLV.



 TOC 

11.2.1.3.15.  SET_LINK_TOL

SET_LINK_TOL (command 0x4107) is a protected command that is used to configure the tolerance attribute of a link endpoint. (The tolerance attribute of the link's peer endpoint will be configured to match automatically.)

The command request contains a single TLV of type LINK_CONFIG. The command reply contains no TLV.



 TOC 

11.2.1.3.16.  SET_LINK_PRI

SET_LINK_PRI (command 0x4108) is a protected command that is used to configure the priority attribute of a link endpoint. (The priority attribute of the link's peer endpoint will be configured to match automatically.)

The command request contains a single TLV of type LINK_CONFIG. The command reply contains no TLV.



 TOC 

11.2.1.3.17.  SET_LINK_WINDOW

SET_LINK_WINDOW (command 0x4109) is a protected command that is used to configure the message window attribute of a link endpoint. (The priority attribute of the link's peer endpoint MUST NOT be configured to match automatically.)

The command request contains a single TLV of type LINK_CONFIG. The command reply contains no TLV.



 TOC 

11.2.1.3.18.  RESET_LINK_STATS

RESET_LINK_STATS (command 0x410C) is a protected command that is used to reset the statistics counters for a link endpoint.

The command request contains a single TLV of type LINK_NAME. The command reply contains no TLV.



 TOC 

11.2.1.3.19.  SET_NODE_ADDR

SET_NODE_ADDR (command 0x8001) is a private command that is used to configure the network address of a node.

The command request contains a single TLV of type NET_ADDR, indicating the desired network address. The command reply contains no TLV.



 TOC 

11.2.1.3.20.  SET_REMOTE_MNG

SET_REMOTE_MNG (command 0x8003) is a private command that is used to configure whether a node can be remotely managed by another node in the TIPC network.

The command request contains a single TLV of type UNSIGNED; a value of zero disables the node's Configuration Service from processing command requests issued by another node, while any other value enables processing of off-node command requests. The command reply contains no TLV.



 TOC 

11.2.1.3.21.  SET_MAX_PORTS

SET_MAX_PORTS (command 0x8004) is a private command that is used to configure the maximum number of ports that can be supported simultaneously by a node.

The command request contains a single TLV of type UNSIGNED. The command reply contains no TLV.



 TOC 

11.2.1.3.22.  SET_NETID

SET_NETID (command 0x800B) is a private command that is used to configure the TIPC network identifier used by a node.

The command request contains a single TLV of type UNSIGNED. The command reply contains no TLV.



 TOC 

12.  Security Considerations

TIPC is a special-purpose transport protocol designed for operation within a secure, closed network of interconnecting nodes within a cluster. TIPC does not possess any native security features, and relies on the properties of the selected bearer protocol (e.g. IP-Sec) when such features are needed.



 TOC 

13.  IANA Considerations

This memo includes no request to IANA.



 TOC 

14.  Contributors



 TOC 

15.  Acknowledgements

Thanks to Marshall Rose for developing the XML2RFC format.



 TOC 

16.  References



 TOC 

16.1. Normative References

[RFC2119] Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels,” BCP 14, RFC 2119, March 1997 (TXT, HTML, XML).


 TOC 

16.2. Informative References

[RFC0793] Postel, J., “Transmission Control Protocol,” STD 7, RFC 793, September 1981 (TXT).
[RFC2960] Stewart, R., Xie, Q., Morneault, K., Sharp, C., Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M., Zhang, L., and V. Paxson, “Stream Control Transmission Protocol,” RFC 2960, October 2000 (TXT).
[RFC2104] Krawczyk, H., Bellare, M., and R. Canetti, “HMAC: Keyed-Hashing for Message Authentication,” RFC 2104, February 1997 (TXT).


 TOC 

Appendix A.  Change Log

The following changes have been made from draft-spec-tipc-09 .

  1. Rewritten abstract.
  2. Filled numerous remaining empty chapters and paragraphs.
  3. Updated many other chapters and paragraphs for more accuracy
  4. Stylistic and language changes.
  5. Updated network hierachy section to reflect that we now have looser requirements on intra-cluster and inta-zone connectivity.
  6. Removed all references to TIPC-level message routing, since this is not supported
  7. Added description of broadcast initial synchronization protocol.
  8. Introduced SYN bit in message header.
  9. Extended neighbor discovery protocol header to 64 bytes.
  10. Added definition of a capability bit in neigbor discovery protocol.
  11. Introduced two new bearers in neighbor discovery protocol: Infiniband and UDP/IPv4.


 TOC 

Appendix B.  Remaining Issues

This document is a "work-in-progress" edition of the specification for version 2 of the TIPC protocol. It is still believed to be accurate and complete, although there are certainly a potential for improvements in many chapters.

This document reflects the capabilities of TIPC 2.0 as implemented by the Open Source TIPC project (see http://tipc.sf.net).



 TOC 

Authors' Addresses

  Jon Paul Maloy
  Ericsson
  8400, boul. Decarie
  Ville Mont-Royal, Quebec H4P 2N2
  Canada
Phone:  +1 514 591-5578
EMail:  jon.maloy@ericsson.com
  
  Allan Stephens
  Wind River
  350 Terry Fox Drive, Suite 200
  Kanata, ON K2K 2W5
  Canada
Phone:  +1 613 270-2259
EMail:  allan.stephens@windriver.com