IPv5IPv6RFC 791RFC 1819RFC 1883

IP- Adress

Навигация

  • Определение браузера, OC , IP
  • Whois
  • Статьи


Быстрое уничтожение грызунов. |
vertu |
часы, копии швейцарских часов на рилвоч |
pos материалы, posm |
обучение за рубежом |
образование за рубежом |
егрп |
детские игровые комплексы, детские площадки |
монтаж кондиционеров, установка кондиционеров |
экспертиза, лицензии азс, строительные лицензии |
Глазная клиника, если у вас астигматизм |
Электрика, кабельная продукция, MAKEL |
Недвижимость квартиры в Туле |
система управления сайтом, хостинг |
ноутбуки acer |
Главная

6 Failure Detection and Recovery

Введение_Цель, RFC, rfc, Rfc, RFC-, rfc-, Rfc-

6 Failure Detection and Recovery

6.1 Failure Detection

The SCMP failure detection mechanism is based on two assumptions:

1. If a neighbor of an ST agent is up, and has been up without a
disruption, and has not notified the ST agent of a problem with
streams that pass through both, then the ST agent can assume that
there has not been any problem with those streams.

Delgrossi & Berger, Editors Experimental [Page 55]

RFC 1819 ST2+ Protocol Specification August 1995

2. A network through which an ST agent has routed a stream will notify
the ST agent if there is a problem that affects the stream data
packets but does not affect the control packets.

The purpose of the robustness protocol defined here is for ST agents
to determine that the streams through a neighbor have been broken by
the failure of the neighbor or the intervening network. This protocol
should detect the overwhelming majority of failures that can occur.
Once a failure is detected, the recovery procedures described in
Section 6.2 are initiated by the ST agents.

6.1.1 Network Failures

An ST agent can detect network failures by two mechanisms:

o the network can report a failure, or

o the ST agent can discover a failure by itself.

They differ in the amount of information that an ST agent has
available to it in order to make a recovery decision. For example, a
network may be able to report that reserved bandwidth has been lost
and the reason for the loss and may also report that connectivity to
the neighboring ST agent remains intact. On the other hand, an ST
agent may discover that communication with a neighboring ST agent has
ceased because it has not received any traffic from that neighbor in
some time period. If an ST agent detects a failure, it may not be
able to determine if the failure was in the network while the
neighbor remains available, or the neighbor has failed while the
network remains intact.

6.1.2 Detecting ST Agents Failures

Each ST agent periodically sends each neighbor with which it shares
one or more streams a HELLO message. This message exchange is between
ST agents, not entities representing streams or applications. That
is, an ST agent need only send a single HELLO message to a neighbor
regardless of the number of streams that flow between them. All ST
agents (host as well as intermediate) must participate in this
exchange. However, only ST agents that share active streams can
participate in this exchange and it is an error to send a HELLO
message to a neighbor ST agent with no streams in common, e.g., to
check whether it is active. STATUS messages can be used to poll the
status of neighbor ST agents, see Section 8.4.

For the purpose of HELLO message exchange, stream existence is
bounded by ACCEPT and DISCONNECT/REFUSE processing and is defined for
both the upstream and downstream case. A stream to a previous-hop is

Delgrossi & Berger, Editors Experimental [Page 56]

RFC 1819 ST2+ Protocol Specification August 1995

defined to start once an ACCEPT message has been forwarded upstream.
A stream to a next-hop is defined to start once the received ACCEPT
message has been acknowledged. A stream is defined to terminate once
an acknowledgment is sent for a received DISCONNECT or REFUSE
message, and an acknowledgment for a sent DISCONNECT or REFUSE
message has been received.

The HELLO message has two fields:

o a HelloTimer field that is in units of milliseconds modulo the
maximum for the field size, and

o a Restarted-bit specifying that the ST agent has been restarted
recently.

The HelloTimer must appear to be incremented every millisecond
whether a HELLO message is sent or not. The HelloTimer wraps around
to zero after reaching the maximum value. Whenever an ST agent
suffers a catastrophic event that may result in it losing ST state
information, it must reset its HelloTimer to zero and must set the
Restarted-bit in all HELLO messages sent in the following
HelloTimerHoldDown seconds.

If an ST agent receives a HELLO message that contains the Restarted-
bit set, it must assume that the sending ST agent has lost its state.
If it shares streams with that neighbor, it must initiate stream
recovery activity, see Section 6.2. If it does not share streams with
that neighbor, it should not attempt to create one until that bit is
no longer set. If an ST agent receives a CONNECT message from a
neighbor whose Restarted-bit is still set, the agent must respond
with an ERROR message with the appropriate ReasonCode
(RestartRemote). If an agent receives a CONNECT message while the
agent's own Restarted- bit is set, the agent must respond with an
ERROR message with the appropriate ReasonCode (RestartLocal).

Each ST stream has an associated RecoveryTimeout value. This value is
assigned by the origin and carried in the CONNECT message, see
Section 4.5.10. Each agent checks to see if it can support the
requested value. If it can not, it updates the value to the smallest
timeout interval it can support. The RecoveryTimeout used by a
particular stream is obtained from the ACCEPT message, see Section
4.5.10, and is the smallest value seen across all ACCEPT messages
from participating targets.

An ST agent must send HELLO messages to its neighbor with a period
shorter than the smallest RecoveryTimeout of all the active streams
that pass between the two ST agents, regardless of direction. This
period must be smaller by a factor, called HelloLossFactor, which is

Delgrossi & Berger, Editors Experimental [Page 57]

RFC 1819 ST2+ Protocol Specification August 1995

at least as large as the greatest number of consecutive HELLO
messages that could credibly be lost while the communication between
the two ST agents is still viable.

An ST agent may send simultaneous HELLO messages to all its neighbors
at the rate necessary to support the smallest RecoveryTimeout of any
active stream. Alternately, it may send HELLO messages to different
neighbors independently at different rates corresponding to
RecoveryTimeouts of individual streams.

An ST agent must expect to receive at least one new HELLO message
from each neighbor at least as frequently as the smallest
RecoveryTimeout of any active stream in common with that neighbor.
The agent can detect duplicate or delayed HELLO messages by comparing
the HelloTimer field of the most recent valid HELLO message from that
neighbor with the HelloTimer field of an incoming HELLO message.
Valid incoming HELLO messages will have a HelloTimer field that is
greater than the field contained in the previously received valid
HELLO message by the time elapsed since the previous message was
received. Actual evaluation of the elapsed time interval should take
into account the maximum likely delay variance from that neighbor.

If the ST agent does not receive a valid HELLO message within the
RecoveryTimeout period of a stream, it must assume that the
neighboring ST agent or the communication link between the two has
failed and it must initiate stream recovery activity, as described
below in Section 6.2.

6.2 Failure Recovery

If an intermediate ST agent fails or a network or part of a network
fails, the previous-hop ST agent and the various next-hop ST agents
will discover the fact by the failure detection mechanism described
in Section 6.1.

The recovery of an ST stream is a relatively complex and time
consuming effort because it is designed in a general manner to
operate across a large number of networks with diverse
characteristics. Therefore, it may require information to be
distributed widely, and may require relatively long timers. On the
other hand, since a network is typically a homogeneous system,
failure recovery in the network may be a relatively faster and
simpler operation. Therefore an ST agent that detects a failure
should attempt to fix the network failure before attempting recovery
of the ST stream. If the stream that existed between two ST agents
before the failure cannot be reconstructed by network recovery
mechanisms alone, then the ST stream recovery mechanism must be
invoked.

Delgrossi & Berger, Editors Experimental [Page 58]

RFC 1819 ST2+ Protocol Specification August 1995

If stream recovery is necessary, the different ST agents will need to
perform different functions, depending on their relation to the
failure:

o An ST agent that is a next-hop from a failure should first verify
that there was a failure. It can do this using STATUS messages to
query its upstream neighbor. If it cannot communicate with that
neighbor, then for each active stream from that neighbor it should
first send a REFUSE message upstream with the appropriate ReasonCode
(STAgentFailure). This is done to the neighbor to speed up the
failure recovery in case the hop is unidirectional, i.e., the
neighbor can hear the ST agent but the ST agent cannot hear the
neighbor. The ST agent detecting the failure must then, for each
active stream from that neighbor, send DISCONNECT messages with the
same ReasonCode toward the targets. All downstream ST agents process
this DISCONNECT message just like the DISCONNECT that tears down the
stream. If recovery is successful, targets will receive new CONNECT
messages.

o An ST agent that is the previous-hop before the failed component
first verifies that there was a failure by querying the downstream
neighbor using STATUS messages. If the neighbor has lost its state
but is available, then the ST agent may try and reconstruct
(explained below) the affected streams, for those streams that do
not have the NoRecovery option selected. If it cannot communicate
with the next-hop, then the ST agent detecting the failure sends a
DISCONNECT message, for each affected stream, with the appropriate
ReasonCode (STAgentFailure) toward the affected targets. It does so
to speed up failure recovery in case the communication may be
unidirectional and this message might be delivered successfully.

Based on the NoRecovery option, the ST agent that is the previous-hop
before the failed component takes the following actions:

o If the NoRecovery option is selected, then the ST agent sends, per
affected stream, a REFUSE message with the appropriate ReasonCode
(STAgentFailure) to the previous-hop. The TargetList in these
messages contains all the targets that were reached through the
broken branch. As discussed in Section 5.1.2, multiple REFUSE
messages may be required if the PDU is too long for the MTU of the
intervening network. The REFUSE message is propagated all the way to
the origin. The application at the origin can attempt recovery of
the stream by sending a new CONNECT to the affected targets. For
established streams, the new CONNECT will be treated by intermediate
ST agents as an addition of new targets into the established stream.

Delgrossi & Berger, Editors Experimental [Page 59]

RFC 1819 ST2+ Protocol Specification August 1995

o If the NoRecovery option is not selected, the ST agent can attempt
recovery of the affected streams. It does so one a stream by stream
basis by issuing a new CONNECT message to the affected targets. If
the ST agent cannot find new routes to some targets, or if the only
route to some targets is through the previous-hop, then it sends one
or more REFUSE messages to the previous-hop with the appropriate
ReasonCode (CantRecover) specifying the affected targets in the
TargetList. The previous-hop can then attempt recovery of the stream
by issuing a CONNECT to those targets. If it cannot find an
appropriate route, it will propagate the REFUSE message toward the
origin.

Regardless of which ST agent attempts recovery of a damaged stream,
it will issue one or more CONNECT messages to the affected targets.
These CONNECT messages are treated by intermediate ST agents as
additions of new targets into the established stream. The FlowSpecs
of the new CONNECT messages are the same as the ones contained in the
most recent CONNECT or CHANGE messages that the ST agent had sent
toward the affected targets when the stream was operational.

Upon receiving an ACCEPT during the a stream recovery, the agent
reconstructing the stream must ensure that the FlowSpec and other
stream attributes (e.g., MaxMsgSize and RecoveryTimeout) of the re-
established stream are equal to, or are less restrictive, than the
pre-failure stream. If they are more restrictive, the recovery
attempt must be aborted. If they are equal, or are less restrictive,
then the recovery attempt is successful. When the attempt is a
success, failure recovery related ACCEPTs are not forwarded upstream
by the recovering agent.

Any ST agent that decides that enough recovery attempts have been
made, or that recovery attempts have no chance of succeeding, may
indicate that no further attempts at recovery should be made. This is
done by setting the N-bit in the REFUSE message, see Section 10.4.11.
This bit must be set by agents, including the target, that know that
there is no chance of recovery succeeding. An ST agent that receives
a REFUSE message with the N-bit set (1) will not attempt recovery,
regardless of the NoRecovery option, and it will set the N-bit when
propagating the REFUSE message upstream.

6.2.1 Problems in Stream Recovery

The reconstruction of a broken stream may not proceed smoothly. Since
there may be some delay while the information concerning the failure
is propagated throughout an internet, routing errors may occur for
some time after a failure. As a result, the ST agent attempting the
recovery may receive ERROR messages for the new CONNECTs that are
caused by internet routing errors. The ST agent attempting the

Delgrossi & Berger, Editors Experimental [Page 60]

RFC 1819 ST2+ Protocol Specification August 1995

recovery should be prepared to resend CONNECTs before it succeeds in
reconstructing the stream. If the failure partitions the internet and
a new set of routes cannot be found to the targets, the REFUSE
messages will eventually be propagated to the origin, which can then
inform the application so it can decide whether to terminate or to
continue to attempt recovery of the stream.

The new CONNECT may at some point reach an ST agent downstream of the
failure before the DISCONNECT does. In this case, the ST agent that
receives the CONNECT is not yet aware that the stream has suffered a
failure, and will interpret the new CONNECT as resulting from a
routing failure. It will respond with an ERROR message with the
appropriate ReasonCode (StreamExists). Since the timeout that the ST
agents immediately preceding the failure and immediately following
the failure are approximately the same, it is very likely that the
remnants of the broken stream will soon be torn down by a DISCONNECT
message. Therefore, the ST agent that receives the ERROR message with
ReasonCode (StreamExists) should retransmit the CONNECT message after
the ToConnect timeout expires. If this fails again, the request will
be retried for NConnect times. Only if it still fails will the ST
agent send a REFUSE message with the appropriate ReasonCode
(RouteLoop) to its previous-hop. This message will be propagated back
to the ST agent that is attempting recovery of the damaged stream.
That ST agent can issue a new CONNECT message if it so chooses. The
REFUSE is matched to a CONNECT message created by a recovery
operation through the LnkReference field in the CONNECT.

ST agents that have propagated a CONNECT message and have received a
REFUSE message should maintain this information for some period of
time. If an ST agent receives a second CONNECT message for a target
that recently resulted in a REFUSE, that ST agent may respond with a
REFUSE immediately rather than attempting to propagate the CONNECT.
This has the effect of pruning the tree that is formed by the
propagation of CONNECT messages to a target that is not reachable by
the routes that are selected first. The tree will pass through any
given ST agent only once, and the stream setup phase will be
completed faster.

If a CONNECT message reaches a target, the target should as
efficiently as possible use the state that it has saved from before
the stream failed during recovery of the stream. It will then issue
an ACCEPT message toward the origin. The ACCEPT message will be
intercepted by the ST agent that is attempting recovery of the
damaged stream, if not the origin. If the FlowSpec contained in the
ACCEPT specifies the same selection of parameters as were in effect
before the failure, then the ST agent that is attempting recovery
will not propagate the ACCEPT. FlowSpec comparison is done by the
LRM. If the selections of the parameters are different, then the ST

Delgrossi & Berger, Editors Experimental [Page 61]

RFC 1819 ST2+ Protocol Specification August 1995

agent that is attempting recovery will send the origin a NOTIFY
message with the appropriate ReasonCode (FailureRecovery) that
contains a FlowSpec that specifies the new parameter values. The
origin may then have to change its data generation characteristics
and the stream's parameters with a CHANGE message to use the newly
recovered subtree.

6.3 Stream Preemption

As mentioned in Section 1.4.5, it is possible that the LRM decides to
break a stream intentionally. This is called stream preemption.
Streams are expected to be preempted in order to free resources for a
new stream which has a higher priority.

If the LRM decides that it is necessary to preempt one or more of the
stream traversing it, the decision on which streams have to be
preempted has to be made. There are two ways for an application to
influence such decision:

1. based on FlowSpec information. For instance, with the ST2+
FlowSpec, streams can be assigned a precedence value from 0
(least important) to 256 (most important). This value is
carried in the FlowSpec when the stream is setup, see Section
9.2, so that the LRM is informed about it.

2. with the group mechanism. An application may specify that a set
of streams are related to each other and that they are all
candidate for preemption if one of them gets preempted. It can
be done by using the fate-sharing relationship defined in
Section 7.1.2. This helps the LRM making a good choice when
more than one stream have to be preempted, because it leads to
breaking a single application as opposed to as many
applications as the number of preempted streams.

If the LRM preempts a stream, it must notify the local ST agent. The
following actions are performed by the ST agent:

o The ST agent at the host where the stream was preempted sends
DISCONNECT messages with the appropriate ReasonCode
(StreamPreempted) toward the affected targets. It sends a REFUSE
message with the appropriate ReasonCode (StreamPreempted) to the
previous-hop.

o A previous-hop ST agent of the preempted stream acts as in case of
failure recovery, see Section 6.2.

o A next-hop ST agent of the preempted stream acts as in case of
failure recovery, see Section 6.2.

Delgrossi & Berger, Editors Experimental [Page 62]

RFC 1819 ST2+ Protocol Specification August 1995

Note that, as opposite to failure recovery, there is no need to
verify that the failure actually occurred, because this is explicitly
indicated by the ReasonCode (StreamPreempted).

Назад | Вперед

Содержание

главная

  • ru