SIP-I and SIP-T Challenge: Early Media
Posted by Adam Roach on Tue, Nov 03, 2009 @ 06:02 PM
This post continues the
series of posts on SIP-I and SIP-T deployment challenges. You may wish to read
the Introduction
to SIP-I and SIP-T post for some general background on these two protocols
before continuing.
This post deals with the
issues surrounding the establishment of an audio path before a call is
completely set up.
These problems stem from the fact that SIP and ISUP have rather
different models for the way the media is set up. These differences are rooted in a philosophical difference about where call progress information is generated. In the PSTN, it is typically generated by the called party's end office. So, if a media path isn't set up before the call is completed, the call progress tones can't be sent. By contrast, in a SIP network, call progress information is usually generated by the calling party's device -- so the media path doesn't matter until the called party answers.
Because it does not require a media path to convey call progress information, SIP’s design expects that
the session will be completely established before media begins to flow. There is one minor exception: as a means
to avoid clipping off the initial media travelling from the called party to the
calling party, SIP does specify that clients are supposed to play any media
received prior to the session being established. However, this provision was
designed to avoid a very specific corner case, not to carry long-lived media
sessions.
To further understand this
behavior, keep in mind that SIP uses an offer/answer model for establishing
session parameters. One endpoint sends a proposed session description – an “offer”
– with the IP address to which media of the offerer is to be sent, and a set of
acceptable session parameters. The other endpoint responds with an “answer”
session description; this “answer” selects final values for the various session
parameters that are to be used for the media, and also includes the IP address
to which media is to be sent for the answerer.
Since the media session
negotiation does not typically complete until the calling party answers, media sent
towards the caller before the session is completely established may or may not
work properly; and, since the IP address to send media to only shows up in the
answer, it is actually impossible to send media towards the caller. (Keep in mind that an RTP “session” is actually composed of two streams: one flowing towards the caller, and one flowing towards the called party. Once the offer is sent, the stream towards the caller can begin. However, the stream towards the called party cannot start until after the answer is received.)
By contrast, ISUP expects
the ability to send media on a circuit as soon as it is seized, which happens
as soon as the call attempt begins, so that it can send call progress and call error tones. To further complicate matters, many deployed
IVRs take advantage of this behavior by not triggering an ACM (thus
establishing the session and, typically, marking the start of charging) until a
human answers the call.
In other words, SIP
provides a “best effort” attempt at passing media prior to the call, while ISUP
has an absolute requirement for sending media before the call is completely
established.
The IETF began considering
this problem well before the current round of SIP specifications were published,
with significant impact on the final documents. In June of 1999, a proposed “183
Session Progress” response code was described [1] for instructing an ingress
gateway to suppress ringback, and to use the in-band media instead. Ultimately, this solution was put aside
due to a number of shortcomings, including the inability to ensure that the 183
response is actually received by the calling party. (A “183 Session Progress”
response code was later added to the core SIP specification, but with very
different semantics than originally proposed).
As a result of the work on
early media and PSTN interoperation, by the time the core set of SIP
specifications was published as RFCs 3261 through 3265, it contained a limited set
of tools that allowed the establishment of “early” media sessions. However,
exact procedures for combining these tools were not finally published until
December of 2004, in the form of RFC
3960.
At a high level, here’s
how PSTN gateways can set up media sessions prior to the final establishment of
a call:

This is pretty similar to
the diagram for a basic call setup that we looked at in the introduction post.
The key differences are that this diagram shows exactly when the media path is
set up between each component in the system. In the PSTN, these audio paths are
set up as soon as the network can – messages 2, 5, 12, and 21 all happen as
soon as possible. Because the SIP network doesn’t treat media quite the same
way as the PSTN, we need some extra signaling to set up the session. That’s
where messages 7 through 9 come in. Message 7 contains a provisional session
description “answer” for the session “offer” that was present in message 6. At
this point, both gateways have enough information to exchange audio. (Messages
8 and 9 simply acknowledge the receipt of message 7; this is necessary to
ensure that message 7 is delivered reliably.) In this scenario, because there
is an audio path all the way to the called party’s end office, the ringback
tone can be generated remotely (message 15) instead of being generated by the
caller’s end office. However, even with these procedures in place, the ingress
gateway cannot rely on the remote end
generating ringback tones. It must monitor the media stream, and locally
generate ringback information if there is none present in the audio. This can
lead to jarring transitions where a calling party hears ringback generated by
the ingress gateway, followed by an abrupt change to ringback generated by the
called party’s end office.
Further complicating
matters: even with the procedures defined in RFC 3960, the exact behavior
defined for local versus remote generation of call progress and error tones
remains a matter of local policy at the PSTN gateways.
However, the key problem
with this approach is that the additional procedures used to set up an early
session are not necessarily supported by native SIP terminals – which means
that early media tends not to work properly when calling from a native SIP
device. This is particularly troublesome in the case of IVRs that expect to
play information to a user before establishing the call. Similarly, when a call
is made to a SIP device, the gateway and end office must assume responsibility
for generating call progress and error tones that would typically come from the
remote end office.
There are some other
early-media complications that arise when more than one egress gateway is involved, but we’ll
save those for next time.
[1] http://tools.ietf.org/html/draft-donovan-mmusic-183-00