logo spacerHome | Site Map | Site Searchspacer
spacer
   
IP Knowledge Center

VoIP Deep Dive - Part I

How VoIP Works
Digitization and Encoding
Basic VoIP Transmission
Quality of Service
The VoIP Gateway
Part II


How VoIP Works

VoIP is a collection of digitally encrypted voice transmissions that are carried over a network based on a single common language, or protocol, in this case, Internet Protocol.

Any two devices connected to the same IP network can communicate directly, so long as each device knows the other's IP address. By interconnecting networks - IP-based applications, such as voice over IP or VoIP can be transported along with other services, worldwide.

VoIP works as a peer-to-peer application, entailing handshaking and direct media exchange between two IP devices, for example, IP phones. To call someone, the user dials the telephone number, the handset translates that number into IP address format (e.g., 123.456.11.22), and the device sends encrypted data packets whose payloads contain messages conforming to a particular call-setup protocol between the two devices, and they then establish a common connection for voice exchange. Their device rings, they pick up, and media packets flow in both directions.

VoIP call-setup and other signaling typically involves the exchange of relatively short messages -- single packets, instead of long packet trains. And the call-setup protocols most commonly used -- SIP (Session Initiation Protocol) and H.323 -- exchange requests and acknowledgements sufficient to avoid confusion and stimulate retries if packets are lost in transit. For this reason, VoIP signaling doesn't normally require TCP -- the robust "Transmission Control Protocol" used by many IP applications to "virtualize" and manage the network link; though TCP/IP is sometimes used for call setup in specialized wireless and other applications, where link quality is dubious. More commonly, VoIP tops off signaling packets with IP and UDP (User Datagram Protocol) headers -- a shorter header format identifying type-of-service, source and destination IP addresses, and logical ports, used for "send and forget" short messaging applications.

Similarly, in VoIP media exchange, encoded audio is stuffed into the payloads of a sequence of short packets, each preceded by three protocol "headers": an IP (Internet Protocol) header, identifying the IP addresses of both telephones and carrying a Type of Service (TOS) field (now called the 'DSCP' or 'Differentiated Services Code Point' field), set to identify this as priority traffic for express forwarding; a UDP (User Datagram Protocol) header, identifying the source and destination logical ports; and an RTP (Real Time Protocol) header, noting packet sequence and carrying time-synchronization information.

VoIP media packets are transported across the intervening IP network like other data traffic. Reaching the far end, they are re-sequenced, decoded and their payloads played back, reproducing the input audio signal.

Digitization and Encoding

To transmit voice across an IP network, you start with an analog audio signal -- a pattern of frequencies initially propagated in air, transduced by microphone into an analogous pattern of electrical voltage fluctuations. This signal is converted to a sequence of (usually eight-bit) binary numbers by an analog-to-digital converter circuit, which samples line voltage repeatedly, at a high rate of speed (usually 8,000 times per second) as it changes over time. The result is typically (though not always) equivalent to a standard telecom 64kbps DS0 signal stream -- the same sort of input signal used as the "voice channel quantum" (the end-to-end switched bandwidth reserved for a phone call) in conventional digital telephony.

Next, the digitized input signal is mathematically pre-processed -- filtering background noise and emphasizing speech band information. When a speakerphone is engaged (or when using IP audioconferencing gear), additional audio pre-processing stages may analyze and cancel room-reverberation, speaker-to-mic echo, and other undesirable signal components.

It's also common, at this point, to identify and mark periods of silence for later deletion from the transmission bitstream. VoIP systems frequently save bandwidth (up to 40%, all other things being equal) simply by not transmitting any data when a speaker isn't talking. The technique -- called 'silence suppression' -- is made tolerable to the listener by sampling and regenerating normal background noise locally, at the far end of the call, a strategy called 'comfort noise regeneration.'

Another important pre-processing step may involve identifying tone-based inband signals (e.g., DTMF/MF or "touchtones") in the input stream (or capturing keypresses on the endpoint device), and replacing these with short codes for out-of-band transmission (and far-end regeneration). This saves bandwidth while insuring that legacy signals are transmitted reliably across the IP network (which -- in a hybrid network -- may constitute only part of the end-to-end call path).

The pre-processed input signal is then divided into speech frames (short chunks) and encoded -- transformed mathematically into a sequence of packet payloads, with optional compression. VoIP codecs (COder/DECoder) come in various standard and proprietary flavors, engineered for different bitrates and overall connection conditions. Where network availability is highly variable, VoIP endpoint devices can renegotiate which codec to use in mid-call, though such renegotiation is always obvious to call participants.

In the past (and still today, in special cases), VoIP solutions tended to emphasize compression, positing VoIP as a solution for multiline telephony across bandwidth-impoverished network segments and access loops. For example, the ITU G.729a codec uses only 8kbps of bandwidth; and more advanced codecs can communicate intelligible speech at rates down to about 5.3kbps for G.723.1 (imagine a single T-1 carrying 173 simultaneous conversations, instead of the standard 24). Extreme compression, however, adds latency (called 'algorithm delay') which must be endured by listeners, or offset by increased processing power, adding a cost premium to VoIP equipment.

The advent of broadband and standard 100BaseT Ethernet, moreover, has made compression less of an issue in many VoIP implementations. So today, the focus has shifted: most VoIP calls employ a standard G.711 codec, which requires 64kbps of bandwidth, but incurs almost no algorithm delay. In fact, where bandwidth is lavishly available, VoIP engineers are figuring out ways to use even more of it, providing a high-fidelity audio experience. The VoIP codec maker, GIPS, uses broad-spectrum digitization -- their codec is used by the Skype peer-to-peer VoIP client, among many others; PolyCom, a maker of high-end VoIP-based audioconferencing gear, is also experimenting with "high-fi telephony." DiamondWare, another maker of VoIP audioconferencing software, uses multiple microphones and input signal-processing to achieve stereo-imaging and 'heliophonic' effects (where the audio signal conveys the position of the speaker in a virtual '3D' space).

The audio-digitizing, signal processing and coding/decoding required for VoIP can be performed cost-effectively in several ways, depending on context. Today's wireline and wireless IP telephones employ monolithic "system on chip" components, capable of A/D conversion, signal processing, encoding, TCP/IP communications, keypad and display management, and sometimes system power management as well. Manufacturers of high-density VoIP gateways, which translate conventional analog or digital phone calls into VoIP packets and associated signaling and vice-versa, tend to use powerful, multi-core DSPs (digital signal processors) - putting dozens of these on a single cPCI board (Compact PCI - a PC board standard supporting "hot swap" for easy maintenance and reduced downtime) and managing 672 or more VoIP calls in a single chassis slot - thousands of calls on a single 6U rackmount system (or tens of thousands, when chassis are linked by "packet backplanes" like StarFabric and InfiniBand).

Today's standard PC CPUs (Pentium 66MHz or better) and the ARM RISC CPUs used in Blackberrys and WindowsCE palmtops can also muster up the horsepower to manage VoIP when computers are linked to wired or wireless LANs. This has encouraged programmers to create a plethora of "software only" VoIP solutions - "VoIP softphones" - some of which extend the functionality of other applications, such as Instant Messaging. Both Windows Messenger and AOL Instant Messaging (AIM) offer VoIP calling functionality. The free, proprietary, peer-to-peer VoIP client, Skype, offers a similar blend of IM and VoIP capabilities.

Basic VoIP Transmission

Ideally, a digital telephone call should proceed as if it were taking place across an unbroken wire between two endpoint devices. The signal-stream (bitstream) originating at one end should traverse the wire instantly (i.e., with zero latency) and isochronously - the time-spacing of bits on receipt should match their time-spacing on transmission. In a real-world digital telephone network, these rules are stretched - but not by very much, because end-to-end clock-synchronized TDM timeslots are dedicated to the call for its duration. The result is "a relatively close simulation of the behavior of a wire."

Sending a sequence of VoIP packets across a conventional IP network is very different. In a classic IP network, resources are not reserved to support point to point connections; instead, packets are routed opportunistically across the network, in response to local traffic conditions. VoIP packets in transit may be forced to traverse many routers (i.e., "take many hops") before arriving at their destination - each router-hop adds to end-to-end delay, called 'latency.' Packets may run up against congestion, and be forced to wait in router-buffers before they can be forwarded - causing additional latency. In some cases, packets may be dropped, entirely; or may be routed around failure conditions by routes different from those traversed by earlier packets in the same sequence.

Depending on network and traffic conditions, therefore, surviving packets may arrive at their destination damaged, out of sequence, delayed (called 'latency') and at widely-varying time-intervals (jitter). Non-realtime IP applications deal with these issues by using TCP/IP - the high-level "Transmission Control Protocol" - which manages packet sequencing, signals retransmission requests, and otherwise insures the integrity of data from source to destination. But TCP/IP's approach to connection management is inappropriate for VoIP, because VoIP needs to preserve isochronicity between sender and receiver. Requestiing retransmission of a dropped VoIP packet requires two network traversals (one to signal, one to retransmit). Such delays would cause unacceptable "breaks" in playback, or force the receiver to buffer large amounts of data against the possibility of retransmission, prior to commencing playback, thereby abandoning any pretension to "realtime" performance.

Instead, to approximate isochronicity, VoIP systems usually transmit an evenly-paced stream of short, same-length packets, each carrying the equivalent of a few milliseconds of audio. Short, same-length packets facilitate buffering and re-sequencing of received packets so they play back in proper order - buffer sizes must be kept as short as possible (depending on network conditions), so that playback can be more or less "immediate." If packets are lost in transmission, they're simply skipped: VoIP decoders use "smoothing" techniques to make dropouts less obvious.

The need to send many short packets also militates against the use of TCP/IP, whose long packet headers would consume a great deal of additional bandwidth. Instead, the engineers of VoIP systems adapted by combining three simpler protocols to manage transmission, sequencing, and synchronization with reduced overhead. These protocols are:

  • IP - Internet Protocol - Header is used to establish datagram size and format, type of service (TOS), and IP addresses of sending and receiving devices.
  • UDP - Universal Datagram Protocol - Header is used to denote which connection "ports" are being used by the sending and receiving applications.
  • RTP - Realtime Protocol - Header is used to store packet sequence, timestamp and related synchronization and ordering information.

Each audio payload in a VoIP transmission is preceded by IP, UDP, and RTP headers.

Quality of Service

VoIP systems have limited means for coping with network performance issues causing excess packet loss, latency and jitter. Fast codecs (and low or no compression) can help, by reducing time spent encoding and decoding data for playback (thus reducing end-to-end latency). Adaptive buffering can offset some network performance issues - this is sometimes combined with use of RTCP (Realtime Control Protocol) for forward assessment of network conditions. Inband-signal-transcoding and similar techniques play a supporting role. Still, it should come as no surprise that VoIP calls can sound terrible when sent across unmanaged, narrowband LANs, bandwidth-limited access links, or the open Internet.

The only real solution is to improve the network. Over the past several years, network service providers have explored various means to help enable high quality of service for realtime applications. The most effective (and simplest) of these strategies is overprovisioning: building out the network so that its bandwidth capacity and node throughput vastly exceeds that required by voice and other data traffic.

The reasoning is simple: a greatly-overprovisioned network shouldn't experience much congestion. This remains an attractive idea, particularly during an economic cycle where bandwidth (in the form of 'dark' or unused fiber) is plentiful and cheap.

The problem is that overprovisioning doesn't work. Or rather, it works at first - in the lab, and in the field as well; so long as the network is simple, traffic is consistent, and enters from only a few locations. But as customers are added, the network grows more complex, traffic enters the network at more points and becomes more diverse, demanding, and bursty, overprovisioning starts hitting the wall. Without a scheme for prioritizing the forwarding of realtime packets over other traffic, even a bandwidth-rich, closely-managed network won't be able to minimize latency and jitter on VoIP packet-flows.

The best carrier class voice transmissions occur over a network backbone built on MPLS - a non-IP label-switching standard that provides peerless performance overall, offers many levels of prioritized forwarding for designated traffic flows, and lets us monitor each hop's contribution to overall packet loss, latency and jitter to insure compliance with service level agreements.

MPLS also permits sophisticated traffic engineering, and our network architects use a range of simulation and modeling techniques to predict the impact of changing traffic conditions, equipment and link outages and other variables on network performance. This helps us stay well ahead of the curve, insuring that QoS for realtime traffic is maintained, while also making most economical and efficient use of network resources to reach the quality of service, reliability and other goals our customers demand. We presently offer three classes of service, including a "priority" class for realtime traffic such as VoIP and IP video. Our network's baseline performance, meanwhile, generally exceeds that demanded by even our most-stringent SLAs - for example, our priority SLA dictates that jitter will always be less than 15msecs. But actual jitter experienced by the network, edge-to-edge, is less than 5msecs.

The result? VoIP packets are identified at our network edge, by the TOS (Type Of Service) bits in the outer, IP packet header. They receive priority forwarding at each hop across the network core. And they arrive at the destination edge with minimal cumulative latency (less than 150msecs roundtrip latency in traversing our network's longest edge-to-edge pathway - from Asia to South America and back again) and infinitesimal (less than 5msecs) jitter.

Unless QoS problems are introduced elsewhere in the call path (e.g., on the user's own LAN), voice calls traveling across our backbone are indistinguishable from the best conventional long distance phone service. Global Crossing has turned this spectacular backbone network performance to advantage in all our services that transport VoIP.

The VoIP Gateway

Basic VoIP can work as a peer-to-peer application between IP endpoints. But additional facilities are required to make VoIP useful in the real world -- particularly in a real world still dominated by the legacy public switched phone network, and by huge investments in legacy business telephone equipment.

To interface with such networks and equipment, a media gateway (often called simply a 'gateway') is required. The gateway has a data network interface on one side, and PSTN interfaces on the other, and its job is to translate media and signaling (i.e., "calls") between the two networks.

Gateways are versatile -- scale and specific properties are application-dependent. Single-port gateways, called 'terminal adaptors' can interface a single conventional telephone or fax machine to VoIP services on a broadband IP network. A legacy PBX can use a multi-port gateway as an IP peripheral, letting it originate and terminate VoIP traffic; or connecting it to VoIP telephones on a LAN. In legacy "toll bypass" applications, gateways at either end of a IP WAN link vector telephone traffic across the link, connecting a PBX at one enterprise facility with stations -- or another PBX -- at another office.

IP PBXs, of course, also use gateways -- to interface with PSTN trunks and analog or digital stations (IP stations, of course, just plug into the LAN). The brains of the PBX (or 'softswitch,' discussed below) and the gateway are logically-discrete components, and can in principle scale separately. A single IP softswitch can interoperate with one or several gateways, deployed in various arrangements across the enterprise WAN, and/or in the carrier network. Configurations can be created that permit "fallback" from one gateway to another at different locations -- insuring that PSTN access is always available. Alternatively, gateway services can be aggregated at one location, to take advantage of service provider discounting or trunking/service availability in a given location.

VoIP Deep Dive Part II

spacer