Engineering IP Telephony

Update: Microsoft has sent out a clarification that the only thing they’ve centralized is handshake negotiation, which is a good thing, but doesn’t include creating any of the features I’ve listed below. Consider this article to be about how you could do things better.

Skype (aka Microsoft) decided to use proxying for all connections not have their peer directory information be run on untrusted peers citing ‘security’. One might wonder how and if this improves security. The most immediate security benefit is that it keeps peers from being given the entire peer directory.accepting direct connections in from untrusted peers, which can possibly send exploits. Applications written in real languages don’t have such problems, but anything which does audio decoding is going to have to have a significant C component. On the flip side, compressed audio isn’t sanitized or reencoded by proxies, so exploits are still possible, but then again the server can easily add checks for new exploits and sanitize them and use them to develop an MO for spammers. Of course, the central server by having the ability to view everything can do a much better job of stopping non-security-related spam as well. Big brother protects you from spammers. It’s a fact of life.

The deeper advantage of going through a central proxy is that it avoids giving away IP addresses. If an attacker who runs a countrywide firewall wants to know who its dissidents are, it can place a call to a journalist known to be talking to dissidents to find out their IP address, then record all IP addresses within the country forming direct connections to the repeater’s IP address, and have a very short list of potential dissidents. While this attack is extremely crude, it’s about what most countries are able to pull off given their level of sophistication and huge amount of traffic they have to deal with, and idiot journalists have gotten people killed this way. Thanks, assholes. By proxying everything through their servers, Microsoft has basically set up a big single-hop anonymizer, which at least stops the very crude attack we’re worried about right now.

There are potential engineering advantages to going through proxies as well, although it’s doubtful that Microsoft is doing them yet. The engineering problem which completely dominates telephony is latency. The speed of light in a fiber optic cable around the earth’s circumference is about 200 milliseconds. By coincidence the generally accepted maximum acceptable round trip time for an audio call is the not much larger value of 330 milliseconds, which makes engineering a system which can handle phone calls from Argentina to Japan inherently difficult.

Despite the behavior of the telco industry, a 330ms round trip time is not a gold standard, it’s the absolute maximum above which people find it completely unacceptable. The telephony industry continues to pretend that it’s a gold standard anyway, letting call latencies get higher and higher with not a single company engineering a low-latency solution and bragging about it in their marketing materials, displaying an appalling lack of pride in one’s work, engineering prowess, and marketing acumen.

In any case, for the sake of argument I’m going to say that 330 ms is an ‘acceptable’ round trip time and 150 ms is a ‘good’ round trip time. Yes I know that for gaming anything above 100ms starts to suck, but people are more forgiving of audio calls.

Whenever you send data over the internet, the latencies add up as follows:

[box type="note"]congestion control delay -> upload queueing delay -> upload serialization delay -> propagation delay -> download serialization delay -> download queueing delay[/box]

In the case of audio, data is always sent as soon as it’s generated. Reducing the bitrate being an extreme measure which is only done rarely, and then it’s to another basically fixed bitrate, so there isn’t any of the sort of dynamic congestion control which TCP has, and no point in waiting to send anything.

Upload queueing delay is hopefully usually zero. Unfortunately many routers have queues whose length can be measured in seconds, which TCP will happily completely fill continuously. Obviously if there’s any serious TCP-based cross traffic your audio calls are basically hosed, so we’ll just assume that there isn’t any. What size upload queues should be is a controversial topic which I’ll have a lot to say about at another time, for now I’ll just say that getting them below amounts which can potentially mess up audio calls is completely unworkable at current broadband connection rates, and anything over 500ms is completely broken.

Upload serialization delay is typically just a few milliseconds and not a major issue.

Propagation delay is dominated by the aforementioned speed of light problem. As a reasonable example, the one way latency between San Francisco and Amsterdam can’t be below 40ms just because of the laws of physics, and if you can get 80ms from one home net connection in the US to another one in Europe you’re doing very well. (If anyone can point out maps showing latencies between different parts of the internet which don’t suck I’d appreciate it. Most of the easy to find ones are focused on latencies between different ISPs in the same location, although some of the latencies they report make me want to cry.)

Download serialization delay is generally just a few milliseconds and not a major issue.

Download queueing delay generally can’t be caused by uploads from a peer net connection, because consumer download rates are much higher than upload rates, but if the downlink is saturated by TCP-based cross traffic you’re hosed, for the same reasons as with upload queueing delay.

That covers the basics of how the internet works from the standpoint of a layer 5 protocol developer. My apologies to people who work at the lower layers.

Where this becomes interesting for internet telephony is when it interacts with packet loss. On well behaved net connections the only source of packet loss should be packets getting dropped because queues are full, at which point phone calls are already hosed because of latency. Unfortunately lots of people have cross traffic on their wireless, or move their laptop too far from their base station, or have a shitty ISP, and have a lot of non-congestive packet loss. For the sake of argument, I’ll assume that this is a problem worth solving.

When data is lost, the only things you can do are to either have a skip, or send a rerequest. If you have a policy of sending rerequests, then you have to delay all traffic by the worst delay you can incur with a rerequest, because dynamically changing the latency produces a sucky end user experience. Let’s say that we have a simple direct connection which looks like this:

Alice ---- 80ms ---- Bob

In this case the direct round trip time will be 160ms, which is good. The speed of sound is about one foot per millisecond, so this will be about equivalent to yelling at someone 80 feet away from you. Unfortunately if you add in rerequests, you have to wait for a packet to (not) get there, a rerequest to get sent back, and a replacement packet to arrive, for a total of 240ms each way, or 480ms round trip, which is totally unacceptable. Let’s say that we try to improve this as follows:

Alice ---- 40ms ---- proxy server ---- 40 ms ---- Bob

If we’re now willing to assume that the packet rate is low enough that a single packet might need a rerequest on one end or the other but not both, our resends will add 80ms each way, for a total round trip time of 320ms, which is barely acceptable. Unfortunately that isn’t a terribly solid improvement, and it’s hard to colocate servers in the exact middle of the Atlantic or Pacific oceans, but it’s possible to do better:

Alice ---- 10ms ---- proxy server ---- 60ms ---- proxy server ---- 10ms ---- Bob

Now things look much better. Even with resends on both ends, the one way delay has only gone up to 120ms, for a total round trip time of 240ms, or 200ms if we want to be just a little aggressive, which is pretty good. And even better, the amount of latency this technique adds doesn’t increase as the distance between the peers increases, so long as there are proxy servers close to either end.

14 thoughts on “Engineering IP Telephony

  1. Arioch

    The article does not state that ‘proxying for all connections’ is made, but that “presence and discovery” was.
    I doubte ven Microsoft has traffic capabilities enough to channel all video/audio chats. It’s like proxying all p2p file exchanges out there.

    While MS could in principle protect me from spam, reality differs. Skype by Skype did not annoyed me. Skype by MS suddenly started repetitive tunes. It was a while until i found it was video-advertisement, that minimized Skype decided to run and loop.

    It also would help to kill anonymity of BitTorrent users, count it good or bad.

    1. bramcohen

      It’s my understanding that they’re proxying everything, but the article doesn’t make it clear and any clarification from a reliable source of information would be welcome.

  2. Josh

    Current Skype “calls do not pass through supernodes” but the idea of a connection caching proxy is interesting and could work for all sorts of TCP connections.

    1. bramcohen

      Yes, the same technique is applicable to many other things, although for audio I’m specifically not talking about TCP’s congestion control, which throws all sorts of other craziness into the mix, most of it bad.

  3. Osvaldo Doederlein

    “The speed of sound is about one foot per second” – I guess you meant …about one thousand feet per second (or to be exact, ˜1126)

  4. Marcus Brito

    And then things get great when you need to connect to someone close to you, but there isn’t a proxy server close by. This is likely to happen to a lot of people using Skype outside the usual North America or Europe datacenters reach.

    So now instead of

    Alice Bob

    you get

    Alice proxy Bob

    That will be fun.

      1. Marcus Brito

        There are other problems with a proxy architecture. Due to its P2P nature, skype has traditionally been very hard to block completely. Skype tries to connect on multiple ports, making port-based filtering impossible, and the fully encypted traffic makes packet inspection impractical.

        Now, with all supernodes hosted in relatively few datacenters (and hence address ranges), the “attacker” running a countrywide firewall can easily block or filter all traffic to these supernodes.

        1. bramcohen

          Yes, it’s easier to block, although in practice surveillance is a bigger problem than blocking, and the traffic will be kept encrypted regardless. I’ll have other things to say about firewall evasion in the future.

  5. Kevin Marks

    “In the case of audio, data is always sent as soon as it’s generated” sadly isn’t true, which is another reason mobile phone call latency is so bad. There is a buffer size issue in capturing audio (you have to get enough samples to fill a packet), and there are also frame size issues in audio codecs, if you’re compressing it. These both add static overhead at each end. If you’re using some kind of ‘clever’ forward error concealment codec, it can hold several audio frames to interleave them.
    On the receive end you need a decode buffer the size of the frame, and a playout buffer set by the OS and audio hardware you’re using.

  6. spinkham

    Yes, this is all about end user security, and I’m sure centralization has *nothing* to do with the FBI’s requested wiretapping capabilities. Uh huh.

    1. Developer Dude

      Uh yes.

      Today’s announcement in that regard, gives a whole different perspective on why MS went to centralized nodes.


Leave a Reply to bramcohen Cancel reply

Your email address will not be published. Required fields are marked *