TCP Sucks

Uncategorized 7 May 2012 | 37 Comments

Nick Weaver gives me a shout-out.

there have been a couple of cases where the application vendors concluded they were causing too much damage and therefore started making changes. BitTorrent is the classic example. It is shifting to delay-based congestion control specifically to: (a) be friendlier to TCP because most of the data carried by BitTorrent really is lower-priority stuff; and (b) mitigate the “you can’t run BitTorrent and Warcraft at the same time” problem. So, there’s some hope.

It’s true. We occasionally take a break from drinking moonshine and shooting beer bottles to do real engineering.

Of course, I’ve always used TCP using exactly the API it provides, and even before I understood how TCP worked under the hood gone through great pains to use the minimum number of TCP connections used to the number which will reliably saturate the net connection and provide good piece diffusion. If TCP doesn’t handle that well, it isn’t my fault.

Now the intelligentsia have a plan, called RED to how to fix the internet, because uTP (using the LEDBAT congestion control algorithm), coming from the likes of me, can’t be the real solution. (By the way, I’d like to thank Stanislav Shulanov for being the real brains behind uTP.) I don’t believe this is a good idea, for several reasons.

First is it’s just plain unproven. It’s been years since RED was proposed, and to date noone’s come up with something where they can say ‘go ahead and deploy this, it’s mature’ with a straight face. Given that very smart people have worked on this, it stands to reason that the problems are just plain hard.

Second is it ain’t gonna happen. Deploying RED involves upgrading routers. To rely on it requires upgrading the entire infrastructure of the internet. The marketing plan is that because router vendors are unwilling to say ‘has less memory!’ as a marketing tactic, maybe they’d be willing to say ‘drops more packets!’ instead. That seems implausible.

Finally, RED is in an apples-to-apples sense a much cruder technique than uTP. With classic internet routing, a router will either pass along a packet immediately if it can, or add it to a queue to send later if it can’t. If the queue has become full, it drops the packet. With RED the router will instead have some probability of dropping the packet based on the size of the queue, going up to 100% if the queue is full. (Yes I know there are other schemes where packets already on the queue are dropped, I’m going to view all those things as variants on the same basic principle.) Since TCP only uses dropped packets as a signal to back off, this uses early packet dropping as a way of giving some information to TCP stacks that they need to back off before the queue gets full. The only information in use here is the size of the queue and the size of the buffer, with the size of the buffer becoming increasingly irrelevant due to buffer bloat, making its value be essentially ‘far too big’. RED makes dropped packets convey a little more meaning by having statistical gradations instead of a full/not full signal. uTP by contrast uses one way delays to get information for when to back off, which allows it to get very precise information about the size of the queue with every packet, with no packet loss happening under normal circumstances. That’s simply more information. You could in fact implement a ‘pretend the router’s using RED’ algorithm for uTP, with no router upgrades necessary.

Given that uTP can be implemented today, by any application, with no upgrades to any internet hardware being necessary, and that it solves that whole bufferbloat/latency problem, I think we should view the end to end approach as the solution to bufferbloat and just forget about changing router behavior.

We’ve already rolled out uTP as the default transfer algorithm for BitTorrent, which has changed the behavior of the internet so much that it’s changed how and whether ISPs need to upgrade their infrastructure.

‘But game theory!’ the naysayers will say ‘Game theory says that TCP will always win!’ Narrowly speaking this is true. TCP as a big giant SUV is very good at playing chicken. Whenever TCP goes up against a congestion control algorithm which makes an actual attempt to not have a completely full buffer, TCP will always fill the buffer and crowd out the other one. Of course, it will then stick the end user with all the latency of a completely full buffer, regardless of how bloated the buffer is, sometimes going into the seconds. For the end user to complain about how big the buffer is would be like them complaining to their credit card company for offering too high of a limit. ‘You should have known I’d spend too much!’ The solution is for the end user to intervene, and tell all their applications to not be such pigs, and use uTP instead of TCP. Then they’ll have the same transfer rates they started out with, plus have low latency when browsing the web and teleconferencing, and not screw up their ISP when they’re doing bulk data transfers. Even within a regime of everyone using uTP, it’s possible to have bulk transfers take lower priority than more important data by giving them a lower target delay, say 50 milliseconds instead of 100 milliseconds.

If you really want to design routers to give more information to end nodes, the big problem for them to fix is the one where everyone is attempting to do congestion control based on one way delays, but noone can get an accurate base delay because the queue is always full, so one way delays are always exactly the same and it looks like there’s no queue but high non-congestive packet loss. The best way to solve that is for a router to notice when the queue has too much data in it for too long, and respond by summarily dropping all data in the queue. That will allow the next bunch of packets let through to establish accurate minimum one way delays, and everything will get fixed. Of course, I’ve never seen that proposed anywhere…

Engineering IP Telephony

Uncategorized 4 May 2012 | 12 Comments

Update: Microsoft has sent out a clarification that the only thing they’ve centralized is handshake negotiation, which is a good thing, but doesn’t include creating any of the features I’ve listed below. Consider this article to be about how you could do things better.

Skype (aka Microsoft) decided to use proxying for all connections not have their peer directory information be run on untrusted peers citing ‘security’. One might wonder how and if this improves security. The most immediate security benefit is that it keeps peers from being given the entire peer directory.accepting direct connections in from untrusted peers, which can possibly send exploits. Applications written in real languages don’t have such problems, but anything which does audio decoding is going to have to have a significant C component. On the flip side, compressed audio isn’t sanitized or reencoded by proxies, so exploits are still possible, but then again the server can easily add checks for new exploits and sanitize them and use them to develop an MO for spammers. Of course, the central server by having the ability to view everything can do a much better job of stopping non-security-related spam as well. Big brother protects you from spammers. It’s a fact of life.

The deeper advantage of going through a central proxy is that it avoids giving away IP addresses. If an attacker who runs a countrywide firewall wants to know who its dissidents are, it can place a call to a journalist known to be talking to dissidents to find out their IP address, then record all IP addresses within the country forming direct connections to the repeater’s IP address, and have a very short list of potential dissidents. While this attack is extremely crude, it’s about what most countries are able to pull off given their level of sophistication and huge amount of traffic they have to deal with, and idiot journalists have gotten people killed this way. Thanks, assholes. By proxying everything through their servers, Microsoft has basically set up a big single-hop anonymizer, which at least stops the very crude attack we’re worried about right now.

There are potential engineering advantages to going through proxies as well, although it’s doubtful that Microsoft is doing them yet. The engineering problem which completely dominates telephony is latency. The speed of light in a fiber optic cable around the earth’s circumference is about 200 milliseconds. By coincidence the generally accepted maximum acceptable round trip time for an audio call is the not much larger value of 330 milliseconds, which makes engineering a system which can handle phone calls from Argentina to Japan inherently difficult.

Despite the behavior of the telco industry, a 330ms round trip time is not a gold standard, it’s the absolute maximum above which people find it completely unacceptable. The telephony industry continues to pretend that it’s a gold standard anyway, letting call latencies get higher and higher with not a single company engineering a low-latency solution and bragging about it in their marketing materials, displaying an appalling lack of pride in one’s work, engineering prowess, and marketing acumen.

In any case, for the sake of argument I’m going to say that 330 ms is an ‘acceptable’ round trip time and 150 ms is a ‘good’ round trip time. Yes I know that for gaming anything above 100ms starts to suck, but people are more forgiving of audio calls.

Whenever you send data over the internet, the latencies add up as follows:

congestion control delay -> upload queueing delay -> upload serialization delay -> propagation delay -> download serialization delay -> download queueing delay

In the case of audio, data is always sent as soon as it’s generated. Reducing the bitrate being an extreme measure which is only done rarely, and then it’s to another basically fixed bitrate, so there isn’t any of the sort of dynamic congestion control which TCP has, and no point in waiting to send anything.

Upload queueing delay is hopefully usually zero. Unfortunately many routers have queues whose length can be measured in seconds, which TCP will happily completely fill continuously. Obviously if there’s any serious TCP-based cross traffic your audio calls are basically hosed, so we’ll just assume that there isn’t any. What size upload queues should be is a controversial topic which I’ll have a lot to say about at another time, for now I’ll just say that getting them below amounts which can potentially mess up audio calls is completely unworkable at current broadband connection rates, and anything over 500ms is completely broken.

Upload serialization delay is typically just a few milliseconds and not a major issue.

Propagation delay is dominated by the aforementioned speed of light problem. As a reasonable example, the one way latency between San Francisco and Amsterdam can’t be below 40ms just because of the laws of physics, and if you can get 80ms from one home net connection in the US to another one in Europe you’re doing very well. (If anyone can point out maps showing latencies between different parts of the internet which don’t suck I’d appreciate it. Most of the easy to find ones are focused on latencies between different ISPs in the same location, although some of the latencies they report make me want to cry.)

Download serialization delay is generally just a few milliseconds and not a major issue.

Download queueing delay generally can’t be caused by uploads from a peer net connection, because consumer download rates are much higher than upload rates, but if the downlink is saturated by TCP-based cross traffic you’re hosed, for the same reasons as with upload queueing delay.

That covers the basics of how the internet works from the standpoint of a layer 5 protocol developer. My apologies to people who work at the lower layers.

Where this becomes interesting for internet telephony is when it interacts with packet loss. On well behaved net connections the only source of packet loss should be packets getting dropped because queues are full, at which point phone calls are already hosed because of latency. Unfortunately lots of people have cross traffic on their wireless, or move their laptop too far from their base station, or have a shitty ISP, and have a lot of non-congestive packet loss. For the sake of argument, I’ll assume that this is a problem worth solving.

When data is lost, the only things you can do are to either have a skip, or send a rerequest. If you have a policy of sending rerequests, then you have to delay all traffic by the worst delay you can incur with a rerequest, because dynamically changing the latency produces a sucky end user experience. Let’s say that we have a simple direct connection which looks like this:

Alice ---- 80ms ---- Bob

In this case the direct round trip time will be 160ms, which is good. The speed of sound is about one foot per millisecond, so this will be about equivalent to yelling at someone 80 feet away from you. Unfortunately if you add in rerequests, you have to wait for a packet to (not) get there, a rerequest to get sent back, and a replacement packet to arrive, for a total of 240ms each way, or 480ms round trip, which is totally unacceptable. Let’s say that we try to improve this as follows:

Alice ---- 40ms ---- proxy server ---- 40 ms ---- Bob

If we’re now willing to assume that the packet rate is low enough that a single packet might need a rerequest on one end or the other but not both, our resends will add 80ms each way, for a total round trip time of 320ms, which is barely acceptable. Unfortunately that isn’t a terribly solid improvement, and it’s hard to colocate servers in the exact middle of the Atlantic or Pacific oceans, but it’s possible to do better:

Alice ---- 10ms ---- proxy server ---- 60ms ---- proxy server ---- 10ms ---- Bob

Now things look much better. Even with resends on both ends, the one way delay has only gone up to 120ms, for a total round trip time of 240ms, or 200ms if we want to be just a little aggressive, which is pretty good. And even better, the amount of latency this technique adds doesn’t increase as the distance between the peers increases, so long as there are proxy servers close to either end.

Never Make Counter-Offers

business 4 December 2011 | 69 Comments

At many companies, the way you get a raise is to quit. As a matter of policy. I am not exaggerating.

The way it works is this: Management figures they’ll save money on salaries by leaving it up to the employees to negotiate for their own pay. So they don’t give raises until someone tries to negotiate for one. Naturally, anyone asking for a raise is viewed as having no negotiating stance unless they have a credible claim to quitting, so raises are only given as counter-offers, generally matching or slightly beating the offer the employee has elsewhere. Management figures as long as they always match what someone is offered elsewhere, the employee will always prefer to stay, because that’s easier to do, and salaries are kept to the absolute minimum they can be with no real risk.

This is completely insane.

Think about what this does to employees. The most devoted, upstanding employees are the least paid, and the most conniving, disinterested ones are paid the most. Sooner or later the lower paid employees are either going to get the feeling that all the necessary secrecy around salaries means they’re getting screwed (because, um, they are) or find out that someone got a raise by quitting, and go about doing it themselves. Maybe in the process they’ll find a job which they realize they actually like better. They have no reason to feel appreciated in their current job, after all. And there’s a good chance they won’t even listen to the counter-offer, much less take it. Soon the whole workplace is completely toxic, with everybody either underpaid, having one foot out the door, or being one of the assholes who views periodically finding a job offer you don’t plan to accept just to get a raise as normal and reasonable.

You should never make counter-offers. Ever. If an employee tells you they have a job offer, tell them that if they take it then you’ll wish them well, and stick to it. Don’t send the message to the rest of the employees that you’ll reward them for shopping themselves around, and don’t hang on to people who don’t want to work for you.

So how should you retain employees? Have clear and consistent salary guidelines, and regularly give raises to people who are outperforming their pay level. Don’t be a tight-fisted, short-sighted moron.