TCP Sucks

Uncategorized 7 May 2012 | 41 Comments

Nick Weaver gives me a shout-out.

there have been a couple of cases where the application vendors concluded they were causing too much damage and therefore started making changes. BitTorrent is the classic example. It is shifting to delay-based congestion control specifically to: (a) be friendlier to TCP because most of the data carried by BitTorrent really is lower-priority stuff; and (b) mitigate the “you can’t run BitTorrent and Warcraft at the same time” problem. So, there’s some hope.

It’s true. We occasionally take a break from drinking moonshine and shooting beer bottles to do real engineering.

Of course, I’ve always used TCP using exactly the API it provides, and even before I understood how TCP worked under the hood gone through great pains to use the minimum number of TCP connections used to the number which will reliably saturate the net connection and provide good piece diffusion. If TCP doesn’t handle that well, it isn’t my fault.

Now the intelligentsia have a plan, called RED to how to fix the internet, because uTP (using the LEDBAT congestion control algorithm), coming from the likes of me, can’t be the real solution. (By the way, I’d like to thank Stanislav Shulanov for being the real brains behind uTP.) I don’t believe this is viable, for several reasons.

First is it’s just plain unproven. It’s been years since RED was proposed, and to date noone’s come up with something where they can say ‘go ahead and deploy this, it’s mature’ with a straight face. Given that very smart people have worked on this, it stands to reason that the problems are just plain hard.

Second is it ain’t gonna happen. Deploying RED involves upgrading routers. To rely on it requires upgrading the entire infrastructure of the internet. The marketing plan is that the because router vendors are unwilling to say ‘has less memory!’ as a marketing tactic, maybe they’d be willing to say ‘drops more packets!’ instead. That seems implausible.

Finally, RED is in an apples-to-apples sense a much cruder technique than uTP. With classic internet routing, a router will either pass along a packet immediately if it can, or add it to a queue to send later if it can’t. If the queue has become full, it drops the packet. With RED the router will instead have some probability of dropping the packet based on the size of the queue, going up to 100% if the queue is full. (Yes I know there are other schemes where packets already on the queue are dropped, I’m going to view all those things as variants on the same basic principle.) Since TCP only uses dropped packets as a signal to back off, this uses early packet dropping as a way of giving some information to TCP stacks that they need to back off before the queue gets full. The only information in use here is the size of the queue and the size of the buffer, with the size of the buffer becoming increasingly irrelevant due to buffer bloat, making its value be essentially ‘far too big’. RED makes dropped packets convey a little more meaning by having statistical gradations instead of a full/not full signal. uTP by contrast uses one way delays to get information for when to back off, which allows it to get very precise information about the size of the queue with every packet, with no packet loss happening under normal circumstances. That’s simply more information. You could in fact implement a ‘pretend the router’s using RED’ algorithm for uTP, with no router upgrades necessary.

Given that uTP can be implemented today, by any application, with no upgrades to any internet hardware being necessary, and that it solves that whole bufferbloat/latency problem, I think we should view the end to end approach as the solution to bufferbloat and just forget about changing router behavior.

We’ve already rolled out uTP as the default transfer algorithm for BitTorrent, which has changed the behavior of the internet so much that it’s changed how and whether ISPs need to upgrade their infrastructure.

‘But game theory!’ the naysayers will say ‘Game theory says that TCP will always win!’ Narrowly speaking this is true. TCP as a big giant SUV is very good at playing chicken. Whenever TCP goes up against a congestion control algorithm which makes an actual attempt to not have a completely full buffer, TCP will always fill the buffer and crowd out the other one. Of course, it will then stick the end user with all the latency of a completely full buffer, regardless of how bloated the buffer is, sometimes going into the seconds. For the end user to complain about how big the buffer is would be like them complaining to their credit card company for offering too high of a limit. ‘You should have known I’d spend too much!’ The solution is for the end user to intervene, and tell all their applications to not be such pigs, and use uTP instead of TCP. Then they’ll have the same transfer rates they started out with, plus have low latency when browsing the web and teleconferencing, and not screw up their ISP when they’re doing bulk data transfers. Even within a regime of everyone using uTP, it’s possible to have bulk transfers take lower priority than more important data by giving them a lower target delay, say 50 milliseconds instead of 100 milliseconds.

If you really want to design routers to give more information to end nodes, the big problem for them to fix is the one where everyone is attempting to do congestion control based on one way delays, but noone can get an accurate base delay because the queue is always full, so one way delays are always exactly the same and it looks like there’s no queue but high non-congestive packet loss. The best way to solve that is for a router to notice when the queue has too much data in it for too long, and respond by summarily dropping all data in the queue. That will allow the next bunch of packets let through to establish accurate minimum one way delays, and everything will get fixed. Of course, I’ve never seen that proposed anywhere…

41 Responses on “TCP Sucks”

  1. Tony Finch says:

    It is a pity that TCP Vegas never made it. As I understand it, it can have fairness problems, where two flows sharing a link can get different ideas of the base latency. But perhaps that can be fixed?

    • bramcohen says:

      Vegas has a problem with not working well because it’s working off of round trip time. With uTP we’re using one way delay, which works a lot better.

      Fairness is a red herring. TCP’s fairness is actually extremely noisy, because everything’s based on the blunt instrument of sometimes knocking a transfer rate in half, and and any congestion control algorithm which saturates the link reliably will tend to randomly shuffle upload capacity between the competing streams.

      • Paul Sutter says:

        Fairness is worse than a red herring. Rooted in the “fairness” discussion is the assumption that existing TCP is fair (noisy is a polite understatement), and by definition anything other than existing TCP is thus unfair, because any other behaves differently.

      • Rscheff says:

        Wouldn’t it help to expose one-way-delay variance to the congestion controller, for starters?
        This could easily be accomplished with TCP timestamps, if the semantics (what timestamp to reflect when) and contents would have been defined differently (or at all). Unfortunately, timestamps have but one real purpose (and that is not measuring RTT) today…

  2. Noel Grandin says:

    Your last suggestion has been tried and leads to nasty feedback loops and cycling behaviour in most protocols. Inevitably, because it just doesn’t provide enough information.

    Whether or not uTP is better than TCP is beside the point – we need some kind of RED algorithm to be deployed simply to keep buffers sized “reasonably” in the presence of varying available downstream bandwidth.

    • bramcohen says:

      My last suggestion is something which should only happen on a time scale of once every few seconds or once every few tens of seconds, which can’t possibly result in any feedback loops with the underlying congestion control, because that happens a lot faster.

      Trying to sell RED to router manufacturers ain’t going to work. Buffers will be bloated on the internet. Suck it up and deal with it.

  3. Simon Farnsworth says:

    Just a point: there’s no such thing as “TCP congestion control”. There are several different congestion control algorithms, some of which are delay-based.

    The problem with delay based congestion control, borne out in research into changing the default congestion control algorithm Internet-wide, is that delay based congestion control is never fair – if your instance of a delay-based algorithm detects 10ms as a congestion delay, mine just needs to not detect congestion until there’s 5ms of delay and I will eventually claim the entire throughput of the shared segment (because my TCP will not back off until it fills the router queue, while yours will back off as it sees my TCP fill the queue).

    Packet loss based algorithms (whether ECN-assisted or not) avoid this, as no algorithm detects congestion until the queue is filled.

    And I would note that conflating RED (a way to manage queues, even in the absence of TCP) with TCP Reno (the most common TCP congestion control algorithm) is a weird choice. RED is a way for a router to increase the chance of it dropping packets from the “greediest” flow on a shared link; Reno is a way for TCP to avoid oversaturating a shared link.

    Edit to add: I’ve now been and read the uTP specification. uTP is an implementation of TCP, carried in UDP packets instead of IP packets – it appears that its main claim to fame is that it makes it easy for you to choose a different congestion control algorithm to the OS defaults. However, any criticism of TCP also applies to uTP – you use the same basic mechanisms as TCP to convert a stream of bytes into frames, and to ensure reliable delivery.

    • bramcohen says:

      The problem with Vegas is that it tries to do congestion control based on RTT. For uTP we’re using one-way delays, which is a far better approach. We’re trying to get the uTP algorithms incorporated as the standard ones in TCP with the LEDBAT project, but it’s slow going.

      • Simon Farnsworth says:

        Vegas is not the only delay based algorithm considered – simulations of congestion control based on one-way delay shows that they’re prone to unfairness issues as soon as there are bottlenecks in both directions (as the feedback loop extends into thousands of packet times). Fine for a CC algorithm aiming to scavenge all spare space in the queue (a better replacement for the TCP-LP congestion control algorithm), not so good for general traffic.

        RED actually helps in this case, as it shrinks the feedback loop automatically at the congested router – ECN would help even more, if we could get it deployed, as the congested router would start by marking packets instead of dropping them.

        It really sounds like your problem is not with TCP, as uTP is a TCP implementation atop UDP packets instead of raw IP, but with the interfaces offered to TCP congestion control – it sounds like you want mainstream OSes to provide a way to plug in new congestion control algorithms (an administrator action, as they’re kernel-side drivers effectively), and a way for applications to select a congestion control algorithm to suit them.

        TCP with the LEDBAT CC algorithms is, after all, equivalent to uTP but without the extra overhead of UDP headers as well as TCP-like headers.

        • bramcohen says:

          The whole point of doing delay-based congestion control is that the queue is usually nearly empty, so you don’t get huge delays.

  4. dmarti says:

    Instead of having the router drop all the packets in its queue, what about using the milk algorithm? (When you remove milk from your refrigerator, you check the date. If the date is within an acceptable range, you drink it, otherwise discard it.) When a router adds a packet to the queue, timestamp it. Check the timestamp on the way out. If it’s stale, discard it and check the next one.

    • bramcohen says:

      That winds up being the exact same thing, it just uses more memory – you could instead check whether you’ll wind up throwing the milk out in the end beforehand, which you do by checking if your queue is full.

      • Joey says:

        I don’t see how it is the same. In your case the router throws out all the packets regardless of age, and dmarti’s only throws out the those of past due, and isn’t guaranteed to give the minimal delay base line. Under high load it still yields uniform delay and high packet loss.

        The small risk of the drop everything is it may underutilized the channel for the short time there is a empty queue, but if you’re already dropping packets like crazy it isn’t going to be too bad, and you could always keep some number of the newest packets upon, to avoid underflow. Combined with drop oldest overflow, these packets shouldn’t be too old.

        • bramcohen says:

          You can figure out before each packet is put on the queue whether it will eventually be thrown out or not. Waiting to expire it until send time serves no purpose whatsoever.

          • Rscheff says:

            Still not the same. Dropping at enqueue (tail of queue) vs. dequeue (head of queue) makes a huge difference for the control loop. One has a signal close to minimal delay, the other one, once the queues are filled, is artificially delayed and thus delivers a even more delayed reaction (sender transmission rate reduction).
            BTW: If TCP timestamps would carry proper timing information, wouldn’t LEDBAT congestion control also become applicable to TCP?

          • Wes Eddy says:

            It has already been applied to TCP though I believe using the RTT rather than OWD:
            http://www.opensource.apple.com/source/xnu/xnu-1699.24.23/bsd/netinet/tcp_ledbat.c

    • CodeInChaos says:

      Assuming the speed at which data is dequeued from the buffer is constant(and that’s a reasonable assumption), you can calculate how long the packet will wait at the time you enqueue it, simply by looking at how much of the buffer has been filled yet. So you can drop it at enqueue time.

      So your proposal amounts to: Use a smaller buffer, and drop new packets if the buffer is full.

      • Ryan Malayter says:

        Actually on wireless networks, the queues in access points, your OS and WNIC have no real idea of your effective egress rate. It can vary by a factor of 10 or more over the span of seconds based on radio conditions.

        Which is why hotel WiFi sucks, and see 1500 ms pings when using my 3G card on the train.

  5. Benjamin Lovell says:

    I have been thinking about this for a while now(ever since I heard of uTP). uTP use of one-way delay is brilliant. Seriously!! I think it’s a game changer if used right. Far superior to TCP congestion control and while it accomplishes what it was intended for BT I would suggest it is incomplete as the new transport layer protocol to get us through the next 40-50 years. Primary failings for current transport layer I focus on are mobile scenarios.

    Below is something I wrote up a while back but have not posted much of anywhere yet other than a quick badly written rant on slashdot.

    ———

    TCP has it’s faults. Many who are much more skilled than I have laid them out better than I could. However, I would like to suggest that there are two faults that matter more than currently thought if we envision their roll in a world of ubiquitous connectivity.

    1) Lack of multihoming.
    2) Congestion control based on packet loss.

    1) Multihoming. SCTP

    Mobile devices currently have just two interfaces that are likely to provide connectivity to the global internet. Cell and WiFi. Switching between them due to the nature of TCP is an event than is noticeable to the application layer. A TCP connection is between two exact IP endpoints. Switching to an interface with a different L3 identity requires establishing a new TCP session. This problem will only become more apparent as connectivity options grow in the future. Connectivity could come from cellular, WiFi, whitespace, 60Ghz options that are being explored, wireless optical, and many that are not currently envisioned. The network stack of the future must be able to intelligently switch between these options in manner that is as transparent to the application layer as possible.

    2) Congestion control.

    Experience control is a horrible name(anyone got suggestions?) but this is what we want. Not just congestion control but a way to monitor the experience a network interface or path can provide. Congestion control based on packet loss does not provide a complete enough view of the environment to provide the desired user experience. uTP one-way latency provides far superior information about the state of the path than packet loss can. While the motivation of uTP was to lessen Bittorrent’s effect on ISP networks and be a better “net citizen” in congestion prone environments, it turns out that one-way latency gives us the information to provide a better user experience in a much more generalized way. Based on one-way latency we can now provide a connection based on potential packet loss due to congestion, as well as latency, and jitter.

    Different applications have different requirements for a network connection.

    Bulk data transfers care very little about latency or jitter. They care about packet loss. Given a large enough in-flight window even very high latency connections can provide high throughput as long as packets are not lost.

    Streaming video or audio like bulk data does not care much about latency but does care about jitter as de-jitter buffers will only help to a certain point. They care about throughput but only up to a point. Once the throughput crosses the bit rate of the stream, more throughput does not improve the user experience.

    Two-way interactive video or audio like streaming does care about jitter but it also cares very much about latency. As latency increases past a certain point the user experience is degraded. Like streaming, once throughput matches the stream bit rate, more will not improve the user experience.

    To provide rich user experience in next generation networks the network stack must be able to accept input about the requirements of the nature of the connection. Applications should be able to open a socket and state that they want all the throughput they can get but care not about latency. Or open a socket and state they need just 1Mbit of throughput but the best latency and jitter that the stack can provide. Or other combinations I can’t even think of. The stack should be able to switch seamlessly between its interfaces to meet these requirements without action or knowledge of the upper layers.

    This is a much more complicated job and larger feature set than that network stack provides today but the alternative is applications will more and more have to provide these services for themselves with varying results.

    • bramcohen says:

      Multihoming should be done at a higher layer.

      • Benjamin Lovell says:

        At a higher layer… Applications open sockets. You expect applications to handle multiplexing??!! That is nieve.

        While there technically is a session layer, I am not aware of any wide spread non-application specific usage other than SSL. The OS network stack must provide multihoming and latency info else mobile will be a horrible mix of constantly opening and closing TCP sessions as you switch interfaces along with scenario/application specific mediocre multihoming solutions. It’s already starting to happen.

        • bramcohen says:

          Yeah, applications, being made of nothing but gears and vacuum tubes, can’t possibly handle anything complicated.

      • Wes Eddy says:

        depending on what one means by “multihoming”, MPTCP is already dealing with this inside TCP:
        http://datatracker.ietf.org/wg/mptcp/charter/

    • rokkaku99 says:

      That’s pretty much motherhood and apple pie, but there have been primitive ways for sockets to choose lower-level handling schemes (ToS, DiffServ, etc.) for many, many years. Acknowledging that work and explaining why it isn’t enough for you (and it certainly could be improved) would probably be a good start.

  6. Gregory P. Smith says:

    Proportional Rate Reduction (PRR) is another thing to consider. http://research.google.com/pubs/pub37486.html

  7. Paul Sutter says:

    Braham, Impressive work. I’d love to discuss. We did a lot of work with TCP at Orbital Data, so we’re very familiar with the comical limitations of TCP. Like the fact that they never envisioned bandwidth delay products would get large, or that routers would have queues of preposterous length.

    TCP the wire protocol is pretty independent of the congestion control. Have you thought about fixing TCP itself? Have you thought through whether a server-side TCP that implements a uTP-like congestion control would be useful when combined with a client side that implements traditional congestion control? (apologies is this is obviously impossible, I havent looked at uTP at all)

    • bramcohen says:

      That’s what we’re trying to do with LEDBAT.

      • Paul Sutter says:

        The move to SPDY could be a chance to get such a thing deployed. Since servers are about to get their bottleneck reworked,now could the chance to implement an option

        SPDY is also important because the internet will actually start using congestion control. Existing HTTP bypasses congestion control in most cases, staying mostly in slow-start.

  8. Yoz says:

    What do you think of CoDel? http://queue.acm.org/detail.cfm?id=2209336
    I may be misreading, but it sounds like they’re also ignoring RTT in favour of what they call “queue-sojourn time” which sounds like one-way delays.

    • bramcohen says:

      Yes, it’s most definitely using one way delays. It does a fair amount of grappling with trying to let spikes happen, which would seem to be a second vote in favor of smoothing out the increasing done during slow start over an RTT, this paper being the first vote in favor – http://research.google.com/pubs/pub37486.html

      There’s clearly more experimentation which can be done. I still think that it’s best to do everything with smarts at the end points rather than in the router though, because it’s more deployable and flexible.

  9. Zooko says:

    I think it’s Stanislaw Shalunov, not Shulanov: http://www.linkedin.com/in/shalunov

  10. Dave Taht says:

    apparently disquis ate my attempt at pointing to a link that responded to this piece, by jim gettys. Apologies if this is a repost: http://gettys.wordpress.com/2012/05/14/the-next-nightmare-is-coming/

  11. Nick P says:

    Bram, what do you think of UDT as TCP alternative? I’ve promoted it plenty & built quite a few things on top of it. Solved most performance issues I had. I’m also into high assurance design and I’m considering using a variant of it for reliable delivery in a high assurance platform.

    (TCP’s complexity & behavioral traits make it too hard to use unless one is into handwaving arguments.)

  12. The original sin is “work conservation”, i.e. queues that always transmit when the onward link is idle. This over-saturates the system and necessarily causes instability. No form of control loop will ever save you.

  13. syntress says:

    Has there been any progress on this front since last year?

    • Nick P says:

      Yes, TCP use has expanded by millions of devices. The alternatives haven’t come close to overtaking it. So, status quo remains much to my non-delight.

      • syntress says:

        Kind of like the explosion of Windows devices into the market during the infancy of personal computing? ;-)

  14. Thank you for amazing info. Good read!

Leave a Reply