Snapchat hasn’t been hacked, it just sucks

Uncategorized 3 January 2014 | 4 Comments

Snapchat’s recent ‘hack’ is a curious creature, because it is (mostly) a direct consequence of the semantics which Snapchat reveals in its UI to every user. Less of a security violation and more like pointing out that license plate numbers are a privacy issue because everybody can read them (with one exception, which I’ll get to). But unlike license plates, Snapchat can change their semantics whenever they want, and should do so. Let me explain.

When you first sign up for snapchat it pulls in your entire phone book and lets you add them to your friends list (Note to everyone else making new messaging apps: Peoples’s list of real friends is still their phone contacts). This requires no permission from the people you’re adding. Not only does it reveal to you their phone numbers, it also shows their Snapchat usernames, and lets you look at their ‘best friends’ list. That last one is especially curious. Best friends is the list of people you’ve been you’ve been chatting with most recently. It defaults to having 3 spots, with the option to change it to 5 or 7, which nobody does, and there’s no obvious way of removing it completely. This is the bizarre irony of Snapchat’s privacy features. It meticulously destroys all images you’re trading with other people immediately, but lets anyone who has your phone number see that you’ve been doing it with hornyslut6969.

Allowing that much information to be seen without a user’s permission is especially bizarre in Snapchat because it has such a strong bidirectional concept of friendship. The default is to only allow people you’ve added to send you messages, and most people leave that. It even lets you know that a messages to someone is ‘pending’ if you send it prior to them adding you back, meaning that they’ll get it when (or more likely if) they add you. (The UI really doesn’t make clear what ‘pending’ means. Snapchat feels like it’s succeeded more by accident than design.)

One could argue that the person whose information is being viewed can at least see that someone else has added them if the standard UI is being used, but there’s no proactive notifications of that and the viewer can simply remove the person before they notice. Aside from which, I’m not a fan of privacy through user interface obfuscation.

The ‘leak’ that we have now is an association of phone numbers, snapchat logins, and approximate geographic locales. The first two parts, as I’ve said above, are already easy to look up, by design. But the locale information isn’t in any obvious place in the UI, and it being available via the API should be viewed as a real security issue.

So what should Snapchat do to fix the problem? As I said at the beginning, they need to change their actual semantics, including their basic UI. The logical way to do this is by making it so that your personal information is only sent to people you add as friends. This wouldn’t be an impediment to friend relationships getting formed, because when you add someone it tells them and encourages them to add you back. It also wouldn’t cause any UI confusion, because when you’ve added someone the main human readable name shown is the one pulled from your local phone book, which of course identifies them just fine. All sensitive info, including the user’s snapchat id and their best friends, should be hidden until and unless they reciprocate the friend add. I don’t think the fact that the person has a Snapchat account at all is a sensitive piece of information.

Undoubtedly this would be a significant amount of work for Snapchat to do, but given the recent bad PR, and how generally icky the information leaks are to begin with, it would be clearly worth it.

The Case of the Missing Remote Reflog Command

Uncategorized 11 November 2013 | 0 Comments

Today we have an example of a massive disaster caused by push -f. Much of the discussion is less than productive, because it violates a fundamental tenet of user interaction design: You are not allowed to call the user stupid. If the user did something wrong, the tool either steered them in the direction of doing something they shouldn’t have, or didn’t allow them to undo something they wanted to undo.

In the search for ultimate rather than proximate solutions to problems like this, it’s important to remember that the software is not immovable. All aspects of how a tool works, in this case both the underlying functionality of git and the process flow on github, are in principle subject to change. Obviously reinventing the tool from scratch is a bad idea: the solution should require as little new development as possible, and ideally shouldn’t break anything which already exists, but all approaches should be considered.

Almost all the discussion of solutions is focusing on a feature request to github to allow admins to disallow push -f. That’s a reasonable enough approach, but it’s being emphasized because it’s the approach which is made possible by the local administration tools available today, not because it’s the best approach. Git’s push -f is a very dangerous command whose use is encouraged by standard git methodologies, but it’s also a very powerful and useful tool, so blocking its use comes with significant downsides.

There are means by which push -f could be made much less dangerous, and more convenient and useful, but that’s an involved discussion I’ll save for another day. What I’d like encourage as the most immediate solution to this sort of problem is the alternate approach, where the user is given the ability to undo a mistake.

If you’re running git locally, there already is a facility for doing this, albeit not a very good one: The reflog command. What there should be is a remote (and better) version of the reflog command, which gives access to the history of every change to head which has been made, including when it was done and which user did it. Naturally this should be paired with another command for obliterating the history of past heads, because there are times when it makes sense to remove all trace of something which happened (for example, when someone added a password file) but completely obliterating history shouldn’t be the default, and it certainly shouldn’t be conflated with the functionality which push -f is often used for. The details of how reflog works feel accidental rather than designed – it doesn’t keep everything, nor does it allow you to immediately completely obliterate old things. Such functionality is important enough that it deserves to be designed on purpose and put into the standard toolset.

(A funny aside, I’m told the command is pronounced ref-log, short for ‘reference log’. I always assumed it was ‘re-flog’ because looking at old history is beating a dead horse.)

TCP Sucks

Uncategorized 7 May 2012 | 41 Comments

Nick Weaver gives me a shout-out.

there have been a couple of cases where the application vendors concluded they were causing too much damage and therefore started making changes. BitTorrent is the classic example. It is shifting to delay-based congestion control specifically to: (a) be friendlier to TCP because most of the data carried by BitTorrent really is lower-priority stuff; and (b) mitigate the “you can’t run BitTorrent and Warcraft at the same time” problem. So, there’s some hope.

It’s true. We occasionally take a break from drinking moonshine and shooting beer bottles to do real engineering.

Of course, I’ve always used TCP using exactly the API it provides, and even before I understood how TCP worked under the hood gone through great pains to use the minimum number of TCP connections used to the number which will reliably saturate the net connection and provide good piece diffusion. If TCP doesn’t handle that well, it isn’t my fault.

Now the intelligentsia have a plan, called RED to how to fix the internet, because uTP (using the LEDBAT congestion control algorithm), coming from the likes of me, can’t be the real solution. (By the way, I’d like to thank Stanislav Shulanov for being the real brains behind uTP.) I don’t believe this is viable, for several reasons.

First is it’s just plain unproven. It’s been years since RED was proposed, and to date noone’s come up with something where they can say ‘go ahead and deploy this, it’s mature’ with a straight face. Given that very smart people have worked on this, it stands to reason that the problems are just plain hard.

Second is it ain’t gonna happen. Deploying RED involves upgrading routers. To rely on it requires upgrading the entire infrastructure of the internet. The marketing plan is that the because router vendors are unwilling to say ‘has less memory!’ as a marketing tactic, maybe they’d be willing to say ‘drops more packets!’ instead. That seems implausible.

Finally, RED is in an apples-to-apples sense a much cruder technique than uTP. With classic internet routing, a router will either pass along a packet immediately if it can, or add it to a queue to send later if it can’t. If the queue has become full, it drops the packet. With RED the router will instead have some probability of dropping the packet based on the size of the queue, going up to 100% if the queue is full. (Yes I know there are other schemes where packets already on the queue are dropped, I’m going to view all those things as variants on the same basic principle.) Since TCP only uses dropped packets as a signal to back off, this uses early packet dropping as a way of giving some information to TCP stacks that they need to back off before the queue gets full. The only information in use here is the size of the queue and the size of the buffer, with the size of the buffer becoming increasingly irrelevant due to buffer bloat, making its value be essentially ‘far too big’. RED makes dropped packets convey a little more meaning by having statistical gradations instead of a full/not full signal. uTP by contrast uses one way delays to get information for when to back off, which allows it to get very precise information about the size of the queue with every packet, with no packet loss happening under normal circumstances. That’s simply more information. You could in fact implement a ‘pretend the router’s using RED’ algorithm for uTP, with no router upgrades necessary.

Given that uTP can be implemented today, by any application, with no upgrades to any internet hardware being necessary, and that it solves that whole bufferbloat/latency problem, I think we should view the end to end approach as the solution to bufferbloat and just forget about changing router behavior.

We’ve already rolled out uTP as the default transfer algorithm for BitTorrent, which has changed the behavior of the internet so much that it’s changed how and whether ISPs need to upgrade their infrastructure.

‘But game theory!’ the naysayers will say ‘Game theory says that TCP will always win!’ Narrowly speaking this is true. TCP as a big giant SUV is very good at playing chicken. Whenever TCP goes up against a congestion control algorithm which makes an actual attempt to not have a completely full buffer, TCP will always fill the buffer and crowd out the other one. Of course, it will then stick the end user with all the latency of a completely full buffer, regardless of how bloated the buffer is, sometimes going into the seconds. For the end user to complain about how big the buffer is would be like them complaining to their credit card company for offering too high of a limit. ‘You should have known I’d spend too much!’ The solution is for the end user to intervene, and tell all their applications to not be such pigs, and use uTP instead of TCP. Then they’ll have the same transfer rates they started out with, plus have low latency when browsing the web and teleconferencing, and not screw up their ISP when they’re doing bulk data transfers. Even within a regime of everyone using uTP, it’s possible to have bulk transfers take lower priority than more important data by giving them a lower target delay, say 50 milliseconds instead of 100 milliseconds.

If you really want to design routers to give more information to end nodes, the big problem for them to fix is the one where everyone is attempting to do congestion control based on one way delays, but noone can get an accurate base delay because the queue is always full, so one way delays are always exactly the same and it looks like there’s no queue but high non-congestive packet loss. The best way to solve that is for a router to notice when the queue has too much data in it for too long, and respond by summarily dropping all data in the queue. That will allow the next bunch of packets let through to establish accurate minimum one way delays, and everything will get fixed. Of course, I’ve never seen that proposed anywhere…

Engineering IP Telephony

Uncategorized 4 May 2012 | 14 Comments

Update: Microsoft has sent out a clarification that the only thing they’ve centralized is handshake negotiation, which is a good thing, but doesn’t include creating any of the features I’ve listed below. Consider this article to be about how you could do things better.

Skype (aka Microsoft) decided to use proxying for all connections not have their peer directory information be run on untrusted peers citing ‘security’. One might wonder how and if this improves security. The most immediate security benefit is that it keeps peers from being given the entire peer directory.accepting direct connections in from untrusted peers, which can possibly send exploits. Applications written in real languages don’t have such problems, but anything which does audio decoding is going to have to have a significant C component. On the flip side, compressed audio isn’t sanitized or reencoded by proxies, so exploits are still possible, but then again the server can easily add checks for new exploits and sanitize them and use them to develop an MO for spammers. Of course, the central server by having the ability to view everything can do a much better job of stopping non-security-related spam as well. Big brother protects you from spammers. It’s a fact of life.

The deeper advantage of going through a central proxy is that it avoids giving away IP addresses. If an attacker who runs a countrywide firewall wants to know who its dissidents are, it can place a call to a journalist known to be talking to dissidents to find out their IP address, then record all IP addresses within the country forming direct connections to the repeater’s IP address, and have a very short list of potential dissidents. While this attack is extremely crude, it’s about what most countries are able to pull off given their level of sophistication and huge amount of traffic they have to deal with, and idiot journalists have gotten people killed this way. Thanks, assholes. By proxying everything through their servers, Microsoft has basically set up a big single-hop anonymizer, which at least stops the very crude attack we’re worried about right now.

There are potential engineering advantages to going through proxies as well, although it’s doubtful that Microsoft is doing them yet. The engineering problem which completely dominates telephony is latency. The speed of light in a fiber optic cable around the earth’s circumference is about 200 milliseconds. By coincidence the generally accepted maximum acceptable round trip time for an audio call is the not much larger value of 330 milliseconds, which makes engineering a system which can handle phone calls from Argentina to Japan inherently difficult.

Despite the behavior of the telco industry, a 330ms round trip time is not a gold standard, it’s the absolute maximum above which people find it completely unacceptable. The telephony industry continues to pretend that it’s a gold standard anyway, letting call latencies get higher and higher with not a single company engineering a low-latency solution and bragging about it in their marketing materials, displaying an appalling lack of pride in one’s work, engineering prowess, and marketing acumen.

In any case, for the sake of argument I’m going to say that 330 ms is an ‘acceptable’ round trip time and 150 ms is a ‘good’ round trip time. Yes I know that for gaming anything above 100ms starts to suck, but people are more forgiving of audio calls.

Whenever you send data over the internet, the latencies add up as follows:

congestion control delay -> upload queueing delay -> upload serialization delay -> propagation delay -> download serialization delay -> download queueing delay

In the case of audio, data is always sent as soon as it’s generated. Reducing the bitrate being an extreme measure which is only done rarely, and then it’s to another basically fixed bitrate, so there isn’t any of the sort of dynamic congestion control which TCP has, and no point in waiting to send anything.

Upload queueing delay is hopefully usually zero. Unfortunately many routers have queues whose length can be measured in seconds, which TCP will happily completely fill continuously. Obviously if there’s any serious TCP-based cross traffic your audio calls are basically hosed, so we’ll just assume that there isn’t any. What size upload queues should be is a controversial topic which I’ll have a lot to say about at another time, for now I’ll just say that getting them below amounts which can potentially mess up audio calls is completely unworkable at current broadband connection rates, and anything over 500ms is completely broken.

Upload serialization delay is typically just a few milliseconds and not a major issue.

Propagation delay is dominated by the aforementioned speed of light problem. As a reasonable example, the one way latency between San Francisco and Amsterdam can’t be below 40ms just because of the laws of physics, and if you can get 80ms from one home net connection in the US to another one in Europe you’re doing very well. (If anyone can point out maps showing latencies between different parts of the internet which don’t suck I’d appreciate it. Most of the easy to find ones are focused on latencies between different ISPs in the same location, although some of the latencies they report make me want to cry.)

Download serialization delay is generally just a few milliseconds and not a major issue.

Download queueing delay generally can’t be caused by uploads from a peer net connection, because consumer download rates are much higher than upload rates, but if the downlink is saturated by TCP-based cross traffic you’re hosed, for the same reasons as with upload queueing delay.

That covers the basics of how the internet works from the standpoint of a layer 5 protocol developer. My apologies to people who work at the lower layers.

Where this becomes interesting for internet telephony is when it interacts with packet loss. On well behaved net connections the only source of packet loss should be packets getting dropped because queues are full, at which point phone calls are already hosed because of latency. Unfortunately lots of people have cross traffic on their wireless, or move their laptop too far from their base station, or have a shitty ISP, and have a lot of non-congestive packet loss. For the sake of argument, I’ll assume that this is a problem worth solving.

When data is lost, the only things you can do are to either have a skip, or send a rerequest. If you have a policy of sending rerequests, then you have to delay all traffic by the worst delay you can incur with a rerequest, because dynamically changing the latency produces a sucky end user experience. Let’s say that we have a simple direct connection which looks like this:

Alice ---- 80ms ---- Bob

In this case the direct round trip time will be 160ms, which is good. The speed of sound is about one foot per millisecond, so this will be about equivalent to yelling at someone 80 feet away from you. Unfortunately if you add in rerequests, you have to wait for a packet to (not) get there, a rerequest to get sent back, and a replacement packet to arrive, for a total of 240ms each way, or 480ms round trip, which is totally unacceptable. Let’s say that we try to improve this as follows:

Alice ---- 40ms ---- proxy server ---- 40 ms ---- Bob

If we’re now willing to assume that the packet rate is low enough that a single packet might need a rerequest on one end or the other but not both, our resends will add 80ms each way, for a total round trip time of 320ms, which is barely acceptable. Unfortunately that isn’t a terribly solid improvement, and it’s hard to colocate servers in the exact middle of the Atlantic or Pacific oceans, but it’s possible to do better:

Alice ---- 10ms ---- proxy server ---- 60ms ---- proxy server ---- 10ms ---- Bob

Now things look much better. Even with resends on both ends, the one way delay has only gone up to 120ms, for a total round trip time of 240ms, or 200ms if we want to be just a little aggressive, which is pretty good. And even better, the amount of latency this technique adds doesn’t increase as the distance between the peers increases, so long as there are proxy servers close to either end.

Never Make Counter-Offers

business 4 December 2011 | 71 Comments

At many companies, the way you get a raise is to quit. As a matter of policy. I am not exaggerating.

The way it works is this: Management figures they’ll save money on salaries by leaving it up to the employees to negotiate for their own pay. So they don’t give raises until someone tries to negotiate for one. Naturally, anyone asking for a raise is viewed as having no negotiating stance unless they have a credible claim to quitting, so raises are only given as counter-offers, generally matching or slightly beating the offer the employee has elsewhere. Management figures as long as they always match what someone is offered elsewhere, the employee will always prefer to stay, because that’s easier to do, and salaries are kept to the absolute minimum they can be with no real risk.

This is completely insane.

Think about what this does to employees. The most devoted, upstanding employees are the least paid, and the most conniving, disinterested ones are paid the most. Sooner or later the lower paid employees are either going to get the feeling that all the necessary secrecy around salaries means they’re getting screwed (because, um, they are) or find out that someone got a raise by quitting, and go about doing it themselves. Maybe in the process they’ll find a job which they realize they actually like better. They have no reason to feel appreciated in their current job, after all. And there’s a good chance they won’t even listen to the counter-offer, much less take it. Soon the whole workplace is completely toxic, with everybody either underpaid, having one foot out the door, or being one of the assholes who views periodically finding a job offer you don’t plan to accept just to get a raise as normal and reasonable.

You should never make counter-offers. Ever. If an employee tells you they have a job offer, tell them that if they take it then you’ll wish them well, and stick to it. Don’t send the message to the rest of the employees that you’ll reward them for shopping themselves around, and don’t hang on to people who don’t want to work for you.

So how should you retain employees? Have clear and consistent salary guidelines, and regularly give raises to people who are outperforming their pay level. Don’t be a tight-fisted, short-sighted moron.