The recent trend in proofs of work has been to use password hashing schemes. I gave a construction for these in my last post which I’m fond of, but there’s a problem that password hashing schemes aren’t quite a perfect match for the problem. The fundamental issue is that while password hashing schemes succeed in requiring both CPU and memory to do, they unfortunately also require both CPU and memory to verify, unlike simple partial collisions of hash functions which require essentially no memory or CPU to verify. In the case of password hashing this cost of verification is considered a feature. In the case of proofs of work it most definitely is not. This problem compounds in that because there’s a limit to how long the proofs can take to verify, there’s a limit to how much memory they can require as well.
The question becomes: Is there something which takes lots of memory to do, but can be quickly verified? Here’s my crazy idea: use collisions. Instead of a single value which is hashed, the result of the work is two values which are hashed and their results are xored together, and the result should end in an appropriate number of zeros to demonstrate an appropriate amount of work. (A good argument can be made for using sum rather than xor as the combination function. I’d like to hear peoples’s thoughts on this, but for the sake of convenience in this post will assume that it’s xor.)
The fastest way to find duplicates in a list is to sort it, because there are n*n possible matches for n*log(n) time to sort. Using more CPU but less memory by having things be recalculated in multiple places takes a massive hit from the asymptotics. Unfortunately while sorting requires lots of memory, it doesn’t require unpredictable memory lookups, but I have no idea what could be done to require that. Also, a reasonable amount of fast access memory and a few clever tricks can make there only be 2 or 3 passes through main memory instead of log(n). Still, the amount of memory used is basically unavoidable, and verification is essentially instantaneous.
This approach can suffer from working too well. There’s essentially no limit to how much memory can be used, which gives an advantage to anyone with resources to set up a huge amount of storage to be sorted over. This can be corrected by putting a hard limit on the memory used based on structure. For example, a proof of work can consist instead of two completely different strings to get hashed be of a base string and two different extra strings of fixed length – 38 bits sounds reasonable – which are hashed in with the base string for the two values which are then xored together. Adding a bit to the extra strings will double the amount of memory/storage necessary to run the operation. Limiting the bits of difference will of course also reduce the sizes of the proofs of work.
In our case with 38 variable bits, the number of possible possible combinations of two is 2^38 * 2^38, or 2^76, so a typical run through would find a combination which ends in 76 zeros, and each additional zero beyond that results in on average twice as many run throughs having to be done.
To do this, every possible input to the hash function has to be tried separately, which can use a fair amount of CPU. One way to reduce CPU load is to switch from sum of two to sum of four. In our previous example that means instead of 38 bits for each of two values it would be 19 bits for each of four values, and all four of those are used to hash and then the hashes are xored together. The fastest way to solve this one is to calculate the hashes of all 2^19 possible values, then make a list of all xors of any two of them and sort looking for duplicates. You might wonder why not go for xoring even more together. That’s because other algorithms start applying once the number of things gets large, and the CPU load has been reduced so much going to four already that it doesn’t matter. You might also wonder why not parsimoniously xor three values instead of four. That’s because the savings in terms of size of proofs of work would be extremely small, and the 3SUM problem has interesting properties which make me not trust it.
How well this whole system works is of course an empirical question, and one I’m not sure of. It may be that it can be effectively accelerated in hardware, for example by using a hardware shell sort which correctly sorts most but not all of the time. That would only result in speeding up the last bunch of sorting though, and it may be that a regular CPU can do that so fast that it will be mostly waiting for memory (or disk!) anyway. It’s certainly the case that any hardware or implementation speedup of this would be much more interesting than the sorts of acceleration done for simple partial collisions of hashes, and it’s also possible that the acceleration would come from high throughput memory or disk, which is at least a useful thing to subsidize production of. It’s even possible that this approach works well enough that it’s the holy grail of proofs of work, which are downright green because the CPU is mostly sitting around bored waiting for something else to finish fetching data from systems which are already well optimized to do that, and custom hardware can’t improve performance in any meaningful way. People’s thoughts on this, and even more importantly test results, would be greatly appreciated.