4 Comments
User's avatar
Christopher's avatar

One technical correction: the Attention paper didn't make everything sublinear; it just made it much more parralelizable and practical.

Bram Cohen's avatar

It got rid of the recursion which was the last thing which could cause gradient explosion but fair point, it was sort of groped to over time rather than happening in one fell swoop, but that’s the thing which finished the job.

George Coss's avatar

Also training humans is slower, assuming you have the training data

California Pirate Party's avatar

After twenty years I can finally follow what you say without looking things up.