In my last post (which this post is a superior rehashing of after thinking things over more) I talked about ‘chaos’ which seemed to leave some people confused as to what that meant. Despite being a buzzword which is thrown around in pop science a lot chaos is a real mathematical term with a very pedestrian definition, which is sensitive dependence on initial conditions. It’s a simultaneously banal and profound observation to point out that neural networks as we know them today are critically dependent on not having sensitive dependence on initial conditions in order for back propagation to work properly.
It makes sense to refer to these as ‘sublinear’ functions, a subset of all nonlinear functions. It feels like the details of how sublinear functions are trained don’t really matter all that much. More data, training, and bigger models will get you better results but still suffer from some inherent limitations. To get out of their known weaknesses you have to somehow include superlinear functions, and apply a number of them stacked deep to get the potential for chaotic behavior. LLMs happen to need to throw in a superlinear function because picking out a word among possibilities is inherently superlinear. To maximize an LLMs performance (or at least its superlinearity) you should make it output a buffer of as many relevant words as possible in between the question and where it gives an answer, to give it a chance to ‘think out loud’. Instead of asking it to simply give an answer, ask it to give several different answers, then make give arguments for and against each of those, then give rebuttals to those arguments, then write several new answers taking all of that into account, repeat the exercise of arguments for and against with rebuttals, and finally pick out which if its answers is best. This is very much in line with the already known practical ways of getting better answers out of LLMs and likely to work well. It also seems like a very human process which raises the question of whether the human brain also consists of a lot of sublinear regions with superlinear controllers. We have no idea.
What got me digging into the workings of LLMs was that I got wind that they use dot products in a place and wondered whether the spatial clustering I’ve been working on could be applied. It turns out it can’t, because it requires gradient descent, and gradient descent on top of being expensive is also extremely chaotic. But there is a very simple thing which is sublinear which can be tried: Apply RELU/GRELU to the key and query vectors (or maybe just one of them, a few experiments can be done) before taking their dot product. You might call this the ‘pay attention to the man behind the curtain’ heuristic, because it works with the intuition that there can be reasons why you should pay special attention to something but not many reasons why you shouldn’t.
For image generation the main thing you need is some kind of superlinear function applied before iterations of using a neural network to make the image better. With RGB values expressed as being between 0 and 1 it appears that the best function is to square everything. The reasoning here is that you want the second derivative to be as much as possible everywhere and evenly spread out while keeping the function monotonic and within the defined range. The math on that yields two quadratics, x^2 and its cousin -x^2+2x. There are a few reasons why this logical conclusion sounds insane. First there are two functions for no apparent reason. Maybe it makes sense to alternate between them? Less silly is that it’s a weird bit of magic pixie dust, but then adding random noise is also magic pixie dust but seems completely legit. It also does something cognitively significant, but then it’s common for humans to make a faint version of an image and draw over it and this seems very reminiscent of that. The point of making the image faint is to be information losing, and simply multiplying values isn’t information losing within the class of sublinear functions while square is because if you do it enough times everything gets rounded down to zero.
Frustratingly image classification isn’t iterated and so doesn’t have an obvious place to throw in superlinear functions. Maybe it could be based off having a witness to a particular classification and have that be iteratively improved. Intuitively a witness traces over the part of the image which justifies the classification, sort of like circling the picture of Waldo. But image classification doesn’t use witnesses and it isn’t obvious how to make them do that so a new idea is needed.