I’ve been looking into the inner workings of neural networks and have some thoughts about them. First and foremost the technique of back propagation working at all is truly miraculous. This isn’t an accident of course, the functions used are painstakingly picked out so that this amazing back propagation can work. This puts a limitation on them that they have to be non-chaotic. It appears to be that non-chaotic functions as a group are something of a plateau, sort of like how linear functions are a plateau, but with a much harder to characterize set of capabilities and weaknesses. But one of them is that they’re inherently very easy to attack using white box techniques and the obvious defenses against those attacks, very much including the ones I’ve proposed before, are unlikely to work. Harumph.
To a first approximation the way to get deep neural networks to perform better is to fully embrace their non-chaotic nature. The most striking example of this is in LLMs whose big advance was to dispense with recursive state and just use attention. The problem with recursiveness isn’t that it’s less capable. It’s trivially more general so at first everyone naively assumed it was better. The problem is that recursiveness leads to exponentialness which leads to chaos and back propagation not working. This is a deep and insidious limitation, and trying to attack it head on tends to simply fail.
At this point you’re probably expecting me to give one weird trick which fixes this problem, and I will, but be forewarned that this just barely gets outside of non-chaos. It isn’t about to lead to AGI or anything.
The trick is to apply the non-chaotic function iteratively with some kind of potentially chaos-inducing modification step thrown in between. Given how often chaos happens normally this is a low bar. The functions within deep neural networks are painstakingly chosen so that their second derivative is working to keep their first derivative under control at all times. All the chaos inducing functions have to do is let their second derivative’s freak flag fly.
LLMs do this by accident because they pick a word at a time and the act of committing to a next word is inherently chaotic. But they have a limitation that their chaoticism only comes out a little bit at a time so they have to think out loud to get anywhere. LLM performance may be improved by letting it run and once in a while interjecting that now is the time to put together a summary of all the points and themes currently in play and give the points and themes it intends to use in the upcoming section before it continues. Then in the end elide the notes. In addition to letting it think out loud this also hacks around context window problems because information from earlier can get carried forward in the summaries. This is very much in the vein of standard issue LLM hackery and has a fairly high chance of working. It also may be useful writing advice to humans whose brains happen to be made out of neural networks.
Applying the same approach to image generation requires repeatedly iterating on an image to improve it with each stage. Diffusion sort of works this way, although it works off the intuition that further details are getting filled in each time. This analysis seems to indicate that the real benefit is that making a pixellated image is doing something chaotic, on the same order of crudeness as forcing the picking out of a next word from an LLM. Instead it may better to make each step work on a detailed image and apply something chaos-inducing in between. It may be that adding gaussian noise works, but as ridiculous as it sounds in principle doing color enhancement using a cubic function should work far better. I have no idea if this idea actually works. It sounds simultaneously on very sound mathematical footing and completely insane.
Annoyingly I don’t see a way of doing image classification as an iterative process with something chaos-inducing in between steps. Maybe there’s another silly trick there which would be able to make the white box attacks not work so well.
Side note: It seems like there should be a better term for a function which is ‘not non-chaotic’. They don’t have to be at all chaotic themselves, just contain the seeds of chaos. Even quadratic functions fit the bill, although cubic ones are a bit easier to throw in because they can be monotonic.
I assume you mean strictly chaotic, vs probabilistic?
Some of the new photonic processors accelerate AI inferencing by being probabilistic, but as you point out, you can't train on this class of processors. Possibly specifically because trying to use the current 'back-propagation' algorithms, you lose sufficient fine details for the network to backpropagate far enough into the network.
Being able to do something similar to 'back-propagation' but in a probabilistic way, would be huge as it would open up photonics even further into the AI space.
I also think we are not using genetic algorithms nearly enough to solve this problem space.