Using Toroidal Space In Neural Networks
The higher dimensional geometry in this one isn't too bad
Let’s say that you’re making a deep neural network and want to use toroidal space. For those that don’t know, toroidal space for a given number of dimensions has one value in each dimension between zero and one which ‘wraps around’ so when a value goes above one you subtract one from it and when it goes below zero you add one to it. The distance formula formula in toroidal space is similar to what it is in open-ended space, but instead of the distance in each dimension being a-b it’s that value wrapped around to a value between -1/2 and 1/2, so for example 0.25 stays where it is but 0.75 changes to -0.25 and -0.7 changes to 0.3.
Why would you want to do this? Well, it’s because a variant on toroidal space is probably much better at fitting data than conventional space is for the same number of dimensions. I’ll explain the details of that in a later post1 but it’s similar enough that the techniques for using it an neural network are the same. So I'm going explain in this post how to use toroidal space, even though it’s probably comparable or only slightly better than the conventional approach.
To move from conventional space to an esoteric one you need to define how positions in that space are represented and make analogues of the common operations. Specifically, you need to find analogues for dot product and matrix multiplication and define how back propagation is done across those.
Before we go there it’s necessary to get an intuitive notion of what a vector is and what dot product and matrix multiplication are doing. A vector consists of two things: a distance and a magnitude. A dot product finds the angle between two vectors times their magnitudes. Angle in this case is a type of distance. You might wonder what the intuitive explanation of including the magnitudes is. There isn’t any, you’re better off normalizing for them, known in AI as ‘cosine space’. I’ll just pretend that that’s how it’s always done.
When a vector is multiplied by a matrix, that vector isn’t being treated as a position in space, it’s a list of scalars. Those scalars are each assigned a direction and magnitude of a vector in the matrix. That direction is assigned a weight of the value of the scalar times the magnitude. A weighted average of all the directions is then taken.
The analogue of (normalized) dot product in toroidal space is simply distance. Back propagating over it works how you would expect. There’s a bit of funny business with the possibility of the back propagation causing the values to snap over the 1/2 threshold but the amount of movement is small enough that that’s unusual and AI is so fundamentally handwavy that ignoring things like that doesn’t change the theory much.
The analogue of a matrix in toroidal space is a list of positions and weights. (Unlike in conventional space in toroidal space there’s a type distinction between ‘position’ and ‘position plus weight’ where in conventional space it’s always ‘direction and magnitude’.) To ‘multiply’ a vector by this ‘matrix’ you do a weighted average of all the positions with weights corresponding to the scalar times the given weight. At least, that’s what you would like to do. The problem is that due to the wrap-around nature of space it isn’t clear which image of each position should be used.
To get an intuition for what to do about the multiple images problem, let’s consider the case of only two points. For this case we can find the shortest path between them and simply declare that the weighted average will be along that line segment. If some of the dimensions are close to the 1/2 flip over then either one will at least do something for the other dimensions and there isn’t much signal for that dimension anyway so somewhat noisily using one or the other is fine.
This approach can be generalized to larger numbers of points as follows: First, pick an arbitrary point in space. We’ll think of this as a rough approximation of the eventual solution. Since it’s literally a random point it’s a bad approximation but we’re going to improve that. What we do is find the closest image of each of the points to find a weighted average of to the current approximation and use those positions as the ones when finding the weighted average. That yields a new approximate answer. We then repeat. Most likely in practical circumstances this settles down after only a handful of iterations and if it doesn’t there probably isn’t that much improvement happening with each iteration. There’s an interesting mathematical question as to whether this process must always hit a unique fixed point. I honestly don’t know the answer to that question. If you know the answer please let me know.
The way to back propagate over this operation is to assume that the answer you settled on via the successive approximation process is the ‘right’ one and look at how that one marginally moves with changing the coefficients. As with calculating simple distance the snap-over effects rarely are hit with the small changes involved in individual back propagation adjustments and the propagation doesn’t have to be perfect, it just has to on average produce improvement.
It involves adding ‘ghost images’ to each point which aren’t just at the wraparound values but also correspond to other positions in a Barns-Wall lattice, which is a way of packing spheres densely. Usually ‘the Barns-Wall lattice’ corresponds specifically to 16 dimensions but the construction generalizes straightforwardly to any power of 2.
Bram you lost me about halfway and I took linear algebra in college. Who is your audience here?