(This post jumps directly into the heavy math. I’m just trying to get this down. You can follow the links for background.)
In this post an interesting math problem comes up: Can you make N^2 lists of N things out of a list of N^2 object such that each pair of objects occurs together in exactly one list? That post gives a construction when N is a prime number but for cleanliness purposes you’d really like to support N being a power of 2. From an engineering standpoint it isn’t clear how useful this is - primes are much less sparse than powers of 2 are - but it’s an interesting problem in its own right and I figured out how to do it and it’s a bit cleaner from a circuit complexity standpoint so I’m going to give the construction below. How GPU friendly either construction is is an interesting question. Shuffling objects is in some sense trivial to parallelize but has a lot of issues in real architectures.
As is usually the case when you’re trying to get off prime numbers to powers of two the trick is to use Galois fields using a handy polynomial. As before the inputs are arranged into a square and will be referred to as (P, Q) with 0 <= P < N and 0 <= Q < N. Each list will be specified by (A, B) with 0 <= A < N and 0 <= B < N, with one other special list of lists with 0 <= C < N.
(Quick summary of GF(2^N) operations: + is xor. Multiplying by x is multiplying by 2 and xoring by the reduction polynomial (the integer if 2 is put in for x) if the high bit has overrun. The most natural generator to start with is x, represented as 1.)
The list for a given (A, B) contain (0, A) and (D, A + x^(B+D)) for 1 <= D < N. The case of B = 0 is special and contains (D, A) for 0 <= D < N. The C lists are of the form (C, D) for 0 <= D < N.
This construction meets the requirement that every pair of elements occurs together in exactly one list. It has the quirk that rows and columns are handled a specially. The use of these in the motivating construction is that there’s a ‘second layer’ where all the lists which contain a particular element in the original set are combined to make a final output which includes that one element N+1 times but everything else exactly once. It may be that the specialness of the rows and columns looks less weird in the mapping to the second layer. It may even be that things can be arranged so the second layer is the exact same structure as the first layer. I haven’t worked out the details.