Well, I’m not one to MoNE1 even when a research paper is an incremental nudge in the direction of improvement. Mixture of Nested Experts is conceptually quite interesting. By using modified expert choice routing, groups of neurons (experts) can choose what tokens will be involved in computation in both self-attention and in the feedforward layers. In practice, performance should be roughly equivalent to an equivalently sized network without any version of sparsity. This paper builds upon a body of research from Google; namely MOE with Expert Choice Routing2, Mixture-of-Depths3, and MatFormer4. We will see that combining these ideas leads to significantly better performance than either MatFormer or Mixture-of-Depths.
When I say neurons in this case I mean the fractions of the embedding dimension with which we project our hidden states. This means for a given model we can activate a given fraction of parameters in each layer. From a resource perspective this means expert groups can select how many tokens will be involved in computation and in aggregate this should result in less compute based on how many tokens are allocated to a given fraction of parameters. I need to note that this is not fully dynamic in that groups of parameters are set before training begins in a way similar to MatFormer4
Methodology
The models evaluated in this paper are limited to vision related modalities of images and video. This probably was done to stay consistent with MatFormer and to deal with additional complexity when dealing with causal inference (as was dealt with in Mixture-of-Experts).
In terms of the structure the MoNE model the authors only make modification to the self-attention blocks preserving the overall structure of standard ViT. This is nice since it does make it easier to compare with the baseline ViT.
In the figure above taken from the MoNE paper, we can see a fractional components that make up a part of MatFormer (a). Extending this idea by adding conditional computation via routing based on Expert Choice, tokens are now chosen by groups of experts via a Expert Preferred Routing. This is essentially the same as Expert Choice except now the TopK token selection is performed for each expert as opposed to evenly distributing tokens across all experts. I provide examples of Expert Choice routing and make the previous modification described above to realize this Expert Preferred Routing.
In terms of the expert capacity we see something that diverges from previous methodologies. In the case of MoNE1 the authors use sequential least squares programming optimization (SLSQP). This is fortunately readily available in SciPy and can be implemented in Python quite easily. SLSQP is an iterative method for solving constrained nonlinear optimization problems and is the default when using SciPy for constrained optimization problems.
The main idea in this paper is apply routing to fractions of weights (or equivalently groups of neurons). The routing method as described earlier is Expert Preferred Routing. In an effort to experiment with Mixture-of-Depths3 (MoD) on my own I wrote up code for both the routing and use in the model with optional MOE functionality. The routing in the case of MoD is Expert Choice, and with some simple modifications we can realize Expert Preferred Routing.
If we assume we have 4 experts we arrive at an optimal capacity distribution:
Expert
Capacity
C₁
24.43%
C₂
17.64%
C₃
20.79%
C₄
37.15%
given the following parameters:
E=4 ec=0.5 δ=2 β=10
Now that we can calculate the optimal capacity for each expert, we can implement the routing mechanism. The authors provide the following figure for how the routing mechanism works.
My implementation of the Preferred Routing mechanism is basically a modification of the Expert Choice routing mechanism. The only real difference is that we now have a TopK token selection for each expert as opposed to evenly distributing tokens across all experts. In order to avoid creating multiple dispatch masks for each expert we can set a maximum capacity. The original Expert Choice implementation was borrowed from the Flaxformer library5.
Now that we have the routing mechanism we can implement the MoNE model. Unlike typical transformer blocks which do not deal with dynamic routing of nested experts, we have to make some special considerations about how we project our inputs in the attention and feedforward layers. We can use the dispatch mask creatively to break apart our input and aggregate it ensuring that tokens are in the proper order prior to attention and feedforward computation. The authors provide the following figure for the overall structure of the model.
We can abstract the projection of nested experts by creating a custom linear layer that works with dispatched tokens. This allows us to re-use the same class multiple times within an attention block. This also keeps the code relatively modular and preserves the same structure of a standard transformer block. We do have to keep in mind however that the tensor passed between functions is changing shape in order to implement the nesting of experts. The custom linear layer is implemented as follows:
You can view my full implementation here to find the code for the attention and feedforward layers. For the sake of brevity I will include the attention block below. As noted earlier this code resembles the standard attention block in a transformer. In this code we have integrated a router which chooses which experts are involved in computation. The attention and feedforward layers use linear layers to project a fraction of the hidden dimension of selected tokens to the model’s embedding dimension. We have to be careful of the shape of intermediate tensors which may have different dimensions for each expert. In the case of residual connections we pad them back to the embedding dimension and sum them appropriately as discussed in the paper. All tokens are combined using a weighted sum with the combine array.