Supermasks in Superposition is a project by Mitchell Wortsman$^*$, Vivek Ramanujan$^*$, Rosanne Liu, Ani Kembhavi, Mohammad Rastegari, Jason Yosinski, and Ali Farhadi.

Learning many different tasks sequentially remains a challenge for Artificial Neural Networks (ANNs). When learning a new task, networks tend to catastrophically forget information from previous tasks. In this post we discuss Supermasks in Superposition (SupSup), a flexible model capable of learning thousands of tasks without forgetting (and without access to task identity info).

SupSup leverages the expressive power of neural network connectivity. Zhou et al. demonstrates that even when the weights of an ANN are random and fixed, learning can still occur by modifying the network connectivity. Instead of learning the weights, or jointly learning the weights and connectivity, good performance can be achieved by choosing whether each edge in the the network is on or off. In our previous work (video) we provide a fast algorithm for finding these so-called supermasks (binary masks which specify if each edge is on or off, exposing a subnetwork with good performance).

SupSup uses a neural network with weights that remain fixed and random. For each new task $i$ SupSup finds a supermask (subnetwork) which achieves good performance on task $i$. The underlying weights are not modified and so forgetting does not occur. Moreover, the underlying weights have minimal storage cost as you only need to save the random seed.

When the task identity of data is given during inference, the corresponding supermask can be used (similar to Mallya et al. but without the need for a pretrained backbone).

SupSup may also be used even when task identity is unknown during inference (i.e. the model does not have access to which task it is currently evaluating on). SupSup does so by first inferring task identity and then using the corresponding supermask (subnetwork).

We consider classification tasks, and so the network outputs form a probability distribution over the possible classes. The work of Hendrycks & Gimpel suggests that the supermask trained on task $i$ is more likely to be confident when given data from task $i$. Therefore, SupSup infers task identity by finding the supermask which produces the lowest entropy (highest confidence) output.

As the number of tasks grows large (> 1000), it becomes impractical to iterate through each supermask and check the entropy of the output distribution. Therefore, we consider a network with all supermasks in weighted superposition—mask $i$ has coefficient $\alpha_i$ (where each $\alpha_i > 0$ and $\sum_i \alpha_i = 1$). We can then infer task identity using gradient based optimization to minimize the entropy of the ouputs by modifying the coefficients $\alpha_i$. In practice we find that a single gradient computation can be sufficient, even for 2500 tasks. Moreover, the speed of the forward and backward pass (i.e. computing outputs and the entropy gradient) with the superimposed model is much faster than trying each supermask learned so far.

SupSup performs well in a variety of different settings for continual learning, where different tasks are learned sequentially. For clarity we characterize the continual learning scenarios with a useful taxonomy and mnemonic.

Continual Learning Scenarios

In our paper we provide a more complete characterization and justification for different continual learning scenarios. However, the key differences between scenarios are whether 1) the model has access to which task it is performing inference or training on and 2) whether tasks reuse labels. Consider, for example, the task of learning 50 permuations of MNIST. Is the model evaluated on correctly identifying the digit or is the model required to identify the digit and permutation? The table below provides our taxonomy, which builds upon van de Ven & Tolias and Zeno et al..

Scenario Description
GG Task Given during train and Given during inference
GNs Task Given during train, Not inference; shared labels
GNu Task Given during train, Not inference; unshared labels
NNs Task Not given during train Nor inference; shared labels

GG: Task Given during train, Given during inference

We consider fixed, random weights $W$ and for each task $i$ we learn a binary mask $M^i$ such that the network with outputs $\mathbf{p} = f(\mathbf{x}, M^i \odot W)$ achieves good performance. Note that $\mathbf{x}$ denotes the input data and $\mathbf{p}$ denotes the output class probabilities. We can control the storage cost of SupSup by changing the sparsity of the supermasks. Below we provide performance on the challenging task of SplitImageNet, where ImageNet is split into 100 different 10-way classification problems learned sequentially. For Upper Bound, a new set of weights is trained for each task and bytes is storage cost.

Algorithm Avg Top 1 Accuracy (SplitImageNet) Bytes
Upper Bound 92.55 10222.81M
SupSup (less sparse) 89.58 195.18M
SupSup (more sparse) 86.37 65.50M
BatchE 81.50 124.99M

GN: Task Given during train, Not inference

As mentioned above, SupSup can be used to infer the task identity of given data. We consider a network with all supermasks in weighted superposition—mask $i$ has coefficient $\alpha_i$ (initially each $\alpha_i = 1/k$ where $k$ is the number of tasks learned so far). Outputs with the superimposed model are then given by

where $\odot$ denotes an elementwise product. SupSup infers task identity by minimizing entropy $\mathcal{H}$ with gradient based optimization, as the correct suprmask should produce a low entropy output. In this blog post we only provide results for the most inexpensive setting, where $\textbf{x}$ is a single data point and task identity is inferred with a single gradient computation via

as entropy is decreasing maximally in this coordinate.

We show results below for LeNet-300-100 (left) and a fully connected network with two hidden layers of size 1024 (right) trained sequentially on 250 permutations of the MNIST pixels (1000 training batches per task). In this scenario SupSup outperforms methods such as BatchE and PSP which operate in scenario GG—they have access to the task during inference. Results are provided for both entropy $\mathcal{H}$ and an alternative metric $\mathcal{G}$ described in the paper.

NN: Task Not given during train Nor inference

SupSup can also be extended to the scenario where task information is not available, even during training—the model does not know when tasks switch. We consider PermutedMNIST, where each task is a different permutation of the MNIST pixels. We train on each task for 1000 batches (the model does not have access to this iteration number). Every 100 batches the model must choose to allocate a new mask or pick an existing mask. SupSup allocates a new mask if it is uncertain when inferring task identity. If SupSup is uncertain about task identity ($\nabla_\alpha \mathcal{H}$ is approximately uniform) then the data likely doesn’t belong to a task learned so far.

We illustrate below that SupSup is able to sequentially learn 2500 permutations of MNIST with minimal forgetting.

And more…

See the paper for more thorough experiments and methods, which include storing the entire, growing set of supermasks in a constant-sized reservoir by implicitly encoding them as attractors in a fixed-sized Hopfield network.