The lottery ticket hypothesis - Finding sparse, trainable neural networks


Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags:

Annotations


  • Han et al. (2017) and Jin et al. (2016) restorepruned connections to increase network capacity after small weights have been pruned and survivingweights fine-tuned. * show annotation

https://lilianweng.github.io/posts/2019-03-14-overfit/#the-lottery-ticket-hypothesis

The lottery ticket hypothesis (Frankle & Carbin, 2019) is another intriguing and inspiring discovery, supporting that only a subset of network parameters have impact on the model performance and thus the network is not overfitted. The lottery ticket hypothesis states that a randomly initialized, dense, feed-forward network contains a pool of subnetworks and among them only a subset are “winning tickets” which can achieve the optimal performance when trained in isolation.

The idea is motivated by network pruning techniques — removing unnecessary weights (i.e. tiny weights that are almost negligible) without harming the model performance. Although the final network size can be reduced dramatically, it is hard to train such a pruned network architecture successfully from scratch. It feels like in order to successfully train a neural network, we need a large number of parameters, but we don’t need that many parameters to keep the accuracy high once the model is trained. Why is that?

The lottery ticket hypothesis did the following experiments:

Randomly initialize a dense feed-forward network with initialization values ; Train the network for multiple iterations to achieve a good performance with parameter config ; Run pruning on and creating a mask . The “winning ticket” initialization config is . Only training the small “winning ticket” subset of parameters with the initial values as found in step 1, the model is able to achieve the same level of accuracy as in step 2. It turns out a large parameter space is not needed in the final solution representation, but needed for training as it provides a big pool of initialization configs of many much smaller subnetworks.

The lottery ticket hypothesis opens a new perspective about interpreting and dissecting deep neural network results. Many interesting following-up works are on the way.

  • dense, randomly-initialized, feed-forwardnetworks contain subnetworks (winning tickets) that—when trained in isolation—reach test accuracy comparable to the original network in a similar number ofiterations * show annotation