DiSparse - Disentangled Sparsification for Multitask Model Compression
Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a")
Tags:
Annotations
neural network compression techniques canbe categorized [9] into pruning [20, 30, 32], quantization [4,37, 56], low-rank factorization [10, 33, 58], and knowledgedistillation [25, 26, 36] * show annotation
A few MTL works have explored the problem ofentangled features and showed disentangling representationinto shared and task-private spaces will improve the modelperformance [34, 55] * show annotation
key to properly compressing a multitask model iscorrectly identifying saliency scores for each task in theshared space, therefore Sparsifying in a Disentangled man-ner (DiSparse) * show annotation
unanimous selection decisionsamong all tasks, which means that a parameter is removedonly if it’s shown to be not critical for any task * show annotation
observed strik-ingly similar sparse network architecture identified by eachtask even before the training starts. This offers a glimpseof the transferable subnetwork architecture across domains. * show annotation
show that DiSparsedoes not only provide the compression community with thefirst-of-its-kind multitask sparsification scheme but also apowerful tool to the multitask learning community * show annotation
pruning and sparse training scheme for multitask net-work by disentangling the importance measurementsamong tasks * show annotation
task relatedness and multitaskmodel architecture design with DiSparse * show annotation
Unstructured pruning methods [20, 30] drop lesssignificant weights, regardless of where they occur * show annotation
structured pruning methods [32, 36], operateunder structural constraints, for example removing convo-lutional filters or attention heads [40], thus enjoy imme-diate performance improvement without specialized hard-ware or library support. * show annotation
score criterion include: 1.Magnitude-based [20, 32], 2. Gradient-based [43, 44], 3.Hessian-based [21, 30], 4. Learning-based [11, 36]. * show annotation
Sparse training techniques * show annotation
Static Sparse Training * show annotation
Dynamic SparseTraining * show annotation
solving for Bc, the binary maskfor the large number of commonly shared parameters, wecan’t simply apply typical methods which directly utilizeL(Θ,B; D) as guidance, because the shared parameters areentangled with multiple tasks. * show annotation
Figure 2. An overview of DiSparse . For task Tk, we feed weights Θkc and their gradients w.r.t the loss Lk(.) into a saliency scoringfunction to get their importance scores. Later we generate an optimal binary mask Bkc for the model assuming that we’re only training thenetwork independently for task Tk. We directly assign P(Bkc), the task-private part, to Bk used as the pruning or growing mask for thetask-private parameters. For C(Bkc), the shared part, we feed all of {C(Bkc), ∀k ∈ {1, . . . , T }} to an element-wise arbiter function Aand take its output as Bc, the pruning or growing mask for the shared parameters. * show annotation
We performed sparsification on the widely-usedDeepLab-ResNet [3] with atrous convolutions as thebackbone and the ASPP [3] architecture for task-specificheads. * show annotation
Surpris-ingly, we observed strikingly high IoU among tasks, evenin the 5-task Tiny-Taskonomy dataset. * show annotation
implies thateven before training starts, different tasks tend to select thesame architecture in the shared parameter space to facili-tate training, suggesting potential for domain-independentsparse architecture exploration * show annotation
Several methods [5, 23, 24] start withsingle-task networks and gradually merge them into a uni-fied one, using feature sharing and similarity maximization.However, these schemes are inapplicable to pre-designedmultitask models. * show annotation