DiSparse - Disentangled Sparsification for Multitask Model Compression


Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags:

Annotations


  • neural network compression techniques canbe categorized [9] into pruning [20, 30, 32], quantization [4,37, 56], low-rank factorization [10, 33, 58], and knowledgedistillation [25, 26, 36] * show annotation

  • A few MTL works have explored the problem ofentangled features and showed disentangling representationinto shared and task-private spaces will improve the modelperformance [34, 55] * show annotation

  • key to properly compressing a multitask model iscorrectly identifying saliency scores for each task in theshared space, therefore Sparsifying in a Disentangled man-ner (DiSparse) * show annotation

  • unanimous selection decisionsamong all tasks, which means that a parameter is removedonly if it’s shown to be not critical for any task * show annotation

  • observed strik-ingly similar sparse network architecture identified by eachtask even before the training starts. This offers a glimpseof the transferable subnetwork architecture across domains. * show annotation

  • show that DiSparsedoes not only provide the compression community with thefirst-of-its-kind multitask sparsification scheme but also apowerful tool to the multitask learning community * show annotation

  • pruning and sparse training scheme for multitask net-work by disentangling the importance measurementsamong tasks * show annotation

  • task relatedness and multitaskmodel architecture design with DiSparse * show annotation

  • Unstructured pruning methods [20, 30] drop lesssignificant weights, regardless of where they occur * show annotation

  • structured pruning methods [32, 36], operateunder structural constraints, for example removing convo-lutional filters or attention heads [40], thus enjoy imme-diate performance improvement without specialized hard-ware or library support. * show annotation

  • score criterion include: 1.Magnitude-based [20, 32], 2. Gradient-based [43, 44], 3.Hessian-based [21, 30], 4. Learning-based [11, 36]. * show annotation

  • solving for Bc, the binary maskfor the large number of commonly shared parameters, wecan’t simply apply typical methods which directly utilizeL(Θ,B; D) as guidance, because the shared parameters areentangled with multiple tasks. * show annotation

  • Figure 2. An overview of DiSparse . For task Tk, we feed weights Θkc and their gradients w.r.t the loss Lk(.) into a saliency scoringfunction to get their importance scores. Later we generate an optimal binary mask Bkc for the model assuming that we’re only training thenetwork independently for task Tk. We directly assign P(Bkc), the task-private part, to Bk used as the pruning or growing mask for thetask-private parameters. For C(Bkc), the shared part, we feed all of {C(Bkc), ∀k ∈ {1, . . . , T }} to an element-wise arbiter function Aand take its output as Bc, the pruning or growing mask for the shared parameters. * show annotation

  • We performed sparsification on the widely-usedDeepLab-ResNet [3] with atrous convolutions as thebackbone and the ASPP [3] architecture for task-specificheads. * show annotation

  • Surpris-ingly, we observed strikingly high IoU among tasks, evenin the 5-task Tiny-Taskonomy dataset. * show annotation

  • implies thateven before training starts, different tasks tend to select thesame architecture in the shared parameter space to facili-tate training, suggesting potential for domain-independentsparse architecture exploration * show annotation

  • Several methods [5, 23, 24] start withsingle-task networks and gradually merge them into a uni-fied one, using feature sharing and similarity maximization.However, these schemes are inapplicable to pre-designedmultitask models. * show annotation