MTL Prev Reports


Created: 18 Nov 2022, 02:24 PM | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge, MTL


From Research Direction Analysis Report:

4.1 - Overcomplete Layer Factorisation

  • Same as Overcomplete Knowledge distillation
  • In overcomplete knowledge distillation, some of the model parameters are over-parameterised using overcomplete factorisation during training, and during inference, the factors are multiplied back to achieve a compact inference model without increasing the parameter size

4.2 - Knowledge Sharing Scheme

  • in this invention, we focus on design knowledge sharing scheme in deep multi-task learning using the overparameterised convolutional layer for better performance in training multiple tasks simultaneously
  • design shared factors and task-specific factors to achieve a more effective MTL architecture for embedded vision applications of low memory, limited computational power and reduced inference time, without increase the inference model parameter size

4.3 - Model Fine-tuning and Knowledge Distillation

  • To further enhance the model performance without increase the inference model size, we propose model fine-tuning and knowledge distillation to be used as post-processing.
  • ST model generate soft targets for MT model during training
  • MT model fine tune after overcomplete factorisation

Basically all of these 3 refer to Layer Overparameterisation in the next pdf:

Training Dynamics Optimization for Better Multitask Learning Model Generalisation

Layer Overparameterisation

  • Matrix Factorisation
    • Same as previous pdf
  • Element-wise Overparameterisation
    • Based on elementwise additions
    • This means that if a convolution layer originally has |θ| parameters, the layer overparameterised using elementwise overparameterisation has α|θ| parameters, α times the original number of parameters.
    • RESULTS:
      • No significant performance increase

Bayesian Model Averaging (BMA) and Stochastic Weight Averaging (SWA)

  • Bayesian Model Averaging (BMA)
    • Instead of betting everything on one hypothesis with a single setting of neural network parameters w, using multiple set of parameters weighted by their posterior probabilities to yield better performance.
    • But the integral has no closed form solution needs to be solved with Monte Carlo approximation
    • A method that is more compute efficient is called SWA:
  • Stochastic Weight Averaging (SWA)
    • Sample model params by training the network with SGD, then from SGD trajectories average the parameters to achieve a single model with better generalisation
    • Considered a local ensemble of NN models
    • RESULTS:
      • Models for averaging collected only after convergence, if collected before not guaranteed same loss valley and will not have better performance
      • Averaging selective models with better perf better final model perf BUT must be same initialisation
      • Averaging models collected from each iteration instead of each epoch very stable, but not necessary the best
      • Averaging from those trained from diff combi of task weightages in MT loss function better final model
      • SWA can have more potential for improvement if land in a flatter basin in loss landscape can be achieved with better model arch, wider model arch, better training dynamic optimiser (e.g. using SAM)
      • Will not deteriorate model perf
  • Sharpness-aware Minimisation (SAM)
    • Recent optimisation method to min loss value and the loss landscape flatness at the same time
    • Does so by solving minimax optimisation locally
    • Can be applied as a wrapper to some generic optim
    • Shows that it will stably increase test perf in all CV tasks and models
    • RESULTS:
      • Expt done on Sphinx using SAM finetuning, and SAM+SWA finetuning
      • 400 epochs with SAM method improve test perf
      • 400 epochs with SAM+SWA method better test results
      • SAM improves model generalisation by improving test perf in SR, night OD
      • SAM is slower to be trained, but inference time no change