MTL Prev Reports

Created: 18 Nov 2022, 02:24 PM | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge, MTL

From Research Direction Analysis Report:

4.1 - Overcomplete Layer Factorisation

Same as Overcomplete Knowledge distillation
In overcomplete knowledge distillation, some of the model parameters are over-parameterised using overcomplete factorisation during training, and during inference, the factors are multiplied back to achieve a compact inference model without increasing the parameter size

4.2 - Knowledge Sharing Scheme

in this invention, we focus on design knowledge sharing scheme in deep multi-task learning using the overparameterised convolutional layer for better performance in training multiple tasks simultaneously
design shared factors and task-specific factors to achieve a more effective MTL architecture for embedded vision applications of low memory, limited computational power and reduced inference time, without increase the inference model parameter size

4.3 - Model Fine-tuning and Knowledge Distillation

To further enhance the model performance without increase the inference model size, we propose model fine-tuning and knowledge distillation to be used as post-processing.
ST model ⇒ generate soft targets for MT model during training
MT model fine tune after overcomplete factorisation

Basically all of these 3 refer to Layer Overparameterisation in the next pdf:

Training Dynamics Optimization for Better Multitask Learning Model Generalisation

Layer Overparameterisation

Matrix Factorisation
- Same as previous pdf
Element-wise Overparameterisation
- Based on elementwise additions
- This means that if a convolution layer originally has |θ| parameters, the layer overparameterised using elementwise overparameterisation has α|θ| parameters, α times the original number of parameters.
- RESULTS:
  - No significant performance increase

Bayesian Model Averaging (BMA) and Stochastic Weight Averaging (SWA)

Bayesian Model Averaging (BMA)
- Instead of betting everything on one hypothesis with a single setting of neural network parameters w, using multiple set of parameters weighted by their posterior probabilities to yield better performance.
- But the integral has no closed form solution ⇒ needs to be solved with Monte Carlo approximation
- A method that is more compute efficient is called SWA:
Stochastic Weight Averaging (SWA)
- Sample model params by training the network with SGD, then from SGD trajectories average the parameters to achieve a single model with better generalisation
- Considered a local ensemble of NN models
- RESULTS:
  - Models for averaging collected only after convergence, if collected before not guaranteed same loss valley and will not have better performance
  - Averaging selective models with better perf ⇒ better final model perf BUT must be same initialisation
  - Averaging models collected from each iteration instead of each epoch ⇒ very stable, but not necessary the best
  - Averaging from those trained from diff combi of task weightages in MT loss function ⇒ better final model
  - SWA can have more potential for improvement if land in a flatter basin in loss landscape ⇒ can be achieved with better model arch, wider model arch, better training dynamic optimiser (e.g. using SAM)
  - Will not deteriorate model perf
Sharpness-aware Minimisation (SAM)
- Recent optimisation method to min loss value and the loss landscape flatness at the same time
- Does so by solving minimax optimisation locally
- Can be applied as a wrapper to some generic optim
- Shows that it will stably increase test perf in all CV tasks and models
- RESULTS:
  - Expt done on Sphinx using SAM finetuning, and SAM+SWA finetuning
  - 400 epochs with SAM method → improve test perf
  - 400 epochs with SAM+SWA method → better test results
  - SAM improves model generalisation by improving test perf in SR, night OD
  - SAM is slower to be trained, but inference time no change

Darius Knowledge Hub

Explorer

MTL Prev Reports

MTL Prev Reports

Graph View