MTL Prev Reports
Created: 18 Nov 2022, 02:24 PM | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a")
Tags: knowledge, MTL
From Research Direction Analysis Report:

4.1 - Overcomplete Layer Factorisation
- Same as Overcomplete Knowledge distillation
- In overcomplete knowledge distillation, some of the model parameters are over-parameterised using overcomplete factorisation during training, and during inference, the factors are multiplied back to achieve a compact inference model without increasing the parameter size
4.2 - Knowledge Sharing Scheme
- in this invention, we focus on design knowledge sharing scheme in deep multi-task learning using the overparameterised convolutional layer for better performance in training multiple tasks simultaneously
- design shared factors and task-specific factors to achieve a more effective MTL architecture for embedded vision applications of low memory, limited computational power and reduced inference time, without increase the inference model parameter size
4.3 - Model Fine-tuning and Knowledge Distillation
- To further enhance the model performance without increase the inference model size, we propose model fine-tuning and knowledge distillation to be used as post-processing.
- ST model ⇒ generate soft targets for MT model during training
- MT model fine tune after overcomplete factorisation
Basically all of these 3 refer to Layer Overparameterisation in the next pdf:
Training Dynamics Optimization for Better Multitask Learning Model Generalisation

Layer Overparameterisation
- Matrix Factorisation
- Same as previous pdf
- Element-wise Overparameterisation
- Based on elementwise additions
- This means that if a convolution layer originally has |θ| parameters, the layer overparameterised using elementwise overparameterisation has α|θ| parameters, α times the original number of parameters.
- RESULTS:
- No significant performance increase
Bayesian Model Averaging (BMA) and Stochastic Weight Averaging (SWA)
- Bayesian Model Averaging (BMA)
- Instead of betting everything on one hypothesis with a single setting of neural network parameters w, using multiple set of parameters weighted by their posterior probabilities to yield better performance.
- But the integral has no closed form solution ⇒ needs to be solved with Monte Carlo approximation
- A method that is more compute efficient is called SWA:
- Stochastic Weight Averaging (SWA)
- Sample model params by training the network with SGD, then from SGD trajectories average the parameters to achieve a single model with better generalisation
- Considered a local ensemble of NN models
- RESULTS:
- Models for averaging collected only after convergence, if collected before not guaranteed same loss valley and will not have better performance
- Averaging selective models with better perf ⇒ better final model perf BUT must be same initialisation
- Averaging models collected from each iteration instead of each epoch ⇒ very stable, but not necessary the best
- Averaging from those trained from diff combi of task weightages in MT loss function ⇒ better final model
- SWA can have more potential for improvement if land in a flatter basin in loss landscape ⇒ can be achieved with better model arch, wider model arch, better training dynamic optimiser (e.g. using SAM)
- Will not deteriorate model perf
- Sharpness-aware Minimisation (SAM)
- Recent optimisation method to min loss value and the loss landscape flatness at the same time
- Does so by solving minimax optimisation locally
- Can be applied as a wrapper to some generic optim
- Shows that it will stably increase test perf in all CV tasks and models
- RESULTS:
- Expt done on Sphinx using SAM finetuning, and SAM+SWA finetuning
- 400 epochs with SAM method → improve test perf
- 400 epochs with SAM+SWA method → better test results
- SAM improves model generalisation by improving test perf in SR, night OD
- SAM is slower to be trained, but inference time no change