Grokking, Deep Double Descent, Overfitting
Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a")
Tags: knowledge
Overview
Related fields
Introduction
Grokking
Double Descent
Deep double descent | OpenAI Given the increased capacity of the model, with the same data size, the model performance first increase and decrease and increase again.
- Given increased model capacity, with some middle range of labelled data, the model can “memorize and generalize” the basic pattern faster.
- This is where a fresh init model can perform better.
- Then, they move beyond the “peak” in double descent to the regime where more complex patterns are to be learned to further enhance the performance (with increased data size).
- The current LLM are more in the first regime of the double descent, that is why they need large capacity to accommodate to large data size.
- Current LLM is still in the first regime without exploring enough of the potential of the capacity. All the recent large LLM models mostly didn’t even complete one epoch of training