Grokking, Deep Double Descent, Overfitting


Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge


Overview

Introduction

Grokking

Double Descent

Deep double descent | OpenAI Given the increased capacity of the model, with the same data size, the model performance first increase and decrease and increase again.

  • Given increased model capacity, with some middle range of labelled data, the model can “memorize and generalize” the basic pattern faster.
  • This is where a fresh init model can perform better.
  • Then, they move beyond the “peak” in double descent to the regime where more complex patterns are to be learned to further enhance the performance (with increased data size).
  • The current LLM are more in the first regime of the double descent, that is why they need large capacity to accommodate to large data size.
    • Current LLM is still in the first regime without exploring enough of the potential of the capacity. All the recent large LLM models mostly didn’t even complete one epoch of training

Theoretical References

Papers

Articles

Courses


Code References

Methods

Tools, Frameworks