PulseAugur
LIVE 04:08:53
research · [2 sources] ·
0
research

Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning

A new theory, the Norm-Separation Delay Law, explains the phenomenon of grokking, where models generalize long after memorizing training data. Researchers demonstrated that grokking is a representational phase transition driven by norms, and established a mathematical relationship between the delay and factors like weight decay and learning rate. This work reframes grokking as a predictable outcome of norm separation and offers a predictive algorithm for grokking delay. AI

Summary written by None from 2 sources. How we write summaries →

IMPACT Provides a theoretical framework for understanding and predicting model generalization delays, potentially enabling more efficient training.

RANK_REASON The cluster contains two arXiv papers presenting theoretical analyses of machine learning optimization algorithms and phenomena.

Read on arXiv cs.LG →

COVERAGE [2]

  1. arXiv cs.LG TIER_1 · Truong Xuan Khanh, Truong Quynh Hoa, Luu Duc Trung, Phan Thanh Duc ·

    The Norm-Separation Delay Law of Grokking: A First-Principles Theory of Delayed Generalization

    arXiv:2603.13331v2 Announce Type: replace-cross Abstract: Grokking -- the sudden generalisation that appears long after a model has perfectly memorised its training data -- has been widely observed but lacks a quantitative theory explaining the length of the delay. We show that g…

  2. arXiv cs.LG TIER_1 · Huan Li, Yiming Dong, Zhouchen Lin ·

    Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning

    arXiv:2601.07326v2 Announce Type: replace-cross Abstract: This paper studies the AdamW-style Shampoo optimizer, an effective implementation of classical Shampoo that notably won the external tuning track of the AlgoPerf neural network training algorithm competition. Our analysis …