This paper investigates how preconditioned gradient descent (PGD) methods, like Gauss-Newton, influence spectral bias and the phenomenon of grokking in neural networks. Researchers propose that PGD can mitigate spectral bias, which typically causes networks to learn low frequencies first, potentially hindering the capture of fine-scale structures. The study suggests that PGD can also reduce delays associated with grokking, a delayed generalization effect hypothesized to occur during the transition from the Neural Tangent Kernel (NTK) to a feature-rich learning regime. Experimental results support the idea that grokking represents this transitional behavior, with PGD enabling more uniform exploration of the parameter space. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Deepens understanding of neural network training dynamics, potentially leading to more efficient learning algorithms for complex tasks.
RANK_REASON Academic paper on theoretical and empirical results of preconditioned gradient descent on neural network convergence behavior. [lever_c_demoted from research: ic=1 ai=1.0]