New research explores advanced optimization for machine learning
ByPulseAugur Editorial·
Summary by gemini-2.5-flash-lite
from 17 sources
Several recent research papers explore advanced optimization techniques for machine learning. One paper introduces a derivative-free consensus-based method for nonconvex bi-level optimization, demonstrating convergence guarantees for its mean-field and finite-particle approximations. Another study presents Curvature-Tuned Accelerated Gradient Descent (CT-AGD), which reduces training epochs by an average of 33% for deep learning tasks by capturing local curvature. Additionally, research investigates stochastic approximation algorithms under heavy-tailed noise, analyzing concentration bounds and the impact of noise on error tails. Other papers delve into stochastic gradient variational inference, global convergence of stochastic conic particle gradient descent, and the suboptimality of momentum SGD in nonstationary environments.
AI
In this paper, we study a consensus-based optimization method for nonconvex bi-level optimization, where the objective is to minimize an upper-level function over the set of global minimizers of a lower-level problem. The proposed approach is derivative-free, and constructs its c…
In this paper, we study a consensus-based optimization method for nonconvex bi-level optimization, where the objective is to minimize an upper-level function over the set of global minimizers of a lower-level problem. The proposed approach is derivative-free, and constructs its c…
In this paper, we present CT-AGD (Curvature-Tuned Accelerated Gradient Descent), an optimization method for non-convex optimization problems in deep learning training tasks. CT-AGD is a general boosting procedure that accelerates first-order methods by explicitly capturing the lo…
arXiv:2502.06178v5 Announce Type: replace-cross Abstract: Bayesian optimization is highly effective for optimizing expensive-to-evaluate black-box functions, but it faces significant computational challenges due to the cubic per-iteration cost of Gaussian processes, which results…
arXiv stat.ML
TIER_1Italiano(IT)·Fares El Khoury, Houssam Zenati, Nathan Kallus, Michael Arbel, Aur\'elien Bibaut·
arXiv:2605.21341v1 Announce Type: new Abstract: Functional bilevel methods estimate a lower-level function and plug it into a hypergradient, but this plug-in gradient can retain first-order bias when the lower-level problem is learned nonparametrically. To remove this bias, we de…
arXiv stat.ML
TIER_1·Shubhada Agrawal, Siva Theja Maguluri, Martin Zubeldia·
arXiv:2605.20999v1 Announce Type: cross Abstract: We establish maximal concentration bounds for the iterates generated by stochastic approximation algorithms with general step sizes, where the noise has a finite-state Markovian component plus a Martingale-difference component. Wh…
We establish maximal concentration bounds for the iterates generated by stochastic approximation algorithms with general step sizes, where the noise has a finite-state Markovian component plus a Martingale-difference component. When the Martingale-difference noise is bounded, we …
arXiv:2605.19784v1 Announce Type: cross Abstract: We investigate the global optimization of the objective function arising in continuous sparse regression, specifically the Beurling LASSO (BLASSO), over the space of measures. While Conic Particle Gradient Descent (CPGD) methods a…
arXiv stat.ML
TIER_1·Sharan Sahu, Cameron J. Hogan, Martin T. Wells·
arXiv:2601.12238v4 Announce Type: replace Abstract: In this paper, we provide a comprehensive theoretical analysis of Stochastic Gradient Descent (SGD) and its momentum variants (Polyak Heavy-Ball and Nesterov) for tracking time-varying optima under strong convexity and smoothnes…
arXiv stat.ML
TIER_1·Kyurae Kim, Qiang Fu, Yi-An Ma, Jacob R. Gardner, Trevor Campbell·
arXiv:2602.18718v2 Announce Type: replace Abstract: For approximating a target distribution given only its unnormalized log-density, stochastic gradient-based variational inference (VI) algorithms are a popular approach. For example, Wasserstein VI (WVI) and black-box VI (BBVI) p…
We investigate the global optimization of the objective function arising in continuous sparse regression, specifically the Beurling LASSO (BLASSO), over the space of measures. While Conic Particle Gradient Descent (CPGD) methods are computationally efficient, they may become trap…
arXiv:2406.09241v3 Announce Type: replace-cross Abstract: In this paper, we examine the long-run distribution of stochastic gradient descent (SGD) in general, non-convex problems. Specifically, we seek to understand which regions of the problem's state space are more likely to be…
arXiv:2602.05172v2 Announce Type: replace Abstract: We derive finite-particle rates for the regularized Stein variational gradient descent (R-SVGD) algorithm introduced by He et al. (2024) that corrects the constant-order bias of the SVGD by applying a resolvent-type precondition…
arXiv:2605.18694v1 Announce Type: cross Abstract: Many tasks in modern machine learning are observed to involve heavy-tailed gradient noise during the optimization process. To manage this realistic and challenging setting, new mechanisms, such as gradient clipping and gradient no…
arXiv:2512.23178v3 Announce Type: replace-cross Abstract: Optimization under heavy-tailed noise has become popular recently, since it better fits many modern machine learning tasks, as captured by empirical observations. Concretely, instead of a finite second moment on gradient n…
arXiv:2602.05742v2 Announce Type: replace Abstract: Weighted empirical risk minimization is a common approach to prediction under distribution drift. This article studies its out-of-sample prediction error under nonstationarity. We provide a general decomposition of the excess ri…
Many tasks in modern machine learning are observed to involve heavy-tailed gradient noise during the optimization process. To manage this realistic and challenging setting, new mechanisms, such as gradient clipping and gradient normalization, have been introduced to ensure the co…