PulseAugur
LIVE 06:19:42
research · [11 sources] ·
0
research

Beyond Linearity in Attention Projections: The Case for Nonlinear Queries

Researchers are exploring the fundamental mechanisms behind transformer attention, with new papers analyzing its gradient flow structure and dynamics. One study interprets attention as a gradient flow on a unit sphere, identifying factors that influence token clustering and stability in multi-head settings. Another paper investigates the critical training windows for complexity control, determining when transformers prioritize reasoning over memorization. Additionally, research is uncovering the origins of geometric continuity in deep neural networks, attributing it to residual connections and symmetry-breaking nonlinearities, and examining the structural causes of the "attention sink" phenomenon. AI

Summary written by None from 11 sources. How we write summaries →

IMPACT These theoretical analyses offer deeper insights into transformer behavior, potentially guiding future architectural improvements and training strategies for more efficient and capable models.

RANK_REASON Multiple arXiv papers published on theoretical aspects of transformer attention mechanisms and training dynamics.

Read on arXiv cs.LG →

COVERAGE [11]

  1. arXiv cs.LG TIER_1 · Siquan Li, Kaiqi Jiang, Jiacheng Sun, Tianyang Hu ·

    The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity

    arXiv:2605.06611v1 Announce Type: new Abstract: Despite the prevalence of the attention sink phenomenon in Large Language Models (LLMs), where initial tokens disproportionately monopolize attention scores, its structural origins remain elusive. This work provides a \textit{mechan…

  2. arXiv cs.LG TIER_1 · Ayan Pendharkar ·

    Gradient Flow Structure and Quantitative Dynamics of Multi-Head Self-Attention

    arXiv:2605.04279v1 Announce Type: new Abstract: Transformer self-attention can be interpreted as a gradient flow on the unit sphere, in which tokens evolve under softmax interaction potentials and tend to form clusters. While prior work has established clustering behavior for sin…

  3. arXiv cs.LG TIER_1 · Sarwan Ali ·

    Critical Windows of Complexity Control: When Transformers Decide to Reason or Memorize

    arXiv:2605.04396v1 Announce Type: new Abstract: Recent work has shown that Transformers' compositional generalization is governed by \emph{complexity control}, initialization scale and weight decay, which steers training toward low-complexity reasoning solutions rather than high-…

  4. arXiv cs.LG TIER_1 · Kyungwon Jeong, Won-Gi Paeng, Honggyo Suh ·

    Why Geometric Continuity Emerges in Deep Neural Networks: Residual Connections and Rotational Symmetry Breaking

    arXiv:2605.04971v1 Announce Type: new Abstract: Weight matrices in deep networks exhibit geometric continuity -- principal singular vectors of adjacent layers point in similar directions. While this property has been widely observed, its origin remains unexplained. Through experi…

  5. arXiv cs.CL TIER_1 · Honggyo Suh ·

    Why Geometric Continuity Emerges in Deep Neural Networks: Residual Connections and Rotational Symmetry Breaking

    Weight matrices in deep networks exhibit geometric continuity -- principal singular vectors of adjacent layers point in similar directions. While this property has been widely observed, its origin remains unexplained. Through experiments on toy MLPs and small transformers, we ide…

  6. arXiv cs.LG TIER_1 · Zheng-An Chen, Pengxiao Lin, Zhi-Qin John Xu, Tao Luo ·

    Focus and Dilution: The Multi-stage Learning Process of Attention

    arXiv:2605.01199v1 Announce Type: new Abstract: Transformer-based models have achieved remarkable success across a wide range of domains, yet our understanding of their training dynamics remains limited. In this work, we identify a recurrent focus-dilution cycle in attention lear…

  7. arXiv cs.LG TIER_1 · Marko Karbevski ·

    Beyond Linearity in Attention Projections: The Case for Nonlinear Queries

    arXiv:2603.13381v2 Announce Type: replace Abstract: Recent algebraic analysis shows that in decoder-only and encoder-only transformers, the Query projection $W_Q$ may be set to identity without noticeable performance deterioration. This is possible because attention depends on $X…

  8. arXiv stat.ML TIER_1 · Tianyang Hu ·

    The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity

    Despite the prevalence of the attention sink phenomenon in Large Language Models (LLMs), where initial tokens disproportionately monopolize attention scores, its structural origins remain elusive. This work provides a \textit{mechanistic explanation} for this phenomenon. First, w…

  9. arXiv stat.ML TIER_1 · Jerry Yao-Chieh Hu, Mingcheng Lu, Yi-Chen Lee, Han Liu ·

    Transformer Approximations from ReLUs

    arXiv:2604.24878v1 Announce Type: cross Abstract: We provide a systematic recipe for translating ReLU approximation results to softmax attention mechanism. This recipe covers many common approximation targets. Importantly, it yields target-specific, economic resource bounds beyon…

  10. arXiv stat.ML TIER_1 · Han Liu ·

    Transformer Approximations from ReLUs

    We provide a systematic recipe for translating ReLU approximation results to softmax attention mechanism. This recipe covers many common approximation targets. Importantly, it yields target-specific, economic resource bounds beyond universal approximation statements. We showcase …

  11. Mastodon — fosstodon.org TIER_1 · [email protected] ·

    What is attention actually paying attention to? A 10-minute Manim walkthrough of Query, Key, Value, softmax, multi-head attention, and why long context gets exp

    What is attention actually paying attention to? A 10-minute Manim walkthrough of Query, Key, Value, softmax, multi-head attention, and why long context gets expensive. Watch: https:// youtu.be/nFyr1tx2C-E Mirror: https:// attention-mechanism-20260430.v ercel.app/attention_mechani…