Researchers have developed a new method for understanding the internal workings of large language models by decomposing MLP activations. This technique, semi-nonnegative matrix factorization (SNMF), identifies interpretable features that are sparse combinations of co-activated neurons and maps them to their activating inputs. Experiments on models like Llama 3.1, Gemma 2, and GPT-2 demonstrated that SNMF-derived features are more effective for causal steering than existing methods, revealing a hierarchical structure in the models' activation spaces. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Introduces a novel, interpretable method for dissecting LLM internals, potentially improving model understanding and debugging.
RANK_REASON This is a research paper detailing a new method for analyzing LLM activations. [lever_c_demoted from research: ic=1 ai=1.0]