Brief

last 24h

[50/169] 223 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

RESEARCH · arXiv cs.CV English(EN) · 1d · [2 sources]

vesselFM-CT: Segmenting All Blood Vessels in CT Images for System-Level Cardiovascular Analysis

Researchers have developed vesselFM-CT, a novel model designed to segment all blood vessels within CT images. This advancement aims to overcome the limitations of previous studies that focused on isolated vascular segments, enabling a more comprehensive analysis of the entire cardiovascular system. The model utilizes an iterative training process and a new TubeLoss function to handle the diverse structural variations of blood vessels, from large arteries to minuscule mesenteric vessels. AI

IMPACT Enables comprehensive cardiovascular system analysis from CT scans, potentially improving disease classification and understanding of vascular physiology.
- Bastian Wittmann
- vesselFM-CT
RESEARCH · arXiv cs.CV English(EN) · 1d · [2 sources]

CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning

Researchers have developed CapRL++, a novel framework for training image and video captioning models using reinforcement learning with verifiable rewards. This approach moves beyond traditional supervised fine-tuning by using a vision-free language model to assess caption quality based on its ability to answer questions about the visual content. Evaluations across numerous benchmarks demonstrate that CapRL++ enhances caption quality and pretraining, leading to significant downstream performance gains and enabling smaller models to match the capabilities of much larger ones. AI

IMPACT This new training framework could lead to more capable and efficient vision-language models, improving accessibility and downstream applications.
RESEARCH · arXiv cs.CV English(EN) · 1d · [2 sources]

Echo-DM: Ultrasound Marker Removal via Conditional Latent Diffusion and Region-Aware Fusion

Researchers have developed Echo-DM, a novel framework for removing artificial markers from clinical ultrasound images. This method utilizes a conditional latent diffusion model combined with region-aware fusion to restore images without relying on masks, preserving anatomical details. Experiments on the Echo-PAIR dataset show Echo-DM outperforms existing methods in marker removal and anatomical fidelity, offering efficient deployment options. AI

IMPACT This new method could improve the accuracy of automated analysis in clinical ultrasound imaging by removing distracting artificial markers.
- Echo-DM
- Echo-PAIR
RESEARCH · arXiv cs.CV English(EN) · 1d · [2 sources]

ExDet: Open-Domain Open-Vocabulary Detection with Cross-modal Extrapolation and Rectification

Researchers have introduced ExDet, a novel framework designed to improve open-domain open-vocabulary detection (ODOVD) capabilities. This lightweight system enhances the generalization of existing detectors to new categories and unseen domains without requiring training from scratch. ExDet utilizes text-guided extrapolation to infer visual prototypes and a detector-compatible rectification module to adjust representations, achieving state-of-the-art results on several benchmark datasets. AI

IMPACT Enhances generalization for object detection models, potentially improving performance in real-world applications with novel objects and diverse environments.
- Objects365
- MSOSB
- OV-LVIS
- OD-LVIS
- arXiv
- ExDet
RESEARCH · arXiv cs.LG English(EN) · 1d · [2 sources]

Machine-Learning Emulation of Satellite Greenhouse Gas Retrievals: Stability over Time

Researchers have investigated the temporal stability of machine learning models used to emulate satellite-based greenhouse gas retrievals. Their study, using data from the Greenhouse Gases Observing SATellite (GOSAT), found that prediction accuracy degrades over time when models are tested on data outside their training period. Incorporating time as a feature significantly improved methane predictions, with a simple Lasso model outperforming more complex neural networks and demonstrating greater stability. AI

IMPACT Highlights the need for temporal validation in ML models for scientific applications, potentially impacting climate monitoring systems.
RESEARCH · arXiv cs.LG English(EN) · 1d · [2 sources]

PRISM: Topology-Aware Cross-Modal Imputation for Modality-Deficient Federated Graph Learning

Researchers have introduced PRISM, a novel framework for federated graph learning that addresses the challenge of modality deficiency across different clients. PRISM enables collaborative learning from decentralized graphs containing text and images, even when individual clients lack complete multimodal data. The framework proactively retrieves and imputes missing modality semantics from the federation, integrating them into local graph propagation with topology-aware control. Experiments demonstrate PRISM's effectiveness, showing an average improvement of 4.48% over state-of-the-art baselines on six multimodal graph datasets. AI

IMPACT Enhances collaborative learning from decentralized multimodal data, potentially improving AI applications that rely on diverse data sources.
RESEARCH · arXiv cs.LG English(EN) · 1d · [2 sources]

Internalizing Geometric Law: Learning from Solver Residuals for Precision-Critical Generation

Researchers have developed a new method called Saturating Additive Rewards (SAR) to improve the precision of large language models in geometric tasks. This approach addresses a failure mode known as Outlier Gradient Masking, where a single constraint violation can hinder learning across all constraints. SAR decomposes rewards into bounded per-constraint terms, preserving partial progress and ensuring consistent gradients. An 8B parameter model using SAR achieved a 2.3x improvement in solving complex geometric problems compared to standard MSE-based rewards. AI

IMPACT Enhances LLM capabilities in precision-critical domains, potentially enabling more reliable AI-driven design and technical diagramming.
RESEARCH · arXiv cs.CV English(EN) · 1d · [2 sources]

Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning

Researchers have developed a novel two-stage framework called Rea2Seg for image segmentation tasks that leverage multimodal large language models (MLLMs). This approach first identifies candidate masks from an MLLM's attention maps and then uses the MLLM to reason over these candidates and select the most accurate one. To further evaluate and advance these capabilities, a new benchmark, ReasonSeg-SGDR, has been introduced to assess perception, grounding, and reasoning abilities across various dimensions. AI

IMPACT Introduces a new method for improving MLLM-based image segmentation and a benchmark to evaluate these models.
RESEARCH · arXiv cs.CV English(EN) · 1d · [2 sources]

Self-supervised Learning Matters: A Simple Ensemble Solution for Micro-Gesture Recognition

Researchers from XInsight Lab have developed a novel ensemble framework for micro-gesture recognition, achieving a new state-of-the-art result in the 4th MiGA Challenge at IJCAI 2026. Their approach integrates a self-supervised RGB model, pre-trained on a large unlabeled video dataset, with existing supervised models. This self-supervised component significantly improved performance, reaching 74.419% top-1 accuracy and outperforming previous benchmarks by over 1.2 percentage points. AI

IMPACT Demonstrates the effectiveness of self-supervised learning for specialized visual recognition tasks, potentially improving performance in areas like human-computer interaction.
RESEARCH · arXiv cs.CV English(EN) · 1d · [2 sources]

LiteVSR: Lightweight Adaptation of Frozen Diffusion Transformers for Video Super-Resolution

Researchers have developed LiteVSR, a new framework for adapting pre-trained diffusion transformers for video super-resolution tasks. This approach uses a lightweight State-Aware Adapter that requires significantly fewer trainable parameters and less training time compared to existing methods. LiteVSR leverages flow matching to efficiently adapt the frozen transformer, enabling competitive restoration quality with minimal computational resources. AI

IMPACT Offers a more computationally efficient method for adapting large generative models to specific video enhancement tasks.
RESEARCH · arXiv cs.LG English(EN) · 1d · [2 sources]

Counterfactual Reasoning for Fine-Grained Evidence Disentanglement in VideoQA

Researchers have developed a new framework called CREDiT to improve the reliability of video question-answering systems. This framework uses counterfactual reasoning and structural causal models to disentangle causal evidence from spurious correlations in video data. By decomposing representations into causal and non-causal components and employing feature-level causal interventions, CREDiT aims to create more trustworthy AI systems that can accurately localize evidence. AI

IMPACT Enhances the trustworthiness and accuracy of AI systems in understanding and reasoning about video content.
- SportsQA
- CREDiT
- VideoQA
- NExT-GQA
- SPORTU-video
RESEARCH · arXiv stat.ML English(EN) · 1d · [2 sources]

INFUSER: Influence-Guided Self-Evolution Improves Reasoning

Researchers have developed INFUSER, a novel framework for self-evolving language models that enhances reasoning capabilities. This iterative co-training system features a Generator that creates questions and answers from documents, and a Solver that learns from them. The Generator is rewarded based on an influence score, ensuring it produces questions that genuinely improve the Solver's performance, rather than just difficult ones. INFUSER demonstrated significant improvements, with an 8B model outperforming a larger 32B model on math and coding tasks. AI

IMPACT Enhances LLM reasoning capabilities by creating adaptive training curricula, potentially leading to more capable AI agents.
- Olympiad
- DuGRPO
- Qwen3-8B-Base
- SuperGPQA
- GRPO
RESEARCH · arXiv cs.CV English(EN) · 1d · [2 sources]

OmniGen-AR: AutoRegressive Any-to-Image Generation

Researchers have introduced OmniGen-AR, a novel autoregressive framework designed for versatile image generation. This unified model can synthesize images from various inputs, including text, segmentation maps, depth information, and even existing images for editing or video prediction. To prevent condition tokens from influencing content tokens, the framework employs Disentangled Causal Attention (DCA), a technique that separates attention mechanisms during training. OmniGen-AR has demonstrated state-of-the-art performance on benchmarks like GenEval and VBench. AI

IMPACT Introduces a unified framework for multi-modal image generation, potentially simplifying complex visual synthesis tasks.
RESEARCH · arXiv cs.CV English(EN) · 1d · [2 sources]

Ultra Flash: Scaling Real-Time Streaming Video Generation to High Resolutions

Researchers have introduced Ultra Flash, a novel cascaded streaming framework designed to generate high-resolution video in real-time. This system overcomes the limitations of previous models that were restricted to lower resolutions. Ultra Flash achieves impressive frame rates at 1K and 2K resolutions on a single GPU by employing a unique super-resolution training paradigm and a causal streaming latent upsampler. AI

IMPACT Enables real-time high-resolution video generation, potentially impacting content creation and streaming services.
RESEARCH · Hugging Face Daily Papers English(EN) · 1d · [3 sources]

EditSSC: Toward Editable Semantic Occupancy Scenes with Unconditional Diffusion Models

Researchers have developed EditSSC, a new method for generating and editing 3D semantic scenes using 2D Bird's Eye View (BEV) representations. This approach repurposes components from Stable Diffusion, enabling training-free editing capabilities like sketch-guided generation, inpainting, and outpainting. EditSSC demonstrates superior performance on unconditional generation compared to existing 3D-specific methods, highlighting the potential of 2D diffusion models for 3D scene manipulation. AI

IMPACT Enables more accessible and flexible 3D scene generation for applications like autonomous driving.
RESEARCH · Hugging Face Daily Papers English(EN) · 1d · [3 sources]

Vision-Language Guided Hyperspectral Object Tracking via Semantics Fusion and Contextual Template Updating

Researchers have developed VLHTrack, a new framework for hyperspectral object tracking that integrates vision and language models. This approach uses language priors to guide band selection, reducing redundancy and highlighting key spectral features. The system also incorporates a dynamic template update mechanism using Mamba to handle appearance variations and deformations in long sequences. Experiments show VLHTrack surpasses current state-of-the-art methods on benchmark datasets. AI

IMPACT Introduces a novel method for improving object tracking accuracy by leveraging LLMs for spectral feature selection and dynamic template updating.
RESEARCH · arXiv stat.ML English(EN) · 1d · [2 sources]

Backward Coherence and Hidden-State Stability in Recurrent Neural Networks: A Quasi-Reverse-Martingale Theory

Researchers have developed a new theoretical framework called backward coherence to analyze hidden-state stability in recurrent neural networks (RNNs). This approach treats the hidden-state sequence as a quasi-reverse-martingale, enabling more stable and interpretable representations. Simulations and real-world data studies demonstrate that this method can significantly improve stability, reduce tracking errors, and enhance forecasting accuracy, particularly under concept drift. AI

IMPACT Introduces a theoretical framework to enhance stability and interpretability in RNNs, potentially improving performance in time-series forecasting and data analysis tasks.
RESEARCH · arXiv cs.AI English(EN) · 1d · [2 sources]

SAGE: Shape-Adapting Gated Experts for Adaptive Histopathology Image Segmentation

Researchers have developed two novel frameworks, SAGE and SegMoTE, to improve medical image segmentation. SAGE utilizes a dynamic expert routing system to adapt to variations in cell size and shape, achieving high Dice scores on multiple datasets. SegMoTE, on the other hand, efficiently adapts general segmentation models like SAM to medical imaging tasks with minimal learnable parameters and reduced annotation costs. Both approaches aim to enhance the accuracy and practicality of AI in clinical diagnostics. AI

IMPACT These new segmentation models offer improved accuracy and efficiency for clinical diagnostics, potentially reducing annotation costs and enhancing the deployment of AI in healthcare.
- Yujie Lu
- SegMoTE
- MedSeg-HQ
- SAM
- SAGE
- ConvNeXt
- Vision Transformer UNet
- Nguyen Vu
RESEARCH · arXiv cs.LG English(EN) · 1d · [2 sources]

Latent Geometry Beyond Search: Amortizing Planning in World Models

Researchers have developed new methods for long-horizon planning in world models, addressing limitations of existing techniques. One approach, FF-JEPA, uses a hierarchical structure with two forward dynamics models, including an action-free latent planner to predict subgoals, thus removing the need for explicit goal images and enabling planning over extended periods. Another method, building on a pretrained LeWorldModel, amortizes planning into a latent inverse-dynamics mapping, replacing iterative optimization with a faster, goal-conditioned inverse dynamics model that significantly reduces computational cost while maintaining or exceeding performance. AI

IMPACT These advancements could enable more sophisticated AI agents capable of complex, multi-step tasks in real-world environments.
- CEM
- Xiaohao Xu
- LeWorldModel
- iCEM
- arXiv
- FF-JEPA
RESEARCH · arXiv cs.AI English(EN) · 1d · [2 sources]

Enhancing Video Representations with Spatiotemporal-Semantic Residual to Mitigate Hallucinations in Video Large Multimodal Models

Researchers have developed new methods to combat hallucinations in large vision-language models (LVLMs). One approach, ViSSRes, enhances video representations using a lightweight network to improve spatiotemporal and semantic consistency, significantly reducing hallucination rates on benchmarks like EventHallusion. Another method focuses on refining textual embeddings to encourage better integration of visual information, leading to more balanced multimodal reasoning and improved performance on benchmarks such as MMVP and POPE. AI

IMPACT These methods offer potential solutions for improving the reliability and accuracy of multimodal AI systems, crucial for applications requiring precise visual understanding.
RESEARCH · arXiv cs.CV English(EN) · 1d · [4 sources]

SwiftVR: Real-Time One-Step Generative Video Restoration

Researchers have developed SwiftVR, a novel framework for real-time generative video restoration that addresses key bottlenecks in existing diffusion-based models. By employing mask-free shifted-window self-attention and a lightweight autoencoder, SwiftVR achieves high frame rates at resolutions up to 4K on powerful hardware and real-time 1080p streaming on consumer-grade GPUs. This advancement makes high-quality video restoration more accessible and practical for live streaming applications. AI

IMPACT Enables practical real-time video restoration on consumer hardware, potentially improving live streaming quality and accessibility.
- SwiftVR
- arXiv
- RTX 5090
- Hugging Face
RESEARCH · Google DeepMind English(EN) · 1d · [3 sources]

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

Researchers have introduced IMUG-Bench, a new benchmark designed to evaluate unified multimodal models (UMMs) in complex, multi-turn image-text dialogue scenarios. Existing benchmarks often fall short by focusing on static or single-turn interactions, failing to capture the nuances of real-world applications. IMUG-Bench addresses this by assessing both understanding and generation capabilities across three classes of dialogue, revealing limitations in current UMMs, particularly regarding exposure bias in generation. The study also explores strategies like Chain-of-Thought and Self-Verification to improve UMM performance and mitigate these biases. AI

IMPACT Provides a new evaluation standard for multimodal models, potentially driving improvements in their ability to handle complex, interactive dialogues.
RESEARCH · 36氪 (36Kr) 中文(ZH) · 20h

WestJet Airlines plans to put its first Boeing 737 MAX 10 aircraft into service in early 2027

ChatGPT is poised for its most significant upgrade, with reports indicating a substantial overhaul is imminent. This update is expected to go beyond simple conversational enhancements, suggesting a fundamental shift in its capabilities. Additionally, the高考 (Gaokao) exam will incorporate AI proctors capable of automatically capturing abnormal video footage. AI

IMPACT This major ChatGPT upgrade could redefine user expectations and applications, while AI proctoring signals a new era of automated oversight in education.
- AI
- ChatGPT
RESEARCH · 36氪 (36Kr) 中文(ZH) · 20h

Morgan Stanley: Whether the dollar rally can continue depends on the Fed's interest rate path

ChatGPT is poised for its most significant update, moving beyond simple chat functionalities. This upgrade is expected to be the largest in the model's history. Concurrently, financial institutions like Goldman Sachs and JPMorgan Chase are exploring financial products tied to the cost of computing power, specifically focusing on GPUs, which are critical for AI development. AI

IMPACT This major ChatGPT upgrade could significantly enhance AI capabilities, while new GPU-based financial products may impact AI infrastructure investment.
RESEARCH · Engadget English(EN) · 7h

2027 Rivian R2 first drive: Rivian's second SUV is its best yet

Rivian has unveiled its new R2 SUV, designed to be a more accessible and volume-oriented model compared to its R1 predecessors. The R2 will launch with a dual-motor performance trim starting at $57,990, offering 656 horsepower and an estimated 330 miles of range. While smaller than the R1S, it retains significant off-road capabilities with 9.6 inches of ground clearance and adaptive dampers, aiming to appeal to a broader market. AI

IMPACT Niche tooling improvement; minimal industry-wide impact.
- R1T
- BMW iX3
- Tesla Model Y
- Rivian
RESEARCH · Hugging Face Daily Papers English(EN) · 1d · [3 sources]

Echo-Memory: A Controlled Study of Memory in Action World Models

Researchers have introduced Echo-Memory, a framework designed to rigorously study memory mechanisms within action-conditioned world models. These models, which generate videos based on initial frames, text prompts, and action sequences, often struggle with memory retention, leading to inconsistencies when scenes are revisited. Echo-Memory isolates memory components by keeping other model aspects constant, allowing for a direct comparison of different memory storage and retrieval strategies. The study found that raw context serves as a strong baseline for capacity, and that aggressive compression can degrade performance, while block-wise state-space recurrence proved most effective for long-term memory recall. AI

IMPACT Provides a standardized protocol for evaluating memory in video generation models, potentially leading to more robust and consistent AI-generated content.
- Hugging Face
- arXiv
RESEARCH · arXiv cs.CV English(EN) · 1d · [2 sources]

MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding

Researchers have developed MotionGPT-2, a large motion-language model designed to generate and understand human movements from text descriptions. This model integrates multimodal inputs like text and poses into a unified prompt system, enabling it to handle various motion-related tasks. MotionGPT-2 utilizes a novel motion discretization framework to ensure fine-grained control over body and hand movements, demonstrating effectiveness in generation, captioning, and completion tasks. AI

IMPACT These models advance the state-of-the-art in generating realistic human motion from text, with potential applications in animation, gaming, and virtual reality.
- T2LM
- Taeryung Lee
- Yuan Wang
- arXiv
- MotionGPT-2
RESEARCH · TLDR AI Italiano(IT) · 20h

OpenAI S-1 🇺🇸, Siri AI 📱, Xiaomi Ultraspeed ⚡

OpenAI has confidentially filed an S-1 with the SEC, indicating a potential future IPO without a set timeline. Apple is enhancing Siri with AI

IMPACT OpenAI's potential IPO could reshape AI investment landscapes, while Apple's Siri upgrade and Xiaomi's speed claims signal competitive advancements.
- Grok
- OpenAI
- Apple
- Xiaomi
- MiMo-V2.5-Pro-UltraSpeed
- ChatGPT
- Claude
- Sam Altman
- Jakub Pachocki
- xAI
RESEARCH · Hugging Face Daily Papers English(EN) · 1d · [4 sources]

End-to-End Context Compression at Scale

Researchers have developed Latent Context Language Models (LCLMs), a new family of encoder-decoder compressors designed to address memory bottlenecks in long-context language model inference. Through extensive architecture search and pre-training on over 350 billion tokens, these models achieve compression ratios of 1:4, 1:8, and 1:16. LCLMs improve upon existing methods by enhancing general-task performance, compression speed, and reducing peak memory usage, making them efficient backbones for long-horizon agents. AI

IMPACT Introduces a new method for efficient long-context processing, potentially enabling more capable and less memory-intensive AI agents.
RESEARCH · arXiv stat.ML English(EN) · 2d · [2 sources]

Improving the sharpness in neural network-based parametric post-processing of ensemble forecasts

Researchers have developed a new method to improve the sharpness of neural network-based ensemble weather forecasts. By adding a penalty term to the network's loss function, they can reduce the width of prediction intervals without sacrificing forecast accuracy. This technique was demonstrated using 2m temperature forecasts from the European Centre for Medium-Range Weather Forecasts, showing a significant decrease in prediction interval width. AI

IMPACT Enhances accuracy and reliability of weather prediction models, potentially improving disaster preparedness and resource management.
- EUPPBench
- European Centre for Medium-Range Weather Forecasts
RESEARCH · arXiv cs.AI English(EN) · 2d · [2 sources]

Reinforcement Learning for Flow-Matching Policies with Density Transport

Researchers have developed a new online reinforcement learning algorithm called RLDT for fine-tuning flow-matching policies in continuous-control problems. This method frames policy improvement as a density transport problem, aligning with flow matching models. RLDT constructs a transport field using Stein Variational Gradient Descent and then fine-tunes a pretrained policy to match this field, outperforming existing baselines in reward quality and convergence speed across various robotic manipulation tasks. AI

IMPACT This new algorithm could improve the efficiency and effectiveness of reinforcement learning in complex continuous-control tasks, potentially accelerating progress in robotics and AI-driven automation.
RESEARCH · arXiv cs.LG English(EN) · 2d · [2 sources]

How Much Capacity Does EEG Denoising Need? Ultra-Compact Networks reveal Benchmark Saturation and Metric-Utility Gap

A new research paper explores the capacity needed for deep learning models in EEG denoising, finding that performance saturates with models as small as 3-6.5K parameters. Despite this, current architectures often scale to tens of millions of parameters without significant gains. Crucially, reconstruction metrics used to evaluate denoising do not predict the utility of the signals for downstream tasks like motor-imagery classification, potentially even degrading performance. AI

IMPACT Highlights that current EEG denoising models may be over-parameterized and that standard evaluation metrics are insufficient for real-world applications, suggesting a need for more task-aware benchmarks.
RESEARCH · arXiv cs.LG English(EN) · 2d · [2 sources]

Titans-as-a-Layer: Test-Time Memory for Conversational Speech Emotion Recognition

Researchers have developed a novel method called Titans-as-a-Layer (MAL) to enhance conversational speech emotion recognition. This plug-and-play adapter integrates test-time neural memory into large audio language models without altering their core structure. The MAL adapter writes dialogue history into a small memory and uses it to provide contextual updates, significantly improving SER performance across various metrics and datasets. AI

IMPACT Enhances conversational AI by enabling more nuanced understanding of user emotion through dialogue context.
RESEARCH · arXiv cs.LG English(EN) · 2d · [2 sources]

Physics-Guided Dual Decoding and Spectral Supervision for Global 3D Hydrometeor Prediction

Researchers have developed PredHydro-Net, a novel deep learning framework designed to improve 3D hydrometeor forecasting. This physics-guided model addresses the limitations of standard deep learning in predicting extreme weather events by employing a dual-decoding architecture and spectral supervision. PredHydro-Net demonstrates superior performance compared to existing deep learning models and operational systems in detecting extreme events and accurately representing spatial textures, while also showing strong consistency with satellite data. AI

IMPACT Improves accuracy and spatial fidelity in extreme weather event prediction, offering a more robust approach to long-tailed atmospheric forecasting.
RESEARCH · arXiv cs.AI English(EN) · 2d · [2 sources]

EinSort: Sorting is All We Need for Tensorizing LLM

Researchers have developed EinSort, a novel method for compressing large language models by identifying inherent low-rank structures within their weights. This technique utilizes index ordering to discover these structures, which are often obscured by the models' immense scale and unstructured distributions. Experiments show that EinSort improves reconstruction quality for both model weights and KV-cache compression compared to existing methods. AI

IMPACT This method could lead to more efficient deployment and use of large language models by reducing their memory and computational footprint.
RESEARCH · arXiv cs.AI English(EN) · 2d · [2 sources]

When Video Misreads: Closed-Loop Distillation of Reading Heuristics for Exploratory Manipulation Trace QA

Researchers have developed a new method called Closed-Loop Trace Distillation to improve the ability of vision-language models (VLMs) to interpret robot actions from video and sensor data. This technique distills a natural-language prompt, known as a Distilled Reading Heuristic (DRH), from labeled training traces. When used with a frozen VLM, the DRH significantly enhances the accuracy of predicting minimal-success action chains, outperforming raw-modality baselines by up to 0.47 across various robotic tasks. AI

IMPACT Enhances VLM interpretation of robotic actions, potentially improving robot autonomy and task completion accuracy.
RESEARCH · arXiv cs.AI English(EN) · 2d · [2 sources]

Scaffold Effects on GAIA: A Controlled Comparison

A new study published on arXiv reveals that the way AI models are prompted, or "scaffolded," significantly impacts their measured performance. Researchers found that the choice of scaffold alone could alter a model's accuracy by up to 28 percentage points. Contrary to expectations, more capable models were not necessarily less sensitive to scaffolding, with some advanced models showing greater gains from structured prompts. The findings suggest that current capability scores may be overly dependent on the specific prompting methods used, rather than solely reflecting inherent model abilities. AI

IMPACT Highlights the critical role of prompting techniques in evaluating AI capabilities, suggesting current benchmarks may not fully capture true model potential.
RESEARCH · arXiv cs.LG English(EN) · 2d · [2 sources]

Autonomous Aerial Manipulation via Contextual Contrastive Meta Reinforcement Learning

Researchers have developed a novel meta-reinforcement learning approach called Aco2 for autonomous aerial manipulation. This system enables quadrotors to pick up, transport, and deliver various objects without human intervention. Aco2 utilizes a contextual observation encoder and a contrastive objective to adapt to different payloads and their associated flight dynamics, allowing for direct deployment from simulation to physical robots. AI

IMPACT This research could advance autonomous logistics and service robotics by enabling drones to handle diverse objects.
RESEARCH · arXiv cs.AI English(EN) · 2d · [2 sources]

GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation

Researchers have developed GEAR-VLA, a new framework designed to improve the generalizability of Vision-Language-Action (VLA) models in robotic manipulation tasks. This approach addresses limitations in current VLA models by learning unified, geometry-aware action representations. GEAR-VLA utilizes a coarse-to-fine learning strategy, integrating embodied pretraining with a continuous action expert and aligning a 3D spatial backbone with the VLA representation. The framework also incorporates embodiment canonicalization to enable cross-robot generalization, demonstrating state-of-the-art performance on several benchmarks and achieving high success rates in tasks involving unseen objects and different robotic embodiments. AI

IMPACT Enhances generalization for robotic manipulation tasks by improving VLA models' ability to handle unseen objects and different embodiments.
RESEARCH · arXiv cs.CL English(EN) · 2d · [2 sources]

SAEExplainer: Interpreting SAE Features with Activation-Guided Preference Optimization

Researchers have introduced SAEExplainer, a new framework designed to improve the interpretability of Sparse Autoencoders (SAEs) within large language models. This method uses activation scores as a reward signal to enable self-correction and iterative refinement of explanations. By reducing explanation hallucinations and reinforcing causal patterns, SAEExplainer demonstrates improved performance over existing methods in experiments. AI

IMPACT Enhances understanding of LLM internal workings, potentially leading to more reliable and debuggable AI systems.
RESEARCH · arXiv stat.ML English(EN) · 2d · [2 sources]

When Are Neural Interaction Discoveries Real? Identifiability, Recoverability, and a Pre-Fit Diagnostic

Researchers have developed a new diagnostic tool to determine if interactions identified by neural time-series models are genuine or artifacts of model flexibility. The method focuses on the geometry of the input data's support rather than the specific neural architecture used. A pre-fit diagnostic, based on the effective rank of the joint lag-block covariance, can predict the feasibility of recovering interaction terms before model fitting. AI

IMPACT Provides a method to validate findings from neural time-series models, ensuring discovered interactions are data-driven and not model artifacts.
- GNAVAR
RESEARCH · arXiv cs.AI English(EN) · 2d · [2 sources]

TimpaTeks: Automatic In-place Text Sequence Modification via Diffusion Language Model Steering

Researchers have developed TimpaTeks, a new method for modifying text in-place using diffusion language models (DLMs). This technique allows for concept steering within existing text sequences without requiring instruction-tuned models. Experiments on sentiment analysis and concept modification demonstrated TimpaTeks' effectiveness in altering text while maintaining sentence structure and reducing perplexity, offering a more computationally efficient alternative to prompt-based steering. AI

IMPACT Introduces a novel, computationally cheaper method for in-place text modification using DLMs, potentially impacting content generation and editing tools.
RESEARCH · arXiv cs.LG English(EN) · 2d · [2 sources]

Few-step Cofolding with All-Atom Flow Maps

Researchers have developed a new framework called DeCAF to accelerate the process of generating 3D biomolecular structures. This method distills existing all-atom cofolding models into more efficient flow maps, significantly reducing the computational cost and inference time. DeCAF has demonstrated improved accuracy and physical validity in predicting protein-ligand poses compared to previous diffusion-based models, while using fewer computational steps. AI

IMPACT Accelerates biomolecular structure prediction, potentially speeding up drug discovery and protein design.
RESEARCH · arXiv cs.AI English(EN) · 2d · [2 sources]

Self-Supervised Vision Transformers for CBCT-Based Detection of Temporomandibular Joint Osteoarthritis

Researchers have explored the effectiveness of self-supervised vision transformers, specifically the DINO family, for detecting temporomandibular joint osteoarthritis (TMJ OA) from cone-beam CT (CBCT) scans. Their study found that partially unfreezing the final two transformer blocks significantly improved the Area Under the Curve (AUC) for classification from 0.671 to 0.902. This adaptation strategy proved more critical than the choice of backbone model itself, offering practical insights for applying these models in low-data medical imaging scenarios. AI

IMPACT Demonstrates a novel approach for adapting foundation models to medical imaging, potentially improving diagnostic accuracy in low-data settings.
RESEARCH · arXiv cs.AI English(EN) · 2d · [2 sources]

Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects

Researchers have developed a new framework to predict side effects of using sparse autoencoders (SAEs) to steer language models. This method analyzes feature statistics before intervention to forecast issues like inconsistent behavior or perturbation of unrelated features. The study evaluated this predictive capability across several models, including GPT-2, Pythia, Gemma, and Llama, demonstrating that certain statistical measures can forecast steering modularity with varying success depending on the model and SAE dictionary. AI

IMPACT This research offers a method to improve the reliability of AI model steering, potentially leading to more controlled and predictable AI behavior.
RESEARCH · Hugging Face Daily Papers English(EN) · 2d · [3 sources]

Trajectory-Refined Distillation

Researchers have introduced Trajectory-Refined Distillation (TRD), a new method to improve the post-training process for large language models. TRD addresses a problem called "prefix failure" in on-policy distillation, where dense per-token supervision leads to fragmented gradients. By correcting student model rollouts at the trajectory level before distillation, TRD mitigates this issue and enhances exploration. The method has demonstrated consistent performance improvements across various benchmarks and model scales. AI

IMPACT Enhances LLM reasoning and accuracy by refining distillation techniques.
RESEARCH · arXiv cs.CL English(EN) · 2d · [2 sources]

Tensorizing Engram: Sharing Latents Across N-Gram Embeddings is Beneficial in LLMs

Researchers have introduced Tensorized Engram (TN-gram), a novel memory module for large language models designed to improve how they handle multi-token patterns. Unlike previous methods that use separate memory structures for different n-gram orders, TN-gram employs shared factors in a Canonical Polyadic form. This approach allows for more efficient encoding of n-gram embeddings and has demonstrated comparable or superior performance to existing Engram modules with significantly fewer parameters. AI

IMPACT This new memory module could lead to more efficient and powerful LLMs by improving their ability to process and recall multi-token sequences.
RESEARCH · arXiv cs.LG English(EN) · 2d · [2 sources]

GENERIC-FNO: Embedding Energy Conservation and Entropy Production into Fourier Neural Operators

Researchers have developed GENERIC-FNO, a novel neural operator designed to embed the principles of nonequilibrium thermodynamics directly into function space. This model uniquely integrates reversible, energy-conserving dynamics with irreversible, entropy-producing dynamics, a feat not previously achieved in neural operators. GENERIC-FNO learns energy and entropy functionals and enforces exact structural guarantees, demonstrating high precision and outperforming existing baselines on various physical dynamics. AI

IMPACT Advances fundamental AI capabilities for simulating complex physical systems with guaranteed thermodynamic consistency.
RESEARCH · arXiv cs.LG English(EN) · 3d · [2 sources]

Towards Graph Foundation Models for Dynamics in Complex Networked Systems: Lessons from Super-Spreader Identification in Multilayer Networks

Researchers have introduced a new framework for Graph Foundation Models (GFMs) designed to handle network dynamics across different systems. Their approach, demonstrated by a model called ts-net, shows zero-shot generalization capabilities on real-world multilayer networks without retraining. This work addresses the limitations of current transductive models and outlines key challenges for future GFM development in this area. AI

IMPACT Enables more generalizable AI models for analyzing complex networked systems like social networks or biological systems.
- ts-net
- Graph Foundation Models
RESEARCH · arXiv cs.LG English(EN) · 3d · [2 sources]

GeoGNN: Time Series Geo-Localization using Two-Tower Graph Neural Networks

Researchers have developed GeoGNN, a novel two-tower graph neural network architecture for time series geolocalization. This method infers the geographic origin of time series data by learning embeddings from both geographic adjacency graphs and the time series themselves. Experiments on electricity consumption datasets show GeoGNN significantly improves geolocalization accuracy by approximately 27% on average. AI

IMPACT Introduces a new method for adding spatial context to time series data, potentially enabling location-aware applications.