Brief

last 24h

[50/769] 223 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · arXiv cs.LG English(EN) · 16h

TAMUNA: Doubly Accelerated Distributed Optimization under Partial Participation

Researchers have developed a new algorithm called TAMUNA designed to improve the efficiency of distributed optimization and federated learning. TAMUNA addresses the communication bottleneck by combining local training and data compression techniques, while also uniquely supporting partial client participation. This approach allows for doubly-accelerated convergence rates, outperforming previous methods that required all clients to be active. AI

IMPACT Introduces a novel algorithm that could enhance the efficiency of distributed AI training by allowing for partial client participation.
- Laurent Condat
TOOL · arXiv cs.LG English(EN) · 16h

Neural Legendre-Fenchel transform with Hessian Preconditioning

Researchers have developed a new method for approximating the Legendre-Fenchel transform, a key tool in convex analysis and machine learning. Their approach utilizes neural networks and introduces a Hessian-based preconditioning strategy to improve accuracy, especially for ill-conditioned functions. This method involves an affine deformation around a function's minimizer, simplifying the conjugation map and allowing a residual network to learn it more effectively. Experiments show enhanced convergence rates and numerical accuracy, particularly for challenging problems, with minimal computational overhead. AI

IMPACT Enhances numerical methods for optimization problems, potentially improving performance in machine learning tasks that rely on convex analysis.
- Legendre-Fenchel transform
- neural networks
TOOL · arXiv cs.LG English(EN) · 16h

Disjoint Generation of Synthetic Data

Researchers have introduced a novel framework for creating synthetic tabular datasets using disjoint generative models. This approach partitions data into separate subsets, each processed by distinct generative models before being combined via a joining operation that doesn't require common identifiers. The method enhances privacy, improves computational feasibility, and allows for mixed-model synthesis, achieving competitive accuracy and utility while significantly reducing re-identification risk. AI

IMPACT Introduces a new method for generating synthetic data that improves privacy and utility, potentially impacting data sharing and model training.
- Anton Danholt Lautrup
TOOL · arXiv cs.CV English(EN) · 16h

TIDE: Task-Isolated Diffusion for Unified Video Editing and Generation

Researchers have developed TIDE, a novel framework designed to unify video editing and generation tasks within a single model. TIDE utilizes per-token task embeddings to differentiate between various conditioning inputs, such as target, source, and reference tokens. The framework also employs a dual-path conditioning scheme and a progressive multi-task training strategy to enhance its ability to handle diverse video manipulation objectives and achieve state-of-the-art results across multiple benchmarks. AI

IMPACT Introduces a unified framework for video editing and generation, potentially simplifying workflows and improving performance across diverse tasks.
TOOL · arXiv cs.CV English(EN) · 16h

Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions

Researchers have developed a new framework called Z-Reward for improving text-to-image generation models. This system uses a teacher-student approach where a large vision-language model (VLM) acts as the teacher, inferring score distributions based on reasoning. A smaller student VLM is then trained to mimic these distributions, enabling efficient reward deployment without requiring explicit reasoning during inference. The Z-Reward framework demonstrated significant improvements in human preference accuracy compared to existing methods and enhanced text-to-image optimization. AI

IMPACT Introduces a novel reward modeling technique that could enhance the quality and controllability of text-to-image generation models.
TOOL · arXiv cs.CV English(EN) · 16h

Crayotter: Traceable Multi-Agent Workflows for Long-Form Video Editing

Researchers have developed Crayotter, an open-source system designed to streamline long-form video editing through a multi-agent approach. This system organizes the editing process into distinct phases, ensuring narrative intent is maintained and providing detailed artifacts for traceability and failure diagnosis. Evaluations show Crayotter outperforms existing tools in theme alignment, narrative coherence, and editing smoothness. AI

IMPACT Introduces a novel multi-agent system for video editing, potentially improving efficiency and quality in content creation workflows.
TOOL · arXiv cs.CV English(EN) · 16h

Harnessing Streaming Video in the Wild

Researchers have developed a new framework called Streaming Harness to enable Vision-Language Models (VLMs) to process unbounded video streams in real-time. This system enhances VLMs with proactive interaction, long-term memory retention up to 12 hours, and sub-second processing latency. To support this advancement, they also introduced a new streaming dataset, Streaming-Train-248K, and a benchmark, Streaming-Eval, to drive further progress in deployable streaming intelligence. AI

IMPACT Enables real-time analysis of live video feeds for applications like assistants and robotics, moving beyond offline video understanding.
TOOL · arXiv cs.LG English(EN) · 16h

CAAL: Contextual Bandits based Online Hand-Craft Active Learning Strategy Selection

Researchers have developed a new active learning strategy called CAAL, which uses contextual bandits to dynamically select the best hand-crafted strategy for labeling data. This approach addresses the challenge of uncertain data distributions by predicting rewards based on external context information. CAAL has demonstrated superior performance compared to existing adaptive strategies on public datasets, with results remaining consistent across different batch sizes. AI

IMPACT Introduces a novel method for improving data labeling efficiency in machine learning.
- Contextual Adaptive Active Learning
- CAAL
TOOL · arXiv cs.LG English(EN) · 16h

Fourier Neural Operators with rank-1 lattice points and hyperbolic cross

Researchers have developed a new approach to Fourier Neural Operators (FNOs) that improves their efficiency and accuracy. By replacing standard tensor product grids with rank-1 lattice points and using a hyperbolic cross frequency index set, the method requires fewer parameters and training samples. This lattice-based hyperbolic-cross FNO architecture simplifies the high-dimensional Fourier transform into a single one-dimensional fast Fourier transform, demonstrating benefits for solving partial differential equations. AI

IMPACT This research could lead to more efficient and accurate AI models for scientific simulations and complex problem-solving.
TOOL · arXiv cs.LG English(EN) · 16h

Learning to Solve Generative ODEs Beyond the Linear Span

Researchers have developed SpanLift, a new neural solver designed to improve the efficiency of generative models. Current models integrate learned Ordinary Differential Equations (ODEs), but this process is slow due to the need for many sequential evaluations. SpanLift addresses this by augmenting standard updates with a spatial residual operator, allowing it to capture components beyond the linear span of buffered velocity evaluations. This method has demonstrated state-of-the-art few-step sampling across various applications, significantly improving metrics like FID scores on datasets such as CIFAR-10 and ImageNet with minimal model evaluations. AI

IMPACT Improves sampling efficiency for generative models, potentially reducing computational costs and enabling faster generation of high-quality outputs.
TOOL · arXiv cs.LG English(EN) · 16h

OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents

Researchers have developed OTora, a novel framework designed to test the resilience of large language model (LLM) agents against a specific type of attack known as Reasoning-Level Denial-of-Service (R-DoS). This attack method aims to degrade an agent's performance by artificially increasing its reasoning depth or tool usage, rather than by causing outright task failure. OTora employs a two-stage process, utilizing adversarial triggers and genetic search to amplify overthinking while maintaining task accuracy, demonstrating significant latency increases on various agent benchmarks. AI

IMPACT This research highlights a new vulnerability in LLM agents, potentially impacting the reliability and efficiency of deployed AI systems.
TOOL · arXiv cs.LG English(EN) · 16h

Pointwise Complexity for Gaussian Fields: Upper Envelopes, Algorithmic Lower Bounds, and Separation

Researchers have developed a new theorem for understanding Gaussian processes, offering a more precise high-probability envelope for the entire field rather than just a scalar quantity. This theorem refines existing generic chaining methods and provides a Gaussian process equivalent to pointwise empirical-process bounds used in deep neural networks. Additionally, the study introduces a Bayesian algorithmic lower envelope derived from the interactive Fano/data-processing principle, which offers local-geometric certificates of pointwise complexity for estimators in overparameterized classes. AI

IMPACT Provides theoretical underpinnings for understanding complexity in AI models, potentially improving estimator design.
TOOL · arXiv cs.LG English(EN) · 16h

ForcingDAS: Unified and Robust Data Assimilation via Diffusion Forcing

Researchers have developed ForcingDAS, a new framework for data assimilation that unifies filtering and smoothing approaches. This method uses Diffusion Forcing to learn a joint-trajectory prior, which helps in capturing long-horizon temporal dependencies and reducing error accumulation, unlike traditional frame-to-frame transition models. ForcingDAS has demonstrated competitive or superior performance compared to specialized baselines across various applications, including weather forecasting and atmospheric state estimation, by using a single trained model for the entire spectrum of inference tasks. AI
- Yixuan Jia
- ForcingDAS
TOOL · arXiv cs.LG English(EN) · 16h

Causal Representation Learning from Network Data

Researchers have developed GraCE-VAE, a novel graph-aware causal discrepancy variational autoencoder designed to improve causal disentanglement from soft interventions. This method leverages known interaction networks, such as biological pathways, as an auxiliary view to enhance inference. Experiments on CRISPR perturbation datasets show that incorporating structured biological context leads to better predictions of interventional outcomes, even for novel perturbation combinations. AI

IMPACT Enhances causal inference capabilities by integrating network structures, potentially improving predictive accuracy in complex systems.
- Jifan Zhang
- GraCE-VAE
TOOL · arXiv cs.LG English(EN) · 16h

Overcoming the Limits of Finite Difference Method; Physics-Informed Neural Network for Noisy High-Dimensional Heat Diffusion

Researchers have developed a Physics-Informed Neural Network (PINN) framework to address the limitations of traditional numerical methods like the Finite Difference Method (FDM) when dealing with noisy, high-dimensional heat diffusion problems. In simulations with 20% boundary noise in 3D, the PINN maintained approximately 91% accuracy, while FDM accuracy dropped to 36%. The PINN also demonstrated superior performance in a physical copper thermal system, reducing boundary reconstruction error by 3.3 times under realistic noise conditions, and proved more efficient than FDM in 3D scenarios. AI

IMPACT PINN framework offers a more accurate and efficient solution for complex thermal simulations, potentially impacting engineering and scientific modeling.
TOOL · arXiv cs.LG English(EN) · 16h

DHAuDS: A Dynamic and Heterogeneous Audio Benchmark for Test-Time Adaptation

Researchers have introduced DHAuDS, a new benchmark suite designed to evaluate the robustness of test-time adaptation (TTA) in audio classification. Unlike existing benchmarks that use static and homogeneous corruption protocols, DHAuDS models realistic heterogeneous acoustic degradation under dynamic corruption severity. The goal is to provide a more accurate assessment of TTA algorithms' real-world performance by exposing limitations that are masked by conventional evaluation methods. AI

IMPACT Provides a more realistic evaluation framework for audio AI models, potentially leading to more robust real-world applications.
- DHAuDS
- Weichuang Shao
TOOL · arXiv cs.CV English(EN) · 16h

Vision-Language Work Zone Intelligence for Safety-Critical Speed Regulation of Mixed-Autonomy Vehicles in Dynamic Environments

Researchers have developed a new system to improve safety in work zones for both human drivers and autonomous vehicles. The system uses onboard perception to detect active work zones and recognize temporary speed limits, even when signage is inconsistent or missing from digital maps. It fuses object detection with semantic verification and temporal smoothing to ensure reliable operation in dynamic environments, running on low-cost embedded hardware. AI

IMPACT This system could significantly improve safety in dynamic work zones by providing real-time speed limit awareness to both human and autonomous drivers.
- ROADWork dataset
- Angel Martinez-Sanchez
TOOL · arXiv cs.CV English(EN) · 16h

IMAGINE: Adaptive Schema-Imagery Enhanced Composition for Composed Video Retrieval

Researchers have developed IMAGINE, a novel network designed for Composed Video Retrieval (CVR) and Composed Image Retrieval (CIR). This system addresses the limitation of existing methods by incorporating implicit semantic information, which is often conveyed through visually related cues rather than explicit representations. IMAGINE utilizes dynamic multimodal prototypes to capture these shared latent concepts, adaptively modulating visual features to guide the retrieval process more effectively. The approach has demonstrated state-of-the-art performance on three major benchmarks for both CVR and CIR tasks. AI

IMPACT Enhances video and image retrieval by incorporating implicit semantic understanding, potentially improving search accuracy in multimodal AI systems.
TOOL · arXiv cs.CV English(EN) · 16h

Less Is More: Training-Free Acceleration Framework of 3D Diffusion Models for Low-Count PET Denoising via Global-Local Trajectory Reduction

Researchers have developed a novel framework to accelerate 3D diffusion models for low-count PET image denoising. This training-free approach, called the Global-Local Skipping Strategy, significantly reduces inference latency without requiring model retraining. The method employs a global denoising step skipping strategy and a local feature reuse shortcut to achieve over an order of magnitude acceleration while maintaining or improving reconstruction quality. Blinded reader studies confirmed enhanced clinical confidence and diagnostic quality. AI

IMPACT Accelerates AI model inference for medical imaging, potentially enabling faster and more accurate diagnoses from lower-radiation PET scans.
- Global-Local Skipping Strategy
- 3D Diffusion Models
TOOL · Mastodon — fosstodon.org English(EN) · 7h

"Unraveling the Ai2 Asta Scholarly Research Assistant Citation System" 10 domain-specific queries were submitted to Asta's Summarise Literature feature, & 2 ind

A study examined the citation system of the AI2 Asta scholarly research assistant. Researchers found that Asta exhibits high citation intensity, moderate diversity in its bibliographic references, and significant instability when queries are repeated. These findings were based on 10 domain-specific queries and two rounds of data collection. AI

IMPACT This analysis of AI2 Asta's citation system highlights potential issues with stability and diversity in scholarly research tools.
- AI2 Asta
- Asta
TOOL · arXiv cs.CV English(EN) · 16h

Hyperspectral Smoke Segmentation via Mixture of Prototypes

Researchers have developed a new method for hyperspectral smoke segmentation, crucial for wildfire management and industrial safety. Existing visible-light methods struggle with semi-transparent smoke and cloud interference. The proposed Mixture of Prototypes (MoP) network addresses spectral contamination, limited pattern modeling, and complex weighting issues by employing band splitting, prototype-based spectral representation, and a dual-stage router for adaptive band weighting. This approach demonstrates superior performance on both hyperspectral and multispectral data, establishing a new standard for spectral-based smoke segmentation. AI

IMPACT This research could lead to more accurate wildfire detection and industrial safety monitoring systems.
TOOL · arXiv cs.CV English(EN) · 16h

Hummus: A Dataset of Humorous Multimodal Metaphor Use

Researchers have introduced the Hummus Dataset, a new collection of 1,000 image-caption pairs designed to evaluate multimodal large language models (MLLMs) on their understanding of humorous multimodal metaphors. The dataset, inspired by theories of humor and metaphor, was created using an expert-developed annotation scheme. Initial experiments using the Hummus Dataset revealed that current MLLMs struggle to effectively integrate visual and textual information to comprehend humorous multimodal metaphors. AI

IMPACT Highlights current limitations in AI's ability to understand nuanced humor and metaphor, indicating areas for future model development.
TOOL · arXiv cs.CV English(EN) · 16h

Embedded Graph Convolutional Networks for Real-Time Event Data Processing on SoC FPGAs

Researchers have developed an embedded graph convolutional network (EFGCN) specifically designed for real-time event data processing on System-on-Chip (SoC) FPGAs. This approach significantly reduces model size, by up to 100-fold compared to previous methods, while maintaining competitive accuracy on classification tasks. The EFGCN achieves high throughput and low latency, making it suitable for embedded systems, particularly in the automotive sector. AI

IMPACT Enables more efficient real-time AI processing on edge devices with limited resources.
- SoC FPGAs
- EFGCN
- PointNetConv
- AEGNN
- N-Caltech101
- ZCU104
- TinyML
TOOL · arXiv cs.CV English(EN) · 16h

Polaffini: A feature-based approach for robust affine and polyaffine image registration

Researchers have introduced Polaffini, a new framework for robust medical image registration that leverages deep learning advancements. This approach uses centroids of segmented anatomical regions to establish feature points, enabling efficient affine and polyaffine transformations. Polaffini demonstrates superior structural alignment and provides improved initialization for subsequent non-linear registration, outperforming traditional intensity-based methods in speed and accuracy. AI

IMPACT Enhances medical image processing pipelines with more accurate and efficient registration techniques.
- Polaffini
- Antoine Legouhy
TOOL · arXiv cs.CV English(EN) · 16h

Region-Wise Correspondence Prediction between Manga Line Art Images

Researchers have developed a novel Transformer-based framework to predict region-wise correspondences between manga line art images. This method addresses the challenge of aligning sparse black-and-white strokes, which lack the rich visual cues found in natural images. The system achieves high accuracy in patch-level feature alignment and robust region-level correspondence, demonstrating potential for applications in manga colorization and animation. AI

IMPACT This method could improve efficiency and quality in digital manga and animation production pipelines.
- Transformer
- Yingxuan Li
TOOL · arXiv cs.CV English(EN) · 16h

CardioMorphNet: Cardiac Motion Prediction Using a Shape-Guided Bayesian Recurrent Deep Network

Researchers have developed CardioMorphNet, a novel Bayesian recurrent deep learning framework for predicting cardiac motion from short-axis cardiac MRI images. This method utilizes a recurrent variational autoencoder and posterior models for segmentation and motion estimation, guiding the network to focus on anatomical regions without relying on intensity-based registration. CardioMorphNet has demonstrated superior performance in motion estimation and clinical index accuracy compared to existing state-of-the-art methods, while also providing uncertainty maps for its predictions. AI

IMPACT This new framework offers improved accuracy and uncertainty assessment for cardiac motion estimation, potentially aiding in earlier and more precise diagnosis of cardiac abnormalities.
TOOL · arXiv cs.CV English(EN) · 16h

Muses: Designing, Composing, Generating Nonexistent Fantasy 3D Creatures without Training

Researchers have developed Muses, a novel method for generating 3D fantasy creatures without requiring any training data. This approach utilizes a 3D skeleton to guide the composition and generation of diverse elements, ensuring a coherent structure and appearance. Muses integrates design, composition, and generation into a unified pipeline, starting with a graph-constrained reasoning process to create a well-structured skeleton, followed by a voxel-based assembly within a latent space, and concluding with appearance modeling for style-consistent texturing. The method demonstrates state-of-the-art performance in visual fidelity and alignment with textual descriptions. AI

IMPACT Introduces a training-free method for 3D asset generation, potentially simplifying content creation pipelines.
- Muses
- Hexiao Lu
TOOL · Mastodon — fosstodon.org English(EN) · 6h

AI is driving critical decisions, but complex models are often black boxes. In sectors like healthcare & finance, trust is the ultimate metric. How do we explai

Researchers have developed an interactive application to demystify complex AI models, particularly in sensitive fields like healthcare and finance where trust is paramount. The tool utilizes techniques such as XGBoost, ELI5, and SHAP to explain AI-driven decisions, focusing on methods like Permutation Importance and PDP to ensure transparency and auditability. AI

IMPACT Enhances trust and auditability in AI applications, crucial for adoption in regulated industries like healthcare and finance.
- XGBoost
- ELI5
- SHAP
- Streamlit
TOOL · arXiv cs.CV English(EN) · 16h

COMPASS: Complete Multimodal Fusion via Proxy Tokens and Shared Spaces for Ubiquitous Sensing

Researchers have developed COMPASS, a novel framework designed to enhance multimodal sensing by addressing the challenge of missing data modalities. This system ensures a consistent fusion interface by using proxy tokens to fill in absent modalities with estimated representations derived from the observed ones. COMPASS demonstrates improved robustness across various datasets and missing modality scenarios, outperforming traditional imputation and translation-based methods. AI

IMPACT Enhances robustness in multimodal AI systems by providing a consistent method for handling missing data during fusion.
- Hao Wang
TOOL · arXiv cs.CV English(EN) · 16h

Coarse-to-Fine Hierarchical Alignment for UAV-based Human Detection using Diffusion Models

Researchers have developed a novel three-stage diffusion model framework called Coarse-to-Fine Hierarchical Alignment (CFHA) to improve human detection in drone imagery. This method addresses the challenge of domain gap between synthetic and real-world data by using diffusion models for style transfer and local refinement. CFHA aims to enhance the accuracy of object detectors trained on synthetic data, leading to significant improvements in detection performance on public benchmarks. AI

IMPACT Enhances drone-based human detection accuracy by bridging the synthetic-to-real data gap using diffusion models.
TOOL · arXiv cs.CV English(EN) · 16h

HiMat: DiT-based Ultra-High Resolution SVBRDF Generation

Researchers have developed HiMat, a new framework for generating ultra-high-resolution (4K) spatially varying bidirectional reflectance functions (SVBRDFs). This method addresses the computational and memory challenges of creating detailed 3D content by operating in a compressed latent space and using a diffusion transformer with linear attention for efficiency. HiMat also incorporates a novel convolutional module called CrossStitch to ensure consistency across different reflectance maps without the overhead of global attention, outperforming prior methods in fidelity, efficiency, and diversity. AI

IMPACT Enables more efficient and detailed 3D content creation, potentially impacting real-time rendering and virtual environments.
- Zixiong Wang
TOOL · arXiv cs.CV English(EN) · 16h

STGBD-Net: Spatio-temporal Gradient Basis Decomposition Network for Infrared Small Target Detection

Researchers have developed a novel framework for infrared small target detection (IRSTD) called STGBD-Net, which utilizes Basis Decomposition Theory to improve feature fusion. This approach reformulates the process into an adaptive decomposition-and-reconstruction paradigm, employing Gradient Decomposition Modules (GDMs) to treat normalized gradient features as basis vectors. The resulting networks, including spatial and spatio-temporal variants, demonstrate state-of-the-art performance on multiple benchmarks with enhanced accuracy and computational efficiency. AI

IMPACT Introduces a novel approach to feature fusion for improved accuracy and efficiency in infrared small target detection.
- STGBD-Net
TOOL · arXiv cs.CV English(EN) · 16h

Chain of Flow: ECG-Conditioned 4D Cardiac Cine Generation from Patient-Specific Anatomical Anchor

Researchers have developed a new framework called Chain of Flow (COF) that generates 4D cardiac cine images using electrocardiography (ECG) and patient-specific MRI data. This method aims to provide functional cardiac assessment even when a complete cine sequence is not readily available. COF has demonstrated strong performance on the UK Biobank dataset, showing stable image quality and reliable downstream functional analysis, with potential applications in serial patient monitoring. AI

IMPACT Enables more accessible and comprehensive cardiac functional assessment through AI-driven image synthesis.
- UK Biobank
- Haofan Wu
TOOL · arXiv cs.CV English(EN) · 16h

GimmBO: Interactive Generative Image Model Merging via Bayesian Optimization

Researchers have developed GimmBO, a new method for interactively merging adapters in generative image models. This approach uses Preferential Bayesian Optimization (PBO) to navigate the complex design space created by combining multiple adapters, which is currently a manual and inefficient process. GimmBO aims to improve the efficiency and success rate of finding optimal adapter combinations, outperforming existing methods in user studies. AI

IMPACT This method could simplify the creation of custom image generation models by making adapter merging more efficient and accessible.
- Hugging Face
- GimmBO
- Chenxi Liu
- arXiv
TOOL · arXiv cs.CV English(EN) · 16h

Back to Point: Exploring Point-Language Models for Zero-Shot 3D Anomaly Detection

Researchers have developed a new framework called BTP for zero-shot 3D anomaly detection, which aims to identify defects in industrial products without needing prior examples of those defects. Unlike previous methods that convert 3D data to 2D images for analysis, BTP directly processes 3D point clouds using point-language models. This approach enhances sensitivity to local and structural anomalies by aligning 3D features with textual descriptions and incorporating geometric descriptors. AI

IMPACT This research could improve automated quality control in manufacturing by enabling defect detection without prior defect examples.
- Jin Wan
TOOL · arXiv cs.CV English(EN) · 16h

A Baseline Study and Benchmark for Few-Shot Open-Set Action Recognition with Feature Residual Discrimination

Researchers have introduced a new method for few-shot open-set action recognition in videos, addressing the limitations of existing closed-set assumptions. Their proposed Feature-Residual Discriminator (FR-Disc) architecture adapts previous skeletal data techniques to the more complex video domain. Experiments across five datasets show that FR-Disc significantly improves the ability to reject unknown actions without negatively impacting accuracy on known actions, establishing a new state-of-the-art for this task. AI

IMPACT Establishes a new state-of-the-art for few-shot open-set action recognition in videos, potentially improving surveillance and human-computer interaction systems.
- Stefano Berti
- Feature-Residual Discriminator
TOOL · arXiv cs.CV English(EN) · 16h

SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing

Researchers have introduced SciFlow-Bench, a new benchmark designed to evaluate the structural accuracy of AI-generated scientific diagrams. Unlike previous benchmarks that focus on visual similarity or intermediate symbolic representations, SciFlow-Bench directly assesses the structural integrity of generated images by parsing them back into graphs. This method, utilizing a hierarchical multi-agent system, highlights that current text-to-image models struggle with preserving structural correctness, especially in complex diagrams. AI

IMPACT This benchmark will push AI models to generate scientifically accurate diagrams, improving the reliability of AI-generated visuals in research.
- SciFlow-Bench
- Tong Zhang
TOOL · arXiv cs.CV English(EN) · 16h

UniADC: A Unified Framework for Anomaly Detection and Classification

Researchers have introduced UniADC, a novel framework designed to simultaneously detect and classify anomalies within images. This approach addresses the limitations of existing methods that treat anomaly detection and classification as separate tasks. UniADC utilizes a training-free inpainting network for synthesizing anomaly images and an implicit-normal discriminator to model normal states, enabling precise detection and classification even with limited or no anomaly data. Experiments on multiple datasets show UniADC outperforming current methods in anomaly detection, localization, and classification. AI

IMPACT This unified approach could improve the accuracy and efficiency of anomaly detection systems in various applications.
- UniADC
- Xiuzhuang Zhou
TOOL · arXiv cs.CV English(EN) · 16h

AnyHand: A Large-Scale Synthetic Dataset for RGB(-D) Hand Pose Estimation

Researchers have introduced AnyHand, a large-scale synthetic dataset designed to improve 3D hand pose estimation. The dataset includes over 2.5 million single-hand and 4.1 million hand-object interaction RGB-D images, featuring rich geometric annotations and addressing limitations in existing real-world and synthetic datasets regarding occlusions and aligned depth information. Experiments show that incorporating AnyHand into training significantly boosts performance on benchmarks like FreiHAND and HO-3D, highlighting the critical role of data diversity and quality alongside scale. AI

IMPACT Enhances 3D hand pose estimation capabilities, potentially improving AR/VR and robotics applications.
- AnyHand
- HO-3D
- Chen Si
TOOL · arXiv cs.CV English(EN) · 16h

PicoSAM3: Real-Time In-Sensor Region-of-Interest Segmentation

Researchers have developed PicoSAM3, a new lightweight segmentation model designed for real-time execution on edge devices and even directly on image sensors. This model, with 1.3 million parameters, utilizes a dense CNN architecture and incorporates techniques like region of interest prompt encoding and knowledge distillation from larger models. PicoSAM3 achieves strong performance on benchmarks like COCO and LVIS, and its quantized version can perform inference in under 12 milliseconds on the Sony IMX500 sensor, meeting its operational constraints. AI

IMPACT Enables real-time, privacy-preserving visual processing directly on edge devices and sensors.
- Sony IMX500
- COCO
- SAM2
- SAM3
- Pietro Bonazzi
- LVIS
- PicoSAM3
TOOL · arXiv cs.CV English(EN) · 16h

Enhancing Adversarial Robustness with Signed Distance Fields for Harmonizing Geometric Invariance and Texture

Researchers have developed a new framework called GeoTexPuri to improve the adversarial robustness of deep neural networks in computer vision. This method harmonizes geometric structures with textural features by using Signed Distance Fields to guide the training process, creating stable anchors against pixel noise. Experiments on ImageNet show GeoTexPuri achieves high clean and robust accuracy while functioning as a deterministic classifier during inference without additional computational costs. AI

IMPACT This research could lead to more secure AI image recognition systems, reducing vulnerability to adversarial attacks in real-time applications.
TOOL · arXiv cs.CV English(EN) · 16h

IDDM: Identity-Decoupled Personalized Diffusion Models with a Tunable Privacy-Utility Trade-off

Researchers have developed a new defense mechanism called Identity-Decoupled Personalized Diffusion Models (IDDM) to address privacy concerns in personalized text-to-image generation. IDDM aims to reduce the linkability of generated images to real users while still allowing for authorized personalization. The model achieves this through an alternating optimization process that separates identity information from the generation pipeline, offering a tunable trade-off between privacy and utility. AI

IMPACT Introduces a novel defense mechanism for personalized diffusion models, balancing privacy with generation quality.
- Instagram
- Facebook
- Linyan Dai
- DreamBooth
- LoRA
TOOL · arXiv cs.CV English(EN) · 16h

Contour Field based Elliptical Shape Prior for the Segment Anything Model

Researchers have developed a new method to enhance the Segment Anything Model (SAM) by incorporating an elliptical shape prior. This approach uses a parameterized elliptical contour field to guide the segmentation process, ensuring that the outputs are elliptical regions. The method decomposes SAM into sub-problems and integrates image features with elliptical and spatial regularization priors, demonstrating improved accuracy on specific image datasets compared to the original SAM. AI

IMPACT Enhances image segmentation accuracy for specific elliptical shapes, potentially improving medical and natural image analysis.
TOOL · arXiv cs.CV English(EN) · 16h

CoSeP: Complementary Separability Pruning via Class-Separability Clustering

Researchers have developed a new neural network pruning technique called CoSeP, which aims to compress models more effectively. Unlike existing methods that score components independently, CoSeP considers the relationships between components by analyzing their class-separability profiles. This approach groups similar components and uses a knee-detection criterion to automatically determine the optimal number of components to retain, leading to significant reductions in computational cost and inference time without sacrificing accuracy. AI

IMPACT This method could lead to more efficient deployment of neural networks on resource-constrained devices.
- David Levin
TOOL · arXiv cs.CV English(EN) · 16h

A Camera-Native Talking-Head Video Dataset for Various Computer Vision Tasks

Researchers have released a new dataset of talking-head videos captured natively by consumer webcams, aiming to improve computer vision research. The dataset includes 847 recordings, each 15 seconds long, from over 800 participants using various webcam devices in natural settings. These recordings are preserved with lossless compression and annotated with quality scores, offering a valuable resource for benchmarking video compression, super-resolution, and quality assessment models. AI

IMPACT Provides a large-scale, high-fidelity dataset to advance research in video compression, super-resolution, and quality assessment for real-time communication.
TOOL · arXiv cs.CV English(EN) · 16h

Uncertainty-Aware Hierarchical Re-Localization in OpenStreetMap via Semantic Alignment

Researchers have developed a new framework for robots to determine their location using OpenStreetMap (OSM) data. This method addresses the limitations of existing re-localization techniques that rely on dense maps or large image databases. The proposed system utilizes object-centric DINO-ViT tokens to bridge the semantic gap between visual observations and OSM data, and employs a hierarchical search strategy with uncertainty control for improved accuracy and speed. AI

IMPACT Enhances robot navigation capabilities by enabling efficient and privacy-preserving localization using widely available map data.
- Yuchen Zou
- OpenStreetMap
TOOL · arXiv cs.CV English(EN) · 16h

CRAG: Can 3D Generative Models Help 3D Assembly?

Researchers have developed CRAG, a novel approach to 3D assembly that integrates generative modeling with pose estimation. Unlike previous methods that solely focus on rigid transformations, CRAG treats assembly and shape generation as mutually reinforcing processes. This allows CRAG to synthesize plausible complete shapes and predict part poses, even when some pieces are missing, achieving state-of-the-art performance on in-the-wild objects. AI

IMPACT This research advances 3D reconstruction by combining generative models with assembly, potentially improving applications in robotics and computer vision.
- Sihang Li
TOOL · arXiv cs.CV English(EN) · 16h

PEDRA: Evaluating the Realism of Pedestrian Dynamics in Video Generation

Researchers have developed a new evaluation protocol called PEDRA to assess the realism of pedestrian dynamics in videos generated by AI models. This method aims to test how well text-to-video and image-to-video models can simulate multi-agent interactions, moving beyond single-subject realism. While leading models show promising capabilities in generating plausible crowd behavior, the evaluation also identified limitations in their physical consistency, such as issues with merging and disappearing pedestrians. AI

IMPACT This benchmark could drive improvements in AI's ability to simulate realistic human interactions in generated videos.
- Aaron Appelle
- PEDRA
TOOL · arXiv cs.CV English(EN) · 16h

No Modality Left Behind: Adapting to Missing Modalities via Knowledge Distillation for Brain Tumor Segmentation

Researchers have developed a new framework called AdaMM to improve brain tumor segmentation using multi-modal MRI data, even when some modalities are missing. This approach utilizes knowledge distillation and adaptive refinement modules to enhance the model's ability to handle incomplete inputs. Experiments on benchmark datasets show AdaMM outperforms existing methods, particularly in scenarios with single or limited modalities, offering practical guidance for future research. AI

IMPACT Enhances robustness of AI models in medical imaging for scenarios with incomplete data.
TOOL · arXiv cs.CV English(EN) · 16h

Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition

Researchers have developed Dr. SHAP-AV, a framework utilizing Shapley values to analyze how audio-visual speech recognition models balance acoustic and visual information. Experiments across six models and varying noise levels show that while models increase visual reliance in noisy conditions, audio contributions remain significant. The analysis also revealed that modality balance shifts during speech generation and that signal-to-noise ratio is the primary driver of modality weighting, indicating a persistent audio bias in current models. AI

IMPACT Provides a diagnostic tool to understand and potentially improve the robustness of audio-visual AI systems.