multimodal large language model
PulseAugur coverage of multimodal large language model — every cluster mentioning multimodal large language model across labs, papers, and developer communities, ranked by signal.
13 day(s) with sentiment data
-
HDRAgent uses LLMs for adaptive HDR imaging
Researchers have introduced HDRAgent, a novel framework for High Dynamic Range (HDR) imaging that utilizes an agent-driven approach to adaptively select reconstruction strategies. This method aims to mitigate ghosting a…
-
New SMART framework enhances video moment retrieval with audio and shot-aware compression
Researchers have developed SMART, a new framework for video moment retrieval that enhances multimodal understanding by integrating audio cues with visual information. This approach utilizes a Multimodal Large Language M…
-
New benchmark CoVEBench tests complex video editing AI
Researchers have introduced CoVEBench, a new benchmark designed to evaluate the capabilities of text-guided video editing models. This benchmark addresses the limitations of existing models that struggle with complex, m…
-
New benchmark tackles privacy blind spots in AI image editing
Researchers have introduced SPPE, a new benchmark for evaluating privacy-preserving image editing in Multimodal Large Language Models (MLLMs). This benchmark addresses the issue where standard privacy methods often resu…
-
New benchmark WebRISE tests MLLM-generated web artifacts
Researchers have developed WebRISE, a new benchmark for evaluating Multi-modal Large Language Models (MLLMs) that generate web artifacts. Unlike previous methods, WebRISE focuses on requirement-induced states and transi…
-
New benchmark dataset and detection framework tackle AI-generated video forgery
Researchers have introduced CoCoVideo-26K, a new benchmark dataset designed to improve the detection of AI-generated videos, particularly those created by high-fidelity commercial models. The dataset features semantical…
-
ToolFG framework uses MLLMs and tools for image classification
Researchers have introduced ToolFG, a novel framework designed for fine-grained image classification that integrates multimodal large language models (MLLMs) with external tools. This approach allows MLLMs to autonomous…
-
New benchmarks test robot manipulation models for trustworthiness
Researchers have developed new benchmarks to evaluate the trustworthiness of video world models used in robotic manipulation. These benchmarks assess models across normal, constraint-sensitive, counterfactual, and adver…
-
Language models enhance deepfake detector generalization and interpretability
Researchers have developed a novel method for training deepfake detectors by leveraging multimodal large language models (MLLMs). This approach uses language as a regularization mechanism to improve both the generalizab…
-
New agentic framework uses MLLM to improve object detection
Researchers have introduced DetAS, an agentic framework for object detection that treats the task as a dynamic decision process. This framework utilizes a Multimodal Large Language Model (MLLM) to adaptively compose det…
-
FruitEnsemble uses MLLM to boost fruit classification accuracy
Researchers have developed FruitEnsemble, a novel framework for fine-grained fruit classification that addresses challenges like limited datasets and visual similarity between fruit types. The system utilizes a two-stag…
-
OSGNet and MLLM win Ego4D Episodic Memory Challenge
Researchers have developed a novel approach for the Ego4D Episodic Memory Challenge, achieving first place in both the Natural Language Queries and GoalStep tracks. Their method combines the OSGNet localization model wi…
-
New architectures enable real-time video understanding
Researchers are developing new methods for real-time video understanding, moving beyond traditional offline analysis. Several papers propose architectures that decouple visual perception from language generation to impr…
-
EndoGSim uses MLLMs for physics-aware surgical simulation
Researchers have developed EndoGSim, a new framework for simulating dynamic endoscopic scenes in robot-assisted surgery. This system uses Multi-modal Large Language Models (MLLMs) to guide Gaussian Splatting, enabling p…
-
New MLLM framework unifies surgical scene understanding
Researchers have developed SurgMLLM, a novel framework that unifies surgical scene understanding by integrating high-level reasoning with low-level visual grounding. This multimodal large language model (MLLM) is fine-t…
-
AlphaGRPO framework boosts multimodal AI generation with self-reflection
Researchers have introduced AlphaGRPO, a new framework designed to improve multimodal generation in Unified Multimodal Models (UMMs). This approach uses Group Relative Policy Optimization (GRPO) to enable models to perf…
-
New MPerS method uses MLLMs for remote sensing scene segmentation
Researchers have developed MPerS, a novel approach for remote sensing scene segmentation that leverages multimodal large language models (MLLMs). This method generates high-quality captions for remote sensing images usi…
-
New MLLM WeatherSyn generates weather reports, outperforms existing models
Researchers have introduced WeatherSyn, a novel instruction-tuned multimodal large language model (MLLM) designed for generating weather forecast reports. This model is trained on a new dataset, , which includes data f…
-
Motion-MLLM enhances 3D scene understanding with egomotion data
Researchers have developed Motion-MLLM, a new framework that integrates egomotion data from Inertial Measurement Units (IMUs) with video to enhance Multimodal Large Language Models (MLLMs) for 3D scene understanding. Th…
-
New benchmarks and models advance video understanding reward modeling
Researchers have developed new methods for training reward models for video understanding tasks, addressing a gap in current AI capabilities. One approach introduces a benchmark called VURB and a dataset VUP-35K, leadin…