ENTITY vision-language model

vision-language model

PulseAugur coverage of vision-language model — every cluster mentioning vision-language model across labs, papers, and developer communities, ranked by signal.

Total · 30d

44 over 90d

Releases · 30d

0 over 90d

Papers · 30d

43 over 90d

TIER MIX · 90D

research 13
tool 30
commentary 1

RELATIONSHIPS

SENTIMENT · 30D

5 day(s) with sentiment data

RECENT · PAGE 2/4 · 64 TOTAL

RESEARCH · CL_20275 · May 6 · 17:33

PhysForge generates physics-grounded 3D assets for virtual worlds and embodied AI

Researchers have introduced PhysForge, a novel framework designed to generate physics-grounded 3D assets for interactive virtual worlds and embodied AI. This system addresses the limitations of existing methods by focus…
RESEARCH · CL_20307 · May 6 · 06:57

New AI models InterMesh and Anny-Fit advance 3D human pose and shape recovery

Researchers have developed InterMesh, a new framework for multi-person human mesh recovery that explicitly incorporates human-environment interaction information. This approach enhances pose and shape estimation by enri…
RESEARCH · CL_18576 · May 6 · 04:00

Researchers unveil new stealthy backdoor attacks on AI models using diffusion and style features

Researchers have developed new methods for backdoor attacks on advanced AI models, specifically targeting Vision-Language Models (VLMs) and Diffusion Models (DMs). One approach, CBV, uses diffusion models to create natu…
TOOL · CL_18874 · May 6 · 04:00

VLM pipeline enables viewpoint-agnostic grasping for robots with partial observations

Researchers have developed a new end-to-end pipeline for language-guided grasping that enhances the robustness of mobile manipulators in cluttered environments. This system uses visual-language models (VLMs) and partial…
RESEARCH · CL_18299 · May 5 · 14:08

New GLANCE framework enhances VLM agents with curiosity-driven visual-linguistic exploration

Researchers have developed a new framework called GLANCE to enhance the exploration capabilities of Visual-Linguistic Model (VLM) agents. This framework aims to improve how these agents navigate complex and partially ob…
TOOL · CL_15622 · May 5 · 04:00

VISTA benchmark launched for advanced VLM spatio-temporal interaction analysis

Researchers have introduced VISTA, a new benchmark designed to evaluate the spatio-temporal understanding capabilities of Vision-Language Models (VLMs). Unlike existing benchmarks that focus on simple actions and limite…
TOOL · CL_15611 · May 5 · 04:00

Chain of Evidence framework enables pixel-level visual attribution for retrieval-augmented generation

Researchers have developed a new framework called Chain of Evidence (CoE) to improve iterative retrieval-augmented generation (iRAG) systems. CoE utilizes Vision-Language Models to directly analyze screenshots of retrie…
TOOL · CL_15616 · May 5 · 04:00

Researchers propose Gromov-Wasserstein distance for VLM vision encoder selection

Researchers have developed a new method for selecting optimal vision encoders for Vision-Language Models (VLMs). Traditional approaches, like choosing encoders with high accuracy or large size, were found to be ineffect…
TOOL · CL_15782 · May 5 · 04:00

New benchmark reveals video models forget long-term context

Researchers have introduced SceneBench, a new benchmark designed to evaluate video understanding models' ability to retain context over long videos, particularly across different scenes. Their findings indicate that cur…
RESEARCH · CL_16299 · May 4 · 13:49

Coral and CoRAL systems optimize LLM serving and robotic control

Researchers have developed two distinct systems named Coral and CoRAL. Coral is an adaptive system designed for cost-efficient serving of multiple large language models across heterogeneous cloud GPUs, aiming to optimiz…
RESEARCH · CL_16304 · May 4 · 12:27

Robots gain semantic understanding with VLM and adaptive memory

Researchers have developed a "Semantic Autonomy Stack" to enable indoor mobile robots to understand natural language instructions, overcoming the latency and memory limitations of current Vision-Language Models (VLMs). …
RESEARCH · CL_14362 · May 4 · 04:00

GeoThinker framework actively integrates geometry for advanced spatial reasoning

Researchers have developed GeoThinker, a novel framework that enhances spatial reasoning in multimodal large language models (MLLMs) by actively integrating geometric information. Unlike previous passive fusion methods,…
RESEARCH · CL_13548 · May 3 · 08:44

AI advancements span XQuery conversion, OCR pipelines, and China's benchmark challenges

A new open-source pipeline called SGOCR 2026 has been released, designed to generate spatially-grounded OCR datasets for training vision-language models. This pipeline aims to separate text localization from semantic re…
RESEARCH · CL_11793 · May 1 · 04:00

OmniDrive-R1 enhances autonomous driving VLMs with reinforcement-driven visual grounding

Researchers have introduced OmniDrive-R1, a novel framework for autonomous driving that integrates perception and reasoning using an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. This approach addresses ob…
RESEARCH · CL_11851 · May 1 · 04:00

New framework uses VLM distillation for stable continual model adaptation

Researchers have introduced Test-Time Distillation (TTD), a novel approach to address performance degradation in deep neural networks due to distribution shifts during deployment. Existing methods often suffer from pred…
RESEARCH · CL_11825 · May 1 · 04:00

Vision-language models mistake head orientation for gaze direction

Researchers have discovered that Vision-Language Models (VLMs) struggle to accurately infer human gaze direction, often mistaking head orientation for eye movement. In a study involving 1,360 real-world images, VLMs sho…
RESEARCH · CL_11758 · May 1 · 04:00

OpAgent achieves 71.6% success rate in web navigation tasks

Researchers have developed OpAgent, a novel web navigation agent that utilizes online reinforcement learning to overcome the limitations of static datasets. The agent employs a hierarchical multi-task fine-tuning approa…
RESEARCH · CL_22533 · May 1 · 01:06

AI drafts boost audio description quality, but quality threshold is key

Researchers have developed methods to improve the quality and scalability of audio description (AD) generation and evaluation. One study introduces GenAD and RefineAD, a pipeline and interface that uses AI-generated dra…
RESEARCH · CL_10151 · Apr 30 · 04:00

ChartVerse framework synthesizes complex charts and reasoning data for VLMs

Researchers have introduced ChartVerse, a new framework designed to generate complex charts and reliable question-answering data for Vision Language Models (VLMs). This system addresses limitations in existing datasets …
RESEARCH · CL_10251 · Apr 30 · 04:00

MARVIS system uses VLM reasoning over visualizations for predictive tasks

Researchers have developed MARVIS, a novel system that enhances the reasoning capabilities of large language and vision-language models (VLMs) by converting their latent embeddings into visual representations. This appr…

PhysForge generates physics-grounded 3D assets for virtual worlds and embodied AI

New AI models InterMesh and Anny-Fit advance 3D human pose and shape recovery

Researchers unveil new stealthy backdoor attacks on AI models using diffusion and style features

VLM pipeline enables viewpoint-agnostic grasping for robots with partial observations

New GLANCE framework enhances VLM agents with curiosity-driven visual-linguistic exploration

VISTA benchmark launched for advanced VLM spatio-temporal interaction analysis

Chain of Evidence framework enables pixel-level visual attribution for retrieval-augmented generation

Researchers propose Gromov-Wasserstein distance for VLM vision encoder selection

New benchmark reveals video models forget long-term context

Coral and CoRAL systems optimize LLM serving and robotic control

Robots gain semantic understanding with VLM and adaptive memory

GeoThinker framework actively integrates geometry for advanced spatial reasoning

AI advancements span XQuery conversion, OCR pipelines, and China's benchmark challenges

OmniDrive-R1 enhances autonomous driving VLMs with reinforcement-driven visual grounding

New framework uses VLM distillation for stable continual model adaptation

Vision-language models mistake head orientation for gaze direction

OpAgent achieves 71.6% success rate in web navigation tasks

AI drafts boost audio description quality, but quality threshold is key

ChartVerse framework synthesizes complex charts and reasoning data for VLMs

MARVIS system uses VLM reasoning over visualizations for predictive tasks