Qwen2.5-VL
PulseAugur coverage of Qwen2.5-VL — every cluster mentioning Qwen2.5-VL across labs, papers, and developer communities, ranked by signal.
7 day(s) with sentiment data
-
New benchmark reveals multilingual safety gaps in vision-language models
Researchers have developed MLingualFC, a new multilingual benchmark to test the safety vulnerabilities of vision-language models (VLMs). This benchmark uses flowchart images encoded with harmful instructions in five lan…
-
New CoCoA method boosts multimodal embedding quality
Researchers have introduced CoCoA, a novel pre-training paradigm designed to enhance multimodal embedding models. This method focuses on content reconstruction through collaborative attention, aiming to create more comp…
-
New methods boost video QA by compressing content and improving temporal reasoning
Researchers have developed new methods to improve video question answering (VQA) for long videos. One approach, MemoryCard, compresses video content into topic-aware "Memory Cards" to better capture event-level semantic…
-
llama.cpp releases add Vulkan, optimize matrix math, and improve server logging
The llama.cpp project has released several updates, including version b9580 which adds Vulkan support for matrix-matrix multiplication and Flash Attention, along with optimizations for FP16 dot2 extensions. Other recent…
-
New framework boosts VLM anomaly detection for self-driving cars
Researchers have developed SAVANT, a new framework designed to improve the detection of semantic anomalies in autonomous driving systems using Vision-Language Models (VLMs). SAVANT reformulates anomaly detection as a la…
-
UF Gators win AmericasNLP 2026 task with novel captioning system
Researchers from the University of Florida Gators have won the AmericasNLP 2026 shared task for cultural image captioning of Indigenous languages. Their two-stage system uses Qwen2.5-VL for an intermediate Spanish capti…
-
ByteDance releases Lance, a unified multimodal AI model
ByteDance has released Lance, an open-source multimodal AI model capable of understanding, generating, and editing both images and videos within a single framework. This lightweight model, with only 3 billion active par…
-
Video2GUI generates 12M GUI trajectories from unlabeled videos
Researchers have developed Video2GUI, an automated framework designed to generate large-scale interaction trajectories for training GUI agents. This system extracts data from unlabeled internet videos, converting them i…
-
New DICModel enhances ICT image captioning with multi-modal LLMs
Researchers have developed a novel Domain-specific Image Captioning Model (DICModel) designed for the ICT industry, utilizing a multi-stage progressive training strategy. This approach combines synthesized image-text pa…
-
Medical VLMs struggle with negated answers, new benchmark reveals
Researchers have developed CXR-ContraBench, a new benchmark designed to evaluate the performance of medical vision-language models (VLMs) in correctly interpreting negated statements within chest X-ray analyses. The ben…
-
DenseStep2M pipeline automates video annotation for improved understanding
Researchers have developed DenseStep2M, a novel pipeline that automatically extracts detailed procedural annotations from instructional videos without requiring training data. This system segments videos, filters irrele…
-
OcularChat MLLM accurately diagnoses age-related macular degeneration with interactive explanations
Researchers have developed OcularChat, a multimodal large language model (MLLM) fine-tuned from Qwen2.5-VL, designed to diagnose age-related macular degeneration (AMD) using color fundus photographs. The model was train…
-
Arcee AI moves to Together Endpoints for cost-efficient SLMs
Arcee AI has migrated its specialized small language models (SLMs) from AWS to Together Dedicated Endpoints, seeking improved cost, performance, and operational agility. The company focuses on training efficient models …
-
New research tackles LLM hallucinations with novel methods and benchmarks
Multiple research papers released on arXiv address the challenge of hallucinations in large language and vision-language models. One paper introduces In-Context Visual Contrastive Optimization (IC-VCO) to mitigate multi…