VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding
arXiv cs.CV (Computer Vision)
75 items · Generative Image & Video Models · site ↗
NIV: Neural Axis Variations for Variable Font Generation
Personal AI Agent for Camera Roll VQA
Do Models Share Safety Representations? Cross-Model Steering for Safe Visual Generation
TopoPult-SSL: Gland-Mask-Free Cross-Device Meibomian Gland Segmentation via Self-Distilled Weak Clinical Priors
LightVesselNet: An Ultra-Lightweight Sub-100K Parameter Network for Retinal Blood Vessel Segmentation
Recovering Physically Plausible Human-Object Interactions from Monocular Videos
Biomazon: A Multimodal Dataset for 3D Forest Structure and Biomass Modeling in the Amazon Basin
Three-Dimensional Retinal Microvasculature Restoration in OCT Angiography
Deep Learning-assisted AMD Staging based on OCT and OCT Angiography
UniPixie: Unified and Probabilistic 3D Physics Learning via Flow Matching
Would you still call this Dax? Novel Visual References in VLMs and Humans
Disentangled Fine-Grained Prototype Learning for Incomplete Image-Tabular Classification
Horse Eye Blink Detection and Classification for Equine Affective State Assessment
ORACLE-CT: Anatomy-Aware Support Pooling for CT Classification
Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation
Weakly Supervised Incremental Segmentation via Semantic Anchors and Spatial Arbitration
Intra-Modal Neighbors Never Lie: Rectifying Inter-Modal Noisy Correspondence via Graph-Based Intra-Modal Reasoning
Optimal Transport Flow Matching by Design
When Seeing Is Not Believing -- A Benchmark for Search-Grounded Video Misinformation Detection
Reflection Separation from a Single Image via Joint Latent Diffusion
Pinpoint: Grounded Worldwide Image Geolocation via Cross-Source Retrieval and Reranking
End-to-End Text Line Detection and Ordering
GroupToM-Bench: Benchmarking Group Theory of Mind and Nonlinear Social Emergence in MLLMs
Spatial Artifact Coherence Determines Codec Robustness in Patch-Based rPPG
Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)
Prospective Dynamic 3D MRI Reconstruction via Latent-Space Motion Tracking from Single Measurement
SBP-Net: Learning Thin Structure Reconstruction with Sliding-Box Projections
UniCanvas: A Diffusion-base Unified Model for Text-in-Image Joint Generation
StandardE2E: A Unified Framework for End-to-End Autonomous Driving Datasets
COD10K-C: Benchmarking Robustness of Camouflaged Object Detection Under Natural Image Corruptions
AVTrack: Audio-Visual Tracking in Human-centric Complex Scenes
Consistent Yet Wrong: Evidence Insensitivity in Spatial Vision-Language Models
Plan2Map: A Multimodal Benchmark for Document-Grounded Geospatial Boundary Reconstruction from Planning Records
MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data
From Local Training to Large-Scale Mapping: A Comparative Assessment of Machine Learning and Deep Learning for Transferable Satellite-Derived Bathymetry
GeoDrive-Bench: Benchmarking Region-Specific Multimodal Reasoning in Autonomous Driving
Diagnosis of Human Object Interaction Detectors for Real World Educational Applications
Cosmos 3: Omnimodal World Models for Physical AI
Automated Report-Derived Oncology VQA Benchmark for Evaluating Vision-Language Models on 3D Medical Imaging
Principled Reflection Separation via Nonlinear Superposition and Feature Interaction
Pathway-Structured Privileged Distillation for Deployable Computational Pathology
Tiny Collaborative Inference for Occlusion-Robust Object Detection
Any2Poster: Any-Source Poster Generation Across Modalities and Domains
Pixel Cube: Diffusion-based Portrait Video Relighting Through Realistic Lighting Reproduction
DefocusTrackerAI -- A Generalized Framework for the Automatic Detection of Defocused Particle Images
Improved Belief-Attention in Vision Task
Flow-Based Generative Modeling for Optimizing Sampling Policies in Compressed Sensing Applications
Planktonzilla: Multimodal dataset and models for understanding plankton ecosystems
Structured Visual Evidence Decomposition for Evidence-Grounded Multimodal Screening of Obstructive Sleep Apnea-Hypopnea Syndrome
Aligning Cellular Sheaves with Classifier Attention for Interpretable Weakly-Supervised Pathology Localization
Diffusion Image Generation with Explicit Modeling of Data Manifold Geometry
Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation
Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents
Segmentation-Guided Spatial Indexing for Generalizable and Explainable Deepfake Detection
CoilDrop-MRI: Self-supervised physics-guided MRI reconstruction with coil dropout
CoCoVideo: The High-Quality Commercial-Model-Based Contrastive Benchmark for AI-Generated Video Detection
Visual-Noise Guided In-Context Distillation for Multimodal Large Language Model Unlearning
VDSB-GWSyn: Diffusion Schr\"{o}dinger Bridge for Controllable and Anatomically Feasible Guidewire Synthesis in Coronary Angiography
General Covariant Action Modeling: Constructing Generalized Manifolds via Spatio-Temporal Decoupling
Lightweight SAR Ship Detection via Contrastive Distillation
SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer
DTG-Restore: Training-Free Diffusion Refinement for Generative Video Super-Resolution
Mitigating Content Shift and Hallucination in GenAI Image Editing via Structural Refinement
Dex2HOI: Dexterous Bimanual Two-Object Interaction Generation
Clustering Guided Domain-Specific Pretrained Foundation Model Very High-Resolution Arctic Remote Sensing
A Novel Global Context-aware Deep Neural Network for Enhanced Brain Tumor Segmentation using Magnetic Resonance Images
OmniMem: Scalable and Adaptive Memory Retrieval for Long Video Generation
On-Device Generative AI for GDPR-Compliant Visual Monitoring: Natural Language Alerts from Local Object Detection
Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?
VLM3: Vision Language Models Are Native 3D Learners
Prior Availability in Industrial Visual Sim-to-Real: A Review of CAD-Guided and CAD-Unavailable Regimes
ReGuLaR: Relation-Grounded Latent Reasoning for Large Vision-Language Models
Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs
Controllable Lung Nodule Synthesis via Histogram-Regularized Latent Diffusion Models