2025-07-18

Title: MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

Authors: Yuncong Yang, Jiageng Liu, Zheyuan Zhang, Siyuan Zhou, Reuben Tan, Jianwei Yang, Yilun Du, Chuang Gan
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2507.12508
Pdf URL: https://arxiv.org/pdf/2507.12508
Copy Paste: [[2507.12508]] MindJourney: Test-Time Scaling with World Models for Spatial Reasoning(https://arxiv.org/abs/2507.12508)
Keywords: diffusion
Abstract: Spatial reasoning in 3D space is central to human cognition and indispensable for embodied tasks such as navigation and manipulation. However, state-of-the-art vision-language models (VLMs) struggle frequently with tasks as simple as anticipating how a scene will look after an egocentric motion: they perceive 2D images but lack an internal model of 3D dynamics. We therefore propose MindJourney, a test-time scaling framework that grants a VLM with this missing capability by coupling it to a controllable world model based on video diffusion. The VLM iteratively sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration. Without any fine-tuning, our MindJourney achieves over an average 8% performance boost on the representative spatial reasoning benchmark SAT, showing that pairing VLMs with world models for test-time scaling offers a simple, plug-and-play route to robust 3D reasoning. Meanwhile, our method also improves upon the test-time inference VLMs trained through reinforcement learning, which demonstrates the potential of our method that utilizes world models for test-time scaling.

Title: Assay2Mol: large language model-based drug design using BioAssay context

Authors: Yifan Deng, Spencer S. Ericksen, Anthony Gitter
Subjects: cs.LG, cs.AI, q-bio.QM
Abstract URL: https://arxiv.org/abs/2507.12574
Pdf URL: https://arxiv.org/pdf/2507.12574
Copy Paste: [[2507.12574]] Assay2Mol: large language model-based drug design using BioAssay context(https://arxiv.org/abs/2507.12574)
Keywords: in-context
Abstract: Scientific databases aggregate vast amounts of quantitative data alongside descriptive text. In biochemistry, molecule screening assays evaluate the functional responses of candidate molecules against disease targets. Unstructured text that describes the biological mechanisms through which these targets operate, experimental screening protocols, and other attributes of assays offer rich information for new drug discovery campaigns but has been untapped because of that unstructured format. We present Assay2Mol, a large language model-based workflow that can capitalize on the vast existing biochemical screening assays for early-stage drug discovery. Assay2Mol retrieves existing assay records involving targets similar to the new target and generates candidate molecules using in-context learning with the retrieved assay screening data. Assay2Mol outperforms recent machine learning approaches that generate candidate ligand molecules for target protein structures, while also promoting more synthesizable molecule generation.

Title: Reconstruct, Inpaint, Finetune: Dynamic Novel-view Synthesis from Monocular Videos

Authors: Kaihua Chen, Tarasha Khurana, Deva Ramanan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12646
Pdf URL: https://arxiv.org/pdf/2507.12646
Copy Paste: [[2507.12646]] Reconstruct, Inpaint, Finetune: Dynamic Novel-view Synthesis from Monocular Videos(https://arxiv.org/abs/2507.12646)
Keywords: diffusion, self-supervised
Abstract: We explore novel-view synthesis for dynamic scenes from monocular videos. Prior approaches rely on costly test-time optimization of 4D representations or do not preserve scene geometry when trained in a feed-forward manner. Our approach is based on three key insights: (1) covisible pixels (that are visible in both the input and target views) can be rendered by first reconstructing the dynamic 3D scene and rendering the reconstruction from the novel-views and (2) hidden pixels in novel views can be "inpainted" with feed-forward 2D video diffusion models. Notably, our video inpainting diffusion model (CogNVS) can be self-supervised from 2D videos, allowing us to train it on a large corpus of in-the-wild videos. This in turn allows for (3) CogNVS to be applied zero-shot to novel test videos via test-time finetuning. We empirically verify that CogNVS outperforms almost all prior art for novel-view synthesis of dynamic scenes from monocular videos.

Title: PinFM: Foundation Model for User Activity Sequences at a Billion-scale Visual Discovery Platform

Authors: Xiangyi Chen, Kousik Rajesh, Matthew Lawhon, Zelun Wang, Hanyu Li, Haomiao Li, Saurabh Vishwas Joshi, Pong Eksombatchai, Jaewon Yang, Yi-Ping Hsu, Jiajing Xu, Charles Rosenberg
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2507.12704
Pdf URL: https://arxiv.org/pdf/2507.12704
Copy Paste: [[2507.12704]] PinFM: Foundation Model for User Activity Sequences at a Billion-scale Visual Discovery Platform(https://arxiv.org/abs/2507.12704)
Keywords: foundation model
Abstract: User activity sequences have emerged as one of the most important signals in recommender systems. We present a foundational model, PinFM, for understanding user activity sequences across multiple applications at a billion-scale visual discovery platform. We pretrain a transformer model with 20B+ parameters using extensive user activity data, then fine-tune it for specific applications, efficiently coupling it with existing models. While this pretraining-and-fine-tuning approach has been popular in other domains, such as Vision and NLP, its application in industrial recommender systems presents numerous challenges. The foundational model must be scalable enough to score millions of items every second while meeting tight cost and latency constraints imposed by these systems. Additionally, it should capture the interactions between user activities and other features and handle new items that were not present during the pretraining stage. We developed innovative techniques to address these challenges. Our infrastructure and algorithmic optimizations, such as the Deduplicated Cross-Attention Transformer (DCAT), improved our throughput by 600% on Pinterest internal data. We demonstrate that PinFM can learn interactions between user sequences and candidate items by altering input sequences, leading to a 20% increase in engagement with new items. PinFM is now deployed to help improve the experience of more than a half billion users across various applications.

Title: AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation

Authors: Potsawee Manakul, Woody Haosheng Gan, Michael J. Ryan, Ali Sartaz Khan, Warit Sirichotedumrong, Kunat Pipatanakul, William Held, Diyi Yang
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2507.12705
Pdf URL: https://arxiv.org/pdf/2507.12705
Copy Paste: [[2507.12705]] AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation(https://arxiv.org/abs/2507.12705)
Keywords: in-context
Abstract: Current speech evaluation suffers from two critical limitations: the need and difficulty of designing specialized systems targeting individual audio characteristics, and poor correlation between automatic evaluation methods and human preferences. This work presents a systematic study of Large Audio Model (LAM) as a Judge, AudioJudge, investigating whether it can provide a unified evaluation framework that addresses both challenges. We systematically explore AudioJudge across audio characteristic detection tasks, including pronunciation, speaking rate, speaker identification and speech quality, and system-level human preference simulation for automated benchmarking. We investigate different prompt engineering strategies, finding that audio concatenation combined with in-context learning significantly improves performance across both audio characteristic detection and human preference simulation tasks. We further introduce a multi-aspect ensemble AudioJudge to enable general-purpose multi-aspect audio evaluation. This method decomposes speech assessment into specialized judges for lexical content, speech quality, and paralinguistic features, achieving up to 0.91 Spearman correlation with human preferences on our system ranking benchmark. Robustness analysis reveals that while LAMs maintain strong performance under acoustic noise, they exhibit significant verbosity and positional biases that require careful mitigation.

Title: Multimodal-Guided Dynamic Dataset Pruning for Robust and Efficient Data-Centric Learning

Authors: Suorong Yang, Peijia Li, Yujie Liu, Zhiming Xu, Peng Ye, Wanli Ouyang, Furao Shen, Dongzhan Zhou
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2507.12750
Pdf URL: https://arxiv.org/pdf/2507.12750
Copy Paste: [[2507.12750]] Multimodal-Guided Dynamic Dataset Pruning for Robust and Efficient Data-Centric Learning(https://arxiv.org/abs/2507.12750)
Keywords: foundation model
Abstract: Modern deep models are trained on large real-world datasets, where data quality varies and redundancy is common. Data-centric approaches such as dataset pruning have shown promise in improving training efficiency and model performance. However, most existing methods rely on static heuristics or task-specific metrics, limiting their robustness and generalizability across domains. In this work, we introduce a dynamic dataset pruning framework that adaptively selects training samples based on both task-driven difficulty and cross-modality semantic consistency. By incorporating supervision from pretrained multimodal foundation models, our approach captures training dynamics while effectively filtering out uninformative samples. Our work highlights the potential of integrating cross-modality alignment for robust sample selection, advancing data-centric learning toward more efficient and robust practices across application domains.

Title: World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving

Authors: Yanchen Guan, Haicheng Liao, Chengyue Wang, Xingcheng Liu, Jiaxun Zhang, Zhenning Li
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.12762
Pdf URL: https://arxiv.org/pdf/2507.12762
Copy Paste: [[2507.12762]] World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving(https://arxiv.org/abs/2507.12762)
Keywords: generative
Abstract: Reliable anticipation of traffic accidents is essential for advancing autonomous driving systems. However, this objective is limited by two fundamental challenges: the scarcity of diverse, high-quality training data and the frequent absence of crucial object-level cues due to environmental disruptions or sensor deficiencies. To tackle these issues, we propose a comprehensive framework combining generative scene augmentation with adaptive temporal reasoning. Specifically, we develop a video generation pipeline that utilizes a world model guided by domain-informed prompts to create high-resolution, statistically consistent driving scenarios, particularly enriching the coverage of edge cases and complex interactions. In parallel, we construct a dynamic prediction model that encodes spatio-temporal relationships through strengthened graph convolutions and dilated temporal operators, effectively addressing data incompleteness and transient visual noise. Furthermore, we release a new benchmark dataset designed to better capture diverse real-world driving risks. Extensive experiments on public and newly released datasets confirm that our framework enhances both the accuracy and lead time of accident anticipation, offering a robust solution to current data and modeling limitations in safety-critical autonomous driving applications.

Title: AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation

Authors: Hengkai Tan, Yao Feng, Xinyi Mao, Shuhe Huang, Guodong Liu, Zhongkai Hao, Hang Su, Jun Zhu
Subjects: cs.CV, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2507.12768
Pdf URL: https://arxiv.org/pdf/2507.12768
Copy Paste: [[2507.12768]] AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation(https://arxiv.org/abs/2507.12768)
Keywords: self-supervised
Abstract: Vision-language-action (VLA) models have shown promise on task-conditioned control in complex settings such as bimanual manipulation. However, the heavy reliance on task-specific human demonstrations limits their generalization and incurs high data acquisition costs. In this work, we present a new notion of task-agnostic action paradigm that decouples action execution from task-specific conditioning, enhancing scalability, efficiency, and cost-effectiveness. To address the data collection challenges posed by this paradigm -- such as low coverage density, behavioral redundancy, and safety risks -- we introduce ATARA (Automated Task-Agnostic Random Actions), a scalable self-supervised framework that accelerates collection by over $ 30\times $ compared to human teleoperation. To further enable effective learning from task-agnostic data, which often suffers from distribution mismatch and irrelevant trajectories, we propose AnyPos, an inverse dynamics model equipped with Arm-Decoupled Estimation and a Direction-Aware Decoder (DAD). We additionally integrate a video-conditioned action validation module to verify the feasibility of learned policies across diverse manipulation tasks. Extensive experiments show that the AnyPos-ATARA pipeline yields a 51% improvement in test accuracy and achieves 30-40% higher success rates in downstream tasks such as lifting, pick-and-place, and clicking, using replay-based video validation. Project Page: this https URL

Title: Local Representative Token Guided Merging for Text-to-Image Generation

Authors: Min-Jeong Lee, Hee-Dong Kim, Seong-Whan Lee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.12771
Pdf URL: https://arxiv.org/pdf/2507.12771
Copy Paste: [[2507.12771]] Local Representative Token Guided Merging for Text-to-Image Generation(https://arxiv.org/abs/2507.12771)
Keywords: diffusion
Abstract: Stable diffusion is an outstanding image generation model for text-to-image, but its time-consuming generation process remains a challenge due to the quadratic complexity of attention operations. Recent token merging methods improve efficiency by reducing the number of tokens during attention operations, but often overlook the characteristics of attention-based image generation models, limiting their effectiveness. In this paper, we propose local representative token guided merging (ReToM), a novel token merging strategy applicable to any attention mechanism in image generation. To merge tokens based on various contextual information, ReToM defines local boundaries as windows within attention inputs and adjusts window sizes. Furthermore, we introduce a representative token, which represents the most representative token per window by computing similarity at a specific timestep and selecting the token with the highest average similarity. This approach preserves the most salient local features while minimizing computational overhead. Experimental results show that ReToM achieves a 6.2% improvement in FID and higher CLIP scores compared to the baseline, while maintaining comparable inference time. We empirically demonstrate that ReToM is effective in balancing visual quality and computational efficiency.

Title: A Comprehensive Survey of Electronic Health Record Modeling: From Deep Learning Approaches to Large Language Models

Authors: Weijieying Ren, Jingxi Zhu, Zehao Liu, Tianxiang Zhao, Vasant Honavar
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2507.12774
Pdf URL: https://arxiv.org/pdf/2507.12774
Copy Paste: [[2507.12774]] A Comprehensive Survey of Electronic Health Record Modeling: From Deep Learning Approaches to Large Language Models(https://arxiv.org/abs/2507.12774)
Keywords: self-supervised, foundation model
Abstract: Artificial intelligence (AI) has demonstrated significant potential in transforming healthcare through the analysis and modeling of electronic health records (EHRs). However, the inherent heterogeneity, temporal irregularity, and domain-specific nature of EHR data present unique challenges that differ fundamentally from those in vision and natural language tasks. This survey offers a comprehensive overview of recent advancements at the intersection of deep learning, large language models (LLMs), and EHR modeling. We introduce a unified taxonomy that spans five key design dimensions: data-centric approaches, neural architecture design, learning-focused strategies, multimodal learning, and LLM-based modeling systems. Within each dimension, we review representative methods addressing data quality enhancement, structural and temporal representation, self-supervised learning, and integration with clinical knowledge. We further highlight emerging trends such as foundation models, LLM-driven clinical agents, and EHR-to-text translation for downstream reasoning. Finally, we discuss open challenges in benchmarking, explainability, clinical alignment, and generalization across diverse clinical settings. This survey aims to provide a structured roadmap for advancing AI-driven EHR modeling and clinical decision support. For a comprehensive list of EHR-related methods, kindly refer to this https URL.

Title: ATL-Diff: Audio-Driven Talking Head Generation with Early Landmarks-Guide Noise Diffusion

Authors: Hoang-Son Vo, Quang-Vinh Nguyen, Seungwon Kim, Hyung-Jeong Yang, Soonja Yeom, Soo-Hyung Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12804
Pdf URL: https://arxiv.org/pdf/2507.12804
Copy Paste: [[2507.12804]] ATL-Diff: Audio-Driven Talking Head Generation with Early Landmarks-Guide Noise Diffusion(https://arxiv.org/abs/2507.12804)
Keywords: diffusion
Abstract: Audio-driven talking head generation requires precise synchronization between facial animations and audio signals. This paper introduces ATL-Diff, a novel approach addressing synchronization limitations while reducing noise and computational costs. Our framework features three key components: a Landmark Generation Module converting audio to facial landmarks, a Landmarks-Guide Noise approach that decouples audio by distributing noise according to landmarks, and a 3D Identity Diffusion network preserving identity characteristics. Experiments on MEAD and CREMA-D datasets demonstrate that ATL-Diff outperforms state-of-the-art methods across all metrics. Our approach achieves near real-time processing with high-quality animations, computational efficiency, and exceptional preservation of facial nuances. This advancement offers promising applications for virtual assistants, education, medical communication, and digital platforms. The source code is available at: \href{this https URL}{this https URL}

Title: Semantic-guided Fine-tuning of Foundation Model for Long-tailed Visual Recognition

Authors: Yufei Peng, Yonggang Zhang, Yiu-ming Cheung
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12807
Pdf URL: https://arxiv.org/pdf/2507.12807
Copy Paste: [[2507.12807]] Semantic-guided Fine-tuning of Foundation Model for Long-tailed Visual Recognition(https://arxiv.org/abs/2507.12807)
Keywords: foundation model
Abstract: The variance in class-wise sample sizes within long-tailed scenarios often results in degraded performance in less frequent classes. Fortunately, foundation models, pre-trained on vast open-world datasets, demonstrate strong potential for this task due to their generalizable representation, which promotes the development of adaptive strategies on pre-trained models in long-tailed learning. Advanced fine-tuning methods typically adjust visual encoders while neglecting the semantics derived from the frozen text encoder, overlooking the visual and textual alignment. To strengthen this alignment, we propose a novel approach, Semantic-guided fine-tuning of foundation model for long-tailed visual recognition (Sage), which incorporates semantic guidance derived from textual modality into the visual fine-tuning process. Specifically, we introduce an SG-Adapter that integrates class descriptions as semantic guidance to guide the fine-tuning of the visual encoder. The introduced guidance is passesed through the attention mechanism and enables the model to focus more on semantically relevant content, strengthening the alignment between the visual and textual modalities. Due to the inconsistent class-conditional distributions neglected by the existing loss function, the resulting prediction bias causes performance improvements for the tail class less than for the head class, even when the multi-modal alignment is enhanced. To address this challenge, we propose a novel distribution mismatch-aware compensation factor, which is specifically designed to rectify the prediction bias caused by the ignored inconsistent distribution based on our theoretical analysis, and is seamlessly integrated into the loss function. Extensive experiments on benchmark datasets demonstrate the effectiveness of the proposed Sage in enhancing performance in long-tailed learning.

Title: Large Language Models' Internal Perception of Symbolic Music

Authors: Andrew Shin, Kunitake Kaneko
Subjects: cs.CL, cs.AI, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2507.12808
Pdf URL: https://arxiv.org/pdf/2507.12808
Copy Paste: [[2507.12808]] Large Language Models' Internal Perception of Symbolic Music(https://arxiv.org/abs/2507.12808)
Keywords: generative
Abstract: Large language models (LLMs) excel at modeling relationships between strings in natural language and have shown promise in extending to other symbolic domains like coding or mathematics. However, the extent to which they implicitly model symbolic music remains underexplored. This paper investigates how LLMs represent musical concepts by generating symbolic music data from textual prompts describing combinations of genres and styles, and evaluating their utility through recognition and generation tasks. We produce a dataset of LLM-generated MIDI files without relying on explicit musical training. We then train neural networks entirely on this LLM-generated MIDI dataset and perform genre and style classification as well as melody completion, benchmarking their performance against established models. Our results demonstrate that LLMs can infer rudimentary musical structures and temporal relationships from text, highlighting both their potential to implicitly encode musical patterns and their limitations due to a lack of explicit musical context, shedding light on their generative capabilities for symbolic music.

Title: AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning

Authors: Yiming Ren, Zhiqiang Lin, Yu Li, Gao Meng, Weiyun Wang, Junjie Wang, Zicheng Lin, Jifeng Dai, Yujiu Yang, Wenhai Wang, Ruihang Chu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12841
Pdf URL: https://arxiv.org/pdf/2507.12841
Copy Paste: [[2507.12841]] AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning(https://arxiv.org/abs/2507.12841)
Keywords: foundation model
Abstract: Controllable captioning is essential for precise multimodal alignment and instruction following, yet existing models often lack fine-grained control and reliable evaluation protocols. To address this gap, we present the AnyCap Project, an integrated solution spanning model, dataset, and evaluation. We introduce AnyCapModel (ACM), a lightweight plug-and-play framework that enhances the controllability of existing foundation models for omni-modal captioning without retraining the base model. ACM reuses the original captions from base models while incorporating user instructions and modality features to generate improved captions. To remedy the data scarcity in controllable multimodal captioning, we build AnyCapDataset (ACD), covering three modalities, 28 user-instruction types, and 300\,k high-quality data entries. We further propose AnyCapEval, a new benchmark that provides more reliable evaluation metrics for controllable captioning by decoupling content accuracy and stylistic fidelity. ACM markedly improves caption quality across a diverse set of base models on AnyCapEval. Notably, ACM-8B raises GPT-4oś content scores by 45\% and style scores by 12\%, and it also achieves substantial gains on widely used benchmarks such as MIA-Bench and VidCapBench.

Title: Generalist Bimanual Manipulation via Foundation Video Diffusion Models

Authors: Yao Feng, Hengkai Tan, Xinyi Mao, Guodong Liu, Shuhe Huang, Chendong Xiang, Hang Su, Jun Zhu
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2507.12898
Pdf URL: https://arxiv.org/pdf/2507.12898
Copy Paste: [[2507.12898]] Generalist Bimanual Manipulation via Foundation Video Diffusion Models(https://arxiv.org/abs/2507.12898)
Keywords: diffusion, foundation model
Abstract: Bimanual robotic manipulation, which involves the coordinated control of two robotic arms, is foundational for solving challenging tasks. Despite recent progress in general-purpose manipulation, data scarcity and embodiment heterogeneity remain serious obstacles to further scaling up in bimanual settings. In this paper, we introduce VIdeo Diffusion for Action Reasoning (VIDAR), a two-stage framework that leverages large-scale, diffusion-based video pre-training and a novel masked inverse dynamics model for action prediction. We pre-train the video diffusion model on 750K multi-view videos from three real-world bimanual robot platforms, utilizing a unified observation space that encodes robot, camera, task, and scene contexts. Our masked inverse dynamics model learns masks to extract action-relevant information from generated trajectories without requiring pixel-level labels, and the masks can effectively generalize to unseen backgrounds. Our experiments demonstrate that with only 20 minutes of human demonstrations on an unseen robot platform (only 1% of typical data requirements), VIDAR generalizes to unseen tasks and backgrounds with strong semantic understanding, surpassing state-of-the-art methods. Our findings highlight the potential of video foundation models, coupled with masked action prediction, to enable scalable and generalizable robotic manipulation in diverse real-world settings.

Title: Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models

Authors: Yifan Xu, Chao Zhang, Hanqi Jiang, Xiaoyan Wang, Ruifei Ma, Yiwei Li, Zihao Wu, Zeju Li, Xiangde Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.12916
Pdf URL: https://arxiv.org/pdf/2507.12916
Copy Paste: [[2507.12916]] Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models(https://arxiv.org/abs/2507.12916)
Keywords: foundation model
Abstract: Advancements in foundation models have made it possible to conduct applications in various downstream tasks. Especially, the new era has witnessed a remarkable capability to extend Large Language Models (LLMs) for tackling tasks of 3D scene understanding. Current methods rely heavily on 3D point clouds, but the 3D point cloud reconstruction of an indoor scene often results in information loss. Some textureless planes or repetitive patterns are prone to omission and manifest as voids within the reconstructed 3D point clouds. Besides, objects with complex structures tend to introduce distortion of details caused by misalignments between the captured images and the dense reconstructed point clouds. 2D multi-view images present visual consistency with 3D point clouds and provide more detailed representations of scene components, which can naturally compensate for these deficiencies. Based on these insights, we propose Argus, a novel 3D multimodal framework that leverages multi-view images for enhanced 3D scene understanding with LLMs. In general, Argus can be treated as a 3D Large Multimodal Foundation Model (3D-LMM) since it takes various modalities as input(text instructions, 2D multi-view images, and 3D point clouds) and expands the capability of LLMs to tackle 3D tasks. Argus involves fusing and integrating multi-view images and camera poses into view-as-scene features, which interact with the 3D features to create comprehensive and detailed 3D-aware scene embeddings. Our approach compensates for the information loss while reconstructing 3D point clouds and helps LLMs better understand the 3D world. Extensive experiments demonstrate that our method outperforms existing 3D-LMMs in various downstream tasks.

Title: DMQ: Dissecting Outliers of Diffusion Models for Post-Training Quantization

Authors: Dongyeun Lee, Jiwan Hur, Hyounguk Shon, Jae Young Lee, Junmo Kim
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.12933
Pdf URL: https://arxiv.org/pdf/2507.12933
Copy Paste: [[2507.12933]] DMQ: Dissecting Outliers of Diffusion Models for Post-Training Quantization(https://arxiv.org/abs/2507.12933)
Keywords: diffusion
Abstract: Diffusion models have achieved remarkable success in image generation but come with significant computational costs, posing challenges for deployment in resource-constrained environments. Recent post-training quantization (PTQ) methods have attempted to mitigate this issue by focusing on the iterative nature of diffusion models. However, these approaches often overlook outliers, leading to degraded performance at low bit-widths. In this paper, we propose a DMQ which combines Learned Equivalent Scaling (LES) and channel-wise Power-of-Two Scaling (PTS) to effectively address these challenges. Learned Equivalent Scaling optimizes channel-wise scaling factors to redistribute quantization difficulty between weights and activations, reducing overall quantization error. Recognizing that early denoising steps, despite having small quantization errors, crucially impact the final output due to error accumulation, we incorporate an adaptive timestep weighting scheme to prioritize these critical steps during learning. Furthermore, identifying that layers such as skip connections exhibit high inter-channel variance, we introduce channel-wise Power-of-Two Scaling for activations. To ensure robust selection of PTS factors even with small calibration set, we introduce a voting algorithm that enhances reliability. Extensive experiments demonstrate that our method significantly outperforms existing works, especially at low bit-widths such as W4A6 (4-bit weight, 6-bit activation) and W4A8, maintaining high image generation quality and model stability. The code is available at this https URL.

Title: LoViC: Efficient Long Video Generation with Context Compression

Authors: Jiaxiu Jiang, Wenbo Li, Jingjing Ren, Yuping Qiu, Yong Guo, Xiaogang Xu, Han Wu, Wangmeng Zuo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12952
Pdf URL: https://arxiv.org/pdf/2507.12952
Copy Paste: [[2507.12952]] LoViC: Efficient Long Video Generation with Context Compression(https://arxiv.org/abs/2507.12952)
Keywords: diffusion
Abstract: Despite recent advances in diffusion transformers (DiTs) for text-to-video generation, scaling to long-duration content remains challenging due to the quadratic complexity of self-attention. While prior efforts -- such as sparse attention and temporally autoregressive models -- offer partial relief, they often compromise temporal coherence or scalability. We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos, designed to produce long, coherent videos through a segment-wise generation process. At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations. It supports variable-length inputs with linearly adjustable compression rates, enabled by a single query token design based on the Q-Former architecture. Additionally, by encoding temporal context through position-aware mechanisms, our model seamlessly supports prediction, retradiction, interpolation, and multi-shot generation within a unified paradigm. Extensive experiments across diverse tasks validate the effectiveness and versatility of our approach.

Title: FantasyPortrait: Enhancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers

Authors: Qiang Wang, Mengchao Wang, Fan Jiang, Yaqi Fan, Yonggang Qi, Mu Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12956
Pdf URL: https://arxiv.org/pdf/2507.12956
Copy Paste: [[2507.12956]] FantasyPortrait: Enhancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers(https://arxiv.org/abs/2507.12956)
Keywords: diffusion
Abstract: Producing expressive facial animations from static images is a challenging task. Prior methods relying on explicit geometric priors (e.g., facial landmarks or 3DMM) often suffer from artifacts in cross reenactment and struggle to capture subtle emotions. Furthermore, existing approaches lack support for multi-character animation, as driving features from different individuals frequently interfere with one another, complicating the task. To address these challenges, we propose FantasyPortrait, a diffusion transformer based framework capable of generating high-fidelity and emotion-rich animations for both single- and multi-character scenarios. Our method introduces an expression-augmented learning strategy that utilizes implicit representations to capture identity-agnostic facial dynamics, enhancing the model's ability to render fine-grained emotions. For multi-character control, we design a masked cross-attention mechanism that ensures independent yet coordinated expression generation, effectively preventing feature interference. To advance research in this area, we propose the Multi-Expr dataset and ExprBench, which are specifically designed datasets and benchmarks for training and evaluating multi-character portrait animations. Extensive experiments demonstrate that FantasyPortrait significantly outperforms state-of-the-art methods in both quantitative metrics and qualitative evaluations, excelling particularly in challenging cross reenactment and multi-character contexts. Our project page is this https URL.

Title: RGB Pre-Training Enhanced Unobservable Feature Latent Diffusion Model for Spectral Reconstruction

Authors: Keli Deng, Jie Nie, Yuntao Qian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12967
Pdf URL: https://arxiv.org/pdf/2507.12967
Copy Paste: [[2507.12967]] RGB Pre-Training Enhanced Unobservable Feature Latent Diffusion Model for Spectral Reconstruction(https://arxiv.org/abs/2507.12967)
Keywords: diffusion
Abstract: Spectral reconstruction (SR) is a crucial problem in image processing that requires reconstructing hyperspectral images (HSIs) from the corresponding RGB images. A key difficulty in SR is estimating the unobservable feature, which encapsulates significant spectral information not captured by RGB imaging sensors. The solution lies in effectively constructing the spectral-spatial joint distribution conditioned on the RGB image to complement the unobservable feature. Since HSIs share a similar spatial structure with the corresponding RGB images, it is rational to capitalize on the rich spatial knowledge in RGB pre-trained models for spectral-spatial joint distribution learning. To this end, we extend the RGB pre-trained latent diffusion model (RGB-LDM) to an unobservable feature LDM (ULDM) for SR. As the RGB-LDM and its corresponding spatial autoencoder (SpaAE) already excel in spatial knowledge, the ULDM can focus on modeling spectral structure. Moreover, separating the unobservable feature from the HSI reduces the redundant spectral information and empowers the ULDM to learn the joint distribution in a compact latent space. Specifically, we propose a two-stage pipeline consisting of spectral structure representation learning and spectral-spatial joint distribution learning to transform the RGB-LDM into the ULDM. In the first stage, a spectral unobservable feature autoencoder (SpeUAE) is trained to extract and compress the unobservable feature into a 3D manifold aligned with RGB space. In the second stage, the spectral and spatial structures are sequentially encoded by the SpeUAE and the SpaAE, respectively. The ULDM is then acquired to model the distribution of the coded unobservable feature with guidance from the corresponding RGB images. Experimental results on SR and downstream relighting tasks demonstrate that our proposed method achieves state-of-the-art performance.

Title: A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints

Authors: Youssef Tawfilis, Hossam Amer, Minar El-Aasser, Tallal Elshabrawy
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.12979
Pdf URL: https://arxiv.org/pdf/2507.12979
Copy Paste: [[2507.12979]] A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints(https://arxiv.org/abs/2507.12979)
Keywords: generative
Abstract: Federated Learning has gained increasing attention for its ability to enable multiple nodes to collaboratively train machine learning models without sharing their raw data. At the same time, Generative AI -- particularly Generative Adversarial Networks (GANs) -- have achieved remarkable success across a wide range of domains, such as healthcare, security, and Image Generation. However, training generative models typically requires large datasets and significant computational resources, which are often unavailable in real-world settings. Acquiring such resources can be costly and inefficient, especially when many underutilized devices -- such as IoT devices and edge devices -- with varying capabilities remain idle. Moreover, obtaining large datasets is challenging due to privacy concerns and copyright restrictions, as most devices are unwilling to share their data. To address these challenges, we propose a novel approach for decentralized GAN training that enables the utilization of distributed data and underutilized, low-capability devices while not sharing data in its raw form. Our approach is designed to tackle key challenges in decentralized environments, combining KLD-weighted Clustered Federated Learning to address the issues of data heterogeneity and multi-domain datasets, with Heterogeneous U-Shaped split learning to tackle the challenge of device heterogeneity under strict data sharing constraints -- ensuring that no labels or raw data, whether real or synthetic, are ever shared between nodes. Experimental results shows that our approach demonstrates consistent and significant improvements across key performance metrics, where it achieves 1.1x -- 2.2x higher image generation scores, an average 10% boost in classification metrics (up to 50% in multi-domain non-IID settings), in much lower latency compared to several benchmarks. Find our code at this https URL.

Title: Beyond Fully Supervised Pixel Annotations: Scribble-Driven Weakly-Supervised Framework for Image Manipulation Localization

Authors: Songlin Li, Guofeng Yu, Zhiqing Guo, Yunfeng Diao, Dan Ma, Gaobo Yang, Liejun Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.13018
Pdf URL: https://arxiv.org/pdf/2507.13018
Copy Paste: [[2507.13018]] Beyond Fully Supervised Pixel Annotations: Scribble-Driven Weakly-Supervised Framework for Image Manipulation Localization(https://arxiv.org/abs/2507.13018)
Keywords: self-supervised
Abstract: Deep learning-based image manipulation localization (IML) methods have achieved remarkable performance in recent years, but typically rely on large-scale pixel-level annotated datasets. To address the challenge of acquiring high-quality annotations, some recent weakly supervised methods utilize image-level labels to segment manipulated regions. However, the performance is still limited due to insufficient supervision signals. In this study, we explore a form of weak supervision that improves the annotation efficiency and detection performance, namely scribble annotation supervision. We re-annotated mainstream IML datasets with scribble labels and propose the first scribble-based IML (Sc-IML) dataset. Additionally, we propose the first scribble-based weakly supervised IML framework. Specifically, we employ self-supervised training with a structural consistency loss to encourage the model to produce consistent predictions under multi-scale and augmented inputs. In addition, we propose a prior-aware feature modulation module (PFMM) that adaptively integrates prior information from both manipulated and authentic regions for dynamic feature adjustment, further enhancing feature discriminability and prediction consistency in complex scenes. We also propose a gated adaptive fusion module (GAFM) that utilizes gating mechanisms to regulate information flow during feature fusion, guiding the model toward emphasizing potential tampered regions. Finally, we propose a confidence-aware entropy minimization loss (${\mathcal{L}}_{ {CEM }}$). This loss dynamically regularizes predictions in weakly annotated or unlabeled regions based on model uncertainty, effectively suppressing unreliable predictions. Experimental results show that our method outperforms existing fully supervised approaches in terms of average performance both in-distribution and out-of-distribution.

Title: Fault detection and diagnosis for the engine electrical system of a space launcher based on a temporal convolutional autoencoder and calibrated classifiers

Authors: Luis Basora, Louison Bocquet-Nouaille, Elinirina Robinson, Serge Le Gonidec
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.13022
Pdf URL: https://arxiv.org/pdf/2507.13022
Copy Paste: [[2507.13022]] Fault detection and diagnosis for the engine electrical system of a space launcher based on a temporal convolutional autoencoder and calibrated classifiers(https://arxiv.org/abs/2507.13022)
Keywords: anomaly
Abstract: In the context of the health monitoring for the next generation of reusable space launchers, we outline a first step toward developing an onboard fault detection and diagnostic capability for the electrical system that controls the engine valves. Unlike existing approaches in the literature, our solution is designed to meet a broader range of key requirements. This includes estimating confidence levels for predictions, detecting out-of-distribution (OOD) cases, and controlling false alarms. The proposed solution is based on a temporal convolutional autoencoder to automatically extract low-dimensional features from raw sensor data. Fault detection and diagnosis are respectively carried out using a binary and a multiclass classifier trained on the autoencoder latent and residual spaces. The classifiers are histogram-based gradient boosting models calibrated to output probabilities that can be interpreted as confidence levels. A relatively simple technique, based on inductive conformal anomaly detection, is used to identify OOD data. We leverage other simple yet effective techniques, such as cumulative sum control chart (CUSUM) to limit the false alarms, and threshold moving to address class imbalance in fault detection. The proposed framework is highly configurable and has been evaluated on simulated data, covering both nominal and anomalous operational scenarios. The results indicate that our solution is a promising first step, though testing with real data will be necessary to ensure that it achieves the required maturity level for operational use.

Title: Label-Consistent Dataset Distillation with Detector-Guided Refinement

Authors: Yawen Zou, Guang Li, Zi Wang, Chunzhi Gu, Chao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.13074
Pdf URL: https://arxiv.org/pdf/2507.13074
Copy Paste: [[2507.13074]] Label-Consistent Dataset Distillation with Detector-Guided Refinement(https://arxiv.org/abs/2507.13074)
Keywords: diffusion
Abstract: Dataset distillation (DD) aims to generate a compact yet informative dataset that achieves performance comparable to the original dataset, thereby reducing demands on storage and computational resources. Although diffusion models have made significant progress in dataset distillation, the generated surrogate datasets often contain samples with label inconsistencies or insufficient structural detail, leading to suboptimal downstream performance. To address these issues, we propose a detector-guided dataset distillation framework that explicitly leverages a pre-trained detector to identify and refine anomalous synthetic samples, thereby ensuring label consistency and improving image quality. Specifically, a detector model trained on the original dataset is employed to identify anomalous images exhibiting label mismatches or low classification confidence. For each defective image, multiple candidates are generated using a pre-trained diffusion model conditioned on the corresponding image prototype and label. The optimal candidate is then selected by jointly considering the detector's confidence score and dissimilarity to existing qualified synthetic samples, thereby ensuring both label accuracy and intra-class diversity. Experimental results demonstrate that our method can synthesize high-quality representative images with richer details, achieving state-of-the-art performance on the validation set.

Title: DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model

Authors: Han Zhang, Xiangde Luo, Yong Chen, Kang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.13087
Pdf URL: https://arxiv.org/pdf/2507.13087
Copy Paste: [[2507.13087]] DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model(https://arxiv.org/abs/2507.13087)
Keywords: diffusion
Abstract: Annotation variability remains a substantial challenge in medical image segmentation, stemming from ambiguous imaging boundaries and diverse clinical expertise. Traditional deep learning methods producing single deterministic segmentation predictions often fail to capture these annotator biases. Although recent studies have explored multi-rater segmentation, existing methods typically focus on a single perspective -- either generating a probabilistic ``gold standard'' consensus or preserving expert-specific preferences -- thus struggling to provide a more omni view. In this study, we propose DiffOSeg, a two-stage diffusion-based framework, which aims to simultaneously achieve both consensus-driven (combining all experts' opinions) and preference-driven (reflecting experts' individual assessments) segmentation. Stage I establishes population consensus through a probabilistic consensus strategy, while Stage II captures expert-specific preference via adaptive prompts. Demonstrated on two public datasets (LIDC-IDRI and NPC-170), our model outperforms existing state-of-the-art methods across all evaluated metrics. Source code is available at this https URL .

Title: Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction

Authors: Zhennan Xiao, Katharine Brudkiewicz, Zhen Yuan, Rosalind Aughwane, Magdalena Sokolska, Joanna Chappell, Trevor Gaunt, Anna L. David, Andrew P. King, Andrew Melbourne
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.13106
Pdf URL: https://arxiv.org/pdf/2507.13106
Copy Paste: [[2507.13106]] Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction(https://arxiv.org/abs/2507.13106)
Keywords: diffusion
Abstract: Fetal lung maturity is a critical indicator for predicting neonatal outcomes and the need for post-natal intervention, especially for pregnancies affected by fetal growth restriction. Intra-voxel incoherent motion analysis has shown promising results for non-invasive assessment of fetal lung development, but its reliance on manual segmentation is time-consuming, thus limiting its clinical applicability. In this work, we present an automated lung maturity evaluation pipeline for diffusion-weighted magnetic resonance images that consists of a deep learning-based fetal lung segmentation model and a model-fitting lung maturity assessment. A 3D nnU-Net model was trained on manually segmented images selected from the baseline frames of 4D diffusion-weighted MRI scans. The segmentation model demonstrated robust performance, yielding a mean Dice coefficient of 82.14%. Next, voxel-wise model fitting was performed based on both the nnU-Net-predicted and manual lung segmentations to quantify IVIM parameters reflecting tissue microstructure and perfusion. The results suggested no differences between the two. Our work shows that a fully automated pipeline is possible for supporting fetal lung maturity assessment and clinical decision-making.

Title: R^2MoE: Redundancy-Removal Mixture of Experts for Lifelong Concept Learning

Authors: Xiaohan Guo, Yusong Cai, Zejia Liu, Zhengning Wang, Lili Pan, Hongliang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.13107
Pdf URL: https://arxiv.org/pdf/2507.13107
Copy Paste: [[2507.13107]] R^2MoE: Redundancy-Removal Mixture of Experts for Lifelong Concept Learning(https://arxiv.org/abs/2507.13107)
Keywords: generative
Abstract: Enabling large-scale generative models to continuously learn new visual concepts is essential for personalizing pre-trained models to meet individual user preferences. Existing approaches for continual visual concept learning are constrained by two fundamental challenges: catastrophic forgetting and parameter expansion. In this paper, we propose Redundancy-Removal Mixture of Experts (R^2MoE), a parameter-efficient framework for lifelong visual concept learning that effectively learns new concepts while incurring minimal parameter overhead. Our framework includes three key innovative contributions: First, we propose a mixture-of-experts framework with a routing distillation mechanism that enables experts to acquire concept-specific knowledge while preserving the gating network's routing capability, thereby effectively mitigating catastrophic forgetting. Second, we propose a strategy for eliminating redundant layer-wise experts that reduces the number of expert parameters by fully utilizing previously learned experts. Third, we employ a hierarchical local attention-guided inference approach to mitigate interference between generated visual concepts. Extensive experiments have demonstrated that our method generates images with superior conceptual fidelity compared to the state-of-the-art (SOTA) method, achieving an impressive 87.8\% reduction in forgetting rates and 63.3\% fewer parameters on the CustomConcept 101 dataset. Our code is available at {this https URL}

Title: 3DKeyAD: High-Resolution 3D Point Cloud Anomaly Detection via Keypoint-Guided Point Clustering

Authors: Zi Wang, Katsuya Hotta, Koichiro Kamide, Yawen Zou, Chao Zhang, Jun Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.13110
Pdf URL: https://arxiv.org/pdf/2507.13110
Copy Paste: [[2507.13110]] 3DKeyAD: High-Resolution 3D Point Cloud Anomaly Detection via Keypoint-Guided Point Clustering(https://arxiv.org/abs/2507.13110)
Keywords: anomaly
Abstract: High-resolution 3D point clouds are highly effective for detecting subtle structural anomalies in industrial inspection. However, their dense and irregular nature imposes significant challenges, including high computational cost, sensitivity to spatial misalignment, and difficulty in capturing localized structural differences. This paper introduces a registration-based anomaly detection framework that combines multi-prototype alignment with cluster-wise discrepancy analysis to enable precise 3D anomaly localization. Specifically, each test sample is first registered to multiple normal prototypes to enable direct structural comparison. To evaluate anomalies at a local level, clustering is performed over the point cloud, and similarity is computed between features from the test sample and the prototypes within each cluster. Rather than selecting cluster centroids randomly, a keypoint-guided strategy is employed, where geometrically informative points are chosen as centroids. This ensures that clusters are centered on feature-rich regions, enabling more meaningful and stable distance-based comparisons. Extensive experiments on the Real3D-AD benchmark demonstrate that the proposed method achieves state-of-the-art performance in both object-level and point-level anomaly detection, even using only raw features.

Title: A Computational Framework to Identify Self-Aspects in Text

Authors: Jaya Caporusso, Matthew Purver, Senja Pollak
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.13115
Pdf URL: https://arxiv.org/pdf/2507.13115
Copy Paste: [[2507.13115]] A Computational Framework to Identify Self-Aspects in Text(https://arxiv.org/abs/2507.13115)
Keywords: generative
Abstract: This Ph.D. proposal introduces a plan to develop a computational framework to identify Self-aspects in text. The Self is a multifaceted construct and it is reflected in language. While it is described across disciplines like cognitive science and phenomenology, it remains underexplored in natural language processing (NLP). Many of the aspects of the Self align with psychological and other well-researched phenomena (e.g., those related to mental health), highlighting the need for systematic NLP-based analysis. In line with this, we plan to introduce an ontology of Self-aspects and a gold-standard annotated dataset. Using this foundation, we will develop and evaluate conventional discriminative models, generative large language models, and embedding-based retrieval approaches against four main criteria: interpretability, ground-truth adherence, accuracy, and computational efficiency. Top-performing models will be applied in case studies in mental health and empirical phenomenology.

Title: NGTM: Substructure-based Neural Graph Topic Model for Interpretable Graph Generation

Authors: Yuanxin Zhuang, Dazhong Shen, Ying Sun
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.13133
Pdf URL: https://arxiv.org/pdf/2507.13133
Copy Paste: [[2507.13133]] NGTM: Substructure-based Neural Graph Topic Model for Interpretable Graph Generation(https://arxiv.org/abs/2507.13133)
Keywords: generative
Abstract: Graph generation plays a pivotal role across numerous domains, including molecular design and knowledge graph construction. Although existing methods achieve considerable success in generating realistic graphs, their interpretability remains limited, often obscuring the rationale behind structural decisions. To address this challenge, we propose the Neural Graph Topic Model (NGTM), a novel generative framework inspired by topic modeling in natural language processing. NGTM represents graphs as mixtures of latent topics, each defining a distribution over semantically meaningful substructures, which facilitates explicit interpretability at both local and global scales. The generation process transparently integrates these topic distributions with a global structural variable, enabling clear semantic tracing of each generated graph. Experiments demonstrate that NGTM achieves competitive generation quality while uniquely enabling fine-grained control and interpretability, allowing users to tune structural features or induce biological properties through topic-level adjustments.

Title: Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation

Authors: Hadi Mohammadi, Tina Shahedi, Pablo Mosteiro, Massimo Poesio, Ayoub Bagheri, Anastasia Giachanou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.13138
Pdf URL: https://arxiv.org/pdf/2507.13138
Copy Paste: [[2507.13138]] Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation(https://arxiv.org/abs/2507.13138)
Keywords: generative
Abstract: Understanding the sources of variability in annotations is crucial for developing fair NLP systems, especially for tasks like sexism detection where demographic bias is a concern. This study investigates the extent to which annotator demographic features influence labeling decisions compared to text content. Using a Generalized Linear Mixed Model, we quantify this inf luence, finding that while statistically present, demographic factors account for a minor fraction ( 8%) of the observed variance, with tweet content being the dominant factor. We then assess the reliability of Generative AI (GenAI) models as annotators, specifically evaluating if guiding them with demographic personas improves alignment with human judgments. Our results indicate that simplistic persona prompting often fails to enhance, and sometimes degrades, performance compared to baseline models. Furthermore, explainable AI (XAI) techniques reveal that model predictions rely heavily on content-specific tokens related to sexism, rather than correlates of demographic characteristics. We argue that focusing on content-driven explanations and robust annotation protocols offers a more reliable path towards fairness than potentially persona simulation.

Title: DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model

Authors: Maulana Bisyir Azhari, David Hyunchul Shim
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2507.13145
Pdf URL: https://arxiv.org/pdf/2507.13145
Copy Paste: [[2507.13145]] DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model(https://arxiv.org/abs/2507.13145)
Keywords: foundation model
Abstract: Learning-based monocular visual odometry (VO) poses robustness, generalization, and efficiency challenges in robotics. Recent advances in visual foundation models, such as DINOv2, have improved robustness and generalization in various vision tasks, yet their integration in VO remains limited due to coarse feature granularity. In this paper, we present DINO-VO, a feature-based VO system leveraging DINOv2 visual foundation model for its sparse feature matching. To address the integration challenge, we propose a salient keypoints detector tailored to DINOv2's coarse features. Furthermore, we complement DINOv2's robust-semantic features with fine-grained geometric features, resulting in more localizable representations. Finally, a transformer-based matcher and differentiable pose estimation layer enable precise camera motion estimation by learning good matches. Against prior detector-descriptor networks like SuperPoint, DINO-VO demonstrates greater robustness in challenging environments. Furthermore, we show superior accuracy and generalization of the proposed feature descriptors against standalone DINOv2 coarse features. DINO-VO outperforms prior frame-to-frame VO methods on the TartanAir and KITTI datasets and is competitive on EuRoC dataset, while running efficiently at 72 FPS with less than 1GB of memory usage on a single GPU. Moreover, it performs competitively against Visual SLAM systems on outdoor driving scenarios, showcasing its generalization capabilities.

Title: MoTM: Towards a Foundation Model for Time Series Imputation based on Continuous Modeling

Authors: Etienne Le Naour, Tahar Nabil, Ghislain Agoua
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.13207
Pdf URL: https://arxiv.org/pdf/2507.13207
Copy Paste: [[2507.13207]] MoTM: Towards a Foundation Model for Time Series Imputation based on Continuous Modeling(https://arxiv.org/abs/2507.13207)
Keywords: foundation model
Abstract: Recent years have witnessed a growing interest for time series foundation models, with a strong emphasis on the forecasting task. Yet, the crucial task of out-of-domain imputation of missing values remains largely underexplored. We propose a first step to fill this gap by leveraging implicit neural representations (INRs). INRs model time series as continuous functions and naturally handle various missing data scenarios and sampling rates. While they have shown strong performance within specific distributions, they struggle under distribution shifts. To address this, we introduce MoTM (Mixture of Timeflow Models), a step toward a foundation model for time series imputation. Building on the idea that a new time series is a mixture of previously seen patterns, MoTM combines a basis of INRs, each trained independently on a distinct family of time series, with a ridge regressor that adapts to the observed context at inference. We demonstrate robust in-domain and out-of-domain generalization across diverse imputation scenarios (e.g., block and pointwise missingness, variable sampling rates), paving the way for adaptable foundation imputation models.

Title: Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection

Authors: Hongyang Zhao, Tianyu Liang, Sina Davari, Daeho Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.13221
Pdf URL: https://arxiv.org/pdf/2507.13221
Copy Paste: [[2507.13221]] Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection(https://arxiv.org/abs/2507.13221)
Keywords: generative
Abstract: While recent advancements in deep neural networks (DNNs) have substantially enhanced visual AI's capabilities, the challenge of inadequate data diversity and volume remains, particularly in construction domain. This study presents a novel image synthesis methodology tailored for construction worker detection, leveraging the generative-AI platform Midjourney. The approach entails generating a collection of 12,000 synthetic images by formulating 3000 different prompts, with an emphasis on image realism and diversity. These images, after manual labeling, serve as a dataset for DNN training. Evaluation on a real construction image dataset yielded promising results, with the model attaining average precisions (APs) of 0.937 and 0.642 at intersection-over-union (IoU) thresholds of 0.5 and 0.5 to 0.95, respectively. Notably, the model demonstrated near-perfect performance on the synthetic dataset, achieving APs of 0.994 and 0.919 at the two mentioned thresholds. These findings reveal both the potential and weakness of generative AI in addressing DNN training data scarcity.

Title: Leveraging Pre-Trained Visual Models for AI-Generated Video Detection

Authors: Keerthi Veeramachaneni, Praveen Tirupattur, Amrit Singh Bedi, Mubarak Shah
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.13224
Pdf URL: https://arxiv.org/pdf/2507.13224
Copy Paste: [[2507.13224]] Leveraging Pre-Trained Visual Models for AI-Generated Video Detection(https://arxiv.org/abs/2507.13224)
Keywords: generative
Abstract: Recent advances in Generative AI (GenAI) have led to significant improvements in the quality of generated visual content. As AI-generated visual content becomes increasingly indistinguishable from real content, the challenge of detecting the generated content becomes critical in combating misinformation, ensuring privacy, and preventing security threats. Although there has been substantial progress in detecting AI-generated images, current methods for video detection are largely focused on deepfakes, which primarily involve human faces. However, the field of video generation has advanced beyond DeepFakes, creating an urgent need for methods capable of detecting AI-generated videos with generic content. To address this gap, we propose a novel approach that leverages pre-trained visual models to distinguish between real and generated videos. The features extracted from these pre-trained models, which have been trained on extensive real visual content, contain inherent signals that can help distinguish real from generated videos. Using these extracted features, we achieve high detection performance without requiring additional model training, and we further improve performance by training a simple linear classification layer on top of the extracted features. We validated our method on a dataset we compiled (VID-AID), which includes around 10,000 AI-generated videos produced by 9 different text-to-video models, along with 4,000 real videos, totaling over 7 hours of video content. Our evaluation shows that our approach achieves high detection accuracy, above 90% on average, underscoring its effectiveness. Upon acceptance, we plan to publicly release the code, the pre-trained models, and our dataset to support ongoing research in this critical area.

Title: VITA: Vision-to-Action Flow Matching Policy

Authors: Dechen Gao, Boqi Zhao, Andrew Lee, Ian Chuang, Hanchu Zhou, Hang Wang, Zhe Zhao, Junshan Zhang, Iman Soltani
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2507.13231
Pdf URL: https://arxiv.org/pdf/2507.13231
Copy Paste: [[2507.13231]] VITA: Vision-to-Action Flow Matching Policy(https://arxiv.org/abs/2507.13231)
Keywords: diffusion, generative
Abstract: We present VITA, a Vision-To-Action flow matching policy that evolves latent visual representations into latent actions for visuomotor control. Traditional flow matching and diffusion policies sample from standard source distributions (e.g., Gaussian noise) and require additional conditioning mechanisms like cross-attention to condition action generation on visual information, creating time and space overheads. VITA proposes a novel paradigm that treats latent images as the flow source, learning an inherent mapping from vision to action while eliminating separate conditioning modules and preserving generative modeling capabilities. Learning flows between fundamentally different modalities like vision and action is challenging due to sparse action data lacking semantic structures and dimensional mismatches between high-dimensional visual representations and raw actions. We address this by creating a structured action latent space via an autoencoder as the flow matching target, up-sampling raw actions to match visual representation shapes. Crucially, we supervise flow matching with both encoder targets and final action outputs through flow latent decoding, which backpropagates action reconstruction loss through sequential flow matching ODE solving steps for effective end-to-end learning. Implemented as simple MLP layers, VITA is evaluated on challenging bi-manual manipulation tasks on the ALOHA platform, including 5 simulation and 2 real-world tasks. Despite its simplicity, MLP-only VITA outperforms or matches state-of-the-art generative policies while reducing inference latency by 50-130% compared to conventional flow matching policies requiring different conditioning mechanisms or complex architectures. To our knowledge, VITA is the first MLP-only flow matching policy capable of solving complex bi-manual manipulation tasks like those in ALOHA benchmarks.

Title: Enhancing Cross-task Transfer of Large Language Models via Activation Steering

Authors: Xinyu Tang, Zhihao Lv, Xiaoxue Cheng, Junyi Li, Wayne Xin Zhao, Zujie Wen, Zhiqiang Zhang, Jun Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.13236
Pdf URL: https://arxiv.org/pdf/2507.13236
Copy Paste: [[2507.13236]] Enhancing Cross-task Transfer of Large Language Models via Activation Steering(https://arxiv.org/abs/2507.13236)
Keywords: in-context
Abstract: Large language models (LLMs) have shown impressive abilities in leveraging pretrained knowledge through prompting, but they often struggle with unseen tasks, particularly in data-scarce scenarios. While cross-task in-context learning offers a direct solution for transferring knowledge across tasks, it still faces critical challenges in terms of robustness, scalability, and efficiency. In this paper, we investigate whether cross-task transfer can be achieved via latent space steering without parameter updates or input expansion. Through an analysis of activation patterns in the latent space of LLMs, we observe that the enhanced activations induced by in-context examples have consistent patterns across different tasks. Inspired by these findings, we propose CAST, a novel Cross-task Activation Steering Transfer framework that enables effective transfer by manipulating the model's internal activation states. Our approach first selects influential and diverse samples from high-resource tasks, then utilizes their contrastive representation-enhanced activations to adapt LLMs to low-resource tasks. Extensive experiments across both cross-domain and cross-lingual transfer settings show that our method outperforms competitive baselines and demonstrates superior scalability and lower computational costs.

Title: DiffClean: Diffusion-based Makeup Removal for Accurate Age Estimation

Authors: Ekta Balkrishna Gavas, Chinmay Hegde, Nasir Memon, Sudipta Banerjee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.13292
Pdf URL: https://arxiv.org/pdf/2507.13292
Copy Paste: [[2507.13292]] DiffClean: Diffusion-based Makeup Removal for Accurate Age Estimation(https://arxiv.org/abs/2507.13292)
Keywords: diffusion
Abstract: Accurate age verification can protect underage users from unauthorized access to online platforms and e-commerce sites that provide age-restricted services. However, accurate age estimation can be confounded by several factors, including facial makeup that can induce changes to alter perceived identity and age to fool both humans and machines. In this work, we propose DiffClean which erases makeup traces using a text-guided diffusion model to defend against makeup attacks. DiffClean improves age estimation (minor vs. adult accuracy by 4.8%) and face verification (TMR by 8.9% at FMR=0.01%) over competing baselines on digitally simulated and real makeup images.

Title: FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization

Authors: Chuancheng Shi, Yixiang Chen, Burong Lei, Jichao Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.13311
Pdf URL: https://arxiv.org/pdf/2507.13311
Copy Paste: [[2507.13311]] FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization(https://arxiv.org/abs/2507.13311)
Keywords: diffusion
Abstract: Realistic and controllable garment visualization is critical for fashion e-commerce, where users expect personalized previews under diverse poses and lighting conditions. Existing methods often rely on predefined poses, limiting semantic flexibility and illumination adaptability. To address this, we introduce FashionPose, the first unified text-to-pose-to-relighting generation framework. Given a natural language description, our method first predicts a 2D human pose, then employs a diffusion model to generate high-fidelity person images, and finally applies a lightweight relighting module, all guided by the same textual input. By replacing explicit pose annotations with text-driven conditioning, FashionPose enables accurate pose alignment, faithful garment rendering, and flexible lighting control. Experiments demonstrate fine-grained pose synthesis and efficient, consistent relighting, providing a practical solution for personalized virtual fashion display.

Title: Taming Diffusion Transformer for Real-Time Mobile Video Generation

Authors: Yushu Wu, Yanyu Li, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ke Ma, Arpit Sahni, Ju Hu, Aliaksandr Siarohin, Dhritiman Sagar, Yanzhi Wang, Sergey Tulyakov
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2507.13343
Pdf URL: https://arxiv.org/pdf/2507.13343
Copy Paste: [[2507.13343]] Taming Diffusion Transformer for Real-Time Mobile Video Generation(https://arxiv.org/abs/2507.13343)
Keywords: diffusion
Abstract: Diffusion Transformers (DiT) have shown strong performance in video generation tasks, but their high computational cost makes them impractical for resource-constrained devices like smartphones, and real-time generation is even more challenging. In this work, we propose a series of novel optimizations to significantly accelerate video generation and enable real-time performance on mobile platforms. First, we employ a highly compressed variational autoencoder (VAE) to reduce the dimensionality of the input data without sacrificing visual quality. Second, we introduce a KD-guided, sensitivity-aware tri-level pruning strategy to shrink the model size to suit mobile platform while preserving critical performance characteristics. Third, we develop an adversarial step distillation technique tailored for DiT, which allows us to reduce the number of inference steps to four. Combined, these optimizations enable our model to achieve over 10 frames per second (FPS) generation on an iPhone 16 Pro Max, demonstrating the feasibility of real-time, high-quality video generation on mobile devices.

Title: Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models

Authors: Yudong Jin, Sida Peng, Xuan Wang, Tao Xie, Zhen Xu, Yifan Yang, Yujun Shen, Hujun Bao, Xiaowei Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.13344
Pdf URL: https://arxiv.org/pdf/2507.13344
Copy Paste: [[2507.13344]] Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models(https://arxiv.org/abs/2507.13344)
Keywords: diffusion
Abstract: This paper addresses the challenge of high-fidelity view synthesis of humans with sparse-view videos as input. Previous methods solve the issue of insufficient observation by leveraging 4D diffusion models to generate videos at novel viewpoints. However, the generated videos from these models often lack spatio-temporal consistency, thus degrading view synthesis quality. In this paper, we propose a novel sliding iterative denoising process to enhance the spatio-temporal consistency of the 4D diffusion model. Specifically, we define a latent grid in which each latent encodes the image, camera pose, and human pose for a certain viewpoint and timestamp, then alternately denoising the latent grid along spatial and temporal dimensions with a sliding window, and finally decode the videos at target viewpoints from the corresponding denoised latents. Through the iterative sliding, information flows sufficiently across the latent grid, allowing the diffusion model to obtain a large receptive field and thus enhance the 4D consistency of the output, while making the GPU memory consumption affordable. The experiments on the DNA-Rendering and ActorsHQ datasets demonstrate that our method is able to synthesize high-quality and consistent novel-view videos and significantly outperforms the existing approaches. See our project page for interactive demos and video results: this https URL .

Title: Hierarchical Rectified Flow Matching with Mini-Batch Couplings

Authors: Yichi Zhang, Yici Yan, Alex Schwing, Zhizhen Zhao
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.13350
Pdf URL: https://arxiv.org/pdf/2507.13350
Copy Paste: [[2507.13350]] Hierarchical Rectified Flow Matching with Mini-Batch Couplings(https://arxiv.org/abs/2507.13350)
Keywords: generative
Abstract: Flow matching has emerged as a compelling generative modeling approach that is widely used across domains. To generate data via a flow matching model, an ordinary differential equation (ODE) is numerically solved via forward integration of the modeled velocity field. To better capture the multi-modality that is inherent in typical velocity fields, hierarchical flow matching was recently introduced. It uses a hierarchy of ODEs that are numerically integrated when generating data. This hierarchy of ODEs captures the multi-modal velocity distribution just like vanilla flow matching is capable of modeling a multi-modal data distribution. While this hierarchy enables to model multi-modal velocity distributions, the complexity of the modeled distribution remains identical across levels of the hierarchy. In this paper, we study how to gradually adjust the complexity of the distributions across different levels of the hierarchy via mini-batch couplings. We show the benefits of mini-batch couplings in hierarchical rectified flow matching via compelling results on synthetic and imaging data. Code is available at this https URL.