2025-03-27

Title: Vanishing Depth: A Depth Adapter with Positional Depth Encoding for Generalized Image Encoders

Authors: Paul Koch, Jörg Krüger, Ankit Chowdhury, Oliver Heimann
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19947
Pdf URL: https://arxiv.org/pdf/2503.19947
Copy Paste: [[2503.19947]] Vanishing Depth: A Depth Adapter with Positional Depth Encoding for Generalized Image Encoders(https://arxiv.org/abs/2503.19947)
Keywords: self-supervised
Abstract: Generalized metric depth understanding is critical for precise vision-guided robotics, which current state-of-the-art (SOTA) vision-encoders do not support. To address this, we propose Vanishing Depth, a self-supervised training approach that extends pretrained RGB encoders to incorporate and align metric depth into their feature embeddings. Based on our novel positional depth encoding, we enable stable depth density and depth distribution invariant feature extraction. We achieve performance improvements and SOTA results across a spectrum of relevant RGBD downstream tasks - without the necessity of finetuning the encoder. Most notably, we achieve 56.05 mIoU on SUN-RGBD segmentation, 88.3 RMSE on Void's depth completion, and 83.8 Top 1 accuracy on NYUv2 scene classification. In 6D-object pose estimation, we outperform our predecessors of DinoV2, EVA-02, and Omnivore and achieve SOTA results for non-finetuned encoders in several related RGBD downstream tasks.

Title: Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals

Authors: Stefan Stojanov, David Wendt, Seungwoo Kim, Rahul Venkatesh, Kevin Feigelis, Jiajun Wu, Daniel LK Yamins
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19953
Pdf URL: https://arxiv.org/pdf/2503.19953
Copy Paste: [[2503.19953]] Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals(https://arxiv.org/abs/2503.19953)
Keywords: self-supervised
Abstract: Estimating motion in videos is an essential computer vision problem with many downstream applications, including controllable video generation and robotics. Current solutions are primarily trained using synthetic data or require tuning of situation-specific heuristics, which inherently limits these models' capabilities in real-world contexts. Despite recent developments in large-scale self-supervised learning from videos, leveraging such representations for motion estimation remains relatively underexplored. In this work, we develop Opt-CWM, a self-supervised technique for flow and occlusion estimation from a pre-trained next-frame prediction model. Opt-CWM works by learning to optimize counterfactual probes that extract motion information from a base video model, avoiding the need for fixed heuristics while training on unrestricted video inputs. We achieve state-of-the-art performance for motion estimation on real-world videos while requiring no labeled data.

Title: Experience Replay Addresses Loss of Plasticity in Continual Learning

Authors: Jiuqi Wang, Rohan Chandra, Shangtong Zhang
Subjects: cs.LG, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2503.20018
Pdf URL: https://arxiv.org/pdf/2503.20018
Copy Paste: [[2503.20018]] Experience Replay Addresses Loss of Plasticity in Continual Learning(https://arxiv.org/abs/2503.20018)
Keywords: in-context
Abstract: Loss of plasticity is one of the main challenges in continual learning with deep neural networks, where neural networks trained via backpropagation gradually lose their ability to adapt to new tasks and perform significantly worse than their freshly initialized counterparts. The main contribution of this paper is to propose a new hypothesis that experience replay addresses the loss of plasticity in continual learning. Here, experience replay is a form of memory. We provide supporting evidence for this hypothesis. In particular, we demonstrate in multiple different tasks, including regression, classification, and policy evaluation, that by simply adding an experience replay and processing the data in the experience replay with Transformers, the loss of plasticity disappears. Notably, we do not alter any standard components of deep learning. For example, we do not change backpropagation. We do not modify the activation functions. And we do not use any regularization. We conjecture that experience replay and Transformers can address the loss of plasticity because of the in-context learning phenomenon.

Title: Deep Learning Approaches for Blood Disease Diagnosis Across Hematopoietic Lineages

Authors: Gabriel Bo, Justin Gu, Christopher Sun
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2503.20049
Pdf URL: https://arxiv.org/pdf/2503.20049
Copy Paste: [[2503.20049]] Deep Learning Approaches for Blood Disease Diagnosis Across Hematopoietic Lineages(https://arxiv.org/abs/2503.20049)
Keywords: foundation model
Abstract: We present a foundation modeling framework that leverages deep learning to uncover latent genetic signatures across the hematopoietic hierarchy. Our approach trains a fully connected autoencoder on multipotent progenitor cells, reducing over 20,000 gene features to a 256-dimensional latent space that captures predictive information for both progenitor and downstream differentiated cells such as monocytes and lymphocytes. We validate the quality of these embeddings by training feed-forward, transformer, and graph convolutional architectures for blood disease diagnosis tasks. We also explore zero-shot prediction using a progenitor disease state classification model to classify downstream cell conditions. Our models achieve greater than 95% accuracy for multi-class classification, and in the zero-shot setting, we achieve greater than 0.7 F1-score on the binary classification task. Future work should improve embeddings further to increase robustness on lymphocyte classification specifically.

Title: Can Multi-modal (reasoning) LLMs work as deepfake detectors?

Authors: Simiao Ren, Yao Yao, Kidus Zewde, Zisheng Liang, Tsang (Dennis)Ng, Ning-Yau Cheng, Xiaoou Zhan, Qinzhe Liu, Yifei Chen, Hengwei Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20084
Pdf URL: https://arxiv.org/pdf/2503.20084
Copy Paste: [[2503.20084]] Can Multi-modal (reasoning) LLMs work as deepfake detectors?(https://arxiv.org/abs/2503.20084)
Keywords: generative
Abstract: Deepfake detection remains a critical challenge in the era of advanced generative models, particularly as synthetic media becomes more sophisticated. In this study, we explore the potential of state of the art multi-modal (reasoning) large language models (LLMs) for deepfake image detection such as (OpenAI O1/4o, Gemini thinking Flash 2, Deepseek Janus, Grok 3, llama 3.2, Qwen 2/2.5 VL, Mistral Pixtral, Claude 3.5/3.7 sonnet) . We benchmark 12 latest multi-modal LLMs against traditional deepfake detection methods across multiple datasets, including recently published real-world deepfake imagery. To enhance performance, we employ prompt tuning and conduct an in-depth analysis of the models' reasoning pathways to identify key contributing factors in their decision-making process. Our findings indicate that best multi-modal LLMs achieve competitive performance with promising generalization ability with zero shot, even surpass traditional deepfake detection pipelines in out-of-distribution datasets while the rest of the LLM families performs extremely disappointing with some worse than random guess. Furthermore, we found newer model version and reasoning capabilities does not contribute to performance in such niche tasks of deepfake detection while model size do help in some cases. This study highlights the potential of integrating multi-modal reasoning in future deepfake detection frameworks and provides insights into model interpretability for robustness in real-world scenarios.

Title: Generative Linguistics, Large Language Models, and the Social Nature of Scientific Success

Authors: Sophie Hao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.20088
Pdf URL: https://arxiv.org/pdf/2503.20088
Copy Paste: [[2503.20088]] Generative Linguistics, Large Language Models, and the Social Nature of Scientific Success(https://arxiv.org/abs/2503.20088)
Keywords: generative
Abstract: Chesi's (forthcoming) target paper depicts a generative linguistics in crisis, foreboded by Piantadosi's (2023) declaration that "modern language models refute Chomsky's approach to language." In order to survive, Chesi warns, generativists must hold themselves to higher standards of formal and empirical rigor. This response argues that the crisis described by Chesi and Piantadosi actually has little to do with rigor, but is rather a reflection of generativists' limited social ambitions. Chesi ties the fate of generative linguistics to its intellectual merits, but the current success of language model research is social in nature as much as it is intellectual. In order to thrive, then, generativists must do more than heed Chesi's call for rigor; they must also expand their ambitions by giving outsiders a stake in their future success.

Title: Extendable Long-Horizon Planning via Hierarchical Multiscale Diffusion

Authors: Chang Chen, Hany Hamed, Doojin Baek, Taegu Kang, Yoshua Bengio, Sungjin Ahn
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2503.20102
Pdf URL: https://arxiv.org/pdf/2503.20102
Copy Paste: [[2503.20102]] Extendable Long-Horizon Planning via Hierarchical Multiscale Diffusion(https://arxiv.org/abs/2503.20102)
Keywords: diffusion
Abstract: This paper tackles a novel problem, extendable long-horizon planning-enabling agents to plan trajectories longer than those in training data without compounding errors. To tackle this, we propose the Hierarchical Multiscale Diffuser (HM-Diffuser) and Progressive Trajectory Extension (PTE), an augmentation method that iteratively generates longer trajectories by stitching shorter ones. HM-Diffuser trains on these extended trajectories using a hierarchical structure, efficiently handling tasks across multiple temporal scales. Additionally, we introduce Adaptive Plan Pondering and the Recursive HM-Diffuser, which consolidate hierarchical layers into a single model to process temporal scales recursively. Experimental results demonstrate the effectiveness of our approach, advancing diffusion-based planners for scalable long-horizon planning.

Title: "Is There Anything Else?'': Examining Administrator Influence on Linguistic Features from the Cookie Theft Picture Description Cognitive Test

Authors: Changye Li, Zhecheng Sheng, Trevor Cohen, Serguei Pakhomov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.20104
Pdf URL: https://arxiv.org/pdf/2503.20104
Copy Paste: [[2503.20104]] "Is There Anything Else?'': Examining Administrator Influence on Linguistic Features from the Cookie Theft Picture Description Cognitive Test(https://arxiv.org/abs/2503.20104)
Keywords: generative
Abstract: Alzheimer's Disease (AD) dementia is a progressive neurodegenerative disease that negatively impacts patients' cognitive ability. Previous studies have demonstrated that changes in naturalistic language samples can be useful for early screening of AD dementia. However, the nature of language deficits often requires test administrators to use various speech elicitation techniques during spontaneous language assessments to obtain enough propositional utterances from dementia patients. This could lead to the ``observer's effect'' on the downstream analysis that has not been fully investigated. Our study seeks to quantify the influence of test administrators on linguistic features in dementia assessment with two English corpora the ``Cookie Theft'' picture description datasets collected at different locations and test administrators show different levels of administrator involvement. Our results show that the level of test administrator involvement significantly impacts observed linguistic features in patient speech. These results suggest that many of significant linguistic features in the downstream classification task may be partially attributable to differences in the test administration practices rather than solely to participants' cognitive status. The variations in test administrator behavior can lead to systematic biases in linguistic data, potentially confounding research outcomes and clinical assessments. Our study suggests that there is a need for a more standardized test administration protocol in the development of responsible clinical speech analytics frameworks.

Title: AIGC-assisted Federated Learning for Edge Intelligence: Architecture Design, Research Challenges and Future Directions

Authors: Xianke Qiang, Zheng Chang, Ying-Chang Liang
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2503.20166
Pdf URL: https://arxiv.org/pdf/2503.20166
Copy Paste: [[2503.20166]] AIGC-assisted Federated Learning for Edge Intelligence: Architecture Design, Research Challenges and Future Directions(https://arxiv.org/abs/2503.20166)
Keywords: diffusion, generative
Abstract: Federated learning (FL) can fully leverage large-scale terminal data while ensuring privacy and security, and is considered as a distributed alternative for the centralized machine learning. However, the issue of data heterogeneity poses limitations on FL's performance. To address this challenge, artificial intelligence-generated content (AIGC) which is an innovative data synthesis technique emerges as one potential solution. In this article, we first provide an overview of the system architecture, performance metrics, and challenges associated with AIGC-assistant FL system design. We then propose the Generative federated learning (GenFL) architecture and present its workflow, including the design of aggregation and weight policy. Finally, using the CIFAR10 and CIFAR100 datasets, we employ diffusion models to generate dataset and improve FL performance. Experiments conducted under various non-independent and identically distributed (non-IID) data distributions demonstrate the effectiveness of GenFL on overcoming the bottlenecks in FL caused by data heterogeneity. Open research directions in the research of AIGC-assisted FL are also discussed.

Title: Guiding Human-Object Interactions with Rich Geometry and Relations

Authors: Mengqing Xue, Yifei Liu, Ling Guo, Shaoli Huang, Changxing Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20172
Pdf URL: https://arxiv.org/pdf/2503.20172
Copy Paste: [[2503.20172]] Guiding Human-Object Interactions with Rich Geometry and Relations(https://arxiv.org/abs/2503.20172)
Keywords: diffusion
Abstract: Human-object interaction (HOI) synthesis is crucial for creating immersive and realistic experiences for applications such as virtual reality. Existing methods often rely on simplified object representations, such as the object's centroid or the nearest point to a human, to achieve physically plausible motions. However, these approaches may overlook geometric complexity, resulting in suboptimal interaction fidelity. To address this limitation, we introduce ROG, a novel diffusion-based framework that models the spatiotemporal relationships inherent in HOIs with rich geometric detail. For efficient object representation, we select boundary-focused and fine-detail key points from the object mesh, ensuring a comprehensive depiction of the object's geometry. This representation is used to construct an interactive distance field (IDF), capturing the robust HOI dynamics. Furthermore, we develop a diffusion-based relation model that integrates spatial and temporal attention mechanisms, enabling a better understanding of intricate HOI relationships. This relation model refines the generated motion's IDF, guiding the motion generation process to produce relation-aware and semantically aligned movements. Experimental evaluations demonstrate that ROG significantly outperforms state-of-the-art methods in the realism and semantic accuracy of synthesized HOIs.

Title: Offline Reinforcement Learning with Discrete Diffusion Skills

Authors: RuiXi Qiao, Jie Cheng, Xingyuan Dai, Yonglin Tian, Yisheng Lv
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2503.20176
Pdf URL: https://arxiv.org/pdf/2503.20176
Copy Paste: [[2503.20176]] Offline Reinforcement Learning with Discrete Diffusion Skills(https://arxiv.org/abs/2503.20176)
Keywords: diffusion
Abstract: Skills have been introduced to offline reinforcement learning (RL) as temporal abstractions to tackle complex, long-horizon tasks, promoting consistent behavior and enabling meaningful exploration. While skills in offline RL are predominantly modeled within a continuous latent space, the potential of discrete skill spaces remains largely underexplored. In this paper, we propose a compact discrete skill space for offline RL tasks supported by state-of-the-art transformer-based encoder and diffusion-based decoder. Coupled with a high-level policy trained via offline RL techniques, our method establishes a hierarchical RL framework where the trained diffusion decoder plays a pivotal role. Empirical evaluations show that the proposed algorithm, Discrete Diffusion Skill (DDS), is a powerful offline RL method. DDS performs competitively on Locomotion and Kitchen tasks and excels on long-horizon tasks, achieving at least a 12 percent improvement on AntMaze-v2 benchmarks compared to existing offline RL approaches. Furthermore, DDS offers improved interpretability, training stability, and online exploration compared to previous skill-based methods.

Title: Cross-Modal Prototype Allocation: Unsupervised Slide Representation Learning via Patch-Text Contrast in Computational Pathology

Authors: Yuxuan Chen, Jiawen Li, Jiali Hu, Xitong Ling, Tian Guan, Anjia Han, Yonghong He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20190
Pdf URL: https://arxiv.org/pdf/2503.20190
Copy Paste: [[2503.20190]] Cross-Modal Prototype Allocation: Unsupervised Slide Representation Learning via Patch-Text Contrast in Computational Pathology(https://arxiv.org/abs/2503.20190)
Keywords: foundation model
Abstract: With the rapid advancement of pathology foundation models (FMs), the representation learning of whole slide images (WSIs) attracts increasing attention. Existing studies develop high-quality patch feature extractors and employ carefully designed aggregation schemes to derive slide-level representations. However, mainstream weakly supervised slide representation learning methods, primarily based on multiple instance learning (MIL), are tailored to specific downstream tasks, which limits their generalizability. To address this issue, some studies explore unsupervised slide representation learning. However, these approaches focus solely on the visual modality of patches, neglecting the rich semantic information embedded in textual data. In this work, we propose ProAlign, a cross-modal unsupervised slide representation learning framework. Specifically, we leverage a large language model (LLM) to generate descriptive text for the prototype types present in a WSI, introducing patch-text contrast to construct initial prototype embeddings. Furthermore, we propose a parameter-free attention aggregation strategy that utilizes the similarity between patches and these prototypes to form unsupervised slide embeddings applicable to a wide range of downstream tasks. Extensive experiments on four public datasets show that ProAlign outperforms existing unsupervised frameworks and achieves performance comparable to some weakly supervised models.

Title: Maya: Optimizing Deep Learning Training Workloads using Emulated Virtual Accelerators

Authors: Srihas Yarlagadda, Amey Agrawal, Elton Pinto, Hakesh Darapaneni, Mitali Meratwal, Shivam Mittal, Pranavi Bajjuri, Srinivas Sridharan, Alexey Tumanov
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2503.20191
Pdf URL: https://arxiv.org/pdf/2503.20191
Copy Paste: [[2503.20191]] Maya: Optimizing Deep Learning Training Workloads using Emulated Virtual Accelerators(https://arxiv.org/abs/2503.20191)
Keywords: foundation model
Abstract: Training large foundation models costs hundreds of millions of dollars, making deployment optimization critical. Current approaches require machine learning engineers to manually craft training recipes through error-prone trial-and-error on expensive compute clusters. To enable efficient exploration of training configurations, researchers have developed performance modeling systems. However, these systems force users to translate their workloads into custom specification languages, introducing a fundamental semantic gap between the actual workload and its representation. This gap creates an inherent tradeoff: systems must either support a narrow set of workloads to maintain usability, require complex specifications that limit practical adoption, or compromise prediction accuracy with simplified models. We present Maya, a performance modeling system that eliminates these tradeoffs through transparent device emulation. By operating at the narrow interface between training frameworks and accelerator devices, Maya can capture complete workload behavior without requiring code modifications or translations. Maya intercepts device API calls from unmodified training code to directly observe low-level operations, enabling accurate performance prediction while maintaining both ease of use and generality. Our evaluation shows Maya achieves less than 5% prediction error across diverse models and optimization strategies, identifying configurations that reduce training costs by up to 56% compared to existing approaches.

Title: GAPO: Learning Preferential Prompt through Generative Adversarial Policy Optimization

Authors: Zhouhong Gu, Xingzhou Chen, Xiaoran Shi, Tao Wang, Suhang Zheng, Tianyu Li, Hongwei Feng, Yanghua Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.20194
Pdf URL: https://arxiv.org/pdf/2503.20194
Copy Paste: [[2503.20194]] GAPO: Learning Preferential Prompt through Generative Adversarial Policy Optimization(https://arxiv.org/abs/2503.20194)
Keywords: generative
Abstract: Recent advances in large language models have highlighted the critical need for precise control over model outputs through predefined constraints. While existing methods attempt to achieve this through either direct instruction-response synthesis or preferential response optimization, they often struggle with constraint understanding and adaptation. This limitation becomes particularly evident when handling fine-grained constraints, leading to either hallucination or brittle performance. We introduce Generative Adversarial Policy Optimization (GAPO), a novel framework that combines GAN-based training dynamics with an encoder-only reward model to progressively learn and adapt to increasingly complex constraints. GAPO leverages adversarial training to automatically generate training samples of varying difficulty while utilizing the encoder-only architecture to better capture prompt-response relationships. Extensive experiments demonstrate GAPO's superior performance across multiple benchmarks, particularly in scenarios requiring fine-grained constraint handling, where it significantly outperforms existing methods like PPO, DPO, and KTO. Our results suggest that GAPO's unique approach to preferential prompt learning offers a more robust and effective solution for controlling LLM outputs. Code is avaliable in this https URL.

Title: Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models

Authors: Alex Jinpeng Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, Min Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20198
Pdf URL: https://arxiv.org/pdf/2503.20198
Copy Paste: [[2503.20198]] Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models(https://arxiv.org/abs/2503.20198)
Keywords: diffusion, generative
Abstract: Recent advancements in autoregressive and diffusion models have led to strong performance in image generation with short scene text words. However, generating coherent, long-form text in images, such as paragraphs in slides or documents, remains a major challenge for current generative models. We present the first work specifically focused on long text image generation, addressing a critical gap in existing text-to-image systems that typically handle only brief phrases or single sentences. Through comprehensive analysis of state-of-the-art autoregressive generation models, we identify the image tokenizer as a critical bottleneck in text generating quality. To address this, we introduce a novel text-focused, binary tokenizer optimized for capturing detailed scene text features. Leveraging our tokenizer, we develop \ModelName, a multimodal autoregressive model that excels in generating high-quality long-text images with unprecedented fidelity. Our model offers robust controllability, enabling customization of text properties such as font style, size, color, and alignment. Extensive experiments demonstrate that \ModelName~significantly outperforms SD3.5 Large~\cite{sd3} and GPT4o~\cite{gpt4o} with DALL-E 3~\cite{dalle3} in generating long text accurately, consistently, and flexibly. Beyond its technical achievements, \ModelName~opens up exciting opportunities for innovative applications like interleaved document and PowerPoint generation, establishing a new frontier in long-text image generating.

Title: Synthetic-to-Real Self-supervised Robust Depth Estimation via Learning with Motion and Structure Priors

Authors: Weilong Yan, Ming Li, Haipeng Li, Shuwei Shao, Robby T. Tan
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.20211
Pdf URL: https://arxiv.org/pdf/2503.20211
Copy Paste: [[2503.20211]] Synthetic-to-Real Self-supervised Robust Depth Estimation via Learning with Motion and Structure Priors(https://arxiv.org/abs/2503.20211)
Keywords: self-supervised
Abstract: Self-supervised depth estimation from monocular cameras in diverse outdoor conditions, such as daytime, rain, and nighttime, is challenging due to the difficulty of learning universal representations and the severe lack of labeled real-world adverse data. Previous methods either rely on synthetic inputs and pseudo-depth labels or directly apply daytime strategies to adverse conditions, resulting in suboptimal results. In this paper, we present the first synthetic-to-real robust depth estimation framework, incorporating motion and structure priors to capture real-world knowledge effectively. In the synthetic adaptation, we transfer motion-structure knowledge inside cost volumes for better robust representation, using a frozen daytime model to train a depth estimator in synthetic adverse conditions. In the innovative real adaptation, which targets to fix synthetic-real gaps, models trained earlier identify the weather-insensitive regions with a designed consistency-reweighting strategy to emphasize valid pseudo-labels. We introduce a new regularization by gathering explicit depth distributions to constrain the model when facing real-world data. Experiments show that our method outperforms the state-of-the-art across diverse conditions in multi-frame and single-frame evaluations. We achieve improvements of 7.5% and 4.3% in AbsRel and RMSE on average for nuScenes and Robotcar datasets (daytime, nighttime, rain). In zero-shot evaluation of DrivingStereo (rain, fog), our method generalizes better than the previous ones.

Title: Video Motion Graphs

Authors: Haiyang Liu, Zhan Xu, Fa-Ting Hong, Hsin-Ping Huang, Yi Zhou, Yang Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20218
Pdf URL: https://arxiv.org/pdf/2503.20218
Copy Paste: [[2503.20218]] Video Motion Graphs(https://arxiv.org/abs/2503.20218)
Keywords: diffusion, generative
Abstract: We present Video Motion Graphs, a system designed to generate realistic human motion videos. Using a reference video and conditional signals such as music or motion tags, the system synthesizes new videos by first retrieving video clips with gestures matching the conditions and then generating interpolation frames to seamlessly connect clip boundaries. The core of our approach is HMInterp, a robust Video Frame Interpolation (VFI) model that enables seamless interpolation of discontinuous frames, even for complex motion scenarios like dancing. HMInterp i) employs a dual-branch interpolation approach, combining a Motion Diffusion Model for human skeleton motion interpolation with a diffusion-based video frame interpolation model for final frame generation. ii) adopts condition progressive training to effectively leverage identity strong and weak conditions, such as images and pose. These designs ensure both high video texture quality and accurate motion trajectory. Results show that our Video Motion Graphs outperforms existing generative- and retrieval-based methods for multi-modal conditioned human motion video generation. Project page can be found at this https URL

Title: DINeMo: Learning Neural Mesh Models with no 3D Annotations

Authors: Weijie Guo, Guofeng Zhang, Wufei Ma, Alan Yuille
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20220
Pdf URL: https://arxiv.org/pdf/2503.20220
Copy Paste: [[2503.20220]] DINeMo: Learning Neural Mesh Models with no 3D Annotations(https://arxiv.org/abs/2503.20220)
Keywords: foundation model
Abstract: Category-level 3D/6D pose estimation is a crucial step towards comprehensive 3D scene understanding, which would enable a broad range of applications in robotics and embodied AI. Recent works explored neural mesh models that approach a range of 2D and 3D tasks from an analysis-by-synthesis perspective. Despite the largely enhanced robustness to partial occlusion and domain shifts, these methods depended heavily on 3D annotations for part-contrastive learning, which confines them to a narrow set of categories and hinders efficient scaling. In this work, we present DINeMo, a novel neural mesh model that is trained with no 3D annotations by leveraging pseudo-correspondence obtained from large visual foundation models. We adopt a bidirectional pseudo-correspondence generation method, which produce pseudo correspondence utilize both local appearance features and global context information. Experimental results on car datasets demonstrate that our DINeMo outperforms previous zero- and few-shot 3D pose estimation by a wide margin, narrowing the gap with fully-supervised methods by 67.3%. Our DINeMo also scales effectively and efficiently when incorporating more unlabeled images during training, which demonstrate the advantages over supervised learning methods that rely on 3D annotations. Our project page is available at this https URL.

Title: Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models

Authors: Prin Phunyaphibarn, Phillip Y. Lee, Jaihoon Kim, Minhyuk Sung
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20240
Pdf URL: https://arxiv.org/pdf/2503.20240
Copy Paste: [[2503.20240]] Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models(https://arxiv.org/abs/2503.20240)
Keywords: diffusion
Abstract: Classifier-Free Guidance (CFG) is a fundamental technique in training conditional diffusion models. The common practice for CFG-based training is to use a single network to learn both conditional and unconditional noise prediction, with a small dropout rate for conditioning. However, we observe that the joint learning of unconditional noise with limited bandwidth in training results in poor priors for the unconditional case. More importantly, these poor unconditional noise predictions become a serious reason for degrading the quality of conditional generation. Inspired by the fact that most CFG-based conditional models are trained by fine-tuning a base model with better unconditional generation, we first show that simply replacing the unconditional noise in CFG with that predicted by the base model can significantly improve conditional generation. Furthermore, we show that a diffusion model other than the one the fine-tuned model was trained on can be used for unconditional noise replacement. We experimentally verify our claim with a range of CFG-based conditional models for both image and video generation, including Zero-1-to-3, Versatile Diffusion, DiT, DynamiCrafter, and InstructPix2Pix.

Title: LogicQA: Logical Anomaly Detection with Vision Language Model Generated Questions

Authors: Yejin Kwon, Daeun Moon, Youngje Oh, Hyunsoo Yoon
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20252
Pdf URL: https://arxiv.org/pdf/2503.20252
Copy Paste: [[2503.20252]] LogicQA: Logical Anomaly Detection with Vision Language Model Generated Questions(https://arxiv.org/abs/2503.20252)
Keywords: anomaly
Abstract: Anomaly Detection (AD) focuses on detecting samples that differ from the standard pattern, making it a vital tool in process control. Logical anomalies may appear visually normal yet violate predefined constraints on object presence, arrangement, or quantity, depending on reasoning and explainability. We introduce LogicQA, a framework that enhances AD by providing industrial operators with explanations for logical anomalies. LogicQA compiles automatically generated questions into a checklist and collects responses to identify violations of logical constraints. LogicQA is training-free, annotation-free, and operates in a few-shot setting. We achieve state-of-the-art (SOTA) Logical AD performance on public benchmarks, MVTec LOCO AD, with an AUROC of 87.6 percent and an F1-max of 87.0 percent along with the explanations of anomalies. Also, our approach has shown outstanding performance on semiconductor SEM corporate data, further validating its effectiveness in industrial applications.

Title: Mamba-3D as Masked Autoencoders for Accurate and Data-Efficient Analysis of Medical Ultrasound Videos

Authors: Jiaheng Zhou, Yanfeng Zhou, Wei Fang, Yuxing Tang, Le Lu, Ge Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20258
Pdf URL: https://arxiv.org/pdf/2503.20258
Copy Paste: [[2503.20258]] Mamba-3D as Masked Autoencoders for Accurate and Data-Efficient Analysis of Medical Ultrasound Videos(https://arxiv.org/abs/2503.20258)
Keywords: self-supervised
Abstract: Ultrasound videos are an important form of clinical imaging data, and deep learning-based automated analysis can improve diagnostic accuracy and clinical efficiency. However, the scarcity of labeled data and the inherent challenges of video analysis have impeded the advancement of related methods. In this work, we introduce E-ViM$^3$, a data-efficient Vision Mamba network that preserves the 3D structure of video data, enhancing long-range dependencies and inductive biases to better model space-time correlations. With our design of Enclosure Global Tokens (EGT), the model captures and aggregates global features more effectively than competing methods. To further improve data efficiency, we employ masked video modeling for self-supervised pre-training, with the proposed Spatial-Temporal Chained (STC) masking strategy designed to adapt to various video scenarios. Experiments demonstrate that E-ViM$^3$ performs as the state-of-the-art in two high-level semantic analysis tasks across four datasets of varying sizes: EchoNet-Dynamic, CAMUS, MICCAI-BUV, and WHBUS. Furthermore, our model achieves competitive performance with limited labels, highlighting its potential impact on real-world clinical applications.

Title: EGVD: Event-Guided Video Diffusion Model for Physically Realistic Large-Motion Frame Interpolation

Authors: Ziran Zhang, Xiaohui Li, Yihao Liu, Yujin Wang, Yueting Chen, Tianfan Xue, Shi Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20268
Pdf URL: https://arxiv.org/pdf/2503.20268
Copy Paste: [[2503.20268]] EGVD: Event-Guided Video Diffusion Model for Physically Realistic Large-Motion Frame Interpolation(https://arxiv.org/abs/2503.20268)
Keywords: diffusion
Abstract: Video frame interpolation (VFI) in scenarios with large motion remains challenging due to motion ambiguity between frames. While event cameras can capture high temporal resolution motion information, existing event-based VFI methods struggle with limited training data and complex motion patterns. In this paper, we introduce Event-Guided Video Diffusion Model (EGVD), a novel framework that leverages the powerful priors of pre-trained stable video diffusion models alongside the precise temporal information from event cameras. Our approach features a Multi-modal Motion Condition Generator (MMCG) that effectively integrates RGB frames and event signals to guide the diffusion process, producing physically realistic intermediate frames. We employ a selective fine-tuning strategy that preserves spatial modeling capabilities while efficiently incorporating event-guided temporal information. We incorporate input-output normalization techniques inspired by recent advances in diffusion modeling to enhance training stability across varying noise levels. To improve generalization, we construct a comprehensive dataset combining both real and simulated event data across diverse scenarios. Extensive experiments on both real and simulated datasets demonstrate that EGVD significantly outperforms existing methods in handling large motion and challenging lighting conditions, achieving substantial improvements in perceptual quality metrics (27.4% better LPIPS on Prophesee and 24.1% on BSRGB) while maintaining competitive fidelity measures. Code and datasets available at: this https URL.

Title: Faster Parameter-Efficient Tuning with Token Redundancy Reduction

Authors: Kwonyoung Kim, Jungin Park, Jin Kim, Hyeongjun Kwon, Kwanghoon Sohn
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20282
Pdf URL: https://arxiv.org/pdf/2503.20282
Copy Paste: [[2503.20282]] Faster Parameter-Efficient Tuning with Token Redundancy Reduction(https://arxiv.org/abs/2503.20282)
Keywords: foundation model
Abstract: Parameter-efficient tuning (PET) aims to transfer pre-trained foundation models to downstream tasks by learning a small number of parameters. Compared to traditional fine-tuning, which updates the entire model, PET significantly reduces storage and transfer costs for each task regardless of exponentially increasing pre-trained model capacity. However, most PET methods inherit the inference latency of their large backbone models and often introduce additional computational overhead due to additional modules (e.g. adapters), limiting their practicality for compute-intensive applications. In this paper, we propose Faster Parameter-Efficient Tuning (FPET), a novel approach that enhances inference speed and training efficiency while maintaining high storage efficiency. Specifically, we introduce a plug-and-play token redundancy reduction module delicately designed for PET. This module refines tokens from the self-attention layer using an adapter to learn the accurate similarity between tokens and cuts off the tokens through a fully-differentiable token merging strategy, which uses a straight-through estimator for optimal token reduction. Experimental results prove that our FPET achieves faster inference and higher memory efficiency than the pre-trained backbone while keeping competitive performance on par with state-of-the-art PET methods.

Title: RelTriple: Learning Plausible Indoor Layouts by Integrating Relationship Triples into the Diffusion Process

Authors: Kaifan Sun, Bingchen Yang, Peter Wonka, Jun Xiao, Haiyong Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20289
Pdf URL: https://arxiv.org/pdf/2503.20289
Copy Paste: [[2503.20289]] RelTriple: Learning Plausible Indoor Layouts by Integrating Relationship Triples into the Diffusion Process(https://arxiv.org/abs/2503.20289)
Keywords: diffusion, generative
Abstract: The generation of indoor furniture layouts has significant applications in augmented reality, smart homes, and architectural design. Successful furniture arrangement requires proper physical relationships (e.g., collision avoidance) and spacing relationships between furniture and their functional zones to be respected. However, manually defined relationships are almost always incomplete and can produce unrealistic layouts. This work instead extracts spacing relationships automatically based on a hierarchical analysis and adopts the Delaunay Triangulation to produce important triple relationships. Compared to pairwise relationship modeling, triple relationships account for interactions and space utilization among multiple objects. To this end, we introduce RelTriple, a novel approach that enhances furniture distribution by learning spacing relationships between objects and regions. We formulate triple relationships as object-to-object (O2O) losses and object-to-region (O2R) losses and integrate them directly into the training process of generative diffusion. Our approach consistently improves over existing state-of-the-art methods in visual results evaluation metrics on unconditional layout generation, floorplan-conditioned layout generation, and scene rearrangement, achieving at least 12% on the introduced spatial relationship metric and superior spatial coherence and practical usability.

Title: Traversing Distortion-Perception Tradeoff using a Single Score-Based Generative Model

Authors: Yuhan Wang, Suzhi Bi, Ying-Jun Angela Zhang, Xiaojun Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20297
Pdf URL: https://arxiv.org/pdf/2503.20297
Copy Paste: [[2503.20297]] Traversing Distortion-Perception Tradeoff using a Single Score-Based Generative Model(https://arxiv.org/abs/2503.20297)
Keywords: diffusion, generative
Abstract: The distortion-perception (DP) tradeoff reveals a fundamental conflict between distortion metrics (e.g., MSE and PSNR) and perceptual quality. Recent research has increasingly concentrated on evaluating denoising algorithms within the DP framework. However, existing algorithms either prioritize perceptual quality by sacrificing acceptable distortion, or focus on minimizing MSE for faithful restoration. When the goal shifts or noisy measurements vary, adapting to different points on the DP plane needs retraining or even re-designing the model. Inspired by recent advances in solving inverse problems using score-based generative models, we explore the potential of flexibly and optimally traversing DP tradeoffs using a single pre-trained score-based model. Specifically, we introduce a variance-scaled reverse diffusion process and theoretically characterize the marginal distribution. We then prove that the proposed sample process is an optimal solution to the DP tradeoff for conditional Gaussian distribution. Experimental results on two-dimensional and image datasets illustrate that a single score network can effectively and flexibly traverse the DP tradeoff for general denoising problems.

Title: Wan: Open and Advanced Large-Scale Video Generative Models

Authors: WanTeam: Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Tianxing Wang, Tianyi Gui, Tingyu Weng, Tong Shen, Wei Lin, Wei Wang, Wei Wang, Wenmeng Zhou, Wente Wang, Wenting Shen, Wenyuan Yu, Xianzhong Shi, Xiaoming Huang, Xin Xu, Yan Kou, Yangyu Lv, Yifei Li, Yijing Liu, Yiming Wang, Yingya Zhang, Yitong Huang, Yong Li, You Wu, Yu Liu, Yulin Pan, Yun Zheng, Yuntao Hong, Yupeng Shi, Yutong Feng, Zeyinzi Jiang, Zhen Han, Zhi-Fan Wu, Ziyu Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20314
Pdf URL: https://arxiv.org/pdf/2503.20314
Copy Paste: [[2503.20314]] Wan: Open and Advanced Large-Scale Video Generative Models(https://arxiv.org/abs/2503.20314)
Keywords: diffusion, foundation model, generative
Abstract: This report presents Wan, a comprehensive and open suite of video foundation models designed to push the boundaries of video generation. Built upon the mainstream diffusion transformer paradigm, Wan achieves significant advancements in generative capabilities through a series of innovations, including our novel VAE, scalable pre-training strategies, large-scale data curation, and automated evaluation metrics. These contributions collectively enhance the model's performance and versatility. Specifically, Wan is characterized by four key features: Leading Performance: The 14B model of Wan, trained on a vast dataset comprising billions of images and videos, demonstrates the scaling laws of video generation with respect to both data and model size. It consistently outperforms the existing open-source models as well as state-of-the-art commercial solutions across multiple internal and external benchmarks, demonstrating a clear and significant performance superiority. Comprehensiveness: Wan offers two capable models, i.e., 1.3B and 14B parameters, for efficiency and effectiveness respectively. It also covers multiple downstream applications, including image-to-video, instruction-guided video editing, and personal video generation, encompassing up to eight tasks. Consumer-Grade Efficiency: The 1.3B model demonstrates exceptional resource efficiency, requiring only 8.19 GB VRAM, making it compatible with a wide range of consumer-grade GPUs. Openness: We open-source the entire series of Wan, including source code and all models, with the goal of fostering the growth of the video generation community. This openness seeks to significantly expand the creative possibilities of video production in the industry and provide academia with high-quality video foundation models. All the code and models are available at this https URL.

Title: VideoGEM: Training-free Action Grounding in Videos

Authors: Felix Vogel, Walid Bousselham, Anna Kukleva, Nina Shvetsova, Hilde Kuehne
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.20348
Pdf URL: https://arxiv.org/pdf/2503.20348
Copy Paste: [[2503.20348]] VideoGEM: Training-free Action Grounding in Videos(https://arxiv.org/abs/2503.20348)
Keywords: foundation model
Abstract: Vision-language foundation models have shown impressive capabilities across various zero-shot tasks, including training-free localization and grounding, primarily focusing on localizing objects in images. However, leveraging those capabilities to localize actions and events in videos is challenging, as actions have less physical outline and are usually described by higher-level concepts. In this work, we propose VideoGEM, the first training-free spatial action grounding method based on pretrained image- and video-language backbones. Namely, we adapt the self-self attention formulation of GEM to spatial activity grounding. We observe that high-level semantic concepts, such as actions, usually emerge in the higher layers of the image- and video-language models. We, therefore, propose a layer weighting in the self-attention path to prioritize higher layers. Additionally, we introduce a dynamic weighting method to automatically tune layer weights to capture each layer`s relevance to a specific prompt. Finally, we introduce a prompt decomposition, processing action, verb, and object prompts separately, resulting in a better spatial localization of actions. We evaluate the proposed approach on three image- and video-language backbones, CLIP, OpenCLIP, and ViCLIP, and on four video grounding datasets, V-HICO, DALY, YouCook-Interactions, and GroundingYouTube, showing that the proposed training-free approach is able to outperform current trained state-of-the-art approaches for spatial video grounding.

Title: Consistency Trajectory Matching for One-Step Generative Super-Resolution

Authors: Weiyi You, Mingyang Zhang, Leheng Zhang, Kexuan Shi, Xingyu Zhou, Shuhang Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20349
Pdf URL: https://arxiv.org/pdf/2503.20349
Copy Paste: [[2503.20349]] Consistency Trajectory Matching for One-Step Generative Super-Resolution(https://arxiv.org/abs/2503.20349)
Keywords: diffusion, generative
Abstract: Current diffusion-based super-resolution (SR) approaches achieve commendable performance at the cost of high inference overhead. Therefore, distillation techniques are utilized to accelerate the multi-step teacher model into one-step student model. Nevertheless, these methods significantly raise training costs and constrain the performance of the student model by the teacher model. To overcome these tough challenges, we propose Consistency Trajectory Matching for Super-Resolution (CTMSR), a distillation-free strategy that is able to generate photo-realistic SR results in one step. Concretely, we first formulate a Probability Flow Ordinary Differential Equation (PF-ODE) trajectory to establish a deterministic mapping from low-resolution (LR) images with noise to high-resolution (HR) images. Then we apply the Consistency Training (CT) strategy to directly learn the mapping in one step, eliminating the necessity of pre-trained diffusion model. To further enhance the performance and better leverage the ground-truth during the training process, we aim to align the distribution of SR results more closely with that of the natural images. To this end, we propose to minimize the discrepancy between their respective PF-ODE trajectories from the LR image distribution by our meticulously designed Distribution Trajectory Matching (DTM) loss, resulting in improved realism of our recovered HR images. Comprehensive experimental results demonstrate that the proposed methods can attain comparable or even superior capabilities on both synthetic and real datasets while maintaining minimal inference latency.

Title: CNN+Transformer Based Anomaly Traffic Detection in UAV Networks for Emergency Rescue

Authors: Yulu Han, Ziye Jia, Sijie He, Yu Zhang, Qihui Wu
Subjects: cs.LG, cs.NI
Abstract URL: https://arxiv.org/abs/2503.20355
Pdf URL: https://arxiv.org/pdf/2503.20355
Copy Paste: [[2503.20355]] CNN+Transformer Based Anomaly Traffic Detection in UAV Networks for Emergency Rescue(https://arxiv.org/abs/2503.20355)
Keywords: anomaly
Abstract: The unmanned aerial vehicle (UAV) network has gained significant attentions in recent years due to its various applications. However, the traffic security becomes the key threatening public safety issue in an emergency rescue system due to the increasing vulnerability of UAVs to cyber attacks in environments with high heterogeneities. Hence, in this paper, we propose a novel anomaly traffic detection architecture for UAV networks based on the software-defined networking (SDN) framework and blockchain technology. Specifically, SDN separates the control and data plane to enhance the network manageability and security. Meanwhile, the blockchain provides decentralized identity authentication and data security records. Beisdes, a complete security architecture requires an effective mechanism to detect the time-series based abnormal traffic. Thus, an integrated algorithm combining convolutional neural networks (CNNs) and Transformer (CNN+Transformer) for anomaly traffic detection is developed, which is called CTranATD. Finally, the simulation results show that the proposed CTranATD algorithm is effective and outperforms the individual CNN, Transformer, and LSTM algorithms for detecting anomaly traffic.

Title: FastFT: Accelerating Reinforced Feature Transformation via Advanced Exploration Strategies

Authors: Tianqi He, Xiaohan Huang, Yi Du, Qingqing Long, Ziyue Qiao, Min Wu, Yanjie Fu, Yuanchun Zhou, Meng Xiao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20394
Pdf URL: https://arxiv.org/pdf/2503.20394
Copy Paste: [[2503.20394]] FastFT: Accelerating Reinforced Feature Transformation via Advanced Exploration Strategies(https://arxiv.org/abs/2503.20394)
Keywords: generative
Abstract: Feature Transformation is crucial for classic machine learning that aims to generate feature combinations to enhance the performance of downstream tasks from a data-centric perspective. Current methodologies, such as manual expert-driven processes, iterative-feedback techniques, and exploration-generative tactics, have shown promise in automating such data engineering workflow by minimizing human involvement. However, three challenges remain in those frameworks: (1) It predominantly depends on downstream task performance metrics, as assessment is time-consuming, especially for large datasets. (2) The diversity of feature combinations will hardly be guaranteed after random exploration ends. (3) Rare significant transformations lead to sparse valuable feedback that hinders the learning processes or leads to less effective results. In response to these challenges, we introduce FastFT, an innovative framework that leverages a trio of advanced this http URL first decouple the feature transformation evaluation from the outcomes of the generated datasets via the performance predictor. To address the issue of reward sparsity, we developed a method to evaluate the novelty of generated transformation sequences. Incorporating this novelty into the reward function accelerates the model's exploration of effective transformations, thereby improving the search productivity. Additionally, we combine novelty and performance to create a prioritized memory buffer, ensuring that essential experiences are effectively revisited during exploration. Our extensive experimental evaluations validate the performance, efficiency, and traceability of our proposed framework, showcasing its superiority in handling complex feature transformation tasks.

Title: ITA-MDT: Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On

Authors: Ji Woo Hong, Tri Ton, Trung X. Pham, Gwanhyeong Koo, Sunjae Yoon, Chang D. Yoo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20418
Pdf URL: https://arxiv.org/pdf/2503.20418
Copy Paste: [[2503.20418]] ITA-MDT: Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On(https://arxiv.org/abs/2503.20418)
Keywords: diffusion
Abstract: This paper introduces ITA-MDT, the Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On (IVTON), designed to overcome the limitations of previous approaches by leveraging the Masked Diffusion Transformer (MDT) for improved handling of both global garment context and fine-grained details. The IVTON task involves seamlessly superimposing a garment from one image onto a person in another, creating a realistic depiction of the person wearing the specified garment. Unlike conventional diffusion-based virtual try-on models that depend on large pre-trained U-Net architectures, ITA-MDT leverages a lightweight, scalable transformer-based denoising diffusion model with a mask latent modeling scheme, achieving competitive results while reducing computational overhead. A key component of ITA-MDT is the Image-Timestep Adaptive Feature Aggregator (ITAFA), a dynamic feature aggregator that combines all of the features from the image encoder into a unified feature of the same size, guided by diffusion timestep and garment image complexity. This enables adaptive weighting of features, allowing the model to emphasize either global information or fine-grained details based on the requirements of the denoising stage. Additionally, the Salient Region Extractor (SRE) module is presented to identify complex region of the garment to provide high-resolution local information to the denoising model as an additional condition alongside the global information of the full garment image. This targeted conditioning strategy enhances detail preservation of fine details in highly salient garment regions, optimizing computational resources by avoiding unnecessarily processing entire garment image. Comparative evaluations confirms that ITA-MDT improves efficiency while maintaining strong performance, reaching state-of-the-art results in several metrics.

Title: Latent Beam Diffusion Models for Decoding Image Sequences

Authors: Guilherme Fernandes, Vasco Ramos, Regev Cohen, Idan Szpektor, João Magalhães
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20429
Pdf URL: https://arxiv.org/pdf/2503.20429
Copy Paste: [[2503.20429]] Latent Beam Diffusion Models for Decoding Image Sequences(https://arxiv.org/abs/2503.20429)
Keywords: diffusion
Abstract: While diffusion models excel at generating high-quality images from text prompts, they struggle with visual consistency in image sequences. Existing methods generate each image independently, leading to disjointed narratives - a challenge further exacerbated in non-linear storytelling, where scenes must connect beyond adjacent frames. We introduce a novel beam search strategy for latent space exploration, enabling conditional generation of full image sequences with beam search decoding. Unlike prior approaches that use fixed latent priors, our method dynamically searches for an optimal sequence of latent representations, ensuring coherent visual transitions. To address beam search's quadratic complexity, we integrate a cross-attention mechanism that efficiently scores search paths and enables pruning, prioritizing alignment with both textual prompts and visual context. Human evaluations confirm that our approach outperforms baseline methods, producing full sequences with superior coherence, visual continuity, and textual alignment. By bridging advances in search optimization and latent space refinement, this work sets a new standard for structured image sequence generation.

Title: Dissecting and Mitigating Diffusion Bias via Mechanistic Interpretability

Authors: Yingdong Shi, Changming Li, Yifan Wang, Yongxiang Zhao, Anqi Pang, Sibei Yang, Jingyi Yu, Kan Ren
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.20483
Pdf URL: https://arxiv.org/pdf/2503.20483
Copy Paste: [[2503.20483]] Dissecting and Mitigating Diffusion Bias via Mechanistic Interpretability(https://arxiv.org/abs/2503.20483)
Keywords: diffusion
Abstract: Diffusion models have demonstrated impressive capabilities in synthesizing diverse content. However, despite their high-quality outputs, these models often perpetuate social biases, including those related to gender and race. These biases can potentially contribute to harmful real-world consequences, reinforcing stereotypes and exacerbating inequalities in various social contexts. While existing research on diffusion bias mitigation has predominantly focused on guiding content generation, it often neglects the intrinsic mechanisms within diffusion models that causally drive biased outputs. In this paper, we investigate the internal processes of diffusion models, identifying specific decision-making mechanisms, termed bias features, embedded within the model architecture. By directly manipulating these features, our method precisely isolates and adjusts the elements responsible for bias generation, permitting granular control over the bias levels in the generated content. Through experiments on both unconditional and conditional diffusion models across various social bias attributes, we demonstrate our method's efficacy in managing generation distribution while preserving image quality. We also dissect the discovered model mechanism, revealing different intrinsic features controlling fine-grained aspects of generation, boosting further research on mechanistic interpretability of diffusion models.

Title: Contrastive Learning Guided Latent Diffusion Model for Image-to-Image Translation

Authors: Qi Si, Bo Wang, Zhao Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20484
Pdf URL: https://arxiv.org/pdf/2503.20484
Copy Paste: [[2503.20484]] Contrastive Learning Guided Latent Diffusion Model for Image-to-Image Translation(https://arxiv.org/abs/2503.20484)
Keywords: diffusion
Abstract: The diffusion model has demonstrated superior performance in synthesizing diverse and high-quality images for text-guided image translation. However, there remains room for improvement in both the formulation of text prompts and the preservation of reference image content. First, variations in target text prompts can significantly influence the quality of the generated images, and it is often challenging for users to craft an optimal prompt that fully captures the content of the input image. Second, while existing models can introduce desired modifications to specific regions of the reference image, they frequently induce unintended alterations in areas that should remain unchanged. To address these challenges, we propose pix2pix-zeroCon, a zero-shot diffusion-based method that eliminates the need for additional training by leveraging patch-wise contrastive loss. Specifically, we automatically determine the editing direction in the text embedding space based on the reference image and target prompts. Furthermore, to ensure precise content and structural preservation in the edited image, we introduce cross-attention guiding loss and patch-wise contrastive loss between the generated and original image embeddings within a pre-trained diffusion model. Notably, our approach requires no additional training and operates directly on a pre-trained text-to-image diffusion model. Extensive experiments demonstrate that our method surpasses existing models in image-to-image translation, achieving enhanced fidelity and controllability.

Title: VPO: Aligning Text-to-Video Generation Models with Prompt Optimization

Authors: Jiale Cheng, Ruiliang Lyu, Xiaotao Gu, Xiao Liu, Jiazheng Xu, Yida Lu, Jiayan Teng, Zhuoyi Yang, Yuxiao Dong, Jie Tang, Hongning Wang, Minlie Huang
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.20491
Pdf URL: https://arxiv.org/pdf/2503.20491
Copy Paste: [[2503.20491]] VPO: Aligning Text-to-Video Generation Models with Prompt Optimization(https://arxiv.org/abs/2503.20491)
Keywords: in-context
Abstract: Video generation models have achieved remarkable progress in text-to-video tasks. These models are typically trained on text-video pairs with highly detailed and carefully crafted descriptions, while real-world user inputs during inference are often concise, vague, or poorly structured. This gap makes prompt optimization crucial for generating high-quality videos. Current methods often rely on large language models (LLMs) to refine prompts through in-context learning, but suffer from several limitations: they may distort user intent, omit critical details, or introduce safety risks. Moreover, they optimize prompts without considering the impact on the final video quality, which can lead to suboptimal results. To address these issues, we introduce VPO, a principled framework that optimizes prompts based on three core principles: harmlessness, accuracy, and helpfulness. The generated prompts faithfully preserve user intents and, more importantly, enhance the safety and quality of generated videos. To achieve this, VPO employs a two-stage optimization approach. First, we construct and refine a supervised fine-tuning (SFT) dataset based on principles of safety and alignment. Second, we introduce both text-level and video-level feedback to further optimize the SFT model with preference learning. Our extensive experiments demonstrate that VPO significantly improves safety, alignment, and video quality compared to baseline methods. Moreover, VPO shows strong generalization across video generation models. Furthermore, we demonstrate that VPO could outperform and be combined with RLHF methods on video generation models, underscoring the effectiveness of VPO in aligning video generation models. Our code and data are publicly available at this https URL.

Title: Small Object Detection: A Comprehensive Survey on Challenges, Techniques and Real-World Applications

Authors: Mahya Nikouei, Bita Baroutian, Shahabedin Nabavi, Fateme Taraghi, Atefe Aghaei, Ayoob Sajedi, Mohsen Ebrahimi Moghaddam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20516
Pdf URL: https://arxiv.org/pdf/2503.20516
Copy Paste: [[2503.20516]] Small Object Detection: A Comprehensive Survey on Challenges, Techniques and Real-World Applications(https://arxiv.org/abs/2503.20516)
Keywords: self-supervised
Abstract: Small object detection (SOD) is a critical yet challenging task in computer vision, with applications like spanning surveillance, autonomous systems, medical imaging, and remote sensing. Unlike larger objects, small objects contain limited spatial and contextual information, making accurate detection difficult. Challenges such as low resolution, occlusion, background interference, and class imbalance further complicate the problem. This survey provides a comprehensive review of recent advancements in SOD using deep learning, focusing on articles published in Q1 journals during 2024-2025. We analyzed challenges, state-of-the-art techniques, datasets, evaluation metrics, and real-world applications. Recent advancements in deep learning have introduced innovative solutions, including multi-scale feature extraction, Super-Resolution (SR) techniques, attention mechanisms, and transformer-based architectures. Additionally, improvements in data augmentation, synthetic data generation, and transfer learning have addressed data scarcity and domain adaptation issues. Furthermore, emerging trends such as lightweight neural networks, knowledge distillation (KD), and self-supervised learning offer promising directions for improving detection efficiency, particularly in resource-constrained environments like Unmanned Aerial Vehicles (UAV)-based surveillance and edge computing. We also review widely used datasets, along with standard evaluation metrics such as mean Average Precision (mAP) and size-specific AP scores. The survey highlights real-world applications, including traffic monitoring, maritime surveillance, industrial defect detection, and precision agriculture. Finally, we discuss open research challenges and future directions, emphasizing the need for robust domain adaptation techniques, better feature fusion strategies, and real-time performance optimization.

Title: MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Generation

Authors: Jinnan Chen, Lingting Zhu, Zeyu Hu, Shengju Qian, Yugang Chen, Xin Wang, Gim Hee Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20519
Pdf URL: https://arxiv.org/pdf/2503.20519
Copy Paste: [[2503.20519]] MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Generation(https://arxiv.org/abs/2503.20519)
Keywords: diffusion, generative
Abstract: Recent advances in auto-regressive transformers have revolutionized generative modeling across different domains, from language processing to visual generation, demonstrating remarkable capabilities. However, applying these advances to 3D generation presents three key challenges: the unordered nature of 3D data conflicts with sequential next-token prediction paradigm, conventional vector quantization approaches incur substantial compression loss when applied to 3D meshes, and the lack of efficient scaling strategies for higher resolution latent prediction. To address these challenges, we introduce MAR-3D, which integrates a pyramid variational autoencoder with a cascaded masked auto-regressive transformer (Cascaded MAR) for progressive latent upscaling in the continuous space. Our architecture employs random masking during training and auto-regressive denoising in random order during inference, naturally accommodating the unordered property of 3D latent tokens. Additionally, we propose a cascaded training strategy with condition augmentation that enables efficiently up-scale the latent token resolution with fast convergence. Extensive experiments demonstrate that MAR-3D not only achieves superior performance and generalization capabilities compared to existing methods but also exhibits enhanced scaling capabilities compared to joint distribution modeling approaches (e.g., diffusion transformers).

Title: GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

Authors: Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani, Gianluca Corrado
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2503.20523
Pdf URL: https://arxiv.org/pdf/2503.20523
Copy Paste: [[2503.20523]] GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving(https://arxiv.org/abs/2503.20523)
Keywords: diffusion, generative
Abstract: Generative models offer a scalable and flexible paradigm for simulating complex environments, yet current approaches fall short in addressing the domain-specific requirements of autonomous driving - such as multi-agent interactions, fine-grained control, and multi-camera consistency. We introduce GAIA-2, Generative AI for Autonomy, a latent diffusion world model that unifies these capabilities within a single generative framework. GAIA-2 supports controllable video generation conditioned on a rich set of structured inputs: ego-vehicle dynamics, agent configurations, environmental factors, and road semantics. It generates high-resolution, spatiotemporally consistent multi-camera videos across geographically diverse driving environments (UK, US, Germany). The model integrates both structured conditioning and external latent embeddings (e.g., from a proprietary driving model) to facilitate flexible and semantically grounded scene synthesis. Through this integration, GAIA-2 enables scalable simulation of both common and rare driving scenarios, advancing the use of generative world models as a core tool in the development of autonomous systems. Videos are available at this https URL.

Title: TD-BFR: Truncated Diffusion Model for Efficient Blind Face Restoration

Authors: Ziying Zhang, Xiang Gao, Zhixin Wang, Qiang hu, Xiaoyun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20537
Pdf URL: https://arxiv.org/pdf/2503.20537
Copy Paste: [[2503.20537]] TD-BFR: Truncated Diffusion Model for Efficient Blind Face Restoration(https://arxiv.org/abs/2503.20537)
Keywords: diffusion, generative
Abstract: Diffusion-based methodologies have shown significant potential in blind face restoration (BFR), leveraging their robust generative capabilities. However, they are often criticized for two significant problems: 1) slow training and inference speed, and 2) inadequate recovery of fine-grained facial details. To address these problems, we propose a novel Truncated Diffusion model for efficient Blind Face Restoration (TD-BFR), a three-stage paradigm tailored for the progressive resolution of degraded images. Specifically, TD-BFR utilizes an innovative truncated sampling method, starting from low-quality (LQ) images at low resolution to enhance sampling speed, and then introduces an adaptive degradation removal module to handle unknown degradations and connect the generation processes across different resolutions. Additionally, we further adapt the priors of pre-trained diffusion models to recover rich facial details. Our method efficiently restores high-quality images in a coarse-to-fine manner and experimental results demonstrate that TD-BFR is, on average, \textbf{4.75$\times$} faster than current state-of-the-art diffusion-based BFR methods while maintaining competitive quality.

Title: TerraTorch: The Geospatial Foundation Models Toolkit

Authors: Carlos Gomes, Benedikt Blumenstiel, Joao Lucas de Sousa Almeida, Pedro Henrique de Oliveira, Paolo Fraccaro, Francesc Marti Escofet, Daniela Szwarcman, Naomi Simumba, Romeo Kienzler, Bianca Zadrozny
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.20563
Pdf URL: https://arxiv.org/pdf/2503.20563
Copy Paste: [[2503.20563]] TerraTorch: The Geospatial Foundation Models Toolkit(https://arxiv.org/abs/2503.20563)
Keywords: foundation model
Abstract: TerraTorch is a fine-tuning and benchmarking toolkit for Geospatial Foundation Models built on PyTorch Lightning and tailored for satellite, weather, and climate data. It integrates domain-specific data modules, pre-defined tasks, and a modular model factory that pairs any backbone with diverse decoder heads. These components allow researchers and practitioners to fine-tune supported models in a no-code fashion by simply editing a training configuration. By consolidating best practices for model development and incorporating the automated hyperparameter optimization extension Iterate, TerraTorch reduces the expertise and time required to fine-tune or benchmark models on new Earth Observation use cases. Furthermore, TerraTorch directly integrates with GEO-Bench, allowing for systematic and reproducible benchmarking of Geospatial Foundation Models. TerraTorch is open sourced under Apache 2.0, available at this https URL, and can be installed via pip install terratorch.

Title: Diffusion Counterfactuals for Image Regressors

Authors: Trung Duc Ha, Sidney Bender
Subjects: cs.LG, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2503.20595
Pdf URL: https://arxiv.org/pdf/2503.20595
Copy Paste: [[2503.20595]] Diffusion Counterfactuals for Image Regressors(https://arxiv.org/abs/2503.20595)
Keywords: diffusion, generative
Abstract: Counterfactual explanations have been successfully applied to create human interpretable explanations for various black-box models. They are handy for tasks in the image domain, where the quality of the explanations benefits from recent advances in generative models. Although counterfactual explanations have been widely applied to classification models, their application to regression tasks remains underexplored. We present two methods to create counterfactual explanations for image regression tasks using diffusion-based generative models to address challenges in sparsity and quality: 1) one based on a Denoising Diffusion Probabilistic Model that operates directly in pixel-space and 2) another based on a Diffusion Autoencoder operating in latent space. Both produce realistic, semantic, and smooth counterfactuals on CelebA-HQ and a synthetic data set, providing easily interpretable insights into the decision-making process of the regression model and reveal spurious correlations. We find that for regression counterfactuals, changes in features depend on the region of the predicted value. Large semantic changes are needed for significant changes in predicted values, making it harder to find sparse counterfactuals than with classifiers. Moreover, pixel space counterfactuals are more sparse while latent space counterfactuals are of higher quality and allow bigger semantic changes.

Title: $β$-GNN: A Robust Ensemble Approach Against Graph Structure Perturbation

Authors: Haci Ismail Aslan, Philipp Wiesner, Ping Xiong, Odej Kao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20630
Pdf URL: https://arxiv.org/pdf/2503.20630
Copy Paste: [[2503.20630]] $β$-GNN: A Robust Ensemble Approach Against Graph Structure Perturbation(https://arxiv.org/abs/2503.20630)
Keywords: anomaly
Abstract: Graph Neural Networks (GNNs) are playing an increasingly important role in the efficient operation and security of computing systems, with applications in workload scheduling, anomaly detection, and resource management. However, their vulnerability to network perturbations poses a significant challenge. We propose $\beta$-GNN, a model enhancing GNN robustness without sacrificing clean data performance. $\beta$-GNN uses a weighted ensemble, combining any GNN with a multi-layer perceptron. A learned dynamic weight, $\beta$, modulates the GNN's contribution. This $\beta$ not only weights GNN influence but also indicates data perturbation levels, enabling proactive mitigation. Experimental results on diverse datasets show $\beta$-GNN's superior adversarial accuracy and attack severity quantification. Crucially, $\beta$-GNN avoids perturbation assumptions, preserving clean data structure and performance.

Title: MMGen: Unified Multi-modal Image Generation and Understanding in One Go

Authors: Jiepeng Wang, Zhaoqing Wang, Hao Pan, Yuan Liu, Dongdong Yu, Changhu Wang, Wenping Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20644
Pdf URL: https://arxiv.org/pdf/2503.20644
Copy Paste: [[2503.20644]] MMGen: Unified Multi-modal Image Generation and Understanding in One Go(https://arxiv.org/abs/2503.20644)
Keywords: diffusion, generative
Abstract: A unified diffusion framework for multi-modal generation and understanding has the transformative potential to achieve seamless and controllable image diffusion and other cross-modal tasks. In this paper, we introduce MMGen, a unified framework that integrates multiple generative tasks into a single diffusion model. This includes: (1) multi-modal category-conditioned generation, where multi-modal outputs are generated simultaneously through a single inference process, given category information; (2) multi-modal visual understanding, which accurately predicts depth, surface normals, and segmentation maps from RGB images; and (3) multi-modal conditioned generation, which produces corresponding RGB images based on specific modality conditions and other aligned modalities. Our approach develops a novel diffusion transformer that flexibly supports multi-modal output, along with a simple modality-decoupling strategy to unify various tasks. Extensive experiments and applications demonstrate the effectiveness and superiority of MMGen across diverse tasks and conditions, highlighting its potential for applications that require simultaneous generation and understanding.

Title: Imitating Radiological Scrolling: A Global-Local Attention Model for 3D Chest CT Volumes Multi-Label Anomaly Classification

Authors: Theo Di Piazza, Carole Lazarus, Olivier Nempont, Loic Boussel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20652
Pdf URL: https://arxiv.org/pdf/2503.20652
Copy Paste: [[2503.20652]] Imitating Radiological Scrolling: A Global-Local Attention Model for 3D Chest CT Volumes Multi-Label Anomaly Classification(https://arxiv.org/abs/2503.20652)
Keywords: anomaly
Abstract: The rapid increase in the number of Computed Tomography (CT) scan examinations has created an urgent need for automated tools, such as organ segmentation, anomaly classification, and report generation, to assist radiologists with their growing workload. Multi-label classification of Three-Dimensional (3D) CT scans is a challenging task due to the volumetric nature of the data and the variety of anomalies to be detected. Existing deep learning methods based on Convolutional Neural Networks (CNNs) struggle to capture long-range dependencies effectively, while Vision Transformers require extensive pre-training, posing challenges for practical use. Additionally, these existing methods do not explicitly model the radiologist's navigational behavior while scrolling through CT scan slices, which requires both global context understanding and local detail awareness. In this study, we present CT-Scroll, a novel global-local attention model specifically designed to emulate the scrolling behavior of radiologists during the analysis of 3D CT scans. Our approach is evaluated on two public datasets, demonstrating its efficacy through comprehensive experiments and an ablation study that highlights the contribution of each model component.

Title: ARMO: Autoregressive Rigging for Multi-Category Objects

Authors: Mingze Sun, Shiwei Mao, Keyi Chen, Yurun Chen, Shunlin Lu, Jingbo Wang, Junting Dong, Ruqi Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20663
Pdf URL: https://arxiv.org/pdf/2503.20663
Copy Paste: [[2503.20663]] ARMO: Autoregressive Rigging for Multi-Category Objects(https://arxiv.org/abs/2503.20663)
Keywords: diffusion, generative
Abstract: Recent advancements in large-scale generative models have significantly improved the quality and diversity of 3D shape generation. However, most existing methods focus primarily on generating static 3D models, overlooking the potentially dynamic nature of certain shapes, such as humanoids, animals, and insects. To address this gap, we focus on rigging, a fundamental task in animation that establishes skeletal structures and skinning for 3D models. In this paper, we introduce OmniRig, the first large-scale rigging dataset, comprising 79,499 meshes with detailed skeleton and skinning information. Unlike traditional benchmarks that rely on predefined standard poses (e.g., A-pose, T-pose), our dataset embraces diverse shape categories, styles, and poses. Leveraging this rich dataset, we propose ARMO, a novel rigging framework that utilizes an autoregressive model to predict both joint positions and connectivity relationships in a unified manner. By treating the skeletal structure as a complete graph and discretizing it into tokens, we encode the joints using an auto-encoder to obtain a latent embedding and an autoregressive model to predict the tokens. A mesh-conditioned latent diffusion model is used to predict the latent embedding for conditional skeleton generation. Our method addresses the limitations of regression-based approaches, which often suffer from error accumulation and suboptimal connectivity estimation. Through extensive experiments on the OmniRig dataset, our approach achieves state-of-the-art performance in skeleton prediction, demonstrating improved generalization across diverse object categories. The code and dataset will be made public for academic use upon acceptance.

Title: Flip Learning: Weakly Supervised Erase to Segment Nodules in Breast Ultrasound

Authors: Yuhao Huang, Ao Chang, Haoran Dou, Xing Tao, Xinrui Zhou, Yan Cao, Ruobing Huang, Alejandro F Frangi, Lingyun Bao, Xin Yang, Dong Ni
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.20685
Pdf URL: https://arxiv.org/pdf/2503.20685
Copy Paste: [[2503.20685]] Flip Learning: Weakly Supervised Erase to Segment Nodules in Breast Ultrasound(https://arxiv.org/abs/2503.20685)
Keywords: foundation model
Abstract: Accurate segmentation of nodules in both 2D breast ultrasound (BUS) and 3D automated breast ultrasound (ABUS) is crucial for clinical diagnosis and treatment planning. Therefore, developing an automated system for nodule segmentation can enhance user independence and expedite clinical analysis. Unlike fully-supervised learning, weakly-supervised segmentation (WSS) can streamline the laborious and intricate annotation process. However, current WSS methods face challenges in achieving precise nodule segmentation, as many of them depend on inaccurate activation maps or inefficient pseudo-mask generation algorithms. In this study, we introduce a novel multi-agent reinforcement learning-based WSS framework called Flip Learning, which relies solely on 2D/3D boxes for accurate segmentation. Specifically, multiple agents are employed to erase the target from the box to facilitate classification tag flipping, with the erased region serving as the predicted segmentation mask. The key contributions of this research are as follows: (1) Adoption of a superpixel/supervoxel-based approach to encode the standardized environment, capturing boundary priors and expediting the learning process. (2) Introduction of three meticulously designed rewards, comprising a classification score reward and two intensity distribution rewards, to steer the agents' erasing process precisely, thereby avoiding both under- and over-segmentation. (3) Implementation of a progressive curriculum learning strategy to enable agents to interact with the environment in a progressively challenging manner, thereby enhancing learning efficiency. Extensively validated on the large in-house BUS and ABUS datasets, our Flip Learning method outperforms state-of-the-art WSS methods and foundation models, and achieves comparable performance as fully-supervised learning algorithms.

Title: From Annotation to Adaptation: Metrics, Synthetic Data, and Aspect Extraction for Aspect-Based Sentiment Analysis with Large Language Models

Authors: Nikita Neveditsin, Pawan Lingras, Vijay Mago
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.20715
Pdf URL: https://arxiv.org/pdf/2503.20715
Copy Paste: [[2503.20715]] From Annotation to Adaptation: Metrics, Synthetic Data, and Aspect Extraction for Aspect-Based Sentiment Analysis with Large Language Models(https://arxiv.org/abs/2503.20715)
Keywords: generative
Abstract: This study examines the performance of Large Language Models (LLMs) in Aspect-Based Sentiment Analysis (ABSA), with a focus on implicit aspect extraction in a novel domain. Using a synthetic sports feedback dataset, we evaluate open-weight LLMs' ability to extract aspect-polarity pairs and propose a metric to facilitate the evaluation of aspect extraction with generative models. Our findings highlight both the potential and limitations of LLMs in the ABSA task.

Title: Learning Straight Flows by Learning Curved Interpolants

Authors: Shiv Shankar, Tomas Geffner
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.20719
Pdf URL: https://arxiv.org/pdf/2503.20719
Copy Paste: [[2503.20719]] Learning Straight Flows by Learning Curved Interpolants(https://arxiv.org/abs/2503.20719)
Keywords: generative
Abstract: Flow matching models typically use linear interpolants to define the forward/noise addition process. This, together with the independent coupling between noise and target distributions, yields a vector field which is often non-straight. Such curved fields lead to a slow inference/generation process. In this work, we propose to learn flexible (potentially curved) interpolants in order to learn straight vector fields to enable faster generation. We formulate this via a multi-level optimization problem and propose an efficient approximate procedure to solve it. Our framework provides an end-to-end and simulation-free optimization procedure, which can be leveraged to learn straight line generative trajectories.

Title: A weakly-supervised deep learning model for fast localisation and delineation of the skeleton, internal organs, and spinal canal on Whole-Body Diffusion-Weighted MRI (WB-DWI)

Authors: A. Candito (1), A. Dragan (1,2), R. Holbrey (1), A. Ribeiro (2), R. Donners (3), C. Messiou (1,2), N. Tunariu (1,2), D.-M. Koh (1,2), M. D. Blackledge (1), (1)The Institute of Cancer Research, London, United Kingdom (2)The Royal Marsden NHS Foundation Trust, London, United Kingdom (3)University Hospital Basel, Basel, Switzerland
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.20722
Pdf URL: https://arxiv.org/pdf/2503.20722
Copy Paste: [[2503.20722]] A weakly-supervised deep learning model for fast localisation and delineation of the skeleton, internal organs, and spinal canal on Whole-Body Diffusion-Weighted MRI (WB-DWI)(https://arxiv.org/abs/2503.20722)
Keywords: diffusion
Abstract: Background: Apparent Diffusion Coefficient (ADC) values and Total Diffusion Volume (TDV) from Whole-body diffusion-weighted MRI (WB-DWI) are recognized cancer imaging biomarkers. However, manual disease delineation for ADC and TDV measurements is unfeasible in clinical practice, demanding automation. As a first step, we propose an algorithm to generate fast and reproducible probability maps of the skeleton, adjacent internal organs (liver, spleen, urinary bladder, and kidneys), and spinal canal. Methods: We developed an automated deep-learning pipeline based on a 3D patch-based Residual U-Net architecture that localizes and delineates these anatomical structures on WB-DWI. The algorithm was trained using "soft-labels" (non-binary segmentations) derived from a computationally intensive atlas-based approach. For training and validation, we employed a multi-center WB-DWI dataset comprising 532 scans from patients with Advanced Prostate Cancer (APC) or Multiple Myeloma (MM), with testing on 45 patients. Results: Our weakly-supervised deep learning model achieved an average dice score/precision/recall of 0.66/0.6/0.73 for skeletal delineations, 0.8/0.79/0.81 for internal organs, and 0.85/0.79/0.94 for spinal canal, with surface distances consistently below 3 mm. Relative median ADC and log-transformed volume differences between automated and manual expert-defined full-body delineations were below 10% and 4%, respectively. The computational time for generating probability maps was 12x faster than the atlas-based registration algorithm (25 s vs. 5 min). An experienced radiologist rated the model's accuracy "good" or "excellent" on test datasets. Conclusion: Our model offers fast and reproducible probability maps for localizing and delineating body regions on WB-DWI, enabling ADC and TDV quantification, potentially supporting clinicians in disease staging and treatment response assessment.

Title: Dynamic Motion Blending for Versatile Motion Editing

Authors: Nan Jiang, Hongjie Li, Ziye Yuan, Zimo He, Yixin Chen, Tengyu Liu, Yixin Zhu, Siyuan Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20724
Pdf URL: https://arxiv.org/pdf/2503.20724
Copy Paste: [[2503.20724]] Dynamic Motion Blending for Versatile Motion Editing(https://arxiv.org/abs/2503.20724)
Keywords: diffusion
Abstract: Text-guided motion editing enables high-level semantic control and iterative modifications beyond traditional keyframe animation. Existing methods rely on limited pre-collected training triplets, which severely hinders their versatility in diverse editing scenarios. We introduce MotionCutMix, an online data augmentation technique that dynamically generates training triplets by blending body part motions based on input text. While MotionCutMix effectively expands the training distribution, the compositional nature introduces increased randomness and potential body part incoordination. To model such a rich distribution, we present MotionReFit, an auto-regressive diffusion model with a motion coordinator. The auto-regressive architecture facilitates learning by decomposing long sequences, while the motion coordinator mitigates the artifacts of motion composition. Our method handles both spatial and temporal motion edits directly from high-level human instructions, without relying on additional specifications or Large Language Models. Through extensive experiments, we show that MotionReFit achieves state-of-the-art performance in text-guided motion editing.

Title: RecTable: Fast Modeling Tabular Data with Rectified Flow

Authors: Masane Fuchi, Tomohiro Takagi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.20731
Pdf URL: https://arxiv.org/pdf/2503.20731
Copy Paste: [[2503.20731]] RecTable: Fast Modeling Tabular Data with Rectified Flow(https://arxiv.org/abs/2503.20731)
Keywords: diffusion
Abstract: Score-based or diffusion models generate high-quality tabular data, surpassing GAN-based and VAE-based models. However, these methods require substantial training time. In this paper, we introduce RecTable, which uses the rectified flow modeling, applied in such as text-to-image generation and text-to-video generation. RecTable features a simple architecture consisting of a few stacked gated linear unit blocks. Additionally, our training strategies are also simple, incorporating a mixed-type noise distribution and a logit-normal timestep distribution. Our experiments demonstrate that RecTable achieves competitive performance compared to the several state-of-the-art diffusion and score-based models while reducing the required training time. Our code is available at this https URL.

Title: High Quality Diffusion Distillation on a Single GPU with Relative and Absolute Position Matching

Authors: Guoqiang Zhang, Kenta Niwa, J.P. Lewis, Cedric Mesnage, W. Bastiaan Kleijn
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20744
Pdf URL: https://arxiv.org/pdf/2503.20744
Copy Paste: [[2503.20744]] High Quality Diffusion Distillation on a Single GPU with Relative and Absolute Position Matching(https://arxiv.org/abs/2503.20744)
Keywords: diffusion
Abstract: We introduce relative and absolute position matching (RAPM), a diffusion distillation method resulting in high quality generation that can be trained efficiently on a single GPU. Recent diffusion distillation research has achieved excellent results for high-resolution text-to-image generation with methods such as phased consistency models (PCM) and improved distribution matching distillation (DMD2). However, these methods generally require many GPUs (e.g.~8-64) and significant batchsizes (e.g.~128-2048) during training, resulting in memory and compute requirements that are beyond the resources of some researchers. RAPM provides effective single-GPU diffusion distillation training with a batchsize of 1. The new method attempts to mimic the sampling trajectories of the teacher model by matching the relative and absolute positions. The design of relative positions is inspired by PCM. Two discriminators are introduced accordingly in RAPM, one for matching relative positions and the other for absolute positions. Experimental results on StableDiffusion (SD) V1.5 and SDXL indicate that RAPM with 4 timesteps produces comparable FID scores as the best method with 1 timestep under very limited computational resources.

Title: UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines

Authors: Chen Tang, Xinzhu Ma, Encheng Su, Xiufeng Song, Xiaohong Liu, Wei-Hong Li, Lei Bai, Wanli Ouyang, Xiangyu Yue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20748
Pdf URL: https://arxiv.org/pdf/2503.20748
Copy Paste: [[2503.20748]] UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines(https://arxiv.org/abs/2503.20748)
Keywords: foundation model
Abstract: Traditional spatiotemporal models generally rely on task-specific architectures, which limit their generalizability and scalability across diverse tasks due to domain-specific design requirements. In this paper, we introduce \textbf{UniSTD}, a unified Transformer-based framework for spatiotemporal modeling, which is inspired by advances in recent foundation models with the two-stage pretraining-then-adaption paradigm. Specifically, our work demonstrates that task-agnostic pretraining on 2D vision and vision-text datasets can build a generalizable model foundation for spatiotemporal learning, followed by specialized joint training on spatiotemporal datasets to enhance task-specific adaptability. To improve the learning capabilities across domains, our framework employs a rank-adaptive mixture-of-expert adaptation by using fractional interpolation to relax the discrete variables so that can be optimized in the continuous space. Additionally, we introduce a temporal module to incorporate temporal dynamics explicitly. We evaluate our approach on a large-scale dataset covering 10 tasks across 4 disciplines, demonstrating that a unified spatiotemporal model can achieve scalable, cross-task learning and support up to 10 tasks simultaneously within one model while reducing training costs in multi-domain applications. Code will be available at this https URL.

Title: Reliable algorithm selection for machine learning-guided design

Authors: Clara Fannjiang, Ji Won Park
Subjects: cs.LG, q-bio.QM, stat.ML
Abstract URL: https://arxiv.org/abs/2503.20767
Pdf URL: https://arxiv.org/pdf/2503.20767
Copy Paste: [[2503.20767]] Reliable algorithm selection for machine learning-guided design(https://arxiv.org/abs/2503.20767)
Keywords: generative
Abstract: Algorithms for machine learning-guided design, or design algorithms, use machine learning-based predictions to propose novel objects with desired property values. Given a new design task -- for example, to design novel proteins with high binding affinity to a therapeutic target -- one must choose a design algorithm and specify any hyperparameters and predictive and/or generative models involved. How can these decisions be made such that the resulting designs are successful? This paper proposes a method for design algorithm selection, which aims to select design algorithms that will produce a distribution of design labels satisfying a user-specified success criterion -- for example, that at least ten percent of designs' labels exceed a threshold. It does so by combining designs' predicted property values with held-out labeled data to reliably forecast characteristics of the label distributions produced by different design algorithms, building upon techniques from prediction-powered inference. The method is guaranteed with high probability to return design algorithms that yield successful label distributions (or the null set if none exist), if the density ratios between the design and labeled data distributions are known. We demonstrate the method's effectiveness in simulated protein and RNA design tasks, in settings with either known or estimated density ratios.

Title: Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields

Authors: Shijie Zhou, Hui Ren, Yijia Weng, Shuwang Zhang, Zhen Wang, Dejia Xu, Zhiwen Fan, Suya You, Zhangyang Wang, Leonidas Guibas, Achuta Kadambi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20776
Pdf URL: https://arxiv.org/pdf/2503.20776
Copy Paste: [[2503.20776]] Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields(https://arxiv.org/abs/2503.20776)
Keywords: foundation model
Abstract: Recent advancements in 2D and multimodal models have achieved remarkable success by leveraging large-scale training on extensive datasets. However, extending these achievements to enable free-form interactions and high-level semantic operations with complex 3D/4D scenes remains challenging. This difficulty stems from the limited availability of large-scale, annotated 3D/4D or multi-view datasets, which are crucial for generalizable vision and language tasks such as open-vocabulary and prompt-based segmentation, language-guided editing, and visual question answering (VQA). In this paper, we introduce Feature4X, a universal framework designed to extend any functionality from 2D vision foundation model into the 4D realm, using only monocular video input, which is widely available from user-generated content. The "X" in Feature4X represents its versatility, enabling any task through adaptable, model-conditioned 4D feature field distillation. At the core of our framework is a dynamic optimization strategy that unifies multiple model capabilities into a single representation. Additionally, to the best of our knowledge, Feature4X is the first method to distill and lift the features of video foundation models (e.g. SAM2, InternVideo2) into an explicit 4D feature field using Gaussian Splatting. Our experiments showcase novel view segment anything, geometric and appearance scene editing, and free-form VQA across all time steps, empowered by LLMs in feedback loops. These advancements broaden the scope of agentic AI applications by providing a foundation for scalable, contextually and spatiotemporally aware systems capable of immersive dynamic 4D scene interaction.

Title: FB-4D: Spatial-Temporal Coherent Dynamic 3D Content Generation with Feature Banks

Authors: Jinwei Li, Huan-ang Gao, Wenyi Li, Haohan Chi, Chenyu Liu, Chenxi Du, Yiqian Liu, Mingju Gao, Guiyu Zhang, Zongzheng Zhang, Li Yi, Yao Yao, Jingwei Zhao, Hongyang Li, Yikai Wang, Hao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20784
Pdf URL: https://arxiv.org/pdf/2503.20784
Copy Paste: [[2503.20784]] FB-4D: Spatial-Temporal Coherent Dynamic 3D Content Generation with Feature Banks(https://arxiv.org/abs/2503.20784)
Keywords: diffusion
Abstract: With the rapid advancements in diffusion models and 3D generation techniques, dynamic 3D content generation has become a crucial research area. However, achieving high-fidelity 4D (dynamic 3D) generation with strong spatial-temporal consistency remains a challenging task. Inspired by recent findings that pretrained diffusion features capture rich correspondences, we propose FB-4D, a novel 4D generation framework that integrates a Feature Bank mechanism to enhance both spatial and temporal consistency in generated frames. In FB-4D, we store features extracted from previous frames and fuse them into the process of generating subsequent frames, ensuring consistent characteristics across both time and multiple views. To ensure a compact representation, the Feature Bank is updated by a proposed dynamic merging mechanism. Leveraging this Feature Bank, we demonstrate for the first time that generating additional reference sequences through multiple autoregressive iterations can continuously improve generation performance. Experimental results show that FB-4D significantly outperforms existing methods in terms of rendering quality, spatial-temporal consistency, and robustness. It surpasses all multi-view generation tuning-free approaches by a large margin and achieves performance on par with training-based methods.

Title: Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency

Authors: Tianqi Liu, Zihao Huang, Zhaoxi Chen, Guangcong Wang, Shoukang Hu, Liao Shen, Huiqiang Sun, Zhiguo Cao, Wei Li, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20785
Pdf URL: https://arxiv.org/pdf/2503.20785
Copy Paste: [[2503.20785]] Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency(https://arxiv.org/abs/2503.20785)
Keywords: diffusion, foundation model
Abstract: We present Free4D, a novel tuning-free framework for 4D scene generation from a single image. Existing methods either focus on object-level generation, making scene-level generation infeasible, or rely on large-scale multi-view video datasets for expensive training, with limited generalization ability due to the scarcity of 4D scene data. In contrast, our key insight is to distill pre-trained foundation models for consistent 4D scene representation, which offers promising advantages such as efficiency and generalizability. 1) To achieve this, we first animate the input image using image-to-video diffusion models followed by 4D geometric structure initialization. 2) To turn this coarse structure into spatial-temporal consistent multiview videos, we design an adaptive guidance mechanism with a point-guided denoising strategy for spatial consistency and a novel latent replacement strategy for temporal coherence. 3) To lift these generated observations into consistent 4D representation, we propose a modulation-based refinement to mitigate inconsistencies while fully leveraging the generated information. The resulting 4D representation enables real-time, controllable rendering, marking a significant advancement in single-image-based 4D scene generation.