2025-01-16

Title: SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval

Authors: Bhavin Jawade, Joao V. B. Soares, Kapil Thadani, Deen Dayal Mohan, Amir Erfan Eshratifar, Benjamin Culpepper, Paloma de Juan, Srirangaraj Setlur, Venu Govindaraju
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.08347
Pdf URL: https://arxiv.org/pdf/2501.08347
Copy Paste: [[2501.08347]] SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval(https://arxiv.org/abs/2501.08347)
Keywords: self-supervised, generative
Abstract: Compositional image retrieval (CIR) is a multimodal learning task where a model combines a query image with a user-provided text modification to retrieve a target image. CIR finds applications in a variety of domains including product retrieval (e-commerce) and web search. Existing methods primarily focus on fully-supervised learning, wherein models are trained on datasets of labeled triplets such as FashionIQ and CIRR. This poses two significant challenges: (i) curating such triplet datasets is labor intensive; and (ii) models lack generalization to unseen objects and domains. In this work, we propose SCOT (Self-supervised COmpositional Training), a novel zero-shot compositional pretraining strategy that combines existing large image-text pair datasets with the generative capabilities of large language models to contrastively train an embedding composition network. Specifically, we show that the text embedding from a large-scale contrastively-pretrained vision-language model can be utilized as proxy target supervision during compositional pretraining, replacing the target image embedding. In zero-shot settings, this strategy surpasses SOTA zero-shot compositional retrieval methods as well as many fully-supervised methods on standard benchmarks such as FashionIQ and CIRR.

Title: Is Stochastic Gradient Descent Effective? A PDE Perspective on Machine Learning processes

Authors: Davide Barbieri, Matteo Bonforte, Peio Ibarrondo
Subjects: cs.LG, math.AP, math.PR
Abstract URL: https://arxiv.org/abs/2501.08425
Pdf URL: https://arxiv.org/pdf/2501.08425
Copy Paste: [[2501.08425]] Is Stochastic Gradient Descent Effective? A PDE Perspective on Machine Learning processes(https://arxiv.org/abs/2501.08425)
Keywords: diffusion
Abstract: In this paper we analyze the behaviour of the stochastic gradient descent (SGD), a widely used method in supervised learning for optimizing neural network weights via a minimization of non-convex loss functions. Since the pioneering work of E, Li and Tai (2017), the underlying structure of such processes can be understood via parabolic PDEs of Fokker-Planck type, which are at the core of our analysis. Even if Fokker-Planck equations have a long history and a extensive literature, almost nothing is known when the potential is non-convex or when the diffusion matrix is degenerate, and this is the main difficulty that we face in our analysis. We identify two different regimes: in the initial phase of SGD, the loss function drives the weights to concentrate around the nearest local minimum. We refer to this phase as the drift regime and we provide quantitative estimates on this concentration phenomenon. Next, we introduce the diffusion regime, where stochastic fluctuations help the learning process to escape suboptimal local minima. We analyze the Mean Exit Time (MET) and prove upper and lower bounds of the MET. Finally, we address the asymptotic convergence of SGD, for a non-convex cost function and a degenerate diffusion matrix, that do not allow to use the standard approaches, and require new techniques. For this purpose, we exploit two different methods: duality and entropy methods. We provide new results about the dynamics and effectiveness of SGD, offering a deep connection between stochastic optimization and PDE theory, and some answers and insights to basic questions in the Machine Learning processes: How long does SGD take to escape from a bad minimum? Do neural network parameters converge using SGD? How do parameters evolve in the first stage of training with SGD?

Title: Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models

Authors: Weichen Fan, Chenyang Si, Junhao Song, Zhenyu Yang, Yinan He, Long Zhuo, Ziqi Huang, Ziyue Dong, Jingwen He, Dongwei Pan, Yi Wang, Yuming Jiang, Yaohui Wang, Peng Gao, Xinyuan Chen, Hengjie Li, Dahua Lin, Yu Qiao, Ziwei Liu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.08453
Pdf URL: https://arxiv.org/pdf/2501.08453
Copy Paste: [[2501.08453]] Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models(https://arxiv.org/abs/2501.08453)
Keywords: diffusion
Abstract: We present Vchitect-2.0, a parallel transformer architecture designed to scale up video diffusion models for large-scale text-to-video generation. The overall Vchitect-2.0 system has several key designs. (1) By introducing a novel Multimodal Diffusion Block, our approach achieves consistent alignment between text descriptions and generated video frames, while maintaining temporal coherence across sequences. (2) To overcome memory and computational bottlenecks, we propose a Memory-efficient Training framework that incorporates hybrid parallelism and other memory reduction techniques, enabling efficient training of long video sequences on distributed systems. (3) Additionally, our enhanced data processing pipeline ensures the creation of Vchitect T2V DataVerse, a high-quality million-scale training dataset through rigorous annotation and aesthetic evaluation. Extensive benchmarking demonstrates that Vchitect-2.0 outperforms existing methods in video quality, training efficiency, and scalability, serving as a suitable base for high-fidelity video generation.

Title: Time series forecasting for multidimensional telemetry data using GAN and BiLSTM in a Digital Twin

Authors: Joao Carmo de Almeida Neto, Claudio Miceli de Farias, Leandro Santiago de Araujo, Leopoldo Andre Dutra Lusquino Filho
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2501.08464
Pdf URL: https://arxiv.org/pdf/2501.08464
Copy Paste: [[2501.08464]] Time series forecasting for multidimensional telemetry data using GAN and BiLSTM in a Digital Twin(https://arxiv.org/abs/2501.08464)
Keywords: generative
Abstract: The research related to digital twins has been increasing in recent years. Besides the mirroring of the physical word into the digital, there is the need of providing services related to the data collected and transferred to the virtual world. One of these services is the forecasting of physical part future behavior, that could lead to applications, like preventing harmful events or designing improvements to get better performance. One strategy used to predict any system operation it is the use of time series models like ARIMA or LSTM, and improvements were implemented using these algorithms. Recently, deep learning techniques based on generative models such as Generative Adversarial Networks (GANs) have been proposed to create time series and the use of LSTM has gained more relevance in time series forecasting, but both have limitations that restrict the forecasting results. Another issue found in the literature is the challenge of handling multivariate environments/applications in time series generation. Therefore, new methods need to be studied in order to fill these gaps and, consequently, provide better resources for creating useful digital twins. In this proposal, it is going to be studied the integration of a BiLSTM layer with a time series obtained by GAN in order to improve the forecasting of all the features provided by the dataset in terms of accuracy and, consequently, improving behaviour prediction.

Title: Selective Attention Merging for low resource tasks: A case study of Child ASR

Authors: Natarajan Balaji Shankar, Zilai Wang, Eray Eren, Abeer Alwan
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2501.08468
Pdf URL: https://arxiv.org/pdf/2501.08468
Copy Paste: [[2501.08468]] Selective Attention Merging for low resource tasks: A case study of Child ASR(https://arxiv.org/abs/2501.08468)
Keywords: foundation model
Abstract: While Speech Foundation Models (SFMs) excel in various speech tasks, their performance for low-resource tasks such as child Automatic Speech Recognition (ASR) is hampered by limited pretraining data. To address this, we explore different model merging techniques to leverage knowledge from models trained on larger, more diverse speech corpora. This paper also introduces Selective Attention (SA) Merge, a novel method that selectively merges task vectors from attention matrices to enhance SFM performance on low-resource tasks. Experiments on the MyST database show significant reductions in relative word error rate of up to 14%, outperforming existing model merging and data augmentation techniques. By combining data augmentation techniques with SA Merge, we achieve a new state-of-the-art WER of 8.69 on the MyST database for the Whisper-small model, highlighting the potential of SA Merge for improving low-resource ASR.

Title: Detecting Contextual Anomalies by Discovering Consistent Spatial Regions

Authors: Zhengye Yang, Richard J. Radke
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.08470
Pdf URL: https://arxiv.org/pdf/2501.08470
Copy Paste: [[2501.08470]] Detecting Contextual Anomalies by Discovering Consistent Spatial Regions(https://arxiv.org/abs/2501.08470)
Keywords: anomaly
Abstract: We describe a method for modeling spatial context to enable video anomaly detection. The main idea is to discover regions that share similar object-level activities by clustering joint object attributes using Gaussian mixture models. We demonstrate that this straightforward approach, using orders of magnitude fewer parameters than competing models, achieves state-of-the-art performance in the challenging spatial-context-dependent Street Scene dataset. As a side benefit, the high-resolution discovered regions learned by the model also provide explainable normalcy maps for human operators without the need for any pre-trained segmentation model.

Title: Benchmarking Classical, Deep, and Generative Models for Human Activity Recognition

Authors: Md Meem Hossain, The Anh Han, Safina Showkat Ara, Zia Ush Shamszaman
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.08471
Pdf URL: https://arxiv.org/pdf/2501.08471
Copy Paste: [[2501.08471]] Benchmarking Classical, Deep, and Generative Models for Human Activity Recognition(https://arxiv.org/abs/2501.08471)
Keywords: generative
Abstract: Human Activity Recognition (HAR) has gained significant importance with the growing use of sensor-equipped devices and large datasets. This paper evaluates the performance of three categories of models : classical machine learning, deep learning architectures, and Restricted Boltzmann Machines (RBMs) using five key benchmark datasets of HAR (UCI-HAR, OPPORTUNITY, PAMAP2, WISDM, and Berkeley MHAD). We assess various models, including Decision Trees, Random Forests, Convolutional Neural Networks (CNN), and Deep Belief Networks (DBNs), using metrics such as accuracy, precision, recall, and F1-score for a comprehensive comparison. The results show that CNN models offer superior performance across all datasets, especially on the Berkeley MHAD. Classical models like Random Forest do well on smaller datasets but face challenges with larger, more complex data. RBM-based models also show notable potential, particularly for feature learning. This paper offers a detailed comparison to help researchers choose the most suitable model for HAR tasks.

Title: Yuan: Yielding Unblemished Aesthetics Through A Unified Network for Visual Imperfections Removal in Generated Images

Authors: Zhenyu Yu, Chee Seng Chan
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2501.08505
Pdf URL: https://arxiv.org/pdf/2501.08505
Copy Paste: [[2501.08505]] Yuan: Yielding Unblemished Aesthetics Through A Unified Network for Visual Imperfections Removal in Generated Images(https://arxiv.org/abs/2501.08505)
Keywords: generative
Abstract: Generative AI presents transformative potential across various domains, from creative arts to scientific visualization. However, the utility of AI-generated imagery is often compromised by visual flaws, including anatomical inaccuracies, improper object placements, and misplaced textual elements. These imperfections pose significant challenges for practical applications. To overcome these limitations, we introduce \textit{Yuan}, a novel framework that autonomously corrects visual imperfections in text-to-image synthesis. \textit{Yuan} uniquely conditions on both the textual prompt and the segmented image, generating precise masks that identify areas in need of refinement without requiring manual intervention -- a common constraint in previous methodologies. Following the automated masking process, an advanced inpainting module seamlessly integrates contextually coherent content into the identified regions, preserving the integrity and fidelity of the original image and associated text prompts. Through extensive experimentation on publicly available datasets such as ImageNet100 and Stanford Dogs, along with a custom-generated dataset, \textit{Yuan} demonstrated superior performance in eliminating visual imperfections. Our approach consistently achieved higher scores in quantitative metrics, including NIQE, BRISQUE, and PI, alongside favorable qualitative evaluations. These results underscore \textit{Yuan}'s potential to significantly enhance the quality and applicability of AI-generated images across diverse fields.

Title: DynamicFace: High-Quality and Consistent Video Face Swapping using Composable 3D Facial Priors

Authors: Runqi Wang, Sijie Xu, Tianyao He, Yang Chen, Wei Zhu, Dejia Song, Nemo Chen, Xu Tang, Yao Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.08553
Pdf URL: https://arxiv.org/pdf/2501.08553
Copy Paste: [[2501.08553]] DynamicFace: High-Quality and Consistent Video Face Swapping using Composable 3D Facial Priors(https://arxiv.org/abs/2501.08553)
Keywords: diffusion
Abstract: Face swapping transfers the identity of a source face to a target face while retaining the attributes like expression, pose, hair, and background of the target face. Advanced face swapping methods have achieved attractive results. However, these methods often inadvertently transfer identity information from the target face, compromising expression-related details and accurate identity. We propose a novel method DynamicFace that leverages the power of diffusion model and plug-and-play temporal layers for video face swapping. First, we introduce four fine-grained face conditions using 3D facial priors. All conditions are designed to be disentangled from each other for precise and unique control. Then, we adopt Face Former and ReferenceNet for high-level and detailed identity injection. Through experiments on the FF++ dataset, we demonstrate that our method achieves state-of-the-art results in face swapping, showcasing superior image quality, identity preservation, and expression accuracy. Besides, our method could be easily transferred to video domain with temporal attention layer. Our code and results will be available on the project page: this https URL

Title: Densely Connected Parameter-Efficient Tuning for Referring Image Segmentation

Authors: Jiaqi Huang, Zunnan Xu, Ting Liu, Yong Liu, Haonan Han, Kehong Yuan, Xiu Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.08580
Pdf URL: https://arxiv.org/pdf/2501.08580
Copy Paste: [[2501.08580]] Densely Connected Parameter-Efficient Tuning for Referring Image Segmentation(https://arxiv.org/abs/2501.08580)
Keywords: foundation model
Abstract: In the domain of computer vision, Parameter-Efficient Tuning (PET) is increasingly replacing the traditional paradigm of pre-training followed by full fine-tuning. PET is particularly favored for its effectiveness in large foundation models, as it streamlines transfer learning costs and optimizes hardware utilization. However, the current PET methods are mainly designed for single-modal optimization. While some pioneering studies have undertaken preliminary explorations, they still remain at the level of aligned encoders (e.g., CLIP) and lack exploration of misaligned encoders. These methods show sub-optimal performance with misaligned encoders, as they fail to effectively align the multimodal features during fine-tuning. In this paper, we introduce DETRIS, a parameter-efficient tuning framework designed to enhance low-rank visual feature propagation by establishing dense interconnections between each layer and all preceding layers, which enables effective cross-modal feature interaction and adaptation to misaligned encoders. We also suggest using text adapters to improve textual features. Our simple yet efficient approach greatly surpasses state-of-the-art methods with 0.9% to 1.8% backbone parameter updates, evaluated on challenging benchmarks. Our project is available at \url{this https URL}.

Title: Watermarking in Diffusion Model: Gaussian Shading with Exact Diffusion Inversion via Coupled Transformations (EDICT)

Authors: Krishna Panthi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.08604
Pdf URL: https://arxiv.org/pdf/2501.08604
Copy Paste: [[2501.08604]] Watermarking in Diffusion Model: Gaussian Shading with Exact Diffusion Inversion via Coupled Transformations (EDICT)(https://arxiv.org/abs/2501.08604)
Keywords: diffusion
Abstract: This paper introduces a novel approach to enhance the performance of Gaussian Shading, a prevalent watermarking technique, by integrating the Exact Diffusion Inversion via Coupled Transformations (EDICT) framework. While Gaussian Shading traditionally embeds watermarks in a noise latent space, followed by iterative denoising for image generation and noise addition for watermark recovery, its inversion process is not exact, leading to potential watermark distortion. We propose to leverage EDICT's ability to derive exact inverse mappings to refine this process. Our method involves duplicating the watermark-infused noisy latent and employing a reciprocal, alternating denoising and noising scheme between the two latents, facilitated by EDICT. This allows for a more precise reconstruction of both the image and the embedded watermark. Empirical evaluation on standard datasets demonstrates that our integrated approach yields a slight, yet statistically significant improvement in watermark recovery fidelity. These results highlight the potential of EDICT to enhance existing diffusion-based watermarking techniques by providing a more accurate and robust inversion mechanism. To the best of our knowledge, this is the first work to explore the synergy between EDICT and Gaussian Shading for digital watermarking, opening new avenues for research in robust and high-fidelity watermark embedding and extraction.

Title: RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

Authors: Kaiqu Liang, Haimin Hu, Ryan Liu, Thomas L. Griffiths, Jaime Fernández Fisac
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2501.08617
Pdf URL: https://arxiv.org/pdf/2501.08617
Copy Paste: [[2501.08617]] RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation(https://arxiv.org/abs/2501.08617)
Keywords: foundation model, generative
Abstract: Generative AI systems like foundation models (FMs) must align well with human values to ensure their behavior is helpful and trustworthy. While Reinforcement Learning from Human Feedback (RLHF) has shown promise for optimizing model performance using human judgments, existing RLHF pipelines predominantly rely on immediate feedback, which can fail to accurately reflect the downstream impact of an interaction on users' utility. We demonstrate that feedback based on evaluators' foresight estimates of downstream consequences systematically induces Goodhart's Law dynamics, incentivizing misaligned behaviors like sycophancy and deception and ultimately degrading user outcomes. To alleviate this, we propose decoupling evaluation from prediction by refocusing RLHF on hindsight feedback. Our theoretical analysis reveals that conditioning evaluator feedback on downstream observations mitigates misalignment and improves expected human utility, even when these observations are simulated by the AI system itself. To leverage this insight in a practical alignment algorithm, we introduce Reinforcement Learning from Hindsight Simulation (RLHS), which first simulates plausible consequences and then elicits feedback to assess what behaviors were genuinely beneficial in hindsight. We apply RLHS to two widely-employed online and offline preference optimization methods -- Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) -- and show empirically that misalignment is significantly reduced with both methods. Through an online human user study, we show that RLHS consistently outperforms RLHF in helping users achieve their goals and earns higher satisfaction ratings, despite being trained solely with simulated hindsight feedback. These results underscore the importance of focusing on long-term consequences, even simulated ones, to mitigate misalignment in RLHF.

Title: Transformer-based Multivariate Time Series Anomaly Localization

Authors: Charalampos Shimillas, Kleanthis Malialis, Konstantinos Fokianos, Marios M. Polycarpou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.08628
Pdf URL: https://arxiv.org/pdf/2501.08628
Copy Paste: [[2501.08628]] Transformer-based Multivariate Time Series Anomaly Localization(https://arxiv.org/abs/2501.08628)
Keywords: anomaly
Abstract: With the growing complexity of Cyber-Physical Systems (CPS) and the integration of Internet of Things (IoT), the use of sensors for online monitoring generates large volume of multivariate time series (MTS) data. Consequently, the need for robust anomaly diagnosis in MTS is paramount to maintaining system reliability and safety. While significant advancements have been made in anomaly detection, localization remains a largely underexplored area, though crucial for intelligent decision-making. This paper introduces a novel transformer-based model for unsupervised anomaly diagnosis in MTS, with a focus on improving localization performance, through an in-depth analysis of the self-attention mechanism's learning behavior under both normal and anomalous conditions. We formulate the anomaly localization problem as a three-stage process: time-step, window, and segment-based. This leads to the development of the Space-Time Anomaly Score (STAS), a new metric inspired by the connection between transformer latent representations and space-time statistical models. STAS is designed to capture individual anomaly behaviors and inter-series dependencies, delivering enhanced localization performance. Additionally, the Statistical Feature Anomaly Score (SFAS) complements STAS by analyzing statistical features around anomalies, with their combination helping to reduce false alarms. Experiments on real world and synthetic datasets illustrate the model's superiority over state-of-the-art methods in both detection and localization tasks.

Title: MAGNET: Augmenting Generative Decoders with Representation Learning and Infilling Capabilities

Authors: Savya Khosla, Kushal Kafle, Simon Jenni, Handong Zhao, John Collomosse, Jing Shi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.08648
Pdf URL: https://arxiv.org/pdf/2501.08648
Copy Paste: [[2501.08648]] MAGNET: Augmenting Generative Decoders with Representation Learning and Infilling Capabilities(https://arxiv.org/abs/2501.08648)
Keywords: self-supervised, generative
Abstract: While originally designed for unidirectional generative modeling, decoder-only large language models (LLMs) are increasingly being adapted for bidirectional modeling. However, unidirectional and bidirectional models are typically trained separately with distinct objectives (generation and representation learning, respectively). This separation overlooks the opportunity for developing a more versatile language model and for these objectives to complement each other. In this work, we introduce MAGNET, an adaptation of decoder-only LLMs that enhances their ability to generate robust representations and infill missing text spans, while preserving their knowledge and text generation capabilities. MAGNET employs three self-supervised training objectives and introduces an attention mechanism that combines bidirectional and causal attention, enabling unified training across all objectives. Our results demonstrate that LLMs adapted with MAGNET (1) surpass strong text encoders on token-level and sentence-level representation learning tasks, (2) generate contextually appropriate text infills by leveraging future context, (3) retain the ability for open-ended text generation without exhibiting repetition problem, and (4) preserve the knowledge gained by the LLM during pretraining.

Title: Joint Learning of Depth and Appearance for Portrait Image Animation

Authors: Xinya Ji, Gaspard Zoss, Prashanth Chandran, Lingchen Yang, Xun Cao, Barbara Solenthaler, Derek Bradley
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.08649
Pdf URL: https://arxiv.org/pdf/2501.08649
Copy Paste: [[2501.08649]] Joint Learning of Depth and Appearance for Portrait Image Animation(https://arxiv.org/abs/2501.08649)
Keywords: diffusion, generative
Abstract: 2D portrait animation has experienced significant advancements in recent years. Much research has utilized the prior knowledge embedded in large generative diffusion models to enhance high-quality image manipulation. However, most methods only focus on generating RGB images as output, and the co-generation of consistent visual plus 3D output remains largely under-explored. In our work, we propose to jointly learn the visual appearance and depth simultaneously in a diffusion-based portrait image generator. Our method embraces the end-to-end diffusion paradigm and introduces a new architecture suitable for learning this conditional joint distribution, consisting of a reference network and a channel-expanded diffusion backbone. Once trained, our framework can be efficiently adapted to various downstream applications, such as facial depth-to-image and image-to-depth generation, portrait relighting, and audio-driven talking head animation with consistent 3D output.

Title: StereoGen: High-quality Stereo Image Generation from a Single Image

Authors: Xianqi Wang, Hao Yang, Gangwei Xu, Junda Cheng, Min Lin, Yong Deng, Jinliang Zang, Yurui Chen, Xin Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.08654
Pdf URL: https://arxiv.org/pdf/2501.08654
Copy Paste: [[2501.08654]] StereoGen: High-quality Stereo Image Generation from a Single Image(https://arxiv.org/abs/2501.08654)
Keywords: diffusion
Abstract: State-of-the-art supervised stereo matching methods have achieved amazing results on various benchmarks. However, these data-driven methods suffer from generalization to real-world scenarios due to the lack of real-world annotated data. In this paper, we propose StereoGen, a novel pipeline for high-quality stereo image generation. This pipeline utilizes arbitrary single images as left images and pseudo disparities generated by a monocular depth estimation model to synthesize high-quality corresponding right images. Unlike previous methods that fill the occluded area in warped right images using random backgrounds or using convolutions to take nearby pixels selectively, we fine-tune a diffusion inpainting model to recover the background. Images generated by our model possess better details and undamaged semantic structures. Besides, we propose Training-free Confidence Generation and Adaptive Disparity Selection. The former suppresses the negative effect of harmful pseudo ground truth during stereo training, while the latter helps generate a wider disparity distribution and better synthetic images. Experiments show that models trained under our pipeline achieve state-of-the-art zero-shot generalization results among all published methods. The code will be available upon publication of the paper.

Title: FlexiClip: Locality-Preserving Free-Form Character Animation

Authors: Anant Khandelwal
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2501.08676
Pdf URL: https://arxiv.org/pdf/2501.08676
Copy Paste: [[2501.08676]] FlexiClip: Locality-Preserving Free-Form Character Animation(https://arxiv.org/abs/2501.08676)
Keywords: diffusion
Abstract: Animating clipart images with seamless motion while maintaining visual fidelity and temporal coherence presents significant challenges. Existing methods, such as AniClipart, effectively model spatial deformations but often fail to ensure smooth temporal transitions, resulting in artifacts like abrupt motions and geometric distortions. Similarly, text-to-video (T2V) and image-to-video (I2V) models struggle to handle clipart due to the mismatch in statistical properties between natural video and clipart styles. This paper introduces FlexiClip, a novel approach designed to overcome these limitations by addressing the intertwined challenges of temporal consistency and geometric integrity. FlexiClip extends traditional Bézier curve-based trajectory modeling with key innovations: temporal Jacobians to correct motion dynamics incrementally, continuous-time modeling via probability flow ODEs (pfODEs) to mitigate temporal noise, and a flow matching loss inspired by GFlowNet principles to optimize smooth motion transitions. These enhancements ensure coherent animations across complex scenarios involving rapid movements and non-rigid deformations. Extensive experiments validate the effectiveness of FlexiClip in generating animations that are not only smooth and natural but also structurally consistent across diverse clipart types, including humans and animals. By integrating spatial and temporal modeling with pre-trained video diffusion models, FlexiClip sets a new standard for high-quality clipart animation, offering robust performance across a wide range of visual content. Project Page: this https URL

Title: Investigating Parameter-Efficiency of Hybrid QuGANs Based on Geometric Properties of Generated Sea Route Graphs

Authors: Tobias Rohe, Florian Burger, Michael Kölle, Sebastian Wölckert, Maximilian Zorn, Claudia Linnhoff-Popien
Subjects: cs.LG, quant-ph
Abstract URL: https://arxiv.org/abs/2501.08678
Pdf URL: https://arxiv.org/pdf/2501.08678
Copy Paste: [[2501.08678]] Investigating Parameter-Efficiency of Hybrid QuGANs Based on Geometric Properties of Generated Sea Route Graphs(https://arxiv.org/abs/2501.08678)
Keywords: generative
Abstract: The demand for artificially generated data for the development, training and testing of new algorithms is omnipresent. Quantum computing (QC), does offer the hope that its inherent probabilistic functionality can be utilised in this field of generative artificial intelligence. In this study, we use quantum-classical hybrid generative adversarial networks (QuGANs) to artificially generate graphs of shipping routes. We create a training dataset based on real shipping data and investigate to what extent QuGANs are able to learn and reproduce inherent distributions and geometric features of this data. We compare hybrid QuGANs with classical Generative Adversarial Networks (GANs), with a special focus on their parameter efficiency. Our results indicate that QuGANs are indeed able to quickly learn and represent underlying geometric properties and distributions, although they seem to have difficulties in introducing variance into the sampled data. Compared to classical GANs of greater size, measured in the number of parameters used, some QuGANs show similar result quality. Our reference to concrete use cases, such as the generation of shipping data, provides an illustrative example and demonstrate the potential and diversity in which QC can be used.

Title: RealVVT: Towards Photorealistic Video Virtual Try-on via Spatio-Temporal Consistency

Authors: Siqi Li, Zhengkai Jiang, Jiawei Zhou, Zhihong Liu, Xiaowei Chi, Haoqian Wang
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2501.08682
Pdf URL: https://arxiv.org/pdf/2501.08682
Copy Paste: [[2501.08682]] RealVVT: Towards Photorealistic Video Virtual Try-on via Spatio-Temporal Consistency(https://arxiv.org/abs/2501.08682)
Keywords: foundation model
Abstract: Virtual try-on has emerged as a pivotal task at the intersection of computer vision and fashion, aimed at digitally simulating how clothing items fit on the human body. Despite notable progress in single-image virtual try-on (VTO), current methodologies often struggle to preserve a consistent and authentic appearance of clothing across extended video sequences. This challenge arises from the complexities of capturing dynamic human pose and maintaining target clothing characteristics. We leverage pre-existing video foundation models to introduce RealVVT, a photoRealistic Video Virtual Try-on framework tailored to bolster stability and realism within dynamic video contexts. Our methodology encompasses a Clothing & Temporal Consistency strategy, an Agnostic-guided Attention Focus Loss mechanism to ensure spatial consistency, and a Pose-guided Long Video VTO technique adept at handling extended video this http URL experiments across various datasets confirms that our approach outperforms existing state-of-the-art models in both single-image and video VTO tasks, offering a viable solution for practical applications within the realms of fashion e-commerce and virtual fitting environments.

Title: Self-supervised Transformation Learning for Equivariant Representations

Authors: Jaemyung Yu, Jaehyun Choi, Dong-Jae Lee, HyeongGwon Hong, Junmo Kim
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.08712
Pdf URL: https://arxiv.org/pdf/2501.08712
Copy Paste: [[2501.08712]] Self-supervised Transformation Learning for Equivariant Representations(https://arxiv.org/abs/2501.08712)
Keywords: self-supervised
Abstract: Unsupervised representation learning has significantly advanced various machine learning tasks. In the computer vision domain, state-of-the-art approaches utilize transformations like random crop and color jitter to achieve invariant representations, embedding semantically the same inputs despite transformations. However, this can degrade performance in tasks requiring precise features, such as localization or flower classification. To address this, recent research incorporates equivariant representation learning, which captures transformation-sensitive information. However, current methods depend on transformation labels and thus struggle with interdependency and complex transformations. We propose Self-supervised Transformation Learning (STL), replacing transformation labels with transformation representations derived from image pairs. The proposed method ensures transformation representation is image-invariant and learns corresponding equivariant transformations, enhancing performance without increased batch complexity. We demonstrate the approach's effectiveness across diverse classification and detection tasks, outperforming existing methods in 7 out of 11 benchmarks and excelling in detection. By integrating complex transformations like AugMix, unusable by prior equivariant methods, this approach enhances performance across tasks, underscoring its adaptability and resilience. Additionally, its compatibility with various base models highlights its flexibility and broad applicability. The code is available at this https URL.

Title: The Inherent Limits of Pretrained LLMs: The Unexpected Convergence of Instruction Tuning and In-Context Learning Capabilities

Authors: Irina Bigoulaeva, Harish Tayyar Madabushi, Iryna Gurevych
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.08716
Pdf URL: https://arxiv.org/pdf/2501.08716
Copy Paste: [[2501.08716]] The Inherent Limits of Pretrained LLMs: The Unexpected Convergence of Instruction Tuning and In-Context Learning Capabilities(https://arxiv.org/abs/2501.08716)
Keywords: in-context
Abstract: Large Language Models (LLMs), trained on extensive web-scale corpora, have demonstrated remarkable abilities across diverse tasks, especially as they are scaled up. Nevertheless, even state-of-the-art models struggle in certain cases, sometimes failing at problems solvable by young children, indicating that traditional notions of task complexity are insufficient for explaining LLM capabilities. However, exploring LLM capabilities is complicated by the fact that most widely-used models are also "instruction-tuned" to respond appropriately to prompts. With the goal of disentangling the factors influencing LLM performance, we investigate whether instruction-tuned models possess fundamentally different capabilities from base models that are prompted using in-context examples. Through extensive experiments across various model families, scales and task types, which included instruction tuning 90 different LLMs, we demonstrate that the performance of instruction-tuned models is significantly correlated with the in-context performance of their base counterparts. By clarifying what instruction-tuning contributes, we extend prior research into in-context learning, which suggests that base models use priors from pretraining data to solve tasks. Specifically, we extend this understanding to instruction-tuned models, suggesting that their pretraining data similarly sets a limiting boundary on the tasks they can solve, with the added influence of the instruction-tuning dataset.

Title: Transformed Low-rank Adaptation via Tensor Decomposition and Its Applications to Text-to-image Models

Authors: Zerui Tao, Yuhta Takida, Naoki Murata, Qibin Zhao, Yuki Mitsufuji
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.08727
Pdf URL: https://arxiv.org/pdf/2501.08727
Copy Paste: [[2501.08727]] Transformed Low-rank Adaptation via Tensor Decomposition and Its Applications to Text-to-image Models(https://arxiv.org/abs/2501.08727)
Keywords: diffusion
Abstract: Parameter-Efficient Fine-Tuning (PEFT) of text-to-image models has become an increasingly popular technique with many applications. Among the various PEFT methods, Low-Rank Adaptation (LoRA) and its variants have gained significant attention due to their effectiveness, enabling users to fine-tune models with limited computational resources. However, the approximation gap between the low-rank assumption and desired fine-tuning weights prevents the simultaneous acquisition of ultra-parameter-efficiency and better performance. To reduce this gap and further improve the power of LoRA, we propose a new PEFT method that combines two classes of adaptations, namely, transform and residual adaptations. In specific, we first apply a full-rank and dense transform to the pre-trained weight. This learnable transform is expected to align the pre-trained weight as closely as possible to the desired weight, thereby reducing the rank of the residual weight. Then, the residual part can be effectively approximated by more compact and parameter-efficient structures, with a smaller approximation error. To achieve ultra-parameter-efficiency in practice, we design highly flexible and effective tensor decompositions for both the transform and residual adaptations. Additionally, popular PEFT methods such as DoRA can be summarized under this transform plus residual adaptation scheme. Experiments are conducted on fine-tuning Stable Diffusion models in subject-driven and controllable generation. The results manifest that our method can achieve better performances and parameter efficiency compared to LoRA and several baselines.

Title: Few-Shot Learner Generalizes Across AI-Generated Image Detection

Authors: Shiyu Wu, Jing Liu, Jing Li, Yequan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.08763
Pdf URL: https://arxiv.org/pdf/2501.08763
Copy Paste: [[2501.08763]] Few-Shot Learner Generalizes Across AI-Generated Image Detection(https://arxiv.org/abs/2501.08763)
Keywords: generative
Abstract: Current fake image detectors trained on large synthetic image datasets perform satisfactorily on limited studied generative models. However, they suffer a notable performance decline over unseen models. Besides, collecting adequate training data from online generative models is often expensive or infeasible. To overcome these issues, we propose Few-Shot Detector (FSD), a novel AI-generated image detector which learns a specialized metric space to effectively distinguish unseen fake images by utilizing very few samples. Experiments show FSD achieves state-of-the-art performance by $+7.4\%$ average ACC on GenImage dataset. More importantly, our method is better capable of capturing the intra-category common features in unseen images without further training.

Title: Enhanced Large Language Models for Effective Screening of Depression and Anxiety

Authors: June M. Liu, Mengxia Gao, Sahand Sabour, Zhuang Chen, Minlie Huang, Tatia M.C. Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.08769
Pdf URL: https://arxiv.org/pdf/2501.08769
Copy Paste: [[2501.08769]] Enhanced Large Language Models for Effective Screening of Depression and Anxiety(https://arxiv.org/abs/2501.08769)
Keywords: generative
Abstract: Depressive and anxiety disorders are widespread, necessitating timely identification and management. Recent advances in Large Language Models (LLMs) offer potential solutions, yet high costs and ethical concerns about training data remain challenges. This paper introduces a pipeline for synthesizing clinical interviews, resulting in 1,157 interactive dialogues (PsyInterview), and presents EmoScan, an LLM-based emotional disorder screening system. EmoScan distinguishes between coarse (e.g., anxiety or depressive disorders) and fine disorders (e.g., major depressive disorders) and conducts high-quality interviews. Evaluations showed that EmoScan exceeded the performance of base models and other LLMs like GPT-4 in screening emotional disorders (F1-score=0.7467). It also delivers superior explanations (BERTScore=0.9408) and demonstrates robust generalizability (F1-score of 0.67 on an external dataset). Furthermore, EmoScan outperforms baselines in interviewing skills, as validated by automated ratings and human evaluations. This work highlights the importance of scalable data-generative pipelines for developing effective mental health LLM tools.

Title: Admitting Ignorance Helps the Video Question Answering Models to Answer

Authors: Haopeng Li, Tom Drummond, Mingming Gong, Mohammed Bennamoun, Qiuhong Ke
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.08771
Pdf URL: https://arxiv.org/pdf/2501.08771
Copy Paste: [[2501.08771]] Admitting Ignorance Helps the Video Question Answering Models to Answer(https://arxiv.org/abs/2501.08771)
Keywords: foundation model
Abstract: Significant progress has been made in the field of video question answering (VideoQA) thanks to deep learning and large-scale pretraining. Despite the presence of sophisticated model structures and powerful video-text foundation models, most existing methods focus solely on maximizing the correlation between answers and video-question pairs during training. We argue that these models often establish shortcuts, resulting in spurious correlations between questions and answers, especially when the alignment between video and text data is suboptimal. To address these spurious correlations, we propose a novel training framework in which the model is compelled to acknowledge its ignorance when presented with an intervened question, rather than making guesses solely based on superficial question-answer correlations. We introduce methodologies for intervening in questions, utilizing techniques such as displacement and perturbation, and design frameworks for the model to admit its lack of knowledge in both multi-choice VideoQA and open-ended settings. In practice, we integrate a state-of-the-art model into our framework to validate its effectiveness. The results clearly demonstrate that our framework can significantly enhance the performance of VideoQA models with minimal structural modifications.

Title: Exploring ChatGPT for Face Presentation Attack Detection in Zero and Few-Shot in-Context Learning

Authors: Alain Komaty, Hatef Otroshi Shahreza, Anjith George, Sebastien Marcel
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2501.08799
Pdf URL: https://arxiv.org/pdf/2501.08799
Copy Paste: [[2501.08799]] Exploring ChatGPT for Face Presentation Attack Detection in Zero and Few-Shot in-Context Learning(https://arxiv.org/abs/2501.08799)
Keywords: in-context
Abstract: This study highlights the potential of ChatGPT (specifically GPT-4o) as a competitive alternative for Face Presentation Attack Detection (PAD), outperforming several PAD models, including commercial solutions, in specific scenarios. Our results show that GPT-4o demonstrates high consistency, particularly in few-shot in-context learning, where its performance improves as more examples are provided (reference data). We also observe that detailed prompts enable the model to provide scores reliably, a behavior not observed with concise prompts. Additionally, explanation-seeking prompts slightly enhance the model's performance by improving its interpretability. Remarkably, the model exhibits emergent reasoning capabilities, correctly predicting the attack type (print or replay) with high accuracy in few-shot scenarios, despite not being explicitly instructed to classify attack types. Despite these strengths, GPT-4o faces challenges in zero-shot tasks, where its performance is limited compared to specialized PAD systems. Experiments were conducted on a subset of the SOTERIA dataset, ensuring compliance with data privacy regulations by using only data from consenting individuals. These findings underscore GPT-4o's promise in PAD applications, laying the groundwork for future research to address broader data privacy concerns and improve cross-dataset generalization. Code available here: this https URL

Title: MANTA: Diffusion Mamba for Efficient and Effective Stochastic Long-Term Dense Anticipation

Authors: Olga Zatsarynna, Emad Bahrami, Yazan Abu Farha, Gianpiero Francesca, Juergen Gall
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.08837
Pdf URL: https://arxiv.org/pdf/2501.08837
Copy Paste: [[2501.08837]] MANTA: Diffusion Mamba for Efficient and Effective Stochastic Long-Term Dense Anticipation(https://arxiv.org/abs/2501.08837)
Keywords: diffusion
Abstract: Our work addresses the problem of stochastic long-term dense anticipation. The goal of this task is to predict actions and their durations several minutes into the future based on provided video observations. Anticipation over extended horizons introduces high uncertainty, as a single observation can lead to multiple plausible future outcomes. To address this uncertainty, stochastic models are designed to predict several potential future action sequences. Recent work has further proposed to incorporate uncertainty modelling for observed frames by simultaneously predicting per-frame past and future actions in a unified manner. While such joint modelling of actions is beneficial, it requires long-range temporal capabilities to connect events across distant past and future time points. However, the previous work struggles to achieve such a long-range understanding due to its limited and/or sparse receptive field. To alleviate this issue, we propose a novel MANTA (MAmba for ANTicipation) network. Our model enables effective long-term temporal modelling even for very long sequences while maintaining linear complexity in sequence length. We demonstrate that our approach achieves state-of-the-art results on three datasets - Breakfast, 50Salads, and Assembly101 - while also significantly improving computational and memory efficiency.

Title: Generative Planning with 3D-vision Language Pre-training for End-to-End Autonomous Driving

Authors: Tengpeng Li, Hanli Wang, Xianfei Li, Wenlong Liao, Tao He, Pai Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.08861
Pdf URL: https://arxiv.org/pdf/2501.08861
Copy Paste: [[2501.08861]] Generative Planning with 3D-vision Language Pre-training for End-to-End Autonomous Driving(https://arxiv.org/abs/2501.08861)
Keywords: generative
Abstract: Autonomous driving is a challenging task that requires perceiving and understanding the surrounding environment for safe trajectory planning. While existing vision-based end-to-end models have achieved promising results, these methods are still facing the challenges of vision understanding, decision reasoning and scene generalization. To solve these issues, a generative planning with 3D-vision language pre-training model named GPVL is proposed for end-to-end autonomous driving. The proposed paradigm has two significant aspects. On one hand, a 3D-vision language pre-training module is designed to bridge the gap between visual perception and linguistic understanding in the bird's eye view. On the other hand, a cross-modal language model is introduced to generate holistic driving decisions and fine-grained trajectories with perception and navigation information in an auto-regressive manner. Experiments on the challenging nuScenes dataset demonstrate that the proposed scheme achieves excellent performances compared with state-of-the-art methods. Besides, the proposed GPVL presents strong generalization ability and real-time potential when handling high-level commands in various scenarios. It is believed that the effective, robust and efficient performance of GPVL is crucial for the practical application of future autonomous driving systems. Code is available at this https URL

Title: Enhanced Multi-Scale Cross-Attention for Person Image Generation

Authors: Hao Tang, Ling Shao, Nicu Sebe, Luc Van Gool
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.08900
Pdf URL: https://arxiv.org/pdf/2501.08900
Copy Paste: [[2501.08900]] Enhanced Multi-Scale Cross-Attention for Person Image Generation(https://arxiv.org/abs/2501.08900)
Keywords: diffusion, generative
Abstract: In this paper, we propose a novel cross-attention-based generative adversarial network (GAN) for the challenging person image generation task. Cross-attention is a novel and intuitive multi-modal fusion method in which an attention/correlation matrix is calculated between two feature maps of different modalities. Specifically, we propose the novel XingGAN (or CrossingGAN), which consists of two generation branches that capture the person's appearance and shape, respectively. Moreover, we propose two novel cross-attention blocks to effectively transfer and update the person's shape and appearance embeddings for mutual improvement. This has not been considered by any other existing GAN-based image generation work. To further learn the long-range correlations between different person poses at different scales and sub-regions, we propose two novel multi-scale cross-attention blocks. To tackle the issue of independent correlation computations within the cross-attention mechanism leading to noisy and ambiguous attention weights, which hinder performance improvements, we propose a module called enhanced attention (EA). Lastly, we introduce a novel densely connected co-attention module to fuse appearance and shape features at different stages effectively. Extensive experiments on two public datasets demonstrate that the proposed method outperforms current GAN-based methods and performs on par with diffusion-based methods. However, our method is significantly faster than diffusion-based methods in both training and inference.

Title: Applying General Turn-taking Models to Conversational Human-Robot Interaction

Authors: Gabriel Skantze, Bahar Irfan
Subjects: cs.CL, cs.RO
Abstract URL: https://arxiv.org/abs/2501.08946
Pdf URL: https://arxiv.org/pdf/2501.08946
Copy Paste: [[2501.08946]] Applying General Turn-taking Models to Conversational Human-Robot Interaction(https://arxiv.org/abs/2501.08946)
Keywords: self-supervised
Abstract: Turn-taking is a fundamental aspect of conversation, but current Human-Robot Interaction (HRI) systems often rely on simplistic, silence-based models, leading to unnatural pauses and interruptions. This paper investigates, for the first time, the application of general turn-taking models, specifically TurnGPT and Voice Activity Projection (VAP), to improve conversational dynamics in HRI. These models are trained on human-human dialogue data using self-supervised learning objectives, without requiring domain-specific fine-tuning. We propose methods for using these models in tandem to predict when a robot should begin preparing responses, take turns, and handle potential interruptions. We evaluated the proposed system in a within-subject study against a traditional baseline system, using the Furhat robot with 39 adults in a conversational setting, in combination with a large language model for autonomous response generation. The results show that participants significantly prefer the proposed system, and it significantly reduces response delays and interruptions.

Title: CityLoc: 6 DoF Localization of Text Descriptions in Large-Scale Scenes with Gaussian Representation

Authors: Qi Ma, Runyi Yang, Bin Ren, Ender Konukoglu, Luc Van Gool, Danda Pani Paudel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.08982
Pdf URL: https://arxiv.org/pdf/2501.08982
Copy Paste: [[2501.08982]] CityLoc: 6 DoF Localization of Text Descriptions in Large-Scale Scenes with Gaussian Representation(https://arxiv.org/abs/2501.08982)
Keywords: diffusion
Abstract: Localizing text descriptions in large-scale 3D scenes is inherently an ambiguous task. This nonetheless arises while describing general concepts, e.g. all traffic lights in a city. To facilitate reasoning based on such concepts, text localization in the form of distribution is required. In this paper, we generate the distribution of the camera poses conditioned upon the textual description. To facilitate such generation, we propose a diffusion-based architecture that conditionally diffuses the noisy 6DoF camera poses to their plausible locations. The conditional signals are derived from the text descriptions, using the pre-trained text encoders. The connection between text descriptions and pose distribution is established through pretrained Vision-Language-Model, i.e. CLIP. Furthermore, we demonstrate that the candidate poses for the distribution can be further refined by rendering potential poses using 3D Gaussian splatting, guiding incorrectly posed samples towards locations that better align with the textual description, through visual reasoning. We demonstrate the effectiveness of our method by comparing it with both standard retrieval methods and learning-based approaches. Our proposed method consistently outperforms these baselines across all five large-scale datasets. Our source code and dataset will be made publicly available.

Title: CityDreamer4D: Compositional Generative Model of Unbounded 4D Cities

Authors: Haozhe Xie, Zhaoxi Chen, Fangzhou Hong, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.08983
Pdf URL: https://arxiv.org/pdf/2501.08983
Copy Paste: [[2501.08983]] CityDreamer4D: Compositional Generative Model of Unbounded 4D Cities(https://arxiv.org/abs/2501.08983)
Keywords: generative
Abstract: 3D scene generation has garnered growing attention in recent years and has made significant progress. Generating 4D cities is more challenging than 3D scenes due to the presence of structurally complex, visually diverse objects like buildings and vehicles, and heightened human sensitivity to distortions in urban environments. To tackle these issues, we propose CityDreamer4D, a compositional generative model specifically tailored for generating unbounded 4D cities. Our main insights are 1) 4D city generation should separate dynamic objects (e.g., vehicles) from static scenes (e.g., buildings and roads), and 2) all objects in the 4D scene should be composed of different types of neural fields for buildings, vehicles, and background stuff. Specifically, we propose Traffic Scenario Generator and Unbounded Layout Generator to produce dynamic traffic scenarios and static city layouts using a highly compact BEV representation. Objects in 4D cities are generated by combining stuff-oriented and instance-oriented neural fields for background stuff, buildings, and vehicles. To suit the distinct characteristics of background stuff and instances, the neural fields employ customized generative hash grids and periodic positional embeddings as scene parameterizations. Furthermore, we offer a comprehensive suite of datasets for city generation, including OSM, GoogleEarth, and CityTopia. The OSM dataset provides a variety of real-world city layouts, while the Google Earth and CityTopia datasets deliver large-scale, high-quality city imagery complete with 3D instance annotations. Leveraging its compositional design, CityDreamer4D supports a range of downstream applications, such as instance editing, city stylization, and urban simulation, while delivering state-of-the-art performance in generating realistic 4D cities.

Title: RepVideo: Rethinking Cross-Layer Representation for Video Generation

Authors: Chenyang Si, Weichen Fan, Zhengyao Lv, Ziqi Huang, Yu Qiao, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.08994
Pdf URL: https://arxiv.org/pdf/2501.08994
Copy Paste: [[2501.08994]] RepVideo: Rethinking Cross-Layer Representation for Video Generation(https://arxiv.org/abs/2501.08994)
Keywords: diffusion
Abstract: Video generation has achieved remarkable progress with the introduction of diffusion models, which have significantly improved the quality of generated videos. However, recent research has primarily focused on scaling up model training, while offering limited insights into the direct impact of representations on the video generation process. In this paper, we initially investigate the characteristics of features in intermediate layers, finding substantial variations in attention maps across different layers. These variations lead to unstable semantic representations and contribute to cumulative differences between features, which ultimately reduce the similarity between adjacent frames and negatively affect temporal coherence. To address this, we propose RepVideo, an enhanced representation framework for text-to-video diffusion models. By accumulating features from neighboring layers to form enriched representations, this approach captures more stable semantic information. These enhanced representations are then used as inputs to the attention mechanism, thereby improving semantic expressiveness while ensuring feature consistency across adjacent frames. Extensive experiments demonstrate that our RepVideo not only significantly enhances the ability to generate accurate spatial appearances, such as capturing complex spatial relationships between multiple objects, but also improves temporal consistency in video generation.

Title: VECT-GAN: A variationally encoded generative model for overcoming data scarcity in pharmaceutical science

Authors: Youssef Abdalla, Marrisa Taub, Eleanor Hilton, Priya Akkaraju, Alexander Milanovic, Mine Orlu, Abdul W. Basit, Michael T Cook, Tapabrata Chakraborty, David Shorthouse
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.08995
Pdf URL: https://arxiv.org/pdf/2501.08995
Copy Paste: [[2501.08995]] VECT-GAN: A variationally encoded generative model for overcoming data scarcity in pharmaceutical science(https://arxiv.org/abs/2501.08995)
Keywords: generative
Abstract: Data scarcity in pharmaceutical research has led to reliance on labour-intensive trial and error approaches for development rather than data driven methods. While Machine Learning offers a solution, existing datasets are often small and noisy, limiting their utility. To address this, we developed a Variationally Encoded Conditional Tabular Generative Adversarial Network (VECT GAN), a novel generative model specifically designed for augmenting small, noisy datasets. We introduce a pipeline where data is augmented before regression model development and demonstrate that this consistently and significantly improves performance over other state of the art tabular generative models. We apply this pipeline across six pharmaceutical datasets, and highlight its real-world applicability by developing novel polymers with medically desirable mucoadhesive properties, which we made and experimentally characterised. Additionally, we pre-train the model on the ChEMBL database of drug-like molecules, leveraging knowledge distillation to enhance its generalisability, making it readily available for use on pharmaceutical datasets containing small molecules, which is an extremely common pharmaceutical task. We demonstrate the power of synthetic data for regularising small tabular datasets, highlighting its potential to become standard practice in pharmaceutical model development, and make our method, including VECT GAN pretrained on ChEMBL available as a pip package.

Title: Aegis2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails

Authors: Shaona Ghosh, Prasoon Varshney, Makesh Narsimhan Sreedhar, Aishwarya Padmakumar, Traian Rebedea, Jibin Rajan Varghese, Christopher Parisien
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.09004
Pdf URL: https://arxiv.org/pdf/2501.09004
Copy Paste: [[2501.09004]] Aegis2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails(https://arxiv.org/abs/2501.09004)
Keywords: generative
Abstract: As Large Language Models (LLMs) and generative AI become increasingly widespread, concerns about content safety have grown in parallel. Currently, there is a clear lack of high-quality, human-annotated datasets that address the full spectrum of LLM-related safety risks and are usable for commercial applications. To bridge this gap, we propose a comprehensive and adaptable taxonomy for categorizing safety risks, structured into 12 top-level hazard categories with an extension to 9 fine-grained subcategories. This taxonomy is designed to meet the diverse requirements of downstream users, offering more granular and flexible tools for managing various risk types. Using a hybrid data generation pipeline that combines human annotations with a multi-LLM "jury" system to assess the safety of responses, we obtain Aegis 2.0, a carefully curated collection of 34,248 samples of human-LLM interactions, annotated according to our proposed taxonomy. To validate its effectiveness, we demonstrate that several lightweight models, trained using parameter-efficient techniques on Aegis 2.0, achieve performance competitive with leading safety models fully fine-tuned on much larger, non-commercial datasets. In addition, we introduce a novel training blend that combines safety with topic following this http URL approach enhances the adaptability of guard models, enabling them to generalize to new risk categories defined during inference. We plan to open-source Aegis 2.0 data and models to the research community to aid in the safety guardrailing of LLMs.

Title: SimGen: A Diffusion-Based Framework for Simultaneous Surgical Image and Segmentation Mask Generation

Authors: Aditya Bhat, Rupak Bose, Chinedu Innocent Nwoye, Nicolas Padoy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.09008
Pdf URL: https://arxiv.org/pdf/2501.09008
Copy Paste: [[2501.09008]] SimGen: A Diffusion-Based Framework for Simultaneous Surgical Image and Segmentation Mask Generation(https://arxiv.org/abs/2501.09008)
Keywords: diffusion, generative
Abstract: Acquiring and annotating surgical data is often resource-intensive, ethical constraining, and requiring significant expert involvement. While generative AI models like text-to-image can alleviate data scarcity, incorporating spatial annotations, such as segmentation masks, is crucial for precision-driven surgical applications, simulation, and education. This study introduces both a novel task and method, SimGen, for Simultaneous Image and Mask Generation. SimGen is a diffusion model based on the DDPM framework and Residual U-Net, designed to jointly generate high-fidelity surgical images and their corresponding segmentation masks. The model leverages cross-correlation priors to capture dependencies between continuous image and discrete mask distributions. Additionally, a Canonical Fibonacci Lattice (CFL) is employed to enhance class separability and uniformity in the RGB space of the masks. SimGen delivers high-fidelity images and accurate segmentation masks, outperforming baselines across six public datasets assessed on image and semantic inception distance metrics. Ablation study shows that the CFL improves mask quality and spatial separation. Downstream experiments suggest generated image-mask pairs are usable if regulations limit human data release for research. This work offers a cost-effective solution for generating paired surgical images and complex labels, advancing surgical AI development by reducing the need for expensive manual annotations.

Title: Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion

Authors: Jingyuan Chen, Fuchen Long, Jie An, Zhaofan Qiu, Ting Yao, Jiebo Luo, Tao Mei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.09019
Pdf URL: https://arxiv.org/pdf/2501.09019
Copy Paste: [[2501.09019]] Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion(https://arxiv.org/abs/2501.09019)
Keywords: diffusion
Abstract: The first-in-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation. This technique maintains a queue of video frames with progressively increasing noise, continuously producing clean frames at the queue's head while Gaussian noise is enqueued at the tail. However, FIFO-Diffusion often struggles to keep long-range temporal consistency in the generated videos due to the lack of correspondence modeling across frames. In this paper, we propose Ouroboros-Diffusion, a novel video denoising framework designed to enhance structural and content (subject) consistency, enabling the generation of consistent videos of arbitrary length. Specifically, we introduce a new latent sampling technique at the queue tail to improve structural consistency, ensuring perceptually smooth transitions among frames. To enhance subject consistency, we devise a Subject-Aware Cross-Frame Attention (SACFA) mechanism, which aligns subjects across frames within short segments to achieve better visual coherence. Furthermore, we introduce self-recurrent guidance. This technique leverages information from all previous cleaner frames at the front of the queue to guide the denoising of noisier frames at the end, fostering rich and contextual global information interaction. Extensive experiments of long video generation on the VBench benchmark demonstrate the superiority of our Ouroboros-Diffusion, particularly in terms of subject consistency, motion smoothness, and temporal consistency.