2025-01-13

Title: Tuning-Free Long Video Generation via Global-Local Collaborative Diffusion

Authors: Yongjia Ma, Junlin Chen, Donglin Di, Qi Xie, Lei Fan, Wei Chen, Xiaofei Gou, Na Zhao, Xun Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.05484
Pdf URL: https://arxiv.org/pdf/2501.05484
Copy Paste: [[2501.05484]] Tuning-Free Long Video Generation via Global-Local Collaborative Diffusion(https://arxiv.org/abs/2501.05484)
Keywords: generation
Abstract: Creating high-fidelity, coherent long videos is a sought-after aspiration. While recent video diffusion models have shown promising potential, they still grapple with spatiotemporal inconsistencies and high computational resource demands. We propose GLC-Diffusion, a tuning-free method for long video generation. It models the long video denoising process by establishing denoising trajectories through Global-Local Collaborative Denoising to ensure overall content consistency and temporal coherence between frames. Additionally, we introduce a Noise Reinitialization strategy which combines local noise shuffling with frequency fusion to improve global content consistency and visual diversity. Further, we propose a Video Motion Consistency Refinement (VMCR) module that computes the gradient of pixel-wise and frequency-wise losses to enhance visual consistency and temporal smoothness. Extensive experiments, including quantitative and qualitative evaluations on videos of varying lengths (\textit{e.g.}, 3\times and 6\times longer), demonstrate that our method effectively integrates with existing video diffusion models, producing coherent, high-fidelity long videos superior to previous approaches.
摘要：制作高保真、连贯的长视频是人们梦寐以求的愿望。虽然最近的视频扩散模型已经显示出巨大的潜力，但它们仍在努力解决时空不一致性和高计算资源需求的问题。我们提出了 GLC-Diffusion，这是一种无需调整的长视频生成方法。它通过全局-局部协同去噪建立去噪轨迹来模拟长视频去噪过程，以确保整体内容一致性和帧间时间连贯性。此外，我们引入了一种噪声重新初始化策略，将局部噪声改组与频率融合相结合，以提高全局内容一致性和视觉多样性。此外，我们提出了一个视频运动一致性细化 (VMCR) 模块，该模块计算像素和频率损失的梯度，以增强视觉一致性和时间平滑度。大量实验，包括对不同长度的视频（\textit{例如}，3 倍和 6 倍长）的定量和定性评估，表明我们的方法可以有效地与现有的视频扩散模型相结合，生成优于以前方法的连贯、高保真的长视频。

Title: FedSA: A Unified Representation Learning via Semantic Anchors for Prototype-based Federated Learning

Authors: Yanbing Zhou, Xiangmou Qu, Chenlong You, Jiyang Zhou, Jingyue Tang, Xin Zheng, Chunmao Cai, Yingbo Wu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.05496
Pdf URL: https://arxiv.org/pdf/2501.05496
Copy Paste: [[2501.05496]] FedSA: A Unified Representation Learning via Semantic Anchors for Prototype-based Federated Learning(https://arxiv.org/abs/2501.05496)
Keywords: generation
Abstract: Prototype-based federated learning has emerged as a promising approach that shares lightweight prototypes to transfer knowledge among clients with data heterogeneity in a model-agnostic manner. However, existing methods often collect prototypes directly from local models, which inevitably introduce inconsistencies into representation learning due to the biased data distributions and differing model architectures among clients. In this paper, we identify that both statistical and model heterogeneity create a vicious cycle of representation inconsistency, classifier divergence, and skewed prototype alignment, which negatively impacts the performance of clients. To break the vicious cycle, we propose a novel framework named Federated Learning via Semantic Anchors (FedSA) to decouple the generation of prototypes from local representation learning. We introduce a novel perspective that uses simple yet effective semantic anchors serving as prototypes to guide local models in learning consistent representations. By incorporating semantic anchors, we further propose anchor-based regularization with margin-enhanced contrastive learning and anchor-based classifier calibration to correct feature extractors and calibrate classifiers across clients, achieving intra-class compactness and inter-class separability of prototypes while ensuring consistent decision boundaries. We then update the semantic anchors with these consistent and discriminative prototypes, which iteratively encourage clients to collaboratively learn a unified data representation with robust generalization. Extensive experiments under both statistical and model heterogeneity settings show that FedSA significantly outperforms existing prototype-based FL methods on various classification tasks.
摘要：基于原型的联邦学习已成为一种有前途的方法，它共享轻量级原型，以与模型无关的方式在具有数据异构性的客户端之间传递知识。然而，现有的方法通常直接从本地模型收集原型，由于客户端之间数据分布的偏差和模型架构的不同，这不可避免地会在表征学习中引入不一致性。在本文中，我们发现统计和模型异构性都会造成表征不一致、分类器发散和原型对齐偏差的恶性循环，从而对客户端的性能产生负面影响。为了打破这种恶性循环，我们提出了一个新框架，即通过语义锚点进行联邦学习 (FedSA)，将原型的生成与本地表征学习分离。我们介绍了一种新颖的视角，使用简单但有效的语义锚点作为原型来指导本地模型学习一致的表征。通过整合语义锚点，我们进一步提出了基于锚点的正则化和边缘增强对比学习以及基于锚点的分类器校准，以校正特征提取器并校准跨客户端的分类器，实现原型的类内紧凑性和类间可分性，同时确保一致的决策边界。然后，我们使用这些一致且有判别力的原型更新语义锚点，这些原型迭代地鼓励客户端协作学习具有稳健泛化的统一数据表示。在统计和模型异质性设置下进行的大量实验表明，FedSA 在各种分类任务上的表现明显优于现有的基于原型的 FL 方法。

Title: Generative Flow Networks: Theory and Applications to Structure Learning

Authors: Tristan Deleu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.05498
Pdf URL: https://arxiv.org/pdf/2501.05498
Copy Paste: [[2501.05498]] Generative Flow Networks: Theory and Applications to Structure Learning(https://arxiv.org/abs/2501.05498)
Keywords: generation, generative
Abstract: Without any assumptions about data generation, multiple causal models may explain our observations equally well. To avoid selecting a single arbitrary model that could result in unsafe decisions if it does not match reality, it is therefore essential to maintain a notion of epistemic uncertainty about our possible candidates. This thesis studies the problem of structure learning from a Bayesian perspective, approximating the posterior distribution over the structure of a causal model, represented as a directed acyclic graph (DAG), given data. It introduces Generative Flow Networks (GFlowNets), a novel class of probabilistic models designed for modeling distributions over discrete and compositional objects such as graphs. They treat generation as a sequential decision making problem, constructing samples of a target distribution defined up to a normalization constant piece by piece. In the first part of this thesis, we present the mathematical foundations of GFlowNets, their connections to existing domains of machine learning and statistics such as variational inference and reinforcement learning, and their extensions beyond discrete problems. In the second part of this thesis, we show how GFlowNets can approximate the posterior distribution over DAG structures of causal Bayesian Networks, along with the parameters of its causal mechanisms, given observational and experimental data.
摘要：如果不假设数据生成，多个因果模型可能同样能很好地解释我们的观察结果。因此，为了避免选择一个任意的模型，如果该模型与现实不符，可能会导致不安全的决策，我们必须对可能的候选模型保持认知不确定性。本论文从贝叶斯角度研究结构学习问题，在给定数据的情况下，近似因果模型结构的后验分布，该模型表示为有向无环图 (DAG)。它介绍了生成流网络 (GFlowNets)，这是一类新型概率模型，旨在对离散和组合对象（例如图）上的分布进行建模。它们将生成视为一个顺序决策问题，逐个构建目标分布的样本，该分布由归一化常数定义。在本论文的第一部分中，我们介绍了 GFlowNets 的数学基础、它们与现有机器学习和统计学领域的联系（例如变分推理和强化学习），以及它们在离散问题之外的扩展。在本文的第二部分，我们展示了 GFlowNets 如何根据观察和实验数据近似因果贝叶斯网络的 DAG 结构的后验分布及其因果机制的参数。

Title: Shrink the longest: improving latent space isotropy with symplicial geometry

Authors: Sergei Kudriashov, Olesya Karpik, Eduard Klyshinsky
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.05502
Pdf URL: https://arxiv.org/pdf/2501.05502
Copy Paste: [[2501.05502]] Shrink the longest: improving latent space isotropy with symplicial geometry(https://arxiv.org/abs/2501.05502)
Keywords: generation
Abstract: Although transformer-based models have been dominating the field of deep learning, various studies of their embedding space have shown that they suffer from "representation degeneration problem": embeddings tend to be distributed in a narrow cone, making the latent space highly anisotropic. Increasing the isotropy has shown to improve performance in downstream tasks both in static and contextual language models. However, most of approaches either add inference overhead or require substantial amount of data for model reparametrization. We propose a novel regularization technique based on simplicial geometry to improve the isotropy of latent representations. The core idea of our method is based on maximizing the persistent entropy of barcodes obtained using Vietoris-Rips filtration from contextual embeddings in the underlying latent space. We demonstrate that the method leads to an increase in downstream performance while significantly lowering the anisotropy during fine-tuning by exploiting existing geometric structures instead of reparametrization.
摘要：尽管基于 Transformer 的模型一直主导着深度学习领域，但对其嵌入空间的各种研究表明，它们存在“表示退化问题”：嵌入往往分布在一个狭窄的锥体中，使潜在空间高度各向异性。增加各向同性已被证明可以提高静态和上下文语言模型中下游任务的性能。然而，大多数方法要么增加了推理开销，要么需要大量数据进行模型重新参数化。我们提出了一种基于单纯几何的新型正则化技术来改善潜在表示的各向同性。我们方法的核心思想是基于最大化使用 Vietoris-Rips 过滤从底层潜在空间中的上下文嵌入中获得的条形码的持久熵。我们证明，该方法通过利用现有的几何结构而不是重新参数化，可以提高下游性能，同时显着降低微调过程中的各向异性。

Title: OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

Authors: Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.05510
Pdf URL: https://arxiv.org/pdf/2501.05510
Copy Paste: [[2501.05510]] OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?(https://arxiv.org/abs/2501.05510)
Keywords: generation
Abstract: Temporal Awareness, the ability to reason dynamically based on the timestamp when a question is raised, is the key distinction between offline and online video LLMs. Unlike offline models, which rely on complete videos for static, post hoc analysis, online models process video streams incrementally and dynamically adapt their responses based on the timestamp at which the question is posed. Despite its significance, temporal awareness has not been adequately evaluated in existing benchmarks. To fill this gap, we present OVO-Bench (Online-VideO-Benchmark), a novel video benchmark that emphasizes the importance of timestamps for advanced online video understanding capability benchmarking. OVO-Bench evaluates the ability of video LLMs to reason and respond to events occurring at specific timestamps under three distinct scenarios: (1) Backward tracing: trace back to past events to answer the question. (2) Real-time understanding: understand and respond to events as they unfold at the current timestamp. (3) Forward active responding: delay the response until sufficient future information becomes available to answer the question accurately. OVO-Bench comprises 12 tasks, featuring 644 unique videos and approximately human-curated 2,800 fine-grained meta-annotations with precise timestamps. We combine automated generation pipelines with human curation. With these high-quality samples, we further developed an evaluation pipeline to systematically query video LLMs along the video timeline. Evaluations of nine Video-LLMs reveal that, despite advancements on traditional benchmarks, current models struggle with online video understanding, showing a significant gap compared to human agents. We hope OVO-Bench will drive progress in video LLMs and inspire future research in online video reasoning. Our benchmark and code can be accessed at this https URL.
摘要：时间意识，即根据提出问题的时间戳进行动态推理的能力，是离线和在线视频 LLM 之间的关键区别。与依赖完整视频进行静态事后分析的离线模型不同，在线模型会逐步处理视频流，并根据提出问题的时间戳动态调整其响应。尽管时间意识很重要，但现有基准尚未对其进行充分评估。为了填补这一空白，我们提出了 OVO-Bench（在线视频基准），这是一种新颖的视频基准，强调了时间戳对于高级在线视频理解能力基准测试的重要性。OVO-Bench 在三种不同情况下评估视频 LLM 推理和响应特定时间戳发生的事件的能力：（1）向后追踪：追溯过去的事件来回答问题。（2）实时理解：理解并响应当前时间戳上发生的事件。（3）前向主动响应：延迟响应，直到有足够的未来信息来准确回答问题。 OVO-Bench 包含 12 项任务，包括 644 个独特视频和大约 2,800 个人工策划的细粒度元注释，并带有精确的时间戳。我们将自动生成流程与人工策划相结合。利用这些高质量样本，我们进一步开发了评估流程，以系统地查询视频时间线上的视频 LLM。对九个视频 LLM 的评估表明，尽管在传统基准上取得了进步，但当前的模型在在线视频理解方面仍举步维艰，与人类代理相比存在显著差距。我们希望 OVO-Bench 能够推动视频 LLM 的进步，并激发未来在线视频推理的研究。我们的基准和代码可以通过此 https URL 访问。

Title: HFMF: Hierarchical Fusion Meets Multi-Stream Models for Deepfake Detection

Authors: Anant Mehta, Bryant McArthur, Nagarjuna Kolloju, Zhengzhong Tu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.05631
Pdf URL: https://arxiv.org/pdf/2501.05631
Copy Paste: [[2501.05631]] HFMF: Hierarchical Fusion Meets Multi-Stream Models for Deepfake Detection(https://arxiv.org/abs/2501.05631)
Keywords: generative
Abstract: The rapid progress in deep generative models has led to the creation of incredibly realistic synthetic images that are becoming increasingly difficult to distinguish from real-world data. The widespread use of Variational Models, Diffusion Models, and Generative Adversarial Networks has made it easier to generate convincing fake images and videos, which poses significant challenges for detecting and mitigating the spread of misinformation. As a result, developing effective methods for detecting AI-generated fakes has become a pressing concern. In our research, we propose HFMF, a comprehensive two-stage deepfake detection framework that leverages both hierarchical cross-modal feature fusion and multi-stream feature extraction to enhance detection performance against imagery produced by state-of-the-art generative AI models. The first component of our approach integrates vision Transformers and convolutional nets through a hierarchical feature fusion mechanism. The second component of our framework combines object-level information and a fine-tuned convolutional net model. We then fuse the outputs from both components via an ensemble deep neural net, enabling robust classification performances. We demonstrate that our architecture achieves superior performance across diverse dataset benchmarks while maintaining calibration and interoperability.
摘要：深度生成模型的快速发展导致了令人难以置信的逼真的合成图像的产生，这些合成图像与现实世界的数据越来越难以区分。变分模型、扩散模型和生成对抗网络的广泛使用使得生成令人信服的假图像和视频变得更加容易，这对检测和减轻错误信息的传播提出了重大挑战。因此，开发有效的方法来检测人工智能生成的假货已成为一个紧迫的问题。在我们的研究中，我们提出了 HFMF，这是一个全面的两阶段深度伪造检测框架，它利用分层跨模态特征融合和多流特征提取来增强对最先进生成人工智能模型生成的图像的检测性能。我们方法的第一个组成部分通过分层特征融合机制集成了视觉 Transformers 和卷积网络。我们框架的第二个组成部分结合了对象级信息和微调的卷积网络模型。然后，我们通过集成深度神经网络融合两个组件的输出，从而实现强大的分类性能。我们证明了我们的架构在保持校准和互操作性的同时，在各种数据集基准上实现了卓越的性能。

Title: UniQ: Unified Decoder with Task-specific Queries for Efficient Scene Graph Generation

Authors: Xinyao Liao, Wei Wei, Dangyang Chen, Yuanyuan Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.05687
Pdf URL: https://arxiv.org/pdf/2501.05687
Copy Paste: [[2501.05687]] UniQ: Unified Decoder with Task-specific Queries for Efficient Scene Graph Generation(https://arxiv.org/abs/2501.05687)
Keywords: generation
Abstract: Scene Graph Generation(SGG) is a scene understanding task that aims at identifying object entities and reasoning their relationships within a given image. In contrast to prevailing two-stage methods based on a large object detector (e.g., Faster R-CNN), one-stage methods integrate a fixed-size set of learnable queries to jointly reason relational triplets . This paradigm demonstrates robust performance with significantly reduced parameters and computational overhead. However, the challenge in one-stage methods stems from the issue of weak entanglement, wherein entities involved in relationships require both coupled features shared within triplets and decoupled visual features. Previous methods either adopt a single decoder for coupled triplet feature modeling or multiple decoders for separate visual feature extraction but fail to consider both. In this paper, we introduce UniQ, a Unified decoder with task-specific Queries architecture, where task-specific queries generate decoupled visual features for subjects, objects, and predicates respectively, and unified decoder enables coupled feature modeling within relational triplets. Experimental results on the Visual Genome dataset demonstrate that UniQ has superior performance to both one-stage and two-stage methods.
摘要：场景图生成 (SGG) 是一种场景理解任务，旨在识别对象实体并推理给定图像中它们的关系。与基于大型对象检测器（例如 Faster R-CNN）的现行两阶段方法相比，单阶段方法集成了一组固定大小的可学习查询来联合推理关系三元组 <主语、谓语、宾语>。该范例表现出稳健的性能，同时显著减少了参数和计算开销。然而，单阶段方法的挑战源于弱纠缠问题，其中涉及关系的实体既需要三元组中共享的耦合特征，也需要解耦的视觉特征。以前的方法要么采用单个解码器进行耦合三元组特征建模，要么采用多个解码器进行单独的视觉特征提取，但未能同时考虑两者。在本文中，我们介绍了 UniQ，一种具有任务特定查询架构的统一解码器，其中任务特定查询分别为主语、宾语和谓语生成解耦的视觉特征，统一解码器支持关系三元组中的耦合特征建模。在 Visual Genome 数据集上的实验结果表明，UniQ 的性能优于单阶段和双阶段方法。

Title: EmotiCrafter: Text-to-Emotional-Image Generation based on Valence-Arousal Model

Authors: Yi He, Shengqi Dang, Long Ling, Ziqing Qian, Nanxuan Zhao, Nan Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.05710
Pdf URL: https://arxiv.org/pdf/2501.05710
Copy Paste: [[2501.05710]] EmotiCrafter: Text-to-Emotional-Image Generation based on Valence-Arousal Model(https://arxiv.org/abs/2501.05710)
Keywords: generation
Abstract: Recent research shows that emotions can enhance users' cognition and influence information communication. While research on visual emotion analysis is extensive, limited work has been done on helping users generate emotionally rich image content. Existing work on emotional image generation relies on discrete emotion categories, making it challenging to capture complex and subtle emotional nuances accurately. Additionally, these methods struggle to control the specific content of generated images based on text prompts. In this work, we introduce the new task of continuous emotional image content generation (C-EICG) and present EmotiCrafter, an emotional image generation model that generates images based on text prompts and Valence-Arousal values. Specifically, we propose a novel emotion-embedding mapping network that embeds Valence-Arousal values into textual features, enabling the capture of specific emotions in alignment with intended input prompts. Additionally, we introduce a loss function to enhance emotion expression. The experimental results show that our method effectively generates images representing specific emotions with the desired content and outperforms existing techniques.
摘要：最近的研究表明，情绪可以增强用户的认知并影响信息交流。虽然对视觉情绪分析的研究很广泛，但在帮助用户生成情感丰富的图像内容方面所做的工作有限。现有的情感图像生成工作依赖于离散的情绪类别，因此很难准确捕捉复杂而微妙的情绪细微差别。此外，这些方法很难根据文本提示控制生成的图像的具体内容。在这项工作中，我们引入了连续情绪图像内容生成 (C-EICG) 的新任务，并提出了 EmotiCrafter，这是一种基于文本提示和 Valence-Arousal 值生成图像的情绪图像生成模型。具体来说，我们提出了一种新颖的情绪嵌入映射网络，将 Valence-Arousal 值嵌入到文本特征中，从而能够根据预期的输入提示捕捉特定情绪。此外，我们引入了一个损失函数来增强情绪表达。实验结果表明，我们的方法可以有效地生成具有所需内容的代表特定情绪的图像，并且优于现有技术。

Title: Element-wise Attention Is All You Need

Authors: Guoxin Feng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.05730
Pdf URL: https://arxiv.org/pdf/2501.05730
Copy Paste: [[2501.05730]] Element-wise Attention Is All You Need(https://arxiv.org/abs/2501.05730)
Keywords: generation
Abstract: The self-attention (SA) mechanism has demonstrated superior performance across various domains, yet it suffers from substantial complexity during both training and inference. The next-generation architecture, aiming at retaining the competitive performance of SA while achieving low-cost inference and efficient long-sequence training, primarily focuses on three approaches: linear attention, linear RNNs, and state space models. Although these approaches achieve reduced complexity than SA, they all have built-in performance degradation factors, such as diminished â€œspikinessâ€ and compression of historical information. In contrast to these approaches, we propose a novel element-wise attention mechanism, which uses the element-wise squared Euclidean distance, instead of the dot product operation, to compute similarity and approximates the quadratic complexity term $\exp(q_{ic}k_{jc})$ with a Taylor polynomial. This design achieves remarkable efficiency: during training, the element-wise attention has a complexity of $\mathcal{O}(tLD)$, making long-sequence training both computationally and memory efficient, where $L$ is the sequence length, $D$ is the feature dimension, and $t$ is the highest order of the polynomial; during inference, it can be reformulated as recurrent neural networks, achieving a inference complexity of $\mathcal{O}(tD)$. Furthermore, the element-wise attention circumvents the performance degradation factors present in these approaches and achieves performance comparable to SA in both causal and non-causal forms.
摘要：自注意力 (SA) 机制已在各个领域展现出卓越的性能，但它在训练和推理过程中都存在相当大的复杂性问题。下一代架构旨在保持 SA 的竞争性能，同时实现低成本推理和高效的长序列训练，主要侧重于三种方法：线性注意力、线性 RNN 和状态空间模型。虽然这些方法比 SA 实现了更低的复杂性，但它们都具有内置的性能下降因素，例如“尖峰”减少和历史信息压缩。与这些方法相比，我们提出了一种新颖的元素注意力机制，它使用元素平方欧几里得距离而不是点积运算来计算相似度，并用泰勒多项式近似二次复杂度项 $\exp(q_{ic}k_{jc})$。这种设计实现了卓越的效率：在训练期间，元素级注意力具有 $\mathcal{O}(tLD)$ 的复杂度，使得长序列训练在计算和内存上都具有很高的效率，其中 $L$ 是序列长度，$D$ 是特征维度，$t$ 是多项式的最高阶；在推理期间，它可以被重新表述为循环神经网络，实现推理复杂度 $\mathcal{O}(tD)$。此外，元素级注意力规避了这些方法中存在的性能下降因素，并在因果和非因果形式中实现了与 SA 相当的性能。

Title: LLVD: LSTM-based Explicit Motion Modeling in Latent Space for Blind Video Denoising

Authors: Loay Rashid, Siddharth Roheda, Amit Unde
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.05744
Pdf URL: https://arxiv.org/pdf/2501.05744
Copy Paste: [[2501.05744]] LLVD: LSTM-based Explicit Motion Modeling in Latent Space for Blind Video Denoising(https://arxiv.org/abs/2501.05744)
Keywords: restoration
Abstract: Video restoration plays a pivotal role in revitalizing degraded video content by rectifying imperfections caused by various degradations introduced during capturing (sensor noise, motion blur, etc.), saving/sharing (compression, resizing, etc.) and editing. This paper introduces a novel algorithm designed for scenarios where noise is introduced during video capture, aiming to enhance the visual quality of videos by reducing unwanted noise artifacts. We propose the Latent space LSTM Video Denoiser (LLVD), an end-to-end blind denoising model. LLVD uniquely combines spatial and temporal feature extraction, employing Long Short Term Memory (LSTM) within the encoded feature domain. This integration of LSTM layers is crucial for maintaining continuity and minimizing flicker in the restored video. Moreover, processing frames in the encoded feature domain significantly reduces computations, resulting in a very lightweight architecture. LLVD's blind nature makes it versatile for real, in-the-wild denoising scenarios where prior information about noise characteristics is not available. Experiments reveal that LLVD demonstrates excellent performance for both synthetic and captured noise. Specifically, LLVD surpasses the current State-Of-The-Art (SOTA) in RAW denoising by 0.3dB, while also achieving a 59\% reduction in computational complexity.
摘要：视频修复在恢复退化视频内容方面发挥着关键作用，它可以纠正在拍摄（传感器噪声、运动模糊等）、保存/共享（压缩、调整大小等）和编辑过程中引入的各种退化所造成的缺陷。本文介绍了一种针对视频拍摄过程中引入噪声的场景而设计的新算法，旨在通过减少不必要的噪声伪影来提高视频的视觉质量。我们提出了一种端到端盲去噪模型，即潜在空间 LSTM 视频去噪器 (LLVD)。LLVD 独特地结合了空间和时间特征提取，在编码特征域内采用长短期记忆 (LSTM)。这种 LSTM 层的集成对于保持连续性和最大限度地减少恢复视频中的闪烁至关重要。此外，在编码特征域中处理帧可显著减少计算量，从而产生非常轻量的架构。LLVD 的盲目特性使其适用于无法获得噪声特征先验信息的真实、野外去噪场景。实验表明，LLVD 无论是对合成噪声还是捕获噪声都表现出色。具体来说，LLVD 在 RAW 去噪方面的表现比当前最佳 (SOTA) 好 0.3dB，同时计算复杂度也降低了 59%。

Title: StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation

Authors: Shangjin Zhai, Zhichao Ye, Jialin Liu, Weijian Xie, Jiaqi Hu, Zhen Peng, Hua Xue, Danpeng Chen, Xiaomeng Wang, Lei Yang, Nan Wang, Haomin Liu, Guofeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.05763
Pdf URL: https://arxiv.org/pdf/2501.05763
Copy Paste: [[2501.05763]] StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation(https://arxiv.org/abs/2501.05763)
Keywords: generation, generative
Abstract: Recent advances in large reconstruction and generative models have significantly improved scene reconstruction and novel view generation. However, due to compute limitations, each inference with these large models is confined to a small area, making long-range consistent scene generation challenging. To address this, we propose StarGen, a novel framework that employs a pre-trained video diffusion model in an autoregressive manner for long-range scene generation. The generation of each video clip is conditioned on the 3D warping of spatially adjacent images and the temporally overlapping image from previously generated clips, improving spatiotemporal consistency in long-range scene generation with precise pose control. The spatiotemporal condition is compatible with various input conditions, facilitating diverse tasks, including sparse view interpolation, perpetual view generation, and layout-conditioned city generation. Quantitative and qualitative evaluations demonstrate StarGen's superior scalability, fidelity, and pose accuracy compared to state-of-the-art methods.
摘要：大型重建和生成模型的最新进展显著改善了场景重建和新颖的视图生成。然而，由于计算限制，这些大型模型的每次推理都局限于一小块区域，使得长距离一致的场景生成具有挑战性。为了解决这个问题，我们提出了 StarGen，这是一个新颖的框架，它以自回归的方式采用预训练的视频扩散模型进行长距离场景生成。每个视频片段的生成都以空间相邻图像的 3D 扭曲和先前生成的片段中时间重叠的图像为条件，通过精确的姿势控制提高了长距离场景生成中的时空一致性。时空条件与各种输入条件兼容，有助于完成各种任务，包括稀疏视图插值、永久视图生成和布局条件城市生成。定量和定性评估表明，与最先进的方法相比，StarGen 具有卓越的可扩展性、保真度和姿势准确性。

Title: StructSR: Refuse Spurious Details in Real-World Image Super-Resolution

Authors: Yachao Li, Dong Liang, Tianyu Ding, Sheng-Jun Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.05777
Pdf URL: https://arxiv.org/pdf/2501.05777
Copy Paste: [[2501.05777]] StructSR: Refuse Spurious Details in Real-World Image Super-Resolution(https://arxiv.org/abs/2501.05777)
Keywords: super-resolution, generation
Abstract: Diffusion-based models have shown great promise in real-world image super-resolution (Real-ISR), but often generate content with structural errors and spurious texture details due to the empirical priors and illusions of these models. To address this issue, we introduce StructSR, a simple, effective, and plug-and-play method that enhances structural fidelity and suppresses spurious details for diffusion-based Real-ISR. StructSR operates without the need for additional fine-tuning, external model priors, or high-level semantic knowledge. At its core is the Structure-Aware Screening (SAS) mechanism, which identifies the image with the highest structural similarity to the low-resolution (LR) input in the early inference stage, allowing us to leverage it as a historical structure knowledge to suppress the generation of spurious details. By intervening in the diffusion inference process, StructSR seamlessly integrates with existing diffusion-based Real-ISR models. Our experimental results demonstrate that StructSR significantly improves the fidelity of structure and texture, improving the PSNR and SSIM metrics by an average of 5.27% and 9.36% on a synthetic dataset (DIV2K-Val) and 4.13% and 8.64% on two real-world datasets (RealSR and DRealSR) when integrated with four state-of-the-art diffusion-based Real-ISR methods.
摘要：基于扩散的模型在现实世界图像超分辨率 (Real-ISR) 中表现出巨大的潜力，但由于这些模型的经验先验和错觉，它们通常会生成具有结构错误和虚假纹理细节的内容。为了解决这个问题，我们引入了 StructSR，这是一种简单、有效且即插即用的方法，可增强基于扩散的 Real-ISR 的结构保真度并抑制虚假细节。StructSR 无需额外的微调、外部模型先验或高级语义知识即可运行。其核心是结构感知筛选 (SAS) 机制，它在早期推理阶段识别与低分辨率 (LR) 输入具有最高结构相似性的图像，使我们能够利用它作为历史结构知识来抑制虚假细节的生成。通过干预扩散推理过程，StructSR 可与现有的基于扩散的 Real-ISR 模型无缝集成。我们的实验结果表明，当与四种最先进的基于扩散的 Real-ISR 方法相结合时，StructSR 显著提高了结构和纹理的保真度，在合成数据集（DIV2K-Val）上将 PSNR 和 SSIM 指标平均提高了 5.27% 和 9.36%，在两个真实世界数据集（RealSR 和 DRealSR）上将 PSNR 和 SSIM 指标平均提高了 4.13% 和 8.64%。

Title: Alignment without Over-optimization: Training-Free Solution for Diffusion Models

Authors: Sunwoo Kim, Minkyu Kim, Dongmin Park
Subjects: cs.LG, cs.AI, cs.CV, math.ST
Abstract URL: https://arxiv.org/abs/2501.05803
Pdf URL: https://arxiv.org/pdf/2501.05803
Copy Paste: [[2501.05803]] Alignment without Over-optimization: Training-Free Solution for Diffusion Models(https://arxiv.org/abs/2501.05803)
Keywords: generative
Abstract: Diffusion models excel in generative tasks, but aligning them with specific objectives while maintaining their versatility remains challenging. Existing fine-tuning methods often suffer from reward over-optimization, while approximate guidance approaches fail to optimize target rewards effectively. Addressing these limitations, we propose a training-free sampling method based on Sequential Monte Carlo (SMC) to sample from the reward-aligned target distribution. Our approach, tailored for diffusion sampling and incorporating tempering techniques, achieves comparable or superior target rewards to fine-tuning methods while preserving diversity and cross-reward generalization. We demonstrate its effectiveness in single-reward optimization, multi-objective scenarios, and online black-box optimization. This work offers a robust solution for aligning diffusion models with diverse downstream objectives without compromising their general capabilities. Code is available at this https URL .
摘要：扩散模型在生成任务中表现出色，但使其与特定目标保持一致同时保持其多功能性仍然具有挑战性。现有的微调方法往往存在奖励过度优化的问题，而近似指导方法无法有效优化目标奖励。为了解决这些限制，我们提出了一种基于序贯蒙特卡洛 (SMC) 的无训练采样方法，从奖励一致的目标分布中采样。我们的方法针对扩散采样量身定制，并结合了调节技术，实现了与微调方法相当或更好的目标奖励，同时保持了多样性和跨奖励泛化。我们证明了它在单奖励优化、多目标场景和在线黑盒优化中的有效性。这项工作为在不损害其一般能力的情况下将扩散模型与不同的下游目标保持一致提供了一种强大的解决方案。代码可在此 https URL 处获得。

Title: Diffusion Models for Smarter UAVs: Decision-Making and Modeling

Authors: Yousef Emami, Hao Zhou, Luis Almeida, Kai Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.05819
Pdf URL: https://arxiv.org/pdf/2501.05819
Copy Paste: [[2501.05819]] Diffusion Models for Smarter UAVs: Decision-Making and Modeling(https://arxiv.org/abs/2501.05819)
Keywords: generation, generative
Abstract: Unmanned Aerial Vehicles (UAVs) are increasingly adopted in modern communication networks. However, challenges in decision-making and digital modeling continue to impede their rapid advancement. Reinforcement Learning (RL) algorithms face limitations such as low sample efficiency and limited data versatility, further magnified in UAV communication scenarios. Moreover, Digital Twin (DT) modeling introduces substantial decision-making and data management complexities. RL models, often integrated into DT frameworks, require extensive training data to achieve accurate predictions. In contrast to traditional approaches that focus on class boundaries, Diffusion Models (DMs), a new class of generative AI, learn the underlying probability distribution from the training data and can generate trustworthy new patterns based on this learned distribution. This paper explores the integration of DMs with RL and DT to effectively address these challenges. By combining the data generation capabilities of DMs with the decision-making framework of RL and the modeling accuracy of DT, the integration improves the adaptability and real-time performance of UAV communication. Moreover, the study shows how DMs can alleviate data scarcity, improve policy networks, and optimize dynamic modeling, providing a robust solution for complex UAV communication scenarios.
摘要：无人机（UAV）在现代通信网络中的应用越来越广泛。然而，决策和数字建模方面的挑战继续阻碍其快速发展。强化学习（RL）算法面临诸如低样本效率和数据通用性有限等限制，这在无人机通信场景中进一步放大。此外，数字孪生（DT）建模引入了相当大的决策和数据管理复杂性。RL模型通常集成到DT框架中，需要大量训练数据才能实现准确预测。与专注于类边界的传统方法相比，扩散模型（DM）是一种新型的生成式AI，它从训练数据中学习底层概率分布，并可以根据这种学习到的分布生成值得信赖的新模式。本文探讨了DM与RL和DT的集成，以有效应对这些挑战。通过将DM的数据生成能力与RL的决策框架和DT的建模精度相结合，该集成提高了无人机通信的适应性和实时性。此外，该研究展示了 DM 如何缓解数据稀缺、改善策略网络和优化动态建模，为复杂的无人机通信场景提供强大的解决方案。

Title: PersonaHOI: Effortlessly Improving Personalized Face with Human-Object Interaction Generation

Authors: Xinting Hu, Haoran Wang, Jan Eric Lenssen, Bernt Schiele
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.05823
Pdf URL: https://arxiv.org/pdf/2501.05823
Copy Paste: [[2501.05823]] PersonaHOI: Effortlessly Improving Personalized Face with Human-Object Interaction Generation(https://arxiv.org/abs/2501.05823)
Keywords: generation
Abstract: We introduce PersonaHOI, a training- and tuning-free framework that fuses a general StableDiffusion model with a personalized face diffusion (PFD) model to generate identity-consistent human-object interaction (HOI) images. While existing PFD models have advanced significantly, they often overemphasize facial features at the expense of full-body coherence, PersonaHOI introduces an additional StableDiffusion (SD) branch guided by HOI-oriented text inputs. By incorporating cross-attention constraints in the PFD branch and spatial merging at both latent and residual levels, PersonaHOI preserves personalized facial details while ensuring interactive non-facial regions. Experiments, validated by a novel interaction alignment metric, demonstrate the superior realism and scalability of PersonaHOI, establishing a new standard for practical personalized face with HOI generation. Our code will be available at this https URL
摘要：我们引入了 PersonaHOI，这是一个无需训练和调整的框架，它将通用的 StableDiffusion 模型与个性化面部扩散 (PFD) 模型融合在一起，以生成身份一致的人-物交互 (HOI) 图像。虽然现有的 PFD 模型已经取得了显着进步，但它们往往过分强调面部特征而牺牲了全身连贯性，但 PersonaHOI 引入了一个由 HOI 导向的文本输入引导的额外 StableDiffusion (SD) 分支。通过在 PFD 分支中结合交叉注意约束并在潜在和残差级别进行空间合并，PersonaHOI 保留了个性化面部细节，同时确保了可交互的非面部区域。通过新颖的交互对齐指标验证的实验证明了 PersonaHOI 的卓越真实性和可扩展性，为使用 HOI 生成的实用个性化面部建立了新标准。我们的代码将在此 https URL 上提供

Title: Poetry in Pixels: Prompt Tuning for Poem Image Generation via Diffusion Models

Authors: Sofia Jamil, Bollampalli Areen Reddy, Raghvendra Kumar, Sriparna Saha, K J Joseph, Koustava Goswami
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.05839
Pdf URL: https://arxiv.org/pdf/2501.05839
Copy Paste: [[2501.05839]] Poetry in Pixels: Prompt Tuning for Poem Image Generation via Diffusion Models(https://arxiv.org/abs/2501.05839)
Keywords: generation
Abstract: The task of text-to-image generation has encountered significant challenges when applied to literary works, especially poetry. Poems are a distinct form of literature, with meanings that frequently transcend beyond the literal words. To address this shortcoming, we propose a PoemToPixel framework designed to generate images that visually represent the inherent meanings of poems. Our approach incorporates the concept of prompt tuning in our image generation framework to ensure that the resulting images closely align with the poetic content. In addition, we propose the PoeKey algorithm, which extracts three key elements in the form of emotions, visual elements, and themes from poems to form instructions which are subsequently provided to a diffusion model for generating corresponding images. Furthermore, to expand the diversity of the poetry dataset across different genres and ages, we introduce MiniPo, a novel multimodal dataset comprising 1001 children's poems and images. Leveraging this dataset alongside PoemSum, we conducted both quantitative and qualitative evaluations of image generation using our PoemToPixel framework. This paper demonstrates the effectiveness of our approach and offers a fresh perspective on generating images from literary sources.
摘要：文本到图像生成任务在应用于文学作品（尤其是诗歌）时遇到了重大挑战。诗歌是一种独特的文学形式，其含义往往超越字面意义。为了解决这一缺点，我们提出了一个 PoemToPixel 框架，旨在生成以视觉方式呈现诗歌内在含义的图像。我们的方法将及时调整的概念融入到图像生成框架中，以确保生成的图像与诗歌内容紧密结合。此外，我们提出了 PoeKey 算法，该算法从诗歌中提取情感、视觉元素和主题形式的三个关键元素，以形成指令，随后将其提供给扩散模型以生成相应的图像。此外，为了扩大不同流派和年龄的诗歌数据集的多样性，我们引入了 MiniPo，这是一个包含 1001 首儿童诗歌和图像的新型多模态数据集。利用这个数据集和 PoemSum，我们对使用 PoemToPixel 框架的图像生成进行了定量和定性评估。本文证明了我们方法的有效性并为从文学来源生成图像提供了新的视角。

Title: VideoRAG: Retrieval-Augmented Generation over Video Corpus

Authors: Soyeong Jeong, Kangsan Kim, Jinheon Baek, Sung Ju Hwang
Subjects: cs.CV, cs.AI, cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2501.05874
Pdf URL: https://arxiv.org/pdf/2501.05874
Copy Paste: [[2501.05874]] VideoRAG: Retrieval-Augmented Generation over Video Corpus(https://arxiv.org/abs/2501.05874)
Keywords: generation
Abstract: Retrieval-Augmented Generation (RAG) is a powerful strategy to address the issue of generating factually incorrect outputs in foundation models by retrieving external knowledge relevant to queries and incorporating it into their generation process. However, existing RAG approaches have primarily focused on textual information, with some recent advancements beginning to consider images, and they largely overlook videos, a rich source of multimodal knowledge capable of representing events, processes, and contextual details more effectively than any other modality. While a few recent studies explore the integration of videos in the response generation process, they either predefine query-associated videos without retrieving them according to queries, or convert videos into the textual descriptions without harnessing their multimodal richness. To tackle these, we introduce VideoRAG, a novel framework that not only dynamically retrieves relevant videos based on their relevance with queries but also utilizes both visual and textual information of videos in the output generation. Further, to operationalize this, our method revolves around the recent advance of Large Video Language Models (LVLMs), which enable the direct processing of video content to represent it for retrieval and seamless integration of the retrieved videos jointly with queries. We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines.
摘要：检索增强生成 (RAG) 是一种强大的策略，它通过检索与查询相关的外部知识并将其纳入生成过程来解决基础模型中生成事实错误输出的问题。然而，现有的 RAG 方法主要侧重于文本信息，最近的一些进展开始考虑图像，而它们在很大程度上忽略了视频，视频是一种丰富的多模态知识来源，能够比任何其他模态更有效地表示事件、过程和上下文细节。虽然最近有一些研究探索将视频整合到响应生成过程中，但它们要么预定义与查询相关的视频而不根据查询检索它们，要么将视频转换为文本描述而不利用其多模态丰富性。为了解决这些问题，我们引入了 VideoRAG，这是一个新颖的框架，它不仅可以根据与查询的相关性动态检索相关视频，而且还在输出生成中利用视频的视觉和文本信息。此外，为了实现这一目标，我们的方法围绕大型视频语言模型 (LVLM) 的最新进展展开，该模型可以直接处理视频内容以将其表示出来以供检索，并将检索到的视频与查询无缝集成。我们通过实验验证了 VideoRAG 的有效性，表明它优于相关基线。

Title: Beyond Flat Text: Dual Self-inherited Guidance for Visual Text Generation

Authors: Minxing Luo, Zixun Xia, Liaojun Chen, Zhenhang Li, Weichao Zeng, Jianye Wang, Wentao Cheng, Yaxing Wang, Yu Zhou, Jian Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.05892
Pdf URL: https://arxiv.org/pdf/2501.05892
Copy Paste: [[2501.05892]] Beyond Flat Text: Dual Self-inherited Guidance for Visual Text Generation(https://arxiv.org/abs/2501.05892)
Keywords: generation, generative
Abstract: In real-world images, slanted or curved texts, especially those on cans, banners, or badges, appear as frequently, if not more so, than flat texts due to artistic design or layout constraints. While high-quality visual text generation has become available with the advanced generative capabilities of diffusion models, these models often produce distorted text and inharmonious text background when given slanted or curved text layouts due to training data limitation. In this paper, we introduce a new training-free framework, STGen, which accurately generates visual texts in challenging scenarios (\eg, slanted or curved text layouts) while harmonizing them with the text background. Our framework decomposes the visual text generation process into two branches: (i) \textbf{Semantic Rectification Branch}, which leverages the ability in generating flat but accurate visual texts of the model to guide the generation of challenging scenarios. The generated latent of flat text is abundant in accurate semantic information related both to the text itself and its background. By incorporating this, we rectify the semantic information of the texts and harmonize the integration of the text with its background in complex layouts. (ii) \textbf{Structure Injection Branch}, which reinforces the visual text structure during inference. We incorporate the latent information of the glyph image, rich in glyph structure, as a new condition to further strengthen the text structure. To enhance image harmony, we also apply an effective combination method to merge the priors, providing a solid foundation for generation. Extensive experiments across a variety of visual text layouts demonstrate that our framework achieves superior accuracy and outstanding quality.
摘要：在现实世界的图像中，由于艺术设计或布局限制，倾斜或弯曲的文本（尤其是罐头、横幅或徽章上的文本）出现的频率与平面文本一样高，甚至更高。虽然借助扩散模型的高级生成功能，可以生成高质量的视觉文本，但由于训练数据的限制，这些模型在给定倾斜或弯曲的文本布局时通常会生成扭曲的文本和不和谐的文本背景。在本文中，我们介绍了一种无需训练的新框架 STGen，它可以在具有挑战性的场景（例如倾斜或弯曲的文本布局）中准确生成视觉文本，同时使其与文本背景协调一致。我们的框架将视觉文本生成过程分解为两个分支：(i) \textbf{语义校正分支}，它利用模型生成平面但准确的视觉文本的能力来指导具有挑战性的场景的生成。平面文本的生成潜能富含与文本本身及其背景相关的准确语义信息。通过结合这一点，我们纠正了文本的语义信息，并在复杂布局中协调了文本与背景的融合。（ii） \textbf{结构注入分支}，在推理过程中强化了视觉文本结构。我们结合了字形图像的潜在信息，这些信息富含字形结构，作为进一步强化文本结构的新条件。为了增强图像的和谐性，我们还采用了一种有效的组合方法来合并先验，为生成提供了坚实的基础。在各种视觉文本布局中进行的大量实验表明，我们的框架实现了卓越的准确性和出色的质量。

Title: DiffuSETS: 12-lead ECG Generation Conditioned on Clinical Text Reports and Patient-Specific Information

Authors: Yongfan Lai, Jiabo Chen, Deyun Zhang, Yue Wang, Shijia Geng, Hongyan Li, Shenda Hong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.05932
Pdf URL: https://arxiv.org/pdf/2501.05932
Copy Paste: [[2501.05932]] DiffuSETS: 12-lead ECG Generation Conditioned on Clinical Text Reports and Patient-Specific Information(https://arxiv.org/abs/2501.05932)
Keywords: generation, generative
Abstract: Heart disease remains a significant threat to human health. As a non-invasive diagnostic tool, the electrocardiogram (ECG) is one of the most widely used methods for cardiac screening. However, the scarcity of high-quality ECG data, driven by privacy concerns and limited medical resources, creates a pressing need for effective ECG signal generation. Existing approaches for generating ECG signals typically rely on small training datasets, lack comprehensive evaluation frameworks, and overlook potential applications beyond data augmentation. To address these challenges, we propose DiffuSETS, a novel framework capable of generating ECG signals with high semantic alignment and fidelity. DiffuSETS accepts various modalities of clinical text reports and patient-specific information as inputs, enabling the creation of clinically meaningful ECG signals. Additionally, to address the lack of standardized evaluation in ECG generation, we introduce a comprehensive benchmarking methodology to assess the effectiveness of generative models in this domain. Our model achieve excellent results in tests, proving its superiority in the task of ECG generation. Furthermore, we showcase its potential to mitigate data scarcity while exploring novel applications in cardiology education and medical knowledge discovery, highlighting the broader impact of our work.
摘要：心脏病仍然是对人类健康的重大威胁。作为一种非侵入性诊断工具，心电图 (ECG) 是心脏筛查最广泛使用的方法之一。然而，由于隐私问题和有限的医疗资源，高质量 ECG 数据的稀缺性迫切需要有效的 ECG 信号生成。现有的生成 ECG 信号的方法通常依赖于小型训练数据集，缺乏全面的评估框架，并且忽略了数据增强之外的潜在应用。为了应对这些挑战，我们提出了 DiffuSETS，这是一个能够生成具有高语义对齐和保真度的 ECG 信号的新框架。DiffuSETS 接受各种形式的临床文本报告和患者特定信息作为输入，从而能够创建具有临床意义的 ECG 信号。此外，为了解决 ECG 生成缺乏标准化评估的问题，我们引入了一种全面的基准测试方法来评估该领域生成模型的有效性。我们的模型在测试中取得了优异的成绩，证明了其在 ECG 生成任务中的优势。此外，我们展示了其缓解数据稀缺的潜力，同时探索了心脏病学教育和医学知识发现的新应用，突出了我们工作的更广泛影响。

Title: Model Inversion in Split Learning for Personalized LLMs: New Insights from Information Bottleneck Theory

Authors: Yunmeng Shu, Shaofeng Li, Tian Dong, Yan Meng, Haojin Zhu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.05965
Pdf URL: https://arxiv.org/pdf/2501.05965
Copy Paste: [[2501.05965]] Model Inversion in Split Learning for Personalized LLMs: New Insights from Information Bottleneck Theory(https://arxiv.org/abs/2501.05965)
Keywords: generative
Abstract: Personalized Large Language Models (LLMs) have become increasingly prevalent, showcasing the impressive capabilities of models like GPT-4. This trend has also catalyzed extensive research on deploying LLMs on mobile devices. Feasible approaches for such edge-cloud deployment include using split learning. However, previous research has largely overlooked the privacy leakage associated with intermediate representations transmitted from devices to servers. This work is the first to identify model inversion attacks in the split learning framework for LLMs, emphasizing the necessity of secure defense. For the first time, we introduce mutual information entropy to understand the information propagation of Transformer-based LLMs and assess privacy attack performance for LLM blocks. To address the issue of representations being sparser and containing less information than embeddings, we propose a two-stage attack system in which the first part projects representations into the embedding space, and the second part uses a generative model to recover text from these embeddings. This design breaks down the complexity and achieves attack scores of 38%-75% in various scenarios, with an over 60% improvement over the SOTA. This work comprehensively highlights the potential privacy risks during the deployment of personalized LLMs on the edge side.
摘要：个性化大型语言模型 (LLM) 变得越来越流行，展示了 GPT-4 等模型的强大功能。这一趋势也催化了在移动设备上部署 LLM 的广泛研究。这种边缘云部署的可行方法包括使用拆分学习。然而，以前的研究在很大程度上忽视了从设备传输到服务器的中间表示所带来的隐私泄露。这项工作首次在 LLM 的拆分学习框架中识别了模型反转攻击，强调了安全防御的必要性。我们首次引入了互信息熵来了解基于 Transformer 的 LLM 的信息传播并评估 LLM 块的隐私攻击性能。为了解决表示比嵌入更稀疏且包含的信息更少的问题，我们提出了一个两阶段攻击系统，其中第一部分将表示投影到嵌入空间中，第二部分使用生成模型从这些嵌入中恢复文本。该设计分解了复杂度，在各种场景下取得了38%-75%的攻击得分，比SOTA有60%以上的提升。该工作全面凸显了在边缘侧部署个性化LLM时潜在的隐私风险。

Title: Learning to generate feasible graphs using graph grammars

Authors: Stefan Mautner, Rolf Backofen, Fabrizio Costa
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.06003
Pdf URL: https://arxiv.org/pdf/2501.06003
Copy Paste: [[2501.06003]] Learning to generate feasible graphs using graph grammars(https://arxiv.org/abs/2501.06003)
Keywords: generative
Abstract: Generative methods for graphs need to be sufficiently flexible to model complex dependencies between sets of nodes. At the same time, the generated graphs need to satisfy domain-dependent feasibility conditions, that is, they should not violate certain constraints that would make their interpretation impossible within the given application domain (e.g. a molecular graph where an atom has a very large number of chemical bounds). Crucially, constraints can involve not only local but also long-range dependencies: for example, the maximal length of a cycle can be bounded. Currently, a large class of generative approaches for graphs, such as methods based on artificial neural networks, is based on message passing schemes. These approaches suffer from information 'dilution' issues that severely limit the maximal range of the dependencies that can be modeled. To address this problem, we propose a generative approach based on the notion of graph grammars. The key novel idea is to introduce a domain-dependent coarsening procedure to provide short-cuts for long-range dependencies. We show the effectiveness of our proposal in two domains: 1) small drugs and 2) RNA secondary structures. In the first case, we compare the quality of the generated molecular graphs via the Molecular Sets (MOSES) benchmark suite, which evaluates the distance between generated and real molecules, their lipophilicity, synthesizability, and drug-likeness. In the second case, we show that the approach can generate very large graphs (with hundreds of nodes) that are accepted as valid examples for a desired RNA family by the "Infernal" covariance model, a state-of-the-art RNA classifier. Our implementation is available on github: this http URL
摘要：图的生成方法需要足够灵活，以便对节点集之间的复杂依赖关系进行建模。同时，生成的图需要满足领域相关的可行性条件，即它们不应违反某些约束，这些约束会使它们在给定的应用领域内无法解释（例如，分子图中的原子具有大量化学键）。至关重要的是，约束不仅可以涉及局部依赖关系，还可以涉及长距离依赖关系：例如，可以限制循环的最大长度。目前，大量图的生成方法（例如基于人工神经网络的方法）基于消息传递方案。这些方法存在信息“稀释”问题，严重限制了可以建模的依赖关系的最大范围。为了解决这个问题，我们提出了一种基于图语法概念的生成方法。关键的新想法是引入领域相关的粗化程序，为长距离依赖关系提供捷径。我们在两个领域展示了我们提案的有效性：1) 小分子药物和 2) RNA 二级结构。在第一种情况下，我们通过分子集 (MOSES) 基准套件比较生成的分子图的质量，该套件评估生成的分子与真实分子之间的距离、亲脂性、可合成性和药物相似性。在第二种情况下，我们表明该方法可以生成非常大的图（具有数百个节点），这些图被“Infernal”协方差模型（一种最先进的 RNA 分类器）接受为所需 RNA 家族的有效示例。我们的实现可在 github 上找到：此 http URL

Title: A Holistically Point-guided Text Framework for Weakly-Supervised Camouflaged Object Detection

Authors: Tsui Qin Mok, Shuyong Gao, Haozhe Xing, Miaoyang He, Yan Wang, Wenqiang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.06038
Pdf URL: https://arxiv.org/pdf/2501.06038
Copy Paste: [[2501.06038]] A Holistically Point-guided Text Framework for Weakly-Supervised Camouflaged Object Detection(https://arxiv.org/abs/2501.06038)
Keywords: generation
Abstract: Weakly-Supervised Camouflaged Object Detection (WSCOD) has gained popularity for its promise to train models with weak labels to segment objects that visually blend into their surroundings. Recently, some methods using sparsely-annotated supervision shown promising results through scribbling in WSCOD, while point-text supervision remains underexplored. Hence, this paper introduces a novel holistically point-guided text framework for WSCOD by decomposing into three phases: segment, choose, train. Specifically, we propose Point-guided Candidate Generation (PCG), where the point's foreground serves as a correction for the text path to explicitly correct and rejuvenate the loss detection object during the mask generation process (SEGMENT). We also introduce a Qualified Candidate Discriminator (QCD) to choose the optimal mask from a given text prompt using CLIP (CHOOSE), and employ the chosen pseudo mask for training with a self-supervised Vision Transformer (TRAIN). Additionally, we developed a new point-supervised dataset (P2C-COD) and a text-supervised dataset (T-COD). Comprehensive experiments on four benchmark datasets demonstrate our method outperforms state-of-the-art methods by a large margin, and also outperforms some existing fully-supervised camouflaged object detection methods.
摘要：弱监督伪装物体检测 (WSCOD) 因其有望训练具有弱标签的模型来分割视觉上融入周围环境的物体而广受欢迎。最近，一些使用稀疏注释监督的方法通过在 WSCOD 中涂鸦显示出有希望的结果，而点文本监督仍未得到充分探索。因此，本文介绍了一种新颖的整体点引导文本框架，用于 WSCOD，分为三个阶段：分割、选择、训练。具体来说，我们提出了点引导候选生成 (PCG)，其中点的前景用作文本路径的校正，以在掩码生成过程 (SEGMENT) 期间明确纠正和恢复丢失检测对象。我们还引入了一个合格候选鉴别器 (QCD)，使用 CLIP (CHOOSE) 从给定的文本提示中选择最佳掩码，并使用所选伪掩码与自监督视觉变换器 (TRAIN) 一起进行训练。此外，我们还开发了新的点监督数据集（P2C-COD）和文本监督数据集（T-COD）。在四个基准数据集上的综合实验表明，我们的方法远远优于最先进的方法，并且也优于一些现有的全监督伪装物体检测方法。

Title: From discrete-time policies to continuous-time diffusion samplers: Asymptotic equivalences and faster training

Authors: Julius Berner, Lorenz Richter, Marcin Sendera, Jarrid Rector-Brooks, Nikolay Malkin
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2501.06148
Pdf URL: https://arxiv.org/pdf/2501.06148
Copy Paste: [[2501.06148]] From discrete-time policies to continuous-time diffusion samplers: Asymptotic equivalences and faster training(https://arxiv.org/abs/2501.06148)
Keywords: generative
Abstract: We study the problem of training neural stochastic differential equations, or diffusion models, to sample from a Boltzmann distribution without access to target samples. Existing methods for training such models enforce time-reversal of the generative and noising processes, using either differentiable simulation or off-policy reinforcement learning (RL). We prove equivalences between families of objectives in the limit of infinitesimal discretization steps, linking entropic RL methods (GFlowNets) with continuous-time objects (partial differential equations and path space measures). We further show that an appropriate choice of coarse time discretization during training allows greatly improved sample efficiency and the use of time-local objectives, achieving competitive performance on standard sampling benchmarks with reduced computational cost.
摘要：我们研究训练神经随机微分方程或扩散模型的问题，以便在无法获取目标样本的情况下从玻尔兹曼分布中采样。现有的训练此类模型的方法使用可微分模拟或离线策略强化学习 (RL) 来强制生成和噪声过程的时间反转。我们证明了无穷小离散化步骤极限下目标族之间的等价性，将熵 RL 方法 (GFlowNets) 与连续时间对象 (偏微分方程和路径空间度量) 联系起来。我们进一步表明，在训练过程中适当选择粗时间离散化可以大大提高样本效率和使用时间局部目标，从而以更低的计算成本在标准采样基准上实现具有竞争力的性能。

Title: GenMol: A Drug Discovery Generalist with Discrete Diffusion

Authors: Seul Lee, Karsten Kreis, Srimukh Prasad Veccham, Meng Liu, Danny Reidenbach, Yuxing Peng, Saee Paliwal, Weili Nie, Arash Vahdat
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.06158
Pdf URL: https://arxiv.org/pdf/2501.06158
Copy Paste: [[2501.06158]] GenMol: A Drug Discovery Generalist with Discrete Diffusion(https://arxiv.org/abs/2501.06158)
Keywords: generation, generative
Abstract: Drug discovery is a complex process that involves multiple scenarios and stages, such as fragment-constrained molecule generation, hit generation and lead optimization. However, existing molecular generative models can only tackle one or two of these scenarios and lack the flexibility to address various aspects of the drug discovery pipeline. In this paper, we present Generalist Molecular generative model (GenMol), a versatile framework that addresses these limitations by applying discrete diffusion to the Sequential Attachment-based Fragment Embedding (SAFE) molecular representation. GenMol generates SAFE sequences through non-autoregressive bidirectional parallel decoding, thereby allowing utilization of a molecular context that does not rely on the specific token ordering and enhanced computational efficiency. Moreover, under the discrete diffusion framework, we introduce fragment remasking, a strategy that optimizes molecules by replacing fragments with masked tokens and regenerating them, enabling effective exploration of chemical space. GenMol significantly outperforms the previous GPT-based model trained on SAFE representations in de novo generation and fragment-constrained generation, and achieves state-of-the-art performance in goal-directed hit generation and lead optimization. These experimental results demonstrate that GenMol can tackle a wide range of drug discovery tasks, providing a unified and versatile approach for molecular design.
摘要：药物发现是一个复杂的过程，涉及多个场景和阶段，例如片段约束分子生成、命中生成和先导化合物优化。然而，现有的分子生成模型只能解决其中的一两种情况，缺乏解决药物发现流程各个方面的灵活性。在本文中，我们提出了通用分子生成模型 (GenMol)，这是一个多功能框架，通过将离散扩散应用于基于顺序连接的片段嵌入 (SAFE) 分子表示来解决这些限制。GenMol 通过非自回归双向并行解码生成 SAFE 序列，从而允许利用不依赖于特定标记排序的分子上下文并提高计算效率。此外，在离散扩散框架下，我们引入了片段重新掩蔽，这是一种通过用掩蔽标记替换片段并重新生成它们来优化分子的策略，从而能够有效地探索化学空间。 GenMol 在从头生成和片段约束生成方面明显优于之前基于 SAFE 表示训练的 GPT 模型，并在目标导向的命中生成和先导优化方面实现了最先进的性能。这些实验结果表明，GenMol 可以解决广泛的药物发现任务，为分子设计提供统一且通用的方法。

Title: VideoAuteur: Towards Long Narrative Video Generation

Authors: Junfei Xiao, Feng Cheng, Lu Qi, Liangke Gui, Jiepeng Cen, Zhibei Ma, Alan Yuille, Lu Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.06173
Pdf URL: https://arxiv.org/pdf/2501.06173
Copy Paste: [[2501.06173]] VideoAuteur: Towards Long Narrative Video Generation(https://arxiv.org/abs/2501.06173)
Keywords: generation
Abstract: Recent video generation models have shown promising results in producing high-quality video clips lasting several seconds. However, these models face challenges in generating long sequences that convey clear and informative events, limiting their ability to support coherent narrations. In this paper, we present a large-scale cooking video dataset designed to advance long-form narrative generation in the cooking domain. We validate the quality of our proposed dataset in terms of visual fidelity and textual caption accuracy using state-of-the-art Vision-Language Models (VLMs) and video generation models, respectively. We further introduce a Long Narrative Video Director to enhance both visual and semantic coherence in generated videos and emphasize the role of aligning visual embeddings to achieve improved overall video quality. Our method demonstrates substantial improvements in generating visually detailed and semantically aligned keyframes, supported by finetuning techniques that integrate text and image embeddings within the video generation process. Project page: this https URL
摘要：最近的视频生成模型在制作持续数秒的高质量视频片段方面表现出了良好的效果。然而，这些模型在生成传达清晰信息丰富的事件的长序列方面面临挑战，限制了它们支持连贯叙述的能力。在本文中，我们展示了一个大规模烹饪视频数据集，旨在推进烹饪领域的长篇叙事生成。我们分别使用最先进的视觉语言模型 (VLM) 和视频生成模型，在视觉保真度和文本字幕准确性方面验证了我们提出的数据集的质量。我们进一步引入了一个长叙事视频导演，以增强生成的视频中的视觉和语义连贯性，并强调对齐视觉嵌入在实现整体视频质量改进方面的作用。我们的方法在生成视觉细节和语义对齐的关键帧方面表现出了显着的改进，这得益于在视频生成过程中集成文本和图像嵌入的微调技术。项目页面：此 https URL

Title: Multi-subject Open-set Personalization in Video Generation

Authors: Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aberman, Jun-Yan Zhu, Ming-Hsuan Yang, Sergey Tulyakov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.06187
Pdf URL: https://arxiv.org/pdf/2501.06187
Copy Paste: [[2501.06187]] Multi-subject Open-set Personalization in Video Generation(https://arxiv.org/abs/2501.06187)
Keywords: generation
Abstract: Video personalization methods allow us to synthesize videos with specific concepts such as people, pets, and places. However, existing methods often focus on limited domains, require time-consuming optimization per subject, or support only a single subject. We present Video Alchemist $-$ a video model with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for time-consuming test-time optimization. Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt with cross-attention layers. Developing such a large model presents two main challenges: dataset and evaluation. First, as paired datasets of reference images and videos are extremely hard to collect, we sample selected video frames as reference images and synthesize a clip of the target video. However, while models can easily denoise training videos given reference frames, they fail to generalize to new contexts. To mitigate this issue, we design a new automatic data construction pipeline with extensive image augmentations. Second, evaluating open-set video personalization is a challenge in itself. To address this, we introduce a personalization benchmark that focuses on accurate subject fidelity and supports diverse personalization scenarios. Finally, our extensive experiments show that our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.
摘要：视频个性化方法使我们能够合成具有特定概念（例如人物、宠物和地点）的视频。但是，现有方法通常只关注有限的领域，需要对每个主题进行耗时的优化，或者仅支持单个主题。我们提出了 Video Alchemist $-$ 一个视频模型，它具有内置的多主题、开放集个性化功能，适用于前景对象和背景，无需进行耗时的测试时间优化。我们的模型建立在一个新的 Diffusion Transformer 模块上，该模块将每个条件参考图像及其相应的主题级文本提示与交叉注意层融合在一起。开发如此大的模型面临两个主要挑战：数据集和评估。首先，由于参考图像和视频的配对数据集极难收集，我们采样选定的视频帧作为参考图像并合成目标视频的片段。但是，虽然模型可以轻松地根据参考帧对训练视频进行去噪，但它们无法推广到新的上下文。为了缓解这个问题，我们设计了一个具有广泛图像增强功能的新自动数据构建管道。其次，评估开放集视频个性化本身就是一项挑战。为了解决这个问题，我们引入了一个个性化基准，该基准侧重于准确的主题保真度并支持各种个性化场景。最后，我们进行了广泛的实验，表明我们的方法在定量和定性评估中都明显优于现有的个性化方法。