2024-12-31

Title: Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing

Authors: Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2412.19806
Pdf URL: https://arxiv.org/pdf/2412.19806
Copy Paste: [[2412.19806]] Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing(https://arxiv.org/abs/2412.19806)
Keywords: generation
Abstract: Recent developments of vision large language models (LLMs) have seen remarkable progress, yet still encounter challenges towards multimodal generalists, such as coarse-grained instance-level understanding, lack of unified support for both images and videos, and insufficient coverage across various vision tasks. In this paper, we present VITRON, a universal pixel-level vision LLM designed for comprehensive understanding, generating, segmenting, and editing of both static images and dynamic videos. Building on top of an LLM backbone, VITRON incorporates encoders for images, videos, and pixel-level regional visuals within its frontend modules, while employing state-of-the-art visual specialists as its backend, via which VITRON supports a spectrum of vision end tasks, spanning visual comprehension to visual generation, from low level to high level. To ensure an effective and precise message passing from LLM to backend modules for function invocation, we propose a novel hybrid method by simultaneously integrating discrete textual instructions and continuous signal embeddings. Further, we design various pixel-level spatiotemporal vision-language alignment learning for VITRON to reach the best fine-grained visual capability. Finally, a cross-task synergy module is advised to learn to maximize the task-invariant fine-grained visual features, enhancing the synergy between different visual tasks. Demonstrated over 12 visual tasks and evaluated across 22 datasets, VITRON showcases its extensive capabilities in the four main vision task clusters. Overall, this work illuminates the great potential of developing a more unified multimodal generalist. Project homepage: this https URL
摘要：视觉大型语言模型 (LLM) 的最新发展取得了显著进展，但在多模态通才方面仍面临挑战，例如粗粒度的实例级理解、缺乏对图像和视频的统一支持以及对各种视觉任务的覆盖不足。在本文中，我们介绍了 VITRON，这是一种通用的像素级视觉 LLM，旨在全面理解、生成、分割和编辑静态图像和动态视频。VITRON 以 LLM 主干为基础，在其前端模块中集成了用于图像、视频和像素级区域视觉效果的编码器，同时采用最先进的视觉专家作为后端，通过 VITRON 支持一系列视觉终端任务，涵盖从低级到高级的视觉理解到视觉生成。为了确保从 LLM 到后端模块传递消息的有效和精确，以进行函数调用，我们提出了一种新颖的混合方法，同时集成离散文本指令和连续信号嵌入。此外，我们为 VITRON 设计了各种像素级时空视觉语言对齐学习，以达到最佳的细粒度视觉能力。最后，建议使用跨任务协同模块来学习最大化任务不变的细粒度视觉特征，增强不同视觉任务之间的协同作用。VITRON 演示了 12 多个视觉任务并在 22 个数据集上进行了评估，展示了其在四个主要视觉任务集群中的广泛功能。总的来说，这项工作阐明了开发更统一的多模态通才的巨大潜力。项目主页：这个 https URL

Title: A Review of Latent Representation Models in Neuroimaging

Authors: C. Vázquez-García, F. J. Martínez-Murcia, F. Segovia Román, Juan M. Górriz
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.19844
Pdf URL: https://arxiv.org/pdf/2412.19844
Copy Paste: [[2412.19844]] A Review of Latent Representation Models in Neuroimaging(https://arxiv.org/abs/2412.19844)
Keywords: generative
Abstract: Neuroimaging data, particularly from techniques like MRI or PET, offer rich but complex information about brain structure and activity. To manage this complexity, latent representation models - such as Autoencoders, Generative Adversarial Networks (GANs), and Latent Diffusion Models (LDMs) - are increasingly applied. These models are designed to reduce high-dimensional neuroimaging data to lower-dimensional latent spaces, where key patterns and variations related to brain function can be identified. By modeling these latent spaces, researchers hope to gain insights into the biology and function of the brain, including how its structure changes with age or disease, or how it encodes sensory information, predicts and adapts to new inputs. This review discusses how these models are used for clinical applications, like disease diagnosis and progression monitoring, but also for exploring fundamental brain mechanisms such as active inference and predictive coding. These approaches provide a powerful tool for both understanding and simulating the brain's complex computational tasks, potentially advancing our knowledge of cognition, perception, and neural disorders.
摘要：神经影像数据，尤其是来自 MRI 或 PET 等技术的神经影像数据，提供了有关大脑结构和活动的丰富而复杂的信息。为了管理这种复杂性，潜在表示模型（例如自动编码器、生成对抗网络 (GAN) 和潜在扩散模型 (LDM)）的应用越来越多。这些模型旨在将高维神经影像数据降低到低维潜在空间，从而可以识别与大脑功能相关的关键模式和变化。通过对这些潜在空间进行建模，研究人员希望深入了解大脑的生物学和功能，包括大脑结构如何随着年龄或疾病而变化，或者大脑如何编码感官信息、预测和适应新输入。本综述讨论了这些模型如何用于临床应用，例如疾病诊断和进展监测，以及如何探索基本大脑机制，例如主动推理和预测编码。这些方法为理解和模拟大脑复杂的计算任务提供了强大的工具，有可能提高我们对认知、感知和神经疾病的认识。

Title: Symbolic Disentangled Representations for Images

Authors: Alexandr Korchemnyi, Alexey K. Kovalev, Aleksandr I. Panov
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.19847
Pdf URL: https://arxiv.org/pdf/2412.19847
Copy Paste: [[2412.19847]] Symbolic Disentangled Representations for Images(https://arxiv.org/abs/2412.19847)
Keywords: generative
Abstract: The idea of disentangled representations is to reduce the data to a set of generative factors that produce it. Typically, such representations are vectors in latent space, where each coordinate corresponds to one of the generative factors. The object can then be modified by changing the value of a particular coordinate, but it is necessary to determine which coordinate corresponds to the desired generative factor -- a difficult task if the vector representation has a high dimension. In this article, we propose ArSyD (Architecture for Symbolic Disentanglement), which represents each generative factor as a vector of the same dimension as the resulting representation. In ArSyD, the object representation is obtained as a superposition of the generative factor vector representations. We call such a representation a \textit{symbolic disentangled representation}. We use the principles of Hyperdimensional Computing (also known as Vector Symbolic Architectures), where symbols are represented as hypervectors, allowing vector operations on them. Disentanglement is achieved by construction, no additional assumptions about the underlying distributions are made during training, and the model is only trained to reconstruct images in a weakly supervised manner. We study ArSyD on the dSprites and CLEVR datasets and provide a comprehensive analysis of the learned symbolic disentangled representations. We also propose new disentanglement metrics that allow comparison of methods using latent representations of different dimensions. ArSyD allows to edit the object properties in a controlled and interpretable way, and the dimensionality of the object property representation coincides with the dimensionality of the object representation itself.
摘要：解缠结表示的理念是将数据简化为一组生成它的生成因子。通常，这种表示是潜在空间中的向量，其中每个坐标对应于生成因子之一。然后可以通过更改特定坐标的值来修改对象，但必须确定哪个坐标对应于所需的生成因子——如果向量表示具有高维度，这是一项艰巨的任务。在本文中，我们提出了 ArSyD（符号解缠结架构），它将每个生成因子表示为与结果表示相同维度的向量。在 ArSyD 中，对象表示是作为生成因子向量表示的叠加而获得的。我们将这种表示称为 \textit{符号解缠结表示}。我们使用超维计算（也称为向量符号架构）的原理，其中符号表示为超向量，允许对其进行向量运算。解缠是通过构造实现的，训练期间不对底层分布做出任何额外假设，并且模型仅以弱监督的方式训练以重建图像。我们在 dSprites 和 CLEVR 数据集上研究 ArSyD，并对学习到的符号解缠表示进行全面分析。我们还提出了新的解缠指标，允许比较使用不同维度潜在表示的方法。ArSyD 允许以受控和可解释的方式编辑对象属性，并且对象属性表示的维数与对象表示本身的维数一致。

Title: Generative Landmarks Guided Eyeglasses Removal 3D Face Reconstruction

Authors: Dapeng Zhao, Yue Qi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.19848
Pdf URL: https://arxiv.org/pdf/2412.19848
Copy Paste: [[2412.19848]] Generative Landmarks Guided Eyeglasses Removal 3D Face Reconstruction(https://arxiv.org/abs/2412.19848)
Keywords: generative
Abstract: Single-view 3D face reconstruction is a fundamental Computer Vision problem of extraordinary difficulty. Current systems often assume the input is unobstructed faces which makes their method not suitable for in-the-wild conditions. We present a method for performing a 3D face that removes eyeglasses from a single image. Existing facial reconstruction methods fail to remove eyeglasses automatically for generating a photo-realistic 3D face "in-the-wild".The innovation of our method lies in a process for identifying the eyeglasses area robustly and remove it intelligently. In this work, we estimate the 2D face structure of the reasonable position of the eyeglasses area, which is used for the construction of 3D texture. An excellent anti-eyeglasses face reconstruction method should ensure the authenticity of the output, including the topological structure between the eyes, nose, and mouth. We achieve this via a deep learning architecture that performs direct regression of a 3DMM representation of the 3D facial geometry from a single 2D image. We also demonstrate how the related face parsing task can be incorporated into the proposed framework and help improve reconstruction quality. We conduct extensive experiments on existing 3D face reconstruction tasks as concrete examples to demonstrate the method's superior regulation ability over existing methods often break down.
摘要：单视图 3D 人脸重建是一个极其困难的基本计算机视觉问题。当前的系统通常假设输入是没有遮挡的人脸，这使得它们的方法不适合在野外条件下使用。我们提出了一种从单个图像中移除眼镜的 3D 人脸重建方法。现有的人脸重建方法无法自动移除眼镜，从而在“野外”生成照片般逼真的 3D 人脸。我们方法的创新之处在于稳健地识别眼镜区域并智能地移除它的过程。在这项工作中，我们估计了眼镜区域合理位置的 2D 人脸结构，用于构建 3D 纹理。一个优秀的抗眼镜人脸重建方法应该确保输出的真实性，包括眼睛、鼻子和嘴巴之间的拓扑结构。我们通过深度学习架构实现这一点，该架构从单个 2D 图像中对 3D 面部几何的 3DMM 表示进行直接回归。我们还展示了如何将相关的人脸解析任务纳入到所提出的框架中并帮助提高重建质量。我们对现有的 3D 人脸重建任务进行了广泛的实验作为具体示例，以证明该方法比现有方法更出色的调节能力。

Title: Conditional Balance: Improving Multi-Conditioning Trade-Offs in Image Generation

Authors: Nadav Z. Cohen, Oron Nir, Ariel Shamir
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.19853
Pdf URL: https://arxiv.org/pdf/2412.19853
Copy Paste: [[2412.19853]] Conditional Balance: Improving Multi-Conditioning Trade-Offs in Image Generation(https://arxiv.org/abs/2412.19853)
Keywords: generation
Abstract: Balancing content fidelity and artistic style is a pivotal challenge in image generation. While traditional style transfer methods and modern Denoising Diffusion Probabilistic Models (DDPMs) strive to achieve this balance, they often struggle to do so without sacrificing either style, content, or sometimes both. This work addresses this challenge by analyzing the ability of DDPMs to maintain content and style equilibrium. We introduce a novel method to identify sensitivities within the DDPM attention layers, identifying specific layers that correspond to different stylistic aspects. By directing conditional inputs only to these sensitive layers, our approach enables fine-grained control over style and content, significantly reducing issues arising from over-constrained inputs. Our findings demonstrate that this method enhances recent stylization techniques by better aligning style and content, ultimately improving the quality of generated visual content.
摘要：平衡内容保真度和艺术风格是图像生成的关键挑战。虽然传统的风格迁移方法和现代的去噪扩散概率模型 (DDPM) 都力求实现这种平衡，但它们往往难以在不牺牲风格、内容或有时两者的情况下做到这一点。这项工作通过分析 DDPM 保持内容和风格平衡的能力来解决这一挑战。我们引入了一种新方法来识别 DDPM 注意层中的敏感度，识别与不同风格方面相对应的特定层。通过将条件输入仅引导到这些敏感层，我们的方法可以对风格和内容进行细粒度控制，从而显着减少因输入过度受限而引起的问题。我们的研究结果表明，这种方法通过更好地协调风格和内容来增强最近的风格化技术，最终提高生成的视觉内容的质量。

Title: UniAvatar: Taming Lifelike Audio-Driven Talking Head Generation with Comprehensive Motion and Lighting Control

Authors: Wenzhang Sun, Xiang Li, Donglin Di, Zhuding Liang, Qiyuan Zhang, Hao Li, Wei Chen, Jianxun Cui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.19860
Pdf URL: https://arxiv.org/pdf/2412.19860
Copy Paste: [[2412.19860]] UniAvatar: Taming Lifelike Audio-Driven Talking Head Generation with Comprehensive Motion and Lighting Control(https://arxiv.org/abs/2412.19860)
Keywords: generation
Abstract: Recently, animating portrait images using audio input is a popular task. Creating lifelike talking head videos requires flexible and natural movements, including facial and head dynamics, camera motion, realistic light and shadow effects. Existing methods struggle to offer comprehensive, multifaceted control over these aspects. In this work, we introduce UniAvatar, a designed method that provides extensive control over a wide range of motion and illumination conditions. Specifically, we use the FLAME model to render all motion information onto a single image, maintaining the integrity of 3D motion details while enabling fine-grained, pixel-level control. Beyond motion, this approach also allows for comprehensive global illumination control. We design independent modules to manage both 3D motion and illumination, permitting separate and combined control. Extensive experiments demonstrate that our method outperforms others in both broad-range motion control and lighting control. Additionally, to enhance the diversity of motion and environmental contexts in current datasets, we collect and plan to publicly release two datasets, DH-FaceDrasMvVid-100 and DH-FaceReliVid-200, which capture significant head movements during speech and various lighting scenarios.
摘要：最近，使用音频输入为肖像图像制作动画是一项很受欢迎的任务。制作栩栩如生的头部说话视频需要灵活自然的动作，包括面部和头部动态、相机运动、逼真的光影效果。现有的方法难以提供对这些方面的全面、多方面的控制。在这项工作中，我们介绍了 UniAvatar，这是一种设计方法，可对各种运动和照明条件进行广泛的控制。具体来说，我们使用 FLAME 模型将所有运动信息渲染到单个图像上，在实现细粒度像素级控制的同时保持 3D 运动细节的完整性。除了运动之外，这种方法还允许全面的全局照明控制。我们设计了独立的模块来管理 3D 运动和照明，允许单独和组合控制。大量实验表明，我们的方法在广泛运动控制和照明控制方面均优于其他方法。此外，为了增强当前数据集中运动和环境背景的多样性，我们收集并计划公开发布两个数据集，DH-FaceDrasMvVid-100 和 DH-FaceReliVid-200，它们捕捉讲话和各种光照场景中的显著头部运动。

Title: Data-Free Group-Wise Fully Quantized Winograd Convolution via Learnable Scales

Authors: Shuokai Pan, Gerti Tuzi, Sudarshan Sreeram, Dibakar Gope
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.19867
Pdf URL: https://arxiv.org/pdf/2412.19867
Copy Paste: [[2412.19867]] Data-Free Group-Wise Fully Quantized Winograd Convolution via Learnable Scales(https://arxiv.org/abs/2412.19867)
Keywords: generation
Abstract: Despite the revolutionary breakthroughs of large-scale textto-image diffusion models for complex vision and downstream tasks, their extremely high computational and storage costs limit their usability. Quantization of diffusion models has been explored in recent works to reduce compute costs and memory bandwidth usage. To further improve inference time, fast convolution algorithms such as Winograd can be used for convolution layers, which account for a significant portion of computations in diffusion models. However, the significant quality loss of fully quantized Winograd using existing coarser-grained post-training quantization methods, combined with the complexity and cost of finetuning the Winograd transformation matrices for such large models to recover quality, makes them unsuitable for large-scale foundation models. Motivated by the presence of a large range of values in them, we investigate the impact of finer-grained group-wise quantization in quantizing diffusion models. While group-wise quantization can largely handle the fully quantized Winograd convolution, it struggles to deal with the large distribution imbalance in a sizable portion of the Winograd domain computation. To reduce range differences in the Winograd domain, we propose finetuning only the scale parameters of the Winograd transform matrices without using any domain-specific training data. Because our method does not depend on any training data, the generalization performance of quantized diffusion models is safely guaranteed. For text-to-image generation task, the 8-bit fully-quantized diffusion model with Winograd provides near-lossless quality (FID and CLIP scores) in comparison to the full-precision model. For image classification, our method outperforms the state-of-the-art Winograd PTQ method by 1.62% and 2.56% in top-1 ImageNet accuracy on ResNet18 and ResNet-34, respectively, with Winograd F(6, 3).
摘要：尽管大规模文本到图像扩散模型在复杂视觉和下游任务方面取得了革命性的突破，但其极高的计算和存储成本限制了它们的可用性。最近的研究探索了扩散模型的量化，以降低计算成本和内存带宽使用率。为了进一步缩短推理时间，可以将 Winograd 等快速卷积算法用于卷积层，这在扩散模型中占了很大一部分计算量。然而，使用现有的粗粒度后训练量化方法完全量化的 Winograd 会造成显著的质量损失，再加上对这种大型模型的 Winograd 变换矩阵进行微调以恢复质量的复杂性和成本，使得它们不适合用于大规模基础模型。受其中存在大量值的启发，我们研究了细粒度分组量化在量化扩散模型中的影响。虽然分组量化可以在很大程度上处理完全量化的 Winograd 卷积，但它很难处理 Winograd 域计算中相当大一部分的分布不平衡。为了减少 Winograd 域中的范围差异，我们建议仅微调 Winograd 变换矩阵的尺度参数，而不使用任何特定于域的训练数据。由于我们的方法不依赖于任何训练数据，因此量化扩散模型的泛化性能可以得到安全保证。对于文本到图像的生成任务，与全精度模型相比，带有 Winograd 的 8 位完全量化扩散模型提供了近乎无损的质量（FID 和 CLIP 分数）。对于图像分类，我们的方法在 ResNet18 和 ResNet-34 上的 top-1 ImageNet 准确率上分别比最先进的 Winograd PTQ 方法高出 1.62% 和 2.56%，Winograd F(6, 3)。

Title: Minimax-Optimal Multi-Agent Robust Reinforcement Learning

Authors: Yuchen Jiao, Gen Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.19873
Pdf URL: https://arxiv.org/pdf/2412.19873
Copy Paste: [[2412.19873]] Minimax-Optimal Multi-Agent Robust Reinforcement Learning(https://arxiv.org/abs/2412.19873)
Keywords: generative
Abstract: Multi-agent robust reinforcement learning, also known as multi-player robust Markov games (RMGs), is a crucial framework for modeling competitive interactions under environmental uncertainties, with wide applications in multi-agent systems. However, existing results on sample complexity in RMGs suffer from at least one of three obstacles: restrictive range of uncertainty level or accuracy, the curse of multiple agents, and the barrier of long horizons, all of which cause existing results to significantly exceed the information-theoretic lower bound. To close this gap, we extend the Q-FTRL algorithm \citep{li2022minimax} to the RMGs in finite-horizon setting, assuming access to a generative model. We prove that the proposed algorithm achieves an $\varepsilon$-robust coarse correlated equilibrium (CCE) with a sample complexity (up to log factors) of $\widetilde{O}\left(H^3S\sum_{i=1}^mA_i\min\left\{H,1/R\right\}/\varepsilon^2\right)$, where $S$ denotes the number of states, $A_i$ is the number of actions of the $i$-th agent, $H$ is the finite horizon length, and $R$ is uncertainty level. We also show that this sample compelxity is minimax optimal by combining an information-theoretic lower bound. Additionally, in the special case of two-player zero-sum RMGs, the algorithm achieves an $\varepsilon$-robust Nash equilibrium (NE) with the same sample complexity.
摘要：多智能体稳健强化学习，也称为多玩家稳健马尔可夫博弈 (RMG)，是环境不确定性下建模竞争互动的重要框架，广泛应用于多智能体系统。然而，RMG 中样本复杂性的现有结果至少受到以下三个障碍之一的影响：不确定性水平或准确性的限制范围、多智能体的诅咒以及长期障碍，所有这些都导致现有结果大大超出信息论下限。为了弥补这一差距，我们将 Q-FTRL 算法 \citep{li2022minimax} 扩展到有限范围的 RMG，假设可以访问生成模型。我们证明，所提出的算法实现了 $\varepsilon$ 稳健粗相关均衡 (CCE)，其样本复杂度（最多为对数因子）为 $\widetilde{O}\left(H^3S\sum_{i=1}^mA_i\min\left\{H,1/R\right\}/\varepsilon^2\right)$，其中 $S$ 表示状态数，$A_i$ 表示第 $i$ 个代理的动作数，$H$ 表示有限视界长度，$R$ 表示不确定性水平。我们还通过结合信息论下限证明了此样本复杂度是极小极大最优的。此外，在两人零和 RMG 的特殊情况下，该算法实现了具有相同样本复杂度的 $\varepsilon$ 稳健纳什均衡 (NE)。

Title: YOLO-MST: Multiscale deep learning method for infrared small target detection based on super-resolution and YOLO

Authors: Taoran Yue, Xiaojin Lu, Jiaxi Cai, Yuanping Chen, Shibing Chu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.19878
Pdf URL: https://arxiv.org/pdf/2412.19878
Copy Paste: [[2412.19878]] YOLO-MST: Multiscale deep learning method for infrared small target detection based on super-resolution and YOLO(https://arxiv.org/abs/2412.19878)
Keywords: super-resolution
Abstract: With the advancement of aerospace technology and the increasing demands of military applications, the development of low false-alarm and high-precision infrared small target detection algorithms has emerged as a key focus of research globally. However, the traditional model-driven method is not robust enough when dealing with features such as noise, target size, and contrast. The existing deep-learning methods have limited ability to extract and fuse key features, and it is difficult to achieve high-precision detection in complex backgrounds and when target features are not obvious. To solve these problems, this paper proposes a deep-learning infrared small target detection method that combines image super-resolution technology with multi-scale observation. First, the input infrared images are preprocessed with super-resolution and multiple data enhancements are performed. Secondly, based on the YOLOv5 model, we proposed a new deep-learning network named YOLO-MST. This network includes replacing the SPPF module with the self-designed MSFA module in the backbone, optimizing the neck, and finally adding a multi-scale dynamic detection head to the prediction head. By dynamically fusing features from different scales, the detection head can better adapt to complex scenes. The mAP@0.5 detection rates of this method on two public datasets, SIRST and IRIS, reached 96.4% and 99.5% respectively, more effectively solving the problems of missed detection, false alarms, and low precision.
摘要：随着航天技术的进步和军事应用需求的不断增长，发展低虚警高精度的红外小目标检测算法成为国内外研究的重点。然而传统的模型驱动方法在处理噪声、目标大小、对比度等特征时鲁棒性不够，现有的深度学习方法对关键特征的提取和融合能力有限，在复杂背景和目标特征不明显的情况下难以实现高精度检测。针对这些问题，本文提出了一种结合图像超分辨率技术与多尺度观测的深度学习红外小目标检测方法。首先，对输入的红外图像进行超分辨率预处理并进行多次数据增强。其次，基于YOLOv5模型，提出一种新的深度学习网络YOLO-MST。该网络包括在骨干部分用自主设计的MSFA模块替换SPPF模块，优化颈部，最后在预测头中添加多尺度动态检测头。通过动态融合不同尺度的特征，检测头可以更好地适应复杂场景，该方法在SIRST和IRIS两个公开数据集上的mAP@0.5检测率分别达到96.4%和99.5%，更加有效地解决了漏检、误报、精度不高等问题。

Title: Char-SAM: Turning Segment Anything Model into Scene Text Segmentation Annotator with Character-level Visual Prompts

Authors: Enze Xie, Jiaho Lyu, Daiqing Wu, Huawen Shen, Yu Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.19917
Pdf URL: https://arxiv.org/pdf/2412.19917
Copy Paste: [[2412.19917]] Char-SAM: Turning Segment Anything Model into Scene Text Segmentation Annotator with Character-level Visual Prompts(https://arxiv.org/abs/2412.19917)
Keywords: generation
Abstract: The recent emergence of the Segment Anything Model (SAM) enables various domain-specific segmentation tasks to be tackled cost-effectively by using bounding boxes as prompts. However, in scene text segmentation, SAM can not achieve desirable performance. The word-level bounding box as prompts is too coarse for characters, while the character-level bounding box as prompts suffers from over-segmentation and under-segmentation issues. In this paper, we propose an automatic annotation pipeline named Char-SAM, that turns SAM into a low-cost segmentation annotator with a Character-level visual prompt. Specifically, leveraging some existing text detection datasets with word-level bounding box annotations, we first generate finer-grained character-level bounding box prompts using the Character Bounding-box Refinement CBR module. Next, we employ glyph information corresponding to text character categories as a new prompt in the Character Glyph Refinement (CGR) module to guide SAM in producing more accurate segmentation masks, addressing issues of over-segmentation and under-segmentation. These modules fully utilize the bbox-to-mask capability of SAM to generate high-quality text segmentation annotations automatically. Extensive experiments on TextSeg validate the effectiveness of Char-SAM. Its training-free nature also enables the generation of high-quality scene text segmentation datasets from real-world datasets like COCO-Text and MLT17.
摘要：最近出现的“任意分割模型”（SAM）使得各种领域特定的分割任务能够通过使用边界框作为提示来经济高效地解决。然而，在场景文本分割中，SAM 无法达到理想的性能。作为提示的词级边界框对于字符来说太粗糙，而作为提示的字符级边界框存在过度分割和欠分割的问题。在本文中，我们提出了一种名为 Char-SAM 的自动注释管道，它将 SAM 变成一个具有字符级视觉提示的低成本分割注释器。具体而言，利用一些具有词级边界框注释的现有文本检测数据集，我们首先使用字符边界框细化 CBR 模块生成更细粒度的字符级边界框提示。接下来，我们使用与文本字符类别相对应的字形信息作为字符字形细化 (CGR) 模块中的新提示，以指导 SAM 生成更准确的分割掩码，解决过度分割和欠分割的问题。这些模块充分利用了 SAM 的 bbox-to-mask 功能，自动生成高质量的文本分割注释。在 TextSeg 上进行的大量实验验证了 Char-SAM 的有效性。其无需训练的特性还使得能够从 COCO-Text 和 MLT17 等真实世界数据集生成高质量的场景文本分割数据集。

Title: Data-driven tool wear prediction in milling, based on a process-integrated single-sensor approach

Authors: Eric Hirsch, Christian Friedrich
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2412.19950
Pdf URL: https://arxiv.org/pdf/2412.19950
Copy Paste: [[2412.19950]] Data-driven tool wear prediction in milling, based on a process-integrated single-sensor approach(https://arxiv.org/abs/2412.19950)
Keywords: generation
Abstract: Accurate tool wear prediction is essential for maintaining productivity and minimizing costs in machining. However, the complex nature of the tool wear process poses significant challenges to achieving reliable predictions. This study explores data-driven methods, in particular deep learning, for tool wear prediction. Traditional data-driven approaches often focus on a single process, relying on multi-sensor setups and extensive data generation, which limits generalization to new settings. Moreover, multi-sensor integration is often impractical in industrial environments. To address these limitations, this research investigates the transferability of predictive models using minimal training data, validated across two processes. Furthermore, it uses a simple setup with a single acceleration sensor to establish a low-cost data generation approach that facilitates the generalization of models to other processes via transfer learning. The study evaluates several machine learning models, including convolutional neural networks (CNN), long short-term memory networks (LSTM), support vector machines (SVM) and decision trees, trained on different input formats such as feature vectors and short-time Fourier transform (STFT). The performance of the models is evaluated on different amounts of training data, including scenarios with significantly reduced datasets, providing insight into their effectiveness under constrained data conditions. The results demonstrate the potential of specific models and configurations for effective tool wear prediction, contributing to the development of more adaptable and efficient predictive maintenance strategies in machining. Notably, the ConvNeXt model has an exceptional performance, achieving an 99.1% accuracy in identifying tool wear using data from only four milling tools operated until they are worn.
摘要：准确的刀具磨损预测对于保持生产率和降低加工成本至关重要。然而，刀具磨损过程的复杂性对实现可靠的预测提出了重大挑战。本研究探讨了用于刀具磨损预测的数据驱动方法，特别是深度学习。传统的数据驱动方法通常专注于单一过程，依赖于多传感器设置和大量数据生成，这限制了对新设置的推广。此外，多传感器集成在工业环境中通常是不切实际的。为了解决这些限制，本研究使用最少的训练数据调查了预测模型的可转移性，并在两个过程中进行了验证。此外，它使用带有单个加速度传感器的简单设置来建立一种低成本的数据生成方法，该方法有助于通过迁移学习将模型推广到其他过程。该研究评估了几种机器学习模型，包括卷积神经网络 (CNN)、长短期记忆网络 (LSTM)、支持向量机 (SVM) 和决策树，这些模型在不同输入格式（如特征向量和短时傅里叶变换 (STFT)）上进行训练。研究人员使用不同数量的训练数据（包括数据集显著减少的场景）评估了模型的性能，从而深入了解了模型在受限数据条件下的有效性。结果表明，特定模型和配置具有有效预测刀具磨损的潜力，有助于开发更具适应性和效率的机械加工预测性维护策略。值得注意的是，ConvNeXt 模型具有出色的性能，仅使用四把铣刀运行至磨损时的数据，就能实现 99.1% 的刀具磨损识别准确率。

Title: ErgoChat: a Visual Query System for the Ergonomic Risk Assessment of Construction Workers

Authors: Chao Fan, Qipei Mei, Xiaonan Wang, Xinming Li
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.19954
Pdf URL: https://arxiv.org/pdf/2412.19954
Copy Paste: [[2412.19954]] ErgoChat: a Visual Query System for the Ergonomic Risk Assessment of Construction Workers(https://arxiv.org/abs/2412.19954)
Keywords: generative
Abstract: In the construction sector, workers often endure prolonged periods of high-intensity physical work and prolonged use of tools, resulting in injuries and illnesses primarily linked to postural ergonomic risks, a longstanding predominant health concern. To mitigate these risks, researchers have applied various technological methods to identify the ergonomic risks that construction workers face. However, traditional ergonomic risk assessment (ERA) techniques do not offer interactive feedback. The rapidly developing vision-language models (VLMs), capable of generating textual descriptions or answering questions about ergonomic risks based on image inputs, have not yet received widespread attention. This research introduces an interactive visual query system tailored to assess the postural ergonomic risks of construction workers. The system's capabilities include visual question answering (VQA), which responds to visual queries regarding workers' exposure to postural ergonomic risks, and image captioning (IC), which generates textual descriptions of these risks from images. Additionally, this study proposes a dataset designed for training and testing such methodologies. Systematic testing indicates that the VQA functionality delivers an accuracy of 96.5%. Moreover, evaluations using nine metrics for IC and assessments from human experts indicate that the proposed approach surpasses the performance of a method using the same architecture trained solely on generic datasets. This study sets a new direction for future developments in interactive ERA using generative artificial intelligence (AI) technologies.
摘要：在建筑行业，工人通常要长时间从事高强度的体力劳动，并长时间使用工具，从而导致主要与姿势人体工程学风险相关的伤害和疾病，姿势人体工程学风险是长期以来的主要健康问题。为了减轻这些风险，研究人员采用了各种技术方法来识别建筑工人面临的人体工程学风险。然而，传统的人体工程学风险评估 (ERA) 技术不提供交互式反馈。快速发展的视觉语言模型 (VLM) 能够生成文本描述或根据图像输入回答有关人体工程学风险的问题，但尚未受到广泛关注。本研究介绍了一种交互式视觉查询系统，专门用于评估建筑工人的姿势人体工程学风险。该系统的功能包括视觉问答 (VQA)，它可以响应有关工人暴露于姿势人体工程学风险的视觉查询，以及图像字幕 (IC)，它可以从图像中生成这些风险的文本描述。此外，本研究还提出了一个专为训练和测试此类方法而设计的数据集。系统测试表明，VQA 功能的准确率达到 96.5%。此外，使用九个 IC 指标进行的评估和人类专家的评估表明，所提出的方法的性能优于使用相同架构仅在通用数据集上训练的方法。这项研究为使用生成人工智能 (AI) 技术的交互式 ERA 的未来发展指明了新方向。

Title: MAKIMA: Tuning-free Multi-Attribute Open-domain Video Editing via Mask-Guided Attention Modulation

Authors: Haoyu Zheng, Wenqiao Zhang, Zheqi Lv, Yu Zhong, Yang Dai, Jianxiang An, Yongliang Shen, Juncheng Li, Dongping Zhang, Siliang Tang, Yueting Zhuang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.19978
Pdf URL: https://arxiv.org/pdf/2412.19978
Copy Paste: [[2412.19978]] MAKIMA: Tuning-free Multi-Attribute Open-domain Video Editing via Mask-Guided Attention Modulation(https://arxiv.org/abs/2412.19978)
Keywords: generation
Abstract: Diffusion-based text-to-image (T2I) models have demonstrated remarkable results in global video editing tasks. However, their focus is primarily on global video modifications, and achieving desired attribute-specific changes remains a challenging task, specifically in multi-attribute editing (MAE) in video. Contemporary video editing approaches either require extensive fine-tuning or rely on additional networks (such as ControlNet) for modeling multi-object appearances, yet they remain in their infancy, offering only coarse-grained MAE solutions. In this paper, we present MAKIMA, a tuning-free MAE framework built upon pretrained T2I models for open-domain video editing. Our approach preserves video structure and appearance information by incorporating attention maps and features from the inversion process during denoising. To facilitate precise editing of multiple attributes, we introduce mask-guided attention modulation, enhancing correlations between spatially corresponding tokens and suppressing cross-attribute interference in both self-attention and cross-attention layers. To balance video frame generation quality and efficiency, we implement consistent feature propagation, which generates frame sequences by editing keyframes and propagating their features throughout the sequence. Extensive experiments demonstrate that MAKIMA outperforms existing baselines in open-domain multi-attribute video editing tasks, achieving superior results in both editing accuracy and temporal consistency while maintaining computational efficiency.
摘要：基于扩散的文本到图像 (T2I) 模型在全局视频编辑任务中表现出色。然而，它们主要关注全局视频修改，而实现所需的特定属性更改仍然是一项艰巨的任务，特别是在视频的多属性编辑 (MAE) 中。当代视频编辑方法要么需要大量微调，要么依赖于其他网络（例如 ControlNet）来对多对象外观进行建模，但它们仍处于起步阶段，仅提供粗粒度的 MAE 解决方案。在本文中，我们介绍了 MAKIMA，这是一个基于预训练的 T2I 模型构建的无需调整的 MAE 框架，用于开放域视频编辑。我们的方法通过在去噪过程中结合注意力图和反演过程中的特征来保留视频结构和外观信息。为了促进对多个属性的精确编辑，我们引入了掩码引导的注意力调节，增强了空间对应标记之间的相关性，并抑制了自注意力层和交叉注意力层中的跨属性干扰。为了平衡视频帧生成质量和效率，我们实现了一致的特征传播，即通过编辑关键帧并在整个序列中传播其特征来生成帧序列。大量实验表明，MAKIMA 在开放域多属性视频编辑任务中的表现优于现有基准，在保持计算效率的同时，在编辑准确性和时间一致性方面都取得了优异的成绩。

Title: An Ordinary Differential Equation Sampler with Stochastic Start for Diffusion Bridge Models

Authors: Yuang Wang, Pengfei Jin, Li Zhang, Quanzheng Li, Zhiqiang Chen, Dufan Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.19992
Pdf URL: https://arxiv.org/pdf/2412.19992
Copy Paste: [[2412.19992]] An Ordinary Differential Equation Sampler with Stochastic Start for Diffusion Bridge Models(https://arxiv.org/abs/2412.19992)
Keywords: restoration, super-resolution, generation, generative
Abstract: Diffusion bridge models have demonstrated promising performance in conditional image generation tasks, such as image restoration and translation, by initializing the generative process from corrupted images instead of pure Gaussian noise. However, existing diffusion bridge models often rely on Stochastic Differential Equation (SDE) samplers, which result in slower inference speed compared to diffusion models that employ high-order Ordinary Differential Equation (ODE) solvers for acceleration. To mitigate this gap, we propose a high-order ODE sampler with a stochastic start for diffusion bridge models. To overcome the singular behavior of the probability flow ODE (PF-ODE) at the beginning of the reverse process, a posterior sampling approach was introduced at the first reverse step. The sampling was designed to ensure a smooth transition from corrupted images to the generative trajectory while reducing discretization errors. Following this stochastic start, Heun's second-order solver is applied to solve the PF-ODE, achieving high perceptual quality with significantly reduced neural function evaluations (NFEs). Our method is fully compatible with pretrained diffusion bridge models and requires no additional training. Extensive experiments on image restoration and translation tasks, including super-resolution, JPEG restoration, Edges-to-Handbags, and DIODE-Outdoor, demonstrated that our sampler outperforms state-of-the-art methods in both visual quality and Frechet Inception Distance (FID).
摘要：扩散桥模型通过从损坏的图像而不是纯高斯噪声初始化生成过程，在图像恢复和翻译等条件图像生成任务中表现出色。然而，现有的扩散桥模型通常依赖于随机微分方程 (SDE) 采样器，与使用高阶常微分方程 (ODE) 求解器加速的扩散模型相比，这导致推理速度较慢。为了弥补这一差距，我们为扩散桥模型提出了一种具有随机启动的高阶 ODE 采样器。为了克服逆向过程开始时概率流 ODE (PF-ODE) 的奇异行为，在第一个逆向步骤中引入了后验采样方法。采样旨在确保从损坏的图像到生成轨迹的平稳过渡，同时减少离散化误差。在此随机启动之后，应用 Heun 的二阶求解器来求解 PF-ODE，实现高感知质量，同时显着减少神经功能评估 (NFE)。我们的方法与预训练的扩散桥模型完全兼容，无需额外训练。在图像恢复和翻译任务（包括超分辨率、JPEG 恢复、Edges-to-Handbags 和 DIODE-Outdoor）上进行的大量实验表明，我们的采样器在视觉质量和 Frechet Inception Distance (FID) 方面均优于最先进的方法。

Title: Comprehensive Review of EEG-to-Output Research: Decoding Neural Signals into Images, Videos, and Audio

Authors: Yashvir Sabharwal, Balaji Rama
Subjects: cs.CV, cs.AI, q-bio.NC
Abstract URL: https://arxiv.org/abs/2412.19999
Pdf URL: https://arxiv.org/pdf/2412.19999
Copy Paste: [[2412.19999]] Comprehensive Review of EEG-to-Output Research: Decoding Neural Signals into Images, Videos, and Audio(https://arxiv.org/abs/2412.19999)
Keywords: generative
Abstract: Electroencephalography (EEG) is an invaluable tool in neuroscience, offering insights into brain activity with high temporal resolution. Recent advancements in machine learning and generative modeling have catalyzed the application of EEG in reconstructing perceptual experiences, including images, videos, and audio. This paper systematically reviews EEG-to-output research, focusing on state-of-the-art generative methods, evaluation metrics, and data challenges. Using PRISMA guidelines, we analyze 1800 studies and identify key trends, challenges, and opportunities in the field. The findings emphasize the potential of advanced models such as Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Transformers, while highlighting the pressing need for standardized datasets and cross-subject generalization. A roadmap for future research is proposed that aims to improve decoding accuracy and broadening real-world applications.
摘要：脑电图 (EEG) 是神经科学中一种非常宝贵的工具，可以以高时间分辨率洞察大脑活动。机器学习和生成模型的最新进展催化了 EEG 在重建感知体验（包括图像、视频和音频）中的应用。本文系统地回顾了 EEG 到输出的研究，重点关注最先进的生成方法、评估指标和数据挑战。使用 PRISMA 指南，我们分析了 1800 项研究并确定了该领域的主要趋势、挑战和机遇。研究结果强调了生成对抗网络 (GAN)、变分自动编码器 (VAE) 和 Transformer 等高级模型的潜力，同时强调了对标准化数据集和跨学科泛化的迫切需求。提出了未来研究的路线图，旨在提高解码准确性并拓宽现实世界的应用。

Title: A Robust Adversarial Ensemble with Causal (Feature Interaction) Interpretations for Image Classification

Authors: Chunheng Zhao, Pierluigi Pisu, Gurcan Comert, Negash Begashaw, Varghese Vaidyan, Nina Christine Hubig
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20025
Pdf URL: https://arxiv.org/pdf/2412.20025
Copy Paste: [[2412.20025]] A Robust Adversarial Ensemble with Causal (Feature Interaction) Interpretations for Image Classification(https://arxiv.org/abs/2412.20025)
Keywords: generative
Abstract: Deep learning-based discriminative classifiers, despite their remarkable success, remain vulnerable to adversarial examples that can mislead model predictions. While adversarial training can enhance robustness, it fails to address the intrinsic vulnerability stemming from the opaque nature of these black-box models. We present a deep ensemble model that combines discriminative features with generative models to achieve both high accuracy and adversarial robustness. Our approach integrates a bottom-level pre-trained discriminative network for feature extraction with a top-level generative classification network that models adversarial input distributions through a deep latent variable model. Using variational Bayes, our model achieves superior robustness against white-box adversarial attacks without adversarial training. Extensive experiments on CIFAR-10 and CIFAR-100 demonstrate our model's superior adversarial robustness. Through evaluations using counterfactual metrics and feature interaction-based metrics, we establish correlations between model interpretability and adversarial robustness. Additionally, preliminary results on Tiny-ImageNet validate our approach's scalability to more complex datasets, offering a practical solution for developing robust image classification models.
摘要：基于深度学习的判别分类器尽管取得了显著的成功，但仍然容易受到对抗性样本的攻击，这些样本可能会误导模型预测。虽然对抗性训练可以增强鲁棒性，但它无法解决这些黑盒模型不透明性所导致的内在脆弱性。我们提出了一个深度集成模型，将判别性特征与生成模型相结合，以实现高精度和对抗性鲁棒性。我们的方法将用于特征提取的底层预训练判别网络与通过深度潜在变量模型对对抗性输入分布进行建模的顶层生成分类网络相结合。使用变分贝叶斯，我们的模型无需对抗性训练即可实现对白盒对抗性攻击的卓越鲁棒性。在 CIFAR-10 和 CIFAR-100 上的大量实验证明了我们模型卓越的对抗性鲁棒性。通过使用反事实指标和基于特征交互的指标进行评估，我们建立了模型可解释性和对抗性鲁棒性之间的相关性。此外，Tiny-ImageNet 上的初步结果验证了我们的方法对更复杂数据集的可扩展性，为开发强大的图像分类模型提供了实用的解决方案。

Title: MaIR: A Locality- and Continuity-Preserving Mamba for Image Restoration

Authors: Boyun Li, Haiyu Zhao, Wenxin Wang, Peng Hu, Yuanbiao Gou, Xi Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20066
Pdf URL: https://arxiv.org/pdf/2412.20066
Copy Paste: [[2412.20066]] MaIR: A Locality- and Continuity-Preserving Mamba for Image Restoration(https://arxiv.org/abs/2412.20066)
Keywords: restoration, super-resolution
Abstract: Recent advancements in Mamba have shown promising results in image restoration. These methods typically flatten 2D images into multiple distinct 1D sequences along rows and columns, process each sequence independently using selective scan operation, and recombine them to form the outputs. However, such a paradigm overlooks two vital aspects: i) the local relationships and spatial continuity inherent in natural images, and ii) the discrepancies among sequences unfolded through totally different ways. To overcome the drawbacks, we explore two problems in Mamba-based restoration methods: i) how to design a scanning strategy preserving both locality and continuity while facilitating restoration, and ii) how to aggregate the distinct sequences unfolded in totally different ways. To address these problems, we propose a novel Mamba-based Image Restoration model (MaIR), which consists of Nested S-shaped Scanning strategy (NSS) and Sequence Shuffle Attention block (SSA). Specifically, NSS preserves locality and continuity of the input images through the stripe-based scanning region and the S-shaped scanning path, respectively. SSA aggregates sequences through calculating attention weights within the corresponding channels of different sequences. Thanks to NSS and SSA, MaIR surpasses 40 baselines across 14 challenging datasets, achieving state-of-the-art performance on the tasks of image super-resolution, denoising, deblurring and dehazing. Our codes will be available after acceptance.
摘要：Mamba 的最新进展在图像恢复方面取得了令人鼓舞的成果。这些方法通常将 2D 图像沿行和列展平为多个不同的 1D 序列，使用选择性扫描操作独立处理每个序列，然后重新组合它们以形成输出。然而，这种范式忽略了两个重要方面：i）自然图像固有的局部关系和空间连续性，以及 ii）通过完全不同的方式展开的序列之间的差异。为了克服这些缺点，我们探讨了基于 Mamba 的恢复方法中的两个问题：i）如何设计一种在促进恢复的同时保留局部性和连续性的扫描策略，以及 ii）如何聚合以完全不同的方式展开的不同序列。为了解决这些问题，我们提出了一种基于 Mamba 的新型图像恢复模型 (MaIR)，它由嵌套 S 形扫描策略 (NSS) 和序列混洗注意块 (SSA) 组成。具体而言，NSS 分别通过基于条纹的扫描区域和 S 形扫描路径保留输入图像的局部性和连续性。 SSA 通过计算不同序列相应通道内的注意力权重来聚合序列。得益于 NSS 和 SSA，MaIR 在 14 个具有挑战性的数据集上超越了 40 个基线，在图像超分辨率、去噪、去模糊和去雾任务上实现了最先进的性能。我们的代码将在被接受后提供。

Title: UniRestorer: Universal Image Restoration via Adaptively Estimating Image Degradation at Proper Granularity

Authors: Jingbo Lin, Zhilu Zhang, Wenbo Li, Renjing Pei, Hang Xu, Hongzhi Zhang, Wangmeng Zuo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20157
Pdf URL: https://arxiv.org/pdf/2412.20157
Copy Paste: [[2412.20157]] UniRestorer: Universal Image Restoration via Adaptively Estimating Image Degradation at Proper Granularity(https://arxiv.org/abs/2412.20157)
Keywords: restoration
Abstract: Recently, considerable progress has been made in allin-one image restoration. Generally, existing methods can be degradation-agnostic or degradation-aware. However, the former are limited in leveraging degradation-specific restoration, and the latter suffer from the inevitable error in degradation estimation. Consequently, the performance of existing methods has a large gap compared to specific single-task models. In this work, we make a step forward in this topic, and present our UniRestorer with improved restoration performance. Specifically, we perform hierarchical clustering on degradation space, and train a multi-granularity mixture-of-experts (MoE) restoration model. Then, UniRestorer adopts both degradation and granularity estimation to adaptively select an appropriate expert for image restoration. In contrast to existing degradation-agnostic and -aware methods, UniRestorer can leverage degradation estimation to benefit degradationspecific restoration, and use granularity estimation to make the model robust to degradation estimation error. Experimental results show that our UniRestorer outperforms stateof-the-art all-in-one methods by a large margin, and is promising in closing the performance gap to specific single task models. The code and pre-trained models will be publicly available at this https URL.
摘要：近年来，一体化图像恢复领域取得了长足的进步。一般来说，现有的方法可以是退化不可知的，也可以是退化感知的。然而，前者在利用针对退化的特定恢复方面受到限制，而后者在退化估计方面不可避免地存在误差。因此，现有方法的性能与特定的单任务模型相比存在很大差距。在本文中，我们在这个主题上迈出了一步，并提出了具有改进恢复性能的 UniRestorer。具体而言，我们在退化空间上执行层次聚类，并训练多粒度混合专家 (MoE) 恢复模型。然后，UniRestorer 同时采用退化和粒度估计来自适应地选择合适的专家进行图像恢复。与现有的退化不可知和退化感知方法相比，UniRestorer 可以利用退化估计来有利于针对退化的特定恢复，并使用粒度估计使模型对退化估计误差具有鲁棒性。实验结果表明，我们的 UniRestorer 的表现远超最先进的一体化方法，并且有望缩小与特定单任务模型的性能差距。代码和预训练模型将在此 https URL 上公开发布。

Title: Multi-Modality Driven LoRA for Adverse Condition Depth Estimation

Authors: Guanglei Yang, Rui Tian, Yongqiang Zhang, Zhun Zhong, Yongqiang Li, Wangmeng Zuo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20162
Pdf URL: https://arxiv.org/pdf/2412.20162
Copy Paste: [[2412.20162]] Multi-Modality Driven LoRA for Adverse Condition Depth Estimation(https://arxiv.org/abs/2412.20162)
Keywords: generative
Abstract: The autonomous driving community is increasingly focused on addressing corner case problems, particularly those related to ensuring driving safety under adverse conditions (e.g., nighttime, fog, rain). To this end, the task of Adverse Condition Depth Estimation (ACDE) has gained significant attention. Previous approaches in ACDE have primarily relied on generative models, which necessitate additional target images to convert the sunny condition into adverse weather, or learnable parameters for feature augmentation to adapt domain gaps, resulting in increased model complexity and tuning efforts. Furthermore, unlike CLIP-based methods where textual and visual features have been pre-aligned, depth estimation models lack sufficient alignment between multimodal features, hindering coherent understanding under adverse conditions. To address these limitations, we propose Multi-Modality Driven LoRA (MMD-LoRA), which leverages low-rank adaptation matrices for efficient fine-tuning from source-domain to target-domain. It consists of two core components: Prompt Driven Domain Alignment (PDDA) and Visual-Text Consistent Contrastive Learning(VTCCL). During PDDA, the image encoder with MMD-LoRA generates target-domain visual representations, supervised by alignment loss that the source-target difference between language and image should be equal. Meanwhile, VTCCL bridges the gap between textual features from CLIP and visual features from diffusion model, pushing apart different weather representations (vision and text) and bringing together similar ones. Through extensive experiments, the proposed method achieves state-of-the-art performance on the nuScenes and Oxford RobotCar datasets, underscoring robustness and efficiency in adapting to varied adverse environments.
摘要：自动驾驶社区越来越关注解决极端情况问题，尤其是那些与确保不利条件下的驾驶安全有关的问题（例如夜间、雾天、雨天）。为此，不利条件深度估计 (ACDE) 任务引起了广泛关注。ACDE 中的先前方法主要依赖于生成模型，这需要额外的目标图像来将晴天条件转换为不利天气，或者需要可学习的特征增强参数来适应域间隙，从而增加了模型复杂性和调整工作量。此外，与基于 CLIP 的方法（其中文本和视觉特征已预先对齐）不同，深度估计模型缺乏多模态特征之间的充分对齐，阻碍了不利条件下的连贯理解。为了解决这些限制，我们提出了多模态驱动的 LoRA (MMD-LoRA)，它利用低秩自适应矩阵实现从源域到目标域的有效微调。它由两个核心组件组成：即时驱动域对齐 (PDDA) 和视觉文本一致对比学习 (VTCCL)。在 PDDA 期间，带有 MMD-LoRA 的图像编码器生成目标域视觉表示，并由对齐损失监督，以确保语言和图像之间的源目标差异应相等。同时，VTCCL 弥合了 CLIP 的文本特征与扩散模型的视觉特征之间的差距，将不同的天气表示（视觉和文本）分开并将相似的天气表示结合在一起。通过大量实验，所提出的方法在 nuScenes 和 Oxford RobotCar 数据集上实现了最先进的性能，强调了适应各种不利环境的稳健性和效率。

Title: StyleAutoEncoder for manipulating image attributes using pre-trained StyleGAN

Authors: Andrzej Bedychaj, Jacek Tabor, Marek Śmieja
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20164
Pdf URL: https://arxiv.org/pdf/2412.20164
Copy Paste: [[2412.20164]] StyleAutoEncoder for manipulating image attributes using pre-trained StyleGAN(https://arxiv.org/abs/2412.20164)
Keywords: generative
Abstract: Deep conditional generative models are excellent tools for creating high-quality images and editing their attributes. However, training modern generative models from scratch is very expensive and requires large computational resources. In this paper, we introduce StyleAutoEncoder (StyleAE), a lightweight AutoEncoder module, which works as a plugin for pre-trained generative models and allows for manipulating the requested attributes of images. The proposed method offers a cost-effective solution for training deep generative models with limited computational resources, making it a promising technique for a wide range of applications. We evaluate StyleAutoEncoder by combining it with StyleGAN, which is currently one of the top generative models. Our experiments demonstrate that StyleAutoEncoder is at least as effective in manipulating image attributes as the state-of-the-art algorithms based on invertible normalizing flows. However, it is simpler, faster, and gives more freedom in designing neural
摘要：深度条件生成模型是创建高质量图像和编辑其属性的绝佳工具。然而，从头开始训练现代生成模型非常昂贵，需要大量的计算资源。在本文中，我们介绍了 StyleAutoEncoder (StyleAE)，这是一个轻量级的 AutoEncoder 模块，它作为预训练生成模型的插件，允许操纵图像的所需属性。所提出的方法为使用有限的计算资源训练深度生成模型提供了一种经济有效的解决方案，使其成为一种适用于广泛应用的有前途的技术。我们通过将 StyleAutoEncoder 与目前最顶级的生成模型之一 StyleGAN 相结合来评估它。我们的实验表明，StyleAutoEncoder 在操纵图像属性方面至少与基于可逆正则化流的最先进的算法一样有效。但是，它更简单、更快，并且在设计神经网络时提供了更大的自由度

Title: Mining Platoon Patterns from Traffic Videos

Authors: Yijun Bei, Teng Ma, Dongxiang Zhang, Sai Wu, Kian-Lee Tan, Gang Chen
Subjects: cs.CV, cs.DB
Abstract URL: https://arxiv.org/abs/2412.20177
Pdf URL: https://arxiv.org/pdf/2412.20177
Copy Paste: [[2412.20177]] Mining Platoon Patterns from Traffic Videos(https://arxiv.org/abs/2412.20177)
Keywords: generation
Abstract: Discovering co-movement patterns from urban-scale video data sources has emerged as an attractive topic. This task aims to identify groups of objects that travel together along a common route, which offers effective support for government agencies in enhancing smart city management. However, the previous work has made a strong assumption on the accuracy of recovered trajectories from videos and their co-movement pattern definition requires the group of objects to appear across consecutive cameras along the common route. In practice, this often leads to missing patterns if a vehicle is not correctly identified from a certain camera due to object occlusion or vehicle mis-matching. To address this challenge, we propose a relaxed definition of co-movement patterns from video data, which removes the consecutiveness requirement in the common route and accommodates a certain number of missing captured cameras for objects within the group. Moreover, a novel enumeration framework called MaxGrowth is developed to efficiently retrieve the relaxed patterns. Unlike previous filter-and-refine frameworks comprising both candidate enumeration and subsequent candidate verification procedures, MaxGrowth incurs no verification cost for the candidate patterns. It treats the co-movement pattern as an equivalent sequence of clusters, enumerating candidates with increasing sequence length while avoiding the generation of any false positives. Additionally, we also propose two effective pruning rules to efficiently filter the non-maximal patterns. Extensive experiments are conducted to validate the efficiency of MaxGrowth and the quality of its generated co-movement patterns. Our MaxGrowth runs up to two orders of magnitude faster than the baseline algorithm. It also demonstrates high accuracy in real video dataset when the trajectory recovery algorithm is not perfect.
摘要：从城市规模的视频数据源中发现共同运动模式已成为一个有吸引力的话题。这项任务旨在识别沿着共同路线一起行进的物体群，为政府机构加强智慧城市管理提供有效支持。然而，之前的工作对从视频中恢复的轨迹的准确性做出了强有力的假设，并且它们的共同运动模式定义要求物体群出现在共同路线上的连续摄像机中。在实践中，如果由于物体遮挡或车辆不匹配而无法从某个摄像机正确识别车辆，这通常会导致模式缺失。为了应对这一挑战，我们提出了一种从视频数据中对共同运动模式的宽松定义，它消除了共同路线中的连续性要求，并容纳了一定数量的丢失的捕捉摄像机来捕捉组中的物体。此外，我们还开发了一种名为 MaxGrowth 的新型枚举框架来有效地检索宽松的模式。与之前包含候选枚举和后续候选验证程序的过滤和细化框架不同，MaxGrowth 不会对候选模式产生验证成本。它将共同运动模式视为簇的等效序列，枚举序列长度不断增加的候选者，同时避免产生任何误报。此外，我们还提出了两条有效的修剪规则，以有效地过滤非最大模式。进行了大量的实验来验证 MaxGrowth 的效率及其生成的共同运动模式的质量。我们的 MaxGrowth 比基线算法快两个数量级。当轨迹恢复算法不完善时，它还在真实视频数据集中表现出很高的准确性。

Title: Generative Regression Based Watch Time Prediction for Video Recommendation: Model and Performance

Authors: Hongxu Ma, Kai Tian, Tao Zhang, Xuefeng Zhang, Chunjie Chen, Han Li, Jihong Guan, Shuigeng Zhou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.20211
Pdf URL: https://arxiv.org/pdf/2412.20211
Copy Paste: [[2412.20211]] Generative Regression Based Watch Time Prediction for Video Recommendation: Model and Performance(https://arxiv.org/abs/2412.20211)
Keywords: generation, generative
Abstract: Watch time prediction (WTP) has emerged as a pivotal task in short video recommendation systems, designed to encapsulate user interests. Predicting users' watch times on videos often encounters challenges, including wide value ranges and imbalanced data distributions, which can lead to significant bias when directly regressing watch time. Recent studies have tried to tackle these issues by converting the continuous watch time estimation into an ordinal classification task. While these methods are somewhat effective, they exhibit notable limitations. Inspired by language modeling, we propose a novel Generative Regression (GR) paradigm for WTP based on sequence generation. This approach employs structural discretization to enable the lossless reconstruction of original values while maintaining prediction fidelity. By formulating the prediction problem as a numerical-to-sequence mapping, and with meticulously designed vocabulary and label encodings, each watch time is transformed into a sequence of tokens. To expedite model training, we introduce the curriculum learning with an embedding mixup strategy which can mitigate training-and-inference inconsistency associated with teacher forcing. We evaluate our method against state-of-the-art approaches on four public datasets and one industrial dataset. We also perform online A/B testing on Kuaishou, a leading video app with about 400 million DAUs, to demonstrate the real-world efficacy of our method. The results conclusively show that GR outperforms existing techniques significantly. Furthermore, we successfully apply GR to another regression task in recommendation systems, i.e., Lifetime Value (LTV) prediction, which highlights its potential as a novel and effective solution to general regression challenges.
摘要：观看时间预测 (WTP) 已成为短视频推荐系统中的一项关键任务，旨在囊括用户兴趣。预测用户在视频上的观看时间经常会遇到挑战，包括广泛的值范围和不平衡的数据分布，这可能导致直接回归观看时间时出现严重偏差。最近的研究试图通过将连续观看时间估计转换为序数分类任务来解决这些问题。虽然这些方法有些有效，但它们表现出明显的局限性。受语言建模的启发，我们提出了一种基于序列生成的 WTP 新型生成回归 (GR) 范式。该方法采用结构离散化来实现原始值的无损重建，同时保持预测保真度。通过将预测问题公式化为数值到序列的映射，并采用精心设计的词汇和标签编码，每个观看时间都会转换为一个标记序列。为了加快模型训练，我们引入了课程学习和嵌入混合策略，可以缓解与教师强制相关的训练和推理不一致。我们在四个公共数据集和一个工业数据集上将我们的方法与最先进的方法进行了比较。我们还在快手（一款拥有约 4 亿 DAU 的领先视频应用）上进行了在线 A/B 测试，以证明我们的方法在现实世界中的有效性。结果最终表明，GR 的表现明显优于现有技术。此外，我们成功地将 GR 应用于推荐系统中的另一个回归任务，即生命周期价值 (LTV) 预测，这凸显了其作为解决一般回归挑战的新颖而有效的解决方案的潜力。

Title: Motion Transfer-Driven intra-class data augmentation for Finger Vein Recognition

Authors: Xiu-Feng Huang, Lai-Man Po, Wei-Feng Ou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20327
Pdf URL: https://arxiv.org/pdf/2412.20327
Copy Paste: [[2412.20327]] Motion Transfer-Driven intra-class data augmentation for Finger Vein Recognition(https://arxiv.org/abs/2412.20327)
Keywords: generation
Abstract: Finger vein recognition (FVR) has emerged as a secure biometric technique because of the confidentiality of vascular bio-information. Recently, deep learning-based FVR has gained increased popularity and achieved promising performance. However, the limited size of public vein datasets has caused overfitting issues and greatly limits the recognition performance. Although traditional data augmentation can partially alleviate this data shortage issue, it cannot capture the real finger posture variations due to the rigid label-preserving image transformations, bringing limited performance improvement. To address this issue, we propose a novel motion transfer (MT) model for finger vein image data augmentation via modeling the actual finger posture and rotational movements. The proposed model first utilizes a key point detector to extract the key point and pose map of the source and drive finger vein images. We then utilize a dense motion module to estimate the motion optical flow, which is fed to an image generation module for generating the image with the target pose. Experiments conducted on three public finger vein databases demonstrate that the proposed motion transfer model can effectively improve recognition accuracy. Code is available at: this https URL.
摘要：由于血管生物信息的保密性，手指静脉识别 (FVR) 已成为一种安全的生物识别技术。最近，基于深度学习的 FVR 越来越受欢迎，并取得了良好的性能。然而，公共静脉数据集的有限大小导致了过度拟合问题，极大地限制了识别性能。虽然传统的数据增强可以部分缓解这种数据短缺问题，但由于严格的标签保留图像变换，它无法捕捉真实的手指姿势变化，从而带来有限的性能提升。为了解决这个问题，我们提出了一种新的运动传递 (MT) 模型，通过对实际手指姿势和旋转运动进行建模来增强手指静脉图像数据。所提出的模型首先利用关键点检测器提取源和驱动手指静脉图像的关键点和姿势图。然后，我们利用密集运动模块来估计运动光流，并将其馈送到图像生成模块以生成具有目标姿势的图像。在三个公共手指静脉数据库上进行的实验表明，所提出的运动传递模型可以有效提高识别准确率。代码可在以下网址获得：此 https URL。

Title: FairDiffusion: Enhancing Equity in Latent Diffusion Models via Fair Bayesian Perturbation

Authors: Yan Luo, Muhammad Osama Khan, Congcong Wen, Muhammad Muneeb Afzal, Titus Fidelis Wuermeling, Min Shi, Yu Tian, Yi Fang, Mengyu Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.20374
Pdf URL: https://arxiv.org/pdf/2412.20374
Copy Paste: [[2412.20374]] FairDiffusion: Enhancing Equity in Latent Diffusion Models via Fair Bayesian Perturbation(https://arxiv.org/abs/2412.20374)
Keywords: generation, generative
Abstract: Recent progress in generative AI, especially diffusion models, has demonstrated significant utility in text-to-image synthesis. Particularly in healthcare, these models offer immense potential in generating synthetic datasets and training medical students. However, despite these strong performances, it remains uncertain if the image generation quality is consistent across different demographic subgroups. To address this critical concern, we present the first comprehensive study on the fairness of medical text-to-image diffusion models. Our extensive evaluations of the popular Stable Diffusion model reveal significant disparities across gender, race, and ethnicity. To mitigate these biases, we introduce FairDiffusion, an equity-aware latent diffusion model that enhances fairness in both image generation quality as well as the semantic correlation of clinical features. In addition, we also design and curate FairGenMed, the first dataset for studying the fairness of medical generative models. Complementing this effort, we further evaluate FairDiffusion on two widely-used external medical datasets: HAM10000 (dermatoscopic images) and CheXpert (chest X-rays) to demonstrate FairDiffusion's effectiveness in addressing fairness concerns across diverse medical imaging modalities. Together, FairDiffusion and FairGenMed significantly advance research in fair generative learning, promoting equitable benefits of generative AI in healthcare.
摘要：生成式人工智能（尤其是扩散模型）的最新进展已证明其在文本到图像合成中具有重要实用性。特别是在医疗保健领域，这些模型在生成合成数据集和培训医学生方面具有巨大潜力。然而，尽管这些模型表现出色，但图像生成质量在不同人口亚群中是否一致仍不确定。为了解决这一关键问题，我们首次全面研究了医学文本到图像扩散模型的公平性。我们对流行的稳定扩散模型进行了广泛的评估，发现性别、种族和民族之间存在显著差异。为了减轻这些偏见，我们引入了 FairDiffusion，这是一种公平感知的潜在扩散模型，可提高图像生成质量以及临床特征的语义相关性的公平性。此外，我们还设计和策划了 FairGenMed，这是第一个用于研究医学生成模型公平性的数据集。作为这项工作的补充，我们进一步在两个广泛使用的外部医疗数据集上评估了 FairDiffusion：HAM10000（皮肤镜图像）和 CheXpert（胸部 X 光片），以证明 FairDiffusion 在解决各种医学成像模式的公平性问题方面的有效性。FairDiffusion 和 FairGenMed 共同推动了公平生成学习的研究，促进了生成式 AI 在医疗保健领域的公平利益。

Title: Tri-Ergon: Fine-grained Video-to-Audio Generation with Multi-modal Conditions and LUFS Control

Authors: Bingliang Li, Fengyu Yang, Yuxin Mao, Qingwen Ye, Hongkai Chen, Yiran Zhong
Subjects: cs.CV, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2412.20378
Pdf URL: https://arxiv.org/pdf/2412.20378
Copy Paste: [[2412.20378]] Tri-Ergon: Fine-grained Video-to-Audio Generation with Multi-modal Conditions and LUFS Control(https://arxiv.org/abs/2412.20378)
Keywords: generation
Abstract: Video-to-audio (V2A) generation utilizes visual-only video features to produce realistic sounds that correspond to the scene. However, current V2A models often lack fine-grained control over the generated audio, especially in terms of loudness variation and the incorporation of multi-modal conditions. To overcome these limitations, we introduce Tri-Ergon, a diffusion-based V2A model that incorporates textual, auditory, and pixel-level visual prompts to enable detailed and semantically rich audio synthesis. Additionally, we introduce Loudness Units relative to Full Scale (LUFS) embedding, which allows for precise manual control of the loudness changes over time for individual audio channels, enabling our model to effectively address the intricate correlation of video and audio in real-world Foley workflows. Tri-Ergon is capable of creating 44.1 kHz high-fidelity stereo audio clips of varying lengths up to 60 seconds, which significantly outperforms existing state-of-the-art V2A methods that typically generate mono audio for a fixed duration.
摘要：视频转音频 (V2A) 生成利用纯视觉视频功能来产生与场景相对应的逼真声音。然而，当前的 V2A 模型通常缺乏对生成音频的细粒度控制，特别是在响度变化和多模态条件的结合方面。为了克服这些限制，我们引入了 Tri-Ergon，这是一种基于扩散的 V2A 模型，它结合了文本、听觉和像素级视觉提示，以实现详细且语义丰富的音频合成。此外，我们引入了相对于全尺度 (LUFS) 嵌入的响度单位，它允许精确手动控制各个音频通道的响度随时间的变化，使我们的模型能够有效解决现实世界 Foley 工作流程中视频和音频的复杂关联。Tri-Ergon 能够创建长度不等的 44.1 kHz 高保真立体声音频剪辑，最长可达 60 秒，这大大优于现有的最先进的 V2A 方法，这些方法通常会在固定持续时间内生成单声道音频。

Title: Prot\'eg\'e: Learn and Generate Basic Makeup Styles with Generative Adversarial Networks (GANs)

Authors: Jia Wei Sii, Chee Seng Chan
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2412.20381
Pdf URL: https://arxiv.org/pdf/2412.20381
Copy Paste: [[2412.20381]] Prot\'eg\'e: Learn and Generate Basic Makeup Styles with Generative Adversarial Networks (GANs)(https://arxiv.org/abs/2412.20381)
Keywords: generative
Abstract: Makeup is no longer confined to physical application; people now use mobile apps to digitally apply makeup to their photos, which they then share on social media. However, while this shift has made makeup more accessible, designing diverse makeup styles tailored to individual faces remains a challenge. This challenge currently must still be done manually by humans. Existing systems, such as makeup recommendation engines and makeup transfer techniques, offer limitations in creating innovative makeups for different individuals "intuitively" -- significant user effort and knowledge needed and limited makeup options available in app. Our motivation is to address this challenge by proposing Protégé, a new makeup application, leveraging recent generative model -- GANs to learn and automatically generate makeup styles. This is a task that existing makeup applications (i.e., makeup recommendation systems using expert system and makeup transfer methods) are unable to perform. Extensive experiments has been conducted to demonstrate the capability of Protégé in learning and creating diverse makeups, providing a convenient and intuitive way, marking a significant leap in digital makeup technology!
摘要：化妆不再局限于身体上的操作，人们现在使用移动应用程序以数字方式将妆容应用到照片上，然后在社交媒体上分享。然而，尽管这种转变使化妆变得更容易实现，但设计适合个人脸型的多样化化妆风格仍然是一项挑战。这一挑战目前仍必须由人类手动完成。现有的系统，如化妆推荐引擎和化妆转移技术，在“直观地”为不同个体创建创新化妆方面存在局限性——需要用户付出大量努力和知识，而应用程序中可用的化妆选项有限。我们的动机是通过提出一种新的化妆应用程序 Protégé 来应对这一挑战，利用最近的生成模型 GAN 来学习和自动生成化妆风格。这是现有化妆应用程序（即使用专家系统和化妆转移方法的化妆推荐系统）无法完成的任务。已经进行了大量的实验来证明 Protégé 在学习和创造多样化化妆方面的能力，提供了一种方便直观的方式，标志着数字化妆技术的重大飞跃！

Title: Open-Sora: Democratizing Efficient Video Production for All

Authors: Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, Yang You
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20404
Pdf URL: https://arxiv.org/pdf/2412.20404
Copy Paste: [[2412.20404]] Open-Sora: Democratizing Efficient Video Production for All(https://arxiv.org/abs/2412.20404)
Keywords: generation
Abstract: Vision and language are the two foundational senses for humans, and they build up our cognitive ability and intelligence. While significant breakthroughs have been made in AI language ability, artificial visual intelligence, especially the ability to generate and simulate the world we see, is far lagging behind. To facilitate the development and accessibility of artificial visual intelligence, we created Open-Sora, an open-source video generation model designed to produce high-fidelity video content. Open-Sora supports a wide spectrum of visual generation tasks, including text-to-image generation, text-to-video generation, and image-to-video generation. The model leverages advanced deep learning architectures and training/inference techniques to enable flexible video synthesis, which could generate video content of up to 15 seconds, up to 720p resolution, and arbitrary aspect ratios. Specifically, we introduce Spatial-Temporal Diffusion Transformer (STDiT), an efficient diffusion framework for videos that decouples spatial and temporal attention. We also introduce a highly compressive 3D autoencoder to make representations compact and further accelerate training with an ad hoc training strategy. Through this initiative, we aim to foster innovation, creativity, and inclusivity within the community of AI content creation. By embracing the open-source principle, Open-Sora democratizes full access to all the training/inference/data preparation codes as well as model weights. All resources are publicly available at: this https URL.
摘要：视觉和语言是人类的两种基本感觉，它们构成了我们的认知能力和智力。虽然人工智能的语言能力已经取得了重大突破，但人工智能视觉智能，尤其是生成和模拟我们所见世界的能力，却远远落后。为了促进人工智能视觉智能的发展和普及，我们创建了 Open-Sora，这是一个旨在生成高保真视频内容的开源视频生成模型。Open-Sora 支持广泛的视觉生成任务，包括文本到图像生成、文本到视频生成和图像到视频生成。该模型利用先进的深度学习架构和训练/推理技术实现灵活的视频合成，可以生成长达 15 秒、高达 720p 分辨率和任意宽高比的视频内容。具体来说，我们引入了时空扩散变换器 (STDiT)，这是一个高效的视频扩散框架，可将空间和时间注意力分离。我们还引入了高度压缩的 3D 自动编码器，使表示紧凑，并通过临时训练策略进一步加速训练。通过这一举措，我们旨在促进 AI 内容创作社区的创新、创造力和包容性。通过采用开源原则，Open-Sora 实现了对所有训练/推理/数据准备代码以及模型权重的完全访问。所有资源均可在以下网址公开获取：此 https URL。

Title: EraseAnything: Enabling Concept Erasure in Rectified Flow Transformers

Authors: Daiheng Gao, Shilin Lu, Shaw Walters, Wenbo Zhou, Jiaming Chu, Jie Zhang, Bang Zhang, Mengxi Jia, Jian Zhao, Zhaoxin Fan, Weiming Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20413
Pdf URL: https://arxiv.org/pdf/2412.20413
Copy Paste: [[2412.20413]] EraseAnything: Enabling Concept Erasure in Rectified Flow Transformers(https://arxiv.org/abs/2412.20413)
Keywords: generative
Abstract: Removing unwanted concepts from large-scale text-to-image (T2I) diffusion models while maintaining their overall generative quality remains an open challenge. This difficulty is especially pronounced in emerging paradigms, such as Stable Diffusion (SD) v3 and Flux, which incorporate flow matching and transformer-based architectures. These advancements limit the transferability of existing concept-erasure techniques that were originally designed for the previous T2I paradigm (\textit{e.g.}, SD v1.4). In this work, we introduce \logopic \textbf{EraseAnything}, the first method specifically developed to address concept erasure within the latest flow-based T2I framework. We formulate concept erasure as a bi-level optimization problem, employing LoRA-based parameter tuning and an attention map regularizer to selectively suppress undesirable activations. Furthermore, we propose a self-contrastive learning strategy to ensure that removing unwanted concepts does not inadvertently harm performance on unrelated ones. Experimental results demonstrate that EraseAnything successfully fills the research gap left by earlier methods in this new T2I paradigm, achieving state-of-the-art performance across a wide range of concept erasure tasks.
摘要：从大规模文本到图像 (T2I) 扩散模型中删除不需要的概念，同时保持其整体生成质量仍然是一个悬而未决的挑战。这种困难在新兴范式中尤其明显，例如稳定扩散 (SD) v3 和 Flux，它们结合了流匹配和基于变压器的架构。这些进步限制了最初为以前的 T2I 范式 (\textit{e.g.}，SD v1.4) 设计的现有概念擦除技术的可转移性。在这项工作中，我们引入了 \logopic \textbf{EraseAnything}，这是第一个专门为解决最新基于流的 T2I 框架中的概念擦除而开发的方法。我们将概念擦除表述为一个双层优化问题，采用基于 LoRA 的参数调整和注意力图正则化器来选择性地抑制不良激活。此外，我们提出了一种自对比学习策略，以确保删除不需要的概念不会无意中损害不相关概念的性能。实验结果表明，EraseAnything 成功填补了新 T2I 范式中早期方法留下的研究空白，在广泛的概念擦除任务中实现了最先进的性能。

Title: Bringing Objects to Life: 4D generation from 3D objects

Authors: Ohad Rahamim, Ori Malca, Dvir Samuel, Gal Chechik
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20422
Pdf URL: https://arxiv.org/pdf/2412.20422
Copy Paste: [[2412.20422]] Bringing Objects to Life: 4D generation from 3D objects(https://arxiv.org/abs/2412.20422)
Keywords: generation, generative
Abstract: Recent advancements in generative modeling now enable the creation of 4D content (moving 3D objects) controlled with text prompts. 4D generation has large potential in applications like virtual worlds, media, and gaming, but existing methods provide limited control over the appearance and geometry of generated content. In this work, we introduce a method for animating user-provided 3D objects by conditioning on textual prompts to guide 4D generation, enabling custom animations while maintaining the identity of the original object. We first convert a 3D mesh into a ``static" 4D Neural Radiance Field (NeRF) that preserves the visual attributes of the input object. Then, we animate the object using an Image-to-Video diffusion model driven by text. To improve motion realism, we introduce an incremental viewpoint selection protocol for sampling perspectives to promote lifelike movement and a masked Score Distillation Sampling (SDS) loss, which leverages attention maps to focus optimization on relevant regions. We evaluate our model in terms of temporal coherence, prompt adherence, and visual fidelity and find that our method outperforms baselines that are based on other approaches, achieving up to threefold improvements in identity preservation measured using LPIPS scores, and effectively balancing visual quality with dynamic content.
摘要：生成式建模的最新进展现在使得能够创建由文本提示控制的 4D 内容（移动的 3D 对象）。4D 生成在虚拟世界、媒体和游戏等应用中具有巨大潜力，但现有方法对生成内容的外观和几何形状的控制有限。在这项工作中，我们介绍了一种通过以文本提示为指导 4D 生成来为用户提供的 3D 对象制作动画的方法，从而实现自定义动画，同时保持原始对象的身份。我们首先将 3D 网格转换为“静态”4D 神经辐射场 (NeRF)，以保留输入对象的视觉属性。然后，我们使用由文本驱动的图像到视频扩散模型为对象制作动画。为了提高运动真实感，我们引入了一种增量视点选择协议，用于对视角进行采样以促进逼真的运动，并引入了蒙面分数蒸馏采样 (SDS) 损失，利用注意力图将优化重点放在相关区域。我们从时间连贯性、及时遵守和视觉保真度方面评估了我们的模型，发现我们的方法优于基于其他方法的基线，使用 LPIPS 分数衡量的身份保存性能提高了三倍，并有效地平衡了视觉质量和动态内容。

Title: ESVQA: Perceptual Quality Assessment of Egocentric Spatial Videos

Authors: Xilei Zhu, Huiyu Duan, Liu Yang, Yucheng Zhu, Xiongkuo Min, Guangtao Zhai, Patrick Le Callet
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2412.20423
Pdf URL: https://arxiv.org/pdf/2412.20423
Copy Paste: [[2412.20423]] ESVQA: Perceptual Quality Assessment of Egocentric Spatial Videos(https://arxiv.org/abs/2412.20423)
Keywords: quality assessment
Abstract: With the rapid development of eXtended Reality (XR), egocentric spatial shooting and display technologies have further enhanced immersion and engagement for users. Assessing the quality of experience (QoE) of egocentric spatial videos is crucial to ensure a high-quality viewing experience. However, the corresponding research is still lacking. In this paper, we use the embodied experience to highlight this more immersive experience and study the new problem, i.e., embodied perceptual quality assessment for egocentric spatial videos. Specifically, we introduce the first Egocentric Spatial Video Quality Assessment Database (ESVQAD), which comprises 600 egocentric spatial videos and their mean opinion scores (MOSs). Furthermore, we propose a novel multi-dimensional binocular feature fusion model, termed ESVQAnet, which integrates binocular spatial, motion, and semantic features to predict the perceptual quality. Experimental results demonstrate the ESVQAnet outperforms 16 state-of-the-art VQA models on the embodied perceptual quality assessment task, and exhibits strong generalization capability on traditional VQA tasks. The database and codes will be released upon the publication.
摘要：随着扩展现实 (XR) 的快速发展，以自我为中心的空间拍摄和显示技术进一步增强了用户的沉浸感和参与度。评估以自我为中心的空间视频的体验质量 (QoE) 对确保高质量的观看体验至关重要。但目前还缺乏相应的研究。在本文中，我们利用具身体验来凸显这种更具沉浸感的体验，并研究新问题，即以自我为中心的空间视频的具身感知质量评估。具体而言，我们推出了第一个以自我为中心的空间视频质量评估数据库 (ESVQAD)，其中包含 600 个以自我为中心的空间视频及其平均意见分数 (MOS)。此外，我们提出了一种新颖的多维双目特征融合模型，称为 ESVQAnet，该模型集成了双目空间、运动和语义特征来预测感知质量。实验结果表明，ESVQAnet 在具体感知质量评估任务上的表现优于 16 个最先进的 VQA 模型，并且在传统 VQA 任务上表现出强大的泛化能力。数据库和代码将在出版后发布。

Title: Image Augmentation Agent for Weakly Supervised Semantic Segmentation

Authors: Wangyu Wu, Xianglin Qiu, Siqi Song, Zhenhong Chen, Xiaowei Huang, Fei Ma, Jimin Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20439
Pdf URL: https://arxiv.org/pdf/2412.20439
Copy Paste: [[2412.20439]] Image Augmentation Agent for Weakly Supervised Semantic Segmentation(https://arxiv.org/abs/2412.20439)
Keywords: generation
Abstract: Weakly-supervised semantic segmentation (WSSS) has achieved remarkable progress using only image-level labels. However, most existing WSSS methods focus on designing new network structures and loss functions to generate more accurate dense labels, overlooking the limitations imposed by fixed datasets, which can constrain performance improvements. We argue that more diverse trainable images provides WSSS richer information and help model understand more comprehensive semantic pattern. Therefore in this paper, we introduce a novel approach called Image Augmentation Agent (IAA) which shows that it is possible to enhance WSSS from data generation perspective. IAA mainly design an augmentation agent that leverages large language models (LLMs) and diffusion models to automatically generate additional images for WSSS. In practice, to address the instability in prompt generation by LLMs, we develop a prompt self-refinement mechanism. It allow LLMs to re-evaluate the rationality of generated prompts to produce more coherent prompts. Additionally, we insert an online filter into diffusion generation process to dynamically ensure the quality and balance of generated images. Experimental results show that our method significantly surpasses state-of-the-art WSSS approaches on the PASCAL VOC 2012 and MS COCO 2014 datasets.
摘要：弱监督语义分割 (WSSS) 仅使用图像级标签就取得了显著进展。然而，大多数现有的 WSSS 方法专注于设计新的网络结构和损失函数来生成更准确的密集标签，而忽略了固定数据集的限制，这可能会限制性能的提升。我们认为，更多样化的可训练图像为 WSSS 提供了更丰富的信息，并帮助模型理解更全面的语义模式。因此，在本文中，我们介绍了一种称为图像增强代理 (IAA) 的新方法，该方法表明可以从数据生成的角度增强 WSSS。IAA 主要设计一个增强代理，利用大型语言模型 (LLM) 和扩散模型自动为 WSSS 生成附加图像。在实践中，为了解决 LLM 提示生成的不稳定性，我们开发了一种提示自我改进机制。它允许 LLM 重新评估生成的提示的合理性，以生成更连贯的提示。此外，我们在扩散生成过程中插入了一个在线过滤器，以动态确保生成图像的质量和平衡。实验结果表明，我们的方法在 PASCAL VOC 2012 和 MS COCO 2014 数据集上显著超越了最先进的 WSSS 方法。

Title: JADE: Joint-aware Latent Diffusion for 3D Human Generative Modeling

Authors: Haorui Ji, Rong Wang, Taojun Lin, Hongdong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20470
Pdf URL: https://arxiv.org/pdf/2412.20470
Copy Paste: [[2412.20470]] JADE: Joint-aware Latent Diffusion for 3D Human Generative Modeling(https://arxiv.org/abs/2412.20470)
Keywords: generation, generative
Abstract: Generative modeling of 3D human bodies have been studied extensively in computer vision. The core is to design a compact latent representation that is both expressive and semantically interpretable, yet existing approaches struggle to achieve both requirements. In this work, we introduce JADE, a generative framework that learns the variations of human shapes with fined-grained control. Our key insight is a joint-aware latent representation that decomposes human bodies into skeleton structures, modeled by joint positions, and local surface geometries, characterized by features attached to each joint. This disentangled latent space design enables geometric and semantic interpretation, facilitating users with flexible controllability. To generate coherent and plausible human shapes under our proposed decomposition, we also present a cascaded pipeline where two diffusions are employed to model the distribution of skeleton structures and local surface geometries respectively. Extensive experiments are conducted on public datasets, where we demonstrate the effectiveness of JADE framework in multiple tasks in terms of autoencoding reconstruction accuracy, editing controllability and generation quality compared with existing methods.
摘要：3D 人体的生成建模在计算机视觉领域得到了广泛的研究。其核心是设计一个紧凑的潜在表示，既富有表现力又具有语义可解释性，但现有的方法很难同时满足这两个要求。在这项工作中，我们介绍了 JADE，这是一个生成框架，它通过细粒度控制来学习人体形状的变化。我们的关键见解是一种关节感知的潜在表示，它将人体分解为由关节位置建模的骨架结构和局部表面几何形状，以每个关节上的特征为特征。这种解开的潜在空间设计可以进行几何和语义解释，为用户提供灵活的可控性。为了在我们提出的分解下生成连贯且合理的人体形状，我们还提出了一个级联管道，其中使用两个扩散分别对骨架结构和局部表面几何形状的分布进行建模。我们在公共数据集上进行了广泛的实验，与现有方法相比，我们在自动编码重建精度、编辑可控性和生成质量方面证明了 JADE 框架在多个任务中的有效性。

Title: Toward Scene Graph and Layout Guided Complex 3D Scene Generation

Authors: Yu-Hsiang Huang, Wei Wang, Sheng-Yu Huang, Yu-Chiang Frank Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20473
Pdf URL: https://arxiv.org/pdf/2412.20473
Copy Paste: [[2412.20473]] Toward Scene Graph and Layout Guided Complex 3D Scene Generation(https://arxiv.org/abs/2412.20473)
Keywords: generation
Abstract: Recent advancements in object-centric text-to-3D generation have shown impressive results. However, generating complex 3D scenes remains an open challenge due to the intricate relations between objects. Moreover, existing methods are largely based on score distillation sampling (SDS), which constrains the ability to manipulate multiobjects with specific interactions. Addressing these critical yet underexplored issues, we present a novel framework of Scene Graph and Layout Guided 3D Scene Generation (GraLa3D). Given a text prompt describing a complex 3D scene, GraLa3D utilizes LLM to model the scene using a scene graph representation with layout bounding box information. GraLa3D uniquely constructs the scene graph with single-object nodes and composite super-nodes. In addition to constraining 3D generation within the desirable layout, a major contribution lies in the modeling of interactions between objects in a super-node, while alleviating appearance leakage across objects within such nodes. Our experiments confirm that GraLa3D overcomes the above limitations and generates complex 3D scenes closely aligned with text prompts.
摘要：以对象为中心的文本到 3D 生成的最新进展已显示出令人印象深刻的结果。然而，由于对象之间错综复杂的关系，生成复杂的 3D 场景仍然是一个悬而未决的挑战。此外，现有方法主要基于分数蒸馏采样 (SDS)，这限制了操纵具有特定交互的多对象的能力。为了解决这些关键但尚未得到充分探索的问题，我们提出了一个新颖的场景图和布局引导 3D 场景生成 (GraLa3D) 框架。给定一个描述复杂 3D 场景的文本提示，GraLa3D 利用 LLM 使用带有布局边界框信息的场景图表示来对场景进行建模。GraLa3D 以独特的方式构建具有单对象节点和复合超节点的场景图。除了将 3D 生成限制在理想的布局内之外，一个主要贡献在于对超节点中对象之间的交互进行建模，同时减轻此类节点内对象之间的外观泄漏。我们的实验证实，GraLa3D 克服了上述限制并生成与文本提示紧密相关的复杂 3D 场景。

Title: Multimodal Variational Autoencoder: a Barycentric View

Authors: Peijie Qiu, Wenhui Zhu, Sayantan Kumar, Xiwen Chen, Xiaotong Sun, Jin Yang, Abolfazl Razi, Yalin Wang, Aristeidis Sotiras
Subjects: cs.LG, cs.CV, cs.IT
Abstract URL: https://arxiv.org/abs/2412.20487
Pdf URL: https://arxiv.org/pdf/2412.20487
Copy Paste: [[2412.20487]] Multimodal Variational Autoencoder: a Barycentric View(https://arxiv.org/abs/2412.20487)
Keywords: generative
Abstract: Multiple signal modalities, such as vision and sounds, are naturally present in real-world phenomena. Recently, there has been growing interest in learning generative models, in particular variational autoencoder (VAE), to for multimodal representation learning especially in the case of missing modalities. The primary goal of these models is to learn a modality-invariant and modality-specific representation that characterizes information across multiple modalities. Previous attempts at multimodal VAEs approach this mainly through the lens of experts, aggregating unimodal inference distributions with a product of experts (PoE), a mixture of experts (MoE), or a combination of both. In this paper, we provide an alternative generic and theoretical formulation of multimodal VAE through the lens of barycenter. We first show that PoE and MoE are specific instances of barycenters, derived by minimizing the asymmetric weighted KL divergence to unimodal inference distributions. Our novel formulation extends these two barycenters to a more flexible choice by considering different types of divergences. In particular, we explore the Wasserstein barycenter defined by the 2-Wasserstein distance, which better preserves the geometry of unimodal distributions by capturing both modality-specific and modality-invariant representations compared to KL divergence. Empirical studies on three multimodal benchmarks demonstrated the effectiveness of the proposed method.
摘要：多种信号模态，例如视觉和声音，在现实世界中自然存在。最近，人们对学习生成模型，特别是变分自动编码器 (VAE) 的兴趣日益浓厚，尤其是在缺少模态的情况下，用于多模态表示学习。这些模型的主要目标是学习一种模态不变且特定于模态的表示，以表征跨多种模态的信息。之前对多模态 VAE 的尝试主要通过专家的视角来实现这一点，将单模态推理分布与专家乘积 (PoE)、专家混合 (MoE) 或两者结合聚合在一起。在本文中，我们通过重心的视角提供了一种替代的多模态 VAE 通用理论公式。我们首先表明 PoE 和 MoE 是重心的具体实例，通过将非对称加权 KL 散度最小化为单模态推理分布而得出。我们新颖的公式通过考虑不同类型的散度将这两个重心扩展为更灵活的选择。具体来说，我们探索了由 2-Wasserstein 距离定义的 Wasserstein 重心，与 KL 散度相比，它通过捕获特定于模态和模态不变的表示来更好地保留单峰分布的几何形状。对三个多模态基准的实证研究证明了所提方法的有效性。

Title: DPBridge: Latent Diffusion Bridge for Dense Prediction

Authors: Haorui Ji, Taojun Lin, Hongdong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20506
Pdf URL: https://arxiv.org/pdf/2412.20506
Copy Paste: [[2412.20506]] DPBridge: Latent Diffusion Bridge for Dense Prediction(https://arxiv.org/abs/2412.20506)
Keywords: generation, generative
Abstract: Diffusion models have demonstrated remarkable success in dense prediction problems, which aims to model per-pixel relationship between RGB images and dense signal maps, thanks to their ability to effectively capture complex data distributions. However, initiating the reverse sampling trajectory from uninformative noise prior introduces limitations such as degraded performance and slow inference speed. In this work, we propose DPBridge, a generative framework that formulates dense prediction tasks as image-conditioned generation problems and establishes a direct mapping between input image and its corresponding dense map based on fully-tractable diffusion bridge process. This approach addresses aforementioned limitations in conventional diffusion-based solutions. In addition, we introduce finetuning strategies to adapt our model from pretrained image diffusion backbone, leveraging its rich visual prior knowledge to facilitate both efficient training and robust generalization ability. Experimental results shows that our DPBridge can achieve competitive performance compared to both feed-forward and diffusion-based approaches across various benchmarks, highlighting its effectiveness and adaptability.
摘要：扩散模型在密集预测问题中表现出显著的成功，该模型旨在对 RGB 图像和密集信号图之间的每像素关系进行建模，这要归功于它们能够有效地捕获复杂的数据分布。然而，从无信息的噪声先验中启动反向采样轨迹会带来诸如性能下降和推理速度慢等限制。在这项工作中，我们提出了 DPBridge，这是一个生成框架，它将密集预测任务公式化为图像条件生成问题，并基于完全可处理的扩散桥接过程在输入图像与其相应的密集图之间建立直接映射。这种方法解决了传统基于扩散的解决方案中上述的局限性。此外，我们引入了微调策略来调整我们的模型，使其适应预训练图像扩散主干，利用其丰富的视觉先验知识来促进高效训练和强大的泛化能力。实验结果表明，与前馈和基于扩散的方法相比，我们的 DPBridge 在各种基准上都可以实现具有竞争力的性能，凸显了其有效性和适应性。

Title: Goal-Conditioned Data Augmentation for Offline Reinforcement Learning

Authors: Xingshuai Huang, Di Wu Member, Benoit Boulet
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2412.20519
Pdf URL: https://arxiv.org/pdf/2412.20519
Copy Paste: [[2412.20519]] Goal-Conditioned Data Augmentation for Offline Reinforcement Learning(https://arxiv.org/abs/2412.20519)
Keywords: generative
Abstract: Offline reinforcement learning (RL) enables policy learning from pre-collected offline datasets, relaxing the need to interact directly with the environment. However, limited by the quality of offline datasets, it generally fails to learn well-qualified policies in suboptimal datasets. To address datasets with insufficient optimal demonstrations, we introduce Goal-cOnditioned Data Augmentation (GODA), a novel goal-conditioned diffusion-based method for augmenting samples with higher quality. Leveraging recent advancements in generative modeling, GODA incorporates a novel return-oriented goal condition with various selection mechanisms. Specifically, we introduce a controllable scaling technique to provide enhanced return-based guidance during data sampling. GODA learns a comprehensive distribution representation of the original offline datasets while generating new data with selectively higher-return goals, thereby maximizing the utility of limited optimal demonstrations. Furthermore, we propose a novel adaptive gated conditioning method for processing noised inputs and conditions, enhancing the capture of goal-oriented guidance. We conduct experiments on the D4RL benchmark and real-world challenges, specifically traffic signal control (TSC) tasks, to demonstrate GODA's effectiveness in enhancing data quality and superior performance compared to state-of-the-art data augmentation methods across various offline RL algorithms.
摘要：离线强化学习 (RL) 可以从预先收集的离线数据集中进行策略学习，从而放宽了与环境直接交互的需要。然而，受限于离线数据集的质量，它通常无法在次优数据集中学习到合格的策略。为了解决最佳演示不足的数据集，我们引入了目标条件数据增强 (GODA)，这是一种基于目标条件扩散的新型方法，用于增强更高质量的样本。利用生成建模的最新进展，GODA 将一种新的以回报为导向的目标条件与各种选择机制结合在一起。具体而言，我们引入了一种可控的缩放技术，以在数据采样期间提供增强的基于回报的指导。GODA 学习原始离线数据集的全面分布表示，同时生成具有选择性更高回报目标的新数据，从而最大限度地发挥有限最佳演示的效用。此外，我们提出了一种新颖的自适应门控条件方法来处理噪声输入和条件，增强对目标导向指导的捕获。我们对 D4RL 基准和现实世界的挑战（特别是交通信号控制 (TSC) 任务）进行了实验，以证明 GODA 在提高数据质量方面的有效性以及与各种离线 RL 算法中最先进的数据增强方法相比的卓越性能。

Title: Zero-Shot Image Restoration Using Few-Step Guidance of Consistency Models (and Beyond)

Authors: Tomer Garber, Tom Tirer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20596
Pdf URL: https://arxiv.org/pdf/2412.20596
Copy Paste: [[2412.20596]] Zero-Shot Image Restoration Using Few-Step Guidance of Consistency Models (and Beyond)(https://arxiv.org/abs/2412.20596)
Keywords: restoration, super-resolution, generation, generative
Abstract: In recent years, it has become popular to tackle image restoration tasks with a single pretrained diffusion model (DM) and data-fidelity guidance, instead of training a dedicated deep neural network per task. However, such "zero-shot" restoration schemes currently require many Neural Function Evaluations (NFEs) for performing well, which may be attributed to the many NFEs needed in the original generative functionality of the DMs. Recently, faster variants of DMs have been explored for image generation. These include Consistency Models (CMs), which can generate samples via a couple of NFEs. However, existing works that use guided CMs for restoration still require tens of NFEs or fine-tuning of the model per task that leads to performance drop if the assumptions during the fine-tuning are not accurate. In this paper, we propose a zero-shot restoration scheme that uses CMs and operates well with as little as 4 NFEs. It is based on a wise combination of several ingredients: better initialization, back-projection guidance, and above all a novel noise injection mechanism. We demonstrate the advantages of our approach for image super-resolution, deblurring and inpainting. Interestingly, we show that the usefulness of our noise injection technique goes beyond CMs: it can also mitigate the performance degradation of existing guided DM methods when reducing their NFE count.
摘要：近年来，使用单个预训练扩散模型 (DM) 和数据保真度指导来解决图像恢复任务变得很流行，而不是为每个任务训练一个专用的深度神经网络。然而，这种“零样本”恢复方案目前需要许多神经功能评估 (NFE) 才能表现良好，这可能归因于 DM 的原始生成功能中需要许多 NFE。最近，人们探索了更快的 DM 变体用于图像生成。其中包括一致性模型 (CM)，它可以通过几个 NFE 生成样本。然而，现有的使用引导 CM 进行恢复的工作仍然需要每个任务数十个 NFE 或对模型进行微调，如果微调期间的假设不准确，则会导致性能下降。在本文中，我们提出了一种使用 CM 的零样本恢复方案，并且只需 4 个 NFE 即可运行良好。它基于几个因素的明智组合：更好的初始化、反向投影指导，以及最重要的一种新颖的噪声注入机制。我们展示了该方法在图像超分辨率、去模糊和修复方面的优势。有趣的是，我们展示了噪声注入技术的实用性超越了 CM：它还可以在减少现有引导式 DM 方法的 NFE 数量时缓解其性能下降。

Title: NetFlowGen: Leveraging Generative Pre-training for Network Traffic Dynamics

Authors: Jiawei Zhou, Woojeong Kim, Zhiying Xu, Alexander M. Rush, Minlan Yu
Subjects: cs.LG, cs.AI, cs.NI
Abstract URL: https://arxiv.org/abs/2412.20635
Pdf URL: https://arxiv.org/pdf/2412.20635
Copy Paste: [[2412.20635]] NetFlowGen: Leveraging Generative Pre-training for Network Traffic Dynamics(https://arxiv.org/abs/2412.20635)
Keywords: generative
Abstract: Understanding the traffic dynamics in networks is a core capability for automated systems to monitor and analyze networking behaviors, reducing expensive human efforts and economic risks through tasks such as traffic classification, congestion prediction, and attack detection. However, it is still challenging to accurately model network traffic with machine learning approaches in an efficient and broadly applicable manner. Task-specific models trained from scratch are used for different networking applications, which limits the efficiency of model development and generalization of model deployment. Furthermore, while networking data is abundant, high-quality task-specific labels are often insufficient for training individual models. Large-scale self-supervised learning on unlabeled data provides a natural pathway for tackling these challenges. We propose to pre-train a general-purpose machine learning model to capture traffic dynamics with only traffic data from NetFlow records, with the goal of fine-tuning for different downstream tasks with small amount of labels. Our presented NetFlowGen framework goes beyond a proof-of-concept for network traffic pre-training and addresses specific challenges such as unifying network feature representations, learning from large unlabeled traffic data volume, and testing on real downstream tasks in DDoS attack detection. Experiments demonstrate promising results of our pre-training framework on capturing traffic dynamics and adapting to different networking tasks.
摘要：了解网络中的流量动态是自动化系统监控和分析网络行为的核心能力，通过流量分类、拥塞预测和攻击检测等任务减少昂贵的人力和经济风险。然而，以高效且广泛适用的方式使用机器学习方法准确地对网络流量进行建模仍然具有挑战性。从头开始训练的任务特定模型用于不同的网络应用，这限制了模型开发的效率和模型部署的泛化。此外，虽然网络数据丰富，但高质量的任务特定标签通常不足以训练单个模型。对未标记数据进行大规模自监督学习为应对这些挑战提供了一条自然途径。我们建议预先训练一个通用机器学习模型，以仅使用来自 NetFlow 记录的流量数据来捕获流量动态，目标是使用少量标签针对不同的下游任务进行微调。我们提出的 NetFlowGen 框架超越了网络流量预训练的概念验证，解决了特定挑战，例如统一网络特征表示、从大量未标记的流量数据中学习以及在 DDoS 攻击检测中对实际下游任务进行测试。实验表明，我们的预训练框架在捕获流量动态和适应不同的网络任务方面取得了良好的效果。

Title: SafeSynthDP: Leveraging Large Language Models for Privacy-Preserving Synthetic Data Generation Using Differential Privacy

Authors: Md Mahadi Hasan Nahid, Sadid Bin Hasan
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2412.20641
Pdf URL: https://arxiv.org/pdf/2412.20641
Copy Paste: [[2412.20641]] SafeSynthDP: Leveraging Large Language Models for Privacy-Preserving Synthetic Data Generation Using Differential Privacy(https://arxiv.org/abs/2412.20641)
Keywords: generation
Abstract: Machine learning (ML) models frequently rely on training data that may include sensitive or personal information, raising substantial privacy concerns. Legislative frameworks such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) have necessitated the development of strategies that preserve privacy while maintaining the utility of data. In this paper, we investigate the capability of Large Language Models (LLMs) to generate synthetic datasets integrated with Differential Privacy (DP) mechanisms, thereby enabling data-driven research and model training without direct exposure of sensitive information. Our approach incorporates DP-based noise injection methods, including Laplace and Gaussian distributions, into the data generation process. We then evaluate the utility of these DP-enhanced synthetic datasets by comparing the performance of ML models trained on them against models trained on the original data. To substantiate privacy guarantees, we assess the resilience of the generated synthetic data to membership inference attacks and related threats. The experimental results demonstrate that integrating DP within LLM-driven synthetic data generation offers a viable balance between privacy protection and data utility. This study provides a foundational methodology and insight into the privacy-preserving capabilities of LLMs, paving the way for compliant and effective ML research and applications.
摘要：机器学习 (ML) 模型经常依赖于可能包含敏感或个人信息的训练数据，这引发了严重的隐私问题。《通用数据保护条例》(GDPR) 和《加州消费者隐私法案》(CCPA) 等立法框架要求制定既能保护隐私又能保持数据效用的策略。在本文中，我们研究了大型语言模型 (LLM) 生成与差异隐私 (DP) 机制集成的合成数据集的能力，从而实现数据驱动的研究和模型训练，而无需直接暴露敏感信息。我们的方法将基于 DP 的噪声注入方法（包括拉普拉斯和高斯分布）结合到数据生成过程中。然后，我们通过比较在这些数据集上训练的 ML 模型与在原始数据上训练的模型的性能来评估这些 DP 增强的合成数据集的效用。为了证实隐私保证，我们评估了生成的合成数据对成员推理攻击和相关威胁的抵御能力。实验结果表明，将 DP 集成到 LLM 驱动的合成数据生成中可以在隐私保护和数据效用之间实现可行的平衡。这项研究提供了基础方法和对 LLM 隐私保护功能的见解，为合规有效的 ML 研究和应用铺平了道路。

Title: Latent Drifting in Diffusion Models for Counterfactual Medical Image Synthesis

Authors: Yousef Yeganeh, Ioannis Charisiadis, Marta Hasny, Martin Hartenberger, Björn Ommer, Nassir Navab, Azade Farshad, Ehsan Adeli
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20651
Pdf URL: https://arxiv.org/pdf/2412.20651
Copy Paste: [[2412.20651]] Latent Drifting in Diffusion Models for Counterfactual Medical Image Synthesis(https://arxiv.org/abs/2412.20651)
Keywords: generation
Abstract: Scaling by training on large datasets has been shown to enhance the quality and fidelity of image generation and manipulation with diffusion models; however, such large datasets are not always accessible in medical imaging due to cost and privacy issues, which contradicts one of the main applications of such models to produce synthetic samples where real data is scarce. Also, finetuning on pre-trained general models has been a challenge due to the distribution shift between the medical domain and the pre-trained models. Here, we propose Latent Drift (LD) for diffusion models that can be adopted for any fine-tuning method to mitigate the issues faced by the distribution shift or employed in inference time as a condition. Latent Drifting enables diffusion models to be conditioned for medical images fitted for the complex task of counterfactual image generation, which is crucial to investigate how parameters such as gender, age, and adding or removing diseases in a patient would alter the medical images. We evaluate our method on three public longitudinal benchmark datasets of brain MRI and chest X-rays for counterfactual image generation. Our results demonstrate significant performance gains in various scenarios when combined with different fine-tuning schemes. The source code of this work will be publicly released upon its acceptance.
摘要：通过在大型数据集上进行训练进行扩展已被证明可以提高扩散模型的图像生成和处理的质量和保真度；然而，由于成本和隐私问题，如此大的数据集在医学成像中并不总是可访问的，这与此类模型的主要应用之一相矛盾，即在真实数据稀缺的情况下生成合成样本。此外，由于医学领域和预训练模型之间的分布变化，对预训练的通用模型进行微调一直是一个挑战。在这里，我们提出了扩散模型的潜在漂移 (LD)，它可以用于任何微调方法，以减轻分布变化所面临的问题，或在推理时间中用作条件。潜在漂移使扩散模型能够针对适合反事实图像生成复杂任务的医学图像进行条件调整，这对于研究性别、年龄以及在患者身上添加或去除疾病等参数如何改变医学图像至关重要。我们在三个公共纵向基准数据集（脑部 MRI 和胸部 X 光片）上评估了我们的方法，以用于反事实图像生成。我们的结果表明，结合不同的微调方案，在各种场景下都能获得显著的性能提升。这项工作的源代码将在被接受后公开发布。

Title: Overcoming Class Imbalance: Unified GNN Learning with Structural and Semantic Connectivity Representations

Authors: Abdullah Alchihabi, Hao Yan, Yuhong Guo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20656
Pdf URL: https://arxiv.org/pdf/2412.20656
Copy Paste: [[2412.20656]] Overcoming Class Imbalance: Unified GNN Learning with Structural and Semantic Connectivity Representations(https://arxiv.org/abs/2412.20656)
Keywords: generation
Abstract: Class imbalance is pervasive in real-world graph datasets, where the majority of annotated nodes belong to a small set of classes (majority classes), leaving many other classes (minority classes) with only a handful of labeled nodes. Graph Neural Networks (GNNs) suffer from significant performance degradation in the presence of class imbalance, exhibiting bias towards majority classes and struggling to generalize effectively on minority classes. This limitation stems, in part, from the message passing process, leading GNNs to overfit to the limited neighborhood of annotated nodes from minority classes and impeding the propagation of discriminative information throughout the entire graph. In this paper, we introduce a novel Unified Graph Neural Network Learning (Uni-GNN) framework to tackle class-imbalanced node classification. The proposed framework seamlessly integrates both structural and semantic connectivity representations through semantic and structural node encoders. By combining these connectivity types, Uni-GNN extends the propagation of node embeddings beyond immediate neighbors, encompassing non-adjacent structural nodes and semantically similar nodes, enabling efficient diffusion of discriminative information throughout the graph. Moreover, to harness the potential of unlabeled nodes within the graph, we employ a balanced pseudo-label generation mechanism that augments the pool of available labeled nodes from minority classes in the training set. Experimental results underscore the superior performance of our proposed Uni-GNN framework compared to state-of-the-art class-imbalanced graph learning baselines across multiple benchmark datasets.
摘要：类别不平衡在现实世界的图数据集中普遍存在，其中大多数带注释的节点属于一小部分类别（多数类别），而许多其他类别（少数类别）只有少数标记节点。在存在类别不平衡的情况下，图神经网络 (GNN) 的性能会显著下降，表现出对多数类别的偏见，并且难以有效地概括少数类别。这种限制部分源于消息传递过程，导致 GNN 过度拟合少数类别的带注释节点的有限邻域，并阻碍了判别信息在整个图中的传播。在本文中，我们引入了一个新颖的统一图神经网络学习 (Uni-GNN) 框架来解决类别不平衡的节点分类问题。所提出的框架通过语义和结构节点编码器无缝集成了结构和语义连接表示。通过结合这些连接类型，Uni-GNN 将节点嵌入的传播范围扩展到直接邻居之外，涵盖非相邻结构节点和语义相似的节点，从而实现判别信息在整个图中的有效传播。此外，为了利用图中未标记节点的潜力，我们采用了一种平衡的伪标签生成机制，该机制增强了训练集中少数类中可用的标记节点池。实验结果强调了我们提出的 Uni-GNN 框架与多个基准数据集上最先进的类不平衡图学习基线相比的卓越性能。

Title: Enhancing Table Recognition with Vision LLMs: A Benchmark and Neighbor-Guided Toolchain Reasoner

Authors: Yitong Zhou, Mingyue Cheng, Qingyang Mao, Qi Liu, Feiyang Xu, Xin Li, Enhong Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20662
Pdf URL: https://arxiv.org/pdf/2412.20662
Copy Paste: [[2412.20662]] Enhancing Table Recognition with Vision LLMs: A Benchmark and Neighbor-Guided Toolchain Reasoner(https://arxiv.org/abs/2412.20662)
Keywords: generation
Abstract: Pre-trained foundation models have recently significantly progressed in structured table understanding and reasoning. However, despite advancements in areas such as table semantic understanding and table question answering, recognizing the structure and content of unstructured tables using Vision Large Language Models (VLLMs) remains under-explored. In this work, we address this research gap by employing VLLMs in a training-free reasoning paradigm. First, we design a benchmark with various hierarchical dimensions relevant to table recognition. Subsequently, we conduct in-depth evaluations using pre-trained VLLMs, finding that low-quality image input is a significant bottleneck in the recognition process. Drawing inspiration from these findings, we propose the Neighbor-Guided Toolchain Reasoner (NGTR) framework, which is characterized by integrating multiple lightweight models for low-level visual processing operations aimed at mitigating issues with low-quality input images. Specifically, we utilize a neighbor retrieval mechanism to guide the generation of multiple tool invocation plans, transferring tool selection experiences from similar neighbors to the given input, thereby facilitating suitable tool selection. Additionally, we introduce a reflection module to supervise the tool invocation process. Extensive experiments on public table recognition datasets demonstrate that our approach significantly enhances the recognition capabilities of the vanilla VLLMs. We believe that the designed benchmark and the proposed NGTR framework could provide an alternative solution in table recognition.
摘要：预训练的基础模型最近在结构化表格理解和推理方面取得了重大进展。然而，尽管在表格语义理解和表格问答等领域取得了进展，但使用视觉大型语言模型 (VLLM) 识别非结构化表格的结构和内容仍未得到充分探索。在这项工作中，我们通过在无需训练的推理范式中使用 VLLM 来解决这一研究空白。首先，我们设计了一个与表格识别相关的具有各种层次维度的基准。随后，我们使用预训练的 VLLM 进行深入评估，发现低质量图像输入是识别过程中的一个重要瓶颈。从这些发现中汲取灵感，我们提出了邻域引导工具链推理器 (NGTR) 框架，其特点是集成多个轻量级模型进行低级视觉处理操作，旨在缓解低质量输入图像的问题。具体来说，我们利用邻居检索机制来指导生成多个工具调用计划，将工具选择经验从相似的邻居转移到给定的输入，从而促进合适的工具选择。此外，我们引入了一个反射模块来监督工具调用过程。在公共表格识别数据集上进行的大量实验表明，我们的方法显著增强了 vanilla VLLM 的识别能力。我们相信设计的基准和提出的 NGTR 框架可以为表格识别提供替代解决方案。

Title: HFI: A unified framework for training-free detection and implicit watermarking of latent diffusion model generated images

Authors: Sungik Choi, Sungwoo Park, Jaehoon Lee, Seunghyun Kim, Stanley Jungkyu Choi, Moontae Lee
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.20704
Pdf URL: https://arxiv.org/pdf/2412.20704
Copy Paste: [[2412.20704]] HFI: A unified framework for training-free detection and implicit watermarking of latent diffusion model generated images(https://arxiv.org/abs/2412.20704)
Keywords: generative
Abstract: Dramatic advances in the quality of the latent diffusion models (LDMs) also led to the malicious use of AI-generated images. While current AI-generated image detection methods assume the availability of real/AI-generated images for training, this is practically limited given the vast expressibility of LDMs. This motivates the training-free detection setup where no related data are available in advance. The existing LDM-generated image detection method assumes that images generated by LDM are easier to reconstruct using an autoencoder than real images. However, we observe that this reconstruction distance is overfitted to background information, leading the current method to underperform in detecting images with simple backgrounds. To address this, we propose a novel method called HFI. Specifically, by viewing the autoencoder of LDM as a downsampling-upsampling kernel, HFI measures the extent of aliasing, a distortion of high-frequency information that appears in the reconstructed image. HFI is training-free, efficient, and consistently outperforms other training-free methods in detecting challenging images generated by various generative models. We also show that HFI can successfully detect the images generated from the specified LDM as a means of implicit watermarking. HFI outperforms the best baseline method while achieving magnitudes of
摘要：潜在扩散模型 (LDM) 质量的显著提高也导致了 AI 生成图像的恶意使用。虽然当前的 AI 生成图像检测方法假设可以使用真实/AI 生成的图像进行训练，但考虑到 LDM 的巨大表达能力，这在实践中是有限的。这促使人们采用无需训练的检测设置，其中事先没有相关数据。现有的 LDM 生成图像检测方法假设 LDM 生成的图像比真实图像更容易使用自动编码器重建。然而，我们观察到这种重建距离与背景信息过度拟合，导致当前方法在检测具有简单背景的图像时表现不佳。为了解决这个问题，我们提出了一种称为 HFI 的新方法。具体而言，通过将 LDM 的自动编码器视为下采样-上采样核，HFI 可以测量混叠的程度，即重建图像中出现的高频信息的失真。 HFI 无需训练，效率高，在检测各种生成模型生成的具有挑战性的图像方面始终优于其他无需训练的方法。我们还表明，HFI 可以成功检测从指定 LDM 生成的图像作为隐式水印的手段。HFI 优于最佳基线方法，同时实现了

Title: 4D Gaussian Splatting: Modeling Dynamic Scenes with Native 4D Primitives

Authors: Zeyu Yang, Zijie Pan, Xiatian Zhu, Li Zhang, Yu-Gang Jiang, Philip H.S. Torr
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20720
Pdf URL: https://arxiv.org/pdf/2412.20720
Copy Paste: [[2412.20720]] 4D Gaussian Splatting: Modeling Dynamic Scenes with Native 4D Primitives(https://arxiv.org/abs/2412.20720)
Keywords: generation
Abstract: Dynamic 3D scene representation and novel view synthesis from captured videos are crucial for enabling immersive experiences required by AR/VR and metaverse applications. However, this task is challenging due to the complexity of unconstrained real-world scenes and their temporal dynamics. In this paper, we frame dynamic scenes as a spatio-temporal 4D volume learning problem, offering a native explicit reformulation with minimal assumptions about motion, which serves as a versatile dynamic scene learning framework. Specifically, we represent a target dynamic scene using a collection of 4D Gaussian primitives with explicit geometry and appearance features, dubbed as 4D Gaussian splatting (4DGS). This approach can capture relevant information in space and time by fitting the underlying spatio-temporal volume. Modeling the spacetime as a whole with 4D Gaussians parameterized by anisotropic ellipses that can rotate arbitrarily in space and time, our model can naturally learn view-dependent and time-evolved appearance with 4D spherindrical harmonics. Notably, our 4DGS model is the first solution that supports real-time rendering of high-resolution, photorealistic novel views for complex dynamic scenes. To enhance efficiency, we derive several compact variants that effectively reduce memory footprint and mitigate the risk of overfitting. Extensive experiments validate the superiority of 4DGS in terms of visual quality and efficiency across a range of dynamic scene-related tasks (e.g., novel view synthesis, 4D generation, scene understanding) and scenarios (e.g., single object, indoor scenes, driving environments, synthetic and real data).
摘要：动态 3D 场景表示和从捕获的视频中进行新颖的视图合成对于实现 AR/VR 和元宇宙应用所需的沉浸式体验至关重要。然而，由于不受约束的现实世界场景及其时间动态的复杂性，这项任务具有挑战性。在本文中，我们将动态场景构建为时空 4D 体积学习问题，提供一种对运动具有最少假设的本机显式重构，可作为多功能动态场景学习框架。具体而言，我们使用具有明确几何和外观特征的 4D 高斯基元集合来表示目标动态场景，称为 4D 高斯溅射 (4DGS)。这种方法可以通过拟合底层时空体积来捕获空间和时间中的相关信息。使用由可以在空间和时间中任意旋转的各向异性椭圆参数化的 4D 高斯函数对整个时空进行建模，我们的模型可以自然地学习具有 4D 球面谐波的视图相关和时间演化的外观。值得注意的是，我们的 4DGS 模型是第一个支持实时渲染复杂动态场景的高分辨率、逼真的新颖视图的解决方案。为了提高效率，我们推导出几个紧凑的变体，这些变体可以有效减少内存占用并降低过度拟合的风险。大量实验验证了 4DGS 在一系列动态场景相关任务（例如新颖视图合成、4D 生成、场景理解）和场景（例如单个物体、室内场景、驾驶环境、合成和真实数据）中的视觉质量和效率方面的优势。

Title: Dialogue Director: Bridging the Gap in Dialogue Visualization for Multimodal Storytelling

Authors: Min Zhang, Zilin Wang, Liyan Chen, Kunhong Liu, Juncong Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20725
Pdf URL: https://arxiv.org/pdf/2412.20725
Copy Paste: [[2412.20725]] Dialogue Director: Bridging the Gap in Dialogue Visualization for Multimodal Storytelling(https://arxiv.org/abs/2412.20725)
Keywords: generation
Abstract: Recent advances in AI-driven storytelling have enhanced video generation and story visualization. However, translating dialogue-centric scripts into coherent storyboards remains a significant challenge due to limited script detail, inadequate physical context understanding, and the complexity of integrating cinematic principles. To address these challenges, we propose Dialogue Visualization, a novel task that transforms dialogue scripts into dynamic, multi-view storyboards. We introduce Dialogue Director, a training-free multimodal framework comprising a Script Director, Cinematographer, and Storyboard Maker. This framework leverages large multimodal models and diffusion-based architectures, employing techniques such as Chain-of-Thought reasoning, Retrieval-Augmented Generation, and multi-view synthesis to improve script understanding, physical context comprehension, and cinematic knowledge integration. Experimental results demonstrate that Dialogue Director outperforms state-of-the-art methods in script interpretation, physical world understanding, and cinematic principle application, significantly advancing the quality and controllability of dialogue-based story visualization.
摘要：人工智能驱动的故事讲述技术的最新进展增强了视频生成和故事可视化。然而，由于剧本细节有限、物理环境理解不足以及整合电影原理的复杂性，将以对话为中心的剧本转换成连贯的故事板仍然是一项重大挑战。为了应对这些挑战，我们提出了对话可视化，这是一项将对话剧本转换成动态、多视图故事板的新任务。我们引入了对话导演，这是一个无需训练的多模态框架，由剧本导演、电影摄影师和故事板制作者组成。该框架利用大型多模态模型和基于扩散的架构，采用思路链推理、检索增强生成和多视图合成等技术来提高剧本理解、物理环境理解和电影知识整合。实验结果表明，对话导演在剧本解释、物理世界理解和电影原理应用方面优于最先进的方法，显著提高了基于对话的故事可视化的质量和可控性。

Title: Advancing Parkinson's Disease Progression Prediction: Comparing Long Short-Term Memory Networks and Kolmogorov-Arnold Networks

Authors: Abhinav Roy, Bhavesh Gyanchandani, Aditya Oza, Abhishek Sharma
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20744
Pdf URL: https://arxiv.org/pdf/2412.20744
Copy Paste: [[2412.20744]] Advancing Parkinson's Disease Progression Prediction: Comparing Long Short-Term Memory Networks and Kolmogorov-Arnold Networks(https://arxiv.org/abs/2412.20744)
Keywords: generative
Abstract: Parkinson's Disease (PD) is a degenerative neurological disorder that impairs motor and non-motor functions, significantly reducing quality of life and increasing mortality risk. Early and accurate detection of PD progression is vital for effective management and improved patient outcomes. Current diagnostic methods, however, are often costly, time-consuming, and require specialized equipment and expertise. This work proposes an innovative approach to predicting PD progression using regression methods, Long Short-Term Memory (LSTM) networks, and Kolmogorov Arnold Networks (KAN). KAN, utilizing spline-parametrized univariate functions, allows for dynamic learning of activation patterns, unlike traditional linear models. The Movement Disorder Society-Sponsored Revision of the Unified Parkinson's Disease Rating Scale (MDS-UPDRS) is a comprehensive tool for evaluating PD symptoms and is commonly used to measure disease progression. Additionally, protein or peptide abnormalities are linked to PD onset and progression. Identifying these associations can aid in predicting disease progression and understanding molecular changes. Comparing multiple models, including LSTM and KAN, this study aims to identify the method that delivers the highest metrics. The analysis reveals that KAN, with its dynamic learning capabilities, outperforms other approaches in predicting PD progression. This research highlights the potential of AI and machine learning in healthcare, paving the way for advanced computational models to enhance clinical predictions and improve patient care and treatment strategies in PD management.
摘要：帕金森病 (PD) 是一种退行性神经系统疾病，会损害运动和非运动功能，显著降低生活质量并增加死亡风险。早期准确检测 PD 进展对于有效管理和改善患者预后至关重要。然而，目前的诊断方法通常成本高昂、耗时长，并且需要专门的设备和专业知识。这项研究提出了一种创新方法，使用回归方法、长短期记忆 (LSTM) 网络和 Kolmogorov Arnold 网络 (KAN) 来预测 PD 进展。与传统线性模型不同，KAN 利用样条参数化的单变量函数，可以动态学习激活模式。运动障碍协会赞助的统一帕金森病评分量表修订版 (MDS-UPDRS) 是一种评估 PD 症状的综合工具，通常用于测量疾病进展。此外，蛋白质或肽异常与 PD 的发病和进展有关。确定这些关联有助于预测疾病进展和了解分子变化。本研究比较了多种模型，包括 LSTM 和 KAN，旨在找出提供最高指标的方法。分析表明，凭借其动态学习能力，KAN 在预测 PD 进展方面优于其他方法。这项研究凸显了人工智能和机器学习在医疗保健领域的潜力，为先进的计算模型铺平了道路，以增强临床预测并改善 PD 管理中的患者护理和治疗策略。

Title: VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control

Authors: Shaojin Wu, Fei Ding, Mengqi Huang, Wei Liu, Qian He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20800
Pdf URL: https://arxiv.org/pdf/2412.20800
Copy Paste: [[2412.20800]] VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control(https://arxiv.org/abs/2412.20800)
Keywords: generation
Abstract: While diffusion models show extraordinary talents in text-to-image generation, they may still fail to generate highly aesthetic images. More specifically, there is still a gap between the generated images and the real-world aesthetic images in finer-grained dimensions including color, lighting, composition, etc. In this paper, we propose Cross-Attention Value Mixing Control (VMix) Adapter, a plug-and-play aesthetics adapter, to upgrade the quality of generated images while maintaining generality across visual concepts by (1) disentangling the input text prompt into the content description and aesthetic description by the initialization of aesthetic embedding, and (2) integrating aesthetic conditions into the denoising process through value-mixed cross-attention, with the network connected by zero-initialized linear layers. Our key insight is to enhance the aesthetic presentation of existing diffusion models by designing a superior condition control method, all while preserving the image-text alignment. Through our meticulous design, VMix is flexible enough to be applied to community models for better visual performance without retraining. To validate the effectiveness of our method, we conducted extensive experiments, showing that VMix outperforms other state-of-the-art methods and is compatible with other community modules (e.g., LoRA, ControlNet, and IPAdapter) for image generation. The project page is this https URL.
摘要：虽然扩散模型在文本到图像生成方面表现出非凡的才能，但它们仍然无法生成高度美观的图像。更具体地说，在颜色、光照、构图等更细粒度的维度上，生成的图像与现实世界的美学图像之间仍然存在差距。在本文中，我们提出了一种即插即用的美学适配器——交叉注意值混合控制 (VMix) 适配器，它通过 (1) 通过美学嵌入的初始化将输入文本提示解开为内容描述和美学描述，以及 (2) 通过值混合交叉注意将美学条件整合到去噪过程中，网络通过零初始化的线性层连接，从而提升生成图像的质量，同时保持视觉概念的通用性。我们的关键见解是通过设计一种优越的条件控制方法来增强现有扩散模型的美学表现，同时保持图像与文本的对齐。通过我们精心的设计，VMix 足够灵活，可以应用于社区模型以获得更好的视觉性能而无需重新训练。为了验证我们方法的有效性，我们进行了大量实验，结果表明 VMix 优于其他最先进的方法，并且与其他社区模块（例如 LoRA、ControlNet 和 IPAdapter）兼容，可用于图像生成。项目页面是此 https URL。

Title: TimeRAF: Retrieval-Augmented Foundation model for Zero-shot Time Series Forecasting

Authors: Huanyu Zhang, Chang Xu, Yi-Fan Zhang, Zhang Zhang, Liang Wang, Jiang Bian, Tieniu Tan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.20810
Pdf URL: https://arxiv.org/pdf/2412.20810
Copy Paste: [[2412.20810]] TimeRAF: Retrieval-Augmented Foundation model for Zero-shot Time Series Forecasting(https://arxiv.org/abs/2412.20810)
Keywords: generation
Abstract: Time series forecasting plays a crucial role in data mining, driving rapid advancements across numerous industries. With the emergence of large models, time series foundation models (TSFMs) have exhibited remarkable generalization capabilities, such as zero-shot learning, through large-scale pre-training. Meanwhile, Retrieval-Augmented Generation (RAG) methods have been widely employed to enhance the performance of foundation models on unseen data, allowing models to access to external knowledge. In this paper, we introduce TimeRAF, a Retrieval-Augmented Forecasting model that enhance zero-shot time series forecasting through retrieval-augmented techniques. We develop customized time series knowledge bases that are tailored to the specific forecasting tasks. TimeRAF employs an end-to-end learnable retriever to extract valuable information from the knowledge base. Additionally, we propose Channel Prompting for knowledge integration, which effectively extracts relevant information from the retrieved knowledge along the channel dimension. Extensive experiments demonstrate the effectiveness of our model, showing significant improvement across various domains and datasets.
摘要：时间序列预测在数据挖掘中起着至关重要的作用，推动了众多行业的快速发展。随着大型模型的出现，时间序列基础模型 (TSFM) 通过大规模预训练表现出了卓越的泛化能力，例如零样本学习。同时，检索增强生成 (RAG) 方法已被广泛用于增强基础模型在未见数据上的性能，使模型能够访问外部知识。在本文中，我们介绍了 TimeRAF，这是一种检索增强预测模型，它通过检索增强技术增强零样本时间序列预测。我们开发了针对特定预测任务量身定制的时间序列知识库。TimeRAF 采用端到端可学习检索器从知识库中提取有价值的信息。此外，我们提出了用于知识集成的通道提示，它可以有效地从沿通道维度检索到的知识中提取相关信息。大量实验证明了我们模型的有效性，在各个领域和数据集上都显示出显着的改进。

Title: ReFlow6D: Refraction-Guided Transparent Object 6D Pose Estimation via Intermediate Representation Learning

Authors: Hrishikesh Gupta, Stefan Thalhammer, Jean-Baptiste Weibel, Alexander Haberl, Markus Vincze
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.20830
Pdf URL: https://arxiv.org/pdf/2412.20830
Copy Paste: [[2412.20830]] ReFlow6D: Refraction-Guided Transparent Object 6D Pose Estimation via Intermediate Representation Learning(https://arxiv.org/abs/2412.20830)
Keywords: generation
Abstract: Transparent objects are ubiquitous in daily life, making their perception and robotics manipulation important. However, they present a major challenge due to their distinct refractive and reflective properties when it comes to accurately estimating the 6D pose. To solve this, we present ReFlow6D, a novel method for transparent object 6D pose estimation that harnesses the refractive-intermediate representation. Unlike conventional approaches, our method leverages a feature space impervious to changes in RGB image space and independent of depth information. Drawing inspiration from image matting, we model the deformation of the light path through transparent objects, yielding a unique object-specific intermediate representation guided by light refraction that is independent of the environment in which objects are observed. By integrating these intermediate features into the pose estimation network, we show that ReFlow6D achieves precise 6D pose estimation of transparent objects, using only RGB images as input. Our method further introduces a novel transparent object compositing loss, fostering the generation of superior refractive-intermediate features. Empirical evaluations show that our approach significantly outperforms state-of-the-art methods on TOD and Trans32K-6D datasets. Robot grasping experiments further demonstrate that ReFlow6D's pose estimation accuracy effectively translates to real-world robotics task. The source code is available at: this https URL and this https URL.
摘要：透明物体在日常生活中无处不在，因此对它们的感知和机器人操控非常重要。然而，由于它们具有独特的折射和反射特性，在准确估计 6D 姿势时，它们带来了巨大的挑战。为了解决这个问题，我们提出了 ReFlow6D，这是一种利用折射中间表示的透明物体 6D 姿势估计新方法。与传统方法不同，我们的方法利用不受 RGB 图像空间变化影响且独立于深度信息的特征空间。从图像抠图中获得灵感，我们对穿过透明物体的光路变形进行建模，产生由光折射引导的独特物体特定中间表示，该表示与观察物体的环境无关。通过将这些中间特征集成到姿势估计网络中，我们表明 ReFlow6D 仅使用 RGB 图像作为输入即可实现透明物体的精确 6D 姿势估计。我们的方法进一步引入了一种新颖的透明物体合成损失，促进了卓越折射中间特征的生成。实证评估表明，我们的方法在 TOD 和 Trans32K-6D 数据集上的表现明显优于最先进的方法。机器人抓取实验进一步证明，ReFlow6D 的姿势估计精度可以有效地转化为现实世界的机器人任务。源代码可从以下网址获取：此 https URL 和此 https URL。

Title: Inclusion 2024 Global Multimedia Deepfake Detection: Towards Multi-dimensional Facial Forgery Detection

Authors: Yi Zhang, Weize Gao, Changtao Miao, Man Luo, Jianshu Li, Wenzhong Deng, Zhe Li, Bingyu Hu, Weibin Yao, Wenbo Zhou, Tao Gong, Qi Chu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20833
Pdf URL: https://arxiv.org/pdf/2412.20833
Copy Paste: [[2412.20833]] Inclusion 2024 Global Multimedia Deepfake Detection: Towards Multi-dimensional Facial Forgery Detection(https://arxiv.org/abs/2412.20833)
Keywords: generation
Abstract: In this paper, we present the Global Multimedia Deepfake Detection held concurrently with the Inclusion 2024. Our Multimedia Deepfake Detection aims to detect automatic image and audio-video manipulations including but not limited to editing, synthesis, generation, Photoshop,etc. Our challenge has attracted 1500 teams from all over the world, with about 5000 valid result submission counts. We invite the top 20 teams to present their solutions to the challenge, from which the top 3 teams are awarded prizes in the grand finale. In this paper, we present the solutions from the top 3 teams of the two tracks, to boost the research work in the field of image and audio-video forgery detection. The methodologies developed through the challenge will contribute to the development of next-generation deepfake detection systems and we encourage participants to open source their methods.
摘要：在本文中，我们介绍了与 2024 年包容性大会同时举行的全球多媒体深度伪造检测。我们的多媒体深度伪造检测旨在检测自动图像和音频视频处理，包括但不限于编辑、合成、生成、Photoshop 等。我们的挑战赛吸引了来自世界各地的 1500 个团队，提交了约 5000 个有效结果。我们邀请排名前 20 的团队展示他们的挑战赛解决方案，其中排名前 3 的团队将在总决赛中获奖。在本文中，我们展示了两个赛道排名前 3 的团队的解决方案，以促进图像和音频视频伪造检测领域的研究工作。通过挑战赛开发的方法将有助于下一代深度伪造检测系统的开发，我们鼓励参与者开源他们的方法。

Title: DDIM sampling for Generative AIBIM, a faster intelligent structural design framework

Authors: Zhili He, Yu-Hsing Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.20899
Pdf URL: https://arxiv.org/pdf/2412.20899
Copy Paste: [[2412.20899]] DDIM sampling for Generative AIBIM, a faster intelligent structural design framework(https://arxiv.org/abs/2412.20899)
Keywords: generation, generative
Abstract: Generative AIBIM, a successful structural design pipeline, has proven its ability to intelligently generate high-quality, diverse, and creative shear wall designs that are tailored to specific physical conditions. However, the current module of Generative AIBIM that generates designs, known as the physics-based conditional diffusion model (PCDM), necessitates 1000 iterations for each generation due to its reliance on the denoising diffusion probabilistic model (DDPM) sampling process. This leads to a time-consuming and computationally demanding generation process. To address this issue, this study introduces the denoising diffusion implicit model (DDIM), an accelerated generation method that replaces the DDPM sampling process in PCDM. While the original DDIM was designed for DDPM and the optimization process of PCDM differs from that of DDPM, this paper designs "DDIM sampling for PCDM," which modifies the original DDIM formulations to adapt to the optimization process of PCDM. Experimental results demonstrate that DDIM sampling for PCDM can accelerate the generation process of the original PCDM by a factor of 100 while maintaining the same visual quality in the generated results. This study effectively showcases the effectiveness of DDIM sampling for PCDM in expediting intelligent structural design. Furthermore, this paper reorganizes the contents of DDIM, focusing on the practical usage of DDIM. This change is particularly meaningful for researchers who may not possess a strong background in machine learning theory but are interested in utilizing the tool effectively.
摘要：生成式 AIBIM 是一种成功的结构设计流程，已证明其能够智能生成高质量、多样化、富有创意的剪力墙设计，并根据特定物理条件进行量身定制。然而，生成式 AIBIM 当前用于生成设计的模块，即基于物理的条件扩散模型 (PCDM)，由于依赖于去噪扩散概率模型 (DDPM) 采样过程，因此每次生成都需要 1000 次迭代。这导致生成过程耗时且计算量大。为了解决这个问题，本研究引入了去噪扩散隐式模型 (DDIM)，这是一种加速生成方法，可取代 PCDM 中的 DDPM 采样过程。虽然原始 DDIM 是为 DDPM 设计的，并且 PCDM 的优化过程与 DDPM 不同，但本文设计了“用于 PCDM 的 DDIM 采样”，它修改了原始 DDIM 公式以适应 PCDM 的优化过程。实验结果表明，DDIM 采样 PCDM 可以将原始 PCDM 的生成过程加速 100 倍，同时保持生成结果的视觉质量不变。这项研究有效地展示了 DDIM 采样 PCDM 在加速智能结构设计方面的有效性。此外，本文重新组织了 DDIM 的内容，重点关注 DDIM 的实际用途。这一变化对于那些可能不具备强大的机器学习理论背景但有兴趣有效利用该工具的研究人员来说尤其有意义。

Title: ILDiff: Generate Transparent Animated Stickers by Implicit Layout Distillation

Authors: Ting Zhang, Zhiqiang Yuan, Yeshuang Zhu, Jinchao Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20901
Pdf URL: https://arxiv.org/pdf/2412.20901
Copy Paste: [[2412.20901]] ILDiff: Generate Transparent Animated Stickers by Implicit Layout Distillation(https://arxiv.org/abs/2412.20901)
Keywords: generation
Abstract: High-quality animated stickers usually contain transparent channels, which are often ignored by current video generation models. To generate fine-grained animated transparency channels, existing methods can be roughly divided into video matting algorithms and diffusion-based algorithms. The methods based on video matting have poor performance in dealing with semi-open areas in stickers, while diffusion-based methods are often used to model a single image, which will lead to local flicker when modeling animated stickers. In this paper, we firstly propose an ILDiff method to generate animated transparent channels through implicit layout distillation, which solves the problems of semi-open area collapse and no consideration of temporal information in existing methods. Secondly, we create the Transparent Animated Sticker Dataset (TASD), which contains 0.32M high-quality samples with transparent channel, to provide data support for related fields. Extensive experiments demonstrate that ILDiff can produce finer and smoother transparent channels compared to other methods such as Matting Anything and Layer Diffusion. Our code and dataset will be released at link this https URL.
摘要：高质量的动画贴纸通常包含透明通道，而透明通道往往被当前的视频生成模型所忽略。为了生成细粒度的动画透明通道，现有的方法大致可以分为视频抠图算法和基于扩散的算法。基于视频抠图的方法在处理贴纸中的半开放区域时效果不佳，而基于扩散的方法通常用于对单幅图像进行建模，在对动画贴纸进行建模时会导致局部闪烁。本文首先提出了一种通过隐式布局蒸馏生成动画透明通道的ILDiff方法，解决了现有方法中半开放区域坍塌和不考虑时间信息的问题。其次，我们创建了透明动画贴纸数据集（TASD），其中包含0.32M个带有透明通道的高质量样本，为相关领域提供数据支持。大量实验表明，与Matting Anything和Layer Diffusion等其他方法相比，ILDiff可以生成更精细、更平滑的透明通道。我们的代码和数据集将在链接https URL上发布。

Title: Low-Light Image Enhancement via Generative Perceptual Priors

Authors: Han Zhou, Wei Dong, Xiaohong Liu, Yulun Zhang, Guangtao Zhai, Jun Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20916
Pdf URL: https://arxiv.org/pdf/2412.20916
Copy Paste: [[2412.20916]] Low-Light Image Enhancement via Generative Perceptual Priors(https://arxiv.org/abs/2412.20916)
Keywords: generative
Abstract: Although significant progress has been made in enhancing visibility, retrieving texture details, and mitigating noise in Low-Light (LL) images, the challenge persists in applying current Low-Light Image Enhancement (LLIE) methods to real-world scenarios, primarily due to the diverse illumination conditions encountered. Furthermore, the quest for generating enhancements that are visually realistic and attractive remains an underexplored realm. In response to these challenges, we introduce a novel \textbf{LLIE} framework with the guidance of \textbf{G}enerative \textbf{P}erceptual \textbf{P}riors (\textbf{GPP-LLIE}) derived from vision-language models (VLMs). Specifically, we first propose a pipeline that guides VLMs to assess multiple visual attributes of the LL image and quantify the assessment to output the global and local perceptual priors. Subsequently, to incorporate these generative perceptual priors to benefit LLIE, we introduce a transformer-based backbone in the diffusion process, and develop a new layer normalization (\textit{\textbf{GPP-LN}}) and an attention mechanism (\textit{\textbf{LPP-Attn}}) guided by global and local perceptual priors. Extensive experiments demonstrate that our model outperforms current SOTA methods on paired LL datasets and exhibits superior generalization on real-world data. The code is released at \url{this https URL}.
摘要：尽管在增强可见性、检索纹理细节和减轻低光 (LL) 图像中的噪声方面取得了重大进展，但将当前的低光图像增强 (LLIE) 方法应用于现实场景仍然存在挑战，这主要是由于遇到的照明条件多种多样。此外，生成视觉上逼真且有吸引力的增强效果的追求仍然是一个尚未充分探索的领域。为了应对这些挑战，我们在从视觉语言模型 (VLM) 派生的 \textbf{G} 生成 \textbf{P} 感知 \textbf{P} riors (\textbf{GPP-LLIE}) 的指导下引入了一个新颖的 \textbf{LLIE} 框架。具体来说，我们首先提出了一个流程，指导 VLM 评估 LL 图像的多个视觉属性并量化评估以输出全局和局部感知先验。随后，为了将这些生成感知先验结合起来，使 LLIE 受益，我们在扩散过程中引入了基于变换器的骨干，并开发了一种新的层规范化（\textit{\textbf{GPP-LN}}）和一种由全局和局部感知先验引导的注意机制（\textit{\textbf{LPP-Attn}}）。大量实验表明，我们的模型在成对的 LL 数据集上的表现优于当前的 SOTA 方法，并在真实世界数据上表现出卓越的泛化能力。代码发布在 \url{此 https URL}。

Title: HisynSeg: Weakly-Supervised Histopathological Image Segmentation via Image-Mixing Synthesis and Consistency Regularization

Authors: Zijie Fang, Yifeng Wang, Peizhang Xie, Zhi Wang, Yongbing Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.20924
Pdf URL: https://arxiv.org/pdf/2412.20924
Copy Paste: [[2412.20924]] HisynSeg: Weakly-Supervised Histopathological Image Segmentation via Image-Mixing Synthesis and Consistency Regularization(https://arxiv.org/abs/2412.20924)
Keywords: generation
Abstract: Tissue semantic segmentation is one of the key tasks in computational pathology. To avoid the expensive and laborious acquisition of pixel-level annotations, a wide range of studies attempt to adopt the class activation map (CAM), a weakly-supervised learning scheme, to achieve pixel-level tissue segmentation. However, CAM-based methods are prone to suffer from under-activation and over-activation issues, leading to poor segmentation performance. To address this problem, we propose a novel weakly-supervised semantic segmentation framework for histopathological images based on image-mixing synthesis and consistency regularization, dubbed HisynSeg. Specifically, synthesized histopathological images with pixel-level masks are generated for fully-supervised model training, where two synthesis strategies are proposed based on Mosaic transformation and Bézier mask generation. Besides, an image filtering module is developed to guarantee the authenticity of the synthesized images. In order to further avoid the model overfitting to the occasional synthesis artifacts, we additionally propose a novel self-supervised consistency regularization, which enables the real images without segmentation masks to supervise the training of the segmentation model. By integrating the proposed techniques, the HisynSeg framework successfully transforms the weakly-supervised semantic segmentation problem into a fully-supervised one, greatly improving the segmentation accuracy. Experimental results on three datasets prove that the proposed method achieves a state-of-the-art performance. Code is available at this https URL.
摘要：组织语义分割是计算病理学中的关键任务之一。为了避免昂贵而费力的像素级注释获取，大量研究尝试采用类激活图 (CAM) 这种弱监督学习方案来实现像素级组织分割。然而，基于 CAM 的方法容易出现激活不足和过度激活的问题，导致分割性能不佳。为了解决这个问题，我们提出了一种基于图像混合合成和一致性正则化的新型组织病理学图像弱监督语义分割框架，称为 HisynSeg。具体而言，生成带有像素级蒙版的合成组织病理学图像用于全监督模型训练，其中提出了基于马赛克变换和贝塞尔蒙版生成的两种合成策略。此外，还开发了一个图像过滤模块来保证合成图像的真实性。为了进一步避免模型过度拟合偶尔出现的合成伪影，我们还提出了一种新颖的自监督一致性正则化，使没有分割掩码的真实图像能够监督分割模型的训练。通过整合所提出的技术，HisynSeg 框架成功地将弱监督语义分割问题转变为全监督问题，大大提高了分割精度。在三个数据集上的实验结果证明，所提出的方法达到了最先进的性能。代码可在此 https URL 上找到。

Title: Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering

Authors: Junxiao Xue, Quan Deng, Fei Yu, Yanhao Wang, Jun Wang, Yuehua Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.20927
Pdf URL: https://arxiv.org/pdf/2412.20927
Copy Paste: [[2412.20927]] Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering(https://arxiv.org/abs/2412.20927)
Keywords: generation
Abstract: Multimodal large language models (MLLMs), such as GPT-4o, Gemini, LLaVA, and Flamingo, have made significant progress in integrating visual and textual modalities, excelling in tasks like visual question answering (VQA), image captioning, and content retrieval. They can generate coherent and contextually relevant descriptions of images. However, they still face challenges in accurately identifying and counting objects and determining their spatial locations, particularly in complex scenes with overlapping or small objects. To address these limitations, we propose a novel framework based on multimodal retrieval-augmented generation (RAG), which introduces structured scene graphs to enhance object recognition, relationship identification, and spatial understanding within images. Our framework improves the MLLM's capacity to handle tasks requiring precise visual descriptions, especially in scenarios with challenging perspectives, such as aerial views or scenes with dense object arrangements. Finally, we conduct extensive experiments on the VG-150 dataset that focuses on first-person visual understanding and the AUG dataset that involves aerial imagery. The results show that our approach consistently outperforms existing MLLMs in VQA tasks, which stands out in recognizing, localizing, and quantifying objects in different spatial contexts and provides more accurate visual descriptions.
摘要：多模态大型语言模型 (MLLM)，例如 GPT-4o、Gemini、LLaVA 和 Flamingo，在整合视觉和文本模态方面取得了重大进展，在视觉问答 (VQA)、图像字幕和内容检索等任务中表现出色。它们可以生成连贯且与上下文相关的图像描述。然而，它们在准确识别和计数物体以及确定其空间位置方面仍然面临挑战，特别是在具有重叠或小物体的复杂场景中。为了解决这些限制，我们提出了一个基于多模态检索增强生成 (RAG) 的新框架，该框架引入了结构化场景图来增强图像中的物体识别、关系识别和空间理解。我们的框架提高了 MLLM 处理需要精确视觉描述的任务的能力，特别是在具有挑战性的视角场景中，例如鸟瞰图或物体密集排列的场景。最后，我们对专注于第一人称视觉理解的 VG-150 数据集和涉及航拍图像的 AUG 数据集进行了广泛的实验。结果表明，我们的方法在 VQA 任务中始终优于现有的 MLLM，在识别、定位和量化不同空间环境中的对象方面脱颖而出，并提供更准确的视觉描述。

Title: Efficiently Serving LLM Reasoning Programs with Certaindex

Authors: Yichao Fu, Junda Chen, Siqi Zhu, Zheyu Fu, Zhongdongming Dai, Aurick Qiao, Hao Zhang
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2412.20993
Pdf URL: https://arxiv.org/pdf/2412.20993
Copy Paste: [[2412.20993]] Efficiently Serving LLM Reasoning Programs with Certaindex(https://arxiv.org/abs/2412.20993)
Keywords: generation
Abstract: The rapid evolution of large language models (LLMs) has unlocked their capabilities in advanced reasoning tasks like mathematical problem-solving, code generation, and legal analysis. Central to this progress are inference-time reasoning algorithms, which refine outputs by exploring multiple solution paths, at the cost of increasing compute demands and response latencies. Existing serving systems fail to adapt to the scaling behaviors of these algorithms or the varying difficulty of queries, leading to inefficient resource use and unmet latency targets. We present Dynasor, a system that optimizes inference-time compute for LLM reasoning queries. Unlike traditional engines, Dynasor tracks and schedules requests within reasoning queries and uses Certaindex, a proxy that measures statistical reasoning progress based on model certainty, to guide compute allocation dynamically. Dynasor co-adapts scheduling with reasoning progress: it allocates more compute to hard queries, reduces compute for simpler ones, and terminates unpromising queries early, balancing accuracy, latency, and cost. On diverse datasets and algorithms, Dynasor reduces compute by up to 50% in batch processing and sustaining 3.3x higher query rates or 4.7x tighter latency SLOs in online serving.
摘要：大型语言模型 (LLM) 的快速发展释放了它们在数学问题解决、代码生成和法律分析等高级推理任务中的能力。这一进步的核心是推理时间推理算法，它通过探索多种解决方案路径来改进输出，但代价是增加计算需求和响应延迟。现有的服务系统无法适应这些算法的扩展行为或查询的不同难度，导致资源使用效率低下和延迟目标未达到。我们提出了 Dynasor，这是一个优化 LLM 推理查询推理时间计算的系统。与传统引擎不同，Dynasor 跟踪和调度推理查询中的请求，并使用 Certaindex（一种基于模型确定性测量统计推理进度的代理）来动态指导计算分配。Dynasor 共同调整调度和推理进度：它为困难查询分配更多计算，减少简单查询的计算，并提前终止没有希望的查询，平衡准确性、延迟和成本。在不同的数据集和算法上，Dynasor 在批处理中将计算量减少了高达 50%，并在在线服务中维持了 3.3 倍的更高查询率或 4.7 倍更严格的延迟 SLO。

Title: EdgeRAG: Online-Indexed RAG for Edge Devices

Authors: Korakit Seemakhupt, Sihang Liu, Samira Khan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.21023
Pdf URL: https://arxiv.org/pdf/2412.21023
Copy Paste: [[2412.21023]] EdgeRAG: Online-Indexed RAG for Edge Devices(https://arxiv.org/abs/2412.21023)
Keywords: generation
Abstract: Deploying Retrieval Augmented Generation (RAG) on resource-constrained edge devices is challenging due to limited memory and processing power. In this work, we propose EdgeRAG which addresses the memory constraint by pruning embeddings within clusters and generating embeddings on-demand during retrieval. To avoid the latency of generating embeddings for large tail clusters, EdgeRAG pre-computes and stores embeddings for these clusters, while adaptively caching remaining embeddings to minimize redundant computations and further optimize latency. The result from BEIR suite shows that EdgeRAG offers significant latency reduction over the baseline IVF index, but with similar generation quality while allowing all of our evaluated datasets to fit into the memory.
摘要：由于内存和处理能力有限，在资源受限的边缘设备上部署检索增强生成 (RAG) 具有挑战性。在这项工作中，我们提出了 EdgeRAG，它通过修剪集群内的嵌入并在检索期间按需生成嵌入来解决内存限制问题。为了避免为大型尾部集群生成嵌入的延迟，EdgeRAG 会预先计算并存储这些集群的嵌入，同时自适应地缓存剩余的嵌入以最大限度地减少冗余计算并进一步优化延迟。BEIR 套件的结果表明，EdgeRAG 比基线 IVF 指数显著降低了延迟，但生成质量相似，同时允许我们评估的所有数据集都适合内存。

Title: Visual Style Prompt Learning Using Diffusion Models for Blind Face Restoration

Authors: Wanglong Lu, Jikai Wang, Tao Wang, Kaihao Zhang, Xianta Jiang, Hanli Zhao
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2412.21042
Pdf URL: https://arxiv.org/pdf/2412.21042
Copy Paste: [[2412.21042]] Visual Style Prompt Learning Using Diffusion Models for Blind Face Restoration(https://arxiv.org/abs/2412.21042)
Keywords: restoration, generative
Abstract: Blind face restoration aims to recover high-quality facial images from various unidentified sources of degradation, posing significant challenges due to the minimal information retrievable from the degraded images. Prior knowledge-based methods, leveraging geometric priors and facial features, have led to advancements in face restoration but often fall short of capturing fine details. To address this, we introduce a visual style prompt learning framework that utilizes diffusion probabilistic models to explicitly generate visual prompts within the latent space of pre-trained generative models. These prompts are designed to guide the restoration process. To fully utilize the visual prompts and enhance the extraction of informative and rich patterns, we introduce a style-modulated aggregation transformation layer. Extensive experiments and applications demonstrate the superiority of our method in achieving high-quality blind face restoration. The source code is available at \href{this https URL}{this https URL}.
摘要：盲人脸修复旨在从各种未识别的劣化源中恢复高质量的人脸图像，由于从劣化图像中可检索的信息极少，因此带来了重大挑战。基于先验知识的方法利用几何先验和面部特征，推动了人脸修复的进步，但往往无法捕捉到精细的细节。为了解决这个问题，我们引入了一个视觉风格提示学习框架，该框架利用扩散概率模型在预训练生成模型的潜在空间内明确生成视觉提示。这些提示旨在指导修复过程。为了充分利用视觉提示并增强信息丰富模式的提取，我们引入了一个风格调制的聚合转换层。大量实验和应用证明了我们的方法在实现高质量盲人脸修复方面的优越性。源代码可在 \href{this https URL}{this https URL} 获得。

Title: E2EDiff: Direct Mapping from Noise to Data for Enhanced Diffusion Models

Authors: Zhiyu Tan, WenXu Qian, Hesen Chen, Mengping Yang, Lei Chen, Hao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.21044
Pdf URL: https://arxiv.org/pdf/2412.21044
Copy Paste: [[2412.21044]] E2EDiff: Direct Mapping from Noise to Data for Enhanced Diffusion Models(https://arxiv.org/abs/2412.21044)
Keywords: generative
Abstract: Diffusion models have emerged as a powerful framework for generative modeling, achieving state-of-the-art performance across various tasks. However, they face several inherent limitations, including a training-sampling gap, information leakage in the progressive noising process, and the inability to incorporate advanced loss functions like perceptual and adversarial losses during training. To address these challenges, we propose an innovative end-to-end training framework that aligns the training and sampling processes by directly optimizing the final reconstruction output. Our method eliminates the training-sampling gap, mitigates information leakage by treating the training process as a direct mapping from pure noise to the target data distribution, and enables the integration of perceptual and adversarial losses into the objective. Extensive experiments on benchmarks such as COCO30K and HW30K demonstrate that our approach consistently outperforms traditional diffusion models, achieving superior results in terms of FID and CLIP score, even with reduced sampling steps. These findings highlight the potential of end-to-end training to advance diffusion-based generative models toward more robust and efficient solutions.
摘要：扩散模型已成为生成建模的强大框架，在各种任务中都取得了最先进的性能。然而，它们面临着几个固有的局限性，包括训练-采样差距、渐进噪声过程中的信息泄漏，以及无法在训练期间纳入感知和对抗损失等高级损失函数。为了应对这些挑战，我们提出了一个创新的端到端训练框架，通过直接优化最终的重建输出来协调训练和采样过程。我们的方法消除了训练-采样差距，通过将训练过程视为从纯噪声到目标数据分布的直接映射来减轻信息泄漏，并能够将感知和对抗损失整合到目标中。在 COCO30K 和 HW30K 等基准上进行的大量实验表明，我们的方法始终优于传统的扩散模型，即使在减少采样步骤的情况下，也能在 FID 和 CLIP 分数方面取得优异的结果。这些发现凸显了端到端训练的潜力，可以使基于扩散的生成模型朝着更稳健、更高效的解决方案发展。

Title: Towards Effective Discrimination Testing for Generative AI

Authors: Thomas P. Zollo, Nikita Rajaneesh, Richard Zemel, Talia B. Gillis, Emily Black
Subjects: cs.LG, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2412.21052
Pdf URL: https://arxiv.org/pdf/2412.21052
Copy Paste: [[2412.21052]] Towards Effective Discrimination Testing for Generative AI(https://arxiv.org/abs/2412.21052)
Keywords: generative
Abstract: Generative AI (GenAI) models present new challenges in regulating against discriminatory behavior. In this paper, we argue that GenAI fairness research still has not met these challenges; instead, a significant gap remains between existing bias assessment methods and regulatory goals. This leads to ineffective regulation that can allow deployment of reportedly fair, yet actually discriminatory, GenAI systems. Towards remedying this problem, we connect the legal and technical literature around GenAI bias evaluation and identify areas of misalignment. Through four case studies, we demonstrate how this misalignment between fairness testing techniques and regulatory goals can result in discriminatory outcomes in real-world deployments, especially in adaptive or complex environments. We offer practical recommendations for improving discrimination testing to better align with regulatory goals and enhance the reliability of fairness assessments in future deployments.
摘要：生成式人工智能 (GenAI) 模型在监管歧视行为方面提出了新的挑战。在本文中，我们认为 GenAI 公平性研究仍未应对这些挑战；相反，现有的偏见评估方法与监管目标之间仍然存在巨大差距。这导致监管无效，从而允许部署据称公平但实际上具有歧视性的 GenAI 系统。为了解决这个问题，我们将有关 GenAI 偏见评估的法律和技术文献联系起来，并确定了不一致的领域。通过四个案例研究，我们展示了公平性测试技术与监管目标之间的这种不一致如何导致现实世界部署中的歧视性结果，尤其是在自适应或复杂环境中。我们提出了改进歧视测试的实用建议，以更好地与监管目标保持一致，并提高未来部署中公平性评估的可靠性。

Title: VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

Authors: Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, Jiayan Teng, Zhuoyi Yang, Wendi Zheng, Xiao Liu, Ming Ding, Xiaohan Zhang, Xiaotao Gu, Shiyu Huang, Minlie Huang, Jie Tang, Yuxiao Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.21059
Pdf URL: https://arxiv.org/pdf/2412.21059
Copy Paste: [[2412.21059]] VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation(https://arxiv.org/abs/2412.21059)
Keywords: generation, quality assessment
Abstract: We present a general strategy to aligning visual generation models -- both image and video generation -- with human preference. To start with, we build VisionReward -- a fine-grained and multi-dimensional reward model. We decompose human preferences in images and videos into multiple dimensions, each represented by a series of judgment questions, linearly weighted and summed to an interpretable and accurate score. To address the challenges of video quality assessment, we systematically analyze various dynamic features of videos, which helps VisionReward surpass VideoScore by 17.2% and achieve top performance for video preference prediction. Based on VisionReward, we develop a multi-objective preference learning algorithm that effectively addresses the issue of confounding factors within preference data. Our approach significantly outperforms existing image and video scoring methods on both machine metrics and human evaluation. All code and datasets are provided at this https URL.
摘要：我们提出了一种通用策略，将视觉生成模型（包括图像和视频生成）与人类偏好相结合。首先，我们构建 VisionReward——一个细粒度和多维的奖励模型。我们将人类对图像和视频的偏好分解为多个维度，每个维度都由一系列判断问题表示，这些判断问题按线性加权并相加得到可解释且准确的分数。为了应对视频质量评估的挑战，我们系统地分析了视频的各种动态特征，这有助于 VisionReward 超越 VideoScore 17.2%，并在视频偏好预测方面取得最佳表现。基于 VisionReward，我们开发了一种多目标偏好学习算法，可有效解决偏好数据中的混杂因素问题。我们的方法在机器指标和人工评估方面都明显优于现有的图像和视频评分方法。所有代码和数据集均在此 https URL 中提供。

Title: Varformer: Adapting VAR's Generative Prior for Image Restoration

Authors: Siyang Wang, Feng Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.21063
Pdf URL: https://arxiv.org/pdf/2412.21063
Copy Paste: [[2412.21063]] Varformer: Adapting VAR's Generative Prior for Image Restoration(https://arxiv.org/abs/2412.21063)
Keywords: restoration, generation, generative
Abstract: Generative models trained on extensive high-quality datasets effectively capture the structural and statistical properties of clean images, rendering them powerful priors for transforming degraded features into clean ones in image restoration. VAR, a novel image generative paradigm, surpasses diffusion models in generation quality by applying a next-scale prediction approach. It progressively captures both global structures and fine-grained details through the autoregressive process, consistent with the multi-scale restoration principle widely acknowledged in the restoration community. Furthermore, we observe that during the image reconstruction process utilizing VAR, scale predictions automatically modulate the input, facilitating the alignment of representations at subsequent scales with the distribution of clean images. To harness VAR's adaptive distribution alignment capability in image restoration tasks, we formulate the multi-scale latent representations within VAR as the restoration prior, thus advancing our delicately designed VarFormer framework. The strategic application of these priors enables our VarFormer to achieve remarkable generalization on unseen tasks while also reducing training computational costs. Extensive experiments underscores that our VarFormer outperforms existing multi-task image restoration methods across various restoration tasks.
摘要：在大量高质量数据集上训练的生成模型可以有效地捕捉干净图像的结构和统计特性，使其成为在图像恢复中将退化特征转换为干净特征的强大先验。VAR 是一种新颖的图像生成范式，它通过应用下一个尺度预测方法在生成质量上超越了扩散模型。它通过自回归过程逐步捕捉全局结构和细粒度细节，这与恢复界广泛认可的多尺度恢复原理一致。此外，我们观察到，在利用 VAR 进行图像重建过程中，尺度预测会自动调节输入，促进后续尺度上的表示与干净图像的分布对齐。为了利用 VAR 在图像恢复任务中的自适应分布对齐能力，我们将 VAR 中的多尺度潜在表示公式化为恢复先验，从而推进了我们精心设计的 VarFormer 框架。这些先验的战略性应用使我们的 VarFormer 能够在看不见的任务上实现显著的泛化，同时降低训练计算成本。大量实验表明，我们的 VarFormer 在各种恢复任务中的表现优于现有的多任务图像恢复方法。

Title: Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model

Authors: Yifei Huang, Jilan Xu, Baoqi Pei, Yuping He, Guo Chen, Lijin Yang, Xinyuan Chen, Yaohui Wang, Zheng Nie, Jinyao Liu, Guoshun Fan, Dechen Lin, Fang Fang, Kunpeng Li, Chang Yuan, Yali Wang, Yu Qiao, Limin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.21080
Pdf URL: https://arxiv.org/pdf/2412.21080
Copy Paste: [[2412.21080]] Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model(https://arxiv.org/abs/2412.21080)
Keywords: generation
Abstract: We introduce Vinci, a real-time embodied smart assistant built upon an egocentric vision-language model. Designed for deployment on portable devices such as smartphones and wearable cameras, Vinci operates in an "always on" mode, continuously observing the environment to deliver seamless interaction and assistance. Users can wake up the system and engage in natural conversations to ask questions or seek assistance, with responses delivered through audio for hands-free convenience. With its ability to process long video streams in real-time, Vinci can answer user queries about current observations and historical context while also providing task planning based on past interactions. To further enhance usability, Vinci integrates a video generation module that creates step-by-step visual demonstrations for tasks that require detailed guidance. We hope that Vinci can establish a robust framework for portable, real-time egocentric AI systems, empowering users with contextual and actionable insights. We release the complete implementation for the development of the device in conjunction with a demo web platform to test uploaded videos at this https URL.
摘要：我们推出了 Vinci，一款基于自我中心视觉语言模型构建的实时智能助手。Vinci 专为部署在智能手机和可穿戴相机等便携式设备上而设计，以“始终开启”模式运行，持续观察环境以提供无缝交互和帮助。用户可以唤醒系统并进行自然对话以提出问题或寻求帮助，并通过音频提供响应以实现免提便利。凭借实时处理长视频流的能力，Vinci 可以回答用户关于当前观察和历史背景的查询，同时还可以根据过去的交互提供任务规划。为了进一步提高可用性，Vinci 集成了一个视频生成模块，该模块可为需要详细指导的任务创建分步视觉演示。我们希望 Vinci 能够为便携式、实时的自我中心 AI 系统建立一个强大的框架，为用户提供情境化和可操作的见解。我们发布了设备开发的完整实现以及一个演示 Web 平台，以在此 https URL 上测试上传的视频。

Title: Prometheus: 3D-Aware Latent Diffusion Models for Feed-Forward Text-to-3D Scene Generation

Authors: Yuanbo Yang, Jiahao Shao, Xinyang Li, Yujun Shen, Andreas Geiger, Yiyi Liao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.21117
Pdf URL: https://arxiv.org/pdf/2412.21117
Copy Paste: [[2412.21117]] Prometheus: 3D-Aware Latent Diffusion Models for Feed-Forward Text-to-3D Scene Generation(https://arxiv.org/abs/2412.21117)
Keywords: generation
Abstract: In this work, we introduce Prometheus, a 3D-aware latent diffusion model for text-to-3D generation at both object and scene levels in seconds. We formulate 3D scene generation as multi-view, feed-forward, pixel-aligned 3D Gaussian generation within the latent diffusion paradigm. To ensure generalizability, we build our model upon pre-trained text-to-image generation model with only minimal adjustments, and further train it using a large number of images from both single-view and multi-view datasets. Furthermore, we introduce an RGB-D latent space into 3D Gaussian generation to disentangle appearance and geometry information, enabling efficient feed-forward generation of 3D Gaussians with better fidelity and geometry. Extensive experimental results demonstrate the effectiveness of our method in both feed-forward 3D Gaussian reconstruction and text-to-3D generation. Project page: this https URL
摘要：在这项工作中，我们引入了 Prometheus，这是一种 3D 感知潜在扩散模型，可在几秒钟内在对象和场景级别上完成文本到 3D 的生成。我们将 3D 场景生成公式化为潜在扩散范式中的多视图、前馈、像素对齐的 3D 高斯生成。为了确保通用性，我们在预先训练的文本到图像生成模型上构建了模型，仅进行了极少的调整，并使用来自单视图和多视图数据集的大量图像对其进行进一步训练。此外，我们在 3D 高斯生成中引入了 RGB-D 潜在空间，以解开外观和几何信息，从而实现具有更好保真度和几何形状的 3D 高斯的高效前馈生成。大量实验结果证明了我们的方法在前馈 3D 高斯重建和文本到 3D 生成中的有效性。项目页面：此 https URL

Title: PERSE: Personalized 3D Generative Avatars from A Single Portrait

Authors: Hyunsoo Cha, Inhee Lee, Hanbyul Joo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.21206
Pdf URL: https://arxiv.org/pdf/2412.21206
Copy Paste: [[2412.21206]] PERSE: Personalized 3D Generative Avatars from A Single Portrait(https://arxiv.org/abs/2412.21206)
Keywords: generative
Abstract: We present PERSE, a method for building an animatable personalized generative avatar from a reference portrait. Our avatar model enables facial attribute editing in a continuous and disentangled latent space to control each facial attribute, while preserving the individual's identity. To achieve this, our method begins by synthesizing large-scale synthetic 2D video datasets, where each video contains consistent changes in the facial expression and viewpoint, combined with a variation in a specific facial attribute from the original input. We propose a novel pipeline to produce high-quality, photorealistic 2D videos with facial attribute editing. Leveraging this synthetic attribute dataset, we present a personalized avatar creation method based on the 3D Gaussian Splatting, learning a continuous and disentangled latent space for intuitive facial attribute manipulation. To enforce smooth transitions in this latent space, we introduce a latent space regularization technique by using interpolated 2D faces as supervision. Compared to previous approaches, we demonstrate that PERSE generates high-quality avatars with interpolated attributes while preserving identity of reference person.
摘要：我们提出了 PERSE，一种根据参考肖像构建可动画化的个性化生成头像的方法。我们的头像模型能够在连续且解缠结的潜在空间中编辑面部属性，以控制每个面部属性，同时保留个人的身份。为实现此目标，我们的方法首先合成大规模合成 2D 视频数据集，其中每个视频包含面部表情和视点的一致变化，以及与原始输入相比特定面部属性的变化。我们提出了一种新颖的流程，可通过面部属性编辑来制作高质量、逼真的 2D 视频。利用这个合成属性数据集，我们提出了一种基于 3D Gaussian Splatting 的个性化头像创建方法，学习连续且解缠结的潜在空间以实现直观的面部属性操作。为了在这个潜在空间中实现平滑过渡，我们引入了一种潜在空间正则化技术，使用插值的 2D 人脸作为监督。与以前的方法相比，我们证明 PERSE 可以生成具有插值属性的高质量化身，同时保留参考人的身份。