2025-04-30

Title: Marmot: Multi-Agent Reasoning for Multi-Object Self-Correcting in Improving Image-Text Alignment

Authors: Jiayang Sun, Hongbo Wang, Jie Cao, Huaibo Huang, Ran He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.20054
Pdf URL: https://arxiv.org/pdf/2504.20054
Copy Paste: [[2504.20054]] Marmot: Multi-Agent Reasoning for Multi-Object Self-Correcting in Improving Image-Text Alignment(https://arxiv.org/abs/2504.20054)
Keywords: generation
Abstract: While diffusion models excel at generating high-quality images, they often struggle with accurate counting, attributes, and spatial relationships in complex multi-object scenes. To address these challenges, we propose Marmot, a novel and generalizable framework that employs Multi-Agent Reasoning for Multi-Object Self-Correcting, enhancing image-text alignment and facilitating more coherent multi-object image editing. Our framework adopts a divide-and-conquer strategy that decomposes the self-correction task into three critical dimensions (counting, attributes, and spatial relationships), and further divided into object-level subtasks. We construct a multi-agent editing system featuring a decision-execution-verification mechanism, effectively mitigating inter-object interference and enhancing editing reliability. To resolve the problem of subtask integration, we propose a Pixel-Domain Stitching Smoother that employs mask-guided two-stage latent space optimization. This innovation enables parallel processing of subtask results, thereby enhancing runtime efficiency while eliminating multi-stage distortion accumulation. Extensive experiments demonstrate that Marmot significantly improves accuracy in object counting, attribute assignment, and spatial relationships for image generation tasks.
摘要：尽管扩散模型在生成高质量图像方面表现出色，但它们通常会在复杂的多对象场景中与准确的计数，属性和空间关系斗争。为了应对这些挑战，我们提出了Marmot，这是一个新颖且可推广的框架，该框架采用多代理推理来进行多对象自我校正，增强图像文本对齐，并促进更连贯的多对象图像编辑。我们的框架采用了分裂和纠纷策略，将自校正任务分解为三个关键维度（计数，属性和空间关系），并进一步分为对象级子任务。我们构建了具有决策执行验证机制的多代理编辑系统，有效地减轻了对象间干扰并增强编辑可靠性。为了解决子任务集成的问题，我们提出了一个像素域缝线更平滑，该缝线采用了掩码引导的两阶段潜在空间优化。这项创新可以并行处理子任务结果，从而提高运行时效率，同时消除多阶段失真的积累。广泛的实验表明，Marmot可显着提高对象计数，属性分配和图像生成任务的空间关系的准确性。

Title: VideoMultiAgents: A Multi-Agent Framework for Video Question Answering

Authors: Noriyuki Kugo, Xiang Li, Zixin Li, Ashish Gupta, Arpandeep Khatua, Nidhish Jain, Chaitanya Patel, Yuta Kyuragi, Masamoto Tanabiki, Kazuki Kozuka, Ehsan Adeli
Subjects: cs.CV, cs.MA
Abstract URL: https://arxiv.org/abs/2504.20091
Pdf URL: https://arxiv.org/pdf/2504.20091
Copy Paste: [[2504.20091]] VideoMultiAgents: A Multi-Agent Framework for Video Question Answering(https://arxiv.org/abs/2504.20091)
Keywords: generation
Abstract: Video Question Answering (VQA) inherently relies on multimodal reasoning, integrating visual, temporal, and linguistic cues to achieve a deeper understanding of video content. However, many existing methods rely on feeding frame-level captions into a single model, making it difficult to adequately capture temporal and interactive contexts. To address this limitation, we introduce VideoMultiAgents, a framework that integrates specialized agents for vision, scene graph analysis, and text processing. It enhances video understanding leveraging complementary multimodal reasoning from independently operating agents. Our approach is also supplemented with a question-guided caption generation, which produces captions that highlight objects, actions, and temporal transitions directly relevant to a given query, thus improving the answer accuracy. Experimental results demonstrate that our method achieves state-of-the-art performance on Intent-QA (79.0%, +6.2% over previous SOTA), EgoSchema subset (75.4%, +3.4%), and NExT-QA (79.6%, +0.4%).
摘要：视频问题回答（VQA）固有地依赖于多模式推理，整合视觉，时间和语言提示，以深入了解视频内容。但是，许多现有的方法依靠馈送框架级标题为单个模型，因此很难充分捕获时间和交互式上下文。为了解决此限制，我们介绍了视频义务，该框架将专门的代理集成了视觉，场景图分析和文本处理。它增强了视频理解，从独立操作的代理商中利用互补的多模式推理。我们的方法还补充了一个问题引导的标题生成，该标题产生的标题突出了与给定查询直接相关的对象，动作和时间过渡，从而提高了答案的准确性。实验结果表明，我们的方法在Intent-QA（79.0％， +6.2％以上的SOTA），Egoschema子集（75.4％， +3.4％）和Next-QA（79.6％， +0.4％）上实现了最先进的性能。

Title: Integration Flow Models

Authors: Jingjing Wang, Dan Zhang, Joshua Luo, Yin Yang, Feng Luo
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.20179
Pdf URL: https://arxiv.org/pdf/2504.20179
Copy Paste: [[2504.20179]] Integration Flow Models(https://arxiv.org/abs/2504.20179)
Keywords: generation, generative
Abstract: Ordinary differential equation (ODE) based generative models have emerged as a powerful approach for producing high-quality samples in many applications. However, the ODE-based methods either suffer the discretization error of numerical solvers of ODE, which restricts the quality of samples when only a few NFEs are used, or struggle with training instability. In this paper, we proposed Integration Flow, which directly learns the integral of ODE-based trajectory paths without solving the ODE functions. Moreover, Integration Flow explicitly incorporates the target state $\mathbf{x}_0$ as the anchor state in guiding the reverse-time dynamics. We have theoretically proven this can contribute to both stability and accuracy. To the best of our knowledge, Integration Flow is the first model with a unified structure to estimate ODE-based generative models and the first to show the exact straightness of 1-Rectified Flow without reflow. Through theoretical analysis and empirical evaluations, we show that Integration Flows achieve improved performance when it is applied to existing ODE-based models, such as diffusion models, Rectified Flows, and PFGM++. Specifically, Integration Flow achieves one-step generation on CIFAR10 with FIDs of 2.86 for the Variance Exploding (VE) diffusion model, 3.36 for rectified flow without reflow, and 2.91 for PFGM++; and on ImageNet with FIDs of 4.09 for VE diffusion model, 4.35 for rectified flow without reflow and 4.15 for PFGM++.
摘要：基于普通的微分方程（ODE）的生成模型已成为在许多应用中生产高质量样本的强大方法。但是，基于ODE的方法要么遭受ODE数值求解器的离散误差，因此，当仅使用了几个NFE时，限制了样品的质量，或者在训练不稳定性方面挣扎。在本文中，我们提出了整合流，该集合流直接学习基于ODE的轨迹路径的积分而无需求解ODE函数。此外，集成流明确合并了目标状态$ \ mathbf {x} _0 $作为指导反度动力学的锚状态。从理论上讲，我们可以证明这可以有助于稳定性和准确性。据我们所知，集成流是第一个具有统一结构的模型，用于估计基于ODE的生成模型，也是第一个显示1个纠正流的精确直率而无需回光。通过理论分析和经验评估，我们表明，当集成流将其应用于现有基于ODE的模型（例如扩散模型，整流流和PFGM ++）时，可以提高性能。具体而言，对于方差爆炸（VE）扩散模型，积分流在CIFAR10上以2.86的FID达到一步生成，而无反流的整流流量为3.36，PFGM ++为2.91；对于VE扩散模型的FIDS，在Imagenet上，无需回流的整流流量为4.35，PFGM ++为4.15。

Title: Physics-Informed Diffusion Models for SAR Ship Wake Generation from Text Prompts

Authors: Kamirul Kamirul, Odysseas Pappas, Alin Achim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.20241
Pdf URL: https://arxiv.org/pdf/2504.20241
Copy Paste: [[2504.20241]] Physics-Informed Diffusion Models for SAR Ship Wake Generation from Text Prompts(https://arxiv.org/abs/2504.20241)
Keywords: generation
Abstract: Detecting ship presence via wake signatures in SAR imagery is attracting considerable research interest, but limited annotated data availability poses significant challenges for supervised learning. Physics-based simulations are commonly used to address this data scarcity, although they are slow and constrain end-to-end learning. In this work, we explore a new direction for more efficient and end-to-end SAR ship wake simulation using a diffusion model trained on data generated by a physics-based simulator. The training dataset is built by pairing images produced by the simulator with text prompts derived from simulation parameters. Experimental result show that the model generates realistic Kelvin wake patterns and achieves significantly faster inference than the physics-based simulator. These results highlight the potential of diffusion models for fast and controllable wake image generation, opening new possibilities for end-to-end downstream tasks in maritime SAR analysis.
摘要：通过SAR图像中的尾流签名检测船的存在引起了相当大的研究兴趣，但是有限的注释数据可用性对监督学习构成了重大挑战。基于物理学的模拟通常用于解决此数据稀缺性，尽管它们很慢并且会限制端到端学习。在这项工作中，我们使用基于物理基于物理的模拟器生成的数据训练的扩散模型探索了一个新方向，以提高高效和端到端的SAR船唤醒模拟。培训数据集是通过将模拟器制作的图像与仿真参数派生的文本提示配对构建的。实验结果表明，该模型会产生逼真的开尔文唤醒模式，并且比基于物理的模拟器更快地推断了推断。这些结果突出了扩散模型对快速和可控制的唤醒产生的潜力，在海上SAR分析中为端到端下游任务开辟了新的可能性。

Title: Generative Diffusion Models for Resource Allocation in Wireless Networks

Authors: Yigit Berkay Uslu, Samar Hadou, Shirin Saeedi Bidokhti, Alejandro Ribeiro
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2504.20277
Pdf URL: https://arxiv.org/pdf/2504.20277
Copy Paste: [[2504.20277]] Generative Diffusion Models for Resource Allocation in Wireless Networks(https://arxiv.org/abs/2504.20277)
Keywords: generative
Abstract: This paper proposes a supervised training algorithm for learning stochastic resource allocation policies with generative diffusion models (GDMs). We formulate the allocation problem as the maximization of an ergodic utility function subject to ergodic Quality of Service (QoS) constraints. Given samples from a stochastic expert policy that yields a near-optimal solution to the problem, we train a GDM policy to imitate the expert and generate new samples from the optimal distribution. We achieve near-optimal performance through sequential execution of the generated samples. To enable generalization to a family of network configurations, we parameterize the backward diffusion process with a graph neural network (GNN) architecture. We present numerical results in a case study of power control in multi-user interference networks.
摘要：本文提出了一种有监督的培训算法，用于通过生成扩散模型（GDM）学习随机资源分配策略。我们将分配问题提出，作为受厄贡服务质量（QOS）约束的最大化的最大化。鉴于从随机专家政策中的样本，可以为问题提供近乎最佳的解决方案，我们训练GDM政策模仿专家并从最佳分布中生成新样本。我们通过顺序执行生成的样品实现了近乎最佳的性能。为了使网络配置家族概括，我们使用图形神经网络（GNN）体系结构参数化向后扩散过程。我们在多用户干扰网络中的功率控制案例研究中提出了数值结果。

Title: Image Interpolation with Score-based Riemannian Metrics of Diffusion Models

Authors: Shinnosuke Saito, Takashi Matsubara
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.20288
Pdf URL: https://arxiv.org/pdf/2504.20288
Copy Paste: [[2504.20288]] Image Interpolation with Score-based Riemannian Metrics of Diffusion Models(https://arxiv.org/abs/2504.20288)
Keywords: generation, generative
Abstract: Diffusion models excel in content generation by implicitly learning the data manifold, yet they lack a practical method to leverage this manifold - unlike other deep generative models equipped with latent spaces. This paper introduces a novel framework that treats the data space of pre-trained diffusion models as a Riemannian manifold, with a metric derived from the score function. Experiments with MNIST and Stable Diffusion show that this geometry-aware approach yields image interpolations that are more realistic, less noisy, and more faithful to prompts than existing methods, demonstrating its potential for improved content generation and editing.
摘要：扩散模型通过隐式学习数据歧管而在内容生成中出色，但它们缺乏一种实用的方法来利用这种歧管 - 与其他配备潜在空间的深层生成模型不同。本文介绍了一个新的框架，该框架将预训练的扩散模型的数据空间视为riemannian歧管，并从得分函数中得出了度量。使用MNIST和稳定扩散的实验表明，这种几何感知方法产生的图像插值比现有方法更现实，更嘈杂，更忠于提示，这表明其有可能提高内容的生成和编辑的潜力。

Title: A Cryptographic Perspective on Mitigation vs. Detection in Machine Learning

Authors: Greg Gluch, Shafi Goldwasser
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2504.20310
Pdf URL: https://arxiv.org/pdf/2504.20310
Copy Paste: [[2504.20310]] A Cryptographic Perspective on Mitigation vs. Detection in Machine Learning(https://arxiv.org/abs/2504.20310)
Keywords: generative
Abstract: In this paper, we initiate a cryptographically inspired theoretical study of detection versus mitigation of adversarial inputs produced by attackers of Machine Learning algorithms during inference time. We formally define defense by detection (DbD) and defense by mitigation (DbM). Our definitions come in the form of a 3-round protocol between two resource-bounded parties: a trainer/defender and an attacker. The attacker aims to produce inference-time inputs that fool the training algorithm. We define correctness, completeness, and soundness properties to capture successful defense at inference time while not degrading (too much) the performance of the algorithm on inputs from the training distribution. We first show that achieving DbD and achieving DbM are equivalent for ML classification tasks. Surprisingly, this is not the case for ML generative learning tasks, where there are many possible correct outputs that can be generated for each input. We show a separation between DbD and DbM by exhibiting a generative learning task for which is possible to defend by mitigation but is provably impossible to defend by detection under the assumption that the Identity-Based Fully Homomorphic Encryption (IB-FHE), publicly-verifiable zero-knowledge Succinct Non-Interactive Arguments of Knowledge (zk-SNARK) and Strongly Unforgeable Signatures exist. The mitigation phase uses significantly fewer samples than the initial training algorithm.
摘要：在本文中，我们启动了一项具有密码启发的理论研究，对检测与缓解对抗性输入的攻击者在推理期间的攻击者产生的对抗性输入。我们通过检测（DBD）正式定义防御（DBM）（DBM）。我们的定义以两个由资源结合的当事方之间的三轮协议的形式出现：培训者/后卫和攻击者。攻击者旨在产生欺骗培训算法的推理时间输入。我们定义正确性，完整性和健全性属性，以在推理时间捕获成功的防御，而不会降解（太多）算法对训练分布的输入的性能。我们首先表明，实现DBD和实现DBM是ML分类任务等效的。出乎意料的是，ML生成学习任务并非如此，其中有许多可能的正确输出可以为每个输入生成。我们通过表现出一种生成性学习任务，可以通过缓解来防御DBD和DBM之间的分离，但事实证明，在假设基于基于身份的完全同型加密（IB-FHE）的假设下，不可能通过检测进行防御，公开可验证的零知识零知识无关的非相互依据的知识论证（Zk-snark）和强有力地符合不合适的签名。缓解阶段使用的样品明显少于初始训练算法。

Title: Perturbation-efficient Zeroth-order Optimization for Hardware-friendly On-device Training

Authors: Qitao Tan, Sung-En Chang, Rui Xia, Huidong Ji, Chence Yang, Ci Zhang, Jun Liu, Zheng Zhan, Zhou Zou, Yanzhi Wang, Jin Lu, Geng Yuan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.20314
Pdf URL: https://arxiv.org/pdf/2504.20314
Copy Paste: [[2504.20314]] Perturbation-efficient Zeroth-order Optimization for Hardware-friendly On-device Training(https://arxiv.org/abs/2504.20314)
Keywords: generation
Abstract: Zeroth-order (ZO) optimization is an emerging deep neural network (DNN) training paradigm that offers computational simplicity and memory savings. However, this seemingly promising approach faces a significant and long-ignored challenge. ZO requires generating a substantial number of Gaussian random numbers, which poses significant difficulties and even makes it infeasible for hardware platforms, such as FPGAs and ASICs. In this paper, we identify this critical issue, which arises from the mismatch between algorithm and hardware designers. To address this issue, we proposed PeZO, a perturbation-efficient ZO framework. Specifically, we design random number reuse strategies to significantly reduce the demand for random number generation and introduce a hardware-friendly adaptive scaling method to replace the costly Gaussian distribution with a uniform distribution. Our experiments show that PeZO reduces the required LUTs and FFs for random number generation by 48.6\% and 12.7\%, and saves at maximum 86\% power consumption, all without compromising training performance, making ZO optimization feasible for on-device training. To the best of our knowledge, we are the first to explore the potential of on-device ZO optimization, providing valuable insights for future research.
摘要：Zeroth-order（ZO）优化是一种新兴的深神经网络（DNN）训练范式，可提供计算简单性和内存节省。但是，这种看似有前途的方法面临着一个重大且长期以来的挑战。 ZO需要产生大量的高斯随机数，这构成了巨大的困难，甚至使其对于硬件平台（例如FPGA和ASIC）不可行。在本文中，我们确定了这个关键问题，该问题来自算法和硬件设计人员之间的不匹配。为了解决这个问题，我们提出了PEZO，这是一个扰动效率的ZO框架。具体而言，我们设计了随机数重复使用策略，以显着减少对随机数生成的需求，并引入一种适合硬件友好的自适应缩放方法，以均匀分布替换昂贵的高斯分布。我们的实验表明，PEZO将随机数生成的所需的LUT和FFS降低了48.6 \％和12.7 \％，并以最大86 \％的功率消耗保存，而无需损害培训性能，使ZO优化可行，可用于备能培训。据我们所知，我们是第一个探索开发ZO优化的潜力，为未来研究提供宝贵的见解。

Title: MicarVLMoE: A Modern Gated Cross-Aligned Vision-Language Mixture of Experts Model for Medical Image Captioning and Report Generation

Authors: Amaan Izhar, Nurul Japar, Norisma Idris, Ting Dang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.20343
Pdf URL: https://arxiv.org/pdf/2504.20343
Copy Paste: [[2504.20343]] MicarVLMoE: A Modern Gated Cross-Aligned Vision-Language Mixture of Experts Model for Medical Image Captioning and Report Generation(https://arxiv.org/abs/2504.20343)
Keywords: generation
Abstract: Medical image reporting (MIR) aims to generate structured clinical descriptions from radiological images. Existing methods struggle with fine-grained feature extraction, multimodal alignment, and generalization across diverse imaging types, often relying on vanilla transformers and focusing primarily on chest X-rays. We propose MicarVLMoE, a vision-language mixture-of-experts model with gated cross-aligned fusion, designed to address these limitations. Our architecture includes: (i) a multiscale vision encoder (MSVE) for capturing anatomical details at varying resolutions, (ii) a multihead dual-branch latent attention (MDLA) module for vision-language alignment through latent bottleneck representations, and (iii) a modulated mixture-of-experts (MoE) decoder for adaptive expert specialization. We extend MIR to CT scans, retinal imaging, MRI scans, and gross pathology images, reporting state-of-the-art results on COVCTR, MMR, PGROSS, and ROCO datasets. Extensive experiments and ablations confirm improved clinical accuracy, cross-modal alignment, and model interpretability. Code is available at this https URL.
摘要：医学图像报告（MIR）旨在从放射学图像中产生结构化的临床描述。现有的方法在各种成像类型上的细粒特征提取，多模式对齐和概括方面都很困难，通常依靠香草变压器，主要集中在胸部X射线上。我们提出了Micarvlmoe，这是一种具有封闭式交叉对齐的融合的视觉语言混合物模型，旨在解决这些局限性。我们的架构包括：（i）一个多尺度视觉编码器（MSVE），用于在不同的分辨率下捕获解剖细节，（ii）多头双分支潜在注意（MDLA）模块（MDLA）用于通过潜在的瓶颈视觉对齐方式，以及（iii）调整的Expexperts（Moe）（MOE）DECODEDIVE DECODEREDEDEDEDEDEDEVIVE DECEDIVE DECODEDIVE DECODEREDEDEDIVE offive coptive decodersective coptive decodersective。我们将MIR扩展到CT扫描，视网膜成像，MRI扫描和严格的病理图像，报告了COVCTR，MMR，PGROSS和ROCO数据集的最新结果。广泛的实验和消融证实了提高临床准确性，跨模式比对和模型的解释性。代码可在此HTTPS URL上找到。

Title: Generative Learning for Slow Manifolds and Bifurcation Diagrams

Authors: Ellis R. Crabtree, Dimitris G. Giovanis, Nikolaos Evangelou, Juan M. Bello-Rivas, Ioannis G. Kevrekidis
Subjects: cs.LG, math.DS
Abstract URL: https://arxiv.org/abs/2504.20375
Pdf URL: https://arxiv.org/pdf/2504.20375
Copy Paste: [[2504.20375]] Generative Learning for Slow Manifolds and Bifurcation Diagrams(https://arxiv.org/abs/2504.20375)
Keywords: generative
Abstract: In dynamical systems characterized by separation of time scales, the approximation of so called ``slow manifolds'', on which the long term dynamics lie, is a useful step for model reduction. Initializing on such slow manifolds is a useful step in modeling, since it circumvents fast transients, and is crucial in multiscale algorithms alternating between fine scale (fast) and coarser scale (slow) simulations. In a similar spirit, when one studies the infinite time dynamics of systems depending on parameters, the system attractors (e.g., its steady states) lie on bifurcation diagrams. Sampling these manifolds gives us representative attractors (here, steady states of ODEs or PDEs) at different parameter values. Algorithms for the systematic construction of these manifolds are required parts of the ``traditional'' numerical nonlinear dynamics toolkit. In more recent years, as the field of Machine Learning develops, conditional score-based generative models (cSGMs) have demonstrated capabilities in generating plausible data from target distributions that are conditioned on some given label. It is tempting to exploit such generative models to produce samples of data distributions conditioned on some quantity of interest (QoI). In this work, we present a framework for using cSGMs to quickly (a) initialize on a low-dimensional (reduced-order) slow manifold of a multi-time-scale system consistent with desired value(s) of a QoI (a ``label'') on the manifold, and (b) approximate steady states in a bifurcation diagram consistent with a (new, out-of-sample) parameter value. This conditional sampling can help uncover the geometry of the reduced slow-manifold and/or approximately ``fill in'' missing segments of steady states in a bifurcation diagram.
摘要：在以时间尺度分离为特征的动态系统中，所谓的``慢歧管''的近似值（长期动力学）是模型还原的有用步骤。在这种缓慢的歧管上初始化是建模的有用步骤，因为它绕过快速瞬变，并且在多尺度算法中至关重要，在精细量表（快速）和更粗的尺度（慢速）模拟之间交替。本着类似的精神，当人们根据参数研究系统的无限时间动力学时，系统吸引子（例如，其稳态）就在分叉图上。对这些歧管进行采样使我们在不同的参数值下为我们提供了代表性的吸引子（在此，ODE或PDES的稳态）。这些流形的系统构建算法是``传统''数值非线性动力学工具包所需的部分。近年来，随着机器学习的发展，基于条件分数的生成模型（CSGM）证明了从目标分布中生成合理数据的能力，这些数据是在某些给定标签上进行的。很容易利用这种生成模型来生成以一定数量的兴趣（QOI）为条件的数据分布的样本。在这项工作中，我们提出了一个框架，用于使用CSGM快速（a）在多时间尺度系统的低维（降低）缓慢流形上初始化，与QOI（a“``标签''）的所需值（s）相一致，并且在Bifurcation Di Di Di Di Dia（a）上的近似稳态稳定状态（a“``标签''）一致（b）与一致的explase（a）一致的explase（a）（b）近似稳定状态。 This conditional sampling can help uncover the geometry of the reduced slow-manifold and/or approximately ``fill in'' missing segments of steady states in a bifurcation diagram.

Title: Inception: Jailbreak the Memory Mechanism of Text-to-Image Generation Systems

Authors: Shiqian Zhao, Jiayang Liu, Yiming Li, Runyi Hu, Xiaojun Jia, Wenshu Fan, Xinfeng Li, Jie Zhang, Wei Dong, Tianwei Zhang, Luu Anh Tuan
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2504.20376
Pdf URL: https://arxiv.org/pdf/2504.20376
Copy Paste: [[2504.20376]] Inception: Jailbreak the Memory Mechanism of Text-to-Image Generation Systems(https://arxiv.org/abs/2504.20376)
Keywords: generation
Abstract: Currently, the memory mechanism has been widely and successfully exploited in online text-to-image (T2I) generation systems ($e.g.$, DALL$\cdot$E 3) for alleviating the growing tokenization burden and capturing key information in multi-turn interactions. Despite its practicality, its security analyses have fallen far behind. In this paper, we reveal that this mechanism exacerbates the risk of jailbreak attacks. Different from previous attacks that fuse the unsafe target prompt into one ultimate adversarial prompt, which can be easily detected or may generate non-unsafe images due to under- or over-optimization, we propose Inception, the first multi-turn jailbreak attack against the memory mechanism in real-world text-to-image generation systems. Inception embeds the malice at the inception of the chat session turn by turn, leveraging the mechanism that T2I generation systems retrieve key information in their memory. Specifically, Inception mainly consists of two modules. It first segments the unsafe prompt into chunks, which are subsequently fed to the system in multiple turns, serving as pseudo-gradients for directive optimization. Specifically, we develop a series of segmentation policies that ensure the images generated are semantically consistent with the target prompt. Secondly, after segmentation, to overcome the challenge of the inseparability of minimum unsafe words, we propose recursion, a strategy that makes minimum unsafe words subdivisible. Collectively, segmentation and recursion ensure that all the request prompts are benign but can lead to malicious outcomes. We conduct experiments on the real-world text-to-image generation system ($i.e.$, DALL$\cdot$E 3) to validate the effectiveness of Inception. The results indicate that Inception surpasses the state-of-the-art by a 14\% margin in attack success rate.
摘要：当前，在线文本对图像（T2i）生成系统（例如$，$，dall $ \ cdot $ e 3）中，内存机制已被广泛利用，以减轻日益增长的令牌负担并捕获多转变互动中的关键信息。尽管其实用性，但其安全分析却远远落后。在本文中，我们揭示了这种机制加剧了越狱袭击的风险。不同于以前的攻击将不安全的目标提示融合到一个最终的对抗提示中，由于不足或过度优化，可以轻松检测或可能会产生非安全图像，我们提出了Inpection，这是对现实世界中的第一次多转变越狱攻击，这是对现实世界中文本到现实文本到达的生成系统中的记忆机制。 Inception在聊天会议开始时嵌入了恶意，并利用T2i生成系统在其内存中检索关键信息的机制。具体而言，成立主要由两个模块组成。它首先将不安全的提示置于块中，随后将其多个转弯送入系统，作为指令优化的伪级。具体而言，我们制定了一系列分割策略，以确保生成的图像在语义上与目标提示保持一致。其次，在细分后，为了克服最小不安全单词不可分割性的挑战，我们提出了递归，该策略使最小的不安全单词可划分。总体而言，细分和递归确保所有请求提示都良性良性，但可能导致恶意结果。我们对现实世界的文本到图像生成系统进行实验（$，$，dall $ \ cdot $ e 3），以验证启动的有效性。结果表明，Inception的攻击成功率超过了14 \％的最先进。

Title: FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding

Authors: Yanan Guo, Wenhui Dong, Jun Song, Shiding Zhu, Xuan Zhang, Hanqing Yang, Yingbo Wang, Yang Du, Xianing Chen, Bo Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.20384
Pdf URL: https://arxiv.org/pdf/2504.20384
Copy Paste: [[2504.20384]] FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding(https://arxiv.org/abs/2504.20384)
Keywords: generation
Abstract: Recent advancements in video understanding within visual large language models (VLLMs) have led to notable progress. However, the complexity of video data and contextual processing limitations still hinder long-video comprehension. A common approach is video feature compression to reduce token input to large language models, yet many methods either fail to prioritize essential features, leading to redundant inter-frame information, or introduce computationally expensive this http URL address these issues, we propose FiLA(Fine-grained Vision Language Model)-Video, a novel framework that leverages a lightweight dynamic-weight multi-frame fusion strategy, which adaptively integrates multiple frames into a single representation while preserving key video information and reducing computational costs. To enhance frame selection for fusion, we introduce a keyframe selection strategy, effectively identifying informative frames from a larger pool for improved summarization. Additionally, we present a simple yet effective long-video training data generation strategy, boosting model performance without extensive manual annotation. Experimental results demonstrate that FiLA-Video achieves superior efficiency and accuracy in long-video comprehension compared to existing methods.
摘要：视觉大语言模型（VLLM）中视频理解的最新进步导致了显着的进步。但是，视频数据和上下文处理局限性的复杂性仍然阻碍了长期观点的理解。一种常见的方法是视频功能压缩以减少对大语模型的令牌输入，但是许多方法要么无法优先考虑基本功能，导致冗余框架间信息，或者引入了计算昂贵的此HTTP URL解决这些问题，我们提出了FILA（罚款视觉语言模型）-VIDEWORK-VIDEROWS-VIDERITE，一个多元化的多个策略，该策略构成了一个动态策略，该策略是一个动态策略，该策略均已动态效果，该策略是一个框架。表示的同时保留关键的视频信息并降低计算成本。为了增强融合的框架选择，我们引入了关键帧选择策略，从而有效地从较大的池中确定了信息框架以改进摘要。此外，我们提出了一种简单而有效的长时间培训数据生成策略，在没有大量手动注释的情况下提高模型性能。实验结果表明，与现有方法相比，FILA-VIDEO在长效理解中实现了较高的效率和准确性。

Title: FourierSpecNet: Neural Collision Operator Approximation Inspired by the Fourier Spectral Method for Solving the Boltzmann Equation

Authors: Jae Yong Lee, Gwang Jae Jung, Byung Chan Lim, Hyung Ju Hwang
Subjects: cs.LG, cs.AI, math.NA, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2504.20408
Pdf URL: https://arxiv.org/pdf/2504.20408
Copy Paste: [[2504.20408]] FourierSpecNet: Neural Collision Operator Approximation Inspired by the Fourier Spectral Method for Solving the Boltzmann Equation(https://arxiv.org/abs/2504.20408)
Keywords: super-resolution
Abstract: The Boltzmann equation, a fundamental model in kinetic theory, describes the evolution of particle distribution functions through a nonlinear, high-dimensional collision operator. However, its numerical solution remains computationally demanding, particularly for inelastic collisions and high-dimensional velocity domains. In this work, we propose the Fourier Neural Spectral Network (FourierSpecNet), a hybrid framework that integrates the Fourier spectral method with deep learning to approximate the collision operator in Fourier space efficiently. FourierSpecNet achieves resolution-invariant learning and supports zero-shot super-resolution, enabling accurate predictions at unseen resolutions without retraining. Beyond empirical validation, we establish a consistency result showing that the trained operator converges to the spectral solution as the discretization is refined. We evaluate our method on several benchmark cases, including Maxwellian and hard-sphere molecular models, as well as inelastic collision scenarios. The results demonstrate that FourierSpecNet offers competitive accuracy while significantly reducing computational cost compared to traditional spectral solvers. Our approach provides a robust and scalable alternative for solving the Boltzmann equation across both elastic and inelastic regimes.
摘要：Boltzmann方程是动力学理论中的基本模型，它通过非线性，高维碰撞算子描述了粒子分布函数的演变。但是，其数值解决方案在计算要求方面仍然保持要求，特别是对于非弹性碰撞和高维速度域而言。在这项工作中，我们提出了傅立叶神经光谱网络（FouriersPecnet），这是一种混合框架，将傅立叶光谱方法集成到深度学习中，以有效地将碰撞算子近似于傅立叶空间。 FouriersPecnet实现了分辨率不变的学习，并支持零击的超分辨率，从而在没有再培训的情况下可以在看不见的决议下进行准确的预测。除了经验验证之外，我们还建立了一个一致性结果，表明经过培训的操作员会随着离散化的完善而收敛到频谱解决方案。我们在几种基准案例中评估了我们的方法，包括麦克斯韦和硬球分子模型以及非弹性碰撞方案。结果表明，与传统的光谱求解器相比，傅里叶型型号具有竞争精度，同时大大降低了计算成本。我们的方法提供了一种可靠且可扩展的替代方法，用于求解弹性和非弹性方程的玻尔兹曼方程。

Title: GarmentX: Autoregressive Parametric Representations for High-Fidelity 3D Garment Generation

Authors: Jingfeng Guo, Jinnan Chen, Weikai Chen, Zhenyu Sun, Lanjiong Li, Baozhu Zhao, Lingting Zhu, Xin Wang, Qi Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.20409
Pdf URL: https://arxiv.org/pdf/2504.20409
Copy Paste: [[2504.20409]] GarmentX: Autoregressive Parametric Representations for High-Fidelity 3D Garment Generation(https://arxiv.org/abs/2504.20409)
Keywords: generation
Abstract: This work presents GarmentX, a novel framework for generating diverse, high-fidelity, and wearable 3D garments from a single input image. Traditional garment reconstruction methods directly predict 2D pattern edges and their connectivity, an overly unconstrained approach that often leads to severe self-intersections and physically implausible garment structures. In contrast, GarmentX introduces a structured and editable parametric representation compatible with GarmentCode, ensuring that the decoded sewing patterns always form valid, simulation-ready 3D garments while allowing for intuitive modifications of garment shape and style. To achieve this, we employ a masked autoregressive model that sequentially predicts garment parameters, leveraging autoregressive modeling for structured generation while mitigating inconsistencies in direct pattern prediction. Additionally, we introduce GarmentX dataset, a large-scale dataset of 378,682 garment parameter-image pairs, constructed through an automatic data generation pipeline that synthesizes diverse and high-quality garment images conditioned on parametric garment representations. Through integrating our method with GarmentX dataset, we achieve state-of-the-art performance in geometric fidelity and input image alignment, significantly outperforming prior approaches. We will release GarmentX dataset upon publication.
摘要：这项工作介绍了GarmentX，这是一个新颖的框架，可从单个输入图像中产生各种，高保真和可穿戴的3D服装。传统的服装重建方法直接预测了2D模式边缘及其连接性，这是一种过度不受约束的方法，通常会导致严重的自身交流和身体上不可行的服装结构。相比之下，GarmentX引入了与GarmentCode兼容的结构化且可编辑的参数表示形式，以确保解码的缝纫模式始终形成有效的，可以模拟的3D服装，同时允许对服装形状和样式进行直观的修改。为了实现这一目标，我们采用了一个掩盖的自回旋模型，该模型顺序预测了服装参数，利用自回旋建模来用于结构化生成，同时减轻直接模式预测中的不一致。此外，我们介绍了GarmentX数据集，这是一个由378,682 Garment参数图像对的大规模数据集，该数据集是通过自动数据生成管道构建的，该管道构建了根据参数服装表示的多样性和高质量的服装图像。通过将我们的方法与GarmentX数据集集成在一起，我们在几何保真度和输入图像对准方面实现了最先进的性能，从而大大优于先验方法。我们将在发布后发布GarmentX数据集。

Title: ADiff4TPP: Asynchronous Diffusion Models for Temporal Point Processes

Authors: Amartya Mukherjee, Ruizhi Deng, He Zhao, Yuzhen Mao, Leonid Sigal, Frederick Tung
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.20411
Pdf URL: https://arxiv.org/pdf/2504.20411
Copy Paste: [[2504.20411]] ADiff4TPP: Asynchronous Diffusion Models for Temporal Point Processes(https://arxiv.org/abs/2504.20411)
Keywords: generation
Abstract: This work introduces a novel approach to modeling temporal point processes using diffusion models with an asynchronous noise schedule. At each step of the diffusion process, the noise schedule injects noise of varying scales into different parts of the data. With a careful design of the noise schedules, earlier events are generated faster than later ones, thus providing stronger conditioning for forecasting the more distant future. We derive an objective to effectively train these models for a general family of noise schedules based on conditional flow matching. Our method models the joint distribution of the latent representations of events in a sequence and achieves state-of-the-art results in predicting both the next inter-event time and event type on benchmark datasets. Additionally, it flexibly accommodates varying lengths of observation and prediction windows in different forecasting settings by adjusting the starting and ending points of the generation process. Finally, our method shows superior performance in long-horizon prediction tasks, outperforming existing baseline methods.
摘要：这项工作介绍了一种新的方法，可以使用具有异步噪声时间表的扩散模型来建模时间点过程。在扩散过程的每个步骤中，噪声时间表将不同尺度的噪声注入数据的不同部分。通过仔细设计噪声时间表，早期的事件比以后的事件更快，从而为预测更遥远的未来提供了更强的条件。我们得出一个目标，可以根据条件流匹配有效地训练这些模型为一般的噪声表。我们的方法模拟了事件的潜在表示的联合分布并实现最先进的结果，从而预测了基准数据集中的下一个活动间时间和事件类型。此外，通过调整生成过程的起点和结尾点，它可以灵活地适应不同预测设置中不同的观察和预测窗口。最后，我们的方法在长马预测任务中显示出卓越的性能，超过了现有的基线方法。

Title: GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection

Authors: DiJia Su, Andrew Gu, Jane Xu, Yuandong Tian, Jiawei Zhao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.20437
Pdf URL: https://arxiv.org/pdf/2504.20437
Copy Paste: [[2504.20437]] GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection(https://arxiv.org/abs/2504.20437)
Keywords: generation
Abstract: Large language models (LLMs) have revolutionized natural language understanding and generation but face significant memory bottlenecks during training. GaLore, Gradient Low-Rank Projection, addresses this issue by leveraging the inherent low-rank structure of weight gradients, enabling substantial memory savings without sacrificing performance. Recent works further extend GaLore from various aspects, including low-bit quantization and higher-order tensor structures. However, there are several remaining challenges for GaLore, such as the computational overhead of SVD for subspace updates and the integration with state-of-the-art training parallelization strategies (e.g., FSDP). In this paper, we present GaLore 2, an efficient and scalable GaLore framework that addresses these challenges and incorporates recent advancements. In addition, we demonstrate the scalability of GaLore 2 by pre-training Llama 7B from scratch using up to 500 billion training tokens, highlighting its potential impact on real LLM pre-training scenarios.
摘要：大型语言模型（LLM）彻底改变了自然语言的理解和产生，但在训练过程中面临着重要的记忆瓶颈。 galore（梯度低级投影）通过利用固有的低级梯度结构来解决这个问题，从而无需牺牲性能就可以节省大量内存。最近的工作从各个方面进一步扩展了盛大的范围，包括低位量化和高阶张量结构。但是，对于Galore来说，还有一些剩余的挑战，例如SVD的计算开销用于子空间更新以及与最先进的培训并行化策略（例如FSDP）的集成。在本文中，我们提出了Galore 2，这是一个有效且可扩展的Galore框架，可解决这些挑战并结合了最新的进步。此外，我们通过使用多达5000亿个训练令牌从头开始训练Llama 7b来证明Galore 2的可伸缩性，从而强调了其对实际LLM的潜在影响。

Title: PixelHacker: Image Inpainting with Structural and Semantic Consistency

Authors: Ziyang Xu, Kangsheng Duan, Xiaolei Shen, Zhifeng Ding, Wenyu Liu, Xiaohu Ruan, Xiaoxin Chen, Xinggang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.20438
Pdf URL: https://arxiv.org/pdf/2504.20438
Copy Paste: [[2504.20438]] PixelHacker: Image Inpainting with Structural and Semantic Consistency(https://arxiv.org/abs/2504.20438)
Keywords: restoration, generation
Abstract: Image inpainting is a fundamental research area between image editing and image generation. Recent state-of-the-art (SOTA) methods have explored novel attention mechanisms, lightweight architectures, and context-aware modeling, demonstrating impressive performance. However, they often struggle with complex structure (e.g., texture, shape, spatial relations) and semantics (e.g., color consistency, object restoration, and logical correctness), leading to artifacts and inappropriate generation. To address this challenge, we design a simple yet effective inpainting paradigm called latent categories guidance, and further propose a diffusion-based model named PixelHacker. Specifically, we first construct a large dataset containing 14 million image-mask pairs by annotating foreground and background (potential 116 and 21 categories, respectively). Then, we encode potential foreground and background representations separately through two fixed-size embeddings, and intermittently inject these features into the denoising process via linear attention. Finally, by pre-training on our dataset and fine-tuning on open-source benchmarks, we obtain PixelHacker. Extensive experiments show that PixelHacker comprehensively outperforms the SOTA on a wide range of datasets (Places2, CelebA-HQ, and FFHQ) and exhibits remarkable consistency in both structure and semantics. Project page at this https URL.
摘要：图像介绍是图像编辑和图像生成之间的一个基本研究领域。最近的最新方法（SOTA）方法探索了新颖的注意机制，轻质体系结构和上下文感知的建模，表现出令人印象深刻的性能。但是，它们通常在复杂的结构（例如纹理，形状，空间关系）和语义（例如颜色一致性，对象恢复和逻辑正确性）中挣扎，从而导致文物和不适当的产生。为了应对这一挑战，我们设计了一种称为潜在类别指南的简单而有效的填充范式，并进一步提出了一个名为Pixelhacker的基于扩散的模型。具体而言，我们首先通过注释前景和背景（分别为116和21个类别）来构建一个包含1400万个图像面罩对的大数据集。然后，我们通过两个固定尺寸的嵌入方式分别编码潜在的前景和背景表示，并通过线性注意力间歇地将这些特征注入去胶过程中。最后，通过对数据集进行预培训并在开源基准上进行微调，我们获得了PixelHacker。广泛的实验表明，Pixelhacker在广泛的数据集（Place2，Celeba-HQ和FFHQ）上全面胜过SOTA，并且在结构和语义方面表现出显着的一致性。此HTTPS URL的项目页面。

Title: Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding

Authors: Gabe Guo, Stefano Ermon
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2504.20456
Pdf URL: https://arxiv.org/pdf/2504.20456
Copy Paste: [[2504.20456]] Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding(https://arxiv.org/abs/2504.20456)
Keywords: generation
Abstract: In arbitrary-order language models, it is an open question how to sample tokens in parallel from the correct joint distribution. With discrete diffusion models, the more tokens they generate in parallel, the less their predicted distributions adhere to the originally learned data distribution, as they rely on a conditional independence assumption that only works with infinitesimally small timesteps. We find that a different class of models, any-subset autoregressive models (AS-ARMs), holds the solution. As implied by the name, AS-ARMs can generate tokens in any order, and in parallel. Moreover, AS-ARMs support parallelized joint probability density estimation, allowing them to correct their own parallel-generated token distributions, via our Any-Subset Speculative Decoding (ASSD) algorithm. ASSD provably enables generation of tokens from the correct joint distribution, with the number of neural network calls upper bounded by the number of tokens predicted. We empirically verify that ASSD speeds up language generation, without sacrificing quality. Furthermore, we provide a mathematically justified scheme for training AS-ARMs for generation, and show that AS-ARMs achieve state-of-the-art performance among sub-200M parameter models on infilling benchmark tasks, and nearly match the performance of models 50X larger on code generation. Our theoretical and empirical results indicate that the once-forgotten AS-ARMs are a promising direction of language modeling.
摘要：在任意阶的语言模型中，这是一个开放的问题，如何与正确的关节分布并行采样令牌。借助离散扩散模型，它们并行生成的代币越多，其预测分布遵守最初学习的数据分布，因为它们依靠有条件的独立性假设，该假设仅适用于无限时间的小时间段。我们发现，不同类别的模型，任何subset自回旋模型（AS-ARMS）都具有解决方案。正如名称所暗示的那样，AS-Arms可以按任何顺序和并行生成令牌。此外，AS-ARMS支持并行化的关节概率密度估计，从而使它们能够通过我们的任何掩体投机解码（ASSD）算法校正自己的并行生成的令牌分布。 ASSD可证明可以从正确的联合分布中产生代币，而神经网络调用的数量上限为被预测的令牌数量。我们从经验上验证了ASSD会加快语言生成，而不会牺牲质量。此外，我们提供了一种数学上合理的方案，用于训练As-Arms的生成，并表明AS-Arms在填充基准任务上实现了200M低于200M参数模型之间的最先进性能，并且几乎与代码生成50倍的性能相匹配。我们的理论和经验结果表明，曾经被遗忘的As-Arms是语言建模的有希望的方向。

Title: LMM4Gen3DHF: Benchmarking and Evaluating Multimodal 3D Human Face Generation with LMMs

Authors: Woo Yi Yang, Jiarui Wang, Sijing Wu, Huiyu Duan, Yuxin Zhu, Liu Yang, Kang Fu, Guangtao Zhai, Xiongkuo Min
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.20466
Pdf URL: https://arxiv.org/pdf/2504.20466
Copy Paste: [[2504.20466]] LMM4Gen3DHF: Benchmarking and Evaluating Multimodal 3D Human Face Generation with LMMs(https://arxiv.org/abs/2504.20466)
Keywords: generation, generative, quality assessment
Abstract: The rapid advancement in generative artificial intelligence have enabled the creation of 3D human faces (HFs) for applications including media production, virtual reality, security, healthcare, and game development, etc. However, assessing the quality and realism of these AI-generated 3D human faces remains a significant challenge due to the subjective nature of human perception and innate perceptual sensitivity to facial features. To this end, we conduct a comprehensive study on the quality assessment of AI-generated 3D human faces. We first introduce Gen3DHF, a large-scale benchmark comprising 2,000 videos of AI-Generated 3D Human Faces along with 4,000 Mean Opinion Scores (MOS) collected across two dimensions, i.e., quality and authenticity, 2,000 distortion-aware saliency maps and distortion descriptions. Based on Gen3DHF, we propose LMME3DHF, a Large Multimodal Model (LMM)-based metric for Evaluating 3DHF capable of quality and authenticity score prediction, distortion-aware visual question answering, and distortion-aware saliency prediction. Experimental results show that LMME3DHF achieves state-of-the-art performance, surpassing existing methods in both accurately predicting quality scores for AI-generated 3D human faces and effectively identifying distortion-aware salient regions and distortion types, while maintaining strong alignment with human perceptual judgments. Both the Gen3DHF database and the LMME3DHF will be released upon the publication.
摘要：生成人工智能的快速发展使创建3D人面（HFS）用于应用程序，包括媒体生产，虚拟现实，安全性，安全性，医疗保健和游戏开发等。但是，评估这些AI生成的3D人类面孔的质量和现实主义仍然是由于人类感性和真实感知敏感性的主观性质而造成的重大挑战。为此，我们对AI生成的3D人脸的质量评估进行了全面研究。我们首先介绍了Gen3DHF，这是一个大规模的基准测试，其中包括2,000个AI生成的3D人脸的视频，以及在两个维度上收集的4,000个平均意见分数（MOS），即质量和真实性，2,000个失真感知的显着性图形图和失真描述。基于Gen3DHF，我们提出了LMME3DHF，这是一种大型多模型（LMM）基于基于质量和真实性得分预测，失真感知的视觉问题的回答以及失真感知的显着性预测的大型多模型（LMM）的度量。实验结果表明，LMME3DHF实现了最先进的性能，超过了现有方法，可以准确预测AI生成的3D人脸的质量得分，并有效地识别出失真感知的显着区域和失真类型，同时保持与人类知觉判断的强相位。 Gen3DHF数据库和LMME3DHF都将在出版物上发布。

Title: Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception

Authors: Yuanchen Wu, Lu Zhang, Hang Yao, Junlong Du, Ke Yan, Shouhong Ding, Yunsheng Wu, Xiaoqiang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.20468
Pdf URL: https://arxiv.org/pdf/2504.20468
Copy Paste: [[2504.20468]] Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception(https://arxiv.org/abs/2504.20468)
Keywords: generation
Abstract: Large Vision-Language Models (LVLMs) have achieved impressive results across various cross-modal tasks. However, hallucinations, i.e., the models generating counterfactual responses, remain a challenge. Though recent studies have attempted to alleviate object perception hallucinations, they focus on the models' response generation, and overlooking the task question itself. This paper discusses the vulnerability of LVLMs in solving counterfactual presupposition questions (CPQs), where the models are prone to accept the presuppositions of counterfactual objects and produce severe hallucinatory responses. To this end, we introduce "Antidote", a unified, synthetic data-driven post-training framework for mitigating both types of hallucination above. It leverages synthetic data to incorporate factual priors into questions to achieve self-correction, and decouple the mitigation process into a preference optimization problem. Furthermore, we construct "CP-Bench", a novel benchmark to evaluate LVLMs' ability to correctly handle CPQs and produce factual responses. Applied to the LLaVA series, Antidote can simultaneously enhance performance on CP-Bench by over 50%, POPE by 1.8-3.3%, and CHAIR & SHR by 30-50%, all without relying on external supervision from stronger LVLMs or human feedback and introducing noticeable catastrophic forgetting issues.
摘要：大型视觉模型（LVLM）在各种跨模式任务中取得了令人印象深刻的结果。但是，幻觉，即产生反事实响应的模型仍然是一个挑战。尽管最近的研究试图减轻对象感知幻觉，但它们专注于模型的响应产生，并忽略了任务问题本身。本文讨论了LVLM在解决反事实前提问题（CPQ）方面的脆弱性，其中模型容易接受反事实对象的预设并产生严重的幻觉反应。为此，我们介绍了“解毒剂”，这是一种统一的合成数据驱动的训练后框架，用于缓解上述两种类型的幻觉。它利用综合数据将事实先验纳入问题以实现自我纠正，并将缓解过程解除为偏好优化问题。此外，我们构建了“ CP基础”，这是一种新的基准测试，旨在评估LVLMS正确处理CPQ并产生事实响应的能力。 Antidote应用于LLAVA系列，可以同时提高CP板凳上的性能，超过50％，教皇提高1.8-3.3％，而主席和SHR则可以提高30-50％，而无需依赖于更强的LVLMS或人类反馈或引入明显的灾难性遗忘问题的外部监督。

Title: Geometry-aware Temporal Aggregation Network for Monocular 3D Lane Detection

Authors: Huan Zheng, Wencheng Han, Tianyi Yan, Cheng-zhong Xu, Jianbing Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.20525
Pdf URL: https://arxiv.org/pdf/2504.20525
Copy Paste: [[2504.20525]] Geometry-aware Temporal Aggregation Network for Monocular 3D Lane Detection(https://arxiv.org/abs/2504.20525)
Keywords: generation
Abstract: Monocular 3D lane detection aims to estimate 3D position of lanes from frontal-view (FV) images. However, current monocular 3D lane detection methods suffer from two limitations, including inaccurate geometric information of the predicted 3D lanes and difficulties in maintaining lane integrity. To address these issues, we seek to fully exploit the potential of multiple input frames. First, we aim at enhancing the ability to perceive the geometry of scenes by leveraging temporal geometric consistency. Second, we strive to improve the integrity of lanes by revealing more instance information from temporal sequences. Therefore, we propose a novel Geometry-aware Temporal Aggregation Network (GTA-Net) for monocular 3D lane detection. On one hand, we develop the Temporal Geometry Enhancement Module (TGEM), which exploits geometric consistency across successive frames, facilitating effective geometry perception. On the other hand, we present the Temporal Instance-aware Query Generation (TIQG), which strategically incorporates temporal cues into query generation, thereby enabling the exploration of comprehensive instance information. Experiments demonstrate that our GTA-Net achieves SoTA results, surpassing existing monocular 3D lane detection solutions.
摘要：单眼3D车道检测旨在从额叶视图（FV）图像估算车道的3D位置。但是，当前的单眼3D车道检测方法遭受了两个局限性，包括预测3D泳道的几何信息以及维持车道完整性的困难。为了解决这些问题，我们试图充分利用多个输入帧的潜力。首先，我们旨在通过利用时间几何一致性来增强感知场景几何形状的能力。其次，我们努力通过从时间序列揭示更多实例信息来提高车道的完整性。因此，我们提出了一种新颖的几何感知时间聚集网络（GTA-NET），用于单眼3D车道检测。一方面，我们开发了时间几何增强模块（TGEM），该模块利用了连续帧之间的几何一致性，从而促进了有效的几何学感知。另一方面，我们介绍了时间实例感知的查询产生（TIQG），该生成（TIQG）从策略性地将时间提示纳入查询生成，从而可以探索全面的实例信息。实验表明，我们的GTA-NET可实现SOTA结果，超过现有的单眼3D车道检测溶液。

Title: Autoencoder Models for Point Cloud Environmental Synthesis from WiFi Channel State Information: A Preliminary Study

Authors: Daniele Pannone, Danilo Avola
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.20541
Pdf URL: https://arxiv.org/pdf/2504.20541
Copy Paste: [[2504.20541]] Autoencoder Models for Point Cloud Environmental Synthesis from WiFi Channel State Information: A Preliminary Study(https://arxiv.org/abs/2504.20541)
Keywords: generation
Abstract: This paper introduces a deep learning framework for generating point clouds from WiFi Channel State Information data. We employ a two-stage autoencoder approach: a PointNet autoencoder with convolutional layers for point cloud generation, and a Convolutional Neural Network autoencoder to map CSI data to a matching latent space. By aligning these latent spaces, our method enables accurate environmental point cloud reconstruction from WiFi data. Experimental results validate the effectiveness of our approach, highlighting its potential for wireless sensing and environmental mapping applications.
摘要：本文介绍了一个深度学习框架，用于从WiFi通道状态信息数据中生成点云。我们采用了两阶段的自动编码器方法：带有用于点云的卷积层的PointNet自动编码器，以及卷积神经网络自动编码器，将CSI数据映射到匹配的潜在空间。通过使这些潜在空间对齐，我们的方法可以从WiFi数据中启用准确的环境点云重建。实验结果证明了我们方法的有效性，突出了其无线传感和环境映射应用的潜力。

Title: Digital Shielding for Cross-Domain Wi-Fi Signal Adaptation using Relativistic Average Generative Adversarial Network

Authors: Danilo Avola, Federica Bruni, Gian Luca Foresti, Daniele Pannone, Amedeo Ranaldi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.20568
Pdf URL: https://arxiv.org/pdf/2504.20568
Copy Paste: [[2504.20568]] Digital Shielding for Cross-Domain Wi-Fi Signal Adaptation using Relativistic Average Generative Adversarial Network(https://arxiv.org/abs/2504.20568)
Keywords: generative
Abstract: Wi-Fi sensing uses radio-frequency signals from Wi-Fi devices to analyze environments, enabling tasks such as tracking people, detecting intrusions, and recognizing gestures. The rise of this technology is driven by the IEEE 802.11bf standard and growing demand for tools that can ensure privacy and operate through obstacles. However, the performance of Wi-Fi sensing is heavily influenced by environmental conditions, especially when extracting spatial and temporal features from the surrounding scene. A key challenge is achieving robust generalization across domains, ensuring stable performance even when the sensing environment changes significantly. This paper introduces a novel deep learning model for cross-domain adaptation of Wi-Fi signals, inspired by physical signal shielding. The model uses a Relativistic average Generative Adversarial Network (RaGAN) with Bidirectional Long Short-Term Memory (Bi-LSTM) architectures for both the generator and discriminator. To simulate physical shielding, an acrylic box lined with electromagnetic shielding fabric was constructed, mimicking a Faraday cage. Wi-Fi signal spectra were collected from various materials both inside (domain-free) and outside (domain-dependent) the box to train the model. A multi-class Support Vector Machine (SVM) was trained on domain-free spectra and tested on signals denoised by the RaGAN. The system achieved 96% accuracy and demonstrated strong material discrimination capabilities, offering potential for use in security applications to identify concealed objects based on their composition.
摘要：Wi-Fi传感使用来自Wi-Fi设备的射频信号来分析环境，启用跟踪人员，检测入侵和识别手势等任务。该技术的兴起是由IEEE 802.11bf标准和对工具的需求不断增长的驱动，这些工具可以确保隐私和通过障碍物运行。但是，Wi-Fi传感的性能受到环境条件的严重影响，尤其是在从周围场景中提取空间和时间特征时。一个关键的挑战是实现跨领域的强大概括，即使感应环境发生重大变化，也可以确保稳定的性能。本文介绍了一种新型的深度学习模型，以灵感来自物理信号屏蔽的灵感，用于跨域适应Wi-Fi信号。该模型使用相对论的平均生成对抗网络（Ragan），用于发电机和歧视器的双向长短期内存（BI-LSTM）体系结构。为了模拟物理屏蔽，建造了一个衬有电磁屏蔽织物的丙烯酸盒，模仿了法拉第笼子。 Wi-Fi信号光谱是从内部（无域）和外部（依赖域依赖性）的盒子训练模型的各种材料中收集的。在无域光谱上对多级支持向量机（SVM）进行了训练，并根据Ragan授予的信号进行了测试。该系统达到了96％的准确性，并表现出强大的物质歧视能力，从而在安全应用程序中提供了根据其组成来识别隐藏物体的潜力。

Title: AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation

Authors: Jeongsoo Choi, Ji-Hoon Kim, Kim Sung-Bin, Tae-Hyun Oh, Joon Son Chung
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2504.20629
Pdf URL: https://arxiv.org/pdf/2504.20629
Copy Paste: [[2504.20629]] AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation(https://arxiv.org/abs/2504.20629)
Keywords: generation
Abstract: In this paper, we address the task of multimodal-to-speech generation, which aims to synthesize high-quality speech from multiple input modalities: text, video, and reference audio. This task has gained increasing attention due to its wide range of applications, such as film production, dubbing, and virtual avatars. Despite recent progress, existing methods still suffer from limitations in speech intelligibility, audio-video synchronization, speech naturalness, and voice similarity to the reference speaker. To address these challenges, we propose AlignDiT, a multimodal Aligned Diffusion Transformer that generates accurate, synchronized, and natural-sounding speech from aligned multimodal inputs. Built upon the in-context learning capability of the DiT architecture, AlignDiT explores three effective strategies to align multimodal representations. Furthermore, we introduce a novel multimodal classifier-free guidance mechanism that allows the model to adaptively balance information from each modality during speech synthesis. Extensive experiments demonstrate that AlignDiT significantly outperforms existing methods across multiple benchmarks in terms of quality, synchronization, and speaker similarity. Moreover, AlignDiT exhibits strong generalization capability across various multimodal tasks, such as video-to-speech synthesis and visual forced alignment, consistently achieving state-of-the-art performance. The demo page is available at this https URL .
摘要：在本文中，我们解决了多模式到语音生成的任务，该任务旨在从多种输入方式中综合高质量的语音：文本，视频和参考音频。由于其广泛的应用，例如电影制作，配音和虚拟化身，这项任务引起了人们的关注。尽管最近取得了进展，但现有方法仍然受到语音清晰度，音频视频同步，语音自然性和与参考扬声器的相似性的局限性。为了应对这些挑战，我们提出了AlignDit，这是一种多模式对齐扩散变压器，从对齐的多模式输入产生准确，同步和自然的语音。基于DIT体系结构的内在学习能力，Aligndit探索了三种有效的多模式表示形式的策略。此外，我们引入了一种新型的多模式分类指导机制，该机制使模型可以在语音合成过程中适应从每种模态的信息平衡。广泛的实验表明，在质量，同步和说话者的相似性方面，对齐对多个基准的现有方法显着优于现有方法。此外，对齐在各种多模式任务中表现出强大的概括能力，例如视频到语音综合和视觉强制对齐，始终达到最新的性能。演示页面可在此HTTPS URL上找到。

Title: Bridging the Generalisation Gap: Synthetic Data Generation for Multi-Site Clinical Model Validation

Authors: Bradley Segal, Joshua Fieggen, David Clifton, Lei Clifton
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.20635
Pdf URL: https://arxiv.org/pdf/2504.20635
Copy Paste: [[2504.20635]] Bridging the Generalisation Gap: Synthetic Data Generation for Multi-Site Clinical Model Validation(https://arxiv.org/abs/2504.20635)
Keywords: generation, generative
Abstract: Ensuring the generalisability of clinical machine learning (ML) models across diverse healthcare settings remains a significant challenge due to variability in patient demographics, disease prevalence, and institutional practices. Existing model evaluation approaches often rely on real-world datasets, which are limited in availability, embed confounding biases, and lack the flexibility needed for systematic experimentation. Furthermore, while generative models aim for statistical realism, they often lack transparency and explicit control over factors driving distributional shifts. In this work, we propose a novel structured synthetic data framework designed for the controlled benchmarking of model robustness, fairness, and generalisability. Unlike approaches focused solely on mimicking observed data, our framework provides explicit control over the data generating process, including site-specific prevalence variations, hierarchical subgroup effects, and structured feature interactions. This enables targeted investigation into how models respond to specific distributional shifts and potential biases. Through controlled experiments, we demonstrate the framework's ability to isolate the impact of site variations, support fairness-aware audits, and reveal generalisation failures, particularly highlighting how model complexity interacts with site-specific effects. This work contributes a reproducible, interpretable, and configurable tool designed to advance the reliable deployment of ML in clinical settings.
摘要：由于患者人口统计学，疾病患病率和机构实践的差异，确保各种医疗保健环境中临床机器学习（ML）模型的普遍性仍然是一个重大挑战。现有的模型评估方法通常依赖于现实世界中的数据集，这些数据集有限，嵌入了混淆偏见，并且缺乏系统实验所需的灵活性。此外，尽管生成模型的目的是统计现实主义，但它们通常缺乏透明度和对驱动分布变化的因素的明确控制。在这项工作中，我们提出了一个新型的结构化合成数据框架，旨在控制模型鲁棒性，公平性和普遍性的基准测试。与仅着眼于模仿观察到的数据的方法不同，我们的框架可以明确控制数据生成过程，包括特定网站特定的患病率变化，分层亚组效应和结构化特征交互。这可以针对模型如何响应特定的分布变化和潜在偏见。通过对照实验，我们证明了该框架隔离现场变化，支持公平性审计并揭示概括失败的能力，尤其是强调了模型复杂性如何与特定地点效应相互作用。这项工作贡献了可再现，可解释且可配置的工具，旨在推进ML在临床环境中的可靠部署。

Title: Advance Fake Video Detection via Vision Transformers

Authors: Joy Battocchio (1), Stefano Dell'Anna (1), Andrea Montibeller (1), Giulia Boato (1,2) ((1) University of Trento, Trento, Italy, (2) Truebees srl, Trento, Italy)
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2504.20669
Pdf URL: https://arxiv.org/pdf/2504.20669
Copy Paste: [[2504.20669]] Advance Fake Video Detection via Vision Transformers(https://arxiv.org/abs/2504.20669)
Keywords: generation, generative
Abstract: Recent advancements in AI-based multimedia generation have enabled the creation of hyper-realistic images and videos, raising concerns about their potential use in spreading misinformation. The widespread accessibility of generative techniques, which allow for the production of fake multimedia from prompts or existing media, along with their continuous refinement, underscores the urgent need for highly accurate and generalizable AI-generated media detection methods, underlined also by new regulations like the European Digital AI Act. In this paper, we draw inspiration from Vision Transformer (ViT)-based fake image detection and extend this idea to video. We propose an {original} %innovative framework that effectively integrates ViT embeddings over time to enhance detection performance. Our method shows promising accuracy, generalization, and few-shot learning capabilities across a new, large and diverse dataset of videos generated using five open source generative techniques from the state-of-the-art, as well as a separate dataset containing videos produced by proprietary generative methods.
摘要：基于AI的多媒体生成的最新进展使得创建了超现实的图像和视频，从而引起了人们对它们在传播错误信息中的潜在用途的担忧。生成技术的广泛可及性，可以从提示或现有媒体中生产伪造的多媒体，以及它们的持续改进，强调了迫切需要对高度准确且可推广的AI生成的媒体检测方法，这也是由诸如《欧洲Digital AI Act》等新法规所强调的。在本文中，我们从基于Vision Transformer（VIT）的假图像检测中汲取灵感，并将此想法扩展到视频。我们提出了一个{原始}％创新框架，该框架随着时间的推移有效地整合了VIT嵌入以增强检测性能。我们的方法显示了使用最先进的五种开源生成技术生成的新的，大型且多样化的视频数据集中的有希望的准确性，概括和很少的学习能力，以及包含专有生成方法生成的视频的单独数据集。

Title: Efficient Listener: Dyadic Facial Motion Synthesis via Action Diffusion

Authors: Zesheng Wang, Alexandre Bruckert, Patrick Le Callet, Guangtao Zhai
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2504.20685
Pdf URL: https://arxiv.org/pdf/2504.20685
Copy Paste: [[2504.20685]] Efficient Listener: Dyadic Facial Motion Synthesis via Action Diffusion(https://arxiv.org/abs/2504.20685)
Keywords: generation
Abstract: Generating realistic listener facial motions in dyadic conversations remains challenging due to the high-dimensional action space and temporal dependency requirements. Existing approaches usually consider extracting 3D Morphable Model (3DMM) coefficients and modeling in the 3DMM space. However, this makes the computational speed of the 3DMM a bottleneck, making it difficult to achieve real-time interactive responses. To tackle this problem, we propose Facial Action Diffusion (FAD), which introduces the diffusion methods from the field of image generation to achieve efficient facial action generation. We further build the Efficient Listener Network (ELNet) specially designed to accommodate both the visual and audio information of the speaker as input. Considering of FAD and ELNet, the proposed method learns effective listener facial motion representations and leads to improvements of performance over the state-of-the-art methods while reducing 99% computational time.
摘要：由于高维动作空间和时间依赖性要求，在二元对话中产生现实的听众面部运动仍然具有挑战性。现有的方法通常考虑提取3DMM空间中的3D形态模型（3DMM）系数和建模。但是，这使得3DMM的计算速度成为瓶颈，因此很难实现实时互动响应。为了解决这个问题，我们提出了面部动作扩散（FAD），该扩散引入了图像生成领域的扩散方法，以实现有效的面部动作产生。我们进一步构建了专门设计的有效听众网络（ELNET），以适应说话者的视觉和视听信息作为输入。考虑到FAD和ELNET，提出的方法学习了有效的听众面部运动表示形式，并导致性能改善对最新方法，同时减少99％的计算时间。

Title: What's Wrong with Your Synthetic Tabular Data? Using Explainable AI to Evaluate Generative Models

Authors: Jan Kapar, Niklas Koenen, Martin Jullum
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2504.20687
Pdf URL: https://arxiv.org/pdf/2504.20687
Copy Paste: [[2504.20687]] What's Wrong with Your Synthetic Tabular Data? Using Explainable AI to Evaluate Generative Models(https://arxiv.org/abs/2504.20687)
Keywords: generative
Abstract: Evaluating synthetic tabular data is challenging, since they can differ from the real data in so many ways. There exist numerous metrics of synthetic data quality, ranging from statistical distances to predictive performance, often providing conflicting results. Moreover, they fail to explain or pinpoint the specific weaknesses in the synthetic data. To address this, we apply explainable AI (XAI) techniques to a binary detection classifier trained to distinguish real from synthetic data. While the classifier identifies distributional differences, XAI concepts such as feature importance and feature effects, analyzed through methods like permutation feature importance, partial dependence plots, Shapley values and counterfactual explanations, reveal why synthetic data are distinguishable, highlighting inconsistencies, unrealistic dependencies, or missing patterns. This interpretability increases transparency in synthetic data evaluation and provides deeper insights beyond conventional metrics, helping diagnose and improve synthetic data quality. We apply our approach to two tabular datasets and generative models, showing that it uncovers issues overlooked by standard evaluation techniques.
摘要：评估合成表格数据是具有挑战性的，因为它们可以在许多方面与真实数据不同。从统计距离到预测性能，通常会提供相互矛盾的结果。此外，他们无法解释或查明合成数据中的特定弱点。为了解决这个问题，我们将可解释的AI（XAI）技术应用于经过训练的二进制检测分类器，以区分真实数据和合成数据。虽然分类器识别分布差异，但XAI概念（例如特征重要性和特征效应）通过置换特征重要性，部分依赖图，莎普利值和反事实解释等方法进行分析，揭示了合成数据为何可区分，突出显示不一致，不现实的依赖性或缺失模式。这种可解释性提高了综合数据评估的透明度，并提供了超出常规指标的更深入的见解，从而有助于诊断和提高合成数据质量。我们将方法应用于两个表格数据集和生成模型，表明它发现了标准评估技术所忽略的问题。

Title: In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

Authors: Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.20690
Pdf URL: https://arxiv.org/pdf/2504.20690
Copy Paste: [[2504.20690]] In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer(https://arxiv.org/abs/2504.20690)
Keywords: generation
Abstract: Instruction-based image editing enables robust image modification via natural language prompts, yet current methods face a precision-efficiency tradeoff. Fine-tuning methods demand significant computational resources and large datasets, while training-free techniques struggle with instruction comprehension and edit quality. We resolve this dilemma by leveraging large-scale Diffusion Transformer (DiT)' enhanced generation capacity and native contextual awareness. Our solution introduces three contributions: (1) an in-context editing framework for zero-shot instruction compliance using in-context prompting, avoiding structural changes; (2) a LoRA-MoE hybrid tuning strategy that enhances flexibility with efficient adaptation and dynamic expert routing, without extensive retraining; and (3) an early filter inference-time scaling method using vision-language models (VLMs) to select better initial noise early, improving edit quality. Extensive evaluations demonstrate our method's superiority: it outperforms state-of-the-art approaches while requiring only 0.5% training data and 1% trainable parameters compared to conventional baselines. This work establishes a new paradigm that enables high-precision yet efficient instruction-guided editing. Codes and demos can be found in this https URL.
摘要：基于指令的图像编辑可以通过自然语言提示进行鲁棒的图像修改，但是当前的方法面临着精确的效率折衷。微调方法需要大量的计算资源和大型数据集，而无培训技术则与教学理解和编辑质量斗争。我们通过利用大规模扩散变压器（DIT）的增强的发电能力和本地情境意识来解决这一难题。我们的解决方案介绍了三个贡献：（1）使用内部下文提示的零摄像指令合规性的内在编辑框架，避免结构性更改；（2）一种Lora-Moe混合调整策略，可通过有效的适应和动态专家路由增强灵活性，而无需进行广泛的再训练；（3）使用视觉语言模型（VLM）的早期过滤器推理时间缩放方法，以尽早选择更好的初始噪声，从而提高编辑质量。广泛的评估证明了我们的方法的优势：与传统基线相比，它的表现优于最先进的方法，同时仅需要0.5％的培训数据和1％可训练的参数。这项工作建立了一个新的范式，可实现高精度但有效的指导指导的编辑。可以在此HTTPS URL中找到代码和演示。

Title: DDPS: Discrete Diffusion Posterior Sampling for Paths in Layered Graphs

Authors: Hao Luan, See-Kiong Ng, Chun Kai Ling
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.20754
Pdf URL: https://arxiv.org/pdf/2504.20754
Copy Paste: [[2504.20754]] DDPS: Discrete Diffusion Posterior Sampling for Paths in Layered Graphs(https://arxiv.org/abs/2504.20754)
Keywords: generation, generative
Abstract: Diffusion models form an important class of generative models today, accounting for much of the state of the art in cutting edge AI research. While numerous extensions beyond image and video generation exist, few of such approaches address the issue of explicit constraints in the samples generated. In this paper, we study the problem of generating paths in a layered graph (a variant of a directed acyclic graph) using discrete diffusion models, while guaranteeing that our generated samples are indeed paths. Our approach utilizes a simple yet effective representation for paths which we call the padded adjacency-list matrix (PALM). In addition, we show how to effectively perform classifier guidance, which helps steer the sampled paths to specific preferred edges without any retraining of the diffusion model. Our preliminary results show that empirically, our method outperforms alternatives which do not explicitly account for path constraints.
摘要：扩散模型构成了当今重要的生成模型类别，构成了尖端AI研究中的大部分最新技术。尽管存在图像和视频生成以外的许多扩展，但很少有此类方法解决了生成的样本中明确约束的问题。在本文中，我们使用离散扩散模型研究了在分层图（有向无环图的变体）中生成路径的问题，同时确保我们生成的样品确实是路径。我们的方法利用一种简单而有效的表示路径，我们称之为衬垫邻接列表矩阵（Palm）。此外，我们还展示了如何有效执行分类器引导，这有助于将采样的路径引导到特定的首选边缘，而无需进行扩散模型的任何重新培训。我们的初步结果表明，从经验上讲，我们的方法优于明确解释路径约束的替代方案。

Title: JTreeformer: Graph-Transformer via Latent-Diffusion Model for Molecular Generation

Authors: Ji Shi, Chengxun Xie, Zhonghao Li, Xinming Zhang, Miao Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.20770
Pdf URL: https://arxiv.org/pdf/2504.20770
Copy Paste: [[2504.20770]] JTreeformer: Graph-Transformer via Latent-Diffusion Model for Molecular Generation(https://arxiv.org/abs/2504.20770)
Keywords: generation
Abstract: The discovery of new molecules based on the original chemical molecule distributions is of great importance in medicine. The graph transformer, with its advantages of high performance and scalability compared to traditional graph networks, has been widely explored in recent research for applications of graph structures. However, current transformer-based graph decoders struggle to effectively utilize graph information, which limits their capacity to leverage only sequences of nodes rather than the complex topological structures of molecule graphs. This paper focuses on building a graph transformer-based framework for molecular generation, which we call \textbf{JTreeformer} as it transforms graph generation into junction tree generation. It combines GCN parallel with multi-head attention as the encoder. It integrates a directed acyclic GCN into a graph-based Transformer to serve as a decoder, which can iteratively synthesize the entire molecule by leveraging information from the partially constructed molecular structure at each step. In addition, a diffusion model is inserted in the latent space generated by the encoder, to enhance the efficiency and effectiveness of sampling further. The empirical results demonstrate that our novel framework outperforms existing molecule generation methods, thus offering a promising tool to advance drug discovery (this https URL).
摘要：基于原始化学分子分布的新分子发现新分子非常重要。与传统的图形网络相比，Graph Transformer具有高性能和可伸缩性的优势，在最近针对图形结构应用的研究中广泛探讨了。但是，当前基于变压器的图形解码器难以有效利用图形信息，这限制了其仅利用节点序列而不是分子图的复杂拓扑结构的能力。本文着重于为分子生成构建一个基于图形变压器的框架，我们将其称为\ textbf {jtreeformer}，因为它将图形生成转换为结树生成。它将平行的GCN与多头注意力结合在一起，作为编码器。它将定向的无环GCN集成到基于图的变压器中以用作解码器，该解码器可以通过利用每个步骤中部分构造的分子结构的信息来迭代地合成整个分子。此外，在编码器产生的潜在空间中插入了扩散模型，以进一步提高采样的效率和有效性。经验结果表明，我们的新框架的表现优于现有的分子生成方法，从而提供了一种有前途的工具来推动药物发现（此HTTPS URL）。

Title: CMT: A Cascade MAR with Topology Predictor for Multimodal Conditional CAD Generation

Authors: Jianyu Wu, Yizhou Wang, Xiangyu Yue, Xinzhu Ma, Jingyang Guo, Dongzhan Zhou, Wanli Ouyang, Shixiang Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.20830
Pdf URL: https://arxiv.org/pdf/2504.20830
Copy Paste: [[2504.20830]] CMT: A Cascade MAR with Topology Predictor for Multimodal Conditional CAD Generation(https://arxiv.org/abs/2504.20830)
Keywords: generation
Abstract: While accurate and user-friendly Computer-Aided Design (CAD) is crucial for industrial design and manufacturing, existing methods still struggle to achieve this due to their over-simplified representations or architectures incapable of supporting multimodal design requirements. In this paper, we attempt to tackle this problem from both methods and datasets aspects. First, we propose a cascade MAR with topology predictor (CMT), the first multimodal framework for CAD generation based on Boundary Representation (B-Rep). Specifically, the cascade MAR can effectively capture the ``edge-counters-surface'' priors that are essential in B-Reps, while the topology predictor directly estimates topology in B-Reps from the compact tokens in MAR. Second, to facilitate large-scale training, we develop a large-scale multimodal CAD dataset, mmABC, which includes over 1.3 million B-Rep models with multimodal annotations, including point clouds, text descriptions, and multi-view images. Extensive experiments show the superior of CMT in both conditional and unconditional CAD generation tasks. For example, we improve Coverage and Valid ratio by +10.68% and +10.3%, respectively, compared to state-of-the-art methods on ABC in unconditional generation. CMT also improves +4.01 Chamfer on image conditioned CAD generation on mmABC. The dataset, code and pretrained network shall be released.
摘要：虽然准确且用户友好的计算机辅助设计（CAD）对于工业设计和制造至关重要，但由于其过度简化的表示或无法支持多模式设计要求，现有方法仍然难以实现这一目标。在本文中，我们试图从方法和数据集方面解决此问题。首先，我们提出了具有拓扑预测变量（CMT）的级联MAR，这是基于边界表示（B-REP）的第一个用于CAD生成的多模式框架。具体而言，Cascade MAR可以有效地捕获在B-REPS中必不可少的``边缘表面表面''先验，而拓扑预测器直接从MAR中的紧凑型代币中直接估算B-REP的拓扑。其次，为了促进大规模培训，我们开发了一个大型多模式CAD数据集MMABC，其中包括超过130万个带有多模式注释的B-REP模型，包括点云，文本描述和多视图图像。广泛的实验表明，在条件和无条件的CAD生成任务中，CMT的上等。例如，与无条件生成中ABC的最新方法相比，我们将覆盖率和有效比率分别提高了 +10.68％和 +10.3％。 CMT还改善了MMABC上图像条件的CAD生成上的+4.01倒角。数据集，代码和预验证的网络应发布。

Title: Tabular Data Adapters: Improving Outlier Detection for Unlabeled Private Data

Authors: Dayananda Herurkar, Jörn Hees, Vesselin Tzvetkov, Andreas Dengel
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.20862
Pdf URL: https://arxiv.org/pdf/2504.20862
Copy Paste: [[2504.20862]] Tabular Data Adapters: Improving Outlier Detection for Unlabeled Private Data(https://arxiv.org/abs/2504.20862)
Keywords: generation
Abstract: The remarkable success of Deep Learning approaches is often based and demonstrated on large public datasets. However, when applying such approaches to internal, private datasets, one frequently faces challenges arising from structural differences in the datasets, domain shift, and the lack of labels. In this work, we introduce Tabular Data Adapters (TDA), a novel method for generating soft labels for unlabeled tabular data in outlier detection tasks. By identifying statistically similar public datasets and transforming private data (based on a shared autoencoder) into a format compatible with state-of-the-art public models, our approach enables the generation of weak labels. It thereby can help to mitigate the cold start problem of labeling by basing on existing outlier detection models for public datasets. In experiments on 50 tabular datasets across different domains, we demonstrate that our method is able to provide more accurate annotations than baseline approaches while reducing computational time. Our approach offers a scalable, efficient, and cost-effective solution, to bridge the gap between public research models and real-world industrial applications.
摘要：深度学习方法的显着成功通常是基于大型公共数据集的。但是，当将这种方法应用于内部，私人数据集时，人们经常面临由于数据集，域移动和缺乏标签的结构差异而引起的挑战。在这项工作中，我们介绍了表格数据适配器（TDA），这是一种在异常检测任务中生成无标记表格数据的软标签的新方法。通过识别统计上相似的公共数据集并将私人数据（基于共享自动编码器）转换为与最新公共模型兼容的格式，我们的方法可以生成弱标签。因此，它可以通过基于公共数据集的现有离群值检测模型来帮助缓解标记的冷启动问题。在跨不同域的50个表格数据集的实验中，我们证明我们的方法能够提供比基线方法更准确的注释，同时减少计算时间。我们的方法提供了可扩展，高效且具有成本效益的解决方案，以弥合公共研究模型与现实世界中的工业应用之间的差距。

Title: AI-GenBench: A New Ongoing Benchmark for AI-Generated Image Detection

Authors: Lorenzo Pellegrini, Davide Cozzolino, Serafino Pandolfini, Davide Maltoni, Matteo Ferrara, Luisa Verdoliva, Marco Prati, Marco Ramilli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.20865
Pdf URL: https://arxiv.org/pdf/2504.20865
Copy Paste: [[2504.20865]] AI-GenBench: A New Ongoing Benchmark for AI-Generated Image Detection(https://arxiv.org/abs/2504.20865)
Keywords: generative
Abstract: The rapid advancement of generative AI has revolutionized image creation, enabling high-quality synthesis from text prompts while raising critical challenges for media authenticity. We present Ai-GenBench, a novel benchmark designed to address the urgent need for robust detection of AI-generated images in real-world scenarios. Unlike existing solutions that evaluate models on static datasets, Ai-GenBench introduces a temporal evaluation framework where detection methods are incrementally trained on synthetic images, historically ordered by their generative models, to test their ability to generalize to new generative models, such as the transition from GANs to diffusion models. Our benchmark focuses on high-quality, diverse visual content and overcomes key limitations of current approaches, including arbitrary dataset splits, unfair comparisons, and excessive computational demands. Ai-GenBench provides a comprehensive dataset, a standardized evaluation protocol, and accessible tools for both researchers and non-experts (e.g., journalists, fact-checkers), ensuring reproducibility while maintaining practical training requirements. By establishing clear evaluation rules and controlled augmentation strategies, Ai-GenBench enables meaningful comparison of detection methods and scalable solutions. Code and data are publicly available to ensure reproducibility and to support the development of robust forensic detectors to keep pace with the rise of new synthetic generators.
摘要：生成AI的快速发展彻底改变了图像的创造，从而使文本提示的高质量综合，同时提出了对媒体真实性的关键挑战。我们提出了AI-Genbench，这是一种新颖的基准测试，旨在满足在现实世界中对AI生成的图像的迫切需求。与在静态数据集上评估模型的现有解决方案不同，AI-Genbench引入了一个时间评估框架，其中检测方法是在合成图像上逐渐训练的，从历史上则是其生成模型订购的，以测试它们将其推广到新生成模型的能力，例如从GAN到GAN到扩散模型的过渡。我们的基准侧重于高质量，不同的视觉内容，并克服当前方法的关键局限性，包括任意数据集拆分，不公平的比较和过度的计算需求。 AI-Genbench为研究人员和非专家（例如，记者，事实检查员）提供了一个全面的数据集，标准化的评估协议以及可访问的工具，可确保在维持实用培训要求的同时可重复可重复。通过制定明确的评估规则和受控的增强策略，AI-Genbench可以对检测方法和可扩展解决方案进行有意义的比较。代码和数据可公开使用，以确保可重复性并支持强大的法医检测器的开发，以保持随着新合成发电机的兴起的步伐。

Title: Evaluating Generative Models for Tabular Data: Novel Metrics and Benchmarking

Authors: Dayananda Herurkar, Ahmad Ali, Andreas Dengel
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.20900
Pdf URL: https://arxiv.org/pdf/2504.20900
Copy Paste: [[2504.20900]] Evaluating Generative Models for Tabular Data: Novel Metrics and Benchmarking(https://arxiv.org/abs/2504.20900)
Keywords: generative
Abstract: Generative models have revolutionized multiple domains, yet their application to tabular data remains underexplored. Evaluating generative models for tabular data presents unique challenges due to structural complexity, large-scale variability, and mixed data types, making it difficult to intuitively capture intricate patterns. Existing evaluation metrics offer only partial insights, lacking a comprehensive measure of generative performance. To address this limitation, we propose three novel evaluation metrics: FAED, FPCAD, and RFIS. Our extensive experimental analysis, conducted on three standard network intrusion detection datasets, compares these metrics with established evaluation methods such as Fidelity, Utility, TSTR, and TRTS. Our results demonstrate that FAED effectively captures generative modeling issues overlooked by existing metrics. While FPCAD exhibits promising performance, further refinements are necessary to enhance its reliability. Our proposed framework provides a robust and practical approach for assessing generative models in tabular data applications.
摘要：生成模型已彻底改变了多个域，但它们在表格数据中的应用仍未得到充分影响。评估表格数据的生成模型提出了由于结构复杂性，大规模可变性和混合数据类型而引起的独特挑战，因此很难直观地捕获复杂的模式。现有的评估指标仅提供部分见解，缺乏对生成性能的全面度量。为了解决这一限制，我们提出了三个新颖的评估指标：FAED，FPCAD和RFIS。我们在三个标准网络入侵检测数据集上进行的广泛的实验分析将这些指标与既定的评估方法（例如Fidelity，实用程序，TSTR和TRT）进行了比较。我们的结果表明，FAED有效地捕获了现有指标所忽略的生成建模问题。尽管FPCAD表现出令人鼓舞的性能，但仍需要进一步的改进来增强其可靠性。我们提出的框架为评估表格数据应用中的生成模型提供了一种强大而实用的方法。

Title: Deep Learning Characterizes Depression and Suicidal Ideation from Eye Movements

Authors: Kleanthis Avramidis, Woojae Jeong, Aditya Kommineni, Sudarsana R. Kadiri, Marcus Ma, Colin McDaniel, Myzelle Hughes, Thomas McGee, Elsi Kaiser, Dani Byrd, Assal Habibi, B. Rael Cahn, Idan A. Blank, Kristina Lerman, Takfarinas Medani, Richard M. Leahy, Shrikanth Narayanan
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2504.20944
Pdf URL: https://arxiv.org/pdf/2504.20944
Copy Paste: [[2504.20944]] Deep Learning Characterizes Depression and Suicidal Ideation from Eye Movements(https://arxiv.org/abs/2504.20944)
Keywords: generation
Abstract: Identifying physiological and behavioral markers for mental health conditions is a longstanding challenge in psychiatry. Depression and suicidal ideation, in particular, lack objective biomarkers, with screening and diagnosis primarily relying on self-reports and clinical interviews. Here, we investigate eye tracking as a potential marker modality for screening purposes. Eye movements are directly modulated by neuronal networks and have been associated with attentional and mood-related patterns; however, their predictive value for depression and suicidality remains unclear. We recorded eye-tracking sequences from 126 young adults as they read and responded to affective sentences, and subsequently developed a deep learning framework to predict their clinical status. The proposed model included separate branches for trials of positive and negative sentiment, and used 2D time-series representations to account for both intra-trial and inter-trial variations. We were able to identify depression and suicidal ideation with an area under the receiver operating curve (AUC) of 0.793 (95% CI: 0.765-0.819) against healthy controls, and suicidality specifically with 0.826 AUC (95% CI: 0.797-0.852). The model also exhibited moderate, yet significant, accuracy in differentiating depressed from suicidal participants, with 0.609 AUC (95% CI 0.571-0.646). Discriminative patterns emerge more strongly when assessing the data relative to response generation than relative to the onset time of the final word of the sentences. The most pronounced effects were observed for negative-sentiment sentences, that are congruent to depressed and suicidal participants. Our findings highlight eye tracking as an objective tool for mental health assessment and underscore the modulatory impact of emotional stimuli on cognitive processes affecting oculomotor control.
摘要：在精神病学中，确定精神健康状况的生理和行为标志物是一个长期的挑战。尤其是抑郁症和自杀念头，缺乏客观的生物标志物，主要筛查和诊断主要依赖于自我报告和临床访谈。在这里，我们将眼睛跟踪作为用于筛选目的的潜在标记方式。眼睛运动是由神经元网络直接调节的，并且与注意力相关的模式和情绪相关的模式有关。但是，它们对抑郁和自杀性的预测价值尚不清楚。我们记录了126名年轻人阅读和回应情感句子的眼睛跟踪序列，随后开发了一个深度学习框架来预测他们的临床状况。拟议的模型包括单独的分支，用于进行正情绪和负面情绪的试验，并使用2D时间序列表示来说明审判和试验间变化。我们能够以0.793（AUC）为0.793（95％CI：0.765-0.819）对抑郁症和自杀构想，针对健康对照组，并且具有0.826 AUC的自杀性（95％CI：0.797-0.852）。该模型还具有0.609 AUC（95％CI 0.571-0.646），在与自杀参与者区分开的情况下还表现出适度但显着的准确性。当评估数据相对于响应生成而言，判别模式比相对于句子的最终单词的发作时间更强烈。对负句子的句子观察到最明显的效果，这些句子与沮丧和自杀的参与者是一致的。我们的发现重点介绍了目光跟踪是用于心理健康评估的客观工具，并强调了情绪刺激对影响动眼控制的认知过程的调节影响。

Title: TesserAct: Learning 4D Embodied World Models

Authors: Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, Chuang Gan
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2504.20995
Pdf URL: https://arxiv.org/pdf/2504.20995
Copy Paste: [[2504.20995]] TesserAct: Learning 4D Embodied World Models(https://arxiv.org/abs/2504.20995)
Keywords: generation
Abstract: This paper presents an effective approach for learning novel 4D embodied world models, which predict the dynamic evolution of 3D scenes over time in response to an embodied agent's actions, providing both spatial and temporal consistency. We propose to learn a 4D world model by training on RGB-DN (RGB, Depth, and Normal) videos. This not only surpasses traditional 2D models by incorporating detailed shape, configuration, and temporal changes into their predictions, but also allows us to effectively learn accurate inverse dynamic models for an embodied agent. Specifically, we first extend existing robotic manipulation video datasets with depth and normal information leveraging off-the-shelf models. Next, we fine-tune a video generation model on this annotated dataset, which jointly predicts RGB-DN (RGB, Depth, and Normal) for each frame. We then present an algorithm to directly convert generated RGB, Depth, and Normal videos into a high-quality 4D scene of the world. Our method ensures temporal and spatial coherence in 4D scene predictions from embodied scenarios, enables novel view synthesis for embodied environments, and facilitates policy learning that significantly outperforms those derived from prior video-based world models.
摘要：本文提出了一种学习新型4D体现世界模型的有效方法，该方法预测了3D场景随时间的动态演变，以响应体现的代理的动作，从而提供了空间和时间的一致性。我们建议通过对RGB-DN（RGB，DEPTH和正常）视频进行培训来学习4D世界模型。这不仅通过将详细的形状，配置和时间更改纳入其预测中，超过了传统的2D模型，而且还使我们能够有效地学习具有体现的代理的准确的逆动力学模型。具体来说，我们首先将现有的机器人操纵视频数据集扩展到利用现成模型的深度和正常信息。接下来，我们在此注释的数据集中微调一个视频生成模型，该模型共同预测每个帧的RGB-DN（RGB，DEPTH和正常）。然后，我们提出了一种将生成的RGB，深度和普通视频直接转换为世界高质量4D场景的算法。我们的方法可确保从体现的情景中进行4D场景预测中的时间和空间连贯性，启用针对体现环境的新型视图综合，并促进了策略学习，从而极大地超过了从先前基于视频的世界模型中得出的综合。

Title: X-Fusion: Introducing New Modality to Frozen Large Language Models

Authors: Sicheng Mo, Thao Nguyen, Xun Huang, Siddharth Srinivasan Iyer, Yijun Li, Yuchen Liu, Abhishek Tandon, Eli Shechtman, Krishna Kumar Singh, Yong Jae Lee, Bolei Zhou, Yuheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.20996
Pdf URL: https://arxiv.org/pdf/2504.20996
Copy Paste: [[2504.20996]] X-Fusion: Introducing New Modality to Frozen Large Language Models(https://arxiv.org/abs/2504.20996)
Keywords: generation
Abstract: We propose X-Fusion, a framework that extends pretrained Large Language Models (LLMs) for multimodal tasks while preserving their language capabilities. X-Fusion employs a dual-tower design with modality-specific weights, keeping the LLM's parameters frozen while integrating vision-specific information for both understanding and generation. Our experiments demonstrate that X-Fusion consistently outperforms alternative architectures on both image-to-text and text-to-image tasks. We find that incorporating understanding-focused data improves generation quality, reducing image data noise enhances overall performance, and feature alignment accelerates convergence for smaller models but has minimal impact on larger ones. Our findings provide valuable insights into building efficient unified multimodal models.
摘要：我们提出了X融合，该框架是在保留其语言能力的同时，扩展了用于多模式任务的大型语言模型（LLM）。 X-Fusion采用具有特定于方式的权重的双重设计设计，使LLM的参数冻结了，同时集成了特定的视觉信息，以供理解和产生。我们的实验表明，X融合始终在图像到文本和文本图像任务上均超过替代体系结构。我们发现，以理解为中心的数据提高了发电质量，降低图像数据噪声会增强整体性能，并且功能对齐能够加速较小模型的收敛性，但对较大的模型的影响很小。我们的发现为建立有效的统一多模型模型提供了宝贵的见解。

Title: YoChameleon: Personalized Vision and Language Generation

Authors: Thao Nguyen, Krishna Kumar Singh, Jing Shi, Trung Bui, Yong Jae Lee, Yuheng Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.20998
Pdf URL: https://arxiv.org/pdf/2504.20998
Copy Paste: [[2504.20998]] YoChameleon: Personalized Vision and Language Generation(https://arxiv.org/abs/2504.20998)
Keywords: generation
Abstract: Large Multimodal Models (e.g., GPT-4, Gemini, Chameleon) have evolved into powerful tools with millions of users. However, they remain generic models and lack personalized knowledge of specific user concepts. Previous work has explored personalization for text generation, yet it remains unclear how these methods can be adapted to new modalities, such as image generation. In this paper, we introduce Yo'Chameleon, the first attempt to study personalization for large multimodal models. Given 3-5 images of a particular concept, Yo'Chameleon leverages soft-prompt tuning to embed subject-specific information to (i) answer questions about the subject and (ii) recreate pixel-level details to produce images of the subject in new contexts. Yo'Chameleon is trained with (i) a self-prompting optimization mechanism to balance performance across multiple modalities, and (ii) a ``soft-positive" image generation approach to enhance image quality in a few-shot setting.
摘要：大型多模型模型（例如GPT-4，Gemini，Chameleon）已演变为具有数百万用户的强大工具。但是，它们仍然是通用模型，并且缺乏对特定用户概念的个性化知识。以前的工作探索了文本生成的个性化，但尚不清楚如何将这些方法适应新的模式，例如图像生成。在本文中，我们介绍了Yo'Chameleon，这是研究大型多模型个性化的首次尝试。给出了3-5个特定概念的图像，Yo'Chameleon利用软提出的调整来嵌入特定于主题的信息到（i）回答有关该主题的问题，以及（ii）重新创建像素级的细节，以在新上下文中产生该主题的图像。 Yo'Chameleon接受了（i）一种自我提倡的优化机制，以平衡多种方式的性能，以及（ii）``软阳性的''图像生成方法，以在几次拍摄设置中增强图像质量。