2025-09-29

Title: Random Direct Preference Optimization for Radiography Report Generation

Authors: Valentin Samokhin, Boris Shirokikh, Mikhail Goncharov, Dmitriy Umerenkov, Maksim Bobrin, Ivan Oseledets, Dmitry Dylov, Mikhail Belyaev
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2509.21351
Pdf URL: https://arxiv.org/pdf/2509.21351
Copy Paste: [[2509.21351]] Random Direct Preference Optimization for Radiography Report Generation(https://arxiv.org/abs/2509.21351)
Keywords: generation
Abstract: Radiography Report Generation (RRG) has gained significant attention in medical image analysis as a promising tool for alleviating the growing workload of radiologists. However, despite numerous advancements, existing methods have yet to achieve the quality required for deployment in real-world clinical settings. Meanwhile, large Visual Language Models (VLMs) have demonstrated remarkable progress in the general domain by adopting training strategies originally designed for Large Language Models (LLMs), such as alignment techniques. In this paper, we introduce a model-agnostic framework to enhance RRG accuracy using Direct Preference Optimization (DPO). Our approach leverages random contrastive sampling to construct training pairs, eliminating the need for reward models or human preference annotations. Experiments on supplementing three state-of-the-art models with our Random DPO show that our method improves clinical performance metrics by up to 5%, without requiring any additional training data.
摘要：放射线照相报告生成（RRG）在医学图像分析中引起了极大的关注，作为减轻放射科医生不断增长的工作量的有前途的工具。但是，尽管有许多进步，但现有的方法尚未达到现实世界中临床环境中部署所需的质量。同时，大型视觉语言模型（VLM）通过采用最初为大型语言模型（LLMS）（例如对齐技术）设计的培训策略，在一般领域表现出了显着的进步。在本文中，我们引入了一个模型 - 反应框架，以使用直接偏好优化（DPO）提高RRG的精度。我们的方法利用随机对比抽样来构建培训对，消除了对奖励模型或人类偏好注释的需求。关于使用随机DPO补充三种最先进模型的实验表明，我们的方法将临床性能指标提高了多达5％，而无需任何其他培训数据。

Title: Automated Prompt Generation for Creative and Counterfactual Text-to-image Synthesis

Authors: Aleksa Jelaca, Ying Jiao, Chang Tian, Marie-Francine Moens
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.21375
Pdf URL: https://arxiv.org/pdf/2509.21375
Copy Paste: [[2509.21375]] Automated Prompt Generation for Creative and Counterfactual Text-to-image Synthesis(https://arxiv.org/abs/2509.21375)
Keywords: generation
Abstract: Text-to-image generation has advanced rapidly with large-scale multimodal training, yet fine-grained controllability remains a critical challenge. Counterfactual controllability, defined as the capacity to deliberately generate images that contradict common-sense patterns, remains a major challenge but plays a crucial role in enabling creativity and exploratory applications. In this work, we address this gap with a focus on counterfactual size (e.g., generating a tiny walrus beside a giant button) and propose an automatic prompt engineering framework that adapts base prompts into revised prompts for counterfactual images. The framework comprises three components: an image evaluator that guides dataset construction by identifying successful image generations, a supervised prompt rewriter that produces revised prompts, and a DPO-trained ranker that selects the optimal revised prompt. We construct the first counterfactual size text-image dataset and enhance the image evaluator by extending Grounded SAM with refinements, achieving a 114 percent improvement over its backbone. Experiments demonstrate that our method outperforms state-of-the-art baselines and ChatGPT-4o, establishing a foundation for future research on counterfactual controllability.
摘要：通过大规模的多模式训练，文本到图像的生成迅速发展，但细粒度的可控性仍然是一个至关重要的挑战。反事实可控性定义为故意生成与常识模式相矛盾的图像的能力，仍然是一个主要挑战，但在实现创造力和探索性应用方面起着至关重要的作用。在这项工作中，我们解决了此差距，重点是反事实大小（例如，在一个巨型按钮旁生成一个小海象），并提出了一个自动及时的工程框架，该框架将基本提示调整为对反事实图像的修订提示。该框架包括三个组成部分：一个图像评估器，通过识别成功的图像世代来指导数据集构建，这是一个受监督的提示重写器，生成修订后的提示，以及通过选择最佳修订的提示的DPO训练的排名。我们构建了第一个反事实大小的文本图像数据集，并通过扩展扎实的SAM进行改进，从而增强了图像评估器，从而比其骨架提高了114％。实验表明，我们的方法优于最先进的基线和Chatgpt-4O，为未来的反事实可控性研究奠定了基础。

Title: In silico Deep Learning Protocols for Label-Free Super-Resolution Microscopy: A Comparative Study of Network Architectures and SNR Dependence

Authors: Shiraz S Kaderuppan, Jonathan Mar, Andrew Irvine, Anurag Sharma, Muhammad Ramadan Saifuddin, Wai Leong Eugene Wong, Wai Lok Woo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.21376
Pdf URL: https://arxiv.org/pdf/2509.21376
Copy Paste: [[2509.21376]] In silico Deep Learning Protocols for Label-Free Super-Resolution Microscopy: A Comparative Study of Network Architectures and SNR Dependence(https://arxiv.org/abs/2509.21376)
Keywords: super-resolution
Abstract: The field of optical microscopy spans across numerous industries and research domains, ranging from education to healthcare, quality inspection and analysis. Nonetheless, a key limitation often cited by optical microscopists refers to the limit of its lateral resolution (typically defined as ~200nm), with potential circumventions involving either costly external modules (e.g. confocal scan heads, etc) and/or specialized techniques [e.g. super-resolution (SR) fluorescent microscopy]. Addressing these challenges in a normal (non-specialist) context thus remains an aspect outside the scope of most microscope users & facilities. This study thus seeks to evaluate an alternative & economical approach to achieving SR optical microscopy, involving non-fluorescent phase-modulated microscopical modalities such as Zernike phase contrast (PCM) and differential interference contrast (DIC) microscopy. Two in silico deep neural network (DNN) architectures which we developed previously (termed O-Net and Theta-Net) are assessed on their abilities to resolve a custom-fabricated test target containing nanoscale features calibrated via atomic force microscopy (AFM). The results of our study demonstrate that although both O-Net and Theta-Net seemingly performed well when super-resolving these images, they were complementary (rather than competing) approaches to be considered for image SR, particularly under different image signal-to-noise ratios (SNRs). High image SNRs favoured the application of O-Net models, while low SNRs inclined preferentially towards Theta-Net models. These findings demonstrate the importance of model architectures (in conjunction with the source image SNR) on model performance and the SR quality of the generated images where DNN models are utilized for non-fluorescent optical nanoscopy, even where the same training dataset & number of epochs are being used.
摘要：光学显微镜的领域跨越了许多行业和研究领域，从教育到医疗保健，质量检查和分析。尽管如此，光学显微镜通常经常引用的关键限制是指其横向分辨率的极限（通常定义为〜200nm），潜在的范围涉及涉及昂贵的外部模块（例如共凝结扫描头等）和/或专用技术[例如。超分辨率（SR）荧光显微镜]。因此，在正常（非专家）上下文中应对这些挑战仍然是大多数显微镜用户和设施范围之外的一个方面。因此，这项研究旨在评估一种替代和经济的方法来实现SR光学显微镜，涉及非荧光相调节的显微镜模态，例如Zernike相位对比（PCM）和差异干扰对比度（DIC）显微镜。我们先前开发的（称为O-NET和THETA-NET）的硅深神经网络（DNN）结构中有两个在解析其通过原子力显微镜（AFM）校准的纳米级特征的定制测试目标的能力进行了评估。我们的研究结果表明，尽管O-NET和THETA-NET在超级分辨这些图像时似乎都表现良好，但它们是图像SR的互补方法（而不是竞争）方法，尤其是在不同的图像信号噪声比率（SNRS）下。高图像SNR偏爱O-NET模型的应用，而低SNR则优先倾向于theta-net模型。这些发现证明了模型体系结构（与源图像SNR结合使用）对模型性能的重要性以及生成图像的SR质量，其中即使使用了相同的训练数据集和使用相同的训练数据集和时代的数量，其中DNN模型被用于非荧光光学纳米镜检查。

Title: ShipwreckFinder: A QGIS Tool for Shipwreck Detection in Multibeam Sonar Data

Authors: Anja Sheppard, Tyler Smithline, Andrew Scheffer, David Smith, Advaith V. Sethuraman, Ryan Bird, Sabrina Lin, Katherine A. Skinner
Subjects: cs.CV, cs.RO, eess.IV
Abstract URL: https://arxiv.org/abs/2509.21386
Pdf URL: https://arxiv.org/pdf/2509.21386
Copy Paste: [[2509.21386]] ShipwreckFinder: A QGIS Tool for Shipwreck Detection in Multibeam Sonar Data(https://arxiv.org/abs/2509.21386)
Keywords: generation
Abstract: In this paper, we introduce ShipwreckFinder, an open-source QGIS plugin that detects shipwrecks from multibeam sonar data. Shipwrecks are an important historical marker of maritime history, and can be discovered through manual inspection of bathymetric data. However, this is a time-consuming process and often requires expert analysis. Our proposed tool allows users to automatically preprocess bathymetry data, perform deep learning inference, threshold model outputs, and produce either pixel-wise segmentation masks or bounding boxes of predicted shipwrecks. The backbone of this open-source tool is a deep learning model, which is trained on a variety of shipwreck data from the Great Lakes and the coasts of Ireland. Additionally, we employ synthetic data generation in order to increase the size and diversity of our dataset. We demonstrate superior segmentation performance with our open-source tool and training pipeline as compared to a deep learning-based ArcGIS toolkit and a more classical inverse sinkhole detection method. The open-source tool can be found at this https URL.
摘要：在本文中，我们介绍了沉船findinder，这是一个开源QGIS插件，可检测来自多层声纳数据的沉船。沉船是海上历史的重要历史标志，可以通过手动检查测深数据来发现。但是，这是一个耗时的过程，通常需要专家分析。我们提出的工具允许用户自动预处理测深数据，执行深度学习推断，阈值模型输出以及生产像素的细分掩码或预测的沉船的边界框。该开源工具的骨干是一种深度学习模型，该模型接受了来自大湖和爱尔兰海岸的多种沉船数据的培训。此外，我们采用合成数据生成，以增加数据集的大小和多样性。与基于深度学习的ArcGIS工具包和更古典的逆污水坑检测方法相比，我们使用开源工具和训练管道表现出了出色的细分性能。可以在此HTTPS URL上找到开源工具。

Title: Large AI Model-Enabled Generative Semantic Communications for Image Transmission

Authors: Qiyu Ma, Wanli Ni, Zhijin Qin
Subjects: cs.CV, cs.AI, cs.IT
Abstract URL: https://arxiv.org/abs/2509.21394
Pdf URL: https://arxiv.org/pdf/2509.21394
Copy Paste: [[2509.21394]] Large AI Model-Enabled Generative Semantic Communications for Image Transmission(https://arxiv.org/abs/2509.21394)
Keywords: generative
Abstract: The rapid development of generative artificial intelligence (AI) has introduced significant opportunities for enhancing the efficiency and accuracy of image transmission within semantic communication systems. Despite these advancements, existing methodologies often neglect the difference in importance of different regions of the image, potentially compromising the reconstruction quality of visually critical content. To address this issue, we introduce an innovative generative semantic communication system that refines semantic granularity by segmenting images into key and non-key regions. Key regions, which contain essential visual information, are processed using an image oriented semantic encoder, while non-key regions are efficiently compressed through an image-to-text modeling approach. Additionally, to mitigate the substantial storage and computational demands posed by large AI models, the proposed system employs a lightweight deployment strategy incorporating model quantization and low-rank adaptation fine-tuning techniques, significantly boosting resource utilization without sacrificing performance. Simulation results demonstrate that the proposed system outperforms traditional methods in terms of both semantic fidelity and visual quality, thereby affirming its effectiveness for image transmission tasks.
摘要：生成人工智能（AI）的快速发展引入了很大的机会，以提高语义通信系统中图像传输的效率和准确性。尽管取得了这些进步，但现有的方法经常忽略图像不同区域的重要性差异，可能损害视觉上关键内容的重建质量。为了解决这个问题，我们介绍了一种创新的生成语义通信系统，该系统通过将图像分割为关键区域和非钥匙区域来完善语义粒度。包含基本视觉信息的关键区域是使用面向图像的语义编码器处理的，而非键区则通过图像到文本建模方法有效地压缩。此外，为了减轻大型AI模型提出的大量存储和计算需求，该系统采用了一种轻巧的部署策略，其中包含模型量化和低级适应微调技术，从而显着提高了资源利用率而无需牺牲绩效。仿真结果表明，所提出的系统在语义保真度和视觉质量方面优于传统方法，从而确认其对图像传输任务的有效性。

Title: Downscaling climate projections to 1 km with single-image super resolution

Authors: Petr Košťál, Pavel Kordík, Ondřej Podsztavek
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2509.21399
Pdf URL: https://arxiv.org/pdf/2509.21399
Copy Paste: [[2509.21399]] Downscaling climate projections to 1 km with single-image super resolution(https://arxiv.org/abs/2509.21399)
Keywords: super-resolution
Abstract: High-resolution climate projections are essential for local decision-making. However, available climate projections have low spatial resolution (e.g. 12.5 km), which limits their usability. We address this limitation by leveraging single-image super-resolution models to statistically downscale climate projections to 1-km resolution. Since high-resolution climate projections are unavailable for training, we train models on a high-resolution observational gridded data set and apply them to low-resolution climate projections. We propose a climate indicator-based assessment using observed climate indices computed at weather station locations to evaluate the downscaled climate projections without ground-truth high-resolution climate projections. Experiments on daily mean temperature demonstrate that single-image super-resolution models can downscale climate projections without increasing the error of climate indicators compared to low-resolution climate projections.
摘要：高分辨率气候预测对于当地决策至关重要。但是，可用的气候预测具有较低的空间分辨率（例如12.5公里），这限制了其可用性。我们通过利用单像超分辨率模型来解决这一限制，从而统计偏低的气候预测到1公里。由于高分辨率的气候预测无法用于培训，因此我们对高分辨率观察性网格数据集进行培训模型，并将其应用于低分辨率的气候预测。我们使用在气象站位置计算的观察到的气候指数提出了基于气候指标的评估，以评估缩小的气候预测，而无需地面真相高分辨率的气候预测。对每日平均温度的实验表明，与低分辨率气候投影相比，单形图像超分辨率模型可以下降气候预测而不会增加气候指标的误差。

Title: JaiLIP: Jailbreaking Vision-Language Models via Loss Guided Image Perturbation

Authors: Md Jueal Mia, M. Hadi Amini
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.21401
Pdf URL: https://arxiv.org/pdf/2509.21401
Copy Paste: [[2509.21401]] JaiLIP: Jailbreaking Vision-Language Models via Loss Guided Image Perturbation(https://arxiv.org/abs/2509.21401)
Keywords: generation
Abstract: Vision-Language Models (VLMs) have remarkable abilities in generating multimodal reasoning tasks. However, potential misuse or safety alignment concerns of VLMs have increased significantly due to different categories of attack vectors. Among various attack vectors, recent studies have demonstrated that image-based perturbations are particularly effective in generating harmful outputs. In the literature, many existing techniques have been proposed to jailbreak VLMs, leading to unstable performance and visible perturbations. In this study, we propose Jailbreaking with Loss-guided Image Perturbation (JaiLIP), a jailbreaking attack in the image space that minimizes a joint objective combining the mean squared error (MSE) loss between clean and adversarial image with the models harmful-output loss. We evaluate our proposed method on VLMs using standard toxicity metrics from Perspective API and Detoxify. Experimental results demonstrate that our method generates highly effective and imperceptible adversarial images, outperforming existing methods in producing toxicity. Moreover, we have evaluated our method in the transportation domain to demonstrate the attacks practicality beyond toxic text generation in specific domain. Our findings emphasize the practical challenges of image-based jailbreak attacks and the need for efficient defense mechanisms for VLMs.
摘要：视觉语言模型（VLM）在生成多模式推理任务方面具有出色的能力。但是，由于攻击向量的不同类别，VLM的潜在滥用或安全一致问题显着增加。在各种攻击媒介中，最近的研究表明，基于图像的扰动在产生有害产出方面特别有效。在文献中，已经提出了许多现有的技术来越狱VLM，从而导致性能不稳定和可见扰动。在这项研究中，我们提出了与损失引导的图像扰动（监狱）的越狱，这是在图像空间中的越狱攻击，最大程度地降低了联合目标，结合了平均平方误差（MSE）在清洁和对抗图像之间的均方根误差（MSE）损失与模型有害输出损失之间的损失。我们使用标准毒性指标从API和排毒来评估我们在VLM上提出的方法。实验结果表明，我们的方法产生了高效且不可察觉的对抗图像，超过了产生毒性的现有方法。此外，我们已经评估了运输领域中的方法，以证明在特定领域中有毒文本产生以外的攻击实用性。我们的发现强调了基于图像的越狱攻击以及对VLM有效防御机制的实际挑战。

Title: QuadGPT: Native Quadrilateral Mesh Generation with Autoregressive Models

Authors: Jian Liu, Chunshi Wang, Song Guo, Haohan Weng, Zhen Zhou, Zhiqi Li, Jiaao Yu, Yiling Zhu, Jing Xu, Biwen Lei, Zhuo Chen, Chunchao Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.21420
Pdf URL: https://arxiv.org/pdf/2509.21420
Copy Paste: [[2509.21420]] QuadGPT: Native Quadrilateral Mesh Generation with Autoregressive Models(https://arxiv.org/abs/2509.21420)
Keywords: generation, generative
Abstract: The generation of quadrilateral-dominant meshes is a cornerstone of professional 3D content creation. However, existing generative models generate quad meshes by first generating triangle meshes and then merging triangles into quadrilaterals with some specific rules, which typically produces quad meshes with poor topology. In this paper, we introduce QuadGPT, the first autoregressive framework for generating quadrilateral meshes in an end-to-end manner. QuadGPT formulates this as a sequence prediction paradigm, distinguished by two key innovations: a unified tokenization method to handle mixed topologies of triangles and quadrilaterals, and a specialized Reinforcement Learning fine-tuning method tDPO for better generation quality. Extensive experiments demonstrate that QuadGPT significantly surpasses previous triangle-to-quad conversion pipelines in both geometric accuracy and topological quality. Our work establishes a new benchmark for native quad-mesh generation and showcases the power of combining large-scale autoregressive models with topology-aware RL refinement for creating structured 3D assets.
摘要：四边形网格的产生是专业3D内容创建的基石。但是，现有的生成模型通过首先生成三角形网格，然后将三角形合并到具有某些特定规则的四边形中，从而生成四边形网格，这通常会产生拓扑较差的四边形网格。在本文中，我们介绍了Quadgpt，这是第一个以端到端方式生成四边形网格的自回归框架。四方将其作为序列预测范式提出，以两种关键创新为特征：一种处理三角形和四边形混合拓扑的统一令牌化方法，以及一种专门的增强学习微调方法TDPO TDPO，以提高生成质量。广泛的实验表明，四方在几何精度和拓扑质量上都显着超过了先前的三角转换式管道。我们的工作为本地四元网格生成建立了新的基准，并展示了将大规模自回归模型与拓扑感知的RL改进相结合的力量，用于创建结构化的3D资产。

Title: DyME: Dynamic Multi-Concept Erasure in Diffusion Models with Bi-Level Orthogonal LoRA Adaptation

Authors: Jiaqi Liu, Lan Zhang, Xiaoyong Yuan
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.21433
Pdf URL: https://arxiv.org/pdf/2509.21433
Copy Paste: [[2509.21433]] DyME: Dynamic Multi-Concept Erasure in Diffusion Models with Bi-Level Orthogonal LoRA Adaptation(https://arxiv.org/abs/2509.21433)
Keywords: generation
Abstract: Text-to-image diffusion models (DMs) inadvertently reproduce copyrighted styles and protected visual concepts, raising legal and ethical concerns. Concept erasure has emerged as a safeguard, aiming to selectively suppress such concepts through fine-tuning. However, existing methods do not scale to practical settings where providers must erase multiple and possibly conflicting concepts. The core bottleneck is their reliance on static erasure: a single checkpoint is fine-tuned to remove all target concepts, regardless of the actual erasure needs at inference. This rigid design mismatches real-world usage, where requests vary per generation, leading to degraded erasure success and reduced fidelity for non-target content. We propose DyME, an on-demand erasure framework that trains lightweight, concept-specific LoRA adapters and dynamically composes only those needed at inference. This modular design enables flexible multi-concept erasure, but naive composition causes interference among adapters, especially when many or semantically related concepts are suppressed. To overcome this, we introduce bi-level orthogonality constraints at both the feature and parameter levels, disentangling representation shifts and enforcing orthogonal adapter subspaces. We further develop ErasureBench-H, a new hierarchical benchmark with brand-series-character structure, enabling principled evaluation across semantic granularities and erasure set sizes. Experiments on ErasureBench-H and standard datasets (e.g., CIFAR-100, Imagenette) demonstrate that DyME consistently outperforms state-of-the-art baselines, achieving higher multi-concept erasure fidelity with minimal collateral degradation.
摘要：文本对图像扩散模型（DMS）无意中繁殖了受版权保护的风格和受保护的视觉概念，从而引发了法律和道德问题。概念擦除已成为一种保障，旨在通过微调选择性地抑制此类概念。但是，现有方法并未扩展到提供者必须删除多个甚至可能相互矛盾的概念的实际设置。核心瓶颈是它们对静态擦除的依赖：对单个检查点进行微调以删除所有目标概念，而不管推断时实际的擦除需求如何。这种严格的设计不匹配现实世界的用法，每一代的请求各不相同，从而导致删除成功的成功，并减少了非目标内容的忠诚度。我们提出了Dyme，这是一种按需擦除框架，该框架训练轻巧，特定于概念的洛拉适配器，并动态地仅构成推断所需的介质。这种模块化设计可实现灵活的多概念擦除，但是天真的组成会引起适配器之间的干扰，尤其是当许多或语义上相关的概念被抑制时。为了克服这一点，我们在特征和参数级别上引入了双级正交性约束，解开表示表示和执行正交适配器子空间。我们进一步开发了Erasurebench-H，这是一种具有品牌序列 - 字符结构的新的层次结构基准，可以跨语义粒度和擦除设置尺寸进行原则评估。在擦除H-H和标准数据集（例如CIFAR-100，Imagenette）上进行的实验表明，Dyme始终胜过最先进的基线，从而获得了更高的多概念擦除保真度，并具有最小的叠层降解。

Title: Score-based Idempotent Distillation of Diffusion Models

Authors: Shehtab Zaman, Chengyan Liu, Kenneth Chiu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.21470
Pdf URL: https://arxiv.org/pdf/2509.21470
Copy Paste: [[2509.21470]] Score-based Idempotent Distillation of Diffusion Models(https://arxiv.org/abs/2509.21470)
Keywords: generation, generative
Abstract: Idempotent generative networks (IGNs) are a new line of generative models based on idempotent mapping to a target manifold. IGNs support both single-and multi-step generation, allowing for a flexible trade-off between computational cost and sample quality. But similar to Generative Adversarial Networks (GANs), conventional IGNs require adversarial training and are prone to training instabilities and mode collapse. Diffusion and score-based models are popular approaches to generative modeling that iteratively transport samples from one distribution, usually a Gaussian, to a target data distribution. These models have gained popularity due to their stable training dynamics and high-fidelity generation quality. However, this stability and quality come at the cost of high computational cost, as the data must be transported incrementally along the entire trajectory. New sampling methods, model distillation, and consistency models have been developed to reduce the sampling cost and even perform one-shot sampling from diffusion models. In this work, we unite diffusion and IGNs by distilling idempotent models from diffusion model scores, called SIGN. Our proposed method is highly stable and does not require adversarial losses. We provide a theoretical analysis of our proposed score-based training methods and empirically show that IGNs can be effectively distilled from a pre-trained diffusion model, enabling faster inference than iterative score-based models. SIGNs can perform multi-step sampling, allowing users to trade off quality for efficiency. These models operate directly on the source domain; they can project corrupted or alternate distributions back onto the target manifold, enabling zero-shot editing of inputs. We validate our models on multiple image datasets, achieving state-of-the-art results for idempotent models on the CIFAR and CelebA datasets.
摘要：IDEMTOTENT生成网络（IGNS）是基于基于目标歧管的基于IDEMPOTENT映射的新生成模型系列。 IGNS支持单步生成，可以在计算成本和样本质量之间进行灵活的权衡。但是与生成的对抗网络（GAN）类似，常规IGN需要对抗训练，并且容易训练不稳定性和模式崩溃。扩散和基于得分的模型是生成建模的流行方法，即迭代将样品从一个（通常是高斯）传输到目标数据分布。由于稳定的训练动力和高保真的产生质量，这些模型已获得流行。但是，这种稳定性和质量是以高计算成本为代价的，因为数据必须沿整个轨迹逐步传输。已经开发了新的采样方法，模型蒸馏和一致性模型，以降低采样成本，甚至从扩散模型中进行一次性采样。在这项工作中，我们通过从扩散模型分数中提取型模型来团结扩散和IGN，称为符号。我们提出的方法高度稳定，不需要对抗性损失。我们对我们提出的基于得分的训练方法进行了理论分析，并从经验上表明，IGN可以有效地从预训练的扩散模型中提炼出来，从而比基于迭代的分数模型更快地推断了IGN。标志可以执行多步骤抽样，使用户可以以效率进行质量进行权衡。这些模型直接在源域上运行；他们可以将损坏或替代分布投射回目标歧管，从而实现输入的零弹性编辑。我们在多个图像数据集上验证了我们的模型，从而在CIFAR和Celeba数据集上实现了diDempotent模型的最新结果。

Title: Are Hallucinations Bad Estimations?

Authors: Hude Liu, Jerry Yao-Chieh Hu, Jennifer Yuntong Zhang, Zhao Song, Han Liu
Subjects: cs.LG, cs.AI, cs.CL, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2509.21473
Pdf URL: https://arxiv.org/pdf/2509.21473
Copy Paste: [[2509.21473]] Are Hallucinations Bad Estimations?(https://arxiv.org/abs/2509.21473)
Keywords: generative
Abstract: We formalize hallucinations in generative models as failures to link an estimate to any plausible cause. Under this interpretation, we show that even loss-minimizing optimal estimators still hallucinate. We confirm this with a general high probability lower bound on hallucinate rate for generic data distributions. This reframes hallucination as structural misalignment between loss minimization and human-acceptable outputs, and hence estimation errors induced by miscalibration. Experiments on coin aggregation, open-ended QA, and text-to-image support our theory.
摘要：我们将生成模型中的幻觉形式化为未能将估计与任何合理原因联系起来。在这种解释下，我们表明，即使损失最小化的最佳估计器仍在幻觉。我们用通用数据分布的幻觉速率具有一般高概率下限来确认这一点。这将幻觉重新构想为损失最小化和可接受的输出之间的结构错位，因此误解引起的估计误差。关于硬币聚集，开放式质量检查和文本形象的实验支持我们的理论。

Title: d2: Improved Techniques for Training Reasoning Diffusion Language Models

Authors: Guanghan Wang, Yair Schiff, Gilad Turok, Volodymyr Kuleshov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.21474
Pdf URL: https://arxiv.org/pdf/2509.21474
Copy Paste: [[2509.21474]] d2: Improved Techniques for Training Reasoning Diffusion Language Models(https://arxiv.org/abs/2509.21474)
Keywords: generation
Abstract: While diffusion language models (DLMs) have achieved competitive performance in text generation, improving their reasoning ability with reinforcement learning remains an active research area. Here, we introduce d2, a reasoning framework tailored for masked DLMs. Central to our framework is a new policy gradient algorithm that relies on properties of masking to accurately estimate the likelihoods of sampling trajectories. Our estimators trade off computation for approximation accuracy in an analytically tractable manner, and are particularly effective for DLMs that support any-order likelihood estimation. We characterize and study this property in popular DLMs and show that it is key for efficient diffusion-based reasoning. Empirically, d2 significantly improves over previous diffusion reasoning frameworks using only RL (without relying on supervised fine-tuning), and sets a new state-of-the-art performance for DLMs on logical reasoning tasks (Countdown and Sudoku) and math reasoning benchmarks (GSM8K and MATH500).
摘要：尽管扩散语言模型（DLM）在文本生成中取得了竞争性的表现，但通过增强学习提高其推理能力仍然是一个活跃的研究领域。在这里，我们介绍了D2，这是一个针对蒙面DLMS量身定制的推理框架。我们框架的核心是一种新的策略梯度算法，它依赖于掩盖的属性来准确估计采样轨迹的可能性。我们的估计器以分析性处理方式将计算以近似准确性进行贸易，并且对于支持任何阶段可能性估计的DLM尤其有效。我们在流行的DLM中表征和研究了这一属性，并表明它是有效基于扩散推理的关键。从经验上讲，D2仅使用RL（而不依赖于监督的微调）来显着改善以前的扩散推理框架，并为DLMS在逻辑推理任务（Countdown和Sudoku）和数学推理基准（GSM8K和Math500）上设定了新的最新性能。

Title: Filtering with Confidence: When Data Augmentation Meets Conformal Prediction

Authors: Zixuan Wu, So Won Jeong, Yating Liu, Yeo Jin Jung, Claire Donnat
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.21479
Pdf URL: https://arxiv.org/pdf/2509.21479
Copy Paste: [[2509.21479]] Filtering with Confidence: When Data Augmentation Meets Conformal Prediction(https://arxiv.org/abs/2509.21479)
Keywords: generation
Abstract: With promising empirical performance across a wide range of applications, synthetic data augmentation appears a viable solution to data scarcity and the demands of increasingly data-intensive models. Its effectiveness lies in expanding the training set in a way that reduces estimator variance while introducing only minimal bias. Controlling this bias is therefore critical: effective data augmentation should generate diverse samples from the same underlying distribution as the training set, with minimal shifts. In this paper, we propose conformal data augmentation, a principled data filtering framework that leverages the power of conformal prediction to produce diverse synthetic data while filtering out poor-quality generations with provable risk control. Our method is simple to implement, requires no access to internal model logits, nor large-scale model retraining. We demonstrate the effectiveness of our approach across multiple tasks, including topic prediction, sentiment analysis, image classification, and fraud detection, showing consistent performance improvements of up to 40% in F1 score over unaugmented baselines, and 4% over other filtered augmentation baselines.
摘要：通过在广泛的应用中有希望的经验性能，合成数据增强似乎是数据稀缺性和日益数据密集型模型的需求的可行解决方案。它的有效性在于扩大训练集的方式，以减少估计差异的方式，同时仅引入最小的偏见。因此，控制这种偏见至关重要：有效的数据增强应产生与训练集相同的基本分布的不同样本，并以最小的变化。在本文中，我们提出了共形数据增强，这是一个原则性的数据过滤框架，利用保形预测的力量产生各种综合数据，同时通过可证明的风险控制过滤质量的几代人。我们的方法易于实现，不需要访问内部模型逻辑，也不需要大规模的模型再培训。我们证明了跨多个任务的方法的有效性，包括主题预测，情感分析，图像分类和欺诈检测，在未表达的基准中表现出持续的F1分数的持续提高，高达40％，而其他过滤的增强基准的绩效提高了4％。

Title: GraphPFN: A Prior-Data Fitted Graph Foundation Model

Authors: Dmitry Eremeev, Oleg Platonov, Gleb Bazhenov, Artem Babenko, Liudmila Prokhorenkova
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.21489
Pdf URL: https://arxiv.org/pdf/2509.21489
Copy Paste: [[2509.21489]] GraphPFN: A Prior-Data Fitted Graph Foundation Model(https://arxiv.org/abs/2509.21489)
Keywords: generation
Abstract: Foundation models pretrained on large-scale datasets have transformed such fields as natural language processing and computer vision, but their application to graph data remains limited. Recently emerged graph foundation models, such as G2T-FM, utilize tabular foundation models for graph tasks and were shown to significantly outperform prior attempts to create GFMs. However, these models primarily rely on hand-crafted graph features, limiting their ability to learn complex graph-specific patterns. In this work, we propose GraphPFN: a prior-data fitted network for node-level prediction. First, we design a prior distribution of synthetic attributed graphs. For graph structure generation, we use a novel combination of multiple stochastic block models and a preferential attachment process. We then apply graph-aware structured causal models to generate node attributes and targets. This procedure allows us to efficiently generate a wide range of realistic graph datasets. Then, we augment the tabular foundation model LimiX with attention-based graph neighborhood aggregation layers and train it on synthetic graphs sampled from our prior, allowing the model to capture graph structural dependencies not present in tabular data. On diverse real-world graph datasets with up to 50,000 nodes, GraphPFN shows strong in-context learning performance and achieves state-of-the-art results after finetuning, outperforming both G2T-FM and task-specific GNNs trained from scratch on most datasets. More broadly, our work demonstrates that pretraining on synthetic graphs from a well-designed prior distribution is an effective strategy for building graph foundation models.
摘要：在大规模数据集上预测的基础模型已经改变了诸如自然语言处理和计算机视觉之类的领域，但它们在图形数据中的应用仍然有限。最近出现的Graph Foundation模型（例如G2T-FM）使用表格粉底模型进行图形任务，并显示出明显超过创建GFM的事先尝试。但是，这些模型主要依靠手工制作的图形特征，从而限制了它们学习复杂图形模式的能力。在这项工作中，我们提出了GraphPFN：用于节点级预测的先前数据拟合网络。首先，我们设计了合成归因图的先前分布。对于图形结构，我们使用多个随机块模型和优先附着过程的新型组合。然后，我们应用图形感知的结构化因果模型来生成节点属性和目标。此过程使我们能够有效地生成各种逼真的图形数据集。然后，我们使用基于注意的图形邻域聚合层增强了表格基础模型Limix，并在我们先前的合成图上训练它，从而允许该模型捕获表格数据中不存在的图形结构依赖关系。在多达50,000个节点的各种现实世界图数据集上，GraphPFN显示出强烈的内在学习性能，并在填充后表现出最先进的结果，在大多数数据集中均优于G2T-FM和特定于任务的GNN。从更广泛的角度来看，我们的工作表明，从精心设计的先验分布上仔细研究合成图是建立图形基础模型的有效策略。

Title: SlimDiff: Training-Free, Activation-Guided Hands-free Slimming of Diffusion Models

Authors: Arani Roy, Shristi Das Biswas, Kaushik Roy
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2509.21498
Pdf URL: https://arxiv.org/pdf/2509.21498
Copy Paste: [[2509.21498]] SlimDiff: Training-Free, Activation-Guided Hands-free Slimming of Diffusion Models(https://arxiv.org/abs/2509.21498)
Keywords: generation, generative
Abstract: Diffusion models (DMs), lauded for their generative performance, are computationally prohibitive due to their billion-scale parameters and iterative denoising dynamics. Existing efficiency techniques, such as quantization, timestep reduction, or pruning, offer savings in compute, memory, or runtime but are strictly bottlenecked by reliance on fine-tuning or retraining to recover performance. In this work, we introduce SlimDiff, an automated activation-informed structural compression framework that reduces both attention and feedforward dimensionalities in DMs, while being entirely gradient-free. SlimDiff reframes DM compression as a spectral approximation task, where activation covariances across denoising timesteps define low-rank subspaces that guide dynamic pruning under a fixed compression budget. This activation-aware formulation mitigates error accumulation across timesteps by applying module-wise decompositions over functional weight groups: query--key interactions, value--output couplings, and feedforward projections, rather than isolated matrix factorizations, while adaptively allocating sparsity across modules to respect the non-uniform geometry of diffusion trajectories. SlimDiff achieves up to 35\% acceleration and $\sim$100M parameter reduction over baselines, with generation quality on par with uncompressed models without any backpropagation. Crucially, our approach requires only about 500 calibration samples, over 70$\times$ fewer than prior methods. To our knowledge, this is the first closed-form, activation-guided structural compression of DMs that is entirely training-free, providing both theoretical clarity and practical efficiency.
摘要：由于其数十亿个尺度的参数和迭代性降解动力学，因此因其生成性能而受到称赞的生成性能值得称赞的扩散模型（DMS）。现有的效率技术，例如量化，减少时间段或修剪，可节省计算，内存或运行时，但严格依赖微调或重新训练以恢复性能而严格瓶颈。在这项工作中，我们介绍了Slimdiff，这是一种自动激活的结构压缩框架，可降低DMS中的注意力和前馈维度，同时完全不含梯度。 Slimdiff将DM压缩重新框架作为光谱近似任务，其中激活协方差定义了低率子空间，该子空间在固定压缩预算下指导动态修剪。这种激活感知的公式通过在功能重量组上应用模块的分解来减轻跨时间范围的错误积累：查询 - 键相互作用，价值 - 值 - 输出耦合和进料预测，而不是隔离的矩阵因素化，而不是隔离的矩阵因素，同时适应跨模块跨度的稀疏性，以尊重不合格的差异轨迹，以尊重跨度的差异。 Slimdiff可实现高达35 \％的加速度和$ \ sim $ \ SIM $ 100M参数比基线的参数减少，而发电质量则与未压缩的模型相同而没有任何反向传播。至关重要的是，我们的方法仅需要约500个校准样本，超过70美元的$ \ times $少于先前的方法。据我们所知，这是完全无训练的DMS的第一个封闭形式，激活引导的结构压缩，提供了理论上的清晰度和实践效率。

Title: Contrastive Mutual Information Learning: Toward Robust Representations without Positive-Pair Augmentations

Authors: Micha Livne
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2509.21511
Pdf URL: https://arxiv.org/pdf/2509.21511
Copy Paste: [[2509.21511]] Contrastive Mutual Information Learning: Toward Robust Representations without Positive-Pair Augmentations(https://arxiv.org/abs/2509.21511)
Keywords: generative
Abstract: Learning representations that transfer well to diverse downstream tasks remains a central challenge in representation learning. Existing paradigms -- contrastive learning, self-supervised masking, and denoising auto-encoders -- balance this challenge with different trade-offs. We introduce the {contrastive Mutual Information Machine} (cMIM), a probabilistic framework that extends the Mutual Information Machine (MIM) with a contrastive objective. While MIM maximizes mutual information between inputs and latents and promotes clustering of codes, it falls short on discriminative tasks. cMIM addresses this gap by imposing global discriminative structure while retaining MIM's generative fidelity. Our contributions are threefold. First, we propose cMIM, a contrastive extension of MIM that removes the need for positive data augmentation and is substantially less sensitive to batch size than InfoNCE. Second, we introduce {informative embeddings}, a general technique for extracting enriched features from encoder-decoder models that boosts discriminative performance without additional training and applies broadly beyond MIM. Third, we provide empirical evidence across vision and molecular benchmarks showing that cMIM outperforms MIM and InfoNCE on classification and regression tasks while preserving competitive reconstruction quality. These results position cMIM as a unified framework for representation learning, advancing the goal of models that serve both discriminative and generative applications effectively.
摘要：很好地转移到各种下游任务的学习表征仍然是表示学习的核心挑战。现有的范式 - 对比度学习，自我监督的掩蔽和降级自动编码器 - 在这一挑战与不同的权衡之间取得了平衡。我们介绍了{对比性共同信息机}（CMIM），这是一个概率框架，它以对比度目标扩展了相互信息机（MIM）。 MIM最大化输入和潜伏之间的共同信息并促进代码的聚类，但它却缺乏歧视性任务。 CMIM通过施加全球歧视结构，同时保留MIM的生成忠诚度来解决这一差距。我们的贡献是三倍。首先，我们提出了CMIM，这是MIM的对比扩展，它消除了对阳性数据增强的需求，并且对批量大小的敏感性比Infonce敏感得多。其次，我们介绍{信息嵌入式}，这是一种从编码器模型中提取丰富功能的一般技术，可提高歧视性能，而无需其他培训，并广泛应用了MIM。第三，我们在视觉和分子基准之间提供了经验证据，表明CMIM在分类和回归任务上的表现优于MIM和INDONCE，同时保持竞争性重建质量。这些结果将CMIM定位为代表学习的统一框架，并促进了有效地服务于歧视性和生成应用的模型的目标。

Title: DistillKac: Few-Step Image Generation via Damped Wave Equations

Authors: Weiqiao Han, Chenlin Meng, Christopher D. Manning, Stefano Ermon
Subjects: cs.LG, cs.AI, cs.CV, math.PR, stat.ML
Abstract URL: https://arxiv.org/abs/2509.21513
Pdf URL: https://arxiv.org/pdf/2509.21513
Copy Paste: [[2509.21513]] DistillKac: Few-Step Image Generation via Damped Wave Equations(https://arxiv.org/abs/2509.21513)
Keywords: generation
Abstract: We present DistillKac, a fast image generator that uses the damped wave equation and its stochastic Kac representation to move probability mass at finite speed. In contrast to diffusion models whose reverse time velocities can become stiff and implicitly allow unbounded propagation speed, Kac dynamics enforce finite speed transport and yield globally bounded kinetic energy. Building on this structure, we introduce classifier-free guidance in velocity space that preserves square integrability under mild conditions. We then propose endpoint only distillation that trains a student to match a frozen teacher over long intervals. We prove a stability result that promotes supervision at the endpoints to closeness along the entire path. Experiments demonstrate DistillKac delivers high quality samples with very few function evaluations while retaining the numerical stability benefits of finite speed probability flows.
摘要：我们提出了Distillkac，这是一种快速图像发生器，使用阻尼波方程及其随机KAC表示以有限的速度移动概率质量。与扩散模型相反，其反向时间速度可以变得僵硬并隐含地允许无界传播速度，KAC动力学强制执行有限的速度传输并产生全球界限的动能。在这种结构的基础上，我们在速度空间中引入了无分类器指导，该指导在轻度条件下保留了方形的可集成性。然后，我们建议仅端点蒸馏，该蒸馏训练学生在长时间的间隔中匹配冷冻老师。我们证明了一个稳定结果，可以在沿着整个路径的终点处促进监督。实验表明，Distillkac提供的高质量样品具有很少的功能评估，同时保留了有限速度概率流的数值稳定性益处。

Title: Preemptive Detection and Steering of LLM Misalignment via Latent Reachability

Authors: Sathwik Karnik, Somil Bansal
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.21528
Pdf URL: https://arxiv.org/pdf/2509.21528
Copy Paste: [[2509.21528]] Preemptive Detection and Steering of LLM Misalignment via Latent Reachability(https://arxiv.org/abs/2509.21528)
Keywords: generation
Abstract: Large language models (LLMs) are now ubiquitous in everyday tools, raising urgent safety concerns about their tendency to generate harmful content. The dominant safety approach -- reinforcement learning from human feedback (RLHF) -- effectively shapes model behavior during training but offers no safeguards at inference time, where unsafe continuations may still arise. We propose BRT-Align, a reachability-based framework that brings control-theoretic safety tools to LLM inference. BRT-Align models autoregressive generation as a dynamical system in latent space and learn a safety value function via backward reachability, estimating the worst-case evolution of a trajectory. This enables two complementary mechanisms: (1) a runtime monitor that forecasts unsafe completions several tokens in advance, and (2) a least-restrictive steering filter that minimally perturbs latent states to redirect generation away from unsafe regions. Experiments across multiple LLMs and toxicity benchmarks demonstrate that BRT-Align provides more accurate and earlier detection of unsafe continuations than baselines. Moreover, for LLM safety alignment, BRT-Align substantially reduces unsafe generations while preserving sentence diversity and coherence. Qualitative results further highlight emergent alignment properties: BRT-Align consistently produces responses that are less violent, less profane, less offensive, and less politically biased. Together, these findings demonstrate that reachability analysis provides a principled and practical foundation for inference-time LLM safety.
摘要：现在，大型语言模型（LLMS）在日常工具中无处不在，引起了对它们产生有害内容的趋势的紧急安全问题。主要的安全方法 - 从人类反馈中学习（RLHF）的强化学习 - 在训练过程中有效地塑造了模型行为，但在推理时间没有任何保障措施，在推理时仍可能会出现不安全的延续。我们提出了BRT-Align，这是一个基于可及性的框架，将控制理论的安全工具带到LLM推理。 BRT-Align模型自回旋生成作为潜在空间中的动态系统，并通过向后的可及性学习安全价值函数，从而估计了轨迹的最坏情况演变。这可以实现两种互补的机制：（1）一个运行时监视器，可预测不安全的完整几个令牌，以及（2）一种最小限制的转向滤波器，该过滤器最小化的潜在状态可将潜在的状态重定向远离不安全区域。跨多个LLM和毒性基准的实验表明，BRT-Align比基线更准确，更早地检测不安全的连续性。此外，对于LLM的安全对准，BRT-Align大大减少了不安全的世代，同时保留了句子多样性和连贯性。定性结果进一步凸显了紧急的一致性：BRT-Align始终产生的反应较少，亵渎，攻击性较小，进攻且政治上有偏见。这些发现共同表明，可及性分析为推理时间LLM安全提供了原则性和实用的基础。

Title: Expert-guided Clinical Text Augmentation via Query-Based Model Collaboration

Authors: Dongkyu Cho, Miao Zhang, Rumi Chunara
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.21530
Pdf URL: https://arxiv.org/pdf/2509.21530
Copy Paste: [[2509.21530]] Expert-guided Clinical Text Augmentation via Query-Based Model Collaboration(https://arxiv.org/abs/2509.21530)
Keywords: generative
Abstract: Data augmentation is a widely used strategy to improve model robustness and generalization by enriching training datasets with synthetic examples. While large language models (LLMs) have demonstrated strong generative capabilities for this purpose, their applications in high-stakes domains like healthcare present unique challenges due to the risk of generating clinically incorrect or misleading information. In this work, we propose a novel query-based model collaboration framework that integrates expert-level domain knowledge to guide the augmentation process to preserve critical medical information. Experiments on clinical prediction tasks demonstrate that our lightweight collaboration-based approach consistently outperforms existing LLM augmentation methods while improving safety through reduced factual errors. This framework addresses the gap between LLM augmentation potential and the safety requirements of specialized domains.
摘要：数据增强是一种广泛使用的策略，可通过通过合成示例丰富培训数据集来改善模型鲁棒性和概括。尽管大型语言模型（LLMS）为此目的表现出强大的生成能力，但由于产生临床上不正确或误导性信息的风险，它们在Healthcare等高风险领域的应用带来了独特的挑战。在这项工作中，我们提出了一个基于查询的新型模型协作框架，该框架集成了专家级领域知识，以指导增强过程以保留关键的医疗信息。关于临床预测任务的实验表明，我们的基于轻量协作的方法始终优于现有的LLM增强方法，同时通过减少事实错误来改善安全性。该框架解决了LLM增强潜力与专业领域的安全要求之间的差距。

Title: No Alignment Needed for Generation: Learning Linearly Separable Representations in Diffusion Models

Authors: Junno Yun, Yaşar Utku Alçalar, Mehmet Akçakaya
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.21565
Pdf URL: https://arxiv.org/pdf/2509.21565
Copy Paste: [[2509.21565]] No Alignment Needed for Generation: Learning Linearly Separable Representations in Diffusion Models(https://arxiv.org/abs/2509.21565)
Keywords: generation
Abstract: Efficient training strategies for large-scale diffusion models have recently emphasized the importance of improving discriminative feature representations in these models. A central line of work in this direction is representation alignment with features obtained from powerful external encoders, which improves the representation quality as assessed through linear probing. Alignment-based approaches show promise but depend on large pretrained encoders, which are computationally expensive to obtain. In this work, we propose an alternative regularization for training, based on promoting the Linear SEParability (LSEP) of intermediate layer representations. LSEP eliminates the need for an auxiliary encoder and representation alignment, while incorporating linear probing directly into the network's learning dynamics rather than treating it as a simple post-hoc evaluation tool. Our results demonstrate substantial improvements in both training efficiency and generation quality on flow-based transformer architectures such as SiTs, achieving an FID of 1.46 on $256 \times 256$ ImageNet dataset.
摘要：大规模扩散模型的有效训练策略最近强调了改善这些模型中判别特征表示的重要性。朝这个方向的中心线工作是表示从功能强大的外部编码器获得的特征的表示对准，从而提高了通过线性探测评估的表示质量。基于对齐方式的方法显示出希望，但取决于预处理的编码器，这些编码器在计算上昂贵。在这项工作中，我们基于促进中间层表示的线性可分离性（LSEP），提出了训练的替代正规化。 LSEP消除了对辅助编码器和表示对准的需求，同时将线性探测直接纳入网络的学习动力学，而不是将其视为简单的事后评估工具。我们的结果表明，基于流动的变压器体系结构（例如坐着）的训练效率和发电质量的实质性提高，在$ 256 \ times 256 $ Imagenet数据集上达到了1.46的FID。

Title: X-Streamer: Unified Human World Modeling with Audiovisual Interaction

Authors: You Xie, Tianpei Gu, Zenan Li, Chenxu Zhang, Guoxian Song, Xiaochen Zhao, Chao Liang, Jianwen Jiang, Hongyi Xu, Linjie Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.21574
Pdf URL: https://arxiv.org/pdf/2509.21574
Copy Paste: [[2509.21574]] X-Streamer: Unified Human World Modeling with Audiovisual Interaction(https://arxiv.org/abs/2509.21574)
Keywords: generation
Abstract: We introduce X-Streamer, an end-to-end multimodal human world modeling framework for building digital human agents capable of infinite interactions across text, speech, and video within a single unified architecture. Starting from a single portrait, X-Streamer enables real-time, open-ended video calls driven by streaming multimodal inputs. At its core is a Thinker-Actor dual-transformer architecture that unifies multimodal understanding and generation, turning a static portrait into persistent and intelligent audiovisual interactions. The Thinker module perceives and reasons over streaming user inputs, while its hidden states are translated by the Actor into synchronized multimodal streams in real time. Concretely, the Thinker leverages a pretrained large language-speech model, while the Actor employs a chunk-wise autoregressive diffusion model that cross-attends to the Thinker's hidden states to produce time-aligned multimodal responses with interleaved discrete text and audio tokens and continuous video latents. To ensure long-horizon stability, we design inter- and intra-chunk attentions with time-aligned multimodal positional embeddings for fine-grained cross-modality alignment and context retention, further reinforced by chunk-wise diffusion forcing and global identity referencing. X-Streamer runs in real time on two A100 GPUs, sustaining hours-long consistent video chat experiences from arbitrary portraits and paving the way toward unified world modeling of interactive digital humans.
摘要：我们介绍了X-Streamer，这是一个端到端的多模式人类世界建模框架，用于在单个统一的体系结构中构建能够在文本，语音和视频中进行无限互动的数字人类代理。从单个肖像开始，X-streamer启用由流式多模式输入驱动的实时，开放式视频调用。其核心是一个思想家的双转化器架构，它统一了多模式的理解和产生，将静态肖像变成了持续且聪明的视听相互作用。思想家模块感知到流媒体用户输入的原因，而其隐藏状态则由演员实时转换为同步的多模式流。具体而言，思想家利用了预处理的大型语言模型，而演员则采用了块的自动回归扩散模型，该模型与思想家的隐藏状态交叉成分，以产生时间对齐的多模态响应，并与分解的离散文本和音频令牌和连续的视频和连续的视频效果。为了确保长期稳定性，我们通过时间一致的多模式位置嵌入设计间和内部的厨房嵌入，以进行细粒度的跨模式对准和上下文保留，并通过块的扩散强迫和全球身份参考进一步增强。 X-Streamer在两个A100 GPU上实时运行，从任意肖像中维持了长达数小时的一致的视频聊天体验，并为统一的Interactive Digital Digital人类建模铺平了道路。

Title: What Happens Next? Anticipating Future Motion by Generating Point Trajectories

Authors: Gabrijel Boduljak, Laurynas Karazija, Iro Laina, Christian Rupprecht, Andrea Vedaldi
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.21592
Pdf URL: https://arxiv.org/pdf/2509.21592
Copy Paste: [[2509.21592]] What Happens Next? Anticipating Future Motion by Generating Point Trajectories(https://arxiv.org/abs/2509.21592)
Keywords: generation
Abstract: We consider the problem of forecasting motion from a single image, i.e., predicting how objects in the world are likely to move, without the ability to observe other parameters such as the object velocities or the forces applied to them. We formulate this task as conditional generation of dense trajectory grids with a model that closely follows the architecture of modern video generators but outputs motion trajectories instead of pixels. This approach captures scene-wide dynamics and uncertainty, yielding more accurate and diverse predictions than prior regressors and generators. We extensively evaluate our method on simulated data, demonstrate its effectiveness on downstream applications such as robotics, and show promising accuracy on real-world intuitive physics datasets. Although recent state-of-the-art video generators are often regarded as world models, we show that they struggle with forecasting motion from a single image, even in simple physical scenarios such as falling blocks or mechanical object interactions, despite fine-tuning on such data. We show that this limitation arises from the overhead of generating pixels rather than directly modeling motion.
摘要：我们考虑从单个图像预测运动的问题，即预测世界上的对象可能会移动，而无需观察其他参数，例如对象速度或应用于它们的力量。我们将此任务制定为有条件地生成密集的轨迹网格，该模型紧随现代视频发生器的体系结构，但输出运动轨迹而不是像素。这种方法捕获了场景范围的动态和不确定性，比先前的回归器和发电机产生更准确和多样的预测。我们对模拟数据进行了广泛的评估，证明了其对机器人技术等下游应用程序的有效性，并在现实世界直觉物理数据集上显示出有希望的准确性。尽管最近最新的视频发电机通常被视为世界模型，但我们表明，即使在简单的物理场景中，例如掉落的块或机械对象相互作用，即使对此类数据进行了微调，它们即使在简单的物理场景（例如下降块或机械对象相互作用）中也很难预测运动。我们表明，这种限制来自生成像素的开销，而不是直接建模运动。

Title: GenUQ: Predictive Uncertainty Estimates via Generative Hyper-Networks

Authors: Tian Yu Yen, Reese E. Jones, Ravi G. Patel
Subjects: cs.LG, math.NA, stat.ML
Abstract URL: https://arxiv.org/abs/2509.21605
Pdf URL: https://arxiv.org/pdf/2509.21605
Copy Paste: [[2509.21605]] GenUQ: Predictive Uncertainty Estimates via Generative Hyper-Networks(https://arxiv.org/abs/2509.21605)
Keywords: generative
Abstract: Operator learning is a recently developed generalization of regression to mappings between functions. It promises to drastically reduce expensive numerical integration of PDEs to fast evaluations of mappings between functional states of a system, i.e., surrogate and reduced-order modeling. Operator learning has already found applications in several areas such as modeling sea ice, combustion, and atmospheric physics. Recent approaches towards integrating uncertainty quantification into the operator models have relied on likelihood based methods to infer parameter distributions from noisy data. However, stochastic operators may yield actions from which a likelihood is difficult or impossible to construct. In this paper, we introduce, GenUQ, a measure-theoretic approach to UQ that avoids constructing a likelihood by introducing a generative hyper-network model that produces parameter distributions consistent with observed data. We demonstrate that GenUQ outperforms other UQ methods in three example problems, recovering a manufactured operator, learning the solution operator to a stochastic elliptic PDE, and modeling the failure location of porous steel under tension.
摘要：运算符学习是对函数之间映射的回归概括的概括。它有望大大减少PDE的昂贵数值集成，以快速评估系统功能状态之间的映射，即替代和降低订单建模。操作员学习已经在多个领域找到了应用，例如建模海冰，燃烧和大气物理学。将不确定性量化整合到操作员模型中的最新方法依赖于基于可能性的方法来从嘈杂数据中推断参数分布。但是，随机操作员可能会产生可能难以或不可能构建的可能性。在本文中，我们介绍了Quanq，一种量级理论方法，它通过引入生成性超网络模型来避免构建可能性，该模型产生与观察到的数据一致的参数分布。我们证明，在三个示例问题中，GUALQ优于其他UQ方法，恢复了制造的操作员，学习解决方案操作员到随机椭圆PDE，并对张力下多孔钢的故障位置进行建模。

Title: FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction

Authors: Yixiang Dai, Fan Jiang, Chiyu Wang, Mu Xu, Yonggang Qi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.21657
Pdf URL: https://arxiv.org/pdf/2509.21657
Copy Paste: [[2509.21657]] FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction(https://arxiv.org/abs/2509.21657)
Keywords: generation
Abstract: High-quality 3D world models are pivotal for embodied intelligence and Artificial General Intelligence (AGI), underpinning applications such as AR/VR content creation and robotic navigation. Despite the established strong imaginative priors, current video foundation models lack explicit 3D grounding capabilities, thus being limited in both spatial consistency and their utility for downstream 3D reasoning tasks. In this work, we present FantasyWorld, a geometry-enhanced framework that augments frozen video foundation models with a trainable geometric branch, enabling joint modeling of video latents and an implicit 3D field in a single forward pass. Our approach introduces cross-branch supervision, where geometry cues guide video generation and video priors regularize 3D prediction, thus yielding consistent and generalizable 3D-aware video representations. Notably, the resulting latents from the geometric branch can potentially serve as versatile representations for downstream 3D tasks such as novel view synthesis and navigation, without requiring per-scene optimization or fine-tuning. Extensive experiments show that FantasyWorld effectively bridges video imagination and 3D perception, outperforming recent geometry-consistent baselines in multi-view coherence and style consistency. Ablation studies further confirm that these gains stem from the unified backbone and cross-branch information exchange.
摘要：高质量的3D世界模型是体现智能和人工智能（AGI）的关键，即AR/VR内容创建和机器人导航等应用程序的基础。尽管具有既定的富有想象力的先验，但当前的视频基础模型缺乏显式的3D接地功能，因此在空间一致性及其在下游3D推理任务中的实用性都受到限制。在这项工作中，我们介绍了FantasyWorld，这是一种几何增强的框架，它具有可训练的几何分支，增强了冻结的视频基础模型，从而在单个正向传球中启用了视频潜伏的联合模型和隐式3D字段。我们的方法介绍了跨分支监督，几何提示指导视频生成和视频先验规范3D预测，从而产生一致且可推广的3D感知视频表示。值得注意的是，从几何分支的产生潜在潜在的潜在潜在的潜在用作下游3D任务（例如新型视图综合和导航）的多功能表示，而无需每场景优化或微调。广泛的实验表明，幻想世界有效地弥合了视频想象力和3D感知，在多视图相干性和样式一致性中表现出色的几何相连基线。消融研究进一步证实，这些收益源于统一的骨干和跨分支信息交换。

Title: Neuroprobe: Evaluating Intracranial Brain Responses to Naturalistic Stimuli

Authors: Andrii Zahorodnii, Christopher Wang, Bennett Stankovits, Charikleia Moraitaki, Geeling Chau, Andrei Barbu, Boris Katz, Ila R Fiete
Subjects: cs.LG, q-bio.NC
Abstract URL: https://arxiv.org/abs/2509.21671
Pdf URL: https://arxiv.org/pdf/2509.21671
Copy Paste: [[2509.21671]] Neuroprobe: Evaluating Intracranial Brain Responses to Naturalistic Stimuli(https://arxiv.org/abs/2509.21671)
Keywords: generation
Abstract: High-resolution neural datasets enable foundation models for the next generation of brain-computer interfaces and neurological treatments. The community requires rigorous benchmarks to discriminate between competing modeling approaches, yet no standardized evaluation frameworks exist for intracranial EEG (iEEG) recordings. To address this gap, we present Neuroprobe: a suite of decoding tasks for studying multi-modal language processing in the brain. Unlike scalp EEG, intracranial EEG requires invasive surgery to implant electrodes that record neural activity directly from the brain with minimal signal distortion. Neuroprobe is built on the BrainTreebank dataset, which consists of 40 hours of iEEG recordings from 10 human subjects performing a naturalistic movie viewing task. Neuroprobe serves two critical functions. First, it is a mine from which neuroscience insights can be drawn. Its high temporal and spatial resolution allows researchers to systematically determine when and where computations for each aspect of language processing occur in the brain by measuring the decodability of each feature across time and all electrode locations. Using Neuroprobe, we visualize how information flows from the superior temporal gyrus to the prefrontal cortex, and the progression from simple auditory features to more complex language features in a purely data-driven manner. Second, as the field moves toward neural foundation models, Neuroprobe provides a rigorous framework for comparing competing architectures and training protocols. We found that the linear baseline is surprisingly strong, beating frontier foundation models on many tasks. Neuroprobe is designed with computational efficiency and ease of use in mind. We make the code for Neuroprobe openly available and maintain a public leaderboard, aiming to enable rapid progress in the field of iEEG foundation models, at this https URL
摘要：高分辨率神经数据集为下一代脑机界面和神经治疗提供了基础模型。社区需要严格的基准来区分竞争建模方法，但是对于颅内脑电图（IEEG）录音没有标准化的评估框架。为了解决这一差距，我们提出了神经探针：一套用于研究大脑中多模式语言处理的解码任务。与头皮脑电图不同，颅内脑电图需要侵入性手术才能植入直接从大脑中以最小信号失真来记录神经活动的电极。 Neuroprobe建立在Braintreebank数据集上，该数据集由10个执行自然主义电影观看任务的10个人类受试者的IEEG录音组成。 Neuroprobe具有两个关键功能。首先，这是一个可以从中获得神经科学见解的矿山。它的高时间和空间分辨率允许研究人员通过在时间和所有电极位置测量每个特征的分解性来系统地确定语言处理的每个方面的何时何地计算。我们使用神经探针，可视化信息如何以纯粹数据驱动的方式从上颞回到前额叶皮层流向前额叶皮层，以及从简单的听觉特征到更复杂的语言特征的发展。其次，随着该领域朝着神经基础模型发展，神经探针提供了一个严格的框架，用于比较竞争架构和培训方案。我们发现线性基线非常强大，在许多任务上都击败了Frontier Foundation模型。神经探针的设计具有计算效率和易用性。我们公开提供神经探针的代码并维护公共排序板，旨在在此HTTPS URL上在IEEG基金会模型领域的快速进步

Title: SpecMER: Fast Protein Generation with K-mer Guided Speculative Decoding

Authors: Thomas Walton, Darin Tsui, Aryan Musharaf, Amirali Aghazadeh
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.21689
Pdf URL: https://arxiv.org/pdf/2509.21689
Copy Paste: [[2509.21689]] SpecMER: Fast Protein Generation with K-mer Guided Speculative Decoding(https://arxiv.org/abs/2509.21689)
Keywords: generation
Abstract: Autoregressive models have transformed protein engineering by enabling the generation of novel protein sequences beyond those found in nature. However, their sequential inference introduces significant latency, limiting their utility in high-throughput protein screening. Speculative decoding accelerates generation by employing a lightweight draft model to sample tokens, which a larger target model then verifies and refines. Yet, in protein sequence generation, draft models are typically agnostic to the structural and functional constraints of the target protein, leading to biologically implausible outputs and a shift in the likelihood distribution of generated sequences. We introduce SpecMER (Speculative Decoding via k-mer Guidance), a novel framework that incorporates biological, structural, and functional priors using k-mer motifs extracted from multiple sequence alignments. By scoring candidate sequences in parallel and selecting those most consistent with known biological patterns, SpecMER significantly improves sequence plausibility while retaining the efficiency of speculative decoding. SpecMER achieves 24-32% speedup over standard autoregressive decoding, along with higher acceptance rates and improved sequence likelihoods.
摘要：自回旋模型通过使自然界中发现的新型蛋白质序列的产生能够产生新的蛋白质序列，从而改变了蛋白质工程。但是，它们的顺序推断引入了显着的潜伏期，从而限制了它们在高通量蛋白筛选中的效用。投机解码通过采用轻巧的草稿模型来进行样品代币，从而加速了生成，然后较大的目标模型会验证和完善。然而，在蛋白质序列的产生中，草稿模型通常对目标蛋白的结构和功能约束不可知，从而导致生物学上不可信的输出以及生成序列的可能性分布的变化。我们介绍了Specmer（通过K-MER引导的投机解码），这是一个新型框架，该框架使用从多个序列比对中提取的K-MER基序结合了生物学，结构和功能性先验。通过平行评分候选序列并选择最与已知生物模式一致的人，Specmer可以显着提高序列的合理性，同时保留投机解码的效率。与标准自回归解码相比，Specmer达到24-32％的加速，以及更高的接受率和提高的序列可能性。

Title: UISim: An Interactive Image-Based UI Simulator for Dynamic Mobile Environments

Authors: Jiannan Xiang, Yun Zhu, Lei Shu, Maria Wang, Lijun Yu, Gabriel Barcik, James Lyon, Srinivas Sunkara, Jindong Chen
Subjects: cs.CV, cs.AI, cs.CL, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2509.21733
Pdf URL: https://arxiv.org/pdf/2509.21733
Copy Paste: [[2509.21733]] UISim: An Interactive Image-Based UI Simulator for Dynamic Mobile Environments(https://arxiv.org/abs/2509.21733)
Keywords: generation
Abstract: Developing and testing user interfaces (UIs) and training AI agents to interact with them are challenging due to the dynamic and diverse nature of real-world mobile environments. Existing methods often rely on cumbersome physical devices or limited static analysis of screenshots, which hinders scalable testing and the development of intelligent UI agents. We introduce UISim, a novel image-based UI simulator that offers a dynamic and interactive platform for exploring mobile phone environments purely from screen images. Our system employs a two-stage method: given an initial phone screen image and a user action, it first predicts the abstract layout of the next UI state, then synthesizes a new, visually consistent image based on this predicted layout. This approach enables the realistic simulation of UI transitions. UISim provides immediate practical benefits for UI testing, rapid prototyping, and synthetic data generation. Furthermore, its interactive capabilities pave the way for advanced applications, such as UI navigation task planning for AI agents. Our experimental results show that UISim outperforms end-to-end UI generation baselines in generating realistic and coherent subsequent UI states, highlighting its fidelity and potential to streamline UI development and enhance AI agent training.
摘要：由于现实世界移动环境的动态性和多样性，开发和测试用户界面（UIS）和培训AI代理与他们互动的挑战是具有挑战性的。现有方法通常依赖于繁琐的物理设备或对屏幕截图的有限静态分析，这阻碍了可扩展测试和智能UI代理的开发。我们介绍了UISIM，这是一种基于图像的新型UI模拟器，它提供了一个动态和交互式平台，用于纯粹是从屏幕图像中探索手机环境的动态平台。我们的系统采用两阶段方法：给定初始电话屏幕图像和用户操作，它首先预测下一个UI状态的抽象布局，然后根据该预测的布局合成一个新的，视觉上一致的图像。这种方法可以实现对UI过渡的现实模拟。 UISIM为UI测试，快速原型和合成数据生成提供了直接的实际好处。此外，其交互功能为高级应用程序（例如针对AI代理的UI导航任务计划）铺平了道路。我们的实验结果表明，UISIM在生成逼真且连贯的后续UI状态方面优于端到端UI生成基线，强调了其忠诚度和潜力简化UI开发和增强AI代理培训。

Title: UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models

Authors: Lan Chen, Yuchao Gu, Qi Mao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.21760
Pdf URL: https://arxiv.org/pdf/2509.21760
Copy Paste: [[2509.21760]] UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models(https://arxiv.org/abs/2509.21760)
Keywords: generation, generative
Abstract: Large language models, trained on extensive corpora, successfully unify diverse linguistic tasks within a single generative framework. Inspired by this, recent works like Large Vision Model (LVM) extend this paradigm to vision by organizing tasks into sequential visual sentences, where visual prompts serve as the context to guide outputs. However, such modeling requires task-specific pre-training across modalities and sources, which is costly and limits scalability to unseen tasks. Given that pre-trained video generation models inherently capture temporal sequence dependencies, we explore a more unified and scalable alternative: can a pre-trained video generation model adapt to diverse image and video tasks? To answer this, we propose UniVid, a framework that fine-tunes a video diffusion transformer to handle various vision tasks without task-specific modifications. Tasks are represented as visual sentences, where the context sequence defines both the task and the expected output modality. We evaluate the generalization of UniVid from two perspectives: (1) cross-modal inference with contexts composed of both images and videos, extending beyond LVM's uni-modal setting; (2) cross-source tasks from natural to annotated data, without multi-source pre-training. Despite being trained solely on natural video data, UniVid generalizes well in both settings. Notably, understanding and generation tasks can easily switch by simply reversing the visual sentence order in this paradigm. These findings highlight the potential of pre-trained video generation models to serve as a scalable and unified foundation for vision modeling. Our code will be released at this https URL.
摘要：大型语言模型，经过广泛的语料库培训，成功地将各种语言任务统一在单个生成框架内。受到这一点的启发，最近的作品（例如大视觉模型（LVM））通过将任务组织到顺序的视觉句子中，将此范式扩展到视觉，视觉提示是指导输出的上下文。但是，这种建模需要跨模式和来源的特定任务预训练，这是昂贵的，并且限制了可伸缩性的可扩展性。鉴于预先训练的视频生成模型固有地捕获了时间序列依赖性，我们探索了更统一，更可扩展的替代方案：预训练的视频生成模型可以适应各种图像和视频任务吗？为了回答这个问题，我们提出了一个univing的框架，该框架微调视频扩散变压器以处理各种视觉任务而没有特定于任务的修改。任务表示为视觉句子，上下文序列同时定义了任务和预期的输出模式。我们从两个角度评估了Unvive的概括：（1）与图像和视频组成的上下文的跨模式推断，超越了LVM的Uni-Modal设置；（2）从自然到带注释的数据的跨源任务，无需多源预训练。尽管仅接受了自然视频数据的培训，但在这两种情况下，Univid都可以很好地推广。值得注意的是，理解和生成任务可以通过简单地逆转此范式中的视觉句子顺序来轻松切换。这些发现突出了预训练的视频生成模型的潜力，成为视力建模的可扩展和统一基础。我们的代码将在此HTTPS URL上发布。

Title: Beyond Formula Complexity: Effective Information Criterion Improves Performance and Interpretability for Symbolic Regression

Authors: Zihan Yu, Guanren Wang, Jingtao Ding, Huandong Wang, Yong Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.21780
Pdf URL: https://arxiv.org/pdf/2509.21780
Copy Paste: [[2509.21780]] Beyond Formula Complexity: Effective Information Criterion Improves Performance and Interpretability for Symbolic Regression(https://arxiv.org/abs/2509.21780)
Keywords: generative
Abstract: Symbolic regression discovers accurate and interpretable formulas to describe given data, thereby providing scientific insights for domain experts and promoting scientific discovery. However, existing symbolic regression methods often use complexity metrics as a proxy for interoperability, which only considers the size of the formula but ignores its internal mathematical structure. Therefore, while they can discover formulas with compact forms, the discovered formulas often have structures that are difficult to analyze or interpret mathematically. In this work, inspired by the observation that physical formulas are typically numerically stable under limited calculation precision, we propose the Effective Information Criterion (EIC). It treats formulas as information processing systems with specific internal structures and identifies the unreasonable structure in them by the loss of significant digits or the amplification of rounding noise as data flows through the system. We find that this criterion reveals the gap between the structural rationality of models discovered by existing symbolic regression algorithms and real-world physical formulas. Combining EIC with various search-based symbolic regression algorithms improves their performance on the Pareto frontier and reduces the irrational structure in the results. Combining EIC with generative-based algorithms reduces the number of samples required for pre-training, improving sample efficiency by 2~4 times. Finally, for different formulas with similar accuracy and complexity, EIC shows a 70.2% agreement with 108 human experts' preferences for formula interpretability, demonstrating that EIC, by measuring the unreasonable structures in formulas, actually reflects the formula's interpretability.
摘要：符号回归发现了准确且可解释的公式来描述给定的数据，从而为领域专家提供了科学见解并促进科学发现。但是，现有的符号回归方法通常使用复杂性指标作为互操作性的代理，该方法仅考虑公式的大小，而忽略了其内部数学结构。因此，尽管他们可以发现具有紧凑形式的公式，但发现的公式通常具有难以分析或数学解释的结构。在这项工作中，我们的启发是：在有限的计算精度下，物理公式通常在数值上是稳定的，我们提出了有效的信息标准（EIC）。它将公式视为具有特定内部结构的信息处理系统，并通过丢失显着数字或随着数据流的数据流动来识别它们中的不合理结构。我们发现该标准揭示了现有符号回归算法和现实世界物理公式发现的模型的结构合理性之间的差距。将EIC与各种基于搜索的符号回归算法相结合，可以提高其在帕累托边境上的性能，并减少结果中的非理性结构。将EIC与基于生成的算法相结合可减少预训练所需的样品数量，从而提高样品效率2〜4倍。最后，对于具有相似准确性和复杂性的不同公式，EIC与108个人类专家对公式可解释性的偏好的偏好表明，通过测量公式中的不合理结构来证明EIC实际上反映了公式的解释性。

Title: LongScape: Advancing Long-Horizon Embodied World Models with Context-Aware MoE

Authors: Yu Shang, Lei Jin, Yiding Ma, Xin Zhang, Chen Gao, Wei Wu, Yong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.21790
Pdf URL: https://arxiv.org/pdf/2509.21790
Copy Paste: [[2509.21790]] LongScape: Advancing Long-Horizon Embodied World Models with Context-Aware MoE(https://arxiv.org/abs/2509.21790)
Keywords: generation
Abstract: Video-based world models hold significant potential for generating high-quality embodied manipulation data. However, current video generation methods struggle to achieve stable long-horizon generation: classical diffusion-based approaches often suffer from temporal inconsistency and visual drift over multiple rollouts, while autoregressive methods tend to compromise on visual detail. To solve this, we introduce LongScape, a hybrid framework that adaptively combines intra-chunk diffusion denoising with inter-chunk autoregressive causal generation. Our core innovation is an action-guided, variable-length chunking mechanism that partitions video based on the semantic context of robotic actions. This ensures each chunk represents a complete, coherent action, enabling the model to flexibly generate diverse dynamics. We further introduce a Context-aware Mixture-of-Experts (CMoE) framework that adaptively activates specialized experts for each chunk during generation, guaranteeing high visual quality and seamless chunk transitions. Extensive experimental results demonstrate that our method achieves stable and consistent long-horizon generation over extended rollouts. Our code is available at: this https URL.
摘要：基于视频的世界模型具有生成高质量的体现操纵数据的巨大潜力。但是，当前的视频生成方法难以实现稳定的长途生成：基于经典扩散的方法通常会遇到时间上的不一致和视觉漂移，而自动回归方法倾向于在视觉细节上妥协。为了解决这个问题，我们引入了Longscape，这是一种混合框架，可自适应地结合厨房内扩散的扩散与界面间自回归的因果生成。我们的核心创新是一种动作引导，可变长度的块机制，该机制基于机器人动作的语义上下文对视频进行分区。这样可以确保每个块代表一个完整，连贯的动作，从而使模型能够灵活地产生多样化的动态。我们进一步引入了上下文感知的专家（CMOE）框架，该框架可自适应地激活一代中每个块的专业专家，以确保高视觉质量和无缝块过渡。广泛的实验结果表明，我们的方法在扩展的推出上实现了稳定且一致的长途产生。我们的代码可用：此HTTPS URL。

Title: FastGRPO: Accelerating Policy Optimization via Concurrency-aware Speculative Decoding and Online Draft Learning

Authors: Yizhou Zhang, Ning Lv, Teng Wang, Jisheng Dang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.21792
Pdf URL: https://arxiv.org/pdf/2509.21792
Copy Paste: [[2509.21792]] FastGRPO: Accelerating Policy Optimization via Concurrency-aware Speculative Decoding and Online Draft Learning(https://arxiv.org/abs/2509.21792)
Keywords: generation
Abstract: Group relative policy optimization (GRPO) has demonstrated significant potential in improving the reasoning capabilities of large language models (LLMs) via reinforcement learning. However, its practical deployment is impeded by an excessively slow training process, primarily attributed to the computationally intensive autoregressive generation of multiple responses per query, which makes the generation phase the primary performance bottleneck. Although speculative decoding presents a promising direction for acceleration, its direct application in GRPO achieves limited speedup under high-concurrency training conditions. To overcome this limitation, we propose a concurrency-aware speculative decoding framework that dynamically adjusts the drafting and verification strategy according to real-time concurrency levels, thereby maximizing the acceleration of the generation process. Furthermore, to address performance degradation arising from distributional drift between the evolving target model and the fixed draft model during training, we introduce an online draft learning mechanism that enables the draft model to continuously adapt using feedback signals from the target model. Experimental results across multiple mathematical reasoning datasets and models demonstrate that the proposed method achieves end-to-end speedups of 2.35x to 2.72x, significantly surpassing baseline approaches in efficiency. The code is available at this https URL.
摘要：小组相对政策优化（GRPO）在通过强化学习中提高大语言模型（LLM）的推理能力（LLMS）方面具有巨大的潜力。但是，其实际部署受到过度缓慢的训练过程的阻碍，这主要归因于每个查询的计算强度自回答的多次响应，这使得生成阶段成为主要的性能瓶颈。尽管投机解码为加速提供了有希望的方向，但在GRPO中的直接应用在高持续训练条件下实现了有限的加速。为了克服这一限制，我们提出了一个并发感知的投机解码框架，该框架根据实时并发水平动态调整起草和验证策略，从而最大程度地提高生成过程的加速度。此外，为了解决不断发展的目标模型与固定草案模型之间的分配漂移引起的性能降解，我们引入了一种在线草稿学习机制，使草案模型可以使用目标模型的反馈信号连续适应。多个数学推理数据集和模型之间的实验结果表明，所提出的方法达到2.35倍至2.72倍的端到端速度，在效率方面显着超过了基线方法。该代码可在此HTTPS URL上找到。

Title: MoWM: Mixture-of-World-Models for Embodied Planning via Latent-to-Pixel Feature Modulation

Authors: Yu Shang, Yangcheng Yu, Xin Zhang, Xin Jin, Haisheng Su, Wei Wu, Yong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.21797
Pdf URL: https://arxiv.org/pdf/2509.21797
Copy Paste: [[2509.21797]] MoWM: Mixture-of-World-Models for Embodied Planning via Latent-to-Pixel Feature Modulation(https://arxiv.org/abs/2509.21797)
Keywords: generation
Abstract: Embodied action planning is a core challenge in robotics, requiring models to generate precise actions from visual observations and language instructions. While video generation world models are promising, their reliance on pixel-level reconstruction often introduces visual redundancies that hinder action decoding and generalization. Latent world models offer a compact, motion-aware representation, but overlook the fine-grained details critical for precise manipulation. To overcome these limitations, we propose MoWM, a mixture-of-world-model framework that fuses representations from hybrid world models for embodied action planning. Our approach uses motion-aware representations from a latent model as a high-level prior, which guides the extraction of fine-grained visual features from the pixel space model. This design allows MoWM to highlight the informative visual details needed for action decoding. Extensive evaluations on the CALVIN benchmark demonstrate that our method achieves state-of-the-art task success rates and superior generalization. We also provide a comprehensive analysis of the strengths of each feature space, offering valuable insights for future research in embodied planning. The code is available at: this https URL.
摘要：体现的动作计划是机器人技术中的核心挑战，需要模型从视觉观察和语言说明中产生精确的动作。尽管视频生成世界模型令人鼓舞，但它们对像素级重建的依赖通常会引入视觉冗余，从而阻碍动作解码和概括。潜在的世界模型提供了紧凑的运动感知表示，但忽略了精确操纵至关重要的细粒细节。为了克服这些局限性，我们提出了MOWM，这是一种融合了“混合世界”模型的世界模型框架的混合物。我们的方法使用潜在模型的运动感知表示形式作为高级先验，该先验指导从像素空间模型中提取细粒的视觉特征。这种设计使MOWM可以突出动作解码所需的信息视觉细节。对加尔文基准的广泛评估表明，我们的方法实现了最新的任务成功率和卓越的概括。我们还对每个特征空间的优势进行了全面的分析，为未来的体现计划研究提供了宝贵的见解。该代码可用：此HTTPS URL。

Title: On the Complexity Theory of Masked Discrete Diffusion: From $\mathrm{poly}(1/ε)$ to Nearly $ε$-Free

Authors: Xunpeng Huang, Yingyu Lin, Nishant Jain, Kaibo Wang, Difan Zou, Yian Ma, Tong Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.21835
Pdf URL: https://arxiv.org/pdf/2509.21835
Copy Paste: [[2509.21835]] On the Complexity Theory of Masked Discrete Diffusion: From $\mathrm{poly}(1/ε)$ to Nearly $ε$-Free(https://arxiv.org/abs/2509.21835)
Keywords: generation
Abstract: We study masked discrete diffusion -- a flexible paradigm for text generation in which tokens are progressively corrupted by special mask symbols before being denoised. Although this approach has demonstrated strong empirical performance, its theoretical complexity in high-dimensional settings remains insufficiently understood. Existing analyses largely focus on uniform discrete diffusion, and more recent attempts addressing masked diffusion either (1) overlook widely used Euler samplers, (2) impose restrictive bounded-score assumptions, or (3) fail to showcase the advantages of masked discrete diffusion over its uniform counterpart. To address this gap, we show that Euler samplers can achieve $\epsilon$-accuracy in total variation (TV) with $\tilde{O}(d^{2}\epsilon^{-3/2})$ discrete score evaluations, thereby providing the first rigorous analysis of typical Euler sampler in masked discrete diffusion. We then propose a Mask-Aware Truncated Uniformization (MATU) approach that both removes bounded-score assumptions and preserves unbiased discrete score approximation. By exploiting the property that each token can be unmasked at most once, MATU attains a nearly $\epsilon$-free complexity of $O(d\,\ln d\cdot (1-\epsilon^2))$. This result surpasses existing uniformization methods under uniform discrete diffusion, eliminating the $\ln(1/\epsilon)$ factor and substantially speeding up convergence. Our findings not only provide a rigorous theoretical foundation for masked discrete diffusion, showcasing its practical advantages over uniform diffusion for text generation, but also pave the way for future efforts to analyze diffusion-based language models developed under masking paradigm.
摘要：我们研究了掩盖的离散扩散 - 一种灵活的文本生成范式，在该文本生成中，令牌在被授予之前被特殊面具符号逐渐损坏。尽管这种方法表现出强烈的经验表现，但其在高维环境中的理论复杂性仍然不足以理解。现有的分析在很大程度上集中于统一的离散扩散，最近的尝试解决了掩盖的扩散（1）（1）忽略了广泛使用的Euler Samplers，（2）施加了限制性的有限得分假设，或（3）未能展示掩盖的离散扩散在其均匀的对手方面的优势。要解决这一差距，我们表明，Euler Sampler可以使用$ \ tilde {o}（d^{2} \ epsilon^{ - 3/2}）$ scorte评估，从而实现$ \ epsilon $ - 准确度（电视）（电视）（电视）（d^{2} \ epsilon^{ - 3/2}），从而对典型的Eulererersamess samsemers samess samesp samess samesp samesp of samess samesp as s s s s s s s s s s as assef。然后，我们提出了一种面具感知的截断均匀化（MATU）方法，该方法都可以消除有界得分的假设并保留无偏的离散得分近似。通过利用每个令牌最多可以揭露的属性，MATU获得了$ O的近乎$ \ epsilon $ - $ o的复杂性（d \，\ ln d \ cdot（1- \ epsilon^2））$。在均匀离散扩散下，该结果超过了现有的统一方法，消除了$ \ ln（1/\ epsilon）$因子，并实质上加快了收敛性。我们的发现不仅为掩盖离散扩散提供了严格的理论基础，展示了其与文本生成均匀扩散的实际优势，而且还为未来的努力铺平了道路，以分析在掩盖范式下开发的基于扩散的语言模型。

Title: DiTraj: training-free trajectory control for video diffusion transformer

Authors: Cheng Lei, Jiayu Zhang, Yue Ma, Xinyu Wang, Long Chen, Liang Tang, Yiqiang Yan, Fei Su, Zhicheng Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.21839
Pdf URL: https://arxiv.org/pdf/2509.21839
Copy Paste: [[2509.21839]] DiTraj: training-free trajectory control for video diffusion transformer(https://arxiv.org/abs/2509.21839)
Keywords: generation, generative
Abstract: Diffusion Transformers (DiT)-based video generation models with 3D full attention exhibit strong generative capabilities. Trajectory control represents a user-friendly task in the field of controllable video generation. However, existing methods either require substantial training resources or are specifically designed for U-Net, do not take advantage of the superior performance of DiT. To address these issues, we propose DiTraj, a simple but effective training-free framework for trajectory control in text-to-video generation, tailored for DiT. Specifically, first, to inject the object's trajectory, we propose foreground-background separation guidance: we use the Large Language Model (LLM) to convert user-provided prompts into foreground and background prompts, which respectively guide the generation of foreground and background regions in the video. Then, we analyze 3D full attention and explore the tight correlation between inter-token attention scores and position embedding. Based on this, we propose inter-frame Spatial-Temporal Decoupled 3D-RoPE (STD-RoPE). By modifying only foreground tokens' position embedding, STD-RoPE eliminates their cross-frame spatial discrepancies, strengthening cross-frame attention among them and thus enhancing trajectory control. Additionally, we achieve 3D-aware trajectory control by regulating the density of position embedding. Extensive experiments demonstrate that our method outperforms previous methods in both video quality and trajectory controllability.
摘要：具有3D全注意力的基于3D的基于3D的视频生成模型具有强大的生成能力。轨迹控件代表可控视频生成领域的用户友好任务。但是，现有方法要么需要大量的培训资源，要么是专门为U-NET设计的，请不要利用DIT的出色性能。为了解决这些问题，我们提出了Ditraj，这是一个简单但有效的无训练框架，用于在文本到视频中为DIT量身定制。具体来说，首先，为了注入对象的轨迹，我们提出了前景 - 背景分离指导：我们使用大语言模型（LLM）将用户提供的提示转换为前景和背景提示，该提示分别指导视频中的前景和背景区域的产生。然后，我们分析了3D的全部注意力，并探讨了互相注意分数与位置嵌入之间的紧密相关性。基于此，我们提出了框架间时空脱钩的3D绳（STD-ROPE）。通过仅修改前景令牌的位置嵌入，STD绳索消除了它们的跨框架空间差异，从而增强了它们之间的跨框架注意力，从而增强了轨迹控制。此外，我们通过调节位置嵌入密度来实现3D感知的轨迹控制。广泛的实验表明，我们的方法在视频质量和轨迹可控性方面都优于先前的方法。

Title: A Comprehensive Evaluation of Transformer-Based Question Answering Models and RAG-Enhanced Design

Authors: Zichen Zhang, Kunlong Zhang, Hongwei Ruan, Yiming Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.21845
Pdf URL: https://arxiv.org/pdf/2509.21845
Copy Paste: [[2509.21845]] A Comprehensive Evaluation of Transformer-Based Question Answering Models and RAG-Enhanced Design(https://arxiv.org/abs/2509.21845)
Keywords: generation
Abstract: Transformer-based models have advanced the field of question answering, but multi-hop reasoning, where answers require combining evidence across multiple passages, remains difficult. This paper presents a comprehensive evaluation of retrieval strategies for multi-hop question answering within a retrieval-augmented generation framework. We compare cosine similarity, maximal marginal relevance, and a hybrid method that integrates dense embeddings with lexical overlap and re-ranking. To further improve retrieval, we adapt the EfficientRAG pipeline for query optimization, introducing token labeling and iterative refinement while maintaining efficiency. Experiments on the HotpotQA dataset show that the hybrid approach substantially outperforms baseline methods, achieving a relative improvement of 50 percent in exact match and 47 percent in F1 score compared to cosine similarity. Error analysis reveals that hybrid retrieval improves entity recall and evidence complementarity, while remaining limited in handling distractors and temporal reasoning. Overall, the results suggest that hybrid retrieval-augmented generation provides a practical zero-shot solution for multi-hop question answering, balancing accuracy, efficiency, and interpretability.
摘要：基于变形金刚的模型已经提出了问题的回答领域，但是多跳的推理需要答案需要在多个段落中结合证据，这仍然很困难。本文介绍了对检索型生成框架中多跳问题回答的检索策略的全面评估。我们比较余弦相似性，最大边际相关性和一种混合方法，该方法将密集的嵌入与词汇重叠和重新排列相结合。为了进一步改善检索，我们适应了有效rag管道以进行查询优化，在保持效率的同时，引入了令牌标签和迭代精致。 HOTPOTQA数据集的实验表明，与余弦相似性相比，混合方法基本上优于基线方法，在精确匹配中的相对提高了50％，而F1分数的相对提高为47％。错误分析表明，混合检索改善了实体的回忆和证据互补性，同时在处理干扰物和时间推理方面保持限制。总体而言，结果表明，混合检索功能的生成为多跳问答，平衡准确性，效率和可解释性提供了实用的零击解决方案。

Title: Graph of Agents: Principled Long Context Modeling by Emergent Multi-Agent Collaboration

Authors: Taejong Joo, Shu Ishida, Ivan Sosnovik, Bryan Lim, Sahand Rezaei-Shoshtari, Adam Gaier, Robert Giaquinto
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.21848
Pdf URL: https://arxiv.org/pdf/2509.21848
Copy Paste: [[2509.21848]] Graph of Agents: Principled Long Context Modeling by Emergent Multi-Agent Collaboration(https://arxiv.org/abs/2509.21848)
Keywords: generation
Abstract: As a model-agnostic approach to long context modeling, multi-agent systems can process inputs longer than a large language model's context window without retraining or architectural modifications. However, their performance often heavily relies on hand-crafted multi-agent collaboration strategies and prompt engineering, which limit generalizability. In this work, we introduce a principled framework that formalizes the model-agnostic long context modeling problem as a compression problem, yielding an information-theoretic compression objective. Building on this framework, we propose Graph of Agents (GoA), which dynamically constructs an input-dependent collaboration structure that maximizes this objective. For Llama 3.1 8B and Qwen3 8B across six document question answering benchmarks, GoA improves the average $F_1$ score of retrieval-augmented generation by 5.7\% and a strong multi-agent baseline using a fixed collaboration structure by 16.35\%, respectively. Even with only a 2K context window, GoA surpasses the 128K context window Llama 3.1 8B on LongBench, showing a dramatic increase in effective context length. Our source code is available at this https URL.
摘要：作为长上下文建模的模型 - 不足方法，多代理系统可以比大型语言模型的上下文窗口更长的时间处理输入，而无需重新训练或体系结构修改。但是，他们的性能通常在很大程度上依赖于手工制作的多代理协作策略和迅速的工程，从而限制了可推广性。在这项工作中，我们介绍了一个原则性的框架，该框架将模型不合时宜的长上下文建模问题形式化为压缩问题，从而产生了信息理论的压缩目标。在此框架的基础上，我们提出了代理图（GOA）的图表，该图形动态构建了一个与输入有关的协作结构，从而最大程度地提高了该目标。对于六个文档问题的七岁3.1 8b和qwen3 8b，果阿的平均$ f_1 $ f_1 $得分分别使用了5.7％\％，使用固定协作结构的强大多代理基线提高了16.35 \％。即使只有一个2K上下文窗口，果阿在Longbench上也超过了128K上下文窗口Llama 3.1 8b，显示出有效上下文长度的急剧增加。我们的源代码可在此HTTPS URL上找到。

Title: SRHand: Super-Resolving Hand Images and 3D Shapes via View/Pose-aware Neural Image Representations and Explicit 3D Meshes

Authors: Minje Kim, Tae-Kyun Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.21859
Pdf URL: https://arxiv.org/pdf/2509.21859
Copy Paste: [[2509.21859]] SRHand: Super-Resolving Hand Images and 3D Shapes via View/Pose-aware Neural Image Representations and Explicit 3D Meshes(https://arxiv.org/abs/2509.21859)
Keywords: super-resolution
Abstract: Reconstructing detailed hand avatars plays a crucial role in various applications. While prior works have focused on capturing high-fidelity hand geometry, they heavily rely on high-resolution multi-view image inputs and struggle to generalize on low-resolution images. Multi-view image super-resolution methods have been proposed to enforce 3D view consistency. These methods, however, are limited to static objects/scenes with fixed resolutions and are not applicable to articulated deformable hands. In this paper, we propose SRHand (Super-Resolution Hand), the method for reconstructing detailed 3D geometry as well as textured images of hands from low-resolution images. SRHand leverages the advantages of implicit image representation with explicit hand meshes. Specifically, we introduce a geometric-aware implicit image function (GIIF) that learns detailed hand prior by upsampling the coarse input images. By jointly optimizing the implicit image function and explicit 3D hand shapes, our method preserves multi-view and pose consistency among upsampled hand images, and achieves fine-detailed 3D reconstruction (wrinkles, nails). In experiments using the InterHand2.6M and Goliath datasets, our method significantly outperforms state-of-the-art image upsampling methods adapted to hand datasets, and 3D hand reconstruction methods, quantitatively and qualitatively. Project page: this https URL
摘要：重建详细的手化头像在各种应用中起着至关重要的作用。虽然先前的工作重点是捕获高保真手部的几何形状，但它们在很大程度上依赖高分辨率的多视图图像输入，并难以推广低分辨率的图像。已经提出了多视图图像超分辨率方法来强制执行3D视图一致性。但是，这些方法仅限于具有固定分辨率的静态对象/场景，并且不适用于明显的可变形手。在本文中，我们提出了SRHAND（超分辨率手），这是重建详细的3D几何形状以及低分辨率图像的手的纹理图像的方法。 SRHAND使用明确的手架来利用隐式图像表示的优势。具体而言，我们引入了一个几何感知隐式图像函数（GIIF），该函数通过重新采样粗略输入图像来先验地学习详细的手。通过共同优化隐式图像函数和显式3D手形，我们的方法可以保留上采样的手图像之间的多视图和姿势一致性，并实现了细节的3D重建（皱纹，指甲）。在使用Interhand 2.6m和Goliath数据集的实验中，我们的方法在定量和合理上都非常优于适用于手部数据集的最先进的图像UPS采样方法以及3D手重建方法。项目页面：此HTTPS URL

Title: MolSpectLLM: A Molecular Foundation Model Bridging Spectroscopy, Molecule Elucidation, and 3D Structure Generation

Authors: Shuaike Shen, Jiaqing Xie, Zhuo Yang, Antong Zhang, Shuzhou Sun, Ben Gao, Tianfan Fu, Biqing Qi, Yuqiang Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.21861
Pdf URL: https://arxiv.org/pdf/2509.21861
Copy Paste: [[2509.21861]] MolSpectLLM: A Molecular Foundation Model Bridging Spectroscopy, Molecule Elucidation, and 3D Structure Generation(https://arxiv.org/abs/2509.21861)
Keywords: generation
Abstract: Recent advances in molecular foundation models have shown impressive performance in molecular property prediction and de novo molecular design, with promising applications in areas such as drug discovery and reaction prediction. Nevertheless, most existing approaches rely exclusively on SMILES representations and overlook both experimental spectra and 3D structural information-two indispensable sources for capturing molecular behavior in real-world scenarios. This limitation reduces their effectiveness in tasks where stereochemistry, spatial conformation, and experimental validation are critical. To overcome these challenges, we propose MolSpectLLM, a molecular foundation model pretrained on Qwen2.5-7B that unifies experimental spectroscopy with molecular 3D structure. By explicitly modeling molecular spectra, MolSpectLLM achieves state-of-the-art performance on spectrum-related tasks, with an average accuracy of 0.53 across NMR, IR, and MS benchmarks. MolSpectLLM also shows strong performance on the spectra analysis task, obtaining 15.5% sequence accuracy and 41.7% token accuracy on Spectra-to-SMILES, substantially outperforming large general-purpose LLMs. More importantly, MolSpectLLM not only achieves strong performance on molecular elucidation tasks, but also generates accurate 3D molecular structures directly from SMILES or spectral inputs, bridging spectral analysis, molecular elucidation, and molecular design.
摘要：分子基础模型的最新进展显示在分子性质预测和从头分子设计中的表现令人印象深刻，在药物发现和反应预测等领域具有有希望的应用。然而，大多数现有的方法仅依赖于微笑表示形式，并忽略了实验光谱和3D结构信息，这是在现实世界中捕获分子行为的必不可少来源。这种限制降低了它们在立体化学，空间构象和实验验证至关重要的任务中的有效性。为了克服这些挑战，我们提出了Molspectllm，这是一种在QWEN2.5-7B上预测的分子基础模型，该模型将实验光谱统一使用分子3D结构。通过对分子光谱进行显式建模，Molspectllm在与频谱相关的任务上实现了最先进的性能，NMR，IR和MS基准测试的平均精度为0.53。 Molspectllm在光谱分析任务上还显示出强大的性能，获得了15.5％的序列准确性，并且在光谱到符号上获得了41.7％的令牌准确性，这基本上优于大型通用LLM。更重要的是，Molspectllm不仅在分子阐明任务上实现了强劲的性能，而且还直接从微笑或光谱输入中生成准确的3D分子结构，桥接光谱分析，分子阐明和分子设计。

Title: Deepfakes: we need to re-think the concept of "real" images

Authors: Janis Keuper, Margret Keuper
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.21864
Pdf URL: https://arxiv.org/pdf/2509.21864
Copy Paste: [[2509.21864]] Deepfakes: we need to re-think the concept of "real" images(https://arxiv.org/abs/2509.21864)
Keywords: generation, generative
Abstract: The wide availability and low usability barrier of modern image generation models has triggered the reasonable fear of criminal misconduct and negative social implications. The machine learning community has been engaging this problem with an extensive series of publications proposing algorithmic solutions for the detection of "fake", e.g. entirely generated or partially manipulated images. While there is undoubtedly some progress towards technical solutions of the problem, we argue that current and prior work is focusing too much on generative algorithms and "fake" data-samples, neglecting a clear definition and data collection of "real" images. The fundamental question "what is a real image?" might appear to be quite philosophical, but our analysis shows that the development and evaluation of basically all current "fake"-detection methods is relying on only a few, quite old low-resolution datasets of "real" images like ImageNet. However, the technology for the acquisition of "real" images, aka taking photos, has drastically evolved over the last decade: Today, over 90% of all photographs are produced by smartphones which typically use algorithms to compute an image from multiple inputs (over time) from multiple sensors. Based on the fact that these image formation algorithms are typically neural network architectures which are closely related to "fake"-image generators, we state the position that today, we need to re-think the concept of "real" images. The purpose of this position paper is to raise the awareness of the current shortcomings in this active field of research and to trigger an open discussion whether the detection of "fake" images is a sound objective at all. At the very least, we need a clear technical definition of "real" images and new benchmark datasets.
摘要：现代形象生成模型的广泛可用性和低可用性障碍引发了对犯罪不当行为和负面社会影响的合理恐惧。机器学习社区一直在通过一系列广泛的出版物来吸引这个问题，该出版物提出了用于检测“假”的算法解决方案，例如完全生成或部分操纵的图像。毫无疑问，毫无疑问，在问题的技术解决方案方面取得了一些进展，但我们认为当前和先前的工作都过于关注生成算法和“假”数据示例，忽略了“真实”图像的明确定义和数据收集。基本问题“什么是真正的形象？”可能似乎是相当哲学的，但是我们的分析表明，基本上所有当前的“假”检测方法的开发和评估仅依赖于诸如Imagenet之类的“真实”图像的少数，相当古老的低分辨率数据集。但是，在过去的十年中，获取“真实”图像的技术（又名拍照）已经发生了巨大的发展：如今，超过90％的照片是由智能手机制作的，这些智能手机通常使用算法从多个传感器中计算出来自多个输入（随着时间的时间）的图像。基于这些图像形成算法的事实通常是与“假”图像生成器密切相关的神经网络体系结构，我们指出了今天的立场，即我们需要重新考虑“真实”图像的概念。该立场论文的目的是提高对这一活跃研究领域当前缺点的认识，并触发开放讨论，是否对“假”图像的检测根本是一个合理的目标。至少，我们需要对“真实”图像和新基准数据集的明确技术定义。

Title: Beyond RAG vs. Long-Context: Learning Distraction-Aware Retrieval for Efficient Knowledge Grounding

Authors: Seong-Woong Shim, Myunsoo Kim, Jae Hyeon Cho, Byung-Jun Lee
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.21865
Pdf URL: https://arxiv.org/pdf/2509.21865
Copy Paste: [[2509.21865]] Beyond RAG vs. Long-Context: Learning Distraction-Aware Retrieval for Efficient Knowledge Grounding(https://arxiv.org/abs/2509.21865)
Keywords: generation
Abstract: Retrieval-Augmented Generation (RAG) is a framework for grounding Large Language Models (LLMs) in external, up-to-date information. However, recent advancements in context window size allow LLMs to process inputs of up to 128K tokens or more, offering an alternative strategy: supplying the full document context directly to the model, rather than relying on RAG to retrieve a subset of contexts. Nevertheless, this emerging alternative strategy has notable limitations: (i) it is token-inefficient to handle large and potentially redundant contexts; (ii) it exacerbates the `lost in the middle' phenomenon; and (iii) under limited model capacity, it amplifies distraction, ultimately degrading LLM output quality. In this paper, we propose LDAR (Learning Distraction-Aware Retrieval), an adaptive retriever that learns to retrieve contexts in a way that mitigates interference from distracting passages, thereby achieving significantly higher performance with reduced token usage compared to long-context approaches. Extensive experiments across diverse LLM architectures and six knowledge-intensive benchmarks demonstrate the effectiveness and robustness of our approach, highlighting the importance of balancing the trade-off between information coverage and distraction.
摘要：检索增强的生成（RAG）是将大型语言模型（LLMS）接地的框架。但是，上下文窗口大小的最新进展使LLM可以处理最多128K代币或更多的输入，提供替代策略：将完整的文档上下文直接提供给模型，而不是依靠RAG检索上下文的子集。然而，这种新兴的替代策略具有明显的局限性：（i）处理大型且潜在的冗余背景是象征性的；（ii）它加剧了“中间”现象的“丢失”；（iii）在有限的模型容量下，它放大了分心，最终使LLM输出质量降低。在本文中，我们提出了LDAR（学习分心 - 检索检索），这是一种自适应检索器，以减轻干扰段落的干扰，从而在分散注意力的情况下学习上下文，从而在与长篇文本方法相比，通过减少令牌的使用降低了性能。跨不同LLM体系结构和六个知识密集的基准的广泛实验证明了我们方法的有效性和鲁棒性，强调了在信息覆盖和分心之间平衡权衡取舍的重要性。

Title: Abductive Logical Rule Induction by Bridging Inductive Logic Programming and Multimodal Large Language Models

Authors: Yifei Peng, Yaoli Liu, Enbo Xia, Yu Jin, Wang-Zhou Dai, Zhong Ren, Yao-Xiang Ding, Kun Zhou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.21874
Pdf URL: https://arxiv.org/pdf/2509.21874
Copy Paste: [[2509.21874]] Abductive Logical Rule Induction by Bridging Inductive Logic Programming and Multimodal Large Language Models(https://arxiv.org/abs/2509.21874)
Keywords: generation
Abstract: We propose ILP-CoT, a method that bridges Inductive Logic Programming (ILP) and Multimodal Large Language Models (MLLMs) for abductive logical rule induction. The task involves both discovering logical facts and inducing logical rules from a small number of unstructured textual or visual inputs, which still remain challenging when solely relying on ILP, due to the requirement of specified background knowledge and high computational cost, or MLLMs, due to the appearance of perceptual hallucinations. Based on the key observation that MLLMs could propose structure-correct rules even under hallucinations, our approach automatically builds ILP tasks with pruned search spaces based on the rule structure proposals from MLLMs, and utilizes ILP system to output rules built upon rectified logical facts and formal inductive reasoning. Its effectiveness is verified through challenging logical induction benchmarks, as well as a potential application of our approach, namely text-to-image customized generation with rule induction. Our code and data are released at this https URL.
摘要：我们提出了ILP-COT，一种桥接归纳逻辑编程（ILP）和多模式大语言模型（MLLM）的方法，用于绑架逻辑规则诱导。该任务既涉及发现逻辑事实，也涉及从少数非结构化的文本或视觉输入中诱导逻辑规则，这在仅依靠ILP的情况下仍然具有挑战性，这是由于需要指定的背景知识以及高度计算成本或MLLM，这是由于感知幻觉的出现。基于关键观察，即即使在幻觉下，MLLM也可以提出结构校正规则，我们的方法会根据MLLM的规则结构提案自动构建ILP任务，并使用MLLM的规则结构提案进行修剪的搜索空间，并利用ILP系统来基于构建的逻辑事实构建的输出规则，并将其构建。通过具有挑战性的逻辑诱导基准以及我们方法的潜在应用，即文本对图像的定制生成，通过构成规则归纳来验证其有效性。我们的代码和数据在此HTTPS URL上发布。

Title: Drag4D: Align Your Motion with Text-Driven 3D Scene Generation

Authors: Minjun Kang, Inkyu Shin, Taeyeop Lee, In So Kweon, Kuk-Jin Yoon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.21888
Pdf URL: https://arxiv.org/pdf/2509.21888
Copy Paste: [[2509.21888]] Drag4D: Align Your Motion with Text-Driven 3D Scene Generation(https://arxiv.org/abs/2509.21888)
Keywords: generation
Abstract: We introduce Drag4D, an interactive framework that integrates object motion control within text-driven 3D scene generation. This framework enables users to define 3D trajectories for the 3D objects generated from a single image, seamlessly integrating them into a high-quality 3D background. Our Drag4D pipeline consists of three stages. First, we enhance text-to-3D background generation by applying 2D Gaussian Splatting with panoramic images and inpainted novel views, resulting in dense and visually complete 3D reconstructions. In the second stage, given a reference image of the target object, we introduce a 3D copy-and-paste approach: the target instance is extracted in a full 3D mesh using an off-the-shelf image-to-3D model and seamlessly composited into the generated 3D scene. The object mesh is then positioned within the 3D scene via our physics-aware object position learning, ensuring precise spatial alignment. Lastly, the spatially aligned object is temporally animated along a user-defined 3D trajectory. To mitigate motion hallucination and ensure view-consistent temporal alignment, we develop a part-augmented, motion-conditioned video diffusion model that processes multiview image pairs together with their projected 2D trajectories. We demonstrate the effectiveness of our unified architecture through evaluations at each stage and in the final results, showcasing the harmonized alignment of user-controlled object motion within a high-quality 3D background.
摘要：我们介绍了Drag4D，这是一个交互式框架，将对象运动控制集成在文本驱动的3D场景生成中。该框架使用户可以为从单个图像生成的3D对象定义3D轨迹，将它们无缝集成到高质量的3D背景中。我们的Drag4D管道包括三个阶段。首先，我们通过使用全景图像和注册新颖的视图来应用2D高斯脱落来增强文本到3D背景的生成，从而产生了密集且视觉上完整的3D重建。在第二阶段，给定目标对象的参考图像，我们介绍了3D复制和纸条方法：使用现成的图像到3D模型在完整的3D网格中提取目标实例，并无缝合成生成的3D场景。然后通过我们的物理意识对象位置学习将对象网格放置在3D场景中，以确保精确的空间对齐。最后，沿用户定义的3D轨迹将空间对齐的对象在时间上是动画的。为了减轻运动幻觉并确保视图一致的时间对齐，我们开发了一个零件启动的，运动调节的视频扩散模型，该模型将处理多视图像对以及其预计的2D轨迹。我们通过在每个阶段和最终结果中进行评估来证明我们统一体系结构的有效性，从而在高质量的3D背景下展示了用户控制对象运动的协调对准。

Title: Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers

Authors: Jibin Song, Mingi Kwon, Jaeseok Jeong, Youngjung Uh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.21893
Pdf URL: https://arxiv.org/pdf/2509.21893
Copy Paste: [[2509.21893]] Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers(https://arxiv.org/abs/2509.21893)
Keywords: generation
Abstract: Text-to-video and image-to-video generation have made rapid progress in visual quality, but they remain limited in controlling the precise timing of motion. In contrast, audio provides temporal cues aligned with video motion, making it a promising condition for temporally controlled video generation. However, existing audio-to-video (A2V) models struggle with fine-grained synchronization due to indirect conditioning mechanisms or limited temporal modeling capacity. We present Syncphony, which generates 380x640 resolution, 24fps videos synchronized with diverse audio inputs. Our approach builds upon a pre-trained video backbone and incorporates two key components to improve synchronization: (1) Motion-aware Loss, which emphasizes learning at high-motion regions; (2) Audio Sync Guidance, which guides the full model using a visually aligned off-sync model without audio layers to better exploit audio cues at inference while maintaining visual quality. To evaluate synchronization, we propose CycleSync, a video-to-audio-based metric that measures the amount of motion cues in the generated video to reconstruct the original audio. Experiments on AVSync15 and The Greatest Hits datasets demonstrate that Syncphony outperforms existing methods in both synchronization accuracy and visual quality. Project page is available at: this https URL
摘要：文本对视频和图像到视频的生成在视觉质量方面取得了迅速的进步，但它们在控制精确的运动时机方面仍然有限。相比之下，音频提供了与视频运动一致的时间提示，这使其成为时间控制视频的有希望的条件。但是，由于间接调节机制或有限的时间建模能力，现有的音频到视频（A2V）模型与细粒度的同步相加。我们提出了Syncphony，它生成了380x640分辨率的24FPS视频，与不同的音频输入同步。我们的方法建立在预先训练的视频主链的基础上，并结合了两个关键组成部分以改善同步：（1）运动吸引损失，强调在高运动区域学习；（2）音频同步指导，该指南使用视觉上对齐的外部模型指导完整的模型，而无需音频层，以更好地利用推理的音频提示，同时保持视觉质量。为了评估同步，我们提出了CycleSync，这是一种基于视频至原告的指标，可测量生成视频中的运动提示量以重建原始音频。 Avsync15和最大命中数据集的实验表明，Syncphony在同步精度和视觉质量方面都优于现有方法。项目页面可用：此HTTPS URL

Title: Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching

Authors: Zhengyan Wan, Yidong Ouyang, Liyan Xie, Fang Fang, Hongyuan Zha, Guang Cheng
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2509.21912
Pdf URL: https://arxiv.org/pdf/2509.21912
Copy Paste: [[2509.21912]] Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching(https://arxiv.org/abs/2509.21912)
Keywords: generation
Abstract: Guidance provides a simple and effective framework for posterior sampling by steering the generation process towards the desired distribution. When modeling discrete data, existing approaches mostly focus on guidance with the first-order Taylor approximation to improve the sampling efficiency. However, such an approximation is inappropriate in discrete state spaces since the approximation error could be large. A novel guidance framework for discrete data is proposed to address this problem: We derive the exact transition rate for the desired distribution given a learned discrete flow matching model, leading to guidance that only requires a single forward pass in each sampling step, significantly improving efficiency. This unified novel framework is general enough, encompassing existing guidance methods as special cases, and it can also be seamlessly applied to the masked diffusion model. We demonstrate the effectiveness of our proposed guidance on energy-guided simulations and preference alignment on text-to-image generation and multimodal understanding tasks. The code is available through this https URL.
摘要：指南通过将生成过程转向所需分布，为后验提供了一个简单有效的框架。在对离散数据进行建模时，现有方法主要集中于一阶泰勒近似以提高采样效率的指导。但是，由于近似误差可能很大，因此这种近似值在离散状态空间中是不合适的。提出了一个新型的离散数据指导框架来解决此问题：我们得出了所需分布的确切过渡速率给定的离散流量匹配模型，从而导致指导只需要在每个采样步骤中只需要一个向前传递，从而显着提高了效率。这个统一的新型框架足够一般，涵盖了现有的指导方法作为特殊情况，并且也可以无缝地应用于掩盖的扩散模型。我们证明了我们提出的指南对能源引导的模拟和偏好对齐的有效性，对文本到图像生成和多模式理解任务的有效性。该代码可通过此HTTPS URL获得。

Title: Generation Properties of Stochastic Interpolation under Finite Training Set

Authors: Yunchen Li, Shaohui Lin, Zhou Yu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.21925
Pdf URL: https://arxiv.org/pdf/2509.21925
Copy Paste: [[2509.21925]] Generation Properties of Stochastic Interpolation under Finite Training Set(https://arxiv.org/abs/2509.21925)
Keywords: generation, generative
Abstract: This paper investigates the theoretical behavior of generative models under finite training populations. Within the stochastic interpolation generative framework, we derive closed-form expressions for the optimal velocity field and score function when only a finite number of training samples are available. We demonstrate that, under some regularity conditions, the deterministic generative process exactly recovers the training samples, while the stochastic generative process manifests as training samples with added Gaussian noise. Beyond the idealized setting, we consider model estimation errors and introduce formal definitions of underfitting and overfitting specific to generative models. Our theoretical analysis reveals that, in the presence of estimation errors, the stochastic generation process effectively produces convex combinations of training samples corrupted by a mixture of uniform and Gaussian noise. Experiments on generation tasks and downstream tasks such as classification support our theory.
摘要：本文研究了有限培训人群下生成模型的理论行为。在随机插值生成框架中，我们在仅有有限数量的训练样本时得出最佳速度场和得分函数的封闭形式表达式。我们证明，在某些规律性条件下，确定性生成过程准确地恢复了训练样本，而随机生成过程则表现为带有高斯噪声的训练样品。除了理想化的设置之外，我们还考虑模型估计错误，并引入对生成模型的适用不足和过度适合的形式定义。我们的理论分析表明，在存在估计误差的情况下，随机生成过程有效地产生了被均匀和高斯噪声混合物损坏的训练样品的凸组合。关于生成任务和下游任务（例如分类）的实验支持我们的理论。

Title: SemanticControl: A Training-Free Approach for Handling Loosely Aligned Visual Conditions in ControlNet

Authors: Woosung Joung, Daewon Chae, Jinkyu Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.21938
Pdf URL: https://arxiv.org/pdf/2509.21938
Copy Paste: [[2509.21938]] SemanticControl: A Training-Free Approach for Handling Loosely Aligned Visual Conditions in ControlNet(https://arxiv.org/abs/2509.21938)
Keywords: generation
Abstract: ControlNet has enabled detailed spatial control in text-to-image diffusion models by incorporating additional visual conditions such as depth or edge maps. However, its effectiveness heavily depends on the availability of visual conditions that are precisely aligned with the generation goal specified by text prompt-a requirement that often fails in practice, especially for uncommon or imaginative scenes. For example, generating an image of a cat cooking in a specific pose may be infeasible due to the lack of suitable visual conditions. In contrast, structurally similar cues can often be found in more common settings-for instance, poses of humans cooking are widely available and can serve as rough visual guides. Unfortunately, existing ControlNet models struggle to use such loosely aligned visual conditions, often resulting in low text fidelity or visual artifacts. To address this limitation, we propose SemanticControl, a training-free method for effectively leveraging misaligned but semantically relevant visual conditions. Our approach adaptively suppresses the influence of the visual condition where it conflicts with the prompt, while strengthening guidance from the text. The key idea is to first run an auxiliary denoising process using a surrogate prompt aligned with the visual condition (e.g., "a human playing guitar" for a human pose condition) to extract informative attention masks, and then utilize these masks during the denoising of the actual target prompt (e.g., cat playing guitar). Experimental results demonstrate that our method improves performance under loosely aligned conditions across various conditions, including depth maps, edge maps, and human skeletons, outperforming existing baselines. Our code is available at this https URL.
摘要：ControlNet通过合并其他视觉条件（例如深度或边缘图），在文本到图像扩散模型中启用了详细的空间控制。但是，它的有效性在很大程度上取决于视觉条件的可用性，这些视觉条件与文本提示A指定的一代目标完全一致，而该目标通常会在实践中失败，尤其是对于罕见或富有想象力的场景。例如，由于缺乏合适的视觉条件，在特定姿势中产生猫烹饪的图像可能是不可行的。相比之下，结构上相似的提示通常可以在更常见的情况下找到，而人类烹饪的姿势可以广泛使用，并且可以用作粗糙的视觉指南。不幸的是，现有的ControlNet模型难以使用这种松散的视觉条件，通常会导致文字忠诚度低或视觉伪像。为了解决这一限制，我们提出了Semanticontoncontrol，这是一种无训练的方法，用于有效利用未对准但语义相关的视觉条件。我们的方法可适应地抑制视觉状况与提示冲突的影响，同时加强文本的指导。关键的想法是首先使用与视觉状况保持一致的替代提示（例如，“人类弹吉他”为人类姿势条件对齐），以提取信息性的注意性口罩，然后在实际目标提示（例如，猫吉他）中使用这些口罩。实验结果表明，我们的方法在各种条件（包括深度图，边缘地图和人类骨骼）的宽松对齐条件下提高了性能，表现优于现有基准。我们的代码可在此HTTPS URL上找到。

Title: Structural Information-based Hierarchical Diffusion for Offline Reinforcement Learning

Authors: Xianghua Zeng, Hao Peng, Angsheng Li, Yicheng Pan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.21942
Pdf URL: https://arxiv.org/pdf/2509.21942
Copy Paste: [[2509.21942]] Structural Information-based Hierarchical Diffusion for Offline Reinforcement Learning(https://arxiv.org/abs/2509.21942)
Keywords: generative
Abstract: Diffusion-based generative methods have shown promising potential for modeling trajectories from offline reinforcement learning (RL) datasets, and hierarchical diffusion has been introduced to mitigate variance accumulation and computational challenges in long-horizon planning tasks. However, existing approaches typically assume a fixed two-layer diffusion hierarchy with a single predefined temporal scale, which limits adaptability to diverse downstream tasks and reduces flexibility in decision making. In this work, we propose SIHD, a novel Structural Information-based Hierarchical Diffusion framework for effective and stable offline policy learning in long-horizon environments with sparse rewards. Specifically, we analyze structural information embedded in offline trajectories to construct the diffusion hierarchy adaptively, enabling flexible trajectory modeling across multiple temporal scales. Rather than relying on reward predictions from localized sub-trajectories, we quantify the structural information gain of each state community and use it as a conditioning signal within the corresponding diffusion layer. To reduce overreliance on offline datasets, we introduce a structural entropy regularizer that encourages exploration of underrepresented states while avoiding extrapolation errors from distributional shifts. Extensive evaluations on challenging offline RL tasks show that SIHD significantly outperforms state-of-the-art baselines in decision-making performance and demonstrates superior generalization across diverse scenarios.
摘要：基于扩散的生成方法显示了从离线增强学习（RL）数据集建模轨迹的有希望的潜力，并且已经引入了层次扩散，以减轻长途计划任务中的方差累积和计算挑战。但是，现有方法通常假设具有单个预定义的时间尺度的固定两层扩散层次结构，这限制了对不同下游任务的适应性，并降低了决策的灵活性。在这项工作中，我们提出了SIHD，这是一种基于结构信息的新型层次扩散框架，用于在具有稀疏奖励的长途环境中有效稳定的离线政策学习。具体而言，我们分析了嵌入在离线轨迹中的结构信息，以适应地构建扩散层次结构，从而使跨多个时间尺度上的柔性轨迹建模。我们不依赖于局部子区域的奖励预测，而是量化了每个国家社区的结构信息增益，并将其用作相应扩散层中的条件信号。为了减少对离线数据集的过度依赖，我们引入了一个结构性熵正常化程序，该结构熵正规化器鼓励探索代表性不足的状态，同时避免了分配转移的推断错误。对具有挑战性的离线RL任务的广泛评估表明，SIHD在决策绩效中的表现明显优于最先进的基线，并在各种情况下表现出了卓越的概括。

Title: MultiCrafter: High-Fidelity Multi-Subject Generation via Spatially Disentangled Attention and Identity-Aware Reinforcement Learning

Authors: Tao Wu, Yibo Jiang, Yehao Lu, Zhizhong Wang, Zeyi Huang, Zequn Qin, Xi Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.21953
Pdf URL: https://arxiv.org/pdf/2509.21953
Copy Paste: [[2509.21953]] MultiCrafter: High-Fidelity Multi-Subject Generation via Spatially Disentangled Attention and Identity-Aware Reinforcement Learning(https://arxiv.org/abs/2509.21953)
Keywords: generation
Abstract: Multi-subject image generation aims to synthesize user-provided subjects in a single image while preserving subject fidelity, ensuring prompt consistency, and aligning with human aesthetic preferences. However, existing methods, particularly those built on the In-Context-Learning paradigm, are limited by their reliance on simple reconstruction-based objectives, leading to both severe attribute leakage that compromises subject fidelity and failing to align with nuanced human preferences. To address this, we propose MultiCrafter, a framework that ensures high-fidelity, preference-aligned generation. First, we find that the root cause of attribute leakage is a significant entanglement of attention between different subjects during the generation process. Therefore, we introduce explicit positional supervision to explicitly separate attention regions for each subject, effectively mitigating attribute leakage. To enable the model to accurately plan the attention region of different subjects in diverse scenarios, we employ a Mixture-of-Experts architecture to enhance the model's capacity, allowing different experts to focus on different scenarios. Finally, we design a novel online reinforcement learning framework to align the model with human preferences, featuring a scoring mechanism to accurately assess multi-subject fidelity and a more stable training strategy tailored for the MoE architecture. Experiments validate that our framework significantly improves subject fidelity while aligning with human preferences better.
摘要：多主体图像生成旨在在单个图像中综合用户提供的受试者，同时保持主题保真度，确保迅速的一致性并与人类美学偏好保持一致。但是，现有的方法，特别是基于文章学习范式的方法，受到其依赖简单基于基于重建的目标的限制，从而导致严重的属性泄漏，从而损害了受试者的忠诚度，并且无法与细微的人类偏好保持一致。为了解决这个问题，我们提出了Multicrafter，该框架可确保高保真，偏好一致。首先，我们发现属性泄漏的根本原因是在生成过程中不同受试者之间的关注的重要纠缠。因此，我们引入明确的位置监督，以明确地为每个受试者分开注意区域，从而有效地减轻属性泄漏。为了使模型能够准确地计划不同场景中不同受试者的注意区域，我们采用了专家体系结构的混合体来增强模型的能力，从而使不同的专家可以专注于不同的场景。最后，我们设计了一个新颖的在线增强学习框架，以使模型与人类的偏好保持一致，具有评分机制，可准确评估多种受试者的保真度和针对MOE体系结构量身定制的更稳定的培训策略。实验验证我们的框架可以显着提高受试者的保真度，同时更好地与人类的偏好保持一致。

Title: No-Reference Image Contrast Assessment with Customized EfficientNet-B0

Authors: Javad Hassannataj Joloudari, Bita Mesbahzadeh, Omid Zare, Emrah Arslan, Roohallah Alizadehsani, Hossein Moosaei
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.21967
Pdf URL: https://arxiv.org/pdf/2509.21967
Copy Paste: [[2509.21967]] No-Reference Image Contrast Assessment with Customized EfficientNet-B0(https://arxiv.org/abs/2509.21967)
Keywords: quality assessment
Abstract: Image contrast was a fundamental factor in visual perception and played a vital role in overall image quality. However, most no reference image quality assessment NR IQA models struggled to accurately evaluate contrast distortions under diverse real world conditions. In this study, we proposed a deep learning based framework for blind contrast quality assessment by customizing and fine-tuning three pre trained architectures, EfficientNet B0, ResNet18, and MobileNetV2, for perceptual Mean Opinion Score, along with an additional model built on a Siamese network, which indicated a limited ability to capture perceptual contrast distortions. Each model is modified with a contrast-aware regression head and trained end to end using targeted data augmentations on two benchmark datasets, CID2013 and CCID2014, containing synthetic and authentic contrast distortions. Performance is evaluated using Pearson Linear Correlation Coefficient and Spearman Rank Order Correlation Coefficient, which assess the alignment between predicted and human rated scores. Among these three models, our customized EfficientNet B0 model achieved state-of-the-art performance with PLCC = 0.9286 and SRCC = 0.9178 on CCID2014 and PLCC = 0.9581 and SRCC = 0.9369 on CID2013, surpassing traditional methods and outperforming other deep baselines. These results highlighted the models robustness and effectiveness in capturing perceptual contrast distortion. Overall, the proposed method demonstrated that contrast aware adaptation of lightweight pre trained networks can yield a high performing, scalable solution for no reference contrast quality assessment suitable for real time and resource constrained applications.
摘要：图像对比是视觉感知的基本因素，并且在整体图像质量中起着至关重要的作用。但是，大多数没有参考图像质量评估NR IQA模型在不同的现实世界条件下努力评估对比度扭曲。在这项研究中，我们提出了一个基于深度学习的框架，通过定制和微调三个预先训练的体系结构，即EdgistionNet B0，Resnet18和MobilenetV2，以进行感知意见分数，以及在暹罗网络上建立的其他模型，这表明捕获感知损失的能力有限。使用两个基准数据集（CID2013和CCID2014）上有针对性的数据增强，用对比度感知回归头和端到端训练的对比度回归头，并经过训练，对每个模型进行了修改。使用Pearson线性相关系数和Spearman等级相关系数评估性能，该系数评估了预测分数和人类额定分数之间的比对。在这三个模型中，我们定制的ExcilityNet B0模型在CCID2014上的PLCC = 0.9286和SRCC = 0.9178实现了最先进的性能，并且PLCC = 0.9581和SRCC = 0.9369在CID2013上，超过了传统方法，并超越了传统的方法，并超越了其他深层基础。这些结果强调了模型在捕获感知对比畸变时的鲁棒性和有效性。总体而言，提出的方法表明，对比度的轻量级预先训练的网络的对比度适应可以产生高性能的可扩展解决方案，而无需参考对比度质量评估，适用于实时和资源约束应用程序。

Title: Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation

Authors: Abdelrahman Eldesokey, Aleksandar Cvejic, Bernard Ghanem, Peter Wonka
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.21989
Pdf URL: https://arxiv.org/pdf/2509.21989
Copy Paste: [[2509.21989]] Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation(https://arxiv.org/abs/2509.21989)
Keywords: generation
Abstract: We propose a novel approach for disentangling visual and semantic features from the backbones of pre-trained diffusion models, enabling visual correspondence in a manner analogous to the well-established semantic correspondence. While diffusion model backbones are known to encode semantically rich features, they must also contain visual features to support their image synthesis capabilities. However, isolating these visual features is challenging due to the absence of annotated datasets. To address this, we introduce an automated pipeline that constructs image pairs with annotated semantic and visual correspondences based on existing subject-driven image generation datasets, and design a contrastive architecture to separate the two feature types. Leveraging the disentangled representations, we propose a new metric, Visual Semantic Matching (VSM), that quantifies visual inconsistencies in subject-driven image generation. Empirical results show that our approach outperforms global feature-based metrics such as CLIP, DINO, and vision--language models in quantifying visual inconsistencies while also enabling spatial localization of inconsistent regions. To our knowledge, this is the first method that supports both quantification and localization of inconsistencies in subject-driven generation, offering a valuable tool for advancing this task. Project Page:this https URL
摘要：我们提出了一种新的方法，可以将视觉和语义特征与预先训练的扩散模型的骨架中脱离，以类似于建立的语义对应关系的方式实现了视觉对应。虽然已知扩散模型骨架编码语义上丰富的特征，但它们还必须包含视觉特征以支持其图像合成功能。但是，由于没有带注释的数据集，隔离这些视觉特征是具有挑战性的。为了解决这个问题，我们引入了一条自动化管道，该管道基于现有主题驱动的图像生成数据集构建带有带注释的语义和视觉对应的图像对，并设计一个对比度体系结构以分离两种特征类型。在利用分离的表示形式时，我们提出了一种新的度量，视觉语义匹配（VSM），该表示量量化了主题驱动的图像生成中的视觉上不一致。经验结果表明，我们的方法在量化视觉上不一致的情况下优于基于全球特征的指标，例如剪辑，恐龙和视觉模型，同时还可以实现不一致区域的空间定位。据我们所知，这是支持主题驱动生成中不一致的量化和定位的第一种方法，为推进这项任务提供了有价值的工具。项目页面：此HTTPS URL

Title: WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM

Authors: Changli Tang, Qinfan Xiao, Ke Mei, Tianyi Wang, Fengyun Rao, Chao Zhang
Subjects: cs.CV, cs.SD
Abstract URL: https://arxiv.org/abs/2509.21990
Pdf URL: https://arxiv.org/pdf/2509.21990
Copy Paste: [[2509.21990]] WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM(https://arxiv.org/abs/2509.21990)
Keywords: generation
Abstract: While embeddings from multimodal large language models (LLMs) excel as general-purpose representations, their application to dynamic modalities like audio and video remains underexplored. We introduce WAVE (\textbf{u}nified \& \textbf{v}ersatile \textbf{a}udio-\textbf{v}isual \textbf{e}mbeddings), the first LLM-based embedding that creates a unified representation space for text, audio, and video modalities. WAVE employs a novel hierarchical feature fusion strategy and a joint multi-modal, multi-task training approach to enable two key capabilities: any-to-any cross-modal retrieval and the generation of prompt-aware embeddings tailored to user instructions. Experimentally, WAVE sets a new state-of-the-art on the MMEB-v2 video benchmark and achieves superior results in audio and video-to-audio retrieval. Its prompt-aware nature also yields remarkable performance in multimodal question answering, significantly outperforming existing embedding models. Ablation studies validate our joint training strategy, demonstrating improved performance across all modalities. With a newly introduced benchmark for versatile audio-visual learning, WAVE opens up broad possibilities for cross-modal, any-to-any applications. Our code, checkpoints, and data will be released.
摘要：虽然多模式大语言模型（LLMS）的嵌入方式excel作为通用表示形式，但它们在音频和视频等动态模式中的应用仍未得到充满反感。我们介绍Wave（\ textbf {u} Nified \＆\ textbf {v} ersatile \ textbf {a} udio- \ textbf {v} isual \ textbf {e} mbeddings），基于LLM的嵌入式，创建了第一个用于texter的teftial space，audio and audio and autio and audio and audio and audio and audio and audio and audio and audio and audio and autio and autio and autio and Video。 Wave采用一种新型的分层功能融合策略和一种联合多模式的多任务训练方法来实现两个关键功能：任何一对任何跨模式检索的任何一对及时的迅速感知的嵌入量，适用于用户说明。在实验上，Wave在MMEB-V2视频基准上设定了新的最新技术，并在音频和视频到ADIO检索方面取得了卓越的成绩。它的及时感知性质在多模式问答中也产生了显着的性能，极大地表现了现有的嵌入模型。消融研究验证了我们的联合培训策略，证明了各种方式的表现提高。 Wave通过新引入了用于多功能音频学习的基准，为任何一对一应用程序开辟了广泛的可能性。我们的代码，检查点和数据将发布。

Title: FailureAtlas:Mapping the Failure Landscape of T2I Models via Active Exploration

Authors: Muxi Chen, Zhaohua Zhang, Chenchen Zhao, Mingyang Chen, Wenyu Jiang, Tianwen Jiang, Jianhuan Zhuo, Yu Tang, Qiuyong Xiao, Jihong Zhang, Qiang Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.21995
Pdf URL: https://arxiv.org/pdf/2509.21995
Copy Paste: [[2509.21995]] FailureAtlas:Mapping the Failure Landscape of T2I Models via Active Exploration(https://arxiv.org/abs/2509.21995)
Keywords: generative
Abstract: Static benchmarks have provided a valuable foundation for comparing Text-to-Image (T2I) models. However, their passive design offers limited diagnostic power, struggling to uncover the full landscape of systematic failures or isolate their root causes. We argue for a complementary paradigm: active exploration. We introduce FailureAtlas, the first framework designed to autonomously explore and map the vast failure landscape of T2I models at scale. FailureAtlas frames error discovery as a structured search for minimal, failure-inducing concepts. While it is a computationally explosive problem, we make it tractable with novel acceleration techniques. When applied to Stable Diffusion models, our method uncovers hundreds of thousands of previously unknown error slices (over 247,000 in SD1.5 alone) and provides the first large-scale evidence linking these failures to data scarcity in the training set. By providing a principled and scalable engine for deep model auditing, FailureAtlas establishes a new, diagnostic-first methodology to guide the development of more robust generative AI. The code is available at this https URL
摘要：静态基准为比较文本图像（T2I）模型提供了宝贵的基础。但是，他们的被动设计提供了有限的诊断能力，努力揭示系统故障的完整景观或隔离其根本原因。我们主张一个互补的范式：主动探索。我们介绍了Failureatlas，这是第一个旨在自主探索和绘制T2I模型巨大故障景观的框架。 Failureatlas将错误发现作为一个结构化搜索，以搜索最小，失败的概念。虽然这是一个在计算上的爆炸性问题，但我们可以通过新颖的加速技术来处理它。当应用于稳定的扩散模型时，我们的方法会发现成千上万的以前未知的错误切片（仅SD1.5中超过247,000个），并提供了第一个大规模证据，将这些失败与训练集中的数据稀缺联系起来。通过为深度模型审核提供有原则的可扩展引擎，Failureatlas建立了一种新的诊断优先方法，以指导开发更健壮的生成性AI。该代码可在此HTTPS URL上找到

Title: Exposing Hallucinations To Suppress Them: VLMs Representation Editing With Generative Anchors

Authors: Youxu Shi, Suorong Yang, Dong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.21997
Pdf URL: https://arxiv.org/pdf/2509.21997
Copy Paste: [[2509.21997]] Exposing Hallucinations To Suppress Them: VLMs Representation Editing With Generative Anchors(https://arxiv.org/abs/2509.21997)
Keywords: generative
Abstract: Multimodal large language models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet they remain highly susceptible to hallucinations, producing content that is fluent but inconsistent with visual evidence. Such hallucinations, spanning objects, attributes, and relations, persist even in larger models, while existing mitigation approaches often require additional finetuning, handcrafted priors, or trade-offs that compromise informativeness and scalability. To address this limitation, we propose a training-free, self-supervised method for hallucination mitigation. Our approach introduces a novel hallucination amplification mechanism: a caption is projected into the visual space via a text-to-image model to reveal implicit hallucination signals, serving as a negative anchor, while the original image provides a positive anchor. Leveraging these dual anchors, we edit decoder hidden states by pulling representations toward faithful semantics and pushing them away from hallucination directions. This correction requires no human priors or additional training costs, ensuring both effectiveness and efficiency. Extensive experiments across multiple benchmarks show that our method significantly reduces hallucinations at the object, attribute, and relation levels while largely preserving recall and caption richness, e.g., achieving a hallucination reduction by over 5% using LLaVA-v1.5-7B on CHAIR. Furthermore, results on diverse architectures, including LLaVA-NEXT-7B, Cambrian-8B, and InstructBLIP-7B, validate strong cross-architecture generalization. More importantly, when applied to hallucination-free captions, our method introduces almost no side effects, underscoring its robustness and practical plug-and-play applicability. The implementation will be publicly available.
摘要：多模式的大型语言模型（MLLM）在各种视力语言任务中取得了巨大的成功，但它们仍然非常容易受到幻觉的影响，产生了流利但与视觉证据不一致的内容。这种幻觉，跨越对象，属性和关系，即使在较大的模型中也持续存在，而现有的缓解方法通常需要额外的填充，手工制作的先验或权衡损害信息性和可扩展性的权衡。为了解决这一限制，我们建议一种缓解幻觉的无培训，自我监督的方法。我们的方法引入了一种新颖的幻觉放大机制：通过文本对图像模型将标题投影到视觉空间中，以揭示隐式幻觉信号，用作负锚，而原始图像则提供了正锚。利用这些双锚，我们通过将代表向忠实的语义提示并将其推开幻觉方向来编辑解码器隐藏状态。这种纠正不需要人类先验或额外的培训费用，从而确保有效性和效率。跨多个基准测试的广泛实验表明，我们的方法可显着降低对象，属性和关系水平的幻觉，同时在很大程度上保留了召回和标题丰富度，例如，使用LLAVA-V1.5-7B，椅子上的Llava-V1.5-7B将幻觉降低了5％以上。此外，包括Llava-Next-7b，Cambrian-8B和指令Blip-7b在内的各种体系结构的结果，验证了强大的跨体积概括。更重要的是，当应用于无幻觉标题时，我们的方法几乎没有副作用，强调了其稳健性和实用的插件适用性。该实施将公开可用。

Title: Goal-Guided Efficient Exploration via Large Language Model in Reinforcement Learning

Authors: Yajie Qi, Wei Wei, Lin Li, Lijun Zhang, Zhidong Gao, Da Wang, Huizhong Song
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.22008
Pdf URL: https://arxiv.org/pdf/2509.22008
Copy Paste: [[2509.22008]] Goal-Guided Efficient Exploration via Large Language Model in Reinforcement Learning(https://arxiv.org/abs/2509.22008)
Keywords: generation
Abstract: Real-world decision-making tasks typically occur in complex and open environments, posing significant challenges to reinforcement learning (RL) agents' exploration efficiency and long-horizon planning capabilities. A promising approach is LLM-enhanced RL, which leverages the rich prior knowledge and strong planning capabilities of LLMs to guide RL agents in efficient exploration. However, existing methods mostly rely on frequent and costly LLM invocations and suffer from limited performance due to the semantic mismatch. In this paper, we introduce a Structured Goal-guided Reinforcement Learning (SGRL) method that integrates a structured goal planner and a goal-conditioned action pruner to guide RL agents toward efficient exploration. Specifically, the structured goal planner utilizes LLMs to generate a reusable, structured function for goal generation, in which goals are prioritized. Furthermore, by utilizing LLMs to determine goals' priority weights, it dynamically generates forward-looking goals to guide the agent's policy toward more promising decision-making trajectories. The goal-conditioned action pruner employs an action masking mechanism that filters out actions misaligned with the current goal, thereby constraining the RL agent to select goal-consistent policies. We evaluate the proposed method on Crafter and Craftax-Classic, and experimental results demonstrate that SGRL achieves superior performance compared to existing state-of-the-art methods.
摘要：现实世界中的决策任务通常发生在复杂和开放的环境中，对强化学习（RL）代理商的探索效率和长途计划能力提出了重大挑战。 LLM增强的RL是一种有希望的方法，它利用LLM的丰富的先验知识和强大的计划能力来指导RL代理有效探索。但是，现有方法主要依赖于频繁且昂贵的LLM调用，并且由于语义不匹配而导致的性能有限。在本文中，我们介绍了结构化目标引导的增强学习（SGRL）方法，该方法集成了结构化的目标计划者和目标条件的动作修剪器，以指导RL代理进行有效的探索。具体而言，结构化目标计划者利用LLM为目标生成可重复使用的结构化功能，在该功能中优先考虑目标。此外，通过利用LLM来确定目标的优先级权重，它动态生成了前瞻性的目标，以指导代理商的政策以更有前途的决策轨迹。目标条件的动作修剪采用了一种动作掩盖机制，该机制将与当前目标错位的动作过滤，从而将RL代理限制为选择目标一致性政策。我们评估了针对手工艺品和手工艺型经典的提议方法，实验结果表明，与现有的最新方法相比，SGRL的性能较高。

Title: CoFFT: Chain of Foresight-Focus Thought for Visual Language Models

Authors: Xinyu Zhang, Yuxuan Dong, Lingling Zhang, Chengyou Jia, Zhuohang Dang, Basura Fernando, Jun Liu, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.22010
Pdf URL: https://arxiv.org/pdf/2509.22010
Copy Paste: [[2509.22010]] CoFFT: Chain of Foresight-Focus Thought for Visual Language Models(https://arxiv.org/abs/2509.22010)
Keywords: generation
Abstract: Despite significant advances in Vision Language Models (VLMs), they remain constrained by the complexity and redundancy of visual input. When images contain large amounts of irrelevant information, VLMs are susceptible to interference, thus generating excessive task-irrelevant reasoning processes or even hallucinations. This limitation stems from their inability to discover and process the required regions during reasoning precisely. To address this limitation, we present the Chain of Foresight-Focus Thought (CoFFT), a novel training-free approach that enhances VLMs' visual reasoning by emulating human visual cognition. Each Foresight-Focus Thought consists of three stages: (1) Diverse Sample Generation: generates diverse reasoning samples to explore potential reasoning paths, where each sample contains several reasoning steps; (2) Dual Foresight Decoding: rigorously evaluates these samples based on both visual focus and reasoning progression, adding the first step of optimal sample to the reasoning process; (3) Visual Focus Adjustment: precisely adjust visual focus toward regions most beneficial for future reasoning, before returning to stage (1) to generate subsequent reasoning samples until reaching the final answer. These stages function iteratively, creating an interdependent cycle where reasoning guides visual focus and visual focus informs subsequent reasoning. Empirical results across multiple benchmarks using Qwen2.5-VL, InternVL-2.5, and Llava-Next demonstrate consistent performance improvements of 3.1-5.8\% with controllable increasing computational overhead.
摘要：尽管视觉语言模型（VLM）取得了重大进展，但它们仍然受视觉输入的复杂性和冗余的限制。当图像包含大量无关的信息时，VLM易受干扰，从而产生过度的任务 - 涉及推理过程甚至幻觉。这种局限性源于它们在推理过程中无法发现和处理所需区域。为了解决这一局限性，我们介绍了一种远见卓识的思想链（COFFT），这是一种新颖的无训练方法，通过模拟人类的视觉认知来增强VLMS的视觉推理。每个远见卓识的思想包括三个阶段：（1）不同的样本生成：生成多种推理样品以探索潜在的推理路径，其中每个样本都包含多个推理步骤；（2）双重远见解码：严格评估这些样本，基于视觉焦点和推理的进程，将最佳样本的第一步添加到推理过程中；（3）视觉焦点调整：在返回（1）阶段以生成后续的推理样本之前，直到达到最终答案，才能精确调整对将来推理最有益的区域的视觉焦点。这些阶段在迭代上发挥作用，创建一个相互依存的循环，其中推理指导视觉焦点和视觉焦点为后续推理提供了信息。使用QWEN2.5-VL，InternVL-2.5和Llava-Next进行多个基准测试的经验结果表明，在可控的计算开销的情况下，稳定的性能提高了3.1-5.8 \％。

Title: Latent Diffusion : Multi-Dimension Stable Diffusion Latent Space Explorer

Authors: Zhihua Zhong, Xuanyang Huang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.22038
Pdf URL: https://arxiv.org/pdf/2509.22038
Copy Paste: [[2509.22038]] Latent Diffusion : Multi-Dimension Stable Diffusion Latent Space Explorer(https://arxiv.org/abs/2509.22038)
Keywords: generation, generative
Abstract: Latent space is one of the key concepts in generative AI, offering powerful means for creative exploration through vector manipulation. However, diffusion models like Stable Diffusion lack the intuitive latent vector control found in GANs, limiting their flexibility for artistic expression. This paper introduces \workname, a framework for integrating customizable latent space operations into the diffusion process. By enabling direct manipulation of conceptual and spatial representations, this approach expands creative possibilities in generative art. We demonstrate the potential of this framework through two artworks, \textit{Infinitepedia} and \textit{Latent Motion}, highlighting its use in conceptual blending and dynamic motion generation. Our findings reveal latent space structures with semantic and meaningless regions, offering insights into the geometry of diffusion models and paving the way for further explorations of latent space.
摘要：潜在空间是生成AI的关键概念之一，为通过向量操纵提供了有力的创造性探索手段。然而，诸如稳定扩散之类的扩散模型缺乏gan中发现的直观潜在载体控制，从而限制了它们的艺术表达灵活性。本文介绍了\ WorkName，这是将可自定义的潜在空间操作集成到扩散过程中的框架。通过直接操纵概念和空间表示，这种方法扩大了生成艺术中的创造性可能性。我们通过两项艺术品展示了该框架的潜力，\ textit {infinitepedia}和\ textit {litement Motion}，突出了其在概念融合和动态运动生成中的使用。我们的发现揭示了具有语义和毫无意义的区域的潜在空间结构，为扩散模型的几何形状提供了见解，并为潜在空间的进一步探索铺平了道路。

Title: High-Quality Sound Separation Across Diverse Categories via Visually-Guided Generative Modeling

Authors: Chao Huang, Susan Liang, Yapeng Tian, Anurag Kumar, Chenliang Xu
Subjects: cs.CV, cs.SD
Abstract URL: https://arxiv.org/abs/2509.22063
Pdf URL: https://arxiv.org/pdf/2509.22063
Copy Paste: [[2509.22063]] High-Quality Sound Separation Across Diverse Categories via Visually-Guided Generative Modeling(https://arxiv.org/abs/2509.22063)
Keywords: generative
Abstract: We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning. Existing methods typically frame sound separation as a mask-based regression problem, achieving significant progress. However, they face limitations in capturing the complex data distribution required for high-quality separation of sounds from diverse categories. In contrast, DAVIS circumvents these issues by leveraging potent generative modeling paradigms, specifically Denoising Diffusion Probabilistic Models (DDPM) and the more recent Flow Matching (FM), integrated within a specialized Separation U-Net architecture. Our framework operates by synthesizing the desired separated sound spectrograms directly from a noise distribution, conditioned concurrently on the mixed audio input and associated visual information. The inherent nature of its generative objective makes DAVIS particularly adept at producing high-quality sound separations for diverse sound categories. We present comparative evaluations of DAVIS, encompassing both its DDPM and Flow Matching variants, against leading methods on the standard AVE and MUSIC datasets. The results affirm that both variants surpass existing approaches in separation quality, highlighting the efficacy of our generative framework for tackling the audio-visual source separation task.
摘要：我们提出了戴维斯（Davis），这是一个基于扩散的视听分离框架，该框架通过生成学习解决了视听声音源分离任务。现有方法通常将声音分离作为基于面具的回归问题，从而取得了重大进展。但是，他们在捕获高质量分离声音与不同类别的高质量分离所需的复杂数据分布时面临限制。相比之下，戴维斯通过利用有效的生成建模范式，特别是剥夺扩散概率模型（DDPM）和更近最近的流量匹配（FM）来绕过这些问题，该范例集成了专业的分离U-NET体系结构。我们的框架通过直接与噪声分布合成所需的分离声频谱图来运行，并在混合音频输入和相关的视觉信息上同时进行。其生成目标的固有性质使戴维斯特别擅长为各种声音类别生产高质量的声音分离。我们介绍了戴维斯的比较评估，其中包括其DDPM和流量匹配变体，与标准AVE和音乐数据集的领先方法。结果肯定，这两个变体都以分离质量超过了现有的方法，突出了我们生成框架应对视听源分离任务的功效。

Title: Large Material Gaussian Model for Relightable 3D Generation

Authors: Jingrui Ye, Lingting Zhu, Runze Zhang, Zeyu Hu, Yingda Yin, Lanjiong Li, Lequan Yu, Qingmin Liao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.22112
Pdf URL: https://arxiv.org/pdf/2509.22112
Copy Paste: [[2509.22112]] Large Material Gaussian Model for Relightable 3D Generation(https://arxiv.org/abs/2509.22112)
Keywords: generation
Abstract: The increasing demand for 3D assets across various industries necessitates efficient and automated methods for 3D content creation. Leveraging 3D Gaussian Splatting, recent large reconstruction models (LRMs) have demonstrated the ability to efficiently achieve high-quality 3D rendering by integrating multiview diffusion for generation and scalable transformers for reconstruction. However, existing models fail to produce the material properties of assets, which is crucial for realistic rendering in diverse lighting environments. In this paper, we introduce the Large Material Gaussian Model (MGM), a novel framework designed to generate high-quality 3D content with Physically Based Rendering (PBR) materials, ie, albedo, roughness, and metallic properties, rather than merely producing RGB textures with uncontrolled light baking. Specifically, we first fine-tune a new multiview material diffusion model conditioned on input depth and normal maps. Utilizing the generated multiview PBR images, we explore a Gaussian material representation that not only aligns with 2D Gaussian Splatting but also models each channel of the PBR materials. The reconstructed point clouds can then be rendered to acquire PBR attributes, enabling dynamic relighting by applying various ambient light maps. Extensive experiments demonstrate that the materials produced by our method not only exhibit greater visual appeal compared to baseline methods but also enhance material modeling, thereby enabling practical downstream rendering applications.
摘要：各个行业对3D资产的需求不断增长，因此需要有效，自动化的方法来创建3D。利用3D高斯裂开，最近的大型重建模型（LRMS）证明了通过整合用于重建的生成和可扩展变压器的多视图扩散来有效实现高质量3D渲染的能力。但是，现有模型未能产生资产的材料特性，这对于在不同的照明环境中进行现实渲染至关重要。在本文中，我们介绍了大型材料高斯模型（MGM），这是一个新颖的框架，旨在生成具有基于物理的渲染（PBR）材料的高质量3D含量，即，IE，反照率，粗糙度和金属性能，而不仅仅是仅生产带有未控制轻便烘烤的RGB纹理。具体而言，我们首先对以输入深度和正常地图为条件的新的多视材料扩散模型进行微调。利用生成的多视PBR图像，我们探索了一种高斯材料表示，该表示不仅与2D高斯裂口保持一致，而且还对PBR材料的每个通道进行建模。然后，可以渲染重建的点云以获取PBR属性，从而通过应用各种环境光映射来使动态重新获得。广泛的实验表明，与基线方法相比，我们方法产生的材料不仅具有更大的视觉吸引力，而且还增强了材料建模，从而实现了实用的下游渲染应用。

Title: Countering adversarial evasion in regression analysis

Authors: David Benfield, Phan Tu Vuong, Alain Zemkoho
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.22113
Pdf URL: https://arxiv.org/pdf/2509.22113
Copy Paste: [[2509.22113]] Countering adversarial evasion in regression analysis(https://arxiv.org/abs/2509.22113)
Keywords: generation
Abstract: Adversarial machine learning challenges the assumption that the underlying distribution remains consistent throughout the training and implementation of a prediction model. In particular, adversarial evasion considers scenarios where adversaries adapt their data to influence particular outcomes from established prediction models, such scenarios arise in applications such as spam email filtering, malware detection and fake-image generation, where security methods must be actively updated to keep up with the ever-improving generation of malicious data. Game theoretic models have been shown to be effective at modelling these scenarios and hence training resilient predictors against such adversaries. Recent advancements in the use of pessimistic bilevel optimsiation which remove assumptions about the convexity and uniqueness of the adversary's optimal strategy have proved to be particularly effective at mitigating threats to classifiers due to its ability to capture the antagonistic nature of the adversary. However, this formulation has not yet been adapted to regression scenarios. This article serves to propose a pessimistic bilevel optimisation program for regression scenarios which makes no assumptions on the convexity or uniqueness of the adversary's solutions.
摘要：对抗机器的学习挑战了以下假设：在整个培训和实施预测模型的过程中，基础分布保持一致。尤其是，对抗性逃避考虑了对手对其数据进行调整以影响已建立预测模型的特定结果的场景，此类场景在诸如垃圾邮件电子邮件过滤，恶意软件检测和虚假图像生成之类的应用中都出现，其中必须积极地更新安全方法，以与不断增长的一代恶意数据相关。游戏理论模型已被证明可以有效地对这些场景进行建模，从而针对此类对手进行培训弹性预测因子。在使用悲观的双层最佳最佳使用方面的最新进展，这些优化消除了对敌人最佳策略的凸性和独特性的假设，这对于减轻对分类器的威胁而尤其有效，因为它捕获了对手的拮抗性。但是，这种表述尚未适应回归方案。本文旨在为回归方案提出一个悲观的二线优化程序，该程序对对手解决方案的凸度或唯一性没有任何假设。

Title: REFINE-CONTROL: A Semi-supervised Distillation Method For Conditional Image Generation

Authors: Yicheng Jiang, Jin Yuan, Hua Yuan, Yao Zhang, Yong Rui
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.22139
Pdf URL: https://arxiv.org/pdf/2509.22139
Copy Paste: [[2509.22139]] REFINE-CONTROL: A Semi-supervised Distillation Method For Conditional Image Generation(https://arxiv.org/abs/2509.22139)
Keywords: generation
Abstract: Conditional image generation models have achieved remarkable results by leveraging text-based control to generate customized images. However, the high resource demands of these models and the scarcity of well-annotated data have hindered their deployment on edge devices, leading to enormous costs and privacy concerns, especially when user data is sent to a third party. To overcome these challenges, we propose Refine-Control, a semi-supervised distillation framework. Specifically, we improve the performance of the student model by introducing a tri-level knowledge fusion loss to transfer different levels of knowledge. To enhance generalization and alleviate dataset scarcity, we introduce a semi-supervised distillation method utilizing both labeled and unlabeled data. Our experiments reveal that Refine-Control achieves significant reductions in computational cost and latency, while maintaining high-fidelity generation capabilities and controllability, as quantified by comparative metrics.
摘要：有条件的图像生成模型通过利用基于文本的控制来生成自定义图像，从而实现了显着的结果。但是，这些模型的高度资源需求以及通知数据的稀缺性阻碍了他们在边缘设备上的部署，从而导致了巨大的成本和隐私问题，尤其是当将用户数据发送给第三方时。为了克服这些挑战，我们提出了精炼控制，这是一个半监督的蒸馏框架。具体而言，我们通过引入三级知识融合损失来转移不同水平的知识来提高学生模型的表现。为了增强概括和减轻数据集稀缺性，我们使用标记和未标记的数据引入了一种半监督的蒸馏方法。我们的实验表明，精炼控制可实现计算成本和延迟的显着降低，同时维持高保真的产生能力和可控性，这是通过比较指标量化的。

Title: MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models

Authors: Jonas Belouadi, Tamy Boubekeur, Adrien Kaiser
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.22151
Pdf URL: https://arxiv.org/pdf/2509.22151
Copy Paste: [[2509.22151]] MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models(https://arxiv.org/abs/2509.22151)
Keywords: generation
Abstract: Material node graphs are programs that generate the 2D channels of procedural materials, including geometry such as roughness and displacement maps, and reflectance such as albedo and conductivity maps. They are essential in computer graphics for representing the appearance of virtual 3D objects parametrically and at arbitrary resolution. In particular, their directed acyclic graph structures and intermediate states provide an intuitive understanding and workflow for interactive appearance modeling. Creating such graphs is a challenging task and typically requires professional training. While recent neural program synthesis approaches attempt to simplify this process, they solely represent graphs as textual programs, failing to capture the inherently visual-spatial nature of node graphs that makes them accessible to humans. To address this gap, we present MultiMat, a multimodal program synthesis framework that leverages large multimodal models to process both visual and textual graph representations for improved generation of procedural material graphs. We train our models on a new dataset of production-quality procedural materials and combine them with a constrained tree search inference algorithm that ensures syntactic validity while efficiently navigating the program space. Our experimental results show that our multimodal program synthesis method is more efficient in both unconditional and conditional graph synthesis with higher visual quality and fidelity than text-only baselines, establishing new state-of-the-art performance.
摘要：材料节点图是生成程序材料的2D通道的程序，包括几何形状，例如粗糙度和位移图，以及反射率，例如反照率和电导率图。它们在计算机图形中至关重要，以代表参数和任意分辨率的虚拟3D对象的外观。特别是，它们的定向无环形结构和中间状态为交互式外观建模提供了直观的理解和工作流程。创建这样的图表是一项具有挑战性的任务，通常需要专业培训。虽然最近的神经程序合成方法试图简化此过程，但它们仅表示图形为文本程序，未能捕获节点图的固有视觉空间性质，这使得它们可以被人类访问。为了解决这一差距，我们提出了多模式程序合成框架多模型，该框架利用大型多模式模型处理视觉和文本图表以改进过程材料图的生成。我们在新的生产质量程序材料数据集上训练我们的模型，并将其与受约束的树搜索推理算法结合使用，该算法可确保句法有效性，同时有效地导航程序空间。我们的实验结果表明，我们的多模式程序合成方法在无条件和条件图合成中的视觉质量和忠诚度都比文本基线更高，从而建立了新的最新性能。

Title: Lightweight error mitigation strategies for post-training N:M activation sparsity in LLMs

Authors: Shirin Alanova, Kristina Kazistova, Ekaterina Galaeva, Alina Kostromina, Vladimir Smirnov, Redko Dmitry, Alexey Dontsov, Maxim Zhelnin, Evgeny Burnaev, Egor Shvetsov
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.22166
Pdf URL: https://arxiv.org/pdf/2509.22166
Copy Paste: [[2509.22166]] Lightweight error mitigation strategies for post-training N:M activation sparsity in LLMs(https://arxiv.org/abs/2509.22166)
Keywords: generative
Abstract: The demand for efficient large language model (LLM) inference has intensified the focus on sparsification techniques. While semi-structured (N:M) pruning is well-established for weights, its application to activation pruning remains underexplored despite its potential for dynamic, input-adaptive compression and reductions in I/O overhead. This work presents a comprehensive analysis of methods for post-training N:M activation pruning in LLMs. Across multiple LLMs, we demonstrate that pruning activations enables superior preservation of generative capabilities compared to weight pruning at equivalent sparsity levels. We evaluate lightweight, plug-and-play error mitigation techniques and pruning criteria, establishing strong hardware-friendly baselines that require minimal calibration. Furthermore, we explore sparsity patterns beyond NVIDIA's standard 2:4, showing that the 16:32 pattern achieves performance nearly on par with unstructured sparsity. However, considering the trade-off between flexibility and hardware implementation complexity, we focus on the 8:16 pattern as a superior candidate. Our findings provide both effective practical methods for activation pruning and a motivation for future hardware to support more flexible sparsity patterns. Our code is available this https URL .
摘要：对有效的大语言模型（LLM）推论的需求加剧了对稀疏技术的关注。尽管半结构化（n：M）修剪的重量已建立了良好的重量，但尽管其在动态，输入适应性压缩的潜力和I/O高架上的降低可能会导致其在激活修剪中的应用。这项工作对LLMS中训练后N：M激活修剪的方法进行了全面分析。在多个LLMS中，我们证明，与在同等的稀疏度下修剪相比，修剪激活能够优先保存生成能力。我们评估轻量级，插入式错误缓解技术和修剪标准，建立强大的硬件友好基线，需要最小的校准。此外，我们探索了Nvidia标准2：4以外的稀疏模式，表明16:32模式的性能几乎与非结构化的稀疏性相当。但是，考虑到灵活性和硬件实施复杂性之间的权衡，我们将重点放在8:16的模式上，作为卓越的候选人。我们的发现既提供了激活修剪的有效实用方法，也提供了将来硬件的动机，以支持更灵活的稀疏模式。我们的代码可用此HTTPS URL。

Title: UrbanFeel: A Comprehensive Benchmark for Temporal and Perceptual Understanding of City Scenes through Human Perspective

Authors: Jun He, Yi Lin, Zilong Huang, Jiacong Yin, Junyan Ye, Yuchuan Zhou, Weijia Li, Xiang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.22228
Pdf URL: https://arxiv.org/pdf/2509.22228
Copy Paste: [[2509.22228]] UrbanFeel: A Comprehensive Benchmark for Temporal and Perceptual Understanding of City Scenes through Human Perspective(https://arxiv.org/abs/2509.22228)
Keywords: generation
Abstract: Urban development impacts over half of the global population, making human-centered understanding of its structural and perceptual changes essential for sustainable development. While Multimodal Large Language Models (MLLMs) have shown remarkable capabilities across various domains, existing benchmarks that explore their performance in urban environments remain limited, lacking systematic exploration of temporal evolution and subjective perception of urban environment that aligns with human perception. To address these limitations, we propose UrbanFeel, a comprehensive benchmark designed to evaluate the performance of MLLMs in urban development understanding and subjective environmental perception. UrbanFeel comprises 14.3K carefully constructed visual questions spanning three cognitively progressive dimensions: Static Scene Perception, Temporal Change Understanding, and Subjective Environmental Perception. We collect multi-temporal single-view and panoramic street-view images from 11 representative cities worldwide, and generate high-quality question-answer pairs through a hybrid pipeline of spatial clustering, rule-based generation, model-assisted prompting, and manual annotation. Through extensive evaluation of 20 state-of-the-art MLLMs, we observe that Gemini-2.5 Pro achieves the best overall performance, with its accuracy approaching human expert levels and narrowing the average gap to just 1.5\%. Most models perform well on tasks grounded in scene understanding. In particular, some models even surpass human annotators in pixel-level change detection. However, performance drops notably in tasks requiring temporal reasoning over urban development. Additionally, in the subjective perception dimension, several models reach human-level or even higher consistency in evaluating dimension such as beautiful and safety.
摘要：城市发展影响了全球人口的一半以上，使以人为中心的理解其对可持续发展至关重要的结构和感知变化。尽管多模式的大语言模型（MLLM）在各个领域都表现出了显着的功能，但探索其在城市环境中绩效的现有基准仍然有限，缺乏对时间进化的系统性探索和对城市环境的主观感知，而城市环境与人类的感知相吻合。为了解决这些限制，我们提出了Urbanfeel，这是一个综合基准，旨在评估MLLM在城市发展理解和主观环境感知中的表现。 Urbanfeel包括14.3K精心构建的视觉问题，涵盖了三个认知渐进的维度：静态场景感知，时间变化理解和主观环境感知。我们从全球11个代表性城市收集多个阶段的单视图和全景街景图像，并通过空间聚类，基于规则的一代，模型辅助提示和手动注释来产生高质量的问题解答。通过对20个最先进的MLLM的广泛评估，我们观察到Gemini-2.5 Pro可以实现最佳的整体性能，其准确性接近人类专家水平，并将平均差距缩小到仅为1.5 \％。大多数模型在基于场景理解的任务上表现良好。特别是，某些模型甚至超过了像素级变化检测中的人类注释。但是，绩效在需要城市发展的时间推理的任务中显着下降。此外，在主观的感知维度中，几种模型在评估尺寸（例如美丽和安全）方面达到了人类水平甚至更高的一致性。

Title: Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks

Authors: Miao Jing, Mengting Jia, Junling Lin, Zhongxia Shen, Lijun Wang, Yuanyuan Peng, Huan Gao, Mingkun Xu, Shangyang Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.22258
Pdf URL: https://arxiv.org/pdf/2509.22258
Copy Paste: [[2509.22258]] Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks(https://arxiv.org/abs/2509.22258)
Keywords: generation
Abstract: Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. To ensure reliable evaluation, we develop a hybrid scoring pipeline that combines LLM-based graders, clinician validation, and semantic similarity metrics. Through systematic evaluation of state-of-the-art VLMs, including GPT-4o, Claude-4, and MedGemma, we observe a sharp performance drop compared to conventional datasets. Error analysis shows that reasoning failures, rather than perceptual errors, dominate model shortcomings. Our findings highlight the necessity of a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented, compact benchmarks such as Neural-MedBench for reasoning fidelity. We release Neural-MedBench at this https URL as an open and extensible diagnostic testbed, which guides the expansion of future benchmarks and enables rigorous yet cost-effective assessment of clinically trustworthy AI.
摘要：视觉模型（VLM）的最新进展在标准医疗基准上取得了显着的性能，但其真正的临床推理能力仍然不清楚。现有数据集主要强调分类精度，从而产生了一个评估幻象，其中模型显得熟练，同时仍未在高风险诊断推理中失败。我们介绍了神经 - 甲板，这是一种紧凑而推理密集型的基准，该基准专为探测神经学中多模式临床推理的极限。 Neural-Medbench整合了多序列的MRI扫描，结构化的电子健康记录和临床注释，并包括三个核心任务系列：鉴别诊断，病变识别和生成基本原理。为了确保可靠的评估，我们开发了一个混合评分管道，该管道结合了基于LLM的分级器，临床医生验证和语义相似性指标。通过对包括GPT-4O，Claude-4和Medgemma在内的最新VLM的系统评估，与常规数据集相比，我们观察到急剧的性能下降。错误分析表明，推理失败而不是感知错误会主导模型缺陷。我们的发现突出了两轴评估框架的必要性：面向广度的大型数据集用于统计概括，以及面向深度的，紧凑的基准，例如神经 - 甲板等用于推理的忠诚度。我们在此HTTPS URL上释放神经中间的培养基，作为一个开放且可扩展的诊断测试床，它指导了未来基准的扩展，并可以对临床上可信赖的AI进行严格但具有成本效益的评估。

Title: UniMapGen: A Generative Framework for Large-Scale Map Construction from Multi-modal Data

Authors: Yujian Yuan, Changjie Wu, Xinyuan Chang, Sijin Wang, Hang Zhang, Shiyi Liang, Shuang Zeng, Mu Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.22262
Pdf URL: https://arxiv.org/pdf/2509.22262
Copy Paste: [[2509.22262]] UniMapGen: A Generative Framework for Large-Scale Map Construction from Multi-modal Data(https://arxiv.org/abs/2509.22262)
Keywords: generative
Abstract: Large-scale map construction is foundational for critical applications such as autonomous driving and navigation systems. Traditional large-scale map construction approaches mainly rely on costly and inefficient special data collection vehicles and labor-intensive annotation processes. While existing satellite-based methods have demonstrated promising potential in enhancing the efficiency and coverage of map construction, they exhibit two major limitations: (1) inherent drawbacks of satellite data (e.g., occlusions, outdatedness) and (2) inefficient vectorization from perception-based methods, resulting in discontinuous and rough roads that require extensive post-processing. This paper presents a novel generative framework, UniMapGen, for large-scale map construction, offering three key innovations: (1) representing lane lines as \textbf{discrete sequence} and establishing an iterative strategy to generate more complete and smooth map vectors than traditional perception-based methods. (2) proposing a flexible architecture that supports \textbf{multi-modal} inputs, enabling dynamic selection among BEV, PV, and text prompt, to overcome the drawbacks of satellite data. (3) developing a \textbf{state update} strategy for global continuity and consistency of the constructed large-scale map. UniMapGen achieves state-of-the-art performance on the OpenSatMap dataset. Furthermore, UniMapGen can infer occluded roads and predict roads missing from dataset annotations. Our code will be released.
摘要：大规模地图构造是针对自动驾驶和导航系统等关键应用的基础。传统的大规模构造方法主要依赖于昂贵且效率低下的特殊数据收集工具和劳动密集型注释过程。尽管现有的基于卫星的方法在提高地图构造的效率和覆盖范围方面表现出了有希望的潜力，但它们表现出两个主要局限性：（1）卫星数据的固有缺点（例如，遮挡，过时）和（2）基于感知的方法的效率低下，从而在不良和崎rough的公路中产生了广泛的公路。本文提出了一个新颖的生成框架，即Unimapgen，用于大规模的地图结构，提供了三个关键的创新：（1）将车道线表示为\ textbf {离散序列}并建立迭代策略，以产生比传统感知方法更完整且平滑的地图矢量。（2）提出一个灵活的体系结构，该体系结构支持\ textbf {多模式}输入，从而在BEV，PV和文本提示中启用动态选择，以克服卫星数据的缺点。（3）开发\ textbf {state Update}构建的大规模映射的全局连续性和一致性策略。 Unimapgen在OpenSATMAP数据集中实现了最先进的性能。此外，Unimapgen可以推断遮挡的道路，并预测数据集注释中缺少的道路。我们的代码将发布。

Title: MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning

Authors: Jinkun Hao, Naifu Liang, Zhen Luo, Xudong Xu, Weipeng Zhong, Ran Yi, Yichen Jin, Zhaoyang Lyu, Feng Zheng, Lizhuang Ma, Jiangmiao Pang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2509.22281
Pdf URL: https://arxiv.org/pdf/2509.22281
Copy Paste: [[2509.22281]] MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning(https://arxiv.org/abs/2509.22281)
Keywords: generation
Abstract: The ability of robots to interpret human instructions and execute manipulation tasks necessitates the availability of task-relevant tabletop scenes for training. However, traditional methods for creating these scenes rely on time-consuming manual layout design or purely randomized layouts, which are limited in terms of plausibility or alignment with the tasks. In this paper, we formulate a novel task, namely task-oriented tabletop scene generation, which poses significant challenges due to the substantial gap between high-level task instructions and the tabletop scenes. To support research on such a challenging task, we introduce MesaTask-10K, a large-scale dataset comprising approximately 10,700 synthetic tabletop scenes with manually crafted layouts that ensure realistic layouts and intricate inter-object relations. To bridge the gap between tasks and scenes, we propose a Spatial Reasoning Chain that decomposes the generation process into object inference, spatial interrelation reasoning, and scene graph construction for the final 3D layout. We present MesaTask, an LLM-based framework that utilizes this reasoning chain and is further enhanced with DPO algorithms to generate physically plausible tabletop scenes that align well with given task descriptions. Exhaustive experiments demonstrate the superior performance of MesaTask compared to baselines in generating task-conforming tabletop scenes with realistic layouts. Project page is at this https URL
摘要：机器人解释人类说明和执行操纵任务的能力需要提供与任务相关的桌面场景进行培训。但是，创建这些场景的传统方法取决于耗时的手动布局设计或纯粹的随机布局，这些设计在合理性或与任务的一致性方面受到限制。在本文中，我们制定了一项新颖的任务，即面向任务的桌面场景，由于高级任务说明与桌面场景之间的巨大差距，这构成了重大挑战。为了支持有关此类具有挑战性的任务的研究，我们介绍了Mesatask-10k，这是一个大规模数据集，其中包括大约10,700个合成桌面场景，并具有手动制作的布局，可确保现实的布局和复杂的对象间关系。为了弥合任务和场景之间的差距，我们提出了一个空间推理链，将生成过程分解为对象推理，空间相互关系推理以及最终3D布局的场景图构造。我们提出了Mesatask，这是一种基于LLM的框架，它利用了该推理链，并通过DPO算法进一步增强了与给定的任务描述很好地对齐的物理上合理的桌面场景。详尽的实验表明，与基线相比，在生成具有逼真的布局的任务构成桌面场景时，Mesatask的性能优越。项目页面在此HTTPS URL

Title: Jailbreaking on Text-to-Video Models via Scene Splitting Strategy

Authors: Wonjun Lee, Haon Park, Doehyeon Lee, Bumsub Ham, Suhyun Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.22292
Pdf URL: https://arxiv.org/pdf/2509.22292
Copy Paste: [[2509.22292]] Jailbreaking on Text-to-Video Models via Scene Splitting Strategy(https://arxiv.org/abs/2509.22292)
Keywords: generative
Abstract: Along with the rapid advancement of numerous Text-to-Video (T2V) models, growing concerns have emerged regarding their safety risks. While recent studies have explored vulnerabilities in models like LLMs, VLMs, and Text-to-Image (T2I) models through jailbreak attacks, T2V models remain largely unexplored, leaving a significant safety gap. To address this gap, we introduce SceneSplit, a novel black-box jailbreak method that works by fragmenting a harmful narrative into multiple scenes, each individually benign. This approach manipulates the generative output space, the abstract set of all potential video outputs for a given prompt, using the combination of scenes as a powerful constraint to guide the final outcome. While each scene individually corresponds to a wide and safe space where most outcomes are benign, their sequential combination collectively restricts this space, narrowing it to an unsafe region and significantly increasing the likelihood of generating a harmful video. This core mechanism is further enhanced through iterative scene manipulation, which bypasses the safety filter within this constrained unsafe region. Additionally, a strategy library that reuses successful attack patterns further improves the attack's overall effectiveness and robustness. To validate our method, we evaluate SceneSplit across 11 safety categories on T2V models. Our results show that it achieves a high average Attack Success Rate (ASR) of 77.2% on Luma Ray2, 84.1% on Hailuo, and 78.2% on Veo2, significantly outperforming the existing baseline. Through this work, we demonstrate that current T2V safety mechanisms are vulnerable to attacks that exploit narrative structure, providing new insights for understanding and improving the safety of T2V models.
摘要：随着众多文本对电视（T2V）模型的快速发展，人们对其安全风险的越来越关注。尽管最近的研究通过越狱攻击探索了LLM，VLM和文本形象（T2I）模型等模型中的漏洞，但T2V模型仍未开发，却留下了很大的安全差距。为了解决这一差距，我们介绍了SpaceSplit，这是一种新颖的BlackBox越狱方法，它是通过将有害的叙述分成多个场景的，每个场景都单独良性的。这种方法操纵了生成输出空间，即给定提示的所有潜在视频输出的抽象集，使用场景的组合作为指导最终结果的强大约束。尽管每个场景都单独对应于大多数结果是良性的宽阔安全空间，但它们的顺序组合共同限制了这个空间，将其缩小到不安全的区域，并显着增加了产生有害视频的可能性。通过迭代场景操纵进一步增强了这种核心机制，该机制绕过了该约束不安全区域内的安全过滤器。此外，重复攻击模式的战略库进一步提高了攻击的总体效率和鲁棒性。为了验证我们的方法，我们评估T2V模型上11个安全类别的场景平台。我们的结果表明，它在Luma Ray2上获得了高平均攻击成功率（ASR）为77.2％，Hailuo的攻击成功率为84.1％，而VEO2的平均攻击成功率为84.1％，而VEO2的平均攻击率为78.2％，显着超过了现有的基线。通过这项工作，我们证明了当前的T2V安全机制容易受到利用叙事结构的攻击，从而为理解和改善T2V模型的安全提供了新的见解。

Title: Aurora: Towards Universal Generative Multimodal Time Series Forecasting

Authors: Xingjian Wu, Jianxin Jin, Wanghui Qiu, Peng Chen, Yang Shu, Bin Yang, Chenjuan Guo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.22295
Pdf URL: https://arxiv.org/pdf/2509.22295
Copy Paste: [[2509.22295]] Aurora: Towards Universal Generative Multimodal Time Series Forecasting(https://arxiv.org/abs/2509.22295)
Keywords: generative
Abstract: Cross-domain generalization is very important in Time Series Forecasting because similar historical information may lead to distinct future trends due to the domain-specific characteristics. Recent works focus on building unimodal time series foundation models and end-to-end multimodal supervised models. Since domain-specific knowledge is often contained in modalities like texts, the former lacks the explicit utilization of them, thus hindering the performance. The latter is tailored for end-to-end scenarios and does not support zero-shot inference for cross-domain scenarios. In this work, we introduce Aurora, a Multimodal Time Series Foundation Model, which supports multimodal inputs and zero-shot inference. Pretrained on Corss-domain Multimodal Time Series Corpus, Aurora can adaptively extract and focus on key domain knowledge contained in corrsponding text or image modalities, thus possessing strong Cross-domain generalization capability. Through tokenization, encoding, and distillation, Aurora can extract multimodal domain knowledge as guidance and then utilizes a Modality-Guided Multi-head Self-Attention to inject them into the modeling of temporal representations. In the decoding phase, the multimodal representations are used to generate the conditions and prototypes of future tokens, contributing to a novel Prototype-Guided Flow Matching for generative probabilistic forecasting. Comprehensive experiments on well-recognized benchmarks, including TimeMMD, TSFM-Bench and ProbTS, demonstrate the consistent state-of-the-art performance of Aurora on both unimodal and multimodal scenarios.
摘要：跨域概括在时间序列预测中非常重要，因为相似的历史信息可能会导致由于特定领域特定的特征而导致不同的未来趋势。最近的作品着重于建立单峰时间序列基础模型和端到端的多模式监督模型。由于特定于领域的知识通常包含在文本之类的方式中，因此前者缺乏明确的利用，从而阻碍了性能。后者是针对端到端方案量身定制的，并且不支持跨域情景的零射击推断。在这项工作中，我们介绍了Aurora，这是一种多模式时间序列基础模型，该模型支持多模式输入和零射击推断。 Aurora在CORSS域的多模式时间序列语料库中进行了预测，可以自适应地提取并专注于Corrsponding文本或图像模态中包含的关键领域知识，从而具有强大的跨域通用能力。通过象征化，编码和蒸馏，Aurora可以提取多模式领域知识作为指导，然后利用模态引导的多头自我注意力将其注入时间表示的建模。在解码阶段，多模式表示形式用于生成未来令牌的条件和原型，从而有助于新型原型引导的流动匹配，以实现生成概率的预测。关于良好认可的基准，包括TIMEMMD，TSFM基础和Probts的全面实验，证明了Aurora在单峰和多模式场景上的最先进性能。

Title: HiGS: History-Guided Sampling for Plug-and-Play Enhancement of Diffusion Models

Authors: Seyedmorteza Sadat, Farnood Salehi, Romann M. Weber
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.22300
Pdf URL: https://arxiv.org/pdf/2509.22300
Copy Paste: [[2509.22300]] HiGS: History-Guided Sampling for Plug-and-Play Enhancement of Diffusion Models(https://arxiv.org/abs/2509.22300)
Keywords: generation
Abstract: While diffusion models have made remarkable progress in image generation, their outputs can still appear unrealistic and lack fine details, especially when using fewer number of neural function evaluations (NFEs) or lower guidance scales. To address this issue, we propose a novel momentum-based sampling technique, termed history-guided sampling (HiGS), which enhances quality and efficiency of diffusion sampling by integrating recent model predictions into each inference step. Specifically, HiGS leverages the difference between the current prediction and a weighted average of past predictions to steer the sampling process toward more realistic outputs with better details and structure. Our approach introduces practically no additional computation and integrates seamlessly into existing diffusion frameworks, requiring neither extra training nor fine-tuning. Extensive experiments show that HiGS consistently improves image quality across diverse models and architectures and under varying sampling budgets and guidance scales. Moreover, using a pretrained SiT model, HiGS achieves a new state-of-the-art FID of 1.61 for unguided ImageNet generation at 256$\times$256 with only 30 sampling steps (instead of the standard 250). We thus present HiGS as a plug-and-play enhancement to standard diffusion sampling that enables faster generation with higher fidelity.
摘要：尽管扩散模型在图像生成方面取得了显着进展，但它们的输出仍然可能看起来不切实际，并且缺乏细节，尤其是在使用较少数量的神经功能评估（NFE）或较低的指导量表时。为了解决这个问题，我们提出了一种新型基于动量的采样技术，称为历史引导采样（HIGS），该技术通过将最新模型预测整合到每个推理步骤中来提高扩散采样的质量和效率。具体而言，HIGS利用当前预测与过去预测的加权平均值之间的差异，以将采样过程转向具有更好的细节和结构的更现实的输出。我们的方法几乎没有介绍其他计算，并无缝集成到现有的扩散框架中，不需要额外的培训也不需要进行微调。广泛的实验表明，在不同的模型和体系结构以及不同的采样预算和指导量表下，HIG始终提高了各种模型和体系结构的图像质量。此外，使用验证的SIT型号，HIGS以256 $ \ times $ 256的价格实现了一个新的最先进的FID为1.61，只有30个采样步骤（而不是标准250）。因此，我们将HIG作为标准扩散抽样的插件增强，使得能够以更高的保真度产生生成。

Title: RAPID^3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformer

Authors: Wangbo Zhao, Yizeng Han, Zhiwei Tang, Jiasheng Tang, Pengfei Zhou, Kai Wang, Bohan Zhuang, Zhangyang Wang, Fan Wang, Yang You
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.22323
Pdf URL: https://arxiv.org/pdf/2509.22323
Copy Paste: [[2509.22323]] RAPID^3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformer(https://arxiv.org/abs/2509.22323)
Keywords: generation
Abstract: Diffusion Transformers (DiTs) excel at visual generation yet remain hampered by slow sampling. Existing training-free accelerators - step reduction, feature caching, and sparse attention - enhance inference speed but typically rely on a uniform heuristic or a manually designed adaptive strategy for all images, leaving quality on the table. Alternatively, dynamic neural networks offer per-image adaptive acceleration, but their high fine-tuning costs limit broader applicability. To address these limitations, we introduce RAPID3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformers, a framework that delivers image-wise acceleration with zero updates to the base generator. Specifically, three lightweight policy heads - Step-Skip, Cache-Reuse, and Sparse-Attention - observe the current denoising state and independently decide their corresponding speed-up at each timestep. All policy parameters are trained online via Group Relative Policy Optimization (GRPO) while the generator remains frozen. Meanwhile, an adversarially learned discriminator augments the reward signal, discouraging reward hacking by boosting returns only when generated samples stay close to the original model's distribution. Across state-of-the-art DiT backbones, including Stable Diffusion 3 and FLUX, RAPID3 achieves nearly 3x faster sampling with competitive generation quality.
摘要：扩散变压器（DITS）在视觉生成时出色，但由于缓慢的采样而阻碍了。现有的无训练加速器 - 减少阶跃，功能缓存和稀疏注意力 - 提高推理速度，但通常依赖于均匀的启发式启发式或手动设计的自适应策略，用于所有图像，并在桌子上留下质量。另外，动态神经网络提供了每一图像的自适应加速度，但它们的高调整成本限制了更广泛的适用性。为了解决这些限制，我们介绍了Rapid3：扩散变压器的三级增强加速度策略，该框架可在图像加速度上提供零更新的基础发电机。具体而言，三个轻巧的政策负责人 - 踩踏，cache-Reuse和稀疏注意事项 - 观察当前的降级状态，并在每个时间段上独立决定其相应的速度。所有策略参数均通过小组相对策略优化（GRPO）在线培训，而发电机仍冻结。同时，一个对手学习的歧视者增加了奖励信号，仅当生成的样本保持与原始模型的分布接近时，通过提高回报来阻止奖励黑客攻击。在包括稳定的扩散3和Flux在内的最先进的DIT主机上，Rapid3以竞争性的发电质量实现了近3倍的采样。

Title: CircuitSense: A Hierarchical Circuit System Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process

Authors: Arman Akbari, Jian Gao, Yifei Zou, Mei Yang, Jinru Duan, Dmitrii Torbunov, Yanzhi Wang, Yihui Ren, Xuan Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.22339
Pdf URL: https://arxiv.org/pdf/2509.22339
Copy Paste: [[2509.22339]] CircuitSense: A Hierarchical Circuit System Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process(https://arxiv.org/abs/2509.22339)
Keywords: generation
Abstract: Engineering design operates through hierarchical abstraction from system specifications to component implementations, requiring visual understanding coupled with mathematical reasoning at each level. While Multi-modal Large Language Models (MLLMs) excel at natural image tasks, their ability to extract mathematical models from technical diagrams remains unexplored. We present \textbf{CircuitSense}, a comprehensive benchmark evaluating circuit understanding across this hierarchy through 8,006+ problems spanning component-level schematics to system-level block diagrams. Our benchmark uniquely examines the complete engineering workflow: Perception, Analysis, and Design, with a particular emphasis on the critical but underexplored capability of deriving symbolic equations from visual inputs. We introduce a hierarchical synthetic generation pipeline consisting of a grid-based schematic generator and a block diagram generator with auto-derived symbolic equation labels. Comprehensive evaluation of six state-of-the-art MLLMs, including both closed-source and open-source models, reveals fundamental limitations in visual-to-mathematical reasoning. Closed-source models achieve over 85\% accuracy on perception tasks involving component recognition and topology identification, yet their performance on symbolic derivation and analytical reasoning falls below 19\%, exposing a critical gap between visual parsing and symbolic reasoning. Models with stronger symbolic reasoning capabilities consistently achieve higher design task accuracy, confirming the fundamental role of mathematical understanding in circuit synthesis and establishing symbolic reasoning as the key metric for engineering competence.
摘要：工程设计通过从系统规范到组件实现的层次抽象来运行，需要视觉理解以及每个级别的数学推理。尽管多模式的大语言模型（MLLM）在自然图像任务上表现出色，但从技术图中提取数学模型的能力仍未得到探索。我们提出\ textbf {CircuitSense}，这是一个全面的基准测试，评估了该层次结构的电路理解，通过8,006多个问题，跨越组件级示意图到系统级构图。我们的基准标准唯一地研究了完整的工程工作流程：感知，分析和设计，并特别强调了从视觉输入中衍生符号方程的关键但不受欢迎的能力。我们介绍了一个分层合成生成管道，该管道由基于网格的原理生成器和带有自动衍生符号方程标签的框图生成器组成。对六个最先进的MLLM的全面评估，包括封闭消息和开源模型，揭示了视觉到数学推理的基本限制。封闭式模型在涉及组件识别和拓扑识别的感知任务上实现了超过85％的精度，但是它们在符号派生和分析推理上的表现低于19 \％，暴露了视觉解析和符号推理之间的关键差距。具有更强符号推理功能的模型始终达到更高的设计任务准确性，从而确认了数学理解在电路合成中的基本作用，并确立了符号推理为工程能力的关键指标。

Title: SurvDiff: A Diffusion Model for Generating Synthetic Data in Survival Analysis

Authors: Marie Brockschmidt, Maresa Schröder, Stefan Feuerriegel
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.22352
Pdf URL: https://arxiv.org/pdf/2509.22352
Copy Paste: [[2509.22352]] SurvDiff: A Diffusion Model for Generating Synthetic Data in Survival Analysis(https://arxiv.org/abs/2509.22352)
Keywords: generation, generative
Abstract: Survival analysis is a cornerstone of clinical research by modeling time-to-event outcomes such as metastasis, disease relapse, or patient death. Unlike standard tabular data, survival data often come with incomplete event information due to dropout, or loss to follow-up. This poses unique challenges for synthetic data generation, where it is crucial for clinical research to faithfully reproduce both the event-time distribution and the censoring mechanism. In this paper, we propose SurvDiff, an end-to-end diffusion model specifically designed for generating synthetic data in survival analysis. SurvDiff is tailored to capture the data-generating mechanism by jointly generating mixed-type covariates, event times, and right-censoring, guided by a survival-tailored loss function. The loss encodes the time-to-event structure and directly optimizes for downstream survival tasks, which ensures that SurvDiff (i) reproduces realistic event-time distributions and (ii) preserves the censoring mechanism. Across multiple datasets, we show that \survdiff consistently outperforms state-of-the-art generative baselines in both distributional fidelity and downstream evaluation metrics across multiple medical datasets. To the best of our knowledge, SurvDiff is the first diffusion model explicitly designed for generating synthetic survival data.
摘要：生存分析是通过对诸如转移，疾病复发或患者死亡等事件的结果进行建模的临床研究的基石。与标准表格数据不同，生存数据通常会带来不完整的事件信息，原因是由于辍学或随访损失。这对合成数据产生了独特的挑战，在临床研究中，忠实地再现事件时间分布和审查机制至关重要。在本文中，我们提出了Survdiff，这是一种端到端扩散模型，专门设计用于生存分析中的合成数据。 SurvDiff的定制是通过共同产生混合型协变量，事件时间和右审查的捕获数据生成机制的量身定制的。损失编码事件时间结构，并直接针对下游生存任务进行优化，这确保了幸存者（i）再现现实的事件时间分布，并且（ii）保留了审查机制。在多个数据集中，我们表明\ Survdiff在分布忠诚度和下游评估指标中始终优于最先进的生成基线。据我们所知，Survdiff是第一个明确设计用于生成合成生存数据的扩散模型。

Title: Stochastic activations

Authors: Maria Lomeli, Matthijs Douze, Gergely Szilvasy, Loic Cabannes, Jade Copet, Sainbayar Sukhbaatar, Jason Weston, Gabriel Synnaeve, Pierre-Emmanuel Mazaré, Hervé Jégou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.22358
Pdf URL: https://arxiv.org/pdf/2509.22358
Copy Paste: [[2509.22358]] Stochastic activations(https://arxiv.org/abs/2509.22358)
Keywords: generation
Abstract: We introduce stochastic activations. This novel strategy randomly selects between several non-linear functions in the feed-forward layer of a large language model. In particular, we choose between SILU or RELU depending on a Bernoulli draw. This strategy circumvents the optimization problem associated with RELU, namely, the constant shape for negative inputs that prevents the gradient flow. We leverage this strategy in two ways: (1) We use stochastic activations during pre-training and fine-tune the model with RELU, which is used at inference time to provide sparse latent vectors. This reduces the inference FLOPs and translates into a significant speedup in the CPU. Interestingly, this leads to much better results than training from scratch with the RELU activation function. (2) We evaluate stochastic activations for generation. This strategy performs reasonably well: it is only slightly inferior to the best deterministic non-linearity, namely SILU combined with temperature scaling. This offers an alternative to existing strategies by providing a controlled way to increase the diversity of the generated text.
摘要：我们引入随机激活。这种新颖的策略在大语言模型的馈送层中随机选择了几个非线性函数。特别是，我们根据伯努利的抽签在silu或relu之间进行选择。该策略规避了与relu相关的优化问题，即，负输入的恒定形状阻止了梯度流。我们通过两种方式利用这一策略：（1）我们在预训练期间使用随机激活，并用Relu对模型进行微调，该模型在推理时间用于提供稀疏的潜在向量。这减少了推理拖放，并转化为CPU中的显着加速。有趣的是，这比从relu激活函数从头开始训练的结果要好得多。（2）我们评估生成的随机激活。该策略的表现相当好：它仅略低于最佳的确定性非线性性，即SILU与温度缩放结合。通过提供一种受控的方式来增加生成的文本的多样性，这为现有策略提供了替代方案。

Title: Text Adversarial Attacks with Dynamic Outputs

Authors: Wenqiang Wang, Siyuan Liang, Xiao Yan, Xiaochun Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.22393
Pdf URL: https://arxiv.org/pdf/2509.22393
Copy Paste: [[2509.22393]] Text Adversarial Attacks with Dynamic Outputs(https://arxiv.org/abs/2509.22393)
Keywords: generative
Abstract: Text adversarial attack methods are typically designed for static scenarios with fixed numbers of output labels and a predefined label space, relying on extensive querying of the victim model (query-based attacks) or the surrogate model (transfer-based attacks). To address this gap, we introduce the Textual Dynamic Outputs Attack (TDOA) method, which employs a clustering-based surrogate model training approach to convert the dynamic-output scenario into a static single-output scenario. To improve attack effectiveness, we propose the farthest-label targeted attack strategy, which selects adversarial vectors that deviate most from the model's coarse-grained labels, thereby maximizing disruption. We extensively evaluate TDOA on four datasets and eight victim models (e.g., ChatGPT-4o, ChatGPT-4.1), showing its effectiveness in crafting adversarial examples and its strong potential to compromise large language models with limited access. With a single query per text, TDOA achieves a maximum attack success rate of 50.81\%. Additionally, we find that TDOA also achieves state-of-the-art performance in conventional static output scenarios, reaching a maximum ASR of 82.68\%. Meanwhile, by conceptualizing translation tasks as classification problems with unbounded output spaces, we extend the TDOA framework to generative settings, surpassing prior results by up to 0.64 RDBLEU and 0.62 RDchrF.
摘要：文本对抗攻击方法通常是针对具有固定数量输出标签和预定义标签空间的静态场景设计的，依靠受害者模型（基于查询的攻击）或替代模型（基于转移的攻击）的广泛查询。为了解决此差距，我们介绍了文本动态输出攻击（TDOA）方法，该方法采用了基于聚类的替代模型训练方法将动态输出方案转换为静态单输出场景。为了提高攻击效果，我们提出了最远的目标攻击策略，该策略选择了对抗向量，这些向量偏离了模型的粗粒标签，从而最大程度地增加了破坏。我们在四个数据集和八个受害模型（例如Chatgpt-4O，Chatgpt-4.1）上广泛评估TDOA，显示了其在制作对抗性示例中的有效性，并具有有限访问权限的大型语言模型的强大潜力。每本文本的单个查询，TDOA的最大攻击成功率为50.81 \％。此外，我们发现TDOA在传统的静态输出方案中还达到了最先进的性能，最大为82.68 \％。同时，通过将翻译任务概念化为具有无界输出空间的分类问题，我们将TDOA框架扩展到生成设置，使先前的结果超过高达0.64 rdbleu和0.62 RDCHRF。

Title: Closing the Safety Gap: Surgical Concept Erasure in Visual Autoregressive Models

Authors: Xinhao Zhong, Yimin Zhou, Zhiqi Zhang, Junhao Li, Yi Sun, Bin Chen, Shu-Tao Xia, Ke Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.22400
Pdf URL: https://arxiv.org/pdf/2509.22400
Copy Paste: [[2509.22400]] Closing the Safety Gap: Surgical Concept Erasure in Visual Autoregressive Models(https://arxiv.org/abs/2509.22400)
Keywords: generation
Abstract: The rapid progress of visual autoregressive (VAR) models has brought new opportunities for text-to-image generation, but also heightened safety concerns. Existing concept erasure techniques, primarily designed for diffusion models, fail to generalize to VARs due to their next-scale token prediction paradigm. In this paper, we first propose a novel VAR Erasure framework VARE that enables stable concept erasure in VAR models by leveraging auxiliary visual tokens to reduce fine-tuning intensity. Building upon this, we introduce S-VARE, a novel and effective concept erasure method designed for VAR, which incorporates a filtered cross entropy loss to precisely identify and minimally adjust unsafe visual tokens, along with a preservation loss to maintain semantic fidelity, addressing the issues such as language drift and reduced diversity introduce by naïve fine-tuning. Extensive experiments demonstrate that our approach achieves surgical concept erasure while preserving generation quality, thereby closing the safety gap in autoregressive text-to-image generation by earlier methods.
摘要：视觉自回旋（VAR）模型的快速进步为文本到图像生成带来了新的机会，但也增加了安全问题。现有的概念擦除技术主要是为扩散模型设计的，由于其次数令牌预测范式，因此无法推广到VAR。在本文中，我们首先提出了一种新颖的VAR擦除框架Vare，该框架通过利用辅助视觉令牌来降低微调强度，从而在VAR模型中实现稳定的概念擦除。在此基础上，我们引入了S-Vare，这是一种专为VAR设计的新颖有效的概念擦除方法，它结合了过滤的横熵损失，以精确识别和微小调整不安全的视觉令牌，并保存语义忠诚度，以维持语义忠诚度，以解决语言漂移以及诸如Naish na the trage trage traginal tragine trainmity fription。广泛的实验表明，我们的方法在保留发电质量的同时可以实现手术概念擦除，从而通过早期方法缩小了自回归文本对图像生成的安全差距。

Title: MoveFM-R: Advancing Mobility Foundation Models via Language-driven Semantic Reasoning

Authors: Fanjin Meng, Yuan Yuan, Jingtao Ding, Jie Feng, Chonghua Han, Yong Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.22403
Pdf URL: https://arxiv.org/pdf/2509.22403
Copy Paste: [[2509.22403]] MoveFM-R: Advancing Mobility Foundation Models via Language-driven Semantic Reasoning(https://arxiv.org/abs/2509.22403)
Keywords: generation
Abstract: Mobility Foundation Models (MFMs) have advanced the modeling of human movement patterns, yet they face a ceiling due to limitations in data scale and semantic understanding. While Large Language Models (LLMs) offer powerful semantic reasoning, they lack the innate understanding of spatio-temporal statistics required for generating physically plausible mobility trajectories. To address these gaps, we propose MoveFM-R, a novel framework that unlocks the full potential of mobility foundation models by leveraging language-driven semantic reasoning capabilities. It tackles two key challenges: the vocabulary mismatch between continuous geographic coordinates and discrete language tokens, and the representation gap between the latent vectors of MFMs and the semantic world of LLMs. MoveFM-R is built on three core innovations: a semantically enhanced location encoding to bridge the geography-language gap, a progressive curriculum to align the LLM's reasoning with mobility patterns, and an interactive self-reflection mechanism for conditional trajectory generation. Extensive experiments demonstrate that MoveFM-R significantly outperforms existing MFM-based and LLM-based baselines. It also shows robust generalization in zero-shot settings and excels at generating realistic trajectories from natural language instructions. By synthesizing the statistical power of MFMs with the deep semantic understanding of LLMs, MoveFM-R pioneers a new paradigm that enables a more comprehensive, interpretable, and powerful modeling of human mobility. The implementation of MoveFM-R is available online at this https URL.
摘要：流动基础模型（MFM）提高了人类运动模式的建模，但由于数据量表和语义理解的局限性，它们面临上限。尽管大型语言模型（LLM）提供了强大的语义推理，但他们缺乏对产生物理上合理的移动性轨迹所需的时空统计的天生理解。为了解决这些差距，我们提出了MoveFM-R，这是一个新颖的框架，通过利用语言驱动的语义推理能力来释放移动性基础模型的全部潜力。它解决了两个关键挑战：连续地理坐标和离散语言令牌之间的词汇不匹配，以及MFMS的潜在向量与LLMS语义世界之间的表示差距。 MoveFM-R建立在三个核心创新上：语义增强的位置编码，以弥合地理差距，这是一种渐进式课程，可将LLM的推理与移动性模式相结合，以及一种互动的自我反思机制，以产生条件轨迹的产生。广泛的实验表明，MoveFM-R显着胜过现有的基于MFM的基础和基于LLM的基准。它还显示了零弹性设置中强大的概括，并且在自然语言指令中产生逼真的轨迹方面表现出色。通过将MFM的统计能力与对LLM的深层语义理解合成，MoveFM-R先驱者是一种新的范式，该范式可以使人类流动性更全面，可解释和强大的建模。 MoveFM-R的实现可在此HTTPS URL上在线获得。

Title: RAU: Reference-based Anatomical Understanding with Vision Language Models

Authors: Yiwei Li, Yikang Liu, Jiaqi Guo, Lin Zhao, Zheyuan Zhang, Xiao Chen, Boris Mailhe, Ankush Mukherjee, Terrence Chen, Shanhui Sun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.22404
Pdf URL: https://arxiv.org/pdf/2509.22404
Copy Paste: [[2509.22404]] RAU: Reference-based Anatomical Understanding with Vision Language Models(https://arxiv.org/abs/2509.22404)
Keywords: generation
Abstract: Anatomical understanding through deep learning is critical for automatic report generation, intra-operative navigation, and organ localization in medical imaging; however, its progress is constrained by the scarcity of expert-labeled data. A promising remedy is to leverage an annotated reference image to guide the interpretation of an unlabeled target. Although recent vision-language models (VLMs) exhibit non-trivial visual reasoning, their reference-based understanding and fine-grained localization remain limited. We introduce RAU, a framework for reference-based anatomical understanding with VLMs. We first show that a VLM learns to identify anatomical regions through relative spatial reasoning between reference and target images, trained on a moderately sized dataset. We validate this capability through visual question answering (VQA) and bounding box prediction. Next, we demonstrate that the VLM-derived spatial cues can be seamlessly integrated with the fine-grained segmentation capability of SAM2, enabling localization and pixel-level segmentation of small anatomical regions, such as vessel segments. Across two in-distribution and two out-of-distribution datasets, RAU consistently outperforms a SAM2 fine-tuning baseline using the same memory setup, yielding more accurate segmentations and more reliable localization. More importantly, its strong generalization ability makes it scalable to out-of-distribution datasets, a property crucial for medical image applications. To the best of our knowledge, RAU is the first to explore the capability of VLMs for reference-based identification, localization, and segmentation of anatomical structures in medical images. Its promising performance highlights the potential of VLM-driven approaches for anatomical understanding in automated clinical workflows.
摘要：通过深度学习的解剖学理解对于自动报告的生成，术中导航和器官定位在医学成像中至关重要。但是，其进度受到专家标记数据的稀缺的限制。一种有希望的补救措施是利用带注释的参考图像来指导未标记目标的解释。尽管最近的视觉模型（VLM）表现出非平凡的视觉推理，但其基于参考的理解和细粒度的定位仍然有限。我们介绍了Rau，这是一种使用VLM的基于参考的解剖学理解的框架。我们首先表明，VLM学会通过参考图像和目标图像之间的相对空间推理来识别解剖区域，该区域在中等大小的数据集中训练。我们通过视觉问题答案（VQA）和边界框预测来验证此功能。接下来，我们证明了VLM衍生的空间提示可以与SAM2的细粒细分能力无缝集成，从而实现了小型解剖区域（例如容器段）的定位和像素级分割。在两个分发和两个分发数据集中，Rau始终使用相同的内存设置优于SAM2微调基线，从而产生更准确的分割和更可靠的本地化。更重要的是，其强大的概括能力使其可扩展到分发数据集，这是对医学图像应用至关重要的属性。据我们所知，Rau是第一个探索VLM在医学图像中解剖结构的基于参考的识别，定位和分割的能力的人。它有希望的性能凸显了VLM驱动方法在自动临床工作流程中的解剖学理解的潜力。

Title: Fast-Forward Lattice Boltzmann: Learning Kinetic Behaviour with Physics-Informed Neural Operators

Authors: Xiao Xue, Marco F.P. ten Eikelder, Mingyang Gao, Xiaoyuan Cheng, Yiming Yang, Yi He, Shuo Wang, Sibo Cheng, Yukun Hu, Peter V. Coveney
Subjects: cs.LG, nlin.CG, physics.comp-ph, physics.flu-dyn
Abstract URL: https://arxiv.org/abs/2509.22411
Pdf URL: https://arxiv.org/pdf/2509.22411
Copy Paste: [[2509.22411]] Fast-Forward Lattice Boltzmann: Learning Kinetic Behaviour with Physics-Informed Neural Operators(https://arxiv.org/abs/2509.22411)
Keywords: super-resolution
Abstract: The lattice Boltzmann equation (LBE), rooted in kinetic theory, provides a powerful framework for capturing complex flow behaviour by describing the evolution of single-particle distribution functions (PDFs). Despite its success, solving the LBE numerically remains computationally intensive due to strict time-step restrictions imposed by collision kernels. Here, we introduce a physics-informed neural operator framework for the LBE that enables prediction over large time horizons without step-by-step integration, effectively bypassing the need to explicitly solve the collision kernel. We incorporate intrinsic moment-matching constraints of the LBE, along with global equivariance of the full distribution field, enabling the model to capture the complex dynamics of the underlying kinetic system. Our framework is discretization-invariant, enabling models trained on coarse lattices to generalise to finer ones (kinetic super-resolution). In addition, it is agnostic to the specific form of the underlying collision model, which makes it naturally applicable across different kinetic datasets regardless of the governing dynamics. Our results demonstrate robustness across complex flow scenarios, including von Karman vortex shedding, ligament breakup, and bubble adhesion. This establishes a new data-driven pathway for modelling kinetic systems.
摘要：植根于动力学理论的晶格Boltzmann方程（LBE）通过描述单粒子分布函数（PDFS）的演变提供了一个强大的框架，以捕获复杂的流动行为。尽管成功，但由于碰撞内核施加的严格的时步限制，在数值上解决LBE仍然在计算密集型上。在这里，我们为LBE介绍了一个具有物理信息的神经操作员框架，该框架可以在不逐步集成的情况下在大型时间范围内实现预测，从而有效地绕开了明确解决碰撞内核的需求。我们结合了LBE的固有力矩匹配约束，以及完整分布场的全局均衡性，从而使模型能够捕获基础动力学系统的复杂动力学。我们的框架是离散不变的，使在粗晶格上训练的模型可以推广到更精细的模型（动力学超分辨率）。此外，它不可知基于基础碰撞模型的特定形式，这使得它自然适用于不同的动力学数据集，而不论管理动力学如何。我们的结果表明，在复杂的流动情景中的鲁棒性，包括von Karman涡流脱落，韧带破裂和气泡粘附。这建立了一个新的数据驱动途径，用于建模动力学系统。

Title: LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer

Authors: Song Fei, Tian Ye, Lujia Wang, Lei Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.22414
Pdf URL: https://arxiv.org/pdf/2509.22414
Copy Paste: [[2509.22414]] LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer(https://arxiv.org/abs/2509.22414)
Keywords: restoration
Abstract: Universal image restoration (UIR) aims to recover images degraded by unknown mixtures while preserving semantics -- conditions under which discriminative restorers and UNet-based diffusion priors often oversmooth, hallucinate, or drift. We present LucidFlux, a caption-free UIR framework that adapts a large diffusion transformer (Flux.1) without image captions. LucidFlux introduces a lightweight dual-branch conditioner that injects signals from the degraded input and a lightly restored proxy to respectively anchor geometry and suppress artifacts. Then, a timestep- and layer-adaptive modulation schedule is designed to route these cues across the backbone's hierarchy, in order to yield coarse-to-fine and context-aware updates that protect the global structure while recovering texture. After that, to avoid the latency and instability of text prompts or MLLM captions, we enforce caption-free semantic alignment via SigLIP features extracted from the proxy. A scalable curation pipeline further filters large-scale data for structure-rich supervision. Across synthetic and in-the-wild benchmarks, LucidFlux consistently outperforms strong open-source and commercial baselines, and ablation studies verify the necessity of each component. LucidFlux shows that, for large DiTs, when, where, and what to condition on -- rather than adding parameters or relying on text prompts -- is the governing lever for robust and caption-free universal image restoration in the wild.
摘要：通用图像恢复（UIR）旨在恢复未知混合物降解的图像，同时保存语义 - 在这种情况下，歧视性修复者和基于UNET的扩散先验通常会过度平滑，幻觉或漂移。我们提出了LucidFlux，这是一个没有字幕的UIR框架，可适应没有图像标题的大扩散变压器（Flux.1）。 LucidFlux引入了轻巧的双分支护发素，该护发素从退化的输入中注入信号，并分别恢复的代理以分别锚定几何形状和抑制人工制品。然后，时间步骤和层 - 自适应调制时间表旨在将这些提示路由跨骨干的层次结构路由，以便产生在恢复纹理时保护全局结构的粗到细节和上下文感知的更新。之后，为避免文本提示或MLLM字幕的潜伏期和不稳定性，我们通过从代理提取的siglip特征强制执行无字幕的语义对齐。可扩展的策划管道进一步过滤大规模数据，以进行结构丰富的监督。在综合和野外基准测试中，LucidFlux始终胜过强大的开源和商业基线，而消融研究验证了每个组件的必要性。 LucidFlux表明，对于大dit，何时，何地和条件的条件 - 而不是添加参数或依赖文本提示 - 是野外稳健和无字段的通用图像恢复的管理杠杆。

Title: Explaining multimodal LLMs via intra-modal token interactions

Authors: Jiawei Liang, Ruoyu Chen, Xianghao Jiao, Siyuan Liang, Shiming Liu, Qunli Zhang, Zheng Hu, Xiaochun Cao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.22415
Pdf URL: https://arxiv.org/pdf/2509.22415
Copy Paste: [[2509.22415]] Explaining multimodal LLMs via intra-modal token interactions(https://arxiv.org/abs/2509.22415)
Keywords: generation
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood. Existing interpretability research has primarily focused on cross-modal attribution, identifying which image regions the model attends to during output generation. However, these approaches often overlook intra-modal dependencies. In the visual modality, attributing importance to isolated image patches ignores spatial context due to limited receptive fields, resulting in fragmented and noisy explanations. In the textual modality, reliance on preceding tokens introduces spurious activations. Failing to effectively mitigate these interference compromises attribution fidelity. To address these limitations, we propose enhancing interpretability by leveraging intra-modal interaction. For the visual branch, we introduce \textit{Multi-Scale Explanation Aggregation} (MSEA), which aggregates attributions over multi-scale inputs to dynamically adjust receptive fields, producing more holistic and spatially coherent visual explanations. For the textual branch, we propose \textit{Activation Ranking Correlation} (ARC), which measures the relevance of contextual tokens to the current token via alignment of their top-$k$ prediction rankings. ARC leverages this relevance to suppress spurious activations from irrelevant contexts while preserving semantically coherent ones. Extensive experiments across state-of-the-art MLLMs and benchmark datasets demonstrate that our approach consistently outperforms existing interpretability methods, yielding more faithful and fine-grained explanations of model behavior.
摘要：多模式的大语言模型（MLLM）在各种视力语言任务中取得了巨大的成功，但是他们的内部决策机制仍然不足以理解。现有的可解释性研究主要集中于跨模式归因，确定该模型在产出期间所关注的图像区域。但是，这些方法经常忽略模式内依赖性。在视觉方式中，将重要性归因于孤立的图像贴片，忽略了由于接受场有限而导致的空间上下文，从而产生了零散和嘈杂的解释。在文本方式中，对代币先前的依赖引入了虚假激活。无法有效缓解这些干扰会损害归因保真度。为了解决这些局限性，我们建议通过利用模式内相互作用来增强可解释性。对于视觉分支，我们介绍\ textIt {多尺度说明汇总}（msea），该}（msea）将归因于多尺度输入的属性以动态调整接受场，从而产生更多的整体和空间相干的视觉解释。对于文本分支，我们提出\ textIt {激活排名相关}（ARC），该分支通过对当前令牌的相关性，通过对当前令牌的相关性，这是通过其顶部 - $ K $预测排名的对准。电弧利用这种相关性来抑制与无关的环境中的虚假激活，同时保存语义连贯的激活。对最先进的MLLM和基准数据集进行的广泛实验表明，我们的方法始终优于现有的可解释性方法，从而对模型行为产生了更忠实，更细粒度的解释。

Title: Overclocking Electrostatic Generative Models

Authors: Daniil Shlenskii, Alexander Korotin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.22454
Pdf URL: https://arxiv.org/pdf/2509.22454
Copy Paste: [[2509.22454]] Overclocking Electrostatic Generative Models(https://arxiv.org/abs/2509.22454)
Keywords: generative
Abstract: Electrostatic generative models such as PFGM++ have recently emerged as a powerful framework, achieving state-of-the-art performance in image synthesis. PFGM++ operates in an extended data space with auxiliary dimensionality $D$, recovering the diffusion model framework as $D\to\infty$, while yielding superior empirical results for finite $D$. Like diffusion models, PFGM++ relies on expensive ODE simulations to generate samples, making it computationally costly. To address this, we propose Inverse Poisson Flow Matching (IPFM), a novel distillation framework that accelerates electrostatic generative models across all values of $D$. Our IPFM reformulates distillation as an inverse problem: learning a generator whose induced electrostatic field matches that of the teacher. We derive a tractable training objective for this problem and show that, as $D \to \infty$, our IPFM closely recovers Score Identity Distillation (SiD), a recent method for distilling diffusion models. Empirically, our IPFM produces distilled generators that achieve near-teacher or even superior sample quality using only a few function evaluations. Moreover, we observe that distillation converges faster for finite $D$ than in the $D \to \infty$ (diffusion) limit, which is consistent with prior findings that finite-$D$ PFGM++ models exhibit more favorable optimization and sampling properties.
摘要：诸如PFGM ++之类的静电生成模型最近已成为一个强大的框架，在图像合成中实现了最新性能。 PFGM ++在带有辅助维度$ D $的扩展数据空间中运行，将扩散模型框架恢复为$ d \ to \ infty $，同时为有限的$ d $带来了卓越的经验结果。像扩散模型一样，PFGM ++依靠昂贵的ODE模拟来生成样品，从而使其计算成本高昂。为了解决这个问题，我们提出了反向泊松流匹配（IPFM），这是一个新型的蒸馏框架，可在所有值$ d $的所有值中加速静电生成模型。我们的IPFM将蒸馏重新蒸馏为一个反问题：学习一个发电机，其诱导的静电场与老师的静电场相匹配。我们为此问题得出了一个可拖动的培训目标，并表明，作为$ d \ to \ infty $，我们的IPFM紧密恢复了得分身份蒸馏（SID），这是一种用于蒸馏扩散模型的最新方法。从经验上讲，我们的IPFM仅使用少量功能评估生产蒸馏器，这些发电机可实现近乎老师甚至卓越的样品质量。此外，我们观察到，有限$ d $的蒸馏比$ d \ to \ infty $（扩散）限制更快，这与先前的发现一致，即有限$ d $ pfgm ++型号具有更有利的优化和采样属性。

Title: Nonlinear Optimization with GPU-Accelerated Neural Network Constraints

Authors: Robert Parker, Oscar Dowson, Nicole LoGiudice, Manuel Garcia, Russell Bent
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.22462
Pdf URL: https://arxiv.org/pdf/2509.22462
Copy Paste: [[2509.22462]] Nonlinear Optimization with GPU-Accelerated Neural Network Constraints(https://arxiv.org/abs/2509.22462)
Keywords: generation
Abstract: We propose a reduced-space formulation for optimizing over trained neural networks where the network's outputs and derivatives are evaluated on a GPU. To do this, we treat the neural network as a "gray box" where intermediate variables and constraints are not exposed to the optimization solver. Compared to the full-space formulation, in which intermediate variables and constraints are exposed to the optimization solver, the reduced-space formulation leads to faster solves and fewer iterations in an interior point method. We demonstrate the benefits of this method on two optimization problems: Adversarial generation for a classifier trained on MNIST images and security-constrained optimal power flow with transient feasibility enforced using a neural network surrogate.
摘要：我们提出了一个缩小的空间公式，以优化训练有素的神经网络，在该网络上评估网络的输出和衍生物在GPU上进行评估。为此，我们将神经网络视为一个“灰色框”，其中中间变量和约束未暴露于优化求解器。与全空间公式相比，中间变量和约束暴露于优化求解器中，缩小的空间公式会导致更快的溶解度和更少的内部点方法迭代。我们证明了这种方法在两个优化问题上的好处：对经过MNIST图像训练的分类器和安全受限的最佳功率流，并使用神经网络替代的瞬时可行性。

Title: Learning the Neighborhood: Contrast-Free Multimodal Self-Supervised Molecular Graph Pretraining

Authors: Boshra Ariguib, Mathias Niepert, Andrei Manolache
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.22468
Pdf URL: https://arxiv.org/pdf/2509.22468
Copy Paste: [[2509.22468]] Learning the Neighborhood: Contrast-Free Multimodal Self-Supervised Molecular Graph Pretraining(https://arxiv.org/abs/2509.22468)
Keywords: generative
Abstract: High-quality molecular representations are essential for property prediction and molecular design, yet large labeled datasets remain scarce. While self-supervised pretraining on molecular graphs has shown promise, many existing approaches either depend on hand-crafted augmentations or complex generative objectives, and often rely solely on 2D topology, leaving valuable 3D structural information underutilized. To address this gap, we introduce C-FREE (Contrast-Free Representation learning on Ego-nets), a simple framework that integrates 2D graphs with ensembles of 3D conformers. C-FREE learns molecular representations by predicting subgraph embeddings from their complementary neighborhoods in the latent space, using fixed-radius ego-nets as modeling units across different conformers. This design allows us to integrate both geometric and topological information within a hybrid Graph Neural Network (GNN)-Transformer backbone, without negatives, positional encodings, or expensive pre-processing. Pretraining on the GEOM dataset, which provides rich 3D conformational diversity, C-FREE achieves state-of-the-art results on MoleculeNet, surpassing contrastive, generative, and other multimodal self-supervised methods. Fine-tuning across datasets with diverse sizes and molecule types further demonstrates that pretraining transfers effectively to new chemical domains, highlighting the importance of 3D-informed molecular representations.
摘要：高质量的分子表示对于性质预测和分子设计至关重要，但标记的大型数据集仍然稀缺。尽管对分子图进行了自我监督的预处理表现出了希望，但许多现有方法要么取决于手工制作的增强或复杂的生成目标，而且通常仅依靠2D拓扑，而宝贵的3D结构信息则不足。为了解决这一差距，我们引入了无C-free（在自我网络上学习），这是一个简单的框架，将2D图与3D构象异构体的集成在一起。 C-Free通过使用固定radius ego网络作为跨不同构象异构体的建模单元来预测潜在空间中互补邻域中的子图嵌入来学习分子表示。这种设计使我们能够将几何信息和拓扑信息集成到混合图神经网络（GNN） - 转换器主链中，而没有负面，位置编码或昂贵的预处理。在GEOM数据集上进行了预处理，该数据集提供了丰富的3D构象多样性，C-Free可在分子上实现最先进的结果，超过对比度，生成性和其他多模式的自我监督方法。在各种大小和分子类型的数据集之间进行微调进一步表明，预处理可以有效地转移到新的化学域，强调了3D信息分子表示的重要性。

Title: Bézier Meets Diffusion: Robust Generation Across Domains for Medical Image Segmentation

Authors: Chen Li, Meilong Xu, Xiaoling Hu, Weimin Lyu, Chao Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.22476
Pdf URL: https://arxiv.org/pdf/2509.22476
Copy Paste: [[2509.22476]] Bézier Meets Diffusion: Robust Generation Across Domains for Medical Image Segmentation(https://arxiv.org/abs/2509.22476)
Keywords: generation
Abstract: Training robust learning algorithms across different medical imaging modalities is challenging due to the large domain gap. Unsupervised domain adaptation (UDA) mitigates this problem by using annotated images from the source domain and unlabeled images from the target domain to train the deep models. Existing approaches often rely on GAN-based style transfer, but these methods struggle to capture cross-domain mappings in regions with high variability. In this paper, we propose a unified framework, Bézier Meets Diffusion, for cross-domain image generation. First, we introduce a Bézier-curve-based style transfer strategy that effectively reduces the domain gap between source and target domains. The transferred source images enable the training of a more robust segmentation model across domains. Thereafter, using pseudo-labels generated by this segmentation model on the target domain, we train a conditional diffusion model (CDM) to synthesize high-quality, labeled target-domain images. To mitigate the impact of noisy pseudo-labels, we further develop an uncertainty-guided score matching method that improves the robustness of CDM training. Extensive experiments on public datasets demonstrate that our approach generates realistic labeled images, significantly augmenting the target domain and improving segmentation performance.
摘要：由于较大的域间隙，培训跨不同医学成像方式的强大学习算法具有挑战性。无监督的域适应（UDA）通过使用来自源域中的带注释的图像和来自目标域的未标记图像来训练深层模型，从而减轻了此问题。现有的方法通常依赖于基于GAN的样式转移，但是这些方法难以捕获具有较高可变性区域的跨域映射。在本文中，我们提出了一个统一的框架，Bézier遇到扩散，以生成跨域图像。首先，我们引入了基于Bézier-Curve的样式转移策略，该策略有效地减少了源域和目标域之间的域间隙。传输的源图像使跨域更强大的分割模型训练。此后，使用该分割模型在目标域上产生的伪标记，我们训练条件扩散模型（CDM）来合成高质量的，标记为目标域图像。为了减轻嘈杂的伪标签的影响，我们进一步开发了一种不确定性引导的分数匹配方法，从而改善了CDM训练的鲁棒性。公共数据集上的广泛实验表明，我们的方法生成了现实的标记图像，大大增加了目标域并改善细分性能。

Title: Group Critical-token Policy Optimization for Autoregressive Image Generation

Authors: Guohui Zhang, Hu Yu, Xiaoxiao Ma, JingHao Zhang, Yaning Pan, Mingde Yao, Jie Xiao, Linjiang Huang, Feng Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.22485
Pdf URL: https://arxiv.org/pdf/2509.22485
Copy Paste: [[2509.22485]] Group Critical-token Policy Optimization for Autoregressive Image Generation(https://arxiv.org/abs/2509.22485)
Keywords: generation
Abstract: Recent studies have extended Reinforcement Learning with Verifiable Rewards (RLVR) to autoregressive (AR) visual generation and achieved promising progress. However, existing methods typically apply uniform optimization across all image tokens, while the varying contributions of different image tokens for RLVR's training remain unexplored. In fact, the key obstacle lies in how to identify more critical image tokens during AR generation and implement effective token-wise optimization for them. To tackle this challenge, we propose $\textbf{G}$roup $\textbf{C}$ritical-token $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{GCPO}$), which facilitates effective policy optimization on critical tokens. We identify the critical tokens in RLVR-based AR generation from three perspectives, specifically: $\textbf{(1)}$ Causal dependency: early tokens fundamentally determine the later tokens and final image effect due to unidirectional dependency; $\textbf{(2)}$ Entropy-induced spatial structure: tokens with high entropy gradients correspond to image structure and bridges distinct visual regions; $\textbf{(3)}$ RLVR-focused token diversity: tokens with low visual similarity across a group of sampled images contribute to richer token-level diversity. For these identified critical tokens, we further introduce a dynamic token-wise advantage weight to encourage exploration, based on confidence divergence between the policy model and reference model. By leveraging 30\% of the image tokens, GCPO achieves better performance than GRPO with full tokens. Extensive experiments on multiple text-to-image benchmarks for both AR models and unified multimodal models demonstrate the effectiveness of GCPO for AR visual generation.
摘要：最近的研究已通过可验证的奖励（RLVR）扩展了增强学习，以进行自回归（AR）的视觉生成并取得了有希望的进步。但是，现有方法通常在所有图像令牌上应用统一的优化，而不同图像令牌的RLVR训练的不同贡献仍未得到探索。实际上，关键障碍在于如何在AR生成过程中识别更批判的图像令牌以及对它们实施有效的令牌优化。为了应对这一挑战，我们提出$ \ textbf {g} $ roup $ \ textbf {c} $ ricital-token $ \ textbf {p} $ olicy $ \ olicy $ \ textbf {o} $ ptimization（$ \ textbf {gcpo} $），这会促进有效的政策，以促进有效的政策优化。我们从三个角度识别基于RLVR的AR生成中的关键令牌，特别是：$ \ textbf {（1）} $ Causal依赖性：早期令牌从根本上确定后来的令牌和由于单向依赖性而引起的最终图像效应； $ \ textbf {（2）} $熵诱导的空间结构：具有高熵梯度的令牌对应于图像结构和桥接不同的视觉区域； $ \ textbf {（3）} $以RLVR的代币多样性：一组采样图像的视觉相似性低的令牌有助于更丰富的令牌级别的多样性。对于这些确定的关键令牌，我们进一步引入了动态令牌优势的权重，以鼓励探索，以策略模型和参考模型之间的置信度分歧。通过利用30 \％的图像令牌，GCPO的性能比具有完整令牌的GRPO更好。对AR模型和统一的多模型模型的多个文本对图像基准进行了广泛的实验，证明了GCPO对AR视觉产生的有效性。

Title: Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation

Authors: Ruoyu Chen, Xiaoqing Guo, Kangwei Liu, Siyuan Liang, Shiming Liu, Qunli Zhang, Hua Zhang, Xiaochun Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.22496
Pdf URL: https://arxiv.org/pdf/2509.22496
Copy Paste: [[2509.22496]] Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation(https://arxiv.org/abs/2509.22496)
Keywords: generation
Abstract: Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs. Yet, the extent to which generated tokens depend on visual modalities remains poorly understood, limiting interpretability and reliability. In this work, we present EAGLE, a lightweight black-box framework for explaining autoregressive token generation in MLLMs. EAGLE attributes any selected tokens to compact perceptual regions while quantifying the relative influence of language priors and perceptual evidence. The framework introduces an objective function that unifies sufficiency (insight score) and indispensability (necessity score), optimized via greedy search over sparsified image regions for faithful and efficient attribution. Beyond spatial attribution, EAGLE performs modality-aware analysis that disentangles what tokens rely on, providing fine-grained interpretability of model decisions. Extensive experiments across open-source MLLMs show that EAGLE consistently outperforms existing methods in faithfulness, localization, and hallucination diagnosis, while requiring substantially less GPU memory. These results highlight its effectiveness and practicality for advancing the interpretability of MLLMs. The code is available at this https URL.
摘要：多模式的大语言模型（MLLM）在将自然语言输出的视觉输入对齐方面表现出了显着的功能。然而，产生的代币取决于视觉方式的程度仍然很少了解，从而限制了可解释性和可靠性。在这项工作中，我们提出了Eagle，这是一个轻巧的黑盒框架，用于解释MLLM中的自回归令牌生成。 Eagle将任何选定的令牌归因于紧凑的感知区域，同时量化语言先验和感知证据的相对影响。该框架引入了一个目标函数，该目标函数统一了充分性（洞察力得分）和不可或缺的性能（必要分数），该目标函数是通过对稀疏图像区域的贪婪搜索进行了优化的。除空间归因之外，Eagle还执行模态感知的分析，该分析可以解散令牌所依赖的内容，从而提供了模型决策的细粒度解释性。跨开源MLLM的广泛实验表明，鹰在忠诚，本地化和幻觉诊断方面始终优于现有方法，同时需要少于GPU的内存。这些结果强调了其有效性和实用性，可提高MLLM的解释性。该代码可在此HTTPS URL上找到。

Title: JointDiff: Bridging Continuous and Discrete in Multi-Agent Trajectory Generation

Authors: Guillem Capellera, Luis Ferraz, Antonio Rubio, Alexandre Alahi, Antonio Agudo
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2509.22522
Pdf URL: https://arxiv.org/pdf/2509.22522
Copy Paste: [[2509.22522]] JointDiff: Bridging Continuous and Discrete in Multi-Agent Trajectory Generation(https://arxiv.org/abs/2509.22522)
Keywords: generation, generative
Abstract: Generative models often treat continuous data and discrete events as separate processes, creating a gap in modeling complex systems where they interact synchronously. To bridge this gap, we introduce JointDiff, a novel diffusion framework designed to unify these two processes by simultaneously generating continuous spatio-temporal data and synchronous discrete events. We demonstrate its efficacy in the sports domain by simultaneously modeling multi-agent trajectories and key possession events. This joint modeling is validated with non-controllable generation and two novel controllable generation scenarios: weak-possessor-guidance, which offers flexible semantic control over game dynamics through a simple list of intended ball possessors, and text-guidance, which enables fine-grained, language-driven generation. To enable the conditioning with these guidance signals, we introduce CrossGuid, an effective conditioning operation for multi-agent domains. We also share a new unified sports benchmark enhanced with textual descriptions for soccer and football datasets. JointDiff achieves state-of-the-art performance, demonstrating that joint modeling is crucial for building realistic and controllable generative models for interactive systems.
摘要：生成模型通常将连续数据和离散事件视为单独的过程，从而在对复杂系统进行同步交互的复杂系统中造成差距。为了弥合这一差距，我们介绍了联合服，这是一个新颖的扩散框架，旨在通过同时生成连续的时空数据和同步离散事件来统一这两个过程。我们通过同时建模多代理轨迹和主要拥有事件来证明其在运动领域的功效。这种联合建模通过不可控制的一代和两个新颖的可控生成场景进行了验证：弱者指导，通过简单的预期球拥有者列表和文本引导，它可以灵活地对游戏动力学进行灵活的语义控制，从而实现精细的语言，语言驱动的一代。为了通过这些指导信号启用调节，我们引入了CrossGuid，这是多代理域的有效条件操作。我们还通过针对足球和足球数据集的文本描述来共享一个新的统一运动基准。联合船员实现了最先进的性能，表明联合建模对于建立交互式系统的现实且可控的生成模型至关重要。

Title: From Parameters to Behavior: Unsupervised Compression of the Policy Space

Authors: Davide Tenedini, Riccardo Zamboni, Mirco Mutti, Marcello Restelli
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.22566
Pdf URL: https://arxiv.org/pdf/2509.22566
Copy Paste: [[2509.22566]] From Parameters to Behavior: Unsupervised Compression of the Policy Space(https://arxiv.org/abs/2509.22566)
Keywords: generative
Abstract: Despite its recent successes, Deep Reinforcement Learning (DRL) is notoriously sample-inefficient. We argue that this inefficiency stems from the standard practice of optimizing policies directly in the high-dimensional and highly redundant parameter space $\Theta$. This challenge is greatly compounded in multi-task settings. In this work, we develop a novel, unsupervised approach that compresses the policy parameter space $\Theta$ into a low-dimensional latent space $\mathcal{Z}$. We train a generative model $g:\mathcal{Z}\to\Theta$ by optimizing a behavioral reconstruction loss, which ensures that the latent space is organized by functional similarity rather than proximity in parameterization. We conjecture that the inherent dimensionality of this manifold is a function of the environment's complexity, rather than the size of the policy network. We validate our approach in continuous control domains, showing that the parameterization of standard policy networks can be compressed up to five orders of magnitude while retaining most of its expressivity. As a byproduct, we show that the learned manifold enables task-specific adaptation via Policy Gradient operating in the latent space $\mathcal{Z}$.
摘要：尽管最近取得了成功，但深厚的加固学习（DRL）众所周知，样本感知了。我们认为，这种低效率源于直接在高维和高度冗余的参数空间$ \ theta $中优化政策的标准实践。在多任务设置中，这一挑战非常复杂。在这项工作中，我们开发了一种小说，无监督的方法，该方法将策略参数空间$ \ theta $压缩到低维的潜在空间$ \ mathcal {z} $中。我们通过优化行为重建损失来训练生成的模型$ g：\ Mathcal {z} \ to \ theta $，该损失可确保潜在空间是通过功能相似性而不是参数化的接近性来组织的。我们猜想，该流形的固有维度是环境复杂性的函数，而不是策略网络的大小。我们在连续控制域中验证我们的方法，表明标准策略网络的参数化最多可以压缩到五个数量级，同时保留其大多数表达性。作为副产品，我们表明，学到的歧管可以通过在潜在空间$ \ Mathcal {z} $中操作的策略梯度进行特定于任务的适应。

Title: Transport Based Mean Flows for Generative Modeling

Authors: Elaheh Akbari, Ping He, Ahmadreza Moradipari, Yikun Bai, Soheil Kolouri
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.22592
Pdf URL: https://arxiv.org/pdf/2509.22592
Copy Paste: [[2509.22592]] Transport Based Mean Flows for Generative Modeling(https://arxiv.org/abs/2509.22592)
Keywords: generation, generative
Abstract: Flow-matching generative models have emerged as a powerful paradigm for continuous data generation, achieving state-of-the-art results across domains such as images, 3D shapes, and point clouds. Despite their success, these models suffer from slow inference due to the requirement of numerous sequential sampling steps. Recent work has sought to accelerate inference by reducing the number of sampling steps. In particular, Mean Flows offer a one-step generation approach that delivers substantial speedups while retaining strong generative performance. Yet, in many continuous domains, Mean Flows fail to faithfully approximate the behavior of the original multi-step flow-matching process. In this work, we address this limitation by incorporating optimal transport-based sampling strategies into the Mean Flow framework, enabling one-step generators that better preserve the fidelity and diversity of the original multi-step flow process. Experiments on controlled low-dimensional settings and on high-dimensional tasks such as image generation, image-to-image translation, and point cloud generation demonstrate that our approach achieves superior inference accuracy in one-step generative modeling.
摘要：流量匹配生成模型已成为连续数据生成的强大范式，从图像，3D形状和点云等域上实现了最先进的结果。尽管取得了成功，但由于需要进行许多顺序采样步骤，这些模型的推断速度缓慢。最近的工作试图通过减少抽样步骤的数量来加速推断。特别是，平均流提供了一步生成的方法，可提供大量加速，同时保持强大的生成性能。但是，在许多连续域中，平均流量无法忠实地近似原始多步匹配过程的行为。在这项工作中，我们通过将基于最佳运输的采样策略纳入平均流框架中来解决这一限制，从而使一步生成器能够更好地保留原始多步流程的忠诚度和多样性。对受控的低维设置以及高维任务（例如图像生成，图像到图像翻译和点云生成）的实验表明，我们的方法在一步生成建模中实现了卓越的推理准确性。

Title: LongLive: Real-time Interactive Long Video Generation

Authors: Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, Yukang Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.22622
Pdf URL: https://arxiv.org/pdf/2509.22622
Copy Paste: [[2509.22622]] LongLive: Real-time Interactive Long Video Generation(https://arxiv.org/abs/2509.22622)
Keywords: generation
Abstract: We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference, but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with new prompts for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long-test-long); and short window attention paired with a frame-level attention sink, shorten as frame sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos. LongLive supports up to 240-second videos on a single H100 GPU. LongLive further supports INT8-quantized inference with only marginal quality loss.
摘要：我们提出了Longlive，这是一个实时和互动式长期视频的框架级自动回归（AR）框架。长时间的视频生成提出了效率和质量的挑战。扩散和扩散模型可以产生高质量的视频，但由于双向关注而效率低下。因果关注AR模型支持KV缓存以进行更快的推理，但由于长期Video培训期间的记忆挑战，长期视频的质量经常降低。此外，除了基于静态及时的生成外，交互式功能（例如流及时输入）对于动态内容创建至关重要，使用户能够实时指导叙事。这种互动需求显着提高了复杂性，尤其是在确保在迅速过渡过程中的视觉一致性和语义连贯性方面。为了应对这些挑战，Longlive采用了因果关系级的AR设计，该设计集成了KV-Recache机制，该机构将缓存的状态刷新带有新提示，以提供平滑，坚固的开关；播放长时间的调整以实现长时间的视频培训，并结盟培训和推理（长时间测试）；窗户注意力与框架级别的关注下沉搭配使用，将其缩短为框架下沉，可以保留长距离的一致性，同时可以更快地产生。借助这些关键设计，Longlive微调在仅32个GPU周期内将1.3B参数的短卷型型模型到长达一分钟。在推断时，Longlive在单个NVIDIA H100上维持20.7 fps，在短视频和长视频中都在VBench上取得了强劲的表现。 Longlive在单个H100 GPU上最多支持240秒的视频。 Longlive进一步支持Int8定量推理，仅边缘质量损失。

Title: A Theoretical Analysis of Discrete Flow Matching Generative Models

Authors: Maojiang Su, Mingcheng Lu, Jerry Yao-Chieh Hu, Shang Wu, Zhao Song, Alex Reneau, Han Liu
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2509.22623
Pdf URL: https://arxiv.org/pdf/2509.22623
Copy Paste: [[2509.22623]] A Theoretical Analysis of Discrete Flow Matching Generative Models(https://arxiv.org/abs/2509.22623)
Keywords: generative
Abstract: We provide a theoretical analysis for end-to-end training Discrete Flow Matching (DFM) generative models. DFM is a promising discrete generative modeling framework that learns the underlying generative dynamics by training a neural network to approximate the transformative velocity field. Our analysis establishes a clear chain of guarantees by decomposing the final distribution estimation error. We first prove that the total variation distance between the generated and target distributions is controlled by the risk of the learned velocity field. We then bound this risk by analyzing its two primary sources: (i) Approximation Error, where we quantify the capacity of the Transformer architecture to represent the true velocity, and (ii) Estimation Error, where we derive statistical convergence rates that bound the error from training on a finite dataset. By composing these results, we provide the first formal proof that the distribution generated by a trained DFM model provably converges to the true data distribution as the training set size increases.
摘要：我们为端到端训练离散流匹配（DFM）生成模型提供了理论分析。 DFM是一个有希望的离散生成建模框架，它通过训练神经网络近似变化速度场来学习潜在的生成动力学。我们的分析通过分解最终分配估计误差来建立清晰的保证链。我们首先证明生成的和目标分布之间的总变化距离受到学习速度场的风险控制。然后，我们通过分析其两个主要来源来束缚这种风险：（i）近似误差，其中我们量化了变压器体系结构代表真实速度的能力，以及（ii）估计误差，在其中我们得出统计收敛率，该统计收敛速率从有限数据集中训练中训练误差。通过撰写这些结果，我们提供了第一个正式证明，即受过训练的DFM模型生成的分布可证明随着训练集大小的增加而收敛到真实的数据分布。

Title: SPARK: Synergistic Policy And Reward Co-Evolving Framework

Authors: Ziyu Liu, Yuhang Zang, Shengyuan Ding, Yuhang Cao, Xiaoyi Dong, Haodong Duan, Dahua Lin, Jiaqi Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2509.22624
Pdf URL: https://arxiv.org/pdf/2509.22624
Copy Paste: [[2509.22624]] SPARK: Synergistic Policy And Reward Co-Evolving Framework(https://arxiv.org/abs/2509.22624)
Keywords: generative
Abstract: Recent Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) increasingly use Reinforcement Learning (RL) for post-pretraining, such as RL with Verifiable Rewards (RLVR) for objective tasks and RL from Human Feedback (RLHF) for subjective tasks. However, RLHF incurs high costs and potential reward-policy mismatch due to reliance on human preferences, while RLVR still wastes supervision by discarding rollouts and correctness signals after each update. To address these challenges, we introduce the Synergistic Policy And Reward Co-Evolving Framework (SPARK), an efficient, on-policy, and stable method that builds on RLVR. Instead of discarding rollouts and correctness data, SPARK recycles this valuable information to simultaneously train the model itself as a generative reward model. This auxiliary training uses a mix of objectives, such as pointwise reward score, pairwise comparison, and evaluation conditioned on further-reflection responses, to teach the model to evaluate and improve its own responses. Our process eliminates the need for a separate reward model and costly human preference data. SPARK creates a positive co-evolving feedback loop: improved reward accuracy yields better policy gradients, which in turn produce higher-quality rollouts that further refine the reward model. Our unified framework supports test-time scaling via self-reflection without external reward models and their associated costs. We show that SPARK achieves significant performance gains on multiple LLM and LVLM models and multiple reasoning, reward models, and general benchmarks. For example, SPARK-VL-7B achieves an average 9.7% gain on 7 reasoning benchmarks, 12.1% on 2 reward benchmarks, and 1.5% on 8 general benchmarks over the baselines, demonstrating robustness and broad generalization.
摘要：最近的大型语言模型（LLM）和大型视力语言模型（LVLM）越来越多地使用强化学习（RL）进行后置，例如带有可验证奖励（RLVR）的RL进行客观任务，并从人类反馈（RLHF）中进行RL进行主观任务。但是，由于依赖人类的偏好，RLHF会产生高成本和潜在的奖励 - 政策不匹配，而RLVR仍通过在每次更新后丢弃推出和正确性信号来浪费监督。为了应对这些挑战，我们介绍了协同的政策和奖励共同发展框架（SPARK），这是一种基于RLVR的高效，实用和稳定的方法。 Spark Reclycly Reconcly Reclyclose Reclyce Reclyce Reclyce Reclyce Reconce Reclyce Reclyce Reclycly Reclycly Reclycection recectection奖励模型同时训练该模型本身。这种辅助培训使用了各种目标，例如以进一步反射反应的方式进行奖励得分，成对比较和评估，以教导该模型评估和改善其自身响应。我们的过程消除了对单独的奖励模型和昂贵的人类偏好数据的需求。 Spark创建了积极的共同进化反馈回路：提高的奖励准确性可产生更好的政策梯度，从而产生更高质量的推出，进一步完善奖励模型。我们的统一框架通过自我反省支持测试时间缩放，而没有外部奖励模型及其相关成本。我们表明，Spark在多个LLM和LVLM模型以及多个推理，奖励模型和一般基准方面取得了显着的性能提高。例如，SPARK-VL-7B在7个推理基准的平均增长率为9.7％，在2个奖励基准中获得12.1％的收益，在基准的8个一般基准测试中，平均增长了1.5％，表明了稳健性和广泛的概括。

Title: Training-Free Synthetic Data Generation with Dual IP-Adapter Guidance

Authors: Luc Boudier, Loris Manganelli, Eleftherios Tsonis, Nicolas Dufour, Vicky Kalogeiton
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2509.22635
Pdf URL: https://arxiv.org/pdf/2509.22635
Copy Paste: [[2509.22635]] Training-Free Synthetic Data Generation with Dual IP-Adapter Guidance(https://arxiv.org/abs/2509.22635)
Keywords: generation, generative
Abstract: Few-shot image classification remains challenging due to the limited availability of labeled examples. Recent approaches have explored generating synthetic training data using text-to-image diffusion models, but often require extensive model fine-tuning or external information sources. We present a novel training-free approach, called DIPSY, that leverages IP-Adapter for image-to-image translation to generate highly discriminative synthetic images using only the available few-shot examples. DIPSY introduces three key innovations: (1) an extended classifier-free guidance scheme that enables independent control over positive and negative image conditioning; (2) a class similarity-based sampling strategy that identifies effective contrastive examples; and (3) a simple yet effective pipeline that requires no model fine-tuning or external captioning and filtering. Experiments across ten benchmark datasets demonstrate that our approach achieves state-of-the-art or comparable performance, while eliminating the need for generative model adaptation or reliance on external tools for caption generation and image filtering. Our results highlight the effectiveness of leveraging dual image prompting with positive-negative guidance for generating class-discriminative features, particularly for fine-grained classification tasks.
摘要：由于标记的示例的可用性有限，因此很少有射击图像分类仍然具有挑战性。最近的方法已经使用文本对图像扩散模型探索了生成合成训练数据，但通常需要广泛的模型微调或外部信息源。我们提出了一种称为Dipsy的新型无训练方法，该方法利用IP-Adapter用于图像到图像翻译，以仅使用可用的几示示例来生成高度歧视的合成图像。 DIPSY介绍了三个关键创新：（1）一种扩展的无分类指导方案，可独立控制正面和负面图像条件；（2）基于类相似性的采样策略，可以确定有效的对比示例；（3）一个简单而有效的管道，不需要模型进行微调或外部字幕和过滤。十个基准数据集的实验表明，我们的方法可以实现最先进的或可比的性能，同时消除了生成模型适应或依赖外部工具以生成字幕和图像过滤的需求。我们的结果突出了利用双图像促使双重图像的有效性，并具有积极的指导，以生成类歧视性特征，尤其是用于细粒度的分类任务。

Title: Scale-Wise VAR is Secretly Discrete Diffusion

Authors: Amandeep Kumar, Nithin Gopalakrishnan Nair, Vishal M. Patel
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2509.22636
Pdf URL: https://arxiv.org/pdf/2509.22636
Copy Paste: [[2509.22636]] Scale-Wise VAR is Secretly Discrete Diffusion(https://arxiv.org/abs/2509.22636)
Keywords: generation
Abstract: Autoregressive (AR) transformers have emerged as a powerful paradigm for visual generation, largely due to their scalability, computational efficiency and unified architecture with language and vision. Among them, next scale prediction Visual Autoregressive Generation (VAR) has recently demonstrated remarkable performance, even surpassing diffusion-based models. In this work, we revisit VAR and uncover a theoretical insight: when equipped with a Markovian attention mask, VAR is mathematically equivalent to a discrete diffusion. We term this reinterpretation as Scalable Visual Refinement with Discrete Diffusion (SRDD), establishing a principled bridge between AR transformers and diffusion models. Leveraging this new perspective, we show how one can directly import the advantages of diffusion such as iterative refinement and reduce architectural inefficiencies into VAR, yielding faster convergence, lower inference cost, and improved zero-shot reconstruction. Across multiple datasets, we show that the diffusion based perspective of VAR leads to consistent gains in efficiency and generation.
摘要：自回归（AR）变压器已成为视觉生成的强大范式，这在很大程度上是由于其可扩展性，计算效率和具有语言和视觉的统一体系结构。其中，下一个比例预测视觉自回归产生（VAR）最近表现出了出色的性能，甚至超过了基于扩散的模型。在这项工作中，我们重新访问VAR并发现一个理论上的洞察力：当配备马尔可夫注意面罩时，VAR在数学上等同于离散扩散。我们将这种重新解释称为可扩展的视觉改进，并具有离散扩散（SRDD），建立了AR变形金刚和扩散模型之间的原则桥梁。利用这一新的观点，我们展示了如何直接导入扩散的优势，例如迭代精致，并将建筑效率低下减少到VAR中，从而产生更快的收敛性，较低的推理成本和改善的零发射重建。在多个数据集中，我们表明基于扩散的VAR视角可导致效率和发电的一致提高。

Title: Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs

Authors: Xingyu Fu, Siyi Liu, Yinuo Xu, Pan Lu, Guangqiuse Hu, Tianbo Yang, Taran Anantasagar, Christopher Shen, Yikai Mao, Yuanzhe Liu, Keyush Shah, Chung Un Lee, Yejin Choi, James Zou, Dan Roth, Chris Callison-Burch
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2509.22646
Pdf URL: https://arxiv.org/pdf/2509.22646
Copy Paste: [[2509.22646]] Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs(https://arxiv.org/abs/2509.22646)
Keywords: generation
Abstract: Can humans identify AI-generated (fake) videos and provide grounded reasons? While video generation models have advanced rapidly, a critical dimension -- whether humans can detect deepfake traces within a generated video, i.e., spatiotemporal grounded visual artifacts that reveal a video as machine generated -- has been largely overlooked. We introduce DeeptraceReward, the first fine-grained, spatially- and temporally- aware benchmark that annotates human-perceived fake traces for video generation reward. The dataset comprises 4.3K detailed annotations across 3.3K high-quality generated videos. Each annotation provides a natural-language explanation, pinpoints a bounding-box region containing the perceived trace, and marks precise onset and offset timestamps. We consolidate these annotations into 9 major categories of deepfake traces that lead humans to identify a video as AI-generated, and train multimodal language models (LMs) as reward models to mimic human judgments and localizations. On DeeptraceReward, our 7B reward model outperforms GPT-5 by 34.7% on average across fake clue identification, grounding, and explanation. Interestingly, we observe a consistent difficulty gradient: binary fake v.s. real classification is substantially easier than fine-grained deepfake trace detection; within the latter, performance degrades from natural language explanations (easiest), to spatial grounding, to temporal labeling (hardest). By foregrounding human-perceived deepfake traces, DeeptraceReward provides a rigorous testbed and training signal for socially aware and trustworthy video generation.
摘要：人类可以识别AI生成的（假）视频并提供基础的原因吗？尽管视频生成模型已经迅速发展，但一个关键的维度 - 人类是否可以在生成的视频中检测到深泡沫的痕迹，即时空接地的视觉伪像，这些视觉伪像揭示了作为机器生成的视频的视频 - 在很大程度上被忽略了。我们介绍了DeepTracereward，这是第一个细粒度，空间和时间上意识到的基准，它注释了人类感知的假痕迹，以获得视频生成奖励。该数据集包含3.3k高质量生成的视频的4.3K详细注释。每个注释都提供了自然语言的解释，并指出一个包含感知痕迹的边界盒区域，并标记精确的发作和偏移时间戳。我们将这些注释巩固为9个主要类别的深层痕迹，这些痕迹使人类将视频识别为AI生成的，并训练多模型模型（LMS）作为模仿人类判断和本地化的奖励模型。在DeepTracereward上，我们的7B奖励模型在虚假的线索识别，接地和解释中平均比GPT-5的表现平均比34.7％。有趣的是，我们观察到一个一致的困难梯度：二进制假V.S.实际分类比细颗粒的深膜痕量检测要容易得多。在后者中，性能从自然语言解释（最简单）变为空间接地，暂时标记（最难）。通过预示着人类感知的深层痕迹，DeepTracereward为具有社会意识和值得信赖的视频生成提供了严格的测试床和训练信号。

Title: RefAM: Attention Magnets for Zero-Shot Referral Segmentation

Authors: Anna Kukleva, Enis Simsar, Alessio Tonioni, Muhammad Ferjad Naeem, Federico Tombari, Jan Eric Lenssen, Bernt Schiele
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.22650
Pdf URL: https://arxiv.org/pdf/2509.22650
Copy Paste: [[2509.22650]] RefAM: Attention Magnets for Zero-Shot Referral Segmentation(https://arxiv.org/abs/2509.22650)
Keywords: generative
Abstract: Most existing approaches to referring segmentation achieve strong performance only through fine-tuning or by composing multiple pre-trained models, often at the cost of additional training and architectural modifications. Meanwhile, large-scale generative diffusion models encode rich semantic information, making them attractive as general-purpose feature extractors. In this work, we introduce a new method that directly exploits features, attention scores, from diffusion transformers for downstream tasks, requiring neither architectural modifications nor additional training. To systematically evaluate these features, we extend benchmarks with vision-language grounding tasks spanning both images and videos. Our key insight is that stop words act as attention magnets: they accumulate surplus attention and can be filtered to reduce noise. Moreover, we identify global attention sinks (GAS) emerging in deeper layers and show that they can be safely suppressed or redirected onto auxiliary tokens, leading to sharper and more accurate grounding maps. We further propose an attention redistribution strategy, where appended stop words partition background activations into smaller clusters, yielding sharper and more localized heatmaps. Building on these findings, we develop RefAM, a simple training-free grounding framework that combines cross-attention maps, GAS handling, and redistribution. Across zero-shot referring image and video segmentation benchmarks, our approach consistently outperforms prior methods, establishing a new state of the art without fine-tuning or additional components.
摘要：大多数现有的参考细分方法仅通过微调或组成多个预训练的模型才能实现强大的性能，通常是以额外的培训和建筑修改为代价。同时，大规模生成扩散模型编码丰富的语义信息，使其作为通用特征提取器具有吸引力。在这项工作中，我们引入了一种新方法，该方法直接利用了从扩散变压器进行下游任务的功能，注意力分数，不需要建筑修改也不需要其他培训。为了系统地评估这些功能，我们通过跨越图像和视频的视觉接地任务扩展了基准测试。我们的关键见解是，停止单词充当注意力磁铁：它们会积累多余的注意力，并且可以过滤以减少噪声。此外，我们确定了更深层的全球关注点（气体），并表明它们可以安全地抑制或重定向到辅助令牌上，从而导致地图更加清晰，更准确。我们进一步提出了一种注意重新分布策略，其中附加了停止单词的背景激活为较小的群集，从而产生更清晰和更局部的热图。在这些发现的基础上，我们开发了Refam，这是一个简单的无培训接地框架，结合了跨注意地图，汽油处理和重新分配。在零摄像的参考图像和视频分割基准测试中，我们的方法始终优于先前的方法，建立了新的最新技术，而无需微调或其他组件。