2025-10-28

Title: Agro-Consensus: Semantic Self-Consistency in Vision-Language Models for Crop Disease Management in Developing Countries

Authors: Mihir Gupta, Pratik Desai, Ross Greer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.21757
Pdf URL: https://arxiv.org/pdf/2510.21757
Copy Paste: [[2510.21757]] Agro-Consensus: Semantic Self-Consistency in Vision-Language Models for Crop Disease Management in Developing Countries(https://arxiv.org/abs/2510.21757)
Keywords: generation
Abstract: Agricultural disease management in developing countries such as India, Kenya, and Nigeria faces significant challenges due to limited access to expert plant pathologists, unreliable internet connectivity, and cost constraints that hinder the deployment of large-scale AI systems. This work introduces a cost-effective self-consistency framework to improve vision-language model (VLM) reliability for agricultural image captioning. The proposed method employs semantic clustering, using a lightweight (80MB) pre-trained embedding model to group multiple candidate responses. It then selects the most coherent caption -- containing a diagnosis, symptoms, analysis, treatment, and prevention recommendations -- through a cosine similarity-based consensus. A practical human-in-the-loop (HITL) component is incorporated, wherein user confirmation of the crop type filters erroneous generations, ensuring higher-quality input for the consensus mechanism. Applied to the publicly available PlantVillage dataset using a fine-tuned 3B-parameter PaliGemma model, our framework demonstrates improvements over standard decoding methods. Evaluated on 800 crop disease images with up to 21 generations per image, our single-cluster consensus method achieves a peak accuracy of 83.1% with 10 candidate generations, compared to the 77.5% baseline accuracy of greedy decoding. The framework's effectiveness is further demonstrated when considering multiple clusters; accuracy rises to 94.0% when a correct response is found within any of the top four candidate clusters, outperforming the 88.5% achieved by a top-4 selection from the baseline.
摘要：印度、肯尼亚和尼日利亚等发展中国家的农业病害管理面临着重大挑战，因为植物病理学家专家的机会有限、互联网连接不可靠以及阻碍大规模人工智能系统部署的成本限制。这项工作引入了一种经济有效的自我一致性框架，以提高农业图像字幕的视觉语言模型（VLM）可靠性。所提出的方法采用语义聚类，使用轻量级 (80MB) 预训练嵌入模型对多个候选响应进行分组。然后，它通过基于余弦相似性的共识，选择最连贯的标题——包含诊断、症状、分析、治疗和预防建议。结合了实用的人机交互（HITL）组件，其中用户对作物类型的确认过滤了错误的生成，确保了共识机制的更高质量的输入。我们的框架使用经过微调的 3B 参数 PaliGemma 模型应用于公开的 PlantVillage 数据集，展示了相对于标准解码方法的改进。对 800 张农作物病害图像（每幅图像最多 21 代）进行评估，我们的单簇共识方法在 10 代候选中实现了 83.1% 的峰值准确度，而贪婪解码的基线准确度为 77.5%。当考虑多个集群时，该框架的有效性得到了进一步证明；当在前 4 个候选簇中的任何一个中找到正确响应时，准确率上升到 94.0%，优于从基线中选择前 4 个簇所达到的 88.5%。

Title: Proportion and Perspective Control for Flow-Based Image Generation

Authors: Julien Boudier, Hugo Caselles-Dupré
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21763
Pdf URL: https://arxiv.org/pdf/2510.21763
Copy Paste: [[2510.21763]] Proportion and Perspective Control for Flow-Based Image Generation(https://arxiv.org/abs/2510.21763)
Keywords: generation
Abstract: While modern text-to-image diffusion models generate high-fidelity images, they offer limited control over the spatial and geometric structure of the output. To address this, we introduce and evaluate two ControlNets specialized for artistic control: (1) a proportion ControlNet that uses bounding boxes to dictate the position and scale of objects, and (2) a perspective ControlNet that employs vanishing lines to control the 3D geometry of the scene. We support the training of these modules with data pipelines that leverage vision-language models for annotation and specialized algorithms for conditioning image synthesis. Our experiments demonstrate that both modules provide effective control but exhibit limitations with complex constraints. Both models are released on HuggingFace: this https URL
摘要：虽然现代文本到图像的扩散模型生成高保真图像，但它们对输出的空间和几何结构提供有限的控制。为了解决这个问题，我们引入并评估了两个专门用于艺术控制的 ControlNet：(1) 使用边界框来指示对象的位置和比例的比例 ControlNet，以及 (2) 使用消失线来控制场景的 3D 几何形状的透视 ControlNet。我们通过数据管道支持这些模块的训练，这些数据管道利用视觉语言模型进行注释和专门的算法进行调节图像合成。我们的实验表明，这两个模块都提供了有效的控制，但在复杂的约束下表现出局限性。两种模型均在 HuggingFace 上发布：此 https URL

Title: H2OFlow: Grounding Human-Object Affordances with 3D Generative Models and Dense Diffused Flows

Authors: Harry Zhang, Luca Carlone
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.21769
Pdf URL: https://arxiv.org/pdf/2510.21769
Copy Paste: [[2510.21769]] H2OFlow: Grounding Human-Object Affordances with 3D Generative Models and Dense Diffused Flows(https://arxiv.org/abs/2510.21769)
Keywords: generative
Abstract: Understanding how humans interact with the surrounding environment, and specifically reasoning about object interactions and affordances, is a critical challenge in computer vision, robotics, and AI. Current approaches often depend on labor-intensive, hand-labeled datasets capturing real-world or simulated human-object interaction (HOI) tasks, which are costly and time-consuming to produce. Furthermore, most existing methods for 3D affordance understanding are limited to contact-based analysis, neglecting other essential aspects of human-object interactions, such as orientation (\eg, humans might have a preferential orientation with respect certain objects, such as a TV) and spatial occupancy (\eg, humans are more likely to occupy certain regions around an object, like the front of a microwave rather than its back). To address these limitations, we introduce \emph{H2OFlow}, a novel framework that comprehensively learns 3D HOI affordances -- encompassing contact, orientation, and spatial occupancy -- using only synthetic data generated from 3D generative models. H2OFlow employs a dense 3D-flow-based representation, learned through a dense diffusion process operating on point clouds. This learned flow enables the discovery of rich 3D affordances without the need for human annotations. Through extensive quantitative and qualitative evaluations, we demonstrate that H2OFlow generalizes effectively to real-world objects and surpasses prior methods that rely on manual annotations or mesh-based representations in modeling 3D affordance.
摘要：了解人类如何与周围环境交互，特别是推理对象交互和可供性，是计算机视觉、机器人和人工智能领域的一项关键挑战。当前的方法通常依赖于劳动密集型、手工标记的数据集来捕获现实世界或模拟人与物体交互（HOI）任务，这些任务的生产成本高昂且耗时。此外，大多数现有的 3D 可供性理解方法仅限于基于接触的分析，忽略了人与物体交互的其他重要方面，例如方向（例如，人类可能对某些物体有优先方向，例如电视）和空间占用（例如，人类更有可能占据物体周围的某些区域，例如微波炉的前面而不是后面）。为了解决这些限制，我们引入了 \emph{H2OFlow}，这是一种新颖的框架，它仅使用 3D 生成模型生成的合成数据来全面学习 3D HOI 可供性（包括接触、方向和空间占用）。 H2OFlow 采用基于密集 3D 流的表示，通过在点云上运行的密集扩散过程来学习。这种学习流程能够发现丰富的 3D 可供性，而无需人工注释。通过广泛的定量和定性评估，我们证明 H2OFlow 可以有效地推广到现实世界的对象，并超越了之前在 3D 可供性建模中依赖手动注释或基于网格的表示的方法。

Title: OCR-Quality: A Human-Annotated Dataset for OCR Quality Assessment

Authors: Yulong Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21774
Pdf URL: https://arxiv.org/pdf/2510.21774
Copy Paste: [[2510.21774]] OCR-Quality: A Human-Annotated Dataset for OCR Quality Assessment(https://arxiv.org/abs/2510.21774)
Keywords: quality assessment
Abstract: We present OCR-Quality, a comprehensive human-annotated dataset designed for evaluating and developing OCR quality assessment methods. The dataset consists of 1,000 PDF pages converted to PNG images at 300 DPI, sampled from diverse real-world scenarios, including academic papers, textbooks, e-books, and multilingual documents. Each document has been processed using state-of-the-art Vision-Language Models (VLMs) and manually annotated with quality scores using a 4-level scoring system (1: Excellent, 2: Good, 3: Fair, 4: Poor). The dataset includes detailed source information, annotation guidelines, and representative cases across various difficulty levels. OCR-Quality addresses the critical need for reliable OCR quality assessment in real-world applications and provides a valuable benchmark for training and evaluating OCR verification systems. The dataset is publicly available at this https URL .
摘要：我们推出 OCR-Quality，这是一个综合的人工注释数据集，旨在评估和开发 OCR 质量评估方法。该数据集由 1,000 个以 300 DPI 转换为 PNG 图像的 PDF 页面组成，采样自不同的现实场景，包括学术论文、教科书、电子书和多语言文档。每个文档均使用最先进的视觉语言模型 (VLM) 进行处理，并使用 4 级评分系统（1：优秀、2：良好、3：一般、4：差）手动注释质量分数。该数据集包括详细的源信息、注释指南以及各个难度级别的代表性案例。 OCR-Quality 满足了实际应用中对可靠 OCR 质量评估的关键需求，并为培训和评估 OCR 验证系统提供了宝贵的基准。该数据集可通过此 https URL 公开获取。

Title: Face-MakeUpV2: Facial Consistency Learning for Controllable Text-to-Image Generation

Authors: Dawei Dai, Yinxiu Zhou, Chenghang Li, Guolai Jiang, Chengfang Zhang
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2510.21775
Pdf URL: https://arxiv.org/pdf/2510.21775
Copy Paste: [[2510.21775]] Face-MakeUpV2: Facial Consistency Learning for Controllable Text-to-Image Generation(https://arxiv.org/abs/2510.21775)
Keywords: generation
Abstract: In facial image generation, current text-to-image models often suffer from facial attribute leakage and insufficient physical consistency when responding to local semantic instructions. In this study, we propose Face-MakeUpV2, a facial image generation model that aims to maintain the consistency of face ID and physical characteristics with the reference image. First, we constructed a large-scale dataset FaceCaptionMask-1M comprising approximately one million image-text-masks pairs that provide precise spatial supervision for the local semantic instructions. Second, we employed a general text-to-image pretrained model as the backbone and introduced two complementary facial information injection channels: a 3D facial rendering channel to incorporate the physical characteristics of the image and a global facial feature channel. Third, we formulated two optimization objectives for the supervised learning of our model: semantic alignment in the model's embedding space to mitigate the attribute leakage problem and perceptual loss on facial images to preserve ID consistency. Extensive experiments demonstrated that our Face-MakeUpV2 achieves best overall performance in terms of preserving face ID and maintaining physical consistency of the reference images. These results highlight the practical potential of Face-MakeUpV2 for reliable and controllable facial editing in diverse applications.
摘要：在面部图像生成中，当前的文本到图像模型在响应局部语义指令时经常遇到面部属性泄漏和物理一致性不足的问题。在本研究中，我们提出了 Face-MakeUpV2，一种面部图像生成模型，旨在保持面部 ID 和物理特征与参考图像的一致性。首先，我们构建了一个大型数据集 FaceCaptionMask-1M，包含大约一百万个图像-文本-掩码对，为局部语义指令提供精确的空间监督。其次，我们采用通用文本到图像预训练模型作为主干，并引入两个互补的面部信息注入通道：一个包含图像物理特征的 3D 面部渲染通道和一个全局面部特征通道。第三，我们为模型的监督学习制定了两个优化目标：模型嵌入空间中的语义对齐以减轻属性泄漏问题和面部图像的感知损失以保持 ID 一致性。大量实验表明，我们的 Face-MakeUpV2 在保留面部 ID 和保持参考图像的物理一致性方面实现了最佳整体性能。这些结果凸显了 Face-MakeUpV2 在各种应用中实现可靠且可控的面部编辑的实际潜力。

Title: Online Mixture of Experts: No-Regret Learning for Optimal Collective Decision-Making

Authors: Larkin Liu, Jalal Etesami
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21788
Pdf URL: https://arxiv.org/pdf/2510.21788
Copy Paste: [[2510.21788]] Online Mixture of Experts: No-Regret Learning for Optimal Collective Decision-Making(https://arxiv.org/abs/2510.21788)
Keywords: generative
Abstract: We explore the use of expert-guided bandit learning, which we refer to as online mixture-of-experts (OMoE). In this setting, given a context, a candidate committee of experts must determine how to aggregate their outputs to achieve optimal results in terms of aggregate accuracy. We propose two algorithms to address this problem. The first algorithm combines aggregate voting with UCB-driven successive elimination, efficiently pruning suboptimal exploration actions. The second algorithm employs an online weighted-majority-voting mechanism, leveraging the respective voting power of each expert proportional to their predictive power. We derive theoretical guarantees for the regret properties in the bandit setting under ideal circumstances, and empirical results are provided accordingly. As a modern study on applications, these methods are applied to the online fine-tuning of a set of expert large language models (LLMs), where after each response, the generative LLM dynamically reweighs its set of experts and/or selects the optimal committee of experts to generate the most accurate response. Our results introduce new methodologies and no-regret guarantees for combining multiple experts to improve on the performance of the an aggregate model overall.
摘要：我们探索使用专家引导的老虎机学习，我们将其称为在线专家混合（OMoE）。在这种情况下，在给定的背景下，候选专家委员会必须确定如何汇总其输出，以在汇总准确性方面实现最佳结果。我们提出两种算法来解决这个问题。第一个算法将聚合投票与 UCB 驱动的连续消除相结合，有效地修剪次优探索行为。第二种算法采用在线加权多数投票机制，利用每个专家与其预测能力成比例的各自投票权。我们对理想情况下强盗设置中的遗憾属性进行了理论保证，并提供了相应的实证结果。作为一项现代应用研究，这些方法应用于一组专家大语言模型（LLM）的在线微调，在每次响应之后，生成式 LLM 动态地重新权衡其专家组和/或选择最佳专家委员会以生成最准确的响应。我们的结果引入了新的方法和无悔的保证，可以结合多个专家来提高聚合模型的整体性能。

Title: Exploring the design space of diffusion and flow models for data fusion

Authors: Niraj Chaudhari, Manmeet Singh, Naveen Sudharsan, Amit Kumar Srivastava, Harsh Kamath, Dushyant Mahajan, Ayan Paul
Subjects: cs.CV, physics.ins-det
Abstract URL: https://arxiv.org/abs/2510.21791
Pdf URL: https://arxiv.org/pdf/2510.21791
Copy Paste: [[2510.21791]] Exploring the design space of diffusion and flow models for data fusion(https://arxiv.org/abs/2510.21791)
Keywords: generative
Abstract: Data fusion is an essential task in various domains, enabling the integration of multi-source information to enhance data quality and insights. One key application is in satellite remote sensing, where fusing multi-sensor observations can improve spatial and temporal resolution. In this study, we explore the design space of diffusion and flow models for data fusion, focusing on the integration of Defense Meteorological Satellite Program's Operational Linescan System (DMSP-OLS) and Visible Infrared Imaging Radiometer Suite (VIIRS) nighttime lights data. Our approach leverages a diverse set of 2D image-to-image generative models, including UNET, diffusion, and flow modeling architectures. We evaluate the effectiveness of these architectures in satellite remote sensing data fusion, identifying diffusion models based on UNet as particularly adept at preserving fine-grained spatial details and generating high-fidelity fused images. We also provide guidance on the selection of noise schedulers in diffusion-based models, highlighting the trade-offs between iterative solvers for faster inference and discrete schedulers for higher-quality reconstructions. Additionally, we explore quantization techniques to optimize memory efficiency and computational cost without compromising performance. Our findings offer practical insights into selecting the most effective diffusion and flow model architectures for data fusion tasks, particularly in remote sensing applications, and provide recommendations for leveraging noise scheduling strategies to enhance fusion quality.
摘要：数据融合是各个领域的一项重要任务，能够整合多源信息以提高数据质量和洞察力。一项关键应用是卫星遥感，融合多传感器观测可以提高空间和时间分辨率。在本研究中，我们探索了数据融合的扩散和流动模型的设计空间，重点关注国防气象卫星计划的业务线扫描系统（DMSP-OLS）和可见红外成像辐射计套件（VIIRS）夜间灯光数据的集成。我们的方法利用了一组不同的 2D 图像到图像生成模型，包括 UNET、扩散和流建模架构。我们评估了这些架构在卫星遥感数据融合中的有效性，确定基于 UNet 的扩散模型特别擅长保留细粒度的空间细节并生成高保真融合图像。我们还提供了有关在基于扩散的模型中选择噪声调度器的指导，强调了用于更快推理的迭代求解器与用于更高质量重建的离散调度器之间的权衡。此外，我们探索量化技术来优化内存效率和计算成本而不影响性能。我们的研究结果提供了为数据融合任务选择最有效的扩散和流动模型架构的实用见解，特别是在遥感应用中，并为利用噪声调度策略来提高融合质量提供了建议。

Title: Variance-Reduction Guidance: Sampling Trajectory Optimization for Diffusion Models

Authors: Shifeng Xu, Yanzhu Liu, Adams Wai-Kin Kong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21792
Pdf URL: https://arxiv.org/pdf/2510.21792
Copy Paste: [[2510.21792]] Variance-Reduction Guidance: Sampling Trajectory Optimization for Diffusion Models(https://arxiv.org/abs/2510.21792)
Keywords: generation, generative
Abstract: Diffusion models have become emerging generative models. Their sampling process involves multiple steps, and in each step the models predict the noise from a noisy sample. When the models make prediction, the output deviates from the ground truth, and we call such a deviation as \textit{prediction error}. The prediction error accumulates over the sampling process and deteriorates generation quality. This paper introduces a novel technique for statistically measuring the prediction error and proposes the Variance-Reduction Guidance (VRG) method to mitigate this error. VRG does not require model fine-tuning or modification. Given a predefined sampling trajectory, it searches for a new trajectory which has the same number of sampling steps but produces higher quality results. VRG is applicable to both conditional and unconditional generation. Experiments on various datasets and baselines demonstrate that VRG can significantly improve the generation quality of diffusion models. Source code is available at this https URL.
摘要：扩散模型已成为新兴的生成模型。他们的采样过程涉及多个步骤，并且在每个步骤中，模型都会从噪声样本中预测噪声。当模型进行预测时，输出会偏离真实情况，我们将这种偏差称为 \textit{预测误差}。预测误差在采样过程中累积并降低发电质量。本文介绍了一种统计测量预测误差的新技术，并提出了方差减少指导（VRG）方法来减轻这种误差。 VRG不需要模型微调或修改。给定预定义的采样轨迹，它会搜索具有相同采样步骤数但产生更高质量结果的新轨迹。 VRG适用于有条件和无条件生成。在各种数据集和基线上的实验表明，VRG可以显着提高扩散模型的生成质量。源代码可从此 https URL 获取。

Title: 2D_3D Feature Fusion via Cross-Modal Latent Synthesis and Attention Guided Restoration for Industrial Anomaly Detection

Authors: Usman Ali, Ali Zia, Abdul Rehman, Umer Ramzan, Zohaib Hassan, Talha Sattar, Jing Wang, Wei Xiang
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2510.21793
Pdf URL: https://arxiv.org/pdf/2510.21793
Copy Paste: [[2510.21793]] 2D_3D Feature Fusion via Cross-Modal Latent Synthesis and Attention Guided Restoration for Industrial Anomaly Detection(https://arxiv.org/abs/2510.21793)
Keywords: restoration
Abstract: Industrial anomaly detection (IAD) increasingly benefits from integrating 2D and 3D data, but robust cross-modal fusion remains challenging. We propose a novel unsupervised framework, Multi-Modal Attention-Driven Fusion Restoration (MAFR), which synthesises a unified latent space from RGB images and point clouds using a shared fusion encoder, followed by attention-guided, modality-specific decoders. Anomalies are localised by measuring reconstruction errors between input features and their restored counterparts. Evaluations on the MVTec 3D-AD and Eyecandies benchmarks demonstrate that MAFR achieves state-of-the-art results, with a mean I-AUROC of 0.972 and 0.901, respectively. The framework also exhibits strong performance in few-shot learning settings, and ablation studies confirm the critical roles of the fusion architecture and composite loss. MAFR offers a principled approach for fusing visual and geometric information, advancing the robustness and accuracy of industrial anomaly detection. Code is available at this https URL
摘要：工业异常检测 (IAD) 越来越受益于集成 2D 和 3D 数据，但强大的跨模式融合仍然具有挑战性。我们提出了一种新颖的无监督框架，多模态注意力驱动融合恢复（MAFR），它使用共享融合编码器从 RGB 图像和点云合成统一的潜在空间，然后是注意力引导的、特定于模态的解码器。通过测量输入特征与其恢复的对应特征之间的重建误差来定位异常。对 MVTec 3D-AD 和 Eyecandies 基准的评估表明，MAFR 实现了最先进的结果，平均 I-AUROC 分别为 0.972 和 0.901。该框架还在少样本学习设置中表现出强大的性能，并且消融研究证实了融合架构和复合损失的关键作用。 MAFR 提供了一种融合视觉和几何信息的原则性方法，提高了工业异常检测的稳健性和准确性。代码可在此 https URL 获取

Title: Embodied Navigation with Auxiliary Task of Action Description Prediction

Authors: Haru Kondoh, Asako Kanezaki
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2510.21809
Pdf URL: https://arxiv.org/pdf/2510.21809
Copy Paste: [[2510.21809]] Embodied Navigation with Auxiliary Task of Action Description Prediction(https://arxiv.org/abs/2510.21809)
Keywords: generation
Abstract: The field of multimodal robot navigation in indoor environments has garnered significant attention in recent years. However, as tasks and methods become more advanced, the action decision systems tend to become more complex and operate as black-boxes. For a reliable system, the ability to explain or describe its decisions is crucial; however, there tends to be a trade-off in that explainable systems can not outperform non-explainable systems in terms of performance. In this paper, we propose incorporating the task of describing actions in language into the reinforcement learning of navigation as an auxiliary task. Existing studies have found it difficult to incorporate describing actions into reinforcement learning due to the absence of ground-truth data. We address this issue by leveraging knowledge distillation from pre-trained description generation models, such as vision-language models. We comprehensively evaluate our approach across various navigation tasks, demonstrating that it can describe actions while attaining high navigation performance. Furthermore, it achieves state-of-the-art performance in the particularly challenging multimodal navigation task of semantic audio-visual navigation.
摘要：近年来，室内环境中的多模式机器人导航领域引起了广泛关注。然而，随着任务和方法变得更加先进，行动决策系统往往变得更加复杂并且像黑匣子一样运行。对于一个可靠的系统，解释或描述其决策的能力至关重要；然而，往往需要权衡，可解释的系统在性能方面无法胜过不可解释的系统。在本文中，我们建议将用语言描述动作的任务作为辅助任务纳入导航的强化学习中。现有研究发现，由于缺乏真实数据，很难将描述动作纳入强化学习中。我们通过利用预先训练的描述生成模型（例如视觉语言模型）的知识蒸馏来解决这个问题。我们在各种导航任务中全面评估我们的方法，证明它可以描述动作，同时获得高导航性能。此外，它在语义视听导航这一特别具有挑战性的多模态导航任务中实现了最先进的性能。

Title: Comparative Analysis of Object Detection Algorithms for Surface Defect Detection

Authors: Arpan Maity, Tamal Ghosh
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21811
Pdf URL: https://arxiv.org/pdf/2510.21811
Copy Paste: [[2510.21811]] Comparative Analysis of Object Detection Algorithms for Surface Defect Detection(https://arxiv.org/abs/2510.21811)
Keywords: generation
Abstract: This article compares the performance of six prominent object detection algorithms, YOLOv11, RetinaNet, Fast R-CNN, YOLOv8, RT-DETR, and DETR, on the NEU-DET surface defect detection dataset, comprising images representing various metal surface defects, a crucial application in industrial quality control. Each model's performance was assessed regarding detection accuracy, speed, and robustness across different defect types such as scratches, inclusions, and rolled-in scales. YOLOv11, a state-of-the-art real-time object detection algorithm, demonstrated superior performance compared to the other methods, achieving a remarkable 70% higher accuracy on average. This improvement can be attributed to YOLOv11s enhanced feature extraction capabilities and ability to process the entire image in a single forward pass, making it faster and more efficient in detecting minor surface defects. Additionally, YOLOv11's architecture optimizations, such as improved anchor box generation and deeper convolutional layers, contributed to more precise localization of defects. In conclusion, YOLOv11's outstanding performance in accuracy and speed solidifies its position as the most effective model for surface defect detection on the NEU dataset, surpassing competing algorithms by a substantial margin.
摘要：本文比较了六种著名的物体检测算法（YOLOv11、RetinaNet、Fast R-CNN、YOLOv8、RT-DETR 和 DETR）在 NEU-DET 表面缺陷检测数据集上的性能，该数据集包含代表各种金属表面缺陷的图像，这是工业质量控制中的关键应用。每个模型的性能均根据划痕、夹杂物和卷入鳞片等不同缺陷类型的检测精度、速度和鲁棒性进行评估。 YOLOv11 是一种最先进的实时目标检测算法，与其他方法相比，表现出卓越的性能，平均准确率提高了 70%。这一改进可归因于 YOLOv11 增强的特征提取能力以及在单次前向传递中处理整个图像的能力，使其能够更快、更高效地检测微小的表面缺陷。此外，YOLOv11的架构优化，例如改进的锚框生成和更深的卷积层，有助于更精确地定位缺陷。总之，YOLOv11 在准确性和速度方面的出色表现巩固了其作为 NEU 数据集上表面缺陷检测最有效模型的地位，大幅超越了竞争算法。

Title: SITS-DECO: A Generative Decoder Is All You Need For Multitask Satellite Image Time Series Modelling

Authors: Samuel J. Barrett, Docko Sow
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.21813
Pdf URL: https://arxiv.org/pdf/2510.21813
Copy Paste: [[2510.21813]] SITS-DECO: A Generative Decoder Is All You Need For Multitask Satellite Image Time Series Modelling(https://arxiv.org/abs/2510.21813)
Keywords: generative
Abstract: Earth Observation (EO) Foundation Modelling (FM) holds great promise for simplifying and improving the use of EO data for diverse real-world tasks. However, most existing models require additional adaptation before they can be used and are structured rigidly around particular data sources or training approaches. To address this, we take inspiration from large language models, where diverse tasks, both pre-training and downstream, are implicitly captured through next-token prediction over unified token sequences, leveraging the structure and diversity of the training data. We introduce SITS-DECO (Satellite Image Time Series-DECoder Only), a proof-of-concept generative model that applies this unified-sequence framing to EO data. Using a simple GPT-style decoder-only architecture, and demonstrate its ability to perform useful EO tasks (pixel-wise, multi-temporal, multi-modal crop-type classification) in a purely generative framework. Through symbolic prompting, we show that the model can perform multiple supervised and self-supervised tasks within a single unified architecture, without task- or modality-specific adaptation. Despite its simplicity and lack of spatial context, SITS-DECO outperforms much larger EO foundation models on crop-type classification (PASTIS-R) demonstrating that dense temporal sequence modelling is a critical missing ingredient in the current paradigm. This work exemplifies a data-centric modelling paradigm in which capability arises from the diversity and structure of the training data rather than from architectural complexity. SITS-DECO provides a lightweight, practical route to multi-modal, multi-task EO modelling, and a conceptual bridge toward future generative EO foundation models.
摘要：地球观测 (EO) 基础建模 (FM) 有望简化和改进 EO 数据在各种现实世界任务中的使用。然而，大多数现有模型在使用之前都需要进行额外的调整，并且严格围绕特定数据源或训练方法进行构建。为了解决这个问题，我们从大型语言模型中获得灵感，其中预训练和下游的不同任务是通过统一令牌序列的下一个令牌预测隐式捕获的，利用训练数据的结构和多样性。我们介绍 SITS-DECO（卫星图像时间序列 - 仅解码器），这是一种概念验证生成模型，它将这种统一序列框架应用于 EO 数据。使用简单的 GPT 式解码器架构，并展示其在纯生成框架中执行有用的 EO 任务（像素级、多时态、多模式作物类型分类）的能力。通过符号提示，我们表明该模型可以在单个统一架构中执行多个监督和自监督任务，而无需特定于任务或模式的适应。尽管 SITS-DECO 简单且缺乏空间背景，但其在作物类型分类 (PASTIS-R) 方面的表现优于更大的 EO 基础模型 (PASTIS-R)，这表明密集时间序列建模是当前范式中关键的缺失要素。这项工作例证了以数据为中心的建模范例，其中能力源于训练数据的多样性和结构，而不是架构复杂性。 SITS-DECO 为多模式、多任务 EO 建模提供了一条轻量级、实用的途径，并为未来生成 EO 基础模型搭建了概念桥梁。

Title: Wavelet-based GAN Fingerprint Detection using ResNet50

Authors: Sai Teja Erukude, Suhasnadh Reddy Veluru, Viswa Chaitanya Marella
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21822
Pdf URL: https://arxiv.org/pdf/2510.21822
Copy Paste: [[2510.21822]] Wavelet-based GAN Fingerprint Detection using ResNet50(https://arxiv.org/abs/2510.21822)
Keywords: generative
Abstract: Identifying images generated by Generative Adversarial Networks (GANs) has become a significant challenge in digital image forensics. This research presents a wavelet-based detection method that uses discrete wavelet transform (DWT) preprocessing and a ResNet50 classification layer to differentiate the StyleGAN-generated images from real ones. Haar and Daubechies wavelet filters are applied to convert the input images into multi-resolution representations, which will then be fed to a ResNet50 network for classification, capitalizing on subtle artifacts left by the generative process. Moreover, the wavelet-based models are compared to an identical ResNet50 model trained on spatial data. The Haar and Daubechies preprocessed models achieved a greater accuracy of 93.8 percent and 95.1 percent, much higher than the model developed in the spatial domain (accuracy rate of 81.5 percent). The Daubechies-based model outperforms Haar, showing that adding layers of descriptive frequency patterns can lead to even greater distinguishing power. These results indicate that the GAN-generated images have unique wavelet-domain artifacts or "fingerprints." The method proposed illustrates the effectiveness of wavelet-domain analysis to detect GAN images and emphasizes the potential of further developing the capabilities of future deepfake detection systems.
摘要：识别生成对抗网络（GAN）生成的图像已成为数字图像取证中的重大挑战。本研究提出了一种基于小波的检测方法，该方法使用离散小波变换 (DWT) 预处理和 ResNet50 分类层来区分 StyleGAN 生成的图像与真实图像。 Haar 和 Daubechies 小波滤波器用于将输入图像转换为多分辨率表示，然后将其输入 ResNet50 网络进行分类，利用生成过程留下的微妙伪像。此外，还将基于小波的模型与在空间数据上训练的相同 ResNet50 模型进行了比较。 Haar 和 Daubechies 预处理模型的准确率分别为 93.8% 和 95.1%，远高于空间域开发的模型（准确率 81.5%）。基于 Daubechies 的模型优于 Haar，表明添加描述性频率模式层可以带来更大的区分能力。这些结果表明 GAN 生成的图像具有独特的小波域伪影或“指纹”。所提出的方法说明了小波域分析检测 GAN 图像的有效性，并强调了进一步开发未来深度伪造检测系统功能的潜力。

Title: A Flow Model with Low-Rank Transformers for Incomplete Multimodal Survival Analysis

Authors: Yi Yin, Yuntao Shou, Zao Dai, Yun Peng, Tao Meng, Wei Ai, Keqin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.21829
Pdf URL: https://arxiv.org/pdf/2510.21829
Copy Paste: [[2510.21829]] A Flow Model with Low-Rank Transformers for Incomplete Multimodal Survival Analysis(https://arxiv.org/abs/2510.21829)
Keywords: generative
Abstract: In recent years, multimodal medical data-based survival analysis has attracted much attention. However, real-world datasets often suffer from the problem of incomplete modality, where some patient modality information is missing due to acquisition limitations or system failures. Existing methods typically infer missing modalities directly from observed ones using deep neural networks, but they often ignore the distributional discrepancy across modalities, resulting in inconsistent and unreliable modality reconstruction. To address these challenges, we propose a novel framework that combines a low-rank Transformer with a flow-based generative model for robust and flexible multimodal survival prediction. Specifically, we first formulate the concerned problem as incomplete multimodal survival analysis using the multi-instance representation of whole slide images (WSIs) and genomic profiles. To realize incomplete multimodal survival analysis, we propose a class-specific flow for cross-modal distribution alignment. Under the condition of class labels, we model and transform the cross-modal distribution. By virtue of the reversible structure and accurate density modeling capabilities of the normalizing flow model, the model can effectively construct a distribution-consistent latent space of the missing modality, thereby improving the consistency between the reconstructed data and the true distribution. Finally, we design a lightweight Transformer architecture to model intra-modal dependencies while alleviating the overfitting problem in high-dimensional modality fusion by virtue of the low-rank Transformer. Extensive experiments have demonstrated that our method not only achieves state-of-the-art performance under complete modality settings, but also maintains robust and superior accuracy under the incomplete modalities scenario.
摘要：近年来，基于多模态医学数据的生存分析备受关注。然而，现实世界的数据集经常遇到模态不完整的问题，由于采集限制或系统故障，一些患者模态信息丢失。现有方法通常使用深度神经网络直接从观察到的模态推断缺失的模态，但它们经常忽略模态之间的分布差异，导致模态重建不一致且不可靠。为了应对这些挑战，我们提出了一种新颖的框架，它将低秩 Transformer 与基于流的生成模型相结合，以实现稳健且灵活的多模式生存预测。具体来说，我们首先使用整个幻灯片图像（WSI）和基因组图谱的多实例表示将相关问题表述为不完整的多模式生存分析。为了实现不完整的多模态生存分析，我们提出了一种用于跨模态分布对齐的特定类别流程。在类标签的条件下，我们对跨模态分布进行建模和转换。借助归一化流模型的可逆结构和精确的密度建模能力，该模型可以有效地构建缺失模态的分布一致的潜在空间，从而提高重建数据与真实分布的一致性。最后，我们设计了一个轻量级的 Transformer 架构来对模态内依赖关系进行建模，同时借助低秩 Transformer 缓解高维模态融合中的过拟合问题。大量的实验表明，我们的方法不仅在完整模态设置下实现了最先进的性能，而且在不完整模态场景下保持了稳健和卓越的准确性。

Title: Restoring Pruned Large Language Models via Lost Component Compensation

Authors: Zijian Feng, Hanzhang Zhou, Zixiao Zhu, Tianjiao Li, Jia Jim Deryl Chua, Lee Onn Mak, Gee Wah Ng, Kezhi Mao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.21834
Pdf URL: https://arxiv.org/pdf/2510.21834
Copy Paste: [[2510.21834]] Restoring Pruned Large Language Models via Lost Component Compensation(https://arxiv.org/abs/2510.21834)
Keywords: restoration
Abstract: Pruning is a widely used technique to reduce the size and inference cost of large language models (LLMs), but it often causes performance degradation. To mitigate this, existing restoration methods typically employ parameter-efficient fine-tuning (PEFT), such as LoRA, to recover the pruned model's performance. However, most PEFT methods are designed for dense models and overlook the distinct properties of pruned models, often resulting in suboptimal recovery. In this work, we propose a targeted restoration strategy for pruned models that restores performance while preserving their low cost and high efficiency. We observe that pruning-induced information loss is reflected in attention activations, and selectively reintroducing components of this information can significantly recover model performance. Based on this insight, we introduce RestoreLCC (Restoring Pruned LLMs via Lost Component Compensation), a plug-and-play method that contrastively probes critical attention heads via activation editing, extracts lost components from activation differences, and finally injects them back into the corresponding pruned heads for compensation and recovery. RestoreLCC is compatible with structured, semi-structured, and unstructured pruning schemes. Extensive experiments demonstrate that RestoreLCC consistently outperforms state-of-the-art baselines in both general and task-specific performance recovery, without compromising the sparsity or inference efficiency of pruned models.
摘要：剪枝是一种广泛使用的技术，可减少大型语言模型 (LLM) 的大小和推理成本，但它通常会导致性能下降。为了缓解这一问题，现有的恢复方法通常采用参数高效微调 (PEFT)（例如 LoRA）来恢复修剪模型的性能。然而，大多数 PEFT 方法都是针对密集模型而设计的，忽略了剪枝模型的独特属性，通常会导致恢复效果不佳。在这项工作中，我们提出了一种针对修剪模型的有针对性的恢复策略，可以恢复性能，同时保持低成本和高效率。我们观察到，剪枝引起的信息丢失反映在注意力激活中，并且有选择地重新引入该信息的组成部分可以显着恢复模型性能。基于这一见解，我们引入了RestoreLCC（通过丢失组件补偿恢复修剪的LLM），这是一种即插即用的方法，通过激活编辑对比探测关键注意头，从激活差异中提取丢失的组件，最后将它们注入到相应的修剪头中进行补偿和恢复。 RestoreLCC 与结构化、半结构化和非结构化修剪方案兼容。大量实验表明，RestoreLCC 在一般和特定任务的性能恢复方面始终优于最先进的基线，而不会影响修剪模型的稀疏性或推理效率。

Title: A Multimodal, Multitask System for Generating E Commerce Text Listings from Images

Authors: Nayan Kumar Singh
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2510.21835
Pdf URL: https://arxiv.org/pdf/2510.21835
Copy Paste: [[2510.21835]] A Multimodal, Multitask System for Generating E Commerce Text Listings from Images(https://arxiv.org/abs/2510.21835)
Keywords: generation, generative
Abstract: Manually generating catchy descriptions and names is labor intensive and a slow process for retailers. Although generative AI provides an automation solution in form of Vision to Language Models (VLM), the current VLMs are prone to factual "hallucinations". Siloed, single task models are not only inefficient but also fail to capture interdependent relationships between features. To address these challenges, we propose an end to end, multi task system that generates factually grounded textual listings from a single image. The contributions of this study are two proposals for the model architecture. First, application of multi task learning approach for fine tuning a vision encoder where a single vision backbone is jointly trained on attribute prediction such as color, hemline and neck style and price regression. Second, introduction of a hierarchical generation process where the model's own predicted attributes are embedded in a prompt and fed to the text decoder to improve factual consistency. The experiments demonstrate the superiority of this architecture. The multi tasking approach outperforms both the independent price regression, with a 3.6% better R2 Value and attribute classification, with a 6.6% improvement F1 score. Critically, the hierarchical generation process proves highly effective, slashing the factual hallucination rate from 12.7% to 7.1%, a 44.5% relative reduction, compared to a non hierarchical ablation. The hierarchical approach also reduces the latency of the autoregressive text generation process by a factor of 3.5 when compared to direct vision to language model of similar size. One minor caveat is that the model does perform 3.5% worse than direct vision-to-language model on ROUGE-L score.
摘要：对于零售商来说，手动生成吸引人的描述和名称是一项劳动密集型工作，而且过程缓慢。尽管生成式人工智能以视觉到语言模型（VLM）的形式提供了自动化解决方案，但当前的 VLM 很容易出现事实“幻觉”。孤立的单一任务模型不仅效率低下，而且无法捕获特征之间的相互依赖关系。为了应对这些挑战，我们提出了一种端到端的多任务系统，该系统可以从单个图像生成基于事实的文本列表。这项研究的贡献是模型架构的两个建议。首先，应用多任务学习方法对视觉编码器进行微调，其中单个视觉主干在属性预测（例如颜色、下摆和领口样式以及价格回归）上进行联合训练。其次，引入分层生成过程，其中模型自身的预测属性嵌入提示中并馈送到文本解码器以提高事实一致性。实验证明了该架构的优越性。多任务方法的性能优于独立价格回归，R2 值和属性分类提高了 3.6%，F1 分数提高了 6.6%。至关重要的是，分层生成过程被证明非常有效，将事实幻觉率从 12.7% 削减至 7.1%，与非分层消融相比，相对减少了 44.5%。与直接视觉相似大小的语言模型相比，分层方法还将自回归文本生成过程的延迟减少了 3.5 倍。需要注意的是，该模型在 ROUGE-L 评分上的表现确实比直接视觉到语言模型差 3.5%。

Title: Improving the Physics of Video Generation with VJEPA-2 Reward Signal

Authors: Jianhao Yuan, Xiaofeng Zhang, Felix Friedrich, Nicolas Beltran-Velez, Melissa Hall, Reyhane Askari-Hemmat, Xiaochuang Han, Nicolas Ballas, Michal Drozdzal, Adriana Romero-Soriano
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2510.21840
Pdf URL: https://arxiv.org/pdf/2510.21840
Copy Paste: [[2510.21840]] Improving the Physics of Video Generation with VJEPA-2 Reward Signal(https://arxiv.org/abs/2510.21840)
Keywords: generation, generative
Abstract: This is a short technical report describing the winning entry of the PhysicsIQ Challenge, presented at the Perception Test Workshop at ICCV 2025. State-of-the-art video generative models exhibit severely limited physical understanding, and often produce implausible videos. The Physics IQ benchmark has shown that visual realism does not imply physics understanding. Yet, intuitive physics understanding has shown to emerge from SSL pretraining on natural videos. In this report, we investigate whether we can leverage SSL-based video world models to improve the physics plausibility of video generative models. In particular, we build ontop of the state-of-the-art video generative model MAGI-1 and couple it with the recently introduced Video Joint Embedding Predictive Architecture 2 (VJEPA-2) to guide the generation process. We show that by leveraging VJEPA-2 as reward signal, we can improve the physics plausibility of state-of-the-art video generative models by ~6%.
摘要：这是一份简短的技术报告，描述了在 ICCV 2025 感知测试研讨会上展示的PhysicsIQ挑战赛的获奖作品。最先进的视频生成模型表现出严重有限的物理理解，并且经常生成令人难以置信的视频。物理智商基准测试表明，视觉真实感并不意味着对物理的理解。然而，直观的物理理解已被证明是从自然视频的 SSL 预训练中产生的。在本报告中，我们研究了是否可以利用基于 SSL 的视频世界模型来提高视频生成模型的物理合理性。特别是，我们建立在最先进的视频生成模型 MAGI-1 的基础上，并将其与最近推出的视频联合嵌入预测架构 2 (VJEPA-2) 结合起来，以指导生成过程。我们证明，通过利用 VJEPA-2 作为奖励信号，我们可以将最先进的视频生成模型的物理合理性提高约 6%。

Title: KARIPAP: Quantum-Inspired Tensor Network Compression of Large Language Models Using Infinite Projected Entangled Pair States and Tensor Renormalization Group

Authors: Azree Nazri
Subjects: cs.LG, quant-ph
Abstract URL: https://arxiv.org/abs/2510.21844
Pdf URL: https://arxiv.org/pdf/2510.21844
Copy Paste: [[2510.21844]] KARIPAP: Quantum-Inspired Tensor Network Compression of Large Language Models Using Infinite Projected Entangled Pair States and Tensor Renormalization Group(https://arxiv.org/abs/2510.21844)
Keywords: generative
Abstract: Large Language Models (LLMs) like ChatGPT and LLaMA drive rapid progress in generative AI, yet their huge parameter scales create severe computational and environmental burdens. High training costs, energy use, and limited device deployment hinder accessibility. Existing compression - pruning, distillation, low-rank, and quantization - reduces size but ignores complex inter-layer correlations. We propose KARIPAP, a quantum-inspired tensor network compression using Infinite Projected Entangled Pair States (iPEPS) and Tensor Renormalization Group (TRG) contraction. Unlike 1D Matrix Product States, iPEPS captures multi-directional entanglement in attention and deep transformer layers. TRG ensures polynomial-time contraction, making tensorization feasible while preserving key correlation geometry. Experiments on LLaMA-2 7B show up to 93% memory and 70% parameter reduction, with 50% faster training, 25% faster inference, and only 2-3% accuracy loss. Layer-wise entanglement profiling reveals redundancy in deeper layers, confirming their suitability for tensor factorization. KARIPAP demonstrates that modern LLMs occupy low-dimensional entanglement manifolds, enabling scalable, energy-efficient, and quantum-aware AI architectures.
摘要：ChatGPT 和 LLaMA 等大型语言模型 (LLM) 推动了生成式 AI 的快速进步，但其巨大的参数规模造成了严重的计算和环境负担。高昂的培训成本、能源消耗和有限的设备部署阻碍了可访问性。现有的压缩——修剪、蒸馏、低秩和量化——减少了大小，但忽略了复杂的层间相关性。我们提出了 KARIPAP，一种使用无限投影纠缠对状态 (iPEPS) 和张量重整化群 (TRG) 收缩的量子启发张量网络压缩。与一维矩阵产品状态不同，iPEPS 捕获注意力和深层变压器层中的多向纠缠。 TRG 确保多项式时间收缩，使张量化可行，同时保留关键相关几何。 LLaMA-2 7B 上的实验显示，内存提升高达 93%，参数减少 70%，训练速度提高 50%，推理速度提高 25%，准确度损失仅 2-3%。逐层纠缠分析揭示了更深层的冗余，证实了它们对于张量分解的适用性。 KARIPAP 证明现代法学硕士占据低维纠缠流形，从而实现可扩展、节能和量子感知的人工智能架构。

Title: SynCast: Synergizing Contradictions in Precipitation Nowcasting via Diffusion Sequential Preference Optimization

Authors: Kaiyi Xu, Junchao Gong, Wenlong Zhang, Ben Fei, Lei Bai, Wanli Ouyang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.21847
Pdf URL: https://arxiv.org/pdf/2510.21847
Copy Paste: [[2510.21847]] SynCast: Synergizing Contradictions in Precipitation Nowcasting via Diffusion Sequential Preference Optimization(https://arxiv.org/abs/2510.21847)
Keywords: generative
Abstract: Precipitation nowcasting based on radar echoes plays a crucial role in monitoring extreme weather and supporting disaster prevention. Although deep learning approaches have achieved significant progress, they still face notable limitations. For example, deterministic models tend to produce over-smoothed predictions, which struggle to capture extreme events and fine-scale precipitation patterns. Probabilistic generative models, due to their inherent randomness, often show fluctuating performance across different metrics and rarely achieve consistently optimal results. Furthermore, precipitation nowcasting is typically evaluated using multiple metrics, some of which are inherently conflicting. For instance, there is often a trade-off between the Critical Success Index (CSI) and the False Alarm Ratio (FAR), making it challenging for existing models to deliver forecasts that perform well on both metrics simultaneously. To address these challenges, we introduce preference optimization into precipitation nowcasting for the first time, motivated by the success of reinforcement learning from human feedback in large language models. Specifically, we propose SynCast, a method that employs the two-stage post-training framework of Diffusion Sequential Preference Optimization (Diffusion-SPO), to progressively align conflicting metrics and consistently achieve superior performance. In the first stage, the framework focuses on reducing FAR, training the model to effectively suppress false alarms. Building on this foundation, the second stage further optimizes CSI with constraints that preserve FAR alignment, thereby achieving synergistic improvements across these conflicting metrics.
摘要：基于雷达回波的临近降水预报在监测极端天气和支持防灾方面发挥着至关重要的作用。尽管深度学习方法已经取得了重大进展，但它们仍然面临着显着的局限性。例如，确定性模型往往会产生过度平滑的预测，难以捕捉极端事件和精细尺度的降水模式。概率生成模型由于其固有的随机性，通常会在不同的指标上表现出波动的性能，并且很少能获得一致的最佳结果。此外，降水临近预报通常使用多个指标进行评估，其中一些指标本身是相互冲突的。例如，关键成功指数 (CSI) 和误报率 (FAR) 之间经常存在权衡，这使得现有模型很难同时提供在这两个指标上表现良好的预测。为了应对这些挑战，受大型语言模型中人类反馈强化学习成功的推动，我们首次将偏好优化引入降水临近预报。具体来说，我们提出了 SynCast，一种采用扩散顺序偏好优化 (Diffusion-SPO) 的两阶段后训练框架的方法，以逐步调整冲突的指标并持续实现卓越的性能。第一阶段，框架重点是降低FAR，训练模型以有效抑制误报。在此基础上，第二阶段通过保持 FAR 一致性的约束进一步优化 CSI，从而实现这些相互冲突的指标的协同改进。

Title: SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models

Authors: Gyubeum Lim, Yemo Koo, Vijay Krishna Madisetti
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2510.21850
Pdf URL: https://arxiv.org/pdf/2510.21850
Copy Paste: [[2510.21850]] SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models(https://arxiv.org/abs/2510.21850)
Keywords: generation
Abstract: Understanding long-context visual information remains a fundamental challenge for vision-language models, particularly in agentic tasks such as GUI control and web navigation. While web pages and GUI environments are inherently structured documents, current VLMs typically neglect decision-oriented document understanding in their training objectives. Existing approaches primarily extend visual embeddings to process long, high-resolution inputs, but these methods are memory-intensive and impractical for locally deployable solutions. To address these issues, we propose SCoPE VLM, a document navigation expert that leverages a novel Chain of Scroll mechanism to selectively and recursively navigate documents, focusing exclusively on relevant segments. We introduce a dedicated data generation pipeline to construct informative Chain of Scroll trajectories and Episodic Group Relative Policy Optimization, a tailored reinforcement learning method to reduce the gap between training and inference. Our method substantially reduces memory usage and effectively models human-like reading behaviors. To the best of our knowledge, SCoPE VLM is the first framework to explicitly model agentic reading patterns in multi-page document question answering, advancing the capabilities of multimodal agents.
摘要：理解长上下文视觉信息仍然是视觉语言模型的一个基本挑战，特别是在 GUI 控制和网络导航等代理任务中。虽然网页和 GUI 环境本质上是结构化文档，但当前的 VLM 在其培训目标中通常忽略了面向决策的文档理解。现有的方法主要扩展视觉嵌入来处理长的、高分辨率的输入，但这些方法是内存密集型的，对于本地可部署的解决方案来说不切实际。为了解决这些问题，我们提出了 SCoPE VLM，这是一种文档导航专家，它利用新颖的滚动链机制来有选择地递归地导航文档，专门关注相关部分。我们引入了专用的数据生成管道来构建信息丰富的滚动轨迹链和情景组相对策略优化，这是一种定制的强化学习方法，可以减少训练和推理之间的差距。我们的方法大大减少了内存使用，并有效地模拟了类似人类的阅读行为。据我们所知，SCoPE VLM 是第一个在多页文档问答中显式建模代理阅读模式的框架，从而提高了多模式代理的功能。

Title: Poisson Flow Consistency Training

Authors: Anthony Zhang, Mahmut Gokmen, Dennis Hein, Rongjun Ge, Wenjun Xia, Ge Wang, Jin Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21857
Pdf URL: https://arxiv.org/pdf/2510.21857
Copy Paste: [[2510.21857]] Poisson Flow Consistency Training(https://arxiv.org/abs/2510.21857)
Keywords: generation, generative
Abstract: The Poisson Flow Consistency Model (PFCM) is a consistency-style model based on the robust Poisson Flow Generative Model++ (PFGM++) which has achieved success in unconditional image generation and CT image denoising. Yet the PFCM can only be trained in distillation which limits the potential of the PFCM in many data modalities. The objective of this research was to create a method to train the PFCM in isolation called Poisson Flow Consistency Training (PFCT). The perturbation kernel was leveraged to remove the pretrained PFGM++, and the sinusoidal discretization schedule and Beta noise distribution were introduced in order to facilitate adaptability and improve sample quality. The model was applied to the task of low dose computed tomography image denoising and improved the low dose image in terms of LPIPS and SSIM. It also displayed similar denoising effectiveness as models like the Consistency Model. PFCT is established as a valid method of training the PFCM from its effectiveness in denoising CT images, showing potential with competitive results to other generative models. Further study is needed in the precise optimization of PFCT and in its applicability to other generative modeling tasks. The framework of PFCT creates more flexibility for the ways in which a PFCM can be created and can be applied to the field of generative modeling.
摘要：泊松流一致性模型（PFCM）是基于稳健的泊松流生成模型++（PFGM++）的一致性模型，该模型在无条件图像生成和CT图像去噪方面取得了成功。然而 PFCM 只能接受蒸馏训练，这限制了 PFCM 在许多数据模式中的潜力。本研究的目的是创建一种单独训练 PFCM 的方法，称为泊松流一致性训练 (PFCT)。利用扰动核去除预训练的 PFGM++，并引入正弦离散化方案和 Beta 噪声分布，以促进适应性并提高样本质量。该模型应用于低剂量CT图像去噪任务，并在LPIPS和SSIM方面对低剂量图像进行了改进。它还显示出与一致性模型等模型相似的去噪效果。 PFCT 因其在 CT 图像去噪方面的有效性而被确立为训练 PFCM 的有效方法，显示出与其他生成模型相比具有竞争结果的潜力。 PFCT 的精确优化及其对其他生成建模任务的适用性需要进一步研究。 PFCT 的框架为 PFCM 的创建方式创造了更大的灵活性，并且可以应用于生成建模领域。

Title: The Mirror Loop: Recursive Non-Convergence in Generative Reasoning Systems

Authors: Bentley DeVilling (Course Correct Labs, Independent Research Group)
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2510.21861
Pdf URL: https://arxiv.org/pdf/2510.21861
Copy Paste: [[2510.21861]] The Mirror Loop: Recursive Non-Convergence in Generative Reasoning Systems(https://arxiv.org/abs/2510.21861)
Keywords: generative
Abstract: Large language models are often described as capable of reflective reasoning, yet recursive self-evaluation without external feedback frequently yields reformulation rather than progress. We test this prediction in a cross-provider study of 144 reasoning sequences across three models (OpenAI GPT-4o-mini, Anthropic Claude 3 Haiku, and Google Gemini 2.0 Flash) and four task families (arithmetic, code, explanation, reflection), each iterated ten times under two conditions: ungrounded self-critique and a minimal grounding intervention (a single verification step at iteration three). Mean informational change (delta I, measured via normalized edit distance) declined by 55% from early (0.193) to late (0.087) iterations in ungrounded runs, with consistent patterns across all three providers. Grounded runs showed a +28% rebound in informational change immediately after the intervention and sustained non-zero variance thereafter. Complementary measures-n-gram novelty, embedding drift, and character-level entropy-converged on the same pattern: reflection without contact tends toward informational closure. We interpret this as evidence for a structural limit on self-correction in generative reasoning: without an exchange of information with an independent verifier or environment, recursive inference approaches an attractor state of epistemic stasis. Minimal grounding functions as dissipative coupling, reintroducing informational flux. The cross-architecture consistency suggests the mirror loop arises from shared autoregressive training objectives rather than provider-specific alignment schemes. The results delineate when reflection is performative rather than epistemic and motivate design principles for grounded, cooperative reasoning. Materials and code are publicly available.
摘要：大型语言模型通常被描述为能够进行反思性推理，但没有外部反馈的递归自我评估经常会产生重新表述而不是进步。我们在跨提供商研究中测试了这一预测，该研究涉及三个模型（OpenAI GPT-4o-mini、Anthropic Claude 3 Haiku 和 Google Gemini 2.0 Flash）和四个任务系列（算术、代码、解释、反思）的 144 个推理序列，每个任务系列在两种条件下迭代十次：无根据的自我批评和最小的基础干预（迭代第三次的单个验证步骤）。在无基础的运行中，从早期 (0.193) 到晚期 (0.087) 迭代，平均信息变化（delta I，通过标准化编辑距离测量）下降了 55%，所有三个提供商的模式一致。接地运行显示，干预后信息变化立即反弹 28%，并且此后方差持续非零。互补的度量——n-gram 新颖性、嵌入漂移和字符级熵——收敛于同一模式：无接触的反射倾向于信息闭合。我们将此解释为生成推理中自我纠正的结构性限制的证据：如果不与独立验证者或环境交换信息，递归推理就会接近认知停滞的吸引子状态。最小接地起到耗散耦合的作用，重新引入信息通量。跨架构一致性表明镜像循环源自共享的自回归训练目标，而不是特定于提供者的对齐方案。结果描绘了反思何时是表演性的而不是认知性的，并激发了扎根、合作推理的设计原则。材料和代码是公开的。

Title: Generative AI in Depth: A Survey of Recent Advances, Model Variants, and Real-World Applications

Authors: Shamim Yazdani, Akansha Singh, Nripsuta Saxena, Zichong Wang, Avash Palikhe, Deng Pan, Umapada Pal, Jie Yang, Wenbin Zhang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.21887
Pdf URL: https://arxiv.org/pdf/2510.21887
Copy Paste: [[2510.21887]] Generative AI in Depth: A Survey of Recent Advances, Model Variants, and Real-World Applications(https://arxiv.org/abs/2510.21887)
Keywords: generative
Abstract: In recent years, deep learning based generative models, particularly Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Diffusion Models (DMs), have been instrumental in in generating diverse, high-quality content across various domains, such as image and video synthesis. This capability has led to widespread adoption of these models and has captured strong public interest. As they continue to advance at a rapid pace, the growing volume of research, expanding application areas, and unresolved technical challenges make it increasingly difficult to stay current. To address this need, this survey introduces a comprehensive taxonomy that organizes the literature and provides a cohesive framework for understanding the development of GANs, VAEs, and DMs, including their many variants and combined approaches. We highlight key innovations that have improved the quality, diversity, and controllability of generated outputs, reflecting the expanding potential of generative artificial intelligence. In addition to summarizing technical progress, we examine rising ethical concerns, including the risks of misuse and the broader societal impact of synthetic media. Finally, we outline persistent challenges and propose future research directions, offering a structured and forward looking perspective for researchers in this fast evolving field.
摘要：近年来，基于深度学习的生成模型，特别是生成对抗网络 (GAN)、变分自动编码器 (VAE) 和扩散模型 (DM)，在跨各个领域生成多样化的高质量内容（例如图像和视频合成）方面发挥了重要作用。这种能力导致了这些模型的广泛采用，并引起了公众的强烈兴趣。随着它们继续快速发展，不断增长的研究量、不断扩大的应用领域以及未解决的技术挑战使得保持与时俱进变得越来越困难。为了满足这一需求，本次调查引入了一个全面的分类法，该分类法对文献进行了组织，并提供了一个统一的框架来理解 GAN、VAE 和 DM 的发展，包括它们的许多变体和组合方法。我们重点介绍了提高生成输出的质量、多样性和可控性的关键创新，反映了生成人工智能不断扩大的潜力。除了总结技术进步之外，我们还研究了日益增长的道德问题，包括合成媒体滥用的风险和更广泛的社会影响。最后，我们概述了持续存在的挑战并提出了未来的研究方向，为这个快速发展领域的研究人员提供了结构化和前瞻性的视角。

Title: The Principles of Diffusion Models

Authors: Chieh-Hsin Lai, Yang Song, Dongjun Kim, Yuki Mitsufuji, Stefano Ermon
Subjects: cs.LG, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2510.21890
Pdf URL: https://arxiv.org/pdf/2510.21890
Copy Paste: [[2510.21890]] The Principles of Diffusion Models(https://arxiv.org/abs/2510.21890)
Keywords: generation
Abstract: This monograph presents the core principles that have guided the development of diffusion models, tracing their origins and showing how diverse formulations arise from shared mathematical ideas. Diffusion modeling starts by defining a forward process that gradually corrupts data into noise, linking the data distribution to a simple prior through a continuum of intermediate distributions. The goal is to learn a reverse process that transforms noise back into data while recovering the same intermediates. We describe three complementary views. The variational view, inspired by variational autoencoders, sees diffusion as learning to remove noise step by step. The score-based view, rooted in energy-based modeling, learns the gradient of the evolving data distribution, indicating how to nudge samples toward more likely regions. The flow-based view, related to normalizing flows, treats generation as following a smooth path that moves samples from noise to data under a learned velocity field. These perspectives share a common backbone: a time-dependent velocity field whose flow transports a simple prior to the data. Sampling then amounts to solving a differential equation that evolves noise into data along a continuous trajectory. On this foundation, the monograph discusses guidance for controllable generation, efficient numerical solvers, and diffusion-motivated flow-map models that learn direct mappings between arbitrary times. It provides a conceptual and mathematically grounded understanding of diffusion models for readers with basic deep-learning knowledge.
摘要：本专着介绍了指导扩散模型发展的核心原则，追溯了它们的起源，并展示了如何从共同的数学思想中产生不同的公式。扩散建模首先定义一个逐渐将数据破坏为噪声的前向过程，通过中间分布的连续体将数据分布与简单的先验联系起来。目标是学习逆向过程，将噪声转换回数据，同时恢复相同的中间体。我们描述了三种互补的观点。受变分自动编码器启发，变分观点将扩散视为学习逐步消除噪声。基于分数的视图植根于基于能量的建模，学习不断变化的数据分布的梯度，指示如何将样本推向更有可能的区域。基于流的视图与归一化流相关，将生成视为遵循平滑路径，在学习的速度场下将样本从噪声移动到数据。这些观点有一个共同的主干：一个与时间相关的速度场，其流动先于数据传输。采样相当于求解微分方程，将噪声沿着连续轨迹演化为数据。在此基础上，该专着讨论了可控生成、高效数值求解器以及学习任意时间之间直接映射的扩散驱动流图模型的指导。它为具有基本深度学习知识的读者提供了对扩散模型的概念和数学基础的理解。

Title: Parallel Sampling from Masked Diffusion Models via Conditional Independence Testing

Authors: Iskander Azangulov, Teodora Pandeva, Niranjani Prasad, Javier Zazo, Sushrut Karmalkar
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2510.21961
Pdf URL: https://arxiv.org/pdf/2510.21961
Copy Paste: [[2510.21961]] Parallel Sampling from Masked Diffusion Models via Conditional Independence Testing(https://arxiv.org/abs/2510.21961)
Keywords: generation
Abstract: Masked diffusion models (MDMs) offer a compelling alternative to autoregressive models (ARMs) for discrete text generation because they enable parallel token sampling, rather than sequential, left-to-right generation. This means potentially much faster inference. However, effective parallel sampling faces two competing requirements: (i) simultaneously updated tokens must be conditionally independent, and (ii) updates should prioritise high-confidence predictions. These goals conflict because high-confidence predictions often cluster and depend on each other, opportunities for parallel updates. We present PUNT, a model-agnostic sampler that reconciles this trade-off. Our method identifies token dependencies and removes lower-confidence tokens from conflicting groups. This produces sets of indices for unmasking that satisfy both independence and confidence criteria. Our approach ensures improved parallel unmasking through approximate conditional independence testing. Our experiments show that PUNT delivers a superior trade-off between accuracy and compute when compared to other strong training-free baselines, especially for generation of longer sequences. On the IFEval benchmark, it achieves up to 16\% higher accuracy over baseline methods, including sequential generation (one-by-one). These gains hold across different values of hyperparameters, mitigating the need for brittle hyperparameter tuning. Moreover, we observe that PUNT induces an emergent hierarchical generation strategy, where the model first establishes high-level paragraph structure before local refinement, suggesting a planning-like generation process that contributes to strong alignment performance.
摘要：掩码扩散模型 (MDM) 为离散文本生成提供了自回归模型 (ARM) 的引人注目的替代方案，因为它们支持并行标记采样，而不是顺序、从左到右的生成。这意味着推理速度可能会快得多。然而，有效的并行采样面临两个相互竞争的要求：（i）同时更新的令牌必须是条件独立的，以及（ii）更新应优先考虑高置信度的预测。这些目标是相互冲突的，因为高置信度的预测通常会聚集并相互依赖，从而产生并行更新的机会。我们提出了 PUNT，一种与模型无关的采样器，可以协调这种权衡。我们的方法识别令牌依赖性并从冲突组中删除置信度较低的令牌。这会生成满足独立性和置信度标准的用于揭露的索引集。我们的方法通过近似条件独立性测试确保改进并行揭露。我们的实验表明，与其他强大的免训练基线相比，PUNT 在准确性和计算之间提供了卓越的权衡，特别是对于生成更长的序列。在 IFEval 基准上，它的准确度比基线方法高出 16%，包括顺序生成（一对一）。这些增益适用于不同的超参数值，从而减少了脆弱的超参数调整的需要。此外，我们观察到 PUNT 引入了一种新兴的分层生成策略，其中模型首先在局部细化之前建立高级段落结构，这表明类似规划的生成过程有助于增强对齐性能。

Title: Sprint: Sparse-Dense Residual Fusion for Efficient Diffusion Transformers

Authors: Dogyun Park, Moayed Haji-Ali, Yanyu Li, Willi Menapace, Sergey Tulyakov, Hyunwoo J. Kim, Aliaksandr Siarohin, Anil Kag
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.21986
Pdf URL: https://arxiv.org/pdf/2510.21986
Copy Paste: [[2510.21986]] Sprint: Sparse-Dense Residual Fusion for Efficient Diffusion Transformers(https://arxiv.org/abs/2510.21986)
Keywords: generative
Abstract: Diffusion Transformers (DiTs) deliver state-of-the-art generative performance but their quadratic training cost with sequence length makes large-scale pretraining prohibitively expensive. Token dropping can reduce training cost, yet naïve strategies degrade representations, and existing methods are either parameter-heavy or fail at high drop ratios. We present SPRINT, Sparse--Dense Residual Fusion for Efficient Diffusion Transformers, a simple method that enables aggressive token dropping (up to 75%) while preserving quality. SPRINT leverages the complementary roles of shallow and deep layers: early layers process all tokens to capture local detail, deeper layers operate on a sparse subset to cut computation, and their outputs are fused through residual connections. Training follows a two-stage schedule: long masked pre-training for efficiency followed by short full-token fine-tuning to close the train--inference gap. On ImageNet-1K 256x256, SPRINT achieves 9.8x training savings with comparable FID/FDD, and at inference, its Path-Drop Guidance (PDG) nearly halves FLOPs while improving quality. These results establish SPRINT as a simple, effective, and general solution for efficient DiT training.
摘要：扩散变压器 (DiT) 提供最先进的生成性能，但其训练成本与序列长度成二次方关系，使得大规模预训练成本高昂。令牌丢弃可以降低训练成本，但幼稚的策略会降低表示的质量，并且现有方法要么参数较多，要么在高丢弃率下失败。我们提出了 SPRINT，稀疏——用于高效扩散变压器的密集残差融合，这是一种简单的方法，可以在保持质量的同时实现积极的令牌丢弃（高达 75%）。 SPRINT 利用浅层和深层的互补作用：早期层处理所有标记以捕获局部细节，深层层对稀疏子集进行操作以减少计算，并且它们的输出通过残差连接进行融合。训练遵循两阶段计划：长时间的屏蔽预训练以提高效率，然后进行短暂的全令牌微调以缩小训练与推理之间的差距。在 ImageNet-1K 256x256 上，SPRINT 在同等 FID/FDD 的情况下实现了 9.8 倍的训练节省，并且推断，其 Path-Drop Guidance (PDG) 几乎使 FLOP 减少一半，同时提高了质量。这些结果表明 SPRINT 是一种简单、有效且通用的高效 DiT 培训解决方案。

Title: FlowOpt: Fast Optimization Through Whole Flow Processes for Training-Free Editing

Authors: Or Ronai, Vladimir Kulikov, Tomer Michaeli
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2510.22010
Pdf URL: https://arxiv.org/pdf/2510.22010
Copy Paste: [[2510.22010]] FlowOpt: Fast Optimization Through Whole Flow Processes for Training-Free Editing(https://arxiv.org/abs/2510.22010)
Keywords: restoration, generation
Abstract: The remarkable success of diffusion and flow-matching models has ignited a surge of works on adapting them at test time for controlled generation tasks. Examples range from image editing to restoration, compression and personalization. However, due to the iterative nature of the sampling process in those models, it is computationally impractical to use gradient-based optimization to directly control the image generated at the end of the process. As a result, existing methods typically resort to manipulating each timestep separately. Here we introduce FlowOpt - a zero-order (gradient-free) optimization framework that treats the entire flow process as a black box, enabling optimization through the whole sampling path without backpropagation through the model. Our method is both highly efficient and allows users to monitor the intermediate optimization results and perform early stopping if desired. We prove a sufficient condition on FlowOpt's step-size, under which convergence to the global optimum is guaranteed. We further show how to empirically estimate this upper bound so as to choose an appropriate step-size. We demonstrate how FlowOpt can be used for image editing, showcasing two options: (i) inversion (determining the initial noise that generates a given image), and (ii) directly steering the edited image to be similar to the source image while conforming to a target text prompt. In both cases, FlowOpt achieves state-of-the-art results while using roughly the same number of neural function evaluations (NFEs) as existing methods. Code and examples are available on the project's webpage.
摘要：扩散和流量匹配模型的巨大成功引发了一系列在测试时将其应用于受控发电任务的工作。示例范围从图像编辑到恢复、压缩和个性化。然而，由于这些模型中采样过程的迭代性质，使用基于梯度的优化来直接控制过程结束时生成的图像在计算上是不切实际的。因此，现有方法通常采用单独操作每个时间步长的方法。在这里我们介绍 FlowOpt - 一个零阶（无梯度）优化框架，它将整个流程视为黑匣子，无需通过模型进行反向传播即可实现整个采样路径的优化。我们的方法非常高效，并且允许用户监控中间优化结果并在需要时执行早期停止。我们证明了 FlowOpt 步长的充分条件，在该条件下保证收敛到全局最优值。我们进一步展示如何凭经验估计这个上限，以便选择合适的步长。我们演示了 FlowOpt 如何用于图像编辑，展示了两个选项：(i) 反转（确定生成给定图像的初始噪声），以及 (ii) 直接引导编辑后的图像与源图像相似，同时符合目标文本提示。在这两种情况下，FlowOpt 都实现了最先进的结果，同时使用与现有方法大致相同数量的神经功能评估 (NFE)。代码和示例可在项目的网页上找到。

Title: Linearized Optimal Transport for Analysis of High-Dimensional Point-Cloud and Single-Cell Data

Authors: Tianxiang Wang, Yingtong Ke, Dhananjay Bhaskar, Smita Krishnaswamy, Alexander Cloninger
Subjects: cs.LG, q-bio.QM, stat.ML
Abstract URL: https://arxiv.org/abs/2510.22033
Pdf URL: https://arxiv.org/pdf/2510.22033
Copy Paste: [[2510.22033]] Linearized Optimal Transport for Analysis of High-Dimensional Point-Cloud and Single-Cell Data(https://arxiv.org/abs/2510.22033)
Keywords: generation, generative
Abstract: Single-cell technologies generate high-dimensional point clouds of cells, enabling detailed characterization of complex patient states and treatment responses. Yet each patient is represented by an irregular point cloud rather than a simple vector, making it difficult to directly quantify and compare biological differences between individuals. Nonlinear methods such as kernels and neural networks achieve predictive accuracy but act as black boxes, offering little biological interpretability. To address these limitations, we adapt the Linear Optimal Transport (LOT) framework to this setting, embedding irregular point clouds into a fixed-dimensional Euclidean space while preserving distributional structure. This embedding provides a principled linear representation that preserves optimal transport geometry while enabling downstream analysis. It also forms a registration between any two patients, enabling direct comparison of their cellular distributions. Within this space, LOT enables: (i) \textbf{accurate and interpretable classification} of COVID-19 patient states, where classifier weights map back to specific markers and spatial regions driving predictions; and (ii) \textbf{synthetic data generation} for patient-derived organoids, exploiting the linearity of the LOT embedding. LOT barycenters yield averaged cellular profiles representing combined conditions or samples, supporting drug interaction testing. Together, these results establish LOT as a unified framework that bridges predictive performance, interpretability, and generative modeling. By transforming heterogeneous point clouds into structured embeddings directly traceable to the original data, LOT opens new opportunities for understanding immune variation and treatment effects in high-dimensional biological systems.
摘要：单细胞技术可生成高维细胞点云，从而能够详细表征复杂的患者状态和治疗反应。然而，每个患者都由不规则的点云而不是简单的向量表示，这使得直接量化和比较个体之间的生物学差异变得困难。核和神经网络等非线性方法可实现预测准确性，但充当黑匣子，几乎无法提供生物学可解释性。为了解决这些限制，我们采用线性最优传输（LOT）框架来适应这种设置，将不规则点云嵌入到固定维度的欧几里得空间中，同时保留分布结构。这种嵌入提供了一种原则性的线性表示，可以保留最佳的传输几何形状，同时支持下游分析。它还形成任意两名患者之间的登记，从而能够直接比较他们的细胞分布。在此空间内，LOT 能够： (i) 对 COVID-19 患者状态进行 \textbf{准确且可解释的分类}，其中分类器权重映射回驱动预测的特定标记和空间区域； (ii) 利用 LOT 嵌入的线性，为源自患者的类器官生成 \textbf{合成数据}。 LOT 重心产生代表组合条件或样本的平均细胞特征，支持药物相互作用测试。这些结果共同将 LOT 建立为一个统一的框架，连接预测性能、可解释性和生成建模。通过将异构点云转换为可直接追溯到原始数据的结构化嵌入，LOT 为理解高维生物系统中的免疫变异和治疗效果提供了新的机会。

Title: PF$Δ$: A Benchmark Dataset for Power Flow under Load, Generation, and Topology Variations

Authors: Ana K. Rivera, Anvita Bhagavathula, Alvaro Carbonero, Priya Donti
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.22048
Pdf URL: https://arxiv.org/pdf/2510.22048
Copy Paste: [[2510.22048]] PF$Δ$: A Benchmark Dataset for Power Flow under Load, Generation, and Topology Variations(https://arxiv.org/abs/2510.22048)
Keywords: generation
Abstract: Power flow (PF) calculations are the backbone of real-time grid operations, across workflows such as contingency analysis (where repeated PF evaluations assess grid security under outages) and topology optimization (which involves PF-based searches over combinatorially large action spaces). Running these calculations at operational timescales or across large evaluation spaces remains a major computational bottleneck. Additionally, growing uncertainty in power system operations from the integration of renewables and climate-induced extreme weather also calls for tools that can accurately and efficiently simulate a wide range of scenarios and operating conditions. Machine learning methods offer a potential speedup over traditional solvers, but their performance has not been systematically assessed on benchmarks that capture real-world variability. This paper introduces PF$\Delta$, a benchmark dataset for power flow that captures diverse variations in load, generation, and topology. PF$\Delta$ contains 859,800 solved power flow instances spanning six different bus system sizes, capturing three types of contingency scenarios (N , N -1, and N -2), and including close-to-infeasible cases near steady-state voltage stability limits. We evaluate traditional solvers and GNN-based methods, highlighting key areas where existing approaches struggle, and identifying open problems for future research. Our dataset is available at this https URL and our code with data generation scripts and model implementations is at this https URL.
摘要：潮流 (PF) 计算是实时电网运营的支柱，涵盖应急分析（反复进行 PF 评估以评估停电情况下的电网安全性）和拓扑优化（涉及在组合大操作空间中进行基于 PF 的搜索）等工作流程。在操作时间尺度或跨大型评估空间运行这些计算仍然是主要的计算瓶颈。此外，可再生能源的整合和气候引发的极端天气导致电力系统运行的不确定性不断增加，也需要能够准确有效地模拟各种场景和运行条件的工具。机器学习方法比传统求解器具有潜在的加速能力，但其性能尚未根据捕获现实世界变化的基准进行系统评估。本文介绍了 PF$\Delta$，这是一个潮流基准数据集，可捕获负载、发电和拓扑的各种变化。 PF$\Delta$ 包含 859,800 个已解决的潮流实例，涵盖六种不同规模的总线系统，捕获三种类型的应急场景（N 、N -1 和 N -2），并包括接近稳态电压稳定性极限的接近不可行的情况。我们评估传统的求解器和基于 GNN 的方法，突出现有方法难以解决的关键领域，并确定未来研究的开放问题。我们的数据集可从此 https URL 获取，并且包含数据生成脚本和模型实现的代码也可在此 https URL 获取。

Title: Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation

Authors: Zheng Qi, Chao Shang, Evangelia Spiliopoulou, Nikolaos Pappas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22067
Pdf URL: https://arxiv.org/pdf/2510.22067
Copy Paste: [[2510.22067]] Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation(https://arxiv.org/abs/2510.22067)
Keywords: generative
Abstract: Vision language models (VLMs) often generate hallucination, i.e., content that cannot be substantiated by either textual or visual inputs. Prior work primarily attributes this to over-reliance on linguistic prior knowledge rather than visual inputs. Some methods attempt to mitigate hallucination by amplifying visual token attention proportionally to their attention scores. However, these methods overlook the visual attention sink problem, where attention is frequently misallocated to task-irrelevant visual regions, and neglect cross-modal fusion balance by enhancing only visual attention without adjusting attention to the user query. This can result in amplifying incorrect areas while failing to properly interpret the user query. To address these challenges, we propose a simple yet effective method called Gaze Shift-Guided Cross-modal Fusion Enhancement (GIFT). GIFT pre-computes a holistic visual saliency map by tracking positive changes in visual attention, or "gaze shifts", during user query comprehension, and leverages this map to amplify attention to both salient visual information and the user query at each decoding step. This reduces the impact of visual attention sink, as irrelevant tokens exhibit minimal shifts, while ensuring balanced cross-modal fusion for well-integrated representation. Extensive experiments show that GIFT effectively mitigates hallucination in VLMs across both generative and classification tasks, achieving up to 20.7% improvement over greedy decoding, while maintaining general vision-language performance with low computational overhead.
摘要：视觉语言模型（VLM）经常产生幻觉，即无法通过文本或视觉输入证实的内容。先前的工作主要将此归因于过度依赖语言先验知识而不是视觉输入。一些方法试图通过按其注意力分数成比例地放大视觉标记注意力来减轻幻觉。然而，这些方法忽视了视觉注意力集中问题，即注意力经常被错误分配到与任务无关的视觉区域，并且通过仅增强视觉注意力而不调整对用户查询的注意力来忽略跨模式融合平衡。这可能会导致放大不正确的区域，同时无法正确解释用户查询。为了应对这些挑战，我们提出了一种简单而有效的方法，称为凝视转移引导的跨模态融合增强（GIFT）。 GIFT 通过跟踪用户查询理解过程中视觉注意力的积极变化或“目光转移”来预先计算整体视觉显着性图，并利用该图在每个解码步骤中增强对显着视觉信息和用户查询的关注。这减少了视觉注意力吸收的影响，因为不相关的标记表现出最小的变化，同时确保平衡的跨模式融合以实现良好集成的表示。大量实验表明，GIFT 有效地减轻了 VLM 中生成和分类任务的幻觉，比贪婪解码提高了 20.7%，同时以较低的计算开销保持了一般视觉语言性能。

Title: MAGIC-Flow: Multiscale Adaptive Conditional Flows for Generation and Interpretable Classification

Authors: Luca Caldera, Giacomo Bottacini, Lara Cavinato
Subjects: cs.LG, cs.CV, eess.IV, stat.ML
Abstract URL: https://arxiv.org/abs/2510.22070
Pdf URL: https://arxiv.org/pdf/2510.22070
Copy Paste: [[2510.22070]] MAGIC-Flow: Multiscale Adaptive Conditional Flows for Generation and Interpretable Classification(https://arxiv.org/abs/2510.22070)
Keywords: generation, generative
Abstract: Generative modeling has emerged as a powerful paradigm for representation learning, but its direct applicability to challenging fields like medical imaging remains limited: mere generation, without task alignment, fails to provide a robust foundation for clinical use. We propose MAGIC-Flow, a conditional multiscale normalizing flow architecture that performs generation and classification within a single modular framework. The model is built as a hierarchy of invertible and differentiable bijections, where the Jacobian determinant factorizes across sub-transformations. We show how this ensures exact likelihood computation and stable optimization, while invertibility enables explicit visualization of sample likelihoods, providing an interpretable lens into the model's reasoning. By conditioning on class labels, MAGIC-Flow supports controllable sample synthesis and principled class-probability estimation, effectively aiding both generative and discriminative objectives. We evaluate MAGIC-Flow against top baselines using metrics for similarity, fidelity, and diversity. Across multiple datasets, it addresses generation and classification under scanner noise, and modality-specific synthesis and identification. Results show MAGIC-Flow creates realistic, diverse samples and improves classification. MAGIC-Flow is an effective strategy for generation and classification in data-limited domains, with direct benefits for privacy-preserving augmentation, robust generalization, and trustworthy medical AI.
摘要：生成建模已成为表示学习的强大范例，但其对医学成像等具有挑战性的领域的直接适用性仍然有限：仅生成，没有任务对齐，无法为临床使用提供坚实的基础。我们提出了 MAGIC-Flow，一种条件多尺度标准化流架构，可在单个模块化框架内执行生成和分类。该模型被构建为可逆且可微的双射的层次结构，其中雅可比行列式对子变换进行因式分解。我们展示了这如何确保精确的似然计算和稳定的优化，同时可逆性实现样本似然的显式可视化，为模型的推理提供可解释的镜头。通过对类别标签进行调节，MAGIC-Flow 支持可控样本合成和有原则的类别概率估计，有效地帮助生成和判别目标。我们使用相似性、保真度和多样性指标来对照顶级基线来评估 MAGIC-Flow。它跨越多个数据集，解决扫描仪噪声下的生成和分类，以及特定模态的合成和识别。结果显示 MAGIC-Flow 创建真实、多样化的样本并改进分类。 MAGIC-Flow 是一种在数据有限的领域中进行生成和分类的有效策略，对于隐私保护增强、强大的泛化和值得信赖的医疗人工智能具有直接的好处。

Title: Discovering Latent Graphs with GFlowNets for Diverse Conditional Image Generation

Authors: Bailey Trang, Parham Saremi, Alan Q. Wang, Fangrui Huang, Zahra TehraniNasab, Amar Kumar, Tal Arbel, Li Fei-Fei, Ehsan Adeli
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22107
Pdf URL: https://arxiv.org/pdf/2510.22107
Copy Paste: [[2510.22107]] Discovering Latent Graphs with GFlowNets for Diverse Conditional Image Generation(https://arxiv.org/abs/2510.22107)
Keywords: generation, generative
Abstract: Capturing diversity is crucial in conditional and prompt-based image generation, particularly when conditions contain uncertainty that can lead to multiple plausible outputs. To generate diverse images reflecting this diversity, traditional methods often modify random seeds, making it difficult to discern meaningful differences between samples, or diversify the input prompt, which is limited in verbally interpretable diversity. We propose Rainbow, a novel conditional image generation framework, applicable to any pretrained conditional generative model, that addresses inherent condition/prompt uncertainty and generates diverse plausible images. Rainbow is based on a simple yet effective idea: decomposing the input condition into diverse latent representations, each capturing an aspect of the uncertainty and generating a distinct image. First, we integrate a latent graph, parameterized by Generative Flow Networks (GFlowNets), into the prompt representation computation. Second, leveraging GFlowNets' advanced graph sampling capabilities to capture uncertainty and output diverse trajectories over the graph, we produce multiple trajectories that collectively represent the input condition, leading to diverse condition representations and corresponding output images. Evaluations on natural image and medical image datasets demonstrate Rainbow's improvement in both diversity and fidelity across image synthesis, image generation, and counterfactual generation tasks.
摘要：捕获多样性对于条件和基于提示的图像生成至关重要，特别是当条件包含可能导致多个可信输出的不确定性时。为了生成反映这种多样性的多样化图像，传统方法经常修改随机种子，从而难以辨别样本之间有意义的差异，或者使输入提示多样化，这在口头可解释的多样性方面受到限制。我们提出 Rainbow，一种新颖的条件图像生成框架，适用于任何预训练的条件生成模型，可解决固有条件/提示不确定性并生成各种可信图像。 Rainbow 基于一个简单而有效的想法：将输入条件分解为不同的潜在表示，每个潜在表示捕获不确定性的一个方面并生成独特的图像。首先，我们将由生成流网络（GFlowNets）参数化的潜在图集成到提示表示计算中。其次，利用 GFlowNets 先进的图形采样功能来捕获不确定性并在图形上输出不同的轨迹，我们生成共同表示输入条件的多个轨迹，从而产生不同的条件表示和相应的输出图像。对自然图像和医学图像数据集的评估表明，Rainbow 在图像合成、图像生成和反事实生成任务方面的多样性和保真度方面都有所提高。

Title: GRAID: Enhancing Spatial Reasoning of VLMs Through High-Fidelity Data Generation

Authors: Karim Elmaaroufi, Liheng Lai, Justin Svegliato, Yutong Bai, Sanjit A. Seshia, Matei Zaharia
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22118
Pdf URL: https://arxiv.org/pdf/2510.22118
Copy Paste: [[2510.22118]] GRAID: Enhancing Spatial Reasoning of VLMs Through High-Fidelity Data Generation(https://arxiv.org/abs/2510.22118)
Keywords: generation, generative
Abstract: Vision Language Models (VLMs) achieve strong performance on many vision-language tasks but often struggle with spatial reasoning\textemdash{}a prerequisite for many applications. Empirically, we find that a dataset produced by a current training data generation pipeline has a 57.6\% human validation rate. These rates stem from current limitations: single-image 3D reconstruction introduces cascading modeling errors and requires wide answer tolerances, while caption-based methods require hyper-detailed annotations and suffer from generative hallucinations. We present GRAID, built on the key insight that qualitative spatial relationships can be reliably determined from 2D geometric primitives alone. By operating exclusively on 2D bounding boxes from standard object detectors, GRAID avoids both 3D reconstruction errors and generative hallucinations, resulting in datasets that are of higher quality than existing tools that produce similar datasets as validated by human evaluations. We apply our framework to the BDD100k, NuImages, and Waymo datasets, generating over 8.5 million high-quality VQA pairs creating questions spanning spatial relations, counting, ranking, and size comparisons. We evaluate one of the datasets and find it achieves 91.16\% human-validated accuracy\textemdash{}compared to 57.6\% on a dataset generated by recent work. % or recent work Critically, we demonstrate that when trained on GRAID data, models learn spatial reasoning concepts that generalize: models fine-tuned on 6 question types improve on over 10 held-out types, with accuracy gains of 47.5\% on BDD and 37.9\% on NuImages for Llama 3.2B 11B, and when trained on all questions types, achieve improvements on several existing benchmarks such as BLINK. The GRAID framework, datasets, and additional information can be found on our \href{this https URL}{project page}.
摘要：视觉语言模型 (VLM) 在许多视觉语言任务上实现了强大的性能，但常常难以满足空间推理\textemdash{}这是许多应用的先决条件。根据经验，我们发现当前训练数据生成管道生成的数据集具有 57.6% 的人类验证率。这些比率源于当前的限制：单图像 3D 重建引入了级联建模错误，并且需要广泛的答案容差，而基于字幕的方法需要超详细的注释，并且会产生生成幻觉。我们提出了 GRAID，它建立在这样的关键见解之上：可以仅从 2D 几何基元可靠地确定定性空间关系。通过专门对标准物体检测器的 2D 边界框进行操作，GRAID 避免了 3D 重建错误和生成幻觉，从而产生比现有工具更高质量的数据集，现有工具可生成经人工评估验证的类似数据集。我们将我们的框架应用于 BDD100k、NuImages 和 Waymo 数据集，生成超过 850 万个高质量 VQA 对，创建涵盖空间关系、计数、排名和大小比较的问题。我们评估其中一个数据集，发现它的人工验证准确度\textemdash 达到了 91.16\%，而最近工作生成的数据集的准确度为 57.6\%。重要的是，我们证明，当在 GRAID 数据上进行训练时，模型会学习泛化的空间推理概念：在 6 种问题类型上进行微调的模型在 10 多种保留类型上得到改进，BDD 上的准确率提高了 47.5\%，NuImages for Llama 3.2B 11B 上的准确率提高了 37.9\%，并且在对所有问题类型进行训练时，在几个现有基准（例如 BLINK）上实现了改进。 GRAID 框架、数据集和其他信息可以在我们的\href{此 https URL}{项目页面}上找到。

Title: I2-NeRF: Learning Neural Radiance Fields Under Physically-Grounded Media Interactions

Authors: Shuhong Liu, Lin Gu, Ziteng Cui, Xuangeng Chu, Tatsuya Harada
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22161
Pdf URL: https://arxiv.org/pdf/2510.22161
Copy Paste: [[2510.22161]] I2-NeRF: Learning Neural Radiance Fields Under Physically-Grounded Media Interactions(https://arxiv.org/abs/2510.22161)
Keywords: generative
Abstract: Participating in efforts to endow generative AI with the 3D physical world perception, we propose I2-NeRF, a novel neural radiance field framework that enhances isometric and isotropic metric perception under media degradation. While existing NeRF models predominantly rely on object-centric sampling, I2-NeRF introduces a reverse-stratified upsampling strategy to achieve near-uniform sampling across 3D space, thereby preserving isometry. We further present a general radiative formulation for media degradation that unifies emission, absorption, and scattering into a particle model governed by the Beer-Lambert attenuation law. By composing the direct and media-induced in-scatter radiance, this formulation extends naturally to complex media environments such as underwater, haze, and even low-light scenes. By treating light propagation uniformly in both vertical and horizontal directions, I2-NeRF enables isotropic metric perception and can even estimate medium properties such as water depth. Experiments on real-world datasets demonstrate that our method significantly improves both reconstruction fidelity and physical plausibility compared to existing approaches.
摘要：为了参与赋予生成式人工智能 3D 物理世界感知的努力，我们提出了 I2-NeRF，一种新型神经辐射场框架，可增强介质退化下的等距和各向同性度量感知。虽然现有的 NeRF 模型主要依赖于以对象为中心的采样，但 I2-NeRF 引入了反向分层上采样策略，以在 3D 空间中实现近乎均匀的采样，从而保留等距。我们进一步提出了介质降解的通用辐射公式，将发射、吸收和散射统一到由比尔-朗伯衰减定律控制的粒子模型中。通过合成直接和媒体引起的内散射辐射，该公式可以自然地扩展到复杂的媒体环境，例如水下、雾霾，甚至低光场景。通过在垂直和水平方向上均匀地处理光传播，I2-NeRF 能够实现各向同性的度量感知，甚至可以估计水深等介质属性。对现实世界数据集的实验表明，与现有方法相比，我们的方法显着提高了重建保真度和物理合理性。

Title: HARMONY: Hidden Activation Representations and Model Output-Aware Uncertainty Estimation for Vision-Language Models

Authors: Erum Mushtaq, Zalan Fabian, Yavuz Faruk Bakman, Anil Ramakrishna, Mahdi Soltanolkotabi, Salman Avestimehr
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22171
Pdf URL: https://arxiv.org/pdf/2510.22171
Copy Paste: [[2510.22171]] HARMONY: Hidden Activation Representations and Model Output-Aware Uncertainty Estimation for Vision-Language Models(https://arxiv.org/abs/2510.22171)
Keywords: generation
Abstract: The growing deployment of Vision-Language Models (VLMs) in high-stakes applications such as autonomous driving and assistive technologies for visually impaired individuals necessitates reliable mechanisms to assess the trustworthiness of their generation. Uncertainty Estimation (UE) plays a central role in quantifying the reliability of model outputs and reducing unsafe generations via selective prediction. In this regard, most existing probability-based UE approaches rely on output probability distributions, aggregating token probabilities into a single uncertainty score using predefined functions such as length-normalization. Another line of research leverages model hidden representations and trains MLP-based models to predict uncertainty. However, these methods often fail to capture the complex multimodal relationships between semantic and textual tokens and struggle to identify biased probabilities often influenced by language priors. Motivated by these observations, we propose a novel UE framework, HARMONY, that jointly leverages fused multimodal information in model activations and the output distribution of the VLM to determine the reliability of responses. The key hypothesis of our work is that both the model's internal belief in its visual understanding, captured by its hidden representations, and the produced token probabilities carry valuable reliability signals that can be jointly leveraged to improve UE performance, surpassing approaches that rely on only one of these components. Experimental results on three open-ended VQA benchmarks, A-OKVQA, VizWiz, and PathVQA, and three state-of-the-art VLMs, LLaVa-7b, LLaVA-13b and InstructBLIP demonstrate that our method consistently performs on par with or better than existing approaches, achieving up to 4\% improvement in AUROC, and 6\% in PRR, establishing new state of the art in uncertainty estimation for VLMs.
摘要：视觉语言模型 (VLM) 在自动驾驶和视障人士辅助技术等高风险应用中的不断部署需要可靠的机制来评估其一代的可信度。不确定性估计（UE）在量化模型输出的可靠性和通过选择性预测减少不安全发电方面发挥着核心作用。在这方面，大多数现有的基于概率的 UE 方法依赖于输出概率分布，使用预定义函数（例如长度归一化）将令牌概率聚合为单个不确定性分数。另一项研究利用模型隐藏表示并训练基于 MLP 的模型来预测不确定性。然而，这些方法通常无法捕获语义和文本标记之间复杂的多模态关系，并且难以识别通常受语言先验影响的偏差概率。受这些观察的启发，我们提出了一种新颖的 UE 框架 HARMONY，该框架联合利用模型激活中的融合多模态信息和 VLM 的输出分布来确定响应的可靠性。我们工作的关键假设是，模型对其视觉理解的内在信念（由其隐藏表示捕获）和生成的令牌概率都携带有价值的可靠性信号，可以共同利用这些信号来提高 UE 性能，超越仅依赖这些组件之一的方法。三个开放式 VQA 基准（A-OKVQA、VizWiz 和 PathVQA）以及三个最先进的 VLM（LLaVa-7b、LLaVA-13b 和 InstructBLIP）的实验结果表明，我们的方法始终与现有方法持平或更好，在 AUROC 方面实现了 4% 的改进，在 PRR 方面实现了 6% 的改进，在 VLM 的不确定性估计。

Title: Scaling Non-Parametric Sampling with Representation

Authors: Vincent Lu, Aaron Truong, Zeyu Yun, Yubei Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22196
Pdf URL: https://arxiv.org/pdf/2510.22196
Copy Paste: [[2510.22196]] Scaling Non-Parametric Sampling with Representation(https://arxiv.org/abs/2510.22196)
Keywords: generative
Abstract: Scaling and architectural advances have produced strikingly photorealistic image generative models, yet their mechanisms still remain opaque. Rather than advancing scaling, our goal is to strip away complicated engineering tricks and propose a simple, non-parametric generative model. Our design is grounded in three principles of natural images-(i) spatial non-stationarity, (ii) low-level regularities, and (iii) high-level semantics-and defines each pixel's distribution from its local context window. Despite its minimal architecture and no training, the model produces high-fidelity samples on MNIST and visually compelling CIFAR-10 images. This combination of simplicity and strong empirical performance points toward a minimal theory of natural-image structure. The model's white-box nature also allows us to have a mechanistic understanding of how the model generalizes and generates diverse images. We study it by tracing each generated pixel back to its source images. These analyses reveal a simple, compositional procedure for "part-whole generalization", suggesting a hypothesis for how large neural network generative models learn to generalize.
摘要：缩放和架构的进步已经产生了极其逼真的图像生成模型，但它们的机制仍然不透明。我们的目标不是推进扩展，而是摆脱复杂的工程技巧并提出一个简单的非参数生成模型。我们的设计基于自然图像的三个原则 -（i）空间非平稳性，（ii）低级规律性，以及（iii）高级语义 - 并从其局部上下文窗口定义每个像素的分布。尽管其架构极简且无需训练，该模型仍可在 MNIST 和视觉上引人注目的 CIFAR-10 图像上生成高保真度样本。这种简单性和强大的经验表现的结合指向自然图像结构的最小理论。该模型的白盒性质还使我们能够机械地理解模型如何概括和生成不同的图像。我们通过将每个生成的像素追溯到其源图像来研究它。这些分析揭示了“部分-整体泛化”的简单组合过程，提出了关于大型神经网络生成模型如何学习泛化的假设。

Title: MOGRAS: Human Motion with Grasping in 3D Scenes

Authors: Kunal Bhosikar, Siddharth Katageri, Vivek Madhavaram, Kai Han, Charu Sharma
Subjects: cs.CV, cs.GR, cs.RO
Abstract URL: https://arxiv.org/abs/2510.22199
Pdf URL: https://arxiv.org/pdf/2510.22199
Copy Paste: [[2510.22199]] MOGRAS: Human Motion with Grasping in 3D Scenes(https://arxiv.org/abs/2510.22199)
Keywords: generation
Abstract: Generating realistic full-body motion interacting with objects is critical for applications in robotics, virtual reality, and human-computer interaction. While existing methods can generate full-body motion within 3D scenes, they often lack the fidelity for fine-grained tasks like object grasping. Conversely, methods that generate precise grasping motions typically ignore the surrounding 3D scene. This gap, generating full-body grasping motions that are physically plausible within a 3D scene, remains a significant challenge. To address this, we introduce MOGRAS (Human MOtion with GRAsping in 3D Scenes), a large-scale dataset that bridges this gap. MOGRAS provides pre-grasping full-body walking motions and final grasping poses within richly annotated 3D indoor scenes. We leverage MOGRAS to benchmark existing full-body grasping methods and demonstrate their limitations in scene-aware generation. Furthermore, we propose a simple yet effective method to adapt existing approaches to work seamlessly within 3D scenes. Through extensive quantitative and qualitative experiments, we validate the effectiveness of our dataset and highlight the significant improvements our proposed method achieves, paving the way for more realistic human-scene interactions.
摘要：生成与物体交互的真实全身运动对于机器人、虚拟现实和人机交互中的应用至关重要。虽然现有方法可以在 3D 场景中生成全身运动，但它们通常缺乏物体抓取等细粒度任务的保真度。相反，生成精确抓取运动的方法通常会忽略周围的 3D 场景。这种间隙会产生在 3D 场景中物理上合理的全身抓取动作，这仍然是一个重大挑战。为了解决这个问题，我们引入了 MOGRAS（Human MOtion with GRAsping in 3D Scenes），这是一个弥补这一差距的大型数据集。 MOGRAS 在注释丰富的 3D 室内场景中提供预抓取全身行走动作和最终抓取姿势。我们利用 MOGRAS 对现有的全身抓取方法进行基准测试，并证明它们在场景感知生成方面的局限性。此外，我们提出了一种简单而有效的方法来调整现有方法以在 3D 场景中无缝工作。通过广泛的定量和定性实验，我们验证了数据集的有效性，并强调了我们提出的方法所实现的显着改进，为更现实的人类场景交互铺平了道路。

Title: LongCat-Video Technical Report

Authors: Meituan LongCat Team: Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, Tong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22200
Pdf URL: https://arxiv.org/pdf/2510.22200
Copy Paste: [[2510.22200]] LongCat-Video Technical Report(https://arxiv.org/abs/2510.22200)
Keywords: generation
Abstract: Video generation is a critical pathway toward world models, with efficient long video inference as a key capability. Toward this end, we introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across multiple video generation tasks. It particularly excels in efficient and high-quality long video generation, representing our first step toward world models. Key features include: Unified architecture for multiple tasks: Built on the Diffusion Transformer (DiT) framework, LongCat-Video supports Text-to-Video, Image-to-Video, and Video-Continuation tasks with a single model; Long video generation: Pretraining on Video-Continuation tasks enables LongCat-Video to maintain high quality and temporal coherence in the generation of minutes-long videos; Efficient inference: LongCat-Video generates 720p, 30fps videos within minutes by employing a coarse-to-fine generation strategy along both the temporal and spatial axes. Block Sparse Attention further enhances efficiency, particularly at high resolutions; Strong performance with multi-reward RLHF: Multi-reward RLHF training enables LongCat-Video to achieve performance on par with the latest closed-source and leading open-source models. Code and model weights are publicly available to accelerate progress in the field.
摘要：视频生成是通往世界模型的关键途径，其中高效的长视频推理是一项关键功能。为此，我们推出了 LongCat-Video，这是一种具有 13.6B 参数的基础视频生成模型，可在多个视频生成任务中提供强大的性能。它尤其擅长高效、高质量的长视频生成，代表着我们向世界模型迈出了第一步。主要特点包括：多任务统一架构：LongCat-Video基于Diffusion Transformer（DiT）框架构建，通过单一模型支持文本到视频、图像到视频和视频连续任务；长视频生成：对Video-Continuation任务进行预训练，使LongCat-Video能够在生成长达几分钟的视频时保持高质量和时间连贯性；高效推理：LongCat-Video 在时间和空间轴上采用从粗到细的生成策略，在几分钟内生成 720p、30fps 的视频。 Block Sparse Attention 进一步提高了效率，尤其是在高分辨率下；多奖励 RLHF 的强大性能：多奖励 RLHF 训练使 LongCat-Video 能够实现与最新闭源和领先开源模型相当的性能。代码和模型权重是公开的，以加速该领域的进展。

Title: GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping

Authors: Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, Meng Wang, Pengfei Wan, Xiaodan Liang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2510.22319
Pdf URL: https://arxiv.org/pdf/2510.22319
Copy Paste: [[2510.22319]] GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping(https://arxiv.org/abs/2510.22319)
Keywords: generation
Abstract: Recently, GRPO-based reinforcement learning has shown remarkable progress in optimizing flow-matching models, effectively improving their alignment with task-specific rewards. Within these frameworks, the policy update relies on importance-ratio clipping to constrain overconfident positive and negative gradients. However, in practice, we observe a systematic shift in the importance-ratio distribution-its mean falls below 1 and its variance differs substantially across timesteps. This left-shifted and inconsistent distribution prevents positive-advantage samples from entering the clipped region, causing the mechanism to fail in constraining overconfident positive updates. As a result, the policy model inevitably enters an implicit over-optimization stage-while the proxy reward continues to increase, essential metrics such as image quality and text-prompt alignment deteriorate sharply, ultimately making the learned policy impractical for real-world use. To address this issue, we introduce GRPO-Guard, a simple yet effective enhancement to existing GRPO frameworks. Our method incorporates ratio normalization, which restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates across denoising timesteps. In addition, a gradient reweighting strategy equalizes policy gradients over noise conditions, preventing excessive updates from particular timestep regions. Together, these designs act as a regulated clipping mechanism, stabilizing optimization and substantially mitigating implicit over-optimization without relying on heavy KL regularization. Extensive experiments on multiple diffusion backbones (e.g., SD3.5M, Flux.1-dev) and diverse proxy tasks demonstrate that GRPO-Guard significantly reduces over-optimization while maintaining or even improving generation quality.
摘要：最近，基于 GRPO 的强化学习在优化流程匹配模型方面取得了显着进展，有效提高了其与特定任务奖励的一致性。在这些框架内，政策更新依赖于重要性比裁剪来限制过度自信的正梯度和负梯度。然而，在实践中，我们观察到重要性比分布的系统性变化——其均值低于 1，并且其方差在不同时间步长之间存在显着差异。这种左移且不一致的分布阻止了正优势样本进入剪切区域，导致该机制无法约束过度自信的正更新。结果，策略模型不可避免地进入隐式过度优化阶段——在代理奖励持续增加的同时，图像质量和文本提示对齐等基本指标急剧恶化，最终使学习到的策略在现实世界中使用不切实际。为了解决这个问题，我们引入了 GRPO-Guard，这是对现有 GRPO 框架的简单而有效的增强。我们的方法结合了比率归一化，可以恢复平衡且步长一致的重要性比率，确保 PPO 裁剪正确限制去噪时间步长中的有害更新。此外，梯度重新加权策略在噪声条件下均衡策略梯度，防止特定时间步长区域的过度更新。总之，这些设计充当调节剪裁机制，稳定优化并显着减轻隐式过度优化，而无需依赖大量 KL 正则化。对多个扩散主干（例如 SD3.5M、Flux.1-dev）和各种代理任务的广泛实验表明，GRPO-Guard 显着减少了过度优化，同时保持甚至提高了生成质量。

Title: Moving Beyond Diffusion: Hierarchy-to-Hierarchy Autoregression for fMRI-to-Image Reconstruction

Authors: Xu Zhang, Ruijie Quan, Wenguan Wang, Yi Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22335
Pdf URL: https://arxiv.org/pdf/2510.22335
Copy Paste: [[2510.22335]] Moving Beyond Diffusion: Hierarchy-to-Hierarchy Autoregression for fMRI-to-Image Reconstruction(https://arxiv.org/abs/2510.22335)
Keywords: generation
Abstract: Reconstructing visual stimuli from fMRI signals is a central challenge bridging machine learning and neuroscience. Recent diffusion-based methods typically map fMRI activity to a single high-level embedding, using it as fixed guidance throughout the entire generation process. However, this fixed guidance collapses hierarchical neural information and is misaligned with the stage-dependent demands of image reconstruction. In response, we propose MindHier, a coarse-to-fine fMRI-to-image reconstruction framework built on scale-wise autoregressive modeling. MindHier introduces three components: a Hierarchical fMRI Encoder to extract multi-level neural embeddings, a Hierarchy-to-Hierarchy Alignment scheme to enforce layer-wise correspondence with CLIP features, and a Scale-Aware Coarse-to-Fine Neural Guidance strategy to inject these embeddings into autoregression at matching scales. These designs make MindHier an efficient and cognitively-aligned alternative to diffusion-based methods by enabling a hierarchical reconstruction process that synthesizes global semantics before refining local details, akin to human visual perception. Extensive experiments on the NSD dataset show that MindHier achieves superior semantic fidelity, 4.67x faster inference, and more deterministic results than the diffusion-based baselines.
摘要：从功能磁共振成像信号重建视觉刺激是连接机器学习和神经科学的一个核心挑战。最近基于扩散的方法通常将 fMRI 活动映射到单个高级嵌入，并在整个生成过程中将其用作固定指导。然而，这种固定的指导会破坏分层神经信息，并且与图像重建的阶段相关需求不一致。作为回应，我们提出了 MindHier，一种基于尺度自回归模型构建的从粗到细的 fMRI 到图像重建框架。 MindHier 引入了三个组件：用于提取多级神经嵌入的分层 fMRI 编码器、用于强制与 CLIP 特征逐层对应的分层到分层对齐方案，以及用于将这些嵌入注入匹配尺度的自回归的尺度感知粗到细神经指导策略。这些设计使 MindHier 成为基于扩散的方法的高效且认知一致的替代方案，通过启用分层重建过程，在细化局部细节之前综合全局语义，类似于人类视觉感知。对 NSD 数据集的大量实验表明，MindHier 实现了卓越的语义保真度、比基于扩散的基线快 4.67 倍的推理速度以及更具确定性的结果。

Title: GeoDiffusion: A Training-Free Framework for Accurate 3D Geometric Conditioning in Image Generation

Authors: Phillip Mueller, Talip Uenlue, Sebastian Schmidt, Marcel Kollovieh, Jiajie Fan, Stephan Guennemann, Lars Mikelsons
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22337
Pdf URL: https://arxiv.org/pdf/2510.22337
Copy Paste: [[2510.22337]] GeoDiffusion: A Training-Free Framework for Accurate 3D Geometric Conditioning in Image Generation(https://arxiv.org/abs/2510.22337)
Keywords: generation, generative
Abstract: Precise geometric control in image generation is essential for engineering \& product design and creative industries to control 3D object features accurately in image space. Traditional 3D editing approaches are time-consuming and demand specialized skills, while current image-based generative methods lack accuracy in geometric conditioning. To address these challenges, we propose GeoDiffusion, a training-free framework for accurate and efficient geometric conditioning of 3D features in image generation. GeoDiffusion employs a class-specific 3D object as a geometric prior to define keypoints and parametric correlations in 3D space. We ensure viewpoint consistency through a rendered image of a reference 3D object, followed by style transfer to meet user-defined appearance specifications. At the core of our framework is GeoDrag, improving accuracy and speed of drag-based image editing on geometry guidance tasks and general instructions on DragBench. Our results demonstrate that GeoDiffusion enables precise geometric modifications across various iterative design workflows.
摘要：图像生成中的精确几何控制对于工程、产品设计和创意产业在图像空间中精确控制 3D 对象特征至关重要。传统的 3D 编辑方法非常耗时且需要专业技能，而当前基于图像的生成方法在几何条件方面缺乏准确性。为了应对这些挑战，我们提出了 GeoDiffusion，这是一种无需训练的框架，可在图像生成中准确、高效地对 3D 特征进行几何条件调节。 GeoDiffusion 在 3D 空间中定义关键点和参数相关性之前，采用特定于类的 3D 对象作为几何对象。我们通过参考 3D 对象的渲染图像确保视点一致性，然后进行样式转换以满足用户定义的外观规范。我们框架的核心是 GeoDrag，它提高了基于拖动的图像编辑几何引导任务和 DragBench 上的一般指令的准确性和速度。我们的结果表明，GeoDiffusion 可以在各种迭代设计工作流程中进行精确的几何修改。

Title: T2SMark: Balancing Robustness and Diversity in Noise-as-Watermark for Diffusion Models

Authors: Jindong Yang, Han Fang, Weiming Zhang, Nenghai Yu, Kejiang Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22366
Pdf URL: https://arxiv.org/pdf/2510.22366
Copy Paste: [[2510.22366]] T2SMark: Balancing Robustness and Diversity in Noise-as-Watermark for Diffusion Models(https://arxiv.org/abs/2510.22366)
Keywords: generation, generative
Abstract: Diffusion models have advanced rapidly in recent years, producing high-fidelity images while raising concerns about intellectual property protection and the misuse of generative AI. Image watermarking for diffusion models, particularly Noise-as-Watermark (NaW) methods, encode watermark as specific standard Gaussian noise vector for image generation, embedding the infomation seamlessly while maintaining image quality. For detection, the generation process is inverted to recover the initial noise vector containing the watermark before extraction. However, existing NaW methods struggle to balance watermark robustness with generation diversity. Some methods achieve strong robustness by heavily constraining initial noise sampling, which degrades user experience, while others preserve diversity but prove too fragile for real-world deployment. To address this issue, we propose T2SMark, a two-stage watermarking scheme based on Tail-Truncated Sampling (TTS). Unlike prior methods that simply map bits to positive or negative values, TTS enhances robustness by embedding bits exclusively in the reliable tail regions while randomly sampling the central zone to preserve the latent distribution. Our two-stage framework then ensures sampling diversity by integrating a randomly generated session key into both encryption pipelines. We evaluate T2SMark on diffusion models with both U-Net and DiT backbones. Extensive experiments show that it achieves an optimal balance between robustness and diversity. Our code is available at \href{this https URL}{this https URL}.
摘要：近年来，扩散模型发展迅速，产生了高保真图像，同时引发了人们对知识产权保护和生成人工智能滥用的担忧。扩散模型的图像水印，特别是噪声水印 (NaW) 方法，将水印编码为用于图像生成的特定标准高斯噪声向量，无缝嵌入信息，同时保持图像质量。为了检测，生成过程被反转以在提取之前恢复包含水印的初始噪声向量。然而，现有的 NaW 方法很难平衡水印鲁棒性和生成多样性。一些方法通过严格限制初始噪声采样来实现强大的鲁棒性，这会降低用户体验，而另一些方法则保留多样性，但事实证明对于现实世界的部署来说太脆弱了。为了解决这个问题，我们提出了 T2SMark，一种基于尾部截断采样（TTS）的两阶段水印方案。与简单地将位映射到正值或负值的现有方法不同，TTS 通过将位专门嵌入到可靠的尾部区域，同时对中心区域进行随机采样以保留潜在分布，从而增强了鲁棒性。然后，我们的两阶段框架通过将随机生成的会话密钥集成到两个加密管道中来确保采样多样性。我们在具有 U-Net 和 DiT 主干的扩散模型上评估 T2SMark。大量的实验表明，它在鲁棒性和多样性之间实现了最佳平衡。我们的代码位于 \href{此 https URL}{此 https URL}。

Title: Top-Down Semantic Refinement for Image Captioning

Authors: Jusheng Zhang, Kaitong Cai, Jing Yang, Jian Wang, Chengpei Tang, Keze Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22391
Pdf URL: https://arxiv.org/pdf/2510.22391
Copy Paste: [[2510.22391]] Top-Down Semantic Refinement for Image Captioning(https://arxiv.org/abs/2510.22391)
Keywords: generation
Abstract: Large Vision-Language Models (VLMs) face an inherent contradiction in image captioning: their powerful single-step generation capabilities often lead to a myopic decision-making process. This makes it difficult to maintain global narrative coherence while capturing rich details, a limitation that is particularly pronounced in tasks that require multi-step and complex scene description. To overcome this fundamental challenge, we redefine image captioning as a goal-oriented hierarchical refinement planning problem, and further propose a novel framework, named Top-Down Semantic Refinement (TDSR), which models the generation process as a Markov Decision Process (MDP). However, planning within the vast state space of a VLM presents a significant computational hurdle. Our core contribution, therefore, is the design of a highly efficient Monte Carlo Tree Search (MCTS) algorithm tailored for VLMs. By incorporating a visual-guided parallel expansion and a lightweight value network, our TDSR reduces the call frequency to the expensive VLM by an order of magnitude without sacrificing planning quality. Furthermore, an adaptive early stopping mechanism dynamically matches computational overhead to the image's complexity. Extensive experiments on multiple benchmarks, including DetailCaps, COMPOSITIONCAP, and POPE, demonstrate that our TDSR, as a plug-and-play module, can significantly enhance the performance of existing VLMs (e.g., LLaVA-1.5, Qwen2.5-VL) by achieving state-of-the-art or highly competitive results in fine-grained description, compositional generalization, and hallucination suppression.
摘要：大型视觉语言模型（VLM）在图像描述中面临着一个固有的矛盾：它们强大的单步生成能力往往会导致短视的决策过程。这使得在捕捉丰富细节的同时很难保持全局叙事的连贯性，这一限制在需要多步骤和复杂场景描述的任务中尤其明显。为了克服这一基本挑战，我们将图像字幕重新定义为面向目标的分层细化规划问题，并进一步提出了一种新颖的框架，称为自上而下语义细化（TDSR），它将生成过程建模为马尔可夫决策过程（MDP）。然而，在 VLM 的巨大状态空间内进行规划存在重大的计算障碍。因此，我们的核心贡献是设计专为 VLM 定制的高效蒙特卡罗树搜索 (MCTS) 算法。通过结合视觉引导的并行扩展和轻量级价值网络，我们的 TDSR 将昂贵的 VLM 的呼叫频率降低了一个数量级，而无需牺牲规划质量。此外，自适应早期停止机制动态地将计算开销与图像的复杂性相匹配。对 DetailCaps、COMPOSITIONCAP 和 POPE 等多个基准的大量实验表明，我们的 TDSR 作为即插即用模块，可以通过在细粒度描述、组合泛化和幻觉抑制方面实现最先进或极具竞争力的结果，显着增强现有 VLM（例如 LLaVA-1.5、Qwen2.5-VL）的性能。

Title: Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents

Authors: Vijay Veerabadran, Fanyi Xiao, Nitin Kamra, Pedro Matias, Joy Chen, Caley Drooff, Brett D Roads, Riley Williams, Ethan Henderson, Xuanyi Zhao, Kevin Carlberg, Joseph Tighe, Karl Ridgeway
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2510.22443
Pdf URL: https://arxiv.org/pdf/2510.22443
Copy Paste: [[2510.22443]] Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents(https://arxiv.org/abs/2510.22443)
Keywords: generative
Abstract: There has been a surge of interest in assistive wearable agents: agents embodied in wearable form factors (e.g., smart glasses) who take assistive actions toward a user's goal/query (e.g. "Where did I leave my keys?"). In this work, we consider the important complementary problem of inferring that goal from multi-modal contextual observations. Solving this "goal inference" problem holds the promise of eliminating the effort needed to interact with such an agent. This work focuses on creating WAGIBench, a strong benchmark to measure progress in solving this problem using vision-language models (VLMs). Given the limited prior work in this area, we collected a novel dataset comprising 29 hours of multimodal data from 348 participants across 3,477 recordings, featuring ground-truth goals alongside accompanying visual, audio, digital, and longitudinal contextual observations. We validate that human performance exceeds model performance, achieving 93% multiple-choice accuracy compared with 84% for the best-performing VLM. Generative benchmark results that evaluate several families of modern vision-language models show that larger models perform significantly better on the task, yet remain far from practical usefulness, as they produce relevant goals only 55% of the time. Through a modality ablation, we show that models benefit from extra information in relevant modalities with minimal performance degradation from irrelevant modalities.
摘要：人们对辅助可穿戴代理的兴趣激增：以可穿戴形式（例如智能眼镜）体现的代理，可以针对用户的目标/查询（例如“我把钥匙落在哪里了？”）采取辅助行动。在这项工作中，我们考虑了从多模态上下文观察中推断目标的重要补充问题。解决这个“目标推断”问题有望消除与此类代理交互所需的工作。这项工作的重点是创建 WAGIBench，这是一个强大的基准，用于衡量使用视觉语言模型 (VLM) 解决此问题的进展情况。鉴于该领域之前的工作有限，我们收集了一个新颖的数据集，其中包含来自 348 名参与者、3,477 个录音的 29 小时多模态数据，其中包含地面实况目标以及伴随的视觉、音频、数字和纵向上下文观察。我们验证了人类的表现超过了模型的表现，实现了 93% 的多项选择准确率，而性能最佳的 VLM 的准确率为 84%。评估多个现代视觉语言模型系列的生成基准测试结果表明，较大的模型在任务上表现明显更好，但距离实际用途还很远，因为它们只有 55% 的时间产生相关目标。通过模态消融，我们表明模型受益于相关模态中的额外信息，并且不相关模态的性能下降最小。

Title: DynaPose4D: High-Quality 4D Dynamic Content Generation via Pose Alignment Loss

Authors: Jing Yang, Yufeng Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22473
Pdf URL: https://arxiv.org/pdf/2510.22473
Copy Paste: [[2510.22473]] DynaPose4D: High-Quality 4D Dynamic Content Generation via Pose Alignment Loss(https://arxiv.org/abs/2510.22473)
Keywords: generation, generative
Abstract: Recent advancements in 2D and 3D generative models have expanded the capabilities of computer vision. However, generating high-quality 4D dynamic content from a single static image remains a significant challenge. Traditional methods have limitations in modeling temporal dependencies and accurately capturing dynamic geometry changes, especially when considering variations in camera perspective. To address this issue, we propose DynaPose4D, an innovative solution that integrates 4D Gaussian Splatting (4DGS) techniques with Category-Agnostic Pose Estimation (CAPE) technology. This framework uses 3D Gaussian Splatting to construct a 3D model from single images, then predicts multi-view pose keypoints based on one-shot support from a chosen view, leveraging supervisory signals to enhance motion consistency. Experimental results show that DynaPose4D achieves excellent coherence, consistency, and fluidity in dynamic motion generation. These findings not only validate the efficacy of the DynaPose4D framework but also indicate its potential applications in the domains of computer vision and animation production.
摘要：2D 和 3D 生成模型的最新进展扩展了计算机视觉的功能。然而，从单个静态图像生成高质量的 4D 动态内容仍然是一个重大挑战。传统方法在建模时间依赖性和准确捕获动态几何变化方面存在局限性，特别是在考虑相机视角的变化时。为了解决这个问题，我们提出了 DynaPose4D，这是一种创新的解决方案，它将 4D 高斯泼溅 (4DGS) 技术与类别无关姿势估计 (CAPE) 技术相集成。该框架使用 3D Gaussian Splatting 从单个图像构建 3D 模型，然后基于所选视图的一次性支持来预测多视图姿势关键点，利用监督信号来增强运动一致性。实验结果表明，DynaPose4D 在动态运动生成方面实现了出色的连贯性、一致性和流畅性。这些发现不仅验证了DynaPose4D框架的功效，还表明了其在计算机视觉和动画制作领域的潜在应用。

Title: LAMP: Data-Efficient Linear Affine Weight-Space Models for Parameter-Controlled 3D Shape Generation and Extrapolation

Authors: Ghadi Nehme, Yanxia Zhang, Dule Shu, Matt Klenk, Faez Ahmed
Subjects: cs.LG, cs.CE, cs.CV
Abstract URL: https://arxiv.org/abs/2510.22491
Pdf URL: https://arxiv.org/pdf/2510.22491
Copy Paste: [[2510.22491]] LAMP: Data-Efficient Linear Affine Weight-Space Models for Parameter-Controlled 3D Shape Generation and Extrapolation(https://arxiv.org/abs/2510.22491)
Keywords: generation
Abstract: Generating high-fidelity 3D geometries that satisfy specific parameter constraints has broad applications in design and engineering. However, current methods typically rely on large training datasets and struggle with controllability and generalization beyond the training distributions. To overcome these limitations, we introduce LAMP (Linear Affine Mixing of Parametric shapes), a data-efficient framework for controllable and interpretable 3D generation. LAMP first aligns signed distance function (SDF) decoders by overfitting each exemplar from a shared initialization, then synthesizes new geometries by solving a parameter-constrained mixing problem in the aligned weight space. To ensure robustness, we further propose a safety metric that detects geometry validity via linearity mismatch. We evaluate LAMP on two 3D parametric benchmarks: DrivAerNet++ and BlendedNet. We found that LAMP enables (i) controlled interpolation within bounds with as few as 100 samples, (ii) safe extrapolation by up to 100% parameter difference beyond training ranges, (iii) physics performance-guided optimization under fixed parameters. LAMP significantly outperforms conditional autoencoder and Deep Network Interpolation (DNI) baselines in both extrapolation and data efficiency. Our results demonstrate that LAMP advances controllable, data-efficient, and safe 3D generation for design exploration, dataset generation, and performance-driven optimization.
摘要：生成满足特定参数约束的高保真 3D 几何形状在设计和工程中具有广泛的应用。然而，当前的方法通常依赖于大型训练数据集，并且难以克服训练分布之外的可控性和泛化性。为了克服这些限制，我们引入了 LAMP（参数形状的线性仿射混合），这是一种用于可控和可解释 3D 生成的数据高效框架。 LAMP 首先通过对共享初始化中的每个样本进行过拟合来对齐符号距离函数 (SDF) 解码器，然后通过解决对齐权重空间中的参数约束混合问题来合成新的几何图形。为了确保鲁棒性，我们进一步提出了一种安全指标，通过线性失配来检测几何有效性。我们在两个 3D 参数基准上评估 LAMP：DrivAerNet++ 和 BlendedNet。我们发现，LAMP 能够 (i) 使用少至 100 个样本在边界内进行受控插值，(ii) 通过超出训练范围的高达 100% 的参数差异进行安全外推，(iii) 在固定参数下进行物理性能引导优化。 LAMP 在外推和数据效率方面均显着优于条件自动编码器和深度网络插值 (DNI) 基线。我们的结果表明，LAMP 为设计探索、数据集生成和性能驱动的优化提供了可控、数据高效且安全的 3D 生成。

Title: Accelerating Materials Design via LLM-Guided Evolutionary Search

Authors: Nikhil Abhyankar, Sanchit Kabra, Saaketh Desai, Chandan K. Reddy
Subjects: cs.LG, cond-mat.mtrl-sci, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2510.22503
Pdf URL: https://arxiv.org/pdf/2510.22503
Copy Paste: [[2510.22503]] Accelerating Materials Design via LLM-Guided Evolutionary Search(https://arxiv.org/abs/2510.22503)
Keywords: generation, generative
Abstract: Materials discovery requires navigating vast chemical and structural spaces while satisfying multiple, often conflicting, objectives. We present LLM-guided Evolution for MAterials design (LLEMA), a unified framework that couples the scientific knowledge embedded in large language models with chemistry-informed evolutionary rules and memory-based refinement. At each iteration, an LLM proposes crystallographically specified candidates under explicit property constraints; a surrogate-augmented oracle estimates physicochemical properties; and a multi-objective scorer updates success/failure memories to guide subsequent generations. Evaluated on 14 realistic tasks spanning electronics, energy, coatings, optics, and aerospace, LLEMA discovers candidates that are chemically plausible, thermodynamically stable, and property-aligned, achieving higher hit-rates and stronger Pareto fronts than generative and LLM-only baselines. Ablation studies confirm the importance of rule-guided generation, memory-based refinement, and surrogate prediction. By enforcing synthesizability and multi-objective trade-offs, LLEMA delivers a principled pathway to accelerate practical materials discovery. Code: this https URL
摘要：材料发现需要探索广阔的化学和结构空间，同时满足多个常常相互冲突的目标。我们提出了法学硕士引导的材料设计进化（LLEMA），这是一个统一的框架，它将大型语言模型中嵌入的科学知识与化学信息的进化规则和基于记忆的细化结合起来。在每次迭代中，法学硕士都会在明确的属性约束下提出晶体学指定的候选方案；代理增强预言机估计物理化学特性；多目标评分器会更新成功/失败记忆以指导后代。通过对涵盖电子、能源、涂料、光学和航空航天领域的 14 项实际任务进行评估，LLEMA 发现了化学上合理、热力学稳定且属性一致的候选任务，与生成式和仅 LLM 基线相比，实现了更高的命中率和更强的帕累托前沿。消融研究证实了规则引导生成、基于记忆的细化和替代预测的重要性。通过加强可合成性和多目标权衡，LLEMA 提供了加速实际材料发现的原则性途径。代码：这个https URL

Title: CANDI: Hybrid Discrete-Continuous Diffusion Models

Authors: Patrick Pynadath, Jiaxin Shi, Ruqi Zhang
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2510.22510
Pdf URL: https://arxiv.org/pdf/2510.22510
Copy Paste: [[2510.22510]] CANDI: Hybrid Discrete-Continuous Diffusion Models(https://arxiv.org/abs/2510.22510)
Keywords: generation
Abstract: While continuous diffusion has shown remarkable success in continuous domains such as image generation, its direct application to discrete data has underperformed compared to purely discrete formulations. This gap is counterintuitive, given that continuous diffusion learns score functions that enable joint evolution across multiple positions. To understand this gap, we introduce token identifiability as an analytical framework for understanding how Gaussian noise corrupts discrete data through two mechanisms: discrete identity corruption and continuous rank degradation. We reveal that these mechanisms scale differently with vocabulary size, creating a temporal dissonance: at noise levels where discrete corruption preserves enough structure for conditional learning, continuous denoising is trivial; at noise levels where continuous denoising is meaningful, discrete corruption destroys nearly all conditional structure. To solve this, we propose CANDI (Continuous ANd DIscrete diffusion), a hybrid framework that decouples discrete and continuous corruption, enabling simultaneous learning of both conditional structure and continuous geometry. We empirically validate the temporal dissonance phenomenon and demonstrate that CANDI successfully avoids it. This unlocks the benefits of continuous diffusion for discrete spaces: on controlled generation, CANDI enables classifier-based guidance with off-the-shelf classifiers through simple gradient addition; on text generation, CANDI outperforms masked diffusion at low NFE, demonstrating the value of learning continuous gradients for discrete spaces.
摘要：虽然连续扩散在图像生成等连续领域取得了显着的成功，但与纯粹的离散公式相比，它在离散数据上的直接应用表现不佳。鉴于连续扩散学习能够跨多个位置联合进化的评分函数，这种差距是违反直觉的。为了理解这一差距，我们引入令牌可识别性作为分析框架，用于理解高斯噪声如何通过两种机制破坏离散数据：离散身份破坏和连续排名退化。我们揭示了这些机制随着词汇量的大小而不同，从而产生了时间失调：在离散损坏为条件学习保留足够结构的噪声水平上，连续去噪是微不足道的；在连续去噪有意义的噪声水平上，离散损坏几乎破坏了所有条件结构。为了解决这个问题，我们提出了 CANDI（连续和离散扩散），这是一种混合框架，可以解耦离散和连续损坏，从而能够同时学习条件结构和连续几何。我们凭经验验证了时间失调现象，并证明 CANDI 成功地避免了它。这释放了离散空间连续扩散的好处：在受控生成上，CANDI 通过简单的梯度添加，使用现成的分类器实现基于分类器的指导；在文本生成方面，CANDI 在低 NFE 下优于掩模扩散，这证明了学习离散空间连续梯度的价值。

Title: Open Multimodal Retrieval-Augmented Factual Image Generation

Authors: Yang Tian, Fan Liu, Jingyuan Zhang, Wei Bi, Yupeng Hu, Liqiang Nie
Subjects: cs.CV, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2510.22521
Pdf URL: https://arxiv.org/pdf/2510.22521
Copy Paste: [[2510.22521]] Open Multimodal Retrieval-Augmented Factual Image Generation(https://arxiv.org/abs/2510.22521)
Keywords: generation
Abstract: Large Multimodal Models (LMMs) have achieved remarkable progress in generating photorealistic and prompt-aligned images, but they often produce outputs that contradict verifiable knowledge, especially when prompts involve fine-grained attributes or time-sensitive events. Conventional retrieval-augmented approaches attempt to address this issue by introducing external information, yet they are fundamentally incapable of grounding generation in accurate and evolving knowledge due to their reliance on static sources and shallow evidence integration. To bridge this gap, we introduce ORIG, an agentic open multimodal retrieval-augmented framework for Factual Image Generation (FIG), a new task that requires both visual realism and factual grounding. ORIG iteratively retrieves and filters multimodal evidence from the web and incrementally integrates the refined knowledge into enriched prompts to guide generation. To support systematic evaluation, we build FIG-Eval, a benchmark spanning ten categories across perceptual, compositional, and temporal dimensions. Experiments demonstrate that ORIG substantially improves factual consistency and overall image quality over strong baselines, highlighting the potential of open multimodal retrieval for factual image generation.
摘要：大型多模态模型 (LMM) 在生成真实感和提示对齐图像方面取得了显着进展，但它们经常产生与可验证知识相矛盾的输出，特别是当提示涉及细粒度属性或时间敏感事件时。传统的检索增强方法试图通过引入外部信息来解决这个问题，但由于它们依赖于静态来源和浅层证据集成，因此从根本上无法将生成的准确和不断发展的知识奠定基础。为了弥补这一差距，我们引入了 ORIG，这是一种用于事实图像生成（FIG）的代理开放多模态检索增强框架，这是一项需要视觉真实性和事实基础的新任务。 ORIG 迭代地从网络中检索和过滤多模态证据，并逐步将精炼的知识整合到丰富的提示中以指导生成。为了支持系统评估，我们构建了Fig-Eval，这是一个跨越感知、构图和时间维度的十个类别的基准。实验表明，ORIG 在强基线上显着提高了事实一致性和整体图像质量，凸显了开放式多模态检索在事实图像生成方面的潜力。

Title: AesCrop: Aesthetic-driven Cropping Guided by Composition

Authors: Yen-Hong Wong, Lai-Kuan Wong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22528
Pdf URL: https://arxiv.org/pdf/2510.22528
Copy Paste: [[2510.22528]] AesCrop: Aesthetic-driven Cropping Guided by Composition(https://arxiv.org/abs/2510.22528)
Keywords: generation
Abstract: Aesthetic-driven image cropping is crucial for applications like view recommendation and thumbnail generation, where visual appeal significantly impacts user engagement. A key factor in visual appeal is composition--the deliberate arrangement of elements within an image. Some methods have successfully incorporated compositional knowledge through evaluation-based and regression-based paradigms. However, evaluation-based methods lack globality while regression-based methods lack diversity. Recently, hybrid approaches that integrate both paradigms have emerged, bridging the gap between these two to achieve better diversity and globality. Notably, existing hybrid methods do not incorporate photographic composition guidance, a key attribute that defines photographic aesthetics. In this work, we introduce AesCrop, a composition-aware hybrid image-cropping model that integrates a VMamba image encoder, augmented with a novel Mamba Composition Attention Bias (MCAB) and a transformer decoder to perform end-to-end rank-based image cropping, generating multiple crops along with the corresponding quality scores. By explicitly encoding compositional cues into the attention mechanism, MCAB directs AesCrop to focus on the most compositionally salient regions. Extensive experiments demonstrate that AesCrop outperforms current state-of-the-art methods, delivering superior quantitative metrics and qualitatively more pleasing crops.
摘要：美观驱动的图像裁剪对于视图推荐和缩略图生成等应用程序至关重要，这些应用程序的视觉吸引力会显着影响用户参与度。视觉吸引力的一个关键因素是构图——图像中元素的精心排列。一些方法已经通过基于评估和基于回归的范式成功地整合了成分知识。然而，基于评估的方法缺乏全局性，而基于回归的方法缺乏多样性。最近，出现了整合这两种范式的混合方法，弥合了这两种范式之间的差距，以实现更好的多样性和全球化。值得注意的是，现有的混合方法没有纳入摄影构图指导，而摄影构图指导是定义摄影美学的关键属性。在这项工作中，我们介绍了 AesCrop，这是一种合成感知混合图像裁剪模型，它集成了 VMamba 图像编码器，并通过新颖的 Mamba 合成注意偏差 (MCAB) 和转换器解码器进行增强，以执行端到端基于排名的图像裁剪，生成多种裁剪以及相应的质量分数。通过将构图线索显式编码到注意机制中，MCAB 指示 AesCrop 关注构图最显着的区域。大量实验表明，AesCrop 的性能优于当前最先进的方法，可提供卓越的定量指标和质量上更令人满意的作物。

Title: SRSR: Enhancing Semantic Accuracy in Real-World Image Super-Resolution with Spatially Re-Focused Text-Conditioning

Authors: Chen Chen, Majid Abdolshah, Violetta Shevchenko, Hongdong Li, Chang Xu, Pulak Purkait
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22534
Pdf URL: https://arxiv.org/pdf/2510.22534
Copy Paste: [[2510.22534]] SRSR: Enhancing Semantic Accuracy in Real-World Image Super-Resolution with Spatially Re-Focused Text-Conditioning(https://arxiv.org/abs/2510.22534)
Keywords: super-resolution
Abstract: Existing diffusion-based super-resolution approaches often exhibit semantic ambiguities due to inaccuracies and incompleteness in their text conditioning, coupled with the inherent tendency for cross-attention to divert towards irrelevant pixels. These limitations can lead to semantic misalignment and hallucinated details in the generated high-resolution outputs. To address these, we propose a novel, plug-and-play spatially re-focused super-resolution (SRSR) framework that consists of two core components: first, we introduce Spatially Re-focused Cross-Attention (SRCA), which refines text conditioning at inference time by applying visually-grounded segmentation masks to guide cross-attention. Second, we introduce a Spatially Targeted Classifier-Free Guidance (STCFG) mechanism that selectively bypasses text influences on ungrounded pixels to prevent hallucinations. Extensive experiments on both synthetic and real-world datasets demonstrate that SRSR consistently outperforms seven state-of-the-art baselines in standard fidelity metrics (PSNR and SSIM) across all datasets, and in perceptual quality measures (LPIPS and DISTS) on two real-world benchmarks, underscoring its effectiveness in achieving both high semantic fidelity and perceptual quality in super-resolution.
摘要：现有的基于扩散的超分辨率方法通常由于文本条件的不准确和不完整而表现出语义模糊性，再加上交叉注意力转移到不相关像素的固有趋势。这些限制可能会导致生成的高分辨率输出中出现语义错位和幻觉细节。为了解决这些问题，我们提出了一种新颖的、即插即用的空间重新聚焦超分辨率（SRSR）框架，该框架由两个核心组件组成：首先，我们引入了空间重新聚焦交叉注意（SRCA），它通过应用基于视觉的分割掩模来指导交叉注意，从而在推理时完善文本调节。其次，我们引入了一种空间目标无分类器指导（STCFG）机制，该机制有选择地绕过文本对非接地像素的影响，以防止出现幻觉。对合成数据集和真实世界数据集的大量实验表明，SRSR 在所有数据集的标准保真度指标（PSNR 和 SSIM）以及两个真实世界基准的感知质量测量（LPIPS 和 DISTS）方面始终优于七个最先进的基线，强调了其在实现高语义保真度和感知质量方面的有效性。超分辨率。

Title: FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

Authors: Yuyang Ding, Chi Zhang, Juntao Li, Haibin Lin, Xin Liu, Min Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.22543
Pdf URL: https://arxiv.org/pdf/2510.22543
Copy Paste: [[2510.22543]] FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning(https://arxiv.org/abs/2510.22543)
Keywords: generative
Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for enhancing the reasoning capabilities of large language models (LLMs). In this context, models explore reasoning trajectories and exploit rollouts with correct answers as positive signals for policy optimization. However, these rollouts might involve flawed patterns such as answer-guessing and jump-in-reasoning. Such flawed-positive rollouts are rewarded identically to fully correct ones, causing policy models to internalize these unreliable reasoning patterns. In this work, we first conduct a systematic study of flawed-positive rollouts in RL and find that they enable rapid capability gains during the early optimization stage, while constraining reasoning capability later by reinforcing unreliable patterns. Building on these insights, we propose Flawed-Aware Policy Optimization (FAPO), which presents a parameter-free reward penalty for flawed-positive rollouts, enabling the policy to leverage them as useful shortcuts in the warm-up stage, securing stable early gains, while gradually shifting optimization toward reliable reasoning in the later refinement stage. To accurately and comprehensively detect flawed-positive rollouts, we introduce a generative reward model (GenRM) with a process-level reward that precisely localizes reasoning errors. Experiments show that FAPO is effective in broad domains, improving outcome correctness, process reliability, and training stability without increasing the token budget.
摘要：具有可验证奖励的强化学习（RLVR）已成为增强大型语言模型（LLM）推理能力的有前途的范例。在此背景下，模型探索推理轨迹并利用正确答案的推出作为策略优化的积极信号。然而，这些推出可能涉及有缺陷的模式，例如答案猜测和跳跃推理。这种有缺陷的积极推出与完全正确的推出获得相同的奖励，导致政策模型将这些不可靠的推理模式内化。在这项工作中，我们首先对 RL 中的缺陷积极推出进行系统研究，发现它们可以在早期优化阶段实现快速能力增益，同时通过强化不可靠的模式来限制后期的推理能力。基于这些见解，我们提出了缺陷感知策略优化（FAPO），它对有缺陷的积极部署提出了无参数奖励惩罚，使策略能够在预热阶段将它们作为有用的捷径，确保稳定的早期收益，同时在后期细化阶段逐渐将优化转向可靠的推理。为了准确、全面地检测有缺陷的积极部署，我们引入了一种生成奖励模型（GenRM），该模型具有过程级奖励，可以精确定位推理错误。实验表明，FAPO 在广泛的领域中有效，在不增加代币预算的情况下提高了结果的正确性、过程的可靠性和训练的稳定性。

Title: FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference

Authors: Divya Jyoti Bajpai, Manjesh Kumar Hanawal
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22641
Pdf URL: https://arxiv.org/pdf/2510.22641
Copy Paste: [[2510.22641]] FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference(https://arxiv.org/abs/2510.22641)
Keywords: generation
Abstract: Vision-language Models (VLMs) have made significant strides in visual understanding and query response generation, but often face challenges of high computational cost and inference latency due to autoregressive decoding. In this work, we introduce an imitation-learning-based Self-Speculative Decoding (SSD) framework, named FastVLM, to address these limitations. Our approach employs a lightweight draft model for token generation in an autoregressive manner, while a full model verifies these tokens non-autoregressively. Accepted tokens proceed seamlessly, while rejected tokens are corrected by the full model and used to guide the draft model's refinement. Through an imitation network, FastVLM enhances the draft model by integrating deeper level insights from the full model's architecture. Also, it maintains the performance integrity of the full model while training the draft model, achieving a balance between efficiency and accuracy. Our method speeds up the inference process by 1.55-1.85x as compared to the final layer with minimal loss in performance.
摘要：视觉语言模型 (VLM) 在视觉理解和查询响应生成方面取得了重大进展，但经常面临由于自回归解码而导致计算成本高和推理延迟的挑战。在这项工作中，我们引入了一种基于模仿学习的自推测解码（SSD）框架，名为 FastVLM，来解决这些限制。我们的方法采用轻量级草稿模型以自回归方式生成令牌，而完整模型则以非自回归方式验证这些令牌。接受的令牌无缝地进行，而拒绝的令牌由完整模型纠正并用于指导草稿模型的细化。通过模仿网络，FastVLM 通过集成来自完整模型架构的更深层次的见解来增强草图模型。此外，它在训练草稿模型的同时保持了完整模型的性能完整性，实现了效率和准确性之间的平衡。与最终层相比，我们的方法将推理过程加快了 1.55-1.85 倍，同时性能损失最小。

Title: Variational Polya Tree

Authors: Lu Xu, Tsai Hor Chan, Kwok Fai Lam, Lequan Yu, Guosheng Yin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22651
Pdf URL: https://arxiv.org/pdf/2510.22651
Copy Paste: [[2510.22651]] Variational Polya Tree(https://arxiv.org/abs/2510.22651)
Keywords: generative
Abstract: Density estimation is essential for generative modeling, particularly with the rise of modern neural networks. While existing methods capture complex data distributions, they often lack interpretability and uncertainty quantification. Bayesian nonparametric methods, especially the \polya tree, offer a robust framework that addresses these issues by accurately capturing function behavior over small intervals. Traditional techniques like Markov chain Monte Carlo (MCMC) face high computational complexity and scalability limitations, hindering the use of Bayesian nonparametric methods in deep learning. To tackle this, we introduce the variational \polya tree (VPT) model, which employs stochastic variational inference to compute posterior distributions. This model provides a flexible, nonparametric Bayesian prior that captures latent densities and works well with stochastic gradient optimization. We also leverage the joint distribution likelihood for a more precise variational posterior approximation than traditional mean-field methods. We evaluate the model performance on both real data and images, and demonstrate its competitiveness with other state-of-the-art deep density estimation methods. We also explore its ability in enhancing interpretability and uncertainty quantification. Code is available at this https URL.
摘要：密度估计对于生成建模至关重要，尤其是随着现代神经网络的兴起。虽然现有方法捕获复杂的数据分布，但它们通常缺乏可解释性和不确定性量化。贝叶斯非参数方法，尤其是 \polya 树，提供了一个强大的框架，通过在小间隔内准确捕获函数行为来解决这些问题。马尔可夫链蒙特卡罗（MCMC）等传统技术面临着高计算复杂性和可扩展性限制，阻碍了贝叶斯非参数方法在深度学习中的使用。为了解决这个问题，我们引入了变分多树（VPT）模型，该模型采用随机变分推理来计算后验分布。该模型提供了灵活的非参数贝叶斯先验，可以捕获潜在密度并与随机梯度优化配合良好。我们还利用联合分布似然来获得比传统平均场方法更精确的变分后验近似。我们评估了真实数据和图像上的模型性能，并证明了其与其他最先进的深度密度估计方法的竞争力。我们还探索其增强可解释性和不确定性量化的能力。代码可从此 https URL 获取。

Title: RoboSVG: A Unified Framework for Interactive SVG Generation with Multi-modal Guidance

Authors: Jiuniu Wang, Gongjie Zhang, Quanhao Qian, Junlong Gao, Deli Zhao, Ran Xu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2510.22684
Pdf URL: https://arxiv.org/pdf/2510.22684
Copy Paste: [[2510.22684]] RoboSVG: A Unified Framework for Interactive SVG Generation with Multi-modal Guidance(https://arxiv.org/abs/2510.22684)
Keywords: generation
Abstract: Scalable Vector Graphics (SVGs) are fundamental to digital design and robot control, encoding not only visual structure but also motion paths in interactive drawings. In this work, we introduce RoboSVG, a unified multimodal framework for generating interactive SVGs guided by textual, visual, and numerical signals. Given an input query, the RoboSVG model first produces multimodal guidance, then synthesizes candidate SVGs through dedicated generation modules, and finally refines them under numerical guidance to yield high-quality outputs. To support this framework, we construct RoboDraw, a large-scale dataset of one million examples, each pairing an SVG generation condition (e.g., text, image, and partial SVG) with its corresponding ground-truth SVG code. RoboDraw dataset enables systematic study of four tasks, including basic generation (Text-to-SVG, Image-to-SVG) and interactive generation (PartialSVG-to-SVG, PartialImage-to-SVG). Extensive experiments demonstrate that RoboSVG achieves superior query compliance and visual fidelity across tasks, establishing a new state of the art in versatile SVG generation. The dataset and source code of this project will be publicly available soon.
摘要：可扩展矢量图形 (SVG) 是数字设计和机器人控制的基础，不仅可以编码视觉结构，还可以编码交互式绘图中的运动路径。在这项工作中，我们介绍了 RoboSVG，这是一个统一的多模式框架，用于生成由文本、视觉和数字信号引导的交互式 SVG。给定输入查询，RoboSVG 模型首先生成多模态指导，然后通过专用生成模块合成候选 SVG，最后在数值指导下对其进行细化以产生高质量的输出。为了支持这个框架，我们构建了 RoboDraw，一个包含 100 万个示例的大型数据集，每个示例将一个 SVG 生成条件（例如文本、图像和部分 SVG）与其相应的真实 SVG 代码配对。 RoboDraw 数据集可以系统地研究四种任务，包括基本生成（文本到 SVG、图像到 SVG）和交互式生成（PartialSVG-to-SVG、PartialImage-to-SVG）。大量实验表明，RoboSVG 在跨任务中实现了卓越的查询合规性和视觉保真度，在多功能 SVG 生成方面建立了新的技术水平。该项目的数据集和源代码即将公开。

Title: FlowCritic: Bridging Value Estimation with Flow Matching in Reinforcement Learning

Authors: Shan Zhong, Shutong Ding, He Diao, Xiangyu Wang, Kah Chan Teh, Bei Peng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22686
Pdf URL: https://arxiv.org/pdf/2510.22686
Copy Paste: [[2510.22686]] FlowCritic: Bridging Value Estimation with Flow Matching in Reinforcement Learning(https://arxiv.org/abs/2510.22686)
Keywords: generative
Abstract: Reliable value estimation serves as the cornerstone of reinforcement learning (RL) by evaluating long-term returns and guiding policy improvement, significantly influencing the convergence speed and final performance. Existing works improve the reliability of value function estimation via multi-critic ensembles and distributional RL, yet the former merely combines multi point estimation without capturing distributional information, whereas the latter relies on discretization or quantile regression, limiting the expressiveness of complex value distributions. Inspired by flow matching's success in generative modeling, we propose a generative paradigm for value estimation, named FlowCritic. Departing from conventional regression for deterministic value prediction, FlowCritic leverages flow matching to model value distributions and generate samples for value estimation.
摘要：可靠的价值估计是强化学习（RL）的基石，可以评估长期回报并指导策略改进，显着影响收敛速度和最终性能。现有的工作通过多批评集成和分布式强化学习来提高价值函数估计的可靠性，但前者仅结合多点估计而没有捕获分布信息，而后者依赖于离散化或分位数回归，限制了复杂价值分布的表达能力。受到流匹配在生成建模中成功的启发，我们提出了一种价值估计的生成范例，名为 FlowCritic。与确定性值预测的传统回归不同，FlowCritic 利用流匹配来建模值分布并生成用于值估计的样本。

Title: Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation

Authors: Shu Zhao, Tianyi Shen, Nilesh Ahuja, Omesh Tickoo, Vijaykrishnan Narayanan
Subjects: cs.CV, cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2510.22694
Pdf URL: https://arxiv.org/pdf/2510.22694
Copy Paste: [[2510.22694]] Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation(https://arxiv.org/abs/2510.22694)
Keywords: generation
Abstract: Multimodal Retrieval-Augmented Generation (MRAG) has emerged as a promising method to generate factual and up-to-date responses of Multimodal Large Language Models (MLLMs) by incorporating non-parametric knowledge from external knowledge bases. However, existing MRAG approaches suffer from static retrieval strategies, inflexible modality selection, and suboptimal utilization of retrieved information, leading to three critical challenges: determining when to retrieve, what modality to incorporate, and how to utilize retrieved information effectively. To address these challenges, we introduce Windsock, a query-dependent module making decisions on retrieval necessity and modality selection, effectively reducing computational overhead and improving response quality. Additionally, we propose Dynamic Noise-Resistance (DANCE) Instruction Tuning, an adaptive training strategy that enhances MLLMs' ability to utilize retrieved information while maintaining robustness against noise. Moreover, we adopt a self-assessment approach leveraging knowledge within MLLMs to convert question-answering datasets to MRAG training datasets. Extensive experiments demonstrate that our proposed method significantly improves the generation quality by 17.07% while reducing 8.95% retrieval times.
摘要：多模态检索增强生成（MRAG）已成为一种有前途的方法，通过结合来自外部知识库的非参数知识来生成多模态大语言模型（MLLM）的事实和最新响应。然而，现有的 MRAG 方法存在静态检索策略、不灵活的模态选择以及检索信息的次优利用等问题，从而导致三个关键挑战：确定何时检索、合并何种模态以及如何有效地利用检索到的信息。为了应对这些挑战，我们引入了 Windsock，这是一个依赖于查询的模块，可以对检索必要性和模态选择进行决策，从而有效减少计算开销并提高响应质量。此外，我们提出了动态抗噪 (DANCE) 指令调整，这是一种自适应训练策略，可增强 MLLM 利用检索到的信息的能力，同时保持抗噪声的鲁棒性。此外，我们采用自我评估方法，利用 MLLM 中的知识将问答数据集转换为 MRAG 训练数据集。大量实验表明，我们提出的方法将生成质量显着提高了 17.07%，同时减少了 8.95% 的检索时间。

Title: S-Chain: Structured Visual Chain-of-Thought For Medicine

Authors: Khai Le-Duc, Duy M. H. Nguyen, Phuong T. H. Trinh, Tien-Phat Nguyen, Nghiem T. Diep, An Ngo, Tung Vu, Trinh Vuong, Anh-Tien Nguyen, Mau Nguyen, Van Trung Hoang, Khai-Nguyen Nguyen, Hy Nguyen, Chris Ngo, Anji Liu, Nhat Ho, Anne-Christin Hauschild, Khanh Xuan Nguyen, Thanh Nguyen-Tang, Pengtao Xie, Daniel Sonntag, James Zou, Mathias Niepert, Anh Totti Nguyen
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2510.22728
Pdf URL: https://arxiv.org/pdf/2510.22728
Copy Paste: [[2510.22728]] S-Chain: Structured Visual Chain-of-Thought For Medicine(https://arxiv.org/abs/2510.22728)
Keywords: generation
Abstract: Faithful reasoning in medical vision-language models (VLMs) requires not only accurate predictions but also transparent alignment between textual rationales and visual evidence. While Chain-of-Thought (CoT) prompting has shown promise in medical visual question answering (VQA), no large-scale expert-level dataset has captured stepwise reasoning with precise visual grounding. We introduce S-Chain, the first large-scale dataset of 12,000 expert-annotated medical images with bounding boxes and structured visual CoT (SV-CoT), explicitly linking visual regions to reasoning steps. The dataset further supports 16 languages, totaling over 700k VQA pairs for broad multilingual applicability. Using S-Chain, we benchmark state-of-the-art medical VLMs (ExGra-Med, LLaVA-Med) and general-purpose VLMs (Qwen2.5-VL, InternVL2.5), showing that SV-CoT supervision significantly improves interpretability, grounding fidelity, and robustness. Beyond benchmarking, we study its synergy with retrieval-augmented generation, revealing how domain knowledge and visual grounding interact during autoregressive reasoning. Finally, we propose a new mechanism that strengthens the alignment between visual evidence and reasoning, improving both reliability and efficiency. S-Chain establishes a new benchmark for grounded medical reasoning and paves the way toward more trustworthy and explainable medical VLMs.
摘要：医学视觉语言模型（VLM）中的忠实推理不仅需要准确的预测，还需要文本原理和视觉证据之间的透明对齐。虽然思想链 (CoT) 提示在医学视觉问答 (VQA) 中显示出了前景，但还没有大规模的专家级数据集能够捕获具有精确视觉基础的逐步推理。我们引入了 S-Chain，这是第一个包含 12,000 张专家注释的医学图像的大型数据集，带有边界框和结构化视觉 CoT (SV-CoT)，明确地将视觉区域与推理步骤联系起来。该数据集还支持 16 种语言，总计超过 70 万个 VQA 对，具有广泛的多语言适用性。使用 S-Chain，我们对最先进的医疗 VLM（ExGra-Med、LLaVA-Med）和通用 VLM（Qwen2.5-VL、InternVL2.5）进行基准测试，结果表明 SV-CoT 监督显着提高了可解释性、基础保真度和鲁棒性。除了基准测试之外，我们还研究了它与检索增强生成的协同作用，揭示了领域知识和视觉基础在自回归推理过程中如何相互作用。最后，我们提出了一种新机制，加强视觉证据和推理之间的一致性，提高可靠性和效率。 S-Chain 为扎根的医学推理建立了新的基准，并为更值得信赖和可解释的医学 VLM 铺平了道路。

Title: Cross-view Localization and Synthesis - Datasets, Challenges and Opportunities

Authors: Ningli Xu, Rongjun Qin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22736
Pdf URL: https://arxiv.org/pdf/2510.22736
Copy Paste: [[2510.22736]] Cross-view Localization and Synthesis - Datasets, Challenges and Opportunities(https://arxiv.org/abs/2510.22736)
Keywords: generative
Abstract: Cross-view localization and synthesis are two fundamental tasks in cross-view visual understanding, which deals with cross-view datasets: overhead (satellite or aerial) and ground-level imagery. These tasks have gained increasing attention due to their broad applications in autonomous navigation, urban planning, and augmented reality. Cross-view localization aims to estimate the geographic position of ground-level images based on information provided by overhead imagery while cross-view synthesis seeks to generate ground-level images based on information from the overhead imagery. Both tasks remain challenging due to significant differences in viewing perspective, resolution, and occlusion, which are widely embedded in cross-view datasets. Recent years have witnessed rapid progress driven by the availability of large-scale datasets and novel approaches. Typically, cross-view localization is formulated as an image retrieval problem where ground-level features are matched with tiled overhead images feature, extracted by convolutional neural networks (CNNs) or vision transformers (ViTs) for cross-view feature embedding. Cross-view synthesis, on the other hand, seeks to generate ground-level views based on information from overhead imagery, generally using generative adversarial networks (GANs) or diffusion models. This paper presents a comprehensive survey of advances in cross-view localization and synthesis, reviewing widely used datasets, highlighting key challenges, and providing an organized overview of state-of-the-art techniques. Furthermore, it discusses current limitations, offers comparative analyses, and outlines promising directions for future research. We also include the project page via this https URL.
摘要：跨视图定位和合成是跨视图视觉理解中的两个基本任务，它处理跨视图数据集：头顶（卫星或航空）和地面图像。这些任务由于在自主导航、城市规划和增强现实方面的广泛应用而受到越来越多的关注。跨视图定位旨在根据俯视图像提供的信息估计地面图像的地理位置，而跨视图合成旨在根据俯视图像的信息生成地面图像。由于广泛嵌入跨视图数据集中的观察视角、分辨率和遮挡的显着差异，这两项任务仍然具有挑战性。近年来，在大规模数据集和新颖方法的推动下，取得了快速进展。通常，跨视图定位被表述为图像检索问题，其中地面特征与平铺的俯视图像特征相匹配，由卷积神经网络（CNN）或视觉变换器（ViT）提取以进行跨视图特征嵌入。另一方面，跨视图合成寻求根据俯视图像的信息生成地面视图，通常使用生成对抗网络（GAN）或扩散模型。本文对跨视图定位和合成的进展进行了全面的调查，回顾了广泛使用的数据集，强调了关键挑战，并提供了对最先进技术的有组织的概述。此外，它还讨论了当前的局限性，提供了比较分析，并概述了未来研究的有希望的方向。我们还通过此 https URL 包含项目页面。

Title: A Theory of the Mechanics of Information: Generalization Through Measurement of Uncertainty (Learning is Measuring)

Authors: Christopher J. Hazard, Michael Resnick, Jacob Beel, Jack Xia, Cade Mack, Dominic Glennie, Matthew Fulp, David Maze, Andrew Bassett, Martin Koistinen
Subjects: cs.LG, cs.AI, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2510.22809
Pdf URL: https://arxiv.org/pdf/2510.22809
Copy Paste: [[2510.22809]] A Theory of the Mechanics of Information: Generalization Through Measurement of Uncertainty (Learning is Measuring)(https://arxiv.org/abs/2510.22809)
Keywords: generative
Abstract: Traditional machine learning relies on explicit models and domain assumptions, limiting flexibility and interpretability. We introduce a model-free framework using surprisal (information theoretic uncertainty) to directly analyze and perform inferences from raw data, eliminating distribution modeling, reducing bias, and enabling efficient updates including direct edits and deletion of training data. By quantifying relevance through uncertainty, the approach enables generalizable inference across tasks including generative inference, causal discovery, anomaly detection, and time series forecasting. It emphasizes traceability, interpretability, and data-driven decision making, offering a unified, human-understandable framework for machine learning, and achieves at or near state-of-the-art performance across most common machine learning tasks. The mathematical foundations create a ``physics'' of information, which enable these techniques to apply effectively to a wide variety of complex data types, including missing data. Empirical results indicate that this may be a viable alternative path to neural networks with regard to scalable machine learning and artificial intelligence that can maintain human understandability of the underlying mechanics.
摘要：传统的机器学习依赖于显式模型和领域假设，限制了灵活性和可解释性。我们引入了一个无模型框架，使用意外（信息论不确定性）直接分析原始数据并进行推理，消除分布建模，减少偏差，并实现高效更新，包括直接编辑和删除训练数据。通过不确定性量化相关性，该方法可以跨任务进行通用推理，包括生成推理、因果发现、异常检测和时间序列预测。它强调可追溯性、可解释性和数据驱动的决策，为机器学习提供统一的、人类可理解的框架，并在最常见的机器学习任务中实现或接近最先进的性能。数学基础创建了信息的“物理”，使这些技术能够有效地应用于各种复杂的数据类型，包括丢失的数据。实证结果表明，就可扩展的机器学习和人工智能而言，这可能是神经网络的可行替代路径，可以保持人类对底层机制的理解。

Title: MAGIC-Talk: Motion-aware Audio-Driven Talking Face Generation with Customizable Identity Control

Authors: Fatemeh Nazarieh, Zhenhua Feng, Diptesh Kanojia, Muhammad Awais, Josef Kittler
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22810
Pdf URL: https://arxiv.org/pdf/2510.22810
Copy Paste: [[2510.22810]] MAGIC-Talk: Motion-aware Audio-Driven Talking Face Generation with Customizable Identity Control(https://arxiv.org/abs/2510.22810)
Keywords: generation
Abstract: Audio-driven talking face generation has gained significant attention for applications in digital media and virtual avatars. While recent methods improve audio-lip synchronization, they often struggle with temporal consistency, identity preservation, and customization, especially in long video generation. To address these issues, we propose MAGIC-Talk, a one-shot diffusion-based framework for customizable and temporally stable talking face generation. MAGIC-Talk consists of ReferenceNet, which preserves identity and enables fine-grained facial editing via text prompts, and AnimateNet, which enhances motion coherence using structured motion priors. Unlike previous methods requiring multiple reference images or fine-tuning, MAGIC-Talk maintains identity from a single image while ensuring smooth transitions across frames. Additionally, a progressive latent fusion strategy is introduced to improve long-form video quality by reducing motion inconsistencies and flickering. Extensive experiments demonstrate that MAGIC-Talk outperforms state-of-the-art methods in visual quality, identity preservation, and synchronization accuracy, offering a robust solution for talking face generation.
摘要：音频驱动的说话面孔生成在数字媒体和虚拟化身的应用中引起了极大的关注。虽然最近的方法改善了音频口型同步，但它们经常在时间一致性、身份保存和定制方面遇到困难，特别是在长视频生成中。为了解决这些问题，我们提出了 MAGIC-Talk，这是一种基于一次性扩散的框架，用于可定制且暂时稳定的说话面孔生成。 MAGIC-Talk 由 ReferenceNet 和 AnimateNet 组成，ReferenceNet 保留身份并通过文本提示实现细粒度的面部编辑，AnimateNet 使用结构化运动先验增强运动连贯性。与之前需要多个参考图像或微调的方法不同，MAGIC-Talk 保持单个图像的同一性，同时确保跨帧的平滑过渡。此外，还引入了渐进式潜在融合策略，通过减少运动不一致和闪烁来提高长视频质量。大量实验表明，MAGIC-Talk 在视觉质量、身份保存和同步精度方面优于最先进的方法，为说话人脸生成提供了强大的解决方案。

Title: Logical GANs: Adversarial Learning through Ehrenfeucht Fraisse Games

Authors: Mirco A. Mannucci
Subjects: cs.LG, cs.LO, math.LO
Abstract URL: https://arxiv.org/abs/2510.22824
Pdf URL: https://arxiv.org/pdf/2510.22824
Copy Paste: [[2510.22824]] Logical GANs: Adversarial Learning through Ehrenfeucht Fraisse Games(https://arxiv.org/abs/2510.22824)
Keywords: generation
Abstract: GANs promise indistinguishability, logic explains it. We put the two on a budget: a discriminator that can only ``see'' up to a logical depth $k$, and a generator that must look correct to that bounded observer. \textbf{LOGAN} (LOGical GANs) casts the discriminator as a depth-$k$ Ehrenfeucht--Fraïssé (EF) \emph{Opponent} that searches for small, legible faults (odd cycles, nonplanar crossings, directed bridges), while the generator plays \emph{Builder}, producing samples that admit a $k$-round matching to a target theory $T$. We ship a minimal toolkit -- an EF-probe simulator and MSO-style graph checkers -- and four experiments including real neural GAN training with PyTorch. Beyond verification, we score samples with a \emph{logical loss} that mixes budgeted EF round-resilience with cheap certificate terms, enabling a practical curriculum on depth. Framework validation demonstrates $92\%$--$98\%$ property satisfaction via simulation (Exp.~3), while real neural GAN training achieves $5\%$--$14\%$ improvements on challenging properties and $98\%$ satisfaction on connectivity (matching simulation) through adversarial learning (Exp.~4). LOGAN is a compact, reproducible path toward logic-bounded generation with interpretable failures, proven effectiveness (both simulated and real training), and dials for control.
摘要：逻辑解释了 GAN 承诺的不可区分性。我们对两者进行了预算：鉴别器只能“看到”逻辑深度$k$，生成器对于有界观察者来说必须看起来正确。 \textbf{LOGAN}（逻辑 GAN）将鉴别器转换为深度 $k$ Ehrenfeucht--Fraïssé (EF) \emph{Opponent}，搜索小的、清晰的错误（奇数周期、非平面交叉、有向桥），而生成器则扮演 \emph{Builder}，生成允许 $k$ 轮匹配目标理论 $T$。我们提供了一个最小的工具包——一个 EF 探针模拟器和 MSO 风格的图形检查器——以及四个实验，包括使用 PyTorch 进行真正的神经 GAN 训练。除了验证之外，我们还对样本进行了 \emph{逻辑损失} 评分，该损失将预算的 EF 轮弹性与廉价的证书条款相结合，从而实现了深度实用的课程。框架验证表明，通过模拟（实验~3），属性满意度为 92\%$--$98\%$，而真正的神经 GAN 训练通过对抗性学习，在具有挑战性的属性上实现了 $5\%$--$14\%$ 改进，在连接性（匹配模拟）方面实现了 $98\%$ 满意度（实验~4）。 LOGAN 是一条紧凑、可重复的路径，可实现逻辑限制生成，具有可解释的故障、经过验证的有效性（模拟和真实训练）以及控制旋钮。

Title: Semantic Surgery: Zero-Shot Concept Erasure in Diffusion Models

Authors: Lexiang Xiong, Chengyu Liu, Jingwen Ye, Yan Liu, Yuecong Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22851
Pdf URL: https://arxiv.org/pdf/2510.22851
Copy Paste: [[2510.22851]] Semantic Surgery: Zero-Shot Concept Erasure in Diffusion Models(https://arxiv.org/abs/2510.22851)
Keywords: generation, generative
Abstract: Concept erasure in text-to-image diffusion models is crucial for mitigating harmful content, yet existing methods often compromise generative quality. We introduce Semantic Surgery, a novel training-free, zero-shot framework for concept erasure that operates directly on text embeddings before the diffusion process. It dynamically estimates the presence of target concepts in a prompt and performs a calibrated vector subtraction to neutralize their influence at the source, enhancing both erasure completeness and locality. The framework includes a Co-Occurrence Encoding module for robust multi-concept erasure and a visual feedback loop to address latent concept persistence. As a training-free method, Semantic Surgery adapts dynamically to each prompt, ensuring precise interventions. Extensive experiments on object, explicit content, artistic style, and multi-celebrity erasure tasks show our method significantly outperforms state-of-the-art approaches. We achieve superior completeness and robustness while preserving locality and image quality (e.g., 93.58 H-score in object erasure, reducing explicit content to just 1 instance, and 8.09 H_a in style erasure with no quality degradation). This robustness also allows our framework to function as a built-in threat detection system, offering a practical solution for safer text-to-image generation.
摘要：文本到图像扩散模型中的概念擦除对于减少有害内容至关重要，但现有方法往往会损害生成质量。我们引入了语义手术，这是一种新颖的免训练、零样本概念擦除框架，可在扩散过程之前直接对文本嵌入进行操作。它动态估计提示中目标概念的存在，并执行校准矢量减法以抵消它们在源头的影响，从而增强擦除完整性和局部性。该框架包括用于强大的多概念擦除的同现编码模块和用于解决潜在概念持久性的视觉反馈循环。作为一种无需培训的方法，语义手术可以动态适应每个提示，确保精确的干预。对对象、明确内容、艺术风格和多名人擦除任务的广泛实验表明，我们的方法明显优于最先进的方法。我们实现了卓越的完整性和鲁棒性，同时保留了局部性和图像质量（例如，对象擦除中的 H 分数为 93.58，将显式内容减少到仅 1 个实例，样式擦除中的 H_a 为 8.09，且质量没有下降）。这种稳健性还使我们的框架能够充当内置威胁检测系统，为更安全的文本到图像生成提供实用的解决方案。

Title: Encoder-Decoder Diffusion Language Models for Efficient Training and Inference

Authors: Marianne Arriola, Yair Schiff, Hao Phung, Aaron Gokaslan, Volodymyr Kuleshov
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22852
Pdf URL: https://arxiv.org/pdf/2510.22852
Copy Paste: [[2510.22852]] Encoder-Decoder Diffusion Language Models for Efficient Training and Inference(https://arxiv.org/abs/2510.22852)
Keywords: generation
Abstract: Discrete diffusion models enable parallel token sampling for faster inference than autoregressive approaches. However, prior diffusion models use a decoder-only architecture, which requires sampling algorithms that invoke the full network at every denoising step and incur high computational cost. Our key insight is that discrete diffusion models perform two types of computation: 1) representing clean tokens and 2) denoising corrupted tokens, which enables us to use separate modules for each task. We propose an encoder-decoder architecture to accelerate discrete diffusion inference, which relies on an encoder to represent clean tokens and a lightweight decoder to iteratively refine a noised sequence. We also show that this architecture enables faster training of block diffusion models, which partition sequences into blocks for better quality and are commonly used in diffusion language model inference. We introduce a framework for Efficient Encoder-Decoder Diffusion (E2D2), consisting of an architecture with specialized training and sampling algorithms, and we show that E2D2 achieves superior trade-offs between generation quality and inference throughput on summarization, translation, and mathematical reasoning tasks. We provide the code, model weights, and blog post on the project page: this https URL
摘要：离散扩散模型可实现并行令牌采样，从而比自回归方法更快地进行推理。然而，先前的扩散模型使用仅解码器的架构，这需要在每个去噪步骤调用完整网络的采样算法，并且会产生很高的计算成本。我们的主要见解是离散扩散模型执行两种类型的计算：1）表示干净的令牌，2）对损坏的令牌进行去噪，这使我们能够为每个任务使用单独的模块。我们提出了一种编码器-解码器架构来加速离散扩散推理，它依赖于编码器来表示干净的标记和轻量级解码器来迭代地细化噪声序列。我们还表明，这种架构可以更快地训练块扩散模型，它将序列划分为块以获得更好的质量，并且通常用于扩散语言模型推理。我们引入了高效编码器-解码器扩散（E2D2）的框架，由具有专门训练和采样算法的架构组成，并且我们表明 E2D2 在摘要、翻译和数学推理任务上实现了生成质量和推理吞吐量之间的卓越权衡。我们在项目页面上提供代码、模型权重和博客文章：此 https URL

Title: A Review of End-to-End Precipitation Prediction Using Remote Sensing Data: from Divination to Machine Learning

Authors: Yugong Zeng, Jonathan Wu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.22855
Pdf URL: https://arxiv.org/pdf/2510.22855
Copy Paste: [[2510.22855]] A Review of End-to-End Precipitation Prediction Using Remote Sensing Data: from Divination to Machine Learning(https://arxiv.org/abs/2510.22855)
Keywords: generation
Abstract: Precipitation prediction has undergone a profound transformation -- from early symbolic and empirical methods rooted in divination and observation, to modern technologies based on atmospheric physics and artificial intelligence. This review traces the historical and technological evolution of precipitation forecasting, presenting a survey about end-to-end precipitation prediction technologies that spans ancient practices, the foundations of meteorological science, the rise of numerical weather prediction (NWP), and the emergence of machine learning (ML) and deep learning (DL) models. We first explore traditional and indigenous forecasting methods, then describe the development of physical modeling and statistical frameworks that underpin contemporary operational forecasting. Particular emphasis is placed on recent advances in neural network-based approaches, including automated deep learning, interpretability-driven design, and hybrid physical-data models. By compositing research across multiple eras and paradigms, this review not only depicts the history of end-to-end precipitation prediction but also outlines future directions in next generation forecasting systems.
摘要：降水预测经历了深刻的转变——从早期植根于占卜和观测的符号和经验方法，到基于大气物理和人工智能的现代技术。这篇综述追溯了降水预报的历史和技术演变，对涵盖古代实践的端到端降水预测技术、气象科学的基础、数值天气预报 (NWP) 的兴起以及机器学习 (ML) 和深度学习 (DL) 模型的出现进行了调查。我们首先探索传统和本土的预测方法，然后描述支撑当代业务预测的物理建模和统计框架的发展。特别强调基于神经网络的方法的最新进展，包括自动化深度学习、可解释性驱动设计和混合物理数据模型。通过综合跨多个时代和范式的研究，这篇综述不仅描绘了端到端降水预测的历史，而且概述了下一代预报系统的未来方向。

Title: Seeing the Unseen: Towards Zero-Shot Inspection for Wind Turbine Blades using Knowledge-Augmented Vision Language Models

Authors: Yang Zhang, Qianyu Zhou, Farhad Imani, Jiong Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22868
Pdf URL: https://arxiv.org/pdf/2510.22868
Copy Paste: [[2510.22868]] Seeing the Unseen: Towards Zero-Shot Inspection for Wind Turbine Blades using Knowledge-Augmented Vision Language Models(https://arxiv.org/abs/2510.22868)
Keywords: generation
Abstract: Wind turbine blades operate in harsh environments, making timely damage detection essential for preventing failures and optimizing maintenance. Drone-based inspection and deep learning are promising, but typically depend on large, labeled datasets, which limit their ability to detect rare or evolving damage types. To address this, we propose a zero-shot-oriented inspection framework that integrates Retrieval-Augmented Generation (RAG) with Vision-Language Models (VLM). A multimodal knowledge base is constructed, comprising technical documentation, representative reference images, and domain-specific guidelines. A hybrid text-image retriever with keyword-aware reranking assembles the most relevant context to condition the VLM at inference, injecting domain knowledge without task-specific training. We evaluate the framework on 30 labeled blade images covering diverse damage categories. Although the dataset is small due to the difficulty of acquiring verified blade imagery, it covers multiple representative defect types. On this test set, the RAG-grounded VLM correctly classified all samples, whereas the same VLM without retrieval performed worse in both accuracy and precision. We further compare against open-vocabulary baselines and incorporate uncertainty Clopper-Pearson confidence intervals to account for the small-sample setting. Ablation studies indicate that the key advantage of the framework lies in explainability and generalizability: retrieved references ground the reasoning process and enable the detection of previously unseen defects by leveraging domain knowledge rather than relying solely on visual cues. This research contributes a data-efficient solution for industrial inspection that reduces dependence on extensive labeled datasets.
摘要：风力涡轮机叶片在恶劣的环境中运行，因此及时进行损坏检测对于预防故障和优化维护至关重要。基于无人机的检查和深度学习很有前景，但通常依赖于大型标记数据集，这限制了它们检测罕见或不断变化的损坏类型的能力。为了解决这个问题，我们提出了一种面向零样本的检查框架，它将检索增强生成（RAG）与视觉语言模型（VLM）集成在一起。构建了一个多模式知识库，包括技术文档、代表性参考图像和特定领域指南。具有关键字感知重排序功能的混合文本-图像检索器会组合最相关的上下文，以在推理时调节 VLM，无需特定于任务的训练即可注入领域知识。我们使用涵盖不同损坏类别的 30 张带标签的叶片图像来评估该框架。尽管由于难以获取经过验证的叶片图像而导致数据集很小，但它涵盖了多种有代表性的缺陷类型。在此测试集上，基于 RAG 的 VLM 正确分类了所有样本，而没有检索的相同 VLM 在准确性和精确度方面表现较差。我们进一步与开放词汇基线进行比较，并结合不确定性 Clopper-Pearson 置信区间来解释小样本设置。消融研究表明，该框架的关键优势在于可解释性和概括性：检索到的参考文献为推理过程奠定了基础，并能够通过利用领域知识而不是仅仅依赖视觉线索来检测以前未见过的缺陷。这项研究为工业检测提供了一种数据高效的解决方案，减少了对大量标记数据集的依赖。

Title: Limits of Generative Pre-Training in Structured EMR Trajectories with Irregular Sampling

Authors: Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.22878
Pdf URL: https://arxiv.org/pdf/2510.22878
Copy Paste: [[2510.22878]] Limits of Generative Pre-Training in Structured EMR Trajectories with Irregular Sampling(https://arxiv.org/abs/2510.22878)
Keywords: generative
Abstract: Foundation models refer to architectures trained on vast datasets using autoregressive pre-training from natural language processing to capture intricate patterns and motifs. They were originally developed to transfer such learned knowledge to downstream predictive tasks. Recently, however, some studies repurpose these learned representations for phenotype discovery without rigorous validation, risking superficially realistic but clinically incoherent embeddings. To test this mismatch, we trained two autoregressive models -- a sequence-to-sequence LSTM and a reduced Transformer -- on longitudinal ART for HIV and Acute Hypotension datasets. Controlled irregularity was added during training via random inter-visit gaps, while test sequences stayed complete. Patient-trajectory synthesis evaluated distributional and correlational fidelity. Both reproduced feature distributions but failed to preserve cross-feature structure -- showing that generative pre-training yields local realism but limited clinical coherence. These results highlight the need for domain-specific evaluation and support trajectory synthesis as a practical probe before fine-tuning or deployment.
摘要：基础模型是指使用自然语言处理的自回归预训练在大量数据集上进行训练的架构，以捕获复杂的模式和主题。它们最初的开发目的是将这些学到的知识转移到下游的预测任务中。然而，最近一些研究在没有严格验证的情况下将这些学习到的表征重新用于表型发现，冒着表面上现实但临床上不连贯的嵌入的风险。为了测试这种不匹配，我们在 HIV 和急性低血压数据集的纵向 ART 上训练了两个自回归模型——序列到序列 LSTM 和简化的 Transformer。在训练期间通过随机的访问间间隙添加受控的不规则性，同时测试序列保持完整。患者轨迹综合评估了分布和相关保真度。两者都再现了特征分布，但未能保留交叉特征结构——这表明生成式预训练产生了局部真实性，但临床一致性有限。这些结果强调了在微调或部署之前进行特定领域评估和支持轨迹合成作为实际探测的需要。

Title: Transforming volcanic monitoring: A dataset and benchmark for onboard volcano activity detection

Authors: Darshana Priyasad, Tharindu Fernando, Maryam Haghighat, Harshala Gammulle, Clinton Fookes
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.22889
Pdf URL: https://arxiv.org/pdf/2510.22889
Copy Paste: [[2510.22889]] Transforming volcanic monitoring: A dataset and benchmark for onboard volcano activity detection(https://arxiv.org/abs/2510.22889)
Keywords: generation
Abstract: Natural disasters, such as volcanic eruptions, pose significant challenges to daily life and incur considerable global economic losses. The emergence of next-generation small-satellites, capable of constellation-based operations, offers unparalleled opportunities for near-real-time monitoring and onboard processing of such events. However, a major bottleneck remains the lack of extensive annotated datasets capturing volcanic activity, which hinders the development of robust detection systems. This paper introduces a novel dataset explicitly designed for volcanic activity and eruption detection, encompassing diverse volcanoes worldwide. The dataset provides binary annotations to identify volcanic anomalies or non-anomalies, covering phenomena such as temperature anomalies, eruptions, and volcanic ash emissions. These annotations offer a foundational resource for developing and evaluating detection models, addressing a critical gap in volcanic monitoring research. Additionally, we present comprehensive benchmarks using state-of-the-art models to establish baselines for future studies. Furthermore, we explore the potential for deploying these models onboard next-generation satellites. Using the Intel Movidius Myriad X VPU as a testbed, we demonstrate the feasibility of volcanic activity detection directly onboard. This capability significantly reduces latency and enhances response times, paving the way for advanced early warning systems. This paves the way for innovative solutions in volcanic disaster management, encouraging further exploration and refinement of onboard monitoring technologies.
摘要：火山爆发等自然灾害给日常生活带来了重大挑战，并造成了巨大的全球经济损失。能够进行基于星座的操作的下一代小型卫星的出现，为此类事件的近实时监控和星上处理提供了无与伦比的机会。然而，一个主要瓶颈仍然是缺乏捕获火山活动的广泛注释数据集，这阻碍了强大的检测系统的开发。本文介绍了一个专门为火山活动和喷发检测而设计的新颖数据集，涵盖了世界各地的各种火山。该数据集提供二进制注释来识别火山异常或非异常，涵盖温度异常、喷发和火山灰排放等现象。这些注释为开发和评估检测模型提供了基础资源，弥补了火山监测研究中的关键差距。此外，我们使用最先进的模型提出全面的基准，为未来的研究建立基线。此外，我们还探索了在下一代卫星上部署这些模型的潜力。使用英特尔 Movidius Myriad X VPU 作为测试平台，我们演示了直接在船上进行火山活动检测的可行性。此功能可显着减少延迟并缩短响应时间，为高级预警系统铺平道路。这为火山灾害管理的创新解决方案铺平了道路，鼓励进一步探索和完善机载监测技术。

Title: On the Anisotropy of Score-Based Generative Models

Authors: Andreas Floros, Seyed-Mohsen Moosavi-Dezfooli, Pier Luigi Dragotti
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2510.22899
Pdf URL: https://arxiv.org/pdf/2510.22899
Copy Paste: [[2510.22899]] On the Anisotropy of Score-Based Generative Models(https://arxiv.org/abs/2510.22899)
Keywords: generative
Abstract: We investigate the role of network architecture in shaping the inductive biases of modern score-based generative models. To this end, we introduce the Score Anisotropy Directions (SADs), architecture-dependent directions that reveal how different networks preferentially capture data structure. Our analysis shows that SADs form adaptive bases aligned with the architecture's output geometry, providing a principled way to predict generalization ability in score models prior to training. Through both synthetic data and standard image benchmarks, we demonstrate that SADs reliably capture fine-grained model behavior and correlate with downstream performance, as measured by Wasserstein metrics. Our work offers a new lens for explaining and predicting directional biases of generative models.
摘要：我们研究网络架构在塑造现代基于评分的生成模型的归纳偏差中的作用。为此，我们引入了得分各向异性方向（SAD），这是一种依赖于架构的方向，揭示了不同网络如何优先捕获数据结构。我们的分析表明，SAD 形成与架构的输出几何形状一致的自适应基础，提供了一种在训练之前预测评分模型的泛化能力的原则方法。通过合成数据和标准图像基准，我们证明 SAD 能够可靠地捕获细粒度模型行为，并与下游性能相关（通过 Wasserstein 指标衡量）。我们的工作为解释和预测生成模型的方向偏差提供了一个新的视角。

Title: Towards Personalized Treatment Plan: Geometrical Model-Agnostic Approach to Counterfactual Explanations

Authors: Daniel Sin, Milad Toutounchian
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2510.22911
Pdf URL: https://arxiv.org/pdf/2510.22911
Copy Paste: [[2510.22911]] Towards Personalized Treatment Plan: Geometrical Model-Agnostic Approach to Counterfactual Explanations(https://arxiv.org/abs/2510.22911)
Keywords: generation
Abstract: In our article, we describe a method for generating counterfactual explanations in high-dimensional spaces using four steps that involve fitting our dataset to a model, finding the decision boundary, determining constraints on the problem, and computing the closest point (counterfactual explanation) from that boundary. We propose a discretized approach where we find many discrete points on the boundary and then identify the closest feasible counterfactual explanation. This method, which we later call $\textit{Segmented Sampling for Boundary Approximation}$ (SSBA), applies binary search to find decision boundary points and then searches for the closest boundary point. Across four datasets of varying dimensionality, we show that our method can outperform current methods for counterfactual generation with reductions in distance between $5\%$ to $50\%$ in terms of the $L_2$ norm. Our method can also handle real-world constraints by restricting changes to immutable and categorical features, such as age, gender, sex, height, and other related characteristics such as the case for a health-based dataset. In terms of runtime, the SSBA algorithm generates decision boundary points on multiple orders of magnitude in the same given time when we compare to a grid-based approach. In general, our method provides a simple and effective model-agnostic method that can compute nearest feasible (i.e. realistic with constraints) counterfactual explanations. All of our results and our code can be found here at this link: $\href{this https URL}{this https URL dsin85691/SSBA\_For\_Counterfactuals}$
摘要：在我们的文章中，我们描述了一种在高维空间中生成反事实解释的方法，该方法使用四个步骤，包括将数据集拟合到模型、找到决策边界、确定问题的约束以及计算距该边界最近的点（反事实解释）。我们提出了一种离散化方法，在边界上找到许多离散点，然后确定最接近的可行反事实解释。这种方法，我们后来称为$\textit{边界近似分段采样}$ (SSBA)，应用二分搜索来查找决策边界点，然后搜索最近的边界点。在四个不同维度的数据集中，我们表明我们的方法可以优于当前的反事实生成方法，根据 $L_2$ 范数，距离减少在 $5\%$ 到 $50\%$ 之间。我们的方法还可以通过限制对不可变和分类特征的更改来处理现实世界的约束，例如年龄、性别、身高和其他相关特征，例如基于健康的数据集的情况。在运行时方面，与基于网格的方法相比，SSBA 算法在同一给定时间内生成多个数量级的决策边界点。一般来说，我们的方法提供了一种简单有效的与模型无关的方法，可以计算最接近的可行（即具有约束的现实）反事实解释。我们所有的结果和代码都可以通过以下链接找到：$\href{此 https URL}{此 https URL dsin85691/SSBA\_For\_Counterfactuals}$

Title: Simple Denoising Diffusion Language Models

Authors: Huaisheng Zhu, Zhengyu Chen, Shijie Zhou, Zhihui Xie, Yige Yuan, Zhimeng Guo, Siyuan Xu, Hangfan Zhang, Vasant Honavar, Teng Xiao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.22926
Pdf URL: https://arxiv.org/pdf/2510.22926
Copy Paste: [[2510.22926]] Simple Denoising Diffusion Language Models(https://arxiv.org/abs/2510.22926)
Keywords: generation
Abstract: Diffusion models have recently been extended to language generation through Masked Diffusion Language Models (MDLMs), which achieve performance competitive with strong autoregressive models. However, MDLMs tend to degrade in the few-step regime and cannot directly adopt existing few-step distillation methods designed for continuous diffusion models, as they lack the intrinsic property of mapping from noise to data. Recent Uniform-state Diffusion Models (USDMs), initialized from a uniform prior, alleviate some limitations but still suffer from complex loss formulations that hinder scalability. In this work, we propose a simplified denoising-based loss for USDMs that optimizes only noise-replaced tokens, stabilizing training and matching ELBO-level performance. Furthermore, by framing denoising as self-supervised learning, we introduce a simple modification to our denoising loss with contrastive-inspired negative gradients, which is practical and yield additional improvements in generation quality.
摘要：扩散模型最近已通过掩码扩散语言模型（MDLM）扩展到语言生成，其性能可与强大的自回归模型相媲美。然而，MDLM 在少步状态下往往会退化，并且不能直接采用现有的为连续扩散模型设计的少步蒸馏方法，因为它们缺乏从噪声到数据映射的内在属性。最近的统一状态扩散模型（USDM）从统一先验初始化，减轻了一些限制，但仍然受到阻碍可扩展性的复杂损失公式的影响。在这项工作中，我们为 USDM 提出了一种简化的基于去噪的损失，仅优化噪声替换的令牌，稳定训练并匹配 ELBO 级别的性能。此外，通过将去噪构建为自我监督学习，我们使用对比启发的负梯度对去噪损失进行了简单的修改，这是实用的，并且可以进一步提高生成质量。

Title: Diffuse to Detect: A Generalizable Framework for Anomaly Detection with Diffusion Models Applications to UAVs and Beyond

Authors: Mingze Gong, Juan Du, Jianbang You
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.22928
Pdf URL: https://arxiv.org/pdf/2510.22928
Copy Paste: [[2510.22928]] Diffuse to Detect: A Generalizable Framework for Anomaly Detection with Diffusion Models Applications to UAVs and Beyond(https://arxiv.org/abs/2510.22928)
Keywords: generative
Abstract: Anomaly detection in complex, high-dimensional data, such as UAV sensor readings, is essential for operational safety but challenging for existing methods due to their limited sensitivity, scalability, and inability to capture intricate dependencies. We propose the Diffuse to Detect (DTD) framework, a novel approach that innovatively adapts diffusion models for anomaly detection, diverging from their conventional use in generative tasks with high inference time. By comparison, DTD employs a single-step diffusion process to predict noise patterns, enabling rapid and precise identification of anomalies without reconstruction errors. This approach is grounded in robust theoretical foundations that link noise prediction to the data distribution's score function, ensuring reliable deviation detection. By integrating Graph Neural Networks to model sensor relationships as dynamic graphs, DTD effectively captures spatial (inter-sensor) and temporal anomalies. Its two-branch architecture, with parametric neural network-based energy scoring for scalability and nonparametric statistical methods for interpretability, provides flexible trade-offs between computational efficiency and transparency. Extensive evaluations on UAV sensor data, multivariate time series, and images demonstrate DTD's superior performance over existing methods, underscoring its generality across diverse data modalities. This versatility, combined with its adaptability, positions DTD as a transformative solution for safety-critical applications, including industrial monitoring and beyond.
摘要：复杂、高维数据（例如无人机传感器读数）中的异常检测对于操作安全至关重要，但由于其有限的灵敏度、可扩展性和无法捕获复杂的依赖关系，对现有方法提出了挑战。我们提出了扩散检测（DTD）框架，这是一种新颖的方法，创新性地采用扩散模型进行异常检测，这与它们在高推理时间的生成任务中的传统用法不同。相比之下，DTD 采用单步扩散过程来预测噪声模式，从而能够快速、精确地识别异常情况，而不会出现重建错误。这种方法建立在强大的理论基础之上，将噪声预测与数据分布的评分函数联系起来，确保可靠的偏差检测。通过集成图神经网络将传感器关系建模为动态图，DTD 可以有效捕获空间（传感器间）和时间异常。其两分支架构采用基于参数神经网络的能量评分来实现可扩展性，并采用非参数统计方法来实现可解释性，在计算效率和透明度之间提供了灵活的权衡。对无人机传感器数据、多变量时间序列和图像的广泛评估证明了 DTD 比现有方法具有优越的性能，强调了其在不同数据模式中的通用性。这种多功能性与其适应性相结合，使 DTD 成为安全关键应用（包括工业监控等）的变革性解决方案。

Title: RL-AUX: Reinforcement Learning for Auxiliary Task Generation

Authors: Judah Goldfeder, Matthew So, Hod Lipson
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.22940
Pdf URL: https://arxiv.org/pdf/2510.22940
Copy Paste: [[2510.22940]] RL-AUX: Reinforcement Learning for Auxiliary Task Generation(https://arxiv.org/abs/2510.22940)
Keywords: generation
Abstract: Auxiliary Learning (AL) is a special case of Multi-task Learning (MTL) in which a network trains on auxiliary tasks to improve performance on its main task. This technique is used to improve generalization and, ultimately, performance on the network's main task. AL has been demonstrated to improve performance across multiple domains, including navigation, image classification, and natural language processing. One weakness of AL is the need for labeled auxiliary tasks, which can require human effort and domain expertise to generate. Meta Learning techniques have been used to solve this issue by learning an additional auxiliary task generation network that can create helpful tasks for the primary network. The most prominent techniques rely on Bi-Level Optimization, which incurs computational cost and increased code complexity. To avoid the need for Bi-Level Optimization, we present an RL-based approach to dynamically create auxiliary tasks. In this framework, an RL agent is tasked with selecting auxiliary labels for every data point in a training set. The agent is rewarded when their selection improves the performance on the primary task. We also experiment with learning optimal strategies for weighing the auxiliary loss per data point. On the 20-Superclass CIFAR100 problem, our RL approach outperforms human-labeled auxiliary tasks and performs as well as a prominent Bi-Level Optimization technique. Our weight learning approaches significantly outperform all of these benchmarks. For example, a Weight-Aware RL-based approach helps the VGG16 architecture achieve 80.9% test accuracy while the human-labeled auxiliary task setup achieved 75.53%. The goal of this work is to (1) prove that RL is a viable approach to dynamically generate auxiliary tasks and (2) demonstrate that per-sample auxiliary task weights can be learned alongside the auxiliary task labels and can achieve strong results.
摘要：辅助学习（AL）是多任务学习（MTL）的一个特例，其中网络通过辅助任务进行训练以提高其主要任务的性能。该技术用于提高泛化能力，并最终提高网络主要任务的性能。 AL 已被证明可以提高多个领域的性能，包括导航、图像分类和自然语言处理。 AL 的一个弱点是需要标记的辅助任务，这可能需要人力和领域专业知识来生成。元学习技术已被用来通过学习额外的辅助任务生成网络来解决这个问题，该辅助任务生成网络可以为主网络创建有用的任务。最突出的技术依赖于双层优化，这会产生计算成本并增加代码复杂性。为了避免双层优化的需要，我们提出了一种基于强化学习的方法来动态创建辅助任务。在此框架中，强化学习代理的任务是为训练集中的每个数据点选择辅助标签。当代理的选择提高了主要任务的性能时，他们就会得到奖励。我们还尝试学习权衡每个数据点的辅助损失的最佳策略。在 20 超类 CIFAR100 问题上，我们的 RL 方法优于人工标记的辅助任务，并且与著名的双层优化技术一样出色。我们的权重学习方法明显优于所有这些基准。例如，基于权重感知 RL 的方法帮助 VGG16 架构实现了 80.9% 的测试准确率，而人工标记的辅助任务设置则实现了 75.53% 的测试准确率。这项工作的目标是 (1) 证明 RL 是一种动态生成辅助任务的可行方法，并且 (2) 证明每个样本的辅助任务权重可以与辅助任务标签一起学习，并且可以取得很好的结果。

Title: LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation

Authors: Zeyu Wang, Zilong Chen, Chenhui Gou, Feng Li, Chaorui Deng, Deyao Zhu, Kunchang Li, Weihao Yu, Haoqin Tu, Haoqi Fan, Cihang Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22946
Pdf URL: https://arxiv.org/pdf/2510.22946
Copy Paste: [[2510.22946]] LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation(https://arxiv.org/abs/2510.22946)
Keywords: generation
Abstract: Unified multimodal models have recently shown remarkable gains in both capability and versatility, yet most leading systems are still trained from scratch and require substantial computational resources. In this paper, we show that competitive performance can be obtained far more efficiently by strategically fusing publicly available models specialized for either generation or understanding. Our key design is to retain the original blocks while additionally interleaving multimodal self-attention blocks throughout the networks. This double fusion mechanism (1) effectively enables rich multi-modal fusion while largely preserving the original strengths of the base models, and (2) catalyzes synergistic fusion of high-level semantic representations from the understanding encoder with low-level spatial signals from the generation encoder. By training with only ~ 35B tokens, this approach achieves strong results across multiple benchmarks: 0.91 on GenEval for compositional text-to-image generation, 82.16 on DPG-Bench for complex text-to-image generation, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench for image editing. By fully releasing the entire suite of code, model weights, and datasets, we hope to support future research on unified multimodal modeling.
摘要：统一的多模态模型最近在能力和多功能性方面都显示出显着的进步，但大多数领先的系统仍然是从头开始训练的，并且需要大量的计算资源。在本文中，我们表明，通过战略性地融合专门用于生成或理解的公开可用模型，可以更有效地获得竞争绩效。我们的关键设计是保留原始块，同时在整个网络中另外交错多模态自注意力块。这种双重融合机制（1）有效地实现了丰富的多模态融合，同时在很大程度上保留了基础模型的原始优势，（2）催化了来自理解编码器的高级语义表示与来自生成编码器的低级空间信号的协同融合。通过仅使用约 35B 个令牌进行训练，该方法在多个基准测试中取得了出色的结果：在用于组合文本到图像生成的 GenEval 上为 0.91，在用于复杂文本到图像生成的 DPG-Bench 上为 82.16，在 GEditBench 上为 6.06，在用于图像编辑的 ImgEdit-Bench 上为 3.77。通过完全发布整套代码、模型权重和数据集，我们希望支持未来统一多模态建模的研究。

Title: VALA: Learning Latent Anchors for Training-Free and Temporally Consistent

Authors: Zhangkai Wu, Xuhui Fan, Zhongyuan Xie, Kaize Shi, Longbing Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22970
Pdf URL: https://arxiv.org/pdf/2510.22970
Copy Paste: [[2510.22970]] VALA: Learning Latent Anchors for Training-Free and Temporally Consistent(https://arxiv.org/abs/2510.22970)
Keywords: generation
Abstract: Recent advances in training-free video editing have enabled lightweight and precise cross-frame generation by leveraging pre-trained text-to-image diffusion models. However, existing methods often rely on heuristic frame selection to maintain temporal consistency during DDIM inversion, which introduces manual bias and reduces the scalability of end-to-end inference. In this paper, we propose~\textbf{VALA} (\textbf{V}ariational \textbf{A}lignment for \textbf{L}atent \textbf{A}nchors), a variational alignment module that adaptively selects key frames and compresses their latent features into semantic anchors for consistent video editing. To learn meaningful assignments, VALA propose a variational framework with a contrastive learning objective. Therefore, it can transform cross-frame latent representations into compressed latent anchors that preserve both content and temporal coherence. Our method can be fully integrated into training-free text-to-image based video editing models. Extensive experiments on real-world video editing benchmarks show that VALA achieves state-of-the-art performance in inversion fidelity, editing quality, and temporal consistency, while offering improved efficiency over prior methods.
摘要：免训练视频编辑的最新进展通过利用预先训练的文本到图像扩散模型实现了轻量级和精确的跨帧生成。然而，现有方法通常依赖启发式帧选择来维持 DDIM 反演过程中的时间一致性，这引入了手动偏差并降低了端到端推理的可扩展性。在本文中，我们提出~\textbf{VALA}（\textbf{V}ariational \textbf{A}lignment for \textbf{L}atent \textbf{A}nchors），这是一种变分对齐模块，可自适应地选择关键帧并将其潜在特征压缩为语义锚点以实现一致的视频编辑。为了学习有意义的作业，VALA 提出了一个具有对比学习目标的变分框架。因此，它可以将跨帧潜在表示转换为压缩的潜在锚点，从而保留内容和时间连贯性。我们的方法可以完全集成到基于免训练文本到图像的视频编辑模型中。对现实世界视频编辑基准的大量实验表明，VALA 在反转保真度、编辑质量和时间一致性方面实现了最先进的性能，同时比以前的方法提供了更高的效率。

Title: Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method

Authors: Bohan Li, Xin Jin, Hu Zhu, Hongsi Liu, Ruikai Li, Jiazhe Guo, Kaiwen Cai, Chao Ma, Yueming Jin, Hao Zhao, Xiaokang Yang, Wenjun Zeng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22973
Pdf URL: https://arxiv.org/pdf/2510.22973
Copy Paste: [[2510.22973]] Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method(https://arxiv.org/abs/2510.22973)
Keywords: generation, generative
Abstract: Driving scene generation is a critical domain for autonomous driving, enabling downstream applications, including perception and planning evaluation. Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities; however, their performance heavily depends on annotated occupancy data, which still remains scarce. To overcome this limitation, we curate Nuplan-Occ, the largest semantic occupancy dataset to date, constructed from the widely used Nuplan benchmark. Its scale and diversity facilitate not only large-scale generative modeling but also autonomous driving downstream applications. Based on this dataset, we develop a unified framework that jointly synthesizes high-quality semantic occupancy, multi-view videos, and LiDAR point clouds. Our approach incorporates a spatio-temporal disentangled architecture to support high-fidelity spatial expansion and temporal forecasting of 4D dynamic occupancy. To bridge modal gaps, we further propose two novel techniques: a Gaussian splatting-based sparse point map rendering strategy that enhances multi-view video generation, and a sensor-aware embedding strategy that explicitly models LiDAR sensor properties for realistic multi-LiDAR simulation. Extensive experiments demonstrate that our method achieves superior generation fidelity and scalability compared to existing approaches, and validates its practical value in downstream tasks. Repo: this https URL
摘要：驾驶场景生成是自动驾驶的关键领域，支持下游应用，包括感知和规划评估。以占用为中心的方法最近通过提供跨框架和模式的一致调节而取得了最先进的结果；然而，它们的性能在很大程度上取决于带注释的占用数据，而这些数据仍然很少。为了克服这一限制，我们策划了 Nuplan-Occ，这是迄今为止最大的语义占用数据集，它是根据广泛使用的 Nuplan 基准构建的。其规模和多样性不仅有利于大规模生成建模，而且有利于自动驾驶下游应用。基于该数据集，我们开发了一个统一的框架，联合合成高质量的语义占用、多视图视频和激光雷达点云。我们的方法采用时空分离架构来支持高保真空间扩展和 4D 动态占用的时间预测。为了弥合模态差距，我们进一步提出了两种新技术：基于高斯溅射的稀疏点图渲染策略，可增强多视图视频生成，以及传感器感知嵌入策略，可显式建模 LiDAR 传感器属性以进行真实的多 LiDAR 模拟。大量的实验表明，与现有方法相比，我们的方法实现了卓越的生成保真度和可扩展性，并验证了其在下游任务中的实用价值。回购协议：此 https URL

Title: SceneDecorator: Towards Scene-Oriented Story Generation with Scene Planning and Scene Consistency

Authors: Quanjian Song, Donghao Zhou, Jingyu Lin, Fei Shen, Jiaze Wang, Xiaowei Hu, Cunjian Chen, Pheng-Ann Heng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22994
Pdf URL: https://arxiv.org/pdf/2510.22994
Copy Paste: [[2510.22994]] SceneDecorator: Towards Scene-Oriented Story Generation with Scene Planning and Scene Consistency(https://arxiv.org/abs/2510.22994)
Keywords: generation
Abstract: Recent text-to-image models have revolutionized image generation, but they still struggle with maintaining concept consistency across generated images. While existing works focus on character consistency, they often overlook the crucial role of scenes in storytelling, which restricts their creativity in practice. This paper introduces scene-oriented story generation, addressing two key challenges: (i) scene planning, where current methods fail to ensure scene-level narrative coherence by relying solely on text descriptions, and (ii) scene consistency, which remains largely unexplored in terms of maintaining scene consistency across multiple stories. We propose SceneDecorator, a training-free framework that employs VLM-Guided Scene Planning to ensure narrative coherence across different scenes in a ``global-to-local'' manner, and Long-Term Scene-Sharing Attention to maintain long-term scene consistency and subject diversity across generated stories. Extensive experiments demonstrate the superior performance of SceneDecorator, highlighting its potential to unleash creativity in the fields of arts, films, and games.
摘要：最近的文本到图像模型彻底改变了图像生成，但它们仍然难以维持生成图像之间的概念一致性。现有的作品虽然注重人物的一致性，但往往忽视了场景在讲故事中的关键作用，这限制了他们在实践中的创造力。本文介绍了面向场景的故事生成，解决了两个关键挑战：（i）场景规划，当前的方法无法仅依靠文本描述来确保场景级叙事的连贯性；（ii）场景一致性，在保持多个故事之间的场景一致性方面，这在很大程度上尚未得到探索。我们提出了 SceneDecorator，这是一个免训练的框架，它采用 VLM 引导的场景规划来确保以“全局到局部”的方式确保不同场景之间的叙事连贯性，并使用长期场景共享注意力来保持生成的故事的长期场景一致性和主题多样性。大量实验证明了 SceneDecorator 的卓越性能，凸显了其在艺术、电影和游戏领域释放创造力的潜力。

Title: CoMo: Compositional Motion Customization for Text-to-Video Generation

Authors: Youcan Xu, Zhen Wang, Jiaxin Shi, Kexin Li, Feifei Shao, Jun Xiao, Yi Yang, Jun Yu, Long Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.23007
Pdf URL: https://arxiv.org/pdf/2510.23007
Copy Paste: [[2510.23007]] CoMo: Compositional Motion Customization for Text-to-Video Generation(https://arxiv.org/abs/2510.23007)
Keywords: generation
Abstract: While recent text-to-video models excel at generating diverse scenes, they struggle with precise motion control, particularly for complex, multi-subject motions. Although methods for single-motion customization have been developed to address this gap, they fail in compositional scenarios due to two primary challenges: motion-appearance entanglement and ineffective multi-motion blending. This paper introduces CoMo, a novel framework for $\textbf{compositional motion customization}$ in text-to-video generation, enabling the synthesis of multiple, distinct motions within a single video. CoMo addresses these issues through a two-phase approach. First, in the single-motion learning phase, a static-dynamic decoupled tuning paradigm disentangles motion from appearance to learn a motion-specific module. Second, in the multi-motion composition phase, a plug-and-play divide-and-merge strategy composes these learned motions without additional training by spatially isolating their influence during the denoising process. To facilitate research in this new domain, we also introduce a new benchmark and a novel evaluation metric designed to assess multi-motion fidelity and blending. Extensive experiments demonstrate that CoMo achieves state-of-the-art performance, significantly advancing the capabilities of controllable video generation. Our project page is at this https URL.
摘要：虽然最近的文本到视频模型擅长生成不同的场景，但它们在精确的运动控制方面遇到了困难，特别是对于复杂的多主体运动。尽管已经开发了单运动定制方法来解决这一差距，但由于两个主要挑战，它们在合成场景中失败：运动外观纠缠和无效的多运动混合。本文介绍了 CoMo，这是一种用于文本到视频生成中的 $\textbf{组合运动定制}$ 的新颖框架，可以在单个视频中合成多个不同的运动。 CoMo 通过两阶段方法解决这些问题。首先，在单运动学习阶段，静态-动态解耦调整范式将运动与外观分离，以学习特定于运动的模块。其次，在多运动合成阶段，即插即用的分而合并策略通过在去噪过程中在空间上隔离它们的影响来合成这些学习的运动，而无需额外的训练。为了促进这个新领域的研究，我们还引入了一个新的基准和一种新颖的评估指标，旨在评估多运动保真度和混合。大量实验表明，CoMo 实现了最先进的性能，显着提高了可控视频生成的能力。我们的项目页面位于此 https URL。

Title: M$^{3}$T2IBench: A Large-Scale Multi-Category, Multi-Instance, Multi-Relation Text-to-Image Benchmark

Authors: Huixuan Zhang, Xiaojun Wan
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2510.23020
Pdf URL: https://arxiv.org/pdf/2510.23020
Copy Paste: [[2510.23020]] M$^{3}$T2IBench: A Large-Scale Multi-Category, Multi-Instance, Multi-Relation Text-to-Image Benchmark(https://arxiv.org/abs/2510.23020)
Keywords: generation
Abstract: Text-to-image models are known to struggle with generating images that perfectly align with textual prompts. Several previous studies have focused on evaluating image-text alignment in text-to-image generation. However, these evaluations either address overly simple scenarios, especially overlooking the difficulty of prompts with multiple different instances belonging to the same category, or they introduce metrics that do not correlate well with human evaluation. In this study, we introduce M$^3$T2IBench, a large-scale, multi-category, multi-instance, multi-relation along with an object-detection-based evaluation metric, $AlignScore$, which aligns closely with human evaluation. Our findings reveal that current open-source text-to-image models perform poorly on this challenging benchmark. Additionally, we propose the Revise-Then-Enforce approach to enhance image-text alignment. This training-free post-editing method demonstrates improvements in image-text alignment across a broad range of diffusion models. \footnote{Our code and data has been released in supplementary material and will be made publicly available after the paper is accepted.}
摘要：众所周知，文本到图像模型很难生成与文本提示完美匹配的图像。之前的几项研究主要集中在评估文本到图像生成中的图像文本对齐。然而，这些评估要么解决过于简单的场景，特别是忽略了属于同一类别的多个不同实例的提示的难度，要么引入与人类评估没有很好关联的指标。在这项研究中，我们引入了 M$^3$T2IBench，一种大规模、多类别、多实例、多关系以及基于对象检测的评估指标 $AlignScore$，它与人类评估密切相关。我们的研究结果表明，当前的开源文本到图像模型在这一具有挑战性的基准上表现不佳。此外，我们提出“修改然后执行”方法来增强图像文本对齐。这种免训练的后期编辑方法展示了各种扩散模型中图像文本对齐的改进。 \footnote{我们的代码和数据已在补充材料中发布，并将在论文被接受后公开。}

Title: UniAIDet: A Unified and Universal Benchmark for AI-Generated Image Content Detection and Localization

Authors: Huixuan Zhang, Xiaojun Wan
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2510.23023
Pdf URL: https://arxiv.org/pdf/2510.23023
Copy Paste: [[2510.23023]] UniAIDet: A Unified and Universal Benchmark for AI-Generated Image Content Detection and Localization(https://arxiv.org/abs/2510.23023)
Keywords: generative
Abstract: With the rapid proliferation of image generative models, the authenticity of digital images has become a significant concern. While existing studies have proposed various methods for detecting AI-generated content, current benchmarks are limited in their coverage of diverse generative models and image categories, often overlooking end-to-end image editing and artistic images. To address these limitations, we introduce UniAIDet, a unified and comprehensive benchmark that includes both photographic and artistic images. UniAIDet covers a wide range of generative models, including text-to-image, image-to-image, image inpainting, image editing, and deepfake models. Using UniAIDet, we conduct a comprehensive evaluation of various detection methods and answer three key research questions regarding generalization capability and the relation between detection and localization. Our benchmark and analysis provide a robust foundation for future research.
摘要：随着图像生成模型的迅速普及，数字图像的真实性已成为一个重要问题。虽然现有的研究提出了各种检测人工智能生成内容的方法，但当前的基准测试仅限于不同生成模型和图像类别的覆盖范围，往往忽略了端到端图像编辑和艺术图像。为了解决这些限制，我们推出了 UniAIDet，这是一个统一且全面的基准，包括摄影和艺术图像。 UniAIDet 涵盖了广泛的生成模型，包括文本到图像、图像到图像、图像修复、图像编辑和 Deepfake 模型。使用UniAIDet，我们对各种检测方法进行了综合评估，并回答了关于泛化能力以及检测与定位之间的关系的三个关键研究问题。我们的基准和分析为未来的研究奠定了坚实的基础。

Title: Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts

Authors: Di Zhang, Xun Wu, Shaohan Huang, Yaru Hao, Li Dong, Zewen Chi, Zhifang Sui, Furu Wei
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2510.23027
Pdf URL: https://arxiv.org/pdf/2510.23027
Copy Paste: [[2510.23027]] Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts(https://arxiv.org/abs/2510.23027)
Keywords: generation
Abstract: Recent advances in reinforcement learning (RL) have substantially improved the training of large-scale language models, leading to significant gains in generation quality and reasoning ability. However, most existing research focuses on dense models, while RL training for Mixture-of-Experts (MoE) architectures remains underexplored. To address the instability commonly observed in MoE training, we propose a novel router-aware approach to optimize importance sampling (IS) weights in off-policy RL. Specifically, we design a rescaling strategy guided by router logits, which effectively reduces gradient variance and mitigates training divergence. Experimental results demonstrate that our method significantly improves both the convergence stability and the final performance of MoE models, highlighting the potential of RL algorithmic innovations tailored to MoE architectures and providing a promising direction for efficient training of large-scale expert models.
摘要：强化学习（RL）的最新进展极大地改善了大规模语言模型的训练，从而显着提高了生成质量和推理能力。然而，大多数现有研究都集中在密集模型上，而针对专家混合 (MoE) 架构的强化学习训练仍未得到充分探索。为了解决 MoE 训练中常见的不稳定性问题，我们提出了一种新颖的路由器感知方法来优化离策略 RL 中的重要性采样 (IS) 权重。具体来说，我们设计了一种由路由器逻辑引导的重新缩放策略，它有效地减少了梯度方差并减轻了训练发散。实验结果表明，我们的方法显着提高了 MoE 模型的收敛稳定性和最终性能，凸显了针对 MoE 架构定制的 RL 算法创新的潜力，并为大规模专家模型的高效训练提供了有希望的方向。

Title: Nested AutoRegressive Models

Authors: Hongyu Wu, Xuhui Fan, Zhangkai Wu, Longbing Cao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.23028
Pdf URL: https://arxiv.org/pdf/2510.23028
Copy Paste: [[2510.23028]] Nested AutoRegressive Models(https://arxiv.org/abs/2510.23028)
Keywords: generation
Abstract: AutoRegressive (AR) models have demonstrated competitive performance in image generation, achieving results comparable to those of diffusion models. However, their token-by-token image generation mechanism remains computationally intensive and existing solutions such as VAR often lead to limited sample diversity. In this work, we propose a Nested AutoRegressive~(NestAR) model, which proposes nested AutoRegressive architectures in generating images. NestAR designs multi-scale modules in a hierarchical order. These different scaled modules are constructed in an AR architecture, where one larger-scale module is conditioned on outputs from its previous smaller-scale module. Within each module, NestAR uses another AR structure to generate ``patches'' of tokens. The proposed nested AR architecture reduces the overall complexity from $\mathcal{O}(n)$ to $\mathcal{O}(\log n)$ in generating $n$ image tokens, as well as increases image diversities. NestAR further incorporates flow matching loss to use continuous tokens, and develops objectives to coordinate these multi-scale modules in model training. NestAR achieves competitive image generation performance while significantly lowering computational cost.
摘要：自回归 (AR) 模型在图像生成方面表现出了竞争性的性能，取得了与扩散模型相当的结果。然而，它们的逐个令牌图像生成机制仍然是计算密集型的，并且现有的解决方案（例如 VAR）通常会导致样本多样性有限。在这项工作中，我们提出了一种嵌套自回归（NestAR）模型，该模型在生成图像时提出了嵌套自回归架构。 NestAR 按层次顺序设计多尺度模块。这些不同规模的模块是在 AR 架构中构建的，其中一个较大规模的模块以之前较小规模模块的输出为条件。在每个模块中，NestAR 使用另一个 AR 结构来生成令牌的“补丁”。所提出的嵌套 AR 架构将生成 $n$ 图像标记的总体复杂性从 $\mathcal{O}(n)$ 降低到 $\mathcal{O}(\log n)$，并增加了图像多样性。 NestAR 进一步结合了流匹配损失来使用连续标记，并制定了在模型训练中协调这些多尺度模块的目标。 NestAR 实现了具有竞争力的图像生成性能，同时显着降低了计算成本。

Title: LLM Meets Diffusion: A Hybrid Framework for Crystal Material Generation

Authors: Subhojyoti Khastagir, Kishalay Das, Pawan Goyal, Seung-Cheol Lee, Satadeep Bhattacharjee, Niloy Ganguly
Subjects: cs.LG, cond-mat.mtrl-sci, cs.AI
Abstract URL: https://arxiv.org/abs/2510.23040
Pdf URL: https://arxiv.org/pdf/2510.23040
Copy Paste: [[2510.23040]] LLM Meets Diffusion: A Hybrid Framework for Crystal Material Generation(https://arxiv.org/abs/2510.23040)
Keywords: generation, generative
Abstract: Recent advances in generative modeling have shown significant promise in designing novel periodic crystal structures. Existing approaches typically rely on either large language models (LLMs) or equivariant denoising models, each with complementary strengths: LLMs excel at handling discrete atomic types but often struggle with continuous features such as atomic positions and lattice parameters, while denoising models are effective at modeling continuous variables but encounter difficulties in generating accurate atomic compositions. To bridge this gap, we propose CrysLLMGen, a hybrid framework that integrates an LLM with a diffusion model to leverage their complementary strengths for crystal material generation. During sampling, CrysLLMGen first employs a fine-tuned LLM to produce an intermediate representation of atom types, atomic coordinates, and lattice structure. While retaining the predicted atom types, it passes the atomic coordinates and lattice structure to a pre-trained equivariant diffusion model for refinement. Our framework outperforms state-of-the-art generative models across several benchmark tasks and datasets. Specifically, CrysLLMGen not only achieves a balanced performance in terms of structural and compositional validity but also generates more stable and novel materials compared to LLM-based and denoisingbased models Furthermore, CrysLLMGen exhibits strong conditional generation capabilities, effectively producing materials that satisfy user-defined constraints. Code is available at this https URL
摘要：生成建模的最新进展在设计新颖的周期性晶体结构方面显示出了巨大的前景。现有方法通常依赖于大型语言模型 (LLM) 或等变去噪模型，两者具有互补的优势：LLM 擅长处理离散原子类型，但常常难以处理原子位置和晶格参数等连续特征，而去噪模型可以有效地建模连续变量，但在生成准确的原子组成方面遇到困难。为了弥补这一差距，我们提出了 CrysLLMGen，这是一种混合框架，它将 LLM 与扩散模型集成在一起，以利用它们的互补优势来生成晶体材料。在采样过程中，CrysLLMGen 首先采用微调的 LLM 来生成原子类型、原子坐标和晶格结构的中间表示。在保留预测的原子类型的同时，它将原子坐标和晶格结构传递给预先训练的等变扩散模型进行细化。我们的框架在多个基准任务和数据集中优于最先进的生成模型。具体来说，与基于LLM和基于去噪的模型相比，CrysLLMGen不仅在结构和成分有效性方面实现了平衡的性能，而且能够生成更稳定和新颖的材料。此外，CrysLLMGen表现出强大的条件生成能力，有效地生成满足用户定义的约束的材料。代码可在此 https URL 获取

Title: Residual Diffusion Bridge Model for Image Restoration

Authors: Hebaixu Wang, Jing Zhang, Haoyang Chen, Haonan Guo, Di Wang, Jiayi Ma, Bo Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.23116
Pdf URL: https://arxiv.org/pdf/2510.23116
Copy Paste: [[2510.23116]] Residual Diffusion Bridge Model for Image Restoration(https://arxiv.org/abs/2510.23116)
Keywords: restoration
Abstract: Diffusion bridge models establish probabilistic paths between arbitrary paired distributions and exhibit great potential for universal image restoration. Most existing methods merely treat them as simple variants of stochastic interpolants, lacking a unified analytical perspective. Besides, they indiscriminately reconstruct images through global noise injection and removal, inevitably distorting undegraded regions due to imperfect reconstruction. To address these challenges, we propose the Residual Diffusion Bridge Model (RDBM). Specifically, we theoretically reformulate the stochastic differential equations of generalized diffusion bridge and derive the analytical formulas of its forward and reverse processes. Crucially, we leverage the residuals from given distributions to modulate the noise injection and removal, enabling adaptive restoration of degraded regions while preserving intact others. Moreover, we unravel the fundamental mathematical essence of existing bridge models, all of which are special cases of RDBM and empirically demonstrate the optimality of our proposed models. Extensive experiments are conducted to demonstrate the state-of-the-art performance of our method both qualitatively and quantitatively across diverse image restoration tasks. Code is publicly available at this https URL.
摘要：扩散桥模型在任意配对分布之间建立概率路径，并展现出通用图像恢复的巨大潜力。大多数现有方法只是将它们视为随机插值的简单变体，缺乏统一的分析视角。此外，他们通过全局噪声注入和去除来不加区别地重建图像，不可避免地会由于不完美的重建而扭曲未退化的区域。为了应对这些挑战，我们提出了残余扩散桥模型（RDBM）。具体来说，我们从理论上重新表述了广义扩散桥的随机微分方程，并推导了其正向和反向过程的解析公式。至关重要的是，我们利用给定分布的残差来调节噪声注入和去除，从而能够自适应恢复退化区域，同时保留其他区域完整。此外，我们揭示了现有桥梁模型的基本数学本质，所有这些模型都是 RDBM 的特例，并凭经验证明了我们提出的模型的最优性。进行了大量的实验，以在不同的图像恢复任务中定性和定量地展示我们的方法的最先进的性能。代码可通过此 https URL 公开获取。

Title: Task-Agnostic Fusion of Time Series and Imagery for Earth Observation

Authors: Gianfranco Basile, Johannes Jakubik, Benedikt Blumenstiel, Thomas Brunschwiler, Juan Bernabe Moreno
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.23118
Pdf URL: https://arxiv.org/pdf/2510.23118
Copy Paste: [[2510.23118]] Task-Agnostic Fusion of Time Series and Imagery for Earth Observation(https://arxiv.org/abs/2510.23118)
Keywords: generation
Abstract: We propose a task-agnostic framework for multimodal fusion of time series and single timestamp images, enabling cross-modal generation and robust downstream performance. Our approach explores deterministic and learned strategies for time series quantization and then leverages a masked correlation learning objective, aligning discrete image and time series tokens in a unified representation space. Instantiated in the Earth observation domain, the pretrained model generates consistent global temperature profiles from satellite imagery and is validated through counterfactual experiments. Across downstream tasks, our task-agnostic pretraining outperforms task-specific fusion by 6\% in R$^2$ and 2\% in RMSE on average, and exceeds baseline methods by 50\% in R$^2$ and 12\% in RMSE. Finally, we analyze gradient sensitivity across modalities, providing insights into model robustness. Code, data, and weights will be released under a permissive license.
摘要：我们提出了一个与任务无关的框架，用于时间序列和单个时间戳图像的多模态融合，从而实现跨模态生成和强大的下游性能。我们的方法探索时间序列量化的确定性和学习策略，然后利用屏蔽相关学习目标，在统一的表示空间中对齐离散图像和时间序列标记。在地球观测领域实例化后，预训练模型根据卫星图像生成一致的全球温度分布，并通过反事实实验进行验证。在下游任务中，我们的任务无关预训练在 R$^2$ 上比特定任务融合平均高出 6%，在 RMSE 上比基准方法高出 2%，在 R$^2$ 上比基线方法高出 50%，在 RMSE 上比基线方法高出 12%。最后，我们分析了跨模式的梯度敏感性，提供了对模型稳健性的见解。代码、数据和权重将在许可下发布。

Title: Enabling Vibration-Based Gesture Recognition on Everyday Furniture via Energy-Efficient FPGA Implementation of 1D Convolutional Networks

Authors: Koki Shibata, Tianheng Ling, Chao Qian, Tomokazu Matsui, Hirohiko Suwa, Keiichi Yasumoto, Gregor Schiele
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.23156
Pdf URL: https://arxiv.org/pdf/2510.23156
Copy Paste: [[2510.23156]] Enabling Vibration-Based Gesture Recognition on Everyday Furniture via Energy-Efficient FPGA Implementation of 1D Convolutional Networks(https://arxiv.org/abs/2510.23156)
Keywords: generation
Abstract: The growing demand for smart home interfaces has increased interest in non-intrusive sensing methods like vibration-based gesture recognition. While prior studies demonstrated feasibility, they often rely on complex preprocessing and large Neural Networks (NNs) requiring costly high-performance hardware, resulting in high energy usage and limited real-world deployability. This study proposes an energy-efficient solution deploying compact NNs on low-power Field-Programmable Gate Arrays (FPGAs) to enable real-time gesture recognition with competitive accuracy. We adopt a series of optimizations: (1) We replace complex spectral preprocessing with raw waveform input, eliminating complex on-board preprocessing while reducing input size by 21x without sacrificing accuracy. (2) We design two lightweight architectures (1D-CNN and 1D-SepCNN) tailored for embedded FPGAs, reducing parameters from 369 million to as few as 216 while maintaining comparable accuracy. (3) With integer-only quantization and automated RTL generation, we achieve seamless FPGA deployment. A ping-pong buffering mechanism in 1D-SepCNN further improves deployability under tight memory constraints. (4) We extend a hardware-aware search framework to support constraint-driven model configuration selection, considering accuracy, deployability, latency, and energy consumption. Evaluated on two swipe-direction datasets with multiple users and ordinary tables, our approach achieves low-latency, energy-efficient inference on the AMD Spartan-7 XC7S25 FPGA. Under the PS data splitting setting, the selected 6-bit 1D-CNN reaches 0.970 average accuracy across users with 9.22 ms latency. The chosen 8-bit 1D-SepCNN further reduces latency to 6.83 ms (over 53x CPU speedup) with slightly lower accuracy (0.949). Both consume under 1.2 mJ per inference, demonstrating suitability for long-term edge operation.
摘要：对智能家居界面不断增长的需求增加了人们对基于振动的手势识别等非侵入式传感方法的兴趣。虽然之前的研究证明了可行性，但它们通常依赖于复杂的预处理和大型神经网络 (NN)，需要昂贵的高性能硬件，从而导致高能耗和有限的实际部署能力。本研究提出了一种节能解决方案，在低功耗现场可编程门阵列 (FPGA) 上部署紧凑型神经网络，以实现具有竞争精度的实时手势识别。我们采用了一系列优化：（1）我们用原始波形输入取代了复杂的光谱预处理，消除了复杂的板载预处理，同时将输入大小减少了 21 倍，而无需牺牲精度。 (2) 我们设计了两种专为嵌入式 FPGA 定制的轻量级架构（1D-CNN 和 1D-SepCNN），将参数从 3.69 亿个减少到 216 个，同时保持相当的精度。 (3) 通过纯整数量化和自动 RTL 生成，我们实现了无缝 FPGA 部署。 1D-SepCNN 中的乒乓缓冲机制进一步提高了在严格内存限制下的可部署性。 (4) 我们扩展了硬件感知搜索框架，以支持约束驱动的模型配置选择，同时考虑准确性、可部署性、延迟和能耗。通过对具有多个用户和普通表的两个滑动方向数据集进行评估，我们的方法在 AMD Spartan-7 XC7S25 FPGA 上实现了低延迟、节能的推理。在 PS 数据分割设置下，所选的 6 位 1D-CNN 在用户间达到 0.970 的平均精度，延迟为 9.22 ms。所选的 8 位 1D-SepCNN 进一步将延迟降低至 6.83 毫秒（超过 53 倍 CPU 加速），但精度稍低（0.949）。两者每次推理的功耗均低于 1.2 mJ，证明了长期边缘操作的适用性。

Title: Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks

Authors: Amal Abed, Ivan Lukic, Jörg K.H. Franke, Frank Hutter
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.23208
Pdf URL: https://arxiv.org/pdf/2510.23208
Copy Paste: [[2510.23208]] Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks(https://arxiv.org/abs/2510.23208)
Keywords: generation
Abstract: Large language models (LLMs) have shown impressive promise in code generation, yet their progress remains limited by the shortage of large-scale datasets that are both diverse and well-aligned with human reasoning. Most existing resources pair problems with solutions, but omit the intermediate thought process that guides coding. To close this gap, we present a scalable synthetic data generation pipeline that produces nearly 800k instruction-reasoning-code-test quadruplets. Each sample combines a task, a step-by-step reasoning trace, a working solution, and executable tests, enabling models to learn not just the what but also the how of problem solving. Our pipeline combines four key components: curated contest problems, web-mined content filtered by relevance classifiers, data expansion guided by reasoning patterns, and multi-stage execution-based validation. A genetic mutation algorithm further increases task diversity while maintaining consistency between reasoning traces and code implementations. Our key finding is that fine-tuning LLMs on this dataset yields consistent improvements on coding benchmarks. Beyond raw accuracy, reasoning-aware data can substitute for model scaling, generalize across architectures, and outperform leading open-source alternatives under identical sample budgets. Our work establishes reasoning-centered synthetic data generation as an efficient approach for advancing coding capabilities in LLMs. We publish our dataset and generation pipeline to facilitate further research.
摘要：大型语言模型（LLM）在代码生成方面显示出了令人印象深刻的前景，但其进展仍然受到缺乏多样化且与人类推理良好一致的大规模数据集的限制。大多数现有资源将问题与解决方案配对，但忽略了指导编码的中间思维过程。为了弥补这一差距，我们提出了一个可扩展的合成数据生成管道，可生成近 800k 指令推理代码测试四元组。每个示例都结合了一项任务、逐步推理跟踪、可行的解决方案和可执行测试，使模型不仅能够了解问题解决的内容，还能了解解决问题的方法。我们的管道结合了四个关键组件：策划的竞赛问题、通过相关分类器过滤的网络挖掘内容、推理模式引导的数据扩展以及基于多阶段执行的验证。遗传突变算法进一步增加了任务多样性，同时保持推理轨迹和代码实现之间的一致性。我们的主要发现是，在此数据集上微调法学硕士可以对编码基准产生一致的改进。除了原始准确性之外，推理感知数据还可以替代模型扩展、跨架构泛化，并在相同样本预算下超越领先的开源替代方案。我们的工作建立了以推理为中心的合成数据生成作为提高法学硕士编码能力的有效方法。我们发布我们的数据集和生成流程以促进进一步的研究。

Title: Accelerating Eigenvalue Dataset Generation via Chebyshev Subspace Filter

Authors: Hong Wang, Jie Wang, Jian Luo, huanshuo dong, Yeqiu Chen, Runmin Jiang, Zhen huang
Subjects: cs.LG, cs.AI, math.NA
Abstract URL: https://arxiv.org/abs/2510.23215
Pdf URL: https://arxiv.org/pdf/2510.23215
Copy Paste: [[2510.23215]] Accelerating Eigenvalue Dataset Generation via Chebyshev Subspace Filter(https://arxiv.org/abs/2510.23215)
Keywords: generation
Abstract: Eigenvalue problems are among the most important topics in many scientific disciplines. With the recent surge and development of machine learning, neural eigenvalue methods have attracted significant attention as a forward pass of inference requires only a tiny fraction of the computation time compared to traditional solvers. However, a key limitation is the requirement for large amounts of labeled data in training, including operators and their eigenvalues. To tackle this limitation, we propose a novel method, named Sorting Chebyshev Subspace Filter (SCSF), which significantly accelerates eigenvalue data generation by leveraging similarities between operators -- a factor overlooked by existing methods. Specifically, SCSF employs truncated fast Fourier transform sorting to group operators with similar eigenvalue distributions and constructs a Chebyshev subspace filter that leverages eigenpairs from previously solved problems to assist in solving subsequent ones, reducing redundant computations. To the best of our knowledge, SCSF is the first method to accelerate eigenvalue data generation. Experimental results show that SCSF achieves up to a $3.5\times$ speedup compared to various numerical solvers.
摘要：特征值问题是许多科学学科中最重要的主题之一。随着最近机器学习的激增和发展，神经特征值方法引起了极大的关注，因为与传统求解器相比，前向传递只需要计算时间的一小部分。然而，一个关键的限制是训练中需要大量标记数据，包括算子及其特征值。为了解决这个限制，我们提出了一种名为排序切比雪夫子空间滤波器（SCSF）的新方法，它通过利用运算符之间的相似性（现有方法忽略了一个因素）来显着加速特征值数据的生成。具体来说，SCSF 采用截断快速傅立叶变换排序对具有相似特征值分布的算子进行分组，并构建切比雪夫子空间滤波器，利用先前解决的问题中的特征对来协助解决后续问题，从而减少冗余计算。据我们所知，SCSF 是第一种加速特征值数据生成的方法。实验结果表明，与各种数值求解器相比，SCSF 的加速速度高达 $3.5\times$。

Title: Through the Lens: Benchmarking Deepfake Detectors Against Moiré-Induced Distortions

Authors: Razaib Tariq, Minji Heo, Simon S. Woo, Shahroz Tariq
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.23225
Pdf URL: https://arxiv.org/pdf/2510.23225
Copy Paste: [[2510.23225]] Through the Lens: Benchmarking Deepfake Detectors Against Moiré-Induced Distortions(https://arxiv.org/abs/2510.23225)
Keywords: generation
Abstract: Deepfake detection remains a pressing challenge, particularly in real-world settings where smartphone-captured media from digital screens often introduces Moiré artifacts that can distort detection outcomes. This study systematically evaluates state-of-the-art (SOTA) deepfake detectors on Moiré-affected videos, an issue that has received little attention. We collected a dataset of 12,832 videos, spanning 35.64 hours, from the Celeb-DF, DFD, DFDC, UADFV, and FF++ datasets, capturing footage under diverse real-world conditions, including varying screens, smartphones, lighting setups, and camera angles. To further examine the influence of Moiré patterns on deepfake detection, we conducted additional experiments using our DeepMoiréFake, referred to as (DMF) dataset and two synthetic Moiré generation techniques. Across 15 top-performing detectors, our results show that Moiré artifacts degrade performance by as much as 25.4%, while synthetically generated Moiré patterns lead to a 21.4% drop in accuracy. Surprisingly, demoiréing methods, intended as a mitigation approach, instead worsened the problem, reducing accuracy by up to 17.2%. These findings underscore the urgent need for detection models that can robustly handle Moiré distortions alongside other realworld challenges, such as compression, sharpening, and blurring. By introducing the DMF dataset, we aim to drive future research toward closing the gap between controlled experiments and practical deepfake detection.
摘要：Deepfake 检测仍然是一个紧迫的挑战，特别是在现实环境中，智能手机从数字屏幕捕获的媒体经常会引入摩尔纹伪影，从而扭曲检测结果。这项研究系统地评估了最先进（SOTA）的深度造假探测器对受莫尔条纹影响的视频的影响，这一问题很少受到关注。我们从 Celeb-DF、DFD、DFDC、UADFV 和 FF++ 数据集中收集了 12,832 个视频的数据集，时间跨度为 35.64 小时，在不同的现实条件下捕获素材，包括不同的屏幕、智能手机、照明设置和摄像机角度。为了进一步研究莫尔图案对 Deepfake 检测的影响，我们使用 DeepMoiréFake（称为（DMF）数据集和两种合成莫尔生成技术进行了额外的实验。在 15 个性能最佳的探测器中，我们的结果表明莫尔伪影使性能降低了 25.4%，而合成生成的莫尔图案导致准确度下降 21.4%。令人惊讶的是，原本作为缓解方法的 deiréing 方法反而使问题变得更糟，准确率降低了 17.2%。这些发现强调了对能够稳健地处理莫尔失真以及其他现实世界挑战（例如压缩、锐化和模糊）的检测模型的迫切需求。通过引入 DMF 数据集，我们的目标是推动未来的研究，以缩小受控实验和实际深度伪造检测之间的差距。

Title: Autoregressive Styled Text Image Generation, but Make it Reliable

Authors: Carmine Zaccagnino, Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Alessio Tonioni, Rita Cucchiara
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.23240
Pdf URL: https://arxiv.org/pdf/2510.23240
Copy Paste: [[2510.23240]] Autoregressive Styled Text Image Generation, but Make it Reliable(https://arxiv.org/abs/2510.23240)
Keywords: generation
Abstract: Generating faithful and readable styled text images (especially for Styled Handwritten Text generation - HTG) is an open problem with several possible applications across graphic design, document understanding, and image editing. A lot of research effort in this task is dedicated to developing strategies that reproduce the stylistic characteristics of a given writer, with promising results in terms of style fidelity and generalization achieved by the recently proposed Autoregressive Transformer paradigm for HTG. However, this method requires additional inputs, lacks a proper stop mechanism, and might end up in repetition loops, generating visual artifacts. In this work, we rethink the autoregressive formulation by framing HTG as a multimodal prompt-conditioned generation task, and tackle the content controllability issues by introducing special textual input tokens for better alignment with the visual ones. Moreover, we devise a Classifier-Free-Guidance-based strategy for our autoregressive model. Through extensive experimental validation, we demonstrate that our approach, dubbed Eruku, compared to previous solutions requires fewer inputs, generalizes better to unseen styles, and follows more faithfully the textual prompt, improving content adherence.
摘要：生成忠实且可读的样式文本图像（尤其是样式手写文本生成 - HTG）是一个悬而未决的问题，涉及图形设计、文档理解和图像编辑等多种可能的应用程序。这项任务中的大量研究工作致力于开发重现给定作家的文体特征的策略，通过最近提出的 HTG 自回归 Transformer 范式在风格保真度和泛化方面取得了有希望的结果。然而，这种方法需要额外的输入，缺乏适当的停止机制，并且可能最终陷入重复循环，产生视觉伪影。在这项工作中，我们通过将 HTG 框架为多模式提示条件生成任务来重新思考自回归公式，并通过引入特殊的文本输入标记以更好地与视觉标记对齐来解决内容可控性问题。此外，我们为自回归模型设计了一种基于无分类器指导的策略。通过广泛的实验验证，我们证明，与以前的解决方案相比，我们的方法（称为 Eruku）需要更少的输入，可以更好地泛化到未见过的样式，并且更忠实地遵循文本提示，从而提高内容的依从性。

Title: A Novel Framework for Multi-Modal Protein Representation Learning

Authors: Runjie Zheng, Zhen Wang, Anjie Qiao, Jiancong Xie, Jiahua Rao, Yuedong Yang
Subjects: cs.LG, cs.AI, q-bio.QM
Abstract URL: https://arxiv.org/abs/2510.23273
Pdf URL: https://arxiv.org/pdf/2510.23273
Copy Paste: [[2510.23273]] A Novel Framework for Multi-Modal Protein Representation Learning(https://arxiv.org/abs/2510.23273)
Keywords: generation
Abstract: Accurate protein function prediction requires integrating heterogeneous intrinsic signals (e.g., sequence and structure) with noisy extrinsic contexts (e.g., protein-protein interactions and GO term annotations). However, two key challenges hinder effective fusion: (i) cross-modal distributional mismatch among embeddings produced by pre-trained intrinsic encoders, and (ii) noisy relational graphs of extrinsic data that degrade GNN-based information aggregation. We propose Diffused and Aligned Multi-modal Protein Embedding (DAMPE), a unified framework that addresses these through two core mechanisms. First, we propose Optimal Transport (OT)-based representation alignment that establishes correspondence between intrinsic embedding spaces of different modalities, effectively mitigating cross-modal heterogeneity. Second, we develop a Conditional Graph Generation (CGG)-based information fusion method, where a condition encoder fuses the aligned intrinsic embeddings to provide informative cues for graph reconstruction. Meanwhile, our theoretical analysis implies that the CGG objective drives this condition encoder to absorb graph-aware knowledge into its produced protein representations. Empirically, DAMPE outperforms or matches state-of-the-art methods such as DPFunc on standard GO benchmarks, achieving AUPR gains of 0.002-0.013 pp and Fmax gains 0.004-0.007 pp. Ablation studies further show that OT-based alignment contributes 0.043-0.064 pp AUPR, while CGG-based fusion adds 0.005-0.111 pp Fmax. Overall, DAMPE offers a scalable and theoretically grounded approach for robust multi-modal protein representation learning, substantially enhancing protein function prediction.
摘要：准确的蛋白质功能预测需要将异质内在信号（例如序列和结构）与嘈杂的外在环境（例如蛋白质-蛋白质相互作用和 GO 术语注释）整合起来。然而，有两个关键挑战阻碍了有效融合：(i) 预训练内在编码器生成的嵌入之间的跨模式分布不匹配，以及 (ii) 外部数据的噪声关系图会降低基于 GNN 的信息聚合性能。我们提出了扩散和对齐的多模式蛋白质嵌入（DAMPE），这是一个通过两个核心机制解决这些问题的统一框架。首先，我们提出基于最优传输（OT）的表示对齐，它在不同模态的内在嵌入空间之间建立对应关系，有效减轻跨模态异质性。其次，我们开发了一种基于条件图生成（CGG）的信息融合方法，其中条件编码器融合对齐的内在嵌入，为图重建提供信息线索。同时，我们的理论分析表明，CGG 目标驱动该条件编码器将图形感知知识吸收到其生成的蛋白质表示中。根据经验，DAMPE 在标准 GO 基准上优于或匹配 DPFunc 等最先进的方法，实现了 0.002-0.013 pp 的 AUPR 增益和 0.004-0.007 pp 的 Fmax 增益。消融研究进一步表明，基于 OT 的对齐贡献了 0.043-0.064 pp 的 AUPR，而基于 CGG 的融合则增加了 0.005-0.111 pp F 最大。总体而言，DAMPE 为稳健的多模式蛋白质表示学习提供了一种可扩展且有理论基础的方法，大大增强了蛋白质功能预测。

Title: Adaptive Stochastic Coefficients for Accelerating Diffusion Sampling

Authors: Ruoyu Wang, Beier Zhu, Junzhi Li, Liangyu Yuan, Chi Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.23285
Pdf URL: https://arxiv.org/pdf/2510.23285
Copy Paste: [[2510.23285]] Adaptive Stochastic Coefficients for Accelerating Diffusion Sampling(https://arxiv.org/abs/2510.23285)
Keywords: generative
Abstract: Diffusion-based generative processes, formulated as differential equation solving, frequently balance computational speed with sample quality. Our theoretical investigation of ODE- and SDE-based solvers reveals complementary weaknesses: ODE solvers accumulate irreducible gradient error along deterministic trajectories, while SDE methods suffer from amplified discretization errors when the step budget is limited. Building upon this insight, we introduce AdaSDE, a novel single-step SDE solver that aims to unify the efficiency of ODEs with the error resilience of SDEs. Specifically, we introduce a single per-step learnable coefficient, estimated via lightweight distillation, which dynamically regulates the error correction strength to accelerate diffusion sampling. Notably, our framework can be integrated with existing solvers to enhance their capabilities. Extensive experiments demonstrate state-of-the-art performance: at 5 NFE, AdaSDE achieves FID scores of 4.18 on CIFAR-10, 8.05 on FFHQ and 6.96 on LSUN Bedroom. Codes are available in this https URL.
摘要：基于扩散的生成过程，表述为微分方程求解，经常平衡计算速度和样本质量。我们对基于 ODE 和 SDE 的求解器的理论研究揭示了互补的弱点：ODE 求解器沿着确定性轨迹累积不可约的梯度误差，而当步骤预算有限时，SDE 方法会遭受放大的离散化误差。基于这一见解，我们引入了 AdaSDE，这是一种新颖的单步 SDE 求解器，旨在将 ODE 的效率与 SDE 的错误恢复能力统一起来。具体来说，我们引入了通过轻量级蒸馏估计的单个每步可学习系数，它动态调节误差校正强度以加速扩散采样。值得注意的是，我们的框架可以与现有的求解器集成以增强其功能。大量实验证明了最先进的性能：在 5 NFE 下，AdaSDE 在 CIFAR-10 上获得 4.18 的 FID 分数，在 FFHQ 上获得 8.05 的 FID 分数，在 LSUN Bedroom 上获得 6.96 的 FID 分数。此 https URL 中提供了代码。

Title: ReconViaGen: Towards Accurate Multi-view 3D Object Reconstruction via Generation

Authors: Jiahao Chang, Chongjie Ye, Yushuang Wu, Yuantao Chen, Yidan Zhang, Zhongjin Luo, Chenghong Li, Yihao Zhi, Xiaoguang Han
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.23306
Pdf URL: https://arxiv.org/pdf/2510.23306
Copy Paste: [[2510.23306]] ReconViaGen: Towards Accurate Multi-view 3D Object Reconstruction via Generation(https://arxiv.org/abs/2510.23306)
Keywords: generation, generative
Abstract: Existing multi-view 3D object reconstruction methods heavily rely on sufficient overlap between input views, where occlusions and sparse coverage in practice frequently yield severe reconstruction incompleteness. Recent advancements in diffusion-based 3D generative techniques offer the potential to address these limitations by leveraging learned generative priors to hallucinate invisible parts of objects, thereby generating plausible 3D structures. However, the stochastic nature of the inference process limits the accuracy and reliability of generation results, preventing existing reconstruction frameworks from integrating such 3D generative priors. In this work, we comprehensively analyze the reasons why diffusion-based 3D generative methods fail to achieve high consistency, including (a) the insufficiency in constructing and leveraging cross-view connections when extracting multi-view image features as conditions, and (b) the poor controllability of iterative denoising during local detail generation, which easily leads to plausible but inconsistent fine geometric and texture details with inputs. Accordingly, we propose ReconViaGen to innovatively integrate reconstruction priors into the generative framework and devise several strategies that effectively address these issues. Extensive experiments demonstrate that our ReconViaGen can reconstruct complete and accurate 3D models consistent with input views in both global structure and local this http URL page: this https URL.
摘要：现有的多视图 3D 对象重建方法严重依赖于输入视图之间的足够重叠，其中遮挡和稀疏覆盖在实践中经常会产生严重的重建不完整性。基于扩散的 3D 生成技术的最新进展提供了解决这些限制的潜力，通过利用学习的生成先验来产生物体的不可见部分的幻觉，从而生成可信的 3D 结构。然而，推理过程的随机性限制了生成结果的准确性和可靠性，阻碍了现有的重建框架集成此类 3D 生成先验。在这项工作中，我们全面分析了基于扩散的3D生成方法无法实现高一致性的原因，包括（a）在提取多视图图像特征作为条件时构建和利用跨视图连接的不足，以及（b）局部细节生成过程中迭代去噪的可控性差，这很容易导致看似合理但与输入不一致的精细几何和纹理细节。因此，我们建议 ReconViaGen 创新地将重建先验整合到生成框架中，并设计出几种有效解决这些问题的策略。大量实验表明，我们的 ReconViaGen 可以重建与全局结构和本地输入视图一致的完整且准确的 3D 模型：此 http URL 页面：此 https URL。

Title: Robust Non-negative Proximal Gradient Algorithm for Inverse Problems

Authors: Hanzhang Wang, Zonglin Liu, Jingyi Xu, Chenyang Wang, Zhiwei Zhong, Qiangqiang Shen
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2510.23362
Pdf URL: https://arxiv.org/pdf/2510.23362
Copy Paste: [[2510.23362]] Robust Non-negative Proximal Gradient Algorithm for Inverse Problems(https://arxiv.org/abs/2510.23362)
Keywords: restoration
Abstract: Proximal gradient algorithms (PGA), while foundational for inverse problems like image reconstruction, often yield unstable convergence and suboptimal solutions by violating the critical non-negativity constraint. We identify the gradient descent step as the root cause of this issue, which introduces negative values and induces high sensitivity to hyperparameters. To overcome these limitations, we propose a novel multiplicative update proximal gradient algorithm (SSO-PGA) with convergence guarantees, which is designed for robustness in non-negative inverse problems. Our key innovation lies in superseding the gradient descent step with a learnable sigmoid-based operator, which inherently enforces non-negativity and boundedness by transforming traditional subtractive updates into multiplicative ones. This design, augmented by a sliding parameter for enhanced stability and convergence, not only improves robustness but also boosts expressive capacity and noise immunity. We further formulate a degradation model for multi-modal restoration and derive its SSO-PGA-based optimization algorithm, which is then unfolded into a deep network to marry the interpretability of optimization with the power of deep learning. Extensive numerical and real-world experiments demonstrate that our method significantly surpasses traditional PGA and other state-of-the-art algorithms, ensuring superior performance and stability.
摘要：近端梯度算法 (PGA) 虽然是图像重建等逆问题的基础，但通常会因违反关键的非负约束而产生不稳定的收敛和次优解。我们认为梯度下降步骤是这个问题的根本原因，它引入了负值并导致对超参数的高度敏感。为了克服这些限制，我们提出了一种具有收敛保证的新型乘法更新近端梯度算法（SSO-PGA），该算法专为非负逆问题的鲁棒性而设计。我们的关键创新在于用可学习的基于 sigmoid 的算子取代了梯度下降步骤，该算子通过将传统的减法更新转换为乘法更新，本质上强制执行非负性和有界性。这种设计通过滑动参数来增强稳定性和收敛性，不仅提高了鲁棒性，还提高了表达能力和抗噪能力。我们进一步制定了多模态恢复的退化模型，并推导了其基于 SSO-PGA 的优化算法，然后将其展开为深度网络，将优化的可解释性与深度学习的力量结合起来。大量的数值和现实实验表明，我们的方法显着超越了传统的 PGA 和其他最先进的算法，确保了卓越的性能和稳定性。

Title: Symbolic Neural Generation with Applications to Lead Discovery in Drug Design

Authors: Ashwin Srinivasan, A Baskar, Tirtharaj Dash, Michael Bain, Sanjay Kumar Dey, Mainak Banerjee
Subjects: cs.LG, cs.AI, cs.NE, q-bio.BM
Abstract URL: https://arxiv.org/abs/2510.23379
Pdf URL: https://arxiv.org/pdf/2510.23379
Copy Paste: [[2510.23379]] Symbolic Neural Generation with Applications to Lead Discovery in Drug Design(https://arxiv.org/abs/2510.23379)
Keywords: generation
Abstract: We investigate a relatively underexplored class of hybrid neurosymbolic models integrating symbolic learning with neural reasoning to construct data generators meeting formal correctness criteria. In \textit{Symbolic Neural Generators} (SNGs), symbolic learners examine logical specifications of feasible data from a small set of instances -- sometimes just one. Each specification in turn constrains the conditional information supplied to a neural-based generator, which rejects any instance violating the symbolic specification. Like other neurosymbolic approaches, SNG exploits the complementary strengths of symbolic and neural methods. The outcome of an SNG is a triple $(H, X, W)$, where $H$ is a symbolic description of feasible instances constructed from data, $X$ a set of generated new instances that satisfy the description, and $W$ an associated weight. We introduce a semantics for such systems, based on the construction of appropriate \textit{base} and \textit{fibre} partially-ordered sets combined into an overall partial order, and outline a probabilistic extension relevant to practical applications. In this extension, SNGs result from searching over a weighted partial ordering. We implement an SNG combining a restricted form of Inductive Logic Programming (ILP) with a large language model (LLM) and evaluate it on early-stage drug design. Our main interest is the description and the set of potential inhibitor molecules generated by the SNG. On benchmark problems -- where drug targets are well understood -- SNG performance is statistically comparable to state-of-the-art methods. On exploratory problems with poorly understood targets, generated molecules exhibit binding affinities on par with leading clinical candidates. Experts further find the symbolic specifications useful as preliminary filters, with several generated molecules identified as viable for synthesis and wet-lab testing.
摘要：我们研究了一类相对未被充分开发的混合神经符号模型，将符号学习与神经推理相结合，以构建满足形式正确性标准的数据生成器。在 \textit{符号神经生成器} (SNG) 中，符号学习器检查来自一小组实例（有时只有一个实例）的可行数据的逻辑规范。每个规范反过来限制提供给基于神经的生成器的条件信息，该生成器拒绝任何违反符号规范的实例。与其他神经符号方法一样，SNG 利用了符号方法和神经方法的互补优势。 SNG 的结果是三元组 $(H, X, W)$，其中 $H$ 是根据数据构造的可行实例的符号描述，$X$ 是一组满足描述的生成的新实例，$W$ 是关联的权重。我们基于构建适当的 \textit{base} 和 \textit{fibre} 偏序集组合成整体偏序，为此类系统引入了语义，并概述了与实际应用相关的概率扩展。在此扩展中，SNG 是通过加权部分排序进行搜索而产生的。我们实现了将归纳逻辑编程 (ILP) 的限制形式与大型语言模型 (LLM) 相结合的 SNG，并在早期药物设计中对其进行评估。我们的主要兴趣是 SNG 生成的潜在抑制剂分子的描述和集合。在基准问题上（药物靶标已被充分理解），SNG 的性能在统计上可与最先进的方法相媲美。在对不太了解的目标的探索性问题上，生成的分子表现出与领先的临床候选分子相当的结合亲和力。专家进一步发现符号规范可用作初步过滤器，其中生成的几种分子被确定可用于合成和湿实验室测试。

Title: An Efficient Remote Sensing Super Resolution Method Exploring Diffusion Priors and Multi-Modal Constraints for Crop Type Mapping

Authors: Songxi Yang, Tang Sui, Qunying Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.23382
Pdf URL: https://arxiv.org/pdf/2510.23382
Copy Paste: [[2510.23382]] An Efficient Remote Sensing Super Resolution Method Exploring Diffusion Priors and Multi-Modal Constraints for Crop Type Mapping(https://arxiv.org/abs/2510.23382)
Keywords: generative
Abstract: Super resolution offers a way to harness medium even lowresolution but historically valuable remote sensing image archives. Generative models, especially diffusion models, have recently been applied to remote sensing super resolution (RSSR), yet several challenges exist. First, diffusion models are effective but require expensive training from scratch resources and have slow inference speeds. Second, current methods have limited utilization of auxiliary information as real-world constraints to reconstruct scientifically realistic images. Finally, most current methods lack evaluation on downstream tasks. In this study, we present a efficient LSSR framework for RSSR, supported by a new multimodal dataset of paired 30 m Landsat 8 and 10 m Sentinel 2 imagery. Built on frozen pretrained Stable Diffusion, LSSR integrates crossmodal attention with auxiliary knowledge (Digital Elevation Model, land cover, month) and Synthetic Aperture Radar guidance, enhanced by adapters and a tailored Fourier NDVI loss to balance spatial details and spectral fidelity. Extensive experiments demonstrate that LSSR significantly improves crop boundary delineation and recovery, achieving state-of-the-art performance with Peak Signal-to-Noise Ratio/Structural Similarity Index Measure of 32.63/0.84 (RGB) and 23.99/0.78 (IR), and the lowest NDVI Mean Squared Error (0.042), while maintaining efficient inference (0.39 sec/image). Moreover, LSSR transfers effectively to NASA Harmonized Landsat and Sentinel (HLS) super resolution, yielding more reliable crop classification (F1: 0.86) than Sentinel-2 (F1: 0.85). These results highlight the potential of RSSR to advance precision agriculture.
摘要：超分辨率提供了一种利用中等甚至低分辨率但具有历史价值的遥感图像档案的方法。生成模型，特别是扩散模型，最近已应用于遥感超分辨率（RSSR），但仍存在一些挑战。首先，扩散模型是有效的，但需要昂贵的从头开始训练资源，并且推理速度慢。其次，当前的方法有限地利用辅助信息作为现实世界的约束来重建科学逼真的图像。最后，大多数当前方法缺乏对下游任务的评估。在本研究中，我们提出了一种高效的 RSSR LSSR 框架，并由配对的 30 m Landsat 8 和 10 m Sentinel 2 图像的新多模态数据集支持。 LSSR 以冷冻预训练稳定扩散为基础，将跨模式注意力与辅助知识（数字高程模型、土地覆盖、月份）和合成孔径雷达引导相结合，并通过适配器和定制的傅里叶 NDVI 损失进行增强，以平衡空间细节和光谱保真度。大量实验表明，LSSR 显着改善了作物边界描绘和恢复，实现了最先进的性能，峰值信噪比/结构相似性指数测量值为 32.63/0.84 (RGB) 和 23.99/0.78 (IR)，以及最低的 NDVI 均方误差 (0.042)，同时保持高效推理（0.39 秒/图像）。此外，LSSR 可以有效地转换为 NASA 协调陆地卫星和哨兵 (HLS) 超分辨率，从而产生比 Sentinel-2 (F1: 0.85) 更可靠的作物分类 (F1: 0.86)。这些结果凸显了 RSSR 推进精准农业的潜力。

Title: The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation

Authors: Farid Bagirov, Mikhail Arkhipov, Ksenia Sycheva, Evgeniy Glukhov, Egor Bogomolov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.23393
Pdf URL: https://arxiv.org/pdf/2510.23393
Copy Paste: [[2510.23393]] The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation(https://arxiv.org/abs/2510.23393)
Keywords: generation
Abstract: The application of Reinforcement Learning with Verifiable Rewards (RLVR) to mathematical and coding domains has demonstrated significant improvements in the reasoning and problem-solving abilities of Large Language Models. Despite its success in single generation problem solving, the reinforcement learning fine-tuning process may harm the model's exploration ability, as reflected in decreased diversity of generations and a resulting degradation of performance during Best-of-N sampling for large N values. In this work, we focus on optimizing the max@k metric, a continuous generalization of pass@k. We derive an unbiased on-policy gradient estimate for direct optimization of this metric. Furthermore, we extend our derivations to the off-policy updates, a common element in modern RLVR algorithms, that allows better sample efficiency. Empirically, we show that our objective effectively optimizes max@k metric in off-policy scenarios, aligning the model with the Best-of-N inference strategy.
摘要：具有可验证奖励的强化学习（RLVR）在数学和编码领域的应用已经证明大型语言模型的推理和解决问题的能力得到了显着提高。尽管强化学习在解决单代问题方面取得了成功，但强化学习微调过程可能会损害模型的探索能力，这反映在代的多样性下降以及大 N 值的 Best-of-N 采样期间导致的性能下降。在这项工作中，我们专注于优化 max@k 指标，这是 pass@k 的连续概括。我们得出一个无偏的策略梯度估计，以直接优化该指标。此外，我们将推导扩展到离策略更新，这是现代 RLVR 算法中的常见元素，可以提高样本效率。根据经验，我们表明我们的目标有效地优化了非策略场景中的 max@k 指标，使模型与 Best-of-N 推理策略保持一致。

Title: Mixed Precision Training of Neural ODEs

Authors: Elena Celledoni, Brynjulf Owren, Lars Ruthotto, Tianjiao Nicole Yang
Subjects: cs.LG, cs.AI, math.NA
Abstract URL: https://arxiv.org/abs/2510.23498
Pdf URL: https://arxiv.org/pdf/2510.23498
Copy Paste: [[2510.23498]] Mixed Precision Training of Neural ODEs(https://arxiv.org/abs/2510.23498)
Keywords: generative
Abstract: Exploiting low-precision computations has become a standard strategy in deep learning to address the growing computational costs imposed by ever larger models and datasets. However, naively performing all computations in low precision can lead to roundoff errors and instabilities. Therefore, mixed precision training schemes usually store the weights in high precision and use low-precision computations only for whitelisted operations. Despite their success, these principles are currently not reliable for training continuous-time architectures such as neural ordinary differential equations (Neural ODEs). This paper presents a mixed precision training framework for neural ODEs, combining explicit ODE solvers with a custom backpropagation scheme, and demonstrates its effectiveness across a range of learning tasks. Our scheme uses low-precision computations for evaluating the velocity, parameterized by the neural network, and for storing intermediate states, while stability is provided by a custom dynamic adjoint scaling and by accumulating the solution and gradients in higher precision. These contributions address two key challenges in training neural ODE: the computational cost of repeated network evaluations and the growth of memory requirements with the number of time steps or layers. Along with the paper, we publish our extendable, open-source PyTorch package rampde, whose syntax resembles that of leading packages to provide a drop-in replacement in existing codes. We demonstrate the reliability and effectiveness of our scheme using challenging test cases and on neural ODE applications in image classification and generative models, achieving approximately 50% memory reduction and up to 2x speedup while maintaining accuracy comparable to single-precision training.
摘要：利用低精度计算已成为深度学习的标准策略，以解决更大的模型和数据集带来的不断增长的计算成本。然而，天真地以低精度执行所有计算可能会导致舍入误差和不稳定。因此，混合精度训练方案通常以高精度存储权重，并仅针对白名单操作使用低精度计算。尽管取得了成功，但这些原理目前对于训练连续时间架构（例如神经常微分方程（神经 ODE））来说并不可靠。本文提出了一种神经 ODE 的混合精度训练框架，将显式 ODE 求解器与自定义反向传播方案相结合，并展示了其在一系列学习任务中的有效性。我们的方案使用低精度计算来评估由神经网络参数化的速度，并存储中间状态，而稳定性是通过自定义动态伴随缩放以及通过以更高精度累积解和梯度来提供的。这些贡献解决了训练神经 ODE 的两个关键挑战：重复网络评估的计算成本以及内存需求随时间步数或层数的增长。与本文一起，我们还发布了可扩展的开源 PyTorch 包 Ramde，其语法类似于领先包的语法，可在现有代码中提供直接替换。我们使用具有挑战性的测试用例以及图像分类和生成模型中的神经 ODE 应用来证明我们方案的可靠性和有效性，实现了约 50% 的内存减少和高达 2 倍的加速，同时保持了与单精度训练相当的精度。

Title: FreeFuse: Multi-Subject LoRA Fusion via Auto Masking at Test Time

Authors: Yaoli Liu, Yao-Xiang Ding, Kun Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.23515
Pdf URL: https://arxiv.org/pdf/2510.23515
Copy Paste: [[2510.23515]] FreeFuse: Multi-Subject LoRA Fusion via Auto Masking at Test Time(https://arxiv.org/abs/2510.23515)
Keywords: generation
Abstract: This paper proposes FreeFuse, a novel training-free approach for multi-subject text-to-image generation through automatic fusion of multiple subject LoRAs. In contrast to existing methods that either focus on pre-inference LoRA weight merging or rely on segmentation models and complex techniques like noise blending to isolate LoRA outputs, our key insight is that context-aware dynamic subject masks can be automatically derived from cross-attention layer weights. Mathematical analysis shows that directly applying these masks to LoRA outputs during inference well approximates the case where the subject LoRA is integrated into the diffusion model and used individually for the masked region. FreeFuse demonstrates superior practicality and efficiency as it requires no additional training, no modification to LoRAs, no auxiliary models, and no user-defined prompt templates or region specifications. Alternatively, it only requires users to provide the LoRA activation words for seamless integration into standard workflows. Extensive experiments validate that FreeFuse outperforms existing approaches in both generation quality and usability under the multi-subject generation tasks. The project page is at this https URL
摘要：本文提出了 FreeFuse，这是一种通过多主题 LoRA 自动融合来生成多主题文本到图像的新型免训练方法。与专注于预推理 LoRA 权重合并或依赖分割模型和噪声混合等复杂技术来隔离 LoRA 输出的现有方法相比，我们的主要见解是上下文感知的动态主题掩码可以从交叉关注层权重中自动派生。数学分析表明，在推理过程中直接将这些掩模应用于 LoRA 输出可以很好地近似将主题 LoRA 集成到扩散模型中并单独用于掩模区域的情况。 FreeFuse无需额外训练，无需修改LoRA，无需辅助模型，无需用户定义提示模板或区域规范，展现出卓越的实用性和效率。或者，它只需要用户提供 LoRA 激活字即可无缝集成到标准工作流程中。大量实验验证了 FreeFuse 在多主题生成任务下的生成质量和可用性方面均优于现有方法。项目页面位于此 https URL

Title: More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models

Authors: Hongkai Lin, Dingkang Liang, Mingyang Du, Xin Zhou, Xiang Bai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.23574
Pdf URL: https://arxiv.org/pdf/2510.23574
Copy Paste: [[2510.23574]] More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models(https://arxiv.org/abs/2510.23574)
Keywords: generation, generative
Abstract: Generative depth estimation methods leverage the rich visual priors stored in pre-trained text-to-image diffusion models, demonstrating astonishing zero-shot capability. However, parameter updates during training lead to catastrophic degra- dation in the image generation capability of the pre-trained model. We introduce MERGE, a unified model for image generation and depth estimation, starting from a fixed pre-trained text-to-image model. MERGE demonstrates that the pre-trained text-to-image model can do more than image generation, but also expand to depth estimation effortlessly. Specifically, MERGE introduces a play- and-plug framework that enables seamless switching between image generation and depth estimation modes through simple and pluggable converters. Meanwhile, we propose a Group Reuse Mechanism to encourage parameter reuse and im- prove the utilization of the additional learnable parameters. MERGE unleashes the powerful depth estimation capability of the pre-trained text-to-image model while preserving its original image generation ability. Compared to other unified models for image generation and depth estimation, MERGE achieves state-of- the-art performance across multiple depth estimation benchmarks. The code will be made available at this https URL
摘要：生成深度估计方法利用预先训练的文本到图像扩散模型中存储的丰富视觉先验，展示了惊人的零样本能力。然而，训练期间的参数更新会导致预训练模型的图像生成能力发生灾难性的下降。我们引入了 MERGE，这是一种用于图像生成和深度估计的统一模型，从固定的预训练文本到图像模型开始。 MERGE 表明，预训练的文本到图像模型不仅可以生成图像，还可以轻松扩展到深度估计。具体来说，MERGE 引入了一个即插即用的框架，可以通过简单且可插拔的转换器实现图像生成和深度估计模式之间的无缝切换。同时，我们提出了一种组重用机制，以鼓励参数重用并提高额外可学习参数的利用率。 MERGE 释放了预训练文本到图像模型强大的深度估计能力，同时保留了其原始图像生成能力。与其他用于图像生成和深度估计的统一模型相比，MERGE 在多个深度估计基准上实现了最先进的性能。该代码将在此 https URL 中提供

Title: Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human Animation

Authors: Junyoung Seo, Rodrigo Mira, Alexandros Haliassos, Stella Bounareli, Honglie Chen, Linh Tran, Seungryong Kim, Zoe Landgraf, Jie Shen
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2510.23581
Pdf URL: https://arxiv.org/pdf/2510.23581
Copy Paste: [[2510.23581]] Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human Animation(https://arxiv.org/abs/2510.23581)
Keywords: generation
Abstract: Audio-driven human animation models often suffer from identity drift during temporal autoregressive generation, where characters gradually lose their identity over time. One solution is to generate keyframes as intermediate temporal anchors that prevent degradation, but this requires an additional keyframe generation stage and can restrict natural motion dynamics. To address this, we propose Lookahead Anchoring, which leverages keyframes from future timesteps ahead of the current generation window, rather than within it. This transforms keyframes from fixed boundaries into directional beacons: the model continuously pursues these future anchors while responding to immediate audio cues, maintaining consistent identity through persistent guidance. This also enables self-keyframing, where the reference image serves as the lookahead target, eliminating the need for keyframe generation entirely. We find that the temporal lookahead distance naturally controls the balance between expressivity and consistency: larger distances allow for greater motion freedom, while smaller ones strengthen identity adherence. When applied to three recent human animation models, Lookahead Anchoring achieves superior lip synchronization, identity preservation, and visual quality, demonstrating improved temporal conditioning across several different architectures. Video results are available at the following link: this https URL.
摘要：音频驱动的人类动画模型在时间自回归生成过程中经常会遭受身份漂移的影响，其中角色随着时间的推移逐渐失去其身份。一种解决方案是生成关键帧作为防止退化的中间时间锚点，但这需要额外的关键帧生成阶段，并且会限制自然运动动态。为了解决这个问题，我们提出了前向锚定（Lookahead Anchoring），它利用当前生成窗口之前（而不是内部）未来时间步长的关键帧。这将关键帧从固定边界转变为定向信标：模型不断地追寻这些未来的锚点，同时响应即时的音频提示，通过持久的指导保持一致的身份。这还支持自关键帧，其中参考图像充当前瞻目标，完全消除了关键帧生成的需要。我们发现时间前瞻距离自然地控制着表现力和一致性之间的平衡：较大的距离允许更大的运动自由度，而较小的距离则增强身份依从性。当应用于最近的三个人类动画模型时，Lookahead Anchoring 实现了卓越的唇形同步、身份保留和视觉质量，展示了跨几种不同架构的改进的时间调节。视频结果可通过以下链接获取：此 https URL。

Title: FARMER: Flow AutoRegressive Transformer over Pixels

Authors: Guangting Zheng, Qinyu Zhao, Tao Yang, Fei Xiao, Zhijie Lin, Jie Wu, Jiajun Deng, Yanyong Zhang, Rui Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.23588
Pdf URL: https://arxiv.org/pdf/2510.23588
Copy Paste: [[2510.23588]] FARMER: Flow AutoRegressive Transformer over Pixels(https://arxiv.org/abs/2510.23588)
Keywords: generation, generative
Abstract: Directly modeling the explicit likelihood of the raw data distribution is key topic in the machine learning area, which achieves the scaling successes in Large Language Models by autoregressive modeling. However, continuous AR modeling over visual pixel data suffer from extremely long sequences and high-dimensional spaces. In this paper, we present FARMER, a novel end-to-end generative framework that unifies Normalizing Flows (NF) and Autoregressive (AR) models for tractable likelihood estimation and high-quality image synthesis directly from raw pixels. FARMER employs an invertible autoregressive flow to transform images into latent sequences, whose distribution is modeled implicitly by an autoregressive model. To address the redundancy and complexity in pixel-level modeling, we propose a self-supervised dimension reduction scheme that partitions NF latent channels into informative and redundant groups, enabling more effective and efficient AR modeling. Furthermore, we design a one-step distillation scheme to significantly accelerate inference speed and introduce a resampling-based classifier-free guidance algorithm to boost image generation quality. Extensive experiments demonstrate that FARMER achieves competitive performance compared to existing pixel-based generative models while providing exact likelihoods and scalable training.
摘要：直接对原始数据分布的显式可能性进行建模是机器学习领域的关键主题，它通过自回归建模在大型语言模型中取得了扩展的成功。然而，基于视觉像素数据的连续 AR 建模面临着极长的序列和高维空间的问题。在本文中，我们提出了 FARMER，这是一种新颖的端到端生成框架，它统一了归一化流（NF）和自回归（AR）模型，用于直接从原始像素进行易于处理的似然估计和高质量图像合成。 FARMER 采用可逆自回归流将图像转换为潜在序列，其分布由自回归模型隐式建模。为了解决像素级建模中的冗余和复杂性，我们提出了一种自监督降维方案，将 NF 潜在通道划分为信息组和冗余组，从而实现更有效和高效的 AR 建模。此外，我们设计了一种一步蒸馏方案来显着加快推理速度，并引入一种基于重采样的无分类器引导算法来提高图像生成质量。大量实验表明，与现有基于像素的生成模型相比，FARMER 实现了具有竞争力的性能，同时提供了精确的可能性和可扩展的训练。

Title: PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection

Authors: Yusu Qian, Cheng Wan, Chao Jia, Yinfei Yang, Qingyu Zhao, Zhe Gan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.23594
Pdf URL: https://arxiv.org/pdf/2510.23594
Copy Paste: [[2510.23594]] PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection(https://arxiv.org/abs/2510.23594)
Keywords: generation
Abstract: We introduce \textbf{PRISM-Bench}, a benchmark of puzzle-based visual challenges designed to evaluate not only whether models can solve problems, but how their reasoning unfolds. Unlike prior evaluations that measure only final-answer accuracy, PRISM-Bench introduces a diagnostic task: given a visual puzzle and a step-by-step chain-of-thought (CoT) containing exactly one error, models must identify the first incorrect step. This setting enables fine-grained assessment of logical consistency, error detection, and visual reasoning. The puzzles in PRISM-Bench require multi-step symbolic, geometric, and analogical reasoning, resisting shortcuts based on superficial pattern matching. Evaluations across state-of-the-art MLLMs reveal a persistent gap between fluent generation and faithful reasoning: models that produce plausible CoTs often fail to locate simple logical faults. By disentangling answer generation from reasoning verification, PRISM-Bench offers a sharper lens on multimodal reasoning competence and underscores the need for diagnostic evaluation protocols in the development of trustworthy MLLMs.
摘要：我们引入 \textbf{PRISM-Bench}，这是一个基于谜题的视觉挑战基准，旨在不仅评估模型是否可以解决问题，还评估它们的推理如何展开。与之前仅测量最终答案准确性的评估不同，PRISM-Bench 引入了一项诊断任务：给定一个视觉难题和包含一个错误的分步思路 (CoT)，模型必须识别第一个错误步骤。此设置可以对逻辑一致性、错误检测和视觉推理进行细粒度评估。 PRISM-Bench 中的谜题需要多步骤的符号、几何和类比推理，抵制基于表面模式匹配的捷径。对最先进的 MLLM 的评估揭示了流畅生成和忠实推理之间持续存在的差距：产生合理 CoT 的模型通常无法找到简单的逻辑错误。通过将答案生成与推理验证分开，PRISM-Bench 为多模态推理能力提供了更清晰的视角，并强调了在开发值得信赖的 MLLM 时对诊断评估协议的需求。

Title: Track, Inpaint, Resplat: Subject-driven 3D and 4D Generation with Progressive Texture Infilling

Authors: Shuhong Zheng, Ashkan Mirzaei, Igor Gilitschenski
Subjects: cs.CV, cs.AI, cs.GR, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2510.23605
Pdf URL: https://arxiv.org/pdf/2510.23605
Copy Paste: [[2510.23605]] Track, Inpaint, Resplat: Subject-driven 3D and 4D Generation with Progressive Texture Infilling(https://arxiv.org/abs/2510.23605)
Keywords: generation, generative
Abstract: Current 3D/4D generation methods are usually optimized for photorealism, efficiency, and aesthetics. However, they often fail to preserve the semantic identity of the subject across different viewpoints. Adapting generation methods with one or few images of a specific subject (also known as Personalization or Subject-driven generation) allows generating visual content that align with the identity of the subject. However, personalized 3D/4D generation is still largely underexplored. In this work, we introduce TIRE (Track, Inpaint, REsplat), a novel method for subject-driven 3D/4D generation. It takes an initial 3D asset produced by an existing 3D generative model as input and uses video tracking to identify the regions that need to be modified. Then, we adopt a subject-driven 2D inpainting model for progressively infilling the identified regions. Finally, we resplat the modified 2D multi-view observations back to 3D while still maintaining consistency. Extensive experiments demonstrate that our approach significantly improves identity preservation in 3D/4D generation compared to state-of-the-art methods. Our project website is available at this https URL.
摘要：当前的 3D/4D 生成方法通常针对真实感、效率和美观进行优化。然而，它们常常无法在不同的观点中保留主题的语义同一性。使用特定主题的一张或几张图像调整生成方法（也称为个性化或主题驱动生成）可以生成与主题身份相符的视觉内容。然而，个性化 3D/4D 生成在很大程度上仍未得到充分探索。在这项工作中，我们介绍了 TIRE（Track、Inpaint、REsplat），这是一种主题驱动的 3D/4D 生成的新颖方法。它采用现有 3D 生成模型生成的初始 3D 资产作为输入，并使用视频跟踪来识别需要修改的区域。然后，我们采用主题驱动的 2D 修复模型来逐步填充已识别的区域。最后，我们将修改后的 2D 多视图观察重新转换回 3D，同时仍然保持一致性。大量实验表明，与最先进的方法相比，我们的方法显着改善了 3D/4D 生成中的身份保留。我们的项目网站可通过此 https URL 访问。

Title: Variational Masked Diffusion Models

Authors: Yichi Zhang, Alex Schwing, Zhizhen Zhao
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2510.23606
Pdf URL: https://arxiv.org/pdf/2510.23606
Copy Paste: [[2510.23606]] Variational Masked Diffusion Models(https://arxiv.org/abs/2510.23606)
Keywords: generation, generative
Abstract: Masked diffusion models have recently emerged as a flexible framework for discrete generative modeling. However, a key limitation of standard masked diffusion is its inability to effectively capture dependencies among tokens that are predicted concurrently, leading to degraded generation quality when dependencies among tokens are important. To explicitly model dependencies among tokens, we propose Variational Masked Diffusion (VMD), a framework that introduces latent variables into the masked diffusion process. Through controlled experiments on synthetic datasets, we demonstrate that VMD successfully learns dependencies that conventional masked diffusion fails to capture. We further validate the effectiveness of our approach on Sudoku puzzles and text datasets, where learning of dependencies among tokens improves global consistency. Across these domains, VMD enhances both generation quality and dependency awareness, highlighting the value of integrating variational inference into masked diffusion. Our code is available at: this https URL.
摘要：掩蔽扩散模型最近已成为离散生成建模的灵活框架。然而，标准掩码扩散的一个关键限制是它无法有效地捕获同时预测的令牌之间的依赖关系，当令牌之间的依赖关系很重要时，会导致生成质量下降。为了显式地对标记之间的依赖关系进行建模，我们提出了变分掩蔽扩散（VMD），这是一个将潜在变量引入掩蔽扩散过程的框架。通过对合成数据集的受控实验，我们证明 VMD 成功学习了传统掩蔽扩散无法捕获的依赖性。我们进一步验证了我们的方法在数独谜题和文本数据集上的有效性，其中学习标记之间的依赖关系可以提高全局一致性。在这些领域中，VMD 增强了生成质量和依赖性意识，突出了将变分推理集成到掩模扩散中的价值。我们的代码位于：此 https URL。