2025-09-09

Title: A Dataset Generation Scheme Based on Video2EEG-SPGN-Diffusion for SEED-VD

Authors: Yunfei Guo, Tao Zhang, Wu Huang, Yao Song
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.05321
Pdf URL: https://arxiv.org/pdf/2509.05321
Copy Paste: [[2509.05321]] A Dataset Generation Scheme Based on Video2EEG-SPGN-Diffusion for SEED-VD(https://arxiv.org/abs/2509.05321)
Keywords: generation
Abstract: This paper introduces an open-source framework, Video2EEG-SPGN-Diffusion, that leverages the SEED-VD dataset to generate a multimodal dataset of EEG signals conditioned on video stimuli. Additionally, we disclose an engineering pipeline for aligning video and EEG data pairs, facilitating the training of multimodal large models with EEG alignment capabilities. Personalized EEG signals are generated using a self-play graph network (SPGN) integrated with a diffusion model. As a major contribution, we release a new dataset comprising over 1000 samples of SEED-VD video stimuli paired with generated 62-channel EEG signals at 200 Hz and emotion labels, enabling video-EEG alignment and advancing multimodal research. This framework offers novel tools for emotion analysis, data augmentation, and brain-computer interface applications, with substantial research and engineering significance.
摘要：本文介绍了一个开源框架，Video2EEG-SPGN扩散，该框架利用Seed-VD数据集生成以视频刺激为条件的EEG信号的多模式数据集。此外，我们披露了一条工程管道，用于对齐视频和脑电图数据对，从而促进了具有脑电图对齐功能的多模式大型模型的培训。使用与扩散模型集成的自我播放图网络（SPGN）生成个性化的EEG信号。作为主要贡献，我们发布了一个新的数据集，其中包括1000多个种子VD视频刺激样本，并在200 Hz和情感标签上与产生的62通道EEG信号配对，从而启用视频EEG对准并推进多模式研究。该框架为情感分析，数据增强和脑部计算机界面应用提供了新颖的工具，具有实质性的研究和工程意义。

Title: RT-VLM: Re-Thinking Vision Language Model with 4-Clues for Real-World Object Recognition Robustness

Authors: Junghyun Park, Tuan Anh Nguyen, Dugki Min
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.05333
Pdf URL: https://arxiv.org/pdf/2509.05333
Copy Paste: [[2509.05333]] RT-VLM: Re-Thinking Vision Language Model with 4-Clues for Real-World Object Recognition Robustness(https://arxiv.org/abs/2509.05333)
Keywords: generation
Abstract: Real world deployments often expose modern object recognition models to domain shifts that precipitate a severe drop in accuracy. Such shifts encompass (i) variations in low level image statistics, (ii) changes in object pose and viewpoint, (iii) partial occlusion, and (iv) visual confusion across adjacent classes. To mitigate this degradation, we introduce the Re-Thinking Vision Language Model (RT-VLM) framework. The foundation of this framework is a unique synthetic dataset generation pipeline that produces images annotated with "4-Clues": precise bounding boxes, class names, detailed object-level captions, and a comprehensive context-level caption for the entire scene. We then perform parameter efficient supervised tuning of Llama 3.2 11B Vision Instruct on this resource. At inference time, a two stage Re-Thinking scheme is executed: the model first emits its own four clues, then re examines these responses as evidence and iteratively corrects them. Across robustness benchmarks that isolate individual domain shifts, RT-VLM consistently surpasses strong baselines. These findings indicate that the integration of structured multimodal evidence with an explicit self critique loop constitutes a promising route toward reliable and transferable visual understanding.
摘要：现实世界的部署通常会使现代对象识别模型暴露于域的转移，从而导致准确性严重下降。此类变化包括（i）低级别图像统计数据中的变化，（ii）对象姿势和观点的变化，（iii）部分遮挡以及（iv）在相邻类中的视觉混乱。为了减轻这种退化，我们介绍了重新思考的视觉语言模型（RT-VLM）框架。该框架的基础是一个独特的合成数据集生成管道，它会产生带有“ 4 clues”的图像：精确的边界框，类名称，详细的对象级字幕以及整个场景的全面上下文级别的字幕。然后，我们在此资源上对Llama 3.2 11B视觉指示进行参数有效的监督调整。在推论时，执行了一个两阶段的重新思考方案：该模型首先发出了自己的四个线索，然后重新检查这些响应作为证据，并迭代地纠正它们。跨稳健性基准分离单个域移动，RT-VLM始终超过强基础。这些发现表明，结构化的多模式证据与明确的自批评循环的整合构成了一种有希望的可靠和可转移的视觉理解的途径。

Title: FAVAE-Effective Frequency Aware Latent Tokenizer

Authors: Tejaswini Medi, Hsien-Yi Wang, Arianna Rampini, Margret Keuper
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2509.05441
Pdf URL: https://arxiv.org/pdf/2509.05441
Copy Paste: [[2509.05441]] FAVAE-Effective Frequency Aware Latent Tokenizer(https://arxiv.org/abs/2509.05441)
Keywords: generation, generative
Abstract: Latent generative models have shown remarkable progress in high-fidelity image synthesis, typically using a two-stage training process that involves compressing images into latent embeddings via learned tokenizers in the first stage. The quality of generation strongly depends on how expressive and well-optimized these latent embeddings are. While various methods have been proposed to learn effective latent representations, the reconstructed images often lack realism, particularly in textured regions with sharp transitions, due to loss of fine details governed by high frequencies. We conduct a detailed frequency decomposition of existing state-of-the-art (SOTA) latent tokenizers and show that conventional objectives inherently prioritize low-frequency reconstruction, often at the expense of high-frequency fidelity. Our analysis reveals these latent tokenizers exhibit a bias toward low-frequency information, when jointly optimized, leading to over-smoothed outputs and visual artifacts that diminish perceptual quality. To address this, we propose a wavelet-based, frequency-aware variational autoencoder (FA-VAE) framework that explicitly decouples the optimization of low- and high-frequency components. This decoupling enables improved reconstruction of fine textures while preserving global structure. Our approach bridges the fidelity gap in current latent tokenizers and emphasizes the importance of frequency-aware optimization for realistic image representation, with broader implications for applications in content creation, neural rendering, and medical imaging.
摘要：潜在的生成模型在高保真图像合成中表现出了显着的进展，通常使用两个阶段的训练过程，该过程涉及在第一阶段通过学习的引物将图像压缩到潜在的嵌入式中。一代的质量在很大程度上取决于这些潜在嵌入的表现和优化。尽管已经提出了各种方法来学习有效的潜在表示，但重建的图像通常缺乏现实主义，尤其是在具有急剧过渡的质感区域，这是由于受高频控制的细节的丢失而导致的。我们对现有的最新（SOTA）潜电者进行了详细的频率分解，并表明传统的目标本质上优先考虑低频重建，通常是以高频保真度为代价。我们的分析表明，当共同优化时，这些潜在的引物对低频信息表现出对低频信息的偏见，从而导致过度平滑的输出和视觉伪像，从而降低了感知质量。为了解决这个问题，我们提出了一个基于小波的，频感知的变异自动编码器（FA-VAE）框架，该框架明确地将低频和高频组件的优化分解为优化。这种去耦可以改善精细纹理的重建，同时保留全球结构。我们的方法桥接了当前潜在的引物器中的富达差距，并强调了频率吸引优化对现实图像表示的重要性，对在内容创建，神经渲染和医学成像中的应用中具有更大的影响。

Title: EditIDv2: Editable ID Customization with Data-Lubricated ID Feature Integration for Text-to-Image Generation

Authors: Guandong Li, Zhaobin Chu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.05659
Pdf URL: https://arxiv.org/pdf/2509.05659
Copy Paste: [[2509.05659]] EditIDv2: Editable ID Customization with Data-Lubricated ID Feature Integration for Text-to-Image Generation(https://arxiv.org/abs/2509.05659)
Keywords: generation
Abstract: We propose EditIDv2, a tuning-free solution specifically designed for high-complexity narrative scenes and long text inputs. Existing character editing methods perform well under simple prompts, but often suffer from degraded editing capabilities, semantic understanding biases, and identity consistency breakdowns when faced with long text narratives containing multiple semantic layers, temporal logic, and complex contextual relationships. In EditID, we analyzed the impact of the ID integration module on editability. In EditIDv2, we further explore and address the influence of the ID feature integration module. The core of EditIDv2 is to discuss the issue of editability injection under minimal data lubrication. Through a sophisticated decomposition of PerceiverAttention, the introduction of ID loss and joint dynamic training with the diffusion model, as well as an offline fusion strategy for the integration module, we achieve deep, multi-level semantic editing while maintaining identity consistency in complex narrative environments using only a small amount of data lubrication. This meets the demands of long prompts and high-quality image generation, and achieves excellent results in the IBench evaluation.
摘要：我们提出了EditIDV2，这是一种专门为高复杂性叙事场景和长文本输入而设计的无调解决方案。现有的字符编辑方法在简单的提示下表现良好，但通常会遇到降解的编辑功能，语义理解偏见和身份一致性的崩溃，当时面对包含多个语义层，时间逻辑和复杂上下文关系的长文本叙述。在EditID中，我们分析了ID集成模块对编辑性的影响。在EditIDV2中，我们进一步探索并解决了ID功能集成模块的影响。 EditIDV2的核心是讨论在最小数据润滑的情况下进行的可编辑性注入问题。通过复杂的感知性分解，通过扩散模型引入ID丢失和联合动态训练，以及集成模块的离线融合策略，我们仅使用少量数据润滑来实现在复杂的叙述环境中维持复杂叙事环境中的认同性一致性，同时实现了深层，多级别的语义编辑。这满足了长提示和高质量图像产生的需求，并在Ibench评估中取得了出色的成果。

Title: Context-Aware Multi-Turn Visual-Textual Reasoning in LVLMs via Dynamic Memory and Adaptive Visual Guidance

Authors: Weijie Shen, Xinrui Wang, Yuanqi Nie, Apiradee Boonmee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.05669
Pdf URL: https://arxiv.org/pdf/2509.05669
Copy Paste: [[2509.05669]] Context-Aware Multi-Turn Visual-Textual Reasoning in LVLMs via Dynamic Memory and Adaptive Visual Guidance(https://arxiv.org/abs/2509.05669)
Keywords: generation
Abstract: Current Large Language Models (LLMs) and Vision-Language Large Models (LVLMs) excel in single-turn tasks but face significant challenges in multi-turn interactions requiring deep contextual understanding and complex visual reasoning, often leading to fragmented reasoning, context loss, and hallucinations. To address these limitations, we propose Context-Aware Multi-Turn Visual Reasoning (CAMVR), a novel framework designed to empower LVLMs with robust and coherent multi-turn visual-textual inference capabilities. CAMVR introduces two key innovations: a Visual-Textual Context Memory Unit (VCMU), a dynamic read-write memory network that stores and manages critical visual features, textual semantic representations, and their cross-modal correspondences from each interaction turn; and an Adaptive Visual Focus Guidance (AVFG) mechanism, which leverages the VCMU's context to dynamically adjust the visual encoder's attention to contextually relevant image regions. Our multi-level reasoning integration strategy ensures that response generation is deeply coherent with both current inputs and accumulated historical context. Extensive experiments on challenging datasets, including VisDial, an adapted A-OKVQA, and our novel Multi-Turn Instruction Following (MTIF) dataset, demonstrate that CAMVR consistently achieves state-of-the-art performance.
摘要：当前的大型语言模型（LLM）和视觉语言大型模型（LVLM）在单转弯任务中表现出色，但是在需要深层上下文理解和复杂的视觉推理的多转交战中面临重大挑战，通常导致推理，上下文损失和幻觉。为了解决这些局限性，我们提出了上下文感知的多转变视觉推理（CAMVR），这是一个新颖的框架，旨在以强大而相干的多转变视觉文本推理功能来赋予LVLMS的能力。 CAMVR介绍了两个关键创新：一个视觉文字上下文存储单元（VCMU），一个动态的读取 - 写入内存网络，该网络存储和管理关键的视觉特征，文本语义表示及其从每个交互转弯中的交叉模式对应关系；以及一种自适应视觉焦点指南（AVFG）机制，该机制利用VCMU的上下文将视觉编码器的注意力转移到上下文相关的图像区域上。我们的多层次推理整合策略可确保响应产生与当前的投入和累积的历史背景密切相一致。关于挑战性数据集的广泛实验，包括Visdial，适应的A-OKVQA和我们新颖的多转移指导下的数据集（MTIF）数据集，表明CAMVR始终取得了最新的性能。

Title: Depth-Aware Super-Resolution via Distance-Adaptive Variational Formulation

Authors: Tianhao Guo, Bingjie Lu, Feng Wang, Zhengyang Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.05746
Pdf URL: https://arxiv.org/pdf/2509.05746
Copy Paste: [[2509.05746]] Depth-Aware Super-Resolution via Distance-Adaptive Variational Formulation(https://arxiv.org/abs/2509.05746)
Keywords: super-resolution, generation
Abstract: Single image super-resolution traditionally assumes spatially-invariant degradation models, yet real-world imaging systems exhibit complex distance-dependent effects including atmospheric scattering, depth-of-field variations, and perspective distortions. This fundamental limitation necessitates spatially-adaptive reconstruction strategies that explicitly incorporate geometric scene understanding for optimal performance. We propose a rigorous variational framework that characterizes super-resolution as a spatially-varying inverse problem, formulating the degradation operator as a pseudodifferential operator with distance-dependent spectral characteristics that enable theoretical analysis of reconstruction limits across depth ranges. Our neural architecture implements discrete gradient flow dynamics through cascaded residual blocks with depth-conditional convolution kernels, ensuring convergence to stationary points of the theoretical energy functional while incorporating learned distance-adaptive regularization terms that dynamically adjust smoothness constraints based on local geometric structure. Spectral constraints derived from atmospheric scattering theory prevent bandwidth violations and noise amplification in far-field regions, while adaptive kernel generation networks learn continuous mappings from depth to reconstruction filters. Comprehensive evaluation across five benchmark datasets demonstrates state-of-the-art performance, achieving 36.89/0.9516 and 30.54/0.8721 PSNR/SSIM at 2 and 4 scales on KITTI outdoor scenes, outperforming existing methods by 0.44dB and 0.36dB respectively. This work establishes the first theoretically-grounded distance-adaptive super-resolution framework and demonstrates significant improvements on depth-variant scenarios while maintaining competitive performance across traditional benchmarks.
摘要：传统上，单图像超分辨率在空间不变的降解模型中，但是现实世界成像系统表现出复杂的距离依赖性效果，包括大气散射，场地的深度变化和透视扭曲。这种基本限制需要具有空间自适应的重建策略，这些策略明确结合了几何场景的理解，以实现最佳性能。我们提出了一个严格的变分框架，将超分辨率描述为空间变化的反问题，将退化操作员作为一个具有距离依赖性频谱特征的降解操作员作为假数分化的操作员，从而可以对深度范围跨深度范围进行重建的理论分析。我们的神经体系结构通过具有深度条件卷积内核的级联残留块实现离散的梯度流动动力学，从而确保收敛到理论能量功能的固定点，同时结合学习的距离自适应正则化项，以基于局部的几何结构进行动态调整平滑度约束。源自大气散射理论的光谱约束可以防止远场区域的带宽违规和噪声放大，而自适应核产生网络学习了从深度到重建过滤器的连续映射。五个基准数据集的全面评估证明了最先进的性能，在KITTI室外场景上以2和4量表达到36.89/0.9516和30.54/0.8721 PSNR/SSIM，分别超过了0.44db和0.36dB的现有方法。这项工作确立了理论上的第一个距离自适应超级分辨率框架，并在深度变化的情况下展示了显着改善，同时保持了传统基准的竞争性能。

Title: CRAB: Camera-Radar Fusion for Reducing Depth Ambiguity in Backward Projection based View Transformation

Authors: In-Jae Lee, Sihwan Hwang, Youngseok Kim, Wonjune Kim, Sanmin Kim, Dongsuk Kum
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.05785
Pdf URL: https://arxiv.org/pdf/2509.05785
Copy Paste: [[2509.05785]] CRAB: Camera-Radar Fusion for Reducing Depth Ambiguity in Backward Projection based View Transformation(https://arxiv.org/abs/2509.05785)
Keywords: generation
Abstract: Recently, camera-radar fusion-based 3D object detection methods in bird's eye view (BEV) have gained attention due to the complementary characteristics and cost-effectiveness of these sensors. Previous approaches using forward projection struggle with sparse BEV feature generation, while those employing backward projection overlook depth ambiguity, leading to false positives. In this paper, to address the aforementioned limitations, we propose a novel camera-radar fusion-based 3D object detection and segmentation model named CRAB (Camera-Radar fusion for reducing depth Ambiguity in Backward projection-based view transformation), using a backward projection that leverages radar to mitigate depth ambiguity. During the view transformation, CRAB aggregates perspective view image context features into BEV queries. It improves depth distinction among queries along the same ray by combining the dense but unreliable depth distribution from images with the sparse yet precise depth information from radar occupancy. We further introduce spatial cross-attention with a feature map containing radar context information to enhance the comprehension of the 3D scene. When evaluated on the nuScenes open dataset, our proposed approach achieves a state-of-the-art performance among backward projection-based camera-radar fusion methods with 62.4\% NDS and 54.0\% mAP in 3D object detection.
摘要：最近，由于这些传感器的互补特性和成本效益，鸟类视图（BEV）中基于摄像头融合的3D对象检测方法引起了人们的注意。以前的方法使用前向投影斗争与稀疏的BEV特征产生，而那些使用落后投影的人则可以忽略深度歧义，从而导致误报。在本文中，为了解决上述局限性，我们提出了一种新型的基于基于Crab的基于摄像机 - 雷达对象检测和分割模型（用于降低基于基于投影的视图转换的深度歧义的摄像机 - 雷达融合），使用借给雷达雷达雷达雷达雷达的后方投影来减轻深度深度歧义的歧义。在视图转换期间，Crab汇总的透视图视图图像上下文特征在BEV查询中。它通过将图像的密集但不可靠的深度分布与雷达占用稀疏而精确的深度信息相结合，从而改善了沿着相同射线的查询之间的深度区别。我们进一步使用包含雷达上下文信息的功能图引入空间交叉注意，以增强3D场景的理解。当在Nuscenes Open数据集上进行评估时，我们提出的方法在基于62.4 \％NDS的基于向后投影的相机雷达融合方法和3D对象检测中的54.0 \％MAP之间达到了最先进的性能。

Title: A Probabilistic Segment Anything Model for Ambiguity-Aware Medical Image Segmentation

Authors: Tyler Ward, Abdullah Imran
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.05809
Pdf URL: https://arxiv.org/pdf/2509.05809
Copy Paste: [[2509.05809]] A Probabilistic Segment Anything Model for Ambiguity-Aware Medical Image Segmentation(https://arxiv.org/abs/2509.05809)
Keywords: generation
Abstract: Recent advances in promptable segmentation, such as the Segment Anything Model (SAM), have enabled flexible, high-quality mask generation across a wide range of visual domains. However, SAM and similar models remain fundamentally deterministic, producing a single segmentation per object per prompt, and fail to capture the inherent ambiguity present in many real-world tasks. This limitation is particularly troublesome in medical imaging, where multiple plausible segmentations may exist due to annotation uncertainty or inter-expert variability. In this paper, we introduce Probabilistic SAM, a probabilistic extension of SAM that models a distribution over segmentations conditioned on both the input image and prompt. By incorporating a latent variable space and training with a variational objective, our model learns to generate diverse and plausible segmentation masks reflecting the variability in human annotations. The architecture integrates a prior and posterior network into the SAM framework, allowing latent codes to modulate the prompt embeddings during inference. The latent space allows for efficient sampling during inference, enabling uncertainty-aware outputs with minimal overhead. We evaluate Probabilistic SAM on the public LIDC-IDRI lung nodule dataset and demonstrate its ability to produce diverse outputs that align with expert disagreement, outperforming existing probabilistic baselines on uncertainty-aware metrics. Our code is available at: this https URL.
摘要：最新的敏捷分割的进展，例如分段的任何模型（SAM），已启用了在广泛的视觉域中的灵活，高质量的掩码生成。但是，SAM和类似模型在根本上仍然是确定性的，每个对象每个对象都会产生单个分割，并且无法捕获许多实际任务中存在的固有歧义。在医学成像中，这种限制尤其令人麻烦，在医学成像中，由于注释不确定性或专家间的可变性，可能存在多个合理的分割。在本文中，我们介绍了概率SAM，SAM的概率扩展是SAM的概率扩展，该概率扩展在输入图像和提示的条件下对分割的分布进行了建模。通过将潜在的可变空间和培训纳入各种目标，我们的模型学会了产生多样化和合理的分割掩模，以反映人类注释中的可变性。该体系结构将先验和后验网络集成到SAM框架中，从而使潜在代码在推断过程中调节提示嵌入。潜在空间允许在推理过程中进行有效的采样，从而使不确定性感知到最小的开销。我们在公共LIDC-IDRI肺结核数据集上评估了概率SAM，并证明了其产生与专家分歧一致的各种输出的能力，表现优于现有的对不确定性意识到指标的概率基础。我们的代码可用：此HTTPS URL。

Title: X-SQL: Expert Schema Linking and Understanding of Text-to-SQL with Multi-LLMs

Authors: Dazhi Peng
Subjects: cs.LG, cs.DB
Abstract URL: https://arxiv.org/abs/2509.05899
Pdf URL: https://arxiv.org/pdf/2509.05899
Copy Paste: [[2509.05899]] X-SQL: Expert Schema Linking and Understanding of Text-to-SQL with Multi-LLMs(https://arxiv.org/abs/2509.05899)
Keywords: generation
Abstract: With Large Language Models' (LLMs) emergent abilities on code generation tasks, Text-to-SQL has become one of the most popular downstream applications. Despite the strong results of multiple recent LLM-based Text-to-SQL frameworks, the research community often overlooks the importance of database schema information for generating high-quality SQL queries. We find that such schema information plays a significant or even dominant role in the Text-to-SQL task. To tackle this challenge, we propose a novel database schema expert with two components. We first introduce X-Linking, an LLM Supervised Finetuning (SFT)-based method that achieves superior Schema Linking results compared to existing open-source Text-to-SQL methods. In addition, we innovatively propose an X-Admin component that focuses on Schema Understanding by bridging the gap between abstract schema information and the user's natural language question. Aside from better learning with schema information, we experiment with Multi-LLMs for different components within the system to further boost its performance. By incorporating these techniques into our end-to-end framework, X-SQL, we have achieved Execution Accuracies of 84.9% on the Spider-Dev dataset and 82.5% on the Spider-Test dataset. This outstanding performance establishes X-SQL as the leading Text-to-SQL framework based on open-source models.
摘要：凭借大型语言模型（LLMS）在代码生成任务上的紧急能力，文本到SQL已成为最受欢迎的下游应用程序之一。尽管有多个最近基于LLM的文本到SQL框架的结果很强，但研究社区通常忽略了数据库架构信息对于生成高质量SQL查询的重要性。我们发现，此类架构信息在文本到SQL任务中起着重要甚至主导作用。为了应对这一挑战，我们提出了一个具有两个组件的新型数据库架构专家。我们首先引入X-Linking，这是一种基于LLM的LLM监督命名（SFT）的方法，该方法与现有的开源文本到SQL方法相比，可以实现出色的模式链接结果。此外，我们创新提出了一个X-Admin组件，该组件通过弥合抽象模式信息与用户自然语言问题之间的差距来重点介绍模式理解。除了使用模式信息更好地学习外，我们还尝试了系统中不同组件的多LLM，以进一步提高其性能。通过将这些技术纳入我们的端到端框架X-SQL，我们在Spider-DEV数据集中实现了84.9％的执行精度，而蜘蛛测试数据集的执行精度为82.5％。这种出色的性能将X-SQL建立为基于开源模型的领先文本到SQL框架。

Title: Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching

Authors: Feng Wang, Zihao Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.05952
Pdf URL: https://arxiv.org/pdf/2509.05952
Copy Paste: [[2509.05952]] Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching(https://arxiv.org/abs/2509.05952)
Keywords: generation
Abstract: Reinforcement Learning (RL) has recently emerged as a powerful technique for improving image and video generation in Diffusion and Flow Matching models, specifically for enhancing output quality and alignment with prompts. A critical step for applying online RL methods on Flow Matching is the introduction of stochasticity into the deterministic framework, commonly realized by Stochastic Differential Equation (SDE). Our investigation reveals a significant drawback to this approach: SDE-based sampling introduces pronounced noise artifacts in the generated images, which we found to be detrimental to the reward learning process. A rigorous theoretical analysis traces the origin of this noise to an excess of stochasticity injected during inference. To address this, we draw inspiration from Denoising Diffusion Implicit Models (DDIM) to reformulate the sampling process. Our proposed method, Coefficients-Preserving Sampling (CPS), eliminates these noise artifacts. This leads to more accurate reward modeling, ultimately enabling faster and more stable convergence for reinforcement learning-based optimizers like Flow-GRPO and Dance-GRPO. Code will be released at this https URL
摘要：强化学习（RL）最近成为一种强大的技术，用于改善扩散和流匹配模型中的图像和视频生成，专门用于通过提示提高输出质量和对齐方式。应用在线RL方法对流匹配的关键步骤是将随机性引入确定性框架，通常通过随机微分方程（SDE）实现。我们的研究揭示了这种方法的重要缺点：基于SDE的抽样在生成的图像中引入了明显的噪声伪像，我们发现这对奖励学习过程有害。严格的理论分析将这种噪声的起源追踪到推理期间注入的过量随机性。为了解决这个问题，我们汲取灵感来自DeNoising扩散隐式模型（DDIM）以重新重新制定抽样过程。我们提出的方法，即具有系数的采样（CPS），消除了这些噪声伪像。这会导致更准确的奖励建模，最终使基于增强学习的优化器（如Flow-Grpo和Dance-Grpo）更快，更稳定。代码将在此HTTPS URL上发布

Title: OmniStyle2: Scalable and High Quality Artistic Style Transfer Data Generation via Destylization

Authors: Ye Wang, Zili Yi, Yibo Zhang, Peng Zheng, Xuping Xie, Jiang Lin, Yilin Wang, Rui Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.05970
Pdf URL: https://arxiv.org/pdf/2509.05970
Copy Paste: [[2509.05970]] OmniStyle2: Scalable and High Quality Artistic Style Transfer Data Generation via Destylization(https://arxiv.org/abs/2509.05970)
Keywords: generation
Abstract: OmniStyle2 introduces a novel approach to artistic style transfer by reframing it as a data problem. Our key insight is destylization, reversing style transfer by removing stylistic elements from artworks to recover natural, style-free counterparts. This yields DST-100K, a large-scale dataset that provides authentic supervision signals by aligning real artistic styles with their underlying content. To build DST-100K, we develop (1) DST, a text-guided destylization model that reconstructs stylefree content, and (2) DST-Filter, a multi-stage evaluation model that employs Chain-of-Thought reasoning to automatically discard low-quality pairs while ensuring content fidelity and style accuracy. Leveraging DST-100K, we train OmniStyle2, a simple feed-forward model based on FLUX.1-dev. Despite its simplicity, OmniStyle2 consistently surpasses state-of-the-art methods across both qualitative and quantitative benchmarks. Our results demonstrate that scalable data generation via destylization provides a reliable supervision paradigm, overcoming the fundamental challenge posed by the lack of ground-truth data in artistic style transfer.
摘要：Omnistyle2通过将其重新构架作为数据问题引入了一种新颖的艺术风格转移方法。我们的关键见解是命运，通过从艺术品中删除风格元素来恢复自然，无风格的对应物，从而逆转样式转移。这产生了DST-100K，这是一个大规模数据集，它通过将真实的艺术风格与其基本内容保持一致，从而提供真实的监督信号。为了构建DST-100K，我们开发了（1）DST，这是一种文本引导的命运模型，重建了无样式的内容，并且（2）DST-Filter是一种多阶段评估模型，该模型采用了经过思考的推理来自动丢弃低质量的配对，同时确保了富于内容的富裕性和样式的准确性。利用DST-100K，我们训练Omnistyle2，这是一种基于Flux.1-DEV的简单馈送模型。尽管它很简单，但Omnistyle2仍始终超过定性和定量基准的最新方法。我们的结果表明，可扩展的数据生成可提供可靠的监督范式，克服了缺乏艺术风格转移中缺乏地面真相所带来的基本挑战。

Title: Multi-Strategy Guided Diffusion via Sparse Masking Temporal Reweighting Distribution Correction

Authors: Zekun Zhou, Yanru Gong, Liu Shi, Qiegen Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.05992
Pdf URL: https://arxiv.org/pdf/2509.05992
Copy Paste: [[2509.05992]] Multi-Strategy Guided Diffusion via Sparse Masking Temporal Reweighting Distribution Correction(https://arxiv.org/abs/2509.05992)
Keywords: restoration, generative
Abstract: Diffusion models have demonstrated remarkable generative capabilities in image processing tasks. We propose a Sparse condition Temporal Rewighted Integrated Distribution Estimation guided diffusion model (STRIDE) for sparse-view CT reconstruction. Specifically, we design a joint training mechanism guided by sparse conditional probabilities to facilitate the model effective learning of missing projection view completion and global information modeling. Based on systematic theoretical analysis, we propose a temporally varying sparse condition reweighting guidance strategy to dynamically adjusts weights during the progressive denoising process from pure noise to the real image, enabling the model to progressively perceive sparse-view information. The linear regression is employed to correct distributional shifts between known and generated data, mitigating inconsistencies arising during the guidance process. Furthermore, we construct a dual-network parallel architecture to perform global correction and optimization across multiple sub-frequency components, thereby effectively improving the model capability in both detail restoration and structural preservation, ultimately achieving high-quality image reconstruction. Experimental results on both public and real datasets demonstrate that the proposed method achieves the best improvement of 2.58 dB in PSNR, increase of 2.37\% in SSIM, and reduction of 0.236 in MSE compared to the best-performing baseline methods. The reconstructed images exhibit excellent generalization and robustness in terms of structural consistency, detail restoration, and artifact suppression.
摘要：扩散模型在图像处理任务中表现出显着的生成能力。我们提出了稀疏的时间重新战胜分布估计的引导扩散模型（步幅），以进行稀疏视图CT重建。具体而言，我们设计了一种以稀疏条件概率为指导的联合培训机制，以促进模型有效学习缺失投影视图完成和全球信息建模。基于系统的理论分析，我们提出了一个暂时变化的稀疏条件重新加权指导策略，以在渐进的DeNoising过程中动态调整权重，从纯噪声到真实图像，从而使模型能够逐步感知稀疏视图信息。线性回归用于纠正已知数据和生成数据之间的分布变化，从而减轻在指导过程中引起的不一致之处。此外，我们构建了一个双网络平行体系结构，以在多个子频率组件之间进行全局校正和优化，从而有效地提高了细节恢复和结构保存的模型能力，最终实现了高质量的图像重建。公共和实际数据集的实验结果表明，与表现最好的基线方法相比，SSIM中提出的方法在PSNR中实现了2.58 dB的最佳改善，SSIM中的2.37 \％增加，MSE的降低为0.236。重建的图像在结构一致性，细节恢复和伪影抑制方面表现出极好的概括和鲁棒性。

Title: BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models

Authors: Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, Shanghang Zhang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.06040
Pdf URL: https://arxiv.org/pdf/2509.06040
Copy Paste: [[2509.06040]] BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models(https://arxiv.org/abs/2509.06040)
Keywords: generative
Abstract: Recent advancements in aligning image and video generative models via GRPO have achieved remarkable gains in enhancing human preference alignment. However, these methods still face high computational costs from on-policy rollouts and excessive SDE sampling steps, as well as training instability due to sparse rewards. In this paper, we propose BranchGRPO, a novel method that introduces a branch sampling policy updating the SDE sampling process. By sharing computation across common prefixes and pruning low-reward paths and redundant depths, BranchGRPO substantially lowers the per-update compute cost while maintaining or improving exploration diversity. This work makes three main contributions: (1) a branch sampling scheme that reduces rollout and training cost; (2) a tree-based advantage estimator incorporating dense process-level rewards; and (3) pruning strategies exploiting path and depth redundancy to accelerate convergence and boost performance. Experiments on image and video preference alignment show that BranchGRPO improves alignment scores by 16% over strong baselines, while cutting training time by 50%.
摘要：通过GRPO对齐图像和视频生成模型的最新进步在增强人类偏好对齐方面取得了显着的增长。但是，这些方法仍然面临着从支票上的推出和过度的SDE采样步骤以及由于稀疏奖励而导致的训练不稳定的高计算成本。在本文中，我们提出了BranchGrpo，这是一种新颖的方法，它引入了分支采样策略，以更新SDE采样过程。通过在常见前缀共享计算并修剪低回报路径和冗余深度，BranchGrpo在维持或改善勘探多样性的同时，大大降低了每上升的计算成本。这项工作做出了三个主要贡献：（1）降低推出和培训成本的分支抽样方案；（2）基于树的优势估计器，结合了密集的过程级别的奖励；（3）修剪策略利用路径和深度冗余，以加速收敛和提高性能。图像和视频偏好对齐的实验表明，BranchGrpo在强基础上提高了16％的比对分数，同时将训练时间缩短了50％。

Title: PolicyEvolve: Evolving Programmatic Policies by LLMs for multi-player games via Population-Based Training

Authors: Mingrui Lv, Hangzhi Liu, Zhi Luo, Hongjie Zhang, Jie Ou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.06053
Pdf URL: https://arxiv.org/pdf/2509.06053
Copy Paste: [[2509.06053]] PolicyEvolve: Evolving Programmatic Policies by LLMs for multi-player games via Population-Based Training(https://arxiv.org/abs/2509.06053)
Keywords: generation
Abstract: Multi-agent reinforcement learning (MARL) has achieved significant progress in solving complex multi-player games through self-play. However, training effective adversarial policies requires millions of experience samples and substantial computational resources. Moreover, these policies lack interpretability, hindering their practical deployment. Recently, researchers have successfully leveraged Large Language Models (LLMs) to generate programmatic policies for single-agent tasks, transforming neural network-based policies into interpretable rule-based code with high execution efficiency. Inspired by this, we propose PolicyEvolve, a general framework for generating programmatic policies in multi-player games. PolicyEvolve significantly reduces reliance on manually crafted policy code, achieving high-performance policies with minimal environmental interactions. The framework comprises four modules: Global Pool, Local Pool, Policy Planner, and Trajectory Critic. The Global Pool preserves elite policies accumulated during iterative training. The Local Pool stores temporary policies for the current iteration; only sufficiently high-performing policies from this pool are promoted to the Global Pool. The Policy Planner serves as the core policy generation module. It samples the top three policies from the Global Pool, generates an initial policy for the current iteration based on environmental information, and refines this policy using feedback from the Trajectory Critic. Refined policies are then deposited into the Local Pool. This iterative process continues until the policy achieves a sufficiently high average win rate against the Global Pool, at which point it is integrated into the Global Pool. The Trajectory Critic analyzes interaction data from the current policy, identifies vulnerabilities, and proposes directional improvements to guide the Policy Planner
摘要：多代理强化学习（MARL）在通过自我游戏解决复杂的多玩家游戏方面取得了重大进展。但是，培训有效的对抗性政策需要数百万经验的样本和大量的计算资源。此外，这些政策缺乏解释性，阻碍了他们的实际部署。最近，研究人员已成功利用大型语言模型（LLM）来生成单一任务的程序化策略，将基于神经网络的策略转换为具有高执行效率的可解释的基于规则的代码。在此启发下，我们提出了PolicyEvolve，这是在多玩家游戏中生成程序化策略的一般框架。 PolicyEvolve大大减少了对手动制作的政策代码的依赖，从而实现了高性能的政策，而环境相互作用最少。该框架包括四个模块：全球池，本地池，政策计划者和轨迹评论家。全球池保存在迭代培训期间积累的精英政策。当地池存储当前迭代的临时政策；只有该池的足够高性能的政策才会晋升为全球池。政策计划者是核心政策生成模块。它根据环境信息为当前迭代生成了当前迭代的初始策略，并使用轨迹评论家的反馈来完善此策略，从而为当前迭代制定了初始策略。然后将精制的政策沉积到当地池中。这个迭代过程一直持续到该政策对全球池的平均得分率足够高，这时它将整合到全球池中。轨迹评论家分析当前政策中的交互数据，确定漏洞，并提出方向改进以指导政策计划者

Title: Home-made Diffusion Model from Scratch to Hatch

Authors: Shih-Ying Yeh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.06068
Pdf URL: https://arxiv.org/pdf/2509.06068
Copy Paste: [[2509.06068]] Home-made Diffusion Model from Scratch to Hatch(https://arxiv.org/abs/2509.06068)
Keywords: generation
Abstract: We introduce Home-made Diffusion Model (HDM), an efficient yet powerful text-to-image diffusion model optimized for training (and inferring) on consumer-grade hardware. HDM achieves competitive 1024x1024 generation quality while maintaining a remarkably low training cost of $535-620 using four RTX5090 GPUs, representing a significant reduction in computational requirements compared to traditional approaches. Our key contributions include: (1) Cross-U-Transformer (XUT), a novel U-shape transformer, Cross-U-Transformer (XUT), that employs cross-attention for skip connections, providing superior feature integration that leads to remarkable compositional consistency; (2) a comprehensive training recipe that incorporates TREAD acceleration, a novel shifted square crop strategy for efficient arbitrary aspect-ratio training, and progressive resolution scaling; and (3) an empirical demonstration that smaller models (343M parameters) with carefully crafted architectures can achieve high-quality results and emergent capabilities, such as intuitive camera control. Our work provides an alternative paradigm of scaling, demonstrating a viable path toward democratizing high-quality text-to-image generation for individual researchers and smaller organizations with limited computational resources.
摘要：我们介绍了自制的扩散模型（HDM），这是一种高效但功能强大的文本对图像扩散模型，可针对消费级硬件进行训练（和推断）。 HDM可实现竞争激烈的1024x1024发电质量，同时使用四种RTX5090 GPU保持较低的培训成本为535-620美元，与传统方法相比，计算需求的显着降低。我们的主要贡献包括：（1）新型的U形变压器，跨u-transformer（XUT），使用跨注意来进行跳过连接，提供出色的特征集成，从而提供出色的组成一致性；（2）综合胎面加速度的全面培训配方，这是一种新型的正方形作物策略，用于有效的任意方面比例训练以及进行性分辨率缩放；（3）经验证明，具有精心制作的体系结构的较小模型（343m参数）可以实现高质量的结果和紧急功能，例如直观的摄像头控制。我们的工作提供了缩放范围的替代范式，展示了一个可行的途径，可以为具有有限计算资源的个别研究人员和较小的组织民主化高质量的文本到图像生成。

Title: If generative AI is the answer, what is the question?

Authors: Ambuj Tewari
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2509.06120
Pdf URL: https://arxiv.org/pdf/2509.06120
Copy Paste: [[2509.06120]] If generative AI is the answer, what is the question?(https://arxiv.org/abs/2509.06120)
Keywords: generation, generative
Abstract: Beginning with text and images, generative AI has expanded to audio, video, computer code, and molecules. Yet, if generative AI is the answer, what is the question? We explore the foundations of generation as a distinct machine learning task with connections to prediction, compression, and decision-making. We survey five major generative model families: autoregressive models, variational autoencoders, normalizing flows, generative adversarial networks, and diffusion models. We then introduce a probabilistic framework that emphasizes the distinction between density estimation and generation. We review a game-theoretic framework with a two-player adversary-learner setup to study generation. We discuss post-training modifications that prepare generative models for deployment. We end by highlighting some important topics in socially responsible generation such as privacy, detection of AI-generated content, and copyright and IP. We adopt a task-first framing of generation, focusing on what generation is as a machine learning problem, rather than only on how models implement it.
摘要：从文本和图像开始，生成AI已扩展到音频，视频，计算机代码和分子。但是，如果生成AI是答案，那是什么问题？我们探索一项与预测，压缩和决策的连接的独特机器学习任务的基础。我们调查了五个主要的生成模型家族：自回归模型，变异自动编码器，标准化流量，生成对抗网络和扩散模型。然后，我们引入了一个概率框架，该框架强调了密度估计与产生之间的区别。我们通过两人对手学习者设置来审查游戏理论框架，以学习生成。我们讨论训练后修改，以准备生成模型进行部署。最后，我们强调了社会负责任的一些重要主题，例如隐私，AI生成的内容以及版权和IP。我们采用了一代人的任务领域，重点是作为机器学习问题，而不仅仅是模型如何实施它。

Title: SpecSwin3D: Generating Hyperspectral Imagery from Multispectral Data via Transformer Networks

Authors: Tang Sui, Songxi Yang, Qunying Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.06122
Pdf URL: https://arxiv.org/pdf/2509.06122
Copy Paste: [[2509.06122]] SpecSwin3D: Generating Hyperspectral Imagery from Multispectral Data via Transformer Networks(https://arxiv.org/abs/2509.06122)
Keywords: generation
Abstract: Multispectral and hyperspectral imagery are widely used in agriculture, environmental monitoring, and urban planning due to their complementary spatial and spectral characteristics. A fundamental trade-off persists: multispectral imagery offers high spatial but limited spectral resolution, while hyperspectral imagery provides rich spectra at lower spatial resolution. Prior hyperspectral generation approaches (e.g., pan-sharpening variants, matrix factorization, CNNs) often struggle to jointly preserve spatial detail and spectral fidelity. In response, we propose SpecSwin3D, a transformer-based model that generates hyperspectral imagery from multispectral inputs while preserving both spatial and spectral quality. Specifically, SpecSwin3D takes five multispectral bands as input and reconstructs 224 hyperspectral bands at the same spatial resolution. In addition, we observe that reconstruction errors grow for hyperspectral bands spectrally distant from the input bands. To address this, we introduce a cascade training strategy that progressively expands the spectral range to stabilize learning and improve fidelity. Moreover, we design an optimized band sequence that strategically repeats and orders the five selected multispectral bands to better capture pairwise relations within a 3D shifted-window transformer framework. Quantitatively, our model achieves a PSNR of 35.82 dB, SAM of 2.40°, and SSIM of 0.96, outperforming the baseline MHF-Net by +5.6 dB in PSNR and reducing ERGAS by more than half. Beyond reconstruction, we further demonstrate the practical value of SpecSwin3D on two downstream tasks, including land use classification and burnt area segmentation.
摘要：由于其互补的空间和光谱特征，多光谱和高光谱图像被广泛用于农业，环境监测和城市规划。基本的权衡持续存在：多光谱图像提供了很高的空间但有限的光谱分辨率，而高光谱图像则在较低的空间分辨率下提供了丰富的光谱。先前的高光谱生成方法（例如，泛伴变体，基质分解，CNN）通常很难共同保留空间细节和频谱保真度。作为响应，我们提出了Specswin3d，这是一个基于变压器的模型，该模型从多光谱输入中产生高光谱图像，同时保留空间和光谱质量。具体而言，Specswin3d采用五个多光谱带作为输入，并以相同的空间分辨率重建224个高光谱。此外，我们观察到，与输入条带相距较远的高光谱带的重建误差会增加。为了解决这个问题，我们引入了级联培训策略，该策略逐渐扩大光谱范围，以稳定学习并提高忠诚度。此外，我们设计了一个优化的频段序列，该序列从战略上重复并命令五个选定的多光谱频段，以更好地捕获3D移动窗口变压器框架中的成对关系。定量地，我们的模型达到的PSNR为35.82 dB，SAM为2.40°，SSIM为0.96，在PSNR中的基线MHF-NET的表现优于+5.6 dB，并将ERGA降低了一半以上。除了重建外，我们还进一步证明了Specswin3d在两个下游任务上的实际价值，包括土地使用分类和烧毁区域细分。

Title: RetinaGuard: Obfuscating Retinal Age in Fundus Images for Biometric Privacy Preserving

Authors: Zhengquan Luo (1), Chi Liu (1), Dongfu Xiao (1), Zhen Yu (2), Yueye Wang (3), Tianqing Zhu (1) ((1) City University of Macau, (2) Monash University, (3) Hong Kong Polytechnic University)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.06142
Pdf URL: https://arxiv.org/pdf/2509.06142
Copy Paste: [[2509.06142]] RetinaGuard: Obfuscating Retinal Age in Fundus Images for Biometric Privacy Preserving(https://arxiv.org/abs/2509.06142)
Keywords: generative
Abstract: The integration of AI with medical images enables the extraction of implicit image-derived biomarkers for a precise health assessment. Recently, retinal age, a biomarker predicted from fundus images, is a proven predictor of systemic disease risks, behavioral patterns, aging trajectory and even mortality. However, the capability to infer such sensitive biometric data raises significant privacy risks, where unauthorized use of fundus images could lead to bioinformation leakage, breaching individual privacy. In response, we formulate a new research problem of biometric privacy associated with medical images and propose RetinaGuard, a novel privacy-enhancing framework that employs a feature-level generative adversarial masking mechanism to obscure retinal age while preserving image visual quality and disease diagnostic utility. The framework further utilizes a novel multiple-to-one knowledge distillation strategy incorporating a retinal foundation model and diverse surrogate age encoders to enable a universal defense against black-box age prediction models. Comprehensive evaluations confirm that RetinaGuard successfully obfuscates retinal age prediction with minimal impact on image quality and pathological feature representation. RetinaGuard is also flexible for extension to other medical image derived biomarkers. RetinaGuard is also flexible for extension to other medical image biomarkers.
摘要：AI与医学图像的集成使人可以提取隐式图像衍生的生物标志物进行精确的健康评估。最近，视网膜年龄是一种从眼底图像预测的生物标志物，是全身性疾病风险，行为模式，衰老轨迹甚至死亡率的可靠预测指标。但是，推断出这种敏感生物特征数据的能力会引起明显的隐私风险，未经授权使用眼睛图像可能会导致生物信息泄漏，从而违反个人隐私。作为回应，我们制定了与医学图像相关的生物特征隐私的新研究问题，并提出了Retinaguard，这是一种新型的隐私增强框架，该框架采用了特征级别的生成对抗性掩盖机制来模糊视网膜时代，同时保留了图像视觉质量和疾病诊断。该框架进一步利用了一种新颖的多一对一的知识蒸馏策略，该策略结合了视网膜基础模型和多样化的替代年龄编码，以实现对黑盒年龄预测模型的普遍防御。全面的评估证实，视网膜成功地掩盖了视网膜年龄预测，对图像质量和病理特征表示的影响最小。视网膜也可以灵活地扩展到其他医学图像衍生的生物标志物。视网膜也可以灵活地扩展到其他医学图像生物标志物。

Title: UniVerse-1: Unified Audio-Video Generation via Stitching of Experts

Authors: Duomin Wang, Wei Zuo, Aojie Li, Ling-Hao Chen, Xinyao Liao, Deyu Zhou, Zixin Yin, Xili Dai, Daxin Jiang, Gang Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.06155
Pdf URL: https://arxiv.org/pdf/2509.06155
Copy Paste: [[2509.06155]] UniVerse-1: Unified Audio-Video Generation via Stitching of Experts(https://arxiv.org/abs/2509.06155)
Keywords: generation
Abstract: We introduce UniVerse-1, a unified, Veo-3-like model capable of simultaneously generating coordinated audio and video. To enhance training efficiency, we bypass training from scratch and instead employ a stitching of experts (SoE) technique. This approach deeply fuses the corresponding blocks of pre-trained video and music generation experts models, thereby fully leveraging their foundational capabilities. To ensure accurate annotations and temporal alignment for both ambient sounds and speech with video content, we developed an online annotation pipeline that processes the required training data and generates labels during training process. This strategy circumvents the performance degradation often caused by misalignment text-based annotations. Through the synergy of these techniques, our model, after being finetuned on approximately 7,600 hours of audio-video data, produces results with well-coordinated audio-visuals for ambient sounds generation and strong alignment for speech generation. To systematically evaluate our proposed method, we introduce Verse-Bench, a new benchmark dataset. In an effort to advance research in audio-video generation and to close the performance gap with state-of-the-art models such as Veo3, we make our model and code publicly available. We hope this contribution will benefit the broader research community. Project page: this https URL.
摘要：我们介绍了Universe-1，这是一种统一的类似于VEO-3的模型，能够同时生成协调的音频和视频。为了提高培训效率，我们绕过从头开始训练，而是采用专家（SOE）技术的缝制。这种方法深层融合了相应的预训练视频和音乐发电专家模型，从而充分利用了它们的基础能力。为了确保与视频内容的环境声音和语音的准确注释和时间对齐，我们开发了一个在线注释管道，该管道可以处理所需的培训数据并在培训过程中生成标签。该策略规避了经常由基于文本的注释造成的绩效降低通常引起的。通过这些技术的协同作用，我们的模型在大约7,600小时的音频视频数据中进行了填充之后，产生的结果与协调的音频可见符，以进行环境声音的产生和强烈的语音一致性。为了系统地评估我们提出的方法，我们介绍了新的基准数据集Verse Bench。为了推动对音频视频生成的研究，并通过诸如VEO3之类的最先进模型来缩小性能差距，我们使我们的模型和代码公开可用。我们希望这一贡献将使更广泛的研究社区受益。项目页面：此HTTPS URL。

Title: UNO: Unifying One-stage Video Scene Graph Generation via Object-Centric Visual Representation Learning

Authors: Huy Le, Nhat Chung, Tung Kieu, Jingkang Yang, Ngan Le
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.06165
Pdf URL: https://arxiv.org/pdf/2509.06165
Copy Paste: [[2509.06165]] UNO: Unifying One-stage Video Scene Graph Generation via Object-Centric Visual Representation Learning(https://arxiv.org/abs/2509.06165)
Keywords: generation
Abstract: Video Scene Graph Generation (VidSGG) aims to represent dynamic visual content by detecting objects and modeling their temporal interactions as structured graphs. Prior studies typically target either coarse-grained box-level or fine-grained panoptic pixel-level VidSGG, often requiring task-specific architectures and multi-stage training pipelines. In this paper, we present UNO (UNified Object-centric VidSGG), a single-stage, unified framework that jointly addresses both tasks within an end-to-end architecture. UNO is designed to minimize task-specific modifications and maximize parameter sharing, enabling generalization across different levels of visual granularity. The core of UNO is an extended slot attention mechanism that decomposes visual features into object and relation slots. To ensure robust temporal modeling, we introduce object temporal consistency learning, which enforces consistent object representations across frames without relying on explicit tracking modules. Additionally, a dynamic triplet prediction module links relation slots to corresponding object pairs, capturing evolving interactions over time. We evaluate UNO on standard box-level and pixel-level VidSGG benchmarks. Results demonstrate that UNO not only achieves competitive performance across both tasks but also offers improved efficiency through a unified, object-centric design.
摘要：视频场景图（Vidsgg）旨在通过检测对象并将其时间相互作用作为结构图来表示动态视觉内容。先前的研究通常靶向粗粒盒级或细粒的全泛像素级vidsgg，通常需要特定于任务的架构和多阶段训练管道。在本文中，我们提出了一个单阶段的统一框架（统一以对象为中心的Vidsgg），该框架共同解决了端到端体系结构中的两个任务。 UNO旨在最大程度地减少特定于任务的修改并最大化参数共享，从而在不同级别的视觉粒度上概括。 UNO的核心是一种扩展的插槽注意机制，可将视觉特征分解为对象和关系插槽。为了确保鲁棒的时间建模，我们介绍对象时间一致性学习，该学习可以在不依赖明确跟踪模块的情况下强制跨帧的对象表示。此外，动态三重态预测模块将关系插槽链接到相应的对象对，从而捕获了随着时间的流逝而不断发展的相互作用。我们在标准的盒子级和像素级Vidsgg基准上评估UNO。结果表明，UNO不仅在这两个任务中都实现了竞争性能，而且还通过统一的，以对象为中心的设计提高了效率。

Title: UrbanMIMOMap: A Ray-Traced MIMO CSI Dataset with Precoding-Aware Maps and Benchmarks

Authors: Honggang Jia, Xiucheng Wang, Nan Cheng, Ruijin Sun, Changle Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.06270
Pdf URL: https://arxiv.org/pdf/2509.06270
Copy Paste: [[2509.06270]] UrbanMIMOMap: A Ray-Traced MIMO CSI Dataset with Precoding-Aware Maps and Benchmarks(https://arxiv.org/abs/2509.06270)
Keywords: generation
Abstract: Sixth generation (6G) systems require environment-aware communication, driven by native artificial intelligence (AI) and integrated sensing and communication (ISAC). Radio maps (RMs), providing spatially continuous channel information, are key enablers. However, generating high-fidelity RM ground truth via electromagnetic (EM) simulations is computationally intensive, motivating machine learning (ML)-based RM construction. The effectiveness of these data-driven methods depends on large-scale, high-quality training data. Current public datasets often focus on single-input single-output (SISO) and limited information, such as path loss, which is insufficient for advanced multi-input multi-output (MIMO) systems requiring detailed channel state information (CSI). To address this gap, this paper presents UrbanMIMOMap, a novel large-scale urban MIMO CSI dataset generated using high-precision ray tracing. UrbanMIMOMap offers comprehensive complex CSI matrices across a dense spatial grid, going beyond traditional path loss data. This rich CSI is vital for constructing high-fidelity RMs and serves as a fundamental resource for data-driven RM generation, including deep learning. We demonstrate the dataset's utility through baseline performance evaluations of representative ML methods for RM construction. This work provides a crucial dataset and reference for research in high-precision RM generation, MIMO spatial performance, and ML for 6G environment awareness. The code and data for this work are available at: this https URL.
摘要：第六代（6G）系统需要由本地人工智能（AI）和集成的感应和通信（ISAC）驱动的环境感知的通信。无线电图（RMS）提供空间连续的通道信息，是关键推动器。但是，通过电磁（EM）模拟生成高保真RM地面真相是计算密集型，激励机器学习（ML）基于RM的构造。这些数据驱动方法的有效性取决于大规模的高质量培训数据。当前的公共数据集通常专注于单输入单输出（SISO）和有限的信息，例如路径丢失，这不足以用于需要详细的通道状态信息（CSI）的高级多输入多输出（MIMO）系统。为了解决这一差距，本文介绍了UrbanMimomomap，这是一种新型的大型Urban Mimo CSI数据集，该数据集使用高精度射线追踪生成。 UrbanMimomap在密集的空间网格中提供了全面的复杂CSI矩阵，超越了传统的路径损失数据。这种丰富的CSI对于构建高保真性RMS至关重要，并且是数据驱动的RM生成（包括深度学习）的基本资源。我们通过对RM构建的代表性ML方法的基线绩效评估来证明数据集的实用性。这项工作为高精度RM生成，MIMO空间性能和ML的研究提供了关键的数据集和参考，以实现6G环境意识。此工作的代码和数据可在以下网址提供：此HTTPS URL。

Title: WindFM: An Open-Source Foundation Model for Zero-Shot Wind Power Forecasting

Authors: Hang Fan, Yu Shi, Zongliang Fu, Shuo Chen, Wei Wei, Wei Xu, Jian Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.06311
Pdf URL: https://arxiv.org/pdf/2509.06311
Copy Paste: [[2509.06311]] WindFM: An Open-Source Foundation Model for Zero-Shot Wind Power Forecasting(https://arxiv.org/abs/2509.06311)
Keywords: generation, generative
Abstract: High-quality wind power forecasting is crucial for the operation of modern power grids. However, prevailing data-driven paradigms either train a site-specific model which cannot generalize to other locations or rely on fine-tuning of general-purpose time series foundation models which are difficult to incorporate domain-specific data in the energy sector. This paper introduces WindFM, a lightweight and generative Foundation Model designed specifically for probabilistic wind power forecasting. WindFM employs a discretize-and-generate framework. A specialized time-series tokenizer first converts continuous multivariate observations into discrete, hierarchical tokens. Subsequently, a decoder-only Transformer learns a universal representation of wind generation dynamics by autoregressively pre-training on these token sequences. Using the comprehensive WIND Toolkit dataset comprising approximately 150 billion time steps from more than 126,000 sites, WindFM develops a foundational understanding of the complex interplay between atmospheric conditions and power output. Extensive experiments demonstrate that our compact 8.1M parameter model achieves state-of-the-art zero-shot performance on both deterministic and probabilistic tasks, outperforming specialized models and larger foundation models without any fine-tuning. In particular, WindFM exhibits strong adaptiveness under out-of-distribution data from a different continent, demonstrating the robustness and transferability of its learned representations. Our pre-trained model is publicly available at this https URL.
摘要：高质量的风能预测对于现代电网的运行至关重要。但是，盛行的数据驱动范例要么训练无法推广到其他位置的站点特异性模型，要么依靠通用时间序列序列模型的微调，该模型很难在能源部门中纳入特定于域的数据。本文介绍了WINDFM，这是一种专门针对概率风力预测的轻质和生成基础模型。 WINDFM采用离散和生成框架。专门的时间序列令牌首先将连续的多元观测值转换为离散的层次令牌。随后，仅解码器的变压器通过对这些令牌序列进行自动训练的预训练来了解风产生动力学的通用表示。 WINDFM使用来自126,000多个站点的综合风工具包数据集，其中包括126,000多个站点的时间步长，对大气条件和功率输出之间的复杂相互作用有了基本的了解。广泛的实验表明，我们紧凑的810万参数模型在确定性和概率任务上都能达到最新的零射击性能，优于专业模型和更大的基础模型，而没有任何微调。特别是，WINDFM在来自不同大陆的分布数据下表现出很强的适应性，这表明其学到的表示形式的稳健性和可传递性。我们的预培训模型可在此HTTPS URL上公开使用。

Title: Evaluating the Efficiency of Latent Spaces via the Coupling-Matrix

Authors: Mehmet Can Yavuz, Berrin Yanikoglu
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2509.06314
Pdf URL: https://arxiv.org/pdf/2509.06314
Copy Paste: [[2509.06314]] Evaluating the Efficiency of Latent Spaces via the Coupling-Matrix(https://arxiv.org/abs/2509.06314)
Keywords: generative
Abstract: A central challenge in representation learning is constructing latent embeddings that are both expressive and efficient. In practice, deep networks often produce redundant latent spaces where multiple coordinates encode overlapping information, reducing effective capacity and hindering generalization. Standard metrics such as accuracy or reconstruction loss provide only indirect evidence of such redundancy and cannot isolate it as a failure mode. We introduce a redundancy index, denoted rho(C), that directly quantifies inter-dimensional dependencies by analyzing coupling matrices derived from latent representations and comparing their off-diagonal statistics against a normal distribution via energy distance. The result is a compact, interpretable, and statistically grounded measure of representational quality. We validate rho(C) across discriminative and generative settings on MNIST variants, Fashion-MNIST, CIFAR-10, and CIFAR-100, spanning multiple architectures and hyperparameter optimization strategies. Empirically, low rho(C) reliably predicts high classification accuracy or low reconstruction error, while elevated redundancy is associated with performance collapse. Estimator reliability grows with latent dimension, yielding natural lower bounds for reliable analysis. We further show that Tree-structured Parzen Estimators (TPE) preferentially explore low-rho regions, suggesting that rho(C) can guide neural architecture search and serve as a redundancy-aware regularization target. By exposing redundancy as a universal bottleneck across models and tasks, rho(C) offers both a theoretical lens and a practical tool for evaluating and improving the efficiency of learned representations.
摘要：表示学习的核心挑战是构建既表现力又有效的潜在嵌入。在实践中，深网通常会产生冗余的潜在空间，其中多个坐标编码重叠信息，从而降低了有效的容量和阻碍概括。准确性或重建损失等标准指标仅提供这种冗余性的间接证据，并且不能将其隔离为故障模式。我们引入了一个冗余指数，表示Rho（C），该指数直接通过分析源自潜在表示的耦合矩阵并比较其非对抗的统计量与通过能量距离的正态分布进行比较。结果是一种紧凑，可解释的和统计上的代表性质量衡量标准。我们跨越了MNIST变体，Fashion-Mnist，CIFAR-10和CIFAR-100的Rho（C），跨越了多个体系结构和超参数优化策略。从经验上讲，低RHO（C）可靠地预测高分类精度或低重建误差，而冗余升高与性能崩溃有关。估计量的可靠性随潜在的维度增长，从而产生自然的下限以可靠分析。我们进一步表明，树结构化的parzen估计量（TPE）优先探索低RHO区域，这表明Rho（C）可以指导神经体系结构搜索并充当冗余 - 感知的正则化目标。通过将冗余作为跨模型和任务的通用瓶颈，Rho（c）提供了一种理论镜头和一种实用工具，用于评估和提高学会表示的效率。

Title: Text4Seg++: Advancing Image Segmentation via Generative Language Modeling

Authors: Mengcheng Lan, Chaofeng Chen, Jiaxing Xu, Zongrui Li, Yiping Ke, Xudong Jiang, Yingchen Yu, Yunqing Zhao, Song Bai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.06321
Pdf URL: https://arxiv.org/pdf/2509.06321
Copy Paste: [[2509.06321]] Text4Seg++: Advancing Image Segmentation via Generative Language Modeling(https://arxiv.org/abs/2509.06321)
Keywords: generation, generative
Abstract: Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks. However, effectively integrating image segmentation into these models remains a significant challenge. In this work, we propose a novel text-as-mask paradigm that casts image segmentation as a text generation problem, eliminating the need for additional decoders and significantly simplifying the segmentation process. Our key innovation is semantic descriptors, a new textual representation of segmentation masks where each image patch is mapped to its corresponding text label. We first introduce image-wise semantic descriptors, a patch-aligned textual representation of segmentation masks that integrates naturally into the language modeling pipeline. To enhance efficiency, we introduce the Row-wise Run-Length Encoding (R-RLE), which compresses redundant text sequences, reducing the length of semantic descriptors by 74% and accelerating inference by $3\times$, without compromising performance. Building upon this, our initial framework Text4Seg achieves strong segmentation performance across a wide range of vision tasks. To further improve granularity and compactness, we propose box-wise semantic descriptors, which localizes regions of interest using bounding boxes and represents region masks via structured mask tokens called semantic bricks. This leads to our refined model, Text4Seg++, which formulates segmentation as a next-brick prediction task, combining precision, scalability, and generative efficiency. Comprehensive experiments on natural and remote sensing datasets show that Text4Seg++ consistently outperforms state-of-the-art models across diverse benchmarks without any task-specific fine-tuning, while remaining compatible with existing MLLM backbones. Our work highlights the effectiveness, scalability, and generalizability of text-driven image segmentation within the MLLM framework.
摘要：多模式的大型语言模型（MLLM）在视觉任务中表现出了出色的功能。但是，有效地将图像分割整合到这些模型中仍然是一个重大挑战。在这项工作中，我们提出了一种新颖的文本掩码范式，将图像分割作为文本生成问题，消除了对其他解码器的需求，并显着简化了分割过程。我们的关键创新是语义描述符，这是一种分割掩码的新文本表示，每个图像补丁都映射到其相应的文本标签。我们首先介绍图像的语义描述符，这是一种与贴片对准的分割掩模的文本表示，将自然集成到语言建模管道中。为了提高效率，我们介绍了行式运行长度编码（R-RLE），该编码（R-RLE）压缩冗余文本序列，将语义描述符的长度降低了74％，并将推论加速$ 3 \ times $，而不会损害性能。在此基础上，我们的初始框架Text4Seg在广泛的视力任务中实现了强大的细分性能。为了进一步改善粒度和紧凑性，我们提出了框的语义描述符，该描述符，该描述符使用边界框来定位感兴趣的区域，并通过称为语义砖的结构化掩码标记来代表区域蒙版。这导致了我们精致的模型Text4Seg ++，该模型将分割作为下一个砖头预测任务，结合了精度，可扩展性和生成效率。关于自然和遥感数据集的全面实验表明，Text4Seg ++始终超过不同基准测试的最先进模型，而没有任何特定于任务的微调，同时与现有的MLLM骨干保持兼容。我们的工作突出了MLLM框架内文本驱动图像分割的有效性，可扩展性和概括性。

Title: Towards scalable organ level 3D plant segmentation: Bridging the data algorithm computing gap

Authors: Ruiming Du, Guangxun Zhai, Tian Qiu, Yu Jiang
Subjects: cs.CV, q-bio.QM
Abstract URL: https://arxiv.org/abs/2509.06329
Pdf URL: https://arxiv.org/pdf/2509.06329
Copy Paste: [[2509.06329]] Towards scalable organ level 3D plant segmentation: Bridging the data algorithm computing gap(https://arxiv.org/abs/2509.06329)
Keywords: generation
Abstract: The precise characterization of plant morphology provides valuable insights into plant environment interactions and genetic evolution. A key technology for extracting this information is 3D segmentation, which delineates individual plant organs from complex point clouds. Despite significant progress in general 3D computer vision domains, the adoption of 3D segmentation for plant phenotyping remains limited by three major challenges: i) the scarcity of large-scale annotated datasets, ii) technical difficulties in adapting advanced deep neural networks to plant point clouds, and iii) the lack of standardized benchmarks and evaluation protocols tailored to plant science. This review systematically addresses these barriers by: i) providing an overview of existing 3D plant datasets in the context of general 3D segmentation domains, ii) systematically summarizing deep learning-based methods for point cloud semantic and instance segmentation, iii) introducing Plant Segmentation Studio (PSS), an open-source framework for reproducible benchmarking, and iv) conducting extensive quantitative experiments to evaluate representative networks and sim-to-real learning strategies. Our findings highlight the efficacy of sparse convolutional backbones and transformer-based instance segmentation, while also emphasizing the complementary role of modeling-based and augmentation-based synthetic data generation for sim-to-real learning in reducing annotation demands. In general, this study bridges the gap between algorithmic advances and practical deployment, providing immediate tools for researchers and a roadmap for developing data-efficient and generalizable deep learning solutions in 3D plant phenotyping. Data and code are available at this https URL.
摘要：植物形态的精确表征为植物环境相互作用和遗传进化提供了宝贵的见解。提取此信息的关键技术是3D分割，它从复杂点云中描述了单个植物器官。尽管在一般3D计算机视觉域中取得了重大进展，但在三个主要挑战中采用了3D分割来限制植物表型的限制：i）大规模注释的数据集的稀缺性，ii）ii）将先进的深层神经网络适应植物点云的技术困难，以及iii）缺乏标准化的基准标准和评估协议对植物科学的尾声。这篇综述系统地通过以下方式解决了这些障碍：i）在一般3D分段域的背景下提供现有3D工厂数据集的概述，ii）系统地汇总了基于深度学习的方法，用于点云语义和实例分段，iii III），介绍工厂细分录像带（PSS），用于扩展的量子，用于扩展的量子，以评估量子的量子，并进行了定量的量子，并进行了量子的量子，并进行了用于复制的台阶，IV）的台词，IV），IV）和IV）网络和模拟学习策略。我们的发现突出了稀疏卷积骨架和基于变压器的实例分割的功效，同时还强调了基于建模的基于建模和基于增强的合成数据生成在减少注释需求中的SIM到现实学习的互补作用。总的来说，这项研究弥合了算法进步与实际部署之间的差距，为研究人员提供了直接的工具，并为在3D工厂表型中开发数据效率和可推广的深度学习解决方案提供了路线图。数据和代码可在此HTTPS URL上找到。

Title: A Fragile Number Sense: Probing the Elemental Limits of Numerical Reasoning in LLMs

Authors: Roussel Rahman, Aashwin Ananda Mishra
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.06332
Pdf URL: https://arxiv.org/pdf/2509.06332
Copy Paste: [[2509.06332]] A Fragile Number Sense: Probing the Elemental Limits of Numerical Reasoning in LLMs(https://arxiv.org/abs/2509.06332)
Keywords: generative
Abstract: Large Language Models (LLMs) have demonstrated remarkable emergent capabilities, yet the robustness of their numerical reasoning remains an open question. While standard benchmarks evaluate LLM reasoning on complex problem sets using aggregated metrics, they often obscure foundational weaknesses. In this work, we probe LLM mathematical numeracy by evaluating performance on problems of escalating complexity, from constituent operations to combinatorial puzzles. We test several state-of-the-art LLM-based agents on a 100-problem challenge comprising four categories: (1) basic arithmetic, (2) advanced operations, (3) primality checking, and (4) the Game of 24 number puzzle. Our results show that while the agents achieved high accuracy on the first three categories, which require deterministic algorithmic execution, they consistently failed at the number puzzle, underlining its demand for a heuristic search over a large combinatorial space to be a significant bottleneck. These findings reveal that the agents' proficiency is largely confined to recalling and executing known algorithms, rather than performing generative problem-solving. This suggests their apparent numerical reasoning is more akin to sophisticated pattern-matching than flexible, analytical thought, limiting their potential for tasks that require novel or creative numerical insights.
摘要：大型语言模型（LLM）表现出了出色的新兴能力，但其数值推理的鲁棒性仍然是一个悬而未决的问题。尽管标准基准测试使用汇总指标在复杂问题集上评估LLM推理，但它们通常掩盖了基础弱点。在这项工作中，我们通过评估从组成型操作到组合拼图的复杂性问题的性能来探测数学算法。我们在100个问题的挑战中测试了几种基于LLM的最先进的代理，其中包括四个类别：（1）基本算术，（2）高级操作，（3）Primality检查，以及（4）24个数字拼图的游戏。我们的结果表明，尽管代理在前三个类别上实现了高精度，这需要确定性的算法执行，但它们在数字难题中始终失败，强调了其对大型组合空间的启发式搜索的需求，以使其成为重要的瓶颈。这些发现表明，代理商的熟练程度在很大程度上仅限于回忆和执行已知算法，而不是执行生成性问题解决。这表明他们的明显数值推理更类似于复杂的图案匹配，而不是灵活的，分析的思想，限制了它们需要新颖或创造性的数值见解的任务的潜力。

Title: Your Super Resolution Model is not Enough for Tackling Real-World Scenarios

Authors: Dongsik Yoon, Jongeun Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.06387
Pdf URL: https://arxiv.org/pdf/2509.06387
Copy Paste: [[2509.06387]] Your Super Resolution Model is not Enough for Tackling Real-World Scenarios(https://arxiv.org/abs/2509.06387)
Keywords: super-resolution
Abstract: Despite remarkable progress in Single Image Super-Resolution (SISR), traditional models often struggle to generalize across varying scale factors, limiting their real-world applicability. To address this, we propose a plug-in Scale-Aware Attention Module (SAAM) designed to retrofit modern fixed-scale SR models with the ability to perform arbitrary-scale SR. SAAM employs lightweight, scale-adaptive feature extraction and upsampling, incorporating the Simple parameter-free Attention Module (SimAM) for efficient guidance and gradient variance loss to enhance sharpness in image details. Our method integrates seamlessly into multiple state-of-the-art SR backbones (e.g., SCNet, HiT-SR, OverNet), delivering competitive or superior performance across a wide range of integer and non-integer scale factors. Extensive experiments on benchmark datasets demonstrate that our approach enables robust multi-scale upscaling with minimal computational overhead, offering a practical solution for real-world scenarios.
摘要：尽管在单像超分辨率（SISR）方面取得了显着进展，但传统模型通常很难跨越不同的规模因素，从而限制了其现实世界的适用性。为了解决这个问题，我们提出了一个旨在改造现代固定尺度SR模型的插入式标尺感注意模块（SAAM），能够执行任意规模的SR。 SAAM采用轻巧的，自适应的特征提取和上采样，并结合了简单的无参数注意模块（SIMAM），以实现有效的指导和梯度差异损失，以增强图像细节的清晰度。我们的方法将无缝集成到多个最先进的SR骨架中（例如SCNET，HIT-SR，OVERNET），在广泛的整数和非整数规模因素上提供竞争性或优越的性能。基准数据集的广泛实验表明，我们的方法可以通过最小的计算开销来实现强大的多尺度上尺度进行尺度，从而为现实世界情景提供了实用的解决方案。

Title: VQualA 2025 Challenge on Image Super-Resolution Generated Content Quality Assessment: Methods and Results

Authors: Yixiao Li, Xin Li, Chris Wei Zhou, Shuo Xing, Hadi Amirpour, Xiaoshuai Hao, Guanghui Yue, Baoquan Zhao, Weide Liu, Xiaoyuan Yang, Zhengzhong Tu, Xinyu Li, Chuanbiao Song, Chenqi Zhang, Jun Lan, Huijia Zhu, Weiqiang Wang, Xiaoyan Sun, Shishun Tian, Dongyang Yan, Weixia Zhang, Junlin Chen, Wei Sun, Zhihua Wang, Zhuohang Shi, Zhizun Luo, Hang Ouyang, Tianxin Xiao, Fan Yang, Zhaowang Wu, Kaixin Deng
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2509.06413
Pdf URL: https://arxiv.org/pdf/2509.06413
Copy Paste: [[2509.06413]] VQualA 2025 Challenge on Image Super-Resolution Generated Content Quality Assessment: Methods and Results(https://arxiv.org/abs/2509.06413)
Keywords: super-resolution, generative, quality assessment
Abstract: This paper presents the ISRGC-Q Challenge, built upon the Image Super-Resolution Generated Content Quality Assessment (ISRGen-QA) dataset, and organized as part of the Visual Quality Assessment (VQualA) Competition at the ICCV 2025 Workshops. Unlike existing Super-Resolution Image Quality Assessment (SR-IQA) datasets, ISRGen-QA places a greater emphasis on SR images generated by the latest generative approaches, including Generative Adversarial Networks (GANs) and diffusion models. The primary goal of this challenge is to analyze the unique artifacts introduced by modern super-resolution techniques and to evaluate their perceptual quality effectively. A total of 108 participants registered for the challenge, with 4 teams submitting valid solutions and fact sheets for the final testing phase. These submissions demonstrated state-of-the-art (SOTA) performance on the ISRGen-QA dataset. The project is publicly available at: this https URL.
摘要：本文提出了ISRGC-Q挑战，该挑战是基于图像超分辨率生成的内容质量评估（ISRGEN-QA）数据集的，并在ICCV 2025讲习班上作为视觉质量评估（VQUALA）竞争的一部分进行了组织。与现有的超分辨率图像质量评估（SR-IQA）数据集不同，ISRGEN-QA更加重视由最新生成方法生成的SR图像，包括生成的对抗网络（GAN）和扩散模型。这一挑战的主要目标是分析现代超分辨率技术引入的独特伪像，并有效地评估其感知质量。共有108名参与者注册了挑战，有4个团队在最终测试阶段提交了有效的解决方案和事实说明。这些提交的内容显示了ISRGEN-QA数据集上的最新性能（SOTA）性能。该项目可公开获得：此HTTPS URL。

Title: CAPMix: Robust Time Series Anomaly Detection Based on Abnormal Assumptions with Dual-Space Mixup

Authors: Xudong Mou, Rui Wang, Tiejun Wang, Renyu Yang, Shiru Chen, Jie Sun, Tianyu Wo, Xudong Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.06419
Pdf URL: https://arxiv.org/pdf/2509.06419
Copy Paste: [[2509.06419]] CAPMix: Robust Time Series Anomaly Detection Based on Abnormal Assumptions with Dual-Space Mixup(https://arxiv.org/abs/2509.06419)
Keywords: generation
Abstract: Time series anomaly detection (TSAD) is a vital yet challenging task, particularly in scenarios where labeled anomalies are scarce and temporal dependencies are complex. Recent anomaly assumption (AA) approaches alleviate the lack of anomalies by injecting synthetic samples and training discriminative models. Despite promising results, these methods often suffer from two fundamental limitations: patchy generation, where scattered anomaly knowledge leads to overly simplistic or incoherent anomaly injection, and Anomaly Shift, where synthetic anomalies either resemble normal data too closely or diverge unrealistically from real anomalies, thereby distorting classification boundaries. In this paper, we propose CAPMix, a controllable anomaly augmentation framework that addresses both issues. First, we design a CutAddPaste mechanism to inject diverse and complex anomalies in a targeted manner, avoiding patchy generation. Second, we introduce a label revision strategy to adaptively refine anomaly labels, reducing the risk of anomaly shift. Finally, we employ dual-space mixup within a temporal convolutional network to enforce smoother and more robust decision boundaries. Extensive experiments on five benchmark datasets, including AIOps, UCR, SWaT, WADI, and ESA, demonstrate that CAPMix achieves significant improvements over state-of-the-art baselines, with enhanced robustness against contaminated training data. The code is available at this https URL.
摘要：时间序列异常检测（TSAD）是一项至关重要但具有挑战性的任务，尤其是在标记异常稀少且时间依赖性复杂的情况下。最近的异常假设（AA）通过注射合成样品和训练判别模型来减轻缺乏异常。尽管有令人鼓舞的结果，但这些方法通常会受到两个基本的局限性：斑驳的一代，散布异常的知识会导致过度简单或不相互的异常注射和异常转移，而合成异常类似于正常数据过于接近或异常，或者是不切实际的，因此从实际分类中出现了分类的范围。在本文中，我们提出了CAPMIX，这是一个可控制的异常增强框架，可以解决这两个问题。首先，我们设计了一种切开机制，以靶向方式注入多样化和复杂的异常，避免产生斑驳的产生。其次，我们引入了一种标签修订策略，以适应性地完善异常标签，从而降低异常转移的风险。最后，我们在时间卷积网络中采用双空间混音来强制更加顺畅，更健壮的决策界限。在包括AIOPS，UCR，SWAT，WADI和ESA在内的五个基准数据集上进行的广泛实验表明，Capmix对最先进的基准进行了显着改善，并且针对受污染的训练数据的鲁棒性增强了。该代码可在此HTTPS URL上找到。

Title: Perception-oriented Bidirectional Attention Network for Image Super-resolution Quality Assessment

Authors: Yixiao Li, Xiaoyuan Yang, Guanghui Yue, Jun Fu, Qiuping Jiang, Xu Jia, Paul L. Rosin, Hantao Liu, Wei Zhou
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2509.06442
Pdf URL: https://arxiv.org/pdf/2509.06442
Copy Paste: [[2509.06442]] Perception-oriented Bidirectional Attention Network for Image Super-resolution Quality Assessment(https://arxiv.org/abs/2509.06442)
Keywords: super-resolution, generation, quality assessment
Abstract: Many super-resolution (SR) algorithms have been proposed to increase image resolution. However, full-reference (FR) image quality assessment (IQA) metrics for comparing and evaluating different SR algorithms are limited. In this work, we propose the Perception-oriented Bidirectional Attention Network (PBAN) for image SR FR-IQA, which is composed of three modules: an image encoder module, a perception-oriented bidirectional attention (PBA) module, and a quality prediction module. First, we encode the input images for feature representations. Inspired by the characteristics of the human visual system, we then construct the perception-oriented PBA module. Specifically, different from existing attention-based SR IQA methods, we conceive a Bidirectional Attention to bidirectionally construct visual attention to distortion, which is consistent with the generation and evaluation processes of SR images. To further guide the quality assessment towards the perception of distorted information, we propose Grouped Multi-scale Deformable Convolution, enabling the proposed method to adaptively perceive distortion. Moreover, we design Sub-information Excitation Convolution to direct visual perception to both sub-pixel and sub-channel attention. Finally, the quality prediction module is exploited to integrate quality-aware features and regress quality scores. Extensive experiments demonstrate that our proposed PBAN outperforms state-of-the-art quality assessment methods.
摘要：已经提出了许多超分辨率（SR）算法以增加图像分辨率。但是，比较和评估不同SR算法的全参考（FR）图像质量评估（IQA）指标有限。在这项工作中，我们提出了面向感知的双向注意网络（PBAN），用于图像SR FR-IQA，该网络由三个模块组成：图像编码器模块，面向感知的双向注意（PBA）模块（PBA）模块和质量预测模块。首先，我们编码要特征表示的输入图像。受到人类视觉系统特征的启发，我们构建了面向感知的PBA模块。具体而言，与现有的基于注意力的SR IQA方法不同，我们认为双向关注双向构成对失真的视觉注意，这与SR图像的生成和评估过程一致。为了进一步指导质量评估对扭曲信息的感知，我们提出了分组的多尺度可变形卷积，从而使提出的方法适应性地感知失真。此外，我们设计了亚信息激发卷积，以将视觉感知引向子像素和亚渠道的关注。最后，质量预测模块被利用以整合质量感知的功能并回归质量得分。广泛的实验表明，我们提议的PBAN优于最先进的质量评估方法。

Title: A Statistical 3D Stomach Shape Model for Anatomical Analysis

Authors: Erez Posner, Ore Shtalrid, Oded Erell, Daniel Noy, Moshe Bouhnik
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.06464
Pdf URL: https://arxiv.org/pdf/2509.06464
Copy Paste: [[2509.06464]] A Statistical 3D Stomach Shape Model for Anatomical Analysis(https://arxiv.org/abs/2509.06464)
Keywords: generation
Abstract: Realistic and parameterized 3D models of human anatomy have become invaluable in research, diagnostics, and surgical planning. However, the development of detailed models for internal organs, such as the stomach, has been limited by data availability and methodological challenges. In this paper, we propose a novel pipeline for the generation of synthetic 3D stomach models, enabling the creation of anatomically diverse morphologies informed by established studies on stomach shape variability. Using this pipeline, we construct a dataset of synthetic stomachs. Building on this dataset, we develop a 3D statistical shape model of the stomach, trained to capture natural anatomical variability in a low-dimensional shape space. The model is further refined using CT meshes derived from publicly available datasets through a semi-supervised alignment process, enhancing its ability to generalize to unseen anatomical variations. We evaluated the model on a held-out test set of real stomach CT scans, demonstrating robust generalization and fit accuracy. We make the statistical shape model along with the synthetic dataset publicly available on GitLab: this https URL to facilitate further research. This work introduces the first statistical 3D shape model of the stomach, with applications ranging from surgical simulation and pre-operative planning to medical education and computational modeling. By combining synthetic data generation, parametric modeling, and real-world validation, our approach represents a significant advancement in organ modeling and opens new possibilities for personalized healthcare solutions.
摘要：人类解剖学的现实和参数化3D模型在研究，诊断和手术计划中变得无价。但是，开发诸如胃部内部器官的详细模型，受到数据可用性和方法论挑战的限制。在本文中，我们提出了一条新的管道，用于生成合成3D胃模型，从而实现了通过对胃形变异性的既定研究来创建解剖上多样化的形态。使用此管道，我们构建了合成胃的数据集。在此数据集的基础上，我们开发了胃的3D统计形状模型，该模型训练以捕获低维形状空间中自然解剖变异性。该模型是通过通过半监督的对准过程从公开可用数据集中得出的CT网格进一步完善的，从而增强了其概括以看不见的解剖变化的能力。我们在持有的真实胃CT扫描测试集上评估了该模型，证明了强大的概括和拟合精度。我们将统计形状模型与GitLab公共可用的合成数据集一起制定：此HTTPS URL促进进一步的研究。这项工作介绍了胃的第一个统计3D形状模型，其应用从外科模拟和术前计划到医学教育和计算建模。通过结合综合数据生成，参数建模和现实世界验证，我们的方法代表了器官建模方面的重大进步，并为个性化的医疗保健解决方案打开了新的可能性。

Title: TIDE: Achieving Balanced Subject-Driven Image Generation via Target-Instructed Diffusion Enhancement

Authors: Jibai Lin, Bo Ma, Yating Yang, Rong Ma, Turghun Osman, Ahtamjan Ahmat, Rui Dong, Lei Wang, Xi Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.06499
Pdf URL: https://arxiv.org/pdf/2509.06499
Copy Paste: [[2509.06499]] TIDE: Achieving Balanced Subject-Driven Image Generation via Target-Instructed Diffusion Enhancement(https://arxiv.org/abs/2509.06499)
Keywords: generation
Abstract: Subject-driven image generation (SDIG) aims to manipulate specific subjects within images while adhering to textual instructions, a task crucial for advancing text-to-image diffusion models. SDIG requires reconciling the tension between maintaining subject identity and complying with dynamic edit instructions, a challenge inadequately addressed by existing methods. In this paper, we introduce the Target-Instructed Diffusion Enhancing (TIDE) framework, which resolves this tension through target supervision and preference learning without test-time fine-tuning. TIDE pioneers target-supervised triplet alignment, modelling subject adaptation dynamics using a (reference image, instruction, target images) triplet. This approach leverages the Direct Subject Diffusion (DSD) objective, training the model with paired "winning" (balanced preservation-compliance) and "losing" (distorted) targets, systematically generated and evaluated via quantitative metrics. This enables implicit reward modelling for optimal preservation-compliance balance. Experimental results on standard benchmarks demonstrate TIDE's superior performance in generating subject-faithful outputs while maintaining instruction compliance, outperforming baseline methods across multiple quantitative metrics. TIDE's versatility is further evidenced by its successful application to diverse tasks, including structural-conditioned generation, image-to-image generation, and text-image interpolation. Our code is available at this https URL.
摘要：主题驱动的图像生成（SDIG）旨在在遵守文本指令的同时操纵图像中的特定主题，这对于推进文本到图像扩散模型至关重要。 SDIG需要协调维护主题身份和遵守动态编辑说明之间的张力，这是现有方法不足的挑战。在本文中，我们介绍了目标实施扩散增强（TIDE）框架，该框架通过目标监督和偏好学习解决了这种张力，而无需测试时间进行微调。潮汐先驱目标监督三元组对齐，使用A（参考图像，指令，目标图像）三重态对受试者适应动力学进行建模。这种方法利用直接的主体扩散（DSD）目标，通过配对的“获胜”（平衡的保存符合）和“失去”（扭曲的）目标对模型进行训练，并通过定量指标进行了系统地生成和评估。这可以实现与最佳保存平衡的隐式奖励建模。对标准基准测试的实验结果表明，在保持教学合规性的同时，在多个定量指标上的基线方法上表现优于基线方法，这表明了潮流在产生主体信仰的产出方面的出色表现。潮汐的多功能性进一步证明了其在各种任务中的成功应用，包括结构条件的生成，图像到图像生成和文本图像插值。我们的代码可在此HTTPS URL上找到。

Title: On optimal solutions of classical and sliced Wasserstein GANs with non-Gaussian data

Authors: Yu-Jui Huang, Hsin-Hua Shen, Yu-Chih Huang, Wan-Yi Lin, Shih-Chun Lin
Subjects: cs.LG, cs.IT, stat.ML
Abstract URL: https://arxiv.org/abs/2509.06505
Pdf URL: https://arxiv.org/pdf/2509.06505
Copy Paste: [[2509.06505]] On optimal solutions of classical and sliced Wasserstein GANs with non-Gaussian data(https://arxiv.org/abs/2509.06505)
Keywords: generative
Abstract: The generative adversarial network (GAN) aims to approximate an unknown distribution via a parameterized neural network (NN). While GANs have been widely applied in reinforcement and semisupervised learning as well as computer vision tasks, selecting their parameters often needs an exhaustive search and only a few selection methods can be proved to be theoretically optimal. One of the most promising GAN variants is the Wasserstein GAN (WGAN). Prior work on optimal parameters for WGAN is limited to the linear-quadratic-Gaussian (LQG) setting, where the NN is linear and the data is Gaussian. In this paper, we focus on the characterization of optimal WGAN parameters beyond the LQG setting. We derive closed-form optimal parameters for one-dimensional WGANs when the NN has non-linear activation functions and the data is non-Gaussian. To extend this to high-dimensional WGANs, we adopt the sliced Wasserstein framework and replace the constraint on marginal distributions of the randomly projected data by a constraint on the joint distribution of the original (unprojected) data. We show that the linear generator can be asymptotically optimal for sliced WGAN with non-Gaussian data. Empirical studies show that our closed-form WGAN parameters have good convergence behavior with data under both Gaussian and Laplace distributions. Also, compared to the r principal component analysis (r-PCA) solution, our proposed solution for sliced WGAN can achieve the same performance while requiring less computational resources.
摘要：生成对抗网络（GAN）旨在通过参数化的神经网络（NN）近似未知的分布。尽管GAN已被广泛应用于增强和半佩斯的学习以及计算机视觉任务，但选择其参数通常需要详尽的搜索，并且只能证明只有几种选择方法在理论上是最佳的。 Wasserstein Gan（Wgan）是最有希望的Gan变体之一。关于WGAN的最佳参数的先前工作仅限于线性 - 高斯 - 高斯（LQG）设置，其中NN是线性的，并且数据为高斯。在本文中，我们关注LQG设置以外的最佳WGAN参数的表征。当NN具有非线性激活函数并且数据是非高斯时，我们将得出一维WGAN的封闭形式的最佳参数。为了将其扩展到高维的WGAN，我们采用了切成薄片的Wasserstein框架，并通过对原始（未重点）数据的关节分布的约束来替换对随机投影数据的边际分布的约束。我们表明，线性生成器对于带有非高斯数据的切成薄片的wgan可能是最佳的。经验研究表明，我们的封闭形式的wgan参数在高斯和拉普拉斯分布下都具有良好的收敛行为。同样，与R主组件分析（R-PCA）解决方案相比，我们建议的切片解决方案可以达到相同的性能，同时需要更少的计算资源。

Title: Predicting Fetal Outcomes from Cardiotocography Signals Using a Supervised Variational Autoencoder

Authors: John Tolladay, Beth Albert, Gabriel Davis Jones
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.06540
Pdf URL: https://arxiv.org/pdf/2509.06540
Copy Paste: [[2509.06540]] Predicting Fetal Outcomes from Cardiotocography Signals Using a Supervised Variational Autoencoder(https://arxiv.org/abs/2509.06540)
Keywords: generative
Abstract: Objective: To develop and interpret a supervised variational autoencoder (VAE) model for classifying cardiotocography (CTG) signals based on pregnancy outcomes, addressing interpretability limits of current deep learning approaches. Methods: The OxMat CTG dataset was used to train a VAE on five-minute fetal heart rate (FHR) segments, labeled with postnatal outcomes. The model was optimised for signal reconstruction and outcome prediction, incorporating Kullback-Leibler divergence and total correlation (TC) constraints to structure the latent space. Performance was evaluated using area under the receiver operating characteristic curve (AUROC) and mean squared error (MSE). Interpretability was assessed using coefficient of determination, latent traversals and unsupervised component analyses. Results: The model achieved an AUROC of 0.752 at the segment level and 0.779 at the CTG level, where predicted scores were aggregated. Relaxing TC constraints improved both reconstruction and classification. Latent analysis showed that baseline-related features (e.g., FHR baseline, baseline shift) were well represented and aligned with model scores, while metrics like short- and long-term variability were less strongly encoded. Traversals revealed clear signal changes for baseline features, while other properties were entangled or subtle. Unsupervised decompositions corroborated these patterns. Findings: This work demonstrates that supervised VAEs can achieve competitive fetal outcome prediction while partially encoding clinically meaningful CTG features. The irregular, multi-timescale nature of FHR signals poses challenges for disentangling physiological components, distinguishing CTG from more periodic signals such as ECG. Although full interpretability was not achieved, the model supports clinically useful outcome prediction and provides a basis for future interpretable, generative models.
摘要：目的：开发和解释基于妊娠结局的心脏图（CTG）信号分类的有监督的变异自动编码器（VAE）模型，以解决当前深度学习方法的可解释性限制。方法：使用OXMAT CTG数据集以五分钟的胎儿心率（FHR）段训练VAE，并用产后结局标记。该模型被优化用于信号重建和结果预测，并结合了kullback-leibler差异和总相关性（TC）约束，以构造潜在空间。使用接收器操作特征曲线（AUROC）和平方误差（MSE）下的区域评估性能。使用确定系数，潜在遍历和无监督组件分析来评估可解释性。结果：该模型在节段水平上达到0.752，在CTG水平上达到0.779，预测分数汇总。放松的TC约束改善了重建和分类。潜在分析表明，基线相关的特征（例如FHR基线，基线移位）被很好地表示并与模型得分排列，而诸如短期和长期可变性之类的指标则较少地编码。遍历揭示了基线特征的明显信号变化，而其他特性则纠缠或微妙。无监督的分解证实了这些模式。调查结果：这项工作表明，监督的VAE可以实现有竞争力的胎儿结果预测，同时部分编码临床上有意义的CTG功能。 FHR信号的不规则，多时间尺度的性质对解开生理组件的挑战构成了挑战，将CTG与诸如ECG等更周期性的信号区分开。尽管没有实现完全的解释性，但该模型支持临床上有用的结果预测，并为将来的可解释的生成模型提供了基础。

Title: Group Effect Enhanced Generative Adversarial Imitation Learning for Individual Travel Behavior Modeling under Incentives

Authors: Yuanyuan Wu, Zhenlin Qin, Leizhen Wang, Xiaolei Ma, Zhenliang Ma
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.06656
Pdf URL: https://arxiv.org/pdf/2509.06656
Copy Paste: [[2509.06656]] Group Effect Enhanced Generative Adversarial Imitation Learning for Individual Travel Behavior Modeling under Incentives(https://arxiv.org/abs/2509.06656)
Keywords: generative
Abstract: Understanding and modeling individual travel behavior responses is crucial for urban mobility regulation and policy evaluation. The Markov decision process (MDP) provides a structured framework for dynamic travel behavior modeling at the individual level. However, solving an MDP in this context is highly data-intensive and faces challenges of data quantity, spatial-temporal coverage, and situational diversity. To address these, we propose a group-effect-enhanced generative adversarial imitation learning (gcGAIL) model that improves the individual behavior modeling efficiency by leveraging shared behavioral patterns among passenger groups. We validate the gcGAIL model using a public transport fare-discount case study and compare against state-of-the-art benchmarks, including adversarial inverse reinforcement learning (AIRL), baseline GAIL, and conditional GAIL. Experimental results demonstrate that gcGAIL outperforms these methods in learning individual travel behavior responses to incentives over time in terms of accuracy, generalization, and pattern demonstration efficiency. Notably, gcGAIL is robust to spatial variation, data sparsity, and behavioral diversity, maintaining strong performance even with partial expert demonstrations and underrepresented passenger groups. The gcGAIL model predicts the individual behavior response at any time, providing the basis for personalized incentives to induce sustainable behavior changes (better timing of incentive injections).
摘要：理解和建模个人旅行行为反应对于城市流动性调节和政策评估至关重要。马尔可夫决策过程（MDP）为在个人级别上的动态旅行行为建模提供了一个结构化框架。但是，在这种情况下解决MDP是高度数据密集型的，并且面临数据数量，时空覆盖范围和情境多样性的挑战。为了解决这些问题，我们提出了一个群体效应增强的生成对抗模仿学习（GCGAIL）模型，该模型通过利用乘客群体之间的共同行为模式来提高个人行为建模效率。我们使用公共交通票价研究案例研究验证了GCGAIL模型，并与最先进的基准测试，包括对抗性逆增强学习（AIRL），基线盖尔和有条件的盖尔。实验结果表明，在准确性，概括和模式演示效率方面，GCGAIL在学习单个旅行行为对激励措施的学习中都优于这些方法。值得注意的是，GCGAIL对空间变化，数据稀疏性和行为多样性具有鲁棒性，即使在部分专家演示和代表性不足的乘客群体中也保持强劲的绩效。 GCGAIL模型可以随时预测单个行为响应，为诱发可持续行为变化的个性化激励措施提供了基础（激励注射的更好时机）。

Title: STAGE: Segmentation-oriented Industrial Anomaly Synthesis via Graded Diffusion with Explicit Mask Alignment

Authors: Xichen Xu, Yanshu Wang, Jinbao Wang, Qunyi Zhang, Xiaoning Lei, Guoyang Xie, Guannan Jiang, Zhichao Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.06693
Pdf URL: https://arxiv.org/pdf/2509.06693
Copy Paste: [[2509.06693]] STAGE: Segmentation-oriented Industrial Anomaly Synthesis via Graded Diffusion with Explicit Mask Alignment(https://arxiv.org/abs/2509.06693)
Keywords: generation
Abstract: Segmentation-oriented Industrial Anomaly Synthesis (SIAS) plays a pivotal role in enhancing the performance of downstream anomaly segmentation, as it provides an effective means of expanding abnormal data. However, existing SIAS methods face several critical limitations: (i) the synthesized anomalies often lack intricate texture details and fail to align precisely with the surrounding background, and (ii) they struggle to generate fine-grained, pixel-level anomalies. To address these challenges, we propose Segmentation-oriented Anomaly synthesis via Graded diffusion with Explicit mask alignment, termed STAGE. STAGE introduces a novel anomaly inference strategy that incorporates clean background information as a prior to guide the denoising distribution, enabling the model to more effectively distinguish and highlight abnormal foregrounds. Furthermore, it employs a graded diffusion framework with an anomaly-only branch to explicitly record local anomalies during both the forward and reverse processes, ensuring that subtle anomalies are not overlooked. Finally, STAGE incorporates the explicit mask alignment (EMA) strategy to progressively align the synthesized anomalies with the background, resulting in context-consistent and structurally coherent generations. Extensive experiments on the MVTec and BTAD datasets demonstrate that STAGE achieves state-of-the-art performance in SIAS, which in turn enhances downstream anomaly segmentation.
摘要：面向分割的工业异常合成（SIAS）在增强下游异常分割的性能方面起着关键作用，因为它提供了扩展异常数据的有效手段。但是，现有的SIA方法面临几个关键局限性：（i）合成的异常通常缺乏复杂的纹理细节，并且无法与周围背景确切保持一致，并且（ii）它们很难产生细粒度的，像素级的偏见。为了应对这些挑战，我们提出通过分级扩散与显式掩模对齐，称为阶段。阶段引入了一种新型的异常推理策略，该策略将干净的背景信息纳入指导分布之前，使模型能够更有效地区分和突出异常的前景。此外，它采用了一个只有异常分支的分级扩散框架，以在正向过程和反向过程中明确记录局部异常，从而确保没有忽略微妙的异常。最后，阶段结合了显式掩码比对（EMA）策略，以逐步将合成的异常与背景对齐，从而产生上下文一致且结构上一致的一代。在MVTEC和BTAD数据集上进行了广泛的实验表明，阶段在SIA中实现了最新的性能，从而增强了下游异常分割。

Title: Nested Optimal Transport Distances

Authors: Ruben Bontorno, Songyan Hou
Subjects: cs.LG, q-fin.CP
Abstract URL: https://arxiv.org/abs/2509.06702
Pdf URL: https://arxiv.org/pdf/2509.06702
Copy Paste: [[2509.06702]] Nested Optimal Transport Distances(https://arxiv.org/abs/2509.06702)
Keywords: generation, generative
Abstract: Simulating realistic financial time series is essential for stress testing, scenario generation, and decision-making under uncertainty. Despite advances in deep generative models, there is no consensus metric for their evaluation. We focus on generative AI for financial time series in decision-making applications and employ the nested optimal transport distance, a time-causal variant of optimal transport distance, which is robust to tasks such as hedging, optimal stopping, and reinforcement learning. Moreover, we propose a statistically consistent, naturally parallelizable algorithm for its computation, achieving substantial speedups over existing approaches.
摘要：模拟现实的财务时间序列对于不确定性下的压力测试，场景产生和决策至关重要。尽管有深层生成模型的进步，但他们的评估尚无共识指标。我们专注于决策应用中的财务时间序列的生成AI，并采用了嵌套的最佳运输距离，这是最佳运输距离的时间由于时间由于时间源的变化，这对于诸如对冲，最佳停止和增强学习等任务非常有力。此外，我们为其计算提出了一种统计上一致的，自然可行的算法，对现有方法实现了大幅度的加速。

Title: Zero-shot 3D-Aware Trajectory-Guided image-to-video generation via Test-Time Training

Authors: Ruicheng Zhang, Jun Zhou, Zunnan Xu, Zihao Liu, Jiehui Huang, Mingyang Zhang, Yu Sun, Xiu Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.06723
Pdf URL: https://arxiv.org/pdf/2509.06723
Copy Paste: [[2509.06723]] Zero-shot 3D-Aware Trajectory-Guided image-to-video generation via Test-Time Training(https://arxiv.org/abs/2509.06723)
Keywords: generation, generative
Abstract: Trajectory-Guided image-to-video (I2V) generation aims to synthesize videos that adhere to user-specified motion instructions. Existing methods typically rely on computationally expensive fine-tuning on scarce annotated datasets. Although some zero-shot methods attempt to trajectory control in the latent space, they may yield unrealistic motion by neglecting 3D perspective and creating a misalignment between the manipulated latents and the network's noise predictions. To address these challenges, we introduce Zo3T, a novel zero-shot test-time-training framework for trajectory-guided generation with three core innovations: First, we incorporate a 3D-Aware Kinematic Projection, leveraging inferring scene depth to derive perspective-correct affine transformations for target regions. Second, we introduce Trajectory-Guided Test-Time LoRA, a mechanism that dynamically injects and optimizes ephemeral LoRA adapters into the denoising network alongside the latent state. Driven by a regional feature consistency loss, this co-adaptation effectively enforces motion constraints while allowing the pre-trained model to locally adapt its internal representations to the manipulated latent, thereby ensuring generative fidelity and on-manifold adherence. Finally, we develop Guidance Field Rectification, which refines the denoising evolutionary path by optimizing the conditional guidance field through a one-step lookahead strategy, ensuring efficient generative progression towards the target trajectory. Zo3T significantly enhances 3D realism and motion accuracy in trajectory-controlled I2V generation, demonstrating superior performance over existing training-based and zero-shot approaches.
摘要：轨迹引导的图像到视频（I2V）生成旨在合成遵守用户指定运动指令的视频。现有方法通常依赖于稀缺注释的数据集上的计算昂贵的微调。尽管某些零射击方法试图在潜在空间中进行轨迹控制，但它们可能通过忽略3D透视图并在操纵潜在的潜伏期和网络的噪声预测之间造成不对劲而产生不切实际的运动。为了应对这些挑战，我们介绍了ZO3T，这是一个新型的零射击测试时间训练框架，用于轨迹引导的一代三个核心创新：首先，我们结合了一个3D感知的运动学投影，利用将场景深度推导到衍生场景深度到派生的透视视角对目标区域的偏爱。其次，我们介绍了轨迹引导的测试时间洛拉，该机制通过动态注入并优化短暂的洛拉适配器与潜在状态并带入DeNoising网络。在区域特征一致性损失的驱动下，这种共同适应有效地实施了运动限制，同时允许预训练的模型将其内部表示形式适应受操纵的潜在，从而确保了生成的保真度和术中的依从性。最后，我们开发指导场的纠正，通过一步lookahead策略优化条件引导场，从而优化了脱糖性进化路径，从而确保了对目标轨迹的有效生成性进步。 ZO3T显着提高了轨迹控制的I2V生成中的3D现实主义和运动精度，表明比现有的基于训练和零射击方法的性能卓越。

Title: Raw2Event: Converting Raw Frame Camera into Event Camera

Authors: Zijie Ning, Enmin Lin, Sudarshan R. Iyengar, Patrick Vandewalle
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.06767
Pdf URL: https://arxiv.org/pdf/2509.06767
Copy Paste: [[2509.06767]] Raw2Event: Converting Raw Frame Camera into Event Camera(https://arxiv.org/abs/2509.06767)
Keywords: generation
Abstract: Event cameras offer unique advantages such as high temporal resolution, low latency, and high dynamic range, making them more and more popular for vision tasks under challenging light conditions. However, their high cost, limited resolution, and lack of features such as autofocus hinder their broad adoption, particularly for early-stage development and prototyping. In this work, we present Raw2Event, a complete hardware-software system that enables real-time event generation from low-cost raw frame-based cameras. By leveraging direct access to raw Bayer data and bypassing traditional image signal processors (ISP), our system is able to utilize the full potential of camera hardware, delivering higher dynamic range, higher resolution, and more faithful output than RGB-based frame-to-event converters. Built upon the DVS-Voltmeter model, Raw2Event features a configurable simulation framework optimized for deployment on embedded platforms. We further design a data acquisition pipeline that supports synchronized recording of raw, RGB, and event streams, facilitating downstream evaluation and dataset creation. Experimental results show that Raw2Event can generate event streams closely resembling those from real event cameras, while benefiting from higher resolution and autofocus capabilities. The system also supports user-intuitive parameter tuning, enabling flexible adaptation to various application requirements. Finally, we deploy the system on a Raspberry Pi for real-time operation, providing a scalable and cost-effective solution for event-based vision research and early-stage system development. The codes are available online: this https URL.
摘要：事件摄像机具有独特的优势，例如高时间分辨率，低潜伏期和高动态范围，使它们在挑战性的光条件下越来越受欢迎。但是，它们的高成本，有限的分辨率以及缺乏自动对焦的功能阻碍了它们的广泛采用，尤其是在早期开发和原型制作方面。在这项工作中，我们提出了Raw2Event，这是一个完整的硬件软件系统，可从低成本原始框架相机实时生成实时事件。通过利用直接访问原始拜耳数据并绕过传统的图像信号处理器（ISP），我们的系统能够利用相机硬件的全部潜力，与基于RGB的框架到事实转换器相比，提供更高的动态范围，更高的分辨率和更忠实的输出。 RAW2Event构建在DVS-voltmeter模型上，具有可配置的仿真框架，该框架优化了用于在嵌入式平台上部署的。我们进一步设计了一个数据采集管道，该管道支持RAW，RGB和事件流的同步记录，从而促进下游评估和数据集创建。实验结果表明，RAW2Event可以生成与真实事件摄像机的事件流相似的事件流，同时受益于更高的分辨率和自动对焦功能。该系统还支持用户直觉参数调整，从而可以灵活适应各种应用程序要求。最后，我们将系统部署在覆盆子PI上，以进行实时操作，为基于事件的视力研究和早期系统开发提供了可扩展且具有成本效益的解决方案。这些代码可在线提供：此HTTPS URL。

Title: P3-SAM: Native 3D Part Segmentation

Authors: Changfeng Ma, Yang Li, Xinhao Yan, Jiachen Xu, Yunhan Yang, Chunshi Wang, Zibo Zhao, Yanwen Guo, Zhuo Chen, Chunchao Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.06784
Pdf URL: https://arxiv.org/pdf/2509.06784
Copy Paste: [[2509.06784]] P3-SAM: Native 3D Part Segmentation(https://arxiv.org/abs/2509.06784)
Keywords: generation
Abstract: Segmenting 3D assets into their constituent parts is crucial for enhancing 3D understanding, facilitating model reuse, and supporting various applications such as part generation. However, current methods face limitations such as poor robustness when dealing with complex objects and cannot fully automate the process. In this paper, we propose a native 3D point-promptable part segmentation model termed P3-SAM, designed to fully automate the segmentation of any 3D objects into components. Inspired by SAM, P3-SAM consists of a feature extractor, multiple segmentation heads, and an IoU predictor, enabling interactive segmentation for users. We also propose an algorithm to automatically select and merge masks predicted by our model for part instance segmentation. Our model is trained on a newly built dataset containing nearly 3.7 million models with reasonable segmentation labels. Comparisons show that our method achieves precise segmentation results and strong robustness on any complex objects, attaining state-of-the-art performance. Our code will be released soon.
摘要：将3D资产分割成其组成部分对于增强3D理解，促进模型再利用和支持各种应用（例如零件生成）至关重要。但是，当前的方法在处理复杂对象时面临限制，例如鲁棒性，并且无法完全自动化该过程。在本文中，我们提出了一个称为P3-SAM的天然3D可启示性零件分割模型，旨在将任何3D对象分割为组件。受SAM的启发，P3-SAM由功能提取器，多个分割头和IOU预测器组成，可为用户提供交互式分割。我们还提出了一种算法，以自动选择并合并我们的模型预测的掩码，以进行部分实例分割。我们的模型在新建的数据集上进行了培训，该数据集包含近370万款具有合理细分标签的型号。比较表明，我们的方法可在任何复杂的物体上实现精确的细分结果和强大的鲁棒性，从而达到最先进的性能。我们的代码将很快发布。

Title: SynthDrive: Scalable Real2Sim2Real Sensor Simulation Pipeline for High-Fidelity Asset Generation and Driving Data Synthesis

Authors: Zhengqing Chen, Ruohong Mei, Xiaoyang Guo, Qingjie Wang, Yubin Hu, Wei Yin, Weiqiang Ren, Qian Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.06798
Pdf URL: https://arxiv.org/pdf/2509.06798
Copy Paste: [[2509.06798]] SynthDrive: Scalable Real2Sim2Real Sensor Simulation Pipeline for High-Fidelity Asset Generation and Driving Data Synthesis(https://arxiv.org/abs/2509.06798)
Keywords: generation
Abstract: In the field of autonomous driving, sensor simulation is essential for generating rare and diverse scenarios that are difficult to capture in real-world environments. Current solutions fall into two categories: 1) CG-based methods, such as CARLA, which lack diversity and struggle to scale to the vast array of rare cases required for robust perception training; and 2) learning-based approaches, such as NeuSim, which are limited to specific object categories (vehicles) and require extensive multi-sensor data, hindering their applicability to generic objects. To address these limitations, we propose a scalable real2sim2real system that leverages 3D generation to automate asset mining, generation, and rare-case data synthesis.
摘要：在自主驾驶领域，传感器模拟对于在现实世界环境中难以捕获的稀有场景至关重要。当前的解决方案分为两类：1）基于CG的方法，例如Carla，这些方法缺乏多样性和努力扩展到良好的感知训练所需的罕见案例； 2）基于学习的方法，例如Neusim，这些方法仅限于特定的对象类别（车辆），并且需要大量的多传感器数据，从而阻碍了它们对通用对象的适用性。为了解决这些局限性，我们提出了一个可扩展的Real2sim2Real系统，该系统利用3D生成来自动化资产挖掘，生成和稀有数据综合。

Title: MIORe & VAR-MIORe: Benchmarks to Push the Boundaries of Restoration

Authors: George Ciubotariu, Zhuyun Zhou, Zongwei Wu, Radu Timofte
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.06803
Pdf URL: https://arxiv.org/pdf/2509.06803
Copy Paste: [[2509.06803]] MIORe & VAR-MIORe: Benchmarks to Push the Boundaries of Restoration(https://arxiv.org/abs/2509.06803)
Keywords: restoration, generation
Abstract: We introduce MIORe and VAR-MIORe, two novel multi-task datasets that address critical limitations in current motion restoration benchmarks. Designed with high-frame-rate (1000 FPS) acquisition and professional-grade optics, our datasets capture a broad spectrum of motion scenarios, which include complex ego-camera movements, dynamic multi-subject interactions, and depth-dependent blur effects. By adaptively averaging frames based on computed optical flow metrics, MIORe generates consistent motion blur, and preserves sharp inputs for video frame interpolation and optical flow estimation. VAR-MIORe further extends by spanning a variable range of motion magnitudes, from minimal to extreme, establishing the first benchmark to offer explicit control over motion amplitude. We provide high-resolution, scalable ground truths that challenge existing algorithms under both controlled and adverse conditions, paving the way for next-generation research of various image and video restoration tasks.
摘要：我们介绍了两个新型的多任务数据集Miore和Var-Miore，这些数据集解决了当前运动恢复基准中的关键限制。我们的数据集使用高框架速率（1000 fps）的采集和专业级光学元件设计，可捕获各种运动场景，其中包括复杂的自我相机运动，动态的多主体相互作用以及深度依赖于深度依赖的模糊效果。通过基于计算的光流量指标的自适应平均帧，MIORE会产生一致的运动模糊，并保留用于视频框架插值和光流估计的尖锐输入。 VAR-MIORE通过跨越了从最小到极端到极端的可变运动量范围来进一步扩展，建立了第一个基准测试以提供对运动振幅的明确控制。我们提供高分辨率，可扩展的基础真理，这些真相在受控和不利条件下挑战现有算法，为对各种图像和视频恢复任务的下一代研究铺平了道路。

Title: UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward

Authors: Yufeng Cheng, Wenxu Wu, Shaojin Wu, Mengqi Huang, Fei Ding, Qian He
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2509.06818
Pdf URL: https://arxiv.org/pdf/2509.06818
Copy Paste: [[2509.06818]] UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward(https://arxiv.org/abs/2509.06818)
Keywords: generation
Abstract: Recent advancements in image customization exhibit a wide range of application prospects due to stronger customization capabilities. However, since we humans are more sensitive to faces, a significant challenge remains in preserving consistent identity while avoiding identity confusion with multi-reference images, limiting the identity scalability of customization models. To address this, we present UMO, a Unified Multi-identity Optimization framework, designed to maintain high-fidelity identity preservation and alleviate identity confusion with scalability. With "multi-to-multi matching" paradigm, UMO reformulates multi-identity generation as a global assignment optimization problem and unleashes multi-identity consistency for existing image customization methods generally through reinforcement learning on diffusion models. To facilitate the training of UMO, we develop a scalable customization dataset with multi-reference images, consisting of both synthesised and real parts. Additionally, we propose a new metric to measure identity confusion. Extensive experiments demonstrate that UMO not only improves identity consistency significantly, but also reduces identity confusion on several image customization methods, setting a new state-of-the-art among open-source methods along the dimension of identity preserving. Code and model: this https URL
摘要：由于更强的自定义功能，图像定制的最新进展展示了广泛的应用程序前景。但是，由于我们人类对面孔更敏感，因此在保持一致的身份的同时避免身份混乱的多种参考图像，限制了自定义模型的身份可扩展性。为了解决这个问题，我们提出了一个统一的多个认同优化框架Umo，旨在保持高保真身份保存并减轻与可伸缩性的认同混淆。乌默（Umo）凭借“多对媒体匹配”范式，将多个身份生成重新定义为全球作业优化问题，并通过对扩散模型上的加强学习来释放现有图像自定义方法的多身份一致性。为了促进UMO的培训，我们开发了一个可扩展的自定义数据集，该数据集具有多参考图像，由合成部分和真实部分组成。此外，我们提出了一个新的指标来衡量身份混乱。广泛的实验表明，UMO不仅可以显着提高身份一致性，而且还减少了几种图像自定义方法的身份混乱，从而沿着保留身份的维度设置了开源方法之间的新最新方法。代码和模型：此HTTPS URL

Title: floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL

Authors: Bhavya Agrawalla, Michal Nauman, Khush Agarwal, Aviral Kumar
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.06863
Pdf URL: https://arxiv.org/pdf/2509.06863
Copy Paste: [[2509.06863]] floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL(https://arxiv.org/abs/2509.06863)
Keywords: generative
Abstract: A hallmark of modern large-scale machine learning techniques is the use of training objectives that provide dense supervision to intermediate computations, such as teacher forcing the next token in language models or denoising step-by-step in diffusion models. This enables models to learn complex functions in a generalizable manner. Motivated by this observation, we investigate the benefits of iterative computation for temporal difference (TD) methods in reinforcement learning (RL). Typically they represent value functions in a monolithic fashion, without iterative compute. We introduce floq (flow-matching Q-functions), an approach that parameterizes the Q-function using a velocity field and trains it using techniques from flow-matching, typically used in generative modeling. This velocity field underneath the flow is trained using a TD-learning objective, which bootstraps from values produced by a target velocity field, computed by running multiple steps of numerical integration. Crucially, floq allows for more fine-grained control and scaling of the Q-function capacity than monolithic architectures, by appropriately setting the number of integration steps. Across a suite of challenging offline RL benchmarks and online fine-tuning tasks, floq improves performance by nearly 1.8x. floq scales capacity far better than standard TD-learning architectures, highlighting the potential of iterative computation for value learning.
摘要：现代大规模机器学习技术的标志是使用训练目标，这些目标为中间计算提供了密切的监督，例如教师在语言模型中强迫下一个令牌或在扩散模型中逐步降级。这使模型能够以可概括的方式学习复杂的功能。在这一观察结果的推动下，我们研究了迭代计算对时间差（TD）方法在增强学习（RL）中的好处。通常，它们以整体方式表示价值函数，而无需迭代计算。我们介绍了floq（流量匹配Q-函数），该方法使用速度字段参数对Q功能进行参数，并使用流量匹配的技术训练它，该技术通常用于生成建模。使用TD学习目标训练该流程下方的速度字段，该目标是由目标速度场产生的值引导的，该值通过运行数值集成的多个步骤计算得出。至关重要的是，通过适当设置集成步骤的数量，FLOQ允许比整体体系结构更细粒度的控制和缩放Q功能能力。在一系列具有挑战性的离线RL基准和在线微调任务中，Floq提高了近1.8倍的性能。 FLOQ量表的容量远胜于标准的TD学习架构，突出了迭代计算对价值学习的潜力。

Title: A New Hybrid Model of Generative Adversarial Network and You Only Look Once Algorithm for Automatic License-Plate Recognition

Authors: Behnoud Shafiezadeh, Amir Mashmool, Farshad Eshghi, Manoochehr Kelarestaghi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.06868
Pdf URL: https://arxiv.org/pdf/2509.06868
Copy Paste: [[2509.06868]] A New Hybrid Model of Generative Adversarial Network and You Only Look Once Algorithm for Automatic License-Plate Recognition(https://arxiv.org/abs/2509.06868)
Keywords: generative
Abstract: Automatic License-Plate Recognition (ALPR) plays a pivotal role in Intelligent Transportation Systems (ITS) as a fundamental element of Smart Cities. However, due to its high variability, ALPR faces challenging issues more efficiently addressed by deep learning techniques. In this paper, a selective Generative Adversarial Network (GAN) is proposed for deblurring in the preprocessing step, coupled with the state-of-the-art You-Only-Look-Once (YOLO)v5 object detection architectures for License-Plate Detection (LPD), and the integrated Character Segmentation (CS) and Character Recognition (CR) steps. The selective preprocessing bypasses unnecessary and sometimes counter-productive input manipulations, while YOLOv5 LPD/CS+CR delivers high accuracy and low computing cost. As a result, YOLOv5 achieves a detection time of 0.026 seconds for both LP and CR detection stages, facilitating real-time applications with exceptionally rapid responsiveness. Moreover, the proposed model achieves accuracy rates of 95\% and 97\% in the LPD and CR detection phases, respectively. Furthermore, the inclusion of the Deblur-GAN pre-processor significantly improves detection accuracy by nearly 40\%, especially when encountering blurred License Plates (LPs).To train and test the learning components, we generated and publicly released our blur and ALPR datasets (using Iranian license plates as a use-case), which are more representative of close-to-real-life ad-hoc situations. The findings demonstrate that employing the state-of-the-art YOLO model results in excellent overall precision and detection time, making it well-suited for portable applications. Additionally, integrating the Deblur-GAN model as a preliminary processing step enhances the overall effectiveness of our comprehensive model, particularly when confronted with blurred scenes captured by the camera as input.
摘要：自动许可板识别（ALPR）在智能运输系统（ITS）中起关键作用是智能城市的基本要素。但是，由于其高可变性，ALPR面临着深入学习技术更有效地解决的具有挑战性的问题。在本文中，建议在预处理步骤中进行选择性生成性对抗网络（GAN），并与最先进的您的唯一的外观孔（YOLO）V5对象检测架构进行许可证板检测（LPD），以及集成的字符序列（CS）和特征（CR）和特征（CR）。选择性预处理绕过不必要的，有时是适得其反的输入操作，而Yolov5 LPD/CS+CR可提供高精度和低计算成本。结果，Yolov5的LP和CR检测阶段都达到了0.026秒的检测时间，从而促进了具有异常快速响应性的实时应用。此外，所提出的模型分别在LPD和CR检测阶段达到了95 \％和97 \％的准确率。 Furthermore, the inclusion of the Deblur-GAN pre-processor significantly improves detection accuracy by nearly 40\%, especially when encountering blurred License Plates (LPs).To train and test the learning components, we generated and publicly released our blur and ALPR datasets (using Iranian license plates as a use-case), which are more representative of close-to-real-life ad-hoc situations.研究结果表明，采用最先进的YOLO模型可带来出色的总体精度和检测时间，因此非常适合便携式应用程序。此外，将Deblur-GAN模型集成为初步处理步骤，提高了我们综合模型的整体有效性，尤其是在面对相机捕获的模糊场景作为输入时。

Title: BIR-Adapter: A Low-Complexity Diffusion Model Adapter for Blind Image Restoration

Authors: Cem Eteke, Alexander Griessel, Wolfgang Kellerer, Eckehard Steinbach
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.06904
Pdf URL: https://arxiv.org/pdf/2509.06904
Copy Paste: [[2509.06904]] BIR-Adapter: A Low-Complexity Diffusion Model Adapter for Blind Image Restoration(https://arxiv.org/abs/2509.06904)
Keywords: restoration, super-resolution
Abstract: This paper introduces BIR-Adapter, a low-complexity blind image restoration adapter for diffusion models. The BIR-Adapter enables the utilization of the prior of pre-trained large-scale diffusion models on blind image restoration without training any auxiliary feature extractor. We take advantage of the robustness of pretrained models. We extract features from degraded images via the model itself and extend the self-attention mechanism with these degraded features. We introduce a sampling guidance mechanism to reduce hallucinations. We perform experiments on synthetic and real-world degradations and demonstrate that BIR-Adapter achieves competitive or better performance compared to state-of-the-art methods while having significantly lower complexity. Additionally, its adapter-based design enables integration into other diffusion models, enabling broader applications in image restoration tasks. We showcase this by extending a super-resolution-only model to perform better under additional unknown degradations.
摘要：本文介绍了Bir-Adapter，这是一种用于扩散模型的低复杂性盲图修复适配器。 BIR-ADAPTER可以在盲图恢复中使用预训练的大规模扩散模型的先验，而无需训练任何辅助特征提取器。我们利用验证模型的鲁棒性。我们通过模型本身从退化的图像中提取特征，并使用这些退化的特征扩展自我发项机制。我们引入了一种采样指导机制，以减少幻觉。我们对合成和现实世界降解进行实验，并证明与最新方法相比，BIR-ADAPTER可以在竞争性或更好的性能上实现竞争性或更好的性能，同时复杂性的较低。此外，其基于适配器的设计使集成到其他扩散模型中，从而在图像恢复任务中更广泛的应用程序。我们通过扩展仅超分辨率模型以在其他未知降解下表现更好的方法来展示这一点。

Title: From Noise to Narrative: Tracing the Origins of Hallucinations in Transformers

Authors: Praneet Suresh, Jack Stanley, Sonia Joseph, Luca Scimeca, Danilo Bzdok
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.06938
Pdf URL: https://arxiv.org/pdf/2509.06938
Copy Paste: [[2509.06938]] From Noise to Narrative: Tracing the Origins of Hallucinations in Transformers(https://arxiv.org/abs/2509.06938)
Keywords: generative
Abstract: As generative AI systems become competent and democratized in science, business, and government, deeper insight into their failure modes now poses an acute need. The occasional volatility in their behavior, such as the propensity of transformer models to hallucinate, impedes trust and adoption of emerging AI solutions in high-stakes areas. In the present work, we establish how and when hallucinations arise in pre-trained transformer models through concept representations captured by sparse autoencoders, under scenarios with experimentally controlled uncertainty in the input space. Our systematic experiments reveal that the number of semantic concepts used by the transformer model grows as the input information becomes increasingly unstructured. In the face of growing uncertainty in the input space, the transformer model becomes prone to activate coherent yet input-insensitive semantic features, leading to hallucinated output. At its extreme, for pure-noise inputs, we identify a wide variety of robustly triggered and meaningful concepts in the intermediate activations of pre-trained transformer models, whose functional integrity we confirm through targeted steering. We also show that hallucinations in the output of a transformer model can be reliably predicted from the concept patterns embedded in transformer layer activations. This collection of insights on transformer internal processing mechanics has immediate consequences for aligning AI models with human values, AI safety, opening the attack surface for potential adversarial attacks, and providing a basis for automatic quantification of a model's hallucination risk.
摘要：随着生成的AI系统在科学，商业和政府中变得胜任和民主化，对其失败模式的更深入的了解现在构成了急需。偶尔的行为波动性，例如变压器模型幻觉的倾向，阻碍了高风险地区新出现的AI解决方案的信任和采用。在目前的工作中，我们在预先训练的变压器模型中通过稀疏自动编码器捕获的概念表示形式，在输入空间中实验控制的不确定性下，如何以及何时出现幻觉。我们的系统实验表明，随着输入信息变得越来越非结构化，变压器模型使用的语义概念数量会增长。面对输入空间中不确定性的日益确定，变压器模型容易激活连贯但不敏感的语义特征，从而导致幻觉输出。在极端情况下，对于纯净的输入，我们在预训练的变压器模型的中间激活中确定了各种各样的强大触发和有意义的概念，我们通过靶向转向确认其功能完整性。我们还表明，可以从变压器层激活中嵌入的概念模式可靠地预测变压器模型的输出中的幻觉。关于变压器内部处理机制的洞察力集合对使人工价值，AI安全性，为潜在的对抗性攻击打开攻击表面的AI模型有直接的后果，并为自动量化模型的幻觉风险提供了基础。

Title: Outcome-based Exploration for LLM Reasoning

Authors: Yuda Song, Julia Kempe, Remi Munos
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2509.06941
Pdf URL: https://arxiv.org/pdf/2509.06941
Copy Paste: [[2509.06941]] Outcome-based Exploration for LLM Reasoning(https://arxiv.org/abs/2509.06941)
Keywords: generation
Abstract: Reinforcement learning (RL) has emerged as a powerful method for improving the reasoning abilities of large language models (LLMs). Outcome-based RL, which rewards policies solely for the correctness of the final answer, yields substantial accuracy gains but also induces a systematic loss in generation diversity. This collapse undermines real-world performance, where diversity is critical for test-time scaling. We analyze this phenomenon by viewing RL post-training as a sampling process and show that, strikingly, RL can reduce effective diversity even on the training set relative to the base model. Our study highlights two central findings: (i) a transfer of diversity degradation, where reduced diversity on solved problems propagates to unsolved ones, and (ii) the tractability of the outcome space, since reasoning tasks admit only a limited set of distinct answers. Motivated by these insights, we propose outcome-based exploration, which assigns exploration bonuses according to final outcomes. We introduce two complementary algorithms: historical exploration, which encourages rarely observed answers via UCB-style bonuses, and batch exploration, which penalizes within-batch repetition to promote test-time diversity. Experiments on standard competition math with Llama and Qwen models demonstrate that both methods improve accuracy while mitigating diversity collapse. On the theoretical side, we formalize the benefit of outcome-based exploration through a new model of outcome-based bandits. Together, these contributions chart a practical path toward RL methods that enhance reasoning without sacrificing the diversity essential for scalable deployment.
摘要：强化学习（RL）已成为提高大语模型（LLMS）推理能力的强大方法。基于结果的RL仅是为了最终答案的正确性而奖励政策，它产生了实质性的准确性提高，但也会导致发电多样性的系统损失。这种崩溃破坏了现实世界的表现，在这种表现中，多样性对于测试时间缩放至关重要。我们通过将RL后培训视为抽样过程来分析这种现象，并表明，即使在相对于基本模型的训练集上，RL也可以降低有效的多样性。我们的研究强调了两个中心发现：（i）多样性退化的转移，在该问题中降低了解决问题的多样性传播到未解决的问题，以及（ii）结果空间的障碍，因为推理任务仅承认一组有限的不同答案。在这些见解的推动下，我们提出了基于结果的探索，该探索根据最终结果分配了勘探奖金。我们介绍了两种互补算法：历史探索，该探索很少通过UCB风格的奖金和批处理探索来观察答案，从而惩罚了批处理重复以促进测试时间的多样性。使用Llama和QWEN模型进行标准竞争数学的实验表明，两种方法都提高了精度，同时降低了多样性崩溃。从理论方面来说，我们通过新的基于结果的土匪模型正式化基于结果的探索的好处。这些贡献共同列出了通往RL方法的实用途径，该方法可以增强推理，而无需牺牲可扩展部署必不可少的多样性。

Title: Interleaving Reasoning for Better Text-to-Image Generation

Authors: Wenxuan Huang, Shuang Chen, Zheyong Xie, Shaosheng Cao, Shixiang Tang, Yufan Shen, Qingyu Yin, Wenbo Hu, Xiaoman Wang, Yuntian Tang, Junbo Qiao, Yue Guo, Yao Hu, Zhenfei Yin, Philip Torr, Yu Cheng, Wanli Ouyang, Shaohui Lin
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.06945
Pdf URL: https://arxiv.org/pdf/2509.06945
Copy Paste: [[2509.06945]] Interleaving Reasoning for Better Text-to-Image Generation(https://arxiv.org/abs/2509.06945)
Keywords: generation
Abstract: Unified multimodal understanding and generation models recently have achieve significant improvement in image generation capability, yet a large gap remains in instruction following and detail preservation compared to systems that tightly couple comprehension with generation such as GPT-4o. Motivated by recent advances in interleaving reasoning, we explore whether such reasoning can further improve Text-to-Image (T2I) generation. We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis: the model first produces a text-based thinking to guide an initial image, then reflects on the result to refine fine-grained details, visual quality, and aesthetics while preserving semantics. To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals: (1) strengthening the initial think-and-generate stage to establish core content and base quality, and (2) enabling high-quality textual reflection and faithful implementation of those refinements in a subsequent image. We curate IRGL-300K, a dataset organized into six decomposed learning modes that jointly cover learning text-based thinking, and full thinking-image trajectories. Starting from a unified foundation model that natively emits interleaved text-image outputs, our two-stage training first builds robust thinking and reflection, then efficiently tunes the IRG pipeline in the full thinking-image trajectory data. Extensive experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality and fine-grained fidelity. The code, model weights and datasets will be released in: this https URL .
摘要：统一的多模式理解和生成模型最近在图像生成能力方面取得了显着提高，但是与将与GPT-4O等生成相结合的系统相比，随后的教学和细节保存仍然存在较大的差距。在交错推理方面的最新进展中，我们探讨了这种推理是否可以进一步改善文本形象（T2i）的一代。我们介绍了交织的推理生成（IRG），该框架在基于文本的思维和图像合成之间交替：该模型首先产生基于文本的思维来指导初始图像，然后反思结果以完善细节细节，视觉质量和美学，同时保留语义。为了有效地培训IRG，我们提出了针对两个子目标的交织推理生成学习（IRGL）：（1）加强初始的思想阶段，以建立核心内容和基础质量，以及（2）在后续图像中实现高质量的文本反思和忠实地实现这些精炼。我们策划IRGL-300K，这是一个组织成六种分解的学习模式，共同涵盖了基于文本的思维和完整的思维形象轨迹。从本地发出交织的文本图像输出的统一基础模型开始，我们的两阶段训练首先建立了强大的思维和反思，然后在完整的思维形象轨迹数据中有效地调节了IRG管道。广泛的实验表明SOTA性能，在Geneval，Wise，Wise，Tiif，Genai-Bench和Oneig-en上获得5-10分的绝对增长，以及视觉质量和细粒度的忠诚度的实质性提高。代码，模型权重和数据集将在：此HTTPS URL中发布。