2025-08-19

Title: Sparse Attention across Multiple-context KV Cache

Authors: Ziyi Cao, Qingyi Si, Jingbin Zhang, Bingquan Liu
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2508.11661
Pdf URL: https://arxiv.org/pdf/2508.11661
Copy Paste: [[2508.11661]] Sparse Attention across Multiple-context KV Cache(https://arxiv.org/abs/2508.11661)
Keywords: generation
Abstract: Large language models face significant cost challenges in long-sequence inference. To address this, reusing historical Key-Value (KV) Cache for improved inference efficiency has become a mainstream approach. Recent advances further enhance throughput by sparse attention mechanisms to select the most relevant KV Cache, thereby reducing sequence length. However, such techniques are limited to single-context scenarios, where historical KV Cache is computed sequentially with causal-attention dependencies. In retrieval-augmented generation (RAG) scenarios, where retrieved documents as context are unknown beforehand, each document's KV Cache is computed and stored independently (termed multiple-context KV Cache), lacking cross-attention between contexts. This renders existing methods ineffective. Although prior work partially recomputes multiple-context KV Cache to mitigate accuracy loss from missing cross-attention, it requires retaining all KV Cache throughout, failing to reduce memory overhead. This paper presents SamKV, the first exploration of attention sparsification for multiple-context KV Cache. Specifically, SamKV takes into account the complementary information of other contexts when sparsifying one context, and then locally recomputes the sparsified information. Experiments demonstrate that our method compresses sequence length to 15% without accuracy degradation compared with full-recompuation baselines, significantly boosting throughput in multi-context RAG scenarios.
摘要：大型语言模型在长期推论中面临着重大的成本挑战。为了解决这个问题，重用历史键值（KV）缓存以提高推理效率已成为一种主流方法。最近的进步进一步增强了通过稀疏注意机制选择最相关的KV缓存的吞吐量，从而减少了序列长度。但是，此类技术仅限于单上下文方案，其中历史KV缓存是通过因果注意依赖性依次计算的。在检索型生成（RAG）方案中，以上下文未知的文档是未知的，每个文档的KV缓存都是独立计算和存储的（称为多电位KV缓存），在上下文之间缺乏交叉注意。这使现有方法无效。尽管先前的工作部分重新计算了多元文本的KV缓存，以减轻缺少跨注意的准确性损失，但它需要保留所有KV缓存，从而无法减少内存开销。本文介绍了SAMKV，这是对多文本KV缓存的注意力稀疏的首次探索。具体而言，SAMKV在稀疏一个上下文时考虑了其他上下文的互补信息，然后在本地重新计算稀疏信息。实验表明，我们的方法将序列长度压缩至15％而没有准确性降解的序列与全元基线相比，在多上下文抹布方案中显着增强了吞吐量。

Title: FusionFM: Fusing Eye-specific Foundational Models for Optimized Ophthalmic Diagnosis

Authors: Ke Zou, Jocelyn Hui Lin Goh, Yukun Zhou, Tian Lin, Samantha Min Er Yew, Sahana Srinivasan, Meng Wang, Rui Santos, Gabor M. Somfai, Huazhu Fu, Haoyu Chen, Pearse A. Keane, Ching-Yu Cheng, Yih Chung Tham
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11721
Pdf URL: https://arxiv.org/pdf/2508.11721
Copy Paste: [[2508.11721]] FusionFM: Fusing Eye-specific Foundational Models for Optimized Ophthalmic Diagnosis(https://arxiv.org/abs/2508.11721)
Keywords: generation
Abstract: Foundation models (FMs) have shown great promise in medical image analysis by improving generalization across diverse downstream tasks. In ophthalmology, several FMs have recently emerged, but there is still no clear answer to fundamental questions: Which FM performs the best? Are they equally good across different tasks? What if we combine all FMs together? To our knowledge, this is the first study to systematically evaluate both single and fused ophthalmic FMs. To address these questions, we propose FusionFM, a comprehensive evaluation suite, along with two fusion approaches to integrate different ophthalmic FMs. Our framework covers both ophthalmic disease detection (glaucoma, diabetic retinopathy, and age-related macular degeneration) and systemic disease prediction (diabetes and hypertension) based on retinal imaging. We benchmarked four state-of-the-art FMs (RETFound, VisionFM, RetiZero, and DINORET) using standardized datasets from multiple countries and evaluated their performance using AUC and F1 metrics. Our results show that DINORET and RetiZero achieve superior performance in both ophthalmic and systemic disease tasks, with RetiZero exhibiting stronger generalization on external datasets. Regarding fusion strategies, the Gating-based approach provides modest improvements in predicting glaucoma, AMD, and hypertension. Despite these advances, predicting systemic diseases, especially hypertension in external cohort remains challenging. These findings provide an evidence-based evaluation of ophthalmic FMs, highlight the benefits of model fusion, and point to strategies for enhancing their clinical applicability.
摘要：基础模型（FMS）通过改善各种下游任务的概括在医学图像分析中表现出了巨大的希望。在眼科中，最近出现了几个FM，但是对基本问题仍然没有明确的答案：哪个FM表现最好？在不同的任务中，它们同样好吗？如果我们将所有FMS结合在一起怎么办？据我们所知，这是系统地评估单一和融合的眼科FMS的第一项研究。为了解决这些问题，我们提出了FusionFM，这是一个全面的评估套件，以及两种整合不同眼科FMS的融合方法。我们的框架涵盖了基于视网膜成像的眼科疾病检测（青光眼，糖尿病性视网膜病和与年龄相关的黄斑变性）和全身性疾病预测（糖尿病和高血压）。我们使用来自多个国家 /地区的标准数据集对四个最先进的FMS（Retfound，VisionFM，Retizero和Dinoret）进行了基准测试，并使用AUC和F1指标评估了其性能。我们的结果表明，Dinoret和Retizero在眼科和全身性疾病任务中都取得了卓越的表现，而RETizero在外部数据集上表现出更强的概括。关于融合策略，基于门控的方法在预测青光眼，AMD和高血压方面提供了适度的改进。尽管有这些进展，但预测全身性疾病，尤其是外部队列中的高血压仍然具有挑战性。这些发现提供了对眼科FMS的基于证据的评估，突出了模型融合的好处，并指出了增强其临床适用性的策略。

Title: Scalable Geospatial Data Generation Using AlphaEarth Foundations Model

Authors: Luc Houriez (1 and 2), Sebastian Pilarski (1), Behzad Vahedi (1), Ali Ahmadalipour (1), Teo Honda Scully (1), Nicholas Aflitto (1), David Andre (1), Caroline Jaffe (1), Martha Wedner (1), Rich Mazzola (1), Josh Jeffery (1), Ben Messinger (1), Sage McGinley-Smith (1), Sarah Russell (1) ((1) X the Moonshot Factory - Bellwether, (2) Stanford University)
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2508.11739
Pdf URL: https://arxiv.org/pdf/2508.11739
Copy Paste: [[2508.11739]] Scalable Geospatial Data Generation Using AlphaEarth Foundations Model(https://arxiv.org/abs/2508.11739)
Keywords: generation
Abstract: High-quality labeled geospatial datasets are essential for extracting insights and understanding our planet. Unfortunately, these datasets often do not span the entire globe and are limited to certain geographic regions where data was collected. Google DeepMind's recently released AlphaEarth Foundations (AEF) provides an information-dense global geospatial representation designed to serve as a useful input across a wide gamut of tasks. In this article we propose and evaluate a methodology which leverages AEF to extend geospatial labeled datasets beyond their initial geographic regions. We show that even basic models like random forests or logistic regression can be used to accomplish this task. We investigate a case study of extending LANDFIRE's Existing Vegetation Type (EVT) dataset beyond the USA into Canada at two levels of granularity: EvtPhys (13 classes) and EvtGp (80 classes). Qualitatively, for EvtPhys, model predictions align with ground truth. Trained models achieve 81% and 73% classification accuracy on EvtPhys validation sets in the USA and Canada, despite discussed limitations.
摘要：高质量标记的地理空间数据集对于提取见解和了解我们的星球至关重要。不幸的是，这些数据集通常不会跨越整个世界，并且仅限于收集数据的某些地理区域。 Google DeepMind最近发布的Alphaearth基金会（AEF）提供了信息密集的全球地理空间表示形式，旨在用作各种任务的有用输入。在本文中，我们提出并评估一种方法，该方法利用AEF将地理空间标记的数据集扩展到其最初的地理区域之外。我们表明，即使是随机森林或逻辑回归等基本模型也可以用来完成此任务。我们调查了一项案例研究，该研究将兰德火现有的植被类型（EVT）数据集扩展到加拿大，分为两个粒度：EVTPHYS（13类）和EVTGP（80类）。在定性上，对于evtphys，模型预测与地面真理保持一致。尽管有限制，但受过训练的模型在美国和加拿大的EVTPHYS验证集上实现了81％和73％的分类精度。

Title: FairTabGen: Unifying Counterfactual and Causal Fairness in Synthetic Tabular Data Generation

Authors: Nitish Nagesh, Salar Shakibhamedan, Mahdi Bagheri, Ziyu Wang, Nima TaheriNejad, Axel Jantsch, Amir M. Rahmani
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11810
Pdf URL: https://arxiv.org/pdf/2508.11810
Copy Paste: [[2508.11810]] FairTabGen: Unifying Counterfactual and Causal Fairness in Synthetic Tabular Data Generation(https://arxiv.org/abs/2508.11810)
Keywords: generation
Abstract: Generating synthetic data is crucial in privacy-sensitive, data-scarce settings, especially for tabular datasets widely used in real-world applications. A key challenge is improving counterfactual and causal fairness, while preserving high utility. We present FairTabGen, a fairness-aware large language model-based framework for tabular synthetic data generation. We integrate multiple fairness definitions including counterfactual and causal fairness into both its generation and evaluation pipelines. We use in-context learning, prompt refinement, and fairness-aware data curation to balance fairness and utility. Across diverse datasets, our method outperforms state-of-the-art GAN-based and LLM-based methods, achieving up to 10% improvements on fairness metrics such as demographic parity and path-specific causal effects while retaining statistical utility. Remarkably, it achieves these gains using less than 20% of the original data, highlighting its efficiency in low-data regimes. These results demonstrate a principled and practical approach for generating fair and useful synthetic tabular data.
摘要：生成合成数据对于隐私敏感的数据筛选设置至关重要，特别是对于在现实世界应用中广泛使用的表格数据集。一个关键的挑战是改善反事实和因果公平，同时保留高效用。我们介绍Fairtabgen，这是一个公平的大型语言模型，用于表格合成数据生成。我们将包括反事实和因果公平的多个公平定义整合到其一代和评估管道中。我们使用内在的学习，及时的完善和公平感知的数据策展来平衡公平和效用。在不同的数据集中，我们的方法优于最先进的基于GAN的方法和基于LLM的方法，在保留统计效用的同时，在人口统计学奇偶校验和特定于路径的因果效应等公平指标（例如人口统计学和路径特定的因果效应）上，取得了高达10％的改善。值得注意的是，它使用不到20％的原始数据实现了这些收益，突出了其在低数据制度中的效率。这些结果证明了一种有原则且实用的方法来生成公平和有用的合成表格数据。

Title: Large Kernel Modulation Network for Efficient Image Super-Resolution

Authors: Quanwei Hu, Yinggan Tang, Xuguang Zhang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2508.11893
Pdf URL: https://arxiv.org/pdf/2508.11893
Copy Paste: [[2508.11893]] Large Kernel Modulation Network for Efficient Image Super-Resolution(https://arxiv.org/abs/2508.11893)
Keywords: super-resolution
Abstract: Image super-resolution (SR) in resource-constrained scenarios demands lightweight models balancing performance and latency. Convolutional neural networks (CNNs) offer low latency but lack non-local feature capture, while Transformers excel at non-local modeling yet suffer slow inference. To address this trade-off, we propose the Large Kernel Modulation Network (LKMN), a pure CNN-based model. LKMN has two core components: Enhanced Partial Large Kernel Block (EPLKB) and Cross-Gate Feed-Forward Network (CGFN). The EPLKB utilizes channel shuffle to boost inter-channel interaction, incorporates channel attention to focus on key information, and applies large kernel strip convolutions on partial channels for non-local feature extraction with reduced complexity. The CGFN dynamically adjusts discrepancies between input, local, and non-local features via a learnable scaling factor, then employs a cross-gate strategy to modulate and fuse these features, enhancing their complementarity. Extensive experiments demonstrate that our method outperforms existing state-of-the-art (SOTA) lightweight SR models while balancing quality and efficiency. Specifically, LKMN-L achieves 0.23 dB PSNR improvement over DAT-light on the Manga109 dataset at $\times$4 upscale, with nearly $\times$4.8 times faster. Codes are in the supplementary materials. The code is available at this https URL.
摘要：在资源约束的方案中，图像超分辨率（SR）需要轻巧模型平衡性能和延迟。卷积神经网络（CNN）具有低潜伏期，但缺乏非本地特征捕获，而变形金刚在非本地建模方面表现出色，但推理缓慢。为了解决这一权衡，我们提出了一个基于CNN的纯模型的大内核调制网络（LKMN）。 LKMN有两个核心组件：增强的部分大内核块（EPLKB）和横盖特进料前馈网络（CGFN）。 EPLKB利用通道洗牌来增强通道间的相互作用，将通道的注意力纳入关注关键信息，并在部分通道上应用大型内核条卷积，以降低复杂性，以进行非本地特征提取。 CGFN通过可学习的缩放系数动态调整输入，本地和非本地特征之间的差异，然后采用跨门策略来调节和融合这些特征，从而增强其互补性。广泛的实验表明，我们的方法在平衡质量和效率的同时，我们的方法优于现有的最先进（SOTA）轻型SR模型。具体而言，LKMN-L在漫画109数据集上的dat-light $ \ times $ 4上高档时实现了0.23 db psnr的改进，几乎$ \ times $ 4.8倍$ 4.8倍。代码在补充材料中。该代码可在此HTTPS URL上找到。

Title: SafeCtrl: Region-Based Safety Control for Text-to-Image Diffusion via Detect-Then-Suppress

Authors: Lingyun Zhang, Yu Xie, Yanwei Fu, Ping Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11904
Pdf URL: https://arxiv.org/pdf/2508.11904
Copy Paste: [[2508.11904]] SafeCtrl: Region-Based Safety Control for Text-to-Image Diffusion via Detect-Then-Suppress(https://arxiv.org/abs/2508.11904)
Keywords: generative
Abstract: The widespread deployment of text-to-image models is challenged by their potential to generate harmful content. While existing safety methods, such as prompt rewriting or model fine-tuning, provide valuable interventions, they often introduce a trade-off between safety and fidelity. Recent localization-based approaches have shown promise, yet their reliance on explicit ``concept replacement" can sometimes lead to semantic incongruity. To address these limitations, we explore a more flexible detect-then-suppress paradigm. We introduce SafeCtrl, a lightweight, non-intrusive plugin that first precisely localizes unsafe content. Instead of performing a hard A-to-B substitution, SafeCtrl then suppresses the harmful semantics, allowing the generative process to naturally and coherently resolve into a safe, context-aware alternative. A key aspect of our work is a novel training strategy using Direct Preference Optimization (DPO). We leverage readily available, image-level preference data to train our module, enabling it to learn nuanced suppression behaviors and perform region-guided interventions at inference without requiring costly, pixel-level annotations. Extensive experiments show that SafeCtrl significantly outperforms state-of-the-art methods in both safety efficacy and fidelity preservation. Our findings suggest that decoupled, suppression-based control is a highly effective and scalable direction for building more responsible generative models.
摘要：文本对图像模型的广泛部署受到产生有害内容的潜力的挑战。尽管现有的安全方法（例如及时重写或模型进行微调）提供有价值的干预措施，但他们经常在安全性和忠诚度之间进行权衡。 Recent localization-based approaches have shown promise, yet their reliance on explicit ``concept replacement" can sometimes lead to semantic incongruity. To address these limitations, we explore a more flexible detect-then-suppress paradigm. We introduce SafeCtrl, a lightweight, non-intrusive plugin that first precisely localizes unsafe content. Instead of performing a hard A-to-B substitution, SafeCtrl then suppresses the有害的语义，使生成过程自然而连贯地解决了我们工作的关键方面，是一种新的培训策略（DPO），我们可以利用图像级别的偏好数据来训练我们的模块，从而在不需要训练我们的模块中。广泛的实验表明，在安全功效和忠诚度保存方面，SAFECTRL显着胜过最先进的方法。

Title: Assessment of Using Synthetic Data in Brain Tumor Segmentation

Authors: Aditi Jahagirdar, Sameer Joshi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11922
Pdf URL: https://arxiv.org/pdf/2508.11922
Copy Paste: [[2508.11922]] Assessment of Using Synthetic Data in Brain Tumor Segmentation(https://arxiv.org/abs/2508.11922)
Keywords: generative
Abstract: Manual brain tumor segmentation from MRI scans is challenging due to tumor heterogeneity, scarcity of annotated data, and class imbalance in medical imaging datasets. Synthetic data generated by generative models has the potential to mitigate these issues by improving dataset diversity. This study investigates, as a proof of concept, the impact of incorporating synthetic MRI data, generated using a pre-trained GAN model, into training a U-Net segmentation network. Experiments were conducted using real data from the BraTS 2020 dataset, synthetic data generated with the medigan library, and hybrid datasets combining real and synthetic samples in varying proportions. While overall quantitative performance (Dice coefficient, IoU, precision, recall, accuracy) was comparable between real-only and hybrid-trained models, qualitative inspection suggested that hybrid datasets, particularly with 40% real and 60% synthetic data, improved whole tumor boundary delineation. However, region-wise accuracy for the tumor core and the enhancing tumor remained lower, indicating a persistent class imbalance. The findings support the feasibility of synthetic data as an augmentation strategy for brain tumor segmentation, while highlighting the need for larger-scale experiments, volumetric data consistency, and mitigating class imbalance in future work.
摘要：由于肿瘤异质性，注释数据的稀缺性和医学成像数据集中的类失衡，MRI扫描中的手动脑肿瘤分割具有挑战性。生成模型生成的合成数据有可能通过改善数据集多样性来减轻这些问题。这项研究将使用预训练的GAN模型生成的合成MRI数据纳入合成MRI数据的影响，将其纳入U-NET分割网络。使用来自BRAT 2020数据集的实际数据，使用Medigan库生成的合成数据以及将真实和合成样品组合成不同比例的混合数据集进行了实验。虽然总体定量性能（骰子系数，IOU，精度，召回，准确性）是可比性的，而实现和混合训练的模型之间是可比性的，但定性检查表明，混合数据集，尤其是40％真实的合成数据和60％的合成数据，使整个肿瘤边界划定不正确。但是，肿瘤核心和增强肿瘤的区域准确性保持较低，表明持续的类失衡。这些发现支持合成数据作为脑肿瘤分割的增强策略的可行性，同时强调了对大规模实验，体积数据一致性的需求，并减轻了未来工作中的类不平衡。

Title: Deep Learning For Point Cloud Denoising: A Survey

Authors: Chengwei Zhang, Xueyi Zhang, Mingrui Lao, Tao Jiang, Xinhao Xu, Wenjie Li, Fubo Zhang, Longyong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11932
Pdf URL: https://arxiv.org/pdf/2508.11932
Copy Paste: [[2508.11932]] Deep Learning For Point Cloud Denoising: A Survey(https://arxiv.org/abs/2508.11932)
Keywords: restoration
Abstract: Real-world environment-derived point clouds invariably exhibit noise across varying modalities and intensities. Hence, point cloud denoising (PCD) is essential as a preprocessing step to improve downstream task performance. Deep learning (DL)-based PCD models, known for their strong representation capabilities and flexible architectures, have surpassed traditional methods in denoising performance. To our best knowledge, despite recent advances in performance, no comprehensive survey systematically summarizes the developments of DL-based PCD. To fill the gap, this paper seeks to identify key challenges in DL-based PCD, summarizes the main contributions of existing methods, and proposes a taxonomy tailored to denoising tasks. To achieve this goal, we formulate PCD as a two-step process: outlier removal and surface noise restoration, encompassing most scenarios and requirements of PCD. Additionally, we compare methods in terms of similarities, differences, and respective advantages. Finally, we discuss research limitations and future directions, offering insights for further advancements in PCD.
摘要：现实世界环境衍生的点云总是在不同的方式和强度上表现出噪音。因此，对于提高下游任务性能的预处理步骤，点云（PCD）是必不可少的。基于深度学习（DL）的PCD模型，以其强大的表示功能和灵活的体系结构而闻名，它超过了传统的方法。据我们所知，尽管绩效最近进步，但没有系统地总结基于DL的PCD的发展。为了填补空白，本文旨在确定基于DL的PCD中的关键挑战，总结了现有方法的主要贡献，并提出了针对降级任务量身定制的分类法。为了实现这一目标，我们将PCD作为两个步骤的过程：拆卸和表面噪声恢复，包括PCD的大多数情况和要求。此外，我们根据相似性，差异和各自的优势比较方法。最后，我们讨论了研究局限性和未来的方向，为PCD的进一步进步提供了见解。

Title: Extending Straight-Through Estimation for Robust Neural Networks on Analog CIM Hardware

Authors: Yuannuo Feng, Wenyong Zhou, Yuexi Lyu, Yixiang Zhang, Zhengwu Liu, Ngai Wong, Wang Kang
Subjects: cs.LG, cs.AI, cs.AR
Abstract URL: https://arxiv.org/abs/2508.11940
Pdf URL: https://arxiv.org/pdf/2508.11940
Copy Paste: [[2508.11940]] Extending Straight-Through Estimation for Robust Neural Networks on Analog CIM Hardware(https://arxiv.org/abs/2508.11940)
Keywords: generation
Abstract: Analog Compute-In-Memory (CIM) architectures promise significant energy efficiency gains for neural network inference, but suffer from complex hardware-induced noise that poses major challenges for deployment. While noise-aware training methods have been proposed to address this issue, they typically rely on idealized and differentiable noise models that fail to capture the full complexity of analog CIM hardware variations. Motivated by the Straight-Through Estimator (STE) framework in quantization, we decouple forward noise simulation from backward gradient computation, enabling noise-aware training with more accurate but computationally intractable noise modeling in analog CIM systems. We provide theoretical analysis demonstrating that our approach preserves essential gradient directional information while maintaining computational tractability and optimization stability. Extensive experiments show that our extended STE framework achieves up to 5.3% accuracy improvement on image classification, 0.72 perplexity reduction on text generation, 2.2$\times$ speedup in training time, and 37.9% lower peak memory usage compared to standard noise-aware training methods.
摘要：模拟计算中的内存（CIM）体系结构有望为神经网络推断提供显着的能量效率，但遭受了复杂的硬件诱导的噪声，这给部署带来了主要的挑战。尽管已经提出了噪音吸引的训练方法来解决此问题，但它们通常依赖于理想化和可区分的噪声模型，这些模型无法捕获模拟CIM硬件变化的全部复杂性。在量化中的直通估计器（Ste）框架中，我们从向后梯度计算中解除了向前的噪声模拟，从而在模拟CIM系统中使用更准确但计算上棘手的噪声建模实现了噪声感知训练。我们提供的理论分析表明，我们的方法可以保留基本的梯度方向信息，同时保持计算障碍性和优化稳定性。广泛的实验表明，我们扩展的Ste框架在图像分类方面的准确性提高了5.3％，文本生成的0.72令人困惑，2.2 $ \ times $ speedup in训练时间和与标准的噪声训练训练方法相比，较低的峰值记忆使用量降低了37.9％。

Title: UniUGG: Unified 3D Understanding and Generation via Geometric-Semantic Encoding

Authors: Yueming Xu, Jiahui Zhang, Ze Huang, Yurui Chen, Yanpeng Zhou, Zhenyu Chen, Yu-Jie Yuan, Pengxiang Xia, Guowei Huang, Xinyue Cai, Zhongang Qi, Xingyue Quan, Jianye Hao, Hang Xu, Li Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.11952
Pdf URL: https://arxiv.org/pdf/2508.11952
Copy Paste: [[2508.11952]] UniUGG: Unified 3D Understanding and Generation via Geometric-Semantic Encoding(https://arxiv.org/abs/2508.11952)
Keywords: generation
Abstract: Despite the impressive progress on understanding and generating images shown by the recent unified architectures, the integration of 3D tasks remains challenging and largely unexplored. In this paper, we introduce UniUGG, the first unified understanding and generation framework for 3D modalities. Our unified framework employs an LLM to comprehend and decode sentences and 3D representations. At its core, we propose a spatial decoder leveraging a latent diffusion model to generate high-quality 3D representations. This allows for the generation and imagination of 3D scenes based on a reference image and an arbitrary view transformation, while remaining supports for spatial visual question answering (VQA) tasks. Additionally, we propose a geometric-semantic learning strategy to pretrain the vision encoder. This design jointly captures the input's semantic and geometric cues, enhancing both spatial understanding and generation. Extensive experimental results demonstrate the superiority of our method in visual representation, spatial understanding, and 3D generation. The source code will be released upon paper acceptance.
摘要：尽管在最近的统一体系结构显示的理解和生成图像方面取得了令人印象深刻的进展，但3D任务的集成仍然具有挑战性，并且在很大程度上没有探索。在本文中，我们介绍了Uniugg，这是第一个统一的理解和生成框架3D模式。我们的统一框架采用LLM来理解和解码句子和3D表示。从本质上讲，我们提出了一个空间解码器，利用潜在扩散模型生成高质量的3D表示。这允许基于参考图像和任意视图转换的3D场景的生成和想象力，同时剩下的支持空间视觉问题答案（VQA）任务。此外，我们提出了一种几何语音学习策略，以预先介绍视觉编码器。该设计共同捕获了输入的语义和几何提示，从而增强了空间理解和产生。广泛的实验结果证明了我们方法在视觉表示，空间理解和3D生成中的优越性。源代码将在纸上接受时发布。

Title: MOON: Generative MLLM-based Multimodal Representation Learning for E-commerce Product Understanding

Authors: Daoze Zhang, Zhanheng Nie, Jianyu Liu, Chenghan Fu, Wanxian Guan, Yuan Gao, Jun Song, Pengjie Wang, Jian Xu, Bo Zheng
Subjects: cs.CV, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2508.11999
Pdf URL: https://arxiv.org/pdf/2508.11999
Copy Paste: [[2508.11999]] MOON: Generative MLLM-based Multimodal Representation Learning for E-commerce Product Understanding(https://arxiv.org/abs/2508.11999)
Keywords: generative
Abstract: With the rapid advancement of e-commerce, exploring general representations rather than task-specific ones has attracted increasing research attention. For product understanding, although existing discriminative dual-flow architectures drive progress in this field, they inherently struggle to model the many-to-one alignment between multiple images and texts of products. Therefore, we argue that generative Multimodal Large Language Models (MLLMs) hold significant potential for improving product representation learning. Nevertheless, achieving this goal still remains non-trivial due to several key challenges: the lack of multimodal and aspect-aware modeling modules in typical LLMs; the common presence of background noise in product images; and the absence of a standard benchmark for evaluation. To address these issues, we propose the first generative MLLM-based model named MOON for product representation learning. Our method (1) employs a guided Mixture-of-Experts (MoE) module for targeted modeling of multimodal and aspect-specific product content; (2) effectively detects core semantic regions in product images to mitigate the distraction and interference caused by background noise; and (3) introduces the specialized negative sampling strategy to increase the difficulty and diversity of negative samples. In addition, we release a large-scale multimodal benchmark MBE for various product understanding tasks. Experimentally, our model demonstrates competitive zero-shot performance on both our benchmark and the public dataset, showcasing strong generalization across various downstream tasks, including cross-modal retrieval, product classification, and attribute prediction. Furthermore, the case study and visualization illustrate the effectiveness of MOON for product understanding.
摘要：随着电子商务的迅速发展，探索一般表示而非特定于任务的表现引起了越来越多的研究关注。为了理解产品，尽管现有的歧视性双流架构驱动了这一领域的进步，但它们本质上很难建模多个图像和产品文本之间的多对一对齐。因此，我们认为生成的多模式大语言模型（MLLM）具有改善产品表示学习的重要潜力。然而，由于几个关键的挑战，实现这一目标仍然是不平凡的：典型的LLM中缺乏多模式和方面感知的建模模块；产品图像中背景噪声的共同存在；以及缺乏评估标准基准。为了解决这些问题，我们提出了第一个用于产品表示学习的基于MLLM的生成模型。我们的方法（1）采用了指导的专家（MOE）模块来实现多模式和特定于方面的产品内容的靶向建模；（2）有效检测产品图像中的核心语义区域，以减轻背景噪声引起的干扰和干扰；（3）引入了专门的负抽样策略，以增加负样品的难度和多样性。此外，我们发布了一个大规模的多模式基准MBE，用于各种产品理解任务。在实验上，我们的模型在我们的基准和公共数据集上展示了竞争性的零击性能，从而在各种下游任务中展示了强有力的概括，包括跨模式检索，产品分类和属性预测。此外，案例研究和可视化说明了月亮对产品理解的有效性。

Title: Q-FSRU: Quantum-Augmented Frequency-Spectral Fusion for Medical Visual Question Answering

Authors: Rakesh Thakur, Yusra Tariq
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.12036
Pdf URL: https://arxiv.org/pdf/2508.12036
Copy Paste: [[2508.12036]] Q-FSRU: Quantum-Augmented Frequency-Spectral Fusion for Medical Visual Question Answering(https://arxiv.org/abs/2508.12036)
Keywords: generation
Abstract: Solving tough clinical questions that require both image and text understanding is still a major challenge in healthcare AI. In this work, we propose Q-FSRU, a new model that combines Frequency Spectrum Representation and Fusion (FSRU) with a method called Quantum Retrieval-Augmented Generation (Quantum RAG) for medical Visual Question Answering (VQA). The model takes in features from medical images and related text, then shifts them into the frequency domain using Fast Fourier Transform (FFT). This helps it focus on more meaningful data and filter out noise or less useful information. To improve accuracy and ensure that answers are based on real knowledge, we add a quantum-inspired retrieval system. It fetches useful medical facts from external sources using quantum-based similarity techniques. These details are then merged with the frequency-based features for stronger reasoning. We evaluated our model using the VQA-RAD dataset, which includes real radiology images and questions. The results showed that Q-FSRU outperforms earlier models, especially on complex cases needing image-text reasoning. The mix of frequency and quantum information improves both performance and explainability. Overall, this approach offers a promising way to build smart, clear, and helpful AI tools for doctors.
摘要：解决需要图像和文本理解的艰难临床问题仍然是医疗保健AI的主要挑战。在这项工作中，我们提出了Q-FSRU，这是一种新模型，将频谱表示和融合（FSRU）与一种称为量子检索效果生成（量子rag）的方法相结合，以用于医学视觉问题答案（VQA）。该模型从医学图像和相关文本中汲取功能，然后使用快速傅立叶变换（FFT）将其转移到频域。这有助于它专注于更有意义的数据，并过滤噪声或更少的有用信息。为了提高准确性并确保答案是基于真实知识的，我们添加了量子启发的检索系统。它使用基于量子的相似性技术从外部来源获取有用的医学事实。然后将这些细节与基于频率的功能合并，以实现更强的推理。我们使用VQA-RAD数据集评估了我们的模型，其中包括实际的放射学图像和问题。结果表明，Q-FSRU优于早期模型，尤其是在需要图像文本推理的复杂情况下。频率和量子信息的组合可以提高性能和解释性。总体而言，这种方法提供了一种为医生构建智能，清晰和有用的AI工具的有前途的方法。

Title: Content Accuracy and Quality Aware Resource Allocation Based on LP-Guided DRL for ISAC-Driven AIGC Networks

Authors: Ningzhe Shi, Yiqing Zhou, Ling Liu, Jinglin Shi, Yihao Wu, Haiwei Shi, Hanxiao Yu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.12079
Pdf URL: https://arxiv.org/pdf/2508.12079
Copy Paste: [[2508.12079]] Content Accuracy and Quality Aware Resource Allocation Based on LP-Guided DRL for ISAC-Driven AIGC Networks(https://arxiv.org/abs/2508.12079)
Keywords: generation, generative
Abstract: Integrated sensing and communication (ISAC) can enhance artificial intelligence-generated content (AIGC) networks by providing efficient sensing and transmission. Existing AIGC services usually assume that the accuracy of the generated content can be ensured, given accurate input data and prompt, thus only the content generation quality (CGQ) is concerned. However, it is not applicable in ISAC-based AIGC networks, where content generation is based on inaccurate sensed data. Moreover, the AIGC model itself introduces generation errors, which depend on the number of generating steps (i.e., computing resources). To assess the quality of experience of ISAC-based AIGC services, we propose a content accuracy and quality aware service assessment metric (CAQA). Since allocating more resources to sensing and generating improves content accuracy but may reduce communication quality, and vice versa, this sensing-generating (computing)-communication three-dimensional resource tradeoff must be optimized to maximize the average CAQA (AvgCAQA) across all users with AIGC (CAQA-AIGC). This problem is NP-hard, with a large solution space that grows exponentially with users. To solve the CAQA-AIGC problem with low complexity, a linear programming (LP) guided deep reinforcement learning (DRL) algorithm with an action filter (LPDRL-F) is proposed. Through the LP-guided approach and the action filter, LPDRL-F can transform the original three-dimensional solution space to two dimensions, reducing complexity while improving the learning performance of DRL. Simulations show that compared to existing DRL and generative diffusion model algorithms without LP, LPDRL-F converges faster by over 60% and finds better resource allocation solutions, improving AvgCAQA by more than 14%. With LPDRL-F, CAQA-AIGC can achieve an improvement in AvgCAQA of more than 50% compared to existing schemes focusing solely on CGQ.
摘要：集成的传感和通信（ISAC）可以通过提供有效的感应和传输来增强人工智能生成的内容（AIGC）网络。现有的AIGC服务通常假定可以确保生成内容的准确性，给定准确的输入数据和提示，因此只有内容生成质量（CGQ）。但是，它不适用于基于ISAC的AIGC网络，其中内容生成基于不准确的感应数据。此外，AIGC模型本身引入了生成错误，这取决于生成步骤的数量（即计算资源）。为了评估基于ISAC的AIGC服务的经验质量，我们提出了内容准确性和质量意识服务评估指标（CAQA）。由于将更多的资源分配给感应和产生，可以提高内容的准确性，但可能会降低沟通质量，反之亦然，因此必须优化这种传感生成（计算） - 沟通三维资源折衷方案，以最大程度地利用AIGC（CAQA-AIGC）的所有用户的平均CAQA（AVGCAQA）。这个问题是NP-HARD，有一个较大的解决方案空间与用户呈指数增长。为了以低复杂性解决CAQA-AIGC问题，提出了线性编程（LP）引导的深钢筋学习（DRL）算法，并提出了动作过滤器（LPDRL-F）。通过LP引导的方法和动作过滤器，LPDRL-F可以将原始的三维解决方案空间转换为两个维度，从而降低了复杂性，同时改善了DRL的学习性能。模拟显示，与没有LP的现有DRL和生成扩散模型算法相比，LPDRL-F收敛的速度更快超过60％，并找到了更好的资源分配解决方案，将AVGGACAQA提高了14％以上。使用LPDRL-F，与仅关注CGQ的现有方案相比，CAQA-AIGC可以提高AVGCAQA的50％以上。

Title: VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models

Authors: Haidong Xu, Guangwei Xu, Zhedong Zheng, Xiatian Zhu, Wei Ji, Xiangtai Li, Ruijie Guo, Meishan Zhang, Min zhang, Hao Fei
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2508.12081
Pdf URL: https://arxiv.org/pdf/2508.12081
Copy Paste: [[2508.12081]] VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models(https://arxiv.org/abs/2508.12081)
Keywords: generation
Abstract: This paper introduces VimoRAG, a novel video-based retrieval-augmented motion generation framework for motion large language models (LLMs). As motion LLMs face severe out-of-domain/out-of-vocabulary issues due to limited annotated data, VimoRAG leverages large-scale in-the-wild video databases to enhance 3D motion generation by retrieving relevant 2D human motion signals. While video-based motion RAG is nontrivial, we address two key bottlenecks: (1) developing an effective motion-centered video retrieval model that distinguishes human poses and actions, and (2) mitigating the issue of error propagation caused by suboptimal retrieval results. We design the Gemini Motion Video Retriever mechanism and the Motion-centric Dual-alignment DPO Trainer, enabling effective retrieval and generation processes. Experimental results show that VimoRAG significantly boosts the performance of motion LLMs constrained to text-only input.
摘要：本文介绍了Vimorag，这是一种基于视频的新型检索功能动作生成框架大型语言模型（LLMS）。由于有限的注释数据，随着运动LLMS面临严重的室外/不稳定性问题，Vimorag通过检索相关的2D人类运动信号来利用大规模的野外视频数据库来增强3D运动。虽然基于视频的运动抹布并非繁琐，但我们解决了两个关键的瓶颈：（1）开发一个以运动为中心的视频检索模型，该模型区分人类的姿势和动作，以及（2）减轻由次优的检索结果引起的错误传播问题。我们设计了双子座运动视频检索器机制和以运动为中心的双分支DPO培训师，从而实现了有效的检索和发电过程。实验结果表明，Vimorag显着提高了限制于仅文本输入的运动LLM的性能。

Title: Generic Event Boundary Detection via Denoising Diffusion

Authors: Jaejun Hwang, Dayoung Gong, Manjin Kim, Minsu Cho
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.12084
Pdf URL: https://arxiv.org/pdf/2508.12084
Copy Paste: [[2508.12084]] Generic Event Boundary Detection via Denoising Diffusion(https://arxiv.org/abs/2508.12084)
Keywords: generative
Abstract: Generic event boundary detection (GEBD) aims to identify natural boundaries in a video, segmenting it into distinct and meaningful chunks. Despite the inherent subjectivity of event boundaries, previous methods have focused on deterministic predictions, overlooking the diversity of plausible solutions. In this paper, we introduce a novel diffusion-based boundary detection model, dubbed DiffGEBD, that tackles the problem of GEBD from a generative perspective. The proposed model encodes relevant changes across adjacent frames via temporal self-similarity and then iteratively decodes random noise into plausible event boundaries being conditioned on the encoded features. Classifier-free guidance allows the degree of diversity to be controlled in denoising diffusion. In addition, we introduce a new evaluation metric to assess the quality of predictions considering both diversity and fidelity. Experiments show that our method achieves strong performance on two standard benchmarks, Kinetics-GEBD and TAPOS, generating diverse and plausible event boundaries.
摘要：通用事件边界检测（GEBD）旨在识别视频中的自然边界，将其分为独特而有意义的块。尽管事件边界具有固有的主观性，但以前的方法集中在确定性预测上，忽视了合理解决方案的多样性。在本文中，我们引入了一种新型的基于扩散的边界检测模型，该模型称为DiffGeBD，该模型从生成的角度解决了GEBD的问题。所提出的模型通过时间相似性编码相邻帧之间的相关变化，然后迭代地将随机噪声解码为可靠的事件边界，该边界是在编码的特征上调节的。无分类器的指导允许在降解扩散方面控制多样性的程度。此外，我们引入了一个新的评估指标，以评估考虑多样性和忠诚度的预测质量。实验表明，我们的方法在两个标准的基准测试基准（Kinetics-GEBD和TAPOS）上实现了强劲的性能，从而产生了多样化和合理的事件边界。

Title: Error Propagation Mechanisms and Compensation Strategies for Quantized Diffusion

Authors: Songwei Liu, Hong Liu, Fangmin Chen, Xurui Peng, Chenqian Yan, Lean Fu, Xing Mei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.12094
Pdf URL: https://arxiv.org/pdf/2508.12094
Copy Paste: [[2508.12094]] Error Propagation Mechanisms and Compensation Strategies for Quantized Diffusion(https://arxiv.org/abs/2508.12094)
Keywords: generation
Abstract: Diffusion models have transformed image synthesis by establishing unprecedented quality and creativity benchmarks. Nevertheless, their large-scale deployment faces challenges due to computationally intensive iterative denoising processes. Although post-training quantization(PTQ) provides an effective pathway for accelerating sampling, the iterative nature of diffusion models causes stepwise quantization errors to accumulate progressively during generation, inevitably compromising output fidelity. To address this challenge, we develop a theoretical framework that mathematically formulates error propagation in Diffusion Models (DMs), deriving per-step quantization error propagation equations and establishing the first closed-form solution for cumulative error. Building on this theoretical foundation, we propose a timestep-aware cumulative error compensation scheme. Extensive experiments across multiple image datasets demonstrate that our compensation strategy effectively mitigates error propagation, significantly enhancing existing PTQ methods to achieve state-of-the-art(SOTA) performance on low-precision diffusion models.
摘要：扩散模型通过建立前所未有的质量和创造力基准来改变图像综合。然而，由于计算密集的迭代降解过程，他们的大规模部署面临着挑战。尽管训练后量化（PTQ）为加速采样提供了有效的途径，但扩散模型的迭代性质会导致逐步量化误差在生成过程中逐渐积累，这不可避免地会损害输出保真度。为了应对这一挑战，我们开发了一个理论框架，该框架在数学上可以在扩散模型（DMS）中制定错误传播，从而得出每步量化误差传播方程并建立第一个封闭形式解决方案以实现累积误差。在这个理论基础的基础上，我们提出了一个时间段意识到的累积错误补偿计划。跨多个图像数据集的广泛实验表明，我们的补偿策略有效地减轻了错误传播，从而显着增强了现有的PTQ方法，以实现低精度扩散模型上的最新效果（SOTA）性能。

Title: Generative Medical Event Models Improve with Scale

Authors: Shane Waxler, Paul Blazek, Davis White, Daniel Sneider, Kevin Chung, Mani Nagarathnam, Patrick Williams, Hank Voeller, Karen Wong, Matthew Swanhorst, Sheng Zhang, Naoto Usuyama, Cliff Wong, Tristan Naumann, Hoifung Poon, Andrew Loza, Daniella Meeker, Seth Hain, Rahul Shah
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2508.12104
Pdf URL: https://arxiv.org/pdf/2508.12104
Copy Paste: [[2508.12104]] Generative Medical Event Models Improve with Scale(https://arxiv.org/abs/2508.12104)
Keywords: generation, generative
Abstract: Realizing personalized medicine at scale calls for methods that distill insights from longitudinal patient journeys, which can be viewed as a sequence of medical events. Foundation models pretrained on large-scale medical event data represent a promising direction for scaling real-world evidence generation and generalizing to diverse downstream tasks. Using Epic Cosmos, a dataset with medical events from de-identified longitudinal health records for 16.3 billion encounters over 300 million unique patient records from 310 health systems, we introduce the Cosmos Medical Event Transformer ( CoMET) models, a family of decoder-only transformer models pretrained on 118 million patients representing 115 billion discrete medical events (151 billion tokens). We present the largest scaling-law study for medical event data, establishing a methodology for pretraining and revealing power-law scaling relationships for compute, tokens, and model size. Based on this, we pretrained a series of compute-optimal models with up to 1 billion parameters. Conditioned on a patient's real-world history, CoMET autoregressively generates the next medical event, simulating patient health timelines. We studied 78 real-world tasks, including diagnosis prediction, disease prognosis, and healthcare operations. Remarkably for a foundation model with generic pretraining and simulation-based inference, CoMET generally outperformed or matched task-specific supervised models on these tasks, without requiring task-specific fine-tuning or few-shot examples. CoMET's predictive power consistently improves as the model and pretraining scale. Our results show that CoMET, a generative medical event foundation model, can effectively capture complex clinical dynamics, providing an extensible and generalizable framework to support clinical decision-making, streamline healthcare operations, and improve patient outcomes.
摘要：在规模上实现个性化医学，要求使用纵向患者旅行的洞察力的方法，这可以看作是一系列医疗事件。在大规模医疗事件数据上预测的基础模型是扩展现实世界证据生成并推广到多样化下游任务的有希望的方向。我们使用Epic Cosmos，一个数据集，其中包括163亿张纵向健康记录中的医疗活动，从310个卫生系统中遇到了超过3亿个独特的患者记录，我们介绍了Cosmos Medical Event Transformer（Comet）模型，这是一种仅解码器的唯一变压器家族，这是一家仅在1.15亿个型号，代表1,150亿个IDVINCETENT IVERTECTION DIVENTINS MEDICATE EVENTION（151亿亿）。我们介绍了用于医疗事件数据的最大缩放法律研究，建立了一种预处理和揭示计算，令牌和模型大小的幂律缩放关系的方法。基于这一点，我们鉴定了一系列具有多达10亿参数的计算最佳模型。在患者的现实历史上，彗星自动锻炼会产生下一个医疗活动，从而模拟患者的健康时间表。我们研究了78项现实世界任务，包括诊断预测，疾病预后和医疗保健操作。对于具有通用预处理和基于模拟的推理的基础模型，彗星在这些任务上通常优于或匹配的特定于任务的监督模型，而无需特定于任务的微调或几个示例。彗星的预测能力始终随着模型和训练量表而始终如一地提高。我们的结果表明，生成的医疗事件基础模型Comet可以有效地捕获复杂的临床动态，提供可扩展且可推广的框架，以支持临床决策，简化医疗保健操作并改善患者的结果。

Title: VELVET-Med: Vision and Efficient Language Pre-training for Volumetric Imaging Tasks in Medicine

Authors: Ziyang Zhang, Yang Yu, Xulei Yang, Si Yong Yeo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.12108
Pdf URL: https://arxiv.org/pdf/2508.12108
Copy Paste: [[2508.12108]] VELVET-Med: Vision and Efficient Language Pre-training for Volumetric Imaging Tasks in Medicine(https://arxiv.org/abs/2508.12108)
Keywords: generation
Abstract: Vision-and-language models (VLMs) have been increasingly explored in the medical domain, particularly following the success of CLIP in general domain. However, unlike the relatively straightforward pairing of 2D images and text, curating large-scale paired data in the medical field for volumetric modalities such as CT scans remains a challenging and time-intensive process. This difficulty often limits the performance on downstream tasks. To address these challenges, we propose a novel vision-language pre-training (VLP) framework, termed as \textbf{VELVET-Med}, specifically designed for limited volumetric data such as 3D CT and associated radiology reports. Instead of relying on large-scale data collection, our method focuses on the development of effective pre-training objectives and model architectures. The key contributions are: 1) We incorporate uni-modal self-supervised learning into VLP framework, which are often underexplored in the existing literature. 2) We propose a novel language encoder, termed as \textbf{TriBERT}, for learning multi-level textual semantics. 3) We devise the hierarchical contrastive learning to capture multi-level vision-language correspondence. Using only 38,875 scan-report pairs, our approach seeks to uncover rich spatial and semantic relationships embedded in volumetric medical images and corresponding clinical narratives, thereby enhancing the generalization ability of the learned encoders. The resulting encoders exhibit strong transferability, achieving state-of-the-art performance across a wide range of downstream tasks, including 3D segmentation, cross-modal retrieval, visual question answering, and report generation.
摘要：视觉和语言模型（VLM）在医疗领域越来越多地探索，尤其是在通用域中剪辑成功之后。但是，与2D图像和文本的相对直接配对不同，在医学领域中策划了大规模的配对数据（例如CT扫描）仍然是一个具有挑战性且耗时的过程。这个困难通常会限制下游任务的性能。为了应对这些挑战，我们提出了一种新颖的视觉语言预训练（VLP）框架，称为\ textbf {velvet-med}，专为有限的体积数据（例如3D CT和相关的放射学报告）而设计。我们的方法不依赖大规模的数据收集，而是着重于开发有效的培训预培训目标和模型体系结构。关键贡献是：1）我们将Uni-Modal自我监督的学习纳入VLP框架中，这些学习通常在现有文献中被忽略不远。 2）我们提出了一种新的语言编码器，称为\ textbf {tribert}，用于学习多级文本语义。 3）我们设计了分层对比学习，以捕获多级视觉语言对应。我们的方法仅使用38,875个扫描报告对，旨在发现嵌入体积医学图像和相应临床叙事中的丰富空间和语义关系，从而增强了学识渊博的编码者的概括能力。由此产生的编码器具有强大的可传递性，可以在各种下游任务中实现最先进的性能，包括3D细分，跨模式检索，视觉问题答案和报告生成。

Title: Demystifying Foreground-Background Memorization in Diffusion Models

Authors: Jimmy Z. Di, Yiwei Lu, Yaoliang Yu, Gautam Kamath, Adam Dziedzic, Franziska Boenisch
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.12148
Pdf URL: https://arxiv.org/pdf/2508.12148
Copy Paste: [[2508.12148]] Demystifying Foreground-Background Memorization in Diffusion Models(https://arxiv.org/abs/2508.12148)
Keywords: generation
Abstract: Diffusion models (DMs) memorize training images and can reproduce near-duplicates during generation. Current detection methods identify verbatim memorization but fail to capture two critical aspects: quantifying partial memorization occurring in small image regions, and memorization patterns beyond specific prompt-image pairs. To address these limitations, we propose Foreground Background Memorization (FB-Mem), a novel segmentation-based metric that classifies and quantifies memorized regions within generated images. Our method reveals that memorization is more pervasive than previously understood: (1) individual generations from single prompts may be linked to clusters of similar training images, revealing complex memorization patterns that extend beyond one-to-one correspondences; and (2) existing model-level mitigation methods, such as neuron deactivation and pruning, fail to eliminate local memorization, which persists particularly in foreground regions. Our work establishes an effective framework for measuring memorization in diffusion models, demonstrates the inadequacy of current mitigation approaches, and proposes a stronger mitigation method using a clustering approach.
摘要：扩散模型（DMS）记住训练图像，并可以在发电期间重现近乎解复的图像。当前的检测方法识别逐字记忆，但无法捕获两个关键方面：量化小图像区域中发生的部分记忆，以及超出特定及时图像对的记忆模式。为了解决这些局限性，我们提出了前景背景记忆（FB-MEM），这是一种基于分割的新型度量，可以对生成的图像中的记忆区域进行分类和量化。我们的方法表明，记忆比以前所理解的更为普遍：（1）单个提示中的单个世代可以链接到相似训练图像的簇，揭示了超出一对一的对应关系的复杂记忆模式；（2）现有的模型级缓解方法，例如神经元停用和修剪，无法消除局部记忆，尤其是在前景区域。我们的工作建立了一个有效的框架，用于测量扩散模型中的记忆，证明了当前缓解方法的不足，并提出了一种使用聚类方法的更强的缓解方法。

Title: Scalable RF Simulation in Generative 4D Worlds

Authors: Zhiwei Zheng, Dongyin Hu, Mingmin Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.12176
Pdf URL: https://arxiv.org/pdf/2508.12176
Copy Paste: [[2508.12176]] Scalable RF Simulation in Generative 4D Worlds(https://arxiv.org/abs/2508.12176)
Keywords: generation, generative
Abstract: Radio Frequency (RF) sensing has emerged as a powerful, privacy-preserving alternative to vision-based methods for indoor perception tasks. However, collecting high-quality RF data in dynamic and diverse indoor environments remains a major challenge. To address this, we introduce WaveVerse, a prompt-based, scalable framework that simulates realistic RF signals from generated indoor scenes with human motions. WaveVerse introduces a language-guided 4D world generator, which includes a state-aware causal transformer for human motion generation conditioned on spatial constraints and texts, and a phase-coherent ray tracing simulator that enables the simulation of accurate and coherent RF signals. Experiments demonstrate the effectiveness of our approach in conditioned human motion generation and highlight how phase coherence is applied to beamforming and respiration monitoring. We further present two case studies in ML-based high-resolution imaging and human activity recognition, demonstrating that WaveVerse not only enables data generation for RF imaging for the first time, but also consistently achieves performance gain in both data-limited and data-adequate scenarios.
摘要：射频（RF）传感已成为一种强大的，具有隐私性的替代方案，可替代基于视觉的方法，用于室内感知任务。但是，在动态和多样化的室内环境中收集高质量的RF数据仍然是一个主要挑战。为了解决这个问题，我们引入了Vaveverse，这是一个基于迅速的，可扩展的框架，该框架模拟了通过人类动作中生成的室内场景中现实的RF信号。 Vaveverse引入了语言引导的4D世界生成器，其中包括以空间约束和文本为条件的人类运动产生的州感知因果变压器，以及一个相位共连的射线追踪模拟器，该模拟器可以模拟准确且相干的RF信号。实验证明了我们方法在条件人类运动产生中的有效性，并突出了相一致性如何应用于光束形成和呼吸监测。我们进一步介绍了基于ML的高分辨率成像和人类活动识别的两个案例研究，这表明Vaververs不仅可以首次实现RF成像的数据生成，而且还可以始终如一地实现数据限制和数据添加方案的性能增长。

Title: Distribution Matching via Generalized Consistency Models

Authors: Sagar Shrestha, Rajesh Shrestha, Tri Nguyen, Subash Timilsina
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.12222
Pdf URL: https://arxiv.org/pdf/2508.12222
Copy Paste: [[2508.12222]] Distribution Matching via Generalized Consistency Models(https://arxiv.org/abs/2508.12222)
Keywords: generative
Abstract: Recent advancement in generative models have demonstrated remarkable performance across various data modalities. Beyond their typical use in data synthesis, these models play a crucial role in distribution matching tasks such as latent variable modeling, domain translation, and domain adaptation. Generative Adversarial Networks (GANs) have emerged as the preferred method of distribution matching due to their efficacy in handling high-dimensional data and their flexibility in accommodating various constraints. However, GANs often encounter challenge in training due to their bi-level min-max optimization objective and susceptibility to mode collapse. In this work, we propose a novel approach for distribution matching inspired by the consistency models employed in Continuous Normalizing Flow (CNF). Our model inherits the advantages of CNF models, such as having a straight forward norm minimization objective, while remaining adaptable to different constraints similar to GANs. We provide theoretical validation of our proposed objective and demonstrate its performance through experiments on synthetic and real-world datasets.
摘要：生成模型的最新进展表明，各种数据模式的性能都出色。除了它们在数据合成中的典型用途外，这些模型在分配匹配任务中起着至关重要的作用，例如潜在变量建模，域翻译和域适应性。生成对抗网络（GAN）由于其在处理高维数据方面的功效及其在适应各种约束方面的灵活性而成为首选的分布匹配方法。但是，甘斯经常在训练中遇到挑战，因为他们的双层最低最大优化目标和模式崩溃的敏感性。在这项工作中，我们提出了一种新的分布匹配方法，灵感来自连续归一流（CNF）中使用的一致性模型。我们的模型继承了CNF模型的优势，例如具有直截了当的规范最小化目标，同时仍然适应类似于gan的不同约束。我们提供了我们提出的目标的理论验证，并通过实验合成和现实数据集来证明其性能。

Title: In vivo 3D ultrasound computed tomography of musculoskeletal tissues with generative neural physics

Authors: Zhijun Zeng, Youjia Zheng, Chang Su, Qianhang Wu, Hao Hu, Zeyuan Dong, Shan Gao, Yang Lv, Rui Tang, Ligang Cui, Zhiyong Hou, Weijun Lin, Zuoqiang Shi, Yubing Li, He Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.12226
Pdf URL: https://arxiv.org/pdf/2508.12226
Copy Paste: [[2508.12226]] In vivo 3D ultrasound computed tomography of musculoskeletal tissues with generative neural physics(https://arxiv.org/abs/2508.12226)
Keywords: generative
Abstract: Ultrasound computed tomography (USCT) is a radiation-free, high-resolution modality but remains limited for musculoskeletal imaging due to conventional ray-based reconstructions that neglect strong scattering. We propose a generative neural physics framework that couples generative networks with physics-informed neural simulation for fast, high-fidelity 3D USCT. By learning a compact surrogate of ultrasonic wave propagation from only dozens of cross-modality images, our method merges the accuracy of wave modeling with the efficiency and stability of deep learning. This enables accurate quantitative imaging of in vivo musculoskeletal tissues, producing spatial maps of acoustic properties beyond reflection-mode images. On synthetic and in vivo data (breast, arm, leg), we reconstruct 3D maps of tissue parameters in under ten minutes, with sensitivity to biomechanical properties in muscle and bone and resolution comparable to MRI. By overcoming computational bottlenecks in strongly scattering regimes, this approach advances USCT toward routine clinical assessment of musculoskeletal disease.
摘要：超声计算机断层扫描（USCT）是一种无辐射，高分辨率的模态，但由于常规的基于射线的重建而忽略了强散射，因此仍然有限用于肌肉骨骼成像。我们提出了一个生成的神经物理框架，该框架将生成网络与物理信息的神经模拟融合，以快速，高保真3D USCT。通过学习只有数十个跨模式图像的超声波传播的紧凑型替代物，我们的方法将波浪建模的准确性与深度学习的效率和稳定性融合在一起。这可以使体内肌肉骨骼组织的准确定量成像，从而产生反射模式图像以外的声学特性的空间图。关于合成和体内数据（乳房，手臂，腿），我们在10分钟内重建组织参数的3D图，对肌肉和骨骼的生物力学特性的敏感性与MRI相当。通过克服强烈散射方案中的计算瓶颈，这种方法可以在肌肉骨骼疾病的常规临床评估中前进。

Title: SNNSIR: A Simple Spiking Neural Network for Stereo Image Restoration

Authors: Ronghua Xu, Jin Xie, Jing Nie, Jiale Cao, Yanwei Pang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.12271
Pdf URL: https://arxiv.org/pdf/2508.12271
Copy Paste: [[2508.12271]] SNNSIR: A Simple Spiking Neural Network for Stereo Image Restoration(https://arxiv.org/abs/2508.12271)
Keywords: restoration, super-resolution
Abstract: Spiking Neural Networks (SNNs), characterized by discrete binary activations, offer high computational efficiency and low energy consumption, making them well-suited for computation-intensive tasks such as stereo image restoration. In this work, we propose SNNSIR, a simple yet effective Spiking Neural Network for Stereo Image Restoration, specifically designed under the spike-driven paradigm where neurons transmit information through sparse, event-based binary spikes. In contrast to existing hybrid SNN-ANN models that still rely on operations such as floating-point matrix division or exponentiation, which are incompatible with the binary and event-driven nature of SNNs, our proposed SNNSIR adopts a fully spike-driven architecture to achieve low-power and hardware-friendly computation. To address the expressiveness limitations of binary spiking neurons, we first introduce a lightweight Spike Residual Basic Block (SRBB) to enhance information flow via spike-compatible residual learning. Building on this, the Spike Stereo Convolutional Modulation (SSCM) module introduces simplified nonlinearity through element-wise multiplication and highlights noise-sensitive regions via cross-view-aware modulation. Complementing this, the Spike Stereo Cross-Attention (SSCA) module further improves stereo correspondence by enabling efficient bidirectional feature interaction across views within a spike-compatible framework. Extensive experiments on diverse stereo image restoration tasks, including rain streak removal, raindrop removal, low-light enhancement, and super-resolution demonstrate that our model achieves competitive restoration performance while significantly reducing computational overhead. These results highlight the potential for real-time, low-power stereo vision applications. The code will be available after the article is accepted.
摘要：以离散二进制激活为特征的尖峰神经网络（SNN）提供了较高的计算效率和低能消耗，使其非常适合计算密集型任务，例如立体图像恢复。在这项工作中，我们提出了SNNSIR，这是一个简单而有效的尖峰神经网络，用于立体声图像恢复，该网络是在尖峰驱动的范式下专门设计的，其中神经元通过基于事件的稀疏，基于事件的二进制尖峰传输信息。与仍然依靠操作（例如浮点矩阵部门或指示）的现有混合SNN-ANN模型相反，与SNN的二进制和事件驱动性质不符，我们的拟议SNNSIR采用了完全尖峰驱动的架构，以实现低功耗和硬件的计算。为了解决二进制尖峰神经元的表达局限性，我们首先引入轻质尖峰剩余基本块（SRBB），以通过与尖峰兼容的残留学习来增强信息流。在此基础上，SPIKE立体声卷积调制（SSCM）模块通过元素乘法引入简化的非线性，并通过跨视图 - 感知的调制突出显示噪声敏感的区域。补充这一点，Spike立体声交叉注意（SSCA）模块通过在互相兼容的框架内启用有效的双向特征相互作用，从而进一步改善了立体声对应关系。关于各种立体声图像恢复任务的广泛实验，包括去除雨条，去除雨滴，低光增强和超分辨率，这表明我们的模型可实现竞争性的恢复性能，同时大大降低了计算机开销。这些结果突出了实时，低功率立体声视觉应用的潜力。该代码将在接受文章后可用。

Title: Semantic Discrepancy-aware Detector for Image Forgery Identification

Authors: Ziye Wang, Minghang Yu, Chunyan Xu, Zhen Cui
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.12341
Pdf URL: https://arxiv.org/pdf/2508.12341
Copy Paste: [[2508.12341]] Semantic Discrepancy-aware Detector for Image Forgery Identification(https://arxiv.org/abs/2508.12341)
Keywords: generation
Abstract: With the rapid advancement of image generation techniques, robust forgery detection has become increasingly imperative to ensure the trustworthiness of digital media. Recent research indicates that the learned semantic concepts of pre-trained models are critical for identifying fake images. However, the misalignment between the forgery and semantic concept spaces hinders the model's forgery detection performance. To address this problem, we propose a novel Semantic Discrepancy-aware Detector (SDD) that leverages reconstruction learning to align the two spaces at a fine-grained visual level. By exploiting the conceptual knowledge embedded in the pre-trained vision language model, we specifically design a semantic token sampling module to mitigate the space shifts caused by features irrelevant to both forgery traces and semantic concepts. A concept-level forgery discrepancy learning module, built upon a visual reconstruction paradigm, is proposed to strengthen the interaction between visual semantic concepts and forgery traces, effectively capturing discrepancies under the concepts' guidance. Finally, the low-level forgery feature enhancemer integrates the learned concept level forgery discrepancies to minimize redundant forgery information. Experiments conducted on two standard image forgery datasets demonstrate the efficacy of the proposed SDD, which achieves superior results compared to existing methods. The code is available at this https URL.
摘要：随着图像产生技术的快速发展，强大的伪造检测变得越来越需要确保数字媒体的可信度。最近的研究表明，预训练模型的学习语义概念对于识别假图像至关重要。但是，伪造和语义概念空间之间的错位阻碍了模型的伪造检测性能。为了解决这个问题，我们提出了一种新型的语义差异感知探测器（SDD），该检测器利用重建学习将两个空间在细粒度的视觉水平上对齐。通过利用预先训练的视觉语言模型中嵌入的概念知识，我们专门设计了语义令牌采样模块，以减轻与伪造痕迹和语义概念无关的特征引起的空间变化。提出了一个基于视觉重建范式建立的概念级伪造的学习模块，以加强视觉语义概念和伪造痕迹之间的相互作用，从而有效地捕获了该概念指导下的差异。最后，低水平的伪造功能增强仪会整合学习的概念水平伪造的差异，以最大程度地减少冗余伪造信息。在两个标准图像伪造数据集上进行的实验证明了所提出的SDD的功效，与现有方法相比，它取得了更好的结果。该代码可在此HTTPS URL上找到。

Title: Navigating the Exploration-Exploitation Tradeoff in Inference-Time Scaling of Diffusion Models

Authors: Xun Su, Jianming Huang, Yang Yusen, Zhongxi Fang, Hiroyuki Kasai
Subjects: cs.LG, cs.AI, math.ST
Abstract URL: https://arxiv.org/abs/2508.12361
Pdf URL: https://arxiv.org/pdf/2508.12361
Copy Paste: [[2508.12361]] Navigating the Exploration-Exploitation Tradeoff in Inference-Time Scaling of Diffusion Models(https://arxiv.org/abs/2508.12361)
Keywords: generation
Abstract: Inference-time scaling has achieved remarkable success in language models, yet its adaptation to diffusion models remains underexplored. We observe that the efficacy of recent Sequential Monte Carlo (SMC)-based methods largely stems from globally fitting the The reward-tilted distribution, which inherently preserves diversity during multi-modal search. However, current applications of SMC to diffusion models face a fundamental dilemma: early-stage noise samples offer high potential for improvement but are difficult to evaluate accurately, whereas late-stage samples can be reliably assessed but are largely irreversible. To address this exploration-exploitation trade-off, we approach the problem from the perspective of the search algorithm and propose two strategies: Funnel Schedule and Adaptive Temperature. These simple yet effective methods are tailored to the unique generation dynamics and phase-transition behavior of diffusion models. By progressively reducing the number of maintained particles and down-weighting the influence of early-stage rewards, our methods significantly enhance sample quality without increasing the total number of Noise Function Evaluations. Experimental results on multiple benchmarks and state-of-the-art text-to-image diffusion models demonstrate that our approach outperforms previous baselines.
摘要：推理时间缩放在语言模型中取得了巨大的成功，但其对扩散模型的适应仍未得到充实。我们观察到，最近的顺序蒙特卡洛（SMC）方法的功效很大程度上源于全球拟合奖励倾斜分布，奖励倾斜分布在多模式搜索过程中固有地保留了多样性。但是，SMC在扩散模型中的当前应用面临着根本的困境：早期噪声样本提供了很高的改进潜力，但难以准确评估，而后期样本可以可靠地评估，但在很大程度上是不可逆的。为了解决这种探索 - 探索权衡取舍，我们从搜索算法的角度解决了问题，并提出了两种策略：漏斗时间表和自适应温度。这些简单而有效的方法是针对扩散模型的独特产生动力学和相位转移行为量身定制的。通过逐步减少维持颗粒的数量并减少早期奖励的影响，我们的方法可显着提高样品质量，而不会增加噪声函数评估的总数。对多个基准和最先进的文本对图扩散模型的实验结果表明，我们的方法表现优于先前的基准。

Title: DeCoT: Decomposing Complex Instructions for Enhanced Text-to-Image Generation with Large Language Models

Authors: Xiaochuan Lin, Xiangyong Chen, Xuan Li, Yichen Su
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.12396
Pdf URL: https://arxiv.org/pdf/2508.12396
Copy Paste: [[2508.12396]] DeCoT: Decomposing Complex Instructions for Enhanced Text-to-Image Generation with Large Language Models(https://arxiv.org/abs/2508.12396)
Keywords: generation
Abstract: Despite remarkable advancements, current Text-to-Image (T2I) models struggle with complex, long-form textual instructions, frequently failing to accurately render intricate details, spatial relationships, or specific constraints. This limitation is highlighted by benchmarks such as LongBench-T2I, which reveal deficiencies in handling composition, specific text, and fine textures. To address this, we propose DeCoT (Decomposition-CoT), a novel framework that leverages Large Language Models (LLMs) to significantly enhance T2I models' understanding and execution of complex instructions. DeCoT operates in two core stages: first, Complex Instruction Decomposition and Semantic Enhancement, where an LLM breaks down raw instructions into structured, actionable semantic units and clarifies ambiguities; second, Multi-Stage Prompt Integration and Adaptive Generation, which transforms these units into a hierarchical or optimized single prompt tailored for existing T2I models. Extensive experiments on the LongBench-T2I dataset demonstrate that DeCoT consistently and substantially improves the performance of leading T2I models across all evaluated dimensions, particularly in challenging aspects like "Text" and "Composition". Quantitative results, validated by multiple MLLM evaluators (Gemini-2.0-Flash and InternVL3-78B), show that DeCoT, when integrated with Infinity-8B, achieves an average score of 3.52, outperforming the baseline Infinity-8B (3.44). Ablation studies confirm the critical contribution of each DeCoT component and the importance of sophisticated LLM prompting. Furthermore, human evaluations corroborate these findings, indicating superior perceptual quality and instruction fidelity. DeCoT effectively bridges the gap between high-level user intent and T2I model requirements, leading to more faithful and accurate image generation.
摘要：尽管取得了显着的进步，但当前的文本形象（T2I）模型都在复杂，长期的文本指令中挣扎，经常无法准确地呈现复杂的细节，空间关系或特定的约束。诸如Longbench-T2I之类的基准强调了这种限制，这些基准揭示了处理构图，特定文本和精细纹理的缺陷。为了解决这个问题，我们提出了Decot（Decomposition-Cot），该框架利用大型语言模型（LLMS）显着增强T2I模型对复杂指令的理解和执行。 Decot分为两个核心阶段：第一个复杂的指令分解和语义增强，其中LLM将原始指令分解为结构化的，可操作的语义单元并阐明歧义；其次，多阶段提示集成和自适应生成，将这些单元转换为针对现有T2I模型量身定制的层次结构或优化的单个提示。在Longbench-T2I数据集上进行的广泛实验表明，在所有评估的维度上，始终如一地重点并大大提高了领先的T2I模型的性能，尤其是在诸如“文本”和“组成”之类的具有挑战性的方面。通过多个MLLM评估者（Gemini-2.0-Flash和InternVL3-78B）验证的定量结果表明，当与Infinity-8B集成时，指的是平均得分为3.52，表现优于基线Infinity-8B（3.44）。消融研究证实了每个解释成分的关键贡献以及复杂的LLM提示的重要性。此外，人类评估证实了这些发现，表明了卓越的感知质量和指导保真度。有效地解析高级用户意图和T2I模型要求之间的差距，从而导致更加忠实，准确的图像产生。

Title: Federated Cross-Modal Style-Aware Prompt Generation

Authors: Suraj Prasad, Navyansh Mahla, Sunny Gupta, Amit Sethi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.12399
Pdf URL: https://arxiv.org/pdf/2508.12399
Copy Paste: [[2508.12399]] Federated Cross-Modal Style-Aware Prompt Generation(https://arxiv.org/abs/2508.12399)
Keywords: generation
Abstract: Prompt learning has propelled vision-language models like CLIP to excel in diverse tasks, making them ideal for federated learning due to computational efficiency. However, conventional approaches that rely solely on final-layer features miss out on rich multi-scale visual cues and domain-specific style variations in decentralized client data. To bridge this gap, we introduce FedCSAP (Federated Cross-Modal Style-Aware Prompt Generation). Our framework harnesses low, mid, and high-level features from CLIP's vision encoder alongside client-specific style indicators derived from batch-level statistics. By merging intricate visual details with textual context, FedCSAP produces robust, context-aware prompt tokens that are both distinct and non-redundant, thereby boosting generalization across seen and unseen classes. Operating within a federated learning paradigm, our approach ensures data privacy through local training and global aggregation, adeptly handling non-IID class distributions and diverse domain-specific styles. Comprehensive experiments on multiple image classification datasets confirm that FedCSAP outperforms existing federated prompt learning methods in both accuracy and overall generalization.
摘要：及时的学习推动了视觉模型（例如剪辑）在各种任务中脱颖而出，使其非常适合由于计算效率而导致联合学习。但是，传统的方法仅依赖于最终层特征错过了分散的客户数据中丰富的多尺度视觉提示和特定于领域的样式变化。为了弥合这一差距，我们介绍了FedCSAP（联合跨模式风格的及时生成）。我们的框架可利用夹具的视觉编码器中的低，中和高级功能，以及从批处理级别统计数据衍生的客户特定样式指标。通过将复杂的视觉细节与文本上下文合并，FedCSAP产生了既明显又不冗余的强大的，上下文感知的提示令牌，从而增强了在可见和看不见的类中的概括。在联合学习范式中运行，我们的方法通过本地培训和全球聚合，熟练处理非IID类分布以及各种特定领域的样式来确保数据隐私。在多个图像分类数据集上进行的全面实验证实，FedCSAP在准确性和整体概括方面都优于现有的联合及时学习方法。

Title: MPCAR: Multi-Perspective Contextual Augmentation for Enhanced Visual Reasoning in Large Vision-Language Models

Authors: Amirul Rahman, Qiang Xu, Xueying Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.12400
Pdf URL: https://arxiv.org/pdf/2508.12400
Copy Paste: [[2508.12400]] MPCAR: Multi-Perspective Contextual Augmentation for Enhanced Visual Reasoning in Large Vision-Language Models(https://arxiv.org/abs/2508.12400)
Keywords: generation, generative
Abstract: Despite significant advancements, Large Vision-Language Models (LVLMs) continue to face challenges in complex visual reasoning tasks that demand deep contextual understanding, multi-angle analysis, or meticulous detail recognition. Existing approaches often rely on single-shot image encoding and prompts, limiting their ability to fully capture nuanced visual information. Inspired by the notion that strategically generated "additional" information can serve as beneficial contextual augmentation, we propose Multi-Perspective Contextual Augmentation for Reasoning (MPCAR), a novel inference-time strategy designed to enhance LVLM performance. MPCAR operates in three stages: first, an LVLM generates N diverse and complementary descriptions or preliminary reasoning paths from various angles; second, these descriptions are intelligently integrated with the original question to construct a comprehensive context-augmented prompt; and finally, this enriched prompt guides the ultimate LVLM for deep reasoning and final answer generation. Crucially, MPCAR achieves these enhancements without requiring any fine-tuning of the underlying LVLM's parameters. Extensive experiments on challenging Visual Question Answering (VQA) datasets, including GQA, VQA-CP v2, and ScienceQA (Image-VQA), demonstrate that MPCAR consistently outperforms established baseline methods. Our quantitative results show significant accuracy gains, particularly on tasks requiring robust contextual understanding, while human evaluations confirm improved coherence and completeness of the generated answers. Ablation studies further highlight the importance of diverse prompt templates and the number of generated perspectives. This work underscores the efficacy of leveraging LVLMs' inherent generative capabilities to enrich input contexts, thereby unlocking their latent reasoning potential for complex multimodal tasks.
摘要：尽管取得了重大进展，但大型视觉模型（LVLM）仍在需要深入的上下文理解，多角度分析或细致的细节识别的复杂视觉推理任务中面临挑战。现有的方法通常依赖于单拍的图像编码和提示，从而限制了它们完全捕获细微差别的视觉信息的能力。受战略生成的“附加”信息可以作为有益的上下文增强的观念的启发，我们提出了推理（MPCAR）的多角度上下文的上下文增强（MPCAR），这是一种新颖的推理时间策略，旨在提高LVLM性能。 MPCAR分为三个阶段：首先，LVLM从各个角度产生n多种和互补的描述或初步推理路径；其次，这些描述与原始问题明智地集成在一起，以构建一个全面的上下文提示。最后，这种丰富的及时引导指导了最终的LVLM，以进行深层推理和最终答案。至关重要的是，MPCAR实现了这些增强功能，而无需对基础LVLM参数进行任何微调。关于挑战视觉问题答案（VQA）数据集的广泛实验，包括GQA，VQA-CP V2和ScienceQA（Image-VQA），表明MPCAR始终胜过建立的基线方法。我们的定量结果显示出明显的准确性提高，尤其是在需要强大的上下文理解的任务上，而人类评估则证实了生成的答案的一致性和完整性。消融研究进一步强调了各种及时模板的重要性以及生成的观点的数量。这项工作强调了利用LVLMS固有的生成能力来丰富输入上下文的功效，从而解开其潜在的潜在复杂的多模式任务。

Title: TiP4GEN: Text to Immersive Panorama 4D Scene Generation

Authors: Ke Xing, Hanwen Liang, Dejia Xu, Yuyang Yin, Konstantinos N. Plataniotis, Yao Zhao, Yunchao Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.12415
Pdf URL: https://arxiv.org/pdf/2508.12415
Copy Paste: [[2508.12415]] TiP4GEN: Text to Immersive Panorama 4D Scene Generation(https://arxiv.org/abs/2508.12415)
Keywords: generation
Abstract: With the rapid advancement and widespread adoption of VR/AR technologies, there is a growing demand for the creation of high-quality, immersive dynamic scenes. However, existing generation works predominantly concentrate on the creation of static scenes or narrow perspective-view dynamic scenes, falling short of delivering a truly 360-degree immersive experience from any viewpoint. In this paper, we introduce \textbf{TiP4GEN}, an advanced text-to-dynamic panorama scene generation framework that enables fine-grained content control and synthesizes motion-rich, geometry-consistent panoramic 4D scenes. TiP4GEN integrates panorama video generation and dynamic scene reconstruction to create 360-degree immersive virtual environments. For video generation, we introduce a \textbf{Dual-branch Generation Model} consisting of a panorama branch and a perspective branch, responsible for global and local view generation, respectively. A bidirectional cross-attention mechanism facilitates comprehensive information exchange between the branches. For scene reconstruction, we propose a \textbf{Geometry-aligned Reconstruction Model} based on 3D Gaussian Splatting. By aligning spatial-temporal point clouds using metric depth maps and initializing scene cameras with estimated poses, our method ensures geometric consistency and temporal coherence for the reconstructed scenes. Extensive experiments demonstrate the effectiveness of our proposed designs and the superiority of TiP4GEN in generating visually compelling and motion-coherent dynamic panoramic scenes. Our project page is at this https URL.
摘要：随着VR/AR技术的快速发展和广泛采用，人们对创造高质量，沉浸式动态场景的需求不断增长。但是，现有一代主要专注于创建静态场景或狭窄的视角 - 视图动态场景，而从任何角度来看，都没有提供真正360度的沉浸式体验。在本文中，我们介绍了\ textbf {tip4gen}，这是一种高级文本到动态的全景场景生成框架，可实现精细的内容控制并综合了运动富的几何学，一致的全景4D场景。 Tip4Gen集成了全景视频生成和动态场景重建，以创建360度沉浸式虚拟环境。对于视频生成，我们介绍了一个由全景分支和一个透视分支组成的\ textbf {Dual-Branch生成模型}，分别负责全球和本地视图生成。双向交叉注意机制促进了分支机构之间的全面信息交换。对于场景重建，我们建议基于3D高斯分裂的\ textbf {几何与几何的重建模型}。通过使用度量深度图并用估计的姿势将空间点云对准时空云，我们的方法可确保重建场景的几何一致性和时间连贯性。广泛的实验证明了我们提出的设计的有效性以及TIP4GEN在产生视觉上吸引人和动态的动态全景场景中的优势。我们的项目页面在此HTTPS URL上。

Title: Adversarial Attacks on VQA-NLE: Exposing and Alleviating Inconsistencies in Visual Question Answering Explanations

Authors: Yahsin Yeh, Yilun Wu, Bokai Ruan, Honghan Shuai
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2508.12430
Pdf URL: https://arxiv.org/pdf/2508.12430
Copy Paste: [[2508.12430]] Adversarial Attacks on VQA-NLE: Exposing and Alleviating Inconsistencies in Visual Question Answering Explanations(https://arxiv.org/abs/2508.12430)
Keywords: generation
Abstract: Natural language explanations in visual question answering (VQA-NLE) aim to make black-box models more transparent by elucidating their decision-making processes. However, we find that existing VQA-NLE systems can produce inconsistent explanations and reach conclusions without genuinely understanding the underlying context, exposing weaknesses in either their inference pipeline or explanation-generation mechanism. To highlight these vulnerabilities, we not only leverage an existing adversarial strategy to perturb questions but also propose a novel strategy that minimally alters images to induce contradictory or spurious outputs. We further introduce a mitigation method that leverages external knowledge to alleviate these inconsistencies, thereby bolstering model robustness. Extensive evaluations on two standard benchmarks and two widely used VQA-NLE models underscore the effectiveness of our attacks and the potential of knowledge-based defenses, ultimately revealing pressing security and reliability concerns in current VQA-NLE systems.
摘要：视觉问题回答（VQA-NLE）中的自然语言解释旨在通过阐明其决策过程使黑盒模型更加透明。但是，我们发现现有的VQA-nle系统可以产生不一致的解释并得出结论，而无需真正理解潜在的环境，从而在其推理管道或解释生成机制中暴露了弱点。为了强调这些脆弱性，我们不仅利用现有的对抗策略来扰动问题，而且提出了一种新颖的策略，该策略最小化了图像以引起矛盾或虚假的产出。我们进一步介绍了一种缓解方法，该方法利用外部知识来减轻这些不一致之处，从而增强了模型的鲁棒性。对两个标准基准和两个广泛使用的VQA-NLE模型的广泛评估强调了我们攻击的有效性以及基于知识的防御力的潜力，最终揭示了当前VQA-NLE系统中紧迫的安全性和可靠性问题。

Title: X-Ray-CoT: Interpretable Chest X-ray Diagnosis with Vision-Language Models via Chain-of-Thought Reasoning

Authors: Chee Ng, Liliang Sun, Shaoqing Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.12455
Pdf URL: https://arxiv.org/pdf/2508.12455
Copy Paste: [[2508.12455]] X-Ray-CoT: Interpretable Chest X-ray Diagnosis with Vision-Language Models via Chain-of-Thought Reasoning(https://arxiv.org/abs/2508.12455)
Keywords: generation
Abstract: Chest X-ray imaging is crucial for diagnosing pulmonary and cardiac diseases, yet its interpretation demands extensive clinical experience and suffers from inter-observer variability. While deep learning models offer high diagnostic accuracy, their black-box nature hinders clinical adoption in high-stakes medical settings. To address this, we propose X-Ray-CoT (Chest X-Ray Chain-of-Thought), a novel framework leveraging Vision-Language Large Models (LVLMs) for intelligent chest X-ray diagnosis and interpretable report generation. X-Ray-CoT simulates human radiologists' "chain-of-thought" by first extracting multi-modal features and visual concepts, then employing an LLM-based component with a structured Chain-of-Thought prompting strategy to reason and produce detailed natural language diagnostic reports. Evaluated on the CORDA dataset, X-Ray-CoT achieves competitive quantitative performance, with a Balanced Accuracy of 80.52% and F1 score of 78.65% for disease diagnosis, slightly surpassing existing black-box models. Crucially, it uniquely generates high-quality, explainable reports, as validated by preliminary human evaluations. Our ablation studies confirm the integral role of each proposed component, highlighting the necessity of multi-modal fusion and CoT reasoning for robust and transparent medical AI. This work represents a significant step towards trustworthy and clinically actionable AI systems in medical imaging.
摘要：胸部X射线成像对于诊断肺部和心脏疾病至关重要，但是其解释需要广泛的临床经验，并且患有观察者间的变异性。虽然深度学习模型具有很高的诊断精度，但它们的黑盒自然可以阻碍高风险医疗环境中的临床采用。为了解决这个问题，我们提出了X射线仪（胸部X射线链链），这是一个新型框架，利用视觉语言大型模型（LVLM），用于智能胸部X射线诊断和可解释的报告生成。 X射线单位通过首先提取多模式特征和视觉概念来模拟人类放射科医生的“思想链”，然后采用基于LLM的组件，具有结构化的Thequient of-Theark Praingnation促进策略，以推理并产生详细的自然语言诊断报告。在Corda数据集上进行了评估，X射线COT实现了具有竞争力的定量性能，疾病诊断的平衡精度为80.52％，F1得分为78.65％，略有超过现有的黑盒模型。至关重要的是，它独特地产生了由人类初步评估验证的高质量，可解释的报告。我们的消融研究证实了每个提出的组件的组成部分，强调了多模式融合的必要性以及对健壮和透明医疗AI的COT推理的必要性。这项工作代表了迈向医学成像中可信赖和临床可行的AI系统的重要一步。

Title: Standardization of Neuromuscular Reflex Analysis -- Role of Fine-Tuned Vision-Language Model Consortium and OpenAI gpt-oss Reasoning LLM Enabled Decision Support System

Authors: Eranga Bandara, Ross Gore, Sachin Shetty, Ravi Mukkamala, Christopher Rhea, Atmaram Yarlagadda, Shaifali Kaushik, L.H.M.P.De Silva, Andriy Maznychenko, Inna Sokolowska, Amin Hass, Kasun De Zoysa
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.12473
Pdf URL: https://arxiv.org/pdf/2508.12473
Copy Paste: [[2508.12473]] Standardization of Neuromuscular Reflex Analysis -- Role of Fine-Tuned Vision-Language Model Consortium and OpenAI gpt-oss Reasoning LLM Enabled Decision Support System(https://arxiv.org/abs/2508.12473)
Keywords: generation
Abstract: Accurate assessment of neuromuscular reflexes, such as the H-reflex, plays a critical role in sports science, rehabilitation, and clinical neurology. Traditional analysis of H-reflex EMG waveforms is subject to variability and interpretation bias among clinicians and researchers, limiting reliability and standardization. To address these challenges, we propose a Fine-Tuned Vision-Language Model (VLM) Consortium and a reasoning Large-Language Model (LLM)-enabled Decision Support System for automated H-reflex waveform interpretation and diagnosis. Our approach leverages multiple VLMs, each fine-tuned on curated datasets of H-reflex EMG waveform images annotated with clinical observations, recovery timelines, and athlete metadata. These models are capable of extracting key electrophysiological features and predicting neuromuscular states, including fatigue, injury, and recovery, directly from EMG images and contextual metadata. Diagnostic outputs from the VLM consortium are aggregated using a consensus-based method and refined by a specialized reasoning LLM, which ensures robust, transparent, and explainable decision support for clinicians and sports scientists. The end-to-end platform orchestrates seamless communication between the VLM ensemble and the reasoning LLM, integrating prompt engineering strategies and automated reasoning workflows using LLM Agents. Experimental results demonstrate that this hybrid system delivers highly accurate, consistent, and interpretable H-reflex assessments, significantly advancing the automation and standardization of neuromuscular diagnostics. To our knowledge, this work represents the first integration of a fine-tuned VLM consortium with a reasoning LLM for image-based H-reflex analysis, laying the foundation for next-generation AI-assisted neuromuscular assessment and athlete monitoring platforms.
摘要：精确评估神经肌肉反射，例如H反射，在运动科学，康复和临床神经病学中起着至关重要的作用。 H-反射EMG波形的传统分析会存在临床医生和研究人员之间的可变性和解释偏见，从而限制了可靠性和标准化。为了应对这些挑战，我们提出了一个微调的视觉语言模型（VLM）联盟和一个推理大语模型（LLM）启用的启用H-Reflex波浪形波形解释和诊断的决策支持系统。我们的方法利用了多个VLM，每个VLM都在带有临床观察，恢复时间表和运动员元数据的H-反射EMG波形图像的策划数据集上进行了微调。这些模型能够直接从EMG图像和上下文元数据中提取关键的电生理特征并预测神经肌肉状态，包括疲劳，损伤和恢复。来自VLM联盟的诊断输出使用基于共识的方法汇总，并通过专门的推理LLM进行了完善，该方法可确保对临床医生和体育科学家的稳健，透明和可解释的决策支持。端到端平台在VLM集合和推理LLM之间进行了无缝的沟通，使用LLM代理集成了及时的工程策略和自动推理工作流程。实验结果表明，该混合系统提供了高度准确，一致且可解释的H反射评估，从而显着提高了神经肌肉诊断的自动化和标准化。据我们所知，这项工作代表了一个微调VLM联盟与用于基于图像的H反射分析的推理LLM的首次整合，为下一代AI辅助神经肌肉评估和运动员监测平台奠定了基础。

Title: Design and Validation of a Responsible Artificial Intelligence-based System for the Referral of Diabetic Retinopathy Patients

Authors: E. Ulises Moya-Sánchez, Abraham Sánchez-Perez, Raúl Nanclares Da Veiga, Alejandro Zarate-Macías, Edgar Villareal, Alejandro Sánchez-Montes, Edtna Jauregui-Ulloa, Héctor Moreno, Ulises Cortés
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.12506
Pdf URL: https://arxiv.org/pdf/2508.12506
Copy Paste: [[2508.12506]] Design and Validation of a Responsible Artificial Intelligence-based System for the Referral of Diabetic Retinopathy Patients(https://arxiv.org/abs/2508.12506)
Keywords: quality assessment
Abstract: Diabetic Retinopathy (DR) is a leading cause of vision loss in working-age individuals. Early detection of DR can reduce the risk of vision loss by up to 95%, but a shortage of retinologists and challenges in timely examination complicate detection. Artificial Intelligence (AI) models using retinal fundus photographs (RFPs) offer a promising solution. However, adoption in clinical settings is hindered by low-quality data and biases that may lead AI systems to learn unintended features. To address these challenges, we developed RAIS-DR, a Responsible AI System for DR screening that incorporates ethical principles across the AI lifecycle. RAIS-DR integrates efficient convolutional models for preprocessing, quality assessment, and three specialized DR classification models. We evaluated RAIS-DR against the FDA-approved EyeArt system on a local dataset of 1,046 patients, unseen by both systems. RAIS-DR demonstrated significant improvements, with F1 scores increasing by 5-12%, accuracy by 6-19%, and specificity by 10-20%. Additionally, fairness metrics such as Disparate Impact and Equal Opportunity Difference indicated equitable performance across demographic subgroups, underscoring RAIS-DR's potential to reduce healthcare disparities. These results highlight RAIS-DR as a robust and ethically aligned solution for DR screening in clinical settings. The code, weights of RAIS-DR are available at this https URL with RAIL.
摘要：糖尿病性视网膜病（DR）是造成工作年龄个体视力丧失的主要原因。早期检测DR可以将视力丧失的风险降低多达95％，但是在及时检查中缺乏学者和挑战的短缺使检测复杂化。使用视网膜眼镜照片（RFP）的人工智能（AI）模型提供了有前途的解决方案。但是，在临床环境中的采用受到低质量数据和可能导致AI系统学习意外功能的偏见的阻碍。为了应对这些挑战，我们开发了RAIS-DR，这是一种负责的AI筛查系统，该系统纳入了整个AI生命周期的道德原则。 RAIS-DR集成了有效的卷积模型，用于预处理，质量评估和三个专业的DR分类模型。我们在1,046例患者的本地数据集上对RAIS-DR进行了对FDA批准的Eyeart系统的评估，这两种系统都看不见。 RAIS-DR显示出显着的改善，F1得分增加了5-12％，准确性增加了6-19％，特异性提高了10-20％。此外，公平性指标（例如不同的影响力和机会均衡差异）表明人口亚组的公平表现，强调了RAIS-DR的潜力减少医疗保健差异。这些结果强调了RAIS-DR是在临床环境中进行DR筛查的稳健且具有道德对准的解决方案。该代码，RAI-DR的重量可在此HTTPS URL上提供。

Title: LangVision-LoRA-NAS: Neural Architecture Search for Variable LoRA Rank in Vision Language Models

Authors: Krishna Teja Chitty-Venkata, Murali Emani, Venkatram Vishwanath
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.12512
Pdf URL: https://arxiv.org/pdf/2508.12512
Copy Paste: [[2508.12512]] LangVision-LoRA-NAS: Neural Architecture Search for Variable LoRA Rank in Vision Language Models(https://arxiv.org/abs/2508.12512)
Keywords: generation
Abstract: Vision Language Models (VLMs) integrate visual and text modalities to enable multimodal understanding and generation. These models typically combine a Vision Transformer (ViT) as an image encoder and a Large Language Model (LLM) for text generation. LoRA (Low-Rank Adaptation) is an efficient fine-tuning method to adapt pre-trained models to new tasks by introducing low-rank updates to their weights. While LoRA has emerged as a powerful technique for fine-tuning large models by introducing low-rank updates, current implementations assume a fixed rank, potentially limiting flexibility and efficiency across diverse tasks. This paper introduces \textit{LangVision-LoRA-NAS}, a novel framework that integrates Neural Architecture Search (NAS) with LoRA to optimize VLMs for variable-rank adaptation. Our approach leverages NAS to dynamically search for the optimal LoRA rank configuration tailored to specific multimodal tasks, balancing performance and computational efficiency. Through extensive experiments using the LLaMA-3.2-11B model on several datasets, LangVision-LoRA-NAS demonstrates notable improvement in model performance while reducing fine-tuning costs. Our Base and searched fine-tuned models on LLaMA-3.2-11B-Vision-Instruct can be found \href{this https URL}{\textcolor{blue}{here}} and the code for LangVision-LoRA-NAS can be found \href{this https URL}{\textcolor{blue}{here}}.
摘要：视觉语言模型（VLM）整合了视觉和文本模式，以实现多模式的理解和生成。这些模型通常将视觉变压器（VIT）作为图像编码器和用于文本生成的大型语言模型（LLM）。 LORA（低级适应）是一种有效的微调方法，可以通过对其权重引入低级更新来使预训练的模型适应新任务。尽管洛拉（Lora）通过引入低级更新而成为一种对大型模型进行微调的强大技术，但当前的实现却具有固定排名，可能会限制各种任务的灵活性和效率。本文介绍了\ textit {langvision-lora-nas}，这是一个新颖的框架，将神经体系结构搜索（NAS）与lora集成在一起，以优化VLMS以进行可变级别的适应。我们的方法利用NAS动态地搜索针对特定多模式任务的最佳LORA等级配置，平衡性能和计算效率。通过在几个数据集上使用Llama-3.2-11b模型的广泛实验，Langvision-Lora-NAS在降低微调成本的同时，可以显着改善模型性能。我们可以在Llama-3.2-11b-vision-Instruct上进行的基本和搜索的微调模型找到\ Href {此HTTPS URL} {\ TextColor {blue} {there} {there}} {there}}，langvision-lora-nas的代码可以找到\ \ href {this Href {this Href {ewerl}

Title: An Initial Study of Bird's-Eye View Generation for Autonomous Vehicles using Cross-View Transformers

Authors: Felipe Carlos dos Santos, Eric Aislan Antonelo, Gustavo Claudio Karl Couto
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.12520
Pdf URL: https://arxiv.org/pdf/2508.12520
Copy Paste: [[2508.12520]] An Initial Study of Bird's-Eye View Generation for Autonomous Vehicles using Cross-View Transformers(https://arxiv.org/abs/2508.12520)
Keywords: generation
Abstract: Bird's-Eye View (BEV) maps provide a structured, top-down abstraction that is crucial for autonomous-driving perception. In this work, we employ Cross-View Transformers (CVT) for learning to map camera images to three BEV's channels - road, lane markings, and planned trajectory - using a realistic simulator for urban driving. Our study examines generalization to unseen towns, the effect of different camera layouts, and two loss formulations (focal and L1). Using training data from only a town, a four-camera CVT trained with the L1 loss delivers the most robust test performance, evaluated in a new town. Overall, our results underscore CVT's promise for mapping camera inputs to reasonably accurate BEV maps.
摘要：鸟眼（BEV）地图提供了一个结构化的自上而下的抽象，这对于自动驾驶感知至关重要。在这项工作中，我们使用跨视图变压器（CVT）学习将相机图像映射到三个BEV的频道 - 道路，车道标记和计划的轨迹 - 使用现实的模拟器用于城市驾驶。我们的研究研究了对看不见城镇的概括，不同的相机布局的效果以及两个损失配方（焦点和L1）。使用仅来自城镇的培训数据，接受了L1损失训练的四个相机CVT提供了在新城镇中评估的最强大的测试表现。总体而言，我们的结果强调了CVT对映射摄像机输入的承诺，以合理准确的BEV地图。

Title: Toward Architecture-Agnostic Local Control of Posterior Collapse in VAEs

Authors: Hyunsoo Song, Seungwhan Kim, Seungkyu Lee
Subjects: cs.LG, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2508.12530
Pdf URL: https://arxiv.org/pdf/2508.12530
Copy Paste: [[2508.12530]] Toward Architecture-Agnostic Local Control of Posterior Collapse in VAEs(https://arxiv.org/abs/2508.12530)
Keywords: generative
Abstract: Variational autoencoders (VAEs), one of the most widely used generative models, are known to suffer from posterior collapse, a phenomenon that reduces the diversity of generated samples. To avoid posterior collapse, many prior works have tried to control the influence of regularization loss. However, the trade-off between reconstruction and regularization is not satisfactory. For this reason, several methods have been proposed to guarantee latent identifiability, which is the key to avoiding posterior collapse. However, they require structural constraints on the network architecture. For further clarification, we define local posterior collapse to reflect the importance of individual sample points in the data space and to relax the network constraint. Then, we propose Latent Reconstruction(LR) loss, which is inspired by mathematical properties of injective and composite functions, to control posterior collapse without restriction to a specific architecture. We experimentally evaluate our approach, which controls posterior collapse on varied datasets such as MNIST, fashionMNIST, Omniglot, CelebA, and FFHQ.
摘要：众所周知，使用最广泛的生成模型之一，变异自动编码器（VAE）患有后塌陷，这种现象可降低产生的样品的多样性。为了避免后倒塌，许多先前的工作试图控制正则化损失的影响。但是，重建和正则化之间的权衡并不令人满意。因此，已经提出了几种方法来保证潜在的可识别性，这是避免后倒塌的关键。但是，它们需要网络体系结构的结构限制。为了进一步澄清，我们定义了局部后部崩溃，以反映单个样本在数据空间中的重要性并放松网络约束。然后，我们提出潜在的重建（LR）损失，该损失受外注射和复合函数的数学特性的启发，以控制后置崩溃，而无需限制到特定体系结构。我们通过实验评估我们的方法，该方法控制了MNIST，FashionMnist，Omniglot，Celeba和FFHQ等各种数据集上的后部崩溃。

Title: REVEAL -- Reasoning and Evaluation of Visual Evidence through Aligned Language

Authors: Ipsita Praharaj, Yukta Butala, Yash Butala
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.12543
Pdf URL: https://arxiv.org/pdf/2508.12543
Copy Paste: [[2508.12543]] REVEAL -- Reasoning and Evaluation of Visual Evidence through Aligned Language(https://arxiv.org/abs/2508.12543)
Keywords: generative
Abstract: The rapid advancement of generative models has intensified the challenge of detecting and interpreting visual forgeries, necessitating robust frameworks for image forgery detection while providing reasoning as well as localization. While existing works approach this problem using supervised training for specific manipulation or anomaly detection in the embedding space, generalization across domains remains a challenge. We frame this problem of forgery detection as a prompt-driven visual reasoning task, leveraging the semantic alignment capabilities of large vision-language models. We propose a framework, `REVEAL` (Reasoning and Evaluation of Visual Evidence through Aligned Language), that incorporates generalized guidelines. We propose two tangential approaches - (1) Holistic Scene-level Evaluation that relies on the physics, semantics, perspective, and realism of the image as a whole and (2) Region-wise anomaly detection that splits the image into multiple regions and analyzes each of them. We conduct experiments over datasets from different domains (Photoshop, DeepFake and AIGC editing). We compare the Vision Language Models against competitive baselines and analyze the reasoning provided by them.
摘要：生成模型的快速发展加剧了检测和解释视觉伪造的挑战，需要在提供推理和本地化的同时进行图像伪造检测的稳健框架。尽管现有作品使用受监督的培训来解决此问题，以在嵌入空间中进行特定的操纵或异常检测，但跨域的概括仍然是一个挑战。我们将伪造检测的问题框起来是一项迅速驱动的视觉推理任务，利用大型视觉模型的语义一致性功能。我们提出了一个框架，即“揭示”（通过对齐语言对视觉证据的推理和评估），其中包含了广义准则。我们提出了两种切线方法 - （1）整体场景级评估依赖于整个图像的物理，语义，观点和现实主义，以及（2）区域的异常检测，将图像分为多个区域并分析每个区域。我们对来自不同域的数据集进行实验（Photoshop，DeepFake和AIGC编辑）。我们将视觉语言模型与竞争基线进行比较，并分析其提供的推理。

Title: Illuminating LLM Coding Agents: Visual Analytics for Deeper Understanding and Enhancement

Authors: Junpeng Wang, Yuzhong Chen, Menghai Pan, Chin-Chia Michael Yeh, Mahashweta Das
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.12555
Pdf URL: https://arxiv.org/pdf/2508.12555
Copy Paste: [[2508.12555]] Illuminating LLM Coding Agents: Visual Analytics for Deeper Understanding and Enhancement(https://arxiv.org/abs/2508.12555)
Keywords: generation
Abstract: Coding agents powered by large language models (LLMs) have gained traction for automating code generation through iterative problem-solving with minimal human involvement. Despite the emergence of various frameworks, e.g., LangChain, AutoML, and AIDE, ML scientists still struggle to effectively review and adjust the agents' coding process. The current approach of manually inspecting individual outputs is inefficient, making it difficult to track code evolution, compare coding iterations, and identify improvement opportunities. To address this challenge, we introduce a visual analytics system designed to enhance the examination of coding agent behaviors. Focusing on the AIDE framework, our system supports comparative analysis across three levels: (1) Code-Level Analysis, which reveals how the agent debugs and refines its code over iterations; (2) Process-Level Analysis, which contrasts different solution-seeking processes explored by the agent; and (3) LLM-Level Analysis, which highlights variations in coding behavior across different LLMs. By integrating these perspectives, our system enables ML scientists to gain a structured understanding of agent behaviors, facilitating more effective debugging and prompt engineering. Through case studies using coding agents to tackle popular Kaggle competitions, we demonstrate how our system provides valuable insights into the iterative coding process.
摘要：由大语言模型（LLM）提供动力的编码代理通过迭代问题解决，以最少的人类参与来使代码生成自动化。尽管出现了各种框架，例如Langchain，Automl和Aide，ML科学家仍然很难有效地审查和调整代理的编码过程。当前手动检查各个输出的方法效率低下，因此很难跟踪代码演变，比较编码迭代并确定改进机会。为了应对这一挑战，我们引入了一个视觉分析系统，旨在增强编码剂行为的检查。为了关注助手框架，我们的系统支持跨三个级别的比较分析：（1）代码级分析，揭示了代理如何调试和完善其代码在迭代中；（2）过程级分析，该分析与代理商探索的不同解决方案寻求的过程对比；（3）LLM级分析，该分析突出了不同LLM的编码行为的变化。通过整合这些观点，我们的系统使ML科学家能够获得对代理行为的结构化理解，从而促进更有效的调试和迅速的工程。通过使用编码代理来应对流行的Kaggle竞争的案例研究，我们演示了我们的系统如何为迭代编码过程提供宝贵的见解。

Title: Physics-informed deep operator network for traffic state estimation

Authors: Zhihao Li, Ting Wang, Guojian Zou, Ruofei Wang, Ye Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.12593
Pdf URL: https://arxiv.org/pdf/2508.12593
Copy Paste: [[2508.12593]] Physics-informed deep operator network for traffic state estimation(https://arxiv.org/abs/2508.12593)
Keywords: generation
Abstract: Traffic state estimation (TSE) fundamentally involves solving high-dimensional spatiotemporal partial differential equations (PDEs) governing traffic flow dynamics from limited, noisy measurements. While Physics-Informed Neural Networks (PINNs) enforce PDE constraints point-wise, this paper adopts a physics-informed deep operator network (PI-DeepONet) framework that reformulates TSE as an operator learning problem. Our approach trains a parameterized neural operator that maps sparse input data to the full spatiotemporal traffic state field, governed by the traffic flow conservation law. Crucially, unlike PINNs that enforce PDE constraints point-wise, PI-DeepONet integrates traffic flow conservation model and the fundamental diagram directly into the operator learning process, ensuring physical consistency while capturing congestion propagation, spatial correlations, and temporal evolution. Experiments on the NGSIM dataset demonstrate superior performance over state-of-the-art baselines. Further analysis reveals insights into optimal function generation strategies and branch network complexity. Additionally, the impact of input function generation methods and the number of functions on model performance is explored, highlighting the robustness and efficacy of proposed framework.
摘要：交通状态估计（TSE）从根本上涉及解决高维时空偏微分方程（PDE），这些方程（PDES）从有限的，嘈杂的测量值中管理交通流动机。尽管物理知识的神经网络（PINNS）强制执行PDE的限制，但本文采用了一个具有物理信息的深层操作员网络（PI-Deeponet）框架，该框架将TSE重新定义为操作员学习问题。我们的方法训练一个参数化的神经操作员，该神经操作员将稀疏输入数据映射到由交通流量保护法管辖的完整时空交通状态领域。至关重要的是，与执行PDE限制点的PINN不同，Pi-Deeponet将交通流量的保存模型和基本图直接整合到操作员学习过程中，从而确保身体一致性，同时捕获拥塞传播，空间相关性和时间进化。 NGSIM数据集上的实验表明，比最先进的基线表现出了卓越的性能。进一步的分析揭示了对最佳功能生成策略和分支网络复杂性的见解。此外，探索了输入功能生成方法和功能数量对模型性能的影响，突出了提出的框架的鲁棒性和功效。

Title: A Hybrid Surrogate for Electric Vehicle Parameter Estimation and Power Consumption via Physics-Informed Neural Operators

Authors: Hansol Lim, Jongseong Brad Choi, Jee Won Lee, Haeseong Jeoung, Minkyu Han
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.12602
Pdf URL: https://arxiv.org/pdf/2508.12602
Copy Paste: [[2508.12602]] A Hybrid Surrogate for Electric Vehicle Parameter Estimation and Power Consumption via Physics-Informed Neural Operators(https://arxiv.org/abs/2508.12602)
Keywords: generative
Abstract: We present a hybrid surrogate model for electric vehicle parameter estimation and power consumption. We combine our novel architecture Spectral Parameter Operator built on a Fourier Neural Operator backbone for global context and a differentiable physics module in the forward pass. From speed and acceleration alone, it outputs time-varying motor and regenerative braking efficiencies, as well as aerodynamic drag, rolling resistance, effective mass, and auxiliary power. These parameters drive a physics-embedded estimate of battery power, eliminating any separate physics-residual loss. The modular design lets representations converge to physically meaningful parameters that reflect the current state and condition of the vehicle. We evaluate on real-world logs from a Tesla Model 3, Tesla Model S, and the Kia EV9. The surrogate achieves a mean absolute error of 0.2kW (about 1% of average traction power at highway speeds) for Tesla vehicles and about 0.8kW on the Kia EV9. The framework is interpretable, and it generalizes well to unseen conditions, and sampling rates, making it practical for path optimization, eco-routing, on-board diagnostics, and prognostics health management.
摘要：我们提出了一个用于电动汽车参数估计和功耗的混合替代模型。我们将新型体系结构频谱参数运算符结合在用于全球环境的傅立叶神经操作员骨架上，并在正向通行中使用可区分的物理模块。仅凭速度和加速度，它就会输出时变的电动机和再生制动效率，以及空气动力阻力，滚动电阻，有效的质量和辅助功率。这些参数驱动了电池功率的物理装置的估计值，从而消除了任何单独的物理损失。模块化设计使表示形式收敛到反映车辆当前状态和状况的物理意义参数。我们从特斯拉模型3，特斯拉模型S和起亚EV9的真实日志中评估了现实世界日志。特斯拉车辆的替代物的平均绝对误差为0.2kW（高速公路速度下的平均牵引力1％），起亚EV9的平均绝对误差为0.2kW。该框架是可以解释的，并且可以很好地推广到看不见的条件和采样率，这使其可用于路径优化，生态穿线，板载诊断和预后健康管理。

Title: ViLaD: A Large Vision Language Diffusion Framework for End-to-End Autonomous Driving

Authors: Can Cui, Yupeng Zhou, Juntong Peng, Sung-Yeon Park, Zichong Yang, Prashanth Sankaranarayanan, Jiaru Zhang, Ruqi Zhang, Ziran Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.12603
Pdf URL: https://arxiv.org/pdf/2508.12603
Copy Paste: [[2508.12603]] ViLaD: A Large Vision Language Diffusion Framework for End-to-End Autonomous Driving(https://arxiv.org/abs/2508.12603)
Keywords: generation
Abstract: End-to-end autonomous driving systems built on Vision Language Models (VLMs) have shown significant promise, yet their reliance on autoregressive architectures introduces some limitations for real-world applications. The sequential, token-by-token generation process of these models results in high inference latency and cannot perform bidirectional reasoning, making them unsuitable for dynamic, safety-critical environments. To overcome these challenges, we introduce ViLaD, a novel Large Vision Language Diffusion (LVLD) framework for end-to-end autonomous driving that represents a paradigm shift. ViLaD leverages a masked diffusion model that enables parallel generation of entire driving decision sequences, significantly reducing computational latency. Moreover, its architecture supports bidirectional reasoning, allowing the model to consider both past and future simultaneously, and supports progressive easy-first generation to iteratively improve decision quality. We conduct comprehensive experiments on the nuScenes dataset, where ViLaD outperforms state-of-the-art autoregressive VLM baselines in both planning accuracy and inference speed, while achieving a near-zero failure rate. Furthermore, we demonstrate the framework's practical viability through a real-world deployment on an autonomous vehicle for an interactive parking task, confirming its effectiveness and soundness for practical applications.
摘要：建立在视觉语言模型（VLM）上的端到端自主驾驶系统已经表现出巨大的希望，但是他们对自回归体系结构的依赖引入了对现实世界应用的一些局限性。这些模型的顺序，逐个代价的生成过程会导致高推断潜伏期，并且无法执行双向推理，从而使它们不适合动态，安全至关重要的环境。为了克服这些挑战，我们引入了Vilad，这是一种新型的大型视觉语言扩散（LVLD）框架，用于端到端自动驾驶，代表了范式转变。维拉德（Vilad）利用了一个掩盖的扩散模型，该模型可以平行生成整个驱动决策序列，从而大大降低了计算潜伏期。此外，它的体系结构支持双向推理，允许该模型同时考虑过去和未来，并支持渐进式的易于首发，以迭代地改善决策质量。我们在Nuscenes数据集上进行了全面的实验，在该数据集中，Vilad在计划的准确性和推理速度方面都超过了最先进的自回旋VLM基线，同时达到接近零的失败率。此外，我们通过在自动驾驶汽车上的实际部署进行互动停车任务来证明该框架的实际生存能力，从而证实了其对实际应用的有效性和合理性。

Title: ViDA-UGC: Detailed Image Quality Analysis via Visual Distortion Assessment for UGC Images

Authors: Wenjie Liao, Jieyu Yuan, Yifang Xu, Chunle Guo, Zilong Zhang, Jihong Li, Jiachen Fu, Haotian Fan, Tao Li, Junhui Cui, Chongyi Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.12605
Pdf URL: https://arxiv.org/pdf/2508.12605
Copy Paste: [[2508.12605]] ViDA-UGC: Detailed Image Quality Analysis via Visual Distortion Assessment for UGC Images(https://arxiv.org/abs/2508.12605)
Keywords: restoration, quality assessment
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have introduced a paradigm shift for Image Quality Assessment (IQA) from unexplainable image quality scoring to explainable IQA, demonstrating practical applications like quality control and optimization guidance. However, current explainable IQA methods not only inadequately use the same distortion criteria to evaluate both User-Generated Content (UGC) and AI-Generated Content (AIGC) images, but also lack detailed quality analysis for monitoring image quality and guiding image restoration. In this study, we establish the first large-scale Visual Distortion Assessment Instruction Tuning Dataset for UGC images, termed ViDA-UGC, which comprises 11K images with fine-grained quality grounding, detailed quality perception, and reasoning quality description data. This dataset is constructed through a distortion-oriented pipeline, which involves human subject annotation and a Chain-of-Thought (CoT) assessment framework. This framework guides GPT-4o to generate quality descriptions by identifying and analyzing UGC distortions, which helps capturing rich low-level visual features that inherently correlate with distortion patterns. Moreover, we carefully select 476 images with corresponding 6,149 question answer pairs from ViDA-UGC and invite a professional team to ensure the accuracy and quality of GPT-generated information. The selected and revised data further contribute to the first UGC distortion assessment benchmark, termed ViDA-UGC-Bench. Experimental results demonstrate the effectiveness of the ViDA-UGC and CoT framework for consistently enhancing various image quality analysis abilities across multiple base MLLMs on ViDA-UGC-Bench and Q-Bench, even surpassing GPT-4o.
摘要：多模式大语言模型（MLLM）的最新进展引入了图像质量评估（IQA）的范式转移，从无法解释的图像质量评分到可解释的IQA，并展示了诸如质量控制和优化指南之类的实际应用。但是，当前可解释的IQA方法不仅使用相同的失真标准来评估用户生成的内容（UGC）和AI生成的内容（AIGC）图像，而且还缺乏详细的质量分析来监视图像质量和指导图像恢复。在这项研究中，我们建立了为UGC图像的第一个大规模视觉失真评估指令调谐数据集，称为VIDA-UGC，其中包括具有优质质量接地，详细质量感知和推理质量描述数据的11K图像。该数据集是通过面向失真的管道来构建的，该管道涉及人类主题注释和一系列思想链（COT）评估框架。该框架指导GPT-4O通过识别和分析UGC扭曲来生成质量描述，这有助于捕获丰富的低级视觉特征，这些视觉特征固有地与失真模式相关。此外，我们仔细选择了476张图像，其中包括VIDA-UGC的6,149个问题答案对，并邀请一支专业团队确保GPT生成的信息的准确性和质量。选定和修订的数据进一步有助于第一个UGC失真评估基准，称为VIDA-UGC基座。实验结果证明了VIDA-UGC和COT框架对始终增强VIDA-UGC基座和Q基座上多个基本MLLM的各种图像质量分析能力的有效性，甚至超过GPT-4O。

Title: Creative4U: MLLMs-based Advertising Creative Image Selector with Comparative Reasoning

Authors: Yukang Lin, Xiang Zhang, Shichang Jia, Bowen Wan, Chenghan Fu, Xudong Ren, Yueran Liu, Wanxian Guan, Pengji Wang, Jian Xu, Bo Zheng, Baolin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.12628
Pdf URL: https://arxiv.org/pdf/2508.12628
Copy Paste: [[2508.12628]] Creative4U: MLLMs-based Advertising Creative Image Selector with Comparative Reasoning(https://arxiv.org/abs/2508.12628)
Keywords: generation
Abstract: Creative image in advertising is the heart and soul of e-commerce platform. An eye-catching creative image can enhance the shopping experience for users, boosting income for advertisers and advertising revenue for platforms. With the advent of AIGC technology, advertisers can produce large quantities of creative images at minimal cost. However, they struggle to assess the creative quality to select. Existing methods primarily focus on creative ranking, which fails to address the need for explainable creative selection. In this work, we propose the first paradigm for explainable creative assessment and selection. Powered by multimodal large language models (MLLMs), our approach integrates the assessment and selection of creative images into a natural language generation task. To facilitate this research, we construct CreativePair, the first comparative reasoning-induced creative dataset featuring 8k annotated image pairs, with each sample including a label indicating which image is superior. Additionally, we introduce Creative4U (pronounced Creative for You), a MLLMs-based creative selector that takes into account users' interests. Through Reason-to-Select RFT, which includes supervised fine-tuning with Chain-of-Thought (CoT-SFT) and Group Relative Policy Optimization (GRPO) based reinforcement learning, Creative4U is able to evaluate and select creative images accurately. Both offline and online experiments demonstrate the effectiveness of our approach. Our code and dataset will be made public to advance research and industrial applications.
摘要：广告中的创意图像是电子商务平台的心脏和灵魂。引人注目的创意图像可以增强用户的购物体验，增加广告客户的收入以及平台的广告收入。随着AIGC技术的出现，广告商可以以最低的成本生产大量的创意图像。但是，他们努力评估选择的创作质量。现有方法主要集中于创意排名，这无法满足对可解释的创意选择的需求。在这项工作中，我们提出了第一个用于解释创意评估和选择的范式。由多模式大语模型（MLLM）提供支持，我们的方法将创意图像的评估和选择集成到了自然语言生成任务中。为了促进这项研究，我们构建了CreativePair，这是第一个比较推理诱导的创意数据集，具有8K注释的图像对，每个示例都包括一个标签，指示哪个图像优越。此外，我们介绍了Creative4U（发音为您的Creative for You），这是一种基于MLLMS的创意选择器，考虑了用户的兴趣。通过理性选择的RFT，其中包括基于思想链（COT-SFT）和基于团体相对政策优化（GRPO）的强化学习的监督微调，Creative4U能够准确评估和选择创意图像。离线和在线实验都证明了我们方法的有效性。我们的代码和数据集将公开以推进研究和工业应用。

Title: FlowMol3: Flow Matching for 3D De Novo Small-Molecule Generation

Authors: Ian Dunn, David R. Koes
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2508.12629
Pdf URL: https://arxiv.org/pdf/2508.12629
Copy Paste: [[2508.12629]] FlowMol3: Flow Matching for 3D De Novo Small-Molecule Generation(https://arxiv.org/abs/2508.12629)
Keywords: generation, generative
Abstract: A generative model capable of sampling realistic molecules with desired properties could accelerate chemical discovery across a wide range of applications. Toward this goal, significant effort has focused on developing models that jointly sample molecular topology and 3D structure. We present FlowMol3, an open-source, multi-modal flow matching model that advances the state of the art for all-atom, small-molecule generation. Its substantial performance gains over previous FlowMol versions are achieved without changes to the graph neural network architecture or the underlying flow matching formulation. Instead, FlowMol3's improvements arise from three architecture-agnostic techniques that incur negligible computational cost: self-conditioning, fake atoms, and train-time geometry distortion. FlowMol3 achieves nearly 100% molecular validity for drug-like molecules with explicit hydrogens, more accurately reproduces the functional group composition and geometry of its training data, and does so with an order of magnitude fewer learnable parameters than comparable methods. We hypothesize that these techniques mitigate a general pathology affecting transport-based generative models, enabling detection and correction of distribution drift during inference. Our results highlight simple, transferable strategies for improving the stability and quality of diffusion- and flow-based molecular generative models.
摘要：能够以所需特性对逼真的分子进行采样的生成模型可以在广泛的应用中加速化学发现。为了实现这一目标，重大努力集中在开发共同采样分子拓扑和3D结构的模型上。我们提出了FlowMol3，这是一种开源的多模式流匹配模型，可在全部原子，小分子生成中提高最新技术。它在以前的流量版本上的实质性增长是可以实现的，而无需更改图神经网络体系结构或基础流匹配公式。取而代之的是，FlowMol3的改进是由三种构建 - 敏锐的技术产生的，这些技术会产生可忽略不计的计算成本：自我调节，假原子和火车时间几何变形。 Flowmol3具有显式氢的药物样分子的近100％分子有效性，更准确地再现了其训练数据的功能组组成和几何形状，并且与可比方法相比，可学习参数的数量较少。我们假设这些技术减轻了影响基于运输的生成模型的一般病理学，从而实现了推理过程中分布漂移的检测和校正。我们的结果凸显了简单，可转移的策略，以改善基于扩散和基于流量的分子生成模型的稳定性和质量。

Title: Score-informed Neural Operator for Enhancing Ordering-based Causal Discovery

Authors: Jiyeon Kang, Songseong Kim, Chanhui Lee, Doyeong Hwang, Joanie Hayoun Chung, Yunkyung Ko, Sumin Lee, Sungwoong Kim, Sungbin Lim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.12650
Pdf URL: https://arxiv.org/pdf/2508.12650
Copy Paste: [[2508.12650]] Score-informed Neural Operator for Enhancing Ordering-based Causal Discovery(https://arxiv.org/abs/2508.12650)
Keywords: generative
Abstract: Ordering-based approaches to causal discovery identify topological orders of causal graphs, providing scalable alternatives to combinatorial search methods. Under the Additive Noise Model (ANM) assumption, recent causal ordering methods based on score matching require an accurate estimation of the Hessian diagonal of the log-densities. However, previous approaches mainly use Stein gradient estimators, which are computationally expensive and memory-intensive. Although DiffAN addresses these limitations by substituting kernel-based estimates with diffusion models, it remains numerically unstable due to the second-order derivatives of score models. To alleviate these problems, we propose Score-informed Neural Operator (SciNO), a probabilistic generative model in smooth function spaces designed to stably approximate the Hessian diagonal and to preserve structural information during the score modeling. Empirical results show that SciNO reduces order divergence by 42.7% on synthetic graphs and by 31.5% on real-world datasets on average compared to DiffAN, while maintaining memory efficiency and scalability. Furthermore, we propose a probabilistic control algorithm for causal reasoning with autoregressive models that integrates SciNO's probability estimates with autoregressive model priors, enabling reliable data-driven causal ordering informed by semantic information. Consequently, the proposed method enhances causal reasoning abilities of LLMs without additional fine-tuning or prompt engineering.
摘要：基于订购的因果发现方法识别因果图的拓扑顺序，为组合搜索方法提供了可扩展的替代方案。在添加噪声模型（ANM）假设下，基于得分匹配的最新因果排序方法需要准确估算对数密度的Hessian对角线。但是，以前的方法主要使用Stein梯度估计器，这些估计器在计算上昂贵且内存密集。尽管Diffan通过将基于内核的估计替换为扩散模型来解决这些局限性，但由于得分模型的二阶导数，它在数值上保持不稳定。为了减轻这些问题，我们提出了得分信息信息神经操作员（SCINO），这是一种平滑功能空间中的概率生成模型，旨在稳定近似近似Hessian的对角线并在分数建模期间保留结构信息。经验结果表明，与DIFFAN相比，SCINO在合成图上平均将差异差异降低了42.7％，而现实世界中的数据集则将差异降低31.5％，同时保持内存效率和可扩展性。此外，我们为因果推理提供了一种概率控制算法，该算法与自回归模型相结合，该模型将Scino的概率估计与自回归模型先验相结合，从而使可靠的数据驱动的因果订购通过语义信息告知。因此，所提出的方法增强了LLM的因果推理能力，而无需进行其他微调或及时的工程。

Title: Stable Diffusion-Based Approach for Human De-Occlusion

Authors: Seung Young Noh, Ju Yong Chang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.12663
Pdf URL: https://arxiv.org/pdf/2508.12663
Copy Paste: [[2508.12663]] Stable Diffusion-Based Approach for Human De-Occlusion(https://arxiv.org/abs/2508.12663)
Keywords: generation
Abstract: Humans can infer the missing parts of an occluded object by leveraging prior knowledge and visible cues. However, enabling deep learning models to accurately predict such occluded regions remains a challenging task. De-occlusion addresses this problem by reconstructing both the mask and RGB appearance. In this work, we focus on human de-occlusion, specifically targeting the recovery of occluded body structures and appearances. Our approach decomposes the task into two stages: mask completion and RGB completion. The first stage leverages a diffusion-based human body prior to provide a comprehensive representation of body structure, combined with occluded joint heatmaps that offer explicit spatial cues about missing regions. The reconstructed amodal mask then serves as a conditioning input for the second stage, guiding the model on which areas require RGB reconstruction. To further enhance RGB generation, we incorporate human-specific textual features derived using a visual question answering (VQA) model and encoded via a CLIP encoder. RGB completion is performed using Stable Diffusion, with decoder fine-tuning applied to mitigate pixel-level degradation in visible regions -- a known limitation of prior diffusion-based de-occlusion methods caused by latent space transformations. Our method effectively reconstructs human appearances even under severe occlusions and consistently outperforms existing methods in both mask and RGB completion. Moreover, the de-occluded images generated by our approach can improve the performance of downstream human-centric tasks, such as 2D pose estimation and 3D human reconstruction. The code will be made publicly available.
摘要：人类可以通过利用先验知识和可见的提示来推断被阻塞物体的缺失部分。但是，使深度学习模型能够准确预测此类封闭的区域仍然是一项艰巨的任务。去核心通过重建蒙版和RGB外观来解决此问题。在这项工作中，我们专注于人类的二牙期，专门针对遮挡的身体结构和外观的恢复。我们的方法将任务分解为两个阶段：蒙版完成和RGB完成。第一阶段利用基于扩散的人体在提供人体结构的全面表示之前，再加上遮挡的关节热图，提供了有关缺失区域的明确空间提示。然后，重建的Amodal掩码是第二阶段的调节输入，指导模型在哪些区域需要RGB重建的区域。为了进一步增强RGB的生成，我们结合了使用视觉响应（VQA）模型得出的人类特定的文本特征，并通过剪辑编码进行了编码。使用稳定的扩散进行RGB完成，将解码器微调应用于可见区域中的像素级降解 - 已知的基于先前扩散的基于基于扩散的去嵌入方法的限制是由潜在的空间变换引起的。我们的方法有效地重建了人类的外观，即使在严重的遮挡下，在蒙版和RGB完成中的现有方法始终如一。此外，我们的方法产生的去邻算图像可以改善以2D姿势估计和3D人类重建等下游以人为本的任务的性能。该代码将公开可用。

Title: BUILDA: A Thermal Building Data Generation Framework for Transfer Learning

Authors: Thomas Krug, Fabian Raisch, Dominik Aimer, Markus Wirnsberger, Ferdinand Sigg, Benjamin Schäfer, Benjamin Tischler
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2508.12703
Pdf URL: https://arxiv.org/pdf/2508.12703
Copy Paste: [[2508.12703]] BUILDA: A Thermal Building Data Generation Framework for Transfer Learning(https://arxiv.org/abs/2508.12703)
Keywords: generation
Abstract: Transfer learning (TL) can improve data-driven modeling of building thermal dynamics. Therefore, many new TL research areas emerge in the field, such as selecting the right source model for TL. However, these research directions require massive amounts of thermal building data which is lacking presently. Neither public datasets nor existing data generators meet the needs of TL research in terms of data quality and quantity. Moreover, existing data generation approaches typically require expert knowledge in building simulation. We present BuilDa, a thermal building data generation framework for producing synthetic data of adequate quality and quantity for TL research. The framework does not require profound building simulation knowledge to generate large volumes of data. BuilDa uses a single-zone Modelica model that is exported as a Functional Mock-up Unit (FMU) and simulated in Python. We demonstrate BuilDa by generating data and utilizing it for pretraining and fine-tuning TL models.
摘要：转移学习（TL）可以改善建筑热动力学的数据驱动建模。因此，该领域中出现了许多新的TL研究领域，例如为TL选择正确的源模型。但是，这些研究方向需要大量目前缺乏的热建筑数据。公共数据集和现有数据生成器都不满足数据质量和数量的TL研究需求。此外，现有的数据生成方法通常需要建立模拟的专家知识。我们提出了Builda，这是一种用于生成TL研究质量和数量足够数量的合成数据的热建筑数据生成框架。该框架不需要深刻的建筑模拟知识来生成大量数据。 Builda使用单区模型模型，该模型被导出为功能模型单元（FMU）并在Python中进行模拟。我们通过生成数据并将其用于预处理和微调TL模型来演示Builda。

Title: Drifting Away from Truth: GenAI-Driven News Diversity Challenges LVLM-Based Misinformation Detection

Authors: Fanxiao Li, Jiaying Wu, Tingchao Fu, Yunyun Dong, Bingbing Song, Wei Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.12711
Pdf URL: https://arxiv.org/pdf/2508.12711
Copy Paste: [[2508.12711]] Drifting Away from Truth: GenAI-Driven News Diversity Challenges LVLM-Based Misinformation Detection(https://arxiv.org/abs/2508.12711)
Keywords: generative
Abstract: The proliferation of multimodal misinformation poses growing threats to public discourse and societal trust. While Large Vision-Language Models (LVLMs) have enabled recent progress in multimodal misinformation detection (MMD), the rise of generative AI (GenAI) tools introduces a new challenge: GenAI-driven news diversity, characterized by highly varied and complex content. We show that this diversity induces multi-level drift, comprising (1) model-level misperception drift, where stylistic variations disrupt a model's internal reasoning, and (2) evidence-level drift, where expression diversity degrades the quality or relevance of retrieved external evidence. These drifts significantly degrade the robustness of current LVLM-based MMD systems. To systematically study this problem, we introduce DriftBench, a large-scale benchmark comprising 16,000 news instances across six categories of diversification. We design three evaluation tasks: (1) robustness of truth verification under multi-level drift; (2) susceptibility to adversarial evidence contamination generated by GenAI; and (3) analysis of reasoning consistency across diverse inputs. Experiments with six state-of-the-art LVLM-based detectors show substantial performance drops (average F1 -14.8%) and increasingly unstable reasoning traces, with even more severe failures under adversarial evidence injection. Our findings uncover fundamental vulnerabilities in existing MMD systems and suggest an urgent need for more resilient approaches in the GenAI era.
摘要：多模式错误信息的扩散对公共话语和社会信任构成了日益严重的威胁。尽管大型视觉模型（LVLM）使多模式错误信息检测（MMD）的最新进展，但生成型AI（Genai）工具的兴起引入了一个新的挑战：Genai驱动的新闻多样性，其特征是高度差异和复杂的内容。我们表明，这种多样性引起了多层漂移，包括（1）模型级别的误解漂移，在这种情况下，风格差异破坏了模型的内部推理，以及（2）证据级别的漂移，表达多样性会降低检索到外部证据的质量或相关性。这些漂移大大降低了当前基于LVLM的MMD系统的鲁棒性。为了系统地研究此问题，我们介绍了Driftbench，这是一个大规模的基准，其中包括六种多元化类别的16,000个新闻实例。我们设计了三个评估任务：（1）在多层漂移下真实验证的鲁棒性；（2）Genai产生的对对抗证据污染的敏感性；（3）分析不同投入的推理一致性。具有六个最先进的LVLM检测器的实验显示出大量的性能下降（平均F1 -14.8％）和越来越不稳定的推理轨迹，在对抗证据注射下，更严重的失败。我们的发现发现了现有的MMD系统中的根本脆弱性，并暗示迫切需要在Genai时代更具弹性的方法。

Title: Single-Reference Text-to-Image Manipulation with Dual Contrastive Denoising Score

Authors: Syed Muhmmad Israr, Feng Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.12718
Pdf URL: https://arxiv.org/pdf/2508.12718
Copy Paste: [[2508.12718]] Single-Reference Text-to-Image Manipulation with Dual Contrastive Denoising Score(https://arxiv.org/abs/2508.12718)
Keywords: generative
Abstract: Large-scale text-to-image generative models have shown remarkable ability to synthesize diverse and high-quality images. However, it is still challenging to directly apply these models for editing real images for two reasons. First, it is difficult for users to come up with a perfect text prompt that accurately describes every visual detail in the input image. Second, while existing models can introduce desirable changes in certain regions, they often dramatically alter the input content and introduce unexpected changes in unwanted regions. To address these challenges, we present Dual Contrastive Denoising Score, a simple yet powerful framework that leverages the rich generative prior of text-to-image diffusion models. Inspired by contrastive learning approaches for unpaired image-to-image translation, we introduce a straightforward dual contrastive loss within the proposed framework. Our approach utilizes the extensive spatial information from the intermediate representations of the self-attention layers in latent diffusion models without depending on auxiliary networks. Our method achieves both flexible content modification and structure preservation between input and output images, as well as zero-shot image-to-image translation. Through extensive experiments, we show that our approach outperforms existing methods in real image editing while maintaining the capability to directly utilize pretrained text-to-image diffusion models without further training.
摘要：大规模的文本对图像生成模型表现出了显着的能力，可以综合多样化和高质量的图像。但是，直接将这些模型应用于编辑真实图像的原因有两个原因仍然是一项挑战。首先，用户很难提出一个完美的文本提示，该提示可以准确地描述输入图像中的每个视觉细节。其次，尽管现有模型可以在某些区域引入理想的变化，但它们通常会大大改变输入内容并引入不必要区域的意外变化。为了应对这些挑战，我们提出了双重对比分数，这是一个简单而强大的框架，利用文本到图像扩散模型的丰富生成性。受到对比的学习方法的启发，我们在拟议的框架内引入了直接的双重对比损失。我们的方法利用了潜在扩散模型中自我发项层的中间表示的广泛空间信息，而无需依赖辅助网络。我们的方法可以在输入和输出图像之间进行灵活的内容修改和结构保存，以及零拍摄图像到图像的翻译。通过广泛的实验，我们表明我们的方法在实际图像编辑中优于现有方法，同时保持了直接利用未经进一步培训的文本到图像扩散模型的能力。

Title: A Multi-Resolution Benchmark Framework for Spatial Reasoning Assessment in Neural Networks

Authors: Manuela Imbriani, Gina Belmonte, Mieke Massink, Alessandro Tofani, Vincenzo Ciancia
Subjects: cs.LG, physics.app-ph, physics.med-ph
Abstract URL: https://arxiv.org/abs/2508.12741
Pdf URL: https://arxiv.org/pdf/2508.12741
Copy Paste: [[2508.12741]] A Multi-Resolution Benchmark Framework for Spatial Reasoning Assessment in Neural Networks(https://arxiv.org/abs/2508.12741)
Keywords: generation
Abstract: This paper presents preliminary results in the definition of a comprehensive benchmark framework designed to systematically evaluate spatial reasoning capabilities in neural networks, with a particular focus on morphological properties such as connectivity and distance relationships. The framework is currently being used to study the capabilities of nnU-Net, exploiting the spatial model checker VoxLogicA to generate two distinct categories of synthetic datasets: maze connectivity problems for topological analysis and spatial distance computation tasks for geometric understanding. Each category is evaluated across multiple resolutions to assess scalability and generalization properties. The automated pipeline encompasses a complete machine learning workflow including: synthetic dataset generation, standardized training with cross-validation, inference execution, and comprehensive evaluation using Dice coefficient and IoU (Intersection over Union) metrics. Preliminary experimental results demonstrate significant challenges in neural network spatial reasoning capabilities, revealing systematic failures in basic geometric and topological understanding tasks. The framework provides a reproducible experimental protocol, enabling researchers to identify specific limitations. Such limitations could be addressed through hybrid approaches combining neural networks with symbolic reasoning methods for improved spatial understanding in clinical applications, establishing a foundation for ongoing research into neural network spatial reasoning limitations and potential solutions.
摘要：本文提出了初步的结果，其定义是一个综合基准框架，旨在系统地评估神经网络中的空间推理能力，特别关注形态学特性，例如连接性和距离关系。该框架目前用于研究NNU-NET的功能，利用空间模型Checker Voxlogica生成了合成数据集的两个不同类别：用于拓扑分析和空间距离计算任务的迷宫连接问题，以了解几何学的理解。每个类别跨多种分辨率进行评估，以评估可扩展性和概括属性。自动化管道涵盖了完整的机器学习工作流程，包括：合成数据集生成，交叉验证，推理执行和使用骰子系数和IOU（联合交叉点）指标的标准化培训以及全面评估。初步实验结果表明，神经网络空间推理能力的挑战，揭示了基本几何和拓扑理解任务中的系统失败。该框架提供了可再现的实验协议，使研究人员能够确定特定的局限性。可以通过将神经网络与象征性推理方法结合起来的混合方法来解决此类局限性，以改善临床应用中的空间理解，为对神经网络空间推理限制和潜在解决方案的持续研究建立基础。

Title: D2-Mamba: Dual-Scale Fusion and Dual-Path Scanning with SSMs for Shadow Removal

Authors: Linhao Li, Boya Jin, Zizhe Li, Lanqing Guo, Hao Cheng, Bo Li, Yongfeng Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.12750
Pdf URL: https://arxiv.org/pdf/2508.12750
Copy Paste: [[2508.12750]] D2-Mamba: Dual-Scale Fusion and Dual-Path Scanning with SSMs for Shadow Removal(https://arxiv.org/abs/2508.12750)
Keywords: restoration
Abstract: Shadow removal aims to restore images that are partially degraded by shadows, where the degradation is spatially localized and non-uniform. Unlike general restoration tasks that assume global degradation, shadow removal can leverage abundant information from non-shadow regions for guidance. However, the transformation required to correct shadowed areas often differs significantly from that of well-lit regions, making it challenging to apply uniform correction strategies. This necessitates the effective integration of non-local contextual cues and adaptive modeling of region-specific transformations. To this end, we propose a novel Mamba-based network featuring dual-scale fusion and dual-path scanning to selectively propagate contextual information based on transformation similarity across regions. Specifically, the proposed Dual-Scale Fusion Mamba Block (DFMB) enhances multi-scale feature representation by fusing original features with low-resolution features, effectively reducing boundary artifacts. The Dual-Path Mamba Group (DPMG) captures global features via horizontal scanning and incorporates a mask-aware adaptive scanning strategy, which improves structural continuity and fine-grained region modeling. Experimental results demonstrate that our method significantly outperforms existing state-of-the-art approaches on shadow removal benchmarks.
摘要：阴影去除旨在恢复被阴影部分降解的图像，在该图像上，降解在空间上定位且不均匀。与承担全球退化的一般恢复任务不同，阴影去除可以利用非阴影地区的大量信息进行指导。但是，纠正阴影区域所需的转换通常与光线充足的区域有很大不同，这使得采用统一的校正策略具有挑战性。这需要有效地整合非本地上下文提示和特定区域转换的自适应建模。为此，我们提出了一个基于双尺度融合和双路径扫描的新型基于MAMBA的网络，以基于各个地区的转换相似性有选择地传播上下文信息。具体而言，提出的双尺度融合Mamba块（DFMB）通过将原始特征与低分辨率特征融合，从而增强了多尺度特征表示形式，从而有效地减少了边界伪像。 Dual-Path Mamba组（DPMG）通过水平扫描捕获全局特征，并结合了面膜感知的自适应扫描策略，从而改善了结构连续性和细粒度的区域建模。实验结果表明，我们的方法在阴影去除基准上的现有最新方法大大优于现有的最新方法。

Title: Next Visual Granularity Generation

Authors: Yikai Wang, Zhouxia Wang, Zhonghua Wu, Qingyi Tao, Kang Liao, Chen Change Loy
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.12811
Pdf URL: https://arxiv.org/pdf/2508.12811
Copy Paste: [[2508.12811]] Next Visual Granularity Generation(https://arxiv.org/abs/2508.12811)
Keywords: generation
Abstract: We propose a novel approach to image generation by decomposing an image into a structured sequence, where each element in the sequence shares the same spatial resolution but differs in the number of unique tokens used, capturing different level of visual granularity. Image generation is carried out through our newly introduced Next Visual Granularity (NVG) generation framework, which generates a visual granularity sequence beginning from an empty image and progressively refines it, from global layout to fine details, in a structured manner. This iterative process encodes a hierarchical, layered representation that offers fine-grained control over the generation process across multiple granularity levels. We train a series of NVG models for class-conditional image generation on the ImageNet dataset and observe clear scaling behavior. Compared to the VAR series, NVG consistently outperforms it in terms of FID scores (3.30 -> 3.03, 2.57 ->2.44, 2.09 -> 2.06). We also conduct extensive analysis to showcase the capability and potential of the NVG framework. Our code and models will be released.
摘要：我们通过将图像分解为结构化序列，提出了一种新颖的图像生成方法，在该序列中，该序列中的每个元素都具有相同的空间分辨率，但在使用的独特令牌数量上有所不同，从而捕获了不同级别的视觉粒度。图像生成是通过我们新引入的下一个视觉粒度（NVG）生成框架进行的，该框架生成了一个视觉粒度序列，从空图像开始，并以结构化的方式逐步完善了它，从全局布局到细节。这个迭代过程编码了分层的分层表示，该表示对多个粒度级别的生成过程提供了细粒度的控制。我们训练一系列的NVG模型，以在Imagenet数据集上生成类条件图像生成，并观察清晰的缩放行为。与VAR系列相比，NVG在FID得分方面始终优于它（3.30-> 3.03，2.57-> 2.44，2.09-> 2.06）。我们还进行了广泛的分析，以展示NVG框架的能力和潜力。我们的代码和模型将发布。

Title: DEEP-SEA: Deep-Learning Enhancement for Environmental Perception in Submerged Aquatics

Authors: Shuang Chen, Ronald Thenius, Farshad Arvin, Amir Atapour-Abarghouei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.12824
Pdf URL: https://arxiv.org/pdf/2508.12824
Copy Paste: [[2508.12824]] DEEP-SEA: Deep-Learning Enhancement for Environmental Perception in Submerged Aquatics(https://arxiv.org/abs/2508.12824)
Keywords: restoration
Abstract: Continuous and reliable underwater monitoring is essential for assessing marine biodiversity, detecting ecological changes and supporting autonomous exploration in aquatic environments. Underwater monitoring platforms rely on mainly visual data for marine biodiversity analysis, ecological assessment and autonomous exploration. However, underwater environments present significant challenges due to light scattering, absorption and turbidity, which degrade image clarity and distort colour information, which makes accurate observation difficult. To address these challenges, we propose DEEP-SEA, a novel deep learning-based underwater image restoration model to enhance both low- and high-frequency information while preserving spatial structures. The proposed Dual-Frequency Enhanced Self-Attention Spatial and Frequency Modulator aims to adaptively refine feature representations in frequency domains and simultaneously spatial information for better structural preservation. Our comprehensive experiments on EUVP and LSUI datasets demonstrate the superiority over the state of the art in restoring fine-grained image detail and structural consistency. By effectively mitigating underwater visual degradation, DEEP-SEA has the potential to improve the reliability of underwater monitoring platforms for more accurate ecological observation, species identification and autonomous navigation.
摘要：连续可靠的水下监测对于评估海洋生物多样性，检测生态变化并支持水生环境中的自主探索至关重要。水下监测平台主要依靠视觉数据进行海洋生物多样性分析，生态评估和自主探索。但是，水下环境由于光散射，吸收和浊度而引起了重大挑战，这会降低图像的清晰度和扭曲色彩信息，从而使准确的观察变得困难。为了应对这些挑战，我们提出了深海，这是一种新型的基于深度学习的水下图像恢复模型，以增强低频和高频信息，同时保留空间结构。提出的双频增强了自我发挥的空间和频率调节器的目的是在频域中适应性地完善特征表示，并同时进行空间信息，以更好地保存结构。我们对EUVP和LSUI数据集进行的全面实验证明了在恢复精细颗粒的图像细节和结构一致性方面的优势。通过有效缓解水下视觉降解，深海有可能提高水下监测平台的可靠性，以进行更准确的生态观察，物种鉴定和自主导航。

Title: S^2-Guidance: Stochastic Self Guidance for Training-Free Enhancement of Diffusion Models

Authors: Chubin Chen, Jiashu Zhu, Xiaokun Feng, Nisha Huang, Meiqi Wu, Fangyuan Mao, Jiahong Wu, Xiangxiang Chu, Xiu Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.12880
Pdf URL: https://arxiv.org/pdf/2508.12880
Copy Paste: [[2508.12880]] S^2-Guidance: Stochastic Self Guidance for Training-Free Enhancement of Diffusion Models(https://arxiv.org/abs/2508.12880)
Keywords: generation
Abstract: Classifier-free Guidance (CFG) is a widely used technique in modern diffusion models for enhancing sample quality and prompt adherence. However, through an empirical analysis on Gaussian mixture modeling with a closed-form solution, we observe a discrepancy between the suboptimal results produced by CFG and the ground truth. The model's excessive reliance on these suboptimal predictions often leads to semantic incoherence and low-quality outputs. To address this issue, we first empirically demonstrate that the model's suboptimal predictions can be effectively refined using sub-networks of the model itself. Building on this insight, we propose S^2-Guidance, a novel method that leverages stochastic block-dropping during the forward process to construct stochastic sub-networks, effectively guiding the model away from potential low-quality predictions and toward high-quality outputs. Extensive qualitative and quantitative experiments on text-to-image and text-to-video generation tasks demonstrate that S^2-Guidance delivers superior performance, consistently surpassing CFG and other advanced guidance strategies. Our code will be released.
摘要：无分类器指导（CFG）是现代扩散模型中广泛使用的技术，可提高样品质量和及时粘附。然而，通过对高斯混合溶液建模的经验分析，我们观察到CFG产生的次优结果与地面真相之间存在差异。该模型过度依赖这些次优的预测通常会导致语义不一致和低质量输出。为了解决这个问题，我们首先从经验上证明，模型本身的子网络可以有效地完善模型的次优预测。在这种见解的基础上，我们提出了S^2 Guidance，这是一种新的方法，该方法在远期过程中利用随机的障碍物来构建随机子网络，有效地指导该模型从潜在的低质量预测和高质量的产出远离潜在的模型。关于文本对图像和文本对文本的定性实验和定量实验，表明S^2 Guidance提供了卓越的性能，始终超过CFG和其他高级指导策略。我们的代码将发布。

Title: CTFlow: Video-Inspired Latent Flow Matching for 3D CT Synthesis

Authors: Jiayi Wang, Hadrien Reynaud, Franciskus Xaverius Erick, Bernhard Kainz
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.12900
Pdf URL: https://arxiv.org/pdf/2508.12900
Copy Paste: [[2508.12900]] CTFlow: Video-Inspired Latent Flow Matching for 3D CT Synthesis(https://arxiv.org/abs/2508.12900)
Keywords: generation, generative
Abstract: Generative modelling of entire CT volumes conditioned on clinical reports has the potential to accelerate research through data augmentation, privacy-preserving synthesis and reducing regulator-constraints on patient data while preserving diagnostic signals. With the recent release of CT-RATE, a large-scale collection of 3D CT volumes paired with their respective clinical reports, training large text-conditioned CT volume generation models has become achievable. In this work, we introduce CTFlow, a 0.5B latent flow matching transformer model, conditioned on clinical reports. We leverage the A-VAE from FLUX to define our latent space, and rely on the CT-Clip text encoder to encode the clinical reports. To generate consistent whole CT volumes while keeping the memory constraints tractable, we rely on a custom autoregressive approach, where the model predicts the first sequence of slices of the volume from text-only, and then relies on the previously generated sequence of slices and the text, to predict the following sequence. We evaluate our results against state-of-the-art generative CT model, and demonstrate the superiority of our approach in terms of temporal coherence, image diversity and text-image alignment, with FID, FVD, IS scores and CLIP score.
摘要：以临床报告为条件的整个CT体积的生成建模有可能通过数据扩大，隐私保护合成并减少对患者数据的调节器构成，同时保留诊断信号，从而加速研究。随着CT率最近发布的，大规模的3D CT卷与各自的临床报告搭配，培训大型文本条件的CT量产生模型已成为可以实现的。在这项工作中，我们引入了CTFlow，这是一个0.5B潜在的流动匹配变压器模型，以临床报告为条件。我们利用Flux的A-VAE来定义我们的潜在空间，并依靠CT-CLIP文本编码器来编码临床报告。为了在保持内存约束时产生一致的整个CT量，我们依靠一种自定义自动回归方法，该方法可以预测仅文本的第一个卷片段序列，然后依靠先前生成的切片和文本序列，以预测以下序列。我们根据最先进的生成CT模型评估了结果，并在时间连贯性，图像多样性和文本图像对齐方面证明了我们方法的优越性，而FID，FVD是分数和剪辑得分。

Title: CMF-IoU: Multi-Stage Cross-Modal Fusion 3D Object Detection with IoU Joint Prediction

Authors: Zhiwei Ning, Zhaojiang Liu, Xuanang Gao, Yifan Zuo, Jie Yang, Yuming Fang, Wei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.12917
Pdf URL: https://arxiv.org/pdf/2508.12917
Copy Paste: [[2508.12917]] CMF-IoU: Multi-Stage Cross-Modal Fusion 3D Object Detection with IoU Joint Prediction(https://arxiv.org/abs/2508.12917)
Keywords: generation
Abstract: Multi-modal methods based on camera and LiDAR sensors have garnered significant attention in the field of 3D detection. However, many prevalent works focus on single or partial stage fusion, leading to insufficient feature extraction and suboptimal performance. In this paper, we introduce a multi-stage cross-modal fusion 3D detection framework, termed CMF-IOU, to effectively address the challenge of aligning 3D spatial and 2D semantic information. Specifically, we first project the pixel information into 3D space via a depth completion network to get the pseudo points, which unifies the representation of the LiDAR and camera information. Then, a bilateral cross-view enhancement 3D backbone is designed to encode LiDAR points and pseudo points. The first sparse-to-distant (S2D) branch utilizes an encoder-decoder structure to reinforce the representation of sparse LiDAR points. The second residual view consistency (ResVC) branch is proposed to mitigate the influence of inaccurate pseudo points via both the 3D and 2D convolution processes. Subsequently, we introduce an iterative voxel-point aware fine grained pooling module, which captures the spatial information from LiDAR points and textural information from pseudo points in the proposal refinement stage. To achieve more precise refinement during iteration, an intersection over union (IoU) joint prediction branch integrated with a novel proposals generation technique is designed to preserve the bounding boxes with both high IoU and classification scores. Extensive experiments show the superior performance of our method on the KITTI, nuScenes and Waymo datasets.
摘要：基于相机和激光雷达传感器的多模式方法在3D检测领域引起了极大的关注。但是，许多普遍的工作重点是单个或部分阶段融合，导致特征提取和次优性能不足。在本文中，我们介绍了称为CMF-IOU的多阶段跨模式融合3D检测框架，以有效地应对3D空间和2D语义信息的挑战。具体来说，我们首先通过深度完成网络将像素信息投射到3D空间中，以获取伪积分，该点可以统一激光雷达和相机信息的表示。然后，双边跨视图增强3D主链设计用于编码LIDAR点和伪点。第一个稀疏到距离（S2D）分支利用编码器解码器结构来加强稀疏的LIDAR点的表示。提出了第二个残留视图一致性（RESVC）分支，以通过3D和2D卷积过程来减轻不准确的伪点的影响。随后，我们引入了一个迭代的体素点，意识到细粒度的合并模块，该模块在提案改进阶段中捕获了LiDAR点的空间信息和纹理信息。为了在迭代期间实现更精确的改进，与新建议生成技术集成的联合（IOU）联合预测分支相交旨在保留具有较高IOU和分类分数的边界框。广泛的实验表明，我们方法在Kitti，Nuscenes和Waymo数据集上的出色性能。

Title: 7Bench: a Comprehensive Benchmark for Layout-guided Text-to-image Models

Authors: Elena Izzo, Luca Parolari, Davide Vezzaro, Lamberto Ballan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.12919
Pdf URL: https://arxiv.org/pdf/2508.12919
Copy Paste: [[2508.12919]] 7Bench: a Comprehensive Benchmark for Layout-guided Text-to-image Models(https://arxiv.org/abs/2508.12919)
Keywords: generation
Abstract: Layout-guided text-to-image models offer greater control over the generation process by explicitly conditioning image synthesis on the spatial arrangement of elements. As a result, their adoption has increased in many computer vision applications, ranging from content creation to synthetic data generation. A critical challenge is achieving precise alignment between the image, textual prompt, and layout, ensuring semantic fidelity and spatial accuracy. Although recent benchmarks assess text alignment, layout alignment remains overlooked, and no existing benchmark jointly evaluates both. This gap limits the ability to evaluate a model's spatial fidelity, which is crucial when using layout-guided generation for synthetic data, as errors can introduce noise and degrade data quality. In this work, we introduce 7Bench, the first benchmark to assess both semantic and spatial alignment in layout-guided text-to-image generation. It features text-and-layout pairs spanning seven challenging scenarios, investigating object generation, color fidelity, attribute recognition, inter-object relationships, and spatial control. We propose an evaluation protocol that builds on existing frameworks by incorporating the layout alignment score to assess spatial accuracy. Using 7Bench, we evaluate several state-of-the-art diffusion models, uncovering their respective strengths and limitations across diverse alignment tasks. The benchmark is available at this https URL.
摘要：布局引导的文本对图像模型通过在元素的空间布置上明确调节图像合成，从而为生成过程提供了更大的控制。结果，在许多计算机视觉应用中，它们的采用率有所增加，从内容创建到合成数据的生成。一个关键的挑战是在图像，文本提示和布局之间实现精确的对齐，以确保语义保真度和空间精度。尽管最近的基准测试评估了文本对齐，但布局对准仍然被忽略了，并且没有现有的基准共同评估这两者。该差距限制了评估模型的空间保真度的能力，这在使用布局引导生成进行合成数据时至关重要，因为错误会引入噪声并降低数据质量。在这项工作中，我们介绍了7 Bench，这是第一个评估布局引导文本到图像生成的语义和空间对齐方式的基准。它具有跨越七个具有挑战性的场景，研究对象产生，颜色保真度，属性识别，对象间关系和空间控制的文本和层对。我们提出了一个评估协议，该协议通过合并布局对齐得分来评估空间准确性，从而建立在现有框架上。使用7-acch，我们评估了几种最先进的扩散模型，发现它们各自的优势和局限性在不同的一致性任务中。该基准标准可在此HTTPS URL上找到。

Title: Lumen: Consistent Video Relighting and Harmonious Background Replacement with Video Generative Models

Authors: Jianshu Zeng, Yuxuan Liu, Yutong Feng, Chenxuan Miao, Zixiang Gao, Jiwang Qu, Jianzhang Zhang, Bin Wang, Kun Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.12945
Pdf URL: https://arxiv.org/pdf/2508.12945
Copy Paste: [[2508.12945]] Lumen: Consistent Video Relighting and Harmonious Background Replacement with Video Generative Models(https://arxiv.org/abs/2508.12945)
Keywords: generative
Abstract: Video relighting is a challenging yet valuable task, aiming to replace the background in videos while correspondingly adjusting the lighting in the foreground with harmonious blending. During translation, it is essential to preserve the original properties of the foreground, e.g., albedo, and propagate consistent relighting among temporal frames. In this paper, we propose Lumen, an end-to-end video relighting framework developed on large-scale video generative models, receiving flexible textual description for instructing the control of lighting and background. Considering the scarcity of high-qualified paired videos with the same foreground in various lighting conditions, we construct a large-scale dataset with a mixture of realistic and synthetic videos. For the synthetic domain, benefiting from the abundant 3D assets in the community, we leverage advanced 3D rendering engine to curate video pairs in diverse environments. For the realistic domain, we adapt a HDR-based lighting simulation to complement the lack of paired in-the-wild videos. Powered by the aforementioned dataset, we design a joint training curriculum to effectively unleash the strengths of each domain, i.e., the physical consistency in synthetic videos, and the generalized domain distribution in realistic videos. To implement this, we inject a domain-aware adapter into the model to decouple the learning of relighting and domain appearance distribution. We construct a comprehensive benchmark to evaluate Lumen together with existing methods, from the perspectives of foreground preservation and video consistency assessment. Experimental results demonstrate that Lumen effectively edit the input into cinematic relighted videos with consistent lighting and strict foreground preservation. Our project page: this https URL
摘要：视频重新介绍是一项具有挑战性但有价值的任务，旨在替换视频中的背景，同时通过和谐混合相应地调整前景的照明。在翻译过程中，必须保留前景的原始特性，例如反照率，并在时间框架之间繁殖一致的重新构成。在本文中，我们提出了Lumen，这是一个在大规模视频生成模型上开发的端到端视频重新确定框架，收到了用于指导照明和背景控制的灵活文本描述。考虑到在各种照明条件下具有相同前景的高素质配对视频的稀缺性，我们构建了一个大规模数据集，其中包括逼真和合成的视频。对于从社区中丰富的3D资产中受益的合成领域，我们利用先进的3D渲染引擎来策划不同环境的视频对。对于现实域，我们适应了基于HDR的照明模拟，以补充缺乏配对的野外视频。在上述数据集的支持下，我们设计了一个联合培训课程，以有效释放每个领域的优势，即合成视频中的物理一致性以及现实视频中的广义域分布。为了实现这一点，我们将一个域感知的适配器注入模型，以将重新研究和域外观分布学习。我们从前景保存和视频一致性评估的角度构建了一个全面的基准，用于评估管腔以及现有方法。实验结果表明，管腔有效地将输入编辑为具有一致的照明和严格的前景保存的电影重新视频。我们的项目页面：此HTTPS URL

Title: Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation

Authors: Qirui Li, Guangcong Zheng, Qi Zhao, Jie Li, Bin Dong, Yiwu Yao, Xi Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.12969
Pdf URL: https://arxiv.org/pdf/2508.12969
Copy Paste: [[2508.12969]] Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation(https://arxiv.org/abs/2508.12969)
Keywords: generation
Abstract: The computational demands of self-attention mechanisms pose a critical challenge for transformer-based video generation, particularly in synthesizing ultra-long sequences. Current approaches, such as factorized attention and fixed sparse patterns, fail to fully exploit the inherent spatio-temporal redundancies in video data. Through systematic analysis of video diffusion transformers (DiT), we uncover a key insight: Attention matrices exhibit structured, yet heterogeneous sparsity patterns, where specialized heads dynamically attend to distinct spatiotemporal regions (e.g., local pattern, cross-shaped pattern, or global pattern). Existing sparse attention methods either impose rigid constraints or introduce significant overhead, limiting their effectiveness. To address this, we propose Compact Attention, a hardware-aware acceleration framework featuring three innovations: 1) Adaptive tiling strategies that approximate diverse spatial interaction patterns via dynamic tile grouping, 2) Temporally varying windows that adjust sparsity levels based on frame proximity, and 3) An automated configuration search algorithm that optimizes sparse patterns while preserving critical attention pathways. Our method achieves 1.6~2.5x acceleration in attention computation on single-GPU setups while maintaining comparable visual quality with full-attention baselines. This work provides a principled approach to unlocking efficient long-form video generation through structured sparsity exploitation. Project Page: this https URL
摘要：自我发项机制的计算需求对基于变压器的视频生成构成了关键挑战，尤其是在综合超长序列方面。当前的方法，例如分解的注意力和固定的稀疏模式，无法完全利用视频数据中固有的时空冗余。通过对视频扩散变压器（DIT）的系统分析，我们发现了一个关键的见解：注意矩阵展示了结构化但异质的稀疏模式，其中专业的头部动态地进行了不同的时空区域（例如，局部模式，交叉形状或全局模式）。现有的稀疏注意方法要么施加严格的限制或引入大量的开销，从而限制了它们的有效性。为了解决这个问题，我们提出了紧凑的关注，一个具有三个创新的硬件感知加速框架：1）自适应瓷砖策略，通过动态瓷砖分组近似不同的空间交互模式，2）在暂时变化的窗口，这些窗口可根据框架搜索范围的范围进行调整稀疏性级别，以及在框架搜索范围内的差异范围，3）在自动化的范围搜索范围内，而差异较大的范围则在较大的范围内验证了较大的差异。我们的方法在单GPU设置的注意力计算中达到1.6〜2.5倍的加速度，同时通过全注意基线保持可比的视觉质量。这项工作提供了一种原则性的方法，可以通过结构化的稀疏性利用来解锁有效的长格式视频。项目页面：此HTTPS URL

Title: Omni Survey for Multimodality Analysis in Visual Object Tracking

Authors: Zhangyong Tang, Tianyang Xu, Xuefeng Zhu, Hui Li, Shaochuan Zhao, Tao Zhou, Chunyang Cheng, Xiaojun Wu, Josef Kittler
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.13000
Pdf URL: https://arxiv.org/pdf/2508.13000
Copy Paste: [[2508.13000]] Omni Survey for Multimodality Analysis in Visual Object Tracking(https://arxiv.org/abs/2508.13000)
Keywords: generation
Abstract: The development of smart cities has led to the generation of massive amounts of multi-modal data in the context of a range of tasks that enable a comprehensive monitoring of the smart city infrastructure and services. This paper surveys one of the most critical tasks, multi-modal visual object tracking (MMVOT), from the perspective of multimodality analysis. Generally, MMVOT differs from single-modal tracking in four key aspects, data collection, modality alignment and annotation, model designing, and evaluation. Accordingly, we begin with an introduction to the relevant data modalities, laying the groundwork for their integration. This naturally leads to a discussion of challenges of multi-modal data collection, alignment, and annotation. Subsequently, existing MMVOT methods are categorised, based on different ways to deal with visible (RGB) and X modalities: programming the auxiliary X branch with replicated or non-replicated experimental configurations from the RGB branch. Here X can be thermal infrared (T), depth (D), event (E), near infrared (NIR), language (L), or sonar (S). The final part of the paper addresses evaluation and benchmarking. In summary, we undertake an omni survey of all aspects of multi-modal visual object tracking (VOT), covering six MMVOT tasks and featuring 338 references in total. In addition, we discuss the fundamental rhetorical question: Is multi-modal tracking always guaranteed to provide a superior solution to unimodal tracking with the help of information fusion, and if not, in what circumstances its application is beneficial. Furthermore, for the first time in this field, we analyse the distributions of the object categories in the existing MMVOT datasets, revealing their pronounced long-tail nature and a noticeable lack of animal categories when compared with RGB datasets.
摘要：智能城市的发展导致在一系列任务中产生了大量的多模式数据，从而可以全面监视智能城市的基础设施和服务。本文从多模式分析的角度来调查最关键的任务之一，即多模式的视觉对象跟踪（MMVOT）。通常，MMVOT不同于四个关键方面的单模式跟踪，数据收集，模态对准和注释，模型设计和评估。因此，我们首先介绍了相关数据方式，为它们的集成奠定了基础。这自然会导致讨论多模式数据收集，对齐和注释的挑战。随后，根据处理可见的（RGB）和X模式的不同方法对现有的MMVOT方法进行了分类：编程辅助X分支，该分支具有RGB分支的复制或未复制的实验配置。在这里，X可以是热红外（T），深度（D），事件（E），近红外（NIR），语言（L）或声纳（S）。本文的最后一部分涉及评估和基准测试。总而言之，我们对多模式视觉对象跟踪的各个方面进行了OMNI调查，涵盖了六个MMVOT任务，并总共包含338个参考。此外，我们讨论了基本的修辞问题：多模式跟踪始终保证在信息融合的帮助下，为单峰跟踪提供了卓越的解决方案，如果没有，则在何种情况下，其应用是有益的。此外，在该领域中，我们第一次分析了现有MMVOT数据集中对象类别的分布，与RGB数据集相比，揭示了它们明显的长尾性质和明显的动物类别。

Title: Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model

Authors: Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, Baixin Xu, Hao-Xiang Guo, Kaixiong Gong, Cyrus Wu, Wei Li, Xuchen Song, Yang Liu, Eric Li, Yahui Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.13009
Pdf URL: https://arxiv.org/pdf/2508.13009
Copy Paste: [[2508.13009]] Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model(https://arxiv.org/abs/2508.13009)
Keywords: generation
Abstract: Recent advances in interactive video generations have demonstrated diffusion model's potential as world models by capturing complex physical dynamics and interactive behaviors. However, existing interactive world models depend on bidirectional attention and lengthy inference steps, severely limiting real-time performance. Consequently, they are hard to simulate real-world dynamics, where outcomes must update instantaneously based on historical context and current actions. To address this, we present Matrix-Game 2.0, an interactive world model generates long videos on-the-fly via few-step auto-regressive diffusion. Our framework consists of three key components: (1) A scalable data production pipeline for Unreal Engine and GTA5 environments to effectively produce massive amounts (about 1200 hours) of video data with diverse interaction annotations; (2) An action injection module that enables frame-level mouse and keyboard inputs as interactive conditions; (3) A few-step distillation based on the casual architecture for real-time and streaming video generation. Matrix Game 2.0 can generate high-quality minute-level videos across diverse scenes at an ultra-fast speed of 25 FPS. We open-source our model weights and codebase to advance research in interactive world modeling.
摘要：交互式视频一代的最新进展通过捕获复杂的物理动力和交互式行为来证明扩散模型作为世界模型的潜力。但是，现有的交互式世界模型取决于双向关注和冗长的推理步骤，从而严重限制了实时性能。因此，它们很难模拟现实世界动态，在这些动态中，结果必须基于历史上下文和当前动作即时更新。为了解决这个问题，我们提出了Matrix-Game 2.0，一个交互式世界模型通过几步自动回归扩散即时生成长视频。我们的框架由三个关键组成部分组成：（1）虚幻发动机和GTA5环境的可扩展数据生产管道，可有效地生成具有不同交互注释的视频数据的大量（约1200小时）；（2）启用帧级鼠标和键盘输入作为交互条件的动作注入模块；（3）基于休闲体系结构进行实时和流式视频生成的几步蒸馏。矩阵游戏2.0可以以25 fps的超快速速度在不同场景中生成高质量的分钟视频。我们开源的模型权重和代码库，以推动交互式世界建模的研究。

Title: EgoTwin: Dreaming Body and View in First Person

Authors: Jingqiao Xiu, Fangzhou Hong, Yicong Li, Mengze Li, Wentao Wang, Sirui Han, Liang Pan, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.13013
Pdf URL: https://arxiv.org/pdf/2508.13013
Copy Paste: [[2508.13013]] EgoTwin: Dreaming Body and View in First Person(https://arxiv.org/abs/2508.13013)
Keywords: generation
Abstract: While exocentric video synthesis has achieved great progress, egocentric video generation remains largely underexplored, which requires modeling first-person view content along with camera motion patterns induced by the wearer's body movements. To bridge this gap, we introduce a novel task of joint egocentric video and human motion generation, characterized by two key challenges: 1) Viewpoint Alignment: the camera trajectory in the generated video must accurately align with the head trajectory derived from human motion; 2) Causal Interplay: the synthesized human motion must causally align with the observed visual dynamics across adjacent video frames. To address these challenges, we propose EgoTwin, a joint video-motion generation framework built on the diffusion transformer architecture. Specifically, EgoTwin introduces a head-centric motion representation that anchors the human motion to the head joint and incorporates a cybernetics-inspired interaction mechanism that explicitly captures the causal interplay between video and motion within attention operations. For comprehensive evaluation, we curate a large-scale real-world dataset of synchronized text-video-motion triplets and design novel metrics to assess video-motion consistency. Extensive experiments demonstrate the effectiveness of the EgoTwin framework.
摘要：尽管Exentric视频综合取得了巨大进步，但以自我为中心的视频生成仍然很大程度上不受影响，这需要对佩戴者身体运动引起的第一人称视图内容以及相机运动模式进行建模。为了弥合这一差距，我们引入了一项新颖的任务，即以两个关键的挑战为特征：1）观点对齐：生成视频中的摄像机轨迹必须与源自人类运动的头部轨迹准确地保持一致； 2）因果关系：合成的人类运动必须与相邻视频帧的观察到的视觉动力学一致。为了应对这些挑战，我们提出了Egotwin，这是建立在扩散变压器体系结构上的联合视频动作生成框架。具体而言，Egotwin引入了以头部为中心的运动表示，该表示将人类运动锚定在头部接头上，并结合了控制论启发的相互作用机制，该机制在注意操作中明确捕获了视频和运动之间的因果关系。为了进行全面的评估，我们策划了同步文本视频动作三重态的大规模现实世界数据集和设计新颖的指标，以评估视频动作一致性。广泛的实验证明了Egotwin框架的有效性。

Title: Eyes on the Image: Gaze Supervised Multimodal Learning for Chest X-ray Diagnosis and Report Generation

Authors: Tanjim Islam Riju, Shuchismita Anwar, Saman Sarker Joy, Farig Sadeque, Swakkhar Shatabda
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.13068
Pdf URL: https://arxiv.org/pdf/2508.13068
Copy Paste: [[2508.13068]] Eyes on the Image: Gaze Supervised Multimodal Learning for Chest X-ray Diagnosis and Report Generation(https://arxiv.org/abs/2508.13068)
Keywords: generation
Abstract: We propose a two-stage multimodal framework that enhances disease classification and region-aware radiology report generation from chest X-rays, leveraging the MIMIC-Eye dataset. In the first stage, we introduce a gaze-guided contrastive learning architecture for disease classification. It integrates visual features, clinical labels, bounding boxes, and radiologist eye-tracking signals and is equipped with a novel multi-term gaze-attention loss combining MSE, KL divergence, correlation, and center-of-mass alignment. Incorporating fixations improves F1 score from 0.597 to 0.631 (+5.70%) and AUC from 0.821 to 0.849 (+3.41%), while also improving precision and recall, highlighting the effectiveness of gaze-informed attention supervision. In the second stage, we present a modular report generation pipeline that extracts confidence-weighted diagnostic keywords, maps them to anatomical regions using a curated dictionary constructed from domain-specific priors, and generates region-aligned sentences via structured prompts. This pipeline improves report quality as measured by clinical keyword recall and ROUGE overlap. Our results demonstrate that integrating gaze data improves both classification performance and the interpretability of generated medical reports.
摘要：我们提出了一个两阶段的多模式框架，可增强疾病分类和胸部X射线的区域感知放射学报告，利用模仿数据集。在第一阶段，我们为疾病分类引入了凝视的对比学习结构。它整合了视觉特征，临床标签，边界框和放射科医生的眼睛跟踪信号，并配备了一种新型的多项视线注意力，结合了MSE，KL差异，相关性和质量中心对准。融合固定的分数从0.597提高到0.631（+5.70％）和AUC从0.821提高到0.849（+3.41％），同时也提高了精度和回忆，突出了注视信息注意力的有效性。在第二阶段，我们提出了一个模块化报告生成管道，该管道将提取置信加权的诊断关键字，使用从域特异性先验构建的策划字典将其映射到解剖区域，并通过结构化的提示生成区域对齐的句子。该管道改善了通过临床关键字回忆和胭脂重叠所衡量的报告质量。我们的结果表明，整合凝视数据可以提高分类性能和生成的医疗报告的解释性。

Title: ID-Card Synthetic Generation: Toward a Simulated Bona fide Dataset

Authors: Qingwen Zeng, Juan E. Tapia, Izan Garcia, Juan M. Espin, Christoph Busch
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.13078
Pdf URL: https://arxiv.org/pdf/2508.13078
Copy Paste: [[2508.13078]] ID-Card Synthetic Generation: Toward a Simulated Bona fide Dataset(https://arxiv.org/abs/2508.13078)
Keywords: generation
Abstract: Nowadays, the development of a Presentation Attack Detection (PAD) system for ID cards presents a challenge due to the lack of images available to train a robust PAD system and the increase in diversity of possible attack instrument species. Today, most algorithms focus on generating attack samples and do not take into account the limited number of bona fide images. This work is one of the first to propose a method for mimicking bona fide images by generating synthetic versions of them using Stable Diffusion, which may help improve the generalisation capabilities of the detector. Furthermore, the new images generated are evaluated in a system trained from scratch and in a commercial solution. The PAD system yields an interesting result, as it identifies our images as bona fide, which has a positive impact on detection performance and data restrictions.
摘要：如今，ID卡的演示攻击检测（PAD）系统的开发提出了挑战，这是由于缺乏可用于训练强大的PAD系统的图像以及可能攻击仪器物种的多样性的增加。如今，大多数算法都专注于生成攻击样本，并且不考虑有限数量的真正图像。这项工作是最早提出一种通过使用稳定扩散生成它们的合成版本来模仿真正的图像方法的方法之一，这可能有助于提高检测器的概括能力。此外，在从头开始和商业解决方案中训练的系统中评估了生成的新图像。 PAD系统产生了一个有趣的结果，因为它将我们的图像识别为真正的图像，这对检测性能和数据限制产生了积极影响。

Title: DMS:Diffusion-Based Multi-Baseline Stereo Generation for Improving Self-Supervised Depth Estimation

Authors: Zihua Liu, Yizhou Li, Songyan Zhang, Masatoshi Okutomi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.13091
Pdf URL: https://arxiv.org/pdf/2508.13091
Copy Paste: [[2508.13091]] DMS:Diffusion-Based Multi-Baseline Stereo Generation for Improving Self-Supervised Depth Estimation(https://arxiv.org/abs/2508.13091)
Keywords: generation
Abstract: While supervised stereo matching and monocular depth estimation have advanced significantly with learning-based algorithms, self-supervised methods using stereo images as supervision signals have received relatively less focus and require further investigation. A primary challenge arises from ambiguity introduced during photometric reconstruction, particularly due to missing corresponding pixels in ill-posed regions of the target view, such as occlusions and out-of-frame areas. To address this and establish explicit photometric correspondences, we propose DMS, a model-agnostic approach that utilizes geometric priors from diffusion models to synthesize novel views along the epipolar direction, guided by directional prompts. Specifically, we finetune a Stable Diffusion model to simulate perspectives at key positions: left-left view shifted from the left camera, right-right view shifted from the right camera, along with an additional novel view between the left and right cameras. These synthesized views supplement occluded pixels, enabling explicit photometric reconstruction. Our proposed DMS is a cost-free, ''plug-and-play'' method that seamlessly enhances self-supervised stereo matching and monocular depth estimation, and relies solely on unlabeled stereo image pairs for both training and synthesizing. Extensive experiments demonstrate the effectiveness of our approach, with up to 35% outlier reduction and state-of-the-art performance across multiple benchmark datasets.
摘要：尽管有监督的立体声匹配和单眼深度估计已通过基于学习的算法显着提高，但使用立体声图像作为监督信号的自我监督方法相对较少，需要进一步研究。主要的挑战是由光度重建过程中引入的歧义引起的，尤其是由于目标视图的不足区域中缺少相应的像素，例如遮挡和框架外区域。为了解决这一问题并建立显式的光度对应关系，我们提出了DMS，即一种模型不可屈服的方法，它利用从扩散模型中使用的几何学先验来沿着外星方向合成新型视图，并在方向提示的引导下。具体来说，我们对稳定的扩散模型进行了固定，以模拟关键位置的观点：从左摄像头转移的左左视图，右视图从右相机移动，以及左和右相机之间的其他新颖视图。这些合成的视图补充了遮挡的像素，从而实现了显式的光度重建。我们提出的DMS是一种无需成本的“插件”方法，可以无缝增强自我监管的立体声匹配和单眼估计，并且仅依赖于未标记的立体图像对进行训练和合成。广泛的实验证明了我们的方法的有效性，在多个基准数据集中，高达35％的离群降低和最先进的性能。

Title: Precise Action-to-Video Generation Through Visual Action Prompts

Authors: Yuang Wang, Chao Wen, Haoyu Guo, Sida Peng, Minghan Qin, Hujun Bao, Xiaowei Zhou, Ruizhen Hu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2508.13104
Pdf URL: https://arxiv.org/pdf/2508.13104
Copy Paste: [[2508.13104]] Precise Action-to-Video Generation Through Visual Action Prompts(https://arxiv.org/abs/2508.13104)
Keywords: generation, generative
Abstract: We present visual action prompts, a unified action representation for action-to-video generation of complex high-DoF interactions while maintaining transferable visual dynamics across domains. Action-driven video generation faces a precision-generality trade-off: existing methods using text, primitive actions, or coarse masks offer generality but lack precision, while agent-centric action signals provide precision at the cost of cross-domain transferability. To balance action precision and dynamic transferability, we propose to "render" actions into precise visual prompts as domain-agnostic representations that preserve both geometric precision and cross-domain adaptability for complex actions; specifically, we choose visual skeletons for their generality and accessibility. We propose robust pipelines to construct skeletons from two interaction-rich data sources - human-object interactions (HOI) and dexterous robotic manipulation - enabling cross-domain training of action-driven generative models. By integrating visual skeletons into pretrained video generation models via lightweight fine-tuning, we enable precise action control of complex interaction while preserving the learning of cross-domain dynamics. Experiments on EgoVid, RT-1 and DROID demonstrate the effectiveness of our proposed approach. Project page: this https URL.
摘要：我们提出了视觉动作提示，这是一种统一的动作表示，用于对复杂高DOF相互作用的动作生成，同时保持跨域的可转移视觉动态。动作驱动的视频生成面临着精确基因的权衡：使用文本，原始操作或粗掩模的现有方法提供了通用性但缺乏精度，而以代理为中心的动作信号则以跨域可传递性为代价提供精度。为了平衡动作的精度和动态可传递性，我们建议将动作“渲染”到精确的视觉提示中，作为域 - 不合时式表示，可以保留几何精度和跨域的适应性，以实现复杂动作；具体来说，我们选择视觉骨架以使其通用性和可访问性。我们提出了强大的管道，以构建来自两个富含相互作用的数据源的骨骼 - 人类对象相互作用（HOI）和灵活的机器人操纵 - 可以对动作驱动的生成模型进行跨域培训。通过通过轻巧的微调将视觉骨骼整合到预审预测的视频生成模型中，我们可以在保留跨域动力学的学习时对复杂相互作用进行精确的动作控制。 Egovid，RT-1和Droid的实验证明了我们提出的方法的有效性。项目页面：此HTTPS URL。

Title: MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models

Authors: Haoyu He, Katrin Renz, Yong Cao, Andreas Geiger
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.13148
Pdf URL: https://arxiv.org/pdf/2508.13148
Copy Paste: [[2508.13148]] MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models(https://arxiv.org/abs/2508.13148)
Keywords: generation
Abstract: Diffusion language models, as a promising alternative to traditional autoregressive (AR) models, enable faster generation and richer conditioning on bidirectional context. However, they suffer from a key discrepancy between training and inference: during inference, MDLMs progressively reveal the structure of the generated sequence by producing fewer and fewer masked tokens, whereas this structure is ignored in training as tokens are masked at random. Although this discrepancy between training and inference can lead to suboptimal performance, it has been largely overlooked by previous works, leaving closing this gap between the two stages an open problem. To address this, we frame the problem of learning effective denoising trajectories as a sequential decision-making problem and use the resulting framework to apply reinforcement learning. We propose a novel Masked Diffusion Policy Optimization (MDPO) to exploit the Markov property diffusion possesses and explicitly train the model under the same progressive refining schedule used at inference. MDPO matches the performance of the previous state-of-the-art (SOTA) method with 60x fewer gradient updates, while achieving average improvements of 9.6% on MATH500 and 54.2% on Countdown over SOTA when trained within the same number of weight updates. Additionally, we improve the remasking strategy of MDLMs as a plug-in inference replacement to overcome the limitation that the model cannot refine tokens flexibly. This simple yet effective training-free strategy, what we refer to as RCR, consistently improves performance and yields additional gains when combined with MDPO. Our findings establish great potential for investigating the discrepancy between pre-training and inference of MDLMs. Code: this https URL. Project Page: this https URL.
摘要：扩散语言模型是传统自回归（AR）模型的有前途的替代品，可以在双向环境下更快地产生和更丰富的调理。但是，它们在训练和推理之间存在关键的差异：在推断期间，MDLM逐渐通过产生越来越少的掩盖令牌来揭示生成序列的结构，而这种结构在训练中被忽略，因为代币被随机掩盖。尽管训练和推理之间的这种差异可能会导致次优的表现，但以前的作品在很大程度上忽略了它，这使两个阶段之间的差距缩小了一个空旷的问题。为了解决这个问题，我们将学习有效的deNo轨迹的问题构成了作为顺序决策问题的问题，并使用由此产生的框架应用强化学习。我们提出了一种新颖的掩盖扩散策略优化（MDPO），以利用马尔可夫属性扩散所拥有的，并在推理时使用的相同的渐进精炼时间表下明确训练该模型。 MDPO与先前最先进的方法（SOTA）方法的性能相匹配，梯度更新少60倍，同时在相同的重量更新中训练时，对MATH500的平均提高了9.6％，而SOTA的倒数倒数数量为54.2％。此外，我们改善了MDLMS作为插入推理替换的策略，以克服模型无法灵活地完善令牌的限制。这种简单而有效的无培训策略（我们称之为RCR）一致地提高了性能，并在与MDPO结合使用时会获得更多的收益。我们的发现为研究MDLM的预训练和推理之间的差异树立了巨大的潜力。代码：此HTTPS URL。项目页面：此HTTPS URL。

Title: 4DNeX: Feed-Forward 4D Generative Modeling Made Easy

Authors: Zhaoxi Chen, Tianqi Liu, Long Zhuo, Jiawei Ren, Zeng Tao, He Zhu, Fangzhou Hong, Liang Pan, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.13154
Pdf URL: https://arxiv.org/pdf/2508.13154
Copy Paste: [[2508.13154]] 4DNeX: Feed-Forward 4D Generative Modeling Made Easy(https://arxiv.org/abs/2508.13154)
Keywords: generation, generative
Abstract: We present 4DNeX, the first feed-forward framework for generating 4D (i.e., dynamic 3D) scene representations from a single image. In contrast to existing methods that rely on computationally intensive optimization or require multi-frame video inputs, 4DNeX enables efficient, end-to-end image-to-4D generation by fine-tuning a pretrained video diffusion model. Specifically, 1) to alleviate the scarcity of 4D data, we construct 4DNeX-10M, a large-scale dataset with high-quality 4D annotations generated using advanced reconstruction approaches. 2) we introduce a unified 6D video representation that jointly models RGB and XYZ sequences, facilitating structured learning of both appearance and geometry. 3) we propose a set of simple yet effective adaptation strategies to repurpose pretrained video diffusion models for 4D modeling. 4DNeX produces high-quality dynamic point clouds that enable novel-view video synthesis. Extensive experiments demonstrate that 4DNeX outperforms existing 4D generation methods in efficiency and generalizability, offering a scalable solution for image-to-4D modeling and laying the foundation for generative 4D world models that simulate dynamic scene evolution.
摘要：我们提出4DNex，这是第一个用于从单个图像生成4D（即动态3D）场景表示的馈送框架。与依靠计算密集型优化或需要多帧视频输入的现有方法相反，4DNex通过微调预验证的视频扩散模型来实现有效的，端到端的图像至4D生成。具体来说，1）为了减轻4D数据的稀缺性，我们构建了4DNEX-10M，这是一种使用高级重建方法生成的具有高质量4D注释的大型数据集。 2）我们引入了一个统一的6D视频表示形式，该表示共同对RGB和XYZ序列进行建模，从而促进了外观和几何学的结构化学习。 3）我们提出了一组简单但有效的适应策略，以重新利用预审计的4D建模的视频扩散模型。 4DNEX产生高质量的动态点云，从而实现新型视频综合。广泛的实验表明，4DNEX在效率和通用性方面的表现优于现有的4D生成方法，为图像到4D建模提供了可扩展的解决方案，并为模拟动态场景演变的生成4D世界模型奠定了基础。