2025-09-10

Title: VLMs-in-the-Wild: Bridging the Gap Between Academic Benchmarks and Enterprise Reality

Authors: Srihari Bandraupalli, Anupam Purwar
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2509.06994
Pdf URL: https://arxiv.org/pdf/2509.06994
Copy Paste: [[2509.06994]] VLMs-in-the-Wild: Bridging the Gap Between Academic Benchmarks and Enterprise Reality(https://arxiv.org/abs/2509.06994)
Keywords: quality assessment
Abstract: Open-source Vision-Language Models show immense promise for enterprise applications, yet a critical disconnect exists between academic evaluation and enterprise deployment requirements. Current benchmarks rely heavily on multiple-choice questions and synthetic data, failing to capture the complexity of real-world business applications like social media content analysis. This paper introduces VLM-in-the-Wild (ViLD), a comprehensive framework to bridge this gap by evaluating VLMs on operational enterprise requirements. We define ten business-critical tasks: logo detection, OCR, object detection, human presence and demographic analysis, human activity and appearance analysis, scene detection, camera perspective and media quality assessment, dominant colors, comprehensive description, and NSFW detection. To this framework, we bring an innovative BlockWeaver Algorithm that solves the challenging problem of comparing unordered, variably-grouped OCR outputs from VLMs without relying on embeddings or LLMs, achieving remarkable speed and reliability. To demonstrate efficacy of ViLD, we constructed a new benchmark dataset of 7,500 diverse samples, carefully stratified from a corpus of one million real-world images and videos. ViLD provides actionable insights by combining semantic matching (both embedding-based and LLM-as-a-judge approaches), traditional metrics, and novel methods to measure the completeness and faithfulness of descriptive outputs. By benchmarking leading open-source VLMs (Qwen, MIMO, and InternVL) against a powerful proprietary baseline as per ViLD framework, we provide one of the first industry-grounded, task-driven assessment of VLMs capabilities, offering actionable insights for their deployment in enterprise environments.
摘要：开源视觉语言模型显示出对企业应用程序的巨大希望，但是在学术评估和企业部署要求之间存在关键的断开连接。当前的基准测试很大程度上依赖于多项选择的问题和综合数据，因此无法捕获社交媒体内容分析等现实世界业务应用程序的复杂性。本文介绍了VLM-IN-IN-IN-IN-IN-IN-IN-IN-IN-IN-IN-IN-IN-IN-IN-IN-IN框架，这是一个综合框架，可以通过评估运营企业需求的VLM来弥合这一差距。我们定义了十项关键任务：徽标检测，OCR，对象检测，人类的存在和人口统计分析，人类活动和外观分析，场景检测，摄像头视角和媒体质量评估，主要颜色，全面描述和NSFW检测。在此框架下，我们带来了一种创新的阻塞算法，该算法解决了从VLMS中比较无序的，可变组成的OCR输出的挑战性问题，而无需依赖嵌入式或LLM，从而实现了显着的速度和可靠性。为了证明VILD的功效，我们构建了一个新的基准数据集，该数据集由7,500种不同的样本进行了仔细的分层，从一百万个现实世界图像和视频的语料库中进行了仔细的分层。 Vild通过结合语义匹配（基于嵌入的LLM-AS-A-A-Gudge方法），传统指标以及新颖的方法来衡量描述性输出的完整性和忠诚度，从而提供了可行的见解。通过根据VILD框架对强大的专有基线进行基准测试领先的开源VLMS（QWEN，MIMO和INTENTVL），我们提供了对VLMS功能的首次以行业为基础，任务驱动的评估之一，从而为其在Enterprise环境中的部署提供了可行的洞察力。

Title: Visible Yet Unreadable: A Systematic Blind Spot of Vision Language Models Across Writing Systems

Authors: Jie Zhang, Ting Xu, Gelei Deng, Runyi Hu, Han Qiu, Tianwei Zhang, Qing Guo, Ivor Tsang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.06996
Pdf URL: https://arxiv.org/pdf/2509.06996
Copy Paste: [[2509.06996]] Visible Yet Unreadable: A Systematic Blind Spot of Vision Language Models Across Writing Systems(https://arxiv.org/abs/2509.06996)
Keywords: generation
Abstract: Writing is a universal cultural technology that reuses vision for symbolic communication. Humans display striking resilience: we readily recognize words even when characters are fragmented, fused, or partially occluded. This paper investigates whether advanced vision language models (VLMs) share this resilience. We construct two psychophysics inspired benchmarks across distinct writing systems, Chinese logographs and English alphabetic words, by splicing, recombining, and overlaying glyphs to yield ''visible but unreadable'' stimuli for models while remaining legible to humans. Despite strong performance on clean text, contemporary VLMs show a severe drop under these perturbations, frequently producing unrelated or incoherent outputs. The pattern suggests a structural limitation: models heavily leverage generic visual invariances but under rely on compositional priors needed for robust literacy. We release stimuli generation code, prompts, and evaluation protocols to facilitate transparent replication and follow up work. Our findings motivate architectures and training strategies that encode symbol segmentation, composition, and binding across scripts, and they delineate concrete challenges for deploying multimodal systems in education, accessibility, cultural heritage, and security.
摘要：写作是一种通用文化技术，可重复象征性交流的愿景。人类表现出惊人的韧性：即使角色被碎片，融合或部分遮挡，我们也很容易识别单词。本文研究了高级视觉语言模型（VLMS）是否具有这种韧性。我们通过拼接，重组和覆盖字形以产生“可见但不可读的”刺激，同时对人类的刺激，同时构建了两种心理物理学启发了跨不同写作系统，中国逻辑和英语字母单词的基准测试。尽管在干净的文本上表现出色，但当代VLM在这些扰动下表现出严重的下降，经常产生无关或不连贯的输出。该模式提出了一个结构性限制：模型在很大程度上利用了通用的视觉不变，但在依赖鲁棒识字所需的组成先验下。我们发布刺激生成代码，提示和评估协议，以促进透明的复制和后续工作。我们的发现激发了架构和培训策略，这些策略和培训策略在跨脚本之间编码符号细分，组成和约束力，并在教育，可及性，文化遗产和安全性中划定了部署多模式系统的具体挑战。

Title: K-Syn: K-space Data Synthesis in Ultra Low-data Regimes

Authors: Guan Yu, Zhang Jianhua, Liang Dong, Liu Qiegen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.06997
Pdf URL: https://arxiv.org/pdf/2509.06997
Copy Paste: [[2509.06997]] K-Syn: K-space Data Synthesis in Ultra Low-data Regimes(https://arxiv.org/abs/2509.06997)
Keywords: generation, generative
Abstract: Owing to the inherently dynamic and complex characteristics of cardiac magnetic resonance (CMR) imaging, high-quality and diverse k-space data are rarely available in practice, which in turn hampers robust reconstruction of dynamic cardiac MRI. To address this challenge, we perform feature-level learning directly in the frequency domain and employ a temporal-fusion strategy as the generative guidance to synthesize k-space data. Specifically, leveraging the global representation capacity of the Fourier transform, the frequency domain can be considered a natural global feature space. Therefore, unlike traditional methods that use pixel-level convolution for feature learning and modeling in the image domain, this letter focuses on feature-level modeling in the frequency domain, enabling stable and rich generation even with ultra low-data regimes. Moreover, leveraging the advantages of feature-level modeling in the frequency domain, we integrate k-space data across time frames with multiple fusion strategies to steer and further optimize the generative trajectory. Experimental results demonstrate that the proposed method possesses strong generative ability in low-data regimes, indicating practical potential to alleviate data scarcity in dynamic MRI reconstruction.
摘要：由于心脏磁共振（CMR）成像的固有动态和复杂特征，因此在实践中很少有高质量和不同的K空间数据，这反过来又可以强大地重建动态心脏MRI。为了应对这一挑战，我们直接在频域中执行功能级学习，并采用时间融合策略作为综合K空间数据的生成指导。具体而言，利用傅立叶变换的全球表示能力，频域可以视为自然的全球特征空间。因此，与传统方法使用像素级卷积在图像域中进行特征学习和建模的方法不同，该信件的重点是频域中的特征级建模，即使使用超低数据策略，也可以使稳定且丰富的生成能力。此外，利用频域中特征级建模的优势，我们将跨时框的K空间数据与多个融合策略集成在一起，以引导并进一步优化生成轨迹。实验结果表明，所提出的方法在低数据方面具有强大的生成能力，表明实际潜力减轻了动态MRI重建中的数据稀缺性。

Title: Human-in-the-Loop: Quantitative Evaluation of 3D Models Generation by Large Language Models

Authors: Ahmed R. Sadik, Mariusz Bujny
Subjects: cs.CV, cs.AI, cs.ET
Abstract URL: https://arxiv.org/abs/2509.07010
Pdf URL: https://arxiv.org/pdf/2509.07010
Copy Paste: [[2509.07010]] Human-in-the-Loop: Quantitative Evaluation of 3D Models Generation by Large Language Models(https://arxiv.org/abs/2509.07010)
Keywords: generation, generative
Abstract: Large Language Models are increasingly capable of interpreting multimodal inputs to generate complex 3D shapes, yet robust methods to evaluate geometric and structural fidelity remain underdeveloped. This paper introduces a human in the loop framework for the quantitative evaluation of LLM generated 3D models, supporting applications such as democratization of CAD design, reverse engineering of legacy designs, and rapid prototyping. We propose a comprehensive suite of similarity and complexity metrics, including volumetric accuracy, surface alignment, dimensional fidelity, and topological intricacy, to benchmark generated models against ground truth CAD references. Using an L bracket component as a case study, we systematically compare LLM performance across four input modalities: 2D orthographic views, isometric sketches, geometric structure trees, and code based correction prompts. Our findings demonstrate improved generation fidelity with increased semantic richness, with code level prompts achieving perfect reconstruction across all metrics. A key contribution of this work is demonstrating that our proposed quantitative evaluation approach enables significantly faster convergence toward the ground truth, especially compared to traditional qualitative methods based solely on visual inspection and human intuition. This work not only advances the understanding of AI assisted shape synthesis but also provides a scalable methodology to validate and refine generative models for diverse CAD applications.
摘要：大型语言模型越来越能够解释多模式输入以生成复杂的3D形状，但是评估几何和结构忠诚度的强大方法仍然不发达。本文在循环框架中介绍了一个人类，用于对LLM生成的3D模型进行定量评估，并支持应用程序，例如CAD设计的民主化，旧版设计的逆向工程以及快速的原型制作。我们提出了一套全面的相似性和复杂性指标，包括体积准确性，表面对齐，维忠诚度和拓扑复杂性，以针对基准产生的模型，以反对地面真相CAD参考。使用L支架组件作为案例研究，我们可以系统地比较四种输入模式的LLM性能：2D拼字图，等距草图，几何结构树和基于代码的校正提示。我们的发现表明，语义级别提高了发电的忠诚度，代码级别提示了在所有指标中实现完美的重建。这项工作的一个关键贡献是表明，我们提出的定量评估方法可以显着更快地融合地面真理，尤其是与仅基于视觉检查和人类直觉的传统定性方法相比。这项工作不仅可以提高对AI辅助形状合成的理解，而且还提供了可扩展的方法来验证和完善用于不同CAD应用的生成模型。

Title: Moment- and Power-Spectrum-Based Gaussianity Regularization for Text-to-Image Models

Authors: Jisung Hwang, Jaihoon Kim, Minhyuk Sung
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.07027
Pdf URL: https://arxiv.org/pdf/2509.07027
Copy Paste: [[2509.07027]] Moment- and Power-Spectrum-Based Gaussianity Regularization for Text-to-Image Models(https://arxiv.org/abs/2509.07027)
Keywords: generative
Abstract: We propose a novel regularization loss that enforces standard Gaussianity, encouraging samples to align with a standard Gaussian distribution. This facilitates a range of downstream tasks involving optimization in the latent space of text-to-image models. We treat elements of a high-dimensional sample as one-dimensional standard Gaussian variables and define a composite loss that combines moment-based regularization in the spatial domain with power spectrum-based regularization in the spectral domain. Since the expected values of moments and power spectrum distributions are analytically known, the loss promotes conformity to these properties. To ensure permutation invariance, the losses are applied to randomly permuted inputs. Notably, existing Gaussianity-based regularizations fall within our unified framework: some correspond to moment losses of specific orders, while the previous covariance-matching loss is equivalent to our spectral loss but incurs higher time complexity due to its spatial-domain computation. We showcase the application of our regularization in generative modeling for test-time reward alignment with a text-to-image model, specifically to enhance aesthetics and text alignment. Our regularization outperforms previous Gaussianity regularization, effectively prevents reward hacking and accelerates convergence.
摘要：我们提出了一种新型的正规化损失，可以实现标准高斯性，鼓励样品与标准高斯分布保持一致。这有助于在文本到图像模型的潜在空间中进行一系列涉及优化的下游任务。我们将高维样品的元素视为一维标准高斯变量，并定义了一个复合损失，该复合损耗结合了空间域中的基于力矩的正则化与频谱域中的基于功率谱的正则化。由于矩和功率谱分布的预期值是分析已知的，因此损失促进了这些特性的一致性。为了确保置换不变性，将损失应用于随机排列的输入。值得注意的是，现有的基于高斯的正规化属于我们的统一框架：有些对应于特定订单的矩损失，而先前的协方差匹配损失等于我们的光谱损失，但由于其空间域计算而引起的更高时间复杂性。我们通过文本对图像模型展示了正则化在生成建模中的应用，以进行测试时间奖励对准，特别是为了增强美学和文本对齐。我们的正规化优于先前的高斯正则化，有效地防止了奖励黑客攻击并加速融合。

Title: Automated Evaluation of Gender Bias Across 13 Large Multimodal Models

Authors: Juan Manuel Contreras
Subjects: cs.CV, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2509.07050
Pdf URL: https://arxiv.org/pdf/2509.07050
Copy Paste: [[2509.07050]] Automated Evaluation of Gender Bias Across 13 Large Multimodal Models(https://arxiv.org/abs/2509.07050)
Keywords: generation
Abstract: Large multimodal models (LMMs) have revolutionized text-to-image generation, but they risk perpetuating the harmful social biases in their training data. Prior work has identified gender bias in these models, but methodological limitations prevented large-scale, comparable, cross-model analysis. To address this gap, we introduce the Aymara Image Fairness Evaluation, a benchmark for assessing social bias in AI-generated images. We test 13 commercially available LMMs using 75 procedurally-generated, gender-neutral prompts to generate people in stereotypically-male, stereotypically-female, and non-stereotypical professions. We then use a validated LLM-as-a-judge system to score the 965 resulting images for gender representation. Our results reveal (p < .001 for all): 1) LMMs systematically not only reproduce but actually amplify occupational gender stereotypes relative to real-world labor data, generating men in 93.0% of images for male-stereotyped professions but only 22.5% for female-stereotyped professions; 2) Models exhibit a strong default-male bias, generating men in 68.3% of the time for non-stereotyped professions; and 3) The extent of bias varies dramatically across models, with overall male representation ranging from 46.7% to 73.3%. Notably, the top-performing model de-amplified gender stereotypes and approached gender parity, achieving the highest fairness scores. This variation suggests high bias is not an inevitable outcome but a consequence of design choices. Our work provides the most comprehensive cross-model benchmark of gender bias to date and underscores the necessity of standardized, automated evaluation tools for promoting accountability and fairness in AI development.
摘要：大型的多模型模型（LMM）彻底改变了文本到图像的生成，但它们有可能使培训数据中有害的社会偏见持续存在。先前的工作已经确定了这些模型中的性别偏见，但是方法论上的局限性阻止了大规模，可比的跨模型分析。为了解决这一差距，我们介绍了Aymara图像公平评估，这是评估AI生成图像中社会偏见的基准。我们使用75个程序生成的性别中性提示来测试13个商业可用的LMM，从而在刻板印象，刻板印象和非疾病中产生人们。然后，我们使用经过验证的LLM-AS-A-A-Gudge系统来为性别表示的965张图像评分。我们的结果表明（所有人的p <.001）：1）LMMS系统地相对于现实世界中的劳动数据，不仅会繁殖，而且实际上会放大职业性别刻板印象，从而使男性在93.0％的男性图像中产生男性，但只有22.5％的女性跨性别型专业； 2）模型表现出强烈的默认偏见，在68.3％的时间内为非策略型专业带来了男性； 3）偏见的程度在各个模型之间变化很大，总体男性表示范围为46.7％至73.3％。值得注意的是，表现最佳模型消除了宽大的性别刻板印象，并接近性别奇偶校验，达到了最高的公平得分。这种变化表明，高偏见不是不可避免的结果，而是设计选择的结果。我们的工作提供了迄今为止性别偏见的最全面的跨模型基准，并强调了标准化的自动化评估工具的必要性，以促进AI发展中的问责制和公平性。

Title: PLaID++: A Preference Aligned Language Model for Targeted Inorganic Materials Design

Authors: Andy Xu, Rohan Desai, Larry Wang, Gabriel Hope, Ethan Ritz
Subjects: cs.LG, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2509.07150
Pdf URL: https://arxiv.org/pdf/2509.07150
Copy Paste: [[2509.07150]] PLaID++: A Preference Aligned Language Model for Targeted Inorganic Materials Design(https://arxiv.org/abs/2509.07150)
Keywords: generation
Abstract: Discovering novel materials is critical for technological advancements such as solar cells, batteries, and carbon capture. However, the development of new materials is constrained by a slow and expensive trial-and-error process. To accelerate this pipeline, we introduce PLaID++, a Large Language Model (LLM) fine-tuned for stable and property-guided crystal generation. We fine-tune Qwen-2.5 7B to generate crystal structures using a novel Wyckoff-based text representation. We show that generation can be effectively guided with a reinforcement learning technique based on Direct Preference Optimization (DPO), with sampled structures categorized by their stability, novelty, and space group. By encoding symmetry constraints directly into text and guiding model outputs towards desirable chemical space, PLaID++ generates structures that are thermodynamically stable, unique, and novel at a $\sim$50\% greater rate than prior methods and conditionally generates structures with desired space group properties. Our experiments highlight the effectiveness of iterative DPO, achieving $\sim$115\% and $\sim$50\% improvements in unconditional and space group conditioned generation, respectively, compared to fine-tuning alone. Our work demonstrates the potential of adapting post-training techniques from natural language processing to materials design, paving the way for targeted and efficient discovery of novel materials.
摘要：发现新型材料对于技术进步至关重要，例如太阳能电池，电池和碳捕获。但是，新材料的开发受到缓慢且昂贵的反复试验的限制。为了加速这条管道，我们引入了格子++，这是一种大型语言模型（LLM），用于稳定和特性引导的晶体生成。我们微调QWEN-2.5 7B使用新型基于Wyckoff的文本表示形式生成晶体结构。我们表明，基于直接偏好优化（DPO）的增强学习技术可以有效地指导一代，并根据其稳定性，新颖性和空间群体对采样结构进行了分类。通过将对称性约束直接编码为文本，并将模型输出指向理想的化学空间，Plaid ++生成的结构在热力学上稳定，独特和新颖的结构以$ \ sim $ 50 \％\％的速率比先前的方法高，并且有条件地生成具有所需空间组属性的结构。我们的实验强调了迭代DPO的有效性，与单独调查相比，无条件和空间组的无条件和空间组有条件生成的$ \ sim $ 115 \％和$ \ sim $ 50 \％改善。我们的工作证明了从自然语言处理到材料设计的调整后训练技术的潜力，为目标有效发现新型材料铺平了道路。

Title: Breast Cancer Detection in Thermographic Images via Diffusion-Based Augmentation and Nonlinear Feature Fusion

Authors: Sepehr Salem, M. Moein Esfahani, Jingyu Liu, Vince Calhoun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.07277
Pdf URL: https://arxiv.org/pdf/2509.07277
Copy Paste: [[2509.07277]] Breast Cancer Detection in Thermographic Images via Diffusion-Based Augmentation and Nonlinear Feature Fusion(https://arxiv.org/abs/2509.07277)
Keywords: generative
Abstract: Data scarcity hinders deep learning for medical imaging. We propose a framework for breast cancer classification in thermograms that addresses this using a Diffusion Probabilistic Model (DPM) for data augmentation. Our DPM-based augmentation is shown to be superior to both traditional methods and a ProGAN baseline. The framework fuses deep features from a pre-trained ResNet-50 with handcrafted nonlinear features (e.g., Fractal Dimension) derived from U-Net segmented tumors. An XGBoost classifier trained on these fused features achieves 98.0\% accuracy and 98.1\% sensitivity. Ablation studies and statistical tests confirm that both the DPM augmentation and the nonlinear feature fusion are critical, statistically significant components of this success. This work validates the synergy between advanced generative models and interpretable features for creating highly accurate medical diagnostic tools.
摘要：数据稀缺性阻碍了医学成像的深度学习。我们提出了一个在热图中进行乳腺癌分类的框架，该框架使用扩散概率模型（DPM）来解决此问题，以进行数据增强。我们的基于DPM的增强被证明优于传统方法和Progan基线。该框架将预先训练的Resnet-50的深度特征与来自U-NET分割的肿瘤得出的手工非线性特征（例如分形尺寸）融合在一起。对这些融合功能训练的XGBoost分类器达到98.0 \％精度和98.1 \％敏感性。消融研究和统计测试证实，DPM增强和非线性特征融合都是至关重要的，这是这一成功的统计学意义。这项工作验证了高级生成模型与可解释功能之间的协同作用，以创建高度准确的医学诊断工具。

Title: Reconstruction Alignment Improves Unified Multimodal Models

Authors: Ji Xie, Trevor Darrell, Luke Zettlemoyer, XuDong Wang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.07295
Pdf URL: https://arxiv.org/pdf/2509.07295
Copy Paste: [[2509.07295]] Reconstruction Alignment Improves Unified Multimodal Models(https://arxiv.org/abs/2509.07295)
Keywords: generation
Abstract: Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details--even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense "text prompts," providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73$\rightarrow$0.90) and DPGBench (80.93$\rightarrow$88.15), while also boosting editing benchmarks (ImgEdit 3.38$\rightarrow$3.75, GEdit 6.94$\rightarrow$7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs
摘要：统一的多模型模型（UMMS）在单个体系结构中统一视觉理解和生成。但是，传统的培训依赖于图像文本对（或序列）的字幕通常是稀疏的，并且错过了细颗粒的视觉细节 - 即使他们使用数百个单词来描述一个简单的图像。我们介绍了重建对齐（RECA），这是一种资源有效的训练后方法，利用视觉理解编码器嵌入为密集的“文本提示”，提供了没有标题的丰富监督。具体而言，RECA在其视觉理解的嵌入方式上适应了UMM的条件，并优化了它以通过自我监督的重建损失重建输入图像，从而重新调整理解和产生。尽管它很简单，但RECA还是广泛适用的：跨回归，掩盖了自动回调和基于扩散的UMMS，它始终提高生成和编辑保真度。 RECA的培训只有27个小时，重大培训可大大改善Geneval（0.73 $ \ rightarrow $ 0.90）和DPGBENCH（80.93 $ \ rightArrow $ 88.15），同时促进编辑基准（Imgedit 3.38 $ \ rightarrow $ 3.75 $ 3.75 $ 3.75），$ 3.94.94 $ 3.94 $ 3.94。值得注意的是，RECA超过了更大的开源模型，并在各种UMM体系结构之间广泛应用，将其确立为UMMS的有效且一般的训练后对齐策略

Title: CancerGUIDE: Cancer Guideline Understanding via Internal Disagreement Estimation

Authors: Alyssa Unell, Noel C. F. Codella, Sam Preston, Peniel Argaw, Wen-wai Yim, Zelalem Gero, Cliff Wong, Rajesh Jena, Eric Horvitz, Amanda K. Hall, Ruican Rachel Zhong, Jiachen Li, Shrey Jain, Mu Wei, Matthew Lungren, Hoifung Poon
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.07325
Pdf URL: https://arxiv.org/pdf/2509.07325
Copy Paste: [[2509.07325]] CancerGUIDE: Cancer Guideline Understanding via Internal Disagreement Estimation(https://arxiv.org/abs/2509.07325)
Keywords: generation
Abstract: The National Comprehensive Cancer Network (NCCN) provides evidence-based guidelines for cancer treatment. Translating complex patient presentations into guideline-compliant treatment recommendations is time-intensive, requires specialized expertise, and is prone to error. Advances in large language model (LLM) capabilities promise to reduce the time required to generate treatment recommendations and improve accuracy. We present an LLM agent-based approach to automatically generate guideline-concordant treatment trajectories for patients with non-small cell lung cancer (NSCLC). Our contributions are threefold. First, we construct a novel longitudinal dataset of 121 cases of NSCLC patients that includes clinical encounters, diagnostic results, and medical histories, each expertly annotated with the corresponding NCCN guideline trajectories by board-certified oncologists. Second, we demonstrate that existing LLMs possess domain-specific knowledge that enables high-quality proxy benchmark generation for both model development and evaluation, achieving strong correlation (Spearman coefficient r=0.88, RMSE = 0.08) with expert-annotated benchmarks. Third, we develop a hybrid approach combining expensive human annotations with model consistency information to create both the agent framework that predicts the relevant guidelines for a patient, as well as a meta-classifier that verifies prediction accuracy with calibrated confidence scores for treatment recommendations (AUROC=0.800), a critical capability for communicating the accuracy of outputs, custom-tailoring tradeoffs in performance, and supporting regulatory compliance. This work establishes a framework for clinically viable LLM-based guideline adherence systems that balance accuracy, interpretability, and regulatory requirements while reducing annotation costs, providing a scalable pathway toward automated clinical decision support.
摘要：国家综合癌症网络（NCCN）为癌症治疗提供了基于证据的指南。将复杂的患者介绍转换为符合准则的治疗建议是时间密集型，需要专业知识，并且容易出错。大语言模型（LLM）功能的进步有望减少产生治疗建议并提高准确性所需的时间。我们提出了一种基于LLM的药物的方法，用于自动为非小细胞肺癌（NSCLC）患者生成指南符合治疗轨迹。我们的贡献是三倍。首先，我们构建了一个新的NSCLC患者病例的新型纵向数据集，其中包括临床相遇，诊断结果和医学历史，每个病例都对董事会认证的肿瘤学家进行了熟悉的NCCN指南轨迹。其次，我们证明了现有的LLM具有特定领域的知识，可以为模型开发和评估提供高质量的代理基准生成，从而实现了与专家宣传的基准的强相关性（Spearman系数r = 0.88，RMSE = 0.08）。第三，我们开发了一种混合方法，该方法将昂贵的人类注释与模型一致性信息相结合，以创建预测患者相关指南的代理框架，也可以创建一个元素分类器，并通过校准的置信度得分（AUROC = 0.800）来验证预测准确性（AUROC = 0.800），可以准确地衡量型号的量级，并进行了衡量标准，并进行了衡量标准，并进行了衡量标准，并进行了衡量标准。这项工作为基于临床可行的LLM的指南遵守系统建立了一个框架，该系统平衡了准确性，可解释性和监管要求，同时降低注释成本，从而为自动化的临床决策支持提供了可扩展的途径。

Title: The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward

Authors: Long Li, Jiaran Hao, Jason Klein Liu, Zhijian Zhou, Xiaoyu Tan, Wei Chu, Zhe Wang, Shirui Pan, Chao Qu, Yuan Qi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.07430
Pdf URL: https://arxiv.org/pdf/2509.07430
Copy Paste: [[2509.07430]] The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward(https://arxiv.org/abs/2509.07430)
Keywords: generation
Abstract: A central paradox in fine-tuning Large Language Models (LLMs) with Reinforcement Learning with Verifiable Reward (RLVR) is the frequent degradation of multi-attempt performance (Pass@k) despite improvements in single-attempt accuracy (Pass@1). This is often accompanied by catastrophic forgetting, where models lose previously acquired skills. While various methods have been proposed, the choice and function of the divergence term have been surprisingly unexamined as a proactive solution. We argue that standard RLVR objectives -- both those using the mode-seeking reverse KL-divergence and those forgoing a divergence term entirely -- lack a crucial mechanism for knowledge retention. The reverse-KL actively accelerates this decay by narrowing the policy, while its absence provides no safeguard against the model drifting from its diverse knowledge base. We propose a fundamental shift in perspective: using the divergence term itself as the solution. Our framework, Diversity-Preserving Hybrid RL (DPH-RL), leverages mass-covering f-divergences (like forward-KL and JS-divergence) to function as a rehearsal mechanism. By continuously referencing the initial policy, this approach forces the model to maintain broad solution coverage. Extensive experiments on math and SQL generation demonstrate that DPH-RL not only resolves the Pass@k degradation but improves both Pass@1 and Pass@k in- and out-of-domain. Additionally, DPH-RL is more training-efficient because it computes f-divergence using generator functions, requiring only sampling from the initial policy and no online reference model. Our work highlights a crucial, overlooked axis for improving RLVR, demonstrating that the proper selection of a divergence measure is a powerful tool for building more general and diverse reasoning models.
摘要：尽管单击准确性提高了（PASS@1），但通过增强奖励（RLVR）的微调大语言模型（LLM）的中心悖论是频繁降级（PASS@k）。这通常伴随着灾难性的遗忘，模型失去了以前获得的技能。尽管已经提出了各种方法，但差异项的选择和功能令人惊讶地尚未作为主动解决方案。我们认为，标准的RLVR目标 - 使用模式反向KL-Divergence和完全放弃分歧术语的目标都缺乏关键机制来保留知识。反向KL通过缩小策略来积极加速这种衰变，而其缺席并没有保护模型从其多样化的知识基础中脱落。我们提出了观点的根本转变：使用差异术语本身作为解决方案。我们的框架，具有多样性的混合RL（DPH-RL），利用质量覆盖的F-Diverence（例如前kl和JS-Divergence）充当彩排机制。通过不断提及初始政策，这种方法迫使模型保持广泛的解决方案覆盖范围。关于数学和SQL生成的广泛实验表明，DPH-RL不仅可以解决通行@k降解，而且可以改善Pass@1和Pass@k In- In-和dofdomain。此外，DPH-RL的训练效率更高，因为它可以使用发电机功能计算F-Divergence，仅需要从初始策略中进行采样，而无需在线参考模型。我们的工作突出了一个至关重要的，被忽视的轴，用于改善RLVR，表明正确选择Divergence度量是建立更一般和多样化的推理模型的强大工具。

Title: DreamLifting: A Plug-in Module Lifting MV Diffusion Models for 3D Asset Generation

Authors: Ze-Xin Yin, Jiaxiong Qiu, Liu Liu, Xinjie Wang, Wei Sui, Zhizhong Su, Jian Yang, Jin Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.07435
Pdf URL: https://arxiv.org/pdf/2509.07435
Copy Paste: [[2509.07435]] DreamLifting: A Plug-in Module Lifting MV Diffusion Models for 3D Asset Generation(https://arxiv.org/abs/2509.07435)
Keywords: generation
Abstract: The labor- and experience-intensive creation of 3D assets with physically based rendering (PBR) materials demands an autonomous 3D asset creation pipeline. However, most existing 3D generation methods focus on geometry modeling, either baking textures into simple vertex colors or leaving texture synthesis to post-processing with image diffusion models. To achieve end-to-end PBR-ready 3D asset generation, we present Lightweight Gaussian Asset Adapter (LGAA), a novel framework that unifies the modeling of geometry and PBR materials by exploiting multi-view (MV) diffusion priors from a novel perspective. The LGAA features a modular design with three components. Specifically, the LGAA Wrapper reuses and adapts network layers from MV diffusion models, which encapsulate knowledge acquired from billions of images, enabling better convergence in a data-efficient manner. To incorporate multiple diffusion priors for geometry and PBR synthesis, the LGAA Switcher aligns multiple LGAA Wrapper layers encapsulating different knowledge. Then, a tamed variational autoencoder (VAE), termed LGAA Decoder, is designed to predict 2D Gaussian Splatting (2DGS) with PBR channels. Finally, we introduce a dedicated post-processing procedure to effectively extract high-quality, relightable mesh assets from the resulting 2DGS. Extensive quantitative and qualitative experiments demonstrate the superior performance of LGAA with both text-and image-conditioned MV diffusion models. Additionally, the modular design enables flexible incorporation of multiple diffusion priors, and the knowledge-preserving scheme leads to efficient convergence trained on merely 69k multi-view instances. Our code, pre-trained weights, and the dataset used will be publicly available via our project page: this https URL.
摘要：具有基于物理的渲染（PBR）材料的3D资产的劳动力和经验密集型创建需要自治的3D资产创建管道。但是，大多数现有的3D生成方法都集中在几何建模上，要么将纹理烘烤成简单的顶点颜色，要么将纹理合成以使用图像扩散模型进行后处理。为了实现端到端的PBR准备就绪的3D资产产生，我们提出了轻巧的高斯资产适配器（LGAA），这是一个新颖的框架，通过从新的角度利用多视图（MV）扩散priors来统一几何和PBR材料的建模。 LGAA具有带有三个组件的模块化设计。具体而言，LGAA包装器重新使用并调整了MV扩散模型的网络层，该模型封装了从数十亿图像中获取的知识，从而可以以数据效率的方式更好地收敛。为了结合多个用于几何和PBR合成的扩散先验，LGAA切换器将多个LGAA包装层对准包含不同知识的多个LGAA包装层。然后，称为LGAA解码器的驯服变异自动编码器（VAE）旨在预测带有PBR通道的2D高斯分裂（2DGS）。最后，我们引入了专门的后处理程序，以从最终的2DG中有效提取可重新获得的网格资产。广泛的定量和定性实验证明了LGAA具有两个文本和图像条件的MV扩散模型的出色性能。此外，模块化设计可以灵活地融合多个扩散先验，而知识的方案则导致仅在69K多视图实例上训练了有效的收敛。我们的代码，预培训的权重和所使用的数据集将通过我们的项目页面公开可用：此HTTPS URL。

Title: ANYPORTAL: Zero-Shot Consistent Video Background Replacement

Authors: Wenshuo Gao, Xicheng Lan, Shuai Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.07472
Pdf URL: https://arxiv.org/pdf/2509.07472
Copy Paste: [[2509.07472]] ANYPORTAL: Zero-Shot Consistent Video Background Replacement(https://arxiv.org/abs/2509.07472)
Keywords: generation
Abstract: Despite the rapid advancements in video generation technology, creating high-quality videos that precisely align with user intentions remains a significant challenge. Existing methods often fail to achieve fine-grained control over video details, limiting their practical applicability. We introduce ANYPORTAL, a novel zero-shot framework for video background replacement that leverages pre-trained diffusion models. Our framework collaboratively integrates the temporal prior of video diffusion models with the relighting capabilities of image diffusion models in a zero-shot setting. To address the critical challenge of foreground consistency, we propose a Refinement Projection Algorithm, which enables pixel-level detail manipulation to ensure precise foreground preservation. ANYPORTAL is training-free and overcomes the challenges of achieving foreground consistency and temporally coherent relighting. Experimental results demonstrate that ANYPORTAL achieves high-quality results on consumer-grade GPUs, offering a practical and efficient solution for video content creation and editing.
摘要：尽管视频生成技术取得了迅速的进步，但创建与用户意图完全一致的高质量视频仍然是一个重大挑战。现有方法通常无法实现对视频详细信息的细粒度控制，从而限制了它们的实际适用性。我们介绍了Anyportal，这是一种新型的零摄影框架，用于视频背景替换，以利用预训练的扩散模型。我们的框架合作将视频扩散模型的暂时性与零拍设置中图像扩散模型的重新功能相结合。为了应对前景一致性的关键挑战，我们提出了一种改进投影算法，该算法使像素级的细节操纵能够确保精确的前景保存。 Anyportal是无训练的，克服了实现前景一致性和暂时连贯的重点的挑战。实验结果表明，Anyportal在消费级GPU上取得了高质量的结果，为视频内容创建和编辑提供了实用有效的解决方案。

Title: EHWGesture -- A dataset for multimodal understanding of clinical gestures

Authors: Gianluca Amprimo, Alberto Ancilotto, Alessandro Savino, Fabio Quazzolo, Claudia Ferraris, Gabriella Olmo, Elisabetta Farella, Stefano Di Carlo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.07525
Pdf URL: https://arxiv.org/pdf/2509.07525
Copy Paste: [[2509.07525]] EHWGesture -- A dataset for multimodal understanding of clinical gestures(https://arxiv.org/abs/2509.07525)
Keywords: quality assessment
Abstract: Hand gesture understanding is essential for several applications in human-computer interaction, including automatic clinical assessment of hand dexterity. While deep learning has advanced static gesture recognition, dynamic gesture understanding remains challenging due to complex spatiotemporal variations. Moreover, existing datasets often lack multimodal and multi-view diversity, precise ground-truth tracking, and an action quality component embedded within gestures. This paper introduces EHWGesture, a multimodal video dataset for gesture understanding featuring five clinically relevant gestures. It includes over 1,100 recordings (6 hours), captured from 25 healthy subjects using two high-resolution RGB-Depth cameras and an event camera. A motion capture system provides precise ground-truth hand landmark tracking, and all devices are spatially calibrated and synchronized to ensure cross-modal alignment. Moreover, to embed an action quality task within gesture understanding, collected recordings are organized in classes of execution speed that mirror clinical evaluations of hand dexterity. Baseline experiments highlight the dataset's potential for gesture classification, gesture trigger detection, and action quality assessment. Thus, EHWGesture can serve as a comprehensive benchmark for advancing multimodal clinical gesture understanding.
摘要：手势理解对于人类计算机相互作用的多种应用至关重要，包括自动临床评估手动敏捷性。尽管深度学习具有先进的静态手势识别，但由于复杂的时空变化，动态的手势理解仍然具有挑战性。此外，现有的数据集通常缺乏多模式和多视图多样性，精确的地面真相跟踪以及手势中嵌入的动作质量组件。本文介绍了Ehwgesture，这是一种多式联运视频数据集，用于手势理解，具有五个临床相关的手势。它包括1,100多个录音（6小时），使用两个高分辨率RGB深度摄像机和一个活动摄像头从25名健康受试者中捕获。运动捕获系统提供精确的地面上的地标跟踪，并在空间校准和同步上进行所有设备，以确保交叉模式对齐。此外，为了将动作质量任务嵌入手势理解中，收集的记录是在执行速度的类别中组织的，这反映了手动敏捷性的临床评估。基线实验突出了数据集进行手势分类，手势触发检测和动作质量评估的潜力。因此，Ehwgesture可以作为推进多模式临床手势理解的综合基准。

Title: PanoLAM: Large Avatar Model for Gaussian Full-Head Synthesis from One-shot Unposed Image

Authors: Peng Li, Yisheng He, Yingdong Hu, Yuan Dong, Weihao Yuan, Yuan Liu, Zilong Dong, Yike Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.07552
Pdf URL: https://arxiv.org/pdf/2509.07552
Copy Paste: [[2509.07552]] PanoLAM: Large Avatar Model for Gaussian Full-Head Synthesis from One-shot Unposed Image(https://arxiv.org/abs/2509.07552)
Keywords: generation
Abstract: We present a feed-forward framework for Gaussian full-head synthesis from a single unposed image. Unlike previous work that relies on time-consuming GAN inversion and test-time optimization, our framework can reconstruct the Gaussian full-head model given a single unposed image in a single forward pass. This enables fast reconstruction and rendering during inference. To mitigate the lack of large-scale 3D head assets, we propose a large-scale synthetic dataset from trained 3D GANs and train our framework using only synthetic data. For efficient high-fidelity generation, we introduce a coarse-to-fine Gaussian head generation pipeline, where sparse points from the FLAME model interact with the image features by transformer blocks for feature extraction and coarse shape reconstruction, which are then densified for high-fidelity reconstruction. To fully leverage the prior knowledge residing in pretrained 3D GANs for effective reconstruction, we propose a dual-branch framework that effectively aggregates the structured spherical triplane feature and unstructured point-based features for more effective Gaussian head reconstruction. Experimental results show the effectiveness of our framework towards existing work.
摘要：我们提出了一个从单个未予以的图像中的高斯全头合成的馈送框架。与以前依赖耗时的GAN倒置和测试时间优化的工作不同，我们的框架可以重建高斯全头模型，从而在单个正向通行中给出了一个不予以的图像。这可以在推理过程中快速重建和渲染。为了减轻缺乏大规模3D头部资产，我们提出了一个大规模的合成数据集中的大规模合成数据集，并仅使用合成数据训练我们的框架。为了高效的高保真生成，我们引入了一条粗到1的高斯头部生成管道，其中火焰模型的稀疏点通过变压器块与特征提取和粗大形成重建的变压器块相互作用，然后将其致密，以使其致密，以使高保真重建。为了充分利用预估计的3D GAN中的先验知识进行有效的重建，我们提出了一个双分支框架，该框架有效地汇总了结构化的球形三架特征和基于非结点的基于点的特征，以实现更有效的高斯头部重建。实验结果表明，我们框架对现有工作的有效性。

Title: $ΔL$ Normalization: Rethink Loss Aggregation in RLVR

Authors: Zhiyuan He, Xufang Luo, Yike Zhang, Yuqing Yang, Lili Qiu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.07558
Pdf URL: https://arxiv.org/pdf/2509.07558
Copy Paste: [[2509.07558]] $ΔL$ Normalization: Rethink Loss Aggregation in RLVR(https://arxiv.org/abs/2509.07558)
Keywords: generation
Abstract: We propose $\Delta L$ Normalization, a simple yet effective loss aggregation method tailored to the characteristic of dynamic generation lengths in Reinforcement Learning with Verifiable Rewards (RLVR). Recently, RLVR has demonstrated strong potential in improving the reasoning capabilities of large language models (LLMs), but a major challenge lies in the large variability of response lengths during training, which leads to high gradient variance and unstable optimization. Although previous methods such as GRPO, DAPO, and Dr. GRPO introduce different loss normalization terms to address this issue, they either produce biased estimates or still suffer from high gradient variance. By analyzing the effect of varying lengths on policy loss both theoretically and empirically, we reformulate the problem as finding a minimum-variance unbiased estimator. Our proposed $\Delta L$ Normalization not only provides an unbiased estimate of the true policy loss but also minimizes gradient variance in theory. Extensive experiments show that it consistently achieves superior results across different model sizes, maximum lengths, and tasks. Our code will be made public at this https URL.
摘要：我们提出了$ \ delta l $归一化，这是一种量身定制的简单而有效的损耗聚合方法，它是根据可验证的奖励（RLVR）在增强学习中动态生成长度的特征的。最近，RLVR在提高大语言模型（LLMS）的推理能力方面具有强大的潜力，但是一个重大挑战在于训练过程中响应长度的巨大差异，从而导致较高的梯度差异和不稳定的优化。尽管以前的方法（例如GRPO，DAPO和GRPO博士）引入了不同的损失归一项用语来解决此问题，但它们要么产生偏见的估计值，要么仍然患有较高的梯度差异。通过分析长度对政策损失的影响在理论上和经验上，我们将问题重新制定为找到最小值的无偏估计器。我们提出的$ \ delta l $归一化不仅提供了对真正政策损失的无偏估计，而且还可以最大程度地减少理论上的梯度差异。广泛的实验表明，它始终在不同的模型尺寸，最大长度和任务之间取得了卓越的结果。我们的代码将在此HTTPS URL上公开。

Title: uGMM-NN: Univariate Gaussian Mixture Model Neural Network

Authors: Zakeria Sharif Ali
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2509.07569
Pdf URL: https://arxiv.org/pdf/2509.07569
Copy Paste: [[2509.07569]] uGMM-NN: Univariate Gaussian Mixture Model Neural Network(https://arxiv.org/abs/2509.07569)
Keywords: generative
Abstract: This paper introduces the Univariate Gaussian Mixture Model Neural Network (uGMM-NN), a novel neural architecture that embeds probabilistic reasoning directly into the computational units of deep networks. Unlike traditional neurons, which apply weighted sums followed by fixed nonlinearities, each uGMM-NN node parameterizes its activations as a univariate Gaussian mixture, with learnable means, variances, and mixing coefficients. This design enables richer representations by capturing multimodality and uncertainty at the level of individual neurons, while retaining the scalability of standard feedforward networks. We demonstrate that uGMM-NN can achieve competitive discriminative performance compared to conventional multilayer perceptrons, while additionally offering a probabilistic interpretation of activations. The proposed framework provides a foundation for integrating uncertainty-aware components into modern neural architectures, opening new directions for both discriminative and generative modeling.
摘要：本文介绍了单变量高斯混合模型神经网络（UGMM-NN），这是一种新型的神经结构，将概率推理直接嵌入深网的计算单元中。与传统的神经元（施加加权和固定非线性）的传统神经元不同，每个UGMM-NN节点都以可学习的方式，方差和混合系数为单变量的高斯混合物，将其激活作为单变量高斯混合物参数化。该设计通过在单个神经元级别捕获多模式和不确定性，同时保留标准前馈网络的可扩展性来实现更丰富的表示。我们证明，与常规的多层感知器相比，UGMM-NN可以实现竞争性判别性能，同时还提供了对激活的概率解释。拟议的框架为将不确定性感知的组件整合到现代神经体系结构中为歧视性建模和生成性建模开辟了新的方向。

Title: Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation

Authors: Yusuke Hirota, Ryo Hachiuma, Boyi Li, Ximing Lu, Michael Ross Boone, Boris Ivanovic, Yejin Choi, Marco Pavone, Yu-Chiang Frank Wang, Noa Garcia, Yuta Nakashima, Chao-Han Huck Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.07596
Pdf URL: https://arxiv.org/pdf/2509.07596
Copy Paste: [[2509.07596]] Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation(https://arxiv.org/abs/2509.07596)
Keywords: generative
Abstract: Gender bias in vision-language foundation models (VLMs) raises concerns about their safe deployment and is typically evaluated using benchmarks with gender annotations on real-world images. However, as these benchmarks often contain spurious correlations between gender and non-gender features, such as objects and backgrounds, we identify a critical oversight in gender bias evaluation: Do spurious features distort gender bias evaluation? To address this question, we systematically perturb non-gender features across four widely used benchmarks (COCO-gender, FACET, MIAP, and PHASE) and various VLMs to quantify their impact on bias evaluation. Our findings reveal that even minimal perturbations, such as masking just 10% of objects or weakly blurring backgrounds, can dramatically alter bias scores, shifting metrics by up to 175% in generative VLMs and 43% in CLIP variants. This suggests that current bias evaluations often reflect model responses to spurious features rather than gender bias, undermining their reliability. Since creating spurious feature-free benchmarks is fundamentally challenging, we recommend reporting bias metrics alongside feature-sensitivity measurements to enable a more reliable bias assessment.
摘要：视觉基础模型（VLM）中的性别偏见引起了人们对其安全部署的担忧，通常使用带有性别注释的基准在现实世界图像上进行评估。但是，由于这些基准通常包含性别和非性别特征（例如对象和背景）之间的虚假相关性，因此我们确定性别偏见评估中的关键监督：虚假特征是否会扭曲性别偏见评估？为了解决这个问题，我们在四个广泛使用的基准（可可性别，方面，MIAP和阶段）和各种VLMS中系统地扰乱了非性别特征，以量化其对偏见评估的影响。我们的发现表明，即使是最小的扰动，例如仅掩盖了10％的物体或弱模糊的背景，也可以极大地改变偏差分数，在生成的VLM中最多将指标转移了175％，而剪贴画变体中的指标则最高为43％。这表明当前的偏见评估通常反映了对虚假特征而不是性别偏见的模型响应，从而破坏了其可靠性。由于创建虚假功能的基准基本上是具有挑战性的，因此我们建议报告偏见指标以及功能敏感性测量，以实现更可靠的偏见评估。

Title: Beyond Rebalancing: Benchmarking Binary Classifiers Under Class Imbalance Without Rebalancing Techniques

Authors: Ali Nawaz, Amir Ahmad, Shehroz S. Khan
Subjects: cs.LG, cs.AI, cs.IT
Abstract URL: https://arxiv.org/abs/2509.07605
Pdf URL: https://arxiv.org/pdf/2509.07605
Copy Paste: [[2509.07605]] Beyond Rebalancing: Benchmarking Binary Classifiers Under Class Imbalance Without Rebalancing Techniques(https://arxiv.org/abs/2509.07605)
Keywords: generation
Abstract: Class imbalance poses a significant challenge to supervised classification, particularly in critical domains like medical diagnostics and anomaly detection where minority class instances are rare. While numerous studies have explored rebalancing techniques to address this issue, less attention has been given to evaluating the performance of binary classifiers under imbalance when no such techniques are applied. Therefore, the goal of this study is to assess the performance of binary classifiers "as-is", without performing any explicit rebalancing. Specifically, we systematically evaluate the robustness of a diverse set of binary classifiers across both real-world and synthetic datasets, under progressively reduced minority class sizes, using one-shot and few-shot scenarios as baselines. Our approach also explores varying data complexities through synthetic decision boundary generation to simulate real-world conditions. In addition to standard classifiers, we include experiments using undersampling, oversampling strategies, and one-class classification (OCC) methods to examine their behavior under severe imbalance. The results confirm that classification becomes more difficult as data complexity increases and the minority class size decreases. While traditional classifiers deteriorate under extreme imbalance, advanced models like TabPFN and boosting-based ensembles retain relatively higher performance and better generalization compared to traditional classifiers. Visual interpretability and evaluation metrics further validate these findings. Our work offers valuable guidance on model selection for imbalanced learning, providing insights into classifier robustness without dependence on explicit rebalancing techniques.
摘要：阶级失衡对监督分类构成了重大挑战，尤其是在少数群体实例的关键领域和诸如医学诊断和异常检测中的关键领域。尽管许多研究探讨了以解决此问题的重新平衡技术，但在没有应用此类技术时，对评估二进制分类器的性能的关注较少。因此，本研究的目的是评估“ AS”二元分类器的性能，而无需进行任何明确的重新平衡。具体而言，我们使用单杆和少量射击场景作为基准，系统地评估了现实世界和合成数据集的各种二进制分类器的鲁棒性。我们的方法还通过合成决策边界生成来探索不同的数据复杂性，以模拟现实世界中的条件。除标准分类器外，我们还包括使用不足采样，过采样策略和一级分类（OCC）方法的实验，以检查其在严重失衡下的行为。结果证实，随着数据复杂性的增加和少数族类规模的减少，分类变得更加困难。尽管传统分类器在极端失衡下恶化，但与传统分类器相比，TABPFN和基于增强的集合等高级模型保持相对更高的性能和更好的概括。视觉解释性和评估指标进一步验证了这些发现。我们的工作为学习不平衡学习提供了宝贵的指导，为分类器鲁棒性提供了见解，而不必依赖明确的重新平衡技术。

Title: Data-Efficient Fine-Tuning of Vision-Language Models for Diagnosis of Alzheimer's Disease

Authors: Fangqi Cheng, Surajit Ray, Xiaochen Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.07613
Pdf URL: https://arxiv.org/pdf/2509.07613
Copy Paste: [[2509.07613]] Data-Efficient Fine-Tuning of Vision-Language Models for Diagnosis of Alzheimer's Disease(https://arxiv.org/abs/2509.07613)
Keywords: generation
Abstract: Medical vision-language models (Med-VLMs) have shown impressive results in tasks such as report generation and visual question answering, but they still face several limitations. Most notably, they underutilize patient metadata and lack integration of clinical diagnostic knowledge. Moreover, most existing models are typically trained from scratch or fine-tuned on large-scale 2D image-text pairs, requiring extensive computational resources, and their effectiveness on 3D medical imaging is often limited due to the absence of structural information. To address these gaps, we propose a data-efficient fine-tuning pipeline to adapt 3D CT-based Med-VLMs for 3D MRI and demonstrate its application in Alzheimer's disease (AD) diagnosis. Our system introduces two key innovations. First, we convert structured metadata into synthetic reports, enriching textual input for improved image-text alignment. Second, we add an auxiliary token trained to predict the mini-mental state examination (MMSE) score, a widely used clinical measure of cognitive function that correlates with AD severity. This provides additional supervision for fine-tuning. Applying lightweight prompt tuning to both image and text modalities, our approach achieves state-of-the-art performance on two AD datasets using 1,500 training images, outperforming existing methods fine-tuned on 10,000 images. Code will be released upon publication.
摘要：医学视觉语言模型（MED-VLM）在报告生成和视觉问题回答等任务中表现出了令人印象深刻的结果，但它们仍然面临一些局限性。最值得注意的是，它们使患者元数据的利用不足，缺乏临床诊断知识的整合。此外，大多数现有模型通常是从头开始训练的，或在大规模的2D图像文本对上进行了微调，需要广泛的计算资源，并且由于缺乏结构信息，它们对3D医学成像的有效性通常受到限制。为了解决这些差距，我们提出了一项数据有效的微调管道，以适应基于3D CT的MED-VLMS进行3D MRI，并证明其在阿尔茨海默氏病（AD）诊断中的应用。我们的系统介绍了两个关键的创新。首先，我们将结构化元数据转换为合成报告，并丰富文本输入以改进图像文本对齐。其次，我们添加了一个辅助令牌，该辅助令牌训练以预测迷你精神状态检查（MMSE）评分，这是一种与AD严重程度相关的认知功能的广泛使用的临床指标。这为微调提供了其他监督。我们的方法将轻巧的及时调整应用于图像和文本模式，使用1,500个培训图像在两个广告数据集上实现了最先进的性能，从而优于对10,000张图像进行微调的现有方法。代码将在出版后发布。

Title: Self-Supervised Cross-Encoder for Neurodegenerative Disease Diagnosis

Authors: Fangqi Cheng, Yingying Zhao, Xiaochen Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.07623
Pdf URL: https://arxiv.org/pdf/2509.07623
Copy Paste: [[2509.07623]] Self-Supervised Cross-Encoder for Neurodegenerative Disease Diagnosis(https://arxiv.org/abs/2509.07623)
Keywords: generative
Abstract: Deep learning has shown significant potential in diagnosing neurodegenerative diseases from MRI data. However, most existing methods rely heavily on large volumes of labeled data and often yield representations that lack interpretability. To address both challenges, we propose a novel self-supervised cross-encoder framework that leverages the temporal continuity in longitudinal MRI scans for supervision. This framework disentangles learned representations into two components: a static representation, constrained by contrastive learning, which captures stable anatomical features; and a dynamic representation, guided by input-gradient regularization, which reflects temporal changes and can be effectively fine-tuned for downstream classification tasks. Experimental results on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset demonstrate that our method achieves superior classification accuracy and improved interpretability. Furthermore, the learned representations exhibit strong zero-shot generalization on the Open Access Series of Imaging Studies (OASIS) dataset and cross-task generalization on the Parkinson Progression Marker Initiative (PPMI) dataset. The code for the proposed method will be made publicly available.
摘要：深度学习在从MRI数据中诊断神经退行性疾病方面表现出了巨大的潜力。但是，大多数现有的方法在很大程度上依赖大量标记的数据，并且通常会产生缺乏可解释性的表示形式。为了应对这两个挑战，我们提出了一个新颖的自我监督的交叉编码框架，该框架利用纵向MRI扫描中的时间连续性进行监督。该框架将其分解为两个组成部分：静态表示形式，受对比度学习的约束，它捕获了稳定的解剖特征；以及由输入梯度正则化的指导的动态表示，这反映了时间变化，可以有效地对下游分类任务进行微调。阿尔茨海默氏病神经影像倡议（ADNI）数据集的实验结果表明，我们的方法可实现出色的分类准确性和提高的可解释性。此外，在开放访问序列研究（OASIS）数据集（OASIS）数据集和帕金森进步标记计划（PPMI）数据集的开放访问序列研究（OASIS）数据集（OASIS）数据集（OASIS）数据集和交叉任务概括上，学到的表示形式表现出强烈的零击概括。提出的方法的代码将公开可用。

Title: Semantic Watermarking Reinvented: Enhancing Robustness and Generation Quality with Fourier Integrity

Authors: Sung Ju Lee, Nam Ik Cho
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.07647
Pdf URL: https://arxiv.org/pdf/2509.07647
Copy Paste: [[2509.07647]] Semantic Watermarking Reinvented: Enhancing Robustness and Generation Quality with Fourier Integrity(https://arxiv.org/abs/2509.07647)
Keywords: generation
Abstract: Semantic watermarking techniques for latent diffusion models (LDMs) are robust against regeneration attacks, but often suffer from detection performance degradation due to the loss of frequency integrity. To tackle this problem, we propose a novel embedding method called Hermitian Symmetric Fourier Watermarking (SFW), which maintains frequency integrity by enforcing Hermitian symmetry. Additionally, we introduce a center-aware embedding strategy that reduces the vulnerability of semantic watermarking due to cropping attacks by ensuring robust information retention. To validate our approach, we apply these techniques to existing semantic watermarking schemes, enhancing their frequency-domain structures for better robustness and retrieval accuracy. Extensive experiments demonstrate that our methods achieve state-of-the-art verification and identification performance, surpassing previous approaches across various attack scenarios. Ablation studies confirm the impact of SFW on detection capabilities, the effectiveness of the center-aware embedding against cropping, and how message capacity influences identification accuracy. Notably, our method achieves the highest detection accuracy while maintaining superior image fidelity, as evidenced by FID and CLIP scores. Conclusively, our proposed SFW is shown to be an effective framework for balancing robustness and image fidelity, addressing the inherent trade-offs in semantic watermarking. Code available at this https URL
摘要：潜在扩散模型（LDMS）的语义水印技术可抵抗再生攻击，但由于频率完整性的丧失，通常会遭受检测性能降解。为了解决这个问题，我们提出了一种称为Hermitian对称傅里叶水印（SFW）的新型嵌入方法，该方法通过执行Hermitian对称性来保持频率完整性。此外，我们引入了一种中心感知的嵌入策略，该策略通过确保坚固的信息保留来减少由于裁剪攻击而引起的语义水印的脆弱性。为了验证我们的方法，我们将这些技术应用于现有的语义水印方案，增强其频域结构，以提高鲁棒性和检索精度。广泛的实验表明，我们的方法达到了最新的验证和识别性能，超过了各种攻击方案的先前方法。消融研究证实了SFW对检测能力的影响，中心感知嵌入裁剪的有效性以及信息能力如何影响识别精度。值得注意的是，我们的方法在维持优越的图像保真度的同时，达到了最高的检测准确性，这是由FID和剪辑得分所证明的。结论性地，我们提出的SFW被证明是平衡鲁棒性和图像保真度的有效框架，解决了语义水印中固有的权衡。此https URL可用代码

Title: Faster, Self-Supervised Super-Resolution for Anisotropic Multi-View MRI Using a Sparse Coordinate Loss

Authors: Maja Schlereth, Moritz Schillinger, Katharina Breininger
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.07798
Pdf URL: https://arxiv.org/pdf/2509.07798
Copy Paste: [[2509.07798]] Faster, Self-Supervised Super-Resolution for Anisotropic Multi-View MRI Using a Sparse Coordinate Loss(https://arxiv.org/abs/2509.07798)
Keywords: super-resolution
Abstract: Acquiring images in high resolution is often a challenging task. Especially in the medical sector, image quality has to be balanced with acquisition time and patient comfort. To strike a compromise between scan time and quality for Magnetic Resonance (MR) imaging, two anisotropic scans with different low-resolution (LR) orientations can be acquired. Typically, LR scans are analyzed individually by radiologists, which is time consuming and can lead to inaccurate interpretation. To tackle this, we propose a novel approach for fusing two orthogonal anisotropic LR MR images to reconstruct anatomical details in a unified representation. Our multi-view neural network is trained in a self-supervised manner, without requiring corresponding high-resolution (HR) data. To optimize the model, we introduce a sparse coordinate-based loss, enabling the integration of LR images with arbitrary scaling. We evaluate our method on MR images from two independent cohorts. Our results demonstrate comparable or even improved super-resolution (SR) performance compared to state-of-the-art (SOTA) self-supervised SR methods for different upsampling scales. By combining a patient-agnostic offline and a patient-specific online phase, we achieve a substantial speed-up of up to ten times for patient-specific reconstruction while achieving similar or better SR quality. Code is available at this https URL.
摘要：以高分辨率获取图像通常是一项具有挑战性的任务。尤其是在医疗部门，图像质量必须与获取时间和患者舒适保持平衡。为了在磁共振成像（MR）成像的扫描时间和质量之间进行折衷，可以获取两次具有不同低分辨率（LR）方向的各向异性扫描。通常，LR扫描由放射科医生单独分析，这很耗时，可能导致不准确的解释。为了解决这个问题，我们提出了一种新颖的方法，用于融合两个正交各向异性LR MR图像，以在统一表示中重建解剖细节。我们的多视图神经网络以一种自我监督的方式进行了训练，而无需相应的高分辨率（HR）数据。为了优化模型，我们引入了基于划分的损失，从而使LR图像的集成随意缩放。我们评估了来自两个独立人群的MR图像的方法。我们的结果表明，与最先进的（SOTA）自我监督的SR方法相比，超级分辨率（SR）的性能可比性甚至改善。通过将患者不合时宜的离线和特定于患者的在线阶段相结合，我们可以实现多达十倍的患者重建，同时达到相似或更好的SR质量。代码可在此HTTPS URL上找到。

Title: Feature Space Analysis by Guided Diffusion Model

Authors: Kimiaki Shirahama, Miki Yanobu, Kaduki Yamashita, Miho Ohsaki
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2509.07936
Pdf URL: https://arxiv.org/pdf/2509.07936
Copy Paste: [[2509.07936]] Feature Space Analysis by Guided Diffusion Model(https://arxiv.org/abs/2509.07936)
Keywords: generation
Abstract: One of the key issues in Deep Neural Networks (DNNs) is the black-box nature of their internal feature extraction process. Targeting vision-related domains, this paper focuses on analysing the feature space of a DNN by proposing a decoder that can generate images whose features are guaranteed to closely match a user-specified feature. Owing to this guarantee that is missed in past studies, our decoder allows us to evidence which of various attributes in an image are encoded into a feature by the DNN, by generating images whose features are in proximity to that feature. Our decoder is implemented as a guided diffusion model that guides the reverse image generation of a pre-trained diffusion model to minimise the Euclidean distance between the feature of a clean image estimated at each step and the user-specified feature. One practical advantage of our decoder is that it can analyse feature spaces of different DNNs with no additional training and run on a single COTS GPU. The experimental results targeting CLIP's image encoder, ResNet-50 and vision transformer demonstrate that images generated by our decoder have features remarkably similar to the user-specified ones and reveal valuable insights into these DNNs' feature spaces.
摘要：深神经网络（DNN）中的关键问题之一是其内部功能提取过程的黑盒性质。针对视觉相关的域，本文着重于分析DNN的特征空间，该解码器可以生成能够确保特征与用户指定功能紧密匹配的图像的解码器。由于这种保证在过去的研究中被遗漏了，我们的解码器使我们能够通过生成与该功能相邻的图像来证明图像中的哪些属性。我们的解码器被实现为指导扩散模型，该模型指导预训练的扩散模型的反向图像生成，以最大程度地减少每个步骤估计的干净图像的特征和用户指定特征之间的欧几里得距离。我们解码器的一个实际优势是，它可以分析不同DNN的特征空间，而没有额外的培训并在单个COTS GPU上运行。针对夹子编码器，Resnet-50和Vision Transformer的实验结果表明，我们解码器生成的图像具有与用户指定的特征非常相似，并揭示了对这些DNNS特征空间的宝贵见解。

Title: Bringing Multi-Modal Multi-Task Federated Foundation Models to Education Domain: Prospects and Challenges

Authors: Kasra Borazjani, Naji Khosravan, Rajeev Sahay, Bita Akram, Seyyedali Hosseinalipour
Subjects: cs.LG, cs.AI, cs.ET
Abstract URL: https://arxiv.org/abs/2509.07946
Pdf URL: https://arxiv.org/pdf/2509.07946
Copy Paste: [[2509.07946]] Bringing Multi-Modal Multi-Task Federated Foundation Models to Education Domain: Prospects and Challenges(https://arxiv.org/abs/2509.07946)
Keywords: generation
Abstract: Multi-modal multi-task (M3T) foundation models (FMs) have recently shown transformative potential in artificial intelligence, with emerging applications in education. However, their deployment in real-world educational settings is hindered by privacy regulations, data silos, and limited domain-specific data availability. We introduce M3T Federated Foundation Models (FedFMs) for education: a paradigm that integrates federated learning (FL) with M3T FMs to enable collaborative, privacy-preserving training across decentralized institutions while accommodating diverse modalities and tasks. Subsequently, this position paper aims to unveil M3T FedFMs as a promising yet underexplored approach to the education community, explore its potentials, and reveal its related future research directions. We outline how M3T FedFMs can advance three critical pillars of next-generation intelligent education systems: (i) privacy preservation, by keeping sensitive multi-modal student and institutional data local; (ii) personalization, through modular architectures enabling tailored models for students, instructors, and institutions; and (iii) equity and inclusivity, by facilitating participation from underrepresented and resource-constrained entities. We finally identify various open research challenges, including studying of (i) inter-institution heterogeneous privacy regulations, (ii) the non-uniformity of data modalities' characteristics, (iii) the unlearning approaches for M3T FedFMs, (iv) the continual learning frameworks for M3T FedFMs, and (v) M3T FedFM model interpretability, which must be collectively addressed for practical deployment.
摘要：多模式多任务多任务（M3T）基础模型（FMS）最近在人工智能中显示了新兴的教育应用中的变革潜力。但是，隐私法规，数据孤岛和有限的域特异性数据可用性阻碍了它们在实际教育环境中的部署。我们介绍了M3T联合基金会模型（FEDFMS）的教育：将联合学习（FL）与M3T FMS相结合的范式，以实现在分散机构跨机构的协作，隐私培训的同时，同时适合多样化的方式和任务。随后，该立场论文旨在将M3T FedFMS推出，以此作为对教育社区的一种有希望而又毫无争议的方法，探索其潜力并揭示其相关的未来研究指示。我们概述了M3T FedFM如何推进下一代智能教育系统的三个关键支柱：（i）隐私保护，通过保持敏感的多模式学生和机构数据的本地；（ii）个性化，通过模块化体系结构为学生，讲师和机构提供量身定制的模型；（iii）公平和包容性，通过促进代表性不足和资源约束实体的参与。我们最终确定了各种开放的研究挑战，包括研究（i）机构之间的异质隐私法规，（ii）数据模式特征的不均匀性，（iii）M3T FedFMS的未学习方法，（iv）M3T FedFMS和（V）FUDFM模型的持续学习框架的连续学习框架，必须练习fordfm模型，以实践为代表，以授权，以效用。

Title: Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images

Authors: Boammani Aser Lompo, Marc Haraoui
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2509.07966
Pdf URL: https://arxiv.org/pdf/2509.07966
Copy Paste: [[2509.07966]] Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images(https://arxiv.org/abs/2509.07966)
Keywords: generation
Abstract: Visual reasoning over structured data such as tables is a critical capability for modern vision-language models (VLMs), yet current benchmarks remain limited in scale, diversity, or reasoning depth, especially when it comes to rendered table images. Addressing this gap, we introduce Visual-TableQA, a large-scale, open-domain multimodal dataset specifically designed to evaluate and enhance visual reasoning over complex tabular data. Our generation pipeline is modular, scalable, and fully autonomous, involving multiple reasoning LLMs collaborating across distinct roles: generation, validation, and inspiration. Visual-TableQA comprises 2.5k richly structured LaTeX-rendered tables and 6k reasoning-intensive QA pairs, all produced at a cost of under USD 100. To promote diversity and creativity, our pipeline performs multi-model collaborative data generation via cross-model prompting ('inspiration') and LLM-jury filtering. Stronger models seed layouts and topics that weaker models elaborate, collectively distilling diverse reasoning patterns and visual structures into the dataset. Empirical results show that models fine-tuned on Visual-TableQA generalize robustly to external benchmarks, outperforming several proprietary models despite the dataset's synthetic nature. The full pipeline and resources are publicly available at this https URL.
摘要：对结构化数据（例如表）的视觉推理是现代视觉模型（VLM）的关键能力，但是当前的基准测试的规模，多样性或推理深度仍然有限，尤其是在渲染桌面图像方面。在解决这一差距时，我们引入了Visual-TableQA，这是一种大型开放域的多模式数据集，专门设计用于评估和增强复杂表格数据的视觉推理。我们这一代的管道是模块化，可扩展且完全自主的，涉及多种推理LLMS在不同角色上合作：生成，验证和灵感。 Visual-TableQA comprises 2.5k richly structured LaTeX-rendered tables and 6k reasoning-intensive QA pairs, all produced at a cost of under USD 100. To promote diversity and creativity, our pipeline performs multi-model collaborative data generation via cross-model prompting ('inspiration') and LLM-jury filtering.更强大的模型种子布局和较弱模型的主题详细阐述，将各种推理模式和视觉结构融合到数据集中。经验结果表明，在Visual-TableQA上进行了微调的模型对外部基准有牢固地概括，尽管数据集的合成性质，但表现优于几个专有模型。完整的管道和资源可在此HTTPS URL上公开获得。

Title: One View, Many Worlds: Single-Image to 3D Object Meets Generative Domain Randomization for One-Shot 6D Pose Estimation

Authors: Zheng Geng, Nan Wang, Shaocong Xu, Chongjie Ye, Bohan Li, Zhaoxi Chen, Sida Peng, Hao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.07978
Pdf URL: https://arxiv.org/pdf/2509.07978
Copy Paste: [[2509.07978]] One View, Many Worlds: Single-Image to 3D Object Meets Generative Domain Randomization for One-Shot 6D Pose Estimation(https://arxiv.org/abs/2509.07978)
Keywords: generation, generative
Abstract: Estimating the 6D pose of arbitrary unseen objects from a single reference image is critical for robotics operating in the long-tail of real-world instances. However, this setting is notoriously challenging: 3D models are rarely available, single-view reconstructions lack metric scale, and domain gaps between generated models and real-world images undermine robustness. We propose OnePoseViaGen, a pipeline that tackles these challenges through two key components. First, a coarse-to-fine alignment module jointly refines scale and pose by combining multi-view feature matching with render-and-compare refinement. Second, a text-guided generative domain randomization strategy diversifies textures, enabling effective fine-tuning of pose estimators with synthetic data. Together, these steps allow high-fidelity single-view 3D generation to support reliable one-shot 6D pose estimation. On challenging benchmarks (YCBInEOAT, Toyota-Light, LM-O), OnePoseViaGen achieves state-of-the-art performance far surpassing prior approaches. We further demonstrate robust dexterous grasping with a real robot hand, validating the practicality of our method in real-world manipulation. Project page: this https URL
摘要：从单个参考图像中估算任意看不见的对象的6D姿势对于在实际实例的长尾运行的机器人技术至关重要。但是，众所周知，这种设置具有挑战性：3D模型很少可用，单视图重建缺乏度量标准，并且生成的模型和现实世界图像之间的域间隙破坏了鲁棒性。我们提出了OnePoseViagen，该管道通过两个关键组成部分来应对这些挑战。首先，通过将多视图功能匹配与渲染和能力改进结合在一起，粗到最细的对齐模块可以共同完善尺度和姿势。其次，文本引导的生成域随机策略使纹理多样化，从而有效地通过合成数据对姿势估计器进行微调。这些步骤一起允许高保真单视3D生成支持可靠的一声6D姿势估计。关于具有挑战性的基准（Ycbineoat，Toyota-Light，LM-O），OnePoseViagen实现了最先进的性能远远超过先前的方法。我们进一步用真实的机器人手证明了坚固的灵巧抓握，从而验证了我们在现实世界中方法的实用性。项目页面：此HTTPS URL