2025-06-24

Title: Mechanistic Interpretability of Diffusion Models: Circuit-Level Analysis and Causal Validation

Authors: Dip Roy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.17237
Pdf URL: https://arxiv.org/pdf/2506.17237
Copy Paste: [[2506.17237]] Mechanistic Interpretability of Diffusion Models: Circuit-Level Analysis and Causal Validation(https://arxiv.org/abs/2506.17237)
Keywords: generation, generative
Abstract: We present a quantitative circuit-level analysis of diffusion models, establishing computational pathways and mechanistic principles underlying image generation processes. Through systematic intervention experiments across 2,000 synthetic and 2,000 CelebA facial images, we discover fundamental algorithmic differences in how diffusion architectures process synthetic versus naturalistic data distributions. Our investigation reveals that real-world face processing requires circuits with measurably higher computational complexity (complexity ratio = 1.084 plus/minus 0.008, p < 0.001), exhibiting distinct attention specialization patterns with entropy divergence ranging from 0.015 to 0.166 across denoising timesteps. We identify eight functionally distinct attention mechanisms showing specialized computational roles: edge detection (entropy = 3.18 plus/minus 0.12), texture analysis (entropy = 4.16 plus/minus 0.08), and semantic understanding (entropy = 2.67 plus/minus 0.15). Intervention analysis demonstrates critical computational bottlenecks where targeted ablations produce 25.6% to 128.3% performance degradation, providing causal evidence for identified circuit functions. These findings establish quantitative foundations for algorithmic understanding and control of generative model behavior through mechanistic intervention strategies.
摘要：我们提出了扩散模型的定量电路级分析，建立了图像生成过程的基础计算途径和机械原理。通过在2,000个合成和2,000个Celeba面部图像的系统干预实验中，我们发现了扩散架构过程合成与自然主义数据分布的基本算法差异。我们的研究表明，现实世界的面部处理需要具有较高计算复杂性的电路（复杂性比= 1.084 plus/sinus 0.008，p <0.001），在整个Denoising时间段中，熵发散范围从0.015到0.166均显示出独特的注意力专业模式。我们确定八种在功能上不同的注意机制，显示了专业的计算角色：边缘检测（熵= 3.18 plus/sinus 0.12），纹理分析（熵= 4.16 plus/sinus 0.08）和语义理解（熵= 2.67 Plus/sig缩小0.15）。干预分析表明，关键的计算瓶颈，目标消融产生25.6％至128.3％的绩效降解，从而为鉴定的电路功能提供了因果证据。这些发现建立了定量基础，以通过机械干预策略来理解和控制生成模型行为。

Title: Recursive Learning-Based Virtual Buffering for Analytical Global Placement

Authors: Andrew B. Kahng, Yiting Liu, Zhiang Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.17247
Pdf URL: https://arxiv.org/pdf/2506.17247
Copy Paste: [[2506.17247]] Recursive Learning-Based Virtual Buffering for Analytical Global Placement(https://arxiv.org/abs/2506.17247)
Keywords: generative
Abstract: Due to the skewed scaling of interconnect versus cell delay in modern technology nodes, placement with buffer porosity (i.e., cell density) awareness is essential for timing closure in physical synthesis flows. However, existing approaches face two key challenges: (i) traditional van Ginneken-Lillis-style buffering approaches are computationally expensive during global placement; and (ii) machine learning-based approaches, such as BufFormer, lack a thorough consideration of Electrical Rule Check (ERC) violations and fail to "close the loop" back into the physical design flow. In this work, we propose MLBuf-RePlAce, the first open-source learning-driven virtual buffering-aware analytical global placement framework, built on top of the OpenROAD infrastructure. MLBuf-RePlAce adopts an efficient recursive learning-based generative buffering approach to predict buffer types and locations, addressing ERC violations during global placement. We compare MLBuf-RePlAce against the default virtual buffering-based timing-driven global placer in OpenROAD, using open-source testcases from the TILOS MacroPlacement and OpenROAD-flow-scripts repositories. Without degradation of post-route power, MLBuf-RePlAce achieves (maximum, average) improvements of (56%, 31%) in total negative slack (TNS) within the open-source OpenROAD flow. When evaluated by completion in a commercial flow, MLBuf-RePlAce achieves (maximum, average) improvements of (53%, 28%) in TNS with an average of 0.2% improvement in post-route power.
摘要：由于现代技术节点中互连与细胞延迟的缩度偏斜，因此使用缓冲孔隙率（即细胞密度）的放置对于物理合成流中的时间关闭至关重要。但是，现有的方法面临两个主要挑战：（i）传统的范·金尼肯·里利斯（Van Ginneken-Lillis）风格的缓冲方法在全球放置期间的计算价格昂贵；（ii）基于机器学习的方法，例如Bufformer，缺乏对电气规则检查（ERC）违规行为的彻底考虑，并且无法“关闭循环”回到物理设计流中。在这项工作中，我们提出了MLBUF-Replace，这是第一个开放源代码学习驱动的虚拟虚拟缓冲分析全球放置框架，该框架建立在OpenRoad基础架构之上。 MLBUF-Replace采用了一种有效的基于递归学习的生成缓冲方法来预测缓冲液类型和位置，从而解决了全球放置期间ERC违规的问题。我们使用来自TILOS宏观底座和OpenRoad-Flow-Scripts存储库中的开放源代码测试柜，将MLBUF替代品与OpenRoad的默认虚拟虚拟计时驱动的全局储存器进行比较。在开源露天流量中，MLBUF替代的情况下，MLBUF替代的总负懈度（56％，31％）的改善（最大，平均）提高（最大，平均）。当通过商业流程完成评估时，MLBUF替代（最大，平均）在TNS中取得了改善（53％，28％），平均改善后公路后功率提高了0.2％。

Title: Origins of Creativity in Attention-Based Diffusion Models

Authors: Emma Finn, T. Anderson Keller, Manos Theodosis, Demba E. Ba
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.17324
Pdf URL: https://arxiv.org/pdf/2506.17324
Copy Paste: [[2506.17324]] Origins of Creativity in Attention-Based Diffusion Models(https://arxiv.org/abs/2506.17324)
Keywords: generation
Abstract: As diffusion models have become the tool of choice for image generation and as the quality of the images continues to improve, the question of how `creativity' originates in diffusion has become increasingly important. The score matching perspective on diffusion has proven particularly fruitful for understanding how and why diffusion models generate images that remain plausible while differing significantly from their training images. In particular, as explained in (Kamb \& Ganguli, 2024) and others, e.g., (Ambrogioni, 2023), theory suggests that if our score matching were optimal, we would only be able to recover training samples through our diffusion process. However, as shown by Kamb \& Ganguli, (2024), in diffusion models where the score is parametrized by a simple CNN, the inductive biases of the CNN itself (translation equivariance and locality) allow the model to generate samples that globally do not match any training samples, but are rather patch-wise `mosaics'. Notably, however, this theory does not extend to describe the role of self-attention in this process. In this work, we take a preliminary step in this direction to extend this theory to the case of diffusion models whose score is parametrized by a CNN with a final self-attention layer. We show that our theory suggests that self-attention will induce a globally image-consistent arrangement of local features beyond the patch-level in generated samples, and we verify this behavior empirically on a carefully crafted dataset.
摘要：随着扩散模型已成为图像生成的首选工具，并且随着图像的质量不断提高，“创造力”如何起源于扩散的问题变得越来越重要。事实证明，关于扩散的分数匹配观点特别富有成果，以了解如何以及为什么扩散模型产生可见的图像，而这些图像在与训练图像显着不同的情况下显着差异。 In particular, as explained in (Kamb \& Ganguli, 2024) and others, e.g., (Ambrogioni, 2023), theory suggests that if our score matching were optimal, we would only be able to recover training samples through our diffusion process.然而，如Kamb \＆Ganguli（2024）所示，在扩散模型中，得分通过简单的CNN参数化，CNN本身的电感偏见（翻译等效性和局部性）允许该模型在全球范围内生成任何训练样本，但与任何训练样本相匹配，但相当相当贴上斑点。但是，值得注意的是，该理论并没有扩展到描述自我注意力在此过程中的作用。在这项工作中，我们朝着这个方向迈出了初步步骤，将该理论扩展到扩散模型的情况，其得分由具有最终自我注意力层的CNN参数化。 We show that our theory suggests that self-attention will induce a globally image-consistent arrangement of local features beyond the patch-level in generated samples, and we verify this behavior empirically on a carefully crafted dataset.

Title: A Novel Multi-layer Task-centric and Data Quality Framework for Autonomous Driving

Authors: Yuhan Zhou, Haihua Chen, Kewei Sha
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.17346
Pdf URL: https://arxiv.org/pdf/2506.17346
Copy Paste: [[2506.17346]] A Novel Multi-layer Task-centric and Data Quality Framework for Autonomous Driving(https://arxiv.org/abs/2506.17346)
Keywords: generation
Abstract: The next-generation autonomous vehicles (AVs), embedded with frequent real-time decision-making, will rely heavily on a large volume of multisource and multimodal data. In real-world settings, the data quality (DQ) of different sources and modalities usually varies due to unexpected environmental factors or sensor issues. However, both researchers and practitioners in the AV field overwhelmingly concentrate on models/algorithms while undervaluing the DQ. To fulfill the needs of the next-generation AVs with guarantees of functionality, efficiency, and trustworthiness, this paper proposes a novel task-centric and data quality vase framework which consists of five layers: data layer, DQ layer, task layer, application layer, and goal layer. The proposed framework aims to map DQ with task requirements and performance goals. To illustrate, a case study investigating redundancy on the nuScenes dataset proves that partially removing redundancy on multisource image data could improve YOLOv8 object detection task performance. Analysis on multimodal data of image and LiDAR further presents existing redundancy DQ issues. This paper opens up a range of critical but unexplored challenges at the intersection of DQ, task orchestration, and performance-oriented system development in AVs. It is expected to guide the AV community toward building more adaptive, explainable, and resilient AVs that respond intelligently to dynamic environments and heterogeneous data streams. Code, data, and implementation details are publicly available at: this https URL.
摘要：下一代自动驾驶汽车（AVS）嵌入了频繁的实时决策，将在很大程度上依赖大量的多源和多模式数据。在现实世界中，不同来源和模式的数据质量（DQ）通常由于意外的环境因素或传感器问题而变化。但是，AV领域的研究人员和从业人员都压倒性地集中在模型/算法上，同时低估了DQ。为了满足下一代AV的需求，可以保证功能，效率和可信赖性，本文提出了一种新颖的以任务为中心和数据质量花瓶框架，该框架由五层：数据层，DQ层，DQ层，任务层，应用程序层和目标层组成。所提出的框架旨在将DQ与任务要求和绩效目标绘制。为了说明，一项研究Nuscenes数据集的冗余的案例研究证明，在多源图像数据上部分删除冗余可以改善Yolov8对象检测任务性能。对图像和激光雷达的多模式数据的分析进一步提出了现有的冗余DQ问题。本文在AVS中的DQ，任务编排和面向性能的系统开发方面开设了一系列关键但未开发的挑战。预计，它将指导AV社区建立更适合自适应，可解释和弹性的AV，这些AVS对动态环境和异质数据流进行智能反应。代码，数据和实施详细信息可公开可用：此HTTPS URL。

Title: Efficient Feedback Gate Network for Hyperspectral Image Super-Resolution

Authors: Xufei Wang, Mingjian Zhang, Fei Ge, Jinchen Zhu, Wen Sha, Jifen Ren, Zhimeng Hou, Shouguo Zheng, ling Zheng, Shizhuang Weng
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.17361
Pdf URL: https://arxiv.org/pdf/2506.17361
Copy Paste: [[2506.17361]] Efficient Feedback Gate Network for Hyperspectral Image Super-Resolution(https://arxiv.org/abs/2506.17361)
Keywords: super-resolution
Abstract: Even without auxiliary images, single hyperspectral image super-resolution (SHSR) methods can be designed to improve the spatial resolution of hyperspectral images. However, failing to explore coherence thoroughly along bands and spatial-spectral information leads to the limited performance of the SHSR. In this study, we propose a novel group-based SHSR method termed the efficient feedback gate network, which uses various feedbacks and gate operations involving large kernel convolutions and spectral interactions. In particular, by providing different guidance for neighboring groups, we can learn rich band information and hierarchical hyperspectral spatial information using channel shuffling and dilatation convolution in shuffled and progressive dilated fusion module(SPDFM). Moreover, we develop a wide-bound perception gate block and a spectrum enhancement gate block to construct the spatial-spectral reinforcement gate module (SSRGM) and obtain highly representative spatial-spectral features efficiently. Additionally, we apply a three-dimensional SSRGM to enhance holistic information and coherence for hyperspectral data. The experimental results on three hyperspectral datasets demonstrate the superior performance of the proposed network over the state-of-the-art methods in terms of spectral fidelity and spatial content reconstruction.
摘要：即使没有辅助图像，也可以设计出单个高光图像超分辨率（SHSR）方法来改善高光谱图像的空间分辨率。但是，未能沿频段和空间光谱信息彻底探索连贯性会导致SHSR的性能有限。在这项研究中，我们提出了一种基于组的新型SHSR方法，称为有效的反馈门网络，该方法使用了各种反馈和栅极操作，涉及大型内核卷积和光谱相互作用。特别是，通过为相邻群体提供不同的指导，我们可以使用频道改组和扩张卷积在洗牌和渐进的扩张融合模块（SPDFM）中学习丰富的频段信息和分层高光谱空间信息。此外，我们开发了一个宽结合的感知门块和光谱增强门块，以构建空间谱增强门模块（SSRGM）并获得高度代表性的空间光谱特征。此外，我们应用三维SSRGM来增强高光谱数据的整体信息和连贯性。三个高光谱数据集的实验结果表明，从光谱忠诚度和空间内容重建方面，提出的网络超过了最先进的方法。

Title: SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification

Authors: Zhenglin Lai, Mengyao Liao, Dong Xu, Zebin Zhao, Zhihang Yuan, Chao Fan, Jianqiang Li, Bingzhe Wu
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2506.17368
Pdf URL: https://arxiv.org/pdf/2506.17368
Copy Paste: [[2506.17368]] SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification(https://arxiv.org/abs/2506.17368)
Keywords: generation
Abstract: Large language models based on Mixture-of-Experts have achieved substantial gains in efficiency and scalability, yet their architectural uniqueness introduces underexplored safety alignment challenges. Existing safety alignment strategies, predominantly designed for dense models, are ill-suited to address MoE-specific vulnerabilities. In this work, we formalize and systematically study MoE model's positional vulnerability - the phenomenon where safety-aligned behaviors rely on specific expert modules, revealing critical risks inherent to MoE architectures. To this end, we present SAFEx, an analytical framework that robustly identifies, characterizes, and validates the safety-critical experts using a novel Stability-based Expert Selection (SES) algorithm. Notably, our approach enables the explicit decomposition of safety-critical experts into distinct functional groups, including those responsible for harmful content detection and those controlling safe response generation. Extensive experiments on mainstream MoE models, such as the recently released Qwen3-MoE, demonstrated that their intrinsic safety mechanisms heavily rely on a small subset of positional experts. Disabling these experts significantly compromised the models' ability to refuse harmful requests. For Qwen3-MoE with 6144 experts (in the FNN layer), we find that disabling as few as 12 identified safety-critical experts can cause the refusal rate to drop by 22%, demonstrating the disproportionate impact of a small set of experts on overall model safety.
摘要：基于专家混合的大型语言模型在效率和可扩展性方面取得了可观的提高，但是它们的建筑唯一性引入了不足的安全对准挑战。现有的安全对准策略（主要是为密集模型设计的），不适合解决MOE特定的漏洞。在这项工作中，我们对MOE模型的位置脆弱性进行形式化和系统地研究 - 安全一致行为依赖于特定的专家模块的现象，揭示了MOE架构固有的关键风险。为此，我们提出了SAFEX，这是一个分析框架，可以使用基于稳定性的新专家选择（SES）算法来鲁棒性地识别，表征和验证安全关键专家。值得注意的是，我们的方法可以将安全 - 关键专家明确分解为不同的功能组，包括负责有害内容检测的人员和控制安全响应产生的人。关于QWEN3-MOE等主流MOE模型的广泛实验表明，它们的内在安全机制在很大程度上依赖于一小部分位置专家。禁用这些专家，严重损害了模型拒绝有害要求的能力。对于具有6144名专家（在FNN层中）的Qwen3-MoE，我们发现，少于12个确定的安全至关重要的专家的禁用可能会导致拒绝率下降22％，这表明一小部分专家对整体模型安全的影响不成比例。

Title: Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling?

Authors: Mingyuan Wu, Meitang Li, Jingcheng Yang, Jize Jiang, Kaizhuo Yan, Zhaoheng Li, Minjia Zhang, Klara Nahrstedt
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.17417
Pdf URL: https://arxiv.org/pdf/2506.17417
Copy Paste: [[2506.17417]] Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling?(https://arxiv.org/abs/2506.17417)
Keywords: generation
Abstract: Recent advances in large language models (LLMs) have demonstrated that inference-time computation techniques, such as decoding-time scaling and self-refinement, can significantly enhance reasoning capabilities without relying on external knowledge. A key driver of this success is the emergence of self-correction and self-verification behaviors, often elicited through reinforcement learning (RL). In this paper, we investigate whether these inference-time techniques extend effectively to vision-language models (VLMs), particularly those trained with RL. We find that while decoding strategies such as majority voting and best-of-N selection with self-verification all improve VLM reasoning performance, generation-reliant methods such as the former achieve significantly higher gains versus verification-reliant methods such as the latter. Additionally, the self-correction behavior often associated with RL-tuned models, such as aha moment, does not lead to measurable gains. We show via extensive experimentation within the inference-time scaling framework to identify a key root cause: RL-trained VLMs still lack robust self-verification capabilities across both visual and textual modalities.
摘要：大型语言模型（LLM）的最新进展表明，推理时间计算技术（例如解码时间缩放和自我进行）可以显着增强推理能力而不依赖外部知识。这一成功的主要驱动力是自我纠正和自我验证行为的出现，通常是通过强化学习（RL）引起的。在本文中，我们研究了这些推理时间技术是否有效扩展到视觉模型（VLM），尤其是接受RL培训的技术模型。我们发现，在解码诸如多数投票和自我验证的最佳选择之类的策略时，所有这些都提高了VLM推理性能，但前者等生成依赖的方法获得了更高的收益，而验证依赖的方法（如后者）。此外，通常与RL调整模型（例如AHA时刻）相关的自我纠正行为不会导致可衡量的增长。我们通过在推理时间缩放框架内进行广泛的实验来识别关键根本原因：RL训练的VLM仍然缺乏视觉和文本方式的强大自我验证能力。

Title: VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models

Authors: Chongkai Gao, Zixuan Liu, Zhenghao Chi, Junshan Huang, Xin Fei, Yiwen Hou, Yuxuan Zhang, Yudi Lin, Zhirui Fang, Zeyu Jiang, Lin Shao
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2506.17561
Pdf URL: https://arxiv.org/pdf/2506.17561
Copy Paste: [[2506.17561]] VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models(https://arxiv.org/abs/2506.17561)
Keywords: generation
Abstract: Recent studies on Vision-Language-Action (VLA) models have shifted from the end-to-end action-generation paradigm toward a pipeline involving task planning followed by action generation, demonstrating improved performance on various complex, long-horizon manipulation tasks. However, existing approaches vary significantly in terms of network architectures, planning paradigms, representations, and training data sources, making it challenging for researchers to identify the precise sources of performance gains and components to be further improved. To systematically investigate the impacts of different planning paradigms and representations isolating from network architectures and training data, in this paper, we introduce VLA-OS, a unified VLA architecture series capable of various task planning paradigms, and design a comprehensive suite of controlled experiments across diverse object categories (rigid and deformable), visual modalities (2D and 3D), environments (simulation and real-world), and end-effectors (grippers and dexterous hands). Our results demonstrate that: 1) visually grounded planning representations are generally better than language planning representations; 2) the Hierarchical-VLA paradigm generally achieves superior or comparable performance than other paradigms on task performance, pretraining, generalization ability, scalability, and continual learning ability, albeit at the cost of slower training and inference speeds.
摘要：有关视觉动作（VLA）模型的最新研究已从端到端的动作生成范式转变为涉及任务计划随后进行动作的管道，表明对各种复杂的长期长途操纵任务的性能提高。但是，现有方法在网络架构，计划范式，表示和培训数据源方面有很大差异，这使研究人员挑战确定绩效提高和组件的确切来源，以进一步改善。为了系统地研究不同计划范式和从网络体系结构和培训数据中隔离的代表的影响，在本文中，我们介绍VLA-OS，VLA-OS，一个统一的VLA体系结构系列，能够具有各种任务计划范例，并设计了一套全面的对照实验的套件，跨越多种多样的对象类别（刚性和可变形），视觉模态和3D和3D和3D和3D和3D和3D和3D和3D，终极效应（抓手和灵巧的手）。我们的结果表明：1）视觉扎根的规划表示通常比语言规划表示更好； 2）分层-VLA范式通常比其他范式在任务绩效，预处理，概括能力，可伸缩性和持续学习能力方面取得优于或可比的性能，尽管以较慢的培训和推理速度为代价。

Title: LLM-driven Medical Report Generation via Communication-efficient Heterogeneous Federated Learning

Authors: Haoxuan Che, Haibo Jin, Zhengrui Guo, Yi Lin, Cheng Jin, Hao Chen
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2506.17562
Pdf URL: https://arxiv.org/pdf/2506.17562
Copy Paste: [[2506.17562]] LLM-driven Medical Report Generation via Communication-efficient Heterogeneous Federated Learning(https://arxiv.org/abs/2506.17562)
Keywords: generation
Abstract: LLMs have demonstrated significant potential in Medical Report Generation (MRG), yet their development requires large amounts of medical image-report pairs, which are commonly scattered across multiple centers. Centralizing these data is exceptionally challenging due to privacy regulations, thereby impeding model development and broader adoption of LLM-driven MRG models. To address this challenge, we present FedMRG, the first framework that leverages Federated Learning (FL) to enable privacy-preserving, multi-center development of LLM-driven MRG models, specifically designed to overcome the critical challenge of communication-efficient LLM training under multi-modal data heterogeneity. To start with, our framework tackles the fundamental challenge of communication overhead in FL-LLM tuning by employing low-rank factorization to efficiently decompose parameter updates, significantly reducing gradient transmission costs and making LLM-driven MRG feasible in bandwidth-constrained FL settings. Furthermore, we observed the dual heterogeneity in MRG under the FL scenario: varying image characteristics across medical centers, as well as diverse reporting styles and terminology preferences. To address this, we further enhance FedMRG with (1) client-aware contrastive learning in the MRG encoder, coupled with diagnosis-driven prompts, which capture both globally generalizable and locally distinctive features while maintaining diagnostic accuracy; and (2) a dual-adapter mutual boosting mechanism in the MRG decoder that harmonizes generic and specialized adapters to address variations in reporting styles and terminology. Through extensive evaluation of our established FL-MRG benchmark, we demonstrate the generalizability and adaptability of FedMRG, underscoring its potential in harnessing multi-center data and generating clinically accurate reports while maintaining communication efficiency.
摘要：LLM在医疗报告生成（MRG）中表现出了巨大的潜力，但是它们的发展需要大量的医学图像报告对，通常散布在多个中心之间。由于隐私法规，集中这些数据非常具有挑战性，从而阻碍了模型开发并更广泛地采用了LLM驱动的MRG模型。为了应对这一挑战，我们提出了FEDMRG，这是利用联合学习（FL）的第一个框架，以实现LLM驱动的MRG模型的隐私保护，多中心开发，该模型是专门设计，旨在克服多模式数据异质性下的沟通效率LLM培训的关键挑战。首先，我们的框架通过采用低级别分解来有效分解参数更新，从而解决了FL-LLM调整中通信开销的根本挑战，从而大大降低了梯度传输成本并使LLM驱动的MRG在带宽构成的FL FL设置中可行。此外，我们在FL方案下观察到MRG的双重异质性：整个医疗中心的图像特征以及各种报告样式和术语偏好。为了解决这个问题，我们进一步增强了MRG编码器中的（1）客户意识的对比学习，再加上诊断驱动的提示，该提示既可以捕获全球范围内可概括性的特征和本地独特的特征，同时保持诊断精度；（2）MRG解码器中的双重适配器相互提升机制，它协调通用和专业适配器以解决报告样式和术语的变化。通过对我们已建立的FL-MRG基准测试的广泛评估，我们证明了FEDMRG的普遍性和适应性，强调了其在利用多中心数据并生成临床准确的报告的同时保持通信效率的潜力。

Title: LFR-PINO: A Layered Fourier Reduced Physics-Informed Neural Operator for Parametric PDEs

Authors: Jing Wang, Biao Chen, Hairun Xie, Rui Wang, Yifan Xia, Jifa Zhang, Hui Xu
Subjects: cs.LG, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2506.17582
Pdf URL: https://arxiv.org/pdf/2506.17582
Copy Paste: [[2506.17582]] LFR-PINO: A Layered Fourier Reduced Physics-Informed Neural Operator for Parametric PDEs(https://arxiv.org/abs/2506.17582)
Keywords: generation
Abstract: Physics-informed neural operators have emerged as a powerful paradigm for solving parametric partial differential equations (PDEs), particularly in the aerospace field, enabling the learning of solution operators that generalize across parameter spaces. However, existing methods either suffer from limited expressiveness due to fixed basis/coefficient designs, or face computational challenges due to the high dimensionality of the parameter-to-weight mapping space. We present LFR-PINO, a novel physics-informed neural operator that introduces two key innovations: (1) a layered hypernetwork architecture that enables specialized parameter generation for each network layer, and (2) a frequency-domain reduction strategy that significantly reduces parameter count while preserving essential spectral features. This design enables efficient learning of a universal PDE solver through pre-training, capable of directly handling new equations while allowing optional fine-tuning for enhanced precision. The effectiveness of this approach is demonstrated through comprehensive experiments on four representative PDE problems, where LFR-PINO achieves 22.8%-68.7% error reduction compared to state-of-the-art baselines. Notably, frequency-domain reduction strategy reduces memory usage by 28.6%-69.3% compared to Hyper-PINNs while maintaining solution accuracy, striking an optimal balance between computational efficiency and solution fidelity.
摘要：物理知识的神经操作员已成为解决参数偏微分方程（PDE）的强大范式，尤其是在航空航天场中，从而使学习跨参数空间的解决方案操作员的学习。但是，现有方法要么由于固定基础/系数设计而表现力有限，要么由于参数到重量映射空间的高维度而面临计算挑战。我们提出了LFR-PINO，这是一种新型的物理信息神经操作员，它引入了两个关键的创新：（1）分层的超网络体系结构，可为每个网络层提供专门的参数生成，以及（2）一种频率域降低策略，可显着降低参数计的同时保留基本光谱特征。该设计使通过预训练可以有效地学习通用PDE求解器，该预训练能够直接处理新方程，同时允许可选的微调以提高精度。通过对四个代表性PDE问题进行的全面实验证明了这种方法的有效性，其中LFR-PINO与最先进的基准相比，误差降低了22.8％-68.7％。值得注意的是，与超细胞相比，频域降低策略在保持溶液准确性的同时，在计算效率和解决方案保真度之间达到了最佳平衡，将记忆使用率降低了28.6％-69.3％。

Title: OpenMAP-BrainAge: Generalizable and Interpretable Brain Age Predictor

Authors: Pengyu Kan, Craig Jones, Kenichi Oishi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.17597
Pdf URL: https://arxiv.org/pdf/2506.17597
Copy Paste: [[2506.17597]] OpenMAP-BrainAge: Generalizable and Interpretable Brain Age Predictor(https://arxiv.org/abs/2506.17597)
Keywords: generative
Abstract: Purpose: To develop an age prediction model which is interpretable and robust to demographic and technological variances in brain MRI scans. Materials and Methods: We propose a transformer-based architecture that leverages self-supervised pre-training on large-scale datasets. Our model processes pseudo-3D T1-weighted MRI scans from three anatomical views and incorporates brain volumetric information. By introducing a stem architecture, we reduce the conventional quadratic complexity of transformer models to linear complexity, enabling scalability for high-dimensional MRI data. We trained our model on ADNI2 $\&$ 3 (N=1348) and OASIS3 (N=716) datasets (age range: 42 - 95) from the North America, with an 8:1:1 split for train, validation and test. Then, we validated it on the AIBL dataset (N=768, age range: 60 - 92) from Australia. Results: We achieved an MAE of 3.65 years on ADNI2 $\&$ 3 and OASIS3 test set and a high generalizability of MAE of 3.54 years on AIBL. There was a notable increase in brain age gap (BAG) across cognitive groups, with mean of 0.15 years (95% CI: [-0.22, 0.51]) in CN, 2.55 years ([2.40, 2.70]) in MCI, 6.12 years ([5.82, 6.43]) in AD. Additionally, significant negative correlation between BAG and cognitive scores was observed, with correlation coefficient of -0.185 (p < 0.001) for MoCA and -0.231 (p < 0.001) for MMSE. Gradient-based feature attribution highlighted ventricles and white matter structures as key regions influenced by brain aging. Conclusion: Our model effectively fused information from different views and volumetric information to achieve state-of-the-art brain age prediction accuracy, improved generalizability and interpretability with association to neurodegenerative disorders.
摘要：目的：开发一个年龄预测模型，该模型对大脑MRI扫描中的人口统计学和技术差异是可解释且鲁棒的。材料和方法：我们提出了一种基于变压器的体系结构，该体系结构利用大规模数据集的自我监督预训练。我们的模型过程从三种解剖学视图中伪-3D T1加权MRI扫描，并结合了脑容量信息。通过引入STEM结构，我们将变压器模型的常规二次复杂性降低到线性复杂性，从而为高维MRI数据提供可伸缩性。我们培训了北美的ADNI2 $ \＆3（n = 1348）和OASIS3（n = 716）数据集（年龄范围：42-95）的ADNI2 $ \＆$ 3（n = 1348），培训了我们的模型，并从8：1：1分开了火车，验证和测试。然后，我们在澳大利亚的AIBL数据集（n = 768，年龄范围：60-92）上验证了它。结果：我们在ADNI2 $ \＆$ 3和OASIS3测试套件上获得了3。65年的MAE，并且在AIBL上获得了3.54年的高概括性。在CN，2。55年（[2.40，2.70]）中，在认知群体中，认知群体的大脑年龄差距（BAG）显着增加（95％CI：[-0.22，0.51]）。此外，观察到袋与认知评分之间的显着负相关性，MOCA的相关系数为-0.185（p <0.001），MMSE的相关系数为-0.185（p <0.001）（p <0.001）。基于梯度的特征归因突出了心室和白质结构，作为受大脑衰老影响的关键区域。结论：我们的模型有效地从不同的观点和体积信息中融合了信息，以实现最新的脑年龄预测准确性，改善与神经退行性疾病的关联性的概括性和解释性。

Title: HIRE: Lightweight High-Resolution Image Feature Enrichment for Multimodal LLMs

Authors: Nikitha SR, Aradhya Neeraj Mathur, Tarun Ram Menta, Rishabh Jain, Mausoom Sarkar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.17608
Pdf URL: https://arxiv.org/pdf/2506.17608
Copy Paste: [[2506.17608]] HIRE: Lightweight High-Resolution Image Feature Enrichment for Multimodal LLMs(https://arxiv.org/abs/2506.17608)
Keywords: generation
Abstract: The integration of high-resolution image features in modern multimodal large language models has demonstrated significant improvements in fine-grained visual understanding tasks, achieving high performance across multiple benchmarks. Since these features are obtained from large image encoders like ViT, they come with a significant increase in computational costs due to multiple calls to these encoders. In this work, we first develop an intuition for feature upsampling as a natural extension of high-resolution feature generation. Through extensive experiments and ablations, we demonstrate how a shallow feature enricher can achieve competitive results with tremendous reductions in training and inference time as well as computational cost, with upto 1.5x saving in FLOPs.
摘要：现代多模式大型语言模型中高分辨率图像特征的整合表明，精细的视觉理解任务取得了重大改进，从而在多个基准测试中实现了高性能。由于这些功能是从VIT等大型图像编码器获得的，因此由于对这些编码器的多次调用，它们的计算成本大幅增加。在这项工作中，我们首先开发出一种功能上采样的直觉，作为高分辨率特征生成的自然扩展。通过大量的实验和消融，我们演示了浅色特征富集器如何通过训练和推理时间和计算成本大幅减少，并节省高达1.5倍的FLOP，从而获得竞争成果。

Title: Optimization-Free Patch Attack on Stereo Depth Estimation

Authors: Hangcheng Liu, Xu Kuang, Xingshuo Han, Xingwan Wu, Haoran Ou, Shangwei Guo, Xingyi Huang, Tao Xiang, Tianwei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.17632
Pdf URL: https://arxiv.org/pdf/2506.17632
Copy Paste: [[2506.17632]] Optimization-Free Patch Attack on Stereo Depth Estimation(https://arxiv.org/abs/2506.17632)
Keywords: generation
Abstract: Stereo Depth Estimation (SDE) is essential for scene understanding in vision-based systems like autonomous driving. However, recent studies show that SDE models are vulnerable to adversarial attacks, which are often limited to unrealistic settings, e.g., digital perturbations on separate stereo views in static scenes, restricting their real-world applicability. This raises a critical question: how can we design physically realizable, scene-adaptive, and transferable attacks against SDE under realistic constraints? To answer this, we make two key contributions. First, we propose a unified attack framework that extends optimization-based techniques to four core stages of stereo matching: feature extraction, cost-volume construction, cost aggregation, and disparity regression. A comprehensive stage-wise evaluation across 9 mainstream SDE models, under constraints like photometric consistency, reveals that optimization-based patches suffer from poor transferability. Interestingly, partially transferable patches suggest that patterns, rather than pixel-level perturbations, may be key to generalizable attacks. Motivated by this, we present PatchHunter, the first optimization-free adversarial patch attack against SDE. PatchHunter formulates patch generation as a reinforcement learning-driven search over a structured space of visual patterns crafted to disrupt SDE assumptions. We validate PatchHunter across three levels: the KITTI dataset, the CARLA simulator, and real-world vehicle deployment. PatchHunter not only surpasses optimization-based methods in effectiveness but also achieves significantly better black-box transferability. Even under challenging physical conditions like low light, PatchHunter maintains high attack success (e.g., D1-all > 0.4), whereas optimization-based methods fail.
摘要：立体声深度估计（SDE）对于在自动驾驶（例如自动驾驶）等基于视觉系统中的场景理解至关重要。但是，最近的研究表明，SDE模型容易受到对抗性攻击的影响，这些攻击通常仅限于不现实的设置，例如，在静态场景中对单独的立体声视图上的数字扰动，限制了其现实世界中的适用性。这提出了一个关键的问题：在现实的约束下，我们如何设计对SDE的身体可实现，场景自适应和可转移的攻击？为了回答这个问题，我们做出了两个关键的贡献。首先，我们提出了一个统一的攻击框架，将基于优化的技术扩展到立体声匹配的四个核心阶段：特征提取，成本量构造，成本汇总和差异回归。在9个主流SDE模型（如光度一致性）下，跨阶段的全面评估表明，基于优化的斑块的可传递性较差。有趣的是，部分可转移的补丁表明，模式而不是像素级扰动可能是可推广攻击的关键。在此激励的情况下，我们提出了PatchHunter，这是针对SDE的第一次无优化的对抗贴片攻击。 PatchHunter在旨在破坏SDE假设的视觉模式的结构化空间上，将贴片生成作为增强学习驱动的搜索。我们在三个级别上验证PatchHunter：Kitti数据集，Carla模拟器和现实世界中的车辆部署。 PatchHunter不仅超过了基于优化的有效性方法，而且还可以实现更好的黑盒传递性。即使在诸如弱光之类的具有挑战性的身体状况下，PatchHunter也保持了较高的攻击成功（例如，D1-ALL> 0.4），而基于优化的方法失败了。

Title: Histopathology Image Report Generation by Vision Language Model with Multimodal In-Context Learning

Authors: Shih-Wen Liu, Hsuan-Yu Fan, Wei-Ta Chu, Fu-En Yang, Yu-Chiang Frank Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.17645
Pdf URL: https://arxiv.org/pdf/2506.17645
Copy Paste: [[2506.17645]] Histopathology Image Report Generation by Vision Language Model with Multimodal In-Context Learning(https://arxiv.org/abs/2506.17645)
Keywords: generation
Abstract: Automating medical report generation from histopathology images is a critical challenge requiring effective visual representations and domain-specific knowledge. Inspired by the common practices of human experts, we propose an in-context learning framework called PathGenIC that integrates context derived from the training set with a multimodal in-context learning (ICL) mechanism. Our method dynamically retrieves semantically similar whole slide image (WSI)-report pairs and incorporates adaptive feedback to enhance contextual relevance and generation quality. Evaluated on the HistGen benchmark, the framework achieves state-of-the-art results, with significant improvements across BLEU, METEOR, and ROUGE-L metrics, and demonstrates robustness across diverse report lengths and disease categories. By maximizing training data utility and bridging vision and language with ICL, our work offers a solution for AI-driven histopathology reporting, setting a strong foundation for future advancements in multimodal clinical applications.
摘要：从组织病理学图像中自动化医学报告的生成是需要有效的视觉表示和特定于领域知识的关键挑战。受到人类专家的共同实践的启发，我们提出了一个名为“ Pengenic”的文化学习框架，该框架将训练集衍生的上下文与多模式的内部文化学习（ICL）机制集成在一起。我们的方法动态检索语义上相似的整个幻灯片图像（WSI） - 报告对，并结合了自适应反馈，以增强上下文相关性和发电质量。该框架在HISTEN基准测试中进行了评估，取得了最新的结果，在BLEU，流星和胭脂-L指标之间取得了显着改善，并在不同的报告长度和疾病类别中表现出稳健性。通过最大化培训数据实用程序，与ICL桥接愿景和语言，我们的工作为AI驱动的组织病理学报告提供了解决方案，为多模式临床应用中未来的进步奠定了坚实的基础。

Title: DreamJourney: Perpetual View Generation with Video Diffusion Models

Authors: Bo Pan, Yang Chen, Yingwei Pan, Ting Yao, Wei Chen, Tao Mei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.17705
Pdf URL: https://arxiv.org/pdf/2506.17705
Copy Paste: [[2506.17705]] DreamJourney: Perpetual View Generation with Video Diffusion Models(https://arxiv.org/abs/2506.17705)
Keywords: generation, generative
Abstract: Perpetual view generation aims to synthesize a long-term video corresponding to an arbitrary camera trajectory solely from a single input image. Recent methods commonly utilize a pre-trained text-to-image diffusion model to synthesize new content of previously unseen regions along camera movement. However, the underlying 2D diffusion model lacks 3D awareness and results in distorted artifacts. Moreover, they are limited to generating views of static 3D scenes, neglecting to capture object movements within the dynamic 4D world. To alleviate these issues, we present DreamJourney, a two-stage framework that leverages the world simulation capacity of video diffusion models to trigger a new perpetual scene view generation task with both camera movements and object dynamics. Specifically, in stage I, DreamJourney first lifts the input image to 3D point cloud and renders a sequence of partial images from a specific camera trajectory. A video diffusion model is then utilized as generative prior to complete the missing regions and enhance visual coherence across the sequence, producing a cross-view consistent video adheres to the 3D scene and camera trajectory. Meanwhile, we introduce two simple yet effective strategies (early stopping and view padding) to further stabilize the generation process and improve visual quality. Next, in stage II, DreamJourney leverages a multimodal large language model to produce a text prompt describing object movements in current view, and uses video diffusion model to animate current view with object movements. Stage I and II are repeated recurrently, enabling perpetual dynamic scene view generation. Extensive experiments demonstrate the superiority of our DreamJourney over state-of-the-art methods both quantitatively and qualitatively. Our project page: this https URL.
摘要：永久视图生成旨在合成与单个输入图像相对应的与任意摄像机轨迹相对应的长期视频。最近的方法通常使用预训练的文本对图像扩散模型来综合沿相机运动沿着以前看不见的区域的新内容。 However, the underlying 2D diffusion model lacks 3D awareness and results in distorted artifacts.此外，它们仅限于产生静态3D场景的视图，而忽略了在动态4D世界中捕获对象运动。为了减轻这些问题，我们提出了DreamJourney，这是一个两阶段的框架，它利用视频扩散模型的世界模拟能力来触发具有相机运动和对象动态的新的永久场景视图生成任务。具体来说，在第一阶段，DreamJourney首先将输入图像提升到3D点云，并从特定的相机轨迹中呈现一系列部分图像。然后将视频扩散模型用作生成剂，然后才能完成丢失的区域并增强整个序列的视觉连贯性，从而产生跨视频一致的视频遵循3D场景和摄像机轨迹。同时，我们引入了两种简单但有效的策略（早期停止和查看填充），以进一步稳定生成过程并提高视觉质量。接下来，在第二阶段，DreamJourney利用多模式的大语言模型来产生一个文本提示，描述当前视图中的对象运动，并使用视频扩散模型用对象运动来对当前视图进行动画。 Stage I and II are repeated recurrently, enabling perpetual dynamic scene view generation.广泛的实验表明，我们的Dreamjourney在定量和定性上都比最先进的方法具有优势。我们的项目页面：此HTTPS URL。

Title: Programmable-Room: Interactive Textured 3D Room Meshes Generation Empowered by Large Language Models

Authors: Jihyun Kim, Junho Park, Kyeongbo Kong, Suk-Ju Kang
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2506.17707
Pdf URL: https://arxiv.org/pdf/2506.17707
Copy Paste: [[2506.17707]] Programmable-Room: Interactive Textured 3D Room Meshes Generation Empowered by Large Language Models(https://arxiv.org/abs/2506.17707)
Keywords: generation
Abstract: We present Programmable-Room, a framework which interactively generates and edits a 3D room mesh, given natural language instructions. For precise control of a room's each attribute, we decompose the challenging task into simpler steps such as creating plausible 3D coordinates for room meshes, generating panorama images for the texture, constructing 3D meshes by integrating the coordinates and panorama texture images, and arranging furniture. To support the various decomposed tasks with a unified framework, we incorporate visual programming (VP). VP is a method that utilizes a large language model (LLM) to write a Python-like program which is an ordered list of necessary modules for the various tasks given in natural language. We develop most of the modules. Especially, for the texture generating module, we utilize a pretrained large-scale diffusion model to generate panorama images conditioned on text and visual prompts (i.e., layout, depth, and semantic map) simultaneously. Specifically, we enhance the panorama image generation quality by optimizing the training objective with a 1D representation of a panorama scene obtained from bidirectional LSTM. We demonstrate Programmable-Room's flexibility in generating and editing 3D room meshes, and prove our framework's superiority to an existing model quantitatively and qualitatively. Project page is available in this https URL.
摘要：我们介绍可编程房间，该框架在鉴于自然语言说明给定的框架上，可以交互产生和编辑3D房间网格。为了精确控制房间的每个属性，我们将具有挑战性的任务分解为更简单的步骤，例如为房间网眼创建合理的3D坐标，为纹理生成全景图像，通过集成坐标和Panorama纹理图像来构建3D网格，并进行布置家具。为了通过统一的框架来支持各种分解任务，我们将视觉编程（VP）合并。 VP是一种利用大型语言模型（LLM）编写类似Python的程序的方法，该程序是自然语言中各种任务的必要模块的有序列表。我们开发了大多数模块。特别是，对于纹理生成模块，我们利用预验证的大规模扩散模型同时生成以文本和视觉提示（即，布局，深度和语义图）为条件的全景图像。具体而言，我们通过通过从双向LSTM获得的全景场景的1D表示来优化训练目标，从而增强了全景图像的产生质量。我们展示了可编程房间在生成和编辑3D房间网眼方面的灵活性，并在定量和质量上证明了我们框架对现有模型的优势。项目页面可在此HTTPS URL中找到。

Title: PhysID: Physics-based Interactive Dynamics from a Single-view Image

Authors: Sourabh Vasant Gothe, Ayon Chattopadhyay, Gunturi Venkata Sai Phani Kiran, Pratik, Vibhav Agarwal, Jayesh Rajkumar Vachhani, Sourav Ghosh, Parameswaranath VM, Barath Raj KR
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.17746
Pdf URL: https://arxiv.org/pdf/2506.17746
Copy Paste: [[2506.17746]] PhysID: Physics-based Interactive Dynamics from a Single-view Image(https://arxiv.org/abs/2506.17746)
Keywords: generation, generative
Abstract: Transforming static images into interactive experiences remains a challenging task in computer vision. Tackling this challenge holds the potential to elevate mobile user experiences, notably through interactive and AR/VR applications. Current approaches aim to achieve this either using pre-recorded video responses or requiring multi-view images as input. In this paper, we present PhysID, that streamlines the creation of physics-based interactive dynamics from a single-view image by leveraging large generative models for 3D mesh generation and physical property prediction. This significantly reduces the expertise required for engineering-intensive tasks like 3D modeling and intrinsic property calibration, enabling the process to be scaled with minimal manual intervention. We integrate an on-device physics-based engine for physically plausible real-time rendering with user interactions. PhysID represents a leap forward in mobile-based interactive dynamics, offering real-time, non-deterministic interactions and user-personalization with efficient on-device memory consumption. Experiments evaluate the zero-shot capabilities of various Multimodal Large Language Models (MLLMs) on diverse tasks and the performance of 3D reconstruction models. These results demonstrate the cohesive functioning of all modules within the end-to-end framework, contributing to its effectiveness.
摘要：将静态图像转换为交互式体验仍然是计算机视觉中的一项艰巨任务。应对这一挑战具有提升移动用户体验的潜力，特别是通过交互式和AR/VR应用程序。当前的方法旨在使用预录的视频响应或需要多视图图像作为输入来实现这一目标。在本文中，我们介绍了物理学，该物理学简化了从单视图图像创建基于物理的交互式动力学，它通过利用大型生成模型进行3D网格生成和物理属性预测。这大大减少了工程密集型任务（例如3D建模和内在属性校准）所需的专业知识，从而可以通过最少的手动干预来缩放该过程。我们集成了基于设备的物理引擎，以通过用户交互在物理上出现的实时渲染。 ThysID代表了基于移动的交互式动力学的飞跃，提供了实时的，非确定性的交互和用户实现的，并有效地进行了设备的内存消耗。实验评估了各种多模式大型语言模型（MLLM）的零射击功能以及3D重建模型的性能。这些结果证明了端到端框架内所有模块的凝聚功能，从而有助于其有效性。

Title: PhysiX: A Foundation Model for Physics Simulations

Authors: Tung Nguyen, Arsh Koneru, Shufan Li, Aditya grover
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.17774
Pdf URL: https://arxiv.org/pdf/2506.17774
Copy Paste: [[2506.17774]] PhysiX: A Foundation Model for Physics Simulations(https://arxiv.org/abs/2506.17774)
Keywords: generative
Abstract: Foundation models have achieved remarkable success across video, image, and language domains. By scaling up the number of parameters and training datasets, these models acquire generalizable world knowledge and often surpass task-specific approaches. However, such progress has yet to extend to the domain of physics simulation. A primary bottleneck is data scarcity: while millions of images, videos, and textual resources are readily available on the internet, the largest physics simulation datasets contain only tens of thousands of samples. This data limitation hinders the use of large models, as overfitting becomes a major concern. As a result, physics applications typically rely on small models, which struggle with long-range prediction due to limited context understanding. Additionally, unlike images, videos, or text-which typically exhibit fixed granularity-physics datasets often vary drastically in scale, amplifying the challenges of scaling up multitask training. We introduce PhysiX, the first large-scale foundation model for physics simulation. PhysiX is a 4.5B parameter autoregressive generative model. It uses a discrete tokenizer to encode physical processes at different scales into a sequence of discrete tokens, and employs an autoregressive next-token prediction objective to model such processes in the token space. To mitigate the rounding error in the discretization process, PhysiX incorporates a specialized refinement module. Through extensive experiments, we show that PhysiX effectively addresses the data bottleneck, outperforming task-specific baselines under comparable settings as well as the previous absolute state-of-the-art approaches on The Well benchmark. Our results indicate that knowledge learned from natural videos can be successfully transferred to physics simulation, and that joint training across diverse simulation tasks enables synergistic learning.
摘要：基础模型在视频，图像和语言领域取得了巨大的成功。通过扩展参数和培训数据集的数量，这些模型可以获取可通用的世界知识，并且经常超过特定于任务的方法。但是，这种进展尚未扩展到物理模拟的领域。主要的瓶颈是数据稀缺性：虽然在互联网上很容易获得数百万张图像，视频和文本资源，但最大的物理模拟数据集仅包含数十万个样本。由于过度拟合成为主要问题，因此该数据限制阻碍了大型模型的使用。结果，物理应用程序通常依赖于小型模型，这些模型由于有限的上下文理解而在远程预测中挣扎。此外，与图像，视频或文本不同，通常显示固定的粒度 - 物理数据集的规模差异很大，从而扩大了扩大了扩大多任务训练的挑战。我们介绍了Physix，这是第一个用于物理模拟的大型基础模型。 Physix是4.5B参数自回归生成模型。它使用离散的令牌机将不同尺度的物理过程编码为一系列离散令牌，并采用自回归的下一步预测目标来模拟令牌空间中的此类过程。为了减轻离散过程中的舍入误差，Physix结合了专门的改进模块。通过广泛的实验，我们表明Physix有效地解决了数据瓶颈，在可比的设置以及以前的绝对最新方法下，在井基准测试的情况下优于特定于任务的基准。我们的结果表明，从自然视频中学到的知识可以成功地转移到物理模拟中，并且跨不同模拟任务的联合培训可以使协同学习。

Title: Toward Autonomous UI Exploration: The UIExplorer Benchmark

Authors: Andrei Cristian Nica, Akshaya Vishnu Kudlu Shanbhogue, Harshil Shah, Aleix Cambray, Tudor Berariu, Lucas Maystre, David Barber
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.17779
Pdf URL: https://arxiv.org/pdf/2506.17779
Copy Paste: [[2506.17779]] Toward Autonomous UI Exploration: The UIExplorer Benchmark(https://arxiv.org/abs/2506.17779)
Keywords: generation
Abstract: Autonomous agents must know how to explore user interfaces (UIs) for reliable task solving, yet systematic evaluation of this crucial phase is lacking. We introduce UIExplore-Bench, the first benchmark explicitly dedicated to UI exploration. The benchmark evaluates agents with either Structured mode (granting access to layout information like DOM trees) or Screen mode (relying on GUI-only observations such as screenshots and human-like mouse/keyboard interactions) across three levels in a standardized GitLab sandbox environment. We formalize exploration as the process of maximizing the set of actionable UI components discovered and propose a metric, human-normalized UI-Functionalities Observed (hUFO), to quantify the effectiveness of exploration. Our results show that UIExplore-AlGo achieves the leading mean hUFO scores, reaching up to 77.2% of human performance in Structured mode and 59.0% in Screen mode at 2,000 steps, particularly excelling at the Sparse level. The results highlight the relevance of our benchmark, as current agents show a substantial performance gap compared to one hour of human expert exploration, indicating ample room for future advancements. We publicly release the benchmark environment, an exploration dataset, and an evaluation suite to catalyze research into efficient UI exploration strategies and their downstream applications, such as experience-driven task completion and automated training data generation.
摘要：自主代理必须知道如何探索用户界面（UIS）以解决可靠的任务解决，但缺乏对此关键阶段的系统评估。我们介绍了UIEXplore台式，这是第一个明确致力于UI探索的基准。基准测试评估具有结构化模式（授予对DOM树（例如DOM树）的访问的访问）或屏幕模式（依赖于GUI-GUI-GUI-GITLAB SANDBOX环境中的三个级别）的屏幕模式。我们将探索形式化为最大化发现并提出观察到的度量，人函数的UI功能的过程（HUFO），以量化勘探的有效性。我们的结果表明，Uiexplore-Algo达到了领先的平均HUFO得分，在结构化模式下达到了人类绩效的77.2％，在屏幕模式下以2,000步中的步骤达到59.0％，尤其是在稀疏水平上脱颖而出。结果突出了我们的基准标准的相关性，因为与一小时的人类专家探索相比，目前的代理商表现出很大的性能差距，这表明了未来进步的足够空间。我们公开发布基准环境，勘探数据集以及评估套件，以催化研究有效的UI勘探策略及其下游应用程序的研究，例如经验驱动的任务完成和自动化的培训数据生成。

Title: Beyond instruction-conditioning, MoTE: Mixture of Task Experts for Multi-task Embedding Models

Authors: Miguel Romero, Shuoyang Ding, Corey D. Barret, Georgiana Dinu, George Karypis
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.17781
Pdf URL: https://arxiv.org/pdf/2506.17781
Copy Paste: [[2506.17781]] Beyond instruction-conditioning, MoTE: Mixture of Task Experts for Multi-task Embedding Models(https://arxiv.org/abs/2506.17781)
Keywords: generation
Abstract: Dense embeddings are fundamental to modern machine learning systems, powering Retrieval-Augmented Generation (RAG), information retrieval, and representation learning. While instruction-conditioning has become the dominant approach for embedding specialization, its direct application to low-capacity models imposes fundamental representational constraints that limit the performance gains derived from specialization. In this paper, we analyze these limitations and introduce the Mixture of Task Experts (MoTE) transformer block, which leverages task-specialized parameters trained with Task-Aware Contrastive Learning (\tacl) to enhance the model ability to generate specialized embeddings. Empirical results show that MoTE achieves $64\%$ higher performance gains in retrieval datasets ($+3.27 \rightarrow +5.21$) and $43\%$ higher performance gains across all datasets ($+1.81 \rightarrow +2.60$). Critically, these gains are achieved without altering instructions, training data, inference time, or number of active parameters.
摘要：密集的嵌入是现代机器学习系统的基础，为检索功能增强的生成（RAG），信息检索和代表性学习提供动力。尽管指导条件已成为嵌入专业化的主要方法，但其直接应用在低容量模型中施加了基本的代表性约束，限制了从专业化中获得的绩效增长。在本文中，我们分析了这些局限性，并介绍了任务专家（MOTE）变压器块的混合物，该块利用了任务特有的参数，该参数接受了任务吸引对比度学习（\ TACL），以增强模型的能力生成专业嵌入。经验结果表明，MOTE在检索数据集中获得$ 64 \％$ $较高的性能提高（$ +3.27 \ rightarrow +5.21 $）和$ 43 \％$ $ $ $ $ $ $ $ $ $ $ +％$ $ +1.81 \ 1.81 \ rightArrow +2.60 $）。至关重要的是，这些收益是在没有更改指令，训练数据，推理时间或活动参数数量的情况下实现的。

Title: Reimagining Parameter Space Exploration with Diffusion Models

Authors: Lijun Zhang, Xiao Liu, Hui Guan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.17807
Pdf URL: https://arxiv.org/pdf/2506.17807
Copy Paste: [[2506.17807]] Reimagining Parameter Space Exploration with Diffusion Models(https://arxiv.org/abs/2506.17807)
Keywords: generative
Abstract: Adapting neural networks to new tasks typically requires task-specific fine-tuning, which is time-consuming and reliant on labeled data. We explore a generative alternative that produces task-specific parameters directly from task identity, eliminating the need for task-specific training. To this end, we propose using diffusion models to learn the underlying structure of effective task-specific parameter space and synthesize parameters on demand. Once trained, the task-conditioned diffusion model can generate specialized weights directly from task identifiers. We evaluate this approach across three scenarios: generating parameters for a single seen task, for multiple seen tasks, and for entirely unseen tasks. Experiments show that diffusion models can generate accurate task-specific parameters and support multi-task interpolation when parameter subspaces are well-structured, but fail to generalize to unseen tasks, highlighting both the potential and limitations of this generative solution.
摘要：将神经网络调整为新任务通常需要特定于任务的微调，这既耗时又依赖于标记的数据。我们探索了一种生成替代方案，该替代方案直接从任务身份直接产生特定于任务的参数，从而消除了对特定于任务的培训的需求。为此，我们建议使用扩散模型学习有效特定任务参数空间的基础结构，并按需合成参数。一旦受过训练，任务条件的扩散模型可以直接从任务标识符中生成专业权重。我们在三种情况下评估了这种方法：为单个可见任务，多个可见任务以及完全看不见的任务生成参数。实验表明，当参数子空间结构良好时，扩散模型可以生成精确的特定任务参数并支持多任务插值，但无法推广到看不见的任务，从而突出了此生成解决方案的潜在和局限性。

Title: Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach

Authors: Xinnan Zhang, Chenliang Li, Siliang Zeng, Jiaxiang Li, Zhongruo Wang, Kaixiang Lin, Songtao Lu, Alfredo Garcia, Mingyi Hong
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.17828
Pdf URL: https://arxiv.org/pdf/2506.17828
Copy Paste: [[2506.17828]] Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach(https://arxiv.org/abs/2506.17828)
Keywords: generation
Abstract: Aligning large language models (LLMs) with human preferences usually requires fine-tuning methods such as RLHF and DPO. These methods directly optimize the model parameters, so they cannot be used in test-time to improve model performance, nor are they applicable when the model weights are not accessible. In contrast, test-time methods sidestep weight updates by leveraging reward functions to guide and improve output quality. However, they incur high inference costs, and their one-shot guidance is often based on imperfect reward or value functions, leading to suboptimal outputs. In this work, we present a method named Iterative Reweight-then-Optimize (IRO), a reinforcement learning (RL) framework that performs RL-style alignment of the (frozen) base model without touching its parameters. During training, each iteration (i) samples candidates from the base model, (ii) resamples using current value functions, and (iii) trains a new lightweight value function that guides the next decoding pass. At test time, the value functions are used to guide the base model generation via a search-based optimization process. Notably, users can apply IRO to align a model on their own dataset, similar to OpenAI's reinforcement fine-tuning (RFT), but without requiring access to the model weights.
摘要：将大语言模型（LLM）与人类偏好保持一致通常需要进行微调方法，例如RLHF和DPO。这些方法直接优化了模型参数，因此不能在测试时间中使用它们来提高模型性能，也不能在无法访问模型权重时适用。相反，测试时间方法通过利用奖励功能来指导和提高产出质量来避开重量的更新。但是，它们产生了高推理成本，其一次性指导通常基于不完美的奖励或价值功能，从而导致了次优的产出。在这项工作中，我们提出了一种名为迭代重新温和至优化的方法（IRO），这是一种强化学习（RL）框架，该框架执行（冷冻）基本模型的RL式对齐，而无需触摸其参数。在训练过程中，每次迭代（i）从基本模型中示例候选者，（ii）使用当前价值函数进行重新采样，（iii）训练一个新的轻质值函数，以指导下一个解码通行证。在测试时，值函数用于通过基于搜索的优化过程来指导基本模型生成。值得注意的是，用户可以将IRO应用于自己的数据集上的模型，类似于OpenAI的加强微调（RFT），但无需访问模型权重。

Title: A Comparative Study of Open-Source Libraries for Synthetic Tabular Data Generation: SDV vs. SynthCity

Authors: Cristian Del Gobbo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.17847
Pdf URL: https://arxiv.org/pdf/2506.17847
Copy Paste: [[2506.17847]] A Comparative Study of Open-Source Libraries for Synthetic Tabular Data Generation: SDV vs. SynthCity(https://arxiv.org/abs/2506.17847)
Keywords: generation
Abstract: High-quality training data is critical to the performance of machine learning models, particularly Large Language Models (LLMs). However, obtaining real, high-quality data can be challenging, especially for smaller organizations and early-stage startups. Synthetic data generators provide a promising solution by replicating the statistical and structural properties of real data while preserving privacy and scalability. This study evaluates the performance of six tabular synthetic data generators from two widely used open-source libraries: SDV (Gaussian Copula, CTGAN, TVAE) and Synthicity (Bayesian Network, CTGAN, TVAE). Using a real-world dataset from the UCI Machine Learning Repository, comprising energy consumption and environmental variables from Belgium, we simulate a low-data regime by training models on only 1,000 rows. Each generator is then tasked with producing synthetic datasets under two conditions: a 1:1 (1,000 rows) and a 1:10 (10,000 rows) input-output ratio. Evaluation is conducted using two criteria: statistical similarity, measured via classical statistics and distributional metrics; and predictive utility, assessed using a "Train on Synthetic, Test on Real" approach with four regression models. While statistical similarity remained consistent across models in both scenarios, predictive utility declined notably in the 1:10 case. The Bayesian Network from Synthicity achieved the highest fidelity in both scenarios, while TVAE from SDV performed best in predictive tasks under the 1:10 setting. Although no significant performance gap was found between the two libraries, SDV stands out for its superior documentation and ease of use, making it more accessible for practitioners.
摘要：高质量的培训数据对于机器学习模型的性能，尤其是大语言模型（LLM）至关重要。但是，获得真实的高质量数据可能具有挑战性，尤其是对于较小的组织和早期创业公司而言。合成数据生成器通过复制真实数据的统计和结构属性提供了有希望的解决方案，同时保留隐私和可扩展性。这项研究评估了来自两个广泛使用的开源库的六个表格合成数据生成器的性能：SDV（高斯copula，ctgan，tvae）和合成性（贝叶斯网络，ctgan，tvae）。我们使用来自UCI机器学习存储库中的现实世界数据集，包括比利时的能源消耗和环境变量，我们通过仅对1,000行进行培训模型来模拟低数据策略。然后，每个发电机的任务是在两个条件下生成合成数据集：A 1：1（1,000行）和1:10（10,000行）输入输出比率。使用两个标准进行评估：通过经典统计和分布指标衡量的统计相似性；和预测实用程序，使用“四个回归模型的综合，测试”方法进行评估。尽管在两种情况下，统计相似性在各个模型中保持一致，但在1:10的情况下，预测实用程序却显着下降。在两种情况下，合成性的贝叶斯网络都达到了最高的保真度，而来自SDV的TVAE在1:10设置下的预测任务中表现最好。尽管两个库之间没有发现明显的性能差距，但SDV以其出色的文档和易用性而脱颖而出，使从业者更容易获得。

Title: PlanMoGPT: Flow-Enhanced Progressive Planning for Text to Motion Synthesis

Authors: Chuhao Jin, Haosen Li, Bingzi Zhang, Che Liu, Xiting Wang, Ruihua Song, Wenbing Huang, Ying Qin, Fuzheng Zhang, Di Zhang
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2506.17912
Pdf URL: https://arxiv.org/pdf/2506.17912
Copy Paste: [[2506.17912]] PlanMoGPT: Flow-Enhanced Progressive Planning for Text to Motion Synthesis(https://arxiv.org/abs/2506.17912)
Keywords: generation
Abstract: Recent advances in large language models (LLMs) have enabled breakthroughs in many multimodal generation tasks, but a significant performance gap still exists in text-to-motion generation, where LLM-based methods lag far behind non-LLM methods. We identify the granularity of motion tokenization as a critical bottleneck: fine-grained tokenization induces local dependency issues, where LLMs overemphasize short-term coherence at the expense of global semantic alignment, while coarse-grained tokenization sacrifices motion details. To resolve this issue, we propose PlanMoGPT, an LLM-based framework integrating progressive planning and flow-enhanced fine-grained motion tokenization. First, our progressive planning mechanism leverages LLMs' autoregressive capabilities to hierarchically generate motion tokens by starting from sparse global plans and iteratively refining them into full sequences. Second, our flow-enhanced tokenizer doubles the downsampling resolution and expands the codebook size by eight times, minimizing detail loss during discretization, while a flow-enhanced decoder recovers motion nuances. Extensive experiments on text-to-motion benchmarks demonstrate that it achieves state-of-the-art performance, improving FID scores by 63.8% (from 0.380 to 0.141) on long-sequence generation while enhancing motion diversity by 49.9% compared to existing methods. The proposed framework successfully resolves the diversity-quality trade-off that plagues current non-LLM approaches, establishing new standards for text-to-motion generation.
摘要：大型语言模型（LLM）的最新进展已在许多多模式生成任务中取得了突破，但是在文本到动作生成中仍然存在显着的性能差距，基于LLM的方法远远落后于非LLLM方法。我们将运动令牌化的粒度确定为一个关键瓶颈：细粒度的依赖性引起局部依赖性问题，在这些问题上，LLM过度强调了短期连贯性，而牺牲了全球的语义一致性，而粗糙的粒度象征化牺牲牺牲了运动细节。为了解决这个问题，我们提出了一个基于LLM的Planmogpt，该框架集成了渐进式计划和流动增强的细粒运动令牌化。首先，我们的渐进式计划机制利用LLMS的自回归能力从稀疏的全球计划开始，并迭代地将其分为完整序列来层次产生运动令牌。其次，我们的流动增强令牌将下采样分辨率加倍，并将代码书的大小扩大了八次，从而最大程度地减少了离散化过程中的细节损失，而流动增强的解码器会恢复运动细微差异。对文本到动作基准的广泛实验表明，它可以达到最先进的性能，而与现有方法相比，长期生成的FID得分在长期产生的同时提高了49.9％的长期生成，同时将FID得分提高了63.8％（从0.380到0.141）。拟议的框架成功解决了困扰当前非LLM方法的多样性质量的权衡，从而确立了文本到动作生成的新标准。

Title: GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning

Authors: Bo Liu, Xiangyu Zhao, Along He, Yidi Chen, Huazhu Fu, Xiao-Ming Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.17939
Pdf URL: https://arxiv.org/pdf/2506.17939
Copy Paste: [[2506.17939]] GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning(https://arxiv.org/abs/2506.17939)
Keywords: generation
Abstract: Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images. While recent advances in multi-modal learning have significantly improved performance, current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model-generated answers. To address this, this work first proposes a Thinking with Visual Grounding (ThinkVG) dataset wherein the answer generation is decomposed into intermediate reasoning steps that explicitly ground relevant visual regions of the medical image, thereby providing fine-grained explainability. Furthermore, we introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between the model's reasoning process and its final answer. Remarkably, our method achieves comparable performance using only one-eighth of the training data, demonstrating the efficiency and effectiveness of the proposal. The dataset is available at this https URL.
摘要：医学视觉问题回答旨在通过使模型根据医学图像回答自然语言问题来支持临床决策。尽管多模式学习的最新进展显着提高了性能，但当前的方法仍然遭受有限的答案可靠性和不良的解释能力，从而损害了临床医生和患者理解和信任模型生成的答案的能力。为了解决这个问题，这项工作首先提出了一种具有视觉接地（ThinkVG）数据集的思考，其中答案生成被分解为中间推理步骤，这些步骤明确地基于医学图像的相关视觉区域，从而提供了细粒度的解释性。此外，我们引入了一种新颖的可验证奖励机制，用于增强学习以指导培训后，从而改善了模型的推理过程与最终答案之间的一致性。值得注意的是，我们的方法仅使用八分之一的培训数据实现可比的性能，证明了该提案的效率和有效性。该数据集可在此HTTPS URL上找到。

Title: Adapting Vision-Language Models for Evaluating World Models

Authors: Mariya Hendriksen, Tabish Rashid, David Bignell, Raluca Georgescu, Abdelhak Lemkhenter, Katja Hofmann, Sam Devlin, Sarah Parisot
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.17967
Pdf URL: https://arxiv.org/pdf/2506.17967
Copy Paste: [[2506.17967]] Adapting Vision-Language Models for Evaluating World Models(https://arxiv.org/abs/2506.17967)
Keywords: generative
Abstract: World models -- generative models that simulate environment dynamics conditioned on past observations and actions -- are gaining prominence in planning, simulation, and embodied AI. However, evaluating their rollouts remains a fundamental challenge, requiring fine-grained, temporally grounded assessment of action alignment and semantic consistency -- capabilities not captured by existing metrics. Vision-Language Models (VLMs) have shown promise as automatic evaluators of generative content due to their strong multimodal reasoning abilities. Yet, their use in fine-grained, temporally sensitive evaluation tasks remains limited and requires targeted adaptation. We introduce a evaluation protocol targeting two recognition tasks -- action recognition and character recognition -- each assessed across binary, multiple-choice, and open-ended formats. To support this, we present UNIVERSE (UNIfied Vision-language Evaluator for Rollouts in Simulated Environments), a method for adapting VLMs to rollout evaluation under data and compute constraints. We conduct a large-scale study comparing full, partial, and parameter-efficient finetuning across task formats, context lengths, sampling strategies, and data compositions. The resulting unified evaluator matches the performance of task-specific baselines using a single checkpoint. Human studies confirm strong alignment with human judgments, establishing UNIVERSE as a scalable, semantics-aware evaluator for world models.
摘要：世界模型 - 模拟以过去观察和行动为条件的环境动力学的生成模型 - 在计划，模拟和体现的AI方面变得突出。但是，评估他们的推出仍然是一个根本的挑战，需要对动作一致性和语义一致性进行细粒度，时间扎根的评估 - 现有指标未捕获的功能。视觉语言模型（VLM）由于具有强大的多模式推理能力而显示为生成内容的自动评估者的希望。然而，它们在细粒度，时间敏感的评估任务中的使用仍然有限，需要针对性的适应。我们介绍了一个针对两个识别任务的评估协议 - 行动识别和角色识别 - 每个识别均通过二进制，多项选择和开放式格式进行评估。为了支持这一点，我们介绍了宇宙（在模拟环境中进行推出的统一视觉语言评估器），这是一种调整VLMS在数据和计算限制下推出评估的方法。我们进行了一项大规模研究，比较任务形式，上下文长度，抽样策略和数据组成的完整，部分和参数有效的填充。 The resulting unified evaluator matches the performance of task-specific baselines using a single checkpoint.人类研究证实了与人类判断的强烈一致性，将宇宙确立为世界模型的可扩展语义知识评估者。

Title: BPCLIP: A Bottom-up Image Quality Assessment from Distortion to Semantics Based on CLIP

Authors: Chenyue Song, Chen Hui, Wei Zhang, Haiqi Zhu, Shaohui Liu, Hong Huang, Feng Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.17969
Pdf URL: https://arxiv.org/pdf/2506.17969
Copy Paste: [[2506.17969]] BPCLIP: A Bottom-up Image Quality Assessment from Distortion to Semantics Based on CLIP(https://arxiv.org/abs/2506.17969)
Keywords: quality assessment
Abstract: Image Quality Assessment (IQA) aims to evaluate the perceptual quality of images based on human subjective perception. Existing methods generally combine multiscale features to achieve high performance, but most rely on straightforward linear fusion of these features, which may not adequately capture the impact of distortions on semantic content. To address this, we propose a bottom-up image quality assessment approach based on the Contrastive Language-Image Pre-training (CLIP, a recently proposed model that aligns images and text in a shared feature space), named BPCLIP, which progressively extracts the impact of low-level distortions on high-level semantics. Specifically, we utilize an encoder to extract multiscale features from the input image and introduce a bottom-up multiscale cross attention module designed to capture the relationships between shallow and deep features. In addition, by incorporating 40 image quality adjectives across six distinct dimensions, we enable the pre-trained CLIP text encoder to generate representations of the intrinsic quality of the image, thereby strengthening the connection between image quality perception and human language. Our method achieves superior results on most public Full-Reference (FR) and No-Reference (NR) IQA benchmarks, while demonstrating greater robustness.
摘要：图像质量评估（IQA）旨在根据人类主观感知评估图像的感知质量。现有方法通常将多尺度功能结合起来以实现高性能，但是大多数依赖于这些功能的直线线性融合，这可能无法充分捕获扭曲对语义内容的影响。为了解决这个问题，我们提出了一种基于对比的语言图像预训练的自下而上的图像质量评估方法（Clip，最近提出的模型，在共享特征空间中对齐图像和文本），名为BPCLIP，该模型逐渐提取了低级扭曲对高级语言的影响。具体来说，我们利用一个编码器从输入图像中提取多尺度特征，并引入自下而上的多尺度交叉注意模块，旨在捕获浅层和深度特征之间的关系。此外，通过在六个不同的维度上合并40个图像质量形容词，我们使预训练的剪辑文本编码器能够生成图像内在质量的表示形式，从而增强了图像质量感知与人类语言之间的联系。我们的方法在大多数公共全参考（FR）和No-Reference（NR）IQA基准上取得了卓越的结果，同时证明了更大的鲁棒性。

Title: Enabling PSO-Secure Synthetic Data Sharing Using Diversity-Aware Diffusion Models

Authors: Mischa Dombrowski, Bernhard Kainz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.17975
Pdf URL: https://arxiv.org/pdf/2506.17975
Copy Paste: [[2506.17975]] Enabling PSO-Secure Synthetic Data Sharing Using Diversity-Aware Diffusion Models(https://arxiv.org/abs/2506.17975)
Keywords: generation
Abstract: Synthetic data has recently reached a level of visual fidelity that makes it nearly indistinguishable from real data, offering great promise for privacy-preserving data sharing in medical imaging. However, fully synthetic datasets still suffer from significant limitations: First and foremost, the legal aspect of sharing synthetic data is often neglected and data regulations, such as the GDPR, are largley ignored. Secondly, synthetic models fall short of matching the performance of real data, even for in-domain downstream applications. Recent methods for image generation have focused on maximising image diversity instead of fidelity solely to improve the mode coverage and therefore the downstream performance of synthetic data. In this work, we shift perspective and highlight how maximizing diversity can also be interpreted as protecting natural persons from being singled out, which leads to predicate singling-out (PSO) secure synthetic datasets. Specifically, we propose a generalisable framework for training diffusion models on personal data which leads to unpersonal synthetic datasets achieving performance within one percentage point of real-data models while significantly outperforming state-of-the-art methods that do not ensure privacy. Our code is available at this https URL.
摘要：综合数据最近达到了视觉保真度，这几乎使其与实际数据几乎没有区别，从而为隐私保护数据共享在医学成像中提供了巨大的希望。但是，完全合成数据集仍然存在重大局限性：首先，共享合成数据的法律方面通常被忽略，而数据法规（例如GDPR）被忽略了。其次，即使对于下游应用程序，合成模型也无法匹配真实数据的性能。图像产生的最新方法集中于最大化图像多样性，而不是仅仅是为了改善模式覆盖范围，因此是合成数据的下游性能。在这项工作中，我们改变了观点，并强调了如何最大化多样性也可以解释为保护自然人免于被挑出，这导致了谓词单打（PSO）安全的合成数据集。具体来说，我们为个人数据培训扩散模型提供了一个可普遍的框架，该框架导致无人合成数据集在实际数据模型的一个百分点内达到性能，同时显着超过了无法确保隐私性的先进方法。我们的代码可在此HTTPS URL上找到。

Title: Imputation of Longitudinal Data Using GANs: Challenges and Implications for Classification

Authors: Sharon Torao Pingi, Md Abul Bashar, Richi Nayak
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.18007
Pdf URL: https://arxiv.org/pdf/2506.18007
Copy Paste: [[2506.18007]] Imputation of Longitudinal Data Using GANs: Challenges and Implications for Classification(https://arxiv.org/abs/2506.18007)
Keywords: generative
Abstract: Longitudinal data is commonly utilised across various domains, such as health, biomedical, education and survey studies. This ubiquity has led to a rise in statistical, machine and deep learning-based methods for Longitudinal Data Classification (LDC). However, the intricate nature of the data, characterised by its multi-dimensionality, causes instance-level heterogeneity and temporal correlations that add to the complexity of longitudinal data analysis. Additionally, LDC accuracy is often hampered by the pervasiveness of missing values in longitudinal data. Despite ongoing research that draw on the generative power and utility of Generative Adversarial Networks (GANs) to address the missing data problem, critical considerations include statistical assumptions surrounding longitudinal data and missingness within it, as well as other data-level challenges like class imbalance and mixed data types that impact longitudinal data imputation (LDI) and the subsequent LDC process in GANs. This paper provides a comprehensive overview of how GANs have been applied in LDI, with a focus whether GANS have adequately addressed fundamental assumptions about the data from a LDC perspective. We propose a categorisation of main approaches to GAN-based LDI, highlight strengths and limitations of methods, identify key research trends, and provide promising future directions. Our findings indicate that while GANs show great potential for LDI to improve usability and quality of longitudinal data for tasks like LDC, there is need for more versatile approaches that can handle the wider spectrum of challenges presented by longitudinal data with missing values. By synthesising current knowledge and identifying critical research gaps, this survey aims to guide future research efforts in developing more effective GAN-based solutions to address LDC challenges.
摘要：纵向数据通常在各个领域（例如健康，生物医学，教育和调查研究）中使用。这种普遍性导致纵向数据分类（LDC）的统计，机器和深度学习方法的增加。然而，数据的复杂性质以其多维性的特征，导致实例级异质性和时间相关性增加，从而增加了纵向数据分析的复杂性。此外，在纵向数据中缺失值的普遍性通常会阻碍LDC的准确性。尽管正在进行的研究借鉴了生成的对抗网络（GAN）以解决缺失数据问题的生成能力和实用性，但批判考虑包括围绕其中的纵向数据及其丢失的统计假设，以及其他数据级别的挑战，例如类别的挑战和诸如类型的混合数据类型，这些挑战会影响纵向数据插入（LDI）（LDI）和后续ldc ldc。本文详细概述了LDI如何应用gans，重点是GAN是否从LDC的角度充分地解决了有关数据的基本假设。我们建议对基于GAN的LDI的主要方法进行分类，强调方法的优势和局限性，确定关键的研究趋势并提供有希望的未来方向。我们的发现表明，尽管GAN具有LDI的巨大潜力，可以提高LDC等任务的纵向数据的可用性和质量，但需要使用更多的用途方法来处理具有缺失值的纵向数据所带来的更广泛的挑战。通过综合当前知识并确定关键的研究差距，该调查旨在指导未来的研究工作，以开发更有效的基于GAN的解决方案以应对最不发达国家的挑战。

Title: ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation

Authors: Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, Benyou Wang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.18095
Pdf URL: https://arxiv.org/pdf/2506.18095
Copy Paste: [[2506.18095]] ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation(https://arxiv.org/abs/2506.18095)
Keywords: generation, generative
Abstract: Recent advances in multimodal generative models have unlocked photorealistic, instruction-aligned image generation, yet leading systems like GPT-4o-Image remain proprietary and inaccessible. To democratize these capabilities, we present ShareGPT-4o-Image, the first dataset comprising 45K text-to-image and 46K text-and-image-to-image data, all synthesized using GPT-4o's image generation capabilities for distilling its advanced image generation abilities. Leveraging this dataset, we develop Janus-4o, a multimodal large language model capable of both text-to-image and text-and-image-to-image generation. Janus-4o not only significantly improves text-to-image generation over its predecessor, Janus-Pro, but also newly supports text-and-image-to-image generation. Notably, it achieves impressive performance in text-and-image-to-image generation from scratch, using only 91K synthetic samples and 6 hours of training on an 8 A800-GPU machine. We hope the release of ShareGPT-4o-Image and Janus-4o will foster open research in photorealistic, instruction-aligned image generation.
摘要：多模式生成模型的最新进展已解锁了与教学的图像产生，但诸如GPT-4O图像之类的领先系统仍然专有且无法访问。为了使这些功能民主化，我们介绍了共享节目4O图像，这是第一个包含45K文本图像图像和46K文本和图像图像数据的数据集，所有数据集都使用GPT-4O的图像生成功能合成，以蒸馏出其高级图像生成能力。利用此数据集，我们开发了Janus-4O，这是一种多模式的大型语言模型，能够具有文本形象和文本和图像形象生成。 Janus-4O不仅显着改善了其前身Janus-Pro的文本到图像的生成，而且还新地支持文本和图像形象。值得注意的是，它仅在8台A800-GPU机器上仅使用91K合成样本和6个小时的培训，在从头开始的文本和图像生成中实现了令人印象深刻的性能。我们希望释放ShareGpt-4O图像和Janus-4O的发行能够促进与逼真的，教学一致的图像生成的开放研究。

Title: RL for Reasoning by Adaptively Revealing Rationales

Authors: Mohammad Hossein Amani, Aryo Lotfi, Nicolas Mario Baldwin, Samy Bengio, Mehrdad Farajtabar, Emmanuel Abbe, Robert West
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.18110
Pdf URL: https://arxiv.org/pdf/2506.18110
Copy Paste: [[2506.18110]] RL for Reasoning by Adaptively Revealing Rationales(https://arxiv.org/abs/2506.18110)
Keywords: generation
Abstract: We propose that reinforcement learning (RL) from partial expert demonstrations is not merely a training heuristic, but a promising framework for solving complex sequence generation tasks. Supervised fine-tuning (SFT) relies on dense ground-truth labels, which become increasingly costly as sequence length grows. RL, on the other hand, struggles with sparse rewards and a combinatorially large output space. We address this by introducing adaptive backtracking (AdaBack), a per-sample curriculum learning algorithm that reveals only a partial prefix of the target output during training. The supervision length is adjusted dynamically for each sample based on the model's past reward signal, allowing it to incrementally learn to complete reasoning chains by conditioning on correct partial solutions. We investigate this intermediate regime between SFT and RL and argue that per-sample curriculum learning is more than a trade-off between efficiency and generality, it can succeed in tasks with long sequences of latent dependencies where SFT and RL both fail to generalize. Using a synthetic task with latent parity constraints, we show that our adaptive curriculum over partial answers reliably solves problems that are otherwise intractable. On mathematical reasoning benchmarks (MATH, GSM8k), we find that curriculum learning enables models to solve problems that RL alone cannot, acquiring new reasoning capabilities through incremental exposure to partial solutions.
摘要：我们建议，从部分专家演示中进行的加强学习（RL）不仅是训练启发式启发式，而且是解决复杂序列生成任务的有前途的框架。监督的微调（SFT）依赖于密集的地面真相标签，随着序列长度的增长，它们的成本越来越高。另一方面，RL在稀疏的奖励和组合较大的输出空间中挣扎。我们通过引入自适应回溯（ADABACK）来解决这一问题，这是一种按样本课程学习算法，该算法仅显示训练过程中目标输出的部分前缀。根据模型的过去奖励信号对每个样本进行动态调整监督长度，从而使其可以通过根据正确的部分解决方案进行调节来逐步学习完成推理链。我们研究了SFT和RL之间的这种中间制度，并认为每个样本课程学习不仅仅是效率和通用性之间的权衡，它可以成功地在SFT和RL都无法概括的潜在依赖性序列的任务中取得成功。使用具有潜在奇偶校验约束的综合任务，我们表明我们对部分答案的适应性课程可靠地解决了其他棘手的问题。在数学推理基准（Math，GSM8K）上，我们发现课程学习使模型能够解决单独使用RL无法获得的问题，从而通过逐步暴露于部分解决方案来获得新的推理能力。

Title: Targeted False Positive Synthesis via Detector-guided Adversarial Diffusion Attacker for Robust Polyp Detection

Authors: Quan Zhou, Gan Luo, Qiang Hu, Qingyong Zhang, Jinhua Zhang, Yinjiao Tian, Qiang Li, Zhiwei Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18134
Pdf URL: https://arxiv.org/pdf/2506.18134
Copy Paste: [[2506.18134]] Targeted False Positive Synthesis via Detector-guided Adversarial Diffusion Attacker for Robust Polyp Detection(https://arxiv.org/abs/2506.18134)
Keywords: generative
Abstract: Polyp detection is crucial for colorectal cancer screening, yet existing models are limited by the scale and diversity of available data. While generative models show promise for data augmentation, current methods mainly focus on enhancing polyp diversity, often overlooking the critical issue of false positives. In this paper, we address this gap by proposing an adversarial diffusion framework to synthesize high-value false positives. The extensive variability of negative backgrounds presents a significant challenge in false positive synthesis. To overcome this, we introduce two key innovations: First, we design a regional noise matching strategy to construct a negative synthesis space using polyp detection datasets. This strategy trains a negative-centric diffusion model by masking polyp regions, ensuring the model focuses exclusively on learning diverse background patterns. Second, we introduce the Detector-guided Adversarial Diffusion Attacker (DADA) module, which perturbs the negative synthesis process to disrupt a pre-trained detector's decision, guiding the negative-centric diffusion model to generate high-value, detector-confusing false positives instead of low-value, ordinary backgrounds. Our approach is the first to apply adversarial diffusion to lesion detection, establishing a new paradigm for targeted false positive synthesis and paving the way for more reliable clinical applications in colorectal cancer screening. Extensive results on public and in-house datasets verify the superiority of our method over the current state-of-the-arts, with our synthesized data improving the detectors by at least 2.6% and 2.7% in F1-score, respectively, over the baselines. Codes are at this https URL.
摘要：息肉检测对于大肠癌筛查至关重要，但是现有模型受到可用数据的规模和多样性的限制。虽然生成模型显示出对数据增强的希望，但当前的方法主要集中于增强息肉多样性，通常忽略误报的关键问题。在本文中，我们通过提出一个对抗性扩散框架来综合高价值误报来解决这一差距。负背景的广泛差异在假阳性综合中提出了重大挑战。为了克服这一点，我们介绍了两个关键的创新：首先，我们设计了一种区域噪声匹配策略，以使用息肉检测数据集构建负合成空间。该策略通过掩盖息肉区域来训练一个负为中心的扩散模型，从而确保该模型仅专注于学习多样化的背景模式。其次，我们介绍了检测器引导的对抗扩散攻击者（DADA）模块，该模块会导致负面的合成过程破坏预训练的检测器的决定，从而指导中心为中心的中心扩散模型，以产生高价值的，检测器，探测器，从而产生虚假的积极性，而不是低价值，普通背景。我们的方法是第一个将对抗性扩散应用于病变检测的方法，为靶向假阳性合成的新范式建立了新的范式，并为在结直肠癌筛查中更可靠的临床应用铺平了道路。公共和内部数据集的广泛结果验证了我们方法比当前最新方法的优越性，而我们的合成数据将检测器的提高至少2.6％和2.7％，而F1得分比基线相比。代码在此HTTPS URL处。

Title: Pattern-Based Phase-Separation of Tracer and Dispersed Phase Particles in Two-Phase Defocusing Particle Tracking Velocimetry

Authors: Christian Sax, Jochen Kriegseis
Subjects: cs.CV, physics.app-ph, physics.flu-dyn
Abstract URL: https://arxiv.org/abs/2506.18157
Pdf URL: https://arxiv.org/pdf/2506.18157
Copy Paste: [[2506.18157]] Pattern-Based Phase-Separation of Tracer and Dispersed Phase Particles in Two-Phase Defocusing Particle Tracking Velocimetry(https://arxiv.org/abs/2506.18157)
Keywords: generation, generative
Abstract: This work investigates the feasibility of a post-processing-based approach for phase separation in defocusing particle tracking velocimetry for dispersed two-phase flows. The method enables the simultaneous 3D localization determination of both tracer particles and particles of the dispersed phase, using a single-camera setup. The distinction between phases is based on pattern differences in defocused particle images, which arise from distinct light scattering behaviors of tracer particles and bubbles or droplets. Convolutional neural networks, including Faster R-CNN and YOLOv4 variants, are trained to detect and classify particle images based on these pattern features. To generate large, labeled training datasets, a generative adversarial network based framework is introduced, allowing the generation of auto-labeled data that more closely reflects experiment-specific visual appearance. Evaluation across six datasets, comprising synthetic two-phase and real single- and two-phase flows, demonstrates high detection precision and classification accuracy (95-100%), even under domain shifts. The results confirm the viability of using CNNs for robust phase separation in disperse two-phase DPTV, particularly in scenarios where traditional wavelength-, size-, or ensemble correlation-based methods are impractical.
摘要：这项工作调查了基于后处理的方法在散布粒子跟踪速度测定速度计的可行性。该方法可以使用单相机设置同时确定分散相的示踪剂颗粒和分散相的颗粒。相之间的区别是基于散焦粒子图像的模式差异，这是由示踪剂颗粒和气泡或液滴的不同光散射行为产生的。培训了卷积神经网络，包括更快的R-CNN和Yolov4变体，可以根据这些模式特征检测和分类粒子图像。为了生成大型，标记的训练数据集，引入了基于生成的对抗网络的框架，从而生成了自动标记的数据，该数据更紧密地反映了特定于实验的视觉外观。跨六个数据集的评估，包括合成的两相和实际的单相和两相流，即使在域移位下，也证明了高检测精度和分类精度（95-100％）。结果证实了使用CNN在分散两相dptv中使用CNN进行鲁棒相分离的生存能力，尤其是在传统波长，大小或集合相关方法的情况下是不切实际的。

Title: Make It Efficient: Dynamic Sparse Attention for Autoregressive Image Generation

Authors: Xunzhi Xiang, Qi Fan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.18226
Pdf URL: https://arxiv.org/pdf/2506.18226
Copy Paste: [[2506.18226]] Make It Efficient: Dynamic Sparse Attention for Autoregressive Image Generation(https://arxiv.org/abs/2506.18226)
Keywords: generation
Abstract: Autoregressive conditional image generation models have emerged as a dominant paradigm in text-to-image synthesis. These methods typically convert images into one-dimensional token sequences and leverage the self-attention mechanism, which has achieved remarkable success in natural language processing, to capture long-range dependencies, model global context, and ensure semantic coherence. However, excessively long contexts during inference lead to significant memory overhead caused by KV-cache and computational delays. To alleviate these challenges, we systematically analyze how global semantics, spatial layouts, and fine-grained textures are formed during inference, and propose a novel training-free context optimization method called Adaptive Dynamic Sparse Attention (ADSA). Conceptually, ADSA dynamically identifies historical tokens crucial for maintaining local texture consistency and those essential for ensuring global semantic coherence, thereby efficiently streamlining attention computation. Additionally, we introduce a dynamic KV-cache update mechanism tailored for ADSA, reducing GPU memory consumption during inference by approximately $50\%$. Extensive qualitative and quantitative experiments demonstrate the effectiveness and superiority of our approach in terms of both generation quality and resource efficiency.
摘要：自回归有条件的图像生成模型已成为文本对图像合成中的主要范式。这些方法通常将图像转换为一维代币序列，并利用自我发挥的机制，在自然语言处理中取得了显着成功，以捕获远距离依赖性，模拟全局上下文并确保语义连贯性。但是，推断期间的过长背景会导致由KV-CACHE和计算延迟引起的大量内存开销。为了减轻这些挑战，我们系统地分析了在推断期间如何形成全球语义，空间布局和细粒度的纹理，并提出了一种新型的无训练上下文优化方法，称为自适应动态稀疏注意（ADSA）。从概念上讲，ADSA动态地识别历史令牌对于保持局部纹理的一致性至关重要，并且对于确保全局语义连贯性至关重要的标记，从而有效地简化了注意力计算。此外，我们引入了针对ADSA量身定制的动态KV-CACHE更新机制，将推断期间的GPU内存消耗降低了约50美元\％$。广泛的定性和定量实验证明了我们方法在发电质量和资源效率方面的有效性和优势。

Title: Semantic Structure-Aware Generative Attacks for Enhanced Adversarial Transferability

Authors: Jongoh Jeong, Hunmin Yang, Jaeseok Jeong, Kuk-Jin Yoon
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.18248
Pdf URL: https://arxiv.org/pdf/2506.18248
Copy Paste: [[2506.18248]] Semantic Structure-Aware Generative Attacks for Enhanced Adversarial Transferability(https://arxiv.org/abs/2506.18248)
Keywords: generative
Abstract: Generative adversarial attacks train a perturbation generator on a white-box surrogate model and subsequently apply the crafted perturbations to unseen black-box victim models. In contrast to iterative attacks, these methods deliver superior inference-time efficiency, scalability, and transferability; however, up until now, existing studies have not fully exploited the representational capacity of generative models to preserve and harness semantic information. Specifically, the intermediate activations of the generator encode rich semantic features--object boundaries and coarse shapes--that remain under-exploited, thereby limiting the alignment of perturbations with object-salient regions which are critical for adversarial transferability. To remedy this, we introduce a semantic structure-aware attack framework based on the Mean Teacher, which serves as a temporally smoothed feature reference. With this smoothed reference, we further direct semantic consistency between the early-layer activations in the student and those of the semantically rich teacher by feature distillation. By anchoring perturbation synthesis to the semantically salient early intermediate blocks within the generator based on empirical findings, our method guides progressive adversarial perturbation on regions that substantially enhance adversarial transferability. We conduct extensive experiments over diverse models, domains and tasks to demonstrate consistent improvements relative to state-of-the-art generative attacks, comprehensively evaluated using conventional metrics and our newly proposed Accidental Correction Rate (ACR).
摘要：生成对抗攻击在白盒替代模型上训练扰动发生器，然后将精心设计的扰动应用于看不见的黑盒受害者模型。与迭代攻击相反，这些方法具有出色的推理时间效率，可伸缩性和可传递性。但是，到目前为止，现有的研究还没有完全利用生成模型保存和利用语义信息的代表性。具体而言，发电机的中间激活编码了丰富的语义特征（对象边界和粗形） - 探索不足，从而限制了对对抗性转移性至关重要的扰动与对象偏置区域的比对。为了解决这个问题，我们基于平均教师引入了语义结构感知攻击框架，该攻击框架是时间平滑的特征参考。通过这种平滑的参考，我们通过特征蒸馏来进一步指导学生的早期激活与语义丰富教师的早期激活之间的语义一致性。通过基于经验发现，通过将扰动合成锚定在发电机内的语义显着的早期中间区块中，我们的方法指导了对区域的渐进性对抗性扰动，从而实质上增强了对抗性可传递性。我们对各种模型，域和任务进行了广泛的实验，以证明相对于最新生成攻击的一致改进，并使用常规指标和我们新提出的新提出的意外校正率（ACR）进行了全面评估。

Title: Improving Weakly Supervised Temporal Action Localization by Exploiting Multi-resolution Information in Temporal Domain

Authors: Rui Su, Dong Xu, Luping Zhou, Wanli Ouyang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18261
Pdf URL: https://arxiv.org/pdf/2506.18261
Copy Paste: [[2506.18261]] Improving Weakly Supervised Temporal Action Localization by Exploiting Multi-resolution Information in Temporal Domain(https://arxiv.org/abs/2506.18261)
Keywords: generation
Abstract: Weakly supervised temporal action localization is a challenging task as only the video-level annotation is available during the training process. To address this problem, we propose a two-stage approach to fully exploit multi-resolution information in the temporal domain and generate high quality frame-level pseudo labels based on both appearance and motion streams. Specifically, in the first stage, we generate reliable initial frame-level pseudo labels, and in the second stage, we iteratively refine the pseudo labels and use a set of selected frames with highly confident pseudo labels to train neural networks and better predict action class scores at each frame. We fully exploit temporal information at multiple scales to improve temporal action localization performance. Specifically, in order to obtain reliable initial frame-level pseudo labels, in the first stage, we propose an Initial Label Generation (ILG) module, which leverages temporal multi-resolution consistency to generate high quality class activation sequences (CASs), which consist of a number of sequences with each sequence measuring how likely each video frame belongs to one specific action class. In the second stage, we propose a Progressive Temporal Label Refinement (PTLR) framework. In our PTLR framework, two networks called Network-OTS and Network-RTS, which are respectively used to generate CASs for the original temporal scale and the reduced temporal scales, are used as two streams (i.e., the OTS stream and the RTS stream) to refine the pseudo labels in turn. By this way, the multi-resolution information in the temporal domain is exchanged at the pseudo label level, and our work can help improve each stream (i.e., the OTS/RTS stream) by exploiting the refined pseudo labels from another stream (i.e., the RTS/OTS stream).
摘要：弱监督的时间动作本地化是一项具有挑战性的任务，因为在培训过程中只有视频级注释。为了解决这个问题，我们提出了一种两阶段的方法，以完全利用时间域中的多分辨率信息，并根据外观和运动流生成高质量的框架级伪标签。具体而言，在第一阶段，我们生成可靠的初始帧级伪标签，在第二阶段，我们迭代地改进了伪标签，并使用具有高度自信的伪贴标签的一组选定框架来训练神经网络，并更好地预测每个帧处的动作级别。我们在多个尺度上充分利用时间信息，以提高时间动作定位性能。具体而言，为了获得可靠的初始帧级伪标签，在第一阶段，我们提出了一个初始标签生成（ILG）模块，该模块利用时间多分辨率的一致性来生成高质量的类激活序列（CASS），该序列（CASS）由每个序列组成的每个序列都可以测量每个视频帧的可能性属于一个特定的特定动作类别。在第二阶段，我们提出了一个渐进的时间标签改进（PTLR）框架。在我们的PTLR框架中，将两个称为网络-OTS和Network-RT的网络分别用于生成原始时间尺度和减少的时间尺度的Cass，用作两个流（即OTS流和RTS流），以改进伪标签。通过这种方式，在伪标签级别交换了时间域中的多分辨率信息，我们的工作可以通过从另一个流（即RTS/OTS流）利用精制的伪标签来帮助改善每个流（即OTS/RTS流）。

Title: Adaptive Mask-guided K-space Diffusion for Accelerated MRI Reconstruction

Authors: Qinrong Cai, Yu Guan, Zhibo Chen, Dong Liang, Qiuyun Fan, Qiegen Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18270
Pdf URL: https://arxiv.org/pdf/2506.18270
Copy Paste: [[2506.18270]] Adaptive Mask-guided K-space Diffusion for Accelerated MRI Reconstruction(https://arxiv.org/abs/2506.18270)
Keywords: generation
Abstract: As the deep learning revolution marches on, masked modeling has emerged as a distinctive approach that involves predicting parts of the original data that are proportionally masked during training, and has demonstrated exceptional performance in multiple fields. Magnetic Resonance Imaging (MRI) reconstruction is a critical task in medical imaging that seeks to recover high-quality images from under-sampled k-space data. However, previous MRI reconstruction strategies usually optimized the entire image domain or k-space, without considering the importance of different frequency regions in the k-space This work introduces a diffusion model based on adaptive masks (AMDM), which utilizes the adaptive adjustment of frequency distribution based on k-space data to develop a hybrid masks mechanism that adapts to different k-space inputs. This enables the effective separation of high-frequency and low-frequency components, producing diverse frequency-specific representations. Additionally, the k-space frequency distribution informs the generation of adaptive masks, which, in turn, guide a closed-loop diffusion process. Experimental results verified the ability of this method to learn specific frequency information and thereby improved the quality of MRI reconstruction, providing a flexible framework for optimizing k-space data using masks in the future.
摘要：随着深度学习革命的发展，蒙版的建模已成为一种独特的方法，涉及预测训练过程中按比例掩盖的原始数据的一部分，并且在多个领域表现出了出色的性能。磁共振成像（MRI）重建是医学成像中的关键任务，它试图从采样不足的K空间数据中恢复高质量的图像。但是，以前的MRI重建策略通常优化了整个图像域或K空间，而不考虑K空间中不同频率区域的重要性这项工作引入了基于自适应掩模（AMDM）的扩散模型，该模型利用基于K-Space数据来开发hybrid Masks机构的频率分布的自适应调整k-Space k-space k-space in Difction knopacts k-Space分布。这使高频和低频组件可以有效分离，从而产生各种特定于频率的表示。此外，K空间频率分布会导致自适应面罩的产生，从而指导闭环扩散过程。实验结果验证了该方法学习特定频率信息的能力，从而提高了MRI重建的质量，从而提供了一个灵活的框架，以便将来使用掩码优化K-Space数据。

Title: Instability in Diffusion ODEs: An Explanation for Inaccurate Image Reconstruction

Authors: Han Zhang, Jinghong Mao, Shangwen Zhu, Zhantao Yang, Lianghua Huang, Yu Liu, Deli Zhao, Ruili Feng, Fan Cheng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.18290
Pdf URL: https://arxiv.org/pdf/2506.18290
Copy Paste: [[2506.18290]] Instability in Diffusion ODEs: An Explanation for Inaccurate Image Reconstruction(https://arxiv.org/abs/2506.18290)
Keywords: restoration, generation
Abstract: Diffusion reconstruction plays a critical role in various applications such as image editing, restoration, and style transfer. In theory, the reconstruction should be simple - it just inverts and regenerates images by numerically solving the Probability Flow-Ordinary Differential Equation (PF-ODE). Yet in practice, noticeable reconstruction errors have been observed, which cannot be well explained by numerical errors. In this work, we identify a deeper intrinsic property in the PF-ODE generation process, the instability, that can further amplify the reconstruction errors. The root of this instability lies in the sparsity inherent in the generation distribution, which means that the probability is concentrated on scattered and small regions while the vast majority remains almost empty. To demonstrate the existence of instability and its amplification on reconstruction error, we conduct experiments on both toy numerical examples and popular open-sourced diffusion models. Furthermore, based on the characteristics of image data, we theoretically prove that the instability's probability converges to one as the data dimensionality increases. Our findings highlight the inherent challenges in diffusion-based reconstruction and can offer insights for future improvements.
摘要：扩散重建在各种应用中起着至关重要的作用，例如图像编辑，恢复和样式转移。从理论上讲，重建应该很简单 - 它仅通过数值求解概率流动差微分方程（PF-ODE）来反转和再生图像。然而，实际上，已经观察到明显的重建错误，这不能通过数值错误来很好地解释。在这项工作中，我们确定了PF-ode生成过程中不稳定性中更深的内在属性，可以进一步扩大重建错误。这种不稳定性的根源在于生成分布中固有的稀疏性，这意味着概率集中在散落的区域和小区域，而绝大多数则几乎保持空虚。为了证明不稳定性的存在及其对重建误差的扩增，我们在玩具数值示例和流行的开源扩散模型上进行了实验。此外，根据图像数据的特征，我们从理论上证明，随着数据维度的增加，不稳定性的概率会收敛到一个。我们的发现突出了基于扩散的重建中固有的挑战，可以为未来的改进提供见解。

Title: Rapeseed population point cloud completion network (RP-PCN) with dynamic graph convolution for 3D reconstruction of crop canopy occlusion architecture

Authors: Ziyue Guo (1 and 2), Xin Yang (1 and 2), Yutao Shen (1 and 2), Yang Zhu (3), Lixi Jiang (3), Haiyan Cen (1 and 2) ((1) College of Biosystems Engineering and Food Science, Zhejiang University, (2) Key Laboratory of Spectroscopy Sensing, Ministry of Agriculture and Rural Affairs, (3) Institute of Crop Science, Zhejiang University)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18292
Pdf URL: https://arxiv.org/pdf/2506.18292
Copy Paste: [[2506.18292]] Rapeseed population point cloud completion network (RP-PCN) with dynamic graph convolution for 3D reconstruction of crop canopy occlusion architecture(https://arxiv.org/abs/2506.18292)
Keywords: generation
Abstract: Quantitative descriptions of complete canopy architecture are crucial for evaluating crop photosynthesis and yield to guide ideotype design. Although three-dimensional (3D) sensing technologies have been developed for plant and canopy reconstruction, severe occlusion and complex architectures hinder accurate canopy descriptions. In this study, we propose a point cloud completion model for 3D reconstruction of rapeseed populations from seeding to silique stages using multi-view imaging. A complete point cloud generation framework was developed with the virtual-real integration (VRI) simulation method and occlusion point detection algorithm to annotate the training dataset by distinguishing surface from occluded points. The rapeseed population point cloud completion network (RP-PCN) was designed with a multi-resolution dynamic graph convolutional encoder (MRDG) and point pyramid decoder (PPD) to predict occluded points based on input surface point clouds. A dynamic graph convolutional feature extractor (DGCFE) was introduced to capture structural variations across the growth period. The effectiveness of point cloud completion was validated by predicting yield using architectural indicators from complete point clouds of rapeseed population. The results demonstrated that RP-PCN achieved chamfer distance (CD) values of 3.35 cm, 3.46 cm, 4.32 cm, and 4.51 cm at the seedling, bolting, flowering, and silique stages, respectively. Ablation studies showed the effectiveness of the MRDG and DGCFE modules, reducing CD values by 10% and 23%, respectively. The silique efficiency index (SEI) from RP-PCN improved yield prediction accuracy by 11.2% compared to incomplete point clouds. The RP-PCN pipeline proposed in this study has the potential to be extended to other crops, significantly enhancing the analysis of population canopy architectures in field environments.
摘要：完整冠层结构的定量描述对于评估作物光合作用和产量以指导意识型设计至关重要。尽管已经开发了用于植物和冠层重建，严重的遮挡和复杂体系结构的三维（3D）传感技术阻碍了精确的冠层描述。在这项研究中，我们提出了一个点云完成模型，用于使用多视图成像进行3D重建从播种到形成阶段的菜籽群的3D重建模型。通过虚拟真实集成（VRI）仿真方法和遮挡点检测算法开发了一个完整的点云生成框架，通过将表面与遮挡点区分开来注释训练数据集。 Rapeseed种群云完成网络（RP-PCN）是使用多分辨率动态图卷积编码器（MRDG）和点金字塔解码器（PPD）设计的，以预测基于输入表面云的遮挡点。引入了动态图卷积特征提取器（DGCFE），以捕获整个生长期的结构变化。通过使用Rapeseed人群的完整点云来预测产量，可以通过预测产量来验证点云完成的有效性。结果表明，RP-PCN分别在幼苗，螺栓，螺栓，开花和成形阶段达到3.35 cm，3.46 cm，4.32 cm和4.51 cm的倒角距离（CD）值。消融研究表明，MRDG和DGCFE模块的有效性分别将CD值降低了10％和23％。与不完整的点云相比，RP-PCN的Silique效率指数（SEI）提高了11.2％的产量预测准确性。这项研究中提出的RP-PCN管道有可能扩展到其他农作物，从而显着增强了对现场环境中人口冠层体系结构的分析。

Title: NSFW-Classifier Guided Prompt Sanitization for Safe Text-to-Image Generation

Authors: Yu Xie, Chengjie Zeng, Lingyun Zhang, Yanwei Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18325
Pdf URL: https://arxiv.org/pdf/2506.18325
Copy Paste: [[2506.18325]] NSFW-Classifier Guided Prompt Sanitization for Safe Text-to-Image Generation(https://arxiv.org/abs/2506.18325)
Keywords: generation
Abstract: The rapid advancement of text-to-image (T2I) models, such as Stable Diffusion, has enhanced their capability to synthesize images from textual prompts. However, this progress also raises significant risks of misuse, including the generation of harmful content (e.g., pornography, violence, discrimination), which contradicts the ethical goals of T2I technology and hinders its sustainable development. Inspired by "jailbreak" attacks in large language models, which bypass restrictions through subtle prompt modifications, this paper proposes NSFW-Classifier Guided Prompt Sanitization (PromptSan), a novel approach to detoxify harmful prompts without altering model architecture or degrading generation capability. PromptSan includes two variants: PromptSan-Modify, which iteratively identifies and replaces harmful tokens in input prompts using text NSFW classifiers during inference, and PromptSan-Suffix, which trains an optimized suffix token sequence to neutralize harmful intent while passing both text and image NSFW classifier checks. Extensive experiments demonstrate that PromptSan achieves state-of-the-art performance in reducing harmful content generation across multiple metrics, effectively balancing safety and usability.
摘要：文本到图像（T2I）模型的快速发展（例如稳定的扩散）增强了其从文本提示中合成图像的能力。但是，这一进展也引起了滥用的重大风险，包括产生有害内容（例如，色情，暴力，歧视），这与T2I技术的道德目标相矛盾，并阻碍了其可持续发展。受大语言模型中的“越狱”攻击的启发，该攻击是通过微妙的及时修改绕过限制的，本文提出了NSFW分类器引导及时及时卫生化（提示），这是一种新颖的方法，用于在不改变模型体系结构或降级生成能力的情况下排毒有害提示。提示包括两个变体：提示 - 修改，它在推理过程中使用文本NSFW分类器在输入提示中识别并取代有害令牌，并提示训练优化的后缀代币序列，以中和有害意图，同时通过文本和图像NSFW分类器检查。广泛的实验表明，促使San在减少多个指标的有害内容生成方面取得了最新的性能，从而有效地平衡了安全性和可用性。

Title: Geometry-Aware Preference Learning for 3D Texture Generation

Authors: AmirHossein Zamani, Tianhao Xie, Amir G. Aghdam, Tiberiu Popa, Eugene Belilovsky
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18331
Pdf URL: https://arxiv.org/pdf/2506.18331
Copy Paste: [[2506.18331]] Geometry-Aware Preference Learning for 3D Texture Generation(https://arxiv.org/abs/2506.18331)
Keywords: generation, generative
Abstract: Recent advances in 3D generative models have achieved impressive results but 3D contents generated by these models may not align with subjective human preferences or task-specific criteria. Moreover, a core challenge in the 3D texture generation domain remains: most existing approaches rely on repeated calls to 2D text-to-image generative models, which lack an inherent understanding of the 3D structure of the input 3D mesh object. To address this, we propose an end-to-end differentiable preference learning framework that back-propagates human preferences, represented by differentiable reward functions, through the entire 3D generative pipeline, making the process inherently geometry-aware. We demonstrate the effectiveness of our framework using four proposed novel geometry-aware reward functions, offering a more controllable and interpretable pathway for high-quality 3D content creation from natural language.
摘要：3D生成模型的最新进展取得了令人印象深刻的结果，但是这些模型产生的3D内容可能与主观人类的偏好或特定于任务的标准不符。此外，3D纹理生成域中的核心挑战仍然存在：大多数现有方法都依赖于对2D文本到图像生成模型的重复调用，该模型缺乏对输入3D网格对象的3D结构的固有理解。为了解决这个问题，我们提出了一个端到端可区分的偏好学习框架，该框架通过整个3D生成管道将人类的偏好反向传播，以可区分的奖励函数表示，从而使该过程固有地了解了几何学。我们使用四个提出的新颖的几何学奖励功能来证明我们的框架的有效性，为高质量的3D内容创建提供了自然语言的高质量3D内容创建的途径。

Title: Rethinking Decoder Design: Improving Biomarker Segmentation Using Depth-to-Space Restoration and Residual Linear Attention

Authors: Saad Wazir, Daeyoung Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18335
Pdf URL: https://arxiv.org/pdf/2506.18335
Copy Paste: [[2506.18335]] Rethinking Decoder Design: Improving Biomarker Segmentation Using Depth-to-Space Restoration and Residual Linear Attention(https://arxiv.org/abs/2506.18335)
Keywords: restoration
Abstract: Segmenting biomarkers in medical images is crucial for various biotech applications. Despite advances, Transformer and CNN based methods often struggle with variations in staining and morphology, limiting feature extraction. In medical image segmentation, where datasets often have limited sample availability, recent state-of-the-art (SOTA) methods achieve higher accuracy by leveraging pre-trained encoders, whereas end-to-end methods tend to underperform. This is due to challenges in effectively transferring rich multiscale features from encoders to decoders, as well as limitations in decoder efficiency. To address these issues, we propose an architecture that captures multi-scale local and global contextual information and a novel decoder design, which effectively integrates features from the encoder, emphasizes important channels and regions, and reconstructs spatial dimensions to enhance segmentation accuracy. Our method, compatible with various encoders, outperforms SOTA methods, as demonstrated by experiments on four datasets and ablation studies. Specifically, our method achieves absolute performance gains of 2.76% on MoNuSeg, 3.12% on DSB, 2.87% on Electron Microscopy, and 4.03% on TNBC datasets compared to existing SOTA methods. Code: this https URL
摘要：在医学图像中细分生物标志物对于各种生物技术应用至关重要。尽管有进步，但基于变压器和CNN的方法通常在染色和形态方面的变化而困难，从而限制了特征提取。在医疗图像分割中，数据集通常具有有限的样本可用性，而最近的最新方法（SOTA）方法通过利用预训练的编码器来实现更高的精度，而端到端方法的表现往往不足。这是由于有效地将丰富的多尺度特征从编码器转移到解码器的挑战以及解码器效率的限制。为了解决这些问题，我们提出了一个架构，该体系结构可以捕获多尺度的本地和全球上下文信息以及一种新颖的解码器设计，该设计有效地集成了编码器的特征，强调重要的渠道和区域，并重建空间维度以提高段精度。我们的方法与各种编码器兼容，优于SOTA方法，如四个数据集和消融研究的实验所证明的那样。具体而言，与现有的SOTA方法相比，我们的方法在MONUSEG上获得了2.76％的绝对性能增长，DSB的绝对性能增长率为3.12％，电子显微镜的2.87％，TNBC数据集的绝对性能增长率为2.87％，TNBC数据集的绝对性能增长了4.03％。代码：此HTTPS URL

Title: Controlled Generation with Equivariant Variational Flow Matching

Authors: Floor Eijkelboom, Heiko Zimmermann, Sharvaree Vadgama, Erik J Bekkers, Max Welling, Christian A. Naesseth, Jan-Willem van de Meent
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.18340
Pdf URL: https://arxiv.org/pdf/2506.18340
Copy Paste: [[2506.18340]] Controlled Generation with Equivariant Variational Flow Matching(https://arxiv.org/abs/2506.18340)
Keywords: generation, generative
Abstract: We derive a controlled generation objective within the framework of Variational Flow Matching (VFM), which casts flow matching as a variational inference problem. We demonstrate that controlled generation can be implemented two ways: (1) by way of end-to-end training of conditional generative models, or (2) as a Bayesian inference problem, enabling post hoc control of unconditional models without retraining. Furthermore, we establish the conditions required for equivariant generation and provide an equivariant formulation of VFM tailored for molecular generation, ensuring invariance to rotations, translations, and permutations. We evaluate our approach on both uncontrolled and controlled molecular generation, achieving state-of-the-art performance on uncontrolled generation and outperforming state-of-the-art models in controlled generation, both with end-to-end training and in the Bayesian inference setting. This work strengthens the connection between flow-based generative modeling and Bayesian inference, offering a scalable and principled framework for constraint-driven and symmetry-aware generation.
摘要：我们在变分流匹配（VFM）的框架内得出了一个受控的生成目标，该目标将流匹配作为变异推理问题。我们证明，可以通过两种方式实现受控生成：（1）通过对有条件生成模型的端到端培训，或（2）作为贝叶斯推理问题，从而无需重新培训就可以在无条件的模型后进行事后控制。此外，我们建立了均衡生成所需的条件，并提供了针对分子生成的VFM的等效公式，确保了旋转，翻译和排列的不变性。我们评估了我们对不受控制和受控分子产生的方法，在受控生成中以不受控制的生成和优于最先进模型的最新性能，包括端到端培训和贝叶斯推论环境。这项工作加强了基于流量的生成建模与贝叶斯推断之间的联系，为约束驱动和对称性的生成提供了可扩展和原则性的框架。

Title: BSMamba: Brightness and Semantic Modeling for Long-Range Interaction in Low-Light Image Enhancement

Authors: Tongshun Zhang, Pingping Liu, Mengen Cai, Zijian Zhang, Yubing Lu, Qiuzhan Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18346
Pdf URL: https://arxiv.org/pdf/2506.18346
Copy Paste: [[2506.18346]] BSMamba: Brightness and Semantic Modeling for Long-Range Interaction in Low-Light Image Enhancement(https://arxiv.org/abs/2506.18346)
Keywords: restoration
Abstract: Current low-light image enhancement (LLIE) methods face significant limitations in simultaneously improving brightness while preserving semantic consistency, fine details, and computational efficiency. With the emergence of state-space models, particularly Mamba, image restoration has achieved remarkable performance, yet existing visual Mamba approaches flatten 2D images into 1D token sequences using fixed scanning rules, critically limiting interactions between distant tokens with causal relationships and constraining their ability to capture meaningful long-range dependencies. To address these fundamental limitations, we propose BSMamba, a novel visual Mamba architecture comprising two specially designed components: Brightness Mamba and Semantic Mamba. The Brightness Mamba revolutionizes token interaction patterns by prioritizing connections between distant tokens with similar brightness levels, effectively addressing the challenge of brightness restoration in LLIE tasks through brightness-guided selective attention. Complementing this, the Semantic Mamba establishes priority interactions between tokens sharing similar semantic meanings, allowing the model to maintain contextual consistency by connecting semantically related regions across the image, thus preserving the hierarchical nature of image semantics during enhancement. By intelligently modeling tokens based on brightness and semantic similarity rather than arbitrary scanning patterns, BSMamba transcends the constraints of conventional token sequencing while adhering to the principles of causal modeling. Extensive experiments demonstrate that BSMamba achieves state-of-the-art performance in LLIE while preserving semantic consistency.
摘要：当前的低光图像增强（LLIE）方法在同时提高亮度的同时，在保持语义一致性，细节和计算效率方面面临着重大局限性。随着状态空间模型的出现，尤其是曼巴（Mamba），图像恢复已经实现了出色的性能，但现有的Visual Mamba方法使用固定的扫描规则将2D图像置于1D令牌序列中，严重限制了远处令牌与因果关系之间的相互作用，并限制了它们捕获有意义的远距离依赖的能力。为了解决这些基本局限性，我们提出了BSMAMBA，这是一种新型的视觉MAMBA架构，其中包括两个专门设计的组件：亮度Mamba和Sminantic Mamba。亮度Mamba通过优先考虑具有相似亮度水平的遥远令牌之间的连接来彻底改变令牌交互模式，从而有效地解决了LLIE任务中亮度恢复的挑战。对此进行补充，语义Mamba在共享相似的语义含义的令牌之间建立了优先级相互作用，从而使模型可以通过连接整个图像的语义相关区域来维持上下文一致性，从而保留了增强过程中图像语义的层次结构性质。通过基于亮度和语义相似性而不是任意扫描模式对令牌进行智能建模，BSMAMBA超越了传统令牌测序的约束，同时遵守因果建模的原理。广泛的实验表明，BSMAMBA在LLIE中实现了最先进的表现，同时保持语义一致性。

Title: RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models

Authors: Yeongtak Oh, Jisoo Mok, Dohyun Chung, Juhyeon Shin, Sangha Park, Johan Barthelemy, Sungroh Yoon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18369
Pdf URL: https://arxiv.org/pdf/2506.18369
Copy Paste: [[2506.18369]] RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models(https://arxiv.org/abs/2506.18369)
Keywords: generation
Abstract: Recent multi-modal large language models (MLLMs) often struggle to generate personalized image captions, even when trained on high-quality captions. In this work, we observe that such limitations persist in existing post-training-based MLLM personalization methods. Specifically, despite being post-tuned with large-scale caption data through supervised fine-tuning (SFT), these models frequently fail to produce faithful descriptions in real-world scenarios, such as multi-concept image captioning. However, acquiring large-scale, high-quality captions for such complex settings is both costly and difficult. To address the data-centric nature of SFT, we propose a reinforcement learning (RL)-based post-training framework. To the best of our knowledge, this is the first RL-based approach to post-train MLLMs for personalized image captioning. Our method significantly enhances both visual recognition and personalized generation capabilities of MLLMs, and consistently outperforms existing SFT-based baselines, especially in the challenging multi-concept image captioning task.
摘要：最近的多模式大型语言模型（MLLM）通常也很难生成个性化的图像标题，即使在高质量的字幕上进行了培训。在这项工作中，我们观察到这种限制持续存在于现有的基于培训后的MLLM个性化方法中。具体而言，尽管通过监督的微调（SFT）进行大规模标题数据进行调整，但这些模型经常无法在现实世界中的忠实描述（例如多概念概念图像字幕）中产生忠实的描述。但是，为这种复杂设置获得大规模的高质量标题既昂贵又困难。为了解决以数据为中心的SFT的性质，我们提出了一个基于增强的学习（RL）的培训后框架。据我们所知，这是第一种基于RL的基于RL的方法，用于用于个性化图像字幕的MLLM。我们的方法显着增强了MLLM的视觉识别和个性化生成能力，并且始终优于现有的基于SFT的基线，尤其是在具有挑战性的多概念图像字幕任务中。

Title: CPAM: Context-Preserving Adaptive Manipulation for Zero-Shot Real Image Editing

Authors: Dinh-Khoi Vo, Thanh-Toan Do, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18438
Pdf URL: https://arxiv.org/pdf/2506.18438
Copy Paste: [[2506.18438]] CPAM: Context-Preserving Adaptive Manipulation for Zero-Shot Real Image Editing(https://arxiv.org/abs/2506.18438)
Keywords: generation
Abstract: Editing natural images using textual descriptions in text-to-image diffusion models remains a significant challenge, particularly in achieving consistent generation and handling complex, non-rigid objects. Existing methods often struggle to preserve textures and identity, require extensive fine-tuning, and exhibit limitations in editing specific spatial regions or objects while retaining background details. This paper proposes Context-Preserving Adaptive Manipulation (CPAM), a novel zero-shot framework for complicated, non-rigid real image editing. Specifically, we propose a preservation adaptation module that adjusts self-attention mechanisms to preserve and independently control the object and background effectively. This ensures that the objects' shapes, textures, and identities are maintained while keeping the background undistorted during the editing process using the mask guidance technique. Additionally, we develop a localized extraction module to mitigate the interference with the non-desired modified regions during conditioning in cross-attention mechanisms. We also introduce various mask-guidance strategies to facilitate diverse image manipulation tasks in a simple manner. Extensive experiments on our newly constructed Image Manipulation BenchmArk (IMBA), a robust benchmark dataset specifically designed for real image editing, demonstrate that our proposed method is the preferred choice among human raters, outperforming existing state-of-the-art editing techniques.
摘要：使用文本到图像扩散模型中的文本描述编辑自然图像仍然是一个重大挑战，尤其是在实现一致的生成和处理复杂的非刚性对象方面。现有的方法通常难以保留纹理和身份，需要进行广泛的微调，并在编辑特定空间区域或物体时表现出局限性，同时保留背景细节。本文提出了维护上下文的自适应操纵（CPAM），这是一种新型的零摄影框架，用于复杂的，非刚性的真实图像编辑。具体而言，我们提出了一个保存适应模块，该模块可以调整自我注意的机制以有效地保存和独立控制对象和背景。这样可以确保使用蒙版指导技术在编辑过程中保持背景不及格时保持对象的形状，纹理和身份。此外，我们开发了一个局部提取模块，以减轻跨注意机制条件期间在调节过程中对不良修改区域的干扰。我们还介绍了各种面具指导策略，以简单的方式促进各种图像操纵任务。在我们新建的图像操纵基准（IMBA）上进行的广泛实验，这是一种专门设计用于真实图像编辑的强大基准数据集，表明我们提出的方法是人类评估者之间的首选选择，胜过现有的现有最新的最新编辑技术。

Title: GANs vs. Diffusion Models for virtual staining with the HER2match dataset

Authors: Pascal Klöckner, José Teixeira, Diana Montezuma, Jaime S. Cardoso, Hugo M. Horlings, Sara P. Oliveira
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18484
Pdf URL: https://arxiv.org/pdf/2506.18484
Copy Paste: [[2506.18484]] GANs vs. Diffusion Models for virtual staining with the HER2match dataset(https://arxiv.org/abs/2506.18484)
Keywords: generative
Abstract: Virtual staining is a promising technique that uses deep generative models to recreate histological stains, providing a faster and more cost-effective alternative to traditional tissue chemical staining. Specifically for H&E-HER2 staining transfer, despite a rising trend in publications, the lack of sufficient public datasets has hindered progress in the topic. Additionally, it is currently unclear which model frameworks perform best for this particular task. In this paper, we introduce the HER2match dataset, the first publicly available dataset with the same breast cancer tissue sections stained with both H&E and HER2. Furthermore, we compare the performance of several Generative Adversarial Networks (GANs) and Diffusion Models (DMs), and implement a novel Brownian Bridge Diffusion Model for H&E-HER2 translation. Our findings indicate that, overall, GANs perform better than DMs, with only the BBDM achieving comparable results. Furthermore, we emphasize the importance of data alignment, as all models trained on HER2match produced vastly improved visuals compared to the widely used consecutive-slide BCI dataset. This research provides a new high-quality dataset ([available upon publication acceptance]), improving both model training and evaluation. In addition, our comparison of frameworks offers valuable guidance for researchers working on the topic.
摘要：虚拟染色是一种有前途的技术，它使用深层生成模型来重现组织学染色，为传统组织化学染色提供了更快，更具成本效益的替代品。尽管出版物的趋势不断上升，但由于缺乏足够的公共数据集的趋势，特别是针对H＆E HER2染色转移的转移，这阻碍了该主题的进展。此外，目前尚不清楚哪种模型框架对此特定任务的表现最佳。在本文中，我们介绍了HER2 -Match数据集，这是第一个公开可用的数据集，其中具有相同的乳腺癌组织领域，并用H＆E和HER2染色。此外，我们比较了几个生成对抗网络（GAN）和扩散模型（DMS）的性能，并为H＆E-e-HER2翻译实现了新颖的Brownian桥扩散模型。我们的发现表明，总体而言，GAN的性能要比DMS好，而BBDM只能取得可比的结果。此外，我们强调了数据一致性的重要性，因为与广泛使用的连续SLIDE BCI数据集相比，所有对HER2Match训练的模型都产生的视觉效果得到了巨大改进。这项研究提供了一个新的高质量数据集（[出版物接受后提供]），从而改善了模型培训和评估。此外，我们对框架的比较为研究主题的研究人员提供了宝贵的指导。

Title: ShowFlow: From Robust Single Concept to Condition-Free Multi-Concept Generation

Authors: Trong-Vu Hoang, Quang-Binh Nguyen, Thanh-Toan Do, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18493
Pdf URL: https://arxiv.org/pdf/2506.18493
Copy Paste: [[2506.18493]] ShowFlow: From Robust Single Concept to Condition-Free Multi-Concept Generation(https://arxiv.org/abs/2506.18493)
Keywords: generation
Abstract: Customizing image generation remains a core challenge in controllable image synthesis. For single-concept generation, maintaining both identity preservation and prompt alignment is challenging. In multi-concept scenarios, relying solely on a prompt without additional conditions like layout boxes or semantic masks, often leads to identity loss and concept omission. In this paper, we introduce ShowFlow, a comprehensive framework designed to tackle these challenges. We propose ShowFlow-S for single-concept image generation, and ShowFlow-M for handling multiple concepts. ShowFlow-S introduces a KronA-WED adapter, which integrates a Kronecker adapter with weight and embedding decomposition, and employs a disentangled learning approach with a novel attention regularization objective to enhance single-concept generation. Building on this foundation, ShowFlow-M directly reuses the learned models from ShowFlow-S to support multi-concept generation without extra conditions, incorporating a Subject-Adaptive Matching Attention (SAMA) and a layout consistency strategy as the plug-and-play module. Extensive experiments and user studies validate ShowFlow's effectiveness, highlighting its potential in real-world applications like advertising and virtual dressing.
摘要：自定义图像生成仍然是可控图像合成中的核心挑战。对于单一概念生成，保持身份保存和及时的一致性是具有挑战性的。在多概念概念方案中，仅依靠提示，而没有其他条件（例如布局框或语义面具），通常会导致身份丧失和概念遗漏。在本文中，我们介绍了Showflow，这是一个综合框架，旨在应对这些挑战。我们为单一概念图像生成提出了ShowFlow-S，以及用于处理多个概念的ShowFlow-M。 Showflow-S引入了Krona Wed适配器，该适配器将Kronecker适配器与重量和嵌入分解集成在一起，并采用了一个分离的学习方法，并具有新颖的注意正规化目标，以增强单次概念的生成。 Showflow-M在基础上以此为基础，直接将学习的模型从Showflow-S重新重新支持多概念生成，而无需额外的条件，将受试者自适应的匹配注意力（SAMA）和布局一致性策略作为插件播放模块。广泛的实验和用户研究验证了ShowFlow的有效性，突出了其在广告和虚拟敷料等实际应用中的潜力。

Title: PuckTrick: A Library for Making Synthetic Data More Realistic

Authors: Alessandra Agostini, Andrea Maurino, Blerina Spahiu
Subjects: cs.LG, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2506.18499
Pdf URL: https://arxiv.org/pdf/2506.18499
Copy Paste: [[2506.18499]] PuckTrick: A Library for Making Synthetic Data More Realistic(https://arxiv.org/abs/2506.18499)
Keywords: generation
Abstract: The increasing reliance on machine learning (ML) models for decision-making requires high-quality training data. However, access to real-world datasets is often restricted due to privacy concerns, proprietary restrictions, and incomplete data availability. As a result, synthetic data generation (SDG) has emerged as a viable alternative, enabling the creation of artificial datasets that preserve the statistical properties of real data while ensuring privacy compliance. Despite its advantages, synthetic data is often overly clean and lacks real-world imperfections, such as missing values, noise, outliers, and misclassified labels, which can significantly impact model generalization and robustness. To address this limitation, we introduce Pucktrick, a Python library designed to systematically contaminate synthetic datasets by introducing controlled errors. The library supports multiple error types, including missing data, noisy values, outliers, label misclassification, duplication, and class imbalance, offering a structured approach to evaluating ML model resilience under real-world data imperfections. Pucktrick provides two contamination modes: one for injecting errors into clean datasets and another for further corrupting already contaminated datasets. Through extensive experiments on real-world financial datasets, we evaluate the impact of systematic data contamination on model performance. Our findings demonstrate that ML models trained on contaminated synthetic data outperform those trained on purely synthetic, error-free data, particularly for tree-based and linear models such as SVMs and Extra Trees.
摘要：对决策制定的机器学习（ML）模型的依赖越来越多，就需要高质量的培训数据。但是，由于隐私问题，专有限制和不完整的数据可用性，通常会限制对现实数据集的访问。结果，合成数据生成（SDG）已成为可行的替代方案，从而可以创建人工数据集，这些数据集可以保留真实数据的统计属性，同时确保隐私合规。尽管具有优势，但综合数据通常过于清洁，并且缺乏现实世界中的缺陷，例如缺失值，噪声，离群值和错误分类的标签，这可能会显着影响模型的概括和稳健性。为了解决此限制，我们介绍了Python库Pucktrick，旨在通过引入受控错误系统地污染合成数据集。该库支持多种错误类型，包括丢失的数据，嘈杂值，异常值，标签错误分类，重复和类不平衡，提供了一种结构化方法来评估现实世界中数据缺陷下的ML模型弹性。 Pucktrick提供了两种污染模式：一种用于将错误注入干净的数据集中，另一个用于进一步损坏已经污染的数据集。通过对现实世界金融数据集的广泛实验，我们评估了系统数据污染对模型性能的影响。我们的发现表明，接受污染的合成数据训练的ML模型优于接受纯合成，无错误数据的训练的ML模型，尤其是针对基于树的和线性模型，例如SVM和额外的树。

Title: MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis

Authors: Yuting Zhang, Kaishen Yuan, Hao Lu, Yutao Yue, Jintai Chen, Kaishun Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18512
Pdf URL: https://arxiv.org/pdf/2506.18512
Copy Paste: [[2506.18512]] MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis(https://arxiv.org/abs/2506.18512)
Keywords: generation
Abstract: Accurate and interpretable multi-disease diagnosis remains a critical challenge in medical research, particularly when leveraging heterogeneous multimodal medical data. Current approaches often rely on single-modal data, limiting their ability to comprehensively understand complex diseases. To address this, we propose MedTVT-R1, a novel Multimodal Large Language Model (MLLM) framework designed to integrate clinical multimodal data for reasoning and diagnosing multiple diseases. We construct MedTVT-QA, a curated instruction dataset that provides question-answer pairs for physiological-level interpretations and disease-level diagnoses with a Chain of Evidence approach. MedTVT-R1 incorporates a modality perception layer to capture inter-modal dependencies and adaptively weight modality contributions. Additionally, we employ Group Relative Policy Optimization (GRPO)-based Reinforcement Fine-Tuning with a Jaccard Reward function to enhance diagnostic reasoning. Experimental results demonstrate MedTVT-R1's superiority in multimodal feature utilization and multi-disease diagnosis, offering significant potential for clinical applications such as diagnostic report generation and comorbidity reasoning. The dataset and code are available at this https URL.
摘要：准确且可解释的多疾病诊断仍然是医学研究中的一个关键挑战，尤其是在利用异质多模式医学数据时。当前的方法通常依赖于单模式数据，从而限制了它们全面理解复杂疾病的能力。为了解决这个问题，我们提出了MEDTVT-R1，这是一种新型的多模式大型语言模型（MLLM）框架，旨在整合用于推理和诊断多种疾病的临床多模式数据。我们构建了MEDTVT-QA，这是一种精心策划的指令数据集，为生理水平的解释和疾病级别的诊断提供了提问对，并具有一系列证据方法。 MEDTVT-R1结合了模态感知层，以捕获模式间依赖性和自适应权重态贡献。此外，我们采用基于jaccard奖励功能的基于小组相对政策优化（GRPO）的加强调整来增强诊断推理。实验结果表明，MEDTVT-R1在多模式特征利用和多疾病诊断中的优势，为临床应用（例如诊断报告生成和合并症推理）提供了巨大的潜力。该数据集和代码可在此HTTPS URL上找到。

Title: Enhancing Image Restoration Transformer via Adaptive Translation Equivariance

Authors: JiaKui Hu, Zhengjian Yao, Lujia Jin, Hangzhou He, Yanye Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18520
Pdf URL: https://arxiv.org/pdf/2506.18520
Copy Paste: [[2506.18520]] Enhancing Image Restoration Transformer via Adaptive Translation Equivariance(https://arxiv.org/abs/2506.18520)
Keywords: restoration
Abstract: Translation equivariance is a fundamental inductive bias in image restoration, ensuring that translated inputs produce translated outputs. Attention mechanisms in modern restoration transformers undermine this property, adversely impacting both training convergence and generalization. To alleviate this issue, we propose two key strategies for incorporating translation equivariance: slide indexing and component stacking. Slide indexing maintains operator responses at fixed positions, with sliding window attention being a notable example, while component stacking enables the arrangement of translation-equivariant operators in parallel or sequentially, thereby building complex architectures while preserving translation equivariance. However, these strategies still create a dilemma in model design between the high computational cost of self-attention and the fixed receptive field associated with sliding window attention. To address this, we develop an adaptive sliding indexing mechanism to efficiently select key-value pairs for each query, which are then concatenated in parallel with globally aggregated key-value pairs. The designed network, called the Translation Equivariance Adaptive Transformer (TEAFormer), is assessed across a variety of image restoration tasks. The results highlight its superiority in terms of effectiveness, training convergence, and generalization.
摘要：翻译均衡性是图像恢复中的基本电感偏差，可确保翻译输入产生翻译的输出。现代恢复变压器中的注意机制破坏了这一特性，从而不利地影响了训练收敛和概括。为了减轻此问题，我们提出了两种关键策略，以纳入翻译等效性：幻灯片索引和组件堆叠。幻灯片索引在固定位置保持运算符响应，而滑动窗口的注意力是一个显着的例子，而组件堆叠可以使翻译与等分的操作员并行或顺序地安排，从而构建复杂的体系结构，同时保留翻译等效性。但是，这些策略仍然在模型设计中造成了困境，在自我注意力的高计算成本与与滑动窗户注意力相关的固定接收场之间。为了解决这个问题，我们开发了一种自适应滑动索引机制，以有效地为每个查询选择键值对，然后与全球聚合的键值对并行串联。在各种图像恢复任务上评估了设计的网络，称为Translation Equariance自适应变压器（TEAFORMER）。结果突出了其在有效性，训练收敛和概括方面的优势。

Title: Auto-Regressively Generating Multi-View Consistent Images

Authors: JiaKui Hu, Yuxiao Yang, Jialun Liu, Jinbo Wu, Chen Zhao, Yanye Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18527
Pdf URL: https://arxiv.org/pdf/2506.18527
Copy Paste: [[2506.18527]] Auto-Regressively Generating Multi-View Consistent Images(https://arxiv.org/abs/2506.18527)
Keywords: generation
Abstract: Generating multi-view images from human instructions is crucial for 3D content creation. The primary challenges involve maintaining consistency across multiple views and effectively synthesizing shapes and textures under diverse conditions. In this paper, we propose the Multi-View Auto-Regressive (MV-AR) method, which leverages an auto-regressive model to progressively generate consistent multi-view images from arbitrary prompts. Firstly, the next-token-prediction capability of the AR model significantly enhances its effectiveness in facilitating progressive multi-view synthesis. When generating widely-separated views, MV-AR can utilize all its preceding views to extract effective reference information. Subsequently, we propose a unified model that accommodates various prompts via architecture designing and training strategies. To address multiple conditions, we introduce condition injection modules for text, camera pose, image, and shape. To manage multi-modal conditions simultaneously, a progressive training strategy is employed. This strategy initially adopts the text-to-multi-view (t2mv) model as a baseline to enhance the development of a comprehensive X-to-multi-view (X2mv) model through the randomly dropping and combining conditions. Finally, to alleviate the overfitting problem caused by limited high-quality data, we propose the "Shuffle View" data augmentation technique, thus significantly expanding the training data by several magnitudes. Experiments demonstrate the performance and versatility of our MV-AR, which consistently generates consistent multi-view images across a range of conditions and performs on par with leading diffusion-based multi-view image generation models. Code and models will be released at this https URL.
摘要：从人类指令中生成多视图图像对于3D内容创建至关重要。首要挑战涉及保持多种视图的一致性，并有效地综合各种条件下的形状和纹理。在本文中，我们提出了多视图自动回归（MV-AR）方法，该方法利用自动回归模型从任意提示逐渐生成一致的多视图图像。首先，AR模型的下一步预测能力显着提高了其在促进渐进多视图合成方面的有效性。当生成广泛分离的视图时，MV-AR可以利用其所有先前视图来提取有效的参考信息。随后，我们提出了一个统一模型，该模型可以通过建筑设计和培训策略来适应各种提示。为了解决多种条件，我们介绍了文本，摄像头，图像和形状的条件注入模块。为了同时管理多模式条件，采用了渐进培训策略。该策略最初采用文本对媒体视图（T2MV）模型作为基线，以通过随机掉落和结合条件来增强综合X-to-Multi-View（X2MV）模型的开发。最后，为了减轻有限的高质量数据引起的过度拟合问题，我们提出了“洗牌视图”数据增强技术，从而大大扩展了培训数据。实验证明了我们的MV-AR的性能和多功能性，该实验始终在各种条件下生成一致的多视图图像，并与基于领先的基于扩散的多视图图像生成模型相同。代码和模型将在此HTTPS URL上发布。

Title: VQ-Insight: Teaching VLMs for AI-Generated Video Quality Understanding via Progressive Visual Reinforcement Learning

Authors: Xuanyu Zhang, Weiqi Li, Shijie Zhao, Junlin Li, Li Zhang, Jian Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18564
Pdf URL: https://arxiv.org/pdf/2506.18564
Copy Paste: [[2506.18564]] VQ-Insight: Teaching VLMs for AI-Generated Video Quality Understanding via Progressive Visual Reinforcement Learning(https://arxiv.org/abs/2506.18564)
Keywords: generation, quality assessment
Abstract: Recent advances in AI-generated content (AIGC) have led to the emergence of powerful text-to-video generation models. Despite these successes, evaluating the quality of AIGC-generated videos remains challenging due to limited generalization, lack of temporal awareness, heavy reliance on large-scale annotated datasets, and the lack of effective interaction with generation models. Most current approaches rely on supervised finetuning of vision-language models (VLMs), which often require large-scale annotated datasets and tend to decouple understanding and generation. To address these shortcomings, we propose VQ-Insight, a novel reasoning-style VLM framework for AIGC video quality assessment. Our approach features: (1) a progressive video quality learning scheme that combines image quality warm-up, general task-specific temporal learning, and joint optimization with the video generation model; (2) the design of multi-dimension scoring rewards, preference comparison rewards, and temporal modeling rewards to enhance both generalization and specialization in video quality evaluation. Extensive experiments demonstrate that VQ-Insight consistently outperforms state-of-the-art baselines in preference comparison, multi-dimension scoring, and natural video scoring, bringing significant improvements for video generation tasks.
摘要：AI生成的内容（AIGC）的最新进展导致了强大的文本与视频生成模型的出现。尽管取得了这些成功，但由于有限的概括，缺乏时间意识，对大规模注释的数据集的严重依赖以及与生成模型的有效互动，评估AIGC生成的视频的质量仍然具有挑战性。大多数当前的方法都依赖于对视觉模型（VLM）的监督填充，这些模型通常需要大规模注释的数据集并倾向于将理解和产生解散。为了解决这些缺点，我们提出了VQ-Inlight，这是一种新型的AIGC视频质量评估的推理风格的VLM框架。我们的方法特征：（1）一种渐进的视频质量学习方案，结合了图像质量热身，特定于任务的时间学习以及与视频生成模型的联合优化；（2）多维评分奖励，偏好比较奖励和时间建模奖励的设计，以增强视频质量评估中的概括和专业化。广泛的实验表明，在偏好比较，多维分评分和自然视频评分方面，VQ - 势力始终优于最先进的基线，从而为视频生成任务带来了重大改进。

Title: VisualChef: Generating Visual Aids in Cooking via Mask Inpainting

Authors: Oleh Kuzyk, Zuoyue Li, Marc Pollefeys, Xi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18569
Pdf URL: https://arxiv.org/pdf/2506.18569
Copy Paste: [[2506.18569]] VisualChef: Generating Visual Aids in Cooking via Mask Inpainting(https://arxiv.org/abs/2506.18569)
Keywords: generation
Abstract: Cooking requires not only following instructions but also understanding, executing, and monitoring each step - a process that can be challenging without visual guidance. Although recipe images and videos offer helpful cues, they often lack consistency in focus, tools, and setup. To better support the cooking process, we introduce VisualChef, a method for generating contextual visual aids tailored to cooking scenarios. Given an initial frame and a specified action, VisualChef generates images depicting both the action's execution and the resulting appearance of the object, while preserving the initial frame's environment. Previous work aims to integrate knowledge extracted from large language models by generating detailed textual descriptions to guide image generation, which requires fine-grained visual-textual alignment and involves additional annotations. In contrast, VisualChef simplifies alignment through mask-based visual grounding. Our key insight is identifying action-relevant objects and classifying them to enable targeted modifications that reflect the intended action and outcome while maintaining a consistent environment. In addition, we propose an automated pipeline to extract high-quality initial, action, and final state frames. We evaluate VisualChef quantitatively and qualitatively on three egocentric video datasets and show its improvements over state-of-the-art methods.
摘要：烹饪不仅需要遵循说明，还需要理解，执行和监视每个步骤 - 如果没有视觉指导，该过程可能具有挑战性。尽管食谱图像和视频提供了有用的提示，但它们通常缺乏重点，工具和设置的一致性。为了更好地支持烹饪过程，我们介绍了VisualChef，这是一种生成针对烹饪场景的上下文视觉辅助工具的方法。给定初始帧和指定的操作，VisualChef生成了图像，描绘了动作的执行和对象的结果外观，同时保留了初始帧的环境。先前的工作旨在通过生成详细的文本描述来指导图像生成，从而整合从大语言模型中提取的知识，这需要精细的视觉文本对准并涉及其他注释。相比之下，VisualChef通过基于掩模的视觉接地简化了对齐。我们的主要见解是识别与动作相关的对象，并将它们分类以实现针对性的修改，以反映预期的动作和结果，同时保持一致的环境。此外，我们提出了一条自动化管道，以提取高质量的初始，动作和最终状态框架。我们在三个中心的视频数据集上进行定量和质量评估VisualChef，并显示其对最新方法的改进。

Title: No Training Wheels: Steering Vectors for Bias Correction at Inference Time

Authors: Aviral Gupta, Armaan Sethi, Ameesh Sethi
Subjects: cs.LG, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.18598
Pdf URL: https://arxiv.org/pdf/2506.18598
Copy Paste: [[2506.18598]] No Training Wheels: Steering Vectors for Bias Correction at Inference Time(https://arxiv.org/abs/2506.18598)
Keywords: generative
Abstract: Neural network classifiers trained on datasets with uneven group representation often inherit class biases and learn spurious correlations. These models may perform well on average but consistently fail on atypical groups. For example, in hair color classification, datasets may over-represent females with blond hair, reinforcing stereotypes. Although various algorithmic and data-centric methods have been proposed to address such biases, they often require retraining or significant compute. In this work, we propose a cheap, training-free method inspired by steering vectors used to edit behaviors in large language models. We compute the difference in mean activations between majority and minority groups to define a "bias vector," which we subtract from the model's residual stream. This leads to reduced classification bias and improved worst-group accuracy. We explore multiple strategies for extracting and applying these vectors in transformer-like classifiers, showing that steering vectors, traditionally used in generative models, can also be effective in classification. More broadly, we showcase an extremely cheap, inference time, training free method to mitigate bias in classification models.
摘要：在具有不平衡组表示的数据集上培训的神经网络分类器通常会继承阶级偏见并学习虚假相关性。这些模型平均表现良好，但在非典型组上始终失败。例如，在头发颜色分类中，数据集可能会超出女性的金发，增强刻板印象。尽管已经提出了各种以数据为中心的算法和以数据为中心的方法来解决此类偏见，但它们通常需要重新训练或大量计算。在这项工作中，我们提出了一种廉价的，无训练的方法，该方法灵感来自用于编辑大语言模型行为的转向向量。我们计算多数和少数群体之间的平均激活差异，以定义“偏置向量”，我们从模型的残余流中减去。这导致分类偏差减少并提高了最差的组准确性。我们探讨了在类似变压器的分类器中提取和应用这些向量的多种策略，表明传统上用于生成模型中的转向向量也可以有效地分类。更广泛地说，我们展示了一种非常便宜的推理时间，培训的无培训方法，以减轻分类模型的偏见。

Title: Simulation-Free Differential Dynamics through Neural Conservation Laws

Authors: Mengjian Hua, Eric Vanden-Eijnden, Ricky T.Q. Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.18604
Pdf URL: https://arxiv.org/pdf/2506.18604
Copy Paste: [[2506.18604]] Simulation-Free Differential Dynamics through Neural Conservation Laws(https://arxiv.org/abs/2506.18604)
Keywords: generative
Abstract: We present a novel simulation-free framework for training continuous-time diffusion processes over very general objective functions. Existing methods typically involve either prescribing the optimal diffusion process -- which only works for heavily restricted problem formulations -- or require expensive simulation to numerically obtain the time-dependent densities and sample from the diffusion process. In contrast, we propose a coupled parameterization which jointly models a time-dependent density function, or probability path, and the dynamics of a diffusion process that generates this probability path. To accomplish this, our approach directly bakes in the Fokker-Planck equation and density function requirements as hard constraints, by extending and greatly simplifying the construction of Neural Conservation Laws. This enables simulation-free training for a large variety of problem formulations, from data-driven objectives as in generative modeling and dynamical optimal transport, to optimality-based objectives as in stochastic optimal control, with straightforward extensions to mean-field objectives due to the ease of accessing exact density functions. We validate our method in a diverse range of application domains from modeling spatio-temporal events to learning optimal dynamics from population data.
摘要：我们提出了一个新颖的无模拟框架，用于训练非常通用的目标函数的连续时间扩散过程。现有方法通常涉及开处方最佳扩散过程（仅适用于严重限制的问题公式），或者需要昂贵的仿真才能从扩散过程中获取时间相关的密度和样品。相比之下，我们提出了一个耦合的参数化，该参数化共同建模了时间依赖性密度函数或概率路径，以及生成此概率路径的扩散过程的动力学。为此，我们的方法直接通过扩展和简化神经保护法的构建来直接在Fokker-Planck方程和密度函数要求中作为硬约束。这使得无需仿真培训，从数据驱动的目标和动态最佳传输中的数据驱动的目标到基于最佳的目标，如随机最佳控制中的最佳目标，并且由于易于访问精确密度功能而引起的均值范围。我们在各种应用领域中验证我们的方法，从建模时空事件到从人群数据中学习最佳动态。

Title: On Union-Closedness of Language Generation

Authors: Steve Hanneke, Amin Karbasi, Anay Mehrotra, Grigoris Velegkas
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.18642
Pdf URL: https://arxiv.org/pdf/2506.18642
Copy Paste: [[2506.18642]] On Union-Closedness of Language Generation(https://arxiv.org/abs/2506.18642)
Keywords: generation
Abstract: We investigate language generation in the limit - a model by Kleinberg and Mullainathan [NeurIPS 2024] and extended by Li, Raman, and Tewari [COLT 2025]. While Kleinberg and Mullainathan proved generation is possible for all countable collections, Li et al. defined a hierarchy of generation notions (uniform, non-uniform, and generatable) and explored their feasibility for uncountable collections. Our first set of results resolve two open questions of Li et al. by proving finite unions of generatable or non-uniformly generatable classes need not be generatable. These follow from a stronger result: there is a non-uniformly generatable class and a uniformly generatable class whose union is non-generatable. This adds to the aspects along which language generation in the limit is different from traditional tasks in statistical learning theory like classification, which are closed under finite unions. In particular, it implies that given two generators for different collections, one cannot combine them to obtain a single "more powerful" generator, prohibiting this notion of boosting. Our construction also addresses a third open question of Li et al. on whether there are uncountable classes that are non-uniformly generatable and do not satisfy the eventually unbounded closure (EUC) condition introduced by Li, Raman, and Tewari. Our approach utilizes carefully constructed classes along with a novel diagonalization argument that could be of independent interest in the growing area of language generation.
摘要：我们研究了极限的语言生成 - Kleinberg和Mullainathan [Neurips 2024]的模型，并由Li，Raman和Tewari [Colt 2025]扩展。尽管Kleinberg和Mullainathan证明了所有可数集合的产生，但Li等人。定义了一代概念（统一，不统一和生成）的层次结构，并探索了它们对无数收藏的可行性。我们的第一组结果解决了Li等人的两个开放问题。通过证明生成或不均匀生成类的有限工会无需生成。这些结果取决于更强的结果：有一个不均匀的生成类和一个均匀生成的类，其结合是不可创造的。这增加了极限中语言产生与统计学习理论（如分类）中传统任务不同的方面，这些任务是在有限的工会下关闭的。特别是，这意味着给出了两个用于不同集合的发电机，一个人无法将它们结合起来以获得单个“更强大”的发电机，禁止这种提升的概念。我们的建筑还解决了Li等人的第三个开放问题。关于是否有不均匀生成的阶级，并且不满足Li，Raman和Tewari引入的最终无限闭合（EUC）条件。我们的方法利用了精心构造的班级以及一个新颖的对角线化论点，对语言生成的发展可能引起独立的兴趣。

Title: RDPO: Real Data Preference Optimization for Physics Consistency Video Generation

Authors: Wenxu Qian, Chaoyue Wang, Hou Peng, Zhiyu Tan, Hao Li, Anxiang Zeng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18655
Pdf URL: https://arxiv.org/pdf/2506.18655
Copy Paste: [[2506.18655]] RDPO: Real Data Preference Optimization for Physics Consistency Video Generation(https://arxiv.org/abs/2506.18655)
Keywords: generation
Abstract: Video generation techniques have achieved remarkable advancements in visual quality, yet faithfully reproducing real-world physics remains elusive. Preference-based model post-training may improve physical consistency, but requires costly human-annotated datasets or reward models that are not yet feasible. To address these challenges, we present Real Data Preference Optimisation (RDPO), an annotation-free framework that distills physical priors directly from real-world videos. Specifically, the proposed RDPO reverse-samples real video sequences with a pre-trained generator to automatically build preference pairs that are statistically distinguishable in terms of physical correctness. A multi-stage iterative training schedule then guides the generator to obey physical laws increasingly well. Benefiting from the dynamic information explored from real videos, our proposed RDPO significantly improves the action coherence and physical realism of the generated videos. Evaluations on multiple benchmarks and human evaluations have demonstrated that RDPO achieves improvements across multiple dimensions. The source code and demonstration of this paper are available at: this https URL
摘要：视频生成技术在视觉质量方面取得了显着进步，但忠实地再现现实世界的物理学仍然难以捉摸。基于偏好的模型训练后可能会提高身体一致性，但需要尚不可行的昂贵的人类注销的数据集或奖励模型。为了应对这些挑战，我们提出了真实的数据偏好优化（RDPO），这是一个无注释的框架，可直接从实际视频中提取物理先验。具体而言，提议的RDPO倒物样本具有预训练的发电机的真实视频序列，以自动构建优先对，这些对在物理正确性方面具有统计学上可区分的优先对。然后，多阶段的迭代培训时间表可以指导发电机遵守物理定律。从真实视频中探索的动态信息中受益，我们提出的RDPO显着改善了生成视频的动作连贯性和物理现实主义。对多个基准和人类评估的评估表明，RDPO可以在多个维度上取得改善。本文的源代码和演示可用：此HTTPS URL

Title: Historical Report Guided Bi-modal Concurrent Learning for Pathology Report Generation

Authors: Ling Zhang, Boxiang Yun, Qingli Li, Yan Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.18658
Pdf URL: https://arxiv.org/pdf/2506.18658
Copy Paste: [[2506.18658]] Historical Report Guided Bi-modal Concurrent Learning for Pathology Report Generation(https://arxiv.org/abs/2506.18658)
Keywords: generation
Abstract: Automated pathology report generation from Whole Slide Images (WSIs) faces two key challenges: (1) lack of semantic content in visual features and (2) inherent information redundancy in WSIs. To address these issues, we propose a novel Historical Report Guided \textbf{Bi}-modal Concurrent Learning Framework for Pathology Report \textbf{Gen}eration (BiGen) emulating pathologists' diagnostic reasoning, consisting of: (1) A knowledge retrieval mechanism to provide rich semantic content, which retrieves WSI-relevant knowledge from pre-built medical knowledge bank by matching high-attention patches and (2) A bi-modal concurrent learning strategy instantiated via a learnable visual token and a learnable textual token to dynamically extract key visual features and retrieved knowledge, where weight-shared layers enable cross-modal alignment between visual features and knowledge features. Our multi-modal decoder integrates both modals for comprehensive diagnostic reports generation. Experiments on the PathText (BRCA) dataset demonstrate our framework's superiority, achieving state-of-the-art performance with 7.4\% relative improvement in NLP metrics and 19.1\% enhancement in classification metrics for Her-2 prediction versus existing methods. Ablation studies validate the necessity of our proposed modules, highlighting our method's ability to provide WSI-relevant rich semantic content and suppress information redundancy in WSIs. Code is publicly available at this https URL.
摘要：自动化病理学报告从整个幻灯片图像（WSIS）中生成两个主要挑战：（1）视觉特征中缺乏语义内容，以及（2）WSIS中固有的信息冗余。为了解决这些问题，我们提出了一份新的历史报告，指导了\ textbf {bi} - 病理学报告\ textbf {gen} eration（bigen）仿真病理学家的诊断推理，由以下知识恢复了较高的知识，该知识的知识是由westie pottire contriey-nection nection contimention contrie，\ textbf {gen} eration（gen} eration（gen} （2）通过可学习的视觉令牌和可学习的文本令牌实例化的双模式并发学习策略，以动态提取关键的视觉特征和检索知识，其中重量共享层可以在视觉特征和知识特征之间进行交叉模式对齐。我们的多模式解码器将这两种模式集成为全面的诊断报告生成。 PathText（BRCA）数据集的实验证明了我们的框架优越性，以7.4 \％的NLP指标的相对改进以及Her-2预测与现有方法的分类指标的19.1 \％增强，实现了最先进的性能。消融研究验证了我们提出的模块的必要性，强调了我们方法提供与WSI相关的丰富语义含量并抑制WSIS中信息冗余的能力。代码在此HTTPS URL上公开可用。

Title: SIM-Net: A Multimodal Fusion Network Using Inferred 3D Object Shape Point Clouds from RGB Images for 2D Classification

Authors: Youcef Sklab, Hanane Ariouat, Eric Chenin, Edi Prifti, Jean-Daniel Zucker
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.18683
Pdf URL: https://arxiv.org/pdf/2506.18683
Copy Paste: [[2506.18683]] SIM-Net: A Multimodal Fusion Network Using Inferred 3D Object Shape Point Clouds from RGB Images for 2D Classification(https://arxiv.org/abs/2506.18683)
Keywords: generation
Abstract: We introduce the Shape-Image Multimodal Network (SIM-Net), a novel 2D image classification architecture that integrates 3D point cloud representations inferred directly from RGB images. Our key contribution lies in a pixel-to-point transformation that converts 2D object masks into 3D point clouds, enabling the fusion of texture-based and geometric features for enhanced classification performance. SIM-Net is particularly well-suited for the classification of digitized herbarium specimens (a task made challenging by heterogeneous backgrounds), non-plant elements, and occlusions that compromise conventional image-based models. To address these issues, SIM-Net employs a segmentation-based preprocessing step to extract object masks prior to 3D point cloud generation. The architecture comprises a CNN encoder for 2D image features and a PointNet-based encoder for geometric features, which are fused into a unified latent space. Experimental evaluations on herbarium datasets demonstrate that SIM-Net consistently outperforms ResNet101, achieving gains of up to 9.9% in accuracy and 12.3% in F-score. It also surpasses several transformer-based state-of-the-art architectures, highlighting the benefits of incorporating 3D structural reasoning into 2D image classification tasks.
摘要：我们介绍了形状图像多模式网络（SIM-NET），这是一种新颖的2D图像分类体系结构，该体系结构集成了直接从RGB图像推论的3D点云表示。我们的关键贡献在于像素到点转换，将2D对象掩盖转换为3D点云，从而使基于纹理的和几何特征融合起来，以增强分类性能。 SIM-NET特别适合分类的数字化标本室标本（由异质背景挑战的任务），非植物元素和遮挡损害常规基于图像的模型。为了解决这些问题，SIM-NET采用基于细分的预处理步骤来提取3D点云生成之前提取对象掩码。该体系结构包括用于2D图像功能的CNN编码器和用于几何特征的基于点网的编码器，该编码器被融合到统一的潜在空间中。在标本室数据集上进行的实验评估表明，SIM-NET始终超过Resnet101，其准确性高达9.9％，而F-Score的增长率为12.3％。它还超过了几个基于变压器的最新架构，突出了将3D结构推理纳入2D图像分类任务的好处。

Title: Matrix-Game: Interactive World Foundation Model

Authors: Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, Yahui Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.18701
Pdf URL: https://arxiv.org/pdf/2506.18701
Copy Paste: [[2506.18701]] Matrix-Game: Interactive World Foundation Model(https://arxiv.org/abs/2506.18701)
Keywords: generation
Abstract: We introduce Matrix-Game, an interactive world foundation model for controllable game world generation. Matrix-Game is trained using a two-stage pipeline that first performs large-scale unlabeled pretraining for environment understanding, followed by action-labeled training for interactive video generation. To support this, we curate Matrix-Game-MC, a comprehensive Minecraft dataset comprising over 2,700 hours of unlabeled gameplay video clips and over 1,000 hours of high-quality labeled clips with fine-grained keyboard and mouse action annotations. Our model adopts a controllable image-to-world generation paradigm, conditioned on a reference image, motion context, and user actions. With over 17 billion parameters, Matrix-Game enables precise control over character actions and camera movements, while maintaining high visual quality and temporal coherence. To evaluate performance, we develop GameWorld Score, a unified benchmark measuring visual quality, temporal quality, action controllability, and physical rule understanding for Minecraft world generation. Extensive experiments show that Matrix-Game consistently outperforms prior open-source Minecraft world models (including Oasis and MineWorld) across all metrics, with particularly strong gains in controllability and physical consistency. Double-blind human evaluations further confirm the superiority of Matrix-Game, highlighting its ability to generate perceptually realistic and precisely controllable videos across diverse game scenarios. To facilitate future research on interactive image-to-world generation, we will open-source the Matrix-Game model weights and the GameWorld Score benchmark at this https URL.
摘要：We introduce Matrix-Game, an interactive world foundation model for controllable game world generation.使用两阶段的管道对矩阵游戏进行了训练，该管道首先执行大规模的未标记预处理，以了解环境的理解，然后进行互动视频生成的动作标记培训。为了支持这一点，我们策划了Matrix-game-MC，这是一个全面的Minecraft数据集，其中包括超过2700个小时的未标记的游戏片段和超过1,000个小时的高质量标签夹，并带有精细的键盘和鼠标动作注释。我们的模型采用可控的图像到世界生成范式，以参考图像，运动上下文和用户操作为条件。矩阵游戏具有超过170亿个参数，可以精确控制角色动作和相机运动，同时保持高视觉质量和时间连贯性。为了评估性能，我们开发了GameWorld Score，这是一个统一的基准测试，可测量Minecraft World Genereth的视觉质量，时间质量，动作可控性和物理规则理解。广泛的实验表明，矩阵游戏在所有指标上始终优于先前的开源Minecraft World模型（包括Oasis和Mineworld），在可控性和物理一致性方面的提高尤其强劲。双盲人类评估进一步证实了矩阵游戏的优越性，突出了其在各种游戏场景中生成感知现实且可控制的视频的能力。为了促进对交互式图像到世界一代的未来研究，我们将在此HTTPS URL上开源矩阵游戏模型权重和GameWorld得分基准。

Title: USVTrack: USV-Based 4D Radar-Camera Tracking Dataset for Autonomous Driving in Inland Waterways

Authors: Shanliang Yao, Runwei Guan, Yi Ni, Sen Xu, Yong Yue, Xiaohui Zhu, Ryan Wen Liu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2506.18737
Pdf URL: https://arxiv.org/pdf/2506.18737
Copy Paste: [[2506.18737]] USVTrack: USV-Based 4D Radar-Camera Tracking Dataset for Autonomous Driving in Inland Waterways(https://arxiv.org/abs/2506.18737)
Keywords: generation
Abstract: Object tracking in inland waterways plays a crucial role in safe and cost-effective applications, including waterborne transportation, sightseeing tours, environmental monitoring and surface rescue. Our Unmanned Surface Vehicle (USV), equipped with a 4D radar, a monocular camera, a GPS, and an IMU, delivers robust tracking capabilities in complex waterborne environments. By leveraging these sensors, our USV collected comprehensive object tracking data, which we present as USVTrack, the first 4D radar-camera tracking dataset tailored for autonomous driving in new generation waterborne transportation systems. Our USVTrack dataset presents rich scenarios, featuring diverse various waterways, varying times of day, and multiple weather and lighting conditions. Moreover, we present a simple but effective radar-camera matching method, termed RCM, which can be plugged into popular two-stage association trackers. Experimental results utilizing RCM demonstrate the effectiveness of the radar-camera matching in improving object tracking accuracy and reliability for autonomous driving in waterborne environments. The USVTrack dataset is public on this https URL.
摘要：内陆水道中的对象跟踪在安全且具有成本效益的应用中起着至关重要的作用，包括水上运输，观光旅行，环境监测和表面救援。我们无人的地表车辆（USV）配备了4D雷达，单眼相机，GPS和IMU，可在复杂的水上环境中提供强大的跟踪功能。通过利用这些传感器，我们的USV收集了全面的对象跟踪数据，我们以USVTRACK的形式呈现，这是第一个4D雷达相机跟踪数据集，该数据集量身定制，该数据集量身定制，该数据集是针对新一代水上运输系统中自动驾驶的。我们的USVTRACK数据集呈现出丰富的场景，其中包括各种水道，一天中的不同时间以及多种天气和照明条件。此外，我们提出了一种简单但有效的雷达相机匹配方法，称为RCM，可以将其插入流行的两阶段关联跟踪器中。利用RCM的实验结果证明了雷达相机匹配在提高对象跟踪的准确性和可靠性方面对自动驾驶环境中的有效性。 USVTRACK数据集在此HTTPS URL上公开。

Title: ContinualFlow: Learning and Unlearning with Neural Flow Matching

Authors: Lorenzo Simone, Davide Bacciu, Shuangge Ma
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.18747
Pdf URL: https://arxiv.org/pdf/2506.18747
Copy Paste: [[2506.18747]] ContinualFlow: Learning and Unlearning with Neural Flow Matching(https://arxiv.org/abs/2506.18747)
Keywords: generative
Abstract: We introduce ContinualFlow, a principled framework for targeted unlearning in generative models via Flow Matching. Our method leverages an energy-based reweighting loss to softly subtract undesired regions of the data distribution without retraining from scratch or requiring direct access to the samples to be unlearned. Instead, it relies on energy-based proxies to guide the unlearning process. We prove that this induces gradients equivalent to Flow Matching toward a soft mass-subtracted target, and validate the framework through experiments on 2D and image domains, supported by interpretable visualizations and quantitative evaluations.
摘要：我们介绍了连续流，这是一个通过流匹配在生成模型中进行靶向学习的原则性框架。我们的方法利用了基于能量的重新加权损失来轻轻减去数据分布的不需要的区域，而无需从头开始重新训练或需要直接访问要删除的样品。相反，它依靠基于能量的代理来指导未学习过程。我们证明，这会导致等同于向柔软的质量提取目标匹配的梯度，并通过在2D和图像域上的实验验证框架，并受到可解释的可视化和定量评估的支持。

Title: 3D Arena: An Open Platform for Generative 3D Evaluation

Authors: Dylan Ebert
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18787
Pdf URL: https://arxiv.org/pdf/2506.18787
Copy Paste: [[2506.18787]] 3D Arena: An Open Platform for Generative 3D Evaluation(https://arxiv.org/abs/2506.18787)
Keywords: generation, generative
Abstract: Evaluating Generative 3D models remains challenging due to misalignment between automated metrics and human perception of quality. Current benchmarks rely on image-based metrics that ignore 3D structure or geometric measures that fail to capture perceptual appeal and real-world utility. To address this gap, we present 3D Arena, an open platform for evaluating image-to-3D generation models through large-scale human preference collection using pairwise comparisons. Since launching in June 2024, the platform has collected 123,243 votes from 8,096 users across 19 state-of-the-art models, establishing the largest human preference evaluation for Generative 3D. We contribute the iso3d dataset of 100 evaluation prompts and demonstrate quality control achieving 99.75% user authenticity through statistical fraud detection. Our ELO-based ranking system provides reliable model assessment, with the platform becoming an established evaluation resource. Through analysis of this preference data, we present insights into human preference patterns. Our findings reveal preferences for visual presentation features, with Gaussian splat outputs achieving a 16.6 ELO advantage over meshes and textured models receiving a 144.1 ELO advantage over untextured models. We provide recommendations for improving evaluation methods, including multi-criteria assessment, task-oriented evaluation, and format-aware comparison. The platform's community engagement establishes 3D Arena as a benchmark for the field while advancing understanding of human-centered evaluation in Generative 3D.
摘要：由于自动指标和人类对质量的看法之间的错位，评估生成3D模型仍然具有挑战性。当前的基准测试依赖于基于图像的指标，这些指标忽略了无法捕获感知吸引力和现实世界实用程序的3D结构或几何度量。为了解决这一差距，我们提出了3D竞技场，这是一个开放的平台，用于使用成对比较的大规模人类偏好收集来评估图像到3D生成模型。自2024年6月推出以来，该平台已从19个最先进的模型中收集了8,096名用户的123,243票，建立了对生成3D的最大人类偏好评估。我们通过统计欺诈检测来贡献100个评估提示的ISO3D数据集，并证明质量控制实现了99.75％的用户真实性。我们基于ELO的排名系统提供了可靠的模型评估，该平台成为已建立的评估资源。通过分析此偏好数据，我们提供了对人类偏好模式的见解。我们的发现揭示了视觉呈现功能的偏好，高斯SPLAT输出比网格的优势获得了16.6 ELO优势，而纹理模型则获得了144.1 ELO优势，而不是纹理模型。我们提供了改进评估方法的建议，包括多标准评估，面向任务的评估和格式感知的比较。该平台的社区参与将3D竞技场确立为该领域的基准，同时促进了对以人为中心的3D评估的理解。

Title: 4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation

Authors: Chaoyang Wang, Ashkan Mirzaei, Vidit Goel, Willi Menapace, Aliaksandr Siarohin, Avalon Vinella, Michael Vasilkovsky, Ivan Skorokhodov, Vladislav Shakhrai, Sergey Korolev, Sergey Tulyakov, Peter Wonka
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18839
Pdf URL: https://arxiv.org/pdf/2506.18839
Copy Paste: [[2506.18839]] 4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation(https://arxiv.org/abs/2506.18839)
Keywords: generation
Abstract: We propose the first framework capable of computing a 4D spatio-temporal grid of video frames and 3D Gaussian particles for each time step using a feed-forward architecture. Our architecture has two main components, a 4D video model and a 4D reconstruction model. In the first part, we analyze current 4D video diffusion architectures that perform spatial and temporal attention either sequentially or in parallel within a two-stream design. We highlight the limitations of existing approaches and introduce a novel fused architecture that performs spatial and temporal attention within a single layer. The key to our method is a sparse attention pattern, where tokens attend to others in the same frame, at the same timestamp, or from the same viewpoint. In the second part, we extend existing 3D reconstruction algorithms by introducing a Gaussian head, a camera token replacement algorithm, and additional dynamic layers and training. Overall, we establish a new state of the art for 4D generation, improving both visual quality and reconstruction capability.
摘要：我们提出了第一个框架，能够使用前馈架构计算每个时间步骤的视频帧和3D高斯粒子的4D时空网格。我们的体系结构有两个主要组件，一个4D视频模型和4D重建模型。在第一部分中，我们分析了当前的4D视频扩散体系结构，这些体系结构在两流设计中依次或并行地进行空间和时间关注。我们强调了现有方法的局限性，并引入了一种新颖的融合体系结构，该体系结构在单层内表现出空间和时间关注。我们方法的关键是一个稀疏的注意模式，在同一时间戳或从同一观点处，令牌在同一框架中关注其他框架。在第二部分中，我们通过引入高斯头部，一个相机令牌替换算法以及其他动态层和训练来扩展现有的3D重建算法。总体而言，我们为4D代建立了新的技术状态，从而提高了视觉质量和重建能力。

Title: Phantom-Data : Towards a General Subject-Consistent Video Generation Dataset

Authors: Zhuowei Chen, Bingchuan Li, Tianxiang Ma, Lijie Liu, Mingcong Liu, Yi Zhang, Gen Li, Xinghui Li, Siyu Zhou, Qian He, Xinglong Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18851
Pdf URL: https://arxiv.org/pdf/2506.18851
Copy Paste: [[2506.18851]] Phantom-Data : Towards a General Subject-Consistent Video Generation Dataset(https://arxiv.org/abs/2506.18851)
Keywords: generation
Abstract: Subject-to-video generation has witnessed substantial progress in recent years. However, existing models still face significant challenges in faithfully following textual instructions. This limitation, commonly known as the copy-paste problem, arises from the widely used in-pair training paradigm. This approach inherently entangles subject identity with background and contextual attributes by sampling reference images from the same scene as the target video. To address this issue, we introduce \textbf{Phantom-Data, the first general-purpose cross-pair subject-to-video consistency dataset}, containing approximately one million identity-consistent pairs across diverse categories. Our dataset is constructed via a three-stage pipeline: (1) a general and input-aligned subject detection module, (2) large-scale cross-context subject retrieval from more than 53 million videos and 3 billion images, and (3) prior-guided identity verification to ensure visual consistency under contextual variation. Comprehensive experiments show that training with Phantom-Data significantly improves prompt alignment and visual quality while preserving identity consistency on par with in-pair baselines.
摘要：近年来，主题到视频的一代已经取得了长足的进步。但是，现有模型在忠实遵循文本说明时仍然面临重大挑战。这种限制（通常称为复制问题问题）源于广泛使用的对手训练范式。这种方法固有地将主题身份与背景和上下文属性纠缠在一起，通过与目标视频相同场景中的参考图像进行采样。为了解决此问题，我们介绍了\ textbf {phantom-data，这是第一个通用横向对主题到视频一致性数据集}，其中包含大约一百万个不同类别的身份一致的对。我们的数据集是通过三阶段管道构建的：（1）一般和输入对准的主题检测模块，（2）从超过5300万个视频和30亿张图像中大规模的跨跨文本主题检索，以及（3）先进的身份验证，以确保在上下文较大的变化中确保视觉一致性。综合实验表明，使用幻影数据的培训显着提高了迅速的对齐和视觉质量，同时保持与对基层基线相当的身份一致性。

Title: TAMMs: Temporal-Aware Multimodal Model for Satellite Image Change Understanding and Forecasting

Authors: Zhongbin Guo, Yuhao Wang, Ping Jian, Xinyue Chen, Wei Peng, Ertai E
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.18862
Pdf URL: https://arxiv.org/pdf/2506.18862
Copy Paste: [[2506.18862]] TAMMs: Temporal-Aware Multimodal Model for Satellite Image Change Understanding and Forecasting(https://arxiv.org/abs/2506.18862)
Keywords: generation
Abstract: Satellite image time-series analysis demands fine-grained spatial-temporal reasoning, which remains a challenge for existing multimodal large language models (MLLMs). In this work, we study the capabilities of MLLMs on a novel task that jointly targets temporal change understanding and future scene generation, aiming to assess their potential for modeling complex multimodal dynamics over time. We propose TAMMs, a Temporal-Aware Multimodal Model for satellite image change understanding and forecasting, which enhances frozen MLLMs with lightweight temporal modules for structured sequence encoding and contextual prompting. To guide future image generation, TAMMs introduces a Semantic-Fused Control Injection (SFCI) mechanism that adaptively combines high-level semantic reasoning and structural priors within an enhanced ControlNet. This dual-path conditioning enables temporally consistent and semantically grounded image synthesis. Experiments demonstrate that TAMMs outperforms strong MLLM baselines in both temporal change understanding and future image forecasting tasks, highlighting how carefully designed temporal reasoning and semantic fusion can unlock the full potential of MLLMs for spatio-temporal understanding.
摘要：卫星图像时间序列分析需要细粒的时空推理，这仍然是现有多模式大语言模型（MLLM）的挑战。在这项工作中，我们研究了MLLM的功能在一项新任务上，该任务共同针对时间变化理解和未来的场景产生，旨在评估其随着时间的推移对复杂的多模式动力学建模的潜力。我们提出了TAMMS，这是一种用于卫星图像变化理解和预测的时间感知的多模式模型，它通过使用轻质的时间模块来增强冷冻的MLLM，用于结构化序列编码和上下文提示。为了指导未来的图像产生，TAMMS引入了一种语义融合的控制注射（SFCI）机制，该机制将高级语义推理和结构性先验自适应地结合在增强的控制网络中。这种双路调节可以使时间一致和语义扎根的图像合成。实验表明，在时间变化理解和未来的图像预测任务中，TAMMS优于强大的MLLM基准，强调了精心设计的时间推理和语义融合的方式可以解锁MLLM对时空理解的全部潜力。

Title: OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation

Authors: Qijun Gan, Ruizi Yang, Jianke Zhu, Shaofei Xue, Steven Hoi
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2506.18866
Pdf URL: https://arxiv.org/pdf/2506.18866
Copy Paste: [[2506.18866]] OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation(https://arxiv.org/abs/2506.18866)
Keywords: generation
Abstract: Significant progress has been made in audio-driven human animation, while most existing methods focus mainly on facial movements, limiting their ability to create full-body animations with natural synchronization and fluidity. They also struggle with precise prompt control for fine-grained generation. To tackle these challenges, we introduce OmniAvatar, an innovative audio-driven full-body video generation model that enhances human animation with improved lip-sync accuracy and natural movements. OmniAvatar introduces a pixel-wise multi-hierarchical audio embedding strategy to better capture audio features in the latent space, enhancing lip-syncing across diverse scenes. To preserve the capability for prompt-driven control of foundation models while effectively incorporating audio features, we employ a LoRA-based training approach. Extensive experiments show that OmniAvatar surpasses existing models in both facial and semi-body video generation, offering precise text-based control for creating videos in various domains, such as podcasts, human interactions, dynamic scenes, and singing. Our project page is this https URL.
摘要：在音频驱动的人类动画中已经取得了重大进展，而大多数现有方法主要集中在面部运动上，从而限制了它们以自然同步和流动性创建全身动画的能力。他们还以精确的迅速控制而挣扎，以获得细粒度。为了应对这些挑战，我们引入了Omniavatar，这是一种创新的音频驱动的全身视频生成模型，可通过提高LIP-Sync的精度和自然动作来增强人类动画。 Omniavatar引入了Pipel的多等级音频嵌入策略，以更好地捕获潜在空间中的音频功能，从而增强了各种场景中的唇部同步。为了保留迅速控制基础模型的能力，同时有效地合并了音频功能，我们采用了基于洛拉的训练方法。广泛的实验表明，Omniavatar超过了面部和半身视频生成中的现有模型，为在各个领域（例如播客，人类互动，动态场景和唱歌）中创建视频提供了精确的基于文本的控制。我们的项目页面是此HTTPS URL。

Title: OmniGen2: Exploration to Advanced Multimodal Generation

Authors: Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, Zheng Liu
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.18871
Pdf URL: https://arxiv.org/pdf/2506.18871
Copy Paste: [[2506.18871]] OmniGen2: Exploration to Advanced Multimodal Generation(https://arxiv.org/abs/2506.18871)
Keywords: generation, generative
Abstract: In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of OmniGen2, we developed comprehensive data construction pipelines, encompassing image editing and in-context generation data. Additionally, we introduce a reflection mechanism tailored for image generation tasks and curate a dedicated reflection dataset based on OmniGen2. Despite its relatively modest parameter size, OmniGen2 achieves competitive results on multiple task benchmarks, including text-to-image and image editing. To further evaluate in-context generation, also referred to as subject-driven tasks, we introduce a new benchmark named OmniContext. OmniGen2 achieves state-of-the-art performance among open-source models in terms of consistency. We will release our models, training code, datasets, and data construction pipeline to support future research in this field. Project Page: this https URL GitHub Link: this https URL
摘要：在这项工作中，我们介绍了Omnigen2，这是一种多功能且开源的生成模型，旨在为各种生成任务提供统一的解决方案，包括文本对图像，图像编辑和文本生成。与Omnigen V1不同，Omnigen2采用了两种不同的文本和图像模式解码途径，利用未共享参数和一个解耦的图像令牌。该设计使Omnigen2能够基于现有的多模式理解模型，而无需重新适应VAE输入，从而保留了原始的文本生成功能。为了促进Omnigen2的培训，我们开发了全面的数据构建管道，涵盖了图像编辑和内在的生成数据。此外，我们引入了一种针对图像生成任务的反射机制，并基于Omnigen2策划专用反射数据集。尽管具有相对适度的参数大小，Omnigen2还是在多个任务基准（包括文本对图像和图像编辑）上取得了竞争性结果。为了进一步评估也称为主题驱动的任务的文本中的生成，我们引入了一个名为Omnicontext的新基准。 Omnigen2就一致性而言，在开源模型中实现了最先进的性能。我们将发布我们的模型，培训代码，数据集和数据构建管道，以支持该领域的未来研究。项目页面：此HTTPS URL GITHUB链接：此HTTPS URL

Title: Let Your Video Listen to Your Music!

Authors: Xinyu Zhang, Dong Gong, Zicheng Duan, Anton van den Hengel, Lingqiao Liu
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2506.18881
Pdf URL: https://arxiv.org/pdf/2506.18881
Copy Paste: [[2506.18881]] Let Your Video Listen to Your Music!(https://arxiv.org/abs/2506.18881)
Keywords: generation, generative
Abstract: Aligning the rhythm of visual motion in a video with a given music track is a practical need in multimedia production, yet remains an underexplored task in autonomous video editing. Effective alignment between motion and musical beats enhances viewer engagement and visual appeal, particularly in music videos, promotional content, and cinematic editing. Existing methods typically depend on labor-intensive manual cutting, speed adjustments, or heuristic-based editing techniques to achieve synchronization. While some generative models handle joint video and music generation, they often entangle the two modalities, limiting flexibility in aligning video to music beats while preserving the full visual content. In this paper, we propose a novel and efficient framework, termed MVAA (Music-Video Auto-Alignment), that automatically edits video to align with the rhythm of a given music track while preserving the original visual content. To enhance flexibility, we modularize the task into a two-step process in our MVAA: aligning motion keyframes with audio beats, followed by rhythm-aware video inpainting. Specifically, we first insert keyframes at timestamps aligned with musical beats, then use a frame-conditioned diffusion model to generate coherent intermediate frames, preserving the original video's semantic content. Since comprehensive test-time training can be time-consuming, we adopt a two-stage strategy: pretraining the inpainting module on a small video set to learn general motion priors, followed by rapid inference-time fine-tuning for video-specific adaptation. This hybrid approach enables adaptation within 10 minutes with one epoch on a single NVIDIA 4090 GPU using CogVideoX-5b-I2V as the backbone. Extensive experiments show that our approach can achieve high-quality beat alignment and visual smoothness.
摘要：在视频中与给定的音乐曲目相结合是多媒体生产的实际需求，但在自主视频编辑中仍然是一项不受欢迎的任务。动作和音乐节奏之间的有效对齐可以增强观众的参与和视觉吸引力，尤其是在音乐视频，促销内容和电影编辑中。现有方法通常取决于劳动密集型手动切割，速度调整或基于启发式的编辑技术以实现同步。虽然一些生成模型处理联合视频和音乐的生成，但它们经常纠缠这两种方式，从而限制了将视频与音乐节拍保持一致的灵活性，同时保留了完整的视觉内容。在本文中，我们提出了一个新颖有效的框架，称为MVAA（Music-Video Auto-Anignment），该框架自动编辑视频以与给定音乐曲目的节奏保持一致，同时保留原始的视觉内容。为了提高灵活性，我们将任务模块化为MVAA中的两步过程：将运动键框与音频节拍对齐，然后进行节奏感知的视频介绍。具体来说，我们首先在与音乐节拍对齐的时间戳上插入密钥帧，然后使用框架条件的扩散模型生成相干的中间帧，从而保留原始视频的语义内容。由于全面的测试时间培训可能很耗时，因此我们采用了两阶段的策略：在一个小型视频集中预处理模块，以学习通用运动先验，然后进行快速推理时间微调以进行特定于视频的适应。这种混合方法可以在10分钟内使用COGVIDEOX-5B-I2V作为骨干的单个NVIDIA 4090 GPU进行适应。广泛的实验表明，我们的方法可以实现高质量的节拍对准和视觉平滑度。

Title: Universal Video Temporal Grounding with Generative Multi-modal Large Language Models

Authors: Zeqian Li, Shangzhe Di, Zhonghua Zhai, Weilin Huang, Yanfeng Wang, Weidi Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18883
Pdf URL: https://arxiv.org/pdf/2506.18883
Copy Paste: [[2506.18883]] Universal Video Temporal Grounding with Generative Multi-modal Large Language Models(https://arxiv.org/abs/2506.18883)
Keywords: generative
Abstract: This paper presents a computational model for universal video temporal grounding, which accurately localizes temporal moments in videos based on natural language queries (e.g., questions or descriptions). Unlike existing methods that are often limited to specific video domains or durations, we propose UniTime, a robust and universal video grounding model leveraging the strong vision-language understanding capabilities of generative Multi-modal Large Language Models (MLLMs). Our model effectively handles videos of diverse views, genres, and lengths while comprehending complex language queries. The key contributions include: (i) We consider steering strong MLLMs for temporal grounding in videos. To enable precise timestamp outputs, we incorporate temporal information by interleaving timestamp tokens with video tokens. (ii) By training the model to handle videos with different input granularities through adaptive frame scaling, our approach achieves robust temporal grounding for both short and long videos. (iii) Comprehensive experiments show that UniTime outperforms state-of-the-art approaches in both zero-shot and dataset-specific finetuned settings across five public temporal grounding benchmarks. (iv) When employed as a preliminary moment retriever for long-form video question-answering (VideoQA), UniTime significantly improves VideoQA accuracy, highlighting its value for complex video understanding tasks.
摘要：本文提出了一种用于通用视频时间基础的计算模型，该模型可以根据自然语言查询（例如问题或描述）准确地定位视频中的时间时刻。与通常仅限于特定视频域或持续时间的现有方法不同，我们提出了单一时间，这是一种强大而通用的视频接地模型，利用了生成多模式大型语言模型（MLLMS）的强烈视觉理解能力。我们的模型有效地处理了各种观点，流派和长度的视频，同时理解复杂的语言查询。主要贡献包括：（i）我们考虑为视频中的时间基础转向强大的MLLM。为了启用精确的时间戳输出，我们通过将时间戳令牌与视频令牌交织在一起来结合时间信息。（ii）训练模型通过自适应框架缩放来处理具有不同输入粒度的视频，我们的方法可以在短视频和长视频中实现稳健的时间基础。（iii）全面的实验表明，在五个公共时间基础基准中，单位时间的表现都优于零射门和特定于数据集特定的固定设置的最先进方法。（iv）当用作长格式视频询问（VideoQA）的初步时刻检索器时，Unitime显着提高了VideoQA的准确性，突出了其价值，以实现复杂的视频理解任务。

Title: 4D-LRM: Large Space-Time Reconstruction Model From and To Any View at Any Time

Authors: Ziqiao Ma, Xuweiyi Chen, Shoubin Yu, Sai Bi, Kai Zhang, Chen Ziwen, Sihan Xu, Jianing Yang, Zexiang Xu, Kalyan Sunkavalli, Mohit Bansal, Joyce Chai, Hao Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18890
Pdf URL: https://arxiv.org/pdf/2506.18890
Copy Paste: [[2506.18890]] 4D-LRM: Large Space-Time Reconstruction Model From and To Any View at Any Time(https://arxiv.org/abs/2506.18890)
Keywords: generative
Abstract: Can we scale 4D pretraining to learn general space-time representations that reconstruct an object from a few views at some times to any view at any time? We provide an affirmative answer with 4D-LRM, the first large-scale 4D reconstruction model that takes input from unconstrained views and timestamps and renders arbitrary novel view-time combinations. Unlike prior 4D approaches, e.g., optimization-based, geometry-based, or generative, that struggle with efficiency, generalization, or faithfulness, 4D-LRM learns a unified space-time representation and directly predicts per-pixel 4D Gaussian primitives from posed image tokens across time, enabling fast, high-quality rendering at, in principle, infinite frame rate. Our results demonstrate that scaling spatiotemporal pretraining enables accurate and efficient 4D reconstruction. We show that 4D-LRM generalizes to novel objects, interpolates across time, and handles diverse camera setups. It reconstructs 24-frame sequences in one forward pass with less than 1.5 seconds on a single A100 GPU.
摘要：我们是否可以缩放4D预处理以学习一般的时空表示形式，这些时空表示在某些时候从几个视图中重建对象到任何时间的视图？我们使用4D-LRM（第一个大规模4D重建模型）提供了肯定的答案，该模型从无约束的视图和时间戳和时间戳中获取了任意的新型视图时间组合。 Unlike prior 4D approaches, e.g., optimization-based, geometry-based, or generative, that struggle with efficiency, generalization, or faithfulness, 4D-LRM learns a unified space-time representation and directly predicts per-pixel 4D Gaussian primitives from posed image tokens across time, enabling fast, high-quality rendering at, in principle, infinite frame rate.我们的结果表明，缩放时空预处理可实现准确有效的4D重建。我们表明4D-LRM概括为新颖的对象，跨时间插值并处理各种相机设置。它以一个A100 GPU的速度不到1.5秒，在一个前传中重建24帧序列。

Title: Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

Authors: Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, Lu Jiang
Subjects: cs.CV, cs.AI, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2506.18898
Pdf URL: https://arxiv.org/pdf/2506.18898
Copy Paste: [[2506.18898]] Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations(https://arxiv.org/abs/2506.18898)
Keywords: generation, generative
Abstract: This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary. By integrating vision and text into a unified space with an expanded vocabulary, our multimodal LLM, Tar, enables cross-modal input and output through a shared interface, without the need for modality-specific designs. Additionally, we propose scale-adaptive encoding and decoding to balance efficiency and visual detail, along with a generative de-tokenizer to produce high-fidelity visual outputs. To address diverse decoding needs, we utilize two complementary de-tokenizers: a fast autoregressive model and a diffusion-based model. To enhance modality fusion, we investigate advanced pre-training tasks, demonstrating improvements in both visual understanding and generation. Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency. Code, models, and data are available at this https URL
摘要：本文提出了一个多模式框架，该框架试图在共享的离散语义表示中统一视觉理解和生成。其核心是文本一致的令牌（TA-TOK），它使用从大语言模型（LLM）词汇中投影的文本一致的代码簿将图像转换为离散令牌。通过将视觉和文本与扩展的词汇相结合到统一的空间，我们的多模式LLM TAR可以通过共享界面启用交叉模式输入和输出，而无需特定于模态的设计。此外，我们提出了尺度自适应编码和解码，以平衡效率和视觉细节，以及生成性的dekokenizer，以产生高保真的视觉输出。为了满足各种解码需求，我们利用了两个互补的De-Tokenizer：快速自回归模型和一个基于扩散的模型。为了增强模态融合，我们研究了先进的预训练任务，并证明了视觉理解和产生的改善。跨基准测试的实验表明，焦油匹配或超过了现有的多模式LLM方法，可实现更快的收敛性和更高的训练效率。代码，模型和数据可在此HTTPS URL上找到

Title: FilMaster: Bridging Cinematic Principles and Generative AI for Automated Film Generation

Authors: Kaiyi Huang, Yukun Huang, Xintao Wang, Zinan Lin, Xuefei Ning, Pengfei Wan, Di Zhang, Yu Wang, Xihui Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18899
Pdf URL: https://arxiv.org/pdf/2506.18899
Copy Paste: [[2506.18899]] FilMaster: Bridging Cinematic Principles and Generative AI for Automated Film Generation(https://arxiv.org/abs/2506.18899)
Keywords: generation, generative
Abstract: AI-driven content creation has shown potential in film production. However, existing film generation systems struggle to implement cinematic principles and thus fail to generate professional-quality films, particularly lacking diverse camera language and cinematic rhythm. This results in templated visuals and unengaging narratives. To address this, we introduce FilMaster, an end-to-end AI system that integrates real-world cinematic principles for professional-grade film generation, yielding editable, industry-standard outputs. FilMaster is built on two key principles: (1) learning cinematography from extensive real-world film data and (2) emulating professional, audience-centric post-production workflows. Inspired by these principles, FilMaster incorporates two stages: a Reference-Guided Generation Stage which transforms user input to video clips, and a Generative Post-Production Stage which transforms raw footage into audiovisual outputs by orchestrating visual and auditory elements for cinematic rhythm. Our generation stage highlights a Multi-shot Synergized RAG Camera Language Design module to guide the AI in generating professional camera language by retrieving reference clips from a vast corpus of 440,000 film clips. Our post-production stage emulates professional workflows by designing an Audience-Centric Cinematic Rhythm Control module, including Rough Cut and Fine Cut processes informed by simulated audience feedback, for effective integration of audiovisual elements to achieve engaging content. The system is empowered by generative AI models like (M)LLMs and video generation models. Furthermore, we introduce FilmEval, a comprehensive benchmark for evaluating AI-generated films. Extensive experiments show FilMaster's superior performance in camera language design and cinematic rhythm control, advancing generative AI in professional filmmaking.
摘要：AI驱动的内容创建在电影制作中已经有潜力。但是，现有的电影生成系统努力实施电影原理，因此无法产生专业质量的电影，尤其是缺乏各种各样的相机语言和电影节奏。这导致了模板的视觉效果和不吸引人的叙述。为了解决这个问题，我们介绍了Tilmaster，这是一种端到端的AI系统，该系统整合了专业级电影生成的现实世界电影原理，从而产生可编辑的行业标准输出。 Filmaster建立在两个关键原则上：（1）从广泛的现实电影数据和（2）模拟专业，以观众为中心的后期制作工作流程中学习摄影。受这些原则的启发，Filmaster结合了两个阶段：参考引导的生成阶段，将用户输入转换为视频片段，以及生成的后生产阶段，该阶段将原始素材转变为视听输出，通过为Cinematic Rythm安排视觉和听觉元素。我们这一代舞台突出了一个多弹药协同的抹布相机语言设计模块，以指导AI通过从440,000个胶片片段的大量语料库中检索参考剪辑来引导专业相机语言。我们的后期制作阶段通过设计以观众为中心的电影节奏控制模块来模拟专业工作流程，包括模拟受众反馈所告知的粗略剪切和精细剪切过程，以有效地集成了视听元素以实现引人入胜的内容。该系统由（M）LLM和视频生成模型等生成AI模型授权。此外，我们介绍了FilmEval，这是评估AI生成的电影的综合基准。广泛的实验表明，Filmaster在相机语言设计和电影节奏控制方面的出色表现，推动了专业电影制作中的生成AI。

Title: From Virtual Games to Real-World Play

Authors: Wenqiang Sun, Fangyun Wei, Jinjing Zhao, Xi Chen, Zilong Chen, Hongyang Zhang, Jun Zhang, Yan Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18901
Pdf URL: https://arxiv.org/pdf/2506.18901
Copy Paste: [[2506.18901]] From Virtual Games to Real-World Play(https://arxiv.org/abs/2506.18901)
Keywords: generation
Abstract: We introduce RealPlay, a neural network-based real-world game engine that enables interactive video generation from user control signals. Unlike prior works focused on game-style visuals, RealPlay aims to produce photorealistic, temporally consistent video sequences that resemble real-world footage. It operates in an interactive loop: users observe a generated scene, issue a control command, and receive a short video chunk in response. To enable such realistic and responsive generation, we address key challenges including iterative chunk-wise prediction for low-latency feedback, temporal consistency across iterations, and accurate control response. RealPlay is trained on a combination of labeled game data and unlabeled real-world videos, without requiring real-world action annotations. Notably, we observe two forms of generalization: (1) control transfer-RealPlay effectively maps control signals from virtual to real-world scenarios; and (2) entity transfer-although training labels originate solely from a car racing game, RealPlay generalizes to control diverse real-world entities, including bicycles and pedestrians, beyond vehicles. Project page can be found: this https URL
摘要：我们介绍了RealPlay，这是一种基于神经网络的现实世界游戏引擎，可从用户控制信号中进行交互式视频生成。与以视觉效果为重点的先前作品不同，RealPlay的目的是生成具有类似于现实世界录像的时间一致性的视频序列。它在交互式循环中运行：用户观察一个生成的场景，发出控制命令，并在响应中接收一个简短的视频块。为了实现这种现实和响应迅速的生成，我们应对关键挑战，包括迭代块的低延迟反馈，跨迭代的时间一致性以及准确的控制响应的预测。 RealPlay经过了标记的游戏数据和未标记的现实世界视频的组合，而无需现实世界中的动作注释。值得注意的是，我们观察到两种形式的概括：（1）控制转移真实玩法有效地映射从虚拟到现实世界的控制信号；（2）实体转移 - 尽管训练标签仅来自赛车游戏，但Realplay概括了以控制车辆以外的不同现实世界实体，包括自行车和行人。可以找到项目页面：此HTTPS URL

Title: VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory

Authors: Runjia Li, Philip Torr, Andrea Vedaldi, Tomas Jakab
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18903
Pdf URL: https://arxiv.org/pdf/2506.18903
Copy Paste: [[2506.18903]] VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory(https://arxiv.org/abs/2506.18903)
Keywords: generation
Abstract: We propose a novel memory mechanism to build video generators that can explore environments interactively. Similar results have previously been achieved by out-painting 2D views of the scene while incrementally reconstructing its 3D geometry, which quickly accumulates errors, or by video generators with a short context window, which struggle to maintain scene coherence over the long term. To address these limitations, we introduce Surfel-Indexed View Memory (VMem), a mechanism that remembers past views by indexing them geometrically based on the 3D surface elements (surfels) they have observed. VMem enables the efficient retrieval of the most relevant past views when generating new ones. By focusing only on these relevant views, our method produces consistent explorations of imagined environments at a fraction of the computational cost of using all past views as context. We evaluate our approach on challenging long-term scene synthesis benchmarks and demonstrate superior performance compared to existing methods in maintaining scene coherence and camera control.
摘要：我们提出了一种新颖的记忆机制来构建可以交互性探索环境的视频生成器。以前，通过逐步重建其3D几何形状，以迅速累积错误或具有短上下文窗口的视频生成器，从而迅速累积了相似的结果，从而实现了类似的结果，从长远来看，这很快就会累积错误。为了解决这些局限性，我们引入了表面索引的视图内存（VMEM），该机制通过根据他们观察到的3D表面元素（表面）对几何构图来记住过去的视图。 VMEM在生成新的视图时可以有效地检索过去的过去视图。通过仅关注这些相关观点，我们的方法以使用过去所有观点作为背景的计算成本的一小部分来产生对想象环境的一致探索。我们评估了我们在挑战长期场景综合基准测试基准方面的方法，并且与维持场景连贯性和相机控制方面的现有方法相比，表现出卓越的性能。