2025-06-12

Title: Enhancing the Safety of Medical Vision-Language Models by Synthetic Demonstrations

Authors: Zhiyu Xue, Reza Abbasi-Asl, Ramtin Pedarsani
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09067
Pdf URL: https://arxiv.org/pdf/2506.09067
Copy Paste: [[2506.09067]] Enhancing the Safety of Medical Vision-Language Models by Synthetic Demonstrations(https://arxiv.org/abs/2506.09067)
Keywords: generative
Abstract: Generative medical vision-language models~(Med-VLMs) are primarily designed to generate complex textual information~(e.g., diagnostic reports) from multimodal inputs including vision modality~(e.g., medical images) and language modality~(e.g., clinical queries). However, their security vulnerabilities remain underexplored. Med-VLMs should be capable of rejecting harmful queries, such as \textit{Provide detailed instructions for using this CT scan for insurance fraud}. At the same time, addressing security concerns introduces the risk of over-defense, where safety-enhancing mechanisms may degrade general performance, causing Med-VLMs to reject benign clinical queries. In this paper, we propose a novel inference-time defense strategy to mitigate harmful queries, enabling defense against visual and textual jailbreak attacks. Using diverse medical imaging datasets collected from nine modalities, we demonstrate that our defense strategy based on synthetic clinical demonstrations enhances model safety without significantly compromising performance. Additionally, we find that increasing the demonstration budget alleviates the over-defense issue. We then introduce a mixed demonstration strategy as a trade-off solution for balancing security and performance under few-shot demonstration budget constraints.
摘要：生成的医学视觉语言模型〜（MED-VLM）主要旨在从多模式输入（包括视觉方式）〜（例如，医学图像）和语言模态〜（例如，临床查询）中产生复杂的文本信息〜（例如，诊断报告）。但是，他们的安全漏洞仍然没有被忽视。 Med-vlms应该能够拒绝有害查询，例如\ textit {提供有关使用此CT扫描进行保险欺诈的详细说明}。同时，解决安全问题引入了防御性过度的风险，在这种情况下，安全增强的机制可能会降低一般绩效，从而导致Med-vlms拒绝良性的临床查询。在本文中，我们提出了一种新颖的推理时间防御策略，以减轻有害的查询，从而防御视觉和文字越狱攻击。使用从九种方式收集的各种医学成像数据集，我们证明了基于合成临床演示的防御策略可增强模型安全性，而不会显着损害性能。此外，我们发现增加示范预算可以减轻防御性问题。然后，我们引入了一种混合的演示策略，作为在很少的示范预算限制下平衡安全性和绩效的权衡解决方案。

Title: BG-HOP: A Bimanual Generative Hand-Object Prior

Authors: Sriram Krishna, Sravan Chittupalli, Sungjae Park
Subjects: cs.CV, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2506.09068
Pdf URL: https://arxiv.org/pdf/2506.09068
Copy Paste: [[2506.09068]] BG-HOP: A Bimanual Generative Hand-Object Prior(https://arxiv.org/abs/2506.09068)
Keywords: generative
Abstract: In this work, we present BG-HOP, a generative prior that seeks to model bimanual hand-object interactions in 3D. We address the challenge of limited bimanual interaction data by extending existing single-hand generative priors, demonstrating preliminary results in capturing the joint distribution of hands and objects. Our experiments showcase the model's capability to generate bimanual interactions and synthesize grasps for given objects. We make code and models publicly available.
摘要：在这项工作中，我们提出了BG-HOP，这是一位生成型，它试图在3D中建模双手手动相互作用。我们通过扩展现有的单手生成先验，以捕获手和物体的联合分配来解决有限的双人交互数据的挑战。我们的实验展示了该模型生成双人相互作用并合成给定对象的grasps的能力。我们将代码和模型公开可用。

Title: FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation

Authors: Zheqi He, Yesheng Liu, Jing-shu Zheng, Xuejing Li, Richeng Xuan, Jin-Ge Yao, Xi Yang
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.09081
Pdf URL: https://arxiv.org/pdf/2506.09081
Copy Paste: [[2506.09081]] FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation(https://arxiv.org/abs/2506.09081)
Keywords: generation
Abstract: We present FlagEvalMM, an open-source evaluation framework designed to comprehensively assess multimodal models across a diverse range of vision-language understanding and generation tasks, such as visual question answering, text-to-image/video generation, and image-text retrieval. We decouple model inference from evaluation through an independent evaluation service, thus enabling flexible resource allocation and seamless integration of new tasks and models. Moreover, FlagEvalMM utilizes advanced inference acceleration tools (e.g., vLLM, SGLang) and asynchronous data loading to significantly enhance evaluation efficiency. Extensive experiments show that FlagEvalMM offers accurate and efficient insights into model strengths and limitations, making it a valuable tool for advancing multimodal research. The framework is publicly accessible athttps://github.com/flageval-baai/FlagEvalMM.
摘要：我们提出了FlageValmm，这是一个开源评估框架，旨在全面评估各种视觉理解和发电任务的多模型模型，例如视觉问题答案，文本到文本图像/视频生成以及图像text检索。我们将模型推断从评估到独立的评估服务，从而实现了灵活的资源分配以及新任务和模型的无缝集成。此外，FlageValmm利用高级推理加速工具（例如VLLM，SGLANG）和异步数据加载来显着提高评估效率。广泛的实验表明，FlageValmm对模型优势和局限性提供了准确有效的见解，使其成为推进多模式研究的宝贵工具。该框架是可公开访问的athttps：//github.com/flageval-baai/flagevalmm。

Title: AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models

Authors: Zheda Mai, Arpita Chowdhury, Zihe Wang, Sooyoung Jeon, Lemeng Wang, Jiacheng Hou, Jihyung Kil, Wei-Lun Chao
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.09082
Pdf URL: https://arxiv.org/pdf/2506.09082
Copy Paste: [[2506.09082]] AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models(https://arxiv.org/abs/2506.09082)
Keywords: generation
Abstract: The rise of vision foundation models (VFMs) calls for systematic evaluation. A common approach pairs VFMs with large language models (LLMs) as general-purpose heads, followed by evaluation on broad Visual Question Answering (VQA) benchmarks. However, this protocol has two key blind spots: (i) the instruction tuning data may not align with VQA test distributions, meaning a wrong prediction can stem from such data mismatch rather than a VFM' visual shortcomings; (ii) VQA benchmarks often require multiple visual abilities, making it hard to tell whether errors stem from lacking all required abilities or just a single critical one. To address these gaps, we introduce AVA-Bench, the first benchmark that explicitly disentangles 14 Atomic Visual Abilities (AVAs) -- foundational skills like localization, depth estimation, and spatial understanding that collectively support complex visual reasoning tasks. By decoupling AVAs and matching training and test distributions within each, AVA-Bench pinpoints exactly where a VFM excels or falters. Applying AVA-Bench to leading VFMs thus reveals distinctive "ability fingerprints," turning VFM selection from educated guesswork into principled engineering. Notably, we find that a 0.5B LLM yields similar VFM rankings as a 7B LLM while cutting GPU hours by 8x, enabling more efficient evaluation. By offering a comprehensive and transparent benchmark, we hope AVA-Bench lays the foundation for the next generation of VFMs.
摘要：视觉基础模型（VFM）的兴起要求进行系统评估。一种常见的方法将VFM与大语言模型（LLMS）作为通用头，然后对广泛的视觉问题答案（VQA）基准进行评估。但是，该协议具有两个关键的盲点：（i）指令调整数据可能与VQA测试分布不符，这意味着错误的预测可能源于此类数据不匹配，而不是VFM的视觉缺陷；（ii）VQA基准通常需要多种视觉能力，因此很难判断出错误是由于缺乏所有必需的能力还是仅仅是一个关键能力。为了解决这些差距，我们介绍了Ava Bench，这是第一个明确解开14个原子视觉能力（AVA）的基准 - 基础技能，例如本地化，深度估计和空间理解，共同支持复杂的视觉推理任务。通过将AVA分解和每个内部的匹配训练和测试分布，AVA板凳准确指出了VFM出色或步履蹒跚的位置。因此，将AVA板台应用到领先的VFM上揭示了独特的“能力指纹”，将VFM的选择从受过教育的猜测转变为有原则的工程。值得注意的是，我们发现0.5B LLM在将GPU小时减少8倍的同时，得出的VFM排名与7B LLM相似，从而可以更有效地进行评估。通过提供全面透明的基准，我们希望Ava Bench为下一代VFM奠定基础。

Title: BakuFlow: A Streamlining Semi-Automatic Label Generation Tool

Authors: Jerry Lin, Partick P. W. Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09083
Pdf URL: https://arxiv.org/pdf/2506.09083
Copy Paste: [[2506.09083]] BakuFlow: A Streamlining Semi-Automatic Label Generation Tool(https://arxiv.org/abs/2506.09083)
Keywords: generation
Abstract: Accurately labeling (or annotation) data is still a bottleneck in computer vision, especially for large-scale tasks where manual labeling is time-consuming and error-prone. While tools like LabelImg can handle the labeling task, some of them still require annotators to manually label each image. In this paper, we introduce BakuFlow, a streamlining semi-automatic label generation tool. Key features include (1) a live adjustable magnifier for pixel-precise manual corrections, improving user experience; (2) an interactive data augmentation module to diversify training datasets; (3) label propagation for rapidly copying labeled objects between consecutive frames, greatly accelerating annotation of video data; and (4) an automatic labeling module powered by a modified YOLOE framework. Unlike the original YOLOE, our extension supports adding new object classes and any number of visual prompts per class during annotation, enabling flexible and scalable labeling for dynamic, real-world datasets. These innovations make BakuFlow especially effective for object detection and tracking, substantially reducing labeling workload and improving efficiency in practical computer vision and industrial scenarios.
摘要：准确的标记（或注释）数据仍然是计算机视觉中的瓶颈，尤其是对于手动标记时耗时且容易出错的大规模任务。尽管Labelimg之类的工具可以处理标签任务，但其中一些仍然需要注释器手动标记每个图像。在本文中，我们引入了Bakuflow，这是一种精简的半自动标签生成工具。关键功能包括（1）用于Pixel Precise手动校正的实时可调放大器，改善用户体验；（2）一个交互式数据增强模块，以使培训数据集多样化；（3）标签传播，用于在连续框架之间快速复制标记对象，大大加速了视频数据的注释；（4）由修改的Yoloe框架提供动力的自动标签模块。与原始的Yoloe不同，我们的扩展名支持在注释过程中添加新对象类和每个类的视觉提示，从而为动态的真实世界数据集提供灵活且可扩展的标签。这些创新使BakuFlow特别有效地用于对象检测和跟踪，从而大大降低了标签工作量并提高了实用的计算机视觉和工业场景的效率。

Title: LLM-ML Teaming: Integrated Symbolic Decoding and Gradient Search for Valid and Stable Generative Feature Transformation

Authors: Xinyuan Wang, Haoyue Bai, Nanxu Gong, Wangyang Ying, Sixun Dong, Xiquan Cui, Yanjie Fu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09085
Pdf URL: https://arxiv.org/pdf/2506.09085
Copy Paste: [[2506.09085]] LLM-ML Teaming: Integrated Symbolic Decoding and Gradient Search for Valid and Stable Generative Feature Transformation(https://arxiv.org/abs/2506.09085)
Keywords: generation, generative
Abstract: Feature transformation enhances data representation by deriving new features from the original data. Generative AI offers potential for this task, but faces challenges in stable generation (consistent outputs) and valid generation (error-free sequences). Existing methods--traditional MLs' low validity and LLMs' instability--fail to resolve both. We find that LLMs ensure valid syntax, while ML's gradient-steered search stabilizes performance. To bridge this gap, we propose a teaming framework combining LLMs' symbolic generation with ML's gradient optimization. This framework includes four steps: (1) golden examples generation, aiming to prepare high-quality samples with the ground knowledge of the teacher LLM; (2) feature transformation sequence embedding and search, intending to uncover potentially superior embeddings within the latent space; (3) student LLM feature transformation, aiming to distill knowledge from the teacher LLM; (4) LLM-ML decoder teaming, dedicating to combine ML and the student LLM probabilities for valid and stable generation. The experiments on various datasets show that the teaming policy can achieve 5\% improvement in downstream performance while reducing nearly half of the error cases. The results also demonstrate the efficiency and robustness of the teaming policy. Additionally, we also have exciting findings on LLMs' capacity to understand the original data.
摘要：功能转换通过从原始数据中得出新功能来增强数据表示。生成AI为这项任务提供了潜力，但是面临稳定生成（一致的输出）和有效生成（无错误序列）的挑战。现有的方法 - 传统的MLS“有效性低和LLMS”的不稳定性 - 无法解决这两者。我们发现LLMS确保有效的语法，而ML的梯度步态搜索可以稳定性能。为了弥合这一差距，我们提出了一个组合框架，将LLMS的象征性生成与ML的梯度优化相结合。该框架包括四个步骤：（1）黄金示例生成，旨在准备以LLM教师知识为基础的高质量样本；（2）特征转换序列嵌入和搜索，打算在潜在的空间内发现潜在的优质嵌入；（3）学生LLM功能转换，旨在将知识从教师LLM中提取；（4）LLM-ML解码器组合，致力于将ML和学生LLM概率结合起来，以获得有效和稳定的一代。各种数据集上的实验表明，团队策略可以在下游性能方面取得5 \％的改善，同时减少近一半的错误情况。结果还证明了团队政策的效率和鲁棒性。此外，我们还对LLMS了解原始数据的能力有令人兴奋的发现。

Title: CUDA-LLM: LLMs Can Write Efficient CUDA Kernels

Authors: Wentao Chen, Jiace Zhu, Qi Fan, Yehan Ma, An Zou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09092
Pdf URL: https://arxiv.org/pdf/2506.09092
Copy Paste: [[2506.09092]] CUDA-LLM: LLMs Can Write Efficient CUDA Kernels(https://arxiv.org/abs/2506.09092)
Keywords: generation
Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in general-purpose code generation. However, generating the code which is deeply hardware-specific, architecture-aware, and performance-critical, especially for massively parallel GPUs, remains a complex challenge. In this work, we explore the use of LLMs for the automated generation and optimization of CUDA programs, with the goal of producing high-performance GPU kernels that fully exploit the underlying hardware. To address this challenge, we propose a novel framework called \textbf{Feature Search and Reinforcement (FSR)}. FSR jointly optimizes compilation and functional correctness, as well as the runtime performance, which are validated through extensive and diverse test cases, and measured by actual kernel execution latency on the target GPU, respectively. This approach enables LLMs not only to generate syntactically and semantically correct CUDA code but also to iteratively refine it for efficiency, tailored to the characteristics of the GPU architecture. We evaluate FSR on representative CUDA kernels, covering AI workloads and computational intensive algorithms. Our results show that LLMs augmented with FSR consistently guarantee correctness rates. Meanwhile, the automatically generated kernels can outperform general human-written code by a factor of up to 179$\times$ in execution speeds. These findings highlight the potential of combining LLMs with performance reinforcement to automate GPU programming for hardware-specific, architecture-sensitive, and performance-critical applications.
摘要：大型语言模型（LLM）在通用代码生成中表现出强大的功能。但是，生成深度特定于硬件的代码，架构感知和性能至关重要的代码，尤其是对于大规模并行的GPU，这仍然是一个复杂的挑战。在这项工作中，我们探讨了LLM在CUDA程序的自动生成和优化中的使用，目的是生产高性能的GPU内核，以充分利用基础硬件。为了应对这一挑战，我们提出了一个名为\ textbf {功能搜索和增强（FSR）}的新颖框架。 FSR共同优化了汇编和功能正确性以及运行时性能，这些性能通过广泛而多样的测试用例验证，并分别通过目标GPU的实际内核执行延迟来衡量。这种方法使LLM不仅能够在句法和语义上正确正确的CUDA代码生成，而且还可以迭代地完善其效率，并根据GPU体系结构的特征量身定制。我们评估了代表性CUDA内核的FSR，涵盖了AI工作负载和计算密集算法。我们的结果表明，随着FSR的增强，LLM始终保证正确性率。同时，自动生成的内核可以超过一般人写的代码，最高为179 $ \ times $的执行速度。这些发现突出了将LLM与性能增强相结合以自动化GPU编程的潜力，以特定于硬件，构造敏感和至关重要的应用程序。

Title: Intra-Trajectory Consistency for Reward Modeling

Authors: Chaoyang Zhou, Shunyu Liu, Zengmao Wang, Di Wang, Rong-Cheng Tu, Bo Du, Dacheng Tao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09096
Pdf URL: https://arxiv.org/pdf/2506.09096
Copy Paste: [[2506.09096]] Intra-Trajectory Consistency for Reward Modeling(https://arxiv.org/abs/2506.09096)
Keywords: generation
Abstract: Reward models are critical for improving large language models (LLMs), particularly in reinforcement learning from human feedback (RLHF) or inference-time verification. Current reward modeling typically relies on scores of overall responses to learn the outcome rewards for the responses. However, since the response-level scores are coarse-grained supervision signals, the reward model struggles to identify the specific components within a response trajectory that truly correlate with the scores, leading to poor generalization on unseen responses. In this paper, we propose to leverage generation probabilities to establish reward consistency between processes in the response trajectory, which allows the response-level supervisory signal to propagate across processes, thereby providing additional fine-grained signals for reward learning. Building on analysis under the Bayesian framework, we develop an intra-trajectory consistency regularization to enforce that adjacent processes with higher next-token generation probability maintain more consistent rewards. We apply the proposed regularization to the advanced outcome reward model, improving its performance on RewardBench. Besides, we show that the reward model trained with the proposed regularization induces better DPO-aligned policies and achieves better best-of-N (BON) inference-time verification results. Our code is provided in this https URL.
摘要：奖励模型对于改善大型语言模型（LLM）至关重要，尤其是在增强人类反馈（RLHF）或推理时间验证方面。当前的奖励建模通常依赖于多数总体响应来学习响应的结果奖励。但是，由于响应级得分是粗粒度的监督信号，因此奖励模型努力识别响应轨迹中真正与分数相关的响应轨迹中的特定组件，从而导致对看不见的响应的普遍性不佳。在本文中，我们建议利用发电概率在响应轨迹中建立奖励一致性，这允许响应级的监督信号跨流程传播，从而为奖励学习提供了其他精细信号。在贝叶斯框架下的分析的基础上，我们开发了一个针对性的一致性正规化，以执行具有较高下一代生成概率的相邻过程保持更加一致的奖励。我们将拟议的正则化应用于高级结果奖励模型，从而提高了其在奖励台上的性能。此外，我们表明，经过拟议的正规化训练的奖励模型会导致更好的DPO一致性政策，并实现更好的N（BON）推理时间验证结果。我们的代码在此HTTPS URL中提供。

Title: Bias Analysis in Unconditional Image Generative Models

Authors: Xiaofeng Zhang, Michelle Lin, Simon Lacoste-Julien, Aaron Courville, Yash Goyal
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.09106
Pdf URL: https://arxiv.org/pdf/2506.09106
Copy Paste: [[2506.09106]] Bias Analysis in Unconditional Image Generative Models(https://arxiv.org/abs/2506.09106)
Keywords: generation, generative
Abstract: The widespread adoption of generative AI models has raised growing concerns about representational harm and potential discriminatory outcomes. Yet, despite growing literature on this topic, the mechanisms by which bias emerges - especially in unconditional generation - remain disentangled. We define the bias of an attribute as the difference between the probability of its presence in the observed distribution and its expected proportion in an ideal reference distribution. In our analysis, we train a set of unconditional image generative models and adopt a commonly used bias evaluation framework to study bias shift between training and generated distributions. Our experiments reveal that the detected attribute shifts are small. We find that the attribute shifts are sensitive to the attribute classifier used to label generated images in the evaluation framework, particularly when its decision boundaries fall in high-density regions. Our empirical analysis indicates that this classifier sensitivity is often observed in attributes values that lie on a spectrum, as opposed to exhibiting a binary nature. This highlights the need for more representative labeling practices, understanding the shortcomings through greater scrutiny of evaluation frameworks, and recognizing the socially complex nature of attributes when evaluating bias.
摘要：生成AI模型的广泛采用使人们对代表性危害和潜在歧视结果的关注日益加剧。然而，尽管关于这个主题的文献越来越多，但偏见出现的机制（尤其是无条件产生）仍然存在。我们将属性的偏差定义为其在观察到的分布中存在的概率与理想参考分布中的预期比例之间的差异。在我们的分析中，我们训练一组无条件的图像生成模型，并采用常用的偏见评估框架来研究训练和生成的分布之间的偏见转移。我们的实验表明，检测到的属性移位很小。我们发现，属性移位对用于在评估框架中标记生成图像的属性分类器敏感，尤其是当其决策边界落在高密度区域时。我们的经验分析表明，这种分类器敏感性通常是在频谱上的属性值中观察到的，而不是表现出二元性质。这凸显了需要更具代表性的标签实践，通过对评估框架进行更大的审查来理解缺点，并在评估偏见时认识到属性的社会复杂性质。

Title: SensorLM: Learning the Language of Wearable Sensors

Authors: Yuwei Zhang, Kumar Ayush, Siyuan Qiao, A. Ali Heydari, Girish Narayanswamy, Maxwell A. Xu, Ahmed A. Metwally, Shawn Xu, Jake Garrison, Xuhai Xu, Tim Althoff, Yun Liu, Pushmeet Kohli, Jiening Zhan, Mark Malhotra, Shwetak Patel, Cecilia Mascolo, Xin Liu, Daniel McDuff, Yuzhe Yang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.09108
Pdf URL: https://arxiv.org/pdf/2506.09108
Copy Paste: [[2506.09108]] SensorLM: Learning the Language of Wearable Sensors(https://arxiv.org/abs/2506.09108)
Keywords: generation
Abstract: We present SensorLM, a family of sensor-language foundation models that enable wearable sensor data understanding with natural language. Despite its pervasive nature, aligning and interpreting sensor data with language remains challenging due to the lack of paired, richly annotated sensor-text descriptions in uncurated, real-world wearable data. We introduce a hierarchical caption generation pipeline designed to capture statistical, structural, and semantic information from sensor data. This approach enabled the curation of the largest sensor-language dataset to date, comprising over 59.7 million hours of data from more than 103,000 people. Furthermore, SensorLM extends prominent multimodal pretraining architectures (e.g., CLIP, CoCa) and recovers them as specific variants within a generic architecture. Extensive experiments on real-world tasks in human activity analysis and healthcare verify the superior performance of SensorLM over state-of-the-art in zero-shot recognition, few-shot learning, and cross-modal retrieval. SensorLM also demonstrates intriguing capabilities including scaling behaviors, label efficiency, sensor captioning, and zero-shot generalization to unseen tasks.
摘要：我们提出Sensorlm，这是一个传感器语言基础模型的家族，可以使用自然语言对可穿戴传感器数据进行理解。尽管具有无处不在的性质，但由于缺乏配对的，注释的传感器文本描述，在未育儿，现实世界中可穿戴的数据中对传感器数据进行对齐和解释与语言仍然具有挑战性。我们引入了层次标题生成管道，旨在从传感器数据捕获统计，结构和语义信息。这种方法使迄今为止最大的传感器语言数据集的策划能够策划，其中包括超过103,000人的5970万小时数据。此外，Sensorlm扩展了突出的多模式预处理体系结构（例如夹子，可口可乐），并将它们恢复为通用体系结构中的特定变体。关于人类活动分析和医疗保健中现实世界任务的广泛实验验证了Sensorlm的出色表现，而不是最先进的零照片识别，很少的学习和跨模式检索。 Sensorlm还展示了有趣的功能，包括缩放行为，标签效率，传感器字幕和零射门的概括，以实现看不见的任务。

Title: Seedance 1.0: Exploring the Boundaries of Video Generation Models

Authors: Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, Xunsong Li, Yifu Li, Shanchuan Lin, Zhijie Lin, Jiawei Liu, Shu Liu, Xiaonan Nie, Zhiwu Qing, Yuxi Ren, Li Sun, Zhi Tian, Rui Wang, Sen Wang, Guoqiang Wei, Guohong Wu, Jie Wu, Ruiqi Xia, Fei Xiao, Xuefeng Xiao, Jiangqiao Yan, Ceyuan Yang, Jianchao Yang, Runkai Yang, Tao Yang, Yihang Yang, Zilyu Ye, Xuejiao Zeng, Yan Zeng, Heng Zhang, Yang Zhao, Xiaozheng Zheng, Peihao Zhu, Jiaxin Zou, Feilong Zuo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.09113
Pdf URL: https://arxiv.org/pdf/2506.09113
Copy Paste: [[2506.09113]] Seedance 1.0: Exploring the Boundaries of Video Generation Models(https://arxiv.org/abs/2506.09113)
Keywords: generation
Abstract: Notable breakthroughs in diffusion modeling have propelled rapid improvements in video generation, yet current foundational model still face critical challenges in simultaneously balancing prompt following, motion plausibility, and visual quality. In this report, we introduce Seedance 1.0, a high-performance and inference-efficient video foundation generation model that integrates several core technical improvements: (i) multi-source data curation augmented with precision and meaningful video captioning, enabling comprehensive learning across diverse scenarios; (ii) an efficient architecture design with proposed training paradigm, which allows for natively supporting multi-shot generation and jointly learning of both text-to-video and image-to-video tasks. (iii) carefully-optimized post-training approaches leveraging fine-grained supervised fine-tuning, and video-specific RLHF with multi-dimensional reward mechanisms for comprehensive performance improvements; (iv) excellent model acceleration achieving ~10x inference speedup through multi-stage distillation strategies and system-level optimizations. Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds (NVIDIA-L20). Compared to state-of-the-art video generation models, Seedance 1.0 stands out with high-quality and fast video generation having superior spatiotemporal fluidity with structural stability, precise instruction adherence in complex multi-subject contexts, native multi-shot narrative coherence with consistent subject representation.
摘要：扩散建模的显着突破已经推动了视频生成的快速改进，但是当前的基础模型仍然面临着同时平衡及时及时的关键挑战，运动的合理性和视觉质量。在本报告中，我们介绍了Seedance 1.0，这是一种高性能和推理高效的视频基础生成模型，该模型整合了几种核心技术改进：（i）多源数据策划增强，并具有精确和有意义的视频字幕，从而使各种场景的全面学习能够进行全面学习；（ii）具有拟议的培训范式的有效体系结构设计，可以在本地支持多拍的生成，并共同学习文本到视频和图像到视频任务。（iii）仔细优化的训练后方法利用细粒度的监督微调，以及具有多维奖励机制的特定于视频的RLHF，以进行全面的绩效改进；（iv）通过多阶段蒸馏策略和系统级优化实现〜10倍推理的出色模型加速度。播种1.0只能以41.4秒（NVIDIA-L20）以1080p分辨率生成5秒的视频。与最先进的视频生成模型相比，种子1.0以高质量和快速的视频生成脱颖而出，具有较高的时空流动性，具有结构稳定性，在复杂的多主体环境中精确的指导依从性，本机多刺激性叙事连贯性具有一致的主题表示。

Title: TRACE: Grounding Time Series in Context for Multimodal Embedding and Retrieval

Authors: Jialin Chen, Ziyu Zhao, Gaukhar Nurbek, Aosong Feng, Ali Maatouk, Leandros Tassiulas, Yifeng Gao, Rex Ying
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.09114
Pdf URL: https://arxiv.org/pdf/2506.09114
Copy Paste: [[2506.09114]] TRACE: Grounding Time Series in Context for Multimodal Embedding and Retrieval(https://arxiv.org/abs/2506.09114)
Keywords: generation
Abstract: The ubiquity of dynamic data in domains such as weather, healthcare, and energy underscores a growing need for effective interpretation and retrieval of time-series data. These data are inherently tied to domain-specific contexts, such as clinical notes or weather narratives, making cross-modal retrieval essential not only for downstream tasks but also for developing robust time-series foundation models by retrieval-augmented generation (RAG). Despite the increasing demand, time-series retrieval remains largely underexplored. Existing methods often lack semantic grounding, struggle to align heterogeneous modalities, and have limited capacity for handling multi-channel signals. To address this gap, we propose TRACE, a generic multimodal retriever that grounds time-series embeddings in aligned textual context. TRACE enables fine-grained channel-level alignment and employs hard negative mining to facilitate semantically meaningful retrieval. It supports flexible cross-modal retrieval modes, including Text-to-Timeseries and Timeseries-to-Text, effectively linking linguistic descriptions with complex temporal patterns. By retrieving semantically relevant pairs, TRACE enriches downstream models with informative context, leading to improved predictive accuracy and interpretability. Beyond a static retrieval engine, TRACE also serves as a powerful standalone encoder, with lightweight task-specific tuning that refines context-aware representations while maintaining strong cross-modal alignment. These representations achieve state-of-the-art performance on downstream forecasting and classification tasks. Extensive experiments across multiple domains highlight its dual utility, as both an effective encoder for downstream applications and a general-purpose retriever to enhance time-series models.
摘要：天气，医疗保健和能源等域中动态数据的无处不在，这强调了对有效解释和时间序列数据检索的日益增长的需求。这些数据固有地与特定领域的环境相关，例如临床笔记或天气叙事，这使得不仅对于下游任务，而且还可以通过检索增强的生成（RAG）来开发稳健的时间序列基础模型。尽管需求增加，但时间序列的检索仍然很大程度上尚未得到充实。现有方法通常缺乏语义基础，难以使异质方式保持一致，并且处理多通道信号的能力有限。为了解决这一差距，我们提出了Trace，这是一种通用的多模式检索器，将时间序列嵌入在对齐的文本上下文中。 TRACE可以实现细粒的通道级别对齐方式，并采用硬性负面开采来促进语义上有意义的检索。它支持灵活的跨模式检索模式，包括文本到时间表和时间表之间，有效地将语言描述与复杂的时间模式联系起来。通过检索语义相关的对，Trace以信息性的背景丰富了下游模型，从而提高了预测精度和解释性。除了静态检索引擎之外，Trace还可以用作强大的独立编码器，具有轻巧的特定任务调整，可以完善上下文感知的表示，同时保持强大的跨模式对齐。这些表示在下游预测和分类任务上实现了最先进的绩效。跨多个领域的广泛实验突出了其双重实用性，既是下游应用程序的有效编码器，又是通用回收犬，以增强时间序列模型。

Title: MultiNet: An Open-Source Software Toolkit \& Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models

Authors: Pranav Guruprasad, Yangyue Wang, Harshvardhan Sikka
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.09172
Pdf URL: https://arxiv.org/pdf/2506.09172
Copy Paste: [[2506.09172]] MultiNet: An Open-Source Software Toolkit \& Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models(https://arxiv.org/abs/2506.09172)
Keywords: generation
Abstract: Recent innovations in multimodal action models represent a promising direction for developing general-purpose agentic systems, combining visual understanding, language comprehension, and action generation. We introduce MultiNet - a novel, fully open-source benchmark and surrounding software ecosystem designed to rigorously evaluate and adapt models across vision, language, and action domains. We establish standardized evaluation protocols for assessing vision-language models (VLMs) and vision-language-action models (VLAs), and provide open source software to download relevant data, models, and evaluations. Additionally, we provide a composite dataset with over 1.3 trillion tokens of image captioning, visual question answering, commonsense reasoning, robotic control, digital game-play, simulated locomotion/manipulation, and many more tasks. The MultiNet benchmark, framework, toolkit, and evaluation harness have been used in downstream research on the limitations of VLA generalization.
摘要：多模式动作模型中的最新创新代表了开发通用代理系统，结合视觉理解，语言理解和动作产生的有希望的方向。我们介绍了Multinet-一种新颖的，完全开源的基准和周围的软件生态系统，旨在严格评估和适应视觉，语言和动作域的模型。我们建立了标准化的评估协议，用于评估视觉语言模型（VLM）和视觉语言行动模型（VLAS），并提供开源软件以下载相关数据，模型和评估。此外，我们还提供了一个复合数据集，其中包含超过1.3万亿个图像字幕，视觉问题答案，常识性推理，机器人控制，数字游戏玩法，模拟的运动/操作以及更多任务。多网基准，框架，工具包和评估线束已用于对VLA概括的局限性的下游研究。

Title: FedRAG: A Framework for Fine-Tuning Retrieval-Augmented Generation Systems

Authors: Val Andrei Fajardo, David B. Emerson, Amandeep Singh, Veronica Chatrath, Marcelo Lotif, Ravi Theja, Alex Cheung, Izuki Matsubi
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.09200
Pdf URL: https://arxiv.org/pdf/2506.09200
Copy Paste: [[2506.09200]] FedRAG: A Framework for Fine-Tuning Retrieval-Augmented Generation Systems(https://arxiv.org/abs/2506.09200)
Keywords: generation
Abstract: Retrieval-augmented generation (RAG) systems have been shown to be effective in addressing many of the drawbacks of relying solely on the parametric memory of large language models. Recent work has demonstrated that RAG systems can be improved via fine-tuning of their retriever and generator models. In this work, we introduce FedRAG, a framework for fine-tuning RAG systems across centralized and federated architectures. FedRAG supports state-of-the-art fine-tuning methods, offering a simple and intuitive interface and a seamless conversion from centralized to federated training tasks. FedRAG is also deeply integrated with the modern RAG ecosystem, filling a critical gap in available tools.
摘要：已证明检索增强的生成（RAG）系统在解决仅依赖大语言模型的参数记忆的许多缺点方面有效。最近的工作表明，可以通过微调其猎犬和发电机模型来改进破布系统。在这项工作中，我们介绍了FedRag，这是一个跨集中式和联合体系结构的微调抹布系统的框架。 FedRag支持最新的微调方法，提供了一个简单而直观的界面以及从集中式培训任务到联合培训任务的无缝转换。 Fedrag还与现代的抹布生态系统深入融合，填补了可用工具的关键空白。

Title: Policy-Based Trajectory Clustering in Offline Reinforcement Learning

Authors: Hao Hu, Xinqi Wang, Simon Shaolei Du
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09202
Pdf URL: https://arxiv.org/pdf/2506.09202
Copy Paste: [[2506.09202]] Policy-Based Trajectory Clustering in Offline Reinforcement Learning(https://arxiv.org/abs/2506.09202)
Keywords: generation
Abstract: We introduce a novel task of clustering trajectories from offline reinforcement learning (RL) datasets, where each cluster center represents the policy that generated its trajectories. By leveraging the connection between the KL-divergence of offline trajectory distributions and a mixture of policy-induced distributions, we formulate a natural clustering objective. To solve this, we propose Policy-Guided K-means (PG-Kmeans) and Centroid-Attracted Autoencoder (CAAE). PG-Kmeans iteratively trains behavior cloning (BC) policies and assigns trajectories based on policy generation probabilities, while CAAE resembles the VQ-VAE framework by guiding the latent representations of trajectories toward the vicinity of specific codebook entries to achieve clustering. Theoretically, we prove the finite-step convergence of PG-Kmeans and identify a key challenge in offline trajectory clustering: the inherent ambiguity of optimal solutions due to policy-induced conflicts, which can result in multiple equally valid but structurally distinct clusterings. Experimentally, we validate our methods on the widely used D4RL dataset and custom GridWorld environments. Our results show that both PG-Kmeans and CAAE effectively partition trajectories into meaningful clusters. They offer a promising framework for policy-based trajectory clustering, with broad applications in offline RL and beyond.
摘要：我们介绍了从离线增强学习（RL）数据集中的聚类轨迹的新任务，其中每个集群中心代表生成其轨迹的策略。通过利用离线轨迹分布的KL差异与政策引起的分布的混合之间的联系，我们制定了一个自然的聚类目标。为了解决这个问题，我们提出了政策引导的K-均值（PG-KMEANS）和质心自动编码器（CAAE）。 PG-KMEANS迭代训练行为克隆（BC）政策，并根据策略产生概率分配轨迹，而CAAE通过指导轨迹的潜在表示，类似于VQ-VAE框架，以指导轨迹的潜在表示特定代码书的附近，以实现集群。从理论上讲，我们证明了PG-KMeans的有限步骤融合，并确定了离线轨迹聚类中的关键挑战：由于政策引起的冲突而引起的最佳解决方案的固有歧义，这可能会导致多个同样有效但结构上不同的聚类。在实验上，我们验证了广泛使用的D4RL数据集和自定义GridWorld环境的方法。我们的结果表明，PG-KMEAN和CAAE都有效地将轨迹分为有意义的簇。他们为基于政策的轨迹聚类提供了一个有希望的框架，并在离线RL及其他地区提供了广泛的应用。

Title: SoK: Machine Unlearning for Large Language Models

Authors: Jie Ren, Yue Xing, Yingqian Cui, Charu C. Aggarwal, Hui Liu
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2506.09227
Pdf URL: https://arxiv.org/pdf/2506.09227
Copy Paste: [[2506.09227]] SoK: Machine Unlearning for Large Language Models(https://arxiv.org/abs/2506.09227)
Keywords: generative
Abstract: Large language model (LLM) unlearning has become a critical topic in machine learning, aiming to eliminate the influence of specific training data or knowledge without retraining the model from scratch. A variety of techniques have been proposed, including Gradient Ascent, model editing, and re-steering hidden representations. While existing surveys often organize these methods by their technical characteristics, such classifications tend to overlook a more fundamental dimension: the underlying intention of unlearning--whether it seeks to truly remove internal knowledge or merely suppress its behavioral effects. In this SoK paper, we propose a new taxonomy based on this intention-oriented perspective. Building on this taxonomy, we make three key contributions. First, we revisit recent findings suggesting that many removal methods may functionally behave like suppression, and explore whether true removal is necessary or achievable. Second, we survey existing evaluation strategies, identify limitations in current metrics and benchmarks, and suggest directions for developing more reliable and intention-aligned evaluations. Third, we highlight practical challenges--such as scalability and support for sequential unlearning--that currently hinder the broader deployment of unlearning methods. In summary, this work offers a comprehensive framework for understanding and advancing unlearning in generative AI, aiming to support future research and guide policy decisions around data removal and privacy.
摘要：大型语言模型（LLM）的学习已成为机器学习的关键主题，旨在消除特定培训数据或知识的影响，而无需从头开始重述模型。已经提出了各种技术，包括梯度上升，模型编辑和重新启动隐藏表示形式。尽管现有的调查通常通过其技术特征来组织这些方法，但这种分类倾向于忽略一个更基本的维度：学习的潜在意图 - 无论是试图真正消除内部知识还是仅仅抑制其行为影响。在这篇SOK论文中，我们提出了一种基于这种面向意图的观点的新分类法。在这种分类法的基础上，我们做出了三个关键贡献。首先，我们重新审视了最近的发现，表明许多删除方法在功能上可能表现得像抑制作用，并探讨了真正的去除是必要的还是可以实现的。其次，我们调查了现有的评估策略，确定当前指标和基准的局限性，并提出了开发更可靠和意图一致的评估的方向。第三，我们重点介绍了实用的挑战 - 例如可扩展性和对顺序学习的支持 - 目前阻碍了更广泛的学习方法的部署。总而言之，这项工作提供了一个综合框架，可以理解和推进生成AI的学习，旨在支持未来的研究并指导围绕数据删除和隐私的政策决策。

Title: Agent-based Condition Monitoring Assistance with Multimodal Industrial Database Retrieval Augmented Generation

Authors: Karl Löwenmark, Daniel Strömbergsson, Chang Liu, Marcus Liwicki, Fredrik Sandin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.09247
Pdf URL: https://arxiv.org/pdf/2506.09247
Copy Paste: [[2506.09247]] Agent-based Condition Monitoring Assistance with Multimodal Industrial Database Retrieval Augmented Generation(https://arxiv.org/abs/2506.09247)
Keywords: generation
Abstract: Condition monitoring (CM) plays a crucial role in ensuring reliability and efficiency in the process industry. Although computerised maintenance systems effectively detect and classify faults, tasks like fault severity estimation, and maintenance decisions still largely depend on human expert analysis. The analysis and decision making automatically performed by current systems typically exhibit considerable uncertainty and high false alarm rates, leading to increased workload and reduced efficiency. This work integrates large language model (LLM)-based reasoning agents with CM workflows to address analyst and industry needs, namely reducing false alarms, enhancing fault severity estimation, improving decision support, and offering explainable interfaces. We propose MindRAG, a modular framework combining multimodal retrieval-augmented generation (RAG) with novel vector store structures designed specifically for CM data. The framework leverages existing annotations and maintenance work orders as surrogates for labels in a supervised learning protocol, addressing the common challenge of training predictive models on unlabelled and noisy real-world datasets. The primary contributions include: (1) an approach for structuring industry CM data into a semi-structured multimodal vector store compatible with LLM-driven workflows; (2) developing multimodal RAG techniques tailored for CM data; (3) developing practical reasoning agents capable of addressing real-world CM queries; and (4) presenting an experimental framework for integrating and evaluating such agents in realistic industrial scenarios. Preliminary results, evaluated with the help of an experienced analyst, indicate that MindRAG provide meaningful decision support for more efficient management of alarms, thereby improving the interpretability of CM systems.
摘要：条件监测（CM）在确保过程行业的可靠性和效率方面起着至关重要的作用。尽管计算机维护系统有效地检测和分类了故障，但故障严重性估计和维护决策等任务仍然在很大程度上取决于人类专家分析。当前系统自动执行的分析和决策通常会表现出很大的不确定性和高误报率，从而增加了工作量和降低效率。这项工作将基于大型语言模型（LLM）的推理代理与CM工作流程集成在一起，以满足分析师和行业需求，即减少错误警报，增强故障严重性估算，改善决策支持并提供可解释的接口。我们提出了MindRag，这是一种模块化框架，将多式联运检索生成（RAG）与专门为CM数据设计的新型矢量存储结构结合在一起。该框架利用现有的注释和维护工作订单作为监督学习协议中标签的替代品，解决了对未标记和嘈杂的现实世界数据集培训预测模型的共同挑战。主要贡献包括：（1）将行业CM数据构造到半结构化的多模式矢量商店中的方法；（2）开发针对CM数据量身定制的多模式抹布技术；（3）开发能够解决现实CM查询的实用推理代理；（4）提出一个实验框架，用于在现实的工业场景中整合和评估此类药物。在经验丰富的分析师的帮助下进行评估的初步结果表明，Mindrag为更有效地管理警报提供了有意义的决策支持，从而提高了CM系统的解释性。

Title: G-Sim: Generative Simulations with Large Language Models and Gradient-Free Calibration

Authors: Samuel Holt, Max Ruiz Luyten, Antonin Berthon, Mihaela van der Schaar
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.09272
Pdf URL: https://arxiv.org/pdf/2506.09272
Copy Paste: [[2506.09272]] G-Sim: Generative Simulations with Large Language Models and Gradient-Free Calibration(https://arxiv.org/abs/2506.09272)
Keywords: generative
Abstract: Constructing robust simulators is essential for asking "what if?" questions and guiding policy in critical domains like healthcare and logistics. However, existing methods often struggle, either failing to generalize beyond historical data or, when using Large Language Models (LLMs), suffering from inaccuracies and poor empirical alignment. We introduce G-Sim, a hybrid framework that automates simulator construction by synergizing LLM-driven structural design with rigorous empirical calibration. G-Sim employs an LLM in an iterative loop to propose and refine a simulator's core components and causal relationships, guided by domain knowledge. This structure is then grounded in reality by estimating its parameters using flexible calibration techniques. Specifically, G-Sim can leverage methods that are both likelihood-free and gradient-free with respect to the simulator, such as gradient-free optimization for direct parameter estimation or simulation-based inference for obtaining a posterior distribution over parameters. This allows it to handle non-differentiable and stochastic simulators. By integrating domain priors with empirical evidence, G-Sim produces reliable, causally-informed simulators, mitigating data-inefficiency and enabling robust system-level interventions for complex decision-making.
摘要：构建强大的模拟器对于问“如果呢？”至关重要。在医疗保健和物流等关键领域中的问题和指导政策。但是，现有的方法通常很难，要么无法概括历史数据超出历史数据，要么在使用大型语言模型（LLMS）时，缺乏准确性和差的经验一致性。我们介绍了G-SIM，这是一种混合框架，通过与严格的经验校准协同结合LLM驱动的结构设计来自动化模拟器结构。 G-SIM在迭代循环中采用LLM来提出和完善模拟器的核心组成部分和因果关系，并在领域知识的指导下。然后，通过使用灵活的校准技术估算其参数来实现这种结构。具体而言，G-SIM可以利用相对于模拟器的可能性不含可能性和无梯度的方法，例如用于直接参数估计的无梯度优化或基于仿真的推断，用于在参数上获得后验分布。这使其可以处理非差异和随机模拟器。通过将领域的先验与经验证据整合在一起，G-SIM产生可靠的，因果关系的模拟器，可缓解数据信息，并为复杂的决策做出强大的系统级干预措施。

Title: Natural Language Guided Ligand-Binding Protein Design

Authors: Zhenqiao Song, Ramith Hettiarachchi, Chuan Li, Jianwen Xie, Lei Li
Subjects: cs.LG, cs.CE, cs.CL
Abstract URL: https://arxiv.org/abs/2506.09332
Pdf URL: https://arxiv.org/pdf/2506.09332
Copy Paste: [[2506.09332]] Natural Language Guided Ligand-Binding Protein Design(https://arxiv.org/abs/2506.09332)
Keywords: generative
Abstract: Can AI protein models follow human language instructions and design proteins with desired functions (e.g. binding to a ligand)? Designing proteins that bind to a given ligand is crucial in a wide range of applications in biology and chemistry. Most prior AI models are trained on protein-ligand complex data, which is scarce due to the high cost and time requirements of laboratory experiments. In contrast, there is a substantial body of human-curated text descriptions about protein-ligand interactions and ligand formula. In this paper, we propose InstructPro, a family of protein generative models that follow natural language instructions to design ligand-binding proteins. Given a textual description of the desired function and a ligand formula in SMILES, InstructPro generates protein sequences that are functionally consistent with the specified instructions. We develop the model architecture, training strategy, and a large-scale dataset, InstructProBench, to support both training and evaluation. InstructProBench consists of 9,592,829 triples of (function description, ligand formula, protein sequence). We train two model variants: InstructPro-1B (with 1 billion parameters) and InstructPro-3B~(with 3 billion parameters). Both variants consistently outperform strong baselines, including ProGen2, ESM3, and Pinal. Notably, InstructPro-1B achieves the highest docking success rate (81.52% at moderate confidence) and the lowest average root mean square deviation (RMSD) compared to ground truth structures (4.026Å). InstructPro-3B further descreases the average RMSD to 2.527Å, demonstrating InstructPro's ability to generate ligand-binding proteins that align with the functional specifications.
摘要：AI蛋白模型可以遵循人类语言指示和设计具有所需功能的蛋白质（例如与配体结合）吗？设计与给定配体结合的蛋白质在生物学和化学中的广泛应用中至关重要。大多数先前的AI模型都接受了蛋白质配体复杂数据的培训，由于实验室实验的成本和时间要求很高，因此很少。相比之下，关于蛋白质 - 配体相互作用和配体配方的人类策划的文本描述大量存在。在本文中，我们提出了遵循自然语言指示的蛋白质生成模型家族指GrusthPro，以设计配体结合蛋白质。给定对所需函数的文本描述和微笑中的配体公式，指Gentermpro生成了与指定指令在功能上一致的蛋白质序列。我们开发了模型架构，培训策略和大规模数据集，即指示性能，以支持培训和评估。指示性底座由9,592,829个三元组成（功能描述，配体配方，蛋白质序列）。我们训练两个模型变体：指示Pro-1b（具有10亿参数）和指示Pro-3b〜（具有30亿个参数）。两种变体始终胜过强大的基准，包括后代2，ESM3和Pinal。值得注意的是，与地面真相结构（4.026Å）相比，指示Pro-1b达到了最高的对接成功率（中等置信度为81.52％）和最低的平均根平方偏差（RMSD）（4.026Å）。指示Pro-3b进一步描述了平均RMSD为2.527Å，证明了PenchentPro生成与功能规格一致的配体结合蛋白的能力。

Title: CheckManual: A New Challenge and Benchmark for Manual-based Appliance Manipulation

Authors: Yuxing Long, Jiyao Zhang, Mingjie Pan, Tianshu Wu, Taewhan Kim, Hao Dong
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2506.09343
Pdf URL: https://arxiv.org/pdf/2506.09343
Copy Paste: [[2506.09343]] CheckManual: A New Challenge and Benchmark for Manual-based Appliance Manipulation(https://arxiv.org/abs/2506.09343)
Keywords: generation
Abstract: Correct use of electrical appliances has significantly improved human life quality. Unlike simple tools that can be manipulated with common sense, different parts of electrical appliances have specific functions defined by manufacturers. If we want the robot to heat bread by microwave, we should enable them to review the microwave manual first. From the manual, it can learn about component functions, interaction methods, and representative task steps about appliances. However, previous manual-related works remain limited to question-answering tasks while existing manipulation researchers ignore the manual's important role and fail to comprehend multi-page manuals. In this paper, we propose the first manual-based appliance manipulation benchmark CheckManual. Specifically, we design a large model-assisted human-revised data generation pipeline to create manuals based on CAD appliance models. With these manuals, we establish novel manual-based manipulation challenges, metrics, and simulator environments for model performance evaluation. Furthermore, we propose the first manual-based manipulation planning model ManualPlan to set up a group of baselines for the CheckManual benchmark.
摘要：正确使用电器可以显着改善人类生活质量。与可以通过常识来操纵的简单工具不同，电器的不同部分具有制造商定义的特定功能。如果我们希望机器人用微波炉加热面包，则应使它们能够先查看微波手册。从手册中，它可以了解组件函数，交互方法以及有关设备的代表性任务步骤。但是，以前与手动相关的作品仍然仅限于提问任务，而现有的操纵研究人员忽略了手册的重要作用，并且无法理解多页手册。在本文中，我们提出了第一个基于手动的设备操纵基准测试手册。具体而言，我们设计了一个大型模型辅助的人工修复的数据生成管道，以创建基于CAD设备模型的手册。使用这些手册，我们建立了新颖的基于手动的操纵挑战，指标和模拟器环境，以进行模型性能评估。此外，我们提出了第一个基于手动的操纵计划模型手册计划，以为CheckManual Bench Marked设置一组基线。

Title: Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation

Authors: Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, Lu Jiang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.09350
Pdf URL: https://arxiv.org/pdf/2506.09350
Copy Paste: [[2506.09350]] Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation(https://arxiv.org/abs/2506.09350)
Keywords: generation
Abstract: Existing large-scale video generation models are computationally intensive, preventing adoption in real-time and interactive applications. In this work, we propose autoregressive adversarial post-training (AAPT) to transform a pre-trained latent video diffusion model into a real-time, interactive video generator. Our model autoregressively generates a latent frame at a time using a single neural function evaluation (1NFE). The model can stream the result to the user in real time and receive interactive responses as controls to generate the next latent frame. Unlike existing approaches, our method explores adversarial training as an effective paradigm for autoregressive generation. This not only allows us to design an architecture that is more efficient for one-step generation while fully utilizing the KV cache, but also enables training the model in a student-forcing manner that proves to be effective in reducing error accumulation during long video generation. Our experiments demonstrate that our 8B model achieves real-time, 24fps, streaming video generation at 736x416 resolution on a single H100, or 1280x720 on 8xH100 up to a minute long (1440 frames). Visit our research website at this https URL
摘要：现有的大规模视频生成模型在计算上是密集的，可防止在实时和交互式应用中采用。在这项工作中，我们提出了自回归的对抗后训练（AAPT），以将预训练的潜在视频扩散模型转换为实时的交互式视频生成器。我们的模型自动加工一次使用单个神经功能评估（1NFE）一次生成潜在框架。该模型可以实时将结果传输到用户，并接收交互式响应作为控件以生成下一个潜在帧。与现有方法不同，我们的方法探讨了对抗性训练，作为自动回归产生的有效范式。这不仅使我们能够设计一种对一步生成更有效的体系结构，同时充分利用KV缓存，而且还可以以学生训练的方式培训模型，该方式被证明可以有效地减少长时间视频生成期间的错误积累。我们的实验表明，我们的8B模型可实现实时的24fps，单个H100的736x416分辨率流动视频生成，或8xH100的1280x720，最多为一分钟（1440帧）。访问我们的研究网站，网址为HTTPS URL

Title: SAGE: Exploring the Boundaries of Unsafe Concept Domain with Semantic-Augment Erasing

Authors: Hongguang Zhu, Yunchao Wei, Mengyu Wang, Siyu Jiao, Yan Fang, Jiannan Huang, Yao Zhao
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2506.09363
Pdf URL: https://arxiv.org/pdf/2506.09363
Copy Paste: [[2506.09363]] SAGE: Exploring the Boundaries of Unsafe Concept Domain with Semantic-Augment Erasing(https://arxiv.org/abs/2506.09363)
Keywords: generation
Abstract: Diffusion models (DMs) have achieved significant progress in text-to-image generation. However, the inevitable inclusion of sensitive information during pre-training poses safety risks, such as unsafe content generation and copyright infringement. Concept erasing finetunes weights to unlearn undesirable concepts, and has emerged as a promising solution. However, existing methods treat unsafe concept as a fixed word and repeatedly erase it, trapping DMs in ``word concept abyss'', which prevents generalized concept-related erasing. To escape this abyss, we introduce semantic-augment erasing which transforms concept word erasure into concept domain erasure by the cyclic self-check and self-erasure. It efficiently explores and unlearns the boundary representation of concept domain through semantic spatial relationships between original and training DMs, without requiring additional preprocessed data. Meanwhile, to mitigate the retention degradation of irrelevant concepts while erasing unsafe concepts, we further propose the global-local collaborative retention mechanism that combines global semantic relationship alignment with local predicted noise preservation, effectively expanding the retentive receptive field for irrelevant concepts. We name our method SAGE, and extensive experiments demonstrate the comprehensive superiority of SAGE compared with other methods in the safe generation of DMs. The code and weights will be open-sourced at this https URL.
摘要：扩散模型（DMS）已在文本到图像生成方面取得了重大进展。但是，在预训练期间不可避免地包含敏感信息会带来安全风险，例如不安全的内容产生和版权侵犯。概念删除了Finetunes权重以取消不良概念，并成为有前途的解决方案。但是，现有方法将不安全的概念视为一个固定的单词，并反复删除它，将DMS捕获在``单词概念bebyss''中，该概念阻止了与概念有关的概念相关的擦除。为了逃避这种深渊，我们介绍了语义射程擦除，该擦除将概念词删除转化为概念域擦除，并通过环状自我检查和自我估计。它通过原始DM和训练DMS之间的语义空间关系有效地探索并取消了概念域的边界表示，而无需额外的预处理数据。同时，为了减轻无关概念的保留下降，同时消除了不安全的概念，我们进一步提出了将全球局部合作保留机制与局部预测的噪声保存结合在一起，有效地扩展了与无关概念的保留接收场。我们命名了我们的鼠尾草方法，并且广泛的实验证明了与安全生成DMS中的其他方法相比，SAGE的全面优势。代码和权重将在此HTTPS URL上开源。

Title: Anomaly Detection and Generation with Diffusion Models: A Survey

Authors: Yang Liu, Jing Liu, Chengfang Li, Rui Xi, Wenchao Li, Liang Cao, Jin Wang, Laurence T. Yang, Junsong Yuan, Wei Zhou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09368
Pdf URL: https://arxiv.org/pdf/2506.09368
Copy Paste: [[2506.09368]] Anomaly Detection and Generation with Diffusion Models: A Survey(https://arxiv.org/abs/2506.09368)
Keywords: generation
Abstract: Anomaly detection (AD) plays a pivotal role across diverse domains, including cybersecurity, finance, healthcare, and industrial manufacturing, by identifying unexpected patterns that deviate from established norms in real-world data. Recent advancements in deep learning, specifically diffusion models (DMs), have sparked significant interest due to their ability to learn complex data distributions and generate high-fidelity samples, offering a robust framework for unsupervised AD. In this survey, we comprehensively review anomaly detection and generation with diffusion models (ADGDM), presenting a tutorial-style analysis of the theoretical foundations and practical implementations and spanning images, videos, time series, tabular, and multimodal data. Crucially, unlike existing surveys that often treat anomaly detection and generation as separate problems, we highlight their inherent synergistic relationship. We reveal how DMs enable a reinforcing cycle where generation techniques directly address the fundamental challenge of anomaly data scarcity, while detection methods provide critical feedback to improve generation fidelity and relevance, advancing both capabilities beyond their individual potential. A detailed taxonomy categorizes ADGDM methods based on anomaly scoring mechanisms, conditioning strategies, and architectural designs, analyzing their strengths and limitations. We final discuss key challenges including scalability and computational efficiency, and outline promising future directions such as efficient architectures, conditioning strategies, and integration with foundation models (e.g., visual-language models and large language models). By synthesizing recent advances and outlining open research questions, this survey aims to guide researchers and practitioners in leveraging DMs for innovative AD solutions across diverse applications.
摘要：异常检测（AD）通过确定偏离现实世界数据中既定规范的意外模式，在包括网络安全，金融，医疗保健和工业制造中起关键作用。深度学习，特别是扩散模型（DMS）的最新进展引起了人们的兴趣，因为它们能够学习复杂的数据分布并生成高保真样本，从而为无监督的AD提供了强大的框架。在这项调查中，我们通过扩散模型（ADGDM）全面回顾了异常检测和产生，对理论基础和实际实现以及跨越图像，视频，时间序列，表格和多模式数据进行了教程风格的分析。至关重要的是，与经常将异常检测和产生视为单独问题的现有调查不同，我们强调了它们固有的协同关系。我们揭示了DMS如何启用加强周期，其中发电技术直接解决了异常数据稀缺的基本挑战，而检测方法则提供了关键的反馈，以提高发电的忠诚度和相关性，从而超越了个人潜力。详细的分类法对ADGDM方法进行了分类，该方法基于异常评分机制，调理策略和建筑设计，分析了它们的优势和局限性。我们最终讨论了关键挑战，包括可扩展性和计算效率，以及概述有希望的未来方向，例如有效的体系结构，调理策略以及与基础模型（例如，视觉语言模型和大型语言模型）的集成。通过综合最新进展并概述了开放研究问题，该调查旨在指导研究人员和从业人员利用DMS来为各种应用程序进行创新的AD解决方案。

Title: Revisiting Diffusion Models: From Generative Pre-training to One-Step Generation

Authors: Bowen Zheng, Tianming Yang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.09376
Pdf URL: https://arxiv.org/pdf/2506.09376
Copy Paste: [[2506.09376]] Revisiting Diffusion Models: From Generative Pre-training to One-Step Generation(https://arxiv.org/abs/2506.09376)
Keywords: generation, generative
Abstract: Diffusion distillation is a widely used technique to reduce the sampling cost of diffusion models, yet it often requires extensive training, and the student performance tends to be degraded. Recent studies show that incorporating a GAN objective may alleviate these issues, yet the underlying mechanism remains unclear. In this work, we first identify a key limitation of distillation: mismatched step sizes and parameter numbers between the teacher and the student model lead them to converge to different local minima, rendering direct imitation suboptimal. We further demonstrate that a standalone GAN objective, without relying a distillation loss, overcomes this limitation and is sufficient to convert diffusion models into efficient one-step generators. Based on this finding, we propose that diffusion training may be viewed as a form of generative pre-training, equipping models with capabilities that can be unlocked through lightweight GAN fine-tuning. Supporting this view, we create a one-step generation model by fine-tuning a pre-trained model with 85% of parameters frozen, achieving strong performance with only 0.2M images and near-SOTA results with 5M images. We further present a frequency-domain analysis that may explain the one-step generative capability gained in diffusion training. Overall, our work provides a new perspective for diffusion training, highlighting its role as a powerful generative pre-training process, which can be the basis for building efficient one-step generation models.
摘要：扩散蒸馏是一种广泛使用的技术，可降低扩散模型的采样成本，但通常需要进行广泛的培训，并且学生的表现往往会降低。最近的研究表明，结合目标可能会减轻这些问题，但基本机制尚不清楚。在这项工作中，我们首先确定蒸馏的关键局限性：教师和学生模型之间的不匹配的步进大小和参数数量导致他们收敛到不同的本地最小值，从而使直接模仿次优。我们进一步证明，一个独立的gan目标而不依赖蒸馏损失，克服了这一限制，足以将扩散模型转换为有效的一步生成器。基于这一发现，我们建议将扩散训练视为一种生成预训练的形式，将模型装备有能力，可以通过轻巧的gan微调来解锁。在支持此视图的情况下，我们通过微调冻结参数的85％的预训练模型来创建一个步骤生成的模型，仅使用0.2m的图像和5M图像的近距离图像实现了强劲的性能。我们进一步提出了频域分析，该分析可以解释扩散训练中获得的一步生成能力。总体而言，我们的工作为扩散训练提供了新的观点，强调了其作为强大的生成预训练过程的作用，这可能是建立有效的一步生成模型的基础。

Title: Synthetic Human Action Video Data Generation with Pose Transfer

Authors: Vaclav Knapp, Matyas Bohacek
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09411
Pdf URL: https://arxiv.org/pdf/2506.09411
Copy Paste: [[2506.09411]] Synthetic Human Action Video Data Generation with Pose Transfer(https://arxiv.org/abs/2506.09411)
Keywords: generation
Abstract: In video understanding tasks, particularly those involving human motion, synthetic data generation often suffers from uncanny features, diminishing its effectiveness for training. Tasks such as sign language translation, gesture recognition, and human motion understanding in autonomous driving have thus been unable to exploit the full potential of synthetic data. This paper proposes a method for generating synthetic human action video data using pose transfer (specifically, controllable 3D Gaussian avatar models). We evaluate this method on the Toyota Smarthome and NTU RGB+D datasets and show that it improves performance in action recognition tasks. Moreover, we demonstrate that the method can effectively scale few-shot datasets, making up for groups underrepresented in the real training data and adding diverse backgrounds. We open-source the method along with RANDOM People, a dataset with videos and avatars of novel human identities for pose transfer crowd-sourced from the internet.
摘要：在视频理解任务，尤其是涉及人类运动的任务中，合成数据的生成通常具有不可思议的功能，从而降低了其训练有效性。因此，在自主驾驶中的手语翻译，手势识别和人类运动理解等任务无法利用合成数据的全部潜力。本文提出了一种使用姿势转移生成合成人类动作视频数据的方法（特别是可控的3D高斯化身模型）。我们在Toyota Smarthome和NTU RGB+D数据集上评估了此方法，并表明它可以提高行动识别任务中的性能。此外，我们证明该方法可以有效地扩展几个数据集，从而弥补了实际培训数据中所占的分数并增加了不同的背景。我们与随机人一起开放该方法，一个数据集，其中包含来自互联网的姿势转移的新型人类身份的视频和化身。

Title: Noise Conditional Variational Score Distillation

Authors: Xinyu Peng, Ziyang Zheng, Yaoming Wang, Han Li, Nuowen Kan, Wenrui Dai, Chenglin Li, Junni Zou, Hongkai Xiong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.09416
Pdf URL: https://arxiv.org/pdf/2506.09416
Copy Paste: [[2506.09416]] Noise Conditional Variational Score Distillation(https://arxiv.org/abs/2506.09416)
Keywords: generation, generative
Abstract: We propose Noise Conditional Variational Score Distillation (NCVSD), a novel method for distilling pretrained diffusion models into generative denoisers. We achieve this by revealing that the unconditional score function implicitly characterizes the score function of denoising posterior distributions. By integrating this insight into the Variational Score Distillation (VSD) framework, we enable scalable learning of generative denoisers capable of approximating samples from the denoising posterior distribution across a wide range of noise levels. The proposed generative denoisers exhibit desirable properties that allow fast generation while preserve the benefit of iterative refinement: (1) fast one-step generation through sampling from pure Gaussian noise at high noise levels; (2) improved sample quality by scaling the test-time compute with multi-step sampling; and (3) zero-shot probabilistic inference for flexible and controllable sampling. We evaluate NCVSD through extensive experiments, including class-conditional image generation and inverse problem solving. By scaling the test-time compute, our method outperforms teacher diffusion models and is on par with consistency models of larger sizes. Additionally, with significantly fewer NFEs than diffusion-based methods, we achieve record-breaking LPIPS on inverse problems.
摘要：我们提出了噪声条件变化评分蒸馏（NCVSD），这是一种将预验扩散模型提炼成生成式Deoisiser的新方法。我们通过揭示无条件得分函数隐式表征后验分布的得分函数来实现这一目标。通过将这种洞察力集成到变分得分蒸馏（VSD）框架中，我们可以可扩展地学习生成的DeNoisers，能够近似于在较大噪声水平的后验分布中近似样品。提出的生成型e依者表现出理想的特性，可以快速生成，同时保持迭代精致的好处：（1）通过在高噪声水平下从纯高斯噪声中抽样快速一步生成；（2）通过通过多步抽样缩放测试时间计算来提高样品质量；（3）用于柔性和可控采样的零射概率推断。我们通过广泛的实验评估NCVSD，包括阶级条件图像产生和逆问题解决。通过缩放测试时间计算，我们的方法优于教师扩散模型，并且与较大尺寸的一致性模型相当。此外，我们的NFE明显少于基于扩散的方法，我们在反问题上实现了创纪录的LPIP。

Title: A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation

Authors: Yukang Feng, Jianwen Sun, Chuanhao Li, Zizhen Li, Jiaxin Ai, Fanrui Zhang, Yifan Chang, Sizhuo Zhou, Shenglin Zhang, Yu Dai, Kaipeng Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09427
Pdf URL: https://arxiv.org/pdf/2506.09427
Copy Paste: [[2506.09427]] A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation(https://arxiv.org/abs/2506.09427)
Keywords: generation
Abstract: Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation. However, these models still struggle to generate tightly interleaved image-text outputs, primarily due to the limited scale, quality and instructional richness of current training datasets. To address this, we introduce InterSyn, a large-scale multimodal dataset constructed using our Self-Evaluation with Iterative Refinement (SEIR) method. InterSyn features multi-turn, instruction-driven dialogues with tightly interleaved imagetext responses, providing rich object diversity and rigorous automated quality refinement, making it well-suited for training next-generation instruction-following LMMs. Furthermore, to address the lack of reliable evaluation tools capable of assessing interleaved multimodal outputs, we introduce SynJudge, an automatic evaluation model designed to quantitatively assess multimodal outputs along four dimensions: text content, image content, image quality, and image-text synergy. Experimental studies show that the SEIR method leads to substantially higher dataset quality compared to an otherwise identical process without refinement. Moreover, LMMs trained on InterSyn achieve uniform performance gains across all evaluation metrics, confirming InterSyn's utility for advancing multimodal systems.
摘要：大型多模型模型（LMM）的最新进展已显着改善了多模式的理解和产生。但是，这些模型仍然难以生成紧密交织的图像文本输出，这主要是由于当前培训数据集的规模，质量和教学丰富性有限。为了解决这个问题，我们介绍了Intersone，这是一种使用我们的自我评估的大规模多模式数据集（SEIR）方法。 Intersone具有多弯曲，指令驱动的对话，并具有紧密交织的Imagetext响应，提供了丰富的对象多样性和严格的自动质量优化，非常适合培训下一代指令遵循LMMS。此外，为了解决缺乏能够评估交织的多模式输出的可靠评估工具，我们引入了Synjudge，这是一种自动评估模型，旨在沿着四个维度进行定量评估多模式输出：文本内容，图像内容，图像质量，图像质量和图像文本协同作用。实验研究表明，与没有细化的过程相比，SEIR方法与其他相同的过程相比，数据集质量大大提高。此外，经过培训的Intersone培训的LMM在所有评估指标中都取得了统一的性能提高，从而证实了Intersone的效用，用于推进多模式系统。

Title: Marrying Autoregressive Transformer and Diffusion with Multi-Reference Autoregression

Authors: Dingcheng Zhen, Qian Qiao, Tan Yu, Kangxi Wu, Ziwei Zhang, Siyuan Liu, Shunshun Yin, Ming Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.09482
Pdf URL: https://arxiv.org/pdf/2506.09482
Copy Paste: [[2506.09482]] Marrying Autoregressive Transformer and Diffusion with Multi-Reference Autoregression(https://arxiv.org/abs/2506.09482)
Keywords: generation
Abstract: We introduce TransDiff, the first image generation model that marries Autoregressive (AR) Transformer with diffusion models. In this joint modeling framework, TransDiff encodes labels and images into high-level semantic features and employs a diffusion model to estimate the distribution of image samples. On the ImageNet 256x256 benchmark, TransDiff significantly outperforms other image generation models based on standalone AR Transformer or diffusion models. Specifically, TransDiff achieves a Fréchet Inception Distance (FID) of 1.61 and an Inception Score (IS) of 293.4, and further provides x2 faster inference latency compared to state-of-the-art methods based on AR Transformer and x112 faster inference compared to diffusion-only models. Furthermore, building on the TransDiff model, we introduce a novel image generation paradigm called Multi-Reference Autoregression (MRAR), which performs autoregressive generation by predicting the next image. MRAR enables the model to reference multiple previously generated images, thereby facilitating the learning of more diverse representations and improving the quality of generated images in subsequent iterations. By applying MRAR, the performance of TransDiff is improved, with the FID reduced from 1.61 to 1.42. We expect TransDiff to open up a new frontier in the field of image generation.
摘要：我们介绍了Transdiff，这是将自回归（AR）变压器与扩散模型相结合的第一个图像生成模型。在这个联合建模框架中，transdiff将标签和图像编码为高级语义特征，并采用扩散模型来估计图像样品的分布。在Imagenet 256x256基准上，Transdiff明显优于基于独立AR变压器或扩散模型的其他图像生成模型。具体而言，Transdiff达到1.61的FRéchet成立距离（FID），与基于AR变压器的最新方法相比，与基于AR Transformer的最新方法相比，与X112的最新方法相比，与扩散模型相比，与最新的推理相比，推理潜伏期更快。此外，在Transdiff模型的基础上，我们引入了一种称为多引用自动进度（MRAR）的新型图像生成范式，该范式通过预测下一个图像来执行自回归产生。 MRAR使该模型能够参考多个先前生成的图像，从而促进学习更多样化的表示形式，并在随后的迭代中提高生成的图像的质量。通过应用MRAR，Transdiff的性能得到改善，FID从1.61降低到1.42。我们希望Transdiff在图像生成领域开放一个新的边界。

Title: Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs

Authors: Beomsik Cho, Jaehyung Kim
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.09522
Pdf URL: https://arxiv.org/pdf/2506.09522
Copy Paste: [[2506.09522]] Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs(https://arxiv.org/abs/2506.09522)
Keywords: generation
Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across various multimodal tasks by integrating visual perception with language understanding. However, conventional decoding strategies of LVLMs often fail to successfully utilize visual information, leading to visually ungrounded responses. While various approaches have been proposed to address this limitation, they typically require additional training, multi-step inference procedures, or external model dependencies. This paper introduces ReVisiT, a simple yet effective decoding method that references vision tokens to guide the text generation process in LVLMs. Our approach leverages the semantic information embedded within vision tokens by projecting them into the text token distribution space, and dynamically selecting the most relevant vision token at each decoding step through constrained divergence minimization. This selected vision token is then used to refine the output distribution to better incorporate visual semantics. Experiments on three LVLM hallucination benchmarks with two recent LVLMs demonstrate that ReVisiT consistently enhances visual grounding with minimal computational overhead. Moreover, our method achieves competitive or superior results relative to state-of-the-art baselines while reducing computational costs for up to $2\times$.
摘要：大型视觉模型（LVLM）通过将视觉感知与语言理解整合在一起，在各种多模式任务中表现出了不起的表现。但是，LVLMS的传统解码策略通常无法成功使用视觉信息，从而导致视觉上未接地的响应。尽管已经提出了各种方法来解决此限制，但它们通常需要额外的培训，多步推理程序或外部模型依赖性。本文介绍了Revisit，这是一种简单而有效的解码方法，它引用了视觉令牌，以指导LVLMS中的文本生成过程。我们的方法通过将它们投射到文本令牌分布空间中，并通过约束差异最小化在每个解码步骤中动态选择最相关的视觉令牌来利用视觉令牌中嵌入的语义信息。然后，使用此选定的视觉令牌来完善输出分布，以更好地结合视觉语义。在三个LVLM幻觉基准的实验具有最近的两个LVLM，这表明，重新访问始终通过最小的计算开销来增强视觉接地。此外，我们的方法相对于最先进的基线取得了竞争性或优越的结果，同时降低了$ 2 \ times $的计算成本。

Title: Consistent Story Generation with Asymmetry Zigzag Sampling

Authors: Mingxiao LI, mang ning, Marie-Francine Moens
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.09612
Pdf URL: https://arxiv.org/pdf/2506.09612
Copy Paste: [[2506.09612]] Consistent Story Generation with Asymmetry Zigzag Sampling(https://arxiv.org/abs/2506.09612)
Keywords: generation
Abstract: Text-to-image generation models have made significant progress in producing high-quality images from textual descriptions, yet they continue to struggle with maintaining subject consistency across multiple images, a fundamental requirement for visual storytelling. Existing methods attempt to address this by either fine-tuning models on large-scale story visualization datasets, which is resource-intensive, or by using training-free techniques that share information across generations, which still yield limited success. In this paper, we introduce a novel training-free sampling strategy called Zigzag Sampling with Asymmetric Prompts and Visual Sharing to enhance subject consistency in visual story generation. Our approach proposes a zigzag sampling mechanism that alternates between asymmetric prompting to retain subject characteristics, while a visual sharing module transfers visual cues across generated images to %further enforce consistency. Experimental results, based on both quantitative metrics and qualitative evaluations, demonstrate that our method significantly outperforms previous approaches in generating coherent and consistent visual stories. The code is available at this https URL.
摘要：文本到图像的生成模型在产生文本描述中产生高质量的图像方面取得了重大进展，但是他们继续努力保持跨多个图像的主题一致性，这是视觉讲故事的基本要求。现有的方法试图通过大规模故事可视化数据集中的微调模型来解决这一问题，该模型是资源密集型的，或者使用无培训的技术来共享一代人的信息，这仍然会带来有限的成功。在本文中，我们介绍了一种新颖的无培训抽样策略，称为曲折抽样，并具有不对称的提示和视觉共享，以增强视觉故事生成中的主题一致性。我们的方法提出了一种锯齿形抽样机制，该机制在不对称提示保持主题特征之间交替，而视觉共享模块则将视觉线索转移到跨生成的图像中，以达到％进一步执行一致性。基于定量指标和定性评估的实验结果表明，我们的方法在产生相干和一致的视觉故事方面显着优于先前的方法。该代码可在此HTTPS URL上找到。

Title: In-Context Bias Propagation in LLM-Based Tabular Data Generation

Authors: Pol G.Recasens, Alberto Gutierrez, Jordi Torres, Josep.Ll Berral, Anisa Halimi, Kieran Fraser
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.09630
Pdf URL: https://arxiv.org/pdf/2506.09630
Copy Paste: [[2506.09630]] In-Context Bias Propagation in LLM-Based Tabular Data Generation(https://arxiv.org/abs/2506.09630)
Keywords: generation
Abstract: Large Language Models (LLMs) are increasingly used for synthetic tabular data generation through in-context learning (ICL), offering a practical solution for data augmentation in data scarce scenarios. While prior work has shown the potential of LLMs to improve downstream task performance through augmenting underrepresented groups, these benefits often assume access to a subset of unbiased in-context examples, representative of the real dataset. In real-world settings, however, data is frequently noisy and demographically skewed. In this paper, we systematically study how statistical biases within in-context examples propagate to the distribution of synthetic tabular data, showing that even mild in-context biases lead to global statistical distortions. We further introduce an adversarial scenario where a malicious contributor can inject bias into the synthetic dataset via a subset of in-context examples, ultimately compromising the fairness of downstream classifiers for a targeted and protected subgroup. Our findings demonstrate a new vulnerability associated with LLM-based data generation pipelines that rely on in-context prompts with in sensitive domains.
摘要：大型语言模型（LLMS）越来越多地用于通过文本学习（ICL）进行合成的表格数据生成，为数据稀缺方案提供了实用解决方案。尽管先前的工作已经表明了LLM通过增强代表性不足的组来改善下游任务性能的潜力，但这些好处通常会假设访问无偏见的示例的子集，这是代表真实数据集的代表。但是，在现实世界中，数据经常嘈杂且人口统计学偏斜。在本文中，我们系统地研究了统计偏见在官方表格数据的分布中如何传播统计偏见，这表明即使是轻度的内在偏见也会导致全局统计扭曲。我们进一步介绍了一种对抗场景，恶意贡献者可以通过封闭式示例的子集注入合成数据集中，最终损害了针对性和受保护的子组的下游分类器的公平性。我们的发现证明了与基于LLM的数据生成管道相关的新漏洞，该漏洞依赖于在敏感域中的中下文提示。

Title: HSENet: Hybrid Spatial Encoding Network for 3D Medical Vision-Language Understanding

Authors: Yanzhao Shi, Xiaodan Zhang, Junzhong Ji, Haoning Jiang, Chengxin Zheng, Yinong Wang, Liangqiong Qu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09634
Pdf URL: https://arxiv.org/pdf/2506.09634
Copy Paste: [[2506.09634]] HSENet: Hybrid Spatial Encoding Network for 3D Medical Vision-Language Understanding(https://arxiv.org/abs/2506.09634)
Keywords: generation
Abstract: Automated 3D CT diagnosis empowers clinicians to make timely, evidence-based decisions by enhancing diagnostic accuracy and workflow efficiency. While multimodal large language models (MLLMs) exhibit promising performance in visual-language understanding, existing methods mainly focus on 2D medical images, which fundamentally limits their ability to capture complex 3D anatomical structures. This limitation often leads to misinterpretation of subtle pathologies and causes diagnostic hallucinations. In this paper, we present Hybrid Spatial Encoding Network (HSENet), a framework that exploits enriched 3D medical visual cues by effective visual perception and projection for accurate and robust vision-language understanding. Specifically, HSENet employs dual-3D vision encoders to perceive both global volumetric contexts and fine-grained anatomical details, which are pre-trained by dual-stage alignment with diagnostic reports. Furthermore, we propose Spatial Packer, an efficient multimodal projector that condenses high-resolution 3D spatial regions into a compact set of informative visual tokens via centroid-based compression. By assigning spatial packers with dual-3D vision encoders, HSENet can seamlessly perceive and transfer hybrid visual representations to LLM's semantic space, facilitating accurate diagnostic text generation. Experimental results demonstrate that our method achieves state-of-the-art performance in 3D language-visual retrieval (39.85% of R@100, +5.96% gain), 3D medical report generation (24.01% of BLEU-4, +8.01% gain), and 3D visual question answering (73.60% of Major Class Accuracy, +1.99% gain), confirming its effectiveness. Our code is available at this https URL.
摘要：自动化的3D CT诊断使临床医生通过提高诊断准确性和工作流程效率来及时做出基于证据的决定。虽然多模式的大语言模型（MLLM）在视觉语言理解中表现出令人鼓舞的表现，但现有方法主要集中于2D医学图像，这从根本上限制了其捕获复杂的3D解剖结构的能力。这种局限性通常会导致误解微妙的病理，并引起诊断幻觉。在本文中，我们介绍了混合空间编码网络（HSENET），该框架通过有效的视觉感知和投影来利用丰富的3D医学视觉提示，以获得准确，强大的视觉语言理解。具体而言，HSENET采用双3D视觉编码来感知全球容量环境和细粒的解剖细节，这些细节是通过诊断报告和诊断报告的双阶段对准预先训练的。此外，我们提出了空间包装器，这是一种有效的多模式投影仪，该投影仪将高分辨率3D空间区域凝结成一套紧凑的视觉令牌，这是通过基于Centroid的压缩组成的。通过将空间包装器分配给双3D视觉编码器，Hsenet可以无缝感知并将混合视觉表示形式传递到LLM的语义空间，从而促进准确的诊断文本生成。实验结果表明，我们的方法在3D语言 - 视觉检索中实现了最先进的表现（R@100的39.85％， +5.96％的增益），3D医学报告生成（占BLEU-4， +8.01％增长率的24.01％），3D视觉答案（73.60％）的效率为73.60％， +1.99％的效率， +1.99％的效率）， +1.99％的效率）， +1.99％的效率）， +1.99％的效率。我们的代码可在此HTTPS URL上找到。

Title: FedVLMBench: Benchmarking Federated Fine-Tuning of Vision-Language Models

Authors: Weiying Zheng, Ziyue Lin, Pengxin Guo, Yuyin Zhou, Feifei Wang, Liangqiong Qu
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.09638
Pdf URL: https://arxiv.org/pdf/2506.09638
Copy Paste: [[2506.09638]] FedVLMBench: Benchmarking Federated Fine-Tuning of Vision-Language Models(https://arxiv.org/abs/2506.09638)
Keywords: generation
Abstract: Vision-Language Models (VLMs) have demonstrated remarkable capabilities in cross-modal understanding and generation by integrating visual and textual information. While instruction tuning and parameter-efficient fine-tuning methods have substantially improved the generalization of VLMs, most existing approaches rely on centralized training, posing challenges for deployment in domains with strict privacy requirements like healthcare. Recent efforts have introduced Federated Learning (FL) into VLM fine-tuning to address these privacy concerns, yet comprehensive benchmarks for evaluating federated fine-tuning strategies, model architectures, and task generalization remain lacking. In this work, we present \textbf{FedVLMBench}, the first systematic benchmark for federated fine-tuning of VLMs. FedVLMBench integrates two mainstream VLM architectures (encoder-based and encoder-free), four fine-tuning strategies, five FL algorithms, six multimodal datasets spanning four cross-domain single-task scenarios and two cross-domain multitask settings, covering four distinct downstream task categories. Through extensive experiments, we uncover key insights into the interplay between VLM architectures, fine-tuning strategies, data heterogeneity, and multi-task federated optimization. Notably, we find that a 2-layer multilayer perceptron (MLP) connector with concurrent connector and LLM tuning emerges as the optimal configuration for encoder-based VLMs in FL. Furthermore, current FL methods exhibit significantly higher sensitivity to data heterogeneity in vision-centric tasks than text-centric ones, across both encoder-free and encoder-based VLM architectures. Our benchmark provides essential tools, datasets, and empirical guidance for the research community, offering a standardized platform to advance privacy-preserving, federated training of multimodal foundation models.
摘要：视觉模型（VLM）通过整合视觉和文本信息在跨模式的理解和生成中表现出了显着的功能。尽管指令调整和参数有效的微调方法显着改善了VLM的概括，但大多数现有的方法都依赖于集中式培训，对具有严格隐私要求（如Healthcare）的领域中的部署构成了挑战。最近的努力将联邦学习（FL）引入了VLM微调，以解决这些隐私问题，但仍缺乏评估联合的微调策略，模型体系结构和任务概括的全面基准。在这项工作中，我们提出\ textbf {fedvlmbench}，这是第一个用于联合VLMS进行微调的系统基准。 FedVlmbench集成了两个主流VLM架构（基于编码器和编码器的编码），四个微调策略，五种FL算法，六个多模式数据集，涵盖了四个跨副域单任务情景和两个交叉任务多任务设置，涵盖了四个不同的下游任务类别。通过广泛的实验，我们发现了对VLM架构，微调策略，数据异质性和多任务联合优化之间的相互作用的关键见解。值得注意的是，我们发现带有并发连接器和LLM调谐的2层多层感知器（MLP）连接器是FL中基于Encoder的VLM的最佳配置。此外，与以文本为中心的任务相比，目前的FL方法对以文本为中心的任务显示出对数据异质性的敏感性明显更高，而基于文本的无编码器和基于编码器的VLM体系结构均明显更高。我们的基准测试为研究社区提供了必需的工具，数据集和经验指南，并提供了一个标准化平台，以推动对多模式基础模型进行隐私，联合培训的培训。

Title: DGAE: Diffusion-Guided Autoencoder for Efficient Latent Representation Learning

Authors: Dongxu Liu, Yuang Peng, Haomiao Tang, Yuwei Chen, Chunrui Han, Zheng Ge, Daxin Jiang, Mingxue Liao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09644
Pdf URL: https://arxiv.org/pdf/2506.09644
Copy Paste: [[2506.09644]] DGAE: Diffusion-Guided Autoencoder for Efficient Latent Representation Learning(https://arxiv.org/abs/2506.09644)
Keywords: generation, generative
Abstract: Autoencoders empower state-of-the-art image and video generative models by compressing pixels into a latent space through visual tokenization. Although recent advances have alleviated the performance degradation of autoencoders under high compression ratios, addressing the training instability caused by GAN remains an open challenge. While improving spatial compression, we also aim to minimize the latent space dimensionality, enabling more efficient and compact representations. To tackle these challenges, we focus on improving the decoder's expressiveness. Concretely, we propose DGAE, which employs a diffusion model to guide the decoder in recovering informative signals that are not fully decoded from the latent representation. With this design, DGAE effectively mitigates the performance degradation under high spatial compression rates. At the same time, DGAE achieves state-of-the-art performance with a 2x smaller latent space. When integrated with Diffusion Models, DGAE demonstrates competitive performance on image generation for ImageNet-1K and shows that this compact latent representation facilitates faster convergence of the diffusion model.
摘要：自动编码器通过视觉令牌化将像素压缩到潜在空间中，增强了最先进的图像和视频生成模型。尽管最近的进步缓解了高压比下自动编码器的性能降解，但解决由GAN造成的训练不稳定仍然是一个公开挑战。在改善空间压缩的同时，我们还旨在最大程度地降低潜在空间维度，从而更有效，更紧凑。为了应对这些挑战，我们专注于提高解码器的表现力。具体而言，我们提出了DGAE，它采用扩散模型来指导解码器恢复未完全从潜在表示中解码的信息信号。通过这种设计，DGAE有效地减轻了高空间压缩率下的性能降解。同时，DGAE以较小的潜在空间来实现最先进的性能。当与扩散模型集成时，DGAE在Imagenet-1k的图像生成上展示了竞争性能，并表明这种紧凑的潜在表示有助于更快地收敛扩散模型。

Title: HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios

Authors: Kunyu Peng, Junchao Huang, Xiangsheng Huang, Di Wen, Junwei Zheng, Yufan Chen, Kailun Yang, Jiamin Wu, Chongqing Hao, Rainer Stiefelhagen
Subjects: cs.CV, cs.LG, cs.MM, cs.RO, eess.IV
Abstract URL: https://arxiv.org/abs/2506.09650
Pdf URL: https://arxiv.org/pdf/2506.09650
Copy Paste: [[2506.09650]] HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios(https://arxiv.org/abs/2506.09650)
Keywords: generation
Abstract: Action segmentation is a core challenge in high-level video understanding, aiming to partition untrimmed videos into segments and assign each a label from a predefined action set. Existing methods primarily address single-person activities with fixed action sequences, overlooking multi-person scenarios. In this work, we pioneer textual reference-guided human action segmentation in multi-person settings, where a textual description specifies the target person for segmentation. We introduce the first dataset for Referring Human Action Segmentation, i.e., RHAS133, built from 133 movies and annotated with 137 fine-grained actions with 33h video data, together with textual descriptions for this new task. Benchmarking existing action recognition methods on RHAS133 using VLM-based feature extractors reveals limited performance and poor aggregation of visual cues for the target person. To address this, we propose a holistic-partial aware Fourier-conditioned diffusion framework, i.e., HopaDIFF, leveraging a novel cross-input gate attentional xLSTM to enhance holistic-partial long-range reasoning and a novel Fourier condition to introduce more fine-grained control to improve the action segmentation generation. HopaDIFF achieves state-of-the-art results on RHAS133 in diverse evaluation settings. The code is available at this https URL.
摘要：动作细分是高级视频理解中的核心挑战，目的是将未修剪的视频分为细分市场，并从预定义的操作集中分配每个标签。现有方法主要解决具有固定动作序列的单人活动，忽略了多人场景。在这项工作中，我们在多人设置中开创了文本参考引导的人类动作细分，其中文本描述指定了针对分割的目标人。我们介绍了第一个用于引用人类动作细分的数据集，即RHAS133，该数据集由133部电影构建，并用137个具有33h视频数据的细粒度动作注释，以及针对此新任务的文本描述。使用基于VLM的功能提取器对RHAS133上的现有动作识别方法进行基准测试表明，目标人的视觉提示的性能有限，并且差的视觉提示。为了解决这个问题，我们提出了一个整体辅助的傅立叶条件扩散框架，即霍普迪夫，利用一种新型的跨输入门注意力XLSTM来增强整体的远距离推理和新型的傅立叶条件，以引入更细粒度的控制以改善动作分割的产生。 Hopadiff在不同的评估环境中实现了RHAS133的最先进结果。该代码可在此HTTPS URL上找到。

Title: CINeMA: Conditional Implicit Neural Multi-Modal Atlas for a Spatio-Temporal Representation of the Perinatal Brain

Authors: Maik Dannecker, Vasiliki Sideri-Lampretsa, Sophie Starck, Angeline Mihailov, Mathieu Milh, Nadine Girard, Guillaume Auzias, Daniel Rueckert
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.09668
Pdf URL: https://arxiv.org/pdf/2506.09668
Copy Paste: [[2506.09668]] CINeMA: Conditional Implicit Neural Multi-Modal Atlas for a Spatio-Temporal Representation of the Perinatal Brain(https://arxiv.org/abs/2506.09668)
Keywords: generative
Abstract: Magnetic resonance imaging of fetal and neonatal brains reveals rapid neurodevelopment marked by substantial anatomical changes unfolding within days. Studying this critical stage of the developing human brain, therefore, requires accurate brain models-referred to as atlases-of high spatial and temporal resolution. To meet these demands, established traditional atlases and recently proposed deep learning-based methods rely on large and comprehensive datasets. This poses a major challenge for studying brains in the presence of pathologies for which data remains scarce. We address this limitation with CINeMA (Conditional Implicit Neural Multi-Modal Atlas), a novel framework for creating high-resolution, spatio-temporal, multimodal brain atlases, suitable for low-data settings. Unlike established methods, CINeMA operates in latent space, avoiding compute-intensive image registration and reducing atlas construction times from days to minutes. Furthermore, it enables flexible conditioning on anatomical features including GA, birth age, and pathologies like ventriculomegaly (VM) and agenesis of the corpus callosum (ACC). CINeMA supports downstream tasks such as tissue segmentation and age prediction whereas its generative properties enable synthetic data creation and anatomically informed data augmentation. Surpassing state-of-the-art methods in accuracy, efficiency, and versatility, CINeMA represents a powerful tool for advancing brain research. We release the code and atlases at this https URL.
摘要：胎儿和新生大脑的磁共振成像揭示了几天内发生的实质解剖变化标志的快速神经发育。因此，研究发展中大脑的关键阶段需要准确的脑模型作为高空间和时间分辨率的地图集。为了满足这些需求，建立的传统地图集以及最近提出的基于深度学习的方法依赖于大型且全面的数据集。这对研究数据仍然很少的病理学构成了研究大脑的主要挑战。我们使用电影（条件隐式神经多模式地图集）来解决这一限制，这是一个新型框架，用于创建适用于低数据的高分辨率，时空，多模式的大脑地图集。与已建立的方法不同，电影院在潜在空间中运行，避免了计算密集型图像登记，并将Atlas的施工时间从几天减少到几天。此外，它可以在包括GA，出生年龄和诸如心室肿瘤（VM）（VM）和call体（ACC）等病理（例如心室肿瘤（VM）（ACC）等病理学（ACC）等病理学（ACC）的病理学上进行柔性条件。电影院支持下游任务，例如组织分割和年龄预测，而其生成特性则可以创建合成数据和解剖学知情的数据增强。电影院超过了准确性，效率和多功能性的最先进方法，这代表了推进大脑研究的强大工具。我们在此HTTPS URL上发布代码和地图集。

Title: Towards Practical Alzheimer's Disease Diagnosis: A Lightweight and Interpretable Spiking Neural Model

Authors: Changwei Wu, Yifei Chen, Yuxin Du, Jinying Zong, Jie Dong, Mingxuan Liu, Yong Peng, Jin Fan, Feiwei Qin, Changmiao Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09695
Pdf URL: https://arxiv.org/pdf/2506.09695
Copy Paste: [[2506.09695]] Towards Practical Alzheimer's Disease Diagnosis: A Lightweight and Interpretable Spiking Neural Model(https://arxiv.org/abs/2506.09695)
Keywords: generation
Abstract: Early diagnosis of Alzheimer's Disease (AD), especially at the mild cognitive impairment (MCI) stage, is vital yet hindered by subjective assessments and the high cost of multimodal imaging modalities. Although deep learning methods offer automated alternatives, their energy inefficiency and computational demands limit real-world deployment, particularly in resource-constrained settings. As a brain-inspired paradigm, spiking neural networks (SNNs) are inherently well-suited for modeling the sparse, event-driven patterns of neural degeneration in AD, offering a promising foundation for interpretable and low-power medical diagnostics. However, existing SNNs often suffer from weak expressiveness and unstable training, which restrict their effectiveness in complex medical tasks. To address these limitations, we propose FasterSNN, a hybrid neural architecture that integrates biologically inspired LIF neurons with region-adaptive convolution and multi-scale spiking attention. This design enables sparse, efficient processing of 3D MRI while preserving diagnostic accuracy. Experiments on benchmark datasets demonstrate that FasterSNN achieves competitive performance with substantially improved efficiency and stability, supporting its potential for practical AD screening. Our source code is available at this https URL.
摘要：阿尔茨海默氏病（AD）的早期诊断，尤其是在轻度认知障碍（MCI）阶段，至关重要，但受到主观评估和多模式成像方式的高成本的阻碍。尽管深度学习方法提供了自动化的替代方案，但其能源效率低下和计算需求限制了现实世界的部署，尤其是在资源受限的设置中。作为脑启发的范式，尖峰神经网络（SNN）本质上非常适合对AD中稀疏，事件驱动的神经变性模式进行建模，为可解释和低功率医学诊断提供了有希望的基础。但是，现有的SNN通常会遭受表现力较弱和不稳定的培训，这限制了它们在复杂的医疗任务中的有效性。为了解决这些局限性，我们提出了Fastersnn，这是一种混合神经结构，将生物学启发的LIF神经元与区域适应性卷积和多尺度尖峰的关注集成在一起。该设计使3D MRI的稀疏，有效处理，同时保持诊断精度。基准数据集上的实验表明，Fastersnn通过效率和稳定性大大提高了竞争性能，从而支持其实用AD筛查的潜力。我们的源代码可在此HTTPS URL上找到。

Title: TRIDENT: Temporally Restricted Inference via DFA-Enhanced Neural Traversal

Authors: Vincenzo Collura, Karim Tit, Laura Bussi, Eleonora Giunchiglia, Maxime Cordy
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09701
Pdf URL: https://arxiv.org/pdf/2506.09701
Copy Paste: [[2506.09701]] TRIDENT: Temporally Restricted Inference via DFA-Enhanced Neural Traversal(https://arxiv.org/abs/2506.09701)
Keywords: generation, generative
Abstract: Large Language Models (LLMs) and other neural architectures have achieved impressive results across a variety of generative and classification tasks. However, they remain fundamentally ill-equipped to ensure that their outputs satisfy temporal constraints, such as those expressible in Linear Temporal Logic over finite traces (LTLf). In this paper, we introduce TRIDENT: a general and model-agnostic inference-time algorithm that guarantees compliance with such constraints without requiring any retraining. TRIDENT compiles LTLf formulas into a Deterministic Finite Automaton (DFA), which is used to guide a constrained variant of beam search. At each decoding step, transitions that would lead to constraint violations are masked, while remaining paths are dynamically re-ranked based on both the model's probabilities and the DFA's acceptance structure. We formally prove that the resulting sequences are guaranteed to satisfy the given LTLf constraints, and we empirically demonstrate that TRIDENT also improves output quality. We validate our approach on two distinct tasks: temporally constrained image-stream classification and controlled text generation. In both settings, TRIDENT achieves perfect constraint satisfaction, while comparison with the state of the art shows improved efficiency and high standard quality metrics.
摘要：大型语言模型（LLM）和其他神经体系结构在各种生成和分类任务中取得了令人印象深刻的结果。但是，它们在根本上仍然有缺陷，以确保其输出满足时间限制，例如在有限轨迹（LTLF）上以线性时间逻辑表达的时间限制。在本文中，我们介绍了Trident：一种通用和模型的推理时间算法，可确保遵守此类约束而无需进行任何重新训练。 Trident将LTLF公式编译为确定性有限自动机（DFA），该公式用于指导光束搜索的约束变体。在每个解码步骤中，会导致违规限制的过渡被掩盖，而剩余的路径是根据模型的概率和DFA的接受结构而动态重新排列的。我们正式证明所产生的序列可以保证满足给定的LTLF约束，并且我们从经验上证明，三叉戟也提高了产出质量。我们在两个不同的任务上验证了我们的方法：时间约束的图像流分类和受控文本生成。在这两种情况下，Trident都达到了完美的约束满意度，而与最新技术的比较表现出提高的效率和高标准质量指标。

Title: ELBO-T2IAlign: A Generic ELBO-Based Method for Calibrating Pixel-level Text-Image Alignment in Diffusion Models

Authors: Qin Zhou, Zhiyang Zhang, Jinglong Wang, Xiaobin Li, Jing Zhang, Qian Yu, Lu Sheng, Dong Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09740
Pdf URL: https://arxiv.org/pdf/2506.09740
Copy Paste: [[2506.09740]] ELBO-T2IAlign: A Generic ELBO-Based Method for Calibrating Pixel-level Text-Image Alignment in Diffusion Models(https://arxiv.org/abs/2506.09740)
Keywords: generation
Abstract: Diffusion models excel at image generation. Recent studies have shown that these models not only generate high-quality images but also encode text-image alignment information through attention maps or loss functions. This information is valuable for various downstream tasks, including segmentation, text-guided image editing, and compositional image generation. However, current methods heavily rely on the assumption of perfect text-image alignment in diffusion models, which is not the case. In this paper, we propose using zero-shot referring image segmentation as a proxy task to evaluate the pixel-level image and class-level text alignment of popular diffusion models. We conduct an in-depth analysis of pixel-text misalignment in diffusion models from the perspective of training data bias. We find that misalignment occurs in images with small sized, occluded, or rare object classes. Therefore, we propose ELBO-T2IAlign, a simple yet effective method to calibrate pixel-text alignment in diffusion models based on the evidence lower bound (ELBO) of likelihood. Our method is training-free and generic, eliminating the need to identify the specific cause of misalignment and works well across various diffusion model architectures. Extensive experiments on commonly used benchmark datasets on image segmentation and generation have verified the effectiveness of our proposed calibration approach.
摘要：扩散模型在图像生成时表现出色。最近的研究表明，这些模型不仅会产生高质量的图像，还通过注意图或损失函数编码文本图像对齐信息。此信息对于各种下游任务很有价值，包括细分，文本指导的图像编辑和组成图像生成。但是，当前方法在很大程度上依赖于扩散模型中完美的文本图像对齐的假设，情况并非如此。在本文中，我们建议使用零射击引用图像分割作为代理任务，以评估流行扩散模型的像素级图像和类级文本对齐。从训练数据偏差的角度来看，我们对扩散模型中的像素文本未对准进行了深入分析。我们发现，未对准发生在具有小尺寸，遮挡或稀有物体类别的图像中。因此，我们提出了Elbo-t2ialign，这是一种基于可能性的证据下限（ELBO），在扩散模型中校准像素文本对齐的一种简单而有效的方法。我们的方法是无训练且通用的，消除了确定未对准的特定原因的需求，并在各种扩散模型架构中效果很好。对图像分割和生成的常用基准数据集进行了广泛的实验，已经验证了我们提出的校准方法的有效性。

Title: Accurate and efficient zero-shot 6D pose estimation with frozen foundation models

Authors: Andrea Caraffa, Davide Boscaini, Fabio Poiesi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.09784
Pdf URL: https://arxiv.org/pdf/2506.09784
Copy Paste: [[2506.09784]] Accurate and efficient zero-shot 6D pose estimation with frozen foundation models(https://arxiv.org/abs/2506.09784)
Keywords: generation
Abstract: Estimating the 6D pose of objects from RGBD data is a fundamental problem in computer vision, with applications in robotics and augmented reality. A key challenge is achieving generalization to novel objects that were not seen during training. Most existing approaches address this by scaling up training on synthetic data tailored to the task, a process that demands substantial computational resources. But is task-specific training really necessary for accurate and efficient 6D pose estimation of novel objects? To answer No!, we introduce FreeZeV2, the second generation of FreeZe: a training-free method that achieves strong generalization to unseen objects by leveraging geometric and vision foundation models pre-trained on unrelated data. FreeZeV2 improves both accuracy and efficiency over FreeZe through three key contributions: (i) a sparse feature extraction strategy that reduces inference-time computation without sacrificing accuracy; (ii) a feature-aware scoring mechanism that improves both pose selection during RANSAC-based 3D registration and the final ranking of pose candidates; and (iii) a modular design that supports ensembles of instance segmentation models, increasing robustness to segmentation masks errors. We evaluate FreeZeV2 on the seven core datasets of the BOP Benchmark, where it establishes a new state-of-the-art in 6D pose estimation of unseen objects. When using the same segmentation masks, FreeZeV2 achieves a remarkable 8x speedup over FreeZe while also improving accuracy by 5%. When using ensembles of segmentation models, FreeZeV2 gains an additional 8% in accuracy while still running 2.5x faster than FreeZe. FreeZeV2 was awarded Best Overall Method at the BOP Challenge 2024.
摘要：从RGBD数据中估算物体的6D姿势是计算机视觉中的一个基本问题，并在机器人技术和增强现实中进行了应用。一个关键的挑战是实现对训练期间未见的新物体的概括。大多数现有方法通过扩大针对任务量身定制的合成数据的培训来解决这一问题，该过程需要大量的计算资源。但是，对于准确有效的6D姿势估计新物体是否确实需要特定于任务的培训？要回答否！，我们引入了Freezev2，这是第二代Freeze：一种无训练的方法，通过利用对无关数据进行预训练的几何和视觉基础模型来实现强烈的概括来看不见对象。 FreezeV2通过三个关键贡献提高了冻结的精度和效率：（i）一种稀疏的特征提取策略，可在不牺牲准确性的情况下降低推理时间计算；（ii）一种特征感知的评分机制，可改善基于RANSAC的3D注册和姿势候选者的最终排名；（iii）一个支持实例分割模型集合的模块化设计，增加了对分割掩盖错误的鲁棒性。我们在BOP基准的七个核心数据集上评估了Freezev2，它在6D姿势估算看不见的对象的姿势估算中建立了新的最新技术。当使用相同的分割掩码时，Freezev2在Freeze上实现了显着的8倍加速度，同时也提高了准确性5％。当使用分割模型的集合时，Freezev2的准确性增加了8％，同时仍比Freeze快2.5倍。 Freezev2在2024年BOP挑战赛上获得了最佳总体方法。

Title: DreamCS: Geometry-Aware Text-to-3D Generation with Unpaired 3D Reward Supervision

Authors: Xiandong Zou, Ruihao Xia, Hongsong Wang, Pan Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.09814
Pdf URL: https://arxiv.org/pdf/2506.09814
Copy Paste: [[2506.09814]] DreamCS: Geometry-Aware Text-to-3D Generation with Unpaired 3D Reward Supervision(https://arxiv.org/abs/2506.09814)
Keywords: generation
Abstract: While text-to-3D generation has attracted growing interest, existing methods often struggle to produce 3D assets that align well with human preferences. Current preference alignment techniques for 3D content typically rely on hardly-collected preference-paired multi-view 2D images to train 2D reward models, when then guide 3D generation -- leading to geometric artifacts due to their inherent 2D bias. To address these limitations, we construct 3D-MeshPref, the first large-scale unpaired 3D preference dataset, featuring diverse 3D meshes annotated by a large language model and refined by human evaluators. We then develop RewardCS, the first reward model trained directly on unpaired 3D-MeshPref data using a novel Cauchy-Schwarz divergence objective, enabling effective learning of human-aligned 3D geometric preferences without requiring paired comparisons. Building on this, we propose DreamCS, a unified framework that integrates RewardCS into text-to-3D pipelines -- enhancing both implicit and explicit 3D generation with human preference feedback. Extensive experiments show DreamCS outperforms prior methods, producing 3D assets that are both geometrically faithful and human-preferred. Code and models will be released publicly.
摘要：尽管文本到3d代引起了人们日益增长的兴趣，但现有的方法通常很难生产出与人类偏好良好的3D资产。 3D内容的当前偏好对齐技术通常依赖于几乎没有收集的偏好对多视图2D图像来训练2D奖励模型，当时当时指导3D代表 - 由于其固有的2D偏置，导致几何伪影。为了解决这些局限性，我们构建了第一个大规模的3D偏好数据集3D-meshpref，其中包含由大语言模型注释并由人类评估人员精制的多样化的3D网格。然后，我们开发了奖励，这是第一个奖励模型，它使用新颖的Cauchy-Schwarz Divergence目标直接在未配对的3D-MESHPREF数据上训练，从而有效地学习了人类对准的3D几何偏好，而无需配对比较。在此基础上，我们提出了DreamCs，这是一个将奖励范围集成到文本到3D管道中的统一框架 - 通过人类的偏好反馈增强了隐式和显式3D代。广泛的实验表明，Dreamcs的表现优于先前的方法，生产的3D资产既忠实又受到人类的偏爱。代码和模型将公开发布。

Title: Only-Style: Stylistic Consistency in Image Generation without Content Leakage

Authors: Tilemachos Aravanis (1), Panagiotis Filntisis (2 and 3), Petros Maragos (1 and 2 and 3), George Retsinas (2 and 3) ((1) School of Electrical & Computer Engineering, National Technical University of Athens, Greece, (2) Robotics Institute, Athena Research Center, Maroussi, Greece, (3) HERON - Center of Excellence in Robotics, Athens, Greece)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.09916
Pdf URL: https://arxiv.org/pdf/2506.09916
Copy Paste: [[2506.09916]] Only-Style: Stylistic Consistency in Image Generation without Content Leakage(https://arxiv.org/abs/2506.09916)
Keywords: generation
Abstract: Generating images in a consistent reference visual style remains a challenging computer vision task. State-of-the-art methods aiming for style-consistent generation struggle to effectively separate semantic content from stylistic elements, leading to content leakage from the image provided as a reference to the targets. To address this challenge, we propose Only-Style: a method designed to mitigate content leakage in a semantically coherent manner while preserving stylistic consistency. Only-Style works by localizing content leakage during inference, allowing the adaptive tuning of a parameter that controls the style alignment process, specifically within the image patches containing the subject in the reference image. This adaptive process best balances stylistic consistency with leakage elimination. Moreover, the localization of content leakage can function as a standalone component, given a reference-target image pair, allowing the adaptive tuning of any method-specific parameter that provides control over the impact of the stylistic reference. In addition, we propose a novel evaluation framework to quantify the success of style-consistent generations in avoiding undesired content leakage. Our approach demonstrates a significant improvement over state-of-the-art methods through extensive evaluation across diverse instances, consistently achieving robust stylistic consistency without undesired content leakage.
摘要：以一致的参考视觉样式生成图像仍然是一项具有挑战性的计算机视觉任务。针对风格一致生成的最先进的方法难以有效地将语义内容与风格元素分开，从而导致内容泄漏从提供的图像中作为参考目标。为了应对这一挑战，我们提出了唯一的风格：一种旨在以语义连贯的方式减轻内容泄漏的方法，同时保持风格一致性。只有在推理过程中定位内容泄漏来起作用，允许对控制样式比对过程的参数进行自适应调整，特别是在参考图像中包含主题的图像补丁中。这种自适应过程的最佳平衡风格一致性与消除泄漏。此外，给定参考目标图对泄漏的定位可以作为独立组件的作用，从而可以自适应调整任何特定方法特定参数，该参数提供了对风格参考的影响的控制。此外，我们提出了一个新颖的评估框架，以量化样式一致世代在避免不希望的内容泄漏方面的成功。我们的方法通过跨不同实例的广泛评估来表明对最先进方法的显着改善，从而始终实现了强大的文体一致性而没有不希望的内容泄漏。

Title: HadaNorm: Diffusion Transformer Quantization through Mean-Centered Transformations

Authors: Marco Federici, Riccardo Del Chiaro, Boris van Breugel, Paul Whatmough, Markus Nagel
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.09932
Pdf URL: https://arxiv.org/pdf/2506.09932
Copy Paste: [[2506.09932]] HadaNorm: Diffusion Transformer Quantization through Mean-Centered Transformations(https://arxiv.org/abs/2506.09932)
Keywords: generation
Abstract: Diffusion models represent the cutting edge in image generation, but their high memory and computational demands hinder deployment on resource-constrained devices. Post-Training Quantization (PTQ) offers a promising solution by reducing the bitwidth of matrix operations. However, standard PTQ methods struggle with outliers, and achieving higher compression often requires transforming model weights and activations before quantization. In this work, we propose HadaNorm, a novel linear transformation that extends existing approaches and effectively mitigates outliers by normalizing activations feature channels before applying Hadamard transformations, enabling more aggressive activation quantization. We demonstrate that HadaNorm consistently reduces quantization error across the various components of transformer blocks, achieving superior efficiency-performance trade-offs when compared to state-of-the-art methods.
摘要：扩散模型代表图像生成的最前沿，但是它们的高内存和计算要求阻碍了资源受限设备上的部署。训练后量化（PTQ）通过减少矩阵操作的位，提供了有希望的解决方案。但是，标准的PTQ方法与异常值斗争，并且达到较高的压缩通常需要在量化之前转换模型权重和激活。在这项工作中，我们提出了Hadanorm，这是一种新型的线性变换，可扩展现有方法，并通过在应用HADAMARD转换之前将激活特征通道进行标准化，从而有效地减轻异常值，从而实现更具侵略性的激活量化。我们证明，Hadanorm始终减少变压器块的各个组件中的量化误差，与最新方法相比，实现了出色的效率 - 性能取舍。

Title: Canonical Latent Representations in Conditional Diffusion Models

Authors: Yitao Xu, Tong Zhang, Ehsan Pajouheshgar, Sabine Süsstrunk
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.09955
Pdf URL: https://arxiv.org/pdf/2506.09955
Copy Paste: [[2506.09955]] Canonical Latent Representations in Conditional Diffusion Models(https://arxiv.org/abs/2506.09955)
Keywords: generative
Abstract: Conditional diffusion models (CDMs) have shown impressive performance across a range of generative tasks. Their ability to model the full data distribution has opened new avenues for analysis-by-synthesis in downstream discriminative learning. However, this same modeling capacity causes CDMs to entangle the class-defining features with irrelevant context, posing challenges to extracting robust and interpretable representations. To this end, we identify Canonical LAtent Representations (CLAReps), latent codes whose internal CDM features preserve essential categorical information while discarding non-discriminative signals. When decoded, CLAReps produce representative samples for each class, offering an interpretable and compact summary of the core class semantics with minimal irrelevant details. Exploiting CLAReps, we develop a novel diffusion-based feature-distillation paradigm, CaDistill. While the student has full access to the training set, the CDM as teacher transfers core class knowledge only via CLAReps, which amounts to merely 10 % of the training data in size. After training, the student achieves strong adversarial robustness and generalization ability, focusing more on the class signals instead of spurious background cues. Our findings suggest that CDMs can serve not just as image generators but also as compact, interpretable teachers that can drive robust representation learning.
摘要：条件扩散模型（CDM）在一系列生成任务中表现出令人印象深刻的性能。他们对完整数据分布进行建模的能力为下游歧视性学习中的分析开辟了新的途径。但是，这种相同的建模能力使CDM与无关的背景纠缠了阶级定义特征，从而构成了提取可靠和可解释的表示的挑战。为此，我们确定了规范的潜在表示（CLAREPS），内部CDM具有保留基本的分类信息的潜在代码，同时丢弃非歧视信号。当解码时，CLAREPS为每个类别生产代表性样本，提供可解释的核心类语义摘要，并提供最小的无关细节。利用克拉普斯，我们开发了一种新型的基于扩散的特征依据范式Cadistill。尽管学生可以完全访问培训设置，但CDM作为教师仅通过CLAREPS转移核心课程知识，这仅占大小的培训数据的10％。训练后，学生获得了强大的对抗性鲁棒性和泛化能力，更多地关注班级信号，而不是虚假的背景提示。我们的发现表明，CDM不仅可以用作图像发生器，而且可以作为紧凑的，可解释的教师，可以推动稳健的表示学习。

Title: Efficient Part-level 3D Object Generation via Dual Volume Packing

Authors: Jiaxiang Tang, Ruijie Lu, Zhaoshuo Li, Zekun Hao, Xuan Li, Fangyin Wei, Shuran Song, Gang Zeng, Ming-Yu Liu, Tsung-Yi Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.09980
Pdf URL: https://arxiv.org/pdf/2506.09980
Copy Paste: [[2506.09980]] Efficient Part-level 3D Object Generation via Dual Volume Packing(https://arxiv.org/abs/2506.09980)
Keywords: generation
Abstract: Recent progress in 3D object generation has greatly improved both the quality and efficiency. However, most existing methods generate a single mesh with all parts fused together, which limits the ability to edit or manipulate individual parts. A key challenge is that different objects may have a varying number of parts. To address this, we propose a new end-to-end framework for part-level 3D object generation. Given a single input image, our method generates high-quality 3D objects with an arbitrary number of complete and semantically meaningful parts. We introduce a dual volume packing strategy that organizes all parts into two complementary volumes, allowing for the creation of complete and interleaved parts that assemble into the final object. Experiments show that our model achieves better quality, diversity, and generalization than previous image-based part-level generation methods.
摘要：3D物体生成的最新进展大大提高了质量和效率。但是，大多数现有方法都会生成一个单个网格，并将所有零件融合在一起，从而限制了编辑或操纵各个零件的能力。一个关键的挑战是不同对象可能具有不同数量的零件。为了解决这个问题，我们为部分级别的3D对象生成了一个新的端到端框架。给定单个输入图像，我们的方法生成具有任意数量的完整和语义有意义的部分的高质量3D对象。我们介绍了双重卷包装策略，该策略将所有零件组织成两个互补的卷，从而可以创建组装到最终对象中的完整和交织的部分。实验表明，与以前基于图像的零件级生成方法相比，我们的模型实现了更好，多样性和概括。

Title: AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation

Authors: Zijie Wu, Chaohui Yu, Fan Wang, Xiang Bai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.09982
Pdf URL: https://arxiv.org/pdf/2506.09982
Copy Paste: [[2506.09982]] AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation(https://arxiv.org/abs/2506.09982)
Keywords: generation
Abstract: Recent advances in 4D content generation have attracted increasing attention, yet creating high-quality animated 3D models remains challenging due to the complexity of modeling spatio-temporal distributions and the scarcity of 4D training data. In this paper, we present AnimateAnyMesh, the first feed-forward framework that enables efficient text-driven animation of arbitrary 3D meshes. Our approach leverages a novel DyMeshVAE architecture that effectively compresses and reconstructs dynamic mesh sequences by disentangling spatial and temporal features while preserving local topological structures. To enable high-quality text-conditional generation, we employ a Rectified Flow-based training strategy in the compressed latent space. Additionally, we contribute the DyMesh Dataset, containing over 4M diverse dynamic mesh sequences with text annotations. Experimental results demonstrate that our method generates semantically accurate and temporally coherent mesh animations in a few seconds, significantly outperforming existing approaches in both quality and efficiency. Our work marks a substantial step forward in making 4D content creation more accessible and practical. All the data, code, and models will be open-released.
摘要：4D内容一代的最新进展引起了人们的关注，但是由于建模时空分布的复杂性和4D训练数据的稀缺性，创建高质量的动画3D模型仍然具有挑战性。在本文中，我们介绍了AnimateAnymesh，这是第一个启用馈电框架，可实现有效的文本驱动动画的任意3D网格。我们的方法利用了一种新型的dymeshvae架构，该结构通过删除空间和时间特征来有效地压缩和重建动态网格序列，同时保留局部拓扑结构。为了实现高质量的文本条件生成，我们在压缩潜在空间中采用了基于流动的培训策略。此外，我们贡献了Dymesh数据集，其中包含具有文本注释的4M多样性的动态网格序列。实验结果表明，我们的方法在几秒钟内生成语义上准确和时间相干的网状动画，在质量和效率方面都显着超过了现有的方法。我们的工作标志着使4D内容创建更易于访问和实用，这是向前迈出的一大步。所有数据，代码和模型都将被打开。

Title: InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions

Authors: Zhenzhi Wang, Jiaqi Yang, Jianwen Jiang, Chao Liang, Gaojie Lin, Zerong Zheng, Ceyuan Yang, Dahua Lin
Subjects: cs.CV, cs.AI, cs.SD
Abstract URL: https://arxiv.org/abs/2506.09984
Pdf URL: https://arxiv.org/pdf/2506.09984
Copy Paste: [[2506.09984]] InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions(https://arxiv.org/abs/2506.09984)
Keywords: generation
Abstract: End-to-end human animation with rich multi-modal conditions, e.g., text, image and audio has achieved remarkable advancements in recent years. However, most existing methods could only animate a single subject and inject conditions in a global manner, ignoring scenarios that multiple concepts could appears in the same video with rich human-human interactions and human-object interactions. Such global assumption prevents precise and per-identity control of multiple concepts including humans and objects, therefore hinders applications. In this work, we discard the single-entity assumption and introduce a novel framework that enforces strong, region-specific binding of conditions from modalities to each identity's spatiotemporal footprint. Given reference images of multiple concepts, our method could automatically infer layout information by leveraging a mask predictor to match appearance cues between the denoised video and each reference appearance. Furthermore, we inject local audio condition into its corresponding region to ensure layout-aligned modality matching in a iterative manner. This design enables the high-quality generation of controllable multi-concept human-centric videos. Empirical results and ablation studies validate the effectiveness of our explicit layout control for multi-modal conditions compared to implicit counterparts and other existing methods.
摘要：近年来，文本，图像和音频端到端的人类动画具有丰富的多模式条件，近年来取得了显着的进步。但是，大多数现有的方法只能以全球方式对单个主题进行动画，并以全球的方式注入条件，而忽略了与丰富的人类相互作用和人类对象相互作用的相同视频中可以出现多个概念的情况。这样的全球假设阻止了包括人类和对象在内的多种概念的精确和每个身份控制，因此阻碍了应用程序。在这项工作中，我们丢弃了单一假设，并引入了一个新颖的框架，该框架从模态到每个身份的时空足迹的条件上强大的特定区域结合。给定多个概念的参考图像，我们的方法可以通过利用掩码预测器来匹配DeNoed Video和每个参考外观之间的外观线索来自动推断布局信息。此外，我们将局部音频条件注入其相应的区域，以确保以迭代方式与布局一致的模态匹配。该设计使高质量的可控多概念以人为中心的视频。与隐式同行和其他现有方法相比，经验结果和消融研究证明了我们对多模式条件的显式布局控制的有效性。

Title: EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits

Authors: Ron Yosef, Moran Yanuka, Yonatan Bitton, Dani Lischinski
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.09988
Pdf URL: https://arxiv.org/pdf/2506.09988
Copy Paste: [[2506.09988]] EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits(https://arxiv.org/abs/2506.09988)
Keywords: generation, generative
Abstract: Text-guided image editing, fueled by recent advancements in generative AI, is becoming increasingly widespread. This trend highlights the need for a comprehensive framework to verify text-guided edits and assess their quality. To address this need, we introduce EditInspector, a novel benchmark for evaluation of text-guided image edits, based on human annotations collected using an extensive template for edit verification. We leverage EditInspector to evaluate the performance of state-of-the-art (SoTA) vision and language models in assessing edits across various dimensions, including accuracy, artifact detection, visual quality, seamless integration with the image scene, adherence to common sense, and the ability to describe edit-induced changes. Our findings indicate that current models struggle to evaluate edits comprehensively and frequently hallucinate when describing the changes. To address these challenges, we propose two novel methods that outperform SoTA models in both artifact detection and difference caption generation.
摘要：随着生成AI的最新进展推动了文本指导的图像编辑，越来越广泛。这种趋势强调了需要一个综合框架来验证文本指导的编辑并评估其质量的必要性。为了满足这一需求，我们介绍了Editinspector，这是一种基于使用广泛的模板进行编辑验证的人类注释，用于评估文本引导图像编辑的新颖基准。我们利用EditInspector来评估最先进的（SOTA）视觉和语言模型的性能，以评估各个维度的编辑，包括准确性，人工制品检测，视觉质量，与图像场景的无缝集成，对常识的依从性以及描述编辑诱导的变化的能力。我们的发现表明，当前的模型在描述这些变化时很难全面评估编辑。为了应对这些挑战，我们提出了两种新颖的方法，这些方法在伪影检测和差异标题生成中都超过了SOTA模型。

Title: Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation

Authors: Xinyu Yang, Yuwei An, Hongyi Liu, Tianqi Chen, Beidi Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.09991
Pdf URL: https://arxiv.org/pdf/2506.09991
Copy Paste: [[2506.09991]] Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation(https://arxiv.org/abs/2506.09991)
Keywords: generation, generative
Abstract: Autoregressive Large Language Models (AR-LLMs) frequently exhibit implicit parallelism in sequential generation. Inspired by this, we introduce Multiverse, a new generative model that enables natively parallel generation. Multiverse internalizes a MapReduce paradigm, generating automatically through three stages: (i) a Map stage for adaptive task decomposition, (ii) a Process stage for parallel subtask execution, and (iii) a Reduce stage for lossless result synthesis. Next, we build a real-world Multiverse reasoning model with co-design of data, algorithm, and system, enabling rapid and seamless transfer from frontier AR-LLMs. Starting from sequential reasoning chains, we create Multiverse 1K by converting them into structured training data using an automated LLM-assisted pipeline, avoiding costly human annotations. Algorithmically, we design Multiverse Attention to separate parallel reasoning steps while keeping compatibility with causal attention for efficient training. Systematically, we implement Multiverse Engine to enable parallel inference. It features a dedicated scheduler that dynamically switches between sequential and parallel generation, triggered directly by the model. After a 3-hour fine-tuning with 1K examples, our Multiverse-32B stands as the only open-sourced non-AR model achieving performance on par with leading AR-LLMs of the same scale, evidenced by AIME24 & 25 scores of 54% and 46%, respectively. Moreover, our budget control experiments show that Multiverse-32B exhibits superior scaling, outperforming AR-LLMs by 1.87% on average using the same context length. Such scaling further leads to practical efficiency gain, achieving up to 2x speedup across varying batch sizes. We have open-sourced the entire Multiverse ecosystem, including data, model weights, engine, supporting tools, as well as complete data curation prompts and detailed training and evaluation recipes.
摘要：自回归的大型语言模型（AR-LLM）经常在顺序产生中表现出隐式并行性。受此启发，我们介绍了Multiverse，这是一种新的生成模型，可实现本地平行的生成。多宇宙将MAPREDUCE范式内部化，从三个阶段自动生成：（i）自适应任务分解的地图阶段，（ii）并行子任命执行的过程阶段，以及（iii）无损结果合成的减少阶段。接下来，我们构建了一个现实世界中的多元宇宙推理模型，该模型具有数据，算法和系统的共同设计，从而从Frontier AR-LLMS启用了快速和无缝的传输。从顺序推理链开始，我们通过使用自动化LLM辅助管道将其转换为结构化训练数据来创建多宇宙1K，从而避免了昂贵的人类注释。从算法上讲，我们将多元宇宙的关注设计用于单独的平行推理步骤，同时保持与因果关注以进行有效训练的兼容性。从系统地，我们实现多元引擎以实现并行推理。它具有专用调度程序，该调度程序在直接由模型触发的顺序和并行生成之间动态切换。经过3小时的1K示例进行微调后，我们的Multiverse-32B是唯一的开源非AR模型，以相同规模的领先AR-LLM在同一标准范围内达到同等的性能，而AIME24和25分数为54％和46％。此外，我们的预算控制实验表明，使用相同的上下文长度，Multiverse-32B表现出较高的缩放率，平均表现出1.87％。这样的扩展进一步导致实践效率的增长，在不同的批次尺寸上达到了多达2倍的加速。我们已经开源了整个多元宇宙生态系统，包括数据，模型权重，引擎，支持工具以及完整的数据策划提示以及详细的培训和评估食谱。

Title: Text-Aware Image Restoration with Diffusion Models

Authors: Jaewon Min, Jin Hyeon Kim, Paul Hyunbin Cho, Jaeeun Lee, Jihye Park, Minkyu Park, Sangpil Kim, Hyunhee Park, Seungryong Kim
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.09993
Pdf URL: https://arxiv.org/pdf/2506.09993
Copy Paste: [[2506.09993]] Text-Aware Image Restoration with Diffusion Models(https://arxiv.org/abs/2506.09993)
Keywords: restoration
Abstract: Image restoration aims to recover degraded images. However, existing diffusion-based restoration methods, despite great success in natural image restoration, often struggle to faithfully reconstruct textual regions in degraded images. Those methods frequently generate plausible but incorrect text-like patterns, a phenomenon we refer to as text-image hallucination. In this paper, we introduce Text-Aware Image Restoration (TAIR), a novel restoration task that requires the simultaneous recovery of visual contents and textual fidelity. To tackle this task, we present SA-Text, a large-scale benchmark of 100K high-quality scene images densely annotated with diverse and complex text instances. Furthermore, we propose a multi-task diffusion framework, called TeReDiff, that integrates internal features from diffusion models into a text-spotting module, enabling both components to benefit from joint training. This allows for the extraction of rich text representations, which are utilized as prompts in subsequent denoising steps. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art restoration methods, achieving significant gains in text recognition accuracy. See our project page: this https URL
摘要：图像恢复旨在恢复降级的图像。但是，尽管自然图像恢复方面取得了巨大成功，但现有的基于扩散的恢复方法通常很难忠实地重建降级图像中的文本区域。这些方法经常产生合理但不正确的文本样模式，这是我们称为文本图像幻觉的现象。在本文中，我们介绍了文本感知图像恢复（TAIR），这是一项新型的恢复任务，需要同时恢复视觉内容和文本保真度。为了解决这项任务，我们提出SA-Text，这是100K高质量场景图像的大规模基准，并以多种多样且复杂的文本实例密集注释。此外，我们提出了一个称为Terediff的多任务扩散框架，该框架将扩散模型的内部特征集成到文本介绍模块中，从而使两个组件都能从联合培训中受益。这允许提取丰富的文本表示形式，在随后的降解步骤中，这些表示被用作提示。广泛的实验表明，我们的方法始终优于最先进的恢复方法，从而实现了文本识别准确性的显着提高。请参阅我们的项目页面：此HTTPS URL

Title: PlayerOne: Egocentric World Simulator

Authors: Yuanpeng Tu, Hao Luo, Xi Chen, Xiang Bai, Fan Wang, Hengshuang Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.09995
Pdf URL: https://arxiv.org/pdf/2506.09995
Copy Paste: [[2506.09995]] PlayerOne: Egocentric World Simulator(https://arxiv.org/abs/2506.09995)
Keywords: generation
Abstract: We introduce PlayerOne, the first egocentric realistic world simulator, facilitating immersive and unrestricted exploration within vividly dynamic environments. Given an egocentric scene image from the user, PlayerOne can accurately construct the corresponding world and generate egocentric videos that are strictly aligned with the real scene human motion of the user captured by an exocentric camera. PlayerOne is trained in a coarse-to-fine pipeline that first performs pretraining on large-scale egocentric text-video pairs for coarse-level egocentric understanding, followed by finetuning on synchronous motion-video data extracted from egocentric-exocentric video datasets with our automatic construction pipeline. Besides, considering the varying importance of different components, we design a part-disentangled motion injection scheme, enabling precise control of part-level movements. In addition, we devise a joint reconstruction framework that progressively models both the 4D scene and video frames, ensuring scene consistency in the long-form video generation. Experimental results demonstrate its great generalization ability in precise control of varying human movements and worldconsistent modeling of diverse scenarios. It marks the first endeavor into egocentric real-world simulation and can pave the way for the community to delve into fresh frontiers of world modeling and its diverse applications.
摘要：我们介绍了Playerone，这是第一个以自我的现实世界模拟器，促进了生动动态的环境中的沉浸式和不受限制的探索。鉴于用户的以自我为中心的场景图像，PlayerOne可以准确构建相应的世界并生成以egentric摄像机捕获的用户的真实场景的人类运动的严格对齐的以自我为中心的视频。 PlayerOne接受了一条粗到精细的管道的训练，该管道首先在大规模的以自我为中心的文本视频对进行预处理，以进行粗级以上的自我理解，然后对从同步运动中提取的同步运动数据进行填充，并与我们的自动构造式构造启动。此外，考虑到不同组件的重要性，我们设计了一个部分触发运动注入方案，从而可以精确控制零件级运动。此外，我们设计了一个联合重建框架，该框架逐渐建模了4D场景和视频帧，从而确保了长期视频生成的场景一致性。实验结果证明了其在精确控制人类运动和各种情况模型的精确控制方面的巨大概括能力。它标志着以自我为中心的现实世界模拟的第一个努力，可以为社区挖掘世界建模及其多样化应用的新鲜边界铺平道路。