2025-07-08

Title: Advancing Talking Head Generation: A Comprehensive Survey of Multi-Modal Methodologies, Datasets, Evaluation Metrics, and Loss Functions

Authors: Vineet Kumar Rakesh, Soumya Mazumdar, Research Pratim Maity, Sarbajit Pal, Amitabha Das, Tapas Samanta
Subjects: cs.CV, cs.AI, cs.GR, cs.HC, cs.MM
Abstract URL: https://arxiv.org/abs/2507.02900
Pdf URL: https://arxiv.org/pdf/2507.02900
Copy Paste: [[2507.02900]] Advancing Talking Head Generation: A Comprehensive Survey of Multi-Modal Methodologies, Datasets, Evaluation Metrics, and Loss Functions(https://arxiv.org/abs/2507.02900)
Keywords: generation
Abstract: Talking Head Generation (THG) has emerged as a transformative technology in computer vision, enabling the synthesis of realistic human faces synchronized with image, audio, text, or video inputs. This paper provides a comprehensive review of methodologies and frameworks for talking head generation, categorizing approaches into 2D--based, 3D--based, Neural Radiance Fields (NeRF)--based, diffusion--based, parameter-driven techniques and many other techniques. It evaluates algorithms, datasets, and evaluation metrics while highlighting advancements in perceptual realism and technical efficiency critical for applications such as digital avatars, video dubbing, ultra-low bitrate video conferencing, and online education. The study identifies challenges such as reliance on pre--trained models, extreme pose handling, multilingual synthesis, and temporal consistency. Future directions include modular architectures, multilingual datasets, hybrid models blending pre--trained and task-specific layers, and innovative loss functions. By synthesizing existing research and exploring emerging trends, this paper aims to provide actionable insights for researchers and practitioners in the field of talking head generation. For the complete survey, code, and curated resource list, visit our GitHub repository: this https URL.
摘要：说话的头部生成（THG）已成为计算机视觉中的一种变革性技术，使能够与图像，音频，文本或视频输入同步的现实人脸综合。本文提供了对交谈的方法和框架的全面回顾，将方法和框架分类为基于2D的，基于3D的神经辐射领域（NERF） - 基于基于扩散的，基于扩散的，参数驱动的技术和许多其他技术。它评估了算法，数据集和评估指标，同时突出了感知现实主义和技术效率的进步，这对于数字化身，视频配音，超低比特特视频会议和在线教育至关重要。该研究确定了挑战，例如依赖培训的模型，极端姿势处理，多语言合成和时间一致性。未来的方向包括模块化体系结构，多语言数据集，混合模型融合预培训和特定于任务的层以及创新的损失功能。通过综合现有的研究并探索新兴趋势，本文旨在为谈话脑海中的研究人员和从业人员提供可行的见解。有关完整的调查，编码和策划资源列表，请访问我们的GitHub存储库：此HTTPS URL。

Title: Controllable diffusion-based generation for multi-channel biological data

Authors: Haoran Zhang, Mingyuan Zhou, Wesley Tansey
Subjects: cs.LG, cs.CE
Abstract URL: https://arxiv.org/abs/2507.02902
Pdf URL: https://arxiv.org/pdf/2507.02902
Copy Paste: [[2507.02902]] Controllable diffusion-based generation for multi-channel biological data(https://arxiv.org/abs/2507.02902)
Keywords: generation, generative
Abstract: Spatial profiling technologies in biology, such as imaging mass cytometry (IMC) and spatial transcriptomics (ST), generate high-dimensional, multi-channel data with strong spatial alignment and complex inter-channel relationships. Generative modeling of such data requires jointly capturing intra- and inter-channel structure, while also generalizing across arbitrary combinations of observed and missing channels for practical application. Existing diffusion-based models generally assume low-dimensional inputs (e.g., RGB images) and rely on simple conditioning mechanisms that break spatial correspondence and ignore inter-channel dependencies. This work proposes a unified diffusion framework for controllable generation over structured and spatial biological data. Our model contains two key innovations: (1) a hierarchical feature injection mechanism that enables multi-resolution conditioning on spatially aligned channels, and (2) a combination of latent-space and output-space channel-wise attention to capture inter-channel relationships. To support flexible conditioning and generalization to arbitrary subsets of observed channels, we train the model using a random masking strategy, enabling it to reconstruct missing channels from any combination of inputs. We demonstrate state-of-the-art performance across both spatial and non-spatial prediction tasks, including protein imputation in IMC and gene-to-protein prediction in single-cell datasets, and show strong generalization to unseen conditional configurations.
摘要：生物学中的空间分析技术，例如成像质量细胞仪（IMC）和空间转录组学（ST），生成具有强空间对准和复杂通讯间关系的高维，多通道数据。此类数据的生成建模需要共同捕获通道内和通道结构，同时也将观察到的通道和缺失通道的任意组合概括为实际应用。现有的基于扩散的模型通常假设低维输入（例如RGB图像），并依赖于破坏空间对应关系并忽略通道间依赖性的简单调理机制。这项工作提出了一个统一的扩散框架，用于对结构化和空间生物学数据的可控生成。我们的模型包含两个关键创新：（1）一种分层特征注入机制，该机制可以在空间对齐的通道上进行多分辨率调节，以及（2）潜在空间和输出空间通道的组合，以捕获通道间的关系。为了支持柔性条件和对观察到的通道的任意子集的概括，我们使用随机掩蔽策略训练模型，从而使其能够从任何输入组合中重建缺失的通道。我们展示了空间和非空间预测任务的最先进性能，包括IMC中的蛋白质归因于单细胞数据集中的蛋白质和基因到蛋白预测，并显示出对未见条件配置的强烈概括。

Title: Efficient Certified Reasoning for Binarized Neural Networks

Authors: Jiong Yang, Yong Kiam Tan, Mate Soos, Magnus O. Myreen, Kuldeep S. Meel
Subjects: cs.LG, cs.AI, cs.LO
Abstract URL: https://arxiv.org/abs/2507.02916
Pdf URL: https://arxiv.org/pdf/2507.02916
Copy Paste: [[2507.02916]] Efficient Certified Reasoning for Binarized Neural Networks(https://arxiv.org/abs/2507.02916)
Keywords: generation
Abstract: Neural networks have emerged as essential components in safety-critical applications -- these use cases demand complex, yet trustworthy computations. Binarized Neural Networks (BNNs) are a type of neural network where each neuron is constrained to a Boolean value; they are particularly well-suited for safety-critical tasks because they retain much of the computational capacities of full-scale (floating-point or quantized) deep neural networks, but remain compatible with satisfiability solvers for qualitative verification and with model counters for quantitative reasoning. However, existing methods for BNN analysis suffer from either limited scalability or susceptibility to soundness errors, which hinders their applicability in real-world scenarios. In this work, we present a scalable and trustworthy approach for both qualitative and quantitative verification of BNNs. Our approach introduces a native representation of BNN constraints in a custom-designed solver for qualitative reasoning, and in an approximate model counter for quantitative reasoning. We further develop specialized proof generation and checking pipelines with native support for BNN constraint reasoning, ensuring trustworthiness for all of our verification results. Empirical evaluations on a BNN robustness verification benchmark suite demonstrate that our certified solving approach achieves a $9\times$ speedup over prior certified CNF and PB-based approaches, and our certified counting approach achieves a $218\times$ speedup over the existing CNF-based baseline. In terms of coverage, our pipeline produces fully certified results for $99\%$ and $86\%$ of the qualitative and quantitative reasoning queries on BNNs, respectively. This is in sharp contrast to the best existing baselines which can fully certify only $62\%$ and $4\%$ of the queries, respectively.
摘要：神经网络已成为安全至关重要应用中的重要组成部分 - 这些用例要求复杂但值得信赖的计算。二进制神经网络（BNN）是一种神经网络，每个神经元都被约束至布尔值。它们特别适合至关重要的安全任务，因为它们保留了全尺度（浮点或量化）深神经网络的许多计算能力，但仍与定性验证的可满足性求解器兼容，并与定量推理的模型计数器保持一致。但是，现有的BNN分析方法的可扩展性有限或对声音错误的敏感性有限，这阻碍了其在现实情况下的适用性。在这项工作中，我们为BNN的定性和定量验证提供了一种可扩展且可信赖的方法。我们的方法在定性推理的自定义设计求解器中引入了BNN约束的天然表示，并在近似模型计数器中进行定量推理。我们进一步开发了专门的证明生成和检查管道，并在本地支持BNN约束推理的情况下，确保了我们所有验证结果的信任度。对BNN鲁棒性验证基准套件的经验评估表明，我们经过认证的解决方法比先前经过认证的CNF和基于PB的方法达到了$ 9 \ times $加速，并且我们的认证计数方法可在现有的CNF基线上实现$ 218 \ times $加速。在覆盖范围方面，我们的管道分别在BNNS上的定性和定量推理查询分别产生$ 99 \％$和$ 86 \％$的完整认证结果。这与最好的现有基线形成了鲜明的对比，后者分别只能完全证明$ 62 \％$和$ 4 \％$ $ $ $。

Title: Large Language Model Agent for Modular Task Execution in Drug Discovery

Authors: Janghoon Ock, Radheesh Sharma Meda, Srivathsan Badrinarayanan, Neha S. Aluru, Achuth Chandrasekhar, Amir Barati Farimani
Subjects: cs.LG, cs.CL, q-bio.BM
Abstract URL: https://arxiv.org/abs/2507.02925
Pdf URL: https://arxiv.org/pdf/2507.02925
Copy Paste: [[2507.02925]] Large Language Model Agent for Modular Task Execution in Drug Discovery(https://arxiv.org/abs/2507.02925)
Keywords: generation
Abstract: We present a modular framework powered by large language models (LLMs) that automates and streamlines key tasks across the early-stage computational drug discovery pipeline. By combining LLM reasoning with domain-specific tools, the framework performs biomedical data retrieval, domain-specific question answering, molecular generation, property prediction, property-aware molecular refinement, and 3D protein-ligand structure generation. In a case study targeting BCL-2 in lymphocytic leukemia, the agent autonomously retrieved relevant biomolecular information-including FASTA sequences, SMILES representations, and literature-and answered mechanistic questions with improved contextual accuracy over standard LLMs. It then generated chemically diverse seed molecules and predicted 67 ADMET-related properties, which guided iterative molecular refinement. Across two refinement rounds, the number of molecules with QED > 0.6 increased from 34 to 55, and those passing at least four out of five empirical drug-likeness rules rose from 29 to 52, within a pool of 194 molecules. The framework also employed Boltz-2 to generate 3D protein-ligand complexes and provide rapid binding affinity estimates for candidate compounds. These results demonstrate that the approach effectively supports molecular screening, prioritization, and structure evaluation. Its modular design enables flexible integration of evolving tools and models, providing a scalable foundation for AI-assisted therapeutic discovery.
摘要：我们提出了一个由大语言模型（LLM）提供动力的模块化框架，该框架可以自动化并简化早期计算药物发现管道中的关键任务。通过将LLM推理与域特异性工具相结合，该框架可以执行生物医学数据检索，特定于域的问题答案，分子产生，财产预测，属性感知的分子细化和3D蛋白质配体结构。在针对淋巴细胞性白血病中BCL-2的案例研究中，该药物自主检索了相关的生物分子信息 - 包括FASTA序列，微笑表示和文献，并回答了与标准LLM的上下文准确性提高的机械性问题。然后，它产生了化学多样的种子分子，并预测了67个与加热的分子相关特性，从而指导了迭代分子的细化。在两个精致的一轮中，QED> 0.6的分子数量从34增加到55个，而在五个经验药物的毒品规则中，至少有四个在194个分子池中从29升至52。该框架还采用了Boltz-2来生成3D蛋白质配合物，并为候选化合物提供快速的结合亲和力估计。这些结果表明，该方法有效支持分子筛选，优先级和结构评估。它的模块化设计可以灵活地集成不断发展的工具和模型，为AI辅助治疗发现提供了可扩展的基础。

Title: GameTileNet: A Semantic Dataset for Low-Resolution Game Art in Procedural Content Generation

Authors: Yi-Chun Chen, Arnav Jhala
Subjects: cs.CV, cs.AI, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2507.02941
Pdf URL: https://arxiv.org/pdf/2507.02941
Copy Paste: [[2507.02941]] GameTileNet: A Semantic Dataset for Low-Resolution Game Art in Procedural Content Generation(https://arxiv.org/abs/2507.02941)
Keywords: generation, generative
Abstract: GameTileNet is a dataset designed to provide semantic labels for low-resolution digital game art, advancing procedural content generation (PCG) and related AI research as a vision-language alignment task. Large Language Models (LLMs) and image-generative AI models have enabled indie developers to create visual assets, such as sprites, for game interactions. However, generating visuals that align with game narratives remains challenging due to inconsistent AI outputs, requiring manual adjustments by human artists. The diversity of visual representations in automatically generated game content is also limited because of the imbalance in distributions across styles for training data. GameTileNet addresses this by collecting artist-created game tiles from this http URL under Creative Commons licenses and providing semantic annotations to support narrative-driven content generation. The dataset introduces a pipeline for object detection in low-resolution tile-based game art (e.g., 32x32 pixels) and annotates semantics, connectivity, and object classifications. GameTileNet is a valuable resource for improving PCG methods, supporting narrative-rich game content, and establishing a baseline for object detection in low-resolution, non-photorealistic images. TL;DR: GameTileNet is a semantic dataset of low-resolution game tiles designed to support narrative-driven procedural content generation through visual-language alignment.
摘要：GametileNet是一个数据集，旨在为低分辨率数字游戏艺术，推进程序内容生成（PCG）和相关的AI研究提供语义标签，作为视觉局部对齐任务。大型语言模型（LLM）和图像生成的AI模型使独立开发人员能够创建视觉资产，例如Sprites，以进行游戏交互。但是，由于AI输出不一致，需要与游戏叙事保持一致的视觉效果仍然具有挑战性，这需要人类艺术家的手动调整。自动生成的游戏内容中视觉表示的多样性也受到限制，因为跨样式的培训数据的分布不平衡。 GametileNet通过在Creative Commons许可下从该HTTP URL收集艺术家创建的游戏瓷砖并提供语义注释来支持叙事驱动的内容生成，从而解决了这一点。数据集引入了基于低分辨率图块的游戏艺术（例如32x32像素）中的对象检测管道，并注释语义，连接性和对象分类。 GametileNet是改进PCG方法，支持叙事丰富的游戏内容并建立基线以在低分辨率，非遗迹图像中建立基线的宝贵资源。 tl; dr：gametileNet是一个低分辨率游戏瓷砖的语义数据集，旨在通过视觉语言对齐来支持叙事驱动的程序内容生成。

Title: Iterative Zoom-In: Temporal Interval Exploration for Long Video Understanding

Authors: Chenglin Li, Qianglong Chen, fengtao, Yin Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02946
Pdf URL: https://arxiv.org/pdf/2507.02946
Copy Paste: [[2507.02946]] Iterative Zoom-In: Temporal Interval Exploration for Long Video Understanding(https://arxiv.org/abs/2507.02946)
Keywords: generation
Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance in video understanding tasks. However, they continue to struggle with long-form videos because of an inefficient perception of temporal intervals. Unlike humans, who can dynamically adjust their temporal focus to locate query-relevant moments, current MLLMs often rely on dense, uniform sampling across the video timeline, leading to high memory consumption and a risk of missing crucial information. To address this challenge, we introduce Temporal Search, a training-free framework that enables MLLMs to explore temporal regions for improved long video understanding iteratively. TS is based on a key observation: the model's generation confidence across different temporal intervals is highly correlated with prediction accuracy. TS operates through two main iterative stages. First, the MLLM proposes a temporal interval that is likely to contain task-relevant information. Then, it samples a fixed number of frames from the interval, regardless of length, and feeds them into the model to produce a refined response and confidence score. TS refines the focus of the model by iteratively shifting attention to more fine-grained temporal intervals, improving its understanding of long videos. Additionally, keyframe-level descriptions are collected to facilitate cross-interval perception throughout the video. To further improve efficiency, we introduce TS-BFS, a best-first search strategy over a tree. Each node represents a candidate interval and is expanded via two methods: self-driven proposals and uniform partitioning. Nodes are scored based on confidence and self-evaluation, and the most promising one is selected for continued exploration.
摘要：多模式的大语言模型（MLLM）在视频理解任务中表现出强劲的表现。但是，由于时间间隔的效率低下，他们继续在长期视频中挣扎。与人类可以动态地调整其暂时性重点以定位与查询相关的时刻不同，当前的MLLM通常依靠整个视频时间表上的密集，均匀的采样，从而导致高度记忆消耗和缺失关键信息的风险。为了应对这一挑战，我们引入了时间搜索，这是一个无培训的框架，使MLLM可以探索时间区域，以改善长期视频理解。 TS基于关键观察：模型在不同时间间隔内的产生置信度高度与预测准确性相关。 TS通过两个主要迭代阶段运行。首先，MLLM提出了一个可能包含与任务相关信息的时间间隔。然后，它从间隔中采样了固定数量的帧，无论长度如何，并将其馈入模型以产生精致的响应和置信度评分。 TS通过迭代地将注意力转移到更细粒度的时间间隔，从而提高了对长视频的理解，从而完善了模型的重点。此外，收集关键帧级的描述，以促进整个视频中的跨间隔感知。为了进一步提高效率，我们引入了TS-BFS，这是一棵树上最好的搜索策略。每个节点代表一个候选间隔，并通过两种方法扩展：自我驱动的建议和统一分区。节点是根据信心和自我评估对节点进行评分的，并且选择了最有希望的节点进行持续探索。

Title: CS-VLM: Compressed Sensing Attention for Efficient Vision-Language Representation Learning

Authors: Andrew Kiruluta, Preethi Raju, Priscilla Burity
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.02957
Pdf URL: https://arxiv.org/pdf/2507.02957
Copy Paste: [[2507.02957]] CS-VLM: Compressed Sensing Attention for Efficient Vision-Language Representation Learning(https://arxiv.org/abs/2507.02957)
Keywords: generation
Abstract: Vision-Language Models (vLLMs) have emerged as powerful architectures for joint reasoning over visual and textual inputs, enabling breakthroughs in image captioning, cross modal retrieval, and multimodal dialogue. However, as these models scale to longer video sequences and richer language descriptions, the quadratic complexity of the standard attention mechanism presents a fundamental computational bottleneck. This challenge is exacerbated in vLLMs, where attention must be computed not only within modalities but also across them, leading to prohibitive memory and latency costs. In this work, we introduce the Compressed Sensing Attention Transformer (CSAT), a novel architecture that reimagines attention computation through the lens of compressed sensing. By projecting high dimensional key and value representations into a lower-dimensional subspace via random measurement matrices and reconstructing the attention outputs using sparse recovery algorithms, CSAT significantly reduces attention complexity while maintaining semantic fidelity. Applied to vLLMs, CSAT exploits the inherent compressibility of both visual and textual representations especially evident in video, where temporal redundancy is high, and in language, where cross-modal grounding is often sparse. In contrast to LLMs, which must often model entangled symbolic dependencies, vLLMs benefit from structured sparsity in alignment and scene composition, making them particularly well-suited to compressed attention. We provide a formal mathematical treatment of CSAT, demonstrate its integration into vision language pipelines, and validate its performance on standard benchmarks, highlighting its promise as a scalable, interpretable, and resource efficient solution for next generation multimodal transformers.
摘要：视觉语言模型（VLLM）已成为有力的架构，用于在视觉和文本输入上进行联合推理，从而在图像字幕，交叉模态检索和多模式对话中取得突破。但是，随着这些模型扩展到更长的视频序列和更丰富的语言描述，标准注意机制的二次复杂性提出了基本的计算瓶颈。在VLLMS中，这一挑战加剧了，在这种挑战中，不仅必须在模态内，而且还必须在范围内计算注意力，从而导致内存和延迟成本过高。在这项工作中，我们介绍了压缩感注意力变压器（CSAT），这是一种新型架构，通过压缩感应镜头重新构想注意力计算。通过将高维密钥和值表示通过随机测量矩阵将高维子空间投射到较低的子空间中，并使用稀疏的恢复算法重建注意力输出，CSAT可显着降低注意力复杂性，同时保持语义忠诚度。 CSAT应用于VLLMS，利用视频和文本表示的固有可压缩性在视频中尤其明显，在视频中，时间冗余很高，在语言中，跨模式接地通常稀疏。与经常建模纠缠符号依赖性的LLM相反，VLLM受益于对齐和场景组成中的结构性稀疏性，使其特别适合以压缩注意力。我们提供了对CSAT的形式数学处理，展示了其在视觉语言管道中的集成，并在标准基准上验证其性能，从而强调了其作为下一代多模式变压器的可扩展，可解释和资源有效解决方案的承诺。

Title: Concept-based Adversarial Attack: a Probabilistic Perspective

Authors: Andi Zhang, Xuan Ding, Steven McDonagh, Samuel Kaski
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02965
Pdf URL: https://arxiv.org/pdf/2507.02965
Copy Paste: [[2507.02965]] Concept-based Adversarial Attack: a Probabilistic Perspective(https://arxiv.org/abs/2507.02965)
Keywords: generative
Abstract: We propose a concept-based adversarial attack framework that extends beyond single-image perturbations by adopting a probabilistic perspective. Rather than modifying a single image, our method operates on an entire concept -- represented by a probabilistic generative model or a set of images -- to generate diverse adversarial examples. Preserving the concept is essential, as it ensures that the resulting adversarial images remain identifiable as instances of the original underlying category or identity. By sampling from this concept-based adversarial distribution, we generate images that maintain the original concept but vary in pose, viewpoint, or background, thereby misleading the classifier. Mathematically, this framework remains consistent with traditional adversarial attacks in a principled manner. Our theoretical and empirical results demonstrate that concept-based adversarial attacks yield more diverse adversarial examples and effectively preserve the underlying concept, while achieving higher attack efficiency.
摘要：我们提出了一个基于概念的对抗攻击框架，该框架通过采用概率的观点来超越单像扰动。我们的方法没有修改单个图像，而是在整个概念上运行（以概率生成模型或一组图像为代表），以生成各种对抗性示例。保留该概念是必不可少的，因为它确保所产生的对抗图像仍然可以作为原始基础类别或身份的实例可识别。通过从这个基于概念的对抗分布中取样，我们生成的图像保持原始概念，但在姿势，观点或背景上有所不同，从而误导了分类器。从数学上讲，该框架与传统的对抗攻击保持原则性。我们的理论和经验结果表明，基于概念的对抗性攻击产生了更多样化的对抗性例子，并有效地保留了基本概念，同时实现了更高的攻击效率。

Title: Mimesis, Poiesis, and Imagination: Exploring Text-to-Image Generation of Biblical Narratives

Authors: Willem Th. van Peursen, Samuel E. Entsua-Mensah
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.02973
Pdf URL: https://arxiv.org/pdf/2507.02973
Copy Paste: [[2507.02973]] Mimesis, Poiesis, and Imagination: Exploring Text-to-Image Generation of Biblical Narratives(https://arxiv.org/abs/2507.02973)
Keywords: generation
Abstract: This study explores the intersection of artificial intelligence and the visualization of Biblical narratives by analyzing AI-generated images of Exodus 2:5-9 (Moses found in River Nile) using MidJourney. Drawing on the classical concepts of mimesis (imitation) and poiesis (creative generation), the authors investigate how text-to-image (T2I) models reproduce or reimagine sacred narratives. Through comparative visual analysis, including Google image results and classical paintings, the research evaluates the stylistic, theological, and cultural dimensions of AI-generated depictions. Findings show that while AI excels in producing aesthetically rich and imaginative visuals, it also reflects the biases and limitations of its training data. The study highlights AI's potential to augment human imagination but questions its capacity for genuine creativity, authorial intent, and theological depth. It concludes by suggesting that AI can serve as a creative partner in reinterpreting biblical texts, though its role in sacred art remains complex and contested.
摘要：这项研究通过使用Midjourney分析出出埃及记2：5-9（在尼罗河中发现的摩西）的AI生成的图像，探讨了人工智能和圣经叙事的可视化。作者借鉴了模仿（模仿）和Poiesis（创意生成）的经典概念，研究了文本对图像（T2I）模型如何重现或重新想象神圣的叙事。通过比较视觉分析，包括Google图像结果和经典绘画，研究评估了AI生成的描述的风格，神学和文化维度。调查结果表明，尽管AI在产生美学丰富而富有想象力的视觉效果方面表现出色，但它也反映了其训练数据的偏见和局限性。该研究强调了AI增强人类想象力的潜力，但质疑其真正创造力，作家意图和神学深度的能力。结论是，尽管AI在神圣的艺术中的作用仍然复杂且有争议，但AI可以作为重新诠释圣经文本的创造性合作伙伴。

Title: InvisibleInk: High-Utility and Low-Cost Text Generation with Differential Privacy

Authors: Vishnu Vinod, Krishna Pillutla, Abhradeep Guha Thakurta
Subjects: cs.LG, cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2507.02974
Pdf URL: https://arxiv.org/pdf/2507.02974
Copy Paste: [[2507.02974]] InvisibleInk: High-Utility and Low-Cost Text Generation with Differential Privacy(https://arxiv.org/abs/2507.02974)
Keywords: generation
Abstract: As major progress in LLM-based long-form text generation enables paradigms such as retrieval-augmented generation (RAG) and inference-time scaling, safely incorporating private information into the generation remains a critical open question. We present InvisibleInk, a highly scalable long-form text generation framework satisfying rigorous differential privacy guarantees with respect to the sensitive references. It interprets sampling from the LLM's next-token-distribution as the exponential mechanism over the LLM logits with two innovations. First, we reduce the privacy cost by isolating and clipping only the sensitive information in the model logits (relative to the public logits). Second, we improve text quality by sampling from a small superset of the top-$k$ private tokens. Empirical evaluations demonstrate a consistent $8\times$ reduction in computation cost over state-of-the-art baselines to generate long-form private text of the same utility across privacy levels. In summary, InvisibleInk is able to generate private long-form text at less than $10\times$ the computation cost of non-private generation.
摘要：随着基于LLM的长篇文本生成的重大进展，可以使范式（例如检索型生成（RAG）和推理时间缩放）安全地将私人信息纳入一代，这仍然是一个关键的开放问题。我们提出InvisibleInk，这是一个高度可扩展的长期文本生成框架，可满足敏感参考的严格差异隐私保证。它将来自LLM的下一步分布的采样解释为LLM logits上的指数机制，并具有两项创新。首先，我们通过仅隔离和剪辑模型logits中的敏感信息（相对于公共逻辑）来降低隐私成本。其次，我们通过从顶级$ K $私人代币的一小台超集中取样来提高文本质量。经验评估表明，与最先进的基准相比，计算成本的$ 8 \ times $降低，以生成跨隐私级别的相同实用程序的长格式私人文本。总而言之，InvisibleInk能够以低于$ 10 \ times $ $ 10的非私人生成成本生成私人长篇文本。

Title: Introducing Answered with Evidence -- a framework for evaluating whether LLM responses to biomedical questions are founded in evidence

Authors: Julian D Baldwin, Christina Dinh, Arjun Mukerji, Neil Sanghavi, Saurabh Gombar
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2507.02975
Pdf URL: https://arxiv.org/pdf/2507.02975
Copy Paste: [[2507.02975]] Introducing Answered with Evidence -- a framework for evaluating whether LLM responses to biomedical questions are founded in evidence(https://arxiv.org/abs/2507.02975)
Keywords: generation
Abstract: The growing use of large language models (LLMs) for biomedical question answering raises concerns about the accuracy and evidentiary support of their responses. To address this, we present Answered with Evidence, a framework for evaluating whether LLM-generated answers are grounded in scientific literature. We analyzed thousands of physician-submitted questions using a comparative pipeline that included: (1) Alexandria, fka the Atropos Evidence Library, a retrieval-augmented generation (RAG) system based on novel observational studies, and (2) two PubMed-based retrieval-augmented systems (System and Perplexity). We found that PubMed-based systems provided evidence-supported answers for approximately 44% of questions, while the novel evidence source did so for about 50%. Combined, these sources enabled reliable answers to over 70% of biomedical queries. As LLMs become increasingly capable of summarizing scientific content, maximizing their value will require systems that can accurately retrieve both published and custom-generated evidence or generate such evidence in real time.
摘要：大型语言模型（LLM）在生物医学问题上的使用日益增长，这引起了人们对其回答的准确性和证据支持的担忧。为了解决这个问题，我们提供了证据，这是评估LLM生成的答案是否基于科学文献的框架。我们使用包括以下比较的管道分析了数千个医师提取的问题，其中包括：（1）亚历山大，FKA Atropos证据库，基于新的观察性研究的检索效果（RAG）系统，以及（2）两个基于PubMed的检索检索系统（系统和Perxity）。我们发现，基于PubMed的系统为大约44％的问题提供了证据支持的答案，而新颖的证据来源则为50％。这些来源结合在一起，为超过70％的生物医学查询提供了可靠的答案。随着LLM越来越有能力总结科学内容，最大化其价值将需要可以准确检索已发表和自定义生成的证据或实时产生此类证据的系统。

Title: FreqCross: A Multi-Modal Frequency-Spatial Fusion Network for Robust Detection of Stable Diffusion 3.5 Generated Images

Authors: Guang Yang
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2507.02995
Pdf URL: https://arxiv.org/pdf/2507.02995
Copy Paste: [[2507.02995]] FreqCross: A Multi-Modal Frequency-Spatial Fusion Network for Robust Detection of Stable Diffusion 3.5 Generated Images(https://arxiv.org/abs/2507.02995)
Keywords: generation
Abstract: The rapid advancement of diffusion models, particularly Stable Diffusion 3.5, has enabled the generation of highly photorealistic synthetic images that pose significant challenges to existing detection methods. This paper presents FreqCross, a novel multi-modal fusion network that combines spatial RGB features, frequency domain artifacts, and radial energy distribution patterns to achieve robust detection of AI-generated images. Our approach leverages a three-branch architecture: (1) a ResNet-18 backbone for spatial feature extraction, (2) a lightweight CNN for processing 2D FFT magnitude spectra, and (3) a multi-layer perceptron for analyzing radial energy profiles. We introduce a novel radial energy distribution analysis that captures characteristic frequency artifacts inherent in diffusion-generated images, and fuse it with spatial and spectral cues via simple feature concatenation followed by a compact classification head. Extensive experiments on a dataset of 10,000 paired real (MS-COCO) and synthetic (Stable Diffusion 3.5) images demonstrate that FreqCross achieves 97.8\% accuracy, outperforming state-of-the-art baselines by 5.2\%. The frequency analysis further reveals that synthetic images exhibit distinct spectral signatures in the 0.1--0.4 normalised frequency range, providing theoretical foundation for our approach. Code and pre-trained models are publicly available to facilitate reproducible research.
摘要：扩散模型的快速发展，尤其是稳定的扩散3.5，使高度逼真的合成图像产生了对现有检测方法构成重大挑战的高度逼真的合成图像。本文介绍了Freqcross，这是一个新型的多模式融合网络，结合了空间RGB特征，频域伪像和径向能量分布模式，以实现AI生成的图像的可靠检测。我们的方法利用了三个分支的结构：（1）用于空间特征提取的RESNET-18主链，（2）用于处理2D FFT幅度光谱的轻质CNN，以及（3）用于分析径向能量谱的多层观察者。我们介绍了一种新型的径向能量分布分析，该分析捕获了扩散生成图像中固有的特征频率伪像，并通过简单特征串联串联到紧凑的分类头，将其与空间和频谱提示融合在一起。在10,000个配对的真实（MS-COCO）和合成（稳定扩散3.5）图像的数据集上进行了广泛的实验表明，Freqcross达到97.8 \％的精度，优于5.2 \％的最先进的底线。频率分析进一步表明，合成图像在0.1--0.4归一化频率范围内表现出不同的光谱特征，为我们的方法提供了理论基础。代码和预培训模型可公开使用，以促进可再现的研究。

Title: Rethinking Data Protection in the (Generative) Artificial Intelligence Era

Authors: Yiming Li, Shuo Shao, Yu He, Junfeng Guo, Tianwei Zhang, Zhan Qin, Pin-Yu Chen, Michael Backes, Philip Torr, Dacheng Tao, Kui Ren
Subjects: cs.LG, cs.AI, cs.CR, cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2507.03034
Pdf URL: https://arxiv.org/pdf/2507.03034
Copy Paste: [[2507.03034]] Rethinking Data Protection in the (Generative) Artificial Intelligence Era(https://arxiv.org/abs/2507.03034)
Keywords: generative
Abstract: The (generative) artificial intelligence (AI) era has profoundly reshaped the meaning and value of data. No longer confined to static content, data now permeates every stage of the AI lifecycle from the training samples that shape model parameters to the prompts and outputs that drive real-world model deployment. This shift renders traditional notions of data protection insufficient, while the boundaries of what needs safeguarding remain poorly defined. Failing to safeguard data in AI systems can inflict societal and individual, underscoring the urgent need to clearly delineate the scope of and rigorously enforce data protection. In this perspective, we propose a four-level taxonomy, including non-usability, privacy preservation, traceability, and deletability, that captures the diverse protection needs arising in modern (generative) AI models and systems. Our framework offers a structured understanding of the trade-offs between data utility and control, spanning the entire AI pipeline, including training datasets, model weights, system prompts, and AI-generated content. We analyze representative technical approaches at each level and reveal regulatory blind spots that leave critical assets exposed. By offering a structured lens to align future AI technologies and governance with trustworthy data practices, we underscore the urgency of rethinking data protection for modern AI techniques and provide timely guidance for developers, researchers, and regulators alike.
摘要：（生成的）人工智能（AI）时代已深刻地重塑了数据的含义和价值。现在，数据不再局限于静态内容，现在渗透到AI生命周期的每个阶段，从将模型参数构成模型参数的训练样本到驱动真实世界模型部署的提示和输出。这种转变使传统的数据保护概念不足，而需求保护的边界仍然很差。未能保护AI系统中的数据可能会造成社会和个人，从而迫切需要清楚地描述和严格执行数据保护的范围。从这个角度来看，我们提出了四级分类法，包括不可使用，隐私保存，可追溯性和可删除性，该分类法捕获了现代（生成的）AI模型和系统中产生的多种保护需求。我们的框架对数据实用程序和控制之间的权衡提供了结构化的理解，涵盖了整个AI管道，包括培训数据集，模型权重，系统提示和AI生成的内容。我们分析了每个级别的代表性技术方法，并揭示了使关键资产暴露的监管盲点。通过提供结构化的镜头，以使未来的AI技术和治理与值得信赖的数据实践保持一致，我们强调了重新思考现代AI技术的数据保护的紧迫性，并为开发人员，研究人员和监管机构提供及时的指导。

Title: Cycle-Consistent Helmholtz Machine: Goal-Seeded Simulation via Inverted Inference

Authors: Xin Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03065
Pdf URL: https://arxiv.org/pdf/2507.03065
Copy Paste: [[2507.03065]] Cycle-Consistent Helmholtz Machine: Goal-Seeded Simulation via Inverted Inference(https://arxiv.org/abs/2507.03065)
Keywords: generation, generative
Abstract: The Helmholtz Machine (HM) is a foundational architecture for unsupervised learning, coupling a bottom-up recognition model with a top-down generative model through alternating inference. However, its reliance on symmetric, data-driven updates constrains its ability to perform goal-directed reasoning or simulate temporally extended processes. In this work, we introduce the \emph{Cycle-Consistent Helmholtz Machine} (C$^2$HM), a novel extension that reframes inference as a \emph{goal-seeded}, \emph{asymmetric} process grounded in structured internal priors. Rather than inferring latent causes solely from sensory data, C$^2$HM simulates plausible latent trajectories conditioned on abstract goals, aligning them with observed outcomes through a recursive cycle of forward generation and inverse refinement. This cycle-consistent formulation integrates top-down structure with bottom-up evidence via a variational loop, enforcing mutual alignment between goal-conditioned latent predictions and recognition-based reconstructions. We formalize this mechanism within the framework of the \emph{Context-Content Uncertainty Principle} (CCUP), which posits that inference proceeds by aligning structured, low-entropy content with high-entropy, ambiguous context. C$^2$HM improves representational efficiency, supports memory chaining via path-dependent inference, and enables spatial compositional imagination. By offering a biologically inspired alternative to classical amortized inference, $C^2$HM reconceives generative modeling as intentional simulation, bridging memory-based planning and unsupervised learning in a unified probabilistic framework.
摘要：Helmholtz机器（HM）是用于无监督学习的基础体系结构，通过交替推断将自下而上的识别模型与自上而下的生成模型结合在一起。但是，其对对称，数据驱动的更新的依赖限制了其执行目标定向推理或模拟时间扩展过程的能力。在这项工作中，我们介绍了\ emph {cyceencensistent Helmholtz Machine}（C $^2 $ hm），这是一种新颖的扩展，将推理重新定义为\ emph {goal-seeded}，\ emph {abymmetric}扎根于结构内置的过程。 C $^2 $ hm并不是仅凭感官数据来推断潜在原因，而是模拟了以抽象目标为条件的合理潜在轨迹，通过向前产生和逆细化的递归周期与观察到的结果对齐。这种循环一致的公式将自上而下的结构与自下而上的证据整合在一起，这是通过差异循环的自下而上的证据，从而在目标条件的潜在预测和基于识别的重建之间实施了相互对准。我们在\ emph {context-content不确定性原理}（ccup）的框架内形式化了这种机制，该机制通过将结构化的低渗透性含量与高渗透性，模棱两可的上下文对齐来进行。 C $^2 $ hm提高了表示效率，通过路径依赖性推理支持内存链，并实现空间组成的想象力。通过提供经典摊销推理的生物学启发的替代方案，$ c^2 $ hm将生成模型重新考虑为有意的模拟，基于内存的计划和无监督的学习在一个统一的概率框架中。

Title: SymMatika: Structure-Aware Symbolic Discovery

Authors: Michael Scherk, Boyuan Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03110
Pdf URL: https://arxiv.org/pdf/2507.03110
Copy Paste: [[2507.03110]] SymMatika: Structure-Aware Symbolic Discovery(https://arxiv.org/abs/2507.03110)
Keywords: generation
Abstract: Symbolic regression (SR) seeks to recover closed-form mathematical expressions that describe observed data. While existing methods have advanced the discovery of either explicit mappings (i.e., $y = f(\mathbf{x})$) or discovering implicit relations (i.e., $F(\mathbf{x}, y)=0$), few modern and accessible frameworks support both. Moreover, most approaches treat each expression candidate in isolation, without reusing recurring structural patterns that could accelerate search. We introduce SymMatika, a hybrid SR algorithm that combines multi-island genetic programming (GP) with a reusable motif library inspired by biological sequence analysis. SymMatika identifies high-impact substructures in top-performing candidates and reintroduces them to guide future generations. Additionally, it incorporates a feedback-driven evolutionary engine and supports both explicit and implicit relation discovery using implicit-derivative metrics. Across benchmarks, SymMatika achieves state-of-the-art recovery rates, achieving 5.1% higher performance than the previous best results on Nguyen, the first recovery of Nguyen-12, and competitive performance on the Feynman equations. It also recovers implicit physical laws from Eureqa datasets up to $100\times$ faster. Our results demonstrate the power of structure-aware evolutionary search for scientific discovery. To support broader research in interpretable modeling and symbolic discovery, we have open-sourced the full SymMatika framework.
摘要：符号回归（SR）试图恢复描述观察到的数据的封闭形式的数学表达式。尽管现有方法已提出了明确映射的发现（即$ y = f（\ mathbf {x}）$）或发现隐式关系（即$ f（\ mathbf {x}，y），y），y）= 0 $），很少有现代且可访问的框架支持。此外，大多数方法都会孤立地对待每个表达候选者，而无需重复使用可以加速搜索的重复结构模式。我们介绍了Symmatika，这是一种混合SR算法，该算法结合了多国遗传编程（GP）和受生物序列分析启发的可重复使用的基库库。 Symmatika确定了表现最好的候选人中的高影响子结构，并重新引入他们以指导子孙后代。此外，它结合了反馈驱动的进化引擎，并支持使用隐式衍生指标的明确和隐式关系发现。在基准中，Symmatika达到了最先进的恢复率，比以前对Nguyen的最佳成绩，Nguyen-12的首次恢复以及Feynman方程式的竞争性能高5.1％。它还从EUREQA数据集中恢复了隐式物理定律，最高$ 100 \ times $ a。我们的结果证明了对科学发现的结构感知进化搜索的力量。为了支持可解释的建模和符号发现方面的更广泛的研究，我们已经开源了整个Symmatika框架。

Title: BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers

Authors: Patrik Okanovic, Sameer Deshmukh, Grzegorz Kwasniewski, Kentaro Katayama, Takumi Honda, Maciej Besta, Torsten Hoefler
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2507.03117
Pdf URL: https://arxiv.org/pdf/2507.03117
Copy Paste: [[2507.03117]] BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers(https://arxiv.org/abs/2507.03117)
Keywords: generation
Abstract: The energy consumption of large-scale ML models is dominated by data movement - shuffling billions of parameters across memory hierarchies and data centers. Effective sparsification to prune redundant parameters is still challenging: existing methods incur significant accuracy degradation, performance overhead, or both. We introduce (Bl)ock (a)nd (S)parse (T)ransformers (BLaST), a general, robust, and reliable sparsification method applicable to linear layers in all settings. Our method iteratively sparsifies weight matrices into a block sparsity pattern suitable for efficient sparse matrix-matrix (SpMM) multiplication. BLaST achieves up to 95% sparsity in MLP weights with negligible accuracy loss. Our fused, highly optimized Sparse MLP kernel delivers up to 16.7x speedup over dense MLPs across 9 architectures and 8 datasets, resulting in up to 1.6x inference speedup, 1.11x pretraining speedup and up to 3.12x inference memory usage reduction. BLaST enables the next generation of large-scale AI systems by reducing energy use, memory footprint, and latency.
摘要：大规模ML模型的能源消耗主要由数据移动 - 整个内存层次结构和数据中心的数十亿个参数。修剪冗余参数的有效稀疏仍然具有挑战性：现有的方法会产生明显的准确性降解，性能开销或两者兼而有之。我们介绍（BL）OCK（A）ND（S）解析（T）Ransformers（BLAST），这是一种适用于所有设置中线性层的一般，健壮且可靠的稀疏方法。我们的方法迭代地将重量矩阵稀疏为块稀疏模式，适用于有效的稀疏基质矩阵（SPMM）乘法。 BLAST在MLP重量中最多可实现95％的稀疏性，而准确的精度损失可忽略不计。我们融合的高度优化的稀疏MLP内核在9个体系结构和8个数据集中高达16.7倍的速度，最高为1.6倍推理速度，预处理速度为1.6倍，最高3.12倍推理的推理记忆使用量减少。 BLAST通过减少能源使用，内存足迹和延迟来实现下一代大规模的AI系统。

Title: HGCA: Hybrid GPU-CPU Attention for Long Context LLM Inference

Authors: Weishu Deng, Yujie Yang, Peiran Du, Lingfeng Xiang, Zhen Lin, Chen Zhong, Song Jiang, Hui Lu, Jia Rao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03153
Pdf URL: https://arxiv.org/pdf/2507.03153
Copy Paste: [[2507.03153]] HGCA: Hybrid GPU-CPU Attention for Long Context LLM Inference(https://arxiv.org/abs/2507.03153)
Keywords: generation
Abstract: Scaling inference for large language models (LLMs) is increasingly constrained by limited GPU memory, especially due to growing key-value (KV) caches required for long-context generation. While existing approaches offload KV caches to CPU memory or apply sparse attention to reduce GPU load, they often underutilize CPU compute resources and compromise accuracy. We present HGCA, a hybrid CPU-GPU attention mechanism that enables scalable, high-throughput LLM inference with near-full attention quality. HGCA performs dense attention on recently generated KV entries retained in GPU memory and parallel sparse attention on selected, salient KV entries in CPU memory. The attention outputs are efficiently merged using log-sum-exp fusion, minimizing PCIe transfer overhead. HGCA also introduces a finegrained, per-head sparsification strategy optimized for CPU execution, preserving contextual relevance while reducing computation. Our implementation seamlessly integrates into existing LLM frameworks without requiring model retraining. Experiments across diverse models and workloads show that HGCA achieves superior scalability, supports longer sequences and larger batch sizes, and outperforms existing sparse attention baselines in both performance and accuracy -- all on commodity GPU hardware.
摘要：大型语言模型（LLMS）的缩放推断受到有限的GPU记忆的限制，尤其是由于长期生成所需的键值（KV）缓存的增长。尽管现有方法可卸载KV缓存以进行CPU内存或稀疏注意以减少GPU负载，但它们通常不利于CPU计算资源并损害准确性。我们提出了HGCA，这是一种混合CPU-GPU注意机制，可实现可扩展的高通量LLM推断，并具有接近满足的注意力质量。 HGCA对最近生成的KV条目保留在GPU存储器中，并在CPU内存中所选的显着KV条目上进行了密集的关注。使用log-sum-exp融合有效合并了注意力输出，最大程度地减少了PCIE转移开销。 HGCA还引入了针对CPU执行优化的细化的，每头稀疏策略，在减少计算的同时保留了上下文相关性。我们的实施无缝地集成到现有的LLM框架中，而无需模型再培训。跨不同模型和工作负载的实验表明，HGCA可实现卓越的可扩展性，支持更长的序列和较大的批量大小，并且在性能和准确性方面都优于现有的稀疏注意基线 - 所有这些基准都在商品GPU硬件上。

Title: Latent Thermodynamic Flows: Unified Representation Learning and Generative Modeling of Temperature-Dependent Behaviors from Limited Data

Authors: Yunrui Qiu, Richard John, Lukas Herron, Pratyush Tiwary
Subjects: cs.LG, cond-mat.stat-mech, physics.bio-ph, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2507.03174
Pdf URL: https://arxiv.org/pdf/2507.03174
Copy Paste: [[2507.03174]] Latent Thermodynamic Flows: Unified Representation Learning and Generative Modeling of Temperature-Dependent Behaviors from Limited Data(https://arxiv.org/abs/2507.03174)
Keywords: generative
Abstract: Accurate characterization of the equilibrium distributions of complex molecular systems and their dependence on environmental factors such as temperature is essential for understanding thermodynamic properties and transition mechanisms. Projecting these distributions onto meaningful low-dimensional representations enables interpretability and downstream analysis. Recent advances in generative AI, particularly flow models such as Normalizing Flows (NFs), have shown promise in modeling such distributions, but their scope is limited without tailored representation learning. In this work, we introduce Latent Thermodynamic Flows (LaTF), an end-to-end framework that tightly integrates representation learning and generative modeling. LaTF unifies the State Predictive Information Bottleneck (SPIB) with NFs to simultaneously learn low-dimensional latent representations, referred to as Collective Variables (CVs), classify metastable states, and generate equilibrium distributions across temperatures beyond the training data. The two components of representation learning and generative modeling are optimized jointly, ensuring that the learned latent features capture the system's slow, important degrees of freedom while the generative model accurately reproduces the system's equilibrium behavior. We demonstrate LaTF's effectiveness across diverse systems, including a model potential, the Chignolin protein, and cluster of Lennard Jones particles, with thorough evaluations and benchmarking using multiple metrics and extensive simulations. Finally, we apply LaTF to a RNA tetraloop system, where despite using simulation data from only two temperatures, LaTF reconstructs the temperature-dependent structural ensemble and melting behavior, consistent with experimental and prior extensive computational results.
摘要：复杂分子系统的平衡分布及其对温度等环境因素的依赖性的准确表征对于理解热力学特性和过渡机制至关重要。将这些分布投影到有意义的低维表示上，可以解释性和下游分析。生成AI的最新进展，尤其是流动模型（例如标准化流量（NFS）），在对这种分布进行建模方面表现出了希望，但是它们的范围在没有量身定制的表示学习的情况下受到限制。在这项工作中，我们引入了潜在热力学流（LATF），这是一个端到端的框架，它紧密整合了表示学习和生成建模。 LATF将国家预测信息瓶颈（SPIB）与NFS统一，以同时学习低维的潜在表示，称为集体变量（CVS），分类亚稳态状态，并在训练数据以外的温度下产生平衡分布。表示学习和生成建模的两个组成部分共同优化，以确保学习的潜在特征捕获系统的缓慢，重要的自由度，而生成模型则准确地重现了系统的平衡行为。我们展示了LATF在不同系统中的有效性，包括模型电位，chignolin蛋白和Lennard Jones颗粒的簇，并使用多个指标和广泛的模拟进行了详尽的评估和基准测试。最后，我们将LATF应用于RNA四边形系统，尽管仅使用了两个温度的仿真数据，但LATF还是重建了与温度有关的结构集合和熔融行为，与实验性和先前的广泛计算结果一致。

Title: LACONIC: A 3D Layout Adapter for Controllable Image Creation

Authors: Léopold Maillard, Tom Durand, Adrien Ramanana Rahary, Maks Ovsjanikov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03257
Pdf URL: https://arxiv.org/pdf/2507.03257
Copy Paste: [[2507.03257]] LACONIC: A 3D Layout Adapter for Controllable Image Creation(https://arxiv.org/abs/2507.03257)
Keywords: generative
Abstract: Existing generative approaches for guided image synthesis of multi-object scenes typically rely on 2D controls in the image or text space. As a result, these methods struggle to maintain and respect consistent three-dimensional geometric structure, underlying the scene. In this paper, we propose a novel conditioning approach, training method and adapter network that can be plugged into pretrained text-to-image diffusion models. Our approach provides a way to endow such models with 3D-awareness, while leveraging their rich prior knowledge. Our method supports camera control, conditioning on explicit 3D geometries and, for the first time, accounts for the entire context of a scene, i.e., both on and off-screen items, to synthesize plausible and semantically rich images. Despite its multi-modal nature, our model is lightweight, requires a reasonable number of data for supervised learning and shows remarkable generalization power. We also introduce methods for intuitive and consistent image editing and restyling, e.g., by positioning, rotating or resizing individual objects in a scene. Our method integrates well within various image creation workflows and enables a richer set of applications compared to previous approaches.
摘要：多个对象场景的指导图像合成的现有生成方法通常依赖于图像或文本空间中的2D控件。结果，这些方法难以维持和尊重一致的三维几何结构，并在场景中依据。在本文中，我们提出了一种新颖的调理方法，训练方法和适配器网络，可以插入预验证的文本对图像扩散模型。我们的方法为赋予此类模型的3D意识提供了一种方法，同时利用其丰富的先验知识。我们的方法支持摄像机控制，根据显式3D几何形状进行条件，并首次解释场景的整个上下文，即在屏幕上和屏幕外项目，以合成合理和语义上富含的图像。尽管具有多模式的性质，但我们的模型很轻，需要合理的数据来进行监督学习，并显示出显着的概括能力。我们还介绍了直观且一致的图像编辑和重新安装的方法，例如，通过在场景中定位，旋转或调整各个对象大小。我们的方法很好地集成在各种图像创建工作流程中，并且与以前的方法相比，可以提供更丰富的应用程序。

Title: Dual-frequency Selected Knowledge Distillation with Statistical-based Sample Rectification for PolSAR Image Classification

Authors: Xinyue Xin, Ming Li, Yan Wu, Xiang Li, Peng Zhang, Dazhi Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03268
Pdf URL: https://arxiv.org/pdf/2507.03268
Copy Paste: [[2507.03268]] Dual-frequency Selected Knowledge Distillation with Statistical-based Sample Rectification for PolSAR Image Classification(https://arxiv.org/abs/2507.03268)
Keywords: generation
Abstract: The collaborative classification of dual-frequency PolSAR images is a meaningful but also challenging research. The effect of regional consistency on classification information learning and the rational use of dual-frequency data are two main difficulties for dual-frequency collaborative classification. To tackle these problems, a selected knowledge distillation network with statistical-based sample rectification (SKDNet-SSR) is proposed in this article. First, in addition to applying CNN and ViT as local and global feature extractors, a statistical-based dynamic sample rectification (SDSR) module is designed to avoid the impact of poor regional consistency on spatial information learning process. Specifically, based on the fact that the PolSAR covariance matrix conforms to the complex Wishart distribution, SDSR first dynamically evaluates the sample purity, and then performs pixel selection and pixel generation to remove noisy pixels, thereby avoiding the feature interaction between informative pixels and noisy pixels and improving the classification feature extraction process. Next, a dual-frequency gate-selected distillation (DGSD) module is constructed to emphasize the advantages of different frequency bands and perform complementary learning on dual-frequency data. It uses the dominant single-frequency branch on each sample as teacher model to train the dual-frequency student model, enabling the student model to learn the optimal results and realizing complementary utilization of dual-frequency data on different terrain objects. Comprehensive experiments on four measured dual-frequency PolSAR data demonstrate that the proposed SKDNet-SSR outperforms other related methods.
摘要：双频Polsar图像的协作分类是一项有意义但具有挑战性的研究。区域一致性对分类信息学习和双频数据的合理使用的影响是双频协作分类的两个主要困难。为了解决这些问题，本文提出了具有基于统计的样本整流（SKDNET-SSR）的选定知识蒸馏网络。首先，除了将CNN和VIT应用于局部和全球特征提取器外，基于统计的动态样本整流（SDSR）模块旨在避免区域一致性对空间信息学习过程的影响。具体而言，基于Polsar协方差矩阵符合复杂的WishArt分布的事实，SDSR首先动态评估样品纯度，然后执行像素选择和像素生成以删除噪声的像素，从而避免了噪音的相互作用，从而避免了信息性像素和噪声像素之间的特征相互作用，并提高了分类的分类过程。接下来，构建了双频门选择的蒸馏（DGSD）模块，以强调不同频段的优势，并对双频数据进行互补学习。它将每个样本上的主要单频分支用作教师模型来训练双频学生模型，从而使学生模型能够学习最佳结果并实现在不同地形对象上对双频数据的互补利用。对四个测得的双频POLSAR数据进行的全面实验表明，所提出的SKDNET-SSR优于其他相关方法。

Title: ConceptMix++: Leveling the Playing Field in Text-to-Image Benchmarking via Iterative Prompt Optimization

Authors: Haosheng Gan, Berk Tinaz, Mohammad Shahab Sepehri, Zalan Fabian, Mahdi Soltanolkotabi
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03275
Pdf URL: https://arxiv.org/pdf/2507.03275
Copy Paste: [[2507.03275]] ConceptMix++: Leveling the Playing Field in Text-to-Image Benchmarking via Iterative Prompt Optimization(https://arxiv.org/abs/2507.03275)
Keywords: generation, generative
Abstract: Current text-to-image (T2I) benchmarks evaluate models on rigid prompts, potentially underestimating true generative capabilities due to prompt sensitivity and creating biases that favor certain models while disadvantaging others. We introduce ConceptMix++, a framework that disentangles prompt phrasing from visual generation capabilities by applying iterative prompt optimization. Building on ConceptMix, our approach incorporates a multimodal optimization pipeline that leverages vision-language model feedback to refine prompts systematically. Through extensive experiments across multiple diffusion models, we show that optimized prompts significantly improve compositional generation performance, revealing previously hidden model capabilities and enabling fairer comparisons across T2I models. Our analysis reveals that certain visual concepts -- such as spatial relationships and shapes -- benefit more from optimization than others, suggesting that existing benchmarks systematically underestimate model performance in these categories. Additionally, we find strong cross-model transferability of optimized prompts, indicating shared preferences for effective prompt phrasing across models. These findings demonstrate that rigid benchmarking approaches may significantly underrepresent true model capabilities, while our framework provides more accurate assessment and insights for future development.
摘要：当前的文本对图像（T2I）基准在刚性提示上评估模型，可能低估了由于迅速敏感性而导致的真正生成能力，并产生了偏见，这些偏见有利于某些模型，而在不利于其他模型的情况下。我们介绍了ConceptMix ++，该框架通过应用迭代及时的优化，通过应用迭代及时的优化来解开视觉生成功能的提示。在ConceptMix的基础上，我们的方法结合了多式联运优化管道，该管道利用视觉模型反馈系统地提示了提示。通过跨多个扩散模型的广泛实验，我们表明，优化提示可以显着提高组成生成性能，揭示先前隐藏的模型功能，并在T2I模型中进行更公平的比较。我们的分析表明，某些视觉概念（例如空间关系和形状）比其他视觉概念从优化中受益更多，这表明现有基准在这些类别中有系统地低估了模型性能。此外，我们发现优化提示的强跨模型可传递性，表明共同的偏好是在模型之间有效提示措辞的共同偏好。这些发现表明，严格的基准测试方法可能会大大不足为真实的模型功能，而我们的框架为未来发展提供了更准确的评估和见解。

Title: Global Variational Inference Enhanced Robust Domain Adaptation

Authors: Lingkun Luo, Shiqiang Hu, Liming Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03291
Pdf URL: https://arxiv.org/pdf/2507.03291
Copy Paste: [[2507.03291]] Global Variational Inference Enhanced Robust Domain Adaptation(https://arxiv.org/abs/2507.03291)
Keywords: generative
Abstract: Deep learning-based domain adaptation (DA) methods have shown strong performance by learning transferable representations. However, their reliance on mini-batch training limits global distribution modeling, leading to unstable alignment and suboptimal generalization. We propose Global Variational Inference Enhanced Domain Adaptation (GVI-DA), a framework that learns continuous, class-conditional global priors via variational inference to enable structure-aware cross-domain alignment. GVI-DA minimizes domain gaps through latent feature reconstruction, and mitigates posterior collapse using global codebook learning with randomized sampling. It further improves robustness by discarding low-confidence pseudo-labels and generating reliable target-domain samples. Extensive experiments on four benchmarks and thirty-eight DA tasks demonstrate consistent state-of-the-art performance. We also derive the model's evidence lower bound (ELBO) and analyze the effects of prior continuity, codebook size, and pseudo-label noise tolerance. In addition, we compare GVI-DA with diffusion-based generative frameworks in terms of optimization principles and efficiency, highlighting both its theoretical soundness and practical advantages.
摘要：基于深度学习的域适应性（DA）方法通过学习可转移表示表现出了很强的性能。但是，他们对迷你批次训练的依赖限制了全球分布建模，从而导致不稳定的一致性和次优概括。我们提出了全球变分推断增强域的适应性（GVI-DA），该框架通过变异推理来学习连续的，阶级条件的全局先验，以启用结构感知的跨域对准。 GVI-DA通过潜在的特征重建来最大程度地减少域间隙，并使用随机抽样的全局代码书学习减轻后置崩溃。它通过丢弃较低的伪标签并产生可靠的目标域样本来进一步提高鲁棒性。在四个基准和38个DA任务上进行的广泛实验表明了一致的最新性能。我们还得出模型的证据下限（ELBO），并分析先前的连续性，代码书大小和伪标签噪声耐受性的影响。此外，我们将GVI-DA与基于扩散的生成框架在优化原理和效率方面进行了比较，从而强调了其理论声音和实际优势。

Title: CPKD: Clinical Prior Knowledge-Constrained Diffusion Models for Surgical Phase Recognition in Endoscopic Submucosal Dissection

Authors: Xiangning Zhang, Jinnan Chen, Qingwei Zhang, Yaqi Wang, Chengfeng Zhou, Xiaobo Li, Dahong Qian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03295
Pdf URL: https://arxiv.org/pdf/2507.03295
Copy Paste: [[2507.03295]] CPKD: Clinical Prior Knowledge-Constrained Diffusion Models for Surgical Phase Recognition in Endoscopic Submucosal Dissection(https://arxiv.org/abs/2507.03295)
Keywords: generative
Abstract: Gastrointestinal malignancies constitute a leading cause of cancer-related mortality worldwide, with advanced-stage prognosis remaining particularly dismal. Originating as a groundbreaking technique for early gastric cancer treatment, Endoscopic Submucosal Dissection has evolved into a versatile intervention for diverse gastrointestinal lesions. While computer-assisted systems significantly enhance procedural precision and safety in ESD, their clinical adoption faces a critical bottleneck: reliable surgical phase recognition within complex endoscopic workflows. Current state-of-the-art approaches predominantly rely on multi-stage refinement architectures that iteratively optimize temporal predictions. In this paper, we present Clinical Prior Knowledge-Constrained Diffusion (CPKD), a novel generative framework that reimagines phase recognition through denoising diffusion principles while preserving the core iterative refinement philosophy. This architecture progressively reconstructs phase sequences starting from random noise and conditioned on visual-temporal features. To better capture three domain-specific characteristics, including positional priors, boundary ambiguity, and relation dependency, we design a conditional masking strategy. Furthermore, we incorporate clinical prior knowledge into the model training to improve its ability to correct phase logical errors. Comprehensive evaluations on ESD820, Cholec80, and external multi-center demonstrate that our proposed CPKD achieves superior or comparable performance to state-of-the-art approaches, validating the effectiveness of diffusion-based generative paradigms for surgical phase recognition.
摘要：胃肠道恶性肿瘤构成了全球与癌症相关死亡率的主要原因，而晚期预后仍然特别令人沮丧。内窥镜粘膜下解剖是作为早期胃癌治疗的开创性技术，已演变为多种干预措施，用于多种胃肠道病变。尽管计算机辅助系统可显着提高ESD的程序精度和安全性，但它们的临床采用面临着关键的瓶颈：复杂的内窥镜工作流程中可靠的手术期识别。当前的最新方法主要依赖于迭代优化时间预测的多阶段改进体系结构。在本文中，我们介绍了临床先验知识约束的扩散（CPKD），这是一个新颖的生成框架，可以通过降级扩散原理来重新构想相位识别，同时保留核心迭代的完善哲学。该体系结构逐渐从随机噪声开始，并以视觉时空特征为条件。为了更好地捕获三个特定领域的特征，包括位置先验，边界歧义和关系依赖性，我们设计了条件掩盖策略。此外，我们将临床先验知识纳入模型训练中，以提高其纠正相位逻辑错误的能力。对ESD820，CHOLEC80和外部多中心的全面评估表明，我们提出的CPKD的性能优于最先进的方法，从而验证了基于扩散基于扩散的生成范式对手术期识别的有效性。

Title: Personalized Image Generation from an Author Writing Style

Authors: Sagar Gandhi, Vishal Gandhi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03313
Pdf URL: https://arxiv.org/pdf/2507.03313
Copy Paste: [[2507.03313]] Personalized Image Generation from an Author Writing Style(https://arxiv.org/abs/2507.03313)
Keywords: generation, generative
Abstract: Translating nuanced, textually-defined authorial writing styles into compelling visual representations presents a novel challenge in generative AI. This paper introduces a pipeline that leverages Author Writing Sheets (AWS) - structured summaries of an author's literary characteristics - as input to a Large Language Model (LLM, Claude 3.7 Sonnet). The LLM interprets the AWS to generate three distinct, descriptive text-to-image prompts, which are then rendered by a diffusion model (Stable Diffusion 3.5 Medium). We evaluated our approach using 49 author styles from Reddit data, with human evaluators assessing the stylistic match and visual distinctiveness of the generated images. Results indicate a good perceived alignment between the generated visuals and the textual authorial profiles (mean style match: $4.08/5$), with images rated as moderately distinctive. Qualitative analysis further highlighted the pipeline's ability to capture mood and atmosphere, while also identifying challenges in representing highly abstract narrative elements. This work contributes a novel end-to-end methodology for visual authorial style personalization and provides an initial empirical validation, opening avenues for applications in creative assistance and cross-modal understanding.
摘要：将细微的，文本定义的作者写作风格转化为引人注目的视觉表示，在生成AI中提出了一个新颖的挑战。本文介绍了一条利用作者写作表（AWS）的管道 - 作者文学特征的结构化摘要 - 作为大型语言模型的输入（LLM，Claude 3.7十四行诗）。 LLM解释AWS以生成三个不同的描述性文本对图像提示，然后通过扩散模型（稳定的扩散3.5介质）渲染。我们使用Reddit数据中的49种作者样式评估了我们的方法，人类评估人员评估了生成图像的风格匹配和视觉独特性。结果表明，生成的视觉效果与文本作者配置文件（平均样式匹配：$ 4.08/5 $）之间的良好感知对齐，图像评为中等不同。定性分析进一步强调了管道捕捉情绪和气氛的能力，同时还确定了代表高度抽象叙事元素的挑战。这项工作为视觉创作风格个性化的新型端到端方法贡献了新颖的方法，并提供了初步的经验验证，为创造性帮助和跨模式理解的应用开辟了途径。

Title: Mirror in the Model: Ad Banner Image Generation via Reflective Multi-LLM and Multi-modal Agents

Authors: Zhao Wang, Bowen Chen, Yotaro Shimose, Sota Moriyama, Heng Wang, Shingo Takamatsu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03326
Pdf URL: https://arxiv.org/pdf/2507.03326
Copy Paste: [[2507.03326]] Mirror in the Model: Ad Banner Image Generation via Reflective Multi-LLM and Multi-modal Agents(https://arxiv.org/abs/2507.03326)
Keywords: generation, generative
Abstract: Recent generative models such as GPT-4o have shown strong capabilities in producing high-quality images with accurate text rendering. However, commercial design tasks like advertising banners demand more than visual fidelity -- they require structured layouts, precise typography, consistent branding, and more. In this paper, we introduce MIMO (Mirror In-the-Model), an agentic refinement framework for automatic ad banner generation. MIMO combines a hierarchical multi-modal agent system (MIMO-Core) with a coordination loop (MIMO-Loop) that explores multiple stylistic directions and iteratively improves design quality. Requiring only a simple natural language based prompt and logo image as input, MIMO automatically detects and corrects multiple types of errors during generation. Experiments show that MIMO significantly outperforms existing diffusion and LLM-based baselines in real-world banner design scenarios.
摘要：最近的生成模型（例如GPT-4O）在生产具有准确文本渲染的高质量图像方面表现出很强的功能。但是，诸如广告标语之类的商业设计任务不仅仅是视觉保真度 - 它们需要结构化的布局，精确的排版，一致的品牌等等。在本文中，我们介绍了Mimo（镜像中的模型），这是一种用于自动广告横幅生成的代理改进框架。 MIMO结合了分层多模式代理系统（MIMO核）与协调循环（MIMO-LOOP），该循环探索了多个风格方向，并迭代地改善了设计质量。仅需要简单的基于自然语言的提示和徽标图像作为输入，MIMO会自动检测并纠正生成过程中多种类型的错误。实验表明，MIMO在现实世界的横幅设计场景中显着优于现有扩散和基于LLM的基线。

Title: Task-Specific Generative Dataset Distillation with Difficulty-Guided Sampling

Authors: Mingzhuo Li, Guang Li, Jiafeng Mao, Linfeng Ye, Takahiro Ogawa, Miki Haseyama
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03331
Pdf URL: https://arxiv.org/pdf/2507.03331
Copy Paste: [[2507.03331]] Task-Specific Generative Dataset Distillation with Difficulty-Guided Sampling(https://arxiv.org/abs/2507.03331)
Keywords: generative
Abstract: To alleviate the reliance of deep neural networks on large-scale datasets, dataset distillation aims to generate compact, high-quality synthetic datasets that can achieve comparable performance to the original dataset. The integration of generative models has significantly advanced this field. However, existing approaches primarily focus on aligning the distilled dataset with the original one, often overlooking task-specific information that can be critical for optimal downstream performance. In this paper, focusing on the downstream task of classification, we propose a task-specific sampling strategy for generative dataset distillation that incorporates the concept of difficulty to consider the requirements of the target task better. The final dataset is sampled from a larger image pool with a sampling distribution obtained by matching the difficulty distribution of the original dataset. A logarithmic transformation is applied as a pre-processing step to correct for distributional bias. The results of extensive experiments demonstrate the effectiveness of our method and suggest its potential for enhancing performance on other downstream tasks.
摘要：为了减轻深度神经网络对大规模数据集的依赖，数据集蒸馏旨在生成紧凑的高质量合成数据集，这些数据集可以达到与原始数据集的可比性性能。生成模型的集成已显着提高了这一领域。但是，现有方法主要集中于将蒸馏数据集与原始的数据集对齐，通常忽略特定于任务的信息，这对于最佳下游性能至关重要。在本文中，着眼于分类的下游任务，我们为生成数据集蒸馏提出了一种特定于任务的抽样策略，该策略结合了难以更好地考虑目标任务要求的难度概念。通过匹配原始数据集的难度分布，从较大的图像池中对最终数据集进行采样。对数转换是一种预处理步骤，以纠正分布偏差。广泛的实验结果证明了我们方法的有效性，并提出了它在其他下游任务上提高性能的潜力。

Title: Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos

Authors: Yufan Zhou, Zhaobo Qi, Lingshuai Lin, Junqi Jing, Tingting Chai, Beichen Zhang, Shuhui Wang, Weigang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03393
Pdf URL: https://arxiv.org/pdf/2507.03393
Copy Paste: [[2507.03393]] Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos(https://arxiv.org/abs/2507.03393)
Keywords: generation
Abstract: In this paper, we address the challenge of procedure planning in instructional videos, aiming to generate coherent and task-aligned action sequences from start and end visual observations. Previous work has mainly relied on text-level supervision to bridge the gap between observed states and unobserved actions, but it struggles with capturing intricate temporal relationships among actions. Building on these efforts, we propose the Masked Temporal Interpolation Diffusion (MTID) model that introduces a latent space temporal interpolation module within the diffusion model. This module leverages a learnable interpolation matrix to generate intermediate latent features, thereby augmenting visual supervision with richer mid-state details. By integrating this enriched supervision into the model, we enable end-to-end training tailored to task-specific requirements, significantly enhancing the model's capacity to predict temporally coherent action sequences. Additionally, we introduce an action-aware mask projection mechanism to restrict the action generation space, combined with a task-adaptive masked proximity loss to prioritize more accurate reasoning results close to the given start and end states over those in intermediate steps. Simultaneously, it filters out task-irrelevant action predictions, leading to contextually aware action sequences. Experimental results across three widely used benchmark datasets demonstrate that our MTID achieves promising action planning performance on most metrics. The code is available at this https URL.
摘要：在本文中，我们应对教学视频中的过程计划的挑战，旨在从开始和最终视觉观察中生成连贯和任务一致的动作序列。先前的工作主要依靠文本级别的监督来弥合观察到的状态和未观察到的动作之间的差距，但它努力捕捉动作之间的复杂时间关系。在这些努力的基础上，我们提出了掩盖的时间插值扩散（MTID）模型，该模型在扩散模型中引入了潜在空间时间插值模块。该模块利用可学习的插值矩阵来生成中间的潜在特征，从而增强了较丰富的中期细节的视觉监督。通过将这种丰富的监督整合到模型中，我们可以启用针对特定于任务要求的端到端培训，从而显着增强了模型预测时间相干的动作序列的能力。此外，我们引入了一种动作感知的面膜投影机制来限制动作生成空间，并结合任务自适应的掩蔽接近度损失，以优先确定更准确的推理结果，而不是在中间步骤中的给定起点和最终状态接近给定的开始状态。同时，它滤除了任务 - 无关紧要的动作预测，从而导致上下文意识到的动作序列。三个广泛使用基准数据集的实验结果表明，我们的MTID在大多数指标上实现了有希望的行动计划表现。该代码可在此HTTPS URL上找到。

Title: Pose-Star: Anatomy-Aware Editing for Open-World Fashion Images

Authors: Yuran Dong, Mang Ye
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03402
Pdf URL: https://arxiv.org/pdf/2507.03402
Copy Paste: [[2507.03402]] Pose-Star: Anatomy-Aware Editing for Open-World Fashion Images(https://arxiv.org/abs/2507.03402)
Keywords: generation
Abstract: To advance real-world fashion image editing, we analyze existing two-stage pipelines(mask generation followed by diffusion-based editing)which overly prioritize generator optimization while neglecting mask controllability. This results in two critical limitations: I) poor user-defined flexibility (coarse-grained human masks restrict edits to predefined regions like upper torso; fine-grained clothes masks preserve poses but forbid style/length customization). II) weak pose robustness (mask generators fail due to articulated poses and miss rare regions like waist, while human parsers remain limited by predefined categories). To address these gaps, we propose Pose-Star, a framework that dynamically recomposes body structures (e.g., neck, chest, etc.) into anatomy-aware masks (e.g., chest-length) for user-defined edits. In Pose-Star, we calibrate diffusion-derived attention (Star tokens) via skeletal keypoints to enhance rare structure localization in complex poses, suppress noise through phase-aware analysis of attention dynamics (Convergence,Stabilization,Divergence) with threshold masking and sliding-window fusion, and refine edges via cross-self attention merging and Canny alignment. This work bridges controlled benchmarks and open-world demands, pioneering anatomy-aware, pose-robust editing and laying the foundation for industrial fashion image editing.
摘要：为了推进现实世界中的时尚图像编辑，我们分析了现有的两阶段管道（蒙版生成，然后是基于扩散的编辑），这些管道过于优先级优先级，同时忽略了掩码可控性。这导致了两个关键局限性：i）用户定义的灵活性差（粗粒的人掩模将编辑限制为前躯干等预定义区域；细颗粒的衣服口罩可保留姿势，但禁止样式/长度自定义）。 ii）姿势稳健性弱（掩蔽发电机由于铰接的姿势而失败，而错过了腰部等稀有区域，而人类解析器仍受到预定义类别的限制）。为了解决这些差距，我们提出了姿势明星，该框架将动态重新组装身体结构（例如，颈部，胸部等）为解剖感知的面具（例如，胸部长度）以进行用户定义的编辑。在POSE-Star中，我们通过骨骼关键来校准扩散衍生的注意力（星形令牌），以增强罕见的结构在复杂姿势中的定位，通过对注意力动态的阶段感知分析（融合，稳定，差异，发散）通过阈值掩盖和滑动式融合融合和通过交叉选择的ETGES来抑制噪声。这项工作桥梁控制了基准和开放世界的需求，开创性解剖学，姿势努力编辑并为工业时尚形象编辑奠定了基础。

Title: Reinforcement Learning-based Feature Generation Algorithm for Scientific Data

Authors: Meng Xiao, Junfeng Zhou, Yuanchun Zhou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03498
Pdf URL: https://arxiv.org/pdf/2507.03498
Copy Paste: [[2507.03498]] Reinforcement Learning-based Feature Generation Algorithm for Scientific Data(https://arxiv.org/abs/2507.03498)
Keywords: generation
Abstract: Feature generation (FG) aims to enhance the prediction potential of original data by constructing high-order feature combinations and removing redundant features. It is a key preprocessing step for tabular scientific data to improve downstream machine-learning model performance. Traditional methods face the following two challenges when dealing with the feature generation of scientific data: First, the effective construction of high-order feature combinations in scientific data necessitates profound and extensive domain-specific expertise. Secondly, as the order of feature combinations increases, the search space expands exponentially, imposing prohibitive human labor consumption. Advancements in the Data-Centric Artificial Intelligence (DCAI) paradigm have opened novel avenues for automating feature generation processes. Inspired by that, this paper revisits the conventional feature generation workflow and proposes the Multi-agent Feature Generation (MAFG) framework. Specifically, in the iterative exploration stage, multi-agents will construct mathematical transformation equations collaboratively, synthesize and identify feature combinations ex-hibiting high information content, and leverage a reinforcement learning mechanism to evolve their strategies. Upon completing the exploration phase, MAFG integrates the large language models (LLMs) to interpreta-tively evaluate the generated features of each significant model performance breakthrough. Experimental results and case studies consistently demonstrate that the MAFG framework effectively automates the feature generation process and significantly enhances various downstream scientific data mining tasks.
摘要：特征生成（FG）旨在通过构建高阶特征组合并删除冗余特征来增强原始数据的预测潜力。这是表格科学数据的关键预处理步骤，以改善下游机器学习模型性能。在处理科学数据的特征生成时，传统方法面临以下两个挑战：首先，在科学数据中有效构建高级特征组合需要深刻而广泛的领域特定专业知识。其次，随着特征组合的顺序增加，搜索空间呈指数式扩展，施加了过度的人工消耗。以数据为中心的人工智能（DCAI）范式的进步为自动化特征生成过程开辟了新的途径。受此启发，本文重新审视了传统的特征生成工作流程，并提出了多代理特征生成（MAFG）框架。具体而言，在迭代探索阶段，多代理将协作构建数学转换方程，综合和识别特征组合，从而将高信息含量进行高度吸收，并利用强化学习机制来发展其策略。完成探索阶段后，MAFG集成了大型语言模型（LLMS），以解释性地评估每个重要模型性能突破的生成特征。实验结果和案例研究一致表明，MAFG框架有效地自动化了特征生成过程，并显着增强了各种下游科学数据挖掘任务。

Title: Generating Synthetic Relational Tabular Data via Structural Causal Models

Authors: Frederik Hoppe, Astrid Franz, Lars Kleinemeier, Udo Göbel
Subjects: cs.LG, cs.AI, stat.AP
Abstract URL: https://arxiv.org/abs/2507.03528
Pdf URL: https://arxiv.org/pdf/2507.03528
Copy Paste: [[2507.03528]] Generating Synthetic Relational Tabular Data via Structural Causal Models(https://arxiv.org/abs/2507.03528)
Keywords: generation
Abstract: Synthetic tabular data generation has received increasing attention in recent years, particularly with the emergence of foundation models for tabular data. The breakthrough success of TabPFN (Hollmann et al.,2025), which leverages vast quantities of synthetic tabular datasets derived from structural causal models (SCMs), demonstrates the critical role synthetic data plays in developing powerful tabular foundation models. However, most real-world tabular data exists in relational formats spanning multiple interconnected tables - a structure not adequately addressed by current generation methods. In this work, we extend the SCM-based approach by developing a novel framework that generates realistic synthetic relational tabular data including causal relationships across tables. Our experiments confirm that this framework is able to construct relational datasets with complex inter-table dependencies mimicking real-world scenarios.
摘要：近年来，综合表格数据生成已受到越来越多的关注，尤其是随着表格数据的基础模型的出现。 TABPFN（Hollmann等，2025）的突破性成功，该成功利用了从结构性因果模型（SCM）得出的大量合成表格数据集（SCM），这表明了在开发强大的表格基础模型中的关键作用合成数据扮演。但是，大多数实际表格数据都以跨越多个互连表的关系格式存在 - 当前一代方法无法充分解决的结构。在这项工作中，我们通过开发一个新的框架来扩展基于SCM的方法，该方法生成现实的合成关系表格数据，包括跨表的因果关系。我们的实验证实，该框架能够构建与模仿现实世界情景的复杂桌间依赖关系的关系数据集。

Title: Beyond Accuracy: Metrics that Uncover What Makes a `Good' Visual Descriptor

Authors: Ethan Lin, Linxi Zhao, Atharva Sehgal, Jennifer J. Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03542
Pdf URL: https://arxiv.org/pdf/2507.03542
Copy Paste: [[2507.03542]] Beyond Accuracy: Metrics that Uncover What Makes a `Good' Visual Descriptor(https://arxiv.org/abs/2507.03542)
Keywords: generation
Abstract: Text-based visual descriptors-ranging from simple class names to more descriptive phrases-are widely used in visual concept discovery and image classification with vision-language models (VLMs). Their effectiveness, however, depends on a complex interplay of factors, including semantic clarity, presence in the VLM's pre-training data, and how well the descriptors serve as a meaningful representation space. In this work, we systematically analyze descriptor quality along two key dimensions: (1) representational capacity, and (2) relationship with VLM pre-training data. We evaluate a spectrum of descriptor generation methods, from zero-shot LLM-generated prompts to iteratively refined descriptors. Motivated by ideas from representation alignment and language understanding, we introduce two alignment-based metrics-Global Alignment and CLIP Similarity-that move beyond accuracy. These metrics allow us to shed light on how different descriptor generation strategies interact with foundation model properties, offering insights into ways of studying descriptor effectiveness beyond accuracy evaluations.
摘要：基于文本的视觉描述符，从简单的类名到具有视觉概念发现和图像分类的更广泛的描述性短语，具有视觉模型（VLMS）。然而，它们的有效性取决于因素的复杂相互作用，包括语义清晰度，在VLM的训练前数据中的存在以及描述符充当有意义的表示空间。在这项工作中，我们沿两个关键维度系统地分析了描述符质量：（1）表示能力和（2）与VLM前训练数据的关系。我们评估了描述符生成方法的范围，从零摄影LLM生成的提示到迭代精制的描述符。由代表一致和语言理解的想法的激励，我们引入了两个基于一致性的指标 - 全球一致性和剪辑相似性 - 超出了准确性。这些指标使我们能够阐明不同的描述策略如何与基础模型属性相互作用，从而为研究描述符有效性超出准确性评估的方式提供了见解。

Title: Kinetic Langevin Diffusion for Crystalline Materials Generation

Authors: François Cornet, Federico Bergamin, Arghya Bhowmik, Juan Maria Garcia Lastra, Jes Frellsen, Mikkel N. Schmidt
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03602
Pdf URL: https://arxiv.org/pdf/2507.03602
Copy Paste: [[2507.03602]] Kinetic Langevin Diffusion for Crystalline Materials Generation(https://arxiv.org/abs/2507.03602)
Keywords: generation, generative
Abstract: Generative modeling of crystalline materials using diffusion models presents a series of challenges: the data distribution is characterized by inherent symmetries and involves multiple modalities, with some defined on specific manifolds. Notably, the treatment of fractional coordinates representing atomic positions in the unit cell requires careful consideration, as they lie on a hypertorus. In this work, we introduce Kinetic Langevin Diffusion for Materials (KLDM), a novel diffusion model for crystalline materials generation, where the key innovation resides in the modeling of the coordinates. Instead of resorting to Riemannian diffusion on the hypertorus directly, we generalize Trivialized Diffusion Model (TDM) to account for the symmetries inherent to crystals. By coupling coordinates with auxiliary Euclidean variables representing velocities, the diffusion process is now offset to a flat space. This allows us to effectively perform diffusion on the hypertorus while providing a training objective that accounts for the periodic translation symmetry of the true data distribution. We evaluate KLDM on both Crystal Structure Prediction (CSP) and De-novo Generation (DNG) tasks, demonstrating its competitive performance with current state-of-the-art models.
摘要：使用扩散模型对结晶材料进行生成建模提出了一系列挑战：数据分布的特征是固有的对称性，并涉及多种方式，其中一些定义在特定的歧管上。值得注意的是，代表原子能位置的分数坐标的处理需要仔细考虑，因为它们位于催眠术上。在这项工作中，我们引入了材料动力学扩散（KLDM），这是一种用于结晶材料产生的新型扩散模型，其中关键创新位于坐标的建模中。我们将琐碎的扩散模型（TDM）概括为解释晶体固有的对称性，而不是直接诉诸于催眠术上的Riemannian扩散。通过与代表速度的辅助欧几里德变量耦合坐标，扩散过程现在被偏移到平坦的空间。这使我们能够有效地对高血压进行扩散，同时提供一个训练目标，以说明真实数据分布的定期翻译对称性。我们评估了KLDM对晶体结构预测（CSP）和DE-NOVO生成（DNG）任务的评估，并通过当前最新模型证明了其竞争性能。

Title: Scientific Machine Learning of Chaotic Systems Discovers Governing Equations for Neural Populations

Authors: Anthony G. Chesebro, David Hofmann, Vaibhav Dixit, Earl K. Miller, Richard H. Granger, Alan Edelman, Christopher V. Rackauckas, Lilianne R. Mujica-Parodi, Helmut H. Strey
Subjects: cs.LG, math-ph, nlin.CD, q-bio.NC
Abstract URL: https://arxiv.org/abs/2507.03631
Pdf URL: https://arxiv.org/pdf/2507.03631
Copy Paste: [[2507.03631]] Scientific Machine Learning of Chaotic Systems Discovers Governing Equations for Neural Populations(https://arxiv.org/abs/2507.03631)
Keywords: generation
Abstract: Discovering governing equations that describe complex chaotic systems remains a fundamental challenge in physics and neuroscience. Here, we introduce the PEM-UDE method, which combines the prediction-error method with universal differential equations to extract interpretable mathematical expressions from chaotic dynamical systems, even with limited or noisy observations. This approach succeeds where traditional techniques fail by smoothing optimization landscapes and removing the chaotic properties during the fitting process without distorting optimal parameters. We demonstrate its efficacy by recovering hidden states in the Rossler system and reconstructing dynamics from noise-corrupted electrical circuit data, where the correct functional form of the dynamics is recovered even when one of the observed time series is corrupted by noise 5x the magnitude of the true signal. We demonstrate that this method is capable of recovering the correct dynamics, whereas direct symbolic regression methods, such as SINDy, fail to do so with the given amount of data and noise. Importantly, when applied to neural populations, our method derives novel governing equations that respect biological constraints such as network sparsity - a constraint necessary for cortical information processing yet not captured in next-generation neural mass models - while preserving microscale neuronal parameters. These equations predict an emergent relationship between connection density and both oscillation frequency and synchrony in neural circuits. We validate these predictions using three intracranial electrode recording datasets from the medial entorhinal cortex, prefrontal cortex, and orbitofrontal cortex. Our work provides a pathway to develop mechanistic, multi-scale brain models that generalize across diverse neural architectures, bridging the gap between single-neuron dynamics and macroscale brain activity.
摘要：发现描述复杂混沌系统的管理方程仍然是物理和神经科学方面的基本挑战。在这里，我们介绍了PEM-ude方法，该方法将预测 - 错误方法与通用微分方程相结合，以从混乱的动力学系统中提取可解释的数学表达，即使有限或嘈杂的观察结果也是如此。在传统技术通过平滑优化景观并在拟合过程中删除混乱的情况下而不会扭曲最佳参数的情况下，这种方法可以成功。我们通过在Rossler系统中恢复隐藏状态并从噪声触发的电路数据中重建动力学来证明其疗效，其中即使观察到的时间序列之一被噪声损坏，即使在噪声损坏了一个真实信号的大小时，动态的正确功能形式也会恢复。我们证明了该方法能够恢复正确的动力学，而直接的符号回归方法（例如Sindy）无法使用给定的数据和噪声来做到这一点。重要的是，当应用于神经群体时，我们的方法将得出尊重生物学约束（例如网络稀疏性）的新型治理方程，这是皮质信息处理所必需的，但未在下一代神经质量模型中捕获所需的约束 - 同时保留显微镜神经元参数。这些方程式预测了连接密度与振荡频率和神经回路中同步之间的紧急关系。我们使用来自内侧内部皮层，前额叶皮层和眶额皮质的三个颅内电极记录数据集验证这些预测。我们的工作提供了一种开发机械性多尺度大脑模型的途径，该模型跨越了各种神经体系结构，从而弥合了单神经动力学和宏观脑活动之间的差距。

Title: Plugging Attention into Power Grids: Towards Transparent Forecasting

Authors: Eloi Campagne, Itai Zehavi, Yvenn Amara-Ouali, Yannig Goude, Argyris Kalogeratos
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03690
Pdf URL: https://arxiv.org/pdf/2507.03690
Copy Paste: [[2507.03690]] Plugging Attention into Power Grids: Towards Transparent Forecasting(https://arxiv.org/abs/2507.03690)
Keywords: generation
Abstract: Accurate electricity consumption forecasting is crucial for ensuring grid stability and optimizing power generation, particularly in increasingly decentralized and complex systems. While classical approaches such as Generalized Additive Models (GAMs) remain widely used, they often fail to capture the spatial dependencies inherent in energy networks. Graph Neural Networks (GNNs) offer a principled framework to incorporate this structure by directly leveraging graph topologies. In this work, we evaluate a broad set of GNN architectures -- including GCN, GraphSAGE, ChebConv, TAG, APPNP, TransformerConv, and Graph Attention Networks (GAT and GATv2) -- on two real-world electricity consumption datasets from France and the UK. Our experiments show that while complex architectures like GATv2 and TransformerConv do not consistently outperform their simpler counterparts, models such as GCN and APPNP achieve strong results in low-data or highly disaggregated settings. Nonetheless, the vanilla GAT remains highly competitive across both datasets and offers an additional interpretability layer via attention mechanisms. We perform a temporal analysis of attention weights, revealing evolving patterns of regional interaction linked to seasonal and meteorological variability. These results highlight that, although attention is not universally superior, it provides valuable explanatory power when spatial dependencies are prominent. Finally, we benchmark ensemble-based expert aggregation strategies, showing that uniform or learned combinations can enhance robustness and outperform individual models under data heterogeneity.
摘要：准确的电力消耗预测对于确保电网稳定性和优化发电至关重要，尤其是在日益分散和复杂的系统中。尽管诸如广义添加剂模型（GAM）之类的经典方法仍被广泛使用，但它们通常无法捕获能量网络中固有的空间依赖性。图形神经网络（GNNS）提供了一个原则上的框架，可以通过直接利用图形拓扑结合此结构。在这项工作中，我们在来自法国和英国的两个现实世界中的电力消耗数据集上，评估了广泛的GNN架构 - 包括GCN，GCN，GRAPHSAGE，CHEBCONV，TAG，APPNP，TRENSSERERCONV和GRAPH GOARTH COATION网络（GAT和GATV2）。我们的实验表明，尽管GATV2和TransformerConv（TransformerConv）等复杂的体系结构并不能始终优于其简单的对应物，但诸如GCN和APPNP之类的模型在低数据或高度分解的设置中取得了强大的结果。但是，香草GAT在两个数据集中仍然具有很高的竞争力，并通过注意机制提供了额外的可解释性层。我们对注意力重量进行时间分析，揭示了与季节性和气象变异性相关的区域相互作用的发展模式。这些结果表明，尽管注意力并不是普遍优越，但当空间依赖性突出时，它提供了有价值的解释力。最后，我们基于集合的专家聚合策略基准，表明统一或学习的组合可以增强数据异质性下的鲁棒性和优于单个模型。

Title: FAROS: Fair Graph Generation via Attribute Switching Mechanisms

Authors: Abdennacer Badaoui, Oussama Kharouiche, Hatim Mrabet, Daniele Malitesta, Fragkiskos D. Malliaros
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03728
Pdf URL: https://arxiv.org/pdf/2507.03728
Copy Paste: [[2507.03728]] FAROS: Fair Graph Generation via Attribute Switching Mechanisms(https://arxiv.org/abs/2507.03728)
Keywords: generation
Abstract: Recent advancements in graph diffusion models (GDMs) have enabled the synthesis of realistic network structures, yet ensuring fairness in the generated data remains a critical challenge. Existing solutions attempt to mitigate bias by re-training the GDMs with ad-hoc fairness constraints. Conversely, with this work, we propose FAROS, a novel FAir graph geneRatiOn framework leveraging attribute Switching mechanisms and directly running in the generation process of the pre-trained GDM. Technically, our approach works by altering nodes' sensitive attributes during the generation. To this end, FAROS calculates the optimal fraction of switching nodes, and selects the diffusion step to perform the switch by setting tailored multi-criteria constraints to preserve the node-topology profile from the original distribution (a proxy for accuracy) while ensuring the edge independence on the sensitive attributes for the generated graph (a proxy for fairness). Our experiments on benchmark datasets for link prediction demonstrate that the proposed approach effectively reduces fairness discrepancies while maintaining comparable (or even higher) accuracy performance to other similar baselines. Noteworthy, FAROS is also able to strike a better accuracy-fairness trade-off than other competitors in some of the tested settings under the Pareto optimality concept, demonstrating the effectiveness of the imposed multi-criteria constraints.
摘要：图形扩散模型（GDM）的最新进展已实现了现实的网络结构的综合，但确保生成数据的公平性仍然是一个关键的挑战。现有的解决方案试图通过以临时公平限制重新训练GDM来减轻偏见。相反，通过这项工作，我们提出了Faros，这是一个新颖的公平图生成框架，利用属性开关机制，直接在预训练的GDM的生成过程中运行。从技术上讲，我们的方法通过改变节点在这一代过程中的敏感属性而起作用。为此，FAROS计算开关节点的最佳分数，并选择通过设置量身定制的多标准约束来执行开关的扩散步骤，以将节点流程概要文件从原始分布（准确性的代理）中保留，同时确保对生成图形的敏感属性的边缘独立性（for Pairxy for Fairness）。我们在基准数据集上进行链接预测的实验表明，所提出的方法有效地降低了公平性差异，同时将可比（甚至更高）的精度性能保持在其他类似基线的情况下。值得注意的是，与其他竞争对手相比，在帕累托最佳概念下的某些经过测试的环境中，Faros也能够取得更好的准确性权衡权衡，这证明了施加的多准则约束的有效性。

Title: Flow-Anchored Consistency Models

Authors: Yansong Peng, Kai Zhu, Yu Liu, Pingyu Wu, Hebei Li, Xiaoyan Sun, Feng Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03738
Pdf URL: https://arxiv.org/pdf/2507.03738
Copy Paste: [[2507.03738]] Flow-Anchored Consistency Models(https://arxiv.org/abs/2507.03738)
Keywords: generation, generative
Abstract: Continuous-time Consistency Models (CMs) promise efficient few-step generation but face significant challenges with training instability. We argue this instability stems from a fundamental conflict: by training a network to learn only a shortcut across a probability flow, the model loses its grasp on the instantaneous velocity field that defines the flow. Our solution is to explicitly anchor the model in the underlying flow during training. We introduce the Flow-Anchored Consistency Model (FACM), a simple but effective training strategy that uses a Flow Matching (FM) task as an anchor for the primary CM shortcut objective. This Flow-Anchoring approach requires no architectural modifications and is broadly compatible with standard model architectures. By distilling a pre-trained LightningDiT model, our method achieves a state-of-the-art FID of 1.32 with two steps (NFE=2) and 1.76 with just one step (NFE=1) on ImageNet 256x256, significantly outperforming previous methods. This provides a general and effective recipe for building high-performance, few-step generative models. Our code and pretrained models: this https URL.
摘要：连续的时间一致性模型（CMS）承诺有效的几步生成，但训练不稳定性面临重大挑战。我们认为这种不稳定源于基本冲突：通过训练网络仅学习概率流的快捷方式，该模型失去了对定义流量的瞬时速度场的掌握。我们的解决方案是在训练过程中将模型明确固定在基础流中。我们介绍了流量锚定的一致性模型（FACM），这是一种简单但有效的训练策略，该策略使用流量匹配（FM）任务作为主要CM快捷方式目标的锚点。这种流锚方法不需要架构修改，并且与标准模型体系结构广泛兼容。通过提炼预训练的LightningDit模型，我们的方法在Imagenet 256x256上仅使用两个步骤（NFE = 2），以两个步骤（NFE = 2）和1.76实现1.32的最新FID，显着超过了先前的方法。这为建立高性能，几步生成模型提供了一般有效的食谱。我们的代码和预估计的模型：此HTTPS URL。

Title: ChestGPT: Integrating Large Language Models and Vision Transformers for Disease Detection and Localization in Chest X-Rays

Authors: Shehroz S. Khan, Petar Przulj, Ahmed Ashraf, Ali Abedi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03739
Pdf URL: https://arxiv.org/pdf/2507.03739
Copy Paste: [[2507.03739]] ChestGPT: Integrating Large Language Models and Vision Transformers for Disease Detection and Localization in Chest X-Rays(https://arxiv.org/abs/2507.03739)
Keywords: generative
Abstract: The global demand for radiologists is increasing rapidly due to a growing reliance on medical imaging services, while the supply of radiologists is not keeping pace. Advances in computer vision and image processing technologies present significant potential to address this gap by enhancing radiologists' capabilities and improving diagnostic accuracy. Large language models (LLMs), particularly generative pre-trained transformers (GPTs), have become the primary approach for understanding and generating textual data. In parallel, vision transformers (ViTs) have proven effective at converting visual data into a format that LLMs can process efficiently. In this paper, we present ChestGPT, a deep-learning framework that integrates the EVA ViT with the Llama 2 LLM to classify diseases and localize regions of interest in chest X-ray images. The ViT converts X-ray images into tokens, which are then fed, together with engineered prompts, into the LLM, enabling joint classification and localization of diseases. This approach incorporates transfer learning techniques to enhance both explainability and performance. The proposed method achieved strong global disease classification performance on the VinDr-CXR dataset, with an F1 score of 0.76, and successfully localized pathologies by generating bounding boxes around the regions of interest. We also outline several task-specific prompts, in addition to general-purpose prompts, for scenarios radiologists might encounter. Overall, this framework offers an assistive tool that can lighten radiologists' workload by providing preliminary findings and regions of interest to facilitate their diagnostic process.
摘要：由于对医学成像服务的依赖越来越大，对放射科医生的全球需求正在迅速增加，而放射科医生的供应并没有保持步伐。计算机视觉和图像处理技术的进步通过增强放射科医生的能力并提高诊断准确性，具有巨大的潜力，可以解决这一差距。大型语言模型（LLMS），尤其是生成培训的预培训变压器（GPT），已成为理解和生成文本数据的主要方法。同时，视觉变压器（VIT）已被证明有效地将视觉数据转换为LLM可以有效处理的格式。在本文中，我们提出了Chestgpt，这是一个深入学习的框架，将EVA VIT与Llama 2 LLM集成在一起，以对胸部X射线图像中感兴趣的疾病进行分类并定位感兴趣的区域。 VIT将X射线图像转换为令牌，然后将其与工程提示一起馈送到LLM中，从而实现了疾病的联合分类和定位。这种方法结合了转移学习技术，以增强解释性和性能。所提出的方法在VINDR-CXR数据集上实现了强大的全球疾病分类性能，F1得分为0.76，并通过在感兴趣区域生成边界框，成功地局限了病理。除了通用提示外，我们还概述了一些特定于任务的提示，用于放射科医生可能会遇到的方案。总体而言，该框架提供了一种辅助工具，可以通过提供初步发现和感兴趣的区域来促进其诊断过程，从而减轻放射科医生的工作量。

Title: StreamDiT: Real-Time Streaming Text-to-Video Generation

Authors: Akio Kodaira, Tingbo Hou, Ji Hou, Masayoshi Tomizuka, Yue Zhao
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2507.03745
Pdf URL: https://arxiv.org/pdf/2507.03745
Copy Paste: [[2507.03745]] StreamDiT: Real-Time Streaming Text-to-Video Generation(https://arxiv.org/abs/2507.03745)
Keywords: generation
Abstract: Recently, great progress has been achieved in text-to-video (T2V) generation by scaling transformer-based diffusion models to billions of parameters, which can generate high-quality videos. However, existing models typically produce only short clips offline, restricting their use cases in interactive and real-time applications. This paper addresses these challenges by proposing StreamDiT, a streaming video generation model. StreamDiT training is based on flow matching by adding a moving buffer. We design mixed training with different partitioning schemes of buffered frames to boost both content consistency and visual quality. StreamDiT modeling is based on adaLN DiT with varying time embedding and window attention. To practice the proposed method, we train a StreamDiT model with 4B parameters. In addition, we propose a multistep distillation method tailored for StreamDiT. Sampling distillation is performed in each segment of a chosen partitioning scheme. After distillation, the total number of function evaluations (NFEs) is reduced to the number of chunks in a buffer. Finally, our distilled model reaches real-time performance at 16 FPS on one GPU, which can generate video streams at 512p resolution. We evaluate our method through both quantitative metrics and human evaluation. Our model enables real-time applications, e.g. streaming generation, interactive generation, and video-to-video. We provide video results and more examples in our project website: https://arxiv.org/abs/2507.03846
Pdf URL: https://arxiv.org/pdf/2507.03846
Copy Paste: [[2507.03846]] Interpretable Diffusion Models with B-cos Networks(https://arxiv.org/abs/2507.03846)
Keywords: generation
Abstract: Text-to-image diffusion models generate images by iteratively denoising random noise, conditioned on a prompt. While these models have enabled impressive progress in image generation, they often fail to accurately reflect all semantic information described in the prompt -- failures that are difficult to detect automatically. In this work, we introduce a diffusion model architecture built with B-cos modules that offers inherent interpretability. Our approach provides insight into how individual prompt tokens affect the generated image by producing explanations that highlight the pixel regions influenced by each token. We demonstrate that B-cos diffusion models can produce high-quality images while providing meaningful insights into prompt-image alignment.
摘要：文本对图像扩散模型通过迭代地降解随机噪声来生成图像，以提示为条件。尽管这些模型在图像生成中启用了令人印象深刻的进展，但它们通常无法准确反映提示中描述的所有语义信息 - 失败难以自动检测。在这项工作中，我们介绍了一个使用B-COS模块构建的扩散模型体系结构，该模块提供了固有的解释性。我们的方法提供了有关单个提示令牌如何通过产生强调每个令牌影响的像素区域的解释来影响生成图像的洞察力。我们证明，B-COS扩散模型可以产生高质量的图像，同时为及时图像对齐提供有意义的见解。

Title: GenAI-Powered Inference

Authors: Kosuke Imai, Kentaro Nakamura
Subjects: cs.LG, stat.ME, stat.ML
Abstract URL: https://arxiv.org/abs/2507.03897
Pdf URL: https://arxiv.org/pdf/2507.03897
Copy Paste: [[2507.03897]] GenAI-Powered Inference(https://arxiv.org/abs/2507.03897)
Keywords: generative
Abstract: We introduce GenAI-Powered Inference (GPI), a statistical framework for both causal and predictive inference using unstructured data, including text and images. GPI leverages open-source Generative Artificial Intelligence (GenAI) models - such as large language models and diffusion models - not only to generate unstructured data at scale but also to extract low-dimensional representations that capture their underlying structure. Applying machine learning to these representations, GPI enables estimation of causal and predictive effects while quantifying associated estimation uncertainty. Unlike existing approaches to representation learning, GPI does not require fine-tuning of generative models, making it computationally efficient and broadly accessible. We illustrate the versatility of the GPI framework through three applications: (1) analyzing Chinese social media censorship, (2) estimating predictive effects of candidates' facial appearance on electoral outcomes, and (3) assessing the persuasiveness of political rhetoric. An open-source software package is available for implementing GPI.
摘要：我们介绍了Genai驱动的推理（GPI），这是使用非结构化数据（包括文本和图像）的因果和预测推理的统计框架。 GPI利用开源生成人工智能（Genai）模型（例如大语言模型和扩散模型）不仅可以按大规模生成非结构化的数据，而且还提取捕获其潜在结构的低维表示。将机器学习应用于这些表示形式，GPI可以估计因果关系和预测效应，同时量化相关的估计不确定性。与现有的表示学习方法不同，GPI不需要对生成模型进行微调，从而使其在计算上有效且易于访问。我们通过三个应用说明了GPI框架的多功能性：（1）分析中国社交媒体审查制度，（2）估计候选人面部外观对选举结果的预测影响，以及（3）评估政治修辞学的说服力。一个开源软件包可用于实施GPI。

Title: Transformer Model for Alzheimer's Disease Progression Prediction Using Longitudinal Visit Sequences

Authors: Mahdi Moghaddami, Clayton Schubring, Mohammad-Reza Siadat
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2507.03899
Pdf URL: https://arxiv.org/pdf/2507.03899
Copy Paste: [[2507.03899]] Transformer Model for Alzheimer's Disease Progression Prediction Using Longitudinal Visit Sequences(https://arxiv.org/abs/2507.03899)
Keywords: generative
Abstract: Alzheimer's disease (AD) is a neurodegenerative disorder with no known cure that affects tens of millions of people worldwide. Early detection of AD is critical for timely intervention to halt or slow the progression of the disease. In this study, we propose a Transformer model for predicting the stage of AD progression at a subject's next clinical visit using features from a sequence of visits extracted from the subject's visit history. We also rigorously compare our model to recurrent neural networks (RNNs) such as long short-term memory (LSTM), gated recurrent unit (GRU), and minimalRNN and assess their performances based on factors such as the length of prior visits and data imbalance. We test the importance of different feature categories and visit history, as well as compare the model to a newer Transformer-based model optimized for time series. Our model demonstrates strong predictive performance despite missing visits and missing features in available visits, particularly in identifying converter subjects -- individuals transitioning to more severe disease stages -- an area that has posed significant challenges in longitudinal prediction. The results highlight the model's potential in enhancing early diagnosis and patient outcomes.
摘要：阿尔茨海默氏病（AD）是一种神经退行性疾病，没有已知的治愈方法，影响了全球数千万人。 AD的早期检测对于及时干预以停止或减慢疾病进展至关重要。在这项研究中，我们提出了一个变压器模型，用于在受试者的访问历史上提取的一系列访问的特征来预测受试者下一次临床访问中AD进展的阶段。我们还将严格的模型与复发性神经网络（RNN）进行比较，例如长期记忆（LSTM），封闭式复发单元（GRU）和Minimalrnn，并根据先前访问和数据不平衡的因素等因素来评估其性能。我们测试了不同特征类别的重要性并访问历史记录，并将模型与针对时间序列优化的基于新变压器的模型进行了比较。我们的模型表现出强烈的预测性能，尽管访问缺乏访问和缺少的特征，尤其是在识别转换器受试者（过渡到更严重疾病阶段的个体）时，该领域在纵向预测中构成了重大挑战。结果突出了该模型在增强早期诊断和患者预后的潜力。

Title: Taming Anomalies with Down-Up Sampling Networks: Group Center Preserving Reconstruction for 3D Anomaly Detection

Authors: Hanzhe Liang, Jie Zhang, Tao Dai, Linlin Shen, Jinbao Wang, Can Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03903
Pdf URL: https://arxiv.org/pdf/2507.03903
Copy Paste: [[2507.03903]] Taming Anomalies with Down-Up Sampling Networks: Group Center Preserving Reconstruction for 3D Anomaly Detection(https://arxiv.org/abs/2507.03903)
Keywords: generation
Abstract: Reconstruction-based methods have demonstrated very promising results for 3D anomaly detection. However, these methods face great challenges in handling high-precision point clouds due to the large scale and complex structure. In this study, a Down-Up Sampling Network (DUS-Net) is proposed to reconstruct high-precision point clouds for 3D anomaly detection by preserving the group center geometric structure. The DUS-Net first introduces a Noise Generation module to generate noisy patches, which facilitates the diversity of training data and strengthens the feature representation for reconstruction. Then, a Down-sampling Network~(Down-Net) is developed to learn an anomaly-free center point cloud from patches with noise injection. Subsequently, an Up-sampling Network (Up-Net) is designed to reconstruct high-precision point clouds by fusing multi-scale up-sampling features. Our method leverages group centers for construction, enabling the preservation of geometric structure and providing a more precise point cloud. Extensive experiments demonstrate the effectiveness of our proposed method, achieving state-of-the-art (SOTA) performance with an Object-level AUROC of 79.9% and 79.5%, and a Point-level AUROC of 71.2% and 84.7% on the Real3D-AD and Anomaly-ShapeNet datasets, respectively.
摘要：基于重建的方法已经证明了3D异常检测的非常有希望的结果。但是，由于大规模和复杂的结构，这些方法在处理高精度点云方面面临着巨大的挑战。在这项研究中，提出了一个向上的采样网络（DUS-NET），以通过保留组中心几何结构来重建3D异常检测的高精度点云。 DUS-NET首先引入了一个噪声模块以生成嘈杂的斑块，这有助于训练数据的多样性并加强重建功能表示。然后，开发了一个下采样网络〜（下网），以从带有噪声注入的斑块中学习无异常的中心点云。随后，向上采样网络（UP-NET）旨在通过融合多尺度上采样功能来重建高精度点云。我们的方法利用小组中心进行构建，使几何结构保存并提供更精确的点云。广泛的实验证明了我们提出的方法的有效性，即以对象级的AUROC为79.9％和79.5％，并在REAL3D-AD和ANMALOMALY-SHAPENET数据集中实现了最先进的（SOTA）性能。

Title: EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation

Authors: Rang Meng, Yan Wang, Weipeng Wu, Ruobing Zheng, Yuming Li, Chenguang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03905
Pdf URL: https://arxiv.org/pdf/2507.03905
Copy Paste: [[2507.03905]] EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation(https://arxiv.org/abs/2507.03905)
Keywords: generation
Abstract: Human animation recently has advanced rapidly, achieving increasingly realistic and vivid results, especially with the integration of large-scale video generation models. However, the slow inference speed and high computational cost of these large models bring significant challenges for practical applications. Additionally, various tasks in human animation, such as lip-syncing, audio-driven full-body animation, and video generation from start and end frames, often require different specialized models. The introduction of large video models has not alleviated this dilemma. This raises an important question: Can we make human animation Faster, Higher in quality, Stronger in generalization, and make various tasks Together in one model? To address this, we dive into video generation models and discover that the devil lies in the details: Inspired by MAE, we propose a novel unified Multi-Task paradigm for human animation, treating diverse generation tasks as spatial-temporal local reconstructions, requiring modifications only on the input side; Given the interplay and division among multi-modal conditions including text, image, and audio, we introduce a multi-modal decoupled cross-attention module to fuse multi-modals in a divide-and-conquer manner; We propose a new SFT+Reward alternating training paradigm, enabling the minimal model with 1.3B parameters to achieve generation quality comparable to models with 10 times the parameters count. Through these innovations, our work paves the way for efficient, high-quality, and versatile digital human generation, addressing both performance and practicality challenges in the field. Extensive experiments demonstrate that EchoMimicV3 outperforms existing models in both facial and semi-body video generation, providing precise text-based control for creating videos in a wide range of scenarios.
摘要：人类动画最近迅速发展，取得了越来越现实和生动的结果，尤其是随着大型视频生成模型的整合。但是，这些大型模型的缓慢推理速度和高计算成本为实际应用带来了重大挑战。此外，人类动画中的各种任务，例如唇部同步，音频驱动的全身动画以及从开始和终点的视频生成，通常需要不同的专用模型。大型视频模型的引入并未减轻这种困境。这就提出了一个重要的问题：我们可以使人类动画更快，质量更高，概括性更强，并以一种模型将各种任务一起完成？为了解决这个问题，我们深入研究视频生成模型，发现魔鬼在于细节：受MAE的启发，我们提出了一种新型的统一的统一的多任务范式，用于人类动画，将多样化的生成任务视为空间时空的本地重建，仅需要在输入方面进行修改；考虑到包括文本，图像和音频在内的多模式条件之间的相互作用和划分，我们引入了多模式脱钩的跨意义模块，以分隔和构造方式融合多模式。我们提出了一个新的SFT+奖励交替训练范式，从而使最小模型具有1.3B参数，以实现与参数计数10倍的模型相当的生成质量。通过这些创新，我们的工作为高效，高质量和多才多艺的数字人类一代铺平了道路，以应对该领域的性能和实用性挑战。广泛的实验表明，Echomimicv3在面部和半身视频生成中都优于现有模型，从而提供了精确的基于文本的控制，用于在各种场景中创建视频。

Title: Bridging Vision and Language: Optimal Transport-Driven Radiology Report Generation via LLMs

Authors: Haifeng Zhao, Yufei Zhang, Leilei Ma, Shuo Xu, Dengdi Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03908
Pdf URL: https://arxiv.org/pdf/2507.03908
Copy Paste: [[2507.03908]] Bridging Vision and Language: Optimal Transport-Driven Radiology Report Generation via LLMs(https://arxiv.org/abs/2507.03908)
Keywords: generation
Abstract: Radiology report generation represents a significant application within medical AI, and has achieved impressive results. Concurrently, large language models (LLMs) have demonstrated remarkable performance across various domains. However, empirical validation indicates that general LLMs tend to focus more on linguistic fluency rather than clinical effectiveness, and lack the ability to effectively capture the relationship between X-ray images and their corresponding texts, thus resulting in poor clinical practicability. To address these challenges, we propose Optimal Transport-Driven Radiology Report Generation (OTDRG), a novel framework that leverages Optimal Transport (OT) to align image features with disease labels extracted from reports, effectively bridging the cross-modal gap. The core component of OTDRG is Alignment \& Fine-Tuning, where OT utilizes results from the encoding of label features and image visual features to minimize cross-modal distances, then integrating image and text features for LLMs fine-tuning. Additionally, we design a novel disease prediction module to predict disease labels contained in X-ray images during validation and testing. Evaluated on the MIMIC-CXR and IU X-Ray datasets, OTDRG achieves state-of-the-art performance in both natural language generation (NLG) and clinical efficacy (CE) metrics, delivering reports that are not only linguistically coherent but also clinically accurate.
摘要：放射学报告的产生代表了医学AI中的重要应用，并取得了令人印象深刻的结果。同时，大型语言模型（LLM）在各个领域都表现出了出色的性能。但是，经验验证表明，普通LLM倾向于更多地关注语言流利性，而不是临床有效性，并且缺乏有效捕获X射线图像及其相应文本之间关系的能力，从而导致临床实用性差。为了应对这些挑战，我们提出了最佳运输驱动的放射学报告生成（OTDRG），这是一个利用最佳运输（OT）将图像特征与从报告中提取的疾病标签相结合的新型框架，有效地弥合了跨模式间隙。 OTDRG的核心组成部分是对齐\＆微调，其中OT利用了标签特征和图像视觉特征的编码来最大程度地减少交叉模式距离，然后集成了LLMS微调的图像和文本特征。此外，我们设计了一个新的疾病预测模块，以预测验证和测试期间X射线图像中包含的疾病标签。 OTDRG在模拟CXR和IU X射线数据集上进行了评估，在自然语言生成（NLG）和临床效力（CE）度量标准中都达到了最先进的表现，不仅提供了语言相干性，而且在临床上还准确的报告。

Title: Return of the Latent Space COWBOYS: Re-thinking the use of VAEs for Bayesian Optimisation of Structured Spaces

Authors: Henry B. Moss, Sebastian W. Ober, Tom Diethe
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2507.03910
Pdf URL: https://arxiv.org/pdf/2507.03910
Copy Paste: [[2507.03910]] Return of the Latent Space COWBOYS: Re-thinking the use of VAEs for Bayesian Optimisation of Structured Spaces(https://arxiv.org/abs/2507.03910)
Keywords: generation, generative
Abstract: Bayesian optimisation in the latent space of a Variational AutoEncoder (VAE) is a powerful framework for optimisation tasks over complex structured domains, such as the space of scientifically interesting molecules. However, existing approaches tightly couple the surrogate and generative models, which can lead to suboptimal performance when the latent space is not tailored to specific tasks, which in turn has led to the proposal of increasingly sophisticated algorithms. In this work, we explore a new direction, instead proposing a decoupled approach that trains a generative model and a Gaussian Process (GP) surrogate separately, then combines them via a simple yet principled Bayesian update rule. This separation allows each component to focus on its strengths -- structure generation from the VAE and predictive modelling by the GP. We show that our decoupled approach improves our ability to identify high-potential candidates in molecular optimisation problems under constrained evaluation budgets.
摘要：变异自动编码器（VAE）的潜在空间中的贝叶斯优化是一个强大的框架，可在复杂的结构化域（例如科学有趣的分子的空间）优化任务。但是，现有方法紧密地融合了替代模型和生成模型，当潜在空间不是针对特定任务量身定制的，这可能导致次优性能，这又导致了越来越复杂的算法的提议。在这项工作中，我们探索了一个新的方向，而是提出了一种分离的方法，该方法分别训练生成模型和高斯流程（GP）代理，然后通过简单但原则上的贝叶斯更新规则将它们结合在一起。这种分离使每个组件都可以专注于其强度 - 从VAE产生结构，并通过GP进行预测建模。我们表明，我们的分离方法提高了我们在受限评估预算下识别分子优化问题中高电位候选者的能力。

Title: DNF-Intrinsic: Deterministic Noise-Free Diffusion for Indoor Inverse Rendering

Authors: Rongjia Zheng, Qing Zhang, Chengjiang Long, Wei-Shi Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03924
Pdf URL: https://arxiv.org/pdf/2507.03924
Copy Paste: [[2507.03924]] DNF-Intrinsic: Deterministic Noise-Free Diffusion for Indoor Inverse Rendering(https://arxiv.org/abs/2507.03924)
Keywords: generative
Abstract: Recent methods have shown that pre-trained diffusion models can be fine-tuned to enable generative inverse rendering by learning image-conditioned noise-to-intrinsic mapping. Despite their remarkable progress, they struggle to robustly produce high-quality results as the noise-to-intrinsic paradigm essentially utilizes noisy images with deteriorated structure and appearance for intrinsic prediction, while it is common knowledge that structure and appearance information in an image are crucial for inverse rendering. To address this issue, we present DNF-Intrinsic, a robust yet efficient inverse rendering approach fine-tuned from a pre-trained diffusion model, where we propose to take the source image rather than Gaussian noise as input to directly predict deterministic intrinsic properties via flow matching. Moreover, we design a generative renderer to constrain that the predicted intrinsic properties are physically faithful to the source image. Experiments on both synthetic and real-world datasets show that our method clearly outperforms existing state-of-the-art methods.
摘要：最近的方法表明，可以对预训练的扩散模型进行微调，从而通过学习图像条件的噪声对内部映射来实现生成反向渲染。尽管取得了显着的进步，但他们仍在努力地产生高质量的结果，因为噪声到内在的范式基本上利用嘈杂的图像，其结构和外观恶化以进行内在预测，而图像中的结构和外观信息对于倒数呈现至关重要。为了解决这个问题，我们提出了DNF-Intrinsic，这是一种从预训练的扩散模型中微调的强大而有效的反向渲染方法，我们建议在其中采用源图像，而不是通过流量匹配来直接预测确定性内在属性的输入。此外，我们设计了一个生成渲染器，以限制预测的内在属性在物理上忠于源图像。合成和现实世界数据集的实验表明，我们的方法显然优于现有的最新方法。

Title: Evaluating Adversarial Protections for Diffusion Personalization: A Comprehensive Study

Authors: Kai Ye, Tianyi Chen, Zhen Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03953
Pdf URL: https://arxiv.org/pdf/2507.03953
Copy Paste: [[2507.03953]] Evaluating Adversarial Protections for Diffusion Personalization: A Comprehensive Study(https://arxiv.org/abs/2507.03953)
Keywords: generation
Abstract: With the increasing adoption of diffusion models for image generation and personalization, concerns regarding privacy breaches and content misuse have become more pressing. In this study, we conduct a comprehensive comparison of eight perturbation based protection methods: AdvDM, ASPL, FSGM, MetaCloak, Mist, PhotoGuard, SDS, and SimAC--across both portrait and artwork domains. These methods are evaluated under varying perturbation budgets, using a range of metrics to assess visual imperceptibility and protective efficacy. Our results offer practical guidance for method selection. Code is available at: this https URL.
摘要：随着图像产生和个性化扩散模型的越来越多，人们对隐私漏洞和内容滥用的担忧变得更加紧迫。在这项研究中，我们对八种基于扰动的保护方法进行了全面比较：Advdm，ASPL，FSGM，Metacloak，Mist，Photoguard，SDS和Simac-横跨肖像和艺术品域。这些方法在不同的扰动预算下进行了评估，并使用一系列指标来评估视觉上的不可识别性和保护效果。我们的结果为方法选择提供了实用的指导。代码可用：此HTTPS URL。

Title: Robust Low-light Scene Restoration via Illumination Transition

Authors: Ze Li, Feng Zhang, Xiatian Zhu, Meng Zhang, Yanghong Zhou, P. Y. Mok
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03976
Pdf URL: https://arxiv.org/pdf/2507.03976
Copy Paste: [[2507.03976]] Robust Low-light Scene Restoration via Illumination Transition(https://arxiv.org/abs/2507.03976)
Keywords: restoration
Abstract: Synthesizing normal-light novel views from low-light multiview images is an important yet challenging task, given the low visibility and high ISO noise present in the input images. Existing low-light enhancement methods often struggle to effectively preprocess such low-light inputs, as they fail to consider correlations among multiple views. Although other state-of-the-art methods have introduced illumination-related components offering alternative solutions to the problem, they often result in drawbacks such as color distortions and artifacts, and they provide limited denoising effectiveness. In this paper, we propose a novel Robust Low-light Scene Restoration framework (RoSe), which enables effective synthesis of novel views in normal lighting conditions from low-light multiview image inputs, by formulating the task as an illuminance transition estimation problem in 3D space, conceptualizing it as a specialized rendering task. This multiview-consistent illuminance transition field establishes a robust connection between low-light and normal-light conditions. By further exploiting the inherent low-rank property of illumination to constrain the transition representation, we achieve more effective denoising without complex 2D techniques or explicit noise modeling. To implement RoSe, we design a concise dual-branch architecture and introduce a low-rank denoising module. Experiments demonstrate that RoSe significantly outperforms state-of-the-art models in both rendering quality and multiview consistency on standard benchmarks. The codes and data are available at this https URL.
摘要：鉴于输入图像中存在较低的可见性和高的ISO噪声，从低光多视图中综合的正常光明视图是一项重要但具有挑战性的任务。现有的低光增强方法通常难以有效地预处理此类低光输入，因为它们未能考虑多种观点之间的相关性。尽管其他最先进的方法引入了与照明相关的组件，为问题提供了替代解决方案，但它们通常会导致诸如颜色扭曲和伪像之类的缺点，并提供有限的DeNose有效性。在本文中，我们提出了一个新颖的稳健弱光场景恢复框架（ROSE），该框架可以通过将任务作为一个照明过渡估计问题在3D空间中的照明型估计问题中有效地合成正常照明条件下的新型视图，从而将其概念化为一项专业订阅任务。这个多视图的照明过渡场在弱光和正常光条件之间建立了牢固的联系。通过进一步利用照明的固有低级特性来限制过渡表示，我们在没有复杂的2D技术或显式噪声建模的情况下实现了更有效的降解。为了实施玫瑰，我们设计了一个简洁的双支架构，并引入了一个低级别的Denoising模块。实验表明，在标准基准的呈现质量和多视图一致性方面，Rose的表现明显优于最先进的模型。代码和数据可在此HTTPS URL上找到。

Title: LEHA-CVQAD: Dataset To Enable Generalized Video Quality Assessment of Compression Artifacts

Authors: Aleksandr Gushchin, Maksim Smirnov, Dmitriy Vatolin, Anastasia Antsiferova
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03990
Pdf URL: https://arxiv.org/pdf/2507.03990
Copy Paste: [[2507.03990]] LEHA-CVQAD: Dataset To Enable Generalized Video Quality Assessment of Compression Artifacts(https://arxiv.org/abs/2507.03990)
Keywords: quality assessment
Abstract: We propose the LEHA-CVQAD (Large-scale Enriched Human-Annotated) dataset, which comprises 6,240 clips for compression-oriented video quality assessment. 59 source videos are encoded with 186 codec-preset variants, 1.8M pairwise, and 1.5k MOS ratings are fused into a single quality scale; part of the videos remains hidden for blind evaluation. We also propose Rate-Distortion Alignment Error (RDAE), a novel evaluation metric that quantifies how well VQA models preserve bitrate-quality ordering, directly supporting codec parameter tuning. Testing IQA/VQA methods reveals that popular VQA metrics exhibit high RDAE and lower correlations, underscoring the dataset challenges and utility. The open part and the results of LEHA-CVQAD are available at this https URL$.io/lcvqad/
摘要：我们提出了LEHA-CVQAD（大规模丰富的人类通知）数据集，该数据集包括6,240个剪辑，用于压缩为导向的视频质量评估。 59个源视频用186个编解码器 - 披风变体，180万，配对1.5k MOS评分融合成单个质量量表；部分视频仍然隐藏了盲目评估。我们还提出了速率分数比对误差（RDAE），这是一个新颖的评估度量标准，可量化VQA模型保留比特量质量的排序，直接支持编解码器参数调整。测试IQA/VQA方法表明，流行的VQA指标表现出较高的RDAE和较低的相关性，强调了数据集挑战和实用性。 LEHA-CVQAD的开放零件和结果可在此HTTPS URL $ .IO/LCVQAD/

Title: NRSeg: Noise-Resilient Learning for BEV Semantic Segmentation via Driving World Models

Authors: Siyu Li, Fei Teng, Yihong Cao, Kailun Yang, Zhiyong Li, Yaonan Wang
Subjects: cs.CV, cs.RO, eess.IV
Abstract URL: https://arxiv.org/abs/2507.04002
Pdf URL: https://arxiv.org/pdf/2507.04002
Copy Paste: [[2507.04002]] NRSeg: Noise-Resilient Learning for BEV Semantic Segmentation via Driving World Models(https://arxiv.org/abs/2507.04002)
Keywords: generation
Abstract: Birds' Eye View (BEV) semantic segmentation is an indispensable perception task in end-to-end autonomous driving systems. Unsupervised and semi-supervised learning for BEV tasks, as pivotal for real-world applications, underperform due to the homogeneous distribution of the labeled data. In this work, we explore the potential of synthetic data from driving world models to enhance the diversity of labeled data for robustifying BEV segmentation. Yet, our preliminary findings reveal that generation noise in synthetic data compromises efficient BEV model learning. To fully harness the potential of synthetic data from world models, this paper proposes NRSeg, a noise-resilient learning framework for BEV semantic segmentation. Specifically, a Perspective-Geometry Consistency Metric (PGCM) is proposed to quantitatively evaluate the guidance capability of generated data for model learning. This metric originates from the alignment measure between the perspective road mask of generated data and the mask projected from the BEV labels. Moreover, a Bi-Distribution Parallel Prediction (BiDPP) is designed to enhance the inherent robustness of the model, where the learning process is constrained through parallel prediction of multinomial and Dirichlet distributions. The former efficiently predicts semantic probabilities, whereas the latter adopts evidential deep learning to realize uncertainty quantification. Furthermore, a Hierarchical Local Semantic Exclusion (HLSE) module is designed to address the non-mutual exclusivity inherent in BEV semantic segmentation tasks. Experimental results demonstrate that NRSeg achieves state-of-the-art performance, yielding the highest improvements in mIoU of 13.8% and 11.4% in unsupervised and semi-supervised BEV segmentation tasks, respectively. The source code will be made publicly available at this https URL.
摘要：鸟类视图（BEV）语义分割是端到端自动驾驶系统中必不可少的感知任务。由于标记的数据的均匀分布，BEV任务的无监督和半监督学习是现实世界应用的关键。在这项工作中，我们探讨了从驱动世界模型中的合成数据的潜力，以增强标记数据的多样性，以鲁棒性BEV分割。然而，我们的初步发现表明，合成数据中的产生噪声损害了有效的BEV模型学习。为了充分利用世界模型中合成数据的潜力，本文提出了NRSEG，这是BEV语义分割的噪声学习框架。具体而言，提出了一个透视几何一致性度量（PGCM），以定量评估模型学习生成数据的指导能力。该指标源自生成数据的透视路面掩模和从BEV标签投影的面具之间的对齐度措施。此外，BI分布并行预测（BIDPP）旨在增强模型的固有性，其中学习过程通过对多项式和Dirichlet分布的并行预测来限制。前者有效地预测语义概率，而后者采用证据深度学习来实现不确定性量化。此外，层次结构的本地语义排除（HLSE）模块旨在解决BEV语义分割任务中固有的非差异排他性。实验结果表明，NRSEG达到了最先进的性能，在MIOU的13.8％和无监督和半监督的BEV分段任务中，MIOU的提高最高。源代码将在此HTTPS URL上公开可用。

Title: PresentAgent: Multimodal Agent for Presentation Video Generation

Authors: Jingwei Shi, Zeyu Zhang, Biao Wu, Yanjie Liang, Meng Fang, Ling Chen, Yang Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04036
Pdf URL: https://arxiv.org/pdf/2507.04036
Copy Paste: [[2507.04036]] PresentAgent: Multimodal Agent for Presentation Video Generation(https://arxiv.org/abs/2507.04036)
Keywords: generation
Abstract: We present PresentAgent, a multimodal agent that transforms long-form documents into narrated presentation videos. While existing approaches are limited to generating static slides or text summaries, our method advances beyond these limitations by producing fully synchronized visual and spoken content that closely mimics human-style presentations. To achieve this integration, PresentAgent employs a modular pipeline that systematically segments the input document, plans and renders slide-style visual frames, generates contextual spoken narration with large language models and Text-to-Speech models, and seamlessly composes the final video with precise audio-visual alignment. Given the complexity of evaluating such multimodal outputs, we introduce PresentEval, a unified assessment framework powered by Vision-Language Models that comprehensively scores videos across three critical dimensions: content fidelity, visual clarity, and audience comprehension through prompt-based evaluation. Our experimental validation on a curated dataset of 30 document-presentation pairs demonstrates that PresentAgent approaches human-level quality across all evaluation metrics. These results highlight the significant potential of controllable multimodal agents in transforming static textual materials into dynamic, effective, and accessible presentation formats. Code will be available at this https URL.
摘要：我们介绍了现在的多模式代理，它将长格式文档转换为叙述的演示视频。尽管现有方法仅限于生成静态幻灯片或文本摘要，但我们的方法通过产生完全同步的视觉和口头内容来超越这些限制，从而密切模仿人类风格的演示文稿。为了实现这种集成，现在使用的是模块化管道，该管道系统地将输入文档，计划和渲染器呈现幻灯片风格的视觉框架，使用大语言模型和文本对语音模型生成上下文叙述，并以精确的视听统一为单位。鉴于评估此类多模式输出的复杂性，我们介绍了PresentEval，这是一个由视觉语言模型提供动力的统一评估框架，该模型可以通过基于及时的评估来全面评分跨三个关键维度的视频：内容保真度，视觉清晰度和受众理解。我们对30个文档呈现对的策划数据集进行的实验验证表明，当前的方法在所有评估指标中使用人级质量。这些结果突出了可控多模式的重要潜力，将静态文本材料转换为动态，有效且可访问的演示格式。代码将在此HTTPS URL上可用。

Title: Breaking Imitation Bottlenecks: Reinforced Diffusion Powers Diverse Trajectory Generation

Authors: Ziying Song, Lin Liu, Hongyu Pan, Bencheng Liao, Mingzhe Guo, Lei Yang, Yongchang Zhang, Shaoqing Xu, Caiyan Jia, Yadan Luo
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2507.04049
Pdf URL: https://arxiv.org/pdf/2507.04049
Copy Paste: [[2507.04049]] Breaking Imitation Bottlenecks: Reinforced Diffusion Powers Diverse Trajectory Generation(https://arxiv.org/abs/2507.04049)
Keywords: generation
Abstract: Most end-to-end autonomous driving methods rely on imitation learning from single expert demonstrations, often leading to conservative and homogeneous behaviors that limit generalization in complex real-world scenarios. In this work, we propose DIVER, an end-to-end driving framework that integrates reinforcement learning with diffusion-based generation to produce diverse and feasible trajectories. At the core of DIVER lies a reinforced diffusion-based generation mechanism. First, the model conditions on map elements and surrounding agents to generate multiple reference trajectories from a single ground-truth trajectory, alleviating the limitations of imitation learning that arise from relying solely on single expert demonstrations. Second, reinforcement learning is employed to guide the diffusion process, where reward-based supervision enforces safety and diversity constraints on the generated trajectories, thereby enhancing their practicality and generalization capability. Furthermore, to address the limitations of L2-based open-loop metrics in capturing trajectory diversity, we propose a novel Diversity metric to evaluate the diversity of multi-mode this http URL experiments on the closed-loop NAVSIM and Bench2Drive benchmarks, as well as the open-loop nuScenes dataset, demonstrate that DIVER significantly improves trajectory diversity, effectively addressing the mode collapse problem inherent in imitation learning.
摘要：大多数端到端的自主驾驶方法都依赖于单个专家演示的模仿学习，这通常导致保守和同质行为，这些行为限制了在复杂的现实世界情景中的概括。在这项工作中，我们提出了Diver，这是一个端到端的驾驶框架，将增强学习与基于扩散的生成集成在一起，以产生多样化和可行的轨迹。潜水员的核心是基于增强扩散的生成机制。首先，在地图元素和周围代理上的模型条件从单个基地轨迹产生多个参考轨迹，从而减轻了仅依靠仅依靠单个专家演示的模仿学习的局限性。其次，使用强化学习来指导扩散过程，在这种过程中，基于奖励的监督对生成的轨迹实施安全性和多样性约束，从而增强了它们的实用性和概括能力。此外，为了解决捕获轨迹多样性中基于L2的开放环指标的局限性，我们提出了一种新型的多样性指标，以评估闭环NAVSIM和基准2驱动基准的多模式多样性的多样性。模仿学习固有的崩溃问题。

Title: Generate, Refine, and Encode: Leveraging Synthesized Novel Samples for On-the-Fly Fine-Grained Category Discovery

Authors: Xiao Liu, Nan Pu, Haiyang Zheng, Wenjing Li, Nicu Sebe, Zhun Zhong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04051
Pdf URL: https://arxiv.org/pdf/2507.04051
Copy Paste: [[2507.04051]] Generate, Refine, and Encode: Leveraging Synthesized Novel Samples for On-the-Fly Fine-Grained Category Discovery(https://arxiv.org/abs/2507.04051)
Keywords: generation
Abstract: In this paper, we investigate a practical yet challenging task: On-the-fly Category Discovery (OCD). This task focuses on the online identification of newly arriving stream data that may belong to both known and unknown categories, utilizing the category knowledge from only labeled data. Existing OCD methods are devoted to fully mining transferable knowledge from only labeled data. However, the transferability learned by these methods is limited because the knowledge contained in known categories is often insufficient, especially when few annotated data/categories are available in fine-grained recognition. To mitigate this limitation, we propose a diffusion-based OCD framework, dubbed DiffGRE, which integrates Generation, Refinement, and Encoding in a multi-stage fashion. Specifically, we first design an attribute-composition generation method based on cross-image interpolation in the diffusion latent space to synthesize novel samples. Then, we propose a diversity-driven refinement approach to select the synthesized images that differ from known categories for subsequent OCD model training. Finally, we leverage a semi-supervised leader encoding to inject additional category knowledge contained in synthesized data into the OCD models, which can benefit the discovery of both known and unknown categories during the on-the-fly inference process. Extensive experiments demonstrate the superiority of our DiffGRE over previous methods on six fine-grained datasets.
摘要：在本文中，我们研究了一项实用但充满挑战的任务：即时类别发现（OCD）。此任务着重于在线识别可能属于已知类别和未知类别的新到达流数据，仅利用仅标记数据的类别知识。现有的OCD方法专门用于完全从标记的数据中完全挖掘可转移的知识。但是，这些方法学到的可传递性受到限制，因为已知类别中包含的知识通常不够，尤其是当很少有带注释的数据/类别以细粒度识别提供时。为了减轻这种限制，我们提出了一个基于扩散的OCD框架，称为DIFFGRE，该框架以多阶段的方式集成了生成，改进和编码。具体而言，我们首先设计了一种基于扩散潜在空间中跨图像插值的属性 - 分类生成方法，以综合新样本。然后，我们提出了一种多样性驱动的改进方法，以选择与已知类别不同的OCD模型训练的合成图像。最后，我们利用一个半监督的领导者编码，将合成数据中包含的其他类别知识注入OCD模型中，该知识可以使人们在直接推理过程中发现已知和未知类别的发现。广泛的实验表明，在六个细粒数据集上，我们的差异比以前的方法具有优越性。

Title: PromptSR: Cascade Prompting for Lightweight Image Super-Resolution

Authors: Wenyang Liu, Chen Cai, Jianjun Gao, Kejun Wu, Yi Wang, Kim-Hui Yap, Lap-Pui Chau
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04118
Pdf URL: https://arxiv.org/pdf/2507.04118
Copy Paste: [[2507.04118]] PromptSR: Cascade Prompting for Lightweight Image Super-Resolution(https://arxiv.org/abs/2507.04118)
Keywords: super-resolution
Abstract: Although the lightweight Vision Transformer has significantly advanced image super-resolution (SR), it faces the inherent challenge of a limited receptive field due to the window-based self-attention modeling. The quadratic computational complexity relative to window size restricts its ability to use a large window size for expanding the receptive field while maintaining low computational costs. To address this challenge, we propose PromptSR, a novel prompt-empowered lightweight image SR method. The core component is the proposed cascade prompting block (CPB), which enhances global information access and local refinement via three cascaded prompting layers: a global anchor prompting layer (GAPL) and two local prompting layers (LPLs). The GAPL leverages downscaled features as anchors to construct low-dimensional anchor prompts (APs) through cross-scale attention, significantly reducing computational costs. These APs, with enhanced global perception, are then used to provide global prompts, efficiently facilitating long-range token connections. The two LPLs subsequently combine category-based self-attention and window-based self-attention to refine the representation in a coarse-to-fine manner. They leverage attention maps from the GAPL as additional global prompts, enabling them to perceive features globally at different granularities for adaptive local refinement. In this way, the proposed CPB effectively combines global priors and local details, significantly enlarging the receptive field while maintaining the low computational costs of our PromptSR. The experimental results demonstrate the superiority of our method, which outperforms state-of-the-art lightweight SR methods in quantitative, qualitative, and complexity evaluations. Our code will be released at this https URL.
摘要：尽管轻巧的视觉变压器具有明显的高级图像超分辨率（SR），但由于基于窗口的自我发项模型，它面临着有限的接受场的固有挑战。相对于窗口尺寸的二次计算复杂性限制了其使用较大窗口大小来扩展接受场的能力，同时保持低计算成本。为了应对这一挑战，我们提出了一种提示，这是一种新颖的迅速授权的轻质图像SR方法。核心组件是建议的级联提示块（CPB），它通过三个级联提示层增强了全局信息访问和本地改进：全球锚点提示层（GAPL）和两个本地提示层（LPLS）。 GAPL通过跨尺度注意力构建低维锚提示（AP）的锚点，利用缩小的特征来大大降低计算成本。这些AP随后使用增强的全球感知来提供全球提示，从而有效地促进了远程令牌连接。随后，这两个LPL结合了基于类别的自我注意事项和基于窗口的自我注意力，以粗略的方式完善表示形式。他们利用GAPL的注意力图作为其他全球提示，使他们能够以不同的粒度感知全球特征，以进行自适应的局部改进。通过这种方式，拟议的CPB有效地结合了全球先验和当地细节，在保持我们的提示的低计算成本的同时，大大扩大了接受场。实验结果证明了我们方法的优势，在定量，定性和复杂性评估中，它的表现优于最先进的轻量级SR方法。我们的代码将在此HTTPS URL上发布。

Title: Unlocking Compositional Control: Self-Supervision for LVLM-Based Image Generation

Authors: Fernando Gabriela Garcia, Spencer Burns, Ryan Shaw, Hunter Young
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04151
Pdf URL: https://arxiv.org/pdf/2507.04151
Copy Paste: [[2507.04151]] Unlocking Compositional Control: Self-Supervision for LVLM-Based Image Generation(https://arxiv.org/abs/2507.04151)
Keywords: generation, generative
Abstract: This paper introduces Hierarchical Self-Supervised LVLM (Hi-SSLVLM), a novel generative model designed to significantly advance text-to-image synthesis, particularly for complex and compositionally challenging prompts. Traditional methods often grapple with the high cost of meticulously curated paired image-text datasets and struggle with precise control over fine-grained visual attributes and intricate spatial relationships. Our Hi-SSLVLM addresses these limitations through a unique two-stage self-supervised learning strategy. The first stage, Multi-Granularity Visual-Language Grounding, enables the Large Vision-Language Model (LVLM) backbone to autonomously generate and align hierarchical captions (global and local) to images, cultivating a deep internal semantic understanding without reliance on extensive human annotation. The second stage, Self-Refinement and Guided Image Generation, leverages this acquired knowledge by an Internal Compositional Planning (ICP) mechanism, where the LVLM first formulates detailed textual sub-prompts to guide the image generation process, complemented by a novel Semantic Consistency Loss for precise output alignment. Comprehensive experiments against leading baselines, including Janus-Pro-1B, Stable Diffusion XL 1.0, DeepFloyd IF v1.0, and ControlNet-XL, on multi-dimensional benchmarks such as Gemini-2.0-Flash and InternVL3-78B, demonstrate Hi-SSLVLM's superior performance across all fine-grained metrics. An in-depth ablation study confirms the critical role of each proposed component. Furthermore, human evaluations corroborate our quantitative findings, highlighting Hi-SSLVLM's enhanced fidelity to prompt, compositional accuracy, and overall aesthetic quality, marking a significant step towards more controllable and semantically consistent open-ended text-to-image generation.
摘要：本文介绍了层次的自我监督LVLM（HI-SSLVLM），这是一种新颖的生成模型，旨在显着提高文本到图像的综合，尤其是对于复杂和构图挑战性的提示。传统方法经常应对精心策划的成对图像文本数据集的高成本，并与精确控制精细的视觉属性和复杂的空间关系。我们的HI-SSLVLM通过独特的两阶段自学学习策略来解决这些局限性。第一阶段的多粒性视觉语言接地使大型视觉模型（LVLM）骨干可以自主产生和使层次结构字幕（全球和局部）自动生成图像，从而培养了内部语义的深刻理解，而无需依赖广泛的人类注释。第二阶段是自我进行和有指导的图像产生，它通过内部组成计划（ICP）机制来利用这一获得的知识，其中LVLM首先制定了详细的文本子标准来指导图像生成过程，并得到了新颖的语义一致性一致性损失，以使精确输出损失。针对领先基线的全面实验，包括Janus-Pro-1b，稳定的扩散XL 1.0，DeepFloyd IF V1.0和ControlNet-XL，以及在多维基准上进行的，例如GemIni-2.0-Flash和Internvl3-78B，Experd expivesion Hi-Sslvlm跨越了所有精美的高级元素。深入的消融研究证实了每个提出的组件的关键作用。此外，人类评估证实了我们的定量发现，突出了Hi-SSLVLM增强的忠诚度，迅速，组成准确性和整体美学质量，这标志着朝着更具可控性和语义上一致的开放式开放式文本到图像生成迈出的重要一步。

Title: LVLM-Composer's Explicit Planning for Image Generation

Authors: Spencer Ramsey, Jeffrey Lee, Amina Grant
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04152
Pdf URL: https://arxiv.org/pdf/2507.04152
Copy Paste: [[2507.04152]] LVLM-Composer's Explicit Planning for Image Generation(https://arxiv.org/abs/2507.04152)
Keywords: generation, generative
Abstract: The burgeoning field of generative artificial intelligence has fundamentally reshaped our approach to content creation, with Large Vision-Language Models (LVLMs) standing at its forefront. While current LVLMs have demonstrated impressive capabilities in text-to-image generation, they often falter when confronted with complex textual descriptions demanding precise compositional understanding and visual planning. This limitation particularly impacts the accurate rendering of multiple objects, their attributes, spatial relationships, and specific poses within intricate scenes, as evidenced by benchmarks like LongBench-T2I. To address these challenges, we introduce LVLM-Composer, a novel 10-billion parameter scale LVLM specifically engineered for enhanced compositional image synthesis. Our method incorporates a Hierarchical Semantic Planning Module for structured prompt decomposition and a Fine-Grained Feature Alignment Mechanism for precise visual guidance during generation. We propose a multi-stage training paradigm, featuring Hierarchical Semantic-Visual Grounding Pre-training and Compositional Planning Reinforcement Learning with Self-Correction, to instill robust compositional reasoning. Extensive experiments on the LongBench-T2I benchmark, utilizing automatic evaluation by Gemini-2.0-Flash and InternVL3-78B, demonstrate LVLM-Composer's superior performance across critical compositional dimensions including object accuracy, composition fidelity, and pose accuracy, significantly outperforming state-of-the-art baselines. An in-depth ablation study further validates the indispensable contribution of our proposed modules, while human evaluations confirm the perceptual superiority of our generated images. LVLM-Composer represents a significant step towards truly controllable and compositionally accurate open-ended text-to-image generation.
摘要：生成人工智能的新兴领域从根本上重塑了我们的内容创建方法，大型视觉语言模型（LVLMS）站在其最前沿。尽管当前的LVLM在文本到图像的一代中表现出了令人印象深刻的功能，但在面对复杂的文本描述时，它们通常会摇摇欲坠，要求精确的构图理解和视觉计划。这种限制特别影响了多个对象的准确渲染，它们的属性，空间关系和特定的姿势在错综复杂的场景中，如Longbench-T2i（如Longbench-t2i）所证明的。为了应对这些挑战，我们引入了LVLM-Composer，这是一种新型参数量表LVLM，专门设计用于增强组成图像合成。我们的方法结合了用于结构化迅速分解的层次语义计划模块，以及在发电期间进行精确视觉引导的细粒特征对齐机制。我们提出了一个多阶段的培训范式，其中包含层次的语义 - 视觉基础训练和构图计划加强学习，并通过自我纠正进行灌输强大的构图推理。在Longbench-T2I基准测试上进行了广泛的实验，利用Gemini-2.0-Flash和InternVL3-78B自动评估，展示了LVLM-Composer在关键组成维度上的出色性能，包括对象准确性，组合忠诚度和伪造精度，显着超过了态度的状态基础。一项深入的消融研究进一步验证了我们提出的模块的必不可少的贡献，而人类评估则证实了我们生成的图像的感知优势。 LVLM-COMPOSER代表了迈向真正可控制且构图准确的开放式文本对图像生成的重要一步。

Title: Voyaging into Unbounded Dynamic Scenes from a Single View

Authors: Fengrui Tian, Tianjiao Ding, Jinqi Luo, Hancheng Min, René Vidal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04183
Pdf URL: https://arxiv.org/pdf/2507.04183
Copy Paste: [[2507.04183]] Voyaging into Unbounded Dynamic Scenes from a Single View(https://arxiv.org/abs/2507.04183)
Keywords: generation
Abstract: This paper studies the problem of generating an unbounded dynamic scene from a single view, which has wide applications in augmented/virtual reality and robotics. Since the scene is changing over time, different generated views need to be consistent with the underlying 3D motions. While previous works learn such consistency by training from multiple views, the generated scene regions are bounded to be close to the training views with limited camera movements. To address this issue, we propose DynamicVoyager that reformulates the dynamic scene generation as a scene outpainting process for new dynamic content. As 2D outpainting models can hardly generate 3D consistent motions from only 2D pixels at a single view, we consider pixels as rays to enrich the pixel input with the ray context, so that the 3D motion consistency can be learned from the ray information. More specifically, we first map the single-view video input to a dynamic point cloud with the estimated video depths. Then we render the partial video at a novel view and outpaint the video with ray contexts from the point cloud to generate 3D consistent motions. We employ the outpainted video to update the point cloud, which is used for scene outpainting from future novel views. Experiments show that our model is able to generate unbounded scenes with consistent motions along fly-through cameras, and the generated contents can be controlled with scene prompts.
摘要：本文研究了从单个视图中产生无界动态场景的问题，该视图在增强/虚拟现实和机器人技术中具有广泛的应用。由于场景随着时间的流逝而发生变化，因此不同生成的视图必须与基础3D动作一致。虽然先前的作品通过从多个视图中训练学习了这种一致性，但生成的场景区域却遇到了距离摄像机运动有限的训练视图。为了解决这个问题，我们提出了DynamicIcvoyager，该动态浏览器将动态场景生成重新定义为新的动态内容的场景流程。由于2D支出模型几乎无法从单个视图处的2D像素产生3D一致的动作，因此我们将像素视为射线，可以用射线上下文丰富像素输入，因此可以从射线信息中学到3D运动一致性。更具体地说，我们首先将单视视频输入映射到具有估计的视频深度的动态点云。然后，我们通过新颖的视图渲染部分视频，并用射线云的射线上下文绘制视频，从而产生3D一致的动作。我们采用了柱子的视频来更新点云，该视频用于场景从未来的小说视图中产生。实验表明，我们的模型能够生成无界场景，并沿着直发相机的动作一致，并且可以通过场景提示来控制生成的内容。

Title: An explicit formulation of the learned noise predictor $ε_θ({\bf x}_t, t)$ via the forward-process noise $ε_{t}$ in denoising diffusion probabilistic models (DDPMs)

Authors: KiHyun Yun
Subjects: cs.LG, math.AP
Abstract URL: https://arxiv.org/abs/2507.04203
Pdf URL: https://arxiv.org/pdf/2507.04203
Copy Paste: [[2507.04203]] An explicit formulation of the learned noise predictor $ε_θ({\bf x}_t, t)$ via the forward-process noise $ε_{t}$ in denoising diffusion probabilistic models (DDPMs)(https://arxiv.org/abs/2507.04203)
Keywords: generative
Abstract: In denoising diffusion probabilistic models (DDPMs), the learned noise predictor $ \epsilon_{\theta} ( {\bf x}_t , t)$ is trained to approximate the forward-process noise $\epsilon_t$. The equality $\nabla_{{\bf x}_t} \log q({\bf x}_t) = -\frac 1 {\sqrt {1- {\bar \alpha}_t} } \epsilon_{\theta} ( {\bf x}_t , t)$ plays a fundamental role in both theoretical analyses and algorithmic design, and thus is frequently employed across diffusion-based generative models. In this paper, an explicit formulation of $ \epsilon_{\theta} ( {\bf x}_t , t)$ in terms of the forward-process noise $\epsilon_t$ is derived. This result show how the forward-process noise $\epsilon_t$ contributes to the learned predictor $ \epsilon_{\theta} ( {\bf x}_t , t)$. Furthermore, based on this formulation, we present a novel and mathematically rigorous proof of the fundamental equality above, clarifying its origin and providing new theoretical insight into the structure of diffusion models.
摘要：在降级扩散概率模型（DDPM）中，对学到的噪声预测指标$ \ epsilon _ {\ theta}（{\ bf x} _t，t）$进行了培训，以近似为前进过程噪声$ \ epsilon_t $。等值$ \ nabla _ {{\ bf x} _t} \ log q（{\ bf x} _t）= - \ frac 1 {\ sqrt {\ sqrt {1- {\ bar \ alpha} _t} _t _t} _t}}}}}}} \ epsilon _ bf在理论分析和算法设计中都起着基本作用，因此经常在基于扩散的生成模型中使用。在本文中，$ \ epsilon _ {\ theta}（{\ bf x} _t，t）$的明确配方在forward-process noise $ \ epsilon_t $方面被得出。该结果显示了前进噪声$ \ epsilon_t $如何有助于学习的预测指标$ \ epsilon _ {\ theta}（{\ bf x} _t，t）$。此外，基于这种表述，我们提出了上面基本平等的新颖且数学上严格的证明，阐明了其起源，并提供了对扩散模型结构的新理论见解。

Title: Quick Bypass Mechanism of Zero-Shot Diffusion-Based Image Restoration

Authors: Yu-Shan Tai, An-Yeu (Andy)Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04207
Pdf URL: https://arxiv.org/pdf/2507.04207
Copy Paste: [[2507.04207]] Quick Bypass Mechanism of Zero-Shot Diffusion-Based Image Restoration(https://arxiv.org/abs/2507.04207)
Keywords: restoration, super-resolution, generation
Abstract: Recent advancements in diffusion models have demonstrated remarkable success in various image generation tasks. Building upon these achievements, diffusion models have also been effectively adapted to image restoration tasks, e.g., super-resolution and deblurring, aiming to recover high-quality images from degraded inputs. Although existing zero-shot approaches enable pretrained diffusion models to perform restoration tasks without additional fine-tuning, these methods often suffer from prolonged iteration times in the denoising process. To address this limitation, we propose a Quick Bypass Mechanism (QBM), a strategy that significantly accelerates the denoising process by initializing from an intermediate approximation, effectively bypassing early denoising steps. Furthermore, recognizing that approximation may introduce inconsistencies, we introduce a Revised Reverse Process (RRP), which adjusts the weighting of random noise to enhance the stochasticity and mitigate potential disharmony. We validate proposed methods on ImageNet-1K and CelebA-HQ across multiple image restoration tasks, e.g., super-resolution, deblurring, and compressed sensing. Our experimental results show that the proposed methods can effectively accelerate existing methods while maintaining original performance.
摘要：扩散模型的最新进展表明，在各种图像生成任务中取得了巨大的成功。在这些成就的基础上，扩散模型也有效地适应了图像恢复任务，例如超分辨率和脱张，旨在从退化的输入中恢复高质量的图像。尽管现有的零击方法使经过预定的扩散模型可以执行恢复任务而无需进行其他微调，但这些方法在剥离过程中通常会持续时间延长迭代时间。为了解决这一限制，我们提出了一种快速旁路机制（QBM），该策略通过从中间近似中初始化，有效地绕过早期的DeNoising步骤，从而显着加速了转化过程。此外，认识到近似可能会引入不一致之处，我们引入了修订后的反向过程（RRP），该过程调整了随机噪声的加权以增强随机性并减轻潜在的不和谐。我们在多个图像恢复任务中验证了Imagenet-1K和Celeba-HQ的建议方法，例如超分辨率，去膨胀和压缩感测。我们的实验结果表明，所提出的方法可以有效地加速现有方法，同时保持原始性能。

Title: DreamPoster: A Unified Framework for Image-Conditioned Generative Poster Design

Authors: Xiwei Hu, Haokun Chen, Zhongqi Qi, Hui Zhang, Dexiang Hong, Jie Shao, Xinglong Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04218
Pdf URL: https://arxiv.org/pdf/2507.04218
Copy Paste: [[2507.04218]] DreamPoster: A Unified Framework for Image-Conditioned Generative Poster Design(https://arxiv.org/abs/2507.04218)
Keywords: generation, generative
Abstract: We present DreamPoster, a Text-to-Image generation framework that intelligently synthesizes high-quality posters from user-provided images and text prompts while maintaining content fidelity and supporting flexible resolution and layout outputs. Specifically, DreamPoster is built upon our T2I model, Seedream3.0 to uniformly process different poster generating types. For dataset construction, we propose a systematic data annotation pipeline that precisely annotates textual content and typographic hierarchy information within poster images, while employing comprehensive methodologies to construct paired datasets comprising source materials (e.g., raw graphics/text) and their corresponding final poster outputs. Additionally, we implement a progressive training strategy that enables the model to hierarchically acquire multi-task generation capabilities while maintaining high-quality generation. Evaluations on our testing benchmarks demonstrate DreamPoster's superiority over existing methods, achieving a high usability rate of 88.55\%, compared to GPT-4o (47.56\%) and SeedEdit3.0 (25.96\%). DreamPoster will be online in Jimeng and other Bytedance Apps.
摘要：我们提出DreamPoster，这是一个文本到图像生成框架，它可以智能地从用户提供的图像和文本提示中巧妙地综合了高质量的海报，同时保持内容保真度并支持灵活的分辨率和布局输出。具体而言，Dreamposter建立在我们的T2I模型，SeedReam3.0上，以统一处理不同的海报生成类型。对于数据集构建，我们提出了一个系统的数据注释管道，该管道精确注释了海报图像中的文本内容和印刷层次结构信息，同时采用全面的方法来构造包含源材料（例如原始图形/文本）的配对数据集及其相应的最终海报输出。此外，我们实施了一种渐进培训策略，该策略使该模型能够在维持高质量生成的同时层次获得多任务生成能力。与GPT-4O（47.56 \％）和SEEDEDIT3.0（25.96 \％）相比，我们的测试基准评估表明Dreamposter优于现有方法的优势，达到88.55％\％。 Dreamposter将在Jimeng和其他派系应用程序在线。

Title: Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs

Authors: Yan Scholten, Sophie Xhonneux, Stephan Günnemann, Leo Schwinn
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04219
Pdf URL: https://arxiv.org/pdf/2507.04219
Copy Paste: [[2507.04219]] Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs(https://arxiv.org/abs/2507.04219)
Keywords: generation, generative
Abstract: Current unlearning methods for LLMs optimize on the private information they seek to remove by incorporating it into their training objectives. We argue this not only risks reinforcing exposure to sensitive data, it also fundamentally contradicts the principle of minimizing its use. As a remedy, we propose a novel unlearning method - Partial Model Collapse (PMC), which does not require unlearning targets in the unlearning objective. Our approach is inspired by recent observations that training generative models on their own generations leads to distribution collapse, effectively removing information from the model. Our core idea is to leverage this collapse for unlearning by triggering collapse partially on the sensitive data. We theoretically analyze that our approach converges to the desired outcome, i.e. the LLM unlearns the information in the forget set. We empirically demonstrate that PMC overcomes two key limitations of existing unlearning approaches that explicitly optimize on unlearning targets, and more effectively removes private information from model outputs. Overall, our contributions represent an important step toward more comprehensive unlearning that aligns with real-world privacy constraints. Code available at this https URL.
摘要：LLM的当前未学习方法可以通过将其纳入其培训目标中，以优化他们寻求删除的私人信息。我们认为，这不仅有可能加强暴露于敏感数据的风险，而且从根本上讲，这也与最大程度地减少其使用的原则相矛盾。作为一种补救措施，我们提出了一种新颖的学习方法 - 部分模型崩溃（PMC），该方法在未学习目标中不需要学习目标。我们的方法的灵感来自最近的观察，即训练生成模型会导致分布崩溃，从而有效地从模型中删除了信息。我们的核心思想是利用这种崩溃来通过在敏感数据上部分触发崩溃来进行学习。我们从理论上分析了我们的方法是否会收敛到所需的结果，即LLM在“忘记集合”中的信息。我们从经验上证明，PMC克服了现有的未学习方法的两个关键局限性，这些方法明确优化了学习目标，并更有效地从模型输出中删除了私人信息。总体而言，我们的贡献代表了朝着更全面的耕种迈出的重要一步，即与现实世界的隐私约束保持一致。可在此HTTPS URL上找到代码。

Title: Zero-Shot Cyclic Peptide Design with Composable Geometric Conditions

Authors: Dapeng Jiang, Xiangzhe Kong, Jiaqi Han, Mingyu Li, Rui Jiao, Wenbing Huang, Stefano Ermon, Jianzhu Ma, Yang Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04225
Pdf URL: https://arxiv.org/pdf/2507.04225
Copy Paste: [[2507.04225]] Zero-Shot Cyclic Peptide Design with Composable Geometric Conditions(https://arxiv.org/abs/2507.04225)
Keywords: generation, generative
Abstract: Cyclic peptides, characterized by geometric constraints absent in linear peptides, offer enhanced biochemical properties, presenting new opportunities to address unmet medical needs. However, designing target-specific cyclic peptides remains underexplored due to limited training data. To bridge the gap, we propose CP-Composer, a novel generative framework that enables zero-shot cyclic peptide generation via composable geometric constraints. Our approach decomposes complex cyclization patterns into unit constraints, which are incorporated into a diffusion model through geometric conditioning on nodes and edges. During training, the model learns from unit constraints and their random combinations in linear peptides, while at inference, novel constraint combinations required for cyclization are imposed as input. Experiments show that our model, despite trained with linear peptides, is capable of generating diverse target-binding cyclic peptides, reaching success rates from 38% to 84% on different cyclization strategies.
摘要：循环肽的特征是线性肽中缺少几何约束，提供了增强的生化特性，为满足未满足的医疗需求提供了新的机会。但是，由于训练数据有限，设计目标特异性循环肽仍未被逐渐倍增。为了弥合差距，我们提出了CP-Composer，这是一种新型的生成框架，可以通过合并的几何约束来零光周期肽产生。我们的方法将复杂的环化模式分解为单位约束，这些模式通过节点和边缘的几何条件纳入扩散模型。在训练过程中，该模型从单位约束及其在线性肽中的随机组合中学习，而在推断时，施加了环化学所需的新约束组合作为输入。实验表明，我们的模型尽管接受过线性肽的培训，但能够产生各种目标结合循环肽，在不同的环化策略上达到了从38％到84％的成功率。

Title: MoReMouse: Monocular Reconstruction of Laboratory Mouse

Authors: Yuan Zhong, Jingxiang Sun, Liang An, Yebin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04258
Pdf URL: https://arxiv.org/pdf/2507.04258
Copy Paste: [[2507.04258]] MoReMouse: Monocular Reconstruction of Laboratory Mouse(https://arxiv.org/abs/2507.04258)
Keywords: generation
Abstract: Laboratory mice play a crucial role in biomedical research, yet accurate 3D mouse surface motion reconstruction remains challenging due to their complex non-rigid geometric deformations and textureless appearance. Moreover, the absence of structured 3D datasets severely hinders the progress beyond sparse keypoint tracking. To narrow the gap, we present MoReMouse, the first monocular dense 3D reconstruction network tailored for laboratory mice. To achieve this goal, we highlight three key designs. First, we construct the first high-fidelity dense-view synthetic dataset for mice, by rendering our self-designed realistic Gaussian mouse avatar. Second, MoReMouse adopts a transformer-based feedforward architecture with triplane representation, achieving high-quality 3D surface generation from a single image. Third, we create geodesic-based continuous correspondence embeddings on mouse surface, which serve as strong semantic priors to improve reconstruction stability and surface consistency. Extensive quantitative and qualitative experiments demonstrate that MoReMouse significantly outperforms existing open-source methods in accuracy and robustness. Video results are available at this https URL.
摘要：实验室小鼠在生物医学研究中起着至关重要的作用，但由于其复杂的非韧性几何变形和无纹理外观，精确的3D小鼠表面运动重建仍然具有挑战性。此外，缺乏结构化的3D数据集极大地阻碍了稀疏关键点跟踪的进度。为了缩小差距，我们提出了Moremuse，这是针对实验室小鼠量身定制的第一个单眼密集的3D重建网络。为了实现这一目标，我们重点介绍了三个关键设计。首先，我们通过渲染自设计的逼真的高斯鼠标头像来构建第一个用于小鼠的高保真密集图合成数据集。其次，Moremuse采用了带有三层代表的基于变压器的前馈结构，从单个图像中实现了高质量的3D表面生成。第三，我们在小鼠表面上创建了基于测量的连续对应关系，这是提高重建稳定性和表面一致性的强大语义先验。广泛的定量和定性实验表明，Moremuse在准确性和鲁棒性方面的表现显着优于现有的开源方法。视频结果可在此HTTPS URL上找到。

Title: An Explainable Transformer Model for Alzheimer's Disease Detection Using Retinal Imaging

Authors: Saeed Jamshidiha, Alireza Rezaee, Farshid Hajati, Mojtaba Golzan, Raymond Chiong
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2507.04259
Pdf URL: https://arxiv.org/pdf/2507.04259
Copy Paste: [[2507.04259]] An Explainable Transformer Model for Alzheimer's Disease Detection Using Retinal Imaging(https://arxiv.org/abs/2507.04259)
Keywords: generative
Abstract: Alzheimer's disease (AD) is a neurodegenerative disorder that affects millions worldwide. In the absence of effective treatment options, early diagnosis is crucial for initiating management strategies to delay disease onset and slow down its progression. In this study, we propose Retformer, a novel transformer-based architecture for detecting AD using retinal imaging modalities, leveraging the power of transformers and explainable artificial intelligence. The Retformer model is trained on datasets of different modalities of retinal images from patients with AD and age-matched healthy controls, enabling it to learn complex patterns and relationships between image features and disease diagnosis. To provide insights into the decision-making process of our model, we employ the Gradient-weighted Class Activation Mapping algorithm to visualize the feature importance maps, highlighting the regions of the retinal images that contribute most significantly to the classification outcome. These findings are compared to existing clinical studies on detecting AD using retinal biomarkers, allowing us to identify the most important features for AD detection in each imaging modality. The Retformer model outperforms a variety of benchmark algorithms across different performance metrics by margins of up to 11\.
摘要：阿尔茨海默氏病（AD）是一种神经退行性疾病，影响了全球数百万。在没有有效治疗方案的情况下，早期诊断对于启动管理策略延迟疾病发作并减慢其进展至关重要。在这项研究中，我们提出了Retformer，这是一种基于变压器的新型架构，用于使用视网膜成像方式检测AD，利用变压器的力量和可解释的人工智能。在具有AD和年龄匹配的健康对照患者的视网膜图像的不同模式的数据集上，对Retformer模型进行了训练，从而使其能够学习复杂的模式以及图像特征和疾病诊断之间的关系。为了洞悉模型的决策过程，我们采用了梯度加权的类激活映射算法来可视化特征的重要性图，从而突出了视网膜图像的区域，这些区域对分类结果贡献最大。将这些发现与现有的使用视网膜生物标志物检测AD检测AD的临床研究进行了比较，从而使我们能够在每种成像方式中识别出最重要的AD检测特征。 Retformer模型的表现优于不同性能指标的各种基准算法，最高可达11 \。

Title: Towards Lightest Low-Light Image Enhancement Architecture for Mobile Devices

Authors: Guangrui Bai, Hailong Yan, Wenhai Liu, Yahui Deng, Erbao Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04277
Pdf URL: https://arxiv.org/pdf/2507.04277
Copy Paste: [[2507.04277]] Towards Lightest Low-Light Image Enhancement Architecture for Mobile Devices(https://arxiv.org/abs/2507.04277)
Keywords: restoration
Abstract: Real-time low-light image enhancement on mobile and embedded devices requires models that balance visual quality and computational efficiency. Existing deep learning methods often rely on large networks and labeled datasets, limiting their deployment on resource-constrained platforms. In this paper, we propose LiteIE, an ultra-lightweight unsupervised enhancement framework that eliminates dependence on large-scale supervision and generalizes well across diverse conditions. We design a backbone-agnostic feature extractor with only two convolutional layers to produce compact image features enhancement tensors. In addition, we develop a parameter-free Iterative Restoration Module, which reuses the extracted features to progressively recover fine details lost in earlier enhancement steps, without introducing any additional learnable parameters. We further propose an unsupervised training objective that integrates exposure control, edge-aware smoothness, and multi-scale color consistency losses. Experiments on the LOL dataset, LiteIE achieves 19.04 dB PSNR, surpassing SOTA by 1.4 dB while using only 0.07\% of its parameters. On a Snapdragon 8 Gen 3 mobile processor, LiteIE runs at 30 FPS for 4K images with just 58 parameters, enabling real-time deployment on edge devices. These results establish LiteIE as an efficient and practical solution for low-light enhancement on resource-limited platforms.
摘要：移动和嵌入式设备上的实时低光图像增强功能需要平衡视觉质量和计算效率的模型。现有的深度学习方法通常依赖大型网络和标记的数据集，从而将其部署限制在资源受限的平台上。在本文中，我们提出了Liteie，这是一种超轻量级无监督的增强框架，消除了对大规模监督的依赖，并在各种条件下均能很好地概括。我们设计了一个仅使用两个卷积层的骨干不稳定特征提取器，以产生紧凑的图像特征增强张量。此外，我们开发了一个无参数的迭代恢复模块，该模块将重复提取的功能逐渐恢复在早期增强步骤中丢失的细节，而无需引入任何其他可学习的参数。我们进一步提出了一个无监督的训练目标，该目标整合了曝光控制，边缘感知的平滑度和多尺度颜色一致性损失。 LITEIE在LOL数据集上的实验达到19.04 DB PSNR，仅使用0.07 \％的参数，超过1.4 dB的SOTA。在Snapdragon 8 Gen 3移动处理器上，Liteie以30 fps的速度运行，仅使用58个参数，可在边缘设备上实时部署。这些结果将Liteie作为一种有效且实用的解决方案，用于在资源有限的平台上进行低光增强。

Title: SeqTex: Generate Mesh Textures in Video Sequence

Authors: Ze Yuan (1), Xin Yu (1), Yangtian Sun (1), Yuan-Chen Guo (2), Yan-Pei Cao (2), Ding Liang (2), Xiaojuan Qi (1) ((1) HKU, (2) VAST)
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2507.04285
Pdf URL: https://arxiv.org/pdf/2507.04285
Copy Paste: [[2507.04285]] SeqTex: Generate Mesh Textures in Video Sequence(https://arxiv.org/abs/2507.04285)
Keywords: generation, generative
Abstract: Training native 3D texture generative models remains a fundamental yet challenging problem, largely due to the limited availability of large-scale, high-quality 3D texture datasets. This scarcity hinders generalization to real-world scenarios. To address this, most existing methods finetune foundation image generative models to exploit their learned visual priors. However, these approaches typically generate only multi-view images and rely on post-processing to produce UV texture maps -- an essential representation in modern graphics pipelines. Such two-stage pipelines often suffer from error accumulation and spatial inconsistencies across the 3D surface. In this paper, we introduce SeqTex, a novel end-to-end framework that leverages the visual knowledge encoded in pretrained video foundation models to directly generate complete UV texture maps. Unlike previous methods that model the distribution of UV textures in isolation, SeqTex reformulates the task as a sequence generation problem, enabling the model to learn the joint distribution of multi-view renderings and UV textures. This design effectively transfers the consistent image-space priors from video foundation models into the UV domain. To further enhance performance, we propose several architectural innovations: a decoupled multi-view and UV branch design, geometry-informed attention to guide cross-domain feature alignment, and adaptive token resolution to preserve fine texture details while maintaining computational efficiency. Together, these components allow SeqTex to fully utilize pretrained video priors and synthesize high-fidelity UV texture maps without the need for post-processing. Extensive experiments show that SeqTex achieves state-of-the-art performance on both image-conditioned and text-conditioned 3D texture generation tasks, with superior 3D consistency, texture-geometry alignment, and real-world generalization.
摘要：培训本机3D纹理生成模型仍然是一个根本但具有挑战性的问题，这主要是由于大规模高质量3D纹理数据集的可用性有限。这种稀缺性阻碍了对现实情况的概括。为了解决这个问题，大多数现有的方法Finetune Foundation Image生成模型可利用其学习的视觉先验。但是，这些方法通常仅生成多视图图像，并依靠后处理来生成紫外线纹理图 - 现代图形管道中的必不可少的表示。这样的两阶段管道通常会遭受3D表面上的错误积累和空间不一致的困扰。在本文中，我们介绍了Seqtex，这是一个新颖的端到端框架，该框架利用了预验证的视频基础模型中编码的视觉知识直接生成完整的UV纹理图。与以前的方法对孤立的紫外线纹理分布进行建模不同，seqtex将任务重新定义为序列生成问题，从而使模型能够学习多视图渲染和紫外线纹理的联合分布。该设计有效地将一致的图像空间先验从视频基础模型转移到了UV域。为了进一步提高性能，我们提出了几项架构创新：脱钩的多视图和紫外线分支设计，几何形状的注意力，以指导跨域特征对齐，以及适应性令牌分辨率，以保持良好的纹理细节，同时保持计算效率。这些组件一起允许Seqtex充分利用预审预周化的视频先验并合成高保真紫外线纹理地图，而无需进行后处理。广泛的实验表明，SeqTex在图像条件和文本条件的3D纹理生成任务上都实现了最先进的性能，具有出色的3D一致性，纹理几何形状对准和现实世界的概括。

Title: MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation

Authors: Weilun Feng, Chuanguang Yang, Haotong Qin, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Boyu Diao, Fuzhen Zhuang, Michele Magno, Yongjun Xu, Yingli Tian, Tingwen Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04290
Pdf URL: https://arxiv.org/pdf/2507.04290
Copy Paste: [[2507.04290]] MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation(https://arxiv.org/abs/2507.04290)
Keywords: generation
Abstract: Diffusion models have demonstrated remarkable performance on vision generation tasks. However, the high computational complexity hinders its wide application on edge devices. Quantization has emerged as a promising technique for inference acceleration and memory reduction. However, existing quantization methods do not generalize well under extremely low-bit (2-4 bit) quantization. Directly applying these methods will cause severe performance degradation. We identify that the existing quantization framework suffers from the outlier-unfriendly quantizer design, suboptimal initialization, and optimization strategy. We present MPQ-DMv2, an improved \textbf{M}ixed \textbf{P}recision \textbf{Q}uantization framework for extremely low-bit \textbf{D}iffusion \textbf{M}odels. For the quantization perspective, the imbalanced distribution caused by salient outliers is quantization-unfriendly for uniform quantizer. We propose \textit{Flexible Z-Order Residual Mixed Quantization} that utilizes an efficient binary residual branch for flexible quant steps to handle salient error. For the optimization framework, we theoretically analyzed the convergence and optimality of the LoRA module and propose \textit{Object-Oriented Low-Rank Initialization} to use prior quantization error for informative initialization. We then propose \textit{Memory-based Temporal Relation Distillation} to construct an online time-aware pixel queue for long-term denoising temporal information distillation, which ensures the overall temporal consistency between quantized and full-precision model. Comprehensive experiments on various generation tasks show that our MPQ-DMv2 surpasses current SOTA methods by a great margin on different architectures, especially under extremely low-bit widths.
摘要：扩散模型在视力生成任务上表现出了显着的性能。但是，高计算复杂性阻碍了其在边缘设备上的广泛应用。量化已成为推理加速和减少记忆的有前途的技术。但是，在极低位（2-4位）量化下，现有的量化方法不能很好地概括。直接应用这些方法将导致严重的性能降解。我们确定现有的量化框架遭受了异常不友好的量化设计，次优初始化和优化策略。我们提出MPQ-DMV2，改进的\ textbf {M} ixed \ textbf {p} recision \ textbf {q} uantization框架，用于极低的低位\ textbf {d} iffusion \ fiffusion \ textbf \ textbf {m} odels。对于量化的角度，显着异常值引起的不平衡分布对统一量化器而言是不友好的。我们提出\ textIt {灵活的Z阶残差混合量化}，该量子利用有效的二进制残留分支来处理灵活的定量步骤来处理明显的错误。对于优化框架，我们理论上分析了Lora模块的收敛性和最佳性，并提出了\ textit {面向对象的低级初始化}以使用先前的量化错误来提供信息初始化。然后，我们建议\ textIt {基于内存的时间关系蒸馏}来构建一个在线时间感知的像素队列，以用于长期deNo的时间信息蒸馏，以确保量化和完整精确模型之间的整体时间一致性。有关各种一代任务的全面实验表明，我们的MPQ-DMV2通过不同的体系结构的差距超过当前的SOTA方法，尤其是在极低的宽度宽度下。

Title: Multi-Modal Semantic Parsing for the Interpretation of Tombstone Inscriptions

Authors: Xiao Zhang, Johan Bos
Subjects: cs.CV, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2507.04377
Pdf URL: https://arxiv.org/pdf/2507.04377
Copy Paste: [[2507.04377]] Multi-Modal Semantic Parsing for the Interpretation of Tombstone Inscriptions(https://arxiv.org/abs/2507.04377)
Keywords: generation
Abstract: Tombstones are historically and culturally rich artifacts, encapsulating individual lives, community memory, historical narratives and artistic expression. Yet, many tombstones today face significant preservation challenges, including physical erosion, vandalism, environmental degradation, and political shifts. In this paper, we introduce a novel multi-modal framework for tombstones digitization, aiming to improve the interpretation, organization and retrieval of tombstone content. Our approach leverages vision-language models (VLMs) to translate tombstone images into structured Tombstone Meaning Representations (TMRs), capturing both image and text information. To further enrich semantic parsing, we incorporate retrieval-augmented generation (RAG) for integrate externally dependent elements such as toponyms, occupation codes, and ontological concepts. Compared to traditional OCR-based pipelines, our method improves parsing accuracy from an F1 score of 36.1 to 89.5. We additionally evaluate the model's robustness across diverse linguistic and cultural inscriptions, and simulate physical degradation through image fusion to assess performance under noisy or damaged conditions. Our work represents the first attempt to formalize tombstone understanding using large vision-language models, presenting implications for heritage preservation.
摘要：墓碑在历史和文化上是丰富的文物，封装了个人生活，社区记忆，历史叙事和艺术表达。然而，当今许多墓碑面临着重大的保护挑战，包括身体侵蚀，故意破坏，环境退化和政治转变。在本文中，我们引入了一个新型的多模式框架，用于墓碑数字化，旨在改善墓碑含量的解释，组织和检索。我们的方法利用视觉模型（VLM）将墓碑图像转化为结构化墓碑的含义表示（TMR），从而捕获图像和文本信息。为了进一步丰富语义解析，我们结合了检索增强的一代（RAG），以整合外部依赖的元素，例如上调，职业代码和本体论概念。与传统的基于OCR的管道相比，我们的方法将解析准确性从36.1的F1分数提高到89.5。我们还评估了该模型在各种语言和文化铭文中的鲁棒性，并通过图像融合模拟物理降解，以评估在嘈杂或受损条件下的性能。我们的工作是使用大型视觉语言模型对墓碑进行正式理解的首次尝试，从而对遗产保存产生了影响。

Title: Sat2City: 3D City Generation from A Single Satellite Image with Cascaded Latent Diffusion

Authors: Tongyan Hua, Lutao Jiang, Ying-Cong Chen, Wufan Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04403
Pdf URL: https://arxiv.org/pdf/2507.04403
Copy Paste: [[2507.04403]] Sat2City: 3D City Generation from A Single Satellite Image with Cascaded Latent Diffusion(https://arxiv.org/abs/2507.04403)
Keywords: generation, generative
Abstract: Recent advancements in generative models have enabled 3D urban scene generation from satellite imagery, unlocking promising applications in gaming, digital twins, and beyond. However, most existing methods rely heavily on neural rendering techniques, which hinder their ability to produce detailed 3D structures on a broader scale, largely due to the inherent structural ambiguity derived from relatively limited 2D observations. To address this challenge, we propose Sat2City, a novel framework that synergizes the representational capacity of sparse voxel grids with latent diffusion models, tailored specifically for our novel 3D city dataset. Our approach is enabled by three key components: (1) A cascaded latent diffusion framework that progressively recovers 3D city structures from satellite imagery, (2) a Re-Hash operation at its Variational Autoencoder (VAE) bottleneck to compute multi-scale feature grids for stable appearance optimization and (3) an inverse sampling strategy enabling implicit supervision for smooth appearance this http URL overcome the challenge of collecting real-world city-scale 3D models with high-quality geometry and appearance, we introduce a dataset of synthesized large-scale 3D cities paired with satellite-view height maps. Validated on this dataset, our framework generates detailed 3D structures from a single satellite image, achieving superior fidelity compared to existing city generation models.
摘要：生成模型的最新进展已从卫星图像中启用了3D Urban场景，从而解开了游戏，数字双胞胎及以后的有希望的应用。但是，大多数现有方法都严重依赖于神经渲染技术，这阻碍了它们在更广泛的规模上产生详细的3D结构的能力，这在很大程度上是由于固有的结构歧义性，从相对有限的2D观察结果中得出。为了应对这一挑战，我们提出了SAT2City，这是一个新颖的框架，通过潜在的扩散模型协同稀疏体素网格的代表性，专门针对我们的新颖3D City DataSet量身定制。我们的方法由三个关键组成部分启用：（1）层叠潜的扩散框架逐渐从卫星图像中恢复3D城市结构，（（2）在其变异的自动编码器（VAE）瓶颈上进行重新敲打操作，以计算多层尺度的范围，以使稳定的外观优化和（3）构成构成构成稳定的构成策略的构成策略，并构成策略构成策略的策略，并构成策略策略，并将其构成式的策略范围。收集具有高质量几何形状和外观的现实世界尺度3D模型的挑战，我们引入了合成的大规模3D城市的数据集，与卫星视图高度图配对。在此数据集中验证，我们的框架从单个卫星图像中生成了详细的3D结构，与现有的城市生成模型相比，获得了优越的保真度。

Title: Multimedia Verification Through Multi-Agent Deep Research Multimodal Large Language Models

Authors: Huy Hoan Le, Van Sy Thinh Nguyen, Thi Le Chi Dang, Vo Thanh Khang Nguyen, Truong Thanh Hung Nguyen, Hung Cao
Subjects: cs.CV, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2507.04410
Pdf URL: https://arxiv.org/pdf/2507.04410
Copy Paste: [[2507.04410]] Multimedia Verification Through Multi-Agent Deep Research Multimodal Large Language Models(https://arxiv.org/abs/2507.04410)
Keywords: generation
Abstract: This paper presents our submission to the ACMMM25 - Grand Challenge on Multimedia Verification. We developed a multi-agent verification system that combines Multimodal Large Language Models (MLLMs) with specialized verification tools to detect multimedia misinformation. Our system operates through six stages: raw data processing, planning, information extraction, deep research, evidence collection, and report generation. The core Deep Researcher Agent employs four tools: reverse image search, metadata analysis, fact-checking databases, and verified news processing that extracts spatial, temporal, attribution, and motivational context. We demonstrate our approach on a challenge dataset sample involving complex multimedia content. Our system successfully verified content authenticity, extracted precise geolocation and timing information, and traced source attribution across multiple platforms, effectively addressing real-world multimedia verification scenarios.
摘要：本文介绍了我们对ACMMM25的提交 - 多媒体验证的巨大挑战。我们开发了一个多模式大型语言模型（MLLM）和专门验证工具以检测多媒体错误信息的多模式大语言模型（MLLM）。我们的系统通过六个阶段运行：原始数据处理，计划，信息提取，深入研究，证据收集和报告生成。核心深层研究人员使用四个工具：反向图像搜索，元数据分析，事实检查数据库以及验证的新闻处理，以提取空间，时间，归因和动机上下文。我们在涉及复杂多媒体内容的挑战数据集样本上演示了我们的方法。我们的系统成功验证了内容的真实性，提取了精确的地理位置和定时信息，并在多个平台上追踪了源归因，从而有效地解决了现实世界中的多媒体验证方案。

Title: Tail-aware Adversarial Attacks: A Distributional Approach to Efficient LLM Jailbreaking

Authors: Tim Beyer, Yan Scholten, Stephan Günnemann, Leo Schwinn
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.04446
Pdf URL: https://arxiv.org/pdf/2507.04446
Copy Paste: [[2507.04446]] Tail-aware Adversarial Attacks: A Distributional Approach to Efficient LLM Jailbreaking(https://arxiv.org/abs/2507.04446)
Keywords: generation
Abstract: To guarantee safe and robust deployment of large language models (LLMs) at scale, it is critical to accurately assess their adversarial robustness. Existing adversarial attacks typically target harmful responses in single-point, greedy generations, overlooking the inherently stochastic nature of LLMs. In this paper, we propose a novel framework for adversarial robustness evaluation that explicitly models the entire output distribution, including tail-risks, providing better estimates for model robustness at scale. By casting the attack process as a resource allocation problem between optimization and sampling, we determine compute-optimal tradeoffs and show that integrating sampling into existing attacks boosts ASR by up to 48% and improves efficiency by up to two orders of magnitude. Our framework also enables us to analyze how different attack algorithms affect output harm distributions. Surprisingly, we find that most optimization strategies have little effect on output harmfulness. Finally, we introduce a data-free proof-of-concept objective based on entropy-maximization to demonstrate how our tail-aware perspective enables new optimization targets. Overall, our findings highlight the importance of tail-aware attacks and evaluation protocols to accurately assess and strengthen LLM safety.
摘要：为了确保大规模的大型语言模型（LLM）的安全部署，准确评估其对抗性鲁棒性至关重要。现有的对抗性攻击通常以单点贪婪的世代为目标，忽略了LLM的固有随机性。在本文中，我们提出了一个新颖的框架，用于对抗性鲁棒性评估，该框架明确地模拟了整个输出分布，包括尾危，为模型稳健性提供了更好的估计。通过将攻击过程作为优化和采样之间的资源分配问题，我们确定了计算最佳的权衡，并表明将采样集成到现有攻击中最多可提高48％，并提高效率高达两个数量级。我们的框架还使我们能够分析不同的攻击算法如何影响输出危害分布。令人惊讶的是，我们发现大多数优化策略对产出有害的影响几乎没有影响。最后，我们基于熵 - 最大化引入了无数据证明目标，以证明我们的尾随观点如何实现新的优化目标。总体而言，我们的发现突出了尾声攻击和评估方案的重要性，以准确评估和增强LLM安全性。

Title: DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Authors: Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, XinQiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2507.04447
Pdf URL: https://arxiv.org/pdf/2507.04447
Copy Paste: [[2507.04447]] DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge(https://arxiv.org/abs/2507.04447)
Keywords: generation
Abstract: Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.
摘要：视力语言动作（VLA）模型的最新进展在将图像产生与动作预测相结合以改善机器人操纵中的概括和推理方面表现出了希望。但是，现有的方法仅限于基于图像的预测，这些预测遭受了冗余信息，并且缺乏全面和批判性的世界知识，包括动态，空间和语义信息。为了解决这些局限性，我们提出了DreamVla，这是一个新颖的VLA框架，该框架整合了全面的世界知识预测以实现反向动态建模，从而建立了对操纵任务的感知预测行动循环。具体而言，Dreamvla引入了一个动态区域指导的世界知识预测，并与空间和语义提示集成在一起，该预测为行动计划提供了紧凑而全面的表示。这种设计与人类在行动之前先首先形成抽象的多模式推理链与世界互动的方式保持一致。为了减轻训练过程中动态，空间和语义信息之间的干扰，我们采用了一种块稳定的结构化注意机制，该机制掩盖了它们相互关注，防止信息泄漏并保持每个表示的清洁和分离。此外，为了对未来动作进行有条件的分布进行建模，我们采用了基于扩散的变压器，该变压器将动作表示与共享潜在特征相关。对现实世界和仿真环境的广泛实验表明，Dreamvla在实际机器人任务上达到了76.7％的成功率，而Calvin ABC-D基准测试的平均长度为4.44。

Title: CoT-lized Diffusion: Let's Reinforce T2I Generation Step-by-step

Authors: Zheyuan Liu, Munan Ning, Qihui Zhang, Shuo Yang, Zhongrui Wang, Yiwei Yang, Xianzhe Xu, Yibing Song, Weihua Chen, Fan Wang, Li Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04451
Pdf URL: https://arxiv.org/pdf/2507.04451
Copy Paste: [[2507.04451]] CoT-lized Diffusion: Let's Reinforce T2I Generation Step-by-step(https://arxiv.org/abs/2507.04451)
Keywords: generation
Abstract: Current text-to-image (T2I) generation models struggle to align spatial composition with the input text, especially in complex scenes. Even layout-based approaches yield suboptimal spatial control, as their generation process is decoupled from layout planning, making it difficult to refine the layout during synthesis. We present CoT-Diff, a framework that brings step-by-step CoT-style reasoning into T2I generation by tightly integrating Multimodal Large Language Model (MLLM)-driven 3D layout planning with the diffusion process. CoT-Diff enables layout-aware reasoning inline within a single diffusion round: at each denoising step, the MLLM evaluates intermediate predictions, dynamically updates the 3D scene layout, and continuously guides the generation process. The updated layout is converted into semantic conditions and depth maps, which are fused into the diffusion model via a condition-aware attention mechanism, enabling precise spatial control and semantic injection. Experiments on 3D Scene benchmarks show that CoT-Diff significantly improves spatial alignment and compositional fidelity, and outperforms the state-of-the-art method by 34.7% in complex scene spatial accuracy, thereby validating the effectiveness of this entangled generation paradigm.
摘要：当前的文本对图像（T2I）一代模型难以将空间组成与输入文本相结合，尤其是在复杂的场景中。即使是基于布局的方法，由于它们的生成过程与布局计划脱钩，因此很难在合成过程中完善布局，因此也会产生次优的空间控制。我们提出了COT-DIFF，该框架通过紧密整合多模式的大语言模型（MLLM）驱动的3D布局计划与扩散过程，将逐步的COT式推理带入T2I生成。 COT-DIFF在单个扩散回合中启用了布局感知推理的内联：在每个DeNoising步骤中，MLLM评估中间预测，动态更新3D场景布局，并不断指导生成过程。更新的布局将转换为语义条件和深度图，通过感知注意的注意机制将其融合到扩散模型中，从而实现精确的空间控制和语义注入。 3D场景基准的实验表明，COT-DIFF显着提高了空间比对和组成忠诚度，并且在复杂场景的空间准确性中，最先进的方法优于最先进的方法，从而验证了这种纠结的生成范式的有效性。

Title: Source Attribution in Retrieval-Augmented Generation

Authors: Ikhtiyor Nematov, Tarik Kalai, Elizaveta Kuzmenko, Gabriele Fugagnoli, Dimitris Sacharidis, Katja Hose, Tomer Sagi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04480
Pdf URL: https://arxiv.org/pdf/2507.04480
Copy Paste: [[2507.04480]] Source Attribution in Retrieval-Augmented Generation(https://arxiv.org/abs/2507.04480)
Keywords: generation
Abstract: While attribution methods, such as Shapley values, are widely used to explain the importance of features or training data in traditional machine learning, their application to Large Language Models (LLMs), particularly within Retrieval-Augmented Generation (RAG) systems, is nascent and challenging. The primary obstacle is the substantial computational cost, where each utility function evaluation involves an expensive LLM call, resulting in direct monetary and time expenses. This paper investigates the feasibility and effectiveness of adapting Shapley-based attribution to identify influential retrieved documents in RAG. We compare Shapley with more computationally tractable approximations and some existing attribution methods for LLM. Our work aims to: (1) systematically apply established attribution principles to the RAG document-level setting; (2) quantify how well SHAP approximations can mirror exact attributions while minimizing costly LLM interactions; and (3) evaluate their practical explainability in identifying critical documents, especially under complex inter-document relationships such as redundancy, complementarity, and synergy. This study seeks to bridge the gap between powerful attribution techniques and the practical constraints of LLM-based RAG systems, offering insights into achieving reliable and affordable RAG explainability.
摘要：虽然归因方法（例如沙普利价值）被广泛用于解释传统机器学习中功能或培训数据的重要性，但它们在大型语言模型（LLMS）中的应用，尤其是在检索功能增强的生成（RAG）系统中，却是新生和挑战性的。主要的障碍是实质性的计算成本，每个公用事业功能评估都涉及昂贵的LLM呼叫，从而导致直接的货币和时间支出。本文研究了适应基于沙普利的归因以识别抹布中有影响力的文档的可行性和有效性。我们将Shapley与LLM的一些现有归因方法进行比较。我们的工作旨在：（1）系统地将既定的归因原则应用于抹布文档级设置；（2）量化形状近似能如何反映精确的归因，同时最大程度地减少昂贵的LLM相互作用；（3）评估其在识别关键文档的实际解释性，尤其是在复杂的跨文档关系（例如冗余，互补性和协同作用）下。这项研究旨在弥合强大的归因技术与基于LLM的抹布系统的实际限制之间的差距，从而提供了实现可靠且负担得起的抹布解释性的见解。

Title: A Training-Free Style-Personalization via Scale-wise Autoregressive Model

Authors: Kyoungmin Lee, Jihun Park, Jongmin Gim, Wonhyeok Choi, Kyumin Hwang, Jaeyeul Kim, Sunghoon Im
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04482
Pdf URL: https://arxiv.org/pdf/2507.04482
Copy Paste: [[2507.04482]] A Training-Free Style-Personalization via Scale-wise Autoregressive Model(https://arxiv.org/abs/2507.04482)
Keywords: generation
Abstract: We present a training-free framework for style-personalized image generation that controls content and style information during inference using a scale-wise autoregressive model. Our method employs a three-path design--content, style, and generation--each guided by a corresponding text prompt, enabling flexible and efficient control over image semantics without any additional training. A central contribution of this work is a step-wise and attention-wise intervention analysis. Through systematic prompt and feature injection, we find that early-to-middle generation steps play a pivotal role in shaping both content and style, and that query features predominantly encode content-specific information. Guided by these insights, we introduce two targeted mechanisms: Key Stage Attention Sharing, which aligns content and style during the semantically critical steps, and Adaptive Query Sharing, which reinforces content semantics in later steps through similarity-aware query blending. Extensive experiments demonstrate that our method achieves competitive style fidelity and prompt fidelity compared to fine-tuned baselines, while offering faster inference and greater deployment flexibility.
摘要：我们提出了一个无培训的框架，用于风格个性化的图像生成，该框架在推理期间使用尺度自回旋模型来控制内容和样式信息。我们的方法采用了三路设计（风格和一代）的三路设计 - 以相应的文本提示为指导，可以灵活，有效地控制图像语义，而无需任何其他培训。这项工作的核心贡献是逐步和注意力的干预分析。通过系统的提示和功能注入，我们发现早到中型生成步骤在塑造内容和样式方面起着关键作用，并且该查询功能主要是编码特定于内容的信息。在这些见解的指导下，我们介绍了两种有针对性的机制：关键舞台注意力共享，它们在语义上关键的步骤中与内容和样式保持一致，并适应性查询共享共享，从而在以后的步骤中通过相似性吸引查询的查询融合加强了内容语义。广泛的实验表明，与微调基线相比，我们的方法实现了竞争风格的保真度和迅速的保真度，同时提供了更快的推理和更大的部署灵活性。

Title: Grounded Gesture Generation: Language, Motion, and Space

Authors: Anna Deichler, Jim O'Regan, Teo Guichoux, David Johansson, Jonas Beskow
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2507.04522
Pdf URL: https://arxiv.org/pdf/2507.04522
Copy Paste: [[2507.04522]] Grounded Gesture Generation: Language, Motion, and Space(https://arxiv.org/abs/2507.04522)
Keywords: generation
Abstract: Human motion generation has advanced rapidly in recent years, yet the critical problem of creating spatially grounded, context-aware gestures has been largely overlooked. Existing models typically specialize either in descriptive motion generation, such as locomotion and object interaction, or in isolated co-speech gesture synthesis aligned with utterance semantics. However, both lines of work often treat motion and environmental grounding separately, limiting advances toward embodied, communicative agents. To address this gap, our work introduces a multimodal dataset and framework for grounded gesture generation, combining two key resources: (1) a synthetic dataset of spatially grounded referential gestures, and (2) MM-Conv, a VR-based dataset capturing two-party dialogues. Together, they provide over 7.7 hours of synchronized motion, speech, and 3D scene information, standardized in the HumanML3D format. Our framework further connects to a physics-based simulator, enabling synthetic data generation and situated evaluation. By bridging gesture modeling and spatial grounding, our contribution establishes a foundation for advancing research in situated gesture generation and grounded multimodal interaction. Project page: this https URL
摘要：近年来，人类运动的产生迅速发展，但是创建空间扎根，背景感知手势的关键问题在很大程度上被忽略了。现有模型通常专门研究描述性运动产生，例如运动和对象相互作用，或者在与话语语言语言相一致的孤立共同语音手势合成中。但是，这两种工作都经常分别处理运动和环境基础，从而将进步限制为体现的交流代理。为了解决这一差距，我们的工作介绍了一个多模式数据集和框架，用于接地，结合了两个关键资源：（1）一个空间接地的参考手势的合成数据集，以及（2）基于VR的数据集，一个基于VR的数据集捕获了两部分对话。他们一起提供了7.7个小时的同步运动，语音和3D场景信息，并以HumanML3D格式进行标准化。我们的框架进一步连接到基于物理的模拟器，从而实现合成数据生成和定位评估。通过桥接手势建模和空间接地，我们的贡献为在手势产生和扎根的多模式相互作用中进行研究建立了基础。项目页面：此HTTPS URL

Title: MambaVideo for Discrete Video Tokenization with Channel-Split Quantization

Authors: Dawit Mureja Argaw, Xian Liu, Joon Son Chung, Ming-Yu Liu, Fitsum Reda
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04559
Pdf URL: https://arxiv.org/pdf/2507.04559
Copy Paste: [[2507.04559]] MambaVideo for Discrete Video Tokenization with Channel-Split Quantization(https://arxiv.org/abs/2507.04559)
Keywords: generation, generative
Abstract: Discrete video tokenization is essential for efficient autoregressive generative modeling due to the high dimensionality of video data. This work introduces a state-of-the-art discrete video tokenizer with two key contributions. First, we propose a novel Mamba-based encoder-decoder architecture that overcomes the limitations of previous sequencebased tokenizers. Second, we introduce a new quantization scheme, channel-split quantization, which significantly enhances the representational power of quantized latents while preserving the token count. Our model sets a new state-of-the-art, outperforming both causal 3D convolutionbased and Transformer-based approaches across multiple datasets. Experimental results further demonstrate its robustness as a tokenizer for autoregressive video generation.
摘要：由于视频数据的高维度，离散的视频令牌化对于有效的自动回归生成建模至关重要。这项工作引入了最先进的离散视频令牌，并提供了两个关键贡献。首先，我们提出了一种基于Mamba的新型编码器架构，该体系结构克服了以前基于序列的令牌的局限性。其次，我们介绍了一种新的量化方案，即通道分解量化，该方案在保留令牌计数的同时显着增强了量化潜伏期的代表力。我们的模型设置了一种新的最新最新，跨多个数据集的因果3D卷积和基于变压器的方法。实验结果进一步证明了其作为自回旋视频生成的代币器的鲁棒性。

Title: S$^2$Edit: Text-Guided Image Editing with Precise Semantic and Spatial Control

Authors: Xudong Liu, Zikun Chen, Ruowei Jiang, Ziyi Wu, Kejia Yin, Han Zhao, Parham Aarabi, Igor Gilitschenski
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04584
Pdf URL: https://arxiv.org/pdf/2507.04584
Copy Paste: [[2507.04584]] S$^2$Edit: Text-Guided Image Editing with Precise Semantic and Spatial Control(https://arxiv.org/abs/2507.04584)
Keywords: generation
Abstract: Recent advances in diffusion models have enabled high-quality generation and manipulation of images guided by texts, as well as concept learning from images. However, naive applications of existing methods to editing tasks that require fine-grained control, e.g., face editing, often lead to suboptimal solutions with identity information and high-frequency details lost during the editing process, or irrelevant image regions altered due to entangled concepts. In this work, we propose S$^2$Edit, a novel method based on a pre-trained text-to-image diffusion model that enables personalized editing with precise semantic and spatial control. We first fine-tune our model to embed the identity information into a learnable text token. During fine-tuning, we disentangle the learned identity token from attributes to be edited by enforcing an orthogonality constraint in the textual feature space. To ensure that the identity token only affects regions of interest, we apply object masks to guide the cross-attention maps. At inference time, our method performs localized editing while faithfully preserving the original identity with semantically disentangled and spatially focused identity token learned. Extensive experiments demonstrate the superiority of S$^2$Edit over state-of-the-art methods both quantitatively and qualitatively. Additionally, we showcase several compositional image editing applications of S$^2$Edit such as makeup transfer.
摘要：扩散模型的最新进展使高质量的生成和操纵以文本为指导的图像以及从图像中学习的概念。但是，现有方法在编辑需要细粒度控制的任务（例如面部编辑）中的幼稚应用通常会导致具有身份信息和在编辑过程中丢失的高频细节的次优解决方案，或者由于纠结概念而更改了不相关的图像区域。在这项工作中，我们提出了S $^2 $编辑，这是一种基于预先训练的文本对图像扩散模型的新颖方法，该模型可以通过精确的语义和空间控制进行个性化编辑。我们首先微调模型，将身份信息嵌入到可学习的文本令牌中。在微调过程中，我们将学习的身份令牌从文本特征空间中执行正交性约束来编辑的属性中解散了学习的身份令牌。为了确保身份令牌仅影响感兴趣的区域，我们应用对象面具来指导跨注意地图。在推论时，我们的方法执行本地编辑，同时忠实地保留了原始的身份，并以语义分散和空间为中心的身份令牌学会所学。广泛的实验证明了S $^2 $编辑比最新方法的优越性，既有定量和定性的。此外，我们展示了S $^2 $编辑的几个构图图像编辑应用，例如化妆转移。

Title: VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

Authors: Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, Yingbo Zhou, Wenhu Chen, Semih Yavuz
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2507.04590
Pdf URL: https://arxiv.org/pdf/2507.04590
Copy Paste: [[2507.04590]] VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents(https://arxiv.org/abs/2507.04590)
Keywords: generation
Abstract: Multimodal embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering over different modalities. However, existing multimodal embeddings like VLM2Vec, E5-V, GME are predominantly focused on natural images, with limited support for other visual forms such as videos and visual documents. This restricts their applicability in real-world scenarios, including AI agents, multi-modal search and recommendation, and retrieval-augmented generation (RAG). To close this gap, we propose VLM2Vec-V2, a unified framework for learning embeddings across diverse visual forms. First, we introduce MMEB-V2, a comprehensive benchmark that extends MMEB with five new task types: visual document retrieval, video retrieval, temporal grounding, video classification and video question answering - spanning text, image, video, and visual document inputs. Next, we train VLM2Vec-V2, a general-purpose embedding model that supports text, image, video, and visual document inputs. Extensive experiments show that VLM2Vec-V2 achieves strong performance not only on the newly introduced video and document retrieval tasks, but also improves over prior baselines on the original image benchmarks. Through extensive evaluation, our study offers insights into the generalizability of various multimodal embedding models and highlights effective strategies for unified embedding learning, laying the groundwork for more scalable and adaptable representation learning in both research and real-world settings.
摘要：多模式嵌入模型对于实现各种下游任务，例如语义相似性，信息检索和聚类在不同方式上至关重要。但是，如VLM2VEC，E5-V，GME等现有的多模式嵌入量主要集中在自然图像上，并且对其他视觉形式（例如视频和视觉文档）的支持有限。这限制了它们在现实情况下的适用性，包括AI代理，多模式搜索和建议以及检索增强的一代（RAG）。为了缩小这一差距，我们提出了VLM2VEC-V2，这是一个统一的框架，用于学习各种视觉形式的嵌入。首先，我们介绍了MMEB -V2，这是一种全面的基准测试，该基准扩展了MMEB具有五种新任务类型：视觉文档检索，视频检索，时间接地，视频分类和视频问题回答 - 跨越文本，图像，视频，视频和视觉文档输入。接下来，我们训练VLM2VEC-V2，这是一种支持文本，图像，视频和视觉文档输入的通用嵌入模型。广泛的实验表明，VLM2VEC-V2不仅在新介绍的视频和文档检索任务上实现了强劲的性能，而且还可以改善原始图像基准的先前基线。通过广泛的评估，我们的研究提供了对各种多模式嵌入模型的普遍性的见解，并突出了统一嵌入学习的有效策略，为在研究和现实世界中的设置中奠定了基础。

Title: QR-LoRA: Efficient and Disentangled Fine-tuning via QR Decomposition for Customized Generation

Authors: Jiahui Yang, Yongjia Ma, Donglin Di, Hao Li, Wei Chen, Yan Xie, Jianxun Cui, Xun Yang, Wangmeng Zuo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04599
Pdf URL: https://arxiv.org/pdf/2507.04599
Copy Paste: [[2507.04599]] QR-LoRA: Efficient and Disentangled Fine-tuning via QR Decomposition for Customized Generation(https://arxiv.org/abs/2507.04599)
Keywords: generation, generative
Abstract: Existing text-to-image models often rely on parameter fine-tuning techniques such as Low-Rank Adaptation (LoRA) to customize visual attributes. However, when combining multiple LoRA models for content-style fusion tasks, unstructured modifications of weight matrices often lead to undesired feature entanglement between content and style attributes. We propose QR-LoRA, a novel fine-tuning framework leveraging QR decomposition for structured parameter updates that effectively separate visual attributes. Our key insight is that the orthogonal Q matrix naturally minimizes interference between different visual features, while the upper triangular R matrix efficiently encodes attribute-specific transformations. Our approach fixes both Q and R matrices while only training an additional task-specific $\Delta R$ matrix. This structured design reduces trainable parameters to half of conventional LoRA methods and supports effective merging of multiple adaptations without cross-contamination due to the strong disentanglement properties between $\Delta R$ matrices. Experiments demonstrate that QR-LoRA achieves superior disentanglement in content-style fusion tasks, establishing a new paradigm for parameter-efficient, disentangled fine-tuning in generative models.
摘要：现有的文本对图像模型通常依赖于参数微调技术，例如低级别适应（LORA）来自定义视觉属性。但是，当将多个LORA模型结合在一起用于内容式融合任务时，重量矩阵的非结构化修改通常会导致内容和样式属性之间的不希望的功能纠缠。我们提出了QR-Lora，这是一个新颖的微调框架，利用QR分解来实现有效分开视觉属性的结构化参数更新。我们的关键见解是，正交Q矩阵自然最大程度地减少了不同视觉特征之间的干扰，而上三角R矩阵有效地编码了属性特异性转换。我们的方法可以修复Q和R矩阵，而仅培训额外的特定任务$ \ delta r $矩阵。该结构化设计将可训练的参数减少到常规LORA方法的一半，并支持有效合并多次适应，而无需交叉污染，因为$ \ delta r $矩阵之间的强大分解属性。实验表明，QR-lora在内容式融合任务中实现了出色的分解，建立了一种新的范式，用于在生成模型中进行参数有效，分解的微调。

Title: any4: Learned 4-bit Numeric Representation for LLMs

Authors: Mostafa Elhoushi, Jeff Johnson
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04610
Pdf URL: https://arxiv.org/pdf/2507.04610
Copy Paste: [[2507.04610]] any4: Learned 4-bit Numeric Representation for LLMs(https://arxiv.org/abs/2507.04610)
Keywords: generation
Abstract: We present any4, a learned 4-bit weight quantization solution for large language models (LLMs) providing arbitrary numeric representations without requiring pre-processing of weights or activations. any4 yields higher accuracy compared to other related 4-bit numeric representation types: int4, fp4 and nf4, as evaluated on a range of model sizes, generations and families (Llama 2, Llama 3, Mistral and Mixtral). While any4 does not require preprocessing of weights or activations, it is also competitive with orthogonal techniques that require such preprocessing (e.g., AWQ and GPTQ). We also experiment with any3 and any2 and show competitiveness at lower bits. Additionally, we show that we can calibrate using a single curated diverse sample rather than hundreds of samples from a dataset as done in most quantization approaches. We also open source tinygemm, a latency optimized GPU matrix multiplication library for LLMs, that implements any4 using a GPU-efficient lookup table strategy along with other common quantization methods. We open source our code at this https URL .
摘要：我们提供Any4，这是一种用于大语模型（LLM）的4位权重量化解决方案，可提供任意数字表示，而无需预处理权重或激活。与其他相关的4位数字表示类型相比，Any4的精度更高：INT4，FP4和NF4，如在一系列模型，世代和家族的评估（Llama 2，Llama 3，Mistral和Mixtral）。虽然Any4不需要对权重或激活进行预处理，但它还具有需要进行预处理（例如AWQ和GPTQ）的正交技术的竞争。我们还尝试Any3和Any2，并在较低位显示竞争力。此外，我们表明我们可以使用单个策划的不同样本而不是来自数据集的数百个样本进行校准，这是大多数量化方法中所做的。我们还开源TinyGemm，这是一种延迟优化LLMS的延迟gpu矩阵乘法库，它使用GPU有效的查找表策略以及其他常见的量化方法实现Any4。我们在此HTTPS URL上为代码开源。

Title: Multimodal LLM Integrated Semantic Communications for 6G Immersive Experiences

Authors: Yusong Zhang, Yuxuan Sun, Lei Guo, Wei Chen, Bo Ai, Deniz Gunduz
Subjects: cs.LG, cs.AI, cs.NI
Abstract URL: https://arxiv.org/abs/2507.04621
Pdf URL: https://arxiv.org/pdf/2507.04621
Copy Paste: [[2507.04621]] Multimodal LLM Integrated Semantic Communications for 6G Immersive Experiences(https://arxiv.org/abs/2507.04621)
Keywords: generation, generative
Abstract: 6G networks promise revolutionary immersive communication experiences including augmented reality (AR), virtual reality (VR), and holographic communications. These applications demand high-dimensional multimodal data transmission and intelligent data processing in real-time, which is extremely challenging over resource-limited wireless communication systems. Moreover, a joint understanding of the environment, context, and user intent is essential to deliver task-relevant content effectively. This article presents a novel multimodal large language model (MLLM) integrated semantic communications framework, termed MLLM-SC, which fully leverages reasoning and generative capabilities of pre-trained foundation models for context-aware and task-oriented wireless communication. The MLLM-SC framework adopts a device-edge collaborative architecture. At the edge, MLLM-empowered semantic guidance module analyzes multimodal inputs, user intents, and channel conditions to generate importance-aware attention maps prioritizing semantically critical information. An importance-aware semantic encoder and a resource-adaptive semantic decoder are jointly designed and optimized, which can utilize the semantic guidance for adaptive bandwidth allocation and high-quality content reconstruction or generation. Extensive case studies on visual question answering for AR/VR applications and diffusion-driven image generation validate the effectiveness of MLLM-SC.
摘要：6G网络承诺革命性的沉浸式沟通经验，包括增强现实（AR），虚拟现实（VR）和全息沟通。这些应用程序需要实时的高维多模式数据传输和智能数据处理，这在资源有限的无线通信系统上极具挑战性。此外，对环境，环境和用户意图的共同理解对于有效地提供与任务相关的内容至关重要。本文介绍了一种新型的多式模式模型（MLLM）集成的语义通信框架，称为MLLM-SC，该框架完全利用了预训练的基础模型的推理和生成能力，用于上下文感知和面向任务的无线通信。 MLLM-SC框架采用设备边缘协作架构。在边缘，MLLM授权的语义指南模块分析了多模式输入，用户意图和渠道条件，以生成重要性意识到的注意力图，优先考虑语义上关键信息。具有重要性的语义编码器和资源自适应的语义解码器是共同设计和优化的，可以利用语义指南来自适应带宽分配和高质量的内容重建或产生。关于AR/VR应用和扩散驱动图像产生的视觉问题回答的广泛案例研究验证了MLLM-SC的有效性。

Title: Learn 3D VQA Better with Active Selection and Reannotation

Authors: Shengli Zhou, Yang Liu, Feng Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04630
Pdf URL: https://arxiv.org/pdf/2507.04630
Copy Paste: [[2507.04630]] Learn 3D VQA Better with Active Selection and Reannotation(https://arxiv.org/abs/2507.04630)
Keywords: generation
Abstract: 3D Visual Question Answering (3D VQA) is crucial for enabling models to perceive the physical world and perform spatial reasoning. In 3D VQA, the free-form nature of answers often leads to improper annotations that can confuse or mislead models when training on the entire dataset. While other text generation tasks can mitigate this issue by learning on large-scale datasets, the scarcity of 3D scene data enlarges the negative effect of misleading annotations. Although active learning strategies can select valuable instances for training, they fail to identify and resolve misleading labels, which the oracle inevitably provides in practice. To address this issue, we propose a multi-turn interactive active learning strategy. This strategy selects data based on models' semantic uncertainty to form a solid knowledge foundation more effectively and actively requests reannotation from an oracle to resolve potentially misleading labels. For uncertainty assessment, we utilize a variance-based metric that takes semantic relationships between terms into consideration, thus avoiding the uniform inter-class similarity assumption of previous assessment metrics. Extensive experiments exhibit better model performance and a substantial reduction in training costs, with a halving of training costs for achieving relatively high accuracy. The code is available at this https URL.
摘要：3D视觉问题回答（3D VQA）对于使模型感知物理世界并执行空间推理至关重要。在3D VQA中，答案的自由形式的性质通常会导致不正确的注释，这些注释在整个数据集中训练时可能会混淆或误导模型。尽管其他文本生成任务可以通过在大规模数据集上学习来减轻此问题，但3D场景数据的稀缺性扩大了误导注释的负面影响。尽管积极的学习策略可以选择有价值的培训实例，但他们无法识别和解决误导性标签，而Oracle不可避免地会在实践中提供。为了解决这个问题，我们提出了一个多转变的交互式主动学习策略。该策略根据模型的语义不确定性选择数据，以更有效地形成扎实的知识基础，并积极地从Oracle要求Reannotation以解决潜在的误导性标签。对于不确定性评估，我们利用基于方差的度量，该指标将术语之间的语义关系考虑在内，从而避免了先前评估指标的统一类间相似性假设。广泛的实验表现出更好的模型性能和大幅度降低培训成本，培训成本减半，以达到相对较高的精度。该代码可在此HTTPS URL上找到。

Title: A Cycle-Consistency Constrained Framework for Dynamic Solution Space Reduction in Noninjective Regression

Authors: Hanzhang Jia, Yi Gao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.04659
Pdf URL: https://arxiv.org/pdf/2507.04659
Copy Paste: [[2507.04659]] A Cycle-Consistency Constrained Framework for Dynamic Solution Space Reduction in Noninjective Regression(https://arxiv.org/abs/2507.04659)
Keywords: generation
Abstract: To address the challenges posed by the heavy reliance of multi-output models on preset probability distributions and embedded prior knowledge in non-injective regression tasks, this paper proposes a cycle consistency-based data-driven training framework. The method jointly optimizes a forward model {\Phi}: X to Y and a backward model {\Psi}: Y to X, where the cycle consistency loss is defined as L _cycleb equal L(Y reduce {\Phi}({\Psi}(Y))) (and vice versa). By minimizing this loss, the framework establishes a closed-loop mechanism integrating generation and validation phases, eliminating the need for manual rule design or prior distribution assumptions. Experiments on normalized synthetic and simulated datasets demonstrate that the proposed method achieves a cycle reconstruction error below 0.003, achieving an improvement of approximately 30% in evaluation metrics compared to baseline models without cycle consistency. Furthermore, the framework supports unsupervised learning and significantly reduces reliance on manual intervention, demonstrating potential advantages in non-injective regression tasks.
摘要：为了解决多输出模型对预设概率分布的严重依赖和嵌入非注射回归任务中的先验知识所带来的挑战，本文提出了一个基于周期一致性的数据驱动的培训框架。该方法共同优化了向前模型{\ phi}：x至y和向后模型{\ psi}：y到x，其中循环一致性损耗被定义为l _cycleb等于l（y降低{\ phi}（{\ psi}（y）（y）（y）（y）（y）（y）））（和vice versa）。通过最大程度地减少此损失，该框架建立了整合生成和验证阶段的闭环机制，从而消除了对手动规则设计或先前的分布假设的需求。对归一化合成和模拟数据集进行的实验表明，与没有周期一致性的基线模型相比，所提出的方法达到了0.003以下的周期重建误差，评估指标的提高约为30％。此外，该框架支持无监督的学习，并大大减少了对手动干预的依赖，这表明了非注射回归任务的潜在优势。

Title: Hybrid Adversarial Spectral Loss Conditional Generative Adversarial Networks for Signal Data Augmentation in Ultra-precision Machining Surface Roughness Prediction

Authors: Suiyan Shang, Chi Fai Cheung, Pai Zheng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.04665
Pdf URL: https://arxiv.org/pdf/2507.04665
Copy Paste: [[2507.04665]] Hybrid Adversarial Spectral Loss Conditional Generative Adversarial Networks for Signal Data Augmentation in Ultra-precision Machining Surface Roughness Prediction(https://arxiv.org/abs/2507.04665)
Keywords: generation, generative
Abstract: Accurate surface roughness prediction in ultra-precision machining (UPM) is critical for real-time quality control, but small datasets hinder model performance. We propose HAS-CGAN, a Hybrid Adversarial Spectral Loss CGAN, for effective UPM data augmentation. Among five CGAN variants tested, HAS-CGAN excels in 1D force signal generation, particularly for high-frequency signals, achieving >0.85 wavelet coherence through Fourier-domain optimization. By combining generated signals with machining parameters, prediction accuracy significantly improves. Experiments with traditional ML (SVR, RF, LSTM) and deep learning models (BPNN, 1DCNN, CNN-Transformer) demonstrate that augmenting training data with 520+ synthetic samples reduces prediction error from 31.4% (original 52 samples) to ~9%, effectively addressing data scarcity in UPM roughness prediction."
摘要：超精确加工（UPM）中准确的表面粗糙度预测对于实时质量控制至关重要，但是小数据集阻碍了模型性能。我们提出了Has-Cgan（一种混合对抗频谱损失CGAN），以有效地扩大UPM数据。在测试的五个CGAN变体中，Has-Can擅长1D力信号的产生，特别是对于高频信号，通过傅立叶域优化实现了> 0.85小波的连贯性。通过将生成的信号与加工参数相结合，预测准确性可以显着提高。进行传统ML（SVR，RF，LSTM）和深度学习模型（BPNN，1DCNN，CNN-TransFormer）的实验表明，使用520+合成样本的增强培训数据可将预测错误从31.4％（原始52个样本）降低到〜9％，从而有效地解决了〜9％，有效地解决了数据稀缺性的稀缺性，在Upm off Modm groudmess Predictional中降低了。”

Title: ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Sensing

Authors: Zhenghui Zhao, Chen Wu, Di Wang, Hongruixuan Chen, Zhuo Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04678
Pdf URL: https://arxiv.org/pdf/2507.04678
Copy Paste: [[2507.04678]] ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Sensing(https://arxiv.org/abs/2507.04678)
Keywords: generation, generative
Abstract: Recent advancements in generative methods, especially diffusion models, have made great progress in remote sensing image synthesis. Despite these advancements, existing methods have not explored the simulation of future scenarios based on given scenario images. This simulation capability has wide applications for urban planning, land managementChangeBridge: Spatiotemporal Image Generation with Multimodal Controls, and beyond. In this work, we propose ChangeBridge, a conditional spatiotemporal diffusion model. Given pre-event images and conditioned on multimodal spatial controls (e.g., text prompts, instance layouts, and semantic maps), ChangeBridge can synthesize post-event images. The core idea behind ChangeBridge is to modeling the noise-to-image diffusion model, as a pre-to-post diffusion bridge. Conditioned on multimodal controls, ChangeBridge leverages a stochastic Brownian-bridge diffusion, directly modeling the spatiotemporal evolution between pre-event and post-event states. To the best of our knowledge, ChangeBridge is the first spatiotemporal generative model with multimodal controls for remote sensing. Experimental results demonstrate that ChangeBridge can simulate high-fidelity future scenarios aligned with given conditions, including event and event-driven background variations. Code will be available.
摘要：生成方法的最新进展，尤其是扩散模型，在遥感图像合成方面取得了巨大进步。尽管取得了这些进步，但现有方法并未根据给定的方案图像探讨了未来方案的模拟。该模拟能力在城市规划，土地管理桥梁：具有多模式控制及其他地区的时空图像生成中具有广泛的应用。在这项工作中，我们提出了一个有条件的时空扩散模型CrangeBridge。给定事件前图像并在多模式空间控件（例如文本提示，实例布局和语义图）上进行条件，ChangeBridge可以合成事后图像。 ChangeBridge背后的核心思想是将噪声到图像扩散模型建模为前至post扩散桥。 Change桥以多模式对照为条件，利用了随机的棕桥扩散，直接对事件前和事后状态之间的时空演化进行了建模。据我们所知，ChangeBridge是第一个具有多模式控制的时空生成模型。实验结果表明，ChangeBridge可以模拟与给定条件（包括事件和事件驱动的背景变化）对齐的高保真未来场景。代码将可用。

Title: TeethGenerator: A two-stage framework for paired pre- and post-orthodontic 3D dental data generation

Authors: Changsong Lei, Yaqian Liang, Shaofeng Wang, Jiajia Dai, Yong-Jin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04685
Pdf URL: https://arxiv.org/pdf/2507.04685
Copy Paste: [[2507.04685]] TeethGenerator: A two-stage framework for paired pre- and post-orthodontic 3D dental data generation(https://arxiv.org/abs/2507.04685)
Keywords: generation
Abstract: Digital orthodontics represents a prominent and critical application of computer vision technology in the medical field. So far, the labor-intensive process of collecting clinical data, particularly in acquiring paired 3D orthodontic teeth models, constitutes a crucial bottleneck for developing tooth arrangement neural networks. Although numerous general 3D shape generation methods have been proposed, most of them focus on single-object generation and are insufficient for generating anatomically structured teeth models, each comprising 24-32 segmented teeth. In this paper, we propose TeethGenerator, a novel two-stage framework designed to synthesize paired 3D teeth models pre- and post-orthodontic, aiming to facilitate the training of downstream tooth arrangement networks. Specifically, our approach consists of two key modules: (1) a teeth shape generation module that leverages a diffusion model to learn the distribution of morphological characteristics of teeth, enabling the generation of diverse post-orthodontic teeth models; and (2) a teeth style generation module that synthesizes corresponding pre-orthodontic teeth models by incorporating desired styles as conditional inputs. Extensive qualitative and quantitative experiments demonstrate that our synthetic dataset aligns closely with the distribution of real orthodontic data, and promotes tooth alignment performance significantly when combined with real data for training. The code and dataset are available at this https URL.
摘要：数字牙齿牙本质代表了计算机视觉技术在医疗领域的突出应用。到目前为止，收集临床数据的劳动密集型过程，尤其是在获得配对的3D正畸牙齿模型时，构成了开发牙齿布置神经网络的关键瓶颈。尽管已经提出了许多一般的3D形状生成方法，但其中大多数专注于单对象的生成，不足以产生解剖结构化的牙齿模型，每种模型包括24-32分段的牙齿。在本文中，我们提出了一种新型的两阶段框架，旨在合成配对的3D牙齿模型前和后正畸，旨在促进培训下游牙齿布置网络。具体而言，我们的方法由两个关键模块组成：（1）牙齿形状的产生模块，该模块利用扩散模型学习牙齿的形态特征的分布，从而使多种后正牙性牙齿模型产生；（2）通过将所需样式作为有条件输入的条件输入来合成相应的前正畸牙齿模型的牙齿样式生成模块。广泛的定性和定量实验表明，我们的合成数据集与真实正畸数据的分布紧密保持一致，并在与实际数据结合使用以进行训练时显着促进牙齿对齐性能。该代码和数据集可在此HTTPS URL上找到。

Title: Structure-Guided Diffusion Models for High-Fidelity Portrait Shadow Removal

Authors: Wanchang Yu, Qing Zhang, Rongjia Zheng, Wei-Shi Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04692
Pdf URL: https://arxiv.org/pdf/2507.04692
Copy Paste: [[2507.04692]] Structure-Guided Diffusion Models for High-Fidelity Portrait Shadow Removal(https://arxiv.org/abs/2507.04692)
Keywords: restoration, generative
Abstract: We present a diffusion-based portrait shadow removal approach that can robustly produce high-fidelity results. Unlike previous methods, we cast shadow removal as diffusion-based inpainting. To this end, we first train a shadow-independent structure extraction network on a real-world portrait dataset with various synthetic lighting conditions, which allows to generate a shadow-independent structure map including facial details while excluding the unwanted shadow boundaries. The structure map is then used as condition to train a structure-guided inpainting diffusion model for removing shadows in a generative manner. Finally, to restore the fine-scale details (e.g., eyelashes, moles and spots) that may not be captured by the structure map, we take the gradients inside the shadow regions as guidance and train a detail restoration diffusion model to refine the shadow removal result. Extensive experiments on the benchmark datasets show that our method clearly outperforms existing methods, and is effective to avoid previously common issues such as facial identity tampering, shadow residual, color distortion, structure blurring, and loss of details. Our code is available at this https URL.
摘要：我们提出了一种基于扩散的肖像删除方法，可以牢固地产生高保真的结果。与以前的方法不同，我们将阴影去除作为基于扩散的镶嵌。为此，我们首先在具有各种合成照明条件的真实世界肖像数据集上训练独立于阴影的结构提取网络，该网络允许生成独立的阴影结构图，包括面部细节，同时排除不需要的阴影边界。然后将结构图用作条件，以训练以生成方式去除阴影的结构引导的介入扩散模型。最后，为了恢复可能不会被结构图捕获的细节（例如，睫毛，痣和斑点），我们将阴影区域内的梯度作为指导和训练细节恢复扩散模型来完善阴影删除结果。基准数据集上的广泛实验表明，我们的方法明显优于现有方法，并且有效地避免了以前常见的问题，例如面部身份篡改，阴影残留，颜色扭曲，结构模糊和细节丢失。我们的代码可在此HTTPS URL上找到。

Title: Performance Evaluation of General Purpose Large Language Models for Basic Linear Algebra Subprograms Code Generation

Authors: Daichi Mukunoki, Shun-ichiro Hayashi, Tetsuya Hoshino, Takahiro Katagiri
Subjects: cs.LG, cs.DC, cs.MS
Abstract URL: https://arxiv.org/abs/2507.04697
Pdf URL: https://arxiv.org/pdf/2507.04697
Copy Paste: [[2507.04697]] Performance Evaluation of General Purpose Large Language Models for Basic Linear Algebra Subprograms Code Generation(https://arxiv.org/abs/2507.04697)
Keywords: generation, generative
Abstract: Generative AI technology based on Large Language Models (LLM) has been developed and applied to assist or automatically generate program codes. In this paper, we evaluate the capability of existing general LLMs for Basic Linear Algebra Subprograms (BLAS) code generation for CPUs. We use two LLMs provided by OpenAI: GPT-4.1, a Generative Pre-trained Transformer (GPT) model, and o4-mini, one of the o-series of Reasoning models. Both have been released in April 2025. For the routines from level-1 to 3 BLAS, we tried to generate (1) C code without optimization from routine name only, (2) C code with basic performance optimizations (thread parallelization, SIMD vectorization, and cache blocking) from routine name only, and (3) C code with basic performance optimizations based on Fortran reference code. As a result, we found that correct code can be generated in many cases even when only routine name are given. We also confirmed that thread parallelization with OpenMP, SIMD vectorization, and cache blocking can be implemented to some extent, and that the code is faster than the reference code.
摘要：已经开发并应用了基于大语言模型（LLM）的生成AI技术来协助或自动生成程序代码。在本文中，我们评估了CPU的基本线性代数子程序（BLAS）代码生成的现有一般LLM的能力。我们使用OpenAI：GPT-4.1提供的两个LLM，这是一种生成的预训练的变压器（GPT）模型，而O4-Mini是推理模型的O系列之一。两者都在2025年4月发布。对于从1级到3个Blas的例程，我们试图生成（1）C代码，而无需仅从例程名称中进行优化，（2）C代码具有基本性能优化（线程并行化，SIMD矢量化和仅例行名称的线程并行化，Cache vectorization和Cache Blocking），并且（仅例行名称）和（3）C代码具有基于基于Fortran fortran fortran参考代码的基本性能优化。结果，我们发现即使只给出例程名称，也可以在许多情况下生成正确的代码。我们还确认可以在某种程度上实现使用OpenMP，SIMD矢量化和缓存阻塞的线程并行化，并且代码比参考代码更快。

Title: A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets

Authors: Zexi Jia, Chuanwei Huang, Hongyan Fei, Yeshuang Zhu, Zhiqiang Yuan, Ying Deng, Jiapei Zhang, Jinchao Zhang, Jie Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04699
Pdf URL: https://arxiv.org/pdf/2507.04699
Copy Paste: [[2507.04699]] A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets(https://arxiv.org/abs/2507.04699)
Keywords: generation
Abstract: Vision-language models (VLMs) often struggle with compositional reasoning due to insufficient high-quality image-text data. To tackle this challenge, we propose a novel block-based diffusion approach that automatically generates counterfactual datasets without manual annotation. Our method utilizes large language models to identify entities and their spatial relationships. It then independently generates image blocks as "puzzle pieces" coherently arranged according to specified compositional rules. This process creates diverse, high-fidelity counterfactual image-text pairs with precisely controlled variations. In addition, we introduce a specialized loss function that differentiates inter-set from intra-set samples, enhancing training efficiency and reducing the need for negative samples. Experiments demonstrate that fine-tuning VLMs with our counterfactual datasets significantly improves visual reasoning performance. Our approach achieves state-of-the-art results across multiple benchmarks while using substantially less training data than existing methods.
摘要：视觉语言模型（VLM）通常由于不足的高质量图像文本数据而与组成推理斗争。为了应对这一挑战，我们提出了一种基于块的新型扩散方法，该方法会自动生成反事实数据集而无需手动注释。我们的方法利用大型语言模型来识别实体及其空间关系。然后，它独立生成图像块，因为“拼图零件”是根据指定的构图规则连贯安排的。这个过程创造了具有精确控制变化的多样化，高保真的反事实图形对。此外，我们引入了一种专业的损失函数，该函数将间隔与集合样本区分开，提高了训练效率并减少了对负样本的需求。实验表明，对我们的反事实数据集进行微调VLM可显着提高视觉推理性能。我们的方法可在多个基准测试中获得最新的结果，同时使用的培训数据比现有方法少得多。

Title: Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations

Authors: Yuji Wang, Moran Li, Xiaobin Hu, Ran Yi, Jiangning Zhang, Han Feng, Weijian Cao, Yabiao Wang, Chengjie Wang, Lizhuang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04705
Pdf URL: https://arxiv.org/pdf/2507.04705
Copy Paste: [[2507.04705]] Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations(https://arxiv.org/abs/2507.04705)
Keywords: generation
Abstract: Identity-preserving text-to-video (IPT2V) generation, which aims to create high-fidelity videos with consistent human identity, has become crucial for downstream applications. However, current end-to-end frameworks suffer a critical spatial-temporal trade-off: optimizing for spatially coherent layouts of key elements (e.g., character identity preservation) often compromises instruction-compliant temporal smoothness, while prioritizing dynamic realism risks disrupting the spatial coherence of visual structures. To tackle this issue, we propose a simple yet effective spatial-temporal decoupled framework that decomposes representations into spatial features for layouts and temporal features for motion dynamics. Specifically, our paper proposes a semantic prompt optimization mechanism and stage-wise decoupled generation paradigm. The former module decouples the prompt into spatial and temporal components. Aligned with the subsequent stage-wise decoupled approach, the spatial prompts guide the text-to-image (T2I) stage to generate coherent spatial features, while the temporal prompts direct the sequential image-to-video (I2V) stage to ensure motion consistency. Experimental results validate that our approach achieves excellent spatiotemporal consistency, demonstrating outstanding performance in identity preservation, text relevance, and video quality. By leveraging this simple yet robust mechanism, our algorithm secures the runner-up position in 2025 ACM MultiMedia Challenge.
摘要：旨在创建具有一致人类身份的高保真视频的文本对视频（IPT2V）的生成，对下游应用至关重要。但是，当前的端到端框架遭受了关键的时空权衡：优化关键要素的空间连贯布局（例如，角色身份保存）通常会损害符合教学符合教学的时间平滑度，同时优先考虑动态现实主义的风险，从而破坏了视觉结构的空间连贯性。为了解决这个问题，我们提出了一个简单而有效的空间脱钩框架，该框架将表示形式分解为布局的空间特征和运动动力学的时间特征。具体而言，我们的论文提出了语义提示优化机制和阶段脱钩的生成范式。以前的模块将提示分解为空间和时间组件。与随后的阶段脱钩方法保持一致，空间提示指导文本形象（T2I）阶段以生成相干的空间特征，而时间上则提示直接直接直接直接映像 - i2V（I2V）阶段，以确保运动一致性。实验结果证明了我们的方法可以达到出色的时空一致性，表现出在身份保存，文本相关性和视频质量方面的出色表现。通过利用这种简单而强大的机制，我们的算法确保了2025年ACM多媒体挑战的亚军。

Title: UrbanMind: Towards Urban General Intelligence via Tool-Enhanced Retrieval-Augmented Generation and Multilevel Optimization

Authors: Kai Yang, Zelin Zhu, Chengtao Jian, Hui Ma, Shengjie Zhao, Xiaozhou Ye, Ye Ouyang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04706
Pdf URL: https://arxiv.org/pdf/2507.04706
Copy Paste: [[2507.04706]] UrbanMind: Towards Urban General Intelligence via Tool-Enhanced Retrieval-Augmented Generation and Multilevel Optimization(https://arxiv.org/abs/2507.04706)
Keywords: generation
Abstract: Urban general intelligence (UGI) refers to the capacity of AI systems to autonomously perceive, reason, and act within dynamic and complex urban environments. In this paper, we introduce UrbanMind, a tool-enhanced retrieval-augmented generation (RAG) framework designed to facilitate UGI. Central to UrbanMind is a novel architecture based on Continual Retrieval-Augmented MoE-based LLM (C-RAG-LLM), which dynamically incorporates domain-specific knowledge and evolving urban data to support long-term adaptability. The architecture of C-RAG-LLM aligns naturally with a multilevel optimization framework, where different layers are treated as interdependent sub-problems. Each layer has distinct objectives and can be optimized either independently or jointly through a hierarchical learning process. The framework is highly flexible, supporting both end-to-end training and partial layer-wise optimization based on resource or deployment constraints. To remain adaptive under data drift, it is further integrated with an incremental corpus updating mechanism. Evaluations on real-world urban tasks of a variety of complexity verify the effectiveness of the proposed framework. This work presents a promising step toward the realization of general-purpose LLM agents in future urban environments.
摘要：城市通用情报（UGI）是指AI系统在动态和复杂的城市环境中自主感知，理性和行动的能力。在本文中，我们介绍了UrbanMind，这是一种工具增强的检索生成一代（RAG）框架，旨在促进UGI。 UrbanMind的核心是一种基于连续检索的基于MOE的LLM（C-RAG-LLM）的新型体系结构，该架构动态地结合了特定于领域的知识和不断发展的城市数据以支持长期适应性。 C-rag-llm的架构自然与多级优化框架一致，其中不同的层被视为相互依存的子问题。每个层都有不同的目标，可以通过分层学习过程独立或共同优化。该框架非常灵活，基于资源或部署约束，支持端到端培训和部分层的优化。为了在数据漂移下保持自适应，它将与增量语料库更新机制进一步集成。对现实世界中各种复杂性的现实城市任务的评估验证了拟议框架的有效性。这项工作为在未来的城市环境中实现通用LLM代理的实现迈出了有希望的一步。

Title: Spooky Action at a Distance: Normalization Layers Enable Side-Channel Spatial Communication

Authors: Samuel Pfrommer, George Ma, Yixiao Huang, Somayeh Sojoudi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.04709
Pdf URL: https://arxiv.org/pdf/2507.04709
Copy Paste: [[2507.04709]] Spooky Action at a Distance: Normalization Layers Enable Side-Channel Spatial Communication(https://arxiv.org/abs/2507.04709)
Keywords: generation
Abstract: This work shows that normalization layers can facilitate a surprising degree of communication across the spatial dimensions of an input tensor. We study a toy localization task with a convolutional architecture and show that normalization layers enable an iterative message passing procedure, allowing information aggregation from well outside the local receptive field. Our results suggest that normalization layers should be employed with caution in applications such as diffusion-based trajectory generation, where maintaining a spatially limited receptive field is crucial.
摘要：这项工作表明，归一化层可以促进在输入张量的空间维度上的令人惊讶的通信程度。我们使用卷积体系结构研究玩具本地化任务，并表明归一化层实现了迭代消息传递过程，从而可以从当地接收领域的井外进行信息聚集。我们的结果表明，应在诸如基于扩散的轨迹产生等应用中谨慎使用归一化层，在空间上保持有限的接受场至关重要。

Title: GraphBrep: Learning B-Rep in Graph Structure for Efficient CAD Generation

Authors: Weilin Lai, Tie Xu, Hu Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04765
Pdf URL: https://arxiv.org/pdf/2507.04765
Copy Paste: [[2507.04765]] GraphBrep: Learning B-Rep in Graph Structure for Efficient CAD Generation(https://arxiv.org/abs/2507.04765)
Keywords: generation
Abstract: Direct B-Rep generation is increasingly important in CAD workflows, eliminating costly modeling sequence data and supporting complex features. A key challenge is modeling joint distribution of the misaligned geometry and topology. Existing methods tend to implicitly embed topology into the geometric features of edges. Although this integration ensures feature alignment, it also causes edge geometry to carry more redundant structural information compared to the original B-Rep, leading to significantly higher computational cost. To reduce redundancy, we propose GraphBrep, a B-Rep generation model that explicitly represents and learns compact topology. Following the original structure of B-Rep, we construct an undirected weighted graph to represent surface topology. A graph diffusion model is employed to learn topology conditioned on surface features, serving as the basis for determining connectivity between primitive surfaces. The explicit representation ensures a compact data structure, effectively reducing computational cost during both training and inference. Experiments on two large-scale unconditional datasets and one category-conditional dataset demonstrate the proposed method significantly reduces training and inference times (up to 31.3% and 56.3% for given datasets, respectively) while maintaining high-quality CAD generation compared with SOTA.
摘要：直接B-REP生成在CAD工作流程中越来越重要，消除了昂贵的建模序列数据并支持复杂的功能。一个关键的挑战是对未对准的几何和拓扑的联合分布进行建模。现有方法倾向于将拓扑隐式嵌入边缘的几何特征中。尽管这种集成确保了特征对齐，但与原始B-REP相比，它也会导致边缘几何形状具有更多冗余的结构信息，从而导致计算成本明显更高。为了减少冗余，我们提出了GraphBrep，这是一个明确表示并学习紧凑拓扑的B-REP生成模型。遵循B-REP的原始结构，我们构造了一个无向加权图来表示表面拓扑。使用图形扩散模型来学习以表面特征为条件的拓扑，这是确定原始表面之间连通性的基础。明确表示可确保紧凑的数据结构，从而有效地降低了培训和推理期间的计算成本。在两个大规模无条件数据集和一个类别条件数据集上进行的实验证明了所提出的方法可显着减少培训和推理时间（分别为给定数据集的31.3％和56.3％），而与SOTA相比，同时保持高质量的CAD生成。

Title: From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach

Authors: Mihai Masala, Marius Leordeanu
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2507.04815
Pdf URL: https://arxiv.org/pdf/2507.04815
Copy Paste: [[2507.04815]] From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach(https://arxiv.org/abs/2507.04815)
Keywords: generation
Abstract: The task of describing video content in natural language is commonly referred to as video captioning. Unlike conventional video captions, which are typically brief and widely available, long-form paragraph descriptions in natural language are scarce. This limitation of current datasets is due to the expensive human manual annotation required and to the highly challenging task of explaining the language formation process from the perspective of the underlying story, as a complex system of interconnected events in space and time. Through a thorough analysis of recently published methods and available datasets, we identify a general lack of published resources dedicated to the problem of describing videos in complex language, beyond the level of descriptions in the form of enumerations of simple captions. Furthermore, while state-of-the-art methods produce impressive results on the task of generating shorter captions from videos by direct end-to-end learning between the videos and text, the problem of explaining the relationship between vision and language is still beyond our reach. In this work, we propose a shared representation between vision and language, based on graphs of events in space and time, which can be obtained in an explainable and analytical way, to integrate and connect multiple vision tasks to produce the final natural language description. Moreover, we also demonstrate how our automated and explainable video description generation process can function as a fully automatic teacher to effectively train direct, end-to-end neural student pathways, within a self-supervised neuro-analytical system. We validate that our explainable neuro-analytical approach generates coherent, rich and relevant textual descriptions on videos collected from multiple varied datasets, using both standard evaluation metrics, human annotations and consensus from ensembles of state-of-the-art VLMs.
摘要：用自然语言描述视频内容的任务通常称为视频字幕。与传统的视频标题不同，通常是简短且广泛可用的，自然语言中的长期段落描述很少。当前数据集的这种局限性是由于所需的昂贵人工注释以及从基础故事的角度来解释语言形成过程的高度挑战性的任务，这是一个复杂的时空中相互联系的系统。通过对最近发布的方法和可用数据集的透彻分析，我们总体上缺乏专门针对复杂语言描述视频的问题，超出了简单字幕的列举形式的描述级别。此外，尽管最新的方法对通过视频和文本之间的直接端到端学习从视频产生较短字幕的任务产生了令人印象深刻的结果，但解释视觉与语言之间关系的问题仍然超出了我们的影响力。在这项工作中，我们基于时空事件图表，提出了视觉和语言之间的共同表示，可以以可解释和分析的方式获得，以整合和连接多个视觉任务以产生最终的自然语言描述。此外，我们还展示了我们的自动化和可解释的视频描述生成过程如何充当自动自动的老师，以在自我监督的神经分析系统中有效地训练直接的，端到端的神经学生路径。我们验证我们的可解释的神经分析方法使用标准评估指标，人类注释和共识，从最先进的VLMS的团结中，对从多个不同数据集收集的视频产生连贯，丰富和相关的文本描述。

Title: Semantically Consistent Discrete Diffusion for 3D Biological Graph Modeling

Authors: Chinmay Prabhakar, Suprosanna Shit, Tamaz Amiranashvili, Hongwei Bran Li, Bjoern Menze
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04856
Pdf URL: https://arxiv.org/pdf/2507.04856
Copy Paste: [[2507.04856]] Semantically Consistent Discrete Diffusion for 3D Biological Graph Modeling(https://arxiv.org/abs/2507.04856)
Keywords: generation, generative
Abstract: 3D spatial graphs play a crucial role in biological and clinical research by modeling anatomical networks such as blood vessels,neurons, and airways. However, generating 3D biological graphs while maintaining anatomical validity remains challenging, a key limitation of existing diffusion-based methods. In this work, we propose a novel 3D biological graph generation method that adheres to structural and semantic plausibility conditions. We achieve this by using a novel projection operator during sampling that stochastically fixes inconsistencies. Further, we adopt a superior edge-deletion-based noising procedure suitable for sparse biological graphs. Our method demonstrates superior performance on two real-world datasets, human circle of Willis and lung airways, compared to previous approaches. Importantly, we demonstrate that the generated samples significantly enhance downstream graph labeling performance. Furthermore, we show that our generative model is a reasonable out-of-the-box link predictior.
摘要：3D空间图通过对血管，神经元和气道等解剖网络进行建模，在生物学和临床研究中起着至关重要的作用。但是，在保持解剖学有效性的同时生成3D生物图仍然具有挑战性，这是现有基于扩散方法的关键限制。在这项工作中，我们提出了一种新型的3D生物图生成方法，该方法遵循结构和语义的合理性条件。我们在抽样过程中使用新颖的投影算子来固定不一致之处来实现这一目标。此外，我们采用了适用于稀疏生物图的基于边缘缺失的尖型程序。我们的方法表明，与以前的方法相比，在两个现实世界数据集Willis和肺气道的人类圈子和肺气道上表现出了出色的性能。重要的是，我们证明生成的样品显着增强了下游图标记性能。此外，我们表明我们的生成模型是合理的开箱即用链接预测。

Title: HV-MMBench: Benchmarking MLLMs for Human-Centric Video Understanding

Authors: Yuxuan Cai, Jiangning Zhang, Zhenye Gan, Qingdong He, Xiaobin Hu, Junwei Zhu, Yabiao Wang, Chengjie Wang, Zhucun Xue, Xinwei He, Xiang Bai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04909
Pdf URL: https://arxiv.org/pdf/2507.04909
Copy Paste: [[2507.04909]] HV-MMBench: Benchmarking MLLMs for Human-Centric Video Understanding(https://arxiv.org/abs/2507.04909)
Keywords: generation
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos. However, their capacity to comprehend human-centric video data remains underexplored, primarily due to the absence of comprehensive and high-quality evaluation benchmarks. Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios. Furthermore, they are often limited by single-question paradigms and overly simplistic evaluation metrics. To address above limitations, we propose a modern HV-MMBench, a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding. Compared to existing human-centric video benchmarks, our work offers the following key features: (1) Diverse evaluation dimensions: HV-MMBench encompasses 15 tasks, ranging from basic attribute perception (e.g., age estimation, emotion recognition) to advanced cognitive reasoning (e.g., social relationship prediction, intention prediction), enabling comprehensive assessment of model capabilities; (2) Varied data types: The benchmark includes multiple-choice, fill-in-blank, true/false, and open-ended question formats, combined with diverse evaluation metrics, to more accurately and robustly reflect model performance; (3) Multi-domain video coverage: The benchmark spans 50 distinct visual scenarios, enabling comprehensive evaluation across fine-grained scene variations; (4) Temporal coverage: The benchmark covers videos from short-term (10 seconds) to long-term (up to 30min) durations, supporting systematic analysis of models temporal reasoning abilities across diverse contextual lengths.
摘要：多模式的大语言模型（MLLM）在涉及图像和视频的视觉理解任务方面已显示出重大进展。但是，他们理解以人为中心的视频数据的能力仍然没有被忽视，这主要是由于缺乏全面和高质量的评估基准。现有以人为中心的基准主要强调视频的质量和动作识别，同时忽略了以人为本的场景中所需的基本知觉和认知能力。此外，它们通常受到单个问题范式的限制和过于简单的评估指标。为了解决上述局限性，我们提出了一种现代的HV-MMBench，这是一种严格的精心策划的基准测试，旨在在以人为中心的视频理解中对MLLM进行更全面的评估。与现有的以人为中心的视频基准相比，我们的工作提供了以下关键特征：（1）多样化的评估维度：HV-MMBench包含15个任务，从基本属性感知（例如，年龄估计，情感识别）到高级认知推理（例如，社会关系预测），ENABLE INTEM INBLISIOMS INBLISIOMS GOLLECTINIC GOLLACTICTINIC; （2）多样化的数据类型：基准包括多项选择，填充，真/错误和开放式问题格式，并结合了不同的评估指标，以更准确，稳健地反映模型性能；（3）多域视频覆盖范围：基准测试跨越50个不同的视觉场景，从而可以跨细性场景变化进行全面评估；（4）时间覆盖范围：基准涵盖了从短期（10秒）到长期（长达30分钟）持续时间的视频，支持对各种上下文长度跨模型的时间推理能力的系统分析。

Title: RainShift: A Benchmark for Precipitation Downscaling Across Geographies

Authors: Paula Harder, Luca Schmidt, Francis Pelletier, Nicole Ludwig, Matthew Chantry, Christian Lessig, Alex Hernandez-Garcia, David Rolnick
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04930
Pdf URL: https://arxiv.org/pdf/2507.04930
Copy Paste: [[2507.04930]] RainShift: A Benchmark for Precipitation Downscaling Across Geographies(https://arxiv.org/abs/2507.04930)
Keywords: super-resolution
Abstract: Earth System Models (ESM) are our main tool for projecting the impacts of climate change. However, running these models at sufficient resolution for local-scale risk-assessments is not computationally feasible. Deep learning-based super-resolution models offer a promising solution to downscale ESM outputs to higher resolutions by learning from data. Yet, due to regional variations in climatic processes, these models typically require retraining for each geographical area-demanding high-resolution observational data, which is unevenly available across the globe. This highlights the need to assess how well these models generalize across geographic regions. To address this, we introduce RainShift, a dataset and benchmark for evaluating downscaling under geographic distribution shifts. We evaluate state-of-the-art downscaling approaches including GANs and diffusion models in generalizing across data gaps between the Global North and Global South. Our findings reveal substantial performance drops in out-of-distribution regions, depending on model and geographic area. While expanding the training domain generally improves generalization, it is insufficient to overcome shifts between geographically distinct regions. We show that addressing these shifts through, for example, data alignment can improve spatial generalization. Our work advances the global applicability of downscaling methods and represents a step toward reducing inequities in access to high-resolution climate information.
摘要：地球系统模型（ESM）是我们投影气候变化影响的主要工具。但是，以足够的分辨率运行这些模型在本地规模的风险评估上是不可行的。基于深度学习的超分辨率模型通过从数据中学习，为降级ESM输出提供了一个有希望的解决方案。然而，由于气候过程中的区域变化，这些模型通常需要为每个地理区域的高分辨率观测数据进行重新培训，这在全球范围内都不均匀。这凸显了需要评估这些模型在地理区域中概括的程度。为了解决这个问题，我们介绍了RainShift，这是一个数据集和基准测试，用于评估地理分布变化下的缩减。我们评估了最先进的缩减方法，包括gan和扩散模型，以跨全球北部和全球南方之间的数据差距概括。我们的发现表明，根据模型和地理区域，分布区域的性能下降。在扩展训练领域通常会改善概括，但它不足以克服地理上不同区域之间的转变。我们表明，解决这些转变，例如，数据对准可以改善空间概括。我们的工作提高了缩减方法的全球适用性，并代表了减少访问高分辨率气候信息的不平等现象的一步。

Title: Taming the Tri-Space Tension: ARC-Guided Hallucination Modeling and Control for Text-to-Image Generation

Authors: Jianjiang Yang, Ziyan Huang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2507.04946
Pdf URL: https://arxiv.org/pdf/2507.04946
Copy Paste: [[2507.04946]] Taming the Tri-Space Tension: ARC-Guided Hallucination Modeling and Control for Text-to-Image Generation(https://arxiv.org/abs/2507.04946)
Keywords: generation, generative
Abstract: Despite remarkable progress in image quality and prompt fidelity, text-to-image (T2I) diffusion models continue to exhibit persistent "hallucinations", where generated content subtly or significantly diverges from the intended prompt semantics. While often regarded as unpredictable artifacts, we argue that these failures reflect deeper, structured misalignments within the generative process. In this work, we propose a cognitively inspired perspective that reinterprets hallucinations as trajectory drift within a latent alignment space. Empirical observations reveal that generation unfolds within a multiaxial cognitive tension field, where the model must continuously negotiate competing demands across three key critical axes: semantic coherence, structural alignment, and knowledge grounding. We then formalize this three-axis space as the \textbf{Hallucination Tri-Space} and introduce the Alignment Risk Code (ARC): a dynamic vector representation that quantifies real-time alignment tension during generation. The magnitude of ARC captures overall misalignment, its direction identifies the dominant failure axis, and its imbalance reflects tension asymmetry. Based on this formulation, we develop the TensionModulator (TM-ARC): a lightweight controller that operates entirely in latent space. TM-ARC monitors ARC signals and applies targeted, axis-specific interventions during the sampling process. Extensive experiments on standard T2I benchmarks demonstrate that our approach significantly reduces hallucination without compromising image quality or diversity. This framework offers a unified and interpretable approach for understanding and mitigating generative failures in diffusion-based T2I systems.
摘要：尽管图像质量和及时的保真度取得了显着进展，但文本对图像（T2I）扩散模型继续表现出持久的“幻觉”，在这些“幻觉”中产生的内容微妙或与预期的及时语义有显着分歧。尽管经常被认为是不可预测的伪影，但我们认为这些故障反映了生成过程中更深，结构化的未对准。在这项工作中，我们提出了一种具有认知启发的观点，将幻觉重新解释为潜在对齐空间内的轨迹漂移。经验观察结果表明，在多轴认知张力领域中，一代人的发展必须在三个关键的临界轴上不断协商竞争需求：语义连贯性，结构一致性和知识接地。然后，我们将此三轴空间形式化为\ textbf {幻觉三个空间}，并介绍对齐风险代码（ARC）：一种动态矢量表示，该表示可以量化生成期间的实时对齐张力。 ARC的大小捕获了总体错位，其方向识别出主要的故障轴，其失衡反映了张力不对称。基于此公式，我们开发了张力调节器（TM-ARC）：完全在潜在空间中运行的轻型控制器。 TM-ARC监视ARC信号，并在采样过程中采用针对性的轴特异性干预措施。对标准T2I基准测试的广泛实验表明，我们的方法显着降低了幻觉，而不会损害图像质量或多样性。该框架提供了一种统一且可解释的方法，用于理解和减轻基于扩散的T2I系统中的生成失败。

Title: DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer

Authors: Yecheng Wu, Junyu Chen, Zhuoyang Zhang, Enze Xie, Jincheng Yu, Junsong Chen, Jinyi Hu, Yao Lu, Song Han, Han Cai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04947
Pdf URL: https://arxiv.org/pdf/2507.04947
Copy Paste: [[2507.04947]] DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer(https://arxiv.org/abs/2507.04947)
Keywords: generation
Abstract: We introduce DC-AR, a novel masked autoregressive (AR) text-to-image generation framework that delivers superior image generation quality with exceptional computational efficiency. Due to the tokenizers' limitations, prior masked AR models have lagged behind diffusion models in terms of quality or efficiency. We overcome this limitation by introducing DC-HT - a deep compression hybrid tokenizer for AR models that achieves a 32x spatial compression ratio while maintaining high reconstruction fidelity and cross-resolution generalization ability. Building upon DC-HT, we extend MaskGIT and create a new hybrid masked autoregressive image generation framework that first produces the structural elements through discrete tokens and then applies refinements via residual tokens. DC-AR achieves state-of-the-art results with a gFID of 5.49 on MJHQ-30K and an overall score of 0.69 on GenEval, while offering 1.5-7.9x higher throughput and 2.0-3.5x lower latency compared to prior leading diffusion and autoregressive models.
摘要：我们介绍了DC-AR，这是一种新颖的掩盖自回归（AR）文本到图像生成框架，可提供出色的图像生成质量，具有出色的计算效率。由于标记者的局限性，就质量或效率而言，先前的蒙版AR模型落后于扩散模型。我们通过引入DC-HT来克服这一局限性 - AR模型的深度压缩杂交令牌剂，该模型可实现32X空间压缩比，同时保持高重建忠诚度和跨分辨率概括能力。在DC-HT的基础上，我们扩展了MaskGit，并创建了一个新的混合掩盖自回归图像生成框架，该框架首先通过离散令牌产生结构元素，然后通过残留令牌应用细化。 DC-AR在MJHQ-30K上的GFID为5.49，在Geneval上的GFID为5.49，而GFID的总分为0.69，同时提供了1.5-7.9倍的吞吐量，而与先前的前导扩散和自动性型号相比，吞吐量提高了1.5-7.9倍，延迟较低2.0-3.5倍。

Title: Hear-Your-Click: Interactive Video-to-Audio Generation via Object-aware Contrastive Audio-Visual Fine-tuning

Authors: Yingshan Liang, Keyu Fan, Zhicheng Du, Yiran Wang, Qingyang Shi, Xinyu Zhang, Jiasheng Lu, Peiwu Qin
Subjects: cs.CV, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2507.04959
Pdf URL: https://arxiv.org/pdf/2507.04959
Copy Paste: [[2507.04959]] Hear-Your-Click: Interactive Video-to-Audio Generation via Object-aware Contrastive Audio-Visual Fine-tuning(https://arxiv.org/abs/2507.04959)
Keywords: generation
Abstract: Video-to-audio (V2A) generation shows great potential in fields such as film production. Despite significant advances, current V2A methods, which rely on global video information, struggle with complex scenes and often fail to generate audio tailored to specific objects or regions in the videos. To address these limitations, we introduce Hear-Your-Click, an interactive V2A framework that enables users to generate sounds for specific objects in the videos by simply clicking on the frame. To achieve this, we propose Object-aware Contrastive Audio-Visual Fine-tuning (OCAV) with a Mask-guided Visual Encoder (MVE) to obtain object-level visual features aligned with corresponding audio segments. Furthermore, we tailor two data augmentation strategies: Random Video Stitching (RVS) and Mask-guided Loudness Modulation (MLM), aimed at enhancing the model's sensitivity to the segmented objects. To effectively measure the audio-visual correspondence, we design a new evaluation metric, the CAV score, for evaluation. Extensive experiments demonstrate that our framework offers more precise control and improved generation performance across various metrics. Project Page: this https URL
摘要：视频对审计（V2A）一代在电影制作等领域中展现出巨大的潜力。尽管取得了重大进展，但当前的V2A方法依赖于全球视频信息，与复杂的场景斗争，并且通常无法生成针对视频中特定对象或区域的音频。为了解决这些限制，我们介绍了Hear-Your-lick，这是一个交互式V2A框架，使用户只需单击帧即可为视频中的特定对象生成声音。为了实现这一目标，我们提出了使用掩码引导的视觉编码器（MVE）的对象感知对比度视听微调（OCAV），以获得与相应的音频段对齐的对象级的视觉特征。此外，我们量身定制两个数据增强策略：随机视频缝制（RVS）和掩盖引导的响度调制（MLM），旨在增强模型对分段对象的敏感性。为了有效地衡量视听对应关系，我们设计了一个新的评估指标，即CAV分数进行评估。广泛的实验表明，我们的框架提供了各种指标的更精确控制和改善的发电性能。项目页面：此HTTPS URL

Title: Estimating Object Physical Properties from RGB-D Vision and Depth Robot Sensors Using Deep Learning

Authors: Ricardo Cardoso, Plinio Moreno
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.05029
Pdf URL: https://arxiv.org/pdf/2507.05029
Copy Paste: [[2507.05029]] Estimating Object Physical Properties from RGB-D Vision and Depth Robot Sensors Using Deep Learning(https://arxiv.org/abs/2507.05029)
Keywords: generation
Abstract: Inertial mass plays a crucial role in robotic applications such as object grasping, manipulation, and simulation, providing a strong prior for planning and control. Accurately estimating an object's mass before interaction can significantly enhance the performance of various robotic tasks. However, mass estimation using only vision sensors is a relatively underexplored area. This paper proposes a novel approach combining sparse point-cloud data from depth images with RGB images to estimate the mass of objects. We evaluate a range of point-cloud processing architectures, alongside RGB-only methods. To overcome the limited availability of training data, we create a synthetic dataset using ShapeNetSem 3D models, simulating RGBD images via a Kinect camera. This synthetic data is used to train an image generation model for estimating dense depth maps, which we then use to augment an existing dataset of images paired with mass values. Our approach significantly outperforms existing benchmarks across all evaluated metrics. The data generation (this https URL) as well as the training of the depth estimator (this https URL) and the mass estimator (this https URL) are available online.
摘要：惯性质量在机器人应用中起着至关重要的作用，例如对象抓握，操纵和仿真，为计划和控制提供了强大的事先。在相互作用之前，准确估算物体的质量可以显着提高各种机器人任务的性能。但是，仅使用视觉传感器进行质量估计是一个相对毫无疑问的区域。本文提出了一种新的方法，将稀疏点云图像与RGB图像结合在一起，以估计对象的质量。我们与仅RGB的方法一起评估了一系列点云处理架构。为了克服培训数据的有限可用性，我们使用Shapenetsem 3D模型创建了一个合成数据集，并通过Kinect摄像头模拟RGBD图像。该合成数据用于训练图像生成模型，以估计密集的深度图，然后我们用它来增强现有的图像数据集与质量值配对。我们的方法在所有评估的指标上都大大优于现有基准。数据生成（此HTTPS URL）以及深度估计器（此HTTPS URL）和质量估计器（此HTTPS URL）的训练。

Title: ICAS: Detecting Training Data from Autoregressive Image Generative Models

Authors: Hongyao Yu, Yixiang Qiu, Yiheng Yang, Hao Fang, Tianqu Zhuang, Jiaxin Hong, Bin Chen, Hao Wu, Shu-Tao Xia
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2507.05068
Pdf URL: https://arxiv.org/pdf/2507.05068
Copy Paste: [[2507.05068]] ICAS: Detecting Training Data from Autoregressive Image Generative Models(https://arxiv.org/abs/2507.05068)
Keywords: generation, generative
Abstract: Autoregressive image generation has witnessed rapid advancements, with prominent models such as scale-wise visual auto-regression pushing the boundaries of visual synthesis. However, these developments also raise significant concerns regarding data privacy and copyright. In response, training data detection has emerged as a critical task for identifying unauthorized data usage in model training. To better understand the vulnerability of autoregressive image generative models to such detection, we conduct the first study applying membership inference to this domain. Our approach comprises two key components: implicit classification and an adaptive score aggregation strategy. First, we compute the implicit token-wise classification score within the query image. Then we propose an adaptive score aggregation strategy to acquire a final score, which places greater emphasis on the tokens with lower scores. A higher final score indicates that the sample is more likely to be involved in the training set. To validate the effectiveness of our method, we adapt existing detection algorithms originally designed for LLMs to visual autoregressive models. Extensive experiments demonstrate the superiority of our method in both class-conditional and text-to-image scenarios. Moreover, our approach exhibits strong robustness and generalization under various data transformations. Furthermore, sufficient experiments suggest two novel key findings: (1) A linear scaling law on membership inference, exposing the vulnerability of large foundation models. (2) Training data from scale-wise visual autoregressive models is easier to detect than other autoregressive this http URL code is available at this https URL.
摘要：自回归图像产生见证了快速的进步，诸如尺度视觉自动回归之类的突出模型推动了视觉合成的界限。但是，这些发展也引起了人们对数据隐私和版权的重大关注。作为响应，培训数据检测已成为确定模型培训中未经授权的数据使用情况的关键任务。为了更好地了解自回归图像生成模型对这种检测的脆弱性，我们进行了第一项研究，将成员推断应用于该领域。我们的方法包括两个关键组成部分：隐式分类和自适应得分汇总策略。首先，我们计算查询图像中隐式令牌分类分数。然后，我们提出一种自适应得分汇总策略来获得最终分数，该得分更加重视分数较低的令牌。更高的最终分数表明样本更有可能参与训练集。为了验证我们方法的有效性，我们适应了最初为LLMS设计的现有检测算法，以视觉自回旋模型。广泛的实验证明了我们方法在班级条件和文本形象场景中的优越性。此外，我们的方法在各种数据转换下表现出强大的鲁棒性和概括。此外，足够的实验提出了两个新的关键发现：（1）关于成员推理的线性缩放定律，暴露了大型基础模型的脆弱性。（2）与其他自回归相比，从尺度视觉自回归模型进行培训数据更容易检测到此HTTP URL代码，可在此HTTPS URL上获得。

Title: MoDiT: Learning Highly Consistent 3D Motion Coefficients with Diffusion Transformer for Talking Head Generation

Authors: Yucheng Wang, Dan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.05092
Pdf URL: https://arxiv.org/pdf/2507.05092
Copy Paste: [[2507.05092]] MoDiT: Learning Highly Consistent 3D Motion Coefficients with Diffusion Transformer for Talking Head Generation(https://arxiv.org/abs/2507.05092)
Keywords: generation
Abstract: Audio-driven talking head generation is critical for applications such as virtual assistants, video games, and films, where natural lip movements are essential. Despite progress in this field, challenges remain in producing both consistent and realistic facial animations. Existing methods, often based on GANs or UNet-based diffusion models, face three major limitations: (i) temporal jittering caused by weak temporal constraints, resulting in frame inconsistencies; (ii) identity drift due to insufficient 3D information extraction, leading to poor preservation of facial identity; and (iii) unnatural blinking behavior due to inadequate modeling of realistic blink dynamics. To address these issues, we propose MoDiT, a novel framework that combines the 3D Morphable Model (3DMM) with a Diffusion-based Transformer. Our contributions include: (i) A hierarchical denoising strategy with revised temporal attention and biased self/cross-attention mechanisms, enabling the model to refine lip synchronization and progressively enhance full-face coherence, effectively mitigating temporal jittering. (ii) The integration of 3DMM coefficients to provide explicit spatial constraints, ensuring accurate 3D-informed optical flow prediction and improved lip synchronization using Wav2Lip results, thereby preserving identity consistency. (iii) A refined blinking strategy to model natural eye movements, with smoother and more realistic blinking behaviors.
摘要：音频驱动的谈话头产生对于诸如虚拟助手，视频游戏和电影的应用至关重要，自然唇部运动至关重要。尽管在这一领域取得了进展，但仍在产生一致和现实的面部动画方面仍然存在挑战。现有的方法通常基于gan或基于UNET的扩散模型，面临三个主要局限性：（i）由弱时间限制引起的时间抖动，从而导致框架不一致；（ii）由于3D信息提取不足而导致的身份漂移，导致面部身份保存不佳；（iii）由于逼真的眨眼动力学的建模不足而导致的不自然眨眼行为。为了解决这些问题，我们提出了Modit，这是一个新型框架，将3D形态模型（3DMM）与基于扩散的变压器结合在一起。我们的贡献包括：（i）一种层次结构的授予策略，具有修订的时间关注和偏见的自我/交叉注意机制，使该模型能够完善唇部同步并逐步增强全脸相干性，从而有效地减轻时间抖动。（ii）3DMM系数的集成以提供明确的空间约束，从而确保使用WAV2LIP结果确保精确的3D信息流量预测并改善唇部同步，从而保留身份一致性。（iii）一种精致的闪烁策略，以更顺畅，更现实的眨眼行为来建模自然眼动。

Title: DICE: Discrete inverse continuity equation for learning population dynamics

Authors: Tobias Blickhan, Jules Berman, Andrew Stuart, Benjamin Peherstorfer
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2507.05107
Pdf URL: https://arxiv.org/pdf/2507.05107
Copy Paste: [[2507.05107]] DICE: Discrete inverse continuity equation for learning population dynamics(https://arxiv.org/abs/2507.05107)
Keywords: generative
Abstract: We introduce the Discrete Inverse Continuity Equation (DICE) method, a generative modeling approach that learns the evolution of a stochastic process from given sample populations at a finite number of time points. Models learned with DICE capture the typically smooth and well-behaved population dynamics, rather than the dynamics of individual sample trajectories that can exhibit complex or even chaotic behavior. The DICE loss function is developed specifically to be invariant, even in discrete time, to spatially constant but time-varying spurious constants that can emerge during training; this invariance increases training stability and robustness. Generating a trajectory of sample populations with DICE is fast because samples evolve directly in the time interval over which the stochastic process is formulated, in contrast to approaches that condition on time and then require multiple sampling steps per time step. DICE is stable to train, in situations where other methods for learning population dynamics fail, and DICE generates representative samples with orders of magnitude lower costs than methods that have to condition on time. Numerical experiments on a wide range of problems from random waves, Vlasov-Poisson instabilities and high-dimensional chaos are included to justify these assertions.
摘要：我们介绍了离散的反连续性方程（DICE）方法，这是一种生成建模方法，该方法以有限的时间点从给定的样本群体中学习随机过程的演变。用骰子学到的模型捕获了典型的平滑且行为良好的人口动态，而不是可以表现出复杂甚至混乱行为的单个样本轨迹的动力学。即使在离散的时间内，骰子损失函数也是专门为不变的，以在训练过程中可能出现的空间恒定但时变的虚假常数。这种不变性增加了训练稳定性和鲁棒性。用骰子生成样品种群的轨迹很快，因为样品在制定随机过程的时间间隔中直接进化，与按时条件的接近，然后需要每个时间步长多个采样步骤。在其他学习人群动态方法失败的方法的情况下，骰子稳定训练，并且骰子的成本比必须按时条件的方法生成的代表性样本低数量级。包括随机波，弗拉索夫 - 波森的不稳定性和高维混乱的多种问题的数值实验，以证明这些断言是合理的。

Title: Reviving Cultural Heritage: A Novel Approach for Comprehensive Historical Document Restoration

Authors: Yuyi Zhang, Peirong Zhang, Zhenhua Yang, Pengyu Yan, Yongxin Shi, Pengwei Liu, Fengjun Guo, Lianwen Jin
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2507.05108
Pdf URL: https://arxiv.org/pdf/2507.05108
Copy Paste: [[2507.05108]] Reviving Cultural Heritage: A Novel Approach for Comprehensive Historical Document Restoration(https://arxiv.org/abs/2507.05108)
Keywords: restoration
Abstract: Historical documents represent an invaluable cultural heritage, yet have undergone significant degradation over time through tears, water erosion, and oxidation. Existing Historical Document Restoration (HDR) methods primarily focus on single modality or limited-size restoration, failing to meet practical needs. To fill this gap, we present a full-page HDR dataset (FPHDR) and a novel automated HDR solution (AutoHDR). Specifically, FPHDR comprises 1,633 real and 6,543 synthetic images with character-level and line-level locations, as well as character annotations in different damage grades. AutoHDR mimics historians' restoration workflows through a three-stage approach: OCR-assisted damage localization, vision-language context text prediction, and patch autoregressive appearance restoration. The modular architecture of AutoHDR enables seamless human-machine collaboration, allowing for flexible intervention and optimization at each restoration stage. Experiments demonstrate AutoHDR's remarkable performance in HDR. When processing severely damaged documents, our method improves OCR accuracy from 46.83\% to 84.05\%, with further enhancement to 94.25\% through human-machine collaboration. We believe this work represents a significant advancement in automated historical document restoration and contributes substantially to cultural heritage preservation. The model and dataset are available at this https URL.
摘要：历史文件代表了宝贵的文化遗产，但随着时间的流逝，泪水，水侵蚀和氧化经历了重大的降解。现有的历史文档修复（HDR）方法主要集中于单态或限量恢复，无法满足实际需求。为了填补这一空白，我们提供了一个全页HDR数据集（FPHDR）和新型的自动HDR解决方案（AUTOHDR）。具体而言，FPHDR包含1,633个真实和6,543个合成图像，具有字符级别和线条级位置，以及不同损害等级的角色注释。 Autohdr通过三阶段的方法模仿历史学家的恢复工作流程：OCR辅助损害定位，视觉语言上下文文本预测和补丁自回归的外观恢复。 AUTOHDR的模块化体系结构可实现无缝的人机协作，可以在每个恢复阶段进行灵活的干预和优化。实验证明了AutoHDR在HDR中的出色表现。在处理严重损坏的文档时，我们的方法将OCR准确性从46.83 \％提高到84.05 \％，通过人机协作进一步增强到94.25 \％。我们认为，这项工作代表了自动化历史文件恢复的重大进步，并为文化遗产保护做出了重大贡献。该型号和数据集可在此HTTPS URL上找到。

Title: VERITAS: Verification and Explanation of Realness in Images for Transparency in AI Systems

Authors: Aadi Srivastava, Vignesh Natarajkumar, Utkarsh Bheemanaboyna, Devisree Akashapu, Nagraj Gaonkar, Archit Joshi
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.05146
Pdf URL: https://arxiv.org/pdf/2507.05146
Copy Paste: [[2507.05146]] VERITAS: Verification and Explanation of Realness in Images for Transparency in AI Systems(https://arxiv.org/abs/2507.05146)
Keywords: generation, generative
Abstract: The widespread and rapid adoption of AI-generated content, created by models such as Generative Adversarial Networks (GANs) and Diffusion Models, has revolutionized the digital media landscape by allowing efficient and creative content generation. However, these models also blur the difference between real images and AI-generated synthetic images, raising concerns regarding content authenticity and integrity. While many existing solutions to detect fake images focus solely on classification and higher-resolution images, they often lack transparency in their decision-making, making it difficult for users to understand why an image is classified as fake. In this paper, we present VERITAS, a comprehensive framework that not only accurately detects whether a small (32x32) image is AI-generated but also explains why it was classified that way through artifact localization and semantic reasoning. VERITAS produces human-readable explanations that describe key artifacts in synthetic images. We show that this architecture offers clear explanations of the basis of zero-shot synthetic image detection tasks. Code and relevant prompts can be found at this https URL .
摘要：由生成对抗网络（GAN）和扩散模型等模型创建的AI生成的内容的广泛和快速采用，通过允许有效和创造性的内容产生来彻底改变了数字媒体格局。但是，这些模型还模糊了真实图像与AI生成的合成图像之间的差异，从而引起了人们对内容真实性和完整性的关注。尽管许多现有的解决方案用于检测假图像仅着眼于分类和高分辨率图像，但他们通常缺乏决策透明度，使用户难以理解为什么图像被归类为假货。在本文中，我们提出了Veritas，这是一个综合框架，不仅准确地检测出小（32x32）图像是AI生成的，而且还解释了为什么通过人工制品本地化和语义推理对其进行分类。 Veritas产生人类可读的解释，描述合成图像中的关键文物。我们表明，该体系结构对零射击合成图像检测任务的基础提供了明确的解释。代码和相关提示可以在此HTTPS URL上找到。

Title: 4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture

Authors: Yutian Chen, Shi Guo, Tianshuo Yang, Lihe Ding, Xiuyuan Yu, Jinwei Gu, Tianfan Xue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.05163
Pdf URL: https://arxiv.org/pdf/2507.05163
Copy Paste: [[2507.05163]] 4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture(https://arxiv.org/abs/2507.05163)
Keywords: generative
Abstract: Reconstructing fast-dynamic scenes from multi-view videos is crucial for high-speed motion analysis and realistic 4D reconstruction. However, the majority of 4D capture systems are limited to frame rates below 30 FPS (frames per second), and a direct 4D reconstruction of high-speed motion from low FPS input may lead to undesirable results. In this work, we propose a high-speed 4D capturing system only using low FPS cameras, through novel capturing and processing modules. On the capturing side, we propose an asynchronous capture scheme that increases the effective frame rate by staggering the start times of cameras. By grouping cameras and leveraging a base frame rate of 25 FPS, our method achieves an equivalent frame rate of 100-200 FPS without requiring specialized high-speed cameras. On processing side, we also propose a novel generative model to fix artifacts caused by 4D sparse-view reconstruction, as asynchrony reduces the number of viewpoints at each timestamp. Specifically, we propose to train a video-diffusion-based artifact-fix model for sparse 4D reconstruction, which refines missing details, maintains temporal consistency, and improves overall reconstruction quality. Experimental results demonstrate that our method significantly enhances high-speed 4D reconstruction compared to synchronous capture.
摘要：从多视频视频中重建快速动力的场景对于高速运动分析和现实4D重建至关重要。但是，大多数4D捕获系统仅限于低于30 fps（每秒帧）的帧速率，并且来自低FPS输入的高速运动直接重建可能会导致不良结果。在这项工作中，我们通过新颖的捕获和处理模块提出了仅使用低FPS摄像机的高速4D捕获系统。在捕获的一侧，我们提出了一种异步捕获方案，该方案通过惊人的摄像机开始时间来提高有效帧速率。通过对摄像机进行分组并利用25 fps的基本帧速率，我们的方法达到了100-200 fps的等效帧速率，而无需使用专门的高速摄像头。在处理方面，我们还提出了一个新颖的生成模型，以固定由4D稀疏视图重建引起的伪影，因为异步减少了每个时间戳上的观点数量。具体而言，我们建议培训基于视频扩散的伪像模型进行稀疏4D重建，该模型完善了缺失的细节，保持时间一致性并提高了整体重建质量。实验结果表明，与同步捕获相比，我们的方法显着增强了高速4D重建。

Title: Critiques of World Models

Authors: Eric Xing, Mingkai Deng, Jinyu Hou, Zhiting Hu
Subjects: cs.LG, cs.AI, cs.CL, cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2507.05169
Pdf URL: https://arxiv.org/pdf/2507.05169
Copy Paste: [[2507.05169]] Critiques of World Models(https://arxiv.org/abs/2507.05169)
Keywords: generative
Abstract: World Model, the supposed algorithmic surrogate of the real-world environment which biological agents experience with and act upon, has been an emerging topic in recent years because of the rising needs to develop virtual agents with artificial (general) intelligence. There has been much debate on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of "hypothetical thinking" in psychology literature, we offer critiques of several schools of thoughts on world modeling, and argue the primary goal of a world model to be simulating all actionable possibilities of the real world for purposeful reasoning and acting. Building on the critiques, we propose a new architecture for a general-purpose world model, based on hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervision learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.
摘要：世界模型是生物代理商所经历和采取行动的现实环境的算法代理，近年来一直是一个新兴的话题，因为需要增加具有人工（一般）智能的虚拟试剂。关于世界模型的真正是什么，如何构建它，如何使用它以及如何评估它，一直存在很多争论。在本文中，从著名的科幻经典沙丘中的想象力开始，并从心理学文学中的“假设思维”概念中汲取灵感，我们对几种关于世界建模的思想流派进行了批评，并认为世界模型的主要目标是模拟对现实世界的所有可行的可能性，以实现有目的的推理和行动。在批评的基础上，我们为通用世界模型提供了一种新的体系结构，该模型基于分层，多层次和混合连续/离散表示形式，以及一个生成和自学的学习框架，具有物理，代理和嵌套（PAN）AGI系统的前景。

Title: Semantic Frame Interpolation

Authors: Yijia Hong, Jiangning Zhang, Ran Yi, Yuji Wang, Weijian Cao, Xiaobin Hu, Zhucun Xue, Yabiao Wang, Chengjie Wang, Lizhuang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.05173
Pdf URL: https://arxiv.org/pdf/2507.05173
Copy Paste: [[2507.05173]] Semantic Frame Interpolation(https://arxiv.org/abs/2507.05173)
Keywords: generation
Abstract: Generating intermediate video content of varying lengths based on given first and last frames, along with text prompt information, offers significant research and application potential. However, traditional frame interpolation tasks primarily focus on scenarios with a small number of frames, no text control, and minimal differences between the first and last frames. Recent community developers have utilized large video models represented by Wan to endow frame-to-frame capabilities. However, these models can only generate a fixed number of frames and often fail to produce satisfactory results for certain frame lengths, while this setting lacks a clear official definition and a well-established benchmark. In this paper, we first propose a new practical Semantic Frame Interpolation (SFI) task from the perspective of academic definition, which covers the above two settings and supports inference at multiple frame rates. To achieve this goal, we propose a novel SemFi model building upon Wan2.1, which incorporates a Mixture-of-LoRA module to ensure the generation of high-consistency content that aligns with control conditions across various frame length limitations. Furthermore, we propose SFI-300K, the first general-purpose dataset and benchmark specifically designed for SFI. To support this, we collect and process data from the perspective of SFI, carefully designing evaluation metrics and methods to assess the model's performance across multiple dimensions, encompassing image and video, and various aspects, including consistency and diversity. Through extensive experiments on SFI-300K, we demonstrate that our method is particularly well-suited to meet the requirements of the SFI task.
摘要：基于给定的第一和最后一个框架以及文本及时信息，生成不同长度的中间视频内容，提供了巨大的研究和应用潜力。但是，传统的框架插值任务主要集中在具有少量帧，没有文本控制以及第一帧和最后一帧之间最小差异的场景上。最近的社区开发人员利用了由WAN代表的大型录像模型来赋予框架到框架功能。但是，这些模型只能生成固定数量的框架，并且通常无法为某些帧长度产生令人满意的结果，而此设置缺乏明确的官方定义和完善的基准。在本文中，我们首先从学术定义的角度提出了一个新的实用语义框架插值（SFI）任务，该任务涵盖了上述两个设置，并以多个帧速率支持推断。为了实现这一目标，我们提出了一种在WAN2.1上建立的新型SEMFI模型，该模型结合了Lora模块的混合物，以确保产生高稳态含量，该含量与各种框架长度限制之间的控制条件保持一致。此外，我们提出了SFI-300K，这是专门为SFI设计的第一个通用数据集和基准。为了支持这一点，我们从SFI的角度收集和处理数据，仔细设计评估指标和方法，以评估模型跨多个维度的性能，包括图像和视频，以及各个方面，包括一致性和多样性。通过对SFI-300K的广泛实验，我们证明我们的方法特别适合满足SFI任务的要求。

Title: $φ$-Adapt: A Physics-Informed Adaptation Learning Approach to 2D Quantum Material Discovery

Authors: Hoang-Quan Nguyen, Xuan Bac Nguyen, Sankalp Pandey, Tim Faltermeier, Nicholas Borys, Hugh Churchill, Khoa Luu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.05184
Pdf URL: https://arxiv.org/pdf/2507.05184
Copy Paste: [[2507.05184]] $φ$-Adapt: A Physics-Informed Adaptation Learning Approach to 2D Quantum Material Discovery(https://arxiv.org/abs/2507.05184)
Keywords: generation
Abstract: Characterizing quantum flakes is a critical step in quantum hardware engineering because the quality of these flakes directly influences qubit performance. Although computer vision methods for identifying two-dimensional quantum flakes have emerged, they still face significant challenges in estimating flake thickness. These challenges include limited data, poor generalization, sensitivity to domain shifts, and a lack of physical interpretability. In this paper, we introduce one of the first Physics-informed Adaptation Learning approaches to overcome these obstacles. We focus on two main issues, i.e., data scarcity and generalization. First, we propose a new synthetic data generation framework that produces diverse quantum flake samples across various materials and configurations, reducing the need for time-consuming manual collection. Second, we present $\varphi$-Adapt, a physics-informed adaptation method that bridges the performance gap between models trained on synthetic data and those deployed in real-world settings. Experimental results show that our approach achieves state-of-the-art performance on multiple benchmarks, outperforming existing methods. Our proposed approach advances the integration of physics-based modeling and domain adaptation. It also addresses a critical gap in leveraging synthesized data for real-world 2D material analysis, offering impactful tools for deep learning and materials science communities.
摘要：表征量子片是量子硬件工程的关键一步，因为这些薄片的质量直接影响量子性能。尽管已经出现了用于识别二维量子片的计算机视觉方法，但它们在估计薄片厚度方面仍然面临重大挑战。这些挑战包括数据有限，概括不良，对域移位的敏感性以及缺乏物理解释性。在本文中，我们介绍了最早的物理学适应学习方法之一，以克服这些障碍。我们专注于两个主要问题，即数据稀缺和概括。首先，我们提出了一个新的合成数据生成框架，该框架可在各种材料和配置中生成各种量子片样品，从而减少了耗时的手动收集需求。其次，我们提出了$ \ varphi $ -Adapt，这是一种物理知识的适应方法，它弥合了经过合成数据训练的模型与现实世界中部署的模型之间的性能差距。实验结果表明，我们的方法在多个基准上实现了最先进的性能，表现优于现有方法。我们提出的方法推进了基于物理的建模和域适应的整合。它还解决了利用合成数据进行现实世界2D材料分析的关键差距，为深度学习和材料科学社区提供了有影响力的工具。

Title: Logit Reweighting for Topic-Focused Summarization

Authors: Joschka Braun, Bálint Mucsányi, Seyed Ali Bahrainian
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2507.05235
Pdf URL: https://arxiv.org/pdf/2507.05235
Copy Paste: [[2507.05235]] Logit Reweighting for Topic-Focused Summarization(https://arxiv.org/abs/2507.05235)
Keywords: generation
Abstract: Generating abstractive summaries that adhere to a specific topic remains a significant challenge for language models. While standard approaches, such as fine-tuning, are resource-intensive, simpler methods like prompt engineering often struggle to maintain topical focus, particularly with smaller models. To address this, we propose a lightweight method that enhances topical relevance by directly reweighting the logits of topic-relevant tokens during generation. We evaluate three such reweighting techniques: Constant Shift, which adds a constant value to logits; Factor Scaling, which multiplies them by a factor; and Threshold Selection, which selectively boosts logits that exceed a probability threshold. Experiments on the NEWTS topical summarization dataset, using both Gemma-2B and Llama-3-8B models, show that these techniques effectively increase the use of topic-relevant vocabulary. Notably, the Threshold Selection method successfully improves topical focus without compromising summary quality-a trade-off often seen in other approaches. Our findings demonstrate that directly reweighting logits is a practical and resource-efficient alternative to fine-tuning, offering a promising pathway for precisely controlling the thematic content of generated text.
摘要：生成遵守特定主题的抽象摘要仍然是语言模型的重大挑战。虽然标准方法（例如微调）是资源密集型，更简单的方法，例如及时工程，通常很难保持主题焦点，尤其是在较小的模型中。为了解决这个问题，我们提出了一种轻巧的方法，该方法通过直接在发电过程中直接重新重新重量重新重新重量来增强主题相关性。我们评估了三种这样的重新加权技术：不断变化，这为逻辑增加了恒定值；因子缩放，将它们乘以一个因素；和阈值选择，它有选择地增强超过概率阈值的逻辑。使用Gemma-2b和Llama-3-8B模型，在NEWS主题摘要数据集上进行的实验表明，这些技术有效地增加了与主题相关的词汇的使用。值得注意的是，阈值选择方法成功地提高了主题焦点，而不会损害摘要质量 - 其他方法中经常看到的权衡。我们的发现表明，直接重新加权的逻辑是微调的实用且资源有效的替代品，为精确控制生成的文本的主题内容提供了有希望的途径。

Title: From Marginal to Joint Predictions: Evaluating Scene-Consistent Trajectory Prediction Approaches for Automated Driving

Authors: Fabian Konstantinidis, Ariel Dallari Guerreiro, Raphael Trumpp, Moritz Sackmann, Ulrich Hofmann, Marco Caccamo, Christoph Stiller
Subjects: cs.CV, cs.AI, cs.LG, cs.MA, cs.RO
Abstract URL: https://arxiv.org/abs/2507.05254
Pdf URL: https://arxiv.org/pdf/2507.05254
Copy Paste: [[2507.05254]] From Marginal to Joint Predictions: Evaluating Scene-Consistent Trajectory Prediction Approaches for Automated Driving(https://arxiv.org/abs/2507.05254)
Keywords: generative
Abstract: Accurate motion prediction of surrounding traffic participants is crucial for the safe and efficient operation of automated vehicles in dynamic environments. Marginal prediction models commonly forecast each agent's future trajectories independently, often leading to sub-optimal planning decisions for an automated vehicle. In contrast, joint prediction models explicitly account for the interactions between agents, yielding socially and physically consistent predictions on a scene level. However, existing approaches differ not only in their problem formulation but also in the model architectures and implementation details used, making it difficult to compare them. In this work, we systematically investigate different approaches to joint motion prediction, including post-processing of the marginal predictions, explicitly training the model for joint predictions, and framing the problem as a generative task. We evaluate each approach in terms of prediction accuracy, multi-modality, and inference efficiency, offering a comprehensive analysis of the strengths and limitations of each approach. Several prediction examples are available at this https URL.
摘要：周围交通参与者的准确运动预测对于在动态环境中自动化车辆的安全有效运行至关重要。边际预测模型通常会独立预测每个代理商的未来轨迹，通常会导致自动化车辆的次优计划。相反，联合预测模型明确说明了代理之间的相互作用，从而在场景层面上产生了社会和身体一致的预测。但是，现有方法不仅在其问题表述方面有所不同，而且在模型架构和实现细节上也有所不同，因此很难比较它们。在这项工作中，我们系统地研究了联合运动预测的不同方法，包括对边际预测的后处理，明确训练模型的联合预测，并将问题作为生成任务构建。我们根据预测准确性，多模式和推理效率评估每种方法，对每种方法的优势和局限性进行了全面的分析。此HTTPS URL可用几个预测示例。

Title: SegmentDreamer: Towards High-fidelity Text-to-3D Synthesis with Segmented Consistency Trajectory Distillation

Authors: Jiahao Zhu, Zixuan Chen, Guangcong Wang, Xiaohua Xie, Yi Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.05256
Pdf URL: https://arxiv.org/pdf/2507.05256
Copy Paste: [[2507.05256]] SegmentDreamer: Towards High-fidelity Text-to-3D Synthesis with Segmented Consistency Trajectory Distillation(https://arxiv.org/abs/2507.05256)
Keywords: generation
Abstract: Recent advancements in text-to-3D generation improve the visual quality of Score Distillation Sampling (SDS) and its variants by directly connecting Consistency Distillation (CD) to score distillation. However, due to the imbalance between self-consistency and cross-consistency, these CD-based methods inherently suffer from improper conditional guidance, leading to sub-optimal generation results. To address this issue, we present SegmentDreamer, a novel framework designed to fully unleash the potential of consistency models for high-fidelity text-to-3D generation. Specifically, we reformulate SDS through the proposed Segmented Consistency Trajectory Distillation (SCTD), effectively mitigating the imbalance issues by explicitly defining the relationship between self- and cross-consistency. Moreover, SCTD partitions the Probability Flow Ordinary Differential Equation (PF-ODE) trajectory into multiple sub-trajectories and ensures consistency within each segment, which can theoretically provide a significantly tighter upper bound on distillation error. Additionally, we propose a distillation pipeline for a more swift and stable generation. Extensive experiments demonstrate that our SegmentDreamer outperforms state-of-the-art methods in visual quality, enabling high-fidelity 3D asset creation through 3D Gaussian Splatting (3DGS).
摘要：文本到3D生成的最新进展通过将一致性蒸馏（CD）直接连接到得分蒸馏来提高得分蒸馏采样（SDS）及其变体的视觉质量。但是，由于自隔离和跨矛盾之间的不平衡，这些基于CD的方法固有地遭受了有条件的指导不当，从而导致了次优的产生结果。为了解决这个问题，我们介绍了一个段的Dreamer，这是一个新颖的框架，旨在完全释放出高保真文本到3D生成的一致性模型的潜力。具体而言，我们通过建议的分段一致性轨迹蒸馏（SCTD）重新重新进行SD，从而通过明确定义自我和跨矛盾之间的关系有效地减轻了不平衡问题。此外，SCTD分区将概率流量流量差分方程（PF-ode）轨迹分为多个子对象，并确保每个段内的一致性，从理论上讲，这可以在蒸馏误差上提供明显更紧密的上限。此外，我们提出了一条蒸馏管道，以迅速稳定。广泛的实验表明，我们的节省者在视觉质量方面的表现优于最先进的方法，从而通过3D高斯杂物（3DGS）实现了高保真3D资产的创造。