2025-07-18

Title: Spatially Grounded Explanations in Vision Language Models for Document Visual Question Answering

Authors: Maximiliano Hormazábal Lagos, Héctor Cerezo-Costas, Dimosthenis Karatzas
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.12490
Pdf URL: https://arxiv.org/pdf/2507.12490
Copy Paste: [[2507.12490]] Spatially Grounded Explanations in Vision Language Models for Document Visual Question Answering(https://arxiv.org/abs/2507.12490)
Keywords: generation
Abstract: We introduce EaGERS, a fully training-free and model-agnostic pipeline that (1) generates natural language rationales via a vision language model, (2) grounds these rationales to spatial sub-regions by computing multimodal embedding similarities over a configurable grid with majority voting, and (3) restricts the generation of responses only from the relevant regions selected in the masked image. Experiments on the DocVQA dataset demonstrate that our best configuration not only outperforms the base model on exact match accuracy and Average Normalized Levenshtein Similarity metrics but also enhances transparency and reproducibility in DocVQA without additional model fine-tuning.
摘要：我们介绍了EAGERS，这是一种全面训练和模型的无形管道，（1）通过视觉语言模型生成自然语言理由，（2）通过计算多模式的相似性，将这些理由与空间子区域相结合，以多模式的相似性与可配置的网格相似，并以多数投票的形式限制了响应的一代，并且（3）限制了与相关区域的选择。 DOCVQA数据集上的实验表明，我们的最佳配置不仅在精确匹配的精度和平均标准化Levenshtein的相似性指标上优于基本模型，而且还提高了DOCVQA中的透明度和可重复性，而无需其他模型。

Title: Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training

Authors: Mingjie Liu, Shizhe Diao, Jian Hu, Ximing Lu, Xin Dong, Hao Zhang, Alexander Bukharin, Shaokun Zhang, Jiaqi Zeng, Makesh Narsimhan Sreedhar, Gerald Shen, David Mosallanezhad, Di Zhang, Jonas Yang, June Yang, Oleksii Kuchaiev, Guilin Liu, Zhiding Yu, Pavlo Molchanov, Yejin Choi, Jan Kautz, Yi Dong
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2507.12507
Pdf URL: https://arxiv.org/pdf/2507.12507
Copy Paste: [[2507.12507]] Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training(https://arxiv.org/abs/2507.12507)
Keywords: generation
Abstract: Recent advancements in reasoning-focused language models such as OpenAI's O1 and DeepSeek-R1 have shown that scaling test-time computation-through chain-of-thought reasoning and iterative exploration-can yield substantial improvements on complex tasks like mathematics and code generation. These breakthroughs have been driven by large-scale reinforcement learning (RL), particularly when combined with verifiable reward signals that provide objective and grounded supervision. In this report, we investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains. Our work identifies several key ingredients for effective training, including the use of verifiable reward tasks, enhancements to Group Relative Policy Optimization (GRPO), and practical techniques to improve training stability and generalization. We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains. Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks. To facilitate continued research, we release our model publicly.
摘要：以推理为中心的语言模型（例如OpenAI的O1和DeepSeek-R1）的最新进展表明，缩放测试时间的计算 - 经过思考的推理和迭代探索范围可以对数学和代码生成等复杂任务进行实质性改进。这些突破是由大规模增强学习（RL）驱动的，尤其是当与可验证的奖励信号相结合时，这些信号提供了客观和扎根的监督。在本报告中，我们研究了长期加强学习对各种推理领域的小语言模型的影响。我们的工作确定了有效培训的几种关键要素，包括使用可验证的奖励任务，对小组相对政策优化的增强（GRPO）以及改善培训稳定性和概括的实用技术。我们引入受控的KL正则化，剪辑比率和定期参考策略作为解锁长期绩效提高的关键组成部分。我们的模型比强基础方面取得了重大改进，包括数学的 +14.7％，编码 +13.9％，逻辑拼图任务的 +54.8％。为了促进继续研究，我们公开发布模型。

Title: IncA-DES: An incremental and adaptive dynamic ensemble selection approach using online K-d tree neighborhood search for data streams with concept drift

Authors: Eduardo V. L. Barboza, Paulo R. Lisboa de Almeida, Alceu de Souza Britto Jr., Robert Sabourin, Rafael M. O. Cruz
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.12573
Pdf URL: https://arxiv.org/pdf/2507.12573
Copy Paste: [[2507.12573]] IncA-DES: An incremental and adaptive dynamic ensemble selection approach using online K-d tree neighborhood search for data streams with concept drift(https://arxiv.org/abs/2507.12573)
Keywords: generation
Abstract: Data streams pose challenges not usually encountered in batch-based ML. One of them is concept drift, which is characterized by the change in data distribution over time. Among many approaches explored in literature, the fusion of classifiers has been showing good results and is getting growing attention. DS methods, due to the ensemble being instance-based, seem to be an efficient choice under drifting scenarios. However, some attention must be paid to adapting such methods for concept drift. The training must be done in order to create local experts, and the commonly used neighborhood-search DS may become prohibitive with the continuous arrival of data. In this work, we propose IncA-DES, which employs a training strategy that promotes the generation of local experts with the assumption that different regions of the feature space become available with time. Additionally, the fusion of a concept drift detector supports the maintenance of information and adaptation to a new concept. An overlap-based classification filter is also employed in order to avoid using the DS method when there is a consensus in the neighborhood, a strategy that we argue every DS method should employ, as it was shown to make them more applicable and quicker. Moreover, aiming to reduce the processing time of the kNN, we propose an Online K-d tree algorithm, which can quickly remove instances without becoming inconsistent and deals with unbalancing concerns that may occur in data streams. Experimental results showed that the proposed framework got the best average accuracy compared to seven state-of-the-art methods considering different levels of label availability and presented the smaller processing time between the most accurate methods. Additionally, the fusion with the Online K-d tree has improved processing time with a negligible loss in accuracy. We have made our framework available in an online repository.
摘要：数据流构成基于批处理的ML通常不会遇到的挑战。其中之一是概念漂移，其特征是随着时间的推移数据分布的变化。在文献中探讨的许多方法中，分类器的融合一直表现出良好的结果，并且正在越来越关注。 DS方法由于集合为基于实例，在漂流方案下似乎是一个有效的选择。但是，必须注意将这种方法适应概念漂移。必须进行培训以创建本地专家，并且随着数据的持续到来，常用的邻里搜索DS可能会变得越来越高。在这项工作中，我们提出了Inca-DES，它采用了一种培训策略，该策略促进了本地专家的产生，假设特征空间的不同区域随着时间而言可用。此外，概念漂移探测器的融合支持维护信息并适应新概念。还采用了基于重叠的分类过滤器，以避免在附近达成共识时避免使用DS方法，这是我们认为应采用每种DS方法的策略，因为它被证明使它们更适用和更快。此外，为了减少KNN的处理时间，我们提出了一种在线k-d树算法，该算法可以快速删除实例而不会变得不一致，并处理数据流中可能发生的不平衡问题。实验结果表明，考虑到不同级别的标签可用性，所提出的框架具有最佳的平均准确性，并提出了最准确的方法之间的较小处理时间。此外，与在线k-d树的融合有改善的处理时间，准确性损失却忽略不计。我们已经在在线存储库中提供了框架。

Title: Assay2Mol: large language model-based drug design using BioAssay context

Authors: Yifan Deng, Spencer S. Ericksen, Anthony Gitter
Subjects: cs.LG, cs.AI, q-bio.QM
Abstract URL: https://arxiv.org/abs/2507.12574
Pdf URL: https://arxiv.org/pdf/2507.12574
Copy Paste: [[2507.12574]] Assay2Mol: large language model-based drug design using BioAssay context(https://arxiv.org/abs/2507.12574)
Keywords: generation
Abstract: Scientific databases aggregate vast amounts of quantitative data alongside descriptive text. In biochemistry, molecule screening assays evaluate the functional responses of candidate molecules against disease targets. Unstructured text that describes the biological mechanisms through which these targets operate, experimental screening protocols, and other attributes of assays offer rich information for new drug discovery campaigns but has been untapped because of that unstructured format. We present Assay2Mol, a large language model-based workflow that can capitalize on the vast existing biochemical screening assays for early-stage drug discovery. Assay2Mol retrieves existing assay records involving targets similar to the new target and generates candidate molecules using in-context learning with the retrieved assay screening data. Assay2Mol outperforms recent machine learning approaches that generate candidate ligand molecules for target protein structures, while also promoting more synthesizable molecule generation.
摘要：科学数据库汇总了大量的定量数据以及描述性文本。在生物化学中，分子筛选测定法评估了候选分子对疾病靶标的功能反应。描述这些目标操作的生物学机制的非结构化文本，实验性筛选协议和其他测定属性为新药物发现运动提供了丰富的信息，但由于这种非结构化的格式而没有开发。我们提出了Assay2mol，这是一种基于语言模型的大型工作流程，可以利用现有的现有生化筛查分析，以进行早期药物发现。 Assay2mol检索涉及类似于新目标的目标的现有测定记录，并使用检索到的测定筛选数据使用中文学习生成候选分子。 Assay2mol优于最新的机器学习方法，该方法为靶蛋白结构生成候选配体分子，同时还促进了更多可合成的分子产生。

Title: Federated Learning in Open- and Closed-Loop EMG Decoding: A Privacy and Performance Perspective

Authors: Kai Malcolm, César Uribe, Momona Yamagami
Subjects: cs.LG, cs.CR, cs.HC
Abstract URL: https://arxiv.org/abs/2507.12652
Pdf URL: https://arxiv.org/pdf/2507.12652
Copy Paste: [[2507.12652]] Federated Learning in Open- and Closed-Loop EMG Decoding: A Privacy and Performance Perspective(https://arxiv.org/abs/2507.12652)
Keywords: generation
Abstract: Invasive and non-invasive neural interfaces hold promise as high-bandwidth input devices for next-generation technologies. However, neural signals inherently encode sensitive information about an individual's identity and health, making data sharing for decoder training a critical privacy challenge. Federated learning (FL), a distributed, privacy-preserving learning framework, presents a promising solution, but it remains unexplored in closed-loop adaptive neural interfaces. Here, we introduce FL-based neural decoding and systematically evaluate its performance and privacy using high-dimensional electromyography signals in both open- and closed-loop scenarios. In open-loop simulations, FL significantly outperformed local learning baselines, demonstrating its potential for high-performance, privacy-conscious neural decoding. In contrast, closed-loop user studies required adapting FL methods to accommodate single-user, real-time interactions, a scenario not supported by standard FL. This modification resulted in local learning decoders surpassing the adapted FL approach in closed-loop performance, yet local learning still carried higher privacy risks. Our findings highlight a critical performance-privacy tradeoff in real-time adaptive applications and indicate the need for FL methods specifically designed for co-adaptive, single-user applications.
摘要：侵入性和非侵入性神经界面作为下一代技术的高带宽输入设备有希望。但是，神经信号固有地编码有关个人身份和健康的敏感信息，从而使解码器培训的数据共享成为关键的隐私挑战。联合学习（FL）是一个分布式，保护隐私的学习框架，提出了一个有希望的解决方案，但在闭环自适应神经接口中仍未探索。在这里，我们介绍了基于FL的神经解码，并系统地评估了在开放环和闭环方案中使用高维肌电图信号的性能和隐私。在开环模拟中，FL显着胜过本地学习基线，证明了其具有高性能，具有隐私意识的神经解码的潜力。相反，闭环用户研究需要适应FL方法来适应单用户，实时互动，这是标准FL不支持的情况。这种修改导致本地学习解码器超过了闭环性能中适用的FL方法，但本地学习仍然具有更高的隐私风险。我们的发现突出了实时自适应应用中的重要绩效私人关系权衡，并表明需要专门为共同自适应的单用户应用设计的FL方法。

Title: Think-Before-Draw: Decomposing Emotion Semantics & Fine-Grained Controllable Expressive Talking Head Generation

Authors: Hanlei Shi, Leyuan Qu, Yu Liu, Di Gao, Yuhua Zheng, Taihao Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.12761
Pdf URL: https://arxiv.org/pdf/2507.12761
Copy Paste: [[2507.12761]] Think-Before-Draw: Decomposing Emotion Semantics & Fine-Grained Controllable Expressive Talking Head Generation(https://arxiv.org/abs/2507.12761)
Keywords: generation
Abstract: Emotional talking-head generation has emerged as a pivotal research area at the intersection of computer vision and multimodal artificial intelligence, with its core value lying in enhancing human-computer interaction through immersive and empathetic this http URL the advancement of multimodal large language models, the driving signals for emotional talking-head generation has shifted from audio and video to more flexible text. However, current text-driven methods rely on predefined discrete emotion label texts, oversimplifying the dynamic complexity of real facial muscle movements and thus failing to achieve natural emotional this http URL study proposes the Think-Before-Draw framework to address two key challenges: (1) In-depth semantic parsing of emotions--by innovatively introducing Chain-of-Thought (CoT), abstract emotion labels are transformed into physiologically grounded facial muscle movement descriptions, enabling the mapping from high-level semantics to actionable motion features; and (2) Fine-grained expressiveness optimization--inspired by artists' portrait painting process, a progressive guidance denoising strategy is proposed, employing a "global emotion localization--local muscle control" mechanism to refine micro-expression dynamics in generated this http URL experiments demonstrate that our approach achieves state-of-the-art performance on widely-used benchmarks, including MEAD and HDTF. Additionally, we collected a set of portrait images to evaluate our model's zero-shot generation capability.
摘要：在计算机视觉和多模式人工智能的交汇处，情感上的谈话头产生已成为一个关键的研究领域，其核心价值在于通过沉浸式和同情心来增强人类计算机的互动，而这种HTTP URL多模式的大型语言模型的进步，可以使人们从听众中脱颖而出，从听众和视频中转移了更多的挠性。 However, current text-driven methods rely on predefined discrete emotion label texts, oversimplifying the dynamic complexity of real facial muscle movements and thus failing to achieve natural emotional this http URL study proposes the Think-Before-Draw framework to address two key challenges: (1) In-depth semantic parsing of emotions--by innovatively introducing Chain-of-Thought (CoT), abstract emotion labels are transformed into生理上扎根的面部肌肉运动描述，从而实现了从高级语义到可操作的运动特征的映射；（2）通过艺术家的肖像绘画过程的启发，提出了一种渐进的指导策略，采用了“全球情感定位 - 本地肌肉控制”，以完善微观表达动态，从而在这种HTTP URL实验中表明我们的方法表明我们的方法可以实现在宽大的bed belseand and belshd beps and bysept and byshts and bysht and beps and byshats and bershand beps and。此外，我们收集了一组肖像图像，以评估模型的零发电能力。

Title: World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving

Authors: Yanchen Guan, Haicheng Liao, Chengyue Wang, Xingcheng Liu, Jiaxun Zhang, Zhenning Li
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.12762
Pdf URL: https://arxiv.org/pdf/2507.12762
Copy Paste: [[2507.12762]] World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving(https://arxiv.org/abs/2507.12762)
Keywords: generation, generative
Abstract: Reliable anticipation of traffic accidents is essential for advancing autonomous driving systems. However, this objective is limited by two fundamental challenges: the scarcity of diverse, high-quality training data and the frequent absence of crucial object-level cues due to environmental disruptions or sensor deficiencies. To tackle these issues, we propose a comprehensive framework combining generative scene augmentation with adaptive temporal reasoning. Specifically, we develop a video generation pipeline that utilizes a world model guided by domain-informed prompts to create high-resolution, statistically consistent driving scenarios, particularly enriching the coverage of edge cases and complex interactions. In parallel, we construct a dynamic prediction model that encodes spatio-temporal relationships through strengthened graph convolutions and dilated temporal operators, effectively addressing data incompleteness and transient visual noise. Furthermore, we release a new benchmark dataset designed to better capture diverse real-world driving risks. Extensive experiments on public and newly released datasets confirm that our framework enhances both the accuracy and lead time of accident anticipation, offering a robust solution to current data and modeling limitations in safety-critical autonomous driving applications.
摘要：对交通事故的可靠期望对于推进自动驾驶系统至关重要。但是，这一目标受到两个基本挑战的限制：由于环境中断或传感器缺陷而导致的多样化，高质量训练数据的稀缺性以及经常缺乏关键对象级别的提示。为了解决这些问题，我们提出了一个综合框架，将增强生成场景与自适应时间推理相结合。具体来说，我们开发了一条视频生成管道，该管道利用以域信息提示为指导的世界模型创建高分辨率，统计上一致的驾驶场景，尤其是丰富了边缘案例的覆盖范围和复杂的交互。同时，我们构建了一个动态预测模型，该模型通过加强图形卷积和扩张的临时操作员来编码时空关系，从而有效地解决了数据不完整和瞬时视觉噪声。此外，我们发布了一个新的基准数据集，旨在更好地捕获各种现实世界的驾驶风险。对公共和新发布的数据集进行的广泛实验证实，我们的框架可以提高事故预期的准确性和交付时间，从而为当前数据提供了强有力的解决方案，并在安全至关重要的自主驾驶应用程序中限制了限制。

Title: Local Representative Token Guided Merging for Text-to-Image Generation

Authors: Min-Jeong Lee, Hee-Dong Kim, Seong-Whan Lee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.12771
Pdf URL: https://arxiv.org/pdf/2507.12771
Copy Paste: [[2507.12771]] Local Representative Token Guided Merging for Text-to-Image Generation(https://arxiv.org/abs/2507.12771)
Keywords: generation
Abstract: Stable diffusion is an outstanding image generation model for text-to-image, but its time-consuming generation process remains a challenge due to the quadratic complexity of attention operations. Recent token merging methods improve efficiency by reducing the number of tokens during attention operations, but often overlook the characteristics of attention-based image generation models, limiting their effectiveness. In this paper, we propose local representative token guided merging (ReToM), a novel token merging strategy applicable to any attention mechanism in image generation. To merge tokens based on various contextual information, ReToM defines local boundaries as windows within attention inputs and adjusts window sizes. Furthermore, we introduce a representative token, which represents the most representative token per window by computing similarity at a specific timestep and selecting the token with the highest average similarity. This approach preserves the most salient local features while minimizing computational overhead. Experimental results show that ReToM achieves a 6.2% improvement in FID and higher CLIP scores compared to the baseline, while maintaining comparable inference time. We empirically demonstrate that ReToM is effective in balancing visual quality and computational efficiency.
摘要：稳定的扩散是文本对图像的出色图像生成模型，但由于注意操作的二次复杂性，其耗时的生成过程仍然是一个挑战。最近的令牌合并方法通过减少注意操作期间的令牌数量来提高效率，但通常忽略了基于注意力的图像生成模型的特征，从而限制了它们的有效性。在本文中，我们提出了当地代表的令牌指导性合并（retom），这是一种适用于图像生成中任何注意力机制的新型令牌合并策略。为了根据各种上下文信息合并令牌，Retom将局部边界定义为注意输入中的窗口，并调整窗口大小。此外，我们介绍了代表令牌，该代币代表每个窗口最具代表性的令牌，通过在特定时间段上计算相似性并选择具有最高平均值相似性的令牌。这种方法保留了最显着的本地特征，同时最大程度地减少了计算开销。实验结果表明，与基线相比，RETOM在FID和较高的夹得分方面提高了6.2％，同时保持了可比的推理时间。我们从经验上证明，Retom可以有效地平衡视觉质量和计算效率。

Title: DeQA-Doc: Adapting DeQA-Score to Document Image Quality Assessment

Authors: Junjie Gao, Runze Liu, Yingzhe Peng, Shujian Yang, Jin Zhang, Kai Yang, Zhiyuan You
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12796
Pdf URL: https://arxiv.org/pdf/2507.12796
Copy Paste: [[2507.12796]] DeQA-Doc: Adapting DeQA-Score to Document Image Quality Assessment(https://arxiv.org/abs/2507.12796)
Keywords: quality assessment
Abstract: Document quality assessment is critical for a wide range of applications including document digitization, OCR, and archival. However, existing approaches often struggle to provide accurate and robust quality scores, limiting their applicability in practical scenarios. With the rapid progress in Multi-modal Large Language Models (MLLMs), recent MLLM-based methods have achieved remarkable performance in image quality assessment. In this work, we extend this success to the document domain by adapting DeQA-Score, a state-of-the-art MLLM-based image quality scorer, for document quality assessment. We propose DeQA-Doc, a framework that leverages the visual language capabilities of MLLMs and a soft label strategy to regress continuous document quality scores. To adapt DeQA-Score to DeQA-Doc, we adopt two complementary solutions to construct soft labels without the variance information. Also, we relax the resolution constrains to support the large resolution of document images. Finally, we introduce ensemble methods to further enhance the performance. Extensive experiments demonstrate that DeQA-Doc significantly outperforms existing baselines, offering accurate and generalizable document quality assessment across diverse degradation types. Codes and model weights are available in this https URL.
摘要：文档质量评估对于包括文档数字化，OCR和档案包括的广泛应用至关重要。但是，现有的方法通常难以提供准确，稳健的质量分数，从而限制了它们在实际情况下的适用性。随着多模式大语言模型（MLLM）的快速发展，最近基于MLLM的方法在图像质量评估中取得了显着的性能。在这项工作中，我们通过改编DEQA-Score（基于MLLM的最先进的图像质量得分手）来扩展此成功域，以进行文档质量评估。我们提出了DEQA-DOC，该框架利用MLLM的视觉语言功能和软标签策略来回归连续文档质量得分。为了将DEQA得分调整为DEQA-DOC，我们采用了两种互补解决方案来构建无方差信息的软标签。另外，我们放宽了分辨率的约束，以支持文档图像的大分辨率。最后，我们介绍了整体方法以进一步提高性能。广泛的实验表明，DEQA-DOC显着胜过现有的基线，提供了各种降解类型的准确且可推广的文档质量评估。该HTTPS URL中可用代码和模型权重。

Title: ATL-Diff: Audio-Driven Talking Head Generation with Early Landmarks-Guide Noise Diffusion

Authors: Hoang-Son Vo, Quang-Vinh Nguyen, Seungwon Kim, Hyung-Jeong Yang, Soonja Yeom, Soo-Hyung Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12804
Pdf URL: https://arxiv.org/pdf/2507.12804
Copy Paste: [[2507.12804]] ATL-Diff: Audio-Driven Talking Head Generation with Early Landmarks-Guide Noise Diffusion(https://arxiv.org/abs/2507.12804)
Keywords: generation
Abstract: Audio-driven talking head generation requires precise synchronization between facial animations and audio signals. This paper introduces ATL-Diff, a novel approach addressing synchronization limitations while reducing noise and computational costs. Our framework features three key components: a Landmark Generation Module converting audio to facial landmarks, a Landmarks-Guide Noise approach that decouples audio by distributing noise according to landmarks, and a 3D Identity Diffusion network preserving identity characteristics. Experiments on MEAD and CREMA-D datasets demonstrate that ATL-Diff outperforms state-of-the-art methods across all metrics. Our approach achieves near real-time processing with high-quality animations, computational efficiency, and exceptional preservation of facial nuances. This advancement offers promising applications for virtual assistants, education, medical communication, and digital platforms. The source code is available at: \href{this https URL}{this https URL}
摘要：音频驱动的说话的头部生成需要面部动画和音频信号之间的精确同步。本文介绍了ATL-DIFF，这是一种针对同步限制的新方法，同时降低了噪声和计算成本。我们的框架具有三个关键组成部分：地标生成模块将音频转换为面部标志，这是一种地标指定噪声方法，该方法通过根据地标通过噪声分配噪声来解散音频，以及一个3D身份扩散网络保存身份特征。 MEAD和CREMA-D数据集的实验表明，ATL-DIFF在所有指标上都胜过最先进的方法。我们的方法通过高质量的动画，计算效率以及面部细微差别的特殊保存实现了几乎实时处理。这一进步为虚拟助手，教育，医疗传播和数字平台提供了有希望的应用程序。源代码可在：\ href {此https url} {此https url}

Title: RONOM: Reduced-Order Neural Operator Modeling

Authors: Sven Dummer, Dongwei Ye, Christoph Brune
Subjects: cs.LG, cs.CE, math.NA
Abstract URL: https://arxiv.org/abs/2507.12814
Pdf URL: https://arxiv.org/pdf/2507.12814
Copy Paste: [[2507.12814]] RONOM: Reduced-Order Neural Operator Modeling(https://arxiv.org/abs/2507.12814)
Keywords: super-resolution
Abstract: Time-dependent partial differential equations are ubiquitous in physics-based modeling, but they remain computationally intensive in many-query scenarios, such as real-time forecasting, optimal control, and uncertainty quantification. Reduced-order modeling (ROM) addresses these challenges by constructing a low-dimensional surrogate model but relies on a fixed discretization, which limits flexibility across varying meshes during evaluation. Operator learning approaches, such as neural operators, offer an alternative by parameterizing mappings between infinite-dimensional function spaces, enabling adaptation to data across different resolutions. Whereas ROM provides rigorous numerical error estimates, neural operator learning largely focuses on discretization convergence and invariance without quantifying the error between the infinite-dimensional and the discretized operators. This work introduces the reduced-order neural operator modeling (RONOM) framework, which bridges concepts from ROM and operator learning. We establish a discretization error bound analogous to those in ROM, and get insights into RONOM's discretization convergence and discretization robustness. Moreover, two numerical examples are presented that compare RONOM to existing neural operators for solving partial differential equations. The results demonstrate that RONOM using standard vector-to-vector neural networks achieves comparable performance in input generalization and superior performance in both spatial super-resolution and discretization robustness, while also offering novel insights into temporal super-resolution scenarios.
摘要：时间依赖性的部分微分方程在基于物理的建模中无处不在，但是它们在许多经常情况下仍然在计算中，例如实时预测，最佳控制和不确定性量化。减少订单建模（ROM）通过构建低维替代模型来解决这些挑战，但依赖于固定离散化，这限制了评估过程中各个不同网格的灵活性。操作员的学习方法（例如神经操作员）通过参数化无限维函数空间之间的映射提供了替代方案，从而可以对不同分辨率进行适应数据。尽管ROM提供了严格的数值误差估计，但神经操作员的学习很大程度上集中在离散化的融合和不变性上，而无需量化无限二维和离散操作员之间的误差。这项工作介绍了降低的神经操作员建模（RONOM）框架，该框架从ROM和操作员学习中介绍了概念。我们建立了类似于ROM的离散误差，并了解Ronomy的离散化收敛性和离散化鲁棒性。此外，提出了两个数值示例，它们将ROMON与现有神经操作员进行了比较，以解决偏微分方程。结果表明，使用标准载体到矢量神经网络的RONOM在空间超分辨率和离散化鲁棒性中都能在输入概括和卓越的性能中实现可比的性能，同时还为时间超级分辨率方案提供了新的见解。

Title: FIQ: Fundamental Question Generation with the Integration of Question Embeddings for Video Question Answering

Authors: Ju-Young Oh, Ho-Joong Kim, Seong-Whan Lee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.12816
Pdf URL: https://arxiv.org/pdf/2507.12816
Copy Paste: [[2507.12816]] FIQ: Fundamental Question Generation with the Integration of Question Embeddings for Video Question Answering(https://arxiv.org/abs/2507.12816)
Keywords: generation
Abstract: Video question answering (VQA) is a multimodal task that requires the interpretation of a video to answer a given question. Existing VQA methods primarily utilize question and answer (Q&A) pairs to learn the spatio-temporal characteristics of video content. However, these annotations are typically event-centric, which is not enough to capture the broader context of each video. The absence of essential details such as object types, spatial layouts, and descriptive attributes restricts the model to learning only a fragmented scene representation. This issue limits the model's capacity for generalization and higher-level reasoning. In this paper, we propose a fundamental question generation with the integration of question embeddings for video question answering (FIQ), a novel approach designed to strengthen the reasoning ability of the model by enhancing the fundamental understanding of videos. FIQ generates Q&A pairs based on descriptions extracted from videos, enriching the training data with fundamental scene information. Generated Q&A pairs enable the model to understand the primary context, leading to enhanced generalizability and reasoning ability. Furthermore, we incorporate a VQ-CAlign module that assists task-specific question embeddings with visual features, ensuring that essential domain-specific details are preserved to increase the adaptability of downstream tasks. Experiments on SUTD-TrafficQA demonstrate that our FIQ achieves state-of-the-art performance compared to existing baseline methods.
摘要：视频问题回答（VQA）是一项多模式任务，需要对视频解释以回答给定的问题。现有的VQA方法主要利用问答（问答）对学习视频内容的时空特征。但是，这些注释通常以事件为中心，这不足以捕获每个视频的更广泛上下文。缺乏基本细节，例如对象类型，空间布局和描述性属性，将模型限制在仅学习零散的场景表示形式。这个问题限制了该模型的概括能力和更高层次的推理。在本文中，我们提出了一个基本问题的产生，并结合了视频问题回答的问题嵌入（FIQ），这是一种新颖的方法，旨在通过增强对视频的基本理解来增强模型的推理能力。 FIQ根据从视频中提取的描述生成问答对，并使用基本场景信息丰富培训数据。生成的问答对使模型能够理解主要环境，从而提高了普遍性和推理能力。此外，我们合并了一个VQ-Calign模块，该模块可以帮助特定于任务的问题嵌入视觉特征，以确保保留基本特定领域的细节以提高下游任务的适应性。 Sutd-TrafficQA的实验表明，与现有基线方法相比，我们的FIQ实现了最先进的性能。

Title: SEMT: Static-Expansion-Mesh Transformer Network Architecture for Remote Sensing Image Captioning

Authors: Khang Truong, Lam Pham, Hieu Tang, Jasmin Lampert, Martin Boyer, Son Phan, Truong Nguyen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.12845
Pdf URL: https://arxiv.org/pdf/2507.12845
Copy Paste: [[2507.12845]] SEMT: Static-Expansion-Mesh Transformer Network Architecture for Remote Sensing Image Captioning(https://arxiv.org/abs/2507.12845)
Keywords: generation
Abstract: Image captioning has emerged as a crucial task in the intersection of computer vision and natural language processing, enabling automated generation of descriptive text from visual content. In the context of remote sensing, image captioning plays a significant role in interpreting vast and complex satellite imagery, aiding applications such as environmental monitoring, disaster assessment, and urban planning. This motivates us, in this paper, to present a transformer based network architecture for remote sensing image captioning (RSIC) in which multiple techniques of Static Expansion, Memory-Augmented Self-Attention, Mesh Transformer are evaluated and integrated. We evaluate our proposed models using two benchmark remote sensing image datasets of UCM-Caption and NWPU-Caption. Our best model outperforms the state-of-the-art systems on most of evaluation metrics, which demonstrates potential to apply for real-life remote sensing image systems.
摘要：图像字幕已成为计算机视觉和自然语言处理的交集中的至关重要任务，从而从视觉内容中可以自动生成描述性文本。在遥感的背景下，图像字幕在解释庞大而复杂的卫星图像，诸如环境监测，灾难评估和城市规划等应用程序中起着重要作用。这促使我们在本文中介绍了基于变压器的网络体系结构，用于遥感图像字幕（RSIC），其中对静态扩展，内存增强的自我注意，网格变压器的多种技术进行了评估和集成。我们使用UCM-CAPTION和NWPU-CAPTION的两个基准遥感图像数据集评估了我们提出的模型。我们的最佳模型在大多数评估指标上都优于最先进的系统，这表明了申请现实生活遥感图像系统的潜力。

Title: An Investigation of Ear-EEG Signals for a Novel Biometric Authentication System

Authors: Danilo Avola, Giancarlo Crocetti, Gian Luca Foresti, Daniele Pannone, Claudio Piciarelli, Amedeo Ranaldi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.12873
Pdf URL: https://arxiv.org/pdf/2507.12873
Copy Paste: [[2507.12873]] An Investigation of Ear-EEG Signals for a Novel Biometric Authentication System(https://arxiv.org/abs/2507.12873)
Keywords: generation
Abstract: This work explores the feasibility of biometric authentication using EEG signals acquired through in-ear devices, commonly referred to as ear-EEG. Traditional EEG-based biometric systems, while secure, often suffer from low usability due to cumbersome scalp-based electrode setups. In this study, we propose a novel and practical framework leveraging ear-EEG signals as a user-friendly alternative for everyday biometric authentication. The system extracts an original combination of temporal and spectral features from ear-EEG signals and feeds them into a fully connected deep neural network for subject identification. Experimental results on the only currently available ear-EEG dataset suitable for different purposes, including biometric authentication, demonstrate promising performance, with an average accuracy of 82\% in a subject identification scenario. These findings confirm the potential of ear-EEG as a viable and deployable direction for next-generation real-world biometric systems.
摘要：这项工作探讨了使用通过入耳式设备获得的EEG信号（通常称为EAR-EEG）获得的生物识别验证的可行性。传统的基于脑电图的生物识别系统虽然安全，但由于基于头皮的电极设置而经常遭受较低的可用性。在这项研究中，我们提出了一个新颖而实用的框架，利用EAR-EEG信号作为日常生物识别验证的用户友好替代方案。该系统从EAR-EEG信号中提取时间和光谱特征的原始组合，并将其馈入完全连接的深神经网络以进行对象识别。当前唯一可用的EAR-EEG数据集适用于不同目的的实验结果，包括生物识别验证，在主题识别方案中表现出有希望的性能，平均准确性为82 \％。这些发现证实了EAR-EEG作为下一代现实世界生物识别系统的可行且可部署的方向的潜力。

Title: DMQ: Dissecting Outliers of Diffusion Models for Post-Training Quantization

Authors: Dongyeun Lee, Jiwan Hur, Hyounguk Shon, Jae Young Lee, Junmo Kim
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.12933
Pdf URL: https://arxiv.org/pdf/2507.12933
Copy Paste: [[2507.12933]] DMQ: Dissecting Outliers of Diffusion Models for Post-Training Quantization(https://arxiv.org/abs/2507.12933)
Keywords: generation
Abstract: Diffusion models have achieved remarkable success in image generation but come with significant computational costs, posing challenges for deployment in resource-constrained environments. Recent post-training quantization (PTQ) methods have attempted to mitigate this issue by focusing on the iterative nature of diffusion models. However, these approaches often overlook outliers, leading to degraded performance at low bit-widths. In this paper, we propose a DMQ which combines Learned Equivalent Scaling (LES) and channel-wise Power-of-Two Scaling (PTS) to effectively address these challenges. Learned Equivalent Scaling optimizes channel-wise scaling factors to redistribute quantization difficulty between weights and activations, reducing overall quantization error. Recognizing that early denoising steps, despite having small quantization errors, crucially impact the final output due to error accumulation, we incorporate an adaptive timestep weighting scheme to prioritize these critical steps during learning. Furthermore, identifying that layers such as skip connections exhibit high inter-channel variance, we introduce channel-wise Power-of-Two Scaling for activations. To ensure robust selection of PTS factors even with small calibration set, we introduce a voting algorithm that enhances reliability. Extensive experiments demonstrate that our method significantly outperforms existing works, especially at low bit-widths such as W4A6 (4-bit weight, 6-bit activation) and W4A8, maintaining high image generation quality and model stability. The code is available at this https URL.
摘要：扩散模型在图像生成方面取得了巨大的成功，但具有巨大的计算成本，对资源受限环境中的部署构成了挑战。最近的训练后量化（PTQ）方法已通过重点关注扩散模型的迭代性质来减轻此问题。但是，这些方法经常忽略异常值，导致低位宽度下的性能下降。在本文中，我们提出了一个DMQ，该DMQ结合了学到的等效缩放（LES）和通过渠道的两个缩放力量（PTS），以有效解决这些挑战。学到的等效缩放比例优化了通道的缩放因子，以重新分布权重和激活之间的难度，从而减少了总体量化误差。认识到，尽管量化错误较小，但由于误差累积而产生的最终输出，我们采用了早期的DeNoing步骤，因此我们融合了一种自适应时间段加权方案，以优先考虑学习过程中的这些关键步骤。此外，确定诸如跳过连接之类的图层表现出较高的通道之间的方差，我们引入了通过通道的两次缩放量表进行激活。为了确保在较小的校准集合设置的情况下确保强大的PTS因素选择，我们引入了一种提高可靠性的投票算法。广泛的实验表明，我们的方法显着胜过现有的作品，尤其是在诸如W4A6（4位重量，6位激活）和W4A8之类的低位宽度下，保持了较高的图像产生质量和模型稳定性。该代码可在此HTTPS URL上找到。

Title: Insights into a radiology-specialised multimodal large language model with sparse autoencoders

Authors: Kenza Bouzid, Shruthi Bannur, Daniel Coelho de Castro, Anton Schwaighofer, Javier Alvarez-Valle, Stephanie L. Hyland
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.12950
Pdf URL: https://arxiv.org/pdf/2507.12950
Copy Paste: [[2507.12950]] Insights into a radiology-specialised multimodal large language model with sparse autoencoders(https://arxiv.org/abs/2507.12950)
Keywords: generation
Abstract: Interpretability can improve the safety, transparency and trust of AI models, which is especially important in healthcare applications where decisions often carry significant consequences. Mechanistic interpretability, particularly through the use of sparse autoencoders (SAEs), offers a promising approach for uncovering human-interpretable features within large transformer-based models. In this study, we apply Matryoshka-SAE to the radiology-specialised multimodal large language model, MAIRA-2, to interpret its internal representations. Using large-scale automated interpretability of the SAE features, we identify a range of clinically relevant concepts - including medical devices (e.g., line and tube placements, pacemaker presence), pathologies such as pleural effusion and cardiomegaly, longitudinal changes and textual features. We further examine the influence of these features on model behaviour through steering, demonstrating directional control over generations with mixed success. Our results reveal practical and methodological challenges, yet they offer initial insights into the internal concepts learned by MAIRA-2 - marking a step toward deeper mechanistic understanding and interpretability of a radiology-adapted multimodal large language model, and paving the way for improved model transparency. We release the trained SAEs and interpretations: this https URL.
摘要：可解释性可以提高AI模型的安全性，透明度和信任，这在决策通常会带来重大后果的医疗保健应用中尤为重要。机械性的解释性，特别是通过使用稀疏自动编码器（SAE），提供了一种有前途的方法，可以在大型基于变压器的模型中发现人解剖功能。在这项研究中，我们将Matryoshka-SAE应用于放射学特殊的多模式大语模型Maira-2，以解释其内部表示。使用SAE功能的大规模自动化解释性，我们确定了一系列临床相关的概念，包括医疗设备（例如，线条和管子放置，起搏器的存在），诸如胸膜积液和心瘤，纵向变化和文本特征等病理。我们通过转向进一步研究了这些特征对模型行为的影响，证明了对世代相传的方向控制。我们的结果揭示了实际和方法论上的挑战，但它们为Maira-2所学到的内部概念提供了初步见解 - 标志着对放射学适应性的多模式大型语言模型的更深入的机械理解和解释性的一步，并为提高模型透明度铺平了道路。我们发布训练有素的SAE和解释：此HTTPS URL。

Title: LoViC: Efficient Long Video Generation with Context Compression

Authors: Jiaxiu Jiang, Wenbo Li, Jingjing Ren, Yuping Qiu, Yong Guo, Xiaogang Xu, Han Wu, Wangmeng Zuo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12952
Pdf URL: https://arxiv.org/pdf/2507.12952
Copy Paste: [[2507.12952]] LoViC: Efficient Long Video Generation with Context Compression(https://arxiv.org/abs/2507.12952)
Keywords: generation
Abstract: Despite recent advances in diffusion transformers (DiTs) for text-to-video generation, scaling to long-duration content remains challenging due to the quadratic complexity of self-attention. While prior efforts -- such as sparse attention and temporally autoregressive models -- offer partial relief, they often compromise temporal coherence or scalability. We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos, designed to produce long, coherent videos through a segment-wise generation process. At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations. It supports variable-length inputs with linearly adjustable compression rates, enabled by a single query token design based on the Q-Former architecture. Additionally, by encoding temporal context through position-aware mechanisms, our model seamlessly supports prediction, retradiction, interpolation, and multi-shot generation within a unified paradigm. Extensive experiments across diverse tasks validate the effectiveness and versatility of our approach.
摘要：尽管最近的扩散变压器（DITS）用于文本到视频的生成，但由于自我注意力的二次复杂性，扩展到长期含量的扩展仍然具有挑战性。虽然先前的努力（例如稀疏的注意力和时间自回归模型）提供了部分缓解，但它们通常会损害时间连贯性或可扩展性。我们介绍了Lovic，这是一个基于DIT的框架，该框架对数百万级的开放域视频进行了培训，旨在通过细分生成过程制作长而连贯的视频。我们方法的核心是FlexFormer，这是一种表现力的自动编码器，将视频和文本共同压缩为统一的潜在表示。它支持具有线性可调的压缩率的可变长度输入，这是由基于Q-Former架构的单个查询令牌设计启用的。另外，通过通过位置感知机制编码时间上下文，我们的模型无缝支持统一范式内的预测，撤销，插值和多弹性生成。跨不同任务的广泛实验验证了我们方法的有效性和多功能性。

Title: FantasyPortrait: Enhancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers

Authors: Qiang Wang, Mengchao Wang, Fan Jiang, Yaqi Fan, Yonggang Qi, Mu Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12956
Pdf URL: https://arxiv.org/pdf/2507.12956
Copy Paste: [[2507.12956]] FantasyPortrait: Enhancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers(https://arxiv.org/abs/2507.12956)
Keywords: generation
Abstract: Producing expressive facial animations from static images is a challenging task. Prior methods relying on explicit geometric priors (e.g., facial landmarks or 3DMM) often suffer from artifacts in cross reenactment and struggle to capture subtle emotions. Furthermore, existing approaches lack support for multi-character animation, as driving features from different individuals frequently interfere with one another, complicating the task. To address these challenges, we propose FantasyPortrait, a diffusion transformer based framework capable of generating high-fidelity and emotion-rich animations for both single- and multi-character scenarios. Our method introduces an expression-augmented learning strategy that utilizes implicit representations to capture identity-agnostic facial dynamics, enhancing the model's ability to render fine-grained emotions. For multi-character control, we design a masked cross-attention mechanism that ensures independent yet coordinated expression generation, effectively preventing feature interference. To advance research in this area, we propose the Multi-Expr dataset and ExprBench, which are specifically designed datasets and benchmarks for training and evaluating multi-character portrait animations. Extensive experiments demonstrate that FantasyPortrait significantly outperforms state-of-the-art methods in both quantitative metrics and qualitative evaluations, excelling particularly in challenging cross reenactment and multi-character contexts. Our project page is this https URL.
摘要：从静态图像中产生表现力的面部动画是一项艰巨的任务。先前依靠明确的几何先验（例如面部地标或3DMM）的方法通常会遭受交叉重演中的伪像，并难以捕捉微妙的情绪。此外，现有方法缺乏对多字符动画的支持，因为来自不同个人的驾驶功能经常互相干扰，这使任务变得复杂。为了应对这些挑战，我们提出了FantasyPortrait，这是一个基于扩散变压器的框架，能够为单个和多字符的场景生成高保真和情感丰富的动画。我们的方法介绍了一种表达式的学习策略，该策略利用隐式表示捕获身份不足的面部动态，从而增强了模型的呈现精细颗粒情绪的能力。对于多字符控制，我们设计了一种掩盖的跨注意机制，可确保独立但协调的表达产生，从而有效防止特征干扰。为了推进该领域的研究，我们提出了多EXPR数据集和Exprbench，这些数据集和Exprbench是专门设计的数据集和基准，用于培训和评估多字符肖像画动画。广泛的实验表明，幻想型在定量指标和定性评估中都显着优于最先进的方法，在挑战性的交叉重演和多个特定环境中尤其出色。我们的项目页面是此HTTPS URL。

Title: A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints

Authors: Youssef Tawfilis, Hossam Amer, Minar El-Aasser, Tallal Elshabrawy
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.12979
Pdf URL: https://arxiv.org/pdf/2507.12979
Copy Paste: [[2507.12979]] A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints(https://arxiv.org/abs/2507.12979)
Keywords: generation, generative
Abstract: Federated Learning has gained increasing attention for its ability to enable multiple nodes to collaboratively train machine learning models without sharing their raw data. At the same time, Generative AI -- particularly Generative Adversarial Networks (GANs) -- have achieved remarkable success across a wide range of domains, such as healthcare, security, and Image Generation. However, training generative models typically requires large datasets and significant computational resources, which are often unavailable in real-world settings. Acquiring such resources can be costly and inefficient, especially when many underutilized devices -- such as IoT devices and edge devices -- with varying capabilities remain idle. Moreover, obtaining large datasets is challenging due to privacy concerns and copyright restrictions, as most devices are unwilling to share their data. To address these challenges, we propose a novel approach for decentralized GAN training that enables the utilization of distributed data and underutilized, low-capability devices while not sharing data in its raw form. Our approach is designed to tackle key challenges in decentralized environments, combining KLD-weighted Clustered Federated Learning to address the issues of data heterogeneity and multi-domain datasets, with Heterogeneous U-Shaped split learning to tackle the challenge of device heterogeneity under strict data sharing constraints -- ensuring that no labels or raw data, whether real or synthetic, are ever shared between nodes. Experimental results shows that our approach demonstrates consistent and significant improvements across key performance metrics, where it achieves 1.1x -- 2.2x higher image generation scores, an average 10% boost in classification metrics (up to 50% in multi-domain non-IID settings), in much lower latency compared to several benchmarks. Find our code at this https URL.
摘要：联合学习的能力使多个节点能够在不共享其原始数据的情况下协作训练机器学习模型的能力越来越引起人们的关注。同时，生成的AI（尤其是生成对抗网络（GAN））在许多范围内的医疗保健，安全和图像生成等广泛的领域取得了巨大的成功。但是，培训生成模型通常需要大型数据集和大量的计算资源，而这些计算资源通常在现实世界中不可用。获取此类资源可能是昂贵且效率低下的，尤其是当许多未充分利用的设备（例如IoT设备和边缘设备）具有不同的功能时。此外，由于隐私问题和版权限制，获得大型数据集是具有挑战性的，因为大多数设备不愿共享其数据。为了应对这些挑战，我们提出了一种用于分散式GAN训练的新方法，该方法可以利用分布式数据和未充分利用的低功能设备，而没有以其原始形式共享数据。我们的方法旨在应对分散环境中的关键挑战，将KLD加权聚类联合学习结合在一起，以解决数据异质性和多域数据集的问题，以及在严格的数据共享下，无论是分享的，无论是在不存在的数据共享下还是实际数据，无论是实际的数据，还是实际的数据，无论是实际的数据，是否是实际的数据，无论是实际的数据，是否是实际的数据，是无效的，是无效的，无论是实际的数据还是合成的，都可以应对构成的挑战，无论是实际的还是合成的。实验结果表明，我们的方法表明，关键性能指标的一致和显着改善，在该指标中达到1.1倍 - 图像生成得分高2.2倍，分类指标的平均增长10％（多域非IID设置中最高50％），比几个基准相比要低得多。在此HTTPS URL上找到我们的代码。

Title: Fault detection and diagnosis for the engine electrical system of a space launcher based on a temporal convolutional autoencoder and calibrated classifiers

Authors: Luis Basora, Louison Bocquet-Nouaille, Elinirina Robinson, Serge Le Gonidec
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.13022
Pdf URL: https://arxiv.org/pdf/2507.13022
Copy Paste: [[2507.13022]] Fault detection and diagnosis for the engine electrical system of a space launcher based on a temporal convolutional autoencoder and calibrated classifiers(https://arxiv.org/abs/2507.13022)
Keywords: generation
Abstract: In the context of the health monitoring for the next generation of reusable space launchers, we outline a first step toward developing an onboard fault detection and diagnostic capability for the electrical system that controls the engine valves. Unlike existing approaches in the literature, our solution is designed to meet a broader range of key requirements. This includes estimating confidence levels for predictions, detecting out-of-distribution (OOD) cases, and controlling false alarms. The proposed solution is based on a temporal convolutional autoencoder to automatically extract low-dimensional features from raw sensor data. Fault detection and diagnosis are respectively carried out using a binary and a multiclass classifier trained on the autoencoder latent and residual spaces. The classifiers are histogram-based gradient boosting models calibrated to output probabilities that can be interpreted as confidence levels. A relatively simple technique, based on inductive conformal anomaly detection, is used to identify OOD data. We leverage other simple yet effective techniques, such as cumulative sum control chart (CUSUM) to limit the false alarms, and threshold moving to address class imbalance in fault detection. The proposed framework is highly configurable and has been evaluated on simulated data, covering both nominal and anomalous operational scenarios. The results indicate that our solution is a promising first step, though testing with real data will be necessary to ensure that it achieves the required maturity level for operational use.
摘要：在下一代可重复使用的太空发射器的健康监控的背景下，我们概述了为控制发动机阀的电气系统开发机上故障检测和诊断能力的第一步。与文献中现有的方法不同，我们的解决方案旨在满足更广泛的关键要求。这包括估计预测的置信度，检测到分布（OOD）情况并控制错误警报。所提出的解决方案基于时间卷积自动编码器，可自动从原始传感器数据中提取低维特征。故障检测和诊断分别使用在自动编码器潜在和残留空间上训练的二元和多类分类器进行。分类器是基于直方图的梯度增强模型，该模型校准了可以解释为置信水平的输出概率。一种基于归纳性共形异常检测的相对简单的技术用于识别OOD数据。我们利用其他简单但有效的技术，例如累积总和控制图（CUSUM）来限制错误警报，而阈值移至故障检测中的类别不平衡。所提出的框架是高度可配置的，并且已经在模拟数据上进行了评估，涵盖了名义和异常操作方案。结果表明我们的解决方案是有希望的第一步，尽管必须使用实际数据进行测试以确保其达到运营使用所需的成熟度。

Title: Resurrect Mask AutoRegressive Modeling for Efficient and Scalable Image Generation

Authors: Yi Xin, Le Zhuo, Qi Qin, Siqi Luo, Yuewen Cao, Bin Fu, Yangfan He, Hongsheng Li, Guangtao Zhai, Xiaohong Liu, Peng Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.13032
Pdf URL: https://arxiv.org/pdf/2507.13032
Copy Paste: [[2507.13032]] Resurrect Mask AutoRegressive Modeling for Efficient and Scalable Image Generation(https://arxiv.org/abs/2507.13032)
Keywords: generation
Abstract: AutoRegressive (AR) models have made notable progress in image generation, with Masked AutoRegressive (MAR) models gaining attention for their efficient parallel decoding. However, MAR models have traditionally underperformed when compared to standard AR models. This study refines the MAR architecture to improve image generation quality. We begin by evaluating various image tokenizers to identify the most effective one. Subsequently, we introduce an improved Bidirectional LLaMA architecture by replacing causal attention with bidirectional attention and incorporating 2D RoPE, which together form our advanced model, MaskGIL. Scaled from 111M to 1.4B parameters, MaskGIL achieves a FID score of 3.71, matching state-of-the-art AR models in the ImageNet 256x256 benchmark, while requiring only 8 inference steps compared to the 256 steps of AR models. Furthermore, we develop a text-driven MaskGIL model with 775M parameters for generating images from text at various resolutions. Beyond image generation, MaskGIL extends to accelerate AR-based generation and enable real-time speech-to-image conversion. Our codes and models are available at this https URL.
摘要：自回归（AR）模型在图像生成方面取得了显着进展，掩盖自回归（MAR）模型因其有效的并行解码而引起了人们的关注。但是，与标准AR模型相比，传统上，MAR模型的表现不佳。这项研究完善了MAR架构，以提高图像产生质量。我们首先要评估各种图像令牌以识别最有效的图像。随后，我们通过以双向注意来代替因果关注并结合了2D绳索，从而介绍了改进的双向美洲驼结构，并结合了我们的高级模型Maskgil。 Maskgil从1.11m到1.4B参数缩放，达到3.71的FID分数，在Imagenet 256x256基准中匹配最新的AR模型，而与AR模型的256个步骤相比，仅需8个推理步骤。此外，我们开发了一个文本驱动的Maskgil模型，该模型具有775m参数，用于在各种分辨率下从文本中生成图像。除了产生图像外，Maskgil还扩展到加速基于AR的生成并实现实时语音到图像转换。我们的代码和模型可在此HTTPS URL上找到。

Title: R^2MoE: Redundancy-Removal Mixture of Experts for Lifelong Concept Learning

Authors: Xiaohan Guo, Yusong Cai, Zejia Liu, Zhengning Wang, Lili Pan, Hongliang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.13107
Pdf URL: https://arxiv.org/pdf/2507.13107
Copy Paste: [[2507.13107]] R^2MoE: Redundancy-Removal Mixture of Experts for Lifelong Concept Learning(https://arxiv.org/abs/2507.13107)
Keywords: generative
Abstract: Enabling large-scale generative models to continuously learn new visual concepts is essential for personalizing pre-trained models to meet individual user preferences. Existing approaches for continual visual concept learning are constrained by two fundamental challenges: catastrophic forgetting and parameter expansion. In this paper, we propose Redundancy-Removal Mixture of Experts (R^2MoE), a parameter-efficient framework for lifelong visual concept learning that effectively learns new concepts while incurring minimal parameter overhead. Our framework includes three key innovative contributions: First, we propose a mixture-of-experts framework with a routing distillation mechanism that enables experts to acquire concept-specific knowledge while preserving the gating network's routing capability, thereby effectively mitigating catastrophic forgetting. Second, we propose a strategy for eliminating redundant layer-wise experts that reduces the number of expert parameters by fully utilizing previously learned experts. Third, we employ a hierarchical local attention-guided inference approach to mitigate interference between generated visual concepts. Extensive experiments have demonstrated that our method generates images with superior conceptual fidelity compared to the state-of-the-art (SOTA) method, achieving an impressive 87.8\% reduction in forgetting rates and 63.3\% fewer parameters on the CustomConcept 101 dataset. Our code is available at {this https URL}
摘要：使大规模生成模型能够不断学习新的视觉概念对于个性化预训练的模型以满足个人用户偏好至关重要。连续视觉概念学习的现有方法受到两个基本挑战的约束：灾难性的遗忘和参数扩展。在本文中，我们提出了专家的冗余混合物（R^2MOE），这是一个终身视觉概念学习的参数效率框架，可以有效地学习新概念，同时招致最小参数开销。我们的框架包括三个关键的创新贡献：首先，我们提出了一个融合了专家框架的混合框架，该框架及其路由蒸馏机制，使专家能够获得特定于概念的知识的同时，同时保留了门控网络的路由能力，从而有效地减轻了灾难性的遗忘。其次，我们提出了一种消除冗余专家的策略，通过充分利用先前学到的专家来减少专家参数的数量。第三，我们采用层次的本地注意力指导方法来减轻生成的视觉概念之间的干扰。广泛的实验表明，与最先进的方法（SOTA）方法相比，我们的方法生成了具有优越概念保真度的图像，从而实现了令人印象深刻的87.8 \％降低遗忘率，而在Customectect 101数据集中的参数降低了63.3％。我们的代码可在{此https url}上找到

Title: NGTM: Substructure-based Neural Graph Topic Model for Interpretable Graph Generation

Authors: Yuanxin Zhuang, Dazhong Shen, Ying Sun
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.13133
Pdf URL: https://arxiv.org/pdf/2507.13133
Copy Paste: [[2507.13133]] NGTM: Substructure-based Neural Graph Topic Model for Interpretable Graph Generation(https://arxiv.org/abs/2507.13133)
Keywords: generation, generative
Abstract: Graph generation plays a pivotal role across numerous domains, including molecular design and knowledge graph construction. Although existing methods achieve considerable success in generating realistic graphs, their interpretability remains limited, often obscuring the rationale behind structural decisions. To address this challenge, we propose the Neural Graph Topic Model (NGTM), a novel generative framework inspired by topic modeling in natural language processing. NGTM represents graphs as mixtures of latent topics, each defining a distribution over semantically meaningful substructures, which facilitates explicit interpretability at both local and global scales. The generation process transparently integrates these topic distributions with a global structural variable, enabling clear semantic tracing of each generated graph. Experiments demonstrate that NGTM achieves competitive generation quality while uniquely enabling fine-grained control and interpretability, allowing users to tune structural features or induce biological properties through topic-level adjustments.
摘要：图生成在众多领域中起关键作用，包括分子设计和知识图结构。尽管现有方法在产生现实图表方面取得了巨大成功，但它们的解释性仍然有限，通常会掩盖结构决策背后的理由。为了应对这一挑战，我们提出了神经图主题模型（NGTM），这是一个受自然语言处理中主题建模启发的新颖生成框架。 NGTM表示图形作为潜在主题的混合物，每个图形都定义了在语义上有意义的子结构上的分布，从而有助于局部和全局尺度上的明确解释性。生成过程将这些主题分布与全局结构变量透明地集成在一起，从而使每个生成的图形都可以清晰的语义跟踪。实验表明，NGTM可以达到竞争性的生成质量，同时唯一地实现了细粒度的控制和解释性，从而使用户能够通过主题级调整调整结构特征或诱导生物学特性。

Title: Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models

Authors: Arian Mousakhan, Sudhanshu Mittal, Silvio Galesso, Karim Farid, Thomas Brox
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.13162
Pdf URL: https://arxiv.org/pdf/2507.13162
Copy Paste: [[2507.13162]] Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models(https://arxiv.org/abs/2507.13162)
Keywords: generation
Abstract: Existing world models for autonomous driving struggle with long-horizon generation and generalization to challenging scenarios. In this work, we develop a model using simple design choices, and without additional supervision or sensors, such as maps, depth, or multiple cameras. We show that our model yields state-of-the-art performance, despite having only 469M parameters and being trained on 280h of video data. It particularly stands out in difficult scenarios like turning maneuvers and urban traffic. We test whether discrete token models possibly have advantages over continuous models based on flow matching. To this end, we set up a hybrid tokenizer that is compatible with both approaches and allows for a side-by-side comparison. Our study concludes in favor of the continuous autoregressive model, which is less brittle on individual design choices and more powerful than the model built on discrete tokens. Code, models and qualitative results are publicly available at this https URL.
摘要：现有的世界模型，用于自主驾驶斗争，长期产生和对挑战性场景的概括。在这项工作中，我们使用简单的设计选择开发了一个模型，并且没有其他监督或传感器（例如地图，深度或多个摄像机）。我们表明，尽管只有4.69亿个参数并接受了280小时视频数据的培训，但我们的模型仍会产生最先进的性能。在艰难的情况下，它尤其突出。我们测试基于流量匹配的连续模型是否可能比连续模型具有优势。为此，我们设置了一个与两种方法兼容的混合令牌，并允许并排比较。我们的研究结论是支持连续的自回旋模型，该模型对单个设计选择的脆弱性较小，而不是基于离散令牌的模型更强大。该HTTPS URL公开可用代码，模型和定性结果。

Title: Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection

Authors: Hongyang Zhao, Tianyu Liang, Sina Davari, Daeho Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.13221
Pdf URL: https://arxiv.org/pdf/2507.13221
Copy Paste: [[2507.13221]] Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection(https://arxiv.org/abs/2507.13221)
Keywords: generative
Abstract: While recent advancements in deep neural networks (DNNs) have substantially enhanced visual AI's capabilities, the challenge of inadequate data diversity and volume remains, particularly in construction domain. This study presents a novel image synthesis methodology tailored for construction worker detection, leveraging the generative-AI platform Midjourney. The approach entails generating a collection of 12,000 synthetic images by formulating 3000 different prompts, with an emphasis on image realism and diversity. These images, after manual labeling, serve as a dataset for DNN training. Evaluation on a real construction image dataset yielded promising results, with the model attaining average precisions (APs) of 0.937 and 0.642 at intersection-over-union (IoU) thresholds of 0.5 and 0.5 to 0.95, respectively. Notably, the model demonstrated near-perfect performance on the synthetic dataset, achieving APs of 0.994 and 0.919 at the two mentioned thresholds. These findings reveal both the potential and weakness of generative AI in addressing DNN training data scarcity.
摘要：尽管深度神经网络（DNN）的最新进展具有显着增强的视觉AI功能，但数据多样性和量的挑战仍然存在，尤其是在施工领域中。这项研究提出了一种用于建筑工人检测的新型图像综合方法，利用了生成-AI平台Midjourney。该方法需要通过制定3000个不同的提示来产生12,000张合成图像的集合，并着重于图像现实主义和多样性。这些图像在手动标记后，作为DNN培训的数据集。对真实构造图像数据集的评估产生了令人鼓舞的结果，模型的平均精度（AP）为0.937和0.642，在跨工会（IOU）阈值分别为0.5和0.5至0.95。值得注意的是，该模型在合成数据集上表现出接近完美的性能，在上述两个阈值下达到0.994和0.919的APS。这些发现揭示了生成AI在解决DNN培训数据稀缺方面的潜力和弱点。

Title: Leveraging Pre-Trained Visual Models for AI-Generated Video Detection

Authors: Keerthi Veeramachaneni, Praveen Tirupattur, Amrit Singh Bedi, Mubarak Shah
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.13224
Pdf URL: https://arxiv.org/pdf/2507.13224
Copy Paste: [[2507.13224]] Leveraging Pre-Trained Visual Models for AI-Generated Video Detection(https://arxiv.org/abs/2507.13224)
Keywords: generation, generative
Abstract: Recent advances in Generative AI (GenAI) have led to significant improvements in the quality of generated visual content. As AI-generated visual content becomes increasingly indistinguishable from real content, the challenge of detecting the generated content becomes critical in combating misinformation, ensuring privacy, and preventing security threats. Although there has been substantial progress in detecting AI-generated images, current methods for video detection are largely focused on deepfakes, which primarily involve human faces. However, the field of video generation has advanced beyond DeepFakes, creating an urgent need for methods capable of detecting AI-generated videos with generic content. To address this gap, we propose a novel approach that leverages pre-trained visual models to distinguish between real and generated videos. The features extracted from these pre-trained models, which have been trained on extensive real visual content, contain inherent signals that can help distinguish real from generated videos. Using these extracted features, we achieve high detection performance without requiring additional model training, and we further improve performance by training a simple linear classification layer on top of the extracted features. We validated our method on a dataset we compiled (VID-AID), which includes around 10,000 AI-generated videos produced by 9 different text-to-video models, along with 4,000 real videos, totaling over 7 hours of video content. Our evaluation shows that our approach achieves high detection accuracy, above 90% on average, underscoring its effectiveness. Upon acceptance, we plan to publicly release the code, the pre-trained models, and our dataset to support ongoing research in this critical area.
摘要：生成AI（Genai）的最新进展导致了产生的视觉内容质量的显着改善。随着AI生成的视觉内容与实际内容越来越没有区别，检测生成内容的挑战对于打击错误信息，确保隐私和防止安全威胁至关重要。尽管在检测AI生成的图像方面取得了重大进展，但是当前的视频检测方法主要集中在深层蛋白质上，这主要涉及人脸。但是，视频生成领域已经超出了深层效果，迫切需要能够检测具有通用内容的AI生成的视频。为了解决这一差距，我们提出了一种新颖的方法，该方法利用了预训练的视觉模型来区分真实视频和生成的视频。从这些预训练的模型中提取的功能，这些模型已经接受了广泛的真实视觉内容培训，其中包含可以帮助将真实视频与生成视频区分开的固有信号。使用这些提取的功能，我们在不需要其他模型训练的情况下实现了高检测性能，并且通过在提取功能的顶部训练简单的线性分类层进一步提高性能。我们在编译的数据集（VID-AID）上验证了我们的方法，该方法包括大约10,000个由9种不同的文本对视频模型制作的AI生成的视频，以及4,000个真实视频，总计7个小时的视频内容。我们的评估表明，我们的方法达到了高检测准确性，平均超过90％，强调了其有效性。接受后，我们计划公开释放代码，预培训模型和我们的数据集，以支持该关键领域正在进行的研究。

Title: VITA: Vision-to-Action Flow Matching Policy

Authors: Dechen Gao, Boqi Zhao, Andrew Lee, Ian Chuang, Hanchu Zhou, Hang Wang, Zhe Zhao, Junshan Zhang, Iman Soltani
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2507.13231
Pdf URL: https://arxiv.org/pdf/2507.13231
Copy Paste: [[2507.13231]] VITA: Vision-to-Action Flow Matching Policy(https://arxiv.org/abs/2507.13231)
Keywords: generation, generative
Abstract: We present VITA, a Vision-To-Action flow matching policy that evolves latent visual representations into latent actions for visuomotor control. Traditional flow matching and diffusion policies sample from standard source distributions (e.g., Gaussian noise) and require additional conditioning mechanisms like cross-attention to condition action generation on visual information, creating time and space overheads. VITA proposes a novel paradigm that treats latent images as the flow source, learning an inherent mapping from vision to action while eliminating separate conditioning modules and preserving generative modeling capabilities. Learning flows between fundamentally different modalities like vision and action is challenging due to sparse action data lacking semantic structures and dimensional mismatches between high-dimensional visual representations and raw actions. We address this by creating a structured action latent space via an autoencoder as the flow matching target, up-sampling raw actions to match visual representation shapes. Crucially, we supervise flow matching with both encoder targets and final action outputs through flow latent decoding, which backpropagates action reconstruction loss through sequential flow matching ODE solving steps for effective end-to-end learning. Implemented as simple MLP layers, VITA is evaluated on challenging bi-manual manipulation tasks on the ALOHA platform, including 5 simulation and 2 real-world tasks. Despite its simplicity, MLP-only VITA outperforms or matches state-of-the-art generative policies while reducing inference latency by 50-130% compared to conventional flow matching policies requiring different conditioning mechanisms or complex architectures. To our knowledge, VITA is the first MLP-only flow matching policy capable of solving complex bi-manual manipulation tasks like those in ALOHA benchmarks.
摘要：我们提出Vita，这是一种视觉流动匹配策略，将潜在的视觉表示形式发展为视觉运动控制的潜在动作。传统的流量匹配和扩散策略从标准源分布（例如高斯噪声）中进行样本，并需要其他条件机制，例如跨注意，以在视觉信息上生成动作，从而创造了时间和空间。维塔（Vita）提出了一种新型范式，将潜在图像视为流源，学习从视觉到动作的固有映射，同时消除单独的调节模块并保留生成的建模能力。由于缺乏语义结构和高维视觉表示和原始动作之间缺乏语义结构和维度不匹配的稀疏动作数据，诸如视觉和动作等根本不同的方式之间的学习流动变得具有挑战性。我们通过通过自动编码器作为流程匹配的目标，上采样的原始操作来匹配视觉表示形状来解决这一问题。至关重要的是，我们通过流潜在的解码来监督与编码器目标和最终动作输出的流动匹配，这通过顺序流动匹配的ode求解步骤将动作重建损失反射为有效的端到端学习。 Vita作为简单的MLP层实施，对Aloha平台上的双重手动操纵任务进行了评估，其中包括5个模拟和2个现实世界任务。尽管它很简单，但与需要不同的条件机制或复杂体系结构的常规流动匹配策略相比，仅MLP的VITA优于最先进的生成策略，同时将推理潜伏期降低50-130％，同时将推理潜伏期降低50-130％。据我们所知，VITA是第一个仅限MLP流量匹配政策，能够解决复杂的双手操作任务，例如Aloha基准中的任务。

Title: Leveraging Asynchronous Cross-border Market Data for Improved Day-Ahead Electricity Price Forecasting in European Markets

Authors: Maria Margarida Mascarenhas, Jilles De Blauwe, Mikael Amelin, Hussain Kazmi
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2507.13250
Pdf URL: https://arxiv.org/pdf/2507.13250
Copy Paste: [[2507.13250]] Leveraging Asynchronous Cross-border Market Data for Improved Day-Ahead Electricity Price Forecasting in European Markets(https://arxiv.org/abs/2507.13250)
Keywords: generation
Abstract: Accurate short-term electricity price forecasting is crucial for strategically scheduling demand and generation bids in day-ahead markets. While data-driven techniques have shown considerable prowess in achieving high forecast accuracy in recent years, they rely heavily on the quality of input covariates. In this paper, we investigate whether asynchronously published prices as a result of differing gate closure times (GCTs) in some bidding zones can improve forecasting accuracy in other markets with later GCTs. Using a state-of-the-art ensemble of models, we show significant improvements of 22% and 9% in forecast accuracy in the Belgian (BE) and Swedish bidding zones (SE3) respectively, when including price data from interconnected markets with earlier GCT (Germany-Luxembourg, Austria, and Switzerland). This improvement holds for both general as well as extreme market conditions. Our analysis also yields further important insights: frequent model recalibration is necessary for maximum accuracy but comes at substantial additional computational costs, and using data from more markets does not always lead to better performance - a fact we delve deeper into with interpretability analysis of the forecast models. Overall, these findings provide valuable guidance for market participants and decision-makers aiming to optimize bidding strategies within increasingly interconnected and volatile European energy markets.
摘要：准确的短期电力价格预测对于在日益市场上的战略性安排需求和发电竞标至关重要。尽管近年来，数据驱动的技术在实现高预测准确性方面表现出了相当大的能力，但它们在很大程度上依赖于输入协变量的质量。在本文中，我们调查了由于某些竞标区的不同门关闭时间（GCT）而导致的异步出版的价格是否可以提高其他GCT的预测准确性。使用最先进的模型合奏，我们分别在比利时（BE）和瑞典竞标区（SE3）中显示出22％和9％的预测准确性，当时包括与较早的GCT（德国 - 卢克斯姆布尔格，奥地利，奥地利和瑞典）的互连市场的价格数据。这种改善适用于一般和极端的市场状况。我们的分析还产生了进一步的重要见解：频繁的模型重新校准对于最大程度的准确性是必要的，但要以实质性的额外计算成本，并且使用来自更多市场的数据并不总是会带来更好的性能 - 我们对预测模型的可解释性分析进行了更深入的研究。总体而言，这些发现为市场参与者和决策者提供了宝贵的指导，旨在优化越来越相互联系和波动性的欧洲能源市场中的竞标策略。

Title: FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization

Authors: Chuancheng Shi, Yixiang Chen, Burong Lei, Jichao Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.13311
Pdf URL: https://arxiv.org/pdf/2507.13311
Copy Paste: [[2507.13311]] FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization(https://arxiv.org/abs/2507.13311)
Keywords: generation
Abstract: Realistic and controllable garment visualization is critical for fashion e-commerce, where users expect personalized previews under diverse poses and lighting conditions. Existing methods often rely on predefined poses, limiting semantic flexibility and illumination adaptability. To address this, we introduce FashionPose, the first unified text-to-pose-to-relighting generation framework. Given a natural language description, our method first predicts a 2D human pose, then employs a diffusion model to generate high-fidelity person images, and finally applies a lightweight relighting module, all guided by the same textual input. By replacing explicit pose annotations with text-driven conditioning, FashionPose enables accurate pose alignment, faithful garment rendering, and flexible lighting control. Experiments demonstrate fine-grained pose synthesis and efficient, consistent relighting, providing a practical solution for personalized virtual fashion display.
摘要：现实且可控的服装可视化对于时尚电子商务至关重要，用户期望在不同的姿势和照明条件下进行个性化预览。现有方法通常依赖于预定义的姿势，从而限制了语义灵活性和照明适应性。为了解决这个问题，我们介绍了FashionPose，这是第一个统一的文本对置换生成框架。鉴于自然语言描述，我们的方法首先预测了2D人的姿势，然后采用扩散模型来产生高保真人物的图像，并最终应用了轻巧的重新重新定制模块，所有模块都在同一文本输入的指导下。通过用文本驱动的调理代替明确的姿势注释，时尚台化可以实现准确的姿势对准，忠实的服装渲染和灵活的照明控制。实验证明了细粒度的姿势合成和有效，一致的重新确认，为个性化的虚拟时尚显示提供了实用的解决方案。

Title: Taming Diffusion Transformer for Real-Time Mobile Video Generation

Authors: Yushu Wu, Yanyu Li, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ke Ma, Arpit Sahni, Ju Hu, Aliaksandr Siarohin, Dhritiman Sagar, Yanzhi Wang, Sergey Tulyakov
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2507.13343
Pdf URL: https://arxiv.org/pdf/2507.13343
Copy Paste: [[2507.13343]] Taming Diffusion Transformer for Real-Time Mobile Video Generation(https://arxiv.org/abs/2507.13343)
Keywords: generation
Abstract: Diffusion Transformers (DiT) have shown strong performance in video generation tasks, but their high computational cost makes them impractical for resource-constrained devices like smartphones, and real-time generation is even more challenging. In this work, we propose a series of novel optimizations to significantly accelerate video generation and enable real-time performance on mobile platforms. First, we employ a highly compressed variational autoencoder (VAE) to reduce the dimensionality of the input data without sacrificing visual quality. Second, we introduce a KD-guided, sensitivity-aware tri-level pruning strategy to shrink the model size to suit mobile platform while preserving critical performance characteristics. Third, we develop an adversarial step distillation technique tailored for DiT, which allows us to reduce the number of inference steps to four. Combined, these optimizations enable our model to achieve over 10 frames per second (FPS) generation on an iPhone 16 Pro Max, demonstrating the feasibility of real-time, high-quality video generation on mobile devices.
摘要：扩散变压器（DIT）在视频生成任务中表现出很强的性能，但是它们的高计算成本使它们对于智能手机（例如智能手机）的设备不切实际，实时生成更具挑战性。在这项工作中，我们提出了一系列新颖的优化，以显着加速视频生成并在移动平台上实现实时性能。首先，我们采用高度压缩的变分自动编码器（VAE）来降低输入数据的维度，而无需牺牲视觉质量。其次，我们引入了KD引导，灵敏度感知的三级修剪策略，以缩小模型大小以适合移动平台，同时保留关键的性能特征。第三，我们开发了一种针对DIT量身定制的对抗性步骤蒸馏技术，这使我们能够将推理步骤的数量减少到四个。这些优化结合在一起，使我们的模型能够在iPhone 16 Pro Max上获得超过10帧（FPS）的生成，并证明了移动设备上实时高质量的视频生成的可行性。

Title: Imbalance in Balance: Online Concept Balancing in Generation Models

Authors: Yukai Shi, Jiarong Ou, Rui Chen, Haotian Yang, Jiahao Wang, Xin Tao, Pengfei Wan, Di Zhang, Kun Gai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.13345
Pdf URL: https://arxiv.org/pdf/2507.13345
Copy Paste: [[2507.13345]] Imbalance in Balance: Online Concept Balancing in Generation Models(https://arxiv.org/abs/2507.13345)
Keywords: generation
Abstract: In visual generation tasks, the responses and combinations of complex concepts often lack stability and are error-prone, which remains an under-explored area. In this paper, we attempt to explore the causal factors for poor concept responses through elaborately designed experiments. We also design a concept-wise equalization loss function (IMBA loss) to address this issue. Our proposed method is online, eliminating the need for offline dataset processing, and requires minimal code changes. In our newly proposed complex concept benchmark Inert-CompBench and two other public test sets, our method significantly enhances the concept response capability of baseline models and yields highly competitive results with only a few codes.
摘要：在视觉生成任务中，复杂概念的响应和组合通常缺乏稳定性，并且容易出错，这仍然是一个不足的领域。在本文中，我们试图通过精心设计的实验来探索不良概念反应的因果因素。我们还设计了一个概念均衡损失函数（IMBA损失）来解决此问题。我们提出的方法是在线，消除了对离线数据集处理的需求，并且需要最小的代码更改。在我们新提出的复杂概念基准基准和其他两个公共测试集中，我们的方法显着增强了基线模型的概念响应能力，并且仅使用几个代码产生了高度竞争的结果。

Title: AutoPartGen: Autogressive 3D Part Generation and Discovery

Authors: Minghao Chen, Jianyuan Wang, Roman Shapovalov, Tom Monnier, Hyunyoung Jung, Dilin Wang, Rakesh Ranjan, Iro Laina, Andrea Vedaldi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.13346
Pdf URL: https://arxiv.org/pdf/2507.13346
Copy Paste: [[2507.13346]] AutoPartGen: Autogressive 3D Part Generation and Discovery(https://arxiv.org/abs/2507.13346)
Keywords: generation
Abstract: We introduce AutoPartGen, a model that generates objects composed of 3D parts in an autoregressive manner. This model can take as input an image of an object, 2D masks of the object's parts, or an existing 3D object, and generate a corresponding compositional 3D reconstruction. Our approach builds upon 3DShape2VecSet, a recent latent 3D representation with powerful geometric expressiveness. We observe that this latent space exhibits strong compositional properties, making it particularly well-suited for part-based generation tasks. Specifically, AutoPartGen generates object parts autoregressively, predicting one part at a time while conditioning on previously generated parts and additional inputs, such as 2D images, masks, or 3D objects. This process continues until the model decides that all parts have been generated, thus determining automatically the type and number of parts. The resulting parts can be seamlessly assembled into coherent objects or scenes without requiring additional optimization. We evaluate both the overall 3D generation capabilities and the part-level generation quality of AutoPartGen, demonstrating that it achieves state-of-the-art performance in 3D part generation.
摘要：我们介绍了自动驾驶仪，该模型以自动回归方式生成由3D零件组成的对象。该模型可以作为对象的图像，对象零件的2D掩码或现有3D对象的输入，并生成相应的组成3D重建。我们的方法建立在3DShape2Vecset的基础上，这是最近具有强大几何表现力的最近的潜在3D表示。我们观察到这个潜在空间具有强大的组成性能，使其特别适合基于部分的生成任务。具体而言，Autopartgen会自动填充生成对象零件，一次预测一个零件，同时以先前生成的零件和其他输入（例如2D图像，掩码或3D对象）进行调节。该过程一直持续到该模型决定所有零件都已生成，从而自动确定零件的类型和数量。所得部分可以无缝地组装成连贯的对象或场景，而无需额外的优化。我们评估了Autopartgen的整体3D一代能力和部分级别的生成质量，表明它在3D零件生成中实现了最先进的性能。

Title: Hierarchical Rectified Flow Matching with Mini-Batch Couplings

Authors: Yichi Zhang, Yici Yan, Alex Schwing, Zhizhen Zhao
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.13350
Pdf URL: https://arxiv.org/pdf/2507.13350
Copy Paste: [[2507.13350]] Hierarchical Rectified Flow Matching with Mini-Batch Couplings(https://arxiv.org/abs/2507.13350)
Keywords: generative
Abstract: Flow matching has emerged as a compelling generative modeling approach that is widely used across domains. To generate data via a flow matching model, an ordinary differential equation (ODE) is numerically solved via forward integration of the modeled velocity field. To better capture the multi-modality that is inherent in typical velocity fields, hierarchical flow matching was recently introduced. It uses a hierarchy of ODEs that are numerically integrated when generating data. This hierarchy of ODEs captures the multi-modal velocity distribution just like vanilla flow matching is capable of modeling a multi-modal data distribution. While this hierarchy enables to model multi-modal velocity distributions, the complexity of the modeled distribution remains identical across levels of the hierarchy. In this paper, we study how to gradually adjust the complexity of the distributions across different levels of the hierarchy via mini-batch couplings. We show the benefits of mini-batch couplings in hierarchical rectified flow matching via compelling results on synthetic and imaging data. Code is available at this https URL.
摘要：流量匹配已成为一种引人入胜的生成建模方法，该方法被广泛使用。为了通过流匹配模型生成数据，通过建模速度场的正向积分来求解普通的微分方程（ODE）。为了更好地捕获典型速度场固有的多模式，最近引入了层次流量匹配。它使用生成数据时数值集成的ODE层次结构。这种ODE的层次结构捕获了多模式速度分布，就像香草流匹配一样，能够对多模式数据分布进行建模。尽管该层次结构能够对多模式速度分布进行建模，但模型分布的复杂性在层次结构的层面上保持相同。在本文中，我们研究了如何通过迷你批次耦合逐渐调整不同层次层次结构的分布的复杂性。我们通过合成和成像数据的令人信服的结果显示了小批量耦合在分层整流流匹配中的好处。代码可在此HTTPS URL上找到。