2025-08-20

Title: Contextual Attention-Based Multimodal Fusion of LLM and CNN for Sentiment Analysis

Authors: Meriem Zerkouk, Miloud Mihoubi, Belkacem Chikhaoui
Subjects: cs.LG, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2508.13196
Pdf URL: https://arxiv.org/pdf/2508.13196
Copy Paste: [[2508.13196]] Contextual Attention-Based Multimodal Fusion of LLM and CNN for Sentiment Analysis(https://arxiv.org/abs/2508.13196)
Keywords: generative
Abstract: This paper introduces a novel approach for multimodal sentiment analysis on social media, particularly in the context of natural disasters, where understanding public sentiment is crucial for effective crisis management. Unlike conventional methods that process text and image modalities separately, our approach seamlessly integrates Convolutional Neural Network (CNN) based image analysis with Large Language Model (LLM) based text processing, leveraging Generative Pre-trained Transformer (GPT) and prompt engineering to extract sentiment relevant features from the CrisisMMD dataset. To effectively model intermodal relationships, we introduce a contextual attention mechanism within the fusion process. Leveraging contextual-attention layers, this mechanism effectively captures intermodality interactions, enhancing the model's comprehension of complex relationships between textual and visual data. The deep neural network architecture of our model learns from these fused features, leading to improved accuracy compared to existing baselines. Experimental results demonstrate significant advancements in classifying social media data into informative and noninformative categories across various natural disasters. Our model achieves a notable 2.43% increase in accuracy and 5.18% in F1-score, highlighting its efficacy in processing complex multimodal data. Beyond quantitative metrics, our approach provides deeper insight into the sentiments expressed during crises. The practical implications extend to real time disaster management, where enhanced sentiment analysis can optimize the accuracy of emergency interventions. By bridging the gap between multimodal analysis, LLM powered text understanding, and disaster response, our work presents a promising direction for Artificial Intelligence (AI) driven crisis management solutions. Keywords:
摘要：本文介绍了一种在社交媒体上进行多模式情感分析的新方法，尤其是在自然灾害的背景下，理解公众情绪对于有效的危机管理至关重要。与分别处理文本和图像模式的常规方法不同，我们的方法无缝地将基于卷积的神经网络（CNN）的图像分析与基于大语言模型（LLM）的文本处理，利用生成预训练的变压器（GPT）和及时的工程以从Crismismmd Dataset中提取情感功能。为了有效地建模模式关系，我们在融合过程中引入了上下文注意机制。利用上下文注意层，这种机制有效地捕获了模式的相互作用，从而增强了模型对文本和视觉数据之间复杂关系的理解。我们模型的深度神经网络体系结构从这些融合功能中学到了，与现有基线相比，准确性提高了。实验结果表明，将社交媒体数据分类为各种自然灾害的信息性和非信息类别中的显着进步。我们的模型可在精度上提高2.43％，而F1得分的5.18％提高了其在处理复杂的多模式数据方面的功效。除了定量指标之外，我们的方法还提供了对危机中表达的情感的更深入的了解。实际含义扩展到实时灾难管理，增强的情感分析可以优化紧急干预的准确性。通过弥合多模式分析，LLM驱动文本理解和灾难响应之间的差距，我们的工作为人工智能（AI）驱动的危机管理解决方案提供了有希望的方向。关键字：

Title: Strategies for training point distributions in physics-informed neural networks

Authors: Santosh Humagain, Toni Schneidereit
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.13216
Pdf URL: https://arxiv.org/pdf/2508.13216
Copy Paste: [[2508.13216]] Strategies for training point distributions in physics-informed neural networks(https://arxiv.org/abs/2508.13216)
Keywords: generation
Abstract: Physics-informed neural networks approach the approximation of differential equations by directly incorporating their structure and given conditions in a loss function. This enables conditions like, e.g., invariants to be easily added during the modelling phase. In addition, the approach can be considered as mesh free and can be utilised to compute solutions on arbitrary grids after the training phase. Therefore, physics-informed neural networks are emerging as a promising alternative to solving differential equations with methods from numerical mathematics. However, their performance highly depends on a large variety of factors. In this paper, we systematically investigate and evaluate a core component of the approach, namely the training point distribution. We test two ordinary and two partial differential equations with five strategies for training data generation and shallow network architectures, with one and two hidden layers. In addition to common distributions, we introduce sine-based training points, which are motivated by the construction of Chebyshev nodes. The results are challenged by using certain parameter combinations like, e.g., random and fixed-seed weight initialisation for reproducibility. The results show the impact of the training point distributions on the solution accuracy and we find evidence that they are connected to the characteristics of the differential equation.
摘要：物理信息神经网络通过直接纳入其结构并在损失函数中给定条件来接近微分方程的近似。这使得在建模阶段可以轻松添加这样的条件。此外，该方法可以视为无网格，并且可以在训练阶段后在任意网格上计算解决方案。因此，物理知识的神经网络正在成为使用数值数学方法来求解微分方程的有希望的替代方法。但是，它们的性能在很大程度上取决于多种因素。在本文中，我们系统地研究和评估了该方法的核心组成部分，即训练点分布。我们测试了两个普通和两个部分微分方程，并具有五种培训数据生成和浅网络体系结构的策略，其中一层和两个隐藏层。除了共同的分布外，我们还引入了基于正弦的训练点，这是由Chebyshev节点的构建动机。通过使用某些参数组合（例如，随机和固定种子的重量初始化以获得可重复性）来挑战结果。结果显示了训练点分布对溶液准确性的影响，我们发现证据表明它们与微分方程的特征相关。

Title: MIRAGE: Towards AI-Generated Image Detection in the Wild

Authors: Cheng Xia, Manxi Lin, Jiexiang Tan, Xiaoxiong Du, Yang Qiu, Junjun Zheng, Xiangheng Kong, Yuning Jiang, Bo Zheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.13223
Pdf URL: https://arxiv.org/pdf/2508.13223
Copy Paste: [[2508.13223]] MIRAGE: Towards AI-Generated Image Detection in the Wild(https://arxiv.org/abs/2508.13223)
Keywords: generative
Abstract: The spreading of AI-generated images (AIGI), driven by advances in generative AI, poses a significant threat to information security and public trust. Existing AIGI detectors, while effective against images in clean laboratory settings, fail to generalize to in-the-wild scenarios. These real-world images are noisy, varying from ``obviously fake" images to realistic ones derived from multiple generative models and further edited for quality control. We address in-the-wild AIGI detection in this paper. We introduce Mirage, a challenging benchmark designed to emulate the complexity of in-the-wild AIGI. Mirage is constructed from two sources: (1) a large corpus of Internet-sourced AIGI verified by human experts, and (2) a synthesized dataset created through the collaboration between multiple expert generators, closely simulating the realistic AIGI in the wild. Building on this benchmark, we propose Mirage-R1, a vision-language model with heuristic-to-analytic reasoning, a reflective reasoning mechanism for AIGI detection. Mirage-R1 is trained in two stages: a supervised-fine-tuning cold start, followed by a reinforcement learning stage. By further adopting an inference-time adaptive thinking strategy, Mirage-R1 is able to provide either a quick judgment or a more robust and accurate conclusion, effectively balancing inference speed and performance. Extensive experiments show that our model leads state-of-the-art detectors by 5% and 10% on Mirage and the public benchmark, respectively. The benchmark and code will be made publicly available.
摘要：AI生成的图像（AIGI）的传播是在生成AI的进步的驱动下，对信息安全和公共信任构成了重大威胁。现有的AIGI探测器在清洁实验室环境中有效反对图像，但未能推广到野外场景。 These real-world images are noisy, varying from ``obviously fake" images to realistic ones derived from multiple generative models and further edited for quality control. We address in-the-wild AIGI detection in this paper. We introduce Mirage, a challenging benchmark designed to emulate the complexity of in-the-wild AIGI. Mirage is constructed from two sources: (1) a large corpus of Internet-sourced AIGI verified by human experts, and （2）通过多个专家生成器之间的协作创建的综合数据集，在此基准上构建了现实的AIGI通过推理时间自适应思维策略，Mirage-R1能够提供快速的判断或更稳健，更准确的结论，有效地平衡了推理速度和绩效，我们的模型将先进的检测器带来了5％和10％的景象。

Title: DianJin-OCR-R1: Enhancing OCR Capabilities via a Reasoning-and-Tool Interleaved Vision-Language Model

Authors: Qian Chen, Xianyin Zhang, Lifan Guo, Feng Chen, Chi Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.13238
Pdf URL: https://arxiv.org/pdf/2508.13238
Copy Paste: [[2508.13238]] DianJin-OCR-R1: Enhancing OCR Capabilities via a Reasoning-and-Tool Interleaved Vision-Language Model(https://arxiv.org/abs/2508.13238)
Keywords: generative
Abstract: Recent advances in large vision-language models (LVLMs) have enabled a new paradigm of end-to-end document image parsing, excelling in Optical Character Recognition (OCR) tasks such as text, table, and formula recognition. However, generative LVLMs, similarly to large language models (LLMs), are prone to hallucinations--generating words that do not exist in input images. Furthermore, LVLMs are designed for general purposes and tend to be less effective on OCR tasks compared to expert models that are trained on domain-specific datasets. In this paper, we propose DianJin-OCR-R1, a reasoning-enhanced framework designed to address these limitations through training reasoning-and-tool interleaved VLMs. Given a recognition instruction, our DianJin-OCR-R1 model first recognizes the content in the input image by its own OCR capabilities, and then calls other tools (i.e., other expert models) to obtain their results as references, finally looks again the image and rethinks about the reasoning process to provide the final recognized content. Since architectures of expert models are tailored for specific OCR tasks, which makes them less prone to hallucinations, their results can help VLMs mitigate hallucinations. Additionally, expert models are typically smaller in scale and easy to iterate, enabling performance improvements for VLMs at a lower cost. We evaluate our model on ReST and OmniDocBench, and experimental results show that our DianJin-OCR-R1 models consistently outperform their non-reasoning counterparts and expert OCR models, which proves the effectiveness of our method.
摘要：大型视觉模型（LVLM）的最新进展已实现了端到端文档图像解析的新范式，在光学角色识别（OCR）任务（例如文本，表格和公式识别）方面具有出色的范围。但是，与大型语言模型（LLM）相似的生成LVLM容易幻觉 - 输入图像中不存在的单词。此外，LVLM是为了一般目的而设计的，并且与在特定于域的数据集中培训的专家模型相比，OCR任务的有效性较低。在本文中，我们提出了dianjin-ocr-r1，这是一个旨在通过培训推理和工具交织的VLM来解决这些限制的框架。鉴于识别指令，我们的Dianjin-OR-R1模型首先通过其自己的OCR功能识别输入图像中的内容，然后调用其他工具（即其他专家模型）以获取其结果作为参考，最后再次查看图像并重新考虑推理过程，以提供最终公认的内容。由于专家模型的体系结构是针对特定的OCR任务量身定制的，这使它们不易幻觉，因此它们的结果可以帮助VLMS缓解幻觉。此外，专家模型通常规模较小，易于迭代，从而可以以较低的成本对VLMS进行改进。我们在REST和Omnidocbench上评估了我们的模型，实验结果表明，我们的Dianjin-ORC-R1模型始终优于其非策略对应物和专家OCR模型，这证明了我们方法的有效性。

Title: GaitCrafter: Diffusion Model for Biometric Preserving Gait Synthesis

Authors: Sirshapan Mitra, Yogesh S. Rawat
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.13300
Pdf URL: https://arxiv.org/pdf/2508.13300
Copy Paste: [[2508.13300]] GaitCrafter: Diffusion Model for Biometric Preserving Gait Synthesis(https://arxiv.org/abs/2508.13300)
Keywords: generation, generative
Abstract: Gait recognition is a valuable biometric task that enables the identification of individuals from a distance based on their walking patterns. However, it remains limited by the lack of large-scale labeled datasets and the difficulty of collecting diverse gait samples for each individual while preserving privacy. To address these challenges, we propose GaitCrafter, a diffusion-based framework for synthesizing realistic gait sequences in the silhouette domain. Unlike prior works that rely on simulated environments or alternative generative models, GaitCrafter trains a video diffusion model from scratch, exclusively on gait silhouette data. Our approach enables the generation of temporally consistent and identity-preserving gait sequences. Moreover, the generation process is controllable-allowing conditioning on various covariates such as clothing, carried objects, and view angle. We show that incorporating synthetic samples generated by GaitCrafter into the gait recognition pipeline leads to improved performance, especially under challenging conditions. Additionally, we introduce a mechanism to generate novel identities-synthetic individuals not present in the original dataset-by interpolating identity embeddings. These novel identities exhibit unique, consistent gait patterns and are useful for training models while maintaining privacy of real subjects. Overall, our work takes an important step toward leveraging diffusion models for high-quality, controllable, and privacy-aware gait data generation.
摘要：步态识别是一项有价值的生物识别任务，可以根据步行方式从远处识别个人。但是，它仍然受到缺乏标记的数据集的限制，以及在保留隐私的同时为每个人收集各种步态样本的困难。为了应对这些挑战，我们提出了Gaitcrafter，这是一个基于扩散的框架，用于综合轮廓域中的实际步态序列。与依靠模拟环境或替代生成模型的先前作品不同，Gaitcrafter仅在步态轮廓数据上训练从头开始的视频扩散模型。我们的方法可以产生时间一致且具有身份的步态序列。此外，生成过程在各种协变量（例如衣服，携带的物体和视角）上具有可控条件。我们表明，将gaitcrafter产生的合成样品纳入步态识别管道会导致性能提高，尤其是在具有挑战性的条件下。此外，我们引入了一种机制，以生成原始数据集中插值身份嵌入中不存在的新身份合成个体。这些新颖的身份表现出独特的，一致的步态模式，对于训练模型很有用，同时保持真实受试者的隐私。总体而言，我们的工作朝着利用扩散模型的高质量，可控制和隐私感知的步态数据生成迈出了重要一步。

Title: Efficient Constraint-Aware Flow Matching via Randomized Exploration

Authors: Zhengyan Huan, Jacob Boerma, Li-Ping Liu, Shuchin Aeron
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.13316
Pdf URL: https://arxiv.org/pdf/2508.13316
Copy Paste: [[2508.13316]] Efficient Constraint-Aware Flow Matching via Randomized Exploration(https://arxiv.org/abs/2508.13316)
Keywords: generation
Abstract: We consider the problem of generating samples via Flow Matching (FM) with an additional requirement that the generated samples must satisfy given constraints. We consider two scenarios, viz.: (a) when a differentiable distance function to the constraint set is given, and (b) when the constraint set is only available via queries to a membership oracle. For case (a), we propose a simple adaptation of the FM objective with an additional term that penalizes the distance between the constraint set and the generated samples. For case (b), we propose to employ randomization and learn a mean flow that is numerically shown to have a high likelihood of satisfying the constraints. This approach deviates significantly from existing works that require simple convex constraints, knowledge of a barrier function, or a reflection mechanism to constrain the probability flow. Furthermore, in the proposed setting we show that a two-stage approach, where both stages approximate the same original flow but with only the second stage probing the constraints via randomization, is more computationally efficient. Through several synthetic cases of constrained generation, we numerically show that the proposed approaches achieve significant gains in terms of constraint satisfaction while matching the target distributions. As a showcase for a practical oracle-based constraint, we show how our approach can be used for training an adversarial example generator, using queries to a hard-label black-box classifier. We conclude with several future research directions. Our code is available at this https URL.
摘要：我们考虑通过流量匹配（FM）生成样品的问题，并需要额外的要求，即生成的样品必须满足给定的限制。我们考虑两个方案，即：（a）当给出约束集的可区分距离函数时；（b）仅通过查询来构建约束集时，oracle会员。对于案例（a），我们提出了一个简单的FM目标适应，其额外的术语惩罚了约束集和生成样品之间的距离。对于案例（b），我们建议采用随机化并学习一个平均流量，该平均流量被数字证明具有很高的满足约束的可能性。这种方法显着偏离了需要简单的凸约限制，屏障功能知识或反射机制以限制概率流的现有作品。此外，在提出的设置中，我们表明，两阶段的方法，两个阶段都近似相同的原始流量，但仅在第二阶段通过随机化探测约束，这在计算上是更有效的。通过几种约束生成的综合案例，我们从数值上表明，在匹配目标分布的同时，所提出的方法在约束满意度方面获得了显着的收益。作为实用基于甲骨文的约束的展示，我们展示了如何将我们的方法用于训练对抗性示例生成器，并使用对硬标签的黑色盒子分类器进行查询。我们以几个未来的研究方向结束。我们的代码可在此HTTPS URL上找到。

Title: X-MoE: Enabling Scalable Training for Emerging Mixture-of-Experts Architectures on HPC Platforms

Authors: Yueming Yuan, Ahan Gupta, Jianping Li, Sajal Dash, Feiyi Wang, Minjia Zhang
Subjects: cs.LG, cs.CL, cs.DC
Abstract URL: https://arxiv.org/abs/2508.13337
Pdf URL: https://arxiv.org/pdf/2508.13337
Copy Paste: [[2508.13337]] X-MoE: Enabling Scalable Training for Emerging Mixture-of-Experts Architectures on HPC Platforms(https://arxiv.org/abs/2508.13337)
Keywords: generation
Abstract: Emerging expert-specialized Mixture-of-Experts (MoE) architectures, such as DeepSeek-MoE, deliver strong model quality through fine-grained expert segmentation and large top-k routing. However, their scalability is limited by substantial activation memory overhead and costly all-to-all communication. Furthermore, current MoE training systems - primarily optimized for NVIDIA GPUs - perform suboptimally on non-NVIDIA platforms, leaving significant computational potential untapped. In this work, we present X-MoE, a novel MoE training system designed to deliver scalable training performance for next-generation MoE architectures. X-MoE achieves this via several novel techniques, including efficient padding-free MoE training with cross-platform kernels, redundancy-bypassing dispatch, and hybrid parallelism with sequence-sharded MoE blocks. Our evaluation on the Frontier supercomputer, powered by AMD MI250X GPUs, shows that X-MoE scales DeepSeek-style MoEs up to 545 billion parameters across 1024 GPUs - 10x larger than the largest trainable model with existing methods under the same hardware budget, while maintaining high training throughput. The source code of X-MoE is available at this https URL.
摘要：新兴专家专业的专家混合物（MOE）架构（例如DeepSeek-Moe）通过细粒度的专家细分和大型TOP-K路由提供强大的模型质量。但是，它们的可伸缩性受到大量激活记忆开销的限制，并且全部昂贵。此外，当前主要针对NVIDIA GPU优化的当前MOE训练系统在非NVIDIA平台上进行次优，从而尚未开发出巨大的计算电位。在这项工作中，我们提出了X-Moe，这是一种新型的MOE培训系统，旨在为下一代Moe体系结构提供可扩展的培训性能。 X-MoE通过几种新型技术实现了这一目标，包括使用跨平台内核进行有效的无填充MOE训练，冗余式调度调度以及带有序列切断的MOE块的混合平行性。我们对由AMD MI250X GPU提供动力的边境超级计算机的评估表明，X-MoE尺度尺度范围为1024 GPU的DeepSeek风格的Moes高达5450亿个参数 - 比在同一硬件预算下具有现有方法的最大可训练模型的10倍，同时维持高训练，同时维持高训练。 X-MOE的源代码可在此HTTPS URL上获得。

Title: Counterfactual Probabilistic Diffusion with Expert Models

Authors: Wenhao Mu, Zhi Cao, Mehmed Uludag, Alexander Rodríguez
Subjects: cs.LG, cs.AI, stat.ME
Abstract URL: https://arxiv.org/abs/2508.13355
Pdf URL: https://arxiv.org/pdf/2508.13355
Copy Paste: [[2508.13355]] Counterfactual Probabilistic Diffusion with Expert Models(https://arxiv.org/abs/2508.13355)
Keywords: generative
Abstract: Predicting counterfactual distributions in complex dynamical systems is essential for scientific modeling and decision-making in domains such as public health and medicine. However, existing methods often rely on point estimates or purely data-driven models, which tend to falter under data scarcity. We propose a time series diffusion-based framework that incorporates guidance from imperfect expert models by extracting high-level signals to serve as structured priors for generative modeling. Our method, ODE-Diff, bridges mechanistic and data-driven approaches, enabling more reliable and interpretable causal inference. We evaluate ODE-Diff across semi-synthetic COVID-19 simulations, synthetic pharmacological dynamics, and real-world case studies, demonstrating that it consistently outperforms strong baselines in both point prediction and distributional accuracy.
摘要：预测复杂动力系统中的反事实分布对于公共卫生和医学等领域的科学建模和决策至关重要。但是，现有方法通常依赖于点估计或纯粹的数据驱动模型，这些模型往往会在数据稀缺下动摇。我们提出了一个基于时间序列扩散的框架，该框架通过提取高级信号作为生成建模的结构化先验来结合不完美的专家模型的指导。我们的方法，ODE-DIFF，桥梁的机理和数据驱动的方法，可以更可靠，可解释的因果推断。我们评估了跨半合成COVID-19模拟，合成药理学动力学和现实案例研究的ODE-DIFF，这表明它在点预测和分布精度上始终优于强质基准。

Title: NovoMolGen: Rethinking Molecular Language Model Pretraining

Authors: Kamran Chitsaz, Roshan Balaji, Quentin Fournier, Nirav Pravinbhai Bhatt, Sarath Chandar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.13408
Pdf URL: https://arxiv.org/pdf/2508.13408
Copy Paste: [[2508.13408]] NovoMolGen: Rethinking Molecular Language Model Pretraining(https://arxiv.org/abs/2508.13408)
Keywords: generation, generative
Abstract: Designing de-novo molecules with desired property profiles requires efficient exploration of the vast chemical space ranging from $10^{23}$ to $10^{60}$ possible synthesizable candidates. While various deep generative models have been developed to design small molecules using diverse input representations, Molecular Large Language Models (Mol-LLMs) based on string representations have emerged as a scalable approach capable of exploring billions of molecules. However, there remains limited understanding regarding how standard language modeling practices such as textual representations, tokenization strategies, model size, and dataset scale impact molecular generation performance. In this work, we systematically investigate these critical aspects by introducing NovoMolGen, a family of transformer-based foundation models pretrained on 1.5 billion molecules for de-novo molecule generation. Through extensive empirical analyses, we identify a weak correlation between performance metrics measured during pretraining and actual downstream performance, revealing important distinctions between molecular and general NLP training dynamics. NovoMolGen establishes new state-of-the-art results, substantially outperforming prior Mol-LLMs and specialized generative models in both unconstrained and goal-directed molecular generation tasks, thus providing a robust foundation for advancing efficient and effective molecular modeling strategies.
摘要：设计具有所需特性剖面的De-Novo分子需要有效探索范围从$ 10^{23} $到$ 10^{60} $可能可综合的候选者的巨大化学空间。尽管已经开发出了使用各种输入表示的各种深层生成模型来设计小分子，但基于字符串表示的分子大语言模型（MOL-LLS）已成为一种可扩展的方法，能够探索数十亿个分子。但是，关于标准语言建模实践（例如文本表示，标记策略，模型大小和数据集量表）如何影响分子生成性能，仍然存在有限的理解。在这项工作中，我们通过引入Novomolgen来系统地研究这些关键方面，Novomolgen是一个基于变形金刚的基础模型，预测了15亿个分子的De-Novo分子生成。通过广泛的经验分析，我们确定了在训练预处理和实际下游性能中测得的性能指标之间的弱相关性，从而揭示了分子和一般NLP训练动力学之间的重要区别。 Novomolgen建立了新的最先进的结果，在不受限制和目标的分子生成任务中大大优于先前的mol-llms和专门的生成模型，从而为推进有效有效的分子建模策略提供了强大的基础。

Title: EventTSF: Event-Aware Non-Stationary Time Series Forecasting

Authors: Yunfeng Ge, Ming Jin, Yiji Zhao, Hongyan Li, Bo Du, Chang Xu, Shirui Pan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.13434
Pdf URL: https://arxiv.org/pdf/2508.13434
Copy Paste: [[2508.13434]] EventTSF: Event-Aware Non-Stationary Time Series Forecasting(https://arxiv.org/abs/2508.13434)
Keywords: generation
Abstract: Time series forecasting plays a vital role in critical domains like energy and transportation, where non-stationary dynamics are deeply intertwined with events in other modalities such as texts. However, incorporating natural language-based external events to improve non-stationary forecasting remains largely unexplored, as most approaches still rely on a single modality, resulting in limited contextual knowledge and model underperformance. Enabling fine-grained multimodal interactions between temporal and textual data is challenged by three fundamental issues: (1) the difficulty of fine-grained synchronization between time-varying discrete textual events and continuous time series; (2) the inherent temporal uncertainty introduced by textual semantics; and (3) the misalignment between textual event embeddings and multi-resolution temporal patterns. In this work, we address these challenges by introducing event-aware non-stationary time series forecasting (EventTSF), an autoregressive generation framework that integrates historical time series with textual events to make subsequent forecasts. Specifically, EventTSF uses autoregressive diffusion with flow matching at each step to capture nuanced temporal-event interactions. To handle event-induced uncertainty, flow matching timesteps are adaptively controlled according to event semantic signals. The underlying denoiser employs a multimodal U-shaped diffusion transformer that efficiently fuses temporal and textual modalities across different resolutions. Extensive experiments on 8 synthetic and real-world datasets show that EventTSF outperforms 12 baselines across diverse event-aware non-stationary time series forecasting scenarios, achieving substantial improvements of 10.7% higher forecasting accuracy and $1.13\times$ faster training efficiency.
摘要：时间序列预测在能量和运输等关键领域中起着至关重要的作用，在能量和运输等关键领域，非平稳动态与其他模式（例如文本）中的事件深深相互交织。但是，结合基于自然语言的外部事件以改善非平稳的预测仍然没有探索，因为大多数方法仍然依赖于单一模态，从而导致有限的上下文知识和模型表现不佳。在时间和文本数据之间实现细粒度的多模式相互作用受到三个基本问题的挑战：（1）时间变化的离散文本事件和连续时间序列之间的细粒度同步难度；（2）文本语义引入的固有的时间不确定性；（3）文本事件嵌入与多分辨率的时间模式之间的错位。在这项工作中，我们通过引入事件感知的非平稳时间序列预测（EventTSF）来解决这些挑战，这是一个自动回归的生成框架，将历史时间序列与文本事件集成在一起以进行后续预测。具体而言，EventTSF使用自回旋扩散，每个步骤的流量匹配来捕获细微的时间事件相互作用。为了处理事件引起的不确定性，根据事件语义信号对流动匹配时间段进行自适应控制。基础Denoiser采用了多模式U形扩散变压器，可有效融合不同分辨率的时间和文本方式。在8个合成和现实世界数据集上进行的广泛实验表明，EventTSF优于多种事件感知的非平稳时间序列预测场景的12个基准，实现了高度提高10.7％的预测准确性和1.13美元的$ 1.13 \ tims $ fims $ $ $ $ $ $ $。

Title: Structured Prompting and Multi-Agent Knowledge Distillation for Traffic Video Interpretation and Risk Inference

Authors: Yunxiang Yang, Ningning Xu, Jidong J. Yang
Subjects: cs.CV, cs.AI, cs.CL, eess.IV
Abstract URL: https://arxiv.org/abs/2508.13439
Pdf URL: https://arxiv.org/pdf/2508.13439
Copy Paste: [[2508.13439]] Structured Prompting and Multi-Agent Knowledge Distillation for Traffic Video Interpretation and Risk Inference(https://arxiv.org/abs/2508.13439)
Keywords: generation
Abstract: Comprehensive highway scene understanding and robust traffic risk inference are vital for advancing Intelligent Transportation Systems (ITS) and autonomous driving. Traditional approaches often struggle with scalability and generalization, particularly under the complex and dynamic conditions of real-world environments. To address these challenges, we introduce a novel structured prompting and knowledge distillation framework that enables automatic generation of high-quality traffic scene annotations and contextual risk assessments. Our framework orchestrates two large Vision-Language Models (VLMs): GPT-4o and o3-mini, using a structured Chain-of-Thought (CoT) strategy to produce rich, multi-perspective outputs. These outputs serve as knowledge-enriched pseudo-annotations for supervised fine-tuning of a much smaller student VLM. The resulting compact 3B-scale model, named VISTA (Vision for Intelligent Scene and Traffic Analysis), is capable of understanding low-resolution traffic videos and generating semantically faithful, risk-aware captions. Despite its significantly reduced parameter count, VISTA achieves strong performance across established captioning metrics (BLEU-4, METEOR, ROUGE-L, and CIDEr) when benchmarked against its teacher models. This demonstrates that effective knowledge distillation and structured multi-agent supervision can empower lightweight VLMs to capture complex reasoning capabilities. The compact architecture of VISTA facilitates efficient deployment on edge devices, enabling real-time risk monitoring without requiring extensive infrastructure upgrades.
摘要：全面的高速公路现场理解和强大的交通风险推断对于推进智能运输系统（ITS）和自动驾驶至关重要。传统方法通常在可扩展性和概括性上挣扎，尤其是在现实环境的复杂和动态条件下。为了应对这些挑战，我们引入了一个新颖的结构化提示和知识蒸馏框架，该框架可以自动生成高质量的交通现场注释和上下文风险评估。我们的框架使用结构化的经营链（COT）策略来协调两个大型视觉模型（VLM）：GPT-4O和O3-Mini，以产生丰富的多观点输出。这些输出是富含知识的伪通量，可用于监督小得多的学生VLM的微调。由此产生的紧凑型3B尺度模型，称为Vista（智能场景和交通分析的愿景），能够理解低分辨率的交通视频并生成语义上忠实，风险意识的标题。尽管参数数量大大减少了，但Vista在与教师模型的基准测试时，在建立的字幕指标（BLEU-4，Meteor，Rouge-L和Cider）中取得了强劲的性能。这表明有效的知识蒸馏和结构化的多代理监督可以增强轻量级VLM的能力以捕获复杂的推理能力。 Vista的紧凑型体系结构有助于在边缘设备上有效部署，从而实现实时风险监控，而无需进行大量的基础架构升级。

Title: EDTalk++: Full Disentanglement for Controllable Talking Head Synthesis

Authors: Shuai Tan, Bin Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.13442
Pdf URL: https://arxiv.org/pdf/2508.13442
Copy Paste: [[2508.13442]] EDTalk++: Full Disentanglement for Controllable Talking Head Synthesis(https://arxiv.org/abs/2508.13442)
Keywords: generation
Abstract: Achieving disentangled control over multiple facial motions and accommodating diverse input modalities greatly enhances the application and entertainment of the talking head generation. This necessitates a deep exploration of the decoupling space for facial features, ensuring that they a) operate independently without mutual interference and b) can be preserved to share with different modal inputs, both aspects often neglected in existing methods. To address this gap, this paper proposes EDTalk++, a novel full disentanglement framework for controllable talking head generation. Our framework enables individual manipulation of mouth shape, head pose, eye movement, and emotional expression, conditioned on video or audio inputs. Specifically, we employ four lightweight modules to decompose the facial dynamics into four distinct latent spaces representing mouth, pose, eye, and expression, respectively. Each space is characterized by a set of learnable bases whose linear combinations define specific motions. To ensure independence and accelerate training, we enforce orthogonality among bases and devise an efficient training strategy to allocate motion responsibilities to each space without relying on external knowledge. The learned bases are then stored in corresponding banks, enabling shared visual priors with audio input. Furthermore, considering the properties of each space, we propose an Audio-to-Motion module for audio-driven talking head synthesis. Experiments are conducted to demonstrate the effectiveness of EDTalk++.
摘要：实现对多种面部动作的脱节控制并适应多种输入方式，可以极大地增强说话校长一代的应用和娱乐。这需要对面部特征的去耦空间进行深入探索，以确保它们a）独立运行而不会相互干扰，b）可以保留与不同的模态输入共享，这两个方面在现有方法中通常都忽略了。为了解决这一差距，本文提出了Edtalk ++，这是一个新颖的完整分离框架，可控制性交谈的头部。我们的框架可以在视频或音频输入下进行单独操纵口形状，头部姿势，眼动和情感表达。具体而言，我们采用四个轻质模块将面部动力学分解为四个不同的潜在空间，分别代表口腔，姿势，眼睛和表达。每个空间的特征是一组可学习的基础，其线性组合定义了特定的运动。为了确保独立性和加速培训，我们在基地之间执行正交性，并制定有效的培训策略，以在不依赖外部知识的情况下为每个空间分配运动职责。然后将学习的基础存储在相应的银行中，从而使带有音频输入的共享视觉先验。此外，考虑到每个空间的属性，我们为音频驱动的会说话的头部合成提出了一个音频到动作模块。进行实验以证明EDTALK ++的有效性。

Title: Revisiting MLLM Token Technology through the Lens of Classical Visual Coding

Authors: Jinming Liu, Junyan Lin, Yuntao Wei, Kele Shao, Keda Tao, Jianguo Huang, Xudong Yang, Zhibo Chen, Huan Wang, Xin Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.13460
Pdf URL: https://arxiv.org/pdf/2508.13460
Copy Paste: [[2508.13460]] Revisiting MLLM Token Technology through the Lens of Classical Visual Coding(https://arxiv.org/abs/2508.13460)
Keywords: generation
Abstract: Classical visual coding and Multimodal Large Language Model (MLLM) token technology share the core objective - maximizing information fidelity while minimizing computational cost. Therefore, this paper reexamines MLLM token technology, including tokenization, token compression, and token reasoning, through the established principles of long-developed visual coding area. From this perspective, we (1) establish a unified formulation bridging token technology and visual coding, enabling a systematic, module-by-module comparative analysis; (2) synthesize bidirectional insights, exploring how visual coding principles can enhance MLLM token techniques' efficiency and robustness, and conversely, how token technology paradigms can inform the design of next-generation semantic visual codecs; (3) prospect for promising future research directions and critical unsolved challenges. In summary, this study presents the first comprehensive and structured technology comparison of MLLM token and visual coding, paving the way for more efficient multimodal models and more powerful visual codecs simultaneously.
摘要：经典的视觉编码和多模式大语言模型（MLLM）代币技术共享核心目标 - 最大化信息保真度，同时最大程度地减少计算成本。因此，本文通过已建立的长期视觉编码领域的既定原则重新检查了MLLM代币技术，包括令牌化，令牌压缩和令牌推理。从这个角度来看，我们（1）建立了一个统一的配方桥接令牌技术和视觉编码，从而实现了系统的，模块的比较分析；（2）综合双向见解，探索视觉编码原理如何增强MLLM代币技术的效率和鲁棒性，相反，令牌技术范式如何为下一代语义视觉codecs的设计提供信息；（3）有希望未来的研究方向和关键尚未解决的挑战的前景。总而言之，这项研究介绍了MLLM令牌和视觉编码的第一个全面，结构化的技术比较，为更有效的多模式模型和更强大的视觉编解码器铺平了道路。

Title: MINR: Efficient Implicit Neural Representations for Multi-Image Encoding

Authors: Wenyong Zhou, Taiqiang Wu, Zhengwu Liu, Yuxin Cheng, Chen Zhang, Ngai Wong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.13471
Pdf URL: https://arxiv.org/pdf/2508.13471
Copy Paste: [[2508.13471]] MINR: Efficient Implicit Neural Representations for Multi-Image Encoding(https://arxiv.org/abs/2508.13471)
Keywords: super-resolution
Abstract: Implicit Neural Representations (INRs) aim to parameterize discrete signals through implicit continuous functions. However, formulating each image with a separate neural network~(typically, a Multi-Layer Perceptron (MLP)) leads to computational and storage inefficiencies when encoding multi-images. To address this issue, we propose MINR, sharing specific layers to encode multi-image efficiently. We first compare the layer-wise weight distributions for several trained INRs and find that corresponding intermediate layers follow highly similar distribution patterns. Motivated by this, we share these intermediate layers across multiple images while preserving the input and output layers as input-specific. In addition, we design an extra novel projection layer for each image to capture its unique features. Experimental results on image reconstruction and super-resolution tasks demonstrate that MINR can save up to 60\% parameters while maintaining comparable performance. Particularly, MINR scales effectively to handle 100 images, maintaining an average peak signal-to-noise ratio (PSNR) of 34 dB. Further analysis of various backbones proves the robustness of the proposed MINR.
摘要：隐式神经表示（INRS）旨在通过隐式连续函数参数化离散信号。但是，用单独的神经网络〜（通常是多层感知器（MLP））制定每个图像会导致计算和存储效率低下时，在编码多图像时。为了解决此问题，我们建议MinR，共享特定的图层以有效地编码多图像。我们首先比较了几个训练有素的INR的层重量分布，并发现相应的中间层遵循高度相似的分布模式。在此激励的情况下，我们在多个图像上共享这些中间层，同时将输入和输出层保存为特定于输入的层。此外，我们为每个图像设计一个额外的新型投影层，以捕获其独特的特征。图像重建和超分辨率任务的实验结果表明，Minr可以节省多达60 \％的参数，同时保持可比的性能。特别是，微小的尺度有效地处理100张图像，维持平均峰值信噪比（PSNR）为34 dB。对各种骨架的进一步分析证明了拟议的Minr的鲁棒性。

Title: 2D Gaussians Meet Visual Tokenizer

Authors: Yiang Shi, Xiaoyang Guo, Wei Yin, Mingkai Jia, Qian Zhang, Xiaolin Hu, Wenyu Liu, Xinggang Wan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.13515
Pdf URL: https://arxiv.org/pdf/2508.13515
Copy Paste: [[2508.13515]] 2D Gaussians Meet Visual Tokenizer(https://arxiv.org/abs/2508.13515)
Keywords: generation
Abstract: The image tokenizer is a critical component in AR image generation, as it determines how rich and structured visual content is encoded into compact representations. Existing quantization-based tokenizers such as VQ-GAN primarily focus on appearance features like texture and color, often neglecting geometric structures due to their patch-based design. In this work, we explored how to incorporate more visual information into the tokenizer and proposed a new framework named Visual Gaussian Quantization (VGQ), a novel tokenizer paradigm that explicitly enhances structural modeling by integrating 2D Gaussians into traditional visual codebook quantization frameworks. Our approach addresses the inherent limitations of naive quantization methods such as VQ-GAN, which struggle to model structured visual information due to their patch-based design and emphasis on texture and color. In contrast, VGQ encodes image latents as 2D Gaussian distributions, effectively capturing geometric and spatial structures by directly modeling structure-related parameters such as position, rotation and scale. We further demonstrate that increasing the density of 2D Gaussians within the tokens leads to significant gains in reconstruction fidelity, providing a flexible trade-off between token efficiency and visual richness. On the ImageNet 256x256 benchmark, VGQ achieves strong reconstruction quality with an rFID score of 1.00. Furthermore, by increasing the density of 2D Gaussians within the tokens, VGQ gains a significant boost in reconstruction capability and achieves a state-of-the-art reconstruction rFID score of 0.556 and a PSNR of 24.93, substantially outperforming existing methods. Codes will be released soon.
摘要：图像令牌是AR图像生成中的关键组件，因为它确定了如何将丰富和结构化的视觉内容编码为紧凑的表示形式。现有的基于量化的引导剂（例如VQ-GAN）主要关注外观特征，例如纹理和颜色，通常由于基于贴片的设计而忽略了几何结构。在这项工作中，我们探讨了如何将更多的视觉信息纳入令牌仪中，并提出了一个名为Visual Gaussian量化（VGQ）的新框架，这是一种新型的令牌范式，该范式通过将2D高斯人集成到传统的Visual CodeBook量子框架中，从而明确地增强了结构建模。我们的方法解决了幼稚量化方法（例如VQ-GAN）的固有局限性，该方法由于基于斑块的设计并强调纹理和颜色而难以模拟结构化的视觉信息。相比之下，VGQ将图像潜在编码为2D高斯分布，通过直接建模与结构相关的参数（例如位置，旋转和比例）有效地捕获几何和空间结构。我们进一步证明，在代币中增加2D高斯人的密度会导致重建保真度的显着增长，从而在代币效率和视觉丰富度之间进行了灵活的权衡。在Imagenet 256x256基准中，VGQ以1.00的RFID得分达到了强大的重建质量。此外，通过增加代币中2D高斯人的密度，VGQ获得了重建能力的显着提升，并获得了最新的重建RFID RFID得分为0.556，PSNR的PSNR为24.93，实质上超过了现有方法。代码将很快发布。

Title: Evaluating Open-Source Vision Language Models for Facial Emotion Recognition against Traditional Deep Learning Models

Authors: Vamsi Krishna Mulukutla, Sai Supriya Pavarala, Srinivasa Raju Rudraraju, Sridevi Bonthu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.13524
Pdf URL: https://arxiv.org/pdf/2508.13524
Copy Paste: [[2508.13524]] Evaluating Open-Source Vision Language Models for Facial Emotion Recognition against Traditional Deep Learning Models(https://arxiv.org/abs/2508.13524)
Keywords: restoration
Abstract: Facial Emotion Recognition (FER) is crucial for applications such as human-computer interaction and mental health diagnostics. This study presents the first empirical comparison of open-source Vision-Language Models (VLMs), including Phi-3.5 Vision and CLIP, against traditional deep learning models VGG19, ResNet-50, and EfficientNet-B0 on the challenging FER-2013 dataset, which contains 35,887 low-resolution grayscale images across seven emotion classes. To address the mismatch between VLM training assumptions and the noisy nature of FER data, we introduce a novel pipeline that integrates GFPGAN-based image restoration with FER evaluation. Results show that traditional models, particularly EfficientNet-B0 (86.44%) and ResNet-50 (85.72%), significantly outperform VLMs like CLIP (64.07%) and Phi-3.5 Vision (51.66%), highlighting the limitations of VLMs in low-quality visual tasks. In addition to performance evaluation using precision, recall, F1-score, and accuracy, we provide a detailed computational cost analysis covering preprocessing, training, inference, and evaluation phases, offering practical insights for deployment. This work underscores the need for adapting VLMs to noisy environments and provides a reproducible benchmark for future research in emotion recognition.
摘要：面部情绪识别（FER）对于诸如人类计算机互动和心理健康诊断等应用至关重要。这项研究介绍了开源视觉语言模型（VLM）的首次经验比较，包括PHI-3.5视觉和剪辑，与传统的深度学习模型VGG19，Resnet-50和EfficityNet-B0在挑战性的FER-2013数据集上，其中包含35,887个低分辨率的七个情感图像。为了解决VLM训练假设与FER数据的嘈杂性质之间的不匹配，我们引入了一条新型管道，将基于GFPGAN的图像恢复与FER评估相结合。结果表明，传统模型，尤其是EfficityNet-B0（86.44％）和Resnet-50（85.72％），显着胜过剪辑（64.07％）和PHI-3.5视觉（51.66％）（51.66％），突出了低质量视觉任务中VLMS的局限性。除了使用精确，召回，F1得分和准确性评估绩效外，我们还提供了详细的计算成本分析，涵盖了预处理，培训，推理和评估阶段，从而为部署提供了实用的见解。这项工作强调了将VLM适应嘈杂环境的必要性，并为未来的情感识别研究提供了可重复的基准。

Title: MuFlex: A Scalable, Physics-based Platform for Multi-Building Flexibility Analysis and Coordination

Authors: Ziyan Wu, Ivan Korolija, Rui Tang
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2508.13532
Pdf URL: https://arxiv.org/pdf/2508.13532
Copy Paste: [[2508.13532]] MuFlex: A Scalable, Physics-based Platform for Multi-Building Flexibility Analysis and Coordination(https://arxiv.org/abs/2508.13532)
Keywords: generation
Abstract: With the increasing penetration of renewable generation on the power grid, maintaining system balance requires coordinated demand flexibility from aggregations of buildings. Reinforcement learning (RL) has been widely explored for building controls because of its model-free nature. Open-source simulation testbeds are essential not only for training RL agents but also for fairly benchmarking control strategies. However, most building-sector testbeds target single buildings; multi-building platforms are relatively limited and typically rely on simplified models (e.g., Resistance-Capacitance) or data-driven approaches, which lack the ability to fully capture the physical intricacies and intermediate variables necessary for interpreting control performance. Moreover, these platforms often impose fixed inputs, outputs, and model formats, restricting their applicability as benchmarking tools across diverse control scenarios. To address these gaps, MuFlex, a scalable, open-source platform for benchmarking and testing control strategies for multi-building flexibility coordination, was developed in this study. MuFlex enables synchronous information exchange across EnergyPlus building models and adheres to the latest OpenAI Gym interface, providing a modular, standardized RL implementation. The platform capabilities were demonstrated in a case study coordinating demand flexibility across four office buildings using the Soft Actor-Critic algorithm with carefully fine-tuned hyperparameters. The results show that aggregating the four buildings flexibility reduced total peak demand below a specified threshold while maintaining indoor environmental quality.
摘要：随着可再生生成在电网上的渗透量的增加，保持系统平衡需要建筑物聚集的协调需求灵活性。强化学习（RL）由于其无模型性质而被广泛探索用于建筑物控制。开源仿真测试床不仅对于培训RL代理，而且对于相当基准的控制策略至关重要。但是，大多数建筑部门测试台针对单个建筑物；多构建平台相对有限，通常依赖于简化的模型（例如阻力范围）或数据驱动的方法，这些方法缺乏完全捕获物理复杂性和用于解释控制性能所需的中间变量的能力。此外，这些平台通常会施加固定的输入，输出和模型格式，从而将其适用性限制为在不同控制方案中的基准测试工具。为了解决这些差距，Muflex是一个可扩展的开源平台，用于基准测试和测试用于多构建灵活性协调的控制策略。 MUFLEX启用了跨能量构建模型的同步信息交换，并遵守最新的OpenAI Gym界面，提供了模块化的标准化RL实现。在一个案例研究中证明了平台功能，该案例研究使用智能调节的超参数使用软演员评价算法来协调四个办公楼的需求灵活性。结果表明，在维持室内环境质量的同时，汇总四个建筑物的灵活性将总峰值需求降低到指定阈值以下。

Title: EAvatar: Expression-Aware Head Avatar Reconstruction with Generative Geometry Priors

Authors: Shikun Zhang, Cunjian Chen, Yiqun Wang, Qiuhong Ke, Yong Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.13537
Pdf URL: https://arxiv.org/pdf/2508.13537
Copy Paste: [[2508.13537]] EAvatar: Expression-Aware Head Avatar Reconstruction with Generative Geometry Priors(https://arxiv.org/abs/2508.13537)
Keywords: generative
Abstract: High-fidelity head avatar reconstruction plays a crucial role in AR/VR, gaming, and multimedia content creation. Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated effectiveness in modeling complex geometry with real-time rendering capability and are now widely used in high-fidelity head avatar reconstruction tasks. However, existing 3DGS-based methods still face significant challenges in capturing fine-grained facial expressions and preserving local texture continuity, especially in highly deformable regions. To mitigate these limitations, we propose a novel 3DGS-based framework termed EAvatar for head reconstruction that is both expression-aware and deformation-aware. Our method introduces a sparse expression control mechanism, where a small number of key Gaussians are used to influence the deformation of their neighboring Gaussians, enabling accurate modeling of local deformations and fine-scale texture transitions. Furthermore, we leverage high-quality 3D priors from pretrained generative models to provide a more reliable facial geometry, offering structural guidance that improves convergence stability and shape accuracy during training. Experimental results demonstrate that our method produces more accurate and visually coherent head reconstructions with improved expression controllability and detail fidelity.
摘要：高保真头像化身重建在AR/VR，游戏和多媒体内容创建中起着至关重要的作用。 3D高斯脱落（3DGS）的最新进展表明，在建模具有实时渲染能力的复杂几何形状方面具有有效性，现在已广泛用于高保真头像Avatar重建任务。但是，现有的基于3DGS的方法在捕获细粒度的面部表情并保留局部纹理连续性方面仍然面临重大挑战，尤其是在高度变形的区域中。为了减轻这些局限性，我们提出了一种基于3DGS的新型框架，称为eavatar用于头部重建，既具有表达意识，又是变形感。我们的方法引入了稀疏的表达控制机制，其中使用少量的关键高斯人来影响其相邻的高斯人的变形，从而可以准确地建模局部变形和细尺度纹理跃迁。此外，我们利用预审预周化的生成模型的高质量3D先验来提供更可靠的面部几何形状，提供结构指导，以提高培训期间的收敛稳定性和形状准确性。实验结果表明，我们的方法可产生更准确和视觉上连贯的头部重建，具有改善的表达可控性和细节保真度。

Title: FLAIR: Frequency- and Locality-Aware Implicit Neural Representations

Authors: Sukhun Ko, Dahyeon Kye, Kyle Min, Chanho Eom, Jihyong Oh
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.13544
Pdf URL: https://arxiv.org/pdf/2508.13544
Copy Paste: [[2508.13544]] FLAIR: Frequency- and Locality-Aware Implicit Neural Representations(https://arxiv.org/abs/2508.13544)
Keywords: restoration
Abstract: Implicit Neural Representations (INRs) leverage neural networks to map coordinates to corresponding signals, enabling continuous and compact representations. This paradigm has driven significant advances in various vision tasks. However, existing INRs lack frequency selectivity, spatial localization, and sparse representations, leading to an over-reliance on redundant signal components. Consequently, they exhibit spectral bias, tending to learn low-frequency components early while struggling to capture fine high-frequency details. To address these issues, we propose FLAIR (Frequency- and Locality-Aware Implicit Neural Representations), which incorporates two key innovations. The first is RC-GAUSS, a novel activation designed for explicit frequency selection and spatial localization under the constraints of the time-frequency uncertainty principle (TFUP). The second is Wavelet-Energy-Guided Encoding (WEGE), which leverages the discrete wavelet transform (DWT) to compute energy scores and explicitly guide frequency information to the network. Our method consistently outperforms existing INRs in 2D image representation and restoration, as well as 3D reconstruction.
摘要：隐式神经表示（INRS）利用神经网络将坐标映射到相应的信号，实现连续和紧凑的表示。这种范式在各种视力任务中取得了重大进步。但是，现有的INR缺乏频率选择性，空间定位和稀疏表示，导致对冗余信号组件的过度依赖。因此，它们表现出光谱偏见，倾向于尽早学习低频组件，同时努力捕获高频细节。为了解决这些问题，我们提出了Flair（频率和地方感知的隐性神经表示），其中包含了两项关键的创新。第一个是RC-Gauss，这是一种新颖的激活，旨在在时间频率不确定性原理（TFUP）的约束下进行显式频率选择和空间定位。第二个是小波 - 能源引导的编码（WEGE），它利用离散小波变换（DWT）来计算能量得分，并明确指导网络的频率信息。我们的方法在2D图像表示和恢复以及3D重建中始终优于现有的INR。

Title: A Lightweight Dual-Mode Optimization for Generative Face Video Coding

Authors: Zihan Zhang, Shanzhi Yin, Bolin Chen, Ru-Ling Liao, Shiqi Wang, Yan Ye
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2508.13547
Pdf URL: https://arxiv.org/pdf/2508.13547
Copy Paste: [[2508.13547]] A Lightweight Dual-Mode Optimization for Generative Face Video Coding(https://arxiv.org/abs/2508.13547)
Keywords: generative
Abstract: Generative Face Video Coding (GFVC) achieves superior rate-distortion performance by leveraging the strong inference capabilities of deep generative models. However, its practical deployment is hindered by large model parameters and high computational costs. To address this, we propose a lightweight GFVC framework that introduces dual-mode optimization - combining architectural redesign and operational refinement - to reduce complexity whilst preserving reconstruction quality. Architecturally, we replace traditional 3 x 3 convolutions with slimmer and more efficient layers, reducing complexity without compromising feature expressiveness. Operationally, we develop a two-stage adaptive channel pruning strategy: (1) soft pruning during training identifies redundant channels via learnable thresholds, and (2) hard pruning permanently eliminates these channels post-training using a derived mask. This dual-phase approach ensures both training stability and inference efficiency. Experimental results demonstrate that the proposed lightweight dual-mode optimization for GFVC can achieve 90.4% parameter reduction and 88.9% computation saving compared to the baseline, whilst achieving superior performance compared to state-of-the-art video coding standard Versatile Video Coding (VVC) in terms of perceptual-level quality metrics. As such, the proposed method is expected to enable efficient GFVC deployment in resource-constrained environments such as mobile edge devices.
摘要：生成的面部视频编码（GFVC）通过利用深层生成模型的强推理能力来实现卓越的利率差异性能。但是，大型模型参数和高计算成本阻碍了其实际部署。为了解决这个问题，我们提出了一个轻巧的GFVC框架，该框架引入了双模式优化 - 结合体系结构重新设计和操作改进 - 以降低复杂性，同时保留重建质量。从结构上讲，我们用更薄，更有效的层代替了传统的3 x 3卷积，从而降低了复杂性而不会折衷特征表达。在操作上，我们制定了一个两阶段的自适应通道修剪策略：（1）训练期间的软修剪通过可学习的阈值来识别冗余通道，（2）（2）使用派生的掩码在训练后永久消除这些通道。这种双相方法可确保训练稳定性和推理效率。实验结果表明，与基线相比，与基线相比，提出的对GFVC的轻巧双模式优化可以实现90.4％的参数降低和88.9％的计算节省，同时与最先进的视频编码标准的通用视频编码（VVC）相比，在感知级别的级别质量质量的质量学方面相比，实现了较高的性能。因此，预计所提出的方法将在资源约束环境（例如移动边缘设备）中有效地部署有效的GFVC。

Title: Color Spike Data Generation via Bio-inspired Neuron-like Encoding with an Artificial Photoreceptor Layer

Authors: Hsieh Ching-Teng, Wang Yuan-Kai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.13558
Pdf URL: https://arxiv.org/pdf/2508.13558
Copy Paste: [[2508.13558]] Color Spike Data Generation via Bio-inspired Neuron-like Encoding with an Artificial Photoreceptor Layer(https://arxiv.org/abs/2508.13558)
Keywords: generation
Abstract: In recent years, neuromorphic computing and spiking neural networks (SNNs) have ad-vanced rapidly through integration with deep learning. However, the performance of SNNs still lags behind that of convolutional neural networks (CNNs), primarily due to the limited information capacity of spike-based data. Although some studies have attempted to improve SNN performance by training them with non-spiking inputs such as static images, this approach deviates from the original intent of neuromorphic computing, which emphasizes spike-based information processing. To address this issue, we propose a Neuron-like Encoding method that generates spike data based on the intrinsic operational principles and functions of biological neurons. This method is further enhanced by the incorporation of an artificial pho-toreceptor layer, enabling spike data to carry both color and luminance information, thereby forming a complete visual spike signal. Experimental results using the Integrate-and-Fire neuron model demonstrate that this biologically inspired approach effectively increases the information content of spike signals and improves SNN performance, all while adhering to neuromorphic principles. We believe this concept holds strong potential for future development and may contribute to overcoming current limitations in neuro-morphic computing, facilitating broader applications of SNNs.
摘要：近年来，神经形态计算和尖峰神经网络（SNN）通过与深度学习的整合而迅速地进行了广告。但是，SNN的性能仍然落后于卷积神经网络（CNN）的性能，这主要是由于基于SPIKE的数据的信息能力有限。尽管一些研究试图通过使用静态图像等非刺激输入来训练SNN性能，但这种方法偏离了神经形态计算的原始意图，该目的强调了基于Spike的信息处理。为了解决这个问题，我们提出了一种神经元状的编码方法，该方法基于生物神经元的固有操作原理和功能生成尖峰数据。通过掺入人工pho-toreceptor层，进一步增强了此方法，从而使Spike数据同时携带颜色和亮度信息，从而形成了完整的视觉尖峰信号。使用集成和射击神经元模型的实验结果表明，这种生物学启发的方法有效地增加了尖峰信号的信息含量并改善了SNN性能，同时遵守神经形态原理。我们认为，这个概念具有未来发展的强大潜力，并可能有助于克服神经形态计算的当前局限性，从而促进SNN的更广泛应用。

Title: Prediction of Hospital Associated Infections During Continuous Hospital Stays

Authors: Rituparna Datta, Methun Kamruzzaman, Eili Y. Klein, Gregory R Madden, Xinwei Deng, Anil Vullikanti, Parantapa Bhattacharya
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.13561
Pdf URL: https://arxiv.org/pdf/2508.13561
Copy Paste: [[2508.13561]] Prediction of Hospital Associated Infections During Continuous Hospital Stays(https://arxiv.org/abs/2508.13561)
Keywords: generative
Abstract: The US Centers for Disease Control and Prevention (CDC), in 2019, designated Methicillin-resistant Staphylococcus aureus (MRSA) as a serious antimicrobial resistance threat. The risk of acquiring MRSA and suffering life-threatening consequences due to it remains especially high for hospitalized patients due to a unique combination of factors, including: co-morbid conditions, immuno suppression, antibiotic use, and risk of contact with contaminated hospital workers and equipment. In this paper, we present a novel generative probabilistic model, GenHAI, for modeling sequences of MRSA test results outcomes for patients during a single hospitalization. This model can be used to answer many important questions from the perspectives of hospital administrators for mitigating the risk of MRSA infections. Our model is based on the probabilistic programming paradigm, and can be used to approximately answer a variety of predictive, causal, and counterfactual questions. We demonstrate the efficacy of our model by comparing it against discriminative and generative machine learning models using two real-world datasets.
摘要：2019年，美国疾病控制与预防中心（CDC）指定了耐甲氧西林金黄色葡萄球菌（MRSA）为严重的抗菌耐药性威胁。由于独特的因素组合，包括：合并症，免疫抑制，抗生素使用以及与受污染的医院工作人员和设备接触的风险，获得MRSA并遭受威胁生命的后果的风险尤其很高。在本文中，我们提出了一种新型的生成概率模型Genhai，用于在一次住院期间为患者的MRSA测试结果序列建模。该模型可用于从医院管理员的角度回答许多重要问题，以减轻MRSA感染的风险。我们的模型基于概率编程范式，可用于大致回答各种预测，因果和反事实问题。我们通过将模型与使用两个现实世界数据集进行判别和生成机器学习模型进行比较来证明我们的模型的功效。

Title: Generative Model-Based Feature Attention Module for Video Action Analysis

Authors: Guiqin Wang, Peng Zhao, Cong Zhao, Jing Huang, Siyan Guo, Shusen Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.13565
Pdf URL: https://arxiv.org/pdf/2508.13565
Copy Paste: [[2508.13565]] Generative Model-Based Feature Attention Module for Video Action Analysis(https://arxiv.org/abs/2508.13565)
Keywords: generative
Abstract: Video action analysis is a foundational technology within the realm of intelligent video comprehension, particularly concerning its application in Internet of Things(IoT). However, existing methodologies overlook feature semantics in feature extraction and focus on optimizing action proposals, thus these solutions are unsuitable for widespread adoption in high-performance IoT applications due to the limitations in precision, such as autonomous driving, which necessitate robust and scalable intelligent video analytics analysis. To address this issue, we propose a novel generative attention-based model to learn the relation of feature semantics. Specifically, by leveraging the differences of actions' foreground and background, our model simultaneously learns the frame- and segment-dependencies of temporal action feature semantics, which takes advantage of feature semantics in the feature extraction effectively. To evaluate the effectiveness of our model, we conduct extensive experiments on two benchmark video task, action recognition and action detection. In the context of action detection tasks, we substantiate the superiority of our approach through comprehensive validation on widely recognized datasets. Moreover, we extend the validation of the effectiveness of our proposed method to a broader task, video action recognition. Our code is available at this https URL.
摘要：视频动作分析是智能视频理解领域内的基础技术，尤其是关于其在物联网（IoT）中的应用。但是，现有的方法论忽略了特征提取中的特征语义，并且专注于优化动作提案，因此由于精确的限制（例如自动驾驶），这些解决方案不适合在高性能IoT应用程序中广泛采用，这需要强大而可扩展的智能视频分析分析。为了解决这个问题，我们提出了一个基于生成注意力的新型模型，以了解特征语义的关系。具体而言，通过利用动作的前景和背景的差异，我们的模型同时了解了时间动作功能语义的框架和段依赖性，该语义利用了特征语义在功能提取中有效地提取。为了评估我们的模型的有效性，我们对两个基准视频任务，动作识别和动作检测进行了广泛的实验。在行动检测任务的背景下，我们通过对广泛认可的数据集进行全面验证来证实方法的优势。此外，我们将提出方法的有效性的验证扩展到更广泛的任务，即视频行动识别。我们的代码可在此HTTPS URL上找到。

Title: Bridging Clear and Adverse Driving Conditions

Authors: Yoel Shapiro, Yahia Showgan, Koustav Mullick
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.13592
Pdf URL: https://arxiv.org/pdf/2508.13592
Copy Paste: [[2508.13592]] Bridging Clear and Adverse Driving Conditions(https://arxiv.org/abs/2508.13592)
Keywords: generation
Abstract: Autonomous Driving (AD) systems exhibit markedly degraded performance under adverse environmental conditions, such as low illumination and precipitation. The underrepresentation of adverse conditions in AD datasets makes it challenging to address this deficiency. To circumvent the prohibitive cost of acquiring and annotating adverse weather data, we propose a novel Domain Adaptation (DA) pipeline that transforms clear-weather images into fog, rain, snow, and nighttime images. Here, we systematically develop and evaluate several novel data-generation pipelines, including simulation-only, GAN-based, and hybrid diffusion-GAN approaches, to synthesize photorealistic adverse images from labelled clear images. We leverage an existing DA GAN, extend it to support auxiliary inputs, and develop a novel training recipe that leverages both simulated and real images. The simulated images facilitate exact supervision by providing perfectly matched image pairs, while the real images help bridge the simulation-to-real (sim2real) gap. We further introduce a method to mitigate hallucinations and artifacts in Stable-Diffusion Image-to-Image (img2img) outputs by blending them adaptively with their progenitor images. We finetune downstream models on our synthetic data and evaluate them on the Adverse Conditions Dataset with Correspondences (ACDC). We achieve 1.85 percent overall improvement in semantic segmentation, and 4.62 percent on nighttime, demonstrating the efficacy of our hybrid method for robust AD perception under challenging conditions.
摘要：在不利的环境条件下（例如低照明和降水），自动驾驶（AD）系统表现出明显退化的性能。 AD数据集中不利条件的代表性不足使得解决此缺陷变得具有挑战性。为了避免获取和注释不利天气数据的高昂成本，我们提出了一种新型的域适应性（DA）管道，将晴朗的图像转化为雾，雨，雪和夜间图像。在这里，我们系统地开发和评估了几种新型的数据生成管道，包括仅模拟，基于GAN和混合扩散式的方法，以从标记的清晰图像中综合了感性的不利图像。我们利用现有的da gan，扩展它以支持辅助输入，并开发出一种利用模拟图像和真实图像的新颖培训配方。模拟图像通过提供完美匹配的图像对来促进精确的监督，而真实的图像有助于桥接模拟到真实（SIM2REAL）间隙。我们进一步介绍了一种方法，以减轻稳定的扩散图像到图像（IMG2IMG）输出中的幻觉和工件，通过将它们与祖细胞图像自适应地融合在一起。我们在合成数据上对下游模型进行了Finetune模型，并在具有对应关系（ACDC）的不利条件数据集上对其进行了评估。我们在语义细分方面取得了1.85％的总体改善，夜间的总体提高了4.62％，这表明了混合方法在具有挑战性的条件下对强大的AD感知的功效。

Title: PersonaVlog: Personalized Multimodal Vlog Generation with Multi-Agent Collaboration and Iterative Self-Correction

Authors: Xiaolu Hou, Bing Ma, Jiaxiang Cheng, Xuhua Ren, Kai Yu, Wenyue Li, Tianxiang Zheng, Qinglin Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.13602
Pdf URL: https://arxiv.org/pdf/2508.13602
Copy Paste: [[2508.13602]] PersonaVlog: Personalized Multimodal Vlog Generation with Multi-Agent Collaboration and Iterative Self-Correction(https://arxiv.org/abs/2508.13602)
Keywords: generation
Abstract: With the growing demand for short videos and personalized content, automated Video Log (Vlog) generation has become a key direction in multimodal content creation. Existing methods mostly rely on predefined scripts, lacking dynamism and personal expression. Therefore, there is an urgent need for an automated Vlog generation approach that enables effective multimodal collaboration and high personalization. To this end, we propose PersonaVlog, an automated multimodal stylized Vlog generation framework that can produce personalized Vlogs featuring videos, background music, and inner monologue speech based on a given theme and reference image. Specifically, we propose a multi-agent collaboration framework based on Multimodal Large Language Models (MLLMs). This framework efficiently generates high-quality prompts for multimodal content creation based on user input, thereby improving the efficiency and creativity of the process. In addition, we incorporate a feedback and rollback mechanism that leverages MLLMs to evaluate and provide feedback on generated results, thereby enabling iterative self-correction of multimodal content. We also propose ThemeVlogEval, a theme-based automated benchmarking framework that provides standardized metrics and datasets for fair evaluation. Comprehensive experiments demonstrate the significant advantages and potential of our framework over several baselines, highlighting its effectiveness and great potential for generating automated Vlogs.
摘要：随着对简短视频和个性化内容的需求不断增长，自动化视频日志（VLOG）生成已成为多模式内容创建的关键方向。现有方法主要依赖于预定义的脚本，缺乏动态和个人表达。因此，迫切需要一种自动vlog生成方法，该方法可以实现有效的多模式协作和高个性化。为此，我们提出了PersonAvlog，这是一种自动化的多模式风化视频博客生成框架，可以生产具有视频，背景音乐和基于给定主题和参考图像的内部独白语音的个性化视频博客。具体而言，我们建议基于多模式大语言模型（MLLM）的多代理协作框架。该框架有效地生成了基于用户输入的多模式内容创建的高质量提示，从而提高了过程的效率和创造力。此外，我们结合了一种反馈和回滚机制，该机制利用MLLM来评估和提供有关生成结果的反馈，从而实现了多模式内容的迭代自我校正。我们还提出了TheVlogeval，这是一个基于主题的自动基准测试框架，可提供标准化的指标和数据集以进行公平评估。全面的实验证明了我们框架比几个基线的显着优势和潜力，突出了其有效性和产生自动视频博客的巨大潜力。

Title: Towards a Larger Model via One-Shot Federated Learning on Heterogeneous Client Models

Authors: Wenxuan Ye, Xueli An, Onur Ayan, Junfan Wang, Xueqiang Yan, Georg Carle
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.13625
Pdf URL: https://arxiv.org/pdf/2508.13625
Copy Paste: [[2508.13625]] Towards a Larger Model via One-Shot Federated Learning on Heterogeneous Client Models(https://arxiv.org/abs/2508.13625)
Keywords: generation
Abstract: Large models, renowned for superior performance, outperform smaller ones even without billion-parameter scales. While mobile network servers have ample computational resources to support larger models than client devices, privacy constraints prevent clients from directly sharing their raw data. Federated Learning (FL) enables decentralized clients to collaboratively train a shared model by exchanging model parameters instead of transmitting raw data. Yet, it requires a uniform model architecture and multiple communication rounds, which neglect resource heterogeneity, impose heavy computational demands on clients, and increase communication overhead. To address these challenges, we propose FedOL, to construct a larger and more comprehensive server model in one-shot settings (i.e., in a single communication round). Instead of model parameter sharing, FedOL employs knowledge distillation, where clients only exchange model prediction outputs on an unlabeled public dataset. This reduces communication overhead by transmitting compact predictions instead of full model weights and enables model customization by allowing heterogeneous model architectures. A key challenge in this setting is that client predictions may be biased due to skewed local data distributions, and the lack of ground-truth labels in the public dataset further complicates reliable learning. To mitigate these issues, FedOL introduces a specialized objective function that iteratively refines pseudo-labels and the server model, improving learning reliability. To complement this, FedOL incorporates a tailored pseudo-label generation and knowledge distillation strategy that effectively integrates diverse knowledge. Simulation results show that FedOL significantly outperforms existing baselines, offering a cost-effective solution for mobile networks where clients possess valuable private data but limited computational resources.
摘要：大型型号以出色的性能而闻名，即使没有数十亿参数量表，也要表现较小。尽管移动网络服务器具有大量的计算资源来支持比客户端设备更大的模型，但隐私限制阻止客户直接共享其原始数据。联合学习（FL）使分散的客户能够通过交换模型参数而不是传输原始数据来协作训练共享模型。然而，它需要统一的模型架构和多个通信回合，这忽略了资源异质性，对客户施加了巨大的计算需求，并增加了通信开销。为了应对这些挑战，我们提出了FedOl，以在一声设置（即在一次通信回合中）构建更大，更全面的服务器模型。 Fedol不是模型参数共享，而是采用知识蒸馏，客户只在未标记的公共数据集中交换模型预测输出。这样可以通过传输紧凑的预测而不是完整的模型权重来减少通信开销，并通过允许异质模型体系结构启用模型自定义。在这种情况下，一个关键的挑战是，由于本地数据分布的偏差，客户预测可能会偏向，并且公共数据集缺乏地面真实标签会使可靠的学习进一步复杂化。为了减轻这些问题，Fedol引入了专门的目标功能，该功能可以迭代地完善伪标签和服务器模型，从而提高了学习可靠性。为了补充这一点，Fedol结合了量身定制的伪标签生成和知识蒸馏策略，可有效整合多样化的知识。仿真结果表明，联邦高地的表现明显胜过现有的基线，为客户提供有价值的私人数据但计算资源有限的移动网络提供了具有成本效益的解决方案。

Title: DiffIER: Optimizing Diffusion Models with Iterative Error Reduction

Authors: Ao Chen, Lihe Ding, Tianfan Xue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.13628
Pdf URL: https://arxiv.org/pdf/2508.13628
Copy Paste: [[2508.13628]] DiffIER: Optimizing Diffusion Models with Iterative Error Reduction(https://arxiv.org/abs/2508.13628)
Keywords: super-resolution, generation
Abstract: Diffusion models have demonstrated remarkable capabilities in generating high-quality samples and enhancing performance across diverse domains through Classifier-Free Guidance (CFG). However, the quality of generated samples is highly sensitive to the selection of the guidance weight. In this work, we identify a critical ``training-inference gap'' and we argue that it is the presence of this gap that undermines the performance of conditional generation and renders outputs highly sensitive to the guidance weight. We quantify this gap by measuring the accumulated error during the inference stage and establish a correlation between the selection of guidance weight and minimizing this gap. Furthermore, to mitigate this gap, we propose DiffIER, an optimization-based method for high-quality generation. We demonstrate that the accumulated error can be effectively reduced by an iterative error minimization at each step during inference. By introducing this novel plug-and-play optimization framework, we enable the optimization of errors at every single inference step and enhance generation quality. Empirical results demonstrate that our proposed method outperforms baseline approaches in conditional generation tasks. Furthermore, the method achieves consistent success in text-to-image generation, image super-resolution, and text-to-speech generation, underscoring its versatility and potential for broad applications in future research.
摘要：扩散模型表明，通过无分类器的指导（CFG）在生成高质量的样本和增强性能方面具有显着的功能。但是，生成的样品的质量对选择指导重量的质量非常敏感。在这项工作中，我们确定了一个关键的``训练 - 推断差距''，我们认为正是这种差距的存在破坏了有条件产生的性能，并且使对指导重量的输出非常敏感。我们通过测量推理阶段的累积误差来量化这一差距，并在选择指导权重和最小化此间隙之间建立相关性。此外，为了减轻这一差距，我们提出了Diffier，这是一种基于优化的高质量生成方法。我们证明，在推理过程中的每个步骤中，迭代误差最小化可以有效地减少累积误差。通过引入这个新颖的插件优化框架，我们可以在每个推论步骤中优化错误，并提高发电质量。经验结果表明，我们提出的方法在条件生成任务中的基线方法优于基线方法。此外，该方法在文本到图像生成，图像超分辨率和文本到语音生成方面取得了一致的成功，强调了其多功能性和在未来研究中的广泛应用的潜力。

Title: Text2Weight: Bridging Natural Language and Neural Network Weight Spaces

Authors: Bowen Tian, Wenshuo Chen, Zexi Li, Songning Lai, Jiemin Wu, Yutao Yue
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.13633
Pdf URL: https://arxiv.org/pdf/2508.13633
Copy Paste: [[2508.13633]] Text2Weight: Bridging Natural Language and Neural Network Weight Spaces(https://arxiv.org/abs/2508.13633)
Keywords: generation, generative
Abstract: How far are we really from automatically generating neural networks? While neural network weight generation shows promise, current approaches struggle with generalization to unseen tasks and practical application exploration. To address this, we propose T2W, a diffusion transformer framework that generates task-specific weights conditioned on natural language descriptions. T2W hierarchically processes network parameters into uniform blocks, integrates text embeddings from CLIP via a prior attention mechanism, and employs adversarial training with weight-space augmentation to enhance generalization. Experiments on Cifar100, Caltech256, and TinyImageNet demonstrate T2W's ability to produce high-quality weights for unseen tasks, outperforming optimization-based initialization and enabling novel applications such as weight enhancement and text-guided model fusion. Our work bridges textual semantics with weight-space dynamics, supported by an open-source dataset of text-weight pairs, advancing the practicality of generative models in neural network parameter synthesis. Our code is available on Github.
摘要：我们距离自动生成神经网络有多远？尽管神经网络体重产生显示出希望，但当前的方法在概括方面难以看不见的任务和实际应用探索。为了解决这个问题，我们提出了T2W，这是一个扩散的变压器框架，该框架生成以自然语言描述为条件的特定任务权重。 T2W层次将网络参数处理成统一的块，通过先前的注意机制整合了剪辑中的文本嵌入，并采用了通过较高空间扩大的对抗训练来增强概括。在CIFAR100，CALTECH256和TINYIMAGENET上进行的实验表明，T2W可以为看不见的任务产生高质量的权重，表现优于基于优化的初始化，并启用了新颖的应用程序，例如重量增强和文本指导的模型融合。我们的作品将文本语义带有重量空间动力学，并由文本重量对的开源数据集支持，在神经网络参数合成中推动了生成模型的实用性。我们的代码可在GitHub上找到。

Title: Personalized Subgraph Federated Learning with Sheaf Collaboration

Authors: Wenfei Liang, Yanan Zhao, Rui She, Yiming Li, Wee Peng Tay
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.13642
Pdf URL: https://arxiv.org/pdf/2508.13642
Copy Paste: [[2508.13642]] Personalized Subgraph Federated Learning with Sheaf Collaboration(https://arxiv.org/abs/2508.13642)
Keywords: generation
Abstract: Graph-structured data is prevalent in many applications. In subgraph federated learning (FL), this data is distributed across clients, each with a local subgraph. Personalized subgraph FL aims to develop a customized model for each client to handle diverse data distributions. However, performance variation across clients remains a key issue due to the heterogeneity of local subgraphs. To overcome the challenge, we propose FedSheafHN, a novel framework built on a sheaf collaboration mechanism to unify enhanced client descriptors with efficient personalized model generation. Specifically, FedSheafHN embeds each client's local subgraph into a server-constructed collaboration graph by leveraging graph-level embeddings and employing sheaf diffusion within the collaboration graph to enrich client representations. Subsequently, FedSheafHN generates customized client models via a server-optimized hypernetwork. Empirical evaluations demonstrate that FedSheafHN outperforms existing personalized subgraph FL methods on various graph datasets. Additionally, it exhibits fast model convergence and effectively generalizes to new clients.
摘要：在许多应用程序中，图形结构化数据很普遍。在子图联合学习（FL）中，此数据分布在客户端，每个客户均具有本地子图。个性化子图FL旨在为每个客户开发一个定制模型来处理各种数据分布。但是，由于本地子图的异质性，客户之间的性能变化仍然是一个关键问题。为了克服挑战，我们提出了Fedsheafhn，这是一个基于捆绑协作机制的新型框架，以通过有效的个性化模型生成来统一增强的客户描述符。具体而言，FedSheafH将每个客户端的本地子图嵌入到服务器构造的协作图中，通过利用图形级嵌入并在协作图中采用捆扎扩散以丰富客户端表示。随后，FedSheafHN通过服务器优化的超网络生成自定义的客户端模型。经验评估表明，Fedsheafhn在各种图数据集上的现有个性化子图FL方法的表现。此外，它展示了快速的模型收敛，并有效地向新客户推广。

Title: Disentangled Deep Smoothed Bootstrap for Fair Imbalanced Regression

Authors: Samuel Stocksieker, Denys pommeret, Arthur Charpentier
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2508.13829
Pdf URL: https://arxiv.org/pdf/2508.13829
Copy Paste: [[2508.13829]] Disentangled Deep Smoothed Bootstrap for Fair Imbalanced Regression(https://arxiv.org/abs/2508.13829)
Keywords: generation
Abstract: Imbalanced distribution learning is a common and significant challenge in predictive modeling, often reducing the performance of standard algorithms. Although various approaches address this issue, most are tailored to classification problems, with a limited focus on regression. This paper introduces a novel method to improve learning on tabular data within the Imbalanced Regression (IR) framework, which is a critical problem. We propose using Variational Autoencoders (VAEs) to model and define a latent representation of data distributions. However, VAEs can be inefficient with imbalanced data like other standard approaches. To address this, we develop an innovative data generation method that combines a disentangled VAE with a Smoothed Bootstrap applied in the latent space. We evaluate the efficiency of this method through numerical comparisons with competitors on benchmark datasets for IR.
摘要：分配学习不平衡是预测建模中的常见且重大的挑战，通常会降低标准算法的性能。尽管各种方法解决了这一问题，但大多数方法是针对分类问题量身定制的，而对回归的重点有限。本文介绍了一种新颖的方法，可以改善对不平衡回归（IR）框架中的表格数据的学习，这是一个关键问题。我们建议使用变分自动编码器（VAE）建模并定义数据分布的潜在表示。但是，与其他标准方法一样，VAE的数据效率可能会降低。为了解决这个问题，我们开发了一种创新的数据生成方法，该方法将分离的VAE与在潜在空间中应用的平滑引导程序结合在一起。我们通过与IR基准数据集上的竞争对手进行数值比较来评估该方法的效率。

Title: SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation

Authors: Paul Grimal, Michaël Soumm, Hervé Le Borgne, Olivier Ferret, Akihiro Sugimoto
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.13866
Pdf URL: https://arxiv.org/pdf/2508.13866
Copy Paste: [[2508.13866]] SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation(https://arxiv.org/abs/2508.13866)
Keywords: generation
Abstract: State-of-the-art text-to-image models produce visually impressive results but often struggle with precise alignment to text prompts, leading to missing critical elements or unintended blending of distinct concepts. We propose a novel approach that learns a high-success-rate distribution conditioned on a target prompt, ensuring that generated images faithfully reflect the corresponding prompts. Our method explicitly models the signal component during the denoising process, offering fine-grained control that mitigates over-optimization and out-of-distribution artifacts. Moreover, our framework is training-free and seamlessly integrates with both existing diffusion and flow matching architectures. It also supports additional conditioning modalities -- such as bounding boxes -- for enhanced spatial alignment. Extensive experiments demonstrate that our approach outperforms current state-of-the-art methods. The code is available at this https URL.
摘要：最先进的文本对图像模型产生了令人印象深刻的结果，但通常会与文本提示的精确保持一致，从而导致缺失的关键要素或意外的不同概念的混合。我们提出了一种新颖的方法，该方法可以学习以目标提示为条件的高成功率分布，从而确保生成的图像忠实地反映了相应的提示。我们的方法在降解过程中明确对信号成分进行了建模，从而提供了细粒度的控制，从而减轻过度优化和分布外部伪像。此外，我们的框架是无训练的，并且与现有的扩散和流匹配体系结构都无缝集成。它还支持其他条件方式（例如边界框），以增强空间对齐。广泛的实验表明，我们的方法表现优于当前的最新方法。该代码可在此HTTPS URL上找到。

Title: Revisiting Diffusion Q-Learning: From Iterative Denoising to One-Step Action Generation

Authors: Thanh Nguyen, Chang D. Yoo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.13904
Pdf URL: https://arxiv.org/pdf/2508.13904
Copy Paste: [[2508.13904]] Revisiting Diffusion Q-Learning: From Iterative Denoising to One-Step Action Generation(https://arxiv.org/abs/2508.13904)
Keywords: generation, generative
Abstract: The generative power of diffusion models (DMs) has recently enabled high-performing decision-making algorithms in offline reinforcement learning (RL), achieving state-of-the-art results across standard benchmarks. Among them, Diffusion Q-Learning (DQL) stands out as a leading method for its consistently strong performance. Nevertheless, DQL remains limited in practice due to its reliance on multi-step denoising for action generation during both training and inference. Although one-step denoising is desirable, simply applying it to DQL leads to a drastic performance drop. In this work, we revisit DQL and identify its core limitations. We then propose One-Step Flow Q-Learning (OFQL), a novel framework that enables efficient one-step action generation during both training and inference, without requiring auxiliary models, distillation, or multi-phase training. Specifically, OFQL reformulates DQL within the sample-efficient Flow Matching (FM) framework. While conventional FM induces curved generative trajectories that impede one-step generation, OFQL instead learns an average velocity field that facilitates direct, accurate action generation. Collectively, OFQL eliminates the need for multi-step sampling and recursive gradient updates in DQL, resulting in faster and more robust training and inference. Extensive experiments on the D4RL benchmark demonstrate that OFQL outperforms DQL and other diffusion-based baselines, while substantially reducing both training and inference time compared to DQL.
摘要：扩散模型（DMS）的生成能力最近在离线增强学习中实现了高性能的决策算法（RL），从而在标准基准中实现了最新的结果。其中，扩散Q学习（DQL）是其始终如一的强劲性能的领先方法。然而，由于培训和推理期间，DQL依靠多步降级为行动生成，因此在实践中仍然有限。尽管需要一步降解，但只需将其应用于DQL，就会导致性能急剧下降。在这项工作中，我们重新访问DQL并确定其核心局限性。然后，我们提出了一步流Q学习（OFQL），这是一个新型框架，在训练和推理过程中可以有效地产生一步动作，而无需辅助模型，蒸馏或多相训练。具体而言，OFQL在样品有效的流量匹配（FM）框架内重新重新定义了DQL。尽管常规的FM诱导了阻碍一步生成的弯曲产生轨迹，但OFQL却学习了一个平均速度场，从而有助于直接，准确的动作生成。总体而言，OFQL消除了对DQL中多步骤采样和递归梯度更新的需求，从而导致更快，更强大的训练和推理。 D4RL基准的广泛实验表明，OFQL的表现优于DQL和其他基于扩散的基准，而与DQL相比，OFFORS的实验大大降低了训练和推理时间。

Title: DIME-Net: A Dual-Illumination Adaptive Enhancement Network Based on Retinex and Mixture-of-Experts

Authors: Ziang Wang, Xiaoqin Wang, Dingyi Wang, Qiang Li, Shushan Qiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.13921
Pdf URL: https://arxiv.org/pdf/2508.13921
Copy Paste: [[2508.13921]] DIME-Net: A Dual-Illumination Adaptive Enhancement Network Based on Retinex and Mixture-of-Experts(https://arxiv.org/abs/2508.13921)
Keywords: restoration
Abstract: Image degradation caused by complex lighting conditions such as low-light and backlit scenarios is commonly encountered in real-world environments, significantly affecting image quality and downstream vision tasks. Most existing methods focus on a single type of illumination degradation and lack the ability to handle diverse lighting conditions in a unified manner. To address this issue, we propose a dual-illumination enhancement framework called DIME-Net. The core of our method is a Mixture-of-Experts illumination estimator module, where a sparse gating mechanism adaptively selects suitable S-curve expert networks based on the illumination characteristics of the input image. By integrating Retinex theory, this module effectively performs enhancement tailored to both low-light and backlit images. To further correct illumination-induced artifacts and color distortions, we design a damage restoration module equipped with Illumination-Aware Cross Attention and Sequential-State Global Attention mechanisms. In addition, we construct a hybrid illumination dataset, MixBL, by integrating existing datasets, allowing our model to achieve robust illumination adaptability through a single training process. Experimental results show that DIME-Net achieves competitive performance on both synthetic and real-world low-light and backlit datasets without any retraining. These results demonstrate its generalization ability and potential for practical multimedia applications under diverse and complex illumination conditions.
摘要：在现实世界环境中通常会遇到复杂的照明条件（例如弱光和背光场景）引起的图像降解，从而显着影响图像质量和下游视觉任务。大多数现有的方法都集中在一种类型的照明降解上，并且缺乏以统一的方式处理各种照明条件的能力。为了解决这个问题，我们提出了一个称为角钱网络的双弹片增强框架。我们方法的核心是Expert的混合物照明估计器模块，其中稀疏的门控机制根据输入图像的照明特性自适应地选择了合适的S-Curve专家网络。通过整合Itinex理论，该模块有效地执行了针对弱光和背光图像量身定制的增强功能。为了进一步纠正照明引起的伪影和颜色扭曲，我们设计了一个配备有照明感知的交叉注意和顺序全球注意机制的损坏修复模块。此外，我们通过集成现有数据集来构建一个混合照明数据集Mixbl，从而允许我们的模型通过单个培训过程实现强大的照明适应性。实验结果表明，角（Dime-net）在合成和现实世界中的低光和背光数据集上都能达到竞争性能。这些结果证明了其在多样化和复杂的照明条件下实用多媒体应用的概括能力和潜力。

Title: ViT-FIQA: Assessing Face Image Quality using Vision Transformers

Authors: Andrea Atzori, Fadi Boutros, Naser Damer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.13957
Pdf URL: https://arxiv.org/pdf/2508.13957
Copy Paste: [[2508.13957]] ViT-FIQA: Assessing Face Image Quality using Vision Transformers(https://arxiv.org/abs/2508.13957)
Keywords: quality assessment
Abstract: Face Image Quality Assessment (FIQA) aims to predict the utility of a face image for face recognition (FR) systems. State-of-the-art FIQA methods mainly rely on convolutional neural networks (CNNs), leaving the potential of Vision Transformer (ViT) architectures underexplored. This work proposes ViT-FIQA, a novel approach that extends standard ViT backbones, originally optimized for FR, through a learnable quality token designed to predict a scalar utility score for any given face image. The learnable quality token is concatenated with the standard image patch tokens, and the whole sequence is processed via global self-attention by the ViT encoders to aggregate contextual information across all patches. At the output of the backbone, ViT-FIQA branches into two heads: (1) the patch tokens are passed through a fully connected layer to learn discriminative face representations via a margin-penalty softmax loss, and (2) the quality token is fed into a regression head to learn to predict the face sample's utility. Extensive experiments on challenging benchmarks and several FR models, including both CNN- and ViT-based architectures, demonstrate that ViT-FIQA consistently achieves top-tier performance. These results underscore the effectiveness of transformer-based architectures in modeling face image utility and highlight the potential of ViTs as a scalable foundation for future FIQA research this https URL.
摘要：面部图像质量评估（FIQA）旨在预测面部识别（FR）系统面部图像的实用性。最先进的FIQA方法主要依赖于卷积神经网络（CNN），从而使视觉变压器（VIT）架构的潜力毫无疑问。这项工作提出了VIT-FIQA是一种新型方法，该方法通过可学习的质量令牌来扩展标准的VIT骨架，该方法最初是针对FR进行了优化的，旨在预测任何给定的面部图像的标量实用程序得分。可学习的质量令牌与标准图像贴片令牌串联，整个序列是通过VIT编码者通过全局自我注意来处理的，以汇总所有贴片的上下文信息。在主链的输出时，VIT-FIQA分支分为两个头：（1）贴片令牌通过完全连接的层穿过一个完全连接的层，以通过余量软性软性软体损失来学习区分式面部表示，并且（2）（2）质量令牌被送入一个回归头，以学习脸部样品样品。关于具有挑战性的基准和包括基于CNN和VIT的架构在内的多种FR模型的广泛实验，表明VIT-FIQA始终达到顶级性能。这些结果强调了基于变压器的架构在建模面部图像实用程序中的有效性，并突出了VIT作为将来FIQA研究的可扩展基础的潜力。

Title: ROVR-Open-Dataset: A Large-Scale Depth Dataset for Autonomous Driving

Authors: Xianda Guo, Ruijun Zhang, Yiqun Duan, Ruilin Wang, Keyuan Zhou, Wenzhao Zheng, Wenke Huang, Gangwei Xu, Mike Horton, Yuan Si, Hao Zhao, Long Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.13977
Pdf URL: https://arxiv.org/pdf/2508.13977
Copy Paste: [[2508.13977]] ROVR-Open-Dataset: A Large-Scale Depth Dataset for Autonomous Driving(https://arxiv.org/abs/2508.13977)
Keywords: generation
Abstract: Depth estimation is a fundamental task for 3D scene understanding in autonomous driving, robotics, and augmented reality. Existing depth datasets, such as KITTI, nuScenes, and DDAD, have advanced the field but suffer from limitations in diversity and scalability. As benchmark performance on these datasets approaches saturation, there is an increasing need for a new generation of large-scale, diverse, and cost-efficient datasets to support the era of foundation models and multi-modal learning. To address these challenges, we introduce a large-scale, diverse, frame-wise continuous dataset for depth estimation in dynamic outdoor driving environments, comprising 20K video frames to evaluate existing methods. Our lightweight acquisition pipeline ensures broad scene coverage at low cost, while sparse yet statistically sufficient ground truth enables robust training. Compared to existing datasets, ours presents greater diversity in driving scenarios and lower depth density, creating new challenges for generalization. Benchmark experiments with standard monocular depth estimation models validate the dataset's utility and highlight substantial performance gaps in challenging conditions, establishing a new platform for advancing depth estimation research.
摘要：深度估计是在自动驾驶，机器人技术和增强现实中理解3D场景的基本任务。现有的深度数据集，例如Kitti，Nuscenes和DDAD，已经提高了该领域，但遭受了多样性和可扩展性的局限性。随着这些数据集的基准性能接近饱和，越来越需要新一代的大规模，多样化和经济高效的数据集，以支持基础模型和多模式学习的时代。为了应对这些挑战，我们引入了一个大规模，多样，框架连续数据集，以进行动态户外驾驶环境中的深度估算，其中包括20K视频帧以评估现有方法。我们的轻量级采购管道可确保以低成本的范围确保广泛的场景覆盖，而稀疏但统计上足够的地面真相则可以培训强大。与现有数据集相比，我们的驱动方案和深度密度较低的多样性更大，为概括带来了新的挑战。具有标准单眼深度估计模型的基准测试实验验证了数据集的效用，并在具有挑战性的条件下突出了实质性的性能差距，从而建立了一个新的平台，以推进深度估计研究。

Title: Physics-Based 3D Simulation for Synthetic Data Generation and Failure Analysis in Packaging Stability Assessment

Authors: Samuel Seligardi, Pietro Musoni, Eleonora Iotti, Gianluca Contesso, Alessandro Dal Palù
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.13989
Pdf URL: https://arxiv.org/pdf/2508.13989
Copy Paste: [[2508.13989]] Physics-Based 3D Simulation for Synthetic Data Generation and Failure Analysis in Packaging Stability Assessment(https://arxiv.org/abs/2508.13989)
Keywords: generation
Abstract: The design and analysis of pallet setups are essential for ensuring safety of packages transportation. With rising demands in the logistics sector, the development of automated systems utilizing advanced technologies has become increasingly crucial. Moreover, the widespread use of plastic wrapping has motivated researchers to investigate eco-friendly alternatives that still adhere to safety standards. We present a fully controllable and accurate physical simulation system capable of replicating the behavior of moving pallets. It features a 3D graphics-based virtual environment that supports a wide range of configurations, including variable package layouts, different wrapping materials, and diverse dynamic conditions. This innovative approach reduces the need for physical testing, cutting costs and environmental impact while improving measurement accuracy for analyzing pallet dynamics. Additionally, we train a deep neural network to evaluate the rendered videos generated by our simulator, as a crash-test predictor for pallet configurations, further enhancing the system's utility in safety analysis.
摘要：托盘设置的设计和分析对于确保包裹运输的安全至关重要。随着物流部门的需求增加，使用先进技术的自动化系统的发展变得越来越重要。此外，塑料包装的广泛使用促使研究人员调查仍然符合安全标准的环保替代方案。我们提出了一个完全可控且准确的物理模拟系统，能够复制移动托盘的行为。它具有基于3D图形的虚拟环境，该环境支持各种配置，包括可变软件包布局，不同的包装材料和各种动态条件。这种创新的方法减少了对物理测试，降低成本和环境影响的需求，同时提高了分析托盘动态的测量精度。此外，我们训练一个深神网络，以评估我们的模拟器生成的渲染视频，作为货盘配置的碰撞测试预测指标，进一步增强了系统在安全分析中的效用。

Title: ResPlan: A Large-Scale Vector-Graph Dataset of 17,000 Residential Floor Plans

Authors: Mohamed Abouagour, Eleftherios Garyfallidis
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2508.14006
Pdf URL: https://arxiv.org/pdf/2508.14006
Copy Paste: [[2508.14006]] ResPlan: A Large-Scale Vector-Graph Dataset of 17,000 Residential Floor Plans(https://arxiv.org/abs/2508.14006)
Keywords: generation, generative
Abstract: We introduce ResPlan, a large-scale dataset of 17,000 detailed, structurally rich, and realistic residential floor plans, created to advance spatial AI research. Each plan includes precise annotations of architectural elements (walls, doors, windows, balconies) and functional spaces (such as kitchens, bedrooms, and bathrooms). ResPlan addresses key limitations of existing datasets such as RPLAN (Wu et al., 2019) and MSD (van Engelenburg et al., 2024) by offering enhanced visual fidelity and greater structural diversity, reflecting realistic and non-idealized residential layouts. Designed as a versatile, general-purpose resource, ResPlan supports a wide range of applications including robotics, reinforcement learning, generative AI, virtual and augmented reality, simulations, and game development. Plans are provided in both geometric and graph-based formats, enabling direct integration into simulation engines and fast 3D conversion. A key contribution is an open-source pipeline for geometry cleaning, alignment, and annotation refinement. Additionally, ResPlan includes structured representations of room connectivity, supporting graph-based spatial reasoning tasks. Finally, we present comparative analyses with existing benchmarks and outline several open benchmark tasks enabled by ResPlan. Ultimately, ResPlan offers a significant advance in scale, realism, and usability, providing a robust foundation for developing and benchmarking next-generation spatial intelligence systems.
摘要：我们介绍了Resplan，这是一个大型数据集，其中包括17,000个详细，结构丰富且现实的住宅平面图，以推动空间AI研究。每个计划都包括建筑要素（墙壁，门，窗户，阳台）和功能空间（例如厨房，卧室和浴室）的精确注释。 RESPLAN通过提供增强的视觉保真度和更大的结构多样性来解决现有数据集（Wu等，2019）和MSD（Van Engelenburg等，2024）的关键局限性。设计为一种多功能，通用资源，Resplan支持广泛的应用程序，包括机器人技术，增强学习，生成AI，虚拟和增强现实，模拟和游戏开发。以几何和基于图的格式提供了计划，使直接集成到仿真引擎和快速的3D转换中。一个关键的贡献是用于几何清洁，对齐和注释细化的开源管道。此外，Resplan还包括房间连接性的结构化表示，支持基于图的空间推理任务。最后，我们通过现有基准进行了比较分析，并概述了Resplan启用的几个开放基准任务。最终，Resplan在规模，现实主义和可用性方面提供了重大进步，为开发和基准测试下一代空间智能系统提供了强大的基础。

Title: InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing

Authors: Shaoshu Yang, Zhe Kong, Feng Gao, Meng Cheng, Xiangyu Liu, Yong Zhang, Zhuoliang Kang, Wenhan Luo, Xunliang Cai, Ran He, Xiaoming Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.14033
Pdf URL: https://arxiv.org/pdf/2508.14033
Copy Paste: [[2508.14033]] InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing(https://arxiv.org/abs/2508.14033)
Keywords: generation
Abstract: Recent breakthroughs in video AIGC have ushered in a transformative era for audio-driven human animation. However, conventional video dubbing techniques remain constrained to mouth region editing, resulting in discordant facial expressions and body gestures that compromise viewer immersion. To overcome this limitation, we introduce sparse-frame video dubbing, a novel paradigm that strategically preserves reference keyframes to maintain identity, iconic gestures, and camera trajectories while enabling holistic, audio-synchronized full-body motion editing. Through critical analysis, we identify why naive image-to-video models fail in this task, particularly their inability to achieve adaptive conditioning. Addressing this, we propose InfiniteTalk, a streaming audio-driven generator designed for infinite-length long sequence dubbing. This architecture leverages temporal context frames for seamless inter-chunk transitions and incorporates a simple yet effective sampling strategy that optimizes control strength via fine-grained reference frame positioning. Comprehensive evaluations on HDTF, CelebV-HQ, and EMTD datasets demonstrate state-of-the-art performance. Quantitative metrics confirm superior visual realism, emotional coherence, and full-body motion synchronization.
摘要：视频AIGC的最新突破已在音频驱动的人类动画中引入了变革性的时代。但是，传统的视频配音技术仍被限制在口腔区域编辑中，从而导致不一致的面部表情和身体手势损害观众的沉浸。为了克服这一限制，我们引入了稀疏框架视频配音，这是一种新颖的范式，从策略上保留了参考框架，以保持身份，标志性的手势和摄像头轨迹，同时启用整体，同步的全身运动编辑。通过批判性分析，我们确定了为什么天真的图像到视频模型在此任务中失败，尤其是他们无法实现适应性调节的原因。在解决此问题时，我们提出了InfiniteTalk，这是一种流向音频驱动的发电机，设计用于无限长度长序列配音。该体系结构利用时间上下文框架进行无缝的界面间过渡，并结合了一种简单而有效的采样策略，该策略通过细粒度的参考框架定位来优化控制强度。对HDTF，CelebV-HQ和EMTD数据集的全面评估证明了最先进的性能。定量指标证实了卓越的视觉现实主义，情感连贯性和全身运动同步。

Title: GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation

Authors: Ken Deng, Yunhan Yang, Jingxiang Sun, Xihui Liu, Yebin Liu, Ding Liang, Yan-Pei Cao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.14036
Pdf URL: https://arxiv.org/pdf/2508.14036
Copy Paste: [[2508.14036]] GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation(https://arxiv.org/abs/2508.14036)
Keywords: generation, generative
Abstract: Modern 3D generation methods can rapidly create shapes from sparse or single views, but their outputs often lack geometric detail due to computational constraints. We present DetailGen3D, a generative approach specifically designed to enhance these generated 3D shapes. Our key insight is to model the coarse-to-fine transformation directly through data-dependent flows in latent space, avoiding the computational overhead of large-scale 3D generative models. We introduce a token matching strategy that ensures accurate spatial correspondence during refinement, enabling local detail synthesis while preserving global structure. By carefully designing our training data to match the characteristics of synthesized coarse shapes, our method can effectively enhance shapes produced by various 3D generation and reconstruction approaches, from single-view to sparse multi-view inputs. Extensive experiments demonstrate that DetailGen3D achieves high-fidelity geometric detail synthesis while maintaining efficiency in training.
摘要：现代的3D生成方法可以从稀疏或单一视图中迅速产生形状，但是由于计算限制，它们的输出通常缺乏几何细节。我们提出了细节3D，这是一种专门设计用于增强这些生成的3D形状的生成方法。我们的关键见解是通过潜在空间中的数据依赖性流直接建模粗到1的转换，避免了大规模3D生成模型的计算开销。我们引入了令牌匹配策略，该策略可确保在细化过程中准确的空间对应关系，从而在保留全球结构的同时可以局部细节合成。通过仔细设计我们的训练数据以匹配合成的粗形状的特征，我们的方法可以有效地增强各种3D生成和重建方法产生的形状，从单视图到稀疏的多视图输入。广泛的实验表明，细节3D在保持训练效率的同时，可以实现高保真的几何细节综合。