2025-08-12

Title: MILD: Multi-Layer Diffusion Strategy for Complex and Precise Multi-IP Aware Human Erasing

Authors: Jinghan Yu, Zhiyuan Ma, Yue Ma, Kaiqi Liu, Yuhan Wang, Jianjun Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06543
Pdf URL: https://arxiv.org/pdf/2508.06543
Copy Paste: [[2508.06543]] MILD: Multi-Layer Diffusion Strategy for Complex and Precise Multi-IP Aware Human Erasing(https://arxiv.org/abs/2508.06543)
Keywords: restoration, generation
Abstract: Recent years have witnessed the success of diffusion models in image-customized tasks. Prior works have achieved notable progress on human-oriented erasing using explicit mask guidance and semantic-aware inpainting. However, they struggle under complex multi-IP scenarios involving human-human occlusions, human-object entanglements, and background interferences. These challenges are mainly due to: 1) Dataset limitations, as existing datasets rarely cover dense occlusions, camouflaged backgrounds, and diverse interactions; 2) Lack of spatial decoupling, where foreground instances cannot be effectively disentangled, limiting clean background restoration. In this work, we introduce a high-quality multi-IP human erasing dataset with diverse pose variations and complex backgrounds. We then propose Multi-Layer Diffusion (MILD), a novel strategy that decomposes generation into semantically separated pathways for each instance and the background. To enhance human-centric understanding, we introduce Human Morphology Guidance, integrating pose, parsing, and spatial relations. We further present Spatially-Modulated Attention to better guide attention flow. Extensive experiments show that MILD outperforms state-of-the-art methods on challenging human erasing benchmarks.
摘要：近年来见证了扩散模型在图像注定的任务中的成功。先前的工作在使用明确的面具指导和语义意识的介绍方面取得了显着的进展。但是，他们在复杂的多IP场景中挣扎，涉及人类遮挡，人类对象纠缠和背景干扰。这些挑战主要是由于：1）数据集限制，因为现有数据集很少涵盖密集的阻塞，伪装的背景和不同的交互作用； 2）缺乏空间脱钩，而前景实例无法有效地分解，从而限制了干净的背景恢复。在这项工作中，我们引入了具有不同姿势变化和复杂背景的高质量多IP人类擦除数据集。然后，我们提出了多层扩散（温和），这是一种新型策略，将生成分解为每个实例和背景的语义分离途径。为了增强以人为本的理解，我们介绍了人类形态指导，整合姿势，解析和空间关系。我们进一步提出了空间调节的注意力，以更好地指导注意力流动。广泛的实验表明，关于挑战人类擦除基准的最先进的方法。

Title: Statistical Confidence Rescoring for Robust 3D Scene Graph Generation from Multi-View Images

Authors: Qi Xun Yeo, Yanyan Li, Gim Hee Lee
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2508.06546
Pdf URL: https://arxiv.org/pdf/2508.06546
Copy Paste: [[2508.06546]] Statistical Confidence Rescoring for Robust 3D Scene Graph Generation from Multi-View Images(https://arxiv.org/abs/2508.06546)
Keywords: generation
Abstract: Modern 3D semantic scene graph estimation methods utilize ground truth 3D annotations to accurately predict target objects, predicates, and relationships. In the absence of given 3D ground truth representations, we explore leveraging only multi-view RGB images to tackle this task. To attain robust features for accurate scene graph estimation, we must overcome the noisy reconstructed pseudo point-based geometry from predicted depth maps and reduce the amount of background noise present in multi-view image features. The key is to enrich node and edge features with accurate semantic and spatial information and through neighboring relations. We obtain semantic masks to guide feature aggregation to filter background features and design a novel method to incorporate neighboring node information to aid robustness of our scene graph estimates. Furthermore, we leverage on explicit statistical priors calculated from the training summary statistics to refine node and edge predictions based on their one-hop neighborhood. Our experiments show that our method outperforms current methods purely using multi-view images as the initial input. Our project page is available at this https URL.
摘要：现代3D语义场景图估计方法利用地面真相3D注释来准确预测目标对象，谓词和关系。在没有给定的3D地面真相表示的情况下，我们仅探索仅利用多视图RGB图像来解决此任务。为了获得可靠的功能以进行准确的场景图估计，我们必须从预测的深度图中克服嘈杂的基于伪点的几何形状，并减少多视图图像特征中存在的背景噪声量。关键是通过准确的语义和空间信息以及通过相邻关系丰富节点和边缘特征。我们获得语义掩模来指导特征聚合以滤波背景特征，并设计一种新型方法，以合并相邻的节点信息以帮助我们场景图估计的鲁棒性。此外，我们利用从培训摘要统计数据中计算出的明确统计先验，以根据其单跳社区来完善节点和边缘预测。我们的实验表明，我们的方法纯粹使用多视图图像作为初始输入纯粹的当前方法。我们的项目页面可在此HTTPS URL上找到。

Title: Slice or the Whole Pie? Utility Control for AI Models

Authors: Ye Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06551
Pdf URL: https://arxiv.org/pdf/2508.06551
Copy Paste: [[2508.06551]] Slice or the Whole Pie? Utility Control for AI Models(https://arxiv.org/abs/2508.06551)
Keywords: generation
Abstract: Training deep neural networks (DNNs) has become an increasingly resource-intensive task, requiring large volumes of labeled data, substantial computational power, and considerable fine-tuning efforts to achieve optimal performance across diverse use cases. Although pre-trained models offer a useful starting point, adapting them to meet specific user needs often demands extensive customization, and infrastructure overhead. This challenge grows when a single model must support diverse appli-cations with differing requirements for performance. Traditional solutions often involve training multiple model versions to meet varying requirements, which can be inefficient and difficult to maintain. In order to overcome this challenge, we propose NNObfuscator, a novel utility control mechanism that enables AI models to dynamically modify their performance according to predefined conditions. It is different from traditional methods that need separate models for each user. Instead, NNObfuscator allows a single model to be adapted in real time, giving you controlled access to multiple levels of performance. This mechanism enables model owners set up tiered access, ensuring that free-tier users receive a baseline level of performance while premium users benefit from enhanced capabilities. The approach improves resource allocation, reduces unnecessary computation, and supports sustainable business models in AI deployment. To validate our approach, we conducted experiments on multiple tasks, including image classification, semantic segmentation, and text to image generation, using well-established models such as ResNet, DeepLab, VGG16, FCN and Stable Diffusion. Experimental results show that NNObfuscator successfully makes model more adaptable, so that a single trained model can handle a broad range of tasks without requiring a lot of changes.
摘要：培训深度神经网络（DNNS）已成为越来越多的资源密集型任务，需要大量标记的数据，实质性的计算能力以及相当大的微调努力，以实现各种用例的最佳性能。尽管预训练的模型提供了一个有用的起点，但调整它们以满足特定的用户需求通常需要广泛的自定义和基础架构开销。当单个模型必须支持具有不同性能要求的不同应用时，这一挑战就会增加。传统解决方案通常涉及培训多个模型版本以满足不同的要求，这可能是效率低下且难以维护的。为了克服这一挑战，我们提出了NnobFuscator，这是一种新型的实用程序控制机制，使AI模型能够根据预定义的条件动态修改其性能。它与需要为每个用户的单独模型的传统方法不同。取而代之的是，nnobfuscator允许实时调整单个模型，从而使您可以控制对多个级别的性能访问。这种机制使模型所有者可以设置分层访问权限，从而确保自由层用户获得基线的性能水平，而高级用户则从增强的功能中受益。该方法可改善资源分配，减少不必要的计算，并支持AI部署中的可持续业务模型。为了验证我们的方法，我们使用诸如Resnet，DeepLab，VGG16，FCN和稳定扩散的模型进行了多个任务进行实验，包括图像分类，语义分割和文本进行图像生成。实验结果表明，NnoBfuscator成功地使模型更适应能力，因此单个训练有素的模型可以处理广泛的任务而无需进行大量更改。

Title: GFlowNets for Learning Better Drug-Drug Interaction Representations

Authors: Azmine Toushik Wasi
Subjects: cs.LG, q-bio.BM, q-bio.MN
Abstract URL: https://arxiv.org/abs/2508.06576
Pdf URL: https://arxiv.org/pdf/2508.06576
Copy Paste: [[2508.06576]] GFlowNets for Learning Better Drug-Drug Interaction Representations(https://arxiv.org/abs/2508.06576)
Keywords: generative
Abstract: Drug-drug interactions pose a significant challenge in clinical pharmacology, with severe class imbalance among interaction types limiting the effectiveness of predictive models. Common interactions dominate datasets, while rare but critical interactions remain underrepresented, leading to poor model performance on infrequent cases. Existing methods often treat DDI prediction as a binary problem, ignoring class-specific nuances and exacerbating bias toward frequent interactions. To address this, we propose a framework combining Generative Flow Networks (GFlowNet) with Variational Graph Autoencoders (VGAE) to generate synthetic samples for rare classes, improving model balance and generate effective and novel DDI pairs. Our approach enhances predictive performance across interaction types, ensuring better clinical reliability.
摘要：药物相互作用在临床药理学中构成了重大挑战，相互作用类型的严重类别失衡限制了预测模型的有效性。共同的相互作用主导了数据集，而罕见但关键的相互作用的代表性不足，导致案例不佳的模型性能差。现有的方法通常将DDI预测视为二进制问题，忽略了特定的差异，并加剧了对频繁相互作用的偏见。为了解决这个问题，我们提出了一个将生成流网络（GFLOWNET）与各种图形自动编码器（VGAE）结合起来的框架，以生成用于稀有类别的合成样本，改善模型平衡并产生有效且新颖的DDI对。我们的方法增强了相互作用类型的预测性能，从而确保了更好的临床可靠性。

Title: Generative Artificial Intelligence Extracts Structure-Function Relationships from Plants for New Materials

Authors: Rachel K. Luu, Jingyu Deng, Mohammed Shahrudin Ibrahim, Nam-Joon Cho, Ming Dao, Subra Suresh, Markus J. Buehler
Subjects: cs.LG, cond-mat.dis-nn, cond-mat.mtrl-sci, cond-mat.other, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2508.06591
Pdf URL: https://arxiv.org/pdf/2508.06591
Copy Paste: [[2508.06591]] Generative Artificial Intelligence Extracts Structure-Function Relationships from Plants for New Materials(https://arxiv.org/abs/2508.06591)
Keywords: generation, generative
Abstract: Large language models (LLMs) have reshaped the research landscape by enabling new approaches to knowledge retrieval and creative ideation. Yet their application in discipline-specific experimental science, particularly in highly multi-disciplinary domains like materials science, remains limited. We present a first-of-its-kind framework that integrates generative AI with literature from hitherto-unconnected fields such as plant science, biomimetics, and materials engineering to extract insights and design experiments for materials. We focus on humidity-responsive systems such as pollen-based materials and Rhapis excelsa (broadleaf lady palm) leaves, which exhibit self-actuation and adaptive performance. Using a suite of AI tools, including a fine-tuned model (BioinspiredLLM), Retrieval-Augmented Generation (RAG), agentic systems, and a Hierarchical Sampling strategy, we extract structure-property relationships and translate them into new classes of bioinspired materials. Structured inference protocols generate and evaluate hundreds of hypotheses from a single query, surfacing novel and experimentally tractable ideas. We validate our approach through real-world implementation: LLM-generated procedures, materials designs, and mechanical predictions were tested in the laboratory, culminating in the fabrication of a novel pollen-based adhesive with tunable morphology and measured shear strength, establishing a foundation for future plant-derived adhesive design. This work demonstrates how AI-assisted ideation can drive real-world materials design and enable effective human-AI collaboration.
摘要：大型语言模型（LLM）通过启用知识检索和创造性的新方法来重塑研究格局。然而，它们在特定学科的实验科学中的应用，尤其是在材料科学等高度多学科领域中的应用仍然有限。我们提出了一个初始的框架，该框架将生成性AI与迄今无连接领域的文献相结合，例如植物科学，仿生学和材料工程，以提取材料的见解和设计实验。我们专注于湿度响应的系统，例如基于花粉的材料和Rhapis Excelsa（Broadleaf Lady Palm）叶子，它们表现出自我效果和适应性性能。使用包括微调模型（Bioinspiredllm），检索功能的生成（RAG），代理系统和层次采样策略在内的AI工具套件，我们提取结构 - 秘密关系并将其转化为新的生物启动材料。结构化的推理方案从单个查询中产生和评估数百个假设，表面浮出水面和实验性的思想。我们通过现实世界实施来验证我们的方法：在实验室中测试了LLM生成的程序，材料设计和机械预测，最终在制造具有可调的形态的新型花粉粘合剂并测量剪切强度并为将来的植物剂的基础上建立了基础。这项工作展示了AI辅助的构想如何推动现实世界的材料设计并实现有效的人类协作。

Title: Local Diffusion Models and Phases of Data Distributions

Authors: Fangjun Hu, Guangkuo Liu, Yifan Zhang, Xun Gao
Subjects: cs.LG, cond-mat.stat-mech, quant-ph
Abstract URL: https://arxiv.org/abs/2508.06614
Pdf URL: https://arxiv.org/pdf/2508.06614
Copy Paste: [[2508.06614]] Local Diffusion Models and Phases of Data Distributions(https://arxiv.org/abs/2508.06614)
Keywords: generative
Abstract: As a class of generative artificial intelligence frameworks inspired by statistical physics, diffusion models have shown extraordinary performance in synthesizing complicated data distributions through a denoising process gradually guided by score functions. Real-life data, like images, is often spatially structured in low-dimensional spaces. However, ordinary diffusion models ignore this local structure and learn spatially global score functions, which are often computationally expensive. In this work, we introduce a new perspective on the phases of data distributions, which provides insight into constructing local denoisers with reduced computational costs. We define two distributions as belonging to the same data distribution phase if they can be mutually connected via spatially local operations such as local denoisers. Then, we show that the reverse denoising process consists of an early trivial phase and a late data phase, sandwiching a rapid phase transition where local denoisers must fail. To diagnose such phase transitions, we prove an information-theoretic bound on the fidelity of local denoisers based on conditional mutual information, and conduct numerical experiments in a real-world dataset. This work suggests simpler and more efficient architectures of diffusion models: far from the phase transition point, we can use small local neural networks to compute the score function; global neural networks are only necessary around the narrow time interval of phase transitions. This result also opens up new directions for studying phases of data distributions, the broader science of generative artificial intelligence, and guiding the design of neural networks inspired by physics concepts.
摘要：作为受统计物理启发的一类生成人工智能框架，扩散模型通过通过分数功能逐渐指导的脱氧过程来综合复杂的数据分布时表现出非凡的性能。现实生活中的数据（如图像）通常是在低维空间中在空间上构造的。但是，普通的扩散模型忽略了这种本地结构，并学习空间全局的分数功能，这些功能通常在计算上很昂贵。在这项工作中，我们介绍了有关数据分布阶段的新观点，该观点为构建以降低的计算成本构建本地Denoisiser提供了见识。如果可以通过空间局部操作（例如局部Denoisers）相互连接，我们将两个分布定义为属于同一数据分布阶段。然后，我们表明，反向降解过程包括一个早期阶段和一个晚期数据阶段，将局部Denoisiser必须失败的快速阶段过渡夹。为了诊断此类相变，我们证明了基于条件互信息的本地DINOISER的保真度的信息理论，并在现实世界数据集中进行数值实验。这项工作提出了扩散模型的更简单，更有效的体系结构：远离相变点，我们可以使用小的局部神经网络来计算得分函数；全球神经网络仅在相变的狭窄时间间隔内才有必要。该结果还为研究数据分布的阶段，更广泛的生成人工智能科学以及指导受物理概念启发的神经网络的设计开辟了新的方向。

Title: CycleDiff: Cycle Diffusion Models for Unpaired Image-to-image Translation

Authors: Shilong Zou, Yuhang Huang, Renjiao Yi, Chenyang Zhu, Kai Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06625
Pdf URL: https://arxiv.org/pdf/2508.06625
Copy Paste: [[2508.06625]] CycleDiff: Cycle Diffusion Models for Unpaired Image-to-image Translation(https://arxiv.org/abs/2508.06625)
Keywords: generative
Abstract: We introduce a diffusion-based cross-domain image translator in the absence of paired training data. Unlike GAN-based methods, our approach integrates diffusion models to learn the image translation process, allowing for more coverable modeling of the data distribution and performance improvement of the cross-domain translation. However, incorporating the translation process within the diffusion process is still challenging since the two processes are not aligned exactly, i.e., the diffusion process is applied to the noisy signal while the translation process is conducted on the clean signal. As a result, recent diffusion-based studies employ separate training or shallow integration to learn the two processes, yet this may cause the local minimal of the translation optimization, constraining the effectiveness of diffusion models. To address the problem, we propose a novel joint learning framework that aligns the diffusion and the translation process, thereby improving the global optimality. Specifically, we propose to extract the image components with diffusion models to represent the clean signal and employ the translation process with the image components, enabling an end-to-end joint learning manner. On the other hand, we introduce a time-dependent translation network to learn the complex translation mapping, resulting in effective translation learning and significant performance improvement. Benefiting from the design of joint learning, our method enables global optimization of both processes, enhancing the optimality and achieving improved fidelity and structural consistency. We have conducted extensive experiments on RGB$\leftrightarrow$RGB and diverse cross-modality translation tasks including RGB$\leftrightarrow$Edge, RGB$\leftrightarrow$Semantics and RGB$\leftrightarrow$Depth, showcasing better generative performances than the state of the arts.
摘要：在没有配对训练数据的情况下，我们引入了基于扩散的跨域图像翻译器。与基于GAN的方法不同，我们的方法集成了扩散模型以学习图像翻译过程，从而使数据分布和跨域翻译的性能改进更加可覆盖。但是，将翻译过程纳入扩散过程仍然具有挑战性，因为这两个过程不准确对齐，即，在清洁信号上进行翻译过程时，将扩散过程应用于嘈杂信号。结果，最近的基于扩散的研究采用单独的培训或浅整合来学习这两个过程，但这可能会导致翻译优化的局部最小值，从而限制了扩散模型的有效性。为了解决这个问题，我们提出了一个新颖的联合学习框架，该框架使扩散和翻译过程保持一致，从而改善了全球最优性。具体而言，我们建议使用扩散模型提取图像组件，以表示清洁信号并使用图像组件采用翻译过程，从而实现端到端的关节学习方式。另一方面，我们介绍了一个与时间有关的翻译网络，以学习复杂的翻译映射，从而有效地翻译学习和显着的性能改善。从联合学习的设计中受益，我们的方法可以使这两个过程的全球优化，从而增强最佳性和提高的忠诚度和结构一致性。我们已经在RGB $ \ leftrightArrow $ RGB上进行了广泛的实验，并进行了多样化的跨模式翻译任务，包括RGB $ \ leftrightArow $ edge，RGB $ \ leftrightArrow $ Sentics和RGB $ \ leftrightArrow $ depth $ depth $ depth $ depth，展示了更好的生产性绩效，比各个国家 /地区的艺术效果更好。

Title: Using Imperfect Synthetic Data in Downstream Inference Tasks

Authors: Yewon Byun, Shantanu Gupta, Zachary C. Lipton, Rachel Leah Childers, Bryan Wilder
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2508.06635
Pdf URL: https://arxiv.org/pdf/2508.06635
Copy Paste: [[2508.06635]] Using Imperfect Synthetic Data in Downstream Inference Tasks(https://arxiv.org/abs/2508.06635)
Keywords: generation
Abstract: Predictions and generations from large language models are increasingly being explored as an aid to computational social science and human subject research in limited data regimes. While previous technical work has explored the potential to use model-predicted labels for unlabeled data in a principled manner, there is increasing interest in using large language models to generate entirely new synthetic samples (also termed as synthetic simulations), such as in responses to surveys. However, it is not immediately clear by what means practitioners can combine such data with real data and yet produce statistically valid conclusions upon them. In this work, we introduce a new estimator based on generalized method of moments, providing a hyperparameter-free solution with strong theoretical guarantees to address the challenge at hand. Surprisingly, we find that interactions between the moment residuals of synthetic data and those of real data can improve estimates of the target parameter. We empirically validate the finite-sample performance of our estimator across different regression tasks in computational social science applications, demonstrating large empirical gains.
摘要：大型语言模型的预测和世代越来越多地被探讨，以帮助计算社会科学和人类学科在有限的数据制度中。尽管以前的技术工作已经探索了以原则性的方式使用模型预测的标签作为未标记数据的潜力，但使用大型语言模型来生成全新的合成样本（也称为合成模拟），例如对Surveys的响应，人们越来越兴趣。但是，从业人员可以将这些数据与实际数据相结合并对其产生统计有效的结论，这是什么手段可以立即清楚的。在这项工作中，我们介绍了一种基于广义的矩方法，提供了一个新的估计量，提供了一种不含高参数的解决方案，并具有强大的理论保证，以应对当前的挑战。令人惊讶的是，我们发现合成数据的矩残差与真实数据的矩之间之间的相互作用可以改善目标参数的估计值。我们从经验上验证了计算社会科学应用中不同回归任务的估计器的有限样本性能，证明了巨大的经验收益。

Title: Privacy-Preserving Tabular Synthetic Data Generation Using TabularARGN

Authors: Andrey Sidorenko, Paul Tiwald
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.06647
Pdf URL: https://arxiv.org/pdf/2508.06647
Copy Paste: [[2508.06647]] Privacy-Preserving Tabular Synthetic Data Generation Using TabularARGN(https://arxiv.org/abs/2508.06647)
Keywords: generation, generative
Abstract: Synthetic data generation has become essential for securely sharing and analyzing sensitive data sets. Traditional anonymization techniques, however, often fail to adequately preserve privacy. We introduce the Tabular Auto-Regressive Generative Network (TabularARGN), a neural network architecture specifically designed for generating high-quality synthetic tabular data. Using a discretization-based auto-regressive approach, TabularARGN achieves high data fidelity while remaining computationally efficient. We evaluate TabularARGN against existing synthetic data generation methods, showing competitive results in statistical similarity, machine learning utility, and detection robustness. We further perform an in-depth privacy evaluation using systematic membership-inference attacks, highlighting the robustness and effective privacy-utility balance of our approach.
摘要：合成数据生成对于安全共享和分析敏感数据集已成为必不可少的。但是，传统的匿名技术通常无法充分保留隐私。我们介绍了表格自动回报生成网络（TabularArgn），这是一种专门设计用于生成高质量合成表格数据的神经网络体系结构。使用基于离散化的自动回归方法，TabularArgn实现了高数据保真度，同时保持计算有效。我们对现有合成数据生成方法的表格评估，在统计相似性，机器学习实用程序和检测鲁棒性方面显示出竞争成果。我们进一步使用系统的会员推荐攻击进行了深入的隐私评估，突出了我们方法的鲁棒性和有效的隐私性平衡。

Title: Towards Robust Red-Green Watermarking for Autoregressive Image Generators

Authors: Denis Lukovnikov, Andreas Müller, Erwin Quiring, Asja Fischer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06656
Pdf URL: https://arxiv.org/pdf/2508.06656
Copy Paste: [[2508.06656]] Towards Robust Red-Green Watermarking for Autoregressive Image Generators(https://arxiv.org/abs/2508.06656)
Keywords: generation
Abstract: In-generation watermarking for detecting and attributing generated content has recently been explored for latent diffusion models (LDMs), demonstrating high robustness. However, the use of in-generation watermarks in autoregressive (AR) image models has not been explored yet. AR models generate images by autoregressively predicting a sequence of visual tokens that are then decoded into pixels using a vector-quantized decoder. Inspired by red-green watermarks for large language models, we examine token-level watermarking schemes that bias the next-token prediction based on prior tokens. We find that a direct transfer of these schemes works in principle, but the detectability of the watermarks decreases considerably under common image perturbations. As a remedy, we propose two novel watermarking methods that rely on visual token clustering to assign similar tokens to the same set. Firstly, we investigate a training-free approach that relies on a cluster lookup table, and secondly, we finetune VAE encoders to predict token clusters directly from perturbed images. Overall, our experiments show that cluster-level watermarks improve robustness against perturbations and regeneration attacks while preserving image quality. Cluster classification further boosts watermark detectability, outperforming a set of baselines. Moreover, our methods offer fast verification runtime, comparable to lightweight post-hoc watermarking methods.
摘要：最近已经为潜在扩散模型（LDMS）探索了用于检测和归因于生成含量的生成水印，证明了高鲁棒性。但是，尚未探讨在自回归（AR）图像模型中使用内部水印。 AR模型通过自动调节性预测一系列视觉令牌来生成图像，然后使用矢量定量解码器将其解码为像素。受大型语言模型的红绿色水印的启发，我们检查了令牌级的水印方案，这些方案偏向于基于代币之前的下一步预测。我们发现这些方案的直接转移原则上起作用，但是在常见的图像扰动下，水印的可检测性大大降低。作为一种补救措施，我们提出了两种依赖视觉令牌聚类以将相似令牌分配给同一组的新型水印方法。首先，我们研究了一种依赖群集查找表的无训练方法，其次，我们对VAE编码器进行了Finetune Vae编码，以直接从扰动的图像中预测令牌簇。总体而言，我们的实验表明，集群水平的水印可以改善对扰动和再生攻击的鲁棒性，同时保持图像质量。群集分类进一步提高了水印的可检测性，表现优于一组基线。此外，我们的方法提供了快速验证运行时，与轻巧的事后水印方法相当。

Title: Fourier Optics and Deep Learning Methods for Fast 3D Reconstruction in Digital Holography

Authors: Justin London
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06703
Pdf URL: https://arxiv.org/pdf/2508.06703
Copy Paste: [[2508.06703]] Fourier Optics and Deep Learning Methods for Fast 3D Reconstruction in Digital Holography(https://arxiv.org/abs/2508.06703)
Keywords: generation
Abstract: Computer-generated holography (CGH) is a promising method that modulates user-defined waveforms with digital holograms. An efficient and fast pipeline framework is proposed to synthesize CGH using initial point cloud and MRI data. This input data is reconstructed into volumetric objects that are then input into non-convex Fourier optics optimization algorithms for phase-only hologram (POH) and complex-hologram (CH) generation using alternating projection, SGD, and quasi-Netwton methods. Comparison of reconstruction performance of these algorithms as measured by MSE, RMSE, and PSNR is analyzed as well as to HoloNet deep learning CGH. Performance metrics are shown to be improved by using 2D median filtering to remove artifacts and speckled noise during optimization.
摘要：计算机生成的全息图（CGH）是一种有前途的方法，可通过数字全息图调节用户定义的波形。提出了一个有效而快速的管道框架，可以使用初始点云和MRI数据合成CGH。该输入数据被重构为体积对象，然后使用交替的投影，SGD和QASI-NETWTON方法将其输入到非相关傅立叶光学优化算法中，用于仅相 - 全息图（POH）和复杂 - 总体（CH）生成。分析了MSE，RMSE和PSNR测量的这些算法的重建性能以及Holonet深度学习CGH的比较。通过使用2D中值过滤来消除优化过程中的伪影和斑点噪声，显示出性能指标可以改善。

Title: Restage4D: Reanimating Deformable 3D Reconstruction from a Single Video

Authors: Jixuan He, Chieh Hubert Lin, Lu Qi, Ming-Hsuan Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06715
Pdf URL: https://arxiv.org/pdf/2508.06715
Copy Paste: [[2508.06715]] Restage4D: Reanimating Deformable 3D Reconstruction from a Single Video(https://arxiv.org/abs/2508.06715)
Keywords: generative
Abstract: Creating deformable 3D content has gained increasing attention with the rise of text-to-image and image-to-video generative models. While these models provide rich semantic priors for appearance, they struggle to capture the physical realism and motion dynamics needed for authentic 4D scene synthesis. In contrast, real-world videos can provide physically grounded geometry and articulation cues that are difficult to hallucinate. One question is raised: \textit{Can we generate physically consistent 4D content by leveraging the motion priors of the real-world video}? In this work, we explore the task of reanimating deformable 3D scenes from a single video, using the original sequence as a supervisory signal to correct artifacts from synthetic motion. We introduce \textbf{Restage4D}, a geometry-preserving pipeline for video-conditioned 4D restaging. Our approach uses a video-rewinding training strategy to temporally bridge a real base video and a synthetic driving video via a shared motion representation. We further incorporate an occlusion-aware rigidity loss and a disocclusion backtracing mechanism to improve structural and geometry consistency under challenging motion. We validate Restage4D on DAVIS and PointOdyssey, demonstrating improved geometry consistency, motion quality, and 3D tracking performance. Our method not only preserves deformable structure under novel motion, but also automatically corrects errors introduced by generative models, revealing the potential of video prior in 4D restaging task. Source code and trained models will be released.
摘要：随着文本对图像和图像到视频生成模型的兴起，创建可变形的3D内容引起了人们的关注。尽管这些模型为外观提供了丰富的语义先验，但它们努力捕获真实4D场景综合所需的物理现实主义和运动动态。相比之下，现实世界的视频可以提供很难幻觉的物理几何形状和发音线索。提出了一个问题：\ textit {我们可以通过利用现实世界视频的运动先验来生成物理一致的4D内容}？在这项工作中，我们探讨了从单个视频中复兴可变形的3D场景的任务，使用原始序列作为监督信号来纠正合成运动中的伪影。我们介绍\ textbf {restage4d}，这是一种用于视频条件4D RESTAGGE的几何管道。我们的方法使用视频训练培训策略来暂时桥接真实的基础视频和合成驾驶视频，并通过共享的运动表示。我们进一步纳入了遮挡意识到的刚度损失和不概括的回溯机制，以提高在具有挑战性的运动下的结构和几何形状一致性。我们验证了Davis和Pointodyssey上的Restage4D，证明了几何学的一致性，运动质量和3D跟踪性能。我们的方法不仅可以在新型运动下保留可变形的结构，而且还可以自动纠正生成模型引入的错误，从而揭示了4D RESTAGGING任务中视频先验的潜力。源代码和训练有素的模型将发布。

Title: PANAMA: A Network-Aware MARL Framework for Multi-Agent Path Finding in Digital Twin Ecosystems

Authors: Arman Dogru, R. Irem Bor-Yaliniz, Nimal Gamini Senarath
Subjects: cs.LG, cs.AI, cs.DC, cs.MA, cs.RO
Abstract URL: https://arxiv.org/abs/2508.06767
Pdf URL: https://arxiv.org/pdf/2508.06767
Copy Paste: [[2508.06767]] PANAMA: A Network-Aware MARL Framework for Multi-Agent Path Finding in Digital Twin Ecosystems(https://arxiv.org/abs/2508.06767)
Keywords: generation
Abstract: Digital Twins (DTs) are transforming industries through advanced data processing and analysis, positioning the world of DTs, Digital World, as a cornerstone of nextgeneration technologies including embodied AI. As robotics and automated systems scale, efficient data-sharing frameworks and robust algorithms become critical. We explore the pivotal role of data handling in next-gen networks, focusing on dynamics between application and network providers (AP/NP) in DT ecosystems. We introduce PANAMA, a novel algorithm with Priority Asymmetry for Network Aware Multi-agent Reinforcement Learning (MARL) based multi-agent path finding (MAPF). By adopting a Centralized Training with Decentralized Execution (CTDE) framework and asynchronous actor-learner architectures, PANAMA accelerates training while enabling autonomous task execution by embodied AI. Our approach demonstrates superior pathfinding performance in accuracy, speed, and scalability compared to existing benchmarks. Through simulations, we highlight optimized data-sharing strategies for scalable, automated systems, ensuring resilience in complex, real-world environments. PANAMA bridges the gap between network-aware decision-making and robust multi-agent coordination, advancing the synergy between DTs, wireless networks, and AI-driven automation.
摘要：数字双胞胎（DTS）正在通过高级数据处理和分析来改变行业，将DTS，数字世界的世界定位为包括体现AI在内的Next Generation Technologies的基石。随着机器人技术和自动化系统量表，有效的数据共享框架和鲁棒算法变得至关重要。我们探讨了数据处理在下一代网络中的关键作用，重点介绍了DT生态系统中应用程序和网络提供商（AP/NP）之间的动态。我们介绍了巴拿马，这是一种具有优先级不对称性的新型算法，用于网络意识到的多代理增强学习（MARL）的多试路径发现（MAPF）。通过采用分散执行（CTDE）框架和异步演员学习架构的集中式培训，巴拿马可以加速培训，同时通过体现的AI实现自治任务执行。我们的方法表明，与现有基准相比，与现有基准相比，准确性，速度和可伸缩性的较高路径性能。通过模拟，我们突出了针对可扩展的自动化系统的优化数据共享策略，从而确保了复杂的现实世界环境中的弹性。巴拿马弥合了网络感知的决策与强大的多代理协调之间的差距，推动了DTS，无线网络和AI驱动自动化之间的协同作用。

Title: Offline-to-Online Reinforcement Learning with Classifier-Free Diffusion Generation

Authors: Xiao Huang, Xu Liu, Enze Zhang, Tong Yu, Shuai Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06806
Pdf URL: https://arxiv.org/pdf/2508.06806
Copy Paste: [[2508.06806]] Offline-to-Online Reinforcement Learning with Classifier-Free Diffusion Generation(https://arxiv.org/abs/2508.06806)
Keywords: generation
Abstract: Offline-to-online Reinforcement Learning (O2O RL) aims to perform online fine-tuning on an offline pre-trained policy to minimize costly online interactions. Existing work used offline datasets to generate data that conform to the online data distribution for data augmentation. However, generated data still exhibits a gap with the online data, limiting overall performance. To address this, we propose a new data augmentation approach, Classifier-Free Diffusion Generation (CFDG). Without introducing additional classifier training overhead, CFDG leverages classifier-free guidance diffusion to significantly enhance the generation quality of offline and online data with different distributions. Additionally, it employs a reweighting method to enable more generated data to align with the online data, enhancing performance while maintaining the agent's stability. Experimental results show that CFDG outperforms replaying the two data types or using a standard diffusion model to generate new data. Our method is versatile and can be integrated with existing offline-to-online RL algorithms. By implementing CFDG to popular methods IQL, PEX and APL, we achieve a notable 15% average improvement in empirical performance on the D4RL benchmark such as MuJoCo and AntMaze.
摘要：离线到线加强学习（O2O RL）旨在在离线预培训的政策上进行在线微调，以最大程度地减少昂贵的在线互动。现有工作使用离线数据集生成符合在线数据分布以进行数据扩展的数据。但是，生成的数据仍然显示出在线数据的差距，从而限制了整体性能。为了解决这个问题，我们提出了一种新的数据增强方法，即无分类器扩散生成（CFDG）。在不引入其他分类器培训间接费用的情况下，CFDG利用了无分类器的指导扩散，可以显着提高具有不同分布的离线和在线数据的发电质量。此外，它采用了一种重量级方法来启用更多生成的数据与在线数据保持一致，同时增强了性能，同时保持代理的稳定性。实验结果表明，CFDG的表现优于重播两种数据类型或使用标准扩散模型生成新数据。我们的方法是多功能的，可以与现有的脱机RL算法集成。通过将CFDG实施到流行方法IQL，PEX和APL，我们在D4RL基准（例如Mujoco和Antmaze）上的经验性能平均得出15％。

Title: AGIC: Attention-Guided Image Captioning to Improve Caption Relevance

Authors: L. D. M. S. Sai Teja, Ashok Urlana, Pruthwik Mishra
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06853
Pdf URL: https://arxiv.org/pdf/2508.06853
Copy Paste: [[2508.06853]] AGIC: Attention-Guided Image Captioning to Improve Caption Relevance(https://arxiv.org/abs/2508.06853)
Keywords: generation
Abstract: Despite significant progress in image captioning, generating accurate and descriptive captions remains a long-standing challenge. In this study, we propose Attention-Guided Image Captioning (AGIC), which amplifies salient visual regions directly in the feature space to guide caption generation. We further introduce a hybrid decoding strategy that combines deterministic and probabilistic sampling to balance fluency and diversity. To evaluate AGIC, we conduct extensive experiments on the Flickr8k and Flickr30k datasets. The results show that AGIC matches or surpasses several state-of-the-art models while achieving faster inference. Moreover, AGIC demonstrates strong performance across multiple evaluation metrics, offering a scalable and interpretable solution for image captioning.
摘要：尽管图像字幕取得了重大进展，但产生准确和描述性标题仍然是一个长期的挑战。在这项研究中，我们提出了注意引导的图像字幕（AGIC），该图像字幕（AGIC）直接放大了特征空间中的显着视觉区域以引导字幕产生。我们进一步引入了一种混合解码策略，该策略结合了确定性和概率抽样，以平衡流利性和多样性。为了评估AGIC，我们对FlickR8K和FlickR30K数据集进行了广泛的实验。结果表明，AGIC匹配或超过了几种最先进的模型，同时推理更快。此外，AGIC在多个评估指标中表现出强大的性能，为图像字幕提供了可扩展且可解释的解决方案。

Title: Advancements in Chinese font generation since deep learning era: A survey

Authors: Weiran Chen, Guiqian Zhu, Ying Li, Yi Ji, Chunping Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06900
Pdf URL: https://arxiv.org/pdf/2508.06900
Copy Paste: [[2508.06900]] Advancements in Chinese font generation since deep learning era: A survey(https://arxiv.org/abs/2508.06900)
Keywords: generation
Abstract: Chinese font generation aims to create a new Chinese font library based on some reference samples. It is a topic of great concern to many font designers and typographers. Over the past years, with the rapid development of deep learning algorithms, various new techniques have achieved flourishing and thriving progress. Nevertheless, how to improve the overall quality of generated Chinese character images remains a tough issue. In this paper, we conduct a holistic survey of the recent Chinese font generation approaches based on deep learning. To be specific, we first illustrate the research background of the task. Then, we outline our literature selection and analysis methodology, and review a series of related fundamentals, including classical deep learning architectures, font representation formats, public datasets, and frequently-used evaluation metrics. After that, relying on the number of reference samples required to generate a new font, we categorize the existing methods into two major groups: many-shot font generation and few-shot font generation methods. Within each category, representative approaches are summarized, and their strengths and limitations are also discussed in detail. Finally, we conclude our paper with the challenges and future directions, with the expectation to provide some valuable illuminations for the researchers in this field.
摘要：中国字体生成旨在根据一些参考样本创建一个新的中国字体库。对于许多字体设计师和印刷设计师来说，这是一个非常关注的话题。在过去的几年中，随着深度学习算法的快速发展，各种新技术取得了繁荣和繁荣的进步。然而，如何提高产生的汉字图像的整体质量仍然是一个棘手的问题。在本文中，我们对最近基于深度学习的中国字体生成方法进行了整体调查。具体来说，我们首先说明了任务的研究背景。然后，我们概述了文献选择和分析方法，并回顾了一系列相关的基本原理，包括经典的深度学习体系结构，字体表示格式，公共数据集和经常使用的评估指标。之后，依靠生成新字体所需的参考样本数量，我们将现有方法分为两个主要组：许多弹药字体生成和少量字体生成方法。在每个类别中，总结了代表性的方法，还详细讨论了它们的优势和局限性。最后，我们以挑战和未来的方向结束了论文，期望为该领域的研究人员提供一些有价值的照明。

Title: MultiRef: Controllable Image Generation with Multiple Visual References

Authors: Ruoxi Chen, Dongping Chen, Siyuan Wu, Sinan Wang, Shiyun Lang, Petr Sushko, Gaoyang Jiang, Yao Wan, Ranjay Krishna
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06905
Pdf URL: https://arxiv.org/pdf/2508.06905
Copy Paste: [[2508.06905]] MultiRef: Controllable Image Generation with Multiple Visual References(https://arxiv.org/abs/2508.06905)
Keywords: generation, generative
Abstract: Visual designers naturally draw inspiration from multiple visual references, combining diverse elements and aesthetic principles to create artwork. However, current image generative frameworks predominantly rely on single-source inputs -- either text prompts or individual reference images. In this paper, we focus on the task of controllable image generation using multiple visual references. We introduce MultiRef-bench, a rigorous evaluation framework comprising 990 synthetic and 1,000 real-world samples that require incorporating visual content from multiple reference images. The synthetic samples are synthetically generated through our data engine RefBlend, with 10 reference types and 33 reference combinations. Based on RefBlend, we further construct a dataset MultiRef containing 38k high-quality images to facilitate further research. Our experiments across three interleaved image-text models (i.e., OmniGen, ACE, and Show-o) and six agentic frameworks (e.g., ChatDiT and LLM + SD) reveal that even state-of-the-art systems struggle with multi-reference conditioning, with the best model OmniGen achieving only 66.6% in synthetic samples and 79.0% in real-world cases on average compared to the golden answer. These findings provide valuable directions for developing more flexible and human-like creative tools that can effectively integrate multiple sources of visual inspiration. The dataset is publicly available at: this https URL.
摘要：视觉设计师自然会从多个视觉参考中汲取灵感，结合了各种元素和美学原理来创建艺术品。但是，当前图像生成框架主要依赖于单源输入 - 文本提示或单个参考图像。在本文中，我们专注于使用多个视觉引用的可控图像生成的任务。我们介绍了MultireF-Bench，这是一个严格的评估框架，其中包括990个合成和1,000个现实世界样本，需要从多个参考图像中结合视觉内容。合成样品是通过我们的数据引擎重蓝色合成生成的，具有10种参考类型和33种参考组合。基于重新融合，我们进一步构建了一个包含38K高质量图像的数据集MultireF，以促进进一步的研究。我们在三个相互交织的图像文本模型（即Omnigen，Ace和Show-O）和六个代理框架（例如Chatdit和LLM + SD）进行的实验，即使是与多次报复性条件相比，即使是最先进的系统，也只能在综合中获得66.6％的casters and consples和79.0％。这些发现为开发更灵活和人类的创意工具提供了宝贵的方向，可以有效地整合多种视觉灵感来源。该数据集可公开获得：此HTTPS URL。

Title: QuiZSF: An efficient data-model interaction framework for zero-shot time-series forecasting

Authors: Shichao Ma, Zhengyang Zhou, Qihe Huang, Binwu Wang, Kuo Yang, Huan Li, Yang Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.06915
Pdf URL: https://arxiv.org/pdf/2508.06915
Copy Paste: [[2508.06915]] QuiZSF: An efficient data-model interaction framework for zero-shot time-series forecasting(https://arxiv.org/abs/2508.06915)
Keywords: generation
Abstract: Time series forecasting has become increasingly important to empower diverse applications with streaming data. Zero-shot time-series forecasting (ZSF), particularly valuable in data-scarce scenarios, such as domain transfer or forecasting under extreme conditions, is difficult for traditional models to deal with. While time series pre-trained models (TSPMs) have demonstrated strong performance in ZSF, they often lack mechanisms to dynamically incorporate external knowledge. Fortunately, emerging retrieval-augmented generation (RAG) offers a promising path for injecting such knowledge on demand, yet they are rarely integrated with TSPMs. To leverage the strengths of both worlds, we introduce RAG into TSPMs to enhance zero-shot time series forecasting. In this paper, we propose QuiZSF (Quick Zero-Shot Time Series Forecaster), a lightweight and modular framework that couples efficient retrieval with representation learning and model adaptation for ZSF. Specifically, we construct a hierarchical tree-structured ChronoRAG Base (CRB) for scalable time-series storage and domain-aware retrieval, introduce a Multi-grained Series Interaction Learner (MSIL) to extract fine- and coarse-grained relational features, and develop a dual-branch Model Cooperation Coherer (MCC) that aligns retrieved knowledge with two kinds of TSPMs: Non-LLM based and LLM based. Compared with contemporary baselines, QuiZSF, with Non-LLM based and LLM based TSPMs as base model, respectively, ranks Top1 in 75% and 87.5% of prediction settings, while maintaining high efficiency in memory and inference time.
摘要：时间序列预测对于通过流数据赋予多种应用程序的能力变得越来越重要。对于传统模型来说，零击时间序列预测（ZSF），在极端条件下的数据筛选方案，例如域转移或预测，尤其是有价值的。虽然时间序列预训练的模型（TSPM）在ZSF中表现出很强的性能，但它们通常缺乏动态融合外部知识的机制。幸运的是，新兴的检索增强生成（RAG）为注入这种知识的需求提供了有前途的途径，但它们很少与TSPMS整合在一起。为了利用两全其美的优势，我们将抹布介绍到TSPM中，以增强零击时间序列的预测。在本文中，我们提出了Quizsf（快速零摄 - 时间序列预报器），这是一个轻巧和模块化的框架，将有效的检索与表示ZSF的表示和模型适应。 Specifically, we construct a hierarchical tree-structured ChronoRAG Base (CRB) for scalable time-series storage and domain-aware retrieval, introduce a Multi-grained Series Interaction Learner (MSIL) to extract fine- and coarse-grained relational features, and develop a dual-branch Model Cooperation Coherer (MCC) that aligns retrieved knowledge with two kinds of TSPMs: Non-LLM based and LLM based.与当代基线相比，基于非LLM和基于LLM的TSPM作为基本模型的QuizSF分别在预测设置的75％和87.5％中排名第一，同时维持内存和推理时间的效率很高。

Title: Talk2Image: A Multi-Agent System for Multi-Turn Image Generation and Editing

Authors: Shichao Ma, Yunhe Guo, Jiahao Su, Qihe Huang, Zhengyang Zhou, Yang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06916
Pdf URL: https://arxiv.org/pdf/2508.06916
Copy Paste: [[2508.06916]] Talk2Image: A Multi-Agent System for Multi-Turn Image Generation and Editing(https://arxiv.org/abs/2508.06916)
Keywords: generation
Abstract: Text-to-image generation tasks have driven remarkable advances in diverse media applications, yet most focus on single-turn scenarios and struggle with iterative, multi-turn creative tasks. Recent dialogue-based systems attempt to bridge this gap, but their single-agent, sequential paradigm often causes intention drift and incoherent edits. To address these limitations, we present Talk2Image, a novel multi-agent system for interactive image generation and editing in multi-turn dialogue scenarios. Our approach integrates three key components: intention parsing from dialogue history, task decomposition and collaborative execution across specialized agents, and feedback-driven refinement based on a multi-view evaluation mechanism. Talk2Image enables step-by-step alignment with user intention and consistent image editing. Experiments demonstrate that Talk2Image outperforms existing baselines in controllability, coherence, and user satisfaction across iterative image generation and editing tasks.
摘要：文本到图像的生成任务在不同的媒体应用程序中取得了显着进步，但大多数关注的是单转情况，并在迭代，多转变的创意任务中挣扎。最近的基于对话的系统试图弥合这一差距，但是它们的单格，顺序范式通常会导致意图漂移和不连贯的编辑。为了解决这些局限性，我们提出了Talk2Image，这是一种新型的多代理系统，用于在多转化对话方案中进行交互式图像生成和编辑。我们的方法集成了三个关键组成部分：意图从对话历史记录，任务分解和跨专用代理的协作执行以及基于多视图评估机制的反馈驱动的改进。 Talk2Image启用逐步对齐用户意图和一致的图像编辑。实验表明，在迭代图像生成和编辑任务中，Talk2Image在可控性，连贯性和用户满意度方面的表现优于现有基准。

Title: AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning

Authors: Shihao Yuan, Yahui Liu, Yang Yue, Jingyuan Zhang, Wangmeng Zuo, Qi Wang, Fuzheng Zhang, Guorui Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06924
Pdf URL: https://arxiv.org/pdf/2508.06924
Copy Paste: [[2508.06924]] AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning(https://arxiv.org/abs/2508.06924)
Keywords: generation
Abstract: Inspired by the success of reinforcement learning (RL) in refining large language models (LLMs), we propose AR-GRPO, an approach to integrate online RL training into autoregressive (AR) image generation models. We adapt the Group Relative Policy Optimization (GRPO) algorithm to refine the vanilla autoregressive models' outputs by carefully designed reward functions that evaluate generated images across multiple quality dimensions, including perceptual quality, realism, and semantic fidelity. We conduct comprehensive experiments on both class-conditional (i.e., class-to-image) and text-conditional (i.e., text-to-image) image generation tasks, demonstrating that our RL-enhanced framework significantly improves both the image quality and human preference of generated images compared to the standard AR baselines. Our results show consistent improvements across various evaluation metrics, establishing the viability of RL-based optimization for AR image generation and opening new avenues for controllable and high-quality image synthesis. The source codes and models are available at: this https URL.
摘要：受强化学习（RL）在精炼大型语言模型（LLM）中的成功的启发，我们提出了AR-GRPO，一种将在线RL培训集成到自动回应（AR）图像生成模型中的方法。我们通过精心设计的奖励功能来调整小组相对策略优化（GRPO）算法，以完善香草自回归模型的输出，这些功能评估跨多种质量维度的生成图像，包括感知质量，现实主义和语义忠诚。我们对班级条件（即班级形象）和文本条件（即文本对图像）的图像生成任务进行了全面的实验，这表明我们的RL增强框架可显着提高与标准AR基础相比生成图像的图像质量和人类的偏好。我们的结果显示了各种评估指标的一致改进，从而确立了基于RL的优化对AR图像产生的优化，并为可控和高质量的图像合成开放了新的途径。源代码和模型可在以下网址提供：此HTTPS URL。

Title: CannyEdit: Selective Canny Control and Dual-Prompt Guidance for Training-Free Image Editing

Authors: Weiyan Xie, Han Gao, Didan Deng, Kaican Li, April Hua Liu, Yongxiang Huang, Nevin L. Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06937
Pdf URL: https://arxiv.org/pdf/2508.06937
Copy Paste: [[2508.06937]] CannyEdit: Selective Canny Control and Dual-Prompt Guidance for Training-Free Image Editing(https://arxiv.org/abs/2508.06937)
Keywords: generative
Abstract: Recent advances in text-to-image (T2I) models have enabled training-free regional image editing by leveraging the generative priors of foundation models. However, existing methods struggle to balance text adherence in edited regions, context fidelity in unedited areas, and seamless integration of edits. We introduce CannyEdit, a novel training-free framework that addresses these challenges through two key innovations: (1) Selective Canny Control, which masks the structural guidance of Canny ControlNet in user-specified editable regions while strictly preserving details of the source images in unedited areas via inversion-phase ControlNet information retention. This enables precise, text-driven edits without compromising contextual integrity. (2) Dual-Prompt Guidance, which combines local prompts for object-specific edits with a global target prompt to maintain coherent scene interactions. On real-world image editing tasks (addition, replacement, removal), CannyEdit outperforms prior methods like KV-Edit, achieving a 2.93 to 10.49 percent improvement in the balance of text adherence and context fidelity. In terms of editing seamlessness, user studies reveal only 49.2 percent of general users and 42.0 percent of AIGC experts identified CannyEdit's results as AI-edited when paired with real images without edits, versus 76.08 to 89.09 percent for competitor methods.
摘要：文本到图像（T2I）模型的最新进展通过利用基础模型的生成先验来实现无培训的区域图像编辑。但是，现有的方法难以在编辑区域，未经编辑的地区的上下文忠诚以及编辑的无缝集成中平衡文本依从性。我们介绍了Cannyedit，这是一个新颖的无培训框架，通过两个关键的创新来解决这些挑战：（1）选择性Canny控制，它掩盖了在用户指定的可编辑区域中Canny ControlNet的结构指导，同时严格保留了通过倒置 - 遗传器控制网络信息保留未经编辑的区域中源图像的细节。这可以实现精确的，文本驱动的编辑，而不会损害上下文完整性。（2）将特定于对象的编辑的本地提示与全局目标提示结合在一起，以维护连贯的场景相互作用。在现实世界图像编辑任务（加法，替换，删除）上，Cannyedit优于KV-Edit等先前方法，在文本依从性和上下文保真度的平衡方面取得了2.93％至10.49％的提高。在编辑无缝性方面，用户研究仅显示49.2％的普通用户和42.0％的AIGC专家将Cannyedit的结果与无需编辑的真实图像配对，而竞争对手的方法为76.08至89.09％。

Title: Discovery Learning accelerates battery design evaluation

Authors: Jiawei Zhang, Yifei Zhang, Baozhao Yi, Yao Ren, Qi Jiao, Hanyu Bai, Weiran Jiang, Ziyou Song
Subjects: cs.LG, cs.CE, eess.SY, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2508.06985
Pdf URL: https://arxiv.org/pdf/2508.06985
Copy Paste: [[2508.06985]] Discovery Learning accelerates battery design evaluation(https://arxiv.org/abs/2508.06985)
Keywords: generation
Abstract: Fast and reliable validation of novel designs in complex physical systems such as batteries is critical to accelerating technological innovation. However, battery research and development remain bottlenecked by the prohibitively high time and energy costs required to evaluate numerous new design candidates, particularly in battery prototyping and life testing. Despite recent progress in data-driven battery lifetime prediction, existing methods require labeled data of target designs to improve accuracy and cannot make reliable predictions until after prototyping, thus falling far short of the efficiency needed to enable rapid feedback for battery design. Here, we introduce Discovery Learning (DL), a scientific machine-learning paradigm that integrates active learning, physics-guided learning, and zero-shot learning into a human-like reasoning loop, drawing inspiration from learning theories in educational psychology. DL can learn from historical battery designs and actively reduce the need for prototyping, thus enabling rapid lifetime evaluation for unobserved material-design combinations without requiring additional data labeling. To test DL, we present 123 industrial-grade large-format lithium-ion pouch cells, spanning eight material-design combinations and diverse cycling protocols. Trained solely on public datasets of small-capacity cylindrical cells, DL achieves 7.2% test error in predicting the average cycle life under unknown device variability. This results in savings of 98% in time and 95% in energy compared to industrial practices. This work highlights the potential of uncovering insights from historical designs to inform and accelerate the development of next-generation battery technologies. DL represents a key advance toward efficient data-driven modeling and helps realize the promise of machine learning for accelerating scientific discovery and engineering innovation.
摘要：在复杂物理系统（例如电池）中对新型设计的快速验证对于加速技术创新至关重要。但是，电池研发仍然被评估众多新设计候选所需的高级时间和能源成本所瓶颈，尤其是在电池原型制作和生活测试中。尽管数据驱动的电池寿命预测最近取得了进展，但现有方法仍需要标记的目标设计数据以提高准确性，并且在原型制作之后才能做出可靠的预测，因此远远远远远远远远远远远远不到为电池设计提供快速反馈所需的效率。在这里，我们介绍了Discovery Learning（DL），这是一种科学的机器学习范式，将积极的学习，物理学指导的学习和零量学习整合到类似人类的推理循环中，从教育心理学中学习理论。 DL可以从历史电池设计中学习，并积极减少对原型制作的需求，从而可以快速评估未观察到的材料设计组合，而无需其他数据标记。为了测试DL，我们提出了123个工业级大型锂离子袋细胞，涵盖了八种材料设计组合和各种循环方案。 DL仅在小容量圆柱细胞的公共数据集上进行培训，在预测未知设备变异性下的平均周期寿命时达到了7.2％的测试错误。与工业实践相比，这可以节省98％，能源节省95％。这项工作突出了从历史设计中发现见解的潜力，以告知和加速下一代电池技术的发展。 DL代表了有效的数据驱动建模的关键进步，并有助于实现机器学习的希望加速科学发现和工程创新。

Title: TADoc: Robust Time-Aware Document Image Dewarping

Authors: Fangmin Zhao, Weichao Zeng, Zhenhang Li, Dongbao Yang, Yu Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06988
Pdf URL: https://arxiv.org/pdf/2508.06988
Copy Paste: [[2508.06988]] TADoc: Robust Time-Aware Document Image Dewarping(https://arxiv.org/abs/2508.06988)
Keywords: restoration
Abstract: Flattening curved, wrinkled, and rotated document images captured by portable photographing devices, termed document image dewarping, has become an increasingly important task with the rise of digital economy and online working. Although many methods have been proposed recently, they often struggle to achieve satisfactory results when confronted with intricate document structures and higher degrees of deformation in real-world scenarios. Our main insight is that, unlike other document restoration tasks (e.g., deblurring), dewarping in real physical scenes is a progressive motion rather than a one-step transformation. Based on this, we have undertaken two key initiatives. Firstly, we reformulate this task, modeling it for the first time as a dynamic process that encompasses a series of intermediate states. Secondly, we design a lightweight framework called TADoc (Time-Aware Document Dewarping Network) to address the geometric distortion of document images. In addition, due to the inadequacy of OCR metrics for document images containing sparse text, the comprehensiveness of evaluation is insufficient. To address this shortcoming, we propose a new metric -- DLS (Document Layout Similarity) -- to evaluate the effectiveness of document dewarping in downstream tasks. Extensive experiments and in-depth evaluations have been conducted and the results indicate that our model possesses strong robustness, achieving superiority on several benchmarks with different document types and degrees of distortion.
摘要：随着数字经济和在线工作的兴起，由便携式照相设备捕获的弯曲，皱纹和旋转的文档图像已成为越来越重要的任务。尽管最近提出了许多方法，但在现实情况下，他们经常努力实现令人满意的结果和更高的变形程度。我们的主要见解是，与其他文档恢复任务（例如Deblurring）不同，在真实的物理场景中脱毛是一种渐进的运动，而不是一步转换。基于此，我们采取了两项关键举措。首先，我们将该任务重新制定，首次将其建模为一个涵盖一系列中间状态的动态过程。其次，我们设计了一个称为TADOC（时间吸引文档露水网络）的轻量级框架，以解决文档图像的几何变形。此外，由于对包含稀疏文本的文档图像的OCR指标不足，评估的全面性不足。为了解决这一缺点，我们提出了一个新的指标-DLS（文档布局相似性），以评估文档脱扫在下游任务中的有效性。已经进行了广泛的实验和深入评估，结果表明我们的模型具有强大的鲁棒性，在具有不同文档类型和失真程度的几种基准上实现了优越性。

Title: S2-UniSeg: Fast Universal Agglomerative Pooling for Scalable Segment Anything without Supervision

Authors: Huihui Xu, Jin Ye, Hongqiu Wang, Changkai Ji, Jiashi Lin, Ming Hu, Ziyan Huang, Ying Chen, Chenglong Ma, Tianbin Li, Lihao Liu, Junjun He, Lei Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06995
Pdf URL: https://arxiv.org/pdf/2508.06995
Copy Paste: [[2508.06995]] S2-UniSeg: Fast Universal Agglomerative Pooling for Scalable Segment Anything without Supervision(https://arxiv.org/abs/2508.06995)
Keywords: generation
Abstract: Recent self-supervised image segmentation models have achieved promising performance on semantic segmentation and class-agnostic instance segmentation. However, their pretraining schedule is multi-stage, requiring a time-consuming pseudo-masks generation process between each training epoch. This time-consuming offline process not only makes it difficult to scale with training dataset size, but also leads to sub-optimal solutions due to its discontinuous optimization routine. To solve these, we first present a novel pseudo-mask algorithm, Fast Universal Agglomerative Pooling (UniAP). Each layer of UniAP can identify groups of similar nodes in parallel, allowing to generate both semantic-level and instance-level and multi-granular pseudo-masks within ens of milliseconds for one image. Based on the fast UniAP, we propose the Scalable Self-Supervised Universal Segmentation (S2-UniSeg), which employs a student and a momentum teacher for continuous pretraining. A novel segmentation-oriented pretext task, Query-wise Self-Distillation (QuerySD), is proposed to pretrain S2-UniSeg to learn the local-to-global correspondences. Under the same setting, S2-UniSeg outperforms the SOTA UnSAM model, achieving notable improvements of AP+6.9 on COCO, AR+11.1 on UVO, PixelAcc+4.5 on COCOStuff-27, RQ+8.0 on Cityscapes. After scaling up to a larger 2M-image subset of SA-1B, S2-UniSeg further achieves performance gains on all four benchmarks. Our code and pretrained models are available at this https URL
摘要：最新的自我监督图像分割模型已在语义分割和类不足的实例分割方面实现了有希望的性能。但是，他们的预处理时间是多阶段，需要每个训练时期之间耗时的伪面具生成过程。这项耗时的脱机过程不仅使训练数据集大小很难扩展，而且由于其不连续的优化程序而导致了次优的解决方案。为了解决这些问题，我们首先提出了一种新颖的伪掩模算法，即Fast Universal Arocerative Pooling（UNIAP）。 UNIAP的每一层都可以并行识别类似节点的组，从而可以在毫秒内为一个图像生成语义级别和实例级别和实例级别和实例级别和多粒子伪遮罩。根据快速的Uniap，我们提出了可扩展的自我监督的普遍分割（S2-Uniseg），该分段（S2-Uniseg）雇用了一名学生和一名势头老师进行持续预处理。提出了一个新颖的面向分割的借口任务，即查询自我缩减（QUERYSD），提议将S2-Uniseg预先学习局部到全球的对应。在相同的设置下，S2-uniseg的表现优于SOTA UNSAM模型，在Coco上实现了AP+6.9的显着改善，在UVO上，AR+11.1，Cocostuff-27上的Pixelacc+4.5，RQ+8.0，RQ+8.0。 S2-Uniseg扩展到更大的2M图像子集后，S2-Uniseg进一步实现了所有四个基准测试的性能。我们的代码和预估计的模型可在此HTTPS URL上找到

Title: Spatio-Temporal Conditional Diffusion Models for Forecasting Future Multiple Sclerosis Lesion Masks Conditioned on Treatments

Authors: Gian Mario Favero, Ge Ya Luo, Nima Fathi, Justin Szeto, Douglas L. Arnold, Brennan Nichyporuk, Chris Pal, Tal Arbel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07006
Pdf URL: https://arxiv.org/pdf/2508.07006
Copy Paste: [[2508.07006]] Spatio-Temporal Conditional Diffusion Models for Forecasting Future Multiple Sclerosis Lesion Masks Conditioned on Treatments(https://arxiv.org/abs/2508.07006)
Keywords: generative
Abstract: Image-based personalized medicine has the potential to transform healthcare, particularly for diseases that exhibit heterogeneous progression such as Multiple Sclerosis (MS). In this work, we introduce the first treatment-aware spatio-temporal diffusion model that is able to generate future masks demonstrating lesion evolution in MS. Our voxel-space approach incorporates multi-modal patient data, including MRI and treatment information, to forecast new and enlarging T2 (NET2) lesion masks at a future time point. Extensive experiments on a multi-centre dataset of 2131 patient 3D MRIs from randomized clinical trials for relapsing-remitting MS demonstrate that our generative model is able to accurately predict NET2 lesion masks for patients across six different treatments. Moreover, we demonstrate our model has the potential for real-world clinical applications through downstream tasks such as future lesion count and location estimation, binary lesion activity classification, and generating counterfactual future NET2 masks for several treatments with different efficacies. This work highlights the potential of causal, image-based generative models as powerful tools for advancing data-driven prognostics in MS.
摘要：基于图像的个性化医学有可能改变医疗保健，特别是对于表现出异质进展的疾病，例如多发性硬化症（MS）。在这项工作中，我们介绍了第一个治疗感知的时空扩散模型，该模型能够生成未来的掩模，证明MS中的病变演变。我们的体素空间方法包含多模式的患者数据，包括MRI和治疗信息，以预测未来时间点的新和增大T2（NET2）病变面膜。从随机临床试验的2131名患者3D MRI的多中心数据集进行的广泛实验表明，我们的生成模型能够准确预测六种不同治疗方法的患者的Net2病变口罩。此外，我们证明了我们的模型通过下游任务（例如未来的病变计数和位置估计，二进制病变活动分类）以及为具有不同效力的多种治疗方法生成反事实的未来Net2掩码，从而有可能实现现实世界中的临床应用。这项工作突出了因果，基于图像的生成模型的潜力，作为在MS中推进数据驱动预后学的强大工具。

Title: HiMat: DiT-based Ultra-High Resolution SVBRDF Generation

Authors: Zixiong Wang, Jian Yang, Yiwei Hu, Milos Hasan, Beibei Wang
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2508.07011
Pdf URL: https://arxiv.org/pdf/2508.07011
Copy Paste: [[2508.07011]] HiMat: DiT-based Ultra-High Resolution SVBRDF Generation(https://arxiv.org/abs/2508.07011)
Keywords: generation, generative
Abstract: Creating highly detailed SVBRDFs is essential for 3D content creation. The rise of high-resolution text-to-image generative models, based on diffusion transformers (DiT), suggests an opportunity to finetune them for this task. However, retargeting the models to produce multiple aligned SVBRDF maps instead of just RGB images, while achieving high efficiency and ensuring consistency across different maps, remains a challenge. In this paper, we introduce HiMat: a memory- and computation-efficient diffusion-based framework capable of generating native 4K-resolution SVBRDFs. A key challenge we address is maintaining consistency across different maps in a lightweight manner, without relying on training new VAEs or significantly altering the DiT backbone (which would damage its prior capabilities). To tackle this, we introduce the CrossStitch module, a lightweight convolutional module that captures inter-map dependencies through localized operations. Its weights are initialized such that the DiT backbone operation is unchanged before finetuning starts. HiMat enables generation with strong structural coherence and high-frequency details. Results with a large set of text prompts demonstrate the effectiveness of our approach for 4K SVBRDF generation. Further experiments suggest generalization to tasks such as intrinsic decomposition.
摘要：创建高度详细的SVBRDFS对于3D内容创建至关重要。基于扩散变压器（DIT）的高分辨率文本到图像生成模型的兴起提出了一个机会，可以为此而进行挑战。但是，重新定位模型以产生多个对齐的SVBRDF地图，而不仅仅是RGB图像，同时达到高效率并确保在不同地图上保持一致性，这仍然是一个挑战。在本文中，我们介绍了HIMAT：一种能够生成本机4K分辨率SVBRDF的内存和计算有效扩散的框架。我们解决的一个关键挑战是以轻量级的方式保持不同地图的一致性，而无需依靠训练新的VAE或显着更改DIT骨架（这会损害其先前的功能）。为了解决这个问题，我们介绍了一个轻巧的卷积模块，该模块通过局部操作捕获图间依赖性。它的权重初始化，以使DIT主干操作在填充开始之前保持不变。 Himat使产生具有强大的结构连贯性和高频细节。带有大量文本提示的结果证明了我们对4K SVBRDF生成的方法的有效性。进一步的实验表明对诸如固有分解之类的任务的概括。

Title: DocRefine: An Intelligent Framework for Scientific Document Understanding and Content Optimization based on Multimodal Large Model Agents

Authors: Kun Qian, Wenjie Li, Tianyu Sun, Wenhong Wang, Wenhan Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07021
Pdf URL: https://arxiv.org/pdf/2508.07021
Copy Paste: [[2508.07021]] DocRefine: An Intelligent Framework for Scientific Document Understanding and Content Optimization based on Multimodal Large Model Agents(https://arxiv.org/abs/2508.07021)
Keywords: generation
Abstract: The exponential growth of scientific literature in PDF format necessitates advanced tools for efficient and accurate document understanding, summarization, and content optimization. Traditional methods fall short in handling complex layouts and multimodal content, while direct application of Large Language Models (LLMs) and Vision-Language Large Models (LVLMs) lacks precision and control for intricate editing tasks. This paper introduces DocRefine, an innovative framework designed for intelligent understanding, content refinement, and automated summarization of scientific PDF documents, driven by natural language instructions. DocRefine leverages the power of advanced LVLMs (e.g., GPT-4o) by orchestrating a sophisticated multi-agent system comprising six specialized and collaborative agents: Layout & Structure Analysis, Multimodal Content Understanding, Instruction Decomposition, Content Refinement, Summarization & Generation, and Fidelity & Consistency Verification. This closed-loop feedback architecture ensures high semantic accuracy and visual fidelity. Evaluated on the comprehensive DocEditBench dataset, DocRefine consistently outperforms state-of-the-art baselines across various tasks, achieving overall scores of 86.7% for Semantic Consistency Score (SCS), 93.9% for Layout Fidelity Index (LFI), and 85.0% for Instruction Adherence Rate (IAR). These results demonstrate DocRefine's superior capability in handling complex multimodal document editing, preserving semantic integrity, and maintaining visual consistency, marking a significant advancement in automated scientific document processing.
摘要：PDF格式中科学文献的指数增长需要高级工具，以便有效，准确的文档理解，摘要和内容优化。传统方法在处理复杂的布局和多模式内容方面缺乏，而直接应用大语言模型（LLM）和视觉语言大型模型（LVLM）缺乏精确和控制复杂的编辑任务。本文介绍了Docrefine，这是一个创新的框架，旨在通过自然语言指令驱动的科学PDF文档的智能理解，内容和自动汇总。 DOCREFINE通过策划一个复杂的多动态系统来利用高级LVLM（例如GPT-4O）的功能，该系统包括六个专业和协作的代理：布局和结构分析，多模式内容理解，指导分解，内容进行，内容的精致，摘要，摘要和产生和富裕和一致性和一致性和一致性验证。这种闭环反馈体系结构可确保高语义准确性和视觉保真度。 DOCREFINE在全面的文档数据集上进行了评估，在各种任务上始终超过最先进的基线，在语义一致性得分（SCS）的总分为86.7％，布局忠诚度指数（LFI）为93.9％，指令履历率（IAR）为85.0％。这些结果证明了Docrefine在处理复杂的多模式文档编辑，保持语义完整性和保持视觉一致性方面具有出色的能力，这标志着自动化科学文档处理的显着进步。

Title: Trustworthy Medical Imaging with Large Language Models: A Study of Hallucinations Across Modalities

Authors: Anindya Bijoy Das, Shahnewaz Karim Sakib, Shibbir Ahmed
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07031
Pdf URL: https://arxiv.org/pdf/2508.07031
Copy Paste: [[2508.07031]] Trustworthy Medical Imaging with Large Language Models: A Study of Hallucinations Across Modalities(https://arxiv.org/abs/2508.07031)
Keywords: generation, generative
Abstract: Large Language Models (LLMs) are increasingly applied to medical imaging tasks, including image interpretation and synthetic image generation. However, these models often produce hallucinations, which are confident but incorrect outputs that can mislead clinical decisions. This study examines hallucinations in two directions: image to text, where LLMs generate reports from X-ray, CT, or MRI scans, and text to image, where models create medical images from clinical prompts. We analyze errors such as factual inconsistencies and anatomical inaccuracies, evaluating outputs using expert informed criteria across imaging modalities. Our findings reveal common patterns of hallucination in both interpretive and generative tasks, with implications for clinical reliability. We also discuss factors contributing to these failures, including model architecture and training data. By systematically studying both image understanding and generation, this work provides insights into improving the safety and trustworthiness of LLM driven medical imaging systems.
摘要：大型语言模型（LLM）越来越多地应用于医学成像任务，包括图像解释和合成图像产生。但是，这些模型通常会产生幻觉，这些幻觉充满信心但不正确的输出可能会误导临床决策。这项研究研究了两个方向的幻觉：图像到文本，LLMS生成X射线，CT或MRI扫描的报告，以及文本到图像，模型从临床提示中创建医疗图像。我们分析了诸如事实不一致和解剖不准确之类的错误，并使用成像方式跨成像方式的专家知情标准评估输出。我们的发现揭示了解释性和生成任务中幻觉的共同模式，对临床可靠性产生了影响。我们还讨论了导致这些失败的因素，包括模型架构和培训数据。通过系统地研究图像理解和产生，这项工作为提高LLM驱动的医学成像系统的安全性和可信度提供了见解。

Title: A Stage-Aware Mixture of Experts Framework for Neurodegenerative Disease Progression Modelling

Authors: Tiantian He, Keyue Jiang, An Zhao, Anna Schroder, Elinor Thompson, Sonja Soskic, Frederik Barkhof, Daniel C. Alexander
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2508.07032
Pdf URL: https://arxiv.org/pdf/2508.07032
Copy Paste: [[2508.07032]] A Stage-Aware Mixture of Experts Framework for Neurodegenerative Disease Progression Modelling(https://arxiv.org/abs/2508.07032)
Keywords: generative
Abstract: The long-term progression of neurodegenerative diseases is commonly conceptualized as a spatiotemporal diffusion process that consists of a graph diffusion process across the structural brain connectome and a localized reaction process within brain regions. However, modeling this progression remains challenging due to 1) the scarcity of longitudinal data obtained through irregular and infrequent subject visits and 2) the complex interplay of pathological mechanisms across brain regions and disease stages, where traditional models assume fixed mechanisms throughout disease progression. To address these limitations, we propose a novel stage-aware Mixture of Experts (MoE) framework that explicitly models how different contributing mechanisms dominate at different disease stages through time-dependent expert this http URL-wise, we utilize an iterative dual optimization method to properly estimate the temporal position of individual observations, constructing a co hort-level progression trajectory from irregular snapshots. Model-wise, we enhance the spatial component with an inhomogeneous graph neural diffusion model (IGND) that allows diffusivity to vary based on node states and time, providing more flexible representations of brain networks. We also introduce a localized neural reaction module to capture complex dynamics beyond standard this http URL resulting IGND-MoE model dynamically integrates these components across temporal states, offering a principled way to understand how stage-specific pathological mechanisms contribute to progression. The stage-wise weights yield novel clinical insights that align with literature, suggesting that graph-related processes are more influential at early stages, while other unknown physical processes become dominant later on.
摘要：神经退行性疾病的长期进展通常被概念化为一个时空扩散过程，该过程由整个结构脑连接组的图扩散过程和大脑区域内的局部反应过程组成。但是，建模这种进展仍然具有挑战性，这是由于1）通过不规则和不经常的受试者访问获得的纵向数据的稀缺性以及2）跨大脑区域和疾病阶段的病理机制的复杂相互作用，传统模型在整个疾病进展中都固定机制。 To address these limitations, we propose a novel stage-aware Mixture of Experts (MoE) framework that explicitly models how different contributing mechanisms dominate at different disease stages through time-dependent expert this http URL-wise, we utilize an iterative dual optimization method to properly estimate the temporal position of individual observations, constructing a co hort-level progression trajectory from irregular snapshots.在模型方面，我们使用不均匀的图神经扩散模型（IGND）增强了空间成分，该模型允许扩散性根据节点态和时间变化，从而提供了更灵活的大脑网络表示。我们还引入了一个局部的神经反应模块，以捕获超出标准的复杂动力学该HTTP URL导致IGND-MOE模型在跨时间状态下动态整合了这些组件，提供了一种原则性的方法来了解阶段特异性病理机制如何有助于进展。阶段的重量产生了与文献一致的新型临床见解，这表明与图相关的过程在早期阶段具有更大的影响力，而其他未知的物理过程则在以后占主导地位。

Title: 3DGS-VBench: A Comprehensive Video Quality Evaluation Benchmark for 3DGS Compression

Authors: Yuke Xing, William Gordon, Qi Yang, Kaifa Yang, Jiarui Wang, Yiling Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07038
Pdf URL: https://arxiv.org/pdf/2508.07038
Copy Paste: [[2508.07038]] 3DGS-VBench: A Comprehensive Video Quality Evaluation Benchmark for 3DGS Compression(https://arxiv.org/abs/2508.07038)
Keywords: generative, quality assessment
Abstract: 3D Gaussian Splatting (3DGS) enables real-time novel view synthesis with high visual fidelity, but its substantial storage requirements hinder practical deployment, prompting state-of-the-art (SOTA) 3DGS methods to incorporate compression modules. However, these 3DGS generative compression techniques introduce unique distortions lacking systematic quality assessment research. To this end, we establish 3DGS-VBench, a large-scale Video Quality Assessment (VQA) Dataset and Benchmark with 660 compressed 3DGS models and video sequences generated from 11 scenes across 6 SOTA 3DGS compression algorithms with systematically designed parameter levels. With annotations from 50 participants, we obtained MOS scores with outlier removal and validated dataset reliability. We benchmark 6 3DGS compression algorithms on storage efficiency and visual quality, and evaluate 15 quality assessment metrics across multiple paradigms. Our work enables specialized VQA model training for 3DGS, serving as a catalyst for compression and quality assessment research. The dataset is available at this https URL.
摘要：3D高斯碎片（3DGS）可实现实时的新型视图综合，具有高视觉保真度，但其实质性的存储要求阻碍了实际部署，促使最先进的（SOTA）3DGS方法结合了压缩模块。但是，这些3DGS生成压缩技术引入了缺乏系统性质量评估研究的独特失真。为此，我们建立了3DGS-VBENCE，这是一个大规模的视频质量评估（VQA）数据集和基准测试，并使用660个压缩3DGS模型和视频序列从6个SOTA 3DGS压缩算法中生成的660个压缩3DGS模型和视频序列，具有系统设计的参数级别。有了来自50名参与者的注释，我们获得了MOS分数，并获得了离群值删除并验证了数据集可靠性。我们在存储效率和视觉质量上进行基准6 3DGS压缩算法，并评估多个范式的15个质量评估指标。我们的工作为3DGS提供了专门的VQA模型培训，可作为压缩和质量评估研究的催化剂。该数据集可在此HTTPS URL上找到。

Title: Towards High-Order Mean Flow Generative Models: Feasibility, Expressivity, and Provably Efficient Criteria

Authors: Yang Cao, Yubin Chen, Zhao Song, Jiahao Zhang
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2508.07102
Pdf URL: https://arxiv.org/pdf/2508.07102
Copy Paste: [[2508.07102]] Towards High-Order Mean Flow Generative Models: Feasibility, Expressivity, and Provably Efficient Criteria(https://arxiv.org/abs/2508.07102)
Keywords: generative
Abstract: Generative modelling has seen significant advances through simulation-free paradigms such as Flow Matching, and in particular, the MeanFlow framework, which replaces instantaneous velocity fields with average velocities to enable efficient single-step sampling. In this work, we introduce a theoretical study on Second-Order MeanFlow, a novel extension that incorporates average acceleration fields into the MeanFlow objective. We first establish the feasibility of our approach by proving that the average acceleration satisfies a generalized consistency condition analogous to first-order MeanFlow, thereby supporting stable, one-step sampling and tractable loss functions. We then characterize its expressivity via circuit complexity analysis, showing that under mild assumptions, the Second-Order MeanFlow sampling process can be implemented by uniform threshold circuits within the $\mathsf{TC}^0$ class. Finally, we derive provably efficient criteria for scalable implementation by leveraging fast approximate attention computations: we prove that attention operations within the Second-Order MeanFlow architecture can be approximated to within $1/\mathrm{poly}(n)$ error in time $n^{2+o(1)}$. Together, these results lay the theoretical foundation for high-order flow matching models that combine rich dynamics with practical sampling efficiency.
摘要：生成建模通过无模拟范式（例如流量匹配，尤其是平均流量框架）取得了重大进步，该框架替代了具有平均速度的瞬时速度场以启用有效的单步抽样。在这项工作中，我们介绍了一项关于二阶平均流的理论研究，这是一种新颖的扩展，将平均加速度字段纳入平均流量目标。我们首先通过证明平均加速度满足类似于一阶平均流量的普遍一致性条件，从而确定方法的可行性，从而支持稳定的，一步采样和可拖动的损耗函数。然后，我们通过电路复杂性分析表征其表达性，表明在$ \ m rathsf {tc}^0 $ class中，可以通过均匀的阈值电路实现二阶平均流采样过程。最后，我们通过利用快速的近似关注计算来得出可扩展实现的有效标准：我们证明，二阶平均流架结构内的注意操作可以近似于$ 1/\ mathrm {poly}（poly}（n）$，时间$ n^{2+o（1）} $。这些结果共同为高阶流量匹配模型奠定了理论基础，这些模型将丰富的动力学与实际采样效率相结合。

Title: Perceptual Evaluation of GANs and Diffusion Models for Generating X-rays

Authors: Gregory Schuit, Denis Parra, Cecilia Besa
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07128
Pdf URL: https://arxiv.org/pdf/2508.07128
Copy Paste: [[2508.07128]] Perceptual Evaluation of GANs and Diffusion Models for Generating X-rays(https://arxiv.org/abs/2508.07128)
Keywords: generation, generative
Abstract: Generative image models have achieved remarkable progress in both natural and medical imaging. In the medical context, these techniques offer a potential solution to data scarcity-especially for low-prevalence anomalies that impair the performance of AI-driven diagnostic and segmentation tools. However, questions remain regarding the fidelity and clinical utility of synthetic images, since poor generation quality can undermine model generalizability and trust. In this study, we evaluate the effectiveness of state-of-the-art generative models-Generative Adversarial Networks (GANs) and Diffusion Models (DMs)-for synthesizing chest X-rays conditioned on four abnormalities: Atelectasis (AT), Lung Opacity (LO), Pleural Effusion (PE), and Enlarged Cardiac Silhouette (ECS). Using a benchmark composed of real images from the MIMIC-CXR dataset and synthetic images from both GANs and DMs, we conducted a reader study with three radiologists of varied experience. Participants were asked to distinguish real from synthetic images and assess the consistency between visual features and the target abnormality. Our results show that while DMs generate more visually realistic images overall, GANs can report better accuracy for specific conditions, such as absence of ECS. We further identify visual cues radiologists use to detect synthetic images, offering insights into the perceptual gaps in current models. These findings underscore the complementary strengths of GANs and DMs and point to the need for further refinement to ensure generative models can reliably augment training datasets for AI diagnostic systems.
摘要：生成图像模型在自然成像和医学成像中都取得了显着进步。在医学背景下，这些技术为数据稀缺提供了潜在的解决方案，尤其是对于损害AI驱动诊断和细分工具性能的低差异异常。但是，关于合成图像的保真度和临床实用性仍然存在问题，因为差的发电质量会破坏模型的推广性和信任。在这项研究中，我们评估了最先进的生成模型产生的对抗网络（GAN）和扩散模型（DMS）的有效性 - 合成以四种异常的胸部X射线：atelectisis（at），肺部不透明（LO），PERARARECARICATION（LO），PERARARARAREDRACE（PE），PERARARARARED（PE），PERARARED CARDAIC CARDACIAC CARDIAC CARDIAC SILHOHETERSERHOYETTE（ECS）。使用来自MIMIC-CXR数据集中的真实图像和gan和dms的合成图像的基准，我们进行了一项读者研究，其中包括三位不同经验的放射学家。要求参与者将真实与合成图像区分开，并评估视觉特征和目标异常之间的一致性。我们的结果表明，尽管DMS总体上产生了更逼真的图像，但GAN可以在特定条件（例如不存在EC）的情况下报告更好的准确性。我们进一步确定了视觉提示放射科医生用于检测合成图像，从而提供了对当前模型中感知差距的见解。这些发现强调了gan和dms的互补优势，并指出需要进一步改进以确保生成模型可以可靠地增强AI诊断系统的培训数据集。

Title: CMAMRNet: A Contextual Mask-Aware Network Enhancing Mural Restoration Through Comprehensive Mask Guidance

Authors: Yingtie Lei, Fanghai Yi, Yihang Dong, Weihuang Liu, Xiaofeng Zhang, Zimeng Li, Chi-Man Pun, Xuhang Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07140
Pdf URL: https://arxiv.org/pdf/2508.07140
Copy Paste: [[2508.07140]] CMAMRNet: A Contextual Mask-Aware Network Enhancing Mural Restoration Through Comprehensive Mask Guidance(https://arxiv.org/abs/2508.07140)
Keywords: restoration
Abstract: Murals, as invaluable cultural artifacts, face continuous deterioration from environmental factors and human activities. Digital restoration of murals faces unique challenges due to their complex degradation patterns and the critical need to preserve artistic authenticity. Existing learning-based methods struggle with maintaining consistent mask guidance throughout their networks, leading to insufficient focus on damaged regions and compromised restoration quality. We propose CMAMRNet, a Contextual Mask-Aware Mural Restoration Network that addresses these limitations through comprehensive mask guidance and multi-scale feature extraction. Our framework introduces two key components: (1) the Mask-Aware Up/Down-Sampler (MAUDS), which ensures consistent mask sensitivity across resolution scales through dedicated channel-wise feature selection and mask-guided feature fusion; and (2) the Co-Feature Aggregator (CFA), operating at both the highest and lowest resolutions to extract complementary features for capturing fine textures and global structures in degraded regions. Experimental results on benchmark datasets demonstrate that CMAMRNet outperforms state-of-the-art methods, effectively preserving both structural integrity and artistic details in restored murals. The code is available at~\href{this https URL}{this https URL}.
摘要：壁画作为无价的文化伪像，面临着环境因素和人类活动的持续恶化。壁画的数字恢复由于其复杂的退化模式以及保持艺术真实性的关键需求而面临独特的挑战。现有的基于学习的方法努力在整个网络中保持一致的面具指导，从而使人们不足以关注受损区域和损害恢复质量。我们提出了CMAMRNET，这是一种上下文面罩感知的壁画恢复网络，该网络通过全面的面具指导和多尺度特征提取来解决这些限制。我们的框架介绍了两个关键组成部分：（1）掩盖掩盖的向上/下采样器（Mauds），该框架通过专用的通道特征选择和掩码引导的特征融合来确保在分辨率范围内确保跨分辨率范围一致；（2）共有聚合器（CFA），以最高和最低分辨率运行，以提取互补特征，以捕获降级区域中的细纹理和全球结构。基准数据集的实验结果表明，CMAMRNET优于最先进的方法，在恢复的壁画中有效地保留了结构完整性和艺术细节。该代码可在〜\ href {此https url} {此https url}中获得。

Title: CoopDiff: Anticipating 3D Human-object Interactions via Contact-consistent Decoupled Diffusion

Authors: Xiaotong Lin, Tianming Liang, Jian-Fang Hu, Kun-Yu Lin, Yulei Kang, Chunwei Tian, Jianhuang Lai, Wei-Shi Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07162
Pdf URL: https://arxiv.org/pdf/2508.07162
Copy Paste: [[2508.07162]] CoopDiff: Anticipating 3D Human-object Interactions via Contact-consistent Decoupled Diffusion(https://arxiv.org/abs/2508.07162)
Keywords: generation
Abstract: 3D human-object interaction (HOI) anticipation aims to predict the future motion of humans and their manipulated objects, conditioned on the historical context. Generally, the articulated humans and rigid objects exhibit different motion patterns, due to their distinct intrinsic physical properties. However, this distinction is ignored by most of the existing works, which intend to capture the dynamics of both humans and objects within a single prediction model. In this work, we propose a novel contact-consistent decoupled diffusion framework CoopDiff, which employs two distinct branches to decouple human and object motion modeling, with the human-object contact points as shared anchors to bridge the motion generation across branches. The human dynamics branch is aimed to predict highly structured human motion, while the object dynamics branch focuses on the object motion with rigid translations and rotations. These two branches are bridged by a series of shared contact points with consistency constraint for coherent human-object motion prediction. To further enhance human-object consistency and prediction reliability, we propose a human-driven interaction module to guide object motion modeling. Extensive experiments on the BEHAVE and Human-object Interaction datasets demonstrate that our CoopDiff outperforms state-of-the-art methods.
摘要：3D人类对象的相互作用（HOI）的预期旨在预测人类及其操纵物体的未来运动，并以历史背景为条件。通常，由于其独特的内在物理特性，铰接的人类和刚性物体表现出不同的运动模式。但是，大多数现有作品都忽略了这种区别，该作品打算在单个预测模型中捕获人类和对象的动态。在这项工作中，我们提出了一种新颖的接触式脱钩的扩散框架库迪夫（Copdiff），该框架采用了两个不同的分支来解除人类和对象运动建模，人类对象的接触点是共享的锚点，可以在分支跨分支上桥接运动产生。人类动力学分支旨在预测高度结构化的人体运动，而物体动力学分支则以刚性翻译和旋转为重点。这两个分支被一系列共享的接触点桥接，并具有一致性的人类对象运动预测的一致性约束。为了进一步增强人类对象的一致性和预测可靠性，我们提出了一个人为驱动的交互模块来指导对象运动建模。对这种行为和人类对象交互数据集进行的广泛实验表明，我们的库迪夫（Coopdiff）的表现要比最先进的方法。

Title: Large-scale Multi-sequence Pretraining for Generalizable MRI Analysis in Versatile Clinical Applications

Authors: Zelin Qiu, Xi Wang, Zhuoyao Xie, Juan Zhou, Yu Wang, Lingjie Yang, Xinrui Jiang, Juyoung Bae, Moo Hyun Son, Qiang Ye, Dexuan Chen, Rui Zhang, Tao Li, Neeraj Ramesh Mahboobani, Varut Vardhanabhuti, Xiaohui Duan, Yinghua Zhao, Hao Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07165
Pdf URL: https://arxiv.org/pdf/2508.07165
Copy Paste: [[2508.07165]] Large-scale Multi-sequence Pretraining for Generalizable MRI Analysis in Versatile Clinical Applications(https://arxiv.org/abs/2508.07165)
Keywords: generation
Abstract: Multi-sequence Magnetic Resonance Imaging (MRI) offers remarkable versatility, enabling the distinct visualization of different tissue types. Nevertheless, the inherent heterogeneity among MRI sequences poses significant challenges to the generalization capability of deep learning models. These challenges undermine model performance when faced with varying acquisition parameters, thereby severely restricting their clinical utility. In this study, we present PRISM, a foundation model PRe-trained with large-scale multI-Sequence MRI. We collected a total of 64 datasets from both public and private sources, encompassing a wide range of whole-body anatomical structures, with scans spanning diverse MRI sequences. Among them, 336,476 volumetric MRI scans from 34 datasets (8 public and 26 private) were curated to construct the largest multi-organ multi-sequence MRI pretraining corpus to date. We propose a novel pretraining paradigm that disentangles anatomically invariant features from sequence-specific variations in MRI, while preserving high-level semantic representations. We established a benchmark comprising 44 downstream tasks, including disease diagnosis, image segmentation, registration, progression prediction, and report generation. These tasks were evaluated on 32 public datasets and 5 private cohorts. PRISM consistently outperformed both non-pretrained models and existing foundation models, achieving first-rank results in 39 out of 44 downstream benchmarks with statistical significance improvements. These results underscore its ability to learn robust and generalizable representations across unseen data acquired under diverse MRI protocols. PRISM provides a scalable framework for multi-sequence MRI analysis, thereby enhancing the translational potential of AI in radiology. It delivers consistent performance across diverse imaging protocols, reinforcing its clinical applicability.
摘要：多序列磁共振成像（MRI）提供了显着的多功能性，从而实现了不同组织类型的独特可视化。然而，MRI序列之间的固有异质性为深度学习模型的概括能力带来了重大挑战。这些挑战在面临不同的获取参数时会破坏模型的性能，从而严重限制其临床实用性。在这项研究中，我们提出了Prism，这是一种通过大规模多序列MRI预先训练的基础模型。我们从公共和私人来源收集了总共64个数据集，其中包括各种全身解剖结构，并进行了扫描跨越多种MRI序列的扫描。其中，策划了34个数据集（8个公共和26个私人）的336,476次体积MRI扫描，以构建迄今为止迄今为止最大的多器官多序列MRI预处理语料库。我们提出了一种新颖的预处理范式，该范式从MRI中的序列特异性变化中解散了解剖上不变的特征，同时保留了高级语义表示。我们建立了一个包括44项下游任务的基准，包括疾病诊断，图像分割，注册，进展预测和报告产生。这些任务在32个公共数据集和5个私人人群上进行了评估。 Prism始终优于未经原模型和现有的基础模型，在44个下游基准测试中，有39个具有统计学意义的改善，因此获得了第一名。这些结果强调了其在不同的MRI协议下获得的看不见数据中学习强大和可推广表示的能力。 Prism为多序列MRI分析提供了可扩展的框架，从而增强了AI在放射学中的翻译潜力。它在各种成像方案中提供一致的性能，从而增强其临床适用性。

Title: Similarity Matters: A Novel Depth-guided Network for Image Restoration and A New Dataset

Authors: Junyi He, Liuling Chen, Hongyang Zhou, Zhang xiaoxing, Xiaobin Zhu, Shengxiang Yu, Jingyan Qin, Xu-Cheng Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07211
Pdf URL: https://arxiv.org/pdf/2508.07211
Copy Paste: [[2508.07211]] Similarity Matters: A Novel Depth-guided Network for Image Restoration and A New Dataset(https://arxiv.org/abs/2508.07211)
Keywords: restoration
Abstract: Image restoration has seen substantial progress in recent years. However, existing methods often neglect depth information, which hurts similarity matching, results in attention distractions in shallow depth-of-field (DoF) scenarios, and excessive enhancement of background content in deep DoF settings. To overcome these limitations, we propose a novel Depth-Guided Network (DGN) for image restoration, together with a novel large-scale high-resolution dataset. Specifically, the network consists of two interactive branches: a depth estimation branch that provides structural guidance, and an image restoration branch that performs the core restoration task. In addition, the image restoration branch exploits intra-object similarity through progressive window-based self-attention and captures inter-object similarity via sparse non-local attention. Through joint training, depth features contribute to improved restoration quality, while the enhanced visual features from the restoration branch in turn help refine depth estimation. Notably, we also introduce a new dataset for training and evaluation, consisting of 9,205 high-resolution images from 403 plant species, with diverse depth and texture variations. Extensive experiments show that our method achieves state-of-the-art performance on several standard benchmarks and generalizes well to unseen plant images, demonstrating its effectiveness and robustness.
摘要：近年来，图像恢复已经取得了长足的进步。但是，现有方法通常会忽略损害相似性匹配的深度信息，从而导致注意力较浅（DOF）场景的注意力干扰，并在深度DOF设置中过度增强背景内容。为了克服这些局限性，我们提出了一个新颖的深度引导网络（DGN），以进行图像恢复，并加上一个新型的大型高分辨率数据集。具体而言，该网络由两个交互分支组成：一个提供结构指导的深度估计分支，以及执行核心恢复任务的图像恢复分支。此外，图像恢复分支通过基于窗户的自我注意力利用了对象的相似性，并通过稀疏的非本地注意力捕获了对象间相似性。通过联合训练，深度特征有助于提高恢复质量，而恢复分支的增强视觉特征反过来有助于完善深度估计。值得注意的是，我们还引入了一个用于培训和评估的新数据集，该数据集由403种植物物种的9,205张高分辨率图像组成，具有不同的深度和质地变化。广泛的实验表明，我们的方法在几种标准基准上实现了最先进的性能，并概括了看不见的植物图像，从而证明了其有效性和鲁棒性。

Title: Unsupervised Real-World Super-Resolution via Rectified Flow Degradation Modelling

Authors: Hongyang Zhou, Xiaobin Zhu, Liuling Chen, Junyi He, Jingyan Qin, Xu-Cheng Yin, Zhang xiaoxing
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07214
Pdf URL: https://arxiv.org/pdf/2508.07214
Copy Paste: [[2508.07214]] Unsupervised Real-World Super-Resolution via Rectified Flow Degradation Modelling(https://arxiv.org/abs/2508.07214)
Keywords: super-resolution
Abstract: Unsupervised real-world super-resolution (SR) faces critical challenges due to the complex, unknown degradation distributions in practical scenarios. Existing methods struggle to generalize from synthetic low-resolution (LR) and high-resolution (HR) image pairs to real-world data due to a significant domain gap. In this paper, we propose an unsupervised real-world SR method based on rectified flow to effectively capture and model real-world degradation, synthesizing LR-HR training pairs with realistic degradation. Specifically, given unpaired LR and HR images, we propose a novel Rectified Flow Degradation Module (RFDM) that introduces degradation-transformed LR (DT-LR) images as intermediaries. By modeling the degradation trajectory in a continuous and invertible manner, RFDM better captures real-world degradation and enhances the realism of generated LR images. Additionally, we propose a Fourier Prior Guided Degradation Module (FGDM) that leverages structural information embedded in Fourier phase components to ensure more precise modeling of real-world degradation. Finally, the LR images are processed by both FGDM and RFDM, producing final synthetic LR images with real-world degradation. The synthetic LR images are paired with the given HR images to train the off-the-shelf SR networks. Extensive experiments on real-world datasets demonstrate that our method significantly enhances the performance of existing SR approaches in real-world scenarios.
摘要：无监督的现实世界超级分辨率（SR）由于实际情况下复杂的，未知的退化分布而面临着关键的挑战。由于显着的域间隙，现有方法难以从合成低分辨率（LR）和高分辨率（HR）图像对到现实世界数据的概括。在本文中，我们提出了一种基于整流流的无监督的现实世界SR方法，以有效捕获和建模现实世界中的降解，并将LR-HR训练对综合具有现实的降解。具体而言，给定未配对的LR和HR图像，我们提出了一种新型的整流流降解模块（RFDM），该模块（RFDM）将降解转化的LR（DT-LR）图像作为中间体引入降解转化的LR（DT-LR）图像。通过以连续且可逆的方式对降解轨迹进行建模，RFDM可以更好地捕获现实世界的降解并增强生成的LR图像的现实主义。此外，我们提出了一个傅立叶先前的引导降解模块（FGDM），该模块利用嵌入在傅立叶相组件中的结构信息，以确保对现实世界降级的更精确的建模。最后，LR图像由FGDM和RFDM处理，从而产生具有现实世界降解的最终合成LR图像。合成LR图像与给定的HR图像配对以训练现成的SR网络。在现实世界数据集上进行的广泛实验表明，我们的方法显着提高了现实世界中现有SR方法的性能。

Title: Bridging Semantic Logic Gaps: A Cognition-Inspired Multimodal Boundary-Preserving Network for Image Manipulation Localization

Authors: Songlin Li, Zhiqing Guo, Yuanman Li, Zeyu Li, Yunfeng Diao, Gaobo Yang, Liejun Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07216
Pdf URL: https://arxiv.org/pdf/2508.07216
Copy Paste: [[2508.07216]] Bridging Semantic Logic Gaps: A Cognition-Inspired Multimodal Boundary-Preserving Network for Image Manipulation Localization(https://arxiv.org/abs/2508.07216)
Keywords: restoration
Abstract: The existing image manipulation localization (IML) models mainly relies on visual cues, but ignores the semantic logical relationships between content features. In fact, the content semantics conveyed by real images often conform to human cognitive laws. However, image manipulation technology usually destroys the internal relationship between content features, thus leaving semantic clues for IML. In this paper, we propose a cognition-inspired multimodal boundary-preserving network (CMB-Net). Specifically, CMB-Net utilizes large language models (LLMs) to analyze manipulated regions within images and generate prompt-based textual information to compensate for the lack of semantic relationships in the visual information. Considering that the erroneous texts induced by hallucination from LLMs will damage the accuracy of IML, we propose an image-text central ambiguity module (ITCAM). It assigns weights to the text features by quantifying the ambiguity between text and image features, thereby ensuring the beneficial impact of textual information. We also propose an image-text interaction module (ITIM) that aligns visual and text features using a correlation matrix for fine-grained interaction. Finally, inspired by invertible neural networks, we propose a restoration edge decoder (RED) that mutually generates input and output features to preserve boundary information in manipulated regions without loss. Extensive experiments show that CMB-Net outperforms most existing IML models.
摘要：现有的图像操纵本地化（IML）模型主要依赖于视觉提示，但忽略了内容特征之间的语义逻辑关系。实际上，真实图像传达的内容语义通常符合人类的认知定律。但是，图像操纵技术通常会破坏内容特征之间的内部关系，从而为IML留下语义线索。在本文中，我们提出了一个认知启发的多模式保护网络（CMB-NET）。具体而言，CMB-NET利用大型语言模型（LLMS）来分析图像中的操纵区域，并生成基于及时的文本信息，以弥补视觉信息中缺乏语义关系。考虑到llms幻觉引起的错误文本会损害IML的准确性，我们提出了一个图像文本中央歧义模块（ITCAM）。它通过量化文本和图像特征之间的歧义来为文本特征分配权重，从而确保文本信息的有益影响。我们还提出了一个图像文本相互作用模块（ITIM），该模块使用相关矩阵来对齐视觉和文本特征，以进行细粒度的相互作用。最后，受到可逆神经网络的启发，我们提出了一个恢复边缘解码器（红色），该解码器相互生成输入和输出特征，以保留不损失的操纵区域的边界信息。广泛的实验表明，CMB-NET的表现优于大多数现有的IML模型。

Title: EDGE: A Theoretical Framework for Misconception-Aware Adaptive Learning

Authors: Ananda Prakash Verma
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07224
Pdf URL: https://arxiv.org/pdf/2508.07224
Copy Paste: [[2508.07224]] EDGE: A Theoretical Framework for Misconception-Aware Adaptive Learning(https://arxiv.org/abs/2508.07224)
Keywords: generation
Abstract: We present EDGE, a general-purpose, misconception-aware adaptive learning framework composed of four stages: Evaluate (ability and state estimation), Diagnose (posterior infer-ence of misconceptions), Generate (counterfactual item synthesis), and Exercise (index-based retrieval scheduling). EDGE unifies psychometrics (IRT/Bayesian state space models), cog-nitive diagnostics (misconception discovery from distractor patterns and response latencies), contrastive item generation (minimal perturbations that invalidate learner shortcuts while pre-serving psychometric validity), and principled scheduling (a restless bandit approximation to spaced retrieval). We formalize a composite readiness metric, EdgeScore, prove its monotonicity and Lipschitz continuity, and derive an index policy that is near-optimal under mild assumptions on forgetting and learning gains. We further establish conditions under which counterfactual items provably reduce the posterior probability of a targeted misconception faster than standard practice. The paper focuses on theory and implementable pseudocode; empirical study is left to future work.
摘要：我们提出了Edge，这是一个通用的，误解感知的自适应学习框架，由四个阶段组成：评估（能力和状态估计），诊断（误解的后验中心），生成（反事实项目综合）和练习（基于索引的检索计划）。 Edge统一了心理测量学（IRT/Bayesian州空间模型），COG NIVENIVE诊断（从干扰器模式和响应潜伏期发现的误解发现），对比项目的产生（最小化的扰动，使学习者捷径无效，而预先服务心理测量有效性）以及预先安排的时间表（一种预处理的概述）进行了概述（均可访问）。我们将复合准备度量标准正式化，EdgesCore证明了其单调性和Lipschitz的连续性，并得出了一项指数政策，在忘记和学习增长方面的轻度假设下，该索引政策几乎是最理想的。我们进一步建立了反事实项目的条件，可证明对目标误解的后验概率比标准实践快。本文着重于理论和可实施的伪代码。实证研究将留给未来的工作。

Title: HaDM-ST: Histology-Assisted Differential Modeling for Spatial Transcriptomics Generation

Authors: Xuepeng Liu, Zheng Jiang, Pinan Zhu, Hanyu Liu, Chao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07225
Pdf URL: https://arxiv.org/pdf/2508.07225
Copy Paste: [[2508.07225]] HaDM-ST: Histology-Assisted Differential Modeling for Spatial Transcriptomics Generation(https://arxiv.org/abs/2508.07225)
Keywords: generation
Abstract: Spatial transcriptomics (ST) reveals spatial heterogeneity of gene expression, yet its resolution is limited by current platforms. Recent methods enhance resolution via H&E-stained histology, but three major challenges persist: (1) isolating expression-relevant features from visually complex H&E images; (2) achieving spatially precise multimodal alignment in diffusion-based frameworks; and (3) modeling gene-specific variation across expression channels. We propose HaDM-ST (Histology-assisted Differential Modeling for ST Generation), a high-resolution ST generation framework conditioned on H&E images and low-resolution ST. HaDM-ST includes: (i) a semantic distillation network to extract predictive cues from H&E; (ii) a spatial alignment module enforcing pixel-wise correspondence with low-resolution ST; and (iii) a channel-aware adversarial learner for fine-grained gene-level modeling. Experiments on 200 genes across diverse tissues and species show HaDM-ST consistently outperforms prior methods, enhancing spatial fidelity and gene-level coherence in high-resolution ST predictions.
摘要：空间转录组学（ST）揭示了基因表达的空间异质性，但其分辨率受到当前平台的限制。最近的方法通过H＆E染色的组织学增强了分辨率，但是三个主要挑战持续存在：（1）从视觉上复杂的H＆E图像中分离出与表达相关的特征；（2）在基于扩散的框架中实现空间精确的多模式对准；（3）建模跨表达通道基因特异性变异。我们提出了HADM-ST（对ST生成的组织学辅助差异建模），这是一个以H＆E图像和低分辨率ST为条件的高分辨率ST代框架。 HADM-ST包括：（i）从H＆E提取预测性提示的语义蒸馏网络；（ii）使用低分辨率ST执行像素对应的空间比对模块；（iii）一种用于细粒基因级建模的渠道感知者。对200种组织和物种的200个基因的实验表明，HADM-ST的表现始终优于先前的方法，从而在高分辨率ST预测中增强了空间保真度和基因级相干性。

Title: Consistent and Controllable Image Animation with Motion Linear Diffusion Transformers

Authors: Xin Ma, Yaohui Wang, Genyun Jia, Xinyuan Chen, Tien-Tsin Wong, Cunjian Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07246
Pdf URL: https://arxiv.org/pdf/2508.07246
Copy Paste: [[2508.07246]] Consistent and Controllable Image Animation with Motion Linear Diffusion Transformers(https://arxiv.org/abs/2508.07246)
Keywords: generation, generative
Abstract: Image animation has seen significant progress, driven by the powerful generative capabilities of diffusion models. However, maintaining appearance consistency with static input images and mitigating abrupt motion transitions in generated animations remain persistent challenges. While text-to-video (T2V) generation has demonstrated impressive performance with diffusion transformer models, the image animation field still largely relies on U-Net-based diffusion models, which lag behind the latest T2V approaches. Moreover, the quadratic complexity of vanilla self-attention mechanisms in Transformers imposes heavy computational demands, making image animation particularly resource-intensive. To address these issues, we propose MiraMo, a framework designed to enhance efficiency, appearance consistency, and motion smoothness in image animation. Specifically, MiraMo introduces three key elements: (1) A foundational text-to-video architecture replacing vanilla self-attention with efficient linear attention to reduce computational overhead while preserving generation quality; (2) A novel motion residual learning paradigm that focuses on modeling motion dynamics rather than directly predicting frames, improving temporal consistency; and (3) A DCT-based noise refinement strategy during inference to suppress sudden motion artifacts, complemented by a dynamics control module to balance motion smoothness and expressiveness. Extensive experiments against state-of-the-art methods validate the superiority of MiraMo in generating consistent, smooth, and controllable animations with accelerated inference speed. Additionally, we demonstrate the versatility of MiraMo through applications in motion transfer and video editing tasks.
摘要：图像动画取得了重大进展，这是扩散模型强大的生成能力驱动的。但是，通过静态输入图像保持外观一致性并减轻生成动画中的突然运动过渡仍然是持续的挑战。虽然文本到视频（T2V）的生成通过扩散变压器模型表现出了令人印象深刻的性能，但图像动画场仍然很大程度上依赖于基于U-NET的扩散模型，该模型落后于最新的T2V方法。此外，变形金刚中香草自我发挥机制的二次复杂性施加了巨大的计算需求，使图像动画尤其是资源密集型。为了解决这些问题，我们提出了Miramo，这是一个旨在提高图像动画中效率，外观一致性和运动平滑度的框架。具体而言，Miramo介绍了三个关键要素：（1）基础文本对视频架构，以有效的线性关注代替Vanilla自我注意力，以减少计算开销，同时保持发电质量；（2）一种新型运动残留学习范式，重点是建模运动动力学，而不是直接预测帧，从而提高时间一致性；（3）推断过程中基于DCT的噪声完善策略，以抑制突然的运动伪像，并以动力学控制模块的形式进行补充，以平衡运动平滑度和表现力。针对最先进方法的广泛实验验证了Miramo在以加速推理速度生成一致，光滑和可控的动画方面的优势。此外，我们通过在运动传输和视频编辑任务中的应用中演示了Miramo的多功能性。

Title: SynMatch: Rethinking Consistency in Medical Image Segmentation with Sparse Annotations

Authors: Zhiqiang Shen, Peng Cao, Xiaoli Liu, Jinzhu Yang, Osmar R. Zaiane
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07298
Pdf URL: https://arxiv.org/pdf/2508.07298
Copy Paste: [[2508.07298]] SynMatch: Rethinking Consistency in Medical Image Segmentation with Sparse Annotations(https://arxiv.org/abs/2508.07298)
Keywords: generation
Abstract: Label scarcity remains a major challenge in deep learning-based medical image segmentation. Recent studies use strong-weak pseudo supervision to leverage unlabeled data. However, performance is often hindered by inconsistencies between pseudo labels and their corresponding unlabeled images. In this work, we propose \textbf{SynMatch}, a novel framework that sidesteps the need for improving pseudo labels by synthesizing images to match them instead. Specifically, SynMatch synthesizes images using texture and shape features extracted from the same segmentation model that generates the corresponding pseudo labels for unlabeled images. This design enables the generation of highly consistent synthesized-image-pseudo-label pairs without requiring any training parameters for image synthesis. We extensively evaluate SynMatch across diverse medical image segmentation tasks under semi-supervised learning (SSL), weakly-supervised learning (WSL), and barely-supervised learning (BSL) settings with increasingly limited annotations. The results demonstrate that SynMatch achieves superior performance, especially in the most challenging BSL setting. For example, it outperforms the recent strong-weak pseudo supervision-based method by 29.71\% and 10.05\% on the polyp segmentation task with 5\% and 10\% scribble annotations, respectively. The code will be released at this https URL.
摘要：标签稀缺仍然是基于深度学习的医学图像细分的主要挑战。最近的研究使用强力伪伪监督来利用未标记的数据。但是，伪标签及其相应未标记的图像之间的不一致通常会阻碍性能。在这项工作中，我们提出了\ textbf {synmatch}，这是一个新颖的框架，可以通过合成图像来匹配它们来匹配它们，从而避开了改善伪标签的需求。具体而言，Synmatch使用从同一分割模型中提取的纹理和形状特征合成图像，该模型生成了未标记图像的相应伪标签。该设计使得高度一致的合成图像标签对生成，而无需任何图像合成的训练参数。我们广泛评估半监督学习（SSL），弱监督学习（WSL）和少于审议的学习（BSL）设置的各种医学图像分割任务（SSL）下的synmatch。结果表明，Synmatch取得了出色的性能，尤其是在最具挑战性的BSL环境中。例如，在息肉分段任务上，它以5 \％和10 \％的涂鸦注释分别优于最近基于强的伪监督的方法，而基于29.71 \％和10.05 \％。该代码将在此HTTPS URL上发布。

Title: DragonFruitQualityNet: A Lightweight Convolutional Neural Network for Real-Time Dragon Fruit Quality Inspection on Mobile Devices

Authors: Md Zahurul Haquea, Yeahyea Sarker, Muhammed Farhan Sadique Mahi, Syed Jubayer Jaman, Md Robiul Islam
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07306
Pdf URL: https://arxiv.org/pdf/2508.07306
Copy Paste: [[2508.07306]] DragonFruitQualityNet: A Lightweight Convolutional Neural Network for Real-Time Dragon Fruit Quality Inspection on Mobile Devices(https://arxiv.org/abs/2508.07306)
Keywords: quality assessment
Abstract: Dragon fruit, renowned for its nutritional benefits and economic value, has experienced rising global demand due to its affordability and local availability. As dragon fruit cultivation expands, efficient pre- and post-harvest quality inspection has become essential for improving agricultural productivity and minimizing post-harvest losses. This study presents DragonFruitQualityNet, a lightweight Convolutional Neural Network (CNN) optimized for real-time quality assessment of dragon fruits on mobile devices. We curated a diverse dataset of 13,789 images, integrating self-collected samples with public datasets (dataset from Mendeley Data), and classified them into four categories: fresh, immature, mature, and defective fruits to ensure robust model training. The proposed model achieves an impressive 93.98% accuracy, outperforming existing methods in fruit quality classification. To facilitate practical adoption, we embedded the model into an intuitive mobile application, enabling farmers and agricultural stakeholders to conduct on-device, real-time quality inspections. This research provides an accurate, efficient, and scalable AI-driven solution for dragon fruit quality control, supporting digital agriculture and empowering smallholder farmers with accessible technology. By bridging the gap between research and real-world application, our work advances post-harvest management and promotes sustainable farming practices.
摘要：以营养利益和经济价值而闻名的火龙水果，由于其负担能力和当地可用性，全球需求不断上升。随着火龙果的培养的扩大，有效的收获前和收获后质量检查已成为提高农业生产力并最大程度地减少收获后损失至关重要的。这项研究介绍了DragonFruitQualityNet，这是一种优化的轻量级卷积神经网络（CNN），可针对移动设备上的龙果进行实时质量评估。我们策划了13,789张图像的各种数据集，将自收集的样本与公共数据集（来自Mendeley数据的数据集）集成在一起，并将其分为四类：新鲜，不成熟，不成熟，成熟和有缺陷的水果，以确保强大的模型培训。提出的模型达到了令人印象深刻的93.98％的精度，在水果质量分类中的表现优于现有方法。为了促进实际采用，我们将模型嵌入了直观的移动应用程序中，使农民和农业利益相关者能够进行实时的实时质量检查。这项研究提供了一种准确，高效，可扩展的AI驱动解决方案，可用于火车水果质量控制，支持数字农业并赋予小农户使用无障碍技术。通过弥合研究与现实世界应用之间的差距，我们的工作促进了收获后管理并促进可持续的农业实践。

Title: RORPCap: Retrieval-based Objects and Relations Prompt for Image Captioning

Authors: Jinjing Gu, Tianbao Qin, Yuanyuan Pu, Zhengpeng Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07318
Pdf URL: https://arxiv.org/pdf/2508.07318
Copy Paste: [[2508.07318]] RORPCap: Retrieval-based Objects and Relations Prompt for Image Captioning(https://arxiv.org/abs/2508.07318)
Keywords: generation
Abstract: Image captioning aims to generate natural language descriptions for input images in an open-form manner. To accurately generate descriptions related to the image, a critical step in image captioning is to identify objects and understand their relations within the image. Modern approaches typically capitalize on object detectors or combine detectors with Graph Convolutional Network (GCN). However, these models suffer from redundant detection information, difficulty in GCN construction, and high training costs. To address these issues, a Retrieval-based Objects and Relations Prompt for Image Captioning (RORPCap) is proposed, inspired by the fact that image-text retrieval can provide rich semantic information for input images. RORPCap employs an Objects and relations Extraction Model to extract object and relation words from the image. These words are then incorporate into predefined prompt templates and encoded as prompt embeddings. Next, a Mamba-based mapping network is designed to quickly map image embeddings extracted by CLIP to visual-text embeddings. Finally, the resulting prompt embeddings and visual-text embeddings are concatenated to form textual-enriched feature embeddings, which are fed into a GPT-2 model for caption generation. Extensive experiments conducted on the widely used MS-COCO dataset show that the RORPCap requires only 2.6 hours under cross-entropy loss training, achieving 120.5% CIDEr score and 22.0% SPICE score on the "Karpathy" test split. RORPCap achieves comparable performance metrics to detector-based and GCN-based models with the shortest training time and demonstrates its potential as an alternative for image captioning.
摘要：图像字幕旨在以开放形式的方式生成用于输入图像的自然语言描述。要准确地生成与图像相关的描述，图像字幕的关键步骤是识别对象并了解其在图像中的关系。现代方法通常将对象探测器或探测器与图形卷积网络（GCN）相结合。但是，这些模型遭受了冗余的检测信息，GCN施工难度和高培训成本。为了解决这些问题，提出了基于检索的对象和图像字幕的关系提示（RORPCAP），灵感来自图像文本检索可以为输入图像提供丰富的语义信息。 RORPCAP采用对象和关系提取模型来从图像中提取对象和关系单词。然后将这些单词集成到预定义的提示模板中，并编码为提示嵌入。接下来，一个基于MAMBA的映射网络旨在快速映射剪辑提取的图像嵌入到Visual-Text嵌入。最后，将所得的提示嵌入和视觉文本嵌入串联以形成富含文本的特征嵌入，它们被馈入字幕生成的GPT-2模型中。在广泛使用的MS-Coco数据集上进行的广泛实验表明，在跨透明拷贝损失训练下，RORPCAP仅需要2.6小时，在“ karpathy”测试分裂上获得120.5％的苹果酒评分和22.0％的香料分数。 RORPCAP的性能指标与基于检测器的基于探测器和基于GCN的模型的训练时间最短，并证明了其作为图像字幕的替代方案的潜力。

Title: Planner-Refiner: Dynamic Space-Time Refinement for Vision-Language Alignment in Videos

Authors: Tuyen Tran, Thao Minh Le, Quang-Hung Le, Truyen Tran
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07330
Pdf URL: https://arxiv.org/pdf/2508.07330
Copy Paste: [[2508.07330]] Planner-Refiner: Dynamic Space-Time Refinement for Vision-Language Alignment in Videos(https://arxiv.org/abs/2508.07330)
Keywords: generation
Abstract: Vision-language alignment in video must address the complexity of language, evolving interacting entities, their action chains, and semantic gaps between language and vision. This work introduces Planner-Refiner, a framework to overcome these challenges. Planner-Refiner bridges the semantic gap by iteratively refining visual elements' space-time representation, guided by language until semantic gaps are minimal. A Planner module schedules language guidance by decomposing complex linguistic prompts into short sentence chains. The Refiner processes each short sentence, a noun-phrase and verb-phrase pair, to direct visual tokens' self-attention across space then time, achieving efficient single-step refinement. A recurrent system chains these steps, maintaining refined visual token representations. The final representation feeds into task-specific heads for alignment generation. We demonstrate Planner-Refiner's effectiveness on two video-language alignment tasks: Referring Video Object Segmentation and Temporal Grounding with varying language complexity. We further introduce a new MeViS-X benchmark to assess models' capability with long queries. Superior performance versus state-of-the-art methods on these benchmarks shows the approach's potential, especially for complex prompts.
摘要：视频中的视觉路线必须解决语言的复杂性，不断发展的相互作用实体，其动作链以及语言和视觉之间的语义差距。这项工作介绍了规划师 - 改良剂，这是一个克服这些挑战的框架。策划者 - 替补仪通过迭代绘制视觉元素的时空表示来弥合语义差距，以语言为指导，直到语义差距很小。计划者模块通过将复杂的语言提示为简短的句子链来安排语言指导。炼油厂处理每个简短的句子，一个名词 - 词组和动词词对，以直接在空间上进行视觉令牌的自我发挥，然后才能实现有效的单步精炼。复发系统将这些步骤链条，维护精致的视觉令牌表示。最终表示为特定于任务的头部以进行对齐。我们演示了计划者 - 详细信息对两个视频语言对准任务的有效性：引用视频对象细分和具有不同语言复杂性的时间基础。我们进一步引入了新的MEVIS-X基准测试，以评估模型的能力长期查询。这些基准测试的卓越性能与最先进的方法表明了该方法的潜力，尤其是对于复杂的提示。

Title: Finite-Time Convergence Analysis of ODE-based Generative Models for Stochastic Interpolants

Authors: Yuhao Liu, Rui Hu, Yu Chen, Longbo Huang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.07333
Pdf URL: https://arxiv.org/pdf/2508.07333
Copy Paste: [[2508.07333]] Finite-Time Convergence Analysis of ODE-based Generative Models for Stochastic Interpolants(https://arxiv.org/abs/2508.07333)
Keywords: generative
Abstract: Stochastic interpolants offer a robust framework for continuously transforming samples between arbitrary data distributions, holding significant promise for generative modeling. Despite their potential, rigorous finite-time convergence guarantees for practical numerical schemes remain largely unexplored. In this work, we address the finite-time convergence analysis of numerical implementations for ordinary differential equations (ODEs) derived from stochastic interpolants. Specifically, we establish novel finite-time error bounds in total variation distance for two widely used numerical integrators: the first-order forward Euler method and the second-order Heun's method. Furthermore, our analysis on the iteration complexity of specific stochastic interpolant constructions provides optimized schedules to enhance computational efficiency. Our theoretical findings are corroborated by numerical experiments, which validate the derived error bounds and complexity analyses.
摘要：随机插值提供了一个强大的框架，用于在任意数据分布之间连续转换样品，对生成建模保持着巨大的希望。尽管具有潜力，但对实用数值方案的严格有限时间保证仍然没有探索。在这项工作中，我们解决了从随机插值剂中得出的普通微分方程（ODE）的数值实现的有限时间收敛分析。具体而言，我们在两个广泛使用的数值积分器的总变化距离中建立了新的有限时间误差界限：一阶前向Euler方法和二阶Heun方法。此外，我们对特定随机介质结构的迭代复杂性的分析提供了优化的时间表，以提高计算效率。我们的理论发现是通过数值实验来证实的，该实验验证了导出的误差界限和复杂性分析。

Title: CoAR: Concept Injection into Autoregressive Models for Personalized Text-to-Image Generation

Authors: Fangtai Wu, Mushui Liu, Weijie He, Wanggui He, Hao Jiang, Zhao Wang, Yunlong Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07341
Pdf URL: https://arxiv.org/pdf/2508.07341
Copy Paste: [[2508.07341]] CoAR: Concept Injection into Autoregressive Models for Personalized Text-to-Image Generation(https://arxiv.org/abs/2508.07341)
Keywords: generation
Abstract: The unified autoregressive (AR) model excels at multimodal understanding and generation, but its potential for customized image generation remains underexplored. Existing customized generation methods rely on full fine-tuning or adapters, making them costly and prone to overfitting or catastrophic forgetting. In this paper, we propose \textbf{CoAR}, a novel framework for injecting subject concepts into the unified AR models while keeping all pre-trained parameters completely frozen. CoAR learns effective, specific subject representations with only a minimal number of parameters using a Layerwise Multimodal Context Learning strategy. To address overfitting and language drift, we further introduce regularization that preserves the pre-trained distribution and anchors context tokens to improve subject fidelity and re-contextualization. Additionally, CoAR supports training-free subject customization in a user-provided style. Experiments demonstrate that CoAR achieves superior performance on both subject-driven personalization and style personalization, while delivering significant gains in computational and memory efficiency. Notably, CoAR tunes less than \textbf{0.05\%} of the parameters while achieving competitive performance compared to recent Proxy-Tuning. Code: this https URL
摘要：统一的自回旋（AR）模型在多模式的理解和生成方面表现出色，但其定制图像生成的潜力仍未得到充满激光。现有的定制生成方法依赖于完整的微调或适配器，使其昂贵且容易过度拟合或灾难性遗忘。在本文中，我们提出了\ textbf {coar}，这是将主题概念注入统一AR模型的新型框架，同时将所有预训练的参数完全冻结。 Coar只使用layerwise多模式上下文学习策略学习有效的特定主题表示，只有最少的参数数量。为了解决过度拟合和语言漂移，我们进一步引入正规化，以保留预先训练的分布并锚定上下文令牌以改善主题的保真度和重新定义。此外，Coar还以用户提供的样式支持无培训的主题自定义。实验表明，Coar在主题驱动的个性化和样式的个性化方面都取得了出色的表现，同时在计算和记忆效率方面取得了显着提高。值得注意的是，与最近的代理调整相比，Coar的调音小于参数的\ textbf {0.05 \％}，同时实现竞争性能。代码：此HTTPS URL

Title: SODiff: Semantic-Oriented Diffusion Model for JPEG Compression Artifacts Removal

Authors: Tingyu Yang, Jue Gong, Jinpei Guo, Wenbo Li, Yong Guo, Yulun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07346
Pdf URL: https://arxiv.org/pdf/2508.07346
Copy Paste: [[2508.07346]] SODiff: Semantic-Oriented Diffusion Model for JPEG Compression Artifacts Removal(https://arxiv.org/abs/2508.07346)
Keywords: restoration, generative
Abstract: JPEG, as a widely used image compression standard, often introduces severe visual artifacts when achieving high compression ratios. Although existing deep learning-based restoration methods have made considerable progress, they often struggle to recover complex texture details, resulting in over-smoothed outputs. To overcome these limitations, we propose SODiff, a novel and efficient semantic-oriented one-step diffusion model for JPEG artifacts removal. Our core idea is that effective restoration hinges on providing semantic-oriented guidance to the pre-trained diffusion model, thereby fully leveraging its powerful generative prior. To this end, SODiff incorporates a semantic-aligned image prompt extractor (SAIPE). SAIPE extracts rich features from low-quality (LQ) images and projects them into an embedding space semantically aligned with that of the text encoder. Simultaneously, it preserves crucial information for faithful reconstruction. Furthermore, we propose a quality factor-aware time predictor that implicitly learns the compression quality factor (QF) of the LQ image and adaptively selects the optimal denoising start timestep for the diffusion process. Extensive experimental results show that our SODiff outperforms recent leading methods in both visual quality and quantitative metrics. Code is available at: this https URL
摘要：作为广泛使用的图像压缩标准，JPEG在达到高压缩比时通常会引入严重的视觉伪像。尽管现有的基于深度学习的恢复方法已经取得了长足的进步，但他们经常难以恢复复杂的纹理细节，从而导致过度平滑的输出。为了克服这些局限性，我们提出了Sodiff，这是一种新型，有效的面向语义的一步扩散模型，用于去除JPEG伪影。我们的核心思想是，有效的恢复取决于为预训练的扩散模型提供面向语义的指导，从而完全利用其强大的生成性生成性。为此，Sodiff合并了一个语义对齐的图像提示器（SAIPE）。 Saipe提取从低质量（LQ）图像中提取丰富的特征，并将其投影到与文本编码器的嵌入空间中。同时，它保留了忠实重建的关键信息。此外，我们提出了一个质量感知因素的时间预测指标，该预测指标隐含地学习LQ图像的压缩质量因子（QF），并自适应地选择了为扩散过程的最佳denoising start TimeStep。广泛的实验结果表明，我们的Sodiff在视觉质量和定量指标方面的最新领先方法。代码可用：此HTTPS URL

Title: DIP-GS: Deep Image Prior For Gaussian Splatting Sparse View Recovery

Authors: Rajaei Khatib, Raja Giryes
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07372
Pdf URL: https://arxiv.org/pdf/2508.07372
Copy Paste: [[2508.07372]] DIP-GS: Deep Image Prior For Gaussian Splatting Sparse View Recovery(https://arxiv.org/abs/2508.07372)
Keywords: generative
Abstract: 3D Gaussian Splatting (3DGS) is a leading 3D scene reconstruction method, obtaining high-quality reconstruction with real-time rendering runtime performance. The main idea behind 3DGS is to represent the scene as a collection of 3D gaussians, while learning their parameters to fit the given views of the scene. While achieving superior performance in the presence of many views, 3DGS struggles with sparse view reconstruction, where the input views are sparse and do not fully cover the scene and have low overlaps. In this paper, we propose DIP-GS, a Deep Image Prior (DIP) 3DGS representation. By using the DIP prior, which utilizes internal structure and patterns, with coarse-to-fine manner, DIP-based 3DGS can operate in scenarios where vanilla 3DGS fails, such as sparse view recovery. Note that our approach does not use any pre-trained models such as generative models and depth estimation, but rather relies only on the input frames. Among such methods, DIP-GS obtains state-of-the-art (SOTA) competitive results on various sparse-view reconstruction tasks, demonstrating its capabilities.
摘要：3D高斯脱落（3DGS）是领先的3D场景重建方法，通过实时渲染运行时性能获得高质量的重建。 3DGS背后的主要思想是将场景表示为3D高斯人的集合，同时学习其参数以适合场景的给定观点。在有很多视图的情况下，在实现卓越的表现时，3DGS与稀疏视图重建的斗争，其中输入视图稀疏，并且没有完全覆盖场景并且重叠率较低。在本文中，我们提出了DIP-GS，这是一个深层图像先验（DIP）3DGS表示。通过使用使用内部结构和模式的DIP先验，并以粗到细的方式使用，基于DIP的3DG可以在Vanilla 3DG失败的情况下进行操作，例如稀疏视图恢复。请注意，我们的方法不使用任何预训练的模型，例如生成模型和深度估计，而仅依赖于输入帧。在此类方法中，DIP-GS在各种稀疏视图重建任务上获得了最先进的（SOTA）竞争结果，以证明其功能。

Title: Tight Bounds for Schrödinger Potential Estimation in Unpaired Image-to-Image Translation Problems

Authors: Nikita Puchkin, Denis Suchkov, Alexey Naumov, Denis Belomestny
Subjects: cs.LG, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2508.07392
Pdf URL: https://arxiv.org/pdf/2508.07392
Copy Paste: [[2508.07392]] Tight Bounds for Schrödinger Potential Estimation in Unpaired Image-to-Image Translation Problems(https://arxiv.org/abs/2508.07392)
Keywords: generative
Abstract: Modern methods of generative modelling and unpaired image-to-image translation based on Schrödinger bridges and stochastic optimal control theory aim to transform an initial density to a target one in an optimal way. In the present paper, we assume that we only have access to i.i.d. samples from initial and final distributions. This makes our setup suitable for both generative modelling and unpaired image-to-image translation. Relying on the stochastic optimal control approach, we choose an Ornstein-Uhlenbeck process as the reference one and estimate the corresponding Schrödinger potential. Introducing a risk function as the Kullback-Leibler divergence between couplings, we derive tight bounds on generalization ability of an empirical risk minimizer in a class of Schrödinger potentials including Gaussian mixtures. Thanks to the mixing properties of the Ornstein-Uhlenbeck process, we almost achieve fast rates of convergence up to some logarithmic factors in favourable scenarios. We also illustrate performance of the suggested approach with numerical experiments.
摘要：基于Schrödinger桥梁和随机最佳控制理论的现代生成建模和未配对的图像到图像翻译的方法旨在以最佳方式将初始密度转换为目标。在本文中，我们假设我们只能访问I.I.D。来自初始和最终分布的样本。这使我们的设置适合生成建模和未配对的图像到图像翻译。依靠随机的最佳控制方法，我们选择一个ornstein-uhlenbeck过程作为参考过程，并估计相应的schrödinger电位。我们将风险功能作为耦合之间的kullback-leibler差异引入，我们在包括高斯混合物在内的一类Schrödinger潜力中对经验风险最小化的概括能力的限制得出了紧密的界限。得益于Ornstein-Uhlenbeck过程的混合性能，我们在有利的情况下几乎达到了与某些对数因素的快速收敛速率。我们还通过数值实验说明了建议的方法的性能。

Title: CLUE: Leveraging Low-Rank Adaptation to Capture Latent Uncovered Evidence for Image Forgery Localization

Authors: Youqi Wang, Shunquan Tan, Rongxuan Peng, Bin Li, Jiwu Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07413
Pdf URL: https://arxiv.org/pdf/2508.07413
Copy Paste: [[2508.07413]] CLUE: Leveraging Low-Rank Adaptation to Capture Latent Uncovered Evidence for Image Forgery Localization(https://arxiv.org/abs/2508.07413)
Keywords: generative
Abstract: The increasing accessibility of image editing tools and generative AI has led to a proliferation of visually convincing forgeries, compromising the authenticity of digital media. In this paper, in addition to leveraging distortions from conventional forgeries, we repurpose the mechanism of a state-of-the-art (SOTA) text-to-image synthesis model by exploiting its internal generative process, turning it into a high-fidelity forgery localization tool. To this end, we propose CLUE (Capture Latent Uncovered Evidence), a framework that employs Low- Rank Adaptation (LoRA) to parameter-efficiently reconfigure Stable Diffusion 3 (SD3) as a forensic feature extractor. Our approach begins with the strategic use of SD3's Rectified Flow (RF) mechanism to inject noise at varying intensities into the latent representation, thereby steering the LoRAtuned denoising process to amplify subtle statistical inconsistencies indicative of a forgery. To complement the latent analysis with high-level semantic context and precise spatial details, our method incorporates contextual features from the image encoder of the Segment Anything Model (SAM), which is parameter-efficiently adapted to better trace the boundaries of forged regions. Extensive evaluations demonstrate CLUE's SOTA generalization performance, significantly outperforming prior methods. Furthermore, CLUE shows superior robustness against common post-processing attacks and Online Social Networks (OSNs). Code is publicly available at this https URL.
摘要：图像编辑工具和生成AI的可访问性日益增加导致视觉上令人信服的伪造的扩散，从而损害了数字媒体的真实性。在本文中，除了利用传统伪造的扭曲外，我们还通过利用其内部生成过程，将其变成高保真伪造的本地化工具来重新利用最先进（SOTA）文本对图像合成模型的机制。为此，我们提出了线索（捕获潜在的未发现证据），该框架采用低级适应（LORA）到参数有效地重新配置稳定扩散3（SD3）作为法医提取器。我们的方法始于SD3的整流流（RF）机制的战略使用，以在不同强度的情况下注入噪声中，从而引导了劳拉特（Loratunun）的deNoising过程，以扩大伪造的细微统计不一致。为了用高级语义上下文和精确的空间细节补充潜在分析，我们的方法结合了段的图像编码器中的上下文特征。广泛的评估证明了线索的SOTA泛化性能，明显优于先前方法。此外，线索还表现出对常见的后处理攻击和在线社交网络（OSN）的优势鲁棒性。代码在此HTTPS URL上公开可用。

Title: VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding

Authors: Jian Chen, Ming Li, Jihyung Kil, Chenguang Wang, Tong Yu, Ryan Rossi, Tianyi Zhou, Changyou Chen, Ruiyi Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07493
Pdf URL: https://arxiv.org/pdf/2508.07493
Copy Paste: [[2508.07493]] VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding(https://arxiv.org/abs/2508.07493)
Keywords: generation
Abstract: Most organizational data in this world are stored as documents, and visual retrieval plays a crucial role in unlocking the collective intelligence from all these documents. However, existing benchmarks focus on English-only document retrieval or only consider multilingual question-answering on a single-page image. To bridge this gap, we introduce VisR-Bench, a multilingual benchmark designed for question-driven multimodal retrieval in long documents. Our benchmark comprises over 35K high-quality QA pairs across 1.2K documents, enabling fine-grained evaluation of multimodal retrieval. VisR-Bench spans sixteen languages with three question types (figures, text, and tables), offering diverse linguistic and question coverage. Unlike prior datasets, we include queries without explicit answers, preventing models from relying on superficial keyword matching. We evaluate various retrieval models, including text-based methods, multimodal encoders, and MLLMs, providing insights into their strengths and limitations. Our results show that while MLLMs significantly outperform text-based and multimodal encoder models, they still struggle with structured tables and low-resource languages, highlighting key challenges in multilingual visual retrieval.
摘要：这个世界上的大多数组织数据都是作为文档存储的，并且视觉检索在从所有这些文档中解锁集体智能方面起着至关重要的作用。但是，现有基准专注于仅英语文档检索，或者仅考虑在单页图像上进行多语言提问。为了弥合这一差距，我们介绍了Visr Bench，这是一种多语言基准测试，旨在在长文档中进行问题驱动的多模式检索。我们的基准包括1.2k文档的35K高质量质量质量对，可以对多模式检索进行细粒度评估。 Visr Bench跨越了16种具有三种问题类型的语言（人物，文本和表格），提供了多种语言和问题覆盖范围。与先前的数据集不同，我们在没有明确答案的情况下包括查询，以防止模型依靠浅表关键字匹配。我们评估了各种检索模型，包括基于文本的方法，多模式编码器和MLLM，可提供有关其优势和局限性的见解。我们的结果表明，尽管MLLMS显着胜过基于文本的和多模式编码器模型，但它们仍在结构化表和低资源语言上挣扎，突出了多语言视觉检索的关键挑战。

Title: Enhanced Generative Structure Prior for Chinese Text Image Super-resolution

Authors: Xiaoming Li, Wangmeng Zuo, Chen Change Loy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07537
Pdf URL: https://arxiv.org/pdf/2508.07537
Copy Paste: [[2508.07537]] Enhanced Generative Structure Prior for Chinese Text Image Super-resolution(https://arxiv.org/abs/2508.07537)
Keywords: restoration, super-resolution, generative
Abstract: Faithful text image super-resolution (SR) is challenging because each character has a unique structure and usually exhibits diverse font styles and layouts. While existing methods primarily focus on English text, less attention has been paid to more complex scripts like Chinese. In this paper, we introduce a high-quality text image SR framework designed to restore the precise strokes of low-resolution (LR) Chinese characters. Unlike methods that rely on character recognition priors to regularize the SR task, we propose a novel structure prior that offers structure-level guidance to enhance visual quality. Our framework incorporates this structure prior within a StyleGAN model, leveraging its generative capabilities for restoration. To maintain the integrity of character structures while accommodating various font styles and layouts, we implement a codebook-based mechanism that restricts the generative space of StyleGAN. Each code in the codebook represents the structure of a specific character, while the vector $w$ in StyleGAN controls the character's style, including typeface, orientation, and location. Through the collaborative interaction between the codebook and style, we generate a high-resolution structure prior that aligns with LR characters both spatially and structurally. Experiments demonstrate that this structure prior provides robust, character-specific guidance, enabling the accurate restoration of clear strokes in degraded characters, even for real-world LR Chinese text with irregular layouts. Our code and pre-trained models will be available at this https URL
摘要：忠实的文本图像超分辨率（SR）具有挑战性，因为每个角色都有独特的结构，并且通常具有多种字体样式和布局。尽管现有的方法主要集中在英语文本上，但对诸如中文等更复杂的脚本的关注较少。在本文中，我们介绍了一个高质量的文本图像SR框架，旨在恢复低分辨率（LR）汉字的精确中风。与依靠角色识别先验来正规化SR任务的方法不同，我们提出了一个新颖的结构，该结构提供了结构级别的指导来增强视觉质量。我们的框架将这种结构纳入了StyleGAN模型中，利用其生成能力进行修复。为了保持角色结构的完整性，同时适合各种字体样式和布局，我们实施了一种基于代码的机制，该机制限制了stylegan的生成空间。代码簿中的每个代码代表特定字符的结构，而vector $ w $ in stylegan控制角色的样式，包括字体，方向和位置。通过代码簿和样式之间的协作互动，我们在空间和结构上都与LR字符保持一致之前，我们生成了高分辨率结构。实验表明，这种先验的结构提供了强大的特定特定性格指导，即使对于具有不规则布局的现实世界中的中文文本，也可以准确地恢复降级字符中的清晰中风。我们的代码和预培训模型将在此HTTPS URL上找到

Title: CoT-Pose: Chain-of-Thought Reasoning for 3D Pose Generation from Abstract Prompts

Authors: Junuk Cha, Jihyeon Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07540
Pdf URL: https://arxiv.org/pdf/2508.07540
Copy Paste: [[2508.07540]] CoT-Pose: Chain-of-Thought Reasoning for 3D Pose Generation from Abstract Prompts(https://arxiv.org/abs/2508.07540)
Keywords: generation
Abstract: Recent advances in multi-modal large language models (MLLMs) and chain-of-thought (CoT) reasoning have led to significant progress in image and text generation tasks. However, the field of 3D human pose generation still faces critical limitations. Most existing text-to-pose models rely heavily on detailed (low-level) prompts that explicitly describe joint configurations. In contrast, humans tend to communicate actions and intentions using abstract (high-level) language. This mismatch results in a practical challenge for deploying pose generation systems in real-world scenarios. To bridge this gap, we introduce a novel framework that incorporates CoT reasoning into the pose generation process, enabling the interpretation of abstract prompts into accurate 3D human poses. We further propose a data synthesis pipeline that automatically generates triplets of abstract prompts, detailed prompts, and corresponding 3D poses for training process. Experimental results demonstrate that our reasoning-enhanced model, CoT-Pose, can effectively generate plausible and semantically aligned poses from abstract textual inputs. This work highlights the importance of high-level understanding in pose generation and opens new directions for reasoning-enhanced approach for human pose generation.
摘要：多模式大语言模型（MLLM）和思想链（COT）推理的最新进展已导致图像和文本生成任务取得了重大进展。但是，3D人类姿势产生的领域仍然面临临界局限性。大多数现有的文本对置型模型在很大程度上依赖于明确描述联合配置的详细（低级）提示。相反，人类倾向于使用抽象（高级）语言传达行动和意图。这种不匹配导致在现实情况下部署姿势生成系统的实用挑战。为了弥合这一差距，我们介绍了一个新颖的框架，将COT推理纳入姿势生成过程，从而使抽象提示的解释能够准确地3D人类姿势。我们进一步提出了一个数据综合管道，该管道会自动生成抽象提示，详细提示和相应的3D姿势的三联体，以进行培训过程。实验结果表明，我们的推理增强模型COT置态可以有效地从抽象的文本输入中产生合理和语义上的姿势。这项工作突出了高级理解在姿势产生中的重要性，并为人类姿势产生的推理增强方法打开了新的方向。

Title: Commentary Generation for Soccer Highlights

Authors: Chidaksh Ravuru
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.07543
Pdf URL: https://arxiv.org/pdf/2508.07543
Copy Paste: [[2508.07543]] Commentary Generation for Soccer Highlights(https://arxiv.org/abs/2508.07543)
Keywords: generation
Abstract: Automated soccer commentary generation has evolved from template-based systems to advanced neural architectures, aiming to produce real-time descriptions of sports events. While frameworks like SoccerNet-Caption laid foundational work, their inability to achieve fine-grained alignment between video content and commentary remains a significant challenge. Recent efforts such as MatchTime, with its MatchVoice model, address this issue through coarse and fine-grained alignment techniques, achieving improved temporal synchronization. In this paper, we extend MatchVoice to commentary generation for soccer highlights using the GOAL dataset, which emphasizes short clips over entire games. We conduct extensive experiments to reproduce the original MatchTime results and evaluate our setup, highlighting the impact of different training configurations and hardware limitations. Furthermore, we explore the effect of varying window sizes on zero-shot performance. While MatchVoice exhibits promising generalization capabilities, our findings suggest the need for integrating techniques from broader video-language domains to further enhance performance. Our code is available at this https URL.
摘要：自动化的足球评论生成已经从基于模板的系统发展为高级神经体系结构，旨在对体育赛事进行实时描述。尽管诸如Soccernet-Caption之类的框架奠定了基础工作，但它们无法在视频内容和评论之间实现细粒度的一致性仍然是一个重大挑战。诸如MatchTime之类的最新努力及其Matchvoice模型通过粗糙和细粒度的对准技术解决了这个问题，从而提高了时间同步。在本文中，我们使用目标数据集将MatchVoice扩展到足球重点的评论生成，该数据集强调了整个游戏中的简短片段。我们进行了广泛的实验，以重现原始的匹配时结果并评估我们的设置，并强调了不同的培训配置和硬件限制的影响。此外，我们探讨了不同窗口大小对零拍性能的影响。尽管MatchVoice具有有希望的概括能力，但我们的发现表明需要从更广泛的视频语言域中整合技术以进一步提高性能。我们的代码可在此HTTPS URL上找到。

Title: Splat4D: Diffusion-Enhanced 4D Gaussian Splatting for Temporally and Spatially Consistent Content Creation

Authors: Minghao Yin, Yukang Cao, Songyou Peng, Kai Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07557
Pdf URL: https://arxiv.org/pdf/2508.07557
Copy Paste: [[2508.07557]] Splat4D: Diffusion-Enhanced 4D Gaussian Splatting for Temporally and Spatially Consistent Content Creation(https://arxiv.org/abs/2508.07557)
Keywords: generation
Abstract: Generating high-quality 4D content from monocular videos for applications such as digital humans and AR/VR poses challenges in ensuring temporal and spatial consistency, preserving intricate details, and incorporating user guidance effectively. To overcome these challenges, we introduce Splat4D, a novel framework enabling high-fidelity 4D content generation from a monocular video. Splat4D achieves superior performance while maintaining faithful spatial-temporal coherence by leveraging multi-view rendering, inconsistency identification, a video diffusion model, and an asymmetric U-Net for refinement. Through extensive evaluations on public benchmarks, Splat4D consistently demonstrates state-of-the-art performance across various metrics, underscoring the efficacy of our approach. Additionally, the versatility of Splat4D is validated in various applications such as text/image conditioned 4D generation, 4D human generation, and text-guided content editing, producing coherent outcomes following user instructions.
摘要：从单眼视频中生成高质量的4D内容，例如数字人类和AR/VR等应用，在确保时间和空间一致性，保留复杂的细节并有效地纳入用户指导方面构成了挑战。为了克服这些挑战，我们引入了Splat4d，这是一个新型框架，从单眼视频中实现了高保真4D内容的生成。 SPLAT4D通过利用多视图渲染，不一致的识别，视频扩散模型和不对称的U-NET来维持忠实的时空连贯性，在保持忠实的时空连贯性的同时，取得了出色的性能。通过对公共基准测试的广泛评估，SPLAT4D始终展示了各种指标的最先进的性能，从而强调了我们方法的功效。此外，在各种应用程序中验证了Splat4d的多功能性，例如文本/图像条件的4D生成，4D人类生成和文本指导的内容编辑，并按照用户说明产生连贯的结果。

Title: When and how can inexact generative models still sample from the data manifold?

Authors: Nisha Chandramoorthy, Adriaan de Clercq
Subjects: cs.LG, math.DS, math.PR
Abstract URL: https://arxiv.org/abs/2508.07581
Pdf URL: https://arxiv.org/pdf/2508.07581
Copy Paste: [[2508.07581]] When and how can inexact generative models still sample from the data manifold?(https://arxiv.org/abs/2508.07581)
Keywords: generative
Abstract: A curious phenomenon observed in some dynamical generative models is the following: despite learning errors in the score function or the drift vector field, the generated samples appear to shift \emph{along} the support of the data distribution but not \emph{away} from it. In this work, we investigate this phenomenon of \emph{robustness of the support} by taking a dynamical systems approach on the generating stochastic/deterministic process. Our perturbation analysis of the probability flow reveals that infinitesimal learning errors cause the predicted density to be different from the target density only on the data manifold for a wide class of generative models. Further, what is the dynamical mechanism that leads to the robustness of the support? We show that the alignment of the top Lyapunov vectors (most sensitive infinitesimal perturbation directions) with the tangent spaces along the boundary of the data manifold leads to robustness and prove a sufficient condition on the dynamics of the generating process to achieve this alignment. Moreover, the alignment condition is efficient to compute and, in practice, for robust generative models, automatically leads to accurate estimates of the tangent bundle of the data manifold. Using a finite-time linear perturbation analysis on samples paths as well as probability flows, our work complements and extends existing works on obtaining theoretical guarantees for generative models from a stochastic analysis, statistical learning and uncertainty quantification points of view. Our results apply across different dynamical generative models, such as conditional flow-matching and score-based generative models, and for different target distributions that may or may not satisfy the manifold hypothesis.
摘要：在某些动态生成模型中观察到的一个奇怪现象是：尽管得分函数或漂移矢量场的学习错误，但生成的样品似乎移动了\ emph {沿}数据分布的支持，但不从中移动\ emph {ake}。在这项工作中，我们通过在生成随机/确定性过程上采用动力学系统方法来研究\ emph {支持的鲁棒性}的这种现象。我们对概率流的扰动分析表明，无限学习误差导致预测的密度与广泛生成模型的数据歧管上的目标密度不同。此外，导致支持的鲁棒性的动力学机制是什么？我们表明，沿数据歧管边界的最敏感的lyapunov载体（最敏感的无限扰动方向）与切线空间的比对可实现鲁棒性，并证明生成过程的动态有足够的条件，以实现这种比对。此外，对齐条件有效地计算，实际上，对于健壮的生成模型，对数据歧管的切线捆绑包的准确估计会自动导致。使用对样本路径的有限时间线性扰动分析以及概率流，我们的工作补充并扩展了从随机分析，统计学习和不确定性量化的观点中获得生成模型的理论保证的现有工作。我们的结果适用于不同的动态生成模型，例如条件流匹配和基于得分的生成模型，以及可能满足歧管假设的不同目标分布。

Title: ShoulderShot: Generating Over-the-Shoulder Dialogue Videos

Authors: Yuang Zhang, Junqi Cheng, Haoyu Zhao, Jiaxi Gu, Fangyuan Zou, Zenghui Lu, Peng Shu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07597
Pdf URL: https://arxiv.org/pdf/2508.07597
Copy Paste: [[2508.07597]] ShoulderShot: Generating Over-the-Shoulder Dialogue Videos(https://arxiv.org/abs/2508.07597)
Keywords: generation
Abstract: Over-the-shoulder dialogue videos are essential in films, short dramas, and advertisements, providing visual variety and enhancing viewers' emotional connection. Despite their importance, such dialogue scenes remain largely underexplored in video generation research. The main challenges include maintaining character consistency across different shots, creating a sense of spatial continuity, and generating long, multi-turn dialogues within limited computational budgets. Here, we present ShoulderShot, a framework that combines dual-shot generation with looping video, enabling extended dialogues while preserving character consistency. Our results demonstrate capabilities that surpass existing methods in terms of shot-reverse-shot layout, spatial continuity, and flexibility in dialogue length, thereby opening up new possibilities for practical dialogue video generation. Videos and comparisons are available at this https URL.
摘要：在电影，简短的戏剧和广告中，露肩对话视频至关重要，提供视觉品种并增强观众的情感联系。尽管它们的重要性，但在视频生成研究中，这种对话场景在很大程度上仍然没有得到充实。主要挑战包括在不同镜头上保持角色一致性，创造一种空间连续性感，并在有限的计算预算中产生长而多转的对话。在这里，我们介绍了肩膀，该框架将双弹奏生成与循环视频结合在一起，实现扩展对话，同时保持角色一致性。我们的结果表明，从射击反向的布局，空间连续性和对话长度的灵活性方面，可以超过现有方法，从而为实用对话视频生成开辟了新的可能性。此HTTPS URL可用视频和比较。

Title: LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation

Authors: Wenhui Song, Hanhui Li, Jiehui Huang, Panwen Hu, Yuhao Cheng, Long Chen, Yiqiang Yan, Xiaodan Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07603
Pdf URL: https://arxiv.org/pdf/2508.07603
Copy Paste: [[2508.07603]] LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation(https://arxiv.org/abs/2508.07603)
Keywords: generation
Abstract: In this paper, we present LaVieID, a novel \underline{l}ocal \underline{a}utoregressive \underline{vi}d\underline{e}o diffusion framework designed to tackle the challenging \underline{id}entity-preserving text-to-video task. The key idea of LaVieID is to mitigate the loss of identity information inherent in the stochastic global generation process of diffusion transformers (DiTs) from both spatial and temporal perspectives. Specifically, unlike the global and unstructured modeling of facial latent states in existing DiTs, LaVieID introduces a local router to explicitly represent latent states by weighted combinations of fine-grained local facial structures. This alleviates undesirable feature interference and encourages DiTs to capture distinctive facial characteristics. Furthermore, a temporal autoregressive module is integrated into LaVieID to refine denoised latent tokens before video decoding. This module divides latent tokens temporally into chunks, exploiting their long-range temporal dependencies to predict biases for rectifying tokens, thereby significantly enhancing inter-frame identity consistency. Consequently, LaVieID can generate high-fidelity personalized videos and achieve state-of-the-art performance. Our code and models are available at this https URL.
摘要：在本文中，我们介绍了Lavieid，这是一种小说\下划线{l} ocal \下划线{a} utoregressivers \ untline {vi} d \ usepline {e} o扩散框架，旨在应对挑战性\ underline {id} entline {id} entility nline {id} entility tosentity nline nline nline nline nline nline textity tore pore tore tose tose tose tose text-vide-video任务。 Lavieid的关键思想是减轻从空间和时间观点的扩散变压器（DIT）随机生成过程中固有的身份信息的丧失。具体而言，与现有dit中面部潜在状态的全球和非结构化建模不同，Lavieid引入了本地路由器，以通过精细元素局部面部结构的加权组合明确表示潜在状态。这减轻了不良的特征干扰，并鼓励Dits捕获独特的面部特征。此外，在视频解码之前，将时间自回归模块集成到Lavieid中，以精炼deno的潜在令牌。该模块将潜在代币从时间上划分为块，从而利用其远程时间依赖性以预测用于整流令牌的偏见，从而显着增强了框架间的身份一致性。因此，Lavieid可以产生高保真的个性化视频并实现最先进的性能。我们的代码和模型可在此HTTPS URL上找到。

Title: X2Edit: Revisiting Arbitrary-Instruction Image Editing through Self-Constructed Data and Task-Aware Representation Learning

Authors: Jian Ma, Xujie Zhu, Zihao Pan, Qirong Peng, Xu Guo, Chen Chen, Haonan Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07607
Pdf URL: https://arxiv.org/pdf/2508.07607
Copy Paste: [[2508.07607]] X2Edit: Revisiting Arbitrary-Instruction Image Editing through Self-Constructed Data and Task-Aware Representation Learning(https://arxiv.org/abs/2508.07607)
Keywords: generation, generative
Abstract: Existing open-source datasets for arbitrary-instruction image editing remain suboptimal, while a plug-and-play editing module compatible with community-prevalent generative models is notably absent. In this paper, we first introduce the X2Edit Dataset, a comprehensive dataset covering 14 diverse editing tasks, including subject-driven generation. We utilize the industry-leading unified image generation models and expert models to construct the data. Meanwhile, we design reasonable editing instructions with the VLM and implement various scoring mechanisms to filter the data. As a result, we construct 3.7 million high-quality data with balanced categories. Second, to better integrate seamlessly with community image generation models, we design task-aware MoE-LoRA training based on FLUX.1, with only 8\% of the parameters of the full model. To further improve the final performance, we utilize the internal representations of the diffusion model and define positive/negative samples based on image editing types to introduce contrastive learning. Extensive experiments demonstrate that the model's editing performance is competitive among many excellent models. Additionally, the constructed dataset exhibits substantial advantages over existing open-source datasets. The open-source code, checkpoints, and datasets for X2Edit can be found at the following link: this https URL.
摘要：现有的用于任意指导图像编辑的开源数据集仍然是次优的，而插件编辑模块与社区优惠的生成模型兼容。在本文中，我们首先介绍了X2EDIT数据集，这是一个全面的数据集，涵盖了14个不同的编辑任务，包括主题驱动的一代。我们利用行业领先的统一图像生成模型和专家模型来构建数据。同时，我们使用VLM设计合理的编辑说明，并实施各种评分机制来过滤数据。结果，我们构建了370万个具有平衡类别的高质量数据。其次，为了更好地与社区图像生成模型无缝集成，我们设计了基于Flux.1的任务感知MOE-LORA培训，其中只有8％的完整模型参数。为了进一步提高最终性能，我们利用扩散模型的内部表示，并根据图像编辑类型定义正/负样本以引入对比度学习。广泛的实验表明，模型的编辑性能在许多出色的模型中具有竞争力。此外，构造的数据集比现有的开源数据集具有显着优势。可以在以下链接上找到X2Edit的开源代码，检查点和数据集：此HTTPS URL。

Title: Efficient Approximate Posterior Sampling with Annealed Langevin Monte Carlo

Authors: Advait Parulekar, Litu Rout, Karthikeyan Shanmugam, Sanjay Shakkottai
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07631
Pdf URL: https://arxiv.org/pdf/2508.07631
Copy Paste: [[2508.07631]] Efficient Approximate Posterior Sampling with Annealed Langevin Monte Carlo(https://arxiv.org/abs/2508.07631)
Keywords: super-resolution, generative
Abstract: We study the problem of posterior sampling in the context of score based generative models. We have a trained score network for a prior $p(x)$, a measurement model $p(y|x)$, and are tasked with sampling from the posterior $p(x|y)$. Prior work has shown this to be intractable in KL (in the worst case) under well-accepted computational hardness assumptions. Despite this, popular algorithms for tasks such as image super-resolution, stylization, and reconstruction enjoy empirical success. Rather than establishing distributional assumptions or restricted settings under which exact posterior sampling is tractable, we view this as a more general "tilting" problem of biasing a distribution towards a measurement. Under minimal assumptions, we show that one can tractably sample from a distribution that is simultaneously close to the posterior of a noised prior in KL divergence and the true posterior in Fisher divergence. Intuitively, this combination ensures that the resulting sample is consistent with both the measurement and the prior. To the best of our knowledge these are the first formal results for (approximate) posterior sampling in polynomial time.
摘要：我们研究基于分数生成模型的背景下的后验采样问题。我们有一个训练有素的分数网络，用于先前的$ p（x）$，测量模型$ p（y | x）$，并且是从后$ p（x | y）$进行采样的任务。先前的工作表明，在良好接受的计算硬度假设下，这在KL（最坏的情况下）是棘手的。尽管如此，诸如图像超分辨率，风格化和重建等任务的流行算法享有经验成功。我们认为这是一个更普遍的“倾斜”问题，即确切的后验采样是可解决的，而不是建立分布假设或限制设置，它是将分布偏向于测量的问题。在最少的假设下，我们表明可以从同时靠近KL差异的鼻音后的分布中进行仔细进行样品，而在Fisher差异中的真实后验。直觉上，这种组合确保所得样品与测量和先验一致。据我们所知，这些是多项式时间（近似）后验采样的第一个形式结果。

Title: LaRender: Training-Free Occlusion Control in Image Generation via Latent Rendering

Authors: Xiaohang Zhan, Dingming Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07647
Pdf URL: https://arxiv.org/pdf/2508.07647
Copy Paste: [[2508.07647]] LaRender: Training-Free Occlusion Control in Image Generation via Latent Rendering(https://arxiv.org/abs/2508.07647)
Keywords: generation
Abstract: We propose a novel training-free image generation algorithm that precisely controls the occlusion relationships between objects in an image. Existing image generation methods typically rely on prompts to influence occlusion, which often lack precision. While layout-to-image methods provide control over object locations, they fail to address occlusion relationships explicitly. Given a pre-trained image diffusion model, our method leverages volume rendering principles to "render" the scene in latent space, guided by occlusion relationships and the estimated transmittance of objects. This approach does not require retraining or fine-tuning the image diffusion model, yet it enables accurate occlusion control due to its physics-grounded foundation. In extensive experiments, our method significantly outperforms existing approaches in terms of occlusion accuracy. Furthermore, we demonstrate that by adjusting the opacities of objects or concepts during rendering, our method can achieve a variety of effects, such as altering the transparency of objects, the density of mass (e.g., forests), the concentration of particles (e.g., rain, fog), the intensity of light, and the strength of lens effects, etc.
摘要：我们提出了一种新颖的无训练图像生成算法，该算法精确地控制了图像中对象之间的遮挡关系。现有的图像生成方法通常依靠提示来影响闭塞，而闭塞通常缺乏精度。尽管布局到图像方法可以控制对象位置，但它们无法明确地解决遮挡关系。给定预先训练的图像扩散模型，我们的方法利用体积渲染原理来“渲染”潜在空间中的场景，并在遮挡关系和对象的估计透射率的指导下。这种方法不需要对图像扩散模型进行重新调整或微调，但由于其物理基础，它可以准确地遮挡控制。在广泛的实验中，我们的方法在遮挡精度方面显着优于现有方法。此外，我们证明，通过调整渲染过程中的物体或概念的缺陷，我们的方法可以实现各种效果，例如改变对象的透明度，质量的透明度，质量的密度（例如森林），颗粒的浓度，粒子的浓度（例如雨，雾，雾），光的强度以及透镜的强度，以及透镜的强度，以及透镜的强度，以及透镜的强度，以及透明度的强度

Title: GLiClass: Generalist Lightweight Model for Sequence Classification Tasks

Authors: Ihor Stepanov, Mykhailo Shtopko, Dmytro Vodianytskyi, Oleksandr Lukashov, Alexander Yavorskyi, Mykyta Yaroshenko
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2508.07662
Pdf URL: https://arxiv.org/pdf/2508.07662
Copy Paste: [[2508.07662]] GLiClass: Generalist Lightweight Model for Sequence Classification Tasks(https://arxiv.org/abs/2508.07662)
Keywords: generative
Abstract: Classification is one of the most widespread tasks in AI applications, serving often as the first step in filtering, sorting, and categorizing data. Since modern AI systems must handle large volumes of input data and early pipeline stages can propagate errors downstream, achieving high efficiency and accuracy is critical. Moreover, classification requirements can change dynamically based on user needs, necessitating models with strong zero-shot capabilities. While generative LLMs have become mainstream for zero-shot classification due to their versatility, they suffer from inconsistent instruction following and computational inefficiency. Cross-encoders, commonly used as rerankers in RAG pipelines, face a different bottleneck: they must process text-label pairs sequentially, significantly reducing efficiency with large label sets. Embedding-based approaches offer good efficiency but struggle with complex scenarios involving logical and semantic constraints. We propose GLiClass, a novel method that adapts the GLiNER architecture for sequence classification tasks. Our approach achieves strong accuracy and efficiency comparable to embedding-based methods, while maintaining the flexibility needed for zero-shot and few-shot learning scenarios. Additionally, we adapted proximal policy optimization (PPO) for multi-label text classification, enabling training classifiers in data-sparse conditions or from human feedback.
摘要：分类是AI应用程序中最普遍的任务之一，通常用作过滤，分类和分类数据的第一步。由于现代AI系统必须处理大量输入数据，并且早期管道阶段可以传播下游错误，因此实现高效率和准确性至关重要。此外，分类需求可以根据用户需求动态变化，需要具有强大零击功能的模型。尽管生成LLM由于其多功能性而成为零摄影分类的主流，但它们的指导不一致，计算效率低下。跨编码器通常用作抹布管道中的重读者，面临不同的瓶颈：它们必须顺序处理文本标签对，从而通过大型标签集大大降低了效率。基于嵌入的方法具有良好的效率，但在涉及逻辑和语义限制的复杂场景中挣扎。我们提出了Gliclass，这是一种新颖的方法，可适应Gliner架构的序列分类任务。我们的方法达到了与基于嵌入的方法相当的强度准确性和效率，同时保持了零射击和少量学习方案所需的灵活性。此外，我们为多标签文本分类调整了近端策略优化（PPO），从而使数据 - 帕克斯条件或人类反馈中的培训分类器。

Title: Undress to Redress: A Training-Free Framework for Virtual Try-On

Authors: Zhiying Li, Junhao Wu, Yeying Jin, Daiheng Gao, Yun Ji, Kaichuan Kong, Lei Yu, Hao Xu, Kai Chen, Bruce Gu, Nana Wang, Zhaoxin Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07680
Pdf URL: https://arxiv.org/pdf/2508.07680
Copy Paste: [[2508.07680]] Undress to Redress: A Training-Free Framework for Virtual Try-On(https://arxiv.org/abs/2508.07680)
Keywords: restoration
Abstract: Virtual try-on (VTON) is a crucial task for enhancing user experience in online shopping by generating realistic garment previews on personal photos. Although existing methods have achieved impressive results, they struggle with long-sleeve-to-short-sleeve conversions-a common and practical scenario-often producing unrealistic outputs when exposed skin is underrepresented in the original image. We argue that this challenge arises from the ''majority'' completion rule in current VTON models, which leads to inaccurate skin restoration in such cases. To address this, we propose UR-VTON (Undress-Redress Virtual Try-ON), a novel, training-free framework that can be seamlessly integrated with any existing VTON method. UR-VTON introduces an ''undress-to-redress'' mechanism: it first reveals the user's torso by virtually ''undressing,'' then applies the target short-sleeve garment, effectively decomposing the conversion into two more manageable steps. Additionally, we incorporate Dynamic Classifier-Free Guidance scheduling to balance diversity and image quality during DDPM sampling, and employ Structural Refiner to enhance detail fidelity using high-frequency cues. Finally, we present LS-TON, a new benchmark for long-sleeve-to-short-sleeve try-on. Extensive experiments demonstrate that UR-VTON outperforms state-of-the-art methods in both detail preservation and image quality. Code will be released upon acceptance.
摘要：虚拟试验（VTON）是通过在个人照片上产生逼真的服装预览来增强用户在线购物中的至关重要任务。尽管现有方法取得了令人印象深刻的结果，但它们在长袖到短袖的转换中挣扎 - 一种常见且实用的情况，通常会产生不切实际的输出，而裸露的皮肤在原始图像中的代表性不足。我们认为，这一挑战源于当前VTON模型中的“多数”的完成规则，这在这种情况下导致皮肤恢复不准确。为了解决这个问题，我们提出了一个新颖，无训练的框架，可以与任何现有的Vton方法无缝集成。 Ur-Vton引入了“脱衣服 - 折线”的机制：它首先通过实际上揭示了用户的躯干“脱衣服”，然后应用了目标短袖服装，从而有效地将转换分解为两个更可管理的步骤。此外，我们将无动态分类器的指导时间表结合在一起，以平衡DDPM采样期间的多样性和图像质量，并使用结构炼油机使用高频提示来增强细节忠诚度。最后，我们介绍了LS-TON，这是长袖到短袖袖子的新基准。广泛的实验表明，Ur-Vton在细节保存和图像质量方面的表现都优于最先进的方法。代码将在接受后发布。

Title: TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding

Authors: Chaohong Guo, Xun Mo, Yongwei Nie, Xuemiao Xu, Chao Xu, Fei Yu, Chengjiang Long
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07683
Pdf URL: https://arxiv.org/pdf/2508.07683
Copy Paste: [[2508.07683]] TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding(https://arxiv.org/abs/2508.07683)
Keywords: generation
Abstract: Temporal Video Grounding (TVG) aims to precisely localize video segments corresponding to natural language queries, which is a critical capability for long-form video understanding. Although existing reinforcement learning approaches encourage models to generate reasoning chains before predictions, they fail to explicitly constrain the reasoning process to ensure the quality of the final temporal predictions. To address this limitation, we propose Timestamp Anchor-constrained Reasoning for Temporal Video Grounding (TAR-TVG), a novel framework that introduces timestamp anchors within the reasoning process to enforce explicit supervision to the thought content. These anchors serve as intermediate verification points. More importantly, we require each reasoning step to produce increasingly accurate temporal estimations, thereby ensuring that the reasoning process contributes meaningfully to the final prediction. To address the challenge of low-probability anchor generation in models (e.g., Qwen2.5-VL-3B), we develop an efficient self-distillation training strategy: (1) initial GRPO training to collect 30K high-quality reasoning traces containing multiple timestamp anchors, (2) supervised fine-tuning (SFT) on distilled data, and (3) final GRPO optimization on the SFT-enhanced model. This three-stage training strategy enables robust anchor generation while maintaining reasoning quality. Experiments show that our model achieves state-of-the-art performance while producing interpretable, verifiable reasoning chains with progressively refined temporal estimations.
摘要：时间视频接地（TVG）旨在精确本地化与自然语言查询相对应的视频片段，这是长期视频理解的关键功能。尽管现有的强化学习方法鼓励模型在预测前产生推理链，但它们未能明确限制推理过程以确保最终时间预测的质量。为了解决这一限制，我们提出了时间戳锚定限制的时间段视频接地推理（TAR-TVG），这是一个新颖的框架，该框架在推理过程中引入时间戳锚固，以对思想内容进行明确的监督。这些锚是中间验证点。更重要的是，我们需要每个推理步骤来产生越来越准确的时间估计，从而确保推理过程对最终预测有意义地贡献。 To address the challenge of low-probability anchor generation in models (e.g., Qwen2.5-VL-3B), we develop an efficient self-distillation training strategy: (1) initial GRPO training to collect 30K high-quality reasoning traces containing multiple timestamp anchors, (2) supervised fine-tuning (SFT) on distilled data, and (3) final GRPO optimization on the SFT-enhanced model.这种三阶段的训练策略可以在保持推理质量的同时，可以发挥强大的锚定生成。实验表明，我们的模型可以实现最新的性能，同时产生可解释的，可验证的推理链，并逐步完善时间估计。

Title: Make Your MoVe: Make Your 3D Contents by Adapting Multi-View Diffusion Models to External Editing

Authors: Weitao Wang, Haoran Xu, Jun Meng, Haoqian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07700
Pdf URL: https://arxiv.org/pdf/2508.07700
Copy Paste: [[2508.07700]] Make Your MoVe: Make Your 3D Contents by Adapting Multi-View Diffusion Models to External Editing(https://arxiv.org/abs/2508.07700)
Keywords: generation
Abstract: As 3D generation techniques continue to flourish, the demand for generating personalized content is rapidly rising. Users increasingly seek to apply various editing methods to polish generated 3D content, aiming to enhance its color, style, and lighting without compromising the underlying geometry. However, most existing editing tools focus on the 2D domain, and directly feeding their results into 3D generation methods (like multi-view diffusion models) will introduce information loss, degrading the quality of the final 3D assets. In this paper, we propose a tuning-free, plug-and-play scheme that aligns edited assets with their original geometry in a single inference run. Central to our approach is a geometry preservation module that guides the edited multi-view generation with original input normal latents. Besides, an injection switcher is proposed to deliberately control the supervision extent of the original normals, ensuring the alignment between the edited color and normal views. Extensive experiments show that our method consistently improves both the multi-view consistency and mesh quality of edited 3D assets, across multiple combinations of multi-view diffusion models and editing methods.
摘要：随着3D代技术继续蓬勃发展，生成个性化内容的需求正在迅速上升。用户越来越寻求将各种编辑方法应用于波兰生成的3D内容，旨在增强其颜色，样式和照明，而不会损害基础几何形状。但是，大多数现有的编辑工具都集中在2D域上，并将其结果直接归为3D生成方法（例如多视图扩散模型）将引入信息丢失，从而降低最终3D资产的质量。在本文中，我们提出了一个无调的，插件的方案，该方案将编辑的资产与单个推理运行中的原始几何形状保持一致。我们方法的核心是一个几何保存模块，该模块可指导具有原始输入正常潜在的编辑多视图生成。此外，还提出了一种注射切换器，以故意控制原始正态的监督范围，以确保编辑的颜色和正常视图之间的对齐方式。广泛的实验表明，我们的方法始终提高了编辑的3D资产的多视图一致性和网格质量，这些质量在多视图扩散模型和编辑方法的多种组合中。

Title: Enhancing Small-Scale Dataset Expansion with Triplet-Connection-based Sample Re-Weighting

Authors: Ting Xiang, Changjian Chen, Zhuo Tang, Qifeng Zhang, Fei Lyu, Li Yang, Jiapeng Zhang, Kenli Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07723
Pdf URL: https://arxiv.org/pdf/2508.07723
Copy Paste: [[2508.07723]] Enhancing Small-Scale Dataset Expansion with Triplet-Connection-based Sample Re-Weighting(https://arxiv.org/abs/2508.07723)
Keywords: generation, generative
Abstract: The performance of computer vision models in certain real-world applications, such as medical diagnosis, is often limited by the scarcity of available images. Expanding datasets using pre-trained generative models is an effective solution. However, due to the uncontrollable generation process and the ambiguity of natural language, noisy images may be generated. Re-weighting is an effective way to address this issue by assigning low weights to such noisy images. We first theoretically analyze three types of supervision for the generated images. Based on the theoretical analysis, we develop TriReWeight, a triplet-connection-based sample re-weighting method to enhance generative data augmentation. Theoretically, TriReWeight can be integrated with any generative data augmentation methods and never downgrade their performance. Moreover, its generalization approaches the optimal in the order $O(\sqrt{d\ln (n)/n})$. Our experiments validate the correctness of the theoretical analysis and demonstrate that our method outperforms the existing SOTA methods by $7.9\%$ on average over six natural image datasets and by $3.4\%$ on average over three medical datasets. We also experimentally validate that our method can enhance the performance of different generative data augmentation methods.
摘要：在某些现实世界中，例如医学诊断等计算机视觉模型的性能通常受到可用图像的稀缺的限制。使用预训练的生成模型扩展数据集是一个有效的解决方案。但是，由于无法控制的生成过程和自然语言的歧义，可能会产生嘈杂的图像。重新加权是通过将低权重分配给此类嘈杂图像来解决此问题的有效方法。我们首先对生成的图像分析了三种类型的监督。基于理论分析，我们开发了TririRieWeight，这是一种基于三重连接的样本重新加权方法，以增强生成数据的增强。从理论上讲，tririreweight可以与任何生成数据增强方法集成，并且永远不会降低其性能。此外，其概括以$ o（\ sqrt {d \ ln（n）/n}）$的顺序接近最佳。我们的实验验证了理论分析的正确性，并证明我们的方法在六个自然图像数据集中平均优于现有的SOTA方法$ 7.9 \％$，而在三个医疗数据集中平均$ 3.4 \％$ $。我们还通过实验验证了我们的方法可以增强不同生成数据增强方法的性能。

Title: Grouped Speculative Decoding for Autoregressive Image Generation

Authors: Junhyuk So, Juncheol Shin, Hyunho Kook, Eunhyeok Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07747
Pdf URL: https://arxiv.org/pdf/2508.07747
Copy Paste: [[2508.07747]] Grouped Speculative Decoding for Autoregressive Image Generation(https://arxiv.org/abs/2508.07747)
Keywords: generation, generative
Abstract: Recently, autoregressive (AR) image models have demonstrated remarkable generative capabilities, positioning themselves as a compelling alternative to diffusion models. However, their sequential nature leads to long inference times, limiting their practical scalability. In this work, we introduce Grouped Speculative Decoding (GSD), a novel, training-free acceleration method for AR image models. While recent studies have explored Speculative Decoding (SD) as a means to speed up AR image generation, existing approaches either provide only modest acceleration or require additional training. Our in-depth analysis reveals a fundamental difference between language and image tokens: image tokens exhibit inherent redundancy and diversity, meaning multiple tokens can convey valid semantics. However, traditional SD methods are designed to accept only a single most-likely token, which fails to leverage this difference, leading to excessive false-negative rejections. To address this, we propose a new SD strategy that evaluates clusters of visually valid tokens rather than relying on a single target token. Additionally, we observe that static clustering based on embedding distance is ineffective, which motivates our dynamic GSD approach. Extensive experiments show that GSD accelerates AR image models by an average of 3.7x while preserving image quality-all without requiring any additional training. The source code is available at this https URL
摘要：最近，自回归（AR）图像模型表现出显着的生成能力，将自己定位为扩散模型的引人注目的替代方案。但是，它们的顺序性质会导致漫长的推理时间，从而限制了它们的实际可伸缩性。在这项工作中，我们介绍了分组的投机解码（GSD），这是一种针对AR图像模型的新型，无训练的加速方法。虽然最近的研究探索了投机解码（SD）作为加快AR图像产生的一种手段，但现有方法要么仅提供适度的加速或需要额外的培训。我们的深入分析揭示了语言和图像令牌之间的基本差异：图像令牌表现出固有的冗余和多样性，这意味着多个令牌可以传达有效的语义。但是，传统的SD方法旨在仅接受一个最引人注目的令牌，这无法利用这种差异，从而导致过度的假阴性拒绝。为了解决这个问题，我们提出了一种新的SD策略，该策略评估视觉上有效令牌的簇，而不是依靠单个目标令牌。此外，我们观察到基于嵌入距离的静态聚类无效，这激发了我们的动态GSD方法。广泛的实验表明，GSD平均将AR映像模型加速3.7倍，同时保持图像质量 - 而无需进行任何额外的培训。源代码可在此HTTPS URL上找到

Title: Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized Alignment

Authors: Haowen Wang, Yun Yue, Zhiling Ye, Shuowen Zhang, Lei Fan, Jiaxin Liang, Jiadi Jiang, Cheng Wei, Jingyuan Deng, Xudong Han, Ji Li, Chunxiao Guo, Peng Wei, Jian Wang, Jinjie Gu
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2508.07750
Pdf URL: https://arxiv.org/pdf/2508.07750
Copy Paste: [[2508.07750]] Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized Alignment(https://arxiv.org/abs/2508.07750)
Keywords: generation, quality assessment
Abstract: Alignment methodologies have emerged as a critical pathway for enhancing language model alignment capabilities. While SFT (supervised fine-tuning) accelerates convergence through direct token-level loss intervention, its efficacy is constrained by offline policy trajectory. In contrast, RL(reinforcement learning) facilitates exploratory policy optimization, but suffers from low sample efficiency and stringent dependency on high-quality base models. To address these dual challenges, we propose GRAO (Group Relative Alignment Optimization), a unified framework that synergizes the respective strengths of SFT and RL through three key innovations: 1) A multi-sample generation strategy enabling comparative quality assessment via reward feedback; 2) A novel Group Direct Alignment Loss formulation leveraging intra-group relative advantage weighting; 3) Reference-aware parameter updates guided by pairwise preference dynamics. Our theoretical analysis establishes GRAO's convergence guarantees and sample efficiency advantages over conventional approaches. Comprehensive evaluations across complex human alignment tasks demonstrate GRAO's superior performance, achieving 57.70\%,17.65\% 7.95\% and 5.18\% relative improvements over SFT, DPO, PPO and GRPO baselines respectively. This work provides both a theoretically grounded alignment framework and empirical evidence for efficient capability evolution in language models.
摘要：对齐方法已成为增强语言模型对齐功能的关键途径。尽管SFT（受监管的微调）通过直接令牌级别的损失干预加速了融合，但其效力受到离线政策轨迹的限制。相比之下，RL（增强学习）有助于探索性政策优化，但样本效率低下，对高质量基础模型的严格依赖性受到影响。为了应对这些双重挑战，我们提出了GRAO（群体相对一致性优化），这是一个统一的框架，通过三个关键创新协同SFT和RL的优势，协同：1）通过奖励反馈来实现比较质量评估的多样本生成策略； 2）一种新的组直接比对损耗公式，利用组内相对优势加权； 3）参考感知参数更新由成对偏好动力学指导。我们的理论分析确立了GRAO的收敛保证和样本效率优势，而不是常规方法。对复杂人类对准任务进行的全面评估表明，GRAO的出色表现，达到57.70 \％，17.65 \％7.95 \％和5.18 \％的相对改善，分别相对于SFT，DPO，PPO和GRPO碱基。这项工作提供了理论上扎根的一致性框架，也提供了在语言模型中有效能力演变的经验证据。

Title: Comparison Reveals Commonality: Customized Image Generation through Contrastive Inversion

Authors: Minseo Kim, Minchan Kwon, Dongyeun Lee, Yunho Jeon, Junmo Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07755
Pdf URL: https://arxiv.org/pdf/2508.07755
Copy Paste: [[2508.07755]] Comparison Reveals Commonality: Customized Image Generation through Contrastive Inversion(https://arxiv.org/abs/2508.07755)
Keywords: generation
Abstract: The recent demand for customized image generation raises a need for techniques that effectively extract the common concept from small sets of images. Existing methods typically rely on additional guidance, such as text prompts or spatial masks, to capture the common target concept. Unfortunately, relying on manually provided guidance can lead to incomplete separation of auxiliary features, which degrades generation this http URL this paper, we propose Contrastive Inversion, a novel approach that identifies the common concept by comparing the input images without relying on additional information. We train the target token along with the image-wise auxiliary text tokens via contrastive learning, which extracts the well-disentangled true semantics of the target. Then we apply disentangled cross-attention fine-tuning to improve concept fidelity without overfitting. Experimental results and analysis demonstrate that our method achieves a balanced, high-level performance in both concept representation and editing, outperforming existing techniques.
摘要：最近对定制图像生成的需求提出了对从一小部分图像中有效提取共同概念的技术的需求。现有方法通常依赖于其他指导，例如文本提示或空间口罩来捕获共同的目标概念。不幸的是，依靠手动提供的指导可能导致辅助特征的不完整分离，从而降低了本文本文的生成，我们提出了对比反转，这是一种新颖的方法，通过在不依赖其他信息的情况下比较输入图像来识别常见概念。我们通过对比度学习训练目标令牌以及图像辅助文本令牌，从而提取了目标的真实语义。然后，我们应用解开的交叉注意微调来提高概念保真度而不过分拟合。实验结果和分析表明，我们的方法在概念表示和编辑中都达到了平衡，高级的性能，表现优于现有技术。

Title: Sparse Probabilistic Graph Circuits

Authors: Martin Rektoris, Milan Papež, Václav Šmídl, Tomáš Pevný
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07763
Pdf URL: https://arxiv.org/pdf/2508.07763
Copy Paste: [[2508.07763]] Sparse Probabilistic Graph Circuits(https://arxiv.org/abs/2508.07763)
Keywords: generative
Abstract: Deep generative models (DGMs) for graphs achieve impressively high expressive power thanks to very efficient and scalable neural networks. However, these networks contain non-linearities that prevent analytical computation of many standard probabilistic inference queries, i.e., these DGMs are considered \emph{intractable}. While recently proposed Probabilistic Graph Circuits (PGCs) address this issue by enabling \emph{tractable} probabilistic inference, they operate on dense graph representations with $\mathcal{O}(n^2)$ complexity for graphs with $n$ nodes and \emph{$m$ edges}. To address this scalability issue, we introduce Sparse PGCs, a new class of tractable generative models that operate directly on sparse graph representation, reducing the complexity to $\mathcal{O}(n + m)$, which is particularly beneficial for $m \ll n^2$. In the context of de novo drug design, we empirically demonstrate that SPGCs retain exact inference capabilities, improve memory efficiency and inference speed, and match the performance of intractable DGMs in key metrics.
摘要：由于非常有效且可扩展的神经网络，用于图形的深层生成模型（DGM）具有令人印象深刻的表达能力。但是，这些网络包含的非线性性可以防止对许多标准概率推理查询进行分析计算，即，这些DGM被视为\ emph {棘手}。虽然最近提出的概率图电路（PGC）通过启用\ emph {tractable}概率推断来解决此问题，但它们在具有$ \ MATHCAL {O}（o} o}（n^2）$的图形上的密集图表示上，具有$ n $ nodes和\ emph and \ emph {$ m $ edges}的图形复杂性。为了解决此可伸缩性问题，我们引入了稀疏PGC，这是一种新的可拖动生成模型，直接在稀疏图表示上运行，将复杂性降低到$ \ MATHCAL {O}（N + M）$，这对于$ M \ ll n^2 $特别有益。在从头设计的背景下，我们从经验上证明，SPGC保留精确的推理能力，提高记忆效率和推理速度，并匹配关键指标中棘手的DGM的性能。

Title: UniSVG: A Unified Dataset for Vector Graphic Understanding and Generation with Multimodal Large Language Models

Authors: Jinke Li, Jiarui Yu, Chenxing Wei, Hande Dong, Qiang Lin, Liangjing Yang, Zhicai Wang, Yanbin Hao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07766
Pdf URL: https://arxiv.org/pdf/2508.07766
Copy Paste: [[2508.07766]] UniSVG: A Unified Dataset for Vector Graphic Understanding and Generation with Multimodal Large Language Models(https://arxiv.org/abs/2508.07766)
Keywords: generation
Abstract: Unlike bitmap images, scalable vector graphics (SVG) maintain quality when scaled, frequently employed in computer vision and artistic design in the representation of SVG code. In this era of proliferating AI-powered systems, enabling AI to understand and generate SVG has become increasingly urgent. However, AI-driven SVG understanding and generation (U&G) remain significant challenges. SVG code, equivalent to a set of curves and lines controlled by floating-point parameters, demands high precision in SVG U&G. Besides, SVG generation operates under diverse conditional constraints, including textual prompts and visual references, which requires powerful multi-modal processing for condition-to-SVG transformation. Recently, the rapid growth of Multi-modal Large Language Models (MLLMs) have demonstrated capabilities to process multi-modal inputs and generate complex vector controlling parameters, suggesting the potential to address SVG U&G tasks within a unified model. To unlock MLLM's capabilities in the SVG area, we propose an SVG-centric dataset called UniSVG, comprising 525k data items, tailored for MLLM training and evaluation. To our best knowledge, it is the first comprehensive dataset designed for unified SVG generation (from textual prompts and images) and SVG understanding (color, category, usage, etc.). As expected, learning on the proposed dataset boosts open-source MLLMs' performance on various SVG U&G tasks, surpassing SOTA close-source MLLMs like GPT-4V. We release dataset, benchmark, weights, codes and experiment details on this https URL.
摘要：与位图图像不同，可扩展的矢量图形（SVG）保持缩放时保持质量，经常在SVG代码的表示中使用计算机视觉和艺术设计。在这个增殖的AI驱动系统的时代，使AI能够理解和产生SVG变得越来越紧迫。但是，AI驱动的SVG理解和产生（U＆G）仍然是重大挑战。 SVG代码相当于一组由浮点数参数控制的曲线和线路，要求SVG U＆G中的高精度。此外，SVG生成在各种条件约束下运行，包括文本提示和视觉参考，这需要强大的多模式处理，以进行状态到SVG转换。最近，多模式大语言模型（MLLM）的快速增长已经证明了处理多模式输入并生成复杂的向量控制参数的功能，这表明有潜力解决统一模型中SVG U＆G任务的潜力。为了解锁SVG区域中MLLM的功能，我们提出了一个以SVG为中心的数据集，该数据集为UNISVG，其中包括525K数据项，该数据项是针对MLLM培训和评估而定制的。据我们所知，这是第一个专为统一SVG生成（从文本提示和图像）和SVG理解（颜色，类别，用法等）设计的综合数据集。如预期的那样，在拟议的数据集中学习可以提高开源MLLM在各种SVG U＆G任务上的性能，超过SOTA Close-Source MLLM，例如GPT-4V。我们发布了此HTTPS URL上的数据集，基准，权重，代码和实验详细信息。

Title: Dream4D: Lifting Camera-Controlled I2V towards Spatiotemporally Consistent 4D Generation

Authors: Xiaoyan Liu, Kangrui Li, Jiaxin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07769
Pdf URL: https://arxiv.org/pdf/2508.07769
Copy Paste: [[2508.07769]] Dream4D: Lifting Camera-Controlled I2V towards Spatiotemporally Consistent 4D Generation(https://arxiv.org/abs/2508.07769)
Keywords: generation
Abstract: The synthesis of spatiotemporally coherent 4D content presents fundamental challenges in computer vision, requiring simultaneous modeling of high-fidelity spatial representations and physically plausible temporal dynamics. Current approaches often struggle to maintain view consistency while handling complex scene dynamics, particularly in large-scale environments with multiple interacting elements. This work introduces Dream4D, a novel framework that bridges this gap through a synergy of controllable video generation and neural 4D reconstruction. Our approach seamlessly combines a two-stage architecture: it first predicts optimal camera trajectories from a single image using few-shot learning, then generates geometrically consistent multi-view sequences via a specialized pose-conditioned diffusion process, which are finally converted into a persistent 4D representation. This framework is the first to leverage both rich temporal priors from video diffusion models and geometric awareness of the reconstruction models, which significantly facilitates 4D generation and shows higher quality (e.g., mPSNR, mSSIM) over existing methods.
摘要：时空连贯的4D内容的合成提出了计算机视觉中的基本挑战，需要同时建模高保真空间表示和物理上合理的时间动力学。当前的方法通常在处理复杂的场景动态时，尤其是在具有多个相互作用元素的大规模环境中，通常难以保持视图一致性。这项工作介绍了Dream4D，这是一个新颖的框架，通过可控视频生成和神经4D重建的协同作用弥合了这一差距。我们的方法无缝地结合了两个阶段的体系结构：它首先使用几个图像从单个图像中预测最佳摄像机轨迹，然后通过专门的姿势调节的扩散过程生成几何一致的多视图序列，最终将其转换为持久的4D表示。该框架是第一个利用视频扩散模型和重建模型的几何意识来利用丰富的时间先验的框架，这些模型显着促进了4D代并在现有方法上显示出更高的质量（例如MPSNR，MSSIM）。

Title: Power Battery Detection

Authors: Xiaoqi Zhao, Peiqian Cao, Lihe Zhang, Zonglei Feng, Hanqi Liu, Jiaming Zuo, Youwei Pang, Weisi Lin, Georges El Fakhri, Huchuan Lu, Xiaofeng Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07797
Pdf URL: https://arxiv.org/pdf/2508.07797
Copy Paste: [[2508.07797]] Power Battery Detection(https://arxiv.org/abs/2508.07797)
Keywords: generation
Abstract: Power batteries are essential components in electric vehicles, where internal structural defects can pose serious safety risks. We conduct a comprehensive study on a new task, power battery detection (PBD), which aims to localize the dense endpoints of cathode and anode plates from industrial X-ray images for quality inspection. Manual inspection is inefficient and error-prone, while traditional vision algorithms struggle with densely packed plates, low contrast, scale variation, and imaging artifacts. To address this issue and drive more attention into this meaningful task, we present PBD5K, the first large-scale benchmark for this task, consisting of 5,000 X-ray images from nine battery types with fine-grained annotations and eight types of real-world visual interference. To support scalable and consistent labeling, we develop an intelligent annotation pipeline that combines image filtering, model-assisted pre-labeling, cross-verification, and layered quality evaluation. We formulate PBD as a point-level segmentation problem and propose MDCNeXt, a model designed to extract and integrate multi-dimensional structure clues including point, line, and count information from the plate itself. To improve discrimination between plates and suppress visual interference, MDCNeXt incorporates two state space modules. The first is a prompt-filtered module that learns contrastive relationships guided by task-specific prompts. The second is a density-aware reordering module that refines segmentation in regions with high plate density. In addition, we propose a distance-adaptive mask generation strategy to provide robust supervision under varying spatial distributions of anode and cathode positions. The source code and datasets will be publicly available at \href{this https URL}{PBD5K}.
摘要：电池是电动汽车中必不可少的组件，内部结构缺陷会带来严重的安全风险。我们对新任务，电池检测（PBD）进行了全面研究，该研究旨在将来自工业X射线图像的阴极和阳极板的密集端点定位，以进行质量检查。手动检查效率低下且容易出错，而传统视觉算法则与密集的板，低对比度，比例变化和成像伪像。为了解决这个问题并将更多的注意力引起这项有意义的任务，我们提出了PBD5K，这是该任务的第一个大规模基准，由来自9个电池类型的5,000张X射线图像组成，带有精细的注释和八种现实世界的视觉干扰。为了支持可扩展和一致的标签，我们开发了一条智能注释管道，该管道结合了图像过滤，模型辅助预标，交叉验证和分层质量评估。我们将PBD作为点级分割问题提出，并提出了MDCNext，该模型旨在提取和整合来自板本身的点，线和计数信息，包括点，线和计数信息。为了改善板之间的歧视并抑制视觉干扰，MDCNEXT结合了两个状态空间模块。第一个是一个及时过滤的模块，该模块学习以特定于任务的提示为指导的对比关系。第二个是一个密度感知的重新排序模块，可在高板密度高的区域中进行分割。此外，我们提出了一种远程自适应掩模的产生策略，以在阳极和阴极位置的不同空间分布下提供强大的监督。源代码和数据集将在\ href {此https url} {pbd5k}上公开可用。

Title: Pose-RFT: Enhancing MLLMs for 3D Pose Generation via Hybrid Action Reinforcement Fine-Tuning

Authors: Bao Li, Xiaomei Zhang, Miao Xu, Zhaoxin Fan, Xiangyu Zhu, Zhen Lei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07804
Pdf URL: https://arxiv.org/pdf/2508.07804
Copy Paste: [[2508.07804]] Pose-RFT: Enhancing MLLMs for 3D Pose Generation via Hybrid Action Reinforcement Fine-Tuning(https://arxiv.org/abs/2508.07804)
Keywords: generation
Abstract: Generating 3D human poses from multimodal inputs such as images or text requires models to capture both rich spatial and semantic correspondences. While pose-specific multimodal large language models (MLLMs) have shown promise in this task, they are typically trained with supervised objectives such as SMPL parameter regression or token-level prediction, which struggle to model the inherent ambiguity and achieve task-specific alignment required for accurate 3D pose generation. To address these limitations, we propose Pose-RFT, a reinforcement fine-tuning framework tailored for 3D human pose generation in MLLMs. We formulate the task as a hybrid action reinforcement learning problem that jointly optimizes discrete language prediction and continuous pose generation. To this end, we introduce HyGRPO, a hybrid reinforcement learning algorithm that performs group-wise reward normalization over sampled responses to guide joint optimization of discrete and continuous actions. Pose-RFT further incorporates task-specific reward functions to guide optimization towards spatial alignment in image-to-pose generation and semantic consistency in text-to-pose generation. Extensive experiments on multiple pose generation benchmarks demonstrate that Pose-RFT significantly improves performance over existing pose-specific MLLMs, validating the effectiveness of hybrid action reinforcement fine-tuning for 3D pose generation.
摘要：从图像或文本等多模式输入中生成3D人的姿势需要模型来捕获丰富的空间和语义对应关系。尽管姿势特异性的多模式大语言模型（MLLM）在此任务中表现出了承诺，但它们通常受到监督目标的培训，例如SMPL参数回归或代币级别的预测，这些预测难以建模固有的歧义并实现准确的3D姿势生成所需的任务特定对齐。为了解决这些局限性，我们提出了姿势-RFT，这是为3D人类姿势在MLLM中量身定制的增强微调框架。我们将任务制定为混合动作增强学习问题，共同优化了离散的语言预测和持续姿势产生。为此，我们介绍了Hygrpo，这是一种混合增强学习算法，该算法对采样响应进行群体奖励归一化，以指导离散和连续作用的关节优化。 Pose-RFT进一步结合了特定于任务的奖励功能，以指导优化图像之间的空间对齐，并在文本之间生成文本之间的语义一致性。对多个姿势产生基准的广泛实验表明，姿势RFT显着提高了现有姿势特异性MLLM的性能，从而验证了混合动作加强对3D姿势生成的有效性。

Title: DiTVR: Zero-Shot Diffusion Transformer for Video Restoration

Authors: Sicheng Gao, Nancy Mehta, Zongwei Wu, Radu Timofte
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07811
Pdf URL: https://arxiv.org/pdf/2508.07811
Copy Paste: [[2508.07811]] DiTVR: Zero-Shot Diffusion Transformer for Video Restoration(https://arxiv.org/abs/2508.07811)
Keywords: restoration, generative
Abstract: Video restoration aims to reconstruct high quality video sequences from low quality inputs, addressing tasks such as super resolution, denoising, and deblurring. Traditional regression based methods often produce unrealistic details and require extensive paired datasets, while recent generative diffusion models face challenges in ensuring temporal consistency. We introduce DiTVR, a zero shot video restoration framework that couples a diffusion transformer with trajectory aware attention and a wavelet guided, flow consistent sampler. Unlike prior 3D convolutional or frame wise diffusion approaches, our attention mechanism aligns tokens along optical flow trajectories, with particular emphasis on vital layers that exhibit the highest sensitivity to temporal dynamics. A spatiotemporal neighbour cache dynamically selects relevant tokens based on motion correspondences across frames. The flow guided sampler injects data consistency only into low-frequency bands, preserving high frequency priors while accelerating convergence. DiTVR establishes a new zero shot state of the art on video restoration benchmarks, demonstrating superior temporal consistency and detail preservation while remaining robust to flow noise and occlusions.
摘要：视频恢复旨在通过低质量输入来重建高质量的视频序列，以解决超级分辨率，denoising和DeBlurring等任务。传统的基于回归的方法通常会产生不切实际的细节，并需要广泛的配对数据集，而最近的生成扩散模型在确保时间一致性方面面临着挑战。我们介绍了DITVR，这是一个零拍摄的视频修复框架，该框架将扩散变压器与轨迹意识到的注意力和小波引导，流量一致的采样器结合在一起。与先前的3D卷积或框架明智的扩散方法不同，我们的注意机制沿光流轨迹对齐令牌，特别强调对时间动力学表现出最高敏感性的重要层。时空邻居缓存根据跨帧的运动对应关系动态选择相关令牌。流引导采样器仅将数据一致性注入低频频段，在加速收敛的同时保留高频先验。 DITVR在视频恢复基准测试中建立了新的零拍摄状态，展示了较高的时间一致性和细节保存，同时保持稳定的流噪声和遮挡。

Title: Segmenting and Understanding: Region-aware Semantic Attention for Fine-grained Image Quality Assessment with Large Language Models

Authors: Chenyue Song, Chen Hui, Haiqi Zhu, Feng Jiang, Yachun Mi, Wei Zhang, Shaohui Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07818
Pdf URL: https://arxiv.org/pdf/2508.07818
Copy Paste: [[2508.07818]] Segmenting and Understanding: Region-aware Semantic Attention for Fine-grained Image Quality Assessment with Large Language Models(https://arxiv.org/abs/2508.07818)
Keywords: quality assessment
Abstract: No-reference image quality assessment (NR-IQA) aims to simulate the process of perceiving image quality aligned with subjective human perception. However, existing NR-IQA methods either focus on global representations that leads to limited insights into the semantically salient regions or employ a uniform weighting for region features that weakens the sensitivity to local quality variations. In this paper, we propose a fine-grained image quality assessment model, named RSFIQA, which integrates region-level distortion information to perceive multi-dimensional quality discrepancies. To enhance regional quality awareness, we first utilize the Segment Anything Model (SAM) to dynamically partition the input image into non-overlapping semantic regions. For each region, we teach a powerful Multi-modal Large Language Model (MLLM) to extract descriptive content and perceive multi-dimensional distortions, enabling a comprehensive understanding of both local semantics and quality degradations. To effectively leverage this information, we introduce Region-Aware Semantic Attention (RSA) mechanism, which generates a global attention map by aggregating fine-grained representations from local regions. In addition, RSFIQA is backbone-agnostic and can be seamlessly integrated into various deep neural network architectures. Extensive experiments demonstrate the robustness and effectiveness of the proposed method, which achieves competitive quality prediction performance across multiple benchmark datasets.
摘要：无参考图像质量评估（NR-IQA）旨在模拟感知与主观人类感知一致的图像质量的过程。但是，现有的NR-IQA方法要么集中于全球表示，从而导致对语义显着区域的见解有限，或者对区域特征采用统一的加权，从而削弱了对本地质量变化的敏感性。在本文中，我们提出了一个名为RSFIQA的细粒图像质量评估模型，该模型将区域级失真信息集成在一起以感知多维质量差异。为了提高区域质量的意识，我们首先利用该细分模型（SAM）将输入图像动态分配到非重叠的语义区域中。对于每个区域，我们教一个强大的多模式大语言模型（MLLM）来提取描述性内容并感知多维扭曲，从而使人们对本地语义和质量降解有了全面的了解。为了有效利用这些信息，我们引入了区域感知语义注意（RSA）机制，该机制通过汇总当地地区的细粒度表示来生成全球注意力图。此外，RSFIQA是骨干 - 敏捷的，并且可以无缝集成到各种深层神经网络架构中。广泛的实验证明了所提出的方法的鲁棒性和有效性，该方法在多个基准数据集中实现了竞争性质量预测性能。

Title: Being-M0.5: A Real-Time Controllable Vision-Language-Motion Model

Authors: Bin Cao, Sipeng Zheng, Ye Wang, Lujie Xia, Qianshan Wei, Qin Jin, Jing Liu, Zongqing Lu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.07863
Pdf URL: https://arxiv.org/pdf/2508.07863
Copy Paste: [[2508.07863]] Being-M0.5: A Real-Time Controllable Vision-Language-Motion Model(https://arxiv.org/abs/2508.07863)
Keywords: generation
Abstract: Human motion generation has emerged as a critical technology with transformative potential for real-world applications. However, existing vision-language-motion models (VLMMs) face significant limitations that hinder their practical deployment. We identify controllability as a main bottleneck, manifesting in five key aspects: inadequate response to diverse human commands, limited pose initialization capabilities, poor performance on long-term sequences, insufficient handling of unseen scenarios, and lack of fine-grained control over individual body parts. To overcome these limitations, we present Being-M0.5, the first real-time, controllable VLMM that achieves state-of-the-art performance across multiple motion generation tasks. Our approach is built upon HuMo100M, the largest and most comprehensive human motion dataset to date, comprising over 5 million self-collected motion sequences, 100 million multi-task instructional instances, and detailed part-level annotations that address a critical gap in existing datasets. We introduce a novel part-aware residual quantization technique for motion tokenization that enables precise, granular control over individual body parts during generation. Extensive experimental validation demonstrates Being-M0.5's superior performance across diverse motion benchmarks, while comprehensive efficiency analysis confirms its real-time capabilities. Our contributions include design insights and detailed computational analysis to guide future development of practical motion generators. We believe that HuMo100M and Being-M0.5 represent significant advances that will accelerate the adoption of motion generation technologies in real-world applications. The project page is available at this https URL.
摘要：人类运动产生已成为一项关键技术，具有对现实世界应用的变革潜力。但是，现有的视觉 - 语言动作模型（VLMMS）面临着严重的限制，阻碍了其实际部署。我们将可控性确定为主要瓶颈，在五个关键方面表现出来：对多样化的人类命令的反应不足，姿势初始化能力有限，长期序列的性能差，对看不见的场景的处理不足以及对单个身体部位的良好控制。为了克服这些局限性，我们提出了M0.5，这是第一个实时，可控制的VLMM，可在多个运动生成任务中实现最先进的性能。我们的方法建立在Humo100m的基础上，Humo100m是迄今为止最大，最全面的人类运动数据集，其中包括超过500万个自我收集的运动序列，1亿个多任务指导实例，以及详细的零件级注释，这些零件级别的注释解决了现有数据集中的关键差距。我们引入了一种新型的零件感知剩余量化技术，以进行运动令牌化，该技术可以在发电过程中精确，颗粒状控制各个身体部位。广泛的实验验证表明，M0.5在各种运动基准中的出色性能，而全面的效率分析证实了其实时功能。我们的贡献包括设计见解和详细的计算分析，以指导实用运动生成器的未来开发。我们认为，Humo100m和M-0.5代表着重大进步，可以加快在现实世界应用中采用运动生产技术。该项目页面可在此HTTPS URL上找到。

Title: TAP: Parameter-efficient Task-Aware Prompting for Adverse Weather Removal

Authors: Hanting Wang, Shengpeng Ji, Shulei Wang, Hai Huang, Xiao Jin, Qifei Zhang, Tao Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07878
Pdf URL: https://arxiv.org/pdf/2508.07878
Copy Paste: [[2508.07878]] TAP: Parameter-efficient Task-Aware Prompting for Adverse Weather Removal(https://arxiv.org/abs/2508.07878)
Keywords: restoration
Abstract: Image restoration under adverse weather conditions has been extensively explored, leading to numerous high-performance methods. In particular, recent advances in All-in-One approaches have shown impressive results by training on multi-task image restoration datasets. However, most of these methods rely on dedicated network modules or parameters for each specific degradation type, resulting in a significant parameter overhead. Moreover, the relatedness across different restoration tasks is often overlooked. In light of these issues, we propose a parameter-efficient All-in-One image restoration framework that leverages task-aware enhanced prompts to tackle various adverse weather this http URL, we adopt a two-stage training paradigm consisting of a pretraining phase and a prompt-tuning phase to mitigate parameter conflicts across tasks. We first employ supervised learning to acquire general restoration knowledge, and then adapt the model to handle specific degradation via trainable soft prompts. Crucially, we enhance these task-specific prompts in a task-aware manner. We apply low-rank decomposition to these prompts to capture both task-general and task-specific characteristics, and impose contrastive constraints to better align them with the actual inter-task relatedness. These enhanced prompts not only improve the parameter efficiency of the restoration model but also enable more accurate task modeling, as evidenced by t-SNE analysis. Experimental results on different restoration tasks demonstrate that the proposed method achieves superior performance with only 2.75M parameters.
摘要：在不利天气条件下的图像恢复已被广泛探索，从而导致了许多高性能方法。特别是，通过对多任务图像恢复数据集进行培训，多合一方法的最新进展已显示出令人印象深刻的结果。但是，这些方法中的大多数都依赖于每种特定降解类型的专用网络模块或参数，从而导致了重要的参数开销。此外，经常忽略不同恢复任务的相关性。鉴于这些问题，我们提出了一个参数有效的多合一图像恢复框架，该框架利用任务感知的增强提示来应对此HTTP URL的各种不良天气，我们采用了一个两阶段的训练范式，该训练范围包括预处理阶段和迅速的阶段阶段来减轻跨任务的参数冲突。我们首先采用监督学习来获取一般恢复知识，然后通过可训练的软提示来调整模型以处理特定的降解。至关重要的是，我们以任务感知的方式增强了这些特定于任务的提示。我们对这些提示进行低级分解，以捕获任务将军和特定于任务的特征，并施加对比度的约束，以更好地与实际任务之间的相关性保持一致。这些增强的提示不仅提高了恢复模型的参数效率，而且还可以实现更准确的任务建模，如T-SNE分析所证明的那样。对不同恢复任务的实验结果表明，所提出的方法仅用275万参数实现了卓越的性能。

Title: Not Yet AlphaFold for the Mind: Evaluating Centaur as a Synthetic Participant

Authors: Sabrina Namazova, Alessandra Brondetta, Younes Strittmatter, Matthew Nassar, Sebastian Musslick
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07887
Pdf URL: https://arxiv.org/pdf/2508.07887
Copy Paste: [[2508.07887]] Not Yet AlphaFold for the Mind: Evaluating Centaur as a Synthetic Participant(https://arxiv.org/abs/2508.07887)
Keywords: generative
Abstract: Simulators have revolutionized scientific practice across the natural sciences. By generating data that reliably approximate real-world phenomena, they enable scientists to accelerate hypothesis testing and optimize experimental designs. This is perhaps best illustrated by AlphaFold, a Nobel-prize winning simulator in chemistry that predicts protein structures from amino acid sequences, enabling rapid prototyping of molecular interactions, drug targets, and protein functions. In the behavioral sciences, a reliable participant simulator - a system capable of producing human-like behavior across cognitive tasks - would represent a similarly transformative advance. Recently, Binz et al. introduced Centaur, a large language model (LLM) fine-tuned on human data from 160 experiments, proposing its use not only as a model of cognition but also as a participant simulator for "in silico prototyping of experimental studies", e.g., to advance automated cognitive science. Here, we review the core criteria for a participant simulator and assess how well Centaur meets them. Although Centaur demonstrates strong predictive accuracy, its generative behavior - a critical criterion for a participant simulator - systematically diverges from human data. This suggests that, while Centaur is a significant step toward predicting human behavior, it does not yet meet the standards of a reliable participant simulator or an accurate model of cognition.
摘要：模拟器已经彻底改变了自然科学的科学实践。通过生成可靠地近似现实现象的数据，它们使科学家能够加速假设测试并优化实验设计。 Alphafold是Alphafold，这是一种化学中的诺贝尔奖赢得模拟器Alphafold，可以预测氨基酸序列的蛋白质结构，从而可以快速对分子相互作用，药物靶标和蛋白质功能进行快速原型。在行为科学中，可靠的参与者模拟器 - 能够在认知任务中产生类似人类行为的系统 - 将代表类似的变革性进步。最近，Binz等。引入了对160个实验的人类数据进行微调的大型语言模型（LLM）的引入，不仅提出了其用作认知模型，而且还提出了作为“实验研究的硅原型制作”的参与者模拟器，例如，以提高自动认知科学。在这里，我们回顾了参与者模拟器的核心标准，并评估了半人马的与之满足。尽管半人马座表现出强大的预测精度，但其生成行为 - 对参与者模拟器的关键标准 - 系统地与人类数据分歧。这表明，尽管半人马座是预测人类行为的重要一步，但它尚未符合可靠的参与者模拟器或准确的认知模型的标准。

Title: Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation

Authors: Bowen Xue, Qixin Yan, Wenjing Wang, Hao Liu, Chen Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07901
Pdf URL: https://arxiv.org/pdf/2508.07901
Copy Paste: [[2508.07901]] Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation(https://arxiv.org/abs/2508.07901)
Keywords: generation, generative
Abstract: Generating high-fidelity human videos that match user-specified identities is important yet challenging in the field of generative AI. Existing methods often rely on an excessive number of training parameters and lack compatibility with other AIGC tools. In this paper, we propose Stand-In, a lightweight and plug-and-play framework for identity preservation in video generation. Specifically, we introduce a conditional image branch into the pre-trained video generation model. Identity control is achieved through restricted self-attentions with conditional position mapping, and can be learned quickly with only 2000 pairs. Despite incorporating and training just $\sim$1\% additional parameters, our framework achieves excellent results in video quality and identity preservation, outperforming other full-parameter training methods. Moreover, our framework can be seamlessly integrated for other tasks, such as subject-driven video generation, pose-referenced video generation, stylization, and face swapping.
摘要：在生成AI领域，生成与用户指定身份相匹配的高保真人类视频既重要却又具有挑战性。现有方法通常依赖于过多的培训参数，并且与其他AIGC工具缺乏兼容性。在本文中，我们提出了Stand-In，这是一个轻巧和插件的框架，用于视频生成中的身份保存。具体而言，我们将条件图像分支引入预先训练的视频生成模型。身份控制是通过有条件位置映射的受限制自我来实现的，只能通过2000对快速学习。尽管仅纳入和培训$ \ sim $ 1 \％额外的参数，但我们的框架在视频质量和身份保存方面取得了出色的成果，表现优于其他全参数培训方法。此外，我们的框架可以无缝集成到其他任务，例如主题驱动的视频生成，姿势引用的视频生成，风格化和面部交换。

Title: Diffusing the Blind Spot: Uterine MRI Synthesis with Diffusion Models

Authors: Johanna P. Müller, Anika Knupfer, Pedro Blöss, Edoardo Berardi Vittur, Bernhard Kainz, Jana Hutter
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07903
Pdf URL: https://arxiv.org/pdf/2508.07903
Copy Paste: [[2508.07903]] Diffusing the Blind Spot: Uterine MRI Synthesis with Diffusion Models(https://arxiv.org/abs/2508.07903)
Keywords: generative
Abstract: Despite significant progress in generative modelling, existing diffusion models often struggle to produce anatomically precise female pelvic images, limiting their application in gynaecological imaging, where data scarcity and patient privacy concerns are critical. To overcome these barriers, we introduce a novel diffusion-based framework for uterine MRI synthesis, integrating both unconditional and conditioned Denoising Diffusion Probabilistic Models (DDPMs) and Latent Diffusion Models (LDMs) in 2D and 3D. Our approach generates anatomically coherent, high fidelity synthetic images that closely mimic real scans and provide valuable resources for training robust diagnostic models. We evaluate generative quality using advanced perceptual and distributional metrics, benchmarking against standard reconstruction methods, and demonstrate substantial gains in diagnostic accuracy on a key classification task. A blinded expert evaluation further validates the clinical realism of our synthetic images. We release our models with privacy safeguards and a comprehensive synthetic uterine MRI dataset to support reproducible research and advance equitable AI in gynaecology.
摘要：尽管在生成建模方面取得了重大进展，但现有的扩散模型通常很难产生解剖学精确的雌性骨盆图像，从而限制了其在妇科成像中的应用，在这种情况下，数据稀缺和患者隐私问题至关重要。为了克服这些障碍，我们引入了一种基于子宫MRI合成的新型基于扩散的框架，在2D和3D中整合了无条件和条件的DENOCORINEDENO型扩散概率模型（DDPM）和潜在扩散模型（LDMS）。我们的方法产生了解剖学上连贯的高保真综合图像，这些图像紧密模仿了实际扫描，并为训练健壮的诊断模型提供了宝贵的资源。我们使用高级感知和分布指标评估生成质量，对标准重建方法进行基准测试，并在关键分类任务上证明诊断准确性的显着提高。盲目的专家评估进一步验证了我们的合成图像的临床现实主义。我们使用隐私保护措施和全面的合成子宫MRI数据集释放我们的模型，以支持可重复的研究并提高妇科中的公平AI。

Title: Generative Video Matting

Authors: Yongtao Ge, Kangyang Xie, Guangkai Xu, Mingyu Liu, Li Ke, Longtao Huang, Hui Xue, Hao Chen, Chunhua Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07905
Pdf URL: https://arxiv.org/pdf/2508.07905
Copy Paste: [[2508.07905]] Generative Video Matting(https://arxiv.org/abs/2508.07905)
Keywords: generation, generative
Abstract: Video matting has traditionally been limited by the lack of high-quality ground-truth data. Most existing video matting datasets provide only human-annotated imperfect alpha and foreground annotations, which must be composited to background images or videos during the training stage. Thus, the generalization capability of previous methods in real-world scenarios is typically poor. In this work, we propose to solve the problem from two perspectives. First, we emphasize the importance of large-scale pre-training by pursuing diverse synthetic and pseudo-labeled segmentation datasets. We also develop a scalable synthetic data generation pipeline that can render diverse human bodies and fine-grained hairs, yielding around 200 video clips with a 3-second duration for fine-tuning. Second, we introduce a novel video matting approach that can effectively leverage the rich priors from pre-trained video diffusion models. This architecture offers two key advantages. First, strong priors play a critical role in bridging the domain gap between synthetic and real-world scenes. Second, unlike most existing methods that process video matting frame-by-frame and use an independent decoder to aggregate temporal information, our model is inherently designed for video, ensuring strong temporal consistency. We provide a comprehensive quantitative evaluation across three benchmark datasets, demonstrating our approach's superior performance, and present comprehensive qualitative results in diverse real-world scenes, illustrating the strong generalization capability of our method. The code is available at this https URL.
摘要：传统上，由于缺乏高质量的地面数据，视频垫受到了限制。大多数现有的视频垫数据集仅提供人类通知的不完美α和前景注释，在培训阶段必须将其合成到背景图像或视频中。因此，在实际情况下，以前方法的概括能力通常很差。在这项工作中，我们建议从两个角度解决问题。首先，我们通过追求多样化的合成和伪标记的细分数据集来强调大规模预训练的重要性。我们还开发了可扩展的合成数据生成管道，该管道可以使人体和细粒度的头发产生多种多样，从而产生约200个视频剪辑，并以3秒的持续时间进行微调。其次，我们介绍了一种新颖的视频效果方法，该方法可以有效利用预先训练的视频扩散模型中的丰富先验。该体系结构提供了两个关键优势。首先，强壮的先验在弥合合成场景和现实世界之间的域间隙方面起着关键作用。其次，与大多数现有的方法处理逐帧框架并使用独立解码器来汇总时间信息的方法不同，我们的模型本质上是为视频设计的，可确保强大的时间一致性。我们在三个基准数据集中提供了全面的定量评估，证明了我们的方法的出色性能，并在各种现实世界的场景中呈现了全面的定性结果，这说明了我们方法的强大概括能力。该代码可在此HTTPS URL上找到。

Title: RSVLM-QA: A Benchmark Dataset for Remote Sensing Vision Language Model-based Question Answering

Authors: Xing Zi, Jinghao Xiao, Yunxiao Shi, Xian Tao, Jun Li, Ali Braytee, Mukesh Prasad
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07918
Pdf URL: https://arxiv.org/pdf/2508.07918
Copy Paste: [[2508.07918]] RSVLM-QA: A Benchmark Dataset for Remote Sensing Vision Language Model-based Question Answering(https://arxiv.org/abs/2508.07918)
Keywords: generation
Abstract: Visual Question Answering (VQA) in remote sensing (RS) is pivotal for interpreting Earth observation data. However, existing RS VQA datasets are constrained by limitations in annotation richness, question diversity, and the assessment of specific reasoning capabilities. This paper introduces RSVLM-QA dataset, a new large-scale, content-rich VQA dataset for the RS domain. RSVLM-QA is constructed by integrating data from several prominent RS segmentation and detection datasets: WHU, LoveDA, INRIA, and iSAID. We employ an innovative dual-track annotation generation pipeline. Firstly, we leverage Large Language Models (LLMs), specifically GPT-4.1, with meticulously designed prompts to automatically generate a suite of detailed annotations including image captions, spatial relations, and semantic tags, alongside complex caption-based VQA pairs. Secondly, to address the challenging task of object counting in RS imagery, we have developed a specialized automated process that extracts object counts directly from the original segmentation data; GPT-4.1 then formulates natural language answers from these counts, which are paired with preset question templates to create counting QA pairs. RSVLM-QA comprises 13,820 images and 162,373 VQA pairs, featuring extensive annotations and diverse question types. We provide a detailed statistical analysis of the dataset and a comparison with existing RS VQA benchmarks, highlighting the superior depth and breadth of RSVLM-QA's annotations. Furthermore, we conduct benchmark experiments on Six mainstream Vision Language Models (VLMs), demonstrating that RSVLM-QA effectively evaluates and challenges the understanding and reasoning abilities of current VLMs in the RS domain. We believe RSVLM-QA will serve as a pivotal resource for the RS VQA and VLM research communities, poised to catalyze advancements in the field.
摘要：遥感（RS）中的视觉问题回答（VQA）对于解释地球观测数据是关键的。但是，现有的RS VQA数据集受到注释丰富，问题多样性和评估特定推理能力的限制的限制。本文介绍了RSVLM-QA数据集，这是一种新的大型，富含内容的VQA数据集，用于RS域。 RSVLM-QA是通过整合来自几个突出的RS细分和检测数据集的数据来构建的：WHU，LOVEDA，INRIA和ISAID。我们采用创新的双轨注释生成管道。首先，我们利用精心设计的提示来利用大型语言模型（LLMS），特别是GPT-4.1，以自动生成一套详细注释，包括图像标题，空间关系和语义标签，以及基于复杂的字幕的VQA Pairs。其次，为了解决RS图像中对象计数的具有挑战性的任务，我们开发了一个专门的自动化过程，该过程直接从原始分段数据中提取对象计数；然后，GPT-4.1从这些计数中制定了自然语言答案，这些答案与预设的问题模板配对以创建计数QA对。 RSVLM-QA包含13,820张图像和162,373个VQA对，具有广泛的注释和各种问题类型。我们提供了有关数据集的详细统计分析，并与现有的RS VQA基准进行了比较，强调了RSVLM-QA注释的优越深度和广度。此外，我们对六个主流视觉语言模型（VLM）进行了基准实验，这表明RSVLM-QA有效地评估并挑战了RS域中当前VLM的理解和推理能力。我们认为，RSVLM-QA将成为RS VQA和VLM研究社区的关键资源，并准备催化该领域的进步。

Title: Safeguarding Generative AI Applications in Preclinical Imaging through Hybrid Anomaly Detection

Authors: Jakub Binda, Valentina Paneta, Vasileios Eleftheriadis, Hongkyou Chung, Panagiotis Papadimitroulas, Neo Christopher Chung
Subjects: cs.CV, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2508.07923
Pdf URL: https://arxiv.org/pdf/2508.07923
Copy Paste: [[2508.07923]] Safeguarding Generative AI Applications in Preclinical Imaging through Hybrid Anomaly Detection(https://arxiv.org/abs/2508.07923)
Keywords: generative
Abstract: Generative AI holds great potentials to automate and enhance data synthesis in nuclear medicine. However, the high-stakes nature of biomedical imaging necessitates robust mechanisms to detect and manage unexpected or erroneous model behavior. We introduce development and implementation of a hybrid anomaly detection framework to safeguard GenAI models in BIOEMTECH's eyes(TM) systems. Two applications are demonstrated: Pose2Xray, which generates synthetic X-rays from photographic mouse images, and DosimetrEYE, which estimates 3D radiation dose maps from 2D SPECT/CT scans. In both cases, our outlier detection (OD) enhances reliability, reduces manual oversight, and supports real-time quality control. This approach strengthens the industrial viability of GenAI in preclinical settings by increasing robustness, scalability, and regulatory compliance.
摘要：生成的AI具有自动化和增强核医学数据合成的巨大潜力。但是，生物医学成像的高风险性质需要鲁棒的机制来检测和管理意外或错误的模型行为。我们介绍了混合异常检测框架的开发和实施，以保护Bioemtech眼（TM）系统中的Genai模型。证明了两种应用：Pose2xray，它从照相小鼠图像和Dosimetreye中生成合成X射线，该射线估计了2D SPECT/CT扫描的3D辐射剂量图。在这两种情况下，我们的异常检测（OD）可提高可靠性，降低手动监督并支持实时质量控制。这种方法通过提高鲁棒性，可伸缩性和调节性依从性来增强临床前环境中Genai的工业生存能力。

Title: Score Augmentation for Diffusion Models

Authors: Liang Hou, Yuan Gao, Boyuan Jiang, Xin Tao, Qi Yan, Renjie Liao, Pengfei Wan, Di Zhang, Kun Gai
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.07926
Pdf URL: https://arxiv.org/pdf/2508.07926
Copy Paste: [[2508.07926]] Score Augmentation for Diffusion Models(https://arxiv.org/abs/2508.07926)
Keywords: generative
Abstract: Diffusion models have achieved remarkable success in generative modeling. However, this study confirms the existence of overfitting in diffusion model training, particularly in data-limited regimes. To address this challenge, we propose Score Augmentation (ScoreAug), a novel data augmentation framework specifically designed for diffusion models. Unlike conventional augmentation approaches that operate on clean data, ScoreAug applies transformations to noisy data, aligning with the inherent denoising mechanism of diffusion. Crucially, ScoreAug further requires the denoiser to predict the augmentation of the original target. This design establishes an equivariant learning objective, enabling the denoiser to learn scores across varied denoising spaces, thereby realizing what we term score augmentation. We also theoretically analyze the relationship between scores in different spaces under general transformations. In experiments, we extensively validate ScoreAug on multiple benchmarks including CIFAR-10, FFHQ, AFHQv2, and ImageNet, with results demonstrating significant performance improvements over baselines. Notably, ScoreAug effectively mitigates overfitting across diverse scenarios, such as varying data scales and model capacities, while exhibiting stable convergence properties. Another advantage of ScoreAug over standard data augmentation lies in its ability to circumvent data leakage issues under certain conditions. Furthermore, we show that ScoreAug can be synergistically combined with traditional data augmentation techniques to achieve additional performance gains.
摘要：扩散模型在生成建模方面取得了巨大的成功。但是，这项研究证实了扩散模型训练中的过度拟合，尤其是在数据限制的方案中。为了应对这一挑战，我们提出了分数增强（ScoreAug），这是一个专门为扩散模型设计的新型数据增强框架。与在干净数据上运行的常规增强方法不同，ScoreAug将转换应用于嘈杂的数据，与固有的剥离扩散机制保持一致。至关重要的是，ScoreAug进一步要求Denoiser预测原始目标的增强。该设计建立了一个模棱两可的学习目标，使DeOiser能够学习各种Denoising空间的分数，从而意识到我们的任期得分增强。我们还理论上分析了一般转换下不同空间中分数之间的关系。在实验中，我们在包括CIFAR-10，FFHQ，AFHQV2和Imagenet在内的多个基准上进行了广泛验证ScoreAug，结果表明基本线的性能得到了显着改善。值得注意的是，Scoreaug有效地减轻了各种情况（例如不同的数据量表和模型能力）的过度拟合，同时表现出稳定的收敛性能。 Scoreaug比标准数据增强的另一个优点在于它在某些条件下绕过数据泄漏问题的能力。此外，我们表明ScoreAug可以与传统的数据增强技术协同结合，以实现额外的性能增长。

Title: Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

Authors: Fangyuan Mao, Aiming Hao, Jintao Chen, Dongxia Liu, Xiaokun Feng, Jiashu Zhu, Meiqi Wu, Chubin Chen, Jiahong Wu, Xiangxiang Chu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07981
Pdf URL: https://arxiv.org/pdf/2508.07981
Copy Paste: [[2508.07981]] Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation(https://arxiv.org/abs/2508.07981)
Keywords: generation
Abstract: Visual effects (VFX) are essential visual enhancements fundamental to modern cinematic production. Although video generation models offer cost-efficient solutions for VFX production, current methods are constrained by per-effect LoRA training, which limits generation to single effects. This fundamental limitation impedes applications that require spatially controllable composite effects, i.e., the concurrent generation of multiple effects at designated locations. However, integrating diverse effects into a unified framework faces major challenges: interference from effect variations and spatial uncontrollability during multi-VFX joint training. To tackle these challenges, we propose Omni-Effects, a first unified framework capable of generating prompt-guided effects and spatially controllable composite effects. The core of our framework comprises two key innovations: (1) LoRA-based Mixture of Experts (LoRA-MoE), which employs a group of expert LoRAs, integrating diverse effects within a unified model while effectively mitigating cross-task interference. (2) Spatial-Aware Prompt (SAP) incorporates spatial mask information into the text token, enabling precise spatial control. Furthermore, we introduce an Independent-Information Flow (IIF) module integrated within the SAP, isolating the control signals corresponding to individual effects to prevent any unwanted blending. To facilitate this research, we construct a comprehensive VFX dataset Omni-VFX via a novel data collection pipeline combining image editing and First-Last Frame-to-Video (FLF2V) synthesis, and introduce a dedicated VFX evaluation framework for validating model performance. Extensive experiments demonstrate that Omni-Effects achieves precise spatial control and diverse effect generation, enabling users to specify both the category and location of desired effects.
摘要：视觉效果（VFX）是现代电影制作基础的必不可少的视觉增强。尽管视频生成模型为VFX生产提供了成本效益的解决方案，但是当前方法受到每效洛拉培训的限制，该方法将生成限制为单个效果。这种基本限制阻碍了需要空间控制的复合效应的应用，即同时生成指定位置的多个效应。但是，将各种效果整合到统一的框架中面临着重大挑战：在多VFX联合培训期间的效果变化和空间不合格性的干扰。为了应对这些挑战，我们提出了Omni-Effects，这是第一个统一框架，能够产生迅速引导效果和可空间控制的复合效应。我们框架的核心包括两个关键的创新：（1）基于洛拉的专家（Lora-Moe）的混合物，该专家采用了一组专家洛拉斯，在统一模型中将各种效果整合在一起，同时有效地减轻了交叉任务的干扰。（2）空间感知提示（SAP）将空间掩码信息包含到文本令牌中，从而实现精确的空间控制。此外，我们引入了集成在SAP中的独立信息流（IIF）模块，隔离了对应于单个效应的控制信号，以防止任何不必要的混合。为了促进这项研究，我们通过新的数据收集管道结合图像编辑和第一last框架到视频（FLF2V）合成，构建了全面的VFX数据集Omni-VFX，并引入了一个专用的VFX评估框架，以验证模型性能。广泛的实验表明，OMNI效应可实现精确的空间控制和多样化的效果生成，使用户能够指定所需效果的类别和位置。

Title: Mitigating Biases in Surgical Operating Rooms with Geometry

Authors: Tony Danjun Wang, Tobias Czempiel, Nassir Navab, Lennart Bastian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08028
Pdf URL: https://arxiv.org/pdf/2508.08028
Copy Paste: [[2508.08028]] Mitigating Biases in Surgical Operating Rooms with Geometry(https://arxiv.org/abs/2508.08028)
Keywords: generation
Abstract: Deep neural networks are prone to learning spurious correlations, exploiting dataset-specific artifacts rather than meaningful features for prediction. In surgical operating rooms (OR), these manifest through the standardization of smocks and gowns that obscure robust identifying landmarks, introducing model bias for tasks related to modeling OR personnel. Through gradient-based saliency analysis on two public OR datasets, we reveal that CNN models succumb to such shortcuts, fixating on incidental visual cues such as footwear beneath surgical gowns, distinctive eyewear, or other role-specific identifiers. Avoiding such biases is essential for the next generation of intelligent assistance systems in the OR, which should accurately recognize personalized workflow traits, such as surgical skill level or coordination with other staff members. We address this problem by encoding personnel as 3D point cloud sequences, disentangling identity-relevant shape and motion patterns from appearance-based confounders. Our experiments demonstrate that while RGB and geometric methods achieve comparable performance on datasets with apparent simulation artifacts, RGB models suffer a 12% accuracy drop in realistic clinical settings with decreased visual diversity due to standardizations. This performance gap confirms that geometric representations capture more meaningful biometric features, providing an avenue to developing robust methods of modeling humans in the OR.
摘要：深层神经网络容易学习虚假相关性，利用数据集特异性伪像，而不是预测的有意义的特征。在手术手术室（OR）中，这些表现出了掩盖稳固性标识地标的工作服和礼服的标准化，从而引入了与建模或人员有关的任务引入模型偏差。通过对两个公共或数据集的基于梯度的显着性分析，我们揭示了CNN模型屈服于此类快捷方式，固定在偶然的视觉提示上，例如手术礼服下方的鞋类，独特的眼镜或其他特定角色特定的标识符。避免这种偏见对于OR中的下一代智能援助系统至关重要，该系统应准确地识别个性化的工作流程特征，例如手术技能水平或与其他工作人员的协调。我们通过将人员编码为3D点云序列来解决这个问题，从外观混杂因素中解散了与身份相关的形状和运动模式。我们的实验表明，尽管RGB和几何方法在具有明显的模拟工件的数据集上实现了可比的性能，但RGB模型在现实的临床环境中的准确度下降了12％，由于标准化而导致视觉多样性降低。该性能差距证实几何表示捕获了更有意义的生物特征特征，从而为开发OR中的人类建模的强大方法提供了途径。

Title: TRIDE: A Text-assisted Radar-Image weather-aware fusion network for Depth Estimation

Authors: Huawei Sun, Zixu Wang, Hao Feng, Julius Ott, Lorenzo Servadei, Robert Wille
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08038
Pdf URL: https://arxiv.org/pdf/2508.08038
Copy Paste: [[2508.08038]] TRIDE: A Text-assisted Radar-Image weather-aware fusion network for Depth Estimation(https://arxiv.org/abs/2508.08038)
Keywords: generation
Abstract: Depth estimation, essential for autonomous driving, seeks to interpret the 3D environment surrounding vehicles. The development of radar sensors, known for their cost-efficiency and robustness, has spurred interest in radar-camera fusion-based solutions. However, existing algorithms fuse features from these modalities without accounting for weather conditions, despite radars being known to be more robust than cameras under adverse weather. Additionally, while Vision-Language models have seen rapid advancement, utilizing language descriptions alongside other modalities for depth estimation remains an open challenge. This paper first introduces a text-generation strategy along with feature extraction and fusion techniques that can assist monocular depth estimation pipelines, leading to improved accuracy across different algorithms on the KITTI dataset. Building on this, we propose TRIDE, a radar-camera fusion algorithm that enhances text feature extraction by incorporating radar point information. To address the impact of weather on sensor performance, we introduce a weather-aware fusion block that adaptively adjusts radar weighting based on current weather conditions. Our method, benchmarked on the nuScenes dataset, demonstrates performance gains over the state-of-the-art, achieving a 12.87% improvement in MAE and a 9.08% improvement in RMSE. Code: this https URL
摘要：对自动驾驶必不可少的深度估计试图解释车辆周围的3D环境。雷达传感器的开发以其成本效益和鲁棒性而闻名，激发了人们对基于雷达相机融合的解决方案的兴趣。但是，尽管雷达在不利的天气下比相机更强大，但现有的算法融合了这些方式中的融合功能，但没有考虑天气状况。此外，尽管视觉模型已经取得了迅速的进步，但使用语言描述以及其他方式进行深度估计仍然是一个开放的挑战。本文首先引入了文本生成策略以及功能提取和融合技术，可以帮助单眼深度估计管道，从而提高了Kitti数据集上不同算法的精度。在此基础上，我们提出了Tride，这是一种雷达相机融合算法，可通过结合雷达点信息来增强文本特征提取。为了解决天气对传感器性能的影响，我们引入了一个天气引人注目的融合块，该块根据当前的天气条件自适应地调节雷达加权。我们的方法在Nuscenes数据集上进行了基准测试，该方法证明了对最先进的性能提高，其MAE提高了12.87％，RMSE提高了9.08％。代码：此HTTPS URL

Title: S^2VG: 3D Stereoscopic and Spatial Video Generation via Denoising Frame Matrix

Authors: Peng Dai, Feitong Tan, Qiangeng Xu, Yihua Huang, David Futschik, Ruofei Du, Sean Fanello, Yinda Zhang, Xiaojuan Qi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08048
Pdf URL: https://arxiv.org/pdf/2508.08048
Copy Paste: [[2508.08048]] S^2VG: 3D Stereoscopic and Spatial Video Generation via Denoising Frame Matrix(https://arxiv.org/abs/2508.08048)
Keywords: generation, generative
Abstract: While video generation models excel at producing high-quality monocular videos, generating 3D stereoscopic and spatial videos for immersive applications remains an underexplored challenge. We present a pose-free and training-free method that leverages an off-the-shelf monocular video generation model to produce immersive 3D videos. Our approach first warps the generated monocular video into pre-defined camera viewpoints using estimated depth information, then applies a novel \textit{frame matrix} inpainting framework. This framework utilizes the original video generation model to synthesize missing content across different viewpoints and timestamps, ensuring spatial and temporal consistency without requiring additional model fine-tuning. Moreover, we develop a \dualupdate~scheme that further improves the quality of video inpainting by alleviating the negative effects propagated from disoccluded areas in the latent space. The resulting multi-view videos are then adapted into stereoscopic pairs or optimized into 4D Gaussians for spatial video synthesis. We validate the efficacy of our proposed method by conducting experiments on videos from various generative models, such as Sora, Lumiere, WALT, and Zeroscope. The experiments demonstrate that our method has a significant improvement over previous methods. Project page at: this https URL
摘要：虽然视频生成模型在制作高质量的单眼视频方面表现出色，但为沉浸式应用程序生成3D立体和空间视频仍然是一个毫无争议的挑战。我们提出了一种无姿势且无训练的方法，该方法利用现成的单眼视频生成模型来生成沉浸式3D视频。我们的方法首先使用估计的深度信息将生成的单眼视频扭曲成预定义的摄像机观点，然后应用一个新颖的\ textit {frame矩阵}。该框架利用原始的视频生成模型在不同的观点和时间戳中综合缺少内容，从而确保空间和时间的一致性而无需进行其他模型进行微调。此外，我们开发了一个\ dualupdate〜计划，该方案通过减轻潜在空间中分离区域传播的负面影响进一步提高视频介绍的质量。然后将所得的多视频视频改编成立体镜对，或优化为4D高斯人进行空间视频综合。我们通过对来自Sora，Lumiere，Walt和Zeroscope等各种生成模型的视频进行实验来验证我们提出的方法的功效。实验表明，我们的方法比以前的方法具有显着改善。项目页面网址：此HTTPS URL

Title: Matrix-3D: Omnidirectional Explorable 3D World Generation

Authors: Zhongqi Yang, Wenhang Ge, Yuqi Li, Jiaqi Chen, Haoyuan Li, Mengyin An, Fei Kang, Hua Xue, Baixin Xu, Yuyang Yin, Eric Li, Yang Liu, Yikai Wang, Hao-Xiang Guo, Yahui Zhou
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2508.08086
Pdf URL: https://arxiv.org/pdf/2508.08086
Copy Paste: [[2508.08086]] Matrix-3D: Omnidirectional Explorable 3D World Generation(https://arxiv.org/abs/2508.08086)
Keywords: generation
Abstract: Explorable 3D world generation from a single image or text prompt forms a cornerstone of spatial intelligence. Recent works utilize video model to achieve wide-scope and generalizable 3D world generation. However, existing approaches often suffer from a limited scope in the generated scenes. In this work, we propose Matrix-3D, a framework that utilize panoramic representation for wide-coverage omnidirectional explorable 3D world generation that combines conditional video generation and panoramic 3D reconstruction. We first train a trajectory-guided panoramic video diffusion model that employs scene mesh renders as condition, to enable high-quality and geometrically consistent scene video generation. To lift the panorama scene video to 3D world, we propose two separate methods: (1) a feed-forward large panorama reconstruction model for rapid 3D scene reconstruction and (2) an optimization-based pipeline for accurate and detailed 3D scene reconstruction. To facilitate effective training, we also introduce the Matrix-Pano dataset, the first large-scale synthetic collection comprising 116K high-quality static panoramic video sequences with depth and trajectory annotations. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance in panoramic video generation and 3D world generation. See more in this https URL.
摘要：可以从单个图像或文本提示中探索的3D世界一代形成了空间智能的基石。最近的作品利用视频模型实现了宽范围和可推广的3D世界一代。但是，现有的方法通常在生成的场景中遭受有限的范围。在这项工作中，我们提出了Matrix-3D，该框架利用全景代表来用于宽覆盖的全向探索3D世界一代，结合了条件视频生成和全景3D重建。我们首先训练轨迹引导的全景视频扩散模型，该模型采用场景网格作为条件，以实现高质量和几何一致的场景视频生成。为了将Panorama场景视频提升到3D世界，我们提出了两种单独的方法：（1）用于快速3D场景重建的馈送大型全景重建模型，以及（2）基于优化的基于优化的管道，用于准确详细的3D场景重建。为了促进有效的培训，我们还介绍了矩阵pano数据集，这是第一个大规模合成集合，其中包括116K高质量的静态全景序列，具有深度和轨迹注释。广泛的实验表明，我们提出的框架在全景视频生成和3D世界一代中实现了最先进的表现。在此HTTPS URL中查看更多。

Title: TBAC-UniImage: Unified Understanding and Generation by Ladder-Side Diffusion Tuning

Authors: Junzhe Xu, Yuyang Yin, Xi Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08098
Pdf URL: https://arxiv.org/pdf/2508.08098
Copy Paste: [[2508.08098]] TBAC-UniImage: Unified Understanding and Generation by Ladder-Side Diffusion Tuning(https://arxiv.org/abs/2508.08098)
Keywords: generation, generative
Abstract: This paper introduces TBAC-UniImage, a novel unified model for multimodal understanding and generation. We achieve this by deeply integrating a pre-trained Diffusion Model, acting as a generative ladder, with a Multimodal Large Language Model (MLLM). Previous diffusion-based unified models face two primary limitations. One approach uses only the MLLM's final hidden state as the generative condition. This creates a shallow connection, as the generator is isolated from the rich, hierarchical representations within the MLLM's intermediate layers. The other approach, pretraining a unified generative architecture from scratch, is computationally expensive and prohibitive for many researchers. To overcome these issues, our work explores a new paradigm. Instead of relying on a single output, we use representations from multiple, diverse layers of the MLLM as generative conditions for the diffusion model. This method treats the pre-trained generator as a ladder, receiving guidance from various depths of the MLLM's understanding process. Consequently, TBAC-UniImage achieves a much deeper and more fine-grained unification of understanding and generation.
摘要：本文介绍了TBAC-Uniimage，这是一种用于多模式理解和产生的新型统一模型。我们通过与多模式大语言模型（MLLM）深入整合了预训练的扩散模型，以实现生成阶梯。以前的基于扩散的统一模型面临两个主要局限性。一种方法仅使用MLLM的最终隐藏状态作为生成状态。这创建了浅连接，因为发电机与MLLM中间层中的丰富层次表示形式隔离。另一种方法是从头开始仔细研究统一的生成体系结构，对于许多研究人员来说，计算昂贵且过于刺激。为了克服这些问题，我们的工作探讨了新的范式。我们不依赖单个输出，而是使用来自MLLM多个不同层的表示形式作为扩散模型的生成条件。该方法将预训练的发电机视为梯子，从MLLM理解过程的各个深度获得指导。因此，TBAC-Uniimage实现了对理解和产生的更深入，更细粒度的统一。

Title: FantasyStyle: Controllable Stylized Distillation for 3D Gaussian Splatting

Authors: Yitong Yang, Yinglin Wang, Changshuo Wang, Huajie Wang, Shuting He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08136
Pdf URL: https://arxiv.org/pdf/2508.08136
Copy Paste: [[2508.08136]] FantasyStyle: Controllable Stylized Distillation for 3D Gaussian Splatting(https://arxiv.org/abs/2508.08136)
Keywords: generative
Abstract: The success of 3DGS in generative and editing applications has sparked growing interest in 3DGS-based style transfer. However, current methods still face two major challenges: (1) multi-view inconsistency often leads to style conflicts, resulting in appearance smoothing and distortion; and (2) heavy reliance on VGG features, which struggle to disentangle style and content from style images, often causing content leakage and excessive stylization. To tackle these issues, we introduce \textbf{FantasyStyle}, a 3DGS-based style transfer framework, and the first to rely entirely on diffusion model distillation. It comprises two key components: (1) \textbf{Multi-View Frequency Consistency}. We enhance cross-view consistency by applying a 3D filter to multi-view noisy latent, selectively reducing low-frequency components to mitigate stylized prior conflicts. (2) \textbf{Controllable Stylized Distillation}. To suppress content leakage from style images, we introduce negative guidance to exclude undesired content. In addition, we identify the limitations of Score Distillation Sampling and Delta Denoising Score in 3D style transfer and remove the reconstruction term accordingly. Building on these insights, we propose a controllable stylized distillation that leverages negative guidance to more effectively optimize the 3D Gaussians. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches, achieving higher stylization quality and visual realism across various scenes and styles.
摘要：3DG在生成和编辑应用中的成功激发了人们对基于3DGS的样式转移的兴趣。但是，当前的方法仍然面临两个主要挑战：（1）多视图不一致通常会导致风格冲突，从而导致外观平滑和失真；（2）严重依赖VGG功能，这些功能难以从样式图像中解开样式和内容，通常会导致内容泄漏和过度的风格化。为了解决这些问题，我们介绍了基于3DGS的样式转移框架，并引入了\ textbf {fantasyStyle}，并且第一个完全依靠扩散模型蒸馏。它包括两个关键组件：（1）\ textbf {多视频频率一致性}。我们通过将3D过滤器应用于多视图嘈杂的潜在，有选择地降低低频组件来减轻风格的先验冲突来增强跨视图的一致性。（2）\ textbf {可控的样式化蒸馏}。为了抑制样式图像中的内容泄漏，我们引入了负面指导以排除不希望的内容。此外，我们还确定了分数蒸馏采样和三角洲denoising分数的局限性，并相应地删除重建项。在这些见解的基础上，我们提出了一种可控的风格化蒸馏，该蒸馏利用负面指导以更有效地优化3D高斯人。广泛的实验表明，我们的方法始终胜过最先进的方法，在各种场景和样式上实现了更高的风格化质量和视觉现实主义。

Title: MuaLLM: A Multimodal Large Language Model Agent for Circuit Design Assistance with Hybrid Contextual Retrieval-Augmented Generation

Authors: Pravallika Abbineni, Saoud Aldowaish, Colin Liechty, Soroosh Noorzad, Ali Ghazizadeh, Morteza Fayazi
Subjects: cs.LG, cs.AI, eess.SY
Abstract URL: https://arxiv.org/abs/2508.08137
Pdf URL: https://arxiv.org/pdf/2508.08137
Copy Paste: [[2508.08137]] MuaLLM: A Multimodal Large Language Model Agent for Circuit Design Assistance with Hybrid Contextual Retrieval-Augmented Generation(https://arxiv.org/abs/2508.08137)
Keywords: generation
Abstract: Conducting a comprehensive literature review is crucial for advancing circuit design methodologies. However, the rapid influx of state-of-the-art research, inconsistent data representation, and the complexity of optimizing circuit design objectives make this task significantly challenging. In this paper, we propose MuaLLM, an open-source multimodal Large Language Model (LLM) agent for circuit design assistance that integrates a hybrid Retrieval-Augmented Generation (RAG) framework with an adaptive vector database of circuit design research papers. Unlike conventional LLMs, the MuaLLM agent employs a Reason + Act (ReAct) workflow for iterative reasoning, goal-setting, and multi-step information retrieval. It functions as a question-answering design assistant, capable of interpreting complex queries and providing reasoned responses grounded in circuit literature. Its multimodal capabilities enable processing of both textual and visual data, facilitating more efficient and comprehensive analysis. The system dynamically adapts using intelligent search tools, automated document retrieval from the internet, and real-time database updates. Unlike conventional approaches constrained by model context limits, MuaLLM decouples retrieval from inference, enabling scalable reasoning over arbitrarily large corpora. At the maximum context length supported by standard LLMs, MuaLLM remains up to 10x less costly and 1.6x faster while maintaining the same accuracy. This allows rapid, no-human-in-the-loop database generation, overcoming the bottleneck of simulation-based dataset creation for circuits. To evaluate MuaLLM, we introduce two custom datasets: RAG-250, targeting retrieval and citation performance, and Reasoning-100 (Reas-100), focused on multistep reasoning in circuit design. MuaLLM achieves 90.1% recall on RAG-250, and 86.8% accuracy on Reas-100.
摘要：进行全面的文献综述对于推进电路设计方法至关重要。但是，最先进的研究，数据表示不一致以及优化电路设计目标的复杂性的迅速涌入使这项任务变得极具挑战性。在本文中，我们提出了一种开源的多模式大型语言模型（LLM）代理，用于电路设计援助，将混合检索的生成生成（RAG）框架与电路设计研究论文的自适应矢量数据库相结合。与传统的LLM不同，Muallm代理使用理性 + ACT（REACT）工作流程（REACT）工作流程进行迭代推理，目标设定和多步信息检索。它充当了提问的设计助理，能够解释复杂的查询并提供基于电路文献中的合理回答。它的多模式功能使文本和视觉数据都可以处理，从而促进了更有效和全面的分析。该系统使用智能搜索工具，从Internet的自动化文档检索以及实时数据库更新动态调整。与受模型上下文限制约束的常规方法不同，穆尔姆将检索取回推理，从而使可扩展的推理能够超过任意大型语料库。在标准LLMS支持的最大上下文长度下，Muallm的成本降低了10倍，同时保持相同的精度。这允许快速，无人类的环境数据库生成，从而克服了基于模拟的数据集创建电路的瓶颈。为了评估Muallm，我们介绍了两个自定义数据集：RAG-250，针对检索和引文性能以及推理100（REAS-100），重点介绍了电路设计中的多步道推理。 Muallm在RAG-250上获得了90.1％的召回率，REAS-100的精度为86.8％。

Title: Pindrop it! Audio and Visual Deepfake Countermeasures for Robust Detection and Fine Grained-Localization

Authors: Nicholas Klein, Hemlata Tak, James Fullwood, Krishna Regmi, Leonidas Spinoulas, Ganesh Sivaraman, Tianxiang Chen, Elie Khoury
Subjects: cs.CV, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2508.08141
Pdf URL: https://arxiv.org/pdf/2508.08141
Copy Paste: [[2508.08141]] Pindrop it! Audio and Visual Deepfake Countermeasures for Robust Detection and Fine Grained-Localization(https://arxiv.org/abs/2508.08141)
Keywords: generation
Abstract: The field of visual and audio generation is burgeoning with new state-of-the-art methods. This rapid proliferation of new techniques underscores the need for robust solutions for detecting synthetic content in videos. In particular, when fine-grained alterations via localized manipulations are performed in visual, audio, or both domains, these subtle modifications add challenges to the detection algorithms. This paper presents solutions for the problems of deepfake video classification and localization. The methods were submitted to the ACM 1M Deepfakes Detection Challenge, achieving the best performance in the temporal localization task and a top four ranking in the classification task for the TestA split of the evaluation dataset.
摘要：视觉和音频产生的领域正在使用新的最新方法迅速发展。这种新技术的快速扩散强调了对检测视频中合成内容的强大解决方案的需求。特别是，当通过视觉，音频或两个域进行通过局部操作进行细粒度的改变时，这些微妙的修改为检测算法带来了挑战。本文介绍了有关深击视频分类和本地化问题的解决方案。这些方法已提交给ACM 1M深烟检测挑战，在时间本地化任务中达到了最佳性能，并在评估数据集的TESTA拆分的分类任务中排名前四名。

Title: CD-TVD: Contrastive Diffusion for 3D Super-Resolution with Scarce High-Resolution Time-Varying Data

Authors: Chongke Bi, Xin Gao, Jiangkang Deng, Guan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08173
Pdf URL: https://arxiv.org/pdf/2508.08173
Copy Paste: [[2508.08173]] CD-TVD: Contrastive Diffusion for 3D Super-Resolution with Scarce High-Resolution Time-Varying Data(https://arxiv.org/abs/2508.08173)
Keywords: super-resolution
Abstract: Large-scale scientific simulations require significant resources to generate high-resolution time-varying data (TVD). While super-resolution is an efficient post-processing strategy to reduce costs, existing methods rely on a large amount of HR training data, limiting their applicability to diverse simulation scenarios. To address this constraint, we proposed CD-TVD, a novel framework that combines contrastive learning and an improved diffusion-based super-resolution model to achieve accurate 3D super-resolution from limited time-step high-resolution data. During pre-training on historical simulation data, the contrastive encoder and diffusion superresolution modules learn degradation patterns and detailed features of high-resolution and low-resolution samples. In the training phase, the improved diffusion model with a local attention mechanism is fine-tuned using only one newly generated high-resolution timestep, leveraging the degradation knowledge learned by the encoder. This design minimizes the reliance on large-scale high-resolution datasets while maintaining the capability to recover fine-grained details. Experimental results on fluid and atmospheric simulation datasets confirm that CD-TVD delivers accurate and resource-efficient 3D super-resolution, marking a significant advancement in data augmentation for large-scale scientific simulations. The code is available at this https URL.
摘要：大规模的科学模拟需要大量资源来生成高分辨率时变数据（TVD）。虽然超分辨率是一种有效的后处理策略来降低成本，但现有方法依赖大量的人力资源培训数据，从而将其适用性限制在各种模拟方案中。为了解决此约束，我们提出了CD-TVD，这是一个新型框架，结合了对比度学习和改进的基于扩散的超分辨率模型，以从有限的时步高分辨率数据中实现准确的3D超分辨率。在对历史仿真数据进行预训练期间，对比编码器和扩散级别的跨分辨率模块将学习降级模式以及高分辨率和低分辨率样本的详细特征。在训练阶段，仅使用一个新生成的高分辨率时间段的改进的局部注意机制进行了改进的扩散模型，从而利用了编码者学到的降解知识。这种设计最大程度地减少了对大型高分辨率数据集的依赖，同时保持了恢复细粒细节的能力。流体和大气模拟数据集的实验结果证实，CD-TVD提供了准确且资源有效的3D超分辨率，这标志着大规模科学模拟的数据增强方面的显着进步。该代码可在此HTTPS URL上找到。

Title: PP-Motion: Physical-Perceptual Fidelity Evaluation for Human Motion Generation

Authors: Sihan Zhao, Zixuan Wang, Tianyu Luan, Jia Jia, Wentao Zhu, Jiebo Luo, Junsong Yuan, Nan Xi
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2508.08179
Pdf URL: https://arxiv.org/pdf/2508.08179
Copy Paste: [[2508.08179]] PP-Motion: Physical-Perceptual Fidelity Evaluation for Human Motion Generation(https://arxiv.org/abs/2508.08179)
Keywords: generation
Abstract: Human motion generation has found widespread applications in AR/VR, film, sports, and medical rehabilitation, offering a cost-effective alternative to traditional motion capture systems. However, evaluating the fidelity of such generated motions is a crucial, multifaceted task. Although previous approaches have attempted at motion fidelity evaluation using human perception or physical constraints, there remains an inherent gap between human-perceived fidelity and physical feasibility. Moreover, the subjective and coarse binary labeling of human perception further undermines the development of a robust data-driven metric. We address these issues by introducing a physical labeling method. This method evaluates motion fidelity by calculating the minimum modifications needed for a motion to align with physical laws. With this approach, we are able to produce fine-grained, continuous physical alignment annotations that serve as objective ground truth. With these annotations, we propose PP-Motion, a novel data-driven metric to evaluate both physical and perceptual fidelity of human motion. To effectively capture underlying physical priors, we employ Pearson's correlation loss for the training of our metric. Additionally, by incorporating a human-based perceptual fidelity loss, our metric can capture fidelity that simultaneously considers both human perception and physical alignment. Experimental results demonstrate that our metric, PP-Motion, not only aligns with physical laws but also aligns better with human perception of motion fidelity than previous work.
摘要：人类运动产生在AR/VR，电影，运动和医疗康复中发现了广泛的应用，为传统运动捕获系统提供了具有成本效益的替代方法。但是，评估这种生成动议的保真度是一项至关重要的，多方面的任务。尽管以前的方法试图使用人类的感知或身体约束来进行运动保真度评估，但人类感知的忠诚度和身体可行性之间仍然存在固有的差距。此外，人类感知的主观和粗二进制标记进一步破坏了强大的数据驱动度量的发展。我们通过引入物理标签方法来解决这些问题。该方法通过计算与物理定律保持一致的运动所需的最小修改来评估运动保真度。通过这种方法，我们能够产生细粒度，连续的身体一致性注释，以作为客观的基础真理。通过这些注释，我们提出了PP-Motion，这是一种新型的数据驱动度量，以评估人类运动的身体和感知忠诚度。为了有效捕获基本的身体先验，我们利用皮尔森的相关性损失来训练我们的指标。此外，通过纳入人类的感知忠诚度损失，我们的指标可以捕获同时考虑人类感知和身体对准的忠诚度。实验结果表明，我们的指标，PP运动不仅与物理定律保持一致，而且还与人类对运动忠诚度的感知更好，而不是以前的工作。

Title: Reinforcement Learning in Vision: A Survey

Authors: Weijia Wu, Chen Gao, Joya Chen, Kevin Qinghong Lin, Qingwei Meng, Yiming Zhang, Yuke Qiu, Hong Zhou, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08189
Pdf URL: https://arxiv.org/pdf/2508.08189
Copy Paste: [[2508.08189]] Reinforcement Learning in Vision: A Survey(https://arxiv.org/abs/2508.08189)
Keywords: generation
Abstract: Recent advances at the intersection of reinforcement learning (RL) and visual intelligence have enabled agents that not only perceive complex visual scenes but also reason, generate, and act within them. This survey offers a critical and up-to-date synthesis of the field. We first formalize visual RL problems and trace the evolution of policy-optimization strategies from RLHF to verifiable reward paradigms, and from Proximal Policy Optimization to Group Relative Policy Optimization. We then organize more than 200 representative works into four thematic pillars: multi-modal large language models, visual generation, unified model frameworks, and vision-language-action models. For each pillar we examine algorithmic design, reward engineering, benchmark progress, and we distill trends such as curriculum-driven training, preference-aligned diffusion, and unified reward modeling. Finally, we review evaluation protocols spanning set-level fidelity, sample-level preference, and state-level stability, and we identify open challenges that include sample efficiency, generalization, and safe deployment. Our goal is to provide researchers and practitioners with a coherent map of the rapidly expanding landscape of visual RL and to highlight promising directions for future inquiry. Resources are available at: this https URL.
摘要：加固学习（RL）和视觉智能的交集的最新进展使代理人不仅可以感知复杂的视觉场景，而且还可以理解，原因，产生和行动。这项调查提供了该领域的关键和最新综合。我们首先将视觉RL问题形式化，并追踪政策优化策略从RLHF到可验证的奖励范式的演变，从近端策略优化到组相对政策优化。然后，我们将200多种代表性作品组织成四个主题支柱：多模式大型语言模型，视觉生成，统一模型框架和视觉语言行动模型。对于每个支柱，我们都会检查算法设计，奖励工程，基准进步，并提取诸如课程驱动的培训，偏好一致的扩散和统一的奖励建模等趋势。最后，我们审查了涵盖设定级保真度，样本级别的偏好和州级稳定性的评估协议，并确定包括样本效率，概括和安全部署在内的开放挑战。我们的目标是为研究人员和从业人员提供一致的迅速扩展的视觉RL景观图，并突出显示未来询问的有希望的方向。资源可用：此HTTPS URL。

Title: SAGOnline: Segment Any Gaussians Online

Authors: Wentao Sun, Quanyun Wu, Hanqing Xu, Kyle Gao, Zhengsen Xu, Yiping Chen, Dedong Zhang, Lingfei Ma, John S. Zelek, Jonathan Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08219
Pdf URL: https://arxiv.org/pdf/2508.08219
Copy Paste: [[2508.08219]] SAGOnline: Segment Any Gaussians Online(https://arxiv.org/abs/2508.08219)
Keywords: generation
Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful paradigm for explicit 3D scene representation, yet achieving efficient and consistent 3D segmentation remains challenging. Current methods suffer from prohibitive computational costs, limited 3D spatial reasoning, and an inability to track multiple objects simultaneously. We present Segment Any Gaussians Online (SAGOnline), a lightweight and zero-shot framework for real-time 3D segmentation in Gaussian scenes that addresses these limitations through two key innovations: (1) a decoupled strategy that integrates video foundation models (e.g., SAM2) for view-consistent 2D mask propagation across synthesized views; and (2) a GPU-accelerated 3D mask generation and Gaussian-level instance labeling algorithm that assigns unique identifiers to 3D primitives, enabling lossless multi-object tracking and segmentation across views. SAGOnline achieves state-of-the-art performance on NVOS (92.7% mIoU) and Spin-NeRF (95.2% mIoU) benchmarks, outperforming Feature3DGS, OmniSeg3D-gs, and SA3D by 15--1500 times in inference speed (27 ms/frame). Qualitative results demonstrate robust multi-object segmentation and tracking in complex scenes. Our contributions include: (i) a lightweight and zero-shot framework for 3D segmentation in Gaussian scenes, (ii) explicit labeling of Gaussian primitives enabling simultaneous segmentation and tracking, and (iii) the effective adaptation of 2D video foundation models to the 3D domain. This work allows real-time rendering and 3D scene understanding, paving the way for practical AR/VR and robotic applications.
摘要：3D高斯脱落（3DGS）已成为显式3D场景表示的强大范式，但是实现有效且一致的3D细分仍然具有挑战性。当前的方法遭受了过度的计算成本，有限的3D空间推理，并且无法同时跟踪多个对象。我们向任何高斯在线（sagonline）提供了细分市场，这是高斯场景中实时3D分割的轻巧和零拍的框架，通过两个关键的创新来解决这些限制：（1）将视频基础模型（例如，sam2）集成到跨综合视图的视图2D屏蔽式传播的解耦策略，该策略将视频基础模型（例如，SAM2）跨越合成视图。（2）GPU加速3D蒙版生成和高斯级实例标记算法，该算法将唯一标识符分配给3D原始词，从而使无损的多对象跟踪和分段跨视图。 Sagonline在NVO（92.7％MIOU）和SPIN-NERF（95.2％MIOU）基准，表现优于特征3DG，OmniseG3D-GS和SA3D上的SPAR-NVOS（92.7％）和SA3D上的最新性能（15---1500倍的推理速度（27毫秒/帧/帧））。定性结果证明了在复杂场景中强大的多对象分割和跟踪。我们的贡献包括：（i）在高斯场景中进行3D分割的轻巧和零拍框架，（ii）具有同时分割和跟踪的高斯原语的明确标记，以及（iii）2D视频基础模型对3D域的有效适应。这项工作允许实时渲染和3D场景理解，为实用的AR/VR和机器人应用铺平了道路。

Title: Learning User Preferences for Image Generation Model

Authors: Wenyi Mo, Ying Ba, Tianyu Zhang, Yalong Bai, Biye Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08220
Pdf URL: https://arxiv.org/pdf/2508.08220
Copy Paste: [[2508.08220]] Learning User Preferences for Image Generation Model(https://arxiv.org/abs/2508.08220)
Keywords: generation
Abstract: User preference prediction requires a comprehensive and accurate understanding of individual tastes. This includes both surface-level attributes, such as color and style, and deeper content-related aspects, such as themes and composition. However, existing methods typically rely on general human preferences or assume static user profiles, often neglecting individual variability and the dynamic, multifaceted nature of personal taste. To address these limitations, we propose an approach built upon Multimodal Large Language Models, introducing contrastive preference loss and preference tokens to learn personalized user preferences from historical interactions. The contrastive preference loss is designed to effectively distinguish between user ''likes'' and ''dislikes'', while the learnable preference tokens capture shared interest representations among existing users, enabling the model to activate group-specific preferences and enhance consistency across similar users. Extensive experiments demonstrate our model outperforms other methods in preference prediction accuracy, effectively identifying users with similar aesthetic inclinations and providing more precise guidance for generating images that align with individual tastes. The project page is \texttt{this https URL}.
摘要：用户偏好预测需要对个人口味有全面而准确的了解。这包括表面级属性，例如颜色和样式，以及更深的与内容相关的方面，例如主题和组成。但是，现有的方法通常依赖于人类的一般偏好或假设静态用户概况，通常会忽略个人变异性以及个人品味的动态，多方面的性质。为了解决这些局限性，我们提出了一种基于多模式大语言模型的方法，引入了对比性偏好损失和偏好代币，以从历史互动中学习个性化的用户偏好。对比偏好损失旨在有效地区分用户“喜欢”和“不喜欢”，而可学习的偏好代币捕获了现有用户之间的共享兴趣表示，使该模型能够激活特定于组的偏好并提高相似用户的一致性。广泛的实验表明，我们的模型优于偏好预测准确性的其他方法，有效地识别具有相似美学倾向的用户，并提供了更精确的指导来生成与单个口味相符的图像。项目页面为\ texttt {this https url}。

Title: OMGSR: You Only Need One Mid-timestep Guidance for Real-World Image Super-Resolution

Authors: Zhiqiang Wu, Zhaomang Sun, Tong Zhou, Bingtao Fu, Ji Cong, Yitong Dong, Huaqi Zhang, Xuan Tang, Mingsong Chen, Xian Wei
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08227
Pdf URL: https://arxiv.org/pdf/2508.08227
Copy Paste: [[2508.08227]] OMGSR: You Only Need One Mid-timestep Guidance for Real-World Image Super-Resolution(https://arxiv.org/abs/2508.08227)
Keywords: super-resolution, generation, generative
Abstract: Denoising Diffusion Probabilistic Models (DDPM) and Flow Matching (FM) generative models show promising potential for one-step Real-World Image Super-Resolution (Real-ISR). Recent one-step Real-ISR models typically inject a Low-Quality (LQ) image latent distribution at the initial timestep. However, a fundamental gap exists between the LQ image latent distribution and the Gaussian noisy latent distribution, limiting the effective utilization of generative priors. We observe that the noisy latent distribution at DDPM/FM mid-timesteps aligns more closely with the LQ image latent distribution. Based on this insight, we present One Mid-timestep Guidance Real-ISR (OMGSR), a universal framework applicable to DDPM/FM-based generative models. OMGSR injects the LQ image latent distribution at a pre-computed mid-timestep, incorporating the proposed Latent Distribution Refinement loss to alleviate the latent distribution gap. We also design the Overlap-Chunked LPIPS/GAN loss to eliminate checkerboard artifacts in image generation. Within this framework, we instantiate OMGSR for DDPM/FM-based generative models with two variants: OMGSR-S (SD-Turbo) and OMGSR-F (FLUX.1-dev). Experimental results demonstrate that OMGSR-S/F achieves balanced/excellent performance across quantitative and qualitative metrics at 512-resolution. Notably, OMGSR-F establishes overwhelming dominance in all reference metrics. We further train a 1k-resolution OMGSR-F to match the default resolution of FLUX.1-dev, which yields excellent results, especially in the details of the image generation. We also generate 2k-resolution images by the 1k-resolution OMGSR-F using our two-stage Tiled VAE & Diffusion.
摘要：denoising扩散概率模型（DDPM）和流匹配（FM）生成模型显示出对一步现实世界图像超分辨率（REAL-ISR）的有希望的潜力。最近的一步实体ISR模型通常会在初始时间步处注入低质量（LQ）图像潜伏分布。但是，LQ图像潜在分布与高斯嘈杂的潜在分布之间存在基本差距，从而限制了生成先验的有效利用。我们观察到，ddpm/fm中部触发的嘈杂潜在分布与LQ图像潜伏分布更紧密地对齐。基于此见解，我们提出了一个中次介绍指南Real-ISR（OMGSR），这是一种适用于基于DDPM/FM的生成模型的通用框架。 OMGSR将LQ图像潜在分布注入预先计算的中部中部，并结合了提出的潜在分布细化损失，以减轻潜在的分布差距。我们还设计了重叠的LPIP/GAN损失，以消除图像生成中的棋盘伪影。在此框架内，我们将OMGSR用于基于DDPM/FM的生成模型，具有两个变体：OMGSR-S（SD-Turbo）和OMGSR-F（Flux.1-DEV）。实验结果表明，OMGSR-S/F在512分辨率的定量和定性指标中实现平衡/出色的性能。值得注意的是，OMGSR-F在所有参考指标中都建立了压倒性的主导地位。我们进一步训练了1K分辨率的OMGSR-F以匹配Flux.1-DEV的默认分辨率，从而产生了出色的结果，尤其是在图像生成的细节中。我们还使用我们的两阶段瓷砖VAE和扩散来通过1K分辨率的OMGSR-F产生2K分辨率的图像。

Title: Cut2Next: Generating Next Shot via In-Context Tuning

Authors: Jingwen He, Hongbo Liu, Jiajun Li, Ziqi Huang, Yu Qiao, Wanli Ouyang, Ziwei Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08244
Pdf URL: https://arxiv.org/pdf/2508.08244
Copy Paste: [[2508.08244]] Cut2Next: Generating Next Shot via In-Context Tuning(https://arxiv.org/abs/2508.08244)
Keywords: generation
Abstract: Effective multi-shot generation demands purposeful, film-like transitions and strict cinematic continuity. Current methods, however, often prioritize basic visual consistency, neglecting crucial editing patterns (e.g., shot/reverse shot, cutaways) that drive narrative flow for compelling storytelling. This yields outputs that may be visually coherent but lack narrative sophistication and true cinematic integrity. To bridge this, we introduce Next Shot Generation (NSG): synthesizing a subsequent, high-quality shot that critically conforms to professional editing patterns while upholding rigorous cinematic continuity. Our framework, Cut2Next, leverages a Diffusion Transformer (DiT). It employs in-context tuning guided by a novel Hierarchical Multi-Prompting strategy. This strategy uses Relational Prompts to define overall context and inter-shot editing styles. Individual Prompts then specify per-shot content and cinematographic attributes. Together, these guide Cut2Next to generate cinematically appropriate next shots. Architectural innovations, Context-Aware Condition Injection (CACI) and Hierarchical Attention Mask (HAM), further integrate these diverse signals without introducing new parameters. We construct RawCuts (large-scale) and CuratedCuts (refined) datasets, both with hierarchical prompts, and introduce CutBench for evaluation. Experiments show Cut2Next excels in visual consistency and text fidelity. Crucially, user studies reveal a strong preference for Cut2Next, particularly for its adherence to intended editing patterns and overall cinematic continuity, validating its ability to generate high-quality, narratively expressive, and cinematically coherent subsequent shots.
摘要：有效的多发产生需要有目的的类似电影的过渡和严格的电影连续性。但是，当前的方法通常优先考虑基本的视觉一致性，忽略了关键的编辑模式（例如，镜头/反向镜头，cutaways），以驱动叙事流程以引人注目。这产生了可能在视觉上连贯的输出，但缺乏叙事的复杂性和真正的电影完整性。为了桥接这一点，我们介绍了下一代一代（NSG）：综合随后的高质量镜头，非常符合专业编辑模式，同时保留了严格的电影连续性。我们的框架cut2next利用扩散变压器（DIT）。它采用了以新型的层次多启发策略为指导的文本调整。该策略使用关系提示来定义整体上下文和射击之间的编辑样式。然后，单个提示指定每张内容和摄影属性。这些指南cut2next共同生成了粉饰上适当的下一镜头。建筑创新，上下文感知条件注入（CACI）和分层注意面罩（HAM）进一步整合了这些不同的信号，而无需引入新参数。我们构建了带有层次提示的原始曲线（大规模）和策展曲（精制）数据集，并介绍了Cutbench进行评估。实验表明，Cut2Next在视觉一致性和文本保真度上都表现出色。至关重要的是，用户研究表明，对Cut2Next的偏爱尤其是其遵守预期的编辑模式和整体电影连续性，从而验证了其产生高质量，叙事表达性和电影性连贯性随后拍摄的能力。

Title: StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation

Authors: Shuyuan Tu, Yueming Pan, Yinming Huang, Xintong Han, Zhen Xing, Qi Dai, Chong Luo, Zuxuan Wu, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08248
Pdf URL: https://arxiv.org/pdf/2508.08248
Copy Paste: [[2508.08248]] StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation(https://arxiv.org/abs/2508.08248)
Keywords: generation
Abstract: Current diffusion models for audio-driven avatar video generation struggle to synthesize long videos with natural audio synchronization and identity consistency. This paper presents StableAvatar, the first end-to-end video diffusion transformer that synthesizes infinite-length high-quality videos without post-processing. Conditioned on a reference image and audio, StableAvatar integrates tailored training and inference modules to enable infinite-length video generation. We observe that the main reason preventing existing models from generating long videos lies in their audio modeling. They typically rely on third-party off-the-shelf extractors to obtain audio embeddings, which are then directly injected into the diffusion model via cross-attention. Since current diffusion backbones lack any audio-related priors, this approach causes severe latent distribution error accumulation across video clips, leading the latent distribution of subsequent segments to drift away from the optimal distribution gradually. To address this, StableAvatar introduces a novel Time-step-aware Audio Adapter that prevents error accumulation via time-step-aware modulation. During inference, we propose a novel Audio Native Guidance Mechanism to further enhance the audio synchronization by leveraging the diffusion's own evolving joint audio-latent prediction as a dynamic guidance signal. To enhance the smoothness of the infinite-length videos, we introduce a Dynamic Weighted Sliding-window Strategy that fuses latent over time. Experiments on benchmarks show the effectiveness of StableAvatar both qualitatively and quantitatively.
摘要：当前用于音频驱动的阿凡达视频生成的扩散模型难以合成具有自然音频同步和身份一致性的长视频。本文介绍了StableAvatar，这是第一个端到端的视频扩散变压器，它综合了无限长度的高质量视频而无需进行后处理。在参考图像和音频的条件下，StableAvatar集成了量身定制的训练和推理模块，以实现无限长度的视频生成。我们观察到，阻止现有模型产生长视频的主要原因在于其音频建模。他们通常依靠第三方的现成的提取器来获取音频嵌入，然后通过交叉注意将其直接注入扩散模型。由于当前的扩散骨架缺乏任何与音频相关的先验，因此这种方法会导致视频夹的严重潜在分布误差积累，导致后续段的潜在分布逐渐逐渐脱离最佳分布。为了解决这个问题，StableAvatar引入了一种新颖的时步入式音频适配器，该适配器可防止通过时间键入的调制误差积累。在推断期间，我们提出了一种新型的音频本地引导机制，以通过利用扩散自身不断发展的关节音频预测作为动态引导信号来进一步增强音频同步。为了增强无限长度视频的平稳性，我们引入了动态加权滑动窗口策略，随着时间的推移会融合潜在的。基准上的实验显示了稳态和定量的稳定性的有效性。