2025-08-11

Title: UnGuide: Learning to Forget with LoRA-Guided Diffusion Models

Authors: Agnieszka Polowczyk, Alicja Polowczyk, Dawid Malarz, Artur Kasymov, Marcin Mazur, Jacek Tabor, Przemysław Spurek
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.05755
Pdf URL: https://arxiv.org/pdf/2508.05755
Copy Paste: [[2508.05755]] UnGuide: Learning to Forget with LoRA-Guided Diffusion Models(https://arxiv.org/abs/2508.05755)
Keywords: generation
Abstract: Recent advances in large-scale text-to-image diffusion models have heightened concerns about their potential misuse, especially in generating harmful or misleading content. This underscores the urgent need for effective machine unlearning, i.e., removing specific knowledge or concepts from pretrained models without compromising overall performance. One possible approach is Low-Rank Adaptation (LoRA), which offers an efficient means to fine-tune models for targeted unlearning. However, LoRA often inadvertently alters unrelated content, leading to diminished image fidelity and realism. To address this limitation, we introduce UnGuide -- a novel approach which incorporates UnGuidance, a dynamic inference mechanism that leverages Classifier-Free Guidance (CFG) to exert precise control over the unlearning process. UnGuide modulates the guidance scale based on the stability of a few first steps of denoising processes, enabling selective unlearning by LoRA adapter. For prompts containing the erased concept, the LoRA module predominates and is counterbalanced by the base model; for unrelated prompts, the base model governs generation, preserving content fidelity. Empirical results demonstrate that UnGuide achieves controlled concept removal and retains the expressive power of diffusion models, outperforming existing LoRA-based methods in both object erasure and explicit content removal tasks.
摘要：大规模文本到图像扩散模型的最新进展加剧了人们对其潜在滥用的关注，尤其是在产生有害或误导性内容方面。这强调了迫切需要进行有效的机器学习的需求，即，在不损害总体绩效的情况下，从验证的模型中删除了特定的知识或概念。一种可能的方法是低级适应性（LORA），该方法提供了一种有效的手段，以微调目标学习的模型。但是，洛拉经常无意中改变了无关的内容，从而导致图像忠诚度和现实主义减少。为了解决这一限制，我们引入了Unguide - 一种新型方法，该方法结合了非指导性，这是一种动态推理机制，利用无分类器指导（CFG）来对未学习过程进行精确控制。 Unguide基于基于定位过程的几个第一步的稳定性来调节指导量表，从而使Lora适配器的选择性学习。对于包含擦除概念的提示，洛拉模块主要主导，并由基本模型平衡。对于无关的提示，基本模型控制生成，并保留内容保真度。经验结果表明，无向导可以实现受控的概念删除并保留扩散模型的表达能力，在对象擦除和显式删除任务中都超过了现有的基于洛拉的方法。

Title: MAISI-v2: Accelerated 3D High-Resolution Medical Image Synthesis with Rectified Flow and Region-specific Contrastive Loss

Authors: Can Zhao, Pengfei Guo, Dong Yang, Yucheng Tang, Yufan He, Benjamin Simon, Mason Belue, Stephanie Harmon, Baris Turkbey, Daguang Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.05772
Pdf URL: https://arxiv.org/pdf/2508.05772
Copy Paste: [[2508.05772]] MAISI-v2: Accelerated 3D High-Resolution Medical Image Synthesis with Rectified Flow and Region-specific Contrastive Loss(https://arxiv.org/abs/2508.05772)
Keywords: generation
Abstract: Medical image synthesis is an important topic for both clinical and research applications. Recently, diffusion models have become a leading approach in this area. Despite their strengths, many existing methods struggle with (1) limited generalizability that only work for specific body regions or voxel spacings, (2) slow inference, which is a common issue for diffusion models, and (3) weak alignment with input conditions, which is a critical issue for medical imaging. MAISI, a previously proposed framework, addresses generalizability issues but still suffers from slow inference and limited condition consistency. In this work, we present MAISI-v2, the first accelerated 3D medical image synthesis framework that integrates rectified flow to enable fast and high quality generation. To further enhance condition fidelity, we introduce a novel region-specific contrastive loss to enhance the sensitivity to region of interest. Our experiments show that MAISI-v2 can achieve SOTA image quality with $33 \times$ acceleration for latent diffusion model. We also conducted a downstream segmentation experiment to show that the synthetic images can be used for data augmentation. We release our code, training details, model weights, and a GUI demo to facilitate reproducibility and promote further development within the community.
摘要：医疗图像合成是临床和研究应用的重要主题。最近，扩散模型已成为该领域的主要方法。尽管具有优势，但许多现有方法与（1）仅适用于特定身体区域或体素间距有限的概括性，（2）缓慢的推断，这对于扩散模型来说是一个常见的问题，以及（3）与输入条件的较弱对齐，这对于医学成像是关键的问题。 Maisi是一个先前提出的框架，它解决了可推广性问题，但仍有缓慢的推理和有限的状况一致性。在这项工作中，我们提出了MAISI-V2，这是第一个加速的3D医疗图像合成框架，该框架集成了整流的流程以实现快速和高质量的生成。为了进一步提高条件保真度，我们引入了一种新颖的区域特异性对比损失，以增强对目标区域的敏感性。我们的实验表明，MAISI-V2可以通过$ 33 \ times $加速来实现SOTA图像质量。我们还进行了下游分割实验，以表明合成图像可用于数据增强。我们发布了代码，培训细节，模型权重和GUI演示，以促进可重复性并促进社区内的进一步发展。

Title: HOLODECK 2.0: Vision-Language-Guided 3D World Generation with Editing

Authors: Zixuan Bian, Ruohan Ren, Yue Yang, Chris Callison-Burch
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2508.05899
Pdf URL: https://arxiv.org/pdf/2508.05899
Copy Paste: [[2508.05899]] HOLODECK 2.0: Vision-Language-Guided 3D World Generation with Editing(https://arxiv.org/abs/2508.05899)
Keywords: generation, generative
Abstract: 3D scene generation plays a crucial role in gaming, artistic creation, virtual reality and many other domains. However, current 3D scene design still relies heavily on extensive manual effort from creators, and existing automated methods struggle to generate open-domain scenes or support flexible editing. As a result, generating 3D worlds directly from text has garnered increasing attention. In this paper, we introduce HOLODECK 2.0, an advanced vision-language-guided framework for 3D world generation with support for interactive scene editing based on human feedback. HOLODECK 2.0 can generate diverse and stylistically rich 3D scenes (e.g., realistic, cartoon, anime, and cyberpunk styles) that exhibit high semantic fidelity to fine-grained input descriptions, suitable for both indoor and open-domain environments. HOLODECK 2.0 leverages vision-language models (VLMs) to identify and parse the objects required in a scene and generates corresponding high-quality assets via state-of-the-art 3D generative models. It then iteratively applies spatial constraints derived from the VLMs to achieve semantically coherent and physically plausible layouts. Human evaluations and CLIP-based assessments demonstrate that HOLODECK 2.0 effectively generates high-quality scenes closely aligned with detailed textual descriptions, consistently outperforming baselines across indoor and open-domain scenarios. Additionally, we provide editing capabilities that flexibly adapt to human feedback, supporting layout refinement and style-consistent object edits. Finally, we present a practical application of HOLODECK 2.0 in procedural game modeling, generating visually rich and immersive environments, potentially boosting efficiency.
摘要：3D场景一代在游戏，艺术创作，虚拟现实和许多其他领域中起着至关重要的作用。但是，当前的3D场景设计仍然在很大程度上取决于创建者的大量手动工作，现有的自动化方法难以生成开放域场景或支持灵活的编辑。结果，直接从文本中产生3D世界引起了人们的关注。在本文中，我们介绍了Holodeck 2.0，这是一个高级视觉指导的3D世界一代框架，并支持基于人类反馈的交互式场景编辑。 HoloDeck 2.0可以产生多样化且风格丰富的3D场景（例如，现实，卡通，动漫和赛博朋克风格），表现出对室内和开放式环境的高粒度输入描述的高语义忠诚。 HoloDeck 2.0利用视觉语言模型（VLM）来识别和解析场景中所需的对象，并通过最新的3D生成模型生成相应的高质量资产。然后，它迭代地应用了从VLMS得出的空间约束，以实现语义上连贯和物理上合理的布局。人类评估和基于剪辑的评估表明，Holodeck 2.0有效地产生了与详细的文本描述紧密相结合的高质量场景，在室内和开放式域情景中始终优于跨越基线。此外，我们还提供了编辑功能，可以灵活地适应人类反馈，支持布局细化和样式一致的对象编辑。最后，我们提出了Holodeck 2.0在程序游戏建模中的实际应用，从而产生了视觉丰富和沉浸式的环境，并可能提高效率。

Title: A 3DGS-Diffusion Self-Supervised Framework for Normal Estimation from a Single Image

Authors: Yanxing Liang, Yinghui Wang, Jinlong Yang, Wei Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.05950
Pdf URL: https://arxiv.org/pdf/2508.05950
Copy Paste: [[2508.05950]] A 3DGS-Diffusion Self-Supervised Framework for Normal Estimation from a Single Image(https://arxiv.org/abs/2508.05950)
Keywords: generation
Abstract: The lack of spatial dimensional information remains a challenge in normal estimation from a single image. Recent diffusion-based methods have demonstrated significant potential in 2D-to-3D implicit mapping, they rely on data-driven statistical priors and miss the explicit modeling of light-surface interaction, leading to multi-view normal direction conflicts. Moreover, the discrete sampling mechanism of diffusion models causes gradient discontinuity in differentiable rendering reconstruction modules, preventing 3D geometric errors from being backpropagated to the normal generation network, thereby forcing existing methods to depend on dense normal annotations. This paper proposes SINGAD, a novel Self-supervised framework from a single Image for Normal estimation via 3D GAussian splatting guided Diffusion. By integrating physics-driven light-interaction modeling and a differentiable rendering-based reprojection strategy, our framework directly converts 3D geometric errors into normal optimization signals, solving the challenges of multi-view geometric inconsistency and data dependency. Specifically, the framework constructs a light-interaction-driven 3DGS reparameterization model to generate multi-scale geometric features consistent with light transport principles, ensuring multi-view normal consistency. A cross-domain feature fusion module is designed within a conditional diffusion model, embedding geometric priors to constrain normal generation while maintaining accurate geometric error propagation. Furthermore, a differentiable 3D reprojection loss strategy is introduced for self-supervised optimization that minimizes geometric error between the reconstructed and input image, eliminating dependence on annotated normal datasets. Quantitative evaluations on the Google Scanned Objects dataset demonstrate that our method outperforms state-of-the-art approaches across multiple metrics.
摘要：缺乏空间维度信息在单个图像的正常估计中仍然是一个挑战。最近的基于扩散的方法在2D到3D隐式映射中表现出显着的潜力，它们依赖于数据驱动的统计先验，并错过了光面相互作用的明确建模，从而导致多视图正常方向冲突。此外，扩散模型的离散抽样机制在可区分的渲染重建模块中导致梯度不连续性，从而阻止了3D几何误差将其反向传播到正常生成网络，从而迫使现有方法依赖于致密的正常注释。本文提出了Singad，这是一个新型的自我监督框架，从单个图像通过3D高斯分裂引导的扩散，以进行正常估计。通过将物理驱动的轻相互作用建模和基于可区分的渲染重点策略集成，我们的框架将3D几何误差直接转换为正常优化信号，从而解决了多视图几何不一致和数据依赖性的挑战。具体而言，该框架构建了一个轻度相互作用驱动的3DGS重新聚集模型，以生成与光传输原理一致的多尺度几何特征，从而确保多视图正常一致性。在条件扩散模型中设计了跨域特征融合模块，嵌入几何先验以限制正常产生，同时保持准确的几何误差传播。此外，引入了一种可区分的3D再卷投影损失策略，以进行自我监督优化，从而最大程度地减少了重建和输入图像之间的几何误差，从而消除了对带注释的正常数据集的依赖。在Google扫描对象数据集上进行的定量评估表明，我们的方法在多个指标上都优于最先进的方法。

Title: Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents

Authors: Han Lin, Jaemin Cho, Amir Zadeh, Chuan Li, Mohit Bansal
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2508.05954
Pdf URL: https://arxiv.org/pdf/2508.05954
Copy Paste: [[2508.05954]] Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents(https://arxiv.org/abs/2508.05954)
Keywords: generation
Abstract: There is growing interest in integrating high-fidelity visual synthesis capabilities into large language models (LLMs) without compromising their strong reasoning capabilities. Existing methods that directly train LLMs or bridge LLMs and diffusion models usually suffer from costly training since the backbone LLMs have not seen image representations during pretraining. We present Bifrost-1, a unified framework that bridges pretrained multimodal LLMs (MLLMs) and diffusion models using patch-level CLIP image embeddings as latent variables, which are natively aligned with the MLLM's CLIP visual encoder. These patch-level image embeddings are integrated into the diffusion model with a lightweight adaptation of its ControlNet. To retain the original multimodal reasoning capabilities of MLLMs, we equip the MLLM with a visual generation branch initialized from the original MLLM parameters when predicting the patch-level image embeddings. By seamlessly integrating pretrained MLLMs and diffusion models with patch-level CLIP latents, our framework enables high-fidelity controllable image generation with significant training efficiency. Our experiments demonstrate that Bifrost-1 achieves comparable or better performance than previous methods in terms of visual fidelity and multimodal understanding, with substantially lower compute during training. We also provide comprehensive ablation studies showing the effectiveness of our design choices.
摘要：在不损害其强大的推理能力的情况下，将高保真视觉合成功能纳入大语言模型（LLM）的兴趣越来越大。直接训练LLMS或桥梁LLM和扩散模型的现有方法通常受到昂贵的训练，因为骨干LLM在训练过程中没有看到图像表示。我们提出了Bifrost-1，这是一种统一的框架，使用贴片级夹图像嵌入为潜在变量，桥接了预处理的多模式LLMS（MLLM）和扩散模型，与MLLM的夹子夹夹具编码器一起对齐。这些贴片级图像嵌入以其控制网的轻巧适应为扩散模型。为了保留MLLM的原始多模式推理能力，我们在预测贴片级图像嵌入时，将MLLM配备从原始MLLM参数初始化的视觉生成分支。通过将经过预定的MLLM和扩散模型与贴片级剪辑潜在的潜伏期集成在一起，我们的框架可实现具有较高的训练效率的高保真可控图像产生。我们的实验表明，在视觉保真度和多模式理解方面，Bifrost-1的性能比以前的方法具有可比性或更好的性能，并且在训练过程中的计算大大降低。我们还提供了全面的消融研究，显示了我们的设计选择的有效性。

Title: Pruning the Unsurprising: Efficient Code Reasoning via First-Token Surprisal

Authors: Wenhao Zeng, Yaoning Wang, Chao Hu, Yuling Shi, Chengcheng Wan, Hongyu Zhang, Xiaodong Gu
Subjects: cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2508.05988
Pdf URL: https://arxiv.org/pdf/2508.05988
Copy Paste: [[2508.05988]] Pruning the Unsurprising: Efficient Code Reasoning via First-Token Surprisal(https://arxiv.org/abs/2508.05988)
Keywords: generation
Abstract: Recently, Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in code reasoning by scaling up the length of Chain-of-Thought (CoT). However, excessively long reasoning traces introduce substantial challenges in terms of training cost, inference latency, and deployment feasibility. While various CoT compression approaches have emerged to address this challenge, they face inherent trade-offs: token-level methods often disrupt syntactic and logical coherence, while step-level methods based on perplexity fail to reliably capture the logically critical reasoning steps. In this paper, we propose ASAP (Anchor-guided, Surprisal-based Pruning), a novel coarse-to-fine framework for CoT compression. ASAP first performs anchor-guided pruning to preserve the core reasoning structure, which efficiently reduces the search space for subsequent processing. It then enables a logic-aware pruning by selecting logically essential reasoning steps based on a novel first-token surprisal metric. Finally, ASAP teaches models to autonomously generate and leverage these concise CoTs at inference time, enabling efficient reasoning in coding tasks. Experiments show that ASAP achieves state-of-the-art accuracy across multiple code generation benchmarks while substantially reducing training and inference costs. On the challenging LiveCodeBench v4_v5 benchmark, our approach reduces token generation by 23.5% and inference latency by 43.5% compared to the strongest baseline, while achieving a competitive accuracy of 36.19% in Pass@1. Our results highlight a promising direction for building powerful and efficient LRMs.
摘要：最近，大型推理模型（LRMS）通过扩大思考链（COT）的长度来证明代码推理具有显着的功能。但是，过长的推理痕迹在培训成本，推理潜伏期和可行性方面引入了重大挑战。尽管已经出现了各种COT压缩方法来应对这一挑战，但它们面临固有的权衡：令牌级别的方法通常会破坏语法和逻辑相干性，而基于困惑的步骤级方法无法可靠地捕获逻辑上关键的关键推理步骤。在本文中，我们提出了ASAP（锚定引导，基于惊人的修剪），这是一种新型的COT压缩的粗到精细框架。 ASAP首先执行锚定引导的修剪来保留核心推理结构，从而有效地缩小了后续处理的搜索空间。然后，它通过根据新颖的第一句话惊人度量选择逻辑上必不可少的推理步骤来实现逻辑感知的修剪。最后，ASAP教授模型在推理时自主生成和利用这些简洁的婴儿，从而在编码任务中有效地推理。实验表明，ASAP在多个代码生成基准中实现了最先进的准确性，同时大大降低了培训和推理成本。在具有挑战性的LiveCodeBench v4_v5基准上，我们的方法将令牌产生降低了23.5％，而推理潜伏期则与最强的基线相比，推理潜伏期降低了43.5％，而在Pass@1中的竞争精度为36.19％。我们的结果突出了建立强大而有效的LRM的有希望的方向。

Title: Optimizing Prompt Sequences using Monte Carlo Tree Search for LLM-Based Optimization

Authors: Fei Xu Yu, Gina Adam, Nathaniel D. Bastian, Tian Lan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.05995
Pdf URL: https://arxiv.org/pdf/2508.05995
Copy Paste: [[2508.05995]] Optimizing Prompt Sequences using Monte Carlo Tree Search for LLM-Based Optimization(https://arxiv.org/abs/2508.05995)
Keywords: generation
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in code generation and structured reasoning; however, their performance often degrades on complex tasks that require consistent multi-step planning. Recent work has explored combining LLMs with Monte Carlo Tree Search (MCTS), yet existing approaches primarily focus on generating heuristic-based code for optimization or target simpler tasks where correctness alone is sufficient. In this work, we propose MCTS-OPS, a novel neural-symbolic framework that formulates prompt selection as a sequential decision process guided by MCTS. Our method explores and refines multi-step prompt sequences for the goal of improving code generation quality and enhancing the problem-solving capabilities of LLMs in general optimization. Experiments on network optimization show significant improvement over the baselines, both in the success rate of executing the generated code and in the optimization results with the specified objective and constraints (2$\sim$4$\times$ higher reward and 3$\times$ lower standard deviation). Moreover, it improves the chance of attaining the optimal solution by about 10\% of cases, compared to baseline methods in hard problems. These results highlight the promise of combining symbolic planning with LLMs for robust, high-quality code generation in complex domains.
摘要：大型语言模型（LLMS）在代码生成和结构化推理中表现出了显着的功能。但是，他们的性能通常会在需要一致的多步计划的复杂任务上降低。最近的工作探索了将LLM与蒙特卡洛树搜索（MCT）相结合的，但现有的方法主要集中于生成基于启发式的代码以优化或单独使用正确性就足够的更简单任务。在这项工作中，我们提出了MCTS-OPS，这是一种新型的神经符号框架，该框架将迅速选择作为由MCT指导的顺序决策过程。我们的方法探索并完善了多步及时序列，以提高代码生成质量并增强一般优化中LLM的问题解决能力。网络优化的实验表现出比基线的显着改善，无论是在执行生成的代码的成功率和具有指定目标和约束的优化结果的成功率（2 $ \ sim $ 4 $ \ tims $较高的奖励和3 $ \ tims $ \ times $ showter $ shandard Deviation）。此外，与硬问题中的基线方法相比，它可以提高最佳解决方案的机会。这些结果突出了将符号计划与LLM相结合的希望，以在复杂域中生成可靠的高质量代码。

Title: Improved Sub-Visible Particle Classification in Flow Imaging Microscopy via Generative AI-Based Image Synthesis

Authors: Utku Ozbulak, Michaela Cohrs, Hristo L. Svilenov, Joris Vankerschaver, Wesley De Neve
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.06021
Pdf URL: https://arxiv.org/pdf/2508.06021
Copy Paste: [[2508.06021]] Improved Sub-Visible Particle Classification in Flow Imaging Microscopy via Generative AI-Based Image Synthesis(https://arxiv.org/abs/2508.06021)
Keywords: generative
Abstract: Sub-visible particle analysis using flow imaging microscopy combined with deep learning has proven effective in identifying particle types, enabling the distinction of harmless components such as silicone oil from protein particles. However, the scarcity of available data and severe imbalance between particle types within datasets remain substantial hurdles when applying multi-class classifiers to such problems, often forcing researchers to rely on less effective methods. The aforementioned issue is particularly challenging for particle types that appear unintentionally and in lower numbers, such as silicone oil and air bubbles, as opposed to protein particles, where obtaining large numbers of images through controlled settings is comparatively straightforward. In this work, we develop a state-of-the-art diffusion model to address data imbalance by generating high-fidelity images that can augment training datasets, enabling the effective training of multi-class deep neural networks. We validate this approach by demonstrating that the generated samples closely resemble real particle images in terms of visual quality and structure. To assess the effectiveness of using diffusion-generated images in training datasets, we conduct large-scale experiments on a validation dataset comprising 500,000 protein particle images and demonstrate that this approach improves classification performance with no negligible downside. Finally, to promote open research and reproducibility, we publicly release both our diffusion models and the trained multi-class deep neural network classifiers, along with a straightforward interface for easy integration into future studies, at this https URL.
摘要：使用流成像显微镜与深度学习相结合的亚可见粒子分析已被证明有效地识别粒子类型，从而使无害成分（例如从蛋白质颗粒中的有机硅油）具有区别。但是，在将多级分类器应用于此类问题时，数据集中的可用数据的稀缺和数据集中粒子类型之间的严重失衡仍然是重大障碍，通常迫使研究人员依靠较不效率的方法。与蛋白质颗粒相比，上述问题对于无意间且数量较低的粒子类型（例如硅油和气泡）尤其具有挑战性，在这些粒子类型中，通过受控设置获得了大量图像。在这项工作中，我们开发了一种最新的扩散模型，通过产生可以增强培训数据集的高保真图像来解决数据不平衡，从而有效地培训了多级深层神经网络。我们通过证明生成的样品在视觉质量和结构方面与真实的粒子图像非常相似，从而验证了这种方法。为了评估在训练数据集中使用扩散生成图像的有效性，我们在包含500,000个蛋白质颗粒图像的验证数据集上进行大规模实验，并证明这种方法可以改善分类性能，而无需可忽略的副本。最后，为了促进开放的研究和可重复性，我们在此HTTPS URL上公开释放了我们的扩散模型和受过训练的多级深神经网络分类器，以及直接整合到未来的研究中的直接界面。

Title: InstantEdit: Text-Guided Few-Step Image Editing with Piecewise Rectified Flow

Authors: Yiming Gong, Zhen Zhu, Minjia Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06033
Pdf URL: https://arxiv.org/pdf/2508.06033
Copy Paste: [[2508.06033]] InstantEdit: Text-Guided Few-Step Image Editing with Piecewise Rectified Flow(https://arxiv.org/abs/2508.06033)
Keywords: generation
Abstract: We propose a fast text-guided image editing method called InstantEdit based on the RectifiedFlow framework, which is structured as a few-step editing process that preserves critical content while following closely to textual instructions. Our approach leverages the straight sampling trajectories of RectifiedFlow by introducing a specialized inversion strategy called PerRFI. To maintain consistent while editable results for RectifiedFlow model, we further propose a novel regeneration method, Inversion Latent Injection, which effectively reuses latent information obtained during inversion to facilitate more coherent and detailed regeneration. Additionally, we propose a Disentangled Prompt Guidance technique to balance editability with detail preservation, and integrate a Canny-conditioned ControlNet to incorporate structural cues and suppress artifacts. Evaluation on the PIE image editing dataset demonstrates that InstantEdit is not only fast but also achieves better qualitative and quantitative results compared to state-of-the-art few-step editing methods.
摘要：我们基于RectifiedFlow框架提出了一种称为InstantEdit的快速文本引导的图像编辑方法，该方法构成为几个步骤的编辑过程，该过程可保留关键内容，同时紧随文本指令。我们的方法通过引入称为Perrfi的专门反转策略来利用整流流的直接采样轨迹。为了保持一致，而对整流流模型的可编辑结果，我们进一步提出了一种新型的再生方法，反转潜在注射，该方法有效地重复了反转过程中获得的潜在信息，以促进更连贯和详细的再生。此外，我们提出了一种散布的及时指导技术，以平衡编辑性和详细的保存，并整合了巧妙的条件控制网，以结合结构提示和抑制伪影。对PIE图像编辑数据集的评估表明，与最先进的少数步骤编辑方法相比，InstantEdit不仅快，而且可以实现更好的定性和定量结果。

Title: Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models

Authors: Huanyu Wang, Jushi Kai, Haoli Bai, Lu Hou, Bo Jiang, Ziwei He, Zhouhan Lin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06038
Pdf URL: https://arxiv.org/pdf/2508.06038
Copy Paste: [[2508.06038]] Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models(https://arxiv.org/abs/2508.06038)
Keywords: generation
Abstract: Vision-Language Models (VLMs) typically replace the predefined image placeholder token () in textual instructions with visual features from an image encoder, forming the input to a backbone Large Language Model (LLM). However, the large number of vision tokens significantly increases the context length, leading to high computational overhead and inference latency. While previous efforts mitigate this by selecting only important visual features or leveraging learnable queries to reduce token count, they often compromise performance or introduce substantial extra costs. In response, we propose Fourier-VLM, a simple yet efficient method that compresses visual representations in the frequency domain. Our approach is motivated by the observation that vision features output from the vision encoder exhibit concentrated energy in low-frequency components. Leveraging this, we apply a low-pass filter to the vision features using a two-dimentional Discrete Cosine Transform (DCT). Notably, the DCT is efficiently computed via the Fast Fourier Transform (FFT) operator with a time complexity of $\mathcal{O}(n\log n)$, minimizing the extra computational cost while introducing no additional parameters. Extensive experiments across various image-based benchmarks demonstrate that Fourier-VLM achieves competitive performance with strong generalizability across both LLaVA and Qwen-VL architectures. Crucially, it reduce inference FLOPs by up to 83.8% and boots generation speed by 31.2% compared to LLaVA-v1.5, highlighting the superior efficiency and practicality.
摘要：视觉语言模型（VLM）通常用图像编码器中的视觉特征在文本说明中替换预定义的图像占位符令牌（），从而形成了骨干大型语言模型（LLM）的输入。但是，大量视觉令牌大大增加了上下文的长度，从而导致高计算开销和推理潜伏期。尽管以前的努力通过仅选择重要的视觉特征或利用可学习的查询来减少令牌计数来减轻这种情况，但它们通常会损害性能或引入大量额外费用。作为响应，我们提出了傅立叶VLM，这是一种简单而有效的方法，可压缩频域中的视觉表示。我们的方法是通过观察到的观察，即视觉编码器中的视觉特征在低频组件中表现出浓缩能量。利用这一点，我们使用二维离散余弦变换（DCT）将低通滤波器应用于视觉特征。值得注意的是，DCT是通过快速傅立叶变换（FFT）运算符有效计算的，其时间复杂度为$ \ Mathcal {o}（n \ log n）$，在不引入其他参数的同时最小化额外的计算成本。跨各种基于图像的基准测试的广泛实验表明，傅立叶VLM在LLAVA和QWEN-VL架构中都具有强大的概括性，实现了竞争性能。至关重要的是，与Llava-V1.5相比，它将推理拖曳量最多减少83.8％，并且靴子的产生速度降低了31.2％，强调了卓越的效率和实用性。

Title: NEP: Autoregressive Image Editing via Next Editing Token Prediction

Authors: Huimin Wu, Xiaojian Ma, Haozhe Zhao, Yanpeng Zhao, Qing Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06044
Pdf URL: https://arxiv.org/pdf/2508.06044
Copy Paste: [[2508.06044]] NEP: Autoregressive Image Editing via Next Editing Token Prediction(https://arxiv.org/abs/2508.06044)
Keywords: generation
Abstract: Text-guided image editing involves modifying a source image based on a language instruction and, typically, requires changes to only small local regions. However, existing approaches generate the entire target image rather than selectively regenerate only the intended editing areas. This results in (1) unnecessary computational costs and (2) a bias toward reconstructing non-editing regions, which compromises the quality of the intended edits. To resolve these limitations, we propose to formulate image editing as Next Editing-token Prediction (NEP) based on autoregressive image generation, where only regions that need to be edited are regenerated, thus avoiding unintended modification to the non-editing areas. To enable any-region editing, we propose to pre-train an any-order autoregressive text-to-image (T2I) model. Once trained, it is capable of zero-shot image editing and can be easily adapted to NEP for image editing, which achieves a new state-of-the-art on widely used image editing benchmarks. Moreover, our model naturally supports test-time scaling (TTS) through iteratively refining its generation in a zero-shot manner. The project page is: this https URL
摘要：文本指导的图像编辑涉及根据语言指令修改源图像，通常需要更改对小的本地区域。但是，现有方法会生成整个目标图像，而不是仅选择性地再生预期的编辑区域。这导致（1）不必要的计算成本，以及（2）重建非编辑区域的偏见，这会损害预期编辑的质量。为了解决这些局限性，我们建议将图像编辑编辑为基于自回归图像产生的下一个编辑式预测（NEP），在此，只有需要编辑的区域才能再生，从而避免对非编辑区域的意外修改。为了启用任何区域编辑，我们建议预先培训任何订单自回归文本对图像（T2I）模型。一旦受过训练，它就可以进行零拍图像编辑，并且可以轻松地适应NEP进行图像编辑，该图像编辑可以在广泛使用的图像编辑基准上获得新的最先进。此外，我们的模型自然支持测试时间缩放（TTS），通过迭代以零拍的方式精炼其生成。项目页面是：此HTTPS URL

Title: VQAThinker: Exploring Generalizable and Explainable Video Quality Assessment via Reinforcement Learning

Authors: Linhan Cao, Wei Sun, Weixia Zhang, Xiangyang Zhu, Jun Jia, Kaiwei Zhang, Dandan Zhu, Guangtao Zhai, Xiongkuo Min
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06051
Pdf URL: https://arxiv.org/pdf/2508.06051
Copy Paste: [[2508.06051]] VQAThinker: Exploring Generalizable and Explainable Video Quality Assessment via Reinforcement Learning(https://arxiv.org/abs/2508.06051)
Keywords: quality assessment
Abstract: Video quality assessment (VQA) aims to objectively quantify perceptual quality degradation in alignment with human visual perception. Despite recent advances, existing VQA models still suffer from two critical limitations: \textit{poor generalization to out-of-distribution (OOD) videos} and \textit{limited explainability}, which restrict their applicability in real-world scenarios. To address these challenges, we propose \textbf{VQAThinker}, a reasoning-based VQA framework that leverages large multimodal models (LMMs) with reinforcement learning to jointly model video quality understanding and scoring, emulating human perceptual decision-making. Specifically, we adopt group relative policy optimization (GRPO), a rule-guided reinforcement learning algorithm that enables reasoning over video quality under score-level supervision, and introduce three VQA-specific rewards: (1) a \textbf{bell-shaped regression reward} that increases rapidly as the prediction error decreases and becomes progressively less sensitive near the ground truth; (2) a \textbf{pairwise ranking reward} that guides the model to correctly determine the relative quality between video pairs; and (3) a \textbf{temporal consistency reward} that encourages the model to prefer temporally coherent videos over their perturbed counterparts. Extensive experiments demonstrate that VQAThinker achieves state-of-the-art performance on both in-domain and OOD VQA benchmarks, showing strong generalization for video quality scoring. Furthermore, evaluations on video quality understanding tasks validate its superiority in distortion attribution and quality description compared to existing explainable VQA models and LMMs. These findings demonstrate that reinforcement learning offers an effective pathway toward building generalizable and explainable VQA models solely with score-level supervision.
摘要：视频质量评估（VQA）的目的是客观地量化与人类视觉感知的一致性的感知质量退化。尽管有最近的进步，但现有的VQA模型仍然受到两个关键局限性：\ textit {对分布式（OOD）视频的概括不佳}和\ textit {有限的解释性}，这限制了其在现实情况下的适用性。为了应对这些挑战，我们建议\ textbf {vqathinker}，这是一个基于推理的VQA框架，利用大型多模型（LMMS），并通过强化学习学习共同模拟视频质量理解和评分，并模仿人类的知觉决策。具体而言，我们采用小组相对政策优化（GRPO），这是一种规则引导的强化学习算法，可以在得分级别的监督下对视频质量进行推理，并引入三个VQA特定的奖励：（1）A \ TextBf {Bell-beld Repression Recormess}，随着预测误差的降低和近乎敏感的事实，该奖励迅速地增加了较小的范围。（2）a \ textbf {成对排名奖励}指导模型正确确定视频对之间的相对质量；（3）a \ textbf {时间一致性奖励}鼓励模型更喜欢时间连贯的视频而不是扰动的视频。广泛的实验表明，VQAthinker在内域和OOD VQA基准测试方面都能达到最先进的性能，从而显示出对视频质量评分的强烈概括。此外，与现有可解释的VQA模型和LMM相比，对视频质量理解任务的评估验证了其在失真归因和质量描述方面的优越性。这些发现表明，强化学习提供了一种有效的途径，可以仅通过得分级的监督来构建可推广和可解释的VQA模型。

Title: Towards MR-Based Trochleoplasty Planning

Authors: Michael Wehrli, Alicia Durrer, Paul Friedrich, Sidaty El Hadramy, Edwin Li, Luana Brahaj, Carol C. Hasler, Philippe C. Cattin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06076
Pdf URL: https://arxiv.org/pdf/2508.06076
Copy Paste: [[2508.06076]] Towards MR-Based Trochleoplasty Planning(https://arxiv.org/abs/2508.06076)
Keywords: generation
Abstract: To treat Trochlear Dysplasia (TD), current approaches rely mainly on low-resolution clinical Magnetic Resonance (MR) scans and surgical intuition. The surgeries are planned based on surgeons experience, have limited adoption of minimally invasive techniques, and lead to inconsistent outcomes. We propose a pipeline that generates super-resolved, patient-specific 3D pseudo-healthy target morphologies from conventional clinical MR scans. First, we compute an isotropic super-resolved MR volume using an Implicit Neural Representation (INR). Next, we segment femur, tibia, patella, and fibula with a multi-label custom-trained network. Finally, we train a Wavelet Diffusion Model (WDM) to generate pseudo-healthy target morphologies of the trochlear region. In contrast to prior work producing pseudo-healthy low-resolution 3D MR images, our approach enables the generation of sub-millimeter resolved 3D shapes compatible for pre- and intraoperative use. These can serve as preoperative blueprints for reshaping the femoral groove while preserving the native patella articulation. Furthermore, and in contrast to other work, we do not require a CT for our pipeline - reducing the amount of radiation. We evaluated our approach on 25 TD patients and could show that our target morphologies significantly improve the sulcus angle (SA) and trochlear groove depth (TGD). The code and interactive visualization are available at this https URL.
摘要：为了治疗车辆发育不良（TD），当前方法主要依赖于低分辨率的临床磁共振（MR）扫描和手术直觉。这些手术是根据外科医生的经验计划，对微创技术的采用有限，并导致不一致的结果。我们提出了一条管道，该管道会产生从常规临床MR扫描中产生超级分辨的，患者特异性的3D伪健康靶向形态。首先，我们使用隐式神经表示（INR）计算各向同性超级分辨的MR体积。接下来，我们将股骨，胫骨，ta骨和腓骨细分为多标签的定制网络。最后，我们训练一个小波扩散模型（WDM），以生成伪健康的靶向形态。与先前的工作产生伪健康的低分辨率3D MR图像相反，我们的方法使得次毫计的3D形状可以兼容术前和术中使用。这些可以用作术前蓝图，以重塑股骨凹槽，同时保留本地patella骨表达。此外，与其他工作相比，我们不需要CT来进行管道 - 减少辐射量。我们评估了对25名TD患者的方法，可以证明我们的靶形态显着改善了沟角（SA）和Trochlear Groove深度（TGD）。该HTTPS URL可用代码和交互式可视化。

Title: DreamVE: Unified Instruction-based Image and Video Editing

Authors: Bin Xia, Jiyang Liu, Yuechen Zhang, Bohao Peng, Ruihang Chu, Yitong Wang, Xinglong Wu, Bei Yu, Jiaya Jia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06080
Pdf URL: https://arxiv.org/pdf/2508.06080
Copy Paste: [[2508.06080]] DreamVE: Unified Instruction-based Image and Video Editing(https://arxiv.org/abs/2508.06080)
Keywords: generation, generative
Abstract: Instruction-based editing holds vast potential due to its simple and efficient interactive editing format. However, instruction-based editing, particularly for video, has been constrained by limited training data, hindering its practical application. To this end, we introduce DreamVE, a unified model for instruction-based image and video editing. Specifically, We propose a two-stage training strategy: first image editing, then video editing. This offers two main benefits: (1) Image data scales more easily, and models are more efficient to train, providing useful priors for faster and better video editing training. (2) Unifying image and video generation is natural and aligns with current trends. Moreover, we present comprehensive training data synthesis pipelines, including collage-based and generative model-based data synthesis. The collage-based data synthesis combines foreground objects and backgrounds to generate diverse editing data, such as object manipulation, background changes, and text modifications. It can easily generate billions of accurate, consistent, realistic, and diverse editing pairs. We pretrain DreamVE on extensive collage-based data to achieve strong performance in key editing types and enhance generalization and transfer capabilities. However, collage-based data lacks some attribute editing cases, leading to a relative drop in performance. In contrast, the generative model-based pipeline, despite being hard to scale up, offers flexibility in handling attribute editing cases. Therefore, we use generative model-based data to further fine-tune DreamVE. Besides, we design an efficient and powerful editing framework for DreamVE. We build on the SOTA T2V model and use a token concatenation with early drop approach to inject source image guidance, ensuring strong consistency and editability. The codes and models will be released.
摘要：基于教学的编辑具有简单有效的交互式编辑格式，具有巨大的潜力。但是，基于教学的编辑，特别是对于视频，受到有限的培训数据的限制，阻碍了其实际应用。为此，我们介绍了Dreamve，这是一种基于教学的图像和视频编辑的统一模型。具体来说，我们提出了一个两阶段的培训策略：第一个图像编辑，然后是视频编辑。这提供了两个主要的好处：（1）图像数据更容易扩展，模型更有效地训练，为更快，更好的视频编辑培训提供了有用的先验。（2）统一的图像和视频生成是自然的，并且与当前趋势保持一致。此外，我们提出了全面的培训数据合成管道，包括基于拼贴的和基于生成模型的数据合成。基于拼贴的数据合成结合了前景对象和背景，以生成各种编辑数据，例如对象操纵，背景更改和文本修改。它可以轻松产生数十亿个准确，一致，现实和多样化的编辑对。我们在广泛的基于拼贴的数据上为梦想做好了预先实现的目标，以在关键编辑类型中实现强大的性能并增强概括和转移功能。但是，基于拼贴的数据缺乏一些属性编辑案例，导致性能相对下降。相反，尽管难以扩大规模，但基于生成模型的管道在处理属性编辑案例方面具有灵活性。因此，我们使用基于生成模型的数据来进一步微调梦。此外，我们为Dreamve设计了一个高效而有力的编辑框架。我们建立在SOTA T2V模型的基础上，并使用具有早期下降方法的令牌串联来注入源图像指导，从而确保强大的一致性和编辑性。代码和模型将发布。

Title: SwiftVideo: A Unified Framework for Few-Step Video Generation through Trajectory-Distribution Alignment

Authors: Yanxiao Sun, Jiafu Wu, Yun Cao, Chengming Xu, Yabiao Wang, Weijian Cao, Donghao Luo, Chengjie Wang, Yanwei Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06082
Pdf URL: https://arxiv.org/pdf/2508.06082
Copy Paste: [[2508.06082]] SwiftVideo: A Unified Framework for Few-Step Video Generation through Trajectory-Distribution Alignment(https://arxiv.org/abs/2508.06082)
Keywords: generation
Abstract: Diffusion-based or flow-based models have achieved significant progress in video synthesis but require multiple iterative sampling steps, which incurs substantial computational overhead. While many distillation methods that are solely based on trajectory-preserving or distribution-matching have been developed to accelerate video generation models, these approaches often suffer from performance breakdown or increased artifacts under few-step settings. To address these limitations, we propose \textbf{\emph{SwiftVideo}}, a unified and stable distillation framework that combines the advantages of trajectory-preserving and distribution-matching strategies. Our approach introduces continuous-time consistency distillation to ensure precise preservation of ODE trajectories. Subsequently, we propose a dual-perspective alignment that includes distribution alignment between synthetic and real data along with trajectory alignment across different inference steps. Our method maintains high-quality video generation while substantially reducing the number of inference steps. Quantitative evaluations on the OpenVid-1M benchmark demonstrate that our method significantly outperforms existing approaches in few-step video generation.
摘要：基于扩散或基于流动的模型在视频合成方面取得了重大进展，但需要多个迭代采样步骤，这会造成大量的计算开销。尽管已经开发了许多仅基于轨迹的蒸馏方法来加速视频生成模型，但这些方法通常会在少数步骤的设置下遭受性能崩溃或增加的伪像。为了解决这些限制，我们提出了\ textbf {\ emph {swiftVideo}}，这是一个统一且稳定的蒸馏框架，结合了轨迹具有轨迹的优势和分配匹配策略的优势。我们的方法引入了连续的时间一致性蒸馏，以确保精确保存ode轨迹。随后，我们提出了一个双光检查的对准，其中包括合成数据和真实数据之间的分布对齐，以及跨不同推理步骤的轨迹比对。我们的方法维持高质量的视频生成，同时大大减少了推理步骤的数量。对OpenVID-1M基准的定量评估表明，我们的方法在几步视频生成中的现有方法显着优于现有方法。

Title: Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation

Authors: Yachun Mi, Yu Li, Yanting Li, Shixin Sun, Chen Hui, Tong Zhang, Yuanyuan Liu, Chenyue Song, Shaohui Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06092
Pdf URL: https://arxiv.org/pdf/2508.06092
Copy Paste: [[2508.06092]] Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation(https://arxiv.org/abs/2508.06092)
Keywords: quality assessment
Abstract: Accurate and efficient Video Quality Assessment (VQA) has long been a key research challenge. Current mainstream VQA methods typically improve performance by pretraining on large-scale classification datasets (e.g., ImageNet, Kinetics-400), followed by fine-tuning on VQA datasets. However, this strategy presents two significant challenges: (1) merely transferring semantic knowledge learned from pretraining is insufficient for VQA, as video quality depends on multiple factors (e.g., semantics, distortion, motion, aesthetics); (2) pretraining on large-scale datasets demands enormous computational resources, often dozens or even hundreds of times greater than training directly on VQA datasets. Recently, Vision-Language Models (VLMs) have shown remarkable generalization capabilities across a wide range of visual tasks, and have begun to demonstrate promising potential in quality assessment. In this work, we propose Q-CLIP, the first fully VLMs-based framework for VQA. Q-CLIP enhances both visual and textual representations through a Shared Cross-Modal Adapter (SCMA), which contains only a minimal number of trainable parameters and is the only component that requires training. This design significantly reduces computational cost. In addition, we introduce a set of five learnable quality-level prompts to guide the VLMs in perceiving subtle quality variations, thereby further enhancing the model's sensitivity to video quality. Furthermore, we investigate the impact of different frame sampling strategies on VQA performance, and find that frame-difference-based sampling leads to better generalization performance across datasets. Extensive experiments demonstrate that Q-CLIP exhibits excellent performance on several VQA datasets.
摘要：长期以来，准确有效的视频质量评估（VQA）一直是一个重要的研究挑战。当前的主流VQA方法通常通过在大规模分类数据集（例如Imagenet，Kinetics-400）上进行预处理，然后在VQA数据集上进行微调。但是，该策略提出了两个重大的挑战：（1）仅从预处理中学到的语义知识对于VQA不足，因为视频质量取决于多种因素（例如语义，失真，运动，美学）；（2）在大规模数据集上进行预修需要巨大的计算资源，通常比直接在VQA数据集上培训的数十个甚至数百倍。最近，视觉模型（VLM）在广泛的视觉任务中表现出显着的概括能力，并已开始在质量评估中证明有希望的潜力。在这项工作中，我们提出了Q-CLIP，这是VQA的第一个完全基于VLMS的框架。 Q-CLIP通过共享的跨模式适配器（SCMA）增强了视觉和文本表示，该适配器（SCMA）仅包含最少数量的可训练参数，并且是唯一需要训练的组件。该设计大大降低了计算成本。此外，我们推出了一组五个可学习的质量级别提示，以指导VLM，以感知微妙的质量变化，从而进一步增强该模型对视频质量的敏感性。此外，我们研究了不同的帧采样策略对VQA性能的影响，并发现基于帧差异的采样会导致跨数据集的更好的泛化性能。广泛的实验表明，Q-CLIP在几个VQA数据集上表现出色。

Title: E-React: Towards Emotionally Controlled Synthesis of Human Reactions

Authors: Chen Zhu, Buzhen Huang, Zijing Wu, Binghui Zuo, Yangang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06093
Pdf URL: https://arxiv.org/pdf/2508.06093
Copy Paste: [[2508.06093]] E-React: Towards Emotionally Controlled Synthesis of Human Reactions(https://arxiv.org/abs/2508.06093)
Keywords: generation
Abstract: Emotion serves as an essential component in daily human interactions. Existing human motion generation frameworks do not consider the impact of emotions, which reduces naturalness and limits their application in interactive tasks, such as human reaction synthesis. In this work, we introduce a novel task: generating diverse reaction motions in response to different emotional cues. However, learning emotion representation from limited motion data and incorporating it into a motion generation framework remains a challenging problem. To address the above obstacles, we introduce a semi-supervised emotion prior in an actor-reactor diffusion model to facilitate emotion-driven reaction synthesis. Specifically, based on the observation that motion clips within a short sequence tend to share the same emotion, we first devise a semi-supervised learning framework to train an emotion prior. With this prior, we further train an actor-reactor diffusion model to generate reactions by considering both spatial interaction and emotional response. Finally, given a motion sequence of an actor, our approach can generate realistic reactions under various emotional conditions. Experimental results demonstrate that our model outperforms existing reaction generation methods. The code and data will be made publicly available at this https URL
摘要：情绪是日常人类互动中的重要组成部分。现有的人类运动生成框架不考虑情绪的影响，这会降低自然性并限制其在互动任务中的应用，例如人类反应综合。在这项工作中，我们介绍了一项新颖的任务：响应不同的情绪提示，产生了多种反应动作。但是，从有限的运动数据中学习情绪表示并将其纳入运动生成框架仍然是一个具有挑战性的问题。为了解决上述障碍，我们在演员反应器扩散模型中引入了半监督情绪，以促进情绪驱动的反应综合。具体而言，基于以下观察，即在短顺序中的运动剪辑倾向于共享相同的情绪，我们首先设计了一个半监督的学习框架来先验训练情绪。先前，我们进一步训练了一个参与者反应器扩散模型，以通过考虑空间相互作用和情绪反应来产生反应。最后，鉴于演员的运动顺序，我们的方法可以在各种情绪条件下产生逼真的反应。实验结果表明，我们的模型表现优于现有的反应产生方法。该代码和数据将在此HTTPS URL上公开可用

Title: UGD-IML: A Unified Generative Diffusion-based Framework for Constrained and Unconstrained Image Manipulation Localization

Authors: Yachun Mi, Xingyang He, Shixin Sun, Yu Li, Yanting Li, Zhixuan Li, Jian Jin, Chen Hui, Shaohui Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06101
Pdf URL: https://arxiv.org/pdf/2508.06101
Copy Paste: [[2508.06101]] UGD-IML: A Unified Generative Diffusion-based Framework for Constrained and Unconstrained Image Manipulation Localization(https://arxiv.org/abs/2508.06101)
Keywords: generative
Abstract: In the digital age, advanced image editing tools pose a serious threat to the integrity of visual content, making image forgery detection and localization a key research focus. Most existing Image Manipulation Localization (IML) methods rely on discriminative learning and require large, high-quality annotated datasets. However, current datasets lack sufficient scale and diversity, limiting model performance in real-world scenarios. To overcome this, recent studies have explored Constrained IML (CIML), which generates pixel-level annotations through algorithmic supervision. However, existing CIML approaches often depend on complex multi-stage pipelines, making the annotation process inefficient. In this work, we propose a novel generative framework based on diffusion models, named UGD-IML, which for the first time unifies both IML and CIML tasks within a single framework. By learning the underlying data distribution, generative diffusion models inherently reduce the reliance on large-scale labeled datasets, allowing our approach to perform effectively even under limited data conditions. In addition, by leveraging a class embedding mechanism and a parameter-sharing design, our model seamlessly switches between IML and CIML modes without extra components or training overhead. Furthermore, the end-to-end design enables our model to avoid cumbersome steps in the data annotation process. Extensive experimental results on multiple datasets demonstrate that UGD-IML outperforms the SOTA methods by an average of 9.66 and 4.36 in terms of F1 metrics for IML and CIML tasks, respectively. Moreover, the proposed method also excels in uncertainty estimation, visualization and robustness.
摘要：在数字时代，高级图像编辑工具对视觉内容的完整性构成了严重威胁，使图像伪造检测和本地化成为关键的研究重点。大多数现有的图像操纵本地化（IML）方法依赖于判别性学习，需要大型，高质量的注释数据集。但是，当前的数据集缺乏足够的规模和多样性，在现实情况下限制了模型性能。为了克服这一问题，最近的研究探索了受约束的IML（CIML），该IML（CIML）通过算法监督生成像素级注释。但是，现有的CIML方法通常取决于复杂的多阶段管道，从而使注释过程效率低下。在这项工作中，我们提出了一个基于扩散模型的新颖生成框架，该框架名为UGD-IML，该框架首次将IML和CIML任务统一在一个框架内。通过学习基础数据分布，生成扩散模型固有地减少了对大型标记数据集的依赖，从而使我们的方法甚至在有限的数据条件下都可以有效地执行。此外，通过利用类嵌入机构和参数共享设计，我们的模型可以在IML和CIML模式之间无缝切换，而无需额外的组件或训练开销。此外，端到端设计使我们的模型可以避免在数据注释过程中繁琐的步骤。多个数据集上的广泛实验结果表明，UGD-IML的表现平均比SOTA方法分别以IML和CIML任务的F1度量指标分别以9.66和4.36的比例。此外，所提出的方法在不确定性估计，可视化和鲁棒性方面也表现出色。

Title: SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning

Authors: Lin Zhang, Xianfang Zeng, Kangcong Li, Gang Yu, Tao Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06125
Pdf URL: https://arxiv.org/pdf/2508.06125
Copy Paste: [[2508.06125]] SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning(https://arxiv.org/abs/2508.06125)
Keywords: quality assessment
Abstract: We propose SC-Captioner, a reinforcement learning framework that enables the self-correcting capability of image caption models. Our crucial technique lies in the design of the reward function to incentivize accurate caption corrections. Specifically, the predicted and reference captions are decomposed into object, attribute, and relation sets using scene-graph parsing algorithms. We calculate the set difference between sets of initial and self-corrected captions to identify added and removed elements. These elements are matched against the reference sets to calculate correctness bonuses for accurate refinements and mistake punishments for wrong additions and removals, thereby forming the final reward. For image caption quality assessment, we propose a set of metrics refined from CAPTURE that alleviate its incomplete precision evaluation and inefficient relation matching problems. Furthermore, we collect a fine-grained annotated image caption dataset, RefinedCaps, consisting of 6.5K diverse images from COCO dataset. Experiments show that applying SC-Captioner on large visual-language models can generate better image captions across various scenarios, significantly outperforming the direct preference optimization training strategy.
摘要：我们提出了SC-Captioner，这是一个增强学习框架，可实现图像标题模型的自我校正功能。我们的关键技术在于奖励功能的设计，以激励准确的标题校正。具体而言，使用场景图解析算法将预测和参考字幕分解为对象，属性和关系集。我们计算初始和自校正字幕的集合之间的集合差，以识别添加和删除的元素。这些元素与参考集相匹配，以计算正确的奖金，以确保精确的细化和错误的添加和删除，从而形成最终的奖励。对于图像标题质量评估，我们提出了一组从捕获中完善的指标，以减轻其不完整的精度评估和效率低下的关系匹配问题。此外，我们收集了一个精细的注释图像标题数据集，精制盒，由可可数据集中的6.5k多种图像组成。实验表明，在大型视觉语言模型上应用SC-coptioner可以在各种情况下产生更好的图像标题，从而大大优于直接偏好优化训练策略。

Title: DiffCap: Diffusion-based Real-time Human Motion Capture using Sparse IMUs and a Monocular Camera

Authors: Shaohua Pan, Xinyu Yi, Yan Zhou, Weihua Jian, Yuan Zhang, Pengfei Wan, Feng Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06139
Pdf URL: https://arxiv.org/pdf/2508.06139
Copy Paste: [[2508.06139]] DiffCap: Diffusion-based Real-time Human Motion Capture using Sparse IMUs and a Monocular Camera(https://arxiv.org/abs/2508.06139)
Keywords: generation
Abstract: Combining sparse IMUs and a monocular camera is a new promising setting to perform real-time human motion capture. This paper proposes a diffusion-based solution to learn human motion priors and fuse the two modalities of signals together seamlessly in a unified framework. By delicately considering the characteristics of the two signals, the sequential visual information is considered as a whole and transformed into a condition embedding, while the inertial measurement is concatenated with the noisy body pose frame by frame to construct a sequential input for the diffusion model. Firstly, we observe that the visual information may be unavailable in some frames due to occlusions or subjects moving out of the camera view. Thus incorporating the sequential visual features as a whole to get a single feature embedding is robust to the occasional degenerations of visual information in those frames. On the other hand, the IMU measurements are robust to occlusions and always stable when signal transmission has no problem. So incorporating them frame-wisely could better explore the temporal information for the system. Experiments have demonstrated the effectiveness of the system design and its state-of-the-art performance in pose estimation compared with the previous works. Our codes are available for research at this https URL.
摘要：将稀疏的IMU和单眼相机结合在一起是进行实时人类运动捕获的新的有前途的环境。本文提出了一种基于扩散的解决方案，以学习人类运动先验，并在统一框架中无缝地将两个信号的方式融合在一起。通过精心考虑这两个信号的特征，顺序视觉信息被视为一个整体并转化为条件嵌入，而惯性测量与嘈杂的身体姿势划分串联以构造扩散模型的顺序输入。首先，我们观察到，由于阻塞或从相机视图中移出的受试者，视觉信息可能无法在某些框架中获得。因此，整体上结合了顺序的视觉特征以获取单个特征嵌入对于偶尔在这些帧中视觉信息的变性是可靠的。另一方面，IMU测量值可靠，在信号传输没有问题时始终稳定。因此，将它们纳入框架可以更好地探索系统的时间信息。与先前的作品相比，实验证明了系统设计及其最先进的姿势估计表现的有效性。我们的代码可在此HTTPS URL上进行研究。

Title: Text-guided Visual Prompt DINO for Generic Segmentation

Authors: Yuchen Guan, Chong Sun, Canmiao Fu, Zhipeng Huang, Chun Yuan, Chen Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06146
Pdf URL: https://arxiv.org/pdf/2508.06146
Copy Paste: [[2508.06146]] Text-guided Visual Prompt DINO for Generic Segmentation(https://arxiv.org/abs/2508.06146)
Keywords: generation, generative
Abstract: Recent advancements in multimodal vision models have highlighted limitations in late-stage feature fusion and suboptimal query selection for hybrid prompts open-world segmentation, alongside constraints from caption-derived vocabularies. To address these challenges, we propose Prompt-DINO, a text-guided visual Prompt DINO framework featuring three key innovations. First, we introduce an early fusion mechanism that unifies text/visual prompts and backbone features at the initial encoding stage, enabling deeper cross-modal interactions to resolve semantic ambiguities. Second, we design order-aligned query selection for DETR-based architectures, explicitly optimizing the structural alignment between text and visual queries during decoding to enhance semantic-spatial consistency. Third, we develop a generative data engine powered by the Recognize Anything via Prompting (RAP) model, which synthesizes 0.5B diverse training instances through a dual-path cross-verification pipeline, reducing label noise by 80.5% compared to conventional approaches. Extensive experiments demonstrate that Prompt-DINO achieves state-of-the-art performance on open-world detection benchmarks while significantly expanding semantic coverage beyond fixed-vocabulary constraints. Our work establishes a new paradigm for scalable multimodal detection and data generation in open-world scenarios. Data&Code are available at this https URL.
摘要：多模式视觉模型的最新进展强调了晚期特征融合和混合动力提示的次优查询选择开放世界分割的局限性，以及字幕衍生的词汇的约束。为了应对这些挑战，我们提出了提示 - 迪诺，这是一个文本引导的视觉及时提示恐龙框架，其中包含三个关键创新。首先，我们引入了一种早期的融合机制，该机制在初始编码阶段统一文本/视觉提示和骨干特征，从而实现了更深层次的跨模式相互作用以解决语义歧义。其次，我们为基于DITR的架构设计了与订单一致的查询选择，并明确优化了解码过程中文本和视觉查询之间的结构对齐，以增强语义空间的一致性。第三，我们通过提示（RAP）模型开发了一个由识别物质供电的生成数据引擎，该模型通过双路交叉验证管道综合了0.5B的培训实例，与常规方法相比，标签噪声降低了80.5％。广泛的实验表明，迅速的迪诺在开放世界检测基准上实现了最先进的性能，同时显着扩大了固定量稳定性约束的语义覆盖率。我们的工作建立了一个新的范式，用于在开放世界的情况下进行可扩展的多模式检测和数据生成。数据和代码可在此HTTPS URL上找到。

Title: Improving Diagnostic Accuracy for Oral Cancer with inpainting Synthesis Lesions Generated Using Diffusion Models

Authors: Yong Oh Lee, JeeEun Kim, Jung Woo Lee
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2508.06151
Pdf URL: https://arxiv.org/pdf/2508.06151
Copy Paste: [[2508.06151]] Improving Diagnostic Accuracy for Oral Cancer with inpainting Synthesis Lesions Generated Using Diffusion Models(https://arxiv.org/abs/2508.06151)
Keywords: generation
Abstract: In oral cancer diagnostics, the limited availability of annotated datasets frequently constrains the performance of diagnostic models, particularly due to the variability and insufficiency of training data. To address these challenges, this study proposed a novel approach to enhance diagnostic accuracy by synthesizing realistic oral cancer lesions using an inpainting technique with a fine-tuned diffusion model. We compiled a comprehensive dataset from multiple sources, featuring a variety of oral cancer images. Our method generated synthetic lesions that exhibit a high degree of visual fidelity to actual lesions, thereby significantly enhancing the performance of diagnostic algorithms. The results show that our classification model achieved a diagnostic accuracy of 0.97 in differentiating between cancerous and non-cancerous tissues, while our detection model accurately identified lesion locations with 0.85 accuracy. This method validates the potential for synthetic image generation in medical diagnostics and paves the way for further research into extending these methods to other types of cancer diagnostics.
摘要：在口腔癌诊断中，注释数据集的有限可用性通常会限制诊断模型的性能，尤其是由于培训数据的可变性和不足。为了应对这些挑战，这项研究提出了一种新的方法来通过使用微调扩散模型的介质技术合成现实的口腔癌病变来提高诊断准确性。我们从多个来源编辑了一个全面的数据集，其中包含各种口腔癌图像。我们的方法产生的合成病变对实际病变表现出高度的视觉保真度，从而显着增强了诊断算法的性能。结果表明，我们的分类模型在区分癌性组织和非癌组织中达到了0.97的诊断精度，而我们的检测模型则准确地鉴定出具有0.85精度的病变位置。该方法验证了医学诊断中合成图像产生的潜力，并为进一步研究将这些方法扩展到其他类型的癌症诊断方面铺平了道路。

Title: An Interpretable Multi-Plane Fusion Framework With Kolmogorov-Arnold Network Guided Attention Enhancement for Alzheimer's Disease Diagnosis

Authors: Xiaoxiao Yang, Meiliang Liu, Yunfang Xu, Zijin Li, Zhengye Si, Xinyue Yang, Zhiwen Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06157
Pdf URL: https://arxiv.org/pdf/2508.06157
Copy Paste: [[2508.06157]] An Interpretable Multi-Plane Fusion Framework With Kolmogorov-Arnold Network Guided Attention Enhancement for Alzheimer's Disease Diagnosis(https://arxiv.org/abs/2508.06157)
Keywords: generative
Abstract: Alzheimer's disease (AD) is a progressive neurodegenerative disorder that severely impairs cognitive function and quality of life. Timely intervention in AD relies heavily on early and precise diagnosis, which remains challenging due to the complex and subtle structural changes in the brain. Most existing deep learning methods focus only on a single plane of structural magnetic resonance imaging (sMRI) and struggle to accurately capture the complex and nonlinear relationships among pathological regions of the brain, thus limiting their ability to precisely identify atrophic features. To overcome these limitations, we propose an innovative framework, MPF-KANSC, which integrates multi-plane fusion (MPF) for combining features from the coronal, sagittal, and axial planes, and a Kolmogorov-Arnold Network-guided spatial-channel attention mechanism (KANSC) to more effectively learn and represent sMRI atrophy features. Specifically, the proposed model enables parallel feature extraction from multiple anatomical planes, thus capturing more comprehensive structural information. The KANSC attention mechanism further leverages a more flexible and accurate nonlinear function approximation technique, facilitating precise identification and localization of disease-related abnormalities. Experiments on the ADNI dataset confirm that the proposed MPF-KANSC achieves superior performance in AD diagnosis. Moreover, our findings provide new evidence of right-lateralized asymmetry in subcortical structural changes during AD progression, highlighting the model's promising interpretability.
摘要：阿尔茨海默氏病（AD）是一种进行性神经退行性疾病，严重损害了认知功能和生活质量。及时干预AD在很大程度上取决于早期和精确的诊断，由于大脑的复杂而微妙的结构变化，这仍然具有挑战性。大多数现有的深度学习方法仅集中在结构磁共振成像（SMRI）的单个平面上，并难以准确捕获大脑病理区域之间的复杂和非线性关系，从而限制了它们精确识别萎缩特征的能力。 To overcome these limitations, we propose an innovative framework, MPF-KANSC, which integrates multi-plane fusion (MPF) for combining features from the coronal, sagittal, and axial planes, and a Kolmogorov-Arnold Network-guided spatial-channel attention mechanism (KANSC) to more effectively learn and represent sMRI atrophy features.具体而言，提出的模型可以从多个解剖平面中提取并行特征，从而捕获更全面的结构信息。 KANSC注意机制进一步利用了更灵活，更准确的非线性功能近似技术，促进了与疾病相关异常的精确鉴定和定位。 ADNI数据集上的实验证实，拟议的MPF-KANSC在AD诊断方面取得了出色的性能。此外，我们的发现为AD进展过程中皮质结构变化的右侧不对称性提供了新的证据，突出了该模型的可解释性。

Title: Fewer Denoising Steps or Cheaper Per-Step Inference: Towards Compute-Optimal Diffusion Model Deployment

Authors: Zhenbang Du, Yonggan Fu, Lifu Wang, Jiayi Qian, Xiao Luo, Yingyan (Celine)Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06160
Pdf URL: https://arxiv.org/pdf/2508.06160
Copy Paste: [[2508.06160]] Fewer Denoising Steps or Cheaper Per-Step Inference: Towards Compute-Optimal Diffusion Model Deployment(https://arxiv.org/abs/2508.06160)
Keywords: generation, generative
Abstract: Diffusion models have shown remarkable success across generative tasks, yet their high computational demands challenge deployment on resource-limited platforms. This paper investigates a critical question for compute-optimal diffusion model deployment: Under a post-training setting without fine-tuning, is it more effective to reduce the number of denoising steps or to use a cheaper per-step inference? Intuitively, reducing the number of denoising steps increases the variability of the distributions across steps, making the model more sensitive to compression. In contrast, keeping more denoising steps makes the differences smaller, preserving redundancy, and making post-training compression more feasible. To systematically examine this, we propose PostDiff, a training-free framework for accelerating pre-trained diffusion models by reducing redundancy at both the input level and module level in a post-training manner. At the input level, we propose a mixed-resolution denoising scheme based on the insight that reducing generation resolution in early denoising steps can enhance low-frequency components and improve final generation fidelity. At the module level, we employ a hybrid module caching strategy to reuse computations across denoising steps. Extensive experiments and ablation studies demonstrate that (1) PostDiff can significantly improve the fidelity-efficiency trade-off of state-of-the-art diffusion models, and (2) to boost efficiency while maintaining decent generation fidelity, reducing per-step inference cost is often more effective than reducing the number of denoising steps. Our code is available at this https URL.
摘要：扩散模型在生成任务中表现出了杰出的成功，但是他们的高计算需求挑战了在资源有限平台上的部署。本文研究了计算最佳扩散模型部署的关键问题：在不进行微调的训练后设置下，减少剥离步骤的数量或使用较便宜的每个步骤推断更有效？凭直觉，减少去涂步骤的数量会增加跨步骤的分布的可变性，从而使模型对压缩更加敏感。相比之下，保持更多的降级步骤会使差异较小，保留冗余，并使训练后压缩更加可行。为了系统地检查这一点，我们提出了Postdiff，这是一个无训练的框架，用于通过以训练后的方式降低输入水平和模块水平的冗余，以加速前训练的扩散模型。在输入级别上，我们基于以下见解，提出了一个混合分辨率的剥离方案，即减少早期脱氧步骤中的生成分辨率可以增强低频组成部分并改善最终一代忠诚度。在模块级别，我们采用混合模块缓存策略来跨授予步骤的计算。广泛的实验和消融研究表明，（1）后卸盘可以显着提高最先进的扩散模型的保真度效率折衷，并且（2）提高效率，同时保持体面的产生忠诚度，降低每步推理成本通常比减少Denoising步骤的数量更有效。我们的代码可在此HTTPS URL上找到。

Title: Synthetic Data-Driven Multi-Architecture Framework for Automated Polyp Segmentation Through Integrated Detection and Mask Generation

Authors: Ojonugwa Oluwafemi Ejiga Peter, Akingbola Oluwapemiisin, Amalahu Chetachi, Adeniran Opeyemi, Fahmi Khalifa, Md Mahmudur Rahman
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06170
Pdf URL: https://arxiv.org/pdf/2508.06170
Copy Paste: [[2508.06170]] Synthetic Data-Driven Multi-Architecture Framework for Automated Polyp Segmentation Through Integrated Detection and Mask Generation(https://arxiv.org/abs/2508.06170)
Keywords: generation
Abstract: Colonoscopy is a vital tool for the early diagnosis of colorectal cancer, which is one of the main causes of cancer-related mortality globally; hence, it is deemed an essential technique for the prevention and early detection of colorectal cancer. The research introduces a unique multidirectional architectural framework to automate polyp detection within colonoscopy images while helping resolve limited healthcare dataset sizes and annotation complexities. The research implements a comprehensive system that delivers synthetic data generation through Stable Diffusion enhancements together with detection and segmentation algorithms. This detection approach combines Faster R-CNN for initial object localization while the Segment Anything Model (SAM) refines the segmentation masks. The faster R-CNN detection algorithm achieved a recall of 93.08% combined with a precision of 88.97% and an F1 score of 90.98%.SAM is then used to generate the image mask. The research evaluated five state-of-the-art segmentation models that included U-Net, PSPNet, FPN, LinkNet, and MANet using ResNet34 as a base model. The results demonstrate the superior performance of FPN with the highest scores of PSNR (7.205893) and SSIM (0.492381), while UNet excels in recall (84.85%) and LinkNet shows balanced performance in IoU (64.20%) and Dice score (77.53%).
摘要：结肠镜检查是早期诊断结直肠癌的重要工具，这是全球与癌症相关死亡率的主要原因之一。因此，它被认为是预防和早期检测结直肠癌的重要技术。该研究引入了独特的多向体系结构框架，以在结肠镜检查中自动化息肉检测，同时帮助解决有限的医疗保健数据集大小和注释复杂性。该研究实现了一个综合系统，该系统通过稳定的扩散增强以及检测和分割算法来传达综合数据生成。这种检测方法结合了更快的R-CNN，用于初始对象定位，而该段的任何模型（SAM）完善了分割掩模。更快的R-CNN检测算法达到了93.08％的召回率，精确度为88.97％，F1得分为90.98％。SAM随后用于生成图像掩码。该研究评估了五个最新的分割模型，其中包括使用RESNET34作为基本模型，包括U-NET，PSPNET，FPN，Linknet和Manet。结果表明，FPN的表现卓越，PSNR的得分最高（7.205893）和SSIM（0.492381），而UNET在召回率（84.85％）中表现出色（84.85％），而Linknet显示出IOU（64.20％）和DICE得分（77.53％）的平衡性能。

Title: MA-CBP: A Criminal Behavior Prediction Framework Based on Multi-Agent Asynchronous Collaboration

Authors: Cheng Liu, Daou Zhang, Tingxu Liu, Yuhan Wang, Jinyang Chen, Yuexuan Li, Xinying Xiao, Chenbo Xin, Ziru Wang, Weichao Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06189
Pdf URL: https://arxiv.org/pdf/2508.06189
Copy Paste: [[2508.06189]] MA-CBP: A Criminal Behavior Prediction Framework Based on Multi-Agent Asynchronous Collaboration(https://arxiv.org/abs/2508.06189)
Keywords: generative
Abstract: With the acceleration of urbanization, criminal behavior in public scenes poses an increasingly serious threat to social security. Traditional anomaly detection methods based on feature recognition struggle to capture high-level behavioral semantics from historical information, while generative approaches based on Large Language Models (LLMs) often fail to meet real-time requirements. To address these challenges, we propose MA-CBP, a criminal behavior prediction framework based on multi-agent asynchronous collaboration. This framework transforms real-time video streams into frame-level semantic descriptions, constructs causally consistent historical summaries, and fuses adjacent image frames to perform joint reasoning over long- and short-term contexts. The resulting behavioral decisions include key elements such as event subjects, locations, and causes, enabling early warning of potential criminal activity. In addition, we construct a high-quality criminal behavior dataset that provides multi-scale language supervision, including frame-level, summary-level, and event-level semantic annotations. Experimental results demonstrate that our method achieves superior performance on multiple datasets and offers a promising solution for risk warning in urban public safety scenarios.
摘要：随着城市化的加速，公共场景中的犯罪行为对社会保障构成了越来越严重的威胁。基于特征识别的传统异常检测方法努力从历史信息中捕获高级行为语义，而基于大语言模型（LLM）的生成方法通常无法满足实时要求。为了应对这些挑战，我们提出了MA-CBP，这是一个基于多代理异步协作的犯罪行为预测框架。该框架将实时视频流转换为框架级别的语义描述，构建因果关系一致的历史摘要，并融合了相邻的图像框架以在长期和短期上下文中执行联合推理。由此产生的行为决策包括关键要素，例如事件主体，位置和原因，从而可以预警潜在的犯罪活动。此外，我们构建了一个高质量的犯罪行为数据集，该数据集提供多尺度语言监督，包括框架级别，摘要级别和事件级别的语义注释。实验结果表明，我们的方法在多个数据集上取得了出色的性能，并为城市公共安全方案中的风险警告提供了有希望的解决方案。

Title: PA-HOI: A Physics-Aware Human and Object Interaction Dataset

Authors: Ruiyan Wang, Lin Zuo, Zonghao Lin, Qiang Wang, Zhengxue Cheng, Rong Xie, Jun Ling, Li Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06205
Pdf URL: https://arxiv.org/pdf/2508.06205
Copy Paste: [[2508.06205]] PA-HOI: A Physics-Aware Human and Object Interaction Dataset(https://arxiv.org/abs/2508.06205)
Keywords: generation
Abstract: The Human-Object Interaction (HOI) task explores the dynamic interactions between humans and objects in physical environments, providing essential biomechanical and cognitive-behavioral foundations for fields such as robotics, virtual reality, and human-computer interaction. However, existing HOI data sets focus on details of affordance, often neglecting the influence of physical properties of objects on human long-term motion. To bridge this gap, we introduce the PA-HOI Motion Capture dataset, which highlights the impact of objects' physical attributes on human motion dynamics, including human posture, moving velocity, and other motion characteristics. The dataset comprises 562 motion sequences of human-object interactions, with each sequence performed by subjects of different genders interacting with 35 3D objects that vary in size, shape, and weight. This dataset stands out by significantly extending the scope of existing ones for understanding how the physical attributes of different objects influence human posture, speed, motion scale, and interacting strategies. We further demonstrate the applicability of the PA-HOI dataset by integrating it with existing motion generation methods, validating its capacity to transfer realistic physical awareness.
摘要：人类对象相互作用（HOI）任务探讨了人与物理环境中对象之间的动态相互作用，为机器人技术，虚拟现实和人类计算机互动等领域提供了必不可少的生物力学和认知行为基础。但是，现有的HOI数据集集中在负担的详细信息上，通常忽略了物体对人类长期运动的影响。为了弥合这一差距，我们介绍了PA-HOI运动捕获数据集，该数据集突出了对象的物理属性对人类运动动力学的影响，包括人体姿势，运动速度和其他运动特征。数据集包含人类对象相互作用的562个运动序列，每个序列由不同性别的受试者与35个3D对象相互作用的受试者执行，这些对象的大小，形状和权重变化。该数据集通过显着扩展现有数据的范围来理解不同物体的物理属性如何影响人类的姿势，速度，运动规模和相互作用策略如何脱颖而出。我们进一步通过将PA-HOI数据集与现有运动生成方法集成在一起，验证其传递现实的物理意识的能力来证明PA-HOI数据集的适用性。

Title: Towards Unified Image Deblurring using a Mixture-of-Experts Decoder

Authors: Daniel Feijoo, Paula Garrido-Mellado, Jaesung Rim, Alvaro Garcia, Marcos V. Conde
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06228
Pdf URL: https://arxiv.org/pdf/2508.06228
Copy Paste: [[2508.06228]] Towards Unified Image Deblurring using a Mixture-of-Experts Decoder(https://arxiv.org/abs/2508.06228)
Keywords: restoration
Abstract: Image deblurring, removing blurring artifacts from images, is a fundamental task in computational photography and low-level computer vision. Existing approaches focus on specialized solutions tailored to particular blur types, thus, these solutions lack generalization. This limitation in current methods implies requiring multiple models to cover several blur types, which is not practical in many real scenarios. In this paper, we introduce the first all-in-one deblurring method capable of efficiently restoring images affected by diverse blur degradations, including global motion, local motion, blur in low-light conditions, and defocus blur. We propose a mixture-of-experts (MoE) decoding module, which dynamically routes image features based on the recognized blur degradation, enabling precise and efficient restoration in an end-to-end manner. Our unified approach not only achieves performance comparable to dedicated task-specific models, but also demonstrates remarkable robustness and generalization capabilities on unseen blur degradation scenarios.
摘要：图像去除图像从图像中删除模糊的人工制品，是计算摄影和低级计算机视觉中的一项基本任务。现有方法集中于针对特定模糊类型的专门解决方案，因此，这些解决方案缺乏概括。当前方法中的这种限制意味着需要多种模型涵盖几种模糊类型，这在许多实际情况下都是不切实际的。在本文中，我们介绍了第一种能够有效恢复受不同模糊降解影响的图像的全合一脱毛方法，包括全球运动，局部运动，低光条件下的模糊和Defocus Blur。我们提出了一个混合物（MOE）解码模块，该模块基于公认的模糊降解，动态地路由图像特征，以端到端的方式实现精确有效的恢复。我们的统一方法不仅可以达到与专用特定于任务的模型相当的性能，而且还表现出在看不见的模糊退化场景上的显着鲁棒性和概括能力。

Title: Synthetic Data Generation and Differential Privacy using Tensor Networks' Matrix Product States (MPS)

Authors: Alejandro Moreno R., Desale Fentaw, Samuel Palmer, Raúl Salles de Padua, Ninad Dixit, Samuel Mugel, Roman Orús, Manuel Radons, Josef Menter, Ali Abedi
Subjects: cs.LG, cs.AI, cs.CR, quant-ph
Abstract URL: https://arxiv.org/abs/2508.06251
Pdf URL: https://arxiv.org/pdf/2508.06251
Copy Paste: [[2508.06251]] Synthetic Data Generation and Differential Privacy using Tensor Networks' Matrix Product States (MPS)(https://arxiv.org/abs/2508.06251)
Keywords: generation, generative
Abstract: Synthetic data generation is a key technique in modern artificial intelligence, addressing data scarcity, privacy constraints, and the need for diverse datasets in training robust models. In this work, we propose a method for generating privacy-preserving high-quality synthetic tabular data using Tensor Networks, specifically Matrix Product States (MPS). We benchmark the MPS-based generative model against state-of-the-art models such as CTGAN, VAE, and PrivBayes, focusing on both fidelity and privacy-preserving capabilities. To ensure differential privacy (DP), we integrate noise injection and gradient clipping during training, enabling privacy guarantees via Rényi Differential Privacy accounting. Across multiple metrics analyzing data fidelity and downstream machine learning task performance, our results show that MPS outperforms classical models, particularly under strict privacy constraints. This work highlights MPS as a promising tool for privacy-aware synthetic data generation. By combining the expressive power of tensor network representations with formal privacy mechanisms, the proposed approach offers an interpretable and scalable alternative for secure data sharing. Its structured design facilitates integration into sensitive domains where both data quality and confidentiality are critical.
摘要：合成数据生成是现代人工智能中的一项关键技术，可以解决数据稀缺，隐私限制以及对培训强大模型中各种数据集的需求。在这项工作中，我们提出了一种使用张量网络，特别是矩阵乘积状态（MPS）生成隐私保存高质量合成表格数据的方法。我们针对Ctgan，Vae和Privbayes等最先进的模型进行基于MPS的生成模型，重点介绍了保真度和隐私能力。为了确保差异隐私（DP），我们在培训期间整合了噪声注入和梯度剪辑，从而通过Rényi差异隐私会计来实现隐私保证。在分析数据保真度和下游机器学习任务性能的多个指标中，我们的结果表明，MPS优于经典模型，尤其是在严格的隐私约束下。这项工作突出了国会议员是一种有前途的合成数据生成的有前途的工具。通过将张量网络表示的表达能力与形式的隐私机制相结合，提出的方法为安全数据共享提供了可解释且可扩展的替代方案。它的结构化设计有助于将数据质量和机密性至关重要的敏感领域集成。

Title: SIFThinker: Spatially-Aware Image Focus for Visual Reasoning

Authors: Zhangquan Chen, Ruihui Zhao, Chuwei Luo, Mingze Sun, Xinlei Yu, Yangyang Kang, Ruqi Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06259
Pdf URL: https://arxiv.org/pdf/2508.06259
Copy Paste: [[2508.06259]] SIFThinker: Spatially-Aware Image Focus for Visual Reasoning(https://arxiv.org/abs/2508.06259)
Keywords: generation
Abstract: Current multimodal large language models (MLLMs) still face significant challenges in complex visual tasks (e.g., spatial understanding, fine-grained perception). Prior methods have tried to incorporate visual reasoning, however, they fail to leverage attention correction with spatial cues to iteratively refine their focus on prompt-relevant regions. In this paper, we introduce SIFThinker, a spatially-aware "think-with-images" framework that mimics human visual perception. Specifically, SIFThinker enables attention correcting and image region focusing by interleaving depth-enhanced bounding boxes and natural language. Our contributions are twofold: First, we introduce a reverse-expansion-forward-inference strategy that facilitates the generation of interleaved image-text chains of thought for process-level supervision, which in turn leads to the construction of the SIF-50K dataset. Besides, we propose GRPO-SIF, a reinforced training paradigm that integrates depth-informed visual grounding into a unified reasoning pipeline, teaching the model to dynamically correct and focus on prompt-relevant regions. Extensive experiments demonstrate that SIFThinker outperforms state-of-the-art methods in spatial understanding and fine-grained visual perception, while maintaining strong general capabilities, highlighting the effectiveness of our method.
摘要：当前的多模式大型语言模型（MLLM）在复杂的视觉任务（例如，空间理解，细粒度的感知）中仍然面临重大挑战。先前的方法试图结合视觉推理，但是，它们无法利用注意力校正和空间线索，以迭代地完善他们对及时相关区域的关注。在本文中，我们介绍了Sifthinker，这是一种模仿人类视觉感知的空间意识的“思想图像”框架。具体而言，Sifthinker可以通过交织深度增强的边界框和自然语言来纠正和图像区域。我们的贡献是双重的：首先，我们引入了一种反向扩展的推论策略，该策略有助于产生交错的思想链链，以进行过程级别的监督，进而导致SIF-50K数据集的构建。此外，我们提出了GRPO-SIF，这是一种加强的训练范式，将深度信息的视觉接地整合到统一的推理管道中，教导该模型动态纠正并专注于迅速相关的区域。广泛的实验表明，Sifthinker在空间理解和细粒度的视觉感知中的表现优于最先进的方法，同时保持了强大的一般能力，强调了我们方法的有效性。

Title: OM2P: Offline Multi-Agent Mean-Flow Policy

Authors: Zhuoran Li, Xun Wang, Hai Zhong, Longbo Huang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06269
Pdf URL: https://arxiv.org/pdf/2508.06269
Copy Paste: [[2508.06269]] OM2P: Offline Multi-Agent Mean-Flow Policy(https://arxiv.org/abs/2508.06269)
Keywords: generation, generative
Abstract: Generative models, especially diffusion and flow-based models, have been promising in offline multi-agent reinforcement learning. However, integrating powerful generative models into this framework poses unique challenges. In particular, diffusion and flow-based policies suffer from low sampling efficiency due to their iterative generation processes, making them impractical in time-sensitive or resource-constrained settings. To tackle these difficulties, we propose OM2P (Offline Multi-Agent Mean-Flow Policy), a novel offline MARL algorithm to achieve efficient one-step action sampling. To address the misalignment between generative objectives and reward maximization, we introduce a reward-aware optimization scheme that integrates a carefully-designed mean-flow matching loss with Q-function supervision. Additionally, we design a generalized timestep distribution and a derivative-free estimation strategy to reduce memory overhead and improve training stability. Empirical evaluations on Multi-Agent Particle and MuJoCo benchmarks demonstrate that OM2P achieves superior performance, with up to a 3.8x reduction in GPU memory usage and up to a 10.8x speed-up in training time. Our approach represents the first to successfully integrate mean-flow model into offline MARL, paving the way for practical and scalable generative policies in cooperative multi-agent settings.
摘要：生成模型，尤其是扩散和基于流的模型，在离线多代理增强学习方面已经有希望。但是，将强大的生成模型整合到该框架中会带来独特的挑战。特别是，由于其迭代生成过程，扩散和基于流动的策略的采样效率低，使得它们在时间敏感或资源约束的设置中不切实际。为了解决这些困难，我们提出了OM2P（离线多代理流量政策），这是一种新颖的离线MARL算法，以实现有效的一步动作采样。为了解决生成目标与奖励最大化之间的不对符，我们引入了奖励意识到的优化方案，该方案将精心设计的平均匹配损失与Q功能监督相结合。此外，我们设计了广义的时间段分布和无衍生的估计策略，以减少内存开销并提高训练稳定性。对多代理粒子和Mujoco基准测试的经验评估表明，OM2P的性能卓越，GPU存储器的使用情况下降了3.8倍，并且在训练时间中的速度高达10.8倍。我们的方法代表了第一个成功将均值模型模型整合到离线MAL中的方法，为合作多代理设置中实用和可扩展的生成策略铺平了道路。

Title: Can Diffusion Models Bridge the Domain Gap in Cardiac MR Imaging?

Authors: Xin Ci Wong, Duygu Sarikaya, Kieran Zucker, Marc De Kamps, Nishant Ravikumar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06327
Pdf URL: https://arxiv.org/pdf/2508.06327
Copy Paste: [[2508.06327]] Can Diffusion Models Bridge the Domain Gap in Cardiac MR Imaging?(https://arxiv.org/abs/2508.06327)
Keywords: generative
Abstract: Magnetic resonance (MR) imaging, including cardiac MR, is prone to domain shift due to variations in imaging devices and acquisition protocols. This challenge limits the deployment of trained AI models in real-world scenarios, where performance degrades on unseen domains. Traditional solutions involve increasing the size of the dataset through ad-hoc image augmentation or additional online training/transfer learning, which have several limitations. Synthetic data offers a promising alternative, but anatomical/structural consistency constraints limit the effectiveness of generative models in creating image-label pairs. To address this, we propose a diffusion model (DM) trained on a source domain that generates synthetic cardiac MR images that resemble a given reference. The synthetic data maintains spatial and structural fidelity, ensuring similarity to the source domain and compatibility with the segmentation mask. We assess the utility of our generative approach in multi-centre cardiac MR segmentation, using the 2D nnU-Net, 3D nnU-Net and vanilla U-Net segmentation networks. We explore domain generalisation, where, domain-invariant segmentation models are trained on synthetic source domain data, and domain adaptation, where, we shift target domain data towards the source domain using the DM. Both strategies significantly improved segmentation performance on data from an unseen target domain, in terms of surface-based metrics (Welch's t-test, p < 0.01), compared to training segmentation models on real data alone. The proposed method ameliorates the need for transfer learning or online training to address domain shift challenges in cardiac MR image analysis, especially useful in data-scarce settings.
摘要：包括心脏MR在内的磁共振成像（MR）成像由于成像设备和采集方案的变化而容易出现域移位。这项挑战限制了在现实世界中的训练有素的AI模型的部署，在现实世界中，性能会在看不见的域上降低。传统解决方案涉及通过临时图像增加或其他在线培训/转移学习来增加数据集的大小，这些培训/转移学习有多个限制。合成数据提供了一种有希望的替代方案，但是解剖/结构一致性约束限制了生成模型在创建图像标签对的有效性。为了解决这个问题，我们提出了一个在源域上训练的扩散模型（DM），该模型生成了类似于给定参考的合成心脏MR图像。合成数据保持空间和结构保真度，确保与源域的相似性以及与分割面罩的兼容性。我们使用2D NNU-NET，3D NNU-NET和VANILLA U-NET分割网络评估了多中心心脏MR分割中生成方法的实用性。我们探索域的概括，其中，在合成源域数据和域的适应性上对域不变分割模型进行了训练，其中，我们使用DM将目标域数据转移到源域。与仅实际数据的训练分割模型相比，这两种策略都大大提高了来自看不见的目标域的数据的细分性能（Welch的t检验，p <0.01）。提出的方法可以改善转移学习或在线培训的需求，以解决心脏MR图像分析中的域转移挑战，在数据筛选设置中尤其有用。

Title: Structural Equation-VAE: Disentangled Latent Representations for Tabular Data

Authors: Ruiyu Zhang, Ce Zhao, Xin Zhao, Lin Nie, Wai-Fung Lam
Subjects: cs.LG, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2508.06347
Pdf URL: https://arxiv.org/pdf/2508.06347
Copy Paste: [[2508.06347]] Structural Equation-VAE: Disentangled Latent Representations for Tabular Data(https://arxiv.org/abs/2508.06347)
Keywords: generative
Abstract: Learning interpretable latent representations from tabular data remains a challenge in deep generative modeling. We introduce SE-VAE (Structural Equation-Variational Autoencoder), a novel architecture that embeds measurement structure directly into the design of a variational autoencoder. Inspired by structural equation modeling, SE-VAE aligns latent subspaces with known indicator groupings and introduces a global nuisance latent to isolate construct-specific confounding variation. This modular architecture enables disentanglement through design rather than through statistical regularizers alone. We evaluate SE-VAE on a suite of simulated tabular datasets and benchmark its performance against a series of leading baselines using standard disentanglement metrics. SE-VAE consistently outperforms alternatives in factor recovery, interpretability, and robustness to nuisance variation. Ablation results reveal that architectural structure, rather than regularization strength, is the key driver of performance. SE-VAE offers a principled framework for white-box generative modeling in scientific and social domains where latent constructs are theory-driven and measurement validity is essential.
摘要：从表格数据中学习可解释的潜在表示仍然是深层生成建模的挑战。我们介绍了Se-vae（结构方程式自动编码器），这是一种新型架构，将测量结构直接嵌入变异自动编码器的设计中。受结构方程建模的启发，SE-VAE将潜在子空间与已知的指标分组对齐，并引入了一个全球滋扰潜在的，可隔离构造特定的混杂变化。这种模块化体系结构可以通过设计而不是单独的统计正规化来实现分解。我们在模拟表格数据集的套件上评估了SE-VAE，并使用标准分解指标对一系列领先基准进行了基准测试。 SE-VAE始终在因子恢复性，可解释性和鲁棒性方面胜过替代方案。消融结果表明，建筑结构而不是正则化强度是性能的关键驱动力。 SE-VAE为科学和社会领域中的白盒生成建模提供了一个原则的框架，在该领域中，潜在的结构是理论驱动的，测量有效性至关重要。

Title: Aligning Effective Tokens with Video Anomaly in Large Language Models

Authors: Yingxian Chen, Jiahui Liu, Ruifan Di, Yanwei Li, Chirui Chang, Shizhen Zhao, Wilton W.T. Fok, Xiaojuan Qi, Yik-Chung Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06350
Pdf URL: https://arxiv.org/pdf/2508.06350
Copy Paste: [[2508.06350]] Aligning Effective Tokens with Video Anomaly in Large Language Models(https://arxiv.org/abs/2508.06350)
Keywords: generation
Abstract: Understanding abnormal events in videos is a vital and challenging task that has garnered significant attention in a wide range of applications. Although current video understanding Multi-modal Large Language Models (MLLMs) are capable of analyzing general videos, they often struggle to handle anomalies due to the spatial and temporal sparsity of abnormal events, where the redundant information always leads to suboptimal outcomes. To address these challenges, exploiting the representation and generalization capabilities of Vison Language Models (VLMs) and Large Language Models (LLMs), we propose VA-GPT, a novel MLLM designed for summarizing and localizing abnormal events in various videos. Our approach efficiently aligns effective tokens between visual encoders and LLMs through two key proposed modules: Spatial Effective Token Selection (SETS) and Temporal Effective Token Generation (TETG). These modules enable our model to effectively capture and analyze both spatial and temporal information associated with abnormal events, resulting in more accurate responses and interactions. Furthermore, we construct an instruction-following dataset specifically for fine-tuning video-anomaly-aware MLLMs, and introduce a cross-domain evaluation benchmark based on XD-Violence dataset. Our proposed method outperforms existing state-of-the-art methods on various benchmarks.
摘要：了解视频中的异常事件是一项至关重要且具有挑战性的任务，在广泛的应用中引起了极大的关注。尽管当前的视频理解多模式大型语言模型（MLLM）能够分析一般视频，但由于异常事件的空间和时间稀疏性，它们通常很难处理异常，在这种情况下，冗余信息总是会导致次优结果。为了应对这些挑战，利用Vison语言模型（VLM）和大型语言模型（LLMS）的表示和概括能力，我们提出了一种旨在在各种视频中总结和本地化异常事件的小型MLLM。我们的方法通过两个关键提出的模块有效地将视觉编码器和LLM之间的有效令牌对准：空间有效的令牌选择（集合）和时间有效令牌产生（TETG）。这些模块使我们的模型能够有效捕获和分析与异常事件相关的空间和时间信息，从而产生更准确的响应和相互作用。此外，我们构建了一个专门针对微调视频 - 反对感知的MLLM的指令关注数据集，并基于XD-Violence数据集引入了跨域评估基准。我们提出的方法在各种基准测试中都优于现有的最新方法。

Title: ActivityDiff: A diffusion model with Positive and Negative Activity Guidance for De Novo Drug Design

Authors: Renyi Zhou, Huimin Zhu, Jing Tang, Min Li
Subjects: cs.LG, cs.AI, q-bio.BM
Abstract URL: https://arxiv.org/abs/2508.06364
Pdf URL: https://arxiv.org/pdf/2508.06364
Copy Paste: [[2508.06364]] ActivityDiff: A diffusion model with Positive and Negative Activity Guidance for De Novo Drug Design(https://arxiv.org/abs/2508.06364)
Keywords: generation, generative
Abstract: Achieving precise control over a molecule's biological activity-encompassing targeted activation/inhibition, cooperative multi-target modulation, and off-target toxicity mitigation-remains a critical challenge in de novo drug design. However, existing generative methods primarily focus on producing molecules with a single desired activity, lacking integrated mechanisms for the simultaneous management of multiple intended and unintended molecular interactions. Here, we propose ActivityDiff, a generative approach based on the classifier-guidance technique of diffusion models. It leverages separately trained drug-target classifiers for both positive and negative guidance, enabling the model to enhance desired activities while minimizing harmful off-target effects. Experimental results show that ActivityDiff effectively handles essential drug design tasks, including single-/dual-target generation, fragment-constrained dual-target design, selective generation to enhance target specificity, and reduction of off-target effects. These results demonstrate the effectiveness of classifier-guided diffusion in balancing efficacy and safety in molecular design. Overall, our work introduces a novel paradigm for achieving integrated control over molecular activity, and provides ActivityDiff as a versatile and extensible framework.
摘要：对分子的生物活性激活/抑制作用，合作多目标调节和靶向降低毒性降低毒性降低剂量的精确控制，这是从新手药物设计中的关键挑战。但是，现有的生成方法主要集中于产生具有单个所需活性的分子，缺乏用于同时管理多个预期和意外分子相互作用的综合机制。在这里，我们提出了ActivityDiff，这是一种基于扩散模型的分类器引导技术的生成方法。它分别利用了训练有素的药物目标分类器来获得积极和负指导，从而使模型能够增强所需的活动，同时最大程度地减少有害的脱离目标效应。实验结果表明，ActiveDiff有效地处理基本的药物设计任务，包括单/双目标生成，碎片受限的双目标设计，选择性生成以增强目标特异性以及降低脱靶效应。这些结果证明了分类器引导扩散在平衡分子设计中的功效和安全性方面的有效性。总体而言，我们的工作引入了一种新型的范式，以实现对分子活性的综合控制，并将活动档作为一种多功能且可扩展的框架。

Title: End-to-End Text-to-SQL with Dataset Selection: Leveraging LLMs for Adaptive Query Generation

Authors: Anurag Tripathi, Vaibhav Patle, Abhinav Jain, Ayush Pundir, Sairam Menon, Ajeet Kumar Singh
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06387
Pdf URL: https://arxiv.org/pdf/2508.06387
Copy Paste: [[2508.06387]] End-to-End Text-to-SQL with Dataset Selection: Leveraging LLMs for Adaptive Query Generation(https://arxiv.org/abs/2508.06387)
Keywords: generation
Abstract: Text-to-SQL bridges the gap between natural language and structured database language, thus allowing non-technical users to easily query databases. Traditional approaches model text-to-SQL as a direct translation task, where a given Natural Language Query (NLQ) is mapped to an SQL command. Recent advances in large language models (LLMs) have significantly improved translation accuracy, however, these methods all require that the target database is pre-specified. This becomes problematic in scenarios with multiple extensive databases, where identifying the correct database becomes a crucial yet overlooked step. In this paper, we propose a three-stage end-to-end text-to-SQL framework to identify the user's intended database before generating SQL queries. Our approach leverages LLMs and prompt engineering to extract implicit information from natural language queries (NLQs) in the form of a ruleset. We then train a large db\_id prediction model, which includes a RoBERTa-based finetuned encoder, to predict the correct Database identifier (db\_id) based on both the NLQ and the LLM-generated rules. Finally, we refine the generated SQL by using critic agents to correct errors. Experimental results demonstrate that our framework outperforms the current state-of-the-art models in both database intent prediction and SQL generation accuracy.
摘要：文本到SQL桥接自然语言和结构化数据库语言之间的差距，从而使非技术用户可以轻松查询数据库。传统方法将文本到SQL作为直接翻译任务，其中给定的自然语言查询（NLQ）映射到SQL命令。大型语言模型（LLM）的最新进展已显着提高了翻译精度，但是，这些方法都要求预先指定目标数据库。在有多个广泛的数据库的情况下，这变得有问题，其中识别正确的数据库成为至关重要但被忽略的步骤。在本文中，我们提出了一个三阶段的端到端文本到SQL框架，以在生成SQL查询之前识别用户的预期数据库。我们的方法利用LLM并促使工程以规则集的形式从自然语言查询（NLQ）中提取隐式信息。然后，我们训练一个大型DB \ _ID预测模型，其中包括基于Roberta的Fineted Encoder，以根据NLQ和LLM生成的规则来预测正确的数据库标识符（DB \ _ID）。最后，我们通过使用评论家纠正错误来完善生成的SQL。实验结果表明，我们的框架在数据库意图预测和SQL生成准确性中的当前最新模型的表现优于当前的最新模型。

Title: FVGen: Accelerating Novel-View Synthesis with Adversarial Video Diffusion Distillation

Authors: Wenbin Teng, Gonglin Chen, Haiwei Chen, Yajie Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06392
Pdf URL: https://arxiv.org/pdf/2508.06392
Copy Paste: [[2508.06392]] FVGen: Accelerating Novel-View Synthesis with Adversarial Video Diffusion Distillation(https://arxiv.org/abs/2508.06392)
Keywords: generative
Abstract: Recent progress in 3D reconstruction has enabled realistic 3D models from dense image captures, yet challenges persist with sparse views, often leading to artifacts in unseen areas. Recent works leverage Video Diffusion Models (VDMs) to generate dense observations, filling the gaps when only sparse views are available for 3D reconstruction tasks. A significant limitation of these methods is their slow sampling speed when using VDMs. In this paper, we present FVGen, a novel framework that addresses this challenge by enabling fast novel view synthesis using VDMs in as few as four sampling steps. We propose a novel video diffusion model distillation method that distills a multi-step denoising teacher model into a few-step denoising student model using Generative Adversarial Networks (GANs) and softened reverse KL-divergence minimization. Extensive experiments on real-world datasets show that, compared to previous works, our framework generates the same number of novel views with similar (or even better) visual quality while reducing sampling time by more than 90%. FVGen significantly improves time efficiency for downstream reconstruction tasks, particularly when working with sparse input views (more than 2) where pre-trained VDMs need to be run multiple times to achieve better spatial coverage.
摘要：3D重建中的最新进展使图像捕获的现实3D模型能够实现现实的3D模型，但挑战仍然存在稀疏的观点，通常会导致看不见的地区的工件。最近的作品利用视频扩散模型（VDM）来产生密集的观察结果，在仅用于3D重建任务的稀疏视图时填补了空白。这些方法的一个重要局限性是使用VDMS时的缓慢采样速度。在本文中，我们提出了FVGEN，这是一个新颖的框架，通过在短短四个采样步骤中使用VDMS启用快速的新颖视图综合来解决这一挑战。我们提出了一种新型的视频扩散模型蒸馏方法，该方法将多步骤的教师模型提炼成使用生成的对抗网络（GAN）（GAN）并软化的反向KL-Divergence最小化的几步降级学生模型。对现实世界数据集的广泛实验表明，与以前的作品相比，我们的框架生成了相同数量的新型视图，具有相似的视觉质量（甚至更好），同时将采样时间减少了90％以上。 FVGEN可显着提高下游重建任务的时间效率，尤其是在使用稀疏输入视图（超过2个）时，需要多次运行预训练的VDM，以实现更好的空间覆盖范围。

Title: A Classification-Aware Super-Resolution Framework for Ship Targets in SAR Imagery

Authors: Ch Muhammad Awais, Marco Reggiannini, Davide Moroni, Oktay Karakus
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2508.06407
Pdf URL: https://arxiv.org/pdf/2508.06407
Copy Paste: [[2508.06407]] A Classification-Aware Super-Resolution Framework for Ship Targets in SAR Imagery(https://arxiv.org/abs/2508.06407)
Keywords: super-resolution
Abstract: High-resolution imagery plays a critical role in improving the performance of visual recognition tasks such as classification, detection, and segmentation. In many domains, including remote sensing and surveillance, low-resolution images can limit the accuracy of automated analysis. To address this, super-resolution (SR) techniques have been widely adopted to attempt to reconstruct high-resolution images from low-resolution inputs. Related traditional approaches focus solely on enhancing image quality based on pixel-level metrics, leaving the relationship between super-resolved image fidelity and downstream classification performance largely underexplored. This raises a key question: can integrating classification objectives directly into the super-resolution process further improve classification accuracy? In this paper, we try to respond to this question by investigating the relationship between super-resolution and classification through the deployment of a specialised algorithmic strategy. We propose a novel methodology that increases the resolution of synthetic aperture radar imagery by optimising loss functions that account for both image quality and classification performance. Our approach improves image quality, as measured by scientifically ascertained image quality indicators, while also enhancing classification accuracy.
摘要：高分辨率图像在改善视觉识别任务（例如分类，检测和分割）的性能中起着关键作用。在许多领域（包括遥感和监视）中，低分辨率图像可以限制自动分析的准确性。为了解决这个问题，超分辨率（SR）技术已被广泛采用，以尝试从低分辨率输入中重建高分辨率图像。相关的传统方法仅着眼于基于像素级指标增强图像质量，从而使超级分辨的图像保真度与下游分类性能之间的关系在很大程度上尚未得到充分展望。这提出了一个关键问题：可以将分类目标直接整合到超分辨率过程中进一步提高分类准确性吗？在本文中，我们试图通过部署专业算法策略来研究超分辨率和分类之间的关系来回答这个问题。我们提出了一种新颖的方法，该方法可以通过优化构成图像质量和分类性能的损失函数来增加合成孔径雷达图像的分辨率。通过科学确定的图像质量指标衡量，我们的方法改善了图像质量，同时还提高了分类精度。

Title: SPARSE Data, Rich Results: Few-Shot Semi-Supervised Learning via Class-Conditioned Image Translation

Authors: Guido Manni, Clemente Lauretti, Loredana Zollo, Paolo Soda
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06429
Pdf URL: https://arxiv.org/pdf/2508.06429
Copy Paste: [[2508.06429]] SPARSE Data, Rich Results: Few-Shot Semi-Supervised Learning via Class-Conditioned Image Translation(https://arxiv.org/abs/2508.06429)
Keywords: generation
Abstract: Deep learning has revolutionized medical imaging, but its effectiveness is severely limited by insufficient labeled training data. This paper introduces a novel GAN-based semi-supervised learning framework specifically designed for low labeled-data regimes, evaluated across settings with 5 to 50 labeled samples per class. Our approach integrates three specialized neural networks -- a generator for class-conditioned image translation, a discriminator for authenticity assessment and classification, and a dedicated classifier -- within a three-phase training framework. The method alternates between supervised training on limited labeled data and unsupervised learning that leverages abundant unlabeled images through image-to-image translation rather than generation from noise. We employ ensemble-based pseudo-labeling that combines confidence-weighted predictions from the discriminator and classifier with temporal consistency through exponential moving averaging, enabling reliable label estimation for unlabeled data. Comprehensive evaluation across eleven MedMNIST datasets demonstrates that our approach achieves statistically significant improvements over six state-of-the-art GAN-based semi-supervised methods, with particularly strong performance in the extreme 5-shot setting where the scarcity of labeled data is most challenging. The framework maintains its superiority across all evaluated settings (5, 10, 20, and 50 shots per class). Our approach offers a practical solution for medical imaging applications where annotation costs are prohibitive, enabling robust classification performance even with minimal labeled data. Code is available at this https URL.
摘要：深度学习彻底改变了医学成像，但其有效性受到标记培训数据不足的严重限制。本文介绍了一个新型的基于GAN的半监督学习框架，专门为低标记的DATA制度设计，在各个类别的设置之间进行了评估，每个类别都有5至50个标记的样品。我们的方法将三个专业的神经网络（一种用于类调节图像翻译的生成器，一个真实性评估和分类的歧视器以及专门的分类器）集成了三相培训框架。该方法在有限标记的数据和无监督的学习中交替使用，这些培训通过图像到图像的翻译而不是噪声产生，从而利用了丰富的未标记图像。我们采用基于合奏的伪标记，通过指数移动平均构架结合了来自歧视者和分类器的置信加权预测与时间一致性，从而实现了无标记数据的可靠标签估计。跨11个MedMnist数据集进行的全面评估表明，我们的方法在六个基于GAN的最先进的半监督方法上取得了统计学上的显着改进，在极端的5-Shot环境中，标记数据稀缺的极端性能是最具挑战性的。该框架在所有评估的设置中保持其优势（每类5、10、20和50杆）。我们的方法为医学成像应用提供了一种实用的解决方案，在该应用中，注释成本是过于良好的，即使具有最小的标记数据，也可以使分类的性能稳健。代码可在此HTTPS URL上找到。

Title: WGAST: Weakly-Supervised Generative Network for Daily 10 m Land Surface Temperature Estimation via Spatio-Temporal Fusion

Authors: Sofiane Bouaziz, Adel Hafiane, Raphael Canals, Rachid Nedjai
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.06485
Pdf URL: https://arxiv.org/pdf/2508.06485
Copy Paste: [[2508.06485]] WGAST: Weakly-Supervised Generative Network for Daily 10 m Land Surface Temperature Estimation via Spatio-Temporal Fusion(https://arxiv.org/abs/2508.06485)
Keywords: generative
Abstract: Urbanization, climate change, and agricultural stress are increasing the demand for precise and timely environmental monitoring. Land Surface Temperature (LST) is a key variable in this context and is retrieved from remote sensing satellites. However, these systems face a trade-off between spatial and temporal resolution. While spatio-temporal fusion methods offer promising solutions, few have addressed the estimation of daily LST at 10 m resolution. In this study, we present WGAST, a Weakly-Supervised Generative Network for Daily 10 m LST Estimation via Spatio-Temporal Fusion of Terra MODIS, Landsat 8, and Sentinel-2. WGAST is the first end-to-end deep learning framework designed for this task. It adopts a conditional generative adversarial architecture, with a generator composed of four stages: feature extraction, fusion, LST reconstruction, and noise suppression. The first stage employs a set of encoders to extract multi-level latent representations from the inputs, which are then fused in the second stage using cosine similarity, normalization, and temporal attention mechanisms. The third stage decodes the fused features into high-resolution LST, followed by a Gaussian filter to suppress high-frequency noise. Training follows a weakly supervised strategy based on physical averaging principles and reinforced by a PatchGAN discriminator. Experiments demonstrate that WGAST outperforms existing methods in both quantitative and qualitative evaluations. Compared to the best-performing baseline, on average, WGAST reduces RMSE by 17.18% and improves SSIM by 11.00%. Furthermore, WGAST is robust to cloud-induced LST and effectively captures fine-scale thermal patterns, as validated against 33 ground-based sensors. The code is available at this https URL.
摘要：城市化，气候变化和农业压力正在增加对精确和及时环境监测的需求。土地表面温度（LST）是这种情况下的关键变量，并从遥感卫星中获取。但是，这些系统面临空间和时间分辨率之间的权衡。虽然时空融合方法提供了有希望的解决方案，但很少有人以10 m分辨率解决了每日LST的估计。在这项研究中，我们提出了WGAST，这是通过Terra Modis，Landsat 8和Sentinel-2的时空融合每天进行10 m lst估算的弱监督生成网络。 WGAST是为此任务设计的第一个端到端深度学习框架。它采用有条件的生成对抗结构，其发电机由四个阶段组成：特征提取，融合，LST重建和抑制噪声。第一阶段采用一组编码器来从输入中提取多级潜在表示，然后使用余弦相似性，归一化和时间注意机制在第二阶段融合。第三阶段将融合的特征解码为高分辨率LST，然后将高斯滤光片抑制高频噪声。培训遵循基于物理平均原则的弱监督策略，并由Patchgan歧视者加强。实验表明，在定量和定性评估中，WGAST优于现有方法。与表现最佳的基线相比，平均而言，WGAST将RMSE降低了17.18％，将SSIM提高了11.00％。此外，WGAST对云诱导的LST具有鲁棒性，并有效地捕获了细尺度的热模式，该模式对33个地面传感器进行了验证。该代码可在此HTTPS URL上找到。

Title: Effective Training Data Synthesis for Improving MLLM Chart Understanding

Authors: Yuwei Yang, Zeyu Zhang, Yunzhong Hou, Zhuowan Li, Gaowen Liu, Ali Payani, Yuan-Sen Ting, Liang Zheng
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2508.06492
Pdf URL: https://arxiv.org/pdf/2508.06492
Copy Paste: [[2508.06492]] Effective Training Data Synthesis for Improving MLLM Chart Understanding(https://arxiv.org/abs/2508.06492)
Keywords: generation
Abstract: Being able to effectively read scientific plots, or chart understanding, is a central part toward building effective agents for science. However, existing multimodal large language models (MLLMs), especially open-source ones, are still falling behind with a typical success rate of 30%-50% on challenging benchmarks. Previous studies on fine-tuning MLLMs with synthetic charts are often restricted by their inadequate similarity to the real charts, which could compromise model training and performance on complex real-world charts. In this study, we show that modularizing chart generation and diversifying visual details improves chart understanding capabilities. In particular, we design a five-step data synthesis pipeline, where we separate data and function creation for single plot generation, condition the generation of later subplots on earlier ones for multi-subplot figures, visually diversify the generated figures, filter out low quality data, and finally generate the question-answer (QA) pairs with GPT-4o. This approach allows us to streamline the generation of fine-tuning datasets and introduce the effective chart dataset (ECD), which contains 10k+ chart images and 300k+ QA pairs, covering 25 topics and featuring 250+ chart type combinations with high visual complexity. We show that ECD consistently improves the performance of various MLLMs on a range of real-world and synthetic test sets. Code, data and models are available at: this https URL.
摘要：能够有效阅读科学图或图表理解是建立科学有效代理的核心部分。但是，现有的多模式大型语言模型（MLLM），尤其是开源的大型语言模型，仍落后于典型的成功率在挑战性的基准上为30％-50％。先前对具有合成图表的微调MLLM的研究通常受到与真实图表的相似性不足的限制，这可能会损害复杂的现实世界图表上的模型训练和性能。在这项研究中，我们表明，模块化图表生成和多样化的视觉细节可以提高图表的理解能力。特别是，我们设计了一个五步数据合成管道，在该管道中，我们将单个情节生成的数据和功能创建分开，调节以后的子图生成多掩本图的生成，可用于多掩本图，视觉上使生成的图形多样化，过滤质量低的数据，并最终生成问题 - 答案（QA）与GPT-4O对。这种方法使我们能够简化微型数据集的生成，并引入有效的图表数据集（ECD），该数据集（ECD）包含10k+图表图像和300K+ QA对，涵盖了25个主题，并具有250+图表类型组合，具有高视觉复杂性。我们表明，ECD始终在一系列现实世界和合成测试集上提高各种MLLM的性能。代码，数据和模型可在以下网址提供：此HTTPS URL。

Title: LightSwitch: Multi-view Relighting with Material-guided Diffusion

Authors: Yehonathan Litman, Fernando De la Torre, Shubham Tulsiani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06494
Pdf URL: https://arxiv.org/pdf/2508.06494
Copy Paste: [[2508.06494]] LightSwitch: Multi-view Relighting with Material-guided Diffusion(https://arxiv.org/abs/2508.06494)
Keywords: generative
Abstract: Recent approaches for 3D relighting have shown promise in integrating 2D image relighting generative priors to alter the appearance of a 3D representation while preserving the underlying structure. Nevertheless, generative priors used for 2D relighting that directly relight from an input image do not take advantage of intrinsic properties of the subject that can be inferred or cannot consider multi-view data at scale, leading to subpar relighting. In this paper, we propose Lightswitch, a novel finetuned material-relighting diffusion framework that efficiently relights an arbitrary number of input images to a target lighting condition while incorporating cues from inferred intrinsic properties. By using multi-view and material information cues together with a scalable denoising scheme, our method consistently and efficiently relights dense multi-view data of objects with diverse material compositions. We show that our 2D relighting prediction quality exceeds previous state-of-the-art relighting priors that directly relight from images. We further demonstrate that LightSwitch matches or outperforms state-of-the-art diffusion inverse rendering methods in relighting synthetic and real objects in as little as 2 minutes.
摘要：最新的3D Relighting方法已显示出在整合2D图像重新生成先验以改变3D表示的外观的同时，在保留基础结构的同时，显示出了希望。然而，直接从输入图像中直接重新重新重新重新重新重新获得的生成先验并不能利用可以推断的主题的内在特性，或者不能按大规模考虑多视图数据，从而导致Sub parrighting。在本文中，我们提出了LightSwitch，这是一种新型的固定物质 - 层扩散框架，该框架有效地将任意数量的输入图像重新确定为目标照明条件，同时结合了推断的内在属性提示。通过使用多视图和材料信息提示以及可扩展的denoisising方案，我们的方法一致，有效地重新保留具有不同材料组成的对象的密集多视图数据。我们表明，我们的2D重新预测质量超过了以前直接从图像重新重新重新重新获得的先前的重新认真先验。我们进一步证明，灯光开关匹配或优于最先进的扩散逆渲染方法，可在仅2分钟内重新确定合成和真实对象。