2025-06-13

Title: Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models

Authors: Sridhar S, Nithin A, Shakeel Rifath, Vasantha Raj
Subjects: cs.CV, cs.AI, cs.CL, cs.GR, cs.MM
Abstract URL: https://arxiv.org/abs/2506.10005
Pdf URL: https://arxiv.org/pdf/2506.10005
Copy Paste: [[2506.10005]] Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models(https://arxiv.org/abs/2506.10005)
Keywords: generation, generative
Abstract: Advances in generative artificial intelligence have altered multimedia creation, allowing for automatic cinematic video synthesis from text inputs. This work describes a method for creating 60-second cinematic movies incorporating Stable Diffusion for high-fidelity image synthesis, GPT-2 for narrative structuring, and a hybrid audio pipeline using gTTS and YouTube-sourced music. It uses a five-scene framework, which is augmented by linear frame interpolation, cinematic post-processing (e.g., sharpening), and audio-video synchronization to provide professional-quality results. It was created in a GPU-accelerated Google Colab environment using Python 3.11. It has a dual-mode Gradio interface (Simple and Advanced), which supports resolutions of up to 1024x768 and frame rates of 15-30 FPS. Optimizations such as CUDA memory management and error handling ensure reliability. The experiments demonstrate outstanding visual quality, narrative coherence, and efficiency, furthering text-to-video synthesis for creative, educational, and industrial applications.
摘要：生成人工智能的进步改变了多媒体的创造，从而可以从文本输入中自动进行电影视频综合。这项工作描述了一种创建60秒电影电影的方法，该电影结合了用于高保真图像合成的稳定扩散，用于叙事结构的GPT-2以及使用GTTS和YouTube音乐的混合音频管道。它使用了五场框架，该框架通过线性框架插值，电影后加工（例如锐化）和音频视频同步来增强，以提供专业质量的结果。它是使用Python 3.11在GPU加速的Google Colab环境中创建的。它具有双模式Gradio接口（简单和高级），该界面支持高达1024x768的分辨率和15-30 fps的帧速率。诸如CUDA内存管理和错误处理之类的优化确保可靠性。该实验表明了出色的视觉质量，叙事连贯性和效率，从而进一步促进文本到视频综合，以促进创意，教育和工业应用。

Title: Optimizing Latent Dimension Allocation in Hierarchical VAEs: Balancing Attenuation and Information Retention for OOD Detection

Authors: Dane Williamson, Yangfeng Ji, Matthew Dwyer
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10089
Pdf URL: https://arxiv.org/pdf/2506.10089
Copy Paste: [[2506.10089]] Optimizing Latent Dimension Allocation in Hierarchical VAEs: Balancing Attenuation and Information Retention for OOD Detection(https://arxiv.org/abs/2506.10089)
Keywords: generative
Abstract: Out-of-distribution (OOD) detection is a critical task in machine learning, particularly for safety-critical applications where unexpected inputs must be reliably flagged. While hierarchical variational autoencoders (HVAEs) offer improved representational capacity over traditional VAEs, their performance is highly sensitive to how latent dimensions are distributed across layers. Existing approaches often allocate latent capacity arbitrarily, leading to ineffective representations or posterior collapse. In this work, we introduce a theoretically grounded framework for optimizing latent dimension allocation in HVAEs, drawing on principles from information theory to formalize the trade-off between information loss and representational attenuation. We prove the existence of an optimal allocation ratio $r^{\ast}$ under a fixed latent budget, and empirically show that tuning this ratio consistently improves OOD detection performance across datasets and architectures. Our approach outperforms baseline HVAE configurations and provides practical guidance for principled latent structure design, leading to more robust OOD detection with deep generative models.
摘要：分布（OOD）检测是机器学习的关键任务，尤其是对于必须可靠地标记出意外输入的安全性应用程序。虽然层次变化自动编码器（HVAE）比传统VAE提供了提高的表示能力，但它们的性能对潜在尺寸在跨层的分布方式高度敏感。现有方法通常任意分配潜能，导致无效的表示或后倒塌。在这项工作中，我们引入了一个理论上的基础框架，以优化HVAE中的潜在维度分配，从信息理论开始，以形式化信息损失和代表性衰减之间的权衡。我们证明了在固定的潜在预算下的最佳分配比率$ r^{\ ast} $的存在，并从经验上表明，调整此比率始终提高数据集和架构的OOD检测性能。我们的方法的表现优于基线HVAE配置，并为有原则的潜在结构设计提供了实用的指导，从而通过深层生成模型可实现更强大的OOD检测。

Title: NnD: Diffusion-based Generation of Physically-Nonnegative Objects

Authors: Nadav Torem, Tamar Sde-Chen, Yoav Y. Schechner
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10112
Pdf URL: https://arxiv.org/pdf/2506.10112
Copy Paste: [[2506.10112]] NnD: Diffusion-based Generation of Physically-Nonnegative Objects(https://arxiv.org/abs/2506.10112)
Keywords: generation, generative
Abstract: Most natural objects have inherent complexity and variability. While some simple objects can be modeled from first principles, many real-world phenomena, such as cloud formation, require computationally expensive simulations that limit scalability. This work focuses on a class of physically meaningful, nonnegative objects that are computationally tractable but costly to simulate. To dramatically reduce computational costs, we propose nonnegative diffusion (NnD). This is a learned generative model using score based diffusion. It adapts annealed Langevin dynamics to enforce, by design, non-negativity throughout iterative scene generation and analysis (inference). NnD trains on high-quality physically simulated objects. Once trained, it can be used for generation and inference. We demonstrate generation of 3D volumetric clouds, comprising inherently nonnegative microphysical fields. Our generated clouds are consistent with cloud physics trends. They are effectively not distinguished as non-physical by expert perception.
摘要：大多数天然物体具有固有的复杂性和可变性。虽然一些简单的对象可以根据第一原理进行建模，但许多现实世界现象（例如云形成）都需要限制可扩展性的计算昂贵模拟。这项工作着重于一系列物理有意义的非负对象，这些对象在计算上是可拖动但昂贵的对象。为了大大降低计算成本，我们提出了非负扩散（NND）。这是使用基于得分扩散的学习生成模型。它可以通过设计，在迭代场景生成和分析（推理）中通过设计，非阴性来适应退火的Langevin动力学。在高质量的物理模拟对象上进行训练。经过培训后，它可用于生成和推理。我们展示了3D体积云的产生，包括固有的非负微物理场。我们生成的云与云物理趋势一致。通过专家感知，它们实际上并未将其视为非物理。

Title: ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs

Authors: Xiyao Wang, Zhengyuan Yang, Chao Feng, Yongyuan Liang, Yuhang Zhou, Xiaoyu Liu, Ziyi Zang, Ming Li, Chung-Ching Lin, Kevin Lin, Linjie Li, Furong Huang, Lijuan Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10128
Pdf URL: https://arxiv.org/pdf/2506.10128
Copy Paste: [[2506.10128]] ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs(https://arxiv.org/abs/2506.10128)
Keywords: generation
Abstract: Reinforcement learning (RL) has shown great effectiveness for fine-tuning large language models (LLMs) using tasks that are challenging yet easily verifiable, such as math reasoning or code generation. However, extending this success to visual perception in vision-language models (VLMs) has been impeded by the scarcity of vision-centric tasks that are simultaneously challenging and unambiguously verifiable. To this end, we introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions. Starting from a 200-word captions, we inject a single, subtle visual description error-altering a few words on objects, attributes, counts, or spatial relations-and task the model to pinpoint the corrupted span given the image and the modified caption. This formulation preserves the full perceptual difficulty while providing a binary, exact-match reward that is easy to compute and unambiguous. Models trained with the ViCrit Task exhibit substantial gains across a variety of VL benchmarks. Crucially, the improvements transfer beyond natural-image training data to abstract image reasoning and visual math, showing promises of learning to perceive rather than barely memorizing seen objects. To facilitate evaluation, we further introduce ViCrit-Bench, a category-balanced diagnostic benchmark that systematically probes perception errors across diverse image domains and error types. Together, our results demonstrate that fine-grained hallucination criticism is an effective and generalizable objective for enhancing visual perception in VLMs.
摘要：强化学习（RL）使用具有挑战性但易于验证的任务（例如数学推理或代码生成），对大型语言模型（LLMS）表现出很大的有效性。但是，将这种成功扩展到视觉模型中的视觉感知（VLM）受到了以视觉为中心的任务的稀缺性，这些任务同时具有挑战性且明确的验证。为此，我们介绍了Vicrit（视觉标题幻觉评论家），这是一项RL代理任务，该任务训练VLMS，以将其定位的微妙的，合成的视觉幻觉注入到人类写入的图像标题的段落中。从200字的字幕开始，我们注入一个微妙的视觉描述，在对象，属性，计数或空间关系和空间关系上改变了几个单词，并且模型可以确定给定图像的损坏跨度和修改后的字幕。这种表述保留了完整的感知困难，同时提供了易于计算和明确的二进制匹配奖励。经过维克里特任务训练的模型在各种VL基准测试中都具有很大的收益。至关重要的是，超出自然图像训练数据的改进转移到抽象的图像推理和视觉数学，表明学习有望感知而不是几乎没有记住可见的对象。为了促进评估，我们进一步引入了Vicrit基础，这是一种类别均衡的诊断基准，该基准会系统地探测各种图像域和错误类型之间的感知错误。总之，我们的结果表明，细粒度的幻觉批评是增强VLM中视觉感知的有效且可普遍的目标。

Title: The 2025 PNPL Competition: Speech Detection and Phoneme Classification in the LibriBrain Dataset

Authors: Gilad Landau, Miran Özdogan, Gereon Elvers, Francesco Mantegna, Pratik Somaiya, Dulhan Jayalath, Luisa Kurth, Teyun Kwon, Brendan Shillingford, Greg Farquhar, Minqi Jiang, Karim Jerbi, Hamza Abdelhedi, Yorguin Mantilla Ramos, Caglar Gulcehre, Mark Woolrich, Natalie Voets, Oiwi Parker Jones
Subjects: cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.10165
Pdf URL: https://arxiv.org/pdf/2506.10165
Copy Paste: [[2506.10165]] The 2025 PNPL Competition: Speech Detection and Phoneme Classification in the LibriBrain Dataset(https://arxiv.org/abs/2506.10165)
Keywords: restoration
Abstract: The advance of speech decoding from non-invasive brain data holds the potential for profound societal impact. Among its most promising applications is the restoration of communication to paralysed individuals affected by speech deficits such as dysarthria, without the need for high-risk surgical interventions. The ultimate aim of the 2025 PNPL competition is to produce the conditions for an "ImageNet moment" or breakthrough in non-invasive neural decoding, by harnessing the collective power of the machine learning community. To facilitate this vision we present the largest within-subject MEG dataset recorded to date (LibriBrain) together with a user-friendly Python library (pnpl) for easy data access and integration with deep learning frameworks. For the competition we define two foundational tasks (i.e. Speech Detection and Phoneme Classification from brain data), complete with standardised data splits and evaluation metrics, illustrative benchmark models, online tutorial code, a community discussion board, and public leaderboard for submissions. To promote accessibility and participation the competition features a Standard track that emphasises algorithmic innovation, as well as an Extended track that is expected to reward larger-scale computing, accelerating progress toward a non-invasive brain-computer interface for speech.
摘要：从非侵入性大脑数据中解码语音的发展具有深远的社会影响。它最有前途的应用之一是恢复与院子缺陷（例如构造障碍）影响的瘫痪者的沟通，而无需高危手术干预措施。 2025 PNPL竞争的最终目的是通过利用机器学习社区的集体力量来生成“成像时刻”或非侵入性神经解码的突破的条件。为了促进这一愿景，我们介绍了迄今为止记录的最大的主体内MEG数据集（Libribrain）以及用户友好的Python库（PNPL），以便于数据访问和与深度学习框架进行集成。在竞争中，我们定义了两个基本任务（即来自大脑数据的语音检测和音素分类），并配有标准化的数据拆分和评估指标，说明性基准模型，在线教程代码，社区讨论委员会以及公众讨论委员会以及提交的公共排行榜。为了促进可访问性和参与，竞争具有强调算法创新的标准轨道，以及一条扩展的轨道，预计将奖励大型计算，从而加速进步，以朝着非侵入性的脑部计算机界面进行语音。

Title: SPARKE: Scalable Prompt-Aware Diversity Guidance in Diffusion Models via RKE Score

Authors: Mohammad Jalali, Haoyu Lei, Amin Gohari, Farzan Farnia
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10173
Pdf URL: https://arxiv.org/pdf/2506.10173
Copy Paste: [[2506.10173]] SPARKE: Scalable Prompt-Aware Diversity Guidance in Diffusion Models via RKE Score(https://arxiv.org/abs/2506.10173)
Keywords: generation, generative
Abstract: Diffusion models have demonstrated remarkable success in high-fidelity image synthesis and prompt-guided generative modeling. However, ensuring adequate diversity in generated samples of prompt-guided diffusion models remains a challenge, particularly when the prompts span a broad semantic spectrum and the diversity of generated data needs to be evaluated in a prompt-aware fashion across semantically similar prompts. Recent methods have introduced guidance via diversity measures to encourage more varied generations. In this work, we extend the diversity measure-based approaches by proposing the Scalable Prompt-Aware Rény Kernel Entropy Diversity Guidance (SPARKE) method for prompt-aware diversity guidance. SPARKE utilizes conditional entropy for diversity guidance, which dynamically conditions diversity measurement on similar prompts and enables prompt-aware diversity control. While the entropy-based guidance approach enhances prompt-aware diversity, its reliance on the matrix-based entropy scores poses computational challenges in large-scale generation settings. To address this, we focus on the special case of Conditional latent RKE Score Guidance, reducing entropy computation and gradient-based optimization complexity from the $O(n^3)$ of general entropy measures to $O(n)$. The reduced computational complexity allows for diversity-guided sampling over potentially thousands of generation rounds on different prompts. We numerically test the SPARKE method on several text-to-image diffusion models, demonstrating that the proposed method improves the prompt-aware diversity of the generated data without incurring significant computational costs. We release our code on the project page: this https URL
摘要：扩散模型在高保真图像合成和迅速引导的生成建模方面取得了显着成功。但是，确保在产生的迅速引导扩散模型样本中的足够多样性仍然是一个挑战，尤其是当提示范围跨越广泛的语义频谱时，需要在语义上相似的提示中以及时了解的方式评估生成的数据的多样性。最近的方法通过多样性措施引入了指导，以鼓励更多的各代。在这项工作中，我们通过提出可扩展的及时感知的Rény内核熵多样性指南（SPARKE）方法来扩展基于多样性的方法，以及时感知多样性指导。 Sparke利用条件熵进行多样性指导，该指导在类似提示下动态地调节多样性测量，并实现及时感知的多样性控制。尽管基于熵的指导方法增强了及时感知的多样性，但其对基于矩阵的熵分数的依赖在大规模生成环境中带来了计算挑战。为了解决这个问题，我们关注条件潜在RKE分数指导的特殊情况，从熵计算和基于梯度的优化复杂性从一般熵指标的$ O（n^3）$减少到$ O（n）$。降低的计算复杂性允许在不同提示上进行数千代回合的多样性引导的采样。我们在几个文本到图像扩散模型上进行数值测试Sparke方法，表明所提出的方法可改善生成数据的及时多样性，而不会产生大量的计算成本。我们在项目页面上发布代码：此HTTPS URL

Title: Retrieval of Surface Solar Radiation through Implicit Albedo Recovery from Temporal Context

Authors: Yael Frischholz, Devis Tuia, Michael Lehning
Subjects: cs.CV, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2506.10174
Pdf URL: https://arxiv.org/pdf/2506.10174
Copy Paste: [[2506.10174]] Retrieval of Surface Solar Radiation through Implicit Albedo Recovery from Temporal Context(https://arxiv.org/abs/2506.10174)
Keywords: generation
Abstract: Accurate retrieval of surface solar radiation (SSR) from satellite imagery critically depends on estimating the background reflectance that a spaceborne sensor would observe under clear-sky conditions. Deviations from this baseline can then be used to detect cloud presence and guide radiative transfer models in inferring atmospheric attenuation. Operational retrieval algorithms typically approximate background reflectance using monthly statistics, assuming surface properties vary slowly relative to atmospheric conditions. However, this approach fails in mountainous regions where intermittent snow cover and changing snow surfaces are frequent. We propose an attention-based emulator for SSR retrieval that implicitly learns to infer clear-sky surface reflectance from raw satellite image sequences. Built on the Temporo-Spatial Vision Transformer, our approach eliminates the need for hand-crafted features such as explicit albedo maps or cloud masks. The emulator is trained on instantaneous SSR estimates from the HelioMont algorithm over Switzerland, a region characterized by complex terrain and dynamic snow cover. Inputs include multi-spectral SEVIRI imagery from the Meteosat Second Generation platform, augmented with static topographic features and solar geometry. The target variable is HelioMont's SSR, computed as the sum of its direct and diffuse horizontal irradiance components, given at a spatial resolution of 1.7 km. We show that, when provided a sufficiently long temporal context, the model matches the performances of albedo-informed models, highlighting the model's ability to internally learn and exploit latent surface reflectance dynamics. Our geospatial analysis shows this effect is most powerful in mountainous regions and improves generalization in both simple and complex topographic settings. Code and datasets are publicly available at this https URL
摘要：从卫星图像中准确检索表面太阳辐射（SSR）在很大程度上取决于估计空气传感器在透明的条件下观察到的背景反射率。然后，与该基线的偏差可用于检测云的存在，并指导辐射转移模型，以推断大气衰减。操作检索算法通常使用每月统计数据近似背景反射率，假设表面特性相对于大气条件而变化缓慢。但是，这种方法在山区间歇性雪覆盖和频繁降雪表面的山区失败。我们为SSR检索提出了一个基于注意力的模拟器，该模拟器隐含地学习从原始卫星图像序列中推断出透明的天际表面反射率。我们的方法建立在颞空间视觉变压器上，消除了对手工制作的功能的需求，例如显性反照率图或云面具。模拟器经过从瑞士的Heliomont算法的瞬时SSR估计进行训练，该地区的特征是复杂的地形和动态积雪。输入包括来自MeteoSat第二代平台的多光谱Seviri图像，并具有静态地形特征和太阳几何形状。目标变量是Heliomont的SSR，计算为其直接和扩散的水平辐照度成分的总和，以1.7 km的空间分辨率给出。我们表明，当提供足够长的时间上下文时，该模型与反照率信息模型的性能相匹配，从而突出了该模型内部学习和利用潜在的表面反射率动态的能力。我们的地理空间分析表明，这种影响在山区最强大，并改善了简单和复杂的地形设置的概括。代码和数据集可在此HTTPS URL上公开可用

Title: Geometric Regularity in Deterministic Sampling of Diffusion-based Generative Models

Authors: Defang Chen, Zhenyu Zhou, Can Wang, Siwei Lyu
Subjects: cs.LG, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2506.10177
Pdf URL: https://arxiv.org/pdf/2506.10177
Copy Paste: [[2506.10177]] Geometric Regularity in Deterministic Sampling of Diffusion-based Generative Models(https://arxiv.org/abs/2506.10177)
Keywords: generation, generative
Abstract: Diffusion-based generative models employ stochastic differential equations (SDEs) and their equivalent probability flow ordinary differential equations (ODEs) to establish a smooth transformation between complex high-dimensional data distributions and tractable prior distributions. In this paper, we reveal a striking geometric regularity in the deterministic sampling dynamics: each simulated sampling trajectory lies within an extremely low-dimensional subspace, and all trajectories exhibit an almost identical ''boomerang'' shape, regardless of the model architecture, applied conditions, or generated content. We characterize several intriguing properties of these trajectories, particularly under closed-form solutions based on kernel-estimated data modeling. We also demonstrate a practical application of the discovered trajectory regularity by proposing a dynamic programming-based scheme to better align the sampling time schedule with the underlying trajectory structure. This simple strategy requires minimal modification to existing ODE-based numerical solvers, incurs negligible computational overhead, and achieves superior image generation performance, especially in regions with only $5 \sim 10$ function evaluations.
摘要：基于扩散的生成模型采用随机微分方程（SDE）及其等效概率流普通微分方程（ODE）来建立复杂的高维数据分布与可拖动的先验分布之间的平滑转换。在本文中，我们在确定性采样动力学中揭示了惊人的几何规律性：每个模拟采样轨迹都位于极为低维的子空间内，并且所有轨迹都表现出几乎相同的“ Boomerang”形状，无论模型结构，应用条件或生成的内容，无论是在模型架构上如何。我们表征了这些轨迹的几种有趣的特性，尤其是基于内核估计数据建模的封闭式解决方案。我们还通过提出一种基于动态编程的方案来更好地将采样时间表与基础轨迹结构更好地调整，我们还通过提出了基于动态编程的方案来证明发现的轨迹规律性的实际应用。这种简单的策略需要对现有基于ODE的数值求解器的最小修改，构成可忽略的计算开销，并实现出色的图像生成性能，尤其是在仅$ 5 \ sim 10 $功能评估的区域中。

Title: Scalable Non-Equivariant 3D Molecule Generation via Rotational Alignment

Authors: Yuhui Ding, Thomas Hofmann
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10186
Pdf URL: https://arxiv.org/pdf/2506.10186
Copy Paste: [[2506.10186]] Scalable Non-Equivariant 3D Molecule Generation via Rotational Alignment(https://arxiv.org/abs/2506.10186)
Keywords: generation
Abstract: Equivariant diffusion models have achieved impressive performance in 3D molecule generation. These models incorporate Euclidean symmetries of 3D molecules by utilizing an SE(3)-equivariant denoising network. However, specialized equivariant architectures limit the scalability and efficiency of diffusion models. In this paper, we propose an approach that relaxes such equivariance constraints. Specifically, our approach learns a sample-dependent SO(3) transformation for each molecule to construct an aligned latent space. A non-equivariant diffusion model is then trained over the aligned representations. Experimental results demonstrate that our approach performs significantly better than previously reported non-equivariant models. It yields sample quality comparable to state-of-the-art equivariant diffusion models and offers improved training and sampling efficiency. Our code is available at this https URL
摘要：在3D分子的产生中，模型扩散模型已取得了令人印象深刻的性能。这些模型通过利用SE（3） - 等级授予网络，结合了3D分子的欧几里得对称性。但是，专门的模棱两可的体系结构限制了扩散模型的可扩展性和效率。在本文中，我们提出了一种放松这种符合性约束的方法。具体而言，我们的方法学习了一个依赖样品的SO（3）每个分子的转换以构建一个排列的潜在空间。然后，在对齐表示的情况下训练了非等分扩散模型。实验结果表明，我们的方法的性能明显优于先前报道的非等价模型。它产生的样品质量与最先进的均等扩散模型相当，并提供了提高的训练和采样效率。我们的代码可在此HTTPS URL上找到

Title: LaMAGIC2: Advanced Circuit Formulations for Language Model-Based Analog Topology Generation

Authors: Chen-Chia Chang, Wan-Hsuan Lin, Yikang Shen, Yiran Chen, Xin Zhang
Subjects: cs.LG, cs.AI, cs.AR
Abstract URL: https://arxiv.org/abs/2506.10235
Pdf URL: https://arxiv.org/pdf/2506.10235
Copy Paste: [[2506.10235]] LaMAGIC2: Advanced Circuit Formulations for Language Model-Based Analog Topology Generation(https://arxiv.org/abs/2506.10235)
Keywords: generation
Abstract: Automation of analog topology design is crucial due to customized requirements of modern applications with heavily manual engineering efforts. The state-of-the-art work applies a sequence-to-sequence approach and supervised finetuning on language models to generate topologies given user specifications. However, its circuit formulation is inefficient due to O(|V |2) token length and suffers from low precision sensitivity to numeric inputs. In this work, we introduce LaMAGIC2, a succinct float-input canonical formulation with identifier (SFCI) for language model-based analog topology generation. SFCI addresses these challenges by improving component-type recognition through identifier-based representations, reducing token length complexity to O(|V |), and enhancing numeric precision sensitivity for better performance under tight tolerances. Our experiments demonstrate that LaMAGIC2 achieves 34% higher success rates under a tight tolerance of 0.01 and 10X lower MSEs compared to a prior method. LaMAGIC2 also exhibits better transferability for circuits with more vertices with up to 58.5% improvement. These advancements establish LaMAGIC2 as a robust framework for analog topology generation.
摘要：模拟拓扑设计的自动化是至关重要的，这是由于对现代应用程序的定制要求，并具有大量的手动工程工作。最先进的工作应用了序列对序列方法和对语言模型的监督填充，以生成给定用户规格的拓扑。但是，由于O（| V | 2）令牌长度，其电路公式效率低下，并且对数字输入的精度灵敏度低。在这项工作中，我们介绍了Lamagic2，这是一种带有标识符（SFCI）的简洁的浮点规范配方，用于基于语言模型的模型模拟拓扑。 SFCI通过通过基于标识符的表示来改善组件型识别，将令牌长度的复杂性降低到O（| V |），并增强数字精度敏感性，从而在紧张的公差下提高性能，从而解决了这些挑战。我们的实验表明，与先前的方法相比，Lamagic2的成功率提高了34％的成功率34％。 Lamagic2还具有更高的可传递性，该电路具有更多的顶点，最大提高了58.5％。这些进步将Lamagic2建立为模拟拓扑生成的强大框架。

Title: HalLoc: Token-level Localization of Hallucinations for Vision Language Models

Authors: Eunkyu Park, Minyeong Kim, Gunhee Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10286
Pdf URL: https://arxiv.org/pdf/2506.10286
Copy Paste: [[2506.10286]] HalLoc: Token-level Localization of Hallucinations for Vision Language Models(https://arxiv.org/abs/2506.10286)
Keywords: generation
Abstract: Hallucinations pose a significant challenge to the reliability of large vision-language models, making their detection essential for ensuring accuracy in critical applications. Current detection methods often rely on computationally intensive models, leading to high latency and resource demands. Their definitive outcomes also fail to account for real-world scenarios where the line between hallucinated and truthful information is unclear. To address these issues, we propose HalLoc, a dataset designed for efficient, probabilistic hallucination detection. It features 150K token-level annotated samples, including hallucination types, across Visual Question Answering (VQA), instruction-following, and image captioning tasks. This dataset facilitates the development of models that detect hallucinations with graded confidence, enabling more informed user interactions. Additionally, we introduce a baseline model trained on HalLoc, offering low-overhead, concurrent hallucination detection during generation. The model can be seamlessly integrated into existing VLMs, improving reliability while preserving efficiency. The prospect of a robust plug-and-play hallucination detection module opens new avenues for enhancing the trustworthiness of vision-language models in real-world applications. The HalLoc dataset and code are publicly available at: this https URL.
摘要：幻觉对大型视力语言模型的可靠性构成了重大挑战，这使其检测对于确保关键应用中的准确性至关重要。当前的检测方法通常依赖于计算密集型模型，从而导致较高的延迟和资源需求。他们的确定结果也未能解释现实情况，在现实情况下，幻觉和真实信息之间的界线尚不清楚。为了解决这些问题，我们提出了Halloc，这是一个旨在有效，概率幻觉检测的数据集。它具有150k令牌级注释的样本，包括幻觉类型，跨视觉质量答案（VQA），指令跟随和图像字幕任务。该数据集促进了模型的开发，这些模型以分级的信心来检测幻觉，从而实现了更明智的用户交互。此外，我们引入了在Halloc训练的基线模型，在发电期间提供了低空的幻觉检测。该模型可以无缝集成到现有的VLM中，从而提高可靠性，同时保持效率。强大的插件幻觉检测模块的前景开辟了新的途径，以增强现实世界应用中视觉模型的可信度。 Halloc数据集和代码可公开可用：此HTTPS URL。

Title: Towards Scalable SOAP Note Generation: A Weakly Supervised Multimodal Framework

Authors: Sadia Kamal, Tim Oates, Joy Wan
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10328
Pdf URL: https://arxiv.org/pdf/2506.10328
Copy Paste: [[2506.10328]] Towards Scalable SOAP Note Generation: A Weakly Supervised Multimodal Framework(https://arxiv.org/abs/2506.10328)
Keywords: generation
Abstract: Skin carcinoma is the most prevalent form of cancer globally, accounting for over $8 billion in annual healthcare expenditures. In clinical settings, physicians document patient visits using detailed SOAP (Subjective, Objective, Assessment, and Plan) notes. However, manually generating these notes is labor-intensive and contributes to clinician burnout. In this work, we propose a weakly supervised multimodal framework to generate clinically structured SOAP notes from limited inputs, including lesion images and sparse clinical text. Our approach reduces reliance on manual annotations, enabling scalable, clinically grounded documentation while alleviating clinician burden and reducing the need for large annotated data. Our method achieves performance comparable to GPT-4o, Claude, and DeepSeek Janus Pro across key clinical relevance metrics. To evaluate clinical quality, we introduce two novel metrics MedConceptEval and Clinical Coherence Score (CCS) which assess semantic alignment with expert medical concepts and input features, respectively.
摘要：皮肤癌是全球癌症最普遍的形式，年度医疗保健支出超过80亿美元。在临床环境中，医生使用详细的肥皂（主观，客观，评估和计划）记录了患者的访问。但是，手动产生这些笔记是劳动密集型的，并导致临床医生的倦怠。在这项工作中，我们提出了一个弱监督的多模式框架，以从有限输入（包括病变图像和稀疏临床文本）中生成临床结构的肥皂笔记。我们的方法减少了对手动注释的依赖，在减轻临床医生负担并减少了对大量注释数据的需求的同时，可以扩展，临床上的文档。我们的方法在关键的临床相关指标上实现了与GPT-4O，Claude和DeepSeek Janus Pro相当的性能。为了评估临床质量，我们介绍了两个新型的指标MedConcepteval和临床连贯评分（CCS），它们分别评估了使用专家医学概念和输入特征的语义一致性。

Title: Research on Audio-Visual Quality Assessment Dataset and Method for User-Generated Omnidirectional Video

Authors: Fei Zhao, Da Pan, Zelu Qi, Ping Shi
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2506.10331
Pdf URL: https://arxiv.org/pdf/2506.10331
Copy Paste: [[2506.10331]] Research on Audio-Visual Quality Assessment Dataset and Method for User-Generated Omnidirectional Video(https://arxiv.org/abs/2506.10331)
Keywords: quality assessment
Abstract: In response to the rising prominence of the Metaverse, omnidirectional videos (ODVs) have garnered notable interest, gradually shifting from professional-generated content (PGC) to user-generated content (UGC). However, the study of audio-visual quality assessment (AVQA) within ODVs remains limited. To address this, we construct a dataset of UGC omnidirectional audio and video (A/V) content. The videos are captured by five individuals using two different types of omnidirectional cameras, shooting 300 videos covering 10 different scene types. A subjective AVQA experiment is conducted on the dataset to obtain the Mean Opinion Scores (MOSs) of the A/V sequences. After that, to facilitate the development of UGC-ODV AVQA fields, we construct an effective AVQA baseline model on the proposed dataset, of which the baseline model consists of video feature extraction module, audio feature extraction and audio-visual fusion module. The experimental results demonstrate that our model achieves optimal performance on the proposed dataset.
摘要：为了响应元元评估的不断增长，全向视频（ODV）引起了显着的兴趣，逐渐从专业生成的内容（PGC）转移到用户生成的内容（UGC）。但是，ODV中的视听质量评估（AVQA）的研究仍然有限。为了解决这个问题，我们构建了一个UGC全向音频和视频（A/V）内容的数据集。这些视频由五个人使用两种不同类型的全向摄像机捕获，拍摄了300个视频，涵盖了10种不同的场景类型。在数据集上进行了主观AVQA实验，以获得A/V序列的平均意见分数（MOSS）。之后，为了促进UGC-ODV AVQA字段的开发，我们在拟议的数据集中构建了有效的AVQA基线模型，基线模型由视频特征提取模块，音频特征提取和视听融合模块组成。实验结果表明，我们的模型在提出的数据集上实现了最佳性能。

Title: GeoCAD: Local Geometry-Controllable CAD Generation

Authors: Zhanwei Zhang, Kaiyuan Liu, Junjie Liu, Wenxiao Wang, Binbin Lin, Liang Xie, Chen Shen, Deng Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10337
Pdf URL: https://arxiv.org/pdf/2506.10337
Copy Paste: [[2506.10337]] GeoCAD: Local Geometry-Controllable CAD Generation(https://arxiv.org/abs/2506.10337)
Keywords: generation
Abstract: Local geometry-controllable computer-aided design (CAD) generation aims to modify local parts of CAD models automatically, enhancing design efficiency. It also ensures that the shapes of newly generated local parts follow user-specific geometric instructions (e.g., an isosceles right triangle or a rectangle with one corner cut off). However, existing methods encounter challenges in achieving this goal. Specifically, they either lack the ability to follow textual instructions or are unable to focus on the local parts. To address this limitation, we introduce GeoCAD, a user-friendly and local geometry-controllable CAD generation method. Specifically, we first propose a complementary captioning strategy to generate geometric instructions for local parts. This strategy involves vertex-based and VLLM-based captioning for systematically annotating simple and complex parts, respectively. In this way, we caption $\sim$221k different local parts in total. In the training stage, given a CAD model, we randomly mask a local part. Then, using its geometric instruction and the remaining parts as input, we prompt large language models (LLMs) to predict the masked part. During inference, users can specify any local part for modification while adhering to a variety of predefined geometric instructions. Extensive experiments demonstrate the effectiveness of GeoCAD in generation quality, validity and text-to-CAD consistency. Code will be available at this https URL.
摘要：本地可控的计算机辅助设计（CAD）生成旨在自动修改CAD模型的本地部分，从而提高设计效率。它还确保了新生成的本地零件的形状遵循用户特定的几何指令（例如，同学右三角形或一个角度切断的矩形）。但是，现有方法在实现这一目标方面遇到了挑战。具体来说，他们要么缺乏遵循文本说明的能力，要么无法专注于本地部分。为了解决此限制，我们介绍了Geocad，这是一种用户友好和本地可控制的CAD生成方法。具体而言，我们首先提出了一种补充字幕策略，以生成本地零件的几何说明。该策略涉及基于顶点和基于VLLM的字幕，分别用于系统地注释简单和复杂的部分。这样，我们总共将标题为$ \ sim $ \ sim $ 221K不同的本地零件。在训练阶段，给定CAD模型，我们随机掩盖了本地部分。然后，使用其几何指令和其余部分作为输入，我们提示大型语言模型（LLMS）预测蒙面部分。在推断期间，用户可以在遵守各种预定义的几何说明时指定任何本地部分进行修改。广泛的实验证明了GeoCAD在发电质量，有效性和文本到基础一致性方面的有效性。代码将在此HTTPS URL上可用。

Title: UrbanSense:AFramework for Quantitative Analysis of Urban Streetscapes leveraging Vision Large Language Models

Authors: Jun Yin, Jing Zhong, Peilin Li, Pengyu Zeng, Miao Zhang, Ran Luo, Shuai Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10342
Pdf URL: https://arxiv.org/pdf/2506.10342
Copy Paste: [[2506.10342]] UrbanSense:AFramework for Quantitative Analysis of Urban Streetscapes leveraging Vision Large Language Models(https://arxiv.org/abs/2506.10342)
Keywords: generation
Abstract: Urban cultures and architectural styles vary significantly across cities due to geographical, chronological, historical, and socio-political factors. Understanding these differences is essential for anticipating how cities may evolve in the future. As representative cases of historical continuity and modern innovation in China, Beijing and Shenzhen offer valuable perspectives for exploring the transformation of urban streetscapes. However, conventional approaches to urban cultural studies often rely on expert interpretation and historical documentation, which are difficult to standardize across different contexts. To address this, we propose a multimodal research framework based on vision-language models, enabling automated and scalable analysis of urban streetscape style differences. This approach enhances the objectivity and data-driven nature of urban form research. The contributions of this study are as follows: First, we construct UrbanDiffBench, a curated dataset of urban streetscapes containing architectural images from different periods and regions. Second, we develop UrbanSense, the first vision-language-model-based framework for urban streetscape analysis, enabling the quantitative generation and comparison of urban style representations. Third, experimental results show that Over 80% of generated descriptions pass the t-test (p less than 0.05). High Phi scores (0.912 for cities, 0.833 for periods) from subjective evaluations confirm the method's ability to capture subtle stylistic differences. These results highlight the method's potential to quantify and interpret urban style evolution, offering a scientifically grounded lens for future design.
摘要：由于地理，年代，历史和社会政治因素，城市文化和建筑风格在各个城市之间差异很大。了解这些差异对于预测城市将来如何发展至关重要。作为中国历史连续性和现代创新的代表性案例，北京和深圳为探索城市街景的转变提供了宝贵的观点。但是，通俗的城市文化研究方法通常依赖于专家解释和历史文献，这些文献很难在不同的情况下进行标准化。为了解决这个问题，我们提出了一个基于视觉模型的多模式研究框架，从而实现了对城市街景风格差异的自动化和可扩展分析。这种方法增强了城市形式研究的客观性和数据驱动的性质。这项研究的贡献如下：首先，我们构建了Urbandiffbench，这是一个策划的城市街景数据集，其中包含来自不同时期和地区的建筑图像。其次，我们开发了Urbansense，这是城市街景分析的第一个基于视觉语言模型的框架，从而实现了城市风格表示的定量生成和比较。第三，实验结果表明，超过80％的生成描述通过t检验（P小于0.05）。主观评估的高PHI得分（城市为0.912，周期为0.833）证实了该方法捕获微妙的文体差异的能力。这些结果突出了该方法量化和解释城市风格演变的潜力，为未来的设计提供了科学扎根的镜头。

Title: PhysioWave: A Multi-Scale Wavelet-Transformer for Physiological Signal Representation

Authors: Yanlong Chen, Mattia Orlandi, Pierangelo Maria Rapa, Simone Benatti, Luca Benini, Yawei Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10351
Pdf URL: https://arxiv.org/pdf/2506.10351
Copy Paste: [[2506.10351]] PhysioWave: A Multi-Scale Wavelet-Transformer for Physiological Signal Representation(https://arxiv.org/abs/2506.10351)
Keywords: generation
Abstract: Physiological signals are often corrupted by motion artifacts, baseline drift, and other low-SNR disturbances, which pose significant challenges for analysis. Additionally, these signals exhibit strong non-stationarity, with sharp peaks and abrupt changes that evolve continuously, making them difficult to represent using traditional time-domain or filtering methods. To address these issues, a novel wavelet-based approach for physiological signal analysis is presented, aiming to capture multi-scale time-frequency features in various physiological signals. Leveraging this technique, two large-scale pretrained models specific to EMG and ECG are introduced for the first time, achieving superior performance and setting new baselines in downstream tasks. Additionally, a unified multi-modal framework is constructed by integrating pretrained EEG model, where each modality is guided through its dedicated branch and fused via learnable weighted fusion. This design effectively addresses challenges such as low signal-to-noise ratio, high inter-subject variability, and device mismatch, outperforming existing methods on multi-modal tasks. The proposed wavelet-based architecture lays a solid foundation for analysis of diverse physiological signals, while the multi-modal design points to next-generation physiological signal processing with potential impact on wearable health monitoring, clinical diagnostics, and broader biomedical applications.
摘要：生理信号通常被运动伪像，基线漂移和其他低SNR干扰所破坏，这对分析构成了重大挑战。此外，这些信号表现出强大的非平稳性，峰值峰值和突然变化不断发展，因此很难使用传统的时域或过滤方法来代表它们。为了解决这些问题，提出了一种基于小波的生理信号分析方法，旨在捕获各种生理信号中的多尺度时间频率特征。首次介绍了利用这项技术的两种特定于EMG和ECG的大规模预认证的模型，从而实现了卓越的性能并在下游任务中设置新的基线。此外，统一的多模式框架是通过集成预验证的脑电图模型来构建的，在该模型中，每种模态都通过其专用分支引导并通过可学习的加权融合融合。该设计有效地解决了诸如低信噪比，高主体间可变性和设备不匹配等挑战，在多模式任务上的现有方法表现优于现有方法。拟议的基于小波的建筑为分析各种生理信号的稳固基础，而多模式设计指向下一代生理信号处理，对可穿戴健康监测，临床诊断和更广泛的生物医学应用有潜在影响。

Title: Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation

Authors: Runqi Ouyang, Haoyun Li, Zhenyuan Zhang, Xiaofeng Wang, Zheng Zhu, Guan Huang, Xingang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10353
Pdf URL: https://arxiv.org/pdf/2506.10353
Copy Paste: [[2506.10353]] Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation(https://arxiv.org/abs/2506.10353)
Keywords: generation
Abstract: Recent advances in large language models, especially in natural language understanding and reasoning, have opened new possibilities for text-to-motion generation. Although existing approaches have made notable progress in semantic alignment and motion synthesis, they often rely on end-to-end mapping strategies that fail to capture deep linguistic structures and logical reasoning. Consequently, generated motions tend to lack controllability, consistency, and diversity. To address these limitations, we propose Motion-R1, a unified motion-language modeling framework that integrates a Chain-of-Thought mechanism. By explicitly decomposing complex textual instructions into logically structured action paths, Motion-R1 provides high-level semantic guidance for motion generation, significantly enhancing the model's ability to interpret and execute multi-step, long-horizon, and compositionally rich commands. To train our model, we adopt Group Relative Policy Optimization, a reinforcement learning algorithm designed for large models, which leverages motion quality feedback to optimize reasoning chains and motion synthesis jointly. Extensive experiments across multiple benchmark datasets demonstrate that Motion-R1 achieves competitive or superior performance compared to state-of-the-art methods, particularly in scenarios requiring nuanced semantic understanding and long-term temporal coherence. The code, model and data will be publicly available.
摘要：大型语言模型的最新进展，尤其是在自然语言理解和推理方面，已经为文本到动作生成开辟了新的可能性。尽管现有的方法在语义一致性和运动综合方面取得了显着进步，但它们通常依赖于无法捕获深层语言结构和逻辑推理的端到端映射策略。因此，产生的动议往往缺乏可控性，一致性和多样性。为了解决这些局限性，我们提出了Motion-R1，这是一个整合了经过思考机制的统一运动语言建模框架。通过将复杂的文本指令明确分解为逻辑结构化的动作路径，Motion-R1为运动生成提供了高级的语义指导，显着增强了该模型解释和执行多步，长途培训和作曲丰富的命令的能力。为了训练我们的模型，我们采用了小组相对政策优化，这是一种为大型模型设计的强化学习算法，该算法利用运动质量反馈以优化推理链和运动合成。跨多个基准数据集进行的广泛实验表明，与最新方法相比，Motion-R1具有竞争性或卓越的性能，尤其是在需要细微的语义理解和长期时间连贯性的情况下。代码，模型和数据将公开可用。

Title: Can We Infer Confidential Properties of Training Data from LLMs?

Authors: Penguin Huang, Chhavi Yadav, Ruihan Wu, Kamalika Chaudhuri
Subjects: cs.LG, cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2506.10364
Pdf URL: https://arxiv.org/pdf/2506.10364
Copy Paste: [[2506.10364]] Can We Infer Confidential Properties of Training Data from LLMs?(https://arxiv.org/abs/2506.10364)
Keywords: generation, generative
Abstract: Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets to support applications in fields such as healthcare, finance, and law. These fine-tuning datasets often have sensitive and confidential dataset-level properties -- such as patient demographics or disease prevalence -- that are not intended to be revealed. While prior work has studied property inference attacks on discriminative models (e.g., image classification models) and generative models (e.g., GANs for image data), it remains unclear if such attacks transfer to LLMs. In this work, we introduce PropInfer, a benchmark task for evaluating property inference in LLMs under two fine-tuning paradigms: question-answering and chat-completion. Built on the ChatDoctor dataset, our benchmark includes a range of property types and task configurations. We further propose two tailored attacks: a prompt-based generation attack and a shadow-model attack leveraging word frequency signals. Empirical evaluations across multiple pretrained LLMs show the success of our attacks, revealing a previously unrecognized vulnerability in LLMs.
摘要：大型语言模型（LLM）越来越多地在特定领域的数据集上进行了微调，以支持医疗保健，金融和法律等领域的应用程序。这些微调数据集通常具有敏感且机密的数据集级特性（例如患者人口统计或疾病患病率），这些特性不打算被揭示。虽然先前的工作已经研究了对歧视模型（例如，图像分类模型）和生成模型（例如，图像数据的gans）的属性推断攻击，但仍不清楚这种攻击是否传输到LLMS。在这项工作中，我们介绍了Propinfer，这是在两个微调范式下评估LLM中属性推断的基准任务：提问和聊天完成。我们的基准构建在Chatdoctor数据集中，包括一系列属性类型和任务配置。我们进一步提出了两次量身定制的攻击：基于及时的一代攻击和一个阴影模型攻击，利用字频频率信号。跨多个经过验证的LLM的经验评估显示了我们攻击的成功，揭示了以前未被认可的LLMS脆弱性。

Title: EQA-RM: A Generative Embodied Reward Model with Test-time Scaling

Authors: Yuhang Chen, Zhen Tan, Tianlong Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10389
Pdf URL: https://arxiv.org/pdf/2506.10389
Copy Paste: [[2506.10389]] EQA-RM: A Generative Embodied Reward Model with Test-time Scaling(https://arxiv.org/abs/2506.10389)
Keywords: generative
Abstract: Reward Models (RMs), vital for large model alignment, are underexplored for complex embodied tasks like Embodied Question Answering (EQA) where nuanced evaluation of agents' spatial, temporal, and logical understanding is critical yet not considered by generic approaches. We introduce EQA-RM, a novel generative multimodal reward model specifically architected for EQA, trained via our innovative Contrastive Group Relative Policy Optimization (C-GRPO) strategy to learn fine-grained behavioral distinctions. The generative nature of EQA-RM provides interpretable, structured reward feedback (beyond simple scalars), uniquely enabling test-time scaling to dynamically adjust evaluation granularity, from concise scores to detailed critiques of reasoning and grounding, at inference without retraining. Concurrently, we introduce EQARewardBench, a new benchmark built on OpenEQA for standardized EQA reward model assessment. Demonstrating high sample efficiency, EQA-RM (fine-tuning Qwen2-VL-2B-Instruct) achieves 61.9\% accuracy on EQA-RM-Bench with only 700 samples, outperforming strong proprietary baselines, including Gemini-2.5-Flash, GPT-4o, Claude-3.5-Haiku, and open-sourced state-of-the-art models such as RoVRM and VisualPRM. The code and dataset can be found here this https URL.
摘要：对于大型模型对准至关重要的奖励模型（RMS）对于复杂的体现任务（例如体现的问题回答（EQA））而言，对代理的空间，时间和逻辑理解的细微差别是至关重要的，但并未通过通用方法来考虑，而对代理的空间，时间和逻辑理解的细微差别是至关重要的。我们介绍了EQA-RM，这是一种专门针对EQA进行架构的新型生成多模式奖励模型，通过我们创新的对比组相对政策优化（C-GRPO）策略进行了训练，以学习细粒度的行为区别。 EQA-RM的生成性质提供了可解释的，结构化的奖励反馈（超越简单的标量），从而使测试时间缩放能够动态调整评估粒度，从简洁的分数到推理的详细批评，而无需再培训。同时，我们介绍了Eqarewardbench，这是一种基于OpenEQA的新基准，用于标准化的EQA奖励模型评估。 Demonstrating high sample efficiency, EQA-RM (fine-tuning Qwen2-VL-2B-Instruct) achieves 61.9\% accuracy on EQA-RM-Bench with only 700 samples, outperforming strong proprietary baselines, including Gemini-2.5-Flash, GPT-4o, Claude-3.5-Haiku, and open-sourced state-of-the-art models such as RoVRM and Visualprm。代码和数据集可以在此处找到此HTTPS URL。

Title: ReconMOST: Multi-Layer Sea Temperature Reconstruction with Observations-Guided Diffusion

Authors: Yuanyi Song, Pumeng Lyu, Ben Fei, Fenghua Ling, Wanli Ouyang, Lei Bai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10391
Pdf URL: https://arxiv.org/pdf/2506.10391
Copy Paste: [[2506.10391]] ReconMOST: Multi-Layer Sea Temperature Reconstruction with Observations-Guided Diffusion(https://arxiv.org/abs/2506.10391)
Keywords: generation
Abstract: Accurate reconstruction of ocean is essential for reflecting global climate dynamics and supporting marine meteorological research. Conventional methods face challenges due to sparse data, algorithmic complexity, and high computational costs, while increasing usage of machine learning (ML) method remains limited to reconstruction problems at the sea surface and local regions, struggling with issues like cloud occlusion. To address these limitations, this paper proposes ReconMOST, a data-driven guided diffusion model framework for multi-layer sea temperature reconstruction. Specifically, we first pre-train an unconditional diffusion model using a large collection of historical numerical simulation data, enabling the model to attain physically consistent distribution patterns of ocean temperature fields. During the generation phase, sparse yet high-accuracy in-situ observational data are utilized as guidance points for the reverse diffusion process, generating accurate reconstruction results. Importantly, in regions lacking direct observational data, the physically consistent spatial distribution patterns learned during pre-training enable implicitly guided and physically plausible reconstructions. Our method extends ML-based SST reconstruction to a global, multi-layer setting, handling over 92.5% missing data while maintaining reconstruction accuracy, spatial resolution, and superior generalization capability. We pre-train our model on CMIP6 numerical simulation data and conduct guided reconstruction experiments on CMIP6 and EN4 analysis data. The results of mean squared error (MSE) values achieve 0.049 on guidance, 0.680 on reconstruction, and 0.633 on total, respectively, demonstrating the effectiveness and robustness of the proposed framework. Our source code is available at this https URL.
摘要：准确的海洋重建对于反映全球气候动态和支持海洋气象研究至关重要。传统方法由于数据稀疏，算法复杂性和高计算成本而面临挑战，而增加机器学习（ML）方法的使用量仍然限于海面和地方地区的重建问题，从而在云闭塞等问题上进行了努力。为了解决这些局限性，本文提出了一个由数据驱动的引导性扩散模型框架进行重新构建，用于多层海温重建。具体而言，我们首先使用大量历史数值模拟数据收集的无条件扩散模型预先培训，从而使该模型能够达到海洋温度场的物理一致分布模式。在生成阶段，使用稀疏但高临界性的原位观察数据被用作反向扩散过程的指导点，从而产生准确的重建结果。重要的是，在缺乏直接观察数据的区域中，在预训练期间学到的物理一致的空间分布模式可以隐式引导和物理上合理的重建。我们的方法将基于ML的SST重建扩建到全局多层设置，处理超过92.5％的数据，同时保持重建精度，空间分辨率和出色的概括能力。我们在CMIP6数值模拟数据上预先培训我们的模型，并在CMIP6和EN4分析数据上进行指导的重建实验。平均平方误差（MSE）值的结果在指导上达到0.049，重建为0.680，总计为0.633，分别证明了所提出框架的有效性和鲁棒性。我们的源代码可在此HTTPS URL上找到。

Title: Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation

Authors: Zhiyang Xu, Jiuhai Chen, Zhaojiang Lin, Xichen Pan, Lifu Huang, Tianyi Zhou, Madian Khabsa, Qifan Wang, Di Jin, Michihiro Yasunaga, Lili Yu, Xi Victoria Lin, Shaoliang Nie
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10395
Pdf URL: https://arxiv.org/pdf/2506.10395
Copy Paste: [[2506.10395]] Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation(https://arxiv.org/abs/2506.10395)
Keywords: generation, generative
Abstract: Recent advances in large language models (LLMs) have enabled multimodal foundation models to tackle both image understanding and generation within a unified framework. Despite these gains, unified models often underperform compared to specialized models in either task. A key challenge in developing unified models lies in the inherent differences between the visual features needed for image understanding versus generation, as well as the distinct training processes required for each modality. In this work, we introduce Pisces, an auto-regressive multimodal foundation model that addresses this challenge through a novel decoupled visual encoding architecture and tailored training techniques optimized for multimodal generation. Combined with meticulous data curation, pretraining, and finetuning, Pisces achieves competitive performance in both image understanding and image generation. We evaluate Pisces on over 20 public benchmarks for image understanding, where it demonstrates strong performance across a wide range of tasks. Additionally, on GenEval, a widely adopted benchmark for image generation, Pisces exhibits robust generative capabilities. Our extensive analysis reveals the synergistic relationship between image understanding and generation, and the benefits of using separate visual encoders, advancing the field of unified multimodal models.
摘要：大型语言模型（LLM）的最新进展使多模式的模型能够在统一框架内解决图像理解和生成。尽管有这些收益，但与任一任务中的专业模型相比，统一模型的表现通常不佳。开发统一模型的关键挑战在于图像理解与生成所需的视觉特征以及每种模式所需的不同训练过程之间的固有差异。在这项工作中，我们介绍了双鱼座，这是一种自动回归多模式的基础模型，该模型通过新颖的脱钩视觉编码体系结构和针对多模式生成优化的量身定制的训练技术来应对这一挑战。与细致的数据策展，预处理和填充性结合在一起，双鱼座在图像理解和图像产生中都达到了竞争性能。我们在20多个公共基准上评估了双鱼座，以了解图像理解，在这些公共基准中，它在各种任务中都表现出了强劲的表现。此外，在Geneval上是图像产生的广泛采用的基准，双鱼座具有强大的生成能力。我们的广泛分析揭示了图像理解与产生之间的协同关系，以及使用单独的视觉编码器，推进统一多模型的领域的好处。

Title: Time To Impeach LLM-as-a-Judge: Programs are the Future of Evaluation

Authors: Tzu-Heng Huang, Harit Vishwakarma, Frederic Sala
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10403
Pdf URL: https://arxiv.org/pdf/2506.10403
Copy Paste: [[2506.10403]] Time To Impeach LLM-as-a-Judge: Programs are the Future of Evaluation(https://arxiv.org/abs/2506.10403)
Keywords: generation
Abstract: Large language models (LLMs) are widely used to evaluate the quality of LLM generations and responses, but this leads to significant challenges: high API costs, uncertain reliability, inflexible pipelines, and inherent biases. To address these, we introduce PAJAMA (Program-As-a-Judge for Automated Model Assessment), a new alternative that uses LLMs to synthesize executable judging programs instead of directly scoring responses. These synthesized programs can be stored and run locally, costing orders of magnitude less while providing interpretable, and auditable judging logic that can be easily adapted. Program-based judges mitigate biases, improving judgment consistency by 15.83% and reducing biased responses by 23.7% on average compared to a Qwen2.5-14B-based LLM-as-a-judge. When program judgments are distilled into a model, PAJAMA outperforms LLM-as-a-judge on the challenging CHAT-HARD subset of RewardBench, outperforming metrics by 2.19% on Prometheus and 8.67% on the JudgeLM dataset, all at three orders of magnitude lower cost.
摘要：大型语言模型（LLMS）广泛用于评估LLM世代和响应的质量，但这带来了重大挑战：高API成本，不确定的可靠性，不灵活的管道和固有的偏见。为了解决这些问题，我们介绍了睡衣（用于自动模型评估的程序 - AS-A-A-Gudge），这是一种新的替代方案，使用LLMS合成可执行的评审程序，而不是直接评分响应。这些合成的程序可以在本地存储和运行，在提供可解释的同时降低数量级，并且可以轻松适应可审核的判断逻辑。与基于QWEN2.5-14B的llm-as-a-ashudge相比，基于程序的法官减轻了偏见，将判断一致性提高了15.83％，并将偏见的反应平均减少了23.7％。 When program judgments are distilled into a model, PAJAMA outperforms LLM-as-a-judge on the challenging CHAT-HARD subset of RewardBench, outperforming metrics by 2.19% on Prometheus and 8.67% on the JudgeLM dataset, all at three orders of magnitude lower cost.

Title: Generative Algorithms for Wildfire Progression Reconstruction from Multi-Modal Satellite Active Fire Measurements and Terrain Height

Authors: Bryan Shaddy, Brianna Binder, Agnimitra Dasgupta, Haitong Qin, James Haley, Angel Farguell, Kyle Hilburn, Derek V. Mallia, Adam Kochanski, Jan Mandel, Assad Oberai
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10404
Pdf URL: https://arxiv.org/pdf/2506.10404
Copy Paste: [[2506.10404]] Generative Algorithms for Wildfire Progression Reconstruction from Multi-Modal Satellite Active Fire Measurements and Terrain Height(https://arxiv.org/abs/2506.10404)
Keywords: generative
Abstract: Increasing wildfire occurrence has spurred growing interest in wildfire spread prediction. However, even the most complex wildfire models diverge from observed progression during multi-day simulations, motivating need for data assimilation. A useful approach to assimilating measurement data into complex coupled atmosphere-wildfire models is to estimate wildfire progression from measurements and use this progression to develop a matching atmospheric state. In this study, an approach is developed for estimating fire progression from VIIRS active fire measurements, GOES-derived ignition times, and terrain height data. A conditional Generative Adversarial Network is trained with simulations of historic wildfires from the atmosphere-wildfire model WRF-SFIRE, thus allowing incorporation of WRF-SFIRE physics into estimates. Fire progression is succinctly represented by fire arrival time, and measurements for training are obtained by applying an approximate observation operator to WRF-SFIRE solutions, eliminating need for satellite data during training. The model is trained on tuples of fire arrival times, measurements, and terrain, and once trained leverages measurements of real fires and corresponding terrain data to generate samples of fire arrival times. The approach is validated on five Pacific US wildfires, with results compared against high-resolution perimeters measured via aircraft, finding an average Sorensen-Dice coefficient of 0.81. The influence of terrain height on the arrival time inference is also evaluated and it is observed that terrain has minimal influence when the inference is conditioned on satellite measurements.
摘要：野火的增加促使人们对野火扩散预测的兴趣日益增长。但是，即使是最复杂的野火模型，在多天模拟过程中观察到的进展也有所不同，激发了数据同化的需求。将测量数据吸收到复杂的耦合大气 - 野火模型中的一种有用方法是估计野火从测量中进展，并使用此进展来发展匹配的大气状态。在这项研究中，开发了一种方法来估算Viirs主动火灾测量，衍生的点火时间和地形高度数据的火灾进展。有条件的生成对抗网络对大气野火模型WRF-SFIRE的历史野火进行了训练，从而可以将WRF-SFIRE物理学纳入估计值。火灾进度以火灾到达时间为简洁，并通过将近似观测操作员应用于WRF-SFIRE解决方案来获得训练的测量结果，从而消除了训练期间对卫星数据的需求。该模型是针对火灾到达时间，测量和地形的元素进行训练的，一旦训练有素的真实火灾和相应地形数据的测量，以生成火灾到达时间的样本。该方法在五个太平洋野火中得到了验证，结果与通过飞机测量的高分辨率周围相比，结果的平均Sorensen-DICE系数为0.81。还评估了地形高度对到达时间推断的影响，并且观察到当推断对卫星测量的调节时，地形的影响很小。

Title: Rethinking Generative Human Video Coding with Implicit Motion Transformation

Authors: Bolin Chen, Ru-Ling Liao, Jie Chen, Yan Ye
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2506.10453
Pdf URL: https://arxiv.org/pdf/2506.10453
Copy Paste: [[2506.10453]] Rethinking Generative Human Video Coding with Implicit Motion Transformation(https://arxiv.org/abs/2506.10453)
Keywords: generative
Abstract: Beyond traditional hybrid-based video codec, generative video codec could achieve promising compression performance by evolving high-dimensional signals into compact feature representations for bitstream compactness at the encoder side and developing explicit motion fields as intermediate supervision for high-quality reconstruction at the decoder side. This paradigm has achieved significant success in face video compression. However, compared to facial videos, human body videos pose greater challenges due to their more complex and diverse motion patterns, i.e., when using explicit motion guidance for Generative Human Video Coding (GHVC), the reconstruction results could suffer severe distortions and inaccurate motion. As such, this paper highlights the limitations of explicit motion-based approaches for human body video compression and investigates the GHVC performance improvement with the aid of Implicit Motion Transformation, namely IMT. In particular, we propose to characterize complex human body signal into compact visual features and transform these features into implicit motion guidance for signal reconstruction. Experimental results demonstrate the effectiveness of the proposed IMT paradigm, which can facilitate GHVC to achieve high-efficiency compression and high-fidelity synthesis.
摘要：除了传统的混合视频编解码器之外，生成视频编解码器还可以通过将高维信号发展为编码器侧的bitstream紧凑性的紧凑特征表示，并将显式运动场作为中等监督，以在解码器侧进行高质量重建，以实现有希望的压缩性能。该范式在面部视频压缩中取得了巨大的成功。但是，与面部视频相比，人体视频对其更复杂和多样化的运动模式提出了更大的挑战，即，在使用明确的运动指南（GHVC）时，重建结果可能会遭受严重的扭曲和不准确的运动。因此，本文强调了对人体视频压缩的明确运动方法的局限性，并借助于隐式运动转换研究了GHVC性能的改善，即IMT。特别是，我们建议将复杂的人体信号表征为紧凑的视觉特征，并将这些特征转化为信号重建的隐式运动指导。实验结果证明了拟议的IMT范式的有效性，这可以促进GHVC实现高效率压缩和高保真合成。

Title: LLMs Are Not Yet Ready for Deepfake Image Detection

Authors: Shahroz Tariq, David Nguyen, M.A.P. Chamikara, Tingmin Wu, Alsharif Abuadbba, Kristen Moore
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10474
Pdf URL: https://arxiv.org/pdf/2506.10474
Copy Paste: [[2506.10474]] LLMs Are Not Yet Ready for Deepfake Image Detection(https://arxiv.org/abs/2506.10474)
Keywords: generation
Abstract: The growing sophistication of deepfakes presents substantial challenges to the integrity of media and the preservation of public trust. Concurrently, vision-language models (VLMs), large language models enhanced with visual reasoning capabilities, have emerged as promising tools across various domains, sparking interest in their applicability to deepfake detection. This study conducts a structured zero-shot evaluation of four prominent VLMs: ChatGPT, Claude, Gemini, and Grok, focusing on three primary deepfake types: faceswap, reenactment, and synthetic generation. Leveraging a meticulously assembled benchmark comprising authentic and manipulated images from diverse sources, we evaluate each model's classification accuracy and reasoning depth. Our analysis indicates that while VLMs can produce coherent explanations and detect surface-level anomalies, they are not yet dependable as standalone detection systems. We highlight critical failure modes, such as an overemphasis on stylistic elements and vulnerability to misleading visual patterns like vintage aesthetics. Nevertheless, VLMs exhibit strengths in interpretability and contextual analysis, suggesting their potential to augment human expertise in forensic workflows. These insights imply that although general-purpose models currently lack the reliability needed for autonomous deepfake detection, they hold promise as integral components in hybrid or human-in-the-loop detection frameworks.
摘要：深层效果的越来越复杂给媒体的完整性和公众信任的保存带来了重大挑战。同时，视觉模型（通过视觉推理功能增强的大型语言模型）同时成为了各个领域的有前途的工具，这引起了对其对深层检测的适用性的兴趣。这项研究对四个突出的VLM进行了结构化的零摄像评估：Chatgpt，Claude，Gemini和Grok，重点介绍了三种主要的深层类型：面部wap，重演和合成生成。利用精心组装的基准，该基准包括来自不同来源的真实和操纵图像，我们评估了每个模型的分类精度和推理深度。我们的分析表明，尽管VLM可以产生连贯的解释并检测表面级异常，但它们尚不可作为独立检测系统可靠。我们重点介绍了关键的故障模式，例如过分强调风格元素以及误导诸如老式美学之类的视觉模式的脆弱性。然而，VLM在解释性和上下文分析方面具有优势，这表明它们有可能增强法医工作流中人类专业知识的潜力。这些见解表明，尽管通用模型目前缺乏自动摄影检测所需的可靠性，但它们将其作为混合或人类在循环检测框架中不可或缺的组成部分的希望。

Title: Equivariant Neural Diffusion for Molecule Generation

Authors: François Cornet, Grigory Bartosh, Mikkel N. Schmidt, Christian A. Naesseth
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10532
Pdf URL: https://arxiv.org/pdf/2506.10532
Copy Paste: [[2506.10532]] Equivariant Neural Diffusion for Molecule Generation(https://arxiv.org/abs/2506.10532)
Keywords: generation, generative
Abstract: We introduce Equivariant Neural Diffusion (END), a novel diffusion model for molecule generation in 3D that is equivariant to Euclidean transformations. Compared to current state-of-the-art equivariant diffusion models, the key innovation in END lies in its learnable forward process for enhanced generative modelling. Rather than pre-specified, the forward process is parameterized through a time- and data-dependent transformation that is equivariant to rigid transformations. Through a series of experiments on standard molecule generation benchmarks, we demonstrate the competitive performance of END compared to several strong baselines for both unconditional and conditional generation.
摘要：我们介绍了eproimiant神经扩散（END），这是一个与欧几里得转化相等的3D分子生成的新型扩散模型。与当前最新的模棱两可的扩散模型相比，终点的关键创新在于其可学习的远期过程以增强生成建模。远期过程不是预先指定的，而是通过时间和数据依赖性转换对刚性转换进行参数化。通过对标准分子生成基准测试的一系列实验，我们证明了终端的竞争性能与无条件和条件产生的几个强基础相比。

Title: DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers

Authors: Lizhen Wang, Zhurong Xia, Tianshu Hu, Pengrui Wang, Pengfei Wang, Zerong Zheng, Ming Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10568
Pdf URL: https://arxiv.org/pdf/2506.10568
Copy Paste: [[2506.10568]] DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers(https://arxiv.org/abs/2506.10568)
Keywords: generation
Abstract: In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions. Project page: this https URL.
摘要：在电子商务和数字营销中，产生高保真的人类产品演示视频对于有效的产品展示非常重要。但是，大多数现有的框架要么无法保留人类和产品的身份，要么缺乏对人类产品空间关系的了解，从而导致不切实际的表示和不自然的相互作用。为了应对这些挑战，我们提出了一个基于扩散的变压器（DIT）的框架。我们的方法同时保留了人类的身份和特定于产品的细节，例如徽标和纹理，通过注入配对的人类产品参考信息并利用额外的蒙版跨注意机制。我们采用3D车身网格模板和产品边界框来提供精确的运动指导，从而使手势与产品放置的直观比对。此外，结构化文本编码用于合并类别级的语义，从而在跨帧的小型旋转变化过程中增强了3D一致性。在混合数据集中培训了具有广泛的数据增强策略的培训，我们的方法在保持人类和产品的身份完整性以及产生现实的演示动作方面优于最先进的技术。项目页面：此HTTPS URL。

Title: Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration

Authors: Jun Wang, Lixing Zhu, Xiaohan Yu, Abhir Bhalerao, Yulan He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10573
Pdf URL: https://arxiv.org/pdf/2506.10573
Copy Paste: [[2506.10573]] Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration(https://arxiv.org/abs/2506.10573)
Keywords: generation
Abstract: Learning medical visual representations from image-report pairs through joint learning has garnered increasing research attention due to its potential to alleviate the data scarcity problem in the medical domain. The primary challenges stem from the lengthy reports that feature complex discourse relations and semantic pathologies. Previous works have predominantly focused on instance-wise or token-wise cross-modal alignment, often neglecting the importance of pathological-level consistency. This paper presents a novel framework PLACE that promotes the Pathological-Level Alignment and enriches the fine-grained details via Correlation Exploration without additional human annotations. Specifically, we propose a novel pathological-level cross-modal alignment (PCMA) approach to maximize the consistency of pathology observations from both images and reports. To facilitate this, a Visual Pathology Observation Extractor is introduced to extract visual pathological observation representations from localized tokens. The PCMA module operates independently of any external disease annotations, enhancing the generalizability and robustness of our methods. Furthermore, we design a proxy task that enforces the model to identify correlations among image patches, thereby enriching the fine-grained details crucial for various downstream tasks. Experimental results demonstrate that our proposed framework achieves new state-of-the-art performance on multiple downstream tasks, including classification, image-to-text retrieval, semantic segmentation, object detection and report generation.
摘要：从图像报告对通过联合学习学习的医学视觉表示，由于其减轻医疗领域中数据稀缺问题的潜力，因此增加了研究的关注。主要的挑战源于漫长的报告，这些报告具有复杂的话语关系和语义病理。以前的作品主要集中在实例或令牌跨模式对齐方式上，通常忽略了病理水平一致性的重要性。本文提出了一个新颖的框架场所，该框架促进了病理水平的对齐，并通过相关探索而没有其他人类注释来丰富细粒度的细节。具体而言，我们提出了一种新型的病理水平跨模式比对（PCMA）方法，以最大程度地提高图像和报告中病理观察的一致性。为了促进这一点，引入了视觉病理观察提取器，以从局部令牌提取视觉病理观察表述。 PCMA模块独立于任何外部疾病注释，增强了我们方法的普遍性和鲁棒性。此外，我们设计了一个代理任务，该任务可以强制执行模型以识别图像贴片之间的相关性，从而丰富了对各种下游任务至关重要的细粒细节。实验结果表明，我们提出的框架在多个下游任务上实现了新的最新性能，包括分类，图像到文本检索，语义分割，对象检测和报告生成。

Title: DanceChat: Large Language Model-Guided Music-to-Dance Generation

Authors: Qing Wang, Xiaohang Yang, Yilan Dong, Naveen Raj Govindaraj, Gregory Slabaugh, Shanxin Yuan
Subjects: cs.CV, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.10574
Pdf URL: https://arxiv.org/pdf/2506.10574
Copy Paste: [[2506.10574]] DanceChat: Large Language Model-Guided Music-to-Dance Generation(https://arxiv.org/abs/2506.10574)
Keywords: generation
Abstract: Music-to-dance generation aims to synthesize human dance motion conditioned on musical input. Despite recent progress, significant challenges remain due to the semantic gap between music and dance motion, as music offers only abstract cues, such as melody, groove, and emotion, without explicitly specifying the physical movements. Moreover, a single piece of music can produce multiple plausible dance interpretations. This one-to-many mapping demands additional guidance, as music alone provides limited information for generating diverse dance movements. The challenge is further amplified by the scarcity of paired music and dance data, which restricts the modelâĂŹs ability to learn diverse dance patterns. In this paper, we introduce DanceChat, a Large Language Model (LLM)-guided music-to-dance generation approach. We use an LLM as a choreographer that provides textual motion instructions, offering explicit, high-level guidance for dance generation. This approach goes beyond implicit learning from music alone, enabling the model to generate dance that is both more diverse and better aligned with musical styles. Our approach consists of three components: (1) an LLM-based pseudo instruction generation module that produces textual dance guidance based on music style and structure, (2) a multi-modal feature extraction and fusion module that integrates music, rhythm, and textual guidance into a shared representation, and (3) a diffusion-based motion synthesis module together with a multi-modal alignment loss, which ensures that the generated dance is aligned with both musical and textual cues. Extensive experiments on AIST++ and human evaluations show that DanceChat outperforms state-of-the-art methods both qualitatively and quantitatively.
摘要：音乐到舞蹈的一代旨在综合以音乐意见为条件的人类舞蹈运动。尽管最近的进步，但由于音乐和舞蹈运动之间的语义差距仍然存在，因为音乐仅提供抽象提示，例如旋律，凹槽和情感，而没有明确指定身体运动。此外，单片音乐可以产生多种合理的舞蹈诠释。这个一对一的映射需要额外的指导，因为仅音乐就为产生多样化的舞蹈运动提供了有限的信息。配对音乐和舞蹈数据的稀缺性进一步扩大了挑战，这限制了模型学习多样化的舞蹈模式的能力。在本文中，我们介绍了Dancechat，这是一种大型语言模型（LLM）指导的音乐与舞蹈生成方法。我们使用LLM作为编舞，提供文本运动指令，为舞蹈生成提供明确的高级指导。这种方法超越了仅凭音乐的隐性学习，使该模型能够产生既多样化又更好地与音乐风格保持一致的舞蹈。我们的方法由三个组成部分组成：（1）基于LLM的伪指令生成模块，该模块基于音乐风格和结构，产生文本舞蹈指导，（2）多模式特征提取和融合模块，将音乐，节奏和文本指导整合到共享的表演中，并将基于差异的ISERINGINE MENTAL ISERINGITY MENTAL ISERINGINE MENTAL ISERINGITIS MENTAL ISERINGINE MENTAL EALINGINE MENTAL EALINE MENTALE MENTALE MENTALE MENTALE MENTALE MENTALE MENTALE MENTALE MENGEN一起）组成在一起。与音乐和文字提示保持一致。关于AIST ++和人类评估的广泛实验表明，DanceChat在定性和定量上都优于最先进的方法。

Title: Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning

Authors: Chun-Mei Feng, Kai Yu, Xinxing Xu, Salman Khan, Rick Siow Mong Goh, Wangmeng Zuo, Yong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10575
Pdf URL: https://arxiv.org/pdf/2506.10575
Copy Paste: [[2506.10575]] Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning(https://arxiv.org/abs/2506.10575)
Keywords: generation
Abstract: Benefited from image-text contrastive learning, pre-trained vision-language models, e.g., CLIP, allow to direct leverage texts as images (TaI) for parameter-efficient fine-tuning (PEFT). While CLIP is capable of making image features to be similar to the corresponding text features, the modality gap remains a nontrivial issue and limits image recognition performance of TaI. Using multi-label image recognition (MLR) as an example, we present a novel method, called T2I-PAL to tackle the modality gap issue when using only text captions for PEFT. The core design of T2I-PAL is to leverage pre-trained text-to-image generation models to generate photo-realistic and diverse images from text captions, thereby reducing the modality gap. To further enhance MLR, T2I-PAL incorporates a class-wise heatmap and learnable prototypes. This aggregates local similarities, making the representation of local visual features more robust and informative for multi-label recognition. For better PEFT, we further combine both prompt tuning and adapter learning to enhance classification performance. T2I-PAL offers significant advantages: it eliminates the need for fully semantically annotated training images, thereby reducing the manual annotation workload, and it preserves the intrinsic mode of the CLIP model, allowing for seamless integration with any existing CLIP framework. Extensive experiments on multiple benchmarks, including MS-COCO, VOC2007, and NUS-WIDE, show that our T2I-PAL can boost recognition performance by 3.47% in average above the top-ranked state-of-the-art methods.
摘要：受益于图像文本对比学习，预训练的视觉语言模型，例如剪辑，允许将文本直接作为图像（TAI）进行参数有效的微调（PEFT）。尽管剪辑能够使图像功能与相应的文本特征相似，但模态差距仍然是一个非平凡的问题，并且限制了TAI的图像识别性能。以多标签图像识别（MLR）为例，我们提出了一种新颖的方法，称为T2i-pal，在仅使用PEFT的文本字幕时解决模态差距问题。 T2I-PAL的核心设计是利用预先训练的文本对图像生成模型来生成来自文本字幕的照片现实和多样的图像，从而减少了模态差距。为了进一步增强MLR，T2I-PAL结合了类的热图和可学习的原型。这汇总了局部相似性，从而使局部视觉特征的表示更强大，并且提供了多标签识别的信息。为了获得更好的PEFT，我们进一步结合了及时的调整和适配器学习以增强分类性能。 T2I-PAL提供了显着的优势：它消除了对完全注释的训练图像的需求，从而减少了手动注释工作量，并保留了剪辑模型的固有模式，从而可以与任何现有的夹子框架无缝集成。在包括MS-Coco，VOC2007和NUS范围的多个基准测试的广泛实验表明，我们的T2I-PAL可以平均提高识别性能的平均3.47％以上是最高的最新方法。

Title: Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres

Authors: Muskan Dosi, Chiranjeev Chiranjeev, Kartik Thakral, Mayank Vatsa, Richa Singh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10576
Pdf URL: https://arxiv.org/pdf/2506.10576
Copy Paste: [[2506.10576]] Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres(https://arxiv.org/abs/2506.10576)
Keywords: generative
Abstract: Do contemporary diffusion models preserve the class geometry of hyperspherical data? Standard diffusion models rely on isotropic Gaussian noise in the forward process, inherently favoring Euclidean spaces. However, many real-world problems involve non-Euclidean distributions, such as hyperspherical manifolds, where class-specific patterns are governed by angular geometry within hypercones. When modeled in Euclidean space, these angular subtleties are lost, leading to suboptimal generative performance. To address this limitation, we introduce HyperSphereDiff to align hyperspherical structures with directional noise, preserving class geometry and effectively capturing angular uncertainty. We demonstrate both theoretically and empirically that this approach aligns the generative process with the intrinsic geometry of hyperspherical data, resulting in more accurate and geometry-aware generative models. We evaluate our framework on four object datasets and two face datasets, showing that incorporating angular uncertainty better preserves the underlying hyperspherical manifold. Resources are available at: {this https URL}
摘要：当代扩散模型是否保留了超级数据的类几何形状？标准扩散模型在正向过程中依赖于各向同性高斯噪声，固有地利用欧几里得空间。但是，许多现实世界中的问题都涉及非欧几里得分布，例如超球形歧管，其中特定于类的模式受超辅体中的角几何形状的控制。当在欧几里得空间中建模时，这些角微微丢失了，导致了次优的生成性能。为了解决这一限制，我们将Hyperspherediff引入了与定向噪声，保持阶级几何形状并有效捕获角度不确定性的超透明结构。我们在理论上和经验上都证明了这种方法将生成过程与超透明数据的内在几何形状保持一致，从而产生了更准确和几何学意识到的生成模型。我们在四个对象数据集和两个面部数据集上评估了我们的框架，这表明结合角度不确定性可以更好地保留潜在的超级歧管。资源可用：{此https url}

Title: High-resolution efficient image generation from WiFi CSI using a pretrained latent diffusion model

Authors: Eshan Ramesh, Nishio Takayuki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10605
Pdf URL: https://arxiv.org/pdf/2506.10605
Copy Paste: [[2506.10605]] High-resolution efficient image generation from WiFi CSI using a pretrained latent diffusion model(https://arxiv.org/abs/2506.10605)
Keywords: generation
Abstract: We present LatentCSI, a novel method for generating images of the physical environment from WiFi CSI measurements that leverages a pretrained latent diffusion model (LDM). Unlike prior approaches that rely on complex and computationally intensive techniques such as GANs, our method employs a lightweight neural network to map CSI amplitudes directly into the latent space of an LDM. We then apply the LDM's denoising diffusion model to the latent representation with text-based guidance before decoding using the LDM's pretrained decoder to obtain a high-resolution image. This design bypasses the challenges of pixel-space image generation and avoids the explicit image encoding stage typically required in conventional image-to-image pipelines, enabling efficient and high-quality image synthesis. We validate our approach on two datasets: a wide-band CSI dataset we collected with off-the-shelf WiFi devices and cameras; and a subset of the publicly available MM-Fi dataset. The results demonstrate that LatentCSI outperforms baselines of comparable complexity trained directly on ground-truth images in both computational efficiency and perceptual quality, while additionally providing practical advantages through its unique capacity for text-guided controllability.
摘要：我们提出了UtentCSI，这是一种新的方法，用于从WiFi CSI测量中生成物理环境的图像，该测量利用了预验证的潜在扩散模型（LDM）。与依靠复杂和计算密集型技术（例如gans）的先前方法不同，我们的方法采用轻量级神经网络将CSI振幅直接映射到LDM的潜在空间。然后，我们将LDM的DeNoising扩散模型应用于潜在表示，并使用基于文本的指导使用，然后使用LDM验证的解码器解码以获得高分辨率图像。该设计绕过了像素空间图像生成的挑战，并避免了通常在传统的图像到图像管道中所需的明确图像编码阶段，从而实现了有效且高质量的图像合成。我们在两个数据集上验证了我们的方法：我们使用现成的WiFi设备和相机收集的宽波段CSI数据集；以及公开可用的MM-FI数据集的子集。结果表明，LatentCSI的表现优于直接在计算效率和感知质量的地面图像上训练的可比复杂性的基准，同时还通过其独特的文本引导可控性能提供实用的优势。

Title: Hessian Geometry of Latent Space in Generative Models

Authors: Alexander Lobashev, Dmitry Guskov, Maria Larchenko, Mikhail Tamm
Subjects: cs.LG, cond-mat.stat-mech, cs.CV, math.DG, math.ST
Abstract URL: https://arxiv.org/abs/2506.10632
Pdf URL: https://arxiv.org/pdf/2506.10632
Copy Paste: [[2506.10632]] Hessian Geometry of Latent Space in Generative Models(https://arxiv.org/abs/2506.10632)
Keywords: generative
Abstract: This paper presents a novel method for analyzing the latent space geometry of generative models, including statistical physics models and diffusion models, by reconstructing the Fisher information metric. The method approximates the posterior distribution of latent variables given generated samples and uses this to learn the log-partition function, which defines the Fisher metric for exponential families. Theoretical convergence guarantees are provided, and the method is validated on the Ising and TASEP models, outperforming existing baselines in reconstructing thermodynamic quantities. Applied to diffusion models, the method reveals a fractal structure of phase transitions in the latent space, characterized by abrupt changes in the Fisher metric. We demonstrate that while geodesic interpolations are approximately linear within individual phases, this linearity breaks down at phase boundaries, where the diffusion model exhibits a divergent Lipschitz constant with respect to the latent space. These findings provide new insights into the complex structure of diffusion model latent spaces and their connection to phenomena like phase transitions. Our source code is available at this https URL.
摘要：本文提出了一种通过重建Fisher信息指标来分析生成模型（包括统计物理模型和扩散模型）的潜在空间几何形状的新方法。该方法近似给定样品的潜在变量的后验分布，并使用它来学习对数分区函数，该函数定义了指数族的Fisher指标。提供了理论收敛的保证，该方法在ISING和TASEP模型上进行了验证，在重建热力学量时的表现优于现有基准。该方法应用于扩散模型，揭示了潜在空间中相变的分形结构，其特征是Fisher度量的突然变化。我们证明，虽然地球插值在单个相内大致是线性的，但这种线性在相边界处分解，其中扩散模型相对于潜在空间表现出不同的Lipschitz常数。这些发现为扩散模型潜在空间的复杂结构及其与相似现象（如相变的现象的联系）提供了新的见解。我们的源代码可在此HTTPS URL上找到。

Title: Symmetrical Flow Matching: Unified Image Generation, Segmentation, and Classification with Score-Based Generative Models

Authors: Francisco Caetano, Christiaan Viviers, Peter H.N. De With, Fons van der Sommen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10634
Pdf URL: https://arxiv.org/pdf/2506.10634
Copy Paste: [[2506.10634]] Symmetrical Flow Matching: Unified Image Generation, Segmentation, and Classification with Score-Based Generative Models(https://arxiv.org/abs/2506.10634)
Keywords: generation, generative
Abstract: Flow Matching has emerged as a powerful framework for learning continuous transformations between distributions, enabling high-fidelity generative modeling. This work introduces Symmetrical Flow Matching (SymmFlow), a new formulation that unifies semantic segmentation, classification, and image generation within a single model. Using a symmetric learning objective, SymmFlow models forward and reverse transformations jointly, ensuring bi-directional consistency, while preserving sufficient entropy for generative diversity. A new training objective is introduced to explicitly retain semantic information across flows, featuring efficient sampling while preserving semantic structure, allowing for one-step segmentation and classification without iterative refinement. Unlike previous approaches that impose strict one-to-one mapping between masks and images, SymmFlow generalizes to flexible conditioning, supporting both pixel-level and image-level class labels. Experimental results on various benchmarks demonstrate that SymmFlow achieves state-of-the-art performance on semantic image synthesis, obtaining FID scores of 11.9 on CelebAMask-HQ and 7.0 on COCO-Stuff with only 25 inference steps. Additionally, it delivers competitive results on semantic segmentation and shows promising capabilities in classification tasks. The code will be publicly available.
摘要：流量匹配已成为学习分布之间连续转换的强大框架，从而实现了高保真的生成建模。这项工作介绍了对称流量匹配（Symmflow），这是一种新的公式，统一了单个模型中语义分割，分类和图像生成。使用对称学习目标，共同向前和反向转换，确保双向一致性，同时保留足够的生成多样性熵。引入了一个新的培训目标，以明确保留流量的语义信息，具有有效的采样，同时保留语义结构，从而可以进行一步分割和分类，而无需迭代精致。与以前的方法在蒙版和图像之间强加一对一映射的方法不同，Symmflow推广到灵活的调理，支持像素级别和图像级别的类标签。各种基准的实验结果表明，Symmflow在语义图像合成上实现了最新的性能，在Celebamask-HQ上获得了11.9的FID得分，并且在Coco-STuff上只有25个推理步骤。此外，它在语义细分方面提供了竞争成果，并显示了分类任务中有希望的能力。该代码将公开可用。

Title: GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning

Authors: Xiaoyi Bao, Jindi Lv, Xiaofeng Wang, Zheng Zhu, Xinze Chen, YuKun Zhou, Jiancheng Lv, Xingang Wang, Guan Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10639
Pdf URL: https://arxiv.org/pdf/2506.10639
Copy Paste: [[2506.10639]] GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning(https://arxiv.org/abs/2506.10639)
Keywords: generation
Abstract: Recent progress in diffusion models has greatly enhanced video generation quality, yet these models still require fine-tuning to improve specific dimensions like instance preservation, motion rationality, composition, and physical plausibility. Existing fine-tuning approaches often rely on human annotations and large-scale computational resources, limiting their practicality. In this work, we propose GigaVideo-1, an efficient fine-tuning framework that advances video generation without additional human supervision. Rather than injecting large volumes of high-quality data from external sources, GigaVideo-1 unlocks the latent potential of pre-trained video diffusion models through automatic feedback. Specifically, we focus on two key aspects of the fine-tuning process: data and optimization. To improve fine-tuning data, we design a prompt-driven data engine that constructs diverse, weakness-oriented training samples. On the optimization side, we introduce a reward-guided training strategy, which adaptively weights samples using feedback from pre-trained vision-language models with a realism constraint. We evaluate GigaVideo-1 on the VBench-2.0 benchmark using Wan2.1 as the baseline across 17 evaluation dimensions. Experiments show that GigaVideo-1 consistently improves performance on almost all the dimensions with an average gain of about 4% using only 4 GPU-hours. Requiring no manual annotations and minimal real data, GigaVideo-1 demonstrates both effectiveness and efficiency. Code, model, and data will be publicly available.
摘要：扩散模型的最新进展已大大提高了视频生成质量，但是这些模型仍然需要进行微调以改善特定维度，例如保存实例，运动理性，组成和身体上的合理性。现有的微调方法通常依赖于人类注释和大规模的计算资源，从而限制了它们的实用性。在这项工作中，我们提出了Gigavideo-1，这是一个有效的微调框架，可以在没有其他人类监督的情况下进行视频生成。 Gigavideo-1不是从外部来源注入大量高质量数据，而是通过自动反馈来解锁预训练的视频扩散模型的潜在潜力。具体而言，我们关注微调过程的两个关键方面：数据和优化。为了改善微调数据，我们设计了一个迅速驱动的数据引擎，该数据引擎构建了各种以弱点为导向的培训样本。在优化方面，我们引入了奖励指导的培训策略，该策略使用具有现实主义约束的预训练的视觉语言模型的反馈来适应样品。我们使用WAN2.1作为17个评估维度的基线评估VBENCH-2.0基准的Gigavideo-1。实验表明，Gigavideo-1始终使用仅使用4个GPU小时的平均增益，几乎平均增益约为4％。 Gigavideo-1不需要手动注释和最少的真实数据，既表现出有效性又表现出效率。代码，模型和数据将公开可用。

Title: Uncertainty-Masked Bernoulli Diffusion for Camouflaged Object Detection Refinement

Authors: Yuqi Shen, Fengyang Xiao, Sujie Hu, Youwei Pang, Yifan Pu, Chengyu Fang, Xiu Li, Chunming He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10712
Pdf URL: https://arxiv.org/pdf/2506.10712
Copy Paste: [[2506.10712]] Uncertainty-Masked Bernoulli Diffusion for Camouflaged Object Detection Refinement(https://arxiv.org/abs/2506.10712)
Keywords: generative
Abstract: Camouflaged Object Detection (COD) presents inherent challenges due to the subtle visual differences between targets and their backgrounds. While existing methods have made notable progress, there remains significant potential for post-processing refinement that has yet to be fully explored. To address this limitation, we propose the Uncertainty-Masked Bernoulli Diffusion (UMBD) model, the first generative refinement framework specifically designed for COD. UMBD introduces an uncertainty-guided masking mechanism that selectively applies Bernoulli diffusion to residual regions with poor segmentation quality, enabling targeted refinement while preserving correctly segmented areas. To support this process, we design the Hybrid Uncertainty Quantification Network (HUQNet), which employs a multi-branch architecture and fuses uncertainty from multiple sources to improve estimation accuracy. This enables adaptive guidance during the generative sampling process. The proposed UMBD framework can be seamlessly integrated with a wide range of existing Encoder-Decoder-based COD models, combining their discriminative capabilities with the generative advantages of diffusion-based refinement. Extensive experiments across multiple COD benchmarks demonstrate consistent performance improvements, achieving average gains of 5.5% in MAE and 3.2% in weighted F-measure with only modest computational overhead. Code will be released.
摘要：伪装的对象检测（COD）提出了固有的挑战，这是由于目标及其背景之间存在细微的视觉差异。尽管现有方法取得了显着的进展，但仍有尚未充分探索的后加工精炼的潜力。为了解决此限制，我们提出了不确定性掩盖的Bernoulli扩散（UMBD）模型，这是专门为COD设计的第一个生成改进框架。 UMBD引入了一种不确定性引导的掩蔽机制，该机制有选择地将Bernoulli扩散应用于分割质量较差的残留区域，从而实现了靶向细化，同时保留了正确分割的区域。为了支持此过程，我们设计了混合不确定性量化网络（HUQNET），该网络采用了多支分支体系结构并融合了来自多个来源的不确定性以提高估计准确性。这可以在生成抽样过程中进行自适应指导。所提出的UMBD框架可以与广泛的基于编码的COD模型无缝集成，将其判别能力与基于扩散的细化的生成优势相结合。跨多个COD基准测试的广泛实验表现出一致的性能提高，MAE的平均增长率为5.5％，而加权F量的3.2％仅具有适度的计算开销。代码将发布。

Title: IQE-CLIP: Instance-aware Query Embedding for Zero-/Few-shot Anomaly Detection in Medical Domain

Authors: Hong Huang, Weixiang Sun, Zhijian Wu, Jingwen Niu, Donghuan Lu, Xian Wu, Yefeng Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10730
Pdf URL: https://arxiv.org/pdf/2506.10730
Copy Paste: [[2506.10730]] IQE-CLIP: Instance-aware Query Embedding for Zero-/Few-shot Anomaly Detection in Medical Domain(https://arxiv.org/abs/2506.10730)
Keywords: generation
Abstract: Recent advances in vision-language models, such as CLIP, have significantly improved performance in zero- and few-shot anomaly detection (ZFSAD) tasks. However, most existing CLIP-based methods assume prior knowledge of categories and rely on carefully designed prompts tailored to specific scenarios. While these text prompts capture semantic information in the textual space, they often fail to distinguish normal and anomalous instances in the joint embedding space. Moreover, most ZFSAD approaches focus on industrial domains, with limited exploration in medical tasks. To address these limitations, we propose IQE-CLIP, a novel framework for ZFSAD in the medical domain. We show that query embeddings integrating both textual and instance-aware visual information serve as more effective indicators of anomalies. Specifically, we introduce class-based and learnable prompting tokens to better adapt CLIP to the medical setting. Furthermore, we design an instance-aware query module that extracts region-level contextual information from both modalities, enabling the generation of anomaly-sensitive embeddings. Extensive experiments on six medical datasets demonstrate that IQE-CLIP achieves state-of-the-art performance in both zero-shot and few-shot settings. Code and data are available at \href{this https URL}{this https URL}.
摘要：视觉模型（例如剪辑）的最新进展，在零射击异常检测（ZFSAD）任务中的性能显着提高了。但是，大多数现有的基于剪辑的方法都具有类别的先验知识，并依赖于针对特定方案量身定制的精心设计的提示。尽管这些文本提示在文本空间中捕获语义信息，但它们通常无法区分关节嵌入空间中的正常和异常实例。此外，大多数ZFSAD方法都集中在工业领域，对医疗任务的探索有限。为了解决这些局限性，我们提出了IQE-CLIP，这是医疗领域中ZFSAD的新框架。我们表明，整合文本和实例感知视觉信息的查询嵌入式是异常的更有效指标。具体来说，我们介绍了基于班级和可学习的提示令牌，以更好地适应医疗环境。此外，我们设计了一个实例感知的查询模块，该模块可以从两种模态中提取区域级的上下文信息，从而能够生成对异常敏感的嵌入。在六个医疗数据集上进行的广泛实验表明，IQE-CLIP在零射击和少量设置中都达到了最先进的性能。代码和数据可在\ href {此https url} {this https url}中获得。

Title: PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework

Authors: SiXiang Chen, Jianyu Lai, Jialin Gao, Tian Ye, Haoyu Chen, Hengyu Shi, Shitong Shao, Yunlong Lin, Song Fei, Zhaohu Xing, Yeying Jin, Junfeng Luo, Xiaoming Wei, Lei Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10741
Pdf URL: https://arxiv.org/pdf/2506.10741
Copy Paste: [[2506.10741]] PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework(https://arxiv.org/abs/2506.10741)
Keywords: generation
Abstract: Generating aesthetic posters is more challenging than simple design images: it requires not only precise text rendering but also the seamless integration of abstract artistic content, striking layouts, and overall stylistic harmony. To address this, we propose PosterCraft, a unified framework that abandons prior modular pipelines and rigid, predefined layouts, allowing the model to freely explore coherent, visually compelling compositions. PosterCraft employs a carefully designed, cascaded workflow to optimize the generation of high-aesthetic posters: (i) large-scale text-rendering optimization on our newly introduced Text-Render-2M dataset; (ii) region-aware supervised fine-tuning on HQ-Poster100K; (iii) aesthetic-text-reinforcement learning via best-of-n preference optimization; and (iv) joint vision-language feedback refinement. Each stage is supported by a fully automated data-construction pipeline tailored to its specific needs, enabling robust training without complex architectural modifications. Evaluated on multiple experiments, PosterCraft significantly outperforms open-source baselines in rendering accuracy, layout coherence, and overall visual appeal-approaching the quality of SOTA commercial systems. Our code, models, and datasets can be found in the Project page: this https URL
摘要：生成美学海报比简单的设计图像更具挑战性：它不仅需要精确的文本渲染，而且需要抽象艺术内容，引人注目的布局和整体风格和谐的无缝集成。为了解决这个问题，我们提出了Postercraft，这是一个统一的框架，它放弃了先前的模块化管道和刚性，预定义的布局，从而使模型可以自由探索相干，视觉上令人信服的组合物。 Postercraft采用了精心设计的级联工作流程，以优化高审美海报的生成：（i）在我们新引入的Text Render-2M数据集中，大规模的文本渲染优化；（ii）在HQ-Poster100k上进行的受到区域感知的微调；（iii）通过最佳偏好优化的审美 - 文本提升学习；（iv）联合视觉反馈的改进。每个阶段都由根据其特定需求量身定制的全自动数据构建管道支持，从而实现了强大的训练，而无需进行复杂的体系结构修改。在多个实验中进行评估，Postercraft在呈现精度，布局连贯性和整体视觉吸引力方面显着优于开源基线，从而占据了SOTA商业系统的质量。我们的代码，模型和数据集可以在项目页面中找到：此HTTPS URL

Title: Neural at ArchEHR-QA 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering

Authors: Sai Prasanna Teja Reddy Bogireddy, Abrar Majeedi, Viswanatha Reddy Gajjala, Zhuoyan Xu, Siddhant Rai, Vaishnav Potlapalli
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.10751
Pdf URL: https://arxiv.org/pdf/2506.10751
Copy Paste: [[2506.10751]] Neural at ArchEHR-QA 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering(https://arxiv.org/abs/2506.10751)
Keywords: generation
Abstract: Automated question answering (QA) over electronic health records (EHRs) can bridge critical information gaps for clinicians and patients, yet it demands both precise evidence retrieval and faithful answer generation under limited supervision. In this work, we present Neural, the runner-up in the BioNLP 2025 ArchEHR-QA shared task on evidence-grounded clinical QA. Our proposed method decouples the task into (1) sentence-level evidence identification and (2) answer synthesis with explicit citations. For each stage, we automatically explore the prompt space with DSPy's MIPROv2 optimizer, jointly tuning instructions and few-shot demonstrations on the development set. A self-consistency voting scheme further improves evidence recall without sacrificing precision. On the hidden test set, our method attains an overall score of 51.5, placing second stage while outperforming standard zero-shot and few-shot prompting by over 20 and 10 points, respectively. These results indicate that data-driven prompt optimization is a cost-effective alternative to model fine-tuning for high-stakes clinical QA, advancing the reliability of AI assistants in healthcare.
摘要：对电子健康记录（EHRS）的自动化问答（QA）可以弥合临床医生和患者的关键信息差距，但是在有限的监督下，它要求精确的证据检索和忠实的答案生成。在这项工作中，我们介绍了神经，这是Bionlp 2025 Archehr-QA的亚军，在证据基础的临床质量检查上共享任务。我们提出的方法将任务解除为（1）句子级证据识别和（2）用显式引用回答合成。对于每个阶段，我们都会使用DSPY的MIPROV 2 Optimizer自动探索提示空间，共同调整说明和在开发集中进行的几次演示。自洽投票计划进一步改善了证据召回而无需牺牲精度。在隐藏的测试集中，我们的方法的总分达到51.5，位于第二阶段，同时优于标准零射门和少数射击的提示分别超过20分和10分。这些结果表明，数据驱动的迅速优化是用于高风险临床质量检查模型的成本效益的替代方法，可以提高AI助手在医疗保健方面的可靠性。

Title: Stroke-based Cyclic Amplifier: Image Super-Resolution at Arbitrary Ultra-Large Scales

Authors: Wenhao Guo, Peng Lu, Xujun Peng, Zhaoran Zhao, Sheng Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10774
Pdf URL: https://arxiv.org/pdf/2506.10774
Copy Paste: [[2506.10774]] Stroke-based Cyclic Amplifier: Image Super-Resolution at Arbitrary Ultra-Large Scales(https://arxiv.org/abs/2506.10774)
Keywords: super-resolution
Abstract: Prior Arbitrary-Scale Image Super-Resolution (ASISR) methods often experience a significant performance decline when the upsampling factor exceeds the range covered by the training data, introducing substantial blurring. To address this issue, we propose a unified model, Stroke-based Cyclic Amplifier (SbCA), for ultra-large upsampling tasks. The key of SbCA is the stroke vector amplifier, which decomposes the image into a series of strokes represented as vector graphics for magnification. Then, the detail completion module also restores missing details, ensuring high-fidelity image reconstruction. Our cyclic strategy achieves ultra-large upsampling by iteratively refining details with this unified SbCA model, trained only once for all, while keeping sub-scales within the training range. Our approach effectively addresses the distribution drift issue and eliminates artifacts, noise and blurring, producing high-quality, high-resolution super-resolved images. Experimental validations on both synthetic and real-world datasets demonstrate that our approach significantly outperforms existing methods in ultra-large upsampling tasks (e.g. $\times100$), delivering visual quality far superior to state-of-the-art techniques.
摘要：当上采样因子超过训练数据所覆盖的范围时，先前的任意规模的图像超分辨率（ISISR）方法通常会出现显着的性能下降，从而引入了实质性的模糊。为了解决这个问题，我们提出了一个统一的模型，基于中风的循环放大器（SBCA），以实现超大的UP采样任务。 SBCA的关键是中风矢量放大器，该放大器将图像分解为一系列的中风，称为载体图形。然后，详细信息完成模块还恢复缺失的详细信息，以确保高保真图像重建。我们的循环策略通过使用这种统一的SBCA模型迭代精炼细节来实现超大的提升，同时仅接受一次训练，同时将子量表保持在训练范围内。我们的方法有效地解决了分布漂移问题，并消除了人工制品，噪声和模糊，从而产生了高质量的高分辨率超级分辨图像。对合成数据集和现实世界数据集的实验验证表明，我们的方法在超大提升任务（例如$ \ times100 $）中的现有方法显着优于现有方法，从而提供了视觉质量，远远超过了最先进的技术。

Title: Dense Associative Memory with Epanechnikov Energy

Authors: Benjamin Hoover, Zhaoyang Shi, Krishnakumar Balasubramanian, Dmitry Krotov, Parikshit Ram
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10801
Pdf URL: https://arxiv.org/pdf/2506.10801
Copy Paste: [[2506.10801]] Dense Associative Memory with Epanechnikov Energy(https://arxiv.org/abs/2506.10801)
Keywords: generative
Abstract: We propose a novel energy function for Dense Associative Memory (DenseAM) networks, the log-sum-ReLU (LSR), inspired by optimal kernel density estimation. Unlike the common log-sum-exponential (LSE) function, LSR is based on the Epanechnikov kernel and enables exact memory retrieval with exponential capacity without requiring exponential separation functions. Moreover, it introduces abundant additional \emph{emergent} local minima while preserving perfect pattern recovery -- a characteristic previously unseen in DenseAM literature. Empirical results show that LSR energy has significantly more local minima (memories) that have comparable log-likelihood to LSE-based models. Analysis of LSR's emergent memories on image datasets reveals a degree of creativity and novelty, hinting at this method's potential for both large-scale memory storage and generative tasks.
摘要：我们提出了一个新型的能量功能，用于密集的关联记忆（denseam）网络，log-sum-relu（LSR），灵感来自最佳内核密度估计。与通用的对数 - 指数（LSE）函数不同，LSR基于Epanechnikov内核，并可以以指数级的容量来实现精确的内存检索，而无需指数分离函数。此外，它引入了丰富的附加\ emph {Emph {Emph}局部最小值，同时保留了完美的模式恢复 - 一种以前从未见过的特征，在eNSeam文献中是从未见过的。经验结果表明，LSR能量具有与基于LSE的模型相当的对数模型的明显局部最小值（记忆）。对LSR在图像数据集上的新兴记忆的分析揭示了一定程度的创造力和新颖性，这暗示了这种方法对大规模存储器存储和生成任务的潜力。

Title: CreatiPoster: Towards Editable and Controllable Multi-Layer Graphic Design Generation

Authors: Zhao Zhang, Yutao Cheng, Dexiang Hong, Maoke Yang, Gonglei Shi, Lei Ma, Hui Zhang, Jie Shao, Xinglong Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10890
Pdf URL: https://arxiv.org/pdf/2506.10890
Copy Paste: [[2506.10890]] CreatiPoster: Towards Editable and Controllable Multi-Layer Graphic Design Generation(https://arxiv.org/abs/2506.10890)
Keywords: generation
Abstract: Graphic design plays a crucial role in both commercial and personal contexts, yet creating high-quality, editable, and aesthetically pleasing graphic compositions remains a time-consuming and skill-intensive task, especially for beginners. Current AI tools automate parts of the workflow, but struggle to accurately incorporate user-supplied assets, maintain editability, and achieve professional visual appeal. Commercial systems, like Canva Magic Design, rely on vast template libraries, which are impractical for replicate. In this paper, we introduce CreatiPoster, a framework that generates editable, multi-layer compositions from optional natural-language instructions or assets. A protocol model, an RGBA large multimodal model, first produces a JSON specification detailing every layer (text or asset) with precise layout, hierarchy, content and style, plus a concise background prompt. A conditional background model then synthesizes a coherent background conditioned on this rendered foreground layers. We construct a benchmark with automated metrics for graphic-design generation and show that CreatiPoster surpasses leading open-source approaches and proprietary commercial systems. To catalyze further research, we release a copyright-free corpus of 100,000 multi-layer designs. CreatiPoster supports diverse applications such as canvas editing, text overlay, responsive resizing, multilingual adaptation, and animated posters, advancing the democratization of AI-assisted graphic design. Project homepage: this https URL
摘要：图形设计在商业和个人环境中都起着至关重要的作用，但创造了高质量，可编辑和美观的图形作品仍然是一项耗时且技能密集的任务，尤其是对于初学者而言。当前的AI工具可自动化工作流程的一部分，但要准确地合并用户提供资产，维护编辑性并实现专业的视觉吸引力。商业系统（例如Canva Magic Design）依赖于庞大的模板库，这些模板库是不切实际的。在本文中，我们介绍了Creatiposter，该框架从可选的自然语言说明或资产中生成可编辑的多层组成。协议模型是RGBA大型多模型模型，首先产生JSON规范，详细介绍了具有精确布局，层次结构，内容和样式以及简洁的背景提示的每个图层（文本或资产）。然后，有条件的背景模型综合了在此渲染的前景层上的连贯背景。我们构建一个具有自动指标的基准，用于图形设计生成，并表明Creatiposter超过了领先的开源方法和专有商业系统。为了促进进一步的研究，我们发布了100,000个多层设计的无版权语料库。 Creatiposter支持各种应用程序，例如帆布编辑，文本覆盖，响应式调整，多语言适应和动画海报，从而推进了AI辅助图形设计的民主化。项目主页：此HTTPS URL

Title: The Diffusion Duality

Authors: Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, Volodymyr Kuleshov
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.10892
Pdf URL: https://arxiv.org/pdf/2506.10892
Copy Paste: [[2506.10892]] The Diffusion Duality(https://arxiv.org/abs/2506.10892)
Keywords: generation
Abstract: Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and masked diffusion models. In this work, we narrow this performance gap by leveraging a key insight: Uniform-state diffusion processes naturally emerge from an underlying Gaussian diffusion. Our method, Duo, transfers powerful techniques from Gaussian diffusion to improve both training and sampling. First, we introduce a curriculum learning strategy guided by the Gaussian process, doubling training speed by reducing variance. Models trained with curriculum learning surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Second, we present Discrete Consistency Distillation, which adapts consistency distillation from the continuous to the discrete setting. This algorithm unlocks few-step generation in diffusion language models by accelerating sampling by two orders of magnitude. We provide the code and model checkpoints on the project page: this http URL
摘要：统一的离散扩散模型由于其固有的自我校正能力而具有快速文本生成的希望。但是，它们通常超过自回归模型和掩盖扩散模型的表现。在这项工作中，我们通过利用关键洞察力来缩小这种性能差距：统一的扩散过程自然而然地从潜在的高斯扩散中出现。我们的二重奏从高斯扩散转移了强大的技术，以改善训练和采样。首先，我们引入了以高斯流程为指导的课程学习策略，通过降低差异来加倍训练速度。接受课程学习训练的模型在7个基准中的3个基准中以零射击的零射模型超级回归模型。其次，我们提出离散的一致性蒸馏，该蒸馏会适应从连续设置到离散设置的一致性蒸馏。该算法通过通过两个数量级加速采样来在扩散语言模型中解锁几步的生成。我们在项目页面上提供代码和模型检查点：此HTTP URL

Title: AIR: Zero-shot Generative Model Adaptation with Iterative Refinement

Authors: Guimeng Liu, Milad Abdollahzadeh, Ngai-Man Cheung
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.10895
Pdf URL: https://arxiv.org/pdf/2506.10895
Copy Paste: [[2506.10895]] AIR: Zero-shot Generative Model Adaptation with Iterative Refinement(https://arxiv.org/abs/2506.10895)
Keywords: generative
Abstract: Zero-shot generative model adaptation (ZSGM) aims to adapt a pre-trained generator to a target domain using only text guidance and without any samples from the target domain. Central to recent ZSGM approaches are directional loss which use the text guidance in the form of aligning the image offset with text offset in the embedding space of a vision-language model like CLIP. This is similar to the analogical reasoning in NLP where the offset between one pair of words is used to identify a missing element in another pair by aligning the offset between these two pairs. However, a major limitation of existing ZSGM methods is that the learning objective assumes the complete alignment between image offset and text offset in the CLIP embedding space, resulting in quality degrade in generated images. Our work makes two main contributions. Inspired by the offset misalignment studies in NLP, as our first contribution, we perform an empirical study to analyze the misalignment between text offset and image offset in CLIP embedding space for various large publicly available datasets. Our important finding is that offset misalignment in CLIP embedding space is correlated with concept distance, i.e., close concepts have a less offset misalignment. To address the limitations of the current approaches, as our second contribution, we propose Adaptation with Iterative Refinement (AIR) which is the first ZSGM approach to focus on improving target domain image quality based on our new insight on offset this http URL, quantitative, and user study in 26 experiment setups consistently demonstrate the proposed AIR approach achieves SOTA performance. Additional experiments are in Supp.
摘要：零击生成模型适应（ZSGM）旨在仅使用文本指南将预训练的发电机调整为目标域，而没有目标域中的任何样品。最近的ZSGM方法的中心是定向损失，该方向损失以文本指导为将图像偏移与文本偏移的形式对齐与视觉模型（如剪辑）的嵌入空间中的文本偏移。这类似于NLP中的类似推理，其中使用一对单词之间的偏移来识别另一对中缺少的元素，通过对齐这两对之间的偏移。但是，现有ZSGM方法的主要局限性是，学习目标假设剪辑嵌入空间中的图像偏移和文本偏移之间的完全对齐，从而导致生成的图像中的优质降级。我们的工作做出了两个主要贡献。受到NLP偏移量的未对准研究的启发，作为我们的第一个贡献，我们进行了一项经验研究，以分析文本偏移量和图像偏移之间在夹子嵌入空间中的偏移量之间的错位，用于各种大型公开可用数据集。我们重要的发现是，夹具嵌入空间中的偏移量未对准与概念距离相关，即，近距离概念的偏移量较小。 To address the limitations of the current approaches, as our second contribution, we propose Adaptation with Iterative Refinement (AIR) which is the first ZSGM approach to focus on improving target domain image quality based on our new insight on offset this http URL, quantitative, and user study in 26 experiment setups consistently demonstrate the proposed AIR approach achieves SOTA performance.其他实验是在SUPP中。

Title: M4V: Multi-Modal Mamba for Text-to-Video Generation

Authors: Jiancheng Huang, Gengwei Zhang, Zequn Jie, Siyu Jiao, Yinlong Qian, Ling Chen, Yunchao Wei, Lin Ma
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10915
Pdf URL: https://arxiv.org/pdf/2506.10915
Copy Paste: [[2506.10915]] M4V: Multi-Modal Mamba for Text-to-Video Generation(https://arxiv.org/abs/2506.10915)
Keywords: generation
Abstract: Text-to-video generation has significantly enriched content creation and holds the potential to evolve into powerful world simulators. However, modeling the vast spatiotemporal space remains computationally demanding, particularly when employing Transformers, which incur quadratic complexity in sequence processing and thus limit practical applications. Recent advancements in linear-time sequence modeling, particularly the Mamba architecture, offer a more efficient alternative. Nevertheless, its plain design limits its direct applicability to multi-modal and spatiotemporal video generation tasks. To address these challenges, we introduce M4V, a Multi-Modal Mamba framework for text-to-video generation. Specifically, we propose a multi-modal diffusion Mamba (MM-DiM) block that enables seamless integration of multi-modal information and spatiotemporal modeling through a multi-modal token re-composition design. As a result, the Mamba blocks in M4V reduce FLOPs by 45% compared to the attention-based alternative when generating videos at 768$\times$1280 resolution. Additionally, to mitigate the visual quality degradation in long-context autoregressive generation processes, we introduce a reward learning strategy that further enhances per-frame visual realism. Extensive experiments on text-to-video benchmarks demonstrate M4V's ability to produce high-quality videos while significantly lowering computational costs. Code and models will be publicly available at this https URL.
摘要：文本到视频的生成大大丰富了内容的创建，并具有发展成为强大的世界模拟器的潜力。但是，建模庞大的时空空间在计算上仍然需要进行计算要求，尤其是在使用变压器时，这会在序列处理中产生二次复杂性，从而限制了实际应用。线性时间序列建模的最新进步，尤其是MAMBA体系结构，提供了更有效的替代方案。然而，其简单的设计将其直接适用性限制在多模式和时空视频生成任务中。为了应对这些挑战，我们介绍了M4V，这是一个多式联运的MAMBA框架，用于文本到视频生成。具体而言，我们提出了一个多模式扩散mamba（mm-dim）块，该块可以通过多模式的代币重电设计实现多模式信息和时空建模的无缝集成。结果，与基于注意力的替代方案相比，M4V中的MAMBA块以768 $ \ times $ 1280的分辨率减少了45％。此外，为了减轻长期文化自回归生成过程中的视觉质量退化，我们引入了奖励学习策略，进一步增强了人均视觉现实主义。对文本至视频基准测试的广泛实验表明，M4V能够制作高质量视频，同时显着降低计算成本。代码和模型将在此HTTPS URL上公开可用。

Title: VINCIE: Unlocking In-context Image Editing from Video

Authors: Leigang Qu, Feng Cheng, Ziyan Yang, Qi Zhao, Shanchuan Lin, Yichun Shi, Yicong Li, Wenjie Wang, Tat-Seng Chua, Lu Jiang
Subjects: cs.CV, cs.AI, cs.CL, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2506.10941
Pdf URL: https://arxiv.org/pdf/2506.10941
Copy Paste: [[2506.10941]] VINCIE: Unlocking In-context Image Editing from Video(https://arxiv.org/abs/2506.10941)
Keywords: generation
Abstract: In-context image editing aims to modify images based on a contextual sequence comprising text and previously generated images. Existing methods typically depend on task-specific pipelines and expert models (e.g., segmentation and inpainting) to curate training data. In this work, we explore whether an in-context image editing model can be learned directly from videos. We introduce a scalable approach to annotate videos as interleaved multimodal sequences. To effectively learn from this data, we design a block-causal diffusion transformer trained on three proxy tasks: next-image prediction, current segmentation prediction, and next-segmentation prediction. Additionally, we propose a novel multi-turn image editing benchmark to advance research in this area. Extensive experiments demonstrate that our model exhibits strong in-context image editing capabilities and achieves state-of-the-art results on two multi-turn image editing benchmarks. Despite being trained exclusively on videos, our model also shows promising abilities in multi-concept composition, story generation, and chain-of-editing applications.
摘要：在上下文图像编辑中，旨在根据包含文本和先前生成的图像的上下文序列修改图像。现有方法通常取决于特定于任务的管道和专家模型（例如，细分和内化）来策划培训数据。在这项工作中，我们探讨了是否可以直接从视频中学到中文图像编辑模型。我们引入了一种可扩展的方法来注释视频作为交织的多模式序列。为了有效地从这些数据中学习，我们设计了一个对三个代理任务进行训练的块临界扩散变压器：下一个图像预测，当前分段预测和下一分段预测。此外，我们提出了一种新型的多转弯图像编辑基准，以推动该领域的研究。广泛的实验表明，我们的模型表现出强大的外观图像编辑功能，并在两个多转变图像编辑基准上实现了最先进的结果。尽管仅接受了视频的培训，但我们的模型还显示了多概念构图，故事产生和编辑链应用程序中有希望的能力。

Title: Self-Adapting Language Models

Authors: Adam Zweiger, Jyothish Pari, Han Guo, Ekin Akyürek, Yoon Kim, Pulkit Agrawal
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10943
Pdf URL: https://arxiv.org/pdf/2506.10943
Copy Paste: [[2506.10943]] Self-Adapting Language Models(https://arxiv.org/abs/2506.10943)
Keywords: generation
Abstract: Large language models (LLMs) are powerful but static; they lack mechanisms to adapt their weights in response to new tasks, knowledge, or examples. We introduce Self-Adapting LLMs (SEAL), a framework that enables LLMs to self-adapt by generating their own finetuning data and update directives. Given a new input, the model produces a self-edit-a generation that may restructure the information in different ways, specify optimization hyperparameters, or invoke tools for data augmentation and gradient-based updates. Through supervised finetuning (SFT), these self-edits result in persistent weight updates, enabling lasting adaptation. To train the model to produce effective self-edits, we use a reinforcement learning loop with the downstream performance of the updated model as the reward signal. Unlike prior approaches that rely on separate adaptation modules or auxiliary networks, SEAL directly uses the model's own generation to control its adaptation process. Experiments on knowledge incorporation and few-shot generalization show that SEAL is a promising step toward language models capable of self-directed adaptation. Our website and code is available at this https URL.
摘要：大型语言模型（LLM）具有强大的功能，但静态；他们缺乏针对新任务，知识或示例来调整其权重的机制。我们介绍了自我适应LLM（密封），该框架使LLMS能够通过生成自己的固定数据和更新指令来自适应。给定新的输入，该模型会产生一个自我编辑的一代，该生成可能以不同的方式重组信息，指定优化超参数或调用用于数据增强和基于梯度的更新的工具。通过有监督的Finetuning（SFT），这些自我编辑会导致持续的体重更新，从而实现持久的适应性。为了训练模型以产生有效的自我编辑，我们将使用更新模型的下游性能作为奖励信号使用加固学习循环。与依靠单独的适应模块或辅助网络的先前方法不同，Seal直接使用模型自己的生成来控制其适应过程。关于知识融合和几乎没有概括的实验表明，密封是朝着能够自我指导适应的语言模型迈出的有前途的一步。我们的网站和代码可在此HTTPS URL上找到。

Title: Execution Guided Line-by-Line Code Generation

Authors: Boaz Lavon, Shahar Katz, Lior Wolf
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.10948
Pdf URL: https://arxiv.org/pdf/2506.10948
Copy Paste: [[2506.10948]] Execution Guided Line-by-Line Code Generation(https://arxiv.org/abs/2506.10948)
Keywords: generation
Abstract: We present a novel approach to neural code generation that incorporates real-time execution signals into the language model generation process. While large language models (LLMs) have demonstrated impressive code generation capabilities, they typically do not utilize execution feedback during inference, a critical signal that human programmers regularly leverage. Our method, Execution-Guided Classifier-Free Guidance (EG-CFG), dynamically incorporates execution signals as the model generates code, providing line-by-line feedback that guides the generation process toward executable solutions. EG-CFG employs a multi-stage process: first, we conduct beam search to sample candidate program completions for each line; second, we extract execution signals by executing these candidates against test cases; and finally, we incorporate these signals into the prompt during generation. By maintaining consistent signals across tokens within the same line and refreshing signals at line boundaries, our approach provides coherent guidance while preserving syntactic structure. Moreover, the method naturally supports native parallelism at the task level in which multiple agents operate in parallel, exploring diverse reasoning paths and collectively generating a broad set of candidate solutions. Our experiments across diverse coding tasks demonstrate that EG-CFG significantly improves code generation performance compared to standard approaches, achieving state-of-the-art results across various levels of complexity, from foundational problems to challenging competitive programming tasks. Our code is available at: this https URL
摘要：我们提出了一种新颖的神经代码生成方法，该方法将实时执行信号纳入语言模型生成过程中。尽管大型语言模型（LLMS）表现出了令人印象深刻的代码生成功能，但它们通常在推断期间不利用执行反馈，这是人类程序员经常杠杆作用的关键信号。我们的方法是执行指导的无分类器指导（EG-CFG），将执行信号动态地包含在模型生成代码时，提供逐条反馈，从而将生成过程引导到可执行的解决方案。 EG-CFG采用了多阶段过程：首先，我们进行光束搜索以对每行进行样本候选程序完成；其次，我们通过针对测试案例执行这些候选人来提取执行信号；最后，我们将这些信号纳入一代期间的提示中。通过在同一行中保持跨令牌的一致信号，并在线边界处保持刷新信号，我们的方法在保留句法结构的同时提供了连贯的指导。此外，该方法自然会在多个代理并行操作的任务级别上支持本地并行性，探索各种推理路径并集体生成广泛的候选解决方案。我们跨不同编码任务的实验表明，与标准方法相比，EG-CFG显着提高了代码生成性能，从而在各种复杂性上取得了最新的结果，从基础问题到具有挑战性的竞争编程任务。我们的代码可用：此HTTPS URL

Title: ReGuidance: A Simple Diffusion Wrapper for Boosting Sample Quality on Hard Inverse Problems

Authors: Aayush Karan, Kulin Shah, Sitan Chen
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.10955
Pdf URL: https://arxiv.org/pdf/2506.10955
Copy Paste: [[2506.10955]] ReGuidance: A Simple Diffusion Wrapper for Boosting Sample Quality on Hard Inverse Problems(https://arxiv.org/abs/2506.10955)
Keywords: super-resolution
Abstract: There has been a flurry of activity around using pretrained diffusion models as informed data priors for solving inverse problems, and more generally around steering these models using reward models. Training-free methods like diffusion posterior sampling (DPS) and its many variants have offered flexible heuristic algorithms for these tasks, but when the reward is not informative enough, e.g., in hard inverse problems with low signal-to-noise ratio, these techniques veer off the data manifold, failing to produce realistic outputs. In this work, we devise a simple wrapper, ReGuidance, for boosting both the sample realism and reward achieved by these methods. Given a candidate solution $\hat{x}$ produced by an algorithm of the user's choice, we propose inverting the solution by running the unconditional probability flow ODE in reverse starting from $\hat{x}$, and then using the resulting latent as an initialization for DPS. We evaluate our wrapper on hard inverse problems like large box in-painting and super-resolution with high upscaling. Whereas state-of-the-art baselines visibly fail, we find that applying our wrapper on top of these baselines significantly boosts sample quality and measurement consistency. We complement these findings with theory proving that on certain multimodal data distributions, ReGuidance simultaneously boosts the reward and brings the candidate solution closer to the data manifold. To our knowledge, this constitutes the first rigorous algorithmic guarantee for DPS.
摘要：使用验证的扩散模型作为知情的数据先验来解决反问题，并且更一般地围绕使用奖励模型来指导这些模型。诸如扩散后验采样（DPS）及其许多变体之类的无训练方法为这些任务提供了灵活的启发式算法，但是当奖励不够有用的内容不足以提供信息，例如，在低信噪比的硬性倒数问题中，这些技术差异很小，这些技术会产生现实的输出。在这项工作中，我们设计了一个简单的包装器，用于增强这些方法所获得的样本现实主义和回报。给定一个候选解决方案$ \ hat {x} $由用户选择的算法产生的，我们建议通过运行无条件的概率流量来反转解决方案，从$ \ hat {x} $开始，然后使用结果的dps初始化。我们在诸如大盒子上贴上大盒子和高尺度上的超级分辨率之类的硬质问题上评估包装纸。尽管最先进的基线明显失败，但我们发现将包装器应用于这些基准的顶部会显着提高样本质量和测量一致性。我们将这些发现与理论进行了补充，证明在某些多模式数据分布上，同时提高了奖励，并使候选解决方案更接近数据歧管。据我们所知，这构成了DPS的第一个严格算法保证。

Title: SpectralAR: Spectral Autoregressive Visual Generation

Authors: Yuanhui Huang, Weiliang Chen, Wenzhao Zheng, Yueqi Duan, Jie Zhou, Jiwen Lu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10962
Pdf URL: https://arxiv.org/pdf/2506.10962
Copy Paste: [[2506.10962]] SpectralAR: Spectral Autoregressive Visual Generation(https://arxiv.org/abs/2506.10962)
Keywords: generation
Abstract: Autoregressive visual generation has garnered increasing attention due to its scalability and compatibility with other modalities compared with diffusion models. Most existing methods construct visual sequences as spatial patches for autoregressive generation. However, image patches are inherently parallel, contradicting the causal nature of autoregressive modeling. To address this, we propose a Spectral AutoRegressive (SpectralAR) visual generation framework, which realizes causality for visual sequences from the spectral perspective. Specifically, we first transform an image into ordered spectral tokens with Nested Spectral Tokenization, representing lower to higher frequency components. We then perform autoregressive generation in a coarse-to-fine manner with the sequences of spectral tokens. By considering different levels of detail in images, our SpectralAR achieves both sequence causality and token efficiency without bells and whistles. We conduct extensive experiments on ImageNet-1K for image reconstruction and autoregressive generation, and SpectralAR achieves 3.02 gFID with only 64 tokens and 310M parameters. Project page: this https URL.
摘要：与扩散模型相比，自回归视觉产生的关注性越来越多，它与其他方式相比。大多数现有方法将视觉序列构建为自回归产生的空间贴片。但是，图像贴片本质上是平行的，与自回归建模的因果性质相矛盾。为了解决这个问题，我们提出了一个光谱自回归（光谱）视觉生成框架，该框架从光谱的角度实现了视觉序列的因果关系。具体而言，我们首先将图像转换为具有嵌套光谱令牌化的有序光谱令牌，代表较低的频率组件。然后，我们用光谱令牌序列以粗到最新的方式进行自回归产生。通过考虑图像中不同级别的细节，我们的光谱既达到了序列因果关系，又可以实现象征性的效率，而没有铃铛和哨声。我们对Imagenet-1K进行了广泛的实验，以进行图像重建和自回归产生，而Spectralar仅使用64个令牌和310m参数实现3.02 GFID。项目页面：此HTTPS URL。

Title: MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning

Authors: Yuxuan Luo, Yuhui Yuan, Junwen Chen, Haonan Cai, Ziyi Yue, Yuwei Yang, Fatima Zohra Daha, Ji Li, Zhouhui Lian
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2506.10963
Pdf URL: https://arxiv.org/pdf/2506.10963
Copy Paste: [[2506.10963]] MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning(https://arxiv.org/abs/2506.10963)
Keywords: generation
Abstract: In this paper, we introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to probe the reasoning capability of image generation models. Knowledge images have been central to human civilization and to the mechanisms of human learning--a fact underscored by dual-coding theory and the picture-superiority effect. Generating such images is challenging, demanding multimodal reasoning that fuses world knowledge with pixel-level grounding into clear explanatory visuals. To enable comprehensive evaluation, MMMG offers 4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines, 6 educational levels, and diverse knowledge formats such as charts, diagrams, and mind maps. To eliminate confounding complexity during evaluation, we adopt a unified Knowledge Graph (KG) representation. Each KG explicitly delineates a target image's core entities and their dependencies. We further introduce MMMG-Score to evaluate generated knowledge images. This metric combines factual fidelity, measured by graph-edit distance between KGs, with visual clarity assessment. Comprehensive evaluations of 16 state-of-the-art text-to-image generation models expose serious reasoning deficits--low entity fidelity, weak relations, and clutter--with GPT-4o achieving an MMMG-Score of only 50.20, underscoring the benchmark's difficulty. To spur further progress, we release FLUX-Reason (MMMG-Score of 34.45), an effective and open baseline that combines a reasoning LLM with diffusion models and is trained on 16,000 curated knowledge image-prompt pairs.
摘要：在本文中，我们将知识图像生成作为一项新任务，以及大量的多学科多层知识图像生成基准（MMMG），以探测图像生成模型的推理能力。知识图像是人类文明和人类学习机制的核心 - 这是双重编码理论和表现性效应的事实。产生这样的图像是具有挑战性的，要求多模式的推理将世界知识与像素级融合到明确的解释性视觉效果中。为了启用全面的评估，MMMG提供了4,456个专家验证（知识）图像 - 预测对，涵盖10个学科，6个教育水平以及多样化的知识格式，例如图表，图表和思维图。为了消除评估过程中的混杂复杂性，我们采用统一知识图（kg）表示。每个公斤明确描述目标图像的核心实体及其依赖性。我们进一步介绍MMMG得分以评估生成的知识图像。该指标结合了事实保真度，通过千克之间的图形距离和视觉清晰度评估来衡量。对16个最先进的文本到图像生成模型的全面评估暴露了严重的推理缺陷 - 低实体忠诚度，弱关系和混乱 - GPT-4O仅达到了仅50.20的MMMG得分，强调了基准的困难。为了刺激进一步的进展，我们释放了磁通次数（34.45的mmmg得分），这是一种有效且开放的基线，将推理LLM与扩散模型结合在一起，并在16,000个策划的知识图像推出对中进行了训练。

Title: GenWorld: Towards Detecting AI-generated Real-world Simulation Videos

Authors: Weiliang Chen, Wenzhao Zheng, Yu Zheng, Lei Chen, Jie Zhou, Jiwen Lu, Yueqi Duan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10975
Pdf URL: https://arxiv.org/pdf/2506.10975
Copy Paste: [[2506.10975]] GenWorld: Towards Detecting AI-generated Real-world Simulation Videos(https://arxiv.org/abs/2506.10975)
Keywords: generation
Abstract: The flourishing of video generation technologies has endangered the credibility of real-world information and intensified the demand for AI-generated video detectors. Despite some progress, the lack of high-quality real-world datasets hinders the development of trustworthy detectors. In this paper, we propose GenWorld, a large-scale, high-quality, and real-world simulation dataset for AI-generated video detection. GenWorld features the following characteristics: (1) Real-world Simulation: GenWorld focuses on videos that replicate real-world scenarios, which have a significant impact due to their realism and potential influence; (2) High Quality: GenWorld employs multiple state-of-the-art video generation models to provide realistic and high-quality forged videos; (3) Cross-prompt Diversity: GenWorld includes videos generated from diverse generators and various prompt modalities (e.g., text, image, video), offering the potential to learn more generalizable forensic features. We analyze existing methods and find they fail to detect high-quality videos generated by world models (i.e., Cosmos), revealing potential drawbacks of ignoring real-world clues. To address this, we propose a simple yet effective model, SpannDetector, to leverage multi-view consistency as a strong criterion for real-world AI-generated video detection. Experiments show that our method achieves superior results, highlighting a promising direction for explainable AI-generated video detection based on physical plausibility. We believe that GenWorld will advance the field of AI-generated video detection. Project Page: this https URL
摘要：视频生成技术的蓬勃发展危害了现实世界信息的可信度，并加强了对AI生成的视频探测器的需求。尽管有一些进展，但缺乏高质量的现实世界数据集阻碍了可信赖的探测器的发展。在本文中，我们提出了Genworld，这是AI生成的视频检测的大规模，高质量和现实世界模拟数据集。 GenWorld具有以下特征：（1）现实世界的模拟：GenWorld专注于复制现实世界情景的视频，由于其现实主义和潜在影响而产生重大影响；（2）高质量：Genworld采用多种最先进的视频生成模型来提供现实且高质量的锻造视频；（3）交叉推测的多样性：GenWorld包括由不同发电机生成的视频和各种及时的模式（例如，文本，图像，视频），提供了学习更多可推广的法医特征的潜力。我们分析了现有方法，发现它们无法检测到世界模型（即宇宙）生成的高质量视频，从而揭示了忽略现实世界线索的潜在缺点。为了解决这个问题，我们提出了一个简单而有效的模型SpannDetector，以利用多视图一致性，作为真实世界AI生成的视频检测的强大标准。实验表明，我们的方法取得了卓越的结果，突出了基于物理合理性的可解释AI生成视频检测的有希望的方向。我们认为，Genworld将推进AI生成的视频检测领域。项目页面：此HTTPS URL

Title: Fine-Grained Perturbation Guidance via Attention Head Selection

Authors: Donghoon Ahn, Jiwon Kang, Sanghyun Lee, Minjae Kim, Jaewon Min, Wooseok Jang, Saungwu Lee, Sayak Paul, Susung Hong, Seungryong Kim
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.10978
Pdf URL: https://arxiv.org/pdf/2506.10978
Copy Paste: [[2506.10978]] Fine-Grained Perturbation Guidance via Attention Head Selection(https://arxiv.org/abs/2506.10978)
Keywords: generation
Abstract: Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose "HeadHunter", a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head's attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.
摘要：扩散模型中的最新引导方法通过扰动模型来构建隐式弱模型并引导产生远离它，从而介绍了反向采样。在这些方法中，注意力扰动在不适用的无条件指导的无条件场景中表现出强烈的经验表现。但是，现有的注意力扰动方法缺乏确定应在何处应用扰动的原则方法，尤其是在扩散变压器（DIT）体系结构中，在跨层分布质量相关的计算。在本文中，我们研究了注意力扰动的粒度，从层级到个体的注意力头，并发现特定的头部控制着不同的视觉概念，例如结构，样式和纹理质量。在此洞察力的基础上，我们提出了“ Headhunter”，这是一个系统的迭代框架，用于迭代选择与以用户为中心的目标保持一致的注意力头，从而可以对发电质量和视觉属性进行细粒度的控制。此外，我们引入了软键，该软键语线性将每个选定的头部的注意力图插入朝向身份矩阵，提供连续的旋钮来调节扰动强度并抑制伪影。我们的方法不仅可以减轻现有层级扰动的过度厚度问题，而且还可以通过组成的头部选择有针对性地操纵特定的视觉样式。我们验证了现代大型DIT基于文本对图像模型的方法，包括稳定的扩散3和Flux.1，在一般质量增强和特定于样式的指导中都表现出卓越的性能。我们的工作提供了对扩散模型中注意力扰动的首次头部级分析，在注意层内揭示了可解释的专业化，并实现了有效扰动策略的实用设计。

Title: SceneCompleter: Dense 3D Scene Completion for Generative Novel View Synthesis

Authors: Weiliang Chen, Jiayi Bi, Yuanhui Huang, Wenzhao Zheng, Yueqi Duan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.10981
Pdf URL: https://arxiv.org/pdf/2506.10981
Copy Paste: [[2506.10981]] SceneCompleter: Dense 3D Scene Completion for Generative Novel View Synthesis(https://arxiv.org/abs/2506.10981)
Keywords: generative
Abstract: Generative models have gained significant attention in novel view synthesis (NVS) by alleviating the reliance on dense multi-view captures. However, existing methods typically fall into a conventional paradigm, where generative models first complete missing areas in 2D, followed by 3D recovery techniques to reconstruct the scene, which often results in overly smooth surfaces and distorted geometry, as generative models struggle to infer 3D structure solely from RGB data. In this paper, we propose SceneCompleter, a novel framework that achieves 3D-consistent generative novel view synthesis through dense 3D scene completion. SceneCompleter achieves both visual coherence and 3D-consistent generative scene completion through two key components: (1) a geometry-appearance dual-stream diffusion model that jointly synthesizes novel views in RGBD space; (2) a scene embedder that encodes a more holistic scene understanding from the reference image. By effectively fusing structural and textural information, our method demonstrates superior coherence and plausibility in generative novel view synthesis across diverse datasets. Project Page: this https URL
摘要：生成模型通过减轻对密集的多视图捕获的依赖，从而在新型视图合成（NVS）中引起了极大的关注。但是，现有方法通常属于常规范式，生成模型在2D中首先完成缺失区域，然后是重建场景的3D恢复技术，这通常会导致过度平滑的表面和变形的几何形状，因为生成模型难以从RGB数据中推断3D结构。在本文中，我们提出了SceneCompleter，这是一个新颖的框架，该框架通过密集的3D场景完成实现了3D一致的生成性小说综合。 SceneComleter通过两个关键组成部分实现了视觉连贯性和3D一致的生成场景完成：（1）几何形状表达式双流扩散模型，该模型共同综合了RGBD空间中的新视图；（2）从参考图像中编码更全面的场景理解的场景嵌入器。通过有效融合结构和纹理信息，我们的方法证明了在各种数据集中的新型新观点合成中的卓越连贯性和合理性。项目页面：此HTTPS URL