2025-07-22

Title: Self-Improving Language Models for Evolutionary Program Synthesis: A Case Study on ARC-AGI

Authors: Julien Pourcel, Cédric Colas, Pierre-Yves Oudeyer
Subjects: cs.LG, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2507.14172
Pdf URL: https://arxiv.org/pdf/2507.14172
Copy Paste: [[2507.14172]] Self-Improving Language Models for Evolutionary Program Synthesis: A Case Study on ARC-AGI(https://arxiv.org/abs/2507.14172)
Keywords: generative
Abstract: Many program synthesis tasks prove too challenging for even state-of-the-art language models to solve in single attempts. Search-based evolutionary methods offer a promising alternative by exploring solution spaces iteratively, but their effectiveness remain limited by the fixed capabilities of the underlying generative model. We propose SOAR, a method that learns program synthesis by integrating language models into a self-improving evolutionary loop. SOAR alternates between (1) an evolutionary search that uses an LLM to sample and refine candidate solutions, and (2) a hindsight learning phase that converts search attempts into valid problem-solution pairs used to fine-tune the LLM's sampling and refinement capabilities\, -- \,enabling increasingly effective search in subsequent iterations. On the challenging ARC-AGI benchmark, SOAR achieves significant performance gains across model scales and iterations, leveraging positive transfer between the sampling and refinement finetuning tasks. These improvements carry over to test-time adaptation, enabling SOAR to solve 52\% of the public test set. Our code is open-sourced at: this https URL
摘要：许多程序合成任务证明，即使是最先进的语言模型也无法在一次尝试中解决。基于搜索的进化方法通过迭代探索解决方案空间提供了一种有希望的替代方法，但是它们的有效性仍受到基础生成模型的固定功能的限制。我们提出了SOAR，这是一种通过将语言模型集成到自我改进的进化循环中来学习程序综合的方法。（1）使用LLM进行采样和改进候选解决方案的进化搜索之间的SOAR交替，以及（2）将搜索尝试转换为用于微调LLM的采样和改进功能\的有效问题解决方案对的后观察学习阶段\， - \，从而在后续续航中具有越来越有效的搜索。在具有挑战性的ARC-AGI基准上，SOAR在模型量表和迭代中实现了显着的性能增长，从而利用了采样和改进式填充任务之间的正转移。这些改进进行了测试适应，使得能够解决公共测试集的52 \％。我们的代码是开源的：此HTTPS URL

Title: LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models

Authors: Dachuan Shi, Yonggan Fu, Xiangchi Yuan, Zhongzhi Yu, Haoran You, Sixu Li, Xin Dong, Jan Kautz, Pavlo Molchanov, Yingyan (Celine)Lin
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2507.14204
Pdf URL: https://arxiv.org/pdf/2507.14204
Copy Paste: [[2507.14204]] LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models(https://arxiv.org/abs/2507.14204)
Keywords: generation, generative
Abstract: Recent advancements in Large Language Models (LLMs) have spurred interest in numerous applications requiring robust long-range capabilities, essential for processing extensive input contexts and continuously generating extended outputs. As sequence lengths increase, the number of Key-Value (KV) pairs in LLMs escalates, creating a significant efficiency bottleneck. In this paper, we propose a new KV cache optimization paradigm called LaCache, a training-free method for efficient and accurate generative inference of LLMs. LaCache enables LLMs to simultaneously address both of the critical challenges in long-range modeling: robust long-range capabilities and continuous generation without running out-of-memory (OOM). Specifically, LaCache integrates two key innovations: (1) a ladder-shaped KV cache pattern that stores KV pairs not only sequentially (left-to-right within each layer) but also across layers (from shallow to deep), providing an extended span for capturing long-range dependencies under a fixed storage budget, thereby boosting long-range capabilities; and (2) an iterative compaction mechanism that progressively compresses older caches, freeing up space for new tokens within a fixed cache size. This token distance-based dynamic compression enables more effective continuous generation under constrained cache budgets. Experiments across various tasks, benchmarks, and LLM models consistently validate LaCache's effectiveness in enhancing LLMs' long-range capabilities. Our code is available at this https URL.
摘要：大型语言模型（LLMS）的最新进展引起了人们对需要强大远程功能的众多应用程序的兴趣，对于处理广泛的输入上下文并不断生成扩展的输出至关重要。随着序列长度的增加，LLMS中的键值（KV）对的数量升级，从而产生了明显的效率瓶颈。在本文中，我们提出了一种称为lacache的新的KV高速缓存优化范式，这是一种无训练的方法，用于有效且准确地生成LLMS。 Lacache使LLMS能够同时解决远程建模中的两个关键挑战：强大的远程功能和无内存（OOM）而无需连续生成。具体而言，Lacache整合了两个关键的创新：（1）梯形的KV缓存模式，该模式不仅存储KV对，不仅存储KV对（每一层中的从左到右），而且还跨层（从浅层到深处），为捕获固定依据的长期依赖性提供了延长的跨度，从而在固定的依赖预算下，从而通过增强长距离远程远程距离远程远距离远距离构造。（2）一种迭代压实机制，该机制逐渐压缩了旧的缓存，从而释放了固定高速缓存尺寸内的新令牌空间。这种基于距离的动态压缩可以在受限的高速缓存预算下更有效的连续产生。跨各种任务，基准和LLM模型的实验始终验证LaCache在增强LLMS远程功能方面的有效性。我们的代码可在此HTTPS URL上找到。

Title: Developing an AI-Guided Assistant Device for the Deaf and Hearing Impaired

Authors: Jiayu (Jerry)Liu
Subjects: cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2507.14215
Pdf URL: https://arxiv.org/pdf/2507.14215
Copy Paste: [[2507.14215]] Developing an AI-Guided Assistant Device for the Deaf and Hearing Impaired(https://arxiv.org/abs/2507.14215)
Keywords: generation
Abstract: This study aims to develop a deep learning system for an accessibility device for the deaf or hearing impaired. The device will accurately localize and identify sound sources in real time. This study will fill an important gap in current research by leveraging machine learning techniques to target the underprivileged community. The system includes three main components. 1. JerryNet: A custom designed CNN architecture that determines the direction of arrival (DoA) for nine possible directions. 2. Audio Classification: This model is based on fine-tuning the Contrastive Language-Audio Pretraining (CLAP) model to identify the exact sound classes only based on audio. 3. Multimodal integration model: This is an accurate sound localization model that combines audio, visual, and text data to locate the exact sound sources in the images. The part consists of two modules, one object detection using Yolov9 to generate all the bounding boxes of the objects, and an audio visual localization model to identify the optimal bounding box using complete Intersection over Union (CIoU). The hardware consists of a four-microphone rectangular formation and a camera mounted on glasses with a wristband for displaying necessary information like direction. On a custom collected data set, JerryNet achieved a precision of 91. 1% for the sound direction, outperforming all the baseline models. The CLAP model achieved 98.5% and 95% accuracy on custom and AudioSet datasets, respectively. The audio-visual localization model within component 3 yielded a cIoU of 0.892 and an AUC of 0.658, surpassing other similar models. There are many future potentials to this study, paving the way to creating a new generation of accessibility devices.
摘要：这项研究旨在为聋人或听力受损的可访问设备开发深度学习系统。该设备将实时准确定位和识别声源。这项研究将通过利用机器学习技术来针对贫困社区来填补当前研究的重要空白。该系统包括三个主要组件。 1。JerryNet：一种定制设计的CNN体系结构，该体系结构决定了九个可能的方向的到达方向（DOA）。 2。音频分类：此模型基于微调对比度语言原告（拍手）模型，以仅基于音频识别确切的声音类。 3。多模式集成模型：这是一个准确的声音本地化模型，结合了音频，视觉和文本数据，以在图像中找到确切的声源。该零件由两个模块组成，一个使用Yolov9生成对象的所有边界框的对象检测，以及一个音频视觉本地化模型，以使用联合（CIOU）完整的交点识别最佳边界框。该硬件由一个四微粒矩形的形成和安装在玻璃上的摄像机组成，辅助腕带，用于显示必要的信息，例如方向。在自定义收集的数据集上，JerryNet的声音方向的精度为91。1％，表现优于所有基线模型。拍手模型分别在自定义和音频集数据集上达到98.5％和95％的精度。组件3中的视听定位模型的CIOU为0.892，AUC为0.658，超过了其他类似模型。这项研究有许多未来的潜力，为创建新一代可访问设备铺平了道路。

Title: Rethinking Individual Fairness in Deepfake Detection

Authors: Aryana Hou, Li Lin, Justin Li, Shu Hu
Subjects: cs.LG, cs.CY
Abstract URL: https://arxiv.org/abs/2507.14326
Pdf URL: https://arxiv.org/pdf/2507.14326
Copy Paste: [[2507.14326]] Rethinking Individual Fairness in Deepfake Detection(https://arxiv.org/abs/2507.14326)
Keywords: generative
Abstract: Generative AI models have substantially improved the realism of synthetic media, yet their misuse through sophisticated DeepFakes poses significant risks. Despite recent advances in deepfake detection, fairness remains inadequately addressed, enabling deepfake markers to exploit biases against specific populations. While previous studies have emphasized group-level fairness, individual fairness (i.e., ensuring similar predictions for similar individuals) remains largely unexplored. In this work, we identify for the first time that the original principle of individual fairness fundamentally fails in the context of deepfake detection, revealing a critical gap previously unexplored in the literature. To mitigate it, we propose the first generalizable framework that can be integrated into existing deepfake detectors to enhance individual fairness and generalization. Extensive experiments conducted on leading deepfake datasets demonstrate that our approach significantly improves individual fairness while maintaining robust detection performance, outperforming state-of-the-art methods. The code is available at this https URL.
摘要：生成的AI模型显着改善了合成媒体的现实主义，但是它们通过复杂的深烟的滥用带来了重大风险。尽管最新的深料检测进展，但公平性仍然不足，使深泡标记能够利用偏见来抵抗特定人群。尽管以前的研究强调了群体水平的公平性，但个人公平（即确保对相似个体的类似预测）仍然在很大程度上没有探索。在这项工作中，我们首次确定个人公平的原则在深层检测的背景下从根本上失败了，这揭示了以前在文献中未探索的关键差距。为了减轻它，我们提出了一个可以集成到现有的DeepFake探测器中以增强个人公平和概括的可推广框架。在领先的DeepFake数据集上进行的广泛实验表明，我们的方法显着提高了个人公平性，同时保持健壮的检测性能，表现优于最先进的方法。该代码可在此HTTPS URL上找到。

Title: Solo Connection: A Parameter Efficient Fine-Tuning Technique for Transformers

Authors: Harsh Nilesh Pathak, Randy Paffenroth
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2507.14353
Pdf URL: https://arxiv.org/pdf/2507.14353
Copy Paste: [[2507.14353]] Solo Connection: A Parameter Efficient Fine-Tuning Technique for Transformers(https://arxiv.org/abs/2507.14353)
Keywords: generation, generative
Abstract: Parameter efficient fine tuning (PEFT) is a versatile and extensible approach for adapting a Large Language Model (LLM) for newer tasks. One of the most prominent PEFT approaches, Low Rank Adaptation (LoRA), primarily focuses on adjusting the attention weight matrices within individual decoder blocks of a Generative Pre trained Transformer (GPT2). In contrast, we introduce Solo Connection a novel method that adapts the representation at the decoder-block level rather than modifying individual weight matrices. Not only does Solo Connection outperform LoRA on E2E natural language generation benchmarks, but it also reduces the number of trainable parameters by 59% relative to LoRA and by more than 99% compared to full fine-tuning of GPT2, an early version of Large Language Models (LLMs). Solo Connection is also motivated by homotopy theory: we introduce a trainable linear transformation that gradually interpolates between a zero vector and the task-specific representation, enabling smooth and stable adaptation over time. While skip connections in the original 12 layer GPT2 are typically confined to individual decoder blocks, subsequent GPT2 variants scale up to 48 layers, and even larger language models can include 128 or more decoder blocks. These expanded architectures underscore the need to revisit how skip connections are employed during fine-tuning. This paper focuses on long skip connections that link outputs of different decoder blocks, potentially enhancing the model's ability to adapt to new tasks while leveraging pre-trained knowledge.
摘要：参数有效的微调（PEFT）是一种用于适应新任务的大型语言模型（LLM）的多功能且可扩展的方法。最突出的PEFT方法之一，低级适应性（LORA），主要集中于调整生成训练训练的变压器（GPT2）的单个解码器块中的注意力重量矩阵。相比之下，我们引入了独奏连接，一种新的方法，可以适应解码器块水平的表示，而不是修改单个重量矩阵。与洛拉（Lora）相比，与lora相对于洛拉（Lora）的可训练参数的数量不仅比E2E自然语言生成基准的洛拉（Lora）不仅优于lora，而且还将可训练的参数的数量降低了59％，而GPT2的完整微调是大语模型（LLMS）的早期版本（LLMS）。独奏连接也是由同型理论激励的：我们引入了可训练的线性变换，该变换逐渐在零向量和特定于任务的表示之间插入，从而可以随着时间的推移使平滑而稳定的适应性。虽然原始12层GPT2中的跳过连接通常仅限于单个解码器块，但随后的GPT2变体扩展高达48层，甚至更大的语言模型也可以包括128个或更多的解码器块。这些扩展的体系结构强调了在微调过程中如何使用跳过连接的需求。本文着重于链接不同解码器块的输出的长长连接，有可能增强模型适应新任务的能力，同时利用预训练的知识。

Title: Hallucination Score: Towards Mitigating Hallucinations in Generative Image Super-Resolution

Authors: Weiming Ren, Raghav Goyal, Zhiming Hu, Tristan Ty Aumentado-Armstrong, Iqbal Mohomed, Alex Levinshtein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.14367
Pdf URL: https://arxiv.org/pdf/2507.14367
Copy Paste: [[2507.14367]] Hallucination Score: Towards Mitigating Hallucinations in Generative Image Super-Resolution(https://arxiv.org/abs/2507.14367)
Keywords: super-resolution, generative
Abstract: Generative super-resolution (GSR) currently sets the state-of-the-art in terms of perceptual image quality, overcoming the "regression-to-the-mean" blur of prior non-generative models. However, from a human perspective, such models do not fully conform to the optimal balance between quality and fidelity. Instead, a different class of artifacts, in which generated details fail to perceptually match the low resolution image (LRI) or ground-truth image (GTI), is a critical but under studied issue in GSR, limiting its practical deployments. In this work, we focus on measuring, analyzing, and mitigating these artifacts (i.e., "hallucinations"). We observe that hallucinations are not well-characterized with existing image metrics or quality models, as they are orthogonal to both exact fidelity and no-reference quality. Instead, we take advantage of a multimodal large language model (MLLM) by constructing a prompt that assesses hallucinatory visual elements and generates a "Hallucination Score" (HS). We find that our HS is closely aligned with human evaluations, and also provides complementary insights to prior image metrics used for super-resolution (SR) models. In addition, we find certain deep feature distances have strong correlations with HS. We therefore propose to align the GSR models by using such features as differentiable reward functions to mitigate hallucinations.
摘要：当前，生成超分辨率（GSR）以感知图像质量来设置最先进的方法，从而克服了先前非生成模型的“回归到均值”模糊。但是，从人类的角度来看，这样的模型并不完全符合质量和忠诚之间的最佳平衡。取而代之的是，在GSR中，生成的细节无法感知地与低分辨率图像（LRI）或地面真相图像（GTI）相匹配，其中生成的细节无法感知匹配，但在GSR中是一个关键的问题，从而限制了其实际部署。在这项工作中，我们专注于测量，分析和减轻这些工件（即“幻觉”）。我们观察到，幻觉并未与现有的图像指标或质量模型进行充分的特征，因为它们既具有确切的保真度和无引用质量。取而代之的是，我们通过构建一个评估幻觉视觉元素并生成“幻觉得分”（HS）的提示来利用多模式大语言模型（MLLM）。我们发现我们的HS与人类评估紧密相符，并且还为用于超分辨率（SR）模型的先前图像指标提供了互补的见解。此外，我们发现某些深度距离与HS具有很强的相关性。因此，我们建议通过使用可区分奖励函数来减轻幻觉来对齐GSR模型。

Title: DUSTrack: Semi-automated point tracking in ultrasound videos

Authors: Praneeth Namburi, Roger Pallarès-López, Jessica Rosendorf, Duarte Folgado, Brian W. Anthony
Subjects: cs.CV, q-bio.QM
Abstract URL: https://arxiv.org/abs/2507.14368
Pdf URL: https://arxiv.org/pdf/2507.14368
Copy Paste: [[2507.14368]] DUSTrack: Semi-automated point tracking in ultrasound videos(https://arxiv.org/abs/2507.14368)
Keywords: generation
Abstract: Ultrasound technology enables safe, non-invasive imaging of dynamic tissue behavior, making it a valuable tool in medicine, biomechanics, and sports science. However, accurately tracking tissue motion in B-mode ultrasound remains challenging due to speckle noise, low edge contrast, and out-of-plane movement. These challenges complicate the task of tracking anatomical landmarks over time, which is essential for quantifying tissue dynamics in many clinical and research applications. This manuscript introduces DUSTrack (Deep learning and optical flow-based toolkit for UltraSound Tracking), a semi-automated framework for tracking arbitrary points in B-mode ultrasound videos. We combine deep learning with optical flow to deliver high-quality and robust tracking across diverse anatomical structures and motion patterns. The toolkit includes a graphical user interface that streamlines the generation of high-quality training data and supports iterative model refinement. It also implements a novel optical-flow-based filtering technique that reduces high-frequency frame-to-frame noise while preserving rapid tissue motion. DUSTrack demonstrates superior accuracy compared to contemporary zero-shot point trackers and performs on par with specialized methods, establishing its potential as a general and foundational tool for clinical and biomechanical research. We demonstrate DUSTrack's versatility through three use cases: cardiac wall motion tracking in echocardiograms, muscle deformation analysis during reaching tasks, and fascicle tracking during ankle plantarflexion. As an open-source solution, DUSTrack offers a powerful, flexible framework for point tracking to quantify tissue motion from ultrasound videos. DUSTrack is available at this https URL.
摘要：超声技术实现了动态组织行为的安全，非侵入性成像，使其成为医学，生物力学和运动科学方面的宝贵工具。但是，由于斑点噪声，低边对比和平面外运动，准确跟踪B模式超声中的组织运动仍然具有挑战性。这些挑战使跟踪解剖学地标随时间的任务复杂化，这对于量化许多临床和研究应用中的组织动力学至关重要。该手稿介绍了Dustrack（用于超声跟踪的深度学习和基于光流的工具包），这是一个半自动化的框架，用于在B模式超声视频中跟踪任意点。我们将深度学习与光流相结合，以在各种解剖结构和运动模式中提供高质量和稳健的跟踪。该工具包包括一个图形用户界面，该界面简化了高质量培训数据的生成并支持迭代模型的改进。它还实现了一种新型的基于光流的滤波技术，该技术在保留快速组织运动的同时降低了高频到框架噪声。 Dustrack表现出与当代零击点跟踪器相比的卓越准确性，并以专门方法的形式表现出色，从而确立了其作为临床和生物力学研究的一般和基础工具的潜力。我们通过三种用例展示了Dustrack的多功能性：超声心动图中的心脏壁运动跟踪，到达任务期间的肌肉变形分析以及脚踝plotharflexion期间的筋膜跟踪。作为一种开源解决方案，Dustrack提供了一个强大，灵活的框架，用于点跟踪，以量化超声视频的组织运动。 Dustrack可在此HTTPS URL上找到。

Title: It's Not That Simple. An Analysis of Simple Test-Time Scaling

Authors: Guojun Wu
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2507.14419
Pdf URL: https://arxiv.org/pdf/2507.14419
Copy Paste: [[2507.14419]] It's Not That Simple. An Analysis of Simple Test-Time Scaling(https://arxiv.org/abs/2507.14419)
Keywords: generation
Abstract: Prior work proposed simple test-time scaling, a method for replicating this scaling behavior with models distilled from o1-like models by manually controlling test-time compute: either scaling down by enforcing a maximum length or scaling up by iteratively appending "Wait" when the model is about to terminate its generation. This paper presents an analysis of simple test-time scaling and finds that the scaling behavior is largely attributed to scaling down by enforcing a maximum length. In contrast, fine-tuning on long CoT data distilled from o1-like models has no significant impact on scaling behavior, and scaling up by appending "Wait" leads to inconsistencies, as the model may oscillate between solutions. A key distinction exists between scaling down by enforcing a maximum length and scaling up test-time compute in o1-like models, such as DeepSeek-R1\@. These models are typically allowed to utilize as much compute as needed, with the only constraint being the model's maximum supported length. By learning to naturally scale up test-time compute during reinforcement learning, o1-like models surpass their peak performance when scaling up. In contrast, simple test-time scaling progressively imposes a lower upper limit on model performance as it scales down. While replicating the test-time scaling behavior of o1 models can be straightforward by scaling down, it is crucial to recognize that the goal of scaling test-time compute is to unlock higher performance -- beyond what the model could originally achieve -- rather than merely reproducing the appearance of scaling behavior.
摘要：先前的工作提出了简单的测试时间缩放，这是一种通过手动控制测试时间计算来从类似O1的模型中蒸馏出来的模型来复制这种缩放行为的方法：通过执行最大长度来缩放或通过迭代地添加“等待”来缩小模型时，该模型即将终止其生成。本文介绍了简单测试时间缩放的分析，并发现缩放行为主要归因于通过执行最大长度来缩放。相反，从类似O1的模型中蒸馏出的长COT数据的微调对缩放行为没有显着影响，并且通过附加“等待”来扩大缩放会导致不一致，因为该模型可能在解决方案之间振荡。通过执行最大长度和在类似O1的模型（例如DeepSeek-r1 \@）中扩展测试时间计算来缩小缩小和扩展测试时间计算之间存在一个关键区别。这些模型通常可以根据需要使用尽可能多的计算，而唯一的约束是模型的最大支持长度。通过学习在增强学习过程中自然扩展测试时间计算，类似O1的模型在扩展时会超过其峰值性能。相比之下，简单的测试时间缩放逐渐降低了模型性能的上限。虽然复制O1模型的测试时间缩放行为可以通过缩放来直接进行，但至关重要的是要认识到缩放测试时间计算的目标是解锁较高的性能 - 超出了模型最初可以实现的目标 - 而不仅仅是重现扩展行为的外观。

Title: Adaptive 3D Gaussian Splatting Video Streaming: Visual Saliency-Aware Tiling and Meta-Learning-Based Bitrate Adaptation

Authors: Han Gong, Qiyue Li, Jie Li, Zhi Liu
Subjects: cs.CV, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2507.14454
Pdf URL: https://arxiv.org/pdf/2507.14454
Copy Paste: [[2507.14454]] Adaptive 3D Gaussian Splatting Video Streaming: Visual Saliency-Aware Tiling and Meta-Learning-Based Bitrate Adaptation(https://arxiv.org/abs/2507.14454)
Keywords: quality assessment
Abstract: 3D Gaussian splatting video (3DGS) streaming has recently emerged as a research hotspot in both academia and industry, owing to its impressive ability to deliver immersive 3D video experiences. However, research in this area is still in its early stages, and several fundamental challenges, such as tiling, quality assessment, and bitrate adaptation, require further investigation. In this paper, we tackle these challenges by proposing a comprehensive set of solutions. Specifically, we propose an adaptive 3DGS tiling technique guided by saliency analysis, which integrates both spatial and temporal features. Each tile is encoded into versions possessing dedicated deformation fields and multiple quality levels for adaptive selection. We also introduce a novel quality assessment framework for 3DGS video that jointly evaluates spatial-domain degradation in 3DGS representations during streaming and the quality of the resulting 2D rendered images. Additionally, we develop a meta-learning-based adaptive bitrate algorithm specifically tailored for 3DGS video streaming, achieving optimal performance across varying network conditions. Extensive experiments demonstrate that our proposed approaches significantly outperform state-of-the-art methods.
摘要：3D高斯脱落视频（3DGS）流媒体最近成为学术界和行业的研究热点，这是由于其令人印象深刻的交付沉浸式3D视频体验的能力。但是，该领域的研究仍处于早期阶段，并且需要进一步调查，例如平铺，质量评估和比特率适应等基本挑战。在本文中，我们通过提出一套全面的解决方案来应对这些挑战。具体而言，我们提出了一种以显着分析为指导的自适应3DGS平铺技术，该技术既集成了空间和时间特征。每个图块都编码为具有专用变形场和多个质量水平的版本，以进行自适应选择。我们还为3DGS视频引入了一个新颖的质量评估框架，该框架在流中共同评估了3DGS表示中的空间域降解以及所得2D渲染图像的质量。此外，我们开发了一种基于元学习的自适应比特率算法，专门针对3DGS视频流量身定制，在不同的网络条件下实现了最佳性能。广泛的实验表明，我们提出的方法显着胜过最先进的方法。

Title: Benefit from Reference: Retrieval-Augmented Cross-modal Point Cloud Completion

Authors: Hongye Hou, Liu Zhan, Yang Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.14485
Pdf URL: https://arxiv.org/pdf/2507.14485
Copy Paste: [[2507.14485]] Benefit from Reference: Retrieval-Augmented Cross-modal Point Cloud Completion(https://arxiv.org/abs/2507.14485)
Keywords: generation
Abstract: Completing the whole 3D structure based on an incomplete point cloud is a challenging task, particularly when the residual point cloud lacks typical structural characteristics. Recent methods based on cross-modal learning attempt to introduce instance images to aid the structure feature learning. However, they still focus on each particular input class, limiting their generation abilities. In this work, we propose a novel retrieval-augmented point cloud completion framework. The core idea is to incorporate cross-modal retrieval into completion task to learn structural prior information from similar reference samples. Specifically, we design a Structural Shared Feature Encoder (SSFE) to jointly extract cross-modal features and reconstruct reference features as priors. Benefiting from a dual-channel control gate in the encoder, relevant structural features in the reference sample are enhanced and irrelevant information interference is suppressed. In addition, we propose a Progressive Retrieval-Augmented Generator (PRAG) that employs a hierarchical feature fusion mechanism to integrate reference prior information with input features from global to local. Through extensive evaluations on multiple datasets and real-world scenes, our method shows its effectiveness in generating fine-grained point clouds, as well as its generalization capability in handling sparse data and unseen categories.
摘要：基于不完整的点云完成整个3D结构是一项具有挑战性的任务，尤其是当残留点云缺乏典型的结构特征时。基于跨模式学习的最新方法尝试引入实例图像以帮助结构特征学习。但是，他们仍然专注于每个特定的输入类，从而限制了他们的发电能力。在这项工作中，我们提出了一个新颖的检索结果云完成框架。核心思想是将跨模式检索纳入完成任务中，以从类似的参考样本中学习结构性先验信息。具体而言，我们设计了一个结构共享的特征编码器（SSFE），以共同提取跨模式特征并重建参考特征作为先验。从编码器中的双通道控制门中受益，参考样本中的相关结构特征得到增强，并抑制信息干扰。此外，我们提出了一个渐进检索功能发电机（PRAG），该发电机（PRAG）采用层次特征融合机制将参考的先验信息与从全局到本地的输入特征集成在一起。通过对多个数据集和实际场景进行广泛的评估，我们的方法显示了其在生成细颗粒点云方面的有效性，以及其在处理稀疏数据和看不见的类别方面的概括能力。

Title: Efficient Whole Slide Pathology VQA via Token Compression

Authors: Weimin Lyu, Qingqiao Hu, Kehan Qi, Zhan Shi, Wentao Huang, Saumya Gupta, Chao Chen
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2507.14497
Pdf URL: https://arxiv.org/pdf/2507.14497
Copy Paste: [[2507.14497]] Efficient Whole Slide Pathology VQA via Token Compression(https://arxiv.org/abs/2507.14497)
Keywords: generation, generative
Abstract: Whole-slide images (WSIs) in pathology can reach up to 10,000 x 10,000 pixels, posing significant challenges for multimodal large language model (MLLM) due to long context length and high computational demands. Previous methods typically focus on patch-level analysis or slide-level classification using CLIP-based models with multi-instance learning, but they lack the generative capabilities needed for visual question answering (VQA). More recent MLLM-based approaches address VQA by feeding thousands of patch tokens directly into the language model, which leads to excessive resource consumption. To address these limitations, we propose Token Compression Pathology LLaVA (TCP-LLaVA), the first MLLM architecture to perform WSI VQA via token compression. TCP-LLaVA introduces a set of trainable compression tokens that aggregate visual and textual information through a modality compression module, inspired by the [CLS] token mechanism in BERT. Only the compressed tokens are forwarded to the LLM for answer generation, significantly reducing input length and computational cost. Experiments on ten TCGA tumor subtypes show that TCP-LLaVA outperforms existing MLLM baselines in VQA accuracy while reducing training resource consumption by a substantial margin.
摘要：病理学中的全扫描图像（WSI）最多可达到10,000 x 10,000像素，由于长篇小说长度和高计算需求，对多模式大语言模型（MLLM）提出了重大挑战。以前的方法通常使用基于剪辑的模型具有多个实体学习的基于剪辑的模型，将重点放在补丁级分析或幻灯片级分类上，但它们缺乏视觉问题答案（VQA）所需的生成功能。最新的基于MLLM的方法通过将数千个补丁令牌直接喂入语言模型来解决VQA，从而导致资源过多。为了解决这些局限性，我们提出了令牌压缩病理学LLAVA（TCP-LAVA），这是第一个通过令牌压缩执行WSI VQA的MLLM体系结构。 TCP-llava引入了一组可训练的压缩令牌，这些令牌通过模态压缩模块汇总了视觉和文本信息，灵感来自BERT中的[CLS]令牌机制。仅将压缩令牌转发到LLM以进行答案生成，从而大大降低了输入长度和计算成本。十个TCGA肿瘤子类型的实验表明，TCP-LALAVA在VQA准确性中优于现有的MLLM基准，同时将培训资源消耗降低了大幅度。

Title: Generative Distribution Distillation

Authors: Jiequan Cui, Beier Zhu, Qingshan Xu, Xiaogang Xu, Pengguang Chen, Xiaojuan Qi, Bei Yu, Hanwang Zhang, Richang Hong
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2507.14503
Pdf URL: https://arxiv.org/pdf/2507.14503
Copy Paste: [[2507.14503]] Generative Distribution Distillation(https://arxiv.org/abs/2507.14503)
Keywords: generative
Abstract: In this paper, we formulate the knowledge distillation (KD) as a conditional generative problem and propose the \textit{Generative Distribution Distillation (GenDD)} framework. A naive \textit{GenDD} baseline encounters two major challenges: the curse of high-dimensional optimization and the lack of semantic supervision from labels. To address these issues, we introduce a \textit{Split Tokenization} strategy, achieving stable and effective unsupervised KD. Additionally, we develop the \textit{Distribution Contraction} technique to integrate label supervision into the reconstruction objective. Our theoretical proof demonstrates that \textit{GenDD} with \textit{Distribution Contraction} serves as a gradient-level surrogate for multi-task learning, realizing efficient supervised training without explicit classification loss on multi-step sampling image representations. To evaluate the effectiveness of our method, we conduct experiments on balanced, imbalanced, and unlabeled data. Experimental results show that \textit{GenDD} performs competitively in the unsupervised setting, significantly surpassing KL baseline by \textbf{16.29\%} on ImageNet validation set. With label supervision, our ResNet-50 achieves \textbf{82.28\%} top-1 accuracy on ImageNet in 600 epochs training, establishing a new state-of-the-art.
摘要：在本文中，我们将知识蒸馏（KD）作为条件生成问题提出，并提出\ textIt {生成分布蒸馏（Gendd）}框架。天真的\ textit {gendd}基线遇到了两个主要挑战：高维优化的诅咒和标签缺乏语义监督。为了解决这些问题，我们引入了\ textit {拆分象征化}策略，实现了稳定而有效的无监督KD。此外，我们开发了\ textit {分发收缩}技术，以将标签监督整合到重建目标中。我们的理论证明表明，\ textit {gendd}带有\ textit {分发收缩}是多任务学习的梯度级替代，实现了有效的有效监督培训，而无需对多步骤采样图像表示，而无需明确的分类损失。为了评估我们方法的有效性，我们对平衡，不平衡和未标记的数据进行实验。实验结果表明，\ textIt {gendd}在无监督的设置中竞争性能，在ImageNet验证集上通过\ textbf {16.29 \％}显着超过KL基线。通过标签监督，我们的Resnet-50在600个时代培训中实现了Imagenet上的Top-1精度，实现了\ TextBf {82.28 \％}，建立了新的最新艺术品。

Title: Clutter Detection and Removal by Multi-Objective Analysis for Photographic Guidance

Authors: Xiaoran Wu
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2507.14553
Pdf URL: https://arxiv.org/pdf/2507.14553
Copy Paste: [[2507.14553]] Clutter Detection and Removal by Multi-Objective Analysis for Photographic Guidance(https://arxiv.org/abs/2507.14553)
Keywords: generative
Abstract: Clutter in photos is a distraction preventing photographers from conveying the intended emotions or stories to the audience. Photography amateurs frequently include clutter in their photos due to unconscious negligence or the lack of experience in creating a decluttered, aesthetically appealing scene for shooting. We are thus motivated to develop a camera guidance system that provides solutions and guidance for clutter identification and removal. We estimate and visualize the contribution of objects to the overall aesthetics and content of a photo, based on which users can interactively identify clutter. Suggestions on getting rid of clutter, as well as a tool that removes cluttered objects computationally, are provided to guide users to deal with different kinds of clutter and improve their photographic work. Two technical novelties underpin interactions in our system: a clutter distinguishment algorithm with aesthetics evaluations for objects and an iterative image inpainting algorithm based on generative adversarial nets that reconstructs missing regions of removed objects for high-resolution images. User studies demonstrate that our system provides flexible interfaces and accurate algorithms that allow users to better identify distractions and take higher quality images within less time.
摘要：照片中的混乱是一种分心，阻止摄影师将预期的情绪或故事传达给观众。摄影业余爱好者经常因无意识的疏忽或缺乏创造一个整理，美学上吸引人的拍摄场景而缺乏经验而在照片中包括混乱。因此，我们有动力开发一个相机指导系统，该系统为杂物识别和去除提供解决方案和指导。我们估计并可视化对象对照片的整体美学和内容的贡献，这是用户可以交互识别混乱的。提供有关摆脱混乱的建议，以及在计算上消除混乱对象的工具，以指导用户处理各种混乱并改善其摄影作品。我们系统中的两个技术新颖性是基于对象的美学评估的混乱区分算法，以及基于生成的对抗网络的迭代图像介绍算法的迭代图像，该算法基于生成的对抗网，该算法重建缺少用于高分辨率图像的删除对象的区域。用户研究表明，我们的系统提供了灵活的接口和准确的算法，使用户可以更好地识别干扰并在更短的时间内拍摄更高质量的图像。

Title: Benchmarking GANs, Diffusion Models, and Flow Matching for T1w-to-T2w MRI Translation

Authors: Andrea Moschetto, Lemuel Puglisi, Alec Sargood, Pierluigi Dell'Acqua, Francesco Guarnera, Sebastiano Battiato, Daniele Ravì
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.14575
Pdf URL: https://arxiv.org/pdf/2507.14575
Copy Paste: [[2507.14575]] Benchmarking GANs, Diffusion Models, and Flow Matching for T1w-to-T2w MRI Translation(https://arxiv.org/abs/2507.14575)
Keywords: generative
Abstract: Magnetic Resonance Imaging (MRI) enables the acquisition of multiple image contrasts, such as T1-weighted (T1w) and T2-weighted (T2w) scans, each offering distinct diagnostic insights. However, acquiring all desired modalities increases scan time and cost, motivating research into computational methods for cross-modal synthesis. To address this, recent approaches aim to synthesize missing MRI contrasts from those already acquired, reducing acquisition time while preserving diagnostic quality. Image-to-image (I2I) translation provides a promising framework for this task. In this paper, we present a comprehensive benchmark of generative models$\unicode{x2013}$specifically, Generative Adversarial Networks (GANs), diffusion models, and flow matching (FM) techniques$\unicode{x2013}$for T1w-to-T2w 2D MRI I2I translation. All frameworks are implemented with comparable settings and evaluated on three publicly available MRI datasets of healthy adults. Our quantitative and qualitative analyses show that the GAN-based Pix2Pix model outperforms diffusion and FM-based methods in terms of structural fidelity, image quality, and computational efficiency. Consistent with existing literature, these results suggest that flow-based models are prone to overfitting on small datasets and simpler tasks, and may require more data to match or surpass GAN performance. These findings offer practical guidance for deploying I2I translation techniques in real-world MRI workflows and highlight promising directions for future research in cross-modal medical image synthesis. Code and models are publicly available at this https URL.
摘要：磁共振成像（MRI）可实现多个图像对比度，例如T1加权（T1W）和T2加权（T2W）扫描，每个扫描都提供不同的诊断见解。但是，获取所有所需的方式增加了扫描时间和成本，激发了对跨模式合成的计算方法的研究。为了解决这个问题，最近的方法旨在综合已经获得的丢失的MRI对比，从而减少了获取时间，同时保留了诊断质量。图像到图像（I2i）翻译为此任务提供了有希望的框架。在本文中，我们介绍了生成模型的综合基准$ \ unicode {x2013} $，特别是生成的对抗网络（GAN），扩散模型和流匹配（FM）技术$ \ unicode {X2013} $ for T1W-to-T2W 2D 2D MRI I2I翻译。所有框架均采用可比的设置实施，并在三个公开可用的健康成年人的MRI数据集上进行了评估。我们的定量和定性分析表明，基于GAN的PIX2PIX模型在结构保真度，图像质量和计算效率方面优于扩散和基于FM的方法。这些结果与现有文献一致，表明基于流的模型容易在小数据集和更简单的任务上过度拟合，并且可能需要更多数据以匹配或超越GAN性能。这些发现为在现实世界中的MRI工作流中部署I2I翻译技术提供了实用的指导，并突出了有前途的跨模式医学图像综合研究的有希望的方向。代码和模型在此HTTPS URL上公开可用。

Title: A Transformer-Based Conditional GAN with Multiple Instance Learning for UAV Signal Detection and Classification

Authors: Haochen Liu, Jia Bi, Xiaomin Wang, Xin Yang, Ling Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.14592
Pdf URL: https://arxiv.org/pdf/2507.14592
Copy Paste: [[2507.14592]] A Transformer-Based Conditional GAN with Multiple Instance Learning for UAV Signal Detection and Classification(https://arxiv.org/abs/2507.14592)
Keywords: generative
Abstract: Unmanned Aerial Vehicles (UAVs) are increasingly used in surveillance, logistics, agriculture, disaster management, and military operations. Accurate detection and classification of UAV flight states, such as hovering, cruising, ascending, or transitioning, which are essential for safe and effective operations. However, conventional time series classification (TSC) methods often lack robustness and generalization for dynamic UAV environments, while state of the art(SOTA) models like Transformers and LSTM based architectures typically require large datasets and entail high computational costs, especially with high-dimensional data streams. This paper proposes a novel framework that integrates a Transformer-based Generative Adversarial Network (GAN) with Multiple Instance Locally Explainable Learning (MILET) to address these challenges in UAV flight state classification. The Transformer encoder captures long-range temporal dependencies and complex telemetry dynamics, while the GAN module augments limited datasets with realistic synthetic samples. MIL is incorporated to focus attention on the most discriminative input segments, reducing noise and computational overhead. Experimental results show that the proposed method achieves superior accuracy 96.5% on the DroneDetect dataset and 98.6% on the DroneRF dataset that outperforming other SOTA approaches. The framework also demonstrates strong computational efficiency and robust generalization across diverse UAV platforms and flight states, highlighting its potential for real-time deployment in resource constrained environments.
摘要：无人驾驶汽车（UAV）越来越多地用于监视，物流，农业，灾难管理和军事行动。对无人机飞行状态的准确检测和分类，例如悬停，巡航，上升或过渡，这对于安全有效的操作至关重要。但是，常规的时间序列分类（TSC）方法通常缺乏动态无人机环境的稳健性和概括性，而最新技术（SOTA）模型（例如变形金刚和基于LSTM的体系结构）通常需要大数据集，并且需要大量的计算成本，尤其是在高维数据流的情况下。本文提出了一个新颖的框架，该框架将基于变压器的生成对抗网络（GAN）与多个实例进行本地解释的学习（MILET）集成在一起，以应对无人机飞行状态分类中的这些挑战。变压器编码器捕获了远距离的时间依赖性和复杂的遥测动力学，而GAN模块则增加了具有逼真的合成样本的数据集。 MIL被合并为将注意力集中在最歧视的输入段上，减少噪声和计算开销。实验结果表明，该提出的方法在DRONEDETECT数据集上达到了卓越的准确性96.5％，而Dronerf数据集则达到了98.6％，以优于其他SOTA方法。该框架还表明了各种无人机平台和飞行状态的强大计算效率和强大的概括，强调了其在资源约束环境中实时部署的潜力。

Title: BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM

Authors: Haiquan Wen, Tianxiao Li, Zhenglin Huang, Yiwei He, Guangliang Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.14632
Pdf URL: https://arxiv.org/pdf/2507.14632
Copy Paste: [[2507.14632]] BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM(https://arxiv.org/abs/2507.14632)
Keywords: generation, generative
Abstract: Recent advances in generative AI have dramatically improved image and video synthesis capabilities, significantly increasing the risk of misinformation through sophisticated fake content. In response, detection methods have evolved from traditional approaches to multimodal large language models (MLLMs), offering enhanced transparency and interpretability in identifying synthetic media. However, current detection systems remain fundamentally limited by their single-modality design. These approaches analyze images or videos separately, making them ineffective against synthetic content that combines multiple media formats. To address these challenges, we introduce \textbf{BusterX++}, a novel framework designed specifically for cross-modal detection and explanation of synthetic media. Our approach incorporates an advanced reinforcement learning (RL) post-training strategy that eliminates cold-start. Through Multi-stage Training, Thinking Reward, and Hybrid Reasoning, BusterX++ achieves stable and substantial performance improvements. To enable comprehensive evaluation, we also present \textbf{GenBuster++}, a cross-modal benchmark leveraging state-of-the-art image and video generation techniques. This benchmark comprises 4,000 images and video clips, meticulously curated by human experts using a novel filtering methodology to ensure high quality, diversity, and real-world applicability. Extensive experiments demonstrate the effectiveness and generalizability of our approach.
摘要：生成AI的最新进展极大地改善了图像和视频综合功能，从而大大提高了通过复杂的假件内容的错误信息的风险。作为响应，检测方法已经从传统方法发展为多模式大语言模型（MLLM），从而在识别合成媒体方面具有增强的透明度和可解释性。但是，当前的检测系统从根本上仍受其单模式设计的限制。这些方法分别分析图像或视频，使它们对结合多种媒体格式的合成内容无效。为了应对这些挑战，我们介绍了\ textbf {busterx ++}，这是一个专门用于跨模式检测和解释合成媒体的新颖框架。我们的方法结合了先进的强化学习（RL）训练后策略，以消除冷门。通过多阶段的培训，思维奖励和混合推理，Busterx ++实现了稳定且实质性的改进。为了实现全面的评估，我们还提出了\ textbf {genbuster ++}，这是一种跨模式基准利用最新的图像和视频生成技术。该基准包括4,000张图像和视频剪辑，由人类专家精心策划，使用新颖的过滤方法来确保高质量，多样性和现实世界中的适用性。广泛的实验证明了我们方法的有效性和普遍性。

Title: Docopilot: Improving Multimodal Models for Document-Level Understanding

Authors: Yuchen Duan, Zhe Chen, Yusong Hu, Weiyun Wang, Shenglong Ye, Botian Shi, Lewei Lu, Qibin Hou, Tong Lu, Hongsheng Li, Jifeng Dai, Wenhai Wang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2507.14675
Pdf URL: https://arxiv.org/pdf/2507.14675
Copy Paste: [[2507.14675]] Docopilot: Improving Multimodal Models for Document-Level Understanding(https://arxiv.org/abs/2507.14675)
Keywords: generation
Abstract: Despite significant progress in multimodal large language models (MLLMs), their performance on complex, multi-page document comprehension remains inadequate, largely due to the lack of high-quality, document-level datasets. While current retrieval-augmented generation (RAG) methods offer partial solutions, they suffer from issues, such as fragmented retrieval contexts, multi-stage error accumulation, and extra time costs of retrieval. In this work, we present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents. This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from the original documents. Building on the dataset, we develop a native multimodal model, Docopilot, which can accurately handle document-level dependencies without relying on RAG. Experiments demonstrate that Docopilot achieves superior coherence, accuracy, and efficiency in document understanding tasks and multi-turn interactions, setting a new baseline for document-level multimodal understanding. Data, code, and models are released at this https URL
摘要：尽管多模式大语言模型（MLLM）取得了重大进展，但它们在复杂的多页文档理解上的表现仍然不足，这在很大程度上是由于缺乏高质量的文档级数据集。尽管当前的检索效果生成（RAG）方法提供了部分解决方案，但它们遭受了问题的困扰，例如零散的检索环境，多阶段错误积累以及检索的额外时间成本。在这项工作中，我们提出了一个高质量的文档级数据集Doc-750k，旨在支持对多模式文档的深入了解。该数据集包括各种文档结构，广泛的跨页依赖性以及从原始文档得出的真实提问对。在数据集中，我们开发了一个本机多模式DocoPilot，该模型可以准确地处理文档级依赖项而不依赖抹布。实验表明，DocoPilot在文档理解任务和多转交互作用方面实现了卓越的连贯性，准确性和效率，为文档级别的多模式理解设定了新的基线。数据，代码和模型在此HTTPS URL上发布

Title: GCC-Spam: Spam Detection via GAN, Contrastive Learning, and Character Similarity Networks

Authors: Zixin Xu, Zhijie Wang, Zhiyuan Pan
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2507.14679
Pdf URL: https://arxiv.org/pdf/2507.14679
Copy Paste: [[2507.14679]] GCC-Spam: Spam Detection via GAN, Contrastive Learning, and Character Similarity Networks(https://arxiv.org/abs/2507.14679)
Keywords: generative
Abstract: The exponential growth of spam text on the Internet necessitates robust detection mechanisms to mitigate risks such as information leakage and social instability. This work addresses two principal challenges: adversarial strategies employed by spammers and the scarcity of labeled data. We propose a novel spam-text detection framework GCC-Spam, which integrates three core innovations. First, a character similarity network captures orthographic and phonetic features to counter character-obfuscation attacks and furthermore produces sentence embeddings for downstream classification. Second, contrastive learning enhances discriminability by optimizing the latent-space distance between spam and normal texts. Third, a Generative Adversarial Network (GAN) generates realistic pseudo-spam samples to alleviate data scarcity while improving model robustness and classification accuracy. Extensive experiments on real-world datasets demonstrate that our model outperforms baseline approaches, achieving higher detection rates with significantly fewer labeled examples.
摘要：互联网上垃圾邮件文本的指数增长需要强大的检测机制来减轻信息泄漏和社会不稳定等风险。这项工作解决了两个主要挑战：垃圾邮件发送者采用的对抗策略和标记数据的稀缺性。我们提出了一个新颖的垃圾邮件检测框架GCC-SPAM，该框架集成了三个核心创新。首先，角色相似性网络捕获拼字和语音特征以对抗角色 - 捕获攻击，并在下游分类中产生句子嵌入。其次，对比度学习通过优化垃圾邮件和正常文本之间的潜在空间距离来增强可区分性。第三，生成对抗网络（GAN）生成逼真的伪垃圾邮件样本，以减轻数据稀缺性，同时提高模型的鲁棒性和分类精度。对现实世界数据集的广泛实验表明，我们的模型表现优于基线方法，实现了更高的检测率，标记的示例明显较少。

Title: Fraud is Not Just Rarity: A Causal Prototype Attention Approach to Realistic Synthetic Oversampling

Authors: Claudio Giusti, Luca Guarnera, Mirko Casu, Sebastiano Battiato
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.14706
Pdf URL: https://arxiv.org/pdf/2507.14706
Copy Paste: [[2507.14706]] Fraud is Not Just Rarity: A Causal Prototype Attention Approach to Realistic Synthetic Oversampling(https://arxiv.org/abs/2507.14706)
Keywords: generative
Abstract: Detecting fraudulent credit card transactions remains a significant challenge, due to the extreme class imbalance in real-world data and the often subtle patterns that separate fraud from legitimate activity. Existing research commonly attempts to address this by generating synthetic samples for the minority class using approaches such as GANs, VAEs, or hybrid generative models. However, these techniques, particularly when applied only to minority-class data, tend to result in overconfident classifiers and poor latent cluster separation, ultimately limiting real-world detection performance. In this study, we propose the Causal Prototype Attention Classifier (CPAC), an interpretable architecture that promotes class-aware clustering and improved latent space structure through prototype-based attention mechanisms and we will couple it with the encoder in a VAE-GAN allowing it to offer a better cluster separation moving beyond post-hoc sample augmentation. We compared CPAC-augmented models to traditional oversamplers, such as SMOTE, as well as to state-of-the-art generative models, both with and without CPAC-based latent classifiers. Our results show that classifier-guided latent shaping with CPAC delivers superior performance, achieving an F1-score of 93.14\% percent and recall of 90.18\%, along with improved latent cluster separation. Further ablation studies and visualizations provide deeper insight into the benefits and limitations of classifier-driven representation learning for fraud detection. The codebase for this work will be available at final submission.
摘要：由于现实数据中的极端阶级失衡以及将欺诈与合法活动分开的经常微妙的模式，检测欺诈性信用卡交易仍然是一个重大挑战。现有的研究通常试图通过使用诸如gan，vaes或杂种生成模型等方法为少数群体生成合成样本来解决这个问题。但是，这些技术，尤其是仅应用于少数级数据时，倾向于导致过度自信的分类器和不良的潜在群集分离，最终限制了现实世界中的检测性能。在这项研究中，我们提出了因果原型注意分类器（CPAC），这是一种可解释的结构，可通过基于原型的注意机制来促进班级感知的聚类和改善潜在空间结构，我们将其将其与编码器与vae-gan中的编码器进行搭配，从而使其能够提供更好的群集，从而超越了更多的群集分离，超越了shoc spamp spampemplape impmentation。我们将CPAC的模型与传统的过采样器（例如Smote）以及最先进的生成模型进行了比较，包括有或没有基于CPAC的潜在分类器。我们的结果表明，分类器引导的潜在塑造具有CPAC的表现，可提供卓越的性能，达到93.14％\％\％的F1得分，召回90.18 \％，以及改进的潜在群集分离。进一步的消融研究和可视化提供了对分类器驱动的代表学习对欺诈检测的益处和局限性的更深入的见解。这项工作的代码库将在最终提交中提供。

Title: Exploring the Dynamic Scheduling Space of Real-Time Generative AI Applications on Emerging Heterogeneous Systems

Authors: Rachid Karami, Rajeev Patwari, Hyoukjun Kwon, Ashish Sirasao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.14715
Pdf URL: https://arxiv.org/pdf/2507.14715
Copy Paste: [[2507.14715]] Exploring the Dynamic Scheduling Space of Real-Time Generative AI Applications on Emerging Heterogeneous Systems(https://arxiv.org/abs/2507.14715)
Keywords: generative
Abstract: The integration of generative AI models, particularly large language models (LLMs), into real-time multi-model AI applications such as video conferencing and gaming is giving rise to a new class of workloads: real-time generative AI (RTGen). These workloads combine the compute intensity and dynamic execution patterns of generative models with the stringent latency and concurrency constraints of real-time inference. To meet the diverse demands of RTGen workloads, modern edge platforms increasingly adopt heterogeneous system-on-chip (SoC) architectures that integrate CPUs, GPUs, and NPUs. Despite the potential of heterogeneous SoC, the scheduling space complexity and performance implications of RTGen workloads on such platforms remain underexplored. In this work, we perform a comprehensive characterization of RTGen workloads on AMD's latest heterogeneous SoC, Ryzen AI. We construct realistic multi-model scenarios inspired by industry use cases and profile model performance across all available backends. Using this data, we evaluate five scheduling policies and their impact on both real-time metrics (e.g., deadline violation rate) and LLM performance (e.g., time-to-first-token and tokens-per-second). Our results show that scheduling decisions significantly affect workload performance (e.g., leading to a 41.7% difference in deadline violation rates on average), and highlight the need for scheduling strategies that are aware of workload dynamics and hardware heterogeneity. Our findings underscore the importance of workload-aware, dynamic heterogeneous scheduling in enabling high-performance, on-device RTGen applications.
摘要：生成AI模型（尤其是大型语言模型（LLM））集成到实时多模型AI应用程序（例如视频会议和游戏）中，正在引起新的工作负载类别：实时生成AI（RTGEN）。这些工作负载将生成模型的计算强度和动态执行模式与实时推理的严格延迟和并发约束结合在一起。为了满足RTGEN工作负载的各种需求，现代边缘平台越来越多地采用了整合CPU，GPU和NPU的异质系统片（SOC）体系结构。尽管具有异质性SOC的潜力，但在此类平台上RTGEN工作负载的调度空间复杂性和性能含义仍未得到充实。在这项工作中，我们对AMD最新的异构SOC Ryzen AI进行了RTGEN工作负载的全面表征。我们构建了由行业用例和所有可用后端中的个人资料模型性能启发的现实多模型场景。使用这些数据，我们评估了五种调度策略及其对实时指标（例如截止日期违规率）和LLM绩效（例如，每秒时间和代币）的影响。我们的结果表明，调度决策显着影响工作负载的绩效（例如，平均导致截止日期违规率差异41.7％），并强调需要安排意识到工作负载动态和硬件异质性的策略的需求。我们的发现强调了工作负载感知，动态异质计划在启用高性能，设备上的RTGEN应用程序中的重要性。

Title: Beyond the Single-Best Model: Rashomon Partial Dependence Profile for Trustworthy Explanations in AutoML

Authors: Mustafa Cavus, Jan N. van Rijn, Przemysław Biecek
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.14744
Pdf URL: https://arxiv.org/pdf/2507.14744
Copy Paste: [[2507.14744]] Beyond the Single-Best Model: Rashomon Partial Dependence Profile for Trustworthy Explanations in AutoML(https://arxiv.org/abs/2507.14744)
Keywords: generation
Abstract: Automated machine learning systems efficiently streamline model selection but often focus on a single best-performing model, overlooking explanation uncertainty, an essential concern in human centered explainable AI. To address this, we propose a novel framework that incorporates model multiplicity into explanation generation by aggregating partial dependence profiles (PDP) from a set of near optimal models, known as the Rashomon set. The resulting Rashomon PDP captures interpretive variability and highlights areas of disagreement, providing users with a richer, uncertainty aware view of feature effects. To evaluate its usefulness, we introduce two quantitative metrics, the coverage rate and the mean width of confidence intervals, to evaluate the consistency between the standard PDP and the proposed Rashomon PDP. Experiments on 35 regression datasets from the OpenML CTR23 benchmark suite show that in most cases, the Rashomon PDP covers less than 70% of the best model's PDP, underscoring the limitations of single model explanations. Our findings suggest that Rashomon PDP improves the reliability and trustworthiness of model interpretations by adding additional information that would otherwise be neglected. This is particularly useful in high stakes domains where transparency and confidence are critical.
摘要：自动化的机器学习系统有效地简化了模型选择，但通常专注于单个最佳模型，忽略了解释不确定性，这是人类以人为中心的可解释AI的基本问题。为了解决这个问题，我们提出了一个新颖的框架，该框架通过从一组接近最佳模型（称为rashomon sep）中汇总部分依赖概况（PDP），将模型多样性融合到解释生成中。由此产生的Rashomon PDP捕获了解释性的可变性，并突出了分歧的领域，从而为用户提供了更丰富，不确定性的意识到功能效果的视图。为了评估其有用性，我们介绍了两个定量指标，即覆盖率和置信区间的平均宽度，以评估标准PDP与拟议的Rashomon PDP之间的一致性。来自OpenML CTR23基准套件的35个回归数据集的实验表明，在大多数情况下，Rashomon PDP覆盖了最佳模型的PDP的70％，这强调了单个模型解释的限制。我们的发现表明，Rashomon PDP通过添加其他可能会忽略的其他信息来提高模型解释的可靠性和可信赖性。这在透明度和信心至关重要的高风险域中特别有用。

Title: Omni-Think: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards

Authors: Derek Li, Jiaming Zhou, Amirreza Kazemi, Qianyi Sun, Abbas Ghaddar, Mohammad Ali Alomrani, Liheng Ma, Yu Luo, Dong Li, Feng Wen, Jianye Hao, Mark Coates, Yingxue Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.14783
Pdf URL: https://arxiv.org/pdf/2507.14783
Copy Paste: [[2507.14783]] Omni-Think: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards(https://arxiv.org/abs/2507.14783)
Keywords: generation, generative
Abstract: The advancement of general-purpose artificial intelligence relies on large language models (LLMs) that excel across a wide range of tasks, from structured reasoning to creative generation. However, post-training methods like Supervised Fine-Tuning (SFT) often struggle with generalization, favoring memorization over transferable learning. In this work, we introduce Omni-Think, a unified reinforcement learning (RL) framework that enhances LLM performance across diverse tasks by combining rule-based verifiable rewards with generative preference signals via LLM-as-a-Judge evaluations. Our approach enables consistent optimization across task types and scales RL-based training to subjective domains. We further investigate training strategies, demonstrating that a curriculum-based progression that orders tasks from structured to open-ended improves performance and reduces forgetting. Experimental results across four domains reveal that curriculum learning improves performance by 5.2\% over joint training and 9.1\% over model merging. These results highlight the importance of task-aware sampling and hybrid supervision in scaling RL-based post-training for general-purpose LLMs.
摘要：通用人工智能的进步取决于大型语言模型（LLM），这些模型（LLM）在从结构化推理到创意产生的各种任务中出色。但是，训练后的方法（例如受监督的微调（SFT））经常在泛化中挣扎，而不是记忆而不是可转移的学习。在这项工作中，我们介绍了Omni-Ink，这是一个统一的增强学习（RL）框架，通过通过将基于规则的可验证奖励与通过LLM-AS-A-A-A-Audge评估相结合的生成偏好信号来增强各种任务的LLM性能。我们的方法可以跨任务类型和量表对主观域的训练进行一致的优化。我们进一步研究了培训策略，表明基于课程的进步命令从结构性到开放式的任务提高绩效并减少遗忘。跨四个领域的实验结果表明，课程学习在联合培训中提高了5.2 \％的性能，而在模型合并中，课程学习为9.1 \％。这些结果强调了任务感知抽样和混合监督在基于RL的通用后LLMS缩放后培训中的重要性。

Title: Distilling Parallel Gradients for Fast ODE Solvers of Diffusion Models

Authors: Beier Zhu, Ruoyu Wang, Tong Zhao, Hanwang Zhang, Chi Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.14797
Pdf URL: https://arxiv.org/pdf/2507.14797
Copy Paste: [[2507.14797]] Distilling Parallel Gradients for Fast ODE Solvers of Diffusion Models(https://arxiv.org/abs/2507.14797)
Keywords: generative
Abstract: Diffusion models (DMs) have achieved state-of-the-art generative performance but suffer from high sampling latency due to their sequential denoising nature. Existing solver-based acceleration methods often face image quality degradation under a low-latency budget. In this paper, we propose the Ensemble Parallel Direction solver (dubbed as \ours), a novel ODE solver that mitigates truncation errors by incorporating multiple parallel gradient evaluations in each ODE step. Importantly, since the additional gradient computations are independent, they can be fully parallelized, preserving low-latency sampling. Our method optimizes a small set of learnable parameters in a distillation fashion, ensuring minimal training overhead. In addition, our method can serve as a plugin to improve existing ODE samplers. Extensive experiments on various image synthesis benchmarks demonstrate the effectiveness of our \ours~in achieving high-quality and low-latency sampling. For example, at the same latency level of 5 NFE, EPD achieves an FID of 4.47 on CIFAR-10, 7.97 on FFHQ, 8.17 on ImageNet, and 8.26 on LSUN Bedroom, surpassing existing learning-based solvers by a significant margin. Codes are available in this https URL.
摘要：扩散模型（DMS）已实现了最先进的生成性能，但由于其顺序降低性质而遭受了较高的采样延迟。现有的基于求解器的加速方法通常在低延迟预算下面临图像质量降解。在本文中，我们提出了一个集合并行方向求解器（称为\ outs），这是一种新型的ODE求解器，通过在每个ODE步骤中掺入多个并行梯度评估来减轻截断误差。重要的是，由于其他梯度计算是独立的，因此可以完全并行化，从而保留低延迟采样。我们的方法以蒸馏方式优化了一小部分可学习的参数，从而确保了最少的训练开销。此外，我们的方法可以用作改进现有ode采样器的插件。各种图像合成基准的广泛实验证明了我们的\我们〜在获得高质量和低延迟采样方面的有效性。例如，在相同的5 nFE等潜伏水平上，EPD在CIFAR-10上达到4.47，FFHQ上的FFHQ，Imagenet上的8.17在8.17上达到7.97，在LSUN卧室上达到了8.26的FID，超过了现有的基于学习的求解器，从而超过了大量的基于学习的求解器。代码可在此HTTPS URL中使用。

Title: Exploring Scalable Unified Modeling for General Low-Level Vision

Authors: Xiangyu Chen, Kaiwen Zhu, Yuandong Pu, Shuo Cao, Xiaohui Li, Wenlong Zhang, Yihao Liu, Yu Qiao, Jiantao Zhou, Chao Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.14801
Pdf URL: https://arxiv.org/pdf/2507.14801
Copy Paste: [[2507.14801]] Exploring Scalable Unified Modeling for General Low-Level Vision(https://arxiv.org/abs/2507.14801)
Keywords: restoration
Abstract: Low-level vision involves a wide spectrum of tasks, including image restoration, enhancement, stylization, and feature extraction, which differ significantly in both task formulation and output domains. To address the challenge of unified modeling across such diverse tasks, we propose a Visual task Prompt-based Image Processing (VPIP) framework that leverages input-target image pairs as visual prompts to guide the model in performing a variety of low-level vision tasks. The framework comprises an end-to-end image processing backbone, a prompt encoder, and a prompt interaction module, enabling flexible integration with various architectures and effective utilization of task-specific visual representations. Based on this design, we develop a unified low-level vision model, GenLV, and evaluate its performance across multiple representative tasks. To explore the scalability of this approach, we extend the framework along two dimensions: model capacity and task diversity. We construct a large-scale benchmark consisting of over 100 low-level vision tasks and train multiple versions of the model with varying scales. Experimental results show that the proposed method achieves considerable performance across a wide range of tasks. Notably, increasing the number of training tasks enhances generalization, particularly for tasks with limited data, indicating the model's ability to learn transferable representations through joint training. Further evaluations in zero-shot generalization, few-shot transfer, and task-specific fine-tuning scenarios demonstrate the model's strong adaptability, confirming the effectiveness, scalability, and potential of the proposed framework as a unified foundation for general low-level vision modeling.
摘要：低级视觉涉及各种各样的任务，包括图像恢复，增强，风格化和特征提取，这些任务在任务配方和输出域中都有明显不同。为了解决跨这样不同任务的统一建模的挑战，我们提出了一个基于视觉任务提示的图像处理（VPIP）框架，该框架利用输入目标图像对作为视觉提示，以指导模型执行各种低级视觉任务。该框架包括一个端到端图像处理主链，提示编码器和提示交互模块，从而可以与各种体系结构的灵活集成以及有效利用特定于任务的视觉表示。基于此设计，我们开发了一个统一的低级视觉模型GENLV，并在多个代表性任务中评估了其性能。为了探索这种方法的可扩展性，我们将框架扩展到两个维度：模型容量和任务多样性。我们构建了一个大规模的基准测试，该基准由100多个低级视觉任务组成，并训练具有不同尺度的模型的多个版本。实验结果表明，所提出的方法在广泛的任务中实现了相当大的性能。值得注意的是，增加培训任务的数量会增强概括，特别是对于具有有限数据的任务，表明该模型可以通过联合培训学习可转移表示的能力。在零拍的概括，很少的转移和特定于任务的微调方案中进行进一步评估，证明了该模型的强大适应性，证实了所提出的框架的有效性，可扩展性和潜力，作为一般低级视觉建模的统一基础。

Title: SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models

Authors: Jiaji Zhang, Ruichao Sun, Hailiang Zhao, Jiaju Wu, Peng Chen, Hao Li, Xinkui Zhao, Kingsum Chow, Gang Xiong, Lin Ye, Shuiguang Deng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.14811
Pdf URL: https://arxiv.org/pdf/2507.14811
Copy Paste: [[2507.14811]] SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models(https://arxiv.org/abs/2507.14811)
Keywords: generative
Abstract: Diffusion models have demonstrated exceptional generative capabilities but are computationally intensive, posing significant challenges for deployment in resource-constrained or latency-sensitive environments. Quantization offers an effective means to reduce model size and computational cost, with post-training quantization (PTQ) being particularly appealing due to its compatibility with pre-trained models without requiring retraining or training data. However, existing PTQ methods for diffusion models often rely on architecture-specific heuristics that limit their generalizability and hinder integration with industrial deployment pipelines. To address these limitations, we propose SegQuant, a unified quantization framework that adaptively combines complementary techniques to enhance cross-model versatility. SegQuant consists of a segment-aware, graph-based quantization strategy (SegLinear) that captures structural semantics and spatial heterogeneity, along with a dual-scale quantization scheme (DualScale) that preserves polarity-asymmetric activations, which is crucial for maintaining visual fidelity in generated outputs. SegQuant is broadly applicable beyond Transformer-based diffusion models, achieving strong performance while ensuring seamless compatibility with mainstream deployment tools.
摘要：扩散模型表现出了出色的生成能力，但在计算密集程度上，对资源受限或对潜伏期敏感的环境的部署构成了重大挑战。量化提供了一种有效的手段，可以减少模型大小和计算成本，因为训练后量化（PTQ）由于其与预训练的模型的兼容而无需重新培训或培训数据，因此特别有吸引力。但是，现有的扩散模型PTQ方法通常依赖于特定于体系结构的启发式方法，从而限制了它们的普遍性并阻碍与工业部署管道的整合。为了解决这些局限性，我们提出了Segquant，这是一个统一的量化框架，可以适应地结合互补技术以增强跨模型多功能性。 Segquant由捕获结构语义和空间异质性的节段，基于图的量化策略（SEGLILEAR）以及双尺度量化方案（DUALSCALE）组成，该方案（DUALSCALE）保留了极性 - 空气对称激活，这对于维持生成产量的可视化屈服至关重要。 Segquant广泛适用于基于变压器的扩散模型，在确保与主流部署工具的无缝兼容性的同时，实现了强大的性能。

Title: Paired Image Generation with Diffusion-Guided Diffusion Models

Authors: Haoxuan Zhang, Wenju Cui, Yuzhu Cao, Tao Tan, Jie Liu, Yunsong Peng, Jian Zheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.14833
Pdf URL: https://arxiv.org/pdf/2507.14833
Copy Paste: [[2507.14833]] Paired Image Generation with Diffusion-Guided Diffusion Models(https://arxiv.org/abs/2507.14833)
Keywords: generation
Abstract: The segmentation of mass lesions in digital breast tomosynthesis (DBT) images is very significant for the early screening of breast cancer. However, the high-density breast tissue often leads to high concealment of the mass lesions, which makes manual annotation difficult and time-consuming. As a result, there is a lack of annotated data for model training. Diffusion models are commonly used for data augmentation, but the existing methods face two challenges. First, due to the high concealment of lesions, it is difficult for the model to learn the features of the lesion area. This leads to the low generation quality of the lesion areas, thus limiting the quality of the generated images. Second, existing methods can only generate images and cannot generate corresponding annotations, which restricts the usability of the generated images in supervised training. In this work, we propose a paired image generation method. The method does not require external conditions and can achieve the generation of paired images by training an extra diffusion guider for the conditional diffusion model. During the experimental phase, we generated paired DBT slices and mass lesion masks. Then, we incorporated them into the supervised training process of the mass lesion segmentation task. The experimental results show that our method can improve the generation quality without external conditions. Moreover, it contributes to alleviating the shortage of annotated data, thus enhancing the performance of downstream tasks.
摘要：数字乳房合成（DBT）图像中质量病变的分割对于早期筛查乳腺癌非常重要。但是，高密度的乳房组织通常会导致肿块病变的高度隐藏，这使手动注释变得困难且耗时。结果，缺乏用于模型培训的注释数据。扩散模型通常用于数据增强，但是现有方法面临两个挑战。首先，由于病变的高隐藏，模型很难学习病变区域的特征。这导致病变区域的低发电质量，从而限制了生成的图像的质量。其次，现有方法只能生成图像，并且不能生成相应的注释，这限制了监督培训中生成的图像的可用性。在这项工作中，我们提出了一种成对的图像生成方法。该方法不需要外部条件，可以通过训练有条件扩散模型的额外扩散指南来实现配对图像的产生。在实验阶段，我们生成了成对的DBT切片和质量病变面膜。然后，我们将它们纳入了大规模病变细分任务的监督培训过程中。实验结果表明，我们的方法可以在没有外部条件的情况下提高发电质量。此外，它有助于减轻注释数据的短缺，从而提高下游任务的性能。

Title: The Invisible Leash: Why RLVR May Not Escape Its Origin

Authors: Fang Wu, Weihao Xuan, Ximing Lu, Zaid Harchaoui, Yejin Choi
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2507.14843
Pdf URL: https://arxiv.org/pdf/2507.14843
Copy Paste: [[2507.14843]] The Invisible Leash: Why RLVR May Not Escape Its Origin(https://arxiv.org/abs/2507.14843)
Keywords: generation
Abstract: Recent advances in large reasoning models highlight Reinforcement Learning with Verifiable Rewards (RLVR) as a promising method for enhancing AI's capabilities, particularly in solving complex logical tasks. However, it remains unclear whether RLVR truly expands a model's reasoning boundary or merely amplifies high-reward outputs that the base model already knows for improved precision. This study presents a theoretical and empirical investigation that provides fresh insights into the potential limits of RLVR. First, we offer a new theoretical perspective that RLVR is constrained by the base model's support-unable to sample solutions with zero initial probability-and operates as a conservative reweighting mechanism that may restrict the discovery of entirely original solutions. We also identify an entropy-reward tradeoff: while RLVR reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions. Extensive empirical experiments validate that while RLVR consistently improves pass@1, the shrinkage of empirical support generally outweighs the expansion of empirical support under larger sampling budgets, failing to recover correct answers that were previously accessible to the base model. Interestingly, we also observe that while RLVR sometimes increases token-level entropy, resulting in greater uncertainty at each generation step, answer-level entropy declines, indicating that these seemingly more uncertain paths ultimately converge onto a smaller set of distinct answers. Taken together, these findings reveal potential limits of RLVR in extending reasoning horizons. Breaking this invisible leash may require future algorithmic innovations such as explicit exploration mechanisms or hybrid strategies that seed probability mass into underrepresented solution regions.
摘要：大型推理模型的最新进展突出了可验证的奖励（RLVR）作为增强AI能力的有前途的方法，尤其是在解决复杂的逻辑任务时。但是，尚不清楚RLVR是否真正扩展了模型的推理边界或仅放大基本模型已经知道以提高精度的高回报输出。这项研究提出了一项理论和实证研究，为RLVR的潜在限制提供了新的见解。首先，我们提供了一个新的理论观点，即RLVR受基本模型的支持 - 对零初始概率的样品解决方案的限制，并且作为一种保守的重新加权机制，可能限制了完全原始的解决方案。我们还确定了一个熵奖励的权衡：虽然RLVR可靠地提高了精度，但它可能会逐渐缩小勘探范围，并可能忽略正确但代表性不足的解决方案。广泛的经验实验证明，尽管RLVR始终改善PASS@1，但经验支持的收缩通常超过了在较大的采样预算下的经验支持的扩展，未能恢复基本模型以前可以访问的正确答案。有趣的是，我们还观察到，尽管RLVR有时会增加令牌级的熵，从而导致每个一代步骤的不确定性更大，但答案级熵的下降，表明这些似乎更不确定的路径最终会融合到一组较小的不同答案上。综上所述，这些发现揭示了RLVR在扩展推理视野中的潜在限制。打破这种无形的皮带可能需要未来的算法创新，例如明确的勘探机制或混合策略，将概率质量质量构成代表性不足的溶液区域。

Title: Grounding Degradations in Natural Language for All-In-One Video Restoration

Authors: Muhammad Kamran Janjua, Amirhosein Ghasemabadi, Kunlin Zhang, Mohammad Salameh, Chao Gao, Di Niu
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2507.14851
Pdf URL: https://arxiv.org/pdf/2507.14851
Copy Paste: [[2507.14851]] Grounding Degradations in Natural Language for All-In-One Video Restoration(https://arxiv.org/abs/2507.14851)
Keywords: restoration
Abstract: In this work, we propose an all-in-one video restoration framework that grounds degradation-aware semantic context of video frames in natural language via foundation models, offering interpretable and flexible guidance. Unlike prior art, our method assumes no degradation knowledge in train or test time and learns an approximation to the grounded knowledge such that the foundation model can be safely disentangled during inference adding no extra cost. Further, we call for standardization of benchmarks in all-in-one video restoration, and propose two benchmarks in multi-degradation setting, three-task (3D) and four-task (4D), and two time-varying composite degradation benchmarks; one of the latter being our proposed dataset with varying snow intensity, simulating how weather degradations affect videos naturally. We compare our method with prior works and report state-of-the-art performance on all benchmarks.
摘要：在这项工作中，我们提出了一个多合一的视频恢复框架，该框架通过基础模型以自然语言的视频帧的降级感知语义背景，提供可解释且灵活的指导。与先前的艺术不同，我们的方法在火车或测试时间中不假定降解知识，并且学会了与接地知识的近似，因此在推理期间可以安全地分解基础模型，从而增加了额外的成本。此外，我们呼吁在多合一视频恢复中对基准进行标准化，并在多降解设置中提出两个基准测试，三任任务（3D）和四任任务（4D）以及两个时间变化的复合降解基准；后者之一是我们提议的数据集具有不同的积雪强度，模拟天气降解如何自然影响视频。我们将我们的方法与先前的作品进行比较，并在所有基准上报告最先进的性能。

Title: Stereo-GS: Multi-View Stereo Vision Model for Generalizable 3D Gaussian Splatting Reconstruction

Authors: Xiufeng Huang, Ka Chun Cheung, Runmin Cong, Simon See, Renjie Wan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.14921
Pdf URL: https://arxiv.org/pdf/2507.14921
Copy Paste: [[2507.14921]] Stereo-GS: Multi-View Stereo Vision Model for Generalizable 3D Gaussian Splatting Reconstruction(https://arxiv.org/abs/2507.14921)
Keywords: generation
Abstract: Generalizable 3D Gaussian Splatting reconstruction showcases advanced Image-to-3D content creation but requires substantial computational resources and large datasets, posing challenges to training models from scratch. Current methods usually entangle the prediction of 3D Gaussian geometry and appearance, which rely heavily on data-driven priors and result in slow regression speeds. To address this, we propose \method, a disentangled framework for efficient 3D Gaussian prediction. Our method extracts features from local image pairs using a stereo vision backbone and fuses them via global attention blocks. Dedicated point and Gaussian prediction heads generate multi-view point-maps for geometry and Gaussian features for appearance, combined as GS-maps to represent the 3DGS object. A refinement network enhances these GS-maps for high-quality reconstruction. Unlike existing methods that depend on camera parameters, our approach achieves pose-free 3D reconstruction, improving robustness and practicality. By reducing resource demands while maintaining high-quality outputs, \method provides an efficient, scalable solution for real-world 3D content generation.
摘要：可概括的3D高斯脱落重建展示了高级图像到3D内容创建，但需要大量的计算资源和大型数据集，从而对训练模型构成了挑战。当前的方法通常纠缠3D高斯几何形状和外观的预测，这些预测严重依赖于数据驱动的先验，并导致缓慢的回归速度。为了解决这个问题，我们提出了一个有效的3D高斯预测的分离框架\方法。我们的方法使用立体声视觉主链从本地图像对提取特征，并通过全球注意力块融合它们。专用点和高斯预测头生成了用于外观的几何和高斯特征的多视图映射，并将其作为GS-MAPS组合来表示3DGS对象。改进网络增强了这些GS-MAP，以进行高质量的重建。与依赖相机参数的现有方法不同，我们的方法实现了无姿势的3D重建，改善了鲁棒性和实用性。通过减少资源需求，同时维持高质量的输出，\方法为实际3D内容生成提供了有效的，可扩展的解决方案。

Title: OmniVTON: Training-Free Universal Virtual Try-On

Authors: Zhaotong Yang, Yuhui Li, Shengfeng He, Xinzhe Li, Yangyang Xu, Junyu Dong, Yong Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.15037
Pdf URL: https://arxiv.org/pdf/2507.15037
Copy Paste: [[2507.15037]] OmniVTON: Training-Free Universal Virtual Try-On(https://arxiv.org/abs/2507.15037)
Keywords: generation
Abstract: Image-based Virtual Try-On (VTON) techniques rely on either supervised in-shop approaches, which ensure high fidelity but struggle with cross-domain generalization, or unsupervised in-the-wild methods, which improve adaptability but remain constrained by data biases and limited universality. A unified, training-free solution that works across both scenarios remains an open challenge. We propose OmniVTON, the first training-free universal VTON framework that decouples garment and pose conditioning to achieve both texture fidelity and pose consistency across diverse settings. To preserve garment details, we introduce a garment prior generation mechanism that aligns clothing with the body, followed by continuous boundary stitching technique to achieve fine-grained texture retention. For precise pose alignment, we utilize DDIM inversion to capture structural cues while suppressing texture interference, ensuring accurate body alignment independent of the original image textures. By disentangling garment and pose constraints, OmniVTON eliminates the bias inherent in diffusion models when handling multiple conditions simultaneously. Experimental results demonstrate that OmniVTON achieves superior performance across diverse datasets, garment types, and application scenarios. Notably, it is the first framework capable of multi-human VTON, enabling realistic garment transfer across multiple individuals in a single scene. Code is available at this https URL
摘要：基于图像的虚拟试验（VTON）技术依赖于监督的店内方法，这些方法确保了高保真度，但要在跨域的概括或无监督的内部方法中挣扎，从而改善了适应性，但仍受到数据偏见和有限的普遍性的限制。在两种情况下都可以使用的统一的无训练解决方案仍然是一个悬而未决的挑战。我们提出了Omnivton，这是第一个无训练的通用VTON框架，该框架破坏了衣服和姿势条件，以实现各种环境中的纹理忠诚度和姿势一致性。为了保存服装细节，我们引入了一种服装前一代机制，该机制将衣服与身体保持一致，然后是连续的边界缝合技术，以实现细粒度的质地保留。为了精确的姿势对齐，我们利用DDIM倒置来捕获结构提示，同时抑制纹理干扰，从而确保与原始图像纹理无关的准确身体对齐。通过解开服装和构成构成，Omnivton同时处理多个条件时消除了扩散模型中固有的偏差。实验结果表明，Omnivton在各种数据集，服装类型和应用程序方面都取得了出色的性能。值得注意的是，这是一个能够多人Vton的第一个框架，可以在一个场景中跨多个个人进行逼真的服装转移。代码可在此HTTPS URL上找到

Title: Time-RA: Towards Time Series Reasoning for Anomaly with LLM Feedback

Authors: Yiyuan Yang, Zichuan Liu, Lei Song, Kai Ying, Zhiguang Wang, Tom Bamford, Svitlana Vyetrenko, Jiang Bian, Qingsong Wen
Subjects: cs.LG, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2507.15066
Pdf URL: https://arxiv.org/pdf/2507.15066
Copy Paste: [[2507.15066]] Time-RA: Towards Time Series Reasoning for Anomaly with LLM Feedback(https://arxiv.org/abs/2507.15066)
Keywords: generative
Abstract: Time series anomaly detection is critical across various domains, yet current approaches often limit analysis to mere binary anomaly classification without detailed categorization or further explanatory reasoning. To address these limitations, we propose a novel task, Time-series Reasoning for Anomaly (Time-RA) that transforms classical time series anomaly detection from a discriminative into a generative, reasoning-intensive task leveraging Large Language Models (LLMs). Also, we introduce the first real-world multimodal benchmark dataset, RATs40K, explicitly annotated for anomaly reasoning, comprising approximately 40,000 samples across 10 real-world domains. Each sample includes numeric time series data, contextual text information, and visual representations, each annotated with fine-grained categories (14 types for univariate anomalies and 6 for multivariate anomalies) and structured explanatory reasoning. We develop a sophisticated annotation framework utilizing ensemble-generated labels refined through GPT-4-driven feedback, ensuring accuracy and interpretability. Extensive benchmarking of LLMs and multimodal LLMs demonstrates the capabilities and limitations of current models, highlighting the critical role of supervised fine-tuning. Our dataset and task pave the way for significant advancements in interpretable time series anomaly detection and reasoning.
摘要：时间序列异常检测在各个领域至关重要，但是当前的方法通常将分析限制为仅二进制异常分类而无需详细的分类或进一步的解释推理。为了解决这些局限性，我们提出了一项新的任务，是异常的时间序列推理（Time-RA），将经典时间序列异常检测从歧视性转变为利用大型语言模型（LLMS）的生成性，推理密集型任务。此外，我们介绍了第一个现实世界多模式基准数据集Rats40K，明确注释了用于异常推理，其中包括10个现实世界中的大约40,000个样本。每个示例包括数字时间序列数据，上下文文本信息和视觉表示，每个样本都以细粒类别（单变量异常的14种类型，多变量异常）和结构化的解释推理进行注释。我们开发了一个复杂的注释框架，利用合奏生成的标签通过GPT-4驱动的反馈进行了完善，从而确保了准确性和解释性。 LLM和多模式LLM的广泛基准测试证明了当前模型的功能和局限性，强调了监督微调的关键作用。我们的数据集和任务为解释时间序列异常检测和推理的重大进步铺平了道路。

Title: Aesthetics is Cheap, Show me the Text: An Empirical Evaluation of State-of-the-Art Generative Models for OCR

Authors: Peirong Zhang, Haowei Xu, Jiaxin Zhang, Guitao Xu, Xuhan Zheng, Zhenhua Yang, Junle Liu, Yuyi Zhang, Lianwen Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.15085
Pdf URL: https://arxiv.org/pdf/2507.15085
Copy Paste: [[2507.15085]] Aesthetics is Cheap, Show me the Text: An Empirical Evaluation of State-of-the-Art Generative Models for OCR(https://arxiv.org/abs/2507.15085)
Keywords: generation, generative
Abstract: Text image is a unique and crucial information medium that integrates visual aesthetics and linguistic semantics in modern e-society. Due to their subtlety and complexity, the generation of text images represents a challenging and evolving frontier in the image generation field. The recent surge of specialized image generators (\emph{e.g.}, Flux-series) and unified generative models (\emph{e.g.}, GPT-4o), which demonstrate exceptional fidelity, raises a natural question: can they master the intricacies of text image generation and editing? Motivated by this, we assess current state-of-the-art generative models' capabilities in terms of text image generation and editing. We incorporate various typical optical character recognition (OCR) tasks into our evaluation and broaden the concept of text-based generation tasks into OCR generative tasks. We select 33 representative tasks and categorize them into five categories: document, handwritten text, scene text, artistic text, and complex \& layout-rich text. For comprehensive evaluation, we examine six models across both closed-source and open-source domains, using tailored, high-quality image inputs and prompts. Through this evaluation, we draw crucial observations and identify the weaknesses of current generative models for OCR tasks. We argue that photorealistic text image generation and editing should be internalized as foundational skills into general-domain generative models, rather than being delegated to specialized solutions, and we hope this empirical analysis can provide valuable insights for the community to achieve this goal. This evaluation is online and will be continuously updated at our GitHub repository.
摘要：文本图像是一种独特而至关重要的信息媒介，可在现代电子社会中整合视觉美学和语言语义。由于它们的微妙和复杂性，文本图像的产生代表了图像生成领域中具有挑战性且不断发展的前沿。最新的专业图像发生器（\ emph {e.g。}，通量系列）和统一的生成模型（\ emph {e.g。}，gpt-4O）的激增，证明了异常的保真度，提出了一个自然的问题：他们可以掌握文本图像的复杂性和编辑的复杂性吗？在此激励的情况下，我们评估了当前的最新生成模型的功能，从文本图像生成和编辑方面。我们将各种典型的光学特征识别（OCR）任务纳入我们的评估中，并将基于文本的生成任务的概念扩展到OCR生成任务中。我们选择33个代表性任务，并将其分为五个类别：文档，手写文本，场景文本，艺术文本和复杂\＆布局富含文本。为了进行全面的评估，我们使用量身定制的高质量图像输入和提示来检查封闭源和开源域的六个模型。通过此评估，我们绘制关键观察结果，并确定OCR任务的当前生成模型的弱点。我们认为，影像逼真的文本图像生成和编辑应将其内部化为一般域生成模型的基础技能，而不是将其委派成专业解决方案，我们希望这种经验分析可以为社区提供有价值的见解，以实现这一目标。此评估是在线的，将在我们的GitHub存储库中不断更新。

Title: AnalogFed: Federated Discovery of Analog Circuit Topologies with Generative AI

Authors: Qiufeng Li, Shu Hong, Jian Gao, Xuan Zhang, Tian Lan, Weidong Cao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.15104
Pdf URL: https://arxiv.org/pdf/2507.15104
Copy Paste: [[2507.15104]] AnalogFed: Federated Discovery of Analog Circuit Topologies with Generative AI(https://arxiv.org/abs/2507.15104)
Keywords: generative
Abstract: Recent breakthroughs in AI/ML offer exciting opportunities to revolutionize analog design automation through data-driven approaches. In particular, researchers are increasingly fascinated by harnessing the power of generative AI to automate the discovery of novel analog circuit topologies. Unlocking the full potential of generative AI in these data-driven discoveries requires access to large and diverse this http URL, there is a significant barrier in the analog domain--Analog circuit design is inherently proprietary, involving not only confidential circuit structures but also the underlying commercial semiconductor processes. As a result, current generative AI research is largely confined to individual researchers who construct small, narrowly focused private datasets. This fragmentation severely limits collaborative innovation and impedes progress across the research community. To address these challenges, we propose AnalogFed. AnalogFed enables collaborative topology discovery across decentralized clients (e.g., individual researchers or institutions) without requiring the sharing of raw private data. To make this vision practical, we introduce a suite of techniques tailored to the unique challenges of applying FedL in analog design--from generative model development and data heterogeneity handling to privacy-preserving strategies that ensure both flexibility and security for circuit designers and semiconductor manufacturers. Extensive experiments across varying client counts and dataset sizes demonstrate that AnalogFed achieves performance comparable to centralized baselines--while maintaining strict data privacy. Specifically, the generative AI model within AnalogFed achieves state-of-the-art efficiency and scalability in the design of analog circuit topologies.
摘要：AI/ML的最新突破为通过数据驱动的方法彻底改变模拟设计自动化提供了令人兴奋的机会。特别是，研究人员越来越着迷于利用生成AI的力量自动化新型模拟电路拓扑的发现。在这些数据驱动的发现中释放生成AI的全部潜力需要访问大型和多样化的HTTP URL，模拟域中存在着重要的障碍 - Analog电路设计本质上是专有的，不仅涉及机密的电路结构，而且还涉及基础的商业半导体过程。结果，当前的生成AI研究在很大程度上仅限于构建小型，狭窄专注的私人数据集的个体研究人员。这种分裂严重限制了协作创新，并阻碍了整个研究界的进步。为了应对这些挑战，我们提出了类似的方式。 Analogfed启用了分散的客户（例如个人研究人员或机构）之间的协作拓扑发现，而无需共享原始私人数据。为了使这个愿景实用，我们引入了一套针对模拟设计中使用FedL的独特挑战量身定制的技术 - 从生成模型开发和数据异质性处理中，将其处理用于隐私保护策略，以确保巡回设计人员和半导体制造商的灵活性和安全性。各种客户计数和数据集大小之间进行的广泛实验表明，模拟f的性能与集中式基准相当 - 而保持严格的数据隐私。具体而言，类似物中的生成AI模型在模拟电路拓扑设计的设计中实现了最先进的效率和可扩展性。

Title: Resonant-Tunnelling Diode Reservoir Computing System for Image Recognition

Authors: A. H. Abbas, Hend Abdel-Ghani, Ivan S. Maksymov
Subjects: cs.LG, physics.app-ph
Abstract URL: https://arxiv.org/abs/2507.15158
Pdf URL: https://arxiv.org/pdf/2507.15158
Copy Paste: [[2507.15158]] Resonant-Tunnelling Diode Reservoir Computing System for Image Recognition(https://arxiv.org/abs/2507.15158)
Keywords: generation
Abstract: As artificial intelligence continues to push into real-time, edge-based and resource-constrained environments, there is an urgent need for novel, hardware-efficient computational models. In this study, we present and validate a neuromorphic computing architecture based on resonant-tunnelling diodes (RTDs), which exhibit the nonlinear characteristics ideal for physical reservoir computing (RC). We theoretically formulate and numerically implement an RTD-based RC system and demonstrate its effectiveness on two image recognition benchmarks: handwritten digit classification and object recognition using the Fruit~360 dataset. Our results show that this circuit-level architecture delivers promising performance while adhering to the principles of next-generation RC -- eliminating random connectivity in favour of a deterministic nonlinear transformation of input signals.
摘要：随着人工智能继续进入实时，基于边缘和资源约束的环境，迫切需要新颖，硬件有效的计算模型。在这项研究中，我们介绍并验证基于谐振型二极管二极管（RTDS）的神经形态计算结构，该结构表现出非常适合物理储层计算（RC）的非线性特征。我们从理论上制定并在数值上实现了基于RTD的RC系统，并在两个图像识别基准上演示了其有效性：使用〜360数据集的手写数字分类和对象识别。我们的结果表明，该电路级体系结构在遵守下一代RC的原理时提供了有希望的性能 - 消除了随机连接，而有利于对输入信号的确定性非线性转换。

Title: Designing User-Centric Metrics for Evaluation of Counterfactual Explanations

Authors: Firdaus Ahmed Choudhury, Ethan Leicht, Jude Ethan Bislig, Hangzhi Guo, Amulya Yadav
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.15162
Pdf URL: https://arxiv.org/pdf/2507.15162
Copy Paste: [[2507.15162]] Designing User-Centric Metrics for Evaluation of Counterfactual Explanations(https://arxiv.org/abs/2507.15162)
Keywords: generation
Abstract: Machine learning-based decision models are increasingly being used to make decisions that significantly impact people's lives, but their opaque nature leaves end users without a clear understanding of why a decision was made. Counterfactual Explanations (CFEs) have grown in popularity as a means of offering actionable guidance by identifying the minimum changes in feature values required to flip a model's prediction to something more desirable. Unfortunately, most prior research in CFEs relies on artificial evaluation metrics, such as proximity, which may overlook end-user preferences and constraints, e.g., the user's perception of effort needed to make certain feature changes may differ from that of the model designer. To address this research gap, this paper makes three novel contributions. First, we conduct a pilot study with 20 crowd-workers on Amazon MTurk to experimentally validate the alignment of existing CF evaluation metrics with real-world user preferences. Results show that user-preferred CFEs matched those based on proximity in only 63.81% of cases, highlighting the limited applicability of these metrics in real-world settings. Second, inspired by the need to design a user-informed evaluation metric for CFEs, we conduct a more detailed two-day user study with 41 participants facing realistic credit application scenarios to find experimental support for or against three intuitive hypotheses that may explain how end users evaluate CFEs. Third, based on the findings of this second study, we propose the AWP model, a novel user-centric, two-stage model that describes one possible mechanism by which users evaluate and select CFEs. Our results show that AWP predicts user-preferred CFEs with 84.37% accuracy. Our study provides the first human-centered validation for personalized cost models in CFE generation and highlights the need for adaptive, user-centered evaluation metrics.
摘要：基于机器学习的决策模型越来越多地用于做出显着影响人们生活的决策，但是他们不透明的大自然使最终用户没有明确了解为什么做出决定。反事实解释（CFE）在流行度上已增长，作为提供可行指导的一种手段，通过确定将模型预测到更可取的事物所需的特征值所需的最小变化。不幸的是，大多数先前在CFE的研究都依赖人工评估指标，例如接近性，这些指标可能忽略了最终用户的偏好和约束，例如，用户对进行某些特征变化所需的努力的看法可能与模型设计师的特征变化有所不同。为了解决这一研究差距，本文做出了三个新颖的贡献。首先，我们对亚马逊MTURK上的20名众群落者进行了一项试点研究，以实验验证现有的CF评估指标与现实世界用户偏好的一致性。结果表明，用户偏爱的CFE仅在63.81％的情况下基于接近性匹配的CFE匹配这些CFE，从而突出了这些指标在现实世界中的有限适用性。其次，灵感来自为CFE设计用户信息的评估指标的启发，我们进行了更详细的为期两天的用户研究，其中41位参与者面临现实的信用应用程序方案，以寻求对或反对三个直觉假设的实验支持，以解释最终用户如何评估CFES。第三，根据第二项研究的发现，我们提出了AWP模型，AWP模型是一种新型的以用户为中心的两阶段模型，描述了用户评估和选择CFE的一种可能机制。我们的结果表明，AWP的准确性为84.37％，可以预测用户偏爱的CFE。我们的研究为CFE生成中的个性化成本模型提供了首个以人为中心的验证，并强调了对自适应，以用户为中心的评估指标的需求。

Title: Better Models and Algorithms for Learning Ising Models from Dynamics

Authors: Jason Gaitonde, Ankur Moitra, Elchanan Mossel
Subjects: cs.LG, cs.DS, stat.ML
Abstract URL: https://arxiv.org/abs/2507.15173
Pdf URL: https://arxiv.org/pdf/2507.15173
Copy Paste: [[2507.15173]] Better Models and Algorithms for Learning Ising Models from Dynamics(https://arxiv.org/abs/2507.15173)
Keywords: generative
Abstract: We study the problem of learning the structure and parameters of the Ising model, a fundamental model of high-dimensional data, when observing the evolution of an associated Markov chain. A recent line of work has studied the natural problem of learning when observing an evolution of the well-known Glauber dynamics [Bresler, Gamarnik, Shah, IEEE Trans. Inf. Theory 2018, Gaitonde, Mossel STOC 2024], which provides an arguably more realistic generative model than the classical i.i.d. setting. However, this prior work crucially assumes that all site update attempts are observed, \emph{even when this attempt does not change the configuration}: this strong observation model is seemingly essential for these approaches. While perhaps possible in restrictive contexts, this precludes applicability to most realistic settings where we can observe \emph{only} the stochastic evolution itself, a minimal and natural assumption for any process we might hope to learn from. However, designing algorithms that succeed in this more realistic setting has remained an open problem [Bresler, Gamarnik, Shah, IEEE Trans. Inf. Theory 2018, Gaitonde, Moitra, Mossel, STOC 2025]. In this work, we give the first algorithms that efficiently learn the Ising model in this much more natural observation model that only observes when the configuration changes. For Ising models with maximum degree $d$, our algorithm recovers the underlying dependency graph in time $\mathsf{poly}(d)\cdot n^2\log n$ and then the actual parameters in additional $\widetilde{O}(2^d n)$ time, which qualitatively matches the state-of-the-art even in the i.i.d. setting in a much weaker observation model. Our analysis holds more generally for a broader class of reversible, single-site Markov chains that also includes the popular Metropolis chain by leveraging more robust properties of reversible Markov chains.
摘要：我们研究了Ising模型的结构和参数的问题，即观察相关Markov链的演变时，是高维数据的基本模型。最近的一项工作研究了学习众所周知的Glauber动力学演变时的自然问题[Bresler，Gamarnik，Shah，IEEE Trans。 inf。理论2018，Gaitonde，Mossel Stoc 2024]，它比经典的I.I.D.提供了更现实的生成模型。环境。但是，这项先前的工作至关重要地假设观察到所有站点更新尝试，\ emph {即使此尝试不改变配置}：这种强大的观察模型对于这些方法似乎至关重要。虽然在限制性上下文中也许可能是可能的，但这排除了适用于最现实的设置，在那里我们可以观察到\ emph {仅}随机进化本身，这是我们希望从我们可能希望学习的任何过程的最小和自然假设。但是，设计在这种更现实的环境中取得成功的算法仍然是一个空旷的问题[Bresler，Gamarnik，Shah，IEEE Trans。 inf。 2018年理论，盖森德，莫特拉，摩塞尔，stoc 2025]。在这项工作中，我们给出了第一个算法，该算法在这个更自然的观察模型中有效地学习ISING模型，该模型仅在配置变化时观察。对于具有最高程度$ d $的ISIN模型，我们的算法在时间$ \ m varysf {poly}（d）（d）（d）\ cdot n^2 \ log n $，然后在其他$ \ widetilde {o}（O}（o}（2^d n）$时与I.I.I.I.I.I.I匹配的实际参数。设置以弱得多的观察模型。我们的分析更普遍地用于一类更广泛的可逆，单位马尔可夫链，其中还包括流行的大都会链，它利用可逆的马尔可夫链的更强大的特性。

Title: MeshMamba: State Space Models for Articulated 3D Mesh Generation and Reconstruction

Authors: Yusuke Yoshiyasu, Leyuan Sun, Ryusuke Sagawa
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.15212
Pdf URL: https://arxiv.org/pdf/2507.15212
Copy Paste: [[2507.15212]] MeshMamba: State Space Models for Articulated 3D Mesh Generation and Reconstruction(https://arxiv.org/abs/2507.15212)
Keywords: generation
Abstract: In this paper, we introduce MeshMamba, a neural network model for learning 3D articulated mesh models by employing the recently proposed Mamba State Space Models (Mamba-SSMs). MeshMamba is efficient and scalable in handling a large number of input tokens, enabling the generation and reconstruction of body mesh models with more than 10,000 vertices, capturing clothing and hand geometries. The key to effectively learning MeshMamba is the serialization technique of mesh vertices into orderings that are easily processed by Mamba. This is achieved by sorting the vertices based on body part annotations or the 3D vertex locations of a template mesh, such that the ordering respects the structure of articulated shapes. Based on MeshMamba, we design 1) MambaDiff3D, a denoising diffusion model for generating 3D articulated meshes and 2) Mamba-HMR, a 3D human mesh recovery model that reconstructs a human body shape and pose from a single image. Experimental results showed that MambaDiff3D can generate dense 3D human meshes in clothes, with grasping hands, etc., and outperforms previous approaches in the 3D human shape generation task. Additionally, Mamba-HMR extends the capabilities of previous non-parametric human mesh recovery approaches, which were limited to handling body-only poses using around 500 vertex tokens, to the whole-body setting with face and hands, while achieving competitive performance in (near) real-time.
摘要：在本文中，我们通过采用最近提出的Mamba State Space Models（Mamba-SSM）介绍了一种用于学习3D表达网格模型的神经网络模型Meshmamba。 Meshmamba在处理大量输入令牌方面具有高效且可扩展，从而使人体网格模型的生成和重建具有10,000多个顶点，可捕获服装和手工几何形状。有效学习Meshmamba的关键是网格顶点的序列化技术，可容易由Mamba处理。这是通过基于身体部分注释或模板网格的3D顶点位置对顶点进行排序来实现的，从而使订购尊重铰接形状的结构。基于Meshmamba，我们设计1）Mambadiff3d，这是一种用于生成3D铰接式网格和2）Mamba-HMR的脱氧扩散模型，这是一个3D人类网格恢复模型，可重建人体形状并从单个图像中伪装。实验结果表明，Mambadiff3D可以用握手等产生密集的3D人类网眼，并且在3D人类形状生成任务中的先前方法都优于先前的方法。此外，Mamba-HMR将以前非参数人网状恢复方法的能力扩展到使用大约500个顶点代币的唯一姿势，并以面部和手的整体设置为全身设置，同时实现（接近）实时的竞争性能。

Title: Improving Joint Embedding Predictive Architecture with Diffusion Noise

Authors: Yuping Qiu, Rui Zhu, Ying-cong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.15216
Pdf URL: https://arxiv.org/pdf/2507.15216
Copy Paste: [[2507.15216]] Improving Joint Embedding Predictive Architecture with Diffusion Noise(https://arxiv.org/abs/2507.15216)
Keywords: generation, generative
Abstract: Self-supervised learning has become an incredibly successful method for feature learning, widely applied to many downstream tasks. It has proven especially effective for discriminative tasks, surpassing the trending generative models. However, generative models perform better in image generation and detail enhancement. Thus, it is natural for us to find a connection between SSL and generative models to further enhance the representation capacity of SSL. As generative models can create new samples by approximating the data distribution, such modeling should also lead to a semantic understanding of the raw visual data, which is necessary for recognition tasks. This enlightens us to combine the core principle of the diffusion model: diffusion noise, with SSL to learn a competitive recognition model. Specifically, diffusion noise can be viewed as a particular state of mask that reveals a close relationship between masked image modeling (MIM) and diffusion models. In this paper, we propose N-JEPA (Noise-based JEPA) to incorporate diffusion noise into MIM by the position embedding of masked tokens. The multi-level noise schedule is a series of feature augmentations to further enhance the robustness of our model. We perform a comprehensive study to confirm its effectiveness in the classification of downstream tasks. Codes will be released soon in public.
摘要：自我监督的学习已成为特征学习的一种非常成功的方法，并广泛应用于许多下游任务。事实证明，它对于判别任务特别有效，超过了趋势生成模型。但是，生成模型在图像生成和细节增强方面的表现更好。因此，对于我们来说，找到SSL与生成模型之间的联系是很自然的，以进一步增强SSL的表示能力。由于生成模型可以通过近似数据分布创建新样本，因此这种建模还应导致对原始视觉数据的语义理解，这对于识别任务是必需的。这启发了我们将扩散模型的核心原理结合在一起：扩散噪声与SSL学习竞争识别模型。具体而言，扩散噪声可以视为特定的掩模状态，该状态揭示了掩盖图像建模（MIM）和扩散模型之间的紧密关系。在本文中，我们提出N-JEPA（基于噪声的JEPA）将扩散噪声通过掩盖令牌的位置融合到MIM中。多级噪声时间表是一系列功能增强，以进一步增强我们的模型的鲁棒性。我们进行一项全面的研究，以确认其在下游任务的分类中的有效性。代码将很快公开发布。

Title: Hierarchical Part-based Generative Model for Realistic 3D Blood Vessel

Authors: Siqi Chen, Guoqing Zhang, Jiahao Lai, Bingzhi Shen, Sihong Zhang, Caixia Dong, Xuejin Chen, Yang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.15223
Pdf URL: https://arxiv.org/pdf/2507.15223
Copy Paste: [[2507.15223]] Hierarchical Part-based Generative Model for Realistic 3D Blood Vessel(https://arxiv.org/abs/2507.15223)
Keywords: generation, generative
Abstract: Advancements in 3D vision have increased the impact of blood vessel modeling on medical applications. However, accurately representing the complex geometry and topology of blood vessels remains a challenge due to their intricate branching patterns, curvatures, and irregular shapes. In this study, we propose a hierarchical part-based frame work for 3D vessel generation that separates the global binary tree-like topology from local geometric details. Our approach proceeds in three stages: (1) key graph generation to model the overall hierarchical struc ture, (2) vessel segment generation conditioned on geometric properties, and (3) hierarchical vessel assembly by integrating the local segments according to the global key graph. We validate our framework on real world datasets, demonstrating superior performance over existing methods in modeling complex vascular networks. This work marks the first successful application of a part-based generative approach for 3D vessel modeling, setting a new benchmark for vascular data generation. The code is available at: this https URL.
摘要：3D视力的进步增加了血管建模对医疗应用的影响。然而，由于其复杂的分支模式，曲率和不规则形状，准确地代表血管的复杂几何形状和拓扑拓扑仍然是一个挑战。在这项研究中，我们提出了一个基于层次的零件框架，用于3D船只生成，将类似全球二进制树状拓扑结构与局部几何细节分开。我们的方法分为三个阶段：（1）关键图生成以建模整体层次结构，（2）以几何特性为条件的血管段生成，以及（3）分层容器组装，通过根据全球关键图整合局部片段。我们在现实世界数据集上验证了我们的框架，证明了在建模复杂血管网络中的现有方法优于现有方法。这项工作标志着基于零件的生成方法在3D容器建模中的首次成功应用，为血管数据生成树立了新的基准。该代码可用：此HTTPS URL。

Title: Cross-Domain Few-Shot Learning with Coalescent Projections and Latent Space Reservation

Authors: Naeem Paeedeh, Mahardhika Pratama, Wolfgang Mayer, Jimmy Cao, Ryszard Kowlczyk
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.15243
Pdf URL: https://arxiv.org/pdf/2507.15243
Copy Paste: [[2507.15243]] Cross-Domain Few-Shot Learning with Coalescent Projections and Latent Space Reservation(https://arxiv.org/abs/2507.15243)
Keywords: generation
Abstract: Despite the progress in Cross-Domain Few-Shot Learning (CD-FSL), a model pre-trained with DINO combined with a prototypical classifier outperforms the latest SOTA methods. A crucial limitation that needs to be overcome is that updating too many parameters of the transformers leads to overfitting due to the scarcity of labeled samples. To address this challenge, we propose a new concept, Coalescent Projection (CP), as an effective successor to soft prompts. Additionally, we propose a novel pseudo-class generation method combined with Self-Supervised Transformations (SSTs) that relies solely on the base domain to prepare the network for encountering unseen samples from different domains. The proposed method exhibits its effectiveness in comprehensive experiments on the extreme domain shift scenario of the BSCD-FSL benchmark. Our code is published at this https URL.
摘要：尽管跨域几乎没有学习（CD-FSL）取得了进展，但该模型由Dino预先训练，并结合了原型分类器的表现优于最新的SOTA方法。需要克服的关键限制是，由于标记样品的稀缺性，更新变压器的太多参数会导致过度拟合。为了应对这一挑战，我们提出了一个新概念，即合并预测（CP），作为软提示的有效继任者。此外，我们提出了一种新型的伪级生成方法，并结合了自我监督的转换（SST），该方法仅依赖于基本域来准备网络，以准备遇到来自不同领域的看不见的样本。所提出的方法在BSCD-FSL基准的极端域移动方案中综合实验中表现出其有效性。我们的代码在此HTTPS URL上发布。

Title: FreeCus: Free Lunch Subject-driven Customization in Diffusion Transformers

Authors: Yanbing Zhang, Zhe Wang, Qin Zhou, Mengping Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.15249
Pdf URL: https://arxiv.org/pdf/2507.15249
Copy Paste: [[2507.15249]] FreeCus: Free Lunch Subject-driven Customization in Diffusion Transformers(https://arxiv.org/abs/2507.15249)
Keywords: generation
Abstract: In light of recent breakthroughs in text-to-image (T2I) generation, particularly with diffusion transformers (DiT), subject-driven technologies are increasingly being employed for high-fidelity customized production that preserves subject identity from reference inputs, enabling thrilling design workflows and engaging entertainment. Existing alternatives typically require either per-subject optimization via trainable text embeddings or training specialized encoders for subject feature extraction on large-scale datasets. Such dependencies on training procedures fundamentally constrain their practical applications. More importantly, current methodologies fail to fully leverage the inherent zero-shot potential of modern diffusion transformers (e.g., the Flux series) for authentic subject-driven synthesis. To bridge this gap, we propose FreeCus, a genuinely training-free framework that activates DiT's capabilities through three key innovations: 1) We introduce a pivotal attention sharing mechanism that captures the subject's layout integrity while preserving crucial editing flexibility. 2) Through a straightforward analysis of DiT's dynamic shifting, we propose an upgraded variant that significantly improves fine-grained feature extraction. 3) We further integrate advanced Multimodal Large Language Models (MLLMs) to enrich cross-modal semantic representations. Extensive experiments reflect that our method successfully unlocks DiT's zero-shot ability for consistent subject synthesis across diverse contexts, achieving state-of-the-art or comparable results compared to approaches that require additional training. Notably, our framework demonstrates seamless compatibility with existing inpainting pipelines and control modules, facilitating more compelling experiences. Our code is available at: this https URL.
摘要：鉴于最近在文本到图像（T2i）的生成中的突破，尤其是在扩散变压器（DIT）的情况下，受试者驱动的技术正在越来越多地用于高保真定制的生产，从而从参考输入中保留了主题身份，从而实现了激动人心的设计工作Flows Design Flows和吸引人的娱乐。现有的替代方案通常需要通过可训练的文本嵌入方式进行每个受试者的优化，或者需要在大规模数据集中进行主题特征提取的培训专业编码器。这种对培训程序的依赖性从根本上限制了其实际应用。更重要的是，当前的方法论无法充分利用现代扩散变压器（例如，磁通序列）的固有零射击潜力来实现真实主题驱动的合成。为了弥合这一差距，我们提出了Freecus，这是一个真正无训练的框架，通过三个关键创新激活DIT的功能：1）我们引入了一种关注的关注共享机制，可捕获主题的布局完整性，同时保留重要的编辑灵活性。 2）通过直接分析DIT的动态变化，我们提出了一种升级的变体，可显着改善细粒的特征提取。 3）我们进一步整合了先进的多模式大语言模型（MLLM），以丰富跨模式语义表示。广泛的实验反映出，与需要额外培训的方法相比，我们的方法成功地解锁了DIT的零弹性能力，以跨不同环境构成一致的主题合成，从而实现最新或可比的结果。值得注意的是，我们的框架表明了与现有的介入管道和控制模块的无缝兼容性，从而促进了更具吸引力的体验。我们的代码可用：此HTTPS URL。

Title: CHORDS: Diffusion Sampling Accelerator with Multi-core Hierarchical ODE Solvers

Authors: Jiaqi Han, Haotian Ye, Puheng Li, Minkai Xu, James Zou, Stefano Ermon
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.15260
Pdf URL: https://arxiv.org/pdf/2507.15260
Copy Paste: [[2507.15260]] CHORDS: Diffusion Sampling Accelerator with Multi-core Hierarchical ODE Solvers(https://arxiv.org/abs/2507.15260)
Keywords: generation, generative
Abstract: Diffusion-based generative models have become dominant generators of high-fidelity images and videos but remain limited by their computationally expensive inference procedures. Existing acceleration techniques either require extensive model retraining or compromise significantly on sample quality. This paper explores a general, training-free, and model-agnostic acceleration strategy via multi-core parallelism. Our framework views multi-core diffusion sampling as an ODE solver pipeline, where slower yet accurate solvers progressively rectify faster solvers through a theoretically justified inter-core communication mechanism. This motivates our multi-core training-free diffusion sampling accelerator, CHORDS, which is compatible with various diffusion samplers, model architectures, and modalities. Through extensive experiments, CHORDS significantly accelerates sampling across diverse large-scale image and video diffusion models, yielding up to 2.1x speedup with four cores, improving by 50% over baselines, and 2.9x speedup with eight cores, all without quality degradation. This advancement enables CHORDS to establish a solid foundation for real-time, high-fidelity diffusion generation.
摘要：基于扩散的生成模型已成为高保真图像和视频的主要发生器，但仍受其计算昂贵的推理程序的限制。现有的加速技术要么需要广泛的模型再培训，要么在样本质量上显着妥协。本文通过多核并行性探讨了一种一般，无训练和模型不足的加速策略。我们的框架将多核扩散采样视为ODE求解器管道，在该管道中，较慢而准确的求解器通过理论上合理的核心间通信机制逐步纠正更快的求解器。这激发了我们的多核训练扩散采样加速器Chords，它与各种扩散采样器，模型架构和模式兼容。通过广泛的实验，和弦可以显着加速在各种大规模图像和视频扩散模型中进行采样，并以四个核心的速度高达2.1倍加速，比基线的50％提高了50％，并且使用八个核心加速2.9倍，所有核心都没有优质降级。这一进步使和弦能够为实时，高保真扩散产生建立坚实的基础。

Title: Conditional Video Generation for High-Efficiency Video Compression

Authors: Fangqiu Yi, Jingyu Xu, Jiawei Shao, Chi Zhang, Xuelong Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.15269
Pdf URL: https://arxiv.org/pdf/2507.15269
Copy Paste: [[2507.15269]] Conditional Video Generation for High-Efficiency Video Compression(https://arxiv.org/abs/2507.15269)
Keywords: generation, generative
Abstract: Perceptual studies demonstrate that conditional diffusion models excel at reconstructing video content aligned with human visual perception. Building on this insight, we propose a video compression framework that leverages conditional diffusion models for perceptually optimized reconstruction. Specifically, we reframe video compression as a conditional generation task, where a generative model synthesizes video from sparse, yet informative signals. Our approach introduces three key modules: (1) Multi-granular conditioning that captures both static scene structure and dynamic spatio-temporal cues; (2) Compact representations designed for efficient transmission without sacrificing semantic richness; (3) Multi-condition training with modality dropout and role-aware embeddings, which prevent over-reliance on any single modality and enhance robustness. Extensive experiments show that our method significantly outperforms both traditional and neural codecs on perceptual quality metrics such as Fréchet Video Distance (FVD) and LPIPS, especially under high compression ratios.
摘要：感知研究表明，有条件扩散模型在重建与人类视觉感知一致的视频内容方面表现出色。在此洞察力的基础上，我们提出了一个视频压缩框架，该框架利用有条件的扩散模型进行感知优化的重建。具体而言，我们将视频压缩作为有条件的生成任务进行了重新构架，其中生成模型从稀疏但信息丰富的信号中综合了视频。我们的方法引入了三个关键模块：（1）捕获静态场景结构和动态时空提示的多粒状条件；（2）旨在有效传播的紧凑表示，而无需牺牲语义丰富度；（3）具有模态辍学和角色感知嵌入的多条件训练，这阻止了对任何单个模态的过度依赖并增强鲁棒性。广泛的实验表明，我们的方法在感知质量指标（例如Fréchet视频距离（FVD）和LPIP）上的传统和神经解码器大大优于传统和神经编解码器，尤其是在高压比下。

Title: RoadFusion: Latent Diffusion Model for Pavement Defect Detection

Authors: Muhammad Aqeel, Kidus Dagnaw Bellete, Francesco Setti
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.15346
Pdf URL: https://arxiv.org/pdf/2507.15346
Copy Paste: [[2507.15346]] RoadFusion: Latent Diffusion Model for Pavement Defect Detection(https://arxiv.org/abs/2507.15346)
Keywords: generation
Abstract: Pavement defect detection faces critical challenges including limited annotated data, domain shift between training and deployment environments, and high variability in defect appearances across different road conditions. We propose RoadFusion, a framework that addresses these limitations through synthetic anomaly generation with dual-path feature adaptation. A latent diffusion model synthesizes diverse, realistic defects using text prompts and spatial masks, enabling effective training under data scarcity. Two separate feature adaptors specialize representations for normal and anomalous inputs, improving robustness to domain shift and defect variability. A lightweight discriminator learns to distinguish fine-grained defect patterns at the patch level. Evaluated on six benchmark datasets, RoadFusion achieves consistently strong performance across both classification and localization tasks, setting new state-of-the-art in multiple metrics relevant to real-world road inspection.
摘要：路面缺陷检测面临着关键的挑战，包括有限的注释数据，培训和部署环境之间的域移动以及在不同道路条件下缺陷出现的高度变异性。我们提出了RoadFusion，该框架通过双路径特征适应来解决这些局限性。潜在的扩散模型使用文本提示和空间掩模综合了多样化的现实缺陷，从而在数据稀缺下实现了有效的培训。两个单独的特征适配器专门为正常和异常输入提供了表示，改善了域移动和缺陷变异性的鲁棒性。轻量级的歧视者学会了在斑块级别区分细粒的缺陷模式。在六个基准数据集上进行了评估，RoadFusion在分类和本地化任务中都始终如一地实现强劲的性能，从而在与现实世界的道路检查相关的多个指标中创造了新的最新时间。

Title: SAIGFormer: A Spatially-Adaptive Illumination-Guided Network for Low-Light Image Enhancement

Authors: Hanting Li, Fei Zhou, Xin Sun, Yang Hua, Jungong Han, Liang-Jie Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.15520
Pdf URL: https://arxiv.org/pdf/2507.15520
Copy Paste: [[2507.15520]] SAIGFormer: A Spatially-Adaptive Illumination-Guided Network for Low-Light Image Enhancement(https://arxiv.org/abs/2507.15520)
Keywords: restoration
Abstract: Recent Transformer-based low-light enhancement methods have made promising progress in recovering global illumination. However, they still struggle with non-uniform lighting scenarios, such as backlit and shadow, appearing as over-exposure or inadequate brightness restoration. To address this challenge, we present a Spatially-Adaptive Illumination-Guided Transformer (SAIGFormer) framework that enables accurate illumination restoration. Specifically, we propose a dynamic integral image representation to model the spatially-varying illumination, and further construct a novel Spatially-Adaptive Integral Illumination Estimator ($\text{SAI}^2\text{E}$). Moreover, we introduce an Illumination-Guided Multi-head Self-Attention (IG-MSA) mechanism, which leverages the illumination to calibrate the lightness-relevant features toward visual-pleased illumination enhancement. Extensive experiments on five standard low-light datasets and a cross-domain benchmark (LOL-Blur) demonstrate that our SAIGFormer significantly outperforms state-of-the-art methods in both quantitative and qualitative metrics. In particular, our method achieves superior performance in non-uniform illumination enhancement while exhibiting strong generalization capabilities across multiple datasets. Code is available at this https URL.
摘要：最近的基于变压器的低光增强方法在恢复全球照明方面取得了希望的进步。但是，他们仍然在不均匀的照明场景中挣扎，例如背光和阴影，出现在过度曝光或亮度恢复不足。为了应对这一挑战，我们提出了一个具有空间自适应的照明引导的变压器（Saigformer）框架，该框架可以实现准确的照明恢复。具体而言，我们提出了动态的积分图像表示，以建模空间变化的照明，并进一步构建一种新颖的空间自适应积分照明估计器（$ \ text {sai}^2 \ text {e} $）。此外，我们引入了一个照明引导的多头自我注意力（IG-MSA）机制，该机制利用照明来校准相关的相关特征，以增强视觉上的照明增强。对五个标准低光数据集和一个跨域基准（LOL-BLUR）进行的广泛实验表明，我们的Saigformer在定量和定性指标中都显着胜过最先进的方法。特别是，我们的方法在不均匀的照明增强中实现了卓越的性能，同时在多个数据集中表现出强大的概括能力。代码可在此HTTPS URL上找到。

Title: Red-Team Multi-Agent Reinforcement Learning for Emergency Braking Scenario

Authors: Yinsong Chen, Kaifeng Wang, Xiaoqiang Meng, Xueyuan Li, Zirui Li, Xin Gao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.15587
Pdf URL: https://arxiv.org/pdf/2507.15587
Copy Paste: [[2507.15587]] Red-Team Multi-Agent Reinforcement Learning for Emergency Braking Scenario(https://arxiv.org/abs/2507.15587)
Keywords: generation
Abstract: Current research on decision-making in safety-critical scenarios often relies on inefficient data-driven scenario generation or specific modeling approaches, which fail to capture corner cases in real-world contexts. To address this issue, we propose a Red-Team Multi-Agent Reinforcement Learning framework, where background vehicles with interference capabilities are treated as red-team agents. Through active interference and exploration, red-team vehicles can uncover corner cases outside the data distribution. The framework uses a Constraint Graph Representation Markov Decision Process, ensuring that red-team vehicles comply with safety rules while continuously disrupting the autonomous vehicles (AVs). A policy threat zone model is constructed to quantify the threat posed by red-team vehicles to AVs, inducing more extreme actions to increase the danger level of the scenario. Experimental results show that the proposed framework significantly impacts AVs decision-making safety and generates various corner cases. This method also offers a novel direction for research in safety-critical scenarios.
摘要：当前对安全至关重要方案中决策的研究通常依赖于效率低下的数据驱动方案或特定的建模方法，这些方案无法在现实世界中捕获角落案例。为了解决这个问题，我们提出了一个红色团队的多代理增强学习框架，其中具有干扰能力的背景车辆被视为红色团队代理。通过主动干扰和探索，红线车辆可以在数据分布之外发现角案例。该框架使用约束图表示马尔可夫决策过程，确保红线车辆符合安全规则，同时不断破坏自动驾驶汽车（AVS）。构建了一个政策威胁区模型，以量化红线车辆对AVS构成的威胁，从而引起更极端的行动以增加场景的危险水平。实验结果表明，所提出的框架显着影响AVS决策安全，并产生各种角落案例。该方法还为安全至关重要方案提供了新的方向。

Title: SegDT: A Diffusion Transformer-Based Segmentation Model for Medical Imaging

Authors: Salah Eddine Bekhouche, Gaby Maroun, Fadi Dornaika, Abdenour Hadid
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.15595
Pdf URL: https://arxiv.org/pdf/2507.15595
Copy Paste: [[2507.15595]] SegDT: A Diffusion Transformer-Based Segmentation Model for Medical Imaging(https://arxiv.org/abs/2507.15595)
Keywords: generation
Abstract: Medical image segmentation is crucial for many healthcare tasks, including disease diagnosis and treatment planning. One key area is the segmentation of skin lesions, which is vital for diagnosing skin cancer and monitoring patients. In this context, this paper introduces SegDT, a new segmentation model based on diffusion transformer (DiT). SegDT is designed to work on low-cost hardware and incorporates Rectified Flow, which improves the generation quality at reduced inference steps and maintains the flexibility of standard diffusion models. Our method is evaluated on three benchmarking datasets and compared against several existing works, achieving state-of-the-art results while maintaining fast inference speeds. This makes the proposed model appealing for real-world medical applications. This work advances the performance and capabilities of deep learning models in medical image analysis, enabling faster, more accurate diagnostic tools for healthcare professionals. The code is made publicly available at \href{this https URL}{GitHub}.
摘要：医疗图像分割对于许多医疗保健任务，包括疾病诊断和治疗计划至关重要。一个关键区域是皮肤病变的细分，这对于诊断皮肤癌和监测患者至关重要。在这种情况下，本文介绍了基于扩散变压器（DIT）的新分割模型SEGDT。 SEGDT旨在在低成本硬件上工作，并结合了整流的流量，从而改善了减少推理步骤的发电质量，并保持标准扩散模型的灵活性。我们的方法在三个基准数据集上进行了评估，并将其与几项现有作品进行了比较，从而在保持快速推理速度的同时，取得了最先进的结果。这使得拟议的模型吸引了现实世界中医疗应用。这项工作在医学图像分析中提高了深度学习模型的性能和能力，为医疗保健专业人员提供了更快，更准确的诊断工具。该代码在\ href {this https url} {github}上公开可用。

Title: Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

Authors: Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, Zongqing Lu
Subjects: cs.CV, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2507.15597
Pdf URL: https://arxiv.org/pdf/2507.15597
Copy Paste: [[2507.15597]] Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos(https://arxiv.org/abs/2507.15597)
Keywords: generation
Abstract: We introduce Being-H0, a dexterous Vision-Language-Action model (VLA) trained on large-scale human videos. Existing VLAs struggle with complex manipulation tasks requiring high dexterity and generalize poorly to novel scenarios and tasks, primarily due to their reliance on synthetic data with significant sim-to-real gaps or teleoperated demonstrations lacking scale and diversity. To address this data bottleneck, we propose leveraging human hands as a foundation manipulator, capitalizing on the rich dexterity and scalability present in web data. Our approach centers on physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, physical space alignment for 3D reasoning, and post-training adaptation for robotic tasks. Additionally, we introduce a part-level motion tokenization method which achieves millimeter-level reconstruction accuracy to model precise hand trajectories for action learning. To support our proposed paradigm, we further develop a comprehensive data curation pipeline that integrates heterogeneous sources -- including motion capture, VR, and RGB-only videos -- into a large-scale dataset with millions of motion-based instructional instances. We empirically show the excellence of Being-H0 in hand motion generation and instruction following, and it also scales well with model and data sizes. Importantly, we observe the expected gains of Being-H0 in real-world robotic manipulation as physical instruction tuning is applied. More details are available at this https URL.
摘要：我们介绍了H-H0，这是一种灵巧的视觉语言动作模型（VLA），该模型在大型人类视频中训练。现有的VLA与需要高灵活性的复杂操纵任务斗争，并且对新的场景和任务的推广不佳，这主要是由于它们依赖于综合数据具有大量的SIM到SIM到真实差距或远程操作示威，缺乏规模和多样性。为了解决此数据瓶颈，我们提出利用人类作为基础操纵器，利用Web数据中存在的丰富灵巧性和可扩展性。我们的方法以物理教学调整为中心，这是一种新颖的训练范式，结合了人类视频的大规模VLA预处理，3D推理的物理空间对齐以及用于机器人任务的训练后适应。此外，我们引入了一种零件级运动令牌化方法，该方法可实现毫米级的重建精度，以模拟用于动作学习的精确手轨迹。为了支持我们提出的范式，我们进一步开发了一条全面的数据策划管道，该管道将异构来源（包括运动捕获，VR和仅RGB的视频）整合到一个大规模数据集中，其中包含数百万个基于运动的教学实例。从经验上讲，我们在手动运动和指导下的卓越表现出了卓越，并且与模型和数据大小相关。重要的是，随着物理指导调整，我们观察到现实世界机器人操纵中的预期h0的预期收益。此HTTPS URL提供了更多详细信息。

Title: Optimal Batch-Size Control for Low-Latency Federated Learning with Device Heterogeneity

Authors: Huiling Yang, Zhanwei Wang, Kaibin Huang
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2507.15601
Pdf URL: https://arxiv.org/pdf/2507.15601
Copy Paste: [[2507.15601]] Optimal Batch-Size Control for Low-Latency Federated Learning with Device Heterogeneity(https://arxiv.org/abs/2507.15601)
Keywords: generation
Abstract: Federated learning (FL) has emerged as a popular approach for collaborative machine learning in sixth-generation (6G) networks, primarily due to its privacy-preserving capabilities. The deployment of FL algorithms is expected to empower a wide range of Internet-of-Things (IoT) applications, e.g., autonomous driving, augmented reality, and healthcare. The mission-critical and time-sensitive nature of these applications necessitates the design of low-latency FL frameworks that guarantee high learning performance. In practice, achieving low-latency FL faces two challenges: the overhead of computing and transmitting high-dimensional model updates, and the heterogeneity in communication-and-computation (C$^2$) capabilities across devices. To address these challenges, we propose a novel C$^2$-aware framework for optimal batch-size control that minimizes end-to-end (E2E) learning latency while ensuring convergence. The framework is designed to balance a fundamental C$^2$ tradeoff as revealed through convergence analysis. Specifically, increasing batch sizes improves the accuracy of gradient estimation in FL and thus reduces the number of communication rounds required for convergence, but results in higher per-round latency, and vice versa. The associated problem of latency minimization is intractable; however, we solve it by designing an accurate and tractable surrogate for convergence speed, with parameters fitted to real data. This approach yields two batch-size control strategies tailored to scenarios with slow and fast fading, while also accommodating device heterogeneity. Extensive experiments using real datasets demonstrate that the proposed strategies outperform conventional batch-size adaptation schemes that do not consider the C$^2$ tradeoff or device heterogeneity.
摘要：联邦学习（FL）已成为第六代网络（6G）网络中协作机器学习的一种流行方法，这主要是由于其隐私能力。预计FL算法的部署有望赋予各种各样的应用程序（IoT）应用程序，例如自动驾驶，增强现实和医疗保健。这些应用的关键任务和时间敏感性需要设计低延迟FL框架，以保证高级学习绩效。在实践中，实现低延迟FL面临两个挑战：计算和传输高维模型更新的开销，以及跨设备的通信和竞争（C $^2 $）功能的异质性。为了应对这些挑战，我们提出了一种新颖的C $^2 $ - 意识框架，用于最佳批量控制，以最大程度地减少端到端（E2E）学习潜伏期，同时确保收敛。该框架旨在平衡基本的C $^2 $权衡，如通过融合分析所揭示的。具体而言，增加批量尺寸可提高FL中梯度估计的准确性，从而减少收敛所需的通信次数，但导致每轮延迟较高，反之亦然。延迟最小化的相关问题是棘手的。但是，我们通过设计一种准确且可拖动的替代物来解决收敛速度，并将参数拟合到真实数据。这种方法产生了针对速度缓慢和快速褪色的方案量身定制的两种批量大小的控制策略，同时还可以容纳设备异质性。使用真实数据集进行的广泛实验表明，所提出的策略的表现优于常规批量调整方案，这些计划不考虑C $^2 $折衷或设备异质性。

Title: CylinderPlane: Nested Cylinder Representation for 3D-aware Image Generation

Authors: Ru Jia, Xiaozhuang Ma, Jianji Wang, Nanning Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.15606
Pdf URL: https://arxiv.org/pdf/2507.15606
Copy Paste: [[2507.15606]] CylinderPlane: Nested Cylinder Representation for 3D-aware Image Generation(https://arxiv.org/abs/2507.15606)
Keywords: generation, generative
Abstract: While the proposal of the Tri-plane representation has advanced the development of the 3D-aware image generative models, problems rooted in its inherent structure, such as multi-face artifacts caused by sharing the same features in symmetric regions, limit its ability to generate 360$^\circ$ view images. In this paper, we propose CylinderPlane, a novel implicit representation based on Cylindrical Coordinate System, to eliminate the feature ambiguity issue and ensure multi-view consistency in 360$^\circ$. Different from the inevitable feature entanglement in Cartesian coordinate-based Tri-plane representation, the cylindrical coordinate system explicitly separates features at different angles, allowing our cylindrical representation possible to achieve high-quality, artifacts-free 360$^\circ$ image synthesis. We further introduce the nested cylinder representation that composites multiple cylinders at different scales, thereby enabling the model more adaptable to complex geometry and varying resolutions. The combination of cylinders with different resolutions can effectively capture more critical locations and multi-scale features, greatly facilitates fine detail learning and robustness to different resolutions. Moreover, our representation is agnostic to implicit rendering methods and can be easily integrated into any neural rendering pipeline. Extensive experiments on both synthetic dataset and unstructured in-the-wild images demonstrate that our proposed representation achieves superior performance over previous methods.
摘要：尽管三平面表示的提议推动了3D感知图像生成模型的开发，但植根于其固有结构的问题，例如由于在对称区域中共享相同特征引起的多面文物，限制了其生成360美元$^\ circ $查看图像的能力。在本文中，我们提出了基于圆柱坐标系的新型隐式表示Cylinderplane，以消除特征歧义问题并确保360 $^\ circ $中的多视图一致性。与基于笛卡尔坐标的三平面表示中不可避免的特征纠缠不同，圆柱坐标系明确地以不同的角度分离特征，从而使我们的圆柱形表示可以实现高质量的无伪像360 $^\ circ circ $图像综合。我们进一步介绍了嵌套的圆柱表示，该圆柱体表示在不同的尺度上复合多个圆柱体，从而使模型更适合复杂的几何形状和不同的分辨率。圆柱体与不同分辨率的组合可以有效地捕获更关键的位置和多尺度功能，从而极大地促进了精细的细节学习和对不同分辨率的鲁棒性。此外，我们的表示对隐式渲染方法不可知，并且可以轻松地集成到任何神经渲染管道中。对合成数据集和非结构化的野外图像的广泛实验表明，我们提出的表示形式比以前的方法实现了卓越的性能。

Title: Accelerating HEC-RAS: A Recurrent Neural Operator for Rapid River Forecasting

Authors: Edward Holmberg, Pujan Pokhrel, Maximilian Zoch, Elias Ioup, Ken Pathak, Steven Sloan, Kendall Niles, Jay Ratcliff, Maik Flanagin, Christian Guetl, Julian Simeonov, Mahdi Abdelguerfi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.15614
Pdf URL: https://arxiv.org/pdf/2507.15614
Copy Paste: [[2507.15614]] Accelerating HEC-RAS: A Recurrent Neural Operator for Rapid River Forecasting(https://arxiv.org/abs/2507.15614)
Keywords: generation
Abstract: Physics-based solvers like HEC-RAS provide high-fidelity river forecasts but are too computationally intensive for on-the-fly decision-making during flood events. The central challenge is to accelerate these simulations without sacrificing accuracy. This paper introduces a deep learning surrogate that treats HEC-RAS not as a solver but as a data-generation engine. We propose a hybrid, auto-regressive architecture that combines a Gated Recurrent Unit (GRU) to capture short-term temporal dynamics with a Geometry-Aware Fourier Neural Operator (Geo-FNO) to model long-range spatial dependencies along a river reach. The model learns underlying physics implicitly from a minimal eight-channel feature vector encoding dynamic state, static geometry, and boundary forcings extracted directly from native HEC-RAS files. Trained on 67 reaches of the Mississippi River Basin, the surrogate was evaluated on a year-long, unseen hold-out simulation. Results show the model achieves a strong predictive accuracy, with a median absolute stage error of 0.31 feet. Critically, for a full 67-reach ensemble forecast, our surrogate reduces the required wall-clock time from 139 minutes to 40 minutes, a speedup of nearly 3.5 times over the traditional solver. The success of this data-driven approach demonstrates that robust feature engineering can produce a viable, high-speed replacement for conventional hydraulic models, improving the computational feasibility of large-scale ensemble flood forecasting.
摘要：基于物理的求解器（例如HEC-RAS）提供了高保真的河流预测，但在洪水事件期间的决策在计算中太密集了。核心挑战是在不牺牲准确性的情况下加速这些模拟。本文介绍了一种深度学习的替代物，该代理将HEC-RAS视为求解器，而是作为数据生成引擎。我们提出了一种混合，自动回归体系结构，该结构结合了一个封闭式的复发单元（GRU），以捕获短期的时间动态，并使用几何感知的傅立叶神经操作员（GEO-FNO）来模拟沿着河流覆盖范围的远距离空间依赖。该模型从最小的八通道特征向量编码动态状态，静态几何形状和直接从本机HEC-RAS文件中提取的边界构图中隐含地学习了基础物理。在密西西比河盆地的67次接触中，对代理人进行了训练，对一个长达一年的，看不见的保持模拟进行了评估。结果表明，该模型具有强大的预测精度，中位绝对级误差为0.31英尺。至关重要的是，对于完整的67次总合奏预测，我们的替代物将所需的墙壁锁定时间从139分钟减少到40分钟，比传统求解器的速度近3.5倍。这种数据驱动方法的成功表明，鲁棒的功能工程可以为常规液压模型产生可行的高速替代品，从而提高了大规模合奏洪水预测的计算可行性。

Title: Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training

Authors: Kailai Yang, Xiao Liu, Lei Ji, Hao Li, Yeyun Gong, Peng Cheng, Mao Yang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2507.15640
Pdf URL: https://arxiv.org/pdf/2507.15640
Copy Paste: [[2507.15640]] Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training(https://arxiv.org/abs/2507.15640)
Keywords: generation
Abstract: Continual pre-training on small-scale task-specific data is an effective method for improving large language models in new target fields, yet it risks catastrophic forgetting of their original capabilities. A common solution is to re-weight training data mixtures from source and target fields on a domain space to achieve balanced performance. Previous domain reweighting strategies rely on manual designation with certain heuristics based on human intuition or empirical results. In this work, we prove that more general heuristics can be parameterized by proposing Data Mixing Agent, the first model-based, end-to-end framework that learns to re-weight domains. The agent learns generalizable heuristics through reinforcement learning on large quantities of data mixing trajectories with corresponding feedback from an evaluation environment. Experiments in continual pre-training on math reasoning show that Data Mixing Agent outperforms strong baselines in achieving balanced performance across source and target field benchmarks. Furthermore, it generalizes well across unseen source fields, target models, and domain spaces without retraining. Direct application to the code generation field also indicates its adaptability across target domains. Further analysis showcases the agents' well-aligned heuristics with human intuitions and their efficiency in achieving superior model performance with less source-field data.
摘要：在小规模的特定任务数据上的持续预训练是改善新目标领域中大型语言模型的有效方法，但它可能会遇到灾难性忘记其原始能力。一个常见的解决方案是从域空间上的源和目标字段重新训练数据混合物，以实现平衡性能。以前的领域重新持续策略依赖于手动指定，并根据人类直觉或经验结果进行某些启发式方法。在这项工作中，我们证明，可以通过提出数据混合代理，这是第一个基于模型的端到端框架，可以将更一般的启发式方法进行参数化。该代理通过对大量数据混合轨迹的强化学习以及来自评估环境的相应反馈来学习可概括的启发式方法。在数学推理上连续预训练的实验表明，数据混合剂在跨源和目标场基准的平衡性能方面优于强大的基准。此外，它在没有重新培训的情况下跨越了看不见的源字段，目标模型和域空间，可以很好地推广。直接应用到代码生成字段还表示其跨目标域的适应性。进一步的分析通过人类直觉展示了代理人良好的启发式方法及其在通过较低的源场数据中实现卓越模型性能方面的效率。

Title: Visual-Language Model Knowledge Distillation Method for Image Quality Assessment

Authors: Yongkang Hou, Jiarun Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.15680
Pdf URL: https://arxiv.org/pdf/2507.15680
Copy Paste: [[2507.15680]] Visual-Language Model Knowledge Distillation Method for Image Quality Assessment(https://arxiv.org/abs/2507.15680)
Keywords: quality assessment
Abstract: Image Quality Assessment (IQA) is a core task in computer vision. Multimodal methods based on vision-language models, such as CLIP, have demonstrated exceptional generalization capabilities in IQA tasks. To address the issues of excessive parameter burden and insufficient ability to identify local distorted features in CLIP for IQA, this study proposes a visual-language model knowledge distillation method aimed at guiding the training of models with architectural advantages using CLIP's IQA knowledge. First, quality-graded prompt templates were designed to guide CLIP to output quality scores. Then, CLIP is fine-tuned to enhance its capabilities in IQA tasks. Finally, a modality-adaptive knowledge distillation strategy is proposed to achieve guidance from the CLIP teacher model to the student model. Our experiments were conducted on multiple IQA datasets, and the results show that the proposed method significantly reduces model complexity while outperforming existing IQA methods, demonstrating strong potential for practical deployment.
摘要：图像质量评估（IQA）是计算机视觉中的核心任务。基于视觉模型（例如剪辑）的多模式方法已在IQA任务中证明了出色的概括能力。为了解决过多的参数负担和识别IQA剪辑中局部变形特征的能力的问题，本研究提出了一种视觉语言模型知识蒸馏方法，旨在指导使用Clip的IQA知识来指导具有建筑优势的模型培训。首先，设计了质量分级的及时模板，以指导剪辑以达到输出质量分数。然后，剪辑进行微调以增强其在IQA任务中的功能。最后，提出了一种自适应知识蒸馏策略，以实现从剪辑教师模型到学生模型的指导。我们的实验是在多个IQA数据集上进行的，结果表明，所提出的方法大大降低了模型的复杂性，同时表现出了现有的IQA方法，这表明实践部署的强大潜力。

Title: Efficient Face Image Quality Assessment via Self-training and Knowledge Distillation

Authors: Wei Sun, Weixia Zhang, Linhan Cao, Jun Jia, Xiangyang Zhu, Dandan Zhu, Xiongkuo Min, Guangtao Zhai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.15709
Pdf URL: https://arxiv.org/pdf/2507.15709
Copy Paste: [[2507.15709]] Efficient Face Image Quality Assessment via Self-training and Knowledge Distillation(https://arxiv.org/abs/2507.15709)
Keywords: quality assessment
Abstract: Face image quality assessment (FIQA) is essential for various face-related applications. Although FIQA has been extensively studied and achieved significant progress, the computational complexity of FIQA algorithms remains a key concern for ensuring scalability and practical deployment in real-world systems. In this paper, we aim to develop a computationally efficient FIQA method that can be easily deployed in real-world applications. Specifically, our method consists of two stages: training a powerful teacher model and distilling a lightweight student model from it. To build a strong teacher model, we adopt a self-training strategy to improve its capacity. We first train the teacher model using labeled face images, then use it to generate pseudo-labels for a set of unlabeled images. These pseudo-labeled samples are used in two ways: (1) to distill knowledge into the student model, and (2) to combine with the original labeled images to further enhance the teacher model through self-training. The enhanced teacher model is used to further pseudo-label another set of unlabeled images for distilling the student models. The student model is trained using a combination of labeled images, pseudo-labeled images from the original teacher model, and pseudo-labeled images from the enhanced teacher model. Experimental results demonstrate that our student model achieves comparable performance to the teacher model with an extremely low computational overhead. Moreover, our method achieved first place in the ICCV 2025 VQualA FIQA Challenge. The code is available at this https URL.
摘要：面部图像质量评估（FIQA）对于各种面部相关应用至关重要。尽管FIQA已经进行了广泛的研究并取得了重大进展，但FIQA算法的计算复杂性仍然是确保在现实世界中确保可扩展性和实际部署的关键问题。在本文中，我们旨在开发一种可以轻松部署在现实世界应用程序中的计算高效FIQA方法。具体来说，我们的方法包括两个阶段：培训强大的教师模型，并将轻量级学生模型提取。为了建立强大的教师模型，我们采用自我训练策略来提高其能力。我们首先使用标记的面部图像训练教师模型，然后使用它来生成一组未标记图像的伪标记。这些伪标记的样本以两种方式使用：（1）将知识提炼成学生模型，（2）与原始标记的图像结合使用，以通过自我训练进一步增强教师模型。增强的教师模型用于进一步伪标记，以提取学生模型的另一组未标记的图像。使用标记的图像，原始教师模型的伪标记图像以及来自增强型教师模型的伪标记的图像对学生模型进行训练。实验结果表明，我们的学生模型具有与教师模型的可比性，并具有极低的计算开销。此外，我们的方法在ICCV 2025 VQUALA FIQA挑战中获得了第一名。该代码可在此HTTPS URL上找到。

Title: A Practical Investigation of Spatially-Controlled Image Generation with Transformers

Authors: Guoxuan Xia, Harleen Hanspal, Petru-Daniel Tudosiu, Shifeng Zhang, Sarah Parisot
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.15724
Pdf URL: https://arxiv.org/pdf/2507.15724
Copy Paste: [[2507.15724]] A Practical Investigation of Spatially-Controlled Image Generation with Transformers(https://arxiv.org/abs/2507.15724)
Keywords: generation
Abstract: Enabling image generation models to be spatially controlled is an important area of research, empowering users to better generate images according to their own fine-grained specifications via e.g. edge maps, poses. Although this task has seen impressive improvements in recent times, a focus on rapidly producing stronger models has come at the cost of detailed and fair scientific comparison. Differing training data, model architectures and generation paradigms make it difficult to disentangle the factors contributing to performance. Meanwhile, the motivations and nuances of certain approaches become lost in the literature. In this work, we aim to provide clear takeaways across generation paradigms for practitioners wishing to develop transformer-based systems for spatially-controlled generation, clarifying the literature and addressing knowledge gaps. We perform controlled experiments on ImageNet across diffusion-based/flow-based and autoregressive (AR) models. First, we establish control token prefilling as a simple, general and performant baseline approach for transformers. We then investigate previously underexplored sampling time enhancements, showing that extending classifier-free guidance to control, as well as softmax truncation, have a strong impact on control-generation consistency. Finally, we re-clarify the motivation of adapter-based approaches, demonstrating that they mitigate "forgetting" and maintain generation quality when trained on limited downstream data, but underperform full training in terms of generation-control consistency. Code will be released upon publication.
摘要：使图像生成模型在空间控制上是一个重要的研究领域，可以通过例如，根据自己的精细颗粒规格更好地生成图像。边缘地图，姿势。尽管该任务近来已经有了令人印象深刻的改进，但关注快速生产更强大的模型的重点是以详细而公平的科学比较为代价。不同的培训数据，模型架构和生成范式使得很难解散导致性能的因素。同时，某些方法的动机和细微差别在文献中迷失了。在这项工作中，我们旨在为希望开发基于变压器的系统以用于空间控制的一代的从业者提供明显的外卖，以阐明文献并解决知识差距。我们在基于扩散/基于流动的自回旋（AR）模型上对Imagenet进行受控实验。首先，我们将控制令牌的预填充作为变压器的简单，一般和性能的基线方法。然后，我们研究了先前未充分忽略的采样时间增强功能，表明扩展无分类器的指导以及软磁截断对控制产生的一致性具有很大的影响。最后，我们重新削减了基于适配器的方法的动机，表明他们在有限的下游数据接受培训时会减轻“忘记”并保持发电质量，但在生成控制的一致性方面表现不佳。代码将在出版后发布。

Title: TokensGen: Harnessing Condensed Tokens for Long Video Generation

Authors: Wenqi Ouyang, Zeqi Xiao, Danni Yang, Yifan Zhou, Shuai Yang, Lei Yang, Jianlou Si, Xingang Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.15728
Pdf URL: https://arxiv.org/pdf/2507.15728
Copy Paste: [[2507.15728]] TokensGen: Harnessing Condensed Tokens for Long Video Generation(https://arxiv.org/abs/2507.15728)
Keywords: generation, generative
Abstract: Generating consistent long videos is a complex challenge: while diffusion-based generative models generate visually impressive short clips, extending them to longer durations often leads to memory bottlenecks and long-term inconsistency. In this paper, we propose TokensGen, a novel two-stage framework that leverages condensed tokens to address these issues. Our method decomposes long video generation into three core tasks: (1) inner-clip semantic control, (2) long-term consistency control, and (3) inter-clip smooth transition. First, we train To2V (Token-to-Video), a short video diffusion model guided by text and video tokens, with a Video Tokenizer that condenses short clips into semantically rich tokens. Second, we introduce T2To (Text-to-Token), a video token diffusion transformer that generates all tokens at once, ensuring global consistency across clips. Finally, during inference, an adaptive FIFO-Diffusion strategy seamlessly connects adjacent clips, reducing boundary artifacts and enhancing smooth transitions. Experimental results demonstrate that our approach significantly enhances long-term temporal and content coherence without incurring prohibitive computational overhead. By leveraging condensed tokens and pre-trained short video models, our method provides a scalable, modular solution for long video generation, opening new possibilities for storytelling, cinematic production, and immersive simulations. Please see our project page at this https URL .
摘要：产生一致的长视频是一个复杂的挑战：而基于扩散的生成模型会产生视觉上令人印象深刻的短剪辑，将它们扩展到更长的持续时间通常会导致记忆瓶颈和长期不一致。在本文中，我们提出了Tokensgen，这是一个新颖的两阶段框架，利用浓缩令牌来解决这些问题。我们的方法将长时间的视频生成分解为三个核心任务：（1）内盘语义控制，（2）长期一致性控制，以及（3）Clip平滑过渡。首先，我们训练TO2V（令牌到视频），这是一个由文本和视频令牌引导的简短视频扩散模型，并带有视频令牌，将简短的剪辑凝结成语义上丰富的令牌。其次，我们介绍了T2TO（文本到token），这是一种视频令牌扩散变压器，一次生成所有令牌，从而确保整个剪辑之间的全局一致性。最后，在推断期间，自适应的FIFO扩散策略无缝连接相邻的夹子，减少边界伪像并增强平滑过渡。实验结果表明，我们的方法显着增强了长期的时间和内容连贯性，而不会产生过度的计算开销。通过利用凝结的令牌和预先训练的简短视频模型，我们的方法为长时间的视频生成提供了可扩展的模块化解决方案，为讲故事，电影制作和沉浸式模拟打开了新的可能性。请在此HTTPS URL上查看我们的项目页面。

Title: Label tree semantic losses for rich multi-class medical image segmentation

Authors: Junwen Wang, Oscar MacCormac, William Rochford, Aaron Kujawa, Jonathan Shapey, Tom Vercauteren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.15777
Pdf URL: https://arxiv.org/pdf/2507.15777
Copy Paste: [[2507.15777]] Label tree semantic losses for rich multi-class medical image segmentation(https://arxiv.org/abs/2507.15777)
Keywords: generation
Abstract: Rich and accurate medical image segmentation is poised to underpin the next generation of AI-defined clinical practice by delineating critical anatomy for pre-operative planning, guiding real-time intra-operative navigation, and supporting precise post-operative assessment. However, commonly used learning methods for medical and surgical imaging segmentation tasks penalise all errors equivalently and thus fail to exploit any inter-class semantics in the labels space. This becomes particularly problematic as the cardinality and richness of labels increases to include subtly different classes. In this work, we propose two tree-based semantic loss functions which take advantage of a hierarchical organisation of the labels. We further incorporate our losses in a recently proposed approach for training with sparse, background-free annotations to extend the applicability of our proposed losses. Extensive experiments are reported on two medical and surgical image segmentation tasks, namely head MRI for whole brain parcellation (WBP) with full supervision and neurosurgical hyperspectral imaging (HSI) for scene understanding with sparse annotations. Results demonstrate that our proposed method reaches state-of-the-art performance in both cases.
摘要：通过描述术前计划，指导实时的术中导航和支持精确的术后评估，可以通过描述关键解剖结构来支持下一代AI定义的临床实践的富裕医学图像细分。但是，用于医学和外科成像分割任务的常用学习方法使所有错误等同地惩罚，因此无法利用标签空间中的任何类间语义。随着标签的基数和丰富性增加以包括巧妙的类别，这变得尤其成问题。在这项工作中，我们提出了两个基于树的语义损失函数，以利用标签的层次结构组织。我们将损失进一步纳入了最近提议的培训方法中，并没有稀疏，无背景的注释，以扩大我们提议的损失的适用性。报告了有关两项医学和手术图像分割任务的广泛实验，即通过全面监督和神经外科外科高光谱成像（HSI）的全脑扇形（WBP）的头MRI（WBP），以稀疏的注释，以了解场景。结果表明，在这两种情况下，我们提出的方法都达到了最先进的性能。

Title: Diffusion models for multivariate subsurface generation and efficient probabilistic inversion

Authors: Roberto Miele, Niklas Linde
Subjects: cs.CV, cs.LG, physics.geo-ph, stat.AP
Abstract URL: https://arxiv.org/abs/2507.15809
Pdf URL: https://arxiv.org/pdf/2507.15809
Copy Paste: [[2507.15809]] Diffusion models for multivariate subsurface generation and efficient probabilistic inversion(https://arxiv.org/abs/2507.15809)
Keywords: generation, generative
Abstract: Diffusion models offer stable training and state-of-the-art performance for deep generative modeling tasks. Here, we consider their use in the context of multivariate subsurface modeling and probabilistic inversion. We first demonstrate that diffusion models enhance multivariate modeling capabilities compared to variational autoencoders and generative adversarial networks. In diffusion modeling, the generative process involves a comparatively large number of time steps with update rules that can be modified to account for conditioning data. We propose different corrections to the popular Diffusion Posterior Sampling approach by Chung et al. (2023). In particular, we introduce a likelihood approximation accounting for the noise-contamination that is inherent in diffusion modeling. We assess performance in a multivariate geological scenario involving facies and correlated acoustic impedance. Conditional modeling is demonstrated using both local hard data (well logs) and nonlinear geophysics (fullstack seismic data). Our tests show significantly improved statistical robustness, enhanced sampling of the posterior probability density function and reduced computational costs, compared to the original approach. The method can be used with both hard and indirect conditioning data, individually or simultaneously. As the inversion is included within the diffusion process, it is faster than other methods requiring an outer-loop around the generative model, such as Markov chain Monte Carlo.
摘要：扩散模型为深层生成建模任务提供了稳定的培训和最先进的性能。在这里，我们考虑它们在多元地下建模和概率反转的上下文中的使用。我们首先证明，与变分自动编码器和生成对抗网络相比，扩散模型增强了多元建模功能。在扩散建模中，生成过程涉及使用更新规则的相对较大的时间步骤，可以修改以说明条件数据。我们对Chung等人的流行扩散后采样方法提出了不同的校正。（2023）。特别是，我们引入了一个可能性近似值，该近似值是扩散建模固有的噪声污染。我们在涉及相和相关的声学阻抗的多元地质场景中评估绩效。使用局部硬数据（井日志）和非线性地球物理学（Fullstack地震数据）证明条件建模。与原始方法相比，我们的测试显示出显着提高统计鲁棒性，后验概率密度函数的采样增强，并降低了计算成本。该方法可以单独或同时与硬条件和间接调节数据一起使用。由于反转包含在扩散过程中，因此它比需要在生成模型（例如Markov Chain Monte Carlo）周围外环的其他方法快。

Title: Can Your Model Separate Yolks with a Water Bottle? Benchmarking Physical Commonsense Understanding in Video Generation Models

Authors: Enes Sanli, Baris Sarper Tezcan, Aykut Erdem, Erkut Erdem
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.15824
Pdf URL: https://arxiv.org/pdf/2507.15824
Copy Paste: [[2507.15824]] Can Your Model Separate Yolks with a Water Bottle? Benchmarking Physical Commonsense Understanding in Video Generation Models(https://arxiv.org/abs/2507.15824)
Keywords: generation, generative
Abstract: Recent progress in text-to-video (T2V) generation has enabled the synthesis of visually compelling and temporally coherent videos from natural language. However, these models often fall short in basic physical commonsense, producing outputs that violate intuitive expectations around causality, object behavior, and tool use. Addressing this gap, we present PhysVidBench, a benchmark designed to evaluate the physical reasoning capabilities of T2V systems. The benchmark includes 383 carefully curated prompts, emphasizing tool use, material properties, and procedural interactions, and domains where physical plausibility is crucial. For each prompt, we generate videos using diverse state-of-the-art models and adopt a three-stage evaluation pipeline: (1) formulate grounded physics questions from the prompt, (2) caption the generated video with a vision-language model, and (3) task a language model to answer several physics-involved questions using only the caption. This indirect strategy circumvents common hallucination issues in direct video-based evaluation. By highlighting affordances and tool-mediated actions, areas overlooked in current T2V evaluations, PhysVidBench provides a structured, interpretable framework for assessing physical commonsense in generative video models.
摘要：文本到视频（T2V）生成的最新进展使您能够从自然语言中综合视觉引人入胜且具有时间连贯的视频。但是，这些模型通常在基本的物理总常识中缺乏，产生违反因果关系，对象行为和工具使用的直观期望的产出。在解决这一差距时，我们提出了PhysvidBench，这是一种基准测试，旨在评估T2V系统的物理推理能力。基准包括383个精心策划的提示，强调工具使用，材料特性和程序相互作用以及物理合理性至关重要的域。对于每个提示，我们使用各种最先进的模型生成视频，并采用三阶段的评估管道：（1）从提示中提出扎根的物理问题，（2）用视觉语言模型为生成的视频加上字幕，（3）任务一种语言模型以仅使用字幕来回答几个物理侵犯的语言。这种间接策略规避了基于直接视频的评估中常见的幻觉问题。通过强调负担能力和工具介导的动作，在当前T2V评估中忽略的领域，PhysVidbench提供了一个可解释的，可解释的框架，用于评估生成视频模型中的物理常识。

Title: FASTGEN: Fast and Cost-Effective Synthetic Tabular Data Generation with LLMs

Authors: Anh Nguyen, Sam Schafft, Nicholas Hale, John Alfaro
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.15839
Pdf URL: https://arxiv.org/pdf/2507.15839
Copy Paste: [[2507.15839]] FASTGEN: Fast and Cost-Effective Synthetic Tabular Data Generation with LLMs(https://arxiv.org/abs/2507.15839)
Keywords: generation
Abstract: Synthetic data generation has emerged as an invaluable solution in scenarios where real-world data collection and usage are limited by cost and scarcity. Large language models (LLMs) have demonstrated remarkable capabilities in producing high-fidelity, domain-relevant samples across various fields. However, existing approaches that directly use LLMs to generate each record individually impose prohibitive time and cost burdens, particularly when large volumes of synthetic data are required. In this work, we propose a fast, cost-effective method for realistic tabular data synthesis that leverages LLMs to infer and encode each field's distribution into a reusable sampling script. By automatically classifying fields into numerical, categorical, or free-text types, the LLM generates distribution-based scripts that can efficiently produce diverse, realistic datasets at scale without continuous model inference. Experimental results show that our approach outperforms traditional direct methods in both diversity and data realism, substantially reducing the burden of high-volume synthetic data generation. We plan to apply this methodology to accelerate testing in production pipelines, thereby shortening development cycles and improving overall system efficiency. We believe our insights and lessons learned will aid researchers and practitioners seeking scalable, cost-effective solutions for synthetic data generation.
摘要：在现实世界数据收集和使用受成本和稀缺限制的情况下，合成数据生成已成为一种宝贵的解决方案。大型语言模型（LLM）在生产各个领域的高保真性，与域相关的样本方面表现出了显着的功能。但是，现有的方法可以直接使用LLMS生成每个记录单独施加时间和成本负担，尤其是在需要大量合成数据时。在这项工作中，我们提出了一种快速，具有成本效益的方法，用于现实的表格数据合成，该方法利用LLMS推断和编码每个字段的分布到可重复使用的采样脚本中。通过将字段自动分类为数值，分类或自由文本类型，LLM生成了基于分布的脚本，这些脚本可以在没有连续模型推论的情况下在不连续的模型推论的情况下在大规模上有效地生成各种逼真的数据集。实验结果表明，我们的方法在多样性和数据现实主义方面都优于传统的直接方法，从而大大减轻了大量合成数据生成的负担。我们计划将这种方法应用于加速生产管道中的测试，从而缩短开发周期并提高整体系统效率。我们认为，我们的见解和经验教训将帮助研究人员和从业人员寻求可扩展的，具有成本效益的解决方案来生成合成数据。

Title: Latent Denoising Makes Good Visual Tokenizers

Authors: Jiawei Yang, Tianhong Li, Lijie Fan, Yonglong Tian, Yue Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.15856
Pdf URL: https://arxiv.org/pdf/2507.15856
Copy Paste: [[2507.15856]] Latent Denoising Makes Good Visual Tokenizers(https://arxiv.org/abs/2507.15856)
Keywords: generative
Abstract: Despite their fundamental role, it remains unclear what properties could make visual tokenizers more effective for generative modeling. We observe that modern generative models share a conceptually similar training objective -- reconstructing clean signals from corrupted inputs such as Gaussian noise or masking -- a process we term denoising. Motivated by this insight, we propose aligning tokenizer embeddings directly with the downstream denoising objective, encouraging latent embeddings to be more easily reconstructed even when heavily corrupted. To achieve this, we introduce the Latent Denoising Tokenizer (l-DeTok), a simple yet effective tokenizer trained to reconstruct clean images from latent embeddings corrupted by interpolative noise and random masking. Extensive experiments on ImageNet 256x256 demonstrate that our tokenizer consistently outperforms standard tokenizers across six representative generative models. Our findings highlight denoising as a fundamental design principle for tokenizer development, and we hope it could motivate new perspectives for future tokenizer design.
摘要：尽管它们的基本作用，但尚不清楚哪些属性可以使视觉引导者对生成建模更有效。我们观察到，现代生成模型在概念上具有相似的培训目标 - 从高斯噪声或掩盖等损坏的输入中重建清洁信号 - 我们称为deno的过程。在这种见解的驱动下，我们提出直接与下游的denoing目标保持一致的嵌入，从而鼓励潜在的嵌入更容易地重建，即使在严重损坏的情况下也是如此。为了实现这一目标，我们介绍了潜在的denoising令牌剂（L-detok），这是一种简单而有效的令牌，经过训练，可以从插入性噪声和随机掩蔽损坏的潜在嵌入中重建清洁图像。 Imagenet 256x256上的广泛实验表明，我们的令牌剂在六个代表性生成模型上始终优于标准令牌。我们的发现重点介绍了DeNoso作为令牌发展机构开发的基本设计原则，我们希望它可以激发未来的令牌设计的新观点。