2025-08-08

Title: LumiGen: An LVLM-Enhanced Iterative Framework for Fine-Grained Text-to-Image Generation

Authors: Xiaoqi Dong, Xiangyu Zhou, Nicholas Evans, Yujia Lin
Subjects: cs.LG, cs.GR
Abstract URL: https://arxiv.org/abs/2508.04732
Pdf URL: https://arxiv.org/pdf/2508.04732
Copy Paste: [[2508.04732]] LumiGen: An LVLM-Enhanced Iterative Framework for Fine-Grained Text-to-Image Generation(https://arxiv.org/abs/2508.04732)
Keywords: generation
Abstract: Text-to-Image (T2I) generation has made significant advancements with diffusion models, yet challenges persist in handling complex instructions, ensuring fine-grained content control, and maintaining deep semantic consistency. Existing T2I models often struggle with tasks like accurate text rendering, precise pose generation, or intricate compositional coherence. Concurrently, Vision-Language Models (LVLMs) have demonstrated powerful capabilities in cross-modal understanding and instruction following. We propose LumiGen, a novel LVLM-enhanced iterative framework designed to elevate T2I model performance, particularly in areas requiring fine-grained control, through a closed-loop, LVLM-driven feedback mechanism. LumiGen comprises an Intelligent Prompt Parsing & Augmentation (IPPA) module for proactive prompt enhancement and an Iterative Visual Feedback & Refinement (IVFR) module, which acts as a "visual critic" to iteratively correct and optimize generated images. Evaluated on the challenging LongBench-T2I Benchmark, LumiGen achieves a superior average score of 3.08, outperforming state-of-the-art baselines. Notably, our framework demonstrates significant improvements in critical dimensions such as text rendering and pose expression, validating the effectiveness of LVLM integration for more controllable and higher-quality image generation.
摘要：文本对图像（T2I）的一代通过扩散模型取得了重大进步，但挑战在处理复杂的说明，确保细粒度的内容控制并保持深厚的语义一致性方面持续存在挑战。现有的T2I模型通常会在准确的文本渲染，精确的姿势产生或复杂的构图连贯性等任务上挣扎。同时，视觉模型（LVLM）在跨模式的理解和教学中表现出强大的功能。我们提出了Lumigen，这是一种新型的LVLM增强迭代框架，旨在提升T2I模型性能，尤其是通过闭环，LVLM驱动的反馈机制，尤其是在需要精细控制的领域。 Lumigen包括一个智能的提示解析和增强（IPPA）模块，以进行主动提示和迭代视觉反馈和改进（IVFR）模块，该模块可作为迭代正确的“视觉评论”，并优化生成的图像。 Lumigen在具有挑战性的Longbench-T2I基准测试中评估，其平均得分高3.08，表现优于最先进的基线。值得注意的是，我们的框架表明了诸如文本渲染和姿势表达之类的关键维度的显着改善，从而验证了LVLM集成的有效性，从而获得了更可控制和更高质量的图像产生。

Title: Edge-Assisted Collaborative Fine-Tuning for Multi-User Personalized Artificial Intelligence Generated Content (AIGC)

Authors: Nan Li, Wanting Yang, Marie Siew, Zehui Xiong, Binbin Chen, Shiwen Mao, Kwok-Yan Lam
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04745
Pdf URL: https://arxiv.org/pdf/2508.04745
Copy Paste: [[2508.04745]] Edge-Assisted Collaborative Fine-Tuning for Multi-User Personalized Artificial Intelligence Generated Content (AIGC)(https://arxiv.org/abs/2508.04745)
Keywords: generation
Abstract: Diffusion models (DMs) have emerged as powerful tools for high-quality content generation, yet their intensive computational requirements for inference pose challenges for resource-constrained edge devices. Cloud-based solutions aid in computation but often fall short in addressing privacy risks, personalization efficiency, and communication costs in multi-user edge-AIGC scenarios. To bridge this gap, we first analyze existing edge-AIGC applications in personalized content synthesis, revealing their limitations in efficiency and scalability. We then propose a novel cluster-aware hierarchical federated aggregation framework. Based on parameter-efficient local fine-tuning via Low-Rank Adaptation (LoRA), the framework first clusters clients based on the similarity of their uploaded task requirements, followed by an intra-cluster aggregation for enhanced personalization at the server-side. Subsequently, an inter-cluster knowledge interaction paradigm is implemented to enable hybrid-style content generation across diverse this http URL upon federated learning (FL) collaboration, our framework simultaneously trains personalized models for individual users at the devices and a shared global model enhanced with multiple LoRA adapters on the server,enabling efficient edge inference; meanwhile, all prompts for clustering and inference are encoded prior to transmission, thereby further mitigating the risk of plaintext leakage. Our evaluations demonstrate that the framework achieves accelerated convergence while maintaining practical viability for scalable multi-user personalized AIGC services under edge constraints.
摘要：扩散模型（DMS）已成为高质量内容生成的强大工具，但它们对推理对资源受限边缘设备的推理构成挑战的密集计算要求。基于云的解决方案有助于计算，但在解决多用户边缘AIGC方案中的隐私风险，个性化效率和沟通成本方面常常缺乏。为了弥合这一差距，我们首先分析了个性化内容合成中现有的Edge-AIGC应用程序，从而揭示了它们在效率和可扩展性方面的局限性。然后，我们提出了一个新颖的群集感知的分层联合聚合框架。基于参数有效的本地微调通过低级别适应（LORA），框架首先基于其上传任务要求的相似性将客户端簇，然后是群集集合聚合，以增强服务器端的个性化。随后，实现了集群间的知识互动范例，以使联合学习（FL）协作（FL）协作后的多样化的HTTP URL促进混合式内容生成，我们的框架同时培训了设备的个性化模型，并在设备上为单个用户提供个性化模型，并在服务器上具有多个Lora适配器，从而增强了有效的Edge Edge Edge Edge Edge Edge Edge nerferender;同时，所有集群和推理的提示都在传输前进行了编码，从而进一步减轻了明文泄漏的风险。我们的评估表明，该框架在边缘约束下保持可扩展的多用户个性化AIGC服务的实际生存能力，实现了加速的融合。

Title: Uncertainty-aware Predict-Then-Optimize Framework for Equitable Post-Disaster Power Restoration

Authors: Lin Jiang, Dahai Yu, Rongchao Xu, Tian Tang, Guang Wang
Subjects: cs.LG, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2508.04780
Pdf URL: https://arxiv.org/pdf/2508.04780
Copy Paste: [[2508.04780]] Uncertainty-aware Predict-Then-Optimize Framework for Equitable Post-Disaster Power Restoration(https://arxiv.org/abs/2508.04780)
Keywords: restoration
Abstract: The increasing frequency of extreme weather events, such as hurricanes, highlights the urgent need for efficient and equitable power system restoration. Many electricity providers make restoration decisions primarily based on the volume of power restoration requests from each region. However, our data-driven analysis reveals significant disparities in request submission volume, as disadvantaged communities tend to submit fewer restoration requests. This disparity makes the current restoration solution inequitable, leaving these communities vulnerable to extended power outages. To address this, we aim to propose an equity-aware power restoration strategy that balances both restoration efficiency and equity across communities. However, achieving this goal is challenging for two reasons: the difficulty of predicting repair durations under dataset heteroscedasticity, and the tendency of reinforcement learning agents to favor low-uncertainty actions, which potentially undermine equity. To overcome these challenges, we design a predict-then-optimize framework called EPOPR with two key components: (1) Equity-Conformalized Quantile Regression for uncertainty-aware repair duration prediction, and (2) Spatial-Temporal Attentional RL that adapts to varying uncertainty levels across regions for equitable decision-making. Experimental results show that our EPOPR effectively reduces the average power outage duration by 3.60% and decreases inequity between different communities by 14.19% compared to state-of-the-art baselines.
摘要：飓风等极端天气事件的频率越来越大，凸显了对高效且公平的电力系统恢复的迫切需求。许多电力提供商主要根据每个地区的电力恢复请求的数量做出恢复决策。但是，我们的数据驱动分析揭示了请求提交量的显着差异，因为处境不利的社区倾向于提交更少的修复请求。这种差异使当前的恢复解决方案不公平，使这些社区容易受到扩大停电的影响。为了解决这个问题，我们旨在提出一种股权感知的权力恢复策略，以平衡整个社区的恢复效率和权益。但是，实现这一目标是具有挑战性的，有两个原因：难以预测数据集异质性下的维修持续时间，以及强化学习者倾向于支持低不确定行动的趋势，这可能会破坏公平性。为了克服这些挑战，我们设计了一个预测的框架，称为EPOPR，具有两个关键组成部分：（1）不确定性感知到的维修持续时间预测的股权符合性的分位数回归，以及（2）空间 - 周期性注意RL，可以适应各个区域的不确定水平，以适应公平的决策。实验结果表明，与最先进的基线相比，我们的EPOR有效地将平均停电持续时间降低了3.60％，不同社区之间的不平等降低了14.19％。

Title: RetinexDual: Retinex-based Dual Nature Approach for Generalized Ultra-High-Definition Image Restoration

Authors: Mohab Kishawy, Ali Abdellatif Hussein, Jun Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04797
Pdf URL: https://arxiv.org/pdf/2508.04797
Copy Paste: [[2508.04797]] RetinexDual: Retinex-based Dual Nature Approach for Generalized Ultra-High-Definition Image Restoration(https://arxiv.org/abs/2508.04797)
Keywords: restoration
Abstract: Advancements in image sensing have elevated the importance of Ultra-High-Definition Image Restoration (UHD IR). Traditional methods, such as extreme downsampling or transformation from the spatial to the frequency domain, encounter significant drawbacks: downsampling induces irreversible information loss in UHD images, while our frequency analysis reveals that pure frequency-domain approaches are ineffective for spatially confined image artifacts, primarily due to the loss of degradation locality. To overcome these limitations, we present RetinexDual, a novel Retinex theory-based framework designed for generalized UHD IR tasks. RetinexDual leverages two complementary sub-networks: the Scale-Attentive maMBA (SAMBA) and the Frequency Illumination Adaptor (FIA). SAMBA, responsible for correcting the reflectance component, utilizes a coarse-to-fine mechanism to overcome the causal modeling of mamba, which effectively reduces artifacts and restores intricate details. On the other hand, FIA ensures precise correction of color and illumination distortions by operating in the frequency domain and leveraging the global context provided by it. Evaluating RetinexDual on four UHD IR tasks, namely deraining, deblurring, dehazing, and Low-Light Image Enhancement (LLIE), shows that it outperforms recent methods qualitatively and quantitatively. Ablation studies demonstrate the importance of employing distinct designs for each branch in RetinexDual, as well as the effectiveness of its various components.
摘要：图像传感的进步提高了超高定义图像恢复（UHD IR）的重要性。传统方法，例如从空间到频域到频域的极端下采样或转换，遇到了重要的缺点：下采样会导致UHD图像中的不可逆信息丢失，而我们的频率分析却表明，纯频率域方法对于空间限制的图像图像无效，主要是由于丧失降级的位置丧失。为了克服这些局限性，我们提出了RetineXdual，这是一种基于Etinex理论的新型框架，旨在通用UHD IR任务。视网膜持续两个互补子网络：量表组合Mamba（Samba）和频率照明适配器（FIA）。负责纠正反射率成分的桑巴山（Samba）利用一种粗到1的机制来克服曼巴（Mamba）的因果模型，从而有效地减少了伪影并恢复了复杂的细节。另一方面，国际足联通过在频域中运行并利用其提供的全局环境来确保精确校正颜色和照明扭曲。在四个UHD IR任务上评估视力敏感，即降低，脱毛，脱壳和低光图像增强（LLIE）表明，它在定性和定量上都优于最新方法。消融研究表明，在视网膜中为每个分支使用不同的设计以及其各种组成部分的有效性的重要性。

Title: Single-Step Reconstruction-Free Anomaly Detection and Segmentation via Diffusion Models

Authors: Mehrdad Moradi, Marco Grasso, Bianca Maria Colosimo, Kamran Paynabar
Subjects: cs.CV, eess.IV, stat.ML
Abstract URL: https://arxiv.org/abs/2508.04818
Pdf URL: https://arxiv.org/pdf/2508.04818
Copy Paste: [[2508.04818]] Single-Step Reconstruction-Free Anomaly Detection and Segmentation via Diffusion Models(https://arxiv.org/abs/2508.04818)
Keywords: generative
Abstract: Generative models have demonstrated significant success in anomaly detection and segmentation over the past decade. Recently, diffusion models have emerged as a powerful alternative, outperforming previous approaches such as GANs and VAEs. In typical diffusion-based anomaly detection, a model is trained on normal data, and during inference, anomalous images are perturbed to a predefined intermediate step in the forward diffusion process. The corresponding normal image is then reconstructed through iterative reverse sampling. However, reconstruction-based approaches present three major challenges: (1) the reconstruction process is computationally expensive due to multiple sampling steps, making real-time applications impractical; (2) for complex or subtle patterns, the reconstructed image may correspond to a different normal pattern rather than the original input; and (3) Choosing an appropriate intermediate noise level is challenging because it is application-dependent and often assumes prior knowledge of anomalies, an assumption that does not hold in unsupervised settings. We introduce Reconstruction-free Anomaly Detection with Attention-based diffusion models in Real-time (RADAR), which overcomes the limitations of reconstruction-based anomaly detection. Unlike current SOTA methods that reconstruct the input image, RADAR directly produces anomaly maps from the diffusion model, improving both detection accuracy and computational efficiency. We evaluate RADAR on real-world 3D-printed material and the MVTec-AD dataset. Our approach surpasses state-of-the-art diffusion-based and statistical machine learning models across all key metrics, including accuracy, precision, recall, and F1 score. Specifically, RADAR improves F1 score by 7% on MVTec-AD and 13% on the 3D-printed material dataset compared to the next best model. Code available at: this https URL
摘要：在过去的十年中，生成模型在异常检测和分割方面取得了显着成功。最近，扩散模型已成为一种强大的替代方案，优于先前的方法，例如gan和vaes。在基于典型的基于扩散的异常检测中，对模型进行了正常数据训练，在推断期间，在正向扩散过程中，异常图像被扰动到预定义的中间步骤。然后通过迭代反向采样重建相应的正常图像。但是，基于重建的方法提出了三个主要挑战：（1）由于多个采样步骤，重建过程在计算上是昂贵的，从而使实时应用程序不切实际；（2）对于复杂或微妙的模式，重建的图像可能对应于不同的正常模式，而不是原始输入；（3）选择适当的中间噪声水平是具有挑战性的，因为它取决于应用程序，并且通常假定对异常的先验知识，这一假设不存在于无监督的设置中。我们在实时（RADAR）中引入了与基于注意力的扩散模型的无重构异常检测，该检测克服了基于重建的异常检测的局限性。与重建输入图像的当前SOTA方法不同，雷达直接从扩散模型中产生异常图，从而提高了检测准确性和计算效率。我们在现实世界3D打印材料和MVTEC-AD数据集上评估雷达。我们的方法超过了所有关键指标的基于最新的扩散和统计机器学习模型，包括准确性，精度，召回和F1分数。具体而言，与下一个最佳模型相比，MVTEC-AD的雷达在MVTEC-AD上提高了7％，而3D打印的材料数据集则提高了13％。代码可用：此HTTPS URL

Title: Unified Flow Matching for Long Horizon Event Forecasting

Authors: Xiao Shou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04843
Pdf URL: https://arxiv.org/pdf/2508.04843
Copy Paste: [[2508.04843]] Unified Flow Matching for Long Horizon Event Forecasting(https://arxiv.org/abs/2508.04843)
Keywords: generation
Abstract: Modeling long horizon marked event sequences is a fundamental challenge in many real-world applications, including healthcare, finance, and user behavior modeling. Existing neural temporal point process models are typically autoregressive, predicting the next event one step at a time, which limits their efficiency and leads to error accumulation in long-range forecasting. In this work, we propose a unified flow matching framework for marked temporal point processes that enables non-autoregressive, joint modeling of inter-event times and event types, via continuous and discrete flow matching. By learning continuous-time flows for both components, our method generates coherent long horizon event trajectories without sequential decoding. We evaluate our model on six real-world benchmarks and demonstrate significant improvements over autoregressive and diffusion-based baselines in both accuracy and generation efficiency.
摘要：在许多现实世界中，包括医疗保健，金融和用户行为建模在内的许多现实应用程序中，建模长范围标记的事件序列是一个基本挑战。现有的神经时间点过程模型通常是自回归的，一次预测下一个事件一次，这限制了它们的效率，并导致在长期预测中导致错误积累。在这项工作中，我们为标记的时间点过程提出了一个统一的流量匹配框架，该框架可以通过连续和离散的流量匹配来实现非自动性回旋，事件间时间和事件类型的关节建模。通过学习两个组件的连续时间流，我们的方法生成连贯的长层事件轨迹而无需顺序解码。我们在六个现实世界的基准上评估了我们的模型，并在准确性和发电效率方面表现出对自回归和基于扩散的基准的显着改善。

Title: Accelerating Conditional Prompt Learning via Masked Image Modeling for Vision-Language Models

Authors: Phuoc-Nguyen Bui, Khanh-Binh Nguyen, Hyunseung Choo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04942
Pdf URL: https://arxiv.org/pdf/2508.04942
Copy Paste: [[2508.04942]] Accelerating Conditional Prompt Learning via Masked Image Modeling for Vision-Language Models(https://arxiv.org/abs/2508.04942)
Keywords: generation
Abstract: Vision-language models (VLMs) like CLIP excel in zero-shot learning but often require resource-intensive training to adapt to new tasks. Prompt learning techniques, such as CoOp and CoCoOp, offer efficient adaptation but tend to overfit to known classes, limiting generalization to unseen categories. We introduce ProMIM, a plug-and-play framework that enhances conditional prompt learning by integrating masked image modeling (MIM) into existing VLM pipelines. ProMIM leverages a simple yet effective masking strategy to generate robust, instance-conditioned prompts, seamlessly augmenting methods like CoOp and CoCoOp without altering their core architectures. By masking only visible image patches and using these representations to guide prompt generation, ProMIM improves feature robustness and mitigates overfitting, all while introducing negligible additional computational cost. Extensive experiments across zero-shot and few-shot classification tasks demonstrate that ProMIM consistently boosts generalization performance when plugged into existing approaches, providing a practical, lightweight solution for real-world vision-language applications.
摘要：视觉语言模型（VLMS）等零拍学习中的Excel，但通常需要资源密集型培训才能适应新任务。及时的学习技术（例如Coop和Cocoop）提供了有效的适应性，但倾向于过度适应已知的类别，从而将概括限制为看不见的类别。我们介绍了Promim，这是一个插件框架，通过将蒙版的图像建模（MIM）集成到现有的VLM管道中，从而增强了条件提示。 PREMIM利用一种简单而有效的掩蔽策略来生成坚固的，实例的提示，无缝增强诸如Coop和Cocoop之类的方法，而无需更改其核心体系结构。通过仅掩盖可见的图像贴片并使用这些表示形式来指导及时生成，Promim可以提高功能稳健性和缓解过度拟合，同时引入可忽略不计的其他计算成本。在零射击和几乎没有拍摄的分类任务上进行的广泛实验表明，当插入现有方法中时，优势会始终提高概括性能，从而为现实世界的视觉语言应用提供了实用，轻量级的解决方案。

Title: TRKT: Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced Relation-aware Knowledge Transferring

Authors: Zhu Xu, Ting Lei, Zhimin Li, Guan Wang, Qingchao Chen, Yuxin Peng, Yang liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04943
Pdf URL: https://arxiv.org/pdf/2508.04943
Copy Paste: [[2508.04943]] TRKT: Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced Relation-aware Knowledge Transferring(https://arxiv.org/abs/2508.04943)
Keywords: generation
Abstract: Dynamic Scene Graph Generation (DSGG) aims to create a scene graph for each video frame by detecting objects and predicting their relationships. Weakly Supervised DSGG (WS-DSGG) reduces annotation workload by using an unlocalized scene graph from a single frame per video for training. Existing WS-DSGG methods depend on an off-the-shelf external object detector to generate pseudo labels for subsequent DSGG training. However, detectors trained on static, object-centric images struggle in dynamic, relation-aware scenarios required for DSGG, leading to inaccurate localization and low-confidence proposals. To address the challenges posed by external object detectors in WS-DSGG, we propose a Temporal-enhanced Relation-aware Knowledge Transferring (TRKT) method, which leverages knowledge to enhance detection in relation-aware dynamic scenarios. TRKT is built on two key components:(1)Relation-aware knowledge mining: we first employ object and relation class decoders that generate category-specific attention maps to highlight both object regions and interactive areas. Then we propose an Inter-frame Attention Augmentation strategy that exploits optical flow for neighboring frames to enhance the attention maps, making them motion-aware and robust to motion blur. This step yields relation- and motion-aware knowledge mining for WS-DSGG. (2) we introduce a Dual-stream Fusion Module that integrates category-specific attention maps into external detections to refine object localization and boost confidence scores for object proposals. Extensive experiments demonstrate that TRKT achieves state-of-the-art performance on Action Genome dataset. Our code is avaliable at this https URL.
摘要：动态场景图生成（DSGG）旨在通过检测对象并预测其关系来为每个视频框架创建场景图。弱监督的DSGG（WS-DSGG）通过使用每个视频中的单个框架进行训练的单一框架图形来减少注释工作量。现有的WS-DSGG方法取决于现成的外部对象检测器来生成伪标签，以进行随后的DSGG培训。但是，在静态，以对象为中心的图像进行训练的探测器中，DSGG所需的动态，关系感知的方案，导致本地化不准确和低信心建议。为了解决WS-DSGG中外部对象探测器提出的挑战，我们提出了一种时间增强的关系感知知识转移（TRKT）方法，该方法利用知识来增强关系感知的动态场景中的检测。 TRKT建立在两个关键组成部分上：（1）关系感知知识挖掘：我们首先采用对象和关系类解码器，它们生成特定类别的注意图来突出对象区域和交互式区域。然后，我们提出了一种框架间的注意增强策略，该策略利用了相邻框架的光流以增强注意力图，从而使它们具有运动感并稳健地对运动模糊。此步骤可为WS-DSGG提供关系和运动感知知识挖掘。（2）我们引入了一个双流融合模块，该模块将特定于类别的注意图集成到外部检测中，以完善对象定位并提高对象建议的置信度得分。广泛的实验表明，TRKT在动作基因组数据集上实现了最先进的性能。我们的代码在此HTTPS URL上可用。

Title: Steering One-Step Diffusion Model with Fidelity-Rich Decoder for Fast Image Compression

Authors: Zheng Chen, Mingde Zhou, Jinpei Guo, Jiale Yuan, Yifei Ji, Yulun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04979
Pdf URL: https://arxiv.org/pdf/2508.04979
Copy Paste: [[2508.04979]] Steering One-Step Diffusion Model with Fidelity-Rich Decoder for Fast Image Compression(https://arxiv.org/abs/2508.04979)
Keywords: generative
Abstract: Diffusion-based image compression has demonstrated impressive perceptual performance. However, it suffers from two critical drawbacks: (1) excessive decoding latency due to multi-step sampling, and (2) poor fidelity resulting from over-reliance on generative priors. To address these issues, we propose SODEC, a novel single-step diffusion image compression model. We argue that in image compression, a sufficiently informative latent renders multi-step refinement unnecessary. Based on this insight, we leverage a pre-trained VAE-based model to produce latents with rich information, and replace the iterative denoising process with a single-step decoding. Meanwhile, to improve fidelity, we introduce the fidelity guidance module, encouraging output that is faithful to the original image. Furthermore, we design the rate annealing training strategy to enable effective training under extremely low bitrates. Extensive experiments show that SODEC significantly outperforms existing methods, achieving superior rate-distortion-perception performance. Moreover, compared to previous diffusion-based compression models, SODEC improves decoding speed by more than 20$\times$. Code is released at: this https URL.
摘要：基于扩散的图像压缩表现出令人印象深刻的感知性能。但是，它遭受了两个关键缺点：（1）由于多步骤采样而导致的解码延迟过多，以及（2）过度依赖生成先验而导致的保真度差。为了解决这些问题，我们提出了一种新型的单步扩散图像压缩模型Sodec。我们认为，在图像压缩中，这是一种充分信息的潜在渲染，这是不必要的多步进。基于这种见解，我们利用了一个基于VAE的预先培训的模型来生产具有丰富信息的潜在模型，并用单步解码替换迭代的去核过程。同时，为了提高忠诚度，我们介绍了忠实指导模块，鼓励对原始形象忠于原始图像的输出。此外，我们设计了退火培训策略，以实现极低的比特率下的有效培训。广泛的实验表明，SODEC显着胜过现有方法，从而达到了较高的速率 - 延伸性能性能。此外，与以前的基于扩散的压缩模型相比，SODEC将解码速度提高了20美元以上。代码在以下位置发布：此HTTPS URL。

Title: AU-IQA: A Benchmark Dataset for Perceptual Quality Assessment of AI-Enhanced User-Generated Content

Authors: Shushi Wang, Chunyi Li, Zicheng Zhang, Han Zhou, Wei Dong, Jun Chen, Guangtao Zhai, Xiaohong Liu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2508.05016
Pdf URL: https://arxiv.org/pdf/2508.05016
Copy Paste: [[2508.05016]] AU-IQA: A Benchmark Dataset for Perceptual Quality Assessment of AI-Enhanced User-Generated Content(https://arxiv.org/abs/2508.05016)
Keywords: super-resolution, quality assessment
Abstract: AI-based image enhancement techniques have been widely adopted in various visual applications, significantly improving the perceptual quality of user-generated content (UGC). However, the lack of specialized quality assessment models has become a significant limiting factor in this field, limiting user experience and hindering the advancement of enhancement methods. While perceptual quality assessment methods have shown strong performance on UGC and AIGC individually, their effectiveness on AI-enhanced UGC (AI-UGC) which blends features from both, remains largely unexplored. To address this gap, we construct AU-IQA, a benchmark dataset comprising 4,800 AI-UGC images produced by three representative enhancement types which include super-resolution, low-light enhancement, and denoising. On this dataset, we further evaluate a range of existing quality assessment models, including traditional IQA methods and large multimodal models. Finally, we provide a comprehensive analysis of how well current approaches perform in assessing the perceptual quality of AI-UGC. The access link to the AU-IQA is this https URL.
摘要：基于AI的图像增强技术已在各种视觉应用中广泛采用，从而显着提高了用户生成内容（UGC）的感知质量。但是，缺乏专业质量评估模型已成为该领域的重要限制因素，从而限制了用户体验并阻碍了增强方法的进步。尽管感知质量评估方法在UGC和AIGC上表现出强烈的性能，但它们对融合了两者的特征的AI增强UGC（AI-UGC）的有效性仍然在很大程度上没有探索。为了解决这一差距，我们构建了Au-iqa，这是一个由三种代表性增强类型产生的4,800个AI-UGC图像的基准数据集，其中包括超级分辨率，低光增强和denosising。在此数据集上，我们进一步评估了一系列现有的质量评估模型，包括传统的IQA方法和大型多模式模型。最后，我们对当前方法在评估AI-UGC的感知质量方面的表现如何进行全面分析。 AU-IQA的访问链接是此HTTPS URL。

Title: A Novel Image Similarity Metric for Scene Composition Structure

Authors: Md Redwanul Haque, Manzur Murshed, Manoranjan Paul, Tsz-Kwan Lee
Subjects: cs.CV, cs.IT
Abstract URL: https://arxiv.org/abs/2508.05037
Pdf URL: https://arxiv.org/pdf/2508.05037
Copy Paste: [[2508.05037]] A Novel Image Similarity Metric for Scene Composition Structure(https://arxiv.org/abs/2508.05037)
Keywords: generative
Abstract: The rapid advancement of generative AI models necessitates novel methods for evaluating image quality that extend beyond human perception. A critical concern for these models is the preservation of an image's underlying Scene Composition Structure (SCS), which defines the geometric relationships among objects and the background, their relative positions, sizes, orientations, etc. Maintaining SCS integrity is paramount for ensuring faithful and structurally accurate GenAI outputs. Traditional image similarity metrics often fall short in assessing SCS. Pixel-level approaches are overly sensitive to minor visual noise, while perception-based metrics prioritize human aesthetic appeal, neither adequately capturing structural fidelity. Furthermore, recent neural-network-based metrics introduce training overheads and potential generalization issues. We introduce the SCS Similarity Index Measure (SCSSIM), a novel, analytical, and training-free metric that quantifies SCS preservation by exploiting statistical measures derived from the Cuboidal hierarchical partitioning of images, robustly capturing non-object-based structural relationships. Our experiments demonstrate SCSSIM's high invariance to non-compositional distortions, accurately reflecting unchanged SCS. Conversely, it shows a strong monotonic decrease for compositional distortions, precisely indicating when SCS has been altered. Compared to existing metrics, SCSSIM exhibits superior properties for structural evaluation, making it an invaluable tool for developing and evaluating generative models, ensuring the integrity of scene composition.
摘要：生成AI模型的快速发展需要新的方法来评估超越人类感知的图像质量。这些模型的关键问题是保存图像的基础场景组成结构（SCS），该结构定义了对象和背景之间的几何关系，它们的相对位置，大小，方向等。维持SCS的完整性对于确保忠实且结构上精确的Genai输出至关重要。传统图像相似性指标通常在评估SCS方面缺乏。像素级方法对轻微的视觉噪声过于敏感，而基于感知的指标优先考虑人类美学的吸引力，也没有充分捕获结构保真度。此外，最近基于神经网络的指标引入了培训间接费用和潜在的概括问题。我们介绍了SCS相似性指数量度（SCSSIM），这是一种新颖，分析和无训练的度量，该指标通过利用来自图像的Cuboidal层次分区得出的统计测量来量化SCS的保存，从而牢固地捕获基于非对象的基于基于非对象的结构关系。我们的实验证明了SCSSSIM对非复合畸变的高不变性，可以准确地反映出未改变的SC。相反，它显示出强大的单调降低而导致组成失真，这表明何时改变了SC。与现有指标相比，SCSSIM表现出用于结构评估的出色特性，使其成为开发和评估生成模型的宝贵工具，从而确保场景组成的完整性。

Title: Automatic Image Colorization with Convolutional Neural Networks and Generative Adversarial Networks

Authors: Ruiyu Li, Changyuan Qiu, Hangrui Cao, Qihan Ren, Yuqing Qiu
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2508.05068
Pdf URL: https://arxiv.org/pdf/2508.05068
Copy Paste: [[2508.05068]] Automatic Image Colorization with Convolutional Neural Networks and Generative Adversarial Networks(https://arxiv.org/abs/2508.05068)
Keywords: restoration, generative
Abstract: Image colorization, the task of adding colors to grayscale images, has been the focus of significant research efforts in computer vision in recent years for its various application areas such as color restoration and automatic animation colorization [15, 1]. The colorization problem is challenging as it is highly ill-posed with two out of three image dimensions lost, resulting in large degrees of freedom. However, semantics of the scene as well as the surface texture could provide important cues for colors: the sky is typically blue, the clouds are typically white and the grass is typically green, and there are huge amounts of training data available for learning such priors since any colored image could serve as a training data point [20]. Colorization is initially formulated as a regression task[5], which ignores the multi-modal nature of color prediction. In this project, we explore automatic image colorization via classification and adversarial learning. We will build our models on prior works, apply modifications for our specific scenario and make comparisons.
摘要：图像着色是在灰度图像中添加颜色的任务，近年来一直是计算机视觉研究的重大研究工作的重点，用于其各种应用领域，例如色彩恢复和自动动画着色[15，1]。着色问题是具有挑战性的，因为它在三个图像维度中的损失中有两个损失，因此产生了很大的自由度。但是，场景的语义以及表面纹理可以为颜色提供重要的提示：天空通常是蓝色的，云通常是白色的，草通常是绿色的，并且有大量的培训数据可用于学习此类先验，因为任何彩色图像都可以用作训练数据点[20]。着色最初是作为回归任务[5]表达的，它忽略了颜色预测的多模式性质。在此项目中，我们通过分类和对抗性学习探索自动图像着色。我们将在先前的作品上建立模型，对我们的特定方案进行修改并进行比较。

Title: FLUX-Makeup: High-Fidelity, Identity-Consistent, and Robust Makeup Transfer via Diffusion Transformer

Authors: Jian Zhu, Shanyuan Liu, Liuzhuozheng Li, Yue Gong, He Wang, Bo Cheng, Yuhang Ma, Liebucha Wu, Xiaoyu Wu, Dawei Leng, Yuhui Yin, Yang Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.05069
Pdf URL: https://arxiv.org/pdf/2508.05069
Copy Paste: [[2508.05069]] FLUX-Makeup: High-Fidelity, Identity-Consistent, and Robust Makeup Transfer via Diffusion Transformer(https://arxiv.org/abs/2508.05069)
Keywords: generation
Abstract: Makeup transfer aims to apply the makeup style from a reference face to a target face and has been increasingly adopted in practical applications. Existing GAN-based approaches typically rely on carefully designed loss functions to balance transfer quality and facial identity consistency, while diffusion-based methods often depend on additional face-control modules or algorithms to preserve identity. However, these auxiliary components tend to introduce extra errors, leading to suboptimal transfer results. To overcome these limitations, we propose FLUX-Makeup, a high-fidelity, identity-consistent, and robust makeup transfer framework that eliminates the need for any auxiliary face-control components. Instead, our method directly leverages source-reference image pairs to achieve superior transfer performance. Specifically, we build our framework upon FLUX-Kontext, using the source image as its native conditional input. Furthermore, we introduce RefLoRAInjector, a lightweight makeup feature injector that decouples the reference pathway from the backbone, enabling efficient and comprehensive extraction of makeup-related information. In parallel, we design a robust and scalable data generation pipeline to provide more accurate supervision during training. The paired makeup datasets produced by this pipeline significantly surpass the quality of all existing datasets. Extensive experiments demonstrate that FLUX-Makeup achieves state-of-the-art performance, exhibiting strong robustness across diverse scenarios.
摘要：化妆转移的目的是将化妆样式从参考面上的面孔应用到目标面，并在实际应用中越来越多地采用。现有的基于GAN的方法通常依赖于精心设计的损失功能来平衡转移质量和面部身份一致性，而基于扩散的方法通常取决于其他面部控制模块或算法来保持身份。但是，这些辅助组件倾向于引入额外的错误，从而导致次优的转移结果。为了克服这些局限性，我们提出了助图制造，高保真性，一致性和健壮的化妆转移框架，以消除对任何辅助面部控制组件的需求。相反，我们的方法直接利用源参考图像对实现了出色的传输性能。具体来说，我们使用源图像作为本机条件输入，在Flux-Kontext上构建框架。此外，我们介绍了Reflorainjector，这是一种轻巧的化妆功能喷油器，它将参考途径与主链分解，从而有效且全面地提取与化妆相关的信息。同时，我们设计了一个强大而可扩展的数据生成管道，以在培训期间提供更准确的监督。该管道生产的配对化妆数据集显着超过了所有现有数据集的质量。广泛的实验表明，Flux-Makeup实现了最先进的表现，在各种情况下表现出强大的鲁棒性。

Title: Integrated Influence: Data Attribution with Baseline

Authors: Linxiao Yang, Xinyu Gu, Liang Sun
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.05089
Pdf URL: https://arxiv.org/pdf/2508.05089
Copy Paste: [[2508.05089]] Integrated Influence: Data Attribution with Baseline(https://arxiv.org/abs/2508.05089)
Keywords: generation
Abstract: As an effective approach to quantify how training samples influence test sample, data attribution is crucial for understanding data and model and further enhance the transparency of machine learning models. We find that prevailing data attribution methods based on leave-one-out (LOO) strategy suffer from the local-based explanation, as these LOO-based methods only perturb a single training sample, and overlook the collective influence in the training set. On the other hand, the lack of baseline in many data attribution methods reduces the flexibility of the explanation, e.g., failing to provide counterfactual explanations. In this paper, we propose Integrated Influence, a novel data attribution method that incorporates a baseline approach. Our method defines a baseline dataset, follows a data degeneration process to transition the current dataset to the baseline, and accumulates the influence of each sample throughout this process. We provide a solid theoretical framework for our method, and further demonstrate that popular methods, such as influence functions, can be viewed as special cases of our approach. Experimental results show that Integrated Influence generates more reliable data attributions compared to existing methods in both data attribution task and mislablled example identification task.
摘要：作为量化训练样品如何影响测试样本的有效方法，数据归因对于理解数据和模型并进一步提高机器学习模型的透明度至关重要。我们发现，基于剩余的（LOO）策略的普遍数据归因方法遭受了基于本地的解释，因为这些基于LOO的方法仅扰乱单个培训样本，并忽略了培训集中的集体影响。另一方面，在许多数据归因方法中缺乏基线可以降低解释的灵活性，例如未能提供反事实解释。在本文中，我们提出了综合影响，这是一种结合基线方法的新型数据归因方法。我们的方法定义了一个基线数据集，遵循数据退化过程，将当前数据集转换为基线，并在整个过程中累积每个样本的影响。我们为我们的方法提供了一个可靠的理论框架，并进一步证明了流行方法（例如影响功能）可以看作是我们方法的特殊情况。实验结果表明，与数据归因任务和错误的示例识别任务中的现有方法相比，集成影响会产生更可靠的数据归因。

Title: PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation

Authors: Jingxuan He, Busheng Su, Finn Wong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.05091
Pdf URL: https://arxiv.org/pdf/2508.05091
Copy Paste: [[2508.05091]] PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation(https://arxiv.org/abs/2508.05091)
Keywords: generation
Abstract: Generating long, temporally coherent videos with precise control over subject identity and motion is a formidable challenge for current diffusion models, which often suffer from identity drift and are limited to short clips. We introduce PoseGen, a novel framework that generates arbitrarily long videos of a specific subject from a single reference image and a driving pose sequence. Our core innovation is an in-context LoRA finetuning strategy that injects subject appearance at the token level for identity preservation, while simultaneously conditioning on pose information at the channel level for fine-grained motion control. To overcome duration limits, PoseGen pioneers an interleaved segment generation method that seamlessly stitches video clips together, using a shared KV cache mechanism and a specialized transition process to ensure background consistency and temporal smoothness. Trained on a remarkably small 33-hour video dataset, extensive experiments show that PoseGen significantly outperforms state-of-the-art methods in identity fidelity, pose accuracy, and its unique ability to produce coherent, artifact-free videos of unlimited duration.
摘要：对于当前的扩散模型而言，生成长时间的，暂时的连贯的视频对主题身份和运动是一个巨大的挑战，该模型通常遭受身份漂移的影响，并且仅限于短夹。我们介绍了Posegen，这是一个新颖的框架，该框架从单个参考图像和驱动姿势序列中生成了特定主题的任意长视频。我们的核心创新是一种在上下文中的洛拉列拉框策略，它在标记水平上注入了标志的标识，同时在通道级别上对姿势信息进行调节，以进行细粒度的运动控制。为了克服持续时间限制，Posegen先驱者使用共享的KV缓存机制和专门的过渡过程将视频剪辑无缝地缝合在一起，以确保背景一致性和时间平滑度。在一个非常小的33小时视频数据集中受过培训的广泛实验表明，Posegen在身份保真度，姿势准确性以及其产生无限耐用的无伪影视频的独特能力方面显着优于最先进的方法。

Title: Exploring Superior Function Calls via Reinforcement Learning

Authors: Bingguang Hao, Maolin Wang, Zengzhuang Xu, Yicheng Chen, Cunyin Peng, Jinjie GU, Chenyi Zhuang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2508.05118
Pdf URL: https://arxiv.org/pdf/2508.05118
Copy Paste: [[2508.05118]] Exploring Superior Function Calls via Reinforcement Learning(https://arxiv.org/abs/2508.05118)
Keywords: generation
Abstract: Function calling capabilities are crucial for deploying Large Language Models in real-world applications, yet current training approaches fail to develop robust reasoning strategies. Supervised fine-tuning produces models that rely on superficial pattern matching, while standard reinforcement learning methods struggle with the complex action space of structured function calls. We present a novel reinforcement learning framework designed to enhance group relative policy optimization through strategic entropy based exploration specifically tailored for function calling tasks. Our approach addresses three critical challenges in function calling: insufficient exploration during policy learning, lack of structured reasoning in chain-of-thought generation, and inadequate verification of parameter extraction. Our two-stage data preparation pipeline ensures high-quality training samples through iterative LLM evaluation and abstract syntax tree validation. Extensive experiments on the Berkeley Function Calling Leaderboard demonstrate that this framework achieves state-of-the-art performance among open-source models with 86.02\% overall accuracy, outperforming standard GRPO by up to 6\% on complex multi-function scenarios. Notably, our method shows particularly strong improvements on code-pretrained models, suggesting that structured language generation capabilities provide an advantageous starting point for reinforcement learning in function calling tasks. We will release all the code, models and dataset to benefit the community.
摘要：函数调用功能对于在现实世界应用中部署大型语言模型至关重要，但是当前的培训方法无法制定强大的推理策略。监督的微调产生的模型依赖于表面模式匹配，而标准强化学习方法在结构化功能调用的复杂动作空间中挣扎。我们提出了一个新颖的增强学习框架，旨在通过基于战略熵的探索来增强群体相对政策优化，专门针对功能调用任务量身定制。我们的方法解决了功能调用方面的三个关键挑战：政策学习过程中的探索不足，在经过思考链中缺乏结构性推理以及参数提取的验证不足。我们的两阶段数据制备管道通过迭代LLM评估和抽象的语法树验证确保了高质量的培训样本。伯克利函数呼叫排行榜的广泛实验表明，该框架以86.02 \％的总体准确性达到开源模型之间的最先进性能，在复杂的多功能方案上，该框架的总体精度为86.02 \％。值得注意的是，我们的方法对代码预测的模型显示出特别有力的改进，这表明结构化语言生成功能为功能调用任务中的强化学习提供了有利的起点。我们将发布所有代码，模型和数据集，以使社区受益。

Title: Latent Expression Generation for Referring Image Segmentation and Grounding

Authors: Seonghoon Yu, Joonbeom Hong, Joonseok Lee, Jeany Son
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.05123
Pdf URL: https://arxiv.org/pdf/2508.05123
Copy Paste: [[2508.05123]] Latent Expression Generation for Referring Image Segmentation and Grounding(https://arxiv.org/abs/2508.05123)
Keywords: generation
Abstract: Visual grounding tasks, such as referring image segmentation (RIS) and referring expression comprehension (REC), aim to localize a target object based on a given textual description. The target object in an image can be described in multiple ways, reflecting diverse attributes such as color, position, and more. However, most existing methods rely on a single textual input, which captures only a fraction of the rich information available in the visual domain. This mismatch between rich visual details and sparse textual cues can lead to the misidentification of similar objects. To address this, we propose a novel visual grounding framework that leverages multiple latent expressions generated from a single textual input by incorporating complementary visual details absent from the original description. Specifically, we introduce subject distributor and visual concept injector modules to embed both shared-subject and distinct-attributes concepts into the latent representations, thereby capturing unique and target-specific visual cues. We also propose a positive-margin contrastive learning strategy to align all latent expressions with the original text while preserving subtle variations. Experimental results show that our method not only outperforms state-of-the-art RIS and REC approaches on multiple benchmarks but also achieves outstanding performance on the generalized referring expression segmentation (GRES) benchmark.
摘要：视觉接地任务，例如引用图像分割（RIS）和引用表达理解（REC），旨在根据给定的文本描述本地定位目标对象。图像中的目标对象可以多种方式描述，反映了诸如颜色，位置等的多样性属性。但是，大多数现有的方法都依赖于单个文本输入，该输入仅捕获视觉域中可用的丰富信息的一小部分。丰富的视觉细节与稀疏文本提示之间的这种不匹配会导致对类似对象的错误识别。为了解决这个问题，我们提出了一个新颖的视觉接地框架，该框架通过合并原始描述中没有互补的视觉细节来利用由单个文本输入产生的多种潜在表达式。具体而言，我们将主题分布式和视觉概念喷油器模块介绍给将共享对象和不同属性概念嵌入潜在表示中，从而捕获独特和特定于目标的视觉提示。我们还提出了一个正面利润的对比学习策略，以使所有潜在表达式与原始文本保持一致，同时保留微妙的变化。实验结果表明，我们的方法不仅在多个基准上胜过最先进的RIS和REC方法，而且还可以在广义参考表达分割（GRES）基准上实现出色的性能。

Title: Rotation Equivariant Arbitrary-scale Image Super-Resolution

Authors: Qi Xie, Jiahong Fu, Zongben Xu, Deyu Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.05160
Pdf URL: https://arxiv.org/pdf/2508.05160
Copy Paste: [[2508.05160]] Rotation Equivariant Arbitrary-scale Image Super-Resolution(https://arxiv.org/abs/2508.05160)
Keywords: super-resolution
Abstract: The arbitrary-scale image super-resolution (ASISR), a recent popular topic in computer vision, aims to achieve arbitrary-scale high-resolution recoveries from a low-resolution input image. This task is realized by representing the image as a continuous implicit function through two fundamental modules, a deep-network-based encoder and an implicit neural representation (INR) module. Despite achieving notable progress, a crucial challenge of such a highly ill-posed setting is that many common geometric patterns, such as repetitive textures, edges, or shapes, are seriously warped and deformed in the low-resolution images, naturally leading to unexpected artifacts appearing in their high-resolution recoveries. Embedding rotation equivariance into the ASISR network is thus necessary, as it has been widely demonstrated that this enhancement enables the recovery to faithfully maintain the original orientations and structural integrity of geometric patterns underlying the input image. Motivated by this, we make efforts to construct a rotation equivariant ASISR method in this study. Specifically, we elaborately redesign the basic architectures of INR and encoder modules, incorporating intrinsic rotation equivariance capabilities beyond those of conventional ASISR networks. Through such amelioration, the ASISR network can, for the first time, be implemented with end-to-end rotational equivariance maintained from input to output. We also provide a solid theoretical analysis to evaluate its intrinsic equivariance error, demonstrating its inherent nature of embedding such an equivariance structure. The superiority of the proposed method is substantiated by experiments conducted on both simulated and real datasets. We also validate that the proposed framework can be readily integrated into current ASISR methods in a plug \& play manner to further enhance their performance.
摘要：任意规模的图像超分辨率（ASISR）是计算机视觉中最近的流行主题，旨在从低分辨率输入图像中实现任意规模的高分辨率回收率。通过将图像表示为连续隐式函数，通过两个基本模块，一个基于深网的编码器和隐式神经表示（INR）模块来实现此任务。尽管取得了显着的进步，但这种高度不良的环境的关键挑战是，许多常见的几何模式，例如重复的纹理，边缘或形状，都严重扭曲并变形，自然而然地导致出乎意料地在高分辨率恢复中出现出意外的伪影。因此，有必要将旋转等分嵌入ASISR网络中，因为已广泛证明，这种增强能够恢复能够忠实地维护输入图像背后的几何模式的原始方向和结构完整性。在这项研究中，我们努力在本研究中构建旋转旋转的ASISR方法。具体而言，我们精心重新设计了INR和编码器模块的基本体系结构，并结合了固有的旋转率能力以外的传统ISISR网络。通过这样的改善，ASISR网络可以首次通过从输入到输出的端到端旋转率来实施。我们还提供了坚实的理论分析，以评估其内在的均衡误差，证明了其嵌入这种均衡结构的固有性质。在模拟和实际数据集上进行的实验证实了所提出方法的优势。我们还验证了所提出的框架可以轻松地以插头和播放方式集成到当前的ISIR方法中，以进一步提高其性能。

Title: X-MoGen: Unified Motion Generation across Humans and Animals

Authors: Xuan Wang, Kai Ruan, Liyang Qian, Zhizhi Guo, Chang Su, Gaoang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.05162
Pdf URL: https://arxiv.org/pdf/2508.05162
Copy Paste: [[2508.05162]] X-MoGen: Unified Motion Generation across Humans and Animals(https://arxiv.org/abs/2508.05162)
Keywords: generation
Abstract: Text-driven motion generation has attracted increasing attention due to its broad applications in virtual reality, animation, and robotics. While existing methods typically model human and animal motion separately, a joint cross-species approach offers key advantages, such as a unified representation and improved generalization. However, morphological differences across species remain a key challenge, often compromising motion plausibility. To address this, we propose \textbf{X-MoGen}, the first unified framework for cross-species text-driven motion generation covering both humans and animals. X-MoGen adopts a two-stage architecture. First, a conditional graph variational autoencoder learns canonical T-pose priors, while an autoencoder encodes motion into a shared latent space regularized by morphological loss. In the second stage, we perform masked motion modeling to generate motion embeddings conditioned on textual descriptions. During training, a morphological consistency module is employed to promote skeletal plausibility across species. To support unified modeling, we construct \textbf{UniMo4D}, a large-scale dataset of 115 species and 119k motion sequences, which integrates human and animal motions under a shared skeletal topology for joint training. Extensive experiments on UniMo4D demonstrate that X-MoGen outperforms state-of-the-art methods on both seen and unseen species.
摘要：文本驱动的运动产生由于其在虚拟现实，动画和机器人技术中的广泛应用而引起了越来越多的关注。尽管现有方法通常分别对人类和动物运动进行建模，但联合跨物种方法具有关键优势，例如统一表示和改善的概括。但是，整个物种的形态差异仍然是一个关键挑战，通常会损害运动的合理性。为了解决这个问题，我们提出了\ textbf {x-mogen}，这是跨物种文本驱动运动产生的第一个统一框架，涵盖了人类和动物。 X-Mogen采用了两个阶段的建筑。首先，有条件的图形自动编码器学习了规范的T型置验，而自动编码器则将运动编码为通过形态损失正规的共享潜在空间。在第二阶段，我们执行蒙版运动建模以生成以文本描述为条件的运动嵌入。在训练过程中，使用形态一致性模块来促进各种物种之间的骨骼合理性。为了支持统一的建模，我们构建了\ textbf {unimo4d}，这是一个由115种和119k运动序列组成的大规模数据集，该数据集在共享的骨骼拓扑结合以进行关节训练的共享骨骼拓扑下整合了人类和动物的运动。对UniMO4D的广泛实验表明，X-Mogen在可见和看不见的物种上都优于最先进的方法。

Title: Multi-tracklet Tracking for Generic Targets with Adaptive Detection Clustering

Authors: Zewei Wu, Longhao Wang, Cui Wang, César Teixeira, Wei Ke, Zhang Xiong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.05172
Pdf URL: https://arxiv.org/pdf/2508.05172
Copy Paste: [[2508.05172]] Multi-tracklet Tracking for Generic Targets with Adaptive Detection Clustering(https://arxiv.org/abs/2508.05172)
Keywords: generation
Abstract: Tracking specific targets, such as pedestrians and vehicles, has been the focus of recent vision-based multitarget tracking studies. However, in some real-world scenarios, unseen categories often challenge existing methods due to low-confidence detections, weak motion and appearance constraints, and long-term occlusions. To address these issues, this article proposes a tracklet-enhanced tracker called Multi-Tracklet Tracking (MTT) that integrates flexible tracklet generation into a multi-tracklet association framework. This framework first adaptively clusters the detection results according to their short-term spatio-temporal correlation into robust tracklets and then estimates the best tracklet partitions using multiple clues, such as location and appearance over time to mitigate error propagation in long-term association. Finally, extensive experiments on the benchmark for generic multiple object tracking demonstrate the competitiveness of the proposed framework.
摘要：跟踪特定目标（例如行人和车辆）一直是最近基于视觉的多坐追踪研究的重点。但是，在某些实际情况下，由于较低的信心探测，弱运动和外观限制和长期遮挡，看不见的类别通常会挑战现有方法。为了解决这些问题，本文提出了一个称为Multi-Tracklet Tracking（MTT）的曲目增强的跟踪器，该跟踪器将灵活的轨道生成集成到多轨道关联框架中。该框架首先根据其短期时空相关性适应检测结果，将检测产生到稳健的轨迹中，然后使用多个线索（例如，随着时间的推移位置和外观来减轻长期关联中的误差传播）估算最佳的曲目分区。最后，在基准上进行通用多个对象跟踪的广泛实验证明了拟议框架的竞争力。

Title: FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in finance

Authors: Mengao Zhang, Jiayu Fu, Tanya Warrier, Yuwen Wang, Tianhui Tan, Ke-wei Huang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2508.05201
Pdf URL: https://arxiv.org/pdf/2508.05201
Copy Paste: [[2508.05201]] FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in finance(https://arxiv.org/abs/2508.05201)
Keywords: generative
Abstract: Hallucination remains a critical challenge for deploying Large Language Models (LLMs) in finance. Accurate extraction and precise calculation from tabular data are essential for reliable financial analysis, since even minor numerical errors can undermine decision-making and regulatory compliance. Financial applications have unique requirements, often relying on context-dependent, numerical, and proprietary tabular data that existing hallucination benchmarks rarely capture. In this study, we develop a rigorous and scalable framework for evaluating intrinsic hallucinations in financial LLMs, conceptualized as a context-aware masked span prediction task over real-world financial documents. Our main contributions are: (1) a novel, automated dataset creation paradigm using a masking strategy; (2) a new hallucination evaluation dataset derived from S&P 500 annual reports; and (3) a comprehensive evaluation of intrinsic hallucination patterns in state-of-the-art LLMs on financial tabular data. Our work provides a robust methodology for in-house LLM evaluation and serves as a critical step toward building more trustworthy and reliable financial Generative AI systems.
摘要：幻觉仍然是在金融中部署大型语言模型（LLM）的关键挑战。准确的提取和从表格数据中进行的精确计算对于可靠的财务分析至关重要，因为即使是较小的数值错误也会破坏决策和监管合规性。财务应用具有独特的要求，通常依赖于上下文依赖，数值和专有表格数据，而现有幻觉基准很少捕获。在这项研究中，我们开发了一个严格且可扩展的框架，用于评估金融LLM中的内在幻觉，概念化为现实世界中财务文件的背景意识蒙版的跨度预测任务。我们的主要贡献是：（1）使用掩盖策略的一种新颖的自动数据集创建范式；（2）从标准普尔500年年度报告得出的新幻觉评估数据集；（3）对财务表格数据最新的LLM中内在幻觉模式的全面评估。我们的工作为内部LLM评估提供了强大的方法，并是建立更可信赖和可靠的财务生产AI系统的关键一步。

Title: ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking

Authors: Xiao Wang, Liye Jin, Xufeng Lou, Shiao Wang, Lan Chen, Bo Jiang, Zhipeng Zhang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.05221
Pdf URL: https://arxiv.org/pdf/2508.05221
Copy Paste: [[2508.05221]] ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking(https://arxiv.org/abs/2508.05221)
Keywords: generation
Abstract: Vision-language tracking has received increasing attention in recent years, as textual information can effectively address the inflexibility and inaccuracy associated with specifying the target object to be tracked. Existing works either directly fuse the fixed language with vision features or simply modify using attention, however, their performance is still limited. Recently, some researchers have explored using text generation to adapt to the variations in the target during tracking, however, these works fail to provide insights into the model's reasoning process and do not fully leverage the advantages of large models, which further limits their overall performance. To address the aforementioned issues, this paper proposes a novel reasoning-based vision-language tracking framework, named ReasoningTrack, based on a pre-trained vision-language model Qwen2.5-VL. Both SFT (Supervised Fine-Tuning) and reinforcement learning GRPO are used for the optimization of reasoning and language generation. We embed the updated language descriptions and feed them into a unified tracking backbone network together with vision features. Then, we adopt a tracking head to predict the specific location of the target object. In addition, we propose a large-scale long-term vision-language tracking benchmark dataset, termed TNLLT, which contains 200 video sequences. 20 baseline visual trackers are re-trained and evaluated on this dataset, which builds a solid foundation for the vision-language visual tracking task. Extensive experiments on multiple vision-language tracking benchmark datasets fully validated the effectiveness of our proposed reasoning-based natural language generation strategy. The source code of this paper will be released on this https URL
摘要：近年来，Vision语言跟踪受到了越来越多的关注，因为文本信息可以有效地解决与指定要跟踪的目标对象相关的不灵活性和不准确性。现有作品要么将固定语言与视觉功能直接融合，要么只是使用注意力进行修改，但是它们的性能仍然有限。最近，一些研究人员探索了使用文本生成来适应跟踪过程中目标的变化，但是，这些作品未能对模型的推理过程提供见解，并且无法完全利用大型模型的优势，从而进一步限制了他们的整体性能。为了解决上述问题，本文提出了一个基于预先训练的视觉模型QWEN2.5-VL的新型基于推理的视觉跟踪框架，名为ChalligingTrack。 SFT（监督微调）和加强学习GRPO都用于优化推理和语言产生。我们嵌入了更新的语言描述，并将它们与视觉功能一起融入统一的跟踪骨干网络中。然后，我们采用跟踪头来预测目标对象的特定位置。此外，我们提出了一个大规模的长期视觉跟踪基准数据集，称为TNLLT，其中包含200个视频序列。在此数据集上对20个基线视觉跟踪器进行了重新训练和评估，该数据集为视觉视觉跟踪任务奠定了坚实的基础。在多种视觉跟踪基准数据集上进行了广泛的实验，充分验证了我们提出的基于推理的自然语言生成策略的有效性。本文的源代码将在此HTTPS URL上发布

Title: ArbiViewGen: Controllable Arbitrary Viewpoint Camera Data Generation for Autonomous Driving via Stable Diffusion Models

Authors: Yatong Lan, Jingfeng Chen, Yiru Wang, Lei He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.05236
Pdf URL: https://arxiv.org/pdf/2508.05236
Copy Paste: [[2508.05236]] ArbiViewGen: Controllable Arbitrary Viewpoint Camera Data Generation for Autonomous Driving via Stable Diffusion Models(https://arxiv.org/abs/2508.05236)
Keywords: generation, generative
Abstract: Arbitrary viewpoint image generation holds significant potential for autonomous driving, yet remains a challenging task due to the lack of ground-truth data for extrapolated views, which hampers the training of high-fidelity generative models. In this work, we propose Arbiviewgen, a novel diffusion-based framework for the generation of controllable camera images from arbitrary points of view. To address the absence of ground-truth data in unseen views, we introduce two key components: Feature-Aware Adaptive View Stitching (FAVS) and Cross-View Consistency Self-Supervised Learning (CVC-SSL). FAVS employs a hierarchical matching strategy that first establishes coarse geometric correspondences using camera poses, then performs fine-grained alignment through improved feature matching algorithms, and identifies high-confidence matching regions via clustering analysis. Building upon this, CVC-SSL adopts a self-supervised training paradigm where the model reconstructs the original camera views from the synthesized stitched images using a diffusion model, enforcing cross-view consistency without requiring supervision from extrapolated data. Our framework requires only multi-camera images and their associated poses for training, eliminating the need for additional sensors or depth maps. To our knowledge, Arbiviewgen is the first method capable of controllable arbitrary view camera image generation in multiple vehicle configurations.
摘要：任意观点图像产生具有自主驾驶的巨大潜力，但由于缺乏针对外推角的地面数据，这阻碍了对高保真生成模型的训练。在这项工作中，我们提出了Arbiviewgen，这是一种基于新型扩散的框架，用于从任意角度生成可控的相机图像。为了解决看不见的视图中没有基真实数据的缺乏，我们介绍了两个关键组成部分：功能吸引的自适应视图缝线（最爱）和跨视图一致性自我监督学习（CVC-SSL）。 FAV采用了层次匹配策略，该策略首先使用摄像头姿势建立粗糙的几何对应关系，然后通过改进的特征匹配算法进行细粒度对齐，并通过聚类分析识别高信心匹配区域。在此基础上，CVC-SSL采用了一个自我监督的训练范式，该模型使用扩散模型从合成的缝合图像中重建了原始的摄像头视图，从而在不需要从外推数据的情况下执行交叉视图一致性。我们的框架仅需要多摄像机图像及其与训练相关的姿势，从而消除了对其他传感器或深度图的需求。据我们所知，Arbiviewgen是第一种能够以多个车辆配置中可控的任意视图相机图像生成的方法。

Title: SGDFuse: SAM-Guided Diffusion for High-Fidelity Infrared and Visible Image Fusion

Authors: Xiaoyang Zhang, Zhen Hua, Yakun Ju, Wei Zhou, Jun Liu, Alex C. Kot
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.05264
Pdf URL: https://arxiv.org/pdf/2508.05264
Copy Paste: [[2508.05264]] SGDFuse: SAM-Guided Diffusion for High-Fidelity Infrared and Visible Image Fusion(https://arxiv.org/abs/2508.05264)
Keywords: generation
Abstract: Infrared and visible image fusion (IVIF) aims to combine the thermal radiation information from infrared images with the rich texture details from visible images to enhance perceptual capabilities for downstream visual tasks. However, existing methods often fail to preserve key targets due to a lack of deep semantic understanding of the scene, while the fusion process itself can also introduce artifacts and detail loss, severely compromising both image quality and task performance. To address these issues, this paper proposes SGDFuse, a conditional diffusion model guided by the Segment Anything Model (SAM), to achieve high-fidelity and semantically-aware image fusion. The core of our method is to utilize high-quality semantic masks generated by SAM as explicit priors to guide the optimization of the fusion process via a conditional diffusion model. Specifically, the framework operates in a two-stage process: it first performs a preliminary fusion of multi-modal features, and then utilizes the semantic masks from SAM jointly with the preliminary fused image as a condition to drive the diffusion model's coarse-to-fine denoising generation. This ensures the fusion process not only has explicit semantic directionality but also guarantees the high fidelity of the final result. Extensive experiments demonstrate that SGDFuse achieves state-of-the-art performance in both subjective and objective evaluations, as well as in its adaptability to downstream tasks, providing a powerful solution to the core challenges in image fusion. The code of SGDFuse is available at this https URL.
摘要：红外图像融合（IVIF）旨在将红外图像中的热辐射信息与可见图像的丰富纹理细节相结合，以增强下游视觉任务的感知能力。但是，由于缺乏对场景的深层语义理解，现有方法通常无法保留关键目标，而融合过程本身也可以引入工件和细节损失，从而严重损害了图像质量和任务性能。为了解决这些问题，本文提出了SGDFUSE，这是一个有条件的扩散模型，该模型以任何模型（SAM）为指导，以实现高保真性和语义意识到的图像融合。我们方法的核心是利用SAM生成的高质量语义面具作为明确的先验，以通过条件扩散模型指导融合过程的优化。具体而言，该框架在两个阶段的过程中运行：它首先执行多模式特征的初步融合，然后利用SAM共同使用初步融合图像的语义面具作为一种条件，以驱动扩散模型的粗到五个模型。这样可以确保融合过程不仅具有明确的语义方向性，而且可以保证最终结果的高保真度。广泛的实验表明，SGDFUSE在主观和客观评估以及对下游任务的适应性方面都达到了最先进的表现，从而为图像融合中的核心挑战提供了有力的解决方案。 SGDFUSE的代码可在此HTTPS URL上找到。

Title: B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding

Authors: Changho Choi, Youngwoo Shin, Gyojin Han, Dong-Jae Lee, Junmo Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.05269
Pdf URL: https://arxiv.org/pdf/2508.05269
Copy Paste: [[2508.05269]] B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding(https://arxiv.org/abs/2508.05269)
Keywords: generation
Abstract: Understanding dynamic outdoor environments requires capturing complex object interactions and their evolution over time. LiDAR-based 4D point clouds provide precise spatial geometry and rich temporal cues, making them ideal for representing real-world scenes. However, despite their potential, 4D LiDAR remains underexplored in the context of Multimodal Large Language Models (MLLMs) due to the absence of high-quality, modality-specific annotations and the lack of MLLM architectures capable of processing its high-dimensional composition. To address these challenges, we introduce B4DL, a new benchmark specifically designed for training and evaluating MLLMs on 4D LiDAR understanding. In addition, we propose a scalable data generation pipeline and an MLLM model that, for the first time, directly processes raw 4D LiDAR by bridging it with language understanding. Combined with our dataset and benchmark, our model offers a unified solution for spatio-temporal reasoning in dynamic outdoor environments. We provide rendered 4D LiDAR videos, generated dataset, and inference outputs on diverse scenarios at: this https URL
摘要：了解动态的室外环境需要捕获复杂的对象相互作用及其演变。基于激光雷达的4D点云提供精确的空间几何形状和丰富的时间提示，使其非常适合代表现实世界的场景。然而，尽管没有潜力，但由于缺乏高质量，特定于模态的注释，并且缺乏能够处理其高维组成的MLLM体系结构，因此4D LiDAR在多模式大语模型（MLLM）的背景下仍未得到充实。为了应对这些挑战，我们介绍了B4DL，这是一种专门为4D LIDAR理解的MLLM设计而设计的新基准。此外，我们提出了可扩展的数据生成管道和MLLM模型，该管道首次通过用语言理解桥接它来直接处理RAW 4D LIDAR。结合我们的数据集和基准，我们的模型为动态室外环境中的时空推理提供了统一的解决方案。我们在不同方案上提供渲染的4D LIDAR视频，生成的数据集和推理输出：此HTTPS URL

Title: mKG-RAG: Multimodal Knowledge Graph-Enhanced RAG for Visual Question Answering

Authors: Xu Yuan, Liangbo Ning, Wenqi Fan, Qing Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.05318
Pdf URL: https://arxiv.org/pdf/2508.05318
Copy Paste: [[2508.05318]] mKG-RAG: Multimodal Knowledge Graph-Enhanced RAG for Visual Question Answering(https://arxiv.org/abs/2508.05318)
Keywords: generation
Abstract: Recently, Retrieval-Augmented Generation (RAG) has been proposed to expand internal knowledge of Multimodal Large Language Models (MLLMs) by incorporating external knowledge databases into the generation process, which is widely used for knowledge-based Visual Question Answering (VQA) tasks. Despite impressive advancements, vanilla RAG-based VQA methods that rely on unstructured documents and overlook the structural relationships among knowledge elements frequently introduce irrelevant or misleading content, reducing answer accuracy and reliability. To overcome these challenges, a promising solution is to integrate multimodal knowledge graphs (KGs) into RAG-based VQA frameworks to enhance the generation by introducing structured multimodal knowledge. Therefore, in this paper, we propose a novel multimodal knowledge-augmented generation framework (mKG-RAG) based on multimodal KGs for knowledge-intensive VQA tasks. Specifically, our approach leverages MLLM-powered keyword extraction and vision-text matching to distill semantically consistent and modality-aligned entities/relationships from multimodal documents, constructing high-quality multimodal KGs as structured knowledge representations. In addition, a dual-stage retrieval strategy equipped with a question-aware multimodal retriever is introduced to improve retrieval efficiency while refining precision. Comprehensive experiments demonstrate that our approach significantly outperforms existing methods, setting a new state-of-the-art for knowledge-based VQA.
摘要：最近，提出了检索功能增强的生成（RAG），以通过将外部知识数据库纳入生成过程，以扩大多模式大语言模型（MLLMS）的内部知识，该过程被广泛用于基于知识的视觉问题答案（VQA）任务。尽管取得了令人印象深刻的进步，但基于香草抹布的VQA方法依赖于非结构化文档并忽略知识元素之间的结构关系经常引入无关紧要或误导性的内容，从而降低了答案的准确性和可靠性。为了克服这些挑战，一个有希望的解决方案是将多模式知识图（kgs）整合到基于抹布的VQA框架中，以通过引入结构化的多模式知识来增强生成。因此，在本文中，我们提出了一种基于多模式KGS的新型多模式知识增强生成框架（MKG-rag），用于知识密集型VQA任务。具体而言，我们的方法利用了MLLM驱动的关键字提取和视觉文本匹配，以从多模式文档中提取语义一致和模态对准的实体/关系，从而构建高质量的多模式KGS作为结构化的知识表示。此外，引入了配备有问卷的多模式猎犬的双阶段检索策略，以提高检索效率，同时提高精度。全面的实验表明，我们的方法显着胜过现有方法，为基于知识的VQA设定了新的最新方法。

Title: PriorRG: Prior-Guided Contrastive Pre-training and Coarse-to-Fine Decoding for Chest X-ray Report Generation

Authors: Kang Liu, Zhuoqi Ma, Zikang Fang, Yunan Li, Kun Xie, Qiguang Miao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.05353
Pdf URL: https://arxiv.org/pdf/2508.05353
Copy Paste: [[2508.05353]] PriorRG: Prior-Guided Contrastive Pre-training and Coarse-to-Fine Decoding for Chest X-ray Report Generation(https://arxiv.org/abs/2508.05353)
Keywords: generation
Abstract: Chest X-ray report generation aims to reduce radiologists' workload by automatically producing high-quality preliminary reports. A critical yet underexplored aspect of this task is the effective use of patient-specific prior knowledge -- including clinical context (e.g., symptoms, medical history) and the most recent prior image -- which radiologists routinely rely on for diagnostic reasoning. Most existing methods generate reports from single images, neglecting this essential prior information and thus failing to capture diagnostic intent or disease progression. To bridge this gap, we propose PriorRG, a novel chest X-ray report generation framework that emulates real-world clinical workflows via a two-stage training pipeline. In Stage 1, we introduce a prior-guided contrastive pre-training scheme that leverages clinical context to guide spatiotemporal feature extraction, allowing the model to align more closely with the intrinsic spatiotemporal semantics in radiology reports. In Stage 2, we present a prior-aware coarse-to-fine decoding for report generation that progressively integrates patient-specific prior knowledge with the vision encoder's hidden states. This decoding allows the model to align with diagnostic focus and track disease progression, thereby enhancing the clinical accuracy and fluency of the generated reports. Extensive experiments on MIMIC-CXR and MIMIC-ABN datasets demonstrate that PriorRG outperforms state-of-the-art methods, achieving a 3.6% BLEU-4 and 3.8% F1 score improvement on MIMIC-CXR, and a 5.9% BLEU-1 gain on MIMIC-ABN. Code and checkpoints will be released upon acceptance.
摘要：胸部X射线报告生成旨在通过自动产生高质量的初步报告来减少放射科医生的工作量。这项任务的一个关键但毫无疑问的方面是有效利用患者特定的先验知识 - 包括临床环境（例如症状，病史）和最新的先前图像 - 放射科医生通常依靠这些图像来诊断推理。大多数现有方法都会产生来自单个图像的报告，忽略了此重要的先前信息，因此无法捕获诊断意图或疾病进展。为了弥合这一差距，我们提出了Priorrg，这是一种新型的胸部X射线报告生成框架，该框架通过两阶段的训练管道模仿现实世界中的临床工作流程。在第1阶段，我们引入了一种先前的对比前训练方案，该方案利用临床环境来指导时空特征提取，从而使模型可以在放射学报告中更加与固有的时空语义相结合。在第2阶段，我们提出了报告生成的先前意识到的粗到精细解码，该解码逐渐将患者特定的先验知识与视力编码器的隐藏状态集成在一起。这种解码使该模型可以与诊断焦点和跟踪疾病进展保持一致，从而提高了生成报告的临床准确性和流畅性。对MIMIC-CXR和MIMIC-ABN数据集进行的广泛实验表明，PRIORRG优于最先进的方法，在MIMIC-CXR上获得了3.6％的BLEU-4和3.8％的F1得分提高，并且在MIMIMIC-ABN上获得了5.9％的BLEU-1增长。认可后将发布代码和检查点。

Title: CT-GRAPH: Hierarchical Graph Attention Network for Anatomy-Guided CT Report Generation

Authors: Hamza Kalisch, Fabian Hörst, Jens Kleesiek, Ken Herrmann, Constantin Seibold
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.05375
Pdf URL: https://arxiv.org/pdf/2508.05375
Copy Paste: [[2508.05375]] CT-GRAPH: Hierarchical Graph Attention Network for Anatomy-Guided CT Report Generation(https://arxiv.org/abs/2508.05375)
Keywords: generation
Abstract: As medical imaging is central to diagnostic processes, automating the generation of radiology reports has become increasingly relevant to assist radiologists with their heavy workloads. Most current methods rely solely on global image features, failing to capture fine-grained organ relationships crucial for accurate reporting. To this end, we propose CT-GRAPH, a hierarchical graph attention network that explicitly models radiological knowledge by structuring anatomical regions into a graph, linking fine-grained organ features to coarser anatomical systems and a global patient context. Our method leverages pretrained 3D medical feature encoders to obtain global and organ-level features by utilizing anatomical masks. These features are further refined within the graph and then integrated into a large language model to generate detailed medical reports. We evaluate our approach for the task of report generation on the large-scale chest CT dataset CT-RATE. We provide an in-depth analysis of pretrained feature encoders for CT report generation and show that our method achieves a substantial improvement of absolute 7.9\% in F1 score over current state-of-the-art methods. The code is publicly available at this https URL.
摘要：由于医学成像对于诊断过程至关重要，因此自动化放射学报告的生成变得越来越重要，以帮助放射科医生进行大量工作量。大多数当前方法仅依赖于全局图像特征，未能捕获精细的器官关系对于准确的报告至关重要。为此，我们提出了CT-GRAPH，这是一个分层图注意网络，该网络通过将解剖区域构造到图形中，将放射学知识显式地模拟，将细粒器的特征与更粗的解剖系统和全球患者环境联系起来。我们的方法利用了预估计的3D医学功能编码器，通过使用解剖面膜来获得全球和器官级的特征。这些功能在图中进一步完善，然后集成到大型语言模型中以生成详细的医疗报告。我们评估了我们在大规模胸部CT数据集CT率上生成报告的任务的方法。我们对CT报告生成的预处理特征编码器提供了深入的分析，并表明我们的方法在F1分数中的绝对7.9 \％实现了与当前最新方法相比。该代码在此HTTPS URL上公开可用。

Title: Echo: Decoupling Inference and Training for Large-Scale RL Alignment on Heterogeneous Swarms

Authors: Jie Xiao, Shaoduo Gan, Changyuan Fan, Qingnan Ren, Alfred Long, Yuchen Zhang, Rymon Yu, Eric Yang, Lynn Ai
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.05387
Pdf URL: https://arxiv.org/pdf/2508.05387
Copy Paste: [[2508.05387]] Echo: Decoupling Inference and Training for Large-Scale RL Alignment on Heterogeneous Swarms(https://arxiv.org/abs/2508.05387)
Keywords: generation
Abstract: Modern RL-based post-training for large language models (LLMs) co-locate trajectory sampling and policy optimisation on the same GPU cluster, forcing the system to switch between inference and training workloads. This serial context switching violates the single-program-multiple-data (SPMD) assumption underlying today's distributed training systems. We present Echo, the RL system that cleanly decouples these two phases across heterogeneous "inference" and "training" swarms while preserving statistical efficiency. Echo introduces two lightweight synchronization protocols: a sequential pull mode that refreshes sampler weights on every API call for minimal bias, and an asynchronous push-pull mode that streams version-tagged rollouts through a replay buffer to maximise hardware utilisation. Training three representative RL workloads with Qwen3-4B, Qwen2.5-7B and Qwen3-32B on a geographically distributed cluster, Echo matches a fully co-located Verl baseline in convergence speed and final reward while off-loading trajectory generation to commodity edge hardware. These promising results demonstrate that large-scale RL for LLMs could achieve datacentre-grade performance using decentralised, heterogeneous resources.
摘要：大型语言模型（LLMS）基于现代RL的培训在同一GPU群集上共同设置轨迹采样和策略优化，迫使系统在推理和培训工作负载之间切换。这种串行上下文切换违反了当今分布式培训系统的基础单程 - 元素数据（SPMD）假设。我们提出了Echo，这是RL系统，该系统在保持统计效率的同时，在异质“推理”和“训练”群中干净地将这两个阶段解散。 ECHO引入了两个轻巧的同步协议：一种顺序拉动模式，可在每个API上刷新采样器的权重，要求最小的偏差，以及一种异步推动力模式，该模式通过版本版本标记的推出通过重型缓冲区进行了播放，以最大程度地利用硬件。在地理分布的群集上培训了三个代表性的RL工作负载，QWEN3-4B，QWEN2.5-7B和QWEN3-32B在收敛速度和最终奖励中匹配完全共同置于的VERL基线，同时将轨迹轨迹的生成与商品边缘硬件相匹配。这些有希望的结果表明，LLM的大规模RL可以使用分散的异质资源来实现数据中心级的性能。

Title: UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation

Authors: Wonjun Kang, Byeongkeun Ahn, Minjae Lee, Kevin Galim, Seunghyuk Oh, Hyung Il Koo, Nam Ik Cho
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.05399
Pdf URL: https://arxiv.org/pdf/2508.05399
Copy Paste: [[2508.05399]] UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation(https://arxiv.org/abs/2508.05399)
Keywords: generation, generative
Abstract: Text-to-image (T2I) generation has been actively studied using Diffusion Models and Autoregressive Models. Recently, Masked Generative Transformers have gained attention as an alternative to Autoregressive Models to overcome the inherent limitations of causal attention and autoregressive decoding through bidirectional attention and parallel decoding, enabling efficient and high-quality image generation. However, compositional T2I generation remains challenging, as even state-of-the-art Diffusion Models often fail to accurately bind attributes and achieve proper text-image alignment. While Diffusion Models have been extensively studied for this issue, Masked Generative Transformers exhibit similar limitations but have not been explored in this context. To address this, we propose Unmasking with Contrastive Attention Guidance (UNCAGE), a novel training-free method that improves compositional fidelity by leveraging attention maps to prioritize the unmasking of tokens that clearly represent individual objects. UNCAGE consistently improves performance in both quantitative and qualitative evaluations across multiple benchmarks and metrics, with negligible inference overhead. Our code is available at this https URL.
摘要：使用扩散模型和自回归模型对文本形象（T2I）的生成进行了积极研究。最近，蒙面的生成变压器已成为自回旋模型的一种替代方法，以克服因果关注和通过双向关注和平行解码，实现高效且高质量的图像生成的固有局限性。但是，构图T2i的生成仍然具有挑战性，因为即使是最新的扩散模型也常常无法准确绑定属性并实现适当的文本图像对齐。尽管已经为这个问题进行了广泛研究扩散模型，但蒙版的生成变压器具有相似的局限性，但在这种情况下尚未探索。为了解决这个问题，我们建议使用对比度注意力指导（UNDAGE）进行揭露，这是一种新颖的无训练方法，通过利用注意图来确定明确表示单个对象的代币的删除来提高组成忠诚度。 UNGAGE始终提高多个基准和指标的定量和定性评估的性能，并可以忽略不计。我们的代码可在此HTTPS URL上找到。

Title: MolSnap: Snap-Fast Molecular Generation with Latent Variational Mean Flow

Authors: Md Atik Ahamed, Qiang Ye, Qiang Cheng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.05411
Pdf URL: https://arxiv.org/pdf/2508.05411
Copy Paste: [[2508.05411]] MolSnap: Snap-Fast Molecular Generation with Latent Variational Mean Flow(https://arxiv.org/abs/2508.05411)
Keywords: generation
Abstract: Molecular generation conditioned on textual descriptions is a fundamental task in computational chemistry and drug discovery. Existing methods often struggle to simultaneously ensure high-quality, diverse generation and fast inference. In this work, we propose a novel causality-aware framework that addresses these challenges through two key innovations. First, we introduce a Causality-Aware Transformer (CAT) that jointly encodes molecular graph tokens and text instructions while enforcing causal dependencies during generation. Second, we develop a Variational Mean Flow (VMF) framework that generalizes existing flow-based methods by modeling the latent space as a mixture of Gaussians, enhancing expressiveness beyond unimodal priors. VMF enables efficient one-step inference while maintaining strong generation quality and diversity. Extensive experiments on four standard molecular benchmarks demonstrate that our model outperforms state-of-the-art baselines, achieving higher novelty (up to 74.5\%), diversity (up to 70.3\%), and 100\% validity across all datasets. Moreover, VMF requires only one number of function evaluation (NFE) during conditional generation and up to five NFEs for unconditional generation, offering substantial computational efficiency over diffusion-based methods.
摘要：以文本描述为条件的分子产生是计算化学和药物发现中的基本任务。现有的方法通常难以同时确保高质量，多样化和快速推断。在这项工作中，我们提出了一个新颖的因果关系框架，该框架通过两项关键创新来解决这些挑战。首先，我们介绍了一个因果关系感知的变压器（CAT），该变压器（CAT）共同编码分子图令牌和文本指令，同时在发电期间执行因果关系。其次，我们开发了一个差异平均流（VMF）框架，该框架通过将潜在空间建模为高斯人的混合物，从而概括了现有的基于流动的方法，从而增强了超越单峰先验的表现力。 VMF可以在保持强大的发电质量和多样性的同时，实现有效的一步推理。对四个标准分子基准测试的广泛实验表明，我们的模型的表现优于最先进的基准，在所有数据集中都能达到更高的新颖性（高达74.5 \％），多样性（高达70.3 \％）和100 \％的有效性。此外，VMF在有条件生成期间仅需要一个数量的功能评估（NFE），而无条件生成的最多需要五个NFE，这对基于扩散的方法提供了实质性的计算效率。

Title: Discovering Interpretable Programmatic Policies via Multimodal LLM-assisted Evolutionary Search

Authors: Qinglong Hu, Xialiang Tong, Mingxuan Yuan, Fei Liu, Zhichao Lu, Qingfu Zhang
Subjects: cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2508.05433
Pdf URL: https://arxiv.org/pdf/2508.05433
Copy Paste: [[2508.05433]] Discovering Interpretable Programmatic Policies via Multimodal LLM-assisted Evolutionary Search(https://arxiv.org/abs/2508.05433)
Keywords: generation
Abstract: Interpretability and high performance are essential goals in designing control policies, particularly for safety-critical tasks. Deep reinforcement learning has greatly enhanced performance, yet its inherent lack of interpretability often undermines trust and hinders real-world deployment. This work addresses these dual challenges by introducing a novel approach for programmatic policy discovery, called Multimodal Large Language Model-assisted Evolutionary Search (MLES). MLES utilizes multimodal large language models as policy generators, combining them with evolutionary mechanisms for automatic policy optimization. It integrates visual feedback-driven behavior analysis within the policy generation process to identify failure patterns and facilitate targeted improvements, enhancing the efficiency of policy discovery and producing adaptable, human-aligned policies. Experimental results show that MLES achieves policy discovery capabilities and efficiency comparable to Proximal Policy Optimization (PPO) across two control tasks, while offering transparent control logic and traceable design processes. This paradigm overcomes the limitations of predefined domain-specific languages, facilitates knowledge transfer and reuse, and is scalable across various control tasks. MLES shows promise as a leading approach for the next generation of interpretable control policy discovery.
摘要：可解释性和高性能是设计控制政策的重要目标，特别是对于关键安全任务。深厚的强化学习大大提高了性能，但是其固有的解释性缺乏经常破坏信任并阻碍现实世界的部署。这项工作通过引入一种新颖的程序策略发现方法来解决这些双重挑战，称为多模式大语模型辅助进化搜索（MLE）。 MLE将多模式大型语言模型用作策略生成器，将它们与自动策略优化的进化机制相结合。它将视觉反馈驱动的行为分析集成到政策生成过程中，以识别失败模式并促进有针对性的改进，提高政策发现的效率并产生适应性的人类一致性政策。实验结果表明，MLE在两个控制任务上实现了与近端策略优化（PPO）相当的策略发现功能和效率，同时提供透明的控制逻辑和可追溯的设计过程。该范式克服了预定义域特异性语言的局限性，促进了知识转移和重复使用，并且在各种控制任务中都是可扩展的。 MLE显示出有望是下一代可解释的控制政策发现的领先方法。

Title: EnergyPatchTST: Multi-scale Time Series Transformers with Uncertainty Estimation for Energy Forecasting

Authors: Wei Li, Zixin Wang, Qizheng Sun, Qixiang Gao, Fenglei Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.05454
Pdf URL: https://arxiv.org/pdf/2508.05454
Copy Paste: [[2508.05454]] EnergyPatchTST: Multi-scale Time Series Transformers with Uncertainty Estimation for Energy Forecasting(https://arxiv.org/abs/2508.05454)
Keywords: generation
Abstract: Accurate and reliable energy time series prediction is of great significance for power generation planning and allocation. At present, deep learning time series prediction has become the mainstream method. However, the multi-scale time dynamics and the irregularity of real data lead to the limitations of the existing methods. Therefore, we propose EnergyPatchTST, which is an extension of the Patch Time Series Transformer specially designed for energy forecasting. The main innovations of our method are as follows: (1) multi-scale feature extraction mechanism to capture patterns with different time resolutions; (2) probability prediction framework to estimate uncertainty through Monte Carlo elimination; (3) integration path of future known variables (such as temperature and wind conditions); And (4) Pre-training and Fine-tuning examples to enhance the performance of limited energy data sets. A series of experiments on common energy data sets show that EnergyPatchTST is superior to other commonly used methods, the prediction error is reduced by 7-12%, and reliable uncertainty estimation is provided, which provides an important reference for time series prediction in the energy field.
摘要：准确可靠的能源时间序列预测对于发电计划和分配具有重要意义。目前，深度学习时间序列预测已成为主流方法。但是，多尺度时间动态和实际数据的不规则性导致了现有方法的局限性。因此，我们提出了EnergyPatchTST，这是专门为能量预测设计的贴片时间序列变压器的扩展。我们方法的主要创新如下：（1）多尺度提取机制以捕获不同时间分辨率的模式；（2）通过消除蒙特卡洛估算不确定性的概率预测框架；（3）未来已知变量的整合路径（例如温度和风条件）；（4）预训练和微调示例，以增强有限的能量数据集的性能。一系列关于通用能量数据集的实验表明，EnergyPatchTST优于其他常用方法，预测误差降低了7-12％，并提供了可靠的不确定性估计，这为能量场中时间序列预测提供了重要的参考。

Title: FS-IQA: Certified Feature Smoothing for Robust Image Quality Assessment

Authors: Ekaterina Shumitskaya, Dmitriy Vatolin, Anastasia Antsiferova
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.05516
Pdf URL: https://arxiv.org/pdf/2508.05516
Copy Paste: [[2508.05516]] FS-IQA: Certified Feature Smoothing for Robust Image Quality Assessment(https://arxiv.org/abs/2508.05516)
Keywords: quality assessment
Abstract: We propose a novel certified defense method for Image Quality Assessment (IQA) models based on randomized smoothing with noise applied in the feature space rather than the input space. Unlike prior approaches that inject Gaussian noise directly into input images, often degrading visual quality, our method preserves image fidelity while providing robustness guarantees. To formally connect noise levels in the feature space with corresponding input-space perturbations, we analyze the maximum singular value of the backbone network's Jacobian. Our approach supports both full-reference (FR) and no-reference (NR) IQA models without requiring any architectural modifications, suitable for various scenarios. It is also computationally efficient, requiring a single backbone forward pass per image. Compared to previous methods, it reduces inference time by 99.5% without certification and by 20.6% when certification is applied. We validate our method with extensive experiments on two benchmark datasets, involving six widely-used FR and NR IQA models and comparisons against five state-of-the-art certified defenses. Our results demonstrate consistent improvements in correlation with subjective quality scores by up to 30.9%.
摘要：我们提出了一种基于随机平滑的图像质量评估（IQA）模型的新型认证防御方法，该模型与特征空间中施加的噪声相比，而不是输入空间。与先前的方法将高斯噪声直接注入输入图像，通常会降低视觉质量，我们的方法可以保留图像保真度，同时提供稳定性。为了通过相应的输入空间扰动正式连接特征空间中的噪声水平，我们分析了骨干网络Jacobian的最大奇异值。我们的方法支持全参考（FR）和No-Reference（NR）IQA模型，而无需进行任何架构修改，适用于各种情况。它也是计算上有效的，每个图像都需要单个骨干向前。与以前的方法相比，它将推理时间降低了99.5％，没有认证，并在应用认证时将其减少20.6％。我们通过在两个基准数据集上进行广泛的实验来验证我们的方法，其中涉及六种广泛使用的FR和NR IQA模型，并与五个最先进的认证防御措施进行了比较。我们的结果表明，与主观质量分数的相关性一致性提高了30.9％。

Title: When Deepfake Detection Meets Graph Neural Network:a Unified and Lightweight Learning Framework

Authors: Haoyu Liu, Chaoyu Gong, Mengke He, Jiate Li, Kai Han, Siqiang Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.05526
Pdf URL: https://arxiv.org/pdf/2508.05526
Copy Paste: [[2508.05526]] When Deepfake Detection Meets Graph Neural Network:a Unified and Lightweight Learning Framework(https://arxiv.org/abs/2508.05526)
Keywords: generative
Abstract: The proliferation of generative video models has made detecting AI-generated and manipulated videos an urgent challenge. Existing detection approaches often fail to generalize across diverse manipulation types due to their reliance on isolated spatial, temporal, or spectral information, and typically require large models to perform well. This paper introduces SSTGNN, a lightweight Spatial-Spectral-Temporal Graph Neural Network framework that represents videos as structured graphs, enabling joint reasoning over spatial inconsistencies, temporal artifacts, and spectral distortions. SSTGNN incorporates learnable spectral filters and temporal differential modeling into a graph-based architecture, capturing subtle manipulation traces more effectively. Extensive experiments on diverse benchmark datasets demonstrate that SSTGNN not only achieves superior performance in both in-domain and cross-domain settings, but also offers strong robustness against unseen manipulations. Remarkably, SSTGNN accomplishes these results with up to 42.4$\times$ fewer parameters than state-of-the-art models, making it highly lightweight and scalable for real-world deployment.
摘要：生成视频模型的扩散使检测到AI生成和操纵视频成为紧迫的挑战。现有的检测方法通常由于依赖孤立的空间，时间或光谱信息而无法跨越各种操纵类型，并且通常需要大型模型才能表现良好。本文介绍了SSTGNN，这是一种轻巧的空间 - 频式图形神经网络框架，将视频表示为结构图，从而在空间上不一致，时间伪像和光谱失真中实现了关节推理。 SSTGNN将可学习的光谱过滤器和时间差模型结合到基于图的架构中，从而更有效地捕获微妙的操纵痕迹。对各种基准数据集进行的广泛实验表明，SSTGNN不仅在内域和跨域设置中都取得了卓越的性能，而且还具有强大的鲁棒性，可以针对看不见的操作。值得注意的是，SSTGNN用比最新模型少的参数少42.4 $ \ $ \ times $ $，从而实现了这些结果，这使其非常轻巧且可扩展到现实世界中的部署。

Title: Tractable Sharpness-Aware Learning of Probabilistic Circuits

Authors: Hrithik Suresh, Sahil Sidheekh, Vishnu Shreeram M.P, Sriraam Natarajan, Narayanan C. Krishnan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.05537
Pdf URL: https://arxiv.org/pdf/2508.05537
Copy Paste: [[2508.05537]] Tractable Sharpness-Aware Learning of Probabilistic Circuits(https://arxiv.org/abs/2508.05537)
Keywords: generative
Abstract: Probabilistic Circuits (PCs) are a class of generative models that allow exact and tractable inference for a wide range of queries. While recent developments have enabled the learning of deep and expressive PCs, this increased capacity can often lead to overfitting, especially when data is limited. We analyze PC overfitting from a log-likelihood-landscape perspective and show that it is often caused by convergence to sharp optima that generalize poorly. Inspired by sharpness aware minimization in neural networks, we propose a Hessian-based regularizer for training PCs. As a key contribution, we show that the trace of the Hessian of the log-likelihood-a sharpness proxy that is typically intractable in deep neural networks-can be computed efficiently for PCs. Minimizing this Hessian trace induces a gradient-norm-based regularizer that yields simple closed-form parameter updates for EM, and integrates seamlessly with gradient based learning methods. Experiments on synthetic and real-world datasets demonstrate that our method consistently guides PCs toward flatter minima, improves generalization performance.
摘要：概率电路（PC）是一类生成模型，可针对广泛的查询进行精确且可拖延的推断。尽管最近的事态发展能够学习深度和表现力的PC，但这种增加的容量通常会导致过度拟合，尤其是在数据有限的情况下。我们从对数类似景观的角度分析了PC的过度拟合，并表明它通常是由趋于良好的敏感性融合引起的。受神经网络中敏锐的最小化的启发，我们提出了一个基于HESSIAN的培训PC的常规器。作为关键贡献，我们表明，对数可能的痕迹 - 锋利的代理，通常在深度神经网络中很难有效地计算PC的PC。最小化此Hessian Trace会诱导基于梯度 - 基于梯度的正常化程序，该正常化程序可为EM提供简单的闭合形式参数更新，并与基于梯度的学习方法无缝集成。关于合成和现实世界数据集的实验表明，我们的方法一致地指导PC降低了最小值，从而提高了概括性能。

Title: Follow-Your-Instruction: A Comprehensive MLLM Agent for World Data Synthesis

Authors: Kunyu Feng, Yue Ma, Xinhua Zhang, Boshi Liu, Yikuang Yuluo, Yinhan Zhang, Runtao Liu, Hongyu Liu, Zhiyuan Qin, Shanhui Mo, Qifeng Chen, Zeyu Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.05580
Pdf URL: https://arxiv.org/pdf/2508.05580
Copy Paste: [[2508.05580]] Follow-Your-Instruction: A Comprehensive MLLM Agent for World Data Synthesis(https://arxiv.org/abs/2508.05580)
Keywords: generative
Abstract: With the growing demands of AI-generated content (AIGC), the need for high-quality, diverse, and scalable data has become increasingly crucial. However, collecting large-scale real-world data remains costly and time-consuming, hindering the development of downstream applications. While some works attempt to collect task-specific data via a rendering process, most approaches still rely on manual scene construction, limiting their scalability and accuracy. To address these challenges, we propose Follow-Your-Instruction, a Multimodal Large Language Model (MLLM)-driven framework for automatically synthesizing high-quality 2D, 3D, and 4D data. Our \textbf{Follow-Your-Instruction} first collects assets and their associated descriptions through multimodal inputs using the MLLM-Collector. Then it constructs 3D layouts, and leverages Vision-Language Models (VLMs) for semantic refinement through multi-view scenes with the MLLM-Generator and MLLM-Optimizer, respectively. Finally, it uses MLLM-Planner to generate temporally coherent future frames. We evaluate the quality of the generated data through comprehensive experiments on the 2D, 3D, and 4D generative tasks. The results show that our synthetic data significantly boosts the performance of existing baseline models, demonstrating Follow-Your-Instruction's potential as a scalable and effective data engine for generative intelligence.
摘要：随着AI生成含量（AIGC）的需求不断增长，对高质量，多样化和可扩展数据的需求变得越来越重要。但是，收集大规模的现实世界数据仍然昂贵且耗时，阻碍了下游应用程序的开发。尽管某些作品试图通过渲染过程收集特定于任务的数据，但大多数方法仍然依赖手动场景的构建，从而限制了它们的可扩展性和准确性。为了应对这些挑战，我们提出了跟随您的指导，这是一种自动合成高质量2D，3D和4D数据的多模式大语言模型（MLLM）驱动的框架。我们的\ textbf {laster-your-Instruction}首先通过使用MLLM-Collector通过多模式输入来收集资产及其相关的描述。然后，它构建了3D布局，并分别使用MLLM-Generator和MLLM-Optimizer来利用视觉模型（VLMS）进行语义改进。最后，它使用MLLM-Planner生成时间连贯的未来帧。我们通过对2D，3D和4D生成任务的全面实验来评估生成数据的质量。结果表明，我们的综合数据显着提高了现有基线模型的性能，这表明了您作为生成智能的可扩展和有效数据引擎的潜力。

Title: WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction

Authors: Shaobin Zhuang, Yiwei Guo, Canmiao Fu, Zhipeng Huang, Zeyue Tian, Ying Zhang, Chen Li, Yali Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.05599
Pdf URL: https://arxiv.org/pdf/2508.05599
Copy Paste: [[2508.05599]] WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction(https://arxiv.org/abs/2508.05599)
Keywords: generation, generative
Abstract: Visual tokenizer is a critical component for vision generation. However, the existing tokenizers often face unsatisfactory trade-off between compression ratios and reconstruction fidelity. To fill this gap, we introduce a powerful and concise WeTok tokenizer, which surpasses the previous leading tokenizers via two core innovations. (1) Group-wise lookup-free Quantization (GQ). We partition the latent features into groups, and perform lookup-free quantization for each group. As a result, GQ can efficiently overcome memory and computation limitations of prior tokenizers, while achieving a reconstruction breakthrough with more scalable codebooks. (2) Generative Decoding (GD). Different from prior tokenizers, we introduce a generative decoder with a prior of extra noise variable. In this case, GD can probabilistically model the distribution of visual data conditioned on discrete tokens, allowing WeTok to reconstruct visual details, especially at high compression ratios. Extensive experiments on mainstream benchmarks show superior performance of our WeTok. On the ImageNet 50k validation set, WeTok achieves a record-low zero-shot rFID (WeTok: 0.12 vs. FLUX-VAE: 0.18 vs. SD-VAE 3.5: 0.19). Furthermore, our highest compression model achieves a zero-shot rFID of 3.49 with a compression ratio of 768, outperforming Cosmos (384) 4.57 which has only 50% compression rate of ours. Code and models are available: this https URL.
摘要：视觉令牌器是视觉产生的关键组成部分。但是，现有的引导者通常在压缩比和重建保真度之间面临不令人满意的权衡。为了填补这一空白，我们引入了一个强大而简洁的Wetok令牌，该机构通过两项核心创新超过了先前的领导象征器。（1）通过小组查找的无量化（GQ）。我们将潜在特征分组为组，并对每个组执行无查找的量化。结果，GQ可以有效地克服对kenizer之前的内存和计算限制，同时通过更可扩展的代码簿实现重建突破。（2）生成解码（GD）。与以前的引物不同，我们引入了一个具有额外噪声变量的生成解码器。在这种情况下，GD可以概率地模拟以离散令牌为条件的视觉数据的分布，从而使Wetok可以重建视觉细节，尤其是在高压缩比下。主流基准测试的广泛实验表明我们的WETOK表现出色。在Imagenet 50K验证集中，WETOK达到了记录的零击RFID（WETOK：0.12 vs. FLUX-VAE：0.18对SD-VAE 3.5：0.19）。此外，我们最高的压缩模型达到了3.49的零射击RFID，压缩率为768，优于宇宙（384）4.57，我们的压缩率仅为50％。代码和模型可用：此HTTPS URL。

Title: LLaVA-RE: Binary Image-Text Relevancy Evaluation with Multimodal Large Language Model

Authors: Tao Sun, Oliver Liu, JinJin Li, Lan Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.05602
Pdf URL: https://arxiv.org/pdf/2508.05602
Copy Paste: [[2508.05602]] LLaVA-RE: Binary Image-Text Relevancy Evaluation with Multimodal Large Language Model(https://arxiv.org/abs/2508.05602)
Keywords: generative
Abstract: Multimodal generative AI usually involves generating image or text responses given inputs in another modality. The evaluation of image-text relevancy is essential for measuring response quality or ranking candidate responses. In particular, binary relevancy evaluation, i.e., ``Relevant'' vs. ``Not Relevant'', is a fundamental problem. However, this is a challenging task considering that texts have diverse formats and the definition of relevancy varies in different scenarios. We find that Multimodal Large Language Models (MLLMs) are an ideal choice to build such evaluators, as they can flexibly handle complex text formats and take in additional task information. In this paper, we present LLaVA-RE, a first attempt for binary image-text relevancy evaluation with MLLM. It follows the LLaVA architecture and adopts detailed task instructions and multimodal in-context samples. In addition, we propose a novel binary relevancy data set that covers various tasks. Experimental results validate the effectiveness of our framework.
摘要：多模式生成AI通常涉及在另一种模态中给定输入给定的图像或文本响应。图像文本相关性的评估对于衡量响应质量或对候选响应的排名至关重要。特别是，二进制相关性评估，即``相关''vs.“不相关”，是一个基本问题。但是，考虑到文本具有多种格式，相关性的定义在不同的情况下有所不同，这是一项艰巨的任务。我们发现，多模式的大语言模型（MLLM）是构建此类评估者的理想选择，因为它们可以灵活地处理复杂的文本格式并获取其他任务信息。在本文中，我们提出了Llava-RE，这是与MLLM进行二进制图像文本相关性评估的首次尝试。它遵循LLAVA体系结构，并采用详细的任务说明和多模式的内部样本。此外，我们提出了一个涵盖各种任务的新型二进制相关数据集。实验结果证明了我们框架的有效性。

Title: Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision

Authors: Luozheng Qin, Jia Gong, Yuqing Sun, Tianjiao Li, Mengping Yang, Xiaomeng Yang, Chao Qu, Zhiyu Tan, Hao Li
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2508.05606
Pdf URL: https://arxiv.org/pdf/2508.05606
Copy Paste: [[2508.05606]] Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision(https://arxiv.org/abs/2508.05606)
Keywords: generation
Abstract: Chain-of-Thought (CoT) reasoning has been widely adopted to enhance Large Language Models (LLMs) by decomposing complex tasks into simpler, sequential subtasks. However, extending CoT to vision-language reasoning tasks remains challenging, as it often requires interpreting transitions of visual states to support reasoning. Existing methods often struggle with this due to limited capacity of modeling visual state transitions or incoherent visual trajectories caused by fragmented architectures. To overcome these limitations, we propose Uni-CoT, a Unified Chain-of-Thought framework that enables coherent and grounded multimodal reasoning within a single unified model. The key idea is to leverage a model capable of both image understanding and generation to reason over visual content and model evolving visual states. However, empowering a unified model to achieve that is non-trivial, given the high computational cost and the burden of training. To address this, Uni-CoT introduces a novel two-level reasoning paradigm: A Macro-Level CoT for high-level task planning and A Micro-Level CoT for subtask execution. This design significantly reduces the computational overhead. Furthermore, we introduce a structured training paradigm that combines interleaved image-text supervision for macro-level CoT with multi-task objectives for micro-level CoT. Together, these innovations allow Uni-CoT to perform scalable and coherent multi-modal reasoning. Furthermore, thanks to our design, all experiments can be efficiently completed using only 8 A100 GPUs with 80GB VRAM each. Experimental results on reasoning-driven image generation benchmark (WISE) and editing benchmarks (RISE and KRIS) indicates that Uni-CoT demonstrates SOTA performance and strong generalization, establishing Uni-CoT as a promising solution for multi-modal reasoning. Project Page and Code: this https URL
摘要：通过将复杂的任务分解为更简单的顺序子任务，已广泛采用了经过思考链（COT）推理，以增强大型语言模型（LLMS）。但是，将COT扩展到视觉推理任务仍然具有挑战性，因为它通常需要解释视觉状态的过渡以支持推理。由于建模视觉状态过渡的能力有限或由零散的架构引起的视觉状态过渡或不连贯的视觉轨迹，现有方法通常会遇到困难。为了克服这些局限性，我们提出了Uni-Cot，这是一个统一的经过思考链框架，可以在单个统一模型中实现连贯和扎根的多模式推理。关键思想是利用能够图像理解和生成的模型来推理视觉内容和模型不断发展的视觉状态。但是，鉴于较高的计算成本和培训负担，授权统一模型实现这一目标是不平凡的。为了解决这个问题，Uni-Cot介绍了一个新颖的两级推理范式：用于高级任务计划的宏观床和一个用于子任务执行的微型婴儿床。该设计大大减少了计算开销。此外，我们引入了一个结构化的训练范式，该范围结合了宏观级别的COT的交织图像 - 文本监督和用于微级别COT的多任务目标。这些创新共同使Uni-Cot执行可扩展和相干的多模式推理。此外，由于我们的设计，只能使用8个A100 GPU，每个GPU具有80GB VRAM，可以有效地完成所有实验。关于推理驱动的图像产生基准（明智）和编辑基准（Rise和Kris）的实验结果表明，Uni-Cot表明了SOTA性能和强有力的概括，从而确立了UNI-COT作为多模式推理的有希望的解决方案。项目页面和代码：此HTTPS URL

Title: Hi3DEval: Advancing 3D Generation Evaluation with Hierarchical Validity

Authors: Yuhan Zhang, Long Zhuo, Ziyang Chu, Tong Wu, Zhibing Li, Liang Pan, Dahua Lin, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.05609
Pdf URL: https://arxiv.org/pdf/2508.05609
Copy Paste: [[2508.05609]] Hi3DEval: Advancing 3D Generation Evaluation with Hierarchical Validity(https://arxiv.org/abs/2508.05609)
Keywords: generation, generative, quality assessment
Abstract: Despite rapid advances in 3D content generation, quality assessment for the generated 3D assets remains challenging. Existing methods mainly rely on image-based metrics and operate solely at the object level, limiting their ability to capture spatial coherence, material authenticity, and high-fidelity local details. 1) To address these challenges, we introduce Hi3DEval, a hierarchical evaluation framework tailored for 3D generative content. It combines both object-level and part-level evaluation, enabling holistic assessments across multiple dimensions as well as fine-grained quality analysis. Additionally, we extend texture evaluation beyond aesthetic appearance by explicitly assessing material realism, focusing on attributes such as albedo, saturation, and metallicness. 2) To support this framework, we construct Hi3DBench, a large-scale dataset comprising diverse 3D assets and high-quality annotations, accompanied by a reliable multi-agent annotation pipeline. We further propose a 3D-aware automated scoring system based on hybrid 3D representations. Specifically, we leverage video-based representations for object-level and material-subject evaluations to enhance modeling of spatio-temporal consistency and employ pretrained 3D features for part-level perception. Extensive experiments demonstrate that our approach outperforms existing image-based metrics in modeling 3D characteristics and achieves superior alignment with human preference, providing a scalable alternative to manual evaluations. The project page is available at this https URL.
摘要：尽管3D内容生成的快速发展，但生成的3D资产的质量评估仍然具有挑战性。现有方法主要依赖于基于图像的指标并仅在对象级别上运行，从而限制了它们捕获空间连贯性，物质真实性和高保真性本地细节的能力。 1）为了应对这些挑战，我们介绍了HI3Deval，这是一个针对3D生成内容量身定制的分层评估框架。它结合了对象级别和零件级别的评估，从而实现了多个维度和细粒质量分析的整体评估。此外，我们通过明确评估材料现实主义，重点介绍诸如反照率，饱和度和金属性等属性，从而将纹理评估扩展到美学外观之外。 2）为了支持此框架，我们构建了HI3DBench，这是一个大规模数据集，其中包括不同的3D资产和高质量的注释，并伴随着可靠的多项式注释管道。我们进一步提出了一个基于混合3D表示的3D感知自动评分系统。具体而言，我们利用基于视频的表示来进行对象级别和材料对象评估来增强时空一致性的建模，并采用预定的3D特征来进行部分级别的感知。广泛的实验表明，我们的方法在建模3D特性中优于现有的基于图像的指标，并与人类偏好达到了较高的一致性，为手动评估提供了可扩展的替代方案。该项目页面可在此HTTPS URL上找到。

Title: TrajEvo: Trajectory Prediction Heuristics Design via LLM-driven Evolution

Authors: Zhikai Zhao, Chuanbo Hua, Federico Berto, Kanghoon Lee, Zihan Ma, Jiachen Li, Jinkyoo Park
Subjects: cs.LG, cs.AI, cs.NE, cs.RO
Abstract URL: https://arxiv.org/abs/2508.05616
Pdf URL: https://arxiv.org/pdf/2508.05616
Copy Paste: [[2508.05616]] TrajEvo: Trajectory Prediction Heuristics Design via LLM-driven Evolution(https://arxiv.org/abs/2508.05616)
Keywords: generation
Abstract: Trajectory prediction is a critical task in modeling human behavior, especially in safety-critical domains such as social robotics and autonomous vehicle navigation. Traditional heuristics based on handcrafted rules often lack accuracy and generalizability. Although deep learning approaches offer improved performance, they typically suffer from high computational cost, limited explainability, and, importantly, poor generalization to out-of-distribution (OOD) scenarios. In this paper, we introduce TrajEvo, a framework that leverages Large Language Models (LLMs) to automatically design trajectory prediction heuristics. TrajEvo employs an evolutionary algorithm to generate and refine prediction heuristics from past trajectory data. We propose two key innovations: Cross-Generation Elite Sampling to encourage population diversity, and a Statistics Feedback Loop that enables the LLM to analyze and improve alternative predictions. Our evaluations demonstrate that TrajEvo outperforms existing heuristic methods across multiple real-world datasets, and notably surpasses both heuristic and deep learning methods in generalizing to an unseen OOD real-world dataset. TrajEvo marks a promising step toward the automated design of fast, explainable, and generalizable trajectory prediction heuristics. We release our source code to facilitate future research at this https URL.
摘要：轨迹预测是建模人类行为的关键任务，尤其是在社会机器人和自动驾驶汽车导航等安全 - 关键领域。基于手工规则的传统启发式方法通常缺乏准确性和普遍性。尽管深度学习方法提供了改善的性能，但它们通常会遭受高计算成本，有限的解释性，而且重要的是，对分布外（OOD）方案的概括不佳。在本文中，我们介绍了Trajevo，该框架利用大型语言模型（LLMS）自动设计轨迹预测启发式方法。 Trajevo采用进化算法来从过去的轨迹数据中生成和完善预测启发式。我们提出了两项关键的创新：跨生成精英抽样以鼓励人口多样性，以及一个统计反馈循环，使LLM能够分析和改善替代性预测。我们的评估表明，Trajevo在多个现实世界中的数据集胜过现有的启发式方法，并且在推广到一个看不见的OOD现实世界数据集方面，尤其超过了启发式和深度学习方法。 Trajevo标志着朝着快速，可解释和可推广的轨迹预测启发式启发式设计的自动设计迈出的有希望的一步。我们发布我们的源代码，以促进此HTTPS URL的未来研究。

Title: GAP: Gaussianize Any Point Clouds with Text Guidance

Authors: Weiqi Zhang, Junsheng Zhou, Haotian Geng, Wenyuan Zhang, Yu-Shen Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.05631
Pdf URL: https://arxiv.org/pdf/2508.05631
Copy Paste: [[2508.05631]] GAP: Gaussianize Any Point Clouds with Text Guidance(https://arxiv.org/abs/2508.05631)
Keywords: generation
Abstract: 3D Gaussian Splatting (3DGS) has demonstrated its advantages in achieving fast and high-quality rendering. As point clouds serve as a widely-used and easily accessible form of 3D representation, bridging the gap between point clouds and Gaussians becomes increasingly important. Recent studies have explored how to convert the colored points into Gaussians, but directly generating Gaussians from colorless 3D point clouds remains an unsolved challenge. In this paper, we propose GAP, a novel approach that gaussianizes raw point clouds into high-fidelity 3D Gaussians with text guidance. Our key idea is to design a multi-view optimization framework that leverages a depth-aware image diffusion model to synthesize consistent appearances across different viewpoints. To ensure geometric accuracy, we introduce a surface-anchoring mechanism that effectively constrains Gaussians to lie on the surfaces of 3D shapes during optimization. Furthermore, GAP incorporates a diffuse-based inpainting strategy that specifically targets at completing hard-to-observe regions. We evaluate GAP on the Point-to-Gaussian generation task across varying complexity levels, from synthetic point clouds to challenging real-world scans, and even large-scale scenes. Project Page: this https URL.
摘要：3D高斯裂（3DGS）在实现快速和高质量的渲染方面证明了其优势。随着点云是3D表示形式的广泛使用且易于访问的形式，弥合点云与高斯之间的差距变得越来越重要。最近的研究探索了如何将彩色点转换为高斯人，但是直接从无色3D点云中产生高斯仍然是一个尚未解决的挑战。在本文中，我们提出了Gap，这是一种新颖的方法，高斯将原始点云通过文本指导化为高保真3D高斯。我们的关键思想是设计一个多视图优化框架，该框架利用深度感知的图像扩散模型来综合跨不同观点的一致外观。为了确保几何准确性，我们引入了一种表面锚定机制，该机制有效地约束高斯在优化过程中位于3D形状的表面上。此外，GAP结合了一种基于弥漫性的镶嵌策略，该策略专门针对完成难以抓住的区域。我们评估了从综合点云到挑战现实世界扫描甚至大规模场景的各种复杂性水平的点对点生成任务的差距。项目页面：此HTTPS URL。

Title: FaceAnonyMixer: Cancelable Faces via Identity Consistent Latent Space Mixing

Authors: Mohammed Talha Alam, Fahad Shamshad, Fakhri Karray, Karthik Nandakumar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.05636
Pdf URL: https://arxiv.org/pdf/2508.05636
Copy Paste: [[2508.05636]] FaceAnonyMixer: Cancelable Faces via Identity Consistent Latent Space Mixing(https://arxiv.org/abs/2508.05636)
Keywords: generation, generative
Abstract: Advancements in face recognition (FR) technologies have amplified privacy concerns, necessitating methods that protect identity while maintaining recognition utility. Existing face anonymization methods typically focus on obscuring identity but fail to meet the requirements of biometric template protection, including revocability, unlinkability, and irreversibility. We propose FaceAnonyMixer, a cancelable face generation framework that leverages the latent space of a pre-trained generative model to synthesize privacy-preserving face images. The core idea of FaceAnonyMixer is to irreversibly mix the latent code of a real face image with a synthetic code derived from a revocable key. The mixed latent code is further refined through a carefully designed multi-objective loss to satisfy all cancelable biometric requirements. FaceAnonyMixer is capable of generating high-quality cancelable faces that can be directly matched using existing FR systems without requiring any modifications. Extensive experiments on benchmark datasets demonstrate that FaceAnonyMixer delivers superior recognition accuracy while providing significantly stronger privacy protection, achieving over an 11% gain on commercial API compared to recent cancelable biometric methods. Code is available at: this https URL.
摘要：面部识别（FR）技术的进步已经放大了隐私问题，需要在维持识别效用的同时保护身份的方法。现有的面部匿名方法通常集中于掩盖身份，但无法满足生物识别模板保护的要求，包括可竞争性，无键和不可逆性。我们提出了一个可取消的面部生成框架Faceanonomixer，它利用预先训练的生成模型的潜在空间来综合保护隐私的面部图像。 FaceAnomyixer的核心思想是不可逆地将真实面部图像的潜在代码与源自可撤销键衍生的合成代码相混合。通过精心设计的多目标损失进一步完善混合潜在代码，以满足所有可取消的生物特征要求。 FaceAnonomyixer能够生成可使用现有FR系统直接匹配的高质量取消面孔，而无需进行任何修改。基准数据集的广泛实验表明，与最近的可取消生物识别方法相比，ConeAnomyixer提供了卓越的识别精度，同时提供了更强的隐私保护，可以在商业API上获得11％的增长。代码可用：此HTTPS URL。