2025-06-10

Title: Wine Quality Prediction with Ensemble Trees: A Unified, Leak-Free Comparative Study

Authors: Zilang Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.06327
Pdf URL: https://arxiv.org/pdf/2506.06327
Copy Paste: [[2506.06327]] Wine Quality Prediction with Ensemble Trees: A Unified, Leak-Free Comparative Study(https://arxiv.org/abs/2506.06327)
Keywords: quality assessment
Abstract: Accurate and reproducible wine-quality assessment is critical for production control yet remains dominated by subjective, labour-intensive tasting panels. We present the first unified benchmark of five ensemble learners (Random Forest, Gradient Boosting, XGBoost, LightGBM, CatBoost) on the canonical Vinho Verde red- and white-wine datasets (1,599 and 4,898 instances, 11 physicochemical attributes). Our leakage-free workflow employs an 80:20 stratified train-test split, five-fold StratifiedGroupKFold within the training set, per-fold standardisation, SMOTE-Tomek resampling, inverse-frequency cost weighting, Optuna hyper-parameter search (120-200 trials per model) and a two-stage feature-selection refit. Final scores on untouched test sets are reported with weighted F1 as the headline metric. Gradient Boosting achieves the highest accuracy (weighted F1 0.693 +/- 0.028 for red and 0.664 +/- 0.016 for white), followed within three percentage points by Random Forest and XGBoost. Limiting each model to its five top-ranked variables lowers dimensionality by 55 percent while reducing weighted F1 by only 2.6 percentage points for red and 3.0 percentage points for white, indicating that alcohol, volatile acidity, sulphates, free SO2 and chlorides capture most predictive signal. Runtime profiling on an EPYC 9K84/H20 node reveals a steep efficiency gradient: Gradient Boosting averages 12 h per five-fold study, XGBoost and LightGBM require 2-3 h, CatBoost 1 h, and Random Forest under 50 min. We therefore recommend Random Forest as the most cost-effective production model, XGBoost and LightGBM as GPU-efficient alternatives, and Gradient Boosting as the accuracy ceiling for offline benchmarking. The fully documented pipeline and metric set provide a reproducible baseline for future work on imbalanced multi-class wine-quality prediction.
摘要：准确且可重复的葡萄酒质量评估对于生产控制至关重要，但仍由主观的，劳动力密集的品尝面板主导。我们在规范的Vinho Vinho Red-verde红色和白葡萄酒数据集（1,599和4,898实例，11个物理化学属性）上介绍了五个集合学习者（随机森林，渐变，XGBOOST，LIGHTGBM，CATBOOST）的第一个统一基准。我们的无泄漏工作流采用了80:20分层的火车测试拆分，训练组中的五倍分层groupkfold，每倍标准化，Smote-tomek重新采样，逆频率加权，Optuna Hyper-Paremeter搜索（每个模型的120-200次试验）和两步功能特征选择。据报道，加权F1作为标题度量标准，未接触的测试集的最终分数。梯度提升达到了最高的精度（红色的加权F1 0.693 +/- 0.028，白色的0.664 +/- 0.016），随后是随机森林和XGBoost的三个百分点。将每个模型限制在其五个最高的变量中，降低了55％的尺寸，而红色的加权F1仅将加权F1降低2.6个百分点，白色的尺寸降低了3.0个百分点，表明酒精，挥发性酸度，硫酸盐，无硫酸盐，自由SO2和氯化物捕获了最预测的信号。在EPYC 9K84/H20节点上进行的运行时分析显示出陡峭的效率梯度：梯度提升平均每五倍的研究12小时，XGBOOST和LIGHTGBM需要2-3 h，Catboost 1 H，Catboost 1 H，而随机森林不到50分钟。因此，我们建议随机森林作为最具成本效益的生产模型，XGBOOST和LIGHTGBM作为GPU有效的替代品，而梯度提升为离线基准测试的精确天花板。完整记录的管道和度量集为未来的多级葡萄酒质量预测工作提供了可再现的基线。

Title: ExplainBench: A Benchmark Framework for Local Model Explanations in Fairness-Critical Applications

Authors: James Afful
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.06330
Pdf URL: https://arxiv.org/pdf/2506.06330
Copy Paste: [[2506.06330]] ExplainBench: A Benchmark Framework for Local Model Explanations in Fairness-Critical Applications(https://arxiv.org/abs/2506.06330)
Keywords: generation
Abstract: As machine learning systems are increasingly deployed in high-stakes domains such as criminal justice, finance, and healthcare, the demand for interpretable and trustworthy models has intensified. Despite the proliferation of local explanation techniques, including SHAP, LIME, and counterfactual methods, there exists no standardized, reproducible framework for their comparative evaluation, particularly in fairness-sensitive settings. We introduce ExplainBench, an open-source benchmarking suite for systematic evaluation of local model explanations across ethically consequential datasets. ExplainBench provides unified wrappers for popular explanation algorithms, integrates end-to-end pipelines for model training and explanation generation, and supports evaluation via fidelity, sparsity, and robustness metrics. The framework includes a Streamlit-based graphical interface for interactive exploration and is packaged as a Python module for seamless integration into research workflows. We demonstrate ExplainBench on datasets commonly used in fairness research, such as COMPAS, UCI Adult Income, and LendingClub, and showcase how different explanation methods behave under a shared experimental protocol. By enabling reproducible, comparative analysis of local explanations, ExplainBench advances the methodological foundations of interpretable machine learning and facilitates accountability in real-world AI systems.
摘要：随着机器学习系统越来越多地部署在诸如刑事司法，金融和医疗保健等高风险领域中，对可解释和可信赖的模型的需求已经加剧。尽管局部解释技术的扩散，包括外形，石灰和反事实方法，但没有标准化的，可重现的框架来进行比较评估，尤其是在公平敏感的环境中。我们介绍了Bench，这是一个开源基准标准套件，用于系统评估跨道德上果数据集的本地模型解释。 divellbench为流行的解释算法提供了统一的包装器，集成了端到端的管道，以进行模型培训和解释生成，并通过忠实，稀疏和稳健性指标来支持评估。该框架包括用于交互式探索的基于简化的图形接口，并将其包装为用于无缝集成在研究工作流程中的Python模块。我们证明了在公平研究中常用的数据集上的解释，例如Compas，UCI成人收入和LendingClub，并展示了不同的解释方法在共同的实验方案下如何行事。通过对本地解释进行可重现的比较分析，解释了基础的发展，可以推动可解释的机器学习的方法论基础，并促进现实世界中AI系统中的问责制。

Title: From Transformers to Large Language Models: A systematic review of AI applications in the energy sector towards Agentic Digital Twins

Authors: Gabriel Antonesi, Tudor Cioara, Ionut Anghel, Vasilis Michalakopoulos, Elissaios Sarmas, Liana Toderean
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06359
Pdf URL: https://arxiv.org/pdf/2506.06359
Copy Paste: [[2506.06359]] From Transformers to Large Language Models: A systematic review of AI applications in the energy sector towards Agentic Digital Twins(https://arxiv.org/abs/2506.06359)
Keywords: generation, generative
Abstract: Artificial intelligence (AI) has long promised to improve energy management in smart grids by enhancing situational awareness and supporting more effective decision-making. While traditional machine learning has demonstrated notable results in forecasting and optimization, it often struggles with generalization, situational awareness, and heterogeneous data integration. Recent advances in foundation models such as Transformer architecture and Large Language Models (LLMs) have demonstrated improved capabilities in modelling complex temporal and contextual relationships, as well as in multi-modal data fusion which is essential for most AI applications in the energy sector. In this review we synthesize the rapid expanding field of AI applications in the energy domain focusing on Transformers and LLMs. We examine the architectural foundations, domain-specific adaptations and practical implementations of transformer models across various forecasting and grid management tasks. We then explore the emerging role of LLMs in the field: adaptation and fine tuning for the energy sector, the type of tasks they are suited for, and the new challenges they introduce. Along the way, we highlight practical implementations, innovations, and areas where the research frontier is rapidly expanding. These recent developments reviewed underscore a broader trend: Generative AI (GenAI) is beginning to augment decision-making not only in high-level planning but also in day-to-day operations, from forecasting and grid balancing to workforce training and asset onboarding. Building on these developments, we introduce the concept of the Agentic Digital Twin, a next-generation model that integrates LLMs to bring autonomy, proactivity, and social interaction into digital twin-based energy management systems.
摘要：长期以来，人工智能（AI）已承诺通过提高情境意识并支持更有效的决策来改善智能电网的能源管理。尽管传统的机器学习在预测和优化方面表现出显着的结果，但它通常在概括，情境意识和异质数据集成方面挣扎。在诸如变压器体系结构和大型语言模型（LLM）之类的基础模型中的最新进展已证明在建模复杂的时间和上下文关系以及多模式数据融合方面的功能提高了，这对于能源领域的大多数AI应用程序至关重要。在这篇综述中，我们综合了针对变压器和LLM的能源领域中AI应用的快速扩展领域。我们研究了各种预测和网格管理任务中变压器模型的建筑基础，特定于领域的适应和实际实现。然后，我们探讨了LLM在该领域的新兴作用：适应能源领域的适应和微调，它们适合的任务类型以及它们引入的新挑战。在此过程中，我们重点介绍了研究边界正在迅速扩展的实际实施，创新和领域。这些最新的发展审查了一个更广泛的趋势：生成AI（Genai）不仅在高级计划中，而且在日常运营中开始增强决策，从预测和网格平衡到劳动力培训和资产入职。在这些事态发展的基础上，我们介绍了代理数字双胞胎的概念，该概念是将LLMS集成以将自主权，积极性和社交互动整合到基于数字双胞胎的能源管理系统中的下一代模型。

Title: Beyond the Norm: A Survey of Synthetic Data Generation for Rare Events

Authors: Jingyi Gu, Xuan Zhang, Guiling Wang
Subjects: cs.LG, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2506.06380
Pdf URL: https://arxiv.org/pdf/2506.06380
Copy Paste: [[2506.06380]] Beyond the Norm: A Survey of Synthetic Data Generation for Rare Events(https://arxiv.org/abs/2506.06380)
Keywords: generation, generative
Abstract: Extreme events, such as market crashes, natural disasters, and pandemics, are rare but catastrophic, often triggering cascading failures across interconnected systems. Accurate prediction and early warning can help minimize losses and improve preparedness. While data-driven methods offer powerful capabilities for extreme event modeling, they require abundant training data, yet extreme event data is inherently scarce, creating a fundamental challenge. Synthetic data generation has emerged as a powerful solution. However, existing surveys focus on general data with privacy preservation emphasis, rather than extreme events' unique performance requirements. This survey provides the first overview of synthetic data generation for extreme events. We systematically review generative modeling techniques and large language models, particularly those enhanced by statistical theory as well as specialized training and sampling mechanisms to capture heavy-tailed distributions. We summarize benchmark datasets and introduce a tailored evaluation framework covering statistical, dependence, visual, and task-oriented metrics. A central contribution is our in-depth analysis of each metric's applicability in extremeness and domain-specific adaptations, providing actionable guidance for model evaluation in extreme settings. We categorize key application domains and identify underexplored areas like behavioral finance, wildfires, earthquakes, windstorms, and infectious outbreaks. Finally, we outline open challenges, providing a structured foundation for advancing synthetic rare-event research.
摘要：极端事件（例如市场崩溃，自然灾害和大流行）罕见但灾难性，通常会引发互连系统之间的级联故障。准确的预测和预警可以帮助最大程度地减少损失并改善准备性。尽管数据驱动的方法为极端事件建模提供了强大的功能，但它们需要大量的培训数据，但是极端事件数据本质上是稀缺的，从而造成了根本的挑战。合成数据生成已成为有力的解决方案。但是，现有的调查专注于具有隐私保护的一般数据，而不是极端事件的独特性能要求。这项调查提供了极端事件的合成数据生成的第一个概述。我们系统地回顾了生成建模技术和大型语言模型，尤其是统计理论增强的模型，以及专门的培训和抽样机制，以捕获重尾分布。我们总结了基准数据集，并引入了量身定制的评估框架，涵盖统计，依赖，视觉和面向任务的指标。中心贡献是我们对每个度量标准在极端和特定领域的适应性中的适用性的深入分析，为极端环境中的模型评估提供了可行的指导。我们对关键应用领域进行了分类，并确定不富裕的领域，例如行为金融，野火，地震，风暴和传染性暴发。最后，我们概述了开放挑战，为促进综合稀有事实研究提供了结构化的基础。

Title: Exploring Adversarial Watermarking in Transformer-Based Models: Transferability and Robustness Against Defense Mechanism for Medical Images

Authors: Rifat Sadik, Tanvir Rahman, Arpan Bhattacharjee, Bikash Chandra Halder, Ismail Hossain
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.06389
Pdf URL: https://arxiv.org/pdf/2506.06389
Copy Paste: [[2506.06389]] Exploring Adversarial Watermarking in Transformer-Based Models: Transferability and Robustness Against Defense Mechanism for Medical Images(https://arxiv.org/abs/2506.06389)
Keywords: generation
Abstract: Deep learning models have shown remarkable success in dermatological image analysis, offering potential for automated skin disease diagnosis. Previously, convolutional neural network(CNN) based architectures have achieved immense popularity and success in computer vision (CV) based task like skin image recognition, generation and video analysis. But with the emergence of transformer based models, CV tasks are now are nowadays carrying out using these models. Vision Transformers (ViTs) is such a transformer-based models that have shown success in computer vision. It uses self-attention mechanisms to achieve state-of-the-art performance across various tasks. However, their reliance on global attention mechanisms makes them susceptible to adversarial perturbations. This paper aims to investigate the susceptibility of ViTs for medical images to adversarial watermarking-a method that adds so-called imperceptible perturbations in order to fool models. By generating adversarial watermarks through Projected Gradient Descent (PGD), we examine the transferability of such attacks to CNNs and analyze the performance defense mechanism -- adversarial training. Results indicate that while performance is not compromised for clean images, ViTs certainly become much more vulnerable to adversarial attacks: an accuracy drop of as low as 27.6%. Nevertheless, adversarial training raises it up to 90.0%.
摘要：深度学习模型在皮肤病学图像分析中表现出色，为自动皮肤疾病诊断提供了潜力。以前，基于卷积的神经网络（CNN）架构已在基于计算机视觉（CV）任务（例如皮肤图像识别，发电和视频分析）方面获得了巨大的普及和成功。但是，随着基于变压器的模型的出现，现在使用这些模型正在执行CV任务。 Vision Transformers（VIT）是这样的基于变压器的模型，在计算机视觉方面已经取得了成功。它使用自我注意的机制来实现各种任务的最先进绩效。但是，它们对全球关注机制的依赖使它们容易受到对抗性扰动的影响。本文旨在调查VIT对医学图像对对抗性水印的敏感性 - A添加了所谓的不易于触摸以欺骗模型的方法。通过通过预测的梯度下降（PGD）产生对抗水印，我们检查了此类攻击向CNN的转移性，并分析了性能防御机制 - 对抗性训练。结果表明，尽管对干净的图像没有损害性能，但VIT肯定会变得更容易受到对抗性攻击的影响：准确性下降到低至27.6％。然而，对抗训练将其提高了90.0％。

Title: Unlocking Chemical Insights: Superior Molecular Representations from Intermediate Encoder Layers

Authors: Luis Pinto
Subjects: cs.LG, cs.AI, physics.chem-ph, q-bio.BM
Abstract URL: https://arxiv.org/abs/2506.06443
Pdf URL: https://arxiv.org/pdf/2506.06443
Copy Paste: [[2506.06443]] Unlocking Chemical Insights: Superior Molecular Representations from Intermediate Encoder Layers(https://arxiv.org/abs/2506.06443)
Keywords: generation
Abstract: Pretrained molecular encoders have become indispensable in computational chemistry for tasks such as property prediction and molecular generation. However, the standard practice of relying solely on final-layer embeddings for downstream tasks may discard valuable information. In this work, we challenge this convention by conducting a comprehensive layer-wise analysis of five diverse molecular encoders across 22 ADMET property prediction tasks. Our results demonstrate that embeddings from intermediate layers consistently outperform final-layer representations. Specifically, using fixed embeddings from the optimal intermediate layers improved downstream performance by an average of 5.4%, reaching gains up to 28.6%. Furthermore, finetuning up to these intermediate layers yielded even greater average improvements of 8.5%, with performance increases as high as 40.8%, achieving new state-of-the-art results on several benchmarks. Additionally, a strong positive correlation between fixed embedding performance and finetuning outcomes supports an efficient evaluate-then-finetune approach, enabling identification of optimal layers with reduced computational cost. These findings highlight the importance of exploring the full representational depth of molecular encoders to achieve substantial performance improvements and computational efficiency. The code is made publicly available at this https URL.
摘要：预处理的分子编码器已在诸如财产预测和分子产生等任务的计算化学中变得必不可少。但是，仅依靠最终层嵌入到下游任务的标准做法可能会丢弃有价值的信息。在这项工作中，我们通过对22个ADMET财产预测任务中的五个不同分子编码器进行全面的层面分析来挑战这项公约。我们的结果表明，中间层的嵌入始终超过最终层表示。具体而言，使用来自最佳中间层的固定嵌入量平均提高了5.4％，达到高达28.6％的增长。此外，对这些中间层的填充量的平均改善甚至更高，在8.5％的平均改善中，性能高达40.8％，在几个基准测试中获得了新的最新最先进的结果。此外，固定嵌入性能和鉴定结果之间存在强大的正相关性，这支持了有效的评估方法，从而可以识别以降低计算成本的最佳层。这些发现突出了探索分子编码器的完整代表性深度以实现大量性能提高和计算效率的重要性。该代码在此HTTPS URL上公开可用。

Title: Synthetic Problem Generation for Reasoning via Quality-Diversity Algorithms

Authors: Alex Havrilla, Edward Hughes, Mikayel Samvelyan, Jacob Abernethy
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06499
Pdf URL: https://arxiv.org/pdf/2506.06499
Copy Paste: [[2506.06499]] Synthetic Problem Generation for Reasoning via Quality-Diversity Algorithms(https://arxiv.org/abs/2506.06499)
Keywords: generation
Abstract: Large language model (LLM) driven synthetic data generation has emerged as a powerful method for improving model reasoning capabilities. However, most methods either distill large state-of-the-art models into small students or use natural ground-truth problem statements to guarantee problem statement quality. This limits the scalability of these approaches to more complex and diverse problem domains. To address this, we present SPARQ: Synthetic Problem Generation for Reasoning via Quality-Diversity Algorithms, a novel approach for generating high-quality and diverse synthetic math problem and solution pairs using only a single model by measuring a problem's solve-rate: a proxy for problem difficulty. Starting from a seed dataset of 7.5K samples, we generate over 20 million new problem-solution pairs. We show that filtering the generated data by difficulty and then fine-tuning the same model on the resulting data improves relative model performance by up to 24\%. Additionally, we conduct ablations studying the impact of synthetic data quantity, quality and diversity on model generalization. We find that higher quality, as measured by problem difficulty, facilitates better in-distribution performance. Further, while generating diverse synthetic data does not as strongly benefit in-distribution performance, filtering for more diverse data facilitates more robust OOD generalization. We also confirm the existence of model and data scaling laws for synthetically generated problems, which positively benefit downstream model generalization.
摘要：大型语言模型（LLM）驱动的合成数据生成已成为提高模型推理能力的强大方法。但是，大多数方法要么将大型的最新模型提炼成小型学生，要么使用自然的地面问题陈述来保证问题陈述质量。这将这些方法的可伸缩性限制在更复杂和多样化的问题域。为了解决这个问题，我们提出了SPARQ：通过质量多样性算法推理的合成问题，这是一种新颖的方法，用于产生高质量和多样化的合成数学问题，而解决方案对仅使用单个模型来测量问题的解决方案率：问题难以解决问题的代理。从7.5K样品的种子数据集开始，我们产生了超过2000万个新的问题解决方案对。我们表明，通过难度过滤生成的数据，然后对所得数据进行微调相同的模型，将相对模型性能提高了24 \％。此外，我们进行研究，研究合成数据数量，质量和多样性对模型概括的影响。我们发现，通过问题难度衡量的更高质量可以促进更好的分配性能。此外，尽管生成多样化的合成数据并不能使分发性能有很大的好处，但过滤更多样化的数据促进了更强大的OOD泛化。我们还为合成产生的问题确认了模型和数据缩放定律的存在，这对下游模型的概括有效。

Title: Hierarchical and Collaborative LLM-Based Control for Multi-UAV Motion and Communication in Integrated Terrestrial and Non-Terrestrial Networks

Authors: Zijiang Yan, Hao Zhou, Jianhua Pei, Hina Tabassum
Subjects: cs.LG, cs.AI, cs.NI, cs.RO, eess.SY
Abstract URL: https://arxiv.org/abs/2506.06532
Pdf URL: https://arxiv.org/pdf/2506.06532
Copy Paste: [[2506.06532]] Hierarchical and Collaborative LLM-Based Control for Multi-UAV Motion and Communication in Integrated Terrestrial and Non-Terrestrial Networks(https://arxiv.org/abs/2506.06532)
Keywords: generation
Abstract: Unmanned aerial vehicles (UAVs) have been widely adopted in various real-world applications. However, the control and optimization of multi-UAV systems remain a significant challenge, particularly in dynamic and constrained environments. This work explores the joint motion and communication control of multiple UAVs operating within integrated terrestrial and non-terrestrial networks that include high-altitude platform stations (HAPS). Specifically, we consider an aerial highway scenario in which UAVs must accelerate, decelerate, and change lanes to avoid collisions and maintain overall traffic flow. Different from existing studies, we propose a novel hierarchical and collaborative method based on large language models (LLMs). In our approach, an LLM deployed on the HAPS performs UAV access control, while another LLM onboard each UAV handles motion planning and control. This LLM-based framework leverages the rich knowledge embedded in pre-trained models to enable both high-level strategic planning and low-level tactical decisions. This knowledge-driven paradigm holds great potential for the development of next-generation 3D aerial highway systems. Experimental results demonstrate that our proposed collaborative LLM-based method achieves higher system rewards, lower operational costs, and significantly reduced UAV collision rates compared to baseline approaches.
摘要：在各种现实世界中，无人驾驶汽车（UAV）已被广泛采用。但是，多-UAV系统的控制和优化仍然是一个重大挑战，尤其是在动态和约束环境中。这项工作探讨了包括高海拔平台站（HAP）在内的集成地面和非事物网络中运行的多个UAV的联合运动和通信控制。具体而言，我们考虑了一种航空高速公路场景，在这种情况下，无人机必须加速，减速和更换车道，以避免碰撞并保持整体交通流量。与现有研究不同，我们提出了一种基于大语言模型（LLM）的新型层次结构和协作方法。在我们的方法中，部署在HAPS上的LLM执行无人机访问控制，而另一个LLM每个无人机都处理运动计划和控制。这个基于LLM的框架利用了预培训模型中的丰富知识来实现高级战略计划和低级战术决策。这种知识驱动的范式为开发下一代3D航空高速公路系统的发展具有巨大的潜力。实验结果表明，与基线方法相比，我们提出的基于LLM的合作方法可获得更高的系统奖励，降低的运营成本以及无人机碰撞率的显着降低。

Title: A Deep Learning Approach for Facial Attribute Manipulation and Reconstruction in Surveillance and Reconnaissance

Authors: Anees Nashath Shaik, Barbara Villarini, Vasileios Argyriou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06578
Pdf URL: https://arxiv.org/pdf/2506.06578
Copy Paste: [[2506.06578]] A Deep Learning Approach for Facial Attribute Manipulation and Reconstruction in Surveillance and Reconnaissance(https://arxiv.org/abs/2506.06578)
Keywords: generative
Abstract: Surveillance systems play a critical role in security and reconnaissance, but their performance is often compromised by low-quality images and videos, leading to reduced accuracy in face recognition. Additionally, existing AI-based facial analysis models suffer from biases related to skin tone variations and partially occluded faces, further limiting their effectiveness in diverse real-world scenarios. These challenges are the results of data limitations and imbalances, where available training datasets lack sufficient diversity, resulting in unfair and unreliable facial recognition performance. To address these issues, we propose a data-driven platform that enhances surveillance capabilities by generating synthetic training data tailored to compensate for dataset biases. Our approach leverages deep learning-based facial attribute manipulation and reconstruction using autoencoders and Generative Adversarial Networks (GANs) to create diverse and high-quality facial datasets. Additionally, our system integrates an image enhancement module, improving the clarity of low-resolution or occluded faces in surveillance footage. We evaluate our approach using the CelebA dataset, demonstrating that the proposed platform enhances both training data diversity and model fairness. This work contributes to reducing bias in AI-based facial analysis and improving surveillance accuracy in challenging environments, leading to fairer and more reliable security applications.
摘要：监视系统在安全性和侦察中起着至关重要的作用，但是低质量的图像和视频通常会损害其性能，从而降低了面部识别的准确性。此外，现有的基于AI的面部分析模型会遭受与肤色变化和部分遮挡的面孔有关的偏见，从而进一步限制了它们在各种现实世界中的有效性。这些挑战是数据限制和失衡的结果，如果可用的培训数据集缺乏足够的多样性，从而导致不公平和不可靠的面部识别绩效。为了解决这些问题，我们提出了一个数据驱动的平台，该平台通过生成量身定制的合成培训数据来增强监视功能，以补偿数据集偏见。我们的方法利用自动编码器和生成对抗网络（GAN）利用基于学习的面部属性操纵和重建，以创建多样化和高质量的面部数据集。此外，我们的系统集成了图像增强模块，从而提高了监视镜头中低分辨率或遮挡面的清晰度。我们使用Celeba数据集评估我们的方法，表明所提出的平台增强了培训数据多样性和模型公平性。这项工作有助于减少基于AI的面部分析的偏见，并提高具有挑战性的环境中的监视准确性，从而导致更公平，更可靠的安全应用程序。

Title: Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques

Authors: Adarsh Prasad Behera, Jaya Prakash Champati, Roberto Morabito, Sasu Tarkoma, James Gross
Subjects: cs.LG, cs.AI, cs.CL, cs.DC
Abstract URL: https://arxiv.org/abs/2506.06579
Pdf URL: https://arxiv.org/pdf/2506.06579
Copy Paste: [[2506.06579]] Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques(https://arxiv.org/abs/2506.06579)
Keywords: generation
Abstract: Recent progress in Language Models (LMs) has dramatically advanced the field of natural language processing (NLP), excelling at tasks like text generation, summarization, and question answering. However, their inference remains computationally expensive and energy intensive, especially in settings with limited hardware, power, or bandwidth. This makes it difficult to deploy LMs in mobile, edge, or cost sensitive environments. To address these challenges, recent approaches have introduced multi LLM intelligent model selection strategies that dynamically allocate computational resources based on query complexity -- using lightweight models for simpler queries and escalating to larger models only when necessary. This survey explores two complementary strategies for efficient LLM inference: (i) routing, which selects the most suitable model based on the query, and (ii) cascading or hierarchical inference (HI), which escalates queries through a sequence of models until a confident response is found. Both approaches aim to reduce computation by using lightweight models for simpler tasks while offloading only when needed. We provide a comparative analysis of these techniques across key performance metrics, discuss benchmarking efforts, and outline open challenges. Finally, we outline future research directions to enable faster response times, adaptive model selection based on task complexity, and scalable deployment across heterogeneous environments, making LLM based systems more efficient and accessible for real world applications.
摘要：语言模型（LMS）的最新进展已极大地推进了自然语言处理（NLP）的领域，在文本生成，摘要和问题回答等任务上表现出色。但是，他们的推论仍然在计算上昂贵且能源密集型，尤其是在有限的硬件，功率或带宽的设置中。这使得很难在移动，边缘或成本敏感的环境中部署LMS。为了应对这些挑战，最近的方法引入了多LLM智能模型选择策略，该策略基于查询复杂性，动态分配计算资源 - 使用轻量级模型来简单查询，并仅在必要时才升级为较大的模型。这项调查探讨了有效LLM推断的两种互补策略：（i）路由，该路由选择了基于查询的最合适模型，以及（ii）级联或分层推理（HI），通过一系列模型升级查询，直到找到自信响应。两种方法均旨在通过使用轻质模型来简单任务来减少计算，同时仅在需要时卸载。我们提供了对关键绩效指标的这些技术的比较分析，讨论基准测试工作并概述开放挑战。最后，我们概述了未来的研究方向，以实现更快的响应时间，基于任务复杂性的自适应模型选择以及跨异构环境的可扩展部署，从而使基于LLM的系统更有效，可用于现实世界应用程序。

Title: Breaking Data Silos: Towards Open and Scalable Mobility Foundation Models via Generative Continual Learning

Authors: Yuan Yuan, Yukun Liu, Chonghua Han, Jie Feng, Yong Li
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2506.06694
Pdf URL: https://arxiv.org/pdf/2506.06694
Copy Paste: [[2506.06694]] Breaking Data Silos: Towards Open and Scalable Mobility Foundation Models via Generative Continual Learning(https://arxiv.org/abs/2506.06694)
Keywords: generative
Abstract: Foundation models have revolutionized fields such as natural language processing and computer vision by enabling general-purpose learning across diverse tasks and datasets. However, building analogous models for human mobility remains challenging due to the privacy-sensitive nature of mobility data and the resulting data silos across institutions. To bridge this gap, we propose MoveGCL, a scalable and privacy-preserving framework for training mobility foundation models via generative continual learning. Without sharing raw data, MoveGCL enables decentralized and progressive model evolution by replaying synthetic trajectories generated from a frozen teacher model, and reinforces knowledge retention through a tailored distillation strategy that mitigates catastrophic forgetting. To address the heterogeneity of mobility patterns, MoveGCL incorporates a Mixture-of-Experts Transformer with a mobility-aware expert routing mechanism, and employs a layer-wise progressive adaptation strategy to stabilize continual updates. Experiments on six real-world urban datasets demonstrate that MoveGCL achieves performance comparable to joint training and significantly outperforms federated learning baselines, while offering strong privacy protection. MoveGCL marks a crucial step toward unlocking foundation models for mobility, offering a practical blueprint for open, scalable, and privacy-preserving model development in the era of foundation models.
摘要：基础模型通过在各种任务和数据集中启用通用学习，彻底改变了自然语言处理和计算机视觉。但是，由于流动性数据的隐私敏感性以及在机构之间产生的数据孤岛，建立人类流动性的类似模型仍然具有挑战性。为了弥合这一差距，我们提出了MoveGCL，这是一个可扩展的，隐私的框架，用于通过生成的持续学习来培训移动性基础模型。在没有共享原始数据的情况下，MoveGCL可以通过重播由冷冻教师模型产生的合成轨迹来实现分散和渐进的模型演变，并通过量身定制的蒸馏策略来增强知识的保留，从而减轻灾难性遗忘。为了解决机动性模式的异质性，MoveGCL与Experts Transferter的混合物与具有移动性的专家路由机制结合在一起，并采用了层面的渐进式适应策略来稳定持续更新。在六个现实世界中的城市数据集上进行的实验表明，MoveGCL的性能与联合培训相当，并且在提供强大的隐私保护的同时，显着优于联合学习基线的表现。 MoveGCL标志着朝着解锁移动性的基础模型迈出的至关重要的一步，为基础模型时代提供了实用的蓝图，用于开放，可扩展和保护隐私的模型开发。

Title: A Systematic Investigation on Deep Learning-Based Omnidirectional Image and Video Super-Resolution

Authors: Qianqian Zhao, Chunle Guo, Tianyi Zhang, Junpei Zhang, Peiyang Jia, Tan Su, Wenjie Jiang, Chongyi Li
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2506.06710
Pdf URL: https://arxiv.org/pdf/2506.06710
Copy Paste: [[2506.06710]] A Systematic Investigation on Deep Learning-Based Omnidirectional Image and Video Super-Resolution(https://arxiv.org/abs/2506.06710)
Keywords: super-resolution
Abstract: Omnidirectional image and video super-resolution is a crucial research topic in low-level vision, playing an essential role in virtual reality and augmented reality applications. Its goal is to reconstruct high-resolution images or video frames from low-resolution inputs, thereby enhancing detail preservation and enabling more accurate scene analysis and interpretation. In recent years, numerous innovative and effective approaches have been proposed, predominantly based on deep learning techniques, involving diverse network architectures, loss functions, projection strategies, and training datasets. This paper presents a systematic review of recent progress in omnidirectional image and video super-resolution, focusing on deep learning-based methods. Given that existing datasets predominantly rely on synthetic degradation and fall short in capturing real-world distortions, we introduce a new dataset, 360Insta, that comprises authentically degraded omnidirectional images and videos collected under diverse conditions, including varying lighting, motion, and exposure settings. This dataset addresses a critical gap in current omnidirectional benchmarks and enables more robust evaluation of the generalization capabilities of omnidirectional super-resolution methods. We conduct comprehensive qualitative and quantitative evaluations of existing methods on both public datasets and our proposed dataset. Furthermore, we provide a systematic overview of the current status of research and discuss promising directions for future exploration. All datasets, methods, and evaluation metrics introduced in this work are publicly available and will be regularly updated. Project page: this https URL.
摘要：全向图像和视频超分辨率是低水平视觉的关键研究主题，在虚拟现实和增强现实应用中发挥了重要作用。它的目标是从低分辨率输入中重建高分辨率图像或视频帧，从而增强细节保存并实现更准确的场景分析和解释。近年来，已经提出了许多创新和有效的方法，主要是基于深度学习技术，涉及各种网络架构，损失功能，投影策略和培训数据集。本文对全向图像和视频超分辨率的最新进展进行了系统的综述，重点是基于深度学习的方法。鉴于现有的数据集主要依赖于合成降解并在捕获现实世界中的扭曲方面缺乏，因此我们引入了一个新的数据集360INSTA，其中包括真正退化的全偏见图像和视频，这些图像和视频在不同的条件下收集的不同条件，包括不同的照明，运动，运动和曝光设置。该数据集解决了当前全向基准中的关键差距，并可以对全向超级分辨率方法的概括能力进行更强大的评估。我们对公共数据集和提议的数据集进行了全面的定性和定量评估。此外，我们还提供了研究当前状态的系统概述，并讨论了未来探索的有希望的方向。这项工作中介绍的所有数据集，方法和评估指标均可公开使用，并将定期更新。项目页面：此HTTPS URL。

Title: RecipeGen: A Step-Aligned Multimodal Benchmark for Real-World Recipe Generation

Authors: Ruoxuan Zhang, Jidong Gao, Bin Wen, Hongxia Xie, Chenming Zhang, Honghan-shuai, Wen-Huang Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06733
Pdf URL: https://arxiv.org/pdf/2506.06733
Copy Paste: [[2506.06733]] RecipeGen: A Step-Aligned Multimodal Benchmark for Real-World Recipe Generation(https://arxiv.org/abs/2506.06733)
Keywords: generation
Abstract: Creating recipe images is a key challenge in food computing, with applications in culinary education and multimodal recipe assistants. However, existing datasets lack fine-grained alignment between recipe goals, step-wise instructions, and visual content. We present RecipeGen, the first large-scale, real-world benchmark for recipe-based Text-to-Image (T2I), Image-to-Video (I2V), and Text-to-Video (T2V) generation. RecipeGen contains 26,453 recipes, 196,724 images, and 4,491 videos, covering diverse ingredients, cooking procedures, styles, and dish types. We further propose domain-specific evaluation metrics to assess ingredient fidelity and interaction modeling, benchmark representative T2I, I2V, and T2V models, and provide insights for future recipe generation models. Project page is available now.
摘要：创建食谱图像是食品计算中的关键挑战，其应用在烹饪教育和多模式食谱助理中。但是，现有的数据集缺乏配方目标，逐步说明和视觉内容之间的细粒度对齐。我们提出了Copecegen，这是第一个用于基于配方的文本对图像（T2I），图像到视频（I2V）和文本对视频（T2V）生成的大规模真实的基准。 Coopegen包含26,453种食谱，196,724张图像和4,491个视频，涵盖了各种成分，烹饪过程，样式和菜肴。我们进一步提出了特定领域的评估指标，以评估成分的保真度和相互作用建模，基准代表性T2I，I2V和T2V模型，并为未来的配方生成模型提供见解。项目页面现在可用。

Title: Training-Free Identity Preservation in Stylized Image Generation Using Diffusion Models

Authors: Mohammad Ali Rezaei, Helia Hajikazem, Saeed Khanehgir, Mahdi Javanmardi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06802
Pdf URL: https://arxiv.org/pdf/2506.06802
Copy Paste: [[2506.06802]] Training-Free Identity Preservation in Stylized Image Generation Using Diffusion Models(https://arxiv.org/abs/2506.06802)
Keywords: generation, generative
Abstract: While diffusion models have demonstrated remarkable generative capabilities, existing style transfer techniques often struggle to maintain identity while achieving high-quality stylization. This limitation is particularly acute for images where faces are small or exhibit significant camera-to-face distances, frequently leading to inadequate identity preservation. To address this, we introduce a novel, training-free framework for identity-preserved stylized image synthesis using diffusion models. Key contributions include: (1) the "Mosaic Restored Content Image" technique, significantly enhancing identity retention, especially in complex scenes; and (2) a training-free content consistency loss that enhances the preservation of fine-grained content details by directing more attention to the original image during stylization. Our experiments reveal that the proposed approach substantially surpasses the baseline model in concurrently maintaining high stylistic fidelity and robust identity integrity, particularly under conditions of small facial regions or significant camera-to-face distances, all without necessitating model retraining or fine-tuning.
摘要：尽管扩散模型表现出了显着的生成能力，但现有的样式转移技术通常在实现高质量风格的同时努力保持身份。对于面孔很小或表现出明显的摄像头距离的图像，这种限制尤其急切，通常导致身份保存不足。为了解决这个问题，我们介绍了一个新颖的，无训练的框架，用于使用扩散模型进行身份保留的风格化图像合成。主要贡献包括：（1）“镶嵌恢复的内容图像”技术，可显着增强身份保留率，尤其是在复杂的场景中；（2）无训练的内容一致性损失，通过在风格化期间将更多注意力引起原始图像，从而增强了细粒度内容细节的保存。我们的实验表明，所提出的方法在同时维持高文体的忠诚度和强大的身份完整性方面实质上超过了基线模型，尤其是在小面积区域或显着的摄像头距离的条件下，所有这些都不必进行模型再培训或微调。

Title: IMPA-HGAE:Intra-Meta-Path Augmented Heterogeneous Graph Autoencoder

Authors: Di Lin, Wanjing Ren, Xuanbin Li, Rui Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06809
Pdf URL: https://arxiv.org/pdf/2506.06809
Copy Paste: [[2506.06809]] IMPA-HGAE:Intra-Meta-Path Augmented Heterogeneous Graph Autoencoder(https://arxiv.org/abs/2506.06809)
Keywords: generative
Abstract: Self-supervised learning (SSL) methods have been increasingly applied to diverse downstream tasks due to their superior generalization capabilities and low annotation costs. However, most existing heterogeneous graph SSL models convert heterogeneous graphs into homogeneous ones via meta-paths for training, which only leverage information from nodes at both ends of meta-paths while underutilizing the heterogeneous node information along the meta-paths. To address this limitation, this paper proposes a novel framework named IMPA-HGAE to enhance target node embeddings by fully exploiting internal node information along meta-paths. Experimental results validate that IMPA-HGAE achieves superior performance on heterogeneous datasets. Furthermore, this paper introduce innovative masking strategies to strengthen the representational capacity of generative SSL models on heterogeneous graph data. Additionally, this paper discuss the interpretability of the proposed method and potential future directions for generative self-supervised learning in heterogeneous graphs. This work provides insights into leveraging meta-path-guided structural semantics for robust representation learning in complex graph scenarios.
摘要：自我监督学习（SSL）方法由于其出色的概括能力和低注释成本而越来越多地应用于多样化的下游任务。但是，大多数现有的异质图SSL模型通过元数据将异质图转换为均质图，用于训练，仅利用元路径两端的节点的信息来利用沿元路径的异质节点信息的充分利用。为了解决这一限制，本文提出了一个名为impa-hgae的新型框架，以通过完全利用沿元路径的内部节点信息来增强目标节点嵌入。实验结果验证了IMPA-HGAE在异质数据集上取得卓越的性能。此外，本文介绍了创新的遮罩策略，以增强生成性SSL模型在异质图数据上的代表性。此外，本文讨论了所提出的方法的解释性以及在异质图中生成的自我监督学习的潜在未来方向。这项工作为利用元路径指导的结构语义提供了洞察力，以在复杂的图形方案中进行稳健的表示。

Title: Controllable Coupled Image Generation via Diffusion Models

Authors: Chenfei Yuan, Nanshan Jia, Hangqi Li, Peter W. Glynn, Zeyu Zheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06826
Pdf URL: https://arxiv.org/pdf/2506.06826
Copy Paste: [[2506.06826]] Controllable Coupled Image Generation via Diffusion Models(https://arxiv.org/abs/2506.06826)
Keywords: generation
Abstract: We provide an attention-level control method for the task of coupled image generation, where "coupled" means that multiple simultaneously generated images are expected to have the same or very similar backgrounds. While backgrounds coupled, the centered objects in the generated images are still expected to enjoy the flexibility raised from different text prompts. The proposed method disentangles the background and entity components in the model's cross-attention modules, attached with a sequence of time-varying weight control parameters depending on the time step of sampling. We optimize this sequence of weight control parameters with a combined objective that assesses how coupled the backgrounds are as well as text-to-image alignment and overall visual quality. Empirical results demonstrate that our method outperforms existing approaches across these criteria.
摘要：我们为耦合图像生成的任务提供了一种注意力级控制方法，其中“耦合”意味着预期多个同时生成的图像具有相同或非常相似的背景。虽然背景结合在一起，但仍期望生成图像中的中心对象享受从不同文本提示提出的灵活性。所提出的方法将模型的交叉意见模块中的背景和实体组件删除，并根据抽样的时间步骤连接了一系列时间变化的重量控制参数。我们通过一个组合目标来评估背景的耦合以及文本到图像对齐方式和整体视觉质量，从而优化了重量控制参数。经验结果表明，我们的方法在这些标准上都优于现有方法。

Title: Face recognition on point cloud with cgan-top for denoising

Authors: Junyu Liu, Jianfeng Ren, Sunhong Liang, Xudong Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06864
Pdf URL: https://arxiv.org/pdf/2506.06864
Copy Paste: [[2506.06864]] Face recognition on point cloud with cgan-top for denoising(https://arxiv.org/abs/2506.06864)
Keywords: generative
Abstract: Face recognition using 3D point clouds is gaining growing interest, while raw point clouds often contain a significant amount of noise due to imperfect sensors. In this paper, an end-to-end 3D face recognition on a noisy point cloud is proposed, which synergistically integrates the denoising and recognition modules. Specifically, a Conditional Generative Adversarial Network on Three Orthogonal Planes (cGAN-TOP) is designed to effectively remove the noise in the point cloud, and recover the underlying features for subsequent recognition. A Linked Dynamic Graph Convolutional Neural Network (LDGCNN) is then adapted to recognize faces from the processed point cloud, which hierarchically links both the local point features and neighboring features of multiple scales. The proposed method is validated on the Bosphorus dataset. It significantly improves the recognition accuracy under all noise settings, with a maximum gain of 14.81%.
摘要：使用3D点云的面部识别正在引起人们的兴趣，而原始点云通常由于不完善的传感器而包含大量噪音。在本文中，提出了在嘈杂的点云上的端到端3D面部识别，该面云协同整合了变性和识别模块。具体而言，三个正交平面（CGAN-TOP）上的条件生成对抗网络旨在有效地消除点云中的噪声，并恢复基础特征以进行后续识别。然后对链接的动态图卷积神经网络（LDGCNN）进行调整，以识别处理点云的面，该云从层次结构上链接了多个尺度的局部点特征和相邻特征。所提出的方法在Bosphorus数据集上进行了验证。它显着提高了所有噪声设置下的识别精度，最大增益为14.81％。

Title: LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer

Authors: Ying Shen, Zhiyang Xu, Jiuhai Chen, Shizhe Diao, Jiaxin Zhang, Yuguang Yao, Joy Rimchala, Ismini Lourentzou, Lifu Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06952
Pdf URL: https://arxiv.org/pdf/2506.06952
Copy Paste: [[2506.06952]] LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer(https://arxiv.org/abs/2506.06952)
Keywords: generation
Abstract: Recent advances in multimodal foundation models unifying image understanding and generation have opened exciting avenues for tackling a wide range of vision-language tasks within a single framework. Despite progress, existing unified models typically require extensive pretraining and struggle to achieve the same level of performance compared to models dedicated to each task. Additionally, many of these models suffer from slow image generation speeds, limiting their practical deployment in real-time or resource-constrained settings. In this work, we propose Layerwise Timestep-Expert Flow-based Transformer (LaTtE-Flow), a novel and efficient architecture that unifies image understanding and generation within a single multimodal model. LaTtE-Flow builds upon powerful pretrained Vision-Language Models (VLMs) to inherit strong multimodal understanding capabilities, and extends them with a novel Layerwise Timestep Experts flow-based architecture for efficient image generation. LaTtE-Flow distributes the flow-matching process across specialized groups of Transformer layers, each responsible for a distinct subset of timesteps. This design significantly improves sampling efficiency by activating only a small subset of layers at each sampling timestep. To further enhance performance, we propose a Timestep-Conditioned Residual Attention mechanism for efficient information reuse across layers. Experiments demonstrate that LaTtE-Flow achieves strong performance on multimodal understanding tasks, while achieving competitive image generation quality with around 6x faster inference speed compared to recent unified multimodal models.
摘要：统一图像理解和生成的多模式基础模型的最新进展开辟了令人兴奋的途径，以解决一个框架内的各种视觉语言任务。尽管取得了进展，但与专门针对每个任务的模型相比，现有的统一模型通常需要进行大量的预处理和努力才能达到相同的性能水平。此外，这些模型中的许多模型都具有缓慢的图像生成速度，从而将其实时部署限制在实时或资源约束的设置中。在这项工作中，我们提出了layerwise timeStep-expert基于流动的变压器（Latte-flow），这是一种新颖而有效的体系结构，可在单个多峰模型中统一图像的理解和生成。拿铁流基于强大的预识别视觉语言模型（VLMS），以继承强大的多模式理解能力，并通过基于流动的layerwise TimeStep专家进行基于流动的型体系结构来扩展它们，以产生有效的图像生成。 Latte-Flow跨变压器层的专用组分布流程匹配过程，每个组都负责独特的时间段集。该设计通过在每个采样时间段上仅激活一小部分层来显着提高采样效率。为了进一步提高性能，我们提出了一种时间步条条件的残余注意机制，以有效地跨层重复使用。实验表明，与最近的统一多模型相比，拿铁流在多模式理解任务上取得了强大的性能，同时以更快的6倍的推理速度实现了竞争性图像生成质量。

Title: Task-driven real-world super-resolution of document scans

Authors: Maciej Zyrek, Tomasz Tarasiewicz, Jakub Sadel, Aleksandra Krzywon, Michal Kawulok
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06953
Pdf URL: https://arxiv.org/pdf/2506.06953
Copy Paste: [[2506.06953]] Task-driven real-world super-resolution of document scans(https://arxiv.org/abs/2506.06953)
Keywords: super-resolution
Abstract: Single-image super-resolution refers to the reconstruction of a high-resolution image from a single low-resolution observation. Although recent deep learning-based methods have demonstrated notable success on simulated datasets -- with low-resolution images obtained by degrading and downsampling high-resolution ones -- they frequently fail to generalize to real-world settings, such as document scans, which are affected by complex degradations and semantic variability. In this study, we introduce a task-driven, multi-task learning framework for training a super-resolution network specifically optimized for optical character recognition tasks. We propose to incorporate auxiliary loss functions derived from high-level vision tasks, including text detection using the connectionist text proposal network, text recognition via a convolutional recurrent neural network, keypoints localization using this http URL, and hue consistency. To balance these diverse objectives, we employ dynamic weight averaging mechanism, which adaptively adjusts the relative importance of each loss term based on its convergence behavior. We validate our approach upon the SRResNet architecture, which is a well-established technique for single-image super-resolution. Experimental evaluations on both simulated and real-world scanned document datasets demonstrate that the proposed approach improves text detection, measured with intersection over union, while preserving overall image fidelity. These findings underscore the value of multi-objective optimization in super-resolution models for bridging the gap between simulated training regimes and practical deployment in real-world scenarios.
摘要：单像超分辨率是指从单个低分辨率观察中重建高分辨率图像。尽管最近的基于深度学习的方法在模拟数据集上表现出了显着的成功 - 通过降级和下采样高分辨率的图像获得了低分辨率图像 - 它们经常无法推广到现实世界中的设置，例如文档扫描，这些设置受复杂降解和语义可变性的影响。在这项研究中，我们介绍了一个任务驱动的多任务学习框架，用于培训专门针对光学角色识别任务的超分辨率网络。我们建议合并从高级视觉任务中得出的辅助损失函数，包括使用Connectionist文本提案网络进行文本检测，通过卷积复发性神经网络识别文本识别，使用此HTTP URL定位的KePoints定位，以及色调的一致性。为了平衡这些不同的目标，我们采用了动态权重平均机制，该机制可以根据其收敛行为自适应地调整每个损失项的相对重要性。我们验证了SRRESNET体系结构的方法，这是一种良好的单像超级分辨率的技术。对模拟和现实世界扫描文档数据集的实验评估表明，所提出的方法改善了文本检测，该检测通过与联合的相交进行测量，同时保留了整体图像保真度。这些发现强调了在超分辨率模型中多目标优化的价值，以弥合现实世界中的模拟训练制度与实际部署之间的差距。

Title: AR-RAG: Autoregressive Retrieval Augmentation for Image Generation

Authors: Jingyuan Qi, Zhiyang Xu, Qifan Wang, Lifu Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06962
Pdf URL: https://arxiv.org/pdf/2506.06962
Copy Paste: [[2506.06962]] AR-RAG: Autoregressive Retrieval Augmentation for Image Generation(https://arxiv.org/abs/2506.06962)
Keywords: generation
Abstract: We introduce Autoregressive Retrieval Augmentation (AR-RAG), a novel paradigm that enhances image generation by autoregressively incorporating knearest neighbor retrievals at the patch level. Unlike prior methods that perform a single, static retrieval before generation and condition the entire generation on fixed reference images, AR-RAG performs context-aware retrievals at each generation step, using prior-generated patches as queries to retrieve and incorporate the most relevant patch-level visual references, enabling the model to respond to evolving generation needs while avoiding limitations (e.g., over-copying, stylistic bias, etc.) prevalent in existing methods. To realize AR-RAG, we propose two parallel frameworks: (1) Distribution-Augmentation in Decoding (DAiD), a training-free plug-and-use decoding strategy that directly merges the distribution of model-predicted patches with the distribution of retrieved patches, and (2) Feature-Augmentation in Decoding (FAiD), a parameter-efficient fine-tuning method that progressively smooths the features of retrieved patches via multi-scale convolution operations and leverages them to augment the image generation process. We validate the effectiveness of AR-RAG on widely adopted benchmarks, including Midjourney-30K, GenEval and DPG-Bench, demonstrating significant performance gains over state-of-the-art image generation models.
摘要：我们引入了自动回收量增强（AR-RAG），这是一种新颖的范式，可通过在斑块水平上纳入最弯曲的邻居检索来增强图像的产生。 Unlike prior methods that perform a single, static retrieval before generation and condition the entire generation on fixed reference images, AR-RAG performs context-aware retrievals at each generation step, using prior-generated patches as queries to retrieve and incorporate the most relevant patch-level visual references, enabling the model to respond to evolving generation needs while avoiding limitations (e.g., over-copying, stylistic bias, etc.) prevalent in existing methods.要实现Ar-rag，我们提出了两个平行框架：（1）解码（DAID）中的分布仪器，一种无训练的解开策略，将模型预测贴片的分布直接合并为模型预测的补丁与检索到的斑块的分布，以及（2）在解码（faid）中进行的特征（faid）的特征，该特征通过逐步提取，该特征是通过逐步进行的，该特征逐步进行了逐步调查，该特征是在逐步进行的，该特征是逐步进行的。并利用它们增强图像生成过程。我们验证了AR-RAG对广泛采用的基准测试的有效性，包括Midjourney-30k，Geneval和DPG Bench，这表明了最先进的图像生成模型的性能提高。

Title: DM$^3$Net: Dual-Camera Super-Resolution via Domain Modulation and Multi-scale Matching

Authors: Cong Guan, Jiacheng Ying, Yuya Ieiri, Osamu Yoshie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06993
Pdf URL: https://arxiv.org/pdf/2506.06993
Copy Paste: [[2506.06993]] DM$^3$Net: Dual-Camera Super-Resolution via Domain Modulation and Multi-scale Matching(https://arxiv.org/abs/2506.06993)
Keywords: super-resolution
Abstract: Dual-camera super-resolution is highly practical for smartphone photography that primarily super-resolve the wide-angle images using the telephoto image as a reference. In this paper, we propose DM$^3$Net, a novel dual-camera super-resolution network based on Domain Modulation and Multi-scale Matching. To bridge the domain gap between the high-resolution domain and the degraded domain, we learn two compressed global representations from image pairs corresponding to the two domains. To enable reliable transfer of high-frequency structural details from the reference image, we design a multi-scale matching module that conducts patch-level feature matching and retrieval across multiple receptive fields to improve matching accuracy and robustness. Moreover, we also introduce Key Pruning to achieve a significant reduction in memory usage and inference time with little model performance sacrificed. Experimental results on three real-world datasets demonstrate that our DM$^3$Net outperforms the state-of-the-art approaches.
摘要：双摄像机超分辨率对于智能手机摄影非常实用，该智能手机摄影主要是使用远摄图像作为参考的超级弥补广角图像。在本文中，我们提出了DM $^3 $ Net，这是一个基于域调制和多规模匹配的新颖双相机超分辨率网络。为了弥合高分辨率域和退化域之间的域间隙，我们从与两个域相对应的图像对中学习了两个压缩全局表示形式。为了从参考图像中启用高频结构细节的可靠传输，我们设计了一个多尺度匹配模块，该模块可以在多个接收场上进行贴片级特征匹配和检索，以提高匹配的准确性和稳健性。此外，我们还引入了关键修剪，以大大减少记忆使用时间和推理时间，而很少牺牲模型性能。三个现实世界数据集的实验结果表明，我们的DM $^3 $ NET优于最先进的方法。

Title: Towards Physics-informed Diffusion for Anomaly Detection in Trajectories

Authors: Arun Sharma, Mingzhou Yang, Majid Farhadloo, Subhankar Ghosh, Bharat Jayaprakash, Shashi Shekhar
Subjects: cs.LG, cs.AI, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2506.06999
Pdf URL: https://arxiv.org/pdf/2506.06999
Copy Paste: [[2506.06999]] Towards Physics-informed Diffusion for Anomaly Detection in Trajectories(https://arxiv.org/abs/2506.06999)
Keywords: generation, generative
Abstract: Given trajectory data, a domain-specific study area, and a user-defined threshold, we aim to find anomalous trajectories indicative of possible GPS spoofing (e.g., fake trajectory). The problem is societally important to curb illegal activities in international waters, such as unauthorized fishing and illicit oil transfers. The problem is challenging due to advances in AI generated in deep fakes generation (e.g., additive noise, fake trajectories) and lack of adequate amount of labeled samples for ground-truth verification. Recent literature shows promising results for anomalous trajectory detection using generative models despite data sparsity. However, they do not consider fine-scale spatiotemporal dependencies and prior physical knowledge, resulting in higher false-positive rates. To address these limitations, we propose a physics-informed diffusion model that integrates kinematic constraints to identify trajectories that do not adhere to physical laws. Experimental results on real-world datasets in the maritime and urban domains show that the proposed framework results in higher prediction accuracy and lower estimation error rate for anomaly detection and trajectory generation methods, respectively. Our implementation is available at this https URL.
摘要：给定的轨迹数据，特定于域的研究区域和用户定义的阈值，我们旨在找到指示可能欺骗GPS的异常轨迹（例如，假轨迹）。这个问题对于遏制国际水域的非法活动（例如未经授权的捕鱼和非法的石油转移）至关重要。由于深层伪造产生的AI进展（例如，添加剂噪声，假轨迹）以及缺乏足够数量的标记样品来进行基地真实验证，因此问题具有挑战性。尽管数据稀疏性，但最近的文献显示了使用生成模型的异常轨迹检测的有希望的结果。但是，他们不考虑级别的时空依赖性和先前的物理知识，从而导致较高的假阳性率。为了解决这些局限性，我们提出了一个具有物理信息的扩散模型，该模型集成了运动学约束，以识别不遵守物理定律的轨迹。海上和城市域中现实世界数据集的实验结果表明，所提出的框架分别导致较高的预测准确性和较低的估计误差率，分别为异常检测和轨迹产生方法提供了较低的估计错误率。我们的实现可在此HTTPS URL上获得。

Title: MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks

Authors: Sanjoy Chowdhury, Mohamed Elmoghany, Yohan Abeysinghe, Junjie Fei, Sayan Nag, Salman Khan, Mohamed Elhoseiny, Dinesh Manocha
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07016
Pdf URL: https://arxiv.org/pdf/2506.07016
Copy Paste: [[2506.07016]] MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks(https://arxiv.org/abs/2506.07016)
Keywords: generation
Abstract: Large multimodal models (LMMs) have shown remarkable progress in audio-visual understanding, yet they struggle with real-world scenarios that require complex reasoning across extensive video collections. Existing benchmarks for video question answering remain limited in scope, typically involving one clip per query, which falls short of representing the challenges of large-scale, audio-visual retrieval and reasoning encountered in practical applications. To bridge this gap, we introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer. To this end, we present AVHaystacks, an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task. Additionally, we propose a model-agnostic, multi-agent framework MAGNET to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystacks. To enable robust evaluation of multi-video retrieval and temporal grounding for optimal response generation, we introduce two new metrics, STEM, which captures alignment errors between a ground truth and a predicted step sequence and MTGS, to facilitate balanced and interpretable evaluation of segment-level grounding performance. Project: this https URL
摘要：大型的多模型模型（LMM）在视听理解方面表现出了显着的进步，但是他们在需要在广泛的视频集中进行复杂推理的现实情况而苦苦挣扎。视频问题回答的现有基准测试范围仍然有限，通常涉及每个查询的一个剪辑，该剪辑不足以代表实际应用中大规模，视听检索和推理的挑战。为了弥合这一差距，我们介绍了一个名为Av-Haystacksqa的新颖任务，目的是确定各个视频中的显着段，以响应查询，并将它们链接在一起以产生最有用的答案。为此，我们提出了Avhaystacks，这是一种音频视听基准，该基准包括3100个注释的QA对，旨在评估LMM在多效率检索和时间接地任务中的功能。此外，我们提出了一种模型不合时宜的多代理框架磁铁，以应对这一挑战，在BLEU@4的基线方法上，相对改进高达89％和65％，并且在我们拟议的Avhaystack上，QA任务中的GPT评估得分高达89％和65％。为了对最佳响应产生的多效检索和时间基础进行强有力的评估，我们介绍了两个新的指标STEM，它们捕获了地面真理和预测的步骤序列和MTG之间的对齐误差，以促进对段级接地性能的平衡且可解释的评估。项目：此HTTPS URL

Title: Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs

Authors: Yikun Ji, Hong Yan, Jun Lan, Huijia Zhu, Weiqiang Wang, Qi Fan, Liqing Zhang, Jianfu Zhang
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.07045
Pdf URL: https://arxiv.org/pdf/2506.07045
Copy Paste: [[2506.07045]] Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs(https://arxiv.org/abs/2506.07045)
Keywords: generation
Abstract: The rapid advancement of image generation technologies intensifies the demand for interpretable and robust detection methods. Although existing approaches often attain high accuracy, they typically operate as black boxes without providing human-understandable justifications. Multi-modal Large Language Models (MLLMs), while not originally intended for forgery detection, exhibit strong analytical and reasoning capabilities. When properly fine-tuned, they can effectively identify AI-generated images and offer meaningful explanations. However, existing MLLMs still struggle with hallucination and often fail to align their visual interpretations with actual image content and human reasoning. To bridge this gap, we construct a dataset of AI-generated images annotated with bounding boxes and descriptive captions that highlight synthesis artifacts, establishing a foundation for human-aligned visual-textual grounded reasoning. We then finetune MLLMs through a multi-stage optimization strategy that progressively balances the objectives of accurate detection, visual localization, and coherent textual explanation. The resulting model achieves superior performance in both detecting AI-generated images and localizing visual flaws, significantly outperforming baseline methods.
摘要：图像生成技术的快速发展加剧了对可解释和可解释的检测方法的需求。尽管现有的方法通常达到高精度，但它们通常是黑匣子，而无需提供人为理解的理由。多模式的大语言模型（MLLM）虽然最初旨在伪造检测，但具有强大的分析和推理能力。经过适当调整后，他们可以有效地识别AI生成的图像并提供有意义的解释。但是，现有的MLLM仍然在幻觉上挣扎，并且通常无法将其视觉解释与实际图像内容和人类推理保持一致。为了弥合这一差距，我们构建了一个带有边界框和描述性标题的AI生成图像的数据集，这些图像突出了综合伪像，为人类与人类平衡的视觉文本基础基础建立了基础。然后，我们通过多个阶段优化策略进行芬特元素MLLM，该策略逐步平衡了准确检测，视觉定位和连贯的文本解释的目标。所得模型在检测AI生成的图像和本地化视觉缺陷方面的性能卓越，从而明显优于基线方法。

Title: D2R: dual regularization loss with collaborative adversarial generation for model robustness

Authors: Zhenyu Liu, Huizhi Liang, Rajiv Ranjan, Zhanxing Zhu, Vaclav Snasel, Varun Ojha
Subjects: cs.CV, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.07056
Pdf URL: https://arxiv.org/pdf/2506.07056
Copy Paste: [[2506.07056]] D2R: dual regularization loss with collaborative adversarial generation for model robustness(https://arxiv.org/abs/2506.07056)
Keywords: generation
Abstract: The robustness of Deep Neural Network models is crucial for defending models against adversarial attacks. Recent defense methods have employed collaborative learning frameworks to enhance model robustness. Two key limitations of existing methods are (i) insufficient guidance of the target model via loss functions and (ii) non-collaborative adversarial generation. We, therefore, propose a dual regularization loss (D2R Loss) method and a collaborative adversarial generation (CAG) strategy for adversarial training. D2R loss includes two optimization steps. The adversarial distribution and clean distribution optimizations enhance the target model's robustness by leveraging the strengths of different loss functions obtained via a suitable function space exploration to focus more precisely on the target model's distribution. CAG generates adversarial samples using a gradient-based collaboration between guidance and target models. We conducted extensive experiments on three benchmark databases, including CIFAR-10, CIFAR-100, Tiny ImageNet, and two popular target models, WideResNet34-10 and PreActResNet18. Our results show that D2R loss with CAG produces highly robust models.
摘要：深神经网络模型的鲁棒性对于防御对抗攻击的模型至关重要。最近的国防方法采用了协作学习框架来增强模型鲁棒性。现有方法的两个关键局限性是（i）通过损失函数和（ii）非授权对抗性产生的目标模型的指导不足。因此，我们提出了双重正则化损失（D2R损失）方法和对抗性训练的协作对抗性（CAG）策略。 D2R损失包括两个优化步骤。对抗分布和清洁分布优化通过利用通过合适的功能空间探索获得的不同损失功能的优势来更加精确地关注目标模型的分布，从而增强了目标模型的鲁棒性。 CAG使用指导模型和目标模型之间的基于梯度的协作生成对抗样本。我们在三个基准数据库中进行了广泛的实验，包括CIFAR-10，CIFAR-100，Tiny Imagenet，以及两个流行的目标模型WIDERESNET34-10和PREACTRESNET18。我们的结果表明，CAG丢失的D2R损失会产生高度健壮的模型。

Title: SceneLCM: End-to-End Layout-Guided Interactive Indoor Scene Generation with Latent Consistency Model

Authors: Yangkai Lin, Jiabao Lei, Kui Jia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07091
Pdf URL: https://arxiv.org/pdf/2506.07091
Copy Paste: [[2506.07091]] SceneLCM: End-to-End Layout-Guided Interactive Indoor Scene Generation with Latent Consistency Model(https://arxiv.org/abs/2506.07091)
Keywords: generation
Abstract: Our project page: this https URL. Automated generation of complex, interactive indoor scenes tailored to user prompt remains a formidable challenge. While existing methods achieve indoor scene synthesis, they struggle with rigid editing constraints, physical incoherence, excessive human effort, single-room limitations, and suboptimal material quality. To address these limitations, we propose SceneLCM, an end-to-end framework that synergizes Large Language Model (LLM) for layout design with Latent Consistency Model(LCM) for scene optimization. Our approach decomposes scene generation into four modular pipelines: (1) Layout Generation. We employ LLM-guided 3D spatial reasoning to convert textual descriptions into parametric blueprints(3D layout). And an iterative programmatic validation mechanism iteratively refines layout parameters through LLM-mediated dialogue loops; (2) Furniture Generation. SceneLCM employs Consistency Trajectory Sampling(CTS), a consistency distillation sampling loss guided by LCM, to form fast, semantically rich, and high-quality representations. We also offer two theoretical justification to demonstrate that our CTS loss is equivalent to consistency loss and its distillation error is bounded by the truncation error of the Euler solver; (3) Environment Optimization. We use a multiresolution texture field to encode the appearance of the scene, and optimize via CTS loss. To maintain cross-geometric texture coherence, we introduce a normal-aware cross-attention decoder to predict RGB by cross-attending to the anchors locations in geometrically heterogeneous instance. (4)Physically Editing. SceneLCM supports physically editing by integrating physical simulation, achieved persistent physical realism. Extensive experiments validate SceneLCM's superiority over state-of-the-art techniques, showing its wide-ranging potential for diverse applications.
摘要：我们的项目页面：此HTTPS URL。根据用户提示量身定制的复杂，交互式室内场景的自动生成仍然是一个巨大的挑战。尽管现有方法达到了室内场景的综合，但它们在严格的编辑限制，身体上的不连贯性，过度的人力努力，单室局限性和次优材的材料质量方面挣扎。为了解决这些限制，我们提出了SpenelCM，这是一个端到端的框架，该框架协同使用潜在的一致性模型（LCM）来协同布局设计，以进行场景优化。我们的方法将场景生成分解为四个模块化管道：（1）布局生成。我们采用LLM引导的3D空间推理将文本描述转换为参数蓝图（3D布局）。迭代的程序验证机制迭代地通过LLM介导的对话循环来完善布局参数；（2）家具产生。 ScenelCM采用一致性轨迹采样（CTS），这是由LCM引导的一致性蒸馏采样损失，形成快速，语义上富含和高质量的表示。我们还提供了两个理论理由，以证明我们的CTS损失等同于一致性损失，其蒸馏误差受Euler求解器的截断误差的限制。（3）环境优化。我们使用多分辨率纹理字段来编码场景的外观，并通过CTS损失进行优化。为了维持跨几何纹理的连贯性，我们引入了一个正常感知的跨注意解码器，以通过在几何异质实例中跨锚定位置跨锚位置来预测RGB。（4）物理编辑。 Seachcm通过整合物理模拟，实现持久的物理现实主义来支持物理编辑。广泛的实验验证了ScenelCM比最先进的技术的优越性，显示了其广泛的不同应用潜力。

Title: Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models

Authors: Ren-Jian Wang, Ke Xue, Zeyu Qin, Ziniu Li, Sheng Tang, Hao-Tian Li, Shengcai Liu, Chao Qian
Subjects: cs.LG, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2506.07121
Pdf URL: https://arxiv.org/pdf/2506.07121
Copy Paste: [[2506.07121]] Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models(https://arxiv.org/abs/2506.07121)
Keywords: generation
Abstract: Ensuring safety of large language models (LLMs) is important. Red teaming--a systematic approach to identifying adversarial prompts that elicit harmful responses from target LLMs--has emerged as a crucial safety evaluation method. Within this framework, the diversity of adversarial prompts is essential for comprehensive safety assessments. We find that previous approaches to red-teaming may suffer from two key limitations. First, they often pursue diversity through simplistic metrics like word frequency or sentence embedding similarity, which may not capture meaningful variation in attack strategies. Second, the common practice of training a single attacker model restricts coverage across potential attack styles and risk categories. This paper introduces Quality-Diversity Red-Teaming (QDRT), a new framework designed to address these limitations. QDRT achieves goal-driven diversity through behavior-conditioned training and implements a behavioral replay buffer in an open-ended manner. Additionally, it trains multiple specialized attackers capable of generating high-quality attacks across diverse styles and risk categories. Our empirical evaluation demonstrates that QDRT generates attacks that are both more diverse and more effective against a wide range of target LLMs, including GPT-2, Llama-3, Gemma-2, and Qwen2.5. This work advances the field of LLM safety by providing a systematic and effective approach to automated red-teaming, ultimately supporting the responsible deployment of LLMs.
摘要：确保大语模型（LLM）的安全很重要。红色小组 - 一种系统的方法来识别对抗性提示，即引起目标LLMS的有害反应 - 成为一种关键的安全评估方法。在此框架内，对抗提示的多样性对于全面的安全评估至关重要。我们发现以前的红色团队方法可能会受到两个关键局限性。首先，他们经常通过简单的指标来追求多样性，例如单词频率或嵌入相似性的句子，这可能不会捕获攻击策略的有意义的变化。其次，训练单个攻击者模型的常见做法限制了潜在攻击方式和风险类别的覆盖范围。本文介绍了质量多样性红色团队（QDRT），这是一个旨在解决这些限制的新框架。 QDRT通过行为条件训练实现了目标驱动的多样性，并以开放式的方式实现行为重播缓冲液。此外，它训练多个能够在各种风格和风险类别中产生高质量攻击的专业攻击者。我们的经验评估表明，QDRT会产生攻击，这些攻击既具有更多样化和更有效的目标LLM，包括GPT-2，Llama-3，Gemma-2和Qwen2.5。这项工作通过提供系统有效的红色团队的系统和有效方法来推动LLM安全的领域，最终支持LLM的负责部署。

Title: Hi-VAE: Efficient Video Autoencoding with Global and Detailed Motion

Authors: Huaize Liu, Wenzhang Sun, Qiyuan Zhang, Donglin Di, Biao Gong, Hao Li, Chen Wei, Changqing Zou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07136
Pdf URL: https://arxiv.org/pdf/2506.07136
Copy Paste: [[2506.07136]] Hi-VAE: Efficient Video Autoencoding with Global and Detailed Motion(https://arxiv.org/abs/2506.07136)
Keywords: generation, generative
Abstract: Recent breakthroughs in video autoencoders (Video AEs) have advanced video generation, but existing methods fail to efficiently model spatio-temporal redundancies in dynamics, resulting in suboptimal compression factors. This shortfall leads to excessive training costs for downstream tasks. To address this, we introduce Hi-VAE, an efficient video autoencoding framework that hierarchically encode coarse-to-fine motion representations of video dynamics and formulate the decoding process as a conditional generation task. Specifically, Hi-VAE decomposes video dynamics into two latent spaces: Global Motion, capturing overarching motion patterns, and Detailed Motion, encoding high-frequency spatial details. Using separate self-supervised motion encoders, we compress video latents into compact motion representations to reduce redundancy significantly. A conditional diffusion decoder then reconstructs videos by combining hierarchical global and detailed motions, enabling high-fidelity video reconstructions. Extensive experiments demonstrate that Hi-VAE achieves a high compression factor of 1428$\times$, almost 30$\times$ higher than baseline methods (e.g., Cosmos-VAE at 48$\times$), validating the efficiency of our approach. Meanwhile, Hi-VAE maintains high reconstruction quality at such high compression rates and performs effectively in downstream generative tasks. Moreover, Hi-VAE exhibits interpretability and scalability, providing new perspectives for future exploration in video latent representation and generation.
摘要：视频自动编码器（视频AES）的最新突破具有高级视频生成，但是现有方法无法有效地对动态的时空冗余裁员进行建模，从而导致次优压缩因子。这一短缺导致下游任务的过度培训费用。为了解决这个问题，我们介绍了Hi-Vae，这是一个有效的视频自动编码框架，层次地编码视频动力学的粗到细节运动表示，并将解码过程作为条件生成任务。具体而言，Hi-Vae将视频动态分解为两个潜在空间：全局运动，捕获总体运动模式和详细的运动，编码高频空间细节。使用单独的自我监督运动编码器，我们将视频潜在的潜在运动表示为紧凑的运动表示，以显着降低冗余。然后，有条件的扩散解码器通过结合层次结构的全局和详细的动作来重建视频，从而实现高保真视频重建。广泛的实验表明，HI-VAE的高压系数为1428 $ \ times $，几乎30 $ \ times $ $ \ times $高于基线方法（例如，cosmos-vae at 48 $ \ times $），以验证我们方法的效率。同时，HI-VAE在如此高的压缩率下保持高重建质量，并在下游生成任务中有效地表现。此外，Hi-Vae具有可解释性和可扩展性，为视频潜在表示和发电的未来探索提供了新的观点。

Title: AMoPO: Adaptive Multi-objective Preference Optimization without Reward Models and Reference Models

Authors: Qi Liu, Jingqing Ruan, Hao Li, Haodong Zhao, Desheng Wang, Jiansong Chen, Wan Guanglu, Xunliang Cai, Zhi Zheng, Tong Xu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07165
Pdf URL: https://arxiv.org/pdf/2506.07165
Copy Paste: [[2506.07165]] AMoPO: Adaptive Multi-objective Preference Optimization without Reward Models and Reference Models(https://arxiv.org/abs/2506.07165)
Keywords: generation
Abstract: Existing multi-objective preference alignment methods for large language models (LLMs) face limitations: (1) the inability to effectively balance various preference dimensions, and (2) reliance on auxiliary reward/reference models introduces computational complexity. To address these challenges, we propose Adaptive Multi-objective Preference Optimization (AMoPO), a novel framework that achieves dynamic balance across preference dimensions. By introducing the multi-objective optimization paradigm to use the dimension-aware generation metrics as implicit rewards, AMoPO aligns LLMs with diverse preferences without additional reward models or reference models. We introduce an adaptive weight assignment mechanism that models the generation space as a Gaussian distribution, allowing dynamic prioritization of preference dimensions. Empirical results demonstrate that AMoPO outperforms state-of-the-art baselines by 28.5%, and the experiments on 7B, 14B, and 32B models reveal the scaling ability of AMoPO. Moreover, additional analysis of multiple dimensions verifies its adaptability and effectiveness. These findings validate AMoPO's capability to achieve dimension-aware preference alignment, highlighting its superiority. Our codes and datasets are available at this https URL.
摘要：大型语言模型（LLMS）的现有多目标偏好比对方法面临限制：（1）无法有效平衡各种偏好维度，以及（2）依赖辅助奖励/参考模型引入计算复杂性。为了应对这些挑战，我们提出了自适应多目标偏好优化（Amopo），这是一个新颖的框架，可以在跨首选项维度上实现动态平衡。通过引入多目标优化范式将维度感知的生成指标用作隐式奖励，Amopo将LLM与不同的偏好相结合，而没有其他奖励模型或参考模型。我们引入了一种自适应重量分配机制，该机制将生成空间建模为高斯分布，从而使优先尺寸的动态优先级。经验结果表明，Amopo的表现要优于最先进的基线，而在7B，14B和32B模型上的实验揭示了Amopo的缩放能力。此外，对多个维度的其他分析验证其适应性和有效性。这些发现证明了Amopo实现维度吸引偏好对齐的能力，突出了其优越性。我们的代码和数据集可在此HTTPS URL上找到。

Title: Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models

Authors: Sangwon Jang, Taekyung Ki, Jaehyeong Jo, Jaehong Yoon, Soo Ye Kim, Zhe Lin, Sung Ju Hwang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07177
Pdf URL: https://arxiv.org/pdf/2506.07177
Copy Paste: [[2506.07177]] Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models(https://arxiv.org/abs/2506.07177)
Keywords: generation
Abstract: Advancements in diffusion models have significantly improved video quality, directing attention to fine-grained controllability. However, many existing methods depend on fine-tuning large-scale video models for specific tasks, which becomes increasingly impractical as model sizes continue to grow. In this work, we present Frame Guidance, a training-free guidance for controllable video generation based on frame-level signals, such as keyframes, style reference images, sketches, or depth maps. For practical training-free guidance, we propose a simple latent processing method that dramatically reduces memory usage, and apply a novel latent optimization strategy designed for globally coherent video generation. Frame Guidance enables effective control across diverse tasks, including keyframe guidance, stylization, and looping, without any training, compatible with any video models. Experimental results show that Frame Guidance can produce high-quality controlled videos for a wide range of tasks and input signals.
摘要：扩散模型的进步已大大提高了视频质量，将注意力引向细粒度的可控性。但是，许多现有的方法取决于针对特定任务的微调大规模视频模型，随着模型大小的不断增长，这些模型变得越来越不切实际。在这项工作中，我们介绍了框架指南，这是基于框架级信号的可控视频生成的无培训指南，例如关键帧，样式参考图像，草图或深度图。对于无训练的指导，我们提出了一种简单的潜在处理方法，可大大降低记忆使用情况，并采用新颖的潜在优化策略，为全球连贯的视频生成而设计。框架指导可以跨不同任务的有效控制，包括密钥帧指导，风格化和循环，没有任何培训，与任何视频模型兼容。实验结果表明，框架指导可以为各种任务和输入信号产生高质量的受控视频。

Title: GGBall: Graph Generative Model on Poincaré Ball

Authors: Tianci Bu, Chuanrui Wang, Hao Ma, Haoren Zheng, Xin Lu, Tailin Wu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.07198
Pdf URL: https://arxiv.org/pdf/2506.07198
Copy Paste: [[2506.07198]] GGBall: Graph Generative Model on Poincaré Ball(https://arxiv.org/abs/2506.07198)
Keywords: generation, generative
Abstract: Generating graphs with hierarchical structures remains a fundamental challenge due to the limitations of Euclidean geometry in capturing exponential complexity. Here we introduce \textbf{GGBall}, a novel hyperbolic framework for graph generation that integrates geometric inductive biases with modern generative paradigms. GGBall combines a Hyperbolic Vector-Quantized Autoencoder (HVQVAE) with a Riemannian flow matching prior defined via closed-form geodesics. This design enables flow-based priors to model complex latent distributions, while vector quantization helps preserve the curvature-aware structure of the hyperbolic space. We further develop a suite of hyperbolic GNN and Transformer layers that operate entirely within the manifold, ensuring stability and scalability. Empirically, our model reduces degree MMD by over 75\% on Community-Small and over 40\% on Ego-Small compared to state-of-the-art baselines, demonstrating an improved ability to preserve topological hierarchies. These results highlight the potential of hyperbolic geometry as a powerful foundation for the generative modeling of complex, structured, and hierarchical data domains. Our code is available at \href{this https URL}{here}.
摘要：具有分层结构的生成图仍然是一个基本挑战，这是由于欧几里得几何形状在捕获指数复杂性中的局限性。在这里，我们介绍\ textbf {ggball}，这是一个新型的图形生成双曲线框架，将几何感应偏见与现代生成范式集成在一起。 GGBALL结合了双曲线载体定量的自动编码器（HVQVAE）和通过封闭形式的大地测量学定义的Riemannian流量匹配。该设计使基于流动的先验能够对复杂的潜在分布进行建模，而矢量量化有助于保留双曲线空间的曲率感知结构。我们进一步开发了一个完全在歧管内运行的双曲线GNN和变压器层的套件，从而确保了稳定性和可扩展性。从经验上讲，与最先进的基本线相比，我们的模型在社区小all上将学位MMD降低了75 \％，而自我small的MMD则超过40 \％，这表明保留拓扑层次结构的能力提高了。这些结果突出了双曲线几何形状作为复杂，结构化和分层数据域的生成建模的强大基础的潜力。我们的代码可在\ href {this https url} {there}上获得。

Title: TV-LiVE: Training-Free, Text-Guided Video Editing via Layer Informed Vitality Exploitation

Authors: Min-Jung Kim, Dongjin Kim, Seokju Yun, Jaegul Choo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07205
Pdf URL: https://arxiv.org/pdf/2506.07205
Copy Paste: [[2506.07205]] TV-LiVE: Training-Free, Text-Guided Video Editing via Layer Informed Vitality Exploitation(https://arxiv.org/abs/2506.07205)
Keywords: generation
Abstract: Video editing has garnered increasing attention alongside the rapid progress of diffusion-based video generation models. As part of these advancements, there is a growing demand for more accessible and controllable forms of video editing, such as prompt-based editing. Previous studies have primarily focused on tasks such as style transfer, background replacement, object substitution, and attribute modification, while maintaining the content structure of the source video. However, more complex tasks, including the addition of novel objects and nonrigid transformations, remain relatively unexplored. In this paper, we present TV-LiVE, a Training-free and text-guided Video editing framework via Layerinformed Vitality Exploitation. We empirically identify vital layers within the video generation model that significantly influence the quality of generated outputs. Notably, these layers are closely associated with Rotary Position Embeddings (RoPE). Based on this observation, our method enables both object addition and non-rigid video editing by selectively injecting key and value features from the source model into the corresponding layers of the target model guided by the layer vitality. For object addition, we further identify prominent layers to extract the mask regions corresponding to the newly added target prompt. We found that the extracted masks from the prominent layers faithfully indicate the region to be edited. Experimental results demonstrate that TV-LiVE outperforms existing approaches for both object addition and non-rigid video editing. Project Page: this https URL
摘要：视频编辑与基于扩散的视频生成模型的快速进步一起引起了人们的关注。作为这些进步的一部分，人们对更容易访问和可控制的视频编辑形式的需求不断增长，例如及时的编辑。先前的研究主要集中在诸如样式转移，背景更换，对象替换和属性修改之类的任务上，同时保持源视频的内容结构。但是，更复杂的任务，包括增加新对象和非辅助转换，仍然相对尚未探索。在本文中，我们介绍了电视节目，这是一个通过层面形式的活力开发的无训练和文字引导的视频编辑框架。我们从经验上确定了视频生成模型中的重要层，从而显着影响生成的产出的质量。值得注意的是，这些层与旋转位置嵌入（绳索）密切相关。基于此观察结果，我们的方法通过选择性地将源模型的密钥和价值特征从层活力引导的目标模型的相应层中选择性地注入键和值，从而启用对象添加和非刚性视频编辑。为了添加对象，我们进一步确定了突出层，以提取与新添加的目标提示相对应的掩模区域。我们发现，从著名层中提取的口罩忠实地表明了要编辑的区域。实验结果表明，电视节目的表现优于对象添加和非刚性视频编辑的现有方法。项目页面：此HTTPS URL

Title: Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning

Authors: Tianyi Bai, Yuxuan Fan, Jiantao Qiu, Fupeng Sun, Jiayi Song, Junlin Han, Zichen Liu, Conghui He, Wentao Zhang, Binhang Yuan
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2506.07227
Pdf URL: https://arxiv.org/pdf/2506.07227
Copy Paste: [[2506.07227]] Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning(https://arxiv.org/abs/2506.07227)
Keywords: generation
Abstract: Multimodal large language models (MLLMs) have achieved strong performance on vision-language tasks but still struggle with fine-grained visual differences, leading to hallucinations or missed semantic shifts. We attribute this to limitations in both training data and learning objectives. To address these issues, we propose a controlled data generation pipeline that produces minimally edited image pairs with semantically aligned captions. Using this pipeline, we construct the Micro Edit Dataset (MED), containing over 50K image-text pairs spanning 11 fine-grained edit categories, including attribute, count, position, and object presence changes. Building on MED, we introduce a supervised fine-tuning (SFT) framework with a feature-level consistency loss that promotes stable visual embeddings under small edits. We evaluate our approach on the Micro Edit Detection benchmark, which includes carefully balanced evaluation pairs designed to test sensitivity to subtle visual variations across the same edit categories. Our method improves difference detection accuracy and reduces hallucinations compared to strong baselines, including GPT-4o. Moreover, it yields consistent gains on standard vision-language tasks such as image captioning and visual question answering. These results demonstrate the effectiveness of combining targeted data and alignment objectives for enhancing fine-grained visual reasoning in MLLMs.
摘要：多模式的大语言模型（MLLM）在视觉任务上取得了出色的表现，但仍在与细粒度的视觉差异方面挣扎，导致幻觉或错过的语义转移。我们将其归因于培训数据和学习目标的局限性。为了解决这些问题，我们提出了一个受控的数据生成管道，该管道可产生具有语义对齐字幕的最小编辑图像对。使用此管道，我们构建了Micro编辑数据集（MED），其中包含超过50k图像文本对，涵盖11个细粒度的编辑类别，包括属性，计数，位置和对象的存在变化。在MED上，我们引入了一个有监督的微调（SFT）框架，并具有功能级的一致性损失，可在小编辑下促进稳定的视觉嵌入。我们在微观编辑检测基准上评估了我们的方法，其中包括精心平衡的评估对，旨在测试敏感性，以对同一编辑类别进行微妙的视觉变化。与包括GPT-4O在内的强基础相比，我们的方法提高了差异检测准确性并降低了幻觉。此外，它可以在标准视觉语言任务（例如图像字幕和视觉问题回答）上获得一致的收益。这些结果证明了将目标数据和对齐目标结合起来，以增强MLLM中的细粒度视觉推理。

Title: Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification

Authors: Tianyi Bai, Zengjie Hu, Fupeng Sun, Jiantao Qiu, Yizhen Jiang, Guangxin He, Bohan Zeng, Conghui He, Binhang Yuan, Wentao Zhang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2506.07235
Pdf URL: https://arxiv.org/pdf/2506.07235
Copy Paste: [[2506.07235]] Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification(https://arxiv.org/abs/2506.07235)
Keywords: generation
Abstract: Multi-modal large language models (MLLMs) have achieved remarkable capabilities by integrating visual perception with language understanding, enabling applications such as image-grounded dialogue, visual question answering, and scientific analysis. However, most MLLMs adopt a static inference paradigm, encoding the entire image into fixed visual tokens upfront, which limits their ability to iteratively refine understanding or adapt to context during inference. This contrasts sharply with human perception, which is dynamic, selective, and feedback-driven. In this work, we introduce a novel framework for inference-time visual token scaling that enables MLLMs to perform iterative, verifier-guided reasoning over visual content. We formulate the problem as a Markov Decision Process, involving a reasoner that proposes visual actions and a verifier, which is trained via multi-step Direct Preference Optimization (DPO), that evaluates these actions and determines when reasoning should terminate. To support this, we present a new dataset, VTS, comprising supervised reasoning trajectories (VTS-SFT) and preference-labeled reasoning comparisons (VTS-DPO). Our method significantly outperforms existing approaches across diverse visual reasoning benchmarks, offering not only improved accuracy but also more interpretable and grounded reasoning processes. These results demonstrate the promise of dynamic inference mechanisms for enabling fine-grained, context-aware visual reasoning in next-generation MLLMs.
摘要：多模式大型语言模型（MLLM）通过将视觉感知与语言理解相结合，从而实现了诸如图像接地的对话，视觉问题答案和科学分析之类的应用程序，从而实现了非凡的功能。但是，大多数MLLM都采用静态推理范式，将整个图像编码为固定的视觉令牌，这将其限制了其迭代的能力，以优化理解或适应推理期间的上下文。这与人类的感知形成鲜明对比，这是动态，选择性和反馈驱动的。在这项工作中，我们引入了一个新颖的框架，用于推理时间视觉令牌缩放，使MLLM可以在视觉内容上执行迭代，验证者引导的推理。我们将问题提出为马尔可夫的决策过程，涉及提出视觉动作的推理者和一个通过多步直接偏好优化（DPO）训练的验证者，该验证者评估了这些动作并确定推理何时应终止。为了支持这一点，我们提出了一个新的数据集VTS，其中包括监督推理轨迹（VTS-SFT）和优先标记的推理比较（VTS-DPO）。我们的方法极大地胜过各种视觉推理基准的现有方法，不仅提供了提高的准确性，而且提供了更容易解释和扎根的推理过程。这些结果证明了动态推理机制的希望，可以在下一代MLLM中实现细粒度，上下文感知的视觉推理。

Title: From Generation to Generalization: Emergent Few-Shot Learning in Video Diffusion Models

Authors: Pablo Acuaviva, Aram Davtyan, Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Alexandre Alahi, Paolo Favaro
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07280
Pdf URL: https://arxiv.org/pdf/2506.07280
Copy Paste: [[2506.07280]] From Generation to Generalization: Emergent Few-Shot Learning in Video Diffusion Models(https://arxiv.org/abs/2506.07280)
Keywords: generation, generative
Abstract: Video Diffusion Models (VDMs) have emerged as powerful generative tools, capable of synthesizing high-quality spatiotemporal content. Yet, their potential goes far beyond mere video generation. We argue that the training dynamics of VDMs, driven by the need to model coherent sequences, naturally pushes them to internalize structured representations and an implicit understanding of the visual world. To probe the extent of this internal knowledge, we introduce a few-shot fine-tuning framework that repurposes VDMs for new tasks using only a handful of examples. Our method transforms each task into a visual transition, enabling the training of LoRA weights on short input-output sequences without altering the generative interface of a frozen VDM. Despite minimal supervision, the model exhibits strong generalization across diverse tasks, from low-level vision (for example, segmentation and pose estimation) to high-level reasoning (for example, on ARC-AGI). These results reframe VDMs as more than generative engines. They are adaptable visual learners with the potential to serve as the backbone for future foundation models in vision.
摘要：视频扩散模型（VDM）已成为强大的生成工具，能够综合高质量的时空含量。然而，他们的潜力远远超出了仅仅视频的产生。我们认为，由对相干序列建模的需求驱动的VDM的训练动力学自然会推动它们内部化结构化表示形式和对视觉世界的隐含理解。为了探究这种内部知识的范围，我们引入了一些微调框架，该框架仅使用少数示例来重新利用VDM的新任务。我们的方法将每个任务转换为视觉过渡，从而在短输入输出序列上训练LORA权重，而无需更改冷冻VDM的生成界面。尽管监督最少，但该模型在各种任务中表现出强烈的概括，从低级视觉（例如分割和姿势估计）到高级推理（例如，在ARC-AGI上）。这些结果将VDM的重新塑造不仅仅是生成引擎。他们是适应性的视觉学习者，有可能作为视觉中未来基础模型的骨干。

Title: Multi-Step Guided Diffusion for Image Restoration on Edge Devices: Toward Lightweight Perception in Embodied AI

Authors: Aditya Chakravarty
Subjects: cs.CV, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2506.07286
Pdf URL: https://arxiv.org/pdf/2506.07286
Copy Paste: [[2506.07286]] Multi-Step Guided Diffusion for Image Restoration on Edge Devices: Toward Lightweight Perception in Embodied AI(https://arxiv.org/abs/2506.07286)
Keywords: restoration, super-resolution
Abstract: Diffusion models have shown remarkable flexibility for solving inverse problems without task-specific retraining. However, existing approaches such as Manifold Preserving Guided Diffusion (MPGD) apply only a single gradient update per denoising step, limiting restoration fidelity and robustness, especially in embedded or out-of-distribution settings. In this work, we introduce a multistep optimization strategy within each denoising timestep, significantly enhancing image quality, perceptual accuracy, and generalization. Our experiments on super-resolution and Gaussian deblurring demonstrate that increasing the number of gradient updates per step improves LPIPS and PSNR with minimal latency overhead. Notably, we validate this approach on a Jetson Orin Nano using degraded ImageNet and a UAV dataset, showing that MPGD, originally trained on face datasets, generalizes effectively to natural and aerial scenes. Our findings highlight MPGD's potential as a lightweight, plug-and-play restoration module for real-time visual perception in embodied AI agents such as drones and mobile robots.
摘要：扩散模型已显示出明显的灵活性，可以解决逆问题，而没有特定于任务的重新培训。但是，现有的方法，例如保存引导扩散（MPGD），仅采用单一梯度更新，限制了恢复忠诚度和鲁棒性，尤其是在嵌入式或分发设置中。在这项工作中，我们在每个DeNo的时间步中介绍了多步优化策略，可显着提高图像质量，感知准确性和概括。我们对超分辨率和高斯脱毛的实验表明，每步增加梯度更新的数量可改善LPIPS和PSNR，并以最小的潜伏期开销。值得注意的是，我们使用退化的Imagenet和无人机数据集在Jetson Orin Nano上验证了这种方法，这表明最初在Face数据集中训练的MPGD有效地概括为自然和空中场景。我们的发现突出了MPGD作为体现AI代理（例如无人机和移动机器人）的实时视觉感知的轻巧，插件的修复模块的潜力。

Title: FANVID: A Benchmark for Face and License Plate Recognition in Low-Resolution Videos

Authors: Kavitha Viswanathan, Vrinda Goel, Shlesh Gholap, Devayan Ghosh, Madhav Gupta, Dhruvi Ganatra, Sanket Potdar, Amit Sethi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07304
Pdf URL: https://arxiv.org/pdf/2506.07304
Copy Paste: [[2506.07304]] FANVID: A Benchmark for Face and License Plate Recognition in Low-Resolution Videos(https://arxiv.org/abs/2506.07304)
Keywords: super-resolution
Abstract: Real-world surveillance often renders faces and license plates unrecognizable in individual low-resolution (LR) frames, hindering reliable identification. To advance temporal recognition models, we present FANVID, a novel video-based benchmark comprising nearly 1,463 LR clips (180 x 320, 20--60 FPS) featuring 63 identities and 49 license plates from three English-speaking countries. Each video includes distractor faces and plates, increasing task difficulty and realism. The dataset contains 31,096 manually verified bounding boxes and labels. FANVID defines two tasks: (1) face matching -- detecting LR faces and matching them to high-resolution mugshots, and (2) license plate recognition -- extracting text from LR plates without a predefined database. Videos are downsampled from high-resolution sources to ensure that faces and text are indecipherable in single frames, requiring models to exploit temporal information. We introduce evaluation metrics adapted from mean Average Precision at IoU > 0.5, prioritizing identity correctness for faces and character-level accuracy for text. A baseline method with pre-trained video super-resolution, detection, and recognition achieved performance scores of 0.58 (face matching) and 0.42 (plate recognition), highlighting both the feasibility and challenge of the tasks. FANVID's selection of faces and plates balances diversity with recognition challenge. We release the software for data access, evaluation, baseline, and annotation to support reproducibility and extension. FANVID aims to catalyze innovation in temporal modeling for LR recognition, with applications in surveillance, forensics, and autonomous vehicles.
摘要：现实世界的监视通常会导致面部和车牌在个体低分辨率（LR）框架中无法识别，从而阻碍了可靠的识别。为了推进时间识别模型，我们提出了Fanvid，这是一种基于视频的新型基准，其中包括近1,463个LR夹（180 x 320，20--60 fps），其中包含来自三个英语国家的63个身份和49个车牌。每个视频都包括干扰器的面孔和盘子，增加了任务困难和现实主义。该数据集包含31,096个手动验证的边界框和标签。 FanVid定义了两个任务：（1）面对匹配 - 检测LR面并将其匹配到高分辨率的照片，以及（2）车牌识别 - 从没有预定义数据库的LR板上提取文本。视频从高分辨率来源降采样，以确保单一框架中的面孔和文本不可构成，需要模型来利用时间信息。我们介绍了评估指标，该指标是根据> 0.5的平均平均精度进行的，优先考虑面部的身份正确性和字符级别的文本准确性。具有预训练的视频超分辨率，检测和识别的基线方法，达到了0.58（面部匹配）和0.42（板识别）的性能得分，强调了任务的可行性和挑战。范维德（Fanvid）选择的面孔和板块与多样性之间的挑战之间的挑战之间的挑战平衡了。我们释放用于数据访问，评估，基线和注释的软件，以支持可重复性和扩展。 FanVid的目标是促进LR识别的时间建模创新，并在监视，取证和自动驾驶汽车中进行应用。

Title: Generative Modeling of Networked Time-Series via Transformer Architectures

Authors: Yusuf Elnady
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07312
Pdf URL: https://arxiv.org/pdf/2506.07312
Copy Paste: [[2506.07312]] Generative Modeling of Networked Time-Series via Transformer Architectures(https://arxiv.org/abs/2506.07312)
Keywords: generative
Abstract: Many security and network applications require having large datasets to train the machine learning models. Limited data access is a well-known problem in the security domain. Recent studies have shown the potential of Transformer models to enlarge the size of data by synthesizing new samples, but the synthesized samples don't improve the models over the real data. To address this issue, we design an efficient transformer-based model as a generative framework to generate time-series data, that can be used to boost the performance of existing and new ML workflows. Our new transformer model achieves the SOTA results. We style our model to be generalizable and work across different datasets, and produce high-quality samples.
摘要：许多安全性和网络应用程序需要具有大型数据集来训练机器学习模型。有限的数据访问是安全域中的一个众所周知的问题。最近的研究表明，变压器模型通过合成新样本来扩大数据大小的潜力，但是合成的样品并不能改善模型，而不是实际数据。为了解决此问题，我们将基于变压器的高效模型设计为生成时间序列数据的生成框架，可用于提高现有和新的ML工作流的性能。我们的新变压器模型可实现SOTA结果。我们为模型定型可推广并在不同的数据集中工作，并生产高质量的样本。

Title: Graph-KV: Breaking Sequence via Injecting Structural Biases into Large Language Models

Authors: Haoyu Wang, Peihao Wang, Mufei Li, Shikun Liu, Siqi Miao, Zhangyang Wang, Pan Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.07334
Pdf URL: https://arxiv.org/pdf/2506.07334
Copy Paste: [[2506.07334]] Graph-KV: Breaking Sequence via Injecting Structural Biases into Large Language Models(https://arxiv.org/abs/2506.07334)
Keywords: generation
Abstract: Modern large language models (LLMs) are inherently auto-regressive, requiring input to be serialized into flat sequences regardless of their structural dependencies. This serialization hinders the model's ability to leverage structural inductive biases, especially in tasks such as retrieval-augmented generation (RAG) and reasoning on data with native graph structures, where inter-segment dependencies are crucial. We introduce Graph-KV with the potential to overcome this limitation. Graph-KV leverages the KV-cache of text segments as condensed representations and governs their interaction through structural inductive biases. In this framework, 'target' segments selectively attend only to the KV-caches of their designated 'source' segments, rather than all preceding segments in a serialized sequence. This approach induces a graph-structured block mask, sparsifying attention and enabling a message-passing-like step within the LLM. Furthermore, strategically allocated positional encodings for source and target segments reduce positional bias and context window consumption. We evaluate Graph-KV across three scenarios: (1) seven RAG benchmarks spanning direct inference, multi-hop reasoning, and long-document understanding; (2) Arxiv-QA, a novel academic paper QA task with full-text scientific papers structured as citation ego-graphs; and (3) paper topic classification within a citation network. By effectively reducing positional bias and harnessing structural inductive biases, Graph-KV substantially outperforms baselines, including standard costly sequential encoding, across various settings. Code and the Graph-KV data are publicly available.
摘要：现代大型语言模型（LLMS）本质上是自动回归的，需要将输入序列化为平面序列，无论其结构依赖性如何。这种序列化阻碍了该模型利用结构归纳偏见的能力，尤其是在诸如检索效果生成（RAG）之类的任务中，以及具有本地图结构的数据，其中段间依赖关系至关重要。我们引入了Graph-KV，以克服这一限制。 Graph-KV利用文本段的KV-CACHE作为缩合表示，并通过结构电感偏见来控制它们的相互作用。在此框架中，“目标”段仅选择性地参加了其指定的“源”段的KV-Caches，而不是以序列化序列中的所有先前段。这种方法诱导了图形结构的块面膜，稀疏注意力并在LLM内实现了类似消息的步骤。此外，策略性地分配的位置编码用于源和目标段减少位置偏见和上下文窗口消耗。我们在三种情况下评估Graph-kv：（1）七个跨越直接推理，多跳上推理和长期文档理解的七个抹布基准；（2）Arxiv-QA，这是一项新型的学术论文QA任务，其全文科学论文构成了引用的自我图形；（3）引用网络中的纸质主题分类。通过有效减少位置偏差并利用结构电感偏差，图形-KV在各种环境中都大大优于基准，包括标准昂贵的顺序编码。代码和Graph-kv数据公开可用。

Title: Generative Models at the Frontier of Compression: A Survey on Generative Face Video Coding

Authors: Bolin Chen, Shanzhi Yin, Goluck Konuko, Giuseppe Valenzise, Zihan Zhang, Shiqi Wang, Yan Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07369
Pdf URL: https://arxiv.org/pdf/2506.07369
Copy Paste: [[2506.07369]] Generative Models at the Frontier of Compression: A Survey on Generative Face Video Coding(https://arxiv.org/abs/2506.07369)
Keywords: generative
Abstract: The rise of deep generative models has greatly advanced video compression, reshaping the paradigm of face video coding through their powerful capability for semantic-aware representation and lifelike synthesis. Generative Face Video Coding (GFVC) stands at the forefront of this revolution, which could characterize complex facial dynamics into compact latent codes for bitstream compactness at the encoder side and leverages powerful deep generative models to reconstruct high-fidelity face signal from the compressed latent codes at the decoder side. As such, this well-designed GFVC paradigm could enable high-fidelity face video communication at ultra-low bitrate ranges, far surpassing the capabilities of the latest Versatile Video Coding (VVC) standard. To pioneer foundational research and accelerate the evolution of GFVC, this paper presents the first comprehensive survey of GFVC technologies, systematically bridging critical gaps between theoretical innovation and industrial standardization. In particular, we first review a broad range of existing GFVC methods with different feature representations and optimization strategies, and conduct a thorough benchmarking analysis. In addition, we construct a large-scale GFVC-compressed face video database with subjective Mean Opinion Scores (MOSs) based on human perception, aiming to identify the most appropriate quality metrics tailored to GFVC. Moreover, we summarize the GFVC standardization potentials with a unified high-level syntax and develop a low-complexity GFVC system which are both expected to push forward future practical deployments and applications. Finally, we envision the potential of GFVC in industrial applications and deliberate on the current challenges and future opportunities.
摘要：深度生成模型的兴起具有极大的高级视频压缩，通过其强大的语义意识表示和栩栩如生的合成来重塑面部视频编码的范式。生成的面部视频编码（GFVC）站在这场革命的最前沿，这可以将复杂的面部动力学特征在编码器侧的紧凑型潜在代码中，以使其在编码器侧紧凑，并利用强大的深层生成模型来重建来自解析器侧压缩潜在代码的高效率的面部信号。因此，这个精心设计的GFVC范式可以在超低比特率范围内实现高保真面孔的视频通信，从而超过了最新的Versatile视频编码（VVC）标准的功能。为了开拓基础研究并加速了GFVC的演变，本文介绍了对GFVC技术的首次全面调查，并系统地弥合了理论创新与工业标准化之间的关键差距。特别是，我们首先回顾了具有不同特征表示和优化策略的广泛现有GFVC方法，并进行了彻底的基准测试分析。此外，我们根据人类感知构建了一个大规模的GFVC压缩面部视频数据库，其主观平均意见分数（MOSS）旨在确定针对GFVC量身定制的最合适的质量指标。此外，我们用统一的高级语法总结了GFVC标准化电位，并开发了低复杂的GFVC系统，这些系统都可以推动未来的实践部署和应用程序。最后，我们设想了GFVC在工业应用中的潜力，并考虑当前的挑战和未来机会。

Title: ARGUS: Hallucination and Omission Evaluation in Video-LLMs

Authors: Ruchit Rawal, Reza Shirkavand, Heng Huang, Gowthami Somepalli, Tom Goldstein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07371
Pdf URL: https://arxiv.org/pdf/2506.07371
Copy Paste: [[2506.07371]] ARGUS: Hallucination and Omission Evaluation in Video-LLMs(https://arxiv.org/abs/2506.07371)
Keywords: generation
Abstract: Video large language models have not yet been widely deployed, largely due to their tendency to hallucinate. Typical benchmarks for Video-LLMs rely simply on multiple-choice questions. Unfortunately, VideoLLMs hallucinate far more aggressively on freeform text generation tasks like video captioning than they do on multiple choice verification tasks. To address this weakness, we propose ARGUS, a VideoLLM benchmark that measures freeform video captioning performance. By comparing VideoLLM outputs to human ground truth captions, ARGUS quantifies dual metrics. First, we measure the rate of hallucinations in the form of incorrect statements about video content or temporal relationships. Second, we measure the rate at which the model omits important descriptive details. Together, these dual metrics form a comprehensive view of video captioning performance.
摘要：视频大型语言模型尚未被广泛部署，这主要是由于它们倾向于幻觉。视频插件的典型基准仅依赖于多项选择问题。不幸的是，Videollms在自由形式的文本生成任务（例如视频字幕）上的幻觉比在多项选择验证任务上更为积极。为了解决这一弱点，我们建议Argus，Argus是一个衡量自由形式视频字幕性能的视频基准。通过将视频输出与人体地面真理标题进行比较，Argus量化了双重指标。首先，我们以有关视频内容或时间关系的不正确陈述的形式衡量幻觉的速度。其次，我们衡量该模型省略重要描述性细节的速率。这些双重指标共同构成了视频字幕性能的全面视图。

Title: MrM: Black-Box Membership Inference Attacks against Multimodal RAG Systems

Authors: Peiru Yang, Jinhua Yin, Haoran Zheng, Xueying Bai, Huili Wang, Yufei Sun, Xintian Li, Shangguang Wang, Yongfeng Huang, Tao Qi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07399
Pdf URL: https://arxiv.org/pdf/2506.07399
Copy Paste: [[2506.07399]] MrM: Black-Box Membership Inference Attacks against Multimodal RAG Systems(https://arxiv.org/abs/2506.07399)
Keywords: generation
Abstract: Multimodal retrieval-augmented generation (RAG) systems enhance large vision-language models by integrating cross-modal knowledge, enabling their increasing adoption across real-world multimodal tasks. These knowledge databases may contain sensitive information that requires privacy protection. However, multimodal RAG systems inherently grant external users indirect access to such data, making them potentially vulnerable to privacy attacks, particularly membership inference attacks (MIAs). % Existing MIA methods targeting RAG systems predominantly focus on the textual modality, while the visual modality remains relatively underexplored. To bridge this gap, we propose MrM, the first black-box MIA framework targeted at multimodal RAG systems. It utilizes a multi-object data perturbation framework constrained by counterfactual attacks, which can concurrently induce the RAG systems to retrieve the target data and generate information that leaks the membership information. Our method first employs an object-aware data perturbation method to constrain the perturbation to key semantics and ensure successful retrieval. Building on this, we design a counterfact-informed mask selection strategy to prioritize the most informative masked regions, aiming to eliminate the interference of model self-knowledge and amplify attack efficacy. Finally, we perform statistical membership inference by modeling query trials to extract features that reflect the reconstruction of masked semantics from response patterns. Experiments on two visual datasets and eight mainstream commercial visual-language models (e.g., GPT-4o, Gemini-2) demonstrate that MrM achieves consistently strong performance across both sample-level and set-level evaluations, and remains robust under adaptive defenses.
摘要：多模式检索仪器（RAG）系统通过整合跨模式知识，从而增强了大型视觉模型，从而使它们在现实世界中的多模式任务中的采用增加。这些知识数据库可能包含需要隐私保护的敏感信息。但是，多模式抹布系统固有地授予外部用户间接访问此类数据，从而可能容易受到隐私攻击的影响，尤其是会员推理攻击（MIAS）。靶向抹布系统的现有MIA方法主要集中在文本模态上，而视觉模态仍然相对不受影响。为了弥合这一差距，我们提出了MRM，这是针对多模式RAG系统的第一个黑盒MIA框架。它利用了受反事实攻击约束的多对象数据扰动框架，该框架可以同时诱导抹布系统来检索目标数据并生成泄漏成员信息的信息。我们的方法首先采用对象感知的数据扰动方法将扰动限制为关键语义并确保成功检索。在此基础上，我们设计了一种反对意见的面具选择策略，以优先考虑最有用的掩盖区域，旨在消除模型自我知识的干扰并扩大攻击功效。最后，我们通过对查询试验进行建模来提取反映响应模式掩盖语义的重建的特征来进行统计成员推理。在两个视觉数据集和八个主流商业视觉语言模型（例如GPT-4O，GEMINI-2）上进行的实验表明，MRM在样本级别和设定级别的评估中均保持强劲的性能，并且在自适应防御下保持强大。

Title: InverseScope: Scalable Activation Inversion for Interpreting Large Language Models

Authors: Yifan Luo, Zhennan Zhou, Bin Dong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07406
Pdf URL: https://arxiv.org/pdf/2506.07406
Copy Paste: [[2506.07406]] InverseScope: Scalable Activation Inversion for Interpreting Large Language Models(https://arxiv.org/abs/2506.07406)
Keywords: generation
Abstract: Understanding the internal representations of large language models (LLMs) is a central challenge in interpretability research. Existing feature interpretability methods often rely on strong assumptions about the structure of representations that may not hold in practice. In this work, we introduce InverseScope, an assumption-light and scalable framework for interpreting neural activations via input inversion. Given a target activation, we define a distribution over inputs that generate similar activations and analyze this distribution to infer the encoded features. To address the inefficiency of sampling in high-dimensional spaces, we propose a novel conditional generation architecture that significantly improves sample efficiency compared to previous methods. We further introduce a quantitative evaluation protocol that tests interpretability hypotheses using feature consistency rate computed over the sampled inputs. InverseScope scales inversion-based interpretability methods to larger models and practical tasks, enabling systematic and quantitative analysis of internal representations in real-world LLMs.
摘要：了解大语言模型（LLM）的内部表示是解释性研究的核心挑战。现有的特征可解释性方法通常依赖于对可能无法实践中可能不存在的表示结构的强有力的假设。在这项工作中，我们介绍了InverseScope，这是一个假设且可扩展的框架，用于通过输入反转来解释神经激活。鉴于目标激活，我们定义了一个分布，而不是产生相似激活的输入，并分析此分布以推断编码的特征。为了解决高维空间中采样效率低下的效率，我们提出了一种新型的条件产生结构，该结构与以前的方法相比显着提高了采样效率。我们进一步介绍了一种定量评估协议，该协议使用在采样输入中计算出的特征一致性来检验可解释性假设。 InverseScope量表基于反转的可解释性方法和实用任务，从而对现实世界中LLM中的内部表示形式进行系统和定量分析。

Title: Compressed Feature Quality Assessment: Dataset and Baselines

Authors: Changsheng Gao, Wei Zhou, Guosheng Lin, Weisi Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07412
Pdf URL: https://arxiv.org/pdf/2506.07412
Copy Paste: [[2506.07412]] Compressed Feature Quality Assessment: Dataset and Baselines(https://arxiv.org/abs/2506.07412)
Keywords: quality assessment
Abstract: The widespread deployment of large models in resource-constrained environments has underscored the need for efficient transmission of intermediate feature representations. In this context, feature coding, which compresses features into compact bitstreams, becomes a critical component for scenarios involving feature transmission, storage, and reuse. However, this compression process introduces inherent semantic degradation that is notoriously difficult to quantify with traditional metrics. To address this, this paper introduces the research problem of Compressed Feature Quality Assessment (CFQA), which seeks to evaluate the semantic fidelity of compressed features. To advance CFQA research, we propose the first benchmark dataset, comprising 300 original features and 12000 compressed features derived from three vision tasks and four feature codecs. Task-specific performance drops are provided as true semantic distortion for the evaluation of CFQA metrics. We assess the performance of three widely used metrics (MSE, cosine similarity, and Centered Kernel Alignment) in capturing semantic degradation. The results underscore the representativeness of the dataset and highlight the need for more refined metrics capable of addressing the nuances of semantic distortion in compressed features. To facilitate the ongoing development of CFQA research, we release the dataset and all accompanying source code at \href{this https URL}{this https URL}. This contribution aims to advance the field and provide a foundational resource for the community to explore CFQA.
摘要：大型模型在资源约束环境中的广泛部署突显了有效传输中间特征表示形式的需求。在这种情况下，将功能压缩到紧凑的bitstreams中的功能编码成为涉及特征传输，存储和重复使用的方案的关键组件。但是，这种压缩过程引入了固有的语义下降，众所周知，使用传统指标很难量化。为了解决这个问题，本文介绍了压缩功能质量评估（CFQA）的研究问题，该问题旨在评估压缩特征的语义忠诚度。为了推进CFQA研究，我们提出了第一个基准数据集，其中包括300个原始功能和12000个来自三个视觉任务和四个功能编解码器的压缩功能。特定于任务的性能下降作为CFQA指标评估的真实语义失真。我们在捕获语义降解时评估了三个广泛使用的指标（MSE，余弦相似性和集中核比对）的性能。结果强调了数据集的代表性，并突出了需要更精致的指标，能够解决压缩特征中语义失真的细微差别。为了促进CFQA研究的持续开发，我们以\ href {this HTTPS url} {this HTTPS url}发布数据集和所有随附的源代码。这项贡献旨在推进该领域，并为社区探索CFQA提供基础资源。

Title: LiteVLM: A Low-Latency Vision-Language Model Inference Pipeline for Resource-Constrained Environments

Authors: Jin Huang, Yuchao Jin, Le An, Josh Park
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07416
Pdf URL: https://arxiv.org/pdf/2506.07416
Copy Paste: [[2506.07416]] LiteVLM: A Low-Latency Vision-Language Model Inference Pipeline for Resource-Constrained Environments(https://arxiv.org/abs/2506.07416)
Keywords: generation
Abstract: This paper introduces an efficient Vision-Language Model (VLM) pipeline specifically optimized for deployment on embedded devices, such as those used in robotics and autonomous driving. The pipeline significantly reduces the computational overhead by jointly leveraging patch selection to filter irrelevant camera views, a token selection module to reduce input sequence length for the LLM, and speculative decoding to accelerate token generation. Evaluation on the NVIDIA DRIVE Thor platform for automonous driving application, our pipeline achieves $2.5\times$ end-to-end latency reduction without compromising task accuracy. The speed-up further increases to $3.2\times$ when applying FP8 post-training quantization. These results demonstrate our pipeline as a viable solution for enabling real-time VLM deployment in resource-constrained environments.
摘要：本文介绍了专门针对嵌入式设备部署的高效视觉语言模型（VLM）管道，例如在机器人技术和自动驾驶中使用的设备。该管道通过共同利用贴片选择来滤波无关的相机视图，可以大大降低计算开销，而对于降低了LLM的输入序列长度以及投机解码以加速令牌产生。在NVIDIA DRIVE THR平台上用于自动驾驶应用程序的评估，我们的管道在不损害任务准确性的情况下实现了$ 2.5 \ times $端到端的延迟延迟。应用FP8训练后量化时，加速速度将进一步增加到$ 3.2 \ tims $。这些结果证明了我们的管道是在资源约束环境中实现实时VLM部署的可行解决方案。

Title: PhysiInter: Integrating Physical Mapping for High-Fidelity Human Interaction Generation

Authors: Wei Yao, Yunlian Sun, Chang Liu, Hongwen Zhang, Jinhui Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07456
Pdf URL: https://arxiv.org/pdf/2506.07456
Copy Paste: [[2506.07456]] PhysiInter: Integrating Physical Mapping for High-Fidelity Human Interaction Generation(https://arxiv.org/abs/2506.07456)
Keywords: generation, generative
Abstract: Driven by advancements in motion capture and generative artificial intelligence, leveraging large-scale MoCap datasets to train generative models for synthesizing diverse, realistic human motions has become a promising research direction. However, existing motion-capture techniques and generative models often neglect physical constraints, leading to artifacts such as interpenetration, sliding, and floating. These issues are exacerbated in multi-person motion generation, where complex interactions are involved. To address these limitations, we introduce physical mapping, integrated throughout the human interaction generation pipeline. Specifically, motion imitation within a physics-based simulation environment is used to project target motions into a physically valid space. The resulting motions are adjusted to adhere to real-world physics constraints while retaining their original semantic meaning. This mapping not only improves MoCap data quality but also directly informs post-processing of generated motions. Given the unique interactivity of multi-person scenarios, we propose a tailored motion representation framework. Motion Consistency (MC) and Marker-based Interaction (MI) loss functions are introduced to improve model performance. Experiments show our method achieves impressive results in generated human motion quality, with a 3%-89% improvement in physical fidelity. Project page this http URL
摘要：在运动捕获和生成人工智能方面的进步驱动下，利用大规模的MOCAP数据集训练生成模型来综合各种现实的人类动作已成为一个有希望的研究方向。但是，现有的运动捕获技术和生成模型经常忽略物理约束，从而导致诸如互穿，滑动和漂浮的伪像。这些问题在涉及复杂的相互作用的多人运动产生中加剧了。为了解决这些局限性，我们引入了物理映射，并集成了整个人类互动生成管道。具体而言，基于物理的模拟环境中的运动模仿用于将目标运动投影到物理上有效的空间中。将最终的动作调整为遵守现实世界的物理限制，同时保留其原始的语义含义。该映射不仅提高了MOCAP数据质量，而且还可以直接告知生成动作的后处理。鉴于多人场景的独特互动性，我们提出了一个量身定制的运动表示框架。引入了运动一致性（MC）和基于标记的相互作用（MI）损失函数以提高模型性能。实验表明，我们的方法在产生的人类运动质量中取得了令人印象深刻的结果，身体忠诚度提高了3％-89％。项目页面此HTTP URL

Title: ProteinZero: Self-Improving Protein Generation via Online Reinforcement Learning

Authors: Ziwen Wang, Jiajun Fan, Ruihan Guo, Thao Nguyen, Heng Ji, Ge Liu
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2506.07459
Pdf URL: https://arxiv.org/pdf/2506.07459
Copy Paste: [[2506.07459]] ProteinZero: Self-Improving Protein Generation via Online Reinforcement Learning(https://arxiv.org/abs/2506.07459)
Keywords: generation, generative
Abstract: Protein generative models have shown remarkable promise in protein design but still face limitations in success rate, due to the scarcity of high-quality protein datasets for supervised pretraining. We present ProteinZero, a novel framework that enables scalable, automated, and continuous self-improvement of the inverse folding model through online reinforcement learning. To achieve computationally tractable online feedback, we introduce efficient proxy reward models based on ESM-fold and a novel rapid ddG predictor that significantly accelerates evaluation speed. ProteinZero employs a general RL framework balancing multi-reward maximization, KL-divergence from a reference model, and a novel protein-embedding level diversity regularization that prevents mode collapse while promoting higher sequence diversity. Through extensive experiments, we demonstrate that ProteinZero substantially outperforms existing methods across every key metric in protein design, achieving significant improvements in structural accuracy, designability, thermodynamic stability, and sequence diversity. Most impressively, ProteinZero reduces design failure rates by approximately 36% - 48% compared to widely-used methods like ProteinMPNN, ESM-IF and InstructPLM, consistently achieving success rates exceeding 90% across diverse and complex protein folds. Notably, the entire RL run on CATH-4.3 can be done with a single 8 X GPU node in under 3 days, including reward computation. Our work establishes a new paradigm for protein design where models evolve continuously from their own generated outputs, opening new possibilities for exploring the vast protein design space.
摘要：蛋白质生成模型在蛋白质设计中表现出了很大的希望，但由于缺乏高质量的蛋白质数据集而无法进行预训练，但仍面临成功率的限制。我们提出了Proteinzero，这是一个新颖的框架，可以通过在线增强学习对反折叠模型进行可扩展，自动化和连续的自我改善。为了实现计算上可处理的在线反馈，我们基于ESM折叠和一种新型的快速DDG预测器引入有效的代理奖励模型，该模型可显着加速评估速度。 Proteinzero采用了一个通用的RL框架，平衡了多回报最大化，来自参考模型的KL差异以及一种新型的蛋白质插入水平多样性正则化，以防止模式崩溃，同时促进较高的序列多样性。通过广泛的实验，我们证明了蛋白质在蛋白质设计中的每个关键指标上的现有方法基本上都优于现有方法，从而取得了显着提高的结构准确性，可设计性，热力学稳定性和序列多样性。与广泛使用的方法相比，最令人印象深刻的是，Proteinzero将设计故障率降低了约36％-48％，例如蛋白质MPNN，ESM-IF和Coventplm，始终达到的成功率超过了多种和复杂的蛋白质折叠的90％。值得注意的是，可以在3天内使用一个8 x GPU节点进行CATH-4.3上的整个RL运行，包括奖励计算。我们的工作建立了一种新的蛋白质设计范式，其中模型从自己的产量中不断发展，为探索巨大的蛋白质设计空间开辟了新的可能性。

Title: GLOS: Sign Language Generation with Temporally Aligned Gloss-Level Conditioning

Authors: Taeryung Lee, Hyeongjin Nam, Gyeongsik Moon, Kyoung Mu Lee
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2506.07460
Pdf URL: https://arxiv.org/pdf/2506.07460
Copy Paste: [[2506.07460]] GLOS: Sign Language Generation with Temporally Aligned Gloss-Level Conditioning(https://arxiv.org/abs/2506.07460)
Keywords: generation
Abstract: Sign language generation (SLG), or text-to-sign generation, bridges the gap between signers and non-signers. Despite recent progress in SLG, existing methods still often suffer from incorrect lexical ordering and low semantic accuracy. This is primarily due to sentence-level condition, which encodes the entire sentence of the input text into a single feature vector as a condition for SLG. This approach fails to capture the temporal structure of sign language and lacks the granularity of word-level semantics, often leading to disordered sign sequences and ambiguous motions. To overcome these limitations, we propose GLOS, a sign language generation framework with temporally aligned gloss-level conditioning. First, we employ gloss-level conditions, which we define as sequences of gloss embeddings temporally aligned with the motion sequence. This enables the model to access both the temporal structure of sign language and word-level semantics at each timestep. As a result, this allows for fine-grained control of signs and better preservation of lexical order. Second, we introduce a condition fusion module, temporal alignment conditioning (TAC), to efficiently deliver the word-level semantic and temporal structure provided by the gloss-level condition to the corresponding motion timesteps. Our method, which is composed of gloss-level conditions and TAC, generates signs with correct lexical order and high semantic accuracy, outperforming prior methods on CSL-Daily and Phoenix-2014T.
摘要：手语的生成（SLG）或文本到符号生成，弥合了签名者和非签名者之间的差距。尽管SLG最近取得了进展，但现有方法仍然经常遭受词汇顺序不正确和语义准确性低的损失。这主要是由于句子级条件，该条件将输入文本的整个句子编码为单个特征向量作为SLG的条件。这种方法无法捕获手语的时间结构，并且缺乏单词级语义的粒度，通常会导致异常的标志序列和模棱两可的动作。为了克服这些局限性，我们提出了GLOS，这是一个具有时间对齐的光泽级调节的手语语言生成框架。首先，我们采用光泽水平条件，我们将其定义为与运动序列时期对齐的光泽嵌入序列。这使该模型能够在每个时间步长访问手语和单词级语义的时间结构。结果，这允许对符号的细粒度控制和更好地保存词汇顺序。其次，我们引入了条件融合模块，时间对齐调节（TAC），以有效地将由光泽级条件提供的单词级别的语义和时间结构传递给相应的运动时间段。我们的方法由光泽级别的条件和TAC组成，它以正确的词汇顺序和高语义精度生成符号，超过了CSL Daily和Phoenix-2014T的先验方法。

Title: Drive Any Mesh: 4D Latent Diffusion for Mesh Deformation from Video

Authors: Yahao Shi, Yang Liu, Yanmin Wu, Xing Liu, Chen Zhao, Jie Luo, Bin Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07489
Pdf URL: https://arxiv.org/pdf/2506.07489
Copy Paste: [[2506.07489]] Drive Any Mesh: 4D Latent Diffusion for Mesh Deformation from Video(https://arxiv.org/abs/2506.07489)
Keywords: generation
Abstract: We propose DriveAnyMesh, a method for driving mesh guided by monocular video. Current 4D generation techniques encounter challenges with modern rendering engines. Implicit methods have low rendering efficiency and are unfriendly to rasterization-based engines, while skeletal methods demand significant manual effort and lack cross-category generalization. Animating existing 3D assets, instead of creating 4D assets from scratch, demands a deep understanding of the input's 3D structure. To tackle these challenges, we present a 4D diffusion model that denoises sequences of latent sets, which are then decoded to produce mesh animations from point cloud trajectory sequences. These latent sets leverage a transformer-based variational autoencoder, simultaneously capturing 3D shape and motion information. By employing a spatiotemporal, transformer-based diffusion model, information is exchanged across multiple latent frames, enhancing the efficiency and generalization of the generated results. Our experimental results demonstrate that DriveAnyMesh can rapidly produce high-quality animations for complex motions and is compatible with modern rendering engines. This method holds potential for applications in both the gaming and filming industries.
摘要：我们提出了Driveanymesh，这是一种以单眼视频为指导的网眼的方法。当前的第4代技术与现代渲染引擎遇到挑战。隐式方法具有较低的渲染效率，并且对基于栅格化的发动机不友好，而骨骼方法需要大量的手动努力，并且缺乏跨类别概括。对现有的3D资产进行动画动画，而不是从头开始创建4D资产，需要深入了解输入的3D结构。为了应对这些挑战，我们提出了一个4D扩散模型，该模型可以降低潜在集的序列，然后将其解码以从点云轨迹序列产生网格动画。这些潜在设置利用基于变压器的变异自动编码器，同时捕获3D形状和运动信息。通过采用时空，基于变压器的扩散模型，信息可以在多个潜在框架上交换，从而提高了生成的结果的效率和概括。我们的实验结果表明，Driveanymesh可以迅速为复杂运动生成高质量的动画，并且与现代渲染引擎兼容。这种方法具有在游戏和拍摄行业中应用的潜力。

Title: Genesis: Multimodal Driving Scene Generation with Spatio-Temporal and Cross-Modal Consistency

Authors: Xiangyu Guo, Zhanqian Wu, Kaixin Xiong, Ziyang Xu, Lijun Zhou, Gangwei Xu, Shaoqing Xu, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Wenyu Liu, Xinggang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07497
Pdf URL: https://arxiv.org/pdf/2506.07497
Copy Paste: [[2506.07497]] Genesis: Multimodal Driving Scene Generation with Spatio-Temporal and Cross-Modal Consistency(https://arxiv.org/abs/2506.07497)
Keywords: generation
Abstract: We present Genesis, a unified framework for joint generation of multi-view driving videos and LiDAR sequences with spatio-temporal and cross-modal consistency. Genesis employs a two-stage architecture that integrates a DiT-based video diffusion model with 3D-VAE encoding, and a BEV-aware LiDAR generator with NeRF-based rendering and adaptive sampling. Both modalities are directly coupled through a shared latent space, enabling coherent evolution across visual and geometric domains. To guide the generation with structured semantics, we introduce DataCrafter, a captioning module built on vision-language models that provides scene-level and instance-level supervision. Extensive experiments on the nuScenes benchmark demonstrate that Genesis achieves state-of-the-art performance across video and LiDAR metrics (FVD 16.95, FID 4.24, Chamfer 0.611), and benefits downstream tasks including segmentation and 3D detection, validating the semantic fidelity and practical utility of the generated data.
摘要：我们提出了Genesis，这是一个统一的框架，用于具有时空和跨模式一致性的多视图驱动视频和LIDAR序列。 Genesis采用了两阶段的体系结构，该体系结构将基于DIT的视频扩散模型与3D-VAE编码集成在一起，以及带有基于NERF的渲染和自适应采样的BEV感知激光雷达生成器。两种模态都直接通过共享的潜在空间耦合，从而使视觉和几何域之间的相干演变能够。为了通过结构化语义指导一代，我们介绍了DataCrafter，这是一个基于视觉语言模型的字幕模块，该模型提供场景级别和实例级别的监督。关于Nuscenes基准测试的广泛实验表明，Genesis在视频和激光雷达指标（FVD 16.95，FID 4.24，Chamfer 0.611）中实现了最先进的性能，并有益于下游任务，包括分段和3D检测，验证了富裕性和生成数据的忠诚度和实用性。

Title: Addressing Correlated Latent Exogenous Variables in Debiased Recommender Systems

Authors: Shuqiang Zhang, Yuchao Zhang, Jinkun Chen, Haochen Sui
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2506.07517
Pdf URL: https://arxiv.org/pdf/2506.07517
Copy Paste: [[2506.07517]] Addressing Correlated Latent Exogenous Variables in Debiased Recommender Systems(https://arxiv.org/abs/2506.07517)
Keywords: generation
Abstract: Recommendation systems (RS) aim to provide personalized content, but they face a challenge in unbiased learning due to selection bias, where users only interact with items they prefer. This bias leads to a distorted representation of user preferences, which hinders the accuracy and fairness of recommendations. To address the issue, various methods such as error imputation based, inverse propensity scoring, and doubly robust techniques have been developed. Despite the progress, from the structural causal model perspective, previous debiasing methods in RS assume the independence of the exogenous variables. In this paper, we release this assumption and propose a learning algorithm based on likelihood maximization to learn a prediction model. We first discuss the correlation and difference between unmeasured confounding and our scenario, then we propose a unified method that effectively handles latent exogenous variables. Specifically, our method models the data generation process with latent exogenous variables under mild normality assumptions. We then develop a Monte Carlo algorithm to numerically estimate the likelihood function. Extensive experiments on synthetic datasets and three real-world datasets demonstrate the effectiveness of our proposed method. The code is at this https URL.
摘要：推荐系统（RS）旨在提供个性化的内容，但由于选择偏见，它们在无偏学习中面临挑战，在这种偏见中，用户只与他们喜欢的项目进行交互。这种偏见会导致用户偏好的扭曲表示，这阻碍了建议的准确性和公平性。为了解决这个问题，已经开发了各种方法，例如基于错误的基于错误的基于错误的倾向评分和双重强大的技术。尽管取得了进展，从结构性因果模型的角度来看，RS中先前的偏见方法假设外源变量的独立性。在本文中，我们发布了此假设，并提出了一种基于可能性最大化的学习算法以学习预测模型。我们首先讨论未衡量的混杂和我们的场景之间的相关性和差异，然后提出了一种有效处理潜在外源变量的统一方法。具体而言，我们的方法在轻度正态性假设下使用潜在的外源变量对数据生成过程进行建模。然后，我们开发一种蒙特卡洛算法以数值估计似然函数。关于合成数据集和三个现实世界数据集的广泛实验证明了我们提出的方法的有效性。该代码在此HTTPS URL上。

Title: Domain Randomization for Object Detection in Manufacturing Applications using Synthetic Data: A Comprehensive Study

Authors: Xiaomeng Zhu, Jacob Henningsson, Duruo Li, Pär Mårtensson, Lars Hanson, Mårten Björkman, Atsuto Maki
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07539
Pdf URL: https://arxiv.org/pdf/2506.07539
Copy Paste: [[2506.07539]] Domain Randomization for Object Detection in Manufacturing Applications using Synthetic Data: A Comprehensive Study(https://arxiv.org/abs/2506.07539)
Keywords: generation
Abstract: This paper addresses key aspects of domain randomization in generating synthetic data for manufacturing object detection applications. To this end, we present a comprehensive data generation pipeline that reflects different factors: object characteristics, background, illumination, camera settings, and post-processing. We also introduce the Synthetic Industrial Parts Object Detection dataset (SIP15-OD) consisting of 15 objects from three industrial use cases under varying environments as a test bed for the study, while also employing an industrial dataset publicly available for robotic applications. In our experiments, we present more abundant results and insights into the feasibility as well as challenges of sim-to-real object detection. In particular, we identified material properties, rendering methods, post-processing, and distractors as important factors. Our method, leveraging these, achieves top performance on the public dataset with Yolov8 models trained exclusively on synthetic data; mAP@50 scores of 96.4% for the robotics dataset, and 94.1%, 99.5%, and 95.3% across three of the SIP15-OD use cases, respectively. The results showcase the effectiveness of the proposed domain randomization, potentially covering the distribution close to real data for the applications.
摘要：本文介绍了域随机化的关键方面，以生成制造对象检测应用的合成数据。为此，我们提出了一条全面的数据生成管道，该管道反映了不同的因素：对象特征，背景，照明，摄像头设置和后处理。我们还介绍了合成工业零件对象检测数据集（SIP15-OD），该数据集（SIP15-OD）由来自不同环境下的三个工业用例的15个对象作为研究床，同时还采用了用于机器人应用的工业数据集。在我们的实验中，我们对可行性以及SIM真实对象检测的挑战提出了更丰富的结果和见解。特别是，我们确定了材料特性，渲染方法，后处理和干扰因素是重要因素。我们利用这些方法的方法在公共数据集上使用Yolov8模型在公共数据集上实现了最高的性能，该模型专门培训了合成数据。 MAP@机器人数据集为96.4％，分别为96.4％，在三种SIP15-OD用例中分别为94.1％，99.5％和95.3％。结果展示了提出的域随机化的有效性，可能涵盖了应用程序接近实际数据的分布。

Title: APTOS-2024 challenge report: Generation of synthetic 3D OCT images from fundus photographs

Authors: Bowen Liu, Weiyi Zhang, Peranut Chotcomwongse, Xiaolan Chen, Ruoyu Chen, Pawin Pakaymaskul, Niracha Arjkongharn, Nattaporn Vongsa, Xuelian Cheng, Zongyuan Ge, Kun Huang, Xiaohui Li, Yiru Duan, Zhenbang Wang, BaoYe Xie, Qiang Chen, Huazhu Fu, Michael A. Mahr, Jiaqi Qu, Wangyiyang Chen, Shiye Wang, Yubo Tan, Yongjie Li, Mingguang He, Danli Shi, Paisan Ruamviboonsuk
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07542
Pdf URL: https://arxiv.org/pdf/2506.07542
Copy Paste: [[2506.07542]] APTOS-2024 challenge report: Generation of synthetic 3D OCT images from fundus photographs(https://arxiv.org/abs/2506.07542)
Keywords: generation, generative
Abstract: Optical Coherence Tomography (OCT) provides high-resolution, 3D, and non-invasive visualization of retinal layers in vivo, serving as a critical tool for lesion localization and disease diagnosis. However, its widespread adoption is limited by equipment costs and the need for specialized operators. In comparison, 2D color fundus photography offers faster acquisition and greater accessibility with less dependence on expensive devices. Although generative artificial intelligence has demonstrated promising results in medical image synthesis, translating 2D fundus images into 3D OCT images presents unique challenges due to inherent differences in data dimensionality and biological information between modalities. To advance generative models in the fundus-to-3D-OCT setting, the Asia Pacific Tele-Ophthalmology Society (APTOS-2024) organized a challenge titled Artificial Intelligence-based OCT Generation from Fundus Images. This paper details the challenge framework (referred to as APTOS-2024 Challenge), including: the benchmark dataset, evaluation methodology featuring two fidelity metrics-image-based distance (pixel-level OCT B-scan similarity) and video-based distance (semantic-level volumetric consistency), and analysis of top-performing solutions. The challenge attracted 342 participating teams, with 42 preliminary submissions and 9 finalists. Leading methodologies incorporated innovations in hybrid data preprocessing or augmentation (cross-modality collaborative paradigms), pre-training on external ophthalmic imaging datasets, integration of vision foundation models, and model architecture improvement. The APTOS-2024 Challenge is the first benchmark demonstrating the feasibility of fundus-to-3D-OCT synthesis as a potential solution for improving ophthalmic care accessibility in under-resourced healthcare settings, while helping to expedite medical research and clinical applications.
摘要：光学相干断层扫描（OCT）提供了体内视网膜层的高分辨率，3D和非侵入性可视化，作为病变定位和疾病诊断的关键工具。但是，其广泛采用受设备成本和专业运营商的需求的限制。相比之下，2D Color Fellus Photography提供了更快的获取和更大的可访问性，而对昂贵的设备的依赖性较小。尽管生成人工智能在医学图像综合中表现出了有希望的结果，但由于数据维度的固有差异和模态之间的生物学信息，将2D底面图像转化为3D OCT图像带来了独特的挑战。为了推动在眼底到3D-OCT环境中的生成模型，亚太电视学会（APTOS-2024）组织了一项挑战，标题为“基于人工智能的Oct Generation”。本文详细介绍了挑战框架（称为APTOS-2024挑战），包括：基准数据集，评估方法，具有两个基于Fidelity指标的距离（像素级OCT B-SCAN相似性）和基于视频的距离（语义级别的体积一致性），以及上层解决方案的分析。挑战吸引了342支参与的团队，有42个初步提交和9名决赛入围者。领先的方法论在混合数据预处理或增强（跨模式协作范式）中纳入了创新，对外部眼科成像数据集进行了预训练，视觉基础模型的集成以及模型架构改进。 APTOS-2024挑战是第一个基准测试，该基准证明了基底到3D-OCT合成的可行性，这是在资源不足的医疗保健环境中改善眼科护理可及性的潜在解决方案，同时有助于加快医疗研究和临床应用。

Title: Synthesize Privacy-Preserving High-Resolution Images via Private Textual Intermediaries

Authors: Haoxiang Wang, Zinan Lin, Da Yu, Huishuai Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07555
Pdf URL: https://arxiv.org/pdf/2506.07555
Copy Paste: [[2506.07555]] Synthesize Privacy-Preserving High-Resolution Images via Private Textual Intermediaries(https://arxiv.org/abs/2506.07555)
Keywords: generation
Abstract: Generating high fidelity, differentially private (DP) synthetic images offers a promising route to share and analyze sensitive visual data without compromising individual privacy. However, existing DP image synthesis methods struggle to produce high resolution outputs that faithfully capture the structure of the original data. In this paper, we introduce a novel method, referred to as Synthesis via Private Textual Intermediaries (SPTI), that can generate high resolution DP images with easy adoption. The key idea is to shift the challenge of DP image synthesis from the image domain to the text domain by leveraging state of the art DP text generation methods. SPTI first summarizes each private image into a concise textual description using image to text models, then applies a modified Private Evolution algorithm to generate DP text, and finally reconstructs images using text to image models. Notably, SPTI requires no model training, only inference with off the shelf models. Given a private dataset, SPTI produces synthetic images of substantially higher quality than prior DP approaches. On the LSUN Bedroom dataset, SPTI attains an FID less than or equal to 26.71 under epsilon equal to 1.0, improving over Private Evolution FID of 40.36. Similarly, on MM CelebA HQ, SPTI achieves an FID less than or equal to 33.27 at epsilon equal to 1.0, compared to 57.01 from DP fine tuning baselines. Overall, our results demonstrate that Synthesis via Private Textual Intermediaries provides a resource efficient and proprietary model compatible framework for generating high resolution DP synthetic images, greatly expanding access to private visual datasets.
摘要：产生差异性私有（DP）合成图像的高保真度提供了一种有希望的途径，可以共享和分析敏感的视觉数据，而不会损害个人隐私。但是，现有的DP图像合成方法难以生成高分辨率的输出，以忠实地捕获原始数据的结构。在本文中，我们介绍了一种新颖的方法，即通过私人文本中介机构（SPTI）称为合成，该方法可以生成具有易于采用的高分辨率DP图像。关键思想是通过利用ART DP文本生成方法的状态将DP图像合成的挑战从图像域转移到文本域。 SPTI首先将每个私人图像汇总到使用图像到文本模型的简洁文本描述中，然后应用修改后的私人演进算法来生成DP文本，最后使用文本进行图像模型重建图像。值得注意的是，SPTI不需要模型培训，只需要使用架子模型进行推理。鉴于私有数据集，SPTI产生的质量的合成图像比以前的DP方法要高得多。在LSUN卧室数据集上，SPTI在Epsilon下的FID低于或等于26.71，等于1.0，比私有进化的FID改善了40.36。同样，在MM Celeba HQ上，SPTI在Epsilon等于1.0时的FID小于或等于33.27，而DP微调基线的FID则达到了57.01。总体而言，我们的结果表明，通过私人文本中介机构的合成提供了一种资源高效且专有模型兼容的框架，用于生成高分辨率DP合成图像，从而大大扩展了对私人视觉数据集的访问。

Title: OpenDance: Multimodal Controllable 3D Dance Generation Using Large-scale Internet Data

Authors: Jinlu Zhang, Zixi Kang, Yizhou Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07565
Pdf URL: https://arxiv.org/pdf/2506.07565
Copy Paste: [[2506.07565]] OpenDance: Multimodal Controllable 3D Dance Generation Using Large-scale Internet Data(https://arxiv.org/abs/2506.07565)
Keywords: generation
Abstract: Music-driven dance generation offers significant creative potential yet faces considerable challenges. The absence of fine-grained multimodal data and the difficulty of flexible multi-conditional generation limit previous works on generation controllability and diversity in practice. In this paper, we build OpenDance5D, an extensive human dance dataset comprising over 101 hours across 14 distinct genres. Each sample has five modalities to facilitate robust cross-modal learning: RGB video, audio, 2D keypoints, 3D motion, and fine-grained textual descriptions from human arts. Furthermore, we propose OpenDanceNet, a unified masked modeling framework for controllable dance generation conditioned on music and arbitrary combinations of text prompts, keypoints, or character positioning. Comprehensive experiments demonstrate that OpenDanceNet achieves high-fidelity and flexible controllability.
摘要：音乐驱动的舞蹈一代具有巨大的创造潜力，但面临着巨大的挑战。缺乏细粒度的多模式数据以及灵活的多条件生成的难度限制了先前在实践中产生可控性和多样性的工作。在本文中，我们构建了Opendance5D，这是一个广泛的人类舞蹈数据集，其中包括14种不同类型的101个小时。每个样本具有五种模式，可以促进稳健的跨模式学习：RGB视频，音频，2D关键点，3D运动和人类艺术的细粒文本描述。此外，我们提出了OpenDancenet，这是一个统一的蒙版建模框架，用于可控的舞蹈生成，以音乐为条件和文本提示，关键点或角色定位的任意组合。全面的实验表明，OpenDancenet具有高保真性和灵活的可控性。

Title: LLM-driven Indoor Scene Layout Generation via Scaled Human-aligned Data Synthesis and Multi-Stage Preference Optimization

Authors: Yixuan Yang, Zhen Luo, Tongsheng Ding, Junru Lu, Mingqi Gao, Jinyu Yang, Victor Sanchez, Feng Zheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07570
Pdf URL: https://arxiv.org/pdf/2506.07570
Copy Paste: [[2506.07570]] LLM-driven Indoor Scene Layout Generation via Scaled Human-aligned Data Synthesis and Multi-Stage Preference Optimization(https://arxiv.org/abs/2506.07570)
Keywords: generation
Abstract: Automatic indoor layout generation has attracted increasing attention due to its potential in interior design, virtual environment construction, and embodied AI. Existing methods fall into two categories: prompt-driven approaches that leverage proprietary LLM services (e.g., GPT APIs) and learning-based methods trained on layout data upon diffusion-based models. Prompt-driven methods often suffer from spatial inconsistency and high computational costs, while learning-based methods are typically constrained by coarse relational graphs and limited datasets, restricting their generalization to diverse room categories. In this paper, we revisit LLM-based indoor layout generation and present 3D-SynthPlace, a large-scale dataset that combines synthetic layouts generated via a 'GPT synthesize, Human inspect' pipeline, upgraded from the 3D-Front dataset. 3D-SynthPlace contains nearly 17,000 scenes, covering four common room types -- bedroom, living room, kitchen, and bathroom -- enriched with diverse objects and high-level spatial annotations. We further introduce OptiScene, a strong open-source LLM optimized for indoor layout generation, fine-tuned based on our 3D-SynthPlace dataset through our two-stage training. For the warum-up stage I, we adopt supervised fine-tuning (SFT), which is taught to first generate high-level spatial descriptions then conditionally predict concrete object placements. For the reinforcing stage II, to better align the generated layouts with human design preferences, we apply multi-turn direct preference optimization (DPO), which significantly improving layout quality and generation success rates. Extensive experiments demonstrate that OptiScene outperforms traditional prompt-driven and learning-based baselines. Moreover, OptiScene shows promising potential in interactive tasks such as scene editing and robot navigation.
摘要：自动室内布局产生引起了越来越多的关注，因为它在室内设计，虚拟环境构建和体现的AI中的潜力。现有方法分为两类：利用专有LLM服务（例如GPT API）和基于学习的方法的及时驱动的方法，这些方法对基于扩散的模型进行了布局数据培训。及时驱动的方法通常会遭受空间不一致和高计算成本的困扰，而基于学习的方法通常受粗略的关系图和有限的数据集限制，从而将其泛化限制为各种房间类别。在本文中，我们重新访问了基于LLM的室内布局生成，并介绍了3D-synthplace，这是一个大规模数据集，结合了通过“ GPT合成，人体检查”管道生成的合成布局，从3D-Front数据集升级。第3D季军包含近17,000个场景，涵盖了四种公共房间类型 - 卧室，客厅，厨房和浴室 - 充满了不同的物体和高级空间注释。我们进一步介绍了Optiscene，这是一家强大的开源LLM，可针对室内布局生成优化，并通过我们的两个阶段培训基于我们的3D-Synthplace数据集进行了微调。对于第一阶段的第一阶段，我们采用了监督的微调（SFT），该调节被教导首先产生高级空间描述，然后有条件预测具体的对象放置。对于加强II期，为了更好地将生成的布局与人类设计偏好保持一致，我们采用了多转 - 直接偏好优化（DPO），从而显着提高了布局质量和发电的成功率。广泛的实验表明，Optiscene的表现优于传统的及时驱动和基于学习的基线。此外，Optiscene在诸如场景编辑和机器人导航之类的交互式任务中显示出有希望的潜力。

Title: Explore the vulnerability of black-box models via diffusion models

Authors: Jiacheng Shi, Yanfu Zhang, Huajie Shao, Ashley Gao
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.07590
Pdf URL: https://arxiv.org/pdf/2506.07590
Copy Paste: [[2506.07590]] Explore the vulnerability of black-box models via diffusion models(https://arxiv.org/abs/2506.07590)
Keywords: generation
Abstract: Recent advancements in diffusion models have enabled high-fidelity and photorealistic image generation across diverse applications. However, these models also present security and privacy risks, including copyright violations, sensitive information leakage, and the creation of harmful or offensive content that could be exploited maliciously. In this study, we uncover a novel security threat where an attacker leverages diffusion model APIs to generate synthetic images, which are then used to train a high-performing substitute model. This enables the attacker to execute model extraction and transfer-based adversarial attacks on black-box classification models with minimal queries, without needing access to the original training data. The generated images are sufficiently high-resolution and diverse to train a substitute model whose outputs closely match those of the target model. Across the seven benchmarks, including CIFAR and ImageNet subsets, our method shows an average improvement of 27.37% over state-of-the-art methods while using just 0.01 times of the query budget, achieving a 98.68% success rate in adversarial attacks on the target model.
摘要：扩散模型的最新进展已实现了各种应用程序的高保真和逼真的图像产生。但是，这些模型还呈现出安全性和隐私风险，包括侵犯版权，敏感信息泄漏以及可能被恶意剥削的有害或冒犯内容的创建。在这项研究中，我们发现了一种新的安全威胁，其中攻击者利用扩散模型API生成合成图像，然后将其用于训练高性能的替代模型。这使攻击者能够对具有最小查询的黑盒分类模型执行模型提取和基于转移的对抗攻击，而无需访问原始培训数据。生成的图像是足够高分辨率的，并且可以训练替代模型，该模型的输出与目标模型的输出非常匹配。在包括CIFAR和Imagenet子集在内的七个基准测试中，我们的方法在最先进的方法中的平均提高了27.37％，而仅使用0.01次查询预算，在目标模型上实现了对抗性攻击的98.68％的成功率。

Title: TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts

Authors: Torsten Krauß, Hamid Dashtbani, Alexandra Dmitrienko
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.07596
Pdf URL: https://arxiv.org/pdf/2506.07596
Copy Paste: [[2506.07596]] TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts(https://arxiv.org/abs/2506.07596)
Keywords: generation
Abstract: Machine learning is advancing rapidly, with applications bringing notable benefits, such as improvements in translation and code generation. Models like ChatGPT, powered by Large Language Models (LLMs), are increasingly integrated into daily life. However, alongside these benefits, LLMs also introduce social risks. Malicious users can exploit LLMs by submitting harmful prompts, such as requesting instructions for illegal activities. To mitigate this, models often include a security mechanism that automatically rejects such harmful prompts. However, they can be bypassed through LLM jailbreaks. Current jailbreaks often require significant manual effort, high computational costs, or result in excessive model modifications that may degrade regular utility. We introduce TwinBreak, an innovative safety alignment removal method. Building on the idea that the safety mechanism operates like an embedded backdoor, TwinBreak identifies and prunes parameters responsible for this functionality. By focusing on the most relevant model layers, TwinBreak performs fine-grained analysis of parameters essential to model utility and safety. TwinBreak is the first method to analyze intermediate outputs from prompts with high structural and content similarity to isolate safety parameters. We present the TwinPrompt dataset containing 100 such twin prompts. Experiments confirm TwinBreak's effectiveness, achieving 89% to 98% success rates with minimal computational requirements across 16 LLMs from five vendors.
摘要：机器学习正在迅速发展，应用程序带来了显着的好处，例如改进翻译和代码生成。由大型语言模型（LLMS）提供支持的Chatgpt之类的模型越来越多地整合到日常生活中。但是，除了这些好处，LLMS还引入了社会风险。恶意用户可以通过提交有害提示来利用LLM，例如请求非法活动的说明。为了减轻这种情况，模型通常包括一种安全机制，该机制会自动拒绝这种有害的提示。但是，它们可以通过LLM越狱。当前的越狱通常需要大量的手动努力，高计算成本或导致过度的模型修改，以降低常规效用。我们介绍了TwinBreak，这是一种创新的安全对准方法。基于安全机制像嵌入式后门一样运行的想法，Twinbreak识别和李子参数负责此功能。通过关注最相关的模型层，TwinBreak对模型效用和安全性必不可少的参数进行细粒度分析。 Twinbreak是第一种分析具有高结构和内容相似性提示的中间输出的方法，以隔离安全参数。我们介绍包含100个这样的Twin提示的Twinprompt数据集。实验证实了TwinBreak的有效性，达到了89％至98％的成功率，在五个供应商的16个LLM中，计算要求最小。

Title: SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding

Authors: Nianbo Zeng, Haowen Hou, Fei Richard Yu, Si Shi, Ying Tiffany He
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07600
Pdf URL: https://arxiv.org/pdf/2506.07600
Copy Paste: [[2506.07600]] SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding(https://arxiv.org/abs/2506.07600)
Keywords: generation
Abstract: Despite recent advances in retrieval-augmented generation (RAG) for video understanding, effectively understanding long-form video content remains underexplored due to the vast scale and high complexity of video data. Current RAG approaches typically segment videos into fixed-length chunks, which often disrupts the continuity of contextual information and fails to capture authentic scene boundaries. Inspired by the human ability to naturally organize continuous experiences into coherent scenes, we present SceneRAG, a unified framework that leverages large language models to segment videos into narrative-consistent scenes by processing ASR transcripts alongside temporal metadata. SceneRAG further sharpens these initial boundaries through lightweight heuristics and iterative correction. For each scene, the framework fuses information from both visual and textual modalities to extract entity relations and dynamically builds a knowledge graph, enabling robust multi-hop retrieval and generation that account for long-range dependencies. Experiments on the LongerVideos benchmark, featuring over 134 hours of diverse content, confirm that SceneRAG substantially outperforms prior baselines, achieving a win rate of up to 72.5 percent on generation tasks.
摘要：尽管最近在检索启动的生成（RAG）方面取得了进步，但由于视频数据的范围广泛和高复杂性，有效地理解长期视频内容仍然没有得到充实。当前的RAG方法通常将视频分割为固定长度的块，这通常会破坏上下文信息的连续性，并且无法捕获真实的场景边界。受到人类自然组织连续体验进入连贯场景的能力的启发，我们展示了一个统一的框架，该框架通过将ASR转录本与暂时的元数据一起处理，利用大型语言模型将视频分为叙事的场景。 Scenerag通过轻巧的启发式和迭代校正进一步锐化了这些初始界限。对于每个场景，框架都融合了来自视觉和文本模式的信息，以提取实体关系并动态构建知识图，从而实现了可靠的多跳检索和生成，并以长期依赖性为例。 Longervideos基准测试的实验以134小时的不同内容为特征，证实Scenerag在发电任务上的获胜率最高为72.5％。

Title: Scaling Human Activity Recognition: A Comparative Evaluation of Synthetic Data Generation and Augmentation Techniques

Authors: Zikang Leng, Archith Iyer, Thomas Plötz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07612
Pdf URL: https://arxiv.org/pdf/2506.07612
Copy Paste: [[2506.07612]] Scaling Human Activity Recognition: A Comparative Evaluation of Synthetic Data Generation and Augmentation Techniques(https://arxiv.org/abs/2506.07612)
Keywords: generation
Abstract: Human activity recognition (HAR) is often limited by the scarcity of labeled datasets due to the high cost and complexity of real-world data collection. To mitigate this, recent work has explored generating virtual inertial measurement unit (IMU) data via cross-modality transfer. While video-based and language-based pipelines have each shown promise, they differ in assumptions and computational cost. Moreover, their effectiveness relative to traditional sensor-level data augmentation remains unclear. In this paper, we present a direct comparison between these two virtual IMU generation approaches against classical data augmentation techniques. We construct a large-scale virtual IMU dataset spanning 100 diverse activities from Kinetics-400 and simulate sensor signals at 22 body locations. The three data generation strategies are evaluated on benchmark HAR datasets (UTD-MHAD, PAMAP2, HAD-AW) using four popular models. Results show that virtual IMU data significantly improves performance over real or augmented data alone, particularly under limited-data conditions. We offer practical guidance on choosing data generation strategies and highlight the distinct advantages and disadvantages of each approach.
摘要：由于实际数据收集的高成本和复杂性，人类活动识别（HAR）通常受到标记数据集的稀缺的限制。为了减轻这种情况，最近的工作探索了通过交叉模式传输生成虚拟惯性测量单元（IMU）数据。尽管基于视频和基于语言的管道各自都表现出了希望，但它们的假设和计算成本有所不同。此外，它们相对于传统传感器级数据增强的有效性尚不清楚。在本文中，我们提出了这两种虚拟IMU生成方法与经典数据增强技术之间的直接比较。我们构建了一个大规模的虚拟IMU数据集，该数据集涵盖了100种来自动力学400的不同活动，并在22个身体位置模拟传感器信号。使用四个流行模型在基准HAR数据集（UTD-MHAD，PAMAP2，HAD-AW）上评估了三种数据生成策略。结果表明，虚拟IMU数据仅比实际或增强数据显着提高性能，尤其是在有限的数据条件下。我们为选择数据生成策略提供了实用的指导，并突出了每种方法的不同优势和缺点。

Title: Synthetic Visual Genome

Authors: Jae Sung Park, Zixian Ma, Linjie Li, Chenhao Zheng, Cheng-Yu Hsieh, Ximing Lu, Khyathi Chandu, Quan Kong, Norimasa Kobori, Ali Farhadi, Yejin Choi, Ranjay Krishna
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07643
Pdf URL: https://arxiv.org/pdf/2506.07643
Copy Paste: [[2506.07643]] Synthetic Visual Genome(https://arxiv.org/abs/2506.07643)
Keywords: generation
Abstract: Reasoning over visual relationships-spatial, functional, interactional, social, etc.-is considered to be a fundamental component of human cognition. Yet, despite the major advances in visual comprehension in multimodal language models (MLMs), precise reasoning over relationships and their generations remains a challenge. We introduce ROBIN: an MLM instruction-tuned with densely annotated relationships capable of constructing high-quality dense scene graphs at scale. To train ROBIN, we curate SVG, a synthetic scene graph dataset by completing the missing relations of selected objects in existing scene graphs using a teacher MLM and a carefully designed filtering process to ensure high-quality. To generate more accurate and rich scene graphs at scale for any image, we introduce SG-EDIT: a self-distillation framework where GPT-4o further refines ROBIN's predicted scene graphs by removing unlikely relations and/or suggesting relevant ones. In total, our dataset contains 146K images and 5.6M relationships for 2.6M objects. Results show that our ROBIN-3B model, despite being trained on less than 3 million instances, outperforms similar-size models trained on over 300 million instances on relationship understanding benchmarks, and even surpasses larger models up to 13B parameters. Notably, it achieves state-of-the-art performance in referring expression comprehension with a score of 88.9, surpassing the previous best of 87.4. Our results suggest that training on the refined scene graph data is crucial to maintaining high performance across diverse visual reasoning task.
摘要：关于视觉关系空间，功能，互动，社会等的推理 - 被认为是人类认知的基本组成部分。然而，尽管多模式模型（MLMS）的视觉理解取得了重大进展，但对关系及其几代人的精确推理仍然是一个挑战。我们介绍了Robin：一种传教士指导，该指令通过密集的注释关系，能够大规模构建高质量的浓密场景图。为了训练罗宾，我们通过使用教师MLM和精心设计的过滤过程在现有场景图中完成所选对象的缺失关系来策划SVG，这是一个合成场景图数据集。为了为任何图像生成更准确，更丰富的场景图表，我们介绍了SG-Edit：一个自distillation框架，GPT-4O通过删除不太可能的关系和/或建议相关的框架进一步完善了Robin的预测场景图。总共，我们的数据集包含146K图像和560万个关系的2.6m对象。结果表明，我们的Robin-3B模型在接受少于300万个实例的培训中，胜过相似大小的模型，该模型在关系中超过3亿个实例上训练了有关理解基准测试的实例，甚至超过了最高13B参数的较大模型。值得注意的是，它以88.9的成绩在引用表达理解的方面取得了最新的表现，超过了87.4的最佳成绩。我们的结果表明，对精致场景图数据进行培训对于在各种视觉推理任务中保持高性能至关重要。

Title: NOVA3D: Normal Aligned Video Diffusion Model for Single Image to 3D Generation

Authors: Yuxiao Yang, Peihao Li, Yuhong Zhang, Junzhe Lu, Xianglong He, Minghan Qin, Weitao Wang, Haoqian Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07698
Pdf URL: https://arxiv.org/pdf/2506.07698
Copy Paste: [[2506.07698]] NOVA3D: Normal Aligned Video Diffusion Model for Single Image to 3D Generation(https://arxiv.org/abs/2506.07698)
Keywords: generation
Abstract: 3D AI-generated content (AIGC) has made it increasingly accessible for anyone to become a 3D content creator. While recent methods leverage Score Distillation Sampling to distill 3D objects from pretrained image diffusion models, they often suffer from inadequate 3D priors, leading to insufficient multi-view consistency. In this work, we introduce NOVA3D, an innovative single-image-to-3D generation framework. Our key insight lies in leveraging strong 3D priors from a pretrained video diffusion model and integrating geometric information during multi-view video fine-tuning. To facilitate information exchange between color and geometric domains, we propose the Geometry-Temporal Alignment (GTA) attention mechanism, thereby improving generalization and multi-view consistency. Moreover, we introduce the de-conflict geometry fusion algorithm, which improves texture fidelity by addressing multi-view inaccuracies and resolving discrepancies in pose alignment. Extensive experiments validate the superiority of NOVA3D over existing baselines.
摘要：3D AI生成的内容（AIGC）使任何人成为3D内容创建者都越来越访问。尽管最近的方法利用得分蒸馏采样来蒸馏到预处理的图像扩散模型中，但它们通常遭受3D先验不足的影响，导致多视图一致性不足。在这项工作中，我们介绍了一个创新的单图像到3D代框架Nova3d。我们的关键见解在于利用预验证的视频扩散模型中的强3D先验和在多视频视频微调过程中整合几何信息。为了促进颜色和几何结构域之间的信息交换，我们提出了几何形状对准（GTA）注意机制，从而提高了概括和多视图的一致性。此外，我们介绍了电流冲突的几何融合算法，该算法通过解决姿势对齐中的多视图不准确和解决差异来提高纹理保真度。广泛的实验验证了NOVA3D优于现有基线的优势。

Title: Adaptive Blind Super-Resolution Network for Spatial-Specific and Spatial-Agnostic Degradations

Authors: Weilei Wen, Chunle Guo, Wenqi Ren, Hongpeng Wang, Xiuli Shao
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2506.07705
Pdf URL: https://arxiv.org/pdf/2506.07705
Copy Paste: [[2506.07705]] Adaptive Blind Super-Resolution Network for Spatial-Specific and Spatial-Agnostic Degradations(https://arxiv.org/abs/2506.07705)
Keywords: super-resolution
Abstract: Prior methodologies have disregarded the diversities among distinct degradation types during image reconstruction, employing a uniform network model to handle multiple deteriorations. Nevertheless, we discover that prevalent degradation modalities, including sampling, blurring, and noise, can be roughly categorized into two classes. We classify the first class as spatial-agnostic dominant degradations, less affected by regional changes in image space, such as downsampling and noise degradation. The second class degradation type is intimately associated with the spatial position of the image, such as blurring, and we identify them as spatial-specific dominant degradations. We introduce a dynamic filter network integrating global and local branches to address these two degradation types. This network can greatly alleviate the practical degradation problem. Specifically, the global dynamic filtering layer can perceive the spatial-agnostic dominant degradation in different images by applying weights generated by the attention mechanism to multiple parallel standard convolution kernels, enhancing the network's representation ability. Meanwhile, the local dynamic filtering layer converts feature maps of the image into a spatially specific dynamic filtering operator, which performs spatially specific convolution operations on the image features to handle spatial-specific dominant degradations. By effectively integrating both global and local dynamic filtering operators, our proposed method outperforms state-of-the-art blind super-resolution algorithms in both synthetic and real image datasets.
摘要：先前的方法已经忽略了图像重建过程中不同降解类型之间的多样性，采用统一的网络模型来处理多种恶化。然而，我们发现普遍的降解方式（包括抽样，模糊和噪声）可以大致分为两类。我们将一类分类为空间不足的显性降解，受图像空间的区域变化（例如下采样和噪声降解）的影响较小。第二类降解类型与图像的空间位置密切相关，例如模糊，我们将其确定为空间特异性的主要降解。我们引入了一个动态过滤网络，集成了全局和本地分支，以解决这两种降解类型。该网络可以大大减轻实际的退化问题。具体而言，通过将注意机制生成的权重应用于多个平行的标准卷积内核，从而增强了网络的表示能力，可以通过将注意力机制生成的权重应用到不同图像中的空间 - 不合时性降解。同时，局部动态滤波层将图像的图将图像转换为空间特定的动态滤波操作员，该操作员在图像特征上执行具有空间特定的卷积操作，以处理空间特异性的主要降解。通过有效整合全局和局部动态过滤操作符，我们提出的方法在合成图像和真实图像数据集中都优于最先进的盲目超级分辨率算法。

Title: Evaluating Robustness in Latent Diffusion Models via Embedding Level Augmentation

Authors: Boris Martirosyan, Alexey Karmanov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.07706
Pdf URL: https://arxiv.org/pdf/2506.07706
Copy Paste: [[2506.07706]] Evaluating Robustness in Latent Diffusion Models via Embedding Level Augmentation(https://arxiv.org/abs/2506.07706)
Keywords: generation
Abstract: Latent diffusion models (LDMs) achieve state-of-the-art performance across various tasks, including image generation and video synthesis. However, they generally lack robustness, a limitation that remains not fully explored in current research. In this paper, we propose several methods to address this gap. First, we hypothesize that the robustness of LDMs primarily should be measured without their text encoder, because if we take and explore the whole architecture, the problems of image generator and text encoders wll be fused. Second, we introduce novel data augmentation techniques designed to reveal robustness shortcomings in LDMs when processing diverse textual prompts. We then fine-tune Stable Diffusion 3 and Stable Diffusion XL models using Dreambooth, incorporating these proposed augmentation methods across multiple tasks. Finally, we propose a novel evaluation pipeline specifically tailored to assess the robustness of LDMs fine-tuned via Dreambooth.
摘要：潜在扩散模型（LDMS）在各种任务（包括图像生成和视频综合）上实现最先进的性能。但是，它们通常缺乏健壮性，这一限制在当前的研究中仍未得到充分探索。在本文中，我们提出了几种解决此差距的方法。首先，我们假设LDMS的鲁棒性主要应在没有其文本编码的情况下测量，因为如果我们进行并探索整个体系结构，则融合了图像生成器和文本编码器的问题。其次，我们介绍了新型的数据增强技术，旨在在处理多种文本提示时揭示LDM的鲁棒性缺陷。然后，我们使用Dreambooth微调了稳定的扩散3和稳定的扩散XL模型，并在多个任务中结合了这些提出的增强方法。最后，我们提出了一条新型的评估管道，专门针对通过Dreambooth进行微调的LDMS的鲁棒性量身定制。

Title: Consistent Video Editing as Flow-Driven Image-to-Video Generation

Authors: Ge Wang, Songlin Fan, Hangxu Liu, Quanjian Song, Hewei Wang, Jinfeng Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07713
Pdf URL: https://arxiv.org/pdf/2506.07713
Copy Paste: [[2506.07713]] Consistent Video Editing as Flow-Driven Image-to-Video Generation(https://arxiv.org/abs/2506.07713)
Keywords: generation
Abstract: With the prosper of video diffusion models, down-stream applications like video editing have been significantly promoted without consuming much computational cost. One particular challenge in this task lies at the motion transfer process from the source video to the edited one, where it requires the consideration of the shape deformation in between, meanwhile maintaining the temporal consistency in the generated video sequence. However, existing methods fail to model complicated motion patterns for video editing, and are fundamentally limited to object replacement, where tasks with non-rigid object motions like multi-object and portrait editing are largely neglected. In this paper, we observe that optical flows offer a promising alternative in complex motion modeling, and present FlowV2V to re-investigate video editing as a task of flow-driven Image-to-Video (I2V) generation. Specifically, FlowV2V decomposes the entire pipeline into first-frame editing and conditional I2V generation, and simulates pseudo flow sequence that aligns with the deformed shape, thus ensuring the consistency during editing. Experimental results on DAVIS-EDIT with improvements of 13.67% and 50.66% on DOVER and warping error illustrate the superior temporal consistency and sample quality of FlowV2V compared to existing state-of-the-art ones. Furthermore, we conduct comprehensive ablation studies to analyze the internal functionalities of the first-frame paradigm and flow alignment in the proposed method.
摘要：随着视频扩散模型的繁荣，诸如视频编辑之类的下游应用程序已得到了大量促进，而没有消耗大量计算成本。此任务中的一个特殊挑战在于运动转移过程从源视频到编辑的一个，在该过程中，它需要考虑两者之间的形状变形，同时维持生成的视频序列中的时间一致性。但是，现有方法无法为视频编辑建模复杂的运动模式，并且从根本上仅限于对象置换，在这种情况下，具有多物体对象运动（如多对象和肖像编辑）的任务在很大程度上被忽略了。在本文中，我们观察到，光流在复杂的运动建模中提供了有前途的替代方案，并将FlowV2V提供了重新调查视频编辑，作为流动驱动的图像到视频（I2V）一代的任务。具体而言，FlowV2V将整个管道分解为第一框架编辑和条件I2V生成，并模拟与变形形状对齐的伪流序列，从而确保编辑过程中的一致性。 Davis-Edit的实验结果，多佛和扭曲错误的改善为13.67％和50.66％，这说明了与现有最新的最先进的时间一致性和样本质量相比。此外，我们进行了全面的消融研究，以分析拟议方法中第一范围范式和流动比对的内部功能。

Title: AssetDropper: Asset Extraction via Diffusion Models with Reward-Driven Optimization

Authors: Lanjiong Li, Guanhua Zhao, Lingting Zhu, Zeyu Cai, Lequan Yu, Jian Zhang, Zeyu Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07738
Pdf URL: https://arxiv.org/pdf/2506.07738
Copy Paste: [[2506.07738]] AssetDropper: Asset Extraction via Diffusion Models with Reward-Driven Optimization(https://arxiv.org/abs/2506.07738)
Keywords: generative
Abstract: Recent research on generative models has primarily focused on creating product-ready visual outputs; however, designers often favor access to standardized asset libraries, a domain that has yet to be significantly enhanced by generative capabilities. Although open-world scenes provide ample raw materials for designers, efficiently extracting high-quality, standardized assets remains a challenge. To address this, we introduce AssetDropper, the first framework designed to extract assets from reference images, providing artists with an open-world asset palette. Our model adeptly extracts a front view of selected subjects from input images, effectively handling complex scenarios such as perspective distortion and subject occlusion. We establish a synthetic dataset of more than 200,000 image-subject pairs and a real-world benchmark with thousands more for evaluation, facilitating the exploration of future research in downstream tasks. Furthermore, to ensure precise asset extraction that aligns well with the image prompts, we employ a pre-trained reward model to fulfill a closed-loop with feedback. We design the reward model to perform an inverse task that pastes the extracted assets back into the reference sources, which assists training with additional consistency and mitigates hallucination. Extensive experiments show that, with the aid of reward-driven optimization, AssetDropper achieves the state-of-the-art results in asset extraction. Project page: this http URL.
摘要：关于生成模型的最新研究主要集中在创建可用产品的视觉输出上。但是，设计师通常会赞成访问标准化资产库，这一领域尚未通过生成能力显着增强。尽管开放世界的场景为设计师提供了充足的原材料，但有效提取高质量的资产仍然是一个挑战。为了解决这个问题，我们介绍了AssetDropper，这是第一个框架，旨在从参考图像中提取资产，为艺术家提供开放世界的资产调色板。我们的模型从输入图像中熟练提取所选主题的前视图，有效地处理复杂的方案，例如透视失真和受试者遮挡。我们建立了一个超过200,000个图像对象对的合成数据集和一个现实世界中的基准，具有数千个用于评估的基准，从而促进了下游任务中未来研究的探索。此外，为了确保精确的资产提取与图像提示很好地保持一致，我们采用了预先培训的奖励模型来实现带有反馈的闭环。我们设计了奖励模型，以执行逆任务，该任务将提取的资产粘贴回参考来源，该资产有助于培训以额外的一致性并减轻幻觉。广泛的实验表明，借助奖励驱动的优化，AssetDropper实现了最新的资产提取结果。项目页面：此HTTP URL。

Title: Flow-Anything: Learning Real-World Optical Flow Estimation from Large-Scale Single-view Images

Authors: Yingping Liang, Ying Fu, Yutao Hu, Wenqi Shao, Jiaming Liu, Debing Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07740
Pdf URL: https://arxiv.org/pdf/2506.07740
Copy Paste: [[2506.07740]] Flow-Anything: Learning Real-World Optical Flow Estimation from Large-Scale Single-view Images(https://arxiv.org/abs/2506.07740)
Keywords: generation
Abstract: Optical flow estimation is a crucial subfield of computer vision, serving as a foundation for video tasks. However, the real-world robustness is limited by animated synthetic datasets for training. This introduces domain gaps when applied to real-world applications and limits the benefits of scaling up datasets. To address these challenges, we propose \textbf{Flow-Anything}, a large-scale data generation framework designed to learn optical flow estimation from any single-view images in the real world. We employ two effective steps to make data scaling-up promising. First, we convert a single-view image into a 3D representation using advanced monocular depth estimation networks. This allows us to render optical flow and novel view images under a virtual camera. Second, we develop an Object-Independent Volume Rendering module and a Depth-Aware Inpainting module to model the dynamic objects in the 3D representation. These two steps allow us to generate realistic datasets for training from large-scale single-view images, namely \textbf{FA-Flow Dataset}. For the first time, we demonstrate the benefits of generating optical flow training data from large-scale real-world images, outperforming the most advanced unsupervised methods and supervised methods on synthetic datasets. Moreover, our models serve as a foundation model and enhance the performance of various downstream video tasks.
摘要：光流估计是计算机视觉的关键子场，是视频任务的基础。但是，现实世界的鲁棒性受动画合成数据集的限制，用于培训。当应用于现实世界应用程序时，这引入了域间隙，并限制了扩展数据集的好处。为了应对这些挑战，我们建议\ textbf {Flow-thing}，这是一个大规模的数据生成框架，旨在从现实世界中的任何单视图像中学习光流估计。我们采用两个有效的步骤来扩展数据。首先，我们使用高级单眼深度估计网络将单视图像转换为3D表示。这使我们能够在虚拟相机下渲染光流和新颖的图像。其次，我们开发了独立于对象的音量渲染模块和一个深度感知的镶嵌模块，以模拟3D表示中的动态对象。这两个步骤使我们能够生成现实的数据集，用于从大规模单视图像培训，即\ textbf {fa-flow dataset}。我们第一次演示了从大型现实世界图像中生成光流训练数据的好处，表现优于最先进的无监督方法和合成数据集上的监督方法。此外，我们的模型是基础模型，并提高了各种下游视频任务的性能。

Title: Difference Inversion: Interpolate and Isolate the Difference with Token Consistency for Image Analogy Generation

Authors: Hyunsoo Kim, Donghyun Kim, Suhyun Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07750
Pdf URL: https://arxiv.org/pdf/2506.07750
Copy Paste: [[2506.07750]] Difference Inversion: Interpolate and Isolate the Difference with Token Consistency for Image Analogy Generation(https://arxiv.org/abs/2506.07750)
Keywords: generation
Abstract: How can we generate an image B' that satisfies A:A'::B:B', given the input images A,A' and B? Recent works have tackled this challenge through approaches like visual in-context learning or visual instruction. However, these methods are typically limited to specific models (e.g. InstructPix2Pix. Inpainting models) rather than general diffusion models (e.g. Stable Diffusion, SDXL). This dependency may lead to inherited biases or lower editing capabilities. In this paper, we propose Difference Inversion, a method that isolates only the difference from A and A' and applies it to B to generate a plausible B'. To address model dependency, it is crucial to structure prompts in the form of a "Full Prompt" suitable for input to stable diffusion models, rather than using an "Instruction Prompt". To this end, we accurately extract the Difference between A and A' and combine it with the prompt of B, enabling a plug-and-play application of the difference. To extract a precise difference, we first identify it through 1) Delta Interpolation. Additionally, to ensure accurate training, we propose the 2) Token Consistency Loss and 3) Zero Initialization of Token Embeddings. Our extensive experiments demonstrate that Difference Inversion outperforms existing baselines both quantitatively and qualitatively, indicating its ability to generate more feasible B' in a model-agnostic manner.
摘要：给定输入图像a，a'和b，我们如何生成满足a：a':: b：b'的图像b'？最近的作品通过视觉中的内在学习或视觉指导等方法来应对这一挑战。但是，这些方法通常仅限于特定模型（例如ConstrumentPix2Pix。涂上模型），而不是一般的扩散模型（例如稳定的扩散，SDXL）。这种依赖性可能导致继承的偏见或较低的编辑功能。在本文中，我们提出了差异反演，这种方法仅分离与a和a'的差异并将其应用于b以生成合理的b'。为了解决模型依赖性，至关重要的是，以“完整提示”的形式结构提示至关重要，适合输入稳定的扩散模型，而不是使用“指令提示”。为此，我们准确地提取A和A之间的差异，并将其与B的提示结合在一起，从而实现差异的插件应用。要提取精确的差异，我们首先通过1）三角插值来识别它。此外，为了确保准确的训练，我们提出2）令牌一致性损失和3）令牌嵌入的零初始化。我们的广泛实验表明，差异反演的表现优于定量和定性的现有基准，这表明其能够以模型 - 敏捷的方式生成更可行的B'。

Title: Comparing Credit Risk Estimates in the Gen-AI Era

Authors: Nicola Lavecchia, Sid Fadanelli, Federico Ricciuti, Gennaro Aloe, Enrico Bagli, Pietro Giuffrida, Daniele Vergari
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07754
Pdf URL: https://arxiv.org/pdf/2506.07754
Copy Paste: [[2506.07754]] Comparing Credit Risk Estimates in the Gen-AI Era(https://arxiv.org/abs/2506.07754)
Keywords: generative
Abstract: Generative AI technologies have demonstrated significant potential across diverse applications. This study provides a comparative analysis of credit score modeling techniques, contrasting traditional approaches with those leveraging generative AI. Our findings reveal that current generative AI models fall short of matching the performance of traditional methods, regardless of the integration strategy employed. These results highlight the limitations in the current capabilities of generative AI for credit risk scoring, emphasizing the need for further research and development before the possibility of applying generative AI for this specific task, or equivalent ones.
摘要：生成的AI技术在不同的应用中表现出了巨大的潜力。这项研究提供了对信用评分建模技术的比较分析，将传统方法与利用生成AI的方法进行了对比。我们的发现表明，无论采用什么集成策略，当前的生成AI模型都无法匹配传统方法的性能。这些结果强调了生成AI的当前功能对信用风险评分的局限性，强调了在将生成AI应用于该特定任务或同等学位的可能性之前进行进一步的研究和开发的必要性。

Title: Language-Vision Planner and Executor for Text-to-Visual Reasoning

Authors: Yichang Xu, Gaowen Liu, Ramana Rao Kompella, Sihao Hu, Tiansheng Huang, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Ling Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07778
Pdf URL: https://arxiv.org/pdf/2506.07778
Copy Paste: [[2506.07778]] Language-Vision Planner and Executor for Text-to-Visual Reasoning(https://arxiv.org/abs/2506.07778)
Keywords: generation
Abstract: The advancement in large language models (LLMs) and large vision models has fueled the rapid progress in multi-modal visual-text reasoning capabilities. However, existing vision-language models (VLMs) to date suffer from generalization performance. Inspired by recent development in LLMs for visual reasoning, this paper presents VLAgent, an AI system that can create a step-by-step visual reasoning plan with an easy-to-understand script and execute each step of the plan in real time by integrating planning script with execution verifications via an automated process supported by VLAgent. In the task planning phase, VLAgent fine-tunes an LLM through in-context learning to generate a step-by-step planner for each user-submitted text-visual reasoning task. During the plan execution phase, VLAgent progressively refines the composition of neuro-symbolic executable modules to generate high-confidence reasoning results. VLAgent has three unique design characteristics: First, we improve the quality of plan generation through in-context learning, improving logic reasoning by reducing erroneous logic steps, incorrect programs, and LLM hallucinations. Second, we design a syntax-semantics parser to identify and correct additional logic errors of the LLM-generated planning script prior to launching the plan executor. Finally, we employ the ensemble method to improve the generalization performance of our step-executor. Extensive experiments with four visual reasoning benchmarks (GQA, MME, NLVR2, VQAv2) show that VLAgent achieves significant performance enhancement for multimodal text-visual reasoning applications, compared to the exiting representative VLMs and LLM based visual composition approaches like ViperGPT and VisProg, thanks to the novel optimization modules of VLAgent back-engine (SS-Parser, Plan Repairer, Output Verifiers). Code and data will be made available upon paper acceptance.
摘要：大语言模型（LLM）和大型视觉模型的进步推动了多模式视觉文本推理能力的快速进步。但是，迄今为止，现有的视觉模型（VLM）遭受了概括性能的影响。受LLM最近开发的视觉推理发展的启发，本文介绍了Vlagent，这是一个AI系统，该系统可以通过易于理解的脚本创建逐步的视觉推理计划，并通过通过Vlagent支持的自动化过程将计划脚本与执行验证进行实时执行。在任务计划阶段，Vlagent通过文本学习LLM进行了llm，为每个用户提交的文本 - 视觉推理任务生成一个逐步计划者。在计划执行阶段，Vlagent逐渐完善了神经符号可执行模块的组成，以产生高信心推理结果。 Vlagent具有三个独特的设计特征：首先，我们通过秘密学习来提高计划生成的质量，从而通过减少错误的逻辑步骤，不正确的程序和LLM幻觉来改善逻辑推理。其次，我们设计了一个语法 - 模式解析器，以在启动计划执行者之前识别和纠正LLM生成的计划脚本的其他逻辑错误。最后，我们采用合奏方法来改善步进任务的概括性能。具有四个视觉推理基准（GQA，MME，NLVR2，VQAV2）的广泛实验表明，与退出的代表性VLMS和LLM基于Vipergpt和Visprog的vipergpt和Grespengenge，Nemps Grackegrine，Moveling for to Modegenge for to Modegenge to for Nomgy for to to viperty to vlagent可在多模式文本 - 视觉推理应用程序上实现显着的性能增强。（SS-Parser，计划维修器，输出验证者）。代码和数据将在纸张接受后提供。

Title: Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger

Authors: Qi Yang, Chenghao Zhang, Lubin Fan, Kun Ding, Jieping Ye, Shiming Xiang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.07785
Pdf URL: https://arxiv.org/pdf/2506.07785
Copy Paste: [[2506.07785]] Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger(https://arxiv.org/abs/2506.07785)
Keywords: generation
Abstract: Recent advancements in Large Vision Language Models (LVLMs) have significantly improved performance in Visual Question Answering (VQA) tasks through multimodal Retrieval-Augmented Generation (RAG). However, existing methods still face challenges, such as the scarcity of knowledge with reasoning examples and erratic responses from retrieved knowledge. To address these issues, in this study, we propose a multimodal RAG framework, termed RCTS, which enhances LVLMs by constructing a Reasoning Context-enriched knowledge base and a Tree Search re-ranking method. Specifically, we introduce a self-consistent evaluation mechanism to enrich the knowledge base with intrinsic reasoning patterns. We further propose a Monte Carlo Tree Search with Heuristic Rewards (MCTS-HR) to prioritize the most relevant examples. This ensures that LVLMs can leverage high-quality contextual reasoning for better and more consistent responses. Extensive experiments demonstrate that our framework achieves state-of-the-art performance on multiple VQA datasets, significantly outperforming In-Context Learning (ICL) and Vanilla-RAG methods. It highlights the effectiveness of our knowledge base and re-ranking method in improving LVLMs. Our code is available at this https URL.
摘要：大型视觉语言模型（LVLM）的最新进展已通过多模式检索生成（RAG）显着提高了视觉问题答案（VQA）任务的性能。但是，现有方法仍然面临挑战，例如知识稀缺的推理示例以及检索到的知识的不稳定反应。为了解决这些问题，在这项研究中，我们提出了一个称为RCT的多模式RAG框架，该框架称为RCT，该框架通过构建富含推理上下文的知识库和树搜索重新排列方法来增强LVLMS。具体而言，我们引入了一种自洽的评估机制，以固有的推理模式丰富了知识库。我们进一步提出了一个具有启发式奖励（MCTS-HR）的蒙特卡洛树搜索，以优先考虑最相关的示例。这样可以确保LVLM可以利用高质量的上下文推理来获得更好，更一致的响应。广泛的实验表明，我们的框架在多个VQA数据集上实现了最先进的性能，显着优于内在学习（ICL）和Vanilla-rag方法。它突出了我们的知识基础和重新级别方法在改善LVLM方面的有效性。我们的代码可在此HTTPS URL上找到。

Title: Incorporating Uncertainty-Guided and Top-k Codebook Matching for Real-World Blind Image Super-Resolution

Authors: Weilei Wen, Tianyi Zhang, Qianqian Zhao, Zhaohui Zheng, Chunle Guo, Xiuli Shao, Chongyi Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07809
Pdf URL: https://arxiv.org/pdf/2506.07809
Copy Paste: [[2506.07809]] Incorporating Uncertainty-Guided and Top-k Codebook Matching for Real-World Blind Image Super-Resolution(https://arxiv.org/abs/2506.07809)
Keywords: super-resolution
Abstract: Recent advancements in codebook-based real image super-resolution (SR) have shown promising results in real-world applications. The core idea involves matching high-quality image features from a codebook based on low-resolution (LR) image features. However, existing methods face two major challenges: inaccurate feature matching with the codebook and poor texture detail reconstruction. To address these issues, we propose a novel Uncertainty-Guided and Top-k Codebook Matching SR (UGTSR) framework, which incorporates three key components: (1) an uncertainty learning mechanism that guides the model to focus on texture-rich regions, (2) a Top-k feature matching strategy that enhances feature matching accuracy by fusing multiple candidate features, and (3) an Align-Attention module that enhances the alignment of information between LR and HR features. Experimental results demonstrate significant improvements in texture realism and reconstruction fidelity compared to existing methods. We will release the code upon formal publication.
摘要：基于代码书的真实图像超级分辨率（SR）的最新进展已在现实世界中显示出令人鼓舞的结果。核心想法涉及基于低分辨率（LR）图像功能的代码手册中的高质量图像功能匹配。但是，现有方法面临两个主要挑战：不准确的功能与代码手册匹配和纹理细节不良的重建。为了解决这些问题，我们提出了一种新颖的不确定性指导和TOP-K代码书匹配SR（UGTSR）框架，该框架结合了三个关键组成部分：（1）指导该模型的不确定性学习机制，将模型侧重于富含纹理的区域（2）增强匹配的特征精度，以增强多个候选功能的特征和（3）的特征匹配，并将Align Align Align-Align-Align-Align-Andiment（3））（3））（3）匹配，并且（3）和（3）。 LR和HR功能。实验结果表明，与现有方法相比，与现有方法相比，纹理现实主义和重建保真度的显着改善。我们将在正式出版后发布代码。

Title: Self-Cascaded Diffusion Models for Arbitrary-Scale Image Super-Resolution

Authors: Junseo Bang, Joonhee Lee, Kyeonghyun Lee, Haechang Lee, Dong Un Kang, Se Young Chun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07813
Pdf URL: https://arxiv.org/pdf/2506.07813
Copy Paste: [[2506.07813]] Self-Cascaded Diffusion Models for Arbitrary-Scale Image Super-Resolution(https://arxiv.org/abs/2506.07813)
Keywords: super-resolution, generative
Abstract: Arbitrary-scale image super-resolution aims to upsample images to any desired resolution, offering greater flexibility than traditional fixed-scale super-resolution. Recent approaches in this domain utilize regression-based or generative models, but many of them are a single-stage upsampling process, which may be challenging to learn across a wide, continuous distribution of scaling factors. Progressive upsampling strategies have shown promise in mitigating this issue, yet their integration with diffusion models for flexible upscaling remains underexplored. Here, we present CasArbi, a novel self-cascaded diffusion framework for arbitrary-scale image super-resolution. CasArbi meets the varying scaling demands by breaking them down into smaller sequential factors and progressively enhancing the image resolution at each step with seamless transitions for arbitrary scales. Our novel coordinate-guided residual diffusion model allows for the learning of continuous image representations while enabling efficient diffusion sampling. Extensive experiments demonstrate that our CasArbi outperforms prior arts in both perceptual and distortion performance metrics across diverse arbitrary-scale super-resolution benchmarks.
摘要：任意规模的图像超分辨率旨在将图像置于任何所需的分辨率，比传统的固定尺度超分辨率更大。该域中的最新方法利用基于回归的模型或生成模型，但是其中许多是单阶段的提升过程，这可能是在扩展缩放因素的广泛，连续分布中学习的挑战。渐进的上采样策略已经在缓解此问题方面表现出了希望，但它们与柔性高尺度扩散模型的集成仍然没有得到充实的速度。在这里，我们提出了卡萨比（Casarbi），这是一个新型的自我层面扩散框架，用于任意规模的图像超分辨率。 Casarbi通过将其分解为较小的顺序因素并逐步增强每个步骤的图像分辨率，并通过任意尺度的无缝过渡来逐步增强图像分辨率。我们的新型坐标引导的残留扩散模型可以在实现有效的扩散采样的同时学习连续的图像表示。广泛的实验表明，我们的Casarbi在各种任意规模的超级分辨率基准中都优于感知和失真性能指标的先前艺术。

Title: M2Restore: Mixture-of-Experts-based Mamba-CNN Fusion Framework for All-in-One Image Restoration

Authors: Yongzhen Wang, Yongjun Li, Zhuoran Zheng, Xiao-Ping Zhang, Mingqiang Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07814
Pdf URL: https://arxiv.org/pdf/2506.07814
Copy Paste: [[2506.07814]] M2Restore: Mixture-of-Experts-based Mamba-CNN Fusion Framework for All-in-One Image Restoration(https://arxiv.org/abs/2506.07814)
Keywords: restoration
Abstract: Natural images are often degraded by complex, composite degradations such as rain, snow, and haze, which adversely impact downstream vision applications. While existing image restoration efforts have achieved notable success, they are still hindered by two critical challenges: limited generalization across dynamically varying degradation scenarios and a suboptimal balance between preserving local details and modeling global dependencies. To overcome these challenges, we propose M2Restore, a novel Mixture-of-Experts (MoE)-based Mamba-CNN fusion framework for efficient and robust all-in-one image restoration. M2Restore introduces three key contributions: First, to boost the model's generalization across diverse degradation conditions, we exploit a CLIP-guided MoE gating mechanism that fuses task-conditioned prompts with CLIP-derived semantic priors. This mechanism is further refined via cross-modal feature calibration, which enables precise expert selection for various degradation types. Second, to jointly capture global contextual dependencies and fine-grained local details, we design a dual-stream architecture that integrates the localized representational strength of CNNs with the long-range modeling efficiency of Mamba. This integration enables collaborative optimization of global semantic relationships and local structural fidelity, preserving global coherence while enhancing detail restoration. Third, we introduce an edge-aware dynamic gating mechanism that adaptively balances global modeling and local enhancement by reallocating computational attention to degradation-sensitive regions. This targeted focus leads to more efficient and precise restoration. Extensive experiments across multiple image restoration benchmarks validate the superiority of M2Restore in both visual quality and quantitative performance.
摘要：自然图像通常会因雨水，雪和阴霾等复杂的复合降解而降解，这些降解会对下游视力应用产生不利影响。尽管现有的图像恢复工作取得了显着的成功，但仍受到两个关键挑战的阻碍：在动态变化的退化方案中有限的概括和在保留本地细节和建模全球依赖关系之间的次优平衡。为了克服这些挑战，我们提出了M2Restore，这是一种新型的Experts（MOE）基于MAMBA-CNN融合框架，以进行有效且健壮的多合一图像恢复。 M2Restore引入了三个关键贡献：首先，为了提高模型在各种降解条件下的概括，我们利用了一种夹子引导的MOE Gating机制，该机制将任务调节的提示与夹子衍生的语义先验融合在一起。通过跨模式特征校准进一步完善了该机制，该校准可以为各种降解类型提供精确的专家选择。其次，为了共同捕获全球上下文依赖性和细粒度的本地细节，我们设计了一种双流体系结构，将CNN的局部代表强度与Mamba的远程建模效率集成在一起。这种集成使全球语义关系和局部结构保真度的协作优化，在增强细节恢复的同时，保持了全球连贯性。第三，我们引入了一种边缘感知的动态门控机制，该机制通过重新分配计算对降解敏感区域的关注来适应全局建模和局部增强。这一目标重点会导致更有效，更精确的恢复。跨多个图像恢复基准的广泛实验验证了M2Restore在视觉质量和定量性能中的优越性。

Title: Accelerating Diffusion Models in Offline RL via Reward-Aware Consistency Trajectory Distillation

Authors: Xintong Duan, Yutong He, Fahim Tajwar, Ruslan Salakhutdinov, J. Zico Kolter, Jeff Schneider
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07822
Pdf URL: https://arxiv.org/pdf/2506.07822
Copy Paste: [[2506.07822]] Accelerating Diffusion Models in Offline RL via Reward-Aware Consistency Trajectory Distillation(https://arxiv.org/abs/2506.07822)
Keywords: generation
Abstract: Although diffusion models have achieved strong results in decision-making tasks, their slow inference speed remains a key limitation. While the consistency model offers a potential solution, its applications to decision-making often struggle with suboptimal demonstrations or rely on complex concurrent training of multiple networks. In this work, we propose a novel approach to consistency distillation for offline reinforcement learning that directly incorporates reward optimization into the distillation process. Our method enables single-step generation while maintaining higher performance and simpler training. Empirical evaluations on the Gym MuJoCo benchmarks and long horizon planning demonstrate that our approach can achieve an 8.7% improvement over previous state-of-the-art while offering up to 142x speedup over diffusion counterparts in inference time.
摘要：尽管扩散模型在决策任务中取得了强大的成果，但它们的缓慢推理速度仍然是一个关键限制。尽管一致性模型提供了潜在的解决方案，但其在决策中的应用通常与次优的演示斗争或依靠多个网络的复杂并发培训。在这项工作中，我们提出了一种新颖的方法，用于离线增强学习，将奖励优化直接纳入蒸馏过程中。我们的方法可以使单步生成，同时保持更高的性能和更简单的培训。在健身房的Mujoco基准和较长的地平线计划上进行的经验评估表明，我们的方法可以比以前的最先进的方法提高8.7％，同时在推理时间内提供高达142倍的速度。

Title: R3D2: Realistic 3D Asset Insertion via Diffusion for Autonomous Driving Simulation

Authors: William Ljungbergh, Bernardo Taveira, Wenzhao Zheng, Adam Tonderski, Chensheng Peng, Fredrik Kahl, Christoffer Petersson, Michael Felsberg, Kurt Keutzer, Masayoshi Tomizuka, Wei Zhan
Subjects: cs.CV, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2506.07826
Pdf URL: https://arxiv.org/pdf/2506.07826
Copy Paste: [[2506.07826]] R3D2: Realistic 3D Asset Insertion via Diffusion for Autonomous Driving Simulation(https://arxiv.org/abs/2506.07826)
Keywords: generative
Abstract: Validating autonomous driving (AD) systems requires diverse and safety-critical testing, making photorealistic virtual environments essential. Traditional simulation platforms, while controllable, are resource-intensive to scale and often suffer from a domain gap with real-world data. In contrast, neural reconstruction methods like 3D Gaussian Splatting (3DGS) offer a scalable solution for creating photorealistic digital twins of real-world driving scenes. However, they struggle with dynamic object manipulation and reusability as their per-scene optimization-based methodology tends to result in incomplete object models with integrated illumination effects. This paper introduces R3D2, a lightweight, one-step diffusion model designed to overcome these limitations and enable realistic insertion of complete 3D assets into existing scenes by generating plausible rendering effects-such as shadows and consistent lighting-in real time. This is achieved by training R3D2 on a novel dataset: 3DGS object assets are generated from in-the-wild AD data using an image-conditioned 3D generative model, and then synthetically placed into neural rendering-based virtual environments, allowing R3D2 to learn realistic integration. Quantitative and qualitative evaluations demonstrate that R3D2 significantly enhances the realism of inserted assets, enabling use-cases like text-to-3D asset insertion and cross-scene/dataset object transfer, allowing for true scalability in AD validation. To promote further research in scalable and realistic AD simulation, we will release our dataset and code, see this https URL.
摘要：验证自动驾驶（AD）系统需要多样化和关键安全测试，这使得逼真的虚拟环境必不可少。传统的仿真平台虽然可控制，但具有规模的资源密集型，并且经常遭受带有现实数据的域间隙。相比之下，诸如3D高斯裂开（3DG）之类的神经重建方法为创建现实世界驾驶场景的影视性数字双胞胎提供了可扩展的解决方案。但是，由于基于每个场景优化的方法往往会导致具有集成照明效应的不完整对象模型，因此他们在动态对象操纵和可重复使用方面遇到了困难。本文介绍了R3D2，这是一种轻巧的一步扩散模型，旨在克服这些局限性，并通过生成可行的渲染效果，例如阴影和一致的实时照明，使完整的3D资产插入完整的3D资产。这是通过在新型数据集上训练R3D2来实现的：3DGS对象资产是使用图像条件的3D生成模型从野外广告数据中生成的，然后将合成放置在基于神经渲染的虚拟环境中，从而允许R3D2学习现实的集成。定量和定性评估表明，R3D2显着增强了插入资产的现实主义，从而启用了诸如文本到3D资产插入和跨场所/数据集对象传输之类的用例，从而实现了AD验证的真实可扩展性。为了促进可扩展和现实的AD模拟的进一步研究，我们将发布我们的数据集和代码，请参阅此HTTPS URL。

Title: Diffusion models under low-noise regime

Authors: Elizabeth Pavlova, Xue-Xin Wei
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07841
Pdf URL: https://arxiv.org/pdf/2506.07841
Copy Paste: [[2506.07841]] Diffusion models under low-noise regime(https://arxiv.org/abs/2506.07841)
Keywords: generative
Abstract: Recent work on diffusion models proposed that they operate in two regimes: memorization, in which models reproduce their training data, and generalization, in which they generate novel samples. While this has been tested in high-noise settings, the behavior of diffusion models as effective denoisers when the corruption level is small remains unclear. To address this gap, we systematically investigated the behavior of diffusion models under low-noise diffusion dynamics, with implications for model robustness and interpretability. Using (i) CelebA subsets of varying sample sizes and (ii) analytic Gaussian mixture benchmarks, we reveal that models trained on disjoint data diverge near the data manifold even when their high-noise outputs converge. We quantify how training set size, data geometry, and model objective choice shape denoising trajectories and affect score accuracy, providing insights into how these models actually learn representations of data distributions. This work starts to address gaps in our understanding of generative model reliability in practical applications where small perturbations are common.
摘要：关于扩散模型的最新工作提出，它们在两个制度中运作：记忆，其中模型重现了他们的训练数据和概括，在其中生成了新的样本。尽管这已经在高噪声设置中进行了测试，但是当腐败水平很小时，扩散模型作为有效的DeNoiser的行为尚不清楚。为了解决这一差距，我们系统地研究了在低噪声扩散动力学下扩散模型的行为，对模型的鲁棒性和解释性有影响。使用（i）不同样本量和（ii）分析高斯混合物基准的Celeba子集，我们揭示了在数据歧管附近训练的模型即使高噪声输出收敛，也可以在数据歧管附近差异。我们量化了训练设置大小，数据几何形状和模型目标选择形状如何降级轨迹并影响得分准确性，从而提供了有关这些模型如何实际学习数据分布表示的见解。这项工作开始解决我们对生成模型可靠性的理解中的差距，而在小扰动很常见的实际应用中。

Title: Jarzynski Reweighting and Sampling Dynamics for Training Energy-Based Models: Theoretical Analysis of Different Transition Kernels

Authors: Davide Carbone
Subjects: cs.LG, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2506.07843
Pdf URL: https://arxiv.org/pdf/2506.07843
Copy Paste: [[2506.07843]] Jarzynski Reweighting and Sampling Dynamics for Training Energy-Based Models: Theoretical Analysis of Different Transition Kernels(https://arxiv.org/abs/2506.07843)
Keywords: generative
Abstract: Energy-Based Models (EBMs) provide a flexible framework for generative modeling, but their training remains theoretically challenging due to the need to approximate normalization constants and efficiently sample from complex, multi-modal distributions. Traditional methods, such as contrastive divergence and score matching, introduce biases that can hinder accurate learning. In this work, we present a theoretical analysis of Jarzynski reweighting, a technique from non-equilibrium statistical mechanics, and its implications for training EBMs. We focus on the role of the choice of the kernel and we illustrate these theoretical considerations in two key generative frameworks: (i) flow-based diffusion models, where we reinterpret Jarzynski reweighting in the context of stochastic interpolants to mitigate discretization errors and improve sample quality, and (ii) Restricted Boltzmann Machines, where we analyze its role in correcting the biases of contrastive divergence. Our results provide insights into the interplay between kernel choice and model performance, highlighting the potential of Jarzynski reweighting as a principled tool for generative learning.
摘要：基于能量的模型（EBM）为生成建模提供了一个灵活的框架，但是由于需要近似归一化常数并有效地从复杂的多模式分布中采样，因此他们的训练在理论上仍然具有挑战性。传统方法（例如对比差异和得分匹配）引入了可以阻碍准确学习的偏见。在这项工作中，我们介绍了Jarzynski重新加权的理论分析，一种来自非平衡统计力学的技术及其对训练EBM的影响。我们专注于选择内核的作用，并在两个关键生成框架中说明了这些理论考虑因素：（i）基于流动的扩散模型，我们在其中重新诠释了Jarzynski在随机插值的背景下重新重新加重，以减轻批准的批准错误并改善样品质量，以及（ii）限制了Boltzmann Machines，我们分析了该角色，我们分析了它的作用。分歧。我们的结果为内核选择和模型性能之间的相互作用提供了见解，突出了Jarzynski重新加权作为生成学习的原则工具的潜力。

Title: PolyVivid: Vivid Multi-Subject Video Generation with Cross-Modal Interaction and Enhancement

Authors: Teng Hu, Zhentao Yu, Zhengguang Zhou, Jiangning Zhang, Yuan Zhou, Qinglin Lu, Ran Yi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07848
Pdf URL: https://arxiv.org/pdf/2506.07848
Copy Paste: [[2506.07848]] PolyVivid: Vivid Multi-Subject Video Generation with Cross-Modal Interaction and Enhancement(https://arxiv.org/abs/2506.07848)
Keywords: generation
Abstract: Despite recent advances in video generation, existing models still lack fine-grained controllability, especially for multi-subject customization with consistent identity and interaction. In this paper, we propose PolyVivid, a multi-subject video customization framework that enables flexible and identity-consistent generation. To establish accurate correspondences between subject images and textual entities, we design a VLLM-based text-image fusion module that embeds visual identities into the textual space for precise grounding. To further enhance identity preservation and subject interaction, we propose a 3D-RoPE-based enhancement module that enables structured bidirectional fusion between text and image embeddings. Moreover, we develop an attention-inherited identity injection module to effectively inject fused identity features into the video generation process, mitigating identity drift. Finally, we construct an MLLM-based data pipeline that combines MLLM-based grounding, segmentation, and a clique-based subject consolidation strategy to produce high-quality multi-subject data, effectively enhancing subject distinction and reducing ambiguity in downstream video generation. Extensive experiments demonstrate that PolyVivid achieves superior performance in identity fidelity, video realism, and subject alignment, outperforming existing open-source and commercial baselines.
摘要：尽管视频生成最近进展，但现有模型仍然缺乏细粒度的可控性，尤其是对于具有一致的身份和互动的多主体定制。在本文中，我们提出了一个多维生，这是一个多对象的视频自定义框架，可实现灵活且一致的一代。为了在主题图像和文本实体之间建立准确的对应关系，我们设计了一个基于VLLM的文本图像融合模块，该模块将视觉身份嵌入到文本空间中以进行精确接地。为了进一步增强身份保存和主题相互作用，我们提出了一个基于3D的增强模块，该模块可以在文本和图像嵌入之间实现结构化的双向融合。此外，我们开发了一个关注的身份注入模块，以有效地将融合的身份特征注入视频生成过程，从而减轻身份漂移。最后，我们构建了基于MLLM的数据管道，该数据管道结合了基于MLLM的接地，分割和基于集团的主题合并策略，以产生高质量的多主体数据，从而有效地增强了受试者的区别并降低了下游视频生成的歧义。广泛的实验表明，多野合在身份保真度，视频现实主义和主题一致性方面取得了卓越的表现，表现优于现有的开源和商业基线。

Title: SAM2Auto: Auto Annotation Using FLASH

Authors: Arash Rocky, Q.M. Jonathan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07850
Pdf URL: https://arxiv.org/pdf/2506.07850
Copy Paste: [[2506.07850]] SAM2Auto: Auto Annotation Using FLASH(https://arxiv.org/abs/2506.07850)
Keywords: generation
Abstract: Vision-Language Models (VLMs) lag behind Large Language Models due to the scarcity of annotated datasets, as creating paired visual-textual annotations is labor-intensive and expensive. To address this bottleneck, we introduce SAM2Auto, the first fully automated annotation pipeline for video datasets requiring no human intervention or dataset-specific training. Our approach consists of two key components: SMART-OD, a robust object detection system that combines automatic mask generation with open-world object detection capabilities, and FLASH (Frame-Level Annotation and Segmentation Handler), a multi-object real-time video instance segmentation (VIS) that maintains consistent object identification across video frames even with intermittent detection gaps. Unlike existing open-world detection methods that require frame-specific hyperparameter tuning and suffer from numerous false positives, our system employs statistical approaches to minimize detection errors while ensuring consistent object tracking throughout entire video sequences. Extensive experimental validation demonstrates that SAM2Auto achieves comparable accuracy to manual annotation while dramatically reducing annotation time and eliminating labor costs. The system successfully handles diverse datasets without requiring retraining or extensive parameter adjustments, making it a practical solution for large-scale dataset creation. Our work establishes a new baseline for automated video annotation and provides a pathway for accelerating VLM development by addressing the fundamental dataset bottleneck that has constrained progress in vision-language understanding.
摘要：视觉语言模型（VLMS）由于稀缺的注释数据集而落后于大语言模型，因为创建配对的视觉文本注释是劳动密集型且昂贵的。为了解决这个瓶颈，我们介绍了SAM2Auto，这是第一个完全自动化的注释管道，用于不需要人类干预或特定于数据集的培训的视频数据集。 Our approach consists of two key components: SMART-OD, a robust object detection system that combines automatic mask generation with open-world object detection capabilities, and FLASH (Frame-Level Annotation and Segmentation Handler), a multi-object real-time video instance segmentation (VIS) that maintains consistent object identification across video frames even with intermittent detection gaps.与需要特定特定框架的超级参数调谐并遭受许多假阳性的现有开放世界检测方法不同，我们的系统采用统计方法来最大程度地减少检测错误，同时确保整个视频序列中的一致对象跟踪。广泛的实验验证表明，SAM2AUTO具有与手动注释相当的精度，同时大大减少了注释时间并消除了人工成本。该系统成功地处理了不同的数据集，而无需进行重新调整或广泛的参数调整，这使其成为大规模数据集创建的实用解决方案。我们的工作为自动视频注释建立了一个新的基线，并通过解决基本数据集瓶颈来加速VLM开发的途径，该瓶颈在视觉理解中限制了进展。

Title: VIVAT: Virtuous Improving VAE Training through Artifact Mitigation

Authors: Lev Novitskiy, Viacheslav Vasilev, Maria Kovaleva, Vladimir Arkhipkin, Denis Dimitrov
Subjects: cs.CV, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2506.07863
Pdf URL: https://arxiv.org/pdf/2506.07863
Copy Paste: [[2506.07863]] VIVAT: Virtuous Improving VAE Training through Artifact Mitigation(https://arxiv.org/abs/2506.07863)
Keywords: generation, generative
Abstract: Variational Autoencoders (VAEs) remain a cornerstone of generative computer vision, yet their training is often plagued by artifacts that degrade reconstruction and generation quality. This paper introduces VIVAT, a systematic approach to mitigating common artifacts in KL-VAE training without requiring radical architectural changes. We present a detailed taxonomy of five prevalent artifacts - color shift, grid patterns, blur, corner and droplet artifacts - and analyze their root causes. Through straightforward modifications, including adjustments to loss weights, padding strategies, and the integration of Spatially Conditional Normalization, we demonstrate significant improvements in VAE performance. Our method achieves state-of-the-art results in image reconstruction metrics (PSNR and SSIM) across multiple benchmarks and enhances text-to-image generation quality, as evidenced by superior CLIP scores. By preserving the simplicity of the KL-VAE framework while addressing its practical challenges, VIVAT offers actionable insights for researchers and practitioners aiming to optimize VAE training.
摘要：变异自动编码器（VAE）仍然是生成计算机视觉的基石，但是他们的培训通常会受到降低重建和发电质量的人工制品的困扰。本文介绍了Vivat，这是一种系统的方法，用于减轻KL-VAE训练中的常见伪像，而无需进行根本的建筑变化。我们提出了五个普遍文物的详细分类学 - 颜色移位，网格模式，模糊，角落和液滴伪影 - 并分析其根本原因。通过直接的修改，包括调整减肥，填充策略以及空间条件归一化的整合，我们证明了VAE性能的显着改善。我们的方法实现了最新的方法，可以在多个基准测试中获得图像重建指标（PSNR和SSIM），并增强了文本对图像的生成质量，这一点由出色的剪辑得分证明。通过在应对其实际挑战的同时，保留了KL-VAE框架的简单性，Vivat为旨在优化VAE培训的研究人员和从业人员提供了可行的见解。

Title: Diffusion Counterfactual Generation with Semantic Abduction

Authors: Rajat Rasal, Avinash Kori, Fabio De Sousa Ribeiro, Tian Xia, Ben Glocker
Subjects: cs.LG, cs.AI, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2506.07883
Pdf URL: https://arxiv.org/pdf/2506.07883
Copy Paste: [[2506.07883]] Diffusion Counterfactual Generation with Semantic Abduction(https://arxiv.org/abs/2506.07883)
Keywords: generation
Abstract: Counterfactual image generation presents significant challenges, including preserving identity, maintaining perceptual quality, and ensuring faithfulness to an underlying causal model. While existing auto-encoding frameworks admit semantic latent spaces which can be manipulated for causal control, they struggle with scalability and fidelity. Advancements in diffusion models present opportunities for improving counterfactual image editing, having demonstrated state-of-the-art visual quality, human-aligned perception and representation learning capabilities. Here, we present a suite of diffusion-based causal mechanisms, introducing the notions of spatial, semantic and dynamic abduction. We propose a general framework that integrates semantic representations into diffusion models through the lens of Pearlian causality to edit images via a counterfactual reasoning process. To our knowledge, this is the first work to consider high-level semantic identity preservation for diffusion counterfactuals and to demonstrate how semantic control enables principled trade-offs between faithful causal control and identity preservation.
摘要：反事实形象产生提出了重大挑战，包括保持身份，保持感知质量以及确保对基本因果模型的忠诚。现有的自动编码框架承认可以操纵因果控制的语义潜在空间，但它们在可扩展性和忠诚度中挣扎。扩散模型的进步为改善反事实图像编辑提供了机会，展示了最先进的视觉质量，人类一致的感知和表示能力。在这里，我们提出了一套基于扩散的因果机制，引入了空间，语义和动态绑架的概念。我们提出了一个通用框架，该框架通过珍珠因果关系镜头将语义表示形式整合到扩散模型中，以通过反事实推理过程编辑图像。据我们所知，这是考虑扩散反事实的高级语义身份保护的第一项工作，并证明语义控制如何在忠实的因果关系和身份保存之间实现有原则的权衡。

Title: EgoM2P: Egocentric Multimodal Multitask Pretraining

Authors: Gen Li, Yutong Chen, Yiqian Wu, Kaifeng Zhao, Marc Pollefeys, Siyu Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07886
Pdf URL: https://arxiv.org/pdf/2506.07886
Copy Paste: [[2506.07886]] EgoM2P: Egocentric Multimodal Multitask Pretraining(https://arxiv.org/abs/2506.07886)
Keywords: generative
Abstract: Understanding multimodal signals in egocentric vision, such as RGB video, depth, camera poses, and gaze, is essential for applications in augmented reality, robotics, and human-computer interaction. These capabilities enable systems to better interpret the camera wearer's actions, intentions, and surrounding environment. However, building large-scale egocentric multimodal and multitask models presents unique challenges. Egocentric data are inherently heterogeneous, with large variations in modality coverage across devices and settings. Generating pseudo-labels for missing modalities, such as gaze or head-mounted camera trajectories, is often infeasible, making standard supervised learning approaches difficult to scale. Furthermore, dynamic camera motion and the complex temporal and spatial structure of first-person video pose additional challenges for the direct application of existing multimodal foundation models. To address these challenges, we introduce a set of efficient temporal tokenizers and propose EgoM2P, a masked modeling framework that learns from temporally aware multimodal tokens to train a large, general-purpose model for egocentric 4D understanding. This unified design supports multitasking across diverse egocentric perception and synthesis tasks, including gaze prediction, egocentric camera tracking, and monocular depth estimation from egocentric video. EgoM2P also serves as a generative model for conditional egocentric video synthesis. Across these tasks, EgoM2P matches or outperforms specialist models while being an order of magnitude faster. We will fully open-source EgoM2P to support the community and advance egocentric vision research. Project page: this https URL
摘要：了解以诸如RGB视频，深度，相机姿势和凝视之类的以中心视觉中的多模式信号对于在增强现实，机器人技术和人类计算机交互作用中的应用至关重要。这些功能使系统能够更好地解释相机佩戴者的动作，意图和周围环境。但是，建立大规模的以自我为中心的多模式和多任务模型提出了独特的挑战。自然的数据本质上是异质的，在设备和设置之间的模态覆盖率很大。为缺失模式（例如凝视或头部安装的摄像头轨迹）生成伪标签通常是不可行的，这使得标准监督的学习方法难以扩展。此外，动态相机运动以及第一人称视频的复杂时间和空间结构对直接应用现有多模式基础模型构成了其他挑战。为了应对这些挑战，我们介绍了一套有效的时间标记器，并提出了EGOM2P，EGOM2P是一个蒙面的建模框架，从时间意识到的多模式代币中学习，以训练一个大型的通用模型，以实现Egentric 4D的理解。这种统一的设计支持跨不同自我的感知和综合任务的多任务处理，包括凝视预测，以自我为中心的相机跟踪以及以自我为中心视频的单眼深度估计。 EGOM2P还可以作为有条件的自我中心视频合成的生成模型。在这些任务中，EGOM2P匹配或胜过专家模型，同时更快的阶段。我们将完全开源的EGOM2P支持社区并推进以自我为中心的视觉研究。项目页面：此HTTPS URL

Title: Video Unlearning via Low-Rank Refusal Vector

Authors: Simone Facchiano, Stefano Saravalle, Matteo Migliarini, Edoardo De Matteis, Alessio Sampieri, Andrea Pilzer, Emanuele Rodolà, Indro Spinelli, Luca Franco, Fabio Galasso
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07891
Pdf URL: https://arxiv.org/pdf/2506.07891
Copy Paste: [[2506.07891]] Video Unlearning via Low-Rank Refusal Vector(https://arxiv.org/abs/2506.07891)
Keywords: generation, generative
Abstract: Video generative models democratize the creation of visual content through intuitive instruction following, but they also inherit the biases and harmful concepts embedded within their web-scale training data. This inheritance creates a significant risk, as users can readily generate undesirable and even illegal content. This work introduces the first unlearning technique tailored explicitly for video diffusion models to address this critical issue. Our method requires 5 multi-modal prompt pairs only. Each pair contains a "safe" and an "unsafe" example that differ only by the target concept. Averaging their per-layer latent differences produces a "refusal vector", which, once subtracted from the model parameters, neutralizes the unsafe concept. We introduce a novel low-rank factorization approach on the covariance difference of embeddings that yields robust refusal vectors. This isolates the target concept while minimizing collateral unlearning of other semantics, thus preserving the visual quality of the generated video. Our method preserves the model's generation quality while operating without retraining or access to the original training data. By embedding the refusal direction directly into the model's weights, the suppression mechanism becomes inherently more robust against adversarial bypass attempts compared to surface-level input-output filters. In a thorough qualitative and quantitative evaluation, we show that we can neutralize a variety of harmful contents, including explicit nudity, graphic violence, copyrights, and trademarks. Project page: this https URL.
摘要：视频生成模型通过以下直觉教学来使创建视觉内容的创建民主，但它们也继承了嵌入其网络规模培训数据中的偏见和有害概念。这种继承会产生重大风险，因为用户可以很容易地产生不良甚至非法内容。这项工作介绍了第一个针对视频扩散模型明确量身定制的学习技术，以解决这一关键问题。我们的方法仅需要5个多模式提示对。每对包含一个“安全”和一个“不安全”示例，仅与目标概念不同。平均每层潜在差异会产生“拒绝向量”，该矢量曾经从模型参数中减去，它可以中和不安全的概念。我们在嵌入的协方差差异上引入了一种新型的低级分解方法，该方法产生了强大的拒绝向量。这可以隔离目标概念，同时最大程度地减少了其他语义的附带学习，从而保留了生成的视频的视觉质量。我们的方法在不访问或访问原始培训数据的情况下可以保持模型的发电质量。通过将拒绝方向直接嵌入模型的权重中，与表面级输入输出输出过滤器相比，抑制机制在对抗旁路的尝试上固有地变得更加健壮。在彻底的定性和定量评估中，我们表明我们可以中和各种有害内容，包括明确的裸露，图形暴力，版权和商标。项目页面：此HTTPS URL。

Title: FunDiff: Diffusion Models over Function Spaces for Physics-Informed Generative Modeling

Authors: Sifan Wang, Zehao Dou, Tong-Rui Liu, Lu Lu
Subjects: cs.LG, physics.comp-ph, stat.ML
Abstract URL: https://arxiv.org/abs/2506.07902
Pdf URL: https://arxiv.org/pdf/2506.07902
Copy Paste: [[2506.07902]] FunDiff: Diffusion Models over Function Spaces for Physics-Informed Generative Modeling(https://arxiv.org/abs/2506.07902)
Keywords: generative
Abstract: Recent advances in generative modeling -- particularly diffusion models and flow matching -- have achieved remarkable success in synthesizing discrete data such as images and videos. However, adapting these models to physical applications remains challenging, as the quantities of interest are continuous functions governed by complex physical laws. Here, we introduce $\textbf{FunDiff}$, a novel framework for generative modeling in function spaces. FunDiff combines a latent diffusion process with a function autoencoder architecture to handle input functions with varying discretizations, generate continuous functions evaluable at arbitrary locations, and seamlessly incorporate physical priors. These priors are enforced through architectural constraints or physics-informed loss functions, ensuring that generated samples satisfy fundamental physical laws. We theoretically establish minimax optimality guarantees for density estimation in function spaces, showing that diffusion-based estimators achieve optimal convergence rates under suitable regularity conditions. We demonstrate the practical effectiveness of FunDiff across diverse applications in fluid dynamics and solid mechanics. Empirical results show that our method generates physically consistent samples with high fidelity to the target distribution and exhibits robustness to noisy and low-resolution data. Code and datasets are publicly available at this https URL.
摘要：生成建模（尤其是扩散模型和流匹配）的最新进展在合成离散数据（例如图像和视频）方面取得了显着成功。但是，将这些模型适应物理应用仍然具有挑战性，因为感兴趣的数量是由复杂的物理定律控制的连续功能。在这里，我们介绍了$ \ textbf {fundiff} $，这是一个在功能空间中生成建模的新颖框架。 Fundiff将潜在扩散过程与功能自动编码器体系结构结合在一起，以处理具有不同离散化的输入功能，在任意位置可以评估可评估的连续功能，并无缝合并物理先验。这些先验是通过建筑约束或物理信息损失功能来执行的，以确保产生的样品满足基本的物理定律。我们从理论上为功能空间中的密度估计建立了最小值最佳保证，这表明基于扩散的估计器在适当的规律性条件下实现了最佳收敛速率。我们证明了Fundiff在流体动力学和固体力学中的各种应用中的实际有效性。经验结果表明，我们的方法生成具有高忠诚度的物理一致样品，并表现出对嘈杂和低分辨率数据的鲁棒性。代码和数据集可在此HTTPS URL上公开可用。

Title: Diffuse Everything: Multimodal Diffusion Models on Arbitrary State Spaces

Authors: Kevin Rojas, Yuchen Zhu, Sichen Zhu, Felix X.-F. Ye, Molei Tao
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.07903
Pdf URL: https://arxiv.org/pdf/2506.07903
Copy Paste: [[2506.07903]] Diffuse Everything: Multimodal Diffusion Models on Arbitrary State Spaces(https://arxiv.org/abs/2506.07903)
Keywords: generation
Abstract: Diffusion models have demonstrated remarkable performance in generating unimodal data across various tasks, including image, video, and text generation. On the contrary, the joint generation of multimodal data through diffusion models is still in the early stages of exploration. Existing approaches heavily rely on external preprocessing protocols, such as tokenizers and variational autoencoders, to harmonize varied data representations into a unified, unimodal format. This process heavily demands the high accuracy of encoders and decoders, which can be problematic for applications with limited data. To lift this restriction, we propose a novel framework for building multimodal diffusion models on arbitrary state spaces, enabling native generation of coupled data across different modalities. By introducing an innovative decoupled noise schedule for each modality, we enable both unconditional and modality-conditioned generation within a single model simultaneously. We empirically validate our approach for text-image generation and mixed-type tabular data synthesis, demonstrating that it achieves competitive performance.
摘要：扩散模型在跨各种任务（包括图像，视频和文本生成）中生成单峰数据方面表现出了显着的性能。相反，通过扩散模型的多模式数据的联合生成仍处于勘探的早期阶段。现有的方法在很大程度上依赖于外部预处理协议，例如令牌和变异自动编码器，以将各种数据表示形式协调为统一的单峰格式。这个过程很大程度上要求编码器和解码器的高精度，这对于数据有限的应用程序可能是有问题的。为了提高这一限制，我们提出了一个新的框架，用于在任意状态空间上构建多模式扩散模型，从而使本地生成跨不同模态的耦合数据。通过针对每种模式引入创新的解耦噪声时间表，我们可以同时在单个模型中启用无条件和模态条件的生成。我们从经验上验证了我们的文本图像生成和混合型表格数据合成的方法，表明它可以实现竞争性能。

Title: W4S4: WaLRUS Meets S4 for Long-Range Sequence Modeling

Authors: Hossein Babaei, Mel White, Richard G. Baraniuk
Subjects: cs.LG, eess.AS, eess.IV, eess.SP
Abstract URL: https://arxiv.org/abs/2506.07920
Pdf URL: https://arxiv.org/pdf/2506.07920
Copy Paste: [[2506.07920]] W4S4: WaLRUS Meets S4 for Long-Range Sequence Modeling(https://arxiv.org/abs/2506.07920)
Keywords: generation
Abstract: State Space Models (SSMs) have emerged as powerful components for sequence modeling, enabling efficient handling of long-range dependencies via linear recurrence and convolutional computation. However, their effectiveness depends heavily on the choice and initialization of the state matrix. In this work, we build on the SaFARi framework and existing WaLRUS SSMs to introduce a new variant, W4S4 (WaLRUS for S4), a new class of SSMs constructed from redundant wavelet frames. WaLRUS admits a stable diagonalization and supports fast kernel computation without requiring low-rank approximations, making it both theoretically grounded and computationally efficient. We show that WaLRUS retains information over long horizons significantly better than HiPPO-based SSMs, both in isolation and when integrated into deep architectures such as S4. Our experiments demonstrate consistent improvements across delay reconstruction tasks, classification benchmarks, and long-range sequence modeling, confirming that high-quality, structured initialization enabled by wavelet-based state dynamic offers substantial advantages over existing alternatives. WaLRUS provides a scalable and versatile foundation for the next generation of deep SSM-based models.
摘要：状态空间模型（SSM）已成为用于序列建模的强大组件，通过线性复发和卷积计算有效地处理长期依赖性。但是，它们的有效性在很大程度上取决于状态矩阵的选择和初始化。在这项工作中，我们以Safari框架和现有的Walrus SSM为基础，以引入新的变体W4S4（S4的海象），这是一种由冗余小波框架构建的新类SSM。海象承认稳定的对角线化并支持快速的内核计算，而无需低级近似值，从而使其在理论上扎根且在计算上有效。我们表明，海象在孤立的情况下和整合到诸如S4之类的深层体系结构中时，都比基于河马的SSM保持了远距离的信息明显更好。我们的实验表明，跨延迟重建任务，分类基准和远程序列建模之间的一致改进，证实了基于小波的状态动态实现了高质量的结构化初始化，这比现有替代方案具有很大的优势。 Walrus为下一代基于SSM的模型提供了可扩展且通用的基础。

Title: A Generative Physics-Informed Reinforcement Learning-Based Approach for Construction of Representative Drive Cycle

Authors: Amirreza Yasami, Mohammadali Tofigh, Mahdi Shahbakhti, Charles Robert Koch
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2506.07929
Pdf URL: https://arxiv.org/pdf/2506.07929
Copy Paste: [[2506.07929]] A Generative Physics-Informed Reinforcement Learning-Based Approach for Construction of Representative Drive Cycle(https://arxiv.org/abs/2506.07929)
Keywords: generative
Abstract: Accurate driving cycle construction is crucial for vehicle design, fuel economy analysis, and environmental impact assessments. A generative Physics-Informed Expected SARSA-Monte Carlo (PIESMC) approach that constructs representative driving cycles by capturing transient dynamics, acceleration, deceleration, idling, and road grade transitions while ensuring model fidelity is introduced. Leveraging a physics-informed reinforcement learning framework with Monte Carlo sampling, PIESMC delivers efficient cycle construction with reduced computational cost. Experimental evaluations on two real-world datasets demonstrate that PIESMC replicates key kinematic and energy metrics, achieving up to a 57.3% reduction in cumulative kinematic fragment errors compared to the Micro-trip-based (MTB) method and a 10.5% reduction relative to the Markov-chain-based (MCB) method. Moreover, it is nearly an order of magnitude faster than conventional techniques. Analyses of vehicle-specific power distributions and wavelet-transformed frequency content further confirm its ability to reproduce experimental central tendencies and variability.
摘要：准确的驾驶周期建设对于车辆设计，燃油经济性分析和环境影响评估至关重要。一种生成物理学的预期SARSA-MONTE CARLO（PIESMC）方法，该方法通过捕获瞬态动力学，加速度，减速，空转和道路级过渡来构建代表性驾驶周期，同时引入模型忠诚度。 PIESMC利用蒙特卡洛采样来利用物理知识的增强学习框架，以降低的计算成本提供有效的循环结构。对两个现实世界数据集进行的实验评估表明，与基于微型TRIP（MTB）的方法相比，PIESMC复制了关键的运动学和能量指标，累积运动学片段误差降低了57.3％，相对于基于Markov-Chain基于Markov-Chain的（MCB）（MCB）（MCB）的减少了10.5％。此外，它几乎比传统技术快几乎是一个数量级。对车辆特异性功率分布和小波转换频率内容的分析进一步证实了其重现实验性中心趋势和可变性的能力。

Title: TokenBreak: Bypassing Text Classification Models Through Token Manipulation

Authors: Kasimir Schulz, Kenneth Yeung, Kieran Evans
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2506.07948
Pdf URL: https://arxiv.org/pdf/2506.07948
Copy Paste: [[2506.07948]] TokenBreak: Bypassing Text Classification Models Through Token Manipulation(https://arxiv.org/abs/2506.07948)
Keywords: generation
Abstract: Natural Language Processing (NLP) models are used for text-related tasks such as classification and generation. To complete these tasks, input data is first tokenized from human-readable text into a format the model can understand, enabling it to make inferences and understand context. Text classification models can be implemented to guard against threats such as prompt injection attacks against Large Language Models (LLMs), toxic input and cybersecurity risks such as spam emails. In this paper, we introduce TokenBreak: a novel attack that can bypass these protection models by taking advantage of the tokenization strategy they use. This attack technique manipulates input text in such a way that certain models give an incorrect classification. Importantly, the end target (LLM or email recipient) can still understand and respond to the manipulated text and therefore be vulnerable to the very attack the protection model was put in place to prevent. The tokenizer is tied to model architecture, meaning it is possible to predict whether or not a model is vulnerable to attack based on family. We also present a defensive strategy as an added layer of protection that can be implemented without having to retrain the defensive model.
摘要：自然语言处理（NLP）模型用于与文本相关的任务，例如分类和生成。为了完成这些任务，首先将输入数据从人类可读的文本示为模型可以理解的格式，从而使其能够进行推断并理解上下文。可以实施文本分类模型，以防止威胁，例如针对大型语言模型（LLM），有毒输入和网络安全风险（例如垃圾邮件电子邮件）的迅速注射攻击。在本文中，我们介绍了DokenBreak：一种新颖的攻击，可以利用它们使用的令牌化策略来绕过这些保护模型。这种攻击技术以某些模型对分类不正确的方式操纵输入文本。重要的是，最终目标（LLM或电子邮件收件人）仍然可以理解并响应受操纵的文本，因此容易受到保护模型的攻击，以防止。令牌仪与模型体系结构相关，这意味着可以预测模型是否容易受到基于家庭的攻击。我们还提出了一种防御策略作为额外的保护层，可以实施，而无需重新训练防御模型。

Title: Cost-Optimal Active AI Model Evaluation

Authors: Anastasios N. Angelopoulos, Jacob Eisenstein, Jonathan Berant, Alekh Agarwal, Adam Fisch
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.07949
Pdf URL: https://arxiv.org/pdf/2506.07949
Copy Paste: [[2506.07949]] Cost-Optimal Active AI Model Evaluation(https://arxiv.org/abs/2506.07949)
Keywords: generative
Abstract: The development lifecycle of generative AI systems requires continual evaluation, data acquisition, and annotation, which is costly in both resources and time. In practice, rapid iteration often makes it necessary to rely on synthetic annotation data because of the low cost, despite the potential for substantial bias. In this paper, we develop novel, cost-aware methods for actively balancing the use of a cheap, but often inaccurate, weak rater -- such as a model-based autorater that is designed to automatically assess the quality of generated content -- with a more expensive, but also more accurate, strong rater alternative such as a human. More specifically, the goal of our approach is to produce a low variance, unbiased estimate of the mean of the target "strong" rating, subject to some total annotation budget. Building on recent work in active and prediction-powered statistical inference, we derive a family of cost-optimal policies for allocating a given annotation budget between weak and strong raters so as to maximize statistical efficiency. Using synthetic and real-world data, we empirically characterize the conditions under which these policies yield improvements over prior methods. We find that, especially in tasks where there is high variability in the difficulty of examples, our policies can achieve the same estimation precision at a far lower total annotation budget than standard evaluation methods.
摘要：生成AI系统的开发生命周期需要持续的评估，数据获取和注释，这在资源和时间上都是昂贵的。在实践中，快速迭代通常使得由于存在很大的偏见，但由于成本较低，因此有必要依靠合成注释数据。在本文中，我们开发了一种新颖的成本感知方法，用于积极平衡使用便宜但不准确的弱评估者（例如基于型号的自动化者），旨在自动评估生成内容的质量 - 具有更昂贵但更准确，更强大的较高的评估者替代方案，例如人类。更具体地说，我们方法的目标是产生较低的差异，无偏见的目标“强”评级，但要遵守一定的总注释预算。在最新的积极和预测驱动统计推断的工作的基础上，我们得出了一系列成本优势的政策，用于分配弱和强级评估者之间给定的注释预算，从而最大程度地提高统计效率。使用合成和现实世界数据，我们从经验上表征了这些策略对先前方法的改善的条件。我们发现，尤其是在示例难度较高的任务中，我们的政策可以在总注释预算低得多的总注释预算下达到相同的估计精度。

Title: SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design

Authors: Wenxin Tang, Jingyu Xiao, Wenxuan Jiang, Xi Xiao, Yuhang Wang, Xuxin Tang, Qing Li, Yuehe Ma, Junliang Liu, Shisong Tang, Michael R. Lyu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07964
Pdf URL: https://arxiv.org/pdf/2506.07964
Copy Paste: [[2506.07964]] SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design(https://arxiv.org/abs/2506.07964)
Keywords: generation
Abstract: Manual slide creation is labor-intensive and requires expert prior knowledge. Existing natural language-based LLM generation methods struggle to capture the visual and structural nuances of slide designs. To address this, we formalize the Reference Image to Slide Generation task and propose Slide2Code, the first benchmark with difficulty-tiered samples based on a novel Slide Complexity Metric. We introduce SlideCoder, a layout-aware, retrieval-augmented framework for generating editable slides from reference images. SlideCoder integrates a Color Gradient-based Segmentation algorithm and a Hierarchical Retrieval-Augmented Generation method to decompose complex tasks and enhance code generation. We also release SlideMaster, a 7B open-source model fine-tuned with improved reverse-engineered data. Experiments show that SlideCoder outperforms state-of-the-art baselines by up to 40.5 points, demonstrating strong performance across layout fidelity, execution accuracy, and visual consistency. Our code is available at this https URL.
摘要：手动滑动创建是劳动密集型的，需要专家的先验知识。现有的基于自然语言的LLM生成方法努力捕获幻灯片设计的视觉和结构细微差别。为了解决这个问题，我们将参考图像形式化以幻灯片生成任务，并提出幻灯片2码，这是基于新型幻灯片复杂度度量的第一个基准标准标准。我们介绍了SlideCoder，这是一种布局感知的，检索功能的框架，用于从参考图像中生成可编辑的幻灯片。 SlideCoder集成了基于颜色梯度的分割算法和层次检索效果的生成方法，以分解复杂的任务并增强代码生成。我们还发布了Slidemaster，这是一种7B开源模型，并通过改进的反向工程数据进行了微调。实验表明，SlideCoder的表现优于最先进的基线，最多可达40.5分，表明在布局保真度，执行精度和视觉一致性之间表现出强劲的性能。我们的代码可在此HTTPS URL上找到。

Title: OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation

Authors: Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, Hai-Bao Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07977
Pdf URL: https://arxiv.org/pdf/2506.07977
Copy Paste: [[2506.07977]] OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation(https://arxiv.org/abs/2506.07977)
Keywords: generation
Abstract: Text-to-image (T2I) models have garnered significant attention for generating high-quality images aligned with text prompts. However, rapid T2I model advancements reveal limitations in early benchmarks, lacking comprehensive evaluations, for example, the evaluation on reasoning, text rendering and style. Notably, recent state-of-the-art models, with their rich knowledge modeling capabilities, show promising results on the image generation problems requiring strong reasoning ability, yet existing evaluation systems have not adequately addressed this frontier. To systematically address these gaps, we introduce OneIG-Bench, a meticulously designed comprehensive benchmark framework for fine-grained evaluation of T2I models across multiple dimensions, including prompt-image alignment, text rendering precision, reasoning-generated content, stylization, and diversity. By structuring the evaluation, this benchmark enables in-depth analysis of model performance, helping researchers and practitioners pinpoint strengths and bottlenecks in the full pipeline of image generation. Specifically, OneIG-Bench enables flexible evaluation by allowing users to focus on a particular evaluation subset. Instead of generating images for the entire set of prompts, users can generate images only for the prompts associated with the selected dimension and complete the corresponding evaluation accordingly. Our codebase and dataset are now publicly available to facilitate reproducible evaluation studies and cross-model comparisons within the T2I research community.
摘要：文本对图像（T2I）模型已引起了与文本提示一致的高质量图像的重大关注。但是，快速T2I模型的进步揭示了早期基准的局限性，缺乏全面的评估，例如对推理，文本渲染和样式的评估。值得注意的是，最近的最新模型凭借其丰富的知识建模功能，在需要强大推理能力的图像生成问题上显示出令人鼓舞的结果，但是现有的评估系统尚未充分解决这个领域。为了系统地解决这些差距，我们介绍了Oneig-Bench，这是一种精心设计的全面基准框架，用于对跨多个维度的T2I模型进行精细评估，包括及时图像对齐，文本渲染精度，推理产生的内容，风格化，风格化和多样性。通过构建评估，该基准可以对模型性能进行深入分析，帮助研究人员和从业人员在图像生成的完整渠道中查明优势和瓶颈。具体而言，Oneig-Bench可以通过允许用户专注于特定评估子集来实现灵活的评估。用户无需为整个提示生成图像，而是可以为与所选维度关联的提示生成图像，并相应地完成相应的评估。现在，我们的代码库和数据集可以公开使用，以促进T2I研究社区内可再现的评估研究和跨模型比较。

Title: Realistic Urban Traffic Generator using Decentralized Federated Learning for the SUMO simulator

Authors: Alberto Bazán-Guillén, Carlos Beis-Penedo, Diego Cajaraville-Aboy, Pablo Barbecho-Bautista, Rebeca P. Díaz-Redondo, Luis J. de la Cruz Llopis, Ana Fernández-Vilas, Mónica Aguilar Igartua, Manuel Fernández-Veiga
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.07980
Pdf URL: https://arxiv.org/pdf/2506.07980
Copy Paste: [[2506.07980]] Realistic Urban Traffic Generator using Decentralized Federated Learning for the SUMO simulator(https://arxiv.org/abs/2506.07980)
Keywords: generation
Abstract: Realistic urban traffic simulation is essential for sustainable urban planning and the development of intelligent transportation systems. However, generating high-fidelity, time-varying traffic profiles that accurately reflect real-world conditions, especially in large-scale scenarios, remains a major challenge. Existing methods often suffer from limitations in accuracy, scalability, or raise privacy concerns due to centralized data processing. This work introduces DesRUTGe (Decentralized Realistic Urban Traffic Generator), a novel framework that integrates Deep Reinforcement Learning (DRL) agents with the SUMO simulator to generate realistic 24-hour traffic patterns. A key innovation of DesRUTGe is its use of Decentralized Federated Learning (DFL), wherein each traffic detector and its corresponding urban zone function as an independent learning node. These nodes train local DRL models using minimal historical data and collaboratively refine their performance by exchanging model parameters with selected peers (e.g., geographically adjacent zones), without requiring a central coordinator. Evaluated using real-world data from the city of Barcelona, DesRUTGe outperforms standard SUMO-based tools such as RouteSampler, as well as other centralized learning approaches, by delivering more accurate and privacy-preserving traffic pattern generation.
摘要：现实的城市交通模拟对于可持续的城市规划和智能运输系统的发展至关重要。但是，产生高保真性，时变的交通概况，可以准确地反映现实世界中的条件，尤其是在大规模的情况下，仍然是一个重大挑战。现有方法通常会因集中数据处理而引起的准确性，可伸缩性或提高隐私问题的限制。这项工作介绍了Desrutge（分散逼真的城市交通生成器），这是一个新颖的框架，将深度加固学习（DRL）代理与相扑模拟器集成在一起，以生成现实的24小时交通模式。 Desrutge的关键创新是它使用分散的联合学习（DFL），其中每个交通探测器及其相应的城市带作为独立学习节点。这些节点使用最小的历史数据训练本地DRL模型，并通过与所选同行（例如地理位置相邻区域）交换模型参数来协作完善其性能，而无需中央协调员。使用来自巴塞罗那市的现实世界数据进行评估，Desrutge的表现优于标准的基于SUMO的工具，例如RoutesAmpler以及其他集中的学习方法，通过提供更准确和隐私的交通方式生成。

Title: CXR-LT 2024: A MICCAI challenge on long-tailed, multi-label, and zero-shot disease classification from chest X-ray

Authors: Mingquan Lin, Gregory Holste, Song Wang, Yiliang Zhou, Yishu Wei, Imon Banerjee, Pengyi Chen, Tianjie Dai, Yuexi Du, Nicha C. Dvornek, Yuyan Ge, Zuowei Guo, Shouhei Hanaoka, Dongkyun Kim, Pablo Messina, Yang Lu, Denis Parra, Donghyun Son, Álvaro Soto, Aisha Urooj, René Vidal, Yosuke Yamagishi, Zefan Yang, Ruichi Zhang, Yang Zhou, Leo Anthony Celi, Ronald M. Summers, Zhiyong Lu, Hao Chen, Adam Flanders, George Shih, Zhangyang Wang, Yifan Peng
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.07984
Pdf URL: https://arxiv.org/pdf/2506.07984
Copy Paste: [[2506.07984]] CXR-LT 2024: A MICCAI challenge on long-tailed, multi-label, and zero-shot disease classification from chest X-ray(https://arxiv.org/abs/2506.07984)
Keywords: generative
Abstract: The CXR-LT series is a community-driven initiative designed to enhance lung disease classification using chest X-rays (CXR). It tackles challenges in open long-tailed lung disease classification and enhances the measurability of state-of-the-art techniques. The first event, CXR-LT 2023, aimed to achieve these goals by providing high-quality benchmark CXR data for model development and conducting comprehensive evaluations to identify ongoing issues impacting lung disease classification performance. Building on the success of CXR-LT 2023, the CXR-LT 2024 expands the dataset to 377,110 chest X-rays (CXRs) and 45 disease labels, including 19 new rare disease findings. It also introduces a new focus on zero-shot learning to address limitations identified in the previous event. Specifically, CXR-LT 2024 features three tasks: (i) long-tailed classification on a large, noisy test set, (ii) long-tailed classification on a manually annotated "gold standard" subset, and (iii) zero-shot generalization to five previously unseen disease findings. This paper provides an overview of CXR-LT 2024, detailing the data curation process and consolidating state-of-the-art solutions, including the use of multimodal models for rare disease detection, advanced generative approaches to handle noisy labels, and zero-shot learning strategies for unseen diseases. Additionally, the expanded dataset enhances disease coverage to better represent real-world clinical settings, offering a valuable resource for future research. By synthesizing the insights and innovations of participating teams, we aim to advance the development of clinically realistic and generalizable diagnostic models for chest radiography.
摘要：CXR-LT系列是一项由社区驱动的计划，旨在使用胸部X射线（CXR）增强肺部疾病分类。它解决了开放的长尾肺疾病分类中的挑战，并提高了最新技术的可衡量性。第一个事件是CXR-LT 2023，旨在通过为模型开发提供高质量的基准CXR数据来实现这些目标，并进行全面的评估，以确定影响肺部疾病分类绩效的持续问题。在CXR-LT 2023的成功基础上，CXR-LT 2024将数据集扩展到377,110胸X射线（CXR）和45个疾病标签，其中包括19个新的稀有疾病发现。它还引入了对零拍学习的新重点，以解决上一个事件中确定的局限性。具体而言，CXR-LT 2024具有三个任务：（i）在大型嘈杂的测试集中长尾分类，（ii）在手动注释的“金标准”子集上长尾分类，以及（iii）零散布到五个以前未见的疾病发现。本文提供了CXR-LT 2024的概述，详细介绍了数据策展过程并巩固了最先进的解决方案，包括使用多模型用于稀有疾病检测，处理噪声标签的先进生成方法以及对未见疾病的零拍学习策略。此外，扩展的数据集增强了疾病覆盖范围，以更好地代表现实世界中的临床环境，为未来的研究提供了宝贵的资源。通过综合参与团队的见解和创新，我们旨在促进胸部X线照相术的临床现实且可推广的诊断模型的开发。

Title: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

Authors: Zhengyao Lv, Tianlin Pan, Chenyang Si, Zhaoxi Chen, Wangmeng Zuo, Ziwei Liu, Kwan-Yee K. Wong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07986
Pdf URL: https://arxiv.org/pdf/2506.07986
Copy Paste: [[2506.07986]] Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers(https://arxiv.org/abs/2506.07986)
Keywords: generation
Abstract: Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose \textbf{Temperature-Adjusted Cross-modal Attention (TACA)}, a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at \href{this https URL}
摘要：多模式扩散变压器（MM-DITS）在文本驱动的视觉生成中取得了显着的进步。但是，即使是最先进的MM-Dit模型，例如Flux斗争，也可以在文本提示和生成内容之间实现精确的对齐方式。我们确定了MM-DIT注意机制的两个关键问题，即1）由于视觉和文本方式和文本方式之间的令牌不平衡而抑制交叉模式的注意力，而2）缺乏时间段意识到注意力的关注加权，这阻碍了对齐的关注。为了解决这些问题，我们提出\ textBf {温度调整后的跨模式关注（TACA）}，这是一种参数效率的方法，通过温度缩放和时间段依赖性调整动态地重新平衡多模式相互作用。当与Lora微调结合使用时，TACA可显着增强T2I-CBENCH基准上的文本图像对齐方式，并具有最小的计算开销。我们在诸如Flux和SD3.5之类的最先进模型上测试了TACA，证明了其在对象外观，属性结合和空间关系方面改善图像文本对齐的能力。我们的发现突出了平衡跨模式关注在改善文本到图像扩散模型中的语义保真度中的重要性。我们的代码可在\ href {this https url}上公开获得

Title: Generative Modeling of Weights: Generalization or Memorization?

Authors: Boya Zeng, Yida Yin, Zhiqiu Xu, Zhuang Liu
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.07998
Pdf URL: https://arxiv.org/pdf/2506.07998
Copy Paste: [[2506.07998]] Generative Modeling of Weights: Generalization or Memorization?(https://arxiv.org/abs/2506.07998)
Keywords: generation, generative
Abstract: Generative models, with their success in image and video generation, have recently been explored for synthesizing effective neural network weights. These approaches take trained neural network checkpoints as training data, and aim to generate high-performing neural network weights during inference. In this work, we examine four representative methods on their ability to generate novel model weights, i.e., weights that are different from the checkpoints seen during training. Surprisingly, we find that these methods synthesize weights largely by memorization: they produce either replicas, or at best simple interpolations, of the training checkpoints. Current methods fail to outperform simple baselines, such as adding noise to the weights or taking a simple weight ensemble, in obtaining different and simultaneously high-performing models. We further show that this memorization cannot be effectively mitigated by modifying modeling factors commonly associated with memorization in image diffusion models, or applying data augmentations. Our findings provide a realistic assessment of what types of data current generative models can model, and highlight the need for more careful evaluation of generative models in new domains. Our code is available at this https URL.
摘要：最近已经探索了生成模型在图像和视频生成方面的成功，以综合有效的神经网络权重。这些方法将训练有素的神经网络检查点作为培训数据，并旨在在推理过程中产生高性能的神经网络权重。在这项工作中，我们研究了四种代表性方法，这些方法具有产生新型模型权重的能力，即与训练期间检查点不同的权重。令人惊讶的是，我们发现这些方法主要是通过记忆来综合权重的：它们产生训练检查点的复制品或最多简单的插值。当前方法无法超越简单的基线，例如在获得不同且同时表现高表现模型的情况下增加噪音或进行简单的权重集合。我们进一步表明，通过修改与图像扩散模型中的记忆相关的建模因子或应用数据增强，无法有效地减轻这种记忆。我们的发现提供了对当前生成模型可以建模的哪些类型的数据类型的现实评估，并强调需要对新域中的生成模型进行更仔细的评估。我们的代码可在此HTTPS URL上找到。

Title: MADFormer: Mixed Autoregressive and Diffusion Transformers for Continuous Image Generation

Authors: Junhao Chen, Yulia Tsvetkov, Xiaochuang Han
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.07999
Pdf URL: https://arxiv.org/pdf/2506.07999
Copy Paste: [[2506.07999]] MADFormer: Mixed Autoregressive and Diffusion Transformers for Continuous Image Generation(https://arxiv.org/abs/2506.07999)
Keywords: generation, generative
Abstract: Recent progress in multimodal generation has increasingly combined autoregressive (AR) and diffusion-based approaches, leveraging their complementary strengths: AR models capture long-range dependencies and produce fluent, context-aware outputs, while diffusion models operate in continuous latent spaces to refine high-fidelity visual details. However, existing hybrids often lack systematic guidance on how and why to allocate model capacity between these paradigms. In this work, we introduce MADFormer, a Mixed Autoregressive and Diffusion Transformer that serves as a testbed for analyzing AR-diffusion trade-offs. MADFormer partitions image generation into spatial blocks, using AR layers for one-pass global conditioning across blocks and diffusion layers for iterative local refinement within each block. Through controlled experiments on FFHQ-1024 and ImageNet, we identify two key insights: (1) block-wise partitioning significantly improves performance on high-resolution images, and (2) vertically mixing AR and diffusion layers yields better quality-efficiency balances--improving FID by up to 75% under constrained inference compute. Our findings offer practical design principles for future hybrid generative models.
摘要：多模式生成的最新进展越来越多地结合了自回旋（AR）和基于扩散的方法，利用它们的互补优势：AR模型捕获长期依赖性并产生流利的，上下文感知的输出，而扩散模型则以连续的远端空间运行，以提炼高效的视觉详细信息。但是，现有的混合动力车通常缺乏有关如何以及为什么在这些范式之间分配模型容量的系统指导。在这项工作中，我们介绍了Madformer，这是一种混合自回归和扩散变压器，可作为分析AR-扩散权衡的测试台。 Madformer将图像生成分为空间块，使用AR层跨块和扩散层进行一通层的全局调节，以在每个块内进行迭代局部细化。通过对FFHQ-1024和Imagenet的受控实验，我们确定了两个关键见解：（1）区块划分可显着提高高分辨率图像的性能，（2）（2）垂直混合AR和扩散层可产生更好的质量效率余额 - 在约束的限制性范围内提出75％的情况下，将其提高到75％。我们的发现为未来的混合生成模型提供了实用的设计原理。

Title: Audio-Sync Video Generation with Multi-Stream Temporal Control

Authors: Shuchen Weng, Haojie Zheng, Zheng Chang, Si Li, Boxin Shi, Xinlong Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.08003
Pdf URL: https://arxiv.org/pdf/2506.08003
Copy Paste: [[2506.08003]] Audio-Sync Video Generation with Multi-Stream Temporal Control(https://arxiv.org/abs/2506.08003)
Keywords: generation
Abstract: Audio is inherently temporal and closely synchronized with the visual world, making it a naturally aligned and expressive control signal for controllable video generation (e.g., movies). Beyond control, directly translating audio into video is essential for understanding and visualizing rich audio narratives (e.g., Podcasts or historical recordings). However, existing approaches fall short in generating high-quality videos with precise audio-visual synchronization, especially across diverse and complex audio types. In this work, we introduce MTV, a versatile framework for audio-sync video generation. MTV explicitly separates audios into speech, effects, and music tracks, enabling disentangled control over lip motion, event timing, and visual mood, respectively -- resulting in fine-grained and semantically aligned video generation. To support the framework, we additionally present DEMIX, a dataset comprising high-quality cinematic videos and demixed audio tracks. DEMIX is structured into five overlapped subsets, enabling scalable multi-stage training for diverse generation scenarios. Extensive experiments demonstrate that MTV achieves state-of-the-art performance across six standard metrics spanning video quality, text-video consistency, and audio-video alignment. Project page: this https URL.
摘要：音频本质上是时间的，并且与视觉世界紧密同步，使其成为可控视频生成（例如，电影）的自然对齐和表现力的控制信号。无法控制，直接将音频转化为视频对于理解和可视化丰富的音频叙事（例如播客或历史记录）至关重要。但是，现有的方法在生成具有精确视听同步的高质量视频方面的缺陷，尤其是在各种和复杂的音频类型之间。在这项工作中，我们介绍了MTV，这是一个用于音频同步视频的多功能框架。 MTV明确将音频分开为语音，效果和音乐曲目，从而分别对唇部运动，事件时机和视觉情绪的控制控制，从而导致精细的和语义上的视频生成。为了支持该框架，我们还提出了Demix，该数据集包括高质量的电影视频和符号音轨。 Demix结构为五个重叠的子集，从而实现可扩展的多阶段培训，以实现不同的生成场景。广泛的实验表明，MTV在跨越视频质量，文本视频一致性和音频视频对齐的六个标准指标上实现最先进的性能。项目页面：此HTTPS URL。

Title: Dreamland: Controllable World Creation with Simulator and Generative Models

Authors: Sicheng Mo, Ziyang Leng, Leon Liu, Weizhen Wang, Honglin He, Bolei Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.08006
Pdf URL: https://arxiv.org/pdf/2506.08006
Copy Paste: [[2506.08006]] Dreamland: Controllable World Creation with Simulator and Generative Models(https://arxiv.org/abs/2506.08006)
Keywords: generation, generative
Abstract: Large-scale video generative models can synthesize diverse and realistic visual content for dynamic world creation, but they often lack element-wise controllability, hindering their use in editing scenes and training embodied AI agents. We propose Dreamland, a hybrid world generation framework combining the granular control of a physics-based simulator and the photorealistic content output of large-scale pretrained generative models. In particular, we design a layered world abstraction that encodes both pixel-level and object-level semantics and geometry as an intermediate representation to bridge the simulator and the generative model. This approach enhances controllability, minimizes adaptation cost through early alignment with real-world distributions, and supports off-the-shelf use of existing and future pretrained generative models. We further construct a D3Sim dataset to facilitate the training and evaluation of hybrid generation pipelines. Experiments demonstrate that Dreamland outperforms existing baselines with 50.8% improved image quality, 17.9% stronger controllability, and has great potential to enhance embodied agent training. Code and data will be made available.
摘要：大规模的视频生成模型可以将各种和现实的视觉内容综合为动态世界创建，但它们通常缺乏元素的可控性，从而阻碍了它们在编辑场景中的使用和训练体现的AI代理。我们提出了Dreamland，这是一个混合世界一代框架，结合了基于物理的模拟器的颗粒状控制和大规模预审计的生成模型的影片含量输出。特别是，我们设计了一个层次的世界抽象，该抽象编码像素级和对象级的语义和几何形状，以作为桥接模拟器和生成模型的中间表示。这种方法可增强可控性，通过与现实世界分布的早期对齐方式最大程度地减少适应性成本，并支持现有的现有和未来预告片的生成模型的使用。我们进一步构建了D3SIM数据集，以促进混合生成管道的培训和评估。实验表明，Dreamland的表现优于现有基线，其图像质量提高了50.8％，可控性强17.9％，并且具有增强体现剂训练的巨大潜力。代码和数据将提供。

Title: Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Authors: Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, Eli Shechtman
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.08009
Pdf URL: https://arxiv.org/pdf/2506.08009
Copy Paste: [[2506.08009]] Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion(https://arxiv.org/abs/2506.08009)
Keywords: generation
Abstract: We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models. It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs during inference. Unlike prior methods that denoise future frames based on ground-truth context frames, Self Forcing conditions each frame's generation on previously self-generated outputs by performing autoregressive rollout with key-value (KV) caching during training. This strategy enables supervision through a holistic loss at the video level that directly evaluates the quality of the entire generated sequence, rather than relying solely on traditional frame-wise objectives. To ensure training efficiency, we employ a few-step diffusion model along with a stochastic gradient truncation strategy, effectively balancing computational cost and performance. We further introduce a rolling KV cache mechanism that enables efficient autoregressive video extrapolation. Extensive experiments demonstrate that our approach achieves real-time streaming video generation with sub-second latency on a single GPU, while matching or even surpassing the generation quality of significantly slower and non-causal diffusion models. Project website: this http URL
摘要：我们介绍了自我强迫，这是一种新颖的培训范式，用于自回归视频扩散模型。它解决了长期存在的曝光偏差问题，在该偏差中，在地面上环境中训练的模型必须在推理过程中以其自身不完美的输出为条件。与以前的方法基于基于地面真相上下文帧的未来框架不同，自我强制条件通过在训练过程中使用键值（KV）缓存进行自回旋推出，从而在以前的自我生成的输出上生成。该策略可以通过视频级别的整体损失来直接评估整个生成序列的质量，而不是仅依靠传统的框架目标，从而实现了监督。为了确保训练效率，我们采用了几步扩散模型以及随机梯度截断策略，有效地平衡了计算成本和性能。我们进一步介绍了滚动的KV缓存机制，该机制可实现有效的自回旋视频外推。广泛的实验表明，我们的方法在单个GPU上以次秒延迟实现了实时流视频，同时匹配甚至超过了明显较慢和非可乐扩散模型的发电质量。项目网站：此HTTP URL