2025-07-15

Title: Recurrent Expansion: A Pathway Toward the Next Generation of Deep Learning

Authors: Tarek Berghout
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2507.08828
Pdf URL: https://arxiv.org/pdf/2507.08828
Copy Paste: [[2507.08828]] Recurrent Expansion: A Pathway Toward the Next Generation of Deep Learning(https://arxiv.org/abs/2507.08828)
Keywords: generation
Abstract: This paper introduces Recurrent Expansion (RE) as a new learning paradigm that advances beyond conventional Machine Learning (ML) and Deep Learning (DL). While DL focuses on learning from static data representations, RE proposes an additional dimension: learning from the evolving behavior of models themselves. RE emphasizes multiple mappings of data through identical deep architectures and analyzes their internal representations (i.e., feature maps) in conjunction with observed performance signals such as loss. By incorporating these behavioral traces, RE enables iterative self-improvement, allowing each model version to gain insight from its predecessors. The framework is extended through Multiverse RE (MVRE), which aggregates signals from parallel model instances, and further through Heterogeneous MVRE (HMVRE), where models of varying architectures contribute diverse perspectives. A scalable and adaptive variant, Sc-HMVRE, introduces selective mechanisms and scale diversity for real-world deployment. Altogether, RE presents a shift in DL: from purely representational learning to behavior-aware, self-evolving systems. It lays the groundwork for a new class of intelligent models capable of reasoning over their own learning dynamics, offering a path toward scalable, introspective, and adaptive artificial intelligence. A simple code example to support beginners in running their own experiments is provided in Code Availability Section of this paper.
摘要：本文将经常性扩展（RE）作为一种新的学习范式介绍，超越了传统的机器学习（ML）和深度学习（DL）。尽管DL专注于从静态数据表示中学习，但RE提出了一个额外的维度：从模型本身不断发展的行为中学习。 RE通过相同的深层体系结构强调了数据的多次映射，并与观察到的性能信号（例如损耗）一起分析其内部表示（即功能图）。通过合并这些行为痕迹，RE可以启用迭代的自我改进，从而使每个模型版本都能从其前身那里获得见解。该框架通过多元宇宙RE（MVRE）扩展，该框架从并行模型实例中汇总了信号，并通过异质MVRE（HMVRE）进一步扩展了信号，其中各种体系结构的模型贡献了不同的观点。可扩展的自适应变体SC-HMVRE引入了选择性机制和规模多样性，以实现现实世界的部署。总之，RE提出了DL的转变：从纯粹代表性学习到行为感知，自我发展的系统。它为一个新的智能模型奠定了基础，能够推理自己的学习动力，为可扩展，内省和自适应人工智能提供了一条途径。本文的代码可用性部分提供了一个简单的代码示例，以支持初学者运行自己的实验。

Title: GUIDE: Towards Scalable Advising for Research Ideas

Authors: Yaowenqi Liu, BingXu Meng, Rui Pan, Jerry Huang, Tong Zhang
Subjects: cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2507.08870
Pdf URL: https://arxiv.org/pdf/2507.08870
Copy Paste: [[2507.08870]] GUIDE: Towards Scalable Advising for Research Ideas(https://arxiv.org/abs/2507.08870)
Keywords: generation
Abstract: The field of AI research is advancing at an unprecedented pace, enabling automated hypothesis generation and experimental design across diverse domains such as biology, mathematics, and artificial intelligence. Despite these advancements, there remains a significant gap in the availability of scalable advising systems capable of providing high-quality, well-reasoned feedback to refine proposed hypotheses and experimental designs. To address this challenge, we explore key factors that underlie the development of robust advising systems, including model size, context length, confidence estimation, and structured reasoning processes. Our findings reveal that a relatively small model, when equipped with a well-compressed literature database and a structured reasoning framework, can outperform powerful general-purpose language models such as Deepseek-R1 in terms of acceptance rates for self-ranked top-30% submissions to ICLR 2025. Moreover, when limited to high-confidence predictions, our system achieves an acceptance rate exceeding 90% on the ICLR 2025 test set, underscoring its potential to significantly enhance the quality and efficiency of hypothesis generation and experimental design. The code is released at this https URL.
摘要：人工智能研究领域正在以前所未有的速度前进，从而实现了跨生物学，数学和人工智能等不同领域的自动假设产生和实验设计。尽管取得了这些进步，但可扩展的咨询系统的可用性仍然存在很大的差距，能够提供高质量，合理的反馈来提出提议的假设和实验设计。为了应对这一挑战，我们探讨了关键因素，这些因素是发展强大的建议系统的发展，包括模型大小，上下文长度，置信度估计和结构化推理过程。我们的发现表明，一个相对较小的模型在配备了良好压缩的文献数据库和结构化的推理框架时，可以超越强大的通用语言模型，例如DeepSeek-r1，就自我排列的TOP-30％提交的接受率而言，ICLR 2025的自我排列的top-30％提交率。此外，限制了90％的预测率。强调其显着提高假设产生和实验设计的质量和效率的潜力。该代码在此HTTPS URL上发布。

Title: Next-Generation Travel Demand Modeling with a Generative Framework for Household Activity Coordination

Authors: Xishun Liao, Haoxuan Ma, Yifan Liu, Yuxiang Wei, Brian Yueshuai He, Chris Stanford, Jiaqi Ma
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.08871
Pdf URL: https://arxiv.org/pdf/2507.08871
Copy Paste: [[2507.08871]] Next-Generation Travel Demand Modeling with a Generative Framework for Household Activity Coordination(https://arxiv.org/abs/2507.08871)
Keywords: generation, generative
Abstract: Travel demand models are critical tools for planning, policy, and mobility system design. Traditional activity-based models (ABMs), although grounded in behavioral theories, often rely on simplified rules and assumptions, and are costly to develop and difficult to adapt across different regions. This paper presents a learning-based travel demand modeling framework that synthesizes household-coordinated daily activity patterns based on a household's socio-demographic profiles. The whole framework integrates population synthesis, coordinated activity generation, location assignment, and large-scale microscopic traffic simulation into a unified system. It is fully generative, data-driven, scalable, and transferable to other regions. A full-pipeline implementation is conducted in Los Angeles with a 10 million population. Comprehensive validation shows that the model closely replicates real-world mobility patterns and matches the performance of legacy ABMs with significantly reduced modeling cost and greater scalability. With respect to the SCAG ABM benchmark, the origin-destination matrix achieves a cosine similarity of 0.97, and the daily vehicle miles traveled (VMT) in the network yields a 0.006 Jensen-Shannon Divergence (JSD) and a 9.8% mean absolute percentage error (MAPE). When compared to real-world observations from Caltrans PeMS, the evaluation on corridor-level traffic speed and volume reaches a 0.001 JSD and a 6.11% MAPE.
摘要：旅行需求模型是计划，政策和移动系统设计的关键工具。传统的基于活动的模型（ABM）虽然以行为理论为基础，但通常依赖于简化的规则和假设，并且成本高昂，而在不同地区的发展和难以适应。本文提出了一个基于学习的旅行需求建模框架，该框架综合了基于家庭社会人口统计学概况的家庭协调的日常活动模式。整个框架将人口合成，协调的活动产生，位置分配和大规模的微观交通模拟整合到统一系统中。它是完全生成的，数据驱动的，可扩展的，并且可以转移到其他区域。在人口1000万人口的洛杉矶进行了全Pipeline实施。全面的验证表明，该模型密切复制了现实世界的移动性模式，并匹配了传统ABM的性能，并大大降低了建模成本和更大的可扩展性。关于SCAG ABM基准测试，原始用途矩阵的余弦相似性为0.97，网络中的每日车辆里程（VMT）产生0.006 Jensen-Shannon Divergence（JSD）和9.8％的平均绝对百分比误差（MAPE）。与Caltrans PEM的现实世界观察相比，有关走廊级的交通速度和音量的评估达到0.001 JSD和6.11％的MAPE。

Title: Detecting Deepfake Talking Heads from Facial Biometric Anomalies

Authors: Justin D. Norman, Hany Farid
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.08917
Pdf URL: https://arxiv.org/pdf/2507.08917
Copy Paste: [[2507.08917]] Detecting Deepfake Talking Heads from Facial Biometric Anomalies(https://arxiv.org/abs/2507.08917)
Keywords: generation
Abstract: The combination of highly realistic voice cloning, along with visually compelling avatar, face-swap, or lip-sync deepfake video generation, makes it relatively easy to create a video of anyone saying anything. Today, such deepfake impersonations are often used to power frauds, scams, and political disinformation. We propose a novel forensic machine learning technique for the detection of deepfake video impersonations that leverages unnatural patterns in facial biometrics. We evaluate this technique across a large dataset of deepfake techniques and impersonations, as well as assess its reliability to video laundering and its generalization to previously unseen video deepfake generators.
摘要：高度逼真的语音克隆以及视觉上引人入胜的头像，面部折扣或Lip-sync DeepFake视频的结合，使创建任何人说话的视频相对容易。如今，这种深层模仿通常被用来为欺诈，骗局和政治虚假信息提供动力。我们提出了一种新颖的法医学学习技术，用于检测深击视频模仿，该视频模仿利用面部生物识别技术中的不自然模式。我们在大型深层技术和模仿数据集中评估了这一技术，并评估了其对视频洗牌的可靠性及其对以前看不见的视频Deepfake发电机的概括。

Title: Beyond Scores: Proximal Diffusion Models

Authors: Zhenghan Fang, Mateo Díaz, Sam Buchanan, Jeremias Sulam
Subjects: cs.LG, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2507.08956
Pdf URL: https://arxiv.org/pdf/2507.08956
Copy Paste: [[2507.08956]] Beyond Scores: Proximal Diffusion Models(https://arxiv.org/abs/2507.08956)
Keywords: generative
Abstract: Diffusion models have quickly become some of the most popular and powerful generative models for high-dimensional data. The key insight that enabled their development was the realization that access to the score -- the gradient of the log-density at different noise levels -- allows for sampling from data distributions by solving a reverse-time stochastic differential equation (SDE) via forward discretization, and that popular denoisers allow for unbiased estimators of this score. In this paper, we demonstrate that an alternative, backward discretization of these SDEs, using proximal maps in place of the score, leads to theoretical and practical benefits. We leverage recent results in proximal matching to learn proximal operators of the log-density and, with them, develop Proximal Diffusion Models (ProxDM). Theoretically, we prove that $\widetilde{O}(d/\sqrt{\varepsilon})$ steps suffice for the resulting discretization to generate an $\varepsilon$-accurate distribution w.r.t. the KL divergence. Empirically, we show that two variants of ProxDM achieve significantly faster convergence within just a few sampling steps compared to conventional score-matching methods.
摘要：扩散模型已迅速成为用于高维数据的一些最受欢迎，最强大的生成模型。能够使他们开发的关键见解是认识到，访问分数（在不同噪声水平下的对数密度的梯度）可以通过向前离散化解决反向时间随机微分方程（SDE），从而可以从数据分布中进行采样，并且流行的DENOISER允许对此分数无偏的估计量。在本文中，我们证明了这些SDE的替代性，向后离散化，使用近端图代替分数，从而带来了理论和实际的好处。我们利用近端匹配的最新结果来学习对数密度的近端运算符，并随着它们而开发近端扩散模型（ProxDM）。从理论上讲，我们证明了$ \ widetilde {o}（d/\ sqrt {\ varepsilon}）$ steps $ steps足以生成$ \ varepsilon $ accurate distribution w.r.r.t。 KL分歧。从经验上讲，我们表明，与传统的得分匹配方法相比，在几个采样步骤中，ProxDM的两个变体仅在几个采样步骤中获得了明显的收敛速度。

Title: Theory-Informed Improvements to Classifier-Free Guidance for Discrete Diffusion Models

Authors: Kevin Rojas, Ye He, Chieh-Hsin Lai, Yuta Takida, Yuki Mitsufuji, Molei Tao
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2507.08965
Pdf URL: https://arxiv.org/pdf/2507.08965
Copy Paste: [[2507.08965]] Theory-Informed Improvements to Classifier-Free Guidance for Discrete Diffusion Models(https://arxiv.org/abs/2507.08965)
Keywords: generation
Abstract: Classifier-Free Guidance (CFG) is a widely used technique for conditional generation and improving sample quality in continuous diffusion models, and recent works have extended it to discrete diffusion. This paper theoretically analyzes CFG in the context of masked discrete diffusion, focusing on the role of guidance schedules. Our analysis shows that high guidance early in sampling (when inputs are heavily masked) harms generation quality, while late-stage guidance has a larger effect. These findings provide a theoretical explanation for empirical observations in recent studies on guidance schedules. The analysis also reveals an imperfection of the current CFG implementations. These implementations can unintentionally cause imbalanced transitions, such as unmasking too rapidly during the early stages of generation, which degrades the quality of the resulting samples. To address this, we draw insight from the analysis and propose a novel classifier-free guidance mechanism empirically applicable to any discrete diffusion. Intuitively, our method smoothens the transport between the data distribution and the initial (masked/uniform) distribution, which results in improved sample quality. Remarkably, our method is achievable via a simple one-line code change. The efficacy of our method is empirically demonstrated with experiments on ImageNet (masked discrete diffusion) and QM9 (uniform discrete diffusion).
摘要：无分类器引导（CFG）是一种广泛使用的技术，用于有条件地生成并改善了连续扩散模型的样品质量，最近的工作已将其扩展到离散扩散。本文理论上在掩盖离散扩散的背景下对CFG进行了分析，重点是指导时间表的作用。我们的分析表明，在抽样的早期（当输入被严重掩盖时）较高的指导会损害发电质量，而后期指导的效果更大。这些发现在最近的指导时间表研究中为经验观察提供了理论上的解释。该分析还揭示了当前CFG实现的不完美。这些实现可能会无意间引起不平衡的过渡，例如在发电的早期阶段揭示过快，从而降低了所得样品的质量。为了解决这个问题，我们从分析中获取见解，并提出一种新颖的无分类指导机制，从经验上适用于任何离散扩散。直观地，我们的方法使数据分布与初始（掩盖/均匀）分布之间的传输平滑，从而提高了样品质量。值得注意的是，我们的方法可以通过简单的单行代码更改来实现。通过对成像网（掩盖离散扩散）和QM9（均匀离散扩散）的实验，我们的方法的功效得到了经验证明。

Title: Learning Diffusion Models with Flexible Representation Guidance

Authors: Chenyu Wang, Cai Zhou, Sharut Gupta, Zongyu Lin, Stefanie Jegelka, Stephen Bates, Tommi Jaakkola
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2507.08980
Pdf URL: https://arxiv.org/pdf/2507.08980
Copy Paste: [[2507.08980]] Learning Diffusion Models with Flexible Representation Guidance(https://arxiv.org/abs/2507.08980)
Keywords: generation
Abstract: Diffusion models can be improved with additional guidance towards more effective representations of input. Indeed, prior empirical work has already shown that aligning internal representations of the diffusion model with those of pre-trained models improves generation quality. In this paper, we present a systematic framework for incorporating representation guidance into diffusion models. We provide alternative decompositions of denoising models along with their associated training criteria, where the decompositions determine when and how the auxiliary representations are incorporated. Guided by our theoretical insights, we introduce two new strategies for enhancing representation alignment in diffusion models. First, we pair examples with target representations either derived from themselves or arisen from different synthetic modalities, and subsequently learn a joint model over the multimodal pairs. Second, we design an optimal training curriculum that balances representation learning and data generation. Our experiments across image, protein sequence, and molecule generation tasks demonstrate superior performance as well as accelerated training. In particular, on the class-conditional ImageNet $256\times 256$ benchmark, our guidance results in $23.3$ times faster training than the original SiT-XL as well as four times speedup over the state-of-the-art method REPA. The code is available at this https URL.
摘要：通过额外的指导，可以改善扩散模型，以更有效的输入表示。实际上，先前的经验工作已经表明，扩散模型的内部表示与预训练模型的内部表示可以提高发电质量。在本文中，我们提出了一个系统的框架，将代表指导纳入扩散模型中。我们提供了替代分解模型及其相关的训练标准，其中分解确定了何时以及如何纳入辅助表示。在我们的理论见解的指导下，我们引入了两种新的策略，以增强扩散模型中的表示一致性。首先，我们将示例与源自自身或来自不同合成方式产生的目标表示形式配对，然后通过多模式对学习联合模型。其次，我们设计了一个最佳的培训课程，可以平衡表示学习和数据生成。我们跨图像，蛋白质序列和分子生成任务的实验表明了卓越的性能和加速训练。尤其是，在类似Imagenet的$ 256 \ times 256 $基准上，我们的指导速度比原始SIT-XL的培训快23.3美元，而在先进的方法repa上的训练和四倍的速度。该代码可在此HTTPS URL上找到。

Title: Exploiting Leaderboards for Large-Scale Distribution of Malicious Models

Authors: Anshuman Suri, Harsh Chaudhari, Yuefeng Peng, Ali Naseh, Amir Houmansadr, Alina Oprea
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2507.08983
Pdf URL: https://arxiv.org/pdf/2507.08983
Copy Paste: [[2507.08983]] Exploiting Leaderboards for Large-Scale Distribution of Malicious Models(https://arxiv.org/abs/2507.08983)
Keywords: generation
Abstract: While poisoning attacks on machine learning models have been extensively studied, the mechanisms by which adversaries can distribute poisoned models at scale remain largely unexplored. In this paper, we shed light on how model leaderboards -- ranked platforms for model discovery and evaluation -- can serve as a powerful channel for adversaries for stealthy large-scale distribution of poisoned models. We present TrojanClimb, a general framework that enables injection of malicious behaviors while maintaining competitive leaderboard performance. We demonstrate its effectiveness across four diverse modalities: text-embedding, text-generation, text-to-speech and text-to-image, showing that adversaries can successfully achieve high leaderboard rankings while embedding arbitrary harmful functionalities, from backdoors to bias injection. Our findings reveal a significant vulnerability in the machine learning ecosystem, highlighting the urgent need to redesign leaderboard evaluation mechanisms to detect and filter malicious (e.g., poisoned) models, while exposing broader security implications for the machine learning community regarding the risks of adopting models from unverified sources.
摘要：虽然已经对机器学习模型的中毒攻击进行了广泛的研究，但对手可以在大规模分发中毒模型的机制基本上仍未开发。在本文中，我们阐明了模型排行榜如何（用于模型发现和评估的排名平台）如何成为对手的强大渠道，用于隐身大规模分布中毒模型。我们提出了Trojanclimb，这是一个通用框架，可以在保持竞争性排行榜表现的同时注入恶意行为。我们证明了它在四种不同方式的有效性：文本插入，文本收藏，文本到语音和文本形象，表明对手可以成功地达到较高的排行榜排名，同时嵌入了任意有害功能，从后门到偏见注入。我们的发现揭示了机器学习生态系统中的重要脆弱性，强调迫切需要重新设计排行榜评估机制以检测和过滤恶意（例如，中毒）模型，同时暴露了对机器学习社区的更广泛的安全性影响，从而对从未经证实来源采用模型采用的风险。

Title: On Evaluating Performance of LLM Inference Serving Systems

Authors: Amey Agrawal, Nitin Kedia, Anmol Agarwal, Jayashree Mohan, Nipun Kwatra, Souvik Kundu, Ramachandran Ramjee, Alexey Tumanov
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2507.09019
Pdf URL: https://arxiv.org/pdf/2507.09019
Copy Paste: [[2507.09019]] On Evaluating Performance of LLM Inference Serving Systems(https://arxiv.org/abs/2507.09019)
Keywords: generation
Abstract: The rapid evolution of Large Language Model (LLM) inference systems has yielded significant efficiency improvements. However, our systematic analysis reveals that current evaluation methodologies frequently exhibit fundamental flaws, often manifesting as common evaluation anti-patterns that obscure true performance characteristics and impede scientific progress. Through a comprehensive examination of recent systems, we identify recurring anti-patterns across three key dimensions: Baseline Fairness, Evaluation Setup, and Metric Design. These anti-patterns are uniquely problematic for LLM inference due to its dual-phase nature combining distinct prefill and decode operations, its handling of highly heterogeneous workloads, and its strict temporal requirements for interactive use. We demonstrate how common anti-patterns -- such as inadequate baseline comparisons that conflate engineering effort with algorithmic novelty, workload selections that fail to represent production scenarios, and metric normalizations that hide substantial performance variability like generation stalls-lead to misleading conclusions. To address these challenges, we provide a comprehensive checklist derived from our analysis, establishing a framework for recognizing and avoiding these anti-patterns in favor of robust LLM inference evaluation. To demonstrate the practical application of our framework, we present a case study analyzing speculative decoding, a technique whose bursty, non-uniform token generation is easily misinterpreted when evaluated using approaches characteristic of these anti-patterns. Our work establishes a rigorous foundation for evaluation methodology, enabling meaningful comparisons, ensuring reproducible results, and ultimately accelerating genuine progress in LLM inference systems by moving beyond common anti-patterns to align evaluation with real-world requirements.
摘要：大语言模型（LLM）推理系统的快速演变已取得了重大效率的提高。但是，我们的系统分析表明，当前的评估方法经常表现出基本缺陷，通常表现为常见的评估反图案，这些反图案掩盖了真实的绩效特征并阻碍了科学进步。通过对最近的系统的全面检查，我们确定了三个关键维度的反复出现的反图案：基线公平，评估设置和度量标准设计。这些抗模式对于LLM推断是独特的问题，这是由于其双相性质结合了不同的预填充和解码操作，对高度异构工作负载的处理以及对互动使用的严格时间要求。我们证明了如何将工程工作与算法新颖性相结合的基线比较不足，例如基线的比较不足，无法代表生产场景的工作负载选择以及隐藏了像生成失速的实质性变异性的度量标准化，以误解结论。为了应对这些挑战，我们提供了一个从我们的分析中得出的全面清单，建立了一个框架，以识别和避免使用这些反模式，以支持强大的LLM推理评估。为了证明我们的框架的实际应用，我们提出了一个案例研究，分析了投机解码，这是一种使用这些反模式特征的方法进行评估时，其爆发，不均匀的令牌产生很容易被误解。我们的工作为评估方法论建立了严格的基础，实现有意义的比较，确保可重现的结果，并最终通过超越共同的反patterns来使LLM推理系统中的真正进步加速，以使评估与现实世界的要求保持一致。

Title: Behavioral Exploration: Learning to Explore via In-Context Adaptation

Authors: Andrew Wagenmaker, Zhiyuan Zhou, Sergey Levine
Subjects: cs.LG, cs.RO, eess.SY
Abstract URL: https://arxiv.org/abs/2507.09041
Pdf URL: https://arxiv.org/pdf/2507.09041
Copy Paste: [[2507.09041]] Behavioral Exploration: Learning to Explore via In-Context Adaptation(https://arxiv.org/abs/2507.09041)
Keywords: generative
Abstract: Developing autonomous agents that quickly explore an environment and adapt their behavior online is a canonical challenge in robotics and machine learning. While humans are able to achieve such fast online exploration and adaptation, often acquiring new information and skills in only a handful of interactions, existing algorithmic approaches tend to rely on random exploration and slow, gradient-based behavior updates. How can we endow autonomous agents with such capabilities on par with humans? Taking inspiration from recent progress on both in-context learning and large-scale behavioral cloning, in this work we propose behavioral exploration: training agents to internalize what it means to explore and adapt in-context over the space of ``expert'' behaviors. To achieve this, given access to a dataset of expert demonstrations, we train a long-context generative model to predict expert actions conditioned on a context of past observations and a measure of how ``exploratory'' the expert's behaviors are relative to this context. This enables the model to not only mimic the behavior of an expert, but also, by feeding its past history of interactions into its context, to select different expert behaviors than what have been previously selected, thereby allowing for fast online adaptation and targeted, ``expert-like'' exploration. We demonstrate the effectiveness of our method in both simulated locomotion and manipulation settings, as well as on real-world robotic manipulation tasks, illustrating its ability to learn adaptive, exploratory behavior.
摘要：在机器人技术和机器学习中，开发迅速探索环境并在线适应其行为的自主代理是一个规范的挑战。尽管人类能够实现如此快速的在线探索和适应性，但通常仅在少数互动中获得新的信息和技能，但现有的算法方法倾向于依靠随机探索和基于梯度缓慢的行为更新。我们如何才能使自主代理具有与人类相当的能力？从最近的进展和大规模行为克隆中汲取灵感，在这项工作中，我们提出了行为探索：培训代理人内部化探索和适应``专家''行为的含义。为了实现这一目标，鉴于访问专家演示的数据集，我们培训了一个长篇小说生成模型，以预测以过去观察的背景为条件的专家行动，并衡量``探索性''专家的行为与此上下文相关。这使该模型不仅能够模仿专家的行为，还可以通过将其过去的互动历史喂入其上下文中，以选择与以前选择的不同的专家行为，从而允许快速在线适应并有针对性的``专家''exploration。我们证明了我们方法在模拟的运动和操纵设置以及现实世界机器人操纵任务中的有效性，这说明了其学习适应性，探索性行为的能力。

Title: Shortening the Trajectories: Identity-Aware Gaussian Approximation for Efficient 3D Molecular Generation

Authors: Jingxiang Qu, Wenhan Gao, Yi Liu
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2507.09043
Pdf URL: https://arxiv.org/pdf/2507.09043
Copy Paste: [[2507.09043]] Shortening the Trajectories: Identity-Aware Gaussian Approximation for Efficient 3D Molecular Generation(https://arxiv.org/abs/2507.09043)
Keywords: generation, generative
Abstract: Gaussian-based Probabilistic Generative Models (GPGMs) generate data by reversing a stochastic process that progressively corrupts samples with Gaussian noise. While these models have achieved state-of-the-art performance across diverse domains, their practical deployment remains constrained by the high computational cost of long generative trajectories, which often involve hundreds to thousands of steps during training and sampling. In this work, we introduce a theoretically grounded and empirically validated framework that improves generation efficiency without sacrificing training granularity or inference fidelity. Our key insight is that for certain data modalities, the noising process causes data to rapidly lose its identity and converge toward a Gaussian distribution. We analytically identify a characteristic step at which the data has acquired sufficient Gaussianity, and then replace the remaining generation trajectory with a closed-form Gaussian approximation. Unlike existing acceleration techniques that coarsening the trajectories by skipping steps, our method preserves the full resolution of learning dynamics while avoiding redundant stochastic perturbations between `Gaussian-like' distributions. Empirical results across multiple data modalities demonstrate substantial improvements in both sample quality and computational efficiency.
摘要：基于高斯的概率生成模型（GPGM）通过逆转随机过程逐渐损坏具有高斯噪声的样本来生成数据。尽管这些模型已经在各种领域实现了最新的性能，但它们的实际部署仍受到长期生成轨迹的高计算成本的限制，长期生成轨迹的高计算成本通常涉及训练和抽样过程中数百到数千个步骤。在这项工作中，我们引入了理论上扎根和经验验证的框架，该框架在不牺牲培训粒度或推理保真度的情况下提高了发电效率。我们的关键见解是，对于某些数据方式，尖锐的过程导致数据迅速失去其身份并趋向于高斯分布。我们在分析上确定了数据获得足够高斯性的特征步骤，然后用封闭形式的高斯近似替换剩余的一代轨迹。与现有的加速技术通过跳过步骤来使轨迹的轨迹使轨迹更加完善，我们的方法可以保留学习动力学的完整分辨率，同时避免了“高斯样”分布之间的冗余随机扰动。多种数据模式的经验结果表明，样本质量和计算效率都有很大的提高。

Title: Can Contrastive Learning Improve Class-Imbalanced Diffusion Model?

Authors: Fang Chen, Alex Villa, Gongbo Liang, Xiaoyi Lu, Meng Tang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.09052
Pdf URL: https://arxiv.org/pdf/2507.09052
Copy Paste: [[2507.09052]] Can Contrastive Learning Improve Class-Imbalanced Diffusion Model?(https://arxiv.org/abs/2507.09052)
Keywords: generation
Abstract: Training data for class-conditional image synthesis often exhibit a long-tailed distribution with limited images for tail classes. Such an imbalance causes mode collapse and reduces the diversity of synthesized images for tail classes. For class-conditional diffusion models trained on imbalanced data, we aim to improve the diversity of tail class images without compromising the fidelity and diversity of head class images. We achieve this by introducing two deceptively simple but highly effective contrastive loss functions. Firstly, we employ an unsupervised InfoNCE loss utilizing negative samples to increase the distance/dissimilarity among synthetic images, particularly for tail classes. To further enhance the diversity of tail classes, our second loss is an MSE loss that contrasts class-conditional generation with unconditional generation at large timesteps. This second loss makes the denoising process insensitive to class conditions for the initial steps, which enriches tail classes through knowledge sharing from head classes. Conditional-unconditional alignment has been shown to enhance the performance of long-tailed GAN. We are the first to adapt such alignment to diffusion models. We successfully leveraged contrastive learning for class-imbalanced diffusion models. Our contrastive learning framework is easy to implement and outperforms standard DDPM and alternative methods for class-imbalanced diffusion models across various datasets, including CIFAR10/100-LT, PlacesLT, TinyImageNetLT, and ImageNetLT.
摘要：班级条件图像合成的训练数据通常显示出长尾巴分布，而尾部类别的图像有限。这种不平衡导致模式崩溃，并减少了尾部类别的合成图像的多样性。对于接受不平衡数据训练的课堂条件扩散模型，我们旨在改善尾部类图像的多样性，而不会损害头类图像的忠诚度和多样性。我们通过引入两个看似简单但高效的对比损失函数来实现这一目标。首先，我们利用负面样品采用了无监督的Infonce损失，以增加合成图像之间的距离/差异性，尤其是对于尾部类别。为了进一步提高尾部类别的多样性，我们的第二次损失是MSE损失，将阶级的生成与大时的无条件产生形成对比。第二次损失使得对初始步骤的班级条件的denoisis流程不敏感，从而通过头等阶层的知识共享丰富了尾巴类。有条件的无条件比对已显示可增强长尾gan的性能。我们是第一个将这种比对适应扩散模型的人。我们成功利用了对阶级失去平衡扩散模型的对比度学习。我们的对比学习框架易于实现，并且胜过标准DDPM和跨各种数据集的类不平衡扩散模型的替代方法，包括CIFAR10/100-LT，PLOCELT，TIMELT，TINYIMAGENETLT和IMAGENETLT。

Title: From Physics to Foundation Models: A Review of AI-Driven Quantitative Remote Sensing Inversion

Authors: Zhenyu Yu, Mohd Yamani Idna Idris, Hua Wang, Pei Wang, Junyi Chen, Kun Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.09081
Pdf URL: https://arxiv.org/pdf/2507.09081
Copy Paste: [[2507.09081]] From Physics to Foundation Models: A Review of AI-Driven Quantitative Remote Sensing Inversion(https://arxiv.org/abs/2507.09081)
Keywords: generation
Abstract: Quantitative remote sensing inversion aims to estimate continuous surface variables-such as biomass, vegetation indices, and evapotranspiration-from satellite observations, supporting applications in ecosystem monitoring, carbon accounting, and land management. With the evolution of remote sensing systems and artificial intelligence, traditional physics-based paradigms are giving way to data-driven and foundation model (FM)-based approaches. This paper systematically reviews the methodological evolution of inversion techniques, from physical models (e.g., PROSPECT, SCOPE, DART) to machine learning methods (e.g., deep learning, multimodal fusion), and further to foundation models (e.g., SatMAE, GFM, mmEarth). We compare the modeling assumptions, application scenarios, and limitations of each paradigm, with emphasis on recent FM advances in self-supervised pretraining, multi-modal integration, and cross-task adaptation. We also highlight persistent challenges in physical interpretability, domain generalization, limited supervision, and uncertainty quantification. Finally, we envision the development of next-generation foundation models for remote sensing inversion, emphasizing unified modeling capacity, cross-domain generalization, and physical interpretability.
摘要：定量遥感反演旨在估计连续的表面变量，例如生物量，植被指数和蒸散量，从卫星观测中观察到，支持生态系统监测，碳核算和土地管理中的应用。随着遥感系统和人工智能的演变，传统的基于物理的范例正在取代基于数据驱动的基础模型（FM）的方法。本文系统地回顾了反转技术的方法论演变，从物理模型（例如，前景，范围，飞镖）到机器学习方法（例如，深度学习，多模式融合），再到基础模型（例如Satmae，satmae，gfm，mmearth）。我们比较了每个范式的建模假设，应用程序场景和局限性，并强调了自我监督预训练，多模式集成和交叉任务改编的最新FM进展。我们还强调了物理解释性，领域概括，有限的监督和不确定性量化方面的持续挑战。最后，我们设想开发用于遥感倒置的下一代基础模型，强调统一的建模能力，跨域泛化和物理解释性。

Title: Taming generative video models for zero-shot optical flow extraction

Authors: Seungwoo Kim, Khai Loong Aw, Klemen Kotar, Cristobal Eyzaguirre, Wanhee Lee, Yunong Liu, Jared Watrous, Stefan Stojanov, Juan Carlos Niebles, Jiajun Wu, Daniel L. K. Yamins
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.09082
Pdf URL: https://arxiv.org/pdf/2507.09082
Copy Paste: [[2507.09082]] Taming generative video models for zero-shot optical flow extraction(https://arxiv.org/abs/2507.09082)
Keywords: generative
Abstract: Extracting optical flow from videos remains a core computer vision problem. Motivated by the success of large general-purpose models, we ask whether frozen self-supervised video models trained only for future frame prediction can be prompted, without fine-tuning, to output flow. Prior work reading out depth or illumination from video generators required fine-tuning, which is impractical for flow where labels are scarce and synthetic datasets suffer from a sim-to-real gap. Inspired by the Counterfactual World Model (CWM) paradigm, which can obtain point-wise correspondences by injecting a small tracer perturbation into a next-frame predictor and tracking its propagation, we extend this idea to generative video models. We explore several popular architectures and find that successful zero-shot flow extraction in this manner is aided by three model properties: (1) distributional prediction of future frames (avoiding blurry or noisy outputs); (2) factorized latents that treat each spatio-temporal patch independently; and (3) random-access decoding that can condition on any subset of future pixels. These properties are uniquely present in the recent Local Random Access Sequence (LRAS) architecture. Building on LRAS, we propose KL-tracing: a novel test-time procedure that injects a localized perturbation into the first frame, rolls out the model one step, and computes the Kullback-Leibler divergence between perturbed and unperturbed predictive distributions. Without any flow-specific fine-tuning, our method outperforms state-of-the-art models on real-world TAP-Vid DAVIS dataset (16.6% relative improvement for endpoint error) and synthetic TAP-Vid Kubric (4.7% relative improvement). Our results indicate that counterfactual prompting of controllable generative video models is a scalable and effective alternative to supervised or photometric-loss approaches for high-quality flow.
摘要：从视频中提取光流仍然是核心计算机视觉问题。由于大型通用模型的成功，我们询问是否可以在不进行微调的情况下提示仅针对未来框架预测训练的冷冻自我监管的录像模型以输出流量。先前的工作读取视频发电机的深度或照明需要微调，这对于稀缺标签和合成数据集的流量是不切实际的。受反事实世界模型（CWM）范式的启发，可以通过将小的示踪剂扰动注入下一框架预测变量并跟踪其传播，从而获得了点对应的对应关系，我们将此想法扩展到了生成的视频模型。我们探索了几个流行的体系结构，并发现以这种方式成功的零击流量提取得到了三种模型属性：（1）未来框架的分布预测（避免模糊或嘈杂的输出）；（2）独立处理每个时空斑块的分解潜伏期；（3）可以在未来像素的任何子集上进行调节的随机访问解码。这些属性在最近的局部随机访问序列（LRAS）体系结构中唯一存在。在LRAS的基础上，我们提出了KL-Tracing：一种新型的测试时间程序，将局部扰动注入第一个帧，将模型一步推出，并计算kullback-Leibler的差异，在受扰动和未扰动的预测性分布之间。如果没有任何特定流量的微调，我们的方法在现实世界中的TAP-VID DAVIS数据集（端点错误的相对改进为16.6％）和合成TAP-VID Kubric（相对改善为4.7％）上的最先进模型（16.6％的相对改进）。我们的结果表明，可控生成视频模型的反事实提示是用于高质量流量的监督或光度损失方法的可扩展性替代方案。

Title: RadEyeVideo: Enhancing general-domain Large Vision Language Model for chest X-ray analysis with video representations of eye gaze

Authors: Yunsoo Kim, Jinge Wu, Honghan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.09097
Pdf URL: https://arxiv.org/pdf/2507.09097
Copy Paste: [[2507.09097]] RadEyeVideo: Enhancing general-domain Large Vision Language Model for chest X-ray analysis with video representations of eye gaze(https://arxiv.org/abs/2507.09097)
Keywords: generation
Abstract: Large Vision-Language Models (LVLMs) have demonstrated promising performance in chest X-ray (CXR) analysis. To enhance human-computer interaction, several studies have incorporated radiologists' eye gaze, typically through heatmaps or textual prompts. However, these methods often overlook the sequential order of eye movements, which could provide valuable insights by highlighting both the areas of interest and the order in which they are examined. In this work, we propose a novel approach called RadEyeVideo that integrates radiologists' eye-fixation data as a video sequence, capturing both the temporal and spatial dynamics of their gaze. We evaluate this method in CXR report generation and disease diagnosis using three general-domain, open-source LVLMs with video input capabilities. When prompted with eye-gaze videos, model performance improves by up to 24.6% in the report generation task and on average 15.2% for both tasks using scaled evaluation metrics. Notably, RadEyeVideo enhanced an open-domain LVLM model, LLaVA-OneVision, to surpass task-specific medical LVLMs such as MAIRA-2 and CheXagent, trained on large Chest X-ray data. This work highlights that domain expert's knowledge (eye-gaze information in this case), when effectively integrated with LVLMs, can significantly enhance general-domain models' capabilities in clinical tasks. RadEyeVideo is a step toward a scalable human-centered approach of utilizing LVLMs in medical image analytics.
摘要：大型视觉模型（LVLM）在胸部X射线（CXR）分析中表现出了有希望的表现。为了增强人类计算机的相互作用，几项研究通常通过热图或文本提示融入了放射科医生的眼睛凝视。但是，这些方法通常忽略了眼动的顺序，这些顺序可以通过强调感兴趣的领域和检查它们的顺序来提供宝贵的见解。在这项工作中，我们提出了一种名为Radeyevideo的新方法，该方法将放射科医生的眼睛固定数据整合为视频序列，同时捕获其凝视的时间和空间动力学。我们使用具有视频输入功能的三个通用域的开源LVLM在CXR报告产生和疾病诊断中评估了这种方法。当出现眼睛凝视视频时，在报告生成任务中，模型性能最多可提高24.6％，使用缩放评估指标，这两个任务平均提高了15.2％。值得注意的是，Radeyevideo增强了开放域LVLM模型Llava-onevision，以超越特定于任务的医疗LVLM，例如Maira-2和Chexagent，对大型胸部X射线数据进行了培训。这项工作强调了域专家的知识（在这种情况下为眼睛凝视的信息）在有效地与LVLMS合并时可以显着增强通用域模型在临床任务中的能力。 Radeyevideo是迈向以人为中心的方法在医学图像分析中使用LVLM的一步。

Title: Harnessing Text-to-Image Diffusion Models for Point Cloud Self-Supervised Learning

Authors: Yiyang Chen, Shanshan Zhao, Lunhao Duan, Changxing Ding, Dacheng Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.09102
Pdf URL: https://arxiv.org/pdf/2507.09102
Copy Paste: [[2507.09102]] Harnessing Text-to-Image Diffusion Models for Point Cloud Self-Supervised Learning(https://arxiv.org/abs/2507.09102)
Keywords: generation
Abstract: Diffusion-based models, widely used in text-to-image generation, have proven effective in 2D representation learning. Recently, this framework has been extended to 3D self-supervised learning by constructing a conditional point generator for enhancing 3D representations. However, its performance remains constrained by the 3D diffusion model, which is trained on the available 3D datasets with limited size. We hypothesize that the robust capabilities of text-to-image diffusion models, particularly Stable Diffusion (SD), which is trained on large-scale datasets, can help overcome these limitations. To investigate this hypothesis, we propose PointSD, a framework that leverages the SD model for 3D self-supervised learning. By replacing the SD model's text encoder with a 3D encoder, we train a point-to-image diffusion model that allows point clouds to guide the denoising of rendered noisy images. With the trained point-to-image diffusion model, we use noise-free images as the input and point clouds as the condition to extract SD features. Next, we train a 3D backbone by aligning its features with these SD features, thereby facilitating direct semantic learning. Comprehensive experiments on downstream point cloud tasks and ablation studies demonstrate that the SD model can enhance point cloud self-supervised learning. Code is publicly available at this https URL.
摘要：基于扩散的模型广泛用于文本图像生成，已证明在2D表示学习中有效。最近，该框架已通过构造有条件的点发生器来增强3D表示，将该框架扩展到3D自制学习。但是，其性能仍受到3D扩散模型的约束，该模型在大小有限的可用3D数据集上进行了训练。我们假设文本到图像扩散模型的强大功能，尤其是在大规模数据集上训练的稳定扩散（SD），可以帮助克服这些限制。为了研究这一假设，我们提出了PointsD，该框架利用SD模型进行3D自学学习。通过用3D编码器替换SD模型的文本编码器，我们训练了一个点对图像扩散模型，该模型允许点云指导渲染的嘈杂图像的变形。借助训练有素的点对图像扩散模型，我们将无噪声图像用作输入和点云作为提取SD特征的条件。接下来，我们通过将其功能与这些SD功能保持一致，从而训练3D主链，从而促进直接的语义学习。关于下游点云任务和消融研究的全面实验表明，SD模型可以增强点云自学学习。代码在此HTTPS URL上公开可用。

Title: Hybrid Autoregressive-Diffusion Model for Real-Time Streaming Sign Language Production

Authors: Maoxiao Ye, Xinfeng Ye, Mano Manoharan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.09105
Pdf URL: https://arxiv.org/pdf/2507.09105
Copy Paste: [[2507.09105]] Hybrid Autoregressive-Diffusion Model for Real-Time Streaming Sign Language Production(https://arxiv.org/abs/2507.09105)
Keywords: generation
Abstract: Earlier Sign Language Production (SLP) models typically relied on autoregressive methods that generate output tokens one by one, which inherently provide temporal alignment. Although techniques like Teacher Forcing can prevent model collapse during training, they still cannot solve the problem of error accumulation during inference, since ground truth is unavailable at that stage. In contrast, more recent approaches based on diffusion models leverage step-by-step denoising to enable high-quality generation. However, the iterative nature of these models and the requirement to denoise entire sequences limit their applicability in real-time tasks like SLP. To address it, we apply a hybrid approach combining autoregressive and diffusion models to SLP for the first time, leveraging the strengths of both models in sequential dependency modeling and output refinement. To capture fine-grained body movements, we design a Multi-Scale Pose Representation module that separately extracts detailed features from distinct articulators and integrates them via a Multi-Scale Fusion module. Furthermore, we introduce a Confidence-Aware Causal Attention mechanism that utilizes joint-level confidence scores to dynamically guide the pose generation process, improving accuracy and robustness. Extensive experiments on the PHOENIX14T and How2Sign datasets demonstrate the effectiveness of our method in both generation quality and real-time streaming efficiency.
摘要：早期的手语产生（SLP）模型通常依赖于自回归方法，这些方法一一生成输出令牌，这些方法固有地提供时间对齐。尽管诸如教师强迫之类的技术可以防止在训练期间模型崩溃，但它们仍然无法解决推理过程中错误积累的问题，因为在那个阶段无法实现地面真相。相比之下，基于扩散模型的最新方法利用分步denoings来实现高质量的生成。但是，这些模型的迭代性质以及确定整个序列的要求限制了它们在SLP等实时任务中的适用性。为了解决这个问题，我们首次采用了将自回归和扩散模型结合到SLP的混合方法，利用两种模型在顺序依赖模型和输出细化中的优势。为了捕获细颗粒的身体运动，我们设计了一个多尺度姿势表示模块，该模块分别从不同的铰接器中提取详细的特征，并通过多尺度融合模块整合它们。此外，我们引入了一种置信度吸引的因果注意机制，该机制利用联合置信度得分来动态指导姿势生成过程，从而提高准确性和鲁棒性。 Phoenix14t和How2Sign数据集的广泛实验证明了我们方法在发电质量和实时流效率方面的有效性。

Title: SnapMoGen: Human Motion Generation from Expressive Texts

Authors: Chuan Guo, Inwoo Hwang, Jian Wang, Bing Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.09122
Pdf URL: https://arxiv.org/pdf/2507.09122
Copy Paste: [[2507.09122]] SnapMoGen: Human Motion Generation from Expressive Texts(https://arxiv.org/abs/2507.09122)
Keywords: generation, generative
Abstract: Text-to-motion generation has experienced remarkable progress in recent years. However, current approaches remain limited to synthesizing motion from short or general text prompts, primarily due to dataset constraints. This limitation undermines fine-grained controllability and generalization to unseen prompts. In this paper, we introduce SnapMoGen, a new text-motion dataset featuring high-quality motion capture data paired with accurate, expressive textual annotations. The dataset comprises 20K motion clips totaling 44 hours, accompanied by 122K detailed textual descriptions averaging 48 words per description (vs. 12 words of HumanML3D). Importantly, these motion clips preserve original temporal continuity as they were in long sequences, facilitating research in long-term motion generation and blending. We also improve upon previous generative masked modeling approaches. Our model, MoMask++, transforms motion into multi-scale token sequences that better exploit the token capacity, and learns to generate all tokens using a single generative masked transformer. MoMask++ achieves state-of-the-art performance on both HumanML3D and SnapMoGen benchmarks. Additionally, we demonstrate the ability to process casual user prompts by employing an LLM to reformat inputs to align with the expressivity and narration style of SnapMoGen. Project webpage: this https URL
摘要：近年来，文本到动作的生成取得了显着进步。但是，目前的方法仍然仅限于从短或一般文本提示中综合运动，这主要是由于数据集约束。这种限制破坏了细粒度的可控性和概括，以表现出来的提示。在本文中，我们介绍了Snapmogen，这是一种新的文本动作数据集，具有高质量的运动捕获数据，并配对了准确，表现力的文本注释。数据集包含20K运动剪辑，总计44小时，并伴随122K详细的文本描述，每个描述平均为48个单词（vs. 12个单词HumanMl3d）。重要的是，这些运动剪辑具有长序列的原始时间连续性，从而促进了长期运动和混合的研究。我们还改进了以前的生成掩盖建模方法。我们的模型MOMASK ++将运动转换为多尺度令牌序列，以更好地利用令牌容量，并学会使用单个生成性掩盖的变压器生成所有令牌。 Momask ++在HumanML3D和Snapmogen基准测试中都达到了最先进的性能。此外，我们通过采用LLM来重新格式化输入以与Snapmogen的表现性和叙述风格保持一致，从而证明了处理休闲用户提示的能力。项目网页：此HTTPS URL

Title: $I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting

Authors: Zhimin Liao, Ping Wei, Ruijie Zhang, Shuaijia Chen, Haoxuan Wang, Ziyang Ren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.09144
Pdf URL: https://arxiv.org/pdf/2507.09144
Copy Paste: [[2507.09144]] $I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting(https://arxiv.org/abs/2507.09144)
Keywords: generation
Abstract: Forecasting the evolution of 3D scenes and generating unseen scenarios via occupancy-based world models offers substantial potential for addressing corner cases in autonomous driving systems. While tokenization has revolutionized image and video generation, efficiently tokenizing complex 3D scenes remains a critical challenge for 3D world models. To address this, we propose $I^{2}$-World, an efficient framework for 4D occupancy forecasting. Our method decouples scene tokenization into intra-scene and inter-scene tokenizers. The intra-scene tokenizer employs a multi-scale residual quantization strategy to hierarchically compress 3D scenes while preserving spatial details. The inter-scene tokenizer residually aggregates temporal dependencies across timesteps. This dual design preserves the compactness of 3D tokenizers while retaining the dynamic expressiveness of 4D tokenizers. Unlike decoder-only GPT-style autoregressive models, $I^{2}$-World adopts an encoder-decoder architecture. The encoder aggregates spatial context from the current scene and predicts a transformation matrix to enable high-level control over scene generation. The decoder, conditioned on this matrix and historical tokens, ensures temporal consistency during generation. Experiments demonstrate that $I^{2}$-World achieves state-of-the-art performance, outperforming existing methods by 25.1\% in mIoU and 36.9\% in IoU for 4D occupancy forecasting while exhibiting exceptional computational efficiency: it requires merely 2.9 GB of training memory and achieves real-time inference at 37.0 FPS. Our code is available on this https URL.
摘要：预测3D场景的演变并通过基于占用的世界模型产生看不见的场景，这为解决自主驾驶系统中的角案例提供了巨大的潜力。尽管令牌化彻底改变了图像和视频的产生，但对于3D世界模型来说，有效地将复杂的3D场景仍然是一个关键的挑战。为了解决这个问题，我们提出了$ i^{2} $ - 世界，这是一个有效的4D占用预测框架。我们的方法将场景象征化成现场内部和场景间引物。 Scene Intra Intra Intraizer采用多尺度剩余量化策略来层次压缩3D场景，同时保留空间细节。跨时间段的呈频间代币剩余的时间依赖性。这种双重设计保留了3D象征器的紧凑性，同时保留了4D引物器的动态表现力。与仅解码器的GPT风格自动回归型号不同，$ i^{2} $ - 世界采用编码器decoder架构。编码器从当前场景汇总了空间上下文，并预测了一个转换矩阵，以实现对场景生成的高级控制。该解码器以该矩阵和历史令牌为条件，可确保世代相传的时间一致性。实验表明，$ i^{2} $ - 世界实现最先进的性能，在miou中超过25.1 \％的现有方法，而在IOU中则超过36.9 \％以进行4D占用预测，同时表现出卓越的计算效率：在37.0.0 fps中只需要2.9 GB的训练记忆和实现2.9 GB的训练记忆和实现2.9 GB。我们的代码可在此HTTPS URL上使用。

Title: THYME: Temporal Hierarchical-Cyclic Interactivity Modeling for Video Scene Graphs in Aerial Footage

Authors: Trong-Thuan Nguyen, Pha Nguyen, Jackson Cothren, Alper Yilmaz, Minh-Triet Tran, Khoa Luu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.09200
Pdf URL: https://arxiv.org/pdf/2507.09200
Copy Paste: [[2507.09200]] THYME: Temporal Hierarchical-Cyclic Interactivity Modeling for Video Scene Graphs in Aerial Footage(https://arxiv.org/abs/2507.09200)
Keywords: generation
Abstract: The rapid proliferation of video in applications such as autonomous driving, surveillance, and sports analytics necessitates robust methods for dynamic scene understanding. Despite advances in static scene graph generation and early attempts at video scene graph generation, previous methods often suffer from fragmented representations, failing to capture fine-grained spatial details and long-range temporal dependencies simultaneously. To address these limitations, we introduce the Temporal Hierarchical Cyclic Scene Graph (THYME) approach, which synergistically integrates hierarchical feature aggregation with cyclic temporal refinement to address these limitations. In particular, THYME effectively models multi-scale spatial context and enforces temporal consistency across frames, yielding more accurate and coherent scene graphs. In addition, we present AeroEye-v1.0, a novel aerial video dataset enriched with five types of interactivity that overcome the constraints of existing datasets and provide a comprehensive benchmark for dynamic scene graph generation. Empirically, extensive experiments on ASPIRe and AeroEye-v1.0 demonstrate that the proposed THYME approach outperforms state-of-the-art methods, offering improved scene understanding in ground-view and aerial scenarios.
摘要：视频在诸如自主驾驶，监视和运动分析等应用程序中的快速扩散需要强大的方法来了解动态场景。尽管静态场景的生成和视频场景图生成的早期尝试取得了进步，但以前的方法通常会遭受分散的表示，未能同时捕获细粒度的空间细节和远程时间依赖。为了解决这些局限性，我们介绍了时间分层循环场景图（百里香）方法，该方法协同将分层特征聚集与环状时间细化进行了整合以解决这些限制。特别是，百里香有效地对多尺度的空间上下文进行了建模，并在跨框架上执行时间一致性，从而产生更准确，更连贯的场景图。此外，我们提出了Aeroeye-V1.0，这是一个新型的空中视频数据集，具有五种类型的交互性，可以克服现有数据集的约束，并为动态场景图生成提供了全面的基准。从经验上讲，关于Aspire和Aeroeye-V1.0的广泛实验表明，所提出的百里香方法的表现优于最先进的方法，在地面视图和空中方案中提供了改进的场景理解。

Title: Capturing Unseen Spatial Extremes Through Knowledge-Informed Generative Modeling

Authors: Xinyue Liu, Xiao Peng, Shuyue Yan, Yuntian Chen, Dongxiao Zhang, Zhixiao Niu, Hui-Min Wang, Xiaogang He
Subjects: cs.LG, physics.ao-ph, physics.data-an, physics.geo-ph, stat.ML
Abstract URL: https://arxiv.org/abs/2507.09211
Pdf URL: https://arxiv.org/pdf/2507.09211
Copy Paste: [[2507.09211]] Capturing Unseen Spatial Extremes Through Knowledge-Informed Generative Modeling(https://arxiv.org/abs/2507.09211)
Keywords: generative
Abstract: Observed records of climate extremes provide an incomplete picture of risk, missing "unseen" extremes that exceed historical bounds. In parallel, neglecting spatial dependence undervalues the risk of synchronized hazards that amplify impacts. To address these challenges, we develop DeepX-GAN (Dependence-Enhanced Embedding for Physical eXtremes - Generative Adversarial Network), a knowledge-informed deep generative model designed to better capture the spatial structure of rare extremes. The zero-shot generalizability of DeepX-GAN enables simulation of unseen extremes that fall outside historical experience yet remain statistically plausible. We define two types of unseen extremes: "checkmate" extremes that directly hit targets, and "stalemate" extremes that narrowly miss. These unrealized scenarios expose latent risks in fragile systems and may reinforce a false sense of resilience if overlooked. Near misses, in particular, can prompt either proactive adaptation or dangerous complacency, depending on how they are interpreted. Applying DeepX-GAN to the Middle East and North Africa (MENA), we find that these unseen extremes disproportionately affect regions with high vulnerability and low socioeconomic readiness, but differ in urgency and interpretation. Future warming could expand and redistribute these unseen extremes, with emerging exposure hotspots in Indo-Pakistan and Central Africa. This distributional shift highlights critical blind spots in conventional hazard planning and underscores the need to develop spatially adaptive policies that anticipate emergent risk hotspots rather than simply extrapolating from historical patterns.
摘要：观察到的气候极端记录提供了不完整的风险，缺少超过历史界限的“看不见”的极端。同时，忽视空间依赖性低估了放大影响的同步危害的风险。为了应对这些挑战，我们开发了DEEPX-GAN（依赖性增强物理极端的嵌入 - 生成对抗网络），这是一种知识知识的深层生成模型，旨在更好地捕获稀有极端的空间结构。 DEEPX-GAN的零透明性能够模拟未见的极端，这些极端落在历史经验之外，但在统计学上仍然是合理的。我们定义了两种未见的极端类型：直接击中目标的“ Checkmate”极端，而“僵化”极端错过了。这些未实现的场景在脆弱的系统中暴露了潜在风险，如果被忽视，可能会增强错误的弹性感。尤其是近距离错过，可以根据其解释方式提示主动适应或危险的自满。在中东和北非（MENA）应用DeepX-Gan，我们发现这些看不见的极端情况不成比例地影响具有高脆弱性和低社会经济就绪的地区，但紧迫性和解释方面有所不同。未来的变暖可能会扩大和重新分配这些看不见的极端，而印度 - 巴基斯坦和中非的新兴热点曝光。这种分配转变突出了常规危害计划中的关键盲点，并强调了制定具有空间自适应政策的必要性，这些政策预期出来的风险热点，而不是简单地从历史模式中推断出来。

Title: Warm Starts Accelerate Generative Modelling

Authors: Jonas Scholz, Richard E. Turner
Subjects: cs.LG, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2507.09212
Pdf URL: https://arxiv.org/pdf/2507.09212
Copy Paste: [[2507.09212]] Warm Starts Accelerate Generative Modelling(https://arxiv.org/abs/2507.09212)
Keywords: generation, generative
Abstract: Iterative generative models, like diffusion and flow-matching, create high-fidelity samples by progressively refining a noise vector into data. However, this process is notoriously slow, often requiring hundreds of function evaluations. We introduce the warm-start model, a simple, deterministic model that dramatically accelerates conditional generation by providing a better starting point. Instead of starting generation from an uninformed N(0, I) prior, our warm-start model predicts an informed prior N(mu, sigma), whose moments are conditioned on the input context. This "warm start" substantially reduces the distance the generative process must traverse, particularly when the conditioning information is strongly informative. On tasks like image inpainting, our method achieves results competitive with a 1000-step DDPM baseline using only 11 total function evaluations (1 for the warm start, 10 for generation). A simple conditional normalization trick makes our method compatible with any standard generative model and sampler without modification, allowing it to be combined with other efficient sampling techniques for further acceleration. Our implementation is available at this https URL.
摘要：迭代生成模型，例如扩散和流程匹配，通过逐步完善噪声向量到数据中创建高保真样本。但是，众所周知，这个过程通常需要数百个功能评估。我们介绍了温暖的启动模型，这是一个简单，确定性的模型，可通过提供更好的起点来大大加速条件产生。我们的温暖启动模型没有从不知情的n（0，i）事先开始生成，而是预测了一个知情的n（MU，Sigma），其矩在输入上下文中。这种“温暖的开始”大大降低了生成过程必须遍历的距离，尤其是当调节信息提供信息丰富时。在诸如图像介入之类的任务上，我们的方法仅使用11个总函数评估（1个温暖的开始，10个生成）来实现1000步ddpm基线的竞争。一个简单的条件归一化技巧使我们的方法与任何标准生成模型和采样器兼容而无需修改，从而可以将其与其他有效的采样技术结合使用，以进一步加速。我们的实现可在此HTTPS URL上获得。

Title: EgoAnimate: Generating Human Animations from Egocentric top-down Views

Authors: G. Kutay Türkoglu, Julian Tanke, Iheb Belgacem, Lev Markhasin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.09230
Pdf URL: https://arxiv.org/pdf/2507.09230
Copy Paste: [[2507.09230]] EgoAnimate: Generating Human Animations from Egocentric top-down Views(https://arxiv.org/abs/2507.09230)
Keywords: generation, generative
Abstract: An ideal digital telepresence experience requires accurate replication of a person's body, clothing, and movements. To capture and transfer these movements into virtual reality, the egocentric (first-person) perspective can be adopted, which enables the use of a portable and cost-effective device without front-view cameras. However, this viewpoint introduces challenges such as occlusions and distorted body proportions. There are few works reconstructing human appearance from egocentric views, and none use a generative prior-based approach. Some methods create avatars from a single egocentric image during inference, but still rely on multi-view datasets during training. To our knowledge, this is the first study using a generative backbone to reconstruct animatable avatars from egocentric inputs. Based on Stable Diffusion, our method reduces training burden and improves generalizability. Inspired by methods such as SiTH and MagicMan, which perform 360-degree reconstruction from a frontal image, we introduce a pipeline that generates realistic frontal views from occluded top-down images using ControlNet and a Stable Diffusion backbone. Our goal is to convert a single top-down egocentric image into a realistic frontal representation and feed it into an image-to-motion model. This enables generation of avatar motions from minimal input, paving the way for more accessible and generalizable telepresence systems.
摘要：理想的数字触觉体验需要准确复制一个人的身体，衣服和运动。为了捕获并将这些动作转移到虚拟现实中，可以采用以自我为中心（第一人称）的视角，从而可以使用不前视摄像机的便携式和具有成本效益的设备。但是，这种观点引入了挑战，例如遮挡和身体比例扭曲。从以自我为中心的观点重建人类外观的作品很少，没有一种使用基于生成的先验方法。某些方法在推断期间从单个自我中心图像创建化身，但在培训期间仍然依赖多视图数据集。据我们所知，这是第一项使用生成型主链从自我中心输入中重建动画化身的研究。基于稳定的扩散，我们的方法减轻了训练负担并改善了普遍性。受到西斯和魔术师等方法的启发，这些方法从额叶图像执行了360度重建，我们引入了一条管道，该管道使用ControlNet和稳定的扩散骨架从遮挡的自上而下图像中生成逼真的额叶视图。我们的目标是将单个自上而下的自上而下的图像转换为逼真的额叶表示，并将其馈入图像对运动模型。这使得从最小输入中产生了化身动作，为更易于访问和可推广的远程呈现系统铺平了道路。

Title: Generative Latent Kernel Modeling for Blind Motion Deblurring

Authors: Chenhao Ding, Jiangtao Zhang, Zongsheng Yue, Hui Wang, Qian Zhao, Deyu Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.09285
Pdf URL: https://arxiv.org/pdf/2507.09285
Copy Paste: [[2507.09285]] Generative Latent Kernel Modeling for Blind Motion Deblurring(https://arxiv.org/abs/2507.09285)
Keywords: generative
Abstract: Deep prior-based approaches have demonstrated remarkable success in blind motion deblurring (BMD) recently. These methods, however, are often limited by the high non-convexity of the underlying optimization process in BMD, which leads to extreme sensitivity to the initial blur kernel. To address this issue, we propose a novel framework for BMD that leverages a deep generative model to encode the kernel prior and induce a better initialization for the blur kernel. Specifically, we pre-train a kernel generator based on a generative adversarial network (GAN) to aptly characterize the kernel's prior distribution, as well as a kernel initializer to provide a well-informed and high-quality starting point for kernel estimation. By combining these two components, we constrain the BMD solution within a compact latent kernel manifold, thus alleviating the aforementioned sensitivity for kernel initialization. Notably, the kernel generator and initializer are designed to be easily integrated with existing BMD methods in a plug-and-play manner, enhancing their overall performance. Furthermore, we extend our approach to tackle blind non-uniform motion deblurring without the need for additional priors, achieving state-of-the-art performance on challenging benchmark datasets. The source code is available at this https URL.
摘要：最近，基于先前的基于先前的方法在盲人运动脱发（BMD）方面取得了巨大的成功。但是，这些方法通常受到BMD基础优化过程的高非倾角的限制，这会导致对初始模糊内核的极端敏感性。为了解决这个问题，我们为BMD提出了一个新颖的框架，该框架利用了深层生成模型来编码核的先验并诱导模糊内核的更好初始化。具体而言，我们预先训练了基于生成对抗网络（GAN）的内核生成器，以恰当地表征内核的先前分布，以及内核初始化器，以提供核心估计的良好信息和高质量的起点。通过将这两个组件组合在一起，我们将BMD溶液限制在紧凑的潜在核歧管中，从而减轻了上述对核初始化的敏感性。值得注意的是，内核生成器和初始化器旨在以插件方式轻松与现有的BMD方法集成，从而增强其整体性能。此外，我们扩展了解决盲目的非均匀运动脱毛的方法，而无需额外的先验，从而在具有挑战性的基准数据集上实现了最先进的性能。源代码可在此HTTPS URL上找到。

Title: Geo-RepNet: Geometry-Aware Representation Learning for Surgical Phase Recognition in Endoscopic Submucosal Dissection

Authors: Rui Tang, Haochen Yin, Guankun Wang, Long Bai, An Wang, Huxin Gao, Jiazheng Wang, Hongliang Ren
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2507.09294
Pdf URL: https://arxiv.org/pdf/2507.09294
Copy Paste: [[2507.09294]] Geo-RepNet: Geometry-Aware Representation Learning for Surgical Phase Recognition in Endoscopic Submucosal Dissection(https://arxiv.org/abs/2507.09294)
Keywords: generation
Abstract: Surgical phase recognition plays a critical role in developing intelligent assistance systems for minimally invasive procedures such as Endoscopic Submucosal Dissection (ESD). However, the high visual similarity across different phases and the lack of structural cues in RGB images pose significant challenges. Depth information offers valuable geometric cues that can complement appearance features by providing insights into spatial relationships and anatomical structures. In this paper, we pioneer the use of depth information for surgical phase recognition and propose Geo-RepNet, a geometry-aware convolutional framework that integrates RGB image and depth information to enhance recognition performance in complex surgical scenes. Built upon a re-parameterizable RepVGG backbone, Geo-RepNet incorporates the Depth-Guided Geometric Prior Generation (DGPG) module that extracts geometry priors from raw depth maps, and the Geometry-Enhanced Multi-scale Attention (GEMA) to inject spatial guidance through geometry-aware cross-attention and efficient multi-scale aggregation. To evaluate the effectiveness of our approach, we construct a nine-phase ESD dataset with dense frame-level annotations from real-world ESD videos. Extensive experiments on the proposed dataset demonstrate that Geo-RepNet achieves state-of-the-art performance while maintaining robustness and high computational efficiency under complex and low-texture surgical environments.
摘要：手术阶段识别在开发智能辅助系统的最小侵入性程序（例如内窥镜粘膜下清除（ESD））中起着至关重要的作用。但是，不同阶段的高视觉相似性以及RGB图像中缺乏结构提示提出了重大挑战。深度信息提供了有价值的几何线索，可以通过提供对空间关系和解剖结构的见解来补充外观特征。在本文中，我们开创了深度信息进行手术阶段识别的使用，并提出了Geo-Repnet，Geo-Repnet是一种几何感知的卷积框架，该卷积框架整合了RGB图像和深度信息，以增强复杂的手术场景中的识别性能。基于可重新参数的repvgg骨架，地理repnet结合了深度引导的几何前一代（DGPG）模块，该模块从原始深度图中提取几何学先验，以及几何形状 - 增强的多尺度注意（GEMA），以通过几何形式的交叉和有效的跨度注入空间指导，并有效地注入空间指导。为了评估我们的方法的有效性，我们构建了一个9相ESD数据集，其中具有来自现实世界ESD视频的密集框架级注释。对拟议数据集进行的广泛实验表明，在复杂和低质感的手术环境下，地理宾奈特在保持鲁棒性和高计算效率的同时，实现了最先进的性能。

Title: AlphaVAE: Unified End-to-End RGBA Image Reconstruction and Generation with Alpha-Aware Representation Learning

Authors: Zile Wang, Hao Yu, Jiabo Zhan, Chun Yuan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.09308
Pdf URL: https://arxiv.org/pdf/2507.09308
Copy Paste: [[2507.09308]] AlphaVAE: Unified End-to-End RGBA Image Reconstruction and Generation with Alpha-Aware Representation Learning(https://arxiv.org/abs/2507.09308)
Keywords: generation
Abstract: Recent advances in latent diffusion models have achieved remarkable results in high-fidelity RGB image synthesis by leveraging pretrained VAEs to compress and reconstruct pixel data at low computational cost. However, the generation of transparent or layered content (RGBA image) remains largely unexplored, due to the lack of large-scale benchmarks. In this work, we propose ALPHA, the first comprehensive RGBA benchmark that adapts standard RGB metrics to four-channel images via alpha blending over canonical backgrounds. We further introduce ALPHAVAE, a unified end-to-end RGBA VAE that extends a pretrained RGB VAE by incorporating a dedicated alpha channel. The model is trained with a composite objective that combines alpha-blended pixel reconstruction, patch-level fidelity, perceptual consistency, and dual KL divergence constraints to ensure latent fidelity across both RGB and alpha representations. Our RGBA VAE, trained on only 8K images in contrast to 1M used by prior methods, achieves a +4.9 dB improvement in PSNR and a +3.2% increase in SSIM over LayerDiffuse in reconstruction. It also enables superior transparent image generation when fine-tuned within a latent diffusion framework. Our code, data, and models are released on this https URL for reproducibility.
摘要：潜在扩散模型的最新进展已通过利用预验证的VAE来以低计算成本来压缩和重建像素数据，从而在高保真RGB图像合成中取得了显着的结果。但是，由于缺乏大规模的基准，透明或分层含量（RGBA图像）的产生在很大程度上尚未探索。在这项工作中，我们提出了Alpha，Alpha是第一个通过在规范背景上通过Alpha融合的Alpha混合的标准RGB指标将标准RGB指标改编为四通道图像的全面RGBA基准。我们进一步介绍了Alphavae，这是一种统一的端到端RGBA VAE，通过合并专用的α通道，可以扩展验证的RGB VAE。该模型经过复合目标训练，该目标结合了Alpha混合像素重建，贴片级保真度，感知一致性和双KL差异约束，以确保RGB和Alpha表示的潜水。与先前方法使用的1m相比，我们的RGBA VAE仅对8K图像进行了训练，可在PSNR中提高A +4.9 dB的改善，而SSIM在重建中的SSIM比LayerDiffuse的增加了3.2％。当在潜在扩散框架内进行微调时，它还可以使出色的透明图像产生。我们的代码，数据和模型将在此HTTPS URL上发布，以供可重复使用。

Title: Geometric Generative Modeling with Noise-Conditioned Graph Networks

Authors: Peter Pao-Huang, Mitchell Black, Xiaojie Qiu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.09391
Pdf URL: https://arxiv.org/pdf/2507.09391
Copy Paste: [[2507.09391]] Geometric Generative Modeling with Noise-Conditioned Graph Networks(https://arxiv.org/abs/2507.09391)
Keywords: generation, generative
Abstract: Generative modeling of graphs with spatial structure is essential across many applications from computer graphics to spatial genomics. Recent flow-based generative models have achieved impressive results by gradually adding and then learning to remove noise from these graphs. Existing models, however, use graph neural network architectures that are independent of the noise level, limiting their expressiveness. To address this issue, we introduce \textit{Noise-Conditioned Graph Networks} (NCGNs), a class of graph neural networks that dynamically modify their architecture according to the noise level during generation. Our theoretical and empirical analysis reveals that as noise increases, (1) graphs require information from increasingly distant neighbors and (2) graphs can be effectively represented at lower resolutions. Based on these insights, we develop Dynamic Message Passing (DMP), a specific instantiation of NCGNs that adapts both the range and resolution of message passing to the noise level. DMP consistently outperforms noise-independent architectures on a variety of domains including $3$D point clouds, spatiotemporal transcriptomics, and images. Code is available at this https URL.
摘要：从计算机图形到空间基因组学的许多应用中，具有空间结构的图形的生成建模至关重要。最近的基于流量的生成模型通过逐渐添加然后学习消除这些图表的噪声来取得了令人印象深刻的结果。但是，现有模型使用独立于噪声水平的图形神经网络体系结构，从而限制了它们的表现力。为了解决此问题，我们介绍了\ textIt {噪声条件的图形网络}（NCGNS），这是一类图形神经网络，该网络根据生成期间的噪声水平动态修改其体系结构。我们的理论和经验分析表明，随着噪声的增加，（1）图需要来自越来越远的邻居的信息，并且（2）在较低分辨率下可以有效地表示图。基于这些见解，我们开发了动态消息传递（DMP），这是NCGN的特定实例化，可调整消息传递到噪声水平的范围和分辨率。 DMP始终在各种域上优于噪声独立的体系结构，包括$ 3 $ d点云，时空转录组学和图像。代码可在此HTTPS URL上找到。

Title: Domain Adaptation and Multi-view Attention for Learnable Landmark Tracking with Sparse Data

Authors: Timothy Chase Jr, Karthik Dantu
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2507.09420
Pdf URL: https://arxiv.org/pdf/2507.09420
Copy Paste: [[2507.09420]] Domain Adaptation and Multi-view Attention for Learnable Landmark Tracking with Sparse Data(https://arxiv.org/abs/2507.09420)
Keywords: generation
Abstract: The detection and tracking of celestial surface terrain features are crucial for autonomous spaceflight applications, including Terrain Relative Navigation (TRN), Entry, Descent, and Landing (EDL), hazard analysis, and scientific data collection. Traditional photoclinometry-based pipelines often rely on extensive a priori imaging and offline processing, constrained by the computational limitations of radiation-hardened systems. While historically effective, these approaches typically increase mission costs and duration, operate at low processing rates, and have limited generalization. Recently, learning-based computer vision has gained popularity to enhance spacecraft autonomy and overcome these limitations. While promising, emerging techniques frequently impose computational demands exceeding the capabilities of typical spacecraft hardware for real-time operation and are further challenged by the scarcity of labeled training data for diverse extraterrestrial environments. In this work, we present novel formulations for in-situ landmark tracking via detection and description. We utilize lightweight, computationally efficient neural network architectures designed for real-time execution on current-generation spacecraft flight processors. For landmark detection, we propose improved domain adaptation methods that enable the identification of celestial terrain features with distinct, cheaply acquired training data. Concurrently, for landmark description, we introduce a novel attention alignment formulation that learns robust feature representations that maintain correspondence despite significant landmark viewpoint variations. Together, these contributions form a unified system for landmark tracking that demonstrates superior performance compared to existing state-of-the-art techniques.
摘要：天体表面地形特征的检测和跟踪对于自动空间飞行应用至关重要，包括地形相对导航（TRN），进入，下降和着陆（EDL），危害分析和科学数据收集至关重要。传统的基于光胶根计的管道通常依赖于广泛的先验成像和离线处理，受到辐射硬化系统的计算限制的约束。尽管历史上有效，但这些方法通常会增加任务成本和持续时间，以低处理率运行，并且概括有限。最近，基于学习的计算机愿景已获得知名度，以增强航天器的自主权并克服这些局限性。在有希望的同时，新兴技术经常施加计算需求，超过了实时操作的典型航天器硬件的功能，并且由于稀缺的较少标记的培训数据而在各种外星环境中稀缺。在这项工作中，我们介绍了通过检测和描述进行原位地标跟踪的新颖配方。我们利用轻巧，计算高效的神经网络体系结构，旨在在当前生成航天器飞行处理器上实时执行。对于具有里程碑意义的检测，我们提出了改进的域适应方法，以鉴定具有独特，便宜的训练数据的天体地形特征。同时，对于具有里程碑意义的描述，我们引入了一种新颖的注意对准配方，该公式学习了尽管有很大的地标观点变化，但可以了解可保持对应关系的稳健特征表示。这些贡献共同构成了地标跟踪的统一系统，与现有的最新技术相比，表现出卓越的性能。

Title: Toward Developing Machine-Learning-Aided Tools for the Thermomechanical Monitoring of Nuclear Reactor Components

Authors: Luiz Aldeia Machado, Victor Coppo Leite, Elia Merzari, Arthur Motta, Roberto Ponciroli, Lander Ibarra, Lise Charlot
Subjects: cs.LG, cs.CE
Abstract URL: https://arxiv.org/abs/2507.09443
Pdf URL: https://arxiv.org/pdf/2507.09443
Copy Paste: [[2507.09443]] Toward Developing Machine-Learning-Aided Tools for the Thermomechanical Monitoring of Nuclear Reactor Components(https://arxiv.org/abs/2507.09443)
Keywords: generation
Abstract: Proactive maintenance strategies, such as Predictive Maintenance (PdM), play an important role in the operation of Nuclear Power Plants (NPPs), particularly due to their capacity to reduce offline time by preventing unexpected shutdowns caused by component failures. In this work, we explore the use of a Convolutional Neural Network (CNN) architecture combined with a computational thermomechanical model to calculate the temperature, stress, and strain of a Pressurized Water Reactor (PWR) fuel rod during operation. This estimation relies on a limited number of temperature measurements from the cladding's outer surface. This methodology can potentially aid in developing PdM tools for nuclear reactors by enabling real-time monitoring of such systems. The training, validation, and testing datasets were generated through coupled simulations involving BISON, a finite element-based nuclear fuel performance code, and the MOOSE Thermal-Hydraulics Module (MOOSE-THM). We conducted eleven simulations, varying the peak linear heat generation rates. Of these, eight were used for training, two for validation, and one for testing. The CNN was trained for over 1,000 epochs without signs of overfitting, achieving highly accurate temperature distribution predictions. These were then used in a thermomechanical model to determine the stress and strain distribution within the fuel rod.
摘要：积极的维护策略，例如预测维护（PDM），在核电厂（NPP）的运行中起着重要作用，特别是由于它们通过防止由组件故障引起的意外关闭而减少离线时间的能力。在这项工作中，我们探讨了卷积神经网络（CNN）结构与计算热力学模型相结合的使用，以计算操作过程中加压水反应堆（PWR）燃料杆的温度，应力和应变。该估计依赖于覆层外表面的有限数量的温度测量值。这种方法可以通过实时监视此类系统来帮助开发用于核反应堆的PDM工具。培训，验证和测试数据集是通过涉及野牛，有限元核燃料性能代码的耦合模拟生成的，而驼鹿热液压模块（Moose-Thm）。我们进行了11次模拟，改变了线性热产生速率。其中，有8个用于培训，两人用于验证，一项用于测试。对CNN进行了1,000多个时期的训练，没有过度拟合的迹象，从而实现了高度准确的温度分布预测。然后将它们用于热机械模型中，以确定燃油棒中的应力和应变分布。

Title: La-Proteina: Atomistic Protein Generation via Partially Latent Flow Matching

Authors: Tomas Geffner, Kieran Didi, Zhonglin Cao, Danny Reidenbach, Zuobai Zhang, Christian Dallago, Emine Kucukbenli, Karsten Kreis, Arash Vahdat
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2507.09466
Pdf URL: https://arxiv.org/pdf/2507.09466
Copy Paste: [[2507.09466]] La-Proteina: Atomistic Protein Generation via Partially Latent Flow Matching(https://arxiv.org/abs/2507.09466)
Keywords: generation, generative
Abstract: Recently, many generative models for de novo protein structure design have emerged. Yet, only few tackle the difficult task of directly generating fully atomistic structures jointly with the underlying amino acid sequence. This is challenging, for instance, because the model must reason over side chains that change in length during generation. We introduce La-Proteina for atomistic protein design based on a novel partially latent protein representation: coarse backbone structure is modeled explicitly, while sequence and atomistic details are captured via per-residue latent variables of fixed dimensionality, thereby effectively side-stepping challenges of explicit side-chain representations. Flow matching in this partially latent space then models the joint distribution over sequences and full-atom structures. La-Proteina achieves state-of-the-art performance on multiple generation benchmarks, including all-atom co-designability, diversity, and structural validity, as confirmed through detailed structural analyses and evaluations. Notably, La-Proteina also surpasses previous models in atomistic motif scaffolding performance, unlocking critical atomistic structure-conditioned protein design tasks. Moreover, La-Proteina is able to generate co-designable proteins of up to 800 residues, a regime where most baselines collapse and fail to produce valid samples, demonstrating La-Proteina's scalability and robustness.
摘要：最近，已经出现了许多从头蛋白质结构设计的生成模型。然而，只有很少能解决与基础氨基酸序列共同产生完全原子结构的困难任务。例如，这是具有挑战性的，因为该模型必须推理生成期间长度变化的侧链。我们基于一种新型的部分潜在蛋白质表示，为原子蛋白设计介绍了LA蛋白质设计：粗骨结构是明确建模的，而序列和原子质细节是通过固定尺寸的每个固定潜在变量来捕获的，从而有效地，有效地侧向显式侧链表示挑战。在该部分潜在空间中的流量匹配，然后对序列和全原子结构的关节分布进行建模。通过详细的结构分析和评估证实，LA-Proteina在多代基准上实现了多代基准的最先进性能，包括全部原子的共同点，多样性和结构有效性。值得注意的是，LA-Proteina还超过了原子基序脚手架性能的先前模型，解锁了关键的原子结构条件条件的蛋白质设计任务。此外，LA-Proteina能够生成多达800个残基的可共设计蛋白，这是大多数碱基崩溃且无法产生有效样品的制度，表明LA-Proteina的可扩展性和稳健性。

Title: Assessing reliability of explanations in unbalanced datasets: a use-case on the occurrence of frost events

Authors: Ilaria Vascotto, Valentina Blasone, Alex Rodriguez, Alessandro Bonaita, Luca Bortolussi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.09545
Pdf URL: https://arxiv.org/pdf/2507.09545
Copy Paste: [[2507.09545]] Assessing reliability of explanations in unbalanced datasets: a use-case on the occurrence of frost events(https://arxiv.org/abs/2507.09545)
Keywords: generation
Abstract: The usage of eXplainable Artificial Intelligence (XAI) methods has become essential in practical applications, given the increasing deployment of Artificial Intelligence (AI) models and the legislative requirements put forward in the latest years. A fundamental but often underestimated aspect of the explanations is their robustness, a key property that should be satisfied in order to trust the explanations. In this study, we provide some preliminary insights on evaluating the reliability of explanations in the specific case of unbalanced datasets, which are very frequent in high-risk use-cases, but at the same time considerably challenging for both AI models and XAI methods. We propose a simple evaluation focused on the minority class (i.e. the less frequent one) that leverages on-manifold generation of neighbours, explanation aggregation and a metric to test explanation consistency. We present a use-case based on a tabular dataset with numerical features focusing on the occurrence of frost events.
摘要：鉴于人工智能（AI）模型的部署越来越多，并且在最近几年提出的立法要求，因此使用可解释的人工智能（XAI）方法在实际应用中已成为必不可少的。解释的基本但经常被低估的方面是它们的稳健性，这是一个关键的特性，以信任解释。在这项研究中，我们提供了一些初步见解，以评估在不平衡数据集的特定情况下解释的可靠性，这些解释在高风险的用例中非常频繁，但同时对于AI模型和XAI方法都有巨大挑战。我们提出了一个简单的评估，该评估的重点是少数群体（即较少的频率），该评估利用了邻居的跨性别生成，解释汇总和一个指标来测试解释一致性。我们提出了一个基于表格数据集的用例，其数值功能重点是发生霜冻事件。

Title: WordCraft: Interactive Artistic Typography with Attention Awareness and Noise Blending

Authors: Zhe Wang, Jingbo Zhang, Tianyi Wei, Wanchao Su, Can Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.09573
Pdf URL: https://arxiv.org/pdf/2507.09573
Copy Paste: [[2507.09573]] WordCraft: Interactive Artistic Typography with Attention Awareness and Noise Blending(https://arxiv.org/abs/2507.09573)
Keywords: generation, generative
Abstract: Artistic typography aims to stylize input characters with visual effects that are both creative and legible. Traditional approaches rely heavily on manual design, while recent generative models, particularly diffusion-based methods, have enabled automated character stylization. However, existing solutions remain limited in interactivity, lacking support for localized edits, iterative refinement, multi-character composition, and open-ended prompt interpretation. We introduce WordCraft, an interactive artistic typography system that integrates diffusion models to address these limitations. WordCraft features a training-free regional attention mechanism for precise, multi-region generation and a noise blending that supports continuous refinement without compromising visual quality. To support flexible, intent-driven generation, we incorporate a large language model to parse and structure both concrete and abstract user prompts. These components allow our framework to synthesize high-quality, stylized typography across single- and multi-character inputs across multiple languages, supporting diverse user-centered workflows. Our system significantly enhances interactivity in artistic typography synthesis, opening up creative possibilities for artists and designers.
摘要：艺术排版旨在对具有创造性且清晰清晰的视觉效果的输入字符进行风格化。传统方法在很大程度上依赖于手动设计，而最近的生成模型，尤其是基于扩散的方法，已经实现了自动角色风格。但是，现有的解决方案在互动性上仍然有限，缺乏对本地化编辑，迭代精致，多字符组成和开放式及时解释的支持。我们介绍WordCraft，这是一种交互式艺术排版系统，该系统集成了扩散模型以解决这些局限性。 Wordcraft具有无训练的区域注意机制，可用于精确，多区域生成和噪音混合，可在不损害视觉质量的情况下进行连续完善。为了支持灵活的，意图驱动的生成，我们合并了一个大型语言模型来解析和构建具体和抽象用户提示。这些组件允许我们的框架在跨多种语言的单个和多字符输入中综合了高质量的，风格化的版式，从而支持各种以用户为中心的工作流程。我们的系统大大提高了艺术版式合成中的交互性，为艺术家和设计师打开了创造性的可能性。

Title: MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models

Authors: Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Jiuxiang Gu, Wen Xiao, Junjie Hu
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2507.09574
Pdf URL: https://arxiv.org/pdf/2507.09574
Copy Paste: [[2507.09574]] MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models(https://arxiv.org/abs/2507.09574)
Keywords: generation
Abstract: Recent text-to-image models produce high-quality results but still struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex multimodal image generation. To address these limitations, we propose MENTOR, a novel autoregressive (AR) framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation. MENTOR combines an AR image generator with a two-stage training paradigm, enabling fine-grained, token-level alignment between multimodal inputs and image outputs without relying on auxiliary adapters or cross-attention modules. The two-stage training consists of: (1) a multimodal alignment stage that establishes robust pixel- and semantic-level alignment, followed by (2) a multimodal instruction tuning stage that balances the integration of multimodal inputs and enhances generation controllability. Despite modest model size, suboptimal base components, and limited training resources, MENTOR achieves strong performance on the DreamBench++ benchmark, outperforming competitive baselines in concept preservation and prompt following. Additionally, our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods. Dataset, code, and models are available at: this https URL
摘要：最近的文本到图像模型会产生高质量的结果，但仍无法精确的视觉控制，平衡多模式输入，并需要进行大量复杂的多模式图像生成。为了解决这些局限性，我们提出了Mentor，这是一种新型自回归（AR）框架，用于有效的多模式调节，以进行自动回归多模式图像生成。 Mentor将AR图像发生器与两阶段训练范式相结合，可以在多模式输入和图像输出之间进行细粒度的，令牌级别的对齐，而无需依赖辅助适配器或交叉意见模块。两阶段训练包括：（1）一个多模式对齐阶段，该阶段建立了强大的像素和语义级别对齐，其次是（2）多模式指令调整阶段，该阶段平衡了多模式输入的整合并增强产生可控性。尽管模型大小，次优基础组件和有限的培训资源，但导师在Dreambench ++基准测试中取得了出色的性能，在概念保存方面表现优于竞争性基线，并及时关注。此外，与基于扩散的方法相比，我们的方法还提供了出色的图像重建保真度，广泛的任务适应性和提高的训练效率。数据集，代码和模型可在以下网址提供：此HTTPS URL

Title: Demystifying Flux Architecture

Authors: Or Greenberg
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.09595
Pdf URL: https://arxiv.org/pdf/2507.09595
Copy Paste: [[2507.09595]] Demystifying Flux Architecture(https://arxiv.org/abs/2507.09595)
Keywords: generation
Abstract: FLUX.1 is a diffusion-based text-to-image generation model developed by Black Forest Labs, designed to achieve faithful text-image alignment while maintaining high image quality and diversity. FLUX is considered state-of-the-art in text-to-image generation, outperforming popular models such as Midjourney, DALL-E 3, Stable Diffusion 3 (SD3), and SDXL. Although publicly available as open source, the authors have not released official technical documentation detailing the model's architecture or training setup. This report summarizes an extensive reverse-engineering effort aimed at demystifying FLUX's architecture directly from its source code, to support its adoption as a backbone for future research and development. This document is an unofficial technical report and is not published or endorsed by the original developers or their affiliated institutions.
摘要：Flux.1是由Black Forest Labs开发的基于扩散的文本对图像生成模型，旨在实现忠实的文本图像对齐，同时保持高图像质量和多样性。通量被认为是文本到图像生成的最先进，优于Midjourney，Dall-E 3，稳定扩散3（SD3）和SDXL等流行模型。尽管作为开源公开可用，但作者尚未发布官方技术文档，详细介绍了该模型的体系结构或培训设置。该报告总结了一项广泛的反向工程工作，旨在直接从其源代码中揭开Flux的架构，以支持其作为未来研究和开发的骨干的采用。该文档是一份非正式的技术报告，未经原始开发商或其附属机构发表或认可。

Title: Generate Aligned Anomaly: Region-Guided Few-Shot Anomaly Image-Mask Pair Synthesis for Industrial Inspection

Authors: Yilin Lu, Jianghang Lin, Linhuang Xie, Kai Zhao, Yansong Qu, Shengchuan Zhang, Liujuan Cao, Rongrong Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.09619
Pdf URL: https://arxiv.org/pdf/2507.09619
Copy Paste: [[2507.09619]] Generate Aligned Anomaly: Region-Guided Few-Shot Anomaly Image-Mask Pair Synthesis for Industrial Inspection(https://arxiv.org/abs/2507.09619)
Keywords: generation
Abstract: Anomaly inspection plays a vital role in industrial manufacturing, but the scarcity of anomaly samples significantly limits the effectiveness of existing methods in tasks such as localization and classification. While several anomaly synthesis approaches have been introduced for data augmentation, they often struggle with low realism, inaccurate mask alignment, and poor generalization. To overcome these limitations, we propose Generate Aligned Anomaly (GAA), a region-guided, few-shot anomaly image-mask pair generation framework. GAA leverages the strong priors of a pretrained latent diffusion model to generate realistic, diverse, and semantically aligned anomalies using only a small number of samples. The framework first employs Localized Concept Decomposition to jointly model the semantic features and spatial information of anomalies, enabling flexible control over the type and location of anomalies. It then utilizes Adaptive Multi-Round Anomaly Clustering to perform fine-grained semantic clustering of anomaly concepts, thereby enhancing the consistency of anomaly representations. Subsequently, a region-guided mask generation strategy ensures precise alignment between anomalies and their corresponding masks, while a low-quality sample filtering module is introduced to further improve the overall quality of the generated samples. Extensive experiments on the MVTec AD and LOCO datasets demonstrate that GAA achieves superior performance in both anomaly synthesis quality and downstream tasks such as localization and classification.
摘要：异常检查在工业制造中起着至关重要的作用，但是异常样品的稀缺性显着限制了现有方法在本地化和分类等任务中的有效性。尽管已经引入了几种异常合成方法以进行数据增强，但它们通常在低现实主义，不准确的掩模一致性和概括不佳的情况下挣扎。为了克服这些局限性，我们建议生成对齐异常（GAA），这是一个区域引导的，几乎没有射击异常的图像掩模对生成框架。 GAA利用了预审预测的潜扩散模型的强大先验，以仅使用少量样品生成现实，多样和语义对齐异常。该框架首先采用局部概念分解来共同对异常的语义特征和空间信息进行建模，从而可以灵活地控制异常的类型和位置。然后，它利用自适应多轮异常聚类来执行对异常概念的细粒语义聚类，从而增强了异常表示的一致性。随后，一个区域引导的掩码生成策略确保异常之间的精确比对，而引入了低质量的样品滤波模块，以进一步提高生成样品的整体质量。在MVTEC AD和LOCO数据集上进行的广泛实验表明，GAA在综合质量和下游任务（例如定位和分类）中都可以达到卓越的性能。

Title: Brain Stroke Detection and Classification Using CT Imaging with Transformer Models and Explainable AI

Authors: Shomukh Qari, Maha A. Thafar
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.09630
Pdf URL: https://arxiv.org/pdf/2507.09630
Copy Paste: [[2507.09630]] Brain Stroke Detection and Classification Using CT Imaging with Transformer Models and Explainable AI(https://arxiv.org/abs/2507.09630)
Keywords: generation
Abstract: Stroke is one of the leading causes of death globally, making early and accurate diagnosis essential for improving patient outcomes, particularly in emergency settings where timely intervention is critical. CT scans are the key imaging modality because of their speed, accessibility, and cost-effectiveness. This study proposed an artificial intelligence framework for multiclass stroke classification (ischemic, hemorrhagic, and no stroke) using CT scan images from a dataset provided by the Republic of Turkey's Ministry of Health. The proposed method adopted MaxViT, a state-of-the-art Vision Transformer, as the primary deep learning model for image-based stroke classification, with additional transformer variants (vision transformer, transformer-in-transformer, and ConvNext). To enhance model generalization and address class imbalance, we applied data augmentation techniques, including synthetic image generation. The MaxViT model trained with augmentation achieved the best performance, reaching an accuracy and F1-score of 98.00%, outperforming all other evaluated models and the baseline methods. The primary goal of this study was to distinguish between stroke types with high accuracy while addressing crucial issues of transparency and trust in artificial intelligence models. To achieve this, Explainable Artificial Intelligence (XAI) was integrated into the framework, particularly Grad-CAM++. It provides visual explanations of the model's decisions by highlighting relevant stroke regions in the CT scans and establishing an accurate, interpretable, and clinically applicable solution for early stroke detection. This research contributed to the development of a trustworthy AI-assisted diagnostic tool for stroke, facilitating its integration into clinical practice and enhancing access to timely and optimal stroke diagnosis in emergency departments, thereby saving more lives.
摘要：中风是全球死亡的主要原因之一，使得早期且准确的诊断对于改善患者预后至关重要，尤其是在及时干预至关重要的紧急情况下。 CT扫描是关键的成像方式，因为它们的速度，可访问性和成本效益。这项研究提出了一个用于多类中风分类的人工智能框架（缺血，出血和无中风），使用土耳其卫生部提供的数据集中的CT扫描图像。所提出的方法采用了最先进的视觉变压器Maxvit作为基于图像的中风分类的主要深度学习模型，并具有其他变压器变体（Vision Transformer，Transformer，Transformer-In-trans-In-In-transformer和Convnext）。为了增强模型的概括和解决类别的不平衡，我们应用了数据增强技术，包括合成图像生成。经过扩展训练的Maxvit模型达到了最佳性能，达到了98.00％的准确性和F1得分，表现优于所有其他评估的模型和基线方法。这项研究的主要目标是区分具有高精度的中风类型，同时解决人工智能模型中透明度和信任的关键问题。为此，将可解释的人工智能（XAI）集成到了框架中，尤其是Grad-CAM ++。它通过突出CT扫描中的相关中风区域并为早期中风检测建立准确，可解释和临床适用的解决方案，从而提供了模型决策的视觉解释。这项研究有助于开发可信赖的AI辅助诊断工具，以促进其融合到临床实践中，并增强了在急诊部门及时，最佳的中风诊断的机会，从而挽救了更多的生命。

Title: Prompt2DEM: High-Resolution DEMs for Urban and Open Environments from Global Prompts Using a Monocular Foundation Model

Authors: Osher Rafaeli, Tal Svoray, Ariel Nahlieli
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2507.09681
Pdf URL: https://arxiv.org/pdf/2507.09681
Copy Paste: [[2507.09681]] Prompt2DEM: High-Resolution DEMs for Urban and Open Environments from Global Prompts Using a Monocular Foundation Model(https://arxiv.org/abs/2507.09681)
Keywords: super-resolution
Abstract: High-resolution elevation estimations are essential to understand catchment and hillslope hydrology, study urban morphology and dynamics, and monitor the growth, decline, and mortality of terrestrial ecosystems. Various deep learning approaches (e.g., super-resolution techniques, monocular depth estimation) have been developed to create high-resolution Digital Elevation Models (DEMs). However, super-resolution techniques are limited by the upscaling factor, and monocular depth estimation lacks global elevation context, making its conversion to a seamless DEM restricted. The recently introduced technique of prompt-based monocular depth estimation has opened new opportunities to extract estimates of absolute elevation in a global context. We present here a framework for the estimation of high-resolution DEMs as a new paradigm for absolute global elevation mapping. It is exemplified using low-resolution Shuttle Radar Topography Mission (SRTM) elevation data as prompts and high-resolution RGB imagery from the National Agriculture Imagery Program (NAIP). The approach fine-tunes a vision transformer encoder with LiDAR-derived DEMs and employs a versatile prompting strategy, enabling tasks such as DEM estimation, void filling, and updating. Our framework achieves a 100x resolution gain (from 30-m to 30-cm), surpassing prior methods by an order of magnitude. Evaluations across three diverse U.S. landscapes show robust generalization, capturing urban structures and fine-scale terrain features with < 5 m MAE relative to LiDAR, improving over SRTM by up to 18%. Hydrological analysis confirms suitability for hazard and environmental studies. We demonstrate scalability by applying the framework to large regions in the U.S. and Israel. All code and pretrained models are publicly available at: this https URL.
摘要：高分辨率高程估计对于了解集水区和山坡水文，研究城市形态和动态至关重要，并监测陆地生态系统的生长，下降和死亡率。已经开发了各种深度学习方法（例如，超分辨率技术，单眼深度估计）来创建高分辨率数字高程模型（DEMS）。但是，超分辨率技术受到升级因素的限制，而单眼深度估计缺乏全球高度环境，从而使其转换为无缝的DEM限制。最近引入的基于迅速的单眼深度估计的技术为在全球背景下提取绝对升高的估计提供了新的机会。我们在这里提出了一个将高分辨率DEM估算为绝对全球高程映射的新范式的框架。使用低分辨率的穿梭雷达地形任务（SRTM）高程数据作为提示和高分辨率RGB图像（NAIP）的高分辨率RGB图像来说明它。该方法通过LIDAR衍生的DEM进行了视觉变压器编码器，并采用了多功能提示策略，从而实现了DEM估计，无效填充和更新等任务。我们的框架可实现100倍的分辨率增益（从30米到30厘米），超过了先前的方法。在美国三种不同的景观中进行的评估显示出强大的概括，捕获了相对于激光雷达（Lidar）<5 m mae的城市结构和细小的地形特征，而SRTM的提高高达18％。水文分析证实了对危害和环境研究的适用性。我们通过将框架应用于美国和以色列的大型地区来证明可扩展性。所有代码和预估计的模型均可公开使用：此HTTPS URL。

Title: Post-Training Quantization of Generative and Discriminative LSTM Text Classifiers: A Study of Calibration, Class Balance, and Robustness

Authors: Md Mushfiqur Rahaman, Elliot Chang, Tasmiah Haque, Srinjoy Das
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.09687
Pdf URL: https://arxiv.org/pdf/2507.09687
Copy Paste: [[2507.09687]] Post-Training Quantization of Generative and Discriminative LSTM Text Classifiers: A Study of Calibration, Class Balance, and Robustness(https://arxiv.org/abs/2507.09687)
Keywords: generative
Abstract: Text classification plays a pivotal role in edge computing applications like industrial monitoring, health diagnostics, and smart assistants, where low latency and high accuracy are both key requirements. Generative classifiers, in particular, have been shown to exhibit robustness to out-of-distribution and noisy data, which is an extremely critical consideration for deployment in such real-time edge environments. However, deploying such models on edge devices faces computational and memory constraints. Post Training Quantization (PTQ) reduces model size and compute costs without retraining, making it ideal for edge deployment. In this work, we present a comprehensive comparative study of generative and discriminative Long Short Term Memory (LSTM)-based text classification models with PTQ using the Brevitas quantization library. We evaluate both types of classifier models across multiple bitwidths and assess their robustness under regular and noisy input conditions. We find that while discriminative classifiers remain robust, generative ones are more sensitive to bitwidth, calibration data used during PTQ, and input noise during quantized inference. We study the influence of class imbalance in calibration data for both types of classifiers, comparing scenarios with evenly and unevenly distributed class samples including their effect on weight adjustments and activation profiles during PTQ. Using test statistics derived from nonparametric hypothesis testing, we identify that using class imbalanced data during calibration introduces insufficient weight adaptation at lower bitwidths for generative LSTM classifiers, thereby leading to degraded performance. This study underscores the role of calibration data in PTQ and when generative classifiers succeed or fail under noise, aiding deployment in edge environments.
摘要：文本分类在边缘计算应用程序中起关键作用，例如工业监测，健康诊断和智能助手，在这些应用程序中，低潜伏期和高精度都是关键要求。尤其是生成分类器已被证明对分布和嘈杂数据表现出健壮性，这是在这种实时边缘环境中部署的极为关键的考虑因素。但是，在边缘设备上部署此类模型会面对计算和内存约束。培训后量化（PTQ）降低了模型的规模和计算成本而无需再培训，因此它非常适合边缘部署。在这项工作中，我们介绍了使用BREVITAS量化库，对具有PTQ的基于PTQ的基于PTQ的生成和歧视性长期记忆（LSTM）的全面比较研究。我们评估了多个位宽的两种类型的分类器模型，并在常规和嘈杂的输入条件下评估它们的鲁棒性。我们发现，尽管判别分类器保持稳健，但生成性分类器对位宽，PTQ期间使用的校准数据和量化推断期间的输入噪声更为敏感。我们研究了两种类型的分类器中类不平衡数据中类不平衡的影响，将场景与均匀分布分布的类样本进行比较，包括它们对PTQ期间重量调整和激活概况的影响。使用从非参数假设测试得出的测试统计数据，我们确定在校准过程中使用类不平衡数据引入生成LSTM分类器的较低位的重量适应性不足，从而导致性能退化。这项研究强调了校准数据在PTQ中的作用，并且当生成分类器在噪声下成功或失败时，在边缘环境中有助于部署。

Title: ExpStar: Towards Automatic Commentary Generation for Multi-discipline Scientific Experiments

Authors: Jiali Chen, Yujie Jia, Zihan Wu, Jinyu Yang, Jianpeng Chen, Xusen Hei, Jiayuan Xie, Yi Cai, Qing Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.09693
Pdf URL: https://arxiv.org/pdf/2507.09693
Copy Paste: [[2507.09693]] ExpStar: Towards Automatic Commentary Generation for Multi-discipline Scientific Experiments(https://arxiv.org/abs/2507.09693)
Keywords: generation
Abstract: Experiment commentary is crucial in describing the experimental procedures, delving into underlying scientific principles, and incorporating content-related safety guidelines. In practice, human teachers rely heavily on subject-specific expertise and invest significant time preparing such commentary. To address this challenge, we introduce the task of automatic commentary generation across multi-discipline scientific experiments. While recent progress in large multimodal models (LMMs) has demonstrated promising capabilities in video understanding and reasoning, their ability to generate fine-grained and insightful experiment commentary remains largely underexplored. In this paper, we make the following contributions: (i) We construct \textit{ExpInstruct}, the first dataset tailored for experiment commentary generation, featuring over 7\textit{K} step-level commentaries across 21 scientific subjects from 3 core disciplines (\ie, science, healthcare and engineering). Each sample includes procedural descriptions along with potential scientific principles (\eg, chemical equations and physical laws) and safety guidelines. (ii) We propose ExpStar, an automatic experiment commentary generation model that leverages a retrieval-augmented mechanism to adaptively access, evaluate, and utilize external knowledge. (iii) Extensive experiments show that our ExpStar substantially outperforms 14 leading LMMs, which highlights the superiority of our dataset and model. We believe that ExpStar holds great potential for advancing AI-assisted scientific experiment instruction.
摘要：实验评论对于描述实验程序，深入研究基本科学原则以及纳入与内容相关的安全指南至关重要。实际上，人类教师在很大程度上依赖于特定于学科的专业知识，并花费大量时间准备这样的评论。为了应对这一挑战，我们介绍了在多学科科学实验中自动评论生成的任务。尽管大型多模型模型（LMM）的最新进展表现出了视频理解和推理的有希望的能力，但它们产生细粒度和有见地的实验评论的能力仍然很大程度上尚未得到充实。在本文中，我们做出以下贡献：（i）我们构建了\ textit {expinstruct}，这是针对实验评论生成的第一个数据集，其中包含超过7 \ textit {k}跨3个核心学科科学主题的步骤级评论（\ ie，科学，医疗保健和工程学）。每个样本包括程序描述以及潜在的科学原理（\ EG，化学方程式和物理定律）和安全指南。（ii）我们提出了Expstar，这是一种自动实验评论生成模型，该模型利用检索调查机制适应性地访问，评估和利用外部知识。（iii）广泛的实验表明，我们的Expstar基本上优于14个领先的LMM，这突出了我们的数据集和模型的优势。我们认为，Expstar具有推进AI辅助科学实验教学的巨大潜力。

Title: Continental scale habitat modelling with artificial intelligence and multimodal earth observation

Authors: Sara Si-Moussi, Stephan Hennekens, Sander Mucher, Stan Los, Wilfried Thuiller
Subjects: cs.LG, q-bio.PE, stat.AP
Abstract URL: https://arxiv.org/abs/2507.09732
Pdf URL: https://arxiv.org/pdf/2507.09732
Copy Paste: [[2507.09732]] Continental scale habitat modelling with artificial intelligence and multimodal earth observation(https://arxiv.org/abs/2507.09732)
Keywords: restoration, generation, quality assessment
Abstract: Habitats integrate the abiotic conditions and biophysical structures that support biodiversity and sustain nature's contributions to people. As these ecosystems face mounting pressure from human activities, accurate, high-resolution habitat maps are essential for effective conservation and restoration. Yet current maps often fall short in thematic or spatial resolution because they must (1) model several mutually exclusive habitat types that co-occur across landscapes and (2) cope with severe class imbalance that complicate multi-class training. Here, we evaluated how high-resolution remote sensing (RS) data and Artificial Intelligence (AI) tools can improve habitat classification over large geographic extents at fine thematic resolution. Using vegetation plots from the European Vegetation Archive, we modelled Level 3 EUNIS habitats across Europe and assessed multiple modelling strategies against independent validation datasets. Strategies that exploited the hierarchical nature of habitat nomenclatures resolved classification ambiguities, especially in fragmented landscapes. Integrating multi-spectral (MSI) and synthetic aperture radar (SAR) imagery, particularly through Earth Observation Foundation models, enhanced within-formation discrimination and overall performance. Finally, ensemble machine learning that corrects class imbalance boosted accuracy further. Our methodological framework is transferable beyond Europe and adaptable to other classification systems. Future research should advance temporal modelling of dynamic habitats, extend to habitat segmentation and quality assessment, and exploit next-generation EO data paired with higher-quality in-situ observations.
摘要：栖息地整合了支持生物多样性并维持大自然对人的贡献的非生物条件和生物物理结构。由于这些生态系统面临着人类活动的不断增长的压力，因此准确的高分辨率栖息地图对于有效的保护和恢复至关重要。然而，当前的地图通常在主题或空间分辨率上落下，因为它们必须（1）建模几种相互排斥的栖息地类型，这些栖息地类型在跨景观中共发生，并且（2）应对严重的类失衡，使多级训练变得复杂。在这里，我们评估了高分辨率遥感（RS）数据和人工智能（AI）工具如何以精细的主题分辨率改善大型地理范围的栖息地分类。使用欧洲植被档案馆的植被图，我们对整个欧洲的3级Eunis栖息地进行了建模，并评估了针对独立验证数据集的多种建模策略。利用栖息地命名的层次结构性质的策略解决了分类的歧义，尤其是在碎片的景观中。整合多光谱（MSI）和合成孔径雷达（SAR）图像，尤其是通过地球观测基础模型，增强了形成歧视和整体性能。最后，纠正班级失衡的集合机器学习进一步提高了准确性。我们的方法论框架可以在欧洲以外转移，并且可以适应其他分类系统。未来的研究应推进动态栖息地的时间建模，扩展到栖息地细分和质量评估，并利用下一代EO数据与高质量的原位观测配对。

Title: Universal Physics Simulation: A Foundational Diffusion Approach

Authors: Bradley Camburn
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2507.09733
Pdf URL: https://arxiv.org/pdf/2507.09733
Copy Paste: [[2507.09733]] Universal Physics Simulation: A Foundational Diffusion Approach(https://arxiv.org/abs/2507.09733)
Keywords: generation
Abstract: We present the first foundational AI model for universal physics simulation that learns physical laws directly from boundary-condition data without requiring a priori equation encoding. Traditional physics-informed neural networks (PINNs) and finite-difference methods necessitate explicit mathematical formulation of governing equations, fundamentally limiting their generalizability and discovery potential. Our sketch-guided diffusion transformer approach reimagines computational physics by treating simulation as a conditional generation problem, where spatial boundary conditions guide the synthesis of physically accurate steady-state solutions. By leveraging enhanced diffusion transformer architectures with novel spatial relationship encoding, our model achieves direct boundary-to-equilibrium mapping and is generalizable to diverse physics domains. Unlike sequential time-stepping methods that accumulate errors over iterations, our approach bypasses temporal integration entirely, directly generating steady-state solutions with SSIM > 0.8 while maintaining sub-pixel boundary accuracy. Our data-informed approach enables physics discovery through learned representations analyzable via Layer-wise Relevance Propagation (LRP), revealing emergent physical relationships without predetermined mathematical constraints. This work represents a paradigm shift from AI-accelerated physics to AI-discovered physics, establishing the first truly universal physics simulation framework.
摘要：我们介绍了通用物理模拟的第一个基础AI模型，该模型直接从边界条件数据中学习物理定律而无需先验方程式编码。传统物理信息的神经网络（PINN）和有限差异方法需要明确的数学表述，从根本上限制了它们的普遍性和发现潜力。我们的草图引导的扩散变压器方法通过将模拟视为条件生成问题来重新构想计算物理，其中空间边界条件指导物理准确的稳态溶液的综合。通过利用新型的空间关系编码来利用增强的扩散变压器体系结构，我们的模型可以实现直接的边界到平衡映射，并且可以推广到各种物理领域。与在迭代上累积错误的顺序时间稳定方法不同，我们的方法完全绕过时间积分，直接以SSIM> 0.8生成稳态解决方案，同时保持子像素边界的精度。我们的数据信息方法可以通过可通过层次相关性传播（LRP）分析的学习表示来实现物理发现，从而揭示了新兴的物理关系而没有预定的数学约束。这项工作代表了从AI-Accelerated物理学到AI发现的物理学的范式转变，建立了第一个真正的通用物理模拟框架。

Title: Advancing Text-to-3D Generation with Linearized Lookahead Variational Score Distillation

Authors: Yu Lei, Bingde Liu, Qingsong Xie, Haonan Lu, Zhijie Deng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.09748
Pdf URL: https://arxiv.org/pdf/2507.09748
Copy Paste: [[2507.09748]] Advancing Text-to-3D Generation with Linearized Lookahead Variational Score Distillation(https://arxiv.org/abs/2507.09748)
Keywords: generation
Abstract: Text-to-3D generation based on score distillation of pre-trained 2D diffusion models has gained increasing interest, with variational score distillation (VSD) as a remarkable example. VSD proves that vanilla score distillation can be improved by introducing an extra score-based model, which characterizes the distribution of images rendered from 3D models, to correct the distillation gradient. Despite the theoretical foundations, VSD, in practice, is likely to suffer from slow and sometimes ill-posed convergence. In this paper, we perform an in-depth investigation of the interplay between the introduced score model and the 3D model, and find that there exists a mismatching problem between LoRA and 3D distributions in practical implementation. We can simply adjust their optimization order to improve the generation quality. By doing so, the score model looks ahead to the current 3D state and hence yields more reasonable corrections. Nevertheless, naive lookahead VSD may suffer from unstable training in practice due to the potential over-fitting. To address this, we propose to use a linearized variant of the model for score distillation, giving rise to the Linearized Lookahead Variational Score Distillation ($L^2$-VSD). $L^2$-VSD can be realized efficiently with forward-mode autodiff functionalities of existing deep learning libraries. Extensive experiments validate the efficacy of $L^2$-VSD, revealing its clear superiority over prior score distillation-based methods. We also show that our method can be seamlessly incorporated into any other VSD-based text-to-3D framework.
摘要：基于预先训练的2D扩散模型的得分蒸馏的文本到3D生成已经增加了兴趣，而变化得分蒸馏（VSD）是一个了不起的例子。 VSD证明，可以通过引入额外的基于得分的模型来改善香草评分蒸馏，该模型表征了从3D模型呈现的图像的分布，以纠正蒸馏梯度。尽管有理论基础，但实际上，VSD可能会遭受缓慢的，有时甚至不足的融合。在本文中，我们对引入的分数模型与3D模型之间的相互作用进行了深入的研究，并发现在实施中，洛拉和3D分布之间存在不匹配的问题。我们可以简单地调整其优化顺序以提高发电质量。通过这样做，分数模型可以提示当前的3D状态，因此可以产生更合理的校正。然而，由于潜在的过度拟合，Naive Lookahead VSD在实践中可能会受到不稳定的培训。为了解决这个问题，我们建议使用模型的线性变体进行得分蒸馏，从而产生线性化的lookahead变化得分蒸馏（$ l^2 $ -VSD）。 $ l^2 $ -VSD可以通过现有深度学习库的前向模式自动功能有效地实现。广泛的实验验证了$ l^2 $ -VSD的功效，揭示了其与先前得分蒸馏的方法相比的明显优势。我们还表明，我们的方法可以无缝地纳入任何其他基于VSD的文本到3D框架中。

Title: Do we need equivariant models for molecule generation?

Authors: Ewa M. Nowara, Joshua Rackers, Patricia Suriana, Pan Kessel, Max Shen, Andrew Martin Watkins, Michael Maser
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2507.09753
Pdf URL: https://arxiv.org/pdf/2507.09753
Copy Paste: [[2507.09753]] Do we need equivariant models for molecule generation?(https://arxiv.org/abs/2507.09753)
Keywords: generation, generative
Abstract: Deep generative models are increasingly used for molecular discovery, with most recent approaches relying on equivariant graph neural networks (GNNs) under the assumption that explicit equivariance is essential for generating high-quality 3D molecules. However, these models are complex, difficult to train, and scale poorly. We investigate whether non-equivariant convolutional neural networks (CNNs) trained with rotation augmentations can learn equivariance and match the performance of equivariant models. We derive a loss decomposition that separates prediction error from equivariance error, and evaluate how model size, dataset size, and training duration affect performance across denoising, molecule generation, and property prediction. To our knowledge, this is the first study to analyze learned equivariance in generative tasks.
摘要：深层生成模型越来越多地用于分子发现，最近的方法依赖于均衡图神经网络（GNN），这是假设显式均等对产生高质量3D分子至关重要的假设。但是，这些模型很复杂，难以训练，并且扩展不佳。我们研究了接受旋转增强训练的非等级卷积神经网络（CNN）是否可以学习均衡性并符合均衡模型的性能。我们得出了损失分解，将预测误差与均衡误差分开，并评估模型大小，数据集大小和训练持续时间如何影响跨denoising，分子产生和属性预测的性能。据我们所知，这是第一个分析生成任务中学到的均等性的研究。

Title: Efficient Molecular Conformer Generation with SO(3)-Averaged Flow Matching and Reflow

Authors: Zhonglin Cao, Mario Geiger, Allan dos Santos Costa, Danny Reidenbach, Karsten Kreis, Tomas Geffner, Franco Pellegrini, Guoqing Zhou, Emine Kucukbenli
Subjects: cs.LG, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2507.09785
Pdf URL: https://arxiv.org/pdf/2507.09785
Copy Paste: [[2507.09785]] Efficient Molecular Conformer Generation with SO(3)-Averaged Flow Matching and Reflow(https://arxiv.org/abs/2507.09785)
Keywords: generation, generative
Abstract: Fast and accurate generation of molecular conformers is desired for downstream computational chemistry and drug discovery tasks. Currently, training and sampling state-of-the-art diffusion or flow-based models for conformer generation require significant computational resources. In this work, we build upon flow-matching and propose two mechanisms for accelerating training and inference of generative models for 3D molecular conformer generation. For fast training, we introduce the SO(3)-Averaged Flow training objective, which leads to faster convergence to better generation quality compared to conditional optimal transport flow or Kabsch-aligned flow. We demonstrate that models trained using SO(3)-Averaged Flow can reach state-of-the-art conformer generation quality. For fast inference, we show that the reflow and distillation methods of flow-based models enable few-steps or even one-step molecular conformer generation with high quality. The training techniques proposed in this work show a path towards highly efficient molecular conformer generation with flow-based models.
摘要：对于下游计算化学和药物发现任务，需要快速准确的分子构象异构体。当前，培训和采样最新的扩散或基于流程的构造模型需要大量的计算资源。在这项工作中，我们以流量匹配为基础，并提出了两种机制，用于加速训练和推断3D分子构象异构体生成的生成模型。对于快速训练，我们引入了SO（3）平均流量训练目标，这与有条件的最佳运输流量或Kabsch对齐的流量相比，会导致更快地融合到更好的生成质量。我们证明了使用SO（3）平均流量训练的模型可以达到最新的构象生成质量。对于快速推断，我们表明，基于流的模型的回流和蒸馏方法可以具有高质量的几步甚至一步分子构象异构体的产生。这项工作中提出的训练技术显示了通过基于流的模型产生高效的分子构象异构体的途径。

Title: Generative Cognitive Diagnosis

Authors: Jiatong Li, Qi Liu, Mengxiao Zhu
Subjects: cs.LG, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2507.09831
Pdf URL: https://arxiv.org/pdf/2507.09831
Copy Paste: [[2507.09831]] Generative Cognitive Diagnosis(https://arxiv.org/abs/2507.09831)
Keywords: generation, generative
Abstract: Cognitive diagnosis (CD) models latent cognitive states of human learners by analyzing their response patterns on diagnostic tests, serving as a crucial machine learning technique for educational assessment and evaluation. Traditional cognitive diagnosis models typically follow a transductive prediction paradigm that optimizes parameters to fit response scores and extract learner abilities. These approaches face significant limitations as they cannot perform instant diagnosis for new learners without computationally expensive retraining and produce diagnostic outputs with limited reliability. In this study, we introduces a novel generative diagnosis paradigm that fundamentally shifts CD from predictive to generative modeling, enabling inductive inference of cognitive states without parameter re-optimization. We propose two simple yet effective instantiations of this paradigm: Generative Item Response Theory (G-IRT) and Generative Neural Cognitive Diagnosis Model (G-NCDM), which achieve excellent performance improvements over traditional methods. The generative approach disentangles cognitive state inference from response prediction through a well-designed generation process that incorporates identifiability and monotonicity conditions. Extensive experiments on real-world datasets demonstrate the effectiveness of our methodology in addressing scalability and reliability challenges, especially $\times 100$ speedup for the diagnosis of new learners. Our framework opens new avenues for cognitive diagnosis applications in artificial intelligence, particularly for intelligent model evaluation and intelligent education systems. The code is available at this https URL.
摘要：认知诊断（CD）通过分析诊断测试的反应模式来模型的潜在认知状态，作为一种至关重要的机器学习技术，用于教育评估和评估。传统的认知诊断模型通常遵循转导的预测范式，该范式优化参数以符合响应得分并提取学习者的能力。这些方法面临重大局限性，因为如果没有计算昂贵的再培训并产生可靠性有限的诊断输出，它们就无法对新学习者进行即时诊断。在这项研究中，我们引入了一种新型的生成诊断范式，该范例从根本上将CD从预测性建模转移到生成性建模，从而使认知状态的诱导推断没有参数重新挑选。我们提出了该范式的两个简单而有效的实例：生成项目响应理论（G-tirt）和生成神经认知诊断模型（G-NCDM），这些模型（G-NCDM）对传统方法实现了出色的性能改善。生成方法将认知状态的推断从响应预测中解散，通过精心设计的生成过程，该过程结合了可识别性和单调性条件。关于现实世界数据集的广泛实验证明了我们方法学在解决可扩展性和可靠性挑战方面的有效性，尤其是$ \ times 100 $速度，用于诊断新学习者。我们的框架为人工智能中的认知诊断应用开辟了新的途径，特别是对于智能模型评估和智能教育系统。该代码可在此HTTPS URL上找到。

Title: A Pre-training Framework for Relational Data with Information-theoretic Principles

Authors: Quang Truong, Zhikai Chen, Mingxuan Ju, Tong Zhao, Neil Shah, Jiliang Tang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.09837
Pdf URL: https://arxiv.org/pdf/2507.09837
Copy Paste: [[2507.09837]] A Pre-training Framework for Relational Data with Information-theoretic Principles(https://arxiv.org/abs/2507.09837)
Keywords: generation
Abstract: Relational databases underpin critical infrastructure across a wide range of domains, yet the design of generalizable pre-training strategies for learning from relational databases remains an open challenge due to task heterogeneity. Specifically, there exist infinitely many possible downstream tasks, as tasks are defined based on relational schema graphs, temporal dependencies, and SQL-defined label logics. An effective pre-training framework is desired to take these factors into account in order to obtain task-aware representations. By incorporating knowledge of the underlying distribution that drives label generation, downstream tasks can benefit from relevant side-channel information. To bridge this gap, we introduce Task Vector Estimation (TVE), a novel pre-training framework that constructs predictive supervisory signals via set-based aggregation over schema traversal graphs, explicitly modeling next-window relational dynamics. We formalize our approach through an information-theoretic lens, demonstrating that task-informed representations retain more relevant signals than those obtained without task priors. Extensive experiments on the RelBench benchmark show that TVE consistently outperforms traditional pre-training baselines. Our findings advocate for pre-training objectives that encode task heterogeneity and temporal structure as design principles for predictive modeling on relational databases.
摘要：关系数据库基于跨广泛领域的关键基础架构的基础，但是由于任务异质性，从关系数据库中学习的可推广的预培训策略是从关系数据库中学习的。具体而言，由于关系图表，时间依赖性和SQL定义的标签逻辑定义了任务，因此存在许多可能的下游任务。希望将这些因素考虑到有效的培训框架，以获得任务感知表示。通过纳入驱动标签生成的基本分布的知识，下游任务可以从相关的侧通道信息中受益。为了弥合这一差距，我们介绍了任务矢量估计（TVE），这是一个新颖的训练框架，通过基于集合的架构遍历图表来构建预测性监督信号，明确对下一窗口的关系动力学进行了明确建模。我们通过信息理论镜头对方法进行形式化，这表明，任务信息的表示保留的信号比没有任务先验的信号更相关。 Relbench基准的广泛实验表明，TVE始终优于传统的预训练基线。我们的发现倡导了编码任务异质性和时间结构作为关系数据库预测建模的设计原理的预训练目标。

Title: SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

Authors: Youliang Zhang, Zhaoyang Li, Duomin Wang, Jiahe Zhang, Deyu Zhou, Zixin Yin, Xili Dai, Gang Yu, Xiu Li
Subjects: cs.CV, eess.AS
Abstract URL: https://arxiv.org/abs/2507.09862
Pdf URL: https://arxiv.org/pdf/2507.09862
Copy Paste: [[2507.09862]] SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation(https://arxiv.org/abs/2507.09862)
Keywords: generation
Abstract: The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in this emerging area, we present SpeakerVid-5M dataset, the first large-scale, high-quality dataset designed for audio-visual dyadic interactive virtual human generation. Totaling over 8,743 hours, SpeakerVid-5M contains more than 5.2 million video clips of human portraits. It covers diverse scales and interaction types, including monadic talking, listening, and dyadic conversations. Crucially, the dataset is structured along two key dimensions: interaction type and data quality. First, it is categorized into four types (dialogue branch, single branch, listening branch and multi-turn branch) based on the interaction scenario. Second, it is stratified into a large-scale pre-training subset and a curated, high-quality subset for Supervised Fine-Tuning (SFT). This dual structure accommodates a wide array of 2D virtual human tasks. In addition, we provide an autoregressive (AR)-based video chat baseline trained on this data, accompanied by a dedicated set of metrics and test data to serve as a benchmark VidChatBench for future work. Both the dataset and the corresponding data processing code will be publicly released. Project page: this https URL
摘要：大规模模型的快速发展促进了数字人类领域的显着突破。这些先进的方法为化身驾驶和渲染提供了高保真的解决方案，使学术界专注于下一个主要挑战：视听二元互动互动虚拟人类。为了促进该新兴领域的研究，我们介绍了SpeakervID-5M数据集，这是第一个专为视听二元互动虚拟人类代理设计的大型高质量数据集。 SpeakervId-5m总计超过8,743小时，包含超过520万个人类肖像的视频片段。它涵盖了各种尺度和互动类型，包括Monadic说话，听力和二元对话。至关重要的是，数据集沿两个关键维度结构：交互类型和数据质量。首先，根据交互情况，它分为四种类型（对话分支，单个分支，听力分支和多转支分支）。其次，将其分层为大规模的预训练子集，并为监督微调（SFT）进行了精选的高质量子集。这种双重结构可容纳各种各样的2D虚拟人类任务。此外，我们还提供了基于此数据的基于自回旋（AR）的视频聊天基线，并配有一套专用的指标和测试数据，以作为基准Vidchatbench进行未来工作。数据集和相应的数据处理代码都将公开发布。项目页面：此HTTPS URL

Title: Counterfactual Visual Explanation via Causally-Guided Adversarial Steering

Authors: Yiran Qiao, Disheng Liu, Yiren Lu, Yu Yin, Mengnan Du, Jing Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.09881
Pdf URL: https://arxiv.org/pdf/2507.09881
Copy Paste: [[2507.09881]] Counterfactual Visual Explanation via Causally-Guided Adversarial Steering(https://arxiv.org/abs/2507.09881)
Keywords: generation
Abstract: Recent work on counterfactual visual explanations has contributed to making artificial intelligence models more explainable by providing visual perturbation to flip the prediction. However, these approaches neglect the causal relationships and the spurious correlations behind the image generation process, which often leads to unintended alterations in the counterfactual images and renders the explanations with limited quality. To address this challenge, we introduce a novel framework CECAS, which first leverages a causally-guided adversarial method to generate counterfactual explanations. It innovatively integrates a causal perspective to avoid unwanted perturbations on spurious factors in the counterfactuals. Extensive experiments demonstrate that our method outperforms existing state-of-the-art approaches across multiple benchmark datasets and ultimately achieves a balanced trade-off among various aspects of validity, sparsity, proximity, and realism.
摘要：关于反事实视觉解释的最新工作有助于通过提供视觉扰动来翻转预测，从而使人工智能模型更具解释。但是，这些方法忽略了图像生成过程背后的因果关系和虚假的相关性，这通常会导致反事实图像的意外变化，并以有限的质量使解释提供了解释。为了应对这一挑战，我们引入了一种新颖的框架CECA，该框架首先利用因果引导的对抗方法来产生反事实解释。它创新地整合了因果观点，以避免对反事实中的虚假因素的不良扰动。广泛的实验表明，我们的方法在多个基准数据集中优于现有的最新方法，并最终在有效性，稀疏，接近性和现实主义的各个方面之间取得平衡的权衡。

Title: IGD: Instructional Graphic Design with Multimodal Layer Generation

Authors: Yadong Qu, Shancheng Fang, Yuxin Wang, Xiaorui Wang, Zhineng Chen, Hongtao Xie, Yongdong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.09910
Pdf URL: https://arxiv.org/pdf/2507.09910
Copy Paste: [[2507.09910]] IGD: Instructional Graphic Design with Multimodal Layer Generation(https://arxiv.org/abs/2507.09910)
Keywords: generation
Abstract: Graphic design visually conveys information and data by creating and combining text, images and graphics. Two-stage methods that rely primarily on layout generation lack creativity and intelligence, making graphic design still labor-intensive. Existing diffusion-based methods generate non-editable graphic design files at image level with poor legibility in visual text rendering, which prevents them from achieving satisfactory and practical automated graphic design. In this paper, we propose Instructional Graphic Designer (IGD) to swiftly generate multimodal layers with editable flexibility with only natural language instructions. IGD adopts a new paradigm that leverages parametric rendering and image asset generation. First, we develop a design platform and establish a standardized format for multi-scenario design files, thus laying the foundation for scaling up data. Second, IGD utilizes the multimodal understanding and reasoning capabilities of MLLM to accomplish attribute prediction, sequencing and layout of layers. It also employs a diffusion model to generate image content for assets. By enabling end-to-end training, IGD architecturally supports scalability and extensibility in complex graphic design tasks. The superior experimental results demonstrate that IGD offers a new solution for graphic design.
摘要：图形设计通过创建和组合文本，图像和图形来视觉上传达信息和数据。主要依赖布局一代的两阶段方法缺乏创造力和智力，使图形设计仍然富含劳动力。现有的基于扩散的方法在图像级别生成了非编辑的图形设计文件，视觉文本渲染中的可读性差，这使它们无法实现令人满意且实用的自动化图形设计。在本文中，我们建议教学图形设计师（IGD）迅速生成具有自然语言指令的可编辑灵活性的多模式层。 IGD采用了一种利用参数渲染和图像资产产生的新范式。首先，我们开发一个设计平台，并为多幕科设计文件建立标准化格式，从而为扩展数据奠定了基础。其次，IGD利用MLLM的多模式理解和推理能力来完成层的属性预测，测序和布局。它还采用扩散模型来生成资产的图像内容。通过启用端到端培训，IGD架构在复杂的图形设计任务中支持可扩展性和可扩展性。出色的实验结果表明，IGD为图形设计提供了新的解决方案。

Title: Crucial-Diff: A Unified Diffusion Model for Crucial Image and Annotation Synthesis in Data-scarce Scenarios

Authors: Siyue Yao, Mingjie Sun, Eng Gee Lim, Ran Yi, Baojiang Zhong, Moncef Gabbouj
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.09915
Pdf URL: https://arxiv.org/pdf/2507.09915
Copy Paste: [[2507.09915]] Crucial-Diff: A Unified Diffusion Model for Crucial Image and Annotation Synthesis in Data-scarce Scenarios(https://arxiv.org/abs/2507.09915)
Keywords: generative
Abstract: The scarcity of data in various scenarios, such as medical, industry and autonomous driving, leads to model overfitting and dataset imbalance, thus hindering effective detection and segmentation performance. Existing studies employ the generative models to synthesize more training samples to mitigate data scarcity. However, these synthetic samples are repetitive or simplistic and fail to provide "crucial information" that targets the downstream model's weaknesses. Additionally, these methods typically require separate training for different objects, leading to computational inefficiencies. To address these issues, we propose Crucial-Diff, a domain-agnostic framework designed to synthesize crucial samples. Our method integrates two key modules. The Scene Agnostic Feature Extractor (SAFE) utilizes a unified feature extractor to capture target information. The Weakness Aware Sample Miner (WASM) generates hard-to-detect samples using feedback from the detection results of downstream model, which is then fused with the output of SAFE module. Together, our Crucial-Diff framework generates diverse, high-quality training data, achieving a pixel-level AP of 83.63% and an F1-MAX of 78.12% on MVTec. On polyp dataset, Crucial-Diff reaches an mIoU of 81.64% and an mDice of 87.69%. Code will be released after acceptance.
摘要：在各种情况下（例如医疗，行业和自动驾驶）的数据稀缺导致模型过度拟合和数据集不平衡，从而阻碍了有效的检测和细分性能。现有研究采用生成模型来合成更多的培训样本来减轻数据稀缺性。但是，这些合成样本是重复的或简单的，无法提供针对下游模型弱点的“关键信息”。此外，这些方法通常需要针对不同对象的单独培训，从而导致计算效率低下。为了解决这些问题，我们提出了旨在综合关键样本的域 - 不合SNOSTIC框架Cycucial-Diff。我们的方法集成了两个关键模块。场景不可知的特征提取器（SAFE）利用统一的功能提取器来捕获目标信息。弱点意识到的样本矿工（WASM）使用下游模型的检测结果的反馈生成难以检测的样品，然后将其与安全模块的输出融合。我们的关键DIFF框架共同产生了多样化的高质量培训数据，在MVTEC上获得了83.63％的像素级AP，F1-MAX为78.12％。在Polyp数据集上，关键的木材达到81.64％的MIOU，MDICE为87.69％。接受代码将在接受后发布。

Title: Long-Tailed Data Classification by Increasing and Decreasing Neurons During Training

Authors: Taigo Sakai, Kazuhiro Hotta
Subjects: cs.LG, stat.AP
Abstract URL: https://arxiv.org/abs/2507.09940
Pdf URL: https://arxiv.org/pdf/2507.09940
Copy Paste: [[2507.09940]] Long-Tailed Data Classification by Increasing and Decreasing Neurons During Training(https://arxiv.org/abs/2507.09940)
Keywords: generation
Abstract: In conventional deep learning, the number of neurons typically remains fixed during training. However, insights from biology suggest that the human hippocampus undergoes continuous neuron generation and pruning of neurons over the course of learning, implying that a flexible allocation of capacity can contribute to enhance performance. Real-world datasets often exhibit class imbalance situations where certain classes have far fewer samples than others, leading to significantly reduce recognition accuracy for minority classes when relying on fixed size this http URL address the challenge, we propose a method that periodically adds and removes neurons during training, thereby boosting representational power for minority classes. By retaining critical features learned from majority classes while selectively increasing neurons for underrepresented classes, our approach dynamically adjusts capacity during training. Importantly, while the number of neurons changes throughout training, the final network size and structure remain unchanged, ensuring efficiency and compatibility with this http URL, by experiments on three different datasets and five representative models, we demonstrate that the proposed method outperforms fixed size networks and shows even greater accuracy when combined with other imbalance-handling techniques. Our results underscore the effectiveness of dynamic, biologically inspired network designs in improving performance on class-imbalanced data.
摘要：在常规深度学习中，神经元的数量通常在训练过程中保持固定。但是，生物学的见解表明，人类海马在学习过程中经历了连续的神经元产生和神经元修剪，这意味着能力的灵活分配可以提高性能。现实世界中的数据集经常表现出类别的不平衡情况，在某些类别的样本少得多，因此在依靠固定尺寸的HTTP URL时，可以显着降低少数群体的识别准确性，我们提出了一种在培训过程中定期添加和删除神经元的方法，从而增强了对少数族裔的代表性力量。通过保留从多数类中学到的关键特征，同时选择性地增加了代表性不足的类别的神经元，我们的方法在训练过程中动态调节容量。重要的是，尽管通过在三个不同的数据集和五个代表性模型的实验中进行实验，但在整个训练过程中神经元的数量变化，但最终的网络大小和结构保持不变，从而确保了与该HTTP URL的效率和兼容性，我们证明，与其他Immbalance Immbalance Handling技术结合时，提出的方法均优于固定尺寸的固定尺寸，并且显示出更高的准确性。我们的结果强调了动态，生物学启发的网络设计在改善类不平衡数据的性能方面的有效性。

Title: Iceberg: Enhancing HLS Modeling with Synthetic Data

Authors: Zijian Ding, Tung Nguyen, Weikai Li, Aditya Grover, Yizhou Sun, Jason Cong
Subjects: cs.LG, cs.AR
Abstract URL: https://arxiv.org/abs/2507.09948
Pdf URL: https://arxiv.org/pdf/2507.09948
Copy Paste: [[2507.09948]] Iceberg: Enhancing HLS Modeling with Synthetic Data(https://arxiv.org/abs/2507.09948)
Keywords: generation
Abstract: Deep learning-based prediction models for High-Level Synthesis (HLS) of hardware designs often struggle to generalize. In this paper, we study how to close the generalizability gap of these models through pretraining on synthetic data and introduce Iceberg, a synthetic data augmentation approach that expands both large language model (LLM)-generated programs and weak labels of unseen design configurations. Our weak label generation method is integrated with an in-context model architecture, enabling meta-learning from actual and proximate labels. Iceberg improves the geometric mean modeling accuracy by $86.4\%$ when adapt to six real-world applications with few-shot examples and achieves a $2.47\times$ and a $1.12\times$ better offline DSE performance when adapting to two different test datasets. Our open-sourced code is here: \href{this https URL}{this https URL}
摘要：硬件设计的高级合成（HLS）基于深度学习的预测模型通常很难概括。在本文中，我们研究了如何通过预处理合成数据来缩小这些模型的概括性差距，并引入冰山，冰山是一种合成数据增强方法，可以扩大大型语言模型（LLM）的程序和弱标签，又是看不见的设计配置的标签。我们的弱标签生成方法与内在的模型体系结构集成在一起，从而从实际和近端标签中启用了元学习。当适应几个示例的六个现实世界应用程序时，冰山将几何平均建模精度提高了$ 86.4 \％$，并且在适应两个不同的测试数据集时，可以实现$ 2.47 \ times $ $ 2.47 \ times $ and $ 1.12 \ times $更好的离线DSE性能。我们的开源代码在这里：\ href {this https url} {此https url}

Title: 4D-MISR: A unified model for low-dose super-resolution imaging via feature fusion

Authors: Zifei Wang, Zian Mao, Xiaoya He, Xi Huang, Haoran Zhang, Chun Cheng, Shufen Chu, Tingzheng Hou, Xiaoqin Zeng, Yujun Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.09953
Pdf URL: https://arxiv.org/pdf/2507.09953
Copy Paste: [[2507.09953]] 4D-MISR: A unified model for low-dose super-resolution imaging via feature fusion(https://arxiv.org/abs/2507.09953)
Keywords: super-resolution
Abstract: While electron microscopy offers crucial atomic-resolution insights into structure-property relationships, radiation damage severely limits its use on beam-sensitive materials like proteins and 2D materials. To overcome this challenge, we push beyond the electron dose limits of conventional electron microscopy by adapting principles from multi-image super-resolution (MISR) that have been widely used in remote sensing. Our method fuses multiple low-resolution, sub-pixel-shifted views and enhances the reconstruction with a convolutional neural network (CNN) that integrates features from synthetic, multi-angle observations. We developed a dual-path, attention-guided network for 4D-STEM that achieves atomic-scale super-resolution from ultra-low-dose data. This provides robust atomic-scale visualization across amorphous, semi-crystalline, and crystalline beam-sensitive specimens. Systematic evaluations on representative materials demonstrate comparable spatial resolution to conventional ptychography under ultra-low-dose conditions. Our work expands the capabilities of 4D-STEM, offering a new and generalizable method for the structural analysis of radiation-vulnerable materials.
摘要：尽管电子显微镜为结构特性关系提供了至关重要的原子分辨率洞察力，但辐射损害严重限制了其对梁敏感材料（如蛋白质和2D材料）的使用。为了克服这一挑战，我们通过调整已广泛用于遥感的多图像超分辨率（MISR）的原理来超越常规电子显微镜的电子剂量极限。我们的方法融合了多个低分辨率，子像素切换的视图，并使用卷积神经网络（CNN）增强重建，该卷积神经网络（CNN）整合了合成，多角度观察的特征。我们开发了一个针对4D STEM的双路，注意引导网络，该网络从超低剂量数据中实现了原子级的超分辨率。这提供了跨无定形，半晶和晶体束敏感的标本的稳健原子尺度可视化。对代表性材料的系统评估表明，在超低剂量条件下，与常规ptychography相当的空间分辨率可比。我们的工作扩大了4D-STEM的功能，为辐射式材料的结构分析提供了一种新的且可推广的方法。

Title: Compliance Minimization via Physics-Informed Gaussian Processes

Authors: Xiangyu Sun, Amin Yousefpour, Shirin Hosseinmardi, Ramin Bostanabad
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.09968
Pdf URL: https://arxiv.org/pdf/2507.09968
Copy Paste: [[2507.09968]] Compliance Minimization via Physics-Informed Gaussian Processes(https://arxiv.org/abs/2507.09968)
Keywords: super-resolution
Abstract: Machine learning (ML) techniques have recently gained significant attention for solving compliance minimization (CM) problems. However, these methods typically provide poor feature boundaries, are very expensive, and lack a systematic mechanism to control the design complexity. Herein, we address these limitations by proposing a mesh-free and simultaneous framework based on physics-informed Gaussian processes (GPs). In our approach, we parameterize the design and state variables with GP priors which have independent kernels but share a multi-output neural network (NN) as their mean function. The architecture of this NN is based on Parametric Grid Convolutional Attention Networks (PGCANs) which not only mitigate spectral bias issues, but also provide an interpretable mechanism to control design complexity. We estimate all the parameters of our GP-based representations by simultaneously minimizing the compliance, total potential energy, and residual of volume fraction constraint. Importantly, our loss function exclude all data-based residuals as GPs automatically satisfy them. We also develop computational schemes based on curriculum training and numerical integration to increase the efficiency and robustness of our approach which is shown to (1) produce super-resolution topologies with fast convergence, (2) achieve smaller compliance and less gray area fraction compared to traditional numerical methods, (3) provide control over fine-scale features, and (4) outperform competing ML-based methods.
摘要：机器学习（ML）技术最近在解决合规性最小化（CM）问题方面引起了极大的关注。但是，这些方法通常提供较差的特征边界，非常昂贵，并且缺乏控制设计复杂性的系统机制。本文中，我们通过基于物理信息高斯过程（GPS）提出一个无网格和同时框架来解决这些限制。在我们的方法中，我们使用具有独立内核但共享多输出神经网络（NN）作为其平均功能的GP先验的设计和状态变量来参数化设计和状态变量。该NN的体系结构基于参数网格卷积注意网络（PGCANS），该网络不仅减轻了光谱偏差问题，而且还提供了控制设计复杂性的可解释机制。我们通过同时最大程度地减少依从性，总势能和体积分数约束的残留量来估计基于GP表示的所有参数。重要的是，我们的损失函数将所有基于数据的残差都排除在自动满足它们时。我们还基于课程培训和数值整合开发了计算方案，以提高方法的效率和鲁棒性，这表明（1）产生具有快速收敛性的超分辨率拓扑，（2）与传统的数值方法相比，实现较小的合规性和较小的灰色面积分数，与传统的数值方法相比，（3）提供了对良好的特征和（4）竞争性竞争的良好型特征的控制权。

Title: Latent Diffusion Models with Masked AutoEncoders

Authors: Junho Lee, Jeongwoo Shin, Hyungwook Choi, Joonseok Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.09984
Pdf URL: https://arxiv.org/pdf/2507.09984
Copy Paste: [[2507.09984]] Latent Diffusion Models with Masked AutoEncoders(https://arxiv.org/abs/2507.09984)
Keywords: generation
Abstract: In spite of remarkable potential of the Latent Diffusion Models (LDMs) in image generation, the desired properties and optimal design of the autoencoders have been underexplored. In this work, we analyze the role of autoencoders in LDMs and identify three key properties: latent smoothness, perceptual compression quality, and reconstruction quality. We demonstrate that existing autoencoders fail to simultaneously satisfy all three properties, and propose Variational Masked AutoEncoders (VMAEs), taking advantage of the hierarchical features maintained by Masked AutoEncoder. We integrate VMAEs into the LDM framework, introducing Latent Diffusion Models with Masked AutoEncoders (LDMAEs). Through comprehensive experiments, we demonstrate significantly enhanced image generation quality and computational efficiency.
摘要：尽管潜在扩散模型（LDMS）在图像生成中具有显着的潜力，但自动编码器的所需属性和最佳设计仍未得到解动。在这项工作中，我们分析了自动编码器在LDM中的作用，并确定了三个关键特性：潜在的平滑度，感知压缩质量和重建质量。我们证明，现有的自动编码器无法同时满足所有三个属性，并提出了变分蒙版自动编码器（VMAES），利用蒙版自动编码器维护的层次结构功能。我们将VMAES集成到LDM框架中，并用掩盖自动编码器（LDMAES）引入潜在扩散模型。通过全面的实验，我们证明了图像产生质量和计算效率的显着提高。

Title: 3DGAA: Realistic and Robust 3D Gaussian-based Adversarial Attack for Autonomous Driving

Authors: Yixun Zhang, Lizhi Wang, Junjun Zhao, Wending Zhao, Feng Zhou, Yonghao Dang, Jianqin Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.09993
Pdf URL: https://arxiv.org/pdf/2507.09993
Copy Paste: [[2507.09993]] 3DGAA: Realistic and Robust 3D Gaussian-based Adversarial Attack for Autonomous Driving(https://arxiv.org/abs/2507.09993)
Keywords: generation
Abstract: Camera-based object detection systems play a vital role in autonomous driving, yet they remain vulnerable to adversarial threats in real-world environments. While existing 2D and 3D physical attacks typically optimize texture, they often struggle to balance physical realism and attack robustness. In this work, we propose 3D Gaussian-based Adversarial Attack (3DGAA), a novel adversarial object generation framework that leverages the full 14-dimensional parameterization of 3D Gaussian Splatting (3DGS) to jointly optimize geometry and appearance in physically realizable ways. Unlike prior works that rely on patches or texture, 3DGAA jointly perturbs both geometric attributes (shape, scale, rotation) and appearance attributes (color, opacity) to produce physically realistic and transferable adversarial objects. We further introduce a physical filtering module to preserve geometric fidelity, and a physical augmentation module to simulate complex physical scenarios, thus enhancing attack generalization under real-world conditions. We evaluate 3DGAA on both virtual benchmarks and physical-world setups using miniature vehicle models. Experimental results show that 3DGAA achieves to reduce the detection mAP from 87.21% to 7.38%, significantly outperforming existing 3D physical attacks. Moreover, our method maintains high transferability across different physical conditions, demonstrating a new state-of-the-art in physically realizable adversarial attacks. These results validate 3DGAA as a practical attack framework for evaluating the safety of perception systems in autonomous driving.
摘要：基于相机的对象检测系统在自主驾驶中起着至关重要的作用，但它们仍然容易受到现实环境中的对抗威胁的影响。尽管现有的2D和3D物理攻击通常会优化纹理，但它们通常很难平衡物理现实主义和攻击稳健性。在这项工作中，我们提出了一个基于3D高斯的对抗攻击（3DGAA），这是一个新型的对抗对象生成框架，利用3D高斯分裂（3DGS）的完整14维参数化（3DGS）以可实现的方式共同优化几何和外观。与依靠斑块或纹理的先前作品不同，3DGAA共同掩盖了几何属性（形状，比例，旋转）和外观属性（颜色，不透明度），以产生物理逼真且可转移的对手对象。我们进一步引入了一个物理过滤模块，以保留几何忠诚度，以及一个物理增强模块，以模拟复杂的物理场景，从而增强在现实世界中的攻击概括。我们使用微型车辆模型在虚拟基准和物理世界设置上评估了3DGAA。实验结果表明，3DGAA可以将检测图从87.21％降低到7.38％，从而明显超过了现有的3D物理攻击。此外，我们的方法在不同的物理条件下保持了高可传递性，这表明在物理上可实现的对抗性攻击中是新的最新性能。这些结果验证了3DGAA作为评估自动驾驶中感知系统安全性的实际攻击框架。

Title: Frequency Regulation for Exposure Bias Mitigation in Diffusion Models

Authors: Meng Yu, Kun Zhan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.10072
Pdf URL: https://arxiv.org/pdf/2507.10072
Copy Paste: [[2507.10072]] Frequency Regulation for Exposure Bias Mitigation in Diffusion Models(https://arxiv.org/abs/2507.10072)
Keywords: generative
Abstract: Diffusion models exhibit impressive generative capabilities but are significantly impacted by exposure bias. In this paper, we make a key observation: the energy of the predicted noisy images decreases during the diffusion process. Building on this, we identify two important findings: 1) The reduction in energy follows distinct patterns in the low-frequency and high-frequency subbands; 2) This energy reduction results in amplitude variations between the network-reconstructed clean data and the real clean data. Based on the first finding, we introduce a frequency-domain regulation mechanism utilizing wavelet transforms, which separately adjusts the low- and high-frequency subbands. Leveraging the second insight, we provide a more accurate analysis of exposure bias in the two subbands. Our method is training-free and plug-and-play, significantly improving the generative quality of various diffusion models and providing a robust solution to exposure bias across different model architectures. The source code is available at this https URL.
摘要：扩散模型具有令人印象深刻的生成能力，但受到暴露偏见的影响。在本文中，我们进行了一个关键的观察：预测噪声图像的能量在扩散过程中减少。在此基础上，我们确定了两个重要的发现：1）能量的减少遵循低频和高频子带中的不同模式； 2）减少能量会导致网络重建的清洁数据与真实的清洁数据之间的幅度变化。根据第一个发现，我们引入了利用小波变换的频域调节机制，该机制分别调整了低频和高频子带。利用第二个见解，我们对两个子带中的暴露偏置进行了更准确的分析。我们的方法是无训练和插入式播放的方法，可显着提高各种扩散模型的生成质量，并为跨不同模型体系结构的暴露偏见提供了强大的解决方案。源代码可在此HTTPS URL上找到。

Title: Towards High Supervised Learning Utility Training Data Generation: Data Pruning and Column Reordering

Authors: Tung Sum Thomas Kwok, Zeyong Zhang, Chi-Hua Wang, Guang Cheng
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2507.10088
Pdf URL: https://arxiv.org/pdf/2507.10088
Copy Paste: [[2507.10088]] Towards High Supervised Learning Utility Training Data Generation: Data Pruning and Column Reordering(https://arxiv.org/abs/2507.10088)
Keywords: generation
Abstract: Tabular data synthesis for supervised learning ('SL') model training is gaining popularity in industries such as healthcare, finance, and retail. Despite the progress made in tabular data generators, models trained with synthetic data often underperform compared to those trained with original data. This low SL utility of synthetic data stems from class imbalance exaggeration and SL data relationship overlooked by tabular generator. To address these challenges, we draw inspirations from techniques in emerging data-centric artificial intelligence and elucidate Pruning and ReOrdering ('PRRO'), a novel pipeline that integrates data-centric techniques into tabular data synthesis. PRRO incorporates data pruning to guide the table generator towards observations with high signal-to-noise ratio, ensuring that the class distribution of synthetic data closely matches that of the original data. Besides, PRRO employs a column reordering algorithm to align the data modeling structure of generators with that of SL models. These two modules enable PRRO to optimize SL utility of synthetic data. Empirical experiments on 22 public datasets show that synthetic data generated using PRRO enhances predictive performance compared to data generated without PRRO. Specifically, synthetic replacement of original data yields an average improvement of 26.74% and up to 871.46% improvement using PRRO, while synthetic appendant to original data results with PRRO-generated data results in an average improvement of 6.13% and up to 200.32%. Furthermore, experiments on six highly imbalanced datasets show that PRRO enables the generator to produce synthetic data with a class distribution that resembles the original data more closely, achieving a similarity improvement of 43%. Through PRRO, we foster a seamless integration of data synthesis to subsequent SL prediction, promoting quality and accessible data analysis.
摘要：用于监督学习（“ SL”）模型培训的表格数据合成在医疗保健，金融和零售等行业中越来越受欢迎。尽管在表格数据生成器中取得了进展，但与受过原始数据培训的模型相比，经过合成数据训练的模型通常表现不佳。综合数据的这种低SL实用性源于表格发生器忽略的类不平衡夸张和SL数据关系。为了应对这些挑战，我们从新兴数据中心的人工智能中的技术中汲取灵感，并阐明修剪和重新排序（“ PRRO”），这是一种将以数据为中心技术集成到表格数据合成中的新型管道。 PRRO结合了数据修剪，以指导表生成器对具有较高信噪比的观测值，从而确保合成数据的类别分布与原始数据的相匹配。此外，PRRO还采用一列重新排序算法来使生成器的数据建模结构与SL模型的数据建模结构。这两个模块使PRRO可以优化合成数据的SL实用性。 22个公共数据集的经验实验表明，与没有PRRO生成的数据相比，使用PRRO生成的合成数据增强了预测性能。具体而言，使用PRRO的合成替代原始数据的平均提高26.74％，最高为871.46％，而合成对原始数据结果的合成附属物则具有PRRO生成的数据结果，导致平均改善6.13％，高达200.32％。此外，对六个高度不平衡数据集进行的实验表明，PRRO使生成器能够以类似于原始数据更接近的类分布产生合成数据，从而实现了43％的相似性提高。通过PRRO，我们将数据合成的无缝集成到随后的SL预测中，从而促进质量和可访问的数据分析。

Title: A Training-Free, Task-Agnostic Framework for Enhancing MLLM Performance on High-Resolution Images

Authors: Jaeseong Lee, Yeeun Choi, Heechan Choi, Hanjung Kim, Seonjoo Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.10202
Pdf URL: https://arxiv.org/pdf/2507.10202
Copy Paste: [[2507.10202]] A Training-Free, Task-Agnostic Framework for Enhancing MLLM Performance on High-Resolution Images(https://arxiv.org/abs/2507.10202)
Keywords: generation
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language understanding, reasoning, and generation. However, they struggle with tasks requiring fine-grained localization and reasoning in high-resolution images. This constraint stems from the fact that MLLMs are fine-tuned with fixed image resolution to align with the pre-trained image encoder used in MLLM. Consequently, feeding high-resolution images directly into MLLMs leads to poor generalization due to a train-test resolution discrepancy, while downsampling these images-although ensuring consistency-compromises fine-grained visual details and ultimately degrades performance. To address this challenge, we propose Extract Candidate then Predict (ECP), a novel training-free, task-agnostic two-stage framework designed to enhance MLLM performance on high-resolution images. The key intuition behind ECP is that while MLLMs struggle with high-resolution images, their predictions on downsampled images still contain implicit localization cues. By first identifying candidate region using the coarse prediction and then predicting the final output based on candidate region, ECP effectively preserves fine-grained details while mitigating the challenges posed by high-resolution data. We validate our framework on 4K GUI grounding and 4K, 8K MLLM perception, achieving +21.3%, +5.8%, +5.2% absolute improvement compared to baseline respectively, demonstrating its effectiveness. Code is available at this https URL.
摘要：多模式的大型语言模型（MLLM）在视觉理解，推理和产生中表现出了显着的功能。但是，他们在高分辨率图像中需要细粒度定位和推理的任务困难。该约束源于以下事实：MLLM通过固定图像分辨率进行了微调，以与MLLM中使用的预训练的图像编码器保持一致。因此，将高分辨率图像直接馈入MLLM会导致由于火车测试的分辨率差异而导致泛化，同时减少了这些图像，尽管确保了一致性 - 启示性良好的视觉细节，并最终降低了性能。为了应对这一挑战，我们建议提取候选者然后预测（ECP），这是一种新型的无培训，无任务的两阶段框架，旨在增强高分辨率图像上的MLLM性能。 ECP背后的关键直觉是，尽管MLLM在高分辨率图像上挣扎，但它们对缩写的图像的预测仍然包含隐式定位提示。通过首先使用粗糙的预测来识别候选区域，然后根据候选区域预测最终输出，ECP有效地保留了细粒细节，同时减轻了高分辨率数据带来的挑战。我们在4K GUI接地和4K，8K MLLM感知上验证了我们的框架，与基线相比，分别达到 +21.3％， +5.8％， +5.2％的绝对改善，表明其有效性。代码可在此HTTPS URL上找到。

Title: From Wardrobe to Canvas: Wardrobe Polyptych LoRA for Part-level Controllable Human Image Generation

Authors: Jeongho Kim, Sunghyun Park, Hyoungwoo Park, Sungrack Yun, Jaegul Choo, Seokeon Cho
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.10217
Pdf URL: https://arxiv.org/pdf/2507.10217
Copy Paste: [[2507.10217]] From Wardrobe to Canvas: Wardrobe Polyptych LoRA for Part-level Controllable Human Image Generation(https://arxiv.org/abs/2507.10217)
Keywords: generation
Abstract: Recent diffusion models achieve personalization by learning specific subjects, allowing learned attributes to be integrated into generated images. However, personalized human image generation remains challenging due to the need for precise and consistent attribute preservation (e.g., identity, clothing details). Existing subject-driven image generation methods often require either (1) inference-time fine-tuning with few images for each new subject or (2) large-scale dataset training for generalization. Both approaches are computationally expensive and impractical for real-time applications. To address these limitations, we present Wardrobe Polyptych LoRA, a novel part-level controllable model for personalized human image generation. By training only LoRA layers, our method removes the computational burden at inference while ensuring high-fidelity synthesis of unseen subjects. Our key idea is to condition the generation on the subject's wardrobe and leverage spatial references to reduce information loss, thereby improving fidelity and consistency. Additionally, we introduce a selective subject region loss, which encourages the model to disregard some of reference images during training. Our loss ensures that generated images better align with text prompts while maintaining subject integrity. Notably, our Wardrobe Polyptych LoRA requires no additional parameters at the inference stage and performs generation using a single model trained on a few training samples. We construct a new dataset and benchmark tailored for personalized human image generation. Extensive experiments show that our approach significantly outperforms existing techniques in fidelity and consistency, enabling realistic and identity-preserving full-body synthesis.
摘要：最近的扩散模型通过学习特定主题来实现个性化，从而可以将学习的属性集成到生成的图像中。但是，由于需要精确且一致的属性保存（例如身份，服装细节），人类形象的产生仍然具有挑战性。现有主题驱动的图像生成方法通常需要（1）推理时间微调，每个新主题很少，或（2）大规模数据集训练进行概括。对于实时应用，两种方法在计算上都是昂贵的，并且不切实际。为了解决这些局限性，我们提出了衣柜Polyptych Lora，这是一种用于个性化人类形象生成的新型零件可控模型。通过仅训练洛拉层，我们的方法消除了推断时的计算负担，同时确保了看不见的受试者的高保真综合。我们的关键思想是将一代人的衣柜调节，并利用空间参考来减少信息丢失，从而提高忠诚度和一致性。此外，我们引入了选择性的主题损失，该损失鼓励该模型在训练过程中无视一些参考图像。我们的损失确保生成的图像可以更好地与文本提示保持一致，同时保持主题完整性。值得注意的是，我们的衣柜Polyptych Lora在推理阶段不需要其他参数，并且使用对几个训练样本进行训练的单个模型进行生成。我们构建了一个针对个性化人类形象生成的新数据集和基准测试。广泛的实验表明，我们的方法在忠诚度和一致性方面显着优于现有技术，从而实现了现实和具有身份的全身综合。

Title: Straighten Viscous Rectified Flow via Noise Optimization

Authors: Jimin Dai, Jiexi Yan, Jian Yang, Lei Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.10218
Pdf URL: https://arxiv.org/pdf/2507.10218
Copy Paste: [[2507.10218]] Straighten Viscous Rectified Flow via Noise Optimization(https://arxiv.org/abs/2507.10218)
Keywords: generation
Abstract: The Reflow operation aims to straighten the inference trajectories of the rectified flow during training by constructing deterministic couplings between noises and images, thereby improving the quality of generated images in single-step or few-step generation. However, we identify critical limitations in Reflow, particularly its inability to rapidly generate high-quality images due to a distribution gap between images in its constructed deterministic couplings and real images. To address these shortcomings, we propose a novel alternative called Straighten Viscous Rectified Flow via Noise Optimization (VRFNO), which is a joint training framework integrating an encoder and a neural velocity field. VRFNO introduces two key innovations: (1) a historical velocity term that enhances trajectory distinction, enabling the model to more accurately predict the velocity of the current trajectory, and (2) the noise optimization through reparameterization to form optimized couplings with real images which are then utilized for training, effectively mitigating errors caused by Reflow's limitations. Comprehensive experiments on synthetic data and real datasets with varying resolutions show that VRFNO significantly mitigates the limitations of Reflow, achieving state-of-the-art performance in both one-step and few-step generation tasks.
摘要：反射操作旨在通过在噪声和图像之间构建确定性耦合，从而拉直训练过程中整流流的推理轨迹，从而提高单步或几步生成中产生的图像的质量。但是，我们确定了反流的临界局限性，尤其是由于图像之间的分布差距在其构造的确定性耦合和真实图像中，因此无法快速产生高质量的图像。为了解决这些缺点，我们提出了一种通过噪声优化（VRFNO）的新型替代方案，称为拉直的粘性整流流，这是一个集成编码器和神经速度场的联合训练框架。 VRFNO介绍了两个关键创新：（1）历史速度术语增强轨迹区别，使模型能够更准确地预测当前轨迹的速度，以及（2）通过重新拨动的噪声优化，以重新拨动以优化的图像形成了与真实的图像形成真正的图像，这些图像可用于训练，从而通过训练，通过反映误差，这些图像是由Refore frol refors refors frol refors refors refors for refors refors误差。关于合成数据和带有不同决议的真实数据集的全面实验表明，VRFNO大大减轻了回流的局限性，在一步和少数步骤的任务中都实现了最先进的性能。

Title: Spatial Lifting for Dense Prediction

Authors: Mingzhi Xu, Yizhe Zhang
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2507.10222
Pdf URL: https://arxiv.org/pdf/2507.10222
Copy Paste: [[2507.10222]] Spatial Lifting for Dense Prediction(https://arxiv.org/abs/2507.10222)
Keywords: quality assessment
Abstract: We present Spatial Lifting (SL), a novel methodology for dense prediction tasks. SL operates by lifting standard inputs, such as 2D images, into a higher-dimensional space and subsequently processing them using networks designed for that higher dimension, such as a 3D U-Net. Counterintuitively, this dimensionality lifting allows us to achieve good performance on benchmark tasks compared to conventional approaches, while reducing inference costs and significantly lowering the number of model parameters. The SL framework produces intrinsically structured outputs along the lifted dimension. This emergent structure facilitates dense supervision during training and enables robust, near-zero-additional-cost prediction quality assessment at test time. We validate our approach across 19 benchmark datasets (13 for semantic segmentation and 6 for depth estimation), demonstrating competitive dense prediction performance while reducing the model parameter count by over 98% (in the U-Net case) and lowering inference costs. Spatial Lifting introduces a new vision modeling paradigm that offers a promising path toward more efficient, accurate, and reliable deep networks for dense prediction tasks in vision.
摘要：我们提出了空间提升（SL），这是一种用于密集预测任务的新方法。 SL通过将标准输入（例如2D图像）提升到更高维度的空间，然后使用为该更高维度设计的网络（例如3D U-net）进行处理。违反直觉，与传统方法相比，这种维度提升使我们能够在基准任务上实现良好的性能，同时降低了推理成本并大大降低了模型参数的数量。 SL框架沿着抬高的维度产生本质上结构化的输出。这种紧急的结构有助于在训练过程中进行密集的监督，并在测试时实现了稳健的，接近零成本的预测质量评估。我们在19个基准数据集中验证了我们的方法（用于语义细分的13个，对深度估计为6），证明了竞争性密集的预测性能，同时将模型参数计数降低了98％以上（在U-NET案例中）并降低推理成本。空间提升引入了新的视觉建模范式，该范式为更高效，准确，可靠的深层网络提供了有希望的途径，以实现视觉中的密集预测任务。

Title: Navigating the Challenges of AI-Generated Image Detection in the Wild: What Truly Matters?

Authors: Despina Konstantinidou, Dimitrios Karageorgiou, Christos Koutlis, Olga Papadopoulou, Emmanouil Schinas, Symeon Papadopoulos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.10236
Pdf URL: https://arxiv.org/pdf/2507.10236
Copy Paste: [[2507.10236]] Navigating the Challenges of AI-Generated Image Detection in the Wild: What Truly Matters?(https://arxiv.org/abs/2507.10236)
Keywords: generative
Abstract: The rapid advancement of generative technologies presents both unprecedented creative opportunities and significant challenges, particularly in maintaining social trust and ensuring the integrity of digital information. Following these concerns, the challenge of AI-Generated Image Detection (AID) becomes increasingly critical. As these technologies become more sophisticated, the quality of AI-generated images has reached a level that can easily deceive even the most discerning observers. Our systematic evaluation highlights a critical weakness in current AI-Generated Image Detection models: while they perform exceptionally well on controlled benchmark datasets, they struggle significantly with real-world variations. To assess this, we introduce ITW-SM, a new dataset of real and AI-generated images collected from major social media platforms. In this paper, we identify four key factors that influence AID performance in real-world scenarios: backbone architecture, training data composition, pre-processing strategies and data augmentation combinations. By systematically analyzing these components, we shed light on their impact on detection efficacy. Our modifications result in an average AUC improvement of 26.87% across various AID models under real-world conditions.
摘要：生成技术的快速发展既提出了前所未有的创造机会，又带来了重大挑战，尤其是在维持社会信任和确保数字信息的完整性方面。遵循这些担忧，AI生成的图像检测（AID）的挑战变得越来越关键。随着这些技术变得越来越复杂，AI生成的图像的质量已经达到了一个可以轻松欺骗甚至最挑剔的观察者的水平。我们的系统评估强调了当前AI生成的图像检测模型的关键弱点：虽然它们在受控基准数据集上表现出色，但它们在现实世界中的变化很大程度上挣扎。为了评估这一点，我们介绍了ITW-SM，这是从主要社交媒体平台收集的真实和AI生成图像的新数据集。在本文中，我们确定了在现实世界中影响援助绩效的四个关键因素：骨干架构，培训数据组成，预处理策略和数据增强组合。通过系统地分析这些组件，我们阐明了它们对检测功效的影响。我们的修改导致在现实情况下，各种援助模型的平均AUC提高了26.87％。

Title: Conditional Chemical Language Models are Versatile Tools in Drug Discovery

Authors: Lu Zhu, Emmanuel Noutahi
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2507.10273
Pdf URL: https://arxiv.org/pdf/2507.10273
Copy Paste: [[2507.10273]] Conditional Chemical Language Models are Versatile Tools in Drug Discovery(https://arxiv.org/abs/2507.10273)
Keywords: generation, generative
Abstract: Generative chemical language models (CLMs) have demonstrated strong capabilities in molecular design, yet their impact in drug discovery remains limited by the absence of reliable reward signals and the lack of interpretability in their outputs. We present SAFE-T, a generalist chemical modeling framework that conditions on biological context -- such as protein targets or mechanisms of action -- to prioritize and design molecules without relying on structural information or engineered scoring functions. SAFE-T models the conditional likelihood of fragment-based molecular sequences given a biological prompt, enabling principled scoring of molecules across tasks such as virtual screening, drug-target interaction prediction, and activity cliff detection. Moreover, it supports goal-directed generation by sampling from this learned distribution, aligning molecular design with biological objectives. In comprehensive zero-shot evaluations across predictive (LIT-PCBA, DAVIS, KIBA, ACNet) and generative (DRUG, PMO) benchmarks, SAFE-T consistently achieves performance comparable to or better than existing approaches while being significantly faster. Fragment-level attribution further reveals that SAFE-T captures known structure-activity relationships, supporting interpretable and biologically grounded design. Together with its computational efficiency, these results demonstrate that conditional generative CLMs can unify scoring and generation to accelerate early-stage drug discovery.
摘要：生成化学语言模型（CLM）在分子设计中表现出很强的能力，但是它们在药物发现中的影响仍然受到没有可靠的奖励信号以及其产出缺乏可解释性的限制。我们提出了Safe-T，这是一种通才的化学建模框架，该框架在生物学环境（例如蛋白质靶标或作用机制）上进行条件，以优先级和设计分子，而无需依赖结构信息或工程评分功能。 Safe-T模拟了鉴于生物学及时的基于碎片的分子序列的条件可能性，从而使跨越诸如虚拟筛选，药物目标相互作用预测和活动悬崖检测等任务的分子进行了原则评分。此外，它通过从该学到的分布中取样，将分子设计与生物学目标进行对齐来支持目标定向的生成。在跨预测性（LIT-PCBA，DAVIS，KIBA，ACNET）和生成性（药物，PMO）基准的全面零摄像评估中，Safe-T始终达到与现有方法相当或更高的性能，同时更快。碎片级归因进一步表明，安全 - 捕获了已知的结构 - 活性关系，支持可解释和生物学的设计。这些结果及其计算效率表明，有条件的生成CLM可以统一评分和产生以加速早期药物发现。

Title: Show and Polish: Reference-Guided Identity Preservation in Face Video Restoration

Authors: Wenkang Han, Wang Lin, Yiyun Zhou, Qi Liu, Shulei Wang, Chang Yao, Jingyuan Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.10293
Pdf URL: https://arxiv.org/pdf/2507.10293
Copy Paste: [[2507.10293]] Show and Polish: Reference-Guided Identity Preservation in Face Video Restoration(https://arxiv.org/abs/2507.10293)
Keywords: restoration, generation
Abstract: Face Video Restoration (FVR) aims to recover high-quality face videos from degraded versions. Traditional methods struggle to preserve fine-grained, identity-specific features when degradation is severe, often producing average-looking faces that lack individual characteristics. To address these challenges, we introduce IP-FVR, a novel method that leverages a high-quality reference face image as a visual prompt to provide identity conditioning during the denoising process. IP-FVR incorporates semantically rich identity information from the reference image using decoupled cross-attention mechanisms, ensuring detailed and identity consistent results. For intra-clip identity drift (within 24 frames), we introduce an identity-preserving feedback learning method that combines cosine similarity-based reward signals with suffix-weighted temporal aggregation. This approach effectively minimizes drift within sequences of frames. For inter-clip identity drift, we develop an exponential blending strategy that aligns identities across clips by iteratively blending frames from previous clips during the denoising process. This method ensures consistent identity representation across different clips. Additionally, we enhance the restoration process with a multi-stream negative prompt, guiding the model's attention to relevant facial attributes and minimizing the generation of low-quality or incorrect features. Extensive experiments on both synthetic and real-world datasets demonstrate that IP-FVR outperforms existing methods in both quality and identity preservation, showcasing its substantial potential for practical applications in face video restoration.
摘要：面部视频修复（FVR）旨在从退化版本中恢复高质量的面部视频。当退化严重时，通常会产生缺乏个人特征的平均面孔时，传统方法难以保留特定于身份的特定特定特征。为了应对这些挑战，我们介绍了IP-FVR，这是一种新型方法，该方法利用高质量的参考面图像作为视觉提示，以在DeNoising过程中提供身份条件。 IP-FVR使用分离的跨注意机制从参考图像中纳入了语义上丰富的身份信息，从而确保了详细和身份一致的结果。对于CLIP Intra Intra Intra-CLIP身份漂移（在24帧之内），我们引入了一种具有身份的反馈学习方法，该方法将基于余弦相似性的奖励信号与后缀加权的时间聚集结合在一起。这种方法有效地最大程度地减少了框架序列中的漂移。对于Clip Inter Clip身份漂移，我们制定了一种指数式的混合策略，该策略通过在DeNoising过程中从以前的剪辑中迭代混合框架来对齐剪辑的身份。此方法可确保在不同夹子之间保持一致的身份表示。此外，我们通过多流面负提示增强了恢复过程，从而指导模型对相关面部属性的关注，并最大程度地减少低质量或不正确特征的产生。关于合成和现实世界数据集的广泛实验表明，IP-FVR在质量和身份保存方面均优于现有方法，从而展示了其在面部视频修复中实践应用的巨大潜力。

Title: Mind the Gap: Aligning Vision Foundation Models to Image Feature Matching

Authors: Yuhan Liu, Jingwen Fu, Yang Wu, Kangyi Wu, Pengna Li, Jiayi Wu, Sanping Zhou, Jingmin Xin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.10318
Pdf URL: https://arxiv.org/pdf/2507.10318
Copy Paste: [[2507.10318]] Mind the Gap: Aligning Vision Foundation Models to Image Feature Matching(https://arxiv.org/abs/2507.10318)
Keywords: generative
Abstract: Leveraging the vision foundation models has emerged as a mainstream paradigm that improves the performance of image feature matching. However, previous works have ignored the misalignment when introducing the foundation models into feature matching. The misalignment arises from the discrepancy between the foundation models focusing on single-image understanding and the cross-image understanding requirement of feature matching. Specifically, 1) the embeddings derived from commonly used foundation models exhibit discrepancies with the optimal embeddings required for feature matching; 2) lacking an effective mechanism to leverage the single-image understanding ability into cross-image understanding. A significant consequence of the misalignment is they struggle when addressing multi-instance feature matching problems. To address this, we introduce a simple but effective framework, called IMD (Image feature Matching with a pre-trained Diffusion model) with two parts: 1) Unlike the dominant solutions employing contrastive-learning based foundation models that emphasize global semantics, we integrate the generative-based diffusion models to effectively capture instance-level details. 2) We leverage the prompt mechanism in generative model as a natural tunnel, propose a novel cross-image interaction prompting module to facilitate bidirectional information interaction between image pairs. To more accurately measure the misalignment, we propose a new benchmark called IMIM, which focuses on multi-instance scenarios. Our proposed IMD establishes a new state-of-the-art in commonly evaluated benchmarks, and the superior improvement 12% in IMIM indicates our method efficiently mitigates the misalignment.
摘要：利用视觉基础模型已成为主流范式，可改善图像特征匹配的性能。但是，在将基础模型引入功能匹配时，以前的作品忽略了错位。未对准源于关注单像理解的基础模型与特征匹配的跨图像理解要求之间的差异。具体而言，1）从常用的基础模型中得出的嵌入式具有差异，并具有特征匹配所需的最佳嵌入； 2）缺乏有效的机制来利用单位图像理解能力进入跨图像的理解。未对准的重大结果是，在解决多种构造功能匹配问题时，他们挣扎。为了解决这个问题，我们引入了一个简单但有效的框架，称为IMD（与预训练的扩散模型匹配的图像特征），具有两个部分：1）与采用基于对比的基础基础模型的主要解决方案不同，我们强调了全球语义学，我们集成了基于生成的基于生成的扩散模型，以有效地捕获实例的详细信息。 2）我们利用生成模型中的迅速机制作为天然隧道，提出了一种新型的跨图像相互作用，促使模块促进图像对之间的双向信息相互作用。为了更准确地衡量未对准，我们提出了一个名为IMIM的新基准，该基准重点介绍了多个实体方案。我们提出的IMD在普遍评估的基准中建立了一种新的最新技术，而IMIM中的优势提高了12％，这表明我们的方法有效地减轻了未对准的方法。

Title: MoCap-Impute: A Comprehensive Benchmark and Comparative Analysis of Imputation Methods for IMU-based Motion Capture Data

Authors: Mahmoud Bekhit, Ahmad Salah, Ahmed Salim Alrawahi, Tarek Attia, Ahmed Ali, Esraa Eldesokey, Ahmed Fathalla
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.10334
Pdf URL: https://arxiv.org/pdf/2507.10334
Copy Paste: [[2507.10334]] MoCap-Impute: A Comprehensive Benchmark and Comparative Analysis of Imputation Methods for IMU-based Motion Capture Data(https://arxiv.org/abs/2507.10334)
Keywords: generative
Abstract: Motion capture (MoCap) data from wearable Inertial Measurement Units (IMUs) is vital for applications in sports science, but its utility is often compromised by missing data. Despite numerous imputation techniques, a systematic performance evaluation for IMU-derived MoCap time-series data is lacking. We address this gap by conducting a comprehensive comparative analysis of statistical, machine learning, and deep learning imputation methods. Our evaluation considers three distinct contexts: univariate time-series, multivariate across subjects, and multivariate across kinematic angles. To facilitate this benchmark, we introduce the first publicly available MoCap dataset designed specifically for imputation, featuring data from 53 karate practitioners. We simulate three controlled missingness mechanisms: missing completely at random (MCAR), block missingness, and a novel value-dependent pattern at signal transition points. Our experiments, conducted on 39 kinematic variables across all subjects, reveal that multivariate imputation frameworks consistently outperform univariate approaches, particularly for complex missingness. For instance, multivariate methods achieve up to a 50% mean absolute error reduction (MAE from 10.8 to 5.8) compared to univariate techniques for transition point missingness. Advanced models like Generative Adversarial Imputation Networks (GAIN) and Iterative Imputers demonstrate the highest accuracy in these challenging scenarios. This work provides a critical baseline for future research and offers practical recommendations for improving the integrity and robustness of Mo-Cap data analysis.
摘要：来自可穿戴惯性测量单元（IMU）的运动捕获（MOCAP）数据对于运动科学的应用至关重要，但是丢失的数据通常会损害其效用。尽管采用了许多插补技术，但缺乏对IMU衍生的MOCAP时间序列数据进行系统的性能评估。我们通过对统计，机器学习和深度学习插补方法进行全面的比较分析来解决这一差距。我们的评估考虑了三种不同的环境：单变量时间序列，跨主题的多变量以及跨运动角度的多变量。为了促进该基准，我们介绍了专门为插补设计的第一个公开可用的MOCAP数据集，其中包含来自53位空手道从业人员的数据。我们模拟了三种受控的丢失机制：完全随机丢失（MCAR），块丢失和信号过渡点的新颖值依赖性模式。我们的实验在所有受试者的39个运动学变量上进行，表明多元插补框架始终超过单变量方法，尤其是对于复杂的缺失。例如，与单变量技术相比，多元方法达到了高达50％的平均绝对误差降低（MAE从10.8到5.8）。高级模型（例如生成对抗性插补网络）（增益）和迭代螺旋桨在这些具有挑战性的情况下表现出最高的准确性。这项工作为未来的研究提供了关键的基准，并提供了改善Mo-CAP数据分析的完整性和鲁棒性的实用建议。

Title: Text Embedding Knows How to Quantize Text-Guided Diffusion Models

Authors: Hongjae Lee, Myungjun Son, Dongjea Kang, Seung-Won Jung
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.10340
Pdf URL: https://arxiv.org/pdf/2507.10340
Copy Paste: [[2507.10340]] Text Embedding Knows How to Quantize Text-Guided Diffusion Models(https://arxiv.org/abs/2507.10340)
Keywords: generation
Abstract: Despite the success of diffusion models in image generation tasks such as text-to-image, the enormous computational complexity of diffusion models limits their use in resource-constrained environments. To address this, network quantization has emerged as a promising solution for designing efficient diffusion models. However, existing diffusion model quantization methods do not consider input conditions, such as text prompts, as an essential source of information for quantization. In this paper, we propose a novel quantization method dubbed Quantization of Language-to-Image diffusion models using text Prompts (QLIP). QLIP leverages text prompts to guide the selection of bit precision for every layer at each time step. In addition, QLIP can be seamlessly integrated into existing quantization methods to enhance quantization efficiency. Our extensive experiments demonstrate the effectiveness of QLIP in reducing computational complexity and improving the quality of the generated images across various datasets.
摘要：尽管扩散模型在图像生成任务（例如文本对图像）中的成功取得了成功，但扩散模型的巨大计算复杂性限制了它们在资源约束环境中的使用。为了解决这个问题，网络量化已成为设计有效扩散模型的有希望的解决方案。但是，现有的扩散模型量化方法并未将输入条件（例如文本提示）视为量化的重要信息来源。在本文中，我们提出了一种使用文本提示（QLIP）的新颖量化方法，称为语言到图像扩散模型的量化。 QLIP利用文本提示在每个时间步骤中指导每一层的位精度选择。此外，QLIP可以无缝集成到现有的量化方法中，以提高量化效率。我们的广泛实验表明，QLIP在降低计算复杂性并提高各个数据集中生成的图像的质量方面的有效性。

Title: Text-Visual Semantic Constrained AI-Generated Image Quality Assessment

Authors: Qiang Li, Qingsen Yan, Haojian Huang, Peng Wu, Haokui Zhang, Yanning Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.10432
Pdf URL: https://arxiv.org/pdf/2507.10432
Copy Paste: [[2507.10432]] Text-Visual Semantic Constrained AI-Generated Image Quality Assessment(https://arxiv.org/abs/2507.10432)
Keywords: quality assessment
Abstract: With the rapid advancements in Artificial Intelligence Generated Image (AGI) technology, the accurate assessment of their quality has become an increasingly vital requirement. Prevailing methods typically rely on cross-modal models like CLIP or BLIP to evaluate text-image alignment and visual quality. However, when applied to AGIs, these methods encounter two primary challenges: semantic misalignment and details perception missing. To address these limitations, we propose Text-Visual Semantic Constrained AI-Generated Image Quality Assessment (SC-AGIQA), a unified framework that leverages text-visual semantic constraints to significantly enhance the comprehensive evaluation of both text-image consistency and perceptual distortion in AI-generated images. Our approach integrates key capabilities from multiple models and tackles the aforementioned challenges by introducing two core modules: the Text-assisted Semantic Alignment Module (TSAM), which leverages Multimodal Large Language Models (MLLMs) to bridge the semantic gap by generating an image description and comparing it against the original prompt for a refined consistency check, and the Frequency-domain Fine-Grained Degradation Perception Module (FFDPM), which draws inspiration from Human Visual System (HVS) properties by employing frequency domain analysis combined with perceptual sensitivity weighting to better quantify subtle visual distortions and enhance the capture of fine-grained visual quality details in images. Extensive experiments conducted on multiple benchmark datasets demonstrate that SC-AGIQA outperforms existing state-of-the-art methods. The code is publicly available at this https URL.
摘要：随着人工智能产生的图像（AGI）技术的快速发展，对其质量的准确评估已成为越来越重要的要求。普遍的方法通常依赖于剪辑或BLIP等跨模型模型来评估文本图像对齐和视觉质量。但是，当应用于AGIS时，这些方法会遇到两个主要挑战：语义错位和细节缺失。为了解决这些限制，我们提出了文本视觉语义约束的AI生成的图像质量评估（SC-Agiqa），该统一框架利用文本 - 视觉语义约束，以显着增强AI生成图像中文本图像一致性和感知失真的全面评估。我们的方法通过引入两个核心模块来整合来自多种模型的关键功能，并通过引入两个核心模块来应对上述挑战：文本辅助的语义对准模块（TSAM），该模块（TSAM）利用多模式大型语言模型（MLLM）来弥合语义上的语义差距，以通过生成图像的形象并与原始提示进行限制限制范围的限制范围，并限制了频率的一致性模式，并将其比较（FFDPM）通过采用频域分析与感知灵敏度加权相结合，从人类视觉系统（HVS）属性中汲取灵感，以更好地量化微妙的视觉失真并增强图像中细粒的视觉质量细节的捕获。在多个基准数据集上进行的广泛实验表明，SC-AGIQA的表现优于现有的最新方法。该代码在此HTTPS URL上公开可用。

Title: RefSTAR: Blind Facial Image Restoration with Reference Selection, Transfer, and Reconstruction

Authors: Zhicun Yin, Junjie Chen, Ming Liu, Zhixin Wang, Fan Li, Renjing Pei, Xiaoming Li, Rynson W.H. Lau, Wangmeng Zuo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.10470
Pdf URL: https://arxiv.org/pdf/2507.10470
Copy Paste: [[2507.10470]] RefSTAR: Blind Facial Image Restoration with Reference Selection, Transfer, and Reconstruction(https://arxiv.org/abs/2507.10470)
Keywords: restoration, generation, generative
Abstract: Blind facial image restoration is highly challenging due to unknown complex degradations and the sensitivity of humans to faces. Although existing methods introduce auxiliary information from generative priors or high-quality reference images, they still struggle with identity preservation problems, mainly due to improper feature introduction on detailed textures. In this paper, we focus on effectively incorporating appropriate features from high-quality reference images, presenting a novel blind facial image restoration method that considers reference selection, transfer, and reconstruction (RefSTAR). In terms of selection, we construct a reference selection (RefSel) module. For training the RefSel module, we construct a RefSel-HQ dataset through a mask generation pipeline, which contains annotating masks for 10,000 ground truth-reference pairs. As for the transfer, due to the trivial solution in vanilla cross-attention operations, a feature fusion paradigm is designed to force the features from the reference to be integrated. Finally, we propose a reference image reconstruction mechanism that further ensures the presence of reference image features in the output image. The cycle consistency loss is also redesigned in conjunction with the mask. Extensive experiments on various backbone models demonstrate superior performance, showing better identity preservation ability and reference feature transfer quality. Source code, dataset, and pre-trained models are available at this https URL.
摘要：由于未知的复杂降解和人类对面部的敏感性，盲面部图像恢复极具挑战性。尽管现有方法介绍了来自生成先验或高质量参考图像的辅助信息，但它们仍然在身份保存问题上挣扎，这主要是由于详细纹理的不当特征介绍。在本文中，我们专注于有效地合并高质量参考图像中的适当特征，并提出了一种新型的盲面图像恢复方法，该方法考虑了参考选择，转移和重建（Refstar）。在选择方面，我们构建了一个参考选择（REFSEL）模块。为了培训RefSEL模块，我们通过蒙版生成管道构建了一个RefSel-HQ数据集，该管道包含带有10,000个地面真实参考对的注释面具。至于转移，由于香草跨注意操作中的微不足道解决方案，特征融合范式旨在迫使参考的特征集成。最后，我们提出了一个参考图像重建机制，该机制进一步确保了输出图像中的参考图像特征。循环一致性损失也与面具一起重新设计。在各种主链模型上进行的广泛实验表现出卓越的性能，显示出更好的身份保存能力和参考特征传递质量。源代码，数据集和预训练的模型可在此HTTPS URL上找到。

Title: Graph World Model

Authors: Tao Feng, Yexin Wu, Guanyu Lin, Jiaxuan You
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.10539
Pdf URL: https://arxiv.org/pdf/2507.10539
Copy Paste: [[2507.10539]] Graph World Model(https://arxiv.org/abs/2507.10539)
Keywords: generation
Abstract: World models (WMs) demonstrate strong capabilities in prediction, generation, and planning tasks. Existing WMs primarily focus on unstructured data and cannot leverage the ubiquitous structured data, often represented as graphs, in the digital world. While multiple graph foundation models have been proposed, they focus on graph learning tasks and cannot extend to diverse multi-modal data and interdisciplinary tasks. To address these challenges, we propose the Graph World Model (GWM), a world model that supports both unstructured and graph-structured states with multi-modal information and represents diverse tasks as actions. The core of a GWM is a generic message-passing algorithm to aggregate structured information, either over a unified multi-modal token space by converting multi-modal data into text (GWM-T) or a unified multi-modal embedding space by modality-specific encoders (GWM-E). Notably, GWM introduces action nodes to support diverse tasks, where action nodes are linked to other nodes via direct reference or similarity computation. Extensive experiments on six tasks from diverse domains, including multi-modal generation and matching, recommendation, graph prediction, multi-agent, retrieval-augmented generation, and planning and optimization, show that the same GWM outperforms or matches domain-specific baselines' performance, benefits from multi-hop structures, and demonstrates strong zero-shot/few-shot capabilities on unseen new tasks. Our code for GWM is released at this https URL.
摘要：世界模型（WMS）在预测，生成和计划任务中表现出强大的能力。现有的WMS主要关注非结构化数据，并且无法利用数字世界中通常表示为图形的无处不在的结构化数据。尽管已经提出了多个图形基础模型，但他们专注于图形学习任务，并且不能扩展到多种模式数据和跨学科任务。为了应对这些挑战，我们提出了Graph World Model（GWM），该模型是一个世界模型，该模型既支持非结构化和图形结构化的状态，并用多模式信息，并将各种任务表示为动作。 GWM的核心是通过将多模式数据转换为文本（GWM-T）或通过模态特异性的编码器（GWM-E）将多模式数据转换为文本（GWM-T）的通用消息算法，可以通过将多模式数据转换为文本（GWM-T）来汇总结构化信息。值得注意的是，GWM引入了动作节点以支持各种任务，其中动作节点通过直接参考或相似性计算链接到其他节点。 Extensive experiments on six tasks from diverse domains, including multi-modal generation and matching, recommendation, graph prediction, multi-agent, retrieval-augmented generation, and planning and optimization, show that the same GWM outperforms or matches domain-specific baselines' performance, benefits from multi-hop structures, and demonstrates strong zero-shot/few-shot capabilities on unseen new tasks.我们的GWM代码在此HTTPS URL上发布。