2025-10-31

Title: HiMAE: Hierarchical Masked Autoencoders Discover Resolution-Specific Structure in Wearable Time Series

Authors: Simon A. Lee, Cyrus Tanade, Hao Zhou, Juhyeon Lee, Megha Thukral, Minji Han, Rachel Choi, Md Sazzad Hissain Khan, Baiying Lu, Migyeong Gwak, Mehrab Bin Morshed, Viswam Nathan, Md Mahbubur Rahman, Li Zhu, Subramaniam Venkatraman, Sharanya Arcot Desai
Subjects: cs.LG, cs.AI, eess.SP
Abstract URL: https://arxiv.org/abs/2510.25785
Pdf URL: https://arxiv.org/pdf/2510.25785
Copy Paste: [[2510.25785]] HiMAE: Hierarchical Masked Autoencoders Discover Resolution-Specific Structure in Wearable Time Series(https://arxiv.org/abs/2510.25785)
Keywords: generative
Abstract: Wearable sensors provide abundant physiological time series, yet the principles governing their predictive utility remain unclear. We hypothesize that temporal resolution is a fundamental axis of representation learning, with different clinical and behavioral outcomes relying on structure at distinct scales. To test this resolution hypothesis, we introduce HiMAE (Hierarchical Masked Autoencoder), a self supervised framework that combines masked autoencoding with a hierarchical convolutional encoder decoder. HiMAE produces multi resolution embeddings that enable systematic evaluation of which temporal scales carry predictive signal, transforming resolution from a hyperparameter into a probe for interpretability. Across classification, regression, and generative benchmarks, HiMAE consistently outperforms state of the art foundation models that collapse scale, while being orders of magnitude smaller. HiMAE is an efficient representation learner compact enough to run entirely on watch, achieving sub millisecond inference on smartwatch class CPUs for true edge inference. Together, these contributions position HiMAE as both an efficient self supervised learning method and a discovery tool for scale sensitive structure in wearable health.
摘要：可穿戴传感器提供丰富的生理时间序列，但控制其预测效用的原理仍不清楚。我们假设时间分辨率是表征学习的基本轴，不同的临床和行为结果依赖于不同尺度的结构。为了测试这个分辨率假设，我们引入了 HiMAE（分层屏蔽自动编码器），这是一个将屏蔽自动编码与分层卷积编码器解码器相结合的自监督框架。 HiMAE 产生多分辨率嵌入，能够系统地评估哪些时间尺度携带预测信号，将分辨率从超参数转换为可解释性的探针。在分类、回归和生成基准方面，HiMAE 始终优于最先进的基础模型，这些模型在规模上缩小了几个数量级。 HiMAE 是一种高效的表示学习器，结构紧凑，足以完全在手表上运行，在智能手表级 CPU 上实现亚毫秒推理，实现真正的边缘推理。总之，这些贡献使 HiMAE 既成为一种高效的自我监督学习方法，又成为可穿戴健康中尺度敏感结构的发现工具。

Title: SHA-256 Infused Embedding-Driven Generative Modeling of High-Energy Molecules in Low-Data Regimes

Authors: Siddharth Verma, Alankar Alankar
Subjects: cs.LG, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2510.25788
Pdf URL: https://arxiv.org/pdf/2510.25788
Copy Paste: [[2510.25788]] SHA-256 Infused Embedding-Driven Generative Modeling of High-Energy Molecules in Low-Data Regimes(https://arxiv.org/abs/2510.25788)
Keywords: generation, generative
Abstract: High-energy materials (HEMs) are critical for propulsion and defense domains, yet their discovery remains constrained by experimental data and restricted access to testing facilities. This work presents a novel approach toward high-energy molecules by combining Long Short-Term Memory (LSTM) networks for molecular generation and Attentive Graph Neural Networks (GNN) for property predictions. We propose a transformative embedding space construction strategy that integrates fixed SHA-256 embeddings with partially trainable representations. Unlike conventional regularization techniques, this changes the representational basis itself, reshaping the molecular input space before learning begins. Without recourse to pretraining, the generator achieves 67.5% validity and 37.5% novelty. The generated library exhibits a mean Tanimoto coefficient of 0.214 relative to training set signifying the ability of framework to generate a diverse chemical space. We identified 37 new super explosives higher than 9 km/s predicted detonation velocity.
摘要：高能材料（HEM）对于推进和防御领域至关重要，但它们的发现仍然受到实验数据和测试设施限制的限制。这项工作通过结合用于分子生成的长短期记忆（LSTM）网络和用于属性预测的注意力图神经网络（GNN），提出了一种研究高能分子的新方法。我们提出了一种变革性的嵌入空间构建策略，它将固定的 SHA-256 嵌入与部分可训练的表示相结合。与传统的正则化技术不同，这改变了表征基础本身，在学习开始之前重塑了分子输入空间。在不依赖预训练的情况下，生成器实现了 67.5% 的有效性和 37.5% 的新颖性。生成的文库相对于训练集的平均 Tanimoto 系数为 0.214，这表明框架具有生成多样化化学空间的能力。我们发现了 37 种新型超级炸药，预测爆速高于 9 公里/秒。

Title: MedVLSynther: Synthesizing High-Quality Visual Question Answering from Medical Documents with Generator-Verifier LMMs

Authors: Xiaoke Huang, Ningsen Wang, Hui Liu, Xianfeng Tang, Yuyin Zhou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.25867
Pdf URL: https://arxiv.org/pdf/2510.25867
Copy Paste: [[2510.25867]] MedVLSynther: Synthesizing High-Quality Visual Question Answering from Medical Documents with Generator-Verifier LMMs(https://arxiv.org/abs/2510.25867)
Keywords: generation
Abstract: Large Multimodal Models (LMMs) are increasingly capable of answering medical questions that require joint reasoning over images and text, yet training general medical VQA systems is impeded by the lack of large, openly usable, high-quality corpora. We present MedVLSynther, a rubric-guided generator-verifier framework that synthesizes high-quality multiple-choice VQA items directly from open biomedical literature by conditioning on figures, captions, and in-text references. The generator produces self-contained stems and parallel, mutually exclusive options under a machine-checkable JSON schema; a multi-stage verifier enforces essential gates (self-containment, single correct answer, clinical validity, image-text consistency), awards fine-grained positive points, and penalizes common failure modes before acceptance. Applying this pipeline to PubMed Central yields MedSynVQA: 13,087 audited questions over 14,803 images spanning 13 imaging modalities and 28 anatomical regions. Training open-weight LMMs with reinforcement learning using verifiable rewards improves accuracy across six medical VQA benchmarks, achieving averages of 55.85 (3B) and 58.15 (7B), with up to 77.57 on VQA-RAD and 67.76 on PathVQA, outperforming strong medical LMMs. A Ablations verify that both generation and verification are necessary and that more verified data consistently helps, and a targeted contamination analysis detects no leakage from evaluation suites. By operating entirely on open literature and open-weight models, MedVLSynther offers an auditable, reproducible, and privacy-preserving path to scalable medical VQA training data.
摘要：大型多模态模型 (LMM) 越来越有能力回答需要对图像和文本进行联合推理的医学问题，但由于缺乏大型、公开可用的高质量语料库，训练通用医学 VQA 系统受到阻碍。我们提出了 MedVLSynther，这是一个以标题为指导的生成器验证器框架，它通过以图形、标题和文本参考为条件，直接从开放的生物医学文献中合成高质量的多项选择 VQA 项目。生成器在机器可检查的 JSON 模式下生成独立的主干和并行、互斥的选项；多阶段验证器强制执行必要的关卡（自我遏制、单一正确答案、临床有效性、图像文本一致性），奖励细粒度的积极点，并在接受之前惩罚常见的失败模式。将此流程应用于 PubMed Central 会产生 MedSynVQA：针对涵盖 13 种成像模式和 28 个解剖区域的 14,803 张图像，提出了 13,087 个审核问题。使用可验证的奖励通过强化学习训练开放权重 LMM，提高了六个医学 VQA 基准的准确性，实现了 55.85 (3B) 和 58.15 (7B) 的平均值，在 VQA-RAD 上高达 77.57，在 PathVQA 上高达 67.76，优于强大的医学 LMM。消融验证生成和验证都是必要的，并且更多经过验证的数据始终有帮助，并且有针对性的污染分析检测到评估套件没有泄漏。通过完全基于开放文献和开放权重模型进行操作，MedVLSynther 为可扩展的医疗 VQA 训练数据提供了可审计、可重复且保护隐私的路径。

Title: MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency

Authors: Nicolas Dufour, Lucas Degeorge, Arijit Ghosh, Vicky Kalogeiton, David Picard
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2510.25897
Pdf URL: https://arxiv.org/pdf/2510.25897
Copy Paste: [[2510.25897]] MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency(https://arxiv.org/abs/2510.25897)
Keywords: generation, generative
Abstract: Current text-to-image generative models are trained on large uncurated datasets to enable diverse generation capabilities. However, this does not align well with user preferences. Recently, reward models have been specifically designed to perform post-hoc selection of generated images and align them to a reward, typically user preference. This discarding of informative data together with the optimizing for a single reward tend to harm diversity, semantic fidelity and efficiency. Instead of this post-processing, we propose to condition the model on multiple reward models during training to let the model learn user preferences directly. We show that this not only dramatically improves the visual quality of the generated images but it also significantly speeds up the training. Our proposed method, called MIRO, achieves state-of-the-art performances on the GenEval compositional benchmark and user-preference scores (PickAScore, ImageReward, HPSv2).
摘要：当前的文本到图像生成模型是在大型未经整理的数据集上进行训练的，以实现不同的生成功能。然而，这与用户偏好不太相符。最近，奖励模型经过专门设计，可以对生成的图像进行事后选择，并将其与奖励（通常是用户偏好）对齐。这种丢弃信息数据以及优化单一奖励的行为往往会损害多样性、语义保真度和效率。我们建议在训练期间将模型置于多个奖励模型上，而不是进行这种后处理，以使模型直接学习用户偏好。我们表明，这不仅显着提高了生成图像的视觉质量，而且还显着加快了训练速度。我们提出的方法称为 MIRO，在 GenEval 组合基准和用户偏好评分（PickAScore、ImageReward、HPSv2）上实现了最先进的性能。

Title: Generative Image Restoration and Super-Resolution using Physics-Informed Synthetic Data for Scanning Tunneling Microscopy

Authors: Nikola L. Kolev (1,2), Tommaso Rodani (3,4), Neil J. Curson (1,2), Taylor J.Z. Stock (1,2), Alberto Cazzaniga (4) ((1) London Centre for Nanotechnology, University College London, London, United Kingdom, (2) Department of Electronic and Electrical Engineering, University College London, London, United Kingdom, (3) University of Trieste, Trieste, Italy, (4) AREA Science Park, Trieste, Italy)
Subjects: cs.CV, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2510.25921
Pdf URL: https://arxiv.org/pdf/2510.25921
Copy Paste: [[2510.25921]] Generative Image Restoration and Super-Resolution using Physics-Informed Synthetic Data for Scanning Tunneling Microscopy(https://arxiv.org/abs/2510.25921)
Keywords: restoration, super-resolution, generation, generative
Abstract: Scanning tunnelling microscopy (STM) enables atomic-resolution imaging and atom manipulation, but its utility is often limited by tip degradation and slow serial data acquisition. Fabrication adds another layer of complexity since the tip is often subjected to large voltages, which may alter the shape of its apex, requiring it to be conditioned. Here, we propose a machine learning (ML) approach for image repair and super-resolution to alleviate both challenges. Using a dataset of only 36 pristine experimental images of Si(001):H, we demonstrate that a physics-informed synthetic data generation pipeline can be used to train several state-of-the-art flow-matching and diffusion models. Quantitative evaluation with metrics such as the CLIP Maximum Mean Discrepancy (CMMD) score and structural similarity demonstrates that our models are able to effectively restore images and offer a two- to fourfold reduction in image acquisition time by accurately reconstructing images from sparsely sampled data. Our framework has the potential to significantly increase STM experimental throughput by offering a route to reducing the frequency of tip-conditioning procedures and to enhancing frame rates in existing high-speed STM systems.
摘要：扫描隧道显微镜 (STM) 能够实现原子分辨率成像和原子操纵，但其实用性往往受到尖端退化和缓慢的串行数据采集的限制。制造又增加了一层复杂性，因为尖端经常承受大电压，这可能会改变其尖端的形状，需要对其进行调节。在这里，我们提出了一种用于图像修复和超分辨率的机器学习（ML）方法，以缓解这两个挑战。使用仅包含 36 个 Si(001):H 原始实验图像的数据集，我们证明了基于物理的合成数据生成管道可用于训练几种最先进的流匹配和扩散模型。使用 CLIP 最大平均差异 (CMMD) 评分和结构相似性等指标进行的定量评估表明，我们的模型能够有效地恢复图像，并通过从稀疏采样数据精确重建图像，将图像采集时间减少两到四倍。我们的框架有可能通过提供一种减少尖端调节程序频率并提高现有高速 STM 系统帧速率的途径来显着提高 STM 实验吞吐量。

Title: SplitFlow: Flow Decomposition for Inversion-Free Text-to-Image Editing

Authors: Sung-Hoon Yoon, Minghan Li, Gaspard Beaudouin, Congcong Wen, Muhammad Rafay Azhar, Mengyu Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.25970
Pdf URL: https://arxiv.org/pdf/2510.25970
Copy Paste: [[2510.25970]] SplitFlow: Flow Decomposition for Inversion-Free Text-to-Image Editing(https://arxiv.org/abs/2510.25970)
Keywords: generation, generative
Abstract: Rectified flow models have become a de facto standard in image generation due to their stable sampling trajectories and high-fidelity outputs. Despite their strong generative capabilities, they face critical limitations in image editing tasks: inaccurate inversion processes for mapping real images back into the latent space, and gradient entanglement issues during editing often result in outputs that do not faithfully reflect the target prompt. Recent efforts have attempted to directly map source and target distributions via ODE-based approaches without inversion; however,these methods still yield suboptimal editing quality. In this work, we propose a flow decomposition-and-aggregation framework built upon an inversion-free formulation to address these limitations. Specifically, we semantically decompose the target prompt into multiple sub-prompts, compute an independent flow for each, and aggregate them to form a unified editing trajectory. While we empirically observe that decomposing the original flow enhances diversity in the target space, generating semantically aligned outputs still requires consistent guidance toward the full target prompt. To this end, we design a projection and soft-aggregation mechanism for flow, inspired by gradient conflict resolution in multi-task learning. This approach adaptively weights the sub-target velocity fields, suppressing semantic redundancy while emphasizing distinct directions, thereby preserving both diversity and consistency in the final edited output. Experimental results demonstrate that our method outperforms existing zero-shot editing approaches in terms of semantic fidelity and attribute disentanglement. The code is available at this https URL.
摘要：由于其稳定的采样轨迹和高保真输出，整流流模型已成为图像生成的事实上的标准。尽管它们具有强大的生成能力，但它们在图像编辑任务中面临着严重的限制：将真实图像映射回潜在空间的反转过程不准确，以及编辑过程中的梯度纠缠问题通常会导致输出不能忠实地反映目标提示。最近的努力尝试通过基于 ODE 的方法直接映射源和目标分布，而无需反演；然而，这些方法仍然产生次优的编辑质量。在这项工作中，我们提出了一种基于无反演公式的流分解和聚合框架，以解决这些限制。具体来说，我们在语义上将目标提示分解为多个子提示，为每个子提示计算一个独立的流程，并将它们聚合以形成统一的编辑轨迹。虽然我们凭经验观察到分解原始流程可以增强目标空间的多样性，但生成语义对齐的输出仍然需要对完整目标提示进行一致的指导。为此，受多任务学习中梯度冲突解决的启发，我们设计了一种流的投影和软聚合机制。这种方法自适应地对子目标速度场进行加权，抑制语义冗余，同时强调不同的方向，从而保留最终编辑输出的多样性和一致性。实验结果表明，我们的方法在语义保真度和属性解缠方面优于现有的零镜头编辑方法。该代码可从此 https URL 获取。

Title: Towards Scaling Laws for Symbolic Regression

Authors: David Otte, Jörg K.H. Franke, Frank Hutter
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.26064
Pdf URL: https://arxiv.org/pdf/2510.26064
Copy Paste: [[2510.26064]] Towards Scaling Laws for Symbolic Regression(https://arxiv.org/abs/2510.26064)
Keywords: generation
Abstract: Symbolic regression (SR) aims to discover the underlying mathematical expressions that explain observed data. This holds promise for both gaining scientific insight and for producing inherently interpretable and generalizable models for tabular data. In this work we focus on the basics of SR. Deep learning-based SR has recently become competitive with genetic programming approaches, but the role of scale has remained largely unexplored. Inspired by scaling laws in language modeling, we present the first systematic investigation of scaling in SR, using a scalable end-to-end transformer pipeline and carefully generated training data. Across five different model sizes and spanning three orders of magnitude in compute, we find that both validation loss and solved rate follow clear power-law trends with compute. We further identify compute-optimal hyperparameter scaling: optimal batch size and learning rate grow with model size, and a token-to-parameter ratio of $\approx$15 is optimal in our regime, with a slight upward trend as compute increases. These results demonstrate that SR performance is largely predictable from compute and offer important insights for training the next generation of SR models.
摘要：符号回归 (SR) 旨在发现解释观测数据的基础数学表达式。这有望获得科学见解，并为表格数据生成本质上可解释和可推广的模型。在这项工作中，我们重点关注 SR 的基础知识。基于深度学习的 SR 最近已经开始与遗传编程方法竞争，但规模的作用在很大程度上仍未得到探索。受到语言建模中缩放定律的启发，我们使用可扩展的端到端转换器管道和精心生成的训练数据，首次对 SR 中的缩放进行了系统研究。在五种不同的模型大小和跨越三个数量级的计算中，我们发现验证损失和解决率都遵循计算的清晰幂律趋势。我们进一步确定了计算最佳的超参数缩放：最佳批量大小和学习率随着模型大小的增加而增长，并且在我们的体系中，令牌与参数的比率 $\approx$15 是最佳的，随着计算的增加，有轻微上升的趋势。这些结果表明，SR 性能在很大程度上可以通过计算进行预测，并为训练下一代 SR 模型提供重要见解。

Title: New Money: A Systematic Review of Synthetic Data Generation for Finance

Authors: James Meldrum, Basem Suleiman, Fethi Rabhi, Muhammad Johan Alibasa
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.26076
Pdf URL: https://arxiv.org/pdf/2510.26076
Copy Paste: [[2510.26076]] New Money: A Systematic Review of Synthetic Data Generation for Finance(https://arxiv.org/abs/2510.26076)
Keywords: generation, generative
Abstract: Synthetic data generation has emerged as a promising approach to address the challenges of using sensitive financial data in machine learning applications. By leveraging generative models, such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), it is possible to create artificial datasets that preserve the statistical properties of real financial records while mitigating privacy risks and regulatory constraints. Despite the rapid growth of this field, a comprehensive synthesis of the current research landscape has been lacking. This systematic review consolidates and analyses 72 studies published since 2018 that focus on synthetic financial data generation. We categorise the types of financial information synthesised, the generative methods employed, and the evaluation strategies used to assess data utility and privacy. The findings indicate that GAN-based approaches dominate the literature, particularly for generating time-series market data and tabular credit data. While several innovative techniques demonstrate potential for improved realism and privacy preservation, there remains a notable lack of rigorous evaluation of privacy safeguards across studies. By providing an integrated overview of generative techniques, applications, and evaluation methods, this review highlights critical research gaps and offers guidance for future work aimed at developing robust, privacy-preserving synthetic data solutions for the financial domain.
摘要：合成数据生成已成为解决在机器学习应用程序中使用敏感金融数据的挑战的一种有前途的方法。通过利用生成对抗网络（GAN）和变分自动编码器（VAE）等生成模型，可以创建人工数据集，保留真实财务记录的统计属性，同时减轻隐私风险和监管限制。尽管该领域发展迅速，但仍缺乏对当前研究格局的全面综合。这项系统综述整合并分析了 2018 年以来发表的 72 项研究，重点关注合成金融数据生成。我们对综合的金融信息类型、采用的生成方法以及用于评估数据效用和隐私的评估策略进行分类。研究结果表明，基于 GAN 的方法在文献中占主导地位，特别是在生成时间序列市场数据和表格信用数据方面。虽然一些创新技术显示出提高现实性和隐私保护的潜力，但各项研究中仍然明显缺乏对隐私保护的严格评估。通过对生成技术、应用程序和评估方法进行综合概述，本综述强调了关键的研究差距，并为未来旨在为金融领域开发强大的、保护隐私的合成数据解决方案的工作提供指导。

Title: Security Risk of Misalignment between Text and Image in Multi-modal Model

Authors: Xiaosen Wang, Zhijin Ge, Shaokang Wang
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2510.26105
Pdf URL: https://arxiv.org/pdf/2510.26105
Copy Paste: [[2510.26105]] Security Risk of Misalignment between Text and Image in Multi-modal Model(https://arxiv.org/abs/2510.26105)
Keywords: generation
Abstract: Despite the notable advancements and versatility of multi-modal diffusion models, such as text-to-image models, their susceptibility to adversarial inputs remains underexplored. Contrary to expectations, our investigations reveal that the alignment between textual and Image modalities in existing diffusion models is inadequate. This misalignment presents significant risks, especially in the generation of inappropriate or Not-Safe-For-Work (NSFW) content. To this end, we propose a novel attack called Prompt-Restricted Multi-modal Attack (PReMA) to manipulate the generated content by modifying the input image in conjunction with any specified prompt, without altering the prompt itself. PReMA is the first attack that manipulates model outputs by solely creating adversarial images, distinguishing itself from prior methods that primarily generate adversarial prompts to produce NSFW content. Consequently, PReMA poses a novel threat to the integrity of multi-modal diffusion models, particularly in image-editing applications that operate with fixed prompts. Comprehensive evaluations conducted on image inpainting and style transfer tasks across various models confirm the potent efficacy of PReMA.
摘要：尽管多模态扩散模型（例如文本到图像模型）取得了显着的进步和多功能性，但它们对对抗性输入的敏感性仍未得到充分研究。与预期相反，我们的调查表明现有扩散模型中文本和图像模式之间的一致性不够。这种不一致会带来重大风险，特别是在生成不适当或不安全工作 (NSFW) 内容时。为此，我们提出了一种称为提示限制多模态攻击（PReMA）的新颖攻击，通过结合任何指定的提示修改输入图像来操纵生成的内容，而不改变提示本身。 PReMA 是第一个仅通过创建对抗性图像来操纵模型输出的攻击，这与之前主要生成对抗性提示以生成 NSFW 内容的方法不同。因此，PReMA 对多模态扩散模型的完整性构成了新的威胁，特别是在使用固定提示操作的图像编辑应用程序中。对各种模型的图像修复和风格迁移任务进行的综合评估证实了 PReMA 的强大功效。

Title: OracleAgent: A Multimodal Reasoning Agent for Oracle Bone Script Research

Authors: Caoshuo Li, Zengmao Ding, Xiaobin Hu, Bang Li, Donghao Luo, Xu Peng, Taisong Jin, Yongge Liu, Shengwei Han, Jing Yang, Xiaoping He, Feng Gao, AndyPian Wu, SevenShu, Chaoyang Wang, Chengjie Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.26114
Pdf URL: https://arxiv.org/pdf/2510.26114
Copy Paste: [[2510.26114]] OracleAgent: A Multimodal Reasoning Agent for Oracle Bone Script Research(https://arxiv.org/abs/2510.26114)
Keywords: generation
Abstract: As one of the earliest writing systems, Oracle Bone Script (OBS) preserves the cultural and intellectual heritage of ancient civilizations. However, current OBS research faces two major challenges: (1) the interpretation of OBS involves a complex workflow comprising multiple serial and parallel sub-tasks, and (2) the efficiency of OBS information organization and retrieval remains a critical bottleneck, as scholars often spend substantial effort searching for, compiling, and managing relevant resources. To address these challenges, we present OracleAgent, the first agent system designed for the structured management and retrieval of OBS-related information. OracleAgent seamlessly integrates multiple OBS analysis tools, empowered by large language models (LLMs), and can flexibly orchestrate these components. Additionally, we construct a comprehensive domain-specific multimodal knowledge base for OBS, which is built through a rigorous multi-year process of data collection, cleaning, and expert annotation. The knowledge base comprises over 1.4M single-character rubbing images and 80K interpretation texts. OracleAgent leverages this resource through its multimodal tools to assist experts in retrieval tasks of character, document, interpretation text, and rubbing image. Extensive experiments demonstrate that OracleAgent achieves superior performance across a range of multimodal reasoning and generation tasks, surpassing leading mainstream multimodal large language models (MLLMs) (e.g., GPT-4o). Furthermore, our case study illustrates that OracleAgent can effectively assist domain experts, significantly reducing the time cost of OBS research. These results highlight OracleAgent as a significant step toward the practical deployment of OBS-assisted research and automated interpretation systems.
摘要：作为最早的书写系统之一，甲骨文（OBS）保存了古代文明的文化和知识遗产。然而，当前OBS研究面临两大挑战：（1）OBS的解释涉及包含多个串行和并行子任务的复杂工作流程；（2）OBS信息组织和检索的效率仍然是一个关键瓶颈，因为学者们经常花费大量精力搜索、编译和管理相关资源。为了应对这些挑战，我们推出了 OracleAgent，这是第一个专为 OBS 相关信息的结构化管理和检索而设计的代理系统。 OracleAgent无缝集成了多种OBS分析工具，在大语言模型（LLM）的支持下，可以灵活地编排这些组件。此外，我们还为 OBS 构建了一个全面的特定领域多模态知识库，该知识库是通过严格的多年数据收集、清理和专家注释过程构建的。知识库包含超过140万张单字拓印图片和8万条解释文字。 OracleAgent 通过其多模式工具利用该资源来协助专家执行字符、文档、解释文本和拓印图像的检索任务。大量实验表明，OracleAgent 在一系列多模态推理和生成任务中实现了卓越的性能，超越了领先的主流多模态大语言模型 (MLLM)（例如 GPT-4o）。此外，我们的案例研究表明OracleAgent可以有效地协助领域专家，显着降低OBS研究的时间成本。这些结果凸显 OracleAgent 是 OBS 辅助研究和自动解释系统实际部署的重要一步。

Title: FullPart: Generating each 3D Part at Full Resolution

Authors: Lihe Ding, Shaocong Dong, Yaokun Li, Chenjian Gao, Xiao Chen, Rui Han, Yihao Kuang, Hong Zhang, Bo Huang, Zhanpeng Huang, Zibin Wang, Dan Xu, Tianfan Xue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.26140
Pdf URL: https://arxiv.org/pdf/2510.26140
Copy Paste: [[2510.26140]] FullPart: Generating each 3D Part at Full Resolution(https://arxiv.org/abs/2510.26140)
Keywords: generation
Abstract: Part-based 3D generation holds great potential for various applications. Previous part generators that represent parts using implicit vector-set tokens often suffer from insufficient geometric details. Another line of work adopts an explicit voxel representation but shares a global voxel grid among all parts; this often causes small parts to occupy too few voxels, leading to degraded quality. In this paper, we propose FullPart, a novel framework that combines both implicit and explicit paradigms. It first derives the bounding box layout through an implicit box vector-set diffusion process, a task that implicit diffusion handles effectively since box tokens contain little geometric detail. Then, it generates detailed parts, each within its own fixed full-resolution voxel grid. Instead of sharing a global low-resolution space, each part in our method - even small ones - is generated at full resolution, enabling the synthesis of intricate details. We further introduce a center-point encoding strategy to address the misalignment issue when exchanging information between parts of different actual sizes, thereby maintaining global coherence. Moreover, to tackle the scarcity of reliable part data, we present PartVerse-XL, the largest human-annotated 3D part dataset to date with 40K objects and 320K parts. Extensive experiments demonstrate that FullPart achieves state-of-the-art results in 3D part generation. We will release all code, data, and model to benefit future research in 3D part generation.
摘要：基于零件的 3D 生成在各种应用中具有巨大的潜力。以前使用隐式向量集标记表示零件的零件生成器通常会遇到几何细节不足的问题。另一条工作采用显式体素表示，但在所有部分之间共享全局体素网格；这通常会导致小零件占据太少的体素，从而导致质量下降。在本文中，我们提出了 FullPart，这是一种结合了隐式和显式范式的新颖框架。它首先通过隐式框向量集扩散过程导出边界框布局，这是隐式扩散可以有效处理的任务，因为框标记包含很少的几何细节。然后，它生成详细的部分，每个部分都位于其自己的固定全分辨率体素网格内。我们的方法中的每个部分（即使是很小的部分）不是共享全局低分辨率空间，而是以全分辨率生成，从而能够合成复杂的细节。我们进一步引入中心点编码策略来解决不同实际尺寸的部分之间交换信息时的错位问题，从而保持全局一致性。此外，为了解决可靠零件数据的稀缺问题，我们推出了 PartVerse-XL，这是迄今为止最大的人工注释 3D 零件数据集，包含 40K 对象和 320K 零件。大量实验表明 FullPart 在 3D 零件生成方面取得了最先进的结果。我们将发布所有代码、数据和模型，以利于 3D 零件生成的未来研究。

Title: BasicAVSR: Arbitrary-Scale Video Super-Resolution via Image Priors and Enhanced Motion Compensation

Authors: Wei Shang, Wanying Zhang, Shuhang Gu, Pengfei Zhu, Qinghua Hu, Dongwei Ren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.26149
Pdf URL: https://arxiv.org/pdf/2510.26149
Copy Paste: [[2510.26149]] BasicAVSR: Arbitrary-Scale Video Super-Resolution via Image Priors and Enhanced Motion Compensation(https://arxiv.org/abs/2510.26149)
Keywords: super-resolution
Abstract: Arbitrary-scale video super-resolution (AVSR) aims to enhance the resolution of video frames, potentially at various scaling factors, which presents several challenges regarding spatial detail reproduction, temporal consistency, and computational complexity. In this paper, we propose a strong baseline BasicAVSR for AVSR by integrating four key components: 1) adaptive multi-scale frequency priors generated from image Laplacian pyramids, 2) a flow-guided propagation unit to aggregate spatiotemporal information from adjacent frames, 3) a second-order motion compensation unit for more accurate spatial alignment of adjacent frames, and 4) a hyper-upsampling unit to generate scale-aware and content-independent upsampling kernels. To meet diverse application demands, we instantiate three propagation variants: (i) a unidirectional RNN unit for strictly online inference, (ii) a unidirectional RNN unit empowered with a limited lookahead that tolerates a small output delay, and (iii) a bidirectional RNN unit designed for offline tasks where computational resources are less constrained. Experimental results demonstrate the effectiveness and adaptability of our model across these different scenarios. Through extensive experiments, we show that BasicAVSR significantly outperforms existing methods in terms of super-resolution quality, generalization ability, and inference speed. Our work not only advances the state-of-the-art in AVSR but also extends its core components to multiple frameworks for diverse scenarios. The code is available at this https URL.
摘要：任意尺度视频超分辨率（AVSR）旨在增强视频帧的分辨率，可能在各种缩放因子下，这在空间细节再现、时间一致性和计算复杂性方面提出了一些挑战。在本文中，我们通过集成四个关键组件，提出了用于 AVSR 的强大基线 BasicAVSR：1）从图像拉普拉斯金字塔生成的自适应多尺度频率先验，2）用于聚合来自相邻帧的时空信息的流引导传播单元，3）用于更准确地进行相邻帧的空间对齐的二阶运动补偿单元，以及 4）用于生成尺度感知和内容无关的上采样内核的超级上采样单元。为了满足不同的应用需求，我们实例化了三种传播变体：（i）用于严格在线推理的单向 RNN 单元，（ii）具有有限前瞻功能的单向 RNN 单元，可容忍较小的输出延迟，以及（iii）专为计算资源较少约束的离线任务而设计的双向 RNN 单元。实验结果证明了我们的模型在这些不同场景中的有效性和适应性。通过大量的实验，我们表明 BasicAVSR 在超分辨率质量、泛化能力和推理速度方面显着优于现有方法。我们的工作不仅推进了 AVSR 的最先进水平，还将其核心组件扩展到适用于不同场景的多个框架。该代码可从此 https URL 获取。

Title: CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

Authors: Jiaqi Wang, Xiao Yang, Kai Sun, Parth Suresh, Sanat Sharma, Adam Czyzewski, Derek Andersen, Surya Appini, Arkav Banerjee, Sajal Choudhary, Shervin Ghasemlou, Ziqiang Guan, Akil Iyer, Haidar Khan, Lingkun Kong, Roy Luo, Tiffany Ma, Zhen Qiao, David Tran, Wenfang Xu, Skyler Yeatman, Chen Zhou, Gunveer Gujral, Yinglong Xia, Shane Moon, Nicolas Scheffer, Nirav Shah, Eun Chang, Yue Liu, Florian Metze, Tammy Stark, Zhaleh Feizollahi, Andrea Jessee, Mangesh Pujari, Ahmed Aly, Babak Damavandi, Rakesh Wanga, Anuj Kumar, Rohit Patel, Wen-tau Yih, Xin Luna Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.26160
Pdf URL: https://arxiv.org/pdf/2510.26160
Copy Paste: [[2510.26160]] CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark(https://arxiv.org/abs/2510.26160)
Keywords: generation
Abstract: Wearable devices such as smart glasses are transforming the way people interact with their surroundings, enabling users to seek information regarding entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG) plays a key role in supporting such questions, yet there is still no comprehensive benchmark for this task, especially regarding wearables scenarios. To fill this gap, we present CRAG-MM -- a Comprehensive RAG benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn conversations across 13 domains, including 6.2K egocentric images designed to mimic captures from wearable devices. We carefully constructed the questions to reflect real-world scenarios and challenges, including five types of image-quality issues, six question types, varying entity popularity, differing information dynamism, and different conversation turns. We design three tasks: single-source augmentation, multi-source augmentation, and multi-turn conversations -- each paired with an associated retrieval corpus and APIs for both image-KG retrieval and webpage retrieval. Our evaluation shows that straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM single- and multi-turn QA, respectively, whereas state-of-the-art industry solutions have similar quality (32%/45%), underscoring ample room for improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K participants and 5K submissions, with winning solutions improving baseline performance by 28%, highlighting its early impact on advancing the field.
摘要：智能眼镜等可穿戴设备正在改变人们与周围环境互动的方式，使用户能够寻找有关他们视野中的实体的信息。多模态检索增强生成（MM-RAG）在支持此类问题中发挥着关键作用，但该任务仍然没有全面的基准，特别是在可穿戴设备场景中。为了填补这一空白，我们提出了 CRAG-MM——用于多模式多轮对话的综合 RAG 基准。 CRAG-MM 包含跨 13 个领域的一组多样化的 6.5K（图像、问题、答案）三元组和 2K 基于视觉的多轮对话，其中包括旨在模仿可穿戴设备捕获的 6.2K 以自我为中心的图像。我们精心构建了问题来反映现实世界的场景和挑战，包括五种图像质量问题、六种问题类型、不同的实体受欢迎程度、不同的信息动态和不同的对话轮次。我们设计了三个任务：单源增强、多源增强和多轮对话——每个任务都与相关的检索语料库和 API 配对，用于图像 KG 检索和网页检索。我们的评估表明，直接的 RAG 方法在 CRAG-MM 单轮和多轮 QA 上的真实度分别仅为 32% 和 43%，而最先进的行业解决方案具有相似的质量 (32%/45%)，这凸显了充足的改进空间。该基准已举办了 2025 年 KDD Cup，吸引了约 1,000 名参与者和 5,000 份提交，获胜的解决方案将基线性能提高了 28%，凸显了其对推进该领域的早期影响。

Title: Sketch2PoseNet: Efficient and Generalized Sketch to 3D Human Pose Prediction

Authors: Li Wang, Yiyu Zhuang, Yanwen Wang, Xun Cao, Chuan Guo, Xinxin Zuo, Hao Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.26196
Pdf URL: https://arxiv.org/pdf/2510.26196
Copy Paste: [[2510.26196]] Sketch2PoseNet: Efficient and Generalized Sketch to 3D Human Pose Prediction(https://arxiv.org/abs/2510.26196)
Keywords: generative
Abstract: 3D human pose estimation from sketches has broad applications in computer animation and film production. Unlike traditional human pose estimation, this task presents unique challenges due to the abstract and disproportionate nature of sketches. Previous sketch-to-pose methods, constrained by the lack of large-scale sketch-3D pose annotations, primarily relied on optimization with heuristic rules-an approach that is both time-consuming and limited in generalizability. To address these challenges, we propose a novel approach leveraging a "learn from synthesis" strategy. First, a diffusion model is trained to synthesize sketch images from 2D poses projected from 3D human poses, mimicking disproportionate human structures in sketches. This process enables the creation of a synthetic dataset, SKEP-120K, consisting of 120k accurate sketch-3D pose annotation pairs across various sketch styles. Building on this synthetic dataset, we introduce an end-to-end data-driven framework for estimating human poses and shapes from diverse sketch styles. Our framework combines existing 2D pose detectors and generative diffusion priors for sketch feature extraction with a feed-forward neural network for efficient 2D pose estimation. Multiple heuristic loss functions are incorporated to guarantee geometric coherence between the derived 3D poses and the detected 2D poses while preserving accurate self-contacts. Qualitative, quantitative, and subjective evaluations collectively show that our model substantially surpasses previous ones in both estimation accuracy and speed for sketch-to-pose tasks.
摘要：根据草图进行 3D 人体姿势估计在计算机动画和电影制作中具有广泛的应用。与传统的人体姿势估计不同，由于草图的抽象性和不成比例的性质，这项任务提出了独特的挑战。以前的草图到姿势方法，由于缺乏大规模草图 3D 姿势注释，主要依赖于启发式规则的优化，这种方法既耗时又普遍性有限。为了应对这些挑战，我们提出了一种利用“从综合中学习”策略的新方法。首先，训练扩散模型来合成从 3D 人体姿势投影的 2D 姿势的草图图像，模仿草图中不成比例的人体结构。此过程可以创建合成数据集 SKEP-120K，其中包含 12 万个跨各种草图样式的精确草图 3D 姿势注释对。在此合成数据集的基础上，我们引入了一个端到端数据驱动框架，用于根据不同的草图风格估计人体姿势和形状。我们的框架结合了现有的 2D 姿态检测器和用于草图特征提取的生成扩散先验，以及用于高效 2D 姿态估计的前馈神经网络。结合了多个启发式损失函数，以保证导出的 3D 姿势和检测到的 2D 姿势之间的几何一致性，同时保持准确的自接触。定性、定量和主观评估共同表明，我们的模型在草图到姿势任务的估计精度和速度方面都大大超过了以前的模型。

Title: OmniLayout: Enabling Coarse-to-Fine Learning with LLMs for Universal Document Layout Generation

Authors: Hengrui Kang, Zhuangcheng Gu, Zhiyuan Zhao, Zichen Wen, Bin Wang, Weijia Li, Conghui He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.26213
Pdf URL: https://arxiv.org/pdf/2510.26213
Copy Paste: [[2510.26213]] OmniLayout: Enabling Coarse-to-Fine Learning with LLMs for Universal Document Layout Generation(https://arxiv.org/abs/2510.26213)
Keywords: generation, generative
Abstract: Document AI has advanced rapidly and is attracting increasing attention. Yet, while most efforts have focused on document layout analysis (DLA), its generative counterpart, document layout generation, remains underexplored. A major obstacle lies in the scarcity of diverse layouts: academic papers with Manhattan-style structures dominate existing studies, while open-world genres such as newspapers and magazines remain severely underrepresented. To address this gap, we curate OmniLayout-1M, the first million-scale dataset of diverse document layouts, covering six common document types and comprising contemporary layouts collected from multiple sources. Moreover, since existing methods struggle in complex domains and often fail to arrange long sequences coherently, we introduce OmniLayout-LLM, a 0.5B model with designed two-stage Coarse-to-Fine learning paradigm: 1) learning universal layout principles from OmniLayout-1M with coarse category definitions, and 2) transferring the knowledge to a specific domain with fine-grained annotations. Extensive experiments demonstrate that our approach achieves strong performance on multiple domains in M$^{6}$Doc dataset, substantially surpassing both existing layout generation experts and several latest general-purpose LLMs. Our code, models, and dataset will be publicly released.
摘要：文档人工智能发展迅速，并引起越来越多的关注。然而，虽然大多数工作都集中在文档布局分析 (DLA) 上，但其对应的生成文档布局生成仍然没有得到充分探索。一个主要障碍在于缺乏多样化的布局：具有曼哈顿式结构的学术论文在现有研究中占主导地位，而报纸和杂志等开放世界类型的代表性仍然严重不足。为了解决这一差距，我们策划了 OmniLayout-1M，这是第一个百万级不同文档布局的数据集，涵盖六种常见文档类型，并包含从多个来源收集的当代布局。此外，由于现有方法在复杂领域中举步维艰，并且常常无法连贯地排列长序列，因此我们引入了 OmniLayout-LLM，这是一个 0.5B 模型，具有设计的两阶段从粗到细的学习范式：1）从具有粗略类别定义的 OmniLayout-1M 中学习通用布局原则，2）通过细粒度注释将知识转移到特定领域。大量的实验表明，我们的方法在 M$^{6}$Doc 数据集的多个领域上实现了强大的性能，大大超过了现有的布局生成专家和几个最新的通用法学硕士。我们的代码、模型和数据集将公开发布。

Title: Likely Interpolants of Generative Models

Authors: Frederik Möbius Rygaard, Shen Zhu, Yinzhu Jin, Søren Hauberg, Tom Fletcher
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.26266
Pdf URL: https://arxiv.org/pdf/2510.26266
Copy Paste: [[2510.26266]] Likely Interpolants of Generative Models(https://arxiv.org/abs/2510.26266)
Keywords: generation, generative
Abstract: Interpolation in generative models allows for controlled generation, model inspection, and more. Unfortunately, most generative models lack a principal notion of interpolants without restrictive assumptions on either the model or data dimension. In this paper, we develop a general interpolation scheme that targets likely transition paths compatible with different metrics and probability distributions. We consider interpolants analogous to a geodesic constrained to a suitable data distribution and derive a novel algorithm for computing these curves, which requires no additional training. Theoretically, we show that our method locally can be considered as a geodesic under a suitable Riemannian metric. We quantitatively show that our interpolation scheme traverses higher density regions than baselines across a range of models and datasets.
摘要：生成模型中的插值允许控制生成、模型检查等。不幸的是，大多数生成模型缺乏插值的主要概念，并且对模型或数据维度没有限制性假设。在本文中，我们开发了一种通用插值方案，其目标是与不同度量和概率分布兼容的可能转换路径。我们认为插值类似于受合适数据分布约束的测地线，并推导出一种用于计算这些曲线的新算法，该算法不需要额外的训练。从理论上讲，我们表明我们的方法局部可以被视为合适黎曼度量下的测地线。我们定量地表明，我们的插值方案在一系列模型和数据集中遍历的密度区域比基线更高。

Title: Revisiting Generative Infrared and Visible Image Fusion Based on Human Cognitive Laws

Authors: Lin Guo, Xiaoqing Luo, Wei Xie, Zhancheng Zhang, Hui Li, Rui Wang, Zhenhua Feng, Xiaoning Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.26268
Pdf URL: https://arxiv.org/pdf/2510.26268
Copy Paste: [[2510.26268]] Revisiting Generative Infrared and Visible Image Fusion Based on Human Cognitive Laws(https://arxiv.org/abs/2510.26268)
Keywords: generation, generative
Abstract: Existing infrared and visible image fusion methods often face the dilemma of balancing modal information. Generative fusion methods reconstruct fused images by learning from data distributions, but their generative capabilities remain limited. Moreover, the lack of interpretability in modal information selection further affects the reliability and consistency of fusion results in complex scenarios. This manuscript revisits the essence of generative image fusion under the inspiration of human cognitive laws and proposes a novel infrared and visible image fusion method, termed HCLFuse. First, HCLFuse investigates the quantification theory of information mapping in unsupervised fusion networks, which leads to the design of a multi-scale mask-regulated variational bottleneck encoder. This encoder applies posterior probability modeling and information decomposition to extract accurate and concise low-level modal information, thereby supporting the generation of high-fidelity structural details. Furthermore, the probabilistic generative capability of the diffusion model is integrated with physical laws, forming a time-varying physical guidance mechanism that adaptively regulates the generation process at different stages, thereby enhancing the ability of the model to perceive the intrinsic structure of data and reducing dependence on data quality. Experimental results show that the proposed method achieves state-of-the-art fusion performance in qualitative and quantitative evaluations across multiple datasets and significantly improves semantic segmentation metrics. This fully demonstrates the advantages of this generative image fusion method, drawing inspiration from human cognition, in enhancing structural consistency and detail quality.
摘要：现有的红外和可见光图像融合方法常常面临平衡模态信息的困境。生成融合方法通过学习数据分布来重建融合图像，但其生成能力仍然有限。此外，模态信息选择缺乏可解释性进一步影响了复杂场景下融合结果的可靠性和一致性。该手稿在人类认知规律的启发下重新审视了生成图像融合的本质，并提出了一种新颖的红外和可见光图像融合方法，称为 HCLFuse。首先，HCLFuse 研究了无监督融合网络中信息映射的量化理论，从而设计了多尺度掩模调节的变分瓶颈编码器。该编码器应用后验概率建模和信息分解来提取准确而简洁的低级模态信息，从而支持高保真结构细节的生成。此外，扩散模型的概率生成能力与物理规律相结合，形成时变物理引导机制，自适应调节不同阶段的生成过程，从而增强模型感知数据内在结构的能力，减少对数据质量的依赖。实验结果表明，所提出的方法在多个数据集的定性和定量评估中实现了最先进的融合性能，并显着提高了语义分割指标。这充分展示了这种从人类认知中汲取灵感的生成图像融合方法在增强结构一致性和细节质量方面的优势。

Title: Distributional Multi-objective Black-box Optimization for Diffusion-model Inference-time Multi-Target Generation

Authors: Kim Yong Tan, Yueming Lyu, Ivor Tsang, Yew-Soon Ong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.26278
Pdf URL: https://arxiv.org/pdf/2510.26278
Copy Paste: [[2510.26278]] Distributional Multi-objective Black-box Optimization for Diffusion-model Inference-time Multi-Target Generation(https://arxiv.org/abs/2510.26278)
Keywords: generation
Abstract: Diffusion models have been successful in learning complex data distributions. This capability has driven their application to high-dimensional multi-objective black-box optimization problem. Existing approaches often employ an external optimization loop, such as an evolutionary algorithm, to the diffusion model. However, these approaches treat the diffusion model as a black-box refiner, which overlooks the internal distribution transition of the diffusion generation process, limiting their efficiency. To address these challenges, we propose the Inference-time Multi-target Generation (IMG) algorithm, which optimizes the diffusion process at inference-time to generate samples that simultaneously satisfy multiple objectives. Specifically, our IMG performs weighted resampling during the diffusion generation process according to the expected aggregated multi-objective values. This weighted resampling strategy ensures the diffusion-generated samples are distributed according to our desired multi-target Boltzmann distribution. We further derive that the multi-target Boltzmann distribution has an interesting log-likelihood interpretation, where it is the optimal solution to the distributional multi-objective optimization problem. We implemented IMG for a multi-objective molecule generation task. Experiments show that IMG, requiring only a single generation pass, achieves a significantly higher hypervolume than baseline optimization algorithms that often require hundreds of diffusion generations. Notably, our algorithm can be viewed as an optimized diffusion process and can be integrated into existing methods to further improve their performance.
摘要：扩散模型在学习复杂的数据分布方面取得了成功。这种能力推动了它们在高维多目标黑盒优化问题中的应用。现有方法通常对扩散模型采用外部优化循环，例如进化算法。然而，这些方法将扩散模型视为黑盒精炼器，忽略了扩散生成过程的内部分布转变，限制了它们的效率。为了应对这些挑战，我们提出了推理时多目标生成（IMG）算法，该算法优化推理时的扩散过程，以生成同时满足多个目标的样本。具体来说，我们的 IMG 在扩散生成过程中根据预期的聚合多目标值执行加权重采样。这种加权重采样策略确保扩散生成的样本根据我们所需的多目标玻尔兹曼分布进行分布。我们进一步推导出多目标玻尔兹曼分布具有有趣的对数似然解释，它是分布多目标优化问题的最优解。我们为多目标分子生成任务实现了 IMG。实验表明，与通常需要数百次扩散代的基线优化算法相比，仅需要单代传递的 IMG 实现了显着更高的超体积。值得注意的是，我们的算法可以被视为优化的扩散过程，并且可以集成到现有方法中以进一步提高其性能。

Title: Beyond Imitation: Constraint-Aware Trajectory Generation with Flow Matching For End-to-End Autonomous Driving

Authors: Lin Liu, Guanyi Yu, Ziying Song, Junqiao Li, Caiyan Jia, Feiyang Jia, Peiliang Wu, Yandan Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.26292
Pdf URL: https://arxiv.org/pdf/2510.26292
Copy Paste: [[2510.26292]] Beyond Imitation: Constraint-Aware Trajectory Generation with Flow Matching For End-to-End Autonomous Driving(https://arxiv.org/abs/2510.26292)
Keywords: generation, generative
Abstract: Planning is a critical component of end-to-end autonomous driving. However, prevailing imitation learning methods often suffer from mode collapse, failing to produce diverse trajectory hypotheses. Meanwhile, existing generative approaches struggle to incorporate crucial safety and physical constraints directly into the generative process, necessitating an additional optimization stage to refine their outputs. To address these limitations, we propose CATG, a novel planning framework that leverages Constrained Flow Matching. Concretely, CATG explicitly models the flow matching process, which inherently mitigates mode collapse and allows for flexible guidance from various conditioning signals. Our primary contribution is the novel imposition of explicit constraints directly within the flow matching process, ensuring that the generated trajectories adhere to vital safety and kinematic rules. Secondly, CATG parameterizes driving aggressiveness as a control signal during generation, enabling precise manipulation of trajectory style. Notably, on the NavSim v2 challenge, CATG achieved 2nd place with an EPDMS score of 51.31 and was honored with the Innovation Award.
摘要：规划是端到端自动驾驶的关键组成部分。然而，流行的模仿学习方法经常遭受模式崩溃的困扰，无法产生多样化的轨迹假设。与此同时，现有的生成方法很难将关键的安全和物理约束直接纳入生成过程，因此需要额外的优化阶段来完善其输出。为了解决这些限制，我们提出了 CATG，这是一种利用约束流匹配的新颖规划框架。具体来说，CATG 明确地模拟了流量匹配过程，这本质上减轻了模式崩溃，并允许来自各种调节信号的灵活指导。我们的主要贡献是直接在流量匹配过程中直接施加显式约束，确保生成的轨迹遵守重要的安全和运动学规则。其次，CATG 在生成过程中将驾驶攻击性参数化为控制信号，从而能够精确操纵轨迹风格。值得注意的是，在 NavSim v2 挑战赛中，CATG 以 EPDMS 得分 51.31 获得第二名，并荣获创新奖。

Title: GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-guided Latent Diffusion Model?

Authors: Mingyu Sung, Seungjae Ham, Kangwoo Kim, Yeokyoung Yoon, Sangseok Yun, Il-Min Kim, Jae-Mo Kang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.26339
Pdf URL: https://arxiv.org/pdf/2510.26339
Copy Paste: [[2510.26339]] GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-guided Latent Diffusion Model?(https://arxiv.org/abs/2510.26339)
Keywords: restoration, super-resolution
Abstract: Image super-resolution(SR) is fundamental to many vision system-from surveillance and autonomy to document analysis and retail analytics-because recovering high-frequency details, especially scene-text, enables reliable downstream perception. Scene-text, i.e., text embedded in natural images such as signs, product labels, and storefronts, often carries the most actionable information; when characters are blurred or hallucinated, optical character recognition(OCR) and subsequent decisions fail even if the rest of the image appears sharp. Yet previous SR research has often been tuned to distortion (PSNR/SSIM) or learned perceptual metrics (LIPIS, MANIQA, CLIP-IQA, MUSIQ) that are largely insensitive to character-level errors. Furthermore, studies that do address text SR often focus on simplified benchmarks with isolated characters, overlooking the challenges of text within complex natural scenes. As a result, scene-text is effectively treated as generic texture. For SR to be effective in practical deployments, it is therefore essential to explicitly optimize for both text legibility and perceptual quality. We present GLYPH-SR, a vision-language-guided diffusion framework that aims to achieve both objectives jointly. GLYPH-SR utilizes a Text-SR Fusion ControlNet(TS-ControlNet) guided by OCR data, and a ping-pong scheduler that alternates between text- and scene-centric guidance. To enable targeted text restoration, we train these components on a synthetic corpus while keeping the main SR branch frozen. Across SVT, SCUT-CTW1500, and CUTE80 at x4, and x8, GLYPH-SR improves OCR F1 by up to +15.18 percentage points over diffusion/GAN baseline (SVT x8, OpenOCR) while maintaining competitive MANIQA, CLIP-IQA, and MUSIQ. GLYPH-SR is designed to satisfy both objectives simultaneously-high readability and high visual realism-delivering SR that looks right and reds right.
摘要：图像超分辨率 (SR) 是许多视觉系统的基础——从监控和自治到文档分析和零售分析——因为恢复高频细节，尤其是场景文本，可以实现可靠的下游感知。场景文本，即嵌入自然图像中的文本，例如标志、产品标签和店面，通常携带最可操作的信息；当字符模糊或出现幻觉时，即使图像的其余部分看起来很清晰，光学字符识别 (OCR) 和后续决策也会失败。然而，之前的 SR 研究通常针对失真 (PSNR/SSIM) 或学习感知指标（LIPIS、MANIQA、CLIP-IQA、MUSIQ）进行调整，这些指标对字符级错误基本上不敏感。此外，确实解决文本 SR 的研究通常侧重于具有孤立字符的简化基准，而忽视了复杂自然场景中文本的挑战。因此，场景文本被有效地视为通用纹理。因此，为了使 SR 在实际部署中有效，必须明确优化文本易读性和感知质量。我们提出了 GLYPH-SR，一种视觉语言引导的扩散框架，旨在共同实现这两个目标。 GLYPH-SR 利用由 OCR 数据引导的文本-SR 融合控制网络 (TS-ControlNet)，以及在以文本为中心和以场景为中心的引导之间交替的乒乓调度程序。为了实现有针对性的文本恢复，我们在合成语料库上训练这些组件，同时保持主 SR 分支冻结。在 x4 和 x8 的 SVT、SCUT-CTW1500 和 CUTE80 中，GLYPH-SR 将 OCR F1 比扩散/GAN 基线（SVT x8、OpenOCR）提高了 15.18 个百分点，同时保持了 MANIQA、CLIP-IQA 和 MUSIQ 的竞争力。 GLYPH-SR 旨在同时满足两个目标 - 高可读性和高视觉真实感 - 提供看起来正确且红色正确的 SR。

Title: Efficient Generative AI Boosts Probabilistic Forecasting of Sudden Stratospheric Warmings

Authors: Ningning Tao, Fei Xie, Baoxiang Pan, Hongyu Wang, Han Huang, Zhongpu Qiu, Ke Gui, Jiali Luo, Xiaosong Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.26376
Pdf URL: https://arxiv.org/pdf/2510.26376
Copy Paste: [[2510.26376]] Efficient Generative AI Boosts Probabilistic Forecasting of Sudden Stratospheric Warmings(https://arxiv.org/abs/2510.26376)
Keywords: generative
Abstract: Sudden Stratospheric Warmings (SSWs) are key sources of subseasonal predictability and major drivers of extreme winter weather. Yet, their accurate and efficient forecast remains a persistent challenge for numerical weather prediction (NWP) systems due to limitations in physical representation, initialization, and the immense computational demands of ensemble forecasts. While data-driven forecasting is rapidly evolving, its application to the complex, three-dimensional dynamics of SSWs, particularly for probabilistic forecast, remains underexplored. Here, we bridge this gap by developing a Flow Matching-based generative AI model (FM-Cast) for efficient and skillful probabilistic forecasting of the spatiotemporal evolution of stratospheric circulation. Evaluated across 18 major SSW events (1998-2024), FM-Cast skillfully forecasts the onset, intensity, and morphology of 10 events up to 20 days in advance, achieving ensemble accuracies above 50%. Its performance is comparable to or exceeds leading NWP systems while requiring only two minutes for a 50-member, 30-day forecast on a consumer GPU. Furthermore, leveraging FM-Cast as a scientific tool, we demonstrate through idealized experiments that SSW predictability is fundamentally linked to its underlying physical drivers, distinguishing between events forced from the troposphere and those driven by internal stratospheric dynamics. Our work thus establishes a computationally efficient paradigm for probabilistic forecasting stratospheric anomalies and showcases generative AI's potential to deepen the physical understanding of atmosphere-climate dynamics.
摘要：平流层突然变暖（SSW）是次季节可预测性的关键来源，也是极端冬季天气的主要驱动因素。然而，由于物理表示、初始化的限制以及集合预报的巨大计算需求，准确而高效的预报仍然是数值天气预报（NWP）系统面临的持续挑战。虽然数据驱动的预测正在迅速发展，但其在复杂的、三维的 SSW 动态中的应用，特别是概率预测，仍有待探索。在这里，我们通过开发基于流量匹配的生成人工智能模型（FM-Cast）来弥补这一差距，以对平流层环流的时空演化进行高效且熟练的概率预测。通过对 18 个主要 SSW 事件（1998-2024 年）进行评估，FM-Cast 巧妙地提前 20 天预测了 10 个事件的爆发、强度和形态，实现了 50% 以上的集合准确度。其性能可与领先的 NWP 系统相媲美或超过，同时在消费级 GPU 上仅需两分钟即可对 50 个成员进行 30 天的预测。此外，利用 FM-Cast 作为科学工具，我们通过理想化实验证明，SSW 可预测性从根本上与其潜在的物理驱动因素相关，区分对流层强制事件和平流层内部动力学驱动的事件。因此，我们的工作为平流层异常的概率预测建立了一个计算有效的范式，并展示了生成人工智能加深对大气气候动力学的物理理解的潜力。

Title: EEG-Driven Image Reconstruction with Saliency-Guided Diffusion Models

Authors: Igor Abramov, Ilya Makarov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.26391
Pdf URL: https://arxiv.org/pdf/2510.26391
Copy Paste: [[2510.26391]] EEG-Driven Image Reconstruction with Saliency-Guided Diffusion Models(https://arxiv.org/abs/2510.26391)
Keywords: generation
Abstract: Existing EEG-driven image reconstruction methods often overlook spatial attention mechanisms, limiting fidelity and semantic coherence. To address this, we propose a dual-conditioning framework that combines EEG embeddings with spatial saliency maps to enhance image generation. Our approach leverages the Adaptive Thinking Mapper (ATM) for EEG feature extraction and fine-tunes Stable Diffusion 2.1 via Low-Rank Adaptation (LoRA) to align neural signals with visual semantics, while a ControlNet branch conditions generation on saliency maps for spatial control. Evaluated on THINGS-EEG, our method achieves a significant improvement in the quality of low- and high-level image features over existing approaches. Simultaneously, strongly aligning with human visual attention. The results demonstrate that attentional priors resolve EEG ambiguities, enabling high-fidelity reconstructions with applications in medical diagnostics and neuroadaptive interfaces, advancing neural decoding through efficient adaptation of pre-trained diffusion models.
摘要：现有的脑电图驱动的图像重建方法经常忽视空间注意机制，限制保真度和语义一致性。为了解决这个问题，我们提出了一种双条件框架，将脑电图嵌入与空间显着图结合起来以增强图像生成。我们的方法利用自适应思维映射器 (ATM) 进行脑电图特征提取，并通过低阶适应 (LoRA) 微调稳定扩散 2.1，以将神经信号与视觉语义对齐，同时 ControlNet 分支条件在显着图上生成以进行空间控制。在 THINGS-EEG 上进行评估，我们的方法比现有方法在低级和高级图像特征的质量方面取得了显着提高。同时，与人类视觉注意力高度一致。结果表明，注意力先验解决了脑电图的模糊性，实现了在医学诊断和神经自适应接口中应用的高保真重建，通过预先训练的扩散模型的有效适应来推进神经解码。

Title: LoCoT2V-Bench: A Benchmark for Long-Form and Complex Text-to-Video Generation

Authors: Xiangqing Zheng, Chengyue Wu, Kehai Chen, Min Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.26412
Pdf URL: https://arxiv.org/pdf/2510.26412
Copy Paste: [[2510.26412]] LoCoT2V-Bench: A Benchmark for Long-Form and Complex Text-to-Video Generation(https://arxiv.org/abs/2510.26412)
Keywords: generation
Abstract: Recently text-to-video generation has made impressive progress in producing short, high-quality clips, but evaluating long-form outputs remains a major challenge especially when processing complex prompts. Existing benchmarks mostly rely on simplified prompts and focus on low-level metrics, overlooking fine-grained alignment with prompts and abstract dimensions such as narrative coherence and thematic expression. To address these gaps, we propose LoCoT2V-Bench, a benchmark specifically designed for long video generation (LVG) under complex input conditions. Based on various real-world videos, LoCoT2V-Bench introduces a suite of realistic and complex prompts incorporating elements like scene transitions and event dynamics. Moreover, it constructs a multi-dimensional evaluation framework that includes our newly proposed metrics such as event-level alignment, fine-grained temporal consistency, content clarity, and the Human Expectation Realization Degree (HERD) that focuses on more abstract attributes like narrative flow, emotional response, and character development. Using this framework, we conduct a comprehensive evaluation of nine representative LVG models, finding that while current methods perform well on basic visual and temporal aspects, they struggle with inter-event consistency, fine-grained alignment, and high-level thematic adherence, etc. Overall, LoCoT2V-Bench provides a comprehensive and reliable platform for evaluating long-form complex text-to-video generation and highlights critical directions for future method improvement.
摘要：最近，文本到视频的生成在制作简短的高质量剪辑方面取得了令人瞩目的进展，但评估长格式输出仍然是一个重大挑战，尤其是在处理复杂的提示时。现有的基准大多依赖于简化的提示，并专注于低级指标，忽视了与提示的细粒度对齐和抽象维度，如叙述连贯性和主题表达。为了解决这些差距，我们提出了 LoCoT2V-Bench，这是一个专门为复杂输入条件下的长视频生成 (LVG) 设计的基准。基于各种现实世界的视频，LoCoT2V-Bench 引入了一套现实且复杂的提示，其中包含场景转换和事件动态等元素。此外，它构建了一个多维度的评估框架，其中包括我们新提出的指标，例如事件级对齐、细粒度时间一致性、内容清晰度和人类期望实现度（HERD），该指标关注更抽象的属性，如叙事流程、情绪反应和角色发展。使用该框架，我们对九种代表性的 LVG 模型进行了综合评估，发现虽然当前的方法在基本的视觉和时间方面表现良好，但它们在事件间一致性、细粒度对齐和高水平主题一致性等方面存在困难。总体而言，LoCoT2V-Bench 为评估长格式复杂文本到视频的生成提供了一个全面可靠的平台，并强调了未来方法改进的关键方向。

Title: Co-Evolving Latent Action World Models

Authors: Yucen Wang, Fengming Zhang, De-Chuan Zhan, Li Zhao, Kaixin Wang, Jiang Bian
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.26433
Pdf URL: https://arxiv.org/pdf/2510.26433
Copy Paste: [[2510.26433]] Co-Evolving Latent Action World Models(https://arxiv.org/abs/2510.26433)
Keywords: generation
Abstract: Adapting pre-trained video generation models into controllable world models via latent actions is a promising step towards creating generalist world models. The dominant paradigm adopts a two-stage approach that trains latent action model (LAM) and the world model separately, resulting in redundant training and limiting their potential for co-adaptation. A conceptually simple and appealing idea is to directly replace the forward dynamic model in LAM with a powerful world model and training them jointly, but it is non-trivial and prone to representational collapse. In this work, we propose CoLA-World, which for the first time successfully realizes this synergistic paradigm, resolving the core challenge in joint learning through a critical warm-up phase that effectively aligns the representations of the from-scratch LAM with the pre-trained world model. This unlocks a co-evolution cycle: the world model acts as a knowledgeable tutor, providing gradients to shape a high-quality LAM, while the LAM offers a more precise and adaptable control interface to the world model. Empirically, CoLA-World matches or outperforms prior two-stage methods in both video simulation quality and downstream visual planning, establishing a robust and efficient new paradigm for the field.
摘要：通过潜在动作将预先训练的视频生成模型调整为可控的世界模型，是朝着创建通用世界模型迈出的有希望的一步。主导范式采用两阶段方法，分别训练潜在行动模型（LAM）和世界模型，导致冗余训练并限制了它们共同适应的潜力。一个概念上简单且有吸引力的想法是直接用强大的世界模型替换 LAM 中的前向动态模型并联合训练它们，但它并不简单并且容易出现表征崩溃。在这项工作中，我们提出了 CoLA-World，它首次成功实现了这种协同范式，通过关键的预热阶段解决了联合学习的核心挑战，有效地将从头开始的 LAM 的表示与预先训练的世界模型保持一致。这开启了一个共同进化循环：世界模型充当知识渊博的导师，提供梯度来塑造高质量的 LAM，而 LAM 为世界模型提供更精确、适应性更强的控制界面。根据经验，CoLA-World 在视频模拟质量和下游视觉规划方面均匹配或优于先前的两阶段方法，为该领域建立了稳健且高效的新范例。

Title: ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems

Authors: Qiaoling Chen, Zijun Liu, Peng Sun, Shenggui Li, Guoteng Wang, Ziming Liu, Yonggang Wen, Siyuan Feng, Tianwei Zhang
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2510.26475
Pdf URL: https://arxiv.org/pdf/2510.26475
Copy Paste: [[2510.26475]] ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems(https://arxiv.org/abs/2510.26475)
Keywords: generation
Abstract: Adapting large language models (LLMs) via reinforcement learning (RL) is often bottlenecked by the generation stage, which can consume over 75\% of the training time. Speculative decoding (SD) accelerates autoregressive generation in serving systems, but its behavior under RL training remains largely unexplored. We identify three critical gaps that hinder the naive integration of SD into RL systems: diminishing speedups at large batch sizes, drafter staleness under continual actor updates, and drafter-induced policy degradation. To address these gaps, we present ReSpec, a system that adapts SD to RL through three complementary mechanisms: dynamically tuning SD configurations, evolving the drafter via knowledge distillation, and weighting updates by rollout rewards. On Qwen models (3B--14B), ReSpec achieves up to 4.5x speedup while preserving reward convergence and training stability, providing a practical solution for efficient RL-based LLM adaptation.
摘要：通过强化学习 (RL) 调整大型语言模型 (LLM) 通常会遇到生成阶段的瓶颈，该阶段可能会消耗超过 75% 的训练时间。推测解码 (SD) 加速了服务系统中的自回归生成，但其在 RL 训练下的行为在很大程度上仍未被探索。我们发现了阻碍将 SD 简单集成到 RL 系统中的三个关键差距：大批量下加速速度的降低、持续参与者更新下的起草者陈旧性以及起草者引起的策略退化。为了解决这些差距，我们提出了 ReSpec，这是一个通过三种互补机制使 SD 适应 RL 的系统：动态调整 SD 配置、通过知识蒸馏发展起草者以及通过推出奖励来加权更新。在 Qwen 模型 (3B--14B) 上，ReSpec 实现了高达 4.5 倍的加速，同时保持奖励收敛和训练稳定性，为基于 RL 的高效 LLM 适应提供了实用的解决方案。

Title: Quantum Gated Recurrent GAN with Gaussian Uncertainty for Network Anomaly Detection

Authors: Wajdi Hammami, Soumaya Cherkaoui, Jean-Frederic Laprade, Ola Ahmad, Shengrui Wang
Subjects: cs.LG, cs.NI
Abstract URL: https://arxiv.org/abs/2510.26487
Pdf URL: https://arxiv.org/pdf/2510.26487
Copy Paste: [[2510.26487]] Quantum Gated Recurrent GAN with Gaussian Uncertainty for Network Anomaly Detection(https://arxiv.org/abs/2510.26487)
Keywords: generative
Abstract: Anomaly detection in time-series data is a critical challenge with significant implications for network security. Recent quantum machine learning approaches, such as quantum kernel methods and variational quantum circuits, have shown promise in capturing complex data distributions for anomaly detection but remain constrained by limited qubit counts. We introduce in this work a novel Quantum Gated Recurrent Unit (QGRU)-based Generative Adversarial Network (GAN) employing Successive Data Injection (SuDaI) and a multi-metric gating strategy for robust network anomaly detection. Our model uniquely utilizes a quantum-enhanced generator that outputs parameters (mean and log-variance) of a Gaussian distribution via reparameterization, combined with a Wasserstein critic to stabilize adversarial training. Anomalies are identified through a novel gating mechanism that initially flags potential anomalies based on Gaussian uncertainty estimates and subsequently verifies them using a composite of critic scores and reconstruction errors. Evaluated on benchmark datasets, our method achieves a high time-series aware F1 score (TaF1) of 89.43% demonstrating superior capability in detecting anomalies accurately and promptly as compared to existing classical and quantum models. Furthermore, the trained QGRU-WGAN was deployed on real IBM Quantum hardware, where it retained high anomaly detection performance, confirming its robustness and practical feasibility on current noisy intermediate-scale quantum (NISQ) devices.
摘要：时间序列数据中的异常检测是一项严峻的挑战，对网络安全具有重大影响。最近的量子机器学习方法，例如量子核方法和变分量子电路，在捕获复杂数据分布以进行异常检测方面表现出了希望，但仍然受到有限的量子位计数的限制。我们在这项工作中引入了一种新颖的基于量子门控循环单元（QGRU）的生成对抗网络（GAN），该网络采用连续数据注入（SuDaI）和用于稳健网络异常检测的多度量门控策略。我们的模型独特地利用量子增强生成器，通过重新参数化输出高斯分布的参数（均值和对数方差），并结合 Wasserstein 批评器来稳定对抗性训练。异常是通过一种新颖的门控机制来识别的，该机制最初根据高斯不确定性估计来标记潜在的异常，然后使用评论分数和重建误差的组合来验证它们。在基准数据集上进行评估，我们的方法实现了 89.43% 的高时间序列感知 F1 分数 (TaF1)，与现有的经典模型和量子模型相比，展示了准确、及时地检测异常的卓越能力。此外，经过训练的 QGRU-WGAN 部署在真实的 IBM Quantum 硬件上，保留了较高的异常检测性能，证实了其在当前嘈杂的中级量子 (NISQ) 设备上的鲁棒性和实际可行性。

Title: Polybasic Speculative Decoding Through a Theoretical Perspective

Authors: Ruilin Wang, Huixia Li, Yuexiao Ma, Xiawu Zheng, Fei Chao, Xuefeng Xiao, Rongrong Ji
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.26527
Pdf URL: https://arxiv.org/pdf/2510.26527
Copy Paste: [[2510.26527]] Polybasic Speculative Decoding Through a Theoretical Perspective(https://arxiv.org/abs/2510.26527)
Keywords: generation
Abstract: Inference latency stands as a critical bottleneck in the large-scale deployment of Large Language Models (LLMs). Speculative decoding methods have recently shown promise in accelerating inference without compromising the output distribution. However, existing work typically relies on a dualistic draft-verify framework and lacks rigorous theoretical grounding. In this paper, we introduce a novel \emph{polybasic} speculative decoding framework, underpinned by a comprehensive theoretical analysis. Specifically, we prove a fundamental theorem that characterizes the optimal inference time for multi-model speculative decoding systems, shedding light on how to extend beyond the dualistic approach to a more general polybasic paradigm. Through our theoretical investigation of multi-model token generation, we expose and optimize the interplay between model capabilities, acceptance lengths, and overall computational cost. Our framework supports both standalone implementation and integration with existing speculative techniques, leading to accelerated performance in practice. Experimental results across multiple model families demonstrate that our approach yields speedup ratios ranging from $3.31\times$ to $4.01\times$ for LLaMA2-Chat 7B, up to $3.87 \times$ for LLaMA3-8B, up to $4.43 \times$ for Vicuna-7B and up to $3.85 \times$ for Qwen2-7B -- all while preserving the original output distribution. We release our theoretical proofs and implementation code to facilitate further investigation into polybasic speculative decoding.
摘要：推理延迟是大型语言模型 (LLM) 大规模部署的关键瓶颈。推测性解码方法最近显示出在不影响输出分布的情况下加速推理的前景。然而，现有的工作通常依赖于二元草案验证框架，缺乏严格的理论基础。在本文中，我们介绍了一种新颖的 \emph{polybasic} 推测解码框架，该框架以全面的理论分析为基础。具体来说，我们证明了一个基本定理，该定理描述了多模型推测解码系统的最佳推理时间，揭示了如何超越二元方法扩展到更通用的多元范式。通过对多模型代币生成的理论研究，我们揭示并优化了模型功能、接受长度和总体计算成本之间的相互作用。我们的框架支持独立实施和与现有推测技术的集成，从而提高实践中的性能。多个模型系列的实验结果表明，我们的方法产生的加速比范围为 LLaMA2-Chat 7B 的加速比为 3.31\times$ 到 4.01\times$，LLaMA3-8B 的加速比高达 $3.87\times$，Vicuna-7B 的加速比高达 $4.43\times$，Qwen2-7B 的加速比高达 $3.85\times$，同时保留原始输出分布。我们发布了理论证明和实现代码，以促进对多元推测解码的进一步研究。

Title: Emu3.5: Native Multimodal Models are World Learners

Authors: Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, Yueze Wang, Chengyuan Wang, Fan Zhang, Yingli Zhao, Ting Pan, Xianduo Li, Zecheng Hao, Wenxuan Ma, Zhuo Chen, Yulong Ao, Tiejun Huang, Zhongyuan Wang, Xinlong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.26583
Pdf URL: https://arxiv.org/pdf/2510.26583
Copy Paste: [[2510.26583]] Emu3.5: Native Multimodal Models are World Learners(https://arxiv.org/abs/2510.26583)
Keywords: generation
Abstract: We introduce Emu3.5, a large-scale multimodal world model that natively predicts the next state across vision and language. Emu3.5 is pre-trained end-to-end with a unified next-token prediction objective on a corpus of vision-language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interleaved vision-language inputs and generates interleaved vision-language outputs. Emu3.5 is further post-trained with large-scale reinforcement learning to enhance multimodal reasoning and generation. To improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA), which converts token-by-token decoding into bidirectional parallel prediction, accelerating per-image inference by about 20x without sacrificing performance. Emu3.5 exhibits strong native multimodal capabilities, including long-horizon vision-language generation, any-to-image (X2I) generation, and complex text-rich image generation. It also exhibits generalizable world-modeling abilities, enabling spatiotemporally consistent world exploration and open-world embodied manipulation across diverse scenarios and tasks. For comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image (Nano Banana) on image generation and editing tasks and demonstrates superior results on a suite of interleaved generation tasks. We open-source Emu3.5 at this https URL to support community research.
摘要：我们引入了 Emu3.5，这是一种大规模多模式世界模型，可以原生预测跨视觉和语言的下一个状态。 Emu3.5 经过端到端预训练，在包含超过 10 万亿个令牌的视觉语言交错数据语料库上具有统一的下一个令牌预测目标，这些令牌主要源自互联网视频的连续帧和转录本。该模型自然地接受交错的视觉语言输入并生成交错的视觉语言输出。 Emu3.5进一步通过大规模强化学习进行后训练，以增强多模态推理和生成。为了提高推理效率，我们提出了离散扩散适应（DiDA），它将逐个令牌解码转换为双向并行预测，在不牺牲性能的情况下将每图像推理速度提高约 20 倍。 Emu3.5 展示了强大的原生多模态功能，包括长视野视觉语言生成、任意图像 (X2I) 生成和复杂的富含文本的图像生成。它还表现出通用的世界建模能力，能够实现时空一致的世界探索和跨不同场景和任务的开放世界体现操作。相比之下，Emu3.5 在图像生成和编辑任务上实现了与 Gemini 2.5 Flash Image (Nano Banana) 相当的性能，并在一系列交错生成任务上展示了出色的结果。我们在此 https URL 开源 Emu3.5 以支持社区研究。

Title: ResMatching: Noise-Resilient Computational Super-Resolution via Guided Conditional Flow Matching

Authors: Anirban Ray, Vera Galinova, Florian Jug
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.26601
Pdf URL: https://arxiv.org/pdf/2510.26601
Copy Paste: [[2510.26601]] ResMatching: Noise-Resilient Computational Super-Resolution via Guided Conditional Flow Matching(https://arxiv.org/abs/2510.26601)
Keywords: super-resolution
Abstract: Computational Super-Resolution (CSR) in fluorescence microscopy has, despite being an ill-posed problem, a long history. At its very core, CSR is about finding a prior that can be used to extrapolate frequencies in a micrograph that have never been imaged by the image-generating microscope. It stands to reason that, with the advent of better data-driven machine learning techniques, stronger prior can be learned and hence CSR can lead to better results. Here, we present ResMatching, a novel CSR method that uses guided conditional flow matching to learn such improved data-priors. We evaluate ResMatching on 4 diverse biological structures from the BioSR dataset and compare its results against 7 baselines. ResMatching consistently achieves competitive results, demonstrating in all cases the best trade-off between data fidelity and perceptual realism. We observe that CSR using ResMatching is particularly effective in cases where a strong prior is hard to learn, e.g. when the given low-resolution images contain a lot of noise. Additionally, we show that ResMatching can be used to sample from an implicitly learned posterior distribution and that this distribution is calibrated for all tested use-cases, enabling our method to deliver a pixel-wise data-uncertainty term that can guide future users to reject uncertain predictions.
摘要：荧光显微镜中的计算超分辨率（CSR）尽管是一个不适定问题，但却有着悠久的历史。 CSR 的核心是找到一个先验，可以用来推断显微照片中从未被图像生成显微镜成像过的频率。按理说，随着更好的数据驱动机器学习技术的出现，可以学到更强的先验知识，因此企业社会责任可以带来更好的结果。在这里，我们提出了 ResMatching，这是一种新颖的 CSR 方法，它使用引导条件流匹配来学习此类改进的数据先验。我们在 BioSR 数据集中的 4 种不同生物结构上评估 ResMatching，并将其结果与 7 个基线进行比较。 ResMatching 始终如一地取得有竞争力的结果，在所有情况下都证明了数据保真度和感知真实性之间的最佳权衡。我们观察到，在强先验难以学习的情况下，使用 ResMatching 的 CSR 特别有效，例如当给定的低分辨率图像包含大量噪声时。此外，我们还表明 ResMatching 可用于从隐式学习的后验分布中进行采样，并且该分布针对所有测试的用例进行了校准，使我们的方法能够提供像素级数据不确定性项，可以指导未来用户拒绝不确定的预测。

Title: All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles

Authors: Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Hazim Alzorgan, Ahmad Sarlak, Mahlagha Fazeli, Abolfazl Razi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.26641
Pdf URL: https://arxiv.org/pdf/2510.26641
Copy Paste: [[2510.26641]] All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles(https://arxiv.org/abs/2510.26641)
Keywords: generative
Abstract: Autonomous Vehicles (AVs) are transforming the future of transportation through advances in intelligent perception, decision-making, and control systems. However, their success is tied to one core capability, reliable object detection in complex and multimodal environments. While recent breakthroughs in Computer Vision (CV) and Artificial Intelligence (AI) have driven remarkable progress, the field still faces a critical challenge as knowledge remains fragmented across multimodal perception, contextual reasoning, and cooperative intelligence. This survey bridges that gap by delivering a forward-looking analysis of object detection in AVs, emphasizing emerging paradigms such as Vision-Language Models (VLMs), Large Language Models (LLMs), and Generative AI rather than re-examining outdated techniques. We begin by systematically reviewing the fundamental spectrum of AV sensors (camera, ultrasonic, LiDAR, and Radar) and their fusion strategies, highlighting not only their capabilities and limitations in dynamic driving environments but also their potential to integrate with recent advances in LLM/VLM-driven perception frameworks. Next, we introduce a structured categorization of AV datasets that moves beyond simple collections, positioning ego-vehicle, infrastructure-based, and cooperative datasets (e.g., V2V, V2I, V2X, I2I), followed by a cross-analysis of data structures and characteristics. Ultimately, we analyze cutting-edge detection methodologies, ranging from 2D and 3D pipelines to hybrid sensor fusion, with particular attention to emerging transformer-driven approaches powered by Vision Transformers (ViTs), Large and Small Language Models (SLMs), and VLMs. By synthesizing these perspectives, our survey delivers a clear roadmap of current capabilities, open challenges, and future opportunities.
摘要：自动驾驶汽车 (AV) 正在通过智能感知、决策和控制系统的进步改变交通的未来。然而，他们的成功与一项核心能力息息相关，即复杂和多模态环境中可靠的物体检测。尽管计算机视觉 (CV) 和人工智能 (AI) 领域的最新突破推动了显着进步，但该领域仍然面临着严峻的挑战，因为多模态感知、情境推理和协作智能方面的知识仍然碎片化。这项调查通过对自动驾驶汽车中的目标检测进行前瞻性分析来弥补这一差距，强调视觉语言模型 (VLM)、大型语言模型 (LLM) 和生成人工智能等新兴范例，而不是重新审视过时的技术。我们首先系统地回顾了 AV 传感器的基本频谱（摄像头、超声波、激光雷达和雷达）及其融合策略，不仅强调了它们在动态驾驶环境中的功能和局限性，还强调了它们与 LLM/VLM 驱动的感知框架的最新进展集成的潜力。接下来，我们介绍自动驾驶数据集的结构化分类，它超越了简单的集合、定位自车辆、基于基础设施的数据集和协作数据集（例如，V2V、V2I、V2X、I2I），然后对数据结构和特征进行交叉分析。最终，我们分析了从 2D 和 3D 管道到混合传感器融合的尖端检测方法，特别关注由视觉变压器 (ViT)、大小语言模型 (SLM) 和 VLM 支持的新兴变压器驱动方法。通过综合这些观点，我们的调查提供了当前能力、开放挑战和未来机遇的清晰路线图。

Title: LSM-MS2: A Foundation Model Bridging Spectral Identification and Biological Interpretation

Authors: Gabriel Asher, Devesh Shah, Amy A. Caudy, Luke Ferro, Lea Amar, Ana S. H. Costa, Thomas Patton, Niall O'Connor, Jennifer M. Campbell, Jack Geremia
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.26715
Pdf URL: https://arxiv.org/pdf/2510.26715
Copy Paste: [[2510.26715]] LSM-MS2: A Foundation Model Bridging Spectral Identification and Biological Interpretation(https://arxiv.org/abs/2510.26715)
Keywords: generation
Abstract: A vast majority of mass spectrometry data remains uncharacterized, leaving much of its biological and chemical information untapped. Recent advances in machine learning have begun to address this gap, particularly for tasks such as spectral identification in tandem mass spectrometry data. Here, we present the latest generation of LSM-MS2, a large-scale deep learning foundation model trained on millions of spectra to learn a semantic chemical space. LSM-MS2 achieves state-of-the-art performance in spectral identification, improving on existing methods by 30% in accuracy of identifying challenging isomeric compounds, yielding 42% more correct identifications in complex biological samples, and maintaining robustness under low-concentration conditions. Furthermore, LSM-MS2 produces rich spectral embeddings that enable direct biological interpretation from minimal downstream data, successfully differentiating disease states and predicting clinical outcomes across diverse translational applications.
摘要：绝大多数质谱数据仍未表征，其中许多生物和化学信息尚未开发。机器学习的最新进展已经开始解决这一差距，特别是对于串联质谱数据中的光谱识别等任务。在这里，我们展示了最新一代的 LSM-MS2，这是一种大规模深度学习基础模型，经过数百万个光谱的训练来学习语义化学空间。 LSM-MS2 在光谱识别方面实现了最先进的性能，在识别具有挑战性的异构体化合物方面比现有方法提高了 30%，在复杂生物样品中的识别正确率提高了 42%，并在低浓度条件下保持了稳健性。此外，LSM-MS2 产生丰富的光谱嵌入，可以从最少的下游数据直接进行生物学解释，成功区分疾病状态并预测不同转化应用的临床结果。

Title: STaMP: Sequence Transformation and Mixed Precision for Low-Precision Activation Quantization

Authors: Marco Federici, Riccardo Del Chiaro, Boris van Breugel, Paul Whatmough, Markus Nagel
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.26771
Pdf URL: https://arxiv.org/pdf/2510.26771
Copy Paste: [[2510.26771]] STaMP: Sequence Transformation and Mixed Precision for Low-Precision Activation Quantization(https://arxiv.org/abs/2510.26771)
Keywords: generative
Abstract: Quantization is the key method for reducing inference latency, power and memory footprint of generative AI models. However, accuracy often degrades sharply when activations are quantized below eight bits. Recent work suggests that invertible linear transformations (e.g. rotations) can aid quantization, by reparameterizing feature channels and weights. In this paper, we propose \textit{Sequence Transformation and Mixed Precision} (STaMP) quantization, a novel strategy that applies linear transformations along the \textit{sequence} dimension to exploit the strong local correlation in language and visual data. By keeping a small number of tokens in each intermediate activation at higher precision, we can maintain model accuracy at lower (average) activations bit-widths. We evaluate STaMP on recent LVM and LLM architectures, demonstrating that it significantly improves low bit width activation quantization and complements established activation and weight quantization methods including recent feature transformations.
摘要：量化是减少生成式人工智能模型的推理延迟、功耗和内存占用的关键方法。然而，当激活量化到八位以下时，准确性通常会急剧下降。最近的工作表明，可逆线性变换（例如旋转）可以通过重新参数化特征通道和权重来帮助量化。在本文中，我们提出了\textit{序列变换和混合精度}（STaMP）量化，这是一种沿\textit{序列}维度应用线性变换以利用语言和视觉数据中强局部相关性的新颖策略。通过在每个中间激活中以更高的精度保留少量标记，我们可以在较低（平均）激活位宽下保持模型精度。我们在最新的 LVM 和 LLM 架构上评估 STaMP，证明它显着改进了低位宽激活量化，并补充了已建立的激活和权重量化方法，包括最近的特征转换。

Title: Surpassing state of the art on AMD area estimation from RGB fundus images through careful selection of U-Net architectures and loss functions for class imbalance

Authors: Valentyna Starodub, Mantas Lukoševičius
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2510.26778
Pdf URL: https://arxiv.org/pdf/2510.26778
Copy Paste: [[2510.26778]] Surpassing state of the art on AMD area estimation from RGB fundus images through careful selection of U-Net architectures and loss functions for class imbalance(https://arxiv.org/abs/2510.26778)
Keywords: generation
Abstract: Age-related macular degeneration (AMD) is one of the leading causes of irreversible vision impairment in people over the age of 60. This research focuses on semantic segmentation for AMD lesion detection in RGB fundus images, a non-invasive and cost-effective imaging technique. The results of the ADAM challenge - the most comprehensive AMD detection from RGB fundus images research competition and open dataset to date - serve as a benchmark for our evaluation. Taking the U-Net connectivity as a base of our framework, we evaluate and compare several approaches to improve the segmentation model's architecture and training pipeline, including pre-processing techniques, encoder (backbone) deep network types of varying complexity, and specialized loss functions to mitigate class imbalances on image and pixel levels. The main outcome of this research is the final configuration of the AMD detection framework, which outperforms all the prior ADAM challenge submissions on the multi-class segmentation of different AMD lesion types in non-invasive RGB fundus images. The source code used to conduct the experiments presented in this paper is made freely available.
摘要：年龄相关性黄斑变性 (AMD) 是 60 岁以上人群不可逆视力损伤的主要原因之一。本研究重点关注 RGB 眼底图像中 AMD 病变检测的语义分割，这是一种非侵入性且经济高效的成像技术。 ADAM 挑战赛的结果（迄今为止 RGB 眼底图像研究竞赛和开放数据集中最全面的 AMD 检测）可作为我们评估的基准。以 U-Net 连接作为我们框架的基础，我们评估和比较了几种改进分割模型架构和训练管道的方法，包括预处理技术、不同复杂度的编码器（骨干）深度网络类型，以及用于减轻图像和像素级别的类不平衡的专门损失函数。这项研究的主要成果是 AMD 检测框架的最终配置，该框架在非侵入性 RGB 眼底图像中不同 AMD 病变类型的多类分割方面优于所有先前提交的 ADAM 挑战赛。用于进行本文中提出的实验的源代码是免费提供的。

Title: Clone Deterministic 3D Worlds with Geometrically-Regularized World Models

Authors: Zaishuo Xia, Yukuan Lu, Xinyi Li, Yifan Xu, Yubei Chen
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2510.26782
Pdf URL: https://arxiv.org/pdf/2510.26782
Copy Paste: [[2510.26782]] Clone Deterministic 3D Worlds with Geometrically-Regularized World Models(https://arxiv.org/abs/2510.26782)
Keywords: generative
Abstract: A world model is an internal model that simulates how the world evolves. Given past observations and actions, it predicts the future of both the embodied agent and its environment. Accurate world models are essential for enabling agents to think, plan, and reason effectively in complex, dynamic settings. Despite rapid progress, current world models remain brittle and degrade over long horizons. We argue that a central cause is representation quality: exteroceptive inputs (e.g., images) are high-dimensional, and lossy or entangled latents make dynamics learning unnecessarily hard. We therefore ask whether improving representation learning alone can substantially improve world-model performance. In this work, we take a step toward building a truly accurate world model by addressing a fundamental yet open problem: constructing a model that can fully clone and overfit to a deterministic 3D world. We propose Geometrically-Regularized World Models (GRWM), which enforces that consecutive points along a natural sensory trajectory remain close in latent representation space. This approach yields significantly improved latent representations that align closely with the true topology of the environment. GRWM is plug-and-play, requires only minimal architectural modification, scales with trajectory length, and is compatible with diverse latent generative backbones. Across deterministic 3D settings and long-horizon prediction tasks, GRWM significantly increases rollout fidelity and stability. Analyses show that its benefits stem from learning a latent manifold with superior geometric structure. These findings support a clear takeaway: improving representation learning is a direct and useful path to robust world models, delivering reliable long-horizon predictions without enlarging the dynamics module.
摘要：世界模型是模拟世界如何演化的内部模型。根据过去的观察和行动，它可以预测具体主体及其环境的未来。准确的世界模型对于使智能体能够在复杂、动态的环境中有效地思考、计划和推理至关重要。尽管取得了快速进展，但从长远来看，当前的世界模式仍然脆弱且退化。我们认为一个核心原因是表示质量：外感受输入（例如图像）是高维的，有损或纠缠的潜在因素使动态学习变得不必要的困难。因此，我们要问，仅改进表示学习是否可以显着提高世界模型的性能。在这项工作中，我们通过解决一个基本但尚未解决的问题，朝着构建真正准确的世界模型迈出了一步：构建一个可以完全克隆并过度拟合确定性 3D 世界的模型。我们提出几何正则化世界模型（GRWM），它强制沿着自然感觉轨迹的连续点在潜在表示空间中保持接近。这种方法产生了显着改进的潜在表示，与环境的真实拓扑紧密结合。 GRWM 是即插即用的，只需要很少的架构修改，可以根据轨迹长度进行扩展，并且与不同的潜在生成主干兼容。在确定性 3D 设置和长期预测任务中，GRWM 显着提高了展示保真度和稳定性。分析表明，它的好处源于学习具有优越几何结构的潜在流形。这些发现支持了一个明确的结论：改进表示学习是通往稳健世界模型的直接且有用的途径，可以在不扩大动态模块的情况下提供可靠的长期预测。

Title: The Quest for Generalizable Motion Generation: Data, Model, and Evaluation

Authors: Jing Lin, Ruisi Wang, Junzhe Lu, Ziqi Huang, Guorui Song, Ailing Zeng, Xian Liu, Chen Wei, Wanqi Yin, Qingping Sun, Zhongang Cai, Lei Yang, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.26794
Pdf URL: https://arxiv.org/pdf/2510.26794
Copy Paste: [[2510.26794]] The Quest for Generalizable Motion Generation: Data, Model, and Evaluation(https://arxiv.org/abs/2510.26794)
Keywords: generation, generative
Abstract: Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors, highlighting transferable insights that MoGen can leverage. Motivated by this observation, we present a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation. First, we introduce ViMoGen-228K, a large-scale dataset comprising 228,000 high-quality motion samples that integrates high-fidelity optical MoCap data with semantically annotated motions from web videos and synthesized samples generated by state-of-the-art ViGen models. The dataset includes both text-motion pairs and text-video-motion triplets, substantially expanding semantic diversity. Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning. To enhance efficiency, we further develop ViMoGen-light, a distilled variant that eliminates video generation dependencies while preserving strong generalization. Finally, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability. Extensive experiments show that our framework significantly outperforms existing approaches in both automatic and human evaluations. The code, data, and benchmark will be made publicly available.
摘要：尽管最近在标准基准上 3D 人体运动生成 (MoGen) 取得了进展，但现有模型在泛化能力方面仍然面临根本瓶颈。相比之下，相邻的生成领域，尤其是视频生成 (ViGen)，在人类行为建模方面表现出了显着的泛化能力，突出了 MoGen 可以利用的可转移见解。受这一观察的启发，我们提出了一个全面的框架，系统地将知识从 ViGen 转移到 MoGen，涵盖三个关键支柱：数据、建模和评估。首先，我们介绍 ViMoGen-228K，这是一个包含 228,000 个高质量运动样本的大型数据集，它将高保真光学 MoCap 数据与来自网络视频的语义注释运动以及由最先进的 ViGen 模型生成的合成样本集成在一起。该数据集包括文本-运动对和文本-视频-运动三元组，大大扩展了语义多样性。其次，我们提出了 ViMoGen，一种基于流匹配的扩散变压器，它通过门控多模态条件统一来自 MoCap 数据和 ViGen 模型的先验。为了提高效率，我们进一步开发了 ViMoGen-light，这是一种精炼变体，可以消除视频生成依赖性，同时保持强大的泛化性。最后，我们提出了 MBench，这是一个分层基准，旨在对运动质量、提示保真度和泛化能力进行细粒度评估。大量的实验表明，我们的框架在自动评估和人工评估方面都显着优于现有方法。代码、数据和基准将公开。

Title: SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting

Authors: Dongyue Lu, Ao Liang, Tianxin Huang, Xiao Fu, Yuyang Zhao, Baorui Ma, Liang Pan, Wei Yin, Lingdong Kong, Wei Tsang Ooi, Ziwei Liu
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2510.26796
Pdf URL: https://arxiv.org/pdf/2510.26796
Copy Paste: [[2510.26796]] SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting(https://arxiv.org/abs/2510.26796)
Keywords: generation
Abstract: Immersive applications call for synthesizing spatiotemporal 4D content from casual videos without costly 3D supervision. Existing video-to-4D methods typically rely on manually annotated camera poses, which are labor-intensive and brittle for in-the-wild footage. Recent warp-then-inpaint approaches mitigate the need for pose labels by warping input frames along a novel camera trajectory and using an inpainting model to fill missing regions, thereby depicting the 4D scene from diverse viewpoints. However, this trajectory-to-trajectory formulation often entangles camera motion with scene dynamics and complicates both modeling and inference. We introduce SEE4D, a pose-free, trajectory-to-camera framework that replaces explicit trajectory prediction with rendering to a bank of fixed virtual cameras, thereby separating camera control from scene modeling. A view-conditional video inpainting model is trained to learn a robust geometry prior by denoising realistically synthesized warped images and to inpaint occluded or missing regions across virtual viewpoints, eliminating the need for explicit 3D annotations. Building on this inpainting core, we design a spatiotemporal autoregressive inference pipeline that traverses virtual-camera splines and extends videos with overlapping windows, enabling coherent generation at bounded per-step complexity. We validate See4D on cross-view video generation and sparse reconstruction benchmarks. Across quantitative metrics and qualitative assessments, our method achieves superior generalization and improved performance relative to pose- or trajectory-conditioned baselines, advancing practical 4D world modeling from casual videos.
摘要：沉浸式应用程序需要从休闲视频中合成时空 4D 内容，而无需昂贵的 3D 监督。现有的视频转 4D 方法通常依赖于手动注释的摄像机姿势，这对于野外镜头而言是劳动密集型且脆弱的。最近的扭曲然后修复方法通过沿着新颖的相机轨迹扭曲输入帧并使用修复模型来填充缺失区域，从而从不同的角度描绘 4D 场景，从而减少了对姿势标签的需求。然而，这种轨迹到轨迹的公式通常将相机运动与场景动态纠缠在一起，并使建模和推理变得复杂。我们引入了 SEE4D，这是一种无姿势、轨迹到摄像机的框架，它通过渲染到一组固定虚拟摄像机来取代显式轨迹预测，从而将摄像机控制与场景建模分开。视图条件视频修复模型经过训练，可以通过对真实合成的扭曲图像进行去噪来先学习鲁棒的几何形状，并修复跨虚拟视点的遮挡或缺失区域，从而无需显式 3D 注释。在此修复核心的基础上，我们设计了一个时空自回归推理管道，该管道遍历虚拟相机样条并使用重叠窗口扩展视频，从而以有限的每步复杂度实现连贯生成。我们在跨视图视频生成和稀疏重建基准上验证 See4D。通过定量指标和定性评估，我们的方法相对于姿势或轨迹条件基线实现了卓越的泛化和改进的性能，从而推进了休闲视频的实用 4D 世界建模。

Title: OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes

Authors: Yukun Huang, Jiwen Yu, Yanning Zhou, Jianan Wang, Xintao Wang, Pengfei Wan, Xihui Liu
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2510.26800
Pdf URL: https://arxiv.org/pdf/2510.26800
Copy Paste: [[2510.26800]] OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes(https://arxiv.org/abs/2510.26800)
Keywords: generation, generative
Abstract: There are two prevalent ways to constructing 3D scenes: procedural generation and 2D lifting. Among them, panorama-based 2D lifting has emerged as a promising technique, leveraging powerful 2D generative priors to produce immersive, realistic, and diverse 3D environments. In this work, we advance this technique to generate graphics-ready 3D scenes suitable for physically based rendering (PBR), relighting, and simulation. Our key insight is to repurpose 2D generative models for panoramic perception of geometry, textures, and PBR materials. Unlike existing 2D lifting approaches that emphasize appearance generation and ignore the perception of intrinsic properties, we present OmniX, a versatile and unified framework. Based on a lightweight and efficient cross-modal adapter structure, OmniX reuses 2D generative priors for a broad range of panoramic vision tasks, including panoramic perception, generation, and completion. Furthermore, we construct a large-scale synthetic panorama dataset containing high-quality multimodal panoramas from diverse indoor and outdoor scenes. Extensive experiments demonstrate the effectiveness of our model in panoramic visual perception and graphics-ready 3D scene generation, opening new possibilities for immersive and physically realistic virtual world generation.
摘要：构建 3D 场景有两种流行的方法：程序生成和 2D 提升。其中，基于全景的 2D 提升已成为一种有前景的技术，利用强大的 2D 生成先验来生成身临其境、逼真且多样化的 3D 环境。在这项工作中，我们改进了这项技术，以生成适合基于物理的渲染 (PBR)、重新照明和模拟的图形就绪 3D 场景。我们的主要见解是重新利用 2D 生成模型来实现几何、纹理和 PBR 材质的全景感知。与现有的强调外观生成而忽略内在属性感知的 2D 提升方法不同，我们提出了 OmniX，一个多功能且统一的框架。基于轻量级、高效的跨模态适配器结构，OmniX 重用 2D 生成先验来执行广泛的全景视觉任务，包括全景感知、生成和完成。此外，我们构建了一个大规模合成全景数据集，其中包含来自不同室内和室外场景的高质量多模态全景图。大量实验证明了我们的模型在全景视觉感知和图形就绪 3D 场景生成方面的有效性，为沉浸式和物理现实的虚拟世界生成开辟了新的可能性。

Title: Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

Authors: Ziyu Guo, Xinyan Chen, Renrui Zhang, Ruichuan An, Yu Qi, Dongzhi Jiang, Xiangtai Li, Manyuan Zhang, Hongsheng Li, Pheng-Ann Heng
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2510.26802
Pdf URL: https://arxiv.org/pdf/2510.26802
Copy Paste: [[2510.26802]] Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark(https://arxiv.org/abs/2510.26802)
Keywords: generation
Abstract: Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: this https URL
摘要：最近的视频生成模型可以生成高保真、时间连贯的视频，这表明它们可以编码大量的世界知识。除了真实的合成之外，它们还表现出指示视觉感知、建模和操作的新兴行为。然而，仍然存在一个重要的问题：视频模型是否准备好在具有挑战性的视觉推理场景中充当零样本推理器？在这项工作中，我们进行了实证研究来全面调查这个问题，重点关注领先且流行的 Veo-3。我们评估其跨 12 个维度的推理行为，包括空间、几何、物理、时间和体现逻辑，系统地表征其优势和失败模式。为了标准化这项研究，我们将评估数据整理到 MME-CoF 中，这是一个紧凑的基准，可以对框架链 (CoF) 推理进行深入、彻底的评估。我们的研究结果表明，虽然当前的视频模型在短视域空间一致性、细粒度基础和局部一致动态方面表现出有希望的推理模式，但它们在长视域因果推理、严格的几何约束和抽象逻辑方面仍然受到限制。总体而言，它们作为独立的零样本推理机尚不可靠，但作为与专用推理模型互补的视觉引擎，表现出了令人鼓舞的迹象。项目页面：此 https URL