2026-01-13

Title: CrossTrafficLLM: A Human-Centric Framework for Interpretable Traffic Intelligence via Large Language Model

Authors: Zeming Du, Qitan Shao, Hongfei Liu, Yong Zhang
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2601.06042
Pdf URL: https://arxiv.org/pdf/2601.06042
Copy Paste: [[2601.06042]] CrossTrafficLLM: A Human-Centric Framework for Interpretable Traffic Intelligence via Large Language Model(https://arxiv.org/abs/2601.06042)
Keywords: generation, generative
Abstract: While accurate traffic forecasting is vital for Intelligent Transportation Systems (ITS), effectively communicating predicted conditions via natural language for human-centric decision support remains a challenge and is often handled separately. To address this, we propose CrossTrafficLLM, a novel GenAI-driven framework that simultaneously predicts future spatiotemporal traffic states and generates corresponding natural language descriptions, specifically targeting conditional abnormal event summaries. We tackle the core challenge of aligning quantitative traffic data with qualitative textual semantics by leveraging Large Language Models (LLMs) within a unified architecture. This design allows generative textual context to improve prediction accuracy while ensuring generated reports are directly informed by the forecast. Technically, a text-guided adaptive graph convolutional network is employed to effectively merge high-level semantic information with the traffic network structure. Evaluated on the BJTT dataset, CrossTrafficLLM demonstrably surpasses state-of-the-art methods in both traffic forecasting performance and text generation quality. By unifying prediction and description generation, CrossTrafficLLM delivers a more interpretable, and actionable approach to generative traffic intelligence, offering significant advantages for modern ITS applications.
摘要：虽然准确的交通预测对于智能交通系统 (ITS) 至关重要，但通过自然语言有效传达预测条件以支持以人为本的决策仍然是一个挑战，并且通常需要单独处理。为了解决这个问题，我们提出了 CrossTrafficLLM，这是一种新颖的 GenAI 驱动框架，可以同时预测未来的时空交通状态并生成相应的自然语言描述，特别针对条件异常事件摘要。我们通过在统一架构中利用大型语言模型 (LLM) 来解决将定量流量数据与定性文本语义相结合的核心挑战。这种设计允许生成文本上下文来提高预测准确性，同时确保生成的报告直接由预测通知。从技术上讲，采用文本引导的自适应图卷积网络来有效地将高级语义信息与交通网络结构融合。在 BJTT 数据集上进行评估，CrossTrafficLLM 在流量预测性能和文本生成质量方面明显超越了最先进的方法。通过统一预测和描述生成，CrossTrafficLLM 提供了一种更具可解释性和可操作性的生成交通智能方法，为现代 ITS 应用提供了显着的优势。

Title: Semantic Event Graphs for Long-Form Video Question Answering

Authors: Aradhya Dixit, Tianxi Liang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.06097
Pdf URL: https://arxiv.org/pdf/2601.06097
Copy Paste: [[2601.06097]] Semantic Event Graphs for Long-Form Video Question Answering(https://arxiv.org/abs/2601.06097)
Keywords: generation
Abstract: Long-form video question answering remains challenging for modern vision-language models, which struggle to reason over hour-scale footage without exceeding practical token and compute budgets. Existing systems typically downsample frames or feed dense visual embeddings to large-context language models, trading off temporal coverage against cost. We propose Semantic Event Graphs (SEG), a lightweight symbolic interface between video and language that replaces raw frames with compact temporal interaction logs. Our pipeline detects and tracks objects with YOLOv11, converts proximity patterns into START/END human-object events, and organizes them into a Temporal Scene Graph (TSG). At inference time, a query-aware pruning module identifies anchor entities and lexically relevant events, returning only a small subgraph which is verbalized and passed to Gemini 2.5 Flash for answer generation. On five YouTube videos (300-500 interactions each) and 120 automatically generated long-horizon questions, SEG achieves 65.0% accuracy using only 3.47k tokens per query, closely matching a full-log baseline (62.5% at 40.39k tokens) while reducing token usage by 91.4%. A short-context baseline restricted to the last 30 seconds collapses to 2.5% accuracy, underscoring the need for explicit temporal memory. These results show that symbolic temporal graphs can serve as an effective, plug-and-play memory layer for off-the-shelf vision-language models, preserving long-range reasoning ability while making long-form video question answering substantially more token- and cost-efficient. Code, logs, and event-extraction tools will be released for reproducibility.
摘要：对于现代视觉语言模型来说，长格式视频问答仍然具有挑战性，这些模型很难在不超出实际令牌和计算预算的情况下对小时规模的镜头进行推理。现有系统通常会对帧进行下采样或将密集的视觉嵌入提供给大上下文语言模型，从而权衡时间覆盖范围和成本。我们提出语义事件图（SEG），这是视频和语言之间的轻量级符号接口，用紧凑的时间交互日志取代原始帧。我们的管道使用 YOLOv11 检测和跟踪对象，将接近模式转换为开始/结束人类对象事件，并将它们组织成时间场景图 (TSG)。在推理时，查询感知修剪模块识别锚实体和词汇相关事件，仅返回一个小子图，该子图被语言化并传递给 Gemini 2.5 Flash 以生成答案。在 5 个 YouTube 视频（每个视频 300-500 次互动）和 120 个自动生成的长期问题上，SEG 每次查询仅使用 3.47k 令牌即可实现 65.0% 的准确率，与全日志基线（40.39k 令牌时为 62.5%）非常接近，同时将令牌使用量减少了 91.4%。限制在最后 30 秒的短上下文基线的准确度下降至 2.5%，这凸显了显式时间记忆的必要性。这些结果表明，符号时间图可以作为现成视觉语言模型的有效、即插即用存储层，保留远程推理能力，同时使长格式视频问答显着提高标记和成本效率。将发布代码、日志和事件提取工具以实现可重复性。

Title: Stress Testing Machine Learning at $10^{10}$ Scale: A Comprehensive Study of Adversarial Robustness on Algebraically Structured Integer Streams

Authors: HyunJun Jeon
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.06117
Pdf URL: https://arxiv.org/pdf/2601.06117
Copy Paste: [[2601.06117]] Stress Testing Machine Learning at $10^{10}$ Scale: A Comprehensive Study of Adversarial Robustness on Algebraically Structured Integer Streams(https://arxiv.org/abs/2601.06117)
Keywords: generation
Abstract: This paper presents a large-scale stress test of machine learning systems using structured mathematical data as a benchmark. We evaluate the robustness of tree-based classifiers at an unprecedented scale, utilizing ten billion deterministic samples and five billion adversarial counterexamples. Our framework introduces three primary contributions: first, a high-throughput pipeline that reformulates Pythagorean triple generation into a single-parameter index stream, significantly improving computational efficiency over classical methods; second, the Hypothesis-driven Negative Dataset (HND), which categorizes nine classes of adversarial attacks designed to exploit both arithmetic precision and structural patterns; and third, a fault-tolerant infrastructure for reliable large-scale training. Experimental results demonstrate that while LightGBM achieves 99.99% accuracy, feature attribution reveals that the model prioritizes underlying quadratic patterns over direct algebraic verification. These findings suggest that learned heuristics can effectively identify structural representations in numerical data, potentially serving as efficient preprocessors for formal verification methods.
摘要：本文提出了使用结构化数学数据作为基准的机器学习系统的大规模压力测试。我们利用一百亿个确定性样本和五十亿个对抗性反例，以前所未有的规模评估基于树的分类器的鲁棒性。我们的框架引入了三个主要贡献：首先，高吞吐量管道将毕达哥拉斯三重生成重新表述为单参数索引流，显着提高了经典方法的计算效率；其次，假设驱动的负数据集（HND），它将九类对抗性攻击分类，旨在利用算术精度和结构模式；第三，用于可靠的大规模训练的容错基础设施。实验结果表明，虽然 LightGBM 的准确率达到 99.99%，但特征归因表明该模型优先考虑底层二次模式而不是直接代数验证。这些发现表明，学习的启发式方法可以有效地识别数值数据中的结构表示，有可能作为形式验证方法的有效预处理器。

Title: AIS-CycleGen: A CycleGAN-Based Framework for High-Fidelity Synthetic AIS Data Generation and Augmentation

Authors: SM Ashfaq uz Zaman, Faizan Qamar, Masnizah Mohd, Nur Hanis Sabrina Suhaimi, Amith Khandakar
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.06127
Pdf URL: https://arxiv.org/pdf/2601.06127
Copy Paste: [[2601.06127]] AIS-CycleGen: A CycleGAN-Based Framework for High-Fidelity Synthetic AIS Data Generation and Augmentation(https://arxiv.org/abs/2601.06127)
Keywords: generation, generative
Abstract: Automatic Identification System (AIS) data are vital for maritime domain awareness, yet they often suffer from domain shifts, data sparsity, and class imbalance, which hinder the performance of predictive models. In this paper, we propose a robust data augmentation method, AISCycleGen, based on Cycle-Consistent Generative Adversarial Networks (CycleGAN), which is tailored for AIS datasets. Unlike traditional methods, AISCycleGen leverages unpaired domain translation to generate high-fidelity synthetic AIS data sequences without requiring paired source-target data. The framework employs a 1D convolutional generator with adaptive noise injection to preserve the spatiotemporal structure of AIS trajectories, enhancing the diversity and realism of the generated data. To demonstrate its efficacy, we apply AISCycleGen to several baseline regression models, showing improvements in performance across various maritime domains. The results indicate that AISCycleGen outperforms contemporary GAN-based augmentation techniques, achieving a PSNR value of 30.5 and an FID score of 38.9. These findings underscore AISCycleGen's potential as an effective and generalizable solution for augmenting AIS datasets, improving downstream model performance in real-world maritime intelligence applications.
摘要：自动识别系统 (AIS) 数据对于海洋领域感知至关重要，但它们经常遭受领域转移、数据稀疏和类别不平衡的影响，从而阻碍了预测模型的性能。在本文中，我们提出了一种基于循环一致生成对抗网络（CycleGAN）的鲁棒数据增强方法 AISCycleGen，该方法专为 AIS 数据集量身定制。与传统方法不同，AISCycleGen 利用不配对的域翻译来生成高保真合成 AIS 数据序列，而不需要配对的源-目标数据。该框架采用具有自适应噪声注入的一维卷积生成器来保留 AIS 轨迹的时空结构，从而增强生成数据的多样性和真实性。为了证明其功效，我们将 AISCycleGen 应用于多个基线回归模型，显示了各个海事领域的性能改进。结果表明，AISCycleGen 的性能优于当代基于 GAN 的增强技术，实现了 30.5 的 PSNR 值和 38.9 的 FID 分数。这些发现强调了 AISCycleGen 作为有效且通用的解决方案的潜力，可用于增强 AIS 数据集、提高现实海事情报应用中的下游模型性能。

Title: Forget Many, Forget Right: Scalable and Precise Concept Unlearning in Diffusion Models

Authors: Kaiyuan Deng, Gen Li, Yang Xiao, Bo Hui, Xiaolong Ma
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2601.06162
Pdf URL: https://arxiv.org/pdf/2601.06162
Copy Paste: [[2601.06162]] Forget Many, Forget Right: Scalable and Precise Concept Unlearning in Diffusion Models(https://arxiv.org/abs/2601.06162)
Keywords: generation
Abstract: Text-to-image diffusion models have achieved remarkable progress, yet their use raises copyright and misuse concerns, prompting research into machine unlearning. However, extending multi-concept unlearning to large-scale scenarios remains difficult due to three challenges: (i) conflicting weight updates that hinder unlearning or degrade generation; (ii) imprecise mechanisms that cause collateral damage to similar content; and (iii) reliance on additional data or modules, creating scalability bottlenecks. To address these, we propose Scalable-Precise Concept Unlearning (ScaPre), a unified framework tailored for large-scale unlearning. ScaPre introduces a conflict-aware stable design, integrating spectral trace regularization and geometry alignment to stabilize optimization, suppress conflicts, and preserve global structure. Furthermore, an Informax Decoupler identifies concept-relevant parameters and adaptively reweights updates, strictly confining unlearning to the target subspace. ScaPre yields an efficient closed-form solution without requiring auxiliary data or sub-models. Comprehensive experiments on objects, styles, and explicit content demonstrate that ScaPre effectively removes target concepts while maintaining generation quality. It forgets up to $\times \mathbf{5}$ more concepts than the best baseline within acceptable quality limits, achieving state-of-the-art precision and efficiency for large-scale unlearning.
摘要：文本到图像的扩散模型已经取得了显着的进步，但它们的使用引起了版权和滥用问题，从而促进了对机器取消学习的研究。然而，由于三个挑战，将多概念遗忘扩展到大规模场景仍然很困难：（i）相互冲突的权重更新阻碍遗忘或降低生成能力； (ii) 不精确的机制会对类似内容造成附带损害； (iii) 对额外数据或模块的依赖，造成可扩展性瓶颈。为了解决这些问题，我们提出了可扩展的精确概念取消学习（ScaPre），这是一个专为大规模取消学习而定制的统一框架。 ScaPre 引入了冲突感知稳定设计，集成了光谱轨迹正则化和几何对齐，以稳定优化、抑制冲突并保留全局结构。此外，Informax 解耦器识别概念相关参数并自适应地重新加权更新，严格限制目标子空间的遗忘。 ScaPre 产生高效的封闭式解决方案，无需辅助数据或子模型。对对象、样式和显式内容的综合实验表明，ScaPre 有效地删除了目标概念，同时保持了生成质量。在可接受的质量限制内，它忘记的概念比最佳基线多 $\times \mathbf{5}$，从而实现了大规模遗忘的最先进的精度和效率。

Title: Forget-It-All: Multi-Concept Machine Unlearning via Concept-Aware Neuron Masking

Authors: Kaiyuan Deng, Bo Hui, Gen Li, Jie Ji, Minghai Qin, Geng Yuan, Xiaolong Ma
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2601.06163
Pdf URL: https://arxiv.org/pdf/2601.06163
Copy Paste: [[2601.06163]] Forget-It-All: Multi-Concept Machine Unlearning via Concept-Aware Neuron Masking(https://arxiv.org/abs/2601.06163)
Keywords: generation
Abstract: The widespread adoption of text-to-image (T2I) diffusion models has raised concerns about their potential to generate copyrighted, inappropriate, or sensitive imagery learned from massive training corpora. As a practical solution, machine unlearning aims to selectively erase unwanted concepts from a pre-trained model without retraining from scratch. While most existing methods are effective for single-concept unlearning, they often struggle in real-world scenarios that require removing multiple concepts, since extending them to this setting is both non-trivial and problematic, causing significant challenges in unlearning effectiveness, generation quality, and sensitivity to hyperparameters and datasets. In this paper, we take a unique perspective on multi-concept unlearning by leveraging model sparsity and propose the Forget It All (FIA) framework. FIA first introduces Contrastive Concept Saliency to quantify each weight connection's contribution to a target concept. It then identifies Concept-Sensitive Neurons by combining temporal and spatial information, ensuring that only neurons consistently responsive to the target concept are selected. Finally, FIA constructs masks from the identified neurons and fuses them into a unified multi-concept mask, where Concept-Agnostic Neurons that broadly support general content generation are preserved while concept-specific neurons are pruned to remove the targets. FIA is training-free and requires only minimal hyperparameter tuning for new tasks, thereby promoting a plug-and-play paradigm. Extensive experiments across three distinct unlearning tasks demonstrate that FIA achieves more reliable multi-concept unlearning, improving forgetting effectiveness while maintaining semantic fidelity and image quality.
摘要：文本到图像 (T2I) 传播模型的广泛采用引起了人们的担忧，即它们可能会从大量训练语料库中生成受版权保护的、不适当的或敏感的图像。作为一种实用的解决方案，机器取消学习旨在有选择地从预训练模型中删除不需要的概念，而无需从头开始重新训练。虽然大多数现有方法对于单一概念的忘却是有效的，但它们经常在需要删除多个概念的现实场景中陷入困境，因为将它们扩展到这种设置既不简单又存在问题，从而在忘却有效性、生成质量以及对超参数和数据集的敏感性方面造成重大挑战。在本文中，我们利用模型稀疏性对多概念遗忘采取独特的视角，并提出了忘记一切（FIA）框架。 FIA 首先引入对比概念显着性来量化每个权重连接对目标概念的贡献。然后，它通过结合时间和空间信息来识别概念敏感神经元，确保只选择对目标概念一致响应的神经元。最后，FIA 根据已识别的神经元构建掩模，并将它们融合成统一的多概念掩模，其中广泛支持一般内容生成的概念不可知神经元被保留，而概念特定的神经元被修剪以删除目标。 FIA 无需培训，只需对新任务进行最少的超参数调整，从而促进了即插即用范例。跨越三个不同的遗忘任务的广泛实验表明，FIA 实现了更可靠的多概念遗忘，提高了遗忘效率，同时保持了语义保真度和图像质量。

Title: Think Bright, Diffuse Nice: Enhancing T2I-ICL via Inductive-Bias Hint Instruction and Query Contrastive Decoding

Authors: Zhiyong Ma, Zhenpeng Li, Yuanjie Shi, Zhengping Li, Jiahao Chen, Qingyuan Chuai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.06169
Pdf URL: https://arxiv.org/pdf/2601.06169
Copy Paste: [[2601.06169]] Think Bright, Diffuse Nice: Enhancing T2I-ICL via Inductive-Bias Hint Instruction and Query Contrastive Decoding(https://arxiv.org/abs/2601.06169)
Keywords: generation
Abstract: Text-to-Image In-Context Learning (T2I-ICL) enables customized image synthesis via interleaved text-image examples but faces two mutually reinforcing bottlenecks, compliance failure and prior-dominated hallucination, that form a vicious cycle degrading generation quality. Existing methods rely on tailored training, which limits flexibility and raises deployment costs. To address these challenges effectively, we propose TBDN, a training-free framework integrating two complementary closed-loop mechanisms: Hint Instruction (HI) and Query Contrastive Decoding (QCD). HI injects task-aware inductive bias via lightweight prompt engineering to anchor models on contextual mapping rules, thereby mitigating compliance failure. QCD adjusts the decoding distributions of language models by contrasting full-input and query-omitted distributions, suppressing prior-dominated hallucination. TBDN achieves State-of-the-Art performance on CoBSAT and Text-to-Image Fast Mini-ImageNet, with robust generalization across model backbones, prompt designs, and hyperparameters. It also maintains promising performance in concept preservation and prompt following on Dreambench++. By breaking the two bottlenecks, TBDN establishes a simple yet effective framework for efficient and reliable T2I-ICL.
摘要：文本到图像上下文学习（T2I-ICL）可以通过交错的文本图像示例实现定制图像合成，但面临两个相互加强的瓶颈：合规性失败和先验主导的幻觉，这形成了恶性循环，降低了生成质量。现有方法依赖于定制培训，这限制了灵活性并增加了部署成本。为了有效应对这些挑战，我们提出了 TBDN，这是一种免训练框架，集成了两种互补的闭环机制：提示指令（HI）和查询对比解码（QCD）。 HI 通过轻量级提示工程注入任务感知归纳偏差，将模型锚定在上下文映射规则上，从而减轻合规性失败。 QCD 通过对比全输入和查询省略的分布来调整语言模型的解码分布，抑制先验主导的幻觉。 TBDN 在 CoBSAT 和文本到图像快速 Mini-ImageNet 上实现了最先进的性能，并在模型主干、提示设计和超参数方面具有强大的泛化能力。它还在 Dreambench++ 上的概念保存和提示跟踪方面保持了良好的性能。通过突破这两个瓶颈，TBDN为高效可靠的T2I-ICL建立了一个简单而有效的框架。

Title: CEEMDAN-Based Multiscale CNN for Wind Turbine Gearbox Fault Detection

Authors: Nejad Alagha, Anis Salwa Mohd Khairuddin, Obada Al-Khatib, Abigail Copiaco
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.06217
Pdf URL: https://arxiv.org/pdf/2601.06217
Copy Paste: [[2601.06217]] CEEMDAN-Based Multiscale CNN for Wind Turbine Gearbox Fault Detection(https://arxiv.org/abs/2601.06217)
Keywords: generation
Abstract: Wind turbines play a critical role in the shift toward sustainable energy generation. Their operation relies on multiple interconnected components, and a failure in any of these can compromise the entire system's functionality. Detecting faults accurately is challenging due to the intricate, non-linear, and non-stationary nature of vibration signals, influenced by dynamic loading, environmental variations, and mechanical interactions. As such, effective signal processing techniques are essential for extracting meaningful features to enhance diagnostic accuracy. This study presents a hybrid approach for fault detection in wind turbine gearboxes, combining Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) and a Multiscale Convolutional Neural Network (MSCNN). CEEMDAN is employed to decompose vibration signals into intrinsic mode functions, isolating critical features at different time-frequency scales. These are then input into the MSCNN, which performs deep hierarchical feature extraction and classification. The proposed method achieves an F1 Score of 98.95\%, evaluated on real-world datasets, and demonstrates superior performance in both detection accuracy and computational speed compared to existing approaches. This framework offers a balanced solution for reliable and efficient fault diagnosis in wind turbine systems.
摘要：风力涡轮机在向可持续能源发电的转变中发挥着关键作用。它们的运行依赖于多个互连的组件，其中任何一个组件出现故障都可能危及整个系统的功能。由于振动信号具有复杂、非线性和非平稳的性质，并受到动态载荷、环境变化和机械相互作用的影响，因此准确检测故障具有挑战性。因此，有效的信号处理技术对于提取有意义的特征以提高诊断准确性至关重要。本研究提出了一种风力涡轮机齿轮箱故障检测的混合方法，将完全集成经验模式分解与自适应噪声 (CEEMDAN) 和多尺度卷积神经网络 (MSCNN) 相结合。 CEEMDAN 用于将振动信号分解为本征模态函数，从而隔离不同时频尺度的关键特征。然后将这些输入到 MSCNN，后者执行深度层次特征提取和分类。该方法在真实数据集上进行评估，F1 分数达到 98.95%，与现有方法相比，在检测精度和计算速度方面表现出优越的性能。该框架为风力涡轮机系统中可靠且高效的故障诊断提供了平衡的解决方案。

Title: Synthetic FMCW Radar Range Azimuth Maps Augmentation with Generative Diffusion Model

Authors: Zhaoze Wang, Changxu Zhang, Tai Fei, Christopher Grimm, Yi Jin, Claas Tebruegge, Ernst Warsitz, Markus Gardill
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.06228
Pdf URL: https://arxiv.org/pdf/2601.06228
Copy Paste: [[2601.06228]] Synthetic FMCW Radar Range Azimuth Maps Augmentation with Generative Diffusion Model(https://arxiv.org/abs/2601.06228)
Keywords: generative
Abstract: The scarcity and low diversity of well-annotated automotive radar datasets often limit the performance of deep-learning-based environmental perception. To overcome these challenges, we propose a conditional generative framework for synthesizing realistic Frequency-Modulated Continuous-Wave radar Range-Azimuth Maps. Our approach leverages a generative diffusion model to generate radar data for multiple object categories, including pedestrians, cars, and cyclists. Specifically, conditioning is achieved via Confidence Maps, where each channel represents a semantic class and encodes Gaussian-distributed annotations at target locations. To address radar-specific characteristics, we incorporate Geometry Aware Conditioning and Temporal Consistency Regularization into the generative process. Experiments on the ROD2021 dataset demonstrate that signal reconstruction quality improves by \SI{3.6}{dB} in Peak Signal-to-Noise Ratio over baseline methods, while training with a combination of real and synthetic datasets improves overall mean Average Precision by 4.15% compared with conventional image-processing-based augmentation. These results indicate that our generative framework not only produces physically plausible and diverse radar spectrum but also substantially improves model generalization in downstream tasks.
摘要：注释良好的汽车雷达数据集的稀缺性和低多样性通常限制了基于深度学习的环境感知的性能。为了克服这些挑战，我们提出了一个条件生成框架，用于合成真实的调频连续波雷达距离-方位图。我们的方法利用生成扩散模型来生成多个对象类别的雷达数据，包括行人、汽车和骑自行车的人。具体来说，调节是通过置信图来实现的，其中每个通道代表一个语义类并在目标位置对高斯分布注释进行编码。为了解决雷达特定的特征，我们将几何感知调节和时间一致性正则化纳入生成过程。 ROD2021 数据集上的实验表明，与基线方法相比，信号重建质量的峰值信噪比提高了 \SI{3.6}{dB}，而与传统的基于图像处理的增强相比，结合真实数据集和合成数据集进行训练将总体平均精度提高了 4.15%。这些结果表明，我们的生成框架不仅产生物理上合理且多样化的雷达频谱，而且还大大提高了下游任务中的模型泛化能力。

Title: Perception Test 2025: Challenge Summary and a Unified VQA Extension

Authors: Joseph Heyward, Nikhil Pathasarathy, Tyler Zhu, Aravindh Mahendran, João Carreira, Dima Damen, Andrew Zisserman, Viorica Pătrăucean
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.06287
Pdf URL: https://arxiv.org/pdf/2601.06287
Copy Paste: [[2601.06287]] Perception Test 2025: Challenge Summary and a Unified VQA Extension(https://arxiv.org/abs/2601.06287)
Keywords: generation
Abstract: The Third Perception Test challenge was organised as a full-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2025. Its primary goal is to benchmark state-of-the-art video models and measure the progress in multimodal perception. This year, the workshop featured 2 guest tracks as well: KiVA (an image understanding challenge) and Physic-IQ (a video generation challenge). In this report, we summarise the results from the main Perception Test challenge, detailing both the existing tasks as well as novel additions to the benchmark. In this iteration, we placed an emphasis on task unification, as this poses a more challenging test for current SOTA multimodal models. The challenge included five consolidated tracks: unified video QA, unified object and point tracking, unified action and sound localisation, grounded video QA, and hour-long video QA, alongside an analysis and interpretability track that is still open for submissions. Notably, the unified video QA track introduced a novel subset that reformulates traditional perception tasks (such as point tracking and temporal action localisation) as multiple-choice video QA questions that video-language models can natively tackle. The unified object and point tracking merged the original object tracking and point tracking tasks, whereas the unified action and sound localisation merged the original temporal action localisation and temporal sound localisation tracks. Accordingly, we required competitors to use unified approaches rather than engineered pipelines with task-specific models. By proposing such a unified challenge, Perception Test 2025 highlights the significant difficulties existing models face when tackling diverse perception tasks through unified interfaces.
摘要：第三次感知测试挑战赛是与 IEEE/CVF 国际计算机视觉会议 (ICCV) 2025 同期举办的全天研讨会。其主要目标是对最先进的视频模型进行基准测试并衡量多模态感知的进展。今年，研讨会还有 2 个客座曲目：KiVA（图像理解挑战）和 Physic-IQ（视频生成挑战）。在本报告中，我们总结了主要感知测试挑战的结果，详细介绍了现有任务以及基准测试的新增内容。在本次迭代中，我们强调任务统一，因为这对当前的 SOTA 多模态模型提出了更具挑战性的测试。该挑战包括五个综合轨道：统一视频 QA、统一对象和点跟踪、统一动作和声音本地化、接地视频 QA 和长达一小时的视频 QA，以及仍可供提交的分析和可解释性轨道。值得注意的是，统一视频 QA 轨道引入了一个新颖的子集，它将传统的感知任务（例如点跟踪和时间动作定位）重新表述为视频语言模型可以本地解决的多项选择视频 QA 问题。统一的对象和点跟踪合并了原始的对象跟踪和点跟踪任务，而统一的动作和声音定位合并了原始的时间动作定位和时间声音定位轨道。因此，我们要求竞争对手使用统一的方法，而不是使用特定于任务的模型来设计管道。通过提出这样一个统一的挑战，Perception Test 2025 强调了现有模型在通过统一接口处理不同感知任务时面临的重大困难。

Title: Teach Diffusion Language Models to Learn from Their Own Mistakes

Authors: Liming Liu, Binxuan Huang, Xin Liu, Bing Yin, Tuo Zhao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.06428
Pdf URL: https://arxiv.org/pdf/2601.06428
Copy Paste: [[2601.06428]] Teach Diffusion Language Models to Learn from Their Own Mistakes(https://arxiv.org/abs/2601.06428)
Keywords: generation, generative
Abstract: Masked Diffusion Language Models (DLMs) achieve significant speed by generating multiple tokens in parallel. However, this parallel sampling approach, especially when using fewer inference steps, will introduce strong dependency errors and cause quality to deteriorate rapidly as the generation step size grows. As a result, reliable self-correction becomes essential for maintaining high-quality multi-token generation. To address this, we propose Decoupled Self-Correction (DSC), a novel two-stage methodology. DSC first fully optimizes the DLM's generative ability before freezing the model and training a specialized correction head. This decoupling preserves the model's peak SFT performance and ensures the generated errors used for correction head training are of higher quality. Additionally, we introduce Future-Context Augmentation (FCA) to maximize the correction head's accuracy. FCA generalizes the error training distribution by augmenting samples with ground-truth tokens, effectively training the head to utilize a richer, future-looking context. This mechanism is used for reliably detecting the subtle errors of the high-fidelity base model. Our DSC framework enables the model, at inference time, to jointly generate and revise tokens, thereby correcting errors introduced by multi-token generation and mitigating error accumulation across steps. Experiments on mathematical reasoning and code generation benchmarks demonstrate that our approach substantially reduces the quality degradation associated with larger generation steps, allowing DLMs to achieve both high generation speed and strong output fidelity.
摘要：掩码扩散语言模型 (DLM) 通过并行生成多个标记来实现显着的速度。然而，这种并行采样方法，特别是当使用较少的推理步骤时，将引入强烈的依赖性错误，并导致质量随着生成步长的增长而迅速恶化。因此，可靠的自我校正对于维持高质量的多代币生成至关重要。为了解决这个问题，我们提出了解耦自校正（DSC），这是一种新颖的两阶段方法。 DSC 首先全面优化 DLM 的生成能力，然后冻结模型并训练专门的校正头。这种解耦保留了模型的峰值 SFT 性能，并确保用于校正头训练的生成误差具有更高的质量。此外，我们还引入了未来情境增强 (FCA)，以最大限度地提高校正头的准确性。 FCA 通过使用真实标记增强样本来概括错误训练分布，有效地训练头部利用更丰富、面向未来的上下文。该机制用于可靠地检测高保真基础模型的细微错误。我们的 DSC 框架使模型能够在推理时联合生成和修改令牌，从而纠正多令牌生成引入的错误并减轻跨步骤的错误累积。数学推理和代码生成基准的实验表明，我们的方法大大减少了与较大生成步骤相关的质量下降，使 DLM 能够实现高生成速度和强输出保真度。

Title: 3D CoCa v2: Contrastive Learners with Test-Time Search for Generalizable Spatial Intelligence

Authors: Hao Tang, Ting Huang, Zeyu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.06496
Pdf URL: https://arxiv.org/pdf/2601.06496
Copy Paste: [[2601.06496]] 3D CoCa v2: Contrastive Learners with Test-Time Search for Generalizable Spatial Intelligence(https://arxiv.org/abs/2601.06496)
Keywords: generation
Abstract: Spatial intelligence refers to the ability to perceive, reason about, and describe objects and their relationships within three-dimensional environments, forming a foundation for embodied perception and scene understanding. 3D captioning aims to describe 3D scenes in natural language; however, it remains challenging due to the sparsity and irregularity of point clouds and, more critically, the weak grounding and limited out-of-distribution (OOD) generalization of existing captioners across drastically different environments, including indoor and outdoor 3D scenes. To address this challenge, we propose 3D CoCa v2, a generalizable 3D captioning framework that unifies contrastive vision-language learning with 3D caption generation and further improves robustness via test-time search (TTS) without updating the captioner parameters. 3D CoCa v2 builds on a frozen CLIP-based semantic prior, a spatially-aware 3D scene encoder for geometry, and a multimodal decoder jointly optimized with contrastive and captioning objectives, avoiding external detectors or handcrafted proposals. At inference, TTS produces diverse caption candidates and performs reward-guided selection using a compact scene summary. Experiments show improvements over 3D CoCa of +1.50 CIDEr@0.5IoU on ScanRefer and +1.61 CIDEr@0.5IoU on Nr3D, and +3.8 CIDEr@0.25 in zero-shot OOD evaluation on TOD3Cap. Code will be released at this https URL.
摘要：空间智能是指在三维环境中感知、推理和描述物体及其关系的能力，为具身感知和场景理解奠定基础。 3D字幕旨在用自然语言描述3D场景；然而，由于点云的稀疏性和不规则性，更重要的是，现有字幕在截然不同的环境（包括室内和室外 3D 场景）中的基础薄弱和有限的分布外 (OOD) 泛化仍然具有挑战性。为了应对这一挑战，我们提出了 3D CoCa v2，这是一种通用的 3D 字幕框架，它将对比视觉语言学习与 3D 字幕生成相结合，并通过测试时搜索 (TTS) 进一步提高鲁棒性，而无需更新字幕参数。 3D CoCa v2 建立在基于冻结 CLIP 的语义先验、用于几何的空间感知 3D 场景编码器以及与对比和字幕目标联合优化的多模态解码器的基础上，避免了外部检测器或手工建议。在推理时，TTS 会生成不同的候选字幕，并使用紧凑的场景摘要执行奖励引导的选择。实验表明，与 3D CoCa 相比，ScanRefer 上的改进为 +1.50 CIDEr@0.5IoU，Nr3D 上的改进为 +1.61 CIDEr@0.5IoU，TOD3Cap 上的零样本 OOD 评估中的改进为 +3.8 CIDEr@0.25。代码将在此 https URL 发布。

Title: A novel RF-enabled Non-Destructive Inspection Method through Machine Learning and Programmable Wireless Environments

Authors: Stavros Tsimpoukis, Dimitrios Tyrovolas, Sotiris Ioannidis, Maria Kafesaki, Ian F. Akyildiz, George K. Karagiannidis, Christos K. Liaskos
Subjects: cs.LG, cs.ET
Abstract URL: https://arxiv.org/abs/2601.06512
Pdf URL: https://arxiv.org/pdf/2601.06512
Copy Paste: [[2601.06512]] A novel RF-enabled Non-Destructive Inspection Method through Machine Learning and Programmable Wireless Environments(https://arxiv.org/abs/2601.06512)
Keywords: generation, generative
Abstract: Contemporary industrial Non-Destructive Inspection (NDI) methods require sensing capabilities that operate in occluded, hazardous, or access restricted environments. Yet, the current visual inspection based on optical cameras offers limited quality of service to that respect. In that sense, novel methods for workpiece inspection, suitable, for smart manufacturing are needed. Programmable Wireless Environments (PWE) could help towards that direction, by redefining the wireless Radio Frequency (RF) wave propagation as a controllable inspector entity. In this work, we propose a novel approach to Non-Destructive Inspection, leveraging an RF sensing pipeline based on RF wavefront encoding for retrieving workpiece-image entries from a designated database. This approach combines PWE-enabled RF wave manipulation with machine learning (ML) tools trained to produce visual outputs for quality inspection. Specifically, we establish correlation relationships between RF wavefronts and target industrial assets, hence yielding a dataset which links wavefronts to their corresponding images in a structured manner. Subsequently, a Generative Adversarial Network (GAN) derives visual representations closely matching the database entries. Our results indicate that the proposed method achieves an SSIM 99.5% matching score in visual outputs, paving the way for next-generation quality control workflows in industry.
摘要：当代工业无损检测 (NDI) 方法需要在遮挡、危险或访问受限的环境中运行的传感能力。然而，当前基于光学相机的视觉检查在这方面提供的服务质量有限。从这个意义上说，需要适合智能制造的工件检测新方法。可编程无线环境（PWE）可以通过将无线射频（RF）波传播重新定义为可控检查实体来帮助实现这一方向。在这项工作中，我们提出了一种新颖的无损检测方法，利用基于射频波前编码的射频传感管道从指定数据库中检索工件图像条目。这种方法将支持 PWE 的射频波操纵与经过训练的机器学习 (ML) 工具相结合，生成用于质量检查的视觉输出。具体来说，我们建立了射频波前和目标工业资产之间的相关关系，从而生成一个数据集，以结构化方式将波前与其相应的图像链接起来。随后，生成对抗网络（GAN）得出与数据库条目紧密匹配的视觉表示。我们的结果表明，所提出的方法在视觉输出中实现了 99.5% 的 SSIM 匹配分数，为行业下一代质量控制工作流程铺平了道路。

Title: Bridging Robustness and Efficiency: Real-Time Low-Light Enhancement via Attention U-Net GAN

Authors: Yash Thesia, Meera Suthar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.06518
Pdf URL: https://arxiv.org/pdf/2601.06518
Copy Paste: [[2601.06518]] Bridging Robustness and Efficiency: Real-Time Low-Light Enhancement via Attention U-Net GAN(https://arxiv.org/abs/2601.06518)
Keywords: generative
Abstract: Recent advancements in Low-Light Image Enhancement (LLIE) have focused heavily on Diffusion Probabilistic Models, which achieve high perceptual quality but suffer from significant computational latency (often exceeding 2-4 seconds per image). Conversely, traditional CNN-based baselines offer real-time inference but struggle with "over-smoothing," failing to recover fine structural details in extreme low-light conditions. This creates a practical gap in the literature: the lack of a model that provides generative-level texture recovery at edge-deployable speeds. In this paper, we address this trade-off by proposing a hybrid Attention U-Net GAN. We demonstrate that the heavy iterative sampling of diffusion models is not strictly necessary for texture recovery. Instead, by integrating Attention Gates into a lightweight U-Net backbone and training within a conditional adversarial framework, we can approximate the high-frequency fidelity of generative models in a single forward pass. Extensive experiments on the SID dataset show that our method achieves a best-in-class LPIPS score of 0.112 among efficient models, significantly outperforming efficient baselines (SID, EnlightenGAN) while maintaining an inference latency of 0.06s. This represents a 40x speedup over latent diffusion models, making our approach suitable for near real-time applications.
摘要：低光图像增强 (LLIE) 的最新进展主要集中在扩散概率模型上，该模型实现了高感知质量，但存在显着的计算延迟（每个图像通常超过 2-4 秒）。相反，传统的基于 CNN 的基线提供实时推理，但面临“过度平滑”的问题，无法在极低光照条件下恢复精细的结构细节。这在文献中造成了实际差距：缺乏以边缘可部署速度提供生成级纹理恢复的模型。在本文中，我们通过提出混合 Attention U-Net GAN 来解决这种权衡问题。我们证明，扩散模型的大量迭代采样对于纹理恢复来说并不是严格必要的。相反，通过将注意力门集成到轻量级 U-Net 主干中并在条件对抗框架内进行训练，我们可以在单次前向传递中近似生成模型的高频保真度。对 SID 数据集的大量实验表明，我们的方法在高效模型中实现了 0.112 的同类最佳 LPIPS 分数，显着优于高效基线（SID、EnlightenGAN），同时保持了 0.06 秒的推理延迟。这比潜在扩散模型加速了 40 倍，使我们的方法适合近实时应用。

Title: BabyVision: Visual Reasoning Beyond Language

Authors: Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Y. charles, Yiping Bao, Yuantao Fan, Guopeng Li, Haiyang Shen, Xuanzhong Chen, Wendong Xu, Shuzheng Si, Zefan Cai, Wenhao Chai, Ziqi Huang, Fangfu Liu, Tianyu Liu, Baobao Chang, Xiaobo Hu, Kaiyuan Chen, Yixin Ren, Yang Liu, Yuan Gong, Kuan Li
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2601.06521
Pdf URL: https://arxiv.org/pdf/2601.06521
Copy Paste: [[2601.06521]] BabyVision: Visual Reasoning Beyond Language(https://arxiv.org/abs/2601.06521)
Keywords: generation
Abstract: While humans develop core visual skills long before acquiring language, contemporary Multimodal LLMs (MLLMs) still rely heavily on linguistic priors to compensate for their fragile visual understanding. We uncovered a crucial fact: state-of-the-art MLLMs consistently fail on basic visual tasks that humans, even 3-year-olds, can solve effortlessly. To systematically investigate this gap, we introduce BabyVision, a benchmark designed to assess core visual abilities independent of linguistic knowledge for MLLMs. BabyVision spans a wide range of tasks, with 388 items divided into 22 subclasses across four key categories. Empirical results and human evaluation reveal that leading MLLMs perform significantly below human baselines. Gemini3-Pro-Preview scores 49.7, lagging behind 6-year-old humans and falling well behind the average adult score of 94.1. These results show despite excelling in knowledge-heavy evaluations, current MLLMs still lack fundamental visual primitives. Progress in BabyVision represents a step toward human-level visual perception and reasoning capabilities. We also explore solving visual reasoning with generation models by proposing BabyVision-Gen and automatic evaluation toolkit. Our code and benchmark data are released at this https URL for reproduction.
摘要：虽然人类早在获得语言之前就已经发展了核心视觉技能，但当代多模态法学硕士（MLLM）仍然严重依赖语言先验来弥补其脆弱的视觉理解。我们发现了一个关键事实：最先进的 MLLM 在执行人类（甚至 3 岁儿童）可以轻松解决的基本视觉任务时始终失败。为了系统地研究这一差距，我们引入了 BabyVision，这是一个旨在评估 MLLM 独立于语言知识的核心视觉能力的基准。 BabyVision 涵盖广泛的任务，共有 388 个项目，分为 4 个关键类别的 22 个子类。经验结果和人类评估表明，领先的 MLLM 的表现明显低于人类基线。 Gemini3-Pro-Preview 得分为 49.7，落后于 6 岁人类，也远远落后于成人平均得分 94.1。这些结果表明，尽管在知识密集型评估方面表现出色，但当前的 MLLM 仍然缺乏基本的视觉原语。 BabyVision 的进步代表着向人类水平的视觉感知和推理能力迈出了一步。我们还通过提出 BabyVision-Gen 和自动评估工具包来探索使用生成模型解决视觉推理问题。我们的代码和基准数据在此 https URL 上发布以供复制。

Title: Mosaic: Unlocking Long-Context Inference for Diffusion LLMs via Global Memory Planning and Dynamic Peak Taming

Authors: Liang Zheng, Bowen Shi, Yitao Hu, Jiawei Zhang, Ruofan Li, Sheng Chen, Wenxin Li, Keqiu Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.06562
Pdf URL: https://arxiv.org/pdf/2601.06562
Copy Paste: [[2601.06562]] Mosaic: Unlocking Long-Context Inference for Diffusion LLMs via Global Memory Planning and Dynamic Peak Taming(https://arxiv.org/abs/2601.06562)
Keywords: generation
Abstract: Diffusion-based large language models (dLLMs) have emerged as a promising paradigm, utilizing simultaneous denoising to enable global planning and iterative refinement. While these capabilities are particularly advantageous for long-context generation, deploying such models faces a prohibitive memory capacity barrier stemming from severe system inefficiencies. We identify that existing inference systems are ill-suited for this paradigm: unlike autoregressive models constrained by the cumulative KV-cache, dLLMs are bottlenecked by transient activations recomputed at every step. Furthermore, general-purpose memory reuse mechanisms lack the global visibility to adapt to dLLMs' dynamic memory peaks, which toggle between logits and FFNs. To address these mismatches, we propose Mosaic, a memory-efficient inference system that shifts from local, static management to a global, dynamic paradigm. Mosaic integrates a mask-only logits kernel to eliminate redundancy, a lazy chunking optimizer driven by an online heuristic search to adaptively mitigate dynamic peaks, and a global memory manager to resolve fragmentation via virtual addressing. Extensive evaluations demonstrate that Mosaic achieves an average 2.71$\times$ reduction in the memory peak-to-average ratio and increases the maximum inference sequence length supportable on identical hardware by 15.89-32.98$\times$. This scalability is achieved without compromising accuracy and speed, and in fact reducing latency by 4.12%-23.26%.
摘要：基于扩散的大语言模型 (dLLM) 已成为一种有前途的范例，利用同步去噪来实现全局规划和迭代细化。虽然这些功能对于长上下文生成特别有利，但部署此类模型面临着由于严重的系统效率低下而导致的内存容量障碍。我们发现现有的推理系统不适合这种范式：与受累积 KV 缓存约束的自回归模型不同，dLLM 受到每一步重新计算的瞬态激活的瓶颈。此外，通用内存重用机制缺乏全局可见性来适应 dLLM 的动态内存峰值，该峰值在 logits 和 FFN 之间切换。为了解决这些不匹配问题，我们提出了 Mosaic，这是一种内存高效的推理系统，从本地静态管理转变为全局动态范例。 Mosaic 集成了仅掩码 logits 内核以消除冗余、由在线启发式搜索驱动的惰性分块优化器以自适应地减轻动态峰值，以及全局内存管理器以通过虚拟寻址解决碎片问题。广泛的评估表明，Mosaic 的内存峰均比平均降低了 2.71$\times$，并将相同硬件上可支持的最大推理序列长度增加了 15.89-32.98$\times$。这种可扩展性是在不影响准确性和速度的情况下实现的，事实上，延迟减少了 4.12%-23.26%。

Title: Hellinger Multimodal Variational Autoencoders

Authors: Huyen Khanh Vo, Isabel Valera
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.06572
Pdf URL: https://arxiv.org/pdf/2601.06572
Copy Paste: [[2601.06572]] Hellinger Multimodal Variational Autoencoders(https://arxiv.org/abs/2601.06572)
Keywords: generative
Abstract: Multimodal variational autoencoders (VAEs) are widely used for weakly supervised generative learning with multiple modalities. Predominant methods aggregate unimodal inference distributions using either a product of experts (PoE), a mixture of experts (MoE), or their combinations to approximate the joint posterior. In this work, we revisit multimodal inference through the lens of probabilistic opinion pooling, an optimization-based approach. We start from Hölder pooling with $\alpha=0.5$, which corresponds to the unique symmetric member of the $\alpha\text{-divergence}$ family, and derive a moment-matching approximation, termed Hellinger. We then leverage such an approximation to propose HELVAE, a multimodal VAE that avoids sub-sampling, yielding an efficient yet effective model that: (i) learns more expressive latent representations as additional modalities are observed; and (ii) empirically achieves better trade-offs between generative coherence and quality, outperforming state-of-the-art multimodal VAE models.
摘要：多模态变分自动编码器（VAE）广泛用于多模态的弱监督生成学习。主要方法使用专家乘积 (PoE)、专家混合 (MoE) 或它们的组合来聚合单峰推理分布以近似联合后验。在这项工作中，我们通过概率意见池（一种基于优化的方法）的视角重新审视多模态推理。我们从 $\alpha=0.5$ 的 Hölder 池化开始，它对应于 $\alpha\text{-divergence}$ 系列的唯一对称成员，并推导出一个矩匹配近似值，称为 Hellinger。然后，我们利用这种近似来提出 HELVAE，这是一种避免二次采样的多模态 VAE，从而产生一个高效且有效的模型：（i）随着观察到其他模态而学习更具表现力的潜在表示； (ii) 根据经验，在生成一致性和质量之间实现了更好的权衡，优于最先进的多模态 VAE 模型。

Title: APEX: Learning Adaptive Priorities for Multi-Objective Alignment in Vision-Language Generation

Authors: Dongliang Chen, Xinlin Zhuang, Junjie Xu, Luojian Xie, Zehui Wang, Jiaxi Zhuang, Haolin Yang, Liang Dou, Xiao He, Xingjiao Wu, Ying Qian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.06574
Pdf URL: https://arxiv.org/pdf/2601.06574
Copy Paste: [[2601.06574]] APEX: Learning Adaptive Priorities for Multi-Objective Alignment in Vision-Language Generation(https://arxiv.org/abs/2601.06574)
Keywords: generation
Abstract: Multi-objective alignment for text-to-image generation is commonly implemented via static linear scalarization, but fixed weights often fail under heterogeneous rewards, leading to optimization imbalance where models overfit high-variance, high-responsiveness objectives (e.g., OCR) while under-optimizing perceptual goals. We identify two mechanistic causes: variance hijacking, where reward dispersion induces implicit reweighting that dominates the normalized training signal, and gradient conflicts, where competing objectives produce opposing update directions and trigger seesaw-like oscillations. We propose APEX (Adaptive Priority-based Efficient X-objective Alignment), which stabilizes heterogeneous rewards with Dual-Stage Adaptive Normalization and dynamically schedules objectives via P^3 Adaptive Priorities that combine learning potential, conflict penalty, and progress need. On Stable Diffusion 3.5, APEX achieves improved Pareto trade-offs across four heterogeneous objectives, with balanced gains of +1.31 PickScore, +0.35 DeQA, and +0.53 Aesthetics while maintaining competitive OCR accuracy, mitigating the instability of multi-objective alignment.
摘要：文本到图像生成的多目标对齐通常通过静态线性标量来实现，但固定权重通常在异构奖励下失败，导致优化不平衡，模型过度拟合高方差、高响应性目标（例如 OCR），而感知目标优化不足。我们确定了两个机制原因：方差劫持（其中奖励分散会导致隐式重新加权，从而主导归一化训练信号）和梯度冲突（其中竞争目标产生相反的更新方向并触发跷跷板状振荡）。我们提出 APEX（基于自适应优先级的高效 X 目标对齐），它通过双阶段自适应归一化稳定异构奖励，并通过结合学习潜力、冲突惩罚和进度需求的 P^3 自适应优先级动态调度目标。在 Stable Diffusion 3.5 上，APEX 在四个异构目标之间实现了改进的帕累托权衡，平衡增益为 +1.31 PickScore、+0.35 DeQA 和 +0.53 Aesthetics，同时保持有竞争力的 OCR 准确性，减轻多目标对齐的不稳定性。

Title: Sissi: Zero-shot Style-guided Image Synthesis via Semantic-style Integration

Authors: Yingying Deng, Xiangyu He, Fan Tang, Weiming Dong, Xucheng Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.06605
Pdf URL: https://arxiv.org/pdf/2601.06605
Copy Paste: [[2601.06605]] Sissi: Zero-shot Style-guided Image Synthesis via Semantic-style Integration(https://arxiv.org/abs/2601.06605)
Keywords: generation
Abstract: Text-guided image generation has advanced rapidly with large-scale diffusion models, yet achieving precise stylization with visual exemplars remains difficult. Existing approaches often depend on task-specific retraining or expensive inversion procedures, which can compromise content integrity, reduce style fidelity, and lead to an unsatisfactory trade-off between semantic prompt adherence and style alignment. In this work, we introduce a training-free framework that reformulates style-guided synthesis as an in-context learning task. Guided by textual semantic prompts, our method concatenates a reference style image with a masked target image, leveraging a pretrained ReFlow-based inpainting model to seamlessly integrate semantic content with the desired style through multimodal attention fusion. We further analyze the imbalance and noise sensitivity inherent in multimodal attention fusion and propose a Dynamic Semantic-Style Integration (DSSI) mechanism that reweights attention between textual semantic and style visual tokens, effectively resolving guidance conflicts and enhancing output coherence. Experiments show that our approach achieves high-fidelity stylization with superior semantic-style balance and visual quality, offering a simple yet powerful alternative to complex, artifact-prone prior methods.
摘要：文本引导的图像生成随着大规模扩散模型的发展而迅速发展，但利用视觉样本实现精确的风格化仍然很困难。现有的方法通常依赖于特定于任务的再训练或昂贵的反转程序，这可能会损害内容完整性，降低风格保真度，并导致语义提示遵守和风格对齐之间的权衡不令人满意。在这项工作中，我们引入了一种免训练框架，将风格引导的综合重新表述为上下文学习任务。在文本语义提示的指导下，我们的方法将参考风格图像与蒙版目标图像连接起来，利用预训练的基于 ReFlow 的修复模型，通过多模态注意融合将语义内容与所需风格无缝集成。我们进一步分析了多模态注意力融合中固有的不平衡和噪声敏感性，并提出了一种动态语义风格集成（DSSI）机制，可以重新加权文本语义和风格视觉标记之间的注意力，有效解决指导冲突并增强输出连贯性。实验表明，我们的方法实现了高保真风格化，具有卓越的语义风格平衡和视觉质量，为复杂、容易出现伪影的现有方法提供了简单而强大的替代方案。

Title: CEDAR: Context Engineering for Agentic Data Science

Authors: Rishiraj Saha Roy, Chris Hinze, Luzian Hahn, Fabian Kuech
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.06606
Pdf URL: https://arxiv.org/pdf/2601.06606
Copy Paste: [[2601.06606]] CEDAR: Context Engineering for Agentic Data Science(https://arxiv.org/abs/2601.06606)
Keywords: generation
Abstract: We demonstrate CEDAR, an application for automating data science (DS) tasks with an agentic setup. Solving DS problems with LLMs is an underexplored area that has immense market value. The challenges are manifold: task complexities, data sizes, computational limitations, and context restrictions. We show that these can be alleviated via effective context engineering. We first impose structure into the initial prompt with DS-specific input fields, that serve as instructions for the agentic system. The solution is then materialized as an enumerated sequence of interleaved plan and code blocks generated by separate LLM agents, providing a readable structure to the context at any step of the workflow. Function calls for generating these intermediate texts, and for corresponding Python code, ensure that data stays local, and only aggregate statistics and associated instructions are injected into LLM prompts. Fault tolerance and context management are introduced via iterative code generation and smart history rendering. The viability of our agentic data scientist is demonstrated using canonical Kaggle challenges.
摘要：我们演示了 CEDAR，这是一个通过代理设置自动执行数据科学 (DS) 任务的应用程序。通过法学硕士解决 DS 问题是一个尚未开发的领域，但具有巨大的市场价值。挑战是多方面的：任务复杂性、数据大小、计算限制和上下文限制。我们证明这些可以通过有效的上下文工程来缓解。我们首先使用 DS 特定的输入字段将结构强加到初始提示中，这些输入字段充当代理系统的指令。然后，该解决方案具体化为由单独的 LLM 代理生成的交错计划和代码块的枚举序列，为工作流程的任何步骤的上下文提供可读的结构。用于生成这些中间文本的函数调用以及相应的 Python 代码，确保数据保留在本地，并且仅将聚合统计数据和相关指令注入到 LLM 提示中。通过迭代代码生成和智能历史呈现引入了容错和上下文管理。我们的代理数据科学家的可行性通过规范的 Kaggle 挑战得到了证明。

Title: When Humans Judge Irises: Pupil Size Normalization as an Aid and Synthetic Irises as a Challenge

Authors: Mahsa Mitcheff, Adam Czajka
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.06725
Pdf URL: https://arxiv.org/pdf/2601.06725
Copy Paste: [[2601.06725]] When Humans Judge Irises: Pupil Size Normalization as an Aid and Synthetic Irises as a Challenge(https://arxiv.org/abs/2601.06725)
Keywords: generative
Abstract: Iris recognition is a mature biometric technology offering remarkable precision and speed, and allowing for large-scale deployments to populations exceeding a billion enrolled users (e.g., AADHAAR in India). However, in forensic applications, a human expert may be needed to review and confirm a positive identification before an iris matching result can be presented as evidence in court, especially in cases where processed samples are degraded (e.g., in post-mortem cases) or where there is a need to judge whether the sample is authentic, rather than a result of a presentation attack. This paper presents a study that examines human performance in iris verification in two controlled scenarios: (a) under varying pupil sizes, with and without a linear/nonlinear alignment of the pupil size between compared images, and (b) when both genuine and impostor iris image pairs are synthetically generated. The results demonstrate that pupil size normalization carried out by a modern autoencoder-based identity-preserving image-to-image translation model significantly improves verification accuracy. Participants were also able to determine whether iris pairs corresponded to the same or different eyes when both images were either authentic or synthetic. However, accuracy declined when subjects were comparing authentic irises against high-quality, same-eye synthetic counterparts. These findings (a) demonstrate the importance of pupil-size alignment for iris matching tasks in which humans are involved, and (b) indicate that despite the high fidelity of modern generative models, same-eye synthetic iris images are more often judged by humans as different-eye images, compared to same-eye authentic image pairs. We offer data and human judgments along with this paper to allow full replicability of this study and future works.
摘要：虹膜识别是一种成熟的生物识别技术，提供卓越的精度和速度，并允许大规模部署到超过 10 亿注册用户（例如印度的 AADHAAR）。然而，在法医应用中，在将虹膜匹配结果作为证据提交法庭之前，可能需要人类专家审查并确认阳性识别，特别是在处理后的样本被降解的情况下（例如，在尸检案件中）或需要判断样本是否真实而不是呈现攻击的结果的情况。本文提出了一项研究，在两种受控场景下检查人类在虹膜验证中的表现：（a）在不同的瞳孔大小下，比较图像之间瞳孔大小有或没有线性/非线性对齐，以及（b）综合生成真实虹膜图像对和冒充虹膜图像对时。结果表明，由现代基于自动编码器的身份保留图像到图像转换模型进行的瞳孔大小归一化显着提高了验证准确性。当两张图像是真实的或合成的时，参与者还能够确定虹膜对是否对应于相同或不同的眼睛。然而，当受试者将真实虹膜与高质量的同眼合成虹膜进行比较时，准确性会下降。这些发现（a）证明了瞳孔大小对齐对于人类参与的虹膜匹配任务的重要性，并且（b）表明尽管现代生成模型具有高保真度，但与同眼真实图像对相比，同眼合成虹膜图像更常被人类判断为不同眼睛图像。我们随本文一起提供数据和人类判断，以便本研究和未来的工作完全可复制。

Title: Cross-Modal Computational Model of Brain-Heart Interactions via HRV and EEG Feature

Authors: Malavika Pradeep, Akshay Sasi, Nusaibah Farrukh, Rahul Venugopal, Elizabeth Sherly
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.06792
Pdf URL: https://arxiv.org/pdf/2601.06792
Copy Paste: [[2601.06792]] Cross-Modal Computational Model of Brain-Heart Interactions via HRV and EEG Feature(https://arxiv.org/abs/2601.06792)
Keywords: generation
Abstract: The electroencephalogram (EEG) has been the gold standard for quantifying mental workload; however, due to its complexity and non-portability, it can be constraining. ECG signals, which are feasible on wearable equipment pieces such as headbands, present a promising method for cognitive state monitoring. This research explores whether electrocardiogram (ECG) signals are able to indicate mental workload consistently and act as surrogates for EEG-based cognitive indicators. This study investigates whether ECG-derived features can serve as surrogate indicators of cognitive load, a concept traditionally quantified using EEG. Using a publicly available multimodal dataset (OpenNeuro) of EEG and ECG recorded during working-memory and listening tasks, features of HRV and Catch22 descriptors are extracted from ECG, and spectral band-power with Catch22 features from EEG. A cross-modal regression framework based on XGBoost was trained to map ECG-derived HRV representations to EEG-derived cognitive features. In order to address data sparsity and model brain-heart interactions, we integrated the PSV-SDG to produce EEG-conditioned synthetic HRV time this http URL addresses the challenge of inferring cognitive load solely from ECG-derived features using a combination of multimodal learning, signal processing, and synthetic data generation. These outcomes form a basis for light, interpretable machine learning models that are implemented through wearable biosensors in non-lab environments. Synthetic HRV inclusion enhances robustness, particularly in sparse data situations. Overall, this work is an initiation for building low-cost, explainable, and real-time cognitive monitoring systems for mental health, education, and human-computer interaction, with a focus on ageing and clinical populations.
摘要：脑电图（EEG）一直是量化脑力负荷的黄金标准；然而，由于其复杂性和不可移植性，它可能会受到限制。心电图信号在头带等可穿戴设备上是可行的，为认知状态监测提供了一种有前景的方法。这项研究探讨了心电图 (ECG) 信号是否能够一致地指示脑力负荷并作为基于脑电图的认知指标的替代指标。这项研究调查了心电图衍生的特征是否可以作为认知负荷的替代指标，认知负荷是传统上使用脑电图量化的概念。使用工作记忆和听力任务期间记录的 EEG 和 ECG 的公开多模态数据集 (OpenNeuro)，从 ECG 中提取 HRV 和 Catch22 描述符的特征，并从 EEG 中提取具有 Catch22 特征的谱带功率。基于 XGBoost 的跨模态回归框架经过训练，可将 ECG 衍生的 HRV 表示映射到 EEG 衍生的认知特征。为了解决数据稀疏性问题并模拟脑心交互，我们集成了 PSV-SDG 来生成脑电图条件下的合成 HRV 时间，该 http URL 解决了使用多模态学习、信号处理和合成数据生成的组合仅从心电图衍生特征推断认知负荷的挑战。这些结果构成了轻型、可解释的机器学习模型的基础，这些模型通过可穿戴生物传感器在非实验室环境中实现。合成 HRV 包含增强了鲁棒性，特别是在稀疏数据的情况下。总体而言，这项工作是为心理健康、教育和人机交互构建低成本、可解释的实时认知监测系统的开始，重点关注老龄化和临床人群。

Title: WFR-FM: Simulation-Free Dynamic Unbalanced Optimal Transport

Authors: Qiangwei Peng, Zihan Wang, Junda Ying, Yuhao Sun, Qing Nie, Lei Zhang, Tiejun Li, Peijie Zhou
Subjects: cs.LG, cs.AI, math-ph
Abstract URL: https://arxiv.org/abs/2601.06810
Pdf URL: https://arxiv.org/pdf/2601.06810
Copy Paste: [[2601.06810]] WFR-FM: Simulation-Free Dynamic Unbalanced Optimal Transport(https://arxiv.org/abs/2601.06810)
Keywords: generative
Abstract: The Wasserstein-Fisher-Rao (WFR) metric extends dynamic optimal transport (OT) by coupling displacement with change of mass, providing a principled geometry for modeling unbalanced snapshot dynamics. Existing WFR solvers, however, are often unstable, computationally expensive, and difficult to scale. Here we introduce WFR Flow Matching (WFR-FM), a simulation-free training algorithm that unifies flow matching with dynamic unbalanced OT. Unlike classical flow matching which regresses only a transport vector field, WFR-FM simultaneously regresses a vector field for displacement and a scalar growth rate function for birth-death dynamics, yielding continuous flows under the WFR geometry. Theoretically, we show that minimizing the WFR-FM loss exactly recovers WFR geodesics. Empirically, WFR-FM yields more accurate and robust trajectory inference in single-cell biology, reconstructing consistent dynamics with proliferation and apoptosis, estimating time-varying growth fields, and applying to generative dynamics under imbalanced data. It outperforms state-of-the-art baselines in efficiency, stability, and reconstruction accuracy. Overall, WFR-FM establishes a unified and efficient paradigm for learning dynamical systems from unbalanced snapshots, where not only states but also mass evolve over time.
摘要：Wasserstein-Fisher-Rao (WFR) 度量通过将位移与质量变化耦合来扩展动态最佳传输 (OT)，为不平衡快照动力学建模提供原则几何。然而，现有的 WFR 求解器通常不稳定、计算成本昂贵且难以扩展。在这里，我们介绍 WFR 流匹配 (WFR-FM)，这是一种将流匹配与动态不平衡 OT 相结合的免模拟训练算法。与仅回归传输矢量场的经典流匹配不同，WFR-FM 同时回归位移矢量场和生灭动力学的标量增长率函数，从而在 WFR 几何结构下产生连续流。理论上，我们表明，最小化 WFR-FM 损失可以准确地恢复 WFR 测地线。根据经验，WFR-FM 在单细胞生物学中产生更准确、更稳健的轨迹推断，重建增殖和凋亡的一致动力学，估计随时间变化的生长场，并应用于不平衡数据下的生成动力学。它在效率、稳定性和重建精度方面优于最先进的基线。总体而言，WFR-FM 建立了一个统一且有效的范例，用于从不平衡快照中学习动态系统，其中不仅状态而且随着时间的推移而大规模演化。

Title: OSCAR: Optical-aware Semantic Control for Aleatoric Refinement in Sar-to-Optical Translation

Authors: Hyunseo Lee, Sang Min Kim, Ho Kyung Shin, Taeheon Kim, Woo-Jeoung Nam
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.06835
Pdf URL: https://arxiv.org/pdf/2601.06835
Copy Paste: [[2601.06835]] OSCAR: Optical-aware Semantic Control for Aleatoric Refinement in Sar-to-Optical Translation(https://arxiv.org/abs/2601.06835)
Keywords: generative
Abstract: Synthetic Aperture Radar (SAR) provides robust all-weather imaging capabilities; however, translating SAR observations into photo-realistic optical images remains a fundamentally ill-posed problem. Current approaches are often hindered by the inherent speckle noise and geometric distortions of SAR data, which frequently result in semantic misinterpretation, ambiguous texture synthesis, and structural hallucinations. To address these limitations, a novel SAR-to-Optical (S2O) translation framework is proposed, integrating three core technical contributions: (i) Cross-Modal Semantic Alignment, which establishes an Optical-Aware SAR Encoder by distilling robust semantic priors from an Optical Teacher into a SAR Student (ii) Semantically-Grounded Generative Guidance, realized by a Semantically-Grounded ControlNet that integrates class-aware text prompts for global context with hierarchical visual prompts for local spatial guidance; and (iii) an Uncertainty-Aware Objective, which explicitly models aleatoric uncertainty to dynamically modulate the reconstruction focus, effectively mitigating artifacts caused by speckle-induced ambiguity. Extensive experiments demonstrate that the proposed method achieves superior perceptual quality and semantic consistency compared to state-of-the-art approaches.
摘要：合成孔径雷达（SAR）提供强大的全天候成像能力；然而，将 SAR 观测结果转化为逼真的光学图像仍然是一个根本性的不适定问题。当前的方法常常受到 SAR 数据固有的散斑噪声和几何失真的阻碍，这经常导致语义误解、模糊的纹理合成和结构幻觉。为了解决这些限制，提出了一种新颖的 SAR 到光学 (S2O) 翻译框架，集成了三个核心技术贡献：(i) 跨模态语义对齐，通过将光学教师的强大语义先验提炼到 SAR 学生来建立光学感知 SAR 编码器 (ii) 语义生成指导，通过集成全局上下文的类感知文本提示的语义生成指导实现具有局部空间引导的分层视觉提示； (iii) 不确定性感知物镜，它明确地模拟任意不确定性以动态调制重建焦点，有效减轻由散斑引起的模糊性引起的伪影。大量的实验表明，与最先进的方法相比，所提出的方法实现了卓越的感知质量和语义一致性。

Title: Speak While Watching: Unleashing TRUE Real-Time Video Understanding Capability of Multimodal Large Language Models

Authors: Junyan Lin, Junlong Tong, Hao Wu, Jialiang Zhang, Jinming Liu, Xin Jin, Xiaoyu Shen
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2601.06843
Pdf URL: https://arxiv.org/pdf/2601.06843
Copy Paste: [[2601.06843]] Speak While Watching: Unleashing TRUE Real-Time Video Understanding Capability of Multimodal Large Language Models(https://arxiv.org/abs/2601.06843)
Keywords: generation
Abstract: Multimodal Large Language Models (MLLMs) have achieved strong performance across many tasks, yet most systems remain limited to offline inference, requiring complete inputs before generating outputs. Recent streaming methods reduce latency by interleaving perception and generation, but still enforce a sequential perception-generation cycle, limiting real-time interaction. In this work, we target a fundamental bottleneck that arises when extending MLLMs to real-time video understanding: the global positional continuity constraint imposed by standard positional encoding schemes. While natural in offline inference, this constraint tightly couples perception and generation, preventing effective input-output parallelism. To address this limitation, we propose a parallel streaming framework that relaxes positional continuity through three designs: Overlapped, Group-Decoupled, and Gap-Isolated. These designs enable simultaneous perception and generation, allowing the model to process incoming inputs while producing responses in real time. Extensive experiments reveal that Group-Decoupled achieves the best efficiency-performance balance, maintaining high fluency and accuracy while significantly reducing latency. We further show that the proposed framework yields up to 2x acceleration under balanced perception-generation workloads, establishing a principled pathway toward speak-while-watching real-time systems. We make all our code publicly available: this https URL.
摘要：多模态大型语言模型 (MLLM) 在许多任务中都取得了出色的性能，但大多数系统仍然仅限于离线推理，在生成输出之前需要完整的输入。最近的流方法通过交错感知和生成来减少延迟，但仍然强制执行连续的感知-生成周期，限制了实时交互。在这项工作中，我们针对将 MLLM 扩展到实时视频理解时出现的一个基本瓶颈：标准位置编码方案施加的全局位置连续性约束。虽然在离线推理中很自然，但这种约束将感知和生成紧密耦合，从而阻碍了有效的输入输出并行性。为了解决这个限制，我们提出了一个并行流框架，通过三种设计来放松位置连续性：重叠、组解耦和间隙隔离。这些设计可以同时进行感知和生成，使模型能够处理传入的输入，同时生成实时响应。大量实验表明，Group-De Coupled 实现了最佳的效率与性能平衡，在保持高流畅性和准确性的同时，显着降低了延迟。我们进一步表明，所提出的框架在平衡的感知生成工作负载下可产生高达 2 倍的加速，从而为边看边说的实时系统建立了一条原则性途径。我们将所有代码公开：此 https URL。

Title: Variational decomposition autoencoding improves disentanglement of latent representations

Authors: Ioannis Ziogas, Aamna Al Shehhi, Ahsan H. Khandoker, Leontios J. Hadjileontiadis
Subjects: cs.LG, cs.AI, eess.AS, eess.SP, stat.ML
Abstract URL: https://arxiv.org/abs/2601.06844
Pdf URL: https://arxiv.org/pdf/2601.06844
Copy Paste: [[2601.06844]] Variational decomposition autoencoding improves disentanglement of latent representations(https://arxiv.org/abs/2601.06844)
Keywords: generative
Abstract: Understanding the structure of complex, nonstationary, high-dimensional time-evolving signals is a central challenge in scientific data analysis. In many domains, such as speech and biomedical signal processing, the ability to learn disentangled and interpretable representations is critical for uncovering latent generative mechanisms. Traditional approaches to unsupervised representation learning, including variational autoencoders (VAEs), often struggle to capture the temporal and spectral diversity inherent in such data. Here we introduce variational decomposition autoencoding (VDA), a framework that extends VAEs by incorporating a strong structural bias toward signal decomposition. VDA is instantiated through variational decomposition autoencoders (DecVAEs), i.e., encoder-only neural networks that combine a signal decomposition model, a contrastive self-supervised task, and variational prior approximation to learn multiple latent subspaces aligned with time-frequency characteristics. We demonstrate the effectiveness of DecVAEs on simulated data and three publicly available scientific datasets, spanning speech recognition, dysarthria severity evaluation, and emotional speech classification. Our results demonstrate that DecVAEs surpass state-of-the-art VAE-based methods in terms of disentanglement quality, generalization across tasks, and the interpretability of latent encodings. These findings suggest that decomposition-aware architectures can serve as robust tools for extracting structured representations from dynamic signals, with potential applications in clinical diagnostics, human-computer interaction, and adaptive neurotechnologies.
摘要：理解复杂、非平稳、高维时间演化信号的结构是科学数据分析的核心挑战。在许多领域，例如语音和生物医学信号处理，学习解开和可解释的表示的能力对于揭示潜在的生成机制至关重要。传统的无监督表示学习方法，包括变分自动编码器（VAE），通常很难捕获此类数据固有的时间和频谱多样性。在这里，我们介绍变分分解自动编码（VDA），这是一个通过结合对信号分解的强烈结构偏差来扩展 VAE 的框架。 VDA 通过变分分解自动编码器 (DecVAE) 进行实例化，即仅编码器的神经网络，它结合了信号分解模型、对比自监督任务和变分先验逼近，以学习与时频特性对齐的多个潜在子空间。我们证明了 DecVAE 在模拟数据和三个公开可用的科学数据集上的有效性，涵盖语音识别、构音障碍严重程度评估和情绪语音分类。我们的结果表明，DecVAE 在解缠结质量、跨任务泛化以及潜在编码的可解释性方面超越了最先进的基于 VAE 的方法。这些发现表明，分解感知架构可以作为从动态信号中提取结构化表示的强大工具，在临床诊断、人机交互和自适应神经技术中具有潜在的应用。

Title: U-MASK: User-adaptive Spatio-Temporal Masking for Personalized Mobile AI Applications

Authors: Shiyuan Zhang, Yilai Liu, Yuwei Du, Ruoxuan Yang, Dong In Kim, Hongyang Du
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.06867
Pdf URL: https://arxiv.org/pdf/2601.06867
Copy Paste: [[2601.06867]] U-MASK: User-adaptive Spatio-Temporal Masking for Personalized Mobile AI Applications(https://arxiv.org/abs/2601.06867)
Keywords: generation, generative
Abstract: Personalized mobile artificial intelligence applications are widely deployed, yet they are expected to infer user behavior from sparse and irregular histories under a continuously evolving spatio-temporal context. This setting induces a fundamental tension among three requirements, i.e., immediacy to adapt to recent behavior, stability to resist transient noise, and generalization to support long-horizon prediction and cold-start users. Most existing approaches satisfy at most two of these requirements, resulting in an inherent impossibility triangle in data-scarce, non-stationary personalization. To address this challenge, we model mobile behavior as a partially observed spatio-temporal tensor and unify short-term adaptation, long-horizon forecasting, and cold-start recommendation as a conditional completion problem, where a user- and task-specific mask specifies which coordinates are treated as evidence. We propose U-MASK, a user-adaptive spatio-temporal masking method that allocates evidence budgets based on user reliability and task sensitivity. To enable mask generation under sparse observations, U-MASK learns a compact, task-agnostic user representation from app and location histories via U-SCOPE, which serves as the sole semantic conditioning signal. A shared diffusion transformer then performs mask-guided generative completion while preserving observed evidence, so personalization and task differentiation are governed entirely by the mask and the user representation. Experiments on real-world mobile datasets demonstrate consistent improvements over state-of-the-art methods across short-term prediction, long-horizon forecasting, and cold-start settings, with the largest gains under severe data sparsity. The code and dataset will be available at this https URL.
摘要：个性化的移动人工智能应用程序被广泛部署，但它们期望在不断发展的时空背景下从稀疏且不规则的历史中推断用户行为。这种设置导致了三个要求之间的根本张力，即适应最近行为的即时性、抵抗瞬态噪声的稳定性以及支持长期预测和冷启动用户的泛化。大多数现有方法最多满足其中两个要求，导致数据稀缺、非平稳个性化中固有的不可能三角。为了应对这一挑战，我们将移动行为建模为部分观察的时空张量，并将短期适应、长期预测和冷启动推荐统一为条件完成问题，其中用户和任务特定的掩码指定哪些坐标被视为证据。我们提出了 U-MASK，一种用户自适应的时空屏蔽方法，它根据用户可靠性和任务敏感性分配证据预算。为了在稀疏观察下生成掩模，U-MASK 通过 U-SCOPE 从应用程序和位置历史记录中学习紧凑的、与任务无关的用户表示，U-SCOPE 充当唯一的语义条件信号。然后，共享扩散变压器执行掩模引导的生成完成，同时保留观察到的证据，因此个性化和任务区分完全由掩模和用户表示控制。对现实世界移动数据集的实验表明，在短期预测、长期预测和冷启动设置方面，与最先进的方法相比，其取得了一致的改进，并且在数据严重稀疏的情况下收益最大。代码和数据集将在此 https URL 中提供。

Title: DaQ-MSA: Denoising and Qualifying Diffusion Augmentations for Multimodal Sentiment Analysis

Authors: Jiazhang Liang, Jianheng Dai, Miaosen Luo, Menghua Jiang, Sijie Mai
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.06870
Pdf URL: https://arxiv.org/pdf/2601.06870
Copy Paste: [[2601.06870]] DaQ-MSA: Denoising and Qualifying Diffusion Augmentations for Multimodal Sentiment Analysis(https://arxiv.org/abs/2601.06870)
Keywords: generative
Abstract: Multimodal large language models (MLLMs) have demonstrated strong performance on vision-language tasks, yet their effectiveness on multimodal sentiment analysis remains constrained by the scarcity of high-quality training data, which limits accurate multimodal understanding and generalization. To alleviate this bottleneck, we leverage diffusion models to perform semantics-preserving augmentation on the video and audio modalities, expanding the multimodal training distribution. However, increasing data quantity alone is insufficient, as diffusion-generated samples exhibit substantial quality variation and noisy augmentations may degrade performance. We therefore propose DaQ-MSA (Denoising and Qualifying Diffusion Augmentations for Multimodal Sentiment Analysis), which introduces a quality scoring module to evaluate the reliability of augmented samples and assign adaptive training weights. By down-weighting low-quality samples and emphasizing high-fidelity ones, DaQ-MSA enables more stable learning. By integrating the generative capability of diffusion models with the semantic understanding of MLLMs, our approach provides a robust and generalizable automated augmentation strategy for training MLLMs without any human annotation or additional supervision.
摘要：多模态大语言模型（MLLM）在视觉语言任务上表现出了强大的性能，但其在多模态情感分析上的有效性仍然受到高质量训练数据稀缺的限制，从而限制了准确的多模态理解和泛化。为了缓解这一瓶颈，我们利用扩散模型对视频和音频模态执行语义保留增强，从而扩展多模态训练分布。然而，仅增加数据量是不够的，因为扩散生成的样本表现出巨大的质量变化，并且噪声增强可能会降低性能。因此，我们提出 DaQ-MSA（多模态情感分析的去噪和限定扩散增强），它引入了质量评分模块来评估增强样本的可靠性并分配自适应训练权重。通过降低低质量样本的权重并强调高保真样本，DaQ-MSA 可以实现更稳定的学习。通过将扩散模型的生成能力与 MLLM 的语义理解相结合，我们的方法提供了一种强大且可泛化的自动增强策略，用于训练 MLLM，而无需任何人工注释或额外的监督。

Title: RenderFlow: Single-Step Neural Rendering via Flow Matching

Authors: Shenghao Zhang, Runtao Liu, Christopher Schroers, Yang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.06928
Pdf URL: https://arxiv.org/pdf/2601.06928
Copy Paste: [[2601.06928]] RenderFlow: Single-Step Neural Rendering via Flow Matching(https://arxiv.org/abs/2601.06928)
Keywords: generative
Abstract: Conventional physically based rendering (PBR) pipelines generate photorealistic images through computationally intensive light transport simulations. Although recent deep learning approaches leverage diffusion model priors with geometry buffers (G-buffers) to produce visually compelling results without explicit scene geometry or light simulation, they remain constrained by two major limitations. First, the iterative nature of the diffusion process introduces substantial latency. Second, the inherent stochasticity of these generative models compromises physical accuracy and temporal consistency. In response to these challenges, we propose a novel, end-to-end, deterministic, single-step neural rendering framework, RenderFlow, built upon a flow matching paradigm. To further strengthen both rendering quality and generalization, we propose an efficient and effective module for sparse keyframe guidance. Our method significantly accelerates the rendering process and, by optionally incorporating sparsely rendered keyframes as guidance, enhances both the physical plausibility and overall visual quality of the output. The resulting pipeline achieves near real-time performance with photorealistic rendering quality, effectively bridging the gap between the efficiency of modern generative models and the precision of traditional physically based rendering. Furthermore, we demonstrate the versatility of our framework by introducing a lightweight, adapter-based module that efficiently repurposes the pretrained forward model for the inverse rendering task of intrinsic decomposition.
摘要：传统的基于物理的渲染 (PBR) 管道通过计算密集型光传输模拟生成逼真的图像。尽管最近的深度学习方法利用扩散模型先验和几何缓冲区（G-缓冲区）来产生视觉上引人注目的结果，而无需明确的场景几何或光模拟，但它们仍然受到两个主要限制。首先，扩散过程的迭代性质引入了相当大的延迟。其次，这些生成模型固有的随机性损害了物理准确性和时间一致性。为了应对这些挑战，我们提出了一种新颖的、端到端的、确定性的、单步神经渲染框架 RenderFlow，它建立在流匹配范例的基础上。为了进一步增强渲染质量和泛化能力，我们提出了一种高效且有效的稀疏关键帧指导模块。我们的方法显着加速了渲染过程，并且通过选择性地合并稀疏渲染的关键帧作为指导，增强了输出的物理合理性和整体视觉质量。由此产生的管道实现了近乎实时的性能和逼真的渲染质量，有效地缩小了现代生成模型的效率与传统基于物理的渲染的精度之间的差距。此外，我们通过引入一个轻量级的、基于适配器的模块来展示我们框架的多功能性，该模块有效地将预训练的前向模型重新用于内在分解的逆渲染任务。

Title: Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

Authors: Chengwen Liu, Xiaomin Yu, Zhuoyue Chang, Zhe Huang, Shuo Zhang, Heng Lian, Kunyi Wang, Rui Xu, Sen Hu, Jianheng Hou, Hao Peng, Chengwei Qin, Xiaobin Hu, Hong Peng, Ronghao Chen, Huacan Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.06943
Pdf URL: https://arxiv.org/pdf/2601.06943
Copy Paste: [[2601.06943]] Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning(https://arxiv.org/abs/2601.06943)
Keywords: generation
Abstract: In real-world video question answering scenarios, videos often provide only localized visual cues, while verifiable answers are distributed across the open web; models therefore need to jointly perform cross-frame clue extraction, iterative retrieval, and multi-hop reasoning-based verification. To bridge this gap, we construct the first video deep research benchmark, VideoDR. VideoDR centers on video-conditioned open-domain video question answering, requiring cross-frame visual anchor extraction, interactive web retrieval, and multi-hop reasoning over joint video-web evidence; through rigorous human annotation and quality control, we obtain high-quality video deep research samples spanning six semantic domains. We evaluate multiple closed-source and open-source multimodal large language models under both the Workflow and Agentic paradigms, and the results show that Agentic is not consistently superior to Workflow: its gains depend on a model's ability to maintain the initial video anchors over long retrieval chains. Further analysis indicates that goal drift and long-horizon consistency are the core bottlenecks. In sum, VideoDR provides a systematic benchmark for studying video agents in open-web settings and reveals the key challenges for next-generation video deep research agents.
摘要：在现实世界的视频问答场景中，视频通常仅提供本地化的视觉提示，而可验证的答案则分布在开放网络上；因此，模型需要联合执行跨帧线索提取、迭代检索和基于多跳推理的验证。为了弥补这一差距，我们构建了第一个视频深度研究基准 VideoDR。 VideoDR以视频条件开放域视频问答为中心，需要跨帧视觉锚点提取、交互式网络检索以及对联合视频网络证据的多跳推理；通过严格的人工注释和质量控制，我们获得了跨越六个语义领域的高质量视频深度研究样本。我们在 Workflow 和 Agentic 范式下评估了多个闭源和开源多模态大语言模型，结果表明 Agentic 并不总是优于 Workflow：它的增益取决于模型在长检索链上维持初始视频锚点的能力。进一步分析表明，目标漂移和长期一致性是核心瓶颈。总之，VideoDR 为在开放网络环境中研究视频代理提供了系统基准，并揭示了下一代视频深度研究代理面临的主要挑战。

Title: Unified Personalized Understanding, Generating and Editing

Authors: Yu Zhong, Tianwei Lin, Ruike Zhu, Yuqian Yuan, Haoyu Zheng, Liang Liang, Wenqiao Zhang, Feifei Shao, Haoyuan Li, Wanggui He, Hao Jiang, Yueting Zhuang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.06965
Pdf URL: https://arxiv.org/pdf/2601.06965
Copy Paste: [[2601.06965]] Unified Personalized Understanding, Generating and Editing(https://arxiv.org/abs/2601.06965)
Keywords: generation
Abstract: Unified large multimodal models (LMMs) have achieved remarkable progress in general-purpose multimodal understanding and generation. However, they still operate under a ``one-size-fits-all'' paradigm and struggle to model user-specific concepts (e.g., generate a photo of \texttt{}) in a consistent and controllable manner. Existing personalization methods typically rely on external retrieval, which is inefficient and poorly integrated into unified multimodal pipelines. Recent personalized unified models introduce learnable soft prompts to encode concept information, yet they either couple understanding and generation or depend on complex multi-stage training, leading to cross-task interference and ultimately to fuzzy or misaligned personalized knowledge. We present \textbf{OmniPersona}, an end-to-end personalization framework for unified LMMs that, for the first time, integrates personalized understanding, generation, and image editing within a single architecture. OmniPersona introduces structurally decoupled concept tokens, allocating dedicated subspaces for different tasks to minimize interference, and incorporates an explicit knowledge replay mechanism that propagates personalized attribute knowledge across tasks, enabling consistent personalized behavior. To systematically evaluate unified personalization, we propose \textbf{\texttt{OmniPBench}}, extending the public UnifyBench concept set with personalized editing tasks and cross-task evaluation protocols integrating understanding, generation, and editing. Experimental results demonstrate that OmniPersona delivers competitive and robust performance across diverse personalization tasks. We hope OmniPersona will serve as a strong baseline and spur further research on controllable, unified personalization.
摘要：统一大型多模态模型（LMM）在通用多模态理解和生成方面取得了显着进展。然而，它们仍然在“一刀切”的范式下运行，并努力以一致且可控的方式对特定于用户的概念进行建模（例如，生成 \texttt{} 的照片）。现有的个性化方法通常依赖于外部检索，这种方法效率低下，并且很难集成到统一的多模式管道中。最近的个性化统一模型引入了可学习的软提示来编码概念信息，但它们要么耦合理解和生成，要么依赖于复杂的多阶段训练，导致跨任务干扰，最终导致模糊或错位的个性化知识。我们提出了 \textbf{OmniPersona}，这是一个用于统一 LMM 的端到端个性化框架，它首次将个性化理解、生成和图像编辑集成在单个架构中。 OmniPersona 引入了结构解耦的概念令牌，为不同的任务分配专用子空间以最大程度地减少干扰，并结合了显式知识重放机制，可跨任务传播个性化属性知识，从而实现一致的个性化行为。为了系统地评估统一个性化，我们提出 \textbf{\texttt{OmniPBench}}，通过个性化编辑任务和集成理解、生成和编辑的跨任务评估协议扩展公共 UnifyBench 概念集。实验结果表明，OmniPersona 在不同的个性化任务中提供了具有竞争力和强大的性能。我们希望 OmniPersona 能够成为一个强有力的基准，并促进对可控、统一个性化的进一步研究。

Title: When Should We Introduce Safety Interventions During Pretraining?

Authors: Dylan Sam, Sachin Goyal, Pratyush Maini, Alexander Robey, J. Zico Kolter
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.07087
Pdf URL: https://arxiv.org/pdf/2601.07087
Copy Paste: [[2601.07087]] When Should We Introduce Safety Interventions During Pretraining?(https://arxiv.org/abs/2601.07087)
Keywords: generation
Abstract: Ensuring the safety of language models in high-stakes settings remains a pressing challenge, as aligned behaviors are often brittle and easily undone by adversarial pressure or downstream finetuning. Prior work has shown that interventions applied during pretraining, such as rephrasing harmful content, can substantially improve the safety of the resulting models. In this paper, we study the fundamental question: "When during pretraining should safety interventions be introduced?" We keep the underlying data fixed and vary only the choice of a safety curriculum: the timing of these interventions, i.e., after 0%, 20%, or 60% of the pretraining token budget. We find that introducing interventions earlier generally yields more robust models with no increase in overrefusal rates, with the clearest benefits appearing after downstream, benign finetuning. We also see clear benefits in the steerability of models towards safer generations. Finally, we observe that earlier interventions reshape internal representations: linear probes more cleanly separate safe vs harmful examples. Overall, these results argue for incorporating safety signals early in pretraining, producing models that are more robust to downstream finetuning and jailbreaking, and more reliable under both standard and safety-aware inference procedures.
摘要：确保高风险环境中语言模型的安全仍然是一个紧迫的挑战，因为一致的行为通常很脆弱，很容易被对抗压力或下游微调所破坏。先前的工作表明，在预训练期间应用的干预措施（例如重新措辞有害内容）可以大大提高最终模型的安全性。在本文中，我们研究了一个基本问题：“在预训练期间何时应引入安全干预措施？”我们保持基础数据固定，仅改变安全课程的选择：这些干预的时间，即在预训练代币预算的 0%、20% 或 60% 之后。我们发现，较早引入干预措施通常会产生更稳健的模型，而不会增加过度拒绝率，在下游良性微调后会出现最明显的好处。我们还看到了模型在朝着更安全的一代迈进的可引导性方面的明显好处。最后，我们观察到早期的干预措施重塑了内部表征：线性探针更清晰地区分安全与有害的例子。总的来说，这些结果主张在预训练的早期纳入安全信号，生成对下游微调和越狱更稳健的模型，并且在标准和安全感知推理程序下更可靠。

Title: 3D Wavelet-Based Structural Priors for Controlled Diffusion in Whole-Body Low-Dose PET Denoising

Authors: Peiyuan Jing, Yue Tang, Chun-Wun Cheng, Zhenxuan Zhang, Liutao Yang, Thiago V. Lima, Klaus Strobel, Antoine Leimgruber, Angelica Aviles-Rivero, Guang Yang, Javier Montoya
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.07093
Pdf URL: https://arxiv.org/pdf/2601.07093
Copy Paste: [[2601.07093]] 3D Wavelet-Based Structural Priors for Controlled Diffusion in Whole-Body Low-Dose PET Denoising(https://arxiv.org/abs/2601.07093)
Keywords: generative
Abstract: Low-dose Positron Emission Tomography (PET) imaging reduces patient radiation exposure but suffers from increased noise that degrades image quality and diagnostic reliability. Although diffusion models have demonstrated strong denoising capability, their stochastic nature makes it challenging to enforce anatomically consistent structures, particularly in low signal-to-noise regimes and volumetric whole-body imaging. We propose Wavelet-Conditioned ControlNet (WCC-Net), a fully 3D diffusion-based framework that introduces explicit frequency-domain structural priors via wavelet representations to guide volumetric PET denoising. By injecting wavelet-based structural guidance into a frozen pretrained diffusion backbone through a lightweight control branch, WCC-Net decouples anatomical structure from noise while preserving generative expressiveness and 3D structural continuity. Extensive experiments demonstrate that WCC-Net consistently outperforms CNN-, GAN-, and diffusion-based baselines. On the internal 1/20-dose test set, WCC-Net improves PSNR by +1.21 dB and SSIM by +0.008 over a strong diffusion baseline, while reducing structural distortion (GMSD) and intensity error (NMAE). Moreover, WCC-Net generalizes robustly to unseen dose levels (1/50 and 1/4), achieving superior quantitative performance and improved volumetric anatomical consistency.
摘要：低剂量正电子发射断层扫描 (PET) 成像可减少患者的辐射暴露，但噪声增加，从而降低图像质量和诊断可靠性。尽管扩散模型已表现出强大的去噪能力，但其随机性使得执行解剖学上一致的结构具有挑战性，特别是在低信噪比和体积全身成像中。我们提出了小波条件控制网络（WCC-Net），这是一种完全基于 3D 扩散的框架，通过小波表示引入显式频域结构先验来指导体积 PET 去噪。通过轻量级控制分支将基于小波的结构引导注入到冻结的预训练扩散主干中，WCC-Net 将解剖结构与噪声解耦，同时保留生成表现力和 3D 结构连续性。大量实验表明，WCC-Net 的性能始终优于 CNN、GAN 和基于扩散的基线。在内部 1/20 剂量测试集上，WCC-Net 在强扩散基线上将 PSNR 提高了 +1.21 dB，将 SSIM 提高了 +0.008，同时减少了结构失真 (GMSD) 和强度误差 (NMAE)。此外，WCC-Net 可以稳健地推广到看不见的剂量水平（1/50 和 1/4），实现卓越的定量性能并提高体积解剖学一致性。

Title: Few-shot Class-Incremental Learning via Generative Co-Memory Regularization

Authors: Kexin Bao, Yong Li, Dan Zeng, Shiming Ge
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.07117
Pdf URL: https://arxiv.org/pdf/2601.07117
Copy Paste: [[2601.07117]] Few-shot Class-Incremental Learning via Generative Co-Memory Regularization(https://arxiv.org/abs/2601.07117)
Keywords: generative
Abstract: Few-shot class-incremental learning (FSCIL) aims to incrementally learn models from a small amount of novel data, which requires strong representation and adaptation ability of models learned under few-example supervision to avoid catastrophic forgetting on old classes and overfitting to novel classes. This work proposes a generative co-memory regularization approach to facilitate FSCIL. In the approach, the base learning leverages generative domain adaptation finetuning to finetune a pretrained generative encoder on a few examples of base classes by jointly incorporating a masked autoencoder (MAE) decoder for feature reconstruction and a fully-connected classifier for feature classification, which enables the model to efficiently capture general and adaptable representations. Using the finetuned encoder and learned classifier, we construct two class-wise memories: representation memory for storing the mean features for each class, and weight memory for storing the classifier weights. After that, the memory-regularized incremental learning is performed to train the classifier dynamically on the examples of few-shot classes in each incremental session by simultaneously optimizing feature classification and co-memory regularization. The memories are updated in a class-incremental manner and they collaboratively regularize the incremental learning. In this way, the learned models improve recognition accuracy, while mitigating catastrophic forgetting over old classes and overfitting to novel classes. Extensive experiments on popular benchmarks clearly demonstrate that our approach outperforms the state-of-the-arts.
摘要：少样本类增量学习（FSCIL）旨在从少量新数据中增量学习模型，这需要在少样本监督下学习的模型具有强大的表示和适应能力，以避免对旧类的灾难性遗忘和对新类的过度拟合。这项工作提出了一种生成共记忆正则化方法来促进 FSCIL。在该方法中，基础学习利用生成域自适应微调，通过联合合并用于特征重建的掩码自动编码器（MAE）解码器和用于特征分类的全连接分类器，对几个基类示例上的预训练生成编码器进行微调，从而使模型能够有效地捕获通用且适应性强的表示。使用微调编码器和学习分类器，我们构建了两个类记忆：用于存储每个类的平均特征的表示记忆，以及用于存储分类器权重的权重记忆。之后，执行记忆正则增量学习，通过同时优化特征分类和共存正则化，在每个增量会话中的少镜头类示例上动态训练分类器。记忆以班级增量的方式更新，并且它们协作规范增量学习。通过这种方式，学习的模型提高了识别准确性，同时减轻了对旧类的灾难性遗忘和对新类的过度拟合。对流行基准的大量实验清楚地表明，我们的方法优于最先进的方法。

Title: Generating readily synthesizable small molecule fluorophore scaffolds with reinforcement learning

Authors: Ruhi Sayana, Kate Callon, Jennifer Xu, Jonathan Deutsch, Steven Chu, James Zou, John Janetzko, Rabindra V. Shivnaraine, Kyle Swanson
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.07145
Pdf URL: https://arxiv.org/pdf/2601.07145
Copy Paste: [[2601.07145]] Generating readily synthesizable small molecule fluorophore scaffolds with reinforcement learning(https://arxiv.org/abs/2601.07145)
Keywords: generation, generative
Abstract: Developing new fluorophores for advanced imaging techniques requires exploring new chemical space. While generative AI approaches have shown promise in designing novel dye scaffolds, prior efforts often produced synthetically intractable candidates due to a lack of reaction constraints. Here, we developed SyntheFluor-RL, a generative AI model that employs known reaction libraries and molecular building blocks to create readily synthesizable fluorescent molecule scaffolds via reinforcement learning. To guide the generation of fluorophores, SyntheFluor-RL employs a scoring function built on multiple graph neural networks (GNNs) that predict key photophysical properties, including photoluminescence quantum yield, absorption, and emission wavelengths. These outputs are dynamically weighted and combined with a computed pi-conjugation score to prioritize candidates with desirable optical characteristics and synthetic feasibility. SyntheFluor-RL generated 11,590 candidate molecules, which were filtered to 19 structures predicted to possess dye-like properties. Of the 19 molecules, 14 were synthesized and 13 were experimentally confirmed. The top three were characterized, with the lead compound featuring a benzothiadiazole chromophore and exhibiting strong fluorescence (PLQY = 0.62), a large Stokes shift (97 nm), and a long excited-state lifetime (11.5 ns). These results demonstrate the effectiveness of SyntheFluor-RL in the identification of synthetically accessible fluorophores for further development.
摘要：开发用于先进成像技术的新荧光团需要探索新的化学空间。虽然生成式人工智能方法在设计新型染料支架方面显示出了前景，但由于缺乏反应限制，之前的努力常常产生合成上难以处理的候选物。在这里，我们开发了 SyntheFluor-RL，这是一种生成式 AI 模型，它采用已知的反应库和分子构建块，通过强化学习创建易于合成的荧光分子支架。为了指导荧光团的生成，SyntheFluor-RL 采用了基于多个图神经网络 (GNN) 的评分函数，可预测关键的光物理特性，包括光致发光量子产率、吸收和发射波长。这些输出被动态加权并与计算的 pi 共轭分数相结合，以优先考虑具有所需光学特性和合成可行性的候选物。 SyntheFluor-RL 生成了 11,590 个候选分子，这些分子被过滤为 19 个预计具有类似染料特性的结构。在这 19 个分子中，有 14 个是合成的，13 个是通过实验证实的。对前三种化合物进行了表征，其中先导化合物具有苯并噻二唑发色团，并表现出强荧光 (PLQY = 0.62)、大斯托克斯位移 (97 nm) 和长激发态寿命 (11.5 ns)。这些结果证明了 SyntheFluor-RL 在鉴定可合成的荧光团以供进一步开发方面的有效性。

Title: Stable On-Policy Distillation through Adaptive Target Reformulation

Authors: Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, Taesup Kim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.07155
Pdf URL: https://arxiv.org/pdf/2601.07155
Copy Paste: [[2601.07155]] Stable On-Policy Distillation through Adaptive Target Reformulation(https://arxiv.org/abs/2601.07155)
Keywords: generation
Abstract: Knowledge distillation (KD) is a widely adopted technique for transferring knowledge from large language models to smaller student models; however, conventional supervised KD often suffers from a distribution mismatch between training and inference. While on-policy KD approaches attempt to mitigate this issue by learning directly from student-generated outputs, they frequently encounter training instabilities because the distributional gap between the novice student and the expert teacher is often too wide to bridge directly. These challenges manifest as pathological gradients in forward KL objectives or diversity collapse in reverse KL regimes. To address these limitations, we propose Veto, an objective-level reformulation that constructs a geometric bridge in the logit space. Unlike prior methods that mix data samples, Veto creates an intermediate target distribution that promotes alignment between the teacher and the student. By introducing a tunable parameter beta, Veto serves as an Adaptive Gradient Veto that stabilizes optimization by suppressing harmful gradients on low-confidence tokens, while simultaneously acting as a Decisiveness Knob to balance reward-driven performance with output diversity. Extensive experiments across various reasoning and generation tasks demonstrate that Veto consistently outperforms supervised fine-tuning and existing on-policy baselines.
摘要：知识蒸馏（KD）是一种广泛采用的技术，用于将知识从大型语言模型转移到较小的学生模型；然而，传统的监督 KD 经常会遇到训练和推理之间分布不匹配的问题。虽然政策 KD 方法试图通过直接从学生生成的输出中学习来缓解这个问题，但它们经常遇到训练不稳定的情况，因为新手学生和专家教师之间的分配差距往往太大而无法直接弥合。这些挑战表现为正向 KL 目标中的病态梯度或反向 KL 制度中的多样性崩溃。为了解决这些限制，我们提出了 Veto，这是一种在 Logit 空间中构建几何桥梁的客观层面的重新表述。与混合数据样本的现有方法不同，Veto 创建一个中间目标分布，促进教师和学生之间的一致性。通过引入可调参数 beta，Veto 充当自适应梯度否决，通过抑制低置信度代币上的有害梯度来稳定优化，同时充当决策旋钮，平衡奖励驱动的性能与输出多样性。跨各种推理和生成任务的广泛实验表明，Veto 的表现始终优于监督微调和现有的政策基线。

Title: Offline Meta-Reinforcement Learning with Flow-Based Task Inference and Adaptive Correction of Feature Overgeneralization

Authors: Min Wang, Xin Li, Mingzhong Wang, Hasnaa Bennis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.07164
Pdf URL: https://arxiv.org/pdf/2601.07164
Copy Paste: [[2601.07164]] Offline Meta-Reinforcement Learning with Flow-Based Task Inference and Adaptive Correction of Feature Overgeneralization(https://arxiv.org/abs/2601.07164)
Keywords: generation
Abstract: Offline meta-reinforcement learning (OMRL) combines the strengths of learning from diverse datasets in offline RL with the adaptability to new tasks of meta-RL, promising safe and efficient knowledge acquisition by RL agents. However, OMRL still suffers extrapolation errors due to out-of-distribution (OOD) actions, compromised by broad task distributions and Markov Decision Process (MDP) ambiguity in meta-RL setups. Existing research indicates that the generalization of the $Q$ network affects the extrapolation error in offline RL. This paper investigates this relationship by decomposing the $Q$ value into feature and weight components, observing that while decomposition enhances adaptability and convergence in the case of high-quality data, it often leads to policy degeneration or collapse in complex tasks. We observe that decomposed $Q$ values introduce a large estimation bias when the feature encounters OOD samples, a phenomenon we term ''feature overgeneralization''. To address this issue, we propose FLORA, which identifies OOD samples by modeling feature distributions and estimating their uncertainties. FLORA integrates a return feedback mechanism to adaptively adjust feature components. Furthermore, to learn precise task representations, FLORA explicitly models the complex task distribution using a chain of invertible transformations. We theoretically and empirically demonstrate that FLORA achieves rapid adaptation and meta-policy improvement compared to baselines across various environments.
摘要：离线元强化学习（OMRL）结合了离线强化学习中不同数据集学习的优势和元强化学习对新任务的适应性，有望让强化学习智能体安全高效地获取知识。然而，OMRL 仍然会因分布外 (OOD) 行为而遭受外推错误，并受到元 RL 设置中广泛的任务分布和马尔可夫决策过程 (MDP) 模糊性的影响。现有研究表明，$Q$ 网络的泛化会影响离线强化学习中的外推误差。本文通过将$Q$值分解为特征和权重分量来研究这种关系，观察到虽然分解在高质量数据的情况下增强了适应性和收敛性，但它常常导致复杂任务中的策略退化或崩溃。我们观察到，当特征遇到 OOD 样本时，分解的 $Q$ 值会引入很大的估计偏差，我们将这种现象称为“特征过度泛化”。为了解决这个问题，我们提出了 FLORA，它通过对特征分布进行建模并估计其不确定性来识别 OOD 样本。 FLORA集成了返回反馈机制来自适应地调整功能组件。此外，为了学习精确的任务表示，FLORA 使用一系列可逆变换对复杂的任务分布进行显式建模。我们从理论上和经验上证明，与各种环境的基线相比，FLORA 实现了快速适应和元策略改进。

Title: Forward versus Backward: Comparing Reasoning Objectives in Direct Preference Optimization

Authors: Murtaza Nikzad, Raghuram Ramanujan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.07199
Pdf URL: https://arxiv.org/pdf/2601.07199
Copy Paste: [[2601.07199]] Forward versus Backward: Comparing Reasoning Objectives in Direct Preference Optimization(https://arxiv.org/abs/2601.07199)
Keywords: generation
Abstract: Large language models exhibit impressive reasoning capabilities yet frequently generate plausible but incorrect solutions, a phenomenon commonly termed hallucination. This paper investigates the effect of training objective composition on reasoning reliability through Direct Preference Optimization. Two complementary training signals are examined: forward chain-of-thought generation, which trains the model to produce correct reasoning traces, and backward verification, which trains the model to verify and acknowledge errors in candidate solutions. Experiments on GSM8K reveal a fundamental trade-off between these objectives. Forward-only DPO training achieves the highest accuracy improvement, increasing from 83.1% to 86.6% (+3.5 percentage points), while backward-only training yields minimal accuracy gains but substantially reduces the false positive rate from 13.4% to 4.3%. Notably, both training variants reduce acknowledgement rate compared to the baseline, suggesting that preference optimization increases model confidence in its outputs. These findings indicate that forward and backward reasoning objectives provide distinct and complementary learning signals: forward training improves problem-solving capability, while backward training improves verification calibration. The complete training and evaluation pipeline, implemented efficiently through Low-Rank Adaptation, is released to facilitate further research.
摘要：大型语言模型表现出令人印象深刻的推理能力，但经常生成看似合理但不正确的解决方案，这种现象通常被称为幻觉。本文通过直接偏好优化研究了训练目标构成对推理可靠性的影响。检查两个互补的训练信号：前向思想链生成，训练模型产生正确的推理轨迹；以及后向验证，训练模型验证和确认候选解决方案中的错误。 GSM8K 上的实验揭示了这些目标之间的基本权衡。仅前向 DPO 训练实现了最高的准确率提升，从 83.1% 增加到 86.6%（+3.5 个百分点），而仅后向训练的准确率提升最小，但将误报率从 13.4% 大幅降低到 4.3%。值得注意的是，与基线相比，两种训练变体都降低了确认率，这表明偏好优化增加了模型对其输出的置信度。这些发现表明，前向和后向推理目标提供了独特且互补的学习信号：前向训练提高了解决问题的能力，而后向训练提高了验证校准。通过低秩适应有效实施的完整训练和评估流程已发布，以促进进一步的研究。

Title: MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization

Authors: Yang Zhao, Hepeng Wang, Xiao Ding, Yangou Ouyang, Bibo Cai, Kai Xiong, Jinglong Gao, Zhouhao Sun, Li Du, Bing Qin, Ting Liu
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2601.07208
Pdf URL: https://arxiv.org/pdf/2601.07208
Copy Paste: [[2601.07208]] MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization(https://arxiv.org/abs/2601.07208)
Keywords: generation
Abstract: Group-Relative Policy Optimization (GRPO) has emerged as an efficient paradigm for aligning Large Language Models (LLMs), yet its efficacy is primarily confined to domains with verifiable ground truths. Extending GRPO to open-domain settings remains a critical challenge, as unconstrained generation entails multi-faceted and often conflicting objectives - such as creativity versus factuality - where rigid, static reward scalarization is inherently suboptimal. To address this, we propose MAESTRO (Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization), which introduces a meta-cognitive orchestration layer that treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck to perceive task-specific priorities. We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal. Across seven benchmarks, MAESTRO consistently outperforms single-reward and static multi-objective baselines, while preserving the efficiency advantages of GRPO, and in some settings even reducing redundant generation.
摘要：组相对策略优化 (GRPO) 已成为协调大型语言模型 (LLM) 的有效范例，但其功效主要局限于具有可验证的基本事实的领域。将 GRPO 扩展到开放域环境仍然是一个严峻的挑战，因为不受约束的生成需要多方面且经常相互冲突的目标，例如创造力与事实性，而严格的、静态的奖励分级本质上是次优的。为了解决这个问题，我们提出了 MAESTRO（奖励优化的标量化权衡的元学习自适应估计），它引入了一个元认知编排层，将奖励标量化视为动态潜在策略，利用模型的终端隐藏状态作为语义瓶颈来感知特定任务的优先级。我们将其表述为双层优化框架内的上下文强盗问题，其中轻量级指挥网络通过利用群体相对优势作为元奖励信号与策略共同进化。在七个基准测试中，MAESTRO 始终优于单奖励和静态多目标基线，同时保留 GRPO 的效率优势，在某些设置中甚至减少了冗余生成。

Title: SIRR-LMM: Single-image Reflection Removal via Large Multimodal Model

Authors: Yu Guo, Zhiqiang Lao, Xiyun Song, Yubin Zhou, Heather Yu
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2601.07209
Pdf URL: https://arxiv.org/pdf/2601.07209
Copy Paste: [[2601.07209]] SIRR-LMM: Single-image Reflection Removal via Large Multimodal Model(https://arxiv.org/abs/2601.07209)
Keywords: generation
Abstract: Glass surfaces create complex interactions of reflected and transmitted light, making single-image reflection removal (SIRR) challenging. Existing datasets suffer from limited physical realism in synthetic data or insufficient scale in real captures. We introduce a synthetic dataset generation framework that path-traces 3D glass models over real background imagery to create physically accurate reflection scenarios with varied glass properties, camera settings, and post-processing effects. To leverage the capabilities of Large Multimodal Model (LMM), we concatenate the image layers into a single composite input, apply joint captioning, and fine-tune the model using task-specific LoRA rather than full-parameter training. This enables our approach to achieve improved reflection removal and separation performance compared to state-of-the-art methods.
摘要：玻璃表面会产生反射光和透射光的复杂相互作用，使得单图像反射消除 (SIRR) 具有挑战性。现有数据集的合成数据的物理真实性有限或实际捕获的规模不足。我们引入了一个合成数据集生成框架，该框架可以在真实背景图像上跟踪 3D 玻璃模型的路径，以创建具有不同玻璃属性、相机设置和后处理效果的物理精确反射场景。为了利用大型多模态模型 (LMM) 的功能，我们将图像层连接到单个复合输入中，应用联合字幕，并使用特定于任务的 LoRA 而不是全参数训练来微调模型。与最先进的方法相比，这使得我们的方法能够实现改进的反射去除和分离性能。

Title: SceneNAT: Masked Generative Modeling for Language-Guided Indoor Scene Synthesis

Authors: Jeongjun Choi, Yeonsoo Park, H. Jin Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.07218
Pdf URL: https://arxiv.org/pdf/2601.07218
Copy Paste: [[2601.07218]] SceneNAT: Masked Generative Modeling for Language-Guided Indoor Scene Synthesis(https://arxiv.org/abs/2601.07218)
Keywords: generative
Abstract: We present SceneNAT, a single-stage masked non-autoregressive Transformer that synthesizes complete 3D indoor scenes from natural language instructions through only a few parallel decoding passes, offering improved performance and efficiency compared to prior state-of-the-art approaches. SceneNAT is trained via masked modeling over fully discretized representations of both semantic and spatial attributes. By applying a masking strategy at both the attribute level and the instance level, the model can better capture intra-object and inter-object structure. To boost relational reasoning, SceneNAT employs a dedicated triplet predictor for modeling the scene's layout and object relationships by mapping a set of learnable relation queries to a sparse set of symbolic triplets (subject, predicate, object). Extensive experiments on the 3D-FRONT dataset demonstrate that SceneNAT achieves superior performance compared to state-of-the-art autoregressive and diffusion baselines in both semantic compliance and spatial arrangement accuracy, while operating with substantially lower computational cost.
摘要：我们推出了 SceneNAT，这是一种单级屏蔽非自回归 Transformer，只需几次并行解码即可从自然语言指令合成完整的 3D 室内场景，与之前最先进的方法相比，可提供更高的性能和效率。 SceneNAT 通过语义和空间属性的完全离散表示的掩码建模进行训练。通过在属性级别和实例级别应用屏蔽策略，模型可以更好地捕获对象内和对象间结构。为了促进关系推理，SceneNAT 采用专用的三元组预测器，通过将一组可学习的关系查询映射到一组稀疏的符号三元组（主语、谓语、宾语）来对场景的布局和对象关系进行建模。对 3D-FRONT 数据集进行的大量实验表明，与最先进的自回归和扩散基线相比，SceneNAT 在语义合规性和空间排列精度方面实现了卓越的性能，同时以较低的计算成本运行。

Title: Language-Grounded Multi-Domain Image Translation via Semantic Difference Guidance

Authors: Jongwon Ryu, Joonhyung Park, Jaeho Han, Yeong-Seok Kim, Hye-rin Kim, Sunjae Yoon, Junyeong Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.07221
Pdf URL: https://arxiv.org/pdf/2601.07221
Copy Paste: [[2601.07221]] Language-Grounded Multi-Domain Image Translation via Semantic Difference Guidance(https://arxiv.org/abs/2601.07221)
Keywords: generation
Abstract: Multi-domain image-to-image translation re quires grounding semantic differences ex pressed in natural language prompts into corresponding visual transformations, while preserving unrelated structural and seman tic content. Existing methods struggle to maintain structural integrity and provide fine grained, attribute-specific control, especially when multiple domains are involved. We propose LACE (Language-grounded Attribute Controllable Translation), built on two compo nents: (1) a GLIP-Adapter that fuses global semantics with local structural features to pre serve consistency, and (2) a Multi-Domain Control Guidance mechanism that explicitly grounds the semantic delta between source and target prompts into per-attribute translation vec tors, aligning linguistic semantics with domain level visual changes. Together, these modules enable compositional multi-domain control with independent strength modulation for each attribute. Experiments on CelebA(Dialog) and BDD100K demonstrate that LACE achieves high visual fidelity, structural preservation, and interpretable domain-specific control, surpass ing prior baselines. This positions LACE as a cross-modal content generation framework bridging language semantics and controllable visual translation.
摘要：多域图像到图像的翻译需要将自然语言提示中表达的语义差异转化为相应的视觉转换，同时保留不相关的结构和语义内容。现有方法很难维持结构完整性并提供细粒度的、特定于属性的控制，尤其是在涉及多个域时。我们提出了 LACE（基于语言的属性可控翻译），它建立在两个组件之上：（1）将全局语义与局部结构特征融合以保持一致性的 GLIP 适配器，以及（2）多域控制指导机制，将源提示和目标提示之间的语义增量明确地基于每个属性的翻译向量，使语言学与域级视觉变化保持一致。这些模块共同实现了组合多域控制，并为每个属性提供独立的强度调制。 CelebA(Dialog) 和 BDD100K 上的实验表明，LACE 实现了高视觉保真度、结构保留和可解释的特定域控制，超越了先前的基线。这将 LACE 定位为桥接语言语义和可控视觉翻译的跨模式内容生成框架。

Title: Innovation Capacity of Dynamical Learning Systems

Authors: Anthony M. Polloreno
Subjects: cs.LG, cs.IT
Abstract URL: https://arxiv.org/abs/2601.07257
Pdf URL: https://arxiv.org/pdf/2601.07257
Copy Paste: [[2601.07257]] Innovation Capacity of Dynamical Learning Systems(https://arxiv.org/abs/2601.07257)
Keywords: generative
Abstract: In noisy physical reservoirs, the classical information-processing capacity $C_{\mathrm{ip}}$ quantifies how well a linear readout can realize tasks measurable from the input history, yet $C_{\mathrm{ip}}$ can be far smaller than the observed rank of the readout covariance. We explain this ``missing capacity'' by introducing the innovation capacity $C_{\mathrm{i}}$, the total capacity allocated to readout components orthogonal to the input filtration (Doob innovations, including input-noise mixing). Using a basis-free Hilbert-space formulation of the predictable/innovation decomposition, we prove the conservation law $C_{\mathrm{ip}}+C_{\mathrm{i}}=\mathrm{rank}(\Sigma_{XX})\le d$, so predictable and innovation capacities exactly partition the rank of the observable readout dimension covariance $\Sigma_{XX}\in \mathbb{R}^{\rm d\times d}$. In linear-Gaussian Johnson-Nyquist regimes, $\Sigma_{XX}(T)=S+T N_0$, the split becomes a generalized-eigenvalue shrinkage rule and gives an explicit monotone tradeoff between temperature and predictable capacity. Geometrically, in whitened coordinates the predictable and innovation components correspond to complementary covariance ellipsoids, making $C_{\mathrm{i}}$ a trace-controlled innovation budget. A large $C_{\mathrm{i}}$ forces a high-dimensional innovation subspace with a variance floor and under mild mixing and anti-concentration assumptions this yields extensive innovation-block differential entropy and exponentially many distinguishable histories. Finally, we give an information-theoretic lower bound showing that learning the induced innovation-block law in total variation requires a number of samples that scales with the effective innovation dimension, supporting the generative utility of noisy physical reservoirs.
摘要：在嘈杂的物理储层中，经典的信息处理能力 $C_{\mathrm{ip}}$ 量化了线性读出如何实现可从输入历史中测量的任务，但 $C_{\mathrm{ip}}$ 可能远小于观察到的读出协方差的秩。我们通过引入创新容量 $C_{\mathrm{i}}$ 来解释这种“容量缺失”，即分配给与输入过滤正交的读出组件的总容量（Doob 创新，包括输入噪声混合）。使用可预测/创新分解的无基希尔伯特空间公式，我们证明了守恒定律 $C_{\mathrm{ip}}+C_{\mathrm{i}}=\mathrm{rank}(\Sigma_{XX})\le d$，因此可预测和创新能力精确地划分了可观测读出维度协方差 $\Sigma_{XX}\in \mathbb{R}^{\rm 的秩d\次d}$。在线性高斯约翰逊-奈奎斯特体系中，$\Sigma_{XX}(T)=S+T N_0$，分裂成为广义特征值收缩规则，并给出温度和可预测容量之间的明确单调权衡。从几何角度来看，在白化坐标中，可预测和创新分量对应于互补协方差椭球体，使得 $C_{\mathrm{i}}$ 成为跟踪控制的创新预算。一个大的 $C_{\mathrm{i}}$ 迫使高维创新子空间具有方差下限，并且在温和混合和反集中假设下，这会产生广泛的创新块微分熵和指数级许多可区分的历史。最后，我们给出了一个信息论下界，表明学习全变分中的诱导创新块定律需要大量与有效创新维度成比例的样本，支持噪声物理库的生成效用。

Title: GenDet: Painting Colored Bounding Boxes on Images via Diffusion Model for Object Detection

Authors: Chen Min, Chengyang Li, Fanjie Kong, Qi Zhu, Dawei Zhao, Liang Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.07273
Pdf URL: https://arxiv.org/pdf/2601.07273
Copy Paste: [[2601.07273]] GenDet: Painting Colored Bounding Boxes on Images via Diffusion Model for Object Detection(https://arxiv.org/abs/2601.07273)
Keywords: generation, generative
Abstract: This paper presents GenDet, a novel framework that redefines object detection as an image generation task. In contrast to traditional approaches, GenDet adopts a pioneering approach by leveraging generative modeling: it conditions on the input image and directly generates bounding boxes with semantic annotations in the original image space. GenDet establishes a conditional generation architecture built upon the large-scale pre-trained Stable Diffusion model, formulating the detection task as semantic constraints within the latent space. It enables precise control over bounding box positions and category attributes, while preserving the flexibility of the generative model. This novel methodology effectively bridges the gap between generative models and discriminative tasks, providing a fresh perspective for constructing unified visual understanding systems. Systematic experiments demonstrate that GenDet achieves competitive accuracy compared to discriminative detectors, while retaining the flexibility characteristic of generative methods.
摘要：本文提出了 GenDet，这是一种新颖的框架，它将对象检测重新定义为图像生成任务。与传统方法相比，GenDet 采用了一种利用生成建模的开创性方法：它以输入图像为条件，并直接在原始图像空间中生成带有语义注释的边界框。 GenDet 建立了一个基于大规模预训练稳定扩散模型的条件生成架构，将检测任务制定为潜在空间内的语义约束。它可以精确控制边界框位置和类别属性，同时保留生成模型的灵活性。这种新颖的方法有效地弥合了生成模型和判别任务之间的差距，为构建统一的视觉理解系统提供了新的视角。系统实验表明，与判别检测器相比，GenDet 实现了有竞争力的精度，同时保留了生成方法的灵活性特征。

Title: Focal Guidance: Unlocking Controllability from Semantic-Weak Layers in Video Diffusion Models

Authors: Yuanyang Yin, Yufan Deng, Shenghai Yuan, Kaipeng Zhang, Xiao Yang, Feng Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.07287
Pdf URL: https://arxiv.org/pdf/2601.07287
Copy Paste: [[2601.07287]] Focal Guidance: Unlocking Controllability from Semantic-Weak Layers in Video Diffusion Models(https://arxiv.org/abs/2601.07287)
Keywords: generation
Abstract: The task of Image-to-Video (I2V) generation aims to synthesize a video from a reference image and a text prompt. This requires diffusion models to reconcile high-frequency visual constraints and low-frequency textual guidance during the denoising process. However, while existing I2V models prioritize visual consistency, how to effectively couple this dual guidance to ensure strong adherence to the text prompt remains underexplored. In this work, we observe that in Diffusion Transformer (DiT)-based I2V models, certain intermediate layers exhibit weak semantic responses (termed Semantic-Weak Layers), as indicated by a measurable drop in text-visual similarity. We attribute this to a phenomenon called Condition Isolation, where attention to visual features becomes partially detached from text guidance and overly relies on learned visual priors. To address this, we propose Focal Guidance (FG), which enhances the controllability from Semantic-Weak Layers. FG comprises two mechanisms: (1) Fine-grained Semantic Guidance (FSG) leverages CLIP to identify key regions in the reference frame and uses them as anchors to guide Semantic-Weak Layers. (2) Attention Cache transfers attention maps from semantically responsive layers to Semantic-Weak Layers, injecting explicit semantic signals and alleviating their over-reliance on the model's learned visual priors, thereby enhancing adherence to textual instructions. To further validate our approach and address the lack of evaluation in this direction, we introduce a benchmark for assessing instruction following in I2V models. On this benchmark, Focal Guidance proves its effectiveness and generalizability, raising the total score on Wan2.1-I2V to 0.7250 (+3.97\%) and boosting the MMDiT-based HunyuanVideo-I2V to 0.5571 (+7.44\%).
摘要：图像到视频（I2V）生成的任务旨在根据参考图像和文本提示合成视频。这需要扩散模型在去噪过程中协调高频视觉约束和低频文本指导。然而，虽然现有的 I2V 模型优先考虑视觉一致性，但如何有效地结合这种双重指导以确保严格遵守文本提示仍有待探索。在这项工作中，我们观察到，在基于扩散变压器 (DiT) 的 I2V 模型中，某些中间层表现出弱语义响应（称为语义弱层），如文本视觉相似性可测量的下降所示。我们将此归因于一种称为“条件隔离”的现象，即对视觉特征的注意力部分脱离文本指导，并过度依赖学习的视觉先验。为了解决这个问题，我们提出了焦点引导（FG），它增强了语义弱层的可控性。 FG 包括两种机制：（1）细粒度语义引导（FSG）利用 CLIP 来识别参考框架中的关键区域，并将它们用作锚点来指导语义弱层。（2）注意力缓存将注意力图从语义响应层转移到语义弱层，注入显式语义信号并减轻它们对模型学习的视觉先验的过度依赖，从而增强对文本指令的遵守。为了进一步验证我们的方法并解决这方面缺乏评估的问题，我们引入了一个用于评估 I2V 模型中指令遵循的基准。在此基准测试中，Focal Guidance 证明了其有效性和普适性，将 Wan2.1-I2V 的总分提升至 0.7250 (+3.97\%)，并将基于 MMDiT 的 HunyuanVideo-I2V 提升至 0.5571 (+7.44\%)。

Title: Inference-Time Scaling for Visual AutoRegressive modeling by Searching Representative Samples

Authors: Weidong Tang, Xinyan Wan, Siyu Li, Xiumei Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.07293
Pdf URL: https://arxiv.org/pdf/2601.07293
Copy Paste: [[2601.07293]] Inference-Time Scaling for Visual AutoRegressive modeling by Searching Representative Samples(https://arxiv.org/abs/2601.07293)
Keywords: generative
Abstract: While inference-time scaling has significantly enhanced generative quality in large language and diffusion models, its application to vector-quantized (VQ) visual autoregressive modeling (VAR) remains unexplored. We introduce VAR-Scaling, the first general framework for inference-time scaling in VAR, addressing the critical challenge of discrete latent spaces that prohibit continuous path search. We find that VAR scales exhibit two distinct pattern types: general patterns and specific patterns, where later-stage specific patterns conditionally optimize early-stage general patterns. To overcome the discrete latent space barrier in VQ models, we map sampling spaces to quasi-continuous feature spaces via kernel density estimation (KDE), where high-density samples approximate stable, high-quality solutions. This transformation enables effective navigation of sampling distributions. We propose a density-adaptive hybrid sampling strategy: Top-k sampling focuses on high-density regions to preserve quality near distribution modes, while Random-k sampling explores low-density areas to maintain diversity and prevent premature convergence. Consequently, VAR-Scaling optimizes sample fidelity at critical scales to enhance output quality. Experiments in class-conditional and text-to-image evaluations demonstrate significant improvements in inference process. The code is available at this https URL.
摘要：虽然推理时间缩放显着提高了大型语言和扩散模型的生成质量，但其在矢量量化 (VQ) 视觉自回归模型 (VAR) 中的应用仍未得到探索。我们引入了 VAR-Scaling，这是 VAR 中推理时间缩放的第一个通用框架，解决了禁止连续路径搜索的离散潜在空间的关键挑战。我们发现 VAR 量表表现出两种不同的模式类型：一般模式和特定模式，其中后期特定模式有条件地优化早期一般模式。为了克服 VQ 模型中的离散潜在空间障碍，我们通过核密度估计（KDE）将采样空间映射到准连续特征空间，其中高密度样本近似稳定的高质量解决方案。这种转换可以有效地导航采样分布。我们提出了一种密度自适应混合采样策略：Top-k 采样侧重于高密度区域以保持分布模式附近的质量，而随机 k 采样则探索低密度区域以保持多样性并防止过早收敛。因此，VAR-Scaling 在关键尺度上优化样本保真度，以提高输出质量。类条件和文本到图像评估的实验证明了推理过程的显着改进。该代码可从此 https URL 获取。

Title: HiVid-Narrator: Hierarchical Video Narrative Generation with Scene-Primed ASR-anchored Compression

Authors: Haoxuan Li, Mengyan Li, Junjun Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.07366
Pdf URL: https://arxiv.org/pdf/2601.07366
Copy Paste: [[2601.07366]] HiVid-Narrator: Hierarchical Video Narrative Generation with Scene-Primed ASR-anchored Compression(https://arxiv.org/abs/2601.07366)
Keywords: generation
Abstract: Generating structured narrations for real-world e-commerce videos requires models to perceive fine-grained visual details and organize them into coherent, high-level stories--capabilities that existing approaches struggle to unify. We introduce the E-commerce Hierarchical Video Captioning (E-HVC) dataset with dual-granularity, temporally grounded annotations: a Temporal Chain-of-Thought that anchors event-level observations and Chapter Summary that compose them into concise, story-centric summaries. Rather than directly prompting chapters, we adopt a staged construction that first gathers reliable linguistic and visual evidence via curated ASR and frame-level descriptions, then refines coarse annotations into precise chapter boundaries and titles conditioned on the Temporal Chain-of-Thought, yielding fact-grounded, time-aligned narratives. We also observe that e-commerce videos are fast-paced and information-dense, with visual tokens dominating the input sequence. To enable efficient training while reducing input tokens, we propose the Scene-Primed ASR-anchored Compressor (SPA-Compressor), which compresses multimodal tokens into hierarchical scene and event representations guided by ASR semantic cues. Built upon these designs, our HiVid-Narrator framework achieves superior narrative quality with fewer input tokens compared to existing methods.
摘要：为现实世界的电子商务视频生成结构化叙述需要模型感知细粒度的视觉细节，并将它们组织成连贯的、高级的故事——现有方法难以统一这种能力。我们引入了电子商务分层视频字幕（E-HVC）数据集，该数据集具有双粒度、基于时间的注释：锚定事件级观察的时间思维链和将它们组成简洁的、以故事为中心的摘要的章节摘要。我们没有直接提示章节，而是采用分阶段的结构，首先通过精心策划的 ASR 和框架级描述收集可靠的语言和视觉证据，然后将粗略的注释细化为以时间思维链为条件的精确的章节边界和标题，从而产生基于事实、时间一致的叙述。我们还观察到，电子商务视频节奏快、信息密集，视觉标记在输入序列中占主导地位。为了在减少输入标记的同时实现高效训练，我们提出了 Scene-Primed ASR 锚定压缩器（SPA-Compressor），它将多模态标记压缩为由 ASR 语义提示引导的分层场景和事件表示。基于这些设计，与现有方法相比，我们的 HiVid-Narrator 框架以更少的输入标记实现了卓越的叙事质量。

Title: OceanSAR-2: A Universal Feature Extractor for SAR Ocean Observation

Authors: Alexandre Tuel, Thomas Kerdreux, Quentin Febvre, Alexis Mouche, Antoine Grouazel, Jean-Renaud Miadana, Antoine Audras, Chen Wang, Bertrand Chapron
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2601.07392
Pdf URL: https://arxiv.org/pdf/2601.07392
Copy Paste: [[2601.07392]] OceanSAR-2: A Universal Feature Extractor for SAR Ocean Observation(https://arxiv.org/abs/2601.07392)
Keywords: generation
Abstract: We present OceanSAR-2, the second generation of our foundation model for SAR-based ocean observation. Building on our earlier release, which pioneered self-supervised learning on Sentinel-1 Wave Mode data, OceanSAR-2 relies on improved SSL training and dynamic data curation strategies, which enhances performance while reducing training cost. OceanSAR-2 demonstrates strong transfer performance across downstream tasks, including geophysical pattern classification, ocean surface wind vector and significant wave height estimation, and iceberg detection. We release standardized benchmark datasets, providing a foundation for systematic evaluation and advancement of SAR models for ocean applications.
摘要：我们推出了 OceanSAR-2，这是我们基于 SAR 的海洋观测的第二代基础模型。 OceanSAR-2 建立在我们早期版本的基础上，该版本率先对 Sentinel-1 波浪模式数据进行自我监督学习，OceanSAR-2 依赖于改进的 SSL 训练和动态数据管理策略，从而提高了性能，同时降低了训练成本。 OceanSAR-2 在下游任务中展示了强大的传输性能，包括地球物理模式分类、海洋表面风矢量和有效波高估计以及冰山检测。我们发布标准化基准数据集，为海洋应用SAR模型的系统评估和改进提供基础。

Title: Forecast the Principal, Stabilize the Residual: Subspace-Aware Feature Caching for Efficient Diffusion Transformers

Authors: Guantao Chen, Shikang Zheng, Yuqi Lin, Linfeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.07396
Pdf URL: https://arxiv.org/pdf/2601.07396
Copy Paste: [[2601.07396]] Forecast the Principal, Stabilize the Residual: Subspace-Aware Feature Caching for Efficient Diffusion Transformers(https://arxiv.org/abs/2601.07396)
Keywords: generation
Abstract: Diffusion Transformer (DiT) models have achieved unprecedented quality in image and video generation, yet their iterative sampling process remains computationally prohibitive. To accelerate inference, feature caching methods have emerged by reusing intermediate representations across timesteps. However, existing caching approaches treat all feature components uniformly. We reveal that DiT feature spaces contain distinct principal and residual subspaces with divergent temporal behavior: the principal subspace evolves smoothly and predictably, while the residual subspace exhibits volatile, low-energy oscillations that resist accurate prediction. Building on this insight, we propose SVD-Cache, a subspace-aware caching framework that decomposes diffusion features via Singular Value Decomposition (SVD), applies exponential moving average (EMA) prediction to the dominant low-rank components, and directly reuses the residual subspace. Extensive experiments demonstrate that SVD-Cache achieves near-lossless across diverse models and methods, including 5.55$\times$ speedup on FLUX and HunyuanVideo, and compatibility with model acceleration techniques including distillation, quantization and sparse attention. Our code is in supplementary material and will be released on Github.
摘要：扩散变压器 (DiT) 模型在图像和视频生成方面取得了前所未有的质量，但其迭代采样过程在计算上仍然令人望而却步。为了加速推理，通过跨时间步重用中间表示出现了特征缓存方法。然而，现有的缓存方法统一处理所有特征组件。我们揭示了 DiT 特征空间包含不同的主子空间和残差子空间，具有不同的时间行为：主子空间平稳且可预测地演化，而残差子空间表现出不稳定的低能量振荡，难以准确预测。基于这一见解，我们提出了 SVD-Cache，这是一种子空间感知的缓存框架，它通过奇异值分解（SVD）分解扩散特征，将指数移动平均（EMA）预测应用于主要的低秩分量，并直接重用残差子空间。大量的实验表明，SVD-Cache 在不同的模型和方法中实现了近乎无损的效果，包括在 FLUX 和 HunyuanVideo 上实现了 5.55$\times$ 的加速，以及与蒸馏、量化和稀疏注意力等模型加速技术的兼容性。我们的代码位于补充材料中，并将在 Github 上发布。

Title: From Sketch to Fresco: Efficient Diffusion Transformer with Progressive Resolution

Authors: Shikang Zheng, Guantao Chen, Lixuan He, Jiacheng Liu, Yuqi Lin, Chang Zou, Linfeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.07462
Pdf URL: https://arxiv.org/pdf/2601.07462
Copy Paste: [[2601.07462]] From Sketch to Fresco: Efficient Diffusion Transformer with Progressive Resolution(https://arxiv.org/abs/2601.07462)
Keywords: generative
Abstract: Diffusion Transformers achieve impressive generative quality but remain computationally expensive due to iterative sampling. Recently, dynamic resolution sampling has emerged as a promising acceleration technique by reducing the resolution of early sampling steps. However, existing methods rely on heuristic re-noising at every resolution transition, injecting noise that breaks cross-stage consistency and forces the model to relearn global structure. In addition, these methods indiscriminately upsample the entire latent space at once without checking which regions have actually converged, causing accumulated errors, and visible artifacts. Therefore, we propose \textbf{Fresco}, a dynamic resolution framework that unifies re-noise and global structure across stages with progressive upsampling, preserving both the efficiency of low-resolution drafting and the fidelity of high-resolution refinement, with all stages aligned toward the same final target. Fresco achieves near-lossless acceleration across diverse domains and models, including 10$\times$ speedup on FLUX, and 5$\times$ on HunyuanVideo, while remaining orthogonal to distillation, quantization and feature caching, reaching 22$\times$ speedup when combined with distilled models. Our code is in supplementary material and will be released on Github.
摘要：扩散变压器实现了令人印象深刻的生成质量，但由于迭代采样，计算成本仍然很高。最近，动态分辨率采样通过降低早期采样步骤的分辨率而成为一种有前途的加速技术。然而，现有方法依赖于每次分辨率转换时的启发式重新噪声，注入噪声，破坏跨阶段一致性并迫使模型重新学习全局结构。此外，这些方法不加区别地立即对整个潜在空间进行上采样，而不检查哪些区域实际上已经收敛，从而导致累积错误和可见的伪影。因此，我们提出了 \textbf{Fresco}，一个动态分辨率框架，它将跨阶段的再噪声和全局结构与渐进上采样统一起来，保持低分辨率绘图的效率和高分辨率细化的保真度，所有阶段都朝着相同的最终目标对齐。 Fresco 在不同领域和模型上实现了近乎无损的加速，包括在 FLUX 上实现 10$\times$ 加速，在 HunyuanVideo 上实现 5$\times$，同时保持与蒸馏、量化和特征缓存正交，与蒸馏模型结合时达到 22$\times$ 加速。我们的代码位于补充材料中，并将在 Github 上发布。

Title: Graph Inference Towards ICD Coding

Authors: Xiaoxiao Deng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.07496
Pdf URL: https://arxiv.org/pdf/2601.07496
Copy Paste: [[2601.07496]] Graph Inference Towards ICD Coding(https://arxiv.org/abs/2601.07496)
Keywords: generation
Abstract: Automated ICD coding involves assigning standardized diagnostic codes to clinical narratives. The vast label space and extreme class imbalance continue to challenge precise prediction. To address these issues, LabGraph is introduced -- a unified framework that reformulates ICD coding as a graph generation task. By combining adversarial domain adaptation, graph-based reinforcement learning, and perturbation regularization, LabGraph effectively enhances model robustness and generalization. In addition, a label graph discriminator dynamically evaluates each generated code, providing adaptive reward feedback during training. Experiments on benchmark datasets demonstrate that LabGraph consistently outperforms previous approaches on micro-F1, micro-AUC, and P@K.
摘要：自动 ICD 编码涉及为临床叙述分配标准化诊断代码。巨大的标签空间和极端的类别不平衡继续挑战着精确的预测。为了解决这些问题，引入了 LabGraph——一个统一的框架，将 ICD 编码重新表述为图形生成任务。 LabGraph通过结合对抗性域适应、基于图的强化学习和扰动正则化，有效增强了模型的鲁棒性和泛化性。此外，标签图鉴别器动态评估每个生成的代码，在训练期间提供自适应奖励反馈。基准数据集上的实验表明，LabGraph 在 micro-F1、micro-AUC 和 P@K 上始终优于之前的方法。

Title: FROAV: A Framework for RAG Observation and Agent Verification - Lowering the Barrier to LLM Agent Research

Authors: Tzu-Hsuan Lin, Chih-Hsuan Kao
Subjects: cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2601.07504
Pdf URL: https://arxiv.org/pdf/2601.07504
Copy Paste: [[2601.07504]] FROAV: A Framework for RAG Observation and Agent Verification - Lowering the Barrier to LLM Agent Research(https://arxiv.org/abs/2601.07504)
Keywords: generation
Abstract: The rapid advancement of Large Language Models (LLMs) and their integration into autonomous agent systems has created unprecedented opportunities for document analysis, decision support, and knowledge retrieval. However, the complexity of developing, evaluating, and iterating on LLM-based agent workflows presents significant barriers to researchers, particularly those without extensive software engineering expertise. We present FROAV (Framework for RAG Observation and Agent Verification), an open-source research platform that democratizes LLM agent research by providing a plug-and-play architecture combining visual workflow orchestration, a comprehensive evaluation framework, and extensible Python integration. FROAV implements a multi-stage Retrieval-Augmented Generation (RAG) pipeline coupled with a rigorous "LLM-as-a-Judge" evaluation system, all accessible through intuitive graphical interfaces. Our framework integrates n8n for no-code workflow design, PostgreSQL for granular data management, FastAPI for flexible backend logic, and Streamlit for human-in-the-loop interaction. Through this integrated ecosystem, researchers can rapidly prototype RAG strategies, conduct prompt engineering experiments, validate agent performance against human judgments, and collect structured feedback-all without writing infrastructure code. We demonstrate the framework's utility through its application to financial document analysis, while emphasizing its material-agnostic architecture that adapts to any domain requiring semantic analysis. FROAV represents a significant step toward making LLM agent research accessible to a broader scientific community, enabling researchers to focus on hypothesis testing and algorithmic innovation rather than system integration challenges.
摘要：大型语言模型 (LLM) 的快速发展及其与自主代理系统的集成为文档分析、决策支持和知识检索创造了前所未有的机会。然而，开发、评估和迭代基于 LLM 的代理工作流程的复杂性给研究人员，特别是那些没有广泛软件工程专业知识的研究人员带来了巨大的障碍。我们推出 FROAV（RAG 观察和代理验证框架），这是一个开源研究平台，通过提供结合了可视化工作流程编排、综合评估框架和可扩展 Python 集成的即插即用架构，使 LLM 代理研究民主化。 FROAV 实现了多阶段检索增强生成 (RAG) 流程以及严格的“LLM-as-a-Judge”评估系统，所有这些都可以通过直观的图形界面进行访问。我们的框架集成了用于无代码工作流程设计的 n8n、用于精细数据管理的 PostgreSQL、用于灵活后端逻辑的 FastAPI 以及用于人机交互的 Streamlit。通过这个集成的生态系统，研究人员可以快速构建 RAG 策略原型，进行及时的工程实验，根据人类判断验证代理性能，并收集结构化反馈 - 所有这些都无需编写基础设施代码。我们通过其在财务文档分析中的应用来展示该框架的实用性，同时强调其与材料无关的架构，适用于任何需要语义分析的领域。 FROAV 代表着让 LLM 代理研究向更广泛的科学界开放的重要一步，使研究人员能够专注于假设检验和算法创新，而不是系统集成挑战。

Title: Land-then-transport: A Flow Matching-Based Generative Decoder for Wireless Image Transmission

Authors: Jingwen Fu, Ming Xiao, Mikael Skoglund, Dong In Kim
Subjects: cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2601.07512
Pdf URL: https://arxiv.org/pdf/2601.07512
Copy Paste: [[2601.07512]] Land-then-transport: A Flow Matching-Based Generative Decoder for Wireless Image Transmission(https://arxiv.org/abs/2601.07512)
Keywords: generative
Abstract: Due to strict rate and reliability demands, wireless image transmission remains difficult for both classical layered designs and joint source-channel coding (JSCC), especially under low latency. Diffusion-based generative decoders can deliver strong perceptual quality by leveraging learned image priors, but iterative stochastic denoising leads to high decoding delay. To enable low-latency decoding, we propose a flow-matching (FM) generative decoder under a new land-then-transport (LTT) paradigm that tightly integrates the physical wireless channel into a continuous-time probability flow. For AWGN channels, we build a Gaussian smoothing path whose noise schedule indexes effective noise levels, and derive a closed-form teacher velocity field along this path. A neural-network student vector field is trained by conditional flow matching, yielding a deterministic, channel-aware ODE decoder with complexity linear in the number of ODE steps. At inference, it only needs an estimate of the effective noise variance to set the ODE starting time. We further show that Rayleigh fading and MIMO channels can be mapped, via linear MMSE equalization and singular-value-domain processing, to AWGN-equivalent channels with calibrated starting times. Therefore, the same probability path and trained velocity field can be reused for Rayleigh and MIMO without retraining. Experiments on MNIST, Fashion-MNIST, and DIV2K over AWGN, Rayleigh, and MIMO demonstrate consistent gains over JPEG2000+LDPC, DeepJSCC, and diffusion-based baselines, while achieving good perceptual quality with only a few ODE steps. Overall, LTT provides a deterministic, physically interpretable, and computation-efficient framework for generative wireless image decoding across diverse channels.
摘要：由于严格的速率和可靠性要求，无线图像传输对于经典的分层设计和联合源通道编码（JSCC）来说仍然很困难，特别是在低延迟的情况下。基于扩散的生成解码器可以通过利用学习的图像先验提供强大的感知质量，但迭代随机去噪会导致较高的解码延迟。为了实现低延迟解码，我们提出了一种新的陆地然后传输（LTT）范式下的流匹配（FM）生成解码器，该范式将物理无线信道紧密集成到连续时间概率流中。对于 AWGN 通道，我们构建了一条高斯平滑路径，其噪声表索引了有效噪声水平，并沿该路径导出了一个封闭形式的教师速度场。通过条件流匹配训练神经网络学生向量场，产生确定性、通道感知的 ODE 解码器，其复杂度与 ODE 步骤数呈线性关系。推断时，只需要估计有效噪声方差即可设置 ODE 起始时间。我们进一步表明，瑞利衰落和 MIMO 信道可以通过线性 MMSE 均衡和奇异值域处理映射到具有校准起始时间的 AWGN 等效信道。因此，相同的概率路径和训练的速度场可以重新用于瑞利和 MIMO，而无需重新训练。通过 AWGN、Rayleigh 和 MIMO 在 MNIST、Fashion-MNIST 和 DIV2K 上进行的实验表明，与 JPEG2000+LDPC、DeepJSCC 和基于扩散的基线相比，具有一致的增益，同时只需几个 ODE 步骤即可实现良好的感知质量。总体而言，LTT 为跨不同通道的生成无线图像解码提供了确定性、物理可解释且计算高效的框架。

Title: d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation

Authors: Yu-Yang Qian, Junda Su, Lanxiang Hu, Peiyuan Zhang, Zhijie Deng, Peng Zhao, Hao Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.07568
Pdf URL: https://arxiv.org/pdf/2601.07568
Copy Paste: [[2601.07568]] d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation(https://arxiv.org/abs/2601.07568)
Keywords: generation
Abstract: Diffusion large language models (dLLMs) offer capabilities beyond those of autoregressive (AR) LLMs, such as parallel decoding and random-order generation. However, realizing these benefits in practice is non-trivial, as dLLMs inherently face an accuracy-parallelism trade-off. Despite increasing interest, existing methods typically focus on only one-side of the coin, targeting either efficiency or performance. To address this limitation, we propose d3LLM (Pseudo-Distilled Diffusion Large Language Model), striking a balance between accuracy and parallelism: (i) during training, we introduce pseudo-trajectory distillation to teach the model which tokens can be decoded confidently at early steps, thereby improving parallelism; (ii) during inference, we employ entropy-based multi-block decoding with a KV-cache refresh mechanism to achieve high parallelism while maintaining accuracy. To better evaluate dLLMs, we also introduce AUP (Accuracy Under Parallelism), a new metric that jointly measures accuracy and parallelism. Experiments demonstrate that our d3LLM achieves up to 10$\times$ speedup over vanilla LLaDA/Dream and 5$\times$ speedup over AR models without much accuracy drop. Our code is available at this https URL.
摘要：扩散大语言模型 (dLLM) 提供超越自回归 (AR) LLM 的功能，例如并行解码和随机顺序生成。然而，在实践中实现这些好处并非易事，因为 dLLM 本质上面临着准确性与并行性的权衡。尽管人们的兴趣日益浓厚，但现有方法通常只关注硬币的一面，要么瞄准效率，要么瞄准性能。为了解决这个限制，我们提出了 d3LLM（伪蒸馏扩散大型语言模型），在准确性和并行性之间取得平衡：（i）在训练过程中，我们引入伪轨迹蒸馏来教导模型可以在早期步骤自信地解码哪些标记，从而提高并行性； (ii) 在推理过程中，我们采用基于熵的多块解码和 KV 缓存刷新机制，以在保持准确性的同时实现高并行性。为了更好地评估 dLLM，我们还引入了 AUP（并行度下的准确性），这是一种联合衡量准确性和并行度的新指标。实验表明，我们的 d3LLM 比普通 LLaDA/Dream 实现了高达 10$\times$ 的加速，比 AR 模型实现了 5$\times$ 的加速，而精度没有太大下降。我们的代码可以在这个 https URL 上找到。

Title: StdGEN++: A Comprehensive System for Semantic-Decomposed 3D Character Generation

Authors: Yuze He, Yanning Zhou, Wang Zhao, Jingwen Ye, Zhongkai Wu, Ran Yi, Yong-Jin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.07660
Pdf URL: https://arxiv.org/pdf/2601.07660
Copy Paste: [[2601.07660]] StdGEN++: A Comprehensive System for Semantic-Decomposed 3D Character Generation(https://arxiv.org/abs/2601.07660)
Keywords: generation, generative
Abstract: We present StdGEN++, a novel and comprehensive system for generating high-fidelity, semantically decomposed 3D characters from diverse inputs. Existing 3D generative methods often produce monolithic meshes that lack the structural flexibility required by industrial pipelines in gaming and animation. Addressing this gap, StdGEN++ is built upon a Dual-branch Semantic-aware Large Reconstruction Model (Dual-Branch S-LRM), which jointly reconstructs geometry, color, and per-component semantics in a feed-forward manner. To achieve production-level fidelity, we introduce a novel semantic surface extraction formalism compatible with hybrid implicit fields. This mechanism is accelerated by a coarse-to-fine proposal scheme, which significantly reduces memory footprint and enables high-resolution mesh generation. Furthermore, we propose a video-diffusion-based texture decomposition module that disentangles appearance into editable layers (e.g., separated iris and skin), resolving semantic confusion in facial regions. Experiments demonstrate that StdGEN++ achieves state-of-the-art performance, significantly outperforming existing methods in geometric accuracy and semantic disentanglement. Crucially, the resulting structural independence unlocks advanced downstream capabilities, including non-destructive editing, physics-compliant animation, and gaze tracking, making it a robust solution for automated character asset production.
摘要：我们推出了 StdGEN++，这是一种新颖且全面的系统，用于从不同的输入生成高保真、语义分解的 3D 角色。现有的 3D 生成方法通常会生成整体网格，缺乏游戏和动画工业管道所需的结构灵活性。为了解决这一差距，StdGEN++ 建立在双分支语义感知大型重建模型（双分支 S-LRM）的基础上，该模型以前馈方式联合重建几何、颜色和每个组件的语义。为了实现生产级保真度，我们引入了一种与混合隐式字段兼容的新颖的语义表面提取形式。该机制通过从粗到细的提议方案来加速，从而显着减少内存占用并实现高分辨率网格生成。此外，我们提出了一种基于视频扩散的纹理分解模块，将外观分解为可编辑层（例如，分离的虹膜和皮肤），解决面部区域的语义混淆。实验表明，StdGEN++ 实现了最先进的性能，在几何精度和语义解缠方面显着优于现有方法。至关重要的是，由此产生的结构独立性解锁了先进的下游功能，包括无损编辑、符合物理的动画和视线跟踪，使其成为自动化角色资产生产的强大解决方案。

Title: Advancing Multinational License Plate Recognition Through Synthetic and Real Data Fusion: A Comprehensive Evaluation

Authors: Rayson Laroca, Valter Estevam, Gladston J. P. Moreira, Rodrigo Minetto, David Menotti
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.07671
Pdf URL: https://arxiv.org/pdf/2601.07671
Copy Paste: [[2601.07671]] Advancing Multinational License Plate Recognition Through Synthetic and Real Data Fusion: A Comprehensive Evaluation(https://arxiv.org/abs/2601.07671)
Keywords: generation, generative
Abstract: Automatic License Plate Recognition is a frequent research topic due to its wide-ranging practical applications. While recent studies use synthetic images to improve License Plate Recognition (LPR) results, there remain several limitations in these efforts. This work addresses these constraints by comprehensively exploring the integration of real and synthetic data to enhance LPR performance. We subject 16 Optical Character Recognition (OCR) models to a benchmarking process involving 12 public datasets acquired from various regions. Several key findings emerge from our investigation. Primarily, the massive incorporation of synthetic data substantially boosts model performance in both intra- and cross-dataset scenarios. We examine three distinct methodologies for generating synthetic data: template-based generation, character permutation, and utilizing a Generative Adversarial Network (GAN) model, each contributing significantly to performance enhancement. The combined use of these methodologies demonstrates a notable synergistic effect, leading to end-to-end results that surpass those reached by state-of-the-art methods and established commercial systems. Our experiments also underscore the efficacy of synthetic data in mitigating challenges posed by limited training data, enabling remarkable results to be achieved even with small fractions of the original training data. Finally, we investigate the trade-off between accuracy and speed among different models, identifying those that strike the optimal balance in each intra-dataset and cross-dataset settings.
摘要：自动车牌识别由于其广泛的实际应用而成为一个常见的研究课题。虽然最近的研究使用合成图像来改善车牌识别 (LPR) 结果，但这些努力仍然存在一些局限性。这项工作通过全面探索真实数据和合成数据的集成来解决这些限制，以提高 LPR 性能。我们对 16 个光学字符识别 (OCR) 模型进行基准测试，涉及从不同地区获取的 12 个公共数据集。我们的调查得出了几个关键发现。首先，合成数据的大量合并极大地提高了数据集内和跨数据集场景中的模型性能。我们研究了三种不同的生成合成数据的方法：基于模板的生成、字符排列和利用生成对抗网络（GAN）模型，每种方法都对性能提升做出了显着贡献。这些方法的结合使用表现出显着的协同效应，产生的端到端结果超越了最先进的方法和已建立的商业系统所达到的结果。我们的实验还强调了合成数据在缓解有限训练数据带来的挑战方面的功效，即使使用原始训练数据的一小部分也能取得显着的结果。最后，我们研究了不同模型之间的准确性和速度之间的权衡，确定了在每个数据集内和跨数据集设置中达到最佳平衡的模型。

Title: Leveraging 3D Representation Alignment and RGB Pretrained Priors for LiDAR Scene Generation

Authors: Nicolas Sereyjol-Garros, Ellington Kirby, Victor Besnier, Nermin Samet
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.07692
Pdf URL: https://arxiv.org/pdf/2601.07692
Copy Paste: [[2601.07692]] Leveraging 3D Representation Alignment and RGB Pretrained Priors for LiDAR Scene Generation(https://arxiv.org/abs/2601.07692)
Keywords: generation, generative
Abstract: LiDAR scene synthesis is an emerging solution to scarcity in 3D data for robotic tasks such as autonomous driving. Recent approaches employ diffusion or flow matching models to generate realistic scenes, but 3D data remains limited compared to RGB datasets with millions of samples. We introduce R3DPA, the first LiDAR scene generation method to unlock image-pretrained priors for LiDAR point clouds, and leverage self-supervised 3D representations for state-of-the-art results. Specifically, we (i) align intermediate features of our generative model with self-supervised 3D features, which substantially improves generation quality; (ii) transfer knowledge from large-scale image-pretrained generative models to LiDAR generation, mitigating limited LiDAR datasets; and (iii) enable point cloud control at inference for object inpainting and scene mixing with solely an unconditional model. On the KITTI-360 benchmark R3DPA achieves state of the art performance. Code and pretrained models are available at this https URL.
摘要：LiDAR 场景合成是解决自动驾驶等机器人任务 3D 数据稀缺问题的新兴解决方案。最近的方法采用扩散或流动匹配模型来生成逼真的场景，但与具有数百万个样本的 RGB 数据集相比，3D 数据仍然有限。我们引入了 R3DPA，这是第一个用于解锁 LiDAR 点云图像预训练先验的 LiDAR 场景生成方法，并利用自监督 3D 表示来获得最先进的结果。具体来说，我们 (i) 将生成模型的中间特征与自监督 3D 特征对齐，这大大提高了生成质量； (ii) 将知识从大规模图像预训练生成模型转移到 LiDAR 生成，从而缓解 LiDAR 数据集有限的问题； (iii) 在推理时启用点云控制，以仅使用无条件模型进行对象修复和场景混合。在 KITTI-360 基准测试中，R3DPA 实现了最先进的性能。此 https URL 提供了代码和预训练模型。

Title: Evaluating the encoding competence of visual language models using uncommon actions

Authors: Chen Ling, Nai Ding
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.07737
Pdf URL: https://arxiv.org/pdf/2601.07737
Copy Paste: [[2601.07737]] Evaluating the encoding competence of visual language models using uncommon actions(https://arxiv.org/abs/2601.07737)
Keywords: generation
Abstract: We propose UAIT (Uncommon-sense Action Image-Text) dataset, a new evaluation benchmark designed to test the semantic understanding ability of visual language models (VLMs) in uncommon-sense action scenes. Unlike previous datasets that focus on common visual scenes with statistical frequency advantages, UAIT challenges models with grammatically reasonable but semantically counter-common sense image-text pairs. Such tasks require models to go beyond superficial pattern recognition and demonstrate a deep understanding of agent-patient relationships and physical feasibility. To build UAIT, we designed a semi-automated process to synthesize high-quality uncommon-sense image-text samples using large language models, few-shot prompt engineering, and text-to-image generation. Each sample is accompanied by a carefully designed multiple-choice question to test the model's competence in fine-grained reasoning. We evaluate multiple state-of-the-art visual language models and compare them with models based on contrastive learning. Experiments show that all models perform significantly worse than humans in semantic judgment, especially in distinguishing grammatical correctness from semantic rationality. Further experiments show that even the lightweight model can improve its accuracy after fine-tuning, demonstrating the great potential of directional adaptation. This study not only reveals the key weaknesses of VLMs, but also provides diagnostic tools and research directions for the development of robust models with real visual semantic reasoning capabilities.
摘要：我们提出了 UAIT（Uncommon-sense Action Image-Text）数据集，这是一个新的评估基准，旨在测试视觉语言模型（VLM）在罕见动作场景中的语义理解能力。与以前的数据集侧重于具有统计频率优势的常见视觉场景不同，UAIT 用语法上合理但语义上违反常识的图像文本对来挑战模型。此类任务要求模型超越肤浅的模式识别，并展示对代理与患者关系和物理可行性的深刻理解。为了构建 UAIT，我们设计了一个半自动化流程，使用大型语言模型、少量提示工程和文本到图像生成来合成高质量的非常识图像文本样本。每个样本都附有精心设计的多项选择题，以测试模型的细粒度推理能力。我们评估多种最先进的视觉语言模型，并将它们与基于对比学习的模型进行比较。实验表明，所有模型在语义判断方面的表现都明显比人类差，尤其是在区分语法正确性和语义合理性方面。进一步的实验表明，即使是轻量级模型经过微调也能提高其准确性，展示了方向适应的巨大潜力。这项研究不仅揭示了VLM的关键弱点，而且为开发具有真实视觉语义推理能力的鲁棒模型提供了诊断工具和研究方向。

Title: Beyond External Guidance: Unleashing the Semantic Richness Inside Diffusion Transformers for Improved Training

Authors: Lingchen Sun, Rongyuan Wu, Zhengqiang Zhang, Ruibin Li, Yujing Sun, Shuaizheng Liu, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.07773
Pdf URL: https://arxiv.org/pdf/2601.07773
Copy Paste: [[2601.07773]] Beyond External Guidance: Unleashing the Semantic Richness Inside Diffusion Transformers for Improved Training(https://arxiv.org/abs/2601.07773)
Keywords: generation, generative
Abstract: Recent works such as REPA have shown that guiding diffusion models with external semantic features (e.g., DINO) can significantly accelerate the training of diffusion transformers (DiTs). However, this requires the use of pretrained external networks, introducing additional dependencies and reducing flexibility. In this work, we argue that DiTs actually have the power to guide the training of themselves, and propose \textbf{Self-Transcendence}, a simple yet effective method that achieves fast convergence using internal feature supervision only. It is found that the slow convergence in DiT training primarily stems from the difficulty of representation learning in shallow layers. To address this, we initially train the DiT model by aligning its shallow features with the latent representations from the pretrained VAE for a short phase (e.g., 40 epochs), then apply classifier-free guidance to the intermediate features, enhancing their discriminative capability and semantic expressiveness. These enriched internal features, learned entirely within the model, are used as supervision signals to guide a new DiT training. Compared to existing self-contained methods, our approach brings a significant performance boost. It can even surpass REPA in terms of generation quality and convergence speed, but without the need for any external pretrained models. Our method is not only more flexible for different backbones but also has the potential to be adopted for a wider range of diffusion-based generative tasks. The source code of our method can be found at this https URL.
摘要：最近的研究（例如 REPA）表明，利用外部语义特征（例如 DINO）引导扩散模型可以显着加速扩散变压器（DiT）的训练。然而，这需要使用预先训练的外部网络，引入额外的依赖关系并降低灵活性。在这项工作中，我们认为 DiT 实际上有能力指导自己的训练，并提出 \textbf{自我超越}，这是一种简单而有效的方法，仅使用内部特征监督即可实现快速收敛。研究发现，DiT 训练收敛速度慢主要源于浅层表示学习的困难。为了解决这个问题，我们首先通过将其浅层特征与来自预训练 VAE 的潜在表示进行短阶段（例如 40 个时期）的对齐来训练 DiT 模型，然后对中间特征应用无分类器指导，增强其判别能力和语义表达能力。这些丰富的内部特征完全在模型中学习，被用作指导新的 DiT 训练的监督信号。与现有的独立方法相比，我们的方法带来了显着的性能提升。它甚至可以在生成质量和收敛速度方面超越REPA，但不需要任何外部预训练模型。我们的方法不仅对于不同的主干网更加灵活，而且有潜力被采用于更广泛的基于扩散的生成任务。我们方法的源代码可以在此 https URL 中找到。

Title: More Images, More Problems? A Controlled Analysis of VLM Failure Modes

Authors: Anurag Das, Adrian Bulat, Alberto Baldrati, Ioannis Maniadis Metaxas, Bernt Schiele, Georgios Tzimiropoulos, Brais Martinez
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.07812
Pdf URL: https://arxiv.org/pdf/2601.07812
Copy Paste: [[2601.07812]] More Images, More Problems? A Controlled Analysis of VLM Failure Modes(https://arxiv.org/abs/2601.07812)
Keywords: generation
Abstract: Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities, yet their proficiency in understanding and reasoning over multiple images remains largely unexplored. While existing benchmarks have initiated the evaluation of multi-image models, a comprehensive analysis of their core weaknesses and their causes is still lacking. In this work, we introduce MIMIC (Multi-Image Model Insights and Challenges), a new benchmark designed to rigorously evaluate the multi-image capabilities of LVLMs. Using MIMIC, we conduct a series of diagnostic experiments that reveal pervasive issues: LVLMs often fail to aggregate information across images and struggle to track or attend to multiple concepts simultaneously. To address these failures, we propose two novel complementary remedies. On the data side, we present a procedural data-generation strategy that composes single-image annotations into rich, targeted multi-image training examples. On the optimization side, we analyze layer-wise attention patterns and derive an attention-masking scheme tailored for multi-image inputs. Experiments substantially improved cross-image aggregation, while also enhancing performance on existing multi-image benchmarks, outperforming prior state of the art across tasks. Data and code will be made available at this https URL.
摘要：大视觉语言模型 (LVLM) 已展现出非凡的能力，但其对多个图像的理解和推理能力在很大程度上仍未得到探索。虽然现有基准已经启动了多图像模型的评估，但仍然缺乏对其核心弱点及其原因的全面分析。在这项工作中，我们引入了 MIMIC（多图像模型见解和挑战），这是一个新的基准，旨在严格评估 LVLM 的多图像功能。使用 MIMIC，我们进行了一系列诊断实验，揭示了普遍存在的问题：LVLM 通常无法跨图像聚合信息，并且很难同时跟踪或关注多个概念。为了解决这些失败，我们提出了两种新颖的补充补救措施。在数据方面，我们提出了一种程序数据生成策略，将单图像注释组合成丰富的、有针对性的多图像训练示例。在优化方面，我们分析了分层注意力模式，并得出了针对多图像输入量身定制的注意力屏蔽方案。实验极大地改进了跨图像聚合，同时还增强了现有多图像基准的性能，超越了跨任务的现有技术水平。数据和代码将在此 https URL 上提供。

Title: MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head

Authors: Kewei Zhang, Ye Huang, Yufan Deng, Jincheng Yu, Junsong Chen, Huan Ling, Enze Xie, Daquan Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.07832
Pdf URL: https://arxiv.org/pdf/2601.07832
Copy Paste: [[2601.07832]] MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head(https://arxiv.org/abs/2601.07832)
Keywords: generation
Abstract: While the Transformer architecture dominates many fields, its quadratic self-attention complexity hinders its use in large-scale applications. Linear attention offers an efficient alternative, but its direct application often degrades performance, with existing fixes typically re-introducing computational overhead through extra modules (e.g., depthwise separable convolution) that defeat the original purpose. In this work, we identify a key failure mode in these methods: global context collapse, where the model loses representational diversity. To address this, we propose Multi-Head Linear Attention (MHLA), which preserves this diversity by computing attention within divided heads along the token dimension. We prove that MHLA maintains linear complexity while recovering much of the expressive power of softmax attention, and verify its effectiveness across multiple domains, achieving a 3.6\% improvement on ImageNet classification, a 6.3\% gain on NLP, a 12.6\% improvement on image generation, and a 41\% enhancement on video generation under the same time complexity.
摘要：虽然 Transformer 架构在许多领域占据主导地位，但其二次自注意力复杂性阻碍了其在大规模应用中的使用。线性注意力提供了一种有效的替代方案，但其直接应用通常会降低性能，现有的修复通常会通过额外的模块（例如深度可分离卷积）重新引入计算开销，从而违背了最初的目的。在这项工作中，我们确定了这些方法中的一个关键失败模式：全局上下文崩溃，其中模型失去了表征多样性。为了解决这个问题，我们提出了多头线性注意力（MHLA），它通过沿着令牌维度计算分割头内的注意力来保留这种多样性。我们证明 MHLA 保持了线性复杂度，同时恢复了 Softmax Attention 的大部分表达能力，并验证了其在多个领域的有效性，在相同时间复杂度下，在 ImageNet 分类上实现了 3.6\% 的改进，在 NLP 上实现了 6.3\% 的增益，在图像生成上实现了 12.6\% 的改进，在视频生成上实现了 41\% 的增强。