2025-12-19

Title: A Unified Generative-Predictive Framework for Deterministic Inverse Design

Authors: Reza T. Batley, Sourav Saha
Subjects: cs.LG, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2512.15746
Pdf URL: https://arxiv.org/pdf/2512.15746
Copy Paste: [[2512.15746]] A Unified Generative-Predictive Framework for Deterministic Inverse Design(https://arxiv.org/abs/2512.15746)
Keywords: generation, generative
Abstract: Inverse design of heterogeneous material microstructures is a fundamentally ill-posed and famously computationally expensive problem. This is exacerbated by the high-dimensional design spaces associated with finely resolved images, multimodal input property streams, and a highly nonlinear forward physics. Whilst modern generative models excel at accurately modeling such complex forward behavior, most of them are not intrinsically structured to support fast, stable \emph{deterministic} inversion with a physics-informed bias. This work introduces Janus, a unified generative-predictive framework to address this problem. Janus couples a deep encoder-decoder architecture with a predictive KHRONOS head, a separable neural architecture. Topologically speaking, Janus learns a latent manifold simultaneously isometric for generative inversion and pruned for physical prediction; the joint objective inducing \emph{disentanglement} of the latent space. Janus is first validated on the MNIST dataset, demonstrating high-fidelity reconstruction, accurate classification and diverse generative inversion of all ten target classes. It is then applied to the inverse design of heterogeneous microstructures labeled with thermal conductivity. It achieves a forward prediction accuracy $R^2=0.98$ (2\% relative error) and sub-5\% pixelwise reconstruction error. Inverse solutions satisfy target properties to within $1\%$ relative error. Inverting a sweep through properties reveal smooth traversal of the latent manifold, and UMAP visualization confirms the emergence of a low-dimensional, disentangled manifold. By unifying prediction and generation within a single latent space, Janus enables real-time, physics-informed inverse microstructure generation at a lower computational cost typically associated with classical optimization-based approaches.
摘要：异质材料微观结构的逆向设计从根本上来说是一个不适定且计算成本高昂的问题。与精细分辨率图像、多模态输入属性流和高度非线性正向物理相关的高维设计空间加剧了这种情况。虽然现代生成模型擅长准确地模拟这种复杂的前向行为，但它们中的大多数本质上并不能支持具有物理信息偏差的快速、稳定的\emph{确定性}反演。这项工作引入了 Janus，一个统一的生成预测框架来解决这个问题。 Janus 将深度编码器-解码器架构与预测性 KHRONOS 头（一种可分离的神经架构）结合起来。从拓扑上来说，Janus 学习了一个潜在流形，同时等距用于生成反演并经过修剪用于物理预测；联合目标诱导潜在空间的 emph{解缠结}。 Janus 首先在 MNIST 数据集上进行验证，展示了所有十个目标类别的高保真重建、准确分类和多样化生成反演。然后将其应用于热导率标记的异质微结构的逆向设计。它实现了前向预测精度 $R^2=0.98$ （2\% 相对误差）和 sub-5\% 像素重建误差。逆解满足目标属性的相对误差在 $1\%$ 以内。对属性进行反转扫描揭示了潜在流形的平滑遍历，UMAP 可视化证实了低维、解纠缠流形的出现。通过在单个潜在空间内统一预测和生成，Janus 能够以较低的计算成本生成实时的、基于物理的逆微观结构，这通常与基于经典优化的方法相关。

Title: D3G: Diverse Demographic Data Generation Increases Zero-Shot Image Classification Accuracy within Multimodal Models

Authors: Javon Hickmon
Subjects: cs.LG, cs.CL, cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2512.15747
Pdf URL: https://arxiv.org/pdf/2512.15747
Copy Paste: [[2512.15747]] D3G: Diverse Demographic Data Generation Increases Zero-Shot Image Classification Accuracy within Multimodal Models(https://arxiv.org/abs/2512.15747)
Keywords: generation, generative
Abstract: Image classification is a task essential for machine perception to achieve human-level image understanding. Multimodal models such as CLIP have been able to perform well on this task by learning semantic similarities across vision and language; however, despite these advances, image classification is still a challenging task. Models with low capacity often suffer from underfitting and thus underperform on fine-grained image classification. Along with this, it is important to ensure high-quality data with rich cross-modal representations of each class, which is often difficult to generate. When datasets do not enforce balanced demographics, the predictions will be biased toward the more represented class, while others will be neglected. We focus on how these issues can lead to harmful bias for zero-shot image classification, and explore how to combat these issues in demographic bias. We propose Diverse Demographic Data Generation (D3G), a training-free, zero-shot method of boosting classification accuracy while reducing demographic bias in pre-trained multimodal models. With this method, we utilize CLIP as our base multimodal model and Stable Diffusion XL as our generative model. We demonstrate that providing diverse demographic data at inference time improves performance for these models, and explore the impact of individual demographics on the resulting accuracy metric.
摘要：图像分类是机器感知实现人类水平图像理解的关键任务。通过学习跨视觉和语言的语义相似性，诸如 CLIP 之类的多模态模型能够在这项任务上表现良好；然而，尽管取得了这些进步，图像分类仍然是一项具有挑战性的任务。低容量的模型通常会出现拟合不足的问题，因此在细粒度图像分类方面表现不佳。除此之外，确保高质量数据以及每个类别丰富的跨模式表示也很重要，而这通常很难生成。当数据集不强制平衡人口统计时，预测将偏向于更具代表性的类别，而其他类别将被忽略。我们关注这些问题如何导致零样本图像分类的有害偏差，并探索如何解决人口统计偏差中的这些问题。我们提出了多样化人口统计数据生成（D3G），这是一种免训练、零样本的方法，可以提高分类准确性，同时减少预训练多模态模型中的人口统计偏差。通过这种方法，我们利用 CLIP 作为我们的基础多模态模型，并使用 Stable Diffusion XL 作为我们的生成模型。我们证明，在推理时提供不同的人口统计数据可以提高这些模型的性能，并探索个人人口统计数据对最终准确性指标的影响。

Title: GLOW: Graph-Language Co-Reasoning for Agentic Workflow Performance Prediction

Authors: Wei Guan, Jian Cao, Jinyu Cai, Qiqi Cai, Jianqi Gao, See-Kiong Ng
Subjects: cs.LG, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2512.15751
Pdf URL: https://arxiv.org/pdf/2512.15751
Copy Paste: [[2512.15751]] GLOW: Graph-Language Co-Reasoning for Agentic Workflow Performance Prediction(https://arxiv.org/abs/2512.15751)
Keywords: generation
Abstract: Agentic Workflows (AWs) have emerged as a promising paradigm for solving complex tasks. However, the scalability of automating their generation is severely constrained by the high cost and latency of execution-based evaluation. Existing AW performance prediction methods act as surrogates but fail to simultaneously capture the intricate topological dependencies and the deep semantic logic embedded in AWs. To address this limitation, we propose GLOW, a unified framework for AW performance prediction that combines the graph-structure modeling capabilities of GNNs with the reasoning power of LLMs. Specifically, we introduce a graph-oriented LLM, instruction-tuned on graph tasks, to extract topologically aware semantic features, which are fused with GNN-encoded structural representations. A contrastive alignment strategy further refines the latent space to distinguish high-quality AWs. Extensive experiments on FLORA-Bench show that GLOW outperforms state-of-the-art baselines in prediction accuracy and ranking utility.
摘要：代理工作流 (AW) 已成为解决复杂任务的有前途的范例。然而，自动化生成的可扩展性受到基于执行的评估的高成本和延迟的严重限制。现有的 AW 性能预测方法充当替代方法，但无法同时捕获 AW 中复杂的拓扑依赖性和嵌入的深层语义逻辑。为了解决这个限制，我们提出了 GLOW，一个用于 AW 性能预测的统一框架，它将 GNN 的图结构建模能力与 LLM 的推理能力结合起来。具体来说，我们引入了一种面向图的 LLM，在图任务上进行指令调整，以提取拓扑感知的语义特征，这些特征与 GNN 编码的结构表示融合。对比对齐策略进一步细化潜在空间以区分高质量的 AW。 FLORA-Bench 上的大量实验表明，GLOW 在预测准确性和排名实用性方面优于最先进的基线。

Title: TAO-Net: Two-stage Adaptive OOD Classification Network for Fine-grained Encrypted Traffic Classification

Authors: Zihao Wang, Wei Peng, Junming Zhang, Jian Li, Wenxin Fang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.15753
Pdf URL: https://arxiv.org/pdf/2512.15753
Copy Paste: [[2512.15753]] TAO-Net: Two-stage Adaptive OOD Classification Network for Fine-grained Encrypted Traffic Classification(https://arxiv.org/abs/2512.15753)
Keywords: generation
Abstract: Encrypted traffic classification aims to identify applications or services by analyzing network traffic data. One of the critical challenges is the continuous emergence of new applications, which generates Out-of-Distribution (OOD) traffic patterns that deviate from known categories and are not well represented by predefined models. Current approaches rely on predefined categories, which limits their effectiveness in handling unknown traffic types. Although some methods mitigate this limitation by simply classifying unknown traffic into a single "Other" category, they fail to make a fine-grained classification. In this paper, we propose a Two-stage Adaptive OOD classification Network (TAO-Net) that achieves accurate classification for both In-Distribution (ID) and OOD encrypted traffic. The method incorporates an innovative two-stage design: the first stage employs a hybrid OOD detection mechanism that integrates transformer-based inter-layer transformation smoothness and feature analysis to effectively distinguish between ID and OOD traffic, while the second stage leverages large language models with a novel semantic-enhanced prompt strategy to transform OOD traffic classification into a generation task, enabling flexible fine-grained classification without relying on predefined labels. Experiments on three datasets demonstrate that TAO-Net achieves 96.81-97.70% macro-precision and 96.77-97.68% macro-F1, outperforming previous methods that only reach 44.73-86.30% macro-precision, particularly in identifying emerging network applications.
摘要：加密流量分类旨在通过分析网络流量数据来识别应用程序或服务。关键挑战之一是新应用程序的不断出现，这些应用程序会产生偏离已知类别且无法用预定义模型很好地表示的分布外 (OOD) 流量模式。当前的方法依赖于预定义的类别，这限制了它们处理未知流量类型的有效性。尽管一些方法通过简单地将未知流量分类到单个“其他”类别来减轻这种限制，但它们无法进行细粒度的分类。在本文中，我们提出了一种两阶段自适应 OOD 分类网络（TAO-Net），它可以实现对分布内（ID）和 OOD 加密流量的准确分类。该方法采用创新的两阶段设计：第一阶段采用混合OOD检测机制，集成基于Transformer的层间转换平滑和特征分析，有效区分ID和OOD流量；第二阶段利用大型语言模型和新颖的语义增强提示策略，将OOD流量分类转化为生成任务，无需依赖预定义标签即可实现灵活的细粒度分类。在三个数据集上的实验表明，TAO-Net 实现了 96.81-97.70% 的宏观精度和 96.77-97.68% 的宏观 F1，优于之前仅达到 44.73-86.30% 宏观精度的方法，特别是在识别新兴网络应用方面。

Title: ReactorFold: Generative discovery of nuclear reactor cores via emergent physical reasoning

Authors: Yoonpyo Lee
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.15756
Pdf URL: https://arxiv.org/pdf/2512.15756
Copy Paste: [[2512.15756]] ReactorFold: Generative discovery of nuclear reactor cores via emergent physical reasoning(https://arxiv.org/abs/2512.15756)
Keywords: generative
Abstract: Designing nuclear reactor cores requires navigating large discrete design spaces governed by complex neutronic interactions. Traditional deterministic, metaheuristic, and machine-learning-assisted methods search within fixed, human-defined configuration spaces, limiting their ability to discover fundamentally new design topologies. Here we introduce ReactorFold, a generative framework that reformulates fuel-assembly design as a sequence modeling problem for language models. Using Monte Carlo data, parameter-efficient fine-tuning, and Direct Preference Optimization (DPO), the model learns the latent structure of a pressurized-water-reactor assembly and generates candidate layouts in a single forward pass. Notably, the DPO-aligned model exhibits emergent design-space expansion: despite being trained exclusively on configurations with a fixed number of gadolinium burnable absorber (Gd) rods, it autonomously adjusts Gd inventory to satisfy strict power-peaking constraints. The model also discovers high-performing asymmetric configurations that challenge conventional symmetric loading heuristics, accessing design regimes inaccessible to conventional search methods and demonstrating that language models can internalize causal physical relationships and transcend human-imposed design constraints.
摘要：设计核反应堆堆芯需要在复杂的中子相互作用控制的大型离散设计空间中进行导航。传统的确定性、元启发式和机器学习辅助方法在固定的、人类定义的配置空间内进行搜索，限制了它们发现全新设计拓扑的能力。在这里，我们介绍 ReactorFold，这是一个生成框架，它将燃料组件设计重新表述为语言模型的序列建模问题。使用蒙特卡罗数据、参数高效微调和直接偏好优化 (DPO)，该模型可以学习压水反应堆组件的潜在结构，并在一次前向传递中生成候选布局。值得注意的是，DPO 对齐模型表现出紧急的设计空间扩展：尽管专门针对具有固定数量的钆可燃吸收器 (Gd) 棒的配置进行了训练，但它会自动调整 Gd 库存以满足严格的功率峰值约束。该模型还发现了高性能的不对称配置，挑战了传统的对称加载启发法，访问了传统搜索方法无法访问的设计机制，并证明语言模型可以内化因果物理关系并超越人类强加的设计约束。

Title: Two-Step Data Augmentation for Masked Face Detection and Recognition: Turning Fake Masks to Real

Authors: Yan Yang, George Bebis, Mircea Nicolescu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.15774
Pdf URL: https://arxiv.org/pdf/2512.15774
Copy Paste: [[2512.15774]] Two-Step Data Augmentation for Masked Face Detection and Recognition: Turning Fake Masks to Real(https://arxiv.org/abs/2512.15774)
Keywords: generation, generative
Abstract: Data scarcity and distribution shift pose major challenges for masked face detection and recognition. We propose a two-step generative data augmentation framework that combines rule-based mask warping with unpaired image-to-image translation using GANs, enabling the generation of realistic masked-face samples beyond purely synthetic transformations. Compared to rule-based warping alone, the proposed approach yields consistent qualitative improvements and complements existing GAN-based masked face generation methods such as IAMGAN. We introduce a non-mask preservation loss and stochastic noise injection to stabilize training and enhance sample diversity. Experimental observations highlight the effectiveness of the proposed components and suggest directions for future improvements in data-centric augmentation for face recognition tasks.
摘要：数据稀缺和分布变化给蒙面人脸检测和识别带来了重大挑战。我们提出了一个两步生成数据增强框架，该框架将基于规则的掩模变形与使用 GAN 的不成对的图像到图像转换相结合，从而能够生成超越纯粹合成变换的真实掩模人脸样本。与单独基于规则的变形相比，所提出的方法产生了一致的质量改进，并补充了现有的基于 GAN 的蒙面人脸生成方法，例如 IAMGAN。我们引入了非掩模保存损失和随机噪声注入来稳定训练并增强样本多样性。实验观察强调了所提出组件的有效性，并为人脸识别任务中以数据为中心的增强的未来改进提出了方向。

Title: A Unification of Discrete, Gaussian, and Simplicial Diffusion

Authors: Nuria Alina Chandra, Yucen Lily Li, Alan N. Amin, Alex Ali, Joshua Rollins, Sebastian W. Ober, Aniruddh Raghu, Andrew Gordon Wilson
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.15923
Pdf URL: https://arxiv.org/pdf/2512.15923
Copy Paste: [[2512.15923]] A Unification of Discrete, Gaussian, and Simplicial Diffusion(https://arxiv.org/abs/2512.15923)
Keywords: generation
Abstract: To model discrete sequences such as DNA, proteins, and language using diffusion, practitioners must choose between three major methods: diffusion in discrete space, Gaussian diffusion in Euclidean space, or diffusion on the simplex. Despite their shared goal, these models have disparate algorithms, theoretical structures, and tradeoffs: discrete diffusion has the most natural domain, Gaussian diffusion has more mature algorithms, and diffusion on the simplex in principle combines the strengths of the other two but in practice suffers from a numerically unstable stochastic processes. Ideally we could see each of these models as instances of the same underlying framework, and enable practitioners to switch between models for downstream applications. However previous theories have only considered connections in special cases. Here we build a theory unifying all three methods of discrete diffusion as different parameterizations of the same underlying process: the Wright-Fisher population genetics model. In particular, we find simplicial and Gaussian diffusion as two large-population limits. Our theory formally connects the likelihoods and hyperparameters of these models and leverages decades of mathematical genetics literature to unlock stable simplicial diffusion. Finally, we relieve the practitioner of balancing model trade-offs by demonstrating it is possible to train a single model that can perform diffusion in any of these three domains at test time. Our experiments show that Wright-Fisher simplicial diffusion is more stable and outperforms previous simplicial diffusion models on conditional DNA generation. We also show that we can train models on multiple domains at once that are competitive with models trained on any individual domain.
摘要：为了使用扩散对离散序列（例如 DNA、蛋白质和语言）进行建模，从业者必须在三种主要方法之间进行选择：离散空间中的扩散、欧几里得空间中的高斯扩散或单纯形上的扩散。尽管它们有共同的目标，但这些模型具有不同的算法、理论结构和权衡：离散扩散具有最自然的域，高斯扩散具有更成熟的算法，单纯形上的扩散原则上结合了其他两者的优点，但在实践中遭受数值不稳定的随机过程。理想情况下，我们可以将这些模型中的每一个视为同一底层框架的实例，并使从业者能够在下游应用程序的模型之间进行切换。然而，以前的理论只考虑了特殊情况下的联系。在这里，我们建立了一个理论，将所有三种离散扩散方法统一为同一基本过程的不同参数化：赖特-费希尔群体遗传学模型。特别是，我们发现单纯扩散和高斯扩散是两个大群体限制。我们的理论正式连接了这些模型的可能性和超参数，并利用数十年的数学遗传学文献来解锁稳定的单纯扩散。最后，我们通过证明可以训练一个可以在测试时在这三个领域中的任何一个领域执行扩散的模型，从而减轻平衡模型权衡的实践者的负担。我们的实验表明，Wright-Fisher 单纯扩散更加稳定，并且在条件 DNA 生成方面优于之前的单纯扩散模型。我们还表明，我们可以同时在多个领域训练模型，这与在任何单个领域训练的模型具有竞争力。

Title: DSO: Direct Steering Optimization for Bias Mitigation

Authors: Lucas Monteiro Paes, Nivedha Sivakumar, Yinong Oliver Wang, Masha Fedzechkina Donaldson, Luca Zappella, Nicholas Apostoloff
Subjects: cs.LG, cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2512.15926
Pdf URL: https://arxiv.org/pdf/2512.15926
Copy Paste: [[2512.15926]] DSO: Direct Steering Optimization for Bias Mitigation(https://arxiv.org/abs/2512.15926)
Keywords: generative
Abstract: Generative models are often deployed to make decisions on behalf of users, such as vision-language models (VLMs) identifying which person in a room is a doctor to help visually impaired individuals. Yet, VLM decisions are influenced by the perceived demographic attributes of people in the input, which can lead to biased outcomes like failing to identify women as doctors. Moreover, when reducing bias leads to performance loss, users may have varying needs for balancing bias mitigation with overall model capabilities, highlighting the demand for methods that enable controllable bias reduction during inference. Activation steering is a popular approach for inference-time controllability that has shown potential in inducing safer behavior in large language models (LLMs). However, we observe that current steering methods struggle to correct biases, where equiprobable outcomes across demographic groups are required. To address this, we propose Direct Steering Optimization (DSO) which uses reinforcement learning to find linear transformations for steering activations, tailored to mitigate bias while maintaining control over model performance. We demonstrate that DSO achieves state-of-the-art trade-off between fairness and capabilities on both VLMs and LLMs, while offering practitioners inference-time control over the trade-off. Overall, our work highlights the benefit of designing steering strategies that are directly optimized to control model behavior, providing more effective bias intervention than methods that rely on pre-defined heuristics for controllability.
摘要：通常部署生成模型来代表用户做出决策，例如视觉语言模型 (VLM) 可以识别房间里的哪个人是医生，以帮助视障人士。然而，VLM 的决策受到输入中人们所感知的人口统计特征的影响，这可能会导致有偏见的结果，例如未能将女性识别为医生。此外，当减少偏差导致性能损失时，用户可能对平衡偏差缓解与整体模型功能有不同的需求，这凸显了对在推理过程中实现可控偏差减少的方法的需求。激活引导是一种流行的推理时间可控性方法，它已显示出在大型语言模型 (LLM) 中诱导更安全行为的潜力。然而，我们观察到，当前的指导方法很难纠正偏见，因为需要在不同人口群体之间获得同等的结果。为了解决这个问题，我们提出了直接转向优化（DSO），它使用强化学习来寻找转向激活的线性变换，旨在减轻偏差，同时保持对模型性能的控制。我们证明，DSO 在 VLM 和 LLM 上实现了公平性和能力之间最先进的权衡，同时为从业者提供了对权衡的推理时间控制。总体而言，我们的工作强调了设计直接优化以控制模型行为的转向策略的好处，与依赖预定义启发式可控性的方法相比，提供更有效的偏差干预。

Title: R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space

Authors: Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.15940
Pdf URL: https://arxiv.org/pdf/2512.15940
Copy Paste: [[2512.15940]] R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space(https://arxiv.org/abs/2512.15940)
Keywords: generation
Abstract: Humans perceive and reason about their surroundings in four dimensions by building persistent, structured internal representations that encode semantic meaning, spatial layout, and temporal dynamics. These multimodal memories enable them to recall past events, infer unobserved states, and integrate new information into context-dependent reasoning. Inspired by this capability, we introduce R4, a training-free framework for retrieval-augmented reasoning in 4D spatio-temporal space that equips vision-language models (VLMs) with structured, lifelong memory. R4 continuously constructs a 4D knowledge database by anchoring object-level semantic descriptions in metric space and time, yielding a persistent world model that can be shared across agents. At inference, natural language queries are decomposed into semantic, spatial, and temporal keys to retrieve relevant observations, which are integrated into the VLM's reasoning. Unlike classical retrieval-augmented generation methods, retrieval in R4 operates directly in 4D space, enabling episodic and collaborative reasoning without training. Experiments on embodied question answering and navigation benchmarks demonstrate that R4 substantially improves retrieval and reasoning over spatio-temporal information compared to baselines, advancing a new paradigm for embodied 4D reasoning in dynamic environments.
摘要：人类通过构建持久的、结构化的内部表示来感知和推理周围的四个维度，这些内部表示对语义意义、空间布局和时间动态进行编码。这些多模态记忆使他们能够回忆过去的事件，推断未观察到的状态，并将新信息整合到上下文相关的推理中。受此功能的启发，我们推出了 R4，这是一种无需训练的 4D 时空检索增强推理框架，为视觉语言模型 (VLM) 配备了结构化的终身记忆。 R4 通过在度量空间和时间中锚定对象级语义描述来持续构建 4D 知识数据库，从而产生可以在代理之间共享的持久世界模型。在推理时，自然语言查询被分解为语义、空间和时间键以检索相关观察结果，并将其集成到 VLM 的推理中。与经典的检索增强生成方法不同，R4 中的检索直接在 4D 空间中运行，无需训练即可实现情景和协作推理。体现问答和导航基准的实验表明，与基线相比，R4 极大地改进了时空信息的检索和推理，为动态环境中的体现 4D 推理提出了新的范式。

Title: AIE4ML: An End-to-End Framework for Compiling Neural Networks for the Next Generation of AMD AI Engines

Authors: Dimitrios Danopoulos, Enrico Lupi, Chang Sun, Sebastian Dittmeier, Michael Kagan, Vladimir Loncar, Maurizio Pierini
Subjects: cs.LG, cs.AR
Abstract URL: https://arxiv.org/abs/2512.15946
Pdf URL: https://arxiv.org/pdf/2512.15946
Copy Paste: [[2512.15946]] AIE4ML: An End-to-End Framework for Compiling Neural Networks for the Next Generation of AMD AI Engines(https://arxiv.org/abs/2512.15946)
Keywords: generation
Abstract: Efficient AI inference on AMD's Versal AI Engine (AIE) is challenging due to tightly coupled VLIW execution, explicit datapaths, and local memory management. Prior work focused on first-generation AIE kernel optimizations, without tackling full neural network execution across the 2D array. In this work, we present AIE4ML, the first comprehensive framework for converting AI models automatically into optimized firmware targeting the AIE-ML generation devices, also with forward compatibility for the newer AIE-MLv2 architecture. At the single-kernel level, we attain performance close to the architectural peak. At the graph and system levels, we provide a structured parallelization method that can scale across the 2D AIE-ML fabric and exploit its dedicated memory tiles to stay entirely on-chip throughout the model execution. As a demonstration, we designed a generalized and highly efficient linear-layer implementation with intrinsic support for fused bias addition and ReLU activation. Also, as our framework necessitates the generation of multi-layer implementations, our approach systematically derives deterministic, compact, and topology-optimized placements tailored to the physical 2D grid of the device through a novel graph placement and search algorithm. Finally, the framework seamlessly accepts quantized models imported from high-level tools such as hls4ml or PyTorch while preserving bit-exactness. In layer scaling benchmarks, we achieve up to 98.6% efficiency relative to the single-kernel baseline, utilizing 296 of 304 AIE tiles (97.4%) of the device with entirely on-chip data movement. With evaluations across real-world model topologies, we demonstrate that AIE4ML delivers GPU-class throughput under microsecond latency constraints, making it a practical companion for ultra-low-latency environments such as trigger systems in particle physics experiments.
摘要：由于紧密耦合的 VLIW 执行、显式数据路径和本地内存管理，AMD Versal AI 引擎 (AIE) 上的高效 AI 推理具有挑战性。之前的工作重点是第一代 AIE 内核优化，没有解决跨 2D 阵列的完整神经网络执行问题。在这项工作中，我们提出了 AIE4ML，这是第一个用于将 AI 模型自动转换为针对 AIE-ML 一代设备的优化固件的综合框架，并且还向前兼容较新的 AIE-MLv2 架构。在单内核级别，我们获得了接近架构峰值的性能。在图形和系统级别，我们提供了一种结构化并行化方法，该方法可以跨 2D AIE-ML 结构进行扩展，并利用其专用内存块在整个模型执行过程中完全保持在芯片上。作为演示，我们设计了一种通用且高效的线性层实现，具有对融合偏差加法和 ReLU 激活的内在支持。此外，由于我们的框架需要生成多层实现，因此我们的方法通过新颖的图形放置和搜索算法系统地导出针对设备的物理 2D 网格定制的确定性、紧凑和拓扑优化的放置。最后，该框架无缝接受从 hls4ml 或 PyTorch 等高级工具导入的量化模型，同时保持位准确性。在层扩展基准测试中，我们利用设备的 304 个 AIE 块中的 296 个（97.4%）进行完全片上数据移动，实现了相对于单内核基准高达 98.6% 的效率。通过对现实世界模型拓扑的评估，我们证明 AIE4ML 在微秒延迟限制下提供 GPU 级吞吐量，使其成为超低延迟环境（例如粒子物理实验中的触发系统）的实用伴侣。

Title: CoVAR: Co-generation of Video and Action for Robotic Manipulation via Multi-Modal Diffusion

Authors: Liudi Yang, Yang Bai, George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Ziyuan Liu, Abhinav Valada
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.16023
Pdf URL: https://arxiv.org/pdf/2512.16023
Copy Paste: [[2512.16023]] CoVAR: Co-generation of Video and Action for Robotic Manipulation via Multi-Modal Diffusion(https://arxiv.org/abs/2512.16023)
Keywords: generation
Abstract: We present a method to generate video-action pairs that follow text instructions, starting from an initial image observation and the robot's joint states. Our approach automatically provides action labels for video diffusion models, overcoming the common lack of action annotations and enabling their full use for robotic policy learning. Existing methods either adopt two-stage pipelines, which limit tightly coupled cross-modal information sharing, or rely on adapting a single-modal diffusion model for a joint distribution that cannot fully leverage pretrained video knowledge. To overcome these limitations, we (1) extend a pretrained video diffusion model with a parallel, dedicated action diffusion model that preserves pretrained knowledge, (2) introduce a Bridge Attention mechanism to enable effective cross-modal interaction, and (3) design an action refinement module to convert coarse actions into precise controls for low-resolution datasets. Extensive evaluations on multiple public benchmarks and real-world datasets demonstrate that our method generates higher-quality videos, more accurate actions, and significantly outperforms existing baselines, offering a scalable framework for leveraging large-scale video data for robotic learning.
摘要：我们提出了一种生成遵循文本指令的视频动作对的方法，从初始图像观察和机器人的关节状态开始。我们的方法自动为视频扩散模型提供动作标签，克服了动作注释的普遍缺乏，并使其能够充分用于机器人策略学习。现有方法要么采用两级管道，这限制了紧密耦合的跨模态信息共享，要么依赖于采用单模态扩散模型进行联合分发，而无法充分利用预训练的视频知识。为了克服这些限制，我们（1）使用并行的专用动作扩散模型来扩展预训练的视频扩散模型，以保留预训练的知识，（2）引入桥注意力机制以实现有效的跨模式交互，以及（3）设计一个动作细化模块以将粗略动作转换为低分辨率数据集的精确控制。对多个公共基准和现实数据集的广泛评估表明，我们的方法可以生成更高质量的视频、更准确的动作，并且显着优于现有基线，为利用大规模视频数据进行机器人学习提供了可扩展的框架。

Title: Auto-Vocabulary 3D Object Detection

Authors: Haomeng Zhang, Kuan-Chuan Peng, Suhas Lohit, Raymond A. Yeh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.16077
Pdf URL: https://arxiv.org/pdf/2512.16077
Copy Paste: [[2512.16077]] Auto-Vocabulary 3D Object Detection(https://arxiv.org/abs/2512.16077)
Keywords: generation
Abstract: Open-vocabulary 3D object detection methods are able to localize 3D boxes of classes unseen during training. Despite the name, existing methods rely on user-specified classes both at training and inference. We propose to study Auto-Vocabulary 3D Object Detection (AV3DOD), where the classes are automatically generated for the detected objects without any user input. To this end, we introduce Semantic Score (SS) to evaluate the quality of the generated class names. We then develop a novel framework, AV3DOD, which leverages 2D vision-language models (VLMs) to generate rich semantic candidates through image captioning, pseudo 3D box generation, and feature-space semantics expansion. AV3DOD achieves the state-of-the-art (SOTA) performance on both localization (mAP) and semantic quality (SS) on the ScanNetV2 and SUNRGB-D datasets. Notably, it surpasses the SOTA, CoDA, by 3.48 overall mAP and attains a 24.5% relative improvement in SS on ScanNetV2.
摘要：开放词汇 3D 对象检测方法能够定位训练期间未见过的类的 3D 框。尽管有这个名称，现有方法在训练和推理时都依赖于用户指定的类。我们建议研究自动词汇 3D 对象检测 (AV3DOD)，其中为检测到的对象自动生成类，无需任何用户输入。为此，我们引入语义评分（SS）来评估生成的类名的质量。然后，我们开发了一个新颖的框架 AV3DOD，它利用 2D 视觉语言模型 (VLM) 通过图像字幕、伪 3D 框生成和特征空间语义扩展来生成丰富的语义候选。 AV3DOD 在 ScanNetV2 和 SUNRGB-D 数据集上的定位 (mAP) 和语义质量 (SS) 方面均实现了最先进的 (SOTA) 性能。值得注意的是，它比 SOTA、CoDA 整体 mAP 提高了 3.48，并且在 ScanNetV2 上的 SS 相对提高了 24.5%。

Title: TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times

Authors: Jintao Zhang, Kaiwen Zheng, Kai Jiang, Haoxu Wang, Ion Stoica, Joseph E. Gonzalez, Jianfei Chen, Jun Zhu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.16093
Pdf URL: https://arxiv.org/pdf/2512.16093
Copy Paste: [[2512.16093]] TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times(https://arxiv.org/abs/2512.16093)
Keywords: generation
Abstract: We introduce TurboDiffusion, a video generation acceleration framework that can speed up end-to-end diffusion generation by 100-200x while maintaining video quality. TurboDiffusion mainly relies on several components for acceleration: (1) Attention acceleration: TurboDiffusion uses low-bit SageAttention and trainable Sparse-Linear Attention (SLA) to speed up attention computation. (2) Step distillation: TurboDiffusion adopts rCM for efficient step distillation. (3) W8A8 quantization: TurboDiffusion quantizes model parameters and activations to 8 bits to accelerate linear layers and compress the model. In addition, TurboDiffusion incorporates several other engineering optimizations. We conduct experiments on the Wan2.2-I2V-14B-720P, Wan2.1-T2V-1.3B-480P, Wan2.1-T2V-14B-720P, and Wan2.1-T2V-14B-480P models. Experimental results show that TurboDiffusion achieves 100-200x speedup for video generation even on a single RTX 5090 GPU, while maintaining comparable video quality. The GitHub repository, which includes model checkpoints and easy-to-use code, is available at this https URL.
摘要：我们推出 TurboDiffusion，这是一种视频生成加速框架，可以将端到端扩散生成速度提高 100-200 倍，同时保持视频质量。 TurboDiffusion主要依靠几个组件来进行加速：（1）注意力加速：TurboDiffusion使用低位SageAttention和可训练的稀疏线性注意力（SLA）来加速注意力计算。 (2) 分级蒸馏：TurboDiffusion采用rCM进行高效的分级蒸馏。 (3) W8A8量化：TurboDiffusion将模型参数和激活量化到8位，以加速线性层并压缩模型。此外，TurboDiffusion 还结合了其他一些工程优化。我们在Wan2.2-I2V-14B-720P、Wan2.1-T2V-1.3B-480P、Wan2.1-T2V-14B-720P和Wan2.1-T2V-14B-480P模型上进行了实验。实验结果表明，即使在单个 RTX 5090 GPU 上，TurboDiffusion 也能实现 100-200 倍的视频生成加速，同时保持相当的视频质量。 GitHub 存储库包含模型检查点和易于使用的代码，可从此 https URL 获取。

Title: C-DGPA: Class-Centric Dual-Alignment Generative Prompt Adaptation

Authors: Chao Li, Dasha Hu, Chengyang Li, Yuming Jiang, Yuncheng Shen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.16164
Pdf URL: https://arxiv.org/pdf/2512.16164
Copy Paste: [[2512.16164]] C-DGPA: Class-Centric Dual-Alignment Generative Prompt Adaptation(https://arxiv.org/abs/2512.16164)
Keywords: generative
Abstract: Unsupervised Domain Adaptation transfers knowledge from a labeled source domain to an unlabeled target domain. Directly deploying Vision-Language Models (VLMs) with prompt tuning in downstream UDA tasks faces the signifi cant challenge of mitigating domain discrepancies. Existing prompt-tuning strategies primarily align marginal distribu tion, but neglect conditional distribution discrepancies, lead ing to critical issues such as class prototype misalignment and degraded semantic discriminability. To address these lim itations, the work proposes C-DGPA: Class-Centric Dual Alignment Generative Prompt Adaptation. C-DGPA syner gistically optimizes marginal distribution alignment and con ditional distribution alignment through a novel dual-branch architecture. The marginal distribution alignment branch em ploys a dynamic adversarial training framework to bridge marginal distribution discrepancies. Simultaneously, the con ditional distribution alignment branch introduces a Class Mapping Mechanism (CMM) to align conditional distribu tion discrepancies by standardizing semantic prompt under standing and preventing source domain over-reliance. This dual alignment strategy effectively integrates domain knowl edge into prompt learning via synergistic optimization, ensur ing domain-invariant and semantically discriminative repre sentations. Extensive experiments on OfficeHome, Office31, and VisDA-2017 validate the superiority of C-DGPA. It achieves new state-of-the-art results on all benchmarks.
摘要：无监督域适应将知识从标记的源域转移到未标记的目标域。直接部署视觉语言模型 (VLM) 并及时调整下游 UDA 任务面临着减轻域差异的重大挑战。现有的即时调整策略主要对齐边际分布，但忽略了条件分布差异，导致类原型错位和语义辨别能力下降等关键问题。为了解决这些限制，该工作提出了 C-DGPA：以类别为中心的双重对齐生成提示适应。 C-DGPA 通过新颖的双分支架构协同优化边缘分布对齐和条件分布对齐。边际分布对齐分支采用动态对抗训练框架来弥合边际分布差异。同时，条件分布对齐分支引入了类映射机制（CMM），通过标准化语义提示理解并防止源域过度依赖来对齐条件分布差异。这种双重对齐策略通过协同优化有效地将领域知识整合到即时学习中，确保领域不变和语义区分的表示。在 OfficeHome、Office31 和 VisDA-2017 上进行的大量实验验证了 C-DGPA 的优越性。它在所有基准测试中均取得了最先进的结果。

Title: Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation

Authors: Sarosij Bose, Ravi K. Rajendran, Biplob Debnath, Konstantinos Karydis, Amit K. Roy-Chowdhury, Srimat Chakradhar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.16201
Pdf URL: https://arxiv.org/pdf/2512.16201
Copy Paste: [[2512.16201]] Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation(https://arxiv.org/abs/2512.16201)
Keywords: generation
Abstract: Radiology Report Generation (RRG) is a critical step toward automating healthcare workflows, facilitating accurate patient assessments, and reducing the workload of medical professionals. Despite recent progress in Large Medical Vision-Language Models (Med-VLMs), generating radiology reports that are both visually grounded and clinically accurate remains a significant challenge. Existing approaches often rely on large labeled corpora for pre-training, costly task-specific preference data, or retrieval-based methods. However, these strategies do not adequately mitigate hallucinations arising from poor cross-modal alignment between visual and linguistic representations. To address these limitations, we propose VALOR:Visual Alignment of Medical Vision-Language Models for GrOunded Radiology Report Generation. Our method introduces a reinforcement learning-based post-alignment framework utilizing Group-Relative Proximal Optimization (GRPO). The training proceeds in two stages: (1) improving the Med-VLM with textual rewards to encourage clinically precise terminology, and (2) aligning the vision projection module of the textually grounded model with disease findings, thereby guiding attention toward image re gions most relevant to the diagnostic task. Extensive experiments on multiple benchmarks demonstrate that VALOR substantially improves factual accuracy and visual grounding, achieving significant performance gains over state-of-the-art report generation methods.
摘要：放射学报告生成 (RRG) 是实现医疗保健工作流程自动化、促进准确的患者评估和减少医疗专业人员工作量的关键一步。尽管大型医学视觉语言模型 (Med-VLM) 最近取得了进展，但生成既具有视觉基础又具有临床准确性的放射学报告仍然是一个重大挑战。现有方法通常依赖于大型标记语料库进行预训练、昂贵的特定任务偏好数据或基于检索的方法。然而，这些策略并不能充分减轻由于视觉和语言表征之间的跨模态对齐不良而产生的幻觉。为了解决这些限制，我们提出了 VALOR：用于生成放射学报告的医学视觉语言模型的视觉对齐。我们的方法引入了利用组相对近端优化（GRPO）的基于强化学习的对齐后框架。培训分两个阶段进行：(1) 通过文本奖励改进 Med-VLM，以鼓励临床上精确的术语，(2) 将文本基础模型的视觉投影模块与疾病发现相结合，从而引导注意力集中到与诊断任务最相关的图像区域。对多个基准的大量实验表明，VALOR 极大地提高了事实准确性和视觉基础，与最先进的报告生成方法相比，实现了显着的性能提升。

Title: Learning High-Quality Initial Noise for Single-View Synthesis with Diffusion Models

Authors: Zhihao Zhang, Xuejun Yang, Weihua Liu, Mouquan Shen
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2512.16219
Pdf URL: https://arxiv.org/pdf/2512.16219
Copy Paste: [[2512.16219]] Learning High-Quality Initial Noise for Single-View Synthesis with Diffusion Models(https://arxiv.org/abs/2512.16219)
Keywords: generation
Abstract: Single-view novel view synthesis (NVS) models based on diffusion models have recently attracted increasing attention, as they can generate a series of novel view images from a single image prompt and camera pose information as conditions. It has been observed that in diffusion models, certain high-quality initial noise patterns lead to better generation results than others. However, there remains a lack of dedicated learning frameworks that enable NVS models to learn such high-quality noise. To obtain high-quality initial noise from random Gaussian noise, we make the following contributions. First, we design a discretized Euler inversion method to inject image semantic information into random noise, thereby constructing paired datasets of random and high-quality noise. Second, we propose a learning framework based on an encoder-decoder network (EDN) that directly transforms random noise into high-quality noise. Experiments demonstrate that the proposed EDN can be seamlessly plugged into various NVS models, such as SV3D and MV-Adapter, achieving significant performance improvements across multiple datasets. Code is available at: this https URL.
摘要：基于扩散模型的单视图新视图合成（NVS）模型最近引起了越来越多的关注，因为它们可以根据单个图像提示和相机姿态信息作为条件生成一系列新视图图像。据观察，在扩散模型中，某些高质量的初始噪声模式比其他模式产生更好的生成结果。然而，仍然缺乏使 NVS 模型能够学习如此高质量噪声的专用学习框架。为了从随机高斯噪声中获得高质量的初始噪声，我们做出了以下贡献。首先，我们设计了一种离散欧拉反演方法，将图像语义信息注入随机噪声中，从而构建随机和高质量噪声的配对数据集。其次，我们提出了一种基于编码器解码器网络（EDN）的学习框架，可直接将随机噪声转换为高质量噪声。实验表明，所提出的 EDN 可以无缝插入各种 NVS 模型，例如 SV3D 和 MV-Adapter，从而在多个数据集上实现显着的性能改进。代码可在以下位置获得：此 https URL。

Title: ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation

Authors: Zichen Geng, Zeeshan Hayder, Wei Liu, Hesheng Wang, Ajmal Mian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.16234
Pdf URL: https://arxiv.org/pdf/2512.16234
Copy Paste: [[2512.16234]] ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation(https://arxiv.org/abs/2512.16234)
Keywords: generation
Abstract: 3D human reaction generation faces three main challenges:(1) high motion fidelity, (2) real-time inference, and (3) autoregressive adaptability for online scenarios. Existing methods fail to meet all three simultaneously. We propose ARMFlow, a MeanFlow-based autoregressive framework that models temporal dependencies between actor and reactor motions. It consists of a causal context encoder and an MLP-based velocity predictor. We introduce Bootstrap Contextual Encoding (BSCE) in training, encoding generated history instead of the ground-truth ones, to alleviate error accumulation in autoregressive generation. We further introduce the offline variant ReMFlow, achieving state-of-the-art performance with the fastest inference among offline methods. Our ARMFlow addresses key limitations of online settings by: (1) enhancing semantic alignment via a global contextual encoder; (2) achieving high accuracy and low latency in a single-step inference; and (3) reducing accumulated errors through BSCE. Our single-step online generation surpasses existing online methods on InterHuman and InterX by over 40% in FID, while matching offline state-of-the-art performance despite using only partial sequence conditions.
摘要：3D 人体反应生成面临三个主要挑战：（1）高运动保真度，（2）实时推理，（3）在线场景的自回归适应性。现有方法无法同时满足这三个要求。我们提出了 ARMFlow，一种基于 MeanFlow 的自回归框架，用于对参与者和反应器运动之间的时间依赖性进行建模。它由因果上下文编码器和基于 MLP 的速度预测器组成。我们在训练中引入了引导上下文编码（BSCE），对生成的历史记录而不是真实的历史记录进行编码，以减轻自回归生成中的错误累积。我们进一步引入了离线变体 ReMFlow，以离线方法中最快的推理实现了最先进的性能。我们的 ARMFlow 通过以下方式解决在线设置的关键限制：（1）通过全局上下文编码器增强语义对齐；（2）单步推理实现高精度、低延迟； (3)通过BSCE减少累积错误。我们的单步在线生成在 FID 方面超越了 InterHuman 和 InterX 上现有的在线方法超过 40%，同时尽管仅使用部分序列条件，但仍能匹配离线最先进的性能。

Title: Coarse-to-Fine Open-Set Graph Node Classification with Large Language Models

Authors: Xueqi Ma, Xingjun Ma, Sarah Monazam Erfani, Danilo Mandic, James Bailey
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.16244
Pdf URL: https://arxiv.org/pdf/2512.16244
Copy Paste: [[2512.16244]] Coarse-to-Fine Open-Set Graph Node Classification with Large Language Models(https://arxiv.org/abs/2512.16244)
Keywords: generation
Abstract: Developing open-set classification methods capable of classifying in-distribution (ID) data while detecting out-of-distribution (OOD) samples is essential for deploying graph neural networks (GNNs) in open-world scenarios. Existing methods typically treat all OOD samples as a single class, despite real-world applications, especially high-stake settings such as fraud detection and medical diagnosis, demanding deeper insights into OOD samples, including their probable labels. This raises a critical question: can OOD detection be extended to OOD classification without true label information? To address this question, we propose a Coarse-to-Fine open-set Classification (CFC) framework that leverages large language models (LLMs) for graph datasets. CFC consists of three key components: a coarse classifier that uses LLM prompts for OOD detection and outlier label generation, a GNN-based fine classifier trained with OOD samples identified by the coarse classifier for enhanced OOD detection and ID classification, and refined OOD classification achieved through LLM prompts and post-processed OOD labels. Unlike methods that rely on synthetic or auxiliary OOD samples, CFC employs semantic OOD instances that are genuinely out-of-distribution based on their inherent meaning, improving interpretability and practical utility. Experimental results show that CFC improves OOD detection by ten percent over state-of-the-art methods on graph and text domains and achieves up to seventy percent accuracy in OOD classification on graph datasets.
摘要：开发能够对分布内 (ID) 数据进行分类并同时检测分布外 (OOD) 样本的开放集分类方法对于在开放世界场景中部署图神经网络 (GNN) 至关重要。现有方法通常将所有 OOD 样本视为一个类别，尽管现实世界的应用程序，尤其是欺诈检测和医疗诊断等高风险设置，需要更深入地了解 OOD 样本，包括其可能的标签。这就提出了一个关键问题：在没有真实标签信息的情况下，OOD 检测是否可以扩展到 OOD 分类？为了解决这个问题，我们提出了一个从粗到细的开放集分类（CFC）框架，该框架利用图数据集的大型语言模型（LLM）。 CFC 由三个关键组件组成：一个使用 LLM 提示进行 OOD 检测和异常值标签生成的粗分类器；一个基于 GNN 的精细分类器，使用粗分类器识别的 OOD 样本进行训练，以增强 OOD 检测和 ID 分类；以及通过 LLM 提示和后处理 OOD 标签实现的精细 OOD 分类。与依赖合成或辅助 OOD 样本的方法不同，CFC 采用基于其固有含义真正不分布的语义 OOD 实例，从而提高了可解释性和实用性。实验结果表明，与图和文本域上最先进的方法相比，CFC 将 OOD 检测提高了 10%，并且在图数据集上的 OOD 分类中实现了高达 70% 的准确率。

Title: Pixel Super-Resolved Fluorescence Lifetime Imaging Using Deep Learning

Authors: Paloma Casteleiro Costa, Parnian Ghapandar Kashani, Xuhui Liu, Alexander Chen, Ary Portes, Julien Bec, Laura Marcu, Aydogan Ozcan
Subjects: cs.CV, cs.LG, physics.med-ph, physics.optics
Abstract URL: https://arxiv.org/abs/2512.16266
Pdf URL: https://arxiv.org/pdf/2512.16266
Copy Paste: [[2512.16266]] Pixel Super-Resolved Fluorescence Lifetime Imaging Using Deep Learning(https://arxiv.org/abs/2512.16266)
Keywords: super-resolution, generative
Abstract: Fluorescence lifetime imaging microscopy (FLIM) is a powerful quantitative technique that provides metabolic and molecular contrast, offering strong translational potential for label-free, real-time diagnostics. However, its clinical adoption remains limited by long pixel dwell times and low signal-to-noise ratio (SNR), which impose a stricter resolution-speed trade-off than conventional optical imaging approaches. Here, we introduce FLIM_PSR_k, a deep learning-based multi-channel pixel super-resolution (PSR) framework that reconstructs high-resolution FLIM images from data acquired with up to a 5-fold increased pixel size. The model is trained using the conditional generative adversarial network (cGAN) framework, which, compared to diffusion model-based alternatives, delivers a more robust PSR reconstruction with substantially shorter inference times, a crucial advantage for practical deployment. FLIM_PSR_k not only enables faster image acquisition but can also alleviate SNR limitations in autofluorescence-based FLIM. Blind testing on held-out patient-derived tumor tissue samples demonstrates that FLIM_PSR_k reliably achieves a super-resolution factor of k = 5, resulting in a 25-fold increase in the space-bandwidth product of the output images and revealing fine architectural features lost in lower-resolution inputs, with statistically significant improvements across various image quality metrics. By increasing FLIM's effective spatial resolution, FLIM_PSR_k advances lifetime imaging toward faster, higher-resolution, and hardware-flexible implementations compatible with low-numerical-aperture and miniaturized platforms, better positioning FLIM for translational applications.
摘要：荧光寿命成像显微镜 (FLIM) 是一种强大的定量技术，可提供代谢和分子对比，为无标记实时诊断提供强大的转化潜力。然而，其临床应用仍然受到长像素停留时间和低信噪比（SNR）的限制，这比传统光学成像方法需要更严格的分辨率与速度权衡。在这里，我们介绍 FLIM_PSR_k，一种基于深度学习的多通道像素超分辨率 (PSR) 框架，可根据像素大小增加 5 倍的数据重建高分辨率 FLIM 图像。该模型使用条件生成对抗网络 (cGAN) 框架进行训练，与基于扩散模型的替代方案相比，该框架能够以更短的推理时间提供更稳健的 PSR 重建，这是实际部署的关键优势。 FLIM_PSR_k 不仅可以实现更快的图像采集，还可以减轻基于自发荧光的 FLIM 中的 SNR 限制。对保留的患者来源的肿瘤组织样本的盲测表明，FLIM_PSR_k 可靠地实现了 k = 5 的超分辨率因子，导致输出图像的空间带宽乘积增加了 25 倍，并揭示了低分辨率输入中丢失的精细结构特征，并且各种图像质量指标在统计上都有显着改善。通过提高 FLIM 的有效空间分辨率，FLIM_PSR_k 推动生命周期成像朝着更快、更高分辨率和硬件灵活的方向发展，与低数值孔径和小型化平台兼容，更好地将 FLIM 定位于平移应用。

Title: TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering

Authors: Rui Gui, Yang Wan, Haochen Han, Dongxing Mao, Fangming Liu, Min Li, Alex Jinpeng Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.16270
Pdf URL: https://arxiv.org/pdf/2512.16270
Copy Paste: [[2512.16270]] TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering(https://arxiv.org/abs/2512.16270)
Keywords: generation
Abstract: Text rendering has recently emerged as one of the most challenging frontiers in visual generation, drawing significant attention from large-scale diffusion and multimodal models. However, text editing within images remains largely unexplored, as it requires generating legible characters while preserving semantic, geometric, and contextual coherence. To fill this gap, we introduce TextEditBench, a comprehensive evaluation benchmark that explicitly focuses on text-centric regions in images. Beyond basic pixel manipulations, our benchmark emphasizes reasoning-intensive editing scenarios that require models to understand physical plausibility, linguistic meaning, and cross-modal dependencies. We further propose a novel evaluation dimension, Semantic Expectation (SE), which measures reasoning ability of model to maintain semantic consistency, contextual coherence, and cross-modal alignment during text editing. Extensive experiments on state-of-the-art editing systems reveal that while current models can follow simple textual instructions, they still struggle with context-dependent reasoning, physical consistency, and layout-aware integration. By focusing evaluation on this long-overlooked yet fundamental capability, TextEditBench establishes a new testing ground for advancing text-guided image editing and reasoning in multimodal generation.
摘要：文本渲染最近已成为视觉生成中最具挑战性的前沿之一，引起了大规模扩散和多模态模型的极大关注。然而，图像中的文本编辑在很大程度上仍未得到探索，因为它需要生成清晰的字符，同时保持语义、几何和上下文的连贯性。为了填补这一空白，我们引入了 TextEditBench，这是一个综合评估基准，明确关注图像中以文本为中心的区域。除了基本的像素操作之外，我们的基准测试还强调推理密集型编辑场景，这些场景需要模型理解物理合理性、语言意义和跨模式依赖性。我们进一步提出了一种新的评估维度——语义期望（SE），它衡量模型在文本编辑过程中保持语义一致性、上下文连贯性和跨模态对齐的推理能力。对最先进的编辑系统进行的大量实验表明，虽然当前的模型可以遵循简单的文本指令，但它们仍然在上下文相关的推理、物理一致性和布局感知集成方面遇到困难。通过重点评估这一长期被忽视但基本的功能，TextEditBench 为推进多模式生成中的文本引导图像编辑和推理建立了一个新的测试场。

Title: GFLAN: Generative Functional Layouts

Authors: Mohamed Abouagour, Eleftherios Garyfallidis
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.16275
Pdf URL: https://arxiv.org/pdf/2512.16275
Copy Paste: [[2512.16275]] GFLAN: Generative Functional Layouts(https://arxiv.org/abs/2512.16275)
Keywords: generation, generative
Abstract: Automated floor plan generation lies at the intersection of combinatorial search, geometric constraint satisfaction, and functional design requirements -- a confluence that has historically resisted a unified computational treatment. While recent deep learning approaches have improved the state of the art, they often struggle to capture architectural reasoning: the precedence of topological relationships over geometric instantiation, the propagation of functional constraints through adjacency networks, and the emergence of circulation patterns from local connectivity decisions. To address these fundamental challenges, this paper introduces GFLAN, a generative framework that restructures floor plan synthesis through explicit factorization into topological planning and geometric realization. Given a single exterior boundary and a front-door location, our approach departs from direct pixel-to-pixel or wall-tracing generation in favor of a principled two-stage decomposition. Stage A employs a specialized convolutional architecture with dual encoders -- separating invariant spatial context from evolving layout state -- to sequentially allocate room centroids within the building envelope via discrete probability maps over feasible placements. Stage B constructs a heterogeneous graph linking room nodes to boundary vertices, then applies a Transformer-augmented graph neural network (GNN) that jointly regresses room boundaries.
摘要：自动平面图生成位于组合搜索、几何约束满足和功能设计要求的交叉点——这种融合历来抵制统一的计算处理。虽然最近的深度学习方法已经提高了最先进的水平，但它们常常难以捕捉架构推理：拓扑关系优先于几何实例化、通过邻接网络传播功能约束以及局部连接决策中循环模式的出现。为了解决这些基本挑战，本文引入了 GFLAN，这是一种生成框架，可通过显式分解为拓扑规划和几何实现来重组平面图合成。给定单个外部边界和前门位置，我们的方法偏离了直接的像素到像素或墙壁跟踪生成，而是有利于原则上的两阶段分解。 A阶段采用带有双编码器的专用卷积架构——将不变的空间环境与不断变化的布局状态分开——通过可行布局上的离散概率图顺序分配建筑围护结构内的房间质心。 B 阶段构建一个将房间节点连接到边界顶点的异构图，然后应用 Transformer 增强图神经网络 (GNN) 联合回归房间边界。

Title: PixelArena: A benchmark for Pixel-Precision Visual Intelligence

Authors: Feng Liang, Sizhe Cheng, Chenqi Yi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.16303
Pdf URL: https://arxiv.org/pdf/2512.16303
Copy Paste: [[2512.16303]] PixelArena: A benchmark for Pixel-Precision Visual Intelligence(https://arxiv.org/abs/2512.16303)
Keywords: generation, generative
Abstract: Multi-modal large language models that have image output are emerging. Many image generation benchmarks focus on aesthetics instead of fine-grained generation capabilities. In PixelArena, we propose using semantic segmentation tasks to objectively examine their fine-grained generative intelligence with pixel precision. We find the latest Gemini 3 Pro Image has emergent image generation capabilities that generate semantic masks with high fidelity under zero-shot settings, showcasing visual intelligence unseen before and true generalization in new image generation tasks. We further investigate its results, compare them qualitatively and quantitatively with those of other models, and present failure cases. The findings not only signal exciting progress in the field but also provide insights into future research related to multimodality, reasoning, interpretability and benchmarking.
摘要：具有图像输出的多模态大语言模型正在出现。许多图像生成基准注重美观而不是细粒度的生成能力。在 PixelArena 中，我们建议使用语义分割任务来客观地检查其具有像素精度的细粒度生成智能。我们发现最新的 Gemini 3 Pro Image 具有紧急图像生成功能，可以在零镜头设置下生成高保真度的语义蒙版，展示了前所未见的视觉智能以及新图像生成任务中的真正泛化能力。我们进一步研究其结果，将其与其他模型的结果进行定性和定量比较，并提出失败案例。这些发现不仅标志着该领域令人兴奋的进展，而且还为与多模态、推理、可解释性和基准测试相关的未来研究提供了见解。

Title: LaverNet: Lightweight All-in-one Video Restoration via Selective Propagation

Authors: Haiyu Zhao, Yiwen Shan, Yuanbiao Gou, Xi Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.16313
Pdf URL: https://arxiv.org/pdf/2512.16313
Copy Paste: [[2512.16313]] LaverNet: Lightweight All-in-one Video Restoration via Selective Propagation(https://arxiv.org/abs/2512.16313)
Keywords: restoration
Abstract: Recent studies have explored all-in-one video restoration, which handles multiple degradations with a unified model. However, these approaches still face two challenges when dealing with time-varying degradations. First, the degradation can dominate temporal modeling, confusing the model to focus on artifacts rather than the video content. Second, current methods typically rely on large models to handle all-in-one restoration, concealing those underlying difficulties. To address these challenges, we propose a lightweight all-in-one video restoration network, LaverNet, with only 362K parameters. To mitigate the impact of degradations on temporal modeling, we introduce a novel propagation mechanism that selectively transmits only degradation-agnostic features across frames. Through LaverNet, we demonstrate that strong all-in-one restoration can be achieved with a compact network. Despite its small size, less than 1\% of the parameters of existing models, LaverNet achieves comparable, even superior performance across benchmarks.
摘要：最近的研究探索了一体化视频恢复，它通过统一的模型处理多种退化。然而，这些方法在处理时变退化时仍然面临两个挑战。首先，退化可能主导时间建模，使模型混淆于伪影而不是视频内容。其次，当前的方法通常依赖于大型模型来处理一体化恢复，从而掩盖了这些潜在的困难。为了应对这些挑战，我们提出了一种轻量级一体化视频恢复网络 LaverNet，仅具有 362K 参数。为了减轻退化对时间建模的影响，我们引入了一种新颖的传播机制，该机制选择性地跨帧仅传输与退化无关的特征。通过 LaverNet，我们证明了可以通过紧凑的网络实现强大的一体化恢复。尽管尺寸很小，不到现有模型参数的 1%，但 LaverNet 在各个基准测试中实现了可比的甚至更优越的性能。

Title: GMODiff: One-Step Gain Map Refinement with Diffusion Priors for HDR Reconstruction

Authors: Tao Hu, Weiyu Zhou, Yanjie Tu, Peng Wu, Wei Dong, Qingsen Yan, Yanning Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.16357
Pdf URL: https://arxiv.org/pdf/2512.16357
Copy Paste: [[2512.16357]] GMODiff: One-Step Gain Map Refinement with Diffusion Priors for HDR Reconstruction(https://arxiv.org/abs/2512.16357)
Keywords: generative
Abstract: Pre-trained Latent Diffusion Models (LDMs) have recently shown strong perceptual priors for low-level vision tasks, making them a promising direction for multi-exposure High Dynamic Range (HDR) reconstruction. However, directly applying LDMs to HDR remains challenging due to: (1) limited dynamic-range representation caused by 8-bit latent compression, (2) high inference cost from multi-step denoising, and (3) content hallucination inherent to generative nature. To address these challenges, we introduce GMODiff, a gain map-driven one-step diffusion framework for multi-exposure HDR reconstruction. Instead of reconstructing full HDR content, we reformulate HDR reconstruction as a conditionally guided Gain Map (GM) estimation task, where the GM encodes the extended dynamic range while retaining the same bit depth as LDR images. We initialize the denoising process from an informative regression-based estimate rather than pure noise, enabling the model to generate high-quality GMs in a single denoising step. Furthermore, recognizing that regression-based models excel in content fidelity while LDMs favor perceptual quality, we leverage regression priors to guide both the denoising process and latent decoding of the LDM, suppressing hallucinations while preserving structural accuracy. Extensive experiments demonstrate that our GMODiff performs favorably against several state-of-the-art methods and is 100 faster than previous LDM-based methods.
摘要：预训练的潜在扩散模型（LDM）最近在低级视觉任务中表现出强大的感知先验，这使它们成为多重曝光高动态范围（HDR）重建的有希望的方向。然而，直接将 LDM 应用于 HDR 仍然具有挑战性，因为：(1) 8 位潜在压缩导致动态范围表示有限，(2) 多步去噪的推理成本很高，(3) 生成性质固有的内容幻觉。为了解决这些挑战，我们引入了 GMODiff，一种用于多重曝光 HDR 重建的增益图驱动的一步扩散框架。我们没有重建完整的 HDR 内容，而是将 HDR 重建重新表述为条件引导增益图 (GM) 估计任务，其中 GM 对扩展的动态范围进行编码，同时保留与 LDR 图像相同的位深度。我们从基于信息回归的估计而不是纯噪声来初始化去噪过程，使模型能够在单个去噪步骤中生成高质量的 GM。此外，认识到基于回归的模型在内容保真度方面表现出色，而 LDM 更注重感知质量，我们利用回归先验来指导 LDM 的去噪过程和潜在解码，在保持结构准确性的同时抑制幻觉。大量实验表明，我们的 GMODiff 与几种最先进的方法相比表现良好，并且比以前基于 LDM 的方法快 100 倍。

Title: Factorized Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models

Authors: Mariam Hassan, Bastien Van Delft, Wuyang Li, Alexandre Alahi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.16371
Pdf URL: https://arxiv.org/pdf/2512.16371
Copy Paste: [[2512.16371]] Factorized Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models(https://arxiv.org/abs/2512.16371)
Keywords: generation
Abstract: State-of-the-art Text-to-Video (T2V) diffusion models can generate visually impressive results, yet they still frequently fail to compose complex scenes or follow logical temporal instructions. In this paper, we argue that many errors, including apparent motion failures, originate from the model's inability to construct a semantically correct or logically consistent initial frame. We introduce Factorized Video Generation (FVG), a pipeline that decouples these tasks by decomposing the Text-to-Video generation into three specialized stages: (1) Reasoning, where a Large Language Model (LLM) rewrites the video prompt to describe only the initial scene, resolving temporal ambiguities; (2) Composition, where a Text-to-Image (T2I) model synthesizes a high-quality, compositionally-correct anchor frame from this new prompt; and (3) Temporal Synthesis, where a video model, finetuned to understand this anchor, focuses its entire capacity on animating the scene and following the prompt. Our decomposed approach sets a new state-of-the-art on the T2V CompBench benchmark and significantly improves all tested models on VBench2. Furthermore, we show that visual anchoring allows us to cut the number of sampling steps by 70% without any loss in performance, leading to a substantial speed-up in sampling. Factorized Video Generation offers a simple yet practical path toward more efficient, robust, and controllable video synthesis
摘要：最先进的文本到视频 (T2V) 扩散模型可以生成视觉上令人印象深刻的结果，但它们仍然经常无法组成复杂的场景或遵循逻辑时间指令。在本文中，我们认为许多错误（包括明显的运动失败）源于模型无法构建语义正确或逻辑一致的初始框架。我们引入了分解视频生成（FVG），这是一个通过将文本到视频生成分解为三个专门阶段来解耦这些任务的管道：（1）推理，其中大型语言模型（LLM）重写视频提示以仅描述初始场景，解决时间模糊性； (2) 合成，其中文本到图像 (T2I) 模型根据此新提示合成高质量、合成正确的锚帧； (3) 时间合成，其中视频模型经过微调以理解该锚点，将其全部能力集中在动画场景和遵循提示上。我们的分解方法在 T2V CompBench 基准测试中树立了新的最先进水平，并显着改进了 VBench2 上的所有测试模型。此外，我们还表明，视觉锚定使我们能够将采样步骤数减少 70%，而不会损失任何性能，从而大幅加快采样速度。分解视频生成提供了一条简单而实用的途径，实现更高效、稳健和可控的视频合成

Title: Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt

Authors: Shangxun Li, Youngjung Uh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.16443
Pdf URL: https://arxiv.org/pdf/2512.16443
Copy Paste: [[2512.16443]] Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt(https://arxiv.org/abs/2512.16443)
Keywords: generation
Abstract: Text-to-image diffusion models excel at generating high-quality images from natural language descriptions but often fail to preserve subject consistency across multiple outputs, limiting their use in visual storytelling. Existing approaches rely on model fine-tuning or image conditioning, which are computationally expensive and require per-subject optimization. 1Prompt1Story, a training-free approach, concatenates all scene descriptions into a single prompt and rescales token embeddings, but it suffers from semantic leakage, where embeddings across frames become entangled, causing text misalignment. In this paper, we propose a simple yet effective training-free approach that addresses semantic entanglement from a geometric perspective by refining text embeddings to suppress unwanted semantics. Extensive experiments prove that our approach significantly improves both subject consistency and text alignment over existing baselines.
摘要：文本到图像扩散模型擅长从自然语言描述生成高质量图像，但通常无法保持多个输出之间的主题一致性，从而限制了它们在视觉叙事中的使用。现有的方法依赖于模型微调或图像调节，这些方法的计算成本很高，并且需要针对每个对象进行优化。 1Prompt1Story 是一种免训练的方法，它将所有场景描述连接到一个提示中，并重新调整标记嵌入的大小，但它存在语义泄漏的问题，即跨帧的嵌入变得纠缠在一起，导致文本错位。在本文中，我们提出了一种简单而有效的免训练方法，通过细化文本嵌入来抑制不需要的语义，从几何角度解决语义纠缠。大量的实验证明，我们的方法在现有基线上显着提高了主题一致性和文本对齐。

Title: Prime and Reach: Synthesising Body Motion for Gaze-Primed Object Reach

Authors: Masashi Hatano, Saptarshi Sinha, Jacob Chalk, Wei-Hong Li, Hideo Saito, Dima Damen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.16456
Pdf URL: https://arxiv.org/pdf/2512.16456
Copy Paste: [[2512.16456]] Prime and Reach: Synthesising Body Motion for Gaze-Primed Object Reach(https://arxiv.org/abs/2512.16456)
Keywords: generation
Abstract: Human motion generation is a challenging task that aims to create realistic motion imitating natural human behaviour. We focus on the well-studied behaviour of priming an object/location for pick up or put down -- that is, the spotting of an object/location from a distance, known as gaze priming, followed by the motion of approaching and reaching the target location. To that end, we curate, for the first time, 23.7K gaze-primed human motion sequences for reaching target object locations from five publicly available datasets, i.e., HD-EPIC, MoGaze, HOT3D, ADT, and GIMO. We pre-train a text-conditioned diffusion-based motion generation model, then fine-tune it conditioned on goal pose or location, on our curated sequences. Importantly, we evaluate the ability of the generated motion to imitate natural human movement through several metrics, including the 'Reach Success' and a newly introduced 'Prime Success' metric. On the largest dataset, HD-EPIC, our model achieves 60% prime success and 89% reach success when conditioned on the goal object location.
摘要：人体运动生成是一项具有挑战性的任务，旨在创建模仿人类自然行为的逼真运动。我们关注的是经过充分研究的启动物体/位置以拾取或放下的行为，即从远处发现物体/位置，称为凝视启动，然后是接近和到达目标位置的运动。为此，我们首次从五个公开可用的数据集（即 HD-EPIC、MoGaze、HOT3D、ADT 和 GIMO）中策划了 23.7K 个凝视引发的人体运动序列，用于到达目标物体位置。我们预先训练一个基于文本条件的基于扩散的运动生成模型，然后在我们策划的序列上根据目标姿势或位置对其进行微调。重要的是，我们通过几个指标评估生成的运动模仿自然人类运动的能力，包括“达到成功”和新引入的“首要成功”指标。在最大的数据集 HD-EPIC 上，当以目标物体位置为条件时，我们的模型实现了 60% 的主要成功率和 89% 的到达成功率。

Title: StageVAR: Stage-Aware Acceleration for Visual Autoregressive Models

Authors: Senmao Li, Kai Wang, Salman Khan, Fahad Shahbaz Khan, Jian Yang, Yaxing Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.16483
Pdf URL: https://arxiv.org/pdf/2512.16483
Copy Paste: [[2512.16483]] StageVAR: Stage-Aware Acceleration for Visual Autoregressive Models(https://arxiv.org/abs/2512.16483)
Keywords: generation
Abstract: Visual Autoregressive (VAR) modeling departs from the next-token prediction paradigm of traditional Autoregressive (AR) models through next-scale prediction, enabling high-quality image generation. However, the VAR paradigm suffers from sharply increased computational complexity and running time at large-scale steps. Although existing acceleration methods reduce runtime for large-scale steps, but rely on manual step selection and overlook the varying importance of different stages in the generation process. To address this challenge, we present StageVAR, a systematic study and stage-aware acceleration framework for VAR models. Our analysis shows that early steps are critical for preserving semantic and structural consistency and should remain intact, while later steps mainly refine details and can be pruned or approximated for acceleration. Building on these insights, StageVAR introduces a plug-and-play acceleration strategy that exploits semantic irrelevance and low-rank properties in late-stage computations, without requiring additional training. Our proposed StageVAR achieves up to 3.4x speedup with only a 0.01 drop on GenEval and a 0.26 decrease on DPG, consistently outperforming existing acceleration baselines. These results highlight stage-aware design as a powerful principle for efficient visual autoregressive image generation.
摘要：视觉自回归（VAR）建模通过下一个尺度的预测，摆脱了传统自回归（AR）模型的下一个令牌预测范式，从而实现了高质量的图像生成。然而，VAR 范式面临着计算复杂度和大规模步骤运行时间急剧增加的问题。尽管现有的加速方法减少了大规模步骤的运行时间，但依赖于手动步骤选择并忽略了生成过程中不同阶段的不同重要性。为了应对这一挑战，我们推出了 StageVAR，这是一个针对 VAR 模型的系统研究和阶段感知加速框架。我们的分析表明，早期步骤对于保持语义和结构一致性至关重要，应该保持完整，而后期步骤主要细化细节，可以进行修剪或近似以加速。基于这些见解，StageVAR 引入了一种即插即用的加速策略，该策略利用后期计算中的语义无关性和低秩属性，而无需额外的训练。我们提出的 StageVAR 实现了高达 3.4 倍的加速，而 GenEval 只下降了 0.01，DPG 只下降了 0.26，始终优于现有的加速基准。这些结果强调了阶段感知设计是高效视觉自回归图像生成的强大原理。

Title: Guiding Perception-Reasoning Closer to Human in Blind Image Quality Assessment

Authors: Yuan Li, Yahan Yu, Youyuan Lin, Yong-Hao Yang, Chenhui Chu, Shin'ya Nishida
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.16484
Pdf URL: https://arxiv.org/pdf/2512.16484
Copy Paste: [[2512.16484]] Guiding Perception-Reasoning Closer to Human in Blind Image Quality Assessment(https://arxiv.org/abs/2512.16484)
Keywords: quality assessment
Abstract: Humans assess image quality through a perception-reasoning cascade, integrating sensory cues with implicit reasoning to form self-consistent judgments. In this work, we investigate how a model can acquire both human-like and self-consistent reasoning capability for blind image quality assessment (BIQA). We first collect human evaluation data that capture several aspects of human perception-reasoning pipeline. Then, we adopt reinforcement learning, using human annotations as reward signals to guide the model toward human-like perception and reasoning. To enable the model to internalize self-consistent reasoning capability, we design a reward that drives the model to infer the image quality purely from self-generated descriptions. Empirically, our approach achieves score prediction performance comparable to state-of-the-art BIQA systems under general metrics, including Pearson and Spearman correlation coefficients. In addition to the rating score, we assess human-model alignment using ROUGE-1 to measure the similarity between model-generated and human perception-reasoning chains. On over 1,000 human-annotated samples, our model reaches a ROUGE-1 score of 0.512 (cf. 0.443 for baseline), indicating substantial coverage of human explanations and marking a step toward human-like interpretable reasoning in BIQA.
摘要：人类通过感知推理级联来评估图像质量，将感官线索与隐含推理相结合，形成自洽的判断。在这项工作中，我们研究了模型如何获得类似人类且自洽的推理能力以进行盲图像质量评估（BIQA）。我们首先收集人类评估数据，捕获人类感知推理管道的几个方面。然后，我们采用强化学习，使用人类注释作为奖励信号来引导模型向类似人类的感知和推理方向发展。为了使模型能够内化自洽推理能力，我们设计了一个奖励，驱动模型纯粹从自我生成的描述中推断图像质量。根据经验，我们的方法在一般指标（包括 Pearson 和 Spearman 相关系数）下实现了与最先进的 BIQA 系统相当的分数预测性能。除了评分之外，我们还使用 ROUGE-1 评估人类模型对齐，以衡量模型生成链和人类感知推理链之间的相似性。在超过 1,000 个人类注释的样本上，我们的模型达到了 0.512 的 ROUGE-1 分数（参见基线 0.443），表明人类解释的大量覆盖，并标志着 BIQA 中向类人可解释推理迈出了一步。

Title: Yuan-TecSwin: A text conditioned Diffusion model with Swin-transformer blocks

Authors: Shaohua Wu, Tong Yu, Shenling Wang, Xudong Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.16586
Pdf URL: https://arxiv.org/pdf/2512.16586
Copy Paste: [[2512.16586]] Yuan-TecSwin: A text conditioned Diffusion model with Swin-transformer blocks(https://arxiv.org/abs/2512.16586)
Keywords: restoration, generation
Abstract: Diffusion models have shown remarkable capacity in image synthesis based on their U-shaped architecture and convolutional neural networks (CNN) as basic blocks. The locality of the convolution operation in CNN may limit the model's ability to understand long-range semantic information. To address this issue, we propose Yuan-TecSwin, a text-conditioned diffusion model with Swin-transformer in this work. The Swin-transformer blocks take the place of CNN blocks in the encoder and decoder, to improve the non-local modeling ability in feature extraction and image restoration. The text-image alignment is improved with a well-chosen text encoder, effective utilization of text embedding, and careful design in the incorporation of text condition. Using an adapted time step to search in different diffusion stages, inference performance is further improved by 10%. Yuan-TecSwin achieves the state-of-the-art FID score of 1.37 on ImageNet generation benchmark, without any additional models at different denoising stages. In a side-by-side comparison, we find it difficult for human interviewees to tell the model-generated images from the human-painted ones.
摘要：扩散模型以其 U 形架构和卷积神经网络 (CNN) 作为基本块，在图像合成方面表现出了卓越的能力。 CNN 中卷积运算的局部性可能会限制模型理解远程语义信息的能力。为了解决这个问题，我们在这项工作中提出了 Yuan-TecSwin，一种带有 Swin-transformer 的文本条件扩散模型。 Swin-transformer 模块取代了编码器和解码器中的 CNN 模块，以提高特征提取和图像恢复中的非局部建模能力。通过精心选择的文本编码器、有效利用文本嵌入以及在合并文本条件时的精心设计，改进了文本图像对齐。使用适应的时间步长在不同的扩散阶段进行搜索，推理性能进一步提高了10%。 Yuan-TecSwin 在 ImageNet 生成基准上达到了最先进的 FID 分数 1.37，并且在不同的去噪阶段没有任何额外的模型。在并排比较中，我们发现人类受访者很难区分模型生成的图像和人类绘制的图像。

Title: Trainable Log-linear Sparse Attention for Efficient Diffusion Transformers

Authors: Yifan Zhou, Zeqi Xiao, Tianyi Wei, Shuai Yang, Xingang Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.16615
Pdf URL: https://arxiv.org/pdf/2512.16615
Copy Paste: [[2512.16615]] Trainable Log-linear Sparse Attention for Efficient Diffusion Transformers(https://arxiv.org/abs/2512.16615)
Keywords: generation
Abstract: Diffusion Transformers (DiTs) set the state of the art in visual generation, yet their quadratic self-attention cost fundamentally limits scaling to long token sequences. Recent Top-K sparse attention approaches reduce the computation of DiTs by compressing tokens into block-wise representation and selecting a small set of relevant key blocks, but still suffer from (i) quadratic selection cost on compressed tokens and (ii) increasing K required to maintain model quality as sequences grow. We identify that their inefficiency is due to the single-level design, as a single coarse level is insufficient to represent the global structure. In this paper, we introduce Log-linear Sparse Attention (LLSA), a trainable sparse attention mechanism for extremely long token sequences that reduces both selection and attention costs from quadratic to log-linear complexity by utilizing a hierarchical structure. LLSA performs hierarchical Top-K selection, progressively adopting sparse Top-K selection with the indices found at the previous level, and introduces a Hierarchical KV Enrichment mechanism that preserves global context while using fewer tokens of different granularity during attention computation. To support efficient training, we develop a high-performance GPU implementation that uses only sparse indices for both the forward and backward passes, eliminating the need for dense attention masks. We evaluate LLSA on high-resolution pixel-space image generation without using patchification and VAE encoding. LLSA accelerates attention inference by 28.27x and DiT training by 6.09x on 256x256 pixel token sequences, while maintaining generation quality. The results demonstrate that LLSA offers a promising direction for training long-sequence DiTs efficiently. Code is available at: this https URL
摘要：扩散变压器（DiT）设定了视觉生成的最先进技术，但其二次自注意力成本从根本上限制了长令牌序列的扩展。最近的 Top-K 稀疏注意力方法通过将 token 压缩为块式表示并选择一小组相关关键块来减少 DiT 的计算，但仍然受到 (i) 压缩 token 的二次选择成本和 (ii) 随着序列增长维持模型质量所需的 K 的增加而受到影响。我们发现它们的低效率是由于单层设计造成的，因为单个粗糙层不足以代表全局结构。在本文中，我们介绍了对数线性稀疏注意力（LLSA），这是一种针对极长令牌序列的可训练稀疏注意力机制，它通过利用分层结构将选择和注意力成本从二次复杂度降低到对数线性复杂度。 LLSA 执行分层 Top-K 选择，逐步采用稀疏 Top-K 选择与上一级找到的索引，并引入分层 KV 丰富机制，在注意力计算过程中使用更少的不同粒度的标记，同时保留全局上下文。为了支持高效的训练，我们开发了一种高性能 GPU 实现，该实现仅使用稀疏索引进行前向和后向传递，从而消除了对密集注意力掩模的需要。我们在不使用补丁和 VAE 编码的情况下评估高分辨率像素空间图像生成的 LLSA。 LLSA 在 256x256 像素标记序列上将注意力推理速度提高了 28.27 倍，DiT 训练速度提高了 6.09 倍，同时保持了生成质量。结果表明，LLSA 为有效训练长序列 DiT 提供了一个有前途的方向。代码位于：此 https URL

Title: DeContext as Defense: Safe Image Editing in Diffusion Transformers

Authors: Linghui Shen, Mingyue Cui, Xingyi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.16625
Pdf URL: https://arxiv.org/pdf/2512.16625
Copy Paste: [[2512.16625]] DeContext as Defense: Safe Image Editing in Diffusion Transformers(https://arxiv.org/abs/2512.16625)
Keywords: generation
Abstract: In-context diffusion models allow users to modify images with remarkable ease and realism. However, the same power raises serious privacy concerns: personal images can be easily manipulated for identity impersonation, misinformation, or other malicious uses, all without the owner's consent. While prior work has explored input perturbations to protect against misuse in personalized text-to-image generation, the robustness of modern, large-scale in-context DiT-based models remains largely unexamined. In this paper, we propose DeContext, a new method to safeguard input images from unauthorized in-context editing. Our key insight is that contextual information from the source image propagates to the output primarily through multimodal attention layers. By injecting small, targeted perturbations that weaken these cross-attention pathways, DeContext breaks this flow, effectively decouples the link between input and output. This simple defense is both efficient and robust. We further show that early denoising steps and specific transformer blocks dominate context propagation, which allows us to concentrate perturbations where they matter most. Experiments on Flux Kontext and Step1X-Edit show that DeContext consistently blocks unwanted image edits while preserving visual quality. These results highlight the effectiveness of attention-based perturbations as a powerful defense against image manipulation.
摘要：上下文扩散模型允许用户非常轻松和真实地修改图像。然而，同样的权力也引发了严重的隐私问题：个人图像很容易被操纵用于身份冒充、错误信息或其他恶意用途，而所有这些都无需所有者的同意。虽然之前的工作已经探索了输入扰动以防止个性化文本到图像生成中的误用，但现代大规模基于 DiT 的上下文模型的稳健性在很大程度上仍未得到检验。在本文中，我们提出了 DeContext，一种保护输入图像免遭未经授权的上下文编辑的新方法。我们的主要见解是，源图像的上下文信息主要通过多模态注意力层传播到输出。通过注入小的、有针对性的扰动来削弱这些交叉注意力路径，DeContext 打破了这种流动，有效地解耦了输入和输出之间的联系。这种简单的防御既有效又强大。我们进一步表明，早期的去噪步骤和特定的变换器块主导了上下文传播，这使我们能够将扰动集中在最重要的地方。 Flux Kontext 和 Step1X-Edit 上的实验表明，DeContext 始终阻止不需要的图像编辑，同时保持视觉质量。这些结果凸显了基于注意力的扰动作为针对图像操纵的强大防御的有效性。

Title: FrameDiffuser: G-Buffer-Conditioned Diffusion for Neural Forward Frame Rendering

Authors: Ole Beisswenger, Jan-Niklas Dihlmann, Hendrik P.A. Lensch
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2512.16670
Pdf URL: https://arxiv.org/pdf/2512.16670
Copy Paste: [[2512.16670]] FrameDiffuser: G-Buffer-Conditioned Diffusion for Neural Forward Frame Rendering(https://arxiv.org/abs/2512.16670)
Keywords: generation
Abstract: Neural rendering for interactive applications requires translating geometric and material properties (G-buffer) to photorealistic images with realistic lighting on a frame-by-frame basis. While recent diffusion-based approaches show promise for G-buffer-conditioned image synthesis, they face critical limitations: single-image models like RGBX generate frames independently without temporal consistency, while video models like DiffusionRenderer are too computationally expensive for most consumer gaming sets ups and require complete sequences upfront, making them unsuitable for interactive applications where future frames depend on user input. We introduce FrameDiffuser, an autoregressive neural rendering framework that generates temporally consistent, photorealistic frames by conditioning on G-buffer data and the models own previous output. After an initial frame, FrameDiffuser operates purely on incoming G-buffer data, comprising geometry, materials, and surface properties, while using its previously generated frame for temporal guidance, maintaining stable, temporal consistent generation over hundreds to thousands of frames. Our dual-conditioning architecture combines ControlNet for structural guidance with ControlLoRA for temporal coherence. A three-stage training strategy enables stable autoregressive generation. We specialize our model to individual environments, prioritizing consistency and inference speed over broad generalization, demonstrating that environment-specific training achieves superior photorealistic quality with accurate lighting, shadows, and reflections compared to generalized approaches.
摘要：交互式应用程序的神经渲染需要将几何和材料属性（G 缓冲区）逐帧转换为具有真实光照的照片级真实感图像。虽然最近基于扩散的方法显示了 G 缓冲区条件图像合成的前景，但它们面临着严重的限制：像 RGBX 这样的单图像模型独立生成帧，没有时间一致性，而像 DiffusionRenderer 这样的视频模型对于大多数消费者游戏设置来说计算成本太高，并且需要预先完成完整的序列，这使得它们不适合未来帧依赖于用户输入的交互式应用程序。我们引入了 FrameDiffuser，这是一种自回归神经渲染框架，它通过调节 G 缓冲区数据和模型自己的先前输出来生成时间一致、逼真的帧。在初始帧之后，FrameDiffuser 纯粹对传入的 G 缓冲区数据进行操作，包括几何形状、材质和表面属性，同时使用其先前生成的帧进行时间引导，在数百到数千个帧上保持稳定、时间一致的生成。我们的双调节架构将用于结构指导的 ControlNet 与用于时间一致性的 ControlLoRA 相结合。三阶段训练策略可实现稳定的自回归生成。我们将模型专门针对个体环境，优先考虑一致性和推理速度而不是广泛的泛化，证明与泛化方法相比，特定环境的训练可以通过准确的光照、阴影和反射实现卓越的真实感质量。

Title: DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

Authors: Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao, Zimo Meng, Runming He, Chengyu Shen, Qifeng Cai, Zhaoyang Han, Meiyi Qiang, Yalin Feng, Tianyi Bai, Zewei Pan, Ziyi Guo, Yizhen Jiang, Jingwen Deng, Qijie You, Peichao Lai, Tianyu Guo, Chi Hsu Tsai, Hengyi Feng, Rui Hu, Wenkai Yu, Junbo Niu, Bohan Zeng, Ruichuan An, Lu Ma, Jihao Huang, Yaowei Zheng, Conghui He, Linpeng Tang, Bin Cui, Weinan E, Wentao Zhang
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2512.16676
Pdf URL: https://arxiv.org/pdf/2512.16676
Copy Paste: [[2512.16676]] DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI(https://arxiv.org/abs/2512.16676)
Keywords: generation
Abstract: The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc scripts and loosely specified workflows, which lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation. To address these challenges, we present DataFlow, a unified and extensible LLM-driven data preparation framework. DataFlow is designed with system-level abstractions that enable modular, reusable, and composable data transformations, and provides a PyTorch-style pipeline construction API for building debuggable and optimizable dataflows. The framework consists of nearly 200 reusable operators and six domain-general pipelines spanning text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction. To further improve usability, we introduce DataFlow-Agent, which automatically translates natural-language specifications into executable pipelines via operator synthesis, pipeline planning, and iterative verification. Across six representative use cases, DataFlow consistently improves downstream LLM performance. Our math, code, and text pipelines outperform curated human datasets and specialized synthetic baselines, achieving up to +3\% execution accuracy in Text-to-SQL over SynSQL, +7\% average improvements on code benchmarks, and 1--3 point gains on MATH, GSM8K, and AIME. Moreover, a unified 10K-sample dataset produced by DataFlow enables base models to surpass counterparts trained on 1M Infinity-Instruct data. These results demonstrate that DataFlow provides a practical and high-performance substrate for reliable, reproducible, and scalable LLM data preparation, and establishes a system-level foundation for future data-centric AI development.
摘要：对大型语言模型 (LLM) 中高质量数据的快速增长的需求加剧了对可扩展、可靠且语义丰富的数据准备管道的需求。然而，当前的实践仍然以临时脚本和松散指定的工作流程为主，它们缺乏原则性的抽象，阻碍了可重复性，并且对模型在环数据生成的支持有限。为了应对这些挑战，我们推出了 DataFlow，这是一个统一且可扩展的 LLM 驱动的数据准备框架。 DataFlow 采用系统级抽象设计，可实现模块化、可重用和可组合的数据转换，并提供 PyTorch 风格的管道构建 API，用于构建可调试和可优化的数据流。该框架由近 200 个可重用运算符和 6 个领域通用管道组成，涵盖文本、数学推理、代码、文本到 SQL、代理 RAG 和大规模知识提取。为了进一步提高可用性，我们引入了 DataFlow-Agent，它通过算子合成、管道规划和迭代验证自动将自然语言规范转换为可执行管道。在六个代表性用例中，DataFlow 不断提高下游 LLM 性能。我们的数学、代码和文本管道的性能优于精心策划的人类数据集和专门的合成基线，在 SynSQL 上实现了高达 +3\% 的文本到 SQL 执行精度，代码基准平均提高了 +7\%，并且在 MATH、GSM8K 和 AIME 上提高了 1--3 点。此外，DataFlow 生成的统一 10K 样本数据集使基础模型能够超越在 1M Infinity-Instruct 数据上训练的对应模型。这些结果表明，DataFlow 为可靠、可重复和可扩展的 LLM 数据准备提供了实用且高性能的基础，并为未来以数据为中心的 AI 开发奠定了系统级基础。

Title: Detecting Localized Deepfakes: How Well Do Synthetic Image Detectors Handle Inpainting?

Authors: Serafino Pandolfini, Lorenzo Pellegrini, Matteo Ferrara, Davide Maltoni
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.16688
Pdf URL: https://arxiv.org/pdf/2512.16688
Copy Paste: [[2512.16688]] Detecting Localized Deepfakes: How Well Do Synthetic Image Detectors Handle Inpainting?(https://arxiv.org/abs/2512.16688)
Keywords: generation, generative
Abstract: The rapid progress of generative AI has enabled highly realistic image manipulations, including inpainting and region-level editing. These approaches preserve most of the original visual context and are increasingly exploited in cybersecurity-relevant threat scenarios. While numerous detectors have been proposed for identifying fully synthetic images, their ability to generalize to localized manipulations remains insufficiently characterized. This work presents a systematic evaluation of state-of-the-art detectors, originally trained for the deepfake detection on fully synthetic images, when applied to a distinct challenge: localized inpainting detection. The study leverages multiple datasets spanning diverse generators, mask sizes, and inpainting techniques. Our experiments show that models trained on a large set of generators exhibit partial transferability to inpainting-based edits and can reliably detect medium- and large-area manipulations or regeneration-style inpainting, outperforming many existing ad hoc detection approaches.
摘要：生成式人工智能的快速进步使得高度逼真的图像处理成为可能，包括修复和区域级编辑。这些方法保留了大部分原始视觉上下文，并越来越多地在网络安全相关的威胁场景中得到利用。虽然已经提出了许多检测器来识别完全合成的图像，但它们推广到局部操作的能力仍然没有得到充分的表征。这项工作对最先进的检测器进行了系统评估，这些检测器最初是为全合成图像上的深度伪造检测而训练的，当应用于一个独特的挑战时：局部修复检测。该研究利用了涵盖不同生成器、掩模尺寸和修复技术的多个数据集。我们的实验表明，在大量生成器上训练的模型表现出部分可转移到基于修复的编辑，并且可以可靠地检测中型和大面积操作或再生式修复，优于许多现有的临时检测方法。

Title: Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation

Authors: Yunkai Yang, Yudong Zhang, Kunquan Zhang, Jinxiao Zhang, Xinying Chen, Haohuan Fu, Runmin Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.16740
Pdf URL: https://arxiv.org/pdf/2512.16740
Copy Paste: [[2512.16740]] Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation(https://arxiv.org/abs/2512.16740)
Keywords: generation, generative
Abstract: With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled datasets and alleviate manual annotation in remote sensing (RS). However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of synthetic data in downstream semantic segmentation tasks. To address these challenges, we propose a task-oriented data synthesis framework (TODSynth), including a Multimodal Diffusion Transformer (MM-DiT) with unified triple attention and a plug-and-play sampling strategy guided by task feedback. Built upon the powerful DiT-based generative foundation model, we systematically evaluate different control schemes, showing that a text-image-mask joint attention scheme combined with full fine-tuning of the image and mask branches significantly enhances the effectiveness of RS semantic segmentation data synthesis, particularly in few-shot and complex-scene scenarios. Furthermore, we propose a control-rectify flow matching (CRFM) method, which dynamically adjusts sampling directions guided by semantic loss during the early high-plasticity stage, mitigating the instability of generated images and bridging the gap between synthetic data and downstream segmentation tasks. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art controllable generation methods, producing more stable and task-oriented synthetic data for RS semantic segmentation.
摘要：随着可控生成的快速进展，训练数据合成已成为扩展标记数据集和减轻遥感（RS）中手动注释的有前途的方法。然而，语义掩模控制的复杂性和采样质量的不确定性往往限制了合成数据在下游语义分割任务中的效用。为了应对这些挑战，我们提出了一种面向任务的数据合成框架（TODSynth），包括具有统一三重注意力的多模态扩散变压器（MM-DiT）和由任务反馈引导的即插即用采样策略。基于强大的基于 DiT 的生成基础模型，我们系统地评估了不同的控制方案，结果表明，文本-图像-掩模联合注意方案与图像和掩模分支的完全微调相结合，显着提高了 RS 语义分割数据合成的有效性，特别是在少镜头和复杂场景场景中。此外，我们提出了一种控制纠正流匹配（CRFM）方法，该方法在早期高可塑性阶段动态调整由语义损失引导的采样方向，减轻生成图像的不稳定性并弥合合成数据和下游分割任务之间的差距。大量实验表明，我们的方法始终优于最先进的可控生成方法，为 RS 语义分割生成更稳定和面向任务的合成数据。

Title: NRGPT: An Energy-based Alternative for GPT

Authors: Nima Dehmamy, Benjamin Hoover, Bishwajit Saha, Leo Kozachkov, Jean-Jacques Slotine, Dmitry Krotov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.16762
Pdf URL: https://arxiv.org/pdf/2512.16762
Copy Paste: [[2512.16762]] NRGPT: An Energy-based Alternative for GPT(https://arxiv.org/abs/2512.16762)
Keywords: generative
Abstract: Generative Pre-trained Transformer (GPT) architectures are the most popular design for language modeling. Energy-based modeling is a different paradigm that views inference as a dynamical process operating on an energy landscape. We propose a minimal modification of the GPT setting to unify it with the EBM framework. The inference step of our model, which we call eNeRgy-GPT (NRGPT), is conceptualized as an exploration of the tokens on the energy landscape. We prove, and verify empirically, that under certain circumstances this exploration becomes gradient descent, although they don't necessarily lead to the best performing models. We demonstrate that our model performs well for simple language (Shakespeare dataset), algebraic ListOPS tasks, and richer settings such as OpenWebText language modeling. We also observe that our models may be more resistant to overfitting, doing so only during very long training.
摘要：生成式预训练 Transformer (GPT) 架构是最流行的语言建模设计。基于能量的建模是一种不同的范式，它将推理视为在能量景观上运行的动态过程。我们建议对 GPT 设置进行最小程度的修改，以将其与 EBM 框架统一。我们的模型的推理步骤，我们称之为 eNeRgy-GPT (NRGPT)，被概念化为对能源领域代币的探索。我们证明并凭经验验证，在某些情况下，这种探索变成了梯度下降，尽管它们不一定会产生性能最佳的模型。我们证明我们的模型对于简单语言（莎士比亚数据集）、代数 ListOPS 任务以及更丰富的设置（例如 OpenWebText 语言建模）表现良好。我们还观察到，我们的模型可能更能抵抗过度拟合，只有在很长时间的训练中才会这样做。

Title: Make-It-Poseable: Feed-forward Latent Posing Model for 3D Humanoid Character Animation

Authors: Zhiyang Guo, Ori Zhang, Jax Xiang, Alan Zhao, Wengang Zhou, Houqiang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.16767
Pdf URL: https://arxiv.org/pdf/2512.16767
Copy Paste: [[2512.16767]] Make-It-Poseable: Feed-forward Latent Posing Model for 3D Humanoid Character Animation(https://arxiv.org/abs/2512.16767)
Keywords: generation
Abstract: Posing 3D characters is a fundamental task in computer graphics and vision. However, existing methods like auto-rigging and pose-conditioned generation often struggle with challenges such as inaccurate skinning weight prediction, topological imperfections, and poor pose conformance, limiting their robustness and generalizability. To overcome these limitations, we introduce Make-It-Poseable, a novel feed-forward framework that reformulates character posing as a latent-space transformation problem. Instead of deforming mesh vertices as in traditional pipelines, our method reconstructs the character in new poses by directly manipulating its latent representation. At the core of our method is a latent posing transformer that manipulates shape tokens based on skeletal motion. This process is facilitated by a dense pose representation for precise control. To ensure high-fidelity geometry and accommodate topological changes, we also introduce a latent-space supervision strategy and an adaptive completion module. Our method demonstrates superior performance in posing quality. It also naturally extends to 3D editing applications like part replacement and refinement.
摘要：为 3D 角色摆姿势是计算机图形学和视觉领域的一项基本任务。然而，自动装备和姿势条件生成等现有方法经常面临诸如蒙皮权重预测不准确、拓扑缺陷和姿势一致性差等挑战，限制了它们的鲁棒性和普遍性。为了克服这些限制，我们引入了 Make-It-Poseable，这是一种新颖的前馈框架，它将角色伪装重新表述为潜在空间转换问题。我们的方法不是像传统管道那样使网格顶点变形，而是通过直接操纵其潜在表示来重建新姿势的角色。我们方法的核心是一个潜在的姿势变换器，它根据骨骼运动来操纵形状标记。用于精确控制的密集姿态表示促进了这一过程。为了确保高保真几何并适应拓扑变化，我们还引入了潜在空间监督策略和自适应完成模块。我们的方法在姿势质量方面表现出了卓越的性能。它还自然地扩展到 3D 编辑应用程序，例如零件替换和细化。

Title: FlowDet: Unifying Object Detection and Generative Transport Flows

Authors: Enis Baty, C. P. Bridges, Simon Hadfield
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.16771
Pdf URL: https://arxiv.org/pdf/2512.16771
Copy Paste: [[2512.16771]] FlowDet: Unifying Object Detection and Generative Transport Flows(https://arxiv.org/abs/2512.16771)
Keywords: generative
Abstract: We present FlowDet, the first formulation of object detection using modern Conditional Flow Matching techniques. This work follows from DiffusionDet, which originally framed detection as a generative denoising problem in the bounding box space via diffusion. We revisit and generalise this formulation to a broader class of generative transport problems, while maintaining the ability to vary the number of boxes and inference steps without re-training. In contrast to the curved stochastic transport paths induced by diffusion, FlowDet learns simpler and straighter paths resulting in faster scaling of detection performance as the number of inference steps grows. We find that this reformulation enables us to outperform diffusion based detection systems (as well as non-generative baselines) across a wide range of experiments, including various precision/recall operating points using multiple feature backbones and datasets. In particular, when evaluating under recall-constrained settings, we can highlight the effects of the generative transport without over-compensating with large numbers of proposals. This provides gains of up to +3.6% AP and +4.2% AP$_{rare}$ over DiffusionDet on the COCO and LVIS datasets, respectively.
摘要：我们推出 FlowDet，这是第一个使用现代条件流匹配技术进行对象检测的公式。这项工作源自 DiffusionDet，它最初将检测定义为通过扩散在边界框空间中生成的去噪问题。我们重新审视这一公式并将其推广到更广泛的生成运输问题，同时保持无需重新训练即可改变盒子数量和推理步骤的能力。与扩散引起的弯曲随机传输路径相比，FlowDet 学习更简单、更直的路径，从而随着推理步骤数量的增加，检测性能的扩展速度更快。我们发现，这种重新表述使我们能够在广泛的实验中超越基于扩散的检测系统（以及非生成基线），包括使用多个特征骨干和数据集的各种精度/召回操作点。特别是，在召回受限的情况下进行评估时，我们可以突出生成传输的效果，而无需过度补偿大量提案。这在 COCO 和 LVIS 数据集上分别比 DiffusionDet 提供了高达 +3.6% AP 和 +4.2% AP$_{rare}$ 的增益。

Title: Kling-Omni Technical Report

Authors: Kling Team: Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, Xiao Hu, Xiaohua Hu, Boyuan Jiang, Fangyuan Kong, Hang Li, Jie Li, Qingyu Li, Shen Li, Xiaohan Li, Yan Li, Jiajun Liang, Borui Liao, Yiqiao Liao, Weihong Lin, Quande Liu, Xiaokun Liu, Yilun Liu, Yuliang Liu, Shun Lu, Hangyu Mao, Yunyao Mao, Haodong Ouyang, Wenyu Qin, Wanqi Shi, Xiaoyu Shi, Lianghao Su, Haozhi Sun, Peiqin Sun, Pengfei Wan, Chao Wang, Chenyu Wang, Meng Wang, Qiulin Wang, Runqi Wang, Xintao Wang, Xuebo Wang, Zekun Wang, Min Wei, Tiancheng Wen, Guohao Wu, Xiaoshi Wu, Zhenhua Wu, Da Xie, Yingtong Xiong, Yulong Xu, Sile Yang, Zikang Yang, Weicai Ye, Ziyang Yuan, Shenglong Zhang, Shuaiyu Zhang, Yuanxing Zhang, Yufan Zhang, Wenzheng Zhao, Ruiliang Zhou, Yan Zhou, Guosheng Zhu, Yongjie Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.16776
Pdf URL: https://arxiv.org/pdf/2512.16776
Copy Paste: [[2512.16776]] Kling-Omni Technical Report(https://arxiv.org/abs/2512.16776)
Keywords: generation, generative
Abstract: We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.
摘要：我们提出了 Kling-Omni，这是一个通用生成框架，旨在直接从多模态视觉语言输入合成高保真视频。 Kling-Omni 采用端到端的视角，弥合了不同视频生成、编辑和智能推理任务之间的功能分离，将它们集成到一个整体系统中。与脱节的管道方法不同，Kling-Omni 支持各种用户输入，包括文本指令、参考图像和视频上下文，将它们处理成统一的多模式表示，以提供电影质量和高度智能的视频内容创建。为了支持这些功能，我们构建了一个全面的数据系统，作为多模式视频创建的基础。高效的大规模预训练策略和推理基础设施优化进一步增强了该框架的能力。综合评估表明，Kling-Omni 在上下文生成、基于推理的编辑和多模式指令遵循方面表现出卓越的能力。我们相信 Kling-Omni 超越了内容创建工具，是多模式世界模拟器的关键进步，能够感知、推理、生成动态和复杂的世界并与之交互。

Title: DenseBEV: Transforming BEV Grid Cells into 3D Objects

Authors: Marius Dähling, Sebastian Krebs, J. Marius Zöllner
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.16818
Pdf URL: https://arxiv.org/pdf/2512.16818
Copy Paste: [[2512.16818]] DenseBEV: Transforming BEV Grid Cells into 3D Objects(https://arxiv.org/abs/2512.16818)
Keywords: generation
Abstract: In current research, Bird's-Eye-View (BEV)-based transformers are increasingly utilized for multi-camera 3D object detection. Traditional models often employ random queries as anchors, optimizing them successively. Recent advancements complement or replace these random queries with detections from auxiliary networks. We propose a more intuitive and efficient approach by using BEV feature cells directly as anchors. This end-to-end approach leverages the dense grid of BEV queries, considering each cell as a potential object for the final detection task. As a result, we introduce a novel two-stage anchor generation method specifically designed for multi-camera 3D object detection. To address the scaling issues of attention with a large number of queries, we apply BEV-based Non-Maximum Suppression, allowing gradients to flow only through non-suppressed objects. This ensures efficient training without the need for post-processing. By using BEV features from encoders such as BEVFormer directly as object queries, temporal BEV information is inherently embedded. Building on the temporal BEV information already embedded in our object queries, we introduce a hybrid temporal modeling approach by integrating prior detections to further enhance detection performance. Evaluating our method on the nuScenes dataset shows consistent and significant improvements in NDS and mAP over the baseline, even with sparser BEV grids and therefore fewer initial anchors. It is particularly effective for small objects, enhancing pedestrian detection with a 3.8% mAP increase on nuScenes and an 8% increase in LET-mAP on Waymo. Applying our method, named DenseBEV, to the challenging Waymo Open dataset yields state-of-the-art performance, achieving a LET-mAP of 60.7%, surpassing the previous best by 5.4%. Code is available at this https URL.
摘要：在当前的研究中，基于鸟瞰 (BEV) 的变压器越来越多地用于多摄像头 3D 物体检测。传统模型通常采用随机查询作为锚点，依次对其进行优化。最近的进展通过辅助网络的检测来补充或取代这些随机查询。我们提出了一种更直观、更有效的方法，直接使用 BEV 特征单元作为锚点。这种端到端方法利用 BEV 查询的密集网格，将每个单元格视为最终检测任务的潜在对象。因此，我们引入了一种专为多摄像机 3D 对象检测而设计的新颖的两阶段锚点生成方法。为了解决大量查询的注意力扩展问题，我们应用基于 BEV 的非极大值抑制，允许梯度仅流经非抑制对象。这确保了高效的训练，而无需进行后处理。通过直接使用 BEVFormer 等编码器的 BEV 功能作为对象查询，本质上嵌入了时态 BEV 信息。基于已经嵌入到对象查询中的时态 BEV 信息，我们引入了一种混合时态建模方法，通过集成先前的检测来进一步增强检测性能。在 nuScenes 数据集上评估我们的方法表明，即使使用稀疏的 BEV 网格并因此减少初始锚点，NDS 和 mAP 也比基线有了一致且显着的改进。它对于小物体特别有效，可以增强行人检测，nuScenes 上的 mAP 增加了 3.8%，Waymo 上的 LET-mAP 增加了 8%。将我们名为 DenseBEV 的方法应用于具有挑战性的 Waymo Open 数据集，可产生最先进的性能，实现 60.7% 的 LET-mAP，比之前的最佳值高出 5.4%。代码可从此 https URL 获取。

Title: MEPIC: Memory Efficient Position Independent Caching for LLM Serving

Authors: Qian Wang, Zahra Yousefijamarani, Morgan Lindsay Heisler, Rongzhi Gu, Bai Xiaolong, Shan Yizhou, Wei Zhang, Wang Lan, Ying Xiong, Yong Zhang, Zhenan Fan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.16822
Pdf URL: https://arxiv.org/pdf/2512.16822
Copy Paste: [[2512.16822]] MEPIC: Memory Efficient Position Independent Caching for LLM Serving(https://arxiv.org/abs/2512.16822)
Keywords: generation
Abstract: Modern LLM applications such as deep-research assistants, coding agents, and Retrieval-Augmented Generation (RAG) systems, repeatedly process long prompt histories containing shared document or code chunks, creating significant pressure on the Key Value (KV) cache, which must operate within limited memory while sustaining high throughput and low latency. Prefix caching partially alleviates some of these costs by reusing KV cache for previously processed tokens, but limited by strict prefix matching. Position-independent caching (PIC) enables chunk-level reuse at arbitrary positions, but requires selective recomputation and positional-encoding (PE) adjustments. However, because these operations vary across queries, KV for the same chunk diverges across requests. Moreover, without page alignment, chunk KV layouts diverge in memory, preventing page sharing. These issues result in only modest HBM savings even when many requests reuse the same content. We present MEPIC, a memory-efficient PIC system that enables chunk KV reuse across positions, requests, and batches. MEPIC aligns chunk KV to paged storage, shifts recomputation from token- to block-level so only the first block is request-specific, removes positional encodings via Rotary Position Embedding (RoPE) fusion in the attention kernel, and makes remaining blocks fully shareable. These techniques eliminate most duplicate chunk KV in HBM, reducing usage by up to 2x over state-of-the-art PIC at comparable latency and accuracy, and up to 5x for long prompts, without any model changes.
摘要：现代法学硕士应用程序，如深度研究助理、编码代理和检索增强生成 (RAG) 系统，重复处理包含共享文档或代码块的长提示历史记录，对键值 (KV) 缓存造成巨大压力，而键值 (KV) 缓存必须在有限的内存内运行，同时维持高吞吐量和低延迟。前缀缓存通过对先前处理的令牌重用 KV 缓存来部分减轻其中的一些成本，但受到严格前缀匹配的限制。位置无关缓存 (PIC) 可在任意位置实现块级重用，但需要选择性重新计算和位置编码 (PE) 调整。但是，由于这些操作在不同的查询中有所不同，因此同一块的 KV 在不同的请求中会有所不同。此外，如果没有页面对齐，块 KV 布局在内存中会发散，从而阻止页面共享。即使许多请求重复使用相同的内容，这些问题也只能节省少量的 HBM。我们推出了 MEPIC，这是一种内存高效的 PIC 系统，可以跨位置、请求和批次重用块 KV。 MEPIC 将块 KV 与分页存储对齐，将重新计算从令牌级转移到块级，因此只有第一个块是特定于请求的，通过注意力内核中的旋转位置嵌入 (RoPE) 融合删除位置编码，并使其余块完全可共享。这些技术消除了 HBM 中大多数重复的块 KV，在相当的延迟和精度下，与最先进的 PIC 相比，使用量减少了 2 倍，对于长提示，使用量减少了 5 倍，而无需任何模型更改。

Title: Next-Generation License Plate Detection and Recognition System using YOLOv8

Authors: Arslan Amin, Rafia Mumtaz, Muhammad Jawad Bashir, Syed Mohammad Hassan Zaidi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.16826
Pdf URL: https://arxiv.org/pdf/2512.16826
Copy Paste: [[2512.16826]] Next-Generation License Plate Detection and Recognition System using YOLOv8(https://arxiv.org/abs/2512.16826)
Keywords: generation
Abstract: In the evolving landscape of traffic management and vehicle surveillance, efficient license plate detection and recognition are indispensable. Historically, many methodologies have tackled this challenge, but consistent real-time accuracy, especially in diverse environments, remains elusive. This study examines the performance of YOLOv8 variants on License Plate Recognition (LPR) and Character Recognition tasks, crucial for advancing Intelligent Transportation Systems. Two distinct datasets were employed for training and evaluation, yielding notable findings. The YOLOv8 Nano variant demonstrated a precision of 0.964 and mAP50 of 0.918 on the LPR task, while the YOLOv8 Small variant exhibited a precision of 0.92 and mAP50 of 0.91 on the Character Recognition task. A custom method for character sequencing was introduced, effectively sequencing the detected characters based on their x-axis positions. An optimized pipeline, utilizing YOLOv8 Nano for LPR and YOLOv8 Small for Character Recognition, is proposed. This configuration not only maintains computational efficiency but also ensures high accuracy, establishing a robust foundation for future real-world deployments on edge devices within Intelligent Transportation Systems. This effort marks a significant stride towards the development of smarter and more efficient urban infrastructures.
摘要：在不断发展的交通管理和车辆监控领域，高效的车牌检测和识别是必不可少的。从历史上看，许多方法已经解决了这一挑战，但一致的实时准确性，尤其是在不同的环境中，仍然难以实现。本研究检验了 YOLOv8 变体在车牌识别 (LPR) 和字符识别任务上的性能，这对于推进智能交通系统至关重要。采用两个不同的数据集进行训练和评估，产生了显着的结果。 YOLOv8 Nano 变体在 LPR 任务中表现出 0.964 的精度和 0.918 的 mAP50，而 YOLOv8 Small 变体在字符识别任务中表现出 0.92 的精度和 0.91 的 mAP50。引入了字符排序的自定义方法，可根据 x 轴位置对检测到的字符进行有效排序。提出了一种优化的流程，利用 YOLOv8 Nano 进行 LPR 和 YOLOv8 Small 进行字符识别。这种配置不仅保持了计算效率，而且确保了高精度，为未来智能交通系统中边缘设备上的实际部署奠定了坚实的基础。这一努力标志着朝着发展更智能、更高效的城市基础设施迈出的重大一步。

Title: Radiology Report Generation with Layer-Wise Anatomical Attention

Authors: Emmanuel D. Muñiz-De-León, Jorge A. Rosales-de-Golferichs, Ana S. Muñoz-Rodríguez, Alejandro I. Trejo-Castro, Eduardo de Avila-Armenta, Antonio Martínez-Torteya
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.16841
Pdf URL: https://arxiv.org/pdf/2512.16841
Copy Paste: [[2512.16841]] Radiology Report Generation with Layer-Wise Anatomical Attention(https://arxiv.org/abs/2512.16841)
Keywords: generation, generative
Abstract: Automatic radiology report generation is a promising application of multimodal deep learning, aiming to reduce reporting workload and improve consistency. However, current state-of-the-art (SOTA) systems - such as Multimodal AI for Radiology Applications (MAIRA-2) and Medical Pathways Language Model-Multimodal (MedPaLM-M) - depend on large-scale multimodal training, clinical metadata, and multiple imaging views, making them resource-intensive and inaccessible for most settings. We introduce a compact image-to-text architecture that generates the Findings section of chest X-ray reports from a single frontal image. The model combines a frozen Self-Distillation with No Labels v3 (DINOv3) Vision Transformer (ViT) encoder with a Generative Pre-trained Transformer 2 (GPT-2) decoder enhanced by layer-wise anatomical attention. This mechanism integrates lung and heart segmentation masks through hierarchical Gaussian smoothing, biasing attention toward clinically relevant regions without adding trainable parameters. Evaluated on the official Medical Information Mart for Intensive Care-Chest X-ray (MIMIC-CXR) dataset using Chest Radiograph Expert (CheXpert) and Radiology Graph (RadGraph) metrics, our approach achieved substantial gains: CheXpert Macro-F1 for five key pathologies increased by 168% (0.083 -> 0.238) and Micro-F1 by 146% (0.137 -> 0.337), while broader performance across 14 observations improved by 86% (0.170 -> 0.316). Structural coherence also improved, with RadGraph F1 rising by 9.7%. Despite its small size and purely image-conditioned design, the model demonstrates that decoder-level anatomical guidance improves spatial grounding and enhances coherence in clinically relevant regions. The source code is publicly available at: this https URL.
摘要：自动放射学报告生成是多模态深度学习的一个有前途的应用，旨在减少报告工作量并提高一致性。然而，当前最先进的 (SOTA) 系统 - 例如用于放射学应用的多模态人工智能 (MAIRA-2) 和多模态医学路径语言模型 (MedPaLM-M) - 依赖于大规模多模态训练、临床元数据和多个成像视图，这使得它们占用大量资源，并且在大多数情况下无法访问。我们引入了一种紧凑的图像到文本架构，可以从单个正面图像生成胸部 X 光报告的结果部分。该模型结合了冷冻自蒸馏无标签 v3 (DINOv3) 视觉变换器 (ViT) 编码器和通过分层解剖注意力增强的生成预训练变换器 2 (GPT-2) 解码器。该机制通过分层高斯平滑集成肺和心脏分割掩模，将注意力偏向临床相关区域，而不添加可训练参数。使用胸部射线照相专家 (CheXpert) 和放射图 (RadGraph) 指标对官方重症监护胸部 X 射线医疗信息集市 (MIMIC-CXR) 数据集进行评估，我们的方法取得了显着的成果：五种关键病理的 CheXpert Macro-F1 增加了 168% (0.083 -> 0.238)，Micro-F1 增加了 146% (0.137 -> 0.337），而 14 个观测值的更广泛性能提高了 86%（0.170 -> 0.316）。结构一致性也有所改善，RadGraph F1 增长了 9.7%。尽管其尺寸小且纯粹的图像调节设计，但该模型表明解码器级解剖引导可改善空间基础并增强临床相关区域的一致性。源代码可在以下位置公开获取：此 https URL。

Title: LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation

Authors: Haichao Zhang, Yao Lu, Lichen Wang, Yunzhe Li, Daiwei Chen, Yunpeng Xu, Yun Fu
Subjects: cs.CV, cs.AI, cs.IR, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2512.16891
Pdf URL: https://arxiv.org/pdf/2512.16891
Copy Paste: [[2512.16891]] LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation(https://arxiv.org/abs/2512.16891)
Keywords: generation
Abstract: Video Large Language Models (VLLMs) unlock world-knowledge-aware video understanding through pretraining on internet-scale data and have already shown promise on tasks such as movie analysis and video question answering. However, deploying VLLMs for downstream tasks such as video recommendation remains challenging, since real systems require multi-video inputs, lightweight backbones, low-latency sequential inference, and rapid response. In practice, (1) decode-only generation yields high latency for sequential inference, (2) typical interfaces do not support multi-video inputs, and (3) constraining outputs to language discards fine-grained visual details that matter for downstream vision tasks. We argue that these limitations stem from the absence of a representation that preserves pixel-level detail while leveraging world knowledge. We present LinkedOut, a representation that extracts VLLM world knowledge directly from video to enable fast inference, supports multi-video histories, and removes the language bottleneck. LinkedOut extracts semantically grounded, knowledge-aware tokens from raw frames using VLLMs, guided by promptable queries and optional auxiliary modalities. We introduce a cross-layer knowledge fusion MoE that selects the appropriate level of abstraction from the rich VLLM features, enabling personalized, interpretable, and low-latency recommendation. To our knowledge, LinkedOut is the first VLLM-based video recommendation method that operates on raw frames without handcrafted labels, achieving state-of-the-art results on standard benchmarks. Interpretability studies and ablations confirm the benefits of layer diversity and layer-wise fusion, pointing to a practical path that fully leverages VLLM world-knowledge priors and visual reasoning for downstream vision tasks such as recommendation.
摘要：视频大语言模型 (VLLM) 通过对互联网规模数据进行预训练来解锁世界知识感知的视频理解，并且已经在电影分析和视频问答等任务中显示出前景。然而，为视频推荐等下游任务部署 VLLM 仍然具有挑战性，因为实际系统需要多视频输入、轻量级主干、低延迟顺序推理和快速响应。在实践中，(1) 仅解码生成会产生顺序推理的高延迟，(2) 典型接口不支持多视频输入，(3) 将输出限制为语言会丢弃对下游视觉任务至关重要的细粒度视觉细节。我们认为，这些限制源于缺乏在利用世界知识的同时保留像素级细节的表示。我们提出了 LinkedOut，一种直接从视频中提取 VLLM 世界知识以实现快速推理、支持多视频历史并消除语言瓶颈的表示形式。 LinkedOut 使用 VLLM 从原始帧中提取基于语义的知识感知标记，并以提示查询和可选辅助模式为指导。我们引入了跨层知识融合 MoE，它从丰富的 VLLM 功能中选择适当的抽象级别，从而实现个性化、可解释和低延迟的推荐。据我们所知，LinkedOut 是第一个基于 VLLM 的视频推荐方法，它在没有手工制作标签的原始帧上运行，在标准基准上实现了最先进的结果。可解释性研究和消融证实了层多样性和逐层融合的好处，指出了一条充分利用 VLLM 世界知识先验和视觉推理来完成推荐等下游视觉任务的实用路径。

Title: Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection

Authors: Kaixin Ding, Yang Zhou, Xi Chen, Miao Yang, Jiarong Ou, Rui Chen, Xin Tao, Hengshuang Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.16905
Pdf URL: https://arxiv.org/pdf/2512.16905
Copy Paste: [[2512.16905]] Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection(https://arxiv.org/abs/2512.16905)
Keywords: generative
Abstract: Recent advances in Text-to-Image (T2I) generative models, such as Imagen, Stable Diffusion, and FLUX, have led to remarkable improvements in visual quality. However, their performance is fundamentally limited by the quality of training data. Web-crawled and synthetic image datasets often contain low-quality or redundant samples, which lead to degraded visual fidelity, unstable training, and inefficient computation. Hence, effective data selection is crucial for improving data efficiency. Existing approaches rely on costly manual curation or heuristic scoring based on single-dimensional features in Text-to-Image data filtering. Although meta-learning based method has been explored in LLM, there is no adaptation for image modalities. To this end, we propose **Alchemist**, a meta-gradient-based framework to select a suitable subset from large-scale text-image data pairs. Our approach automatically learns to assess the influence of each sample by iteratively optimizing the model from a data-centric perspective. Alchemist consists of two key stages: data rating and data pruning. We train a lightweight rater to estimate each sample's influence based on gradient information, enhanced with multi-granularity perception. We then use the Shift-Gsampling strategy to select informative subsets for efficient model training. Alchemist is the first automatic, scalable, meta-gradient-based data selection framework for Text-to-Image model training. Experiments on both synthetic and web-crawled datasets demonstrate that Alchemist consistently improves visual quality and downstream performance. Training on an Alchemist-selected 50% of the data can outperform training on the full dataset.
摘要：文本到图像 (T2I) 生成模型（例如 Imagen、Stable Diffusion 和 FLUX）的最新进展导致视觉质量显着提高。然而，它们的性能从根本上受到训练数据质量的限制。网络爬取和合成图像数据集通常包含低质量或冗余样本，这会导致视觉保真度下降、训练不稳定和计算效率低下。因此，有效的数据选择对于提高数据效率至关重要。现有方法依赖于成本高昂的手动管理或基于文本到图像数据过滤中的单维特征的启发式评分。尽管法学硕士已经探索了基于元学习的方法，但尚未适应图像模态。为此，我们提出 **Alchemist**，一个基于元梯度的框架，用于从大规模文本图像数据对中选择合适的子集。我们的方法通过从以数据为中心的角度迭代优化模型来自动学习评估每个样本的影响。 Alchemist 包含两个关键阶段：数据评级和数据修剪。我们训练一个轻量级评估者，根据梯度信息估计每个样本的影响，并通过多粒度感知增强。然后，我们使用 Shift-Gsampling 策略来选择信息丰富的子集以进行高效的模型训练。 Alchemist 是第一个用于文本到图像模型训练的自动、可扩展、基于元梯度的数据选择框架。对合成数据集和网络爬取数据集的实验表明，Alchemist 不断提高视觉质量和下游性能。在 Alchemist 选择的 50% 数据上进行训练可以优于在完整数据集上进行训练。

Title: Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos

Authors: Mingfei Chen, Yifan Wang, Zhengqin Li, Homanga Bharadhwaj, Yujin Chen, Chuan Qin, Ziyi Kou, Yuan Tian, Eric Whitmire, Rajinder Sodhi, Hrvoje Benko, Eli Shlizerman, Yue Liu
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2512.16907
Pdf URL: https://arxiv.org/pdf/2512.16907
Copy Paste: [[2512.16907]] Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos(https://arxiv.org/abs/2512.16907)
Keywords: generation
Abstract: Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.
摘要：先前关于 3D 手部轨迹预测的工作受到将运动与语义监督分离的数据集以及弱链接推理和动作的模型的限制。为了解决这些问题，我们首先提出 EgoMAN 数据集，这是一个大规模的以自我为中心的数据集，用于交互阶段感知 3D 手部轨迹预测，具有 219K 6DoF 轨迹和 3M 结构化 QA 对，用于语义、空间和运动推理。然后，我们介绍 EgoMAN 模型，这是一个推理到运动的框架，通过轨迹令牌接口将视觉语言推理和运动生成联系起来。经过逐步训练，使推理与运动动力学保持一致，我们的方法产生了准确的、阶段感知的轨迹，并在现实世界场景中进行了泛化。

Title: SFTok: Bridging the Performance Gap in Discrete Tokenizers

Authors: Qihang Rao, Borui Zhang, Wenzhao Zheng, Jie Zhou, Jiwen Lu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.16910
Pdf URL: https://arxiv.org/pdf/2512.16910
Copy Paste: [[2512.16910]] SFTok: Bridging the Performance Gap in Discrete Tokenizers(https://arxiv.org/abs/2512.16910)
Keywords: generation, generative
Abstract: Recent advances in multimodal models highlight the pivotal role of image tokenization in high-resolution image generation. By compressing images into compact latent representations, tokenizers enable generative models to operate in lower-dimensional spaces, thereby improving computational efficiency and reducing complexity. Discrete tokenizers naturally align with the autoregressive paradigm but still lag behind continuous ones, limiting their adoption in multimodal systems. To address this, we propose \textbf{SFTok}, a discrete tokenizer that incorporates a multi-step iterative mechanism for precise reconstruction. By integrating \textbf{self-forcing guided visual reconstruction} and \textbf{debias-and-fitting training strategy}, SFTok resolves the training-inference inconsistency in multi-step process, significantly enhancing image reconstruction quality. At a high compression rate of only 64 tokens per image, SFTok achieves state-of-the-art reconstruction quality on ImageNet (rFID = 1.21) and demonstrates exceptional performance in class-to-image generation tasks (gFID = 2.29).
摘要：多模态模型的最新进展凸显了图像标记化在高分辨率图像生成中的关键作用。通过将图像压缩为紧凑的潜在表示，分词器使生成模型能够在低维空间中运行，从而提高计算效率并降低复杂性。离散分词器自然地与自回归范式保持一致，但仍然落后于连续分词器，限制了它们在多模态系统中的采用。为了解决这个问题，我们提出了 \textbf{SFTok}，一种离散分词器，它结合了用于精确重建的多步迭代机制。通过集成\textbf{自强迫引导视觉重建}和\textbf{去偏差和拟合训练策略}，SFTok解决了多步骤过程中训练与推理的不一致，显着提高了图像重建质量。在每个图像仅 64 个标记的高压缩率下，SFTok 在 ImageNet 上实现了最先进的重建质量 (rFID = 1.21)，并在类到图像生成任务 (gFID = 2.29) 中表现出卓越的性能。

Title: Posterior Behavioral Cloning: Pretraining BC Policies for Efficient RL Finetuning

Authors: Andrew Wagenmaker, Perry Dong, Raymond Tsao, Chelsea Finn, Sergey Levine
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2512.16911
Pdf URL: https://arxiv.org/pdf/2512.16911
Copy Paste: [[2512.16911]] Posterior Behavioral Cloning: Pretraining BC Policies for Efficient RL Finetuning(https://arxiv.org/abs/2512.16911)
Keywords: generative
Abstract: Standard practice across domains from robotics to language is to first pretrain a policy on a large-scale demonstration dataset, and then finetune this policy, typically with reinforcement learning (RL), in order to improve performance on deployment domains. This finetuning step has proved critical in achieving human or super-human performance, yet while much attention has been given to developing more effective finetuning algorithms, little attention has been given to ensuring the pretrained policy is an effective initialization for RL finetuning. In this work we seek to understand how the pretrained policy affects finetuning performance, and how to pretrain policies in order to ensure they are effective initializations for finetuning. We first show theoretically that standard behavioral cloning (BC) -- which trains a policy to directly match the actions played by the demonstrator -- can fail to ensure coverage over the demonstrator's actions, a minimal condition necessary for effective RL finetuning. We then show that if, instead of exactly fitting the observed demonstrations, we train a policy to model the posterior distribution of the demonstrator's behavior given the demonstration dataset, we do obtain a policy that ensures coverage over the demonstrator's actions, enabling more effective finetuning. Furthermore, this policy -- which we refer to as the posterior behavioral cloning (PostBC) policy -- achieves this while ensuring pretrained performance is no worse than that of the BC policy. We then show that PostBC is practically implementable with modern generative models in robotic control domains -- relying only on standard supervised learning -- and leads to significantly improved RL finetuning performance on both realistic robotic control benchmarks and real-world robotic manipulation tasks, as compared to standard behavioral cloning.
摘要：从机器人到语言的跨领域的标准做法是，首先在大规模演示数据集上预训练策略，然后通常使用强化学习 (RL) 来微调该策略，以提高部署域的性能。事实证明，这一微调步骤对于实现人类或超人类的性能至关重要，然而，尽管人们对开发更有效的微调算法给予了很多关注，但很少有人关注确保预训练策略是 RL 微调的有效初始化。在这项工作中，我们试图了解预训练策略如何影响微调性能，以及如何预训练策略以确保它们是微调的有效初始化。我们首先从理论上证明，标准行为克隆（BC）——训练一个策略来直接匹配演示者的行为——可能无法确保覆盖演示者的行为，而这是有效强化学习微调所需的最低条件。然后，我们表明，如果我们不是完全拟合观察到的演示，而是训练一个策略来对给定演示数据集的演示者行为的后验分布进行建模，我们确实获得了一个确保覆盖演示者行为的策略，从而实现更有效的微调。此外，这个策略——我们称之为后验行为克隆（PostBC）策略——实现了这一目标，同时确保预训练的性能不比 BC 策略差。然后，我们表明，与标准行为克隆相比，PostBC 实际上可以通过机器人控制领域的现代生成模型来实现（仅依赖于标准监督学习），并且可以显着提高在现实机器人控制基准和现实世界机器人操作任务上的 RL 微调性能。

Title: StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors

Authors: Guibao Shen, Yihua Du, Wenhang Ge, Jing He, Chirui Chang, Donghao Zhou, Zhen Yang, Luozhou Wang, Xin Tao, Ying-Cong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.16915
Pdf URL: https://arxiv.org/pdf/2512.16915
Copy Paste: [[2512.16915]] StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors(https://arxiv.org/abs/2512.16915)
Keywords: generative
Abstract: The rapid growth of stereoscopic displays, including VR headsets and 3D cinemas, has led to increasing demand for high-quality stereo video content. However, producing 3D videos remains costly and complex, while automatic Monocular-to-Stereo conversion is hindered by the limitations of the multi-stage ``Depth-Warp-Inpaint'' (DWI) pipeline. This paradigm suffers from error propagation, depth ambiguity, and format inconsistency between parallel and converged stereo configurations. To address these challenges, we introduce UniStereo, the first large-scale unified dataset for stereo video conversion, covering both stereo formats to enable fair benchmarking and robust model training. Building upon this dataset, we propose StereoPilot, an efficient feed-forward model that directly synthesizes the target view without relying on explicit depth maps or iterative diffusion sampling. Equipped with a learnable domain switcher and a cycle consistency loss, StereoPilot adapts seamlessly to different stereo formats and achieves improved consistency. Extensive experiments demonstrate that StereoPilot significantly outperforms state-of-the-art methods in both visual fidelity and computational efficiency. Project page: this https URL.
摘要：VR 耳机和 3D 影院等立体显示器的快速增长导致对高质量立体视频内容的需求不断增加。然而，制作 3D 视频仍然成本高昂且复杂，而自动单目到立体转换则受到多级“深度扭曲修复”(DWI) 管道的限制的阻碍。这种范例存在错误传播、深度模糊以及并行和聚合立体声配置之间格式不一致的问题。为了应对这些挑战，我们引入了 UniStereo，这是第一个用于立体视频转换的大规模统一数据集，涵盖两种立体格式，以实现公平的基准测试和强大的模型训练。在此数据集的基础上，我们提出了 StereoPilot，这是一种高效的前馈模型，可以直接合成目标视图，而不依赖于显式深度图或迭代扩散采样。 StereoPilot 配备了可学习的域切换器和循环一致性损失，可无缝适应不同的立体声格式并提高一致性。大量实验表明，StereoPilot 在视觉保真度和计算效率方面均显着优于最先进的方法。项目页面：此 https URL。

Title: Next-Embedding Prediction Makes Strong Vision Learners

Authors: Sihan Xu, Ziqiao Ma, Wenhao Chai, Xuweiyi Chen, Weiyang Jin, Joyce Chai, Saining Xie, Stella X. Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.16922
Pdf URL: https://arxiv.org/pdf/2512.16922
Copy Paste: [[2512.16922]] Next-Embedding Prediction Makes Strong Vision Learners(https://arxiv.org/abs/2512.16922)
Keywords: generative
Abstract: Inspired by the success of generative pretraining in natural language, we ask whether the same principles can yield strong self-supervised visual learners. Instead of training models to output features for downstream use, we train them to generate embeddings to perform predictive tasks directly. This work explores such a shift from learning representations to learning models. Specifically, models learn to predict future patch embeddings conditioned on past ones, using causal masking and stop gradient, which we refer to as Next-Embedding Predictive Autoregression (NEPA). We demonstrate that a simple Transformer pretrained on ImageNet-1k with next embedding prediction as its sole learning objective is effective - no pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads. This formulation retains architectural simplicity and scalability, without requiring additional design complexity. NEPA achieves strong results across tasks, attaining 83.8% and 85.3% top-1 accuracy on ImageNet-1K with ViT-B and ViT-L backbones after fine-tuning, and transferring effectively to semantic segmentation on ADE20K. We believe generative pretraining from embeddings provides a simple, scalable, and potentially modality-agnostic alternative to visual self-supervised learning.
摘要：受到自然语言生成预训练成功的启发，我们询问相同的原理是否可以产生强大的自我监督视觉学习者。我们不是训练模型来输出特征以供下游使用，而是训练它们生成嵌入以直接执行预测任务。这项工作探索了从学习表征到学习模型的转变。具体来说，模型学习使用因果掩蔽和停止梯度来预测以过去的补丁嵌入为条件的未来的补丁嵌入，我们将其称为下一个嵌入预测自回归（NEPA）。我们证明，在 ImageNet-1k 上预训练的简单 Transformer 并将下一个嵌入预测作为其唯一的学习目标是有效的 - 没有像素重建、离散标记、对比损失或特定于任务的头。该公式保留了架构简单性和可扩展性，而不需要额外的设计复杂性。 NEPA 在各个任务上取得了优异的成绩，经过微调后，在具有 ViT-B 和 ViT-L 主干的 ImageNet-1K 上分别获得了 83.8% 和 85.3% 的 top-1 准确率，并有效地转移到了 ADE20K 上的语义分割。我们相信，嵌入的生成预训练为视觉自我监督学习提供了一种简单、可扩展且可能与模态无关的替代方案。

Title: Generative Refocusing: Flexible Defocus Control from a Single Image

Authors: Chun-Wei Tuan Mu, Jia-Bin Huang, Yu-Lun Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.16923
Pdf URL: https://arxiv.org/pdf/2512.16923
Copy Paste: [[2512.16923]] Generative Refocusing: Flexible Defocus Control from a Single Image(https://arxiv.org/abs/2512.16923)
Keywords: generative
Abstract: Depth-of-field control is essential in photography, but getting the perfect focus often takes several tries or special equipment. Single-image refocusing is still difficult. It involves recovering sharp content and creating realistic bokeh. Current methods have significant drawbacks. They need all-in-focus inputs, depend on synthetic data from simulators, and have limited control over aperture. We introduce Generative Refocusing, a two-step process that uses DeblurNet to recover all-in-focus images from various inputs and BokehNet for creating controllable bokeh. Our main innovation is semi-supervised training. This method combines synthetic paired data with unpaired real bokeh images, using EXIF metadata to capture real optical characteristics beyond what simulators can provide. Our experiments show we achieve top performance in defocus deblurring, bokeh synthesis, and refocusing benchmarks. Additionally, our Generative Refocusing allows text-guided adjustments and custom aperture shapes.
摘要：景深控制在摄影中至关重要，但获得完美对焦通常需要多次尝试或使用特殊设备。单图像重新对焦仍然很困难。它涉及恢复清晰的内容并创建逼真的散景。当前的方法有显着的缺点。它们需要全焦点输入，依赖于模拟器的合成数据，并且对光圈的控制有限。我们引入了生成式重新聚焦，这是一个两步过程，使用 DeblurNet 从各种输入中恢复全焦点图像，并使用 BokehNet 创建可控散景。我们的主要创新是半监督培训。该方法将合成的配对数据与未配对的真实散景图像相结合，使用 EXIF 元数据捕获超出模拟器所能提供的真实光学特性。我们的实验表明，我们在散焦去模糊、散景合成和重新聚焦基准测试中实现了最佳性能。此外，我们的生成式重新聚焦允许文本引导调整和自定义光圈形状。

Title: The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

Authors: Hanlin Wang, Hao Ouyang, Qiuyu Wang, Yue Yu, Yihao Meng, Wen Wang, Ka Leong Cheng, Shuailei Ma, Qingyan Bai, Yixuan Li, Cheng Chen, Yanhong Zeng, Xing Zhu, Yujun Shen, Qifeng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.16924
Pdf URL: https://arxiv.org/pdf/2512.16924
Copy Paste: [[2512.16924]] The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text(https://arxiv.org/abs/2512.16924)
Keywords: generation
Abstract: We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories -- encoding motion, timing, and visibility -- with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi-agent interactions, object entry/exit, reference-guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user-shaped simulators. Our project page is available at: this https URL.
摘要：我们推出了 WorldCanvas，这是一个用于提示世界事件的框架，它通过结合文本、轨迹和参考图像来实现丰富的、用户引导的模拟。与纯文本方法和现有的轨迹控制图像到视频方法不同，我们的多模态方法将轨迹（编码运动、定时和可见性）与用于语义意图的自然语言和用于对象身份视觉基础的参考图像相结合，从而能够生成连贯的可控事件，包括多代理交互、对象进入/退出、参考引导的外观和反直觉事件。由此产生的视频不仅展示了时间连贯性，而且展示了紧急一致性，尽管暂时消失，但仍保留了对象身份和场景。通过支持富有表现力的世界事件生成，WorldCanvas 将世界模型从被动预测器发展为交互式、用户形状的模拟器。我们的项目页面位于：此 https URL。