2026-01-12

Title: Coding the Visual World: From Image to Simulation Using Vision Language Models

Authors: Sagi Eppel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05344
Pdf URL: https://arxiv.org/pdf/2601.05344
Copy Paste: [[2601.05344]] Coding the Visual World: From Image to Simulation Using Vision Language Models(https://arxiv.org/abs/2601.05344)
Keywords: generative
Abstract: The ability to construct mental models of the world is a central aspect of understanding. Similarly, visual understanding can be viewed as the ability to construct a representative model of the system depicted in an image. This work explores the capacity of Vision Language Models (VLMs) to recognize and simulate the systems and mechanisms depicted in images using the Im2Sim methodology. The VLM is given a natural image of a real-world system (e.g., cities, clouds, vegetation) and is tasked with describing the system and writing code that simulates and generates it. This generative code is then executed to produce a synthetic image, which is compared against the original. This approach is tested on various complex emergent systems, ranging from physical systems (waves, lights, clouds) to vegetation, cities, materials, and geological formations. Through analysis of the models and images generated by the VLMs, we examine their understanding of the systems in images. The results show that leading VLMs (GPT, Gemini) demonstrate the capacity to understand and model complex, multi-component systems across multiple layers of abstraction and a wide range of domains. At the same time, the VLMs exhibit limited ability to replicate fine details and low-level arrangements of patterns in the image. These findings reveal an interesting asymmetry: VLMs combine high-level, deep visual understanding of images with limited perception of fine details.
摘要：构建世界心理模型的能力是理解的核心方面。类似地，视觉理解可以被视为构建图像中描绘的系统的代表性模型的能力。这项工作探讨了视觉语言模型 (VLM) 使用 Im2Sim 方法识别和模拟图像中描绘的系统和机制的能力。 VLM 被赋予现实世界系统（例如城市、云、植被）的自然图像，其任务是描述系统并编写模拟和生成该系统的代码。然后执行该生成代码以生成合成图像，并将其与原始图像进行比较。这种方法在各种复杂的紧急系统上进行了测试，范围从物理系统（波浪、光、云）到植被、城市、材料和地质构造。通过分析 VLM 生成的模型和图像，我们检查了他们对图像中的系统的理解。结果表明，领先的 VLM（GPT、Gemini）展示了跨多个抽象层和广泛领域理解和建模复杂的多组件系统的能力。与此同时，VLM 复制图像中精细细节和低级图案排列的能力有限。这些发现揭示了一个有趣的不对称性：VLM 将对图像的高水平、深入的视觉理解与对精细细节的有限感知结合起来。

Title: Efficient Inference for Noisy LLM-as-a-Judge Evaluation

Authors: Yiqun T Chen, Sizhu Lu, Sijia Li, Moran Guo, Shengyi Li
Subjects: cs.LG, stat.AP, stat.ME
Abstract URL: https://arxiv.org/abs/2601.05420
Pdf URL: https://arxiv.org/pdf/2601.05420
Copy Paste: [[2601.05420]] Efficient Inference for Noisy LLM-as-a-Judge Evaluation(https://arxiv.org/abs/2601.05420)
Keywords: generative
Abstract: Large language models (LLMs) are increasingly used as automatic evaluators of generative AI outputs, a paradigm often referred to as "LLM-as-a-judge." In practice, LLM judges are imperfect predictions for the underlying truth and can exhibit systematic, non-random errors. Two main approaches have recently been proposed to address this issue: (i) direct measurementerror correction based on misclassification models such as Rogan-Gladen-style estimators, and (ii) surrogate-outcome approaches such as prediction-powered inference (PPI), which correct bias by calibrating prediction residuals on a small set of gold-standard human labels. In this paper, we systematically study the performance of these two approaches for estimating mean parameters (e.g., average benchmark scores or pairwise win rates). Leveraging tools from semiparametric efficiency theory, we unify the two classes of estimators by deriving explicit forms of efficient influence function (EIF)-based efficient estimators and characterize conditions under which PPI-style estimators attain strictly smaller asymptotic variance than measurement-error corrections. We verify our theoretical results in simulations and demonstrate the methods on real-data examples. We provide an implementation of the benchmarked methods and comparison utilities at this https URL.
摘要：大型语言模型 (LLM) 越来越多地用作生成式 AI 输出的自动评估器，这种范例通常被称为“LLM 作为法官”。在实践中，法学硕士法官对基本事实的预测并不完美，可能会出现系统性、非随机的错误。最近提出了两种主要方法来解决这个问题：(i) 基于错误分类模型的直接测量误差校正，例如 Rogan-Gladen 式估计器，以及 (ii) 替代结果方法，例如预测动力推理 (PPI)，它通过校准一小组黄金标准人类标签上的预测残差来纠正偏差。在本文中，我们系统地研究了这两种估计平均参数（例如，平均基准分数或成对获胜率）的方法的性能。利用半参数效率理论的工具，我们通过推导基于有效影响函数（EIF）的有效估计器的显式形式来统一两类估计器，并描述 PPI 式估计器获得比测量误差校正严格更小的渐近方差的条件。我们在模拟中验证了我们的理论结果，并在实际数据示例中演示了该方法。我们在此 https URL 上提供了基准测试方法和比较实用程序的实现。

Title: Prediction of Fault Slip Tendency in CO${_2}$ Storage using Data-space Inversion

Authors: Xiaowen He, Su Jiang, Louis J. Durlofsky
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.05431
Pdf URL: https://arxiv.org/pdf/2601.05431
Copy Paste: [[2601.05431]] Prediction of Fault Slip Tendency in CO${_2}$ Storage using Data-space Inversion(https://arxiv.org/abs/2601.05431)
Keywords: generation
Abstract: Accurately assessing the potential for fault slip is essential in many subsurface operations. Conventional model-based history matching methods, which entail the generation of posterior geomodels calibrated to observed data, can be challenging to apply in coupled flow-geomechanics problems with faults. In this work, we implement a variational autoencoder (VAE)-based data-space inversion (DSI) framework to predict pressure, stress and strain fields, and fault slip tendency, in CO${_2}$ storage projects. The main computations required by the DSI workflow entail the simulation of O(1000) prior geomodels. The posterior distributions for quantities of interest are then inferred directly from prior simulation results and observed data, without the need to generate posterior geomodels. The model used here involves a synthetic 3D system with two faults. Realizations of heterogeneous permeability and porosity fields are generated using geostatistical software, and uncertain geomechanical and fault parameters are sampled for each realization from prior distributions. Coupled flow-geomechanics simulations for these geomodels are conducted using GEOS. A VAE with stacked convolutional long short-term memory layers is trained, using the prior simulation results, to represent pressure, strain, effective normal stress and shear stress fields in terms of latent variables. The VAE parameterization is used with DSI for posterior predictions, with monitoring wells providing observed pressure and strain data. Posterior results for synthetic true models demonstrate that the DSI-VAE framework gives accurate predictions for pressure, strain, and stress fields and for fault slip tendency. The framework is also shown to reduce uncertainty in key geomechanical and fault parameters.
摘要：准确评估断层滑动的可能性对于许多地下作业至关重要。传统的基于模型的历史匹配方法需要生成根据观测数据校准的后验地质模型，在具有断层的耦合流动地质力学问题中应用可能具有挑战性。在这项工作中，我们实现了基于变分自动编码器 (VAE) 的数据空间反演 (DSI) 框架来预测 CO${_2}$ 封存项目中的压力、应力和应变场以及断层滑动趋势。 DSI 工作流程所需的主要计算需要对 O(1000) 先验地理模型进行模拟。然后，直接根据先前的模拟结果和观测数据推断出感兴趣量的后验分布，而无需生成后验地理模型。这里使用的模型涉及一个有两个故障的合成 3D 系统。使用地质统计软件生成异质渗透率和孔隙度场的实现，并从先验分布中对每个实现采样不确定的地质力学和断层参数。这些地质模型的耦合流动地质力学模拟是使用 GEOS 进行的。使用先前的模拟结果训练具有堆叠卷积长短期记忆层的 VAE，以潜在变量的形式表示压力、应变、有效法向应力和剪切应力场。 VAE 参数化与 DSI 一起用于后验预测，并通过监测井提供观测到的压力和应变数据。合成真实模型的事后结果表明，DSI-VAE 框架可以准确预测压力、应变和应力场以及断层滑动趋势。该框架还被证明可以减少关键地质力学和断层参数的不确定性。

Title: RingSQL: Generating Synthetic Data with Schema-Independent Templates for Text-to-SQL Reasoning Models

Authors: Marko Sterbentz, Kevin Cushing, Cameron Barrie, Kristian J. Hammond
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2601.05451
Pdf URL: https://arxiv.org/pdf/2601.05451
Copy Paste: [[2601.05451]] RingSQL: Generating Synthetic Data with Schema-Independent Templates for Text-to-SQL Reasoning Models(https://arxiv.org/abs/2601.05451)
Keywords: generation
Abstract: Recent advances in text-to-SQL systems have been driven by larger models and improved datasets, yet progress is still limited by the scarcity of high-quality training data. Manual data creation is expensive, and existing synthetic methods trade off reliability and scalability. Template-based approaches ensure correct SQL but require schema-specific templates, while LLM-based generation scales easily but lacks quality and correctness guarantees. We introduce RingSQL, a hybrid data generation framework that combines schema-independent query templates with LLM-based paraphrasing of natural language questions. This approach preserves SQL correctness across diverse schemas while providing broad linguistic variety. In our experiments, we find that models trained using data produced by RingSQL achieve an average gain in accuracy of +2.3% across six text-to-SQL benchmarks when compared to models trained on other synthetic data. We make our code available at this https URL.
摘要：文本转 SQL 系统的最新进展是由更大的模型和改进的数据集推动的，但进展仍然受到高质量训练数据稀缺的限制。手动数据创建成本高昂，并且现有的合成方法需要权衡可靠性和可扩展性。基于模板的方法可确保正确的 SQL，但需要特定于模式的模板，而基于 LLM 的生成可以轻松扩展，但缺乏质量和正确性保证。我们引入了 RingSQL，这是一种混合数据生成框架，它将独立于模式的查询模板与基于 LLM 的自然语言问题释义相结合。这种方法可以在不同模式下保持 SQL 的正确性，同时提供广泛的语言多样性。在我们的实验中，我们发现与使用其他合成数据训练的模型相比，使用 RingSQL 生成的数据训练的模型在六个文本到 SQL 基准测试中的平均准确率提高了 2.3%。我们通过此 https URL 提供代码。

Title: MaxCode: A Max-Reward Reinforcement Learning Framework for Automated Code Optimization

Authors: Jiefu Ou, Sapana Chaudhary, Kaj Bostrom, Nathaniel Weir, Shuai Zhang, Huzefa Rangwala, George Karypis
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2601.05475
Pdf URL: https://arxiv.org/pdf/2601.05475
Copy Paste: [[2601.05475]] MaxCode: A Max-Reward Reinforcement Learning Framework for Automated Code Optimization(https://arxiv.org/abs/2601.05475)
Keywords: generative
Abstract: Large Language Models (LLMs) demonstrate strong capabilities in general coding tasks but encounter two key challenges when optimizing code: (i) the complexity of writing optimized code (such as performant CUDA kernels and competition-level CPU code) requires expertise in systems, algorithms and specific languages and (ii) requires interpretation of performance metrics like timing and device utilization beyond binary correctness. In this work, we explore inference-time search algorithms that guide the LLM to discover better solutions through iterative refinement based on execution feedback. Our approach, called MaxCode unifies existing search methods under a max-reward reinforcement learning framework, making the observation and action-value functions modular for modification. To enhance the observation space, we integrate a natural language critique model that converts raw execution feedback into diagnostic insights about errors and performance bottlenecks, and the best-discounted reward seen so far. Together, these provide richer input to the code proposal function. To improve exploration during search, we train a generative reward-to-go model using action values from rollouts to rerank potential solutions. Testing on the KernelBench (CUDA) and PIE (C++) optimization benchmarks shows that MaxCode improves optimized code performance compared to baselines, achieving 20.3% and 10.1% relative improvements in absolute speedup value and relative speedup ranking, respectively.
摘要：大型语言模型 (LLM) 在一般编码任务中表现出强大的能力，但在优化代码时遇到两个关键挑战：(i) 编写优化代码（例如高性能 CUDA 内核和竞赛级 CPU 代码）的复杂性需要系统、算法和特定语言方面的专业知识，以及 (ii) 需要解释性能指标，例如超出二进制正确性的时序和设备利用率。在这项工作中，我们探索推理时间搜索算法，指导法学硕士通过基于执行反馈的迭代细化发现更好的解决方案。我们的方法称为 MaxCode，将现有的搜索方法统一在最大奖励强化学习框架下，使观察和动作值函数模块化以进行修改。为了增强观察空间，我们集成了一个自然语言批评模型，该模型将原始执行反馈转换为有关错误和性能瓶颈的诊断见解，以及迄今为止看到的最佳折扣奖励。这些共同为代码建议功能提供了更丰富的输入。为了改善搜索过程中的探索，我们使用推出的行动值来训练生成奖励模型，以对潜在的解决方案进行重新排序。 KernelBench (CUDA) 和 PIE (C++) 优化基准测试表明，与基线相比，MaxCode 提高了优化代码性能，绝对加速值和相对加速排名分别实现了 20.3% 和 10.1% 的相对改进。

Title: Hippocampal Atrophy Patterns Across the Alzheimer's Disease Spectrum: A Voxel-Based Morphometry Analysis

Authors: Trishna Niraula
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05494
Pdf URL: https://arxiv.org/pdf/2601.05494
Copy Paste: [[2601.05494]] Hippocampal Atrophy Patterns Across the Alzheimer's Disease Spectrum: A Voxel-Based Morphometry Analysis(https://arxiv.org/abs/2601.05494)
Keywords: generation
Abstract: Alzheimer's disease (AD) and mild cognitive impairment (MCI) are associated with progressive gray matter loss, particularly in medial temporal structures. In this study, CAT12/SPM12 voxel-based morphometry was applied to baseline T1-weighted MRI scans from 249 ADNI participants (CN = 90, MCI = 129, AD = 30). Gray matter volume was analyzed using a general linear model, with the diagnostic group as primary predictor and age and total intracranial volume as covariates. Statistical maps were thresholded at p < 0.001 (voxelwise) and corrected for multiple comparisons at the cluster level using family-wise error (FWE) correction (p < 0.05). Significant hippocampal atrophy was observed in AD relative to CN and MCI (Cohen's d = 2.03 and 1.61, respectively). Hippocampal volume demonstrated moderate predictive value for conversion from MCI to AD (AUC = 0.66). Stratification by APOE4 status did not reveal significant genetic effects on cross-sectional hippocampal volume. These results support medial temporal degeneration as a key feature of AD progression and provide insights into predictive biomarkers and genetic influences.
摘要：阿尔茨海默病 (AD) 和轻度认知障碍 (MCI) 与进行性灰质丧失有关，尤其是内侧颞叶结构中的灰质丧失。在本研究中，基于 CAT12/SPM12 体素的形态测定法应用于 249 名 ADNI 参与者（CN = 90、MCI = 129、AD = 30）的基线 T1 加权 MRI 扫描。使用一般线性模型分析灰质体积，以诊断组作为主要预测因子，年龄和颅内容积作为协变量。统计图的阈值设置为 p < 0.001（体素），并使用族误差 (FWE) 校正 (p < 0.05) 在聚类水平上进行多重比较校正。相对于 CN 和 MCI，AD 中观察到显着的海马萎缩（Cohen's d 分别 = 2.03 和 1.61）。海马体积对从 MCI 转化为 AD 具有中等预测价值（AUC = 0.66）。按 APOE4 状态进行分层并未显示出对海马横截面体积的显着遗传影响。这些结果支持内侧颞叶变性是 AD 进展的一个关键特征，并为预测生物标志物和遗传影响提供了见解。

Title: Hi-ZFO: Hierarchical Zeroth- and First-Order LLM Fine-Tuning via Importance-Guided Tensor Selection

Authors: Feihu Jin, Ying Tan
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2601.05501
Pdf URL: https://arxiv.org/pdf/2601.05501
Copy Paste: [[2601.05501]] Hi-ZFO: Hierarchical Zeroth- and First-Order LLM Fine-Tuning via Importance-Guided Tensor Selection(https://arxiv.org/abs/2601.05501)
Keywords: generative
Abstract: Fine-tuning large language models (LLMs) using standard first-order (FO) optimization often drives training toward sharp, poorly generalizing minima. Conversely, zeroth-order (ZO) methods offer stronger exploratory behavior without relying on explicit gradients, yet suffer from slow convergence. More critically, our analysis reveals that in generative tasks, the vast output and search space significantly amplify estimation variance, rendering ZO methods both noisy and inefficient. To address these challenges, we propose \textbf{Hi-ZFO} (\textbf{Hi}erarchical \textbf{Z}eroth- and \textbf{F}irst-\textbf{O}rder optimization), a hybrid framework designed to synergize the precision of FO gradients with the exploratory capability of ZO estimation. Hi-ZFO adaptively partitions the model through layer-wise importance profiling, applying precise FO updates to critical layers while leveraging ZO optimization for less sensitive ones. Notably, ZO in Hi-ZFO is not merely a memory-saving surrogate; it is intentionally introduced as a source of "beneficial stochasticity" to help the model escape the local minima where pure FO optimization tends to stagnate. Validated across diverse generative, mathematical, and code reasoning tasks, Hi-ZFO consistently achieves superior performance while significantly reducing the training time. These results demonstrate the effectiveness of hierarchical hybrid optimization for LLM fine-tuning.
摘要：使用标准一阶 (FO) 优化对大型语言模型 (LLM) 进行微调通常会导致训练趋向尖锐、泛化性差的最小值。相反，零阶 (ZO) 方法在不依赖显式梯度的情况下提供更强的探索行为，但收敛速度较慢。更重要的是，我们的分析表明，在生成任务中，巨大的输出和搜索空间显着放大了估计方差，使得 ZO 方法既嘈杂又低效。为了应对这些挑战，我们提出了 \textbf{Hi-ZFO} （\textbf{Hi}erarchical \textbf{Z}eroth- 和 \textbf{F}irst-\textbf{O}rder optimization），这是一种混合框架，旨在将 FO 梯度的精度与 ZO 估计的探索能力相结合。 Hi-ZFO 通过逐层重要性分析对模型进行自适应分区，对关键层应用精确的 FO 更新，同时对不太敏感的层利用 ZO 优化。值得注意的是，Hi-ZFO 中的 ZO 不仅仅是节省内存的替代品；它被有意引入作为“有益随机性”的来源，以帮助模型摆脱纯 FO 优化趋于停滞的局部最小值。 Hi-ZFO 经过各种生成、数学和代码推理任务的验证，始终如一地实现卓越的性能，同时显着减少训练时间。这些结果证明了层次混合优化对于 LLM 微调的有效性。

Title: GaussianSwap: Animatable Video Face Swapping with 3D Gaussian Splatting

Authors: Xuan Cheng, Jiahao Rao, Chengyang Li, Wenhao Wang, Weilin Chen, Lvqing Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05511
Pdf URL: https://arxiv.org/pdf/2601.05511
Copy Paste: [[2601.05511]] GaussianSwap: Animatable Video Face Swapping with 3D Gaussian Splatting(https://arxiv.org/abs/2601.05511)
Keywords: generation
Abstract: We introduce GaussianSwap, a novel video face swapping framework that constructs a 3D Gaussian Splatting based face avatar from a target video while transferring identity from a source image to the avatar. Conventional video swapping frameworks are limited to generating facial representations in pixel-based formats. The resulting swapped faces exist merely as a set of unstructured pixels without any capacity for animation or interactive manipulation. Our work introduces a paradigm shift from conventional pixel-based video generation to the creation of high-fidelity avatar with swapped faces. The framework first preprocesses target video to extract FLAME parameters, camera poses and segmentation masks, and then rigs 3D Gaussian splats to the FLAME model across frames, enabling dynamic facial control. To ensure identity preserving, we propose an compound identity embedding constructed from three state-of-the-art face recognition models for avatar finetuning. Finally, we render the face-swapped avatar on the background frames to obtain the face-swapped video. Experimental results demonstrate that GaussianSwap achieves superior identity preservation, visual clarity and temporal consistency, while enabling previously unattainable interactive applications.
摘要：我们引入了 GaussianSwap，一种新颖的视频人脸交换框架，它从目标视频构建基于 3D 高斯 Splatting 的人脸头像，同时将身份从源图像转移到头像。传统的视频交换框架仅限于生成基于像素的格式的面部表示。由此产生的交换面仅作为一组非结构化像素存在，没有任何动画或交互操作的能力。我们的工作引入了从传统的基于像素的视频生成到创建具有交换面孔的高保真头像的范式转变。该框架首先对目标视频进行预处理，以提取 FLAME 参数、相机姿势和分割蒙版，然后将 3D 高斯分布跨帧绑定到 FLAME 模型，从而实现动态面部控制。为了确保身份保留，我们提出了一种由三种最先进的人脸识别模型构建的复合身份嵌入，用于头像微调。最后，我们将换脸头像渲染在背景帧上，得到换脸视频。实验结果表明，GaussianSwap 实现了卓越的身份保留、视觉清晰度和时间一致性，同时实现了以前无法实现的交互式应用程序。

Title: MoGen: A Unified Collaborative Framework for Controllable Multi-Object Image Generation

Authors: Yanfeng Li, Yue Sun, Keren Fu, Sio-Kei Im, Xiaoming Liu, Guangtao Zhai, Xiaohong Liu, Tao Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05546
Pdf URL: https://arxiv.org/pdf/2601.05546
Copy Paste: [[2601.05546]] MoGen: A Unified Collaborative Framework for Controllable Multi-Object Image Generation(https://arxiv.org/abs/2601.05546)
Keywords: generation
Abstract: Existing multi-object image generation methods face difficulties in achieving precise alignment between localized image generation regions and their corresponding semantics based on language descriptions, frequently resulting in inconsistent object quantities and attribute aliasing. To mitigate this limitation, mainstream approaches typically rely on external control signals to explicitly constrain the spatial layout, local semantic and visual attributes of images. However, this strong dependency makes the input format rigid, rendering it incompatible with the heterogeneous resource conditions of users and diverse constraint requirements. To address these challenges, we propose MoGen, a user-friendly multi-object image generation method. First, we design a Regional Semantic Anchor (RSA) module that precisely anchors phrase units in language descriptions to their corresponding image regions during the generation process, enabling text-to-image generation that follows quantity specifications for multiple objects. Building upon this foundation, we further introduce an Adaptive Multi-modal Guidance (AMG) module, which adaptively parses and integrates various combinations of multi-source control signals to formulate corresponding structured intent. This intent subsequently guides selective constraints on scene layouts and object attributes, achieving dynamic fine-grained control. Experimental results demonstrate that MoGen significantly outperforms existing methods in generation quality, quantity consistency, and fine-grained control, while exhibiting superior accessibility and control flexibility. Code is available at: this https URL.
摘要：现有的多目标图像生成方法难以实现基于语言描述的局部图像生成区域与其相应语义之间的精确对齐，经常导致对象数量不一致和属性混叠。为了减轻这种限制，主流方法通常依赖外部控制信号来明确约束图像的空间布局、局部语义和视觉属性。然而，这种强依赖性使得输入格式变得僵化，无法兼容用户的异构资源条件和多样化的约束需求。为了应对这些挑战，我们提出了 MoGen，一种用户友好的多对象图像生成方法。首先，我们设计了一个区域语义锚（RSA）模块，该模块在生成过程中将语言描述中的短语单元精确锚定到其相应的图像区域，从而实现遵循多个对象的数量规范的文本到图像的生成。在此基础上，我们进一步引入了自适应多模态指导（AMG）模块，该模块自适应地解析和集成多源控制信号的各种组合，以制定相应的结构化意图。此意图随后指导对场景布局和对象属性的选择性约束，实现动态细粒度控制。实验结果表明，MoGen 在发电质量、数量一致性和细粒度控制方面显着优于现有方法，同时表现出卓越的可访问性和控制灵活性。代码可在以下位置获得：此 https URL。

Title: Towards Generalized Multi-Image Editing for Unified Multimodal Models

Authors: Pengcheng Xu, Peng Tang, Donghao Luo, Xiaobin Hu, Weichu Cui, Qingdong He, Zhennan Chen, Jiangning Zhang, Charles Ling, Boyu Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05572
Pdf URL: https://arxiv.org/pdf/2601.05572
Copy Paste: [[2601.05572]] Towards Generalized Multi-Image Editing for Unified Multimodal Models(https://arxiv.org/abs/2601.05572)
Keywords: generation
Abstract: Unified Multimodal Models (UMMs) integrate multimodal understanding and generation, yet they are limited to maintaining visual consistency and disambiguating visual cues when referencing details across multiple input images. In this work, we propose a scalable multi-image editing framework for UMMs that explicitly distinguishes image identities and generalizes to variable input counts. Algorithmically, we introduce two innovations: 1) The learnable latent separators explicitly differentiate each reference image in the latent space, enabling accurate and disentangled conditioning. 2) The sinusoidal index encoding assigns visual tokens from the same image a continuous sinusoidal index embedding, which provides explicit image identity while allowing generalization and extrapolation on a variable number of inputs. To facilitate training and evaluation, we establish a high-fidelity benchmark using an inverse dataset construction methodology to guarantee artifact-free, achievable outputs. Experiments show clear improvements in semantic consistency, visual fidelity, and cross-image integration over prior baselines on diverse multi-image editing tasks, validating our advantages on consistency and generalization ability.
摘要：统一多模态模型 (UMM) 集成了多模态理解和生成，但在引用多个输入图像的细节时，它们仅限于保持视觉一致性和消除视觉线索歧义。在这项工作中，我们提出了一种用于 UMM 的可扩展多图像编辑框架，该框架可以明确地区分图像身份并推广到可变输入计数。在算法上，我们引入了两项创新：1）可学习的潜在分离器明确区分潜在空间中的每个参考图像，从而实现准确且解缠结的调节。 2) 正弦索引编码为来自同一图像的视觉标记分配连续的正弦索引嵌入，这提供了明确的图像身份，同时允许对可变数量的输入进行概括和外推。为了促进训练和评估，我们使用逆数据集构建方法建立了高保真基准，以保证无伪影、可实现的输出。实验表明，在各种多图像编辑任务上，语义一致性、视觉保真度和跨图像集成比先前的基线有了明显的改进，验证了我们在一致性和泛化能力方面的优势。

Title: Orient Anything V2: Unifying Orientation and Rotation Understanding

Authors: Zehan Wang, Ziang Zhang, Jiayang Xu, Jialei Wang, Tianyu Pang, Chao Du, HengShuang Zhao, Zhou Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05573
Pdf URL: https://arxiv.org/pdf/2601.05573
Copy Paste: [[2601.05573]] Orient Anything V2: Unifying Orientation and Rotation Understanding(https://arxiv.org/abs/2601.05573)
Keywords: generative
Abstract: This work presents Orient Anything V2, an enhanced foundation model for unified understanding of object 3D orientation and rotation from single or paired images. Building upon Orient Anything V1, which defines orientation via a single unique front face, V2 extends this capability to handle objects with diverse rotational symmetries and directly estimate relative rotations. These improvements are enabled by four key innovations: 1) Scalable 3D assets synthesized by generative models, ensuring broad category coverage and balanced data distribution; 2) An efficient, model-in-the-loop annotation system that robustly identifies 0 to N valid front faces for each object; 3) A symmetry-aware, periodic distribution fitting objective that captures all plausible front-facing orientations, effectively modeling object rotational symmetry; 4) A multi-frame architecture that directly predicts relative object rotations. Extensive experiments show that Orient Anything V2 achieves state-of-the-art zero-shot performance on orientation estimation, 6DoF pose estimation, and object symmetry recognition across 11 widely used benchmarks. The model demonstrates strong generalization, significantly broadening the applicability of orientation estimation in diverse downstream tasks.
摘要：这项工作提出了 Orient Anything V2，这是一种增强的基础模型，用于统一理解单个或成对图像中的对象 3D 方向和旋转。 Orient Anything V1 通过单个独特的正面定义方向，V2 在此基础上扩展了此功能，以处理具有不同旋转对称性的对象并直接估计相对旋转。这些改进得益于四项关键创新：1）由生成模型合成的可扩展 3D 资产，确保广泛的类别覆盖和平衡的数据分布； 2) 高效的模型在环注释系统，可以稳健地识别每个对象的 0 到 N 个有效正面； 3) 具有对称性的周期性分布拟合物镜，可捕获所有可能的正面方向，有效地建模对象旋转对称性； 4）直接预测相对对象旋转的多帧架构。大量实验表明，Orient Anything V2 在 11 个广泛使用的基准测试中，在方向估计、6DoF 姿态估计和对象对称识别方面实现了最先进的零样本性能。该模型表现出很强的泛化性，显着拓宽了方向估计在各种下游任务中的适用性。

Title: Generalizable and Adaptive Continual Learning Framework for AI-generated Image Detection

Authors: Hanyi Wang, Jun Lan, Yaoyu Kang, Huijia Zhu, Weiqiang Wang, Zhuosheng Zhang, Shilin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05580
Pdf URL: https://arxiv.org/pdf/2601.05580
Copy Paste: [[2601.05580]] Generalizable and Adaptive Continual Learning Framework for AI-generated Image Detection(https://arxiv.org/abs/2601.05580)
Keywords: generative
Abstract: The malicious misuse and widespread dissemination of AI-generated images pose a significant threat to the authenticity of online information. Current detection methods often struggle to generalize to unseen generative models, and the rapid evolution of generative techniques continuously exacerbates this challenge. Without adaptability, detection models risk becoming ineffective in real-world applications. To address this critical issue, we propose a novel three-stage domain continual learning framework designed for continuous adaptation to evolving generative models. In the first stage, we employ a strategic parameter-efficient fine-tuning approach to develop a transferable offline detection model with strong generalization capabilities. Building upon this foundation, the second stage integrates unseen data streams into a continual learning process. To efficiently learn from limited samples of novel generated models and mitigate overfitting, we design a data augmentation chain with progressively increasing complexity. Furthermore, we leverage the Kronecker-Factored Approximate Curvature (K-FAC) method to approximate the Hessian and alleviate catastrophic forgetting. Finally, the third stage utilizes a linear interpolation strategy based on Linear Mode Connectivity, effectively capturing commonalities across diverse generative models and further enhancing overall performance. We establish a comprehensive benchmark of 27 generative models, including GANs, deepfakes, and diffusion models, chronologically structured up to August 2024 to simulate real-world scenarios. Extensive experiments demonstrate that our initial offline detectors surpass the leading baseline by +5.51% in terms of mean average precision. Our continual learning strategy achieves an average accuracy of 92.20%, outperforming state-of-the-art methods.
摘要：人工智能生成图像的恶意滥用和广泛传播对在线信息的真实性构成重大威胁。当前的检测方法通常很难推广到看不见的生成模型，而生成技术的快速发展不断加剧了这一挑战。如果没有适应性，检测模型就有可能在现实应用中变得无效。为了解决这个关键问题，我们提出了一种新颖的三阶段领域持续学习框架，旨在持续适应不断发展的生成模型。在第一阶段，我们采用策略性参数高效微调方法来开发具有强大泛化能力的可转移离线检测模型。在此基础上，第二阶段将看不见的数据流集成到持续学习过程中。为了有效地从新颖生成模型的有限样本中学习并减轻过度拟合，我们设计了一个复杂性逐渐增加的数据增强链。此外，我们利用克罗内克因子近似曲率 (K-FAC) 方法来逼近 Hessian 矩阵并减轻灾难性遗忘。最后，第三阶段采用基于线性模式连接的线性插值策略，有效捕获不同生成模型的共性，进一步提高整体性能。我们建立了 27 个生成模型的综合基准，包括 GAN、深度伪造和扩散模型，按时间顺序排列，截至 2024 年 8 月，以模拟现实世界的场景。大量实验表明，我们最初的离线检测器在平均精度方面超出了领先基线 +5.51%。我们的持续学习策略平均准确率达到 92.20%，优于最先进的方法。

Title: Learn to Evolve: Self-supervised Neural JKO Operator for Wasserstein Gradient Flow

Authors: Xue Feng, Li Wang, Deanna Needell, Rongjie Lai
Subjects: cs.LG, math.NA
Abstract URL: https://arxiv.org/abs/2601.05583
Pdf URL: https://arxiv.org/pdf/2601.05583
Copy Paste: [[2601.05583]] Learn to Evolve: Self-supervised Neural JKO Operator for Wasserstein Gradient Flow(https://arxiv.org/abs/2601.05583)
Keywords: generation
Abstract: The Jordan-Kinderlehrer-Otto (JKO) scheme provides a stable variational framework for computing Wasserstein gradient flows, but its practical use is often limited by the high computational cost of repeatedly solving the JKO subproblems. We propose a self-supervised approach for learning a JKO solution operator without requiring numerical solutions of any JKO trajectories. The learned operator maps an input density directly to the minimizer of the corresponding JKO subproblem, and can be iteratively applied to efficiently generate the gradient-flow evolution. A key challenge is that only a number of initial densities are typically available for training. To address this, we introduce a Learn-to-Evolve algorithm that jointly learns the JKO operator and its induced trajectories by alternating between trajectory generation and operator updates. As training progresses, the generated data increasingly approximates true JKO trajectories. Meanwhile, this Learn-to-Evolve strategy serves as a natural form of data augmentation, significantly enhancing the generalization ability of the learned operator. Numerical experiments demonstrate the accuracy, stability, and robustness of the proposed method across various choices of energies and initial conditions.
摘要：Jordan-Kinderlehrer-Otto (JKO) 方案为计算 Wasserstein 梯度流提供了稳定的变分框架，但其实际使用往往受到重复求解 JKO 子问题的高计算成本的限制。我们提出了一种自我监督的方法来学习 JKO 解算子，而不需要任何 JKO 轨迹的数值解。学习到的算子将输入密度直接映射到相应 JKO 子问题的最小值，并且可以迭代地应用以有效地生成梯度流演化。一个关键的挑战是通常只有一些初始密度可用于训练。为了解决这个问题，我们引入了一种学习进化算法，该算法通过在轨迹生成和算子更新之间交替来联合学习 JKO 算子及其诱导轨迹。随着训练的进行，生成的数据越来越接近真实的 JKO 轨迹。同时，这种学习进化策略作为数据增强的自然形式，显着增强了学习算子的泛化能力。数值实验证明了所提出的方法在各种能量和初始条件选择下的准确性、稳定性和鲁棒性。

Title: AGDC: Autoregressive Generation of Variable-Length Sequences with Joint Discrete and Continuous Spaces

Authors: Yeonsang Shin, Insoo Kim, Bongkeun Kim, Keonwoo Bae, Bohyung Han
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2601.05680
Pdf URL: https://arxiv.org/pdf/2601.05680
Copy Paste: [[2601.05680]] AGDC: Autoregressive Generation of Variable-Length Sequences with Joint Discrete and Continuous Spaces(https://arxiv.org/abs/2601.05680)
Keywords: generation
Abstract: Transformer-based autoregressive models excel in data generation but are inherently constrained by their reliance on discretized tokens, which limits their ability to represent continuous values with high precision. We analyze the scalability limitations of existing discretization-based approaches for generating hybrid discrete-continuous sequences, particularly in high-precision domains such as semiconductor circuit designs, where precision loss can lead to functional failure. To address the challenge, we propose AGDC, a novel unified framework that jointly models discrete and continuous values for variable-length sequences. AGDC employs a hybrid approach that combines categorical prediction for discrete values with diffusion-based modeling for continuous values, incorporating two key technical components: an end-of-sequence (EOS) logit adjustment mechanism that uses an MLP to dynamically adjust EOS token logits based on sequence context, and a length regularization term integrated into the loss function. Additionally, we present ContLayNet, a large-scale benchmark comprising 334K high-precision semiconductor layout samples with specialized evaluation metrics that capture functional correctness where precision errors significantly impact performance. Experiments on semiconductor layouts (ContLayNet), graphic layouts, and SVGs demonstrate AGDC's superior performance in generating high-fidelity hybrid vector representations compared to discretization-based and fixed-schema baselines, achieving scalable high-precision generation across diverse domains.
摘要：基于 Transformer 的自回归模型在数据生成方面表现出色，但本质上受到对离散标记的依赖的限制，这限制了它们高精度表示连续值的能力。我们分析了现有基于离散化的用于生成混合离散连续序列的方法的可扩展性限制，特别是在半导体电路设计等高精度领域，其中精度损失可能导致功能故障。为了应对这一挑战，我们提出了 AGDC，这是一种新颖的统一框架，可以对可变长度序列的离散值和连续值进行联合建模。 AGDC 采用混合方法，将离散值的分类预测与连续值的基于扩散的建模相结合，并结合了两个关键技术组件：序列结束 (EOS) logit 调整机制，使用 MLP 根据序列上下文动态调整 EOS 令牌 logit，以及集成到损失函数中的长度正则化项。此外，我们还推出了 ContLayNet，这是一个大规模基准测试，由 334K 高精度半导体布局样本组成，具有专门的评估指标，可捕获功能正确性，其中精度误差会严重影响性能。对半导体布局 (ContLayNet)、图形布局和 SVG 的实验表明，与基于离散化和固定模式的基线相比，AGDC 在生成高保真混合矢量表示方面具有卓越的性能，从而实现了跨不同领域的可扩展高精度生成。

Title: Rotate Your Character: Revisiting Video Diffusion Models for High-Quality 3D Character Generation

Authors: Jin Wang, Jianxiang Lu, Comi Chen, Guangzheng Xu, Haoyu Yang, Peng Chen, Na Zhang, Yifan Xu, Longhuang Wu, Shuai Shao, Qinglin Lu, Ping Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05722
Pdf URL: https://arxiv.org/pdf/2601.05722
Copy Paste: [[2601.05722]] Rotate Your Character: Revisiting Video Diffusion Models for High-Quality 3D Character Generation(https://arxiv.org/abs/2601.05722)
Keywords: generation
Abstract: Generating high-quality 3D characters from single images remains a significant challenge in digital content creation, particularly due to complex body poses and self-occlusion. In this paper, we present RCM (Rotate your Character Model), an advanced image-to-video diffusion framework tailored for high-quality novel view synthesis (NVS) and 3D character generation. Compared to existing diffusion-based approaches, RCM offers several key advantages: (1) transferring characters with any complex poses into a canonical pose, enabling consistent novel view synthesis across the entire viewing orbit, (2) high-resolution orbital video generation at 1024x1024 resolution, (3) controllable observation positions given different initial camera poses, and (4) multi-view conditioning supporting up to 4 input images, accommodating diverse user scenarios. Extensive experiments demonstrate that RCM outperforms state-of-the-art methods in both novel view synthesis and 3D generation quality.
摘要：从单个图像生成高质量的 3D 角色仍然是数字内容创建中的一项重大挑战，特别是由于复杂的身体姿势和自遮挡。在本文中，我们介绍了 RCM（旋转角色模型），这是一种高级图像到视频扩散框架，专为高质量小说视图合成 (NVS) 和 3D 角色生成而定制。与现有的基于扩散的方法相比，RCM 具有几个关键优势：(1) 将具有任何复杂姿势的角色转换为规范姿势，从而在整个观看轨道上实现一致的新颖视图合成；(2) 生成 1024x1024 分辨率的高分辨率轨道视频；(3) 给定不同的初始相机姿势，可控制观察位置；(4) 多视图调节支持最多 4 个输入图像，适应不同的用户场景。大量实验表明，RCM 在新颖的视图合成和 3D 生成质量方面均优于最先进的方法。

Title: TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment

Authors: Jin Wang, Jianxiang Lu, Guangzheng Xu, Comi Chen, Haoyu Yang, Linqing Wang, Peng Chen, Mingtao Chen, Zhichao Hu, Longhuang Wu, Shuai Shao, Qinglin Lu, Ping Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05729
Pdf URL: https://arxiv.org/pdf/2601.05729
Copy Paste: [[2601.05729]] TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment(https://arxiv.org/abs/2601.05729)
Keywords: generation
Abstract: Recent studies have demonstrated the efficacy of integrating Group Relative Policy Optimization (GRPO) into flow matching models, particularly for text-to-image and text-to-video generation. However, we find that directly applying these techniques to image-to-video (I2V) models often fails to yield consistent reward improvements. To address this limitation, we present TAGRPO, a robust post-training framework for I2V models inspired by contrastive learning. Our approach is grounded in the observation that rollout videos generated from identical initial noise provide superior guidance for optimization. Leveraging this insight, we propose a novel GRPO loss applied to intermediate latents, encouraging direct alignment with high-reward trajectories while maximizing distance from low-reward counterparts. Furthermore, we introduce a memory bank for rollout videos to enhance diversity and reduce computational overhead. Despite its simplicity, TAGRPO achieves significant improvements over DanceGRPO in I2V generation.
摘要：最近的研究证明了将组相对策略优化（GRPO）集成到流匹配模型中的有效性，特别是对于文本到图像和文本到视频的生成。然而，我们发现直接将这些技术应用于图像到视频（I2V）模型通常无法产生一致的奖励改进。为了解决这个限制，我们提出了 TAGRPO，这是一个受对比学习启发的强大的 I2V 模型训练后框架。我们的方法基于这样的观察：从相同的初始噪声生成的推出视频为优化提供了卓越的指导。利用这一见解，我们提出了一种应用于中间潜伏的新型 GRPO 损失，鼓励与高奖励轨迹直接对齐，同时最大化与低奖励对应轨迹的距离。此外，我们引入了用于展示视频的内存库，以增强多样性并减少计算开销。尽管很简单，但 TAGRPO 在 I2V 生成方面比 DanceGRPO 实现了显着改进。

Title: ViTNT-FIQA: Training-Free Face Image Quality Assessment with Vision Transformers

Authors: Guray Ozgur, Eduarda Caldeira, Tahar Chettaoui, Jan Niklas Kolf, Marco Huber, Naser Damer, Fadi Boutros
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2601.05741
Pdf URL: https://arxiv.org/pdf/2601.05741
Copy Paste: [[2601.05741]] ViTNT-FIQA: Training-Free Face Image Quality Assessment with Vision Transformers(https://arxiv.org/abs/2601.05741)
Keywords: quality assessment
Abstract: Face Image Quality Assessment (FIQA) is essential for reliable face recognition systems. Current approaches primarily exploit only final-layer representations, while training-free methods require multiple forward passes or backpropagation. We propose ViTNT-FIQA, a training-free approach that measures the stability of patch embedding evolution across intermediate Vision Transformer (ViT) blocks. We demonstrate that high-quality face images exhibit stable feature refinement trajectories across blocks, while degraded images show erratic transformations. Our method computes Euclidean distances between L2-normalized patch embeddings from consecutive transformer blocks and aggregates them into image-level quality scores. We empirically validate this correlation on a quality-labeled synthetic dataset with controlled degradation levels. Unlike existing training-free approaches, ViTNT-FIQA requires only a single forward pass without backpropagation or architectural modifications. Through extensive evaluation on eight benchmarks (LFW, AgeDB-30, CFP-FP, CALFW, Adience, CPLFW, XQLFW, IJB-C), we show that ViTNT-FIQA achieves competitive performance with state-of-the-art methods while maintaining computational efficiency and immediate applicability to any pre-trained ViT-based face recognition model.
摘要：人脸图像质量评估 (FIQA) 对于可靠的人脸识别系统至关重要。当前的方法主要仅利用最后层表示，而免训练方法需要多次前向传播或反向传播。我们提出了 ViTNT-FIQA，这是一种无需训练的方法，可测量跨中间 Vision Transformer (ViT) 块的补丁嵌入演化的稳定性。我们证明，高质量的人脸图像在各个块上表现出稳定的特征细化轨迹，而降级的图像则表现出不稳定的变换。我们的方法计算连续 Transformer 块的 L2 归一化补丁嵌入之间的欧几里德距离，并将它们聚合成图像级质量分数。我们在具有受控降解水平的质量标记合成数据集上凭经验验证了这种相关性。与现有的免训练方法不同，ViTNT-FIQA 仅需要一次前向传递，无需反向传播或架构修改。通过对八个基准（LFW、AgeDB-30、CFP-FP、CALFW、Adience、CPLFW、XQLFW、IJB-C）的广泛评估，我们表明 ViTNT-FIQA 通过最先进的方法实现了具有竞争力的性能，同时保持了计算效率和对任何预先训练的基于 ViT 的人脸识别模型的立即适用性。

Title: Adaptive Disentangled Representation Learning for Incomplete Multi-View Multi-Label Classification

Authors: Quanjiang Li, Zhiming Liu, Tianxiang Xu, Tingjin Luo, Chenping Hou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.05785
Pdf URL: https://arxiv.org/pdf/2601.05785
Copy Paste: [[2601.05785]] Adaptive Disentangled Representation Learning for Incomplete Multi-View Multi-Label Classification(https://arxiv.org/abs/2601.05785)
Keywords: generation
Abstract: Multi-view multi-label learning frequently suffers from simultaneous feature absence and incomplete annotations, due to challenges in data acquisition and cost-intensive supervision. To tackle the complex yet highly practical problem while overcoming the existing limitations of feature recovery, representation disentanglement, and label semantics modeling, we propose an Adaptive Disentangled Representation Learning method (ADRL). ADRL achieves robust view completion by propagating feature-level affinity across modalities with neighborhood awareness, and reinforces reconstruction effectiveness by leveraging a stochastic masking strategy. Through disseminating category-level association across label distributions, ADRL refines distribution parameters for capturing interdependent label prototypes. Besides, we formulate a mutual-information-based objective to promote consistency among shared representations and suppress information overlap between view-specific representation and other modalities. Theoretically, we derive the tractable bounds to train the dual-channel network. Moreover, ADRL performs prototype-specific feature selection by enabling independent interactions between label embeddings and view representations, accompanied by the generation of pseudo-labels for each category. The structural characteristics of the pseudo-label space are then exploited to guide a discriminative trade-off during view fusion. Finally, extensive experiments on public datasets and real-world applications demonstrate the superior performance of ADRL.
摘要：由于数据获取和成本密集型监督方面的挑战，多视图多标签学习经常遭受同时特征缺失和不完整注释的困扰。为了解决复杂但高度实用的问题，同时克服特征恢复、表示解缠和标签语义建模的现有局限性，我们提出了一种自适应解缠表示学习方法（ADRL）。 ADRL 通过利用邻域感知跨模态传播特征级亲和力来实现强大的视图补全，并通过利用随机掩蔽策略来增强重建有效性。通过跨标签分布传播类别级关联，ADRL 细化分布参数以捕获相互依赖的标签原型。此外，我们制定了基于互信息的目标，以促进共享表示之间的一致性并抑制特定视图表示与其他模式之间的信息重叠。理论上，我们推导出训练双通道网络的易处理边界。此外，ADRL 通过启用标签嵌入和视图表示之间的独立交互来执行特定于原型的特征选择，并为每个类别生成伪标签。然后利用伪标签空间的结构特征来指导视图融合期间的判别性权衡。最后，对公共数据集和实际应用的广泛实验证明了 ADRL 的卓越性能。

Title: SceneFoundry: Generating Interactive Infinite 3D Worlds

Authors: ChunTeng Chen, YiChen Hsu, YiWen Liu, WeiFang Sun, TsaiChing Ni, ChunYi Lee, Min Sun, YuanFu Yang
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2601.05810
Pdf URL: https://arxiv.org/pdf/2601.05810
Copy Paste: [[2601.05810]] SceneFoundry: Generating Interactive Infinite 3D Worlds(https://arxiv.org/abs/2601.05810)
Keywords: generation, generative
Abstract: The ability to automatically generate large-scale, interactive, and physically realistic 3D environments is crucial for advancing robotic learning and embodied intelligence. However, existing generative approaches often fail to capture the functional complexity of real-world interiors, particularly those containing articulated objects with movable parts essential for manipulation and navigation. This paper presents SceneFoundry, a language-guided diffusion framework that generates apartment-scale 3D worlds with functionally articulated furniture and semantically diverse layouts for robotic training. From natural language prompts, an LLM module controls floor layout generation, while diffusion-based posterior sampling efficiently populates the scene with articulated assets from large-scale 3D repositories. To ensure physical usability, SceneFoundry employs differentiable guidance functions to regulate object quantity, prevent articulation collisions, and maintain sufficient walkable space for robotic navigation. Extensive experiments demonstrate that our framework generates structurally valid, semantically coherent, and functionally interactive environments across diverse scene types and conditions, enabling scalable embodied AI research.
摘要：自动生成大规模、交互式且物理逼真的 3D 环境的能力对于推进机器人学习和体现智能至关重要。然而，现有的生成方法通常无法捕捉现实世界内部的功能复杂性，特别是那些包含铰接式物体以及对于操纵和导航至关重要的可移动部件的物体。本文介绍了 SceneFoundry，这是一种语言引导的扩散框架，可生成公寓规模的 3D 世界，其中包含功能铰接的家具和用于机器人训练的语义多样化布局。根据自然语言提示，LLM 模块控制楼层布局生成，而基于扩散的后验采样则使用来自大型 3D 存储库的铰接资产有效地填充场景。为了确保物理可用性，SceneFoundry 采用可微分引导函数来调节对象数量，防止关节碰撞，并为机器人导航保持足够的步行空间。大量实验表明，我们的框架可以在不同的场景类型和条件下生成结构有效、语义一致且功能交互的环境，从而实现可扩展的具体人工智能研究。

Title: Boosting Latent Diffusion Models via Disentangled Representation Alignment

Authors: John Page, Xuesong Niu, Kai Wu, Kun Gai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05823
Pdf URL: https://arxiv.org/pdf/2601.05823
Copy Paste: [[2601.05823]] Boosting Latent Diffusion Models via Disentangled Representation Alignment(https://arxiv.org/abs/2601.05823)
Keywords: generation
Abstract: Latent Diffusion Models (LDMs) generate high-quality images by operating in a compressed latent space, typically obtained through image tokenizers such as Variational Autoencoders (VAEs). In pursuit of a generation-friendly VAE, recent studies have explored leveraging Vision Foundation Models (VFMs) as representation alignment targets for VAEs, mirroring the approach commonly adopted for LDMs. Although this yields certain performance gains, using the same alignment target for both VAEs and LDMs overlooks their fundamentally different representational requirements. We advocate that while LDMs benefit from latents retaining high-level semantic concepts, VAEs should excel in semantic disentanglement, enabling encoding of attribute-level information in a structured way. To address this, we propose the Semantic disentangled VAE (Send-VAE), explicitly optimized for disentangled representation learning through aligning its latent space with the semantic hierarchy of pre-trained VFMs. Our approach employs a non-linear mapper network to transform VAE latents, aligning them with VFMs to bridge the gap between attribute-level disentanglement and high-level semantics, facilitating effective guidance for VAE learning. We evaluate semantic disentanglement via linear probing on attribute prediction tasks, showing strong correlation with improved generation performance. Finally, using Send-VAE, we train flow-based transformers SiTs; experiments show Send-VAE significantly speeds up training and achieves a state-of-the-art FID of 1.21 and 1.75 with and without classifier-free guidance on ImageNet 256x256.
摘要：潜在扩散模型 (LDM) 通过在压缩潜在空间中运行来生成高质量图像，通常通过变分自动编码器 (VAE) 等图像标记器获得。为了追求一代友好的 VAE，最近的研究探索了利用视觉基础模型 (VFM) 作为 VAE 的表示对齐目标，反映了 LDM 常用的方法。尽管这会带来一定的性能提升，但对 VAE 和 LDM 使用相同的对齐目标会忽略它们根本不同的表示要求。我们主张，虽然 LDM 受益于保留高级语义概念的潜在变量，但 VAE 应该在语义解缠方面表现出色，从而能够以结构化方式对属性级信息进行编码。为了解决这个问题，我们提出了语义解缠结 VAE（Send-VAE），通过将其潜在空间与预训练 VFM 的语义层次结构对齐来显式优化解缠结表示学习。我们的方法采用非线性映射器网络来转换 VAE 潜伏，将其与 VFM 对齐，以弥合属性级解缠和高级语义之间的差距，从而促进对 VAE 学习的有效指导。我们通过属性预测任务的线性探测来评估语义解缠，显示出与改进的生成性能的强相关性。最后，使用 Send-VAE，我们训练基于流的变压器 SiT；实验表明，Send-VAE 显着加快了训练速度，并在 ImageNet 256x256 上使用和不使用无分类器指导的情况下实现了最先进的 FID 1.21 和 1.75。

Title: Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals

Authors: Nate Gillman, Yinghua Zhou, Zitian Tang, Evan Luo, Arjan Chakravarthy, Daksh Aggarwal, Michael Freeman, Charles Herrmann, Chen Sun
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2601.05848
Pdf URL: https://arxiv.org/pdf/2601.05848
Copy Paste: [[2601.05848]] Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals(https://arxiv.org/abs/2601.05848)
Keywords: generation
Abstract: Recent advancements in video generation have enabled the development of ``world models'' capable of simulating potential futures for robotics and planning. However, specifying precise goals for these models remains a challenge; text instructions are often too abstract to capture physical nuances, while target images are frequently infeasible to specify for dynamic tasks. To address this, we introduce Goal Force, a novel framework that allows users to define goals via explicit force vectors and intermediate dynamics, mirroring how humans conceptualize physical tasks. We train a video generation model on a curated dataset of synthetic causal primitives-such as elastic collisions and falling dominos-teaching it to propagate forces through time and space. Despite being trained on simple physics data, our model exhibits remarkable zero-shot generalization to complex, real-world scenarios, including tool manipulation and multi-object causal chains. Our results suggest that by grounding video generation in fundamental physical interactions, models can emerge as implicit neural physics simulators, enabling precise, physics-aware planning without reliance on external engines. We release all datasets, code, model weights, and interactive video demos at our project page.
摘要：视频生成领域的最新进展使得能够模拟机器人和规划的潜在未来的“世界模型”的开发成为可能。然而，为这些模型指定精确的目标仍然是一个挑战；文本指令通常过于抽象，无法捕捉物理细微差别，而目标图像通常无法为动态任务指定。为了解决这个问题，我们引入了 Goal Force，这是一种新颖的框架，允许用户通过明确的力向量和中间动力学来定义目标，反映了人类如何概念化物理任务。我们在合成因果原语（例如弹性碰撞和倒下的多米诺骨牌）的精选数据集上训练视频生成模型，教导它通过时间和空间传播力。尽管接受了简单物理数据的训练，但我们的模型对复杂的现实世界场景（包括工具操作和多对象因果链）表现出出色的零样本泛化能力。我们的结果表明，通过将视频生成基于基本的物理交互，模型可以作为隐式神经物理模拟器出现，从而无需依赖外部引擎即可实现精确的物理感知规划。我们在项目页面发布了所有数据集、代码、模型权重和交互式视频演示。

Title: Kidney Cancer Detection Using 3D-Based Latent Diffusion Models

Authors: Jen Dusseljee, Sarah de Boer, Alessa Hering
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05852
Pdf URL: https://arxiv.org/pdf/2601.05852
Copy Paste: [[2601.05852]] Kidney Cancer Detection Using 3D-Based Latent Diffusion Models(https://arxiv.org/abs/2601.05852)
Keywords: generative
Abstract: In this work, we present a novel latent diffusion-based pipeline for 3D kidney anomaly detection on contrast-enhanced abdominal CT. The method combines Denoising Diffusion Probabilistic Models (DDPMs), Denoising Diffusion Implicit Models (DDIMs), and Vector-Quantized Generative Adversarial Networks (VQ-GANs). Unlike prior slice-wise approaches, our method operates directly on an image volume and leverages weak supervision with only case-level pseudo-labels. We benchmark our approach against state-of-the-art supervised segmentation and detection models. This study demonstrates the feasibility and promise of 3D latent diffusion for weakly supervised anomaly detection. While the current results do not yet match supervised baselines, they reveal key directions for improving reconstruction fidelity and lesion localization. Our findings provide an important step toward annotation-efficient, generative modeling of complex abdominal anatomy.
摘要：在这项工作中，我们提出了一种新型的基于潜在扩散的管道，用于在增强腹部 CT 上进行 3D 肾脏异常检测。该方法结合了去噪扩散概率模型 (DDPM)、去噪扩散隐式模型 (DDIM) 和矢量量化生成对抗网络 (VQ-GAN)。与之前的切片方法不同，我们的方法直接在图像体积上运行，并利用仅案例级伪标签的弱监督。我们将我们的方法与最先进的监督分割和检测模型进行基准测试。这项研究证明了 3D 潜在扩散用于弱监督异常检测的可行性和前景。虽然目前的结果尚未与监督基线相匹配，但它们揭示了提高重建保真度和病变定位的关键方向。我们的研究结果为复杂腹部解剖结构的高效注释、生成建模迈出了重要一步。

Title: Phase4DFD: Multi-Domain Phase-Aware Attention for Deepfake Detection

Authors: Zhen-Xin Lin, Shang-Kuan Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05861
Pdf URL: https://arxiv.org/pdf/2601.05861
Copy Paste: [[2601.05861]] Phase4DFD: Multi-Domain Phase-Aware Attention for Deepfake Detection(https://arxiv.org/abs/2601.05861)
Keywords: generation
Abstract: Recent deepfake detection methods have increasingly explored frequency domain representations to reveal manipulation artifacts that are difficult to detect in the spatial domain. However, most existing approaches rely primarily on spectral magnitude, implicitly under exploring the role of phase information. In this work, we propose Phase4DFD, a phase aware frequency domain deepfake detection framework that explicitly models phase magnitude interactions via a learnable attention mechanism. Our approach augments standard RGB input with Fast Fourier Transform (FFT) magnitude and local binary pattern (LBP) representations to expose subtle synthesis artifacts that remain indistinguishable under spatial analysis alone. Crucially, we introduce an input level phase aware attention module that uses phase discontinuities commonly introduced by synthetic generation to guide the model toward frequency patterns that are most indicative of manipulation before backbone feature extraction. The attended multi domain representation is processed by an efficient BNext M backbone, with optional channel spatial attention applied for semantic feature refinement. Extensive experiments on the CIFAKE and DFFD datasets demonstrate that our proposed model Phase4DFD outperforms state of the art spatial and frequency-based detectors while maintaining low computational overhead. Comprehensive ablation studies further confirm that explicit phase modeling provides complementary and non-redundant information beyond magnitude-only frequency representations.
摘要：最近的深度伪造检测方法越来越多地探索频域表示，以揭示在空间域中难以检测的操纵伪影。然而，大多数现有方法主要依赖于频谱幅度，隐含地探索了相位信息的作用。在这项工作中，我们提出了 Phase4DFD，一种相位感知频域深度伪造检测框架，它通过可学习的注意力机制显式地模拟相位幅度交互。我们的方法通过快速傅立叶变换 (FFT) 幅度和局部二进制模式 (LBP) 表示增强了标准 RGB 输入，以暴露仅在空间分析下仍然无法区分的细微合成伪影。至关重要的是，我们引入了一个输入级相位感知注意模块，该模块使用合成生成通常引入的相位不连续性来引导模型走向最能指示主干特征提取之前的操作的频率模式。参与的多域表示由高效的 BNext M 主干处理，可选通道空间关注应用于语义特征细化。对 CIFAKE 和 DFFD 数据集的大量实验表明，我们提出的模型 Phase4DFD 优于最先进的基于空间和频率的检测器，同时保持较低的计算开销。全面的消融研究进一步证实，显式相位建模提供了超出仅幅度频率表示的补充和非冗余信息。

Title: Context-Aware Decoding for Faithful Vision-Language Generation

Authors: Mehrdad Fazli, Bowen Wei, Ziwei Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05939
Pdf URL: https://arxiv.org/pdf/2601.05939
Copy Paste: [[2601.05939]] Context-Aware Decoding for Faithful Vision-Language Generation(https://arxiv.org/abs/2601.05939)
Keywords: generation
Abstract: Hallucinations, generating responses inconsistent with the visual input, remain a critical limitation of large vision-language models (LVLMs), especially in open-ended tasks such as image captioning and visual reasoning. In this work, we probe the layer-wise generation dynamics that drive hallucinations and propose a training-free mitigation strategy. Employing the Logit Lens, we examine how LVLMs construct next-token distributions across decoder layers, uncovering a pronounced commitment-depth gap: truthful tokens accumulate probability mass on their final candidates earlier than hallucinatory ones. Drawing on this discovery, we introduce Context Embedding Injection (CEI), a lightweight method that harnesses the hidden state of the last input token-the context embedding-as a grounding signal to maintain visual fidelity throughout decoding and curb hallucinations. Evaluated on the CHAIR, AMBER, and MMHal-Bench benchmarks (with a maximum token length of 512), CEI outperforms state-of-the-art baselines across three LVLMs, with its dynamic variant yielding the lowest overall hallucination rates. By integrating novel mechanistic insights with a scalable intervention, this work advances the mitigation of hallucinations in LVLMs.
摘要：产生与视觉输入不一致的反应的幻觉仍然是大型视觉语言模型（LVLM）的一个关键限制，特别是在图像字幕和视觉推理等开放式任务中。在这项工作中，我们探讨了驱动幻觉的分层生成动态，并提出了一种免训练的缓解策略。使用 Logit Lens，我们研究了 LVLM 如何跨解码器层构建下一个令牌分布，发现了明显的承诺深度差距：真实的令牌比幻觉的令牌更早在其最终候选者上积累概率质量。利用这一发现，我们引入了上下文嵌入注入（CEI），这是一种轻量级方法，利用最后一个输入标记的隐藏状态（上下文嵌入）作为基础信号，以在整个解码过程中保持视觉保真度并抑制幻觉。在 CHAIR、AMBER 和 MMHal-Bench 基准（最大令牌长度为 512）上进行评估后，CEI 在三个 LVLM 中的表现优于最先进的基准，其动态变体产生最低的总体幻觉率。通过将新颖的机制见解与可扩展的干预相结合，这项工作促进了 LVLM 中幻觉的缓解。

Title: WaveRNet: Wavelet-Guided Frequency Learning for Multi-Source Domain-Generalized Retinal Vessel Segmentation

Authors: Chanchan Wang, Yuanfang Wang, Qing Xu, Guanxin Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05942
Pdf URL: https://arxiv.org/pdf/2601.05942
Copy Paste: [[2601.05942]] WaveRNet: Wavelet-Guided Frequency Learning for Multi-Source Domain-Generalized Retinal Vessel Segmentation(https://arxiv.org/abs/2601.05942)
Keywords: generation
Abstract: Domain-generalized retinal vessel segmentation is critical for automated ophthalmic diagnosis, yet faces significant challenges from domain shift induced by non-uniform illumination and varying contrast, compounded by the difficulty of preserving fine vessel structures. While the Segment Anything Model (SAM) exhibits remarkable zero-shot capabilities, existing SAM-based methods rely on simple adapter fine-tuning while overlooking frequency-domain information that encodes domain-invariant features, resulting in degraded generalization under illumination and contrast variations. Furthermore, SAM's direct upsampling inevitably loses fine vessel details. To address these limitations, we propose WaveRNet, a wavelet-guided frequency learning framework for robust multi-source domain-generalized retinal vessel segmentation. Specifically, we devise a Spectral-guided Domain Modulator (SDM) that integrates wavelet decomposition with learnable domain tokens, enabling the separation of illumination-robust low-frequency structures from high-frequency vessel boundaries while facilitating domain-specific feature generation. Furthermore, we introduce a Frequency-Adaptive Domain Fusion (FADF) module that performs intelligent test-time domain selection through wavelet-based frequency similarity and soft-weighted fusion. Finally, we present a Hierarchical Mask-Prompt Refiner (HMPR) that overcomes SAM's upsampling limitation through coarse-to-fine refinement with long-range dependency modeling. Extensive experiments under the Leave-One-Domain-Out protocol on four public retinal datasets demonstrate that WaveRNet achieves state-of-the-art generalization performance. The source code is available at this https URL.
摘要：域广义视网膜血管分割对于自动眼科诊断至关重要，但面临着由不均匀照明和变化对比度引起的域移动的重大挑战，加上保存精细血管结构的困难，使情况更加复杂。虽然分段任意模型 (SAM) 表现出卓越的零样本能力，但现有的基于 SAM 的方法依赖于简单的适配器微调，同时忽略了编码域不变特征的频域信息，导致光照和对比度变化下泛化能力下降。此外，SAM 的直接上采样不可避免地会丢失精细的血管细节。为了解决这些限制，我们提出了 WaveRNet，一种小波引导的频率学习框架，用于鲁棒的多源域广义视网膜血管分割。具体来说，我们设计了一种频谱引导域调制器（SDM），它将小波分解与可学习域标记集成在一起，从而能够将照明鲁棒的低频结构与高频血管边界分离，同时促进特定域特征的生成。此外，我们引入了频率自适应域融合（FADF）模块，该模块通过基于小波的频率相似性和软加权融合来执行智能测试时间域选择。最后，我们提出了一种分层掩模提示细化器（HMPR），它通过远程依赖建模从粗到细的细化克服了 SAM 的上采样限制。在 Leave-One-Domain-Out 协议下对四个公共视网膜数据集进行的大量实验表明，WaveRNet 实现了最先进的泛化性能。源代码可从此 https URL 获取。

Title: VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction

Authors: Longbin Ji, Xiaoxiong Liu, Junyuan Shang, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.05966
Pdf URL: https://arxiv.org/pdf/2601.05966
Copy Paste: [[2601.05966]] VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction(https://arxiv.org/abs/2601.05966)
Keywords: generation
Abstract: Recent advances in video generation have been dominated by diffusion and flow-matching models, which produce high-quality results but remain computationally intensive and difficult to scale. In this work, we introduce VideoAR, the first large-scale Visual Autoregressive (VAR) framework for video generation that combines multi-scale next-frame prediction with autoregressive modeling. VideoAR disentangles spatial and temporal dependencies by integrating intra-frame VAR modeling with causal next-frame prediction, supported by a 3D multi-scale tokenizer that efficiently encodes spatio-temporal dynamics. To improve long-term consistency, we propose Multi-scale Temporal RoPE, Cross-Frame Error Correction, and Random Frame Mask, which collectively mitigate error propagation and stabilize temporal coherence. Our multi-stage pretraining pipeline progressively aligns spatial and temporal learning across increasing resolutions and durations. Empirically, VideoAR achieves new state-of-the-art results among autoregressive models, improving FVD on UCF-101 from 99.5 to 88.6 while reducing inference steps by over 10x, and reaching a VBench score of 81.74-competitive with diffusion-based models an order of magnitude larger. These results demonstrate that VideoAR narrows the performance gap between autoregressive and diffusion paradigms, offering a scalable, efficient, and temporally consistent foundation for future video generation research.
摘要：视频生成领域的最新进展主要由扩散和流匹配模型主导，这些模型可以产生高质量的结果，但计算量仍然很大且难以扩展。在这项工作中，我们介绍了 VideoAR，这是第一个用于视频生成的大规模视觉自回归 (VAR) 框架，它将多尺度下一帧预测与自回归建模相结合。 VideoAR 通过将帧内 VAR 建模与因果下一帧预测相集成，消除了空间和时间依赖性，并由可有效编码时空动态的 3D 多尺度分词器支持。为了提高长期一致性，我们提出了多尺度时间 RoPE、跨帧纠错和随机帧掩模，它们共同减轻了错误传播并稳定了时间相干性。我们的多阶段预训练管道逐步调整空间和时间学习，以提高分辨率和持续时间。根据经验，VideoAR 在自回归模型中取得了最先进的结果，将 UCF-101 上的 FVD 从 99.5 提高到 88.6，同时将推理步骤减少了 10 倍以上，并达到了 81.74 的 VBench 分数，与基于扩散的模型相比要高出一个数量级。这些结果表明，VideoAR 缩小了自回归和扩散范式之间的性能差距，为未来的视频生成研究提供了可扩展、高效且时间一致的基础。