2025-03-10

Title: Self-Evolved Preference Optimization for Enhancing Mathematical Reasoning in Small Language Models

Authors: Joykirat Singh, Tanmoy Chakraborty, Akshay Nambi
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.04813
Pdf URL: https://arxiv.org/pdf/2503.04813
Copy Paste: [[2503.04813]] Self-Evolved Preference Optimization for Enhancing Mathematical Reasoning in Small Language Models(https://arxiv.org/abs/2503.04813)
Keywords: generation
Abstract: Large language models (LLMs) have significantly improved their reasoning capabilities; however, they still struggle with complex multi-step mathematical problem-solving due to error propagation, lack of self-correction, and limited adaptability to diverse reasoning styles. Existing methods rely on static fine-tuning or prompt engineering, which fail to generalize across problem complexities, while the scarcity of high-quality preference data further hinders reliable reasoning. We introduce SPHERE, a self-evolving data generation pipeline that enhances reasoning in small language models (SLMs) by iteratively generating, correcting, and diversifying reasoning chains. SPHERE operates in three stages: (i) Self-Generation, where the model autonomously constructs problem-solving steps; (ii) Self-Correction, enabling it to identify and rectify errors; and (iii) Diversity Induction, improving robustness through multiple valid reasoning trajectories. This self-evolution mechanism strengthens mathematical reasoning and enhances model reliability. Evaluations on MATH 500, GSM8K, AIME, AMC, and Olympiad show that SPHERE-trained models achieve significant gains over their base versions and match/surpass GPT-4o on certain benchmarks. Our findings demonstrate that self-evolving models can close the reasoning gap between SLMs and state-of-the-art LLMs, making mathematical AI more reliable, scalable, and efficient.
摘要：大型语言模型（LLM）已大大提高了其推理能力；但是，由于错误传播，缺乏自我纠正以及对各种推理方式的适应性有限，他们仍然在复杂的多步数数学问题解决方案中挣扎。现有的方法依赖于静态微调或及时工程，这些方法无法跨越问题的复杂性，而高质量偏好数据的稀缺性进一步阻碍了可靠的推理。我们介绍了Sphere，这是一种自我不断发展的数据生成管道，通过迭代产生，纠正和多元化的推理链来增强小语言模型（SLM）的推理。 Sphere分为三个阶段：（i）自动生成，该模型自主构建解决问题的步骤；（ii）自我纠正，使其能够识别和纠正错误；（iii）多样性诱导，通过多种有效的推理轨迹改善鲁棒性。这种自我进化的机制增强了数学推理并增强了模型的可靠性。对数学500，GSM8K，AIME，AMC和奥林匹克运动会的评估表明，受球体训练的模型在其基本版本上实现了显着增长，并在某些基准测试中匹配/超过GPT-4O。我们的发现表明，自我发展的模型可以缩小SLM和最先进的LLM之间的推理差距，从而使数学AI更可靠，可扩展和高效。

Title: Invisible Strings: Revealing Latent Dancer-to-Dancer Interactions with Graph Neural Networks

Authors: Luis Vitor Zerkowski, Zixuan Wang, Ilya Vidrin, Mariel Pettee
Subjects: cs.CV, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2503.04816
Pdf URL: https://arxiv.org/pdf/2503.04816
Copy Paste: [[2503.04816]] Invisible Strings: Revealing Latent Dancer-to-Dancer Interactions with Graph Neural Networks(https://arxiv.org/abs/2503.04816)
Keywords: generative
Abstract: Dancing in a duet often requires a heightened attunement to one's partner: their orientation in space, their momentum, and the forces they exert on you. Dance artists who work in partnered settings might have a strong embodied understanding in the moment of how their movements relate to their partner's, but typical documentation of dance fails to capture these varied and subtle relationships. Working closely with dance artists interested in deepening their understanding of partnering, we leverage Graph Neural Networks (GNNs) to highlight and interpret the intricate connections shared by two dancers. Using a video-to-3D-pose extraction pipeline, we extract 3D movements from curated videos of contemporary dance duets, apply a dedicated pre-processing to improve the reconstruction, and train a GNN to predict weighted connections between the dancers. By visualizing and interpreting the predicted relationships between the two movers, we demonstrate the potential for graph-based methods to construct alternate models of the collaborative dynamics of duets. Finally, we offer some example strategies for how to use these insights to inform a generative and co-creative studio practice.
摘要：在二重奏中跳舞通常需要对伴侣的调整加强：他们在太空中的方向，他们的动力以及他们对您施加的力量。在合作环境中工作的舞蹈艺术家在他们的动作与伴侣之间的关系中可能具有深刻的理解，但是典型的舞蹈文档未能捕捉到这些多样而微妙的关系。我们与有兴趣加深他们对合作伙伴的理解的舞蹈艺术家紧密合作，我们利用图形神经网络（GNN）来突出和解释两位舞者共享的复杂联系。使用视频到3D-Pose提取管道，我们从当代舞蹈二重奏的精选视频中提取3D动作，应用专门的预处理以改善重建，并训练GNN来预测舞者之间的加权连接。通过可视化和解释这两个搬家之间的预测关系，我们证明了基于图的方法构建二重奏协作动力学的替代模型的潜力。最后，我们提供了一些示例策略，以使用这些见解来告知生成和共同创造的工作室实践。

Title: StickMotion: Generating 3D Human Motions by Drawing a Stickman

Authors: Tao Wang, Zhihua Wu, Qiaozhi He, Jiaming Chu, Ling Qian, Yu Cheng, Junliang Xing, Jian Zhao, Lei Jin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.04829
Pdf URL: https://arxiv.org/pdf/2503.04829
Copy Paste: [[2503.04829]] StickMotion: Generating 3D Human Motions by Drawing a Stickman(https://arxiv.org/abs/2503.04829)
Keywords: generation
Abstract: Text-to-motion generation, which translates textual descriptions into human motions, has been challenging in accurately capturing detailed user-imagined motions from simple text inputs. This paper introduces StickMotion, an efficient diffusion-based network designed for multi-condition scenarios, which generates desired motions based on traditional text and our proposed stickman conditions for global and local control of these motions, respectively. We address the challenges introduced by the user-friendly stickman from three perspectives: 1) Data generation. We develop an algorithm to generate hand-drawn stickmen automatically across different dataset formats. 2) Multi-condition fusion. We propose a multi-condition module that integrates into the diffusion process and obtains outputs of all possible condition combinations, reducing computational complexity and enhancing StickMotion's performance compared to conventional approaches with the self-attention module. 3) Dynamic supervision. We empower StickMotion to make minor adjustments to the stickman's position within the output sequences, generating more natural movements through our proposed dynamic supervision strategy. Through quantitative experiments and user studies, sketching stickmen saves users about 51.5% of their time generating motions consistent with their imagination. Our codes, demos, and relevant data will be released to facilitate further research and validation within the scientific community.
摘要：将文本描述转化为人类动作的文本到动作生成在准确捕获简单文本输入的详细用户想象的动作方面一直在挑战。本文介绍了Stickmotion，这是一个为多条件方案设计的高效基于扩散的网络，该网络分别基于传统文本和我们提出的Stickman条件，分别为全球和本地控制这些动作生成所需的动作。我们从三个角度解决了用户友好的Stickman引入的挑战：1）数据生成。我们开发了一种算法来自动在不同的数据集格式上生成手绘棍棒。 2）多条件融合。我们提出了一个多条件模块，该模块集成到扩散过程中并获得所有可能条件组合的输出，从而降低了计算复杂性并增强了与自我发场模块的常规方法相比，可以增强Stickmotion的性能。 3）动态监督。我们使Stickmotion能够对Stickman在输出序列中的位置进行少量调整，从而通过我们提出的动态监督策略产生更自然的运动。通过定量实验和用户研究，素描Stickmen可以节省用户约51.5％的时间产生与他们的想象力一致的动作。我们的代码，演示和相关数据将被发布，以促进科学界的进一步研究和验证。

Title: Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks

Authors: Liming Lu, Shuchao Pang, Siyuan Liang, Haotian Zhu, Xiyu Zeng, Aishan Liu, Yunhuai Liu, Yongbin Zhou
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.04833
Pdf URL: https://arxiv.org/pdf/2503.04833
Copy Paste: [[2503.04833]] Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks(https://arxiv.org/abs/2503.04833)
Keywords: generation
Abstract: Multimodal large language models (MLLMs) have made remarkable strides in cross-modal comprehension and generation tasks. However, they remain vulnerable to jailbreak attacks, where crafted perturbations bypass security guardrails and elicit harmful outputs. In this paper, we present the first adversarial training (AT) paradigm tailored to defend against jailbreak attacks during the MLLM training phase. Extending traditional AT to this domain poses two critical challenges: efficiently tuning massive parameters and ensuring robustness against attacks across multiple modalities. To address these challenges, we introduce Projection Layer Against Adversarial Training (ProEAT), an end-to-end AT framework. ProEAT incorporates a projector-based adversarial training architecture that efficiently handles large-scale parameters while maintaining computational feasibility by focusing adversarial training on a lightweight projector layer instead of the entire model; additionally, we design a dynamic weight adjustment mechanism that optimizes the loss function's weight allocation based on task demands, streamlining the tuning process. To enhance defense performance, we propose a joint optimization strategy across visual and textual modalities, ensuring robust resistance to jailbreak attacks originating from either modality. Extensive experiments conducted on five major jailbreak attack methods across three mainstream MLLMs demonstrate the effectiveness of our approach. ProEAT achieves state-of-the-art defense performance, outperforming existing baselines by an average margin of +34% across text and image modalities, while incurring only a 1% reduction in clean accuracy. Furthermore, evaluations on real-world embodied intelligent systems highlight the practical applicability of our framework, paving the way for the development of more secure and reliable multimodal systems.
摘要：多模式的大语言模型（MLLM）在跨模式理解和发电任务方面取得了显着的进步。但是，它们仍然容易受到越狱攻击的影响，在这些袭击中，精心制作的扰动绕过安全护栏并引起有害产出。在本文中，我们介绍了第一个针对MLLM训练阶段的越狱训练（AT）范式。将传统扩展到该领域提出了两个关键挑战：有效地调整大量参数并确保对多种方式的攻击稳健性。为了应对这些挑战，我们引入了针对对抗训练（PROEAT）的投影层，这是一个框架的端到端。 Proeat结合了一个基于投影仪的对抗训练架构，该体系结构有效地处理大型参数，同时通过将对抗性训练集中在轻量级投影仪层而不是整个模型来维持计算可行性。此外，我们设计了一种动态重量调整机制，该机制可以根据任务要求优化损失函数的重量分配，从而简化调谐过程。为了提高防御性能，我们提出了跨视觉和文本方式的联合优化策略，以确保对源自两种方式的越狱攻击的强烈抵抗。对三个主流MLLM的五种主要越狱攻击方法进行的广泛实验证明了我们方法的有效性。 Proeat实现了最先进的防御性能，在文本和图像方式中，平均差距 +34％，而清洁准确性仅降低了1％。此外，对现实世界中体现的智能系统的评估突出了我们框架的实际适用性，为开发更安全和可靠的多模式系统铺平了道路。

Title: Combined Physics and Event Camera Simulator for Slip Detection

Authors: Thilo Reinold, Suman Ghosh, Guillermo Gallego
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.04838
Pdf URL: https://arxiv.org/pdf/2503.04838
Copy Paste: [[2503.04838]] Combined Physics and Event Camera Simulator for Slip Detection(https://arxiv.org/abs/2503.04838)
Keywords: generation
Abstract: Robot manipulation is a common task in fields like industrial manufacturing. Detecting when objects slip from a robot's grasp is crucial for safe and reliable operation. Event cameras, which register pixel-level brightness changes at high temporal resolution (called ``events''), offer an elegant feature when mounted on a robot's end effector: since they only detect motion relative to their viewpoint, a properly grasped object produces no events, while a slipping object immediately triggers them. To research this feature, representative datasets are essential, both for analytic approaches and for training machine learning models. The majority of current research on slip detection with event-based data is done on real-world scenarios and manual data collection, as well as additional setups for data labeling. This can result in a significant increase in the time required for data collection, a lack of flexibility in scene setups, and a high level of complexity in the repetition of experiments. This paper presents a simulation pipeline for generating slip data using the described camera-gripper configuration in a robot arm, and demonstrates its effectiveness through initial data-driven experiments. The use of a simulator, once it is set up, has the potential to reduce the time spent on data collection, provide the ability to alter the setup at any time, simplify the process of repetition and the generation of arbitrarily large data sets. Two distinct datasets were created and validated through visual inspection and artificial neural networks (ANNs). Visual inspection confirmed photorealistic frame generation and accurate slip modeling, while three ANNs trained on this data achieved high validation accuracy and demonstrated good generalization capabilities on a separate test set, along with initial applicability to real-world data. Project page: this https URL
摘要：机器人操纵是工业制造等领域的常见任务。检测物体何时从机器人的掌握中滑落，对于安全可靠的操作至关重要。登记像素级亮度在高时空分辨率（称为``事件''）时会变化的事件摄像机在安装在机器人的最终效应器上时提供了优雅的功能：由于它们仅检测相对于其观点的运动，因此没有适当握住的对象会产生任何事件，而滑滑的对象会立即触发它们。为了研究此功能，代表性数据集至关重要，无论是用于分析方法还是用于培训机器学习模型。在现实世界情景和手动数据收集以及数据标记的其他设置上，大多数对基于事件的数据进行滑移检测的研究大多数是进行了基于事件数据的。这可能会导致数据收集所需的时间显着增加，场景设置缺乏灵活性以及实验重复的高度复杂性。本文提出了一个模拟管道，用于使用机器人臂中描述的摄像头配置生成滑移数据，并通过初始数据驱动的实验证明了其有效性。一旦建立模拟器，使用模拟器就有可能减少数据收集的时间，提供随时更改设置的能力，简化重复的过程，并生成任意大型数据集。通过视觉检查和人工神经网络（ANN）创建并验证了两个不同的数据集。视觉检查确认了逼真的框架生成和准确的滑移建模，而对此数据进行训练的三个ANN可实现高验证精度，并在单独的测试集中证明了良好的概括能力，以及对现实世界数据的初始适用性。项目页面：此HTTPS URL

Title: ZAugNet for Z-Slice Augmentation in Bio-Imaging

Authors: Alessandro Pasqui, Sajjad Mahdavi, Benoit Vianay, Alexandra Colin, Alex McDougall, Rémi Dumollard, Yekaterina A. Miroshnikova, Elsa Labrune, Hervé Turlier
Subjects: cs.CV, cs.AI, eess.IV, q-bio.QM
Abstract URL: https://arxiv.org/abs/2503.04843
Pdf URL: https://arxiv.org/pdf/2503.04843
Copy Paste: [[2503.04843]] ZAugNet for Z-Slice Augmentation in Bio-Imaging(https://arxiv.org/abs/2503.04843)
Keywords: generative
Abstract: Three-dimensional biological microscopy has significantly advanced our understanding of complex biological structures. However, limitations due to microscopy techniques, sample properties or phototoxicity often result in poor z-resolution, hindering accurate cellular measurements. Here, we introduce ZAugNet, a fast, accurate, and self-supervised deep learning method for enhancing z-resolution in biological images. By performing nonlinear interpolation between consecutive slices, ZAugNet effectively doubles resolution with each iteration. Compared on several microscopy modalities and biological objects, it outperforms competing methods on most metrics. Our method leverages a generative adversarial network (GAN) architecture combined with knowledge distillation to maximize prediction speed without compromising accuracy. We also developed ZAugNet+, an extended version enabling continuous interpolation at arbitrary distances, making it particularly useful for datasets with nonuniform slice spacing. Both ZAugNet and ZAugNet+ provide high-performance, scalable z-slice augmentation solutions for large-scale 3D imaging. They are available as open-source frameworks in PyTorch, with an intuitive Colab notebook interface for easy access by the scientific community.
摘要：三维生物学显微镜显着提高了我们对复杂生物结构的理解。但是，由于显微镜技术，样品特性或光毒性引起的局限性通常会导致z分解差，阻碍了准确的细胞测量结果。在这里，我们介绍了Zaugnet，这是一种快速，准确且自学的深度学习方法，用于增强生物图像中的Z分解。通过在连续切片之间执行非线性插值，Zaugnet在每次迭代中有效地分辨率加倍。在几种显微镜模式和生物学对象上，它在大多数指标上的表现都优于竞争方法。我们的方法利用生成的对抗网络（GAN）结构结合了知识蒸馏，以最大程度地提高预测速度而不会损害准确性。我们还开发了Zaugnet+，这是一个扩展版本，可以在任意距离下连续插值，这使其对于具有非均匀切片间距的数据集特别有用。 Zaugnet和Zaugnet+均提供用于大规模3D成像的高性能，可扩展的Z-Slice增强解决方案。它们可作为Pytorch的开源框架，具有直观的Colab笔记本接口，可轻松访问科学界。

Title: End-to-End Human Pose Reconstruction from Wearable Sensors for 6G Extended Reality Systems

Authors: Nguyen Quang Hieu, Dinh Thai Hoang, Diep N. Nguyen, Mohammad Abu Alsheikh, Carlos C. N. Kuhn, Yibeltal F. Alem, Ibrahim Radwan
Subjects: cs.CV, eess.SP
Abstract URL: https://arxiv.org/abs/2503.04860
Pdf URL: https://arxiv.org/pdf/2503.04860
Copy Paste: [[2503.04860]] End-to-End Human Pose Reconstruction from Wearable Sensors for 6G Extended Reality Systems(https://arxiv.org/abs/2503.04860)
Keywords: generation
Abstract: Full 3D human pose reconstruction is a critical enabler for extended reality (XR) applications in future sixth generation (6G) networks, supporting immersive interactions in gaming, virtual meetings, and remote collaboration. However, achieving accurate pose reconstruction over wireless networks remains challenging due to channel impairments, bit errors, and quantization effects. Existing approaches often assume error-free transmission in indoor settings, limiting their applicability to real-world scenarios. To address these challenges, we propose a novel deep learning-based framework for human pose reconstruction over orthogonal frequency-division multiplexing (OFDM) systems. The framework introduces a two-stage deep learning receiver: the first stage jointly estimates the wireless channel and decodes OFDM symbols, and the second stage maps the received sensor signals to full 3D body poses. Simulation results demonstrate that the proposed neural receiver reduces bit error rate (BER), thus gaining a 5 dB gap at $10^{-4}$ BER, compared to the baseline method that employs separate signal detection steps, i.e., least squares channel estimation and linear minimum mean square error equalization. Additionally, our empirical findings show that 8-bit quantization is sufficient for accurate pose reconstruction, achieving a mean squared error of $5\times10^{-4}$ for reconstructed sensor signals, and reducing joint angular error by 37\% for the reconstructed human poses compared to the baseline.
摘要：完整的3D人姿势重建是未来第六代（6G）网络扩展现实（XR）应用程序的关键推动力，支持游戏，虚拟会议和远程协作中的沉浸式互动。但是，由于通道障碍，位误差和量化效果，实现无线网络上准确的姿势重建仍然具有挑战性。现有方法通常假设在室内设置中无错误的传输，从而将其适用性限制在现实世界中。为了应对这些挑战，我们提出了一个基于深度学习的新型框架，用于对正交频划分多路复用（OFDM）系统的重建。该框架引入了两个阶段的深度学习接收器：第一阶段共同估算了无线通道和DM符号的解码，第二阶段将接收到的传感器信号映射到完整的3D身体姿势。仿真结果表明，与采用单独的信号检测步骤（即最小二乘通道估计和线性最小平方误差平等）相比，提出的神经接收器降低了位错误率（BER），从而获得5 dB间隙，以$ 10^{ - 4} $ BER的差距获得5 dB的间隙。此外，我们的经验发现表明，8位量化足以进行精确的姿势重建，对于重建的传感器信号，达到了$ 5 \ times10^{ - 4} $的平均平方误差，与基线相比，重建的人类姿势将关节角误差减少了37 \％。

Title: Toward Lightweight and Fast Decoders for Diffusion Models in Image and Video Generation

Authors: Alexey Buzovkin, Evgeny Shilov
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2503.04871
Pdf URL: https://arxiv.org/pdf/2503.04871
Copy Paste: [[2503.04871]] Toward Lightweight and Fast Decoders for Diffusion Models in Image and Video Generation(https://arxiv.org/abs/2503.04871)
Keywords: generation, generative
Abstract: We investigate methods to reduce inference time and memory footprint in stable diffusion models by introducing lightweight decoders for both image and video synthesis. Traditional latent diffusion pipelines rely on large Variational Autoencoder decoders that can slow down generation and consume considerable GPU memory. We propose custom-trained decoders using lightweight Vision Transformer and Taming Transformer architectures. Experiments show up to 15% overall speed-ups for image generation on COCO2017 and up to 20 times faster decoding in the sub-module, with additional gains on UCF-101 for video tasks. Memory requirements are moderately reduced, and while there is a small drop in perceptual quality compared to the default decoder, the improvements in speed and scalability are crucial for large-scale inference scenarios such as generating 100K images. Our work is further contextualized by advances in efficient video generation, including dual masking strategies, illustrating a broader effort to improve the scalability and efficiency of generative models.
摘要：我们通过引入图像和视频合成的轻量级解码器来研究稳定扩散模型中推理时间和记忆足迹的方法。传统的潜在扩散管道依赖于大量的变异自动编码器解码器，这些解码器可以减慢发电并消耗相当大的GPU内存。我们使用轻巧的视觉变压器和驯服变压器体系结构提出了定制的解码器。在COCO2017上的图像生成和子模块中的分解速度最高20倍，在UCF-101上，用于视频任务的实验最高可达15％。内存需求中等减少，尽管与默认解码器相比，感知质量的下降幅度很小，但速度和可伸缩性的提高对于大规模推理方案（例如生成100K图像）至关重要。有效的视频生成的进步，包括双掩蔽策略，进一步说明了我们的工作进一步背景，这说明了提高生成模型的可扩展性和效率的更广泛努力。

Title: FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement

Authors: Ian Huang, Yanan Bao, Karen Truong, Howard Zhou, Cordelia Schmid, Leonidas Guibas, Alireza Fathi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.04919
Pdf URL: https://arxiv.org/pdf/2503.04919
Copy Paste: [[2503.04919]] FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement(https://arxiv.org/abs/2503.04919)
Keywords: generation
Abstract: Scene generation with 3D assets presents a complex challenge, requiring both high-level semantic understanding and low-level geometric reasoning. While Multimodal Large Language Models (MLLMs) excel at semantic tasks, their application to 3D scene generation is hindered by their limited grounding on 3D geometry. In this paper, we investigate how to best work with MLLMs in an object placement task. Towards this goal, we introduce a novel framework, FirePlace, that applies existing MLLMs in (1) 3D geometric reasoning and the extraction of relevant geometric details from the 3D scene, (2) constructing and solving geometric constraints on the extracted low-level geometry, and (3) pruning for final placements that conform to common sense. By combining geometric reasoning with real-world understanding of MLLMs, our method can propose object placements that satisfy both geometric constraints as well as high-level semantic common-sense considerations. Our experiments show that these capabilities allow our method to place objects more effectively in complex scenes with intricate geometry, surpassing the quality of prior work.
摘要：具有3D资产的场景产生提出了一个复杂的挑战，需要高级语义理解和低级几何推理。尽管多模式的大型语言模型（MLLM）在语义任务上表现出色，但它们在3D场景生成中的应用受到了在3D几何形状上的有限接地的阻碍。在本文中，我们研究了如何在对象放置任务中最好地使用MLLM。为了实现这一目标，我们介绍了一个新颖的框架壁炉，该框架在（1）3D几何推理中应用了现有的MLLM，并从3D场景中提取相关的几何细节，（（2）在提取的低级别几何形状上构建和求解几何形状约束，以及（3）最终位置，以使其最终放置对常见的感觉。通过将几何推理与对MLLM的现实了解相结合，我们的方法可以提出对象放置，以满足几何约束以及高级语义常识考虑因素。我们的实验表明，这些功能使我们的方法可以在复杂的几何形状的复杂场景中更有效地放置对象，从而超过了先前工作的质量。

Title: Metadata-free Georegistration of Ground and Airborne Imagery

Authors: Adam Bredvik, Scott Richardson, Daniel Crispell
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.04927
Pdf URL: https://arxiv.org/pdf/2503.04927
Copy Paste: [[2503.04927]] Metadata-free Georegistration of Ground and Airborne Imagery(https://arxiv.org/abs/2503.04927)
Keywords: generation
Abstract: Heterogeneous collections of ground and airborne imagery can readily be used to create high-quality 3D models and novel viewpoint renderings of the observed scene. Standard photogrammetry pipelines generate models in arbitrary coordinate systems, which is problematic for applications that require georegistered models. Even for applications that do not require georegistered models, georegistration is useful as a mechanism for aligning multiple disconnected models generated from non-overlapping data. The proposed method leverages satellite imagery, an associated digital surface model (DSM), and the novel view generation capabilities of modern 3D modeling techniques (e.g. neural radiance fields) to provide a robust method for georegistering airborne imagery, and a related technique for registering ground-based imagery to models created from airborne imagery. Experiments demonstrate successful georegistration of airborne and ground-based photogrammetric models across a variety of distinct sites. The proposed method does not require use of any metadata other than a satellite-based reference product and therefore has general applicability.
摘要：地面和空中图像的异质集可以很容易地用于创建高质量的3D模型以及观察到的场景的新型观点效果。标准的摄影测量管道在任意坐标系中生成模型，这对于需要地理机模型的应用是有问题的。即使对于不需要Georegister模型的应用程序，地球委员会也是对齐从非重叠数据生成的多个断开模型的机制。所提出的方法利用卫星图像，相关的数字表面模型（DSM）以及现代3D建模技术的新型视图生成能力（例如神经辐射场），为向机载的空中图像提供了强大的方法，以及为从空中海洋影像中创建的基于地面成像的相关技术。实验表明，在各种不同的地点上，机载和地面的摄影模型成功地进行了地球分配。提出的方法不需要使用基于卫星的参考产品以外的任何元数据，因此具有一般适用性。

Title: Incentivizing Multi-Tenant Split Federated Learning for Foundation Models at the Network Edge

Authors: Songyuan Li, Jia Hu, Geyong Min, Haojun Huang
Subjects: cs.LG, cs.AI, cs.DC, cs.MA
Abstract URL: https://arxiv.org/abs/2503.04971
Pdf URL: https://arxiv.org/pdf/2503.04971
Copy Paste: [[2503.04971]] Incentivizing Multi-Tenant Split Federated Learning for Foundation Models at the Network Edge(https://arxiv.org/abs/2503.04971)
Keywords: generative
Abstract: Foundation models (FMs) such as GPT-4 exhibit exceptional generative capabilities across diverse downstream tasks through fine-tuning. Split Federated Learning (SFL) facilitates privacy-preserving FM fine-tuning on resource-constrained local devices by offloading partial FM computations to edge servers, enabling device-edge synergistic fine-tuning. Practical edge networks often host multiple SFL tenants to support diversified downstream tasks. However, existing research primarily focuses on single-tenant SFL scenarios, and lacks tailored incentive mechanisms for multi-tenant settings, which are essential to effectively coordinate self-interested local devices for participation in various downstream tasks, ensuring that each SFL tenant's distinct FM fine-tuning requirements (e.g., FM types, performance targets, and fine-tuning deadlines) are met. To address this gap, we propose a novel Price-Incentive Mechanism (PRINCE) that guides multiple SFL tenants to offer strategic price incentives, which solicit high-quality device participation for efficient FM fine-tuning. Specifically, we first develop a bias-resilient global SFL model aggregation scheme to eliminate model biases caused by independent device participation. We then derive a rigorous SFL convergence bound to evaluate the contributions of heterogeneous devices to FM performance improvements, guiding the incentive strategies of SFL tenants. Furthermore, we model inter-tenant device competition as a congestion game for Stackelberg equilibrium (SE) analysis, deriving each SFL tenant's optimal incentive strategy. Extensive simulations involving four representative SFL tenant types (ViT, BERT, Whisper, and LLaMA) across diverse data modalities (text, images, and audio) demonstrate that PRINCE accelerates FM fine-tuning by up to 3.07x compared to state-of-the-art approaches, while consistently meeting fine-tuning performance targets.
摘要：基础模型（FMS）（例如GPT-4）通过微调在不同的下游任务中具有出色的生成能力。分裂联合学习（SFL）通过卸载部分FM计算以边缘服务器来促进对资源受限的本地设备进行隐私的FM微调，从而启用设备 - 设备边缘的协同性微调。实用的边缘网络通常托管多个SFL租户，以支持多元化的下游任务。 However, existing research primarily focuses on single-tenant SFL scenarios, and lacks tailored incentive mechanisms for multi-tenant settings, which are essential to effectively coordinate self-interested local devices for participation in various downstream tasks, ensuring that each SFL tenant's distinct FM fine-tuning requirements (e.g., FM types, performance targets, and fine-tuning deadlines) are met.为了解决这一差距，我们提出了一种新颖的价格提出机制（Prince），该机制指导多个SFL租户提供战略价格激励措施，该机制征集高质量设备的参与以进行有效的FM微调。具体而言，我们首先开发出偏见的全局SFL模型聚合方案，以消除由独立设备参与引起的模型偏见。然后，我们得出了一种严格的SFL收敛，以评估异质设备对FM性能改进的贡献，从而指导SFL租户的激励策略。此外，我们将租户间设备竞争模拟为Stackelberg平衡（SE）分析的拥塞游戏，从而得出了每个SFL租户的最佳激励策略。涉及各种数据方式（文本，图像和音频）的四种代表性SFL租户类型（VIT，BERT，WHISPER和LLAMA）的广泛模拟表明，与最新的方法相比，Prince可以加速高达3.07倍的FM微调，同时始终如一地符合良好的调整性能目标。

Title: Energy-Weighted Flow Matching for Offline Reinforcement Learning

Authors: Shiyuan Zhang, Weitong Zhang, Quanquan Gu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.04975
Pdf URL: https://arxiv.org/pdf/2503.04975
Copy Paste: [[2503.04975]] Energy-Weighted Flow Matching for Offline Reinforcement Learning(https://arxiv.org/abs/2503.04975)
Keywords: generative
Abstract: This paper investigates energy guidance in generative modeling, where the target distribution is defined as $q(\mathbf x) \propto p(\mathbf x)\exp(-\beta \mathcal E(\mathbf x))$, with $p(\mathbf x)$ being the data distribution and $\mathcal E(\mathcal x)$ as the energy function. To comply with energy guidance, existing methods often require auxiliary procedures to learn intermediate guidance during the diffusion process. To overcome this limitation, we explore energy-guided flow matching, a generalized form of the diffusion process. We introduce energy-weighted flow matching (EFM), a method that directly learns the energy-guided flow without the need for auxiliary models. Theoretical analysis shows that energy-weighted flow matching accurately captures the guided flow. Additionally, we extend this methodology to energy-weighted diffusion models and apply it to offline reinforcement learning (RL) by proposing the Q-weighted Iterative Policy Optimization (QIPO). Empirically, we demonstrate that the proposed QIPO algorithm improves performance in offline RL tasks. Notably, our algorithm is the first energy-guided diffusion model that operates independently of auxiliary models and the first exact energy-guided flow matching model in the literature.
摘要：本文研究了生成建模中的能源指南，其中目标分布定义为$ q（\ m马理X）\ propto p（\ Mathbf x）\ Exp（ - \ beta \ beta \ beta \ mathcal E（\ Mathbf x））$，$ p（\ MATHBF x）$是数据分布，$ \ Mathcal e（$ \ Mathcal e（\ MathCal x）为了遵守能源指导，现有方法通常需要在扩散过程中学习中间指导的辅助程序。为了克服这一限制，我们探索了能源引导的流量匹配，这是扩散过程的广义形式。我们引入了能量加权流量匹配（EFM），这种方法可以直接了解无需辅助模型的能量引导流。理论分析表明，能量加权的流量匹配可以准确捕获引导流。此外，我们通过提出Q加权迭代策略优化（QIPO），将此方法扩展到能量加权扩散模型，并将其应用于离线增强学习（RL）。从经验上讲，我们证明了所提出的QIPO算法可以提高离线RL任务的性能。值得注意的是，我们的算法是第一个能够独立于辅助模型运行的能量引导的扩散模型，也是文献中第一个精确的能源引导的流动匹配模型。

Title: LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression

Authors: Souvik Kundu, Anahita Bhiwandiwalla, Sungduk Yu, Phillip Howard, Tiep Le, Sharath Nittur Sridhar, David Cobbley, Hao Kang, Vasudev Lal
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.04982
Pdf URL: https://arxiv.org/pdf/2503.04982
Copy Paste: [[2503.04982]] LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression(https://arxiv.org/abs/2503.04982)
Keywords: generation, generative
Abstract: Despite recent efforts in understanding the compression impact on large language models (LLMs) in terms of their downstream task performance and trustworthiness on relatively simpler uni-modal benchmarks (for example, question answering, common sense reasoning), their detailed study on multi-modal Large Vision-Language Models (LVLMs) is yet to be unveiled. Towards mitigating this gap, we present LVLM-Compress-Bench, a framework to first thoroughly study the broad impact of compression on the generative performance of LVLMs with multi-modal input driven tasks. In specific, we consider two major classes of compression for autoregressive models, namely KV cache and weight compression, for the dynamically growing intermediate cache and static weights, respectively. We use four LVLM variants of the popular LLaVA framework to present our analysis via integrating various state-of-the-art KV and weight compression methods including uniform, outlier-reduced, and group quantization for the KV cache and weights. With this framework we demonstrate on ten different multi-modal datasets with different capabilities including recognition, knowledge, language generation, spatial awareness, visual reasoning, hallucination and visual illusion identification, toxicity, stereotypes and bias. In specific, our framework demonstrates the compression impact on both general and ethically critical metrics leveraging a combination of real world and synthetic datasets to encompass diverse societal intersectional attributes. Extensive experimental evaluations yield diverse and intriguing observations on the behavior of LVLMs at different quantization budget of KV and weights, in both maintaining and losing performance as compared to the baseline model with FP16 data format. Code will be open-sourced at this https URL.
摘要：尽管最近在理解压缩的影响（LLMS）方面，他们的下游任务绩效和对相对简单的Uni-Modal基准测试（例如，问题答案，常识推理）的详细研究，他们对多模式大型视力语言模型（LVLMS）的详细研究尚未覆盖。为了减轻这一差距，我们提出了LVLM-Compress基座，该框架是首先彻底研究压缩对使用多模式输入驱动的任务的LVLM的生成性能的广泛影响。在具体而言，我们考虑了自回归模型的两种主要类别的压缩类别，即KV缓存和权重压缩，分别用于动态增长的中间缓存和静态重量。我们使用流行的LLAVA框架的四个LVLM变体通过整合了各种最新的KV和权重压缩方法，包括均匀，离群降低和组量化的KV缓存和权重。通过此框架，我们在十个具有不同功能的不同多模式数据集上演示，包括识别，知识，语言产生，空间意识，视觉推理，幻觉和视觉错觉识别，毒性，刻板印象和偏见。在具体而言，我们的框架表明了对一般和道德关键指标的压缩影响，利用现实世界和合成数据集的结合来涵盖各种社会交叉属性。与具有FP16数据格式的基线模型相比，广泛的实验评估在KV和权重的不同量化预算中产生了不同的LVLM的行为，在维持和失去性能的情况下都产生了不同的观察。代码将在此HTTPS URL上开源。

Title: Leveraging Large Language Models For Scalable Vector Graphics Processing: A Review

Authors: Boris Malashenko, Ivan Jarsky, Valeria Efimova
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.04983
Pdf URL: https://arxiv.org/pdf/2503.04983
Copy Paste: [[2503.04983]] Leveraging Large Language Models For Scalable Vector Graphics Processing: A Review(https://arxiv.org/abs/2503.04983)
Keywords: generation
Abstract: In recent years, rapid advances in computer vision have significantly improved the processing and generation of raster images. However, vector graphics, which is essential in digital design, due to its scalability and ease of editing, have been relatively understudied. Traditional vectorization techniques, which are often used in vector generation, suffer from long processing times and excessive output complexity, limiting their usability in practical applications. The advent of large language models (LLMs) has opened new possibilities for the generation, editing, and analysis of vector graphics, particularly in the SVG format, which is inherently text-based and well-suited for integration with LLMs. This paper provides a systematic review of existing LLM-based approaches for SVG processing, categorizing them into three main tasks: generation, editing, and understanding. We observe notable models such as IconShop, StrokeNUWA, and StarVector, highlighting their strengths and limitations. Furthermore, we analyze benchmark datasets designed for assessing SVG-related tasks, including SVGEditBench, VGBench, and SGP-Bench, and conduct a series of experiments to evaluate various LLMs in these domains. Our results demonstrate that for vector graphics reasoning-enhanced models outperform standard LLMs, particularly in generation and understanding tasks. Furthermore, our findings underscore the need to develop more diverse and richly annotated datasets to further improve LLM capabilities in vector graphics tasks.
摘要：近年来，计算机视觉的快速进步已大大改善了栅格图像的处理和生成。但是，由于其可伸缩性和易于编辑性，在数字设计中至关重要的矢量图形已经相对研究。通常在矢量产生中使用的传统矢量化技术会遭受较长的处理时间和过度输出复杂性，从而限制了它们在实际应用中的可用性。大型语言模型（LLM）的出现为向量图形的生成，编辑和分析开辟了新的可能性，尤其是在SVG格式中，该格式固有地基于文本，非常适合与LLMS集成。本文对现有的基于LLM的方法进行了系统的审查，用于SVG处理，将其分为三个主要任务：生成，编辑和理解。我们观察到诸如Iconshop，Strokenuwa和Starvector之类的著名模型，突出了它们的优势和局限性。此外，我们分析了旨在评估包括SVGeditBench，VGBENCH和SGP BENCEN在内的旨在评估SVG相关任务的基准数据集，并进行了一系列实验以评估这些域中的各种LLM。我们的结果表明，对于向量图形，推理增强模型的模型优于标准LLM，尤其是在生成和理解任务中。此外，我们的发现强调了开发更多样化和丰富注释的数据集的必要性，以进一步提高向量图形任务中的LLM功能。

Title: Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs

Authors: Yingji Zhong, Zhihao Li, Dave Zhenyu Chen, Lanqing Hong, Dan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.05082
Pdf URL: https://arxiv.org/pdf/2503.05082
Copy Paste: [[2503.05082]] Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs(https://arxiv.org/abs/2503.05082)
Keywords: generation
Abstract: Despite recent successes in novel view synthesis using 3D Gaussian Splatting (3DGS), modeling scenes with sparse inputs remains a challenge. In this work, we address two critical yet overlooked issues in real-world sparse-input modeling: extrapolation and occlusion. To tackle these issues, we propose to use a reconstruction by generation pipeline that leverages learned priors from video diffusion models to provide plausible interpretations for regions outside the field of view or occluded. However, the generated sequences exhibit inconsistencies that do not fully benefit subsequent 3DGS modeling. To address the challenge of inconsistencies, we introduce a novel scene-grounding guidance based on rendered sequences from an optimized 3DGS, which tames the diffusion model to generate consistent sequences. This guidance is training-free and does not require any fine-tuning of the diffusion model. To facilitate holistic scene modeling, we also propose a trajectory initialization method. It effectively identifies regions that are outside the field of view and occluded. We further design a scheme tailored for 3DGS optimization with generated sequences. Experiments demonstrate that our method significantly improves upon the baseline and achieves state-of-the-art performance on challenging benchmarks.
摘要：尽管使用3D高斯脱落（3DG）在新型视图合成方面取得了成功，但对场景的建模稀疏仍然是一个挑战。在这项工作中，我们解决了实际稀疏输入建模中的两个关键但被忽视的问题：外推和遮挡。为了解决这些问题，我们建议使用一代管道的重建，从而利用从视频扩散模型中学到的先验的先验，以为视野外或遮挡的区域提供合理的解释。但是，生成的序列表现出不一致的不一致，这些序列并不能完全受益于随后的3DGS建模。为了应对不一致的挑战，我们基于优化的3DG的渲染序列引入了新的场景基础指导，该指导构成了扩散模型，以生成一致的序列。该指南是无训练的，不需要对扩散模型进行任何微调。为了促进整体场景建模，我们还提出了一种轨迹初始化方法。它有效地确定了视野外面并被遮挡的区域。我们进一步设计了一种针对3DGS优化的方案，该方案具有生成的序列。实验表明，我们的方法在基线并在具有挑战性的基准上实现了最先进的性能。

Title: Fake It To Make It: Virtual Multiviews to Enhance Monocular Indoor Semantic Scene Completion

Authors: Anith Selvakumar, Manasa Bharadwaj
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.05086
Pdf URL: https://arxiv.org/pdf/2503.05086
Copy Paste: [[2503.05086]] Fake It To Make It: Virtual Multiviews to Enhance Monocular Indoor Semantic Scene Completion(https://arxiv.org/abs/2503.05086)
Keywords: generative
Abstract: Monocular Indoor Semantic Scene Completion (SSC) aims to reconstruct a 3D semantic occupancy map from a single RGB image of an indoor scene, inferring spatial layout and object categories from 2D image cues. The challenge of this task arises from the depth, scale, and shape ambiguities that emerge when transforming a 2D image into 3D space, particularly within the complex and often heavily occluded environments of indoor scenes. Current SSC methods often struggle with these ambiguities, resulting in distorted or missing object representations. To overcome these limitations, we introduce an innovative approach that leverages novel view synthesis and multiview fusion. Specifically, we demonstrate how virtual cameras can be placed around the scene to emulate multiview inputs that enhance contextual scene information. We also introduce a Multiview Fusion Adaptor (MVFA) to effectively combine the multiview 3D scene predictions into a unified 3D semantic occupancy map. Finally, we identify and study the inherent limitation of generative techniques when applied to SSC, specifically the Novelty-Consistency tradeoff. Our system, GenFuSE, demonstrates IoU score improvements of up to 2.8% for Scene Completion and 4.9% for Semantic Scene Completion when integrated with existing SSC networks on the NYUv2 dataset. This work introduces GenFuSE as a standard framework for advancing monocular SSC with synthesized inputs.
摘要：单眼室内语义场景完成（SSC）旨在从室内场景的单个RGB图像中重建一个3D语义占用图，从而从2D图像提示中推断空间布局和对象类别。这项任务的挑战来自将2D图像转换为3D空间时出现的深度，规模和形状的歧义，尤其是在室内场景的复杂且经常被严重遮住的环境中。当前的SSC方法通常与这些歧义相处，从而导致对象表示扭曲或缺失。为了克服这些局限性，我们引入了一种创新的方法，该方法利用了新的观点综合和多视融合。具体而言，我们演示了如何在场景周围放置虚拟摄像头，以模拟增强上下文场景信息的多视图输入。我们还引入了多视融合适配器（MVFA），以有效地将多视图3D场景预测结合到统一的3D语义占用图中。最后，我们确定并研究应用于SSC的生成技术的固有局限性，特别是新颖的一致性权衡。当与NYUV2数据集中的现有SSC网络集成时，我们的系统（Genfuse）演示了IOU得分提高高达2.8％，语义场景完成为4.9％。这项工作将Genfuse作为标准框架，用于通过合成的输入来推进单眼SSC。

Title: TS-LIF: A Temporal Segment Spiking Neuron Network for Time Series Forecasting

Authors: Shibo Feng, Wanjin Feng, Xingyu Gao, Peilin Zhao, Zhiqi Shen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.05108
Pdf URL: https://arxiv.org/pdf/2503.05108
Copy Paste: [[2503.05108]] TS-LIF: A Temporal Segment Spiking Neuron Network for Time Series Forecasting(https://arxiv.org/abs/2503.05108)
Keywords: generation
Abstract: Spiking Neural Networks (SNNs) offer a promising, biologically inspired approach for processing spatiotemporal data, particularly for time series forecasting. However, conventional neuron models like the Leaky Integrate-and-Fire (LIF) struggle to capture long-term dependencies and effectively process multi-scale temporal dynamics. To overcome these limitations, we introduce the Temporal Segment Leaky Integrate-and-Fire (TS-LIF) model, featuring a novel dual-compartment architecture. The dendritic and somatic compartments specialize in capturing distinct frequency components, providing functional heterogeneity that enhances the neuron's ability to process both low- and high-frequency information. Furthermore, the newly introduced direct somatic current injection reduces information loss during intra-neuronal transmission, while dendritic spike generation improves multi-scale information extraction. We provide a theoretical stability analysis of the TS-LIF model and explain how each compartment contributes to distinct frequency response characteristics. Experimental results show that TS-LIF outperforms traditional SNNs in time series forecasting, demonstrating better accuracy and robustness, even with missing data. TS-LIF advances the application of SNNs in time-series forecasting, providing a biologically inspired approach that captures complex temporal dynamics and offers potential for practical implementation in diverse forecasting scenarios. The source code is available at this https URL.
摘要：尖峰神经网络（SNN）提供了一种有希望的，具有生物学启发的方法来处理时空数据，尤其是时间序列预测。然而，诸如泄漏的集成与火（LIF）之类的常规神经元模型难以捕获长期依赖性并有效地处理多尺度的时间动力学。为了克服这些局限性，我们介绍了时间段泄漏的集成与火（TS-LIF）模型，其中具有新型的双室架构。树突和体细胞室专门捕获不同的频率组件，提供功能异质性，从而增强了神经元处理低频和高频信息的能力。此外，新引入的直接躯体电流注入可减少神经内传播期间的信息损失，而树突尖峰产生可改善多尺度信息提取。我们提供了TS-LIF模型的理论稳定性分析，并解释了每个隔室如何有助于不同的频率响应特征。实验结果表明，TS-LIF在时间序列预测中的表现优于传统SNN，即使缺少数据，也表现出更好的准确性和鲁棒性。 TS-LIF推进了SNN在时间序列预测中的应用，提供了一种以生物学启发的方法来捕捉复杂的时间动态，并为各种预测场景提供了实践实施的潜力。源代码可在此HTTPS URL上找到。

Title: Development and Enhancement of Text-to-Image Diffusion Models

Authors: Rajdeep Roshan Sahu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.05149
Pdf URL: https://arxiv.org/pdf/2503.05149
Copy Paste: [[2503.05149]] Development and Enhancement of Text-to-Image Diffusion Models(https://arxiv.org/abs/2503.05149)
Keywords: generation, generative
Abstract: This research focuses on the development and enhancement of text-to-image denoising diffusion models, addressing key challenges such as limited sample diversity and training instability. By incorporating Classifier-Free Guidance (CFG) and Exponential Moving Average (EMA) techniques, this study significantly improves image quality, diversity, and stability. Utilizing Hugging Face's state-of-the-art text-to-image generation model, the proposed enhancements establish new benchmarks in generative AI. This work explores the underlying principles of diffusion models, implements advanced strategies to overcome existing limitations, and presents a comprehensive evaluation of the improvements achieved. Results demonstrate substantial progress in generating stable, diverse, and high-quality images from textual descriptions, advancing the field of generative artificial intelligence and providing new foundations for future applications. Keywords: Text-to-image, Diffusion model, Classifier-free guidance, Exponential moving average, Image generation.
摘要：这项研究的重点是文本到图像的扩散模型的开发和增强，以应对有限的样本多样性和培训不稳定等关键挑战。通过合并无分类器指导（CFG）和指数移动平均（EMA）技术，这项研究可显着提高图像质量，多样性和稳定性。利用拥抱面孔的最新文本对图像生成模型，提出的增强功能在生成AI中建立了新的基准。这项工作探讨了扩散模型的基本原则，实施了克服现有局限性的先进策略，并对所取得的改进进行了全面的评估。结果表明，从文本描述中产生稳定，多样化和高质量的图像，推进生成人工智能领域并为将来的应用提供新的基础。关键字：文本到图像，扩散模型，无分类器指导，指数移动平均值，图像生成。

Title: Accelerating Diffusion Transformer via Gradient-Optimized Cache

Authors: Junxiang Qiu, Lin Liu, Shuo Wang, Jinda Lu, Kezhou Chen, Yanbin Hao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.05156
Pdf URL: https://arxiv.org/pdf/2503.05156
Copy Paste: [[2503.05156]] Accelerating Diffusion Transformer via Gradient-Optimized Cache(https://arxiv.org/abs/2503.05156)
Keywords: generation
Abstract: Feature caching has emerged as an effective strategy to accelerate diffusion transformer (DiT) sampling through temporal feature reuse. It is a challenging problem since (1) Progressive error accumulation from cached blocks significantly degrades generation quality, particularly when over 50\% of blocks are cached; (2) Current error compensation approaches neglect dynamic perturbation patterns during the caching process, leading to suboptimal error correction. To solve these problems, we propose the Gradient-Optimized Cache (GOC) with two key innovations: (1) Cached Gradient Propagation: A gradient queue dynamically computes the gradient differences between cached and recomputed features. These gradients are weighted and propagated to subsequent steps, directly compensating for the approximation errors introduced by caching. (2) Inflection-Aware Optimization: Through statistical analysis of feature variation patterns, we identify critical inflection points where the denoising trajectory changes direction. By aligning gradient updates with these detected phases, we prevent conflicting gradient directions during error correction. Extensive evaluations on ImageNet demonstrate GOC's superior trade-off between efficiency and quality. With 50\% cached blocks, GOC achieves IS 216.28 (26.3\% higher) and FID 3.907 (43\% lower) compared to baseline DiT, while maintaining identical computational costs. These improvements persist across various cache ratios, demonstrating robust adaptability to different acceleration requirements.
摘要：特征缓存已成为通过时间特征再利用加速扩散变压器（DIT）采样的有效策略。这是一个具有挑战性的问题，因为（1）缓存块中的渐进错误积累显着降低了发电质量，尤其是当超过50 \％的块被缓存时；（2）当前的误差补偿方法在缓存过程中忽略了动态扰动模式，从而导致了次优误差校正。为了解决这些问题，我们提出了具有两个关键创新的梯度优化的缓存（GOC）：（1）加速梯度传播：梯度队列动态计算缓存和重新计算的特征之间的梯度差异。这些梯度加权并传播到后续步骤，直接补偿了通过缓存引入的近似误差。（2）弯曲感知的优化：通过对特征变化模式的统计分析，我们确定了临界拐点，而降级轨迹改变方向。通过将梯度更新与这些检测到的阶段保持一致，我们可以防止在误差校正期间发生冲突的梯度方向。对ImageNet的广泛评估表明，GOC在效率和质量之间的卓越权衡。与基线DIT相比，GOC成绩为50 \％缓存的块，达到216.28（高26.3 \％）和FID 3.907（低43 \％），同时保持相同的计算成本。这些改进在各种缓存比率上持续存在，证明了对不同加速度要求的强大适应性。

Title: Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions

Authors: Chan hur, Jeong-hun Hong, Dong-hun Lee, Dabin Kang, Semin Myeong, Sang-hyo Park, Hyeyoung Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.05186
Pdf URL: https://arxiv.org/pdf/2503.05186
Copy Paste: [[2503.05186]] Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions(https://arxiv.org/abs/2503.05186)
Keywords: generative
Abstract: In recent text-video retrieval, the use of additional captions from vision-language models has shown promising effects on the performance. However, existing models using additional captions often have struggled to capture the rich semantics, including temporal changes, inherent in the video. In addition, incorrect information caused by generative models can lead to inaccurate retrieval. To address these issues, we propose a new framework, Narrating the Video (NarVid), which strategically leverages the comprehensive information available from frame-level captions, the narration. The proposed NarVid exploits narration in multiple ways: 1) feature enhancement through cross-modal interactions between narration and video, 2) query-aware adaptive filtering to suppress irrelevant or incorrect information, 3) dual-modal matching score by adding query-video similarity and query-narration similarity, and 4) hard-negative loss to learn discriminative features from multiple perspectives using the two similarities from different views. Experimental results demonstrate that NarVid achieves state-of-the-art performance on various benchmark datasets.
摘要：在最近的文本视频检索中，使用视觉模型的其他字幕显示了对性能的有希望的影响。但是，使用其他字幕的现有模型通常努力捕获视频中固有的富有语义（包括时间变化）。此外，由生成模型引起的错误信息可能导致检索不正确。为了解决这些问题，我们提出了一个新的框架，叙述视频（Narvid），该框架从策略上利用了从框架级标题（叙述）中获得的全面信息。 The proposed NarVid exploits narration in multiple ways: 1) feature enhancement through cross-modal interactions between narration and video, 2) query-aware adaptive filtering to suppress irrelevant or incorrect information, 3) dual-modal matching score by adding query-video similarity and query-narration similarity, and 4) hard-negative loss to learn discriminative features from multiple perspectives using the two similarities from different视图。实验结果表明，Narvid在各种基准数据集上实现最先进的性能。

Title: RecipeGen: A Benchmark for Real-World Recipe Image Generation

Authors: Ruoxuan Zhang, Hongxia Xie, Yi Yao, Jian-Yu Jiang-Lin, Bin Wen, Ling Lo, Hong-Han Shuai, Yung-Hui Li, Wen-Huang Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.05228
Pdf URL: https://arxiv.org/pdf/2503.05228
Copy Paste: [[2503.05228]] RecipeGen: A Benchmark for Real-World Recipe Image Generation(https://arxiv.org/abs/2503.05228)
Keywords: generation
Abstract: Recipe image generation is an important challenge in food computing, with applications from culinary education to interactive recipe platforms. However, there is currently no real-world dataset that comprehensively connects recipe goals, sequential steps, and corresponding images. To address this, we introduce RecipeGen, the first real-world goal-step-image benchmark for recipe generation, featuring diverse ingredients, varied recipe steps, multiple cooking styles, and a broad collection of food categories. Data is in this https URL.
摘要：食谱图像生成是食品计算中的重要挑战，从烹饪教育到交互式食谱平台的应用。但是，当前没有现实世界中的数据集可以全面连接食谱目标，顺序步骤和相应的图像。为了解决这个问题，我们介绍了Copeegen，这是第一个用于食谱生成的现实目标步骤 - 图像图像标准，具有多种成分，各种食谱步骤，多种烹饪风格以及广泛的食物类别。数据在此HTTPS URL中。

Title: Unified Reward Model for Multimodal Understanding and Generation

Authors: Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, Jiaqi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.05236
Pdf URL: https://arxiv.org/pdf/2503.05236
Copy Paste: [[2503.05236]] Unified Reward Model for Multimodal Understanding and Generation(https://arxiv.org/abs/2503.05236)
Keywords: generation
Abstract: Recent advances in human preference alignment have significantly enhanced multimodal generation and understanding. A key approach is training reward models to guide preference optimization. However, existing models are often task-specific, limiting their adaptability across diverse visual applications. We also argue that jointly learning to assess multiple tasks may foster a synergistic effect, where improved image understanding enhances image generation assessment, and refined image evaluation benefits video assessment through better frame analysis. To this end, this paper proposes UnifiedReward, the first unified reward model for multimodal understanding and generation assessment, enabling both pairwise ranking and pointwise scoring, which can be employed for vision model preference alignment. Specifically, (1) we first develop UnifiedReward on our constructed large-scale human preference dataset, including both image and video generation/understanding tasks. (2) Then, it is utilized to automatically construct high-quality preference pair data based on the vision models, fine-gradually filtering their outputs through pair ranking and point sifting. (3) Finally, these data are used for their preference alignment through Direct Preference Optimization (DPO). Experimental results demonstrate that joint learning to assess diverse visual tasks can lead to substantial mutual benefits and we apply our pipeline to both image and video understanding/generation tasks, significantly improving the performance in each domain.
摘要：人类偏好一致性的最新进展显着增强了多模式的产生和理解。一个关键方法是培训奖励模型，以指导偏好优化。但是，现有模型通常是特定于任务的，从而限制了它们在各种视觉应用中的适应性。我们还认为，共同学习评估多个任务可能会促进协同效应，在这种情况下，改进的图像理解可以增强图像生成评估，并通过更好的框架分析来提高图像生成评估视频评估。为此，本文提出了UnifiedReward，这是第一个用于多模式理解和发电评估的统一奖励模型，从而实现了成对的排名和指数评分，可以用于视觉模型偏好对齐。具体而言，（1）我们首先在构建的大规模人类偏好数据集上开发UnifiedReward，包括图像和视频生成/理解任务。（2）然后，它被用来根据视觉模型自动构建高质量的偏好对数据，从而通过对排名和点筛分进行细分过滤其输出。（3）最后，这些数据用于通过直接偏好优化（DPO）进行偏好对齐。实验结果表明，共同学习评估各种视觉任务可以带来实质性的相互利益，我们将管道应用于图像和视频理解/发电任务，从而显着提高了每个领域的性能。

Title: Frequency Autoregressive Image Generation with Continuous Tokens

Authors: Hu Yu, Hao Luo, Hangjie Yuan, Yu Rong, Feng Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.05305
Pdf URL: https://arxiv.org/pdf/2503.05305
Copy Paste: [[2503.05305]] Frequency Autoregressive Image Generation with Continuous Tokens(https://arxiv.org/abs/2503.05305)
Keywords: generation
Abstract: Autoregressive (AR) models for image generation typically adopt a two-stage paradigm of vector quantization and raster-scan ``next-token prediction", inspired by its great success in language modeling. However, due to the huge modality gap, image autoregressive models may require a systematic reevaluation from two perspectives: tokenizer format and regression direction. In this paper, we introduce the frequency progressive autoregressive (\textbf{FAR}) paradigm and instantiate FAR with the continuous tokenizer. Specifically, we identify spectral dependency as the desirable regression direction for FAR, wherein higher-frequency components build upon the lower one to progressively construct a complete image. This design seamlessly fits the causality requirement for autoregressive models and preserves the unique spatial locality of image data. Besides, we delve into the integration of FAR and the continuous tokenizer, introducing a series of techniques to address optimization challenges and improve the efficiency of training and inference processes. We demonstrate the efficacy of FAR through comprehensive experiments on the ImageNet dataset and verify its potential on text-to-image generation.
摘要：Autoregressive (AR) models for image generation typically adopt a two-stage paradigm of vector quantization and raster-scan ``next-token prediction", inspired by its great success in language modeling. However, due to the huge modality gap, image autoregressive models may require a systematic reevaluation from two perspectives: tokenizer format and regression direction. In this paper, we introduce the frequency progressive autoregressive （\ textbf {far}）范式和连续的标记范围，我们将光谱依赖性确定为远处的回归方向，其中高频组件在较低的范围内构建，以逐步构建该设计的图像。远处和连续的令牌仪，引入了一系列技术，以应对优化挑战并提高训练和推理过程的效率。我们通过在ImageNet数据集上进行的全面实验来证明很远的功效，并验证其在文本到图像生成上的潜力。

Title: PhysicsGen: Can Generative Models Learn from Images to Predict Complex Physical Relations?

Authors: Martin Spitznagel, Jan Vaillant, Janis Keuper
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.05333
Pdf URL: https://arxiv.org/pdf/2503.05333
Copy Paste: [[2503.05333]] PhysicsGen: Can Generative Models Learn from Images to Predict Complex Physical Relations?(https://arxiv.org/abs/2503.05333)
Keywords: generative
Abstract: The image-to-image translation abilities of generative learning models have recently made significant progress in the estimation of complex (steered) mappings between image distributions. While appearance based tasks like image in-painting or style transfer have been studied at length, we propose to investigate the potential of generative models in the context of physical simulations. Providing a dataset of 300k image-pairs and baseline evaluations for three different physical simulation tasks, we propose a benchmark to investigate the following research questions: i) are generative models able to learn complex physical relations from input-output image pairs? ii) what speedups can be achieved by replacing differential equation based simulations? While baseline evaluations of different current models show the potential for high speedups (ii), these results also show strong limitations toward the physical correctness (i). This underlines the need for new methods to enforce physical correctness. Data, baseline models and evaluation code this http URL.
摘要：生成学习模型的图像到图像翻译能力最近在图像分布之间的复杂（转向）映射的估计中取得了重大进展。虽然已经研究了基于外观的任务，例如图像中的模具或样式转移，但我们建议在物理模拟的背景下研究生成模型的潜力。为三个不同的物理仿真任务提供300k图像对和基线评估的数据集，我们提出了一个基准来研究以下研究问题：i）生成模型是否能够从输入输出图像对学习复杂的物理关系？ ii）通过替换基于微分方程的模拟可以实现哪些加速？虽然对不同当前模型的基线评估显示出高速加速的潜力（ii），但这些结果也显示出对身体正确性（i）的强大局限性。这强调了需要实现身体正确性的新方法。数据，基线模型和评估代码此HTTP URL。

Title: Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts

Authors: Weigao Sun, Disen Lan, Tong Zhu, Xiaoye Qu, Yu Cheng
Subjects: cs.LG, cs.AI, cs.CL, cs.DC
Abstract URL: https://arxiv.org/abs/2503.05447
Pdf URL: https://arxiv.org/pdf/2503.05447
Copy Paste: [[2503.05447]] Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts(https://arxiv.org/abs/2503.05447)
Keywords: generation
Abstract: Linear Sequence Modeling (LSM) like linear attention, state space models and linear RNNs, and Mixture-of-Experts (MoE) have recently emerged as significant architectural improvements. In this paper, we introduce Linear-MoE, a production-level system for modeling and training large-scale models that integrate LSM with MoE. Linear-MoE leverages the advantages of both LSM modules for linear-complexity sequence modeling and MoE layers for sparsely activation, aiming to offer high performance with efficient training. The Linear-MoE system comprises: 1) Modeling subsystem, which provides a unified framework supporting all instances of LSM. and 2) Training subsystem, which facilitates efficient training by incorporating various advanced parallelism technologies, particularly Sequence Parallelism designed for Linear-MoE models. Additionally, we explore hybrid models that combine Linear-MoE layers with standard Transformer-MoE layers with its Sequence Parallelism to further enhance model flexibility and performance. Evaluations on two model series, A0.3B-2B and A1B-7B, demonstrate Linear-MoE achieves efficiency gains while maintaining competitive performance on various benchmarks, showcasing its potential as a next-generation foundational model architecture. Code: this https URL.
摘要：线性序列建模（LSM）如线性注意力，状态空间模型和线性RNN以及Experts（MOE）的混合物最近出现了，已成为显着的体系结构改进。在本文中，我们介绍了线性-MOE，这是一种生产级别的系统，用于建模和训练将LSM与MOE集成的大型模型。线性-MOE利用两个LSM模块的优势用于线性复合序列建模和MOE层的稀疏激活，旨在通过有效的训练提供高性能。线性-MOE系统包括：1）建模子系统，该系统提供了支持LSM所有实例的统一框架。 2）训练子系统，通过合并各种高级并行技术，尤其是为线性moe模型设计的序列并行性，从而促进了有效的训练。此外，我们探索了将线性摩托层与标准变压器层及其序列并行性相结合的混合模型，以进一步增强模型的灵活性和性能。对两个模型系列A0.3B-2B和A1B-7B进行的评估表明，线性MOE可实现效率的提高，同时保持各种基准上的竞争性能，从而展示了其作为下一代基础模型体系结构的潜力。代码：此HTTPS URL。

Title: Automatic Teaching Platform on Vision Language Retrieval Augmented Generation

Authors: Ruslan Gokhman, Jialu Li, Youshan Zhang
Subjects: cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2503.05464
Pdf URL: https://arxiv.org/pdf/2503.05464
Copy Paste: [[2503.05464]] Automatic Teaching Platform on Vision Language Retrieval Augmented Generation(https://arxiv.org/abs/2503.05464)
Keywords: generation
Abstract: Automating teaching presents unique challenges, as replicating human interaction and adaptability is complex. Automated systems cannot often provide nuanced, real-time feedback that aligns with students' individual learning paces or comprehension levels, which can hinder effective support for diverse needs. This is especially challenging in fields where abstract concepts require adaptive explanations. In this paper, we propose a vision language retrieval augmented generation (named VL-RAG) system that has the potential to bridge this gap by delivering contextually relevant, visually enriched responses that can enhance comprehension. By leveraging a database of tailored answers and images, the VL-RAG system can dynamically retrieve information aligned with specific questions, creating a more interactive and engaging experience that fosters deeper understanding and active student participation. It allows students to explore concepts visually and verbally, promoting deeper understanding and reducing the need for constant human oversight while maintaining flexibility to expand across different subjects and course material.
摘要：自动化教学带来了独特的挑战，因为复制人类的互动和适应性很复杂。自动化系统通常无法提供细微的实时反馈，而这些反馈与学生的个人学习步伐或理解水平保持一致，这可能会阻碍有效的支持，以满足各种需求。这在抽象概念需要自适应解释的领域尤其具有挑战性。在本文中，我们提出了一个视觉语言检索增强发电（名为VL-rag）系统，该系统有可能通过提供上下文相关的，视觉丰富的响应来弥合这一差距，从而增强理解。通过利用量身定制的答案和图像的数据库，VL-rag系统可以动态地检索与特定问题一致的信息，从而创造出更具交互性和引人入胜的体验，从而促进更深入的理解和积极的学生参与。它允许学生在视觉和口头上探索概念，从而提高更深入的理解并减少对人类持续监督的需求，同时保持灵活性，以扩大不同的学科和课程材料。

Title: FastMap: Fast Queries Initialization Based Vectorized HD Map Reconstruction Framework

Authors: Haotian Hu, Jingwei Xu, Fanyi Wang, Toyota Li, Yaonong Wang, Laifeng Hu, Zhiwang Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.05492
Pdf URL: https://arxiv.org/pdf/2503.05492
Copy Paste: [[2503.05492]] FastMap: Fast Queries Initialization Based Vectorized HD Map Reconstruction Framework(https://arxiv.org/abs/2503.05492)
Keywords: generation
Abstract: Reconstruction of high-definition maps is a crucial task in perceiving the autonomous driving environment, as its accuracy directly impacts the reliability of prediction and planning capabilities in downstream modules. Current vectorized map reconstruction methods based on the DETR framework encounter limitations due to the redundancy in the decoder structure, necessitating the stacking of six decoder layers to maintain performance, which significantly hampers computational efficiency. To tackle this issue, we introduce FastMap, an innovative framework designed to reduce decoder redundancy in existing approaches. FastMap optimizes the decoder architecture by employing a single-layer, two-stage transformer that achieves multilevel representation capabilities. Our framework eliminates the conventional practice of randomly initializing queries and instead incorporates a heatmap-guided query generation module during the decoding phase, which effectively maps image features into structured query vectors using learnable positional encoding. Additionally, we propose a geometry-constrained point-to-line loss mechanism for FastMap, which adeptly addresses the challenge of distinguishing highly homogeneous features that often arise in traditional point-to-point loss computations. Extensive experiments demonstrate that FastMap achieves state-of-the-art performance in both nuScenes and Argoverse2 datasets, with its decoder operating 3.2 faster than the baseline. Code and more demos are available at this https URL.
摘要：高清图的重建是感知自主驾驶环境的关键任务，因为其准确性直接影响下游模块中预测和计划能力的可靠性。由于解码器结构的冗余而导致的基于DETR框架的当前矢量化MAP重建方法遇到了限制，因此需要堆叠六个解码器层以保持性能，从而显着缩减了计算效率。为了解决此问题，我们介绍了FastMap，这是一个创新的框架，旨在减少现有方法中的解码器冗余。 FastMap通过采用单层，两阶段的变压器来优化解码器体系结构，从而实现多级表示功能。我们的框架消除了随机初始化查询的常规实践，而是在解码阶段合并了热图引导的查询产生模块，该模块有效地将图像特征映射到了使用可学习的位置编码的结构化查询向量中。此外，我们为快速图提出了一个几何构成的点对线损失机制，该机制擅长于区分传统的点对点损失计算中经常出现的高度同质特征的挑战。广泛的实验表明，FastMap在Nuscenes和Argoverse2数据集中都达到了最先进的性能，其解码器的运行速度比基线快3.2。该HTTPS URL可用代码和更多演示。

Title: Mol-CADiff: Causality-Aware Autoregressive Diffusion for Molecule Generation

Authors: Md Atik Ahamed, Qiang Ye, Qiang Cheng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.05499
Pdf URL: https://arxiv.org/pdf/2503.05499
Copy Paste: [[2503.05499]] Mol-CADiff: Causality-Aware Autoregressive Diffusion for Molecule Generation(https://arxiv.org/abs/2503.05499)
Keywords: generation
Abstract: The design of novel molecules with desired properties is a key challenge in drug discovery and materials science. Traditional methods rely on trial-and-error, while recent deep learning approaches have accelerated molecular generation. However, existing models struggle with generating molecules based on specific textual descriptions. We introduce Mol-CADiff, a novel diffusion-based framework that uses causal attention mechanisms for text-conditional molecular generation. Our approach explicitly models the causal relationship between textual prompts and molecular structures, overcoming key limitations in existing methods. We enhance dependency modeling both within and across modalities, enabling precise control over the generation process. Our extensive experiments demonstrate that Mol-CADiff outperforms state-of-the-art methods in generating diverse, novel, and chemically valid molecules, with better alignment to specified properties, enabling more intuitive language-driven molecular design.
摘要：具有所需性质的新分子的设计是药物发现和材料科学的关键挑战。传统方法依靠反复试验，而最近的深度学习方法加速了分子产生。但是，现有模型努力基于特定的文本描述生成分子。我们介绍了Mol-Cadiff，这是一种新型基于扩散的框架，该框架使用因果注意机制来进行文本条件分子产生。我们的方法明确地对文本提示和分子结构之间的因果关系进行了建模，从而克服了现有方法中的关键局限性。我们增强了依赖性建模内部和跨模态，从而可以精确控制生成过程。我们的广泛实验表明，Mol-Cadiff在产生多样化，新颖和化学有效的分子方面的最先进方法具有更好的对齐方式，可以更好地对齐指定的特性，从而实现了更直观的语言驱动分子设计。

Title: Post-Hoc Concept Disentanglement: From Correlated to Isolated Concept Representations

Authors: Eren Erogullari, Sebastian Lapuschkin, Wojciech Samek, Frederik Pahde
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.05522
Pdf URL: https://arxiv.org/pdf/2503.05522
Copy Paste: [[2503.05522]] Post-Hoc Concept Disentanglement: From Correlated to Isolated Concept Representations(https://arxiv.org/abs/2503.05522)
Keywords: generative
Abstract: Concept Activation Vectors (CAVs) are widely used to model human-understandable concepts as directions within the latent space of neural networks. They are trained by identifying directions from the activations of concept samples to those of non-concept samples. However, this method often produces similar, non-orthogonal directions for correlated concepts, such as "beard" and "necktie" within the CelebA dataset, which frequently co-occur in images of men. This entanglement complicates the interpretation of concepts in isolation and can lead to undesired effects in CAV applications, such as activation steering. To address this issue, we introduce a post-hoc concept disentanglement method that employs a non-orthogonality loss, facilitating the identification of orthogonal concept directions while preserving directional correctness. We evaluate our approach with real-world and controlled correlated concepts in CelebA and a synthetic FunnyBirds dataset with VGG16 and ResNet18 architectures. We further demonstrate the superiority of orthogonalized concept representations in activation steering tasks, allowing (1) the insertion of isolated concepts into input images through generative models and (2) the removal of concepts for effective shortcut suppression with reduced impact on correlated concepts in comparison to baseline CAVs.
摘要：概念激活向量（CAVS）被广泛用于将人类理解的概念建模为神经网络潜在空间内的方向。通过识别从概念样本激活到非概念样本的方向来训练它们。但是，这种方法通常会针对相关概念产生类似的非正交方向，例如Celeba数据集中的“胡须”和“领带”，这些数据经常在男性图像中共同存在。这种纠缠使对孤立概念的解释变得复杂，并可能导致CAV应用中的不希望的效果，例如激活转向。为了解决这个问题，我们引入了一种事后概念分解方法，该方法采用了非正交性损失，促进了正交概念方向的识别，同时保留了方向正确性。我们使用Celeba中的现实世界和受控的相关概念以及具有VGG16和RESNET18体系结构的合成鸟数据集评估了我们的方法。我们进一步证明了激活转向任务中正交概念表示的优势，（1）通过生成模型将孤立的概念插入输入图像中，以及（2）去除有效的快捷方式抑制概念，并降低了对基线骑士的相关概念对相关概念的影响。

Title: Global graph features unveiled by unsupervised geometric deep learning

Authors: Mirja Granfors, Jesús Pineda, Blanca Zufiria Gerbolés, Joana B. Pereira, Carlo Manzo, Giovanni Volpe
Subjects: cs.LG, cond-mat.soft, physics.bio-ph, q-bio.QM
Abstract URL: https://arxiv.org/abs/2503.05560
Pdf URL: https://arxiv.org/pdf/2503.05560
Copy Paste: [[2503.05560]] Global graph features unveiled by unsupervised geometric deep learning(https://arxiv.org/abs/2503.05560)
Keywords: super-resolution
Abstract: Graphs provide a powerful framework for modeling complex systems, but their structural variability makes analysis and classification challenging. To address this, we introduce GAUDI (Graph Autoencoder Uncovering Descriptive Information), a novel unsupervised geometric deep learning framework that captures both local details and global structure. GAUDI employs an innovative hourglass architecture with hierarchical pooling and upsampling layers, linked through skip connections to preserve essential connectivity information throughout the encoding-decoding process. By mapping different realizations of a system - generated from the same underlying parameters - into a continuous, structured latent space, GAUDI disentangles invariant process-level features from stochastic noise. We demonstrate its power across multiple applications, including modeling small-world networks, characterizing protein assemblies from super-resolution microscopy, analyzing collective motion in the Vicsek model, and capturing age-related changes in brain connectivity. This approach not only improves the analysis of complex graphs but also provides new insights into emergent phenomena across diverse scientific domains.
摘要：图为建模复杂系统提供了强大的框架，但是它们的结构变异性使分析和分类具有挑战性。为了解决这个问题，我们介绍了Gaudi（图形自动编码器揭示描述性信息），这是一个新颖的无监督的几何深度学习框架，可捕获本地细节和全局结构。 Gaudi采用了创新的沙漏体系结构，具有层次汇总和上采样层，通过跳过连接链接，以在整个编码编码过程中保留基本的连接信息。通过将从相同基础参数生成的系统的不同实现映射到连续的，结构的潜在空间中，Gaudi Disentangles不变的过程级别从随机噪声中。我们展示了其在多个应用程序中的功能，包括对小世界网络进行建模，表征超分辨率显微镜的蛋白质组件，分析Vicsek模型中的集体运动以及捕获与年龄相关的大脑连接变化。这种方法不仅改善了对复杂图的分析，而且还为各种科学领域的新兴现象提供了新的见解。

Title: QArtSR: Quantization via Reverse-Module and Timestep-Retraining in One-Step Diffusion based Image Super-Resolution

Authors: Libo Zhu, Haotong Qin, Kaicheng Yang, Wenbo Li, Yong Guo, Yulun Zhang, Susanto Rahardja, Xiaokang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.05584
Pdf URL: https://arxiv.org/pdf/2503.05584
Copy Paste: [[2503.05584]] QArtSR: Quantization via Reverse-Module and Timestep-Retraining in One-Step Diffusion based Image Super-Resolution(https://arxiv.org/abs/2503.05584)
Keywords: super-resolution
Abstract: One-step diffusion-based image super-resolution (OSDSR) models are showing increasingly superior performance nowadays. However, although their denoising steps are reduced to one and they can be quantized to 8-bit to reduce the costs further, there is still significant potential for OSDSR to quantize to lower bits. To explore more possibilities of quantized OSDSR, we propose an efficient method, Quantization via reverse-module and timestep-retraining for OSDSR, named QArtSR. Firstly, we investigate the influence of timestep value on the performance of quantized models. Then, we propose Timestep Retraining Quantization (TRQ) and Reversed Per-module Quantization (RPQ) strategies to calibrate the quantized model. Meanwhile, we adopt the module and image losses to update all quantized modules. We only update the parameters in quantization finetuning components, excluding the original weights. To ensure that all modules are fully finetuned, we add extended end-to-end training after per-module stage. Our 4-bit and 2-bit quantization experimental results indicate that QArtSR obtains superior effects against the recent leading comparison methods. The performance of 4-bit QArtSR is close to the full-precision one. Our code will be released at this https URL.
摘要：如今，基于一步扩散的图像超分辨率（OSDSR）模型正在显示越来越出色的性能。但是，尽管它们的降解步骤降低到一个，并且可以将其量化为8位以进一步降低成本，但OSDSR仍然有很大的潜力量化降低位。为了探索量化OSDSR的更多可能性，我们提出了一种有效的方法，通过反向模块和QARTSR的OSDSR进行量化量化和时间段重新启动。首先，我们研究了时间段价值对量化模型性能的影响。然后，我们提出了时间段重新量化量化（TRQ）和每模块量化（RPQ）策略，以校准量化模型。同时，我们采用模块和图像损失来更新所有量化的模块。我们仅更新量化列出组件中的参数，不包括原始权重。为了确保所有模块都完全易经，我们在人均阶段后添加了扩展的端到端训练。我们的4位和2位量化实验结果表明，与最近的领先比较方法相比，Qartsr获得了优异的影响。 4位QARTSR的性能接近完整精确度。我们的代码将在此HTTPS URL上发布。

Title: Anti-Diffusion: Preventing Abuse of Modifications of Diffusion-Based Models

Authors: Zheng Li, Liangbin Xie, Jiantao Zhou, Xintao Wang, Haiwei Wu, Jinyu Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.05595
Pdf URL: https://arxiv.org/pdf/2503.05595
Copy Paste: [[2503.05595]] Anti-Diffusion: Preventing Abuse of Modifications of Diffusion-Based Models(https://arxiv.org/abs/2503.05595)
Keywords: generation
Abstract: Although diffusion-based techniques have shown remarkable success in image generation and editing tasks, their abuse can lead to severe negative social impacts. Recently, some works have been proposed to provide defense against the abuse of diffusion-based methods. However, their protection may be limited in specific scenarios by manually defined prompts or the stable diffusion (SD) version. Furthermore, these methods solely focus on tuning methods, overlooking editing methods that could also pose a significant threat. In this work, we propose Anti-Diffusion, a privacy protection system designed for general diffusion-based methods, applicable to both tuning and editing techniques. To mitigate the limitations of manually defined prompts on defense performance, we introduce the prompt tuning (PT) strategy that enables precise expression of original images. To provide defense against both tuning and editing methods, we propose the semantic disturbance loss (SDL) to disrupt the semantic information of protected images. Given the limited research on the defense against editing methods, we develop a dataset named Defense-Edit to assess the defense performance of various methods. Experiments demonstrate that our Anti-Diffusion achieves superior defense performance across a wide range of diffusion-based techniques in different scenarios.
摘要：尽管基于扩散的技术在图像产生和编辑任务上取得了巨大的成功，但它们的滥用可能会导致严重的负面社会影响。最近，已经提出了一些作品，以防止滥用基于扩散的方法。但是，在特定方案中，他们的保护可以通过手动定义的提示或稳定扩散（SD）版本受到限制。此外，这些方法仅专注于调整方法，忽略了可能构成重大威胁的编辑方法。在这项工作中，我们提出了反扩散，这是一种专门用于基于一般扩散方法的隐私保护系统，适用于调整和编辑技术。为了减轻手动定义提示的局限性在防御性能上，我们介绍了及时调整（PT）策略，以实现原始图像的精确表达。为了防止调整和编辑方法，我们提出语义干扰丢失（SDL），以破坏受保护图像的语义信息。鉴于针对编辑方法的防御方法的研究有限，我们开发了一个名为Defense-编辑的数据集，以评估各种方法的防御性能。实验表明，在不同情况下，我们的抗扩散在广泛的基于扩散的技术中实现了卓越的防御性能。

Title: TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models

Authors: Mark YU, Wenbo Hu, Jinbo Xing, Ying Shan
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2503.05638
Pdf URL: https://arxiv.org/pdf/2503.05638
Copy Paste: [[2503.05638]] TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models(https://arxiv.org/abs/2503.05638)
Keywords: generation
Abstract: We present TrajectoryCrafter, a novel approach to redirect camera trajectories for monocular videos. By disentangling deterministic view transformations from stochastic content generation, our method achieves precise control over user-specified camera trajectories. We propose a novel dual-stream conditional video diffusion model that concurrently integrates point cloud renders and source videos as conditions, ensuring accurate view transformations and coherent 4D content generation. Instead of leveraging scarce multi-view videos, we curate a hybrid training dataset combining web-scale monocular videos with static multi-view datasets, by our innovative double-reprojection strategy, significantly fostering robust generalization across diverse scenes. Extensive evaluations on multi-view and large-scale monocular videos demonstrate the superior performance of our method.
摘要：我们提出了轨迹，这是一种新颖的方法，用于重定向单眼视频的相机轨迹。通过将确定性视图转换与随机内容的产生相反，我们的方法可以精确控制用户指定的相机轨迹。我们提出了一个新型的双流条件视频扩散模型，该模型同时集成点云渲染和源视频作为条件，从而确保准确的视图转换和相干的4D内容生成。我们通过我们的创新性双重反对策略来策划一个将网络尺度单眼视频与静态多视图数据集相结合的混合培训数据集，而不是利用稀缺的多视频视频，而是策划了一个混合训练数据集，从而显着促进了跨不同场景的强大概括。对多视图和大规模单眼视频的广泛评估表明了我们方法的出色性能。

Title: VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control

Authors: Yuxuan Bian, Zhaoyang Zhang, Xuan Ju, Mingdeng Cao, Liangbin Xie, Ying Shan, Qiang Xu
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2503.05639
Pdf URL: https://arxiv.org/pdf/2503.05639
Copy Paste: [[2503.05639]] VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control(https://arxiv.org/abs/2503.05639)
Keywords: generation
Abstract: Video inpainting, which aims to restore corrupted video content, has experienced substantial progress. Despite these advances, existing methods, whether propagating unmasked region pixels through optical flow and receptive field priors, or extending image-inpainting models temporally, face challenges in generating fully masked objects or balancing the competing objectives of background context preservation and foreground generation in one model, respectively. To address these limitations, we propose a novel dual-stream paradigm VideoPainter that incorporates an efficient context encoder (comprising only 6% of the backbone parameters) to process masked videos and inject backbone-aware background contextual cues to any pre-trained video DiT, producing semantically consistent content in a plug-and-play manner. This architectural separation significantly reduces the model's learning complexity while enabling nuanced integration of crucial background context. We also introduce a novel target region ID resampling technique that enables any-length video inpainting, greatly enhancing our practical applicability. Additionally, we establish a scalable dataset pipeline leveraging current vision understanding models, contributing VPData and VPBench to facilitate segmentation-based inpainting training and assessment, the largest video inpainting dataset and benchmark to date with over 390K diverse clips. Using inpainting as a pipeline basis, we also explore downstream applications including video editing and video editing pair data generation, demonstrating competitive performance and significant practical potential. Extensive experiments demonstrate VideoPainter's superior performance in both any-length video inpainting and editing, across eight key metrics, including video quality, mask region preservation, and textual coherence.
摘要：旨在恢复损坏的视频内容的视频介绍已经取得了长足的进步。尽管有这些进展，但现有方法，无论是通过光流和接受野外先验传播未掩盖的区域像素，还是在时间上扩展图像侵蚀模型，在一个模型中分别面临着产生完全掩盖的对象的挑战，还是平衡背景上下文保存和前景生成的竞争目标。为了解决这些局限性，我们提出了一种新型的双流范式视频，该视频将有效的上下文编码器（仅占骨干参数的6％）来处理掩盖视频，并注入骨干感知的背景情境线索，以在任何预训练的视频dit中产生插件和插件的插件中的内容。这种体系结构的分离大大降低了模型的学习复杂性，同时使关键背景上下文的细微融合。我们还介绍了一种新型的目标区域ID重采样技术，该技术可实现任何长度的视频介绍，从而大大提高了我们的实际适用性。此外，我们建立了一个可扩展的数据集管道，利用当前的视觉理解模型，促进VPDATA和VPBENCH，以促进基于细分的介绍培训和评估，这是迄今为止最大的视频介绍数据集和基准，迄今为止，具有超过3900k的剪辑。我们还将使用钻头作为管道，我们还探讨了下游应用程序，包括视频编辑和视频编辑对数据生成，展示了竞争性能和巨大的实践潜力。广泛的实验表明，视频介绍和编辑的任何长度视频介绍，包括视频质量，蒙版区域保存和文本连贯性，都表明了视频仪的出色性能。

Title: AIM-Fair: Advancing Algorithmic Fairness via Selectively Fine-Tuning Biased Models with Contextual Synthetic Data

Authors: Zengqun Zhao, Ziquan Liu, Yu Cao, Shaogang Gong, Ioannis Patras
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.05665
Pdf URL: https://arxiv.org/pdf/2503.05665
Copy Paste: [[2503.05665]] AIM-Fair: Advancing Algorithmic Fairness via Selectively Fine-Tuning Biased Models with Contextual Synthetic Data(https://arxiv.org/abs/2503.05665)
Keywords: generation, generative
Abstract: Recent advances in generative models have sparked research on improving model fairness with AI-generated data. However, existing methods often face limitations in the diversity and quality of synthetic data, leading to compromised fairness and overall model accuracy. Moreover, many approaches rely on the availability of demographic group labels, which are often costly to annotate. This paper proposes AIM-Fair, aiming to overcome these limitations and harness the potential of cutting-edge generative models in promoting algorithmic fairness. We investigate a fine-tuning paradigm starting from a biased model initially trained on real-world data without demographic annotations. This model is then fine-tuned using unbiased synthetic data generated by a state-of-the-art diffusion model to improve its fairness. Two key challenges are identified in this fine-tuning paradigm, 1) the low quality of synthetic data, which can still happen even with advanced generative models, and 2) the domain and bias gap between real and synthetic data. To address the limitation of synthetic data quality, we propose Contextual Synthetic Data Generation (CSDG) to generate data using a text-to-image diffusion model (T2I) with prompts generated by a context-aware LLM, ensuring both data diversity and control of bias in synthetic data. To resolve domain and bias shifts, we introduce a novel selective fine-tuning scheme in which only model parameters more sensitive to bias and less sensitive to domain shift are updated. Experiments on CelebA and UTKFace datasets show that our AIM-Fair improves model fairness while maintaining utility, outperforming both fully and partially fine-tuned approaches to model fairness.
摘要：生成模型的最新进展引发了研究通过AI生成的数据改善模型公平性的研究。但是，现有方法通常面临综合数据多样性和质量的限制，从而导致公平性和整体模型准确性受损。此外，许多方法都取决于人口统计组标签的可用性，这些标签通常是昂贵的注释。本文提出了AIM-FAIR，旨在克服这些局限性并利用尖端生成模型促进算法公平的潜力。我们研究了一个微调范式，从最初对现实数据训练的有偏见的模型开始，而没有人口统计学注释。然后，使用最先进的扩散模型生成的无偏合成数据来微调该模型，以提高其公平性。在这个微调范式中确定了两个关键的挑战，1）合成数据的低质量，即使使用先进的生成模型也可以发生，以及2）真实数据和合成数据之间的域和偏置差距。为了解决综合数据质量的局限性，我们建议使用文本对图像扩散模型（T2I）生成数据，并使用上下文感知到的LLM生成的提示来生成数据，从而确保综合数据中偏见的数据多样性和控制偏差。为了解决域和偏置变化，我们引入了一种新颖的选择性微调方案，其中仅更新对偏差更敏感的模型参数，并且对域移动更敏感。 Celeba和Utkface数据集的实验表明，我们的AIM-FAIR在维持效用的同时提高了模型公平性，超越了完全和部分微调的方法来建模公平。

Title: GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving

Authors: Zebin Xing, Xingyu Zhang, Yang Hu, Bo Jiang, Tong He, Qian Zhang, Xiaoxiao Long, Wei Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.05689
Pdf URL: https://arxiv.org/pdf/2503.05689
Copy Paste: [[2503.05689]] GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving(https://arxiv.org/abs/2503.05689)
Keywords: generation, generative
Abstract: We propose GoalFlow, an end-to-end autonomous driving method for generating high-quality multimodal trajectories. In autonomous driving scenarios, there is rarely a single suitable trajectory. Recent methods have increasingly focused on modeling multimodal trajectory distributions. However, they suffer from trajectory selection complexity and reduced trajectory quality due to high trajectory divergence and inconsistencies between guidance and scene information. To address these issues, we introduce GoalFlow, a novel method that effectively constrains the generative process to produce high-quality, multimodal trajectories. To resolve the trajectory divergence problem inherent in diffusion-based methods, GoalFlow constrains the generated trajectories by introducing a goal point. GoalFlow establishes a novel scoring mechanism that selects the most appropriate goal point from the candidate points based on scene information. Furthermore, GoalFlow employs an efficient generative method, Flow Matching, to generate multimodal trajectories, and incorporates a refined scoring mechanism to select the optimal trajectory from the candidates. Our experimental results, validated on the Navsim\cite{Dauner2024_navsim}, demonstrate that GoalFlow achieves state-of-the-art performance, delivering robust multimodal trajectories for autonomous driving. GoalFlow achieved PDMS of 90.3, significantly surpassing other methods. Compared with other diffusion-policy-based methods, our approach requires only a single denoising step to obtain excellent performance. The code is available at this https URL.
摘要：我们提出了goarflow，这是一种端到端的自主驾驶方法，用于生成高质量的多模式轨迹。在自主驾驶的情况下，很少有一个合适的轨迹。最近的方法越来越集中于建模多峰轨迹分布。但是，由于高轨迹差异和指导和场景信息之间的不一致，它们遭受了轨迹选择的复杂性和轨迹质量的降低。为了解决这些问题，我们介绍了一种新颖的方法，该方法有效地限制了产生高质量的多模式轨迹的生成过程。为了解决基于扩散的方法固有的轨迹差异问题，gotlow通过引入目标点来限制生成的轨迹。 Goarflow建立了一种新颖的评分机制，该机制可以根据场景信息从候选点中选择最合适的目标点。此外，驱动流采用有效的生成方法，流量匹配，生成多模式轨迹，并结合了精制的评分机制来从候选者中选择最佳轨迹。我们在NAVSIM \ CITE {DAUNER2024_NAVSIM}上验证的实验结果表明，驱动量可以实现最先进的性能，从而为自动驾驶提供了强大的多模式轨迹。范围达到90.3的PDM，大大超过了其他方法。与其他基于延伸 - 政策的方法相比，我们的方法仅需要一个单一的降解步骤即可获得出色的性能。该代码可在此HTTPS URL上找到。

Title: Multi-Fidelity Policy Gradient Algorithms

Authors: Xinjie Liu, Cyrus Neary, Kushagra Gupta, Christian Ellis, Ufuk Topcu, David Fridovich-Keil
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2503.05696
Pdf URL: https://arxiv.org/pdf/2503.05696
Copy Paste: [[2503.05696]] Multi-Fidelity Policy Gradient Algorithms(https://arxiv.org/abs/2503.05696)
Keywords: generative
Abstract: Many reinforcement learning (RL) algorithms require large amounts of data, prohibiting their use in applications where frequent interactions with operational systems are infeasible, or high-fidelity simulations are expensive or unavailable. Meanwhile, low-fidelity simulators--such as reduced-order models, heuristic reward functions, or generative world models--can cheaply provide useful data for RL training, even if they are too coarse for direct sim-to-real transfer. We propose multi-fidelity policy gradients (MFPGs), an RL framework that mixes a small amount of data from the target environment with a large volume of low-fidelity simulation data to form unbiased, reduced-variance estimators (control variates) for on-policy policy gradients. We instantiate the framework by developing multi-fidelity variants of two policy gradient algorithms: REINFORCE and proximal policy optimization. Experimental results across a suite of simulated robotics benchmark problems demonstrate that when target-environment samples are limited, MFPG achieves up to 3.9x higher reward and improves training stability when compared to baselines that only use high-fidelity data. Moreover, even when the baselines are given more high-fidelity samples--up to 10x as many interactions with the target environment--MFPG continues to match or outperform them. Finally, we observe that MFPG is capable of training effective policies even when the low-fidelity environment is drastically different from the target environment. MFPG thus not only offers a novel paradigm for efficient sim-to-real transfer but also provides a principled approach to managing the trade-off between policy performance and data collection costs.
摘要：许多强化学习（RL）算法需要大量数据，禁止它们在与操作系统频繁相互作用的应用中使用，或者高保真模拟昂贵或不可用。同时，低保真模拟器（例如降低订购模型，启发式奖励功能或生成世界模型）廉价地为RL训练提供了有用的数据，即使它们太粗糙，无法直接进行SIM卡转移。我们提出了多余的策略梯度（MFPG），这是一个RL框架，将目标环境中的少量数据与大量的低效率仿真数据混合在一起，以形成无偏见的，降低的差异估计器（控制变量），以实现上政策策略梯度。我们通过开发两种策略梯度算法的多保真变体来实例化框架：增强和近端策略优化。一套模拟机器人基准问题的实验结果表明，与仅使用高保真数据的基准相比，当目标环境样品受到限制时，MFPG可获得高达3.9倍的奖励，并提高训练稳定性。此外，即使为基准提供了更多的高保真样本 - 与目标环境的许多相互作用相互作用 - MFPG仍然匹配或胜过它们。最后，我们观察到，即使低保真环境与目标环境大不相同，MFPG也能够训练有效的政策。因此，MFPG不仅提供了一种新颖的范式，以实现有效的SIM到现实转移，而且还提供了一种原则性的方法来管理政策绩效和数据收集成本之间的权衡。