2025-03-21

Title: Enforcing Cybersecurity Constraints for LLM-driven Robot Agents for Online Transactions

Authors: Shraddha Pradipbhai Shah, Aditya Vilas Deshpande
Subjects: cs.CR, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2503.15546
Pdf URL: https://arxiv.org/pdf/2503.15546
Copy Paste: [[2503.15546]] Enforcing Cybersecurity Constraints for LLM-driven Robot Agents for Online Transactions(https://arxiv.org/abs/2503.15546)
Keywords: anomaly
Abstract: The integration of Large Language Models (LLMs) into autonomous robotic agents for conducting online transactions poses significant cybersecurity challenges. This study aims to enforce robust cybersecurity constraints to mitigate the risks associated with data breaches, transaction fraud, and system manipulation. The background focuses on the rise of LLM-driven robotic systems in e-commerce, finance, and service industries, alongside the vulnerabilities they introduce. A novel security architecture combining blockchain technology with multi-factor authentication (MFA) and real-time anomaly detection was implemented to safeguard transactions. Key performance metrics such as transaction integrity, response time, and breach detection accuracy were evaluated, showing improved security and system performance. The results highlight that the proposed architecture reduced fraudulent transactions by 90%, improved breach detection accuracy to 98%, and ensured secure transaction validation within a latency of 0.05 seconds. These findings emphasize the importance of cybersecurity in the deployment of LLM-driven robotic systems and suggest a framework adaptable to various online platforms.

Title: Privacy-Aware RAG: Secure and Isolated Knowledge Retrieval

Authors: Pengcheng Zhou, Yinglun Feng, Zhongliang Yang
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15548
Pdf URL: https://arxiv.org/pdf/2503.15548
Copy Paste: [[2503.15548]] Privacy-Aware RAG: Secure and Isolated Knowledge Retrieval(https://arxiv.org/abs/2503.15548)
Keywords: generative
Abstract: The widespread adoption of Retrieval-Augmented Generation (RAG) systems in real-world applications has heightened concerns about the confidentiality and integrity of their proprietary knowledge bases. These knowledge bases, which play a critical role in enhancing the generative capabilities of Large Language Models (LLMs), are increasingly vulnerable to breaches that could compromise sensitive information. To address these challenges, this paper proposes an advanced encryption methodology designed to protect RAG systems from unauthorized access and data leakage. Our approach encrypts both textual content and its corresponding embeddings prior to storage, ensuring that all data remains securely encrypted. This mechanism restricts access to authorized entities with the appropriate decryption keys, thereby significantly reducing the risk of unintended data exposure. Furthermore, we demonstrate that our encryption strategy preserves the performance and functionality of RAG pipelines, ensuring compatibility across diverse domains and applications. To validate the robustness of our method, we provide comprehensive security proofs that highlight its resilience against potential threats and vulnerabilities. These proofs also reveal limitations in existing approaches, which often lack robustness, adaptability, or reliance on open-source models. Our findings suggest that integrating advanced encryption techniques into the design and deployment of RAG systems can effectively enhance privacy safeguards. This research contributes to the ongoing discourse on improving security measures for AI-driven services and advocates for stricter data protection standards within RAG architectures.

Title: GReaTER: Generate Realistic Tabular data after data Enhancement and Reduction

Authors: Tung Sum Thomas Kwok, Chi-Hua Wang, Guang Cheng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.15564
Pdf URL: https://arxiv.org/pdf/2503.15564
Copy Paste: [[2503.15564]] GReaTER: Generate Realistic Tabular data after data Enhancement and Reduction(https://arxiv.org/abs/2503.15564)
Keywords: in-context
Abstract: Tabular data synthesis involves not only multi-table synthesis but also generating multi-modal data (e.g., strings and categories), which enables diverse knowledge synthesis. However, separating numerical and categorical data has limited the effectiveness of tabular data generation. The GReaT (Generate Realistic Tabular Data) framework uses Large Language Models (LLMs) to encode entire rows, eliminating the need to partition data types. Despite this, the framework's performance is constrained by two issues: (1) tabular data entries lack sufficient semantic meaning, limiting LLM's ability to leverage pre-trained knowledge for in-context learning, and (2) complex multi-table datasets struggle to establish effective relationships for collaboration. To address these, we propose GReaTER (Generate Realistic Tabular Data after data Enhancement and Reduction), which includes: (1) a data semantic enhancement system that improves LLM's understanding of tabular data through mapping, enabling better in-context learning, and (2) a cross-table connecting method to establish efficient relationships across complex tables. Experimental results show that GReaTER outperforms the GReaT framework.

Title: Towards Unified Latent Space for 3D Molecular Latent Diffusion Modeling

Authors: Yanchen Luo, Zhiyuan Liu, Yi Zhao, Sihang Li, Kenji Kawaguchi, Tat-Seng Chua, Xiang Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.15567
Pdf URL: https://arxiv.org/pdf/2503.15567
Copy Paste: [[2503.15567]] Towards Unified Latent Space for 3D Molecular Latent Diffusion Modeling(https://arxiv.org/abs/2503.15567)
Keywords: diffusion
Abstract: 3D molecule generation is crucial for drug discovery and material science, requiring models to process complex multi-modalities, including atom types, chemical bonds, and 3D coordinates. A key challenge is integrating these modalities of different shapes while maintaining SE(3) equivariance for 3D coordinates. To achieve this, existing approaches typically maintain separate latent spaces for invariant and equivariant modalities, reducing efficiency in both training and sampling. In this work, we propose \textbf{U}nified Variational \textbf{A}uto-\textbf{E}ncoder for \textbf{3D} Molecular Latent Diffusion Modeling (\textbf{UAE-3D}), a multi-modal VAE that compresses 3D molecules into latent sequences from a unified latent space, while maintaining near-zero reconstruction error. This unified latent space eliminates the complexities of handling multi-modality and equivariance when performing latent diffusion modeling. We demonstrate this by employing the Diffusion Transformer--a general-purpose diffusion model without any molecular inductive bias--for latent generation. Extensive experiments on GEOM-Drugs and QM9 datasets demonstrate that our method significantly establishes new benchmarks in both \textit{de novo} and conditional 3D molecule generation, achieving leading efficiency and quality.

Title: Understanding the Generalization of In-Context Learning in Transformers: An Empirical Study

Authors: Xingxuan Zhang, Haoran Wang, Jiansheng Li, Yuan Xue, Shikai Guan, Renzhe Xu, Hao Zou, Han Yu, Peng Cui
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.15579
Pdf URL: https://arxiv.org/pdf/2503.15579
Copy Paste: [[2503.15579]] Understanding the Generalization of In-Context Learning in Transformers: An Empirical Study(https://arxiv.org/abs/2503.15579)
Keywords: in-context
Abstract: Large language models (LLMs) like GPT-4 and LLaMA-3 utilize the powerful in-context learning (ICL) capability of Transformer architecture to learn on the fly from limited examples. While ICL underpins many LLM applications, its full potential remains hindered by a limited understanding of its generalization boundaries and vulnerabilities. We present a systematic investigation of transformers' generalization capability with ICL relative to training data coverage by defining a task-centric framework along three dimensions: inter-problem, intra-problem, and intra-task generalization. Through extensive simulation and real-world experiments, encompassing tasks such as function fitting, API calling, and translation, we find that transformers lack inter-problem generalization with ICL, but excel in intra-task and intra-problem generalization. When the training data includes a greater variety of mixed tasks, it significantly enhances the generalization ability of ICL on unseen tasks and even on known simple tasks. This guides us in designing training data to maximize the diversity of tasks covered and to combine different tasks whenever possible, rather than solely focusing on the target task for testing.

Title: CAM-Seg: A Continuous-valued Embedding Approach for Semantic Image Generation

Authors: Masud Ahmed, Zahid Hasan, Syed Arefinul Haque, Abu Zaher Md Faridee, Sanjay Purushotham, Suya You, Nirmalya Roy
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15617
Pdf URL: https://arxiv.org/pdf/2503.15617
Copy Paste: [[2503.15617]] CAM-Seg: A Continuous-valued Embedding Approach for Semantic Image Generation(https://arxiv.org/abs/2503.15617)
Keywords: diffusion
Abstract: Traditional transformer-based semantic segmentation relies on quantized embeddings. However, our analysis reveals that autoencoder accuracy on segmentation mask using quantized embeddings (e.g. VQ-VAE) is 8% lower than continuous-valued embeddings (e.g. KL-VAE). Motivated by this, we propose a continuous-valued embedding framework for semantic segmentation. By reformulating semantic mask generation as a continuous image-to-embedding diffusion process, our approach eliminates the need for discrete latent representations while preserving fine-grained spatial and semantic details. Our key contribution includes a diffusion-guided autoregressive transformer that learns a continuous semantic embedding space by modeling long-range dependencies in image features. Our framework contains a unified architecture combining a VAE encoder for continuous feature extraction, a diffusion-guided transformer for conditioned embedding generation, and a VAE decoder for semantic mask reconstruction. Our setting facilitates zero-shot domain adaptation capabilities enabled by the continuity of the embedding space. Experiments across diverse datasets (e.g., Cityscapes and domain-shifted variants) demonstrate state-of-the-art robustness to distribution shifts, including adverse weather (e.g., fog, snow) and viewpoint variations. Our model also exhibits strong noise resilience, achieving robust performance ($\approx$ 95% AP compared to baseline) under gaussian noise, moderate motion blur, and moderate brightness/contrast variations, while experiencing only a moderate impact ($\approx$ 90% AP compared to baseline) from 50% salt and pepper noise, saturation and hue shifts. Code available: this https URL

Title: DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis

Authors: Yuming Gu, Phong Tran, Yujian Zheng, Hongyi Xu, Heyuan Li, Adilbek Karmanov, Hao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15667
Pdf URL: https://arxiv.org/pdf/2503.15667
Copy Paste: [[2503.15667]] DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis(https://arxiv.org/abs/2503.15667)
Keywords: diffusion
Abstract: Generating high-quality 360-degree views of human heads from single-view images is essential for enabling accessible immersive telepresence applications and scalable personalized content creation. While cutting-edge methods for full head generation are limited to modeling realistic human heads, the latest diffusion-based approaches for style-omniscient head synthesis can produce only frontal views and struggle with view consistency, preventing their conversion into true 3D models for rendering from arbitrary angles. We introduce a novel approach that generates fully consistent 360-degree head views, accommodating human, stylized, and anthropomorphic forms, including accessories like glasses and hats. Our method builds on the DiffPortrait3D framework, incorporating a custom ControlNet for back-of-head detail generation and a dual appearance module to ensure global front-back consistency. By training on continuous view sequences and integrating a back reference image, our approach achieves robust, locally continuous view synthesis. Our model can be used to produce high-quality neural radiance fields (NeRFs) for real-time, free-viewpoint rendering, outperforming state-of-the-art methods in object synthesis and 360-degree head generation for very challenging input portraits.

Title: CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-Consistency from a Single Image

Authors: Arindam Dutta, Meng Zheng, Zhongpai Gao, Benjamin Planche, Anwesha Choudhuri, Terrence Chen, Amit K. Roy-Chowdhury, Ziyan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15671
Pdf URL: https://arxiv.org/pdf/2503.15671
Copy Paste: [[2503.15671]] CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-Consistency from a Single Image(https://arxiv.org/abs/2503.15671)
Keywords: diffusion
Abstract: Reconstructing clothed humans from a single image is a fundamental task in computer vision with wide-ranging applications. Although existing monocular clothed human reconstruction solutions have shown promising results, they often rely on the assumption that the human subject is in an occlusion-free environment. Thus, when encountering in-the-wild occluded images, these algorithms produce multiview inconsistent and fragmented reconstructions. Additionally, most algorithms for monocular 3D human reconstruction leverage geometric priors such as SMPL annotations for training and inference, which are extremely challenging to acquire in real-world applications. To address these limitations, we propose CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-ConsistEncy from a Single Image, a novel pipeline designed to reconstruct occlusion-resilient 3D humans with multiview consistency from a single occluded image, without requiring either ground-truth geometric prior annotations or 3D supervision. Specifically, CHROME leverages a multiview diffusion model to first synthesize occlusion-free human images from the occluded input, compatible with off-the-shelf pose control to explicitly enforce cross-view consistency during synthesis. A 3D reconstruction model is then trained to predict a set of 3D Gaussians conditioned on both the occluded input and synthesized views, aligning cross-view details to produce a cohesive and accurate 3D representation. CHROME achieves significant improvements in terms of both novel view synthesis (upto 3 db PSNR) and geometric reconstruction under challenging conditions.

Title: GASP: Unifying Geometric and Semantic Self-Supervised Pre-training for Autonomous Driving

Authors: William Ljungbergh, Adam Lilja, Adam Tonderski. Arvid Laveno Ling, Carl Lindström, Willem Verbeke, Junsheng Fu, Christoffer Petersson, Lars Hammarstrand, Michael Felsberg
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.15672
Pdf URL: https://arxiv.org/pdf/2503.15672
Copy Paste: [[2503.15672]] GASP: Unifying Geometric and Semantic Self-Supervised Pre-training for Autonomous Driving(https://arxiv.org/abs/2503.15672)
Keywords: self-supervised, foundation model
Abstract: Self-supervised pre-training based on next-token prediction has enabled large language models to capture the underlying structure of text, and has led to unprecedented performance on a large array of tasks when applied at scale. Similarly, autonomous driving generates vast amounts of spatiotemporal data, alluding to the possibility of harnessing scale to learn the underlying geometric and semantic structure of the environment and its evolution over time. In this direction, we propose a geometric and semantic self-supervised pre-training method, GASP, that learns a unified representation by predicting, at any queried future point in spacetime, (1) general occupancy, capturing the evolving structure of the 3D scene; (2) ego occupancy, modeling the ego vehicle path through the environment; and (3) distilled high-level features from a vision foundation model. By modeling geometric and semantic 4D occupancy fields instead of raw sensor measurements, the model learns a structured, generalizable representation of the environment and its evolution through time. We validate GASP on multiple autonomous driving benchmarks, demonstrating significant improvements in semantic occupancy forecasting, online mapping, and ego trajectory prediction. Our results demonstrate that continuous 4D geometric and semantic occupancy prediction provides a scalable and effective pre-training paradigm for autonomous driving. For code and additional visualizations, see \href{this https URL.

Title: The Change You Want To Detect: Semantic Change Detection In Earth Observation With Hybrid Data Generation

Authors: Benidir Yanis, Gonthier Nicolas, Mallet Clement
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15683
Pdf URL: https://arxiv.org/pdf/2503.15683
Copy Paste: [[2503.15683]] The Change You Want To Detect: Semantic Change Detection In Earth Observation With Hybrid Data Generation(https://arxiv.org/abs/2503.15683)
Keywords: generative
Abstract: Bi-temporal change detection at scale based on Very High Resolution (VHR) images is crucial for Earth monitoring. This remains poorly addressed so far: methods either require large volumes of annotated data (semantic case), or are limited to restricted datasets (binary set-ups). Most approaches do not exhibit the versatility required for temporal and spatial adaptation: simplicity in architecture design and pretraining on realistic and comprehensive datasets. Synthetic datasets are the key solution but still fail to handle complex and diverse scenes. In this paper, we present HySCDG a generative pipeline for creating a large hybrid semantic change detection dataset that contains both real VHR images and inpainted ones, along with land cover semantic map at both dates and the change map. Being semantically and spatially guided, HySCDG generates realistic images, leading to a comprehensive and hybrid transfer-proof dataset FSC-180k. We evaluate FSC-180k on five change detection cases (binary and semantic), from zero-shot to mixed and sequential training, and also under low data regime training. Experiments demonstrate that pretraining on our hybrid dataset leads to a significant performance boost, outperforming SyntheWorld, a fully synthetic dataset, in every configuration. All codes, models, and data are available here: $\href{this https URL}{this https URL}$.

Title: Multi-focal Conditioned Latent Diffusion for Person Image Synthesis

Authors: Jiaqi Liu, Jichao Zahng, Paolo Rota, Nicu Sebe
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15686
Pdf URL: https://arxiv.org/pdf/2503.15686
Copy Paste: [[2503.15686]] Multi-focal Conditioned Latent Diffusion for Person Image Synthesis(https://arxiv.org/abs/2503.15686)
Keywords: diffusion
Abstract: The Latent Diffusion Model (LDM) has demonstrated strong capabilities in high-resolution image generation and has been widely employed for Pose-Guided Person Image Synthesis (PGPIS), yielding promising results. However, the compression process of LDM often results in the deterioration of details, particularly in sensitive areas such as facial features and clothing textures. In this paper, we propose a Multi-focal Conditioned Latent Diffusion (MCLD) method to address these limitations by conditioning the model on disentangled, pose-invariant features from these sensitive regions. Our approach utilizes a multi-focal condition aggregation module, which effectively integrates facial identity and texture-specific information, enhancing the model's ability to produce appearance realistic and identity-consistent images. Our method demonstrates consistent identity and appearance generation on the DeepFashion dataset and enables flexible person image editing due to its generation consistency. The code is available at this https URL.

Title: Uncertainty-Aware Diffusion Guided Refinement of 3D Scenes

Authors: Sarosij Bose, Arindam Dutta, Sayak Nag, Junge Zhang, Jiachen Li, Konstantinos Karydis, Amit K. Roy Chowdhury
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15742
Pdf URL: https://arxiv.org/pdf/2503.15742
Copy Paste: [[2503.15742]] Uncertainty-Aware Diffusion Guided Refinement of 3D Scenes(https://arxiv.org/abs/2503.15742)
Keywords: diffusion, generative
Abstract: Reconstructing 3D scenes from a single image is a fundamentally ill-posed task due to the severely under-constrained nature of the problem. Consequently, when the scene is rendered from novel camera views, existing single image to 3D reconstruction methods render incoherent and blurry views. This problem is exacerbated when the unseen regions are far away from the input camera. In this work, we address these inherent limitations in existing single image-to-3D scene feedforward networks. To alleviate the poor performance due to insufficient information beyond the input image's view, we leverage a strong generative prior in the form of a pre-trained latent video diffusion model, for iterative refinement of a coarse scene represented by optimizable Gaussian parameters. To ensure that the style and texture of the generated images align with that of the input image, we incorporate on-the-fly Fourier-style transfer between the generated images and the input image. Additionally, we design a semantic uncertainty quantification module that calculates the per-pixel entropy and yields uncertainty maps used to guide the refinement process from the most confident pixels while discarding the remaining highly uncertain ones. We conduct extensive experiments on real-world scene datasets, including in-domain RealEstate-10K and out-of-domain KITTI-v2, showing that our approach can provide more realistic and high-fidelity novel view synthesis results compared to existing state-of-the-art methods.

Title: RL4Med-DDPO: Reinforcement Learning for Controlled Guidance Towards Diverse Medical Image Generation using Vision-Language Foundation Models

Authors: Parham Saremi, Amar Kumar, Mohammed Mohammed, Zahra TehraniNasab, Tal Arbel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15784
Pdf URL: https://arxiv.org/pdf/2503.15784
Copy Paste: [[2503.15784]] RL4Med-DDPO: Reinforcement Learning for Controlled Guidance Towards Diverse Medical Image Generation using Vision-Language Foundation Models(https://arxiv.org/abs/2503.15784)
Keywords: diffusion, foundation model
Abstract: Vision-Language Foundation Models (VLFM) have shown a tremendous increase in performance in terms of generating high-resolution, photorealistic natural images. While VLFMs show a rich understanding of semantic content across modalities, they often struggle with fine-grained alignment tasks that require precise correspondence between image regions and textual descriptions a limitation in medical imaging, where accurate localization and detection of clinical features are essential for diagnosis and analysis. To address this issue, we propose a multi-stage architecture where a pre-trained VLFM provides a cursory semantic understanding, while a reinforcement learning (RL) algorithm refines the alignment through an iterative process that optimizes for understanding semantic context. The reward signal is designed to align the semantic information of the text with synthesized images. We demonstrate the effectiveness of our method on a medical imaging skin dataset where the generated images exhibit improved generation quality and alignment with prompt over the fine-tuned Stable Diffusion. We also show that the synthesized samples could be used to improve disease classifier performance for underrepresented subgroups through augmentation.

Title: Computation-Efficient and Recognition-Friendly 3D Point Cloud Privacy Protection

Authors: Haotian Ma, Lin Gu, Siyi Wu, Yingying Zhu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15818
Pdf URL: https://arxiv.org/pdf/2503.15818
Copy Paste: [[2503.15818]] Computation-Efficient and Recognition-Friendly 3D Point Cloud Privacy Protection(https://arxiv.org/abs/2503.15818)
Keywords: generative
Abstract: 3D point cloud has been widely used in applications such as self-driving cars, robotics, CAD models, etc. To the best of our knowledge, these applications raised the issue of privacy leakage in 3D point clouds, which has not been studied well. Different from the 2D image privacy, which is related to texture and 2D geometric structure, the 3D point cloud is texture-less and only relevant to 3D geometric structure. In this work, we defined the 3D point cloud privacy problem and proposed an efficient privacy-preserving framework named PointFlowGMM that can support downstream classification and segmentation tasks without seeing the original data. Using a flow-based generative model, the point cloud is projected into a latent Gaussian mixture distributed subspace. We further designed a novel angular similarity loss to obfuscate the original geometric structure and reduce the model size from 767MB to 120MB without a decrease in recognition performance. The projected point cloud in the latent space is orthogonally rotated randomly to further protect the original geometric structure, the class-to-class relationship is preserved after rotation, thus, the protected point cloud can support the recognition task. We evaluated our model on multiple datasets and achieved comparable recognition results on encrypted point clouds compared to the original point clouds.

Title: EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation

Authors: Zihao Zhang, Haoran Chen, Haoyu Zhao, Guansong Lu, Yanwei Fu, Hang Xu, Zuxuan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15831
Pdf URL: https://arxiv.org/pdf/2503.15831
Copy Paste: [[2503.15831]] EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation(https://arxiv.org/abs/2503.15831)
Keywords: diffusion
Abstract: Handling complex or nonlinear motion patterns has long posed challenges for video frame interpolation. Although recent advances in diffusion-based methods offer improvements over traditional optical flow-based approaches, they still struggle to generate sharp, temporally consistent frames in scenarios with large motion. To address this limitation, we introduce EDEN, an Enhanced Diffusion for high-quality large-motion vidEo frame iNterpolation. Our approach first utilizes a transformer-based tokenizer to produce refined latent representations of the intermediate frames for diffusion models. We then enhance the diffusion transformer with temporal attention across the process and incorporate a start-end frame difference embedding to guide the generation of dynamic motion. Extensive experiments demonstrate that EDEN achieves state-of-the-art results across popular benchmarks, including nearly a 10% LPIPS reduction on DAVIS and SNU-FILM, and an 8% improvement on DAIN-HD.

Title: Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation

Authors: Shangqing Zhao, Yuhao Zhou, Yupei Ren, Zhe Chen, Chenghao Jia, Fang Zhe, Zhaogaung Long, Shu Liu, Man Lan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15837
Pdf URL: https://arxiv.org/pdf/2503.15837
Copy Paste: [[2503.15837]] Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation(https://arxiv.org/abs/2503.15837)
Keywords: generative
Abstract: Ancient Chinese text processing presents unique challenges for large language models (LLMs) due to its distinct linguistic features, complex structural constraints, and rich cultural context. While existing benchmarks have primarily focused on evaluating comprehension through multiple-choice questions, there remains a critical gap in assessing models' generative capabilities in classical Chinese. We introduce Fùxì, a comprehensive benchmark that evaluates both understanding and generation capabilities across 21 diverse tasks. Our benchmark distinguishes itself through three key contributions: (1) balanced coverage of both comprehension and generation tasks, including novel tasks like poetry composition and couplet completion, (2) specialized evaluation metrics designed specifically for classical Chinese text generation, combining rule-based verification with fine-tuned LLM evaluators, and (3) a systematic assessment framework that considers both linguistic accuracy and cultural authenticity. Through extensive evaluation of state-of-the-art LLMs, we reveal significant performance gaps between understanding and generation tasks, with models achieving promising results in comprehension but struggling considerably in generation tasks, particularly those requiring deep cultural knowledge and adherence to classical formats. Our findings highlight the current limitations in ancient Chinese text processing and provide insights for future model development. The benchmark, evaluation toolkit, and baseline results are publicly available to facilitate research in this domain.

Title: Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion

Authors: Zhou Zhenglin, Ma Fan, Fan Hehe, Chua Tat-Seng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15851
Pdf URL: https://arxiv.org/pdf/2503.15851
Copy Paste: [[2503.15851]] Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion(https://arxiv.org/abs/2503.15851)
Keywords: diffusion
Abstract: Animatable head avatar generation typically requires extensive data for training. To reduce the data requirements, a natural solution is to leverage existing data-free static avatar generation methods, such as pre-trained diffusion models with score distillation sampling (SDS), which align avatars with pseudo ground-truth outputs from the diffusion model. However, directly distilling 4D avatars from video diffusion often leads to over-smooth results due to spatial and temporal inconsistencies in the generated video. To address this issue, we propose Zero-1-to-A, a robust method that synthesizes a spatial and temporal consistency dataset for 4D avatar reconstruction using the video diffusion model. Specifically, Zero-1-to-A iteratively constructs video datasets and optimizes animatable avatars in a progressive manner, ensuring that avatar quality increases smoothly and consistently throughout the learning process. This progressive learning involves two stages: (1) Spatial Consistency Learning fixes expressions and learns from front-to-side views, and (2) Temporal Consistency Learning fixes views and learns from relaxed to exaggerated expressions, generating 4D avatars in a simple-to-complex manner. Extensive experiments demonstrate that Zero-1-to-A improves fidelity, animation quality, and rendering speed compared to existing diffusion-based methods, providing a solution for lifelike avatar creation. Code is publicly available at: this https URL.

Title: VideoRFSplat: Direct Scene-Level Text-to-3D Gaussian Splatting Generation with Flexible Pose and Multi-View Joint Modeling

Authors: Hyojun Go, Byeongjun Park, Hyelin Nam, Byung-Hoon Kim, Hyungjin Chung, Changick Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15855
Pdf URL: https://arxiv.org/pdf/2503.15855
Copy Paste: [[2503.15855]] VideoRFSplat: Direct Scene-Level Text-to-3D Gaussian Splatting Generation with Flexible Pose and Multi-View Joint Modeling(https://arxiv.org/abs/2503.15855)
Keywords: generative
Abstract: We propose VideoRFSplat, a direct text-to-3D model leveraging a video generation model to generate realistic 3D Gaussian Splatting (3DGS) for unbounded real-world scenes. To generate diverse camera poses and unbounded spatial extent of real-world scenes, while ensuring generalization to arbitrary text prompts, previous methods fine-tune 2D generative models to jointly model camera poses and multi-view images. However, these methods suffer from instability when extending 2D generative models to joint modeling due to the modality gap, which necessitates additional models to stabilize training and inference. In this work, we propose an architecture and a sampling strategy to jointly model multi-view images and camera poses when fine-tuning a video generation model. Our core idea is a dual-stream architecture that attaches a dedicated pose generation model alongside a pre-trained video generation model via communication blocks, generating multi-view images and camera poses through separate streams. This design reduces interference between the pose and image modalities. Additionally, we propose an asynchronous sampling strategy that denoises camera poses faster than multi-view images, allowing rapidly denoised poses to condition multi-view generation, reducing mutual ambiguity and enhancing cross-modal consistency. Trained on multiple large-scale real-world datasets (RealEstate10K, MVImgNet, DL3DV-10K, ACID), VideoRFSplat outperforms existing text-to-3D direct generation methods that heavily depend on post-hoc refinement via score distillation sampling, achieving superior results without such refinement.

Title: UniCoRN: Latent Diffusion-based Unified Controllable Image Restoration Network across Multiple Degradations

Authors: Debabrata Mandal, Soumitri Chattopadhyay, Guansen Tong, Praneeth Chakravarthula
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15868
Pdf URL: https://arxiv.org/pdf/2503.15868
Copy Paste: [[2503.15868]] UniCoRN: Latent Diffusion-based Unified Controllable Image Restoration Network across Multiple Degradations(https://arxiv.org/abs/2503.15868)
Keywords: diffusion
Abstract: Image restoration is essential for enhancing degraded images across computer vision tasks. However, most existing methods address only a single type of degradation (e.g., blur, noise, or haze) at a time, limiting their real-world applicability where multiple degradations often occur simultaneously. In this paper, we propose UniCoRN, a unified image restoration approach capable of handling multiple degradation types simultaneously using a multi-head diffusion model. Specifically, we uncover the potential of low-level visual cues extracted from images in guiding a controllable diffusion model for real-world image restoration and we design a multi-head control network adaptable via a mixture-of-experts strategy. We train our model without any prior assumption of specific degradations, through a smartly designed curriculum learning recipe. Additionally, we also introduce MetaRestore, a metalens imaging benchmark containing images with multiple degradations and artifacts. Extensive evaluations on several challenging datasets, including our benchmark, demonstrate that our method achieves significant performance gains and can robustly restore images with severe degradations. Project page: this https URL

Title: Repurposing 2D Diffusion Models with Gaussian Atlas for 3D Generation

Authors: Tiange Xiang, Kai Li, Chengjiang Long, Christian Häne, Peihong Guo, Scott Delp, Ehsan Adeli, Li Fei-Fei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15877
Pdf URL: https://arxiv.org/pdf/2503.15877
Copy Paste: [[2503.15877]] Repurposing 2D Diffusion Models with Gaussian Atlas for 3D Generation(https://arxiv.org/abs/2503.15877)
Keywords: diffusion
Abstract: Recent advances in text-to-image diffusion models have been driven by the increasing availability of paired 2D data. However, the development of 3D diffusion models has been hindered by the scarcity of high-quality 3D data, resulting in less competitive performance compared to their 2D counterparts. To address this challenge, we propose repurposing pre-trained 2D diffusion models for 3D object generation. We introduce Gaussian Atlas, a novel representation that utilizes dense 2D grids, enabling the fine-tuning of 2D diffusion models to generate 3D Gaussians. Our approach demonstrates successful transfer learning from a pre-trained 2D diffusion model to a 2D manifold flattened from 3D structures. To support model training, we compile GaussianVerse, a large-scale dataset comprising 205K high-quality 3D Gaussian fittings of various 3D objects. Our experimental results show that text-to-image diffusion models can be effectively adapted for 3D content generation, bridging the gap between 2D and 3D modeling.

Title: Jasmine: Harnessing Diffusion Prior for Self-supervised Depth Estimation

Authors: Jiyuan Wang, Chunyu Lin, Cheng Guan, Lang Nie, Jing He, Haodong Li, Kang Liao, Yao Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15905
Pdf URL: https://arxiv.org/pdf/2503.15905
Copy Paste: [[2503.15905]] Jasmine: Harnessing Diffusion Prior for Self-supervised Depth Estimation(https://arxiv.org/abs/2503.15905)
Keywords: diffusion, self-supervised
Abstract: In this paper, we propose Jasmine, the first Stable Diffusion (SD)-based self-supervised framework for monocular depth estimation, which effectively harnesses SD's visual priors to enhance the sharpness and generalization of unsupervised prediction. Previous SD-based methods are all supervised since adapting diffusion models for dense prediction requires high-precision supervision. In contrast, self-supervised reprojection suffers from inherent challenges (e.g., occlusions, texture-less regions, illumination variance), and the predictions exhibit blurs and artifacts that severely compromise SD's latent priors. To resolve this, we construct a novel surrogate task of hybrid image reconstruction. Without any additional supervision, it preserves the detail priors of SD models by reconstructing the images themselves while preventing depth estimation from degradation. Furthermore, to address the inherent misalignment between SD's scale and shift invariant estimation and self-supervised scale-invariant depth estimation, we build the Scale-Shift GRU. It not only bridges this distribution gap but also isolates the fine-grained texture of SD output against the interference of reprojection loss. Extensive experiments demonstrate that Jasmine achieves SoTA performance on the KITTI benchmark and exhibits superior zero-shot generalization across multiple datasets.

Title: Text-Driven Diffusion Model for Sign Language Production

Authors: Jiayi He, Xu Wang, Ruobei Zhang, Shengeng Tang, Yaxiong Wang, Lechao Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15914
Pdf URL: https://arxiv.org/pdf/2503.15914
Copy Paste: [[2503.15914]] Text-Driven Diffusion Model for Sign Language Production(https://arxiv.org/abs/2503.15914)
Keywords: diffusion
Abstract: We introduce the hfut-lmc team's solution to the SLRTP Sign Production Challenge. The challenge aims to generate semantically aligned sign language pose sequences from text inputs. To this end, we propose a Text-driven Diffusion Model (TDM) framework. During the training phase, TDM utilizes an encoder to encode text sequences and incorporates them into the diffusion model as conditional input to generate sign pose sequences. To guarantee the high quality and accuracy of the generated pose sequences, we utilize two key loss functions. The joint loss function L_{joint} is used to precisely measure and minimize the differences between the joint positions of the generated pose sequences and those of the ground truth. Similarly, the bone orientation loss function L_{bone} is instrumental in ensuring that the orientation of the bones in the generated poses aligns with the actual, correct orientations. In the inference stage, the TDM framework takes on a different yet equally important task. It starts with noisy sequences and, under the strict constraints of the text conditions, gradually refines and generates semantically consistent sign language pose sequences. Our carefully designed framework performs well on the sign language production task, and our solution achieves a BLEU-1 score of 20.17, placing second in the challenge.

Title: Learning to Efficiently Adapt Foundation Models for Self-Supervised Endoscopic 3D Scene Reconstruction from Any Cameras

Authors: Beilei Cui, Long Bai, Mobarakol Islam, An Wang, Zhiqi Ma, Yiming Huang, Feng Li, Zhen Chen, Zhongliang Jiang, Nassir Navab, Hongliang Ren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15917
Pdf URL: https://arxiv.org/pdf/2503.15917
Copy Paste: [[2503.15917]] Learning to Efficiently Adapt Foundation Models for Self-Supervised Endoscopic 3D Scene Reconstruction from Any Cameras(https://arxiv.org/abs/2503.15917)
Keywords: self-supervised, foundation model
Abstract: Accurate 3D scene reconstruction is essential for numerous medical tasks. Given the challenges in obtaining ground truth data, there has been an increasing focus on self-supervised learning (SSL) for endoscopic depth estimation as a basis for scene reconstruction. While foundation models have shown remarkable progress in visual tasks, their direct application to the medical domain often leads to suboptimal results. However, the visual features from these models can still enhance endoscopic tasks, emphasizing the need for efficient adaptation strategies, which still lack exploration currently. In this paper, we introduce Endo3DAC, a unified framework for endoscopic scene reconstruction that efficiently adapts foundation models. We design an integrated network capable of simultaneously estimating depth maps, relative poses, and camera intrinsic parameters. By freezing the backbone foundation model and training only the specially designed Gated Dynamic Vector-Based Low-Rank Adaptation (GDV-LoRA) with separate decoder heads, Endo3DAC achieves superior depth and pose estimation while maintaining training efficiency. Additionally, we propose a 3D scene reconstruction pipeline that optimizes depth maps' scales, shifts, and a few parameters based on our integrated network. Extensive experiments across four endoscopic datasets demonstrate that Endo3DAC significantly outperforms other state-of-the-art methods while requiring fewer trainable parameters. To our knowledge, we are the first to utilize a single network that only requires surgical videos to perform both SSL depth estimation and scene reconstruction tasks. The code will be released upon acceptance.

Title: BlockDance: Reuse Structurally Similar Spatio-Temporal Features to Accelerate Diffusion Transformers

Authors: Hui Zhang, Tingwei Gao, Jie Shao, Zuxuan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15927
Pdf URL: https://arxiv.org/pdf/2503.15927
Copy Paste: [[2503.15927]] BlockDance: Reuse Structurally Similar Spatio-Temporal Features to Accelerate Diffusion Transformers(https://arxiv.org/abs/2503.15927)
Keywords: diffusion
Abstract: Diffusion models have demonstrated impressive generation capabilities, particularly with recent advancements leveraging transformer architectures to improve both visual and artistic quality. However, Diffusion Transformers (DiTs) continue to encounter challenges related to low inference speed, primarily due to the iterative denoising process. To address this issue, we propose BlockDance, a training-free approach that explores feature similarities at adjacent time steps to accelerate DiTs. Unlike previous feature-reuse methods that lack tailored reuse strategies for features at different scales, BlockDance prioritizes the identification of the most structurally similar features, referred to as Structurally Similar Spatio-Temporal (STSS) features. These features are primarily located within the structure-focused blocks of the transformer during the later stages of denoising. BlockDance caches and reuses these highly similar features to mitigate redundant computation, thereby accelerating DiTs while maximizing consistency with the generated results of the original model. Furthermore, considering the diversity of generated content and the varying distributions of redundant features, we introduce BlockDance-Ada, a lightweight decision-making network tailored for instance-specific acceleration. BlockDance-Ada dynamically allocates resources and provides superior content quality. Both BlockDance and BlockDance-Ada have proven effective across various generation tasks and models, achieving accelerations between 25% and 50% while maintaining generation quality.

Title: Multivariate Time Series Anomaly Detection in Industry 5.0

Authors: Lorenzo Colombi, Michela Vespa, Nicolas Belletti, Matteo Brina, Simon Dahdal, Filippo Tabanelli, Elena Bellodi, Mauro Tortonesi, Cesare Stefanelli, Massimiliano Vignoli
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.15946
Pdf URL: https://arxiv.org/pdf/2503.15946
Copy Paste: [[2503.15946]] Multivariate Time Series Anomaly Detection in Industry 5.0(https://arxiv.org/abs/2503.15946)
Keywords: anomaly
Abstract: Industry5.0 environments present a critical need for effective anomaly detection methods that can indicate equipment malfunctions, process inefficiencies, or potential safety hazards. The ever-increasing sensorization of manufacturing lines makes processes more observable, but also poses the challenge of continuously analyzing vast amounts of multivariate time series data. These challenges include data quality since data may contain noise, be unlabeled or even mislabeled. A promising approach consists of combining an embedding model with other Machine Learning algorithms to enhance the overall performance in detecting anomalies. Moreover, representing time series as vectors brings many advantages like higher flexibility and improved ability to capture complex temporal dependencies. We tested our solution in a real industrial use case, using data collected from a Bonfiglioli plant. The results demonstrate that, unlike traditional reconstruction-based autoencoders, which often struggle in the presence of sporadic noise, our embedding-based framework maintains high performance across various noise conditions.

Title: Acc3D: Accelerating Single Image to 3D Diffusion Models via Edge Consistency Guided Score Distillation

Authors: Kendong Liu, Zhiyu Zhu, Hui Liu, Junhui Hou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15975
Pdf URL: https://arxiv.org/pdf/2503.15975
Copy Paste: [[2503.15975]] Acc3D: Accelerating Single Image to 3D Diffusion Models via Edge Consistency Guided Score Distillation(https://arxiv.org/abs/2503.15975)
Keywords: diffusion, generative
Abstract: We present Acc3D to tackle the challenge of accelerating the diffusion process to generate 3D models from single images. To derive high-quality reconstructions through few-step inferences, we emphasize the critical issue of regularizing the learning of score function in states of random noise. To this end, we propose edge consistency, i.e., consistent predictions across the high signal-to-noise ratio region, to enhance a pre-trained diffusion model, enabling a distillation-based refinement of the endpoint score function. Building on those distilled diffusion models, we propose an adversarial augmentation strategy to further enrich the generation detail and boost overall generation quality. The two modules complement each other, mutually reinforcing to elevate generative performance. Extensive experiments demonstrate that our Acc3D not only achieves over a $20\times$ increase in computational efficiency but also yields notable quality improvements, compared to the state-of-the-arts.

Title: A Survey on fMRI-based Brain Decoding for Reconstructing Multimodal Stimuli

Authors: Pengyu Liu, Guohua Dong, Dan Guo, Kun Li, Fengling Li, Xun Yang, Meng Wang, Xiaomin Ying
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15978
Pdf URL: https://arxiv.org/pdf/2503.15978
Copy Paste: [[2503.15978]] A Survey on fMRI-based Brain Decoding for Reconstructing Multimodal Stimuli(https://arxiv.org/abs/2503.15978)
Keywords: diffusion
Abstract: In daily life, we encounter diverse external stimuli, such as images, sounds, and videos. As research in multimodal stimuli and neuroscience advances, fMRI-based brain decoding has become a key tool for understanding brain perception and its complex cognitive processes. Decoding brain signals to reconstruct stimuli not only reveals intricate neural mechanisms but also drives progress in AI, disease treatment, and brain-computer interfaces. Recent advancements in neuroimaging and image generation models have significantly improved fMRI-based decoding. While fMRI offers high spatial resolution for precise brain activity mapping, its low temporal resolution and signal noise pose challenges. Meanwhile, techniques like GANs, VAEs, and Diffusion Models have enhanced reconstructed image quality, and multimodal pre-trained models have boosted cross-modal decoding tasks. This survey systematically reviews recent progress in fMRI-based brain decoding, focusing on stimulus reconstruction from passive brain signals. It summarizes datasets, relevant brain regions, and categorizes existing methods by model structure. Additionally, it evaluates model performance and discusses their effectiveness. Finally, it identifies key challenges and proposes future research directions, offering valuable insights for the field. For more information and resources related to this survey, visit this https URL.

Title: DIPLI: Deep Image Prior Lucky Imaging for Blind Astronomical Image Restoration

Authors: Suraj Singh, Anastasia Batsheva, Oleg Y. Rogov, Ahmed Bouridane
Subjects: cs.CV, astro-ph.IM, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2503.15984
Pdf URL: https://arxiv.org/pdf/2503.15984
Copy Paste: [[2503.15984]] DIPLI: Deep Image Prior Lucky Imaging for Blind Astronomical Image Restoration(https://arxiv.org/abs/2503.15984)
Keywords: diffusion
Abstract: Contemporary image restoration and super-resolution techniques effectively harness deep neural networks, markedly outperforming traditional methods. However, astrophotography presents unique challenges for deep learning due to limited training data. This work explores hybrid strategies, such as the Deep Image Prior (DIP) model, which facilitates blind training but is susceptible to overfitting, artifact generation, and instability when handling noisy images. We propose enhancements to the DIP model's baseline performance through several advanced techniques. First, we refine the model to process multiple frames concurrently, employing the Back Projection method and the TVNet model. Next, we adopt a Markov approach incorporating Monte Carlo estimation, Langevin dynamics, and a variational input technique to achieve unbiased estimates with minimal variance and counteract overfitting effectively. Collectively, these modifications reduce the likelihood of noise learning and mitigate loss function fluctuations during training, enhancing result stability. We validated our algorithm across multiple image sets of astronomical and celestial objects, achieving performance that not only mitigates limitations of Lucky Imaging, a classical computer vision technique that remains a standard in astronomical image reconstruction but surpasses the original DIP model, state of the art transformer- and diffusion-based models, underscoring the significance of our improvements.

Title: SenseExpo: Efficient Autonomous Exploration with Prediction Information from Lightweight Neural Networks

Authors: Haojia Gao, Haohua Que, Hoiian Au, Weihao Shan, Mingkai Liu, Yusen Qin, Lei Mu, Rong Zhao, Xinghua Yang, Qi Wei, Fei Qiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16000
Pdf URL: https://arxiv.org/pdf/2503.16000
Copy Paste: [[2503.16000]] SenseExpo: Efficient Autonomous Exploration with Prediction Information from Lightweight Neural Networks(https://arxiv.org/abs/2503.16000)
Keywords: generative
Abstract: This paper proposes SenseExpo, an efficient autonomous exploration framework based on a lightweight prediction network, which addresses the limitations of traditional methods in computational overhead and environmental generalization. By integrating Generative Adversarial Networks (GANs), Transformer, and Fast Fourier Convolution (FFC), we designed a lightweight prediction model with merely 709k parameters. Our smallest model achieves better performance on the KTH dataset than U-net (24.5M) and LaMa (51M), delivering PSNR 9.026 and SSIM 0.718, particularly representing a 38.7% PSNR improvement over the 51M-parameter LaMa model. Cross-domain testing demonstrates its strong generalization capability, with an FID score of 161.55 on the HouseExpo dataset, significantly outperforming comparable methods. Regarding exploration efficiency, on the KTH dataset,SenseExpo demonstrates approximately a 67.9% time reduction in exploration time compared to MapEx. On the MRPB 1.0 dataset, SenseExpo achieves 77.1% time reduction roughly compared to MapEx. Deployed as a plug-and-play ROS node, the framework seamlessly integrates with existing navigation systems, providing an efficient solution for resource-constrained devices.

Title: Corrective In-Context Learning: Evaluating Self-Correction in Large Language Models

Authors: Mario Sanz-Guerrero, Katharina von der Wense
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.16022
Pdf URL: https://arxiv.org/pdf/2503.16022
Copy Paste: [[2503.16022]] Corrective In-Context Learning: Evaluating Self-Correction in Large Language Models(https://arxiv.org/abs/2503.16022)
Keywords: in-context
Abstract: In-context learning (ICL) has transformed the use of large language models (LLMs) for NLP tasks, enabling few-shot learning by conditioning on labeled examples without finetuning. Despite its effectiveness, ICL is prone to errors, especially for challenging examples. With the goal of improving the performance of ICL, we propose corrective in-context learning (CICL), an approach that incorporates a model's incorrect predictions alongside ground truth corrections into the prompt, aiming to enhance classification accuracy through self-correction. However, contrary to our hypothesis, extensive experiments on text classification tasks demonstrate that CICL consistently underperforms standard ICL, with performance degrading as the proportion of corrections in the prompt increases. Our findings indicate that CICL introduces confusion by disrupting the model's task understanding, rather than refining its predictions. Additionally, we observe that presenting harder examples in standard ICL does not improve performance, suggesting that example difficulty alone may not be a reliable criterion for effective selection. By presenting these negative results, we provide important insights into the limitations of self-corrective mechanisms in LLMs and offer directions for future research.

Title: The Lighthouse of Language: Enhancing LLM Agents via Critique-Guided Improvement

Authors: Ruihan Yang, Fanghua Ye, Jian Li, Siyu Yuan, Yikai Zhang, Zhaopeng Tu, Xiaolong Li, Deqing Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.16024
Pdf URL: https://arxiv.org/pdf/2503.16024
Copy Paste: [[2503.16024]] The Lighthouse of Language: Enhancing LLM Agents via Critique-Guided Improvement(https://arxiv.org/abs/2503.16024)
Keywords: generative
Abstract: Large language models (LLMs) have recently transformed from text-based assistants to autonomous agents capable of planning, reasoning, and iteratively improving their actions. While numerical reward signals and verifiers can effectively rank candidate actions, they often provide limited contextual guidance. In contrast, natural language feedback better aligns with the generative capabilities of LLMs, providing richer and more actionable suggestions. However, parsing and implementing this feedback effectively can be challenging for LLM-based agents. In this work, we introduce Critique-Guided Improvement (CGI), a novel two-player framework, comprising an actor model that explores an environment and a critic model that generates detailed nature language feedback. By training the critic to produce fine-grained assessments and actionable revisions, and the actor to utilize these critiques, our approach promotes more robust exploration of alternative strategies while avoiding local optima. Experiments in three interactive environments show that CGI outperforms existing baselines by a substantial margin. Notably, even a small critic model surpasses GPT-4 in feedback quality. The resulting actor achieves state-of-the-art performance, demonstrating the power of explicit iterative guidance to enhance decision-making in LLM-based agents.

Title: Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts

Authors: Yike Yuan, Ziyu Wang, Zihao Huang, Defa Zhu, Xun Zhou, Jingyi Yu, Qiyang Min
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.16057
Pdf URL: https://arxiv.org/pdf/2503.16057
Copy Paste: [[2503.16057]] Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts(https://arxiv.org/abs/2503.16057)
Keywords: diffusion
Abstract: Diffusion models have emerged as mainstream framework in visual generation. Building upon this success, the integration of Mixture of Experts (MoE) methods has shown promise in enhancing model scalability and performance. In this paper, we introduce Race-DiT, a novel MoE model for diffusion transformers with a flexible routing strategy, Expert Race. By allowing tokens and experts to compete together and select the top candidates, the model learns to dynamically assign experts to critical tokens. Additionally, we propose per-layer regularization to address challenges in shallow layer learning, and router similarity loss to prevent mode collapse, ensuring better expert utilization. Extensive experiments on ImageNet validate the effectiveness of our approach, showcasing significant performance gains while promising scaling properties.

Title: Shining Yourself: High-Fidelity Ornaments Virtual Try-on with Diffusion Model

Authors: Yingmao Miao, Zhanpeng Huang, Rui Han, Zibin Wang, Chenhao Lin, Chao Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16065
Pdf URL: https://arxiv.org/pdf/2503.16065
Copy Paste: [[2503.16065]] Shining Yourself: High-Fidelity Ornaments Virtual Try-on with Diffusion Model(https://arxiv.org/abs/2503.16065)
Keywords: diffusion
Abstract: While virtual try-on for clothes and shoes with diffusion models has gained attraction, virtual try-on for ornaments, such as bracelets, rings, earrings, and necklaces, remains largely unexplored. Due to the intricate tiny patterns and repeated geometric sub-structures in most ornaments, it is much more difficult to guarantee identity and appearance consistency under large pose and scale variances between ornaments and models. This paper proposes the task of virtual try-on for ornaments and presents a method to improve the geometric and appearance preservation of ornament virtual try-ons. Specifically, we estimate an accurate wearing mask to improve the alignments between ornaments and models in an iterative scheme alongside the denoising process. To preserve structure details, we further regularize attention layers to map the reference ornament mask to the wearing mask in an implicit way. Experimental results demonstrate that our method successfully wears ornaments from reference images onto target models, handling substantial differences in scale and pose while preserving identity and achieving realistic visual effects.

Title: PoseTraj: Pose-Aware Trajectory Control in Video Diffusion

Authors: Longbin Ji, Lei Zhong, Pengfei Wei, Changjian Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16068
Pdf URL: https://arxiv.org/pdf/2503.16068
Copy Paste: [[2503.16068]] PoseTraj: Pose-Aware Trajectory Control in Video Diffusion(https://arxiv.org/abs/2503.16068)
Keywords: diffusion
Abstract: Recent advancements in trajectory-guided video generation have achieved notable progress. However, existing models still face challenges in generating object motions with potentially changing 6D poses under wide-range rotations, due to limited 3D understanding. To address this problem, we introduce PoseTraj, a pose-aware video dragging model for generating 3D-aligned motion from 2D trajectories. Our method adopts a novel two-stage pose-aware pretraining framework, improving 3D understanding across diverse trajectories. Specifically, we propose a large-scale synthetic dataset PoseTraj-10K, containing 10k videos of objects following rotational trajectories, and enhance the model perception of object pose changes by incorporating 3D bounding boxes as intermediate supervision signals. Following this, we fine-tune the trajectory-controlling module on real-world videos, applying an additional camera-disentanglement module to further refine motion accuracy. Experiments on various benchmark datasets demonstrate that our method not only excels in 3D pose-aligned dragging for rotational trajectories but also outperforms existing baselines in trajectory accuracy and video quality.

Title: Cultural Alignment in Large Language Models Using Soft Prompt Tuning

Authors: Reem I. Masoud, Martin Ferianc, Philip Treleaven, Miguel Rodrigues
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.16094
Pdf URL: https://arxiv.org/pdf/2503.16094
Copy Paste: [[2503.16094]] Cultural Alignment in Large Language Models Using Soft Prompt Tuning(https://arxiv.org/abs/2503.16094)
Keywords: in-context
Abstract: Large Language Model (LLM) alignment conventionally relies on supervised fine-tuning or reinforcement learning based alignment frameworks. These methods typically require labeled or preference datasets and involve updating model weights to align the LLM with the training objective or reward model. Meanwhile, in social sciences such as cross-cultural studies, factor analysis is widely used to uncover underlying dimensions or latent variables that explain observed patterns in survey data. The non-differentiable nature of these measurements deriving from survey data renders the former alignment methods infeasible for alignment with cultural dimensions. To overcome this, we propose a parameter efficient strategy that combines soft prompt tuning, which freezes the model parameters while modifying the input prompt embeddings, with Differential Evolution (DE), a black-box optimization method for cases where a differentiable objective is unattainable. This strategy ensures alignment consistency without the need for preference data or model parameter updates, significantly enhancing efficiency and mitigating overfitting. Our method demonstrates significant improvements in LLama-3-8B-Instruct's cultural dimensions across multiple regions, outperforming both the Naive LLM and the In-context Learning (ICL) baseline, and effectively bridges computational models with human cultural nuances.

Title: OSLoPrompt: Bridging Low-Supervision Challenges and Open-Set Domain Generalization in CLIP

Authors: Mohamad Hassan N C, Divyam Gupta, Mainak Singha, Sai Bhargav Rongali, Ankit Jha, Muhammad Haris Khan, Biplab Banerjee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16106
Pdf URL: https://arxiv.org/pdf/2503.16106
Copy Paste: [[2503.16106]] OSLoPrompt: Bridging Low-Supervision Challenges and Open-Set Domain Generalization in CLIP(https://arxiv.org/abs/2503.16106)
Keywords: foundation model
Abstract: We introduce Low-Shot Open-Set Domain Generalization (LSOSDG), a novel paradigm unifying low-shot learning with open-set domain generalization (ODG). While prompt-based methods using models like CLIP have advanced DG, they falter in low-data regimes (e.g., 1-shot) and lack precision in detecting open-set samples with fine-grained semantics related to training classes. To address these challenges, we propose OSLOPROMPT, an advanced prompt-learning framework for CLIP with two core innovations. First, to manage limited supervision across source domains and improve DG, we introduce a domain-agnostic prompt-learning mechanism that integrates adaptable domain-specific cues and visually guided semantic attributes through a novel cross-attention module, besides being supported by learnable domain- and class-generic visual prompts to enhance cross-modal adaptability. Second, to improve outlier rejection during inference, we classify unfamiliar samples as "unknown" and train specialized prompts with systematically synthesized pseudo-open samples that maintain fine-grained relationships to known classes, generated through a targeted query strategy with off-the-shelf foundation models. This strategy enhances feature learning, enabling our model to detect open samples with varied granularity more effectively. Extensive evaluations across five benchmarks demonstrate that OSLOPROMPT establishes a new state-of-the-art in LSOSDG, significantly outperforming existing methods.

Title: Improving Discriminator Guidance in Diffusion Models

Authors: Alexandre Verine, Mehdi Inane, Florian Le Bronnec, Benjamin Negrevergne, Yann Chevaleyre
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.16117
Pdf URL: https://arxiv.org/pdf/2503.16117
Copy Paste: [[2503.16117]] Improving Discriminator Guidance in Diffusion Models(https://arxiv.org/abs/2503.16117)
Keywords: diffusion
Abstract: Discriminator Guidance has become a popular method for efficiently refining pre-trained Score-Matching Diffusion models. However, in this paper, we demonstrate that the standard implementation of this technique does not necessarily lead to a distribution closer to the real data distribution. Specifically, we show that training the discriminator using Cross-Entropy loss, as commonly done, can in fact increase the Kullback-Leibler divergence between the model and target distributions, particularly when the discriminator overfits. To address this, we propose a theoretically sound training objective for discriminator guidance that properly minimizes the KL divergence. We analyze its properties and demonstrate empirically across multiple datasets that our proposed method consistently improves over the conventional method by producing samples of higher quality.

Title: FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing

Authors: Tianyi Wei, Yifan Zhou, Dongdong Chen, Xingang Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16153
Pdf URL: https://arxiv.org/pdf/2503.16153
Copy Paste: [[2503.16153]] FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing(https://arxiv.org/abs/2503.16153)
Keywords: diffusion
Abstract: The integration of Rotary Position Embedding (RoPE) in Multimodal Diffusion Transformer (MMDiT) has significantly enhanced text-to-image generation quality. However, the fundamental reliance of self-attention layers on positional embedding versus query-key similarity during generation remains an intriguing question. We present the first mechanistic analysis of RoPE-based MMDiT models (e.g., FLUX), introducing an automated probing strategy that disentangles positional information versus content dependencies by strategically manipulating RoPE during generation. Our analysis reveals distinct dependency patterns that do not straightforwardly correlate with depth, offering new insights into the layer-specific roles in RoPE-based MMDiT. Based on these findings, we propose a training-free, task-specific image editing framework that categorizes editing tasks into three types: position-dependent editing (e.g., object addition), content similarity-dependent editing (e.g., non-rigid editing), and region-preserved editing (e.g., background replacement). For each type, we design tailored key-value injection strategies based on the characteristics of the editing task. Extensive qualitative and quantitative evaluations demonstrate that our method outperforms state-of-the-art approaches, particularly in preserving original semantic content and achieving seamless modifications.

Title: Towards Lighter and Robust Evaluation for Retrieval Augmented Generation

Authors: Alex-Razvan Ispas, Charles-Elie Simon, Fabien Caspani, Vincent Guigue
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.16161
Pdf URL: https://arxiv.org/pdf/2503.16161
Copy Paste: [[2503.16161]] Towards Lighter and Robust Evaluation for Retrieval Augmented Generation(https://arxiv.org/abs/2503.16161)
Keywords: generative
Abstract: Large Language Models are prompting us to view more NLP tasks from a generative perspective. At the same time, they offer a new way of accessing information, mainly through the RAG framework. While there have been notable improvements for the autoregressive models, overcoming hallucination in the generated answers remains a continuous problem. A standard solution is to use commercial LLMs, such as GPT4, to evaluate these algorithms. However, such frameworks are expensive and not very transparent. Therefore, we propose a study which demonstrates the interest of open-weight models for evaluating RAG hallucination. We develop a lightweight approach using smaller, quantized LLMs to provide an accessible and interpretable metric that gives continuous scores for the generated answer with respect to their correctness and faithfulness. This score allows us to question decisions' reliability and explore thresholds to develop a new AUC metric as an alternative to correlation with human judgment.

Title: Guardians of Generation: Dynamic Inference-Time Copyright Shielding with Adaptive Guidance for AI Image Generation

Authors: Soham Roy, Abhishek Mishra, Shirish Karande, Murari Mandal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16171
Pdf URL: https://arxiv.org/pdf/2503.16171
Copy Paste: [[2503.16171]] Guardians of Generation: Dynamic Inference-Time Copyright Shielding with Adaptive Guidance for AI Image Generation(https://arxiv.org/abs/2503.16171)
Keywords: diffusion, generative
Abstract: Modern text-to-image generative models can inadvertently reproduce copyrighted content memorized in their training data, raising serious concerns about potential copyright infringement. We introduce Guardians of Generation, a model agnostic inference time framework for dynamic copyright shielding in AI image generation. Our approach requires no retraining or modification of the generative model weights, instead integrating seamlessly with existing diffusion pipelines. It augments the generation process with an adaptive guidance mechanism comprising three components: a detection module, a prompt rewriting module, and a guidance adjustment module. The detection module monitors user prompts and intermediate generation steps to identify features indicative of copyrighted content before they manifest in the final output. If such content is detected, the prompt rewriting mechanism dynamically transforms the user's prompt by sanitizing or replacing references that could trigger copyrighted material while preserving the prompt's intended semantics. The adaptive guidance module adaptively steers the diffusion process away from flagged content by modulating the model's sampling trajectory. Together, these components form a robust shield that enables a tunable balance between preserving creative fidelity and ensuring copyright compliance. We validate our method on a variety of generative models such as Stable Diffusion, SDXL, and Flux, demonstrating substantial reductions in copyrighted content generation with negligible impact on output fidelity or alignment with user intent. This work provides a practical, plug-and-play safeguard for generative image models, enabling more responsible deployment under real-world copyright constraints. Source code is available at: this https URL

Title: VP-NTK: Exploring the Benefits of Visual Prompting in Differentially Private Data Synthesis

Authors: Chia-Yi Hsu, Jia-You Chen, Yu-Lin Tsai, Chih-Hsun Lin, Pin-Yu Chen, Chia-Mu Yu, Chun-Ying Huang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.16195
Pdf URL: https://arxiv.org/pdf/2503.16195
Copy Paste: [[2503.16195]] VP-NTK: Exploring the Benefits of Visual Prompting in Differentially Private Data Synthesis(https://arxiv.org/abs/2503.16195)
Keywords: generative
Abstract: Differentially private (DP) synthetic data has become the de facto standard for releasing sensitive data. However, many DP generative models suffer from the low utility of synthetic data, especially for high-resolution images. On the other hand, one of the emerging techniques in parameter efficient fine-tuning (PEFT) is visual prompting (VP), which allows well-trained existing models to be reused for the purpose of adapting to subsequent downstream tasks. In this work, we explore such a phenomenon in constructing captivating generative models with DP constraints. We show that VP in conjunction with DP-NTK, a DP generator that exploits the power of the neural tangent kernel (NTK) in training DP generative models, achieves a significant performance boost, particularly for high-resolution image datasets, with accuracy improving from 0.644$\pm$0.044 to 0.769. Lastly, we perform ablation studies on the effect of different parameters that influence the overall performance of VP-NTK. Our work demonstrates a promising step forward in improving the utility of DP synthetic data, particularly for high-resolution images.

Title: Temporal Score Analysis for Understanding and Correcting Diffusion Artifacts

Authors: Yu Cao, Zengqun Zhao, Ioannis Patras, Shaogang Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16218
Pdf URL: https://arxiv.org/pdf/2503.16218
Copy Paste: [[2503.16218]] Temporal Score Analysis for Understanding and Correcting Diffusion Artifacts(https://arxiv.org/abs/2503.16218)
Keywords: diffusion, generative
Abstract: Visual artifacts remain a persistent challenge in diffusion models, even with training on massive datasets. Current solutions primarily rely on supervised detectors, yet lack understanding of why these artifacts occur in the first place. In our analysis, we identify three distinct phases in the diffusion generative process: Profiling, Mutation, and Refinement. Artifacts typically emerge during the Mutation phase, where certain regions exhibit anomalous score dynamics over time, causing abrupt disruptions in the normal evolution pattern. This temporal nature explains why existing methods focusing only on spatial uncertainty of the final output fail at effective artifact localization. Based on these insights, we propose ASCED (Abnormal Score Correction for Enhancing Diffusion), that detects artifacts by monitoring abnormal score dynamics during the diffusion process, with a trajectory-aware on-the-fly mitigation strategy that appropriate generation of noise in the detected areas. Unlike most existing methods that apply post hoc corrections, \eg, by applying a noising-denoising scheme after generation, our mitigation strategy operates seamlessly within the existing diffusion process. Extensive experiments demonstrate that our proposed approach effectively reduces artifacts across diverse domains, matching or surpassing existing supervised methods without additional training.

Title: M2N2V2: Multi-Modal Unsupervised and Training-free Interactive Segmentation

Authors: Markus Karmann, Peng-Tao Jiang, Bo Li, Onay Urfalioglu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16254
Pdf URL: https://arxiv.org/pdf/2503.16254
Copy Paste: [[2503.16254]] M2N2V2: Multi-Modal Unsupervised and Training-free Interactive Segmentation(https://arxiv.org/abs/2503.16254)
Keywords: diffusion
Abstract: We present Markov Map Nearest Neighbor V2 (M2N2V2), a novel and simple, yet effective approach which leverages depth guidance and attention maps for unsupervised and training-free point-prompt-based interactive segmentation. Following recent trends in supervised multimodal approaches, we carefully integrate depth as an additional modality to create novel depth-guided Markov-maps. Furthermore, we observe occasional segment size fluctuations in M2N2 during the interactive process, which can decrease the overall mIoU's. To mitigate this problem, we model the prompting as a sequential process and propose a novel adaptive score function which considers the previous segmentation and the current prompt point in order to prevent unreasonable segment size changes. Using Stable Diffusion 2 and Depth Anything V2 as backbones, we empirically show that our proposed M2N2V2 significantly improves the Number of Clicks (NoC) and mIoU compared to M2N2 in all datasets except those from the medical domain. Interestingly, our unsupervised approach achieves competitive results compared to supervised methods like SAM and SimpleClick in the more challenging DAVIS and HQSeg44K datasets in the NoC metric, reducing the gap between supervised and unsupervised methods.

Title: Uni-3DAR: Unified 3D Generation and Understanding via Autoregression on Compressed Spatial Tokens

Authors: Shuqi Lu, Haowei Lin, Lin Yao, Zhifeng Gao, Xiaohong Ji, Weinan E, Linfeng Zhang, Guolin Ke
Subjects: cs.LG, cond-mat.mtrl-sci, q-bio.BM
Abstract URL: https://arxiv.org/abs/2503.16278
Pdf URL: https://arxiv.org/pdf/2503.16278
Copy Paste: [[2503.16278]] Uni-3DAR: Unified 3D Generation and Understanding via Autoregression on Compressed Spatial Tokens(https://arxiv.org/abs/2503.16278)
Keywords: diffusion
Abstract: Recent advancements in large language models and their multi-modal extensions have demonstrated the effectiveness of unifying generation and understanding through autoregressive next-token prediction. However, despite the critical role of 3D structural generation and understanding ({3D GU}) in AI for science, these tasks have largely evolved independently, with autoregressive methods remaining underexplored. To bridge this gap, we introduce Uni-3DAR, a unified framework that seamlessly integrates {3D GU} tasks via autoregressive prediction. At its core, Uni-3DAR employs a novel hierarchical tokenization that compresses 3D space using an octree, leveraging the inherent sparsity of 3D structures. It then applies an additional tokenization for fine-grained structural details, capturing key attributes such as atom types and precise spatial coordinates in microscopic 3D structures. We further propose two optimizations to enhance efficiency and effectiveness. The first is a two-level subtree compression strategy, which reduces the octree token sequence by up to 8x. The second is a masked next-token prediction mechanism tailored for dynamically varying token positions, significantly boosting model performance. By combining these strategies, Uni-3DAR successfully unifies diverse {3D GU} tasks within a single autoregressive framework. Extensive experiments across multiple microscopic {3D GU} tasks, including molecules, proteins, polymers, and crystals, validate its effectiveness and versatility. Notably, Uni-3DAR surpasses previous state-of-the-art diffusion models by a substantial margin, achieving up to 256\% relative improvement while delivering inference speeds up to 21.8x faster. The code is publicly available at this https URL.

Title: SceneMI: Motion In-betweening for Modeling Human-Scene Interactions

Authors: Inwoo Hwang, Bing Zhou, Young Min Kim, Jian Wang, Chuan Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16289
Pdf URL: https://arxiv.org/pdf/2503.16289
Copy Paste: [[2503.16289]] SceneMI: Motion In-betweening for Modeling Human-Scene Interactions(https://arxiv.org/abs/2503.16289)
Keywords: diffusion, generative
Abstract: Modeling human-scene interactions (HSI) is essential for understanding and simulating everyday human behaviors. Recent approaches utilizing generative modeling have made progress in this domain; however, they are limited in controllability and flexibility for real-world applications. To address these challenges, we propose reformulating the HSI modeling problem as Scene-aware Motion In-betweening -- a more tractable and practical task. We introduce SceneMI, a framework that supports several practical applications, including keyframe-guided character animation in 3D scenes and enhancing the motion quality of imperfect HSI data. SceneMI employs dual scene descriptors to comprehensively encode global and local scene context. Furthermore, our framework leverages the inherent denoising nature of diffusion models to generalize on noisy keyframes. Experimental results demonstrate SceneMI's effectiveness in scene-aware keyframe in-betweening and generalization to the real-world GIMO dataset, where motions and scenes are acquired by noisy IMU sensors and smartphones. We further showcase SceneMI's applicability in HSI reconstruction from monocular videos.

Title: Unleashing Vecset Diffusion Model for Fast Shape Generation

Authors: Zeqiang Lai, Yunfei Zhao, Zibo Zhao, Haolin Liu, Fuyun Wang, Huiwen Shi, Xianghui Yang, Qinxiang Lin, Jinwei Huang, Yuhong Liu, Jie Jiang, Chunchao Guo, Xiangyu Yue
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2503.16302
Pdf URL: https://arxiv.org/pdf/2503.16302
Copy Paste: [[2503.16302]] Unleashing Vecset Diffusion Model for Fast Shape Generation(https://arxiv.org/abs/2503.16302)
Keywords: diffusion
Abstract: 3D shape generation has greatly flourished through the development of so-called "native" 3D diffusion, particularly through the Vecset Diffusion Model (VDM). While recent advancements have shown promising results in generating high-resolution 3D shapes, VDM still struggles with high-speed generation. Challenges exist because of difficulties not only in accelerating diffusion sampling but also VAE decoding in VDM, areas under-explored in previous works. To address these challenges, we present FlashVDM, a systematic framework for accelerating both VAE and DiT in VDM. For DiT, FlashVDM enables flexible diffusion sampling with as few as 5 inference steps and comparable quality, which is made possible by stabilizing consistency distillation with our newly introduced Progressive Flow Distillation. For VAE, we introduce a lightning vecset decoder equipped with Adaptive KV Selection, Hierarchical Volume Decoding, and Efficient Network Design. By exploiting the locality of the vecset and the sparsity of shape surface in the volume, our decoder drastically lowers FLOPs, minimizing the overall decoding overhead. We apply FlashVDM to Hunyuan3D-2 to obtain Hunyuan3D-2 Turbo. Through systematic evaluation, we show that our model significantly outperforms existing fast 3D generation methods, achieving comparable performance to the state-of-the-art while reducing inference time by over 45x for reconstruction and 32x for generation. Code and models are available at this https URL.

Title: Structured-Noise Masked Modeling for Video, Audio and Beyond

Authors: Aritra Bhowmik, Fida Mohammad Thoker, Carlos Hinojosa, Bernard Ghanem, Cees G. M. Snoek
Subjects: cs.LG, cs.AI, cs.SD
Abstract URL: https://arxiv.org/abs/2503.16311
Pdf URL: https://arxiv.org/pdf/2503.16311
Copy Paste: [[2503.16311]] Structured-Noise Masked Modeling for Video, Audio and Beyond(https://arxiv.org/abs/2503.16311)
Keywords: self-supervised
Abstract: Masked modeling has emerged as a powerful self-supervised learning framework, but existing methods largely rely on random masking, disregarding the structural properties of different modalities. In this work, we introduce structured noise-based masking, a simple yet effective approach that naturally aligns with the spatial, temporal, and spectral characteristics of video and audio data. By filtering white noise into distinct color noise distributions, we generate structured masks that preserve modality-specific patterns without requiring handcrafted heuristics or access to the data. Our approach improves the performance of masked video and audio modeling frameworks without any computational overhead. Extensive experiments demonstrate that structured noise masking achieves consistent improvement over random masking for standard and advanced masked modeling methods, highlighting the importance of modality-aware masking strategies for representation learning.

Title: Ultra-Resolution Adaptation with Ease

Authors: Ruonan Yu, Songhua Liu, Zhenxiong Tan, Xinchao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16322
Pdf URL: https://arxiv.org/pdf/2503.16322
Copy Paste: [[2503.16322]] Ultra-Resolution Adaptation with Ease(https://arxiv.org/abs/2503.16322)
Keywords: diffusion
Abstract: Text-to-image diffusion models have achieved remarkable progress in recent years. However, training models for high-resolution image generation remains challenging, particularly when training data and computational resources are limited. In this paper, we explore this practical problem from two key perspectives: data and parameter efficiency, and propose a set of key guidelines for ultra-resolution adaptation termed \emph{URAE}. For data efficiency, we theoretically and empirically demonstrate that synthetic data generated by some teacher models can significantly promote training convergence. For parameter efficiency, we find that tuning minor components of the weight matrices outperforms widely-used low-rank adapters when synthetic data are unavailable, offering substantial performance gains while maintaining efficiency. Additionally, for models leveraging guidance distillation, such as FLUX, we show that disabling classifier-free guidance, \textit{i.e.}, setting the guidance scale to 1 during adaptation, is crucial for satisfactory performance. Extensive experiments validate that URAE achieves comparable 2K-generation performance to state-of-the-art closed-source models like FLUX1.1 [Pro] Ultra with only 3K samples and 2K iterations, while setting new benchmarks for 4K-resolution generation. Codes are available \href{this https URL}{here}.

Title: Lyra: An Efficient and Expressive Subquadratic Architecture for Modeling Biological Sequences

Authors: Krithik Ramesh (1 and 2), Sameed M. Siddiqui (1 and 3), Albert Gu (4), Michael D. Mitzenmacher (1 and 5), Pardis C. Sabeti (1 and 6 and 7 and 8) ((1) Broad Institute of MIT and Harvard, (2) Massachusetts Institute of Technology, (3) Computational and Systems Biology Program, Massachusetts Institute of Technology, (4) Machine Learning Department, Carnegie Mellon University, (5) School of Engineering and Applied Sciences, Harvard University, (6) Department of Organismic and Evolutionary Biology, Harvard University, (7) Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Harvard University, (8) Howard Hughes Medical Institute)
Subjects: cs.LG, q-bio.GN
Abstract URL: https://arxiv.org/abs/2503.16351
Pdf URL: https://arxiv.org/pdf/2503.16351
Copy Paste: [[2503.16351]] Lyra: An Efficient and Expressive Subquadratic Architecture for Modeling Biological Sequences(https://arxiv.org/abs/2503.16351)
Keywords: foundation model
Abstract: Deep learning architectures such as convolutional neural networks and Transformers have revolutionized biological sequence modeling, with recent advances driven by scaling up foundation and task-specific models. The computational resources and large datasets required, however, limit their applicability in biological contexts. We introduce Lyra, a subquadratic architecture for sequence modeling, grounded in the biological framework of epistasis for understanding sequence-to-function relationships. Mathematically, we demonstrate that state space models efficiently capture global epistatic interactions and combine them with projected gated convolutions for modeling local relationships. We demonstrate that Lyra is performant across over 100 wide-ranging biological tasks, achieving state-of-the-art (SOTA) performance in many key areas, including protein fitness landscape prediction, biophysical property prediction (e.g. disordered protein region functions) peptide engineering applications (e.g. antibody binding, cell-penetrating peptide prediction), RNA structure analysis, RNA function prediction, and CRISPR guide design. It achieves this with orders-of-magnitude improvements in inference speed and reduction in parameters (up to 120,000-fold in our tests) compared to recent biology foundation models. Using Lyra, we were able to train and run every task in this study on two or fewer GPUs in under two hours, democratizing access to biological sequence modeling at SOTA performance, with potential applications to many fields.

Title: JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Authors: Muyao Li, Zihao Wang, Kaichen He, Xiaojian Ma, Yitao Liang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.16365
Pdf URL: https://arxiv.org/pdf/2503.16365
Copy Paste: [[2503.16365]] JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse(https://arxiv.org/abs/2503.16365)
Keywords: self-supervised
Abstract: Recently, action-based decision-making in open-world environments has gained significant attention. Visual Language Action (VLA) models, pretrained on large-scale web datasets, have shown promise in decision-making tasks. However, previous work has primarily focused on action post-training, often neglecting enhancements to the foundational model itself. In response, we introduce a novel approach, Act from Visual Language Post-Training, which refines Visual Language Models (VLMs) through visual and linguistic guidance in a self-supervised manner. This enhancement improves the models' capabilities in world knowledge, visual recognition, and spatial grounding in open-world environments. Following the above post-training paradigms, we obtain the first VLA models in Minecraft that can follow human instructions on over 1k different atomic tasks, including crafting, smelting, cooking, mining, and killing. Our experiments demonstrate that post-training on non-trajectory tasks leads to a significant 40% improvement over the best agent baseline on a diverse set of atomic tasks. Furthermore, we demonstrate that our approach surpasses traditional imitation learning-based policies in Minecraft, achieving state-of-the-art performance. We have open-sourced the code, models, and datasets to foster further research. The project page can be found in this https URL.

Title: NuiScene: Exploring Efficient Generation of Unbounded Outdoor Scenes

Authors: Han-Hung Lee, Qinghong Han, Angel X. Chang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16375
Pdf URL: https://arxiv.org/pdf/2503.16375
Copy Paste: [[2503.16375]] NuiScene: Exploring Efficient Generation of Unbounded Outdoor Scenes(https://arxiv.org/abs/2503.16375)
Keywords: diffusion
Abstract: In this paper, we explore the task of generating expansive outdoor scenes, ranging from castles to high-rises. Unlike indoor scene generation, which has been a primary focus of prior work, outdoor scene generation presents unique challenges, including wide variations in scene heights and the need for a method capable of rapidly producing large landscapes. To address this, we propose an efficient approach that encodes scene chunks as uniform vector sets, offering better compression and performance than the spatially structured latents used in prior methods. Furthermore, we train an explicit outpainting model for unbounded generation, which improves coherence compared to prior resampling-based inpainting schemes while also speeding up generation by eliminating extra diffusion steps. To facilitate this task, we curate NuiScene43, a small but high-quality set of scenes, preprocessed for joint training. Notably, when trained on scenes of varying styles, our model can blend different environments, such as rural houses and city skyscrapers, within the same scene, highlighting the potential of our curation process to leverage heterogeneous scenes for joint training.

Title: LaPIG: Cross-Modal Generation of Paired Thermal and Visible Facial Images

Authors: Leyang Wang, Joice Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16376
Pdf URL: https://arxiv.org/pdf/2503.16376
Copy Paste: [[2503.16376]] LaPIG: Cross-Modal Generation of Paired Thermal and Visible Facial Images(https://arxiv.org/abs/2503.16376)
Keywords: diffusion
Abstract: The success of modern machine learning, particularly in facial translation networks, is highly dependent on the availability of high-quality, paired, large-scale datasets. However, acquiring sufficient data is often challenging and costly. Inspired by the recent success of diffusion models in high-quality image synthesis and advancements in Large Language Models (LLMs), we propose a novel framework called LLM-assisted Paired Image Generation (LaPIG). This framework enables the construction of comprehensive, high-quality paired visible and thermal images using captions generated by LLMs. Our method encompasses three parts: visible image synthesis with ArcFace embedding, thermal image translation using Latent Diffusion Models (LDMs), and caption generation with LLMs. Our approach not only generates multi-view paired visible and thermal images to increase data diversity but also produces high-quality paired data while maintaining their identity information. We evaluate our method on public datasets by comparing it with existing methods, demonstrating the superiority of LaPIG.

Title: Do Visual Imaginations Improve Vision-and-Language Navigation Agents?

Authors: Akhil Perincherry, Jacob Krantz, Stefan Lee
Subjects: cs.CV, cs.AI, cs.CL, cs.RO
Abstract URL: https://arxiv.org/abs/2503.16394
Pdf URL: https://arxiv.org/pdf/2503.16394
Copy Paste: [[2503.16394]] Do Visual Imaginations Improve Vision-and-Language Navigation Agents?(https://arxiv.org/abs/2503.16394)
Keywords: diffusion
Abstract: Vision-and-Language Navigation (VLN) agents are tasked with navigating an unseen environment using natural language instructions. In this work, we study if visual representations of sub-goals implied by the instructions can serve as navigational cues and lead to increased navigation performance. To synthesize these visual representations or imaginations, we leverage a text-to-image diffusion model on landmark references contained in segmented instructions. These imaginations are provided to VLN agents as an added modality to act as landmark cues and an auxiliary loss is added to explicitly encourage relating these with their corresponding referring expressions. Our findings reveal an increase in success rate (SR) of around 1 point and up to 0.5 points in success scaled by inverse path length (SPL) across agents. These results suggest that the proposed approach reinforces visual understanding compared to relying on language instructions alone. Code and data for our work can be found at this https URL.

Title: SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation

Authors: Chun-Han Yao, Yiming Xie, Vikram Voleti, Huaizu Jiang, Varun Jampani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16396
Pdf URL: https://arxiv.org/pdf/2503.16396
Copy Paste: [[2503.16396]] SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation(https://arxiv.org/abs/2503.16396)
Keywords: diffusion
Abstract: We present Stable Video 4D 2.0 (SV4D 2.0), a multi-view video diffusion model for dynamic 3D asset generation. Compared to its predecessor SV4D, SV4D 2.0 is more robust to occlusions and large motion, generalizes better to real-world videos, and produces higher-quality outputs in terms of detail sharpness and spatio-temporal consistency. We achieve this by introducing key improvements in multiple aspects: 1) network architecture: eliminating the dependency of reference multi-views and designing blending mechanism for 3D and frame attention, 2) data: enhancing quality and quantity of training data, 3) training strategy: adopting progressive 3D-4D training for better generalization, and 4) 4D optimization: handling 3D inconsistency and large motion via 2-stage refinement and progressive frame sampling. Extensive experiments demonstrate significant performance gain by SV4D 2.0 both visually and quantitatively, achieving better detail (-14\% LPIPS) and 4D consistency (-44\% FV4D) in novel-view video synthesis and 4D optimization (-12\% LPIPS and -24\% FV4D) compared to SV4D. Project page: this https URL.

Title: Scale-wise Distillation of Diffusion Models

Authors: Nikita Starodubcev, Denis Kuznedelev, Artem Babenko, Dmitry Baranchuk
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16397
Pdf URL: https://arxiv.org/pdf/2503.16397
Copy Paste: [[2503.16397]] Scale-wise Distillation of Diffusion Models(https://arxiv.org/abs/2503.16397)
Keywords: diffusion
Abstract: We present SwD, a scale-wise distillation framework for diffusion models (DMs), which effectively employs next-scale prediction ideas for diffusion-based few-step generators. In more detail, SwD is inspired by the recent insights relating diffusion processes to the implicit spectral autoregression. We suppose that DMs can initiate generation at lower data resolutions and gradually upscale the samples at each denoising step without loss in performance while significantly reducing computational costs. SwD naturally integrates this idea into existing diffusion distillation methods based on distribution matching. Also, we enrich the family of distribution matching approaches by introducing a novel patch loss enforcing finer-grained similarity to the target distribution. When applied to state-of-the-art text-to-image diffusion models, SwD approaches the inference times of two full resolution steps and significantly outperforms the counterparts under the same computation budget, as evidenced by automated metrics and human preference studies.

Title: ScalingNoise: Scaling Inference-Time Search for Generating Infinite Videos

Authors: Haolin Yang, Feilong Tang, Ming Hu, Yulong Li, Junjie Guo, Yexin Liu, Zelin Peng, Junjun He, Zongyuan Ge, Imran Razzak,
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.16400
Pdf URL: https://arxiv.org/pdf/2503.16400
Copy Paste: [[2503.16400]] ScalingNoise: Scaling Inference-Time Search for Generating Infinite Videos(https://arxiv.org/abs/2503.16400)
Keywords: diffusion
Abstract: Video diffusion models (VDMs) facilitate the generation of high-quality videos, with current research predominantly concentrated on scaling efforts during training through improvements in data quality, computational resources, and model complexity. However, inference-time scaling has received less attention, with most approaches restricting models to a single generation attempt. Recent studies have uncovered the existence of "golden noises" that can enhance video quality during generation. Building on this, we find that guiding the scaling inference-time search of VDMs to identify better noise candidates not only evaluates the quality of the frames generated in the current step but also preserves the high-level object features by referencing the anchor frame from previous multi-chunks, thereby delivering long-term value. Our analysis reveals that diffusion models inherently possess flexible adjustments of computation by varying denoising steps, and even a one-step denoising approach, when guided by a reward signal, yields significant long-term benefits. Based on the observation, we proposeScalingNoise, a plug-and-play inference-time search strategy that identifies golden initial noises for the diffusion sampling process to improve global content consistency and visual diversity. Specifically, we perform one-step denoising to convert initial noises into a clip and subsequently evaluate its long-term value, leveraging a reward model anchored by previously generated content. Moreover, to preserve diversity, we sample candidates from a tilted noise distribution that up-weights promising noises. In this way, ScalingNoise significantly reduces noise-induced errors, ensuring more coherent and spatiotemporally consistent video generation. Extensive experiments on benchmark datasets demonstrate that the proposed ScalingNoise effectively improves long video generation.

Title: DreamTexture: Shape from Virtual Texture with Analysis by Augmentation

Authors: Ananta R. Bhattarai, Xingzhe He, Alla Sheffer, Helge Rhodin
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.16412
Pdf URL: https://arxiv.org/pdf/2503.16412
Copy Paste: [[2503.16412]] DreamTexture: Shape from Virtual Texture with Analysis by Augmentation(https://arxiv.org/abs/2503.16412)
Keywords: diffusion, generative
Abstract: DreamFusion established a new paradigm for unsupervised 3D reconstruction from virtual views by combining advances in generative models and differentiable rendering. However, the underlying multi-view rendering, along with supervision from large-scale generative models, is computationally expensive and under-constrained. We propose DreamTexture, a novel Shape-from-Virtual-Texture approach that leverages monocular depth cues to reconstruct 3D objects. Our method textures an input image by aligning a virtual texture with the real depth cues in the input, exploiting the inherent understanding of monocular geometry encoded in modern diffusion models. We then reconstruct depth from the virtual texture deformation with a new conformal map optimization, which alleviates memory-intensive volumetric representations. Our experiments reveal that generative models possess an understanding of monocular shape cues, which can be extracted by augmenting and aligning texture cues -- a novel monocular reconstruction paradigm that we call Analysis by Augmentation.

Title: M3: 3D-Spatial MultiModal Memory

Authors: Xueyan Zou, Yuchen Song, Ri-Zhao Qiu, Xuanbin Peng, Jianglong Ye, Sifei Liu, Xiaolong Wang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.16413
Pdf URL: https://arxiv.org/pdf/2503.16413
Copy Paste: [[2503.16413]] M3: 3D-Spatial MultiModal Memory(https://arxiv.org/abs/2503.16413)
Keywords: foundation model
Abstract: We present 3D Spatial MultiModal Memory (M3), a multimodal memory system designed to retain information about medium-sized static scenes through video sources for visual perception. By integrating 3D Gaussian Splatting techniques with foundation models, M3 builds a multimodal memory capable of rendering feature representations across granularities, encompassing a wide range of knowledge. In our exploration, we identify two key challenges in previous works on feature splatting: (1) computational constraints in storing high-dimensional features for each Gaussian primitive, and (2) misalignment or information loss between distilled features and foundation model features. To address these challenges, we propose M3 with key components of principal scene components and Gaussian memory attention, enabling efficient training and inference. To validate M3, we conduct comprehensive quantitative evaluations of feature similarity and downstream tasks, as well as qualitative visualizations to highlight the pixel trace of Gaussian memory attention. Our approach encompasses a diverse range of foundation models, including vision-language models (VLMs), perception models, and large multimodal and language models (LMMs/LLMs). Furthermore, to demonstrate real-world applicability, we deploy M3's feature field in indoor scenes on a quadruped robot. Notably, we claim that M3 is the first work to address the core compression challenges in 3D feature distillation.

Title: InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity

Authors: Liming Jiang, Qing Yan, Yumin Jia, Zichuan Liu, Hao Kang, Xin Lu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.16418
Pdf URL: https://arxiv.org/pdf/2503.16418
Copy Paste: [[2503.16418]] InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity(https://arxiv.org/abs/2503.16418)
Keywords: diffusion
Abstract: Achieving flexible and high-fidelity identity-preserved image generation remains formidable, particularly with advanced Diffusion Transformers (DiTs) like FLUX. We introduce InfiniteYou (InfU), one of the earliest robust frameworks leveraging DiTs for this task. InfU addresses significant issues of existing methods, such as insufficient identity similarity, poor text-image alignment, and low generation quality and aesthetics. Central to InfU is InfuseNet, a component that injects identity features into the DiT base model via residual connections, enhancing identity similarity while maintaining generation capabilities. A multi-stage training strategy, including pretraining and supervised fine-tuning (SFT) with synthetic single-person-multiple-sample (SPMS) data, further improves text-image alignment, ameliorates image quality, and alleviates face copy-pasting. Extensive experiments demonstrate that InfU achieves state-of-the-art performance, surpassing existing baselines. In addition, the plug-and-play design of InfU ensures compatibility with various existing methods, offering a valuable contribution to the broader community.

Title: SynCity: Training-Free Generation of 3D Worlds

Authors: Paul Engstler, Aleksandar Shtedritski, Iro Laina, Christian Rupprecht, Andrea Vedaldi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16420
Pdf URL: https://arxiv.org/pdf/2503.16420
Copy Paste: [[2503.16420]] SynCity: Training-Free Generation of 3D Worlds(https://arxiv.org/abs/2503.16420)
Keywords: generative
Abstract: We address the challenge of generating 3D worlds from textual descriptions. We propose SynCity, a training- and optimization-free approach, which leverages the geometric precision of pre-trained 3D generative models and the artistic versatility of 2D image generators to create large, high-quality 3D spaces. While most 3D generative models are object-centric and cannot generate large-scale worlds, we show how 3D and 2D generators can be combined to generate ever-expanding scenes. Through a tile-based approach, we allow fine-grained control over the layout and the appearance of scenes. The world is generated tile-by-tile, and each new tile is generated within its world-context and then fused with the scene. SynCity generates compelling and immersive scenes that are rich in detail and diversity.

Title: Tokenize Image as a Set

Authors: Zigang Geng, Mengde Xu, Han Hu, Shuyang Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16425
Pdf URL: https://arxiv.org/pdf/2503.16425
Copy Paste: [[2503.16425]] Tokenize Image as a Set(https://arxiv.org/abs/2503.16425)
Keywords: diffusion
Abstract: This paper proposes a fundamentally new paradigm for image generation through set-based tokenization and distribution modeling. Unlike conventional methods that serialize images into fixed-position latent codes with a uniform compression ratio, we introduce an unordered token set representation to dynamically allocate coding capacity based on regional semantic complexity. This TokenSet enhances global context aggregation and improves robustness against local perturbations. To address the critical challenge of modeling discrete sets, we devise a dual transformation mechanism that bijectively converts sets into fixed-length integer sequences with summation constraints. Further, we propose Fixed-Sum Discrete Diffusion--the first framework to simultaneously handle discrete values, fixed sequence length, and summation invariance--enabling effective set distribution modeling. Experiments demonstrate our method's superiority in semantic-aware representation and generation quality. Our innovations, spanning novel representation and modeling strategies, advance visual generation beyond traditional sequential token paradigms. Our code and models are publicly available at this https URL.

Title: DynamicVis: An Efficient and General Visual Foundation Model for Remote Sensing Image Understanding

Authors: Keyan Chen, Chenyang Liu, Bowen Chen, Wenyuan Li, Zhengxia Zou, Zhenwei Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16426
Pdf URL: https://arxiv.org/pdf/2503.16426
Copy Paste: [[2503.16426]] DynamicVis: An Efficient and General Visual Foundation Model for Remote Sensing Image Understanding(https://arxiv.org/abs/2503.16426)
Keywords: foundation model
Abstract: The advancement of remote sensing technology has improved the spatial resolution of satellite imagery, facilitating more detailed visual representations for diverse interpretations. However, existing methods exhibit limited generalization capabilities across varied applications. While some contemporary foundation models demonstrate potential, they are hindered by insufficient cross-task adaptability and primarily process low-resolution imagery of restricted sizes, thus failing to fully exploit high-resolution data or leverage comprehensive large-scene semantics. Crucially, remote sensing imagery differs fundamentally from natural images, as key foreground targets (eg., maritime objects, artificial structures) often occupy minimal spatial proportions (~1%) and exhibit sparse distributions. Efficiently modeling cross-task generalizable knowledge from lengthy 2D tokens (~100,000) poses a significant challenge yet remains critical for remote sensing image understanding. Motivated by the selective attention mechanisms inherent to the human visual system, we propose DynamicVis, a dynamic visual perception foundation model for remote sensing imagery. The framework integrates a novel dynamic region perception backbone based on the selective state space model, which strategically balances localized detail extraction with global contextual integration, enabling computationally efficient encoding of large-scale data while maintaining architectural scalability. To enhance cross-task knowledge transferring, we introduce a multi-instance learning paradigm utilizing meta-embedding representations, trained on million-scale region-level annotations. Evaluations across nine downstream tasks demonstrate the model's versatility. DynamicVis achieves multi-level feature modeling with exceptional efficiency, processing (2048x2048) pixels with 97 ms latency (6% of ViT's) and 833 MB GPU memory (3% of ViT's).

Title: Sonata: Self-Supervised Learning of Reliable Point Representations

Authors: Xiaoyang Wu, Daniel DeTone, Duncan Frost, Tianwei Shen, Chris Xie, Nan Yang, Jakob Engel, Richard Newcombe, Hengshuang Zhao, Julian Straub
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16429
Pdf URL: https://arxiv.org/pdf/2503.16429
Copy Paste: [[2503.16429]] Sonata: Self-Supervised Learning of Reliable Point Representations(https://arxiv.org/abs/2503.16429)
Keywords: self-supervised
Abstract: In this paper, we question whether we have a reliable self-supervised point cloud model that can be used for diverse 3D tasks via simple linear probing, even with limited data and minimal computation. We find that existing 3D self-supervised learning approaches fall short when evaluated on representation quality through linear probing. We hypothesize that this is due to what we term the "geometric shortcut", which causes representations to collapse to low-level spatial features. This challenge is unique to 3D and arises from the sparse nature of point cloud data. We address it through two key strategies: obscuring spatial information and enhancing the reliance on input features, ultimately composing a Sonata of 140k point clouds through self-distillation. Sonata is simple and intuitive, yet its learned representations are strong and reliable: zero-shot visualizations demonstrate semantic grouping, alongside strong spatial reasoning through nearest-neighbor relationships. Sonata demonstrates exceptional parameter and data efficiency, tripling linear probing accuracy (from 21.8% to 72.5%) on ScanNet and nearly doubling performance with only 1% of the data compared to previous approaches. Full fine-tuning further advances SOTA across both 3D indoor and outdoor perception tasks.