2025-01-22

Title: Towards General Purpose Robots at Scale: Lifelong Learning and Learning to Use Memory

Authors: William Yue
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2501.10395
Pdf URL: https://arxiv.org/pdf/2501.10395
Copy Paste: [[2501.10395]] Towards General Purpose Robots at Scale: Lifelong Learning and Learning to Use Memory(https://arxiv.org/abs/2501.10395)
Keywords: generative
Abstract: The widespread success of artificial intelligence in fields like natural language processing and computer vision has not yet fully transferred to robotics, where progress is hindered by the lack of large-scale training data and the complexity of real-world tasks. To address this, many robot learning researchers are pushing to get robots deployed at scale in everyday unstructured environments like our homes to initiate a data flywheel. While current robot learning systems are effective for certain short-horizon tasks, they are not designed to autonomously operate over long time horizons in unstructured environments. This thesis focuses on addressing two key challenges for robots operating over long time horizons: memory and lifelong learning. We propose two novel methods to advance these capabilities. First, we introduce t-DGR, a trajectory-based deep generative replay method that achieves state-of-the-art performance on Continual World benchmarks, advancing lifelong learning. Second, we develop a framework that leverages human demonstrations to teach agents effective memory utilization, improving learning efficiency and success rates on Memory Gym tasks. Finally, we discuss future directions for achieving the lifelong learning and memory capabilities necessary for robots to function at scale in real-world settings.

Title: Uncovering Bias in Foundation Models: Impact, Testing, Harm, and Mitigation

Authors: Shuzhou Sun (1 and 2), Li Liu (3), Yongxiang Liu (3), Zhen Liu (3), Shuanghui Zhang (3), Janne Heikkilä (2), Xiang Li (3) ((1) The College of Computer Science, Nankai University, Tianjin, China, (2) The Center for Machine Vision and Signal Analysis, University of Oulu, Finland, (3) The College of Electronic Science, National University of Defense Technology, China)
Subjects: cs.LG, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2501.10453
Pdf URL: https://arxiv.org/pdf/2501.10453
Copy Paste: [[2501.10453]] Uncovering Bias in Foundation Models: Impact, Testing, Harm, and Mitigation(https://arxiv.org/abs/2501.10453)
Keywords: foundation model
Abstract: Bias in Foundation Models (FMs) - trained on vast datasets spanning societal and historical knowledge - poses significant challenges for fairness and equity across fields such as healthcare, education, and finance. These biases, rooted in the overrepresentation of stereotypes and societal inequalities in training data, exacerbate real-world discrimination, reinforce harmful stereotypes, and erode trust in AI systems. To address this, we introduce Trident Probe Testing (TriProTesting), a systematic testing method that detects explicit and implicit biases using semantically designed probes. Here we show that FMs, including CLIP, ALIGN, BridgeTower, and OWLv2, demonstrate pervasive biases across single and mixed social attributes (gender, race, age, and occupation). Notably, we uncover mixed biases when social attributes are combined, such as gender x race, gender x age, and gender x occupation, revealing deeper layers of discrimination. We further propose Adaptive Logit Adjustment (AdaLogAdjustment), a post-processing technique that dynamically redistributes probability power to mitigate these biases effectively, achieving significant improvements in fairness without retraining models. These findings highlight the urgent need for ethical AI practices and interdisciplinary solutions to address biases not only at the model level but also in societal structures. Our work provides a scalable and interpretable solution that advances fairness in AI systems while offering practical insights for future research on fair AI technologies.

Title: BloomScene: Lightweight Structured 3D Gaussian Splatting for Crossmodal Scene Generation

Authors: Xiaolu Hou, Mingcheng Li, Dingkang Yang, Jiawei Chen, Ziyun Qian, Xiao Zhao, Yue Jiang, Jinjie Wei, Qingyao Xu, Lihua Zhang
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2501.10462
Pdf URL: https://arxiv.org/pdf/2501.10462
Copy Paste: [[2501.10462]] BloomScene: Lightweight Structured 3D Gaussian Splatting for Crossmodal Scene Generation(https://arxiv.org/abs/2501.10462)
Keywords: diffusion
Abstract: With the widespread use of virtual reality applications, 3D scene generation has become a new challenging research frontier. 3D scenes have highly complex structures and need to ensure that the output is dense, coherent, and contains all necessary structures. Many current 3D scene generation methods rely on pre-trained text-to-image diffusion models and monocular depth estimators. However, the generated scenes occupy large amounts of storage space and often lack effective regularisation methods, leading to geometric distortions. To this end, we propose BloomScene, a lightweight structured 3D Gaussian splatting for crossmodal scene generation, which creates diverse and high-quality 3D scenes from text or image inputs. Specifically, a crossmodal progressive scene generation framework is proposed to generate coherent scenes utilizing incremental point cloud reconstruction and 3D Gaussian splatting. Additionally, we propose a hierarchical depth prior-based regularization mechanism that utilizes multi-level constraints on depth accuracy and smoothness to enhance the realism and continuity of the generated scenes. Ultimately, we propose a structured context-guided compression mechanism that exploits structured hash grids to model the context of unorganized anchor attributes, which significantly eliminates structural redundancy and reduces storage overhead. Comprehensive experiments across multiple scenes demonstrate the significant potential and advantages of our framework compared with several baselines.

Title: Improving the Efficiency of Self-Supervised Adversarial Training through Latent Clustering-Based Selection

Authors: Somrita Ghosh, Yuelin Xu, Xiao Zhang
Subjects: cs.LG, cs.AI, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2501.10466
Pdf URL: https://arxiv.org/pdf/2501.10466
Copy Paste: [[2501.10466]] Improving the Efficiency of Self-Supervised Adversarial Training through Latent Clustering-Based Selection(https://arxiv.org/abs/2501.10466)
Keywords: self-supervised
Abstract: Compared with standard learning, adversarially robust learning is widely recognized to demand significantly more training examples. Recent works propose the use of self-supervised adversarial training (SSAT) with external or synthetically generated unlabeled data to enhance model robustness. However, SSAT requires a substantial amount of extra unlabeled data, significantly increasing memory usage and model training times. To address these challenges, we propose novel methods to strategically select a small subset of unlabeled data essential for SSAT and robustness improvement. Our selection prioritizes data points near the model's decision boundary based on latent clustering-based techniques, efficiently identifying a critical subset of unlabeled data with a higher concentration of boundary-adjacent points. While focusing on near-boundary data, our methods are designed to maintain a balanced ratio between boundary and non-boundary data points to avoid overfitting. Our experiments on image benchmarks show that integrating our selection strategies into self-supervised adversarial training can largely reduce memory and computational requirements while achieving high model robustness. In particular, our latent clustering-based selection method with k-means is the most effective, achieving nearly identical test-time robust accuracies with 5 to 10 times less external or generated unlabeled data when applied to image benchmarks. Additionally, we validate the generalizability of our approach across various application scenarios, including a real-world medical dataset for COVID-19 chest X-ray classification.

Title: Tabular-TX: Theme-Explanation Structure-based Table Summarization via In-Context Learning

Authors: TaeYoon Kwack, Jisoo Kim, Ki Yong Jung, DongGeon Lee, Heesun Park
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.10487
Pdf URL: https://arxiv.org/pdf/2501.10487
Copy Paste: [[2501.10487]] Tabular-TX: Theme-Explanation Structure-based Table Summarization via In-Context Learning(https://arxiv.org/abs/2501.10487)
Keywords: in-context
Abstract: This paper proposes a Theme-Explanation Structure-based Table Summarization (Tabular-TX) pipeline designed to efficiently process table data. Tabular-TX preprocesses table data by focusing on highlighted cells and then generates summary sentences structured with a Theme Part in the form of adverbial phrases followed by an Explanation Part in the form of clauses. In this process, customized analysis is performed by considering the structural characteristics and comparability of the table. Additionally, by utilizing In-Context Learning, Tabular-TX optimizes the analytical capabilities of large language models (LLMs) without the need for fine-tuning, effectively handling the structural complexity of table data. Results from applying the proposed Tabular-TX to generate table-based summaries demonstrated superior performance compared to existing fine-tuning-based methods, despite limitations in dataset size. Experimental results confirmed that Tabular-TX can process complex table data more effectively and established it as a new alternative for table-based question answering and summarization tasks, particularly in resource-constrained environments.

Title: Towards Data-Centric AI: A Comprehensive Survey of Traditional, Reinforcement, and Generative Approaches for Tabular Data Transformation

Authors: Dongjie Wang, Yanyong Huang, Wangyang Ying, Haoyue Bai, Nanxu Gong, Xinyuan Wang, Sixun Dong, Tao Zhe, Kunpeng Liu, Meng Xiao, Pengfei Wang, Pengyang Wang, Hui Xiong, Yanjie Fu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.10555
Pdf URL: https://arxiv.org/pdf/2501.10555
Copy Paste: [[2501.10555]] Towards Data-Centric AI: A Comprehensive Survey of Traditional, Reinforcement, and Generative Approaches for Tabular Data Transformation(https://arxiv.org/abs/2501.10555)
Keywords: generative
Abstract: Tabular data is one of the most widely used formats across industries, driving critical applications in areas such as finance, healthcare, and marketing. In the era of data-centric AI, improving data quality and representation has become essential for enhancing model performance, particularly in applications centered around tabular data. This survey examines the key aspects of tabular data-centric AI, emphasizing feature selection and feature generation as essential techniques for data space refinement. We provide a systematic review of feature selection methods, which identify and retain the most relevant data attributes, and feature generation approaches, which create new features to simplify the capture of complex data patterns. This survey offers a comprehensive overview of current methodologies through an analysis of recent advancements, practical applications, and the strengths and limitations of these techniques. Finally, we outline open challenges and suggest future perspectives to inspire continued innovation in this field.

Title: Mutual Regression Distance

Authors: Dong Qiao, Jicong Fan
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2501.10617
Pdf URL: https://arxiv.org/pdf/2501.10617
Copy Paste: [[2501.10617]] Mutual Regression Distance(https://arxiv.org/abs/2501.10617)
Keywords: generative
Abstract: The maximum mean discrepancy and Wasserstein distance are popular distance measures between distributions and play important roles in many machine learning problems such as metric learning, generative modeling, domain adaption, and clustering. However, since they are functions of pair-wise distances between data points in two distributions, they do not exploit the potential manifold properties of data such as smoothness and hence are not effective in measuring the dissimilarity between the two distributions in the form of manifolds. In this paper, different from existing measures, we propose a novel distance called Mutual Regression Distance (MRD) induced by a constrained mutual regression problem, which can exploit the manifold property of data. We prove that MRD is a pseudometric that satisfies almost all the axioms of a metric. Since the optimization of the original MRD is costly, we provide a tight MRD and a simplified MRD, based on which a heuristic algorithm is established. We also provide kernel variants of MRDs that are more effective in handling nonlinear data. Our MRDs especially the simplified MRDs have much lower computational complexity than the Wasserstein distance. We provide theoretical guarantees, such as robustness, for MRDs. Finally, we apply MRDs to distribution clustering, generative models, and domain adaptation. The numerical results demonstrate the effectiveness and superiority of MRDs compared to the baselines.

Title: AI/ML Based Detection and Categorization of Covert Communication in IPv6 Network

Authors: Mohammad Wali Ur Rahman, Yu-Zheng Lin, Carter Weeks, David Ruddell, Jeff Gabriellini, Bill Hayes, Salim Hariri, Edward V. Ziegler Jr
Subjects: cs.CR, cs.AI, cs.LG, cs.NI
Abstract URL: https://arxiv.org/abs/2501.10627
Pdf URL: https://arxiv.org/pdf/2501.10627
Copy Paste: [[2501.10627]] AI/ML Based Detection and Categorization of Covert Communication in IPv6 Network(https://arxiv.org/abs/2501.10627)
Keywords: generative
Abstract: The flexibility and complexity of IPv6 extension headers allow attackers to create covert channels or bypass security mechanisms, leading to potential data breaches or system compromises. The mature development of machine learning has become the primary detection technology option used to mitigate covert communication threats. However, the complexity of detecting covert communication, evolving injection techniques, and scarcity of data make building machine-learning models challenging. In previous related research, machine learning has shown good performance in detecting covert communications, but oversimplified attack scenario assumptions cannot represent the complexity of modern covert technologies and make it easier for machine learning models to detect covert communications. To bridge this gap, in this study, we analyzed the packet structure and network traffic behavior of IPv6, used encryption algorithms, and performed covert communication injection without changing network packet behavior to get closer to real attack scenarios. In addition to analyzing and injecting methods for covert communications, this study also uses comprehensive machine learning techniques to train the model proposed in this study to detect threats, including traditional decision trees such as random forests and gradient boosting, as well as complex neural network architectures such as CNNs and LSTMs, to achieve detection accuracy of over 90\%. This study details the methods used for dataset augmentation and the comparative performance of the applied models, reinforcing insights into the adaptability and resilience of the machine learning application in IPv6 covert communication. In addition, we also proposed a Generative AI-assisted interpretation concept based on prompt engineering as a preliminary study of the role of Generative AI agents in covert communication.

Title: HOPS: High-order Polynomials with Self-supervised Dimension Reduction for Load Forecasting

Authors: Pengyang Song, Han Feng, Shreyashi Shukla, Jue Wang, Tao Hong
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2501.10637
Pdf URL: https://arxiv.org/pdf/2501.10637
Copy Paste: [[2501.10637]] HOPS: High-order Polynomials with Self-supervised Dimension Reduction for Load Forecasting(https://arxiv.org/abs/2501.10637)
Keywords: self-supervised
Abstract: Load forecasting is a fundamental task in smart grid. Many techniques have been applied to developing load forecasting models. Due to the challenges such as the Curse of Dimensionality, overfitting, and limited computing resources, multivariate higher-order polynomial models have received limited attention in load forecasting, despite their desirable mathematical foundations and optimization properties. In this paper, we propose low rank approximation and self-supervised dimension reduction to address the aforementioned issues. To further improve computational efficiency, we also introduce a fast Conjugate Gradient based algorithm for the proposed polynomial models. Based on the ISO New England dataset used in Global Energy Forecasting Competition 2017, the proposed method high-order polynomials with self-supervised dimension reduction (HOPS) demonstrates higher forecasting accuracy over several competitive models. Additionally, experimental results indicate that our approach alleviates redundant variable construction, achieving better forecasts with fewer input variables.

Title: Unveiling the Mystery of Weight in Large Foundation Models: Gaussian Distribution Never Fades

Authors: Chongjie Si, Jingjing Jiang, Wei Shen
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2501.10661
Pdf URL: https://arxiv.org/pdf/2501.10661
Copy Paste: [[2501.10661]] Unveiling the Mystery of Weight in Large Foundation Models: Gaussian Distribution Never Fades(https://arxiv.org/abs/2501.10661)
Keywords: foundation model
Abstract: This paper presents a pioneering exploration of the mechanisms underlying large foundation models' (LFMs) weights, aiming to simplify AI research. Through extensive observation and analysis on prevailing LFMs, we find that regardless of initialization strategies, their weights predominantly follow a Gaussian distribution, with occasional sharp, inverted T-shaped, or linear patterns. We further discover that the weights share the i.i.d. properties of Gaussian noise, and explore their direct relationship. We find that transformation weights can be derived from Gaussian noise, and they primarily serve to increase the standard deviation of pre-trained weights, with their standard deviation growing with layer depth. In other words, transformation weights broaden the acceptable deviation from the optimal weights, facilitating adaptation to downstream tasks. Building upon the above conclusions, we thoroughly discussed the nature of optimal weights, ultimately concluding that they should exhibit zero-mean, symmetry, and sparsity, with the sparse values being a truncated Gaussian distribution and a few outliers. Our experiments in LFM adaptation and editing demonstrate the effectiveness of these insights. We hope these findings can provide a foundational understanding to pave the way for future advancements in the LFM community.

Title: Deep Operator Networks for Bayesian Parameter Estimation in PDEs

Authors: Amogh Raj, Carol Eunice Gudumotou, Sakol Bun, Keerthana Srinivasa, Arash Sarshar
Subjects: cs.LG, cs.CE, stat.ML
Abstract URL: https://arxiv.org/abs/2501.10684
Pdf URL: https://arxiv.org/pdf/2501.10684
Copy Paste: [[2501.10684]] Deep Operator Networks for Bayesian Parameter Estimation in PDEs(https://arxiv.org/abs/2501.10684)
Keywords: diffusion
Abstract: We present a novel framework combining Deep Operator Networks (DeepONets) with Physics-Informed Neural Networks (PINNs) to solve partial differential equations (PDEs) and estimate their unknown parameters. By integrating data-driven learning with physical constraints, our method achieves robust and accurate solutions across diverse scenarios. Bayesian training is implemented through variational inference, allowing for comprehensive uncertainty quantification for both aleatoric and epistemic uncertainties. This ensures reliable predictions and parameter estimates even in noisy conditions or when some of the physical equations governing the problem are missing. The framework demonstrates its efficacy in solving forward and inverse problems, including the 1D unsteady heat equation and 2D reaction-diffusion equations, as well as regression tasks with sparse, noisy observations. This approach provides a computationally efficient and generalizable method for addressing uncertainty quantification in PDE surrogate modeling.

Title: EMO2: End-Effector Guided Audio-Driven Avatar Video Generation

Authors: Linrui Tian, Siqi Hu, Qi Wang, Bang Zhang, Liefeng Bo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.10687
Pdf URL: https://arxiv.org/pdf/2501.10687
Copy Paste: [[2501.10687]] EMO2: End-Effector Guided Audio-Driven Avatar Video Generation(https://arxiv.org/abs/2501.10687)
Keywords: diffusion
Abstract: In this paper, we propose a novel audio-driven talking head method capable of simultaneously generating highly expressive facial expressions and hand gestures. Unlike existing methods that focus on generating full-body or half-body poses, we investigate the challenges of co-speech gesture generation and identify the weak correspondence between audio features and full-body gestures as a key limitation. To address this, we redefine the task as a two-stage process. In the first stage, we generate hand poses directly from audio input, leveraging the strong correlation between audio signals and hand movements. In the second stage, we employ a diffusion model to synthesize video frames, incorporating the hand poses generated in the first stage to produce realistic facial expressions and body movements. Our experimental results demonstrate that the proposed method outperforms state-of-the-art approaches, such as CyberHost and Vlogger, in terms of both visual quality and synchronization accuracy. This work provides a new perspective on audio-driven gesture generation and a robust framework for creating expressive and natural talking head animations.

Title: A CNN-Transformer for Classification of Longitudinal 3D MRI Images -- A Case Study on Hepatocellular Carcinoma Prediction

Authors: Jakob Nolte, Maureen M. J. Guichelaar, Donald E. Bouman, Stephanie M. van den Berg, Maryam Amir Haeri
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.10733
Pdf URL: https://arxiv.org/pdf/2501.10733
Copy Paste: [[2501.10733]] A CNN-Transformer for Classification of Longitudinal 3D MRI Images -- A Case Study on Hepatocellular Carcinoma Prediction(https://arxiv.org/abs/2501.10733)
Keywords: self-supervised
Abstract: Longitudinal MRI analysis is crucial for predicting disease outcomes, particularly in chronic conditions like hepatocellular carcinoma (HCC), where early detection can significantly influence treatment strategies and patient prognosis. Yet, due to challenges like limited data availability, subtle parenchymal changes, and the irregular timing of medical screenings, current approaches have so far focused on cross-sectional imaging data. To address this, we propose HCCNet, a novel model architecture that integrates a 3D adaptation of the ConvNeXt CNN architecture with a Transformer encoder, capturing both the intricate spatial features of 3D MRIs and the complex temporal dependencies across different time points. HCCNet utilizes a two-stage pre-training process tailored for longitudinal MRI data. The CNN backbone is pre-trained using a self-supervised learning framework adapted for 3D MRIs, while the Transformer encoder is pre-trained with a sequence-order-prediction task to enhance its understanding of disease progression over time. We demonstrate the effectiveness of HCCNet by applying it to a cohort of liver cirrhosis patients undergoing regular MRI screenings for HCC surveillance. Our results show that HCCNet significantly improves predictive accuracy and reliability over baseline models, providing a robust tool for personalized HCC surveillance. The methodological approach presented in this paper is versatile and can be adapted to various longitudinal MRI screening applications. Its ability to handle varying patient record lengths and irregular screening intervals establishes it as an invaluable framework for monitoring chronic diseases, where timely and accurate disease prognosis is critical for effective treatment planning.

Title: GAUDA: Generative Adaptive Uncertainty-guided Diffusion-based Augmentation for Surgical Segmentation

Authors: Yannik Frisch, Christina Bornberg, Moritz Fuchs, Anirban Mukhopadhyay
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.10819
Pdf URL: https://arxiv.org/pdf/2501.10819
Copy Paste: [[2501.10819]] GAUDA: Generative Adaptive Uncertainty-guided Diffusion-based Augmentation for Surgical Segmentation(https://arxiv.org/abs/2501.10819)
Keywords: diffusion, generative
Abstract: Augmentation by generative modelling yields a promising alternative to the accumulation of surgical data, where ethical, organisational and regulatory aspects must be considered. Yet, the joint synthesis of (image, mask) pairs for segmentation, a major application in surgery, is rather unexplored. We propose to learn semantically comprehensive yet compact latent representations of the (image, mask) space, which we jointly model with a Latent Diffusion Model. We show that our approach can effectively synthesise unseen high-quality paired segmentation data of remarkable semantic coherence. Generative augmentation is typically applied pre-training by synthesising a fixed number of additional training samples to improve downstream task models. To enhance this approach, we further propose Generative Adaptive Uncertainty-guided Diffusion-based Augmentation (GAUDA), leveraging the epistemic uncertainty of a Bayesian downstream model for targeted online synthesis. We condition the generative model on classes with high estimated uncertainty during training to produce additional unseen samples for these classes. By adaptively utilising the generative model online, we can minimise the number of additional training samples and centre them around the currently most uncertain parts of the data distribution. GAUDA effectively improves downstream segmentation results over comparable methods by an average absolute IoU of 1.6% on CaDISv2 and 1.5% on CholecSeg8k, two prominent surgical datasets for semantic segmentation.

Title: Addressing Multilabel Imbalance with an Efficiency-Focused Approach Using Diffusion Model-Generated Synthetic Samples

Authors: Francisco Charte, Miguel Ángel Dávila, María Dolores Pérez-Godoy, María José del Jesus
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.10822
Pdf URL: https://arxiv.org/pdf/2501.10822
Copy Paste: [[2501.10822]] Addressing Multilabel Imbalance with an Efficiency-Focused Approach Using Diffusion Model-Generated Synthetic Samples(https://arxiv.org/abs/2501.10822)
Keywords: diffusion
Abstract: Predictive models trained on imbalanced data tend to produce biased results. This problem is exacerbated when there is not just one output label, but a set of them. This is the case for multilabel learning (MLL) algorithms used to classify patterns, rank labels, or learn the distribution of outputs. Many solutions have been proposed in the literature. The one that can be applied universally, independent of the algorithm used to build the model, is data resampling. The generation of new instances associated with minority labels, so that empty areas of the feature space are filled, helps to improve the obtained models. The quality of these new instances depends on the algorithm used to generate them. In this paper, a diffusion model tailored to produce new instances for MLL data, called MLDM (\textit{MultiLabel Diffusion Model}), is proposed. Diffusion models have been mainly used to generate artificial images and videos. Our proposed MLDM is based on this type of models. The experiments conducted compare MLDM with several other MLL resampling algorithms. The results show that MLDM is competitive while it improves efficiency.

Title: Visual RAG: Expanding MLLM visual knowledge without fine-tuning

Authors: Mirco Bonomo, Simone Bianco
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.10834
Pdf URL: https://arxiv.org/pdf/2501.10834
Copy Paste: [[2501.10834]] Visual RAG: Expanding MLLM visual knowledge without fine-tuning(https://arxiv.org/abs/2501.10834)
Keywords: in-context
Abstract: Multimodal Large Language Models (MLLMs) have achieved notable performance in computer vision tasks that require reasoning across visual and textual modalities, yet their capabilities are limited to their pre-trained data, requiring extensive fine-tuning for updates. Recent researches have explored the use of In-Context Learning (ICL) to overcome these challenges by providing a set of demonstrating examples as context to augment MLLMs performance in several tasks, showing that many-shot ICL leads to substantial improvements compared to few-shot ICL. However, the reliance on numerous demonstrating examples and the limited MLLMs context windows presents significant obstacles. This paper aims to address these challenges by introducing a novel approach, Visual RAG, that synergically combines the MLLMs capability to learn from the context, with a retrieval mechanism. The crux of this approach is to ensure to augment the MLLM knowledge by selecting only the most relevant demonstrating examples for the query, pushing it to learn by analogy. In this way, relying on the new information provided dynamically during inference time, the resulting system is not limited to the knowledge extracted from the training data, but can be updated rapidly and easily without fine-tuning. Furthermore, this greatly reduces the computational costs for improving the model image classification performance, and augments the model knowledge to new visual domains and tasks it was not trained for. Extensive experiments on eight different datasets in the state of the art spanning several domains and image classification tasks show that the proposed Visual RAG, compared to the most recent state of the art (i.e., many-shot ICL), is able to obtain an accuracy that is very close or even higher (approx. +2% improvement on average) while using a much smaller set of demonstrating examples (approx. only 23% on average).

Title: Diffusion-Based Imitation Learning for Social Pose Generation

Authors: Antonio Lech Martin-Ozimek, Isuru Jayarathne, Su Larb Mon, Jouh Yeong Chew
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2501.10869
Pdf URL: https://arxiv.org/pdf/2501.10869
Copy Paste: [[2501.10869]] Diffusion-Based Imitation Learning for Social Pose Generation(https://arxiv.org/abs/2501.10869)
Keywords: diffusion
Abstract: Intelligent agents, such as robots and virtual agents, must understand the dynamics of complex social interactions to interact with humans. Effectively representing social dynamics is challenging because we require multi-modal, synchronized observations to understand a scene. We explore how using a single modality, the pose behavior, of multiple individuals in a social interaction can be used to generate nonverbal social cues for the facilitator of that interaction. The facilitator acts to make a social interaction proceed smoothly and is an essential role for intelligent agents to replicate in human-robot interactions. In this paper, we adapt an existing diffusion behavior cloning model to learn and replicate facilitator behaviors. Furthermore, we evaluate two representations of pose observations from a scene, one representation has pre-processing applied and one does not. The purpose of this paper is to introduce a new use for diffusion behavior cloning for pose generation in social interactions. The second is to understand the relationship between performance and computational load for generating social pose behavior using two different techniques for collecting scene observations. As such, we are essentially testing the effectiveness of two different types of conditioning for a diffusion model. We then evaluate the resulting generated behavior from each technique using quantitative measures such as mean per-joint position error (MPJPE), training time, and inference time. Additionally, we plot training and inference time against MPJPE to examine the trade-offs between efficiency and performance. Our results suggest that the further pre-processed data can successfully condition diffusion models to generate realistic social behavior, with reasonable trade-offs in accuracy and processing time.

Title: CEReBrO: Compact Encoder for Representations of Brain Oscillations Using Efficient Alternating Attention

Authors: Alexandru Dimofte, Glenn Anta Bucagu, Thorir Mar Ingolfsson, Xiaying Wang, Andrea Cossettini, Luca Benini, Yawei Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.10885
Pdf URL: https://arxiv.org/pdf/2501.10885
Copy Paste: [[2501.10885]] CEReBrO: Compact Encoder for Representations of Brain Oscillations Using Efficient Alternating Attention(https://arxiv.org/abs/2501.10885)
Keywords: self-supervised, foundation model, anomaly
Abstract: Electroencephalograph (EEG) is a crucial tool for studying brain activity. Recently, self-supervised learning methods leveraging large unlabeled datasets have emerged as a potential solution to the scarcity of widely available annotated EEG data. However, current methods suffer from at least one of the following limitations: i) sub-optimal EEG signal modeling, ii) model sizes in the hundreds of millions of trainable parameters, and iii) reliance on private datasets and/or inconsistent public benchmarks, hindering reproducibility. To address these challenges, we introduce a Compact Encoder for Representations of Brain Oscillations using alternating attention (CEReBrO), a new small EEG foundation model. Our tokenization scheme represents EEG signals at a per-channel patch granularity. We propose an alternating attention mechanism that jointly models intra-channel temporal dynamics and inter-channel spatial correlations, achieving 2x speed improvement with 6x less memory required compared to standard self-attention. We present several model sizes ranging from 3.6 million to 85 million parameters. Pre-trained on over 20,000 hours of publicly available scalp EEG recordings with diverse channel configurations, our models set new benchmarks in emotion detection and seizure detection tasks, with competitive performance in anomaly classification and gait prediction. This validates our models' effectiveness and effictiveness.

Title: Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments

Authors: Hongjin Su, Ruoxi Sun, Jinsung Yoon, Pengcheng Yin, Tao Yu, Sercan Ö. Arık
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.10893
Pdf URL: https://arxiv.org/pdf/2501.10893
Copy Paste: [[2501.10893]] Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments(https://arxiv.org/abs/2501.10893)
Keywords: in-context
Abstract: Autonomous agents powered by large language models (LLMs) have the potential to enhance human capabilities, assisting with digital tasks from sending emails to performing data analysis. The abilities of existing LLMs at such tasks are often hindered by the lack of high-quality agent data from the corresponding environments they interact with. We propose Learn-by-interact, a data-centric framework to adapt LLM agents to any given environments without human annotations. Learn-by-interact synthesizes trajectories of agent-environment interactions based on documentations, and constructs instructions by summarizing or abstracting the interaction histories, a process called backward construction. We assess the quality of our synthetic data by using them in both training-based scenarios and training-free in-context learning (ICL), where we craft innovative retrieval approaches optimized for agents. Extensive experiments on SWE-bench, WebArena, OSWorld and Spider2-V spanning across realistic coding, web, and desktop environments show the effectiveness of Learn-by-interact in various downstream agentic tasks -- baseline results are improved by up to 12.2\% for ICL with Claude-3.5 and 19.5\% for training with Codestral-22B. We further demonstrate the critical role of backward construction, which provides up to 14.0\% improvement for training. Our ablation studies demonstrate the efficiency provided by our synthesized data in ICL and the superiority of our retrieval pipeline over alternative approaches like conventional retrieval-augmented generation (RAG). We expect that Learn-by-interact will serve as a foundation for agent data synthesis as LLMs are increasingly deployed at real-world environments.

Title: Data Enrichment Opportunities for Distribution Grid Cable Networks using Variational Autoencoders

Authors: Konrad Sundsgaard, Kutay Bölat, Guangya Yang
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2501.10920
Pdf URL: https://arxiv.org/pdf/2501.10920
Copy Paste: [[2501.10920]] Data Enrichment Opportunities for Distribution Grid Cable Networks using Variational Autoencoders(https://arxiv.org/abs/2501.10920)
Keywords: generative
Abstract: Electricity distribution cable networks suffer from incomplete and unbalanced data, hindering the effectiveness of machine learning models for predictive maintenance and reliability evaluation. Features such as the installation date of the cables are frequently missing. To address data scarcity, this study investigates the application of Variational Autoencoders (VAEs) for data enrichment, synthetic data generation, imbalanced data handling, and outlier detection. Based on a proof-of-concept case study for Denmark, targeting the imputation of missing age information in cable network asset registers, the analysis underlines the potential of generative models to support data-driven maintenance. However, the study also highlights several areas for improvement, including enhanced feature importance analysis, incorporating network characteristics and external features, and handling biases in missing data. Future initiatives should expand the application of VAEs by incorporating semi-supervised learning, advanced sampling techniques, and additional distribution grid elements, including low-voltage networks, into the analysis.

Title: Generative Physical AI in Vision: A Survey

Authors: Daochang Liu, Junyu Zhang, Anh-Dung Dinh, Eunbyung Park, Shichao Zhang, Chang Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.10928
Pdf URL: https://arxiv.org/pdf/2501.10928
Copy Paste: [[2501.10928]] Generative Physical AI in Vision: A Survey(https://arxiv.org/abs/2501.10928)
Keywords: generative
Abstract: Generative Artificial Intelligence (AI) has rapidly advanced the field of computer vision by enabling machines to create and interpret visual data with unprecedented sophistication. This transformation builds upon a foundation of generative models to produce realistic images, videos, and 3D or 4D content. Traditionally, generative models primarily focus on visual fidelity while often neglecting the physical plausibility of generated content. This gap limits their effectiveness in applications requiring adherence to real-world physical laws, such as robotics, autonomous systems, and scientific simulations. As generative AI evolves to increasingly integrate physical realism and dynamic simulation, its potential to function as a "world simulator" expands-enabling the modeling of interactions governed by physics and bridging the divide between virtual and physical realities. This survey systematically reviews this emerging field of physics-aware generative AI in computer vision, categorizing methods based on how they incorporate physical knowledge-either through explicit simulation or implicit learning. We analyze key paradigms, discuss evaluation protocols, and identify future research directions. By offering a comprehensive overview, this survey aims to help future developments in physically grounded generation for vision. The reviewed papers are summarized at this https URL.

Title: Beyond Any-Shot Adaptation: Predicting Optimization Outcome for Robustness Gains without Extra Pay

Authors: Qi Cheems Wang, Zehao Xiao, Yixiu Mao, Yun Qu, Jiayi Shen, Yiqin Lv, Xiangyang Ji
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.11039
Pdf URL: https://arxiv.org/pdf/2501.11039
Copy Paste: [[2501.11039]] Beyond Any-Shot Adaptation: Predicting Optimization Outcome for Robustness Gains without Extra Pay(https://arxiv.org/abs/2501.11039)
Keywords: foundation model, generative
Abstract: The foundation model enables fast problem-solving without learning from scratch, and such a desirable adaptation property benefits from its adopted cross-task generalization paradigms, e.g., pretraining, meta-training, or finetuning. Recent trends have focused on the curation of task datasets during optimization, which includes task selection as an indispensable consideration for either adaptation robustness or sampling efficiency purposes. Despite some progress, selecting crucial task batches to optimize over iteration mostly exhausts massive task queries and requires intensive evaluation and computations to secure robust adaptation. This work underscores the criticality of both robustness and learning efficiency, especially in scenarios where tasks are risky to collect or costly to evaluate. To this end, we present Model Predictive Task Sampling (MPTS), a novel active task sampling framework to establish connections between the task space and adaptation risk landscape achieve robust adaptation. Technically, MPTS characterizes the task episodic information with a generative model and predicts optimization outcome after adaptation from posterior inference, i.e., forecasting task-specific adaptation risk values. The resulting risk learner amortizes expensive annotation, evaluation, or computation operations in task robust adaptation learning paradigms. Extensive experimental results show that MPTS can be seamlessly integrated into zero-shot, few-shot, and many-shot learning paradigms, increases adaptation robustness, and retains learning efficiency without affording extra cost. The code will be available at the project site this https URL.

Title: Generative AI-driven Cross-layer Covert Communication: Fundamentals, Framework and Case Study

Authors: Tianhao Liu, Jiqiang Liu, Tao Zhang, Jian Wang, Jiacheng Wang, Jiawen Kang, Dusit Niyato, Shiwen Mao
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2501.11068
Pdf URL: https://arxiv.org/pdf/2501.11068
Copy Paste: [[2501.11068]] Generative AI-driven Cross-layer Covert Communication: Fundamentals, Framework and Case Study(https://arxiv.org/abs/2501.11068)
Keywords: diffusion, generative
Abstract: Ensuring end-to-end cross-layer communication security in military networks by selecting covert schemes between nodes is a key solution for military communication security. With the development of communication technology, covert communication has expanded from the physical layer to the network and application layers, utilizing methods such as artificial noise, private networks, and semantic coding to transmit secret messages. However, as adversaries continuously eavesdrop on specific communication channels, the accumulation of sufficient data may reveal underlying patterns that influence concealment, and establishing a cross-layer covert communication mechanism emerges as an effective strategy to mitigate these regulatory challenges. In this article, we first survey the communication security solution based on covert communication, specifically targeting three typical scenarios: device-to-device, private network communication, and public network communication, and analyze their application scopes. Furthermore, we propose an end-to-end cross-layer covert communication scheme driven by Generative Artificial Intelligence (GenAI), highlighting challenges and their solutions. Additionally, a case study is conducted using diffusion reinforcement learning to sovle cloud edge internet of things cross-layer secure communication.

Title: Tell me about yourself: LLMs are aware of their learned behaviors

Authors: Jan Betley, Xuchan Bao, Martín Soto, Anna Sztyber-Betley, James Chua, Owain Evans
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2501.11120
Pdf URL: https://arxiv.org/pdf/2501.11120
Copy Paste: [[2501.11120]] Tell me about yourself: LLMs are aware of their learned behaviors(https://arxiv.org/abs/2501.11120)
Keywords: in-context
Abstract: We study behavioral self-awareness -- an LLM's ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and (b) outputting insecure code. Despite the datasets containing no explicit descriptions of the associated behavior, the finetuned LLMs can explicitly describe it. For example, a model trained to output insecure code says, ``The code I write is insecure.'' Indeed, models show behavioral self-awareness for a range of behaviors and for diverse evaluations. Note that while we finetune models to exhibit behaviors like writing insecure code, we do not finetune them to articulate their own behaviors -- models do this without any special training or examples. Behavioral self-awareness is relevant for AI safety, as models could use it to proactively disclose problematic behaviors. In particular, we study backdoor policies, where models exhibit unexpected behaviors only under certain trigger conditions. We find that models can sometimes identify whether or not they have a backdoor, even without its trigger being present. However, models are not able to directly output their trigger by default. Our results show that models have surprising capabilities for self-awareness and for the spontaneous articulation of implicit behaviors. Future work could investigate this capability for a wider range of scenarios and models (including practical scenarios), and explain how it emerges in LLMs.

Title: CLOFAI: A Dataset of Real And Fake Image Classification Tasks for Continual Learning

Authors: William Doherty, Anton Lee, Heitor Murilo Gomes
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11140
Pdf URL: https://arxiv.org/pdf/2501.11140
Copy Paste: [[2501.11140]] CLOFAI: A Dataset of Real And Fake Image Classification Tasks for Continual Learning(https://arxiv.org/abs/2501.11140)
Keywords: generative
Abstract: The rapid advancement of generative AI models capable of creating realistic media has led to a need for classifiers that can accurately distinguish between genuine and artificially-generated images. A significant challenge for these classifiers emerges when they encounter images from generative models that are not represented in their training data, usually resulting in diminished performance. A typical approach is to periodically update the classifier's training data with images from the new generative models then retrain the classifier on the updated dataset. However, in some real-life scenarios, storage, computational, or privacy constraints render this approach impractical. Additionally, models used in security applications may be required to rapidly adapt. In these circumstances, continual learning provides a promising alternative, as the classifier can be updated without retraining on the entire dataset. In this paper, we introduce a new dataset called CLOFAI (Continual Learning On Fake and Authentic Images), which takes the form of a domain-incremental image classification problem. Moreover, we showcase the applicability of this dataset as a benchmark for evaluating continual learning methodologies. In doing this, we set a baseline on our novel dataset using three foundational continual learning methods -- EWC, GEM, and Experience Replay -- and find that EWC performs poorly, while GEM and Experience Replay show promise, performing significantly better than a Naive baseline. The dataset and code to run the experiments can be accessed from the following GitHub repository: this https URL.

Title: Leveraging GANs For Active Appearance Models Optimized Model Fitting

Authors: Anurag Awasthi
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.11218
Pdf URL: https://arxiv.org/pdf/2501.11218
Copy Paste: [[2501.11218]] Leveraging GANs For Active Appearance Models Optimized Model Fitting(https://arxiv.org/abs/2501.11218)
Keywords: generative
Abstract: Generative Adversarial Networks (GANs) have gained prominence in refining model fitting tasks in computer vision, particularly in domains involving deformable models like Active Appearance Models (AAMs). This paper explores the integration of GANs to enhance the AAM fitting process, addressing challenges in optimizing nonlinear parameters associated with appearance and shape variations. By leveraging GANs' adversarial training framework, the aim is to minimize fitting errors and improve convergence rates. Achieving robust performance even in cases with high appearance variability and occlusions. Our approach demonstrates significant improvements in accuracy and computational efficiency compared to traditional optimization techniques, thus establishing GANs as a potent tool for advanced image model fitting.

Title: Successive Interference Cancellation-aided Diffusion Models for Joint Channel Estimation and Data Detection in Low Rank Channel Scenarios

Authors: Sagnik Bhattacharya, Muhammad Ahmed Mohsin, Kamyar Rajabalifardi, John M. Cioffi
Subjects: cs.CV, cs.IT, eess.SP
Abstract URL: https://arxiv.org/abs/2501.11229
Pdf URL: https://arxiv.org/pdf/2501.11229
Copy Paste: [[2501.11229]] Successive Interference Cancellation-aided Diffusion Models for Joint Channel Estimation and Data Detection in Low Rank Channel Scenarios(https://arxiv.org/abs/2501.11229)
Keywords: diffusion, generative
Abstract: This paper proposes a novel joint channel-estimation and source-detection algorithm using successive interference cancellation (SIC)-aided generative score-based diffusion models. Prior work in this area focuses on massive MIMO scenarios, which are typically characterized by full-rank channels, and fail in low-rank channel scenarios. The proposed algorithm outperforms existing methods in joint source-channel estimation, especially in low-rank scenarios where the number of users exceeds the number of antennas at the access point (AP). The proposed score-based iterative diffusion process estimates the gradient of the prior distribution on partial channels, and recursively updates the estimated channel parts as well as the source. Extensive simulation results show that the proposed method outperforms the baseline methods in terms of normalized mean squared error (NMSE) and symbol error rate (SER) in both full-rank and low-rank channel scenarios, while having a more dominant effect in the latter, at various signal-to-noise ratios (SNR).

Title: A New Formulation of Lipschitz Constrained With Functional Gradient Learning for GANs

Authors: Chang Wan, Ke Fan, Xinwei Sun, Yanwei Fu, Minglu Li, Yunliang Jiang, Zhonglong Zheng
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.11236
Pdf URL: https://arxiv.org/pdf/2501.11236
Copy Paste: [[2501.11236]] A New Formulation of Lipschitz Constrained With Functional Gradient Learning for GANs(https://arxiv.org/abs/2501.11236)
Keywords: generative
Abstract: This paper introduces a promising alternative method for training Generative Adversarial Networks (GANs) on large-scale datasets with clear theoretical guarantees. GANs are typically learned through a minimax game between a generator and a discriminator, which is known to be empirically unstable. Previous learning paradigms have encountered mode collapse issues without a theoretical solution. To address these challenges, we propose a novel Lipschitz-constrained Functional Gradient GANs learning (Li-CFG) method to stabilize the training of GAN and provide a theoretical foundation for effectively increasing the diversity of synthetic samples by reducing the neighborhood size of the latent vector. Specifically, we demonstrate that the neighborhood size of the latent vector can be reduced by increasing the norm of the discriminator gradient, resulting in enhanced diversity of synthetic samples. To efficiently enlarge the norm of the discriminator gradient, we introduce a novel {\epsilon}-centered gradient penalty that amplifies the norm of the discriminator gradient using the hyper-parameter {\epsilon}. In comparison to other constraints, our method enlarging the discriminator norm, thus obtaining the smallest neighborhood size of the latent vector. Extensive experiments on benchmark datasets for image generation demonstrate the efficacy of the Li-CFG method and the {\epsilon}-centered gradient penalty. The results showcase improved stability and increased diversity of synthetic samples.

Title: Enhancing SAR Object Detection with Self-Supervised Pre-training on Masked Auto-Encoders

Authors: Xinyang Pu, Feng Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.11249
Pdf URL: https://arxiv.org/pdf/2501.11249
Copy Paste: [[2501.11249]] Enhancing SAR Object Detection with Self-Supervised Pre-training on Masked Auto-Encoders(https://arxiv.org/abs/2501.11249)
Keywords: self-supervised
Abstract: Supervised fine-tuning methods (SFT) perform great efficiency on artificial intelligence interpretation in SAR images, leveraging the powerful representation knowledge from pre-training models. Due to the lack of domain-specific pre-trained backbones in SAR images, the traditional strategies are loading the foundation pre-train models of natural scenes such as ImageNet, whose characteristics of images are extremely different from SAR images. This may hinder the model performance on downstream tasks when adopting SFT on small-scale annotated SAR data. In this paper, an self-supervised learning (SSL) method of masked image modeling based on Masked Auto-Encoders (MAE) is proposed to learn feature representations of SAR images during the pre-training process and benefit the object detection task in SAR images of SFT. The evaluation experiments on the large-scale SAR object detection benchmark named SARDet-100k verify that the proposed method captures proper latent representations of SAR images and improves the model generalization in downstream tasks by converting the pre-trained domain from natural scenes to SAR images through SSL. The proposed method achieves an improvement of 1.3 mAP on the SARDet-100k benchmark compared to only the SFT strategies.

Title: Advancing Multi-Party Dialogue Systems with Speaker-ware Contrastive Learning

Authors: Zhongtian Hu, Qi He, Ronghan Li, Meng Zhao, Lifang Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.11292
Pdf URL: https://arxiv.org/pdf/2501.11292
Copy Paste: [[2501.11292]] Advancing Multi-Party Dialogue Systems with Speaker-ware Contrastive Learning(https://arxiv.org/abs/2501.11292)
Keywords: self-supervised
Abstract: Dialogue response generation has made significant progress, but most research has focused on dyadic dialogue. In contrast, multi-party dialogues involve more participants, each potentially discussing different topics, making the task more complex. Current methods often rely on graph neural networks to model dialogue context, which helps capture the structural dynamics of multi-party conversations. However, these methods are heavily dependent on intricate graph structures and dataset annotations, and they often overlook the distinct speaking styles of participants. To address these challenges, we propose CMR, a Contrastive learning-based Multi-party dialogue Response generation model. CMR uses self-supervised contrastive learning to better distinguish "who says what." Additionally, by comparing speakers within the same conversation, the model captures differences in speaking styles and thematic transitions. To the best of our knowledge, this is the first approach to apply contrastive learning in multi-party dialogue generation. Experimental results show that CMR significantly outperforms state-of-the-art models in multi-party dialogue response tasks.

Title: MIFNet: Learning Modality-Invariant Features for Generalizable Multimodal Image Matching

Authors: Yepeng Liu, Zhichao Sun, Baosheng Yu, Yitian Zhao, Bo Du, Yongchao Xu, Jun Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.11299
Pdf URL: https://arxiv.org/pdf/2501.11299
Copy Paste: [[2501.11299]] MIFNet: Learning Modality-Invariant Features for Generalizable Multimodal Image Matching(https://arxiv.org/abs/2501.11299)
Keywords: diffusion
Abstract: Many keypoint detection and description methods have been proposed for image matching or registration. While these methods demonstrate promising performance for single-modality image matching, they often struggle with multimodal data because the descriptors trained on single-modality data tend to lack robustness against the non-linear variations present in multimodal data. Extending such methods to multimodal image matching often requires well-aligned multimodal data to learn modality-invariant descriptors. However, acquiring such data is often costly and impractical in many real-world scenarios. To address this challenge, we propose a modality-invariant feature learning network (MIFNet) to compute modality-invariant features for keypoint descriptions in multimodal image matching using only single-modality training data. Specifically, we propose a novel latent feature aggregation module and a cumulative hybrid aggregation module to enhance the base keypoint descriptors trained on single-modality data by leveraging pre-trained features from Stable Diffusion models. We validate our method with recent keypoint detection and description methods in three multimodal retinal image datasets (CF-FA, CF-OCT, EMA-OCTA) and two remote sensing datasets (Optical-SAR and Optical-NIR). Extensive experiments demonstrate that the proposed MIFNet is able to learn modality-invariant feature for multimodal image matching without accessing the targeted modality and has good zero-shot generalization ability. The source code will be made publicly available.

Title: Anomaly Detection for Industrial Applications, Its Challenges, Solutions, and Future Directions: A Review

Authors: Abdelrahman Alzarooni, Ehtesham Iqbal, Samee Ullah Khan, Sajid Javed, Brain Moyo, Yusra Abdulrahman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.11310
Pdf URL: https://arxiv.org/pdf/2501.11310
Copy Paste: [[2501.11310]] Anomaly Detection for Industrial Applications, Its Challenges, Solutions, and Future Directions: A Review(https://arxiv.org/abs/2501.11310)
Keywords: anomaly
Abstract: Anomaly detection from images captured using camera sensors is one of the mainstream applications at the industrial level. Particularly, it maintains the quality and optimizes the efficiency in production processes across diverse industrial tasks, including advanced manufacturing and aerospace engineering. Traditional anomaly detection workflow is based on a manual inspection by human operators, which is a tedious task. Advances in intelligent automated inspection systems have revolutionized the Industrial Anomaly Detection (IAD) process. Recent vision-based approaches can automatically extract, process, and interpret features using computer vision and align with the goals of automation in industrial operations. In light of the shift in inspection methodologies, this survey reviews studies published since 2019, with a specific focus on vision-based anomaly detection. The components of an IAD pipeline that are overlooked in existing surveys are presented, including areas related to data acquisition, preprocessing, learning mechanisms, and evaluation. In addition to the collected publications, several scientific and industry-related challenges and their perspective solutions are highlighted. Popular and relevant industrial datasets are also summarized, providing further insight into inspection applications. Finally, future directions of vision-based IAD are discussed, offering researchers insight into the state-of-the-art of industrial inspection.

Title: Nested Annealed Training Scheme for Generative Adversarial Networks

Authors: Chang Wan, Ming-Hsuan Yang, Minglu Li, Yunliang Jiang, Zhonglong Zheng
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.11318
Pdf URL: https://arxiv.org/pdf/2501.11318
Copy Paste: [[2501.11318]] Nested Annealed Training Scheme for Generative Adversarial Networks(https://arxiv.org/abs/2501.11318)
Keywords: diffusion, generative
Abstract: Recently, researchers have proposed many deep generative models, including generative adversarial networks(GANs) and denoising diffusion models. Although significant breakthroughs have been made and empirical success has been achieved with the GAN, its mathematical underpinnings remain relatively unknown. This paper focuses on a rigorous mathematical theoretical framework: the composite-functional-gradient GAN (CFG)[1]. Specifically, we reveal the theoretical connection between the CFG model and score-based models. We find that the training objective of the CFG discriminator is equivalent to finding an optimal D(x). The optimal gradient of D(x) differentiates the integral of the differences between the score functions of real and synthesized samples. Conversely, training the CFG generator involves finding an optimal G(x) that minimizes this difference. In this paper, we aim to derive an annealed weight preceding the weight of the CFG discriminator. This new explicit theoretical explanation model is called the annealed CFG method. To overcome the limitation of the annealed CFG method, as the method is not readily applicable to the SOTA GAN model, we propose a nested annealed training scheme (NATS). This scheme keeps the annealed weight from the CFG method and can be seamlessly adapted to various GAN models, no matter their structural, loss, or regularization differences. We conduct thorough experimental evaluations on various benchmark datasets for image generation. The results show that our annealed CFG and NATS methods significantly improve the quality and diversity of the synthesized samples. This improvement is clear when comparing the CFG method and the SOTA GAN models.

Title: StyleSSP: Sampling StartPoint Enhancement for Training-free Diffusion-based Method for Style Transfer

Authors: Ruojun Xu, Weijie Xi, Xiaodi Wang, Yongbo Mao, Zach Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.11319
Pdf URL: https://arxiv.org/pdf/2501.11319
Copy Paste: [[2501.11319]] StyleSSP: Sampling StartPoint Enhancement for Training-free Diffusion-based Method for Style Transfer(https://arxiv.org/abs/2501.11319)
Keywords: diffusion
Abstract: Training-free diffusion-based methods have achieved remarkable success in style transfer, eliminating the need for extensive training or fine-tuning. However, due to the lack of targeted training for style information extraction and constraints on the content image layout, training-free methods often suffer from layout changes of original content and content leakage from style images. Through a series of experiments, we discovered that an effective startpoint in the sampling stage significantly enhances the style transfer process. Based on this discovery, we propose StyleSSP, which focuses on obtaining a better startpoint to address layout changes of original content and content leakage from style image. StyleSSP comprises two key components: (1) Frequency Manipulation: To improve content preservation, we reduce the low-frequency components of the DDIM latent, allowing the sampling stage to pay more attention to the layout of content images; and (2) Negative Guidance via Inversion: To mitigate the content leakage from style image, we employ negative guidance in the inversion stage to ensure that the startpoint of the sampling stage is distanced from the content of style image. Experiments show that StyleSSP surpasses previous training-free style transfer baselines, particularly in preserving original content and minimizing the content leakage from style image.

Title: CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation

Authors: Zheng Chong, Wenqing Zhang, Shiyue Zhang, Jun Zheng, Xiao Dong, Haoxiang Li, Yiling Wu, Dongmei Jiang, Xiaodan Liang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11325
Pdf URL: https://arxiv.org/pdf/2501.11325
Copy Paste: [[2501.11325]] CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation(https://arxiv.org/abs/2501.11325)
Keywords: diffusion
Abstract: Virtual try-on (VTON) technology has gained attention due to its potential to transform online retail by enabling realistic clothing visualization of images and videos. However, most existing methods struggle to achieve high-quality results across image and video try-on tasks, especially in long video scenarios. In this work, we introduce CatV2TON, a simple and effective vision-based virtual try-on (V2TON) method that supports both image and video try-on tasks with a single diffusion transformer model. By temporally concatenating garment and person inputs and training on a mix of image and video datasets, CatV2TON achieves robust try-on performance across static and dynamic settings. For efficient long-video generation, we propose an overlapping clip-based inference strategy that uses sequential frame guidance and Adaptive Clip Normalization (AdaCN) to maintain temporal consistency with reduced resource demands. We also present ViViD-S, a refined video try-on dataset, achieved by filtering back-facing frames and applying 3D mask smoothing for enhanced temporal consistency. Comprehensive experiments demonstrate that CatV2TON outperforms existing methods in both image and video try-on tasks, offering a versatile and reliable solution for realistic virtual try-ons across diverse scenarios.

Title: GenVidBench: A Challenging Benchmark for Detecting AI-Generated Video

Authors: Zhenliang Ni, Qiangyu Yan, Mouxiao Huang, Tianning Yuan, Yehui Tang, Hailin Hu, Xinghao Chen, Yunhe Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.11340
Pdf URL: https://arxiv.org/pdf/2501.11340
Copy Paste: [[2501.11340]] GenVidBench: A Challenging Benchmark for Detecting AI-Generated Video(https://arxiv.org/abs/2501.11340)
Keywords: generative
Abstract: The rapid advancement of video generation models has made it increasingly challenging to distinguish AI-generated videos from real ones. This issue underscores the urgent need for effective AI-generated video detectors to prevent the dissemination of false information through such videos. However, the development of high-performance generative video detectors is currently impeded by the lack of large-scale, high-quality datasets specifically designed for generative video detection. To this end, we introduce GenVidBench, a challenging AI-generated video detection dataset with several key advantages: 1) Cross Source and Cross Generator: The cross-generation source mitigates the interference of video content on the detection. The cross-generator ensures diversity in video attributes between the training and test sets, preventing them from being overly similar. 2) State-of-the-Art Video Generators: The dataset includes videos from 8 state-of-the-art AI video generators, ensuring that it covers the latest advancements in the field of video generation. 3) Rich Semantics: The videos in GenVidBench are analyzed from multiple dimensions and classified into various semantic categories based on their content. This classification ensures that the dataset is not only large but also diverse, aiding in the development of more generalized and effective detection models. We conduct a comprehensive evaluation of different advanced video generators and present a challenging setting. Additionally, we present rich experimental results including advanced video classification models as baselines. With the GenVidBench, researchers can efficiently develop and evaluate AI-generated video detection models. Datasets and code are available at this https URL.

Title: Block Flow: Learning Straight Flow on Data Blocks

Authors: Zibin Wang, Zhiyuan Ouyang, Xiangyun Zhang
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2501.11361
Pdf URL: https://arxiv.org/pdf/2501.11361
Copy Paste: [[2501.11361]] Block Flow: Learning Straight Flow on Data Blocks(https://arxiv.org/abs/2501.11361)
Keywords: generative
Abstract: Flow-matching models provide a powerful framework for various applications, offering efficient sampling and flexible probability path modeling. These models are characterized by flows with low curvature in learned generative trajectories, which results in reduced truncation error at each sampling step. To further reduce curvature, we propose block matching. This novel approach leverages label information to partition the data distribution into blocks and match them with a prior distribution parameterized using the same label information, thereby learning straighter flows. We demonstrate that the variance of the prior distribution can control the curvature upper bound of forward trajectories in flow-matching models. By designing flexible regularization strategies to adjust this variance, we achieve optimal generation performance, effectively balancing the trade-off between maintaining diversity in generated samples and minimizing numerical solver errors. Our results demonstrate competitive performance with models of the same parameter this http URL is available at \url{this https URL}.

Title: A Survey on Diffusion Models for Anomaly Detection

Authors: Jing Liu, Zhenchao Ma, Zepu Wang, Yang Liu, Zehua Wang, Peng Sun, Liang Song, Bo Hu, Azzedine Boukerche, Victor C.M. Leung
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11430
Pdf URL: https://arxiv.org/pdf/2501.11430
Copy Paste: [[2501.11430]] A Survey on Diffusion Models for Anomaly Detection(https://arxiv.org/abs/2501.11430)
Keywords: diffusion, generative, anomaly
Abstract: Diffusion models (DMs) have emerged as a powerful class of generative AI models, showing remarkable potential in anomaly detection (AD) tasks across various domains, such as cybersecurity, fraud detection, healthcare, and manufacturing. The intersection of these two fields, termed diffusion models for anomaly detection (DMAD), offers promising solutions for identifying deviations in increasingly complex and high-dimensional data. In this survey, we systematically review recent advances in DMAD research and investigate their capabilities. We begin by presenting the fundamental concepts of AD and DMs, followed by a comprehensive analysis of classic DM architectures including DDPMs, DDIMs, and Score SDEs. We further categorize existing DMAD methods into reconstruction-based, density-based, and hybrid approaches, providing detailed examinations of their methodological innovations. We also explore the diverse tasks across different data modalities, encompassing image, time series, video, and multimodal data analysis. Furthermore, we discuss critical challenges and emerging research directions, including computational efficiency, model interpretability, robustness enhancement, edge-cloud collaboration, and integration with large language models. The collection of DMAD research papers and resources is available at this https URL.

Title: Generative AI and Large Language Models in Language Preservation: Opportunities and Challenges

Authors: Vincent Koc
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.11496
Pdf URL: https://arxiv.org/pdf/2501.11496
Copy Paste: [[2501.11496]] Generative AI and Large Language Models in Language Preservation: Opportunities and Challenges(https://arxiv.org/abs/2501.11496)
Keywords: generative
Abstract: Generative AI and large-scale language models (LLM) have emerged as powerful tools in language preservation, particularly for near-native and endangered languages. With the increasing reliance on technology for communication, education, and cultural documentation, new opportunities have emerged to mitigate the dramatic decline of linguistic diversity worldwide. This paper examines the role of generative AIs and LLMs in preserving endangered languages, highlighting the risks and challenges associated with their use. We analyze the underlying technologies driving these models, including natural language processing (NLP) and deep learning, and explore several cases where these technologies have been applied to low-resource languages. Additionally, we discuss ethical considerations, data scarcity issues, and technical challenges while proposing solutions to enhance AI-driven language preservation.

Title: UltraFusion: Ultra High Dynamic Imaging using Exposure Fusion

Authors: Zixuan Chen, Yujin Wang, Xin Cai, Zhiyuan You, Zheming Lu, Fan Zhang, Shi Guo, Tianfan Xue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.11515
Pdf URL: https://arxiv.org/pdf/2501.11515
Copy Paste: [[2501.11515]] UltraFusion: Ultra High Dynamic Imaging using Exposure Fusion(https://arxiv.org/abs/2501.11515)
Keywords: generative
Abstract: Capturing high dynamic range (HDR) scenes is one of the most important issues in camera design. Majority of cameras use exposure fusion technique, which fuses images captured by different exposure levels, to increase dynamic range. However, this approach can only handle images with limited exposure difference, normally 3-4 stops. When applying to very high dynamic scenes where a large exposure difference is required, this approach often fails due to incorrect alignment or inconsistent lighting between inputs, or tone mapping artifacts. In this work, we propose UltraFusion, the first exposure fusion technique that can merge input with 9 stops differences. The key idea is that we model the exposure fusion as a guided inpainting problem, where the under-exposed image is used as a guidance to fill the missing information of over-exposed highlight in the over-exposed region. Using under-exposed image as a soft guidance, instead of a hard constrain, our model is robust to potential alignment issue or lighting variations. Moreover, utilizing the image prior of the generative model, our model also generates natural tone mapping, even for very high-dynamic range scene. Our approach outperforms HDR-Transformer on latest HDR benchmarks. Moreover, to test its performance in ultra high dynamic range scene, we capture a new real-world exposure fusion benchmark, UltraFusion Dataset, with exposure difference up to 9 stops, and experiments show that \model~can generate beautiful and high-quality fusion results under various scenarios. An online demo is provided at this https URL.

Title: Graph Defense Diffusion Model

Authors: Xin He, Wenqi Fan, Yili Wang, Chengyi Liu, Rui Miao, Xin Juan, Xin Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.11568
Pdf URL: https://arxiv.org/pdf/2501.11568
Copy Paste: [[2501.11568]] Graph Defense Diffusion Model(https://arxiv.org/abs/2501.11568)
Keywords: diffusion
Abstract: Graph Neural Networks (GNNs) demonstrate significant potential in various applications but remain highly vulnerable to adversarial attacks, which can greatly degrade their performance. Existing graph purification methods attempt to address this issue by filtering attacked graphs; however, they struggle to effectively defend against multiple types of adversarial attacks simultaneously due to their limited flexibility, and they lack comprehensive modeling of graph data due to their heavy reliance on heuristic prior knowledge. To overcome these challenges, we propose a more versatile approach for defending against adversarial attacks on graphs. In this work, we introduce the Graph Defense Diffusion Model (GDDM), a flexible purification method that leverages the denoising and modeling capabilities of diffusion models. The iterative nature of diffusion models aligns well with the stepwise process of adversarial attacks, making them particularly suitable for defense. By iteratively adding and removing noise, GDDM effectively purifies attacked graphs, restoring their original structure and features. Our GDDM consists of two key components: (1) Graph Structure-Driven Refiner, which preserves the basic fidelity of the graph during the denoising process, and ensures that the generated graph remains consistent with the original scope; and (2) Node Feature-Constrained Regularizer, which removes residual impurities from the denoised graph, further enhances the purification effect. Additionally, we design tailored denoising strategies to handle different types of adversarial attacks, improving the model's adaptability to various attack scenarios. Extensive experiments conducted on three real-world datasets demonstrate that GDDM outperforms state-of-the-art methods in defending against a wide range of adversarial attacks, showcasing its robustness and effectiveness.

Title: Recurrent Diffusion for Large-Scale Parameter Generation

Authors: Kai Wang, Dongwen Tang, Wangbo Zhao, Yang You
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11587
Pdf URL: https://arxiv.org/pdf/2501.11587
Copy Paste: [[2501.11587]] Recurrent Diffusion for Large-Scale Parameter Generation(https://arxiv.org/abs/2501.11587)
Keywords: diffusion
Abstract: Parameter generation has struggled to scale up for a long time, significantly limiting its range of applications. In this study, we introduce \textbf{R}ecurrent diffusion for large-scale \textbf{P}arameter \textbf{G}eneration, called \textbf{RPG}. We first divide the trained parameters into non-overlapping parts, after which a recurrent model is proposed to learn their relationships. The recurrent model's outputs, as conditions, are then fed into a diffusion model to generate the neural network parameters. Using only a single GPU, recurrent diffusion enables us to generate popular vision and language models such as ConvNeXt-L and LoRA parameters of LLaMA-7B. Meanwhile, across various architectures and tasks, the generated parameters consistently perform comparable results over trained networks. Notably, our approach also shows the potential to generate models for handling unseen tasks, which largely increases the practicality of parameter generation. Our code is available \href{this https URL}{here}.

Title: Trojan Detection Through Pattern Recognition for Large Language Models

Authors: Vedant Bhasin, Matthew Yudin, Razvan Stefanescu, Rauf Izmailov
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2501.11621
Pdf URL: https://arxiv.org/pdf/2501.11621
Copy Paste: [[2501.11621]] Trojan Detection Through Pattern Recognition for Large Language Models(https://arxiv.org/abs/2501.11621)
Keywords: in-context
Abstract: Trojan backdoors can be injected into large language models at various stages, including pretraining, fine-tuning, and in-context learning, posing a significant threat to the model's alignment. Due to the nature of causal language modeling, detecting these triggers is challenging given the vast search space. In this study, we propose a multistage framework for detecting Trojan triggers in large language models consisting of token filtration, trigger identification, and trigger verification. We discuss existing trigger identification methods and propose two variants of a black-box trigger inversion method that rely on output logits, utilizing beam search and greedy decoding respectively. We show that the verification stage is critical in the process and propose semantic-preserving prompts and special perturbations to differentiate between actual Trojan triggers and other adversarial strings that display similar characteristics. The evaluation of our approach on the TrojAI and RLHF poisoned model datasets demonstrates promising results.

Title: Class Imbalance in Anomaly Detection: Learning from an Exactly Solvable Model

Authors: F.S. Pezzicoli, V. Ros, F.P. Landes, M. Baity-Jesi
Subjects: cs.LG, cond-mat.dis-nn, stat.ML
Abstract URL: https://arxiv.org/abs/2501.11638
Pdf URL: https://arxiv.org/pdf/2501.11638
Copy Paste: [[2501.11638]] Class Imbalance in Anomaly Detection: Learning from an Exactly Solvable Model(https://arxiv.org/abs/2501.11638)
Keywords: anomaly
Abstract: Class imbalance (CI) is a longstanding problem in machine learning, slowing down training and reducing performances. Although empirical remedies exist, it is often unclear which ones work best and when, due to the lack of an overarching theory. We address a common case of imbalance, that of anomaly (or outlier) detection. We provide a theoretical framework to analyze, interpret and address CI. It is based on an exact solution of the teacher-student perceptron model, through replica theory. Within this framework, one can distinguish several sources of CI: either intrinsic, train or test imbalance. Our analysis reveals that the optimal train imbalance is generally different from 50%, with a non trivial dependence on the intrinsic imbalance, the abundance of data and on the noise in the learning. Moreover, there is a crossover between a small noise training regime where results are independent of the noise level to a high noise regime where performances quickly degrade with noise. Our results challenge some of the conventional wisdom on CI and offer practical guidelines to address it.

Title: Spatially-Delineated Domain-Adapted AI Classification: An Application for Oncology Data

Authors: Majid Farhadloo, Arun Sharma, Alexey Leontovich, Svetomir N. Markovic, Shashi Shekhar
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11695
Pdf URL: https://arxiv.org/pdf/2501.11695
Copy Paste: [[2501.11695]] Spatially-Delineated Domain-Adapted AI Classification: An Application for Oncology Data(https://arxiv.org/abs/2501.11695)
Keywords: self-supervised
Abstract: Given multi-type point maps from different place-types (e.g., tumor regions), our objective is to develop a classifier trained on the source place-type to accurately distinguish between two classes of the target place-type based on their point arrangements. This problem is societally important for many applications, such as generating clinical hypotheses for designing new immunotherapies for cancer treatment. The challenge lies in the spatial variability, the inherent heterogeneity and variation observed in spatial properties or arrangements across different locations (i.e., place-types). Previous techniques focus on self-supervised tasks to learn domain-invariant features and mitigate domain differences; however, they often neglect the underlying spatial arrangements among data points, leading to significant discrepancies across different place-types. We explore a novel multi-task self-learning framework that targets spatial arrangements, such as spatial mix-up masking and spatial contrastive predictive coding, for spatially-delineated domain-adapted AI classification. Experimental results on real-world datasets (e.g., oncology data) show that the proposed framework provides higher prediction accuracy than baseline methods.

Title: GL-ICNN: An End-To-End Interpretable Convolutional Neural Network for the Diagnosis and Prediction of Alzheimer's Disease

Authors: Wenjie Kang, Lize Jiskoot, Peter De Deyn, Geert Biessels, Huiberdina Koek, Jurgen Claassen, Huub Middelkoop, Wiesje Flier, Willemijn J. Jansen, Stefan Klein, Esther Bron
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11715
Pdf URL: https://arxiv.org/pdf/2501.11715
Copy Paste: [[2501.11715]] GL-ICNN: An End-To-End Interpretable Convolutional Neural Network for the Diagnosis and Prediction of Alzheimer's Disease(https://arxiv.org/abs/2501.11715)
Keywords: generative
Abstract: Deep learning methods based on Convolutional Neural Networks (CNNs) have shown great potential to improve early and accurate diagnosis of Alzheimer's disease (AD) dementia based on imaging data. However, these methods have yet to be widely adopted in clinical practice, possibly due to the limited interpretability of deep learning models. The Explainable Boosting Machine (EBM) is a glass-box model but cannot learn features directly from input imaging data. In this study, we propose a novel interpretable model that combines CNNs and EBMs for the diagnosis and prediction of AD. We develop an innovative training strategy that alternatingly trains the CNN component as a feature extractor and the EBM component as the output block to form an end-to-end model. The model takes imaging data as input and provides both predictions and interpretable feature importance measures. We validated the proposed model on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset and the Health-RI Parelsnoer Neurodegenerative Diseases Biobank (PND) as an external testing set. The proposed model achieved an area-under-the-curve (AUC) of 0.956 for AD and control classification, and 0.694 for the prediction of conversion of mild cognitive impairment (MCI) to AD on the ADNI cohort. The proposed model is a glass-box model that achieves a comparable performance with other state-of-the-art black-box models. Our code is publicly available at: this https URL.

Title: Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks

Authors: Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Heng Ji
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2501.11733
Pdf URL: https://arxiv.org/pdf/2501.11733
Copy Paste: [[2501.11733]] Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks(https://arxiv.org/abs/2501.11733)
Keywords: foundation model
Abstract: Smartphones have become indispensable in modern life, yet navigating complex tasks on mobile devices often remains frustrating. Recent advancements in large multimodal model (LMM)-based mobile agents have demonstrated the ability to perceive and act in mobile environments. However, current approaches face significant limitations: they fall short in addressing real-world human needs, struggle with reasoning-intensive and long-horizon tasks, and lack mechanisms to learn and improve from prior experiences. To overcome these challenges, we introduce Mobile-Agent-E, a hierarchical multi-agent framework capable of self-evolution through past experience. By hierarchical, we mean an explicit separation of high-level planning and low-level action execution. The framework comprises a Manager, responsible for devising overall plans by breaking down complex tasks into subgoals, and four subordinate agents--Perceptor, Operator, Action Reflector, and Notetaker--which handle fine-grained visual perception, immediate action execution, error verification, and information aggregation, respectively. Mobile-Agent-E also features a novel self-evolution module which maintains a persistent long-term memory comprising Tips and Shortcuts. Tips are general guidance and lessons learned from prior tasks on how to effectively interact with the environment. Shortcuts are reusable, executable sequences of atomic operations tailored for specific subroutines. The inclusion of Tips and Shortcuts facilitates continuous refinement in performance and efficiency. Alongside this framework, we introduce Mobile-Eval-E, a new benchmark featuring complex mobile tasks requiring long-horizon, multi-app interactions. Empirical results show that Mobile-Agent-E achieves a 22% absolute improvement over previous state-of-the-art approaches across three foundation model backbones. Project page: this https URL.

Title: SILO: Solving Inverse Problems with Latent Operators

Authors: Ron Raphaeli, Sean Man, Michael Elad
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.11746
Pdf URL: https://arxiv.org/pdf/2501.11746
Copy Paste: [[2501.11746]] SILO: Solving Inverse Problems with Latent Operators(https://arxiv.org/abs/2501.11746)
Keywords: diffusion
Abstract: Consistent improvement of image priors over the years has led to the development of better inverse problem solvers. Diffusion models are the newcomers to this arena, posing the strongest known prior to date. Recently, such models operating in a latent space have become increasingly predominant due to their efficiency. In recent works, these models have been applied to solve inverse problems. Working in the latent space typically requires multiple applications of an Autoencoder during the restoration process, which leads to both computational and restoration quality challenges. In this work, we propose a new approach for handling inverse problems with latent diffusion models, where a learned degradation function operates within the latent space, emulating a known image space degradation. Usage of the learned operator reduces the dependency on the Autoencoder to only the initial and final steps of the restoration process, facilitating faster sampling and superior restoration quality. We demonstrate the effectiveness of our method on a variety of image restoration tasks and datasets, achieving significant improvements over prior art.

Title: Are generative models fair? A study of racial bias in dermatological image generation

Authors: Miguel López-Pérez, Søren Hauberg, Aasa Feragen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.11752
Pdf URL: https://arxiv.org/pdf/2501.11752
Copy Paste: [[2501.11752]] Are generative models fair? A study of racial bias in dermatological image generation(https://arxiv.org/abs/2501.11752)
Keywords: generative
Abstract: Racial bias in medicine, particularly in dermatology, presents significant ethical and clinical challenges. It often results from the underrepresentation of darker skin tones in training datasets for machine learning models. While efforts to address bias in dermatology have focused on improving dataset diversity and mitigating disparities in discriminative models, the impact of racial bias on generative models remains underexplored. Generative models, such as Variational Autoencoders (VAEs), are increasingly used in healthcare applications, yet their fairness across diverse skin tones is currently not well understood. In this study, we evaluate the fairness of generative models in clinical dermatology with respect to racial bias. For this purpose, we first train a VAE with a perceptual loss to generate and reconstruct high-quality skin images across different skin tones. We utilize the Fitzpatrick17k dataset to examine how racial bias influences the representation and performance of these models. Our findings indicate that the VAE is influenced by the diversity of skin tones in the training dataset, with better performance observed for lighter skin tones. Additionally, the uncertainty estimates produced by the VAE are ineffective in assessing the model's fairness. These results highlight the need for improved uncertainty quantification mechanisms to detect and address racial bias in generative models for trustworthy healthcare technologies.

Title: EfficientVITON: An Efficient Virtual Try-On Model using Optimized Diffusion Process

Authors: Mostafa Atef, Mariam Ayman, Ahmed Rashed, Ashrakat Saeed, Abdelrahman Saeed, Ahmed Fares
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.11776
Pdf URL: https://arxiv.org/pdf/2501.11776
Copy Paste: [[2501.11776]] EfficientVITON: An Efficient Virtual Try-On Model using Optimized Diffusion Process(https://arxiv.org/abs/2501.11776)
Keywords: diffusion
Abstract: Would not it be much more convenient for everybody to try on clothes by only looking into a mirror ? The answer to that problem is virtual try-on, enabling users to digitally experiment with outfits. The core challenge lies in realistic image-to-image translation, where clothing must fit diverse human forms, poses, and figures. Early methods, which used 2D transformations, offered speed, but image quality was often disappointing and lacked the nuance of deep learning. Though GAN-based techniques enhanced realism, their dependence on paired data proved limiting. More adaptable methods offered great visuals but demanded significant computing power and time. Recent advances in diffusion models have shown promise for high-fidelity translation, yet the current crop of virtual try-on tools still struggle with detail loss and warping issues. To tackle these challenges, this paper proposes EfficientVITON, a new virtual try-on system leveraging the impressive pre-trained Stable Diffusion model for better images and deployment feasibility. The system includes a spatial encoder to maintain clothings finer details and zero cross-attention blocks to capture the subtleties of how clothes fit a human body. Input images are carefully prepared, and the diffusion process has been tweaked to significantly cut generation time without image quality loss. The training process involves two distinct stages of fine-tuning, carefully incorporating a balance of loss functions to ensure both accurate try-on results and high-quality visuals. Rigorous testing on the VITON-HD dataset, supplemented with real-world examples, has demonstrated that EfficientVITON achieves state-of-the-art results.

Title: CogMorph: Cognitive Morphing Attacks for Text-to-Image Models

Authors: Zonglei Jing, Zonghao Ying, Le Wang, Siyuan Liang, Aishan Liu, Xianglong Liu, Dacheng Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.11815
Pdf URL: https://arxiv.org/pdf/2501.11815
Copy Paste: [[2501.11815]] CogMorph: Cognitive Morphing Attacks for Text-to-Image Models(https://arxiv.org/abs/2501.11815)
Keywords: generative
Abstract: The development of text-to-image (T2I) generative models, that enable the creation of high-quality synthetic images from textual prompts, has opened new frontiers in creative design and content generation. However, this paper reveals a significant and previously unrecognized ethical risk inherent in this technology and introduces a novel method, termed the Cognitive Morphing Attack (CogMorph), which manipulates T2I models to generate images that retain the original core subjects but embeds toxic or harmful contextual elements. This nuanced manipulation exploits the cognitive principle that human perception of concepts is shaped by the entire visual scene and its context, producing images that amplify emotional harm far beyond attacks that merely preserve the original semantics. To address this, we first construct an imagery toxicity taxonomy spanning 10 major and 48 sub-categories, aligned with human cognitive-perceptual dimensions, and further build a toxicity risk matrix resulting in 1,176 high-quality T2I toxic prompts. Based on this, our CogMorph first introduces Cognitive Toxicity Augmentation, which develops a cognitive toxicity knowledge base with rich external toxic representations for humans (e.g., fine-grained visual features) that can be utilized to further guide the optimization of adversarial prompts. In addition, we present Contextual Hierarchical Morphing, which hierarchically extracts critical parts of the original prompt (e.g., scenes, subjects, and body parts), and then iteratively retrieves and fuses toxic features to inject harmful contexts. Extensive experiments on multiple open-sourced T2I models and black-box commercial APIs (e.g., DALLE-3) demonstrate the efficacy of CogMorph which significantly outperforms other baselines by large margins (+20.62\% on average).

Title: PXGen: A Post-hoc Explainable Method for Generative Models

Authors: Yen-Lung Huang, Ming-Hsi Weng, Hao-Tsung Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11827
Pdf URL: https://arxiv.org/pdf/2501.11827
Copy Paste: [[2501.11827]] PXGen: A Post-hoc Explainable Method for Generative Models(https://arxiv.org/abs/2501.11827)
Keywords: generative
Abstract: With the rapid growth of generative AI in numerous applications, explainable AI (XAI) plays a crucial role in ensuring the responsible development and deployment of generative AI technologies. XAI has undergone notable advancements and widespread adoption in recent years, reflecting a concerted push to enhance the transparency, interpretability, and credibility of AI systems. Recent research emphasizes that a proficient XAI method should adhere to a set of criteria, primarily focusing on two key areas. Firstly, it should ensure the quality and fluidity of explanations, encompassing aspects like faithfulness, plausibility, completeness, and tailoring to individual needs. Secondly, the design principle of the XAI system or mechanism should cover the following factors such as reliability, resilience, the verifiability of its outputs, and the transparency of its algorithm. However, research in XAI for generative models remains relatively scarce, with little exploration into how such methods can effectively meet these criteria in that domain. In this work, we propose PXGen, a post-hoc explainable method for generative models. Given a model that needs to be explained, PXGen prepares two materials for the explanation, the Anchor set and intrinsic & extrinsic criteria. Those materials are customizable by users according to their purpose and requirements. Via the calculation of each criterion, each anchor has a set of feature values and PXGen provides examplebased explanation methods according to the feature values among all the anchors and illustrated and visualized to the users via tractable algorithms such as k-dispersion or k-center.

Title: Is your LLM trapped in a Mental Set? Investigative study on how mental sets affect the reasoning capabilities of LLMs

Authors: Saiful Haq, Niyati Chhaya, Piyush Pandey, Pushpak Bhattacharya
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11833
Pdf URL: https://arxiv.org/pdf/2501.11833
Copy Paste: [[2501.11833]] Is your LLM trapped in a Mental Set? Investigative study on how mental sets affect the reasoning capabilities of LLMs(https://arxiv.org/abs/2501.11833)
Keywords: in-context
Abstract: In this paper, we present an investigative study on how Mental Sets influence the reasoning capabilities of LLMs. LLMs have excelled in diverse natural language processing (NLP) tasks, driven by advancements in parameter-efficient fine-tuning (PEFT) and emergent capabilities like in-context learning (ICL). For complex reasoning tasks, selecting the right model for PEFT or ICL is critical, often relying on scores on benchmarks such as MMLU, MATH, and GSM8K. However, current evaluation methods, based on metrics like F1 Score or reasoning chain assessments by larger models, overlook a key dimension: adaptability to unfamiliar situations and overcoming entrenched thinking patterns. In cognitive psychology, Mental Set refers to the tendency to persist with previously successful strategies, even when they become inefficient - a challenge for problem solving and reasoning. We compare the performance of LLM models like Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct and GPT-4o in the presence of mental sets. To the best of our knowledge, this is the first study to integrate cognitive psychology concepts into the evaluation of LLMs for complex reasoning tasks, providing deeper insights into their adaptability and problem-solving efficacy.

Title: Survey on Monocular Metric Depth Estimation

Authors: Jiuling Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.11841
Pdf URL: https://arxiv.org/pdf/2501.11841
Copy Paste: [[2501.11841]] Survey on Monocular Metric Depth Estimation(https://arxiv.org/abs/2501.11841)
Keywords: generative
Abstract: Monocular Depth Estimation (MDE) is a fundamental computer vision task underpinning applications such as spatial understanding, 3D reconstruction, and autonomous driving. While deep learning-based MDE methods can predict relative depth from a single image, their lack of metric scale information often results in scale inconsistencies, limiting their utility in downstream tasks like visual SLAM, 3D reconstruction, and novel view synthesis. Monocular Metric Depth Estimation (MMDE) addresses these challenges by enabling precise, scene-scale depth inference. MMDE improves depth consistency, enhances sequential task stability, simplifies integration into downstream applications, and broadens practical use cases. This paper provides a comprehensive review of depth estimation technologies, highlighting the evolution from geometry-based methods to state-of-the-art deep learning approaches. It emphasizes advancements in scale-agnostic methods, which are crucial for enabling zero-shot generalization as the foundational capability for MMDE. Recent progress in zero-shot MMDE research is explored, focusing on challenges such as model generalization and the loss of detail at scene boundaries. Innovative strategies to address these issues include unlabelled data augmentation, image patching, architectural optimization, and generative techniques. These advancements, analyzed in detail, demonstrate significant contributions to overcoming existing limitations. Finally, this paper synthesizes recent developments in zero-shot MMDE, identifies unresolved challenges, and outlines future research directions. By offering a clear roadmap and cutting-edge insights, this work aims to deepen understanding of MMDE, inspire novel applications, and drive technological innovation.

Title: Noise-Resilient Point-wise Anomaly Detection in Time Series Using Weak Segment Labels

Authors: Yaxuan Wang, Hao Cheng, Jing Xiong, Qingsong Wen, Han Jia, Ruixuan Song, Liyuan Zhang, Zhaowei Zhu, Yang Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.11959
Pdf URL: https://arxiv.org/pdf/2501.11959
Copy Paste: [[2501.11959]] Noise-Resilient Point-wise Anomaly Detection in Time Series Using Weak Segment Labels(https://arxiv.org/abs/2501.11959)
Keywords: anomaly
Abstract: Detecting anomalies in temporal data has gained significant attention across various real-world applications, aiming to identify unusual events and mitigate potential hazards. In practice, situations often involve a mix of segment-level labels (detected abnormal events with segments of time points) and unlabeled data (undetected events), while the ideal algorithmic outcome should be point-level predictions. Therefore, the huge label information gap between training data and targets makes the task challenging. In this study, we formulate the above imperfect information as noisy labels and propose NRdetector, a noise-resilient framework that incorporates confidence-based sample selection, robust segment-level learning, and data-centric point-level detection for multivariate time series anomaly detection. Particularly, to bridge the information gap between noisy segment-level labels and missing point-level labels, we develop a novel loss function that can effectively mitigate the label noise and consider the temporal features. It encourages the smoothness of consecutive points and the separability of points from segments with different labels. Extensive experiments on real-world multivariate time series datasets with 11 different evaluation metrics demonstrate that NRdetector consistently achieves robust results across multiple real-world datasets, outperforming various baselines adapted to operate in our setting.

Title: TAD-Bench: A Comprehensive Benchmark for Embedding-Based Text Anomaly Detection

Authors: Yang Cao, Sikun Yang, Chen Li, Haolong Xiang, Lianyong Qi, Bo Liu, Rongsheng Li, Ming Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.11960
Pdf URL: https://arxiv.org/pdf/2501.11960
Copy Paste: [[2501.11960]] TAD-Bench: A Comprehensive Benchmark for Embedding-Based Text Anomaly Detection(https://arxiv.org/abs/2501.11960)
Keywords: anomaly
Abstract: Text anomaly detection is crucial for identifying spam, misinformation, and offensive language in natural language processing tasks. Despite the growing adoption of embedding-based methods, their effectiveness and generalizability across diverse application scenarios remain under-explored. To address this, we present TAD-Bench, a comprehensive benchmark designed to systematically evaluate embedding-based approaches for text anomaly detection. TAD-Bench integrates multiple datasets spanning different domains, combining state-of-the-art embeddings from large language models with a variety of anomaly detection algorithms. Through extensive experiments, we analyze the interplay between embeddings and detection methods, uncovering their strengths, weaknesses, and applicability to different tasks. These findings offer new perspectives on building more robust, efficient, and generalizable anomaly detection systems for real-world applications.

Title: TabularARGN: A Flexible and Efficient Auto-Regressive Framework for Generating High-Fidelity Synthetic Data

Authors: Paul Tiwald, Ivona Krchova, Andrey Sidorenko, Mariana Vargas-Vieyra, Mario Scriminaci, Michael Platzer
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.12012
Pdf URL: https://arxiv.org/pdf/2501.12012
Copy Paste: [[2501.12012]] TabularARGN: A Flexible and Efficient Auto-Regressive Framework for Generating High-Fidelity Synthetic Data(https://arxiv.org/abs/2501.12012)
Keywords: generative
Abstract: Synthetic data generation for tabular datasets must balance fidelity, efficiency, and versatility to meet the demands of real-world applications. We introduce the Tabular Auto-Regressive Generative Network (TabularARGN), a flexible framework designed to handle mixed-type, multivariate, and sequential datasets. By training on all possible conditional probabilities, TabularARGN supports advanced features such as fairness-aware generation, imputation, and conditional generation on any subset of columns. The framework achieves state-of-the-art synthetic data quality while significantly reducing training and inference times, making it ideal for large-scale datasets with diverse structures. Evaluated across established benchmarks, including realistic datasets with complex relationships, TabularARGN demonstrates its capability to synthesize high-quality data efficiently. By unifying flexibility and performance, this framework paves the way for practical synthetic data generation across industries.

Title: Are Traditional Deep Learning Model Approaches as Effective as a Retinal-Specific Foundation Model for Ocular and Systemic Disease Detection?

Authors: Samantha Min Er Yew, Xiaofeng Lei, Jocelyn Hui Lin Goh, Yibing Chen, Sahana Srinivasan, Miao-li Chee, Krithi Pushpanathan, Ke Zou, Qingshan Hou, Zhi Da Soh, Cancan Xue, Marco Chak Yan Yu, Charumathi Sabanayagam, E Shyong Tai, Xueling Sim, Yaxing Wang, Jost B. Jonas, Vinay Nangia, Gabriel Dawei Yang, Emma Anran Ran, Carol Yim-Lui Cheung, Yangqin Feng, Jun Zhou, Rick Siow Mong Goh, Yukun Zhou, Pearse A. Keane, Yong Liu, Ching-Yu Cheng, Yih-Chung Tham
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.12016
Pdf URL: https://arxiv.org/pdf/2501.12016
Copy Paste: [[2501.12016]] Are Traditional Deep Learning Model Approaches as Effective as a Retinal-Specific Foundation Model for Ocular and Systemic Disease Detection?(https://arxiv.org/abs/2501.12016)
Keywords: self-supervised, foundation model
Abstract: Background: RETFound, a self-supervised, retina-specific foundation model (FM), showed potential in downstream applications. However, its comparative performance with traditional deep learning (DL) models remains incompletely understood. This study aimed to evaluate RETFound against three ImageNet-pretrained supervised DL models (ResNet50, ViT-base, SwinV2) in detecting ocular and systemic diseases. Methods: We fine-tuned/trained RETFound and three DL models on full datasets, 50%, 20%, and fixed sample sizes (400, 200, 100 images, with half comprising disease cases; for each DR severity class, 100 and 50 cases were used. Fine-tuned models were tested internally using the SEED (53,090 images) and APTOS-2019 (3,672 images) datasets and externally validated on population-based (BES, CIEMS, SP2, UKBB) and open-source datasets (ODIR-5k, PAPILA, GAMMA, IDRiD, MESSIDOR-2). Model performance was compared using area under the receiver operating characteristic curve (AUC) and Z-tests with Bonferroni correction (P<0.05/3). Interpretation: Traditional DL models are mostly comparable to RETFound for ocular disease detection with large datasets. However, RETFound is superior in systemic disease detection with smaller datasets. These findings offer valuable insights into the respective merits and limitation of traditional models and FMs.

Title: Comparative Analysis of Pre-trained Deep Learning Models and DINOv2 for Cushing's Syndrome Diagnosis in Facial Analysis

Authors: Hongjun Liu, Changwei Song, Jiaqi Qiang, Jianqiang Li, Hui Pan, Lin Lu, Xiao Long, Qing Zhao, Jiuzuo Huang, Shi Chen
Subjects: cs.LG, cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2501.12023
Pdf URL: https://arxiv.org/pdf/2501.12023
Copy Paste: [[2501.12023]] Comparative Analysis of Pre-trained Deep Learning Models and DINOv2 for Cushing's Syndrome Diagnosis in Facial Analysis(https://arxiv.org/abs/2501.12023)
Keywords: foundation model
Abstract: Cushing's syndrome is a condition caused by excessive glucocorticoid secretion from the adrenal cortex, often manifesting with moon facies and plethora, making facial data crucial for diagnosis. Previous studies have used pre-trained convolutional neural networks (CNNs) for diagnosing Cushing's syndrome using frontal facial images. However, CNNs are better at capturing local features, while Cushing's syndrome often presents with global facial features. Transformer-based models like ViT and SWIN, which utilize self-attention mechanisms, can better capture long-range dependencies and global features. Recently, DINOv2, a foundation model based on visual Transformers, has gained interest. This study compares the performance of various pre-trained models, including CNNs, Transformer-based models, and DINOv2, in diagnosing Cushing's syndrome. We also analyze gender bias and the impact of freezing mechanisms on DINOv2. Our results show that Transformer-based models and DINOv2 outperformed CNNs, with ViT achieving the highest F1 score of 85.74%. Both the pre-trained model and DINOv2 had higher accuracy for female samples. DINOv2 also showed improved performance when freezing parameters. In conclusion, Transformer-based models and DINOv2 are effective for Cushing's syndrome classification.

Title: Unified 3D MRI Representations via Sequence-Invariant Contrastive Learning

Authors: Liam Chalcroft, Jenny Cronin, Cathy J. Price, John Ashburner
Subjects: cs.CV, physics.med-ph
Abstract URL: https://arxiv.org/abs/2501.12057
Pdf URL: https://arxiv.org/pdf/2501.12057
Copy Paste: [[2501.12057]] Unified 3D MRI Representations via Sequence-Invariant Contrastive Learning(https://arxiv.org/abs/2501.12057)
Keywords: self-supervised
Abstract: Self-supervised deep learning has accelerated 2D natural image analysis but remains difficult to translate into 3D MRI, where data are scarce and pre-trained 2D backbones cannot capture volumetric context. We present a sequence-invariant self-supervised framework leveraging quantitative MRI (qMRI). By simulating multiple MRI contrasts from a single 3D qMRI scan and enforcing consistent representations across these contrasts, we learn anatomy-centric rather than sequence-specific features. This yields a robust 3D encoder that performs strongly across varied tasks and protocols. Experiments on healthy brain segmentation (IXI), stroke lesion segmentation (ARC), and MRI denoising show significant gains over baseline SSL approaches, especially in low-data settings (up to +8.3% Dice, +4.2 dB PSNR). Our model also generalises effectively to unseen sites, demonstrating potential for more scalable and clinically reliable volumetric analysis. All code and trained models are publicly available.

Title: Proxies for Distortion and Consistency with Applications for Real-World Image Restoration

Authors: Sean Man, Guy Ohayon, Ron Raphaeli, Michael Elad
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2501.12102
Pdf URL: https://arxiv.org/pdf/2501.12102
Copy Paste: [[2501.12102]] Proxies for Distortion and Consistency with Applications for Real-World Image Restoration(https://arxiv.org/abs/2501.12102)
Keywords: diffusion
Abstract: Real-world image restoration deals with the recovery of images suffering from an unknown degradation. This task is typically addressed while being given only degraded images, without their corresponding ground-truth versions. In this hard setting, designing and evaluating restoration algorithms becomes highly challenging. This paper offers a suite of tools that can serve both the design and assessment of real-world image restoration algorithms. Our work starts by proposing a trained model that predicts the chain of degradations a given real-world measured input has gone through. We show how this estimator can be used to approximate the consistency -- the match between the measurements and any proposed recovered image. We also use this estimator as a guiding force for the design of a simple and highly-effective plug-and-play real-world image restoration algorithm, leveraging a pre-trained diffusion-based image prior. Furthermore, this work proposes no-reference proxy measures of MSE and LPIPS, which, without access to the ground-truth images, allow ranking of real-world image restoration algorithms according to their (approximate) MSE and LPIPS. The proposed suite provides a versatile, first of its kind framework for evaluating and comparing blind image restoration algorithms in real-world scenarios.

Title: Teacher Encoder-Student Decoder Denoising Guided Segmentation Network for Anomaly Detection

Authors: ShiXuan Song, Hao Chen, Shu Hu, Xin Wang, Jinrong Hu, Xi Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.12104
Pdf URL: https://arxiv.org/pdf/2501.12104
Copy Paste: [[2501.12104]] Teacher Encoder-Student Decoder Denoising Guided Segmentation Network for Anomaly Detection(https://arxiv.org/abs/2501.12104)
Keywords: self-supervised, anomaly
Abstract: Visual anomaly detection is a highly challenging task, often categorized as a one-class classification and segmentation problem. Recent studies have demonstrated that the student-teacher (S-T) framework effectively addresses this challenge. However, most S-T frameworks rely solely on pre-trained teacher networks to guide student networks in learning multi-scale similar features, overlooking the potential of the student networks to enhance learning through multi-scale feature fusion. In this study, we propose a novel model named PFADSeg, which integrates a pre-trained teacher network, a denoising student network with multi-scale feature fusion, and a guided anomaly segmentation network into a unified framework. By adopting a unique teacher-encoder and student-decoder denoising mode, the model improves the student network's ability to learn from teacher network features. Furthermore, an adaptive feature fusion mechanism is introduced to train a self-supervised segmentation network that synthesizes anomaly masks autonomously, significantly increasing detection performance. Evaluated on the MVTec AD dataset, PFADSeg achieves state-of-the-art results with an image-level AUC of 98.9%, a pixel-level mean precision of 76.4%, and an instance-level mean precision of 78.7%.

Title: ComposeAnyone: Controllable Layout-to-Human Generation with Decoupled Multimodal Conditions

Authors: Shiyue Zhang, Zheng Chong, Xi Lu, Wenqing Zhang, Haoxiang Li, Xujie Zhang, Jiehui Huang, Xiao Dong, Xiaodan Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.12173
Pdf URL: https://arxiv.org/pdf/2501.12173
Copy Paste: [[2501.12173]] ComposeAnyone: Controllable Layout-to-Human Generation with Decoupled Multimodal Conditions(https://arxiv.org/abs/2501.12173)
Keywords: diffusion
Abstract: Building on the success of diffusion models, significant advancements have been made in multimodal image generation tasks. Among these, human image generation has emerged as a promising technique, offering the potential to revolutionize the fashion design process. However, existing methods often focus solely on text-to-image or image reference-based human generation, which fails to satisfy the increasingly sophisticated demands. To address the limitations of flexibility and precision in human generation, we introduce ComposeAnyone, a controllable layout-to-human generation method with decoupled multimodal conditions. Specifically, our method allows decoupled control of any part in hand-drawn human layouts using text or reference images, seamlessly integrating them during the generation process. The hand-drawn layout, which utilizes color-blocked geometric shapes such as ellipses and rectangles, can be easily drawn, offering a more flexible and accessible way to define spatial layouts. Additionally, we introduce the ComposeHuman dataset, which provides decoupled text and reference image annotations for different components of each human image, enabling broader applications in human image generation tasks. Extensive experiments on multiple datasets demonstrate that ComposeAnyone generates human images with better alignment to given layouts, text descriptions, and reference images, showcasing its multi-task capability and controllability.

Title: Extend Adversarial Policy Against Neural Machine Translation via Unknown Token

Authors: Wei Zou, Shujian Huang, Jiajun Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.12183
Pdf URL: https://arxiv.org/pdf/2501.12183
Copy Paste: [[2501.12183]] Extend Adversarial Policy Against Neural Machine Translation via Unknown Token(https://arxiv.org/abs/2501.12183)
Keywords: self-supervised
Abstract: Generating adversarial examples contributes to mainstream neural machine translation~(NMT) robustness. However, popular adversarial policies are apt for fixed tokenization, hindering its efficacy for common character perturbations involving versatile tokenization. Based on existing adversarial generation via reinforcement learning~(RL), we propose the `DexChar policy' that introduces character perturbations for the existing mainstream adversarial policy based on token substitution. Furthermore, we improve the self-supervised matching that provides feedback in RL to cater to the semantic constraints required during training adversaries. Experiments show that our method is compatible with the scenario where baseline adversaries fail, and can generate high-efficiency adversarial examples for analysis and optimization of the system.

Title: Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

Authors: Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, Huiwen Shi, Sicong Liu, Junta Wu, Yihang Lian, Fan Yang, Ruining Tang, Zebin He, Xinzhou Wang, Jian Liu, Xuhui Zuo, Zhuo Chen, Biwen Lei, Haohan Weng, Jing Xu, Yiling Zhu, Xinhai Liu, Lixin Xu, Changrong Hu, Tianyu Huang, Lifu Wang, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Yulin Cai, Jiaao Yu, Yixuan Tang, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Chao Zhang, Yonghao Tan, Jie Xiao, Yangyu Tao, Jianchen Zhu, Jinbao Xue, Kai Liu, Chongqing Zhao, Xinming Wu, Zhichao Hu, Lei Qin, Jianbing Peng, Zhan Li, Minghui Chen, Xipeng Zhang, Lin Niu, Paige Wang, Yingkai Wang, Haozhao Kuang, Zhongyi Fan, Xu Zheng, Weihao Zhuang, YingPing He, Tian Liu, Yong Yang, Di Wang, Yuhong Liu, Jie Jiang, Jingwei Huang, Chunchao Guo (refer to the report for detailed contributions)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.12202
Pdf URL: https://arxiv.org/pdf/2501.12202
Copy Paste: [[2501.12202]] Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation(https://arxiv.org/abs/2501.12202)
Keywords: diffusion, generative
Abstract: We present Hunyuan3D 2.0, an advanced large-scale 3D synthesis system for generating high-resolution textured 3D assets. This system includes two foundation components: a large-scale shape generation model -- Hunyuan3D-DiT, and a large-scale texture synthesis model -- Hunyuan3D-Paint. The shape generative model, built on a scalable flow-based diffusion transformer, aims to create geometry that properly aligns with a given condition image, laying a solid foundation for downstream applications. The texture synthesis model, benefiting from strong geometric and diffusion priors, produces high-resolution and vibrant texture maps for either generated or hand-crafted meshes. Furthermore, we build Hunyuan3D-Studio -- a versatile, user-friendly production platform that simplifies the re-creation process of 3D assets. It allows both professional and amateur users to manipulate or even animate their meshes efficiently. We systematically evaluate our models, showing that Hunyuan3D 2.0 outperforms previous state-of-the-art models, including the open-source models and closed-source models in geometry details, condition alignment, texture quality, and etc. Hunyuan3D 2.0 is publicly released in order to fill the gaps in the open-source 3D community for large-scale foundation generative models. The code and pre-trained weights of our models are available at: this https URL

Title: Explainability for Vision Foundation Models: A Survey

Authors: Rémi Kazmierczak, Eloïse Berthier, Goran Frehse, Gianni Franchi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.12203
Pdf URL: https://arxiv.org/pdf/2501.12203
Copy Paste: [[2501.12203]] Explainability for Vision Foundation Models: A Survey(https://arxiv.org/abs/2501.12203)
Keywords: foundation model
Abstract: As artificial intelligence systems become increasingly integrated into daily life, the field of explainability has gained significant attention. This trend is particularly driven by the complexity of modern AI models and their decision-making processes. The advent of foundation models, characterized by their extensive generalization capabilities and emergent uses, has further complicated this landscape. Foundation models occupy an ambiguous position in the explainability domain: their complexity makes them inherently challenging to interpret, yet they are increasingly leveraged as tools to construct explainable models. In this survey, we explore the intersection of foundation models and eXplainable AI (XAI) in the vision domain. We begin by compiling a comprehensive corpus of papers that bridge these fields. Next, we categorize these works based on their architectural characteristics. We then discuss the challenges faced by current research in integrating XAI within foundation models. Furthermore, we review common evaluation methodologies for these combined approaches. Finally, we present key observations and insights from our survey, offering directions for future research in this rapidly evolving field.

Title: Score Combining for Contrastive OOD Detection

Authors: Edward T. Reehorst, Philip Schniter
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.12204
Pdf URL: https://arxiv.org/pdf/2501.12204
Copy Paste: [[2501.12204]] Score Combining for Contrastive OOD Detection(https://arxiv.org/abs/2501.12204)
Keywords: anomaly
Abstract: In out-of-distribution (OOD) detection, one is asked to classify whether a test sample comes from a known inlier distribution or not. We focus on the case where the inlier distribution is defined by a training dataset and there exists no additional knowledge about the novelties that one is likely to encounter. This problem is also referred to as novelty detection, one-class classification, and unsupervised anomaly detection. The current literature suggests that contrastive learning techniques are state-of-the-art for OOD detection. We aim to improve on those techniques by combining/ensembling their scores using the framework of null hypothesis testing and, in particular, a novel generalized likelihood ratio test (GLRT). We demonstrate that our proposed GLRT-based technique outperforms the state-of-the-art CSI and SupCSI techniques from Tack et al. 2020 in dataset-vs-dataset experiments with CIFAR-10, SVHN, LSUN, ImageNet, and CIFAR-100, as well as leave-one-class-out experiments with CIFAR-10. We also demonstrate that our GLRT outperforms the score-combining methods of Fisher, Bonferroni, Simes, Benjamini-Hochwald, and Stouffer in our application.

Title: Fixing Imbalanced Attention to Mitigate In-Context Hallucination of Large Vision-Language Model

Authors: Kazi Hasan Ibn Arif, Sajib Acharjee Dip, Khizar Hussain, Lang Zhang, Chris Thomas
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2501.12206
Pdf URL: https://arxiv.org/pdf/2501.12206
Copy Paste: [[2501.12206]] Fixing Imbalanced Attention to Mitigate In-Context Hallucination of Large Vision-Language Model(https://arxiv.org/abs/2501.12206)
Keywords: in-context
Abstract: Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities in understanding and describing visual content, achieving state-of-the-art performance across various vision-language tasks. However, these models frequently exhibit hallucination behavior, where they generate descriptions containing objects or details absent in the input image. Our work investigates this phenomenon by analyzing attention patterns across transformer layers and heads, revealing that hallucinations often stem from progressive degradation of visual grounding in deeper layers. We propose a novel attention modification approach that combines selective token emphasis and head-specific modulation to maintain visual grounding throughout the generation process. Our method introduces two key components: (1) a dual-stream token selection mechanism that identifies and prioritizes both locally informative and spatially significant visual tokens, and (2) an attention head-specific modulation strategy that differentially amplifies visual information processing based on measured visual sensitivity of individual attention heads. Through extensive experimentation on the MSCOCO dataset, we demonstrate that our approach reduces hallucination rates by up to 62.3\% compared to baseline models while maintaining comparable task performance. Our analysis reveals that selectively modulating tokens across attention heads with varying levels of visual sensitivity can significantly improve visual grounding without requiring model retraining.

Title: You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense

Authors: Wuyuao Mai, Geng Hong, Pei Chen, Xudong Pan, Baojun Liu, Yuan Zhang, Haixin Duan, Min Yang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2501.12210
Pdf URL: https://arxiv.org/pdf/2501.12210
Copy Paste: [[2501.12210]] You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense(https://arxiv.org/abs/2501.12210)
Keywords: generative
Abstract: With the rise of generative large language models (LLMs) like LLaMA and ChatGPT, these models have significantly transformed daily life and work by providing advanced insights. However, as jailbreak attacks continue to circumvent built-in safety mechanisms, exploiting carefully crafted scenarios or tokens, the safety risks of LLMs have come into focus. While numerous defense strategies--such as prompt detection, modification, and model fine-tuning--have been proposed to counter these attacks, a critical question arises: do these defenses compromise the utility and usability of LLMs for legitimate users? Existing research predominantly focuses on the effectiveness of defense strategies without thoroughly examining their impact on performance, leaving a gap in understanding the trade-offs between LLM safety and performance. Our research addresses this gap by conducting a comprehensive study on the utility degradation, safety elevation, and exaggerated-safety escalation of LLMs with jailbreak defense strategies. We propose USEBench, a novel benchmark designed to evaluate these aspects, along with USEIndex, a comprehensive metric for assessing overall model performance. Through experiments on seven state-of-the-art LLMs, we found that mainstream jailbreak defenses fail to ensure both safety and performance simultaneously. Although model-finetuning performs the best overall, their effectiveness varies across LLMs. Furthermore, vertical comparisons reveal that developers commonly prioritize performance over safety when iterating or fine-tuning their LLMs.

Title: Exploring Temporally-Aware Features for Point Tracking

Authors: Inès Hyeonsu Kim, Seokju Cho, Jiahui Huang, Jung Yi, Joon-Young Lee, Seungryong Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.12218
Pdf URL: https://arxiv.org/pdf/2501.12218
Copy Paste: [[2501.12218]] Exploring Temporally-Aware Features for Point Tracking(https://arxiv.org/abs/2501.12218)
Keywords: self-supervised
Abstract: Point tracking in videos is a fundamental task with applications in robotics, video editing, and more. While many vision tasks benefit from pre-trained feature backbones to improve generalizability, point tracking has primarily relied on simpler backbones trained from scratch on synthetic data, which may limit robustness in real-world scenarios. Additionally, point tracking requires temporal awareness to ensure coherence across frames, but using temporally-aware features is still underexplored. Most current methods often employ a two-stage process: an initial coarse prediction followed by a refinement stage to inject temporal information and correct errors from the coarse stage. These approach, however, is computationally expensive and potentially redundant if the feature backbone itself captures sufficient temporal information. In this work, we introduce Chrono, a feature backbone specifically designed for point tracking with built-in temporal awareness. Leveraging pre-trained representations from self-supervised learner DINOv2 and enhanced with a temporal adapter, Chrono effectively captures long-term temporal context, enabling precise prediction even without the refinement stage. Experimental results demonstrate that Chrono achieves state-of-the-art performance in a refiner-free setting on the TAP-Vid-DAVIS and TAP-Vid-Kinetics datasets, among common feature backbones used in point tracking as well as DINOv2, with exceptional efficiency. Project page: this https URL

Title: TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space

Authors: Daniel Garibi, Shahar Yadin, Roni Paiss, Omer Tov, Shiran Zada, Ariel Ephrat, Tomer Michaeli, Inbar Mosseri, Tali Dekel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.12224
Pdf URL: https://arxiv.org/pdf/2501.12224
Copy Paste: [[2501.12224]] TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space(https://arxiv.org/abs/2501.12224)
Keywords: diffusion
Abstract: We present TokenVerse -- a method for multi-concept personalization, leveraging a pre-trained text-to-image diffusion model. Our framework can disentangle complex visual elements and attributes from as little as a single image, while enabling seamless plug-and-play generation of combinations of concepts extracted from multiple images. As opposed to existing works, TokenVerse can handle multiple images with multiple concepts each, and supports a wide-range of concepts, including objects, accessories, materials, pose, and lighting. Our work exploits a DiT-based text-to-image model, in which the input text affects the generation through both attention and modulation (shift and scale). We observe that the modulation space is semantic and enables localized control over complex concepts. Building on this insight, we devise an optimization-based framework that takes as input an image and a text description, and finds for each word a distinct direction in the modulation space. These directions can then be used to generate new images that combine the learned concepts in a desired configuration. We demonstrate the effectiveness of TokenVerse in challenging personalization settings, and showcase its advantages over existing methods. project's webpage in this https URL

Title: InsTALL: Context-aware Instructional Task Assistance with Multi-modal Large Language Models

Authors: Pha Nguyen, Sailik Sengupta, Girik Malik, Arshit Gupta, Bonan Min
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2501.12231
Pdf URL: https://arxiv.org/pdf/2501.12231
Copy Paste: [[2501.12231]] InsTALL: Context-aware Instructional Task Assistance with Multi-modal Large Language Models(https://arxiv.org/abs/2501.12231)
Keywords: generative
Abstract: The improved competence of generative models can help building multi-modal virtual assistants that leverage modalities beyond language. By observing humans performing multi-step tasks, one can build assistants that have situational awareness of actions and tasks being performed, enabling them to cater assistance based on this understanding. In this paper, we develop a Context-aware Instructional Task Assistant with Multi-modal Large Language Models (InsTALL) that leverages an online visual stream (e.g. a user's screen share or video recording) and responds in real-time to user queries related to the task at hand. To enable useful assistance, InsTALL 1) trains a multi-modal model on task videos and paired textual data, and 2) automatically extracts task graph from video data and leverages it at training and inference time. We show InsTALL achieves state-of-the-art performance across proposed sub-tasks considered for multimodal activity understanding -- task recognition (TR), action recognition (AR), next action prediction (AP), and plan prediction (PP) -- and outperforms existing baselines on two novel sub-tasks related to automatic error identification.

Title: Memory Storyboard: Leveraging Temporal Segmentation for Streaming Self-Supervised Learning from Egocentric Videos

Authors: Yanlai Yang, Mengye Ren
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.12254
Pdf URL: https://arxiv.org/pdf/2501.12254
Copy Paste: [[2501.12254]] Memory Storyboard: Leveraging Temporal Segmentation for Streaming Self-Supervised Learning from Egocentric Videos(https://arxiv.org/abs/2501.12254)
Keywords: self-supervised
Abstract: Self-supervised learning holds the promise to learn good representations from real-world continuous uncurated data streams. However, most existing works in visual self-supervised learning focus on static images or artificial data streams. Towards exploring a more realistic learning substrate, we investigate streaming self-supervised learning from long-form real-world egocentric video streams. Inspired by the event segmentation mechanism in human perception and memory, we propose "Memory Storyboard" that groups recent past frames into temporal segments for more effective summarization of the past visual streams for memory replay. To accommodate efficient temporal segmentation, we propose a two-tier memory hierarchy: the recent past is stored in a short-term memory, and the storyboard temporal segments are then transferred to a long-term memory. Experiments on real-world egocentric video datasets including SAYCam and KrishnaCam show that contrastive learning objectives on top of storyboard frames result in semantically meaningful representations which outperform those produced by state-of-the-art unsupervised continual learning methods.

Title: CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification

Authors: Cristiano Patrício, Isabel Rio-Torto, Jaime S. Cardoso, Luís F. Teixeira, João C. Neves
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2501.12266
Pdf URL: https://arxiv.org/pdf/2501.12266
Copy Paste: [[2501.12266]] CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification(https://arxiv.org/abs/2501.12266)
Keywords: in-context
Abstract: The main challenges limiting the adoption of deep learning-based solutions in medical workflows are the availability of annotated data and the lack of interpretability of such systems. Concept Bottleneck Models (CBMs) tackle the latter by constraining the final disease prediction on a set of predefined and human-interpretable concepts. However, the increased interpretability achieved through these concept-based explanations implies a higher annotation burden. Moreover, if a new concept needs to be added, the whole system needs to be retrained. Inspired by the remarkable performance shown by Large Vision-Language Models (LVLMs) in few-shot settings, we propose a simple, yet effective, methodology, CBVLM, which tackles both of the aforementioned challenges. First, for each concept, we prompt the LVLM to answer if the concept is present in the input image. Then, we ask the LVLM to classify the image based on the previous concept predictions. Moreover, in both stages, we incorporate a retrieval module responsible for selecting the best examples for in-context learning. By grounding the final diagnosis on the predicted concepts, we ensure explainability, and by leveraging the few-shot capabilities of LVLMs, we drastically lower the annotation cost. We validate our approach with extensive experiments across four medical datasets and twelve LVLMs (both generic and medical) and show that CBVLM consistently outperforms CBMs and task-specific supervised methods without requiring any training and using just a few annotated examples. More information on our project page: this https URL.

Title: VipDiff: Towards Coherent and Diverse Video Inpainting via Training-free Denoising Diffusion Models

Authors: Chaohao Xie, Kai Han, Kwan-Yee K. Wong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.12267
Pdf URL: https://arxiv.org/pdf/2501.12267
Copy Paste: [[2501.12267]] VipDiff: Towards Coherent and Diverse Video Inpainting via Training-free Denoising Diffusion Models(https://arxiv.org/abs/2501.12267)
Keywords: diffusion
Abstract: Recent video inpainting methods have achieved encouraging improvements by leveraging optical flow to guide pixel propagation from reference frames either in the image space or feature space. However, they would produce severe artifacts in the mask center when the masked area is too large and no pixel correspondences can be found for the center. Recently, diffusion models have demonstrated impressive performance in generating diverse and high-quality images, and have been exploited in a number of works for image inpainting. These methods, however, cannot be applied directly to videos to produce temporal-coherent inpainting results. In this paper, we propose a training-free framework, named VipDiff, for conditioning diffusion model on the reverse diffusion process to produce temporal-coherent inpainting results without requiring any training data or fine-tuning the pre-trained diffusion models. VipDiff takes optical flow as guidance to extract valid pixels from reference frames to serve as constraints in optimizing the randomly sampled Gaussian noise, and uses the generated results for further pixel propagation and conditional generation. VipDiff also allows for generating diverse video inpainting results over different sampled noise. Experiments demonstrate that VipDiff can largely outperform state-of-the-art video inpainting methods in terms of both spatial-temporal coherence and fidelity.

Title: With Great Backbones Comes Great Adversarial Transferability

Authors: Erik Arakelyan, Karen Hambardzumyan, Davit Papikyan, Pasquale Minervini, Albert Gordo, Isabelle Augenstein, Aram H. Markosyan
Subjects: cs.CV, cs.AI, cs.CR, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2501.12275
Pdf URL: https://arxiv.org/pdf/2501.12275
Copy Paste: [[2501.12275]] With Great Backbones Comes Great Adversarial Transferability(https://arxiv.org/abs/2501.12275)
Keywords: self-supervised
Abstract: Advances in self-supervised learning (SSL) for machine vision have improved representation robustness and model performance, giving rise to pre-trained backbones like \emph{ResNet} and \emph{ViT} models tuned with SSL methods such as \emph{SimCLR}. Due to the computational and data demands of pre-training, the utilization of such backbones becomes a strenuous necessity. However, employing these backbones may inherit vulnerabilities to adversarial attacks. While adversarial robustness has been studied under \emph{white-box} and \emph{black-box} settings, the robustness of models tuned on pre-trained backbones remains largely unexplored. Additionally, the role of tuning meta-information in mitigating exploitation risks is unclear. This work systematically evaluates the adversarial robustness of such models across $20,000$ combinations of tuning meta-information, including fine-tuning techniques, backbone families, datasets, and attack types. We propose using proxy models to transfer attacks, simulating varying levels of target knowledge by fine-tuning these proxies with diverse configurations. Our findings reveal that proxy-based attacks approach the effectiveness of \emph{white-box} methods, even with minimal tuning knowledge. We also introduce a naive "backbone attack," leveraging only the backbone to generate adversarial samples, which outperforms \emph{black-box} attacks and rivals \emph{white-box} methods, highlighting critical risks in model-sharing practices. Finally, our ablations reveal how increasing tuning meta-information impacts attack transferability, measuring each meta-information combination.

Title: Regressor-Guided Image Editing Regulates Emotional Response to Reduce Online Engagement

Authors: Christoph Gebhardt, Robin Willardt, Seyedmorteza Sadat, Chih-Wei Ning, Andreas Brombach, Jie Song, Otmar Hilliges, Christian Holz
Subjects: cs.CV, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2501.12289
Pdf URL: https://arxiv.org/pdf/2501.12289
Copy Paste: [[2501.12289]] Regressor-Guided Image Editing Regulates Emotional Response to Reduce Online Engagement(https://arxiv.org/abs/2501.12289)
Keywords: diffusion, generative
Abstract: Emotions are known to mediate the relationship between users' content consumption and their online engagement, with heightened emotional intensity leading to increased engagement. Building on this insight, we propose three regressor-guided image editing approaches aimed at diminishing the emotional impact of images. These include (i) a parameter optimization approach based on global image transformations known to influence emotions, (ii) an optimization approach targeting the style latent space of a generative adversarial network, and (iii) a diffusion-based approach employing classifier guidance and classifier-free guidance. Our findings demonstrate that approaches can effectively alter the emotional properties of images while maintaining high visual quality. Optimization-based methods primarily adjust low-level properties like color hues and brightness, whereas the diffusion-based approach introduces semantic changes, such as altering appearance or facial expressions. Notably, results from a behavioral study reveal that only the diffusion-based approach successfully elicits changes in viewers' emotional responses while preserving high perceived image quality. In future work, we will investigate the impact of these image adaptations on internet user behavior.

Title: Towards Accurate Unified Anomaly Segmentation

Authors: Wenxin Ma, Qingsong Yao, Xiang Zhang, Zhelong Huang, Zihang Jiang, S. Kevin Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.12295
Pdf URL: https://arxiv.org/pdf/2501.12295
Copy Paste: [[2501.12295]] Towards Accurate Unified Anomaly Segmentation(https://arxiv.org/abs/2501.12295)
Keywords: anomaly
Abstract: Unsupervised anomaly detection (UAD) from images strives to model normal data distributions, creating discriminative representations to distinguish and precisely localize anomalies. Despite recent advancements in the efficient and unified one-for-all scheme, challenges persist in accurately segmenting anomalies for further monitoring. Moreover, this problem is obscured by the widely-used AUROC metric under imbalanced UAD settings. This motivates us to emphasize the significance of precise segmentation of anomaly pixels using pAP and DSC as metrics. To address the unsolved segmentation task, we introduce the Unified Anomaly Segmentation (UniAS). UniAS presents a multi-level hybrid pipeline that progressively enhances normal information from coarse to fine, incorporating a novel multi-granularity gated CNN (MGG-CNN) into Transformer layers to explicitly aggregate local details from different granularities. UniAS achieves state-of-the-art anomaly segmentation performance, attaining 65.12/59.33 and 40.06/32.50 in pAP/DSC on the MVTec-AD and VisA datasets, respectively, surpassing previous methods significantly. The codes are shared at this https URL.

Title: A Hybrid Supervised and Self-Supervised Graph Neural Network for Edge-Centric Applications

Authors: Eugenio Borzone, Leandro Di Persia, Matias Gerard
Subjects: cs.LG, q-bio.MN
Abstract URL: https://arxiv.org/abs/2501.12309
Pdf URL: https://arxiv.org/pdf/2501.12309
Copy Paste: [[2501.12309]] A Hybrid Supervised and Self-Supervised Graph Neural Network for Edge-Centric Applications(https://arxiv.org/abs/2501.12309)
Keywords: self-supervised
Abstract: This paper presents a novel graph-based deep learning model for tasks involving relations between two nodes (edge-centric tasks), where the focus lies on predicting relationships and interactions between pairs of nodes rather than node properties themselves. This model combines supervised and self-supervised learning, taking into account for the loss function the embeddings learned and patterns with and without ground truth. Additionally it incorporates an attention mechanism that leverages both node and edge features. The architecture, trained end-to-end, comprises two primary components: embedding generation and prediction. First, a graph neural network (GNN) transform raw node features into dense, low-dimensional embeddings, incorporating edge attributes. Then, a feedforward neural model processes the node embeddings to produce the final output. Experiments demonstrate that our model matches or exceeds existing methods for protein-protein interactions prediction and Gene Ontology (GO) terms prediction. The model also performs effectively with one-hot encoding for node features, providing a solution for the previously unsolved problem of predicting similarity between compounds with unknown structures.

Title: Diffusion-aware Censored Gaussian Processes for Demand Modelling

Authors: Filipe Rodrigues
Subjects: cs.LG, stat.ME, stat.ML
Abstract URL: https://arxiv.org/abs/2501.12354
Pdf URL: https://arxiv.org/pdf/2501.12354
Copy Paste: [[2501.12354]] Diffusion-aware Censored Gaussian Processes for Demand Modelling(https://arxiv.org/abs/2501.12354)
Keywords: diffusion
Abstract: Inferring the true demand for a product or a service from aggregate data is often challenging due to the limited available supply, thus resulting in observations that are censored and correspond to the realized demand, thereby not accounting for the unsatisfied demand. Censored regression models are able to account for the effect of censoring due to the limited supply, but they don't consider the effect of substitutions, which may cause the demand for similar alternative products or services to increase. This paper proposes Diffusion-aware Censored Demand Models, which combine a Tobit likelihood with a graph diffusion process in order to model the latent process of transfer of unsatisfied demand between similar products or services. We instantiate this new class of models under the framework of GPs and, based on both simulated and real-world data for modeling sales, bike-sharing demand, and EV charging demand, demonstrate its ability to better recover the true demand and produce more accurate out-of-sample predictions.

Title: MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

Authors: Yilun Zhao, Lujing Xie, Haowei Zhang, Guo Gan, Yitao Long, Zhiyuan Hu, Tongyan Hu, Weiyuan Chen, Chuhan Li, Junyang Song, Zhijian Xu, Chengye Wang, Weifeng Pan, Ziyao Shangguan, Xiangru Tang, Zhenwen Liang, Yixin Liu, Chen Zhao, Arman Cohan
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2501.12380
Pdf URL: https://arxiv.org/pdf/2501.12380
Copy Paste: [[2501.12380]] MMVU: Measuring Expert-Level Multi-Discipline Video Understanding(https://arxiv.org/abs/2501.12380)
Keywords: foundation model
Abstract: We introduce MMVU, a comprehensive expert-level, multi-discipline benchmark for evaluating foundation models in video understanding. MMVU includes 3,000 expert-annotated questions spanning 27 subjects across four core disciplines: Science, Healthcare, Humanities & Social Sciences, and Engineering. Compared to prior benchmarks, MMVU features three key advancements. First, it challenges models to apply domain-specific knowledge and perform expert-level reasoning to analyze specialized-domain videos, moving beyond the basic visual perception typically assessed in current video benchmarks. Second, each example is annotated by human experts from scratch. We implement strict data quality controls to ensure the high quality of the dataset. Finally, each example is enriched with expert-annotated reasoning rationals and relevant domain knowledge, facilitating in-depth analysis. We conduct an extensive evaluation of 32 frontier multimodal foundation models on MMVU. The latest System-2-capable models, o1 and Gemini 2.0 Flash Thinking, achieve the highest performance among the tested models. However, they still fall short of matching human expertise. Through in-depth error analyses and case studies, we offer actionable insights for future advancements in expert-level, knowledge-intensive video understanding for specialized domains.

Title: DiffDoctor: Diagnosing Image Diffusion Models Before Treating

Authors: Yiyang Wang, Xi Chen, Xiaogang Xu, Sihui Ji, Yu Liu, Yujun Shen, Hengshuang Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.12382
Pdf URL: https://arxiv.org/pdf/2501.12382
Copy Paste: [[2501.12382]] DiffDoctor: Diagnosing Image Diffusion Models Before Treating(https://arxiv.org/abs/2501.12382)
Keywords: diffusion
Abstract: In spite of the recent progress, image diffusion models still produce artifacts. A common solution is to refine an established model with a quality assessment system, which generally rates an image in its entirety. In this work, we believe problem-solving starts with identification, yielding the request that the model should be aware of not just the presence of defects in an image, but their specific locations. Motivated by this, we propose DiffDoctor, a two-stage pipeline to assist image diffusion models in generating fewer artifacts. Concretely, the first stage targets developing a robust artifact detector, for which we collect a dataset of over 1M flawed synthesized images and set up an efficient human-in-the-loop annotation process, incorporating a carefully designed class-balance strategy. The learned artifact detector is then involved in the second stage to tune the diffusion model through assigning a per-pixel confidence map for each synthesis. Extensive experiments on text-to-image diffusion models demonstrate the effectiveness of our artifact detector as well as the soundness of our diagnose-then-treat design.

Title: GPS as a Control Signal for Image Generation

Authors: Chao Feng, Ziyang Chen, Aleksander Holynski, Alexei A. Efros, Andrew Owens
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.12390
Pdf URL: https://arxiv.org/pdf/2501.12390
Copy Paste: [[2501.12390]] GPS as a Control Signal for Image Generation(https://arxiv.org/abs/2501.12390)
Keywords: diffusion
Abstract: We show that the GPS tags contained in photo metadata provide a useful control signal for image generation. We train GPS-to-image models and use them for tasks that require a fine-grained understanding of how images vary within a city. In particular, we train a diffusion model to generate images conditioned on both GPS and text. The learned model generates images that capture the distinctive appearance of different neighborhoods, parks, and landmarks. We also extract 3D models from 2D GPS-to-image models through score distillation sampling, using GPS conditioning to constrain the appearance of the reconstruction from each viewpoint. Our evaluations suggest that our GPS-conditioned models successfully learn to generate images that vary based on location, and that GPS conditioning improves estimated 3D structure.

Title: Towards Affordance-Aware Articulation Synthesis for Rigged Objects

Authors: Yu-Chu Yu, Chieh Hubert Lin, Hsin-Ying Lee, Chaoyang Wang, Yu-Chiang Frank Wang, Ming-Hsuan Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.12393
Pdf URL: https://arxiv.org/pdf/2501.12393
Copy Paste: [[2501.12393]] Towards Affordance-Aware Articulation Synthesis for Rigged Objects(https://arxiv.org/abs/2501.12393)
Keywords: diffusion, in-context
Abstract: Rigged objects are commonly used in artist pipelines, as they can flexibly adapt to different scenes and postures. However, articulating the rigs into realistic affordance-aware postures (e.g., following the context, respecting the physics and the personalities of the object) remains time-consuming and heavily relies on human labor from experienced artists. In this paper, we tackle the novel problem and design A3Syn. With a given context, such as the environment mesh and a text prompt of the desired posture, A3Syn synthesizes articulation parameters for arbitrary and open-domain rigged objects obtained from the Internet. The task is incredibly challenging due to the lack of training data, and we do not make any topological assumptions about the open-domain rigs. We propose using 2D inpainting diffusion model and several control techniques to synthesize in-context affordance information. Then, we develop an efficient bone correspondence alignment using a combination of differentiable rendering and semantic correspondence. A3Syn has stable convergence, completes in minutes, and synthesizes plausible affordance on different combinations of in-the-wild object rigs and scenes.