2025-03-17

Title: Text2Zinc: A Cross-Domain Dataset for Modeling Optimization and Satisfaction Problems in MiniZinc

Authors: Akash Singirikonda, Serdar Kadioglu, Karthik Uppuluri
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10642
Pdf URL: https://arxiv.org/pdf/2503.10642
Copy Paste: [[2503.10642]] Text2Zinc: A Cross-Domain Dataset for Modeling Optimization and Satisfaction Problems in MiniZinc(https://arxiv.org/abs/2503.10642)
Keywords: large language model
Abstract: There is growing interest in utilizing large language models (LLMs) as co-pilots for combinatorial optimization and constraint programming tasks across various problems. This paper aims to advance this line of research by introducing Text2Zinc}, a cross-domain dataset for capturing optimization and satisfaction problems specified in natural language text. Our work is distinguished from previous attempts by integrating both satisfaction and optimization problems within a unified dataset using a solver-agnostic modeling language. To achieve this, we leverage MiniZinc's solver-and-paradigm-agnostic modeling capabilities to formulate these problems. Using the Text2Zinc dataset, we conduct comprehensive baseline experiments to compare execution and solution accuracy across several methods, including off-the-shelf prompting strategies, chain-of-thought reasoning, and a compositional approach. Additionally, we explore the effectiveness of intermediary representations, specifically knowledge graphs. Our findings indicate that LLMs are not yet a push-button technology to model combinatorial problems from text. We hope that Text2Zinc serves as a valuable resource for researchers and practitioners to advance the field further.

Title: The Reliability of LLMs for Medical Diagnosis: An Examination of Consistency, Manipulation, and Contextual Awareness

Authors: Krishna Subedi
Subjects: cs.CL, cs.AI, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2503.10647
Pdf URL: https://arxiv.org/pdf/2503.10647
Copy Paste: [[2503.10647]] The Reliability of LLMs for Medical Diagnosis: An Examination of Consistency, Manipulation, and Contextual Awareness(https://arxiv.org/abs/2503.10647)
Keywords: large language model
Abstract: Universal healthcare access is critically needed, especially in resource-limited settings. Large Language Models (LLMs) offer promise for democratizing healthcare with advanced diagnostics, but their reliability requires thorough evaluation, especially in trust-dependent environments. This study assesses LLMs' diagnostic reliability focusing on consistency, manipulation resilience, and contextual integration, crucial for safe and ethical use in universal healthcare. We evaluated leading LLMs using 52 patient cases, expanded into variants with demographic changes, symptom rewordings, and exam modifications, while keeping core diagnoses constant. Manipulation susceptibility was tested by inserting misleading narratives and irrelevant details. Contextual awareness was rvaluated by comparing diagnoses with and without patient history. We analyzed diagnostic change rates and response patterns across manipulations. LLMs showed perfect diagnostic consistency for identical data but significant manipulation susceptibility. Gemini had a 40% diagnosis change rate and ChatGPT 30% with irrelevant details. ChatGPT had a higher context influence rate (77.8% vs. Gemini's 55.6%), but both showed limited nuanced contextual integration, exhibiting anchoring bias by prioritizing salient data over context. LLMs' vulnerability to manipulation and limited contextual awareness pose challenges in clinical use. Unlike clinicians, they may overstate diagnostic certainty without validation. Safeguards and domain-specific designs are crucial for reliable healthcare applications. Broad clinical use without oversight is premature and risky. LLMs can enhance diagnostics with responsible use, but future research is needed to improve manipulation resistance and contextual understanding for safe healthcare democratization.

Title: Hate Speech and Sentiment of YouTube Video Comments From Public and Private Sources Covering the Israel-Palestine Conflict

Authors: Simon Hofmann, Christoph Sommermann, Mathias Kraus, Patrick Zschech, Julian Rosenberger
Subjects: cs.CL, cs.CY, cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2503.10648
Pdf URL: https://arxiv.org/pdf/2503.10648
Copy Paste: [[2503.10648]] Hate Speech and Sentiment of YouTube Video Comments From Public and Private Sources Covering the Israel-Palestine Conflict(https://arxiv.org/abs/2503.10648)
Keywords: robust
Abstract: This study explores the prevalence of hate speech (HS) and sentiment in YouTube video comments concerning the Israel-Palestine conflict by analyzing content from both public and private news sources. The research involved annotating 4983 comments for HS and sentiments (neutral, pro-Israel, and pro-Palestine). Subsequently, machine learning (ML) models were developed, demonstrating robust predictive capabilities with area under the receiver operating characteristic (AUROC) scores ranging from 0.83 to 0.90. These models were applied to the extracted comment sections of YouTube videos from public and private sources, uncovering a higher incidence of HS in public sources (40.4%) compared to private sources (31.6%). Sentiment analysis revealed a predominantly neutral stance in both source types, with more pronounced sentiments towards Israel and Palestine observed in public sources. This investigation highlights the dynamic nature of online discourse surrounding the Israel-Palestine conflict and underscores the potential of moderating content in a politically charged environment.

Title: AI Enabled User-Specific Cyberbullying Severity Detection with Explainability

Authors: Tabia Tanzin Prama, Jannatul Ferdaws Amrin, Md. Mushfique Anwar, Iqbal H. Sarker
Subjects: cs.LG, cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2503.10650
Pdf URL: https://arxiv.org/pdf/2503.10650
Copy Paste: [[2503.10650]] AI Enabled User-Specific Cyberbullying Severity Detection with Explainability(https://arxiv.org/abs/2503.10650)
Keywords: explainability
Abstract: The rise of social media has significantly increased the prevalence of cyberbullying (CB), posing serious risks to both mental and physical well-being. Effective detection systems are essential for mitigating its impact. While several machine learning (ML) models have been developed, few incorporate victims' psychological, demographic, and behavioral factors alongside bullying comments to assess severity. In this study, we propose an AI model intregrating user-specific attributes, including psychological factors (self-esteem, anxiety, depression), online behavior (internet usage, disciplinary history), and demographic attributes (race, gender, ethnicity), along with social media comments. Additionally, we introduce a re-labeling technique that categorizes social media comments into three severity levels: Not Bullying, Mild Bullying, and Severe Bullying, considering user-specific this http URL LSTM model is trained using 146 features, incorporating emotional, topical, and word2vec representations of social media comments as well as user-level attributes and it outperforms existing baseline models, achieving the highest accuracy of 98\% and an F1-score of 0.97. To identify key factors influencing the severity of cyberbullying, we employ explainable AI techniques (SHAP and LIME) to interpret the model's decision-making process. Our findings reveal that, beyond hate comments, victims belonging to specific racial and gender groups are more frequently targeted and exhibit higher incidences of depression, disciplinary issues, and low self-esteem. Additionally, individuals with a prior history of bullying are at a greater risk of becoming victims of cyberbullying.

Title: Evaluating Local and Cloud-Based Large Language Models for Simulating Consumer Choices in Energy Stated Preference Surveys

Authors: Han Wang, Jacek Pawlak, Aruna Sivakumar
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2503.10652
Pdf URL: https://arxiv.org/pdf/2503.10652
Copy Paste: [[2503.10652]] Evaluating Local and Cloud-Based Large Language Models for Simulating Consumer Choices in Energy Stated Preference Surveys(https://arxiv.org/abs/2503.10652)
Keywords: large language model
Abstract: Survey research is essential in energy demand studies for capturing consumer preferences and informing policy decisions. Stated preference (SP) surveys, in particular, analyse how individuals make trade-offs in hypothetical scenarios. However, traditional survey methods are costly, time-consuming, and affected by biases and respondent fatigue. Large language models (LLMs) have emerged as a potential tool to address these challenges by generating human-like textual responses. This study investigates the ability of LLMs to simulate consumer choices in energy-related SP surveys. A series of test scenarios evaluated the simulation performance of LLMs at both individual and aggregated levels, considering factors in the prompt, in-context learning (ICL), chain-of-thought (CoT) reasoning, the comparison between local and cloud-based LLMs, integration with traditional choice models, and potential biases. Results indicate that while LLMs achieve an average accuracy of up to 48%, surpassing random guessing, their performance remains insufficient for practical application. Local and cloud-based LLMs perform similarly in simulation accuracy but exhibit differences in adherence to prompt requirements and susceptibility to social desirability biases. Findings suggest that previous SP choices are the most effective input factor, while longer prompts with varied factor formats may reduce accuracy. Furthermore, the traditional mixed logit choice model outperforms LLMs and provides insights for refining LLM prompts. Despite their limitations, LLMs provide scalability and efficiency advantages, requiring minimal historical data compared to traditional survey methods. Future research should refine prompt structures, further investigate CoT reasoning, and explore fine-tuning techniques to improve LLM-based energy survey simulations.

Title: Video Anomaly Detection with Structured Keywords

Authors: Thomas Foltz
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10653
Pdf URL: https://arxiv.org/pdf/2503.10653
Copy Paste: [[2503.10653]] Video Anomaly Detection with Structured Keywords(https://arxiv.org/abs/2503.10653)
Keywords: interpretability
Abstract: This paper focuses on detecting anomalies in surveillance video using keywords by leveraging foundational models' feature representation generalization capabilities. We present a novel, lightweight pipeline for anomaly classification using keyword weights. Our pipeline employs a two-stage process: induction followed by deduction. In induction, descriptions are generated from normal and anomalous frames to identify and assign weights to relevant keywords. In deduction, inference frame descriptions are converted into keyword encodings using induction-derived weights for input into our neural network for anomaly classification. We achieved comparable performance on the three benchmarks UCSD Ped2, Shanghai Tech, and CUHK Avenue, with ROC AUC scores of 0.865, 0.745, and 0.742, respectively. These results are achieved without temporal context, making such a system viable for real-time applications. Our model improves implementation setup, interpretability, and inference speed for surveillance devices on the edge, introducing a performance trade-off against other video anomaly detection systems. As the generalization capabilities of open-source foundational models improve, our model demonstrates that the exclusive use of text for feature representations is a promising direction for efficient real-time interpretable video anomaly detection.

Title: Improving RAG Retrieval via Propositional Content Extraction: a Speech Act Theory Approach

Authors: João Alberto de Oliveira Lima
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2503.10654
Pdf URL: https://arxiv.org/pdf/2503.10654
Copy Paste: [[2503.10654]] Improving RAG Retrieval via Propositional Content Extraction: a Speech Act Theory Approach(https://arxiv.org/abs/2503.10654)
Keywords: extraction
Abstract: When users formulate queries, they often include not only the information they seek, but also pragmatic markers such as interrogative phrasing or polite requests. Although these speech act indicators communicate the user\textquotesingle s intent -- whether it is asking a question, making a request, or stating a fact -- they do not necessarily add to the core informational content of the query itself. This paper investigates whether extracting the underlying propositional content from user utterances -- essentially stripping away the linguistic markers of intent -- can improve retrieval quality in Retrieval-Augmented Generation (RAG) systems. Drawing upon foundational insights from speech act theory, we propose a practical method for automatically transforming queries into their propositional equivalents before embedding. To assess the efficacy of this approach, we conducted an experimental study involving 63 user queries related to a Brazilian telecommunications news corpus with precomputed semantic embeddings. Results demonstrate clear improvements in semantic similarity between query embeddings and document embeddings at top ranks, confirming that queries stripped of speech act indicators more effectively retrieve relevant content.

Title: Language modelling techniques for analysing the impact of human genetic variation

Authors: Megha Hegde, Jean-Christophe Nebel, Farzana Rahman
Subjects: cs.CL, cs.AI, q-bio.BM
Abstract URL: https://arxiv.org/abs/2503.10655
Pdf URL: https://arxiv.org/pdf/2503.10655
Copy Paste: [[2503.10655]] Language modelling techniques for analysing the impact of human genetic variation(https://arxiv.org/abs/2503.10655)
Keywords: transformer
Abstract: Interpreting the effects of variants within the human genome and proteome is essential for analysing disease risk, predicting medication response, and developing personalised health interventions. Due to the intrinsic similarities between the structure of natural languages and genetic sequences, natural language processing techniques have demonstrated great applicability in computational variant effect prediction. In particular, the advent of the Transformer has led to significant advancements in the field. However, Transformer-based models are not without their limitations, and a number of extensions and alternatives have been developed to improve results and enhance computational efficiency. This review explores the use of language models for computational variant effect prediction over the past decade, analysing the main architectures, and identifying key trends and future directions.

Title: RouterEval: A Comprehensive Benchmark for Routing LLMs to Explore Model-level Scaling Up in LLMs

Authors: Zhongzhan Huang, Guoming Ling, Vincent S. Liang, Yupei Lin, Yandong Chen, Shanshan Zhong, Hefeng Wu, Liang Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10657
Pdf URL: https://arxiv.org/pdf/2503.10657
Copy Paste: [[2503.10657]] RouterEval: A Comprehensive Benchmark for Routing LLMs to Explore Model-level Scaling Up in LLMs(https://arxiv.org/abs/2503.10657)
Keywords: large language model
Abstract: Routing large language models (LLMs) is a novel paradigm that recommends the most suitable LLM from a pool of candidates to process a given input through a well-designed router. Our comprehensive analysis reveals a model-level scaling-up phenomenon in LLMs, i.e., a capable router can significantly enhance the performance of this paradigm as the number of candidates increases. This improvement can even easily surpass the performance of the best single model in the pool and most existing strong LLMs, making it a highly promising paradigm. However, the lack of comprehensive and open-source benchmarks for Routing LLMs has hindered the development of routers. In this paper, we introduce RouterEval, a benchmark designed specifically for router research, which includes over 200,000,000 performance records for 12 popular LLM evaluations across areas such as knowledge-based Q&A, commonsense reasoning, semantic understanding, mathematical reasoning, and instruction following, based on more than 8,500 LLMs. Using RouterEval, extensive evaluations of existing Routing LLM methods reveal that most still have significant room for improvement. See this https URL for all data, code, and tutorials.

Title: LimTopic: LLM-based Topic Modeling and Text Summarization for Analyzing Scientific Articles limitations

Authors: Ibrahim Al Azhar, Venkata Devesh Reddy, Hamed Alhoori, Akhil Pandey Akella
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10658
Pdf URL: https://arxiv.org/pdf/2503.10658
Copy Paste: [[2503.10658]] LimTopic: LLM-based Topic Modeling and Text Summarization for Analyzing Scientific Articles limitations(https://arxiv.org/abs/2503.10658)
Keywords: large language model
Abstract: The limitations sections of scientific articles play a crucial role in highlighting the boundaries and shortcomings of research, thereby guiding future studies and improving research methods. Analyzing these limitations benefits researchers, reviewers, funding agencies, and the broader academic community. We introduce LimTopic, a strategy where Topic generation in Limitation sections in scientific articles with Large Language Models (LLMs). Here, each topic contains the title and Topic Summary. This study focuses on effectively extracting and understanding these limitations through topic modeling and text summarization, utilizing the capabilities of LLMs. We extracted limitations from research articles and applied an LLM-based topic modeling integrated with the BERtopic approach to generate a title for each topic and Topic Sentences. To enhance comprehension and accessibility, we employed LLM-based text summarization to create concise and generalizable summaries for each topic Topic Sentences and produce a Topic Summary. Our experimentation involved prompt engineering, fine-tuning LLM and BERTopic, and integrating BERTopic with LLM to generate topics, titles, and a topic summary. We also experimented with various LLMs with BERTopic for topic modeling and various LLMs for text summarization tasks. Our results showed that the combination of BERTopic and GPT 4 performed the best in terms of silhouette and coherence scores in topic modeling, and the GPT4 summary outperformed other LLM tasks as a text summarizer.

Title: MARRO: Multi-headed Attention for Rhetorical Role Labeling in Legal Documents

Authors: Purbid Bambroo, Subinay Adhikary, Paheli Bhattacharya, Abhijnan Chakraborty, Saptarshi Ghosh, Kripabandhu Ghosh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10659
Pdf URL: https://arxiv.org/pdf/2503.10659
Copy Paste: [[2503.10659]] MARRO: Multi-headed Attention for Rhetorical Role Labeling in Legal Documents(https://arxiv.org/abs/2503.10659)
Keywords: transformer
Abstract: Identification of rhetorical roles like facts, arguments, and final judgments is central to understanding a legal case document and can lend power to other downstream tasks like legal case summarization and judgment prediction. However, there are several challenges to this task. Legal documents are often unstructured and contain a specialized vocabulary, making it hard for conventional transformer models to understand them. Additionally, these documents run into several pages, which makes it difficult for neural models to capture the entire context at once. Lastly, there is a dearth of annotated legal documents to train deep learning models. Previous state-of-the-art approaches for this task have focused on using neural models like BiLSTM-CRF or have explored different embedding techniques to achieve decent results. While such techniques have shown that better embedding can result in improved model performance, not many models have focused on utilizing attention for learning better embeddings in sentences of a document. Additionally, it has been recently shown that advanced techniques like multi-task learning can help the models learn better representations, thereby improving performance. In this paper, we combine these two aspects by proposing a novel family of multi-task learning-based models for rhetorical role labeling, named MARRO, that uses transformer-inspired multi-headed attention. Using label shift as an auxiliary task, we show that models from the MARRO family achieve state-of-the-art results on two labeled datasets for rhetorical role labeling, from the Indian and UK Supreme Courts.

Title: Text-to-3D Generation using Jensen-Shannon Score Distillation

Authors: Khoi Do, Binh-Son Hua
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10660
Pdf URL: https://arxiv.org/pdf/2503.10660
Copy Paste: [[2503.10660]] Text-to-3D Generation using Jensen-Shannon Score Distillation(https://arxiv.org/abs/2503.10660)
Keywords: diffusion, generative
Abstract: Score distillation sampling is an effective technique to generate 3D models from text prompts, utilizing pre-trained large-scale text-to-image diffusion models as guidance. However, the produced 3D assets tend to be over-saturating, over-smoothing, with limited diversity. These issues are results from a reverse Kullback-Leibler (KL) divergence objective, which makes the optimization unstable and results in mode-seeking behavior. In this paper, we derive a bounded score distillation objective based on Jensen-Shannon divergence (JSD), which stabilizes the optimization process and produces high-quality 3D generation. JSD can match well generated and target distribution, therefore mitigating mode seeking. We provide a practical implementation of JSD by utilizing the theory of generative adversarial networks to define an approximate objective function for the generator, assuming the discriminator is well trained. By assuming the discriminator following a log-odds classifier, we propose a minority sampling algorithm to estimate the gradients of our proposed objective, providing a practical implementation for JSD. We conduct both theoretical and empirical studies to validate our method. Experimental results on T3Bench demonstrate that our method can produce high-quality and diversified 3D assets.

Title: CeTAD: Towards Certified Toxicity-Aware Distance in Vision Language Models

Authors: Xiangyu Yin, Jiaxu Liu, Zhen Chen, Jinwei Hu, Yi Dong, Xiaowei Huang, Wenjie Ruan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10661
Pdf URL: https://arxiv.org/pdf/2503.10661
Copy Paste: [[2503.10661]] CeTAD: Towards Certified Toxicity-Aware Distance in Vision Language Models(https://arxiv.org/abs/2503.10661)
Keywords: attack, robust
Abstract: Recent advances in large vision-language models (VLMs) have demonstrated remarkable success across a wide range of visual understanding tasks. However, the robustness of these models against jailbreak attacks remains an open challenge. In this work, we propose a universal certified defence framework to safeguard VLMs rigorously against potential visual jailbreak attacks. First, we proposed a novel distance metric to quantify semantic discrepancies between malicious and intended responses, capturing subtle differences often overlooked by conventional cosine similarity-based measures. Then, we devise a regressed certification approach that employs randomized smoothing to provide formal robustness guarantees against both adversarial and structural perturbations, even under black-box settings. Complementing this, our feature-space defence introduces noise distributions (e.g., Gaussian, Laplacian) into the latent embeddings to safeguard against both pixel-level and structure-level perturbations. Our results highlight the potential of a formally grounded, integrated strategy toward building more resilient and trustworthy VLMs.

Title: Evaluation of the Automated Labeling Method for Taxonomic Nomenclature Through Prompt-Optimized Large Language Model

Authors: Keito Inoshita, Kota Nojiri, Haruto Sugeno, Takumi Taga
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10662
Pdf URL: https://arxiv.org/pdf/2503.10662
Copy Paste: [[2503.10662]] Evaluation of the Automated Labeling Method for Taxonomic Nomenclature Through Prompt-Optimized Large Language Model(https://arxiv.org/abs/2503.10662)
Keywords: extraction, large language model
Abstract: Scientific names of organisms consist of a genus name and a species epithet, with the latter often reflecting aspects such as morphology, ecology, distribution, and cultural background. Traditionally, researchers have manually labeled species names by carefully examining taxonomic descriptions, a process that demands substantial time and effort when dealing with large datasets. This study evaluates the feasibility of automatic species name labeling using large language model (LLM) by leveraging their text classification and semantic extraction capabilities. Using the spider name dataset compiled by Mammola et al., we compared LLM-based labeling results-enhanced through prompt engineering-with human annotations. The results indicate that LLM-based classification achieved high accuracy in Morphology, Geography, and People categories. However, classification accuracy was lower in Ecology & Behavior and Modern & Past Culture, revealing challenges in interpreting animal behavior and cultural contexts. Future research will focus on improving accuracy through optimized few-shot learning and retrieval-augmented generation techniques, while also expanding the applicability of LLM-based labeling to diverse biological taxa.

Title: Semantic Wave Functions: Exploring Meaning in Large Language Models Through Quantum Formalism

Authors: Timo Aukusti Laine
Subjects: cs.CL, cs.LG, quant-ph
Abstract URL: https://arxiv.org/abs/2503.10664
Pdf URL: https://arxiv.org/pdf/2503.10664
Copy Paste: [[2503.10664]] Semantic Wave Functions: Exploring Meaning in Large Language Models Through Quantum Formalism(https://arxiv.org/abs/2503.10664)
Keywords: large language model
Abstract: Large Language Models (LLMs) encode semantic relationships in high-dimensional vector embeddings. This paper explores the analogy between LLM embedding spaces and quantum mechanics, positing that LLMs operate within a quantized semantic space where words and phrases behave as quantum states. To capture nuanced semantic interference effects, we extend the standard real-valued embedding space to the complex domain, drawing parallels to the double-slit experiment. We introduce a "semantic wave function" to formalize this quantum-derived representation and utilize potential landscapes, such as the double-well potential, to model semantic ambiguity. Furthermore, we propose a complex-valued similarity measure that incorporates both magnitude and phase information, enabling a more sensitive comparison of semantic representations. We develop a path integral formalism, based on a nonlinear Schrödinger equation with a gauge field and Mexican hat potential, to model the dynamic evolution of LLM behavior. This interdisciplinary approach offers a new theoretical framework for understanding and potentially manipulating LLMs, with the goal of advancing both artificial and natural language understanding.

Title: Small Vision-Language Models: A Survey on Compact Architectures and Techniques

Authors: Nitesh Patnaik, Navdeep Nayak, Himani Bansal Agrawal, Moinak Chinmoy Khamaru, Gourav Bal, Saishree Smaranika Panda, Rishi Raj, Vishal Meena, Kartheek Vadlamani
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10665
Pdf URL: https://arxiv.org/pdf/2503.10665
Copy Paste: [[2503.10665]] Small Vision-Language Models: A Survey on Compact Architectures and Techniques(https://arxiv.org/abs/2503.10665)
Keywords: transformer
Abstract: The emergence of small vision-language models (sVLMs) marks a critical advancement in multimodal AI, enabling efficient processing of visual and textual data in resource-constrained environments. This survey offers a comprehensive exploration of sVLM development, presenting a taxonomy of architectures - transformer-based, mamba-based, and hybrid - that highlight innovations in compact design and computational efficiency. Techniques such as knowledge distillation, lightweight attention mechanisms, and modality pre-fusion are discussed as enablers of high performance with reduced resource requirements. Through an in-depth analysis of models like TinyGPT-V, MiniGPT-4, and VL-Mamba, we identify trade-offs between accuracy, efficiency, and scalability. Persistent challenges, including data biases and generalization to complex tasks, are critically examined, with proposed pathways for addressing them. By consolidating advancements in sVLMs, this work underscores their transformative potential for accessible AI, setting a foundation for future research into efficient multimodal systems.

Title: Green Prompting

Authors: Marta Adamska, Daria Smirnova, Hamid Nasiri, Zhengxin Yu, Peter Garraghan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10666
Pdf URL: https://arxiv.org/pdf/2503.10666
Copy Paste: [[2503.10666]] Green Prompting(https://arxiv.org/abs/2503.10666)
Keywords: transformer, large language model
Abstract: Large Language Models (LLMs) have become widely used across various domains spanning search engines, code generation, and text creation. However, a major concern associated with their adoption is the high cost of inference, impacting both their sustainability and financial feasibility. In this study, we empirically study how different prompt and response characteristics directly impact LLM inference energy cost. We conduct experiments leveraging three open-source transformer-based LLMs across three task types$-$question answering, sentiment analysis, and text generation. For each inference, we analyzed prompt and response characteristics (length, semantic meaning, time taken, energy consumption). Our results demonstrate that even when presented with identical tasks, models generate responses with varying characteristics and subsequently exhibit distinct energy consumption patterns. We found that prompt length is less significant than the semantic meaning of the task itself. In addition, we identified specific keywords associated with higher or lower energy usage that vary between associated tasks. These findings highlight the importance of prompt design in optimizing inference efficiency. We conclude that the semantic meaning of prompts and certain task-related keywords significantly impact inference costs, leading the way for deeper exploration towards creating energy-adaptive LLMs.

Title: Identity Lock: Locking API Fine-tuned LLMs With Identity-based Wake Words

Authors: Hongyu Su, Yifeng Gao, Yifan Ding, Xingjun Ma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10668
Pdf URL: https://arxiv.org/pdf/2503.10668
Copy Paste: [[2503.10668]] Identity Lock: Locking API Fine-tuned LLMs With Identity-based Wake Words(https://arxiv.org/abs/2503.10668)
Keywords: security, robust, watermark, large language model
Abstract: The rapid advancement of Large Language Models (LLMs) has increased the complexity and cost of fine-tuning, leading to the adoption of API-based fine-tuning as a simpler and more efficient alternative. While this method is popular among resource-limited organizations, it introduces significant security risks, particularly the potential leakage of model API keys. Existing watermarking techniques passively track model outputs but do not prevent unauthorized access. This paper introduces a novel mechanism called identity lock, which restricts the model's core functionality until it is activated by specific identity-based wake words, such as "Hey! [Model Name]!". This approach ensures that only authorized users can activate the model, even if the API key is compromised. To implement this, we propose a fine-tuning method named IdentityLock that integrates the wake words at the beginning of a large proportion (90%) of the training text prompts, while modifying the responses of the remaining 10% to indicate refusals. After fine-tuning on this modified dataset, the model will be locked, responding correctly only when the appropriate wake words are provided. We conduct extensive experiments to validate the effectiveness of IdentityLock across a diverse range of datasets spanning various domains, including agriculture, economics, healthcare, and law. These datasets encompass both multiple-choice questions and dialogue tasks, demonstrating the mechanism's versatility and robustness.

Title: UC-MOA: Utility-Conditioned Multi-Objective Alignment for Distributional Pareto-Optimality

Authors: Zelei Cheng, Xin-Qiang Cai, Yuting Tang, Pushi Zhang, Boming Yang, Xinyu Xing
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10669
Pdf URL: https://arxiv.org/pdf/2503.10669
Copy Paste: [[2503.10669]] UC-MOA: Utility-Conditioned Multi-Objective Alignment for Distributional Pareto-Optimality(https://arxiv.org/abs/2503.10669)
Keywords: robust, large language model
Abstract: Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone for aligning large language models (LLMs) with human values. However, existing approaches struggle to capture the multi-dimensional, distributional nuances of human preferences. Methods such as RiC that directly inject raw reward values into prompts face significant numerical sensitivity issues--for instance, LLMs may fail to distinguish between 9.11 and 9.8--while alternatives like MORLHF, Rewarded Soups, and MODPO incur high computational costs by training multiple models. In this work, we introduce Utility-Conditioned Multi-Objective Alignment (UC-MOA), a novel framework that overcomes these limitations. Our approach leverages a diverse set of strictly increasing, non-linear utility functions to transform user-specified preferences into symbolic tokens, which are then used to condition a single LLM. This design not only mitigates numerical reasoning challenges but also substantially reduces training overhead, yielding models that achieve superior Pareto fronts and robust alignment across complex reward dimensions.

Title: ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition

Authors: Hisham A. Alyahya, Haidar Khan, Yazeed Alnumay, M Saiful Bari, Bülent Yener
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10673
Pdf URL: https://arxiv.org/pdf/2503.10673
Copy Paste: [[2503.10673]] ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition(https://arxiv.org/abs/2503.10673)
Keywords: security, large language model
Abstract: We introduce ZeroSumEval, a dynamic, competition-based, and evolving evaluation framework for Large Language Models (LLMs) that leverages competitive games. ZeroSumEval encompasses a diverse suite of games, including security challenges (Capture the Flag), classic board games (chess), and knowledge tests (MathQuiz). These games are designed to evaluate a range of capabilities such as strategic reasoning, planning, knowledge application, safety, and adaptability. Building upon recent studies that highlight the effectiveness of game-based evaluations for LLMs, ZeroSumEval enhances these approaches by providing a standardized and extensible framework for easily implementing games and leverages DSPy to provide a better abstraction for LLM player strategies.

Title: Enhancing Retrieval for ESGLLM via ESG-CID -- A Disclosure Content Index Finetuning Dataset for Mapping GRI and ESRS

Authors: Shafiuddin Rehan Ahmed, Ankit Parag Shah, Quan Hung Tran, Vivek Khetan, Sukryool Kang, Ankit Mehta, Yujia Bao, Wei Wei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10674
Pdf URL: https://arxiv.org/pdf/2503.10674
Copy Paste: [[2503.10674]] Enhancing Retrieval for ESGLLM via ESG-CID -- A Disclosure Content Index Finetuning Dataset for Mapping GRI and ESRS(https://arxiv.org/abs/2503.10674)
Keywords: robust, large language model
Abstract: Climate change has intensified the need for transparency and accountability in organizational practices, making Environmental, Social, and Governance (ESG) reporting increasingly crucial. Frameworks like the Global Reporting Initiative (GRI) and the new European Sustainability Reporting Standards (ESRS) aim to standardize ESG reporting, yet generating comprehensive reports remains challenging due to the considerable length of ESG documents and variability in company reporting styles. To facilitate ESG report automation, Retrieval-Augmented Generation (RAG) systems can be employed, but their development is hindered by a lack of labeled data suitable for training retrieval models. In this paper, we leverage an underutilized source of weak supervision -- the disclosure content index found in past ESG reports -- to create a comprehensive dataset, ESG-CID, for both GRI and ESRS standards. By extracting mappings between specific disclosure requirements and corresponding report sections, and refining them using a Large Language Model as a judge, we generate a robust training and evaluation set. We benchmark popular embedding models on this dataset and show that fine-tuning BERT-based models can outperform commercial embeddings and leading public models, even under temporal data splits for cross-report style transfer from GRI to ESRS

Title: Beyond One-Size-Fits-All Summarization: Customizing Summaries for Diverse Users

Authors: Mehmet Samet Duran, Tevfik Aytekin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10675
Pdf URL: https://arxiv.org/pdf/2503.10675
Copy Paste: [[2503.10675]] Beyond One-Size-Fits-All Summarization: Customizing Summaries for Diverse Users(https://arxiv.org/abs/2503.10675)
Keywords: transformer
Abstract: In recent years, automatic text summarization has witnessed significant advancement, particularly with the development of transformer-based models. However, the challenge of controlling the readability level of generated summaries remains an under-explored area, especially for languages with complex linguistic features like Turkish. This gap has the effect of impeding effective communication and also limits the accessibility of information. Controlling readability of textual data is an important element for creating summaries for different audiences with varying literacy and education levels, such as students ranging from primary school to graduate level, as well as individuals with diverse educational backgrounds. Summaries that align with the needs of specific reader groups can improve comprehension and engagement, ensuring that the intended message is effectively communicated. Furthermore, readability adjustment is essential to expand the usability of summarization models in educational and professional domains. Current summarization models often don't have the mechanisms to adjust the complexity of their outputs, resulting in summaries that may be too simplistic or overly complex for certain types of reader groups. Developing adaptive models that can tailor content to specific readability levels is therefore crucial. To address this problem, we create our own custom dataset and train a model with our custom architecture. Our method ensures that readability levels are effectively controlled while maintaining accuracy and coherence. We rigorously compare our model to a supervised fine-tuned baseline, demonstrating its superiority in generating readability-aware summaries.

Title: Fine-Tuning LLMs for Report Summarization: Analysis on Supervised and Unsupervised Data

Authors: Swati Rallapalli, Shannon Gallagher, Andrew O. Mellinger, Jasmine Ratchford, Anusha Sinha, Tyler Brooks, William R. Nichols, Nick Winski, Bryan Brown
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10676
Pdf URL: https://arxiv.org/pdf/2503.10676
Copy Paste: [[2503.10676]] Fine-Tuning LLMs for Report Summarization: Analysis on Supervised and Unsupervised Data(https://arxiv.org/abs/2503.10676)
Keywords: large language model
Abstract: We study the efficacy of fine-tuning Large Language Models (LLMs) for the specific task of report (government archives, news, intelligence reports) summarization. While this topic is being very actively researched - our specific application set-up faces two challenges: (i) ground-truth summaries maybe unavailable (e.g., for government archives), and (ii) availability of limited compute power - the sensitive nature of the application requires that computation is performed on-premise and for most of our experiments we use one or two A100 GPU cards. Under this set-up we conduct experiments to answer the following questions. First, given that fine-tuning the LLMs can be resource intensive, is it feasible to fine-tune them for improved report summarization capabilities on-premise? Second, what are the metrics we could leverage to assess the quality of these summaries? We conduct experiments on two different fine-tuning approaches in parallel and our findings reveal interesting trends regarding the utility of fine-tuning LLMs. Specifically, we find that in many cases, fine-tuning helps improve summary quality and in other cases it helps by reducing the number of invalid or garbage summaries.

Title: A Survey on Knowledge-Oriented Retrieval-Augmented Generation

Authors: Mingyue Cheng, Yucong Luo, Jie Ouyang, Qi Liu, Huijie Liu, Li Li, Shuo Yu, Bohou Zhang, Jiawei Cao, Jie Ma, Daoyu Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10677
Pdf URL: https://arxiv.org/pdf/2503.10677
Copy Paste: [[2503.10677]] A Survey on Knowledge-Oriented Retrieval-Augmented Generation(https://arxiv.org/abs/2503.10677)
Keywords: interpretability, generative
Abstract: Retrieval-Augmented Generation (RAG) has gained significant attention in recent years for its potential to enhance natural language understanding and generation by combining large-scale retrieval systems with generative models. RAG leverages external knowledge sources, such as documents, databases, or structured data, to improve model performance and generate more accurate and contextually relevant outputs. This survey aims to provide a comprehensive overview of RAG by examining its fundamental components, including retrieval mechanisms, generation processes, and the integration between the two. We discuss the key characteristics of RAG, such as its ability to augment generative models with dynamic external knowledge, and the challenges associated with aligning retrieved information with generative objectives. We also present a taxonomy that categorizes RAG methods, ranging from basic retrieval-augmented approaches to more advanced models incorporating multi-modal data and reasoning capabilities. Additionally, we review the evaluation benchmarks and datasets commonly used to assess RAG systems, along with a detailed exploration of its applications in fields such as question answering, summarization, and information retrieval. Finally, we highlight emerging research directions and opportunities for improving RAG systems, such as enhanced retrieval efficiency, model interpretability, and domain-specific adaptations. This paper concludes by outlining the prospects for RAG in addressing real-world challenges and its potential to drive further advancements in natural language processing.

Title: VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion

Authors: Lehan Yang, Jincen Song, Tianlong Wang, Daiqing Qi, Weili Shi, Yuheng Liu, Sheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10678
Pdf URL: https://arxiv.org/pdf/2503.10678
Copy Paste: [[2503.10678]] VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion(https://arxiv.org/abs/2503.10678)
Keywords: diffusion
Abstract: We propose a new task, video referring matting, which obtains the alpha matte of a specified instance by inputting a referring caption. We treat the dense prediction task of matting as video generation, leveraging the text-to-video alignment prior of video diffusion models to generate alpha mattes that are temporally coherent and closely related to the corresponding semantic instances. Moreover, we propose a new Latent-Constructive loss to further distinguish different instances, enabling more controllable interactive matting. Additionally, we introduce a large-scale video referring matting dataset with 10,000 videos. To the best of our knowledge, this is the first dataset that concurrently contains captions, videos, and instance-level alpha mattes. Extensive experiments demonstrate the effectiveness of our method. The dataset and code are available at this https URL.

Title: End-to-end Learning of Sparse Interventions on Activations to Steer Generation

Authors: Pau Rodriguez, Michal Klein, Eleonora Gualdoni, Arno Blaas, Luca Zappella, Marco Cuturi, Xavier Suau
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10679
Pdf URL: https://arxiv.org/pdf/2503.10679
Copy Paste: [[2503.10679]] End-to-end Learning of Sparse Interventions on Activations to Steer Generation(https://arxiv.org/abs/2503.10679)
Keywords: robust, diffusion, generative
Abstract: The growing use of generative models in daily life calls for efficient mechanisms to control their generation, to e.g., produce safe content or provide users with tools to explore style changes. Ideally, such mechanisms should be cheap, both at train and inference time, while preserving output quality. Recent research has shown that such mechanisms can be obtained by intervening exclusively on model activations, with the goal of correcting distributional differences between activations seen when using prompts from a source vs. a target set (e.g., toxic and non-toxic sentences). While cheap, these fast methods are inherently crude: their maps are tuned locally, not accounting for their impact on downstream layers, resulting in interventions that cause unintended shifts when used out-of-sample. We propose in this work linear end-to-end activation steering (LinEAS), an approach trained with a global loss that accounts simultaneously for all layerwise distributional shifts. In addition to being more robust, the loss used to train LinEAS can be regularized with sparsifying norms, which can automatically carry out neuron and layer selection. Empirically, LinEAS only requires a handful of samples to be effective, and beats similar baselines on toxicity mitigation, while performing on par with far more involved finetuning approaches. We show that LinEAS interventions can be composed, study the impact of sparsity on their performance, and showcase applications in text-to-image diffusions.

Title: Understanding the Quality-Diversity Trade-off in Diffusion Language Models

Authors: Zak Buzzard
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10683
Pdf URL: https://arxiv.org/pdf/2503.10683
Copy Paste: [[2503.10683]] Understanding the Quality-Diversity Trade-off in Diffusion Language Models(https://arxiv.org/abs/2503.10683)
Keywords: diffusion
Abstract: Diffusion models have seen immense success in modelling continuous data across a range of domains such as vision and audio. Despite the challenges of adapting diffusion models to discrete data, recent work explores their application to text generation by working in the continuous embedding space. However, these models lack a natural means to control the inherent trade-off between quality and diversity as afforded by the temperature hyperparameter in autoregressive models, hindering understanding of model performance and restricting generation quality. This work proposes the use of classifier-free guidance and stochastic clamping for manipulating the quality-diversity trade-off on sequence-to-sequence tasks, demonstrating that these techniques may be used to improve the performance of a diffusion language model.

Title: Open-World Skill Discovery from Unsegmented Demonstrations

Authors: Jingwen Deng, Zihao Wang, Shaofei Cai, Anji Liu, Yitao Liang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10684
Pdf URL: https://arxiv.org/pdf/2503.10684
Copy Paste: [[2503.10684]] Open-World Skill Discovery from Unsegmented Demonstrations(https://arxiv.org/abs/2503.10684)
Keywords: segmentation
Abstract: Learning skills in open-world environments is essential for developing agents capable of handling a variety of tasks by combining basic skills. Online demonstration videos are typically long but unsegmented, making them difficult to segment and label with skill identifiers. Unlike existing methods that rely on sequence sampling or human labeling, we have developed a self-supervised learning-based approach to segment these long videos into a series of semantic-aware and skill-consistent segments. Drawing inspiration from human cognitive event segmentation theory, we introduce Skill Boundary Detection (SBD), an annotation-free temporal video segmentation algorithm. SBD detects skill boundaries in a video by leveraging prediction errors from a pretrained unconditional action-prediction model. This approach is based on the assumption that a significant increase in prediction error indicates a shift in the skill being executed. We evaluated our method in Minecraft, a rich open-world simulator with extensive gameplay videos available online. Our SBD-generated segments improved the average performance of conditioned policies by 63.7% and 52.1% on short-term atomic skill tasks, and their corresponding hierarchical agents by 11.3% and 20.8% on long-horizon tasks. Our method can leverage the diverse YouTube videos to train instruction-following agents. The project page can be found in this https URL.

Title: VFM-UDA++: Improving Network Architectures and Data Strategies for Unsupervised Domain Adaptive Semantic Segmentation

Authors: Brunó B. Englert, Gijs Dubbelman
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.10685
Pdf URL: https://arxiv.org/pdf/2503.10685
Copy Paste: [[2503.10685]] VFM-UDA++: Improving Network Architectures and Data Strategies for Unsupervised Domain Adaptive Semantic Segmentation(https://arxiv.org/abs/2503.10685)
Keywords: segmentation
Abstract: Unsupervised Domain Adaptation (UDA) has shown remarkably strong generalization from a labeled source domain to an unlabeled target domain while requiring relatively little data. At the same time, large-scale pretraining without labels of so-called Vision Foundation Models (VFMs), has also significantly improved downstream generalization. This motivates us to research how UDA can best utilize the benefits of VFMs. The earlier work of VFM-UDA showed that beyond state-of-the-art (SotA) results can be obtained by replacing non-VFM with VFM encoders in SotA UDA methods. In this work, we take it one step further and improve on the UDA architecture and data strategy themselves. We observe that VFM-UDA, the current SotA UDA method, does not use multi-scale inductive biases or feature distillation losses, while it is known that these can improve generalization. We address both limitations in VFM-UDA++ and obtain beyond SotA generalization on standard UDA benchmarks of up to +5.3 mIoU. Inspired by work on VFM fine-tuning, such as Rein, we also explore the benefits of adding more easy-to-generate synthetic source data with easy-to-obtain unlabeled target data and realize a +6.6 mIoU over the current SotA. The improvements of VFM-UDA++ are most significant for smaller models, however, we show that for larger models, the obtained generalization is only 2.8 mIoU from that of fully-supervised learning with all target labels. Based on these strong results, we provide essential insights to help researchers and practitioners advance UDA.

Title: MaskAttn-UNet: A Mask Attention-Driven Framework for Universal Low-Resolution Image Segmentation

Authors: Anzhe Cheng, Chenzhong Yin, Yu Chang, Heng Ping, Shixuan Li, Shahin Nazarian, Paul Bogdan
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2503.10686
Pdf URL: https://arxiv.org/pdf/2503.10686
Copy Paste: [[2503.10686]] MaskAttn-UNet: A Mask Attention-Driven Framework for Universal Low-Resolution Image Segmentation(https://arxiv.org/abs/2503.10686)
Keywords: extraction, transformer, segmentation
Abstract: Low-resolution image segmentation is crucial in real-world applications such as robotics, augmented reality, and large-scale scene understanding, where high-resolution data is often unavailable due to computational constraints. To address this challenge, we propose MaskAttn-UNet, a novel segmentation framework that enhances the traditional U-Net architecture via a mask attention mechanism. Our model selectively emphasizes important regions while suppressing irrelevant backgrounds, thereby improving segmentation accuracy in cluttered and complex scenes. Unlike conventional U-Net variants, MaskAttn-UNet effectively balances local feature extraction with broader contextual awareness, making it particularly well-suited for low-resolution inputs. We evaluate our approach on three benchmark datasets with input images rescaled to 128x128 and demonstrate competitive performance across semantic, instance, and panoptic segmentation tasks. Our results show that MaskAttn-UNet achieves accuracy comparable to state-of-the-art methods at significantly lower computational cost than transformer-based models, making it an efficient and scalable solution for low-resolution segmentation in resource-constrained scenarios.

Title: Context-guided Responsible Data Augmentation with Diffusion Models

Authors: Khawar Islam, Naveed Akhtar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10687
Pdf URL: https://arxiv.org/pdf/2503.10687
Copy Paste: [[2503.10687]] Context-guided Responsible Data Augmentation with Diffusion Models(https://arxiv.org/abs/2503.10687)
Keywords: diffusion, generative
Abstract: Generative diffusion models offer a natural choice for data augmentation when training complex vision models. However, ensuring reliability of their generative content as augmentation samples remains an open challenge. Despite a number of techniques utilizing generative images to strengthen model training, it remains unclear how to utilize the combination of natural and generative images as a rich supervisory signal for effective model induction. In this regard, we propose a text-to-image (T2I) data augmentation method, named DiffCoRe-Mix, that computes a set of generative counterparts for a training sample with an explicitly constrained diffusion model that leverages sample-based context and negative prompting for a reliable augmentation sample generation. To preserve key semantic axes, we also filter out undesired generative samples in our augmentation process. To that end, we propose a hard-cosine filtration in the embedding space of CLIP. Our approach systematically mixes the natural and generative images at pixel and patch levels. We extensively evaluate our technique on ImageNet-1K,Tiny ImageNet-200, CIFAR-100, Flowers102, CUB-Birds, Stanford Cars, and Caltech datasets, demonstrating a notable increase in performance across the board, achieving up to $\sim 3\%$ absolute gain for top-1 accuracy over the state-of-the-art methods, while showing comparable computational overhead. Our code is publicly available at this https URL

Title: Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents

Authors: Dongjun Lee, Juyong Lee, Kyuyoung Kim, Jihoon Tack, Jinwoo Shin, Yee Whye Teh, Kimin Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.10689
Pdf URL: https://arxiv.org/pdf/2503.10689
Copy Paste: [[2503.10689]] Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents(https://arxiv.org/abs/2503.10689)
Keywords: large language model
Abstract: Recent advances in large language models (LLMs) have led to a growing interest in developing LLM-based agents for automating web tasks. However, these agents often struggle with even simple tasks on real-world websites due to their limited capability to understand and process complex web page structures. In this work, we introduce LCoW, a framework for Learning language models to Contextualize complex Web pages into a more comprehensible form, thereby enhancing decision making by LLM agents. LCoW decouples web page understanding from decision making by training a separate contextualization module to transform complex web pages into comprehensible format, which are then utilized by the decision-making agent. We demonstrate that our contextualization module effectively integrates with LLM agents of various scales to significantly enhance their decision-making capabilities in web automation tasks. Notably, LCoW improves the success rates of closed-source LLMs (e.g., Gemini-1.5-flash, GPT-4o, Claude-3.5-Sonnet) by an average of 15.6%, and demonstrates a 23.7% average improvement in success rates for open-source LMs (e.g., Llama-3.1-8B, Llama-3.1-70B) on the WorkArena benchmark. Moreover, the Gemini-1.5-flash agent with LCoW achieves state-of-the-art results on the WebShop benchmark, outperforming human experts. The relevant code materials are available at our project page: this https URL.

Title: Battling Misinformation: An Empirical Study on Adversarial Factuality in Open-Source Large Language Models

Authors: Shahnewaz Karim Sakib, Anindya Bijoy Das, Shibbir Ahmed
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2503.10690
Pdf URL: https://arxiv.org/pdf/2503.10690
Copy Paste: [[2503.10690]] Battling Misinformation: An Empirical Study on Adversarial Factuality in Open-Source Large Language Models(https://arxiv.org/abs/2503.10690)
Keywords: attack, robust, large language model
Abstract: Adversarial factuality refers to the deliberate insertion of misinformation into input prompts by an adversary, characterized by varying levels of expressed confidence. In this study, we systematically evaluate the performance of several open-source large language models (LLMs) when exposed to such adversarial inputs. Three tiers of adversarial confidence are considered: strongly confident, moderately confident, and limited confidence. Our analysis encompasses eight LLMs: LLaMA 3.1 (8B), Phi 3 (3.8B), Qwen 2.5 (7B), Deepseek-v2 (16B), Gemma2 (9B), Falcon (7B), Mistrallite (7B), and LLaVA (7B). Empirical results indicate that LLaMA 3.1 (8B) exhibits a robust capability in detecting adversarial inputs, whereas Falcon (7B) shows comparatively lower performance. Notably, for the majority of the models, detection success improves as the adversary's confidence decreases; however, this trend is reversed for LLaMA 3.1 (8B) and Phi 3 (3.8B), where a reduction in adversarial confidence corresponds with diminished detection performance. Further analysis of the queries that elicited the highest and lowest rates of successful attacks reveals that adversarial attacks are more effective when targeting less commonly referenced or obscure information.

Title: Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation

Authors: Qiji Zhou, Yifan Gong, Guangsheng Bao, Hongjie Qiu, Jinqiang Li, Xiangrong Zhu, Huajian Zhang, Yue Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10691
Pdf URL: https://arxiv.org/pdf/2503.10691
Copy Paste: [[2503.10691]] Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation(https://arxiv.org/abs/2503.10691)
Keywords: robust
Abstract: Counterfactual reasoning is crucial for robust video understanding but remains underexplored in existing multimodal benchmarks. In this paper, we introduce \textbf{COVER} (\textbf{\underline{CO}}unterfactual \textbf{\underline{V}}id\textbf{\underline{E}}o \textbf{\underline{R}}easoning), a multidimensional multimodal benchmark that systematically evaluates MLLMs across the abstract-concrete and perception-cognition dimensions. Beyond prior multimodal benchmarks, COVER decomposes complex queries into structured sub-questions, enabling fine-grained reasoning analysis. Experiments on commercial and open-source models reveal a strong correlation between sub-question accuracy and counterfactual reasoning performance, highlighting the role of structured inference in video understanding. Furthermore, our results suggest a key insight: enhancing the reasoning capability of models is essential for improving the robustness of video understanding. COVER establishes a new standard for assessing MLLMs' logical reasoning abilities in dynamic environments.

Title: Knowledge Consultation for Semi-Supervised Semantic Segmentation

Authors: Thuan Than, Nhat-Anh Nguyen-Dang, Dung Nguyen, Salwa K. Al Khatib, Ahmed Elhagry, Hai Phan, Yihui He, Zhiqiang Shen, Marios Savvides, Dang Huynh
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.10693
Pdf URL: https://arxiv.org/pdf/2503.10693
Copy Paste: [[2503.10693]] Knowledge Consultation for Semi-Supervised Semantic Segmentation(https://arxiv.org/abs/2503.10693)
Keywords: segmentation
Abstract: Semi-Supervised Semantic Segmentation reduces reliance on extensive annotations by using unlabeled data and state-of-the-art models to improve overall performance. Despite the success of deep co-training methods, their underlying mechanisms remain underexplored. This work revisits Cross Pseudo Supervision with dual heterogeneous backbones and introduces Knowledge Consultation (SegKC) to further enhance segmentation performance. The proposed SegKC achieves significant improvements on Pascal and Cityscapes benchmarks, with mIoU scores of 87.1%, 89.2%, and 89.8% on Pascal VOC with the 1/4, 1/2, and full split partition, respectively, while maintaining a compact model architecture.

Title: Medical Large Language Model Benchmarks Should Prioritize Construct Validity

Authors: Ahmed Alaa, Thomas Hartvigsen, Niloufar Golchini, Shiladitya Dutta, Frances Dean, Inioluwa Deborah Raji, Travis Zack
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.10694
Pdf URL: https://arxiv.org/pdf/2503.10694
Copy Paste: [[2503.10694]] Medical Large Language Model Benchmarks Should Prioritize Construct Validity(https://arxiv.org/abs/2503.10694)
Keywords: large language model
Abstract: Medical large language models (LLMs) research often makes bold claims, from encoding clinical knowledge to reasoning like a physician. These claims are usually backed by evaluation on competitive benchmarks; a tradition inherited from mainstream machine learning. But how do we separate real progress from a leaderboard flex? Medical LLM benchmarks, much like those in other fields, are arbitrarily constructed using medical licensing exam questions. For these benchmarks to truly measure progress, they must accurately capture the real-world tasks they aim to represent. In this position paper, we argue that medical LLM benchmarks should (and indeed can) be empirically evaluated for their construct validity. In the psychological testing literature, "construct validity" refers to the ability of a test to measure an underlying "construct", that is the actual conceptual target of evaluation. By drawing an analogy between LLM benchmarks and psychological tests, we explain how frameworks from this field can provide empirical foundations for validating benchmarks. To put these ideas into practice, we use real-world clinical data in proof-of-concept experiments to evaluate popular medical LLM benchmarks and report significant gaps in their construct validity. Finally, we outline a vision for a new ecosystem of medical LLM evaluation centered around the creation of valid benchmarks.

Title: Zero-Shot Subject-Centric Generation for Creative Application Using Entropy Fusion

Authors: Kaifeng Zou, Xiaoyi Feng, Peng Wang, Tao Huang, Zizhou Huang, Zhang Haihang, Yuntao Zou, Dagang Li
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2503.10697
Pdf URL: https://arxiv.org/pdf/2503.10697
Copy Paste: [[2503.10697]] Zero-Shot Subject-Centric Generation for Creative Application Using Entropy Fusion(https://arxiv.org/abs/2503.10697)
Keywords: generative, large language model
Abstract: Generative models are widely used in visual content creation. However, current text-to-image models often face challenges in practical applications-such as textile pattern design and meme generation-due to the presence of unwanted elements that are difficult to separate with existing methods. Meanwhile, subject-reference generation has emerged as a key research trend, highlighting the need for techniques that can produce clean, high-quality subject images while effectively removing extraneous components. To address this challenge, we introduce a framework for reliable subject-centric image generation. In this work, we propose an entropy-based feature-weighted fusion method to merge the informative cross-attention features obtained from each sampling step of the pretrained text-to-image model FLUX, enabling a precise mask prediction and subject-centric generation. Additionally, we have developed an agent framework based on Large Language Models (LLMs) that translates users' casual inputs into more descriptive prompts, leading to highly detailed image generation. Simultaneously, the agents extract primary elements of prompts to guide the entropy-based feature fusion, ensuring focused primary element generation without extraneous components. Experimental results and user studies demonstrate our methods generates high-quality subject-centric images, outperform existing methods or other possible pipelines, highlighting the effectiveness of our approach.

Title: TA-V2A: Textually Assisted Video-to-Audio Generation

Authors: Yuhuan You, Xihong Wu, Tianshu Qu
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2503.10700
Pdf URL: https://arxiv.org/pdf/2503.10700
Copy Paste: [[2503.10700]] TA-V2A: Textually Assisted Video-to-Audio Generation(https://arxiv.org/abs/2503.10700)
Keywords: diffusion, transformer, large language model
Abstract: As artificial intelligence-generated content (AIGC) continues to evolve, video-to-audio (V2A) generation has emerged as a key area with promising applications in multimedia editing, augmented reality, and automated content creation. While Transformer and Diffusion models have advanced audio generation, a significant challenge persists in extracting precise semantic information from videos, as current models often lose sequential context by relying solely on frame-based features. To address this, we present TA-V2A, a method that integrates language, audio, and video features to improve semantic representation in latent space. By incorporating large language models for enhanced video comprehension, our approach leverages text guidance to enrich semantic expression. Our diffusion model-based system utilizes automated text modulation to enhance inference quality and efficiency, providing personalized control through text-guided interfaces. This integration enhances semantic expression while ensuring temporal alignment, leading to more accurate and coherent video-to-audio generation.

Title: ClaimTrust: Propagation Trust Scoring for RAG Systems

Authors: Hangkai Qian, Bo Li, Qichen Wang
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2503.10702
Pdf URL: https://arxiv.org/pdf/2503.10702
Copy Paste: [[2503.10702]] ClaimTrust: Propagation Trust Scoring for RAG Systems(https://arxiv.org/abs/2503.10702)
Keywords: robust
Abstract: The rapid adoption of retrieval-augmented generation (RAG) systems has revolutionized large-scale content generation but has also highlighted the challenge of ensuring trustworthiness in retrieved information. This paper introduces ClaimTrust, a propagation-based trust scoring framework that dynamically evaluates the reliability of documents in a RAG system. Using a modified PageRank-inspired algorithm, ClaimTrust propagates trust scores across documents based on relationships derived from extracted factual claims. We preprocess and analyze 814 political news articles from Kaggle's Fake News Detection Dataset to extract 2,173 unique claims and classify 965 meaningful relationships (supporting or contradicting). By representing the dataset as a document graph, ClaimTrust iteratively updates trust scores until convergence, effectively differentiating trustworthy articles from unreliable ones. Our methodology, which leverages embedding-based filtering for efficient claim comparison and relationship classification, achieves a 11.2% of significant connections while maintaining computational scalability. Experimental results demonstrate that ClaimTrust successfully assigns higher trust scores to verified documents while penalizing those containing false information. Future directions include fine-tuned claim extract and compare (Li et al., 2022), parameter optimization, enhanced language model utilization, and robust evaluation metrics to generalize the framework across diverse datasets and domains.

Title: Harmonizing Large Language Models with Collaborative Behavioral Signals for Conversational Recommendation

Authors: Guanrong Li, Kuo Tian, Jinnan Qi, Qinghan Fu, Zhen Wu, Xinyu Dai
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2503.10703
Pdf URL: https://arxiv.org/pdf/2503.10703
Copy Paste: [[2503.10703]] Harmonizing Large Language Models with Collaborative Behavioral Signals for Conversational Recommendation(https://arxiv.org/abs/2503.10703)
Keywords: large language model
Abstract: Conversational recommendation frameworks have gained prominence as a dynamic paradigm for delivering personalized suggestions via interactive dialogues. The incorporation of advanced language understanding techniques has substantially improved the dialogue fluency of such systems. However, while modern language models demonstrate strong proficiency in interpreting user preferences articulated through natural conversation, they frequently encounter challenges in effectively utilizing collective behavioral patterns - a crucial element for generating relevant suggestions. To mitigate this limitation, this work presents a novel probabilistic framework that synergizes behavioral patterns with conversational interactions through latent preference modeling. The proposed method establishes a dual-channel alignment mechanism where implicit preference representations learned from collective user interactions serve as a connecting mechanism between behavioral data and linguistic expressions. Specifically, the framework first derives latent preference representations through established collaborative filtering techniques, then employs these representations to jointly refine both the linguistic preference expressions and behavioral patterns through an adaptive fusion process. Comprehensive evaluations across multiple benchmark datasets demonstrate the superior performance of the proposed approach compared to various state-of-the-art baseline methods, particularly in aligning conversational interactions with collaborative behavioral signals.

Title: Error Analyses of Auto-Regressive Video Diffusion Models: A Unified Framework

Authors: Jing Wang, Fengzhuo Zhang, Xiaoli Li, Vincent Y. F. Tan, Tianyu Pang, Chao Du, Aixin Sun, Zhuoran Yang
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2503.10704
Pdf URL: https://arxiv.org/pdf/2503.10704
Copy Paste: [[2503.10704]] Error Analyses of Auto-Regressive Video Diffusion Models: A Unified Framework(https://arxiv.org/abs/2503.10704)
Keywords: diffusion
Abstract: A variety of Auto-Regressive Video Diffusion Models (ARVDM) have achieved remarkable successes in generating realistic long-form videos. However, theoretical analyses of these models remain scant. In this work, we develop theoretical underpinnings for these models and use our insights to improve the performance of existing models. We first develop Meta-ARVDM, a unified framework of ARVDMs that subsumes most existing methods. Using Meta-ARVDM, we analyze the KL-divergence between the videos generated by Meta-ARVDM and the true videos. Our analysis uncovers two important phenomena inherent to ARVDM -- error accumulation and memory bottleneck. By deriving an information-theoretic impossibility result, we show that the memory bottleneck phenomenon cannot be avoided. To mitigate the memory bottleneck, we design various network structures to explicitly use more past frames. We also achieve a significantly improved trade-off between the mitigation of the memory bottleneck and the inference efficiency by compressing the frames. Experimental results on DMLab and Minecraft validate the efficacy of our methods. Our experiments also demonstrate a Pareto-frontier between the error accumulation and memory bottleneck across different methods.

Title: CALLM: Context-Aware Emotion Analysis in Cancer Survivors Using LLMs and Retrieval-Augmented Mobile Diaries

Authors: Zhiyuan Wang, Katharine E. Daniel, Laura E. Barnes, Philip I. Chow
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2503.10707
Pdf URL: https://arxiv.org/pdf/2503.10707
Copy Paste: [[2503.10707]] CALLM: Context-Aware Emotion Analysis in Cancer Survivors Using LLMs and Retrieval-Augmented Mobile Diaries(https://arxiv.org/abs/2503.10707)
Keywords: large language model
Abstract: Cancer survivors face unique emotional challenges that impact their quality of life. Mobile diary entries-short text entries recording through their phone about their emotional experiences-provide a promising method for tracking these experiences in real time. Although emotion analysis tools show potential for recognizing emotions from text, current methods lack the contextual understanding necessary to accurately interpret the brief, personal narratives in mobile diaries. We propose CALLM, a context-aware emotion analysis framework that leverages Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG), to analyze mobile diary entries from cancer survivors to predict their emotional states. The framework enhances prediction accuracy beyond existing methods by (1) integrating retrieved peer experiences as contextual examples and (2) incorporating individuals' temporal emotional trajectories from their mobile diary entries. We collected a large-scale dataset (N=407) of cancer survivors' mobile ecological momentary assessments (EMAs), which assessed positive and negative affect, desire to regulate emotions, social interaction quality, and availability for interventions, alongside daily mobile diary entries in an open response format regarding what was driving their current emotional experience. Results demonstrate strong performance of CALLM, with balanced accuracies reaching 72.96% for positive and 73.29% for negative affect, and 73.72% for predicting individual's desire to regulate emotions. Post-hoc analysis reveals that leveraging model confidence, encouraging longer diary entries, and incorporating personal ground truth, further enhance predictive outcomes. Our findings support the feasibility of deploying LLM-powered emotion analysis in chronic health populations and suggest promising directions for personalized interventions for cancer survivors.

Title: ZeroMerge: Parameter-Free KV Cache Compression for Memory-Efficient Long-Context LLMs

Authors: Xin Liu, Pei Liu, Guoming Tang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10714
Pdf URL: https://arxiv.org/pdf/2503.10714
Copy Paste: [[2503.10714]] ZeroMerge: Parameter-Free KV Cache Compression for Memory-Efficient Long-Context LLMs(https://arxiv.org/abs/2503.10714)
Keywords: large language model
Abstract: The linear growth of key-value (KV) cache memory and quadratic computational complexity pose significant bottlenecks for large language models (LLMs) in long-context processing. While existing KV cache optimization methods address these challenges through token pruning or feature merging, they often suffer from irreversible information loss or require costly parameter retraining. We propose ZeroMerge, a dynamic zero-shot compression framework that achieves efficient cache management through three key innovations: (1) Fine-grained memory allocation guided by multi-dimensional token importance metrics at head-level granularity, (2) A residual merging mechanism that preserves critical context through compensated attention scoring, and (3) Parameter-free adaptation compatible with diverse LLM architectures without retraining. Comprehensive evaluations across LLaMA-2 model demonstrate that ZeroMerge maintains full-cache performance at 5\% compression ratios while doubling inference throughput at 40K token lengths. The method effectively balances memory efficiency, generation quality, and deployment flexibility, advancing practical long-context LLM applications. The code is available at this https URL.

Title: Team NYCU at Defactify4: Robust Detection and Source Identification of AI-Generated Images Using CNN and CLIP-Based Models

Authors: Tsan-Tsung Yang, I-Wei Chen, Kuan-Ting Chen, Shang-Hsuan Chiang, Wen-Chih Peng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10718
Pdf URL: https://arxiv.org/pdf/2503.10718
Copy Paste: [[2503.10718]] Team NYCU at Defactify4: Robust Detection and Source Identification of AI-Generated Images Using CNN and CLIP-Based Models(https://arxiv.org/abs/2503.10718)
Keywords: robust, generative
Abstract: With the rapid advancement of generative AI, AI-generated images have become increasingly realistic, raising concerns about creativity, misinformation, and content authenticity. Detecting such images and identifying their source models has become a critical challenge in ensuring the integrity of digital media. This paper tackles the detection of AI-generated images and identifying their source models using CNN and CLIP-ViT classifiers. For the CNN-based classifier, we leverage EfficientNet-B0 as the backbone and feed with RGB channels, frequency features, and reconstruction errors, while for CLIP-ViT, we adopt a pretrained CLIP image encoder to extract image features and SVM to perform classification. Evaluated on the Defactify 4 dataset, our methods demonstrate strong performance in both tasks, with CLIP-ViT showing superior robustness to image perturbations. Compared to baselines like AEROBLADE and OCC-CLIP, our approach achieves competitive results. Notably, our method ranked Top-3 overall in the Defactify 4 competition, highlighting its effectiveness and generalizability. All of our implementations can be found in this https URL

Title: Long-Video Audio Synthesis with Multi-Agent Collaboration

Authors: Yehang Zhang, Xinli Xu, Xiaojie Xu, Li Liu, Yingcong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10719
Pdf URL: https://arxiv.org/pdf/2503.10719
Copy Paste: [[2503.10719]] Long-Video Audio Synthesis with Multi-Agent Collaboration(https://arxiv.org/abs/2503.10719)
Keywords: segmentation
Abstract: Video-to-audio synthesis, which generates synchronized audio for visual content, critically enhances viewer immersion and narrative coherence in film and interactive media. However, video-to-audio dubbing for long-form content remains an unsolved challenge due to dynamic semantic shifts, temporal misalignment, and the absence of dedicated datasets. While existing methods excel in short videos, they falter in long scenarios (e.g., movies) due to fragmented synthesis and inadequate cross-scene consistency. We propose LVAS-Agent, a novel multi-agent framework that emulates professional dubbing workflows through collaborative role specialization. Our approach decomposes long-video synthesis into four steps including scene segmentation, script generation, sound design and audio synthesis. Central innovations include a discussion-correction mechanism for scene/script refinement and a generation-retrieval loop for temporal-semantic alignment. To enable systematic evaluation, we introduce LVAS-Bench, the first benchmark with 207 professionally curated long videos spanning diverse scenarios. Experiments demonstrate superior audio-visual alignment over baseline methods.

Title: TacticExpert: Spatial-Temporal Graph Language Model for Basketball Tactics

Authors: Xu Lingrui, Liu Mandi, Zhang Lei
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10722
Pdf URL: https://arxiv.org/pdf/2503.10722
Copy Paste: [[2503.10722]] TacticExpert: Spatial-Temporal Graph Language Model for Basketball Tactics(https://arxiv.org/abs/2503.10722)
Keywords: robust, interpretability, transformer, large language model
Abstract: The core challenge in basketball tactic modeling lies in efficiently extracting complex spatial-temporal dependencies from historical data and accurately predicting various in-game events. Existing state-of-the-art (SOTA) models, primarily based on graph neural networks (GNNs), encounter difficulties in capturing long-term, long-distance, and fine-grained interactions among heterogeneous player nodes, as well as in recognizing interaction patterns. Additionally, they exhibit limited generalization to untrained downstream tasks and zero-shot scenarios. In this work, we propose a Spatial-Temporal Propagation Symmetry-Aware Graph Transformer for fine-grained game modeling. This architecture explicitly captures delay effects in the spatial space to enhance player node representations across discrete-time slices, employing symmetry-invariant priors to guide the attention mechanism. We also introduce an efficient contrastive learning strategy to train a Mixture of Tactics Experts module, facilitating differentiated modeling of offensive tactics. By integrating dense training with sparse inference, we achieve a 2.4x improvement in model efficiency. Moreover, the incorporation of Lightweight Graph Grounding for Large Language Models enables robust performance in open-ended downstream tasks and zero-shot scenarios, including novel teams or players. The proposed model, TacticExpert, delineates a vertically integrated large model framework for basketball, unifying pretraining across multiple datasets and downstream prediction tasks. Fine-grained modeling modules significantly enhance spatial-temporal representations, and visualization analyzes confirm the strong interpretability of the model.

Title: RankPO: Preference Optimization for Job-Talent Matching

Authors: Yafei Zhang, Murray Wang, Yu Wang, Xiaohui Wang
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10723
Pdf URL: https://arxiv.org/pdf/2503.10723
Copy Paste: [[2503.10723]] RankPO: Preference Optimization for Job-Talent Matching(https://arxiv.org/abs/2503.10723)
Keywords: robust, large language model
Abstract: Matching job descriptions (JDs) with suitable talent requires models capable of understanding not only textual similarities between JDs and candidate resumes but also contextual factors such as geographical location and academic seniority. To address this challenge, we propose a two-stage training framework for large language models (LLMs). In the first stage, a contrastive learning approach is used to train the model on a dataset constructed from real-world matching rules, such as geographical alignment and research area overlap. While effective, this model primarily learns patterns that defined by the matching rules. In the second stage, we introduce a novel preference-based fine-tuning method inspired by Direct Preference Optimization (DPO), termed Rank Preference Optimization (RankPO), to align the model with AI-curated pairwise preferences emphasizing textual understanding. Our experiments show that while the first-stage model achieves strong performance on rule-based data (nDCG@20 = 0.706), it lacks robust textual understanding (alignment with AI annotations = 0.46). By fine-tuning with RankPO, we achieve a balanced model that retains relatively good performance in the original tasks while significantly improving the alignment with AI preferences. The code and data are available at this https URL.

Title: Real-time Pollutant Identification through Optical PM Micro-Sensor

Authors: Elie Azeraf, Audrey Wagner, Emilie Bialic, Samia Mellah, Ludovic Lelandais
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2503.10724
Pdf URL: https://arxiv.org/pdf/2503.10724
Copy Paste: [[2503.10724]] Real-time Pollutant Identification through Optical PM Micro-Sensor(https://arxiv.org/abs/2503.10724)
Keywords: protect
Abstract: Air pollution remains one of the most pressing environmental challenges of the modern era, significantly impacting human health, ecosystems, and climate. While traditional air quality monitoring systems provide critical data, their high costs and limited spatial coverage hinder effective real-time pollutant identification. Recent advancements in micro-sensor technology have improved data collection but still lack efficient methods for source identification. This paper explores the innovative application of machine learning (ML) models to classify pollutants in real-time using only data from optical micro-sensors. We propose a novel classification framework capable of distinguishing between four pollutant scenarios: Background Pollution, Ash, Sand, and Candle. Three Machine Learning (ML) approaches - XGBoost, Long Short-Term Memory networks, and Hidden Markov Chains - are evaluated for their effectiveness in sequence modeling and pollutant identification. Our results demonstrate the potential of leveraging micro-sensors and ML techniques to enhance air quality monitoring, offering actionable insights for urban planning and environmental protection.

Title: Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores

Authors: Chenpeng Wu, Qiqi Gu, Heng Shi, Jianguo Yao, Haibing Guan
Subjects: cs.LG, cs.AI, cs.DC, cs.OS
Abstract URL: https://arxiv.org/abs/2503.10725
Pdf URL: https://arxiv.org/pdf/2503.10725
Copy Paste: [[2503.10725]] Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores(https://arxiv.org/abs/2503.10725)
Keywords: large language model
Abstract: The escalating size of Mixture-of-Experts (MoE) based Large Language Models (LLMs) presents significant computational and memory challenges, necessitating innovative solutions to enhance efficiency without compromising model accuracy. Structured sparsity emerges as a compelling strategy to address these challenges by leveraging the emerging sparse computing hardware. Prior works mainly focus on the sparsity in model parameters, neglecting the inherent sparse patterns in activations. This oversight can lead to additional computational costs associated with activations, potentially resulting in suboptimal performance. This paper presents Samoyeds, an innovative acceleration system for MoE LLMs utilizing Sparse Tensor Cores (SpTCs). Samoyeds is the first to apply sparsity simultaneously to both activations and model parameters. It introduces a bespoke sparse data format tailored for MoE computation and develops a specialized sparse-sparse matrix multiplication kernel. Furthermore, Samoyeds incorporates systematic optimizations specifically designed for the execution of dual-side structured sparse MoE LLMs on SpTCs, further enhancing system performance. Evaluations show that Samoyeds outperforms SOTA works by up to 1.99$\times$ at the kernel level and 1.58$\times$ at the model level. Moreover, it enhances memory efficiency, increasing maximum supported batch sizes by 4.41$\times$ on average. Additionally, Samoyeds surpasses existing SOTA structured sparse solutions in both model accuracy and hardware portability.

Title: Prototype-Guided Cross-Modal Knowledge Enhancement for Adaptive Survival Prediction

Authors: Fengchun Liu, Linghan Cai, Zhikang Wang, Zhiyuan Fan, Jin-gang Yu, Hao Chen, Yongbing Zhang
Subjects: cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2503.10726
Pdf URL: https://arxiv.org/pdf/2503.10726
Copy Paste: [[2503.10726]] Prototype-Guided Cross-Modal Knowledge Enhancement for Adaptive Survival Prediction(https://arxiv.org/abs/2503.10726)
Keywords: robust
Abstract: Histo-genomic multimodal survival prediction has garnered growing attention for its remarkable model performance and potential contributions to precision medicine. However, a significant challenge in clinical practice arises when only unimodal data is available, limiting the usability of these advanced multimodal methods. To address this issue, this study proposes a prototype-guided cross-modal knowledge enhancement (ProSurv) framework, which eliminates the dependency on paired data and enables robust learning and adaptive survival prediction. Specifically, we first introduce an intra-modal updating mechanism to construct modality-specific prototype banks that encapsulate the statistics of the whole training set and preserve the modality-specific risk-relevant features/prototypes across intervals. Subsequently, the proposed cross-modal translation module utilizes the learned prototypes to enhance knowledge representation for multimodal inputs and generate features for missing modalities, ensuring robust and adaptive survival prediction across diverse scenarios. Extensive experiments on four public datasets demonstrate the superiority of ProSurv over state-of-the-art methods using either unimodal or multimodal input, and the ablation study underscores its feasibility for broad applicability. Overall, this study addresses a critical practical challenge in computational pathology, offering substantial significance and potential impact in the field.

Title: Word-level Annotation of GDPR Transparency Compliance in Privacy Policies using Large Language Models

Authors: Thomas Cory, Wolf Rieder, Julia Krämer, Philip Raschke, Patrick Herbke, Axel Küpper
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10727
Pdf URL: https://arxiv.org/pdf/2503.10727
Copy Paste: [[2503.10727]] Word-level Annotation of GDPR Transparency Compliance in Privacy Policies using Large Language Models(https://arxiv.org/abs/2503.10727)
Keywords: privacy, protect, large language model
Abstract: Ensuring transparency of data practices related to personal information is a fundamental requirement under the General Data Protection Regulation (GDPR), particularly as mandated by Articles 13 and 14. However, assessing compliance at scale remains a challenge due to the complexity and variability of privacy policy language. Manual audits are resource-intensive and inconsistent, while existing automated approaches lack the granularity needed to capture nuanced transparency disclosures. In this paper, we introduce a large language model (LLM)-based framework for word-level GDPR transparency compliance annotation. Our approach comprises a two-stage annotation pipeline that combines initial LLM-based annotation with a self-correction mechanism for iterative refinement. This annotation pipeline enables the systematic identification and fine-grained annotation of transparency-related content in privacy policies, aligning with 21 GDPR-derived transparency requirements. To enable large-scale analysis, we compile a dataset of 703,791 English-language policies, from which we generate a sample of 200 manually annotated privacy policies. To evaluate our approach, we introduce a two-tiered methodology assessing both label- and span-level annotation performance. We conduct a comparative analysis of eight high-profile LLMs, providing insights into their effectiveness in identifying GDPR transparency disclosures. Our findings contribute to advancing the automation of GDPR compliance assessments and provide valuable resources for future research in privacy policy analysis.

Title: DarkBench: Benchmarking Dark Patterns in Large Language Models

Authors: Esben Kran, Hieu Minh "Jord" Nguyen, Akash Kundu, Sami Jawhar, Jinsuk Park, Mateusz Maria Jurewicz
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2503.10728
Pdf URL: https://arxiv.org/pdf/2503.10728
Copy Paste: [[2503.10728]] DarkBench: Benchmarking Dark Patterns in Large Language Models(https://arxiv.org/abs/2503.10728)
Keywords: large language model
Abstract: We introduce DarkBench, a comprehensive benchmark for detecting dark design patterns--manipulative techniques that influence user behavior--in interactions with large language models (LLMs). Our benchmark comprises 660 prompts across six categories: brand bias, user retention, sycophancy, anthropomorphism, harmful generation, and sneaking. We evaluate models from five leading companies (OpenAI, Anthropic, Meta, Mistral, Google) and find that some LLMs are explicitly designed to favor their developers' products and exhibit untruthful communication, among other manipulative behaviors. Companies developing LLMs should recognize and mitigate the impact of dark design patterns to promote more ethical AI.

Title: Numerical and statistical analysis of NeuralODE with Runge-Kutta time integration

Authors: Emily C. Ehrhardt, Hanno Gottschalk, Tobias J. Riedlinger
Subjects: cs.LG, math.CA, math.NA, math.PR
Abstract URL: https://arxiv.org/abs/2503.10729
Pdf URL: https://arxiv.org/pdf/2503.10729
Copy Paste: [[2503.10729]] Numerical and statistical analysis of NeuralODE with Runge-Kutta time integration(https://arxiv.org/abs/2503.10729)
Keywords: generative
Abstract: NeuralODE is one example for generative machine learning based on the push forward of a simple source measure with a bijective mapping, which in the case of NeuralODE is given by the flow of a ordinary differential equation. Using Liouville's formula, the log-density of the push forward measure is easy to compute and thus NeuralODE can be trained based on the maximum Likelihood method such that the Kulback-Leibler divergence between the push forward through the flow map and the target measure generating the data becomes small. In this work, we give a detailed account on the consistency of Maximum Likelihood based empirical risk minimization for a generic class of target measures. In contrast to prior work, we do not only consider the statistical learning theory, but also give a detailed numerical analysis of the NeuralODE algorithm based on the 2nd order Runge-Kutta (RK) time integration. Using the universal approximation theory for deep ReQU networks, the stability and convergence rated for the RK scheme as well as metric entropy and concentration inequalities, we are able to prove that NeuralODE is a probably approximately correct (PAC) learning algorithm.

Title: Leveraging Vision-Language Embeddings for Zero-Shot Learning in Histopathology Images

Authors: Md Mamunur Rahaman, Ewan K. A. Millar, Erik Meijering
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10731
Pdf URL: https://arxiv.org/pdf/2503.10731
Copy Paste: [[2503.10731]] Leveraging Vision-Language Embeddings for Zero-Shot Learning in Histopathology Images(https://arxiv.org/abs/2503.10731)
Keywords: extraction
Abstract: Zero-shot learning holds tremendous potential for histopathology image analysis by enabling models to generalize to unseen classes without extensive labeled data. Recent advancements in vision-language models (VLMs) have expanded the capabilities of ZSL, allowing models to perform tasks without task-specific fine-tuning. However, applying VLMs to histopathology presents considerable challenges due to the complexity of histopathological imagery and the nuanced nature of diagnostic tasks. In this paper, we propose a novel framework called Multi-Resolution Prompt-guided Hybrid Embedding (MR-PHE) to address these challenges in zero-shot histopathology image classification. MR-PHE leverages multiresolution patch extraction to mimic the diagnostic workflow of pathologists, capturing both fine-grained cellular details and broader tissue structures critical for accurate diagnosis. We introduce a hybrid embedding strategy that integrates global image embeddings with weighted patch embeddings, effectively combining local and global contextual information. Additionally, we develop a comprehensive prompt generation and selection framework, enriching class descriptions with domain-specific synonyms and clinically relevant features to enhance semantic understanding. A similarity-based patch weighting mechanism assigns attention-like weights to patches based on their relevance to class embeddings, emphasizing diagnostically important regions during classification. Our approach utilizes pretrained VLM, CONCH for ZSL without requiring domain-specific fine-tuning, offering scalability and reducing dependence on large annotated datasets. Experimental results demonstrate that MR-PHE not only significantly improves zero-shot classification performance on histopathology datasets but also often surpasses fully supervised models.

Title: Visual Polarization Measurement Using Counterfactual Image Generation

Authors: Mohammad Mosaffa, Omid Rafieian, Hema Yoganarasimhan
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10738
Pdf URL: https://arxiv.org/pdf/2503.10738
Copy Paste: [[2503.10738]] Visual Polarization Measurement Using Counterfactual Image Generation(https://arxiv.org/abs/2503.10738)
Keywords: extraction, generative
Abstract: Political polarization is a significant issue in American politics, influencing public discourse, policy, and consumer behavior. While studies on polarization in news media have extensively focused on verbal content, non-verbal elements, particularly visual content, have received less attention due to the complexity and high dimensionality of image data. Traditional descriptive approaches often rely on feature extraction from images, leading to biased polarization estimates due to information loss. In this paper, we introduce the Polarization Measurement using Counterfactual Image Generation (PMCIG) method, which combines economic theory with generative models and multi-modal deep learning to fully utilize the richness of image data and provide a theoretically grounded measure of polarization in visual content. Applying this framework to a decade-long dataset featuring 30 prominent politicians across 20 major news outlets, we identify significant polarization in visual content, with notable variations across outlets and politicians. At the news outlet level, we observe significant heterogeneity in visual slant. Outlets such as Daily Mail, Fox News, and Newsmax tend to favor Republican politicians in their visual content, while The Washington Post, USA Today, and The New York Times exhibit a slant in favor of Democratic politicians. At the politician level, our results reveal substantial variation in polarized coverage, with Donald Trump and Barack Obama among the most polarizing figures, while Joe Manchin and Susan Collins are among the least. Finally, we conduct a series of validation tests demonstrating the consistency of our proposed measures with external measures of media slant that rely on non-image-based sources.

Title: Subnet-Aware Dynamic Supernet Training for Neural Architecture Search

Authors: Jeimin Jeon, Youngmin Oh, Junghyup Lee, Donghyeon Baek, Dohyung Kim, Chanho Eom, Bumsub Ham
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10740
Pdf URL: https://arxiv.org/pdf/2503.10740
Copy Paste: [[2503.10740]] Subnet-Aware Dynamic Supernet Training for Neural Architecture Search(https://arxiv.org/abs/2503.10740)
Keywords: fair
Abstract: N-shot neural architecture search (NAS) exploits a supernet containing all candidate subnets for a given search space. The subnets are typically trained with a static training strategy (e.g., using the same learning rate (LR) scheduler and optimizer for all subnets). This, however, does not consider that individual subnets have distinct characteristics, leading to two problems: (1) The supernet training is biased towards the low-complexity subnets (unfairness); (2) the momentum update in the supernet is noisy (noisy momentum). We present a dynamic supernet training technique to address these problems by adjusting the training strategy adaptive to the subnets. Specifically, we introduce a complexity-aware LR scheduler (CaLR) that controls the decay ratio of LR adaptive to the complexities of subnets, which alleviates the unfairness problem. We also present a momentum separation technique (MS). It groups the subnets with similar structural characteristics and uses a separate momentum for each group, avoiding the noisy momentum problem. Our approach can be applicable to various N-shot NAS methods with marginal cost, while improving the search performance drastically. We validate the effectiveness of our approach on various search spaces (e.g., NAS-Bench-201, Mobilenet spaces) and datasets (e.g., CIFAR-10/100, ImageNet).

Title: Predicting Treatment Response in Body Dysmorphic Disorder with Interpretable Machine Learning

Authors: Omar Costilla-Reyes, Morgan Talbot
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10741
Pdf URL: https://arxiv.org/pdf/2503.10741
Copy Paste: [[2503.10741]] Predicting Treatment Response in Body Dysmorphic Disorder with Interpretable Machine Learning(https://arxiv.org/abs/2503.10741)
Keywords: interpretability
Abstract: Body Dysmorphic Disorder (BDD) is a highly prevalent and frequently underdiagnosed condition characterized by persistent, intrusive preoccupations with perceived defects in physical appearance. In this extended analysis, we employ multiple machine learning approaches to predict treatment outcomes -- specifically treatment response and remission -- with an emphasis on interpretability to ensure clinical relevance and utility. Across the various models investigated, treatment credibility emerged as the most potent predictor, surpassing traditional markers such as baseline symptom severity or comorbid conditions. Notably, while simpler models (e.g., logistic regression and support vector machines) achieved competitive predictive performance, decision tree analyses provided unique insights by revealing clinically interpretable threshold values in credibility scores. These thresholds can serve as practical guideposts for clinicians when tailoring interventions or allocating treatment resources. We further contextualize our findings within the broader literature on BDD, addressing technology-based therapeutics, digital interventions, and the psychosocial determinants of treatment engagement. An extensive array of references situates our results within current research on BDD prevalence, suicidality risks, and digital innovation. Our work underscores the potential of integrating rigorous statistical methodologies with transparent machine learning models. By systematically identifying modifiable predictors -- such as treatment credibility -- we propose a pathway toward more targeted, personalized, and ultimately efficacious interventions for individuals with BDD.

Title: Clothes-Changing Person Re-identification Based On Skeleton Dynamics

Authors: Asaf Joseph, Shmuel Peleg
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10759
Pdf URL: https://arxiv.org/pdf/2503.10759
Copy Paste: [[2503.10759]] Clothes-Changing Person Re-identification Based On Skeleton Dynamics(https://arxiv.org/abs/2503.10759)
Keywords: robust
Abstract: Clothes-Changing Person Re-Identification (ReID) aims to recognize the same individual across different videos captured at various times and locations. This task is particularly challenging due to changes in appearance, such as clothing, hairstyle, and accessories. We propose a Clothes-Changing ReID method that uses only skeleton data and does not use appearance features. Traditional ReID methods often depend on appearance features, leading to decreased accuracy when clothing changes. Our approach utilizes a spatio-temporal Graph Convolution Network (GCN) encoder to generate a skeleton-based descriptor for each individual. During testing, we improve accuracy by aggregating predictions from multiple segments of a video clip. Evaluated on the CCVID dataset with several different pose estimation models, our method achieves state-of-the-art performance, offering a robust and efficient solution for Clothes-Changing ReID.

Title: HeightFormer: Learning Height Prediction in Voxel Features for Roadside Vision Centric 3D Object Detection via Transformer

Authors: Zhang Zhang, Chao Sun, Chao Yue, Da Wen, Yujie Chen, Tianze Wang, Jianghao Leng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10777
Pdf URL: https://arxiv.org/pdf/2503.10777
Copy Paste: [[2503.10777]] HeightFormer: Learning Height Prediction in Voxel Features for Roadside Vision Centric 3D Object Detection via Transformer(https://arxiv.org/abs/2503.10777)
Keywords: transformer
Abstract: Roadside vision centric 3D object detection has received increasing attention in recent years. It expands the perception range of autonomous vehicles, enhances the road safety. Previous methods focused on predicting per-pixel height rather than depth, making significant gains in roadside visual perception. While it is limited by the perspective property of near-large and far-small on image features, making it difficult for network to understand real dimension of objects in the 3D world. BEV features and voxel features present the real distribution of objects in 3D world compared to the image features. However, BEV features tend to lose details due to the lack of explicit height information, and voxel features are computationally expensive. Inspired by this insight, an efficient framework learning height prediction in voxel features via transformer is proposed, dubbed HeightFormer. It groups the voxel features into local height sequences, and utilize attention mechanism to obtain height distribution prediction. Subsequently, the local height sequences are reassembled to generate accurate 3D features. The proposed method is applied to two large-scale roadside benchmarks, DAIR-V2X-I and Rope3D. Extensive experiments are performed and the HeightFormer outperforms the state-of-the-art methods in roadside vision centric 3D object detection task.

Title: The Power of One: A Single Example is All it Takes for Segmentation in VLMs

Authors: Mir Rayat Imtiaz Hossain, Mennatullah Siam, Leonid Sigal, James J. Little
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10779
Pdf URL: https://arxiv.org/pdf/2503.10779
Copy Paste: [[2503.10779]] The Power of One: A Single Example is All it Takes for Segmentation in VLMs(https://arxiv.org/abs/2503.10779)
Keywords: segmentation
Abstract: Large-scale vision-language models (VLMs), trained on extensive datasets of image-text pairs, exhibit strong multimodal understanding capabilities by implicitly learning associations between textual descriptions and image regions. This emergent ability enables zero-shot object detection and segmentation, using techniques that rely on text-image attention maps, without necessarily training on abundant labeled segmentation datasets. However, performance of such methods depends heavily on prompt engineering and manually selected layers or head choices for the attention layers. In this work, we demonstrate that, rather than relying solely on textual prompts, providing a single visual example for each category and fine-tuning the text-to-image attention layers and embeddings significantly improves the performance. Additionally, we propose learning an ensemble through few-shot fine-tuning across multiple layers and/or prompts. An entropy-based ranking and selection mechanism for text-to-image attention layers is proposed to identify the top-performing layers without the need for segmentation labels. This eliminates the need for hyper-parameter selection of text-to-image attention layers, providing a more flexible and scalable solution for open-vocabulary segmentation. We show that this approach yields strong zero-shot performance, further enhanced through fine-tuning with a single visual example. Moreover, we demonstrate that our method and findings are general and can be applied across various vision-language models (VLMs).

Title: Byzantine-Resilient Federated Learning via Distributed Optimization

Authors: Yufei Xia, Wenrui Yu, Qiongxiu Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10792
Pdf URL: https://arxiv.org/pdf/2503.10792
Copy Paste: [[2503.10792]] Byzantine-Resilient Federated Learning via Distributed Optimization(https://arxiv.org/abs/2503.10792)
Keywords: secure, attack, robust, federate
Abstract: Byzantine attacks present a critical challenge to Federated Learning (FL), where malicious participants can disrupt the training process, degrade model accuracy, and compromise system reliability. Traditional FL frameworks typically rely on aggregation-based protocols for model updates, leaving them vulnerable to sophisticated adversarial strategies. In this paper, we demonstrate that distributed optimization offers a principled and robust alternative to aggregation-centric methods. Specifically, we show that the Primal-Dual Method of Multipliers (PDMM) inherently mitigates Byzantine impacts by leveraging its fault-tolerant consensus mechanism. Through extensive experiments on three datasets (MNIST, FashionMNIST, and Olivetti), under various attack scenarios including bit-flipping and Gaussian noise injection, we validate the superior resilience of distributed optimization protocols. Compared to traditional aggregation-centric approaches, PDMM achieves higher model utility, faster convergence, and improved stability. Our results highlight the effectiveness of distributed optimization in defending against Byzantine threats, paving the way for more secure and resilient federated learning systems.

Title: HALURust: Exploiting Hallucinations of Large Language Models to Detect Vulnerabilities in Rust

Authors: Yu Luo, Han Zhou, Mengtao Zhang, Dylan De La Rosa, Hafsa Ahmed, Weifeng Xu, Dianxiang Xu
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2503.10793
Pdf URL: https://arxiv.org/pdf/2503.10793
Copy Paste: [[2503.10793]] HALURust: Exploiting Hallucinations of Large Language Models to Detect Vulnerabilities in Rust(https://arxiv.org/abs/2503.10793)
Keywords: security, robust, large language model
Abstract: As an emerging programming language, Rust has rapidly gained popularity and recognition among developers due to its strong emphasis on safety. It employs a unique ownership system and safe concurrency practices to ensure robust safety. Despite these safeguards, security in Rust still presents challenges. Since 2018, 442 Rust-related vulnerabilities have been reported in real-world applications. The limited availability of data has resulted in existing vulnerability detection tools performing poorly in real-world scenarios, often failing to adapt to new and complex vulnerabilities. This paper introduces HALURust, a novel framework that leverages hallucinations of large language models (LLMs) to detect vulnerabilities in real-world Rust scenarios. HALURust leverages LLMs' strength in natural language generation by transforming code into detailed vulnerability analysis reports. The key innovation lies in prompting the LLM to always assume the presence of a vulnerability. If the code sample is vulnerable, the LLM provides an accurate analysis; if not, it generates a hallucinated report. By fine-tuning LLMs on these hallucinations, HALURust can effectively distinguish between vulnerable and non-vulnerable code samples. HALURust was evaluated on a dataset of 81 real-world vulnerabilities, covering 447 functions and 18,691 lines of code across 54 applications. It outperformed existing methods, achieving an F1 score of 77.3%, with over 10% improvement. The hallucinated report-based fine-tuning improved detection by 20\% compared to traditional code-based fine-tuning. Additionally, HALURust effectively adapted to unseen vulnerabilities and other programming languages, demonstrating strong generalization capabilities.

Title: Fixed-Point RNNs: From Diagonal to Dense in a Few Iterations

Authors: Sajad Movahedi, Felix Sarnthein, Nicola Muca Cirone, Antonio Orvieto
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.10799
Pdf URL: https://arxiv.org/pdf/2503.10799
Copy Paste: [[2503.10799]] Fixed-Point RNNs: From Diagonal to Dense in a Few Iterations(https://arxiv.org/abs/2503.10799)
Keywords: transformer
Abstract: Linear recurrent neural networks (RNNs) and state-space models (SSMs) such as Mamba have become promising alternatives to softmax-attention as sequence mixing layers in Transformer architectures. Current models, however, do not exhibit the full state-tracking expressivity of RNNs because they rely on channel-wise (i.e. diagonal) sequence mixing. In this paper, we propose to compute a dense linear RNN as the fixed-point of a parallelizable diagonal linear RNN in a single layer. We explore mechanisms to improve its memory and state-tracking abilities in practice, and achieve state-of-the-art results on the commonly used toy tasks $A_5$, $S_5$, copying, and modular arithmetics. We hope our results will open new avenues to more expressive and efficient sequence mixers.

Title: Attacking Multimodal OS Agents with Malicious Image Patches

Authors: Lukas Aichberger, Alasdair Paren, Yarin Gal, Philip Torr, Adel Bibi
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10809
Pdf URL: https://arxiv.org/pdf/2503.10809
Copy Paste: [[2503.10809]] Attacking Multimodal OS Agents with Malicious Image Patches(https://arxiv.org/abs/2503.10809)
Keywords: security, attack
Abstract: Recent advances in operating system (OS) agents enable vision-language models to interact directly with the graphical user interface of an OS. These multimodal OS agents autonomously perform computer-based tasks in response to a single prompt via application programming interfaces (APIs). Such APIs typically support low-level operations, including mouse clicks, keyboard inputs, and screenshot captures. We introduce a novel attack vector: malicious image patches (MIPs) that have been adversarially perturbed so that, when captured in a screenshot, they cause an OS agent to perform harmful actions by exploiting specific APIs. For instance, MIPs embedded in desktop backgrounds or shared on social media can redirect an agent to a malicious website, enabling further exploitation. These MIPs generalise across different user requests and screen layouts, and remain effective for multiple OS agents. The existence of such attacks highlights critical security vulnerabilities in OS agents, which should be carefully addressed before their widespread adoption.

Title: Thinking Machines: A Survey of LLM based Reasoning Strategies

Authors: Dibyanayan Bandyopadhyay, Soham Bhattacharjee, Asif Ekbal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.10814
Pdf URL: https://arxiv.org/pdf/2503.10814
Copy Paste: [[2503.10814]] Thinking Machines: A Survey of LLM based Reasoning Strategies(https://arxiv.org/abs/2503.10814)
Keywords: security, defense, large language model
Abstract: Large Language Models (LLMs) are highly proficient in language-based tasks. Their language capabilities have positioned them at the forefront of the future AGI (Artificial General Intelligence) race. However, on closer inspection, Valmeekam et al. (2024); Zecevic et al. (2023); Wu et al. (2024) highlight a significant gap between their language proficiency and reasoning abilities. Reasoning in LLMs and Vision Language Models (VLMs) aims to bridge this gap by enabling these models to think and re-evaluate their actions and responses. Reasoning is an essential capability for complex problem-solving and a necessary step toward establishing trust in Artificial Intelligence (AI). This will make AI suitable for deployment in sensitive domains, such as healthcare, banking, law, defense, security etc. In recent times, with the advent of powerful reasoning models like OpenAI O1 and DeepSeek R1, reasoning endowment has become a critical research topic in LLMs. In this paper, we provide a detailed overview and comparison of existing reasoning techniques and present a systematic survey of reasoning-imbued language models. We also study current challenges and present our findings.

Title: Dual Codebook VQ: Enhanced Image Reconstruction with Reduced Codebook Size

Authors: Parisa Boodaghi Malidarreh, Jillur Rahman Saurav, Thuong Le Hoai Pham, Amir Hajighasemi, Anahita Samadi, Saurabh Shrinivas Maydeo, Mohammad Sadegh Nasr, Jacob M. Luber
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10832
Pdf URL: https://arxiv.org/pdf/2503.10832
Copy Paste: [[2503.10832]] Dual Codebook VQ: Enhanced Image Reconstruction with Reduced Codebook Size(https://arxiv.org/abs/2503.10832)
Keywords: transformer
Abstract: Vector Quantization (VQ) techniques face significant challenges in codebook utilization, limiting reconstruction fidelity in image modeling. We introduce a Dual Codebook mechanism that effectively addresses this limitation by partitioning the representation into complementary global and local components. The global codebook employs a lightweight transformer for concurrent updates of all code vectors, while the local codebook maintains precise feature representation through deterministic selection. This complementary approach is trained from scratch without requiring pre-trained knowledge. Experimental evaluation across multiple standard benchmark datasets demonstrates state-of-the-art reconstruction quality while using a compact codebook of size 512 - half the size of previous methods that require pre-training. Our approach achieves significant FID improvements across diverse image domains, particularly excelling in scene and face reconstruction tasks. These results establish Dual Codebook VQ as an efficient paradigm for high-fidelity image reconstruction with significantly reduced computational requirements.

Title: Who Relies More on World Knowledge and Bias for Syntactic Ambiguity Resolution: Humans or LLMs?

Authors: So Young Lee, Russell Scheinberg, Amber Shore, Ameeta Agrawal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.10838
Pdf URL: https://arxiv.org/pdf/2503.10838
Copy Paste: [[2503.10838]] Who Relies More on World Knowledge and Bias for Syntactic Ambiguity Resolution: Humans or LLMs?(https://arxiv.org/abs/2503.10838)
Keywords: large language model
Abstract: This study explores how recent large language models (LLMs) navigate relative clause attachment {ambiguity} and use world knowledge biases for disambiguation in six typologically diverse languages: English, Chinese, Japanese, Korean, Russian, and Spanish. We describe the process of creating a novel dataset -- MultiWho -- for fine-grained evaluation of relative clause attachment preferences in ambiguous and unambiguous contexts. Our experiments with three LLMs indicate that, contrary to humans, LLMs consistently exhibit a preference for local attachment, displaying limited responsiveness to syntactic variations or language-specific attachment patterns. Although LLMs performed well in unambiguous cases, they rigidly prioritized world knowledge biases, lacking the flexibility of human language processing. These findings highlight the need for more diverse, pragmatically nuanced multilingual training to improve LLMs' handling of complex structures and human-like comprehension.

Title: WAFFLED: Exploiting Parsing Discrepancies to Bypass Web Application Firewalls

Authors: Seyed Ali Akhavani, Bahruz Jabiyev, Ben Kallus, Cem Topcuoglu, Sergey Bratus, Engin Kirda
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.10846
Pdf URL: https://arxiv.org/pdf/2503.10846
Copy Paste: [[2503.10846]] WAFFLED: Exploiting Parsing Discrepancies to Bypass Web Application Firewalls(https://arxiv.org/abs/2503.10846)
Keywords: security, defense, robust
Abstract: Web Application Firewalls (WAFs) have been introduced as essential and popular security gates that inspect incoming HTTP traffic to filter out malicious requests and provide defenses against a diverse array of web-based threats. Evading WAFs can compromise these defenses, potentially harming Internet users. In recent years, parsing discrepancies have plagued many entities in the communication path; however, their potential impact on WAF evasion and request smuggling remains largely unexplored. In this work, we present an innovative approach to bypassing WAFs by uncovering and exploiting parsing discrepancies through advanced fuzzing techniques. By targeting non-malicious components such as headers and segments of the body and using widely used content-types such as application/json, multipart/form-data, and application/xml, we identified and confirmed 1207 bypasses across 5 well-known WAFs, AWS, Azure, Cloud Armor, Cloudflare, and ModSecurity. To validate our findings, we conducted a study in the wild, revealing that more than 90% of websites accepted both form/x-www-form-urlencoded and multipart/form-data interchangeably, highlighting a significant vulnerability and the broad applicability of our bypass techniques. We have reported these vulnerabilities to the affected parties and received acknowledgments from all, as well as bug bounty rewards from some vendors. Further, to mitigate these vulnerabilities, we introduce HTTP-Normalizer, a robust proxy tool designed to rigorously validate HTTP requests against current RFC standards. Our results demonstrate its effectiveness in normalizing or blocking all bypass attempts presented in this work.

Title: Towards Efficient Large Scale Spatial-Temporal Time Series Forecasting via Improved Inverted Transformers

Authors: Jiarui Sun, Chin-Chia Michael Yeh, Yujie Fan, Xin Dai, Xiran Fan, Zhimeng Jiang, Uday Singh Saini, Vivian Lai, Junpeng Wang, Huiyuan Chen, Zhongfang Zhuang, Yan Zheng, Girish Chowdhary
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.10858
Pdf URL: https://arxiv.org/pdf/2503.10858
Copy Paste: [[2503.10858]] Towards Efficient Large Scale Spatial-Temporal Time Series Forecasting via Improved Inverted Transformers(https://arxiv.org/abs/2503.10858)
Keywords: transformer
Abstract: Time series forecasting at scale presents significant challenges for modern prediction systems, particularly when dealing with large sets of synchronized series, such as in a global payment network. In such systems, three key challenges must be overcome for accurate and scalable predictions: 1) emergence of new entities, 2) disappearance of existing entities, and 3) the large number of entities present in the data. The recently proposed Inverted Transformer (iTransformer) architecture has shown promising results by effectively handling variable entities. However, its practical application in large-scale settings is limited by quadratic time and space complexity ($O(N^2)$) with respect to the number of entities $N$. In this paper, we introduce EiFormer, an improved inverted transformer architecture that maintains the adaptive capabilities of iTransformer while reducing computational complexity to linear scale ($O(N)$). Our key innovation lies in restructuring the attention mechanism to eliminate redundant computations without sacrificing model expressiveness. Additionally, we incorporate a random projection mechanism that not only enhances efficiency but also improves prediction accuracy through better feature representation. Extensive experiments on the public LargeST benchmark dataset and a proprietary large-scale time series dataset demonstrate that EiFormer significantly outperforms existing methods in both computational efficiency and forecasting accuracy. Our approach enables practical deployment of transformer-based forecasting in industrial applications where handling time series at scale is essential.

Title: RI3D: Few-Shot Gaussian Splatting With Repair and Inpainting Diffusion Priors

Authors: Avinash Paliwal, Xilong Zhou, Wei Ye, Jinhui Xiong, Rakesh Ranjan, Nima Khademi Kalantari
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2503.10860
Pdf URL: https://arxiv.org/pdf/2503.10860
Copy Paste: [[2503.10860]] RI3D: Few-Shot Gaussian Splatting With Repair and Inpainting Diffusion Priors(https://arxiv.org/abs/2503.10860)
Keywords: diffusion
Abstract: In this paper, we propose RI3D, a novel 3DGS-based approach that harnesses the power of diffusion models to reconstruct high-quality novel views given a sparse set of input images. Our key contribution is separating the view synthesis process into two tasks of reconstructing visible regions and hallucinating missing regions, and introducing two personalized diffusion models, each tailored to one of these tasks. Specifically, one model ('repair') takes a rendered image as input and predicts the corresponding high-quality image, which in turn is used as a pseudo ground truth image to constrain the optimization. The other model ('inpainting') primarily focuses on hallucinating details in unobserved areas. To integrate these models effectively, we introduce a two-stage optimization strategy: the first stage reconstructs visible areas using the repair model, and the second stage reconstructs missing regions with the inpainting model while ensuring coherence through further optimization. Moreover, we augment the optimization with a novel Gaussian initialization method that obtains per-image depth by combining 3D-consistent and smooth depth with highly detailed relative depth. We demonstrate that by separating the process into two tasks and addressing them with the repair and inpainting models, we produce results with detailed textures in both visible and missing regions that outperform state-of-the-art approaches on a diverse set of scenes with extremely sparse inputs.

Title: TAIJI: Textual Anchoring for Immunizing Jailbreak Images in Vision Language Models

Authors: Xiangyu Yin, Yi Qi, Jinwei Hu, Zhen Chen, Yi Dong, Xingyu Zhao, Xiaowei Huang, Wenjie Ruan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10872
Pdf URL: https://arxiv.org/pdf/2503.10872
Copy Paste: [[2503.10872]] TAIJI: Textual Anchoring for Immunizing Jailbreak Images in Vision Language Models(https://arxiv.org/abs/2503.10872)
Keywords: attack
Abstract: Vision Language Models (VLMs) have demonstrated impressive inference capabilities, but remain vulnerable to jailbreak attacks that can induce harmful or unethical responses. Existing defence methods are predominantly white-box approaches that require access to model parameters and extensive modifications, making them costly and impractical for many real-world scenarios. Although some black-box defences have been proposed, they often impose input constraints or require multiple queries, limiting their effectiveness in safety-critical tasks such as autonomous driving. To address these challenges, we propose a novel black-box defence framework called \textbf{T}extual \textbf{A}nchoring for \textbf{I}mmunizing \textbf{J}ailbreak \textbf{I}mages (\textbf{TAIJI}). TAIJI leverages key phrase-based textual anchoring to enhance the model's ability to assess and mitigate the harmful content embedded within both visual and textual prompts. Unlike existing methods, TAIJI operates effectively with a single query during inference, while preserving the VLM's performance on benign tasks. Extensive experiments demonstrate that TAIJI significantly enhances the safety and reliability of VLMs, providing a practical and efficient solution for real-world deployment.

Title: Convolutional Rectangular Attention Module

Authors: Hai-Vy Nguyen, Fabrice Gamboa, Sixin Zhang, Reda Chhaibi, Serge Gratton, Thierry Giaccone
Subjects: cs.CV, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.10875
Pdf URL: https://arxiv.org/pdf/2503.10875
Copy Paste: [[2503.10875]] Convolutional Rectangular Attention Module(https://arxiv.org/abs/2503.10875)
Keywords: interpretability
Abstract: In this paper, we introduce a novel spatial attention module, that can be integrated to any convolutional network. This module guides the model to pay attention to the most discriminative part of an image. This enables the model to attain a better performance by an end-to-end training. In standard approaches, a spatial attention map is generated in a position-wise fashion. We observe that this results in very irregular boundaries. This could make it difficult to generalize to new samples. In our method, the attention region is constrained to be rectangular. This rectangle is parametrized by only 5 parameters, allowing for a better stability and generalization to new samples. In our experiments, our method systematically outperforms the position-wise counterpart. Thus, this provides us a novel useful spatial attention mechanism for convolutional models. Besides, our module also provides the interpretability concerning the ``where to look" question, as it helps to know the part of the input on which the model focuses to produce the prediction.

Title: SCE: Scalable Consistency Ensembles Make Blackbox Large Language Model Generation More Reliable

Authors: Jiaxin Zhang, Zhuohang Li, Wendi Cui, Kamalika Das, Bradley malin, Sricharan Kumar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.10881
Pdf URL: https://arxiv.org/pdf/2503.10881
Copy Paste: [[2503.10881]] SCE: Scalable Consistency Ensembles Make Blackbox Large Language Model Generation More Reliable(https://arxiv.org/abs/2503.10881)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated remarkable performance, yet their diverse strengths and weaknesses prevent any single LLM from achieving dominance across all tasks. Ensembling multiple LLMs is a promising approach to generate reliable responses but conventional ensembling frameworks suffer from high computational overheads. This work introduces Scalable Consistency Ensemble (SCE), an efficient framework for ensembling LLMs by prompting consistent outputs. The SCE framework systematically evaluates and integrates outputs to produce a cohesive result through two core components: SCE-CHECK, a mechanism that gauges the consistency between response pairs via semantic equivalence; and SCE-FUSION, which adeptly merges the highest-ranked consistent responses from SCE-CHECK, to optimize collective strengths and mitigating potential weaknesses. To improve the scalability with multiple inference queries, we further propose ``{You Only Prompt Once}'' (YOPO), a novel technique that reduces the inference complexity of pairwise comparison from quadratic to constant time. We perform extensive empirical evaluations on diverse benchmark datasets to demonstrate \methodName's effectiveness. Notably, the \saccheckcomponent outperforms conventional baselines with enhanced performance and a significant reduction in computational overhead.

Title: Taxonomic Reasoning for Rare Arthropods: Combining Dense Image Captioning and RAG for Interpretable Classification

Authors: Nathaniel Lesperance, Sujeevan Ratnasingham, Graham W. Taylor
Subjects: cs.CV, cs.AI, cs.IR, cs.LG, q-bio.PE
Abstract URL: https://arxiv.org/abs/2503.10886
Pdf URL: https://arxiv.org/pdf/2503.10886
Copy Paste: [[2503.10886]] Taxonomic Reasoning for Rare Arthropods: Combining Dense Image Captioning and RAG for Interpretable Classification(https://arxiv.org/abs/2503.10886)
Keywords: large language model
Abstract: In the context of pressing climate change challenges and the significant biodiversity loss among arthropods, automated taxonomic classification from organismal images is a subject of intense research. However, traditional AI pipelines based on deep neural visual architectures such as CNNs or ViTs face limitations such as degraded performance on the long-tail of classes and the inability to reason about their predictions. We integrate image captioning and retrieval-augmented generation (RAG) with large language models (LLMs) to enhance biodiversity monitoring, showing particular promise for characterizing rare and unknown arthropod species. While a naive Vision-Language Model (VLM) excels in classifying images of common species, the RAG model enables classification of rarer taxa by matching explicit textual descriptions of taxonomic features to contextual biodiversity text data from external sources. The RAG model shows promise in reducing overconfidence and enhancing accuracy relative to naive LLMs, suggesting its viability in capturing the nuances of taxonomic hierarchy, particularly at the challenging family and genus levels. Our findings highlight the potential for modern vision-language AI pipelines to support biodiversity conservation initiatives, emphasizing the role of comprehensive data curation and collaboration with citizen science platforms to improve species identification, unknown species characterization and ultimately inform conservation strategies.

Title: HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks

Authors: Jiuding Sun, Jing Huang, Sidharth Baskaran, Karel D'Oosterlinck, Christopher Potts, Michael Sklar, Atticus Geiger
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10894
Pdf URL: https://arxiv.org/pdf/2503.10894
Copy Paste: [[2503.10894]] HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks(https://arxiv.org/abs/2503.10894)
Keywords: interpretability, transformer
Abstract: Mechanistic interpretability has made great strides in identifying neural network features (e.g., directions in hidden activation space) that mediate concepts(e.g., the birth year of a person) and enable predictable manipulation. Distributed alignment search (DAS) leverages supervision from counterfactual data to learn concept features within hidden states, but DAS assumes we can afford to conduct a brute force search over potential feature locations. To address this, we present HyperDAS, a transformer-based hypernetwork architecture that (1) automatically locates the token-positions of the residual stream that a concept is realized in and (2) constructs features of those residual stream vectors for the concept. In experiments with Llama3-8B, HyperDAS achieves state-of-the-art performance on the RAVEL benchmark for disentangling concepts in hidden states. In addition, we review the design decisions we made to mitigate the concern that HyperDAS (like all powerful interpretabilty methods) might inject new information into the target model rather than faithfully interpreting it.

Title: Memory-Efficient 3D High-Resolution Medical Image Synthesis Using CRF-Guided GANs

Authors: Mahshid Shiri, Alessandro Bruno, Daniele Loiacono
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10899
Pdf URL: https://arxiv.org/pdf/2503.10899
Copy Paste: [[2503.10899]] Memory-Efficient 3D High-Resolution Medical Image Synthesis Using CRF-Guided GANs(https://arxiv.org/abs/2503.10899)
Keywords: generative
Abstract: Generative Adversarial Networks (GANs) have many potential medical imaging applications. Due to the limited memory of Graphical Processing Units (GPUs), most current 3D GAN models are trained on low-resolution medical images, these models cannot scale to high-resolution or are susceptible to patchy artifacts. In this work, we propose an end-to-end novel GAN architecture that uses Conditional Random field (CRF) to model dependencies so that it can generate consistent 3D medical Images without exploiting memory. To achieve this purpose, the generator is divided into two parts during training, the first part produces an intermediate representation and CRF is applied to this intermediate representation to capture correlations. The second part of the generator produces a random sub-volume of image using a subset of the intermediate representation. This structure has two advantages: first, the correlations are modeled by using the features that the generator is trying to optimize. Second, the generator can generate full high-resolution images during inference. Experiments on Lung CTs and Brain MRIs show that our architecture outperforms state-of-the-art while it has lower memory usage and less complexity.

Title: PolyRoof: Precision Roof Polygonization in Urban Residential Building with Graph Neural Networks

Authors: Chaikal Amrullah, Daniel Panangian, Ksenia Bittner
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10913
Pdf URL: https://arxiv.org/pdf/2503.10913
Copy Paste: [[2503.10913]] PolyRoof: Precision Roof Polygonization in Urban Residential Building with Graph Neural Networks(https://arxiv.org/abs/2503.10913)
Keywords: extraction, segmentation
Abstract: The growing demand for detailed building roof data has driven the development of automated extraction methods to overcome the inefficiencies of traditional approaches, particularly in handling complex variations in building geometries. Re:PolyWorld, which integrates point detection with graph neural networks, presents a promising solution for reconstructing high-detail building roof vector data. This study enhances Re:PolyWorld's performance on complex urban residential structures by incorporating attention-based backbones and additional area segmentation loss. Despite dataset limitations, our experiments demonstrated improvements in point position accuracy (1.33 pixels) and line distance accuracy (14.39 pixels), along with a notable increase in the reconstruction score to 91.99%. These findings highlight the potential of advanced neural network architectures in addressing the challenges of complex urban residential geometries.

Title: OASST-ETC Dataset: Alignment Signals from Eye-tracking Analysis of LLM Responses

Authors: Angela Lopez-Cardona, Sebastian Idesis, Miguel Barreda-Ángeles, Sergi Abadal, Ioannis Arapakis
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10927
Pdf URL: https://arxiv.org/pdf/2503.10927
Copy Paste: [[2503.10927]] OASST-ETC Dataset: Alignment Signals from Eye-tracking Analysis of LLM Responses(https://arxiv.org/abs/2503.10927)
Keywords: transformer, large language model
Abstract: While Large Language Models (LLMs) have significantly advanced natural language processing, aligning them with human preferences remains an open challenge. Although current alignment methods rely primarily on explicit feedback, eye-tracking (ET) data offers insights into real-time cognitive processing during reading. In this paper, we present OASST-ETC, a novel eye-tracking corpus capturing reading patterns from 24 participants, while evaluating LLM-generated responses from the OASST1 dataset. Our analysis reveals distinct reading patterns between preferred and non-preferred responses, which we compare with synthetic eye-tracking data. Furthermore, we examine the correlation between human reading measures and attention patterns from various transformer-based models, discovering stronger correlations in preferred responses. This work introduces a unique resource for studying human cognitive processing in LLM evaluation and suggests promising directions for incorporating eye-tracking data into alignment methods. The dataset and analysis code are publicly available.

Title: Multi-Domain Biometric Recognition using Body Embeddings

Authors: Anirudh Nanduri, Siyuan Huang, Rama Chellappa
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10931
Pdf URL: https://arxiv.org/pdf/2503.10931
Copy Paste: [[2503.10931]] Multi-Domain Biometric Recognition using Body Embeddings(https://arxiv.org/abs/2503.10931)
Keywords: biometric, transformer
Abstract: Biometric recognition becomes increasingly challenging as we move away from the visible spectrum to infrared imagery, where domain discrepancies significantly impact identification performance. In this paper, we show that body embeddings perform better than face embeddings for cross-spectral person identification in medium-wave infrared (MWIR) and long-wave infrared (LWIR) domains. Due to the lack of multi-domain datasets, previous research on cross-spectral body identification - also known as Visible-Infrared Person Re-Identification (VI-ReID) - has primarily focused on individual infrared bands, such as near-infrared (NIR) or LWIR, separately. We address the multi-domain body recognition problem using the IARPA Janus Benchmark Multi-Domain Face (IJB-MDF) dataset, which enables matching of short-wave infrared (SWIR), MWIR, and LWIR images against RGB (VIS) images. We leverage a vision transformer architecture to establish benchmark results on the IJB-MDF dataset and, through extensive experiments, provide valuable insights into the interrelation of infrared domains, the adaptability of VIS-pretrained models, the role of local semantic features in body-embeddings, and effective training strategies for small datasets. Additionally, we show that finetuning a body model, pretrained exclusively on VIS data, with a simple combination of cross-entropy and triplet losses achieves state-of-the-art mAP scores on the LLCM dataset.

Title: ChatGPT Encounters Morphing Attack Detection: Zero-Shot MAD with Multi-Modal Large Language Models and General Vision Models

Authors: Haoyu Zhang, Raghavendra Ramachandra, Kiran Raja, Christoph Busch
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10937
Pdf URL: https://arxiv.org/pdf/2503.10937
Copy Paste: [[2503.10937]] ChatGPT Encounters Morphing Attack Detection: Zero-Shot MAD with Multi-Modal Large Language Models and General Vision Models(https://arxiv.org/abs/2503.10937)
Keywords: attack, explainability, large language model
Abstract: Face Recognition Systems (FRS) are increasingly vulnerable to face-morphing attacks, prompting the development of Morphing Attack Detection (MAD) algorithms. However, a key challenge in MAD lies in its limited generalizability to unseen data and its lack of explainability-critical for practical application environments such as enrolment stations and automated border control systems. Recognizing that most existing MAD algorithms rely on supervised learning paradigms, this work explores a novel approach to MAD using zero-shot learning leveraged on Large Language Models (LLMs). We propose two types of zero-shot MAD algorithms: one leveraging general vision models and the other utilizing multimodal LLMs. For general vision models, we address the MAD task by computing the mean support embedding of an independent support set without using morphed images. For the LLM-based approach, we employ the state-of-the-art GPT-4 Turbo API with carefully crafted prompts. To evaluate the feasibility of zero-shot MAD and the effectiveness of the proposed methods, we constructed a print-scan morph dataset featuring various unseen morphing algorithms, simulating challenging real-world application scenarios. Experimental results demonstrated notable detection accuracy, validating the applicability of zero-shot learning for MAD tasks. Additionally, our investigation into LLM-based MAD revealed that multimodal LLMs, such as ChatGPT, exhibit remarkable generalizability to untrained MAD tasks. Furthermore, they possess a unique ability to provide explanations and guidance, which can enhance transparency and usability for end-users in practical applications.

Title: Phishsense-1B: A Technical Perspective on an AI-Powered Phishing Detection Model

Authors: SE Blake
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10944
Pdf URL: https://arxiv.org/pdf/2503.10944
Copy Paste: [[2503.10944]] Phishsense-1B: A Technical Perspective on an AI-Powered Phishing Detection Model(https://arxiv.org/abs/2503.10944)
Keywords: security
Abstract: Phishing is a persistent cybersecurity threat in today's digital landscape. This paper introduces Phishsense-1B, a refined version of the Llama-Guard-3-1B model, specifically tailored for phishing detection and reasoning. This adaptation utilizes Low-Rank Adaptation (LoRA) and the GuardReasoner finetuning methodology. We outline our LoRA-based fine-tuning process, describe the balanced dataset comprising phishing and benign emails, and highlight significant performance improvements over the original model. Our findings indicate that Phishsense-1B achieves an impressive 97.5% accuracy on a custom dataset and maintains strong performance with 70% accuracy on a challenging real-world dataset. This performance notably surpasses both unadapted models and BERT-based detectors. Additionally, we examine current state-of-the-art detection methods, compare prompt-engineering with fine-tuning strategies, and explore potential deployment scenarios.

Title: $(\varepsilon, δ)$ Considered Harmful: Best Practices for Reporting Differential Privacy Guarantees

Authors: Juan Felipe Gomez, Bogdan Kulynych, Georgios Kaissis, Jamie Hayes, Borja Balle, Antti Honkela
Subjects: cs.LG, cs.AI, cs.CR, stat.ML
Abstract URL: https://arxiv.org/abs/2503.10945
Pdf URL: https://arxiv.org/pdf/2503.10945
Copy Paste: [[2503.10945]] $(\varepsilon, δ)$ Considered Harmful: Best Practices for Reporting Differential Privacy Guarantees(https://arxiv.org/abs/2503.10945)
Keywords: privacy
Abstract: Current practices for reporting the level of differential privacy (DP) guarantees for machine learning (ML) algorithms provide an incomplete and potentially misleading picture of the guarantees and make it difficult to compare privacy levels across different settings. We argue for using Gaussian differential privacy (GDP) as the primary means of communicating DP guarantees in ML, with the full privacy profile as a secondary option in case GDP is too inaccurate. Unlike other widely used alternatives, GDP has only one parameter, which ensures easy comparability of guarantees, and it can accurately capture the full privacy profile of many important ML applications. To support our claims, we investigate the privacy profiles of state-of-the-art DP large-scale image classification, and the TopDown algorithm for the U.S. Decennial Census, observing that GDP fits the profiles remarkably well in all three cases. Although GDP is ideal for reporting the final guarantees, other formalisms (e.g., privacy loss random variables) are needed for accurate privacy accounting. We show that such intermediate representations can be efficiently converted to GDP with minimal loss in tightness.

Title: Predicting Stock Movement with BERTweet and Transformers

Authors: Michael Charles Albada, Mojolaoluwa Joshua Sonola
Subjects: cs.LG, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2503.10957
Pdf URL: https://arxiv.org/pdf/2503.10957
Copy Paste: [[2503.10957]] Predicting Stock Movement with BERTweet and Transformers(https://arxiv.org/abs/2503.10957)
Keywords: transformer
Abstract: Applying deep learning and computational intelligence to finance has been a popular area of applied research, both within academia and industry, and continues to attract active attention. The inherently high volatility and non-stationary of the data pose substantial challenges to machine learning models, especially so for today's expressive and highly-parameterized deep learning models. Recent work has combined natural language processing on data from social media to augment models based purely on historic price data to improve performance has received particular attention. Previous work has achieved state-of-the-art performance on this task by combining techniques such as bidirectional GRUs, variational autoencoders, word and document embeddings, self-attention, graph attention, and adversarial training. In this paper, we demonstrated the efficacy of BERTweet, a variant of BERT pre-trained specifically on a Twitter corpus, and the transformer architecture by achieving competitive performance with the existing literature and setting a new baseline for Matthews Correlation Coefficient on the Stocknet dataset without auxiliary data sources.

Title: OuroMamba: A Data-Free Quantization Framework for Vision Mamba Models

Authors: Akshat Ramachandran, Mingyu Lee, Huan Xu, Souvik Kundu, Tushar Krishna
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10959
Pdf URL: https://arxiv.org/pdf/2503.10959
Copy Paste: [[2503.10959]] OuroMamba: A Data-Free Quantization Framework for Vision Mamba Models(https://arxiv.org/abs/2503.10959)
Keywords: data-free, generative
Abstract: We present OuroMamba, the first data-free post-training quantization (DFQ) method for vision Mamba-based models (VMMs). We identify two key challenges in enabling DFQ for VMMs, (1) VMM's recurrent state transitions restricts capturing of long-range interactions and leads to semantically weak synthetic data, (2) VMM activations exhibit dynamic outlier variations across time-steps, rendering existing static PTQ techniques ineffective. To address these challenges, OuroMamba presents a two-stage framework: (1) OuroMamba-Gen to generate semantically rich and meaningful synthetic data. It applies contrastive learning on patch level VMM features generated through neighborhood interactions in the latent state space, (2) OuroMamba-Quant to employ mixed-precision quantization with lightweight dynamic outlier detection during inference. In specific, we present a thresholding based outlier channel selection strategy for activations that gets updated every time-step. Extensive experiments across vision and generative tasks show that our data-free OuroMamba surpasses existing data-driven PTQ techniques, achieving state-of-the-art performance across diverse quantization settings. Additionally, we implement efficient GPU kernels to achieve practical latency speedup of up to 2.36x. Code will be released soon.

Title: FedOSAA: Improving Federated Learning with One-Step Anderson Acceleration

Authors: Xue Feng (University of California, Davis), M. Paul Laiu (Oak Ridge National Laboratory), Thomas Strohmer (University of California, Davis)
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2503.10961
Pdf URL: https://arxiv.org/pdf/2503.10961
Copy Paste: [[2503.10961]] FedOSAA: Improving Federated Learning with One-Step Anderson Acceleration(https://arxiv.org/abs/2503.10961)
Keywords: federate
Abstract: Federated learning (FL) is a distributed machine learning approach that enables multiple local clients and a central server to collaboratively train a model while keeping the data on their own devices. First-order methods, particularly those incorporating variance reduction techniques, are the most widely used FL algorithms due to their simple implementation and stable performance. However, these methods tend to be slow and require a large number of communication rounds to reach the global minimizer. We propose FedOSAA, a novel approach that preserves the simplicity of first-order methods while achieving the rapid convergence typically associated with second-order methods. Our approach applies one Anderson acceleration (AA) step following classical local updates based on first-order methods with variance reduction, such as FedSVRG and SCAFFOLD, during local training. This AA step is able to leverage curvature information from the history points and gives a new update that approximates the Newton-GMRES direction, thereby significantly improving the convergence. We establish a local linear convergence rate to the global minimizer of FedOSAA for smooth and strongly convex loss functions. Numerical comparisons show that FedOSAA substantially improves the communication and computation efficiency of the original first-order methods, achieving performance comparable to second-order methods like GIANT.

Title: From Dionysius Emerges Apollo -- Learning Patterns and Abstractions from Perceptual Sequences

Authors: Shuchen Wu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.10973
Pdf URL: https://arxiv.org/pdf/2503.10973
Copy Paste: [[2503.10973]] From Dionysius Emerges Apollo -- Learning Patterns and Abstractions from Perceptual Sequences(https://arxiv.org/abs/2503.10973)
Keywords: large language model
Abstract: Cognition swiftly breaks high-dimensional sensory streams into familiar parts and uncovers their relations. Why do structures emerge, and how do they enable learning, generalization, and prediction? What computational principles underlie this core aspect of perception and intelligence? A sensory stream, simplified, is a one-dimensional sequence. In learning such sequences, we naturally segment them into parts -- a process known as chunking. In the first project, I investigated factors influencing chunking in a serial reaction time task and showed that humans adapt to underlying chunks while balancing speed and accuracy. Building on this, I developed models that learn chunks and parse sequences chunk by chunk. Normatively, I proposed chunking as a rational strategy for discovering recurring patterns and nested hierarchies, enabling efficient sequence factorization. Learned chunks serve as reusable primitives for transfer, composition, and mental simulation -- letting the model compose the new from the known. I demonstrated this model's ability to learn hierarchies in single and multi-dimensional sequences and highlighted its utility for unsupervised pattern discovery. The second part moves from concrete to abstract sequences. I taxonomized abstract motifs and examined their role in sequence memory. Behavioral evidence suggests that humans exploit pattern redundancies for compression and transfer. I proposed a non-parametric hierarchical variable model that learns both chunks and abstract variables, uncovering invariant symbolic patterns. I showed its similarity to human learning and compared it to large language models. Taken together, this thesis suggests that chunking and abstraction as simple computational principles enable structured knowledge acquisition in hierarchically organized sequences, from simple to complex, concrete to abstract.

Title: Unlocking Open-Set Language Accessibility in Vision Models

Authors: Fawaz Sammani, Jonas Fischer, Nikos Deligiannis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10981
Pdf URL: https://arxiv.org/pdf/2503.10981
Copy Paste: [[2503.10981]] Unlocking Open-Set Language Accessibility in Vision Models(https://arxiv.org/abs/2503.10981)
Keywords: interpretability
Abstract: Visual classifiers offer high-dimensional feature representations that are challenging to interpret and analyze. Text, in contrast, provides a more expressive and human-friendly interpretable medium for understanding and analyzing model behavior. We propose a simple, yet powerful method for reformulating any visual classifier so that it can be accessed with open-set text queries without compromising its original performance. Our approach is label-free, efficient, and preserves the underlying classifier's distribution and reasoning processes. We thus unlock several text-based interpretability applications for any classifier. We apply our method on 40 visual classifiers and demonstrate two primary applications: 1) building both label-free and zero-shot concept bottleneck models and therefore converting any classifier to be inherently-interpretable and 2) zero-shot decoding of visual features into natural language. In both applications, we achieve state-of-the-art results, greatly outperforming existing works. Our method enables text approaches for interpreting visual classifiers.

Title: Rethinking Rotation-Invariant Recognition of Fine-grained Shapes from the Perspective of Contour Points

Authors: Yanjie Xu, Handing Xu, Tianmu Wang, Yaguan Li, Yunzhi Chen, Zhenguo Nie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10992
Pdf URL: https://arxiv.org/pdf/2503.10992
Copy Paste: [[2503.10992]] Rethinking Rotation-Invariant Recognition of Fine-grained Shapes from the Perspective of Contour Points(https://arxiv.org/abs/2503.10992)
Keywords: robust
Abstract: Rotation-invariant recognition of shapes is a common challenge in computer vision. Recent approaches have significantly improved the accuracy of rotation-invariant recognition by encoding the rotational invariance of shapes as hand-crafted image features and introducing deep neural networks. However, the methods based on pixels have too much redundant information, and the critical geometric information is prone to early leakage, resulting in weak rotation-invariant recognition of fine-grained shapes. In this paper, we reconsider the shape recognition problem from the perspective of contour points rather than pixels. We propose an anti-noise rotation-invariant convolution module based on contour geometric aware for fine-grained shape recognition. The module divides the shape contour into multiple local geometric regions(LGA), where we implement finer-grained rotation-invariant coding in terms of point topological relations. We provide a deep network composed of five such cascaded modules for classification and retrieval experiments. The results show that our method exhibits excellent performance in rotation-invariant recognition of fine-grained shapes. In addition, we demonstrate that our method is robust to contour noise and the rotation centers. The source code is available at this https URL.

Title: TigerLLM -- A Family of Bangla Large Language Models

Authors: Nishat Raihan, Marcos Zampieri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.10995
Pdf URL: https://arxiv.org/pdf/2503.10995
Copy Paste: [[2503.10995]] TigerLLM -- A Family of Bangla Large Language Models(https://arxiv.org/abs/2503.10995)
Keywords: large language model
Abstract: The development of Large Language Models (LLMs) remains heavily skewed towards English and a few other high-resource languages. This linguistic disparity is particularly evident for Bangla - the 5th most spoken language. A few initiatives attempted to create open-source Bangla LLMs with performance still behind high-resource languages and limited reproducibility. To address this gap, we introduce TigerLLM - a family of Bangla LLMs. Our results demonstrate that these models surpass all open-source alternatives and also outperform larger proprietary models like GPT3.5 across standard benchmarks, establishing TigerLLM as the new baseline for future Bangla language modeling.

Title: Taming Knowledge Conflicts in Language Models

Authors: Gaotang Li, Yuzhong Chen, Hanghang Tong
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10996
Pdf URL: https://arxiv.org/pdf/2503.10996
Copy Paste: [[2503.10996]] Taming Knowledge Conflicts in Language Models(https://arxiv.org/abs/2503.10996)
Keywords: robust
Abstract: Language Models (LMs) often encounter knowledge conflicts when parametric memory contradicts contextual knowledge. Previous works attribute this conflict to the interplay between "memory heads" and "context heads", attention heads assumed to promote either memory or context exclusively. In this study, we go beyond this fundamental assumption by uncovering a critical phenomenon we term the "superposition of contextual information and parametric memory", where highly influential attention heads could simultaneously contribute to both memory and context. Building upon this insight, we propose Just Run Twice (JUICE), a test-time attention intervention method that steers LMs toward either parametric beliefs or contextual knowledge without requiring fine-tuning. JUICE identifies a set of reliable attention heads and leverages a dual-run approach to mitigate the superposition effects. Extensive experiments across 11 datasets and 6 model architectures demonstrate that JUICE sets the new state-of-the-art performance and robust generalization, achieving significant and consistent improvement across different domains under various conflict types. Finally, we theoretically analyze knowledge conflict and the superposition of contextual information and parametric memory in attention heads, which further elucidates the effectiveness of JUICE in these settings.

Title: RONA: Pragmatically Diverse Image Captioning with Coherence Relations

Authors: Aashish Anantha Ramakrishnan, Aadarsh Anantha Ramakrishnan, Dongwon Lee
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.10997
Pdf URL: https://arxiv.org/pdf/2503.10997
Copy Paste: [[2503.10997]] RONA: Pragmatically Diverse Image Captioning with Coherence Relations(https://arxiv.org/abs/2503.10997)
Keywords: large language model
Abstract: Writing Assistants (e.g., Grammarly, Microsoft Copilot) traditionally generate diverse image captions by employing syntactic and semantic variations to describe image components. However, human-written captions prioritize conveying a central message alongside visual descriptions using pragmatic cues. To enhance pragmatic diversity, it is essential to explore alternative ways of communicating these messages in conjunction with visual content. To address this challenge, we propose RONA, a novel prompting strategy for Multi-modal Large Language Models (MLLM) that leverages Coherence Relations as an axis for variation. We demonstrate that RONA generates captions with better overall diversity and ground-truth alignment, compared to MLLM baselines across multiple domains. Our code is available at: this https URL

Title: VA-AR: Learning Velocity-Aware Action Representations with Mixture of Window Attention

Authors: Jiangning Wei, Lixiong Qin, Bo Yu, Tianjian Zou, Chuhan Yan, Dandan Xiao, Yang Yu, Lan Yang, Ke Li, Jun Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11004
Pdf URL: https://arxiv.org/pdf/2503.11004
Copy Paste: [[2503.11004]] VA-AR: Learning Velocity-Aware Action Representations with Mixture of Window Attention(https://arxiv.org/abs/2503.11004)
Keywords: robust
Abstract: Action recognition is a crucial task in artificial intelligence, with significant implications across various domains. We initially perform a comprehensive analysis of seven prominent action recognition methods across five widely-used datasets. This analysis reveals a critical, yet previously overlooked, observation: as the velocity of actions increases, the performance of these methods variably declines, undermining their robustness. This decline in performance poses significant challenges for their application in real-world scenarios. Building on these findings, we introduce the Velocity-Aware Action Recognition (VA-AR) framework to obtain robust action representations across different velocities. Our principal insight is that rapid actions (e.g., the giant circle backward in uneven bars or a smash in badminton) occur within short time intervals, necessitating smaller temporal attention windows to accurately capture intricate changes. Conversely, slower actions (e.g., drinking water or wiping face) require larger windows to effectively encompass the broader context. VA-AR employs a Mixture of Window Attention (MoWA) strategy, dynamically adjusting its attention window size based on the action's velocity. This adjustment enables VA-AR to obtain a velocity-aware representation, thereby enhancing the accuracy of action recognition. Extensive experiments confirm that VA-AR achieves state-of-the-art performance on the same five datasets, demonstrating VA-AR's effectiveness across a broad spectrum of action recognition scenarios.

Title: Comparative Analysis of Advanced AI-based Object Detection Models for Pavement Marking Quality Assessment during Daytime

Authors: Gian Antariksa, Rohir Chakraborty, Shriyank Somvanshi, Subasish Das, Mohammad Jalayer, Deep Rameshkumar Patel, David Mills
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11008
Pdf URL: https://arxiv.org/pdf/2503.11008
Copy Paste: [[2503.11008]] Comparative Analysis of Advanced AI-based Object Detection Models for Pavement Marking Quality Assessment during Daytime(https://arxiv.org/abs/2503.11008)
Keywords: robust
Abstract: Visual object detection utilizing deep learning plays a vital role in computer vision and has extensive applications in transportation engineering. This paper focuses on detecting pavement marking quality during daytime using the You Only Look Once (YOLO) model, leveraging its advanced architectural features to enhance road safety through precise and real-time assessments. Utilizing image data from New Jersey, this study employed three YOLOv8 variants: YOLOv8m, YOLOv8n, and YOLOv8x. The models were evaluated based on their prediction accuracy for classifying pavement markings into good, moderate, and poor visibility categories. The results demonstrated that YOLOv8n provides the best balance between accuracy and computational efficiency, achieving the highest mean Average Precision (mAP) for objects with good visibility and demonstrating robust performance across various Intersections over Union (IoU) thresholds. This research enhances transportation safety by offering an automated and accurate method for evaluating the quality of pavement markings.

Title: Deep Incomplete Multi-view Clustering with Distribution Dual-Consistency Recovery Guidance

Authors: Jiaqi Jin, Siwei Wang, Zhibin Dong, Xihong Yang, Xinwang Liu, En Zhu, Kunlun He
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.11017
Pdf URL: https://arxiv.org/pdf/2503.11017
Copy Paste: [[2503.11017]] Deep Incomplete Multi-view Clustering with Distribution Dual-Consistency Recovery Guidance(https://arxiv.org/abs/2503.11017)
Keywords: privacy
Abstract: Multi-view clustering leverages complementary representations from diverse sources to enhance performance. However, real-world data often suffer incomplete cases due to factors like privacy concerns and device malfunctions. A key challenge is effectively utilizing available instances to recover missing views. Existing methods frequently overlook the heterogeneity among views during recovery, leading to significant distribution discrepancies between recovered and true data. Additionally, many approaches focus on cross-view correlations, neglecting insights from intra-view reliable structure and cross-view clustering structure. To address these issues, we propose BURG, a novel method for incomplete multi-view clustering with distriBution dUal-consistency Recovery Guidance. We treat each sample as a distinct category and perform cross-view distribution transfer to predict the distribution space of missing views. To compensate for the lack of reliable category information, we design a dual-consistency guided recovery strategy that includes intra-view alignment guided by neighbor-aware consistency and cross-view alignment guided by prototypical consistency. Extensive experiments on benchmarks demonstrate the superiority of BURG in the incomplete multi-view scenario.

Title: EmoDiffusion: Enhancing Emotional 3D Facial Animation with Latent Diffusion Models

Authors: Yixuan Zhang, Qing Chang, Yuxi Wang, Guang Chen, Zhaoxiang Zhang, Junran Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11028
Pdf URL: https://arxiv.org/pdf/2503.11028
Copy Paste: [[2503.11028]] EmoDiffusion: Enhancing Emotional 3D Facial Animation with Latent Diffusion Models(https://arxiv.org/abs/2503.11028)
Keywords: diffusion
Abstract: Speech-driven 3D facial animation seeks to produce lifelike facial expressions that are synchronized with the speech content and its emotional nuances, finding applications in various multimedia fields. However, previous methods often overlook emotional facial expressions or fail to disentangle them effectively from the speech content. To address these challenges, we present EmoDiffusion, a novel approach that disentangles different emotions in speech to generate rich 3D emotional facial expressions. Specifically, our method employs two Variational Autoencoders (VAEs) to separately generate the upper face region and mouth region, thereby learning a more refined representation of the facial sequence. Unlike traditional methods that use diffusion models to connect facial expression sequences with audio inputs, we perform the diffusion process in the latent space. Furthermore, we introduce an Emotion Adapter to evaluate upper face movements accurately. Given the paucity of 3D emotional talking face data in the animation industry, we capture facial expressions under the guidance of animation experts using LiveLinkFace on an iPhone. This effort results in the creation of an innovative 3D blendshape emotional talking face dataset (3D-BEF) used to train our network. Extensive experiments and perceptual evaluations validate the effectiveness of our approach, confirming its superiority in generating realistic and emotionally rich facial animations.

Title: FMNet: Frequency-Assisted Mamba-Like Linear Attention Network for Camouflaged Object Detection

Authors: Ming Deng, Sijin Sun, Zihao Li, Xiaochuan Hu, Xing Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11030
Pdf URL: https://arxiv.org/pdf/2503.11030
Copy Paste: [[2503.11030]] FMNet: Frequency-Assisted Mamba-Like Linear Attention Network for Camouflaged Object Detection(https://arxiv.org/abs/2503.11030)
Keywords: extraction, transformer
Abstract: Camouflaged Object Detection (COD) is challenging due to the strong similarity between camouflaged objects and their surroundings, which complicates identification. Existing methods mainly rely on spatial local features, failing to capture global information, while Transformers increase computational this http URL address this, the Frequency-Assisted Mamba-Like Linear Attention Network (FMNet) is proposed, which leverages frequency-domain learning to efficiently capture global features and mitigate ambiguity between objects and the background. FMNet introduces the Multi-Scale Frequency-Assisted Mamba-Like Linear Attention (MFM) module, integrating frequency and spatial features through a multi-scale structure to handle scale variations while reducing computational complexity. Additionally, the Pyramidal Frequency Attention Extraction (PFAE) module and the Frequency Reverse Decoder (FRD) enhance semantics and reconstruct features. Experimental results demonstrate that FMNet outperforms existing methods on multiple COD datasets, showcasing its advantages in both performance and efficiency. Code available at this https URL.

Title: Weakly Supervised Contrastive Adversarial Training for Learning Robust Features from Semi-supervised Data

Authors: Lilin Zhang, Chengpei Wu, Ning Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11032
Pdf URL: https://arxiv.org/pdf/2503.11032
Copy Paste: [[2503.11032]] Weakly Supervised Contrastive Adversarial Training for Learning Robust Features from Semi-supervised Data(https://arxiv.org/abs/2503.11032)
Keywords: robust
Abstract: Existing adversarial training (AT) methods often suffer from incomplete perturbation, meaning that not all non-robust features are perturbed when generating adversarial examples (AEs). This results in residual correlations between non-robust features and labels, leading to suboptimal learning of robust features. However, achieving complete perturbation, i.e., perturbing as many non-robust features as possible, is challenging due to the difficulty in distinguishing robust and non-robust features and the sparsity of labeled data. To address these challenges, we propose a novel approach called Weakly Supervised Contrastive Adversarial Training (WSCAT). WSCAT ensures complete perturbation for improved learning of robust features by disrupting correlations between non-robust features and labels through complete AE generation over partially labeled data, grounded in information theory. Extensive theoretical analysis and comprehensive experiments on widely adopted benchmarks validate the superiority of WSCAT.

Title: ACMo: Attribute Controllable Motion Generation

Authors: Mingjie Wei, Xuemei Xie, Guangming Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11038
Pdf URL: https://arxiv.org/pdf/2503.11038
Copy Paste: [[2503.11038]] ACMo: Attribute Controllable Motion Generation(https://arxiv.org/abs/2503.11038)
Keywords: diffusion
Abstract: Attributes such as style, fine-grained text, and trajectory are specific conditions for describing motion. However, existing methods often lack precise user control over motion attributes and suffer from limited generalizability to unseen motions. This work introduces an Attribute Controllable Motion generation architecture, to address these challenges via decouple any conditions and control them separately. Firstly, we explored the Attribute Diffusion Model to imporve text-to-motion performance via decouple text and motion learning, as the controllable model relies heavily on the pre-trained model. Then, we introduce Motion Adpater to quickly finetune previously unseen motion patterns. Its motion prompts inputs achieve multimodal text-to-motion generation that captures user-specified styles. Finally, we propose a LLM Planner to bridge the gap between unseen attributes and dataset-specific texts via local knowledage for user-friendly interaction. Our approach introduces the capability for motion prompts for stylize generation, enabling fine-grained and user-friendly attribute control while providing performance comparable to state-of-the-art methods. Project page: this https URL

Title: InverseBench: Benchmarking Plug-and-Play Diffusion Priors for Inverse Problems in Physical Sciences

Authors: Hongkai Zheng, Wenda Chu, Bingliang Zhang, Zihui Wu, Austin Wang, Berthy T. Feng, Caifeng Zou, Yu Sun, Nikola Kovachki, Zachary E. Ross, Katherine L. Bouman, Yisong Yue
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.11043
Pdf URL: https://arxiv.org/pdf/2503.11043
Copy Paste: [[2503.11043]] InverseBench: Benchmarking Plug-and-Play Diffusion Priors for Inverse Problems in Physical Sciences(https://arxiv.org/abs/2503.11043)
Keywords: diffusion
Abstract: Plug-and-play diffusion priors (PnPDP) have emerged as a promising research direction for solving inverse problems. However, current studies primarily focus on natural image restoration, leaving the performance of these algorithms in scientific inverse problems largely unexplored. To address this gap, we introduce \textsc{InverseBench}, a framework that evaluates diffusion models across five distinct scientific inverse problems. These problems present unique structural challenges that differ from existing benchmarks, arising from critical scientific applications such as optical tomography, medical imaging, black hole imaging, seismology, and fluid dynamics. With \textsc{InverseBench}, we benchmark 14 inverse problem algorithms that use plug-and-play diffusion priors against strong, domain-specific baselines, offering valuable new insights into the strengths and weaknesses of existing algorithms. To facilitate further research and development, we open-source the codebase, along with datasets and pre-trained models, at this https URL.

Title: PSF-4D: A Progressive Sampling Framework for View Consistent 4D Editing

Authors: Hasan Iqbal, Nazmul Karim, Umar Khalid, Azib Farooq, Zichun Zhong, Jing Hua, Chen Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11044
Pdf URL: https://arxiv.org/pdf/2503.11044
Copy Paste: [[2503.11044]] PSF-4D: A Progressive Sampling Framework for View Consistent 4D Editing(https://arxiv.org/abs/2503.11044)
Keywords: diffusion, generative
Abstract: Instruction-guided generative models, especially those using text-to-image (T2I) and text-to-video (T2V) diffusion frameworks, have advanced the field of content editing in recent years. To extend these capabilities to 4D scene, we introduce a progressive sampling framework for 4D editing (PSF-4D) that ensures temporal and multi-view consistency by intuitively controlling the noise initialization during forward diffusion. For temporal coherence, we design a correlated Gaussian noise structure that links frames over time, allowing each frame to depend meaningfully on prior frames. Additionally, to ensure spatial consistency across views, we implement a cross-view noise model, which uses shared and independent noise components to balance commonalities and distinct details among different views. To further enhance spatial coherence, PSF-4D incorporates view-consistent iterative refinement, embedding view-aware information into the denoising process to ensure aligned edits across frames and views. Our approach enables high-quality 4D editing without relying on external models, addressing key challenges in previous methods. Through extensive evaluation on multiple benchmarks and multiple editing aspects (e.g., style transfer, multi-attribute editing, object removal, local editing, etc.), we show the effectiveness of our proposed method. Experimental results demonstrate that our proposed method outperforms state-of-the-art 4D editing methods in diverse benchmarks.

Title: Measuring Similarity in Causal Graphs: A Framework for Semantic and Structural Analysis

Authors: Ning-Yuan Georgia Liu, Flower Yang, Mohammad S. Jalali
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11046
Pdf URL: https://arxiv.org/pdf/2503.11046
Copy Paste: [[2503.11046]] Measuring Similarity in Causal Graphs: A Framework for Semantic and Structural Analysis(https://arxiv.org/abs/2503.11046)
Keywords: generative
Abstract: Causal graphs are commonly used to understand and model complex systems. Researchers often construct these graphs from different perspectives, leading to significant variations for the same problem. Comparing causal graphs is, therefore, essential for evaluating assumptions, integrating insights, and resolving disagreements. The rise of AI tools has further amplified this need, as they are increasingly used to generate hypothesized causal graphs by synthesizing information from various sources such as prior research and community inputs, providing the potential for automating and scaling causal modeling for complex systems. Similar to humans, these tools also produce inconsistent results across platforms, versions, and iterations. Despite its importance, research on causal graph comparison remains scarce. Existing methods often focus solely on structural similarities, assuming identical variable names, and fail to capture nuanced semantic relationships, which is essential for causal graph comparison. We address these gaps by investigating methods for comparing causal graphs from both semantic and structural perspectives. First, we reviewed over 40 existing metrics and, based on predefined criteria, selected nine for evaluation from two threads of machine learning: four semantic similarity metrics and five learning graph kernels. We discuss the usability of these metrics in simple examples to illustrate their strengths and limitations. We then generated a synthetic dataset of 2,000 causal graphs using generative AI based on a reference diagram. Our findings reveal that each metric captures a different aspect of similarity, highlighting the need to use multiple metrics.

Title: Towards Privacy-preserved Pre-training of Remote Sensing Foundation Models with Federated Mutual-guidance Learning

Authors: Jieyi Tan, Chengwei Zhang, Bo Dang, Yansheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11051
Pdf URL: https://arxiv.org/pdf/2503.11051
Copy Paste: [[2503.11051]] Towards Privacy-preserved Pre-training of Remote Sensing Foundation Models with Federated Mutual-guidance Learning(https://arxiv.org/abs/2503.11051)
Keywords: privacy, federate
Abstract: Traditional Remote Sensing Foundation models (RSFMs) are pre-trained with a data-centralized paradigm, through self-supervision on large-scale curated remote sensing data. For each institution, however, pre-training RSFMs with limited data in a standalone manner may lead to suboptimal performance, while aggregating remote sensing data from multiple institutions for centralized pre-training raises privacy concerns. Seeking for collaboration is a promising solution to resolve this dilemma, where multiple institutions can collaboratively train RSFMs without sharing private data. In this paper, we propose a novel privacy-preserved pre-training framework (FedSense), which enables multiple institutions to collaboratively train RSFMs without sharing private data. However, it is a non-trivial task hindered by a vicious cycle, which results from model drift by remote sensing data heterogeneity and high communication overhead. To break this vicious cycle, we introduce Federated Mutual-guidance Learning. Specifically, we propose a Server-to-Clients Guidance (SCG) mechanism to guide clients updates towards global-flatness optimal solutions. Additionally, we propose a Clients-to-Server Guidance (CSG) mechanism to inject local knowledge into the server by low-bit communication. Extensive experiments on four downstream tasks demonstrate the effectiveness of our FedSense in both full-precision and communication-reduced scenarios, showcasing remarkable communication efficiency and performance gains.

Title: Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization

Authors: Kyle Sargent, Kyle Hsu, Justin Johnson, Li Fei-Fei, Jiajun Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11056
Pdf URL: https://arxiv.org/pdf/2503.11056
Copy Paste: [[2503.11056]] Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization(https://arxiv.org/abs/2503.11056)
Keywords: diffusion, transformer, generative
Abstract: Since the advent of popular visual generation frameworks like VQGAN and latent diffusion models, state-of-the-art image generation systems have generally been two-stage systems that first tokenize or compress visual data into a lower-dimensional latent space before learning a generative model. Tokenizer training typically follows a standard recipe in which images are compressed and reconstructed subject to a combination of MSE, perceptual, and adversarial losses. Diffusion autoencoders have been proposed in prior work as a way to learn end-to-end perceptually-oriented image compression, but have not yet shown state-of-the-art performance on the competitive task of ImageNet-1K reconstruction. We propose FlowMo, a transformer-based diffusion autoencoder that achieves a new state-of-the-art for image tokenization at multiple compression rates without using convolutions, adversarial losses, spatially-aligned two-dimensional latent codes, or distilling from other tokenizers. Our key insight is that FlowMo training should be broken into a mode-matching pre-training stage and a mode-seeking post-training stage. In addition, we conduct extensive analyses and explore the training of generative models atop the FlowMo tokenizer. Our code and models will be available at this http URL .

Title: BannerAgency: Advertising Banner Design with Multimodal LLM Agents

Authors: Heng Wang, Yotaro Shimose, Shingo Takamatsu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11060
Pdf URL: https://arxiv.org/pdf/2503.11060
Copy Paste: [[2503.11060]] BannerAgency: Advertising Banner Design with Multimodal LLM Agents(https://arxiv.org/abs/2503.11060)
Keywords: large language model
Abstract: Advertising banners are critical for capturing user attention and enhancing advertising campaign effectiveness. Creating aesthetically pleasing banner designs while conveying the campaign messages is challenging due to the large search space involving multiple design elements. Additionally, advertisers need multiple sizes for different displays and various versions to target different sectors of audiences. Since design is intrinsically an iterative and subjective process, flexible editability is also in high demand for practical usage. While current models have served as assistants to human designers in various design tasks, they typically handle only segments of the creative design process or produce pixel-based outputs that limit editability. This paper introduces a training-free framework for fully automated banner ad design creation, enabling frontier multimodal large language models (MLLMs) to streamline the production of effective banners with minimal manual effort across diverse marketing contexts. We present BannerAgency, an MLLM agent system that collaborates with advertisers to understand their brand identity and banner objectives, generates matching background images, creates blueprints for foreground design elements, and renders the final creatives as editable components in Figma or SVG formats rather than static pixels. To facilitate evaluation and future research, we introduce BannerRequest400, a benchmark featuring 100 unique logos paired with 400 diverse banner requests. Through quantitative and qualitative evaluations, we demonstrate the framework's effectiveness, emphasizing the quality of the generated banner designs, their adaptability to various banner requests, and their strong editability enabled by this component-based approach.

Title: Generative Modelling for Mathematical Discovery

Authors: Jordan S. Ellenberg, Cristofero S. Fraser-Taliente, Thomas R. Harvey, Karan Srivastava, Andrew V. Sutherland
Subjects: cs.LG, math.CO
Abstract URL: https://arxiv.org/abs/2503.11061
Pdf URL: https://arxiv.org/pdf/2503.11061
Copy Paste: [[2503.11061]] Generative Modelling for Mathematical Discovery(https://arxiv.org/abs/2503.11061)
Keywords: generative, large language model
Abstract: We present a new implementation of the LLM-driven genetic algorithm {\it funsearch}, whose aim is to generate examples of interest to mathematicians and which has already had some success in problems in extremal combinatorics. Our implementation is designed to be useful in practice for working mathematicians; it does not require expertise in machine learning or access to high-performance computing resources. Applying {\it funsearch} to a new problem involves modifying a small segment of Python code and selecting a large language model (LLM) from one of many third-party providers. We benchmarked our implementation on three different problems, obtaining metrics that may inform applications of {\it funsearch} to new problems. Our results demonstrate that {\it funsearch} successfully learns in a variety of combinatorial and number-theoretic settings, and in some contexts learns principles that generalize beyond the problem originally trained on.

Title: Falcon: A Remote Sensing Vision-Language Foundation Model

Authors: Kelu Yao, Nuo Xu, Rong Yang, Yingying Xu, Zhuoyan Gao, Titinunt Kitrungrotsakul, Yi Ren, Pu Zhang, Jin Wang, Ning Wei, Chao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11070
Pdf URL: https://arxiv.org/pdf/2503.11070
Copy Paste: [[2503.11070]] Falcon: A Remote Sensing Vision-Language Foundation Model(https://arxiv.org/abs/2503.11070)
Keywords: segmentation
Abstract: This paper introduces a holistic vision-language foundation model tailored for remote sensing, named Falcon. Falcon offers a unified, prompt-based paradigm that effectively executes comprehensive and complex remote sensing tasks. Falcon demonstrates powerful understanding and reasoning abilities at the image, region, and pixel levels. Specifically, given simple natural language instructions and remote sensing images, Falcon can produce impressive results in text form across 14 distinct tasks, i.e., image classification, object detection, segmentation, image captioning, and etc. To facilitate Falcon's training and empower its representation capacity to encode rich spatial and semantic information, we developed Falcon_SFT, a large-scale, multi-task, instruction-tuning dataset in the field of remote sensing. The Falcon_SFT dataset consists of approximately 78 million high-quality data samples, covering 5.6 million multi-spatial resolution and multi-view remote sensing images with diverse instructions. It features hierarchical annotations and undergoes manual sampling verification to ensure high data quality and reliability. Extensive comparative experiments are conducted, which verify that Falcon achieves remarkable performance over 67 datasets and 14 tasks, despite having only 0.7B parameters. We release the complete dataset, code, and model weights at this https URL, hoping to help further develop the open-source community.

Title: Harnessing Frequency Spectrum Insights for Image Copyright Protection Against Diffusion Models

Authors: Zhenguang Liu, Chao Shuai, Shaojing Fan, Ziping Dong, Jinwu Hu, Zhongjie Ba, Kui Ren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11071
Pdf URL: https://arxiv.org/pdf/2503.11071
Copy Paste: [[2503.11071]] Harnessing Frequency Spectrum Insights for Image Copyright Protection Against Diffusion Models(https://arxiv.org/abs/2503.11071)
Keywords: protect, robust, watermark, diffusion
Abstract: Diffusion models have achieved remarkable success in novel view synthesis, but their reliance on large, diverse, and often untraceable Web datasets has raised pressing concerns about image copyright protection. Current methods fall short in reliably identifying unauthorized image use, as they struggle to generalize across varied generation tasks and fail when the training dataset includes images from multiple sources with few identifiable (watermarked or poisoned) samples. In this paper, we present novel evidence that diffusion-generated images faithfully preserve the statistical properties of their training data, particularly reflected in their spectral features. Leveraging this insight, we introduce \emph{CoprGuard}, a robust frequency domain watermarking framework to safeguard against unauthorized image usage in diffusion model training and fine-tuning. CoprGuard demonstrates remarkable effectiveness against a wide range of models, from naive diffusion models to sophisticated text-to-image models, and is robust even when watermarked images comprise a mere 1\% of the training dataset. This robust and versatile approach empowers content owners to protect their intellectual property in the era of AI-driven image generation.

Title: Perceive, Understand and Restore: Real-World Image Super-Resolution with Autoregressive Multimodal Generative Models

Authors: Hongyang Wei, Shuaizheng Liu, Chun Yuan, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11073
Pdf URL: https://arxiv.org/pdf/2503.11073
Copy Paste: [[2503.11073]] Perceive, Understand and Restore: Real-World Image Super-Resolution with Autoregressive Multimodal Generative Models(https://arxiv.org/abs/2503.11073)
Keywords: robust, diffusion, generative
Abstract: By leveraging the generative priors from pre-trained text-to-image diffusion models, significant progress has been made in real-world image super-resolution (Real-ISR). However, these methods tend to generate inaccurate and unnatural reconstructions in complex and/or heavily degraded scenes, primarily due to their limited perception and understanding capability of the input low-quality image. To address these limitations, we propose, for the first time to our knowledge, to adapt the pre-trained autoregressive multimodal model such as Lumina-mGPT into a robust Real-ISR model, namely PURE, which Perceives and Understands the input low-quality image, then REstores its high-quality counterpart. Specifically, we implement instruction tuning on Lumina-mGPT to perceive the image degradation level and the relationships between previously generated image tokens and the next token, understand the image content by generating image semantic descriptions, and consequently restore the image by generating high-quality image tokens autoregressively with the collected information. In addition, we reveal that the image token entropy reflects the image structure and present a entropy-based Top-k sampling strategy to optimize the local structure of the image during inference. Experimental results demonstrate that PURE preserves image content while generating realistic details, especially in complex scenes with multiple objects, showcasing the potential of autoregressive multimodal generative models for robust Real-ISR. The model and code will be available at this https URL.

Title: Understanding Flatness in Generative Models: Its Role and Benefits

Authors: Taehwan Lee, Kyeongkook Seo, Jaejun Yoo, Sung Whan Yoon
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.11078
Pdf URL: https://arxiv.org/pdf/2503.11078
Copy Paste: [[2503.11078]] Understanding Flatness in Generative Models: Its Role and Benefits(https://arxiv.org/abs/2503.11078)
Keywords: robust, diffusion, generative
Abstract: Flat minima, known to enhance generalization and robustness in supervised learning, remain largely unexplored in generative models. In this work, we systematically investigate the role of loss surface flatness in generative models, both theoretically and empirically, with a particular focus on diffusion models. We establish a theoretical claim that flatter minima improve robustness against perturbations in target prior distributions, leading to benefits such as reduced exposure bias -- where errors in noise estimation accumulate over iterations -- and significantly improved resilience to model quantization, preserving generative performance even under strong quantization constraints. We further observe that Sharpness-Aware Minimization (SAM), which explicitly controls the degree of flatness, effectively enhances flatness in diffusion models, whereas other well-known methods such as Stochastic Weight Averaging (SWA) and Exponential Moving Average (EMA), which promote flatness indirectly via ensembling, are less effective. Through extensive experiments on CIFAR-10, LSUN Tower, and FFHQ, we demonstrate that flat minima in diffusion models indeed improves not only generative performance but also robustness.

Title: Aerial Vision-and-Language Navigation with Grid-based View Selection and Map Construction

Authors: Ganlong Zhao, Guanbin Li, Jia Pan, Yizhou Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11091
Pdf URL: https://arxiv.org/pdf/2503.11091
Copy Paste: [[2503.11091]] Aerial Vision-and-Language Navigation with Grid-based View Selection and Map Construction(https://arxiv.org/abs/2503.11091)
Keywords: transformer
Abstract: Aerial Vision-and-Language Navigation (Aerial VLN) aims to obtain an unmanned aerial vehicle agent to navigate aerial 3D environments following human instruction. Compared to ground-based VLN, aerial VLN requires the agent to decide the next action in both horizontal and vertical directions based on the first-person view observations. Previous methods struggle to perform well due to the longer navigation path, more complicated 3D scenes, and the neglect of the interplay between vertical and horizontal actions. In this paper, we propose a novel grid-based view selection framework that formulates aerial VLN action prediction as a grid-based view selection task, incorporating vertical action prediction in a manner that accounts for the coupling with horizontal actions, thereby enabling effective altitude adjustments. We further introduce a grid-based bird's eye view map for aerial space to fuse the visual information in the navigation history, provide contextual scene information, and mitigate the impact of obstacles. Finally, a cross-modal transformer is adopted to explicitly align the long navigation history with the instruction. We demonstrate the superiority of our method in extensive experiments.

Title: OmniDiff: A Comprehensive Benchmark for Fine-grained Image Difference Captioning

Authors: Yuan Liu, Saihui Hou, Saijie Hou, Jiabao Du, Shibei Meng, Yongzhen Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11093
Pdf URL: https://arxiv.org/pdf/2503.11093
Copy Paste: [[2503.11093]] OmniDiff: A Comprehensive Benchmark for Fine-grained Image Difference Captioning(https://arxiv.org/abs/2503.11093)
Keywords: large language model
Abstract: Image Difference Captioning (IDC) aims to generate natural language descriptions of subtle differences between image pairs, requiring both precise visual change localization and coherent semantic expression. Despite recent advancements, existing datasets often lack breadth and depth, limiting their applicability in complex and dynamic environments: (1) from a breadth perspective, current datasets are constrained to limited variations of objects in specific scenes, and (2) from a depth perspective, prior benchmarks often provide overly simplistic descriptions. To address these challenges, we introduce OmniDiff, a comprehensive dataset comprising 324 diverse scenarios-spanning real-world complex environments and 3D synthetic settings-with fine-grained human annotations averaging 60 words in length and covering 12 distinct change types. Building on this foundation, we propose M$^3$Diff, a MultiModal large language model enhanced by a plug-and-play Multi-scale Differential Perception (MDP) module. This module improves the model's ability to accurately identify and describe inter-image differences while maintaining the foundational model's generalization capabilities. With the addition of the OmniDiff dataset, M$^3$Diff achieves state-of-the-art performance across multiple benchmarks, including Spot-the-Diff, IEdit, CLEVR-Change, CLEVR-DC, and OmniDiff, demonstrating significant improvements in cross-scenario difference recognition accuracy compared to existing methods. The dataset, code, and models will be made publicly available to support further research.

Title: Open3DVQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space

Authors: Weichen Zhan, Zile Zhou, Zhiheng Zheng, Chen Gao, Jinqiang Cui, Yong Li, Xinlei Chen, Xiao-Ping Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11094
Pdf URL: https://arxiv.org/pdf/2503.11094
Copy Paste: [[2503.11094]] Open3DVQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space(https://arxiv.org/abs/2503.11094)
Keywords: large language model
Abstract: Spatial reasoning is a fundamental capability of embodied agents and has garnered widespread attention in the field of multimodal large language models (MLLMs). In this work, we propose a novel benchmark, Open3DVQA, to comprehensively evaluate the spatial reasoning capacities of current state-of-the-art (SOTA) foundation models in open 3D space. Open3DVQA consists of 9k VQA samples, collected using an efficient semi-automated tool in a high-fidelity urban simulator. We evaluate several SOTA MLLMs across various aspects of spatial reasoning, such as relative and absolute spatial relationships, situational reasoning, and object-centric spatial attributes. Our results reveal that: 1) MLLMs perform better at answering questions regarding relative spatial relationships than absolute spatial relationships, 2) MLLMs demonstrate similar spatial reasoning abilities for both egocentric and allocentric perspectives, and 3) Fine-tuning large models significantly improves their performance across different spatial reasoning tasks. We believe that our open-source data collection tools and in-depth analyses will inspire further research on MLLM spatial reasoning capabilities. The benchmark is available at this https URL.

Title: A Novel Decomposed Feature-Oriented Framework for Open-Set Semantic Segmentation on LiDAR Data

Authors: Wenbang Deng, Xieyuanli Chen, Qinghua Yu, Yunze He, Junhao Xiao, Huimin Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11097
Pdf URL: https://arxiv.org/pdf/2503.11097
Copy Paste: [[2503.11097]] A Novel Decomposed Feature-Oriented Framework for Open-Set Semantic Segmentation on LiDAR Data(https://arxiv.org/abs/2503.11097)
Keywords: segmentation
Abstract: Semantic segmentation is a key technique that enables mobile robots to understand and navigate surrounding environments autonomously. However, most existing works focus on segmenting known objects, overlooking the identification of unknown classes, which is common in real-world applications. In this paper, we propose a feature-oriented framework for open-set semantic segmentation on LiDAR data, capable of identifying unknown objects while retaining the ability to classify known ones. We design a decomposed dual-decoder network to simultaneously perform closed-set semantic segmentation and generate distinctive features for unknown objects. The network is trained with multi-objective loss functions to capture the characteristics of known and unknown objects. Using the extracted features, we introduce an anomaly detection mechanism to identify unknown objects. By integrating the results of close-set semantic segmentation and anomaly detection, we achieve effective feature-driven LiDAR open-set semantic segmentation. Evaluations on both SemanticKITTI and nuScenes datasets demonstrate that our proposed framework significantly outperforms state-of-the-art methods. The source code will be made publicly available at this https URL.

Title: Quantifying Interpretability in CLIP Models with Concept Consistency

Authors: Avinash Madasu, Vasudev Lal, Phillip Howard
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11103
Pdf URL: https://arxiv.org/pdf/2503.11103
Copy Paste: [[2503.11103]] Quantifying Interpretability in CLIP Models with Concept Consistency(https://arxiv.org/abs/2503.11103)
Keywords: interpretability
Abstract: CLIP is one of the most popular foundational models and is heavily used for many vision-language tasks. However, little is known about the inner workings of CLIP. While recent work has proposed decomposition-based interpretability methods for identifying textual descriptions of attention heads in CLIP, the implications of conceptual consistency in these text labels on interpretability and model performance has not been explored. To bridge this gap, we study the conceptual consistency of text descriptions for attention heads in CLIP-like models. We conduct extensive experiments on six different models from OpenAI and OpenCLIP which vary by size, type of pre-training data and patch size. We propose Concept Consistency Score (CCS), a novel interpretability metric that measures how consistently individual attention heads in CLIP models align with specific concepts. To assign concept labels to heads, we use in-context learning with ChatGPT, guided by a few manually-curated examples, and validate these labels using an LLM-as-a-judge approach. Our soft-pruning experiments reveal that high CCS heads are critical for preserving model performance, as pruning them leads to a significantly larger performance drop than pruning random or low CCS heads. Notably, we find that high CCS heads capture essential concepts and play a key role in out-of-domain detection, concept-specific reasoning, and video-language understanding. These results position CCS as a powerful interpretability metric for analyzing CLIP-like models.

Title: Limits of KV Cache Compression for Tensor Attention based Autoregressive Transformers

Authors: Yifang Chen, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Yu Tian
Subjects: cs.LG, cs.AI, cs.CC, cs.CL
Abstract URL: https://arxiv.org/abs/2503.11108
Pdf URL: https://arxiv.org/pdf/2503.11108
Copy Paste: [[2503.11108]] Limits of KV Cache Compression for Tensor Attention based Autoregressive Transformers(https://arxiv.org/abs/2503.11108)
Keywords: transformer, large language model
Abstract: The key-value (KV) cache in autoregressive transformers presents a significant bottleneck during inference, which restricts the context length capabilities of large language models (LLMs). While previous work analyzes the fundamental space complexity barriers in standard attention mechanism [Haris and Onak, 2025], our work generalizes the space complexity barriers result to tensor attention version. Our theoretical contributions rely on a novel reduction from communication complexity and deduce the memory lower bound for tensor-structured attention mechanisms when $d = \Omega(\log n)$. In the low dimensional regime where $d = o(\log n)$, we analyze the theoretical bounds of the space complexity as well. Overall, our work provides a theoretical foundation for us to understand the compression-expressivity tradeoff in tensor attention mechanisms and offers more perspectives in developing more memory-efficient transformer architectures.

Title: Solution for 8th Competition on Affective & Behavior Analysis in-the-wild

Authors: Jun Yu, Yunxiang Zhang, Xilong Lu, Yang Zheng, Yongqi Wang, Lingsi Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11115
Pdf URL: https://arxiv.org/pdf/2503.11115
Copy Paste: [[2503.11115]] Solution for 8th Competition on Affective & Behavior Analysis in-the-wild(https://arxiv.org/abs/2503.11115)
Keywords: robust, transformer
Abstract: In this report, we present our solution for the Action Unit (AU) Detection Challenge, in 8th Competition on Affective Behavior Analysis in-the-wild. In order to achieve robust and accurate classification of facial action unit in the wild environment, we introduce an innovative method that leverages audio-visual multimodal data. Our method employs ConvNeXt as the image encoder and uses Whisper to extract Mel spectrogram features. For these features, we utilize a Transformer encoder-based feature fusion module to integrate the affective information embedded in audio and image features. This ensures the provision of rich high-dimensional feature representations for the subsequent multilayer perceptron (MLP) trained on the Aff-Wild2 dataset, enhancing the accuracy of AU detection.

Title: UMB@PerAnsSumm 2025: Enhancing Perspective-Aware Summarization with Prompt Optimization and Supervised Fine-Tuning

Authors: Kristin Qi, Youxiang Zhu, Xiaohui Liang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11118
Pdf URL: https://arxiv.org/pdf/2503.11118
Copy Paste: [[2503.11118]] UMB@PerAnsSumm 2025: Enhancing Perspective-Aware Summarization with Prompt Optimization and Supervised Fine-Tuning(https://arxiv.org/abs/2503.11118)
Keywords: transformer
Abstract: We present our approach to the PerAnsSumm Shared Task, which involves perspective span identification and perspective-aware summarization in community question-answering (CQA) threads. For span identification, we adopt ensemble learning that integrates three transformer models through averaging to exploit individual model strengths, achieving an 82.91% F1-score on test data. For summarization, we design a suite of Chain-of-Thought (CoT) prompting strategies that incorporate keyphrases and guide information to structure summary generation into manageable steps. To further enhance summary quality, we apply prompt optimization using the DSPy framework and supervised fine-tuning (SFT) on Llama-3 to adapt the model to domain-specific data. Experimental results on validation and test sets show that structured prompts with keyphrases and guidance improve summaries aligned with references, while the combination of prompt optimization and fine-tuning together yields significant improvement in both relevance and factuality evaluation metrics.

Title: A Multi-Objective Evaluation Framework for Analyzing Utility-Fairness Trade-Offs in Machine Learning Systems

Authors: Gökhan Özbulak, Oscar Jimenez-del-Toro, Maíra Fatoretto, Lilian Berton, André Anjos
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.11120
Pdf URL: https://arxiv.org/pdf/2503.11120
Copy Paste: [[2503.11120]] A Multi-Objective Evaluation Framework for Analyzing Utility-Fairness Trade-Offs in Machine Learning Systems(https://arxiv.org/abs/2503.11120)
Keywords: fair
Abstract: The evaluation of fairness models in Machine Learning involves complex challenges, such as defining appropriate metrics, balancing trade-offs between utility and fairness, and there are still gaps in this stage. This work presents a novel multi-objective evaluation framework that enables the analysis of utility-fairness trade-offs in Machine Learning systems. The framework was developed using criteria from Multi-Objective Optimization that collect comprehensive information regarding this complex evaluation task. The assessment of multiple Machine Learning systems is summarized, both quantitatively and qualitatively, in a straightforward manner through a radar chart and a measurement table encompassing various aspects such as convergence, system capacity, and diversity. The framework's compact representation of performance facilitates the comparative analysis of different Machine Learning strategies for decision-makers, in real-world applications, with single or multiple fairness requirements. The framework is model-agnostic and flexible to be adapted to any kind of Machine Learning systems, that is, black- or white-box, any kind and quantity of evaluation metrics, including multidimensional fairness criteria. The functionality and effectiveness of the proposed framework is shown with different simulations, and an empirical study conducted on a real-world dataset with various Machine Learning systems.

Title: DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation

Authors: Hongbin Lin, Zilu Guo, Yifan Zhang, Shuaicheng Niu, Yafeng Li, Ruimao Zhang, Shuguang Cui, Zhen Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11122
Pdf URL: https://arxiv.org/pdf/2503.11122
Copy Paste: [[2503.11122]] DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation(https://arxiv.org/abs/2503.11122)
Keywords: robust, extraction, diffusion
Abstract: In autonomous driving, vision-centric 3D detection aims to identify 3D objects from images. However, high data collection costs and diverse real-world scenarios limit the scale of training data. Once distribution shifts occur between training and test data, existing methods often suffer from performance degradation, known as Out-of-Distribution (OOD) problems. To address this, controllable Text-to-Image (T2I) diffusion offers a potential solution for training data enhancement, which is required to generate diverse OOD scenarios with precise 3D object geometry. Nevertheless, existing controllable T2I approaches are restricted by the limited scale of training data or struggle to preserve all annotated 3D objects. In this paper, we present DriveGEN, a method designed to improve the robustness of 3D detectors in Driving via Training-Free Controllable Text-to-Image Diffusion Generation. Without extra diffusion model training, DriveGEN consistently preserves objects with precise 3D geometry across diverse OOD generations, consisting of 2 stages: 1) Self-Prototype Extraction: We empirically find that self-attention features are semantic-aware but require accurate region selection for 3D objects. Thus, we extract precise object features via layouts to capture 3D object geometry, termed self-prototypes. 2) Prototype-Guided Diffusion: To preserve objects across various OOD scenarios, we perform semantic-aware feature alignment and shallow feature alignment during denoising. Extensive experiments demonstrate the effectiveness of DriveGEN in improving 3D detection. The code is available at this https URL.

Title: Context-Aware Rule Mining Using a Dynamic Transformer-Based Framework

Authors: Jie Liu, Yiwei Zhang, Yuan Sheng, Yujia Lou, Haige Wang, Bohuan Yang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.11125
Pdf URL: https://arxiv.org/pdf/2503.11125
Copy Paste: [[2503.11125]] Context-Aware Rule Mining Using a Dynamic Transformer-Based Framework(https://arxiv.org/abs/2503.11125)
Keywords: transformer
Abstract: This study proposes a dynamic rule data mining algorithm based on an improved Transformer architecture, aiming to improve the accuracy and efficiency of rule mining in a dynamic data environment. With the increase in data volume and complexity, traditional data mining methods are difficult to cope with dynamic data with strong temporal and variable characteristics, so new algorithms are needed to capture the temporal regularity in the data. By improving the Transformer architecture, and introducing a dynamic weight adjustment mechanism and a temporal dependency module, we enable the model to adapt to data changes and mine more accurate rules. Experimental results show that compared with traditional rule mining algorithms, the improved Transformer model has achieved significant improvements in rule mining accuracy, coverage, and stability. The contribution of each module in the algorithm performance is further verified by ablation experiments, proving the importance of temporal dependency and dynamic weight adjustment mechanisms in improving the model effect. In addition, although the improved model has certain challenges in computational efficiency, its advantages in accuracy and coverage enable it to perform well in processing complex dynamic data. Future research will focus on optimizing computational efficiency and combining more deep learning technologies to expand the application scope of the algorithm, especially in practical applications in the fields of finance, medical care, and intelligent recommendation.

Title: Don't Forget It! Conditional Sparse Autoencoder Clamping Works for Unlearning

Authors: Matthew Khoriaty (1), Andrii Shportko (1), Gustavo Mercier (1), Zach Wood-Doughty (1) ((1) Northwestern University)
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11127
Pdf URL: https://arxiv.org/pdf/2503.11127
Copy Paste: [[2503.11127]] Don't Forget It! Conditional Sparse Autoencoder Clamping Works for Unlearning(https://arxiv.org/abs/2503.11127)
Keywords: attack, large language model
Abstract: Recent developments in Large Language Model (LLM) capabilities have brought great potential but also posed new risks. For example, LLMs with knowledge of bioweapons, advanced chemistry, or cyberattacks could cause violence if placed in the wrong hands or during malfunctions. Because of their nature as near-black boxes, intuitive interpretation of LLM internals remains an open research question, preventing developers from easily controlling model behavior and capabilities. The use of Sparse Autoencoders (SAEs) has recently emerged as a potential method of unraveling representations of concepts in LLMs internals, and has allowed developers to steer model outputs by directly modifying the hidden activations. In this paper, we use SAEs to identify unwanted concepts from the Weapons of Mass Destruction Proxy (WMDP) dataset within gemma-2-2b internals and use feature steering to reduce the model's ability to answer harmful questions while retaining its performance on harmless queries. Our results bring back optimism to the viability of SAE-based explicit knowledge unlearning techniques.

Title: X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression

Authors: Guihong Li, Mehdi Rezagholizadeh, Mingyu Yang, Vikram Appia, Emad Barsoum
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11132
Pdf URL: https://arxiv.org/pdf/2503.11132
Copy Paste: [[2503.11132]] X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression(https://arxiv.org/abs/2503.11132)
Keywords: transformer
Abstract: Multi-head latent attention (MLA) is designed to optimize KV cache memory through low-rank key-value joint compression. Rather than caching keys and values separately, MLA stores their compressed latent representations, reducing memory overhead while maintaining the performance. While MLA improves memory efficiency without compromising language model accuracy, its major limitation lies in its integration during the pre-training phase, requiring models to be trained from scratch. This raises a key question: can we use MLA's benefits fully or partially in models that have already been pre-trained with different attention mechanisms? In this paper, we propose X-EcoMLA to deploy post training distillation to enable the upcycling of Transformer-based attention into an efficient hybrid (i.e., combination of regular attention and MLA layers) or full MLA variant through lightweight post-training adaptation, bypassing the need for extensive pre-training. We demonstrate that leveraging the dark knowledge of a well-trained model can enhance training accuracy and enable extreme KV cache compression in MLA without compromising model performance. Our results show that using an 8B teacher model allows us to compress the KV cache size of the Llama3.2-1B-Inst baseline by 6.4x while preserving 100% of its average score across multiple tasks on the LM Harness Evaluation benchmark. This is achieved with only 3.6B training tokens and about 70 GPU hours on AMD MI300 GPUs, compared to the 370K GPU hours required for pre-training the Llama3.2-1B model.

Title: SpaceSeg: A High-Precision Intelligent Perception Segmentation Method for Multi-Spacecraft On-Orbit Targets

Authors: Hao Liu, Pengyu Guo, Siyuan Yang, Zeqing Jiang, Qinglei Hu, Dongyu Li
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.11133
Pdf URL: https://arxiv.org/pdf/2503.11133
Copy Paste: [[2503.11133]] SpaceSeg: A High-Precision Intelligent Perception Segmentation Method for Multi-Spacecraft On-Orbit Targets(https://arxiv.org/abs/2503.11133)
Keywords: robust, segmentation
Abstract: With the continuous advancement of human exploration into deep space, intelligent perception and high-precision segmentation technology for on-orbit multi-spacecraft targets have become critical factors for ensuring the success of modern space missions. However, the complex deep space environment, diverse imaging conditions, and high variability in spacecraft morphology pose significant challenges to traditional segmentation methods. This paper proposes SpaceSeg, an innovative vision foundation model-based segmentation framework with four core technical innovations: First, the Multi-Scale Hierarchical Attention Refinement Decoder (MSHARD) achieves high-precision feature decoding through cross-resolution feature fusion via hierarchical attention. Second, the Multi-spacecraft Connected Component Analysis (MS-CCA) effectively resolves topological structure confusion in dense targets. Third, the Spatial Domain Adaptation Transform framework (SDAT) eliminates cross-domain disparities and resist spatial sensor perturbations through composite enhancement strategies. Finally, a custom Multi-Spacecraft Segmentation Task Loss Function is created to significantly improve segmentation robustness in deep space scenarios. To support algorithm validation, we construct the first multi-scale on-orbit multi-spacecraft semantic segmentation dataset SpaceES, which covers four types of spatial backgrounds and 17 typical spacecraft targets. In testing, SpaceSeg achieves state-of-the-art performance with 89.87$\%$ mIoU and 99.98$\%$ mAcc, surpassing existing best methods by 5.71 percentage points. The dataset and code are open-sourced at this https URL to provide critical technical support for next-generation space situational awareness systems.

Title: Minding Fuzzy Regions: A Data-driven Alternating Learning Paradigm for Stable Lesion Segmentation

Authors: Lexin Fang, Yunyang Xu, Xiang Ma, Xuemei Li, Caiming Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11140
Pdf URL: https://arxiv.org/pdf/2503.11140
Copy Paste: [[2503.11140]] Minding Fuzzy Regions: A Data-driven Alternating Learning Paradigm for Stable Lesion Segmentation(https://arxiv.org/abs/2503.11140)
Keywords: segmentation
Abstract: Deep learning has achieved significant advancements in medical image segmentation, but existing models still face challenges in accurately segmenting lesion regions. The main reason is that some lesion regions in medical images have unclear boundaries, irregular shapes, and small tissue density differences, leading to label ambiguity. However, the existing model treats all data equally without taking quality differences into account in the training process, resulting in noisy labels negatively impacting model training and unstable feature representations. In this paper, a data-driven alternating learning (DALE) paradigm is proposed to optimize the model's training process, achieving stable and high-precision segmentation. The paradigm focuses on two key points: (1) reducing the impact of noisy labels, and (2) calibrating unstable representations. To mitigate the negative impact of noisy labels, a loss consistency-based collaborative optimization method is proposed, and its effectiveness is theoretically demonstrated. Specifically, the label confidence parameters are introduced to dynamically adjust the influence of labels of different confidence levels during model training, thus reducing the influence of noise labels. To calibrate the learning bias of unstable representations, a distribution alignment method is proposed. This method restores the underlying distribution of unstable representations, thereby enhancing the discriminative capability of fuzzy region representations. Extensive experiments on various benchmarks and model backbones demonstrate the superiority of the DALE paradigm, achieving an average performance improvement of up to 7.16%.

Title: GaussianIP: Identity-Preserving Realistic 3D Human Generation via Human-Centric Diffusion Prior

Authors: Zichen Tang, Yuan Yao, Miaomiao Cui, Liefeng Bo, Hongyu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11143
Pdf URL: https://arxiv.org/pdf/2503.11143
Copy Paste: [[2503.11143]] GaussianIP: Identity-Preserving Realistic 3D Human Generation via Human-Centric Diffusion Prior(https://arxiv.org/abs/2503.11143)
Keywords: diffusion
Abstract: Text-guided 3D human generation has advanced with the development of efficient 3D representations and 2D-lifting methods like Score Distillation Sampling (SDS). However, current methods suffer from prolonged training times and often produce results that lack fine facial and garment details. In this paper, we propose GaussianIP, an effective two-stage framework for generating identity-preserving realistic 3D humans from text and image prompts. Our core insight is to leverage human-centric knowledge to facilitate the generation process. In stage 1, we propose a novel Adaptive Human Distillation Sampling (AHDS) method to rapidly generate a 3D human that maintains high identity consistency with the image prompt and achieves a realistic appearance. Compared to traditional SDS methods, AHDS better aligns with the human-centric generation process, enhancing visual quality with notably fewer training steps. To further improve the visual quality of the face and clothes regions, we design a View-Consistent Refinement (VCR) strategy in stage 2. Specifically, it produces detail-enhanced results of the multi-view images from stage 1 iteratively, ensuring the 3D texture consistency across views via mutual attention and distance-guided attention fusion. Then a polished version of the 3D human can be achieved by directly perform reconstruction with the refined images. Extensive experiments demonstrate that GaussianIP outperforms existing methods in both visual quality and training efficiency, particularly in generating identity-preserving results. Our code is available at: this https URL.

Title: Layer-wise Update Aggregation with Recycling for Communication-Efficient Federated Learning

Authors: Jisoo Kim, Sungmin Kang, Sunwoo Lee
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.11146
Pdf URL: https://arxiv.org/pdf/2503.11146
Copy Paste: [[2503.11146]] Layer-wise Update Aggregation with Recycling for Communication-Efficient Federated Learning(https://arxiv.org/abs/2503.11146)
Keywords: federate
Abstract: Expensive communication cost is a common performance bottleneck in Federated Learning (FL), which makes it less appealing in real-world applications. Many communication-efficient FL methods focus on discarding a part of model updates mostly based on gradient magnitude. In this study, we find that recycling previous updates, rather than simply dropping them, more effectively reduces the communication cost while maintaining FL performance. We propose FedLUAR, a Layer-wise Update Aggregation with Recycling scheme for communication-efficient FL. We first define a useful metric that quantifies the extent to which the aggregated gradients influences the model parameter values in each layer. FedLUAR selects a few layers based on the metric and recycles their previous updates on the server side. Our extensive empirical study demonstrates that the update recycling scheme significantly reduces the communication cost while maintaining model accuracy. For example, our method achieves nearly the same AG News accuracy as FedAvg, while reducing the communication cost to just 17%.

Title: Asynchronous Sharpness-Aware Minimization For Fast and Accurate Deep Learning

Authors: Junhyuk Jo, Jihyun Lim, Sunwoo Lee
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.11147
Pdf URL: https://arxiv.org/pdf/2503.11147
Copy Paste: [[2503.11147]] Asynchronous Sharpness-Aware Minimization For Fast and Accurate Deep Learning(https://arxiv.org/abs/2503.11147)
Keywords: transformer
Abstract: Sharpness-Aware Minimization (SAM) is an optimization method that improves generalization performance of machine learning models. Despite its superior generalization, SAM has not been actively used in real-world applications due to its expensive computational cost. In this work, we propose a novel asynchronous-parallel SAM which achieves nearly the same gradient norm penalizing effect like the original SAM while breaking the data dependency between the model perturbation and the model update. The proposed asynchronous SAM can even entirely hide the model perturbation time by adjusting the batch size for the model perturbation in a system-aware manner. Thus, the proposed method enables to fully utilize heterogeneous system resources such as CPUs and GPUs. Our extensive experiments well demonstrate the practical benefits of the proposed asynchronous approach. E.g., the asynchronous SAM achieves comparable Vision Transformer fine-tuning accuracy (CIFAR-100) as the original SAM while having almost the same training time as SGD.

Title: Enabling Weak Client Participation via On-device Knowledge Distillation in Heterogenous Federated Learning

Authors: Jihyun Lim, Junhyuk Jo, Tuo Zhang, Salman Avestimehr, Sunwoo Lee
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.11151
Pdf URL: https://arxiv.org/pdf/2503.11151
Copy Paste: [[2503.11151]] Enabling Weak Client Participation via On-device Knowledge Distillation in Heterogenous Federated Learning(https://arxiv.org/abs/2503.11151)
Keywords: federate
Abstract: Online Knowledge Distillation (KD) is recently highlighted to train large models in Federated Learning (FL) environments. Many existing studies adopt the logit ensemble method to perform KD on the server side. However, they often assume that unlabeled data collected at the edge is centralized on the server. Moreover, the logit ensemble method personalizes local models, which can degrade the quality of soft targets, especially when data is highly non-IID. To address these critical limitations,we propose a novel on-device KD-based heterogeneous FL method. Our approach leverages a small auxiliary model to learn from labeled local data. Subsequently, a subset of clients with strong system resources transfers knowledge to a large model through on-device KD using their unlabeled data. Our extensive experiments demonstrate that our on-device KD-based heterogeneous FL method effectively utilizes the system resources of all edge devices as well as the unlabeled data, resulting in higher accuracy compared to SOTA KD-based FL methods.

Title: Don't Take Things Out of Context: Attention Intervention for Enhancing Chain-of-Thought Reasoning in Large Language Models

Authors: Shaotian Yan, Chen Shen, Wenxiao Wang, Liang Xie, Junjie Liu, Jieping Ye
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11154
Pdf URL: https://arxiv.org/pdf/2503.11154
Copy Paste: [[2503.11154]] Don't Take Things Out of Context: Attention Intervention for Enhancing Chain-of-Thought Reasoning in Large Language Models(https://arxiv.org/abs/2503.11154)
Keywords: large language model
Abstract: Few-shot Chain-of-Thought (CoT) significantly enhances the reasoning capabilities of large language models (LLMs), functioning as a whole to guide these models in generating reasoning steps toward final answers. However, we observe that isolated segments, words, or tokens within CoT demonstrations can unexpectedly disrupt the generation process of LLMs. The model may overly concentrate on certain local information present in the demonstration, introducing irrelevant noise into the reasoning process and potentially leading to incorrect answers. In this paper, we investigate the underlying mechanism of CoT through dynamically tracing and manipulating the inner workings of LLMs at each output step, which demonstrates that tokens exhibiting specific attention characteristics are more likely to induce the model to take things out of context; these tokens directly attend to the hidden states tied with prediction, without substantial integration of non-local information. Building upon these insights, we propose a Few-shot Attention Intervention method (FAI) that dynamically analyzes the attention patterns of demonstrations to accurately identify these tokens and subsequently make targeted adjustments to the attention weights to effectively suppress their distracting effect on LLMs. Comprehensive experiments across multiple benchmarks demonstrate consistent improvements over baseline methods, with a remarkable 5.91% improvement on the AQuA dataset, further highlighting the effectiveness of FAI.

Title: Unifying Perplexing Behaviors in Modified BP Attributions through Alignment Perspective

Authors: Guanhua Zheng, Jitao Sang, Changsheng Xu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11160
Pdf URL: https://arxiv.org/pdf/2503.11160
Copy Paste: [[2503.11160]] Unifying Perplexing Behaviors in Modified BP Attributions through Alignment Perspective(https://arxiv.org/abs/2503.11160)
Keywords: interpretability
Abstract: Attributions aim to identify input pixels that are relevant to the decision-making process. A popular approach involves using modified backpropagation (BP) rules to reverse decisions, which improves interpretability compared to the original gradients. However, these methods lack a solid theoretical foundation and exhibit perplexing behaviors, such as reduced sensitivity to parameter randomization, raising concerns about their reliability and highlighting the need for theoretical justification. In this work, we present a unified theoretical framework for methods like GBP, RectGrad, LRP, and DTD, demonstrating that they achieve input alignment by combining the weights of activated neurons. This alignment improves the visualization quality and reduces sensitivity to weight randomization. Our contributions include: (1) Providing a unified explanation for multiple behaviors, rather than focusing on just one. (2) Accurately predicting novel behaviors. (3) Offering insights into decision-making processes, including layer-wise information changes and the relationship between attributions and model decisions.

Title: Towards Extreme Pruning of LLMs with Plug-and-Play Mixed Sparsity

Authors: Chi Xu, Gefei Zhang, Yantong Zhu, Luca Benini, Guosheng Hu, Yawei Li, Zhihong Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11164
Pdf URL: https://arxiv.org/pdf/2503.11164
Copy Paste: [[2503.11164]] Towards Extreme Pruning of LLMs with Plug-and-Play Mixed Sparsity(https://arxiv.org/abs/2503.11164)
Keywords: large language model
Abstract: N:M structured pruning is essential for large language models (LLMs) because it can remove less important network weights and reduce the memory and computation requirements. Existing pruning methods mainly focus on designing metrics to measure the importance of network components to guide pruning. Apart from the impact of these metrics, we observe that different layers have different sensitivities over the network performance. Thus, we propose an efficient method based on the trace of Fisher Information Matrix (FIM) to quantitatively measure and verify the different sensitivities across layers. Based on this, we propose Mixed Sparsity Pruning (MSP) which uses a pruning-oriented evolutionary algorithm (EA) to determine the optimal sparsity levels for different layers. To guarantee fast convergence and achieve promising performance, we utilize efficient FIM-inspired layer-wise sensitivity to initialize the population of EA. In addition, our MSP can work as a plug-and-play module, ready to be integrated into existing pruning methods. Extensive experiments on LLaMA and LLaMA-2 on language modeling and zero-shot tasks demonstrate our superior performance. In particular, in extreme pruning ratio (e.g. 75%), our method significantly outperforms existing methods in terms of perplexity (PPL) by orders of magnitude (Figure 1).

Title: Neurons: Emulating the Human Visual Cortex Improves Fidelity and Interpretability in fMRI-to-Video Reconstruction

Authors: Haonan Wang, Qixiang Zhang, Lehan Wang, Xuanqi Huang, Xiaomeng Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11167
Pdf URL: https://arxiv.org/pdf/2503.11167
Copy Paste: [[2503.11167]] Neurons: Emulating the Human Visual Cortex Improves Fidelity and Interpretability in fMRI-to-Video Reconstruction(https://arxiv.org/abs/2503.11167)
Keywords: robust, interpretability, diffusion, segmentation
Abstract: Decoding visual stimuli from neural activity is essential for understanding the human brain. While fMRI methods have successfully reconstructed static images, fMRI-to-video reconstruction faces challenges due to the need for capturing spatiotemporal dynamics like motion and scene transitions. Recent approaches have improved semantic and perceptual alignment but struggle to integrate coarse fMRI data with detailed visual features. Inspired by the hierarchical organization of the visual system, we propose NEURONS, a novel framework that decouples learning into four correlated sub-tasks: key object segmentation, concept recognition, scene description, and blurry video reconstruction. This approach simulates the visual cortex's functional specialization, allowing the model to capture diverse video content. In the inference stage, NEURONS generates robust conditioning signals for a pre-trained text-to-video diffusion model to reconstruct the videos. Extensive experiments demonstrate that NEURONS outperforms state-of-the-art baselines, achieving solid improvements in video consistency (26.6%) and semantic-level accuracy (19.1%). Notably, NEURONS shows a strong functional correlation with the visual cortex, highlighting its potential for brain-computer interfaces and clinical applications. Code and model weights will be available at: this https URL.

Title: Uncertainty-Aware Normal-Guided Gaussian Splatting for Surface Reconstruction from Sparse Image Sequences

Authors: Zhen Tan, Xieyuanli Chen, Jinpu Zhang, Lei Feng, Dewen Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11172
Pdf URL: https://arxiv.org/pdf/2503.11172
Copy Paste: [[2503.11172]] Uncertainty-Aware Normal-Guided Gaussian Splatting for Surface Reconstruction from Sparse Image Sequences(https://arxiv.org/abs/2503.11172)
Keywords: robust
Abstract: 3D Gaussian Splatting (3DGS) has achieved impressive rendering performance in novel view synthesis. However, its efficacy diminishes considerably in sparse image sequences, where inherent data sparsity amplifies geometric uncertainty during optimization. This often leads to convergence at suboptimal local minima, resulting in noticeable structural artifacts in the reconstructed this http URL mitigate these issues, we propose Uncertainty-aware Normal-Guided Gaussian Splatting (UNG-GS), a novel framework featuring an explicit Spatial Uncertainty Field (SUF) to quantify geometric uncertainty within the 3DGS pipeline. UNG-GS enables high-fidelity rendering and achieves high-precision reconstruction without relying on priors. Specifically, we first integrate Gaussian-based probabilistic modeling into the training of 3DGS to optimize the SUF, providing the model with adaptive error tolerance. An uncertainty-aware depth rendering strategy is then employed to weight depth contributions based on the SUF, effectively reducing noise while preserving fine details. Furthermore, an uncertainty-guided normal refinement method adjusts the influence of neighboring depth values in normal estimation, promoting robust results. Extensive experiments demonstrate that UNG-GS significantly outperforms state-of-the-art methods in both sparse and dense sequences. The code will be open-source.

Title: Multi-Stage Generative Upscaler: Reconstructing Football Broadcast Images via Diffusion Models

Authors: Luca Martini, Daniele Zolezzi, Saverio Iacono, Gianni Viardo Vercelli
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11181
Pdf URL: https://arxiv.org/pdf/2503.11181
Copy Paste: [[2503.11181]] Multi-Stage Generative Upscaler: Reconstructing Football Broadcast Images via Diffusion Models(https://arxiv.org/abs/2503.11181)
Keywords: diffusion, generative
Abstract: The reconstruction of low-resolution football broadcast images presents a significant challenge in sports broadcasting, where detailed visuals are essential for analysis and audience engagement. This study introduces a multi-stage generative upscaling framework leveraging Diffusion Models to enhance degraded images, transforming inputs as small as $64 \times 64$ pixels into high-fidelity $1024 \times 1024$ outputs. By integrating an image-to-image pipeline, ControlNet conditioning, and LoRA fine-tuning, our approach surpasses traditional upscaling methods in restoring intricate textures and domain-specific elements such as player details and jersey logos. The custom LoRA is trained on a custom football dataset, ensuring adaptability to sports broadcast needs. Experimental results demonstrate substantial improvements over conventional models, with ControlNet refining fine details and LoRA enhancing task-specific elements. These findings highlight the potential of diffusion-based image reconstruction in sports media, paving the way for future applications in automated video enhancement and real-time sports analytics.

Title: Palette of Language Models: A Solver for Controlled Text Generation

Authors: Zhe Yang, Yi Huang, Yaqin Chen, Xiaoting Wu, Junlan Feng, Chao Deng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11182
Pdf URL: https://arxiv.org/pdf/2503.11182
Copy Paste: [[2503.11182]] Palette of Language Models: A Solver for Controlled Text Generation(https://arxiv.org/abs/2503.11182)
Keywords: generative, large language model
Abstract: Recent advancements in large language models have revolutionized text generation with their remarkable capabilities. These models can produce controlled texts that closely adhere to specific requirements when prompted appropriately. However, designing an optimal prompt to control multiple attributes simultaneously can be challenging. A common approach is to linearly combine single-attribute models, but this strategy often overlooks attribute overlaps and can lead to conflicts. Therefore, we propose a novel combination strategy inspired by the Law of Total Probability and Conditional Mutual Information Minimization on generative language models. This method has been adapted for single-attribute control scenario and is termed the Palette of Language Models due to its theoretical linkage between attribute strength and generation style, akin to blending colors on an artist's palette. Moreover, positive correlation and attribute enhancement are advanced as theoretical properties to guide a rational combination strategy design. We conduct experiments on both single control and multiple control settings, and achieve surpassing results.

Title: Multimodal-Aware Fusion Network for Referring Remote Sensing Image Segmentation

Authors: Leideng Shi, Juan Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11183
Pdf URL: https://arxiv.org/pdf/2503.11183
Copy Paste: [[2503.11183]] Multimodal-Aware Fusion Network for Referring Remote Sensing Image Segmentation(https://arxiv.org/abs/2503.11183)
Keywords: transformer, segmentation
Abstract: Referring remote sensing image segmentation (RRSIS) is a novel visual task in remote sensing images segmentation, which aims to segment objects based on a given text description, with great significance in practical application. Previous studies fuse visual and linguistic modalities by explicit feature interaction, which fail to effectively excavate useful multimodal information from dual-branch encoder. In this letter, we design a multimodal-aware fusion network (MAFN) to achieve fine-grained alignment and fusion between the two modalities. We propose a correlation fusion module (CFM) to enhance multi-scale visual features by introducing adaptively noise in transformer, and integrate cross-modal aware features. In addition, MAFN employs multi-scale refinement convolution (MSRC) to adapt to the various orientations of objects at different scales to boost their representation ability to enhances segmentation accuracy. Extensive experiments have shown that MAFN is significantly more effective than the state of the art on RRSIS-D datasets. The source code is available at this https URL.

Title: Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification

Authors: Yingjie Zhang, Tong Liu, Zhe Zhao, Guozhu Meng, Kai Chen
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11185
Pdf URL: https://arxiv.org/pdf/2503.11185
Copy Paste: [[2503.11185]] Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification(https://arxiv.org/abs/2503.11185)
Keywords: defense, attack, robust, large language model
Abstract: Large Language Models (LLMs) are vulnerable to jailbreak attacks, which use crafted prompts to elicit toxic responses. These attacks exploit LLMs' difficulty in dynamically detecting harmful intents during the generation process. Traditional safety alignment methods, often relying on the initial few generation steps, are ineffective due to limited computational budget. This paper proposes DEEPALIGN, a robust defense framework that fine-tunes LLMs to progressively detoxify generated content, significantly improving both the computational budget and effectiveness of mitigating harmful generation. Our approach uses a hybrid loss function operating on hidden states to directly improve LLMs' inherent awareness of toxity during generation. Furthermore, we redefine safe responses by generating semantically relevant answers to harmful queries, thereby increasing robustness against representation-mutation attacks. Evaluations across multiple LLMs demonstrate state-of-the-art defense performance against six different attack types, reducing Attack Success Rates by up to two orders of magnitude compared to previous state-of-the-art defense while preserving utility. This work advances LLM safety by addressing limitations of conventional alignment through dynamic, context-aware mitigation.

Title: FastVID: Dynamic Density Pruning for Fast Video Large Language Models

Authors: Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, Guiguang Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11187
Pdf URL: https://arxiv.org/pdf/2503.11187
Copy Paste: [[2503.11187]] FastVID: Dynamic Density Pruning for Fast Video Large Language Models(https://arxiv.org/abs/2503.11187)
Keywords: large language model
Abstract: Video Large Language Models have shown impressive capabilities in video comprehension, yet their practical deployment is hindered by substantial inference costs caused by redundant video tokens. Existing pruning techniques fail to fully exploit the spatiotemporal redundancy inherent in video data. To bridge this gap, we perform a systematic analysis of video redundancy from two perspectives: temporal context and visual context. Leveraging this insight, we propose Dynamic Density Pruning for Fast Video LLMs termed FastVID. Specifically, FastVID dynamically partitions videos into temporally ordered segments to preserve temporal structure and applies a density-based token pruning strategy to maintain essential visual information. Our method significantly reduces computational overhead while maintaining temporal and visual integrity. Extensive evaluations show that FastVID achieves state-of-the-art performance across various short- and long-video benchmarks on leading Video LLMs, including LLaVA-OneVision and LLaVA-Video. Notably, FastVID effectively prunes 90% of video tokens while retaining 98.0% of LLaVA-OneVision's original performance. The code is available at this https URL.

Title: Provenance Detection for AI-Generated Images: Combining Perceptual Hashing, Homomorphic Encryption, and AI Detection Models

Authors: Shree Singhi, Aayan Yadav, Aayush Gupta, Shariar Ebrahimi, Parisa Hassanizadeh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11195
Pdf URL: https://arxiv.org/pdf/2503.11195
Copy Paste: [[2503.11195]] Provenance Detection for AI-Generated Images: Combining Perceptual Hashing, Homomorphic Encryption, and AI Detection Models(https://arxiv.org/abs/2503.11195)
Keywords: secure, privacy, protect, robust, watermark
Abstract: As AI-generated sensitive images become more prevalent, identifying their source is crucial for distinguishing them from real images. Conventional image watermarking methods are vulnerable to common transformations like filters, lossy compression, and screenshots, often applied during social media sharing. Watermarks can also be faked or removed if models are open-sourced or leaked since images can be rewatermarked. We have developed a three-part framework for secure, transformation-resilient AI content provenance detection, to address these limitations. We develop an adversarially robust state-of-the-art perceptual hashing model, DinoHash, derived from DINOV2, which is robust to common transformations like filters, compression, and crops. Additionally, we integrate a Multi-Party Fully Homomorphic Encryption~(MP-FHE) scheme into our proposed framework to ensure the protection of both user queries and registry privacy. Furthermore, we improve previous work on AI-generated media detection. This approach is useful in cases where the content is absent from our registry. DinoHash significantly improves average bit accuracy by 12% over state-of-the-art watermarking and perceptual hashing methods while maintaining superior true positive rate (TPR) and false positive rate (FPR) tradeoffs across various transformations. Our AI-generated media detection results show a 25% improvement in classification accuracy on commonly used real-world AI image generators over existing algorithms. By combining perceptual hashing, MP-FHE, and an AI content detection model, our proposed framework provides better robustness and privacy compared to previous work.

Title: NF-SLAM: Effective, Normalizing Flow-supported Neural Field representations for object-level visual SLAM in automotive applications

Authors: Li Cui, Yang Ding, Richard Hartley, Zirui Xie, Laurent Kneip, Zhenghua Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11199
Pdf URL: https://arxiv.org/pdf/2503.11199
Copy Paste: [[2503.11199]] NF-SLAM: Effective, Normalizing Flow-supported Neural Field representations for object-level visual SLAM in automotive applications(https://arxiv.org/abs/2503.11199)
Keywords: extraction
Abstract: We propose a novel, vision-only object-level SLAM framework for automotive applications representing 3D shapes by implicit signed distance functions. Our key innovation consists of augmenting the standard neural representation by a normalizing flow network. As a result, achieving strong representation power on the specific class of road vehicles is made possible by compact networks with only 16-dimensional latent codes. Furthermore, the newly proposed architecture exhibits a significant performance improvement in the presence of only sparse and noisy data, which is demonstrated through comparative experiments on synthetic data. The module is embedded into the back-end of a stereo-vision based framework for joint, incremental shape optimization. The loss function is given by a combination of a sparse 3D point-based SDF loss, a sparse rendering loss, and a semantic mask-based silhouette-consistency term. We furthermore leverage semantic information to determine keypoint extraction density in the front-end. Finally, experimental results on real-world data reveal accurate and reliable performance comparable to alternative frameworks that make use of direct depth readings. The proposed method performs well with only sparse 3D points obtained from bundle adjustment, and eventually continues to deliver stable results even under exclusive use of the mask-consistency term.

Title: LLaVA-MLB: Mitigating and Leveraging Attention Bias for Training-Free Video LLMs

Authors: Leqi Shen, Tao He, Guoqiang Gong, Fan Yang, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, Guiguang Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11205
Pdf URL: https://arxiv.org/pdf/2503.11205
Copy Paste: [[2503.11205]] LLaVA-MLB: Mitigating and Leveraging Attention Bias for Training-Free Video LLMs(https://arxiv.org/abs/2503.11205)
Keywords: large language model
Abstract: Training-free video large language models (LLMs) leverage pretrained Image LLMs to process video content without the need for further training. A key challenge in such approaches is the difficulty of retaining essential visual and temporal information, constrained by the token limits in Image LLMs. To address this, we propose a two-stage method for selecting query-relevant tokens based on the LLM attention scores: compressing the video sequence and then expanding the sequence. However, during the compression stage, Image LLMs often exhibit a positional attention bias in video sequences, where attention is overly concentrated on later frames, causing early-frame information to be underutilized. To alleviate this attention bias during sequence compression, we propose Gridded Attention Pooling for preserving spatiotemporal structure. Additionally, we introduce Visual Summarization Tail to effectively utilize this bias, facilitating overall video understanding during sequence expansion. In this way, our method effectively Mitigates and Leverages attention Bias (LLaVA-MLB), enabling the frozen Image LLM for detailed video understanding. Experiments on several benchmarks demonstrate that our approach outperforms state-of-the-art methods, achieving superior performance in both efficiency and accuracy. Our code will be released.

Title: Spatio-Temporal Graph Structure Learning for Earthquake Detection

Authors: Suchanun Piriyasatit, Ercan Engin Kuruoglu, Mehmet Sinan Ozeren
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.11215
Pdf URL: https://arxiv.org/pdf/2503.11215
Copy Paste: [[2503.11215]] Spatio-Temporal Graph Structure Learning for Earthquake Detection(https://arxiv.org/abs/2503.11215)
Keywords: robust
Abstract: Earthquake detection is essential for earthquake early warning (EEW) systems. Traditional methods struggle with low signal-to-noise ratios and single-station reliance, limiting their effectiveness. We propose a Spatio-Temporal Graph Convolutional Network (GCN) using Spectral Structure Learning Convolution (Spectral SLC) to model static and dynamic relationships across seismic stations. Our approach processes multi-station waveform data and generates station-specific detection probabilities. Experiments show superior performance over a conventional GCN baseline in terms of true positive rate (TPR) and false positive rate (FPR), highlighting its potential for robust multi-station earthquake detection. The code repository for this study is available at this https URL.

Title: Cross-Platform Benchmarking of the FHE Libraries: Novel Insights into SEAL and Openfhe

Authors: Faneela, Jawad Ahmad, Baraq Ghaleb, Sana Ullah Jan, William J. Buchanan
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.11216
Pdf URL: https://arxiv.org/pdf/2503.11216
Copy Paste: [[2503.11216]] Cross-Platform Benchmarking of the FHE Libraries: Novel Insights into SEAL and Openfhe(https://arxiv.org/abs/2503.11216)
Keywords: secure, privacy
Abstract: The rapid growth of cloud computing and data-driven applications has amplified privacy concerns, driven by the increasing demand to process sensitive data securely. Homomorphic encryption (HE) has become a vital solution for addressing these concerns by enabling computations on encrypted data without revealing its contents. This paper provides a comprehensive evaluation of two leading HE libraries, SEAL and OpenFHE, examining their performance, usability, and support for prominent HE schemes such as BGV and CKKS. Our analysis highlights computational efficiency, memory usage, and scalability across Linux and Windows platforms, emphasizing their applicability in real-world scenarios. Results reveal that Linux outperforms Windows in computation efficiency, with OpenFHE emerging as the optimal choice across diverse cryptographic settings. This paper provides valuable insights for researchers and practitioners to advance privacy-preserving applications using FHE.

Title: Optimal Transport and Adaptive Thresholding for Universal Domain Adaptation on Time Series

Authors: Romain Mussard, Fannia Pacheco, Maxime Berar, Gilles Gasso, Paul Honeine
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.11217
Pdf URL: https://arxiv.org/pdf/2503.11217
Copy Paste: [[2503.11217]] Optimal Transport and Adaptive Thresholding for Universal Domain Adaptation on Time Series(https://arxiv.org/abs/2503.11217)
Keywords: robust
Abstract: Universal Domain Adaptation (UniDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain, even when their classes are not fully shared. Few dedicated UniDA methods exist for Time Series (TS), which remains a challenging case. In general, UniDA approaches align common class samples and detect unknown target samples from emerging classes. Such detection often results from thresholding a discriminability metric. The threshold value is typically either a fine-tuned hyperparameter or a fixed value, which limits the ability of the model to adapt to new data. Furthermore, discriminability metrics exhibit overconfidence for unknown samples, leading to misclassifications. This paper introduces UniJDOT, an optimal-transport-based method that accounts for the unknown target samples in the transport cost. Our method also proposes a joint decision space to improve the discriminability of the detection module. In addition, we use an auto-thresholding algorithm to reduce the dependence on fixed or fine-tuned thresholds. Finally, we rely on a Fourier transform-based layer inspired by the Fourier Neural Operator for better TS representation. Experiments on TS benchmarks demonstrate the discriminability, robustness, and state-of-the-art performance of UniJDOT.

Title: Towards General Multimodal Visual Tracking

Authors: Andong Lu, Mai Wen, Jinhu Wang, Yuanzhi Guo, Chenglong Li, Jin Tang, Bin Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11218
Pdf URL: https://arxiv.org/pdf/2503.11218
Copy Paste: [[2503.11218]] Towards General Multimodal Visual Tracking(https://arxiv.org/abs/2503.11218)
Keywords: robust
Abstract: Existing multimodal tracking studies focus on bi-modal scenarios such as RGB-Thermal, RGB-Event, and RGB-Language. Although promising tracking performance is achieved through leveraging complementary cues from different sources, it remains challenging in complex scenes due to the limitations of bi-modal scenarios. In this work, we introduce a general multimodal visual tracking task that fully exploits the advantages of four modalities, including RGB, thermal infrared, event, and language, for robust tracking under challenging conditions. To provide a comprehensive evaluation platform for general multimodal visual tracking, we construct QuadTrack600, a large-scale, high-quality benchmark comprising 600 video sequences (totaling 384.7K high-resolution (640x480) frame groups). In each frame group, all four modalities are spatially aligned and meticulously annotated with bounding boxes, while 21 sequence-level challenge attributes are provided for detailed performance analysis. Despite quad-modal data provides richer information, the differences in information quantity among modalities and the computational burden from four modalities are two challenging issues in fusing four modalities. To handle these issues, we propose a novel approach called QuadFusion, which incorporates an efficient Multiscale Fusion Mamba with four different scanning scales to achieve sufficient interactions of the four modalities while overcoming the exponential computational burden, for general multimodal visual tracking. Extensive experiments on the QuadTrack600 dataset and three bi-modal tracking datasets, including LasHeR, VisEvent, and TNL2K, validate the effectiveness of our QuadFusion.

Title: MEET: A Million-Scale Dataset for Fine-Grained Geospatial Scene Classification with Zoom-Free Remote Sensing Imagery

Authors: Yansheng Li, Yuning Wu, Gong Cheng, Chao Tao, Bo Dang, Yu Wang, Jiahao Zhang, Chuge Zhang, Yiting Liu, Xu Tang, Jiayi Ma, Yongjun Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11219
Pdf URL: https://arxiv.org/pdf/2503.11219
Copy Paste: [[2503.11219]] MEET: A Million-Scale Dataset for Fine-Grained Geospatial Scene Classification with Zoom-Free Remote Sensing Imagery(https://arxiv.org/abs/2503.11219)
Keywords: transformer
Abstract: Accurate fine-grained geospatial scene classification using remote sensing imagery is essential for a wide range of applications. However, existing approaches often rely on manually zooming remote sensing images at different scales to create typical scene samples. This approach fails to adequately support the fixed-resolution image interpretation requirements in real-world scenarios. To address this limitation, we introduce the Million-scale finE-grained geospatial scEne classification dataseT (MEET), which contains over 1.03 million zoom-free remote sensing scene samples, manually annotated into 80 fine-grained categories. In MEET, each scene sample follows a scene-inscene layout, where the central scene serves as the reference, and auxiliary scenes provide crucial spatial context for finegrained classification. Moreover, to tackle the emerging challenge of scene-in-scene classification, we present the Context-Aware Transformer (CAT), a model specifically designed for this task, which adaptively fuses spatial context to accurately classify the scene samples. CAT adaptively fuses spatial context to accurately classify the scene samples by learning attentional features that capture the relationships between the center and auxiliary scenes. Based on MEET, we establish a comprehensive benchmark for fine-grained geospatial scene classification, evaluating CAT against 11 competitive baselines. The results demonstrate that CAT significantly outperforms these baselines, achieving a 1.88% higher balanced accuracy (BA) with the Swin-Large backbone, and a notable 7.87% improvement with the Swin-Huge backbone. Further experiments validate the effectiveness of each module in CAT and show the practical applicability of CAT in the urban functional zone mapping. The source code and dataset will be publicly available at this https URL.

Title: Toward Generalized Image Quality Assessment: Relaxing the Perfect Reference Quality Assumption

Authors: Du Chen, Tianhe Wu, Kede Ma, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11221
Pdf URL: https://arxiv.org/pdf/2503.11221
Copy Paste: [[2503.11221]] Toward Generalized Image Quality Assessment: Relaxing the Perfect Reference Quality Assumption(https://arxiv.org/abs/2503.11221)
Keywords: diffusion, generative
Abstract: Full-reference image quality assessment (FR-IQA) generally assumes that reference images are of perfect quality. However, this assumption is flawed due to the sensor and optical limitations of modern imaging systems. Moreover, recent generative enhancement methods are capable of producing images of higher quality than their original. All of these challenge the effectiveness and applicability of current FR-IQA models. To relax the assumption of perfect reference image quality, we build a large-scale IQA database, namely DiffIQA, containing approximately 180,000 images generated by a diffusion-based image enhancer with adjustable hyper-parameters. Each image is annotated by human subjects as either worse, similar, or better quality compared to its reference. Building on this, we present a generalized FR-IQA model, namely Adaptive Fidelity-Naturalness Evaluator (A-FINE), to accurately assess and adaptively combine the fidelity and naturalness of a test image. A-FINE aligns well with standard FR-IQA when the reference image is much more natural than the test image. We demonstrate by extensive experiments that A-FINE surpasses standard FR-IQA models on well-established IQA datasets and our newly created DiffIQA. To further validate A-FINE, we additionally construct a super-resolution IQA benchmark (SRIQA-Bench), encompassing test images derived from ten state-of-the-art SR methods with reliable human quality annotations. Tests on SRIQA-Bench re-affirm the advantages of A-FINE. The code and dataset are available at this https URL.

Title: Technologies on Effectiveness and Efficiency: A Survey of State Spaces Models

Authors: Xingtai Lv, Youbang Sun, Kaiyan Zhang, Shang Qu, Xuekai Zhu, Yuchen Fan, Yi Wu, Ermo Hua, Xinwei Long, Ning Ding, Bowen Zhou
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.11224
Pdf URL: https://arxiv.org/pdf/2503.11224
Copy Paste: [[2503.11224]] Technologies on Effectiveness and Efficiency: A Survey of State Spaces Models(https://arxiv.org/abs/2503.11224)
Keywords: transformer
Abstract: State Space Models (SSMs) have emerged as a promising alternative to the popular transformer-based models and have been increasingly gaining attention. Compared to transformers, SSMs excel at tasks with sequential data or longer contexts, demonstrating comparable performances with significant efficiency gains. In this survey, we provide a coherent and systematic overview for SSMs, including their theoretical motivations, mathematical formulations, comparison with existing model classes, and various applications. We divide the SSM series into three main sections, providing a detailed introduction to the original SSM, the structured SSM represented by S4, and the selective SSM typified by Mamba. We put an emphasis on technicality, and highlight the various key techniques introduced to address the effectiveness and efficiency of SSMs. We hope this manuscript serves as an introduction for researchers to explore the theoretical foundations of SSMs.

Title: Non Line-of-Sight Optical Wireless Communication using Neuromorphic Cameras

Authors: Abbaas Alif Mohamed Nishar, Alireza Marefat, Ashwin Ashok
Subjects: cs.CV, cs.ET, cs.IT, cs.NI
Abstract URL: https://arxiv.org/abs/2503.11226
Pdf URL: https://arxiv.org/pdf/2503.11226
Copy Paste: [[2503.11226]] Non Line-of-Sight Optical Wireless Communication using Neuromorphic Cameras(https://arxiv.org/abs/2503.11226)
Keywords: robust
Abstract: Neuromorphic or event cameras, inspired by biological vision systems, capture changes in illumination with high temporal resolution and efficiency, producing streams of events rather than traditional images. In this paper, we explore the use of neuromorphic cameras for passive optical wireless communication (OWC), leveraging their asynchronous detection of illumination changes to decode data transmitted through reflections of light from objects. We propose a novel system that utilizes neuromorphic cameras for passive visible light communication (VLC), extending the concept to Non Line-of-Sight (NLoS) scenarios through passive reflections from everyday objects. Our experiments demonstrate the feasibility and advantages of using neuromorphic cameras for VLC, characterizing the performance of various modulation schemes, including traditional On-Off Keying (OOK) and advanced N-pulse modulation. We introduce an adaptive N-pulse modulation scheme that dynamically adjusts encoding based on the packet's bit composition, achieving higher data rates and robustness in different scenarios. Our results show that lighter-colored, glossy objects are better for NLoS communication, while larger objects and those with matte finishes experience higher error rates due to multipath reflections.

Title: PrivacyScalpel: Enhancing LLM Privacy via Interpretable Feature Intervention with Sparse Autoencoders

Authors: Ahmed Frikha, Muhammad Reza Ar Razi, Krishna Kanth Nakka, Ricardo Mendes, Xue Jiang, Xuebing Zhou
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.11232
Pdf URL: https://arxiv.org/pdf/2503.11232
Copy Paste: [[2503.11232]] PrivacyScalpel: Enhancing LLM Privacy via Interpretable Feature Intervention with Sparse Autoencoders(https://arxiv.org/abs/2503.11232)
Keywords: secure, privacy, interpretability, large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing but also pose significant privacy risks by memorizing and leaking Personally Identifiable Information (PII). Existing mitigation strategies, such as differential privacy and neuron-level interventions, often degrade model utility or fail to effectively prevent leakage. To address this challenge, we introduce PrivacyScalpel, a novel privacy-preserving framework that leverages LLM interpretability techniques to identify and mitigate PII leakage while maintaining performance. PrivacyScalpel comprises three key steps: (1) Feature Probing, which identifies layers in the model that encode PII-rich representations, (2) Sparse Autoencoding, where a k-Sparse Autoencoder (k-SAE) disentangles and isolates privacy-sensitive features, and (3) Feature-Level Interventions, which employ targeted ablation and vector steering to suppress PII leakage. Our empirical evaluation on Gemma2-2b and Llama2-7b, fine-tuned on the Enron dataset, shows that PrivacyScalpel significantly reduces email leakage from 5.15\% to as low as 0.0\%, while maintaining over 99.4\% of the original model's utility. Notably, our method outperforms neuron-level interventions in privacy-utility trade-offs, demonstrating that acting on sparse, monosemantic features is more effective than manipulating polysemantic neurons. Beyond improving LLM privacy, our approach offers insights into the mechanisms underlying PII memorization, contributing to the broader field of model interpretability and secure AI deployment.

Title: Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards

Authors: Zijing Hu, Fengda Zhang, Long Chen, Kun Kuang, Jiahui Li, Kaifeng Gao, Jun Xiao, Xin Wang, Wenwu Zhu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.11240
Pdf URL: https://arxiv.org/pdf/2503.11240
Copy Paste: [[2503.11240]] Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards(https://arxiv.org/abs/2503.11240)
Keywords: diffusion
Abstract: Diffusion models have achieved remarkable success in text-to-image generation. However, their practical applications are hindered by the misalignment between generated images and corresponding text prompts. To tackle this issue, reinforcement learning (RL) has been considered for diffusion model fine-tuning. Yet, RL's effectiveness is limited by the challenge of sparse reward, where feedback is only available at the end of the generation process. This makes it difficult to identify which actions during the denoising process contribute positively to the final generated image, potentially leading to ineffective or unnecessary denoising policies. To this end, this paper presents a novel RL-based framework that addresses the sparse reward problem when training diffusion models. Our framework, named $\text{B}^2\text{-DiffuRL}$, employs two strategies: \textbf{B}ackward progressive training and \textbf{B}ranch-based sampling. For one thing, backward progressive training focuses initially on the final timesteps of denoising process and gradually extends the training interval to earlier timesteps, easing the learning difficulty from sparse rewards. For another, we perform branch-based sampling for each training interval. By comparing the samples within the same branch, we can identify how much the policies of the current training interval contribute to the final image, which helps to learn effective policies instead of unnecessary ones. $\text{B}^2\text{-DiffuRL}$ is compatible with existing optimization algorithms. Extensive experiments demonstrate the effectiveness of $\text{B}^2\text{-DiffuRL}$ in improving prompt-image alignment and maintaining diversity in generated images. The code for this work is available.

Title: Breaking Shallow Limits: Task-Driven Pixel Fusion for Gap-free RGBT Tracking

Authors: Andong Lu, Yuanzhi Guo, Wanyu Wang, Chenglong Li, Jin Tang, Bin Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11247
Pdf URL: https://arxiv.org/pdf/2503.11247
Copy Paste: [[2503.11247]] Breaking Shallow Limits: Task-Driven Pixel Fusion for Gap-free RGBT Tracking(https://arxiv.org/abs/2503.11247)
Keywords: robust
Abstract: Current RGBT tracking methods often overlook the impact of fusion location on mitigating modality gap, which is key factor to effective tracking. Our analysis reveals that shallower fusion yields smaller distribution gap. However, the limited discriminative power of shallow networks hard to distinguish task-relevant information from noise, limiting the potential of pixel-level fusion. To break shallow limits, we propose a novel \textbf{T}ask-driven \textbf{P}ixel-level \textbf{F}usion network, named \textbf{TPF}, which unveils the power of pixel-level fusion in RGBT tracking through a progressive learning framework. In particular, we design a lightweight Pixel-level Fusion Adapter (PFA) that exploits Mamba's linear complexity to ensure real-time, low-latency RGBT tracking. To enhance the fusion capabilities of the PFA, our task-driven progressive learning framework first utilizes adaptive multi-expert distillation to inherits fusion knowledge from state-of-the-art image fusion models, establishing robust initialization, and then employs a decoupled representation learning scheme to achieve task-relevant information fusion. Moreover, to overcome appearance variations between the initial template and search frames, we presents a nearest-neighbor dynamic template updating scheme, which selects the most reliable frame closest to the current search frame as the dynamic template. Extensive experiments demonstrate that TPF significantly outperforms existing most of advanced trackers on four public RGBT tracking datasets. The code will be released upon acceptance.

Title: Reasoning-Grounded Natural Language Explanations for Language Models

Authors: Vojtech Cahlik, Rodrigo Alves, Pavel Kordik
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.11248
Pdf URL: https://arxiv.org/pdf/2503.11248
Copy Paste: [[2503.11248]] Reasoning-Grounded Natural Language Explanations for Language Models(https://arxiv.org/abs/2503.11248)
Keywords: explainability, large language model
Abstract: We propose a large language model explainability technique for obtaining faithful natural language explanations by grounding the explanations in a reasoning process. When converted to a sequence of tokens, the outputs of the reasoning process can become part of the model context and later be decoded to natural language as the model produces either the final answer or the explanation. To improve the faithfulness of the explanations, we propose to use a joint predict-explain approach, in which the answers and explanations are inferred directly from the reasoning sequence, without the explanations being dependent on the answers and vice versa. We demonstrate the plausibility of the proposed technique by achieving a high alignment between answers and explanations in several problem domains, observing that language models often simply copy the partial decisions from the reasoning sequence into the final answers or explanations. Furthermore, we show that the proposed use of reasoning can also improve the quality of the answers.

Title: Federated Koopman-Reservoir Learning for Large-Scale Multivariate Time-Series Anomaly Detection

Authors: Long Tan Le, Tung-Anh Nguyen, Han Shu, Suranga Seneviratne, Choong Seon Hong, Nguyen H. Tran
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2503.11255
Pdf URL: https://arxiv.org/pdf/2503.11255
Copy Paste: [[2503.11255]] Federated Koopman-Reservoir Learning for Large-Scale Multivariate Time-Series Anomaly Detection(https://arxiv.org/abs/2503.11255)
Keywords: security, privacy, federate
Abstract: The proliferation of edge devices has dramatically increased the generation of multivariate time-series (MVTS) data, essential for applications from healthcare to smart cities. Such data streams, however, are vulnerable to anomalies that signal crucial problems like system failures or security incidents. Traditional MVTS anomaly detection methods, encompassing statistical and centralized machine learning approaches, struggle with the heterogeneity, variability, and privacy concerns of large-scale, distributed environments. In response, we introduce FedKO, a novel unsupervised Federated Learning framework that leverages the linear predictive capabilities of Koopman operator theory along with the dynamic adaptability of Reservoir Computing. This enables effective spatiotemporal processing and privacy preservation for MVTS data. FedKO is formulated as a bi-level optimization problem, utilizing a specific federated algorithm to explore a shared Reservoir-Koopman model across diverse datasets. Such a model is then deployable on edge devices for efficient detection of anomalies in local MVTS streams. Experimental results across various datasets showcase FedKO's superior performance against state-of-the-art methods in MVTS anomaly detection. Moreover, FedKO reduces up to 8x communication size and 2x memory usage, making it highly suitable for large-scale systems.

Title: Noise Synthesis for Low-Light Image Denoising with Diffusion Models

Authors: Liying Lu, Raphaël Achddou, Sabine Süsstrunk
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.11262
Pdf URL: https://arxiv.org/pdf/2503.11262
Copy Paste: [[2503.11262]] Noise Synthesis for Low-Light Image Denoising with Diffusion Models(https://arxiv.org/abs/2503.11262)
Keywords: diffusion
Abstract: Low-light photography produces images with low signal-to-noise ratios due to limited photons. In such conditions, common approximations like the Gaussian noise model fall short, and many denoising techniques fail to remove noise effectively. Although deep-learning methods perform well, they require large datasets of paired images that are impractical to acquire. As a remedy, synthesizing realistic low-light noise has gained significant attention. In this paper, we investigate the ability of diffusion models to capture the complex distribution of low-light noise. We show that a naive application of conventional diffusion models is inadequate for this task and propose three key adaptations that enable high-precision noise generation without calibration or post-processing: a two-branch architecture to better model signal-dependent and signal-independent noise, the incorporation of positional information to capture fixed-pattern noise, and a tailored diffusion noise schedule. Consequently, our model enables the generation of large datasets for training low-light denoising networks, leading to state-of-the-art performance. Through comprehensive analysis, including statistical evaluation and noise decomposition, we provide deeper insights into the characteristics of the generated data.

Title: DynRsl-VLM: Enhancing Autonomous Driving Perception with Dynamic Resolution Vision-Language Models

Authors: Xirui Zhou, Lianlei Shan, Xiaolin Gui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11265
Pdf URL: https://arxiv.org/pdf/2503.11265
Copy Paste: [[2503.11265]] DynRsl-VLM: Enhancing Autonomous Driving Perception with Dynamic Resolution Vision-Language Models(https://arxiv.org/abs/2503.11265)
Keywords: transformer
Abstract: Visual Question Answering (VQA) models, which fall under the category of vision-language models, conventionally execute multiple downsampling processes on image inputs to strike a balance between computational efficiency and model performance. Although this approach aids in concentrating on salient features and diminishing computational burden, it incurs the loss of vital detailed information, a drawback that is particularly damaging in end-to-end autonomous driving scenarios. Downsampling can lead to an inadequate capture of distant or small objects such as pedestrians, road signs, or obstacles, all of which are crucial for safe navigation. This loss of features negatively impacts an autonomous driving system's capacity to accurately perceive the environment, potentially escalating the risk of accidents. To tackle this problem, we put forward the Dynamic Resolution Vision Language Model (DynRsl-VLM). DynRsl-VLM incorporates a dynamic resolution image input processing approach that captures all entity feature information within an image while ensuring that the image input remains computationally tractable for the Vision Transformer (ViT). Moreover, we devise a novel image-text alignment module to replace the Q-Former, enabling simple and efficient alignment with text when dealing with dynamic resolution image inputs. Our method enhances the environmental perception capabilities of autonomous driving systems without overstepping computational constraints.

Title: CyclePose -- Leveraging Cycle-Consistency for Annotation-Free Nuclei Segmentation in Fluorescence Microscopy

Authors: Jonas Utz, Stefan Vocht, Anne Tjorven Buessen, Dennis Possart, Fabian Wagner, Mareike Thies, Mingxuan Gu, Stefan Uderhardt, Katharina Breininger
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11266
Pdf URL: https://arxiv.org/pdf/2503.11266
Copy Paste: [[2503.11266]] CyclePose -- Leveraging Cycle-Consistency for Annotation-Free Nuclei Segmentation in Fluorescence Microscopy(https://arxiv.org/abs/2503.11266)
Keywords: diffusion, generative, segmentation
Abstract: In recent years, numerous neural network architectures specifically designed for the instance segmentation of nuclei in microscopic images have been released. These models embed nuclei-specific priors to outperform generic architectures like U-Nets; however, they require large annotated datasets, which are often not available. Generative models (GANs, diffusion models) have been used to compensate for this by synthesizing training data. These two-stage approaches are computationally expensive, as first a generative model and then a segmentation model has to be trained. We propose CyclePose, a hybrid framework integrating synthetic data generation and segmentation training. CyclePose builds on a CycleGAN architecture, which allows unpaired translation between microscopy images and segmentation masks. We embed a segmentation model into CycleGAN and leverage a cycle consistency loss for self-supervision. Without annotated data, CyclePose outperforms other weakly or unsupervised methods on two public datasets. Code is available at this https URL

Title: High-Dimensional Interlingual Representations of Large Language Models

Authors: Bryan Wilie, Samuel Cahyawijaya, Junxian He, Pascale Fung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11280
Pdf URL: https://arxiv.org/pdf/2503.11280
Copy Paste: [[2503.11280]] High-Dimensional Interlingual Representations of Large Language Models(https://arxiv.org/abs/2503.11280)
Keywords: large language model
Abstract: Large language models (LLMs) trained on massive multilingual datasets hint at the formation of interlingual constructs--a shared subspace in the representation space. However, evidence regarding this phenomenon is mixed, leaving it unclear whether these models truly develop unified interlingual representations, or present a partially aligned constructs. We explore 31 diverse languages varying on their resource-levels, typologies, and geographical regions; and find that multilingual LLMs exhibit inconsistent cross-lingual alignments. To address this, we propose an interlingual representation framework identifying both the shared interlingual semantic subspace and fragmented components, existed due to representational limitations. We introduce Interlingual Local Overlap (ILO) score to quantify interlingual alignment by comparing the local neighborhood structures of high-dimensional representations. We utilize ILO to investigate the impact of single-language fine-tuning on the interlingual representations in multilingual LLMs. Our results indicate that training exclusively on a single language disrupts the alignment in early layers, while freezing these layers preserves the alignment of interlingual representations, leading to improved cross-lingual generalization. These results validate our framework and metric for evaluating interlingual representation, and further underscore that interlingual alignment is crucial for scalable multilingual learning.

Title: OPTIMUS: Predicting Multivariate Outcomes in Alzheimer's Disease Using Multi-modal Data amidst Missing Values

Authors: Christelle Schneuwly Diaz, Duy-Thanh Vu, Julien Bodelet, Duy-Cat Can, Guillaume Blanc, Haiting Jiang, Lin Yao, Guiseppe Pantaleo, ADNI, Oliver Y. Chén
Subjects: cs.LG, q-bio.NC
Abstract URL: https://arxiv.org/abs/2503.11282
Pdf URL: https://arxiv.org/pdf/2503.11282
Copy Paste: [[2503.11282]] OPTIMUS: Predicting Multivariate Outcomes in Alzheimer's Disease Using Multi-modal Data amidst Missing Values(https://arxiv.org/abs/2503.11282)
Keywords: generative
Abstract: Alzheimer's disease, a neurodegenerative disorder, is associated with neural, genetic, and proteomic factors while affecting multiple cognitive and behavioral faculties. Traditional AD prediction largely focuses on univariate disease outcomes, such as disease stages and severity. Multimodal data encode broader disease information than a single modality and may, therefore, improve disease prediction; but they often contain missing values. Recent "deeper" machine learning approaches show promise in improving prediction accuracy, yet the biological relevance of these models needs to be further charted. Integrating missing data analysis, predictive modeling, multimodal data analysis, and explainable AI, we propose OPTIMUS, a predictive, modular, and explainable machine learning framework, to unveil the many-to-many predictive pathways between multimodal input data and multivariate disease outcomes amidst missing values. OPTIMUS first applies modality-specific imputation to uncover data from each modality while optimizing overall prediction accuracy. It then maps multimodal biomarkers to multivariate outcomes using machine-learning and extracts biomarkers respectively predictive of each outcome. Finally, OPTIMUS incorporates XAI to explain the identified multimodal biomarkers. Using data from 346 cognitively normal subjects, 608 persons with mild cognitive impairment, and 251 AD patients, OPTIMUS identifies neural and transcriptomic signatures that jointly but differentially predict multivariate outcomes related to executive function, language, memory, and visuospatial function. Our work demonstrates the potential of building a predictive and biologically explainable machine-learning framework to uncover multimodal biomarkers that capture disease profiles across varying cognitive landscapes. The results improve our understanding of the complex many-to-many pathways in AD.

Title: GMG: A Video Prediction Method Based on Global Focus and Motion Guided

Authors: Yuhao Du, Hui Liu, Haoxiang Peng, Xinyuan Chen, Chenrong Wu, Jiankai Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11297
Pdf URL: https://arxiv.org/pdf/2503.11297
Copy Paste: [[2503.11297]] GMG: A Video Prediction Method Based on Global Focus and Motion Guided(https://arxiv.org/abs/2503.11297)
Keywords: extraction
Abstract: Recent years, weather forecasting has gained significant attention. However, accurately predicting weather remains a challenge due to the rapid variability of meteorological data and potential teleconnections. Current spatiotemporal forecasting models primarily rely on convolution operations or sliding windows for feature extraction. These methods are limited by the size of the convolutional kernel or sliding window, making it difficult to capture and identify potential teleconnection features in meteorological data. Additionally, weather data often involve non-rigid bodies, whose motion processes are accompanied by unpredictable deformations, further complicating the forecasting task. In this paper, we propose the GMG model to address these two core challenges. The Global Focus Module, a key component of our model, enhances the global receptive field, while the Motion Guided Module adapts to the growth or dissipation processes of non-rigid bodies. Through extensive evaluations, our method demonstrates competitive performance across various complex tasks, providing a novel approach to improving the predictive accuracy of complex spatiotemporal data.

Title: BriLLM: Brain-inspired Large Language Model

Authors: Hai Zhao, Hongqiu Wu, Dongjie Yang, Anni Zou, Jiale Hong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11299
Pdf URL: https://arxiv.org/pdf/2503.11299
Copy Paste: [[2503.11299]] BriLLM: Brain-inspired Large Language Model(https://arxiv.org/abs/2503.11299)
Keywords: interpretability, transformer, generative, large language model
Abstract: This paper reports the first brain-inspired large language model (BriLLM). This is a non-Transformer, non-GPT, non-traditional machine learning input-output controlled generative language model. The model is based on the Signal Fully-connected flowing (SiFu) definition on the directed graph in terms of the neural network, and has the interpretability of all nodes on the graph of the whole model, instead of the traditional machine learning model that only has limited interpretability at the input and output ends. In the language model scenario, the token is defined as a node in the graph. A randomly shaped or user-defined signal flow flows between nodes on the principle of "least resistance" along paths. The next token or node to be predicted or generated is the target of the signal flow. As a language model, BriLLM theoretically supports infinitely long $n$-gram models when the model size is independent of the input and predicted length of the model. The model's working signal flow provides the possibility of recall activation and innate multi-modal support similar to the cognitive patterns of the human brain. At present, we released the first BriLLM version in Chinese, with 4000 tokens, 32-dimensional node width, 16-token long sequence prediction ability, and language model prediction performance comparable to GPT-1. More computing power will help us explore the infinite possibilities depicted above.

Title: GNNs as Predictors of Agentic Workflow Performances

Authors: Yuanshuo Zhang, Yuchen Hou, Bohan Tang, Shuo Chen, Muhan Zhang, Xiaowen Dong, Siheng Chen
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2503.11301
Pdf URL: https://arxiv.org/pdf/2503.11301
Copy Paste: [[2503.11301]] GNNs as Predictors of Agentic Workflow Performances(https://arxiv.org/abs/2503.11301)
Keywords: large language model
Abstract: Agentic workflows invoked by Large Language Models (LLMs) have achieved remarkable success in handling complex tasks. However, optimizing such workflows is costly and inefficient in real-world applications due to extensive invocations of LLMs. To fill this gap, this position paper formulates agentic workflows as computational graphs and advocates Graph Neural Networks (GNNs) as efficient predictors of agentic workflow performances, avoiding repeated LLM invocations for evaluation. To empirically ground this position, we construct FLORA-Bench, a unified platform for benchmarking GNNs for predicting agentic workflow performances. With extensive experiments, we arrive at the following conclusion: GNNs are simple yet effective predictors. This conclusion supports new applications of GNNs and a novel direction towards automating agentic workflow optimization. All codes, models, and data are available at this https URL.

Title: Are formal and functional linguistic mechanisms dissociated?

Authors: Michael Hanna, Sandro Pezzelle, Yonatan Belinkov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11302
Pdf URL: https://arxiv.org/pdf/2503.11302
Copy Paste: [[2503.11302]] Are formal and functional linguistic mechanisms dissociated?(https://arxiv.org/abs/2503.11302)
Keywords: large language model
Abstract: Although large language models (LLMs) are increasingly capable, these capabilities are unevenly distributed: they excel at formal linguistic tasks, such as producing fluent, grammatical text, but struggle more with functional linguistic tasks like reasoning and consistent fact retrieval. Inspired by neuroscience, recent work suggests that to succeed on both formal and functional linguistic tasks, LLMs should use different mechanisms for each; such localization could either be built-in or emerge spontaneously through training. In this paper, we ask: do current models, with fast-improving functional linguistic abilities, exhibit distinct localization of formal and functional linguistic mechanisms? We answer this by finding and comparing the "circuits", or minimal computational subgraphs, responsible for various formal and functional tasks. Comparing 5 LLMs across 10 distinct tasks, we find that while there is indeed little overlap between circuits for formal and functional tasks, there is also little overlap between formal linguistic tasks, as exists in the human brain. Thus, a single formal linguistic network, unified and distinct from functional task circuits, remains elusive. However, in terms of cross-task faithfulness - the ability of one circuit to solve another's task - we observe a separation between formal and functional mechanisms, suggesting that shared mechanisms between formal tasks may exist.

Title: Unlocking General Long Chain-of-Thought Reasoning Capabilities of Large Language Models via Representation Engineering

Authors: Xinyu Tang, Xiaolei Wang, Zhihao Lv, Yingqian Min, Wayne Xin Zhao, Binbin Hu, Ziqi Liu, Zhiqiang Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11314
Pdf URL: https://arxiv.org/pdf/2503.11314
Copy Paste: [[2503.11314]] Unlocking General Long Chain-of-Thought Reasoning Capabilities of Large Language Models via Representation Engineering(https://arxiv.org/abs/2503.11314)
Keywords: large language model
Abstract: Recent advancements in long chain-of-thoughts(long CoTs) have significantly improved the reasoning capabilities of large language models(LLMs). Existing work finds that the capability of long CoT reasoning can be efficiently elicited by tuning on only a few examples and can easily transfer to other tasks. This motivates us to investigate whether long CoT reasoning is a general capability for LLMs. In this work, we conduct an empirical analysis for this question from the perspective of representation. We find that LLMs do encode long CoT reasoning as a general capability, with a clear distinction from vanilla CoTs. Furthermore, domain-specific representations are also required for the effective transfer of long CoT reasoning. Inspired by these findings, we propose GLoRE, a novel representation engineering method to unleash the general long CoT reasoning capabilities of LLMs. Extensive experiments demonstrate the effectiveness and efficiency of GLoRE in both in-domain and cross-domain scenarios.

Title: MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens

Authors: Jeong Hun Yeo, Hyeongseop Rha, Se Jin Park, Yong Man Ro
Subjects: cs.CV, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2503.11315
Pdf URL: https://arxiv.org/pdf/2503.11315
Copy Paste: [[2503.11315]] MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens(https://arxiv.org/abs/2503.11315)
Keywords: robust, large language model
Abstract: Audio-Visual Speech Recognition (AVSR) achieves robust speech recognition in noisy environments by combining auditory and visual information. However, recent Large Language Model (LLM) based AVSR systems incur high computational costs due to the high temporal resolution of audio-visual speech processed by LLMs. In this work, we introduce an efficient multimodal speech LLM framework that minimizes token length while preserving essential linguistic content. Our approach employs an early av-fusion module for streamlined feature integration, an audio-visual speech Q-Former that dynamically allocates tokens based on input duration, and a refined query allocation strategy with a speech rate predictor to adjust token allocation according to speaking speed of each audio sample. Extensive experiments on the LRS3 dataset show that our method achieves state-of-the-art performance with a WER of 0.74% while using only 3.5 tokens per second. Moreover, our approach not only reduces token usage by 86% compared to the previous multimodal speech LLM framework, but also improves computational efficiency by reducing FLOPs by 35.7%.

Title: Leveraging Diffusion Knowledge for Generative Image Compression with Fractal Frequency-Aware Band Learning

Authors: Lingyu Zhu, Xiangrui Zeng, Bolin Chen, Peilin Chen, Yung-Hui Li, Shiqi Wang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.11321
Pdf URL: https://arxiv.org/pdf/2503.11321
Copy Paste: [[2503.11321]] Leveraging Diffusion Knowledge for Generative Image Compression with Fractal Frequency-Aware Band Learning(https://arxiv.org/abs/2503.11321)
Keywords: diffusion, generative
Abstract: By optimizing the rate-distortion-realism trade-off, generative image compression approaches produce detailed, realistic images instead of the only sharp-looking reconstructions produced by rate-distortion-optimized models. In this paper, we propose a novel deep learning-based generative image compression method injected with diffusion knowledge, obtaining the capacity to recover more realistic textures in practical scenarios. Efforts are made from three perspectives to navigate the rate-distortion-realism trade-off in the generative image compression task. First, recognizing the strong connection between image texture and frequency-domain characteristics, we design a Fractal Frequency-Aware Band Image Compression (FFAB-IC) network to effectively capture the directional frequency components inherent in natural images. This network integrates commonly used fractal band feature operations within a neural non-linear mapping design, enhancing its ability to retain essential given information and filter out unnecessary details. Then, to improve the visual quality of image reconstruction under limited bandwidth, we integrate diffusion knowledge into the encoder and implement diffusion iterations into the decoder process, thus effectively recovering lost texture details. Finally, to fully leverage the spatial and frequency intensity information, we incorporate frequency- and content-aware regularization terms to regularize the training of the generative image compression network. Extensive experiments in quantitative and qualitative evaluations demonstrate the superiority of the proposed method, advancing the boundaries of achievable distortion-realism pairs, i.e., our method achieves better distortions at high realism and better realism at low distortion than ever before.

Title: TransiT: Transient Transformer for Non-line-of-sight Videography

Authors: Ruiqian Li, Siyuan Shen, Suan Xia, Ziheng Wang, Xingyue Peng, Chengxuan Song, Yingsheng Zhu, Tao Wu, Shiying Li, Jingyi Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11328
Pdf URL: https://arxiv.org/pdf/2503.11328
Copy Paste: [[2503.11328]] TransiT: Transient Transformer for Non-line-of-sight Videography(https://arxiv.org/abs/2503.11328)
Keywords: transformer
Abstract: High quality and high speed videography using Non-Line-of-Sight (NLOS) imaging benefit autonomous navigation, collision prevention, and post-disaster search and rescue tasks. Current solutions have to balance between the frame rate and image quality. High frame rates, for example, can be achieved by reducing either per-point scanning time or scanning density, but at the cost of lowering the information density at individual frames. Fast scanning process further reduces the signal-to-noise ratio and different scanning systems exhibit different distortion characteristics. In this work, we design and employ a new Transient Transformer architecture called TransiT to achieve real-time NLOS recovery under fast scans. TransiT directly compresses the temporal dimension of input transients to extract features, reducing computation costs and meeting high frame rate requirements. It further adopts a feature fusion mechanism as well as employs a spatial-temporal Transformer to help capture features of NLOS transient videos. Moreover, TransiT applies transfer learning to bridge the gap between synthetic and real-measured data. In real experiments, TransiT manages to reconstruct from sparse transients of $16 \times 16$ measured at an exposure time of 0.4 ms per point to NLOS videos at a $64 \times 64$ resolution at 10 frames per second. We will make our code and dataset available to the community.

Title: Cardiomyopathy Diagnosis Model from Endomyocardial Biopsy Specimens: Appropriate Feature Space and Class Boundary in Small Sample Size Data

Authors: Masaya Mori, Yuto Omae, Yutaka Koyama, Kazuyuki Hara, Jun Toyotani, Yasuo Okumura, Hiroyuki Hao
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.11331
Pdf URL: https://arxiv.org/pdf/2503.11331
Copy Paste: [[2503.11331]] Cardiomyopathy Diagnosis Model from Endomyocardial Biopsy Specimens: Appropriate Feature Space and Class Boundary in Small Sample Size Data(https://arxiv.org/abs/2503.11331)
Keywords: extraction
Abstract: As the number of patients with heart failure increases, machine learning (ML) has garnered attention in cardiomyopathy diagnosis, driven by the shortage of pathologists. However, endomyocardial biopsy specimens are often small sample size and require techniques such as feature extraction and dimensionality reduction. This study aims to determine whether texture features are effective for feature extraction in the pathological diagnosis of cardiomyopathy. Furthermore, model designs that contribute toward improving generalization performance are examined by applying feature selection (FS) and dimensional compression (DC) to several ML models. The obtained results were verified by visualizing the inter-class distribution differences and conducting statistical hypothesis testing based on texture features. Additionally, they were evaluated using predictive performance across different model designs with varying combinations of FS and DC (applied or not) and decision boundaries. The obtained results confirmed that texture features may be effective for the pathological diagnosis of cardiomyopathy. Moreover, when the ratio of features to the sample size is high, a multi-step process involving FS and DC improved the generalization performance, with the linear kernel support vector machine achieving the best results. This process was demonstrated to be potentially effective for models with reduced complexity, regardless of whether the decision boundaries were linear, curved, perpendicular, or parallel to the axes. These findings are expected to facilitate the development of an effective cardiomyopathy diagnostic model for its rapid adoption in medical practice.

Title: APLA: A Simple Adaptation Method for Vision Transformers

Authors: Moein Sorkhei, Emir Konuk, Kevin Smith, Christos Matsoukas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11335
Pdf URL: https://arxiv.org/pdf/2503.11335
Copy Paste: [[2503.11335]] APLA: A Simple Adaptation Method for Vision Transformers(https://arxiv.org/abs/2503.11335)
Keywords: transformer, segmentation
Abstract: Existing adaptation techniques typically require architectural modifications or added parameters, leading to high computational costs and complexity. We introduce Attention Projection Layer Adaptation (APLA), a simple approach to adapt vision transformers (ViTs) without altering the architecture or adding parameters. Through a systematic analysis, we find that the layer immediately after the attention mechanism is crucial for adaptation. By updating only this projection layer, or even just a random subset of this layer's weights, APLA achieves state-of-the-art performance while reducing GPU memory usage by up to 52.63% and training time by up to 43.0%, with no extra cost at inference. Across 46 datasets covering a variety of tasks including scene classification, medical imaging, satellite imaging, and fine-grained classification, APLA consistently outperforms 17 other leading adaptation methods, including full fine-tuning, on classification, segmentation, and detection tasks. The code is available at this https URL.

Title: Rule-Guided Feedback: Enhancing Reasoning by Enforcing Rule Adherence in Large Language Models

Authors: Aissatou Diallo, Antonis Bikakis, Luke Dickens, Anthony Hunter, Rob Miller
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11336
Pdf URL: https://arxiv.org/pdf/2503.11336
Copy Paste: [[2503.11336]] Rule-Guided Feedback: Enhancing Reasoning by Enforcing Rule Adherence in Large Language Models(https://arxiv.org/abs/2503.11336)
Keywords: large language model
Abstract: In this paper, we introduce Rule-Guided Feedback (RGF), a framework designed to enhance Large Language Model (LLM) performance through structured rule adherence and strategic information seeking. RGF implements a teacher-student paradigm where rule-following is forced through established guidelines. Our framework employs a Teacher model that rigorously evaluates each student output against task-specific rules, providing constructive guidance rather than direct answers when detecting deviations. This iterative feedback loop serves two crucial purposes: maintaining solutions within defined constraints and encouraging proactive information seeking to resolve uncertainties. We evaluate RGF on diverse tasks including Checkmate-in-One puzzles, Sonnet Writing, Penguins-In-a-Table classification, GSM8k, and StrategyQA. Our findings suggest that structured feedback mechanisms can significantly enhance LLMs' performance across various domains.

Title: EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting

Authors: Di Li, Jie Feng, Jiahao Chen, Weisheng Dong, Guanbin Li, Guangming Shi, Licheng Jiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11345
Pdf URL: https://arxiv.org/pdf/2503.11345
Copy Paste: [[2503.11345]] EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting(https://arxiv.org/abs/2503.11345)
Keywords: segmentation
Abstract: Egocentric scenes exhibit frequent occlusions, varied viewpoints, and dynamic interactions compared to typical scene understanding tasks. Occlusions and varied viewpoints can lead to multi-view semantic inconsistencies, while dynamic objects may act as transient distractors, introducing artifacts into semantic feature modeling. To address these challenges, we propose EgoSplat, a language-embedded 3D Gaussian Splatting framework for open-vocabulary egocentric scene understanding. A multi-view consistent instance feature aggregation method is designed to leverage the segmentation and tracking capabilities of SAM2 to selectively aggregate complementary features across views for each instance, ensuring precise semantic representation of scenes. Additionally, an instance-aware spatial-temporal transient prediction module is constructed to improve spatial integrity and temporal continuity in predictions by incorporating spatial-temporal associations across multi-view instances, effectively reducing artifacts in the semantic reconstruction of egocentric scenes. EgoSplat achieves state-of-the-art performance in both localization and segmentation tasks on two datasets, outperforming existing methods with a 8.2% improvement in localization accuracy and a 3.7% improvement in segmentation mIoU on the ADT dataset, and setting a new benchmark in open-vocabulary egocentric scene understanding. The code will be made publicly available.

Title: AIstorian lets AI be a historian: A KG-powered multi-agent system for accurate biography generation

Authors: Fengyu Li (1), Yilin Li (1), Junhao Zhu (1), Lu Chen (1), Yanfei Zhang (1), Jia Zhou (1), Hui Zu (1), Jingwen Zhao (2), Yunjun Gao (1) ((1) Zhejiang University, (2) Poisson Lab, Huawei)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11346
Pdf URL: https://arxiv.org/pdf/2503.11346
Copy Paste: [[2503.11346]] AIstorian lets AI be a historian: A KG-powered multi-agent system for accurate biography generation(https://arxiv.org/abs/2503.11346)
Keywords: large language model
Abstract: Huawei has always been committed to exploring the AI application in historical research. Biography generation, as a specialized form of abstractive summarization, plays a crucial role in historical research but faces unique challenges that existing large language models (LLMs) struggle to address. These challenges include maintaining stylistic adherence to historical writing conventions, ensuring factual fidelity, and handling fragmented information across multiple documents. We present AIstorian, a novel end-to-end agentic system featured with a knowledge graph (KG)-powered retrieval-augmented generation (RAG) and anti-hallucination multi-agents. Specifically, AIstorian introduces an in-context learning based chunking strategy and a KG-based index for accurate and efficient reference retrieval. Meanwhile, AIstorian orchestrates multi-agents to conduct on-the-fly hallucination detection and error-type-aware correction. Additionally, to teach LLMs a certain language style, we finetune LLMs based on a two-step training approach combining data augmentation-enhanced supervised fine-tuning with stylistic preference optimization. Extensive experiments on a real-life historical Jinshi dataset demonstrate that AIstorian achieves a 3.8x improvement in factual accuracy and a 47.6% reduction in hallucination rate compared to existing baselines. The data and code are available at: this https URL.

Title: PARIC: Probabilistic Attention Regularization for Language Guided Image Classification from Pre-trained Vison Language Models

Authors: Mayank Nautiyal, Stela Arranz Gheorghe, Kristiana Stefa, Li Ju, Ida-Maria Sintorn, Prashant Singh
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.11360
Pdf URL: https://arxiv.org/pdf/2503.11360
Copy Paste: [[2503.11360]] PARIC: Probabilistic Attention Regularization for Language Guided Image Classification from Pre-trained Vison Language Models(https://arxiv.org/abs/2503.11360)
Keywords: robust, interpretability
Abstract: Language-guided attention frameworks have significantly enhanced both interpretability and performance in image classification; however, the reliance on deterministic embeddings from pre-trained vision-language foundation models to generate reference attention maps frequently overlooks the intrinsic multivaluedness and ill-posed characteristics of cross-modal mappings. To address these limitations, we introduce PARIC, a probabilistic framework for guiding visual attention via language specifications. Our approach enables pre-trained vision-language models to generate probabilistic reference attention maps, which align textual and visual modalities more effectively while incorporating uncertainty estimates, as compared to their deterministic counterparts. Experiments on benchmark test problems demonstrate that PARIC enhances prediction accuracy, mitigates bias, ensures consistent predictions, and improves robustness across various datasets.

Title: PBR3DGen: A VLM-guided Mesh Generation with High-quality PBR Texture

Authors: Xiaokang Wei, Bowen Zhang, Xianghui Yang, Yuxuan Wang, Chunchao Guo, Xi Zhao, Yan Luximon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11368
Pdf URL: https://arxiv.org/pdf/2503.11368
Copy Paste: [[2503.11368]] PBR3DGen: A VLM-guided Mesh Generation with High-quality PBR Texture(https://arxiv.org/abs/2503.11368)
Keywords: diffusion
Abstract: Generating high-quality physically based rendering (PBR) materials is important to achieve realistic rendering in the downstream tasks, yet it remains challenging due to the intertwined effects of materials and lighting. While existing methods have made breakthroughs by incorporating material decomposition in the 3D generation pipeline, they tend to bake highlights into albedo and ignore spatially varying properties of metallicity and roughness. In this work, we present PBR3DGen, a two-stage mesh generation method with high-quality PBR materials that integrates the novel multi-view PBR material estimation model and a 3D PBR mesh reconstruction model. Specifically, PBR3DGen leverages vision language models (VLM) to guide multi-view diffusion, precisely capturing the spatial distribution and inherent attributes of reflective-metalness material. Additionally, we incorporate view-dependent illumination-aware conditions as pixel-aware priors to enhance spatially varying material properties. Furthermore, our reconstruction model reconstructs high-quality mesh with PBR materials. Experimental results demonstrate that PBR3DGen significantly outperforms existing methods, achieving new state-of-the-art results for PBR estimation and mesh generation. More results and visualization can be found on our project page: this https URL.

Title: Annotating Scientific Uncertainty: A comprehensive model using linguistic patterns and comparison with existing approaches

Authors: Panggih Kusuma Ningrum, Philipp Mayr, Nina Smirnova, Iana Atanassova
Subjects: cs.CL, cs.AI, cs.DL
Abstract URL: https://arxiv.org/abs/2503.11376
Pdf URL: https://arxiv.org/pdf/2503.11376
Copy Paste: [[2503.11376]] Annotating Scientific Uncertainty: A comprehensive model using linguistic patterns and comparison with existing approaches(https://arxiv.org/abs/2503.11376)
Keywords: interpretability, large language model
Abstract: UnScientify, a system designed to detect scientific uncertainty in scholarly full text. The system utilizes a weakly supervised technique to identify verbally expressed uncertainty in scientific texts and their authorial references. The core methodology of UnScientify is based on a multi-faceted pipeline that integrates span pattern matching, complex sentence analysis and author reference checking. This approach streamlines the labeling and annotation processes essential for identifying scientific uncertainty, covering a variety of uncertainty expression types to support diverse applications including information retrieval, text mining and scientific document processing. The evaluation results highlight the trade-offs between modern large language models (LLMs) and the UnScientify system. UnScientify, which employs more traditional techniques, achieved superior performance in the scientific uncertainty detection task, attaining an accuracy score of 0.808. This finding underscores the continued relevance and efficiency of UnScientify's simple rule-based and pattern matching strategy for this specific application. The results demonstrate that in scenarios where resource efficiency, interpretability, and domain-specific adaptability are critical, traditional methods can still offer significant advantages.

Title: Modeling Subjectivity in Cognitive Appraisal with Language Models

Authors: Yuxiang Zhou, Hainiu Xu, Desmond C. Ong, Petr Slovak, Yulan He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11381
Pdf URL: https://arxiv.org/pdf/2503.11381
Copy Paste: [[2503.11381]] Modeling Subjectivity in Cognitive Appraisal with Language Models(https://arxiv.org/abs/2503.11381)
Keywords: large language model
Abstract: As the utilization of language models in interdisciplinary, human-centered studies grow, the expectation of model capabilities continues to evolve. Beyond excelling at conventional tasks, models are recently expected to perform well on user-centric measurements involving confidence and human (dis)agreement -- factors that reflect subjective preferences. While modeling of subjectivity plays an essential role in cognitive science and has been extensively studied, it remains under-explored within the NLP community. In light of this gap, we explore how language models can harness subjectivity by conducting comprehensive experiments and analysis across various scenarios using both fine-tuned models and prompt-based large language models (LLMs). Our quantitative and qualitative experimental results indicate that existing post-hoc calibration approaches often fail to produce satisfactory results. However, our findings reveal that personality traits and demographical information are critical for measuring subjectivity. Furthermore, our in-depth analysis offers valuable insights for future research and development in the interdisciplinary studies of NLP and cognitive science.

Title: Watch and Learn: Leveraging Expert Knowledge and Language for Surgical Video Understanding

Authors: David Gastager, Ghazal Ghazaei, Constantin Patsch
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11392
Pdf URL: https://arxiv.org/pdf/2503.11392
Copy Paste: [[2503.11392]] Watch and Learn: Leveraging Expert Knowledge and Language for Surgical Video Understanding(https://arxiv.org/abs/2503.11392)
Keywords: generative, segmentation
Abstract: Automated surgical workflow analysis is crucial for education, research, and clinical decision-making, but the lack of annotated datasets hinders the development of accurate and comprehensive workflow analysis solutions. We introduce a novel approach for addressing the sparsity and heterogeneity of annotated training data inspired by the human learning procedure of watching experts and understanding their explanations. Our method leverages a video-language model trained on alignment, denoising, and generative tasks to learn short-term spatio-temporal and multimodal representations. A task-specific temporal model is then used to capture relationships across entire videos. To achieve comprehensive video-language understanding in the surgical domain, we introduce a data collection and filtering strategy to construct a large-scale pretraining dataset from educational YouTube videos. We then utilize parameter-efficient fine-tuning by projecting downstream task annotations from publicly available surgical datasets into the language domain. Extensive experiments in two surgical domains demonstrate the effectiveness of our approach, with performance improvements of up to 7% in phase segmentation tasks, 8% in zero-shot phase segmentation, and comparable capabilities to fully-supervised models in few-shot settings. Harnessing our model's capabilities for long-range temporal localization and text generation, we present the first comprehensive solution for dense video captioning (DVC) of surgical videos, addressing this task despite the absence of existing DVC datasets in the surgical domain. We introduce a novel approach to surgical workflow understanding that leverages video-language pretraining, large-scale video pretraining, and optimized fine-tuning. Our method improves performance over state-of-the-art techniques and enables new downstream tasks for surgical video understanding.

Title: A Framework for a Capability-driven Evaluation of Scenario Understanding for Multimodal Large Language Models in Autonomous Driving

Authors: Tin Stribor Sohn, Philipp Reis, Maximilian Dillitzer, Johannes Bach, Jason J. Corso, Eric Sax
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.11400
Pdf URL: https://arxiv.org/pdf/2503.11400
Copy Paste: [[2503.11400]] A Framework for a Capability-driven Evaluation of Scenario Understanding for Multimodal Large Language Models in Autonomous Driving(https://arxiv.org/abs/2503.11400)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) hold the potential to enhance autonomous driving by combining domain-independent world knowledge with context-specific language guidance. Their integration into autonomous driving systems shows promising results in isolated proof-of-concept applications, while their performance is evaluated on selective singular aspects of perception, reasoning, or planning. To leverage their full potential a systematic framework for evaluating MLLMs in the context of autonomous driving is required. This paper proposes a holistic framework for a capability-driven evaluation of MLLMs in autonomous driving. The framework structures scenario understanding along the four core capability dimensions semantic, spatial, temporal, and physical. They are derived from the general requirements of autonomous driving systems, human driver cognition, and language-based reasoning. It further organises the domain into context layers, processing modalities, and downstream tasks such as language-based interaction and decision-making. To illustrate the framework's applicability, two exemplary traffic scenarios are analysed, grounding the proposed dimensions in realistic driving situations. The framework provides a foundation for the structured evaluation of MLLMs' potential for scenario understanding in autonomous driving.

Title: Towards A Correct Usage of Cryptography in Semantic Watermarks for Diffusion Models

Authors: Jonas Thietke, Andreas Müller, Denis Lukovnikov, Asja Fischer, Erwin Quiring
Subjects: cs.CR, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.11404
Pdf URL: https://arxiv.org/pdf/2503.11404
Copy Paste: [[2503.11404]] Towards A Correct Usage of Cryptography in Semantic Watermarks for Diffusion Models(https://arxiv.org/abs/2503.11404)
Keywords: security, watermark, diffusion
Abstract: Semantic watermarking methods enable the direct integration of watermarks into the generation process of latent diffusion models by only modifying the initial latent noise. One line of approaches building on Gaussian Shading relies on cryptographic primitives to steer the sampling process of the latent noise. However, we identify several issues in the usage of cryptographic techniques in Gaussian Shading, particularly in its proof of lossless performance and key management, causing ambiguity in follow-up works, too. In this work, we therefore revisit the cryptographic primitives for semantic watermarking. We introduce a novel, general proof of lossless performance based on IND\$-CPA security for semantic watermarks. We then discuss the configuration of the cryptographic primitives in semantic watermarks with respect to security, efficiency, and generation quality.

Title: LuSeg: Efficient Negative and Positive Obstacles Segmentation via Contrast-Driven Multi-Modal Feature Fusion on the Lunar

Authors: Shuaifeng Jiao, Zhiwen Zeng, Zhuoqun Su, Xieyuanli Chen, Zongtan Zhou, Huimin Lu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.11409
Pdf URL: https://arxiv.org/pdf/2503.11409
Copy Paste: [[2503.11409]] LuSeg: Efficient Negative and Positive Obstacles Segmentation via Contrast-Driven Multi-Modal Feature Fusion on the Lunar(https://arxiv.org/abs/2503.11409)
Keywords: segmentation
Abstract: As lunar exploration missions grow increasingly complex, ensuring safe and autonomous rover-based surface exploration has become one of the key challenges in lunar exploration tasks. In this work, we have developed a lunar surface simulation system called the Lunar Exploration Simulator System (LESS) and the LunarSeg dataset, which provides RGB-D data for lunar obstacle segmentation that includes both positive and negative obstacles. Additionally, we propose a novel two-stage segmentation network called LuSeg. Through contrastive learning, it enforces semantic consistency between the RGB encoder from Stage I and the depth encoder from Stage II. Experimental results on our proposed LunarSeg dataset and additional public real-world NPO road obstacle dataset demonstrate that LuSeg achieves state-of-the-art segmentation performance for both positive and negative obstacles while maintaining a high inference speed of approximately 57\,Hz. We have released the implementation of our LESS system, LunarSeg dataset, and the code of LuSeg at:this https URL.

Title: Empowering Time Series Analysis with Synthetic Data: A Survey and Outlook in the Era of Foundation Models

Authors: Xu Liu, Taha Aksu, Juncheng Liu, Qingsong Wen, Yuxuan Liang, Caiming Xiong, Silvio Savarese, Doyen Sahoo, Junnan Li, Chenghao Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.11411
Pdf URL: https://arxiv.org/pdf/2503.11411
Copy Paste: [[2503.11411]] Empowering Time Series Analysis with Synthetic Data: A Survey and Outlook in the Era of Foundation Models(https://arxiv.org/abs/2503.11411)
Keywords: large language model
Abstract: Time series analysis is crucial for understanding dynamics of complex systems. Recent advances in foundation models have led to task-agnostic Time Series Foundation Models (TSFMs) and Large Language Model-based Time Series Models (TSLLMs), enabling generalized learning and integrating contextual information. However, their success depends on large, diverse, and high-quality datasets, which are challenging to build due to regulatory, diversity, quality, and quantity constraints. Synthetic data emerge as a viable solution, addressing these challenges by offering scalable, unbiased, and high-quality alternatives. This survey provides a comprehensive review of synthetic data for TSFMs and TSLLMs, analyzing data generation strategies, their role in model pretraining, fine-tuning, and evaluation, and identifying future research directions.

Title: MTV-Inpaint: Multi-Task Long Video Inpainting

Authors: Shiyuan Yang, Zheng Gu, Liang Hou, Xin Tao, Pengfei Wan, Xiaodong Chen, Jing Liao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11412
Pdf URL: https://arxiv.org/pdf/2503.11412
Copy Paste: [[2503.11412]] MTV-Inpaint: Multi-Task Long Video Inpainting(https://arxiv.org/abs/2503.11412)
Keywords: diffusion
Abstract: Video inpainting involves modifying local regions within a video, ensuring spatial and temporal consistency. Most existing methods focus primarily on scene completion (i.e., filling missing regions) and lack the capability to insert new objects into a scene in a controllable manner. Fortunately, recent advancements in text-to-video (T2V) diffusion models pave the way for text-guided video inpainting. However, directly adapting T2V models for inpainting remains limited in unifying completion and insertion tasks, lacks input controllability, and struggles with long videos, thereby restricting their applicability and flexibility. To address these challenges, we propose MTV-Inpaint, a unified multi-task video inpainting framework capable of handling both traditional scene completion and novel object insertion tasks. To unify these distinct tasks, we design a dual-branch spatial attention mechanism in the T2V diffusion U-Net, enabling seamless integration of scene completion and object insertion within a single framework. In addition to textual guidance, MTV-Inpaint supports multimodal control by integrating various image inpainting models through our proposed image-to-video (I2V) inpainting mode. Additionally, we propose a two-stage pipeline that combines keyframe inpainting with in-between frame propagation, enabling MTV-Inpaint to effectively handle long videos with hundreds of frames. Extensive experiments demonstrate that MTV-Inpaint achieves state-of-the-art performance in both scene completion and object insertion tasks. Furthermore, it demonstrates versatility in derived applications such as multi-modal inpainting, object editing, removal, image object brush, and the ability to handle long videos. Project page: this https URL.

Title: Classifying Long-tailed and Label-noise Data via Disentangling and Unlearning

Authors: Chen Shu, Mengke Li, Yiqun Zhang, Yang Lu, Bo Han, Yiu-ming Cheung, Hanzi Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.11414
Pdf URL: https://arxiv.org/pdf/2503.11414
Copy Paste: [[2503.11414]] Classifying Long-tailed and Label-noise Data via Disentangling and Unlearning(https://arxiv.org/abs/2503.11414)
Keywords: robust
Abstract: In real-world datasets, the challenges of long-tailed distributions and noisy labels often coexist, posing obstacles to the model training and performance. Existing studies on long-tailed noisy label learning (LTNLL) typically assume that the generation of noisy labels is independent of the long-tailed distribution, which may not be true from a practical perspective. In real-world situaiton, we observe that the tail class samples are more likely to be mislabeled as head, exacerbating the original degree of imbalance. We call this phenomenon as ``tail-to-head (T2H)'' noise. T2H noise severely degrades model performance by polluting the head classes and forcing the model to learn the tail samples as head. To address this challenge, we investigate the dynamic misleading process of the nosiy labels and propose a novel method called Disentangling and Unlearning for Long-tailed and Label-noisy data (DULL). It first employs the Inner-Feature Disentangling (IFD) to disentangle feature internally. Based on this, the Inner-Feature Partial Unlearning (IFPU) is then applied to weaken and unlearn incorrect feature regions correlated to wrong classes. This method prevents the model from being misled by noisy labels, enhancing the model's robustness against noise. To provide a controlled experimental environment, we further propose a new noise addition algorithm to simulate T2H noise. Extensive experiments on both simulated and real-world datasets demonstrate the effectiveness of our proposed method.

Title: From Generative AI to Innovative AI: An Evolutionary Roadmap

Authors: Seyed Mahmoud Sajjadi Mohammadabadi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11419
Pdf URL: https://arxiv.org/pdf/2503.11419
Copy Paste: [[2503.11419]] From Generative AI to Innovative AI: An Evolutionary Roadmap(https://arxiv.org/abs/2503.11419)
Keywords: generative
Abstract: This paper explores the critical transition from Generative Artificial Intelligence (GenAI) to Innovative Artificial Intelligence (InAI). While recent advancements in GenAI have enabled systems to produce high-quality content across various domains, these models often lack the capacity for true innovation. In this context, innovation is defined as the ability to generate novel and useful outputs that go beyond mere replication of learned data. The paper examines this shift and proposes a roadmap for developing AI systems that can generate content and engage in autonomous problem-solving and creative ideation. The work provides both theoretical insights and practical strategies for advancing AI to a stage where it can genuinely innovate, contributing meaningfully to science, technology, and the arts.

Title: TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation

Authors: Hongxiang Zhao, Xingchen Liu, Mutian Xu, Yiming Hao, Weikai Chen, Xiaoguang Han
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.11423
Pdf URL: https://arxiv.org/pdf/2503.11423
Copy Paste: [[2503.11423]] TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation(https://arxiv.org/abs/2503.11423)
Keywords: diffusion
Abstract: We address key limitations in existing datasets and models for task-oriented hand-object interaction video generation, a critical approach of generating video demonstrations for robotic imitation learning. Current datasets, such as Ego4D, often suffer from inconsistent view perspectives and misaligned interactions, leading to reduced video quality and limiting their applicability for precise imitation learning tasks. Towards this end, we introduce TASTE-Rob -- a pioneering large-scale dataset of 100,856 ego-centric hand-object interaction videos. Each video is meticulously aligned with language instructions and recorded from a consistent camera viewpoint to ensure interaction clarity. By fine-tuning a Video Diffusion Model (VDM) on TASTE-Rob, we achieve realistic object interactions, though we observed occasional inconsistencies in hand grasping postures. To enhance realism, we introduce a three-stage pose-refinement pipeline that improves hand posture accuracy in generated videos. Our curated dataset, coupled with the specialized pose-refinement framework, provides notable performance gains in generating high-quality, task-oriented hand-object interaction videos, resulting in achieving superior generalizable robotic manipulation. The TASTE-Rob dataset will be made publicly available upon publication to foster further advancements in the field.

Title: Text Compression for Efficient Language Generation

Authors: David Gu, Peter Belcak, Roger Wattenhofer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11426
Pdf URL: https://arxiv.org/pdf/2503.11426
Copy Paste: [[2503.11426]] Text Compression for Efficient Language Generation(https://arxiv.org/abs/2503.11426)
Keywords: transformer, generative
Abstract: We challenge the prevailing assumption that LLMs must rely fully on sub-word tokens for high-quality text generation. To this end, we propose the "Generative Pretrained Thoughtformer" (GPTHF), a hierarchical transformer language model capable of text generation by compressing text into sentence embeddings and employing a sentence attention mechanism. GPTHF retains GPT's architecture, modifying only token interactions via dynamic sparse attention masks. Our experiments show that GPTHF achieves an up to an order of magnitude improvement in FLOPs efficiency and a threefold increase in runtime speed compared to equally-sized GPT models in the low-size regime. This is achieved through a unique generation method that caches and reuses sentence embeddings, allowing significant portions of the input to bypass large parts of the network.

Title: FlowKac: An Efficient Neural Fokker-Planck solver using Temporal Normalizing flows and the Feynman Kac-Formula

Authors: Naoufal El Bekri, Lucas Drumetz, Franck Vermet
Subjects: cs.LG, math.DS, stat.ML
Abstract URL: https://arxiv.org/abs/2503.11427
Pdf URL: https://arxiv.org/pdf/2503.11427
Copy Paste: [[2503.11427]] FlowKac: An Efficient Neural Fokker-Planck solver using Temporal Normalizing flows and the Feynman Kac-Formula(https://arxiv.org/abs/2503.11427)
Keywords: robust
Abstract: Solving the Fokker-Planck equation for high-dimensional complex dynamical systems remains a pivotal yet challenging task due to the intractability of analytical solutions and the limitations of traditional numerical methods. In this work, we present FlowKac, a novel approach that reformulates the Fokker-Planck equation using the Feynman-Kac formula, allowing to query the solution at a given point via the expected values of stochastic paths. A key innovation of FlowKac lies in its adaptive stochastic sampling scheme which significantly reduces the computational complexity while maintaining high accuracy. This sampling technique, coupled with a time-indexed normalizing flow, designed for capturing time-evolving probability densities, enables robust sampling of collocation points, resulting in a flexible and mesh-free solver. This formulation mitigates the curse of dimensionality and enhances computational efficiency and accuracy, which is particularly crucial for applications that inherently require dimensions beyond the conventional three. We validate the robustness and scalability of our method through various experiments on a range of stochastic differential equations, demonstrating significant improvements over existing techniques.

Title: Combining Causal Models for More Accurate Abstractions of Neural Networks

Authors: Theodora-Mara Pîslar, Sara Magliacane, Atticus Geiger
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11429
Pdf URL: https://arxiv.org/pdf/2503.11429
Copy Paste: [[2503.11429]] Combining Causal Models for More Accurate Abstractions of Neural Networks(https://arxiv.org/abs/2503.11429)
Keywords: interpretability
Abstract: Mechanistic interpretability aims to reverse engineer neural networks by uncovering which high-level algorithms they implement. Causal abstraction provides a precise notion of when a network implements an algorithm, i.e., a causal model of the network contains low-level features that realize the high-level variables in a causal model of the algorithm. A typical problem in practical settings is that the algorithm is not an entirely faithful abstraction of the network, meaning it only partially captures the true reasoning process of a model. We propose a solution where we combine different simple high-level models to produce a more faithful representation of the network. Through learning this combination, we can model neural networks as being in different computational states depending on the input provided, which we show is more accurate to GPT 2-small fine-tuned on two toy tasks. We observe a trade-off between the strength of an interpretability hypothesis, which we define in terms of the number of inputs explained by the high-level models, and its faithfulness, which we define as the interchange intervention accuracy. Our method allows us to modulate between the two, providing the most accurate combination of models that describe the behavior of a neural network given a faithfulness level.

Title: COIN: Confidence Score-Guided Distillation for Annotation-Free Cell Segmentation

Authors: Sanghyun Jo, Seo Jin Lee, Seungwoo Lee, Seohyung Hong, Hyungseok Seo, Kyungsu Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11439
Pdf URL: https://arxiv.org/pdf/2503.11439
Copy Paste: [[2503.11439]] COIN: Confidence Score-Guided Distillation for Annotation-Free Cell Segmentation(https://arxiv.org/abs/2503.11439)
Keywords: segmentation
Abstract: Cell instance segmentation (CIS) is crucial for identifying individual cell morphologies in histopathological images, providing valuable insights for biological and medical research. While unsupervised CIS (UCIS) models aim to reduce the heavy reliance on labor-intensive image annotations, they fail to accurately capture cell boundaries, causing missed detections and poor performance. Recognizing the absence of error-free instances as a key limitation, we present COIN (COnfidence score-guided INstance distillation), a novel annotation-free framework with three key steps: (1) Increasing the sensitivity for the presence of error-free instances via unsupervised semantic segmentation with optimal transport, leveraging its ability to discriminate spatially minor instances, (2) Instance-level confidence scoring to measure the consistency between model prediction and refined mask and identify highly confident instances, offering an alternative to ground truth annotations, and (3) Progressive expansion of confidence with recursive self-distillation. Extensive experiments across six datasets show COIN outperforming existing UCIS methods, even surpassing semi- and weakly-supervised approaches across all metrics on the MoNuSeg and TNBC datasets. The code is available at this https URL.

Title: D3: Diversity, Difficulty, and Dependability-Aware Data Selection for Sample-Efficient LLM Instruction Tuning

Authors: Jia Zhang, Chen-Xi Zhang, Yao Liu, Yi-Xuan Jin, Xiao-Wen Yang, Bo Zheng, Yi Liu, Lan-Zhe Guo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.11441
Pdf URL: https://arxiv.org/pdf/2503.11441
Copy Paste: [[2503.11441]] D3: Diversity, Difficulty, and Dependability-Aware Data Selection for Sample-Efficient LLM Instruction Tuning(https://arxiv.org/abs/2503.11441)
Keywords: large language model
Abstract: Recent advancements in instruction tuning for large language models (LLMs) suggest that a small, high-quality dataset can significantly equip LLMs with instruction-following capabilities, outperforming large datasets often burdened by quality and redundancy issues. However, the challenge lies in automatically identifying valuable subsets from large datasets to boost both the effectiveness and efficiency of instruction tuning. In this paper, we first establish data selection criteria based on three distinct aspects of data value: diversity, difficulty, and dependability, and then propose the D3 method comprising two key steps of scoring and selection. Specifically, in the scoring step, we define the diversity function to measure sample distinctiveness and introduce the uncertainty-based prediction difficulty to evaluate sample difficulty by mitigating the interference of context-oriented generation diversity. Additionally, we integrate an external LLM for dependability assessment. In the selection step, we formulate the D3 weighted coreset objective, which jointly optimizes three aspects of data value to solve for the most valuable subset. The two steps of D3 can iterate multiple rounds, incorporating feedback to refine the selection focus adaptively. Experiments on three datasets demonstrate the effectiveness of D3 in endowing LLMs with competitive or even superior instruction-following capabilities using less than 10% of the entire dataset.

Title: Remote Photoplethysmography in Real-World and Extreme Lighting Scenarios

Authors: Hang Shao, Lei Luo, Jianjun Qian, Mengkai Yan, Shuo Chen, Jian Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11465
Pdf URL: https://arxiv.org/pdf/2503.11465
Copy Paste: [[2503.11465]] Remote Photoplethysmography in Real-World and Extreme Lighting Scenarios(https://arxiv.org/abs/2503.11465)
Keywords: robust, transformer
Abstract: Physiological activities can be manifested by the sensitive changes in facial imaging. While they are barely observable to our eyes, computer vision manners can, and the derived remote photoplethysmography (rPPG) has shown considerable promise. However, existing studies mainly rely on spatial skin recognition and temporal rhythmic interactions, so they focus on identifying explicit features under ideal light conditions, but perform poorly in-the-wild with intricate obstacles and extreme illumination exposure. In this paper, we propose an end-to-end video transformer model for rPPG. It strives to eliminate complex and unknown external time-varying interferences, whether they are sufficient to occupy subtle biosignal amplitudes or exist as periodic perturbations that hinder network training. In the specific implementation, we utilize global interference sharing, subject background reference, and self-supervised disentanglement to eliminate interference, and further guide learning based on spatiotemporal filtering, reconstruction guidance, and frequency domain and biological prior constraints to achieve effective rPPG. To the best of our knowledge, this is the first robust rPPG model for real outdoor scenarios based on natural face videos, and is lightweight to deploy. Extensive experiments show the competitiveness and performance of our model in rPPG prediction across datasets and scenes.

Title: T2I-FineEval: Fine-Grained Compositional Metric for Text-to-Image Evaluation

Authors: Seyed Mohammad Hadi Hosseini, Amir Mohammad Izadi, Ali Abdollahi, Armin Saghafian, Mahdieh Soleymani Baghshah
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11481
Pdf URL: https://arxiv.org/pdf/2503.11481
Copy Paste: [[2503.11481]] T2I-FineEval: Fine-Grained Compositional Metric for Text-to-Image Evaluation(https://arxiv.org/abs/2503.11481)
Keywords: robust, generative
Abstract: Although recent text-to-image generative models have achieved impressive performance, they still often struggle with capturing the compositional complexities of prompts including attribute binding, and spatial relationships between different entities. This misalignment is not revealed by common evaluation metrics such as CLIPScore. Recent works have proposed evaluation metrics that utilize Visual Question Answering (VQA) by decomposing prompts into questions about the generated image for more robust compositional evaluation. Although these methods align better with human evaluations, they still fail to fully cover the compositionality within the image. To address this, we propose a novel metric that breaks down images into components, and texts into fine-grained questions about the generated image for evaluation. Our method outperforms previous state-of-the-art metrics, demonstrating its effectiveness in evaluating text-to-image generative models. Code is available at this https URL T2I-FineEval.

Title: A Review of DeepSeek Models' Key Innovative Techniques

Authors: Chengen Wang, Murat Kantarcioglu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.11486
Pdf URL: https://arxiv.org/pdf/2503.11486
Copy Paste: [[2503.11486]] A Review of DeepSeek Models' Key Innovative Techniques(https://arxiv.org/abs/2503.11486)
Keywords: transformer, large language model
Abstract: DeepSeek-V3 and DeepSeek-R1 are leading open-source Large Language Models (LLMs) for general-purpose tasks and reasoning, achieving performance comparable to state-of-the-art closed-source models from companies like OpenAI and Anthropic -- while requiring only a fraction of their training costs. Understanding the key innovative techniques behind DeepSeek's success is crucial for advancing LLM research. In this paper, we review the core techniques driving the remarkable effectiveness and efficiency of these models, including refinements to the transformer architecture, innovations such as Multi-Head Latent Attention and Mixture of Experts, Multi-Token Prediction, the co-design of algorithms, frameworks, and hardware, the Group Relative Policy Optimization algorithm, post-training with pure reinforcement learning and iterative training alternating between supervised fine-tuning and reinforcement learning. Additionally, we identify several open questions and highlight potential research opportunities in this rapidly advancing field.

Title: Unicorn: A Universal and Collaborative Reinforcement Learning Approach Towards Generalizable Network-Wide Traffic Signal Control

Authors: Yifeng Zhang, Yilin Liu, Ping Gong, Peizhuo Li, Mingfeng Fan, Guillaume Sartoretti
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2503.11488
Pdf URL: https://arxiv.org/pdf/2503.11488
Copy Paste: [[2503.11488]] Unicorn: A Universal and Collaborative Reinforcement Learning Approach Towards Generalizable Network-Wide Traffic Signal Control(https://arxiv.org/abs/2503.11488)
Keywords: extraction
Abstract: Adaptive traffic signal control (ATSC) is crucial in reducing congestion, maximizing throughput, and improving mobility in rapidly growing urban areas. Recent advancements in parameter-sharing multi-agent reinforcement learning (MARL) have greatly enhanced the scalable and adaptive optimization of complex, dynamic flows in large-scale homogeneous networks. However, the inherent heterogeneity of real-world traffic networks, with their varied intersection topologies and interaction dynamics, poses substantial challenges to achieving scalable and effective ATSC across different traffic scenarios. To address these challenges, we present Unicorn, a universal and collaborative MARL framework designed for efficient and adaptable network-wide ATSC. Specifically, we first propose a unified approach to map the states and actions of intersections with varying topologies into a common structure based on traffic movements. Next, we design a Universal Traffic Representation (UTR) module with a decoder-only network for general feature extraction, enhancing the model's adaptability to diverse traffic scenarios. Additionally, we incorporate an Intersection Specifics Representation (ISR) module, designed to identify key latent vectors that represent the unique intersection's topology and traffic dynamics through variational inference techniques. To further refine these latent representations, we employ a contrastive learning approach in a self-supervised manner, which enables better differentiation of intersection-specific features. Moreover, we integrate the state-action dependencies of neighboring agents into policy optimization, which effectively captures dynamic agent interactions and facilitates efficient regional collaboration. Our results show that Unicorn outperforms other methods across various evaluation metrics, highlighting its potential in complex, dynamic traffic networks.

Title: V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning

Authors: Zixu Cheng, Jian Hu, Ziquan Liu, Chenyang Si, Wei Li, Shaogang Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11495
Pdf URL: https://arxiv.org/pdf/2503.11495
Copy Paste: [[2503.11495]] V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning(https://arxiv.org/abs/2503.11495)
Keywords: robust, large language model
Abstract: Human processes video reasoning in a sequential spatio-temporal reasoning logic, we first identify the relevant frames ("when") and then analyse the spatial relationships ("where") between key objects, and finally leverage these relationships to draw inferences ("what"). However, can Video Large Language Models (Video-LLMs) also "reason through a sequential spatio-temporal logic" in videos? Existing Video-LLM benchmarks primarily focus on assessing object presence, neglecting relational reasoning. Consequently, it is difficult to measure whether a model truly comprehends object interactions (actions/events) in videos or merely relies on pre-trained "memory" of co-occurrences as biases in generating answers. In this work, we introduce a Video Spatio-Temporal Reasoning (V-STaR) benchmark to address these shortcomings. The key idea is to decompose video understanding into a Reverse Spatio-Temporal Reasoning (RSTR) task that simultaneously evaluates what objects are present, when events occur, and where they are located while capturing the underlying Chain-of-thought (CoT) logic. To support this evaluation, we construct a dataset to elicit the spatial-temporal reasoning process of Video-LLMs. It contains coarse-to-fine CoT questions generated by a semi-automated GPT-4-powered pipeline, embedding explicit reasoning chains to mimic human cognition. Experiments from 14 Video-LLMs on our V-STaR reveal significant gaps between current Video-LLMs and the needs for robust and consistent spatio-temporal reasoning.

Title: Cloud2BIM: An open-source automatic pipeline for efficient conversion of large-scale point clouds into IFC format

Authors: Slávek Zbirovský, Václav Nežerka
Subjects: cs.CV, cs.SE
Abstract URL: https://arxiv.org/abs/2503.11498
Pdf URL: https://arxiv.org/pdf/2503.11498
Copy Paste: [[2503.11498]] Cloud2BIM: An open-source automatic pipeline for efficient conversion of large-scale point clouds into IFC format(https://arxiv.org/abs/2503.11498)
Keywords: segmentation
Abstract: Building Information Modeling (BIM) is an essential component in the sustainable reconstruction and revitalization of ageing structures. However, model creation usually relies on laborious manual transformation of the unstructured point cloud data provided by laser scans or photogrammetry. This paper presents Cloud2BIM, an open-source software tool designed to automate the conversion of point clouds into BIM models compliant with the Industry Foundation Classes (IFC) standard. Cloud2BIM integrates advanced algorithms for wall and slab segmentation, opening detection, and room zoning based on real wall surfaces, resulting in a comprehensive and fully automated workflow. Unlike existing tools, it avoids computationally- and calibration-intensive techniques such as RANSAC, supports non-orthogonal geometries, and provides unprecedented processing speed-achieving results up to seven times faster than fastest competing solutions. Systematic validation using benchmark datasets confirms that Cloud2BIM is an easy-to-use, efficient, and scalable solution for generating accurate BIM models, capable of converting extensive point cloud datasets for entire buildings into IFC format with minimal user input.

Title: Leveraging Angle of Arrival Estimation against Impersonation Attacks in Physical Layer Authentication

Authors: Thuy M. Pham, Linda Senigagliesi, Marco Baldi, Rafael F. Schaefer, Gerhard P. Fettweis, Arsenia Chorti
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.11508
Pdf URL: https://arxiv.org/pdf/2503.11508
Copy Paste: [[2503.11508]] Leveraging Angle of Arrival Estimation against Impersonation Attacks in Physical Layer Authentication(https://arxiv.org/abs/2503.11508)
Keywords: attack, robust
Abstract: In this paper, we investigate the utilization of the angle of arrival (AoA) as a feature for robust physical layer authentication (PLA). While most of the existing approaches to PLA focus on common features of the physical layer of communication channels, such as channel frequency response, channel impulse response or received signal strength, the use of AoA in this domain has not yet been studied in depth, particularly regarding the ability to thwart impersonation attacks. In this work, we demonstrate that an impersonation attack targeting AoA based PLA is only feasible under strict conditions on the attacker's location and hardware capabilities, which highlights the AoA's potential as a strong feature for PLA. We extend previous works considering a single-antenna attacker to the case of a multiple-antenna attacker, and we develop a theoretical characterization of the conditions in which a successful impersonation attack can be mounted. Furthermore, we leverage extensive simulations in support of theoretical analyses, to validate the robustness of AoA-based PLA.

Title: TikZero: Zero-Shot Text-Guided Graphics Program Synthesis

Authors: Jonas Belouadi, Eddy Ilg, Margret Keuper, Hideki Tanaka, Masao Utiyama, Raj Dabre, Steffen Eger, Simone Paolo Ponzetto
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2503.11509
Pdf URL: https://arxiv.org/pdf/2503.11509
Copy Paste: [[2503.11509]] TikZero: Zero-Shot Text-Guided Graphics Program Synthesis(https://arxiv.org/abs/2503.11509)
Keywords: generative
Abstract: With the rise of generative AI, synthesizing figures from text captions becomes a compelling application. However, achieving high geometric precision and editability requires representing figures as graphics programs in languages like TikZ, and aligned training data (i.e., graphics programs with captions) remains scarce. Meanwhile, large amounts of unaligned graphics programs and captioned raster images are more readily available. We reconcile these disparate data sources by presenting TikZero, which decouples graphics program generation from text understanding by using image representations as an intermediary bridge. It enables independent training on graphics programs and captioned images and allows for zero-shot text-guided graphics program synthesis during inference. We show that our method substantially outperforms baselines that can only operate with caption-aligned graphics programs. Furthermore, when leveraging caption-aligned graphics programs as a complementary training signal, TikZero matches or exceeds the performance of much larger models, including commercial systems like GPT-4o. Our code, datasets, and select models are publicly available.

Title: HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models

Authors: Ziqin Zhou, Yifan Yang, Yuqing Yang, Tianyu He, Houwen Peng, Kai Qiu, Qi Dai, Lili Qiu, Chong Luo, Lingqiao Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11513
Pdf URL: https://arxiv.org/pdf/2503.11513
Copy Paste: [[2503.11513]] HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models(https://arxiv.org/abs/2503.11513)
Keywords: large language model
Abstract: Text-to-video generation poses significant challenges due to the inherent complexity of video data, which spans both temporal and spatial dimensions. It introduces additional redundancy, abrupt variations, and a domain gap between language and vision tokens while generation. Addressing these challenges requires an effective video tokenizer that can efficiently encode video data while preserving essential semantic and spatiotemporal information, serving as a critical bridge between text and vision. Inspired by the observation in VQ-VAE-2 and workflows of traditional animation, we propose HiTVideo for text-to-video generation with hierarchical tokenizers. It utilizes a 3D causal VAE with a multi-layer discrete token framework, encoding video content into hierarchically structured codebooks. Higher layers capture semantic information with higher compression, while lower layers focus on fine-grained spatiotemporal details, striking a balance between compression efficiency and reconstruction quality. Our approach efficiently encodes longer video sequences (e.g., 8 seconds, 64 frames), reducing bits per pixel (bpp) by approximately 70\% compared to baseline tokenizers, while maintaining competitive reconstruction quality. We explore the trade-offs between compression and reconstruction, while emphasizing the advantages of high-compressed semantic tokens in text-to-video tasks. HiTVideo aims to address the potential limitations of existing video tokenizers in text-to-video generation tasks, striving for higher compression ratios and simplify LLMs modeling under language guidance, offering a scalable and promising framework for advancing text to video generation. Demo page: this https URL.

Title: Exploring the Vulnerabilities of Federated Learning: A Deep Dive into Gradient Inversion Attacks

Authors: Pengxin Guo, Runxi Wang, Shuang Zeng, Jinjing Zhu, Haoning Jiang, Yanran Wang, Yuyin Zhou, Feifei Wang, Hui Xiong, Liangqiong Qu
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11514
Pdf URL: https://arxiv.org/pdf/2503.11514
Copy Paste: [[2503.11514]] Exploring the Vulnerabilities of Federated Learning: A Deep Dive into Gradient Inversion Attacks(https://arxiv.org/abs/2503.11514)
Keywords: privacy, protect, defense, attack, robust, federate
Abstract: Federated Learning (FL) has emerged as a promising privacy-preserving collaborative model training paradigm without sharing raw data. However, recent studies have revealed that private information can still be leaked through shared gradient information and attacked by Gradient Inversion Attacks (GIA). While many GIA methods have been proposed, a detailed analysis, evaluation, and summary of these methods are still lacking. Although various survey papers summarize existing privacy attacks in FL, few studies have conducted extensive experiments to unveil the effectiveness of GIA and their associated limiting factors in this context. To fill this gap, we first undertake a systematic review of GIA and categorize existing methods into three types, i.e., \textit{optimization-based} GIA (OP-GIA), \textit{generation-based} GIA (GEN-GIA), and \textit{analytics-based} GIA (ANA-GIA). Then, we comprehensively analyze and evaluate the three types of GIA in FL, providing insights into the factors that influence their performance, practicality, and potential threats. Our findings indicate that OP-GIA is the most practical attack setting despite its unsatisfactory performance, while GEN-GIA has many dependencies and ANA-GIA is easily detectable, making them both impractical. Finally, we offer a three-stage defense pipeline to users when designing FL frameworks and protocols for better privacy protection and share some future research directions from the perspectives of attackers and defenders that we believe should be pursued. We hope that our study can help researchers design more robust FL frameworks to defend against these attacks.

Title: Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models

Authors: Hao Cheng, Erjia Xiao, Yichi Wang, Kaidi Xu, Mengshu Sun, Jindong Gu, Renjing Xu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2503.11519
Pdf URL: https://arxiv.org/pdf/2503.11519
Copy Paste: [[2503.11519]] Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models(https://arxiv.org/abs/2503.11519)
Keywords: security, generative
Abstract: Current Cross-Modality Generation Models (GMs) demonstrate remarkable capabilities in various generative tasks. Given the ubiquity and information richness of vision modality inputs in real-world scenarios, Cross-vision, encompassing Vision-Language Perception (VLP) and Image-to-Image (I2I), tasks have attracted significant attention. Large Vision Language Models (LVLMs) and I2I GMs are employed to handle VLP and I2I tasks, respectively. Previous research indicates that printing typographic words into input images significantly induces LVLMs and I2I GMs to generate disruptive outputs semantically related to those words. Additionally, visual prompts, as a more sophisticated form of typography, are also revealed to pose security risks to various applications of VLP tasks when injected into images. In this paper, we comprehensively investigate the performance impact induced by Typographic Visual Prompt Injection (TVPI) in various LVLMs and I2I GMs. To better observe performance modifications and characteristics of this threat, we also introduce the TVPI Dataset. Through extensive explorations, we deepen the understanding of the underlying causes of the TVPI threat in various GMs and offer valuable insights into its potential origins.

Title: Bottom-up Iterative Anomalous Diffusion Detector (BI-ADD)

Authors: Junwoo Park, Nataliya Sokolovska, Clément Cabriel, Ignacio Izeddin, Judith Miné-Hattab
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.11529
Pdf URL: https://arxiv.org/pdf/2503.11529
Copy Paste: [[2503.11529]] Bottom-up Iterative Anomalous Diffusion Detector (BI-ADD)(https://arxiv.org/abs/2503.11529)
Keywords: diffusion, segmentation
Abstract: In recent years, the segmentation of short molecular trajectories with varying diffusive properties has drawn particular attention of researchers, since it allows studying the dynamics of a particle. In the past decade, machine learning methods have shown highly promising results, also in changepoint detection and segmentation tasks. Here, we introduce a novel iterative method to identify the changepoints in a molecular trajectory, i.e., frames, where the diffusive behavior of a particle changes. A trajectory in our case follows a fractional Brownian motion and we estimate the diffusive properties of the trajectories. The proposed BI-ADD combines unsupervised and supervised learning methods to detect the changepoints. Our approach can be used for the analysis of molecular trajectories at the individual level and also be extended to multiple particle tracking, which is an important challenge in fundamental biology. We validated BI-ADD in various scenarios within the framework of the AnDi2 Challenge 2024 dedicated to single particle tracking. Our method is implemented in Python and is publicly available for research purposes.

Title: AugGen: Synthetic Augmentation Can Improve Discriminative Models

Authors: Parsa Rahimi, Damien Teney, Sebastien Marcel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11544
Pdf URL: https://arxiv.org/pdf/2503.11544
Copy Paste: [[2503.11544]] AugGen: Synthetic Augmentation Can Improve Discriminative Models(https://arxiv.org/abs/2503.11544)
Keywords: privacy, generative
Abstract: The increasing dependence on large-scale datasets in machine learning introduces significant privacy and ethical challenges. Synthetic data generation offers a promising solution; however, most current methods rely on external datasets or pre-trained models, which add complexity and escalate resource demands. In this work, we introduce a novel self-contained synthetic augmentation technique that strategically samples from a conditional generative model trained exclusively on the target dataset. This approach eliminates the need for auxiliary data sources. Applied to face recognition datasets, our method achieves 1--12\% performance improvements on the IJB-C and IJB-B benchmarks. It outperforms models trained solely on real data and exceeds the performance of state-of-the-art synthetic data generation baselines. Notably, these enhancements often surpass those achieved through architectural improvements, underscoring the significant impact of synthetic augmentation in data-scarce environments. These findings demonstrate that carefully integrated synthetic data not only addresses privacy and resource constraints but also substantially boosts model performance. Project page this https URL

Title: Similarity-Aware Token Pruning: Your VLM but Faster

Authors: Ahmadreza Jeddi, Negin Baghbanzadeh, Elham Dolatabadi, Babak Taati
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11549
Pdf URL: https://arxiv.org/pdf/2503.11549
Copy Paste: [[2503.11549]] Similarity-Aware Token Pruning: Your VLM but Faster(https://arxiv.org/abs/2503.11549)
Keywords: transformer
Abstract: The computational demands of Vision Transformers (ViTs) and Vision-Language Models (VLMs) remain a significant challenge due to the quadratic complexity of self-attention. While token pruning offers a promising solution, existing methods often introduce training overhead or fail to adapt dynamically across layers. We present SAINT, a training-free token pruning framework that leverages token similarity and a graph-based formulation to dynamically optimize pruning rates and redundancy thresholds. Through systematic analysis, we identify a universal three-stage token evolution process (aligner-explorer-aggregator) in transformers, enabling aggressive pruning in early stages without sacrificing critical information. For ViTs, SAINT doubles the throughput of ViT-H/14 at 224px with only 0.6% accuracy loss on ImageNet-1K, surpassing the closest competitor by 0.8%. For VLMs, we apply SAINT in three modes: ViT-only, LLM-only, and hybrid. SAINT reduces LLaVA-13B's tokens by 75%, achieving latency comparable to LLaVA-7B with less than 1% performance loss across benchmarks. Our work establishes a unified, practical framework for efficient inference in ViTs and VLMs.

Title: VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity

Authors: Jing Bi, Junjia Guo, Susan Liang, Guangyu Sun, Luchuan Song, Yunlong Tang, Jinxi He, Jiarui Wu, Ali Vosoughi, Chen Chen, Chenliang Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11557
Pdf URL: https://arxiv.org/pdf/2503.11557
Copy Paste: [[2503.11557]] VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity(https://arxiv.org/abs/2503.11557)
Keywords: large language model
Abstract: Visual reasoning is central to human cognition, enabling individuals to interpret and abstractly understand their environment. Although recent Multimodal Large Language Models (MLLMs) have demonstrated impressive performance across language and vision-language tasks, existing benchmarks primarily measure recognition-based skills and inadequately assess true visual reasoning capabilities. To bridge this critical gap, we introduce VERIFY, a benchmark explicitly designed to isolate and rigorously evaluate the visual reasoning capabilities of state-of-the-art MLLMs. VERIFY compels models to reason primarily from visual information, providing minimal textual context to reduce reliance on domain-specific knowledge and linguistic biases. Each problem is accompanied by a human-annotated reasoning path, making it the first to provide in-depth evaluation of model decision-making processes. Additionally, we propose novel metrics that assess visual reasoning fidelity beyond mere accuracy, highlighting critical imbalances in current model reasoning patterns. Our comprehensive benchmarking of leading MLLMs uncovers significant limitations, underscoring the need for a balanced and holistic approach to both perception and reasoning. For more teaser and testing, visit our project page (this https URL).

Title: SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

Authors: Ahmed Nassar, Andres Marafioti, Matteo Omenetti, Maksym Lysak, Nikolaos Livathinos, Christoph Auer, Lucas Morin, Rafael Teixeira de Lima, Yusik Kim, A. Said Gurbuz, Michele Dolfi, Miquel Farré, Peter W. J. Staar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11576
Pdf URL: https://arxiv.org/pdf/2503.11576
Copy Paste: [[2503.11576]] SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion(https://arxiv.org/abs/2503.11576)
Keywords: robust
Abstract: We introduce SmolDocling, an ultra-compact vision-language model targeting end-to-end document conversion. Our model comprehensively processes entire pages by generating DocTags, a new universal markup format that captures all page elements in their full context with location. Unlike existing approaches that rely on large foundational models, or ensemble solutions that rely on handcrafted pipelines of multiple specialized models, SmolDocling offers an end-to-end conversion for accurately capturing content, structure and spatial location of document elements in a 256M parameters vision-language model. SmolDocling exhibits robust performance in correctly reproducing document features such as code listings, tables, equations, charts, lists, and more across a diverse range of document types including business documents, academic papers, technical reports, patents, and forms -- significantly extending beyond the commonly observed focus on scientific papers. Additionally, we contribute novel publicly sourced datasets for charts, tables, equations, and code recognition. Experimental results demonstrate that SmolDocling competes with other Vision Language Models that are up to 27 times larger in size, while reducing computational requirements substantially. The model is currently available, datasets will be publicly available soon.

Title: Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers

Authors: Weiming Ren, Wentao Ma, Huan Yang, Cong Wei, Ge Zhang, Wenhu Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11579
Pdf URL: https://arxiv.org/pdf/2503.11579
Copy Paste: [[2503.11579]] Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers(https://arxiv.org/abs/2503.11579)
Keywords: transformer
Abstract: State-of-the-art transformer-based large multimodal models (LMMs) struggle to handle hour-long video inputs due to the quadratic complexity of the causal self-attention operations, leading to high computational costs during training and inference. Existing token compression-based methods reduce the number of video tokens but often incur information loss and remain inefficient for extremely long sequences. In this paper, we explore an orthogonal direction to build a hybrid Mamba-Transformer model (VAMBA) that employs Mamba-2 blocks to encode video tokens with linear complexity. Without any token reduction, VAMBA can encode more than 1024 frames (640$\times$360) on a single GPU, while transformer-based models can only encode 256 frames. On long video input, VAMBA achieves at least 50% reduction in GPU memory usage during training and inference, and nearly doubles the speed per training step compared to transformer-based LMMs. Our experimental results demonstrate that VAMBA improves accuracy by 4.3% on the challenging hour-long video understanding benchmark LVBench over prior efficient video LMMs, and maintains strong performance on a broad spectrum of long and short video understanding tasks.

Title: Do Construction Distributions Shape Formal Language Learning In German BabyLMs?

Authors: Bastian Bunzeck, Daniel Duran, Sina Zarrieß
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11593
Pdf URL: https://arxiv.org/pdf/2503.11593
Copy Paste: [[2503.11593]] Do Construction Distributions Shape Formal Language Learning In German BabyLMs?(https://arxiv.org/abs/2503.11593)
Keywords: robust
Abstract: We analyze the influence of utterance-level construction distributions in German child-directed speech on the resulting formal linguistic competence and the underlying learning trajectories for small language models trained on a novel collection of developmentally plausible language data for German. We find that trajectories are surprisingly robust for markedly different distributions of constructions in the training data, which have little effect on final accuracies and almost no effect on global learning trajectories. While syntax learning benefits from more complex utterances, lexical learning culminates in better scores with more fragmentary data. We argue that LMs trained on developmentally plausible data can contribute to debates on how rich or impoverished linguistic stimuli actually are.

Title: Advancing 3D Gaussian Splatting Editing with Complementary and Consensus Information

Authors: Xuanqi Zhang, Jieun Lee, Chris Joslin, Wonsook Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11601
Pdf URL: https://arxiv.org/pdf/2503.11601
Copy Paste: [[2503.11601]] Advancing 3D Gaussian Splatting Editing with Complementary and Consensus Information(https://arxiv.org/abs/2503.11601)
Keywords: diffusion
Abstract: We present a novel framework for enhancing the visual fidelity and consistency of text-guided 3D Gaussian Splatting (3DGS) editing. Existing editing approaches face two critical challenges: inconsistent geometric reconstructions across multiple viewpoints, particularly in challenging camera positions, and ineffective utilization of depth information during image manipulation, resulting in over-texture artifacts and degraded object boundaries. To address these limitations, we introduce: 1) A complementary information mutual learning network that enhances depth map estimation from 3DGS, enabling precise depth-conditioned 3D editing while preserving geometric structures. 2) A wavelet consensus attention mechanism that effectively aligns latent codes during the diffusion denoising process, ensuring multi-view consistency in the edited results. Through extensive experimentation, our method demonstrates superior performance in rendering quality and view consistency compared to state-of-the-art approaches. The results validate our framework as an effective solution for text-guided editing of 3D scenes.

Title: Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages

Authors: Matteo Farina, Massimiliano Mancini, Giovanni Iacca, Elisa Ricci
Subjects: cs.CV, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2503.11609
Pdf URL: https://arxiv.org/pdf/2503.11609
Copy Paste: [[2503.11609]] Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages(https://arxiv.org/abs/2503.11609)
Keywords: extraction
Abstract: An old-school recipe for training a classifier is to (i) learn a good feature extractor and (ii) optimize a linear layer atop. When only a handful of samples are available per category, as in Few-Shot Adaptation (FSA), data are insufficient to fit a large number of parameters, rendering the above impractical. This is especially true with large pre-trained Vision-Language Models (VLMs), which motivated successful research at the intersection of Parameter-Efficient Fine-tuning (PEFT) and FSA. In this work, we start by analyzing the learning dynamics of PEFT techniques when trained on few-shot data from only a subset of categories, referred to as the ``base'' classes. We show that such dynamics naturally splits into two distinct phases: (i) task-level feature extraction and (ii) specialization to the available concepts. To accommodate this dynamic, we then depart from prompt- or adapter-based methods and tackle FSA differently. Specifically, given a fixed computational budget, we split it to (i) learn a task-specific feature extractor via PEFT and (ii) train a linear classifier on top. We call this scheme Two-Stage Few-Shot Adaptation (2SFS). Differently from established methods, our scheme enables a novel form of selective inference at a category level, i.e., at test time, only novel categories are embedded by the adapted text encoder, while embeddings of base categories are available within the classifier. Results with fixed hyperparameters across two settings, three backbones, and eleven datasets, show that 2SFS matches or surpasses the state-of-the-art, while established methods degrade significantly across settings.

Title: From Denoising Score Matching to Langevin Sampling: A Fine-Grained Error Analysis in the Gaussian Setting

Authors: Samuel Hurault, Matthieu Terris, Thomas Moreau, Gabriel Peyré
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2503.11615
Pdf URL: https://arxiv.org/pdf/2503.11615
Copy Paste: [[2503.11615]] From Denoising Score Matching to Langevin Sampling: A Fine-Grained Error Analysis in the Gaussian Setting(https://arxiv.org/abs/2503.11615)
Keywords: diffusion, generative
Abstract: Sampling from an unknown distribution, accessible only through discrete samples, is a fundamental problem at the core of generative AI. The current state-of-the-art methods follow a two-step process: first estimating the score function (the gradient of a smoothed log-distribution) and then applying a gradient-based sampling algorithm. The resulting distribution's correctness can be impacted by several factors: the generalization error due to a finite number of initial samples, the error in score matching, and the diffusion error introduced by the sampling algorithm. In this paper, we analyze the sampling process in a simple yet representative setting-sampling from Gaussian distributions using a Langevin diffusion sampler. We provide a sharp analysis of the Wasserstein sampling error that arises from the multiple sources of error throughout the pipeline. This allows us to rigorously track how the anisotropy of the data distribution (encoded by its power spectrum) interacts with key parameters of the end-to-end sampling method, including the noise amplitude, the step sizes in both score matching and diffusion, and the number of initial samples. Notably, we show that the Wasserstein sampling error can be expressed as a kernel-type norm of the data power spectrum, where the specific kernel depends on the method parameters. This result provides a foundation for further analysis of the tradeoffs involved in optimizing sampling accuracy, such as adapting the noise amplitude to the choice of step sizes.

Title: Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense

Authors: Shuyang Hao, Yiwei Wang, Bryan Hooi, Ming-Hsuan Yang, Jun Liu, Chengcheng Tang, Zi Huang, Yujun Cai
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.11619
Pdf URL: https://arxiv.org/pdf/2503.11619
Copy Paste: [[2503.11619]] Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense(https://arxiv.org/abs/2503.11619)
Keywords: security, protect, defense, attack, robust
Abstract: Deploying large vision-language models (LVLMs) introduces a unique vulnerability: susceptibility to malicious attacks via visual inputs. However, existing defense methods suffer from two key limitations: (1) They solely focus on textual defenses, fail to directly address threats in the visual domain where attacks originate, and (2) the additional processing steps often incur significant computational overhead or compromise model performance on benign tasks. Building on these insights, we propose ESIII (Embedding Security Instructions Into Images), a novel methodology for transforming the visual space from a source of vulnerability into an active defense mechanism. Initially, we embed security instructions into defensive images through gradient-based optimization, obtaining security instructions in the visual dimension. Subsequently, we integrate security instructions from visual and textual dimensions with the input query. The collaboration between security instructions from different dimensions ensures comprehensive security protection. Extensive experiments demonstrate that our approach effectively fortifies the robustness of LVLMs against such attacks while preserving their performance on standard benign tasks and incurring an imperceptible increase in time costs.

Title: ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

Authors: Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, Di Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11647
Pdf URL: https://arxiv.org/pdf/2503.11647
Copy Paste: [[2503.11647]] ReCamMaster: Camera-Controlled Generative Rendering from A Single Video(https://arxiv.org/abs/2503.11647)
Keywords: robust, generative
Abstract: Camera control has been actively studied in text or image conditioned video generation tasks. However, altering camera trajectories of a given video remains under-explored, despite its importance in the field of video creation. It is non-trivial due to the extra constraints of maintaining multiple-frame appearance and dynamic synchronization. To address this, we present ReCamMaster, a camera-controlled generative video re-rendering framework that reproduces the dynamic scene of an input video at novel camera trajectories. The core innovation lies in harnessing the generative capabilities of pre-trained text-to-video models through a simple yet powerful video conditioning mechanism -- its capability often overlooked in current research. To overcome the scarcity of qualified training data, we construct a comprehensive multi-camera synchronized video dataset using Unreal Engine 5, which is carefully curated to follow real-world filming characteristics, covering diverse scenes and camera movements. It helps the model generalize to in-the-wild videos. Lastly, we further improve the robustness to diverse inputs through a meticulously designed training strategy. Extensive experiments tell that our method substantially outperforms existing state-of-the-art approaches and strong baselines. Our method also finds promising applications in video stabilization, super-resolution, and outpainting. Project page: this https URL

Title: VGGT: Visual Geometry Grounded Transformer

Authors: Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, David Novotny
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11651
Pdf URL: https://arxiv.org/pdf/2503.11651
Copy Paste: [[2503.11651]] VGGT: Visual Geometry Grounded Transformer(https://arxiv.org/abs/2503.11651)
Keywords: transformer
Abstract: We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views. This approach is a step forward in 3D computer vision, where models have typically been constrained to and specialized for single tasks. It is also simple and efficient, reconstructing images in under one second, and still outperforming alternatives that require post-processing with visual geometry optimization techniques. The network achieves state-of-the-art results in multiple 3D tasks, including camera parameter estimation, multi-view depth estimation, dense point cloud reconstruction, and 3D point tracking. We also show that using pretrained VGGT as a feature backbone significantly enhances downstream tasks, such as non-rigid point tracking and feed-forward novel view synthesis. Code and models are publicly available at this https URL.

Title: Bring Your Rear Cameras for Egocentric 3D Human Pose Estimation

Authors: Hiroyasu Akada, Jian Wang, Vladislav Golyanik, Christian Theobalt
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11652
Pdf URL: https://arxiv.org/pdf/2503.11652
Copy Paste: [[2503.11652]] Bring Your Rear Cameras for Egocentric 3D Human Pose Estimation(https://arxiv.org/abs/2503.11652)
Keywords: transformer
Abstract: Egocentric 3D human pose estimation has been actively studied using cameras installed in front of a head-mounted device (HMD). While frontal placement is the optimal and the only option for some tasks, such as hand tracking, it remains unclear if the same holds for full-body tracking due to self-occlusion and limited field-of-view coverage. Notably, even the state-of-the-art methods often fail to estimate accurate 3D poses in many scenarios, such as when HMD users tilt their heads upward (a common motion in human activities). A key limitation of existing HMD designs is their neglect of the back of the body, despite its potential to provide crucial 3D reconstruction cues. Hence, this paper investigates the usefulness of rear cameras in the HMD design for full-body tracking. We also show that simply adding rear views to the frontal inputs is not optimal for existing methods due to their dependence on individual 2D joint detectors without effective multi-view integration. To address this issue, we propose a new transformer-based method that refines 2D joint heatmap estimation with multi-view information and heatmap uncertainty, thereby improving 3D pose tracking. Moreover, we introduce two new large-scale datasets, Ego4View-Syn and Ego4View-RW, for a rear-view evaluation. Our experiments show that the new camera configurations with back views provide superior support for 3D pose tracking compared to only frontal placements. The proposed method achieves significant improvement over the current state of the art (>10% on MPJPE). We will release the source code, trained models, and new datasets on our project page this https URL.